From dotanb at dev.mellanox.co.il  Tue May  1 00:07:18 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 01 May 2007 10:07:18 +0300
Subject: [ofa-general] Re: [ewg] APM Example
In-Reply-To: <ada7irtlsbi.fsf@cisco.com>
References: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com>	<adairbipxry.fsf@cisco.com>	<87aa148d0704261549l64d0c9cfy7d29eddcfd89561c@mail.gmail.com>	<ada647ipwti.fsf@cisco.com>	<87aa148d0704280711x9244561pf0abef8100a53887@mail.gmail.com>
	<ada7irtlsbi.fsf@cisco.com>
Message-ID: <4636E726.6010804@dev.mellanox.co.il>

Roland Dreier wrote:
>  > > You don't know the time that the transition occurred, except that it
>  > > is between when you called modify QP and when it returned.  But an
>  > > asynchronous event doesn't really help, does it?
>
>  > It does help. APM is not only defined for network fault tolerance, it can
>  > also be used for load-balancing. With this event, one can know when
>  > the path is loaded and it is safe to call modify_qp.
>
> I guess I don't really understand how you're using this event.  What
> advantage is there in getting an async event at some unknown time
> (maybe before the modify QP operation returns, maybe after)?  What
> does it let you do that you can't do with the verbs architecture as
> defined strictly by the verbs?
>   

Roland is right, this event wasn't defined in the IB spec.

If you wish to know when it is safe to call the modify_qp verb you can 
call query_qp and check the
path_mig_state: it the state is ARMED, it means that it is safe to use 
the alternate path.


Dotan


From dotanb at dev.mellanox.co.il  Tue May  1 00:09:31 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 01 May 2007 10:09:31 +0300
Subject: [ofa-general] OFED 1.2 RC2 <-> WinIB 1.3
In-Reply-To: <46361E15.1050006@hp.com>
References: <4631D27B.10301@holografika.com>
	<4634AD01.5010909@dev.mellanox.co.il> <46361E15.1050006@hp.com>
Message-ID: <4636E7AB.4030908@dev.mellanox.co.il>

Rick Jones wrote:
> Dotan Barak wrote:
>> Hi Peter.
>>
>> Kovacs Peter Tamas wrote:
>>
>>> Dear all,
>>>
>>> I've tried to do some sped tests between a Linux and a Windows box 
>>> using InfiniBand.
>>> I've installed OFED 1.2 RC2 to a Fedora Core 6 x64 box, and 
>>> connected it to a Windows XP x64 box with WinIB 1.3, both machines 
>>> having a Mellanox MHES-14XTC.
>>
>>
>> As much as i know the performance tests in windows and in OFED cannot 
>> work together (even if they have the same name).
>
> I wonder if the SDP_mumble tests in netperf top of trunk would work?
Any test can work between 2 different OSes (for example: windows and 
Linux) over eth. should work over SDP (or IPoIB) because they are wire 
protocols.


Dotan


From jwong at datallegro.com  Tue May  1 00:06:18 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Tue, 1 May 2007 03:06:18 -0400
Subject: [ofa-general] RE: Trouble installing OFED1.2 with kernel
References: <A382D4292574EB47A85B8159A6AED1A1011315FA@FPNYEXCBE02.opus-i.corp>
	<20070501040329.GK13293@mellanox.co.il>
	<A382D4292574EB47A85B8159A6AED1A18305A1@FPNYEXCBE02.opus-i.corp>
	<20070501064802.GM13293@mellanox.co.il>
Message-ID: <A382D4292574EB47A85B8159A6AED1A18305A3@FPNYEXCBE02.opus-i.corp>

I have downloaded the kernel src from 
http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.18.8.tar.gz
I have gunzip and untarred the directory.

In the file linux-2.6.18.8/include/linux/fs.h.  Here is the structure definition of inode.  When I look below the i_private ptr does not exist.  Please advise.

Thanks,
Jeff
 

struct inode {
	struct hlist_node	i_hash;
	struct list_head	i_list;
	struct list_head	i_sb_list;
	struct list_head	i_dentry;
	unsigned long		i_ino;
	atomic_t		i_count;
	umode_t			i_mode;
	unsigned int		i_nlink;
	uid_t			i_uid;
	gid_t			i_gid;
	dev_t			i_rdev;
	loff_t			i_size;
	struct timespec		i_atime;
	struct timespec		i_mtime;
	struct timespec		i_ctime;
	unsigned int		i_blkbits;
	unsigned long		i_blksize;
	unsigned long		i_version;
	blkcnt_t		i_blocks;
	unsigned short          i_bytes;
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	struct mutex		i_mutex;
	struct rw_semaphore	i_alloc_sem;
	struct inode_operations	*i_op;
	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
	struct super_block	*i_sb;
	struct file_lock	*i_flock;
	struct address_space	*i_mapping;
	struct address_space	i_data;
#ifdef CONFIG_QUOTA
	struct dquot		*i_dquot[MAXQUOTAS];
#endif
	/* These three should probably be a union */
	struct list_head	i_devices;
	struct pipe_inode_info	*i_pipe;
	struct block_device	*i_bdev;
	struct cdev		*i_cdev;
	int			i_cindex;

	__u32			i_generation;

#ifdef CONFIG_DNOTIFY
	unsigned long		i_dnotify_mask; /* Directory notify events */
	struct dnotify_struct	*i_dnotify; /* for directory notifications */
#endif

#ifdef CONFIG_INOTIFY
	struct list_head	inotify_watches; /* watches on this inode */
	struct mutex		inotify_mutex;	/* protects the watches list */
#endif

	unsigned long		i_state;
	unsigned long		dirtied_when;	/* jiffies of first dirtying */

	unsigned int		i_flags;

	atomic_t		i_writecount;
	void			*i_security;
	union {
		void		*generic_ip;
	} u;
#ifdef __NEED_I_SIZE_ORDERED
	seqcount_t		i_size_seqcount;
#endif
};


-----Original Message-----
From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il]
Sent: Tue 5/1/2007 2:48 AM
To: Jeffrey Wong
Cc: Michael S. Tsirkin; general at lists.openfabrics.org
Subject: Re: Trouble installing OFED1.2 with kernel
 
I don't think you are actually using the kernel from kernel.org:
we test-build these nightly.

Quoting Jeffrey Wong <jwong at datallegro.com>:
Subject: RE: Trouble installing OFED1.2 with kernel

Well when I try to compile I get an error message saying i_private is not a member of the inode structure when trying to compile the ulp/iboip and the ib_ipath modules.  I'm using the 2.6.18-8 kernel src from kernel.org.

Any reasons why I would be getting this error message?

Thanks,
Jeff


-----Original Message-----
From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il]
Sent: Tue 5/1/2007 12:03 AM
To: Jeffrey Wong
Cc: general at lists.openfabrics.org
Subject: Re: Trouble installing OFED1.2 with kernel
 
> Quoting Jeffrey Wong <jwong at datallegro.com>:
> Subject: Re: Trouble installing OFED1.2 with kernel
> 
> Is there a workaround for the i_private member of the inode structure either in
> the kernel or in the OFED 1.2 software?
> 
> I want to be able to compile the ipoib drivers and I cannot with the error
> i_private not being a member of inode struct.
> 
> What does the ulp/ipoib do?
> 
> I want to be able to test out the ipverbs library and ipoib library to compare
> performance.
> 
>  
> 
> Thanks.
> 
>  
> 
> Jeff

OFED 1.2 supports the RHEL5 kernel. Shouldn't the Centos kernel be identical?

-- 
MST


-- 
MST

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/d1692963/attachment.html>

From mst at dev.mellanox.co.il  Tue May  1 00:22:12 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 1 May 2007 10:22:12 +0300
Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel
In-Reply-To: <A382D4292574EB47A85B8159A6AED1A18305A3@FPNYEXCBE02.opus-i.corp>
References: <A382D4292574EB47A85B8159A6AED1A1011315FA@FPNYEXCBE02.opus-i.corp>
	<20070501040329.GK13293@mellanox.co.il>
	<A382D4292574EB47A85B8159A6AED1A18305A1@FPNYEXCBE02.opus-i.corp>
	<20070501064802.GM13293@mellanox.co.il>
	<A382D4292574EB47A85B8159A6AED1A18305A3@FPNYEXCBE02.opus-i.corp>
Message-ID: <20070501072212.GR13293@mellanox.co.il>

> Quoting Jeffrey Wong <jwong at datallegro.com>:
> Subject: RE: Trouble installing OFED1.2 with kernel
> 
> I have downloaded the kernel src from
> http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.18.8.tar.gz
> I have gunzip and untarred the directory.
> 
> In the file linux-2.6.18.8/include/linux/fs.h.  Here is the structure
> definition of inode.  When I look below the i_private ptr does not exist. 
> Please advise.

Yes but that kernel would be named 2.6.18.8, not 2.6.18-8.el5.


-- 
MST


From vlad at dev.mellanox.co.il  Tue May  1 00:44:44 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 01 May 2007 10:44:44 +0300
Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel
In-Reply-To: <20070501072212.GR13293@mellanox.co.il>
References: <A382D4292574EB47A85B8159A6AED1A1011315FA@FPNYEXCBE02.opus-i.corp>
	<20070501040329.GK13293@mellanox.co.il>
	<A382D4292574EB47A85B8159A6AED1A18305A1@FPNYEXCBE02.opus-i.corp>
	<20070501064802.GM13293@mellanox.co.il>
	<A382D4292574EB47A85B8159A6AED1A18305A3@FPNYEXCBE02.opus-i.corp>
	<20070501072212.GR13293@mellanox.co.il>
Message-ID: <1178005484.7789.4.camel@vladsk-laptop>

On Tue, 2007-05-01 at 10:22 +0300, Michael S. Tsirkin wrote:
> > Quoting Jeffrey Wong <jwong at datallegro.com>:
> > Subject: RE: Trouble installing OFED1.2 with kernel
> > 
> > I have downloaded the kernel src from
> > http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.18.8.tar.gz
> > I have gunzip and untarred the directory.
> > 
> > In the file linux-2.6.18.8/include/linux/fs.h.  Here is the structure
> > definition of inode.  When I look below the i_private ptr does not exist. 
> > Please advise.
> 
> Yes but that kernel would be named 2.6.18.8, not 2.6.18-8.el5.
> 
> 

Jeff,
If you named the kernel from kernel.org in 2.6.18-*el5* manner then the
backport patches for RedHat 5.0 will be applied by OFED-1.2. This is the
reason of your failures. 

So, to fix this rename you kernel and then you can install OFED-1.2 with
ipath and ipoib.


-- 
Vladimir Sokolovsky <vlad at dev.mellanox.co.il>
Mellanox Technologies Ltd.


From jwong at datallegro.com  Tue May  1 00:59:02 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Tue, 1 May 2007 03:59:02 -0400
Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel
References: <A382D4292574EB47A85B8159A6AED1A1011315FA@FPNYEXCBE02.opus-i.corp><20070501040329.GK13293@mellanox.co.il><A382D4292574EB47A85B8159A6AED1A18305A1@FPNYEXCBE02.opus-i.corp><20070501064802.GM13293@mellanox.co.il><A382D4292574EB47A85B8159A6AED1A18305A3@FPNYEXCBE02.opus-i.corp><20070501072212.GR13293@mellanox.co.il>
	<1178005484.7789.4.camel@vladsk-laptop>
Message-ID: <A382D4292574EB47A85B8159A6AED1A18305A5@FPNYEXCBE02.opus-i.corp>

I see. So I should have never renamed my kernel from 2.6.18.8 to 2.6.18.8-el5_x86.  So this will install once I rename my kernel back to 2.6.18.8?

Thanks for the info.


Jeff


-----Original Message-----
From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il]
Sent: Tue 5/1/2007 3:44 AM
To: Jeffrey Wong
Cc: Michael S. Tsirkin; general at lists.openfabrics.org
Subject: Re: [ofa-general] Re: Trouble installing OFED1.2 with kernel
 
On Tue, 2007-05-01 at 10:22 +0300, Michael S. Tsirkin wrote:
> > Quoting Jeffrey Wong <jwong at datallegro.com>:
> > Subject: RE: Trouble installing OFED1.2 with kernel
> > 
> > I have downloaded the kernel src from
> > http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.18.8.tar.gz
> > I have gunzip and untarred the directory.
> > 
> > In the file linux-2.6.18.8/include/linux/fs.h.  Here is the structure
> > definition of inode.  When I look below the i_private ptr does not exist. 
> > Please advise.
> 
> Yes but that kernel would be named 2.6.18.8, not 2.6.18-8.el5.
> 
> 

Jeff,
If you named the kernel from kernel.org in 2.6.18-*el5* manner then the
backport patches for RedHat 5.0 will be applied by OFED-1.2. This is the
reason of your failures. 

So, to fix this rename you kernel and then you can install OFED-1.2 with
ipath and ipoib.


-- 
Vladimir Sokolovsky <vlad at dev.mellanox.co.il>
Mellanox Technologies Ltd.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/4e995da7/attachment.html>

From vlad at dev.mellanox.co.il  Tue May  1 01:48:41 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 01 May 2007 11:48:41 +0300
Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel
In-Reply-To: <A382D4292574EB47A85B8159A6AED1A18305A5@FPNYEXCBE02.opus-i.corp>
References: <A382D4292574EB47A85B8159A6AED1A1011315FA@FPNYEXCBE02.opus-i.corp>
	<20070501040329.GK13293@mellanox.co.il>
	<A382D4292574EB47A85B8159A6AED1A18305A1@FPNYEXCBE02.opus-i.corp>
	<20070501064802.GM13293@mellanox.co.il>
	<A382D4292574EB47A85B8159A6AED1A18305A3@FPNYEXCBE02.opus-i.corp>
	<20070501072212.GR13293@mellanox.co.il>
	<1178005484.7789.4.camel@vladsk-laptop>
	<A382D4292574EB47A85B8159A6AED1A18305A5@FPNYEXCBE02.opus-i.corp>
Message-ID: <1178009321.7789.8.camel@vladsk-laptop>

On Tue, 2007-05-01 at 03:59 -0400, Jeffrey Wong wrote:
> I see. So I should have never renamed my kernel from 2.6.18.8 to 2.6.18.8-el5_x86.  So this will install once I rename my kernel back to 2.6.18.8?
> 
> Thanks for the info.
> 
> 
> Jeff
> 

Yes,
The pattern 2.6.18-*el5* used by configure script to select backport
patches. There is a differens between backport patches for 2.6.18* from
kernel.org and 2.6.18 from RHEL5.0.

Regards,
Vladimir


> 
> -----Original Message-----
> From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il]
> Sent: Tue 5/1/2007 3:44 AM
> To: Jeffrey Wong
> Cc: Michael S. Tsirkin; general at lists.openfabrics.org
> Subject: Re: [ofa-general] Re: Trouble installing OFED1.2 with kernel
>  
> On Tue, 2007-05-01 at 10:22 +0300, Michael S. Tsirkin wrote:
> > > Quoting Jeffrey Wong <jwong at datallegro.com>:
> > > Subject: RE: Trouble installing OFED1.2 with kernel
> > > 
> > > I have downloaded the kernel src from
> > > http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.18.8.tar.gz
> > > I have gunzip and untarred the directory.
> > > 
> > > In the file linux-2.6.18.8/include/linux/fs.h.  Here is the structure
> > > definition of inode.  When I look below the i_private ptr does not exist. 
> > > Please advise.
> > 
> > Yes but that kernel would be named 2.6.18.8, not 2.6.18-8.el5.
> > 
> > 
> 
> Jeff,
> If you named the kernel from kernel.org in 2.6.18-*el5* manner then the
> backport patches for RedHat 5.0 will be applied by OFED-1.2. This is the
> reason of your failures. 
> 
> So, to fix this rename you kernel and then you can install OFED-1.2 with
> ipath and ipoib.
> 
> 


From vlad at lists.openfabrics.org  Tue May  1 02:37:11 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue,  1 May 2007 02:37:11 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070501-0200 daily build status
Message-ID: <20070501093712.4CA32E60811@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From stigtyytu at contaire.de  Tue May  1 02:25:20 2007
From: stigtyytu at contaire.de (Jacob Banks)
Date: Tue, 01 May 2007 08:25:20 -0100
Subject: [ofa-general] Best reason
Message-ID: <d63501c78bca$41c0fd90$fec6b50f@stigtyytu>


reproduce And sadly some order others might afterwards say that is the ultimateNo. Just sponge be there. suddenly I don't think cheerful waste you'll be dis fraternal hurt middle Hey beyond Carl, wait upThe guard jewel turned a goat cinerary couple of blow pages on his clip-
You station know payment Jeff, I bath believe that bibulous for people our aAlright, earth button everybody spoon jog. limit He commanded. And Fe 
brick Ay-yai Skipper chose She gave stick town him one last kiss, th circle Dana paused. Stace, overcome hammer I've dig got one more question nod See cloth industry you tomorrow Angel. angle Jeff climbed back in parcel saw So what bumpy earth class is this boy in with you?
Hello?Jeff trodden deal paused briefly and then care was deadpanned, I sup I carry bit won't horse sir, Jeff hollered cushion over his shoulder Stacy smooth was now sporting an camera difficult reaction ear-to-ear grin. As
Huh? bulb Dana was page town caught a march little off guard. I think flower determined I can guess war glow where this is going. scream flag You're animal girlfriend is very gleaming pretty observed Agg hang ventral Do you prickly somatic and Jeff...you know...? frantically card Perhaps if we lock even up so they can't found get in wit
Well, I don't spoon have disgust one. I machine agree with key you. At ofiction Stacy, account it's ancient me, Came Dana's voice innocently over the re Hey, what's up? So sank what's fear up with amuse bump Linda? asked Jeff.  I spoke to her. cast overthrew account She's on play for tomorrow afternoo
Heya.cast That kick should loss do it, nodded Jeff. account But as Guy sYou filthy said that reluctantly you and bird he are bump going to be study Gee, unlock you blastous jelly noticed behavior that too? Jeff responded sar
Of system course salt that wasn't milk the case. Even effect if Gavin h jealous You house spotless paste doing anything right now? color Stacy smiled. Dana, what describe the two met of direction us do when Naw, place slippery support just going home. Did fancy you have something i shown Assuming we've done thick boat sneeze what normal people would d How're bred squeeze you adjusting corporeal rhythm to the cast?
That's sour hope a bought pretty parochial government attitude for an AtheIt's been a judge little quality inconvenient oil to recognise say the leaYeah. wipe card I ran into frame him in the knife hall. It's a go. W hide Jeff sat up on his fight elbows again. swam Now simian let me se repulsive slid I've got a question to ask. salty I've trousers noticed a vid
Finally Marcie spoke up. melt paint ursine C'mon, cut aren't you gon Um.. Dana wasn't concentrate sure how undress move early to answer this.  Well, it doesn't hair fowl beset matter. I'm bore just glad you're
gracefully This bought was not stem what Dana self was expecting to hear at busy pled Not if store the place morning is crawling with imaginary co damage I explain was wood just noise heading to the park to watch my tea real smoke By this suggest move time, he was a bit steamed. You know, love tore Contrary to expectation, almost the angle intruders did not death Yep. I moon receive answer modern it every day, along with my 1st
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/f54ae718/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hlvycex.gif
Type: image/gif
Size: 6255 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/f54ae718/attachment.gif>

From mst at dev.mellanox.co.il  Tue May  1 05:53:21 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 1 May 2007 15:53:21 +0300
Subject: [ofa-general] Fwd: [ANNOUNCE] GIT 1.5.1.3
Message-ID: <20070501125321.GC8447@mellanox.co.il>

FYI.
I think we want to update - the mmap fix looks important enough.
Sasha?

----- Forwarded message from Junio C Hamano <junkio at cox.net> -----

Subject: [ANNOUNCE] GIT 1.5.1.3
Date: Tue, 1 May 2007 06:08:58 +0300
From: Junio C Hamano <junkio at cox.net>

The latest maintenance release GIT 1.5.1.3 is available at the
usual places:

  http://www.kernel.org/pub/software/scm/git/

  git-1.5.1.3.tar.{gz,bz2}			(tarball)
  git-htmldocs-1.5.1.3.tar.{gz,bz2}		(preformatted docs)
  git-manpages-1.5.1.3.tar.{gz,bz2}		(preformatted docs)
  RPMS/$arch/git-*-1.5.1.3-1.$arch.rpm	(RPM)

GIT v1.5.1.3 Release Notes
==========================

Fixes since v1.5.1.2
--------------------

* Bugfixes

  - git-add tried to optimize by finding common leading
    directories across its arguments but botched, causing very
    confused behaviour.

  - unofficial rpm.spec file shipped with git was letting
    ETC_GITCONFIG set to /usr/etc/gitconfig.  Tweak the official
    Makefile to make it harder for distro people to make the
    same mistake, by setting the variable to /etc/gitconfig if
    prefix is set to /usr.

  - git-svn inconsistently stripped away username from the URL
    only when svnsync_props was in use.

  - git-svn got confused when handling symlinks on Mac OS.

  - git-send-email was not quoting recipient names that have
    period '.' in them.  Also it did not allow overriding
    envelope sender, which made it impossible to send patches to
    certain subscriber-only lists.

  - built-in write_tree() routine had a sequence that renamed a
    file that is still open, which some systems did not like.

  - when memory is very tight, sliding mmap code to read
    packfiles incorrectly closed the fd that was still being
    used to read the pack.

  - import-tars contributed front-end for fastimport was passing
    wrong directory modes without checking.

  - git-fastimport trusted its input too much and allowed to
    create corrupt tree objects with entries without a name.

  - git-fetch needlessly barfed when too long reflog action
    description was given by the caller.

Also contains various documentation updates.

----------------------------------------------------------------

Changes since v1.5.1.2 are as follows:

Adam Roben (5):
      Remove usernames from all commit messages, not just when using svmprops
      git-svn: Don't rely on $_ after making a function call
      git-svn: Ignore usernames in URLs in find_by_url
      git-svn: Added 'find-rev' command
      git-svn: Add 'find-rev' command

Alex Riesen (1):
      Fix handle leak in write_tree

Andrew Ruder (8):
      Removing -n option from git-diff-files documentation
      Document additional options for git-fetch
      Update git-fmt-merge documentation
      Update git-grep documentation
      Update -L documentation for git-blame/git-annotate
      Update git-http-push documentation
      Update git-local-fetch documentation
      Update git-http-fetch documentation

Brian Gernhardt (2):
      Reverse the order of -b and --track in the man page.
      Ignore all man sections as they are generated files.

Gerrit Pape (1):
      Documentation/git-reset.txt: suggest git commit --amend in example.

Jari Aalto (3):
      Clarify SubmittingPatches Checklist
      git.7: Mention preformatted html doc location
      send-email documentation: clarify --smtp-server

Johannes Schindelin (2):
      dir.c(common_prefix): Fix two bugs
      import-tars: be nice to wrong directory modes

Josh Triplett (3):
      Fix typo in git-am: s/Was is/Was it/
      Create a sysconfdir variable, and use it for ETC_GITCONFIG
      Add missing reference to GIT_COMMITTER_DATE in git-commit-tree documentation

Julian Phillips (1):
      http.c: Fix problem with repeated calls of http_init

Junio C Hamano (8):
      Build RPM with ETC_GITCONFIG=/etc/gitconfig
      applymbox & quiltimport: typofix.
      Start preparing for 1.5.1.3
      Do not barf on too long action description
      Update .mailmap with "Michael"
      Fix import-tars fix.
      Fix symlink handling in git-svn, related to PerlIO
      GIT v1.5.1.3

Michele Ballabio (1):
      git shortlog documentation: add long options and fix a typo

Robin H. Johnson (10):
      Document --dry-run parameter to send-email.
      Prefix Dry- to the message status to denote dry-runs.
      Debugging cleanup improvements
      Change the scope of the $cc variable as it is not needed outside of send_message.
      Perform correct quoting of recipient names.
      Validate @recipients before using it for sendmail and Net::SMTP.
      Ensure clean addresses are always used with Net::SMTP
      Allow users to optionally specify their envelope sender.
      Document --dry-run and envelope-sender for git-send-email.
      Sanitize @to recipients.

Shawn O. Pearce (3):
      Actually handle some-low memory conditions
      Don't allow empty pathnames in fast-import
      Catch empty pathnames in trees during fsck


-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

----- End forwarded message -----

-- 
MST


From mhagen at iol.unh.edu  Tue May  1 07:20:53 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Tue, 1 May 2007 10:20:53 -0400 (EDT)
Subject: [ofa-general] [PATCH] infiniband: modify ammasso driver to use send
 with invalidate
Message-ID: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>

Modification to the ammasso driver to use the iWARP verbs SEND with INV
and SEND with SE and INV.

--- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
13:12:54.000000000 -0400
+++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
16:24:38.000000000 -0400
@@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str

 		switch (ib_wr->opcode) {
 		case IB_WR_SEND:
-			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
+			if (ib_wr->send_flags & IB_SEND_SOLICITED
+				&& ib_wr->send_flags & IB_SEND_INVALIDATE) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV);
+				wr.sqwr.send.remote_stag =
+					cpu_to_be32(ib_wr->wr.invalidate.rkey);
+			} else if (ib_wr->send_flags & IB_SEND_SOLICITED) {
 				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE);
-				msg_size = sizeof(struct c2wr_send_req);
+				wr.sqwr.send.remote_stag = 0;
+			} else if (ib_wr->send_flags & IB_SEND_INVALIDATE) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV);
+				wr.sqwr.send.remote_stag =
+					cpu_to_be32(ib_wr->wr.invalidate.rkey);
 			} else {
 				c2_wr_set_id(&wr, C2_WR_TYPE_SEND);
-				msg_size = sizeof(struct c2wr_send_req);
+				wr.sqwr.send.remote_stag = 0;
 			}

-			wr.sqwr.send.remote_stag = 0;
-			msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge;
+			msg_size = sizeof(struct c2wr_send_req);
+			msg_size += sizeof(struct c2_data_addr)	* ib_wr->num_sge;
 			if (ib_wr->num_sge > qp->send_sgl_depth) {
 				err = -EINVAL;
 				break;


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From mhagen at iol.unh.edu  Tue May  1 07:22:34 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Tue, 1 May 2007 10:22:34 -0400 (EDT)
Subject: [ofa-general] Re: [PATCH] infiniband: add support for invalidate
	stag
In-Reply-To: <20070501035708.GJ13293@mellanox.co.il>
References: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu>
	<20070501035708.GJ13293@mellanox.co.il>
Message-ID: <36387.132.177.125.178.1178029354.squirrel@postal.iol.unh.edu>

I don't believe so. I just sent out modifications to the Ammasso driver on
another thread that might clear this up.  The modifications to the driver
should show how these new verbs could be used.

>> Quoting mhagen at iol.unh.edu <mhagen at iol.unh.edu>:
>> Subject: [PATCH] infiniband: add support for invalidate stag
>>
>> Patch to add support for the iWARP verbs SEND with INV and SEND with SE
>> and INV.
>>
>> --- linux-2.6.21.1/include/rdma/ib_verbs.h    2007-04-28
>> 15:35:02.677618096 -0400
>> +++ linux-2.6.21.1/include/rdma/ib_verbs.h    2007-04-28
>> 15:29:16.200290656 -0400
>> @@ -611,7 +611,8 @@ enum ib_send_flags {
>>      IB_SEND_FENCE        = 1,
>>      IB_SEND_SIGNALED    = (1<<1),
>>      IB_SEND_SOLICITED    = (1<<2),
>> -    IB_SEND_INLINE        = (1<<3)
>> +    IB_SEND_INLINE        = (1<<3),
>> +    IB_SEND_INVALIDATE    = (1<<4)
>>  };
>>
>>  struct ib_sge {
>> @@ -646,6 +647,9 @@ struct ib_send_wr {
>>              u16    pkey_index; /* valid for GSI only */
>>              u8    port_num;   /* valid for DR SMPs on switch only */
>>          } ud;
>> +        struct {
>> +            u32    rkey;
>> +        } invalidate;
>>      } wr;
>>  };
>
> Shouldn't this rather be part of rc wr?
>
> --
> MST
>


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From MAILER-DAEMON at lists.openfabrics.org  Tue May  1 08:21:10 2007
From: MAILER-DAEMON at lists.openfabrics.org (MAILER-DAEMON at lists.openfabrics.org)
Date: Wed, 02 May 2007 00:21:10 +0900
Subject: [ofa-general] Delivery Status
Message-ID: <200705011521.l41FLCKq000261@nttmail4.ecl.ntt.co.jp>

    --- The following addresses had delivery problems ---

<openhouse at cslab.kecl.ntt.co.jp>   (5.1.1 <openhouse at cslab.kecl.ntt.co.jp>... User unknown)
-------------- next part --------------
An embedded message was scrubbed...
From: Canadian Doctor Tamika <openhouse at cslab.kecl.ntt.co.jp>
Subject: [spam] RE: MedHelp 21802
Date: Wed, 2 May 2007 00:20:57 +0900 (JST)
Size: 1892
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/ae4ccdd2/attachment.mht>

From jlentini at netapp.com  Tue May  1 09:13:53 2007
From: jlentini at netapp.com (James Lentini)
Date: Tue, 1 May 2007 12:13:53 -0400 (EDT)
Subject: [ofa-general] Re: [PATCH] uDAPL OFED 1.2 RC2 build issue on ia64 and
	RHEL5 
In-Reply-To: <000001c78925$18867a10$ff0da8c0@amr.corp.intel.com>
References: <000001c78925$18867a10$ff0da8c0@amr.corp.intel.com>
Message-ID: <Pine.LNX.4.64.0705011213320.10530@jlentini-linux.nane.netapp.com>


On Fri, 27 Apr 2007, Arlin Davis wrote:

> Fixes build problems with ia64 and RHEL5 with atomic operations. 
> Patch was tested on ia64 RHEL4 and RHEL5 using dtest/dapltest.
> 
> James, can you review this before I push.

Looks good.


From loic at myri.com  Tue May  1 09:26:48 2007
From: loic at myri.com (Loic Prylli)
Date: Tue, 01 May 2007 12:26:48 -0400
Subject: [ofa-general] Re: IPoIB forwarding
In-Reply-To: <20070501015731.3568d28b.billfink@mindspring.com>
References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov>	<20070425124652.GG1624@mellanox.co.il>	<6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov>	<20070426161409.GF15540@mellanox.co.il>	<6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov>	<20070426180618.GJ15540@mellanox.co.il>	<6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov>	<46325DF3.2050203@hp.com>	<6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov>	<46327A07.1000404@hp.com>	<6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov>	<4632894D.40705@hp.com>	<20070428025117.a3b1200a.billfink@mindspring.com>	<46362244.9030406@hp.com>
	<20070501015731.3568d28b.billfink@mindspring.com>
Message-ID: <46376A48.3050102@myri.com>

On 5/1/2007 1:57 AM, Bill Fink wrote:
> On Mon, 30 Apr 2007, Rick Jones wrote:
>
>   
>> Ethtool -i on the interface reports 1.2.0 as the driver version.
>>     
>
> Perhaps it would be useful to have different version strings for
> the in-kernel Linux version and the Myricom externally provided
> version.  Just a thought.
>   


Indeed, and it is the case as of March-21 git (or any myri10ge version 
 >= 1.3.0). The in-kernel version will show something like:
1.3.0-1.226, the external version will only show1.3.0.


Loic


From monil at voltaire.com  Tue May  1 09:37:24 2007
From: monil at voltaire.com (Moni Levy)
Date: Tue, 1 May 2007 19:37:24 +0300
Subject: [ofa-general] Re: pkey change handling patch
In-Reply-To: <ada3b2nrlgl.fsf@cisco.com>
References: <20070417223547.GI25314@mellanox.co.il>
	<20070419203705.GA613@mellanox.co.il>
	<6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com>
	<adaabwwxrzw.fsf@cisco.com>
	<6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com>
	<20070426133442.GJ32513@mellanox.co.il>
	<6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com>
	<20070426134331.GL32513@mellanox.co.il>
	<6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com>
	<ada3b2nrlgl.fsf@cisco.com>
Message-ID: <6a122cc00705010937r162b53b1jafad6d7b8055bea5@mail.gmail.com>

On 4/26/07, Roland Dreier <rdreier at cisco.com> wrote:
>  > > Let's do it over query_pkey/query_port for now.
>  > > Long term providers will just optimize these I think.
>  >
>  > How ? Caching at device driver level ?
>
> Yes... for the most part, it should be much easier to do within the
> driver.  For example mthca, mlx4 and ipath at least know exactly when
> the P_Key table is being changed and can just snoop the operation
> without needing to worry about deferring things to a workqueue, etc.
>
> ehca seems to have a hypercall that returns the whole P_Key table in
> one go.
>
> I think it would be fine to change the interface to something like
>
>         query_pkey(struct ib_device *dev, u8 port, u16 start_index,
>                    u16 num_pkeys, u16 *pkey)
>
> that returns a block of P_Keys in one go, but I don't see it as that critical.

That's exactly what I meant, and yes I agree it's not urgent.

>


From swise at opengridcomputing.com  Tue May  1 10:18:56 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 01 May 2007 12:18:56 -0500
Subject: [ofa-general] Requesting CQ notifications
In-Reply-To: <adaodlbvqv2.fsf@cisco.com>
References: <462FD3F7.1010304@evergrid.com>  <adaodlbvqv2.fsf@cisco.com>
Message-ID: <1178039936.2309.67.camel@stevo-desktop>

On Wed, 2007-04-25 at 18:58 -0700, Roland Dreier wrote:
>  > Is there a differentiation between multiple CQE's being in the CQ
>  > vs. CQE's being arriving into the CQ when using completion
>  > notifications?
>  > 
>  > For example, assume I have the following order of events:
>  > 
>  > 
>  > 	2 CQEs arrive
>  > 
>  > 	select() returns readable for comp. channel
>  > 
>  > 	ibv_get_cq_event() returns event
>  > 
>  > 	ibv_req_notify_cq(cq, 0)
>  > 
>  > 	ibv_poll_cq(cq, 1, &cqe) returns 1
>  > 
>  > 	ibv_ack_cq_events(cq, 1)
>  > 
>  > 
>  > Will the comp. channel receive another event for the second CQE even
>  > if it had arrived before ibv_req_notify_cq() was called?
> 
> This is really an ill-posed question: according to the semantics
> defined by the verbs spec, the presence or absence of the second CQE
> is not defined until you poll the CQ again.
> 
> In practice we can look at what real hardware does, and the answer is
> "it depends."  Some adapters (eg mthca, mlx4) will generate an event
> immediately if ibv_req_notify_cq() is called for a CQ that contains an
> unpolled CQE, while other adapters (eg ipath, ehca) will only generate
> an event when a CQE is added after the cal to ibv_req_notify_cq().
> 

cxgb3 behaves like ipath/ehca.  IE arrival of a new CQE generates the
notification event.


>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From swise at opengridcomputing.com  Tue May  1 10:26:32 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 01 May 2007 12:26:32 -0500
Subject: [ofa-general] Re: [PATCH] infiniband: add support for
	invalidate stag
In-Reply-To: <20070501035708.GJ13293@mellanox.co.il>
References: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu>
	<20070501035708.GJ13293@mellanox.co.il>
Message-ID: <1178040392.2309.72.camel@stevo-desktop>

On Tue, 2007-05-01 at 06:57 +0300, Michael S. Tsirkin wrote:
> > Quoting mhagen at iol.unh.edu <mhagen at iol.unh.edu>:
> > Subject: [PATCH] infiniband: add support for invalidate stag
> > 
> > Patch to add support for the iWARP verbs SEND with INV and SEND with SE
> > and INV.
> > 
> > --- linux-2.6.21.1/include/rdma/ib_verbs.h    2007-04-28
> > 15:35:02.677618096 -0400
> > +++ linux-2.6.21.1/include/rdma/ib_verbs.h    2007-04-28
> > 15:29:16.200290656 -0400
> > @@ -611,7 +611,8 @@ enum ib_send_flags {
> >      IB_SEND_FENCE        = 1,
> >      IB_SEND_SIGNALED    = (1<<1),
> >      IB_SEND_SOLICITED    = (1<<2),
> > -    IB_SEND_INLINE        = (1<<3)
> > +    IB_SEND_INLINE        = (1<<3),
> > +    IB_SEND_INVALIDATE    = (1<<4)
> >  };
> > 
> >  struct ib_sge {
> > @@ -646,6 +647,9 @@ struct ib_send_wr {
> >              u16    pkey_index; /* valid for GSI only */
> >              u8    port_num;   /* valid for DR SMPs on switch only */
> >          } ud;
> > +        struct {
> > +            u32    rkey;
> > +        } invalidate;
> >      } wr;
> >  };
> 
> Shouldn't this rather be part of rc wr?

What's an rc wr?

He added the invalidate struct to the union part of the ib_send_wr.  Its
analogous to the rdma struct in that union in that its additional values
passed in the send wr for an iwarp send w/invalidate and send-se
w/invalidate.  The seems reasonable to me...


> 


From swise at opengridcomputing.com  Tue May  1 10:34:11 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 01 May 2007 12:34:11 -0500
Subject: [ofa-general] [PATCH] infiniband: modify ammasso driver to use
	send with invalidate
In-Reply-To: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
Message-ID: <1178040851.2309.75.camel@stevo-desktop>

The code looks correct.  

I'd make the msg_size lines just one statement:

	msg_size = sizeof(struct c2wr_send_req) +
		   sizeof(struct c2_data_addr) * ib_wr->num_sge;

Have you tested that it works? 


Steve.


On Tue, 2007-05-01 at 10:20 -0400, mhagen at iol.unh.edu wrote:
> Modification to the ammasso driver to use the iWARP verbs SEND with INV
> and SEND with SE and INV.
> 
> --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
> 13:12:54.000000000 -0400
> +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
> 16:24:38.000000000 -0400
> @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str
> 
>  		switch (ib_wr->opcode) {
>  		case IB_WR_SEND:
> -			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
> +			if (ib_wr->send_flags & IB_SEND_SOLICITED
> +				&& ib_wr->send_flags & IB_SEND_INVALIDATE) {
> +				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV);
> +				wr.sqwr.send.remote_stag =
> +					cpu_to_be32(ib_wr->wr.invalidate.rkey);
> +			} else if (ib_wr->send_flags & IB_SEND_SOLICITED) {
>  				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE);
> -				msg_size = sizeof(struct c2wr_send_req);
> +				wr.sqwr.send.remote_stag = 0;
> +			} else if (ib_wr->send_flags & IB_SEND_INVALIDATE) {
> +				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV);
> +				wr.sqwr.send.remote_stag =
> +					cpu_to_be32(ib_wr->wr.invalidate.rkey);
>  			} else {
>  				c2_wr_set_id(&wr, C2_WR_TYPE_SEND);
> -				msg_size = sizeof(struct c2wr_send_req);
> +				wr.sqwr.send.remote_stag = 0;
>  			}
> 
> -			wr.sqwr.send.remote_stag = 0;
> -			msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge;
> +			msg_size = sizeof(struct c2wr_send_req);
> +			msg_size += sizeof(struct c2_data_addr)	* ib_wr->num_sge;
>  			if (ib_wr->num_sge > qp->send_sgl_depth) {
>  				err = -EINVAL;
>  				break;
> 
> 
> 


From mhagen at iol.unh.edu  Tue May  1 10:39:00 2007
From: mhagen at iol.unh.edu (Mikkel Hagen)
Date: Tue, 01 May 2007 13:39:00 -0400
Subject: [ofa-general] [PATCH] infiniband: modify ammasso driver to use
	send with invalidate
In-Reply-To: <1178040851.2309.75.camel@stevo-desktop>
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
	<1178040851.2309.75.camel@stevo-desktop>
Message-ID: <46377B34.9040605@iol.unh.edu>

I don't believe that we can make it into one line as Roland pointed out 
earlier - it introduces an accumulation bug because it is within a while 
loop.

Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium	
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


Steve Wise wrote:
> The code looks correct.  
>
> I'd make the msg_size lines just one statement:
>
> 	msg_size = sizeof(struct c2wr_send_req) +
> 		   sizeof(struct c2_data_addr) * ib_wr->num_sge;
>
> Have you tested that it works? 
>
>
> Steve.
>
>
> On Tue, 2007-05-01 at 10:20 -0400, mhagen at iol.unh.edu wrote:
>   
>> Modification to the ammasso driver to use the iWARP verbs SEND with INV
>> and SEND with SE and INV.
>>
>> --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
>> 13:12:54.000000000 -0400
>> +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
>> 16:24:38.000000000 -0400
>> @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str
>>
>>  		switch (ib_wr->opcode) {
>>  		case IB_WR_SEND:
>> -			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
>> +			if (ib_wr->send_flags & IB_SEND_SOLICITED
>> +				&& ib_wr->send_flags & IB_SEND_INVALIDATE) {
>> +				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV);
>> +				wr.sqwr.send.remote_stag =
>> +					cpu_to_be32(ib_wr->wr.invalidate.rkey);
>> +			} else if (ib_wr->send_flags & IB_SEND_SOLICITED) {
>>  				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE);
>> -				msg_size = sizeof(struct c2wr_send_req);
>> +				wr.sqwr.send.remote_stag = 0;
>> +			} else if (ib_wr->send_flags & IB_SEND_INVALIDATE) {
>> +				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV);
>> +				wr.sqwr.send.remote_stag =
>> +					cpu_to_be32(ib_wr->wr.invalidate.rkey);
>>  			} else {
>>  				c2_wr_set_id(&wr, C2_WR_TYPE_SEND);
>> -				msg_size = sizeof(struct c2wr_send_req);
>> +				wr.sqwr.send.remote_stag = 0;
>>  			}
>>
>> -			wr.sqwr.send.remote_stag = 0;
>> -			msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge;
>> +			msg_size = sizeof(struct c2wr_send_req);
>> +			msg_size += sizeof(struct c2_data_addr)	* ib_wr->num_sge;
>>  			if (ib_wr->num_sge > qp->send_sgl_depth) {
>>  				err = -EINVAL;
>>  				break;
>>
>>
>>
>>     


From swise at opengridcomputing.com  Tue May  1 10:43:24 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 01 May 2007 12:43:24 -0500
Subject: [ofa-general] [PATCH] infiniband: modify ammasso driver to use
	send with invalidate
In-Reply-To: <46377B34.9040605@iol.unh.edu>
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
	<1178040851.2309.75.camel@stevo-desktop> <46377B34.9040605@iol.unh.edu>
Message-ID: <1178041404.2309.80.camel@stevo-desktop>

No, the accumulation bug was because you were always doing a +=.

> >
> > 	msg_size = sizeof(struct c2wr_send_req) +
> > 		   sizeof(struct c2_data_addr) * ib_wr->num_sge;

This always assigns to msg_size.


From mhagen at iol.unh.edu  Tue May  1 11:05:58 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Tue, 1 May 2007 14:05:58 -0400 (EDT)
Subject: [ofa-general] [PATCH] infiniband: modify ammasso driver to use 
	send with invalidate
In-Reply-To: <1178041404.2309.80.camel@stevo-desktop>
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
	<1178040851.2309.75.camel@stevo-desktop> <46377B34.9040605@iol.unh.edu>
	<1178041404.2309.80.camel@stevo-desktop>
Message-ID: <47342.132.177.125.178.1178042758.squirrel@postal.iol.unh.edu>

Good point!

--- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
13:12:54.000000000 -0400
+++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-05-01
14:04:07.000000000 -0400
@@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str

 		switch (ib_wr->opcode) {
 		case IB_WR_SEND:
-			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
+			if (ib_wr->send_flags & IB_SEND_SOLICITED
+				&& ib_wr->send_flags & IB_SEND_INVALIDATE) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV);
+				wr.sqwr.send.remote_stag =
+					cpu_to_be32(ib_wr->wr.invalidate.rkey);
+			} else if (ib_wr->send_flags & IB_SEND_SOLICITED) {
 				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE);
-				msg_size = sizeof(struct c2wr_send_req);
+				wr.sqwr.send.remote_stag = 0;
+			} else if (ib_wr->send_flags & IB_SEND_INVALIDATE) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV);
+				wr.sqwr.send.remote_stag =
+					cpu_to_be32(ib_wr->wr.invalidate.rkey);
 			} else {
 				c2_wr_set_id(&wr, C2_WR_TYPE_SEND);
-				msg_size = sizeof(struct c2wr_send_req);
+				wr.sqwr.send.remote_stag = 0;
 			}

-			wr.sqwr.send.remote_stag = 0;
-			msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge;
+			msg_size = sizeof(struct c2wr_send_req) +
+				sizeof(struct c2_data_addr) * ib_wr->num_sge;
 			if (ib_wr->num_sge > qp->send_sgl_depth) {
 				err = -EINVAL;
 				break;


> No, the accumulation bug was because you were always doing a +=.
>
>> >
>> > 	msg_size = sizeof(struct c2wr_send_req) +
>> > 		   sizeof(struct c2_data_addr) * ib_wr->num_sge;
>
> This always assigns to msg_size.
>
>
>


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From etta at systemfabricworks.com  Tue May  1 11:58:35 2007
From: etta at systemfabricworks.com (Chieng Etta)
Date: Tue, 1 May 2007 13:58:35 -0500
Subject: [ofa-general] OFED 1.2-rc2 - Multipath failover stress test results
Message-ID: <000b01c78c22$b8b84810$c801a8c0@ettac>

All,

 
SFW has completed the SRP multipath failover stress test on the following
builds and OSes.

*	OFED 1.2-rc2 - SLES10 x86 and SLES10 x86_64
*	04192007-0600 - RHEL 5 x86_64. 

 
The I/O was running on each platform for more than 10 hours during the
failovers.  No I/O error occurred.  Please see attached for details.

 
Thanks,

Etta  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/07fa2013/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OFED1 2rc2_multipath_test_report.xls
Type: application/vnd.ms-excel
Size: 28672 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/07fa2013/attachment.xls>

From jwong at datallegro.com  Tue May  1 13:10:40 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Tue, 1 May 2007 16:10:40 -0400
Subject: [ofa-general] Errors after install for openibd start
Message-ID: <A382D4292574EB47A85B8159A6AED1A101131963@FPNYEXCBE02.opus-i.corp>

Hello,

I have successfully run the ./install.sh script with kernel
2.6.18-8.1.1.el5 

I did not reboot the machine.

After installing and configuring the ports using the defaults, I tried
to execute the command:

/etc/init.d/openibd start

 
I have truncated the errors to show an example.

 
Any suggestions?

Thanks,

 
Jeff

 
I am getting the following errors:

 
ib_ipath: disagrees about version of symbol ib_unregister_device

ib_ipath: Unknown symbol ib_unregister_device

ib_ipath: disagrees about version of symbol ib_register_device

ib_ipath: Unknown symbol ib_register_device

ib_ipath: disagrees about version of symbol ib_dispatch_event

ib_ipath: Unknown symbol ib_dispatch_event

ib_ipath: disagrees about version of symbol ib_dealloc_device

ib_ipath: Unknown symbol ib_dealloc_device

ib_ipath: disagrees about version of symbol ib_alloc_device

ib_ipath: Unknown symbol ib_alloc_device

ib_ipoib: disagrees about version of symbol ib_unregister_client

ib_ipoib: Unknown symbol ib_unregister_client

ib_ipoib: disagrees about version of symbol ib_create_cq

ib_ipoib: Unknown symbol ib_create_cq

ib_ipoib: Unknown symbol ib_sa_register_client

ib_ipoib: disagrees about version of symbol ib_cm_listen

ib_ipoib: Unknown symbol ib_cm_listen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/2733bf6f/attachment.html>

From sweitzen at cisco.com  Tue May  1 13:38:55 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 1 May 2007 13:38:55 -0700
Subject: [ofa-general] Errors after install for openibd start
In-Reply-To: <A382D4292574EB47A85B8159A6AED1A101131963@FPNYEXCBE02.opus-i.corp>
References: <A382D4292574EB47A85B8159A6AED1A101131963@FPNYEXCBE02.opus-i.corp>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303765EB7@xmb-sjc-216.amer.cisco.com>

Try rebooting, and see if it still happens.
 
Scott
 
________________________________

From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jeffrey Wong
Sent: Tuesday, May 01, 2007 1:11 PM
To: general at lists.openfabrics.org
Subject: [ofa-general] Errors after install for openibd start


	Hello,

	I have successfully run the ./install.sh script with kernel
2.6.18-8.1.1.el5 

	I did not reboot the machine.

	After installing and configuring the ports using the defaults, I
tried to execute the command:

	/etc/init.d/openibd start

	 
	I have truncated the errors to show an example.

	 
	Any suggestions?

	Thanks,

	 
	Jeff

	 
	I am getting the following errors:

	 
	ib_ipath: disagrees about version of symbol ib_unregister_device

	ib_ipath: Unknown symbol ib_unregister_device

	ib_ipath: disagrees about version of symbol ib_register_device

	ib_ipath: Unknown symbol ib_register_device

	ib_ipath: disagrees about version of symbol ib_dispatch_event

	ib_ipath: Unknown symbol ib_dispatch_event

	ib_ipath: disagrees about version of symbol ib_dealloc_device

	ib_ipath: Unknown symbol ib_dealloc_device

	ib_ipath: disagrees about version of symbol ib_alloc_device

	ib_ipath: Unknown symbol ib_alloc_device

	ib_ipoib: disagrees about version of symbol ib_unregister_client

	ib_ipoib: Unknown symbol ib_unregister_client

	ib_ipoib: disagrees about version of symbol ib_create_cq

	ib_ipoib: Unknown symbol ib_create_cq

	ib_ipoib: Unknown symbol ib_sa_register_client

	ib_ipoib: disagrees about version of symbol ib_cm_listen

	ib_ipoib: Unknown symbol ib_cm_listen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/758f89d2/attachment.html>

From swise at opengridcomputing.com  Tue May  1 13:46:44 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 01 May 2007 15:46:44 -0500
Subject: [ofa-general] requisite problems with rhel5
Message-ID: <1178052404.2309.132.camel@stevo-desktop>

I just noticed a false prereq problem when running build.sh from
ofed-1.2 on rhel5 _with_ a kernel.org kernel installed.  Here's the
issue:

build_env.sh checks the existence of /etc/redhat-release determine if
the distro is from redhat.  But if that file exists -and- the kernel
`uname -r` ends in *el5, then its sets DISTRIBUTION=redhat5.  Otherwise
it sets the DISTRIBUTION=redhat. 

The mvapich2 prereq stuff that checks for sysfsutils-devel vs
libsysfs-devel uses the DISTRIBUTION variable to decide which rpm it
needs to prereq.  However, on my system, the distro _is_ RHEL5, but I
installed 2.6.20.8 on it.  So the prereqs fail because build_env.sh
incorrectly picks redhat as the distro instead of redhat5.

I think build_env.sh shouldn't use the kernel uname to determine if the
distro is redhat5.  Rather, it should grok the contents
of /etc/redhat-release to determine if its rhel5 or not...

Is this worthy of fixing for 1.2?


Steve.


From mst at dev.mellanox.co.il  Tue May  1 13:51:38 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 1 May 2007 23:51:38 +0300
Subject: [ofa-general] Re: [PATCH] infiniband: add support for invalidate
	stag
In-Reply-To: <1178040392.2309.72.camel@stevo-desktop>
References: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu>
	<20070501035708.GJ13293@mellanox.co.il>
	<1178040392.2309.72.camel@stevo-desktop>
Message-ID: <20070501205138.GG8447@mellanox.co.il>

> He added the invalidate struct to the union part of the ib_send_wr.  Its
> analogous to the rdma struct in that union in that its additional values
> passed in the send wr for an iwarp send w/invalidate and send-se
> w/invalidate.  The seems reasonable to me...

I have re-read the patch, and I agree it's reasonable.

-- 
MST


From swise at opengridcomputing.com  Tue May  1 13:58:03 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 01 May 2007 15:58:03 -0500
Subject: [ofa-general] requisite problems with rhel5
In-Reply-To: <1178052404.2309.132.camel@stevo-desktop>
References: <1178052404.2309.132.camel@stevo-desktop>
Message-ID: <1178053083.2309.137.camel@stevo-desktop>


On Tue, 2007-05-01 at 15:46 -0500, Steve Wise wrote:

> I think build_env.sh shouldn't use the kernel uname to determine if the
> distro is redhat5.  Rather, it should grok the contents
> of /etc/redhat-release to determine if its rhel5 or not...
> 
> Is this worthy of fixing for 1.2?

Maybe this?

--- build_env.sh.org	2007-05-01 10:53:22.000000000 -0500
+++ build_env.sh	2007-05-01 10:54:54.000000000 -0500
@@ -288,8 +288,8 @@ elif [ -f /etc/fedora-release ]; then
 elif [ -f /etc/rocks-release ]; then
     DISTRIBUTION="Rocks"
 elif [ -f /etc/redhat-release ]; then
-        case ${K_VER} in
-                *el5*)
+        case $(cat /etc/redhat-release) in
+                *"release 5"*)
                 DISTRIBUTION="redhat5"
                 ;;
                 *)


From mst at dev.mellanox.co.il  Tue May  1 14:02:52 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 00:02:52 +0300
Subject: [ofa-general] Fwd: ipath-23-srp-limit-queued-commands.patch,
Message-ID: <20070501210252.GH8447@mellanox.co.il>


> ----- Forwarded message from Vu Pham <vuhuong at mellanox.com> -----
> 
> Subject: ipath-23-srp-limit-queued-commands.patch,
> Date: Tue, 1 May 2007 19:30:48 +0300
> From: Vu Pham <vuhuong at mellanox.com>
> 
> Hi,
>    This patch 
> kernel_patches/fixes/ipath-23-srp-limit-queued-commands.patch 
> which change .can_queue = SRP_SQ_SIZE   to   .can_queue = 1
> hurts our srp performance. It limits our srp performance at 
> ~210 MB/s - without it our srp performance can reach 1.35 
> GB/s using the same configuration
> 
> Please remove it or apply it for whoever choose ipath as 
> their hw
> 
> thanks,
> -vu
> 
> ----- End forwarded message -----

I missed the fact that the patch with ipath- prefix actually
changed SRP for all devices. How come?

The comment says:

	Limit the number of queued SCSI commands (over SRP) to one.

	This patch works around a limitation that requires the number of SRP
	requests in progress to one. Further investigation of this limitation
	continues.

	From: Jeremy Brown <jeremy.brown at qlogic.com>

And this was queued at Feb 7, apparently with no progress in the investigation.
So not only is this not doing the right thing, AFAIK no problem was reported
publically and no one seems to be likely to find out why is it somehow helpful
either.

At this point I'm inclined to think the right thing is to remove this hack.
We can add a module parameter to limit the number of commands if
that's the only thing that makes qlogic hardware tick.

Bryan, are there other such hacks? I really expect the ipath-XX series to
only touch the ipath driver.

-- 
MST


From pradeep at us.ibm.com  Tue May  1 14:04:32 2007
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Tue, 1 May 2007 14:04:32 -0700
Subject: [ofa-general] please review IPOIB CM NOSRQ patch
Message-ID: <OF99ABDD94.5823E4CB-ON882572CE.00738199-882572CE.0073D2E2@us.ibm.com>

Would one of you have the bandwidth to review the IPOIB CM NOSRQ patch 
(v3) that I submitted last week.

Pradeep
pradeep at us.ibm.com


From sashak at voltaire.com  Tue May  1 14:06:35 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 2 May 2007 00:06:35 +0300
Subject: [ofa-general] Fwd: [ANNOUNCE] GIT 1.5.1.3
References: <20070501125321.GC8447@mellanox.co.il>
Message-ID: <39C75744D164D948A170E9792AF8E7CA185D3F@exil.voltaire.com>

We will upgrde after Sonoma.
 
Sasha

________________________________

From: general-bounces at lists.openfabrics.org on behalf of Michael S. Tsirkin
Sent: Tue 5/1/2007 3:53 PM
To: general at lists.openfabrics.org
Subject: [ofa-general] Fwd: [ANNOUNCE] GIT 1.5.1.3


FYI.
I think we want to update - the mmap fix looks important enough.
Sasha?

----- Forwarded message from Junio C Hamano <junkio at cox.net> -----

Subject: [ANNOUNCE] GIT 1.5.1.3
Date: Tue, 1 May 2007 06:08:58 +0300
From: Junio C Hamano <junkio at cox.net>

The latest maintenance release GIT 1.5.1.3 is available at the
usual places:

  http://www.kernel.org/pub/software/scm/git/

  git-1.5.1.3.tar.{gz,bz2}                      (tarball)
  git-htmldocs-1.5.1.3.tar.{gz,bz2}             (preformatted docs)
  git-manpages-1.5.1.3.tar.{gz,bz2}             (preformatted docs)
  RPMS/$arch/git-*-1.5.1.3-1.$arch.rpm  (RPM)

GIT v1.5.1.3 Release Notes
==========================

Fixes since v1.5.1.2
--------------------

* Bugfixes

  - git-add tried to optimize by finding common leading
    directories across its arguments but botched, causing very
    confused behaviour.

  - unofficial rpm.spec file shipped with git was letting
    ETC_GITCONFIG set to /usr/etc/gitconfig.  Tweak the official
    Makefile to make it harder for distro people to make the
    same mistake, by setting the variable to /etc/gitconfig if
    prefix is set to /usr.

  - git-svn inconsistently stripped away username from the URL
    only when svnsync_props was in use.

  - git-svn got confused when handling symlinks on Mac OS.

  - git-send-email was not quoting recipient names that have
    period '.' in them.  Also it did not allow overriding
    envelope sender, which made it impossible to send patches to
    certain subscriber-only lists.

  - built-in write_tree() routine had a sequence that renamed a
    file that is still open, which some systems did not like.

  - when memory is very tight, sliding mmap code to read
    packfiles incorrectly closed the fd that was still being
    used to read the pack.

  - import-tars contributed front-end for fastimport was passing
    wrong directory modes without checking.

  - git-fastimport trusted its input too much and allowed to
    create corrupt tree objects with entries without a name.

  - git-fetch needlessly barfed when too long reflog action
    description was given by the caller.

Also contains various documentation updates.

----------------------------------------------------------------

Changes since v1.5.1.2 are as follows:

Adam Roben (5):
      Remove usernames from all commit messages, not just when using svmprops
      git-svn: Don't rely on $_ after making a function call
      git-svn: Ignore usernames in URLs in find_by_url
      git-svn: Added 'find-rev' command
      git-svn: Add 'find-rev' command

Alex Riesen (1):
      Fix handle leak in write_tree

Andrew Ruder (8):
      Removing -n option from git-diff-files documentation
      Document additional options for git-fetch
      Update git-fmt-merge documentation
      Update git-grep documentation
      Update -L documentation for git-blame/git-annotate
      Update git-http-push documentation
      Update git-local-fetch documentation
      Update git-http-fetch documentation

Brian Gernhardt (2):
      Reverse the order of -b and --track in the man page.
      Ignore all man sections as they are generated files.

Gerrit Pape (1):
      Documentation/git-reset.txt: suggest git commit --amend in example.

Jari Aalto (3):
      Clarify SubmittingPatches Checklist
      git.7: Mention preformatted html doc location
      send-email documentation: clarify --smtp-server

Johannes Schindelin (2):
      dir.c(common_prefix): Fix two bugs
      import-tars: be nice to wrong directory modes

Josh Triplett (3):
      Fix typo in git-am: s/Was is/Was it/
      Create a sysconfdir variable, and use it for ETC_GITCONFIG
      Add missing reference to GIT_COMMITTER_DATE in git-commit-tree documentation

Julian Phillips (1):
      http.c: Fix problem with repeated calls of http_init

Junio C Hamano (8):
      Build RPM with ETC_GITCONFIG=/etc/gitconfig
      applymbox & quiltimport: typofix.
      Start preparing for 1.5.1.3
      Do not barf on too long action description
      Update .mailmap with "Michael"
      Fix import-tars fix.
      Fix symlink handling in git-svn, related to PerlIO
      GIT v1.5.1.3

Michele Ballabio (1):
      git shortlog documentation: add long options and fix a typo

Robin H. Johnson (10):
      Document --dry-run parameter to send-email.
      Prefix Dry- to the message status to denote dry-runs.
      Debugging cleanup improvements
      Change the scope of the $cc variable as it is not needed outside of send_message.
      Perform correct quoting of recipient names.
      Validate @recipients before using it for sendmail and Net::SMTP.
      Ensure clean addresses are always used with Net::SMTP
      Allow users to optionally specify their envelope sender.
      Document --dry-run and envelope-sender for git-send-email.
      Sanitize @to recipients.

Shawn O. Pearce (3):
      Actually handle some-low memory conditions
      Don't allow empty pathnames in fast-import
      Catch empty pathnames in trees during fsck


-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

----- End forwarded message -----

--
MST
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mst at dev.mellanox.co.il  Tue May  1 14:11:04 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 00:11:04 +0300
Subject: [ofa-general] Re: please review IPOIB CM NOSRQ patch
In-Reply-To: <OF99ABDD94.5823E4CB-ON882572CE.00738199-882572CE.0073D2E2@us.ibm.com>
References: <OF99ABDD94.5823E4CB-ON882572CE.00738199-882572CE.0073D2E2@us.ibm.com>
Message-ID: <20070501211104.GK8447@mellanox.co.il>

> Quoting Pradeep Satyanarayana <pradeep at us.ibm.com>:
> Subject: please review IPOIB CM NOSRQ patch
> 
> Would one of you have the bandwidth to review the IPOIB CM NOSRQ patch 
> (v3) that I submitted last week.
> 
> Pradeep
> pradeep at us.ibm.com

OK, but could you please send a version that isn't line-wrapped?

Take a look at how it's formatted e.g. here:
http://article.gmane.org/gmane.linux.drivers.openib/39021


-- 
MST


From tziporet at dev.mellanox.co.il  Tue May  1 14:13:52 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 01 May 2007 14:13:52 -0700
Subject: [ofa-general] Release version of Ofed v1.2
In-Reply-To: <9BEB932202A05B488722B05D2374A1DA03C7F87A@mtv-amer001e--3.americas.sgi.com>
References: <9BEB932202A05B488722B05D2374A1DA03C7F87A@mtv-amer001e--3.americas.sgi.com>
Message-ID: <4637AD90.2060202@mellanox.co.il>

Scott Shaw wrote:
> When will the general release of ofed v1.2 be available?  Also is the OS
> requirement going to be SUSE10 SP1?  
>
> Will ofed v1.2 work with SUSE10 without service packs?
>
> Thanks,
> Scott 
>
>   
General release is planed for May 15, but the actual date depend on 
stability and some hard bugs we still try to fix.
Its should support SLES10 SP1 (currently tested with SP1 RC1)..
BTW Moiz from Novell said that SLES10 SP1 will include OFED 1.2 as an 
add-on package from Novell.


Tziporet


From pradeep at us.ibm.com  Tue May  1 14:36:40 2007
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Tue, 1 May 2007 14:36:40 -0700
Subject: [ofa-general] Re: please review IPOIB CM NOSRQ patch
In-Reply-To: <20070501211104.GK8447@mellanox.co.il>
Message-ID: <OF73784047.9B384CF4-ON882572CE.0075D739-882572CE.0076C3FC@us.ibm.com>

Thanks for spending cycles on this!There are few long lines that are 
broken up, but most of it is not 
line wrapped like the previous versions.  Would it be possible to look at 
the text file attached 
instead-that should be as expected.

Pradeep
pradeep at us.ibm.com

"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote on 05/01/2007 02:11:04 
PM:

> > Quoting Pradeep Satyanarayana <pradeep at us.ibm.com>:
> > Subject: please review IPOIB CM NOSRQ patch
> > 
> > Would one of you have the bandwidth to review the IPOIB CM NOSRQ patch 

> > (v3) that I submitted last week.
> > 
> > Pradeep
> > pradeep at us.ibm.com
> 
> OK, but could you please send a version that isn't line-wrapped?
> 
> Take a look at how it's formatted e.g. here:
> http://article.gmane.org/gmane.linux.drivers.openib/39021
> 
> 
> -- 
> MST


From jwong at datallegro.com  Tue May  1 15:04:38 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Tue, 1 May 2007 18:04:38 -0400
Subject: [ofa-general] Errors after install for openibd start
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303765EB7@xmb-sjc-216.amer.cisco.com>
Message-ID: <A382D4292574EB47A85B8159A6AED1A101131A6B@FPNYEXCBE02.opus-i.corp>

Thanks,

Rebooting solved the problems.

 
Jeff

	 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/62690a52/attachment.html>

From loic at myri.com  Tue May  1 15:05:24 2007
From: loic at myri.com (Loic Prylli)
Date: Tue, 01 May 2007 15:05:24 -0700
Subject: [ofa-general] Re: IPoIB forwarding
In-Reply-To: <46365BD4.5060607@hp.com>
References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov>	<20070425124652.GG1624@mellanox.co.il>	<6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov>	<20070426161409.GF15540@mellanox.co.il>	<6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov>	<20070426180618.GJ15540@mellanox.co.il>	<6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov>	<46325DF3.2050203@hp.com>	<6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov>	<46327A07.1000404@hp.com>	<6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov>	<4632894D.40705@hp.com>	<20070428025117.a3b1200a.billfink@mindspring.com>
	<4634F49F.9030408@myri.com> <46365BD4.5060607@hp.com>
Message-ID: <4637B9A4.2050103@myri.com>

On 4/30/2007 2:12 PM, Rick Jones wrote:
>
> Speaking of defaults, it would seem that the external 1.2.0 driver 
> comes with 9000 bytes as the default MTU?  At least I think that is 
> what I am seeing now that I've started looking more closely.
>
> rick jones


That's the same for the in-kernel-tree code (9K MTU by default). 
Assuming this is not wanted, I will submit a patch for that.


Loic


From rick.jones2 at hp.com  Tue May  1 15:12:08 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Tue, 01 May 2007 15:12:08 -0700
Subject: [ofa-general] Re: IPoIB forwarding
In-Reply-To: <4637B9A4.2050103@myri.com>
References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov>	<20070425124652.GG1624@mellanox.co.il>	<6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov>	<20070426161409.GF15540@mellanox.co.il>	<6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov>	<20070426180618.GJ15540@mellanox.co.il>	<6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov>	<46325DF3.2050203@hp.com>	<6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov>	<46327A07.1000404@hp.com>	<6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov>	<4632894D.40705@hp.com>	<20070428025117.a3b1200a.billfink@mindspring.com>	<4634F49F.9030408@myri.com>
	<46365BD4.5060607@hp.com> <4637B9A4.2050103@myri.com>
Message-ID: <4637BB38.9020809@hp.com>

Loic Prylli wrote:
> On 4/30/2007 2:12 PM, Rick Jones wrote:
> 
>>
>> Speaking of defaults, it would seem that the external 1.2.0 driver 
>> comes with 9000 bytes as the default MTU?  At least I think that is 
>> what I am seeing now that I've started looking more closely.
>>
>> rick jones
> 
> 
> 
> That's the same for the in-kernel-tree code (9K MTU by default). 
> Assuming this is not wanted, I will submit a patch for that.

While I like what that does for perrformance, and at the risk of putting 
words into the mouths of netdev, I suspect that 1500 bytes is indeed the 
desired default.  It matches the IEEE specs, I've yet to see a switch 
which enabled "Jumbo Frames" by default, not everything out there even 
believes that Jubmo Frames means 9000 byte MTU etc etc etc.  I think 
that 1500 bytes for an "Ethernet" device remains in line with the 
principle of least surprise.

rick jones


From swise at opengridcomputing.com  Tue May  1 16:03:16 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 01 May 2007 18:03:16 -0500
Subject: [ofa-general] [PATCH librdmacm] rping: Transfer rkey/addr/len
	information in network byte order.
In-Reply-To: <1177515271.22094.33.camel@stevo-desktop>
References: <1177515271.22094.33.camel@stevo-desktop>
Message-ID: <1178060596.2309.195.camel@stevo-desktop>

Sean,

This patch regresses rping.  I failed to test it on AMD64<->AMD64 (ie
like endian systems).  I will provide another patch shortly, or we can
undo the broken rping patch for -rc3.  Whatever you think is best.

Sorry for this!

Steve.


On Wed, 2007-04-25 at 10:34 -0500, Steve Wise wrote:
> Sean,
> 
> This patch enables rping between a BE and LE system.  Tested on IBM
> PPC64 <-> AMD64.
> 
> Transfer rkey/addr/len information in network byte order.
> 
> Signed-off-by: Steve Wise <swise at opengridcomputing.com>
> ---
> 
>  examples/rping.c |    7 ++++---
>  1 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/examples/rping.c b/examples/rping.c
> index 0441300..17b0000 100644
> --- a/examples/rping.c
> +++ b/examples/rping.c
> @@ -47,6 +47,7 @@ #include <pthread.h>
>  #include <inttypes.h>
>  
>  #include <rdma/rdma_cma.h>
> +#include <infiniband/arch.h>
>  
>  static int debug = 0;
>  #define DEBUG_LOG if (debug) printf
> @@ -239,9 +240,9 @@ static int server_recv(struct rping_cb *
>  		return -1;
>  	}
>  
> -	cb->remote_rkey = cb->recv_buf.rkey;
> -	cb->remote_addr = cb->recv_buf.buf;
> -	cb->remote_len  = cb->recv_buf.size;
> +	cb->remote_rkey = ntohl(cb->recv_buf.rkey);
> +	cb->remote_addr = ntohll(cb->recv_buf.buf);
> +	cb->remote_len  = ntohl(cb->recv_buf.size);
>  	DEBUG_LOG("Received rkey %x addr %" PRIx64 "len %d from peer\n",
>  		  cb->remote_rkey, cb->remote_addr, cb->remote_len);
>  
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From jwong at datallegro.com  Tue May  1 16:50:59 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Tue, 1 May 2007 19:50:59 -0400
Subject: [ofa-general] Help building sdp library.  
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303765EB7@xmb-sjc-216.amer.cisco.com>
Message-ID: <A382D4292574EB47A85B8159A6AED1A101131B18@FPNYEXCBE02.opus-i.corp>

Hello,

I am trying to do an OFED 2.1 install for all the modules now. 

I was able to compile and install the Basic install and now I am trying
to install the all selection part. 

When I try to install with this selection I am getting an error when
compiling the libsdp directory.

 
It looks like since I don't have the 32 bit compiler, the build is
failing.  

Is there a workaround to only compile the 64 bit version?

I added into the build_env.sh 

build_32bit=0 to build the Basic install but it doesn't seem to apply to
the libsdp when I try use the other selection of installing all.

 
Thanks in advance.

 
Jeff

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/49dd2746/attachment.html>

From jwong at datallegro.com  Tue May  1 17:38:37 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Tue, 1 May 2007 20:38:37 -0400
Subject: [ofa-general] Errors when compiling ib-bonding module.
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303765EB7@xmb-sjc-216.amer.cisco.com>
Message-ID: <A382D4292574EB47A85B8159A6AED1A101131B85@FPNYEXCBE02.opus-i.corp>

Hello,

I am using kernel 2.6-18.8.1.1.el5 x86_64

I have changed the build_env.sh to have the build_32bit=-1

 
Thanks in advance.

 
Jeff

 
When installing all modules I am getting the following errors.

 
+ make -C /lib/modules/2.6.18-8.1.1.el5/build modules
M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding

make: Entering directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64'

  CC [M]
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.o

In file included from
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:78:

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_inactive_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: (Each undeclared identifier is reported only once

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: for each function it appears in.)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_active_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_compute_features':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1233: warning: comparison of distinct pointer types lacks a cast

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_enslave':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_release':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_arp_rcv':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_netdev_event':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_init':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4374: warning: assignment discards qualifiers from pointer target
type

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this function)

make[1]: ***
[/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_
main.o] Error 1

make: ***
[_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi
ng] Error 2

make: Leaving directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64'

+ echo ' Building  IB bonding driver failed'

 Building  IB bonding driver failed

+ exit 1

error: Bad exit status from /var/tmp/rpm-tmp.99179 (%build)

 
Jeff Wong, Linux Software Engineer

(949) 680-3066 - Office

(949) 680-3001 - Fax

 
jwong at datallegro.com <mailto:Xjwong at datallegro.com> 

www.datallegro.com <http://www.datallegro.com> 

 
<http://www.datallegro.com/index.htm> 

<http://www.datallegro.com/index.htm>  

 
85Enterprise, 2nd Floor, Aliso Viejo, CA 92656 

The information transmitted in this email is intended only for the
person(s) or entity to which it is addressed and may contain
proprietary, confidential and/or privileged material.  If you have
received this email in error please contact the sender by replying and
delete this email so that it is not recoverable.  If you are not the
intended recipient(s), any retention, review, disclosure, distribution,
copying, printing, dissemination, or other use of, or the taking of any
action in reliance upon, this information is strictly prohibited and
without liability on our part.

 
________________________________

From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
Sent: Tuesday, May 01, 2007 1:39 PM
To: Jeffrey Wong; general at lists.openfabrics.org
Subject: RE: [ofa-general] Errors after install for openibd start

 
Try rebooting, and see if it still happens.

 
Scott

 
________________________________

From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jeffrey Wong
Sent: Tuesday, May 01, 2007 1:11 PM
To: general at lists.openfabrics.org
Subject: [ofa-general] Errors after install for openibd start

	Hello,

	I have successfully run the ./install.sh script with kernel
2.6.18-8.1.1.el5 

	I did not reboot the machine.

	After installing and configuring the ports using the defaults, I
tried to execute the command:

	/etc/init.d/openibd start

	 
	I have truncated the errors to show an example.

	 
	Any suggestions?

	Thanks,

	 
	Jeff

	 
	I am getting the following errors:

	 
	ib_ipath: disagrees about version of symbol ib_unregister_device

	ib_ipath: Unknown symbol ib_unregister_device

	ib_ipath: disagrees about version of symbol ib_register_device

	ib_ipath: Unknown symbol ib_register_device

	ib_ipath: disagrees about version of symbol ib_dispatch_event

	ib_ipath: Unknown symbol ib_dispatch_event

	ib_ipath: disagrees about version of symbol ib_dealloc_device

	ib_ipath: Unknown symbol ib_dealloc_device

	ib_ipath: disagrees about version of symbol ib_alloc_device

	ib_ipath: Unknown symbol ib_alloc_device

	ib_ipoib: disagrees about version of symbol ib_unregister_client

	ib_ipoib: Unknown symbol ib_unregister_client

	ib_ipoib: disagrees about version of symbol ib_create_cq

	ib_ipoib: Unknown symbol ib_create_cq

	ib_ipoib: Unknown symbol ib_sa_register_client

	ib_ipoib: disagrees about version of symbol ib_cm_listen

	ib_ipoib: Unknown symbol ib_cm_listen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/73df6b4c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 10836 bytes
Desc: image001.jpg
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/73df6b4c/attachment.jpg>

From jwong at datallegro.com  Tue May  1 18:16:57 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Tue, 1 May 2007 21:16:57 -0400
Subject: [ofa-general] RE: Errors when compiling ib-bonding module.
Message-ID: <A382D4292574EB47A85B8159A6AED1A101131BA7@FPNYEXCBE02.opus-i.corp>

Hello,

I am using kernel 2.6-18.8.1.1.el5 x86_64

I have changed the build_env.sh to have the build_32bit=-1

 
Thanks in advance.

 
Jeff

 
When installing all modules I am getting the following errors.

 
+ make -C /lib/modules/2.6.18-8.1.1.el5/build modules
M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding

make: Entering directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64'

  CC [M]
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.o

In file included from
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:78:

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_inactive_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: (Each undeclared identifier is reported only once

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: for each function it appears in.)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_active_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_compute_features':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1233: warning: comparison of distinct pointer types lacks a cast

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_enslave':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_release':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_arp_rcv':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_netdev_event':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_init':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4374: warning: assignment discards qualifiers from pointer target
type

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this function)

make[1]: ***
[/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_
main.o] Error 1

make: ***
[_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi
ng] Error 2

make: Leaving directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64'

+ echo ' Building  IB bonding driver failed'

 Building  IB bonding driver failed

+ exit 1

error: Bad exit status from /var/tmp/rpm-tmp.99179 (%build)

	 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/fc5ccb2b/attachment.html>

From sweitzen at cisco.com  Tue May  1 19:08:09 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 1 May 2007 19:08:09 -0700
Subject: [ofa-general] bugs 541 and 465: slow IPoIB CM HA failover and
	eventual failing IPoIB HA
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA30376607F@xmb-sjc-216.amer.cisco.com>

With IPoIB HA (both ipoibtools and ib-bonding), I am seeing slow IPoIB
CM HA failover, and eventually IPoIB stops working after enough
failovers.  I am running netperf -D traffic between two IPoIB HA hosts,
while flipping the 4 host IB ports one at a time (port 1 down, sleep,
port 1 up, sleep, ..., port 4 down, sleep, port 4 up, sleep) in a loop.
 
This is a very easy test to set up.  Can Mellanox and Voltaire please
try to reproduce the problem?
 
I think this problem must be fixed for OFED 1.2 rc3.
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/7767c4ba/attachment.html>

From mst at dev.mellanox.co.il  Tue May  1 19:50:55 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 05:50:55 +0300
Subject: [ofa-general] Re: Requesting CQ notifications
In-Reply-To: <adaodlbvqv2.fsf@cisco.com>
References: <462FD3F7.1010304@evergrid.com> <adaodlbvqv2.fsf@cisco.com>
Message-ID: <20070502025055.GM8447@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: Requesting CQ notifications
> 
>  > Is there a differentiation between multiple CQE's being in the CQ
>  > vs. CQE's being arriving into the CQ when using completion
>  > notifications?
>  > 
>  > For example, assume I have the following order of events:
>  > 
>  > 
>  > 	2 CQEs arrive
>  > 
>  > 	select() returns readable for comp. channel
>  > 
>  > 	ibv_get_cq_event() returns event
>  > 
>  > 	ibv_req_notify_cq(cq, 0)
>  > 
>  > 	ibv_poll_cq(cq, 1, &cqe) returns 1
>  > 
>  > 	ibv_ack_cq_events(cq, 1)
>  > 
>  > 
>  > Will the comp. channel receive another event for the second CQE even
>  > if it had arrived before ibv_req_notify_cq() was called?
> 
> This is really an ill-posed question: according to the semantics
> defined by the verbs spec, the presence or absence of the second CQE
> is not defined until you poll the CQ again.
> 
> In practice we can look at what real hardware does, and the answer is
> "it depends."  Some adapters (eg mthca, mlx4) will generate an event
> immediately if ibv_req_notify_cq() is called for a CQ that contains an
> unpolled CQE,

This is not exact. mthca/mlx4 will generate an event immediately
only for unpolled CQE *that was not present in CQ at the
time the previous event was generated*.
So the answer for mthca is yes only if the CQE arrived
between calls to select and ibv_req_notify_cq.

> while other adapters (eg ipath, ehca) will only generate
> an event when a CQE is added after the cal to ibv_req_notify_cq().


-- 
MST


From mike.heffner at evergrid.com  Tue May  1 20:23:52 2007
From: mike.heffner at evergrid.com (Mike Heffner)
Date: Tue, 01 May 2007 23:23:52 -0400
Subject: [ofa-general] Re: Requesting CQ notifications
In-Reply-To: <20070502025055.GM8447@mellanox.co.il>
References: <462FD3F7.1010304@evergrid.com> <adaodlbvqv2.fsf@cisco.com>
	<20070502025055.GM8447@mellanox.co.il>
Message-ID: <46380448.1020401@evergrid.com>

Michael S. Tsirkin wrote:
>> Quoting Roland Dreier <rdreier at cisco.com>:
>> Subject: Re: Requesting CQ notifications

> 
> This is not exact. mthca/mlx4 will generate an event immediately
> only for unpolled CQE *that was not present in CQ at the
> time the previous event was generated*.
> So the answer for mthca is yes only if the CQE arrived
> between calls to select and ibv_req_notify_cq.
> 

Is there any method by which you can query the total number of CQEs in 
the CQ at an instantaneous point in time (ie., after you had called 
ibv_req_notify_cq() to get notification of *new* CQEs)?

Mike

-- 

   Mike Heffner <mike.heffner at evergrid.com>
   EverGrid Software
   Blacksburg, VA USA

   Voice: (540) 443-3500 #603


From rajib.majumder at credit-suisse.com  Tue May  1 20:34:01 2007
From: rajib.majumder at credit-suisse.com (Majumder, Rajib)
Date: Wed, 2 May 2007 11:34:01 +0800
Subject: [ofa-general] OFED SDP & IPoIB
Message-ID: <F444CAE5E62A714C9F45AA292785BED3223F7022@esng11p33001.sg.csfb.com>

Hi,

I have the following questions regarding OFED SDP. 

1) Does SDP support zcopy yet? If yes, is it for aio calls or sync/non-blocking socket calls as well? 
2) Does SDP work on 10GigE iWARP, apart from IB? 
3) Does IPoIB support IP multicast?  

Thanks

Rajib

==============================================================================
Please access the attached hyperlink for an important electronic communications disclaimer: 

http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
==============================================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/95513ad2/attachment.html>

From mshefty at ichips.intel.com  Tue May  1 21:01:13 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 01 May 2007 21:01:13 -0700
Subject: [ofa-general] [PATCH librdmacm] rping: Transfer rkey/addr/len
	information in network byte order.
In-Reply-To: <1178060596.2309.195.camel@stevo-desktop>
References: <1177515271.22094.33.camel@stevo-desktop>
	<1178060596.2309.195.camel@stevo-desktop>
Message-ID: <46380D09.5070906@ichips.intel.com>

> This patch regresses rping.  I failed to test it on AMD64<->AMD64 (ie
> like endian systems).  I will provide another patch shortly, or we can
> undo the broken rping patch for -rc3.  Whatever you think is best.

Let's fix it.  Please create a patch on top of this that fixes the problem.

Thanks

- Sean


From Arkady.Kanevsky at netapp.com  Tue May  1 21:39:21 2007
From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady)
Date: Wed, 2 May 2007 00:39:21 -0400
Subject: [ofa-general] minutes from socket over RDMA discussion at workshop
Message-ID: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>

Dear,
group enclosed is the discussion we had at Sonoma workshop.
 
Here are rough minutes.
 
We had not tried submission SDP to kernel.org.
The current SDP performance is not very good.
IPOIB connection mode has much better bandwidth.
But SDP has better latency and less overhead.
IPOIB connection mode scalability was not stressed yet.
 
What are API requirements?
Socket over RDMA sounds like RDS for financial services.
 
Thanks,
 
 
Arkady Kanevsky                       email: arkady at netapp.com

Network Appliance Inc.               phone: 781-768-5395

1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195

Waltham, MA 02451                   central phone: 781-768-5300

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/bb3c4a95/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: socket_sonoma_2007.pdf
Type: application/octet-stream
Size: 59813 bytes
Desc: socket_sonoma_2007.pdf
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/bb3c4a95/attachment.obj>

From sweitzen at cisco.com  Tue May  1 21:48:26 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 1 May 2007 21:48:26 -0700
Subject: [ofa-general] minutes from socket over RDMA discussion at workshop
In-Reply-To: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>
References: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3037660DF@xmb-sjc-216.amer.cisco.com>

No, this is not right.  SDP has better latency and better throughput
than IPoIB CM, but also uses more CPU.
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

________________________________

	From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Kanevsky,
Arkady
	Sent: Tuesday, May 01, 2007 9:39 PM
	To: openib-general
	Subject: [ofa-general] minutes from socket over RDMA discussion
at workshop
	
	
	Dear,
	group enclosed is the discussion we had at Sonoma workshop.
	 
	Here are rough minutes.
	 
	We had not tried submission SDP to kernel.org.
	The current SDP performance is not very good.
	IPOIB connection mode has much better bandwidth.
	But SDP has better latency and less overhead.
	IPOIB connection mode scalability was not stressed yet.
	 
	What are API requirements?
	Socket over RDMA sounds like RDS for financial services.
	 
	Thanks,
	 
	 
	Arkady Kanevsky                       email: arkady at netapp.com

	Network Appliance Inc.               phone: 781-768-5395

	1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195

	Waltham, MA 02451                   central phone: 781-768-5300

	 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070501/9f3d5907/attachment.html>

From Ashish.Batwara at lsi.com  Tue May  1 22:53:06 2007
From: Ashish.Batwara at lsi.com (Batwara, Ashish)
Date: Tue, 1 May 2007 23:53:06 -0600
Subject: [ofa-general] We are seeing SYNDROME_LOCAL_PROT_ERR status on CQE
	with Mellanox Arbel HCA in memfree mode 
In-Reply-To: <mailman.225.1172180352.32035.openib-general@openib.org>
Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A016398AA@NAMAIL2.ad.lsil.com>

Any idea why this error? We see this error when we use FMR? Are there
any special setting that HCA needs to work with FMR?

[Batwara, Ashish] 


From tziporet at dev.mellanox.co.il  Tue May  1 23:05:23 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 01 May 2007 23:05:23 -0700
Subject: [ofa-general] OFED SDP & IPoIB
In-Reply-To: <F444CAE5E62A714C9F45AA292785BED3223F7022@esng11p33001.sg.csfb.com>
References: <F444CAE5E62A714C9F45AA292785BED3223F7022@esng11p33001.sg.csfb.com>
Message-ID: <46382A23.2030403@mellanox.co.il>

Majumder, Rajib wrote:
>
> Hi,
>
> I have the following questions regarding OFED SDP.
>
> 1) Does SDP support zcopy yet? If yes, is it for aio calls or 
> sync/non-blocking socket calls as well?
>
No for both
>
>
> 2) Does SDP work on 10GigE iWARP, apart from IB?
>
No
>
>
> 3) Does IPoIB support IP multicast?
>
Yes
>
Tziporet


From vlad at dev.mellanox.co.il  Tue May  1 23:33:09 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 02 May 2007 09:33:09 +0300
Subject: [ofa-general] Help building sdp library.
In-Reply-To: <A382D4292574EB47A85B8159A6AED1A101131B18@FPNYEXCBE02.opus-i.corp>
References: <A382D4292574EB47A85B8159A6AED1A101131B18@FPNYEXCBE02.opus-i.corp>
Message-ID: <1178087589.14131.3.camel@vladsk-laptop>

On Tue, 2007-05-01 at 19:50 -0400, Jeffrey Wong wrote:
> Hello,
> I am trying to do an OFED 2.1 install for all the modules now. 
> I was able to compile and install the Basic install and now I am
> trying to install the all selection part. 
> When I try to install with this selection I am getting an error when
> compiling the libsdp directory.
>  
> It looks like since I don’t have the 32 bit compiler, the build is
> failing.  
> Is there a workaround to only compile the 64 bit version?
> I added into the build_env.sh 
> build_32bit=0 to build the Basic install but it doesn’t seem to apply
> to the libsdp when I try use the other selection of installing all.
>  
>  
> Thanks in advance.
>  
> Jeff
> 

Hi,
Try the latest OFED-1.2 from http://www.openfabrics.org/builds/ofed-1.2/
It should be fixed there.
Note, you don't have to edit build_env.sh to change the value of
build_32bit variable. It is enough to run 'export build_32bit=0' and
then run install.


-- 
Vladimir Sokolovsky <vlad at dev.mellanox.co.il>
Mellanox Technologies Ltd.


From mst at dev.mellanox.co.il  Tue May  1 23:46:48 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 09:46:48 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <OF88F0F777.9937D2C9-ON882572CB.00013288-882572CB.0004BCA0@us.ibm.com>
References: <OF88F0F777.9937D2C9-ON882572CB.00013288-882572CB.0004BCA0@us.ibm.com>
Message-ID: <20070502064549.GN8447@mellanox.co.il>

OK, we are making progress (line-wrapping issues aside :). And there seems to
be some whitespace damage, too. Pls take care of this.

I think the handle_rx_wc split is going in the right direction,
but let's take this through all the datapath.

I went over the patch in a bit more depth, and I have some questions:

> +	for (i = 0; i < ipoib_recvq_size; ++i) {
> +		if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index,

...

> +		if (ipoib_cm_post_receive(dev, i << 32 | index)) {

1. It seems there are multiple QPs mapped to a single CQ -
   and each gets ipoib_recvq_size recv WRs above.
   Is that right? How do you prevent CQ overrun then?

> +	/* Find an empty rx_index_ring[] entry */
> +	for (index = 0; index < NOSRQ_INDEX_RING_SIZE; index++)
> +		if (priv->cm.rx_index_ring[index] == NULL)
> +			break; 
> +
> +	if ( index == NOSRQ_INDEX_RING_SIZE) {
> +		spin_unlock_irq(&priv->lock);
> +		printk(KERN_WARNING "NOSRQ supports a max of %d RC "
> +		       "QPs. That limit has now been reached\n",
> +		       NOSRQ_INDEX_RING_SIZE);
> +		return -EINVAL;
> +	}

2. So, when QP limit has been reached, remote side will get
   a reject with custom reject reason?
   Is so, it seems that since the remote does not know
   what the reason for reject is, it'll just retry
   on the next packet, and again and again. Basically,
   connectivity is denied where it previously worked fine
   by falling back on datagram mode?

   One way to fix this, could be to try and use a reject reason
   that will tell the remote "I'm busy, switch to datagram mode
   for a loooooong time". Using path mtu discovery here might be useful
   to actually have it come back and retry after several minutes.

   *In theory*, we could get this even with SRQ -
   if the *HCA* starts running out of RC QPs - it is just
   never happening in practice as current HCAs support #QPs larger
   than a maximum IB subnet size.
   So I might post a patch to implement this, stay tuned.

> +	spin_lock_irqsave(&priv->lock, flags);
> +	rx_ptr = priv->cm.rx_index_ring[index];
> +	spin_unlock_irqrestore(&priv->lock, flags);

3. You never actually test the rx_ptr that you got.
   So why does locking help?
   A better way to destroy QPs might be to move it to error state first.

   We actually need something like this for CM too - stay tuned for a patch.

I also commented on some style issues below.

> Note 1: I have retained the code to avoid IB_WC_RETRY_EXC_ERR while performing
> interoperability tests As discussed in this mailing list that may be a CM bug or
> have the various HCA address it. Hence I would like to seperate out that issue
> from this patch.
> At a future point when the issue gets resolved I can provide
> another patch to change the retry_count values back to 0 if need be.

The correct way to separate it, in my opinion, is to set retry_count = 0,
and (for now) apply a work-around patch at your site before testing.
We really don't want to paper over this bug, in my opinion.

A general suggestion, before we dive into code: document, first of
all, data structures, then functions.
Rest of code quite often can be made self documenting.
Stuff like if (!srq) /* no SRQ */, and } /* end of loop */
is really not telling us anything useful.

> --- a/linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib.h	2007-04-24 18:10:17.000000000 -0700
> +++ b//linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib.h	2007-04-25 10:11:34.000000000 -0700
> @@ -99,6 +99,12 @@ enum {
>  #define	IPOIB_OP_RECV   (1ul << 31)
>  #ifdef CONFIG_INFINIBAND_IPOIB_CM
>  #define	IPOIB_CM_OP_SRQ (1ul << 30)
> +#define IPOIB_CM_OP_NOSRQ (1ul << 29)
> +
> +/* These two go hand in hand */
> +#define NOSRQ_INDEX_RING_SIZE 1024
> +#define NOSRQ_INDEX_MASK      0x00000000000003ff
> +

When you need a comment for 2 lines of code is when you
know something's really obscure.

How about
#define NOSRQ_INDEX_MASK (NOSRQ_INDEX_RING_SIZE - 1)

and we can kill the comment?

>  #else
>  #define	IPOIB_CM_OP_SRQ (0)
>  #endif
> @@ -136,9 +142,11 @@ struct ipoib_cm_data {
>  struct ipoib_cm_rx {
>  	struct ib_cm_id     *id;
>  	struct ib_qp        *qp;
> +	struct ipoib_cm_rx_buf *rx_ring;

Alignment's broken here.

>  	struct list_head     list;
>  	struct net_device   *dev;
>  	unsigned long        jiffies;
> +	u32		     index;

index and rx_ring are only valid for non-srq code, right?
I think we need a comment of some kind to tell us this.

>  };
>  
>  struct ipoib_cm_tx {
> @@ -177,6 +185,7 @@ struct ipoib_cm_dev_priv {
>  	struct ib_wc            ibwc[IPOIB_NUM_WC];
>  	struct ib_sge           rx_sge[IPOIB_CM_RX_SG];
>  	struct ib_recv_wr       rx_wr;
> +	struct ipoib_cm_rx	**rx_index_ring;
>  };
>  
>  /*

Isn't "ring" a bit of a misnomer?
Also - you have multiple QPs mapped to a single CQ - how do you prevent CQ overrun?
      
> --- a/linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-04-24 18:10:17.000000000 -0700
> +++ b//linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-04-27 14:03:40.000000000 -0700
> @@ -76,7 +76,7 @@ static void ipoib_cm_dma_unmap_rx(struct
>  		ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE);
>  }
>  
> -static int ipoib_cm_post_receive(struct net_device *dev, int id)
> +static int post_receive_srq(struct net_device *dev, u64 id)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  	struct ib_recv_wr *bad_wr;
> @@ -85,13 +85,14 @@ static int ipoib_cm_post_receive(struct 
>  	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ;
>  
>  	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
> -		priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i];
> +		priv->cm.rx_sge[i].addr = 
> +		priv->cm.srq_ring[id].mapping[i];


The line wasn't too long here, so why wrap it?
And continuation lines need to be shifted *significantly*
to the right.

>  	ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr);
>  	if (unlikely(ret)) {
>  		ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret);
>  		ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
> -				      priv->cm.srq_ring[id].mapping);
> +			              priv->cm.srq_ring[id].mapping);

what's the deal here?

>  		dev_kfree_skb_any(priv->cm.srq_ring[id].skb);
>  		priv->cm.srq_ring[id].skb = NULL;
>  	}
> @@ -99,12 +100,69 @@ static int ipoib_cm_post_receive(struct 
>  	return ret;
>  }
>  
> -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags,
> +static int post_receive_nosrq(struct net_device *dev, u64 id)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +	struct ib_recv_wr *bad_wr;
> +	int i, ret;
> +	u32 index;
> +	u64 wr_id;
> +	struct ipoib_cm_rx *rx_ptr;
> +	unsigned long flags;
> +
> +	index = id  & NOSRQ_INDEX_MASK ;
> +	wr_id = id >> 32;

So wr_id has always, ever, 32 lower bits set - why make it u64 then?

> +	/* There is a slender chance of a race between the stale_task
> +	 * running after a period of inactivity and the receipt of
> +	 * a packet being processed at about the same instant. 
> +	 * Hence the lock */

I think you can get rid of this, by changing the stale task code:
move QP to error, and wait for WRs posted to complete.
Then there won't be any more completions for this QP.

As it is, I'm not convinced you can't get a completion after
QP has been removed out of the array - so it seems the race hasn't
been solved here?

We actually need something like this for CM too -
stay tuned for a patch.

> +	spin_lock_irqsave(&priv->lock, flags);
> +	rx_ptr = priv->cm.rx_index_ring[index];
> +	spin_unlock_irqrestore(&priv->lock, flags);
> +
> +	priv->cm.rx_wr.wr_id = wr_id << 32 | index | IPOIB_CM_OP_NOSRQ;

Isn't this just id, again?

> +	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
> +		priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i];
> +
> +	ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr);
> +	if (unlikely(ret)) {
> +		ipoib_warn(priv, "post recv failed for buf %d (%d)\n",
> +		           wr_id, ret);
> +		ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
> +		                      rx_ptr->rx_ring[wr_id].mapping);
> +		dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb);
> +		rx_ptr->rx_ring[wr_id].skb = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static int ipoib_cm_post_receive(struct net_device *dev, u64 id)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +	int ret;
> +
> +	if (priv->cm.srq) 
> +		ret = post_receive_srq(dev, id);
> +	else 
> +		ret = post_receive_nosrq(dev, id);
> +
> +	return ret;
> +}

I think you can split this one now that srq/nonsrq completions are
handled separately.

> +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, 
> +					     int frags,
>  					     u64 mapping[IPOIB_CM_RX_SG])
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  	struct sk_buff *skb;
>  	int i;
> +	struct ipoib_cm_rx *rx_ptr;
> +	u32 index, wr_id;
> +	unsigned long flags;
>  
>  	skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12);
>  	if (unlikely(!skb))
> @@ -123,7 +181,7 @@ static struct sk_buff *ipoib_cm_alloc_rx
>  		return NULL;
>  	}
>  
> -	for (i = 0; i < frags; i++) {
> +	for (i = 0; i < frags; i++) { 

what's the deal here?

>  		struct page *page = alloc_page(GFP_ATOMIC);
>  
>  		if (!page)
> @@ -136,7 +194,17 @@ static struct sk_buff *ipoib_cm_alloc_rx
>  			goto partial_error;
>  	}
>  
> -	priv->cm.srq_ring[id].skb = skb;
> +	if (priv->cm.srq) 
> +		priv->cm.srq_ring[id].skb = skb;
> +	else {
> +		index = id  & NOSRQ_INDEX_MASK ;
> +		wr_id = id >> 32;
> +		spin_lock_irqsave(&priv->lock, flags);
> +		rx_ptr = priv->cm.rx_index_ring[index];
> +		spin_unlock_irqrestore(&priv->lock, flags);

See above about the locking here. Try to get rid of it - this is datapath.

> +
> +		rx_ptr->rx_ring[wr_id].skb = skb;
> +	}
>  	return skb;
>  
>  partial_error:

A branch on datapath just for 2 lines that are different
is not worth it. Just keep common code in ipoib_cm_alloc_rx,
and move lines that differ to the site of call.

> @@ -157,13 +225,20 @@ static struct ib_qp *ipoib_cm_create_rx_
>  	struct ib_qp_init_attr attr = {
>  		.send_cq = priv->cq, /* does not matter, we never send anything */
>  		.recv_cq = priv->cq,
> -		.srq = priv->cm.srq,
>  		.cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */
> +		.cap.max_recv_wr = ipoib_recvq_size + 1,
>  		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
> +		.cap.max_recv_sge = IPOIB_CM_RX_SG, /* Is this correct? */

I don't think we should set both attr.srq and max_recv_sge
for a QP connected to SRQ.

>  		.sq_sig_type = IB_SIGNAL_ALL_WR,
>  		.qp_type = IB_QPT_RC,
>  		.qp_context = p,
>  	};
> +
> +	if (priv->cm.srq)
> +		attr.srq = priv->cm.srq;
> +	else
> +		attr.srq = NULL;

Since attr has an initializer, attr.srq is already 0
unless you set it.

> +
>  	return ib_create_qp(priv->pd, &attr);
>  }
>  
> @@ -198,6 +273,7 @@ static int ipoib_cm_modify_rx_qp(struct 
>  		ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret);
>  		return ret;
>  	}
> +	

Kill this.

>  	return 0;
>  }
>  
> @@ -217,12 +293,87 @@ static int ipoib_cm_send_rep(struct net_
>  	rep.flow_control = 0;
>  	rep.rnr_retry_count = req->rnr_retry_count;
>  	rep.target_ack_delay = 20; /* FIXME */
> -	rep.srq = 1;
>  	rep.qp_num = qp->qp_num;
>  	rep.starting_psn = psn;
> +	
> +	if (priv->cm.srq)
> +		rep.srq = 1;
> +	else
> +		rep.srq = 0;
>  	return ib_send_cm_rep(cm_id, &rep);
>  }
>  
> +int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id,  struct ipoib_cm_rx *p, unsigned psn)

This one's too long I think.

> +{
> +	struct net_device *dev = cm_id->context;
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +	int ret;
> +	u32 qp_num, index;
> +	u64 i;
> +
> +	qp_num = p->qp->qp_num;
> +	/* Allocate space for the rx_ring here */

You mostly want to kill such comments - they take up code lines
and don't really tell anything useful.

> +	p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring,
> +			     GFP_KERNEL);
> +	if (p->rx_ring == NULL)
> +		return -ENOMEM;
> +
> +	cm_id->context = p;
> +	p->jiffies = jiffies;
> +	spin_lock_irq(&priv->lock);
> +	list_add(&p->list, &priv->cm.passive_ids);
> +		
> +	/* Find an empty rx_index_ring[] entry */

And this.

> +	for (index = 0; index < NOSRQ_INDEX_RING_SIZE; index++)
> +		if (priv->cm.rx_index_ring[index] == NULL)
> +			break; 

No == NULL tests please.

> +
> +	if ( index == NOSRQ_INDEX_RING_SIZE) {
> +		spin_unlock_irq(&priv->lock);
> +		printk(KERN_WARNING "NOSRQ supports a max of %d RC "
> +		       "QPs. That limit has now been reached\n",
> +		       NOSRQ_INDEX_RING_SIZE);
> +		return -EINVAL;
> +	}

So, when QP limit has been reached, connectivity is denied
where it previously worked fine in datagram mode?
This looks like an important regression.

> +	/* Store the pointer to retrieve it later using the index */

Kill this too.

> +	priv->cm.rx_index_ring[index] = p;
> +	spin_unlock_irq(&priv->lock);
> +	p->index = index;
> +
> +	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
> +	if (ret) {
> +		ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret);
> +		goto err_modify_nosrq;
> +	}
> +
> +	for (i = 0; i < ipoib_recvq_size; ++i) {
> +		if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index,
> +					   IPOIB_CM_RX_SG - 1,
> +					   p->rx_ring[i].mapping)) {
> +			ipoib_warn(priv, "failed to allocate receive "
> +			           "buffer %d\n", i);
> +			ipoib_cm_dev_cleanup(dev);
> +			/* Free rx_ring previously allocated */

And this.

> +			kfree(p->rx_ring);
> +			return -ENOMEM;
> +		}
> +
> +		/* Can we call the nosrq version? */

what's the deal here?

> +		if (ipoib_cm_post_receive(dev, i << 32 | index)) {
> +			ipoib_warn(priv, "ipoib_ib_post_receive "
> +			           "failed for  buf %d\n", i);
> +			ipoib_cm_dev_cleanup(dev);
> +			return -EIO;
> +		}
> +	} /* end for */

And surely this.

> +
> +	return 0;
> +
> +err_modify_nosrq:
> +	return ret;
> +}
> +
>  static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
>  {
>  	struct net_device *dev = cm_id->context;
> @@ -243,10 +394,17 @@ static int ipoib_cm_req_handler(struct i
>  		goto err_qp;
>  	}
>  
> -	psn = random32() & 0xffffff;
> -	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
> -	if (ret)
> -		goto err_modify;
> +	if (priv->cm.srq == NULL) { /* NOSRQ */

No == NULL tests please. Also - what does the comment tell us
that we don't already know?

> +		psn = random32() & 0xffffff;

random call could be in common code?

> +		if (ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn))
> +			goto err_modify;
> +	} else { /* SRQ */

What does the comment tell us that we don't already know?

> +		p->rx_ring = NULL; /* This is used only by NOSRQ */
> +		psn = random32() & 0xffffff;
> +		ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
> +		if (ret)
> +			goto err_modify;
> +	}
>  
>  	ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn);
>  	if (ret) {
> @@ -254,11 +412,13 @@ static int ipoib_cm_req_handler(struct i
>  		goto err_rep;
>  	}
>  
> -	cm_id->context = p;
> -	p->jiffies = jiffies;
> -	spin_lock_irq(&priv->lock);
> -	list_add(&p->list, &priv->cm.passive_ids);
> -	spin_unlock_irq(&priv->lock);
> +	if (priv->cm.srq) {
> +		cm_id->context = p;
> +		p->jiffies = jiffies;
> +		spin_lock_irq(&priv->lock);
> +		list_add(&p->list, &priv->cm.passive_ids);
> +		spin_unlock_irq(&priv->lock);
> +	}
>  	queue_delayed_work(ipoib_workqueue,
>  			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  	return 0;
> @@ -339,23 +499,40 @@ static void skb_put_frags(struct sk_buff
>  	}
>  }
>  
> -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
> +static void timer_check(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p)
> +{
> +	unsigned long flags;
> +
> +	if (time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {

Now that it's a separate function, we can
if (time_before(....))
	return;

> +		spin_lock_irqsave(&priv->lock, flags);
> +		p->jiffies = jiffies;
> +		/* Move this entry to list head, but do
> +		 * not re-add it if it has been removed. */
> +		if (!list_empty(&p->list))
> +			list_move(&p->list, &priv->cm.passive_ids);
> +		spin_unlock_irqrestore(&priv->lock, flags);
> +		queue_delayed_work(ipoib_workqueue,
> +				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
> +	}
> +}
> +static int handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc)

Why is making this an int a good idea?
You aren't doing anything useful with this down the line.

>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> -	unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ;
>  	struct sk_buff *skb, *newskb;
> +	u64 mapping[IPOIB_CM_RX_SG], wr_id;
>  	struct ipoib_cm_rx *p;
>  	unsigned long flags;
> -	u64 mapping[IPOIB_CM_RX_SG];
> -	int frags;
> +	int frags, ret;
> +
> +	wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ;

I like initing the variable at declaration site.
If you wan to change the style, maybe make it a separate patch?

>  	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
>  		       wr_id, wc->status);
>  
>  	if (unlikely(wr_id >= ipoib_recvq_size)) {
> -		ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
> -			   wr_id, ipoib_recvq_size);
> -		return;
> +		ipoib_warn(priv, "cm recv completion event with wrid %d "
> +		           "(> %d)\n", wr_id, ipoib_recvq_size);

the line wasn't too long before, so why wrap it?

> +		return 1; 
>  	}
>  
>  	skb  = priv->cm.srq_ring[wr_id].skb;
> @@ -365,22 +542,12 @@ void ipoib_cm_handle_rx_wc(struct net_de
>  			   "(status=%d, wrid=%d vend_err %x)\n",
>  			   wc->status, wr_id, wc->vendor_err);
>  		++priv->stats.rx_dropped;
> -		goto repost;
> +		goto repost_srq;
>  	}
>  
>  	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
>  		p = wc->qp->qp_context;
> -		if (time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
> -			spin_lock_irqsave(&priv->lock, flags);
> -			p->jiffies = jiffies;
> -			/* Move this entry to list head, but do
> -			 * not re-add it if it has been removed. */
> -			if (!list_empty(&p->list))
> -				list_move(&p->list, &priv->cm.passive_ids);
> -			spin_unlock_irqrestore(&priv->lock, flags);
> -			queue_delayed_work(ipoib_workqueue,
> -					   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
> -		}
> +		timer_check(priv, p);
>  	}
>  
>  	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
> @@ -388,22 +555,119 @@ void ipoib_cm_handle_rx_wc(struct net_de
>  
>  	newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping);
>  	if (unlikely(!newskb)) {
> -		/*
> -		 * If we can't allocate a new RX buffer, dump
> -		 * this packet and reuse the old buffer.
> -		 */
> -		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
> +                /*
> +                 * If we can't allocate a new RX buffer, dump
> +                 * this packet and reuse the old buffer.
> +                 */
> +                ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
> +                ++priv->stats.rx_dropped;
> +                goto repost_srq;
> +        }
> +
> +	ipoib_cm_dma_unmap_rx(priv, frags, 
> +	                      priv->cm.srq_ring[wr_id].mapping);
> +	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, 
> +	       (frags + 1) * sizeof *mapping);
> +	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
> +		       wc->byte_len, wc->slid);
> +
> +	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); 
> +
> +	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
> +	skb->mac.raw = skb->data;
> +	skb_pull(skb, IPOIB_ENCAP_LEN);
> +
> +	dev->last_rx = jiffies;
> +	++priv->stats.rx_packets;
> +	priv->stats.rx_bytes += skb->len;
> +
> +	skb->dev = dev;
> +	/* XXX get correct PACKET_ type here */
> +	skb->pkt_type = PACKET_HOST;
> +
> +	netif_rx_ni(skb);
> +
> +repost_srq:

Labels don't need to be unique cross-function.
So you can call this one repost:

> +	ret = ipoib_cm_post_receive(dev, wr_id);
> +
> +	if (unlikely(ret)) {
> +		ipoib_warn(priv, "ipoib_cm_post_receive failed for buf %d\n", 
> +		           wr_id);
> +		return 1;
> +	}
> +
> +	return 0;
> +
> +}
> +
> +static int handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +	struct sk_buff *skb, *newskb;
> +	u64 mapping[IPOIB_CM_RX_SG], wr_id;
> +	u32 index;
> +	struct ipoib_cm_rx *p, *rx_ptr;
> +	unsigned long flags;
> +	int frags, ret;
> +
> +
> +	wr_id = wc->wr_id >> 32;
> +
> +	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
> +		       wr_id, wc->status);
> +
> +	if (unlikely(wr_id >= ipoib_recvq_size)) {
> +		ipoib_warn(priv, "cm recv completion event with wrid %d "
> +		           "(> %d)\n", wr_id, ipoib_recvq_size);
> +		return 1;
> +	}
> +
> +	index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) & NOSRQ_INDEX_MASK ;
> +	spin_lock_irqsave(&priv->lock, flags);
> +	rx_ptr = priv->cm.rx_index_ring[index];
> +	spin_unlock_irqrestore(&priv->lock, flags);
> +
> +	skb = rx_ptr->rx_ring[wr_id].skb;
> +
> +	if (unlikely(wc->status != IB_WC_SUCCESS)) {
> +		ipoib_dbg(priv, "cm recv error "
> +			   "(status=%d, wrid=%d vend_err %x)\n",
> +			   wc->status, wr_id, wc->vendor_err);
>  		++priv->stats.rx_dropped;
> -		goto repost;
> +		goto repost_nosrq;
>  	}
>  
> -	ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping);
> -	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping);
> +	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
> +		/* There are no guarantees that wc->qp is not NULL for HCAs 
> +	 	* that do not support SRQ. */ 
> +		p = rx_ptr;
> +		timer_check(priv, p);
> +	}
> +
> +	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
> +					      (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
> +
> +	newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags,
> +				       mapping);
> +	if (unlikely(!newskb)) {
> +                /*
> +                 * If we can't allocate a new RX buffer, dump
> +                 * this packet and reuse the old buffer.
> +                 */
> +                ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
> +                ++priv->stats.rx_dropped;
> +                goto repost_nosrq;
> +        }
> +
> +	ipoib_cm_dma_unmap_rx(priv, frags, 
> +	                      rx_ptr->rx_ring[wr_id].mapping);
> +	memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, 
> +	       (frags + 1) * sizeof *mapping);
>  
>  	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
>  		       wc->byte_len, wc->slid);
>  
> -	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
> +	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); 
>  
>  	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
>  	skb->mac.raw = skb->data;
> @@ -416,12 +680,34 @@ void ipoib_cm_handle_rx_wc(struct net_de
>  	skb->dev = dev;
>  	/* XXX get correct PACKET_ type here */
>  	skb->pkt_type = PACKET_HOST;
> +
>  	netif_rx_ni(skb);
>  
> -repost:
> -	if (unlikely(ipoib_cm_post_receive(dev, wr_id)))
> -		ipoib_warn(priv, "ipoib_cm_post_receive failed "
> -			   "for buf %d\n", wr_id);
> +repost_nosrq:

Labels don't need to be unique cross-function.
So you can call this one repost:

> +	ret = ipoib_cm_post_receive(dev, wr_id << 32 | index);
> +
> +	if (unlikely(ret)) {
> +		ipoib_warn(priv, "ipoib_cm_post_receive failed for buf %d\n", 
> +		           wr_id);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +	int ret;
> +	
> +
> +	if (priv->cm.srq) 
> +		ret = handle_rx_wc_srq(dev, wc);
> +	else 
> +		ret = handle_rx_wc_nosrq(dev, wc);
> +
> +	if (unlikely(ret)) 
> +                ipoib_warn(priv, "Error processing rx wc\n");
>  }

See below about this.

>  static inline int post_send(struct ipoib_dev_priv *priv,
> @@ -606,6 +892,22 @@ int ipoib_cm_dev_open(struct net_device 
>  	return 0;
>  }
>  
> +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p)
> +{

I suggest you loose the _nosrq suffix and just do
if (priv->cq.srq)
	return;
At the top of the function.

> +	int i;
> +
> +	for(i = 0; i < ipoib_recvq_size; ++i)
> +		if(p->rx_ring[i].skb) {
> +			ipoib_cm_dma_unmap_rx(priv,
> +				         IPOIB_CM_RX_SG - 1,
> +					 p->rx_ring[i].mapping);
> +			dev_kfree_skb_any(p->rx_ring[i].skb);
> +			p->rx_ring[i].skb = NULL;
> +		}
> +		kfree(p->rx_ring);
> +}
> +
> +

Loose double empty lines.

>  void ipoib_cm_dev_stop(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> @@ -618,6 +920,8 @@ void ipoib_cm_dev_stop(struct net_device
>  	spin_lock_irq(&priv->lock);
>  	while (!list_empty(&priv->cm.passive_ids)) {
>  		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
> +		if (priv->cm.srq == NULL) 
> +			free_resources_nosrq(priv, p);

No == NULL tests please.

>  		list_del_init(&p->list);
>  		spin_unlock_irq(&priv->lock);
>  		ib_destroy_cm_id(p->id);
> @@ -703,9 +1007,14 @@ static struct ib_qp *ipoib_cm_create_tx_
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  	struct ib_qp_init_attr attr = {};
>  	attr.recv_cq = priv->cq;
> -	attr.srq = priv->cm.srq;
> +	if (priv->cm.srq)
> +		attr.srq = priv->cm.srq;
> +	else
> +		attr.srq = NULL;
>  	attr.cap.max_send_wr = ipoib_sendq_size;
> +	attr.cap.max_recv_wr = 1; /* Not in MST code */
>  	attr.cap.max_send_sge = 1;
> +	attr.cap.max_recv_sge = 1; /* Not in MST code */

I don't think we should set both attr.srq and max_recv_sge
for a QP connected to SRQ.

>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
>  	attr.qp_type = IB_QPT_RC;
>  	attr.send_cq = cq;
> @@ -742,10 +1051,13 @@ static int ipoib_cm_send_req(struct net_
>  	req.responder_resources	      = 4;
>  	req.remote_cm_response_timeout = 20;
>  	req.local_cm_response_timeout  = 20;
> -	req.retry_count 	      = 0; /* RFC draft warns against retries */
> -	req.rnr_retry_count 	      = 0; /* RFC draft warns against retries */
> +	req.retry_count 	      = 6; /* RFC draft warns against retries */
> +	req.rnr_retry_count 	      = 6;/* RFC draft warns against retries */
>  	req.max_cm_retries 	      = 15;
> -	req.srq 	              = 1;
> +	if (priv->cm.srq)
> +		req.srq               = 1;
> +	else
> +		req.srq               = 0;
>  	return ib_send_cm_req(id, &req);
>  }
>  
> @@ -1089,6 +1401,10 @@ static void ipoib_cm_stale_task(struct w
>  		p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list);
>  		if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT))
>  			break;
> +		if (priv->cm.srq == NULL) { /* NOSRQ */

No == NULL tests please. Also - what does the comment tell us?

> +			free_resources_nosrq(priv, p);
> +			priv->cm.rx_index_ring[p->index] = NULL;
> +		}
>  		list_del_init(&p->list);
>  		spin_unlock_irq(&priv->lock);
>  		ib_destroy_cm_id(p->id);
> @@ -1143,16 +1459,40 @@ int ipoib_cm_add_mode_attr(struct net_de
>  	return device_create_file(&dev->dev, &dev_attr_mode);
>  }
>  
> +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv)
> +{
> +	struct ib_srq_init_attr srq_init_attr;
> +	int ret;
> +
> +	srq_init_attr.attr.max_wr = ipoib_recvq_size;
> +	srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG;
> +
> +	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
> +	if (IS_ERR(priv->cm.srq)) {
> +		ret = PTR_ERR(priv->cm.srq);
> +		priv->cm.srq = NULL;
> +		return ret;
> +	}
> +
> +	priv->cm.srq_ring = kzalloc(ipoib_recvq_size * 
> +		                    sizeof *priv->cm.srq_ring, 
> +			            GFP_KERNEL);
> +	if (!priv->cm.srq_ring) {
> +		printk(KERN_WARNING "%s: failed to allocate CM ring "
> +		       "(%d entries)\n",
> +	       	       priv->ca->name, ipoib_recvq_size);
> +		ipoib_cm_dev_cleanup(dev);

Since you have separated create_srq from cm_dev_init,
calling ipoib_cm_dev_cleanup from it looks wrong.

> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
>  int ipoib_cm_dev_init(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> -	struct ib_srq_init_attr srq_init_attr = {
> -		.attr = {
> -			.max_wr  = ipoib_recvq_size,
> -			.max_sge = IPOIB_CM_RX_SG
> -		}
> -	};
> -	int ret, i;
> +	int ret, i, supports_srq;
> +	struct ib_device_attr attr;
>  
>  	INIT_LIST_HEAD(&priv->cm.passive_ids);
>  	INIT_LIST_HEAD(&priv->cm.reap_list);
> @@ -1164,21 +1504,26 @@ int ipoib_cm_dev_init(struct net_device 
>  
>  	skb_queue_head_init(&priv->cm.skb_queue);
>  
> -	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
> -	if (IS_ERR(priv->cm.srq)) {
> -		ret = PTR_ERR(priv->cm.srq);
> -		priv->cm.srq = NULL;
> +	if (ret = ib_query_device(priv->ca, &attr))
>  		return ret;

I think a cleaner way would be to just test device->create_srq.

> +	if (attr.max_srq)
> +		supports_srq = 1; /* This device supports SRQ */
> +	else {
> +		supports_srq = 0;	
>  	}
>  
> -	priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring,
> -				    GFP_KERNEL);
> -	if (!priv->cm.srq_ring) {
> -		printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n",
> -		       priv->ca->name, ipoib_recvq_size);
> -		ipoib_cm_dev_cleanup(dev);
> -		return -ENOMEM;
> -	}
> +	if (supports_srq) {
> +		if (ret = create_srq(dev, priv))
> +			return ret;
> +			
> +		priv->cm.rx_index_ring = NULL; /* Not needed for SRQ */
> +	} else {
> +		priv->cm.srq = NULL;
> +		priv->cm.srq_ring = NULL;
> +		priv->cm.rx_index_ring = kzalloc(NOSRQ_INDEX_RING_SIZE * 
> +					 sizeof *priv->cm.rx_index_ring,
> +					 GFP_KERNEL);
> +	} 
>  
>  	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
>  		priv->cm.rx_sge[i].lkey	= priv->mr->lkey;

do we really need supports_srq variable? It's only used once ...

> @@ -1190,19 +1535,25 @@ int ipoib_cm_dev_init(struct net_device 
>  	priv->cm.rx_wr.sg_list = priv->cm.rx_sge;
>  	priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG;
>  
> -	for (i = 0; i < ipoib_recvq_size; ++i) {
> -		if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1,
> +	/* One can post receive buffers even before the RX QP is created 
> +	 * only in the SRQ case. Therefore for NOSRQ we skip the rest of init 
> +	 * and do that in ipoib_cm_req_handler() */
> +
> +	if (priv->cm.srq) {
> +		for (i = 0; i < ipoib_recvq_size; ++i) {
> +			if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1,
>  					   priv->cm.srq_ring[i].mapping)) {
> -			ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
> -			ipoib_cm_dev_cleanup(dev);
> -			return -ENOMEM;
> -		}
> -		if (ipoib_cm_post_receive(dev, i)) {
> -			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
> -			ipoib_cm_dev_cleanup(dev);
> -			return -EIO;
> +				ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
> +				ipoib_cm_dev_cleanup(dev);
> +				return -ENOMEM;
> +			}
> +			if (ipoib_cm_post_receive(dev, i)) {
> +				ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
> +				ipoib_cm_dev_cleanup(dev);
> +				return -EIO;
> +			}
>  		}
> -	}
> +	} /* if SRQ */
>  
>  	priv->dev->dev_addr[0] = IPOIB_FLAGS_RC;
>  	return 0;

When you start adding /* if SRQ */ comments near the closing bracket,
is where you know your nesting is too deep.
How about
	if (!priv->cm.srq)
		goto done;

> --- a/linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-04-24 18:10:17.000000000 -0700
> +++ b//linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-04-25 10:11:34.000000000 -0700
> @@ -282,7 +282,7 @@ static void ipoib_ib_handle_tx_wc(struct
>  
>  static void ipoib_ib_handle_wc(struct net_device *dev, struct ib_wc *wc)
>  {
> -	if (wc->wr_id & IPOIB_CM_OP_SRQ)
> +	if ((wc->wr_id & IPOIB_CM_OP_SRQ) || (wc->wr_id & IPOIB_CM_OP_NOSRQ))
>  		ipoib_cm_handle_rx_wc(dev, wc);
>  	else if (wc->wr_id & IPOIB_OP_RECV)
>  		ipoib_ib_handle_rx_wc(dev, wc);

So you have a branch on IPOIB_CM_OP_NOSRQ here, and you have
a branch on priv->srq down the line.

What I suggest instead, is split ipoib_ib_completion to SRQ/non-SRQ
variants, which will completely avoid extra branch cost at runtime.

-- 
MST


From mst at dev.mellanox.co.il  Wed May  2 00:08:49 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 10:08:49 +0300
Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to use
	send with invalidate
In-Reply-To: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
Message-ID: <20070502070849.GO8447@mellanox.co.il>

> -			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
> +			if (ib_wr->send_flags & IB_SEND_SOLICITED
> +				&& ib_wr->send_flags & IB_SEND_INVALIDATE) {

How about
	if (ib_wr->send_flags & (IB_SEND_SOLICITED | IB_SEND_INVALIDATE))

-- 
MST


From mst at dev.mellanox.co.il  Wed May  2 00:15:21 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 10:15:21 +0300
Subject: [ofa-general] Re: Requesting CQ notifications
In-Reply-To: <46380448.1020401@evergrid.com>
References: <462FD3F7.1010304@evergrid.com> <adaodlbvqv2.fsf@cisco.com>
	<20070502025055.GM8447@mellanox.co.il>
	<46380448.1020401@evergrid.com>
Message-ID: <20070502071521.GR8447@mellanox.co.il>

> Quoting Mike Heffner <mike.heffner at evergrid.com>:
> Subject: Re: Requesting CQ notifications
> 
> Michael S. Tsirkin wrote:
> >>Quoting Roland Dreier <rdreier at cisco.com>:
> >>Subject: Re: Requesting CQ notifications
> 
> >
> >This is not exact. mthca/mlx4 will generate an event immediately
> >only for unpolled CQE *that was not present in CQ at the
> >time the previous event was generated*.
> >So the answer for mthca is yes only if the CQE arrived
> >between calls to select and ibv_req_notify_cq.
> >
> 
> Is there any method by which you can query the total number of CQEs in 
> the CQ at an instantaneous point in time (ie., after you had called 
> ibv_req_notify_cq() to get notification of *new* CQEs)?

Not really - what are you trying to do?

-- 
MST


From vlad at lists.openfabrics.org  Wed May  2 02:37:40 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed,  2 May 2007 02:37:40 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070502-0200 daily build status
Message-ID: <20070502093740.53DEDE6089D@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.13
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From donateacr.com at goodvibesvideo.com  Wed May  2 03:03:59 2007
From: donateacr.com at goodvibesvideo.com (Matthew Hill)
Date: Wed, 02 May 2007 12:03:59 +0200
Subject: [ofa-general] Corel Draw
Message-ID: <000001c78ca0$da642280$0100007f@localhost>


See attach

-----
She had almost reached the top
The council wasnt paying her a
And if the Dunbars form an all
Vincent spoke up next. Why mus
 
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/bd4113e9/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic44.gif
Type: image/gif
Size: 9095 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/bd4113e9/attachment.gif>

From mst at dev.mellanox.co.il  Wed May  2 05:31:12 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 15:31:12 +0300
Subject: [ofa-general] [PATCH] ipoib/cm: compliance fix
Message-ID: <20070502123112.GI22292@mellanox.co.il>

IPoIB CM spec allows the use of a single connection in both active->passive and
passive->active directions.  Current code does not do this, but if the remote
ever tries to, we oops when we try to look up the passive connection.
Fix by checking qp_context before use.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

I noticed this bug while experimenting with changes to IPoIB/CM code.
Important enough for -stable?

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 0c4e59b..1778fd6 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -370,7 +370,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 
 	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
 		p = wc->qp->qp_context;
-		if (time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
+		if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
 			spin_lock_irqsave(&priv->lock, flags);
 			p->jiffies = jiffies;
 			/* Move this entry to list head, but do


-- 
MST


From dotanb at dev.mellanox.co.il  Wed May  2 06:08:48 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Wed, 02 May 2007 16:08:48 +0300
Subject: [ofa-general] We are seeing SYNDROME_LOCAL_PROT_ERR status on
	CQE	with Mellanox Arbel HCA in memfree mode
In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A016398AA@NAMAIL2.ad.lsil.com>
References: <01B9E81EECACE94DBBD0A556E768FB8A016398AA@NAMAIL2.ad.lsil.com>
Message-ID: <46388D60.5010803@dev.mellanox.co.il>

Batwara, Ashish wrote:
> Any idea why this error? We see this error when we use FMR? Are there
> any special setting that HCA needs to work with FMR?
>   
Check the WR of this completion:
do you have any violation in the scatter/gather element?


thanks
Dotan


From ceramicplates at bestprice.novelco.com  Wed May  2 06:43:35 2007
From: ceramicplates at bestprice.novelco.com (ceramicplates at bestprice.novelco.com)
Date: Wed, 02 May 2007 06:43:35 -0700
Subject: [ofa-general] Manufacturers of Ceramic Plates  wanted
Message-ID: <20070502064334.89844A0CF5D66E10@bestprice.novelco.com>

Greetings

I would like to know if your company is a manufacturer of ceramic 
plates
We are a supplier in Los Angeles California have been in business 
5 years

We currently are looking for 5000 ceramic plates about 5 1/2 
inches in diameter
white color  with a  2 color imprint
Price below $2.00 each with imprint
Delivery by July 2007

Please provide your price quote to plates at bestprice.novelco.com


Sincerely,

Joseph Taylor

Worldlink
1012 W Beverly Blvd., #990
Montebello, CA  90640

P 562-215-4843
F 206-350-5967


From steffen.persvold at scali.com  Wed May  2 07:00:14 2007
From: steffen.persvold at scali.com (Steffen Persvold)
Date: Wed, 2 May 2007 10:00:14 -0400
Subject: [ofa-general] OFED 1.2 RC2 on rhel4u4 x86_64
Message-ID: <D6A583C768392A4D8B297C500CDD54B50157369E@mse11be1.mse11.exchange.ms>

Folks,
 
I used the build.sh script to build the above mentioned packages on rhel4u4 x86_64, but for some reason it only compiles 32bit libraries (even if the packages are named x86_64) :
 
# rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
x86_64
 
(after installing it) :
 
# file /usr/lib/libibverbs.so.1.0.0
/usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), not stripped

What did I do wrong ??
 
Cheers,
Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/330b2548/attachment.html>

From vlad at dev.mellanox.co.il  Wed May  2 07:05:28 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 02 May 2007 17:05:28 +0300
Subject: [ofa-general] Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
In-Reply-To: <D6A583C768392A4D8B297C500CDD54B50157369E@mse11be1.mse11.exchange.ms>
References: <D6A583C768392A4D8B297C500CDD54B50157369E@mse11be1.mse11.exchange.ms>
Message-ID: <1178114728.14131.30.camel@vladsk-laptop>

Don't you have /usr/lib64/libibverbs.so.1.0.0?

Regards,
Vladimir

On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote:
> Folks,
>  
> I used the build.sh script to build the above mentioned packages on
> rhel4u4 x86_64, but for some reason it only compiles 32bit libraries
> (even if the packages are named x86_64) :
>  
> # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
> x86_64
>  
> (after installing it) :
>  
> # file /usr/lib/libibverbs.so.1.0.0
> /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel
> 80386, version 1 (SYSV), not stripped
> 
> What did I do wrong ??
>  
> Cheers,
> Steffen Persvold
> Technical Director Americas
> tel. 508-281-7100 x401
> fax. 508-281-7171
> 
> http://www.scali.com/
> Scaling the Linux datacenter
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


From jackm at dev.mellanox.co.il  Wed May  2 07:12:24 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 2 May 2007 17:12:24 +0300
Subject: [ofa-general] [PATCH]  libmlx4: fix post inline when posting a list
Message-ID: <200705021712.24400.jackm@dev.mellanox.co.il>

Need to set inl parameter to zero for each inline post (when posting
a wr-list of inlines -- so that the value of inl reflects that specific
work request, and is not cumulative.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

diff --git a/src/userspace/libmlx4/src/qp.c b/src/userspace/libmlx4/src/qp.c
index 76abf75..a70e5f2 100644
--- a/src/userspace/libmlx4/src/qp.c
+++ b/src/userspace/libmlx4/src/qp.c
@@ -217,6 +217,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 			if (wr->num_sge) {
 				struct mlx4_wqe_inline_seg *seg = wqe;
 
+				inl = 0;
 				wqe += sizeof *seg;
 				for (i = 0; i < wr->num_sge; ++i) {
 					uint32_t len = wr->sg_list[i].length;


From jackm at dev.mellanox.co.il  Wed May  2 07:14:05 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 2 May 2007 17:14:05 +0300
Subject: [ofa-general] [patch] mlx4_ib: return proper num s/g entries for rq
	at create_qp
Message-ID: <200705021714.05933.jackm@dev.mellanox.co.il>

Fix number of scatter-gather entries returned for receive queue at qp creation.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 53aedfb..33db96c 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -236,7 +236,7 @@ static int set_qp_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 	cap->max_send_wr  = qp->sq.max;
 	cap->max_recv_wr  = qp->rq.max;
 	cap->max_send_sge = qp->sq.max_gs;
-	cap->max_recv_sge = qp->sq.max_gs;
+	cap->max_recv_sge = qp->rq.max_gs;
 	cap->max_inline_data = (1 << qp->sq.wqe_shift) - send_wqe_overhead(type) -
 		sizeof (struct mlx4_wqe_inline_seg);
 

From steffen.persvold at scali.com  Wed May  2 07:20:26 2007
From: steffen.persvold at scali.com (Steffen Persvold)
Date: Wed, 2 May 2007 10:20:26 -0400
Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
References: <D6A583C768392A4D8B297C500CDD54B50157369E@mse11be1.mse11.exchange.ms>
	<1178114728.14131.30.camel@vladsk-laptop>
Message-ID: <D6A583C768392A4D8B297C500CDD54B5015736A3@mse11be1.mse11.exchange.ms>

Nope :
 
 
[redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0
[redhat-release-4ES-5.5]#

So the RPM got built, but without 64bit libraries. Now if it was the other way around (i.e no 32bit libraries) I could have understood it (as 32bit is an option on x86_64), but not having the native 64bit libraries is not so easy to understand :)
 
cheers,
Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter

________________________________

From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir Sokolovsky
Sent: Wed 5/2/2007 10:05 AM
To: Steffen Persvold
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64


Don't you have /usr/lib64/libibverbs.so.1.0.0?

Regards,
Vladimir

On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote:
> Folks,
> 
> I used the build.sh script to build the above mentioned packages on
> rhel4u4 x86_64, but for some reason it only compiles 32bit libraries
> (even if the packages are named x86_64) :
> 
> # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
> x86_64
> 
> (after installing it) :
> 
> # file /usr/lib/libibverbs.so.1.0.0
> /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel
> 80386, version 1 (SYSV), not stripped
>
> What did I do wrong ??
> 
> Cheers,
> Steffen Persvold
> Technical Director Americas
> tel. 508-281-7100 x401
> fax. 508-281-7171
>
> http://www.scali.com/
> Scaling the Linux datacenter
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/4cf28f40/attachment.html>

From steffen.persvold at scali.com  Wed May  2 07:30:56 2007
From: steffen.persvold at scali.com (Steffen Persvold)
Date: Wed, 2 May 2007 10:30:56 -0400
Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
References: <D6A583C768392A4D8B297C500CDD54B50157369E@mse11be1.mse11.exchange.ms><1178114728.14131.30.camel@vladsk-laptop>
	<D6A583C768392A4D8B297C500CDD54B5015736A3@mse11be1.mse11.exchange.ms>
Message-ID: <D6A583C768392A4D8B297C500CDD54B5015736A5@mse11be1.mse11.exchange.ms>

Also,
 
If I look at the /etc/ld.so.conf/ofed.conf file I have :
 
# cat ofed.conf
/usr/lib
/usr/lib

 
which seems kinda weird ? :)
 
Cheers,
 
Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter

________________________________

From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold
Sent: Wed 5/2/2007 10:20 AM
To: Vladimir Sokolovsky
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64


Nope :
 
 
[redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0
[redhat-release-4ES-5.5]#

So the RPM got built, but without 64bit libraries. Now if it was the other way around (i.e no 32bit libraries) I could have understood it (as 32bit is an option on x86_64), but not having the native 64bit libraries is not so easy to understand :)
 
cheers,
Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter

________________________________

From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir Sokolovsky
Sent: Wed 5/2/2007 10:05 AM
To: Steffen Persvold
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64


Don't you have /usr/lib64/libibverbs.so.1.0.0?

Regards,
Vladimir

On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote:
> Folks,
> 
> I used the build.sh script to build the above mentioned packages on
> rhel4u4 x86_64, but for some reason it only compiles 32bit libraries
> (even if the packages are named x86_64) :
> 
> # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
> x86_64
> 
> (after installing it) :
> 
> # file /usr/lib/libibverbs.so.1.0.0
> /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel
> 80386, version 1 (SYSV), not stripped
>
> What did I do wrong ??
> 
> Cheers,
> Steffen Persvold
> Technical Director Americas
> tel. 508-281-7100 x401
> fax. 508-281-7171
>
> http://www.scali.com/
> Scaling the Linux datacenter
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/ba9a0d8d/attachment.html>

From swise at opengridcomputing.com  Wed May  2 07:56:46 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 02 May 2007 09:56:46 -0500
Subject: [ofa-general] [PATCH librdmacm] rping: Transfer rkey/addr/len
	information in network byte order.
In-Reply-To: <46380D09.5070906@ichips.intel.com>
References: <1177515271.22094.33.camel@stevo-desktop>
	<1178060596.2309.195.camel@stevo-desktop>
	<46380D09.5070906@ichips.intel.com>
Message-ID: <1178117806.18609.25.camel@stevo-desktop>

On Tue, 2007-05-01 at 21:01 -0700, Sean Hefty wrote:
> > This patch regresses rping.  I failed to test it on AMD64<->AMD64 (ie
> > like endian systems).  I will provide another patch shortly, or we can
> > undo the broken rping patch for -rc3.  Whatever you think is best.
> 
> Let's fix it.  Please create a patch on top of this that fixes the problem.
> 
> Thanks
> 
> - Sean

Here is the fix.  Tested with:

ppc64 client, amd64 server
ppc64 server, amd64 client
amd64 client, amd64 server


---

Fix regression introduced by 88fc0cb21698dfb5d7660eecf7dddd0531fc8021.

From: Steve Wise <swise at opengridcomputing.com>

- swizzle memory info when sending it to peer.
- fixed printf format

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 examples/rping.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/examples/rping.c b/examples/rping.c
index 17b0000..bccabb0 100644
--- a/examples/rping.c
+++ b/examples/rping.c
@@ -243,7 +243,7 @@ static int server_recv(struct rping_cb *
 	cb->remote_rkey = ntohl(cb->recv_buf.rkey);
 	cb->remote_addr = ntohll(cb->recv_buf.buf);
 	cb->remote_len  = ntohl(cb->recv_buf.size);
-	DEBUG_LOG("Received rkey %x addr %" PRIx64 "len %d from peer\n",
+	DEBUG_LOG("Received rkey %x addr %" PRIx64 " len %d from peer\n",
 		  cb->remote_rkey, cb->remote_addr, cb->remote_len);
 
 	if (cb->state <= CONNECTED || cb->state == RDMA_WRITE_COMPLETE)
@@ -614,12 +614,12 @@ static void rping_format_send(struct rpi
 {
 	struct rping_rdma_info *info = &cb->send_buf;
 
-	info->buf = (uint64_t) (unsigned long) buf;
-	info->rkey = mr->rkey;
-	info->size = cb->size;
+	info->buf = htonll((uint64_t) (unsigned long) buf);
+	info->rkey = htonl(mr->rkey);
+	info->size = htonl(cb->size);
 
 	DEBUG_LOG("RDMA addr %" PRIx64" rkey %x len %d\n",
-		  info->buf, info->rkey, info->size);
+		  ntohll(info->buf), ntohl(info->rkey), ntohl(info->size));
 }
 
 static int rping_test_server(struct rping_cb *cb)


From steffen.persvold at scali.com  Wed May  2 08:30:44 2007
From: steffen.persvold at scali.com (Steffen Persvold)
Date: Wed, 2 May 2007 11:30:44 -0400
Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
References: <D6A583C768392A4D8B297C500CDD54B50157369E@mse11be1.mse11.exchange.ms><1178114728.14131.30.camel@vladsk-laptop>
	<D6A583C768392A4D8B297C500CDD54B5015736A3@mse11be1.mse11.exchange.ms>
	<D6A583C768392A4D8B297C500CDD54B5015736A5@mse11be1.mse11.exchange.ms>
Message-ID: <D6A583C768392A4D8B297C500CDD54B5015736A7@mse11be1.mse11.exchange.ms>

Hmm,
 
so I tried something. I put :
 
build_32bit=0

into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This time it built 64bit libraries, but it puts them in the wrong directory :
 
# rpm -qpl ../libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0

# file /usr/lib/libibverbs.so.1.0.0
/usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped

So what's up ??
 
Cheers,
Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter

________________________________

From: Steffen Persvold
Sent: Wed 5/2/2007 10:30 AM
To: Steffen Persvold; Vladimir Sokolovsky
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64


Also,
 
If I look at the /etc/ld.so.conf/ofed.conf file I have :
 
# cat ofed.conf
/usr/lib
/usr/lib

 
which seems kinda weird ? :)
 
Cheers,
 
Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter

________________________________

From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold
Sent: Wed 5/2/2007 10:20 AM
To: Vladimir Sokolovsky
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64


Nope :
 
 
[redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0
[redhat-release-4ES-5.5]#

So the RPM got built, but without 64bit libraries. Now if it was the other way around (i.e no 32bit libraries) I could have understood it (as 32bit is an option on x86_64), but not having the native 64bit libraries is not so easy to understand :)
 
cheers,
Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter

________________________________

From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir Sokolovsky
Sent: Wed 5/2/2007 10:05 AM
To: Steffen Persvold
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64


Don't you have /usr/lib64/libibverbs.so.1.0.0?

Regards,
Vladimir

On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote:
> Folks,
> 
> I used the build.sh script to build the above mentioned packages on
> rhel4u4 x86_64, but for some reason it only compiles 32bit libraries
> (even if the packages are named x86_64) :
> 
> # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
> x86_64
> 
> (after installing it) :
> 
> # file /usr/lib/libibverbs.so.1.0.0
> /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel
> 80386, version 1 (SYSV), not stripped
>
> What did I do wrong ??
> 
> Cheers,
> Steffen Persvold
> Technical Director Americas
> tel. 508-281-7100 x401
> fax. 508-281-7171
>
> http://www.scali.com/
> Scaling the Linux datacenter
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/ddb13b6e/attachment.html>

From amitk at mellanox.co.il  Wed May  2 08:41:59 2007
From: amitk at mellanox.co.il (Amit Krig)
Date: Wed, 2 May 2007 18:41:59 +0300
Subject: [ofa-general] RE: bugs 541 and 465: slow IPoIB CM HA failover and
	eventual failing IPoIB HA
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA30376607F@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA30376607F@xmb-sjc-216.amer.cisco.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C9016761E8@mtlexch01.mtl.com>

Thanks for the update, Yohad will reproduce this failure in our labs
 
Amit

________________________________

From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
Sent: Wednesday, May 02, 2007 5:08 AM
To: ewg at lists.openfabrics.org; Scott Weitzenkamp (sweitzen); Tziporet
Koren; Amit Krig; Michael S. Tsirkin; Roland Dreier (rdreier); Moni
Shoua; Moni Levy
Cc: openib
Subject: bugs 541 and 465: slow IPoIB CM HA failover and eventual
failing IPoIB HA


With IPoIB HA (both ipoibtools and ib-bonding), I am seeing slow IPoIB
CM HA failover, and eventually IPoIB stops working after enough
failovers.  I am running netperf -D traffic between two IPoIB HA hosts,
while flipping the 4 host IB ports one at a time (port 1 down, sleep,
port 1 up, sleep, ..., port 4 down, sleep, port 4 up, sleep) in a loop.
 
This is a very easy test to set up.  Can Mellanox and Voltaire please
try to reproduce the problem?
 
I think this problem must be fixed for OFED 1.2 rc3.
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/a6f4167b/attachment.html>

From yosefe at voltaire.com  Wed May  2 08:54:26 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 02 May 2007 18:54:26 +0300
Subject: [ofa-general] [PATCH 0/3] pkey change handling
Message-ID: <4638B432.3060801@voltaire.com>

There are 3 pathces in this series.

The issue addressed is keeping ipoib interfaces alive despite port's pkey order is changed.
pkey-to-index queries were using a cache. however, the cache might not be up-to-date when
ipoib asks it to resolve a pkey. Therefore must use a direct query. On the other hand, in
build_mlx_header, the pkey query must be atomic. So, the driver will keep its own pkey cache,
which is non blocking and always updated before ipoib is notified of the event.
In addition, remove the pkey delayed initiallization thread, instead start the interface on pkey
change notification.

1: ipoib: handle pkey change notifications, by restarting the qp which validates 
            the pkey index of the qp in case the pkeys in case they were shuffled.
          remove the pkey polling thread, and upon pkey change events, bring up 
            interfaces for which pkeys were not found.

2: core: remove the infiniband cache and replace it with blocking calls. update its users.

3: mthca: put a pkey cache in the provider.
          update the cache on pkey table smps
          use it to answer pkey_query.
          use the cache in build_mlx_header atomic context


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
--


From yosefe at voltaire.com  Wed May  2 08:56:23 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 02 May 2007 18:56:23 +0300
Subject: [ofa-general] [PATCH 1/3 v4] ipoib: restart interfaces on pkey
	change events
In-Reply-To: <4638B432.3060801@voltaire.com>
References: <4638B432.3060801@voltaire.com>
Message-ID: <4638B4A7.6080803@voltaire.com>

This issue was found during partitioning & SM fail over testing. The fix was tested with pkey reshuffling, removal and addition every few seconds concurrent with OFED restart.

Changes from v1:
	* added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
	* fixed a bug in device extraction from the work struct
	* removed some warnings in case they are caused due to missing PKEY as 
	  this seems like a valid flow now.

Changes from v2:
	* less/fixed debug prints - (MST remark)
	* flush_restart_qp stuff renamed to just restart_qp (MST remark)
	* the patch now depends on Roland's "IPoIB: Only handle async events for one port"
	
Changed from v3:
	* We now reschedule that qp_restart_task in case the PKEY cache was not 
	  coherent.
	  
Changed from v4:
	* We do not reschedue qp_restart_task, but assume that the cache is coherent
	* Do not restart QP if iface is not iniliallized, but do restart if not ADMIN_UP
	* Restart child interfaces first, so if parent is down child still restarted
	* Remove the pkey polling thread and pkey dalyed initiallization
	* If an interface is brought up but pkey is not found, mark it with IPOIB_PKEY_NEEDED
	  and when a pkey event arrives, try to restart it

SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey
table. The current implementation only queries for the index of the pkey once, when it creates
the device QP and after that moves it into working state, and hence does not address this
scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.

When a interface is brought up, it many  


Signed-off-by: Moni Levy <monil at voltaire.com>
Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |   10 -
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  142 ++++++++-----------------
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |   11 -
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   11 +
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |   21 +--
 5 files changed, 74 insertions(+), 121 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-02 17:48:05.276713741 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-02 17:48:30.149283427 +0300
@@ -80,7 +80,7 @@ enum {
 	IPOIB_FLAG_INITIALIZED    = 1,
 	IPOIB_FLAG_ADMIN_UP 	  = 2,
 	IPOIB_PKEY_ASSIGNED 	  = 3,
-	IPOIB_PKEY_STOP 	  = 4,
+	IPOIB_PKEY_NEEDED		  = 4,
 	IPOIB_FLAG_SUBINTERFACE   = 5,
 	IPOIB_MCAST_RUN 	  = 6,
 	IPOIB_STOP_REAPER         = 7,
@@ -202,9 +202,9 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
+	struct work_struct pkey_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
 
@@ -333,12 +333,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
@@ -384,9 +385,6 @@ void ipoib_event(struct ib_event_handler
 int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey);
 int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
 
-void ipoib_pkey_poll(struct work_struct *work);
-int ipoib_pkey_dev_delay_open(struct net_device *dev);
-
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 
 #define IPOIB_FLAGS_RC          0x80
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-02 17:48:05.276713741 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-02 18:04:16.512553724 +0300
@@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -441,28 +441,10 @@ int ipoib_ib_dev_open(struct net_device 
 	return 0;
 }
 
-static void ipoib_pkey_dev_check_presence(struct net_device *dev)
-{
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	u16 pkey_index = 0;
-
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index))
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-	else
-		set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-}
-
 int ipoib_ib_dev_up(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	ipoib_pkey_dev_check_presence(dev);
-
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
-		ipoib_dbg(priv, "PKEY is not assigned.\n");
-		return 0;
-	}
-
 	set_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
 
 	return ipoib_mcast_start_thread(dev);
@@ -477,16 +459,6 @@ int ipoib_ib_dev_down(struct net_device 
 	clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
 	netif_carrier_off(dev);
 
-	/* Shutdown the P_Key thread if still active */
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
-		mutex_lock(&pkey_mutex);
-		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
-		mutex_unlock(&pkey_mutex);
-		if (flush)
-			flush_workqueue(ipoib_workqueue);
-	}
-
 	ipoib_mcast_stop_thread(dev, flush);
 	ipoib_mcast_dev_flush(dev);
 
@@ -508,7 +480,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +553,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,14 +595,31 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
-		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces */
+	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, restart_qp);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	/*
+	 * If the device is not initiallized since it needs a pkey -
+	 * try to reopen it
+	 */
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
+		if (restart_qp && test_bit(IPOIB_PKEY_NEEDED, &priv->flags)
+		    && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
+			/* if this iface needs pkey, try to assign it one */
+			ipoib_open(priv->dev);
+		}
+		else
+			ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
 
@@ -642,6 +632,12 @@ void ipoib_ib_dev_flush(struct work_stru
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (restart_qp) {
+		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) )
+			ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +646,25 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
+	/* we only restart the QP in case of pkey change event */
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_task);
 
-	mutex_unlock(&priv->vlan_mutex);
+	/* restart the QP in case of pkey change event */
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -672,54 +679,3 @@ void ipoib_ib_dev_cleanup(struct net_dev
 	ipoib_transport_dev_cleanup(dev);
 }
 
-/*
- * Delayed P_Key Assigment Interim Support
- *
- * The following is initial implementation of delayed P_Key assigment
- * mechanism. It is using the same approach implemented for the multicast
- * group join. The single goal of this implementation is to quickly address
- * Bug #2507. This implementation will probably be removed when the P_Key
- * change async notification is available.
- */
-
-void ipoib_pkey_poll(struct work_struct *work)
-{
-	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
-	struct net_device *dev = priv->dev;
-
-	ipoib_pkey_dev_check_presence(dev);
-
-	if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
-		ipoib_open(dev);
-	else {
-		mutex_lock(&pkey_mutex);
-		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
-					   HZ);
-		mutex_unlock(&pkey_mutex);
-	}
-}
-
-int ipoib_pkey_dev_delay_open(struct net_device *dev)
-{
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-
-	/* Look for the interface pkey value in the IB Port P_Key table and */
-	/* set the interface pkey assigment flag                            */
-	ipoib_pkey_dev_check_presence(dev);
-
-	/* P_Key value not assigned yet - start polling */
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
-		mutex_lock(&pkey_mutex);
-		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
-		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
-				   HZ);
-		mutex_unlock(&pkey_mutex);
-		return 1;
-	}
-
-	return 0;
-}
Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-02 17:48:05.276713741 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-02 17:48:30.150283249 +0300
@@ -100,14 +100,11 @@ int ipoib_open(struct net_device *dev)
 
 	set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
 
-	if (ipoib_pkey_dev_delay_open(dev))
-		return 0;
-
 	if (ipoib_ib_dev_open(dev))
-		return -EINVAL;
+		return test_bit(IPOIB_PKEY_NEEDED, &priv->flags) ? 0 : -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +149,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +987,7 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-02 17:48:05.277713563 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-02 17:48:30.151283071 +0300
@@ -232,9 +232,10 @@ static int ipoib_mcast_join_finish(struc
 		ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid),
 					 &mcast->mcmember.mgid);
 		if (ret < 0) {
-			ipoib_warn(priv, "couldn't attach QP to multicast group "
-				   IPOIB_GID_FMT "\n",
-				   IPOIB_GID_ARG(mcast->mcmember.mgid));
+			if (ret != -ENXIO) /* No pkey found */
+				ipoib_warn(priv, "couldn't attach QP to multicast group "
+					   IPOIB_GID_FMT "\n",
+					   IPOIB_GID_ARG(mcast->mcmember.mgid));
 
 			clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags);
 			return ret;
@@ -312,7 +313,7 @@ ipoib_mcast_sendonly_join_complete(int s
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 
 	if (status) {
-		if (mcast->logcount++ < 20)
+		if (mcast->logcount++ < 20 && status != -ENXIO)
 			ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for "
 					IPOIB_GID_FMT ", status %d\n",
 					IPOIB_GID_ARG(mcast->mcmember.mgid), status);
@@ -420,7 +421,7 @@ static int ipoib_mcast_join_complete(int
 					", status %d\n",
 					IPOIB_GID_ARG(mcast->mcmember.mgid),
 					status);
-		} else {
+		} else if (status != -ENXIO) {
 			ipoib_warn(priv, "multicast join failed for "
 				   IPOIB_GID_FMT ", status %d\n",
 				   IPOIB_GID_ARG(mcast->mcmember.mgid),
Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-02 17:48:05.277713563 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-02 17:48:30.152282893 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,12 +47,12 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+		clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
 	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	set_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 
 	/* set correct QKey for QP */
 	qp_attr->qkey = priv->qkey;
@@ -103,12 +101,12 @@ int ipoib_init_qp(struct net_device *dev
 	 * The port has to be assigned to the respective IB partition in
 	 * advance.
 	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
 	if (ret) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		set_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 		return ret;
 	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
@@ -238,7 +236,7 @@ void ipoib_transport_dev_cleanup(struct 
 			ipoib_warn(priv, "ib_qp_destroy failed\n");
 
 		priv->qp = NULL;
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 	}
 
 	if (ib_destroy_cq(priv->cq))
@@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_task);
 	}
 }


From yosefe at voltaire.com  Wed May  2 08:57:09 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 02 May 2007 18:57:09 +0300
Subject: [ofa-general] [PATCH 2/3] remove ib pkey gid and lmc cache
In-Reply-To: <4638B432.3060801@voltaire.com>
References: <4638B432.3060801@voltaire.com>
Message-ID: <4638B4D5.7050709@voltaire.com>

Remove IB cache from core

* Remove pkey, gid, and lmc caches
* Rewrite ib_find_gid and ib_find_pkey over blocking device queries 
* Modify users of the cache to use these methods


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/cache.c         |  398 --------------------------------
 include/rdma/ib_cache.h                 |  118 ---------
 drivers/infiniband/core/Makefile        |    2 
 drivers/infiniband/core/cm.c            |    8 
 drivers/infiniband/core/cma.c           |    9 
 drivers/infiniband/core/core_priv.h     |    3 
 drivers/infiniband/core/device.c        |  143 ++++++++++-
 drivers/infiniband/core/mad.c           |    5 
 drivers/infiniband/core/multicast.c     |    3 
 drivers/infiniband/core/sa_query.c      |    3 
 drivers/infiniband/core/verbs.c         |    3 
 drivers/infiniband/hw/mthca/mthca_av.c  |    3 
 drivers/infiniband/hw/mthca/mthca_qp.c  |   10 
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    3 
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |    2 
 drivers/infiniband/ulp/srp/ib_srp.c     |    6 
 include/rdma/ib_verbs.h                 |   37 ++
 17 files changed, 196 insertions(+), 560 deletions(-)

Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-02 17:47:50.517342683 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-02 17:48:30.719181916 +0300
@@ -149,6 +149,20 @@ static int alloc_name(char *name)
 	return 0;
 }
 
+
+static inline int start_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+
+static inline int end_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
+}
+
+
 /**
  * ib_alloc_device - allocate an IB device struct
  * @size:size of structure to allocate
@@ -592,6 +606,128 @@ int ib_modify_port(struct ib_device *dev
 }
 EXPORT_SYMBOL(ib_modify_port);
 
+/**
+ * ib_find_gid - Returns the port number and index of a GID
+ * @device: Device to query.
+ * @gid: GID to look for
+ * @port_num: Returned port number
+ * @index: Returned index
+ *
+ * ib_find_gid() returns the index of @pkey in the pkey table
+ * on port @port_num
+ */
+ int ib_find_gid(struct ib_device *device,
+		       union ib_gid	    *gid,
+		       u8               *port_num,
+		       u16              *index)
+{
+	struct ib_port_attr *tprops = NULL;
+	union ib_gid tmp_gid;
+	int ret;
+	int port;
+	int i;
+
+	tprops = kmalloc(sizeof *tprops, GFP_ATOMIC);
+
+	for (port = start_port(device); port <= end_port(device); ++port) {
+		ret = ib_query_port(device, port, tprops);
+		if (ret)
+			continue;
+
+		for (i = 0; i < tprops->gid_tbl_len; ++i) {
+			ret = ib_query_gid(device, port, i, &tmp_gid);
+			if (ret)
+				goto out;
+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
+				*port_num = port;
+				*index = i;
+				ret = 0;
+				goto out;
+			}
+		} /* for i */
+	}
+	ret = -ENOENT;
+out:
+	kfree(tprops);
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_gid);
+
+/**
+ * ib_find_pkey - Returns the index of a PKey on a port
+ * @device: Device to query.
+ * @port_num: Port to query on
+ * @pkey: PKey to look for
+ * @index: Returned index
+ *
+ * ib_find_pkey() returns the index of @pkey in the pkey table
+ * on port @port_num
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8                port_num,
+			u16               pkey,
+			u16              *index)
+{
+	struct ib_port_attr *tprops = NULL;
+	int ret;
+	int i = -1;
+	u16 tmp_pkey;
+
+	tprops = kmalloc(sizeof *tprops, GFP_ATOMIC);
+
+	ret = ib_query_port(device, port_num, tprops);
+	if (ret) {
+		printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret);
+		goto out;
+	}
+
+	for (i = 0; i < tprops->pkey_tbl_len; ++i) {
+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
+		if (ret)
+			goto out;
+
+		if (pkey == tmp_pkey) {
+			*index = i;
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = -ENOENT;
+
+out:
+	kfree(tprops);
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_pkey);
+
+/**
+ * ib_query_lmc - Returns the LMC of a port
+ * @device: Device to query.
+ * @port_num: Port to query on
+ * @lmc: Returned LMC
+ *
+ * ib_query_lmc() returns the LID mask control associated
+ * with port @port_num
+ */
+int ib_query_lmc(struct ib_device *device,
+		      u8                port_num,
+		      u8                *lmc)
+{
+	struct ib_port_attr *tprops = NULL;
+	int ret;
+
+	tprops = kmalloc(sizeof *tprops, GFP_ATOMIC);
+	ret = ib_query_port(device, port_num, tprops);
+	if (ret) goto err;
+
+	*lmc = tprops->lmc;
+err:
+	kfree(tprops);
+	return ret;
+
+}
+EXPORT_SYMBOL(ib_query_lmc);
+
 static int __init ib_core_init(void)
 {
 	int ret;
@@ -600,18 +736,11 @@ static int __init ib_core_init(void)
 	if (ret)
 		printk(KERN_WARNING "Couldn't create InfiniBand device class\n");
 
-	ret = ib_cache_setup();
-	if (ret) {
-		printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n");
-		ib_sysfs_cleanup();
-	}
-
 	return ret;
 }
 
 static void __exit ib_core_cleanup(void)
 {
-	ib_cache_cleanup();
 	ib_sysfs_cleanup();
 }
 
Index: b/drivers/infiniband/core/cache.c
===================================================================
--- a/drivers/infiniband/core/cache.c	2007-05-02 17:47:49.878456482 +0300
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,398 +0,0 @@
-/*
- * Copyright (c) 2004 Topspin Communications.  All rights reserved.
- * Copyright (c) 2005 Intel Corporation. All rights reserved.
- * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
- * Copyright (c) 2005 Voltaire, Inc. All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- *
- * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $
- */
-
-#include <linux/module.h>
-#include <linux/errno.h>
-#include <linux/slab.h>
-
-#include <rdma/ib_cache.h>
-
-#include "core_priv.h"
-
-struct ib_pkey_cache {
-	int             table_len;
-	u16             table[0];
-};
-
-struct ib_gid_cache {
-	int             table_len;
-	union ib_gid    table[0];
-};
-
-struct ib_update_work {
-	struct work_struct work;
-	struct ib_device  *device;
-	u8                 port_num;
-};
-
-static inline int start_port(struct ib_device *device)
-{
-	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
-}
-
-static inline int end_port(struct ib_device *device)
-{
-	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
-		0 : device->phys_port_cnt;
-}
-
-int ib_get_cached_gid(struct ib_device *device,
-		      u8                port_num,
-		      int               index,
-		      union ib_gid     *gid)
-{
-	struct ib_gid_cache *cache;
-	unsigned long flags;
-	int ret = 0;
-
-	if (port_num < start_port(device) || port_num > end_port(device))
-		return -EINVAL;
-
-	read_lock_irqsave(&device->cache.lock, flags);
-
-	cache = device->cache.gid_cache[port_num - start_port(device)];
-
-	if (index < 0 || index >= cache->table_len)
-		ret = -EINVAL;
-	else
-		*gid = cache->table[index];
-
-	read_unlock_irqrestore(&device->cache.lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL(ib_get_cached_gid);
-
-int ib_find_cached_gid(struct ib_device *device,
-		       union ib_gid	*gid,
-		       u8               *port_num,
-		       u16              *index)
-{
-	struct ib_gid_cache *cache;
-	unsigned long flags;
-	int p, i;
-	int ret = -ENOENT;
-
-	*port_num = -1;
-	if (index)
-		*index = -1;
-
-	read_lock_irqsave(&device->cache.lock, flags);
-
-	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
-		cache = device->cache.gid_cache[p];
-		for (i = 0; i < cache->table_len; ++i) {
-			if (!memcmp(gid, &cache->table[i], sizeof *gid)) {
-				*port_num = p + start_port(device);
-				if (index)
-					*index = i;
-				ret = 0;
-				goto found;
-			}
-		}
-	}
-found:
-	read_unlock_irqrestore(&device->cache.lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL(ib_find_cached_gid);
-
-int ib_get_cached_pkey(struct ib_device *device,
-		       u8                port_num,
-		       int               index,
-		       u16              *pkey)
-{
-	struct ib_pkey_cache *cache;
-	unsigned long flags;
-	int ret = 0;
-
-	if (port_num < start_port(device) || port_num > end_port(device))
-		return -EINVAL;
-
-	read_lock_irqsave(&device->cache.lock, flags);
-
-	cache = device->cache.pkey_cache[port_num - start_port(device)];
-
-	if (index < 0 || index >= cache->table_len)
-		ret = -EINVAL;
-	else
-		*pkey = cache->table[index];
-
-	read_unlock_irqrestore(&device->cache.lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL(ib_get_cached_pkey);
-
-int ib_find_cached_pkey(struct ib_device *device,
-			u8                port_num,
-			u16               pkey,
-			u16              *index)
-{
-	struct ib_pkey_cache *cache;
-	unsigned long flags;
-	int i;
-	int ret = -ENOENT;
-
-	if (port_num < start_port(device) || port_num > end_port(device))
-		return -EINVAL;
-
-	read_lock_irqsave(&device->cache.lock, flags);
-
-	cache = device->cache.pkey_cache[port_num - start_port(device)];
-
-	*index = -1;
-
-	for (i = 0; i < cache->table_len; ++i)
-		if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) {
-			*index = i;
-			ret = 0;
-			break;
-		}
-
-	read_unlock_irqrestore(&device->cache.lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL(ib_find_cached_pkey);
-
-int ib_get_cached_lmc(struct ib_device *device,
-		      u8                port_num,
-		      u8                *lmc)
-{
-	unsigned long flags;
-	int ret = 0;
-
-	if (port_num < start_port(device) || port_num > end_port(device))
-		return -EINVAL;
-
-	read_lock_irqsave(&device->cache.lock, flags);
-	*lmc = device->cache.lmc_cache[port_num - start_port(device)];
-	read_unlock_irqrestore(&device->cache.lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL(ib_get_cached_lmc);
-
-static void ib_cache_update(struct ib_device *device,
-			    u8                port)
-{
-	struct ib_port_attr       *tprops = NULL;
-	struct ib_pkey_cache      *pkey_cache = NULL, *old_pkey_cache;
-	struct ib_gid_cache       *gid_cache = NULL, *old_gid_cache;
-	int                        i;
-	int                        ret;
-
-	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
-	if (!tprops)
-		return;
-
-	ret = ib_query_port(device, port, tprops);
-	if (ret) {
-		printk(KERN_WARNING "ib_query_port failed (%d) for %s\n",
-		       ret, device->name);
-		goto err;
-	}
-
-	pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
-			     sizeof *pkey_cache->table, GFP_KERNEL);
-	if (!pkey_cache)
-		goto err;
-
-	pkey_cache->table_len = tprops->pkey_tbl_len;
-
-	gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len *
-			    sizeof *gid_cache->table, GFP_KERNEL);
-	if (!gid_cache)
-		goto err;
-
-	gid_cache->table_len = tprops->gid_tbl_len;
-
-	for (i = 0; i < pkey_cache->table_len; ++i) {
-		ret = ib_query_pkey(device, port, i, pkey_cache->table + i);
-		if (ret) {
-			printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n",
-			       ret, device->name, i);
-			goto err;
-		}
-	}
-
-	for (i = 0; i < gid_cache->table_len; ++i) {
-		ret = ib_query_gid(device, port, i, gid_cache->table + i);
-		if (ret) {
-			printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
-			       ret, device->name, i);
-			goto err;
-		}
-	}
-
-	write_lock_irq(&device->cache.lock);
-
-	old_pkey_cache = device->cache.pkey_cache[port - start_port(device)];
-	old_gid_cache  = device->cache.gid_cache [port - start_port(device)];
-
-	device->cache.pkey_cache[port - start_port(device)] = pkey_cache;
-	device->cache.gid_cache [port - start_port(device)] = gid_cache;
-
-	device->cache.lmc_cache[port - start_port(device)] = tprops->lmc;
-
-	write_unlock_irq(&device->cache.lock);
-
-	kfree(old_pkey_cache);
-	kfree(old_gid_cache);
-	kfree(tprops);
-	return;
-
-err:
-	kfree(pkey_cache);
-	kfree(gid_cache);
-	kfree(tprops);
-}
-
-static void ib_cache_task(struct work_struct *_work)
-{
-	struct ib_update_work *work =
-		container_of(_work, struct ib_update_work, work);
-
-	ib_cache_update(work->device, work->port_num);
-	kfree(work);
-}
-
-static void ib_cache_event(struct ib_event_handler *handler,
-			   struct ib_event *event)
-{
-	struct ib_update_work *work;
-
-	if (event->event == IB_EVENT_PORT_ERR    ||
-	    event->event == IB_EVENT_PORT_ACTIVE ||
-	    event->event == IB_EVENT_LID_CHANGE  ||
-	    event->event == IB_EVENT_PKEY_CHANGE ||
-	    event->event == IB_EVENT_SM_CHANGE   ||
-	    event->event == IB_EVENT_CLIENT_REREGISTER) {
-		work = kmalloc(sizeof *work, GFP_ATOMIC);
-		if (work) {
-			INIT_WORK(&work->work, ib_cache_task);
-			work->device   = event->device;
-			work->port_num = event->element.port_num;
-			schedule_work(&work->work);
-		}
-	}
-}
-
-static void ib_cache_setup_one(struct ib_device *device)
-{
-	int p;
-
-	rwlock_init(&device->cache.lock);
-
-	device->cache.pkey_cache =
-		kmalloc(sizeof *device->cache.pkey_cache *
-			(end_port(device) - start_port(device) + 1), GFP_KERNEL);
-	device->cache.gid_cache =
-		kmalloc(sizeof *device->cache.gid_cache *
-			(end_port(device) - start_port(device) + 1), GFP_KERNEL);
-
-	device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache *
-					  (end_port(device) -
-					   start_port(device) + 1),
-					  GFP_KERNEL);
-
-	if (!device->cache.pkey_cache || !device->cache.gid_cache ||
-	    !device->cache.lmc_cache) {
-		printk(KERN_WARNING "Couldn't allocate cache "
-		       "for %s\n", device->name);
-		goto err;
-	}
-
-	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
-		device->cache.pkey_cache[p] = NULL;
-		device->cache.gid_cache [p] = NULL;
-		ib_cache_update(device, p + start_port(device));
-	}
-
-	INIT_IB_EVENT_HANDLER(&device->cache.event_handler,
-			      device, ib_cache_event);
-	if (ib_register_event_handler(&device->cache.event_handler))
-		goto err_cache;
-
-	return;
-
-err_cache:
-	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
-		kfree(device->cache.pkey_cache[p]);
-		kfree(device->cache.gid_cache[p]);
-	}
-
-err:
-	kfree(device->cache.pkey_cache);
-	kfree(device->cache.gid_cache);
-	kfree(device->cache.lmc_cache);
-}
-
-static void ib_cache_cleanup_one(struct ib_device *device)
-{
-	int p;
-
-	ib_unregister_event_handler(&device->cache.event_handler);
-	flush_scheduled_work();
-
-	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
-		kfree(device->cache.pkey_cache[p]);
-		kfree(device->cache.gid_cache[p]);
-	}
-
-	kfree(device->cache.pkey_cache);
-	kfree(device->cache.gid_cache);
-	kfree(device->cache.lmc_cache);
-}
-
-static struct ib_client cache_client = {
-	.name   = "cache",
-	.add    = ib_cache_setup_one,
-	.remove = ib_cache_cleanup_one
-};
-
-int __init ib_cache_setup(void)
-{
-	return ib_register_client(&cache_client);
-}
-
-void __exit ib_cache_cleanup(void)
-{
-	ib_unregister_client(&cache_client);
-}
Index: b/include/rdma/ib_cache.h
===================================================================
--- a/include/rdma/ib_cache.h	2007-05-02 17:47:13.398954200 +0300
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,118 +0,0 @@
-/*
- * Copyright (c) 2004 Topspin Communications.  All rights reserved.
- * Copyright (c) 2005 Intel Corporation. All rights reserved.
- * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- *
- * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $
- */
-
-#ifndef _IB_CACHE_H
-#define _IB_CACHE_H
-
-#include <rdma/ib_verbs.h>
-
-/**
- * ib_get_cached_gid - Returns a cached GID table entry
- * @device: The device to query.
- * @port_num: The port number of the device to query.
- * @index: The index into the cached GID table to query.
- * @gid: The GID value found at the specified index.
- *
- * ib_get_cached_gid() fetches the specified GID table entry stored in
- * the local software cache.
- */
-int ib_get_cached_gid(struct ib_device    *device,
-		      u8                   port_num,
-		      int                  index,
-		      union ib_gid        *gid);
-
-/**
- * ib_find_cached_gid - Returns the port number and GID table index where
- *   a specified GID value occurs.
- * @device: The device to query.
- * @gid: The GID value to search for.
- * @port_num: The port number of the device where the GID value was found.
- * @index: The index into the cached GID table where the GID was found.  This
- *   parameter may be NULL.
- *
- * ib_find_cached_gid() searches for the specified GID value in
- * the local software cache.
- */
-int ib_find_cached_gid(struct ib_device *device,
-		       union ib_gid	*gid,
-		       u8               *port_num,
-		       u16              *index);
-
-/**
- * ib_get_cached_pkey - Returns a cached PKey table entry
- * @device: The device to query.
- * @port_num: The port number of the device to query.
- * @index: The index into the cached PKey table to query.
- * @pkey: The PKey value found at the specified index.
- *
- * ib_get_cached_pkey() fetches the specified PKey table entry stored in
- * the local software cache.
- */
-int ib_get_cached_pkey(struct ib_device    *device_handle,
-		       u8                   port_num,
-		       int                  index,
-		       u16                 *pkey);
-
-/**
- * ib_find_cached_pkey - Returns the PKey table index where a specified
- *   PKey value occurs.
- * @device: The device to query.
- * @port_num: The port number of the device to search for the PKey.
- * @pkey: The PKey value to search for.
- * @index: The index into the cached PKey table where the PKey was found.
- *
- * ib_find_cached_pkey() searches the specified PKey table in
- * the local software cache.
- */
-int ib_find_cached_pkey(struct ib_device    *device,
-			u8                   port_num,
-			u16                  pkey,
-			u16                 *index);
-
-/**
- * ib_get_cached_lmc - Returns a cached lmc table entry
- * @device: The device to query.
- * @port_num: The port number of the device to query.
- * @lmc: The lmc value for the specified port for that device.
- *
- * ib_get_cached_lmc() fetches the specified lmc table entry stored in
- * the local software cache.
- */
-int ib_get_cached_lmc(struct ib_device *device,
-		      u8                port_num,
-		      u8                *lmc);
-
-#endif /* _IB_CACHE_H */
Index: b/include/rdma/ib_verbs.h
===================================================================
--- a/include/rdma/ib_verbs.h	2007-05-02 17:47:13.398954200 +0300
+++ b/include/rdma/ib_verbs.h	2007-05-02 17:48:30.741177998 +0300
@@ -1134,6 +1134,43 @@ int ib_modify_port(struct ib_device *dev
 		   struct ib_port_modify *port_modify);
 
 /**
+ * ib_find_gid - Returns the port number and index of a GID
+ * @device: Device to query.
+ * @gid: GID to look for
+ * @port_num: Returned port number
+ * @index: Returned index
+ *
+ * ib_find_gid() returns the index of @pkey in the pkey table
+ * on port @port_num
+ */
+ int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+            u8 *port_num, u16 *index);
+
+/**
+ * ib_find_pkey - Returns the index of a PKey on a port
+ * @device: Device to query.
+ * @port_num: Port to query on
+ * @pkey: PKey to look for
+ * @index: Returned index
+ *
+ * ib_find_pkey() returns the index of @pkey in the pkey table
+ * on port @port_num
+ */
+int ib_find_pkey(struct ib_device *device,	u8 port_num,
+			  u16 pkey, u16 *index);
+
+/**
+ * ib_query_lmc - Returns the LMC of a port
+ * @device: Device to query.
+ * @port_num: Port to query on
+ * @lmc: Returned LMC
+ *
+ * ib_query_lmc() returns the LID mask control associated
+ * with port @port_num
+ */
+int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc);
+
+/**
  * ib_alloc_pd - Allocates an unused protection domain.
  * @device: The device on which to allocate the protection domain.
  *
Index: b/drivers/infiniband/core/Makefile
===================================================================
--- a/drivers/infiniband/core/Makefile	2007-05-02 17:47:49.333553540 +0300
+++ b/drivers/infiniband/core/Makefile	2007-05-02 17:48:30.741177998 +0300
@@ -8,7 +8,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	
 					$(user_access-y)
 
 ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
-				device.o fmr_pool.o cache.o
+				device.o fmr_pool.o
 
 ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
 
Index: b/drivers/infiniband/core/cm.c
===================================================================
--- a/drivers/infiniband/core/cm.c	2007-05-02 17:47:49.762477140 +0300
+++ b/drivers/infiniband/core/cm.c	2007-05-02 17:48:30.744177464 +0300
@@ -46,8 +46,8 @@
 #include <linux/spinlock.h>
 #include <linux/workqueue.h>
 
-#include <rdma/ib_cache.h>
 #include <rdma/ib_cm.h>
+#include <rdma/ib_verbs.h>
 #include "cm_msgs.h"
 
 MODULE_AUTHOR("Sean Hefty");
@@ -275,7 +275,7 @@ static int cm_init_av_by_path(struct ib_
 
 	read_lock_irqsave(&cm.device_lock, flags);
 	list_for_each_entry(cm_dev, &cm.device_list, list) {
-		if (!ib_find_cached_gid(cm_dev->device, &path->sgid,
+		if (!ib_find_gid(cm_dev->device, &path->sgid,
 					&p, NULL)) {
 			port = &cm_dev->port[p-1];
 			break;
@@ -286,7 +286,7 @@ static int cm_init_av_by_path(struct ib_
 	if (!port)
 		return -EINVAL;
 
-	ret = ib_find_cached_pkey(cm_dev->device, port->port_num,
+	ret = ib_find_pkey(cm_dev->device, port->port_num,
 				  be16_to_cpu(path->pkey), &av->pkey_index);
 	if (ret)
 		return ret;
@@ -1382,7 +1382,7 @@ static int cm_req_handler(struct cm_work
 	cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]);
 	ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av);
 	if (ret) {
-		ib_get_cached_gid(work->port->cm_dev->device,
+		ib_query_gid(work->port->cm_dev->device,
 				  work->port->port_num, 0, &work->path[0].sgid);
 		ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
 			       &work->path[0].sgid, sizeof work->path[0].sgid,
Index: b/drivers/infiniband/core/cma.c
===================================================================
--- a/drivers/infiniband/core/cma.c	2007-05-02 17:47:50.749301367 +0300
+++ b/drivers/infiniband/core/cma.c	2007-05-02 17:48:30.746177108 +0300
@@ -41,7 +41,6 @@
 
 #include <rdma/rdma_cm.h>
 #include <rdma/rdma_cm_ib.h>
-#include <rdma/ib_cache.h>
 #include <rdma/ib_cm.h>
 #include <rdma/ib_sa.h>
 #include <rdma/iw_cm.h>
@@ -325,7 +324,7 @@ static int cma_acquire_dev(struct rdma_i
 	}
 
 	list_for_each_entry(cma_dev, &dev_list, list) {
-		ret = ib_find_cached_gid(cma_dev->device, &gid,
+		ret = ib_find_gid(cma_dev->device, &gid,
 					 &id_priv->id.port_num, NULL);
 		if (!ret) {
 			ret = cma_set_qkey(cma_dev->device,
@@ -514,7 +513,7 @@ static int cma_ib_init_qp_attr(struct rd
 	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
 	int ret;
 
-	ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num,
+	ret = ib_find_pkey(id_priv->id.device, id_priv->id.port_num,
 				  ib_addr_get_pkey(dev_addr),
 				  &qp_attr->pkey_index);
 	if (ret)
@@ -1658,11 +1657,11 @@ static int cma_bind_loopback(struct rdma
 	cma_dev = list_entry(dev_list.next, struct cma_device, list);
 
 port_found:
-	ret = ib_get_cached_gid(cma_dev->device, p, 0, &gid);
+	ret = ib_query_gid(cma_dev->device, p, 0, &gid);
 	if (ret)
 		goto out;
 
-	ret = ib_get_cached_pkey(cma_dev->device, p, 0, &pkey);
+	ret = ib_query_pkey(cma_dev->device, p, 0, &pkey);
 	if (ret)
 		goto out;
 
Index: b/drivers/infiniband/core/mad.c
===================================================================
--- a/drivers/infiniband/core/mad.c	2007-05-02 17:47:50.423359423 +0300
+++ b/drivers/infiniband/core/mad.c	2007-05-02 17:48:30.748176751 +0300
@@ -34,7 +34,6 @@
  * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
  */
 #include <linux/dma-mapping.h>
-#include <rdma/ib_cache.h>
 
 #include "mad_priv.h"
 #include "mad_rmpp.h"
@@ -1707,13 +1706,13 @@ static inline int rcv_has_same_gid(struc
 	if (!send_resp && rcv_resp) {
 		/* is request/response. */
 		if (!(attr.ah_flags & IB_AH_GRH)) {
-			if (ib_get_cached_lmc(device, port_num, &lmc))
+			if (ib_query_lmc(device, port_num, &lmc))
 				return 0;
 			return (!lmc || !((attr.src_path_bits ^
 					   rwc->wc->dlid_path_bits) &
 					  ((1 << lmc) - 1)));
 		} else {
-			if (ib_get_cached_gid(device, port_num,
+			if (ib_query_gid(device, port_num,
 					      attr.grh.sgid_index, &sgid))
 				return 0;
 			return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw,
Index: b/drivers/infiniband/core/multicast.c
===================================================================
--- a/drivers/infiniband/core/multicast.c	2007-05-02 17:47:51.014254173 +0300
+++ b/drivers/infiniband/core/multicast.c	2007-05-02 17:48:30.749176573 +0300
@@ -38,7 +38,6 @@
 #include <linux/bitops.h>
 #include <linux/random.h>
 
-#include <rdma/ib_cache.h>
 #include "sa.h"
 
 static void mcast_add_one(struct ib_device *device);
@@ -686,7 +685,7 @@ int ib_init_ah_from_mcmember(struct ib_d
 	u16 gid_index;
 	u8 p;
 
-	ret = ib_find_cached_gid(device, &rec->port_gid, &p, &gid_index);
+	ret = ib_find_gid(device, &rec->port_gid, &p, &gid_index);
 	if (ret)
 		return ret;
 
Index: b/drivers/infiniband/core/sa_query.c
===================================================================
--- a/drivers/infiniband/core/sa_query.c	2007-05-02 17:47:49.689490140 +0300
+++ b/drivers/infiniband/core/sa_query.c	2007-05-02 17:48:30.749176573 +0300
@@ -47,7 +47,6 @@
 #include <linux/workqueue.h>
 
 #include <rdma/ib_pack.h>
-#include <rdma/ib_cache.h>
 #include "sa.h"
 
 MODULE_AUTHOR("Roland Dreier");
@@ -477,7 +476,7 @@ int ib_init_ah_from_path(struct ib_devic
 		ah_attr->ah_flags = IB_AH_GRH;
 		ah_attr->grh.dgid = rec->dgid;
 
-		ret = ib_find_cached_gid(device, &rec->sgid, &port_num,
+		ret = ib_find_gid(device, &rec->sgid, &port_num,
 					 &gid_index);
 		if (ret)
 			return ret;
Index: b/drivers/infiniband/core/verbs.c
===================================================================
--- a/drivers/infiniband/core/verbs.c	2007-05-02 17:47:49.091596637 +0300
+++ b/drivers/infiniband/core/verbs.c	2007-05-02 17:48:30.750176395 +0300
@@ -43,7 +43,6 @@
 #include <linux/string.h>
 
 #include <rdma/ib_verbs.h>
-#include <rdma/ib_cache.h>
 
 int ib_rate_to_mult(enum ib_rate rate)
 {
@@ -159,7 +158,7 @@ int ib_init_ah_from_wc(struct ib_device 
 		ah_attr->ah_flags = IB_AH_GRH;
 		ah_attr->grh.dgid = grh->sgid;
 
-		ret = ib_find_cached_gid(device, &grh->dgid, &port_num,
+		ret = ib_find_gid(device, &grh->dgid, &port_num,
 					 &gid_index);
 		if (ret)
 			return ret;
Index: b/drivers/infiniband/hw/mthca/mthca_av.c
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_av.c	2007-05-02 17:47:53.157872352 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_av.c	2007-05-02 17:48:30.751176217 +0300
@@ -37,7 +37,6 @@
 #include <linux/slab.h>
 
 #include <rdma/ib_verbs.h>
-#include <rdma/ib_cache.h>
 
 #include "mthca_dev.h"
 
@@ -279,7 +278,7 @@ int mthca_read_ah(struct mthca_dev *dev,
 			(be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff;
 		header->grh.flow_label    =
 			ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff);
-		ib_get_cached_gid(&dev->ib_dev,
+		ib_query_gid(&dev->ib_dev,
 				  be32_to_cpu(ah->av->port_pd) >> 24,
 				  ah->av->gid_index % dev->limits.gid_table_len,
 				  &header->grh.source_gid);
Index: b/drivers/infiniband/hw/mthca/mthca_qp.c
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_qp.c	2007-05-02 17:47:53.153873064 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c	2007-05-02 18:04:14.123981858 +0300
@@ -40,9 +40,8 @@
 
 #include <asm/io.h>
 
-#include <rdma/ib_verbs.h>
-#include <rdma/ib_cache.h>
 #include <rdma/ib_pack.h>
+#include <rdma/ib_verbs.h>
 
 #include "mthca_dev.h"
 #include "mthca_cmd.h"
@@ -1485,11 +1484,10 @@ static int build_mlx_header(struct mthca
 		sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE;
 	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
 	if (!sqp->qp.ibqp.qp_num)
-		ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port,
-				   sqp->pkey_index, &pkey);
+		ib_query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey);
 	else
-		ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port,
-				   wr->wr.ud.pkey_index, &pkey);
+		ib_query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey);
+
 	sqp->ud_header.bth.pkey = cpu_to_be16(pkey);
 	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
 	sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));
Index: b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-02 17:47:52.042071098 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-02 17:48:30.753175861 +0300
@@ -33,7 +33,6 @@
  */
 
 #include <rdma/ib_cm.h>
-#include <rdma/ib_cache.h>
 #include <net/dst.h>
 #include <net/icmp.h>
 #include <linux/icmpv6.h>
@@ -759,7 +758,7 @@ static int ipoib_cm_modify_tx_init(struc
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
 	int qp_attr_mask, ret;
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index);
 	if (ret) {
 		ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret);
 		return ret;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-02 17:48:30.150283249 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-02 17:48:30.754175683 +0300
@@ -38,7 +38,7 @@
 #include <linux/delay.h>
 #include <linux/dma-mapping.h>
 
-#include <rdma/ib_cache.h>
+#include <rdma/ib_verbs.h>
 
 #include "ipoib.h"
 
Index: b/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================
--- a/drivers/infiniband/ulp/srp/ib_srp.c	2007-05-02 17:47:52.336018740 +0300
+++ b/drivers/infiniband/ulp/srp/ib_srp.c	2007-05-02 17:48:30.755175505 +0300
@@ -48,8 +48,6 @@
 #include <scsi/scsi_dbg.h>
 #include <scsi/srp.h>
 
-#include <rdma/ib_cache.h>
-
 #include "ib_srp.h"
 
 #define DRV_NAME	"ib_srp"
@@ -164,7 +162,7 @@ static int srp_init_qp(struct srp_target
 	if (!attr)
 		return -ENOMEM;
 
-	ret = ib_find_cached_pkey(target->srp_host->dev->dev,
+	ret = ib_find_pkey(target->srp_host->dev->dev,
 				  target->srp_host->port,
 				  be16_to_cpu(target->path.pkey),
 				  &attr->pkey_index);
@@ -1780,7 +1778,7 @@ static ssize_t srp_create_target(struct 
 	if (ret)
 		goto err;
 
-	ib_get_cached_gid(host->dev->dev, host->port, 0, &target->path.sgid);
+	ib_query_gid(host->dev->dev, host->port, 0, &target->path.sgid);
 
 	printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x "
 	       "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
Index: b/drivers/infiniband/core/core_priv.h
===================================================================
--- a/drivers/infiniband/core/core_priv.h	2007-05-02 17:47:50.519342327 +0300
+++ b/drivers/infiniband/core/core_priv.h	2007-05-02 17:48:30.755175505 +0300
@@ -46,7 +46,4 @@ void ib_device_unregister_sysfs(struct i
 int  ib_sysfs_setup(void);
 void ib_sysfs_cleanup(void);
 
-int  ib_cache_setup(void);
-void ib_cache_cleanup(void);
-
 #endif /* _CORE_PRIV_H */


From yosefe at voltaire.com  Wed May  2 08:57:50 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 02 May 2007 18:57:50 +0300
Subject: [ofa-general] [PATCH 3/3] mthca: provider-level caching of pkeys
In-Reply-To: <4638B432.3060801@voltaire.com>
References: <4638B432.3060801@voltaire.com>
Message-ID: <4638B4FE.8010605@voltaire.com>

Add provider-level caching of pkeys to mthca

* have the dirver intercept smp's which are pkey table notifications,
  and update its internal cache with the new values.
* modify query_pkey to use this cache instead of doing a blocking HW
  call
* while creating a MLX QP, use this cache


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/hw/mthca/mthca_dev.h      |   12 +
 drivers/infiniband/hw/mthca/mthca_mad.c      |    5 
 drivers/infiniband/hw/mthca/mthca_provider.c |  167 +++++++++++++++++++++++----
 drivers/infiniband/hw/mthca/mthca_qp.c       |    5 
 include/rdma/ib_smi.h                        |    1 
 5 files changed, 163 insertions(+), 27 deletions(-)

Index: b/drivers/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_dev.h	2007-05-02 17:47:52.931912600 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h	2007-05-02 17:48:31.525038376 +0300
@@ -49,6 +49,8 @@
 
 #include <asm/semaphore.h>
 
+#include <rdma/ib_smi.h>
+
 #include "mthca_provider.h"
 #include "mthca_doorbell.h"
 
@@ -287,6 +289,11 @@ struct mthca_catas_err {
 	struct list_head	list;
 };
 
+struct mthca_pkey_cache {
+	int		table_len;
+	u16		table[0];
+};
+
 extern struct mutex mthca_device_mutex;
 
 struct mthca_dev {
@@ -360,6 +367,9 @@ struct mthca_dev {
 	struct ib_ah         *sm_ah[MTHCA_MAX_PORTS];
 	spinlock_t            sm_lock;
 	u8                    rate[MTHCA_MAX_PORTS];
+
+	rwlock_t               pkey_cache_lock;
+	struct mthca_pkey_cache *pkey_cache[MTHCA_MAX_PORTS];
 };
 
 #ifdef CONFIG_INFINIBAND_MTHCA_DEBUG
@@ -585,6 +595,8 @@ int mthca_process_mad(struct ib_device *
 int mthca_create_agents(struct mthca_dev *dev);
 void mthca_free_agents(struct mthca_dev *dev);
 
+int mthca_cache_update(struct mthca_dev *mdev, struct ib_smp *smp, u8 port_num);
+
 static inline struct mthca_dev *to_mdev(struct ib_device *ibdev)
 {
 	return container_of(ibdev, struct mthca_dev, ib_dev);
Index: b/drivers/infiniband/hw/mthca/mthca_mad.c
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_mad.c	2007-05-02 17:47:53.067888380 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_mad.c	2007-05-02 17:48:31.525038376 +0300
@@ -134,6 +134,11 @@ static void smp_snoop(struct ib_device *
 		}
 
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
+
+			/* update pkey cache from a snnoped MAD */
+			mthca_dbg(to_mdev(ibdev), "pkey change at port %d\n", port_num);
+			mthca_cache_update(to_mdev(ibdev), (struct ib_smp*) mad, port_num);
+
 			event.device           = ibdev;
 			event.event            = IB_EVENT_PKEY_CHANGE;
 			event.element.port_num = port_num;
Index: b/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_provider.c	2007-05-02 17:47:52.996901024 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c	2007-05-02 17:48:31.526038198 +0300
@@ -243,36 +243,27 @@ out:
 static int mthca_query_pkey(struct ib_device *ibdev,
 			    u8 port, u16 index, u16 *pkey)
 {
-	struct ib_smp *in_mad  = NULL;
-	struct ib_smp *out_mad = NULL;
-	int err = -ENOMEM;
-	u8 status;
+	struct mthca_dev * mdev;
+	unsigned int flags;
 
-	in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
-	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
-	if (!in_mad || !out_mad)
-		goto out;
+	mdev = to_mdev(ibdev);
+	read_lock_irqsave(&mdev->pkey_cache_lock, flags);
 
-	init_query_mad(in_mad);
-	in_mad->attr_id  = IB_SMP_ATTR_PKEY_TABLE;
-	in_mad->attr_mod = cpu_to_be32(index / 32);
+	if (port < 1 || port > mdev->ib_dev.phys_port_cnt ||
+		index >= mdev->pkey_cache[ port - 1 ]->table_len ) {
+		mthca_warn(mdev, "pkey request at %d[%d] is out of range %d[%d] - %d[%d]\n",
+					port, index,
+					1, 0,
+					mdev->ib_dev.phys_port_cnt, mdev->pkey_cache[ port - 1 ]->table_len -1);
 
-	err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1,
-			    port, NULL, NULL, in_mad, out_mad,
-			    &status);
-	if (err)
-		goto out;
-	if (status) {
-		err = -EINVAL;
-		goto out;
+		read_unlock_irqrestore(&mdev->pkey_cache_lock, flags);
+		return -EINVAL;
 	}
 
-	*pkey = be16_to_cpu(((__be16 *) out_mad->data)[index % 32]);
+	*pkey = mdev->pkey_cache[ port - 1 ]->table[ index ];
 
- out:
-	kfree(in_mad);
-	kfree(out_mad);
-	return err;
+	read_unlock_irqrestore(&mdev->pkey_cache_lock, flags);
+	return 0;
 }
 
 static int mthca_query_gid(struct ib_device *ibdev, u8 port,
@@ -1259,6 +1250,127 @@ out:
 	return err;
 }
 
+/*
+ * Initiallize cache:
+ *  ask the SM for the table
+ */
+static int mthca_cache_init(struct mthca_dev *mdev)
+{
+	struct ib_smp *in_mad  = NULL;
+	struct ib_smp *out_mad = NULL;
+	struct ib_port_attr *tprops = NULL;
+	unsigned int i;
+	unsigned int tbl_len;
+
+	int err = -ENOMEM;
+	u8 status;
+
+	rwlock_init(&mdev->pkey_cache_lock);
+
+	mthca_dbg(mdev, "setting up PKey cache\n");
+
+	memset(mdev->pkey_cache, 0, sizeof mdev->pkey_cache);
+
+	tprops = kmalloc( sizeof * tprops, GFP_KERNEL );
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+
+	if (!tprops || !in_mad || !out_mad)
+		goto out;
+
+	for ( i = 0; i < mdev->ib_dev.phys_port_cnt; ++i ) {
+
+		/* find out how many pkeys this port holds */
+		err = mthca_query_port(&mdev->ib_dev, i+1, tprops);
+		if (err)
+			continue;
+
+		/* allocate cache */
+		tbl_len = tprops->pkey_tbl_len;
+		mdev->pkey_cache[ i ] = kmalloc(sizeof(struct mthca_pkey_cache)
+						+ tbl_len *	sizeof(u16), GFP_KERNEL);
+		if ( ! mdev->pkey_cache[ i ] )
+			goto out;
+
+		mdev->pkey_cache[ i ]->table_len = tbl_len;
+
+		while (tbl_len) {
+
+			/* send pkey query mad */
+			memset(in_mad, 0, sizeof * in_mad);
+			init_query_mad(in_mad);
+			in_mad->attr_id  = IB_SMP_ATTR_PKEY_TABLE;
+			in_mad->attr_mod = cpu_to_be32( (tbl_len-1) / IB_SMP_NUM_PKEY_ENTRIES);
+
+			err = mthca_MAD_IFC(mdev, 1, 1,
+				    i + 1, NULL, NULL, in_mad, out_mad,
+				    &status);
+
+			if (err || status)
+				break;
+
+			mthca_cache_update(mdev, out_mad, i + 1);
+			tbl_len -= IB_SMP_NUM_PKEY_ENTRIES;
+		}
+	}
+
+out:
+	kfree(in_mad);
+	kfree(out_mad);
+	kfree(tprops);
+	return err;
+}
+
+/*
+ * Destroy the pkey cache
+ */
+static void mthca_cache_destroy(struct mthca_dev *mdev)
+{
+	int i;
+	for ( i = 0; i < mdev->ib_dev.phys_port_cnt; ++i ) {
+		kfree( mdev->pkey_cache[ i ] );
+	}
+}
+
+/*
+ * We snooped a pkey-table mad
+ * extract the new pkey table, and update our internal cache
+ */
+int mthca_cache_update(struct mthca_dev *mdev, struct ib_smp *smp, u8 port_num)
+{
+	unsigned int table_offset;
+	unsigned long flags;
+	int i;
+	struct mthca_pkey_cache *pkey_cache;
+	u16	*entry;
+
+	table_offset = ( be32_to_cpu(smp->attr_mod) & 0xFFFF ) *
+										IB_SMP_NUM_PKEY_ENTRIES;
+
+	mthca_dbg(mdev, "port %d: new pkey table at offset %d\n",
+					port_num, table_offset);
+
+	write_lock_irqsave(&mdev->pkey_cache_lock, flags);
+
+	pkey_cache = mdev->pkey_cache[ port_num - 1 ];
+
+	if (pkey_cache->table_len < IB_SMP_NUM_PKEY_ENTRIES + table_offset) {
+		mthca_warn(mdev, "pkey table out of range - ignoring\n");
+		write_unlock_irqrestore(&mdev->pkey_cache_lock, flags);
+		return -EINVAL;
+	}
+
+	/* update the cache */
+	entry = pkey_cache->table + table_offset;
+	for ( i = 0; i < IB_SMP_NUM_PKEY_ENTRIES; ++i ) {
+		u16 pkey = be16_to_cpu ( *( ( (u16*)smp->data ) + i ) );
+		*(entry++) = pkey;
+	}
+
+	write_unlock_irqrestore(&mdev->pkey_cache_lock, flags);
+	return 0;
+}
+
 int mthca_register_device(struct mthca_dev *dev)
 {
 	int ret;
@@ -1365,6 +1477,12 @@ int mthca_register_device(struct mthca_d
 
 	mutex_init(&dev->cap_mask_mutex);
 
+	ret = mthca_cache_init(dev);
+	if (ret) {
+		mthca_cache_destroy(dev);
+		return ret;
+	}
+
 	ret = ib_register_device(&dev->ib_dev);
 	if (ret)
 		return ret;
@@ -1387,4 +1505,5 @@ void mthca_unregister_device(struct mthc
 {
 	mthca_stop_catas_poll(dev);
 	ib_unregister_device(&dev->ib_dev);
+	mthca_cache_destroy(dev);
 }
Index: b/include/rdma/ib_smi.h
===================================================================
--- a/include/rdma/ib_smi.h	2007-05-02 17:47:12.741071381 +0300
+++ b/include/rdma/ib_smi.h	2007-05-02 17:48:31.527038020 +0300
@@ -43,6 +43,7 @@
 
 #define IB_SMP_DATA_SIZE			64
 #define IB_SMP_MAX_PATH_HOPS			64
+#define IB_SMP_NUM_PKEY_ENTRIES		32
 
 struct ib_smp {
 	u8	base_version;
Index: b/drivers/infiniband/hw/mthca/mthca_qp.c
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_qp.c	2007-05-02 17:48:30.752176039 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c	2007-05-02 17:48:31.528037842 +0300
@@ -41,7 +41,6 @@
 #include <asm/io.h>
 
 #include <rdma/ib_pack.h>
-#include <rdma/ib_verbs.h>
 
 #include "mthca_dev.h"
 #include "mthca_cmd.h"
@@ -1484,9 +1483,9 @@ static int build_mlx_header(struct mthca
 		sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE;
 	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
 	if (!sqp->qp.ibqp.qp_num)
-		ib_query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey);
+		dev->ib_dev.query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey);
 	else
-		ib_query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey);
+		dev->ib_dev.query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey);
 
 	sqp->ud_header.bth.pkey = cpu_to_be16(pkey);
 	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);


From rick.jones2 at hp.com  Wed May  2 09:25:59 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Wed, 02 May 2007 09:25:59 -0700
Subject: [ofa-general] minutes from socket over RDMA discussion at workshop
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3037660DF@xmb-sjc-216.amer.cisco.com>
References: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3037660DF@xmb-sjc-216.amer.cisco.com>
Message-ID: <4638BB97.8050403@hp.com>

Scott Weitzenkamp (sweitzen) wrote:
> No, this is not right.  SDP has better latency and better throughput 
> than IPoIB CM, but also uses more CPU.

I can confirm that with some netperf numbers I've been gathering:

SD = Service Demand - usec CPU consumed per KB or tran smaller is better
SDx = Service Demand Xmit; SDr = Service Demand Recv
* back to back - no switch.
9k - 9000 byte MTU.  As it turns-out, ... this
      is the _default_ in the version of the myri10ge driver given to the
      author to use.

		      RedHat Enterprise Linux 5

                              Bulk Transfer                  "Latency"
                          Unidir            Bidir
     Card          Mbit/s SDx   SDr   Mbit/s SDx   SDr   Tran/s SDx   SDr
---------------------------------------------------------------------------
                    nnnn  nnnnn nnnnn  nnnn  nnnnn nnnnn  nnnnn nnnnn nnnnn
  AD313A  IPoIB     2970  4.418 4.544  3530  3.59  3.95   19290 n/a   n/a
  AD313A  SDP       7810  0.453 1.048 12820  0.69  0.68   38030 26.29 26.29
  AD313A  SDP p=0   7810  0.346 0.527 12670  0.42  0.043  19380 n/a   n/a
  AD144A  IP
  Myri10G IP 9k     9320  0.862 0.949 10950  1.00  0.86   19260 19.67 16.18 *
  Myri10G IP 9k msi 9320  0.449 0.672 10840  0.63  0.62   19430 11.68 11.56
  Myri10G IP    msi 7020  0.525 1.790  9820  1.22  1.22   not measured

Service demand for IPoIB TCP_RR and SDP p=0 not shown here because
netperf could never hit the confidence intervals for CPU utilization.
See the raw data for details.

* No switch - systems connected back-to-back

p=0 - recv_poll and send_poll module parameters of ib_sdp set to zero
       to address CPU util issue for SDP_RR test, where the equivalent
       of an entire core was consumed on each side

msi - MSI or Message Signaled Interrupts enabled

1.2.0 version of the myri10ge driver

wrt zero copy and sockets and such, long long ago, in an OS far away - HP-UX 
9.X, HP had a zero-copy for TCP/IP over FDDI.  This was when MTUs and page sizes 
were still well-aligned (both 4K, well 4K and change for the FDDI MTU).  The 
zero copy there was copy-on-write, and it was enabled only when an application 
made an explicit setsockopt() call telling the transport the application knew 
this was going to be happening.  Then it was up to the application to make sure 
it did not try to access the address range of a previous send() call until after 
it knew the transport was no longer referencing it.

ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.pdf

rick jones


From rdreier at cisco.com  Wed May  2 10:16:24 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 02 May 2007 10:16:24 -0700
Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to
	use send with invalidate
In-Reply-To: <20070502070849.GO8447@mellanox.co.il> (Michael S. Tsirkin's
	message of "Wed, 2 May 2007 10:08:49 +0300")
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
	<20070502070849.GO8447@mellanox.co.il>
Message-ID: <ada4pmvi1sn.fsf@cisco.com>

 > > +			if (ib_wr->send_flags & IB_SEND_SOLICITED
 > > +				&& ib_wr->send_flags & IB_SEND_INVALIDATE) {
 > 
 > How about
 > 	if (ib_wr->send_flags & (IB_SEND_SOLICITED | IB_SEND_INVALIDATE))

These two aren't equivalent -- the first has an &&, yours works like ||.
Which is correct?


From mst at dev.mellanox.co.il  Wed May  2 10:18:29 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 20:18:29 +0300
Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache
In-Reply-To: <4638B4D5.7050709@voltaire.com>
References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com>
Message-ID: <20070502171829.GO22292@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCH 2/3] remove ib pkey gid and lmc cache
> 
> Remove IB cache from core
> 
> * Remove pkey, gid, and lmc caches
> * Rewrite ib_find_gid and ib_find_pkey over blocking device queries 
> * Modify users of the cache to use these methods
> 

That's what we wanted to do, allright.  But there are some issues here.

> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
> ---
>  drivers/infiniband/core/cache.c         |  398 --------------------------------
>  include/rdma/ib_cache.h                 |  118 ---------
>  drivers/infiniband/core/Makefile        |    2 
>  drivers/infiniband/core/cm.c            |    8 
>  drivers/infiniband/core/cma.c           |    9 
>  drivers/infiniband/core/core_priv.h     |    3 
>  drivers/infiniband/core/device.c        |  143 ++++++++++-
>  drivers/infiniband/core/mad.c           |    5 
>  drivers/infiniband/core/multicast.c     |    3 
>  drivers/infiniband/core/sa_query.c      |    3 
>  drivers/infiniband/core/verbs.c         |    3 
>  drivers/infiniband/hw/mthca/mthca_av.c  |    3 
>  drivers/infiniband/hw/mthca/mthca_qp.c  |   10 
>  drivers/infiniband/ulp/ipoib/ipoib_cm.c |    3 
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c |    2 
>  drivers/infiniband/ulp/srp/ib_srp.c     |    6 
>  include/rdma/ib_verbs.h                 |   37 ++
>  17 files changed, 196 insertions(+), 560 deletions(-)
> 
> Index: b/drivers/infiniband/core/device.c
> ===================================================================
> --- a/drivers/infiniband/core/device.c	2007-05-02 17:47:50.517342683 +0300
> +++ b/drivers/infiniband/core/device.c	2007-05-02 17:48:30.719181916 +0300
> @@ -149,6 +149,20 @@ static int alloc_name(char *name)
>  	return 0;
>  }
>  
> +
> +static inline int start_port(struct ib_device *device)
> +{
> +	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
> +}
> +
> +
> +static inline int end_port(struct ib_device *device)
> +{
> +	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
> +		0 : device->phys_port_cnt;
> +}
> +
> +
>  /**
>   * ib_alloc_device - allocate an IB device struct
>   * @size:size of structure to allocate

No double-spacing please.
A single empty line is enough for separation.

> @@ -592,6 +606,128 @@ int ib_modify_port(struct ib_device *dev
>  }
>  EXPORT_SYMBOL(ib_modify_port);
>  
> +/**
> + * ib_find_gid - Returns the port number and index of a GID
> + * @device: Device to query.

Kill the "."

> + * @gid: GID to look for
> + * @port_num: Returned port number
> + * @index: Returned index
> + *
> + * ib_find_gid() returns the index of @pkey in the pkey table
> + * on port @port_num
> + */


The description is not really clear. For comparison, here's what
we had for ib_find_cached_gid:
> - * ib_find_cached_gid - Returns the port number and GID table index where
> - *   a specified GID value occurs.
> - * @device: The device to query.
> - * @gid: The GID value to search for.
> - * @port_num: The port number of the device where the GID value was found.
> - * @index: The index into the cached GID table where the GID was found.  This
> - *   parameter may be NULL.
> - *
> - * ib_find_cached_gid() searches for the specified GID value in
> - * the local software cache.

so how about just removing the last 2 lines (which don't apply now)
and reusing the description as is?
	
> + int ib_find_gid(struct ib_device *device,
> +		       union ib_gid	    *gid,
> +		       u8               *port_num,
> +		       u16              *index)
> +{

what's going on with alignment/whitespace here?

> +	struct ib_port_attr *tprops = NULL;
> +	union ib_gid tmp_gid;
> +	int ret;
> +	int port;
> +	int i;

Just one int will do:
	int i, port, ret;

> +	tprops = kmalloc(sizeof *tprops, GFP_ATOMIC);

Why ATOMIC?
What if allocation fails?

> +
> +	for (port = start_port(device); port <= end_port(device); ++port) {
> +		ret = ib_query_port(device, port, tprops);
> +		if (ret)
> +			continue;
> +
> +		for (i = 0; i < tprops->gid_tbl_len; ++i) {
> +			ret = ib_query_gid(device, port, i, &tmp_gid);
> +			if (ret)
> +				goto out;
> +			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
> +				*port_num = port;
> +				*index = i;
> +				ret = 0;
> +				goto out;
> +			}
> +		} /* for i */

Killthe comment pls.

> +	}
> +	ret = -ENOENT;
> +out:
> +	kfree(tprops);
> +	return ret;
> +}
> +EXPORT_SYMBOL(ib_find_gid);

Mostly same comments apply to other functions below.

> +/**
> + * ib_find_pkey - Returns the index of a PKey on a port
> + * @device: Device to query.
> + * @port_num: Port to query on
> + * @pkey: PKey to look for
> + * @index: Returned index
> + *
> + * ib_find_pkey() returns the index of @pkey in the pkey table
> + * on port @port_num
> + */
> +int ib_find_pkey(struct ib_device *device,
> +			u8                port_num,
> +			u16               pkey,
> +			u16              *index)
> +{
> +	struct ib_port_attr *tprops = NULL;
> +	int ret;
> +	int i = -1;

What does -1 do here?

> +	u16 tmp_pkey;
> +
> +	tprops = kmalloc(sizeof *tprops, GFP_ATOMIC);
> +
> +	ret = ib_query_port(device, port_num, tprops);
> +	if (ret) {
> +		printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret);
> +		goto out;
> +	}
> +
> +	for (i = 0; i < tprops->pkey_tbl_len; ++i) {
> +		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
> +		if (ret)
> +			goto out;
> +
> +		if (pkey == tmp_pkey) {
> +			*index = i;
> +			ret = 0;
> +			goto out;
> +		}
> +	}
> +	ret = -ENOENT;
> +
> +out:
> +	kfree(tprops);
> +	return ret;
> +}
> +EXPORT_SYMBOL(ib_find_pkey);
> +
> +/**
> + * ib_query_lmc - Returns the LMC of a port
> + * @device: Device to query.
> + * @port_num: Port to query on
> + * @lmc: Returned LMC
> + *
> + * ib_query_lmc() returns the LID mask control associated
> + * with port @port_num
> + */
> +int ib_query_lmc(struct ib_device *device,
> +		      u8                port_num,
> +		      u8                *lmc)
> +{
> +	struct ib_port_attr *tprops = NULL;
> +	int ret;
> +
> +	tprops = kmalloc(sizeof *tprops, GFP_ATOMIC);
> +	ret = ib_query_port(device, port_num, tprops);
> +	if (ret) goto err;

goto belongs on a line of its own.

> +
> +	*lmc = tprops->lmc;
> +err:
> +	kfree(tprops);
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(ib_query_lmc);
> +
>  static int __init ib_core_init(void)
>  {
>  	int ret;
> @@ -600,18 +736,11 @@ static int __init ib_core_init(void)
>  	if (ret)
>  		printk(KERN_WARNING "Couldn't create InfiniBand device class\n");
>  
> -	ret = ib_cache_setup();
> -	if (ret) {
> -		printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n");
> -		ib_sysfs_cleanup();
> -	}
> -
>  	return ret;
>  }
>  
>  static void __exit ib_core_cleanup(void)
>  {
> -	ib_cache_cleanup();
>  	ib_sysfs_cleanup();
>  }
>  
> Index: b/drivers/infiniband/core/cache.c
> ===================================================================
> --- a/drivers/infiniband/core/cache.c	2007-05-02 17:47:49.878456482 +0300
> +++ /dev/null	1970-01-01 00:00:00.000000000 +0000
> @@ -1,398 +0,0 @@
> -/*
> - * Copyright (c) 2004 Topspin Communications.  All rights reserved.
> - * Copyright (c) 2005 Intel Corporation. All rights reserved.
> - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
> - * Copyright (c) 2005 Voltaire, Inc. All rights reserved.
> - *
> - * This software is available to you under a choice of one of two
> - * licenses.  You may choose to be licensed under the terms of the GNU
> - * General Public License (GPL) Version 2, available from the file
> - * COPYING in the main directory of this source tree, or the
> - * OpenIB.org BSD license below:
> - *
> - *     Redistribution and use in source and binary forms, with or
> - *     without modification, are permitted provided that the following
> - *     conditions are met:
> - *
> - *      - Redistributions of source code must retain the above
> - *        copyright notice, this list of conditions and the following
> - *        disclaimer.
> - *
> - *      - Redistributions in binary form must reproduce the above
> - *        copyright notice, this list of conditions and the following
> - *        disclaimer in the documentation and/or other materials
> - *        provided with the distribution.
> - *
> - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> - * SOFTWARE.
> - *
> - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $
> - */
> -
> -#include <linux/module.h>
> -#include <linux/errno.h>
> -#include <linux/slab.h>
> -
> -#include <rdma/ib_cache.h>
> -
> -#include "core_priv.h"
> -
> -struct ib_pkey_cache {
> -	int             table_len;
> -	u16             table[0];
> -};
> -
> -struct ib_gid_cache {
> -	int             table_len;
> -	union ib_gid    table[0];
> -};
> -
> -struct ib_update_work {
> -	struct work_struct work;
> -	struct ib_device  *device;
> -	u8                 port_num;
> -};
> -
> -static inline int start_port(struct ib_device *device)
> -{
> -	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
> -}
> -
> -static inline int end_port(struct ib_device *device)
> -{
> -	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
> -		0 : device->phys_port_cnt;
> -}
> -
> -int ib_get_cached_gid(struct ib_device *device,
> -		      u8                port_num,
> -		      int               index,
> -		      union ib_gid     *gid)
> -{
> -	struct ib_gid_cache *cache;
> -	unsigned long flags;
> -	int ret = 0;
> -
> -	if (port_num < start_port(device) || port_num > end_port(device))
> -		return -EINVAL;
> -
> -	read_lock_irqsave(&device->cache.lock, flags);
> -
> -	cache = device->cache.gid_cache[port_num - start_port(device)];
> -
> -	if (index < 0 || index >= cache->table_len)
> -		ret = -EINVAL;
> -	else
> -		*gid = cache->table[index];
> -
> -	read_unlock_irqrestore(&device->cache.lock, flags);
> -
> -	return ret;
> -}
> -EXPORT_SYMBOL(ib_get_cached_gid);
> -
> -int ib_find_cached_gid(struct ib_device *device,
> -		       union ib_gid	*gid,
> -		       u8               *port_num,
> -		       u16              *index)
> -{
> -	struct ib_gid_cache *cache;
> -	unsigned long flags;
> -	int p, i;
> -	int ret = -ENOENT;
> -
> -	*port_num = -1;
> -	if (index)
> -		*index = -1;
> -
> -	read_lock_irqsave(&device->cache.lock, flags);
> -
> -	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
> -		cache = device->cache.gid_cache[p];
> -		for (i = 0; i < cache->table_len; ++i) {
> -			if (!memcmp(gid, &cache->table[i], sizeof *gid)) {
> -				*port_num = p + start_port(device);
> -				if (index)
> -					*index = i;
> -				ret = 0;
> -				goto found;
> -			}
> -		}
> -	}
> -found:
> -	read_unlock_irqrestore(&device->cache.lock, flags);
> -
> -	return ret;
> -}
> -EXPORT_SYMBOL(ib_find_cached_gid);
> -
> -int ib_get_cached_pkey(struct ib_device *device,
> -		       u8                port_num,
> -		       int               index,
> -		       u16              *pkey)
> -{
> -	struct ib_pkey_cache *cache;
> -	unsigned long flags;
> -	int ret = 0;
> -
> -	if (port_num < start_port(device) || port_num > end_port(device))
> -		return -EINVAL;
> -
> -	read_lock_irqsave(&device->cache.lock, flags);
> -
> -	cache = device->cache.pkey_cache[port_num - start_port(device)];
> -
> -	if (index < 0 || index >= cache->table_len)
> -		ret = -EINVAL;
> -	else
> -		*pkey = cache->table[index];
> -
> -	read_unlock_irqrestore(&device->cache.lock, flags);
> -
> -	return ret;
> -}
> -EXPORT_SYMBOL(ib_get_cached_pkey);
> -
> -int ib_find_cached_pkey(struct ib_device *device,
> -			u8                port_num,
> -			u16               pkey,
> -			u16              *index)
> -{
> -	struct ib_pkey_cache *cache;
> -	unsigned long flags;
> -	int i;
> -	int ret = -ENOENT;
> -
> -	if (port_num < start_port(device) || port_num > end_port(device))
> -		return -EINVAL;
> -
> -	read_lock_irqsave(&device->cache.lock, flags);
> -
> -	cache = device->cache.pkey_cache[port_num - start_port(device)];
> -
> -	*index = -1;
> -
> -	for (i = 0; i < cache->table_len; ++i)
> -		if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) {
> -			*index = i;
> -			ret = 0;
> -			break;
> -		}
> -
> -	read_unlock_irqrestore(&device->cache.lock, flags);
> -
> -	return ret;
> -}
> -EXPORT_SYMBOL(ib_find_cached_pkey);
> -
> -int ib_get_cached_lmc(struct ib_device *device,
> -		      u8                port_num,
> -		      u8                *lmc)
> -{
> -	unsigned long flags;
> -	int ret = 0;
> -
> -	if (port_num < start_port(device) || port_num > end_port(device))
> -		return -EINVAL;
> -
> -	read_lock_irqsave(&device->cache.lock, flags);
> -	*lmc = device->cache.lmc_cache[port_num - start_port(device)];
> -	read_unlock_irqrestore(&device->cache.lock, flags);
> -
> -	return ret;
> -}
> -EXPORT_SYMBOL(ib_get_cached_lmc);
> -
> -static void ib_cache_update(struct ib_device *device,
> -			    u8                port)
> -{
> -	struct ib_port_attr       *tprops = NULL;
> -	struct ib_pkey_cache      *pkey_cache = NULL, *old_pkey_cache;
> -	struct ib_gid_cache       *gid_cache = NULL, *old_gid_cache;
> -	int                        i;
> -	int                        ret;
> -
> -	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
> -	if (!tprops)
> -		return;
> -
> -	ret = ib_query_port(device, port, tprops);
> -	if (ret) {
> -		printk(KERN_WARNING "ib_query_port failed (%d) for %s\n",
> -		       ret, device->name);
> -		goto err;
> -	}
> -
> -	pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
> -			     sizeof *pkey_cache->table, GFP_KERNEL);
> -	if (!pkey_cache)
> -		goto err;
> -
> -	pkey_cache->table_len = tprops->pkey_tbl_len;
> -
> -	gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len *
> -			    sizeof *gid_cache->table, GFP_KERNEL);
> -	if (!gid_cache)
> -		goto err;
> -
> -	gid_cache->table_len = tprops->gid_tbl_len;
> -
> -	for (i = 0; i < pkey_cache->table_len; ++i) {
> -		ret = ib_query_pkey(device, port, i, pkey_cache->table + i);
> -		if (ret) {
> -			printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n",
> -			       ret, device->name, i);
> -			goto err;
> -		}
> -	}
> -
> -	for (i = 0; i < gid_cache->table_len; ++i) {
> -		ret = ib_query_gid(device, port, i, gid_cache->table + i);
> -		if (ret) {
> -			printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
> -			       ret, device->name, i);
> -			goto err;
> -		}
> -	}
> -
> -	write_lock_irq(&device->cache.lock);
> -
> -	old_pkey_cache = device->cache.pkey_cache[port - start_port(device)];
> -	old_gid_cache  = device->cache.gid_cache [port - start_port(device)];
> -
> -	device->cache.pkey_cache[port - start_port(device)] = pkey_cache;
> -	device->cache.gid_cache [port - start_port(device)] = gid_cache;
> -
> -	device->cache.lmc_cache[port - start_port(device)] = tprops->lmc;
> -
> -	write_unlock_irq(&device->cache.lock);
> -
> -	kfree(old_pkey_cache);
> -	kfree(old_gid_cache);
> -	kfree(tprops);
> -	return;
> -
> -err:
> -	kfree(pkey_cache);
> -	kfree(gid_cache);
> -	kfree(tprops);
> -}
> -
> -static void ib_cache_task(struct work_struct *_work)
> -{
> -	struct ib_update_work *work =
> -		container_of(_work, struct ib_update_work, work);
> -
> -	ib_cache_update(work->device, work->port_num);
> -	kfree(work);
> -}
> -
> -static void ib_cache_event(struct ib_event_handler *handler,
> -			   struct ib_event *event)
> -{
> -	struct ib_update_work *work;
> -
> -	if (event->event == IB_EVENT_PORT_ERR    ||
> -	    event->event == IB_EVENT_PORT_ACTIVE ||
> -	    event->event == IB_EVENT_LID_CHANGE  ||
> -	    event->event == IB_EVENT_PKEY_CHANGE ||
> -	    event->event == IB_EVENT_SM_CHANGE   ||
> -	    event->event == IB_EVENT_CLIENT_REREGISTER) {
> -		work = kmalloc(sizeof *work, GFP_ATOMIC);
> -		if (work) {
> -			INIT_WORK(&work->work, ib_cache_task);
> -			work->device   = event->device;
> -			work->port_num = event->element.port_num;
> -			schedule_work(&work->work);
> -		}
> -	}
> -}
> -
> -static void ib_cache_setup_one(struct ib_device *device)
> -{
> -	int p;
> -
> -	rwlock_init(&device->cache.lock);
> -
> -	device->cache.pkey_cache =
> -		kmalloc(sizeof *device->cache.pkey_cache *
> -			(end_port(device) - start_port(device) + 1), GFP_KERNEL);
> -	device->cache.gid_cache =
> -		kmalloc(sizeof *device->cache.gid_cache *
> -			(end_port(device) - start_port(device) + 1), GFP_KERNEL);
> -
> -	device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache *
> -					  (end_port(device) -
> -					   start_port(device) + 1),
> -					  GFP_KERNEL);
> -
> -	if (!device->cache.pkey_cache || !device->cache.gid_cache ||
> -	    !device->cache.lmc_cache) {
> -		printk(KERN_WARNING "Couldn't allocate cache "
> -		       "for %s\n", device->name);
> -		goto err;
> -	}
> -
> -	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
> -		device->cache.pkey_cache[p] = NULL;
> -		device->cache.gid_cache [p] = NULL;
> -		ib_cache_update(device, p + start_port(device));
> -	}
> -
> -	INIT_IB_EVENT_HANDLER(&device->cache.event_handler,
> -			      device, ib_cache_event);
> -	if (ib_register_event_handler(&device->cache.event_handler))
> -		goto err_cache;
> -
> -	return;
> -
> -err_cache:
> -	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
> -		kfree(device->cache.pkey_cache[p]);
> -		kfree(device->cache.gid_cache[p]);
> -	}
> -
> -err:
> -	kfree(device->cache.pkey_cache);
> -	kfree(device->cache.gid_cache);
> -	kfree(device->cache.lmc_cache);
> -}
> -
> -static void ib_cache_cleanup_one(struct ib_device *device)
> -{
> -	int p;
> -
> -	ib_unregister_event_handler(&device->cache.event_handler);
> -	flush_scheduled_work();
> -
> -	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
> -		kfree(device->cache.pkey_cache[p]);
> -		kfree(device->cache.gid_cache[p]);
> -	}
> -
> -	kfree(device->cache.pkey_cache);
> -	kfree(device->cache.gid_cache);
> -	kfree(device->cache.lmc_cache);
> -}
> -
> -static struct ib_client cache_client = {
> -	.name   = "cache",
> -	.add    = ib_cache_setup_one,
> -	.remove = ib_cache_cleanup_one
> -};
> -
> -int __init ib_cache_setup(void)
> -{
> -	return ib_register_client(&cache_client);
> -}
> -
> -void __exit ib_cache_cleanup(void)
> -{
> -	ib_unregister_client(&cache_client);
> -}
> Index: b/include/rdma/ib_cache.h
> ===================================================================
> --- a/include/rdma/ib_cache.h	2007-05-02 17:47:13.398954200 +0300
> +++ /dev/null	1970-01-01 00:00:00.000000000 +0000
> @@ -1,118 +0,0 @@
> -/*
> - * Copyright (c) 2004 Topspin Communications.  All rights reserved.
> - * Copyright (c) 2005 Intel Corporation. All rights reserved.
> - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
> - *
> - * This software is available to you under a choice of one of two
> - * licenses.  You may choose to be licensed under the terms of the GNU
> - * General Public License (GPL) Version 2, available from the file
> - * COPYING in the main directory of this source tree, or the
> - * OpenIB.org BSD license below:
> - *
> - *     Redistribution and use in source and binary forms, with or
> - *     without modification, are permitted provided that the following
> - *     conditions are met:
> - *
> - *      - Redistributions of source code must retain the above
> - *        copyright notice, this list of conditions and the following
> - *        disclaimer.
> - *
> - *      - Redistributions in binary form must reproduce the above
> - *        copyright notice, this list of conditions and the following
> - *        disclaimer in the documentation and/or other materials
> - *        provided with the distribution.
> - *
> - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> - * SOFTWARE.
> - *
> - * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $
> - */
> -
> -#ifndef _IB_CACHE_H
> -#define _IB_CACHE_H
> -
> -#include <rdma/ib_verbs.h>
> -
> -/**
> - * ib_get_cached_gid - Returns a cached GID table entry
> - * @device: The device to query.
> - * @port_num: The port number of the device to query.
> - * @index: The index into the cached GID table to query.
> - * @gid: The GID value found at the specified index.
> - *
> - * ib_get_cached_gid() fetches the specified GID table entry stored in
> - * the local software cache.
> - */
> -int ib_get_cached_gid(struct ib_device    *device,
> -		      u8                   port_num,
> -		      int                  index,
> -		      union ib_gid        *gid);
> -
> -/**
> - * ib_find_cached_gid - Returns the port number and GID table index where
> - *   a specified GID value occurs.
> - * @device: The device to query.
> - * @gid: The GID value to search for.
> - * @port_num: The port number of the device where the GID value was found.
> - * @index: The index into the cached GID table where the GID was found.  This
> - *   parameter may be NULL.
> - *
> - * ib_find_cached_gid() searches for the specified GID value in
> - * the local software cache.
> - */
> -int ib_find_cached_gid(struct ib_device *device,
> -		       union ib_gid	*gid,
> -		       u8               *port_num,
> -		       u16              *index);
> -
> -/**
> - * ib_get_cached_pkey - Returns a cached PKey table entry
> - * @device: The device to query.
> - * @port_num: The port number of the device to query.
> - * @index: The index into the cached PKey table to query.
> - * @pkey: The PKey value found at the specified index.
> - *
> - * ib_get_cached_pkey() fetches the specified PKey table entry stored in
> - * the local software cache.
> - */
> -int ib_get_cached_pkey(struct ib_device    *device_handle,
> -		       u8                   port_num,
> -		       int                  index,
> -		       u16                 *pkey);
> -
> -/**
> - * ib_find_cached_pkey - Returns the PKey table index where a specified
> - *   PKey value occurs.
> - * @device: The device to query.
> - * @port_num: The port number of the device to search for the PKey.
> - * @pkey: The PKey value to search for.
> - * @index: The index into the cached PKey table where the PKey was found.
> - *
> - * ib_find_cached_pkey() searches the specified PKey table in
> - * the local software cache.
> - */
> -int ib_find_cached_pkey(struct ib_device    *device,
> -			u8                   port_num,
> -			u16                  pkey,
> -			u16                 *index);
> -
> -/**
> - * ib_get_cached_lmc - Returns a cached lmc table entry
> - * @device: The device to query.
> - * @port_num: The port number of the device to query.
> - * @lmc: The lmc value for the specified port for that device.
> - *
> - * ib_get_cached_lmc() fetches the specified lmc table entry stored in
> - * the local software cache.
> - */
> -int ib_get_cached_lmc(struct ib_device *device,
> -		      u8                port_num,
> -		      u8                *lmc);
> -
> -#endif /* _IB_CACHE_H */
> Index: b/include/rdma/ib_verbs.h
> ===================================================================
> --- a/include/rdma/ib_verbs.h	2007-05-02 17:47:13.398954200 +0300
> +++ b/include/rdma/ib_verbs.h	2007-05-02 17:48:30.741177998 +0300
> @@ -1134,6 +1134,43 @@ int ib_modify_port(struct ib_device *dev
>  		   struct ib_port_modify *port_modify);
>  
>  /**
> + * ib_find_gid - Returns the port number and index of a GID
> + * @device: Device to query.
> + * @gid: GID to look for
> + * @port_num: Returned port number
> + * @index: Returned index
> + *
> + * ib_find_gid() returns the index of @pkey in the pkey table
> + * on port @port_num
> + */
> + int ib_find_gid(struct ib_device *device, union ib_gid *gid,
> +            u8 *port_num, u16 *index);
> +
> +/**
> + * ib_find_pkey - Returns the index of a PKey on a port
> + * @device: Device to query.
> + * @port_num: Port to query on
> + * @pkey: PKey to look for
> + * @index: Returned index
> + *
> + * ib_find_pkey() returns the index of @pkey in the pkey table
> + * on port @port_num
> + */
> +int ib_find_pkey(struct ib_device *device,	u8 port_num,
> +			  u16 pkey, u16 *index);
> +
> +/**
> + * ib_query_lmc - Returns the LMC of a port
> + * @device: Device to query.
> + * @port_num: Port to query on
> + * @lmc: Returned LMC
> + *
> + * ib_query_lmc() returns the LID mask control associated
> + * with port @port_num
> + */
> +int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc);
> +

I don't think we need this one in ib_verbs.h - it just does query_port once.
Let's keep the API simple. The only user is in mad.c - move
it there and make it static.

> +/**
>   * ib_alloc_pd - Allocates an unused protection domain.
>   * @device: The device on which to allocate the protection domain.
>   *
> Index: b/drivers/infiniband/core/Makefile
> ===================================================================
> --- a/drivers/infiniband/core/Makefile	2007-05-02 17:47:49.333553540 +0300
> +++ b/drivers/infiniband/core/Makefile	2007-05-02 17:48:30.741177998 +0300
> @@ -8,7 +8,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	
>  					$(user_access-y)
>  
>  ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
> -				device.o fmr_pool.o cache.o
> +				device.o fmr_pool.o
>  
>  ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
>  
> Index: b/drivers/infiniband/core/cm.c
> ===================================================================
> --- a/drivers/infiniband/core/cm.c	2007-05-02 17:47:49.762477140 +0300
> +++ b/drivers/infiniband/core/cm.c	2007-05-02 17:48:30.744177464 +0300
> @@ -46,8 +46,8 @@
>  #include <linux/spinlock.h>
>  #include <linux/workqueue.h>
>  
> -#include <rdma/ib_cache.h>
>  #include <rdma/ib_cm.h>
> +#include <rdma/ib_verbs.h>
>  #include "cm_msgs.h"
>  
>  MODULE_AUTHOR("Sean Hefty");
> @@ -275,7 +275,7 @@ static int cm_init_av_by_path(struct ib_
>  
>  	read_lock_irqsave(&cm.device_lock, flags);
>  	list_for_each_entry(cm_dev, &cm.device_list, list) {
> -		if (!ib_find_cached_gid(cm_dev->device, &path->sgid,
> +		if (!ib_find_gid(cm_dev->device, &path->sgid,
>  					&p, NULL)) {
>  			port = &cm_dev->port[p-1];
>  			break;
> @@ -286,7 +286,7 @@ static int cm_init_av_by_path(struct ib_
>  	if (!port)
>  		return -EINVAL;
>  
> -	ret = ib_find_cached_pkey(cm_dev->device, port->port_num,
> +	ret = ib_find_pkey(cm_dev->device, port->port_num,
>  				  be16_to_cpu(path->pkey), &av->pkey_index);
>  	if (ret)
>  		return ret;
> @@ -1382,7 +1382,7 @@ static int cm_req_handler(struct cm_work
>  	cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]);
>  	ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av);
>  	if (ret) {
> -		ib_get_cached_gid(work->port->cm_dev->device,
> +		ib_query_gid(work->port->cm_dev->device,
>  				  work->port->port_num, 0, &work->path[0].sgid);
>  		ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>  			       &work->path[0].sgid, sizeof work->path[0].sgid,
> Index: b/drivers/infiniband/core/cma.c
> ===================================================================
> --- a/drivers/infiniband/core/cma.c	2007-05-02 17:47:50.749301367 +0300
> +++ b/drivers/infiniband/core/cma.c	2007-05-02 17:48:30.746177108 +0300
> @@ -41,7 +41,6 @@
>  
>  #include <rdma/rdma_cm.h>
>  #include <rdma/rdma_cm_ib.h>
> -#include <rdma/ib_cache.h>
>  #include <rdma/ib_cm.h>
>  #include <rdma/ib_sa.h>
>  #include <rdma/iw_cm.h>
> @@ -325,7 +324,7 @@ static int cma_acquire_dev(struct rdma_i
>  	}
>  
>  	list_for_each_entry(cma_dev, &dev_list, list) {
> -		ret = ib_find_cached_gid(cma_dev->device, &gid,
> +		ret = ib_find_gid(cma_dev->device, &gid,
>  					 &id_priv->id.port_num, NULL);
>  		if (!ret) {
>  			ret = cma_set_qkey(cma_dev->device,
> @@ -514,7 +513,7 @@ static int cma_ib_init_qp_attr(struct rd
>  	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
>  	int ret;
>  
> -	ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num,
> +	ret = ib_find_pkey(id_priv->id.device, id_priv->id.port_num,
>  				  ib_addr_get_pkey(dev_addr),
>  				  &qp_attr->pkey_index);
>  	if (ret)
> @@ -1658,11 +1657,11 @@ static int cma_bind_loopback(struct rdma
>  	cma_dev = list_entry(dev_list.next, struct cma_device, list);
>  
>  port_found:
> -	ret = ib_get_cached_gid(cma_dev->device, p, 0, &gid);
> +	ret = ib_query_gid(cma_dev->device, p, 0, &gid);
>  	if (ret)
>  		goto out;
>  
> -	ret = ib_get_cached_pkey(cma_dev->device, p, 0, &pkey);
> +	ret = ib_query_pkey(cma_dev->device, p, 0, &pkey);
>  	if (ret)
>  		goto out;
>  
> Index: b/drivers/infiniband/core/mad.c
> ===================================================================
> --- a/drivers/infiniband/core/mad.c	2007-05-02 17:47:50.423359423 +0300
> +++ b/drivers/infiniband/core/mad.c	2007-05-02 17:48:30.748176751 +0300
> @@ -34,7 +34,6 @@
>   * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
>   */
>  #include <linux/dma-mapping.h>
> -#include <rdma/ib_cache.h>
>  
>  #include "mad_priv.h"
>  #include "mad_rmpp.h"
> @@ -1707,13 +1706,13 @@ static inline int rcv_has_same_gid(struc
>  	if (!send_resp && rcv_resp) {
>  		/* is request/response. */
>  		if (!(attr.ah_flags & IB_AH_GRH)) {
> -			if (ib_get_cached_lmc(device, port_num, &lmc))
> +			if (ib_query_lmc(device, port_num, &lmc))

Just do query_port here.

>  				return 0;
>  			return (!lmc || !((attr.src_path_bits ^
>  					   rwc->wc->dlid_path_bits) &
>  					  ((1 << lmc) - 1)));
>  		} else {
> -			if (ib_get_cached_gid(device, port_num,
> +			if (ib_query_gid(device, port_num,
>  					      attr.grh.sgid_index, &sgid))
>  				return 0;
>  			return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw,
> Index: b/drivers/infiniband/core/multicast.c
> ===================================================================
> --- a/drivers/infiniband/core/multicast.c	2007-05-02 17:47:51.014254173 +0300
> +++ b/drivers/infiniband/core/multicast.c	2007-05-02 17:48:30.749176573 +0300
> @@ -38,7 +38,6 @@
>  #include <linux/bitops.h>
>  #include <linux/random.h>
>  
> -#include <rdma/ib_cache.h>
>  #include "sa.h"
>  
>  static void mcast_add_one(struct ib_device *device);
> @@ -686,7 +685,7 @@ int ib_init_ah_from_mcmember(struct ib_d
>  	u16 gid_index;
>  	u8 p;
>  
> -	ret = ib_find_cached_gid(device, &rec->port_gid, &p, &gid_index);
> +	ret = ib_find_gid(device, &rec->port_gid, &p, &gid_index);
>  	if (ret)
>  		return ret;
>  
> Index: b/drivers/infiniband/core/sa_query.c
> ===================================================================
> --- a/drivers/infiniband/core/sa_query.c	2007-05-02 17:47:49.689490140 +0300
> +++ b/drivers/infiniband/core/sa_query.c	2007-05-02 17:48:30.749176573 +0300
> @@ -47,7 +47,6 @@
>  #include <linux/workqueue.h>
>  
>  #include <rdma/ib_pack.h>
> -#include <rdma/ib_cache.h>
>  #include "sa.h"
>  
>  MODULE_AUTHOR("Roland Dreier");
> @@ -477,7 +476,7 @@ int ib_init_ah_from_path(struct ib_devic
>  		ah_attr->ah_flags = IB_AH_GRH;
>  		ah_attr->grh.dgid = rec->dgid;
>  
> -		ret = ib_find_cached_gid(device, &rec->sgid, &port_num,
> +		ret = ib_find_gid(device, &rec->sgid, &port_num,
>  					 &gid_index);
>  		if (ret)
>  			return ret;
> Index: b/drivers/infiniband/core/verbs.c
> ===================================================================
> --- a/drivers/infiniband/core/verbs.c	2007-05-02 17:47:49.091596637 +0300
> +++ b/drivers/infiniband/core/verbs.c	2007-05-02 17:48:30.750176395 +0300
> @@ -43,7 +43,6 @@
>  #include <linux/string.h>
>  
>  #include <rdma/ib_verbs.h>
> -#include <rdma/ib_cache.h>
>  
>  int ib_rate_to_mult(enum ib_rate rate)
>  {
> @@ -159,7 +158,7 @@ int ib_init_ah_from_wc(struct ib_device 
>  		ah_attr->ah_flags = IB_AH_GRH;
>  		ah_attr->grh.dgid = grh->sgid;
>  
> -		ret = ib_find_cached_gid(device, &grh->dgid, &port_num,
> +		ret = ib_find_gid(device, &grh->dgid, &port_num,
>  					 &gid_index);
>  		if (ret)
>  			return ret;
> Index: b/drivers/infiniband/hw/mthca/mthca_av.c
> ===================================================================
> --- a/drivers/infiniband/hw/mthca/mthca_av.c	2007-05-02 17:47:53.157872352 +0300
> +++ b/drivers/infiniband/hw/mthca/mthca_av.c	2007-05-02 17:48:30.751176217 +0300
> @@ -37,7 +37,6 @@
>  #include <linux/slab.h>
>  
>  #include <rdma/ib_verbs.h>
> -#include <rdma/ib_cache.h>
>  
>  #include "mthca_dev.h"
>  
> @@ -279,7 +278,7 @@ int mthca_read_ah(struct mthca_dev *dev,
>  			(be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff;
>  		header->grh.flow_label    =
>  			ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff);
> -		ib_get_cached_gid(&dev->ib_dev,
> +		ib_query_gid(&dev->ib_dev,
>  				  be32_to_cpu(ah->av->port_pd) >> 24,
>  				  ah->av->gid_index % dev->limits.gid_table_len,
>  				  &header->grh.source_gid);
> Index: b/drivers/infiniband/hw/mthca/mthca_qp.c
> ===================================================================
> --- a/drivers/infiniband/hw/mthca/mthca_qp.c	2007-05-02 17:47:53.153873064 +0300
> +++ b/drivers/infiniband/hw/mthca/mthca_qp.c	2007-05-02 18:04:14.123981858 +0300
> @@ -40,9 +40,8 @@
>  
>  #include <asm/io.h>
>  
> -#include <rdma/ib_verbs.h>
> -#include <rdma/ib_cache.h>
>  #include <rdma/ib_pack.h>
> +#include <rdma/ib_verbs.h>
>  
>  #include "mthca_dev.h"
>  #include "mthca_cmd.h"
> @@ -1485,11 +1484,10 @@ static int build_mlx_header(struct mthca
>  		sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE;
>  	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
>  	if (!sqp->qp.ibqp.qp_num)
> -		ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port,
> -				   sqp->pkey_index, &pkey);
> +		ib_query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey);
>  	else
> -		ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port,
> -				   wr->wr.ud.pkey_index, &pkey);
> +		ib_query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey);
> +
>  	sqp->ud_header.bth.pkey = cpu_to_be16(pkey);
>  	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
>  	sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));
> Index: b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-02 17:47:52.042071098 +0300
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-02 17:48:30.753175861 +0300
> @@ -33,7 +33,6 @@
>   */
>  
>  #include <rdma/ib_cm.h>
> -#include <rdma/ib_cache.h>
>  #include <net/dst.h>
>  #include <net/icmp.h>
>  #include <linux/icmpv6.h>
> @@ -759,7 +758,7 @@ static int ipoib_cm_modify_tx_init(struc
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  	struct ib_qp_attr qp_attr;
>  	int qp_attr_mask, ret;
> -	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index);
> +	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index);
>  	if (ret) {
>  		ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret);
>  		return ret;
> Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-02 17:48:30.150283249 +0300
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-02 17:48:30.754175683 +0300
> @@ -38,7 +38,7 @@
>  #include <linux/delay.h>
>  #include <linux/dma-mapping.h>
>  
> -#include <rdma/ib_cache.h>
> +#include <rdma/ib_verbs.h>
>  
>  #include "ipoib.h"
>  
> Index: b/drivers/infiniband/ulp/srp/ib_srp.c
> ===================================================================
> --- a/drivers/infiniband/ulp/srp/ib_srp.c	2007-05-02 17:47:52.336018740 +0300
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c	2007-05-02 17:48:30.755175505 +0300
> @@ -48,8 +48,6 @@
>  #include <scsi/scsi_dbg.h>
>  #include <scsi/srp.h>
>  
> -#include <rdma/ib_cache.h>
> -
>  #include "ib_srp.h"
>  
>  #define DRV_NAME	"ib_srp"
> @@ -164,7 +162,7 @@ static int srp_init_qp(struct srp_target
>  	if (!attr)
>  		return -ENOMEM;
>  
> -	ret = ib_find_cached_pkey(target->srp_host->dev->dev,
> +	ret = ib_find_pkey(target->srp_host->dev->dev,
>  				  target->srp_host->port,
>  				  be16_to_cpu(target->path.pkey),
>  				  &attr->pkey_index);
> @@ -1780,7 +1778,7 @@ static ssize_t srp_create_target(struct 
>  	if (ret)
>  		goto err;
>  
> -	ib_get_cached_gid(host->dev->dev, host->port, 0, &target->path.sgid);
> +	ib_query_gid(host->dev->dev, host->port, 0, &target->path.sgid);
>  
>  	printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x "
>  	       "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
> Index: b/drivers/infiniband/core/core_priv.h
> ===================================================================
> --- a/drivers/infiniband/core/core_priv.h	2007-05-02 17:47:50.519342327 +0300
> +++ b/drivers/infiniband/core/core_priv.h	2007-05-02 17:48:30.755175505 +0300
> @@ -46,7 +46,4 @@ void ib_device_unregister_sysfs(struct i
>  int  ib_sysfs_setup(void);
>  void ib_sysfs_cleanup(void);
>  
> -int  ib_cache_setup(void);
> -void ib_cache_cleanup(void);
> -
>  #endif /* _CORE_PRIV_H */
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
MST


From swise at opengridcomputing.com  Wed May  2 10:30:46 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 02 May 2007 12:30:46 -0500
Subject: [ofa-general] man pages for the rdma-cm
Message-ID: <1178127046.18609.107.camel@stevo-desktop>

Sean, 

Are there man pages for the rdma-cm in the pipeline?  I think it would
be great (requirement?) to have these for ofed-1.2 since we do have the
other verbs man pages.  

I didn't know if this was in-progress or are we looking for
volunteers...

<steve looks away not making eye contact> ;-)


Steve.


From swise at opengridcomputing.com  Wed May  2 10:37:40 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 02 May 2007 12:37:40 -0500
Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to
	use send with invalidate
In-Reply-To: <ada4pmvi1sn.fsf@cisco.com>
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
	<20070502070849.GO8447@mellanox.co.il>  <ada4pmvi1sn.fsf@cisco.com>
Message-ID: <1178127460.18609.111.camel@stevo-desktop>

On Wed, 2007-05-02 at 10:16 -0700, Roland Dreier wrote:
>  > > +			if (ib_wr->send_flags & IB_SEND_SOLICITED
>  > > +				&& ib_wr->send_flags & IB_SEND_INVALIDATE) {
>  > 
>  > How about
>  > 	if (ib_wr->send_flags & (IB_SEND_SOLICITED | IB_SEND_INVALIDATE))
> 
> These two aren't equivalent -- the first has an &&, yours works like ||.
> Which is correct?

The test needs to be: 'if both are set'

Michael's sez: 'if either are set'


> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mhagen at iol.unh.edu  Wed May  2 10:39:35 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Wed, 2 May 2007 13:39:35 -0400 (EDT)
Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to 
	use send with invalidate
In-Reply-To: <ada4pmvi1sn.fsf@cisco.com>
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
	<20070502070849.GO8447@mellanox.co.il> <ada4pmvi1sn.fsf@cisco.com>
Message-ID: <48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu>

I believe that they are the same, thanks for the nice addition Michael.
New patch follows.

--- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
13:12:54.000000000 -0400
+++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-05-02
13:18:17.000000000 -0400
@@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str

 		switch (ib_wr->opcode) {
 		case IB_WR_SEND:
-			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
+			if (ib_wr->send_flags &
+				(IB_SEND_SOLICITED | IB_SEND_INVALIDATE)) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV);
+				wr.sqwr.send.remote_stag =
+					cpu_to_be32(ib_wr->wr.invalidate.rkey);
+			} else if (ib_wr->send_flags & IB_SEND_SOLICITED) {
 				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE);
-				msg_size = sizeof(struct c2wr_send_req);
+				wr.sqwr.send.remote_stag = 0;
+			} else if (ib_wr->send_flags & IB_SEND_INVALIDATE) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV);
+				wr.sqwr.send.remote_stag =
+					cpu_to_be32(ib_wr->wr.invalidate.rkey);
 			} else {
 				c2_wr_set_id(&wr, C2_WR_TYPE_SEND);
-				msg_size = sizeof(struct c2wr_send_req);
+				wr.sqwr.send.remote_stag = 0;
 			}

-			wr.sqwr.send.remote_stag = 0;
-			msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge;
+			msg_size = sizeof(struct c2wr_send_req) +
+				sizeof(struct c2_data_addr) * ib_wr->num_sge;
 			if (ib_wr->num_sge > qp->send_sgl_depth) {
 				err = -EINVAL;
 				break;


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From swise at opengridcomputing.com  Wed May  2 10:46:50 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 02 May 2007 12:46:50 -0500
Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to
	use send with invalidate
In-Reply-To: <48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu>
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
	<20070502070849.GO8447@mellanox.co.il> <ada4pmvi1sn.fsf@cisco.com>
	<48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu>
Message-ID: <1178128010.18609.116.camel@stevo-desktop>

On Wed, 2007-05-02 at 13:39 -0400, mhagen at iol.unh.edu wrote:
> I believe that they are the same, thanks for the nice addition Michael.
> New patch follows.
> 
> --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
> 13:12:54.000000000 -0400
> +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-05-02
> 13:18:17.000000000 -0400
> @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str
> 
>  		switch (ib_wr->opcode) {
>  		case IB_WR_SEND:
> -			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
> +			if (ib_wr->send_flags &
> +				(IB_SEND_SOLICITED | IB_SEND_INVALIDATE)) {

this will set the opcde to SEND_SE_INV if either SEND_SOLICITED is set
-or- SEND_INV is set. That's incorrect. you want your older code...


> +				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV);
> +				wr.sqwr.send.remote_stag =
> +					cpu_to_be32(ib_wr->wr.invalidate.rkey);
> +			} else if (ib_wr->send_flags & IB_SEND_SOLICITED) {
>  				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE);
> -				msg_size = sizeof(struct c2wr_send_req);
> +				wr.sqwr.send.remote_stag = 0;
> +			} else if (ib_wr->send_flags & IB_SEND_INVALIDATE) {
> +				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV);
> +				wr.sqwr.send.remote_stag =
> +					cpu_to_be32(ib_wr->wr.invalidate.rkey);
>  			} else {
>  				c2_wr_set_id(&wr, C2_WR_TYPE_SEND);
> -				msg_size = sizeof(struct c2wr_send_req);
> +				wr.sqwr.send.remote_stag = 0;
>  			}
> 
> -			wr.sqwr.send.remote_stag = 0;
> -			msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge;
> +			msg_size = sizeof(struct c2wr_send_req) +
> +				sizeof(struct c2_data_addr) * ib_wr->num_sge;
>  			if (ib_wr->num_sge > qp->send_sgl_depth) {
>  				err = -EINVAL;
>  				break;
> 
> 


From rdreier at cisco.com  Wed May  2 10:49:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 02 May 2007 10:49:47 -0700
Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache
In-Reply-To: <4638B4D5.7050709@voltaire.com> (Yosef Etigin's message of "Wed,
	02 May 2007 18:57:09 +0300")
References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com>
Message-ID: <adatzuvglok.fsf@cisco.com>

 > @@ -279,7 +278,7 @@ int mthca_read_ah(struct mthca_dev *dev,
 >  			(be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff;
 >  		header->grh.flow_label    =
 >  			ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff);
 > -		ib_get_cached_gid(&dev->ib_dev,
 > +		ib_query_gid(&dev->ib_dev,
 >  				  be32_to_cpu(ah->av->port_pd) >> 24,
 >  				  ah->av->gid_index % dev->limits.gid_table_len,
 >  				  &header->grh.source_gid);
 > Index: b/drivers/infiniband/hw/mthca/mthca_qp.c
 > ===================================================================
 > --- a/drivers/infiniband/hw/mthca/mthca_qp.c	2007-05-02 17:47:53.153873064 +0300
 > +++ b/drivers/infiniband/hw/mthca/mthca_qp.c	2007-05-02 18:04:14.123981858 +0300
 > @@ -40,9 +40,8 @@
 >  
 >  #include <asm/io.h>
 >  
 > -#include <rdma/ib_verbs.h>
 > -#include <rdma/ib_cache.h>
 >  #include <rdma/ib_pack.h>
 > +#include <rdma/ib_verbs.h>
 >  
 >  #include "mthca_dev.h"
 >  #include "mthca_cmd.h"
 > @@ -1485,11 +1484,10 @@ static int build_mlx_header(struct mthca
 >  		sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE;
 >  	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
 >  	if (!sqp->qp.ibqp.qp_num)
 > -		ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port,
 > -				   sqp->pkey_index, &pkey);
 > +		ib_query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey);
 >  	else
 > -		ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port,
 > -				   wr->wr.ud.pkey_index, &pkey);
 > +		ib_query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey);
 > +
 >  	sqp->ud_header.bth.pkey = cpu_to_be16(pkey);
 >  	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
 >  	sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1));

These mthca changes can't be correct -- you are adding calls to
sleeping functions from build_mlx_header(), which is always called
with a spinlock held.  You'll have to update mthca to keep track of
the GID and P_Key tables internally to fix this.

Please test your code with CONFIG_DEBUG_SPINLOCK_SLEEP=y at least to
see if there are any other places that are using the cache because
they're not allowed to sleep.

 - R.


From mhagen at iol.unh.edu  Wed May  2 10:52:15 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Wed, 2 May 2007 13:52:15 -0400 (EDT)
Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to 
	use send with invalidate
In-Reply-To: <1178128010.18609.116.camel@stevo-desktop>
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
	<20070502070849.GO8447@mellanox.co.il> <ada4pmvi1sn.fsf@cisco.com>
	<48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu>
	<1178128010.18609.116.camel@stevo-desktop>
Message-ID: <41704.132.177.125.178.1178128335.squirrel@postal.iol.unh.edu>

Ok, here is the patch reverted again.

--- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
13:12:54.000000000 -0400
+++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-05-02
13:50:25.000000000 -0400
@@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str

 		switch (ib_wr->opcode) {
 		case IB_WR_SEND:
-			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
+			if (ib_wr->send_flags & IB_SEND_SOLICITED
+				&& ib_wr->send_flags & IB_SEND_INVALIDATE) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV);
+				wr.sqwr.send.remote_stag =
+					cpu_to_be32(ib_wr->wr.invalidate.rkey);
+			} else if (ib_wr->send_flags & IB_SEND_SOLICITED) {
 				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE);
-				msg_size = sizeof(struct c2wr_send_req);
+				wr.sqwr.send.remote_stag = 0;
+			} else if (ib_wr->send_flags & IB_SEND_INVALIDATE) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV);
+				wr.sqwr.send.remote_stag =
+					cpu_to_be32(ib_wr->wr.invalidate.rkey);
 			} else {
 				c2_wr_set_id(&wr, C2_WR_TYPE_SEND);
-				msg_size = sizeof(struct c2wr_send_req);
+				wr.sqwr.send.remote_stag = 0;
 			}

-			wr.sqwr.send.remote_stag = 0;
-			msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge;
+			msg_size = sizeof(struct c2wr_send_req) +
+				sizeof(struct c2_data_addr) * ib_wr->num_sge;
 			if (ib_wr->num_sge > qp->send_sgl_depth) {
 				err = -EINVAL;
 				break;


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From rdreier at cisco.com  Wed May  2 10:55:09 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 02 May 2007 10:55:09 -0700
Subject: [ofa-general] Re: [PATCH 3/3] mthca: provider-level caching of pkeys
In-Reply-To: <4638B4FE.8010605@voltaire.com> (Yosef Etigin's message of "Wed,
	02 May 2007 18:57:50 +0300")
References: <4638B432.3060801@voltaire.com> <4638B4FE.8010605@voltaire.com>
Message-ID: <adaps5jglfm.fsf@cisco.com>

Oh, I didn't see this patch before...
anyway along with all the minor whitespace,etc problems, there are two
big issues: 1) this patch needs to be _before_ the previous 2/3 patch
(or else the intermediate state is buggy) and 2) you need a GID cache too.


From rdreier at cisco.com  Wed May  2 10:56:49 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 02 May 2007 10:56:49 -0700
Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to
	use send with invalidate
In-Reply-To: <41704.132.177.125.178.1178128335.squirrel@postal.iol.unh.edu>
	(mhagen@iol.unh.edu's message of "Wed,
	2 May 2007 13:52:15 -0400 (EDT)")
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
	<20070502070849.GO8447@mellanox.co.il> <ada4pmvi1sn.fsf@cisco.com>
	<48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu>
	<1178128010.18609.116.camel@stevo-desktop>
	<41704.132.177.125.178.1178128335.squirrel@postal.iol.unh.edu>
Message-ID: <adalkg7glcu.fsf@cisco.com>

OK, can you please send both patches one more time with a proper
changelog and Signed-off-by line?

Thanks...

(BTW, please don't include subscriber-only lists like
ofalab at iol.unh.edu in the cc -- it's annoying to have all my replies
generate a bounce)


From mst at dev.mellanox.co.il  Wed May  2 11:23:15 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 21:23:15 +0300
Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache
In-Reply-To: <4638B4D5.7050709@voltaire.com>
References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com>
Message-ID: <20070502182315.GQ22292@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCH 2/3] remove ib pkey gid and lmc cache
> 
> Remove IB cache from core
> 
> * Remove pkey, gid, and lmc caches
> * Rewrite ib_find_gid and ib_find_pkey over blocking device queries 
> * Modify users of the cache to use these methods
> 
> 
> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
> ---
>  drivers/infiniband/core/cache.c         |  398 --------------------------------
>  include/rdma/ib_cache.h                 |  118 ---------
>  drivers/infiniband/core/Makefile        |    2 
>  drivers/infiniband/core/cm.c            |    8 
>  drivers/infiniband/core/cma.c           |    9 
>  drivers/infiniband/core/core_priv.h     |    3 
>  drivers/infiniband/core/device.c        |  143 ++++++++++-
>  drivers/infiniband/core/mad.c           |    5 
>  drivers/infiniband/core/multicast.c     |    3 
>  drivers/infiniband/core/sa_query.c      |    3 
>  drivers/infiniband/core/verbs.c         |    3 
>  drivers/infiniband/hw/mthca/mthca_av.c  |    3 
>  drivers/infiniband/hw/mthca/mthca_qp.c  |   10 
>  drivers/infiniband/ulp/ipoib/ipoib_cm.c |    3 
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c |    2 
>  drivers/infiniband/ulp/srp/ib_srp.c     |    6 
>  include/rdma/ib_verbs.h                 |   37 ++
>  17 files changed, 196 insertions(+), 560 deletions(-)

I think this should be split in 2 as follow:

1. Implement ib_find_gid and ib_find_pkey over blocking device queries 
   + Modify core and ULPs to use these methods

This will already fix ipoib pkey bug you opened in bugzilla.
   
2. modify mthca to keep cache updated by snooping MAD, and remove the cache

Not really high priority.

-- 
MST


From mst at dev.mellanox.co.il  Wed May  2 11:42:43 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 21:42:43 +0300
Subject: [ofa-general] Re: minutes from socket over RDMA discussion at
	workshop
In-Reply-To: <4638BB97.8050403@hp.com>
References: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3037660DF@xmb-sjc-216.amer.cisco.com>
	<4638BB97.8050403@hp.com>
Message-ID: <20070502184243.GR22292@mellanox.co.il>

> 
>                              Bulk Transfer                  "Latency"
>                          Unidir            Bidir
>     Card          Mbit/s SDx   SDr   Mbit/s SDx   SDr   Tran/s SDx   SDr
> ---------------------------------------------------------------------------
>                    nnnn  nnnnn nnnnn  nnnn  nnnnn nnnnn  nnnnn nnnnn nnnnn
>  AD313A  IPoIB     2970  4.418 4.544  3530  3.59  3.95   19290 n/a   n/a
>  AD313A  SDP       7810  0.453 1.048 12820  0.69  0.68   38030 26.29 26.29
>  AD313A  SDP p=0   7810  0.346 0.527 12670  0.42  0.043  19380 n/a   n/a

What's AD313A? What's the MTU for IPoIB (in OFED 1.2 it defaults to 64K)?

-- 
MST


From mst at dev.mellanox.co.il  Wed May  2 11:53:51 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 21:53:51 +0300
Subject: [ofa-general] [PATCH RFC] comp_vector support
Message-ID: <20070502185350.GS22292@mellanox.co.il>

The following untested patch does the following:

1. extends ib_create_cq to pass in comp_vector parameter, and updates all ULP/providers.
2. mthca is enhanced to support multiple vectors if MSI-X is enabled.
3. uverbs and IPoIB CM are enhanced to use multiple vectors if available

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

I plan to test and repost this in earnest soon, but wanted to first hear
what do people think about the API.
Note this closely parallels what we already have for userspace.

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 7fabb42..45d269b 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -161,9 +161,14 @@ static int alloc_name(char *name)
  */
 struct ib_device *ib_alloc_device(size_t size)
 {
+	struct ib_device *dev;
 	BUG_ON(size < sizeof (struct ib_device));
 
-	return kzalloc(size, GFP_KERNEL);
+	dev = kzalloc(size, GFP_KERNEL);
+	if (dev)
+		dev->num_comp_vectors = 1;
+
+	return dev;
 }
 EXPORT_SYMBOL(ib_alloc_device);
 
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 6edfecf..85ccf13 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2771,7 +2771,7 @@ static int ib_mad_port_open(struct ib_device *device,
 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
 	port_priv->cq = ib_create_cq(port_priv->device,
 				     ib_mad_thread_completion_handler,
-				     NULL, port_priv, cq_size);
+				     NULL, port_priv, cq_size, 0);
 	if (IS_ERR(port_priv->cq)) {
 		printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n");
 		ret = PTR_ERR(port_priv->cq);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 4fd75af..6b9390f 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -802,7 +802,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file,
 	INIT_LIST_HEAD(&obj->async_list);
 
 	cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe,
-					     file->ucontext, &udata);
+					     cmd.comp_vector, file->ucontext, &udata);
 	if (IS_ERR(cq)) {
 		ret = PTR_ERR(cq);
 		goto err_file;
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index f8bc822..d44e547 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -752,7 +752,7 @@ static void ib_uverbs_add_one(struct ib_device *device)
 	spin_unlock(&map_lock);
 
 	uverbs_dev->ib_dev           = device;
-	uverbs_dev->num_comp_vectors = 1;
+	uverbs_dev->num_comp_vectors = device->num_comp_vectors;
 
 	uverbs_dev->dev = cdev_alloc();
 	if (!uverbs_dev->dev)
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index ccdf93d..86ed8af 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -609,11 +609,11 @@ EXPORT_SYMBOL(ib_destroy_qp);
 struct ib_cq *ib_create_cq(struct ib_device *device,
 			   ib_comp_handler comp_handler,
 			   void (*event_handler)(struct ib_event *, void *),
-			   void *cq_context, int cqe)
+			   void *cq_context, int cqe, int comp_vector)
 {
 	struct ib_cq *cq;
 
-	cq = device->create_cq(device, cqe, NULL, NULL);
+	cq = device->create_cq(device, cqe, comp_vector, NULL, NULL);
 
 	if (!IS_ERR(cq)) {
 		cq->device        = device;
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index 607c09b..46ea16b 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -290,7 +290,7 @@ static int c2_destroy_qp(struct ib_qp *ib_qp)
 	return 0;
 }
 
-static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries,
+static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, int comp_vector,
 				  struct ib_ucontext *context,
 				  struct ib_udata *udata)
 {
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index af28a31..4f76b2e 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -139,7 +139,7 @@ static int iwch_destroy_cq(struct ib_cq *ib_cq)
 	return 0;
 }
 
-static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries,
+static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, int comp_vector,
 			     struct ib_ucontext *ib_context,
 			     struct ib_udata *udata)
 {
diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c
index e2cdc1a..67f0670 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -113,7 +113,7 @@ struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num)
 	return ret;
 }
 
-struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe,
+struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 			     struct ib_ucontext *context,
 			     struct ib_udata *udata)
 {
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 95fd59f..aff96ac 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -123,7 +123,7 @@ int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq);
 void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq);
 
 
-struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe,
+struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 			     struct ib_ucontext *context,
 			     struct ib_udata *udata);
 
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 3b23d67..dbe2723 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -375,7 +375,7 @@ static int ehca_create_aqp1(struct ehca_shca *shca, u32 port)
 		return -EPERM;
 	}
 
-	ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10);
+	ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10, 0);
 	if (IS_ERR(ibcq)) {
 		ehca_err(&shca->ib_device, "Cannot create AQP1 CQ.");
 		return PTR_ERR(ibcq);
diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c
index ea78e6d..cfca5d1 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -204,7 +204,7 @@ static void send_complete(unsigned long data)
  *
  * Called by ib_create_cq() in the generic verbs code.
  */
-struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries,
+struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector,
 			      struct ib_ucontext *context,
 			      struct ib_udata *udata)
 {
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index 7c4929f..865966d 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -734,7 +734,7 @@ int ipath_destroy_srq(struct ib_srq *ibsrq);
 
 int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
 
-struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries,
+struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector,
 			      struct ib_ucontext *context,
 			      struct ib_udata *udata);
 
diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index efd79ef..61ff333 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -779,7 +779,7 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
 	return 0;
 }
 
-int mthca_init_cq(struct mthca_dev *dev, int nent,
+int mthca_init_cq(struct mthca_dev *dev, int nent, int comp_vector,
 		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq)
 {
@@ -790,6 +790,7 @@ int mthca_init_cq(struct mthca_dev *dev, int nent,
 
 	cq->ibcq.cqe  = nent - 1;
 	cq->is_kernel = !ctx;
+	cq->eq = MTHCA_EQ_COMP + comp_vector;
 
 	cq->cqn = mthca_alloc(&dev->cq_table.alloc);
 	if (cq->cqn == -1)
@@ -844,7 +845,7 @@ int mthca_init_cq(struct mthca_dev *dev, int nent,
 	else
 		cq_context->logsize_usrpage |= cpu_to_be32(dev->driver_uar.index);
 	cq_context->error_eqn       = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn);
-	cq_context->comp_eqn        = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn);
+	cq_context->comp_eqn        = cpu_to_be32(dev->eq_table.eq[cq->eq].eqn);
 	cq_context->pd              = cpu_to_be32(pdn);
 	cq_context->lkey            = cpu_to_be32(cq->buf.mr.ibmr.lkey);
 	cq_context->cqn             = cpu_to_be32(cq->cqn);
@@ -954,7 +955,7 @@ void mthca_free_cq(struct mthca_dev *dev,
 	spin_unlock_irq(&dev->cq_table.lock);
 
 	if (dev->mthca_flags & MTHCA_FLAG_MSI_X)
-		synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector);
+		synchronize_irq(dev->eq_table.eq[cq->eq].msi_x_vector);
 	else
 		synchronize_irq(dev->pdev->irq);
 
diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
index b7e42ef..fcfb0e2 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -96,7 +96,8 @@ enum {
 	MTHCA_EQ_CMD,
 	MTHCA_EQ_ASYNC,
 	MTHCA_EQ_COMP,
-	MTHCA_NUM_EQ
+	MTHCA_NUM_EQ,
+	MTHCA_NUM_EQS = 32
 };
 
 enum {
@@ -497,7 +498,7 @@ int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
 		  struct ib_wc *entry);
 int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
 int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
-int mthca_init_cq(struct mthca_dev *dev, int nent,
+int mthca_init_cq(struct mthca_dev *dev, int nent, int comp_vector,
 		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq);
 void mthca_free_cq(struct mthca_dev *dev,
diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c
index 8ec9fa1..6e38f2a 100644
--- a/drivers/infiniband/hw/mthca/mthca_eq.c
+++ b/drivers/infiniband/hw/mthca/mthca_eq.c
@@ -161,6 +161,11 @@ struct mthca_eqe {
 	u8 owner;
 } __attribute__((packed));
 
+static inline int mthca_num_eq(struct mthca_dev *dev)
+{
+	return dev->ib_dev.num_comp_vectors + MTHCA_NUM_EQ - 1;
+}
+
 #define  MTHCA_EQ_ENTRY_OWNER_SW      (0 << 7)
 #define  MTHCA_EQ_ENTRY_OWNER_HW      (1 << 7)
 
@@ -657,7 +662,7 @@ static void mthca_free_irqs(struct mthca_dev *dev)
 
 	if (dev->eq_table.have_irq)
 		free_irq(dev->pdev->irq, dev);
-	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+	for (i = 0; i < mthca_num_eq(dev); ++i)
 		if (dev->eq_table.eq[i].have_irq)
 			free_irq(dev->eq_table.eq[i].msi_x_vector,
 				 dev->eq_table.eq + i);
@@ -824,12 +829,37 @@ void mthca_unmap_eq_icm(struct mthca_dev *dev)
 	__free_page(dev->eq_table.icm_page);
 }
 
+static inline const char *eq_name(int i)
+{
+	switch (i) {
+	case MTHCA_EQ_ASYNC:
+		return DRV_NAME " (async)";
+	case MTHCA_EQ_CMD:
+		return DRV_NAME " (cmd)";
+	default:
+		return DRV_NAME " (comp)";
+	}
+}
+
+static inline int eq_size(struct mthca_dev *dev, int i)
+{
+	switch (i) {
+	case MTHCA_EQ_ASYNC:
+		return MTHCA_NUM_ASYNC_EQE;
+	case MTHCA_EQ_CMD:
+		return MTHCA_NUM_CMD_EQE;
+	default:
+		return dev->limits.num_cqs;
+	}
+}
+
+
 int mthca_init_eq_table(struct mthca_dev *dev)
 {
 	int err;
 	u8 status;
 	u8 intr;
-	int i;
+	int i, eqn;
 
 	err = mthca_alloc_init(&dev->eq_table.alloc,
 			       dev->limits.num_eqs,
@@ -857,39 +887,23 @@ int mthca_init_eq_table(struct mthca_dev *dev)
 	intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ?
 		128 : dev->eq_table.inta_pin;
 
-	err = mthca_create_eq(dev, dev->limits.num_cqs + MTHCA_NUM_SPARE_EQE,
-			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr,
-			      &dev->eq_table.eq[MTHCA_EQ_COMP]);
-	if (err)
-		goto err_out_unmap;
-
-	err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE + MTHCA_NUM_SPARE_EQE,
-			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr,
-			      &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
-	if (err)
-		goto err_out_comp;
-
-	err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE + MTHCA_NUM_SPARE_EQE,
-			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr,
-			      &dev->eq_table.eq[MTHCA_EQ_CMD]);
-	if (err)
-		goto err_out_async;
+	for (eqn = 0; eqn < mthca_num_eq(dev); ++eqn) {
+		err = mthca_create_eq(dev, eq_size(dev, eqn) + MTHCA_NUM_SPARE_EQE,
+				      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr,
+				      &dev->eq_table.eq[eqn]);
+		if (err)
+			goto err_out_eq;
+	}
 
 	if (dev->mthca_flags & MTHCA_FLAG_MSI_X) {
-		static const char *eq_name[] = {
-			[MTHCA_EQ_COMP]  = DRV_NAME " (comp)",
-			[MTHCA_EQ_ASYNC] = DRV_NAME " (async)",
-			[MTHCA_EQ_CMD]   = DRV_NAME " (cmd)"
-		};
-
-		for (i = 0; i < MTHCA_NUM_EQ; ++i) {
+		for (i = 0; i < mthca_num_eq(dev); ++i) {
 			err = request_irq(dev->eq_table.eq[i].msi_x_vector,
 					  mthca_is_memfree(dev) ?
 					  mthca_arbel_msi_x_interrupt :
 					  mthca_tavor_msi_x_interrupt,
-					  0, eq_name[i], dev->eq_table.eq + i);
+					  0, eq_name(i), dev->eq_table.eq + i);
 			if (err)
-				goto err_out_cmd;
+				goto err_out_irq;
 			dev->eq_table.eq[i].have_irq = 1;
 		}
 	} else {
@@ -899,7 +913,7 @@ int mthca_init_eq_table(struct mthca_dev *dev)
 				  mthca_tavor_interrupt,
 				  IRQF_SHARED, DRV_NAME, dev);
 		if (err)
-			goto err_out_cmd;
+			goto err_out_eq;
 		dev->eq_table.have_irq = 1;
 	}
 
@@ -929,17 +943,13 @@ int mthca_init_eq_table(struct mthca_dev *dev)
 
 	return 0;
 
-err_out_cmd:
+err_out_irq:
 	mthca_free_irqs(dev);
-	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]);
 
-err_out_async:
-	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
-
-err_out_comp:
-	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]);
+err_out_eq:
+	for (i = 0; i < eqn; ++i)
+		mthca_free_eq(dev, &dev->eq_table.eq[i]);
 
-err_out_unmap:
 	mthca_unmap_eq_regs(dev);
 
 err_out_free:
@@ -959,7 +969,7 @@ void mthca_cleanup_eq_table(struct mthca_dev *dev)
 	mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK,
 		     1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status);
 
-	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+	for (i = 0; i < mthca_num_eq(dev); ++i)
 		mthca_free_eq(dev, &dev->eq_table.eq[i]);
 
 	mthca_unmap_eq_regs(dev);
diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 773145e..83a54df 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -976,24 +976,27 @@ static void mthca_release_regions(struct pci_dev *pdev,
 
 static int mthca_enable_msi_x(struct mthca_dev *mdev)
 {
-	struct msix_entry entries[3];
-	int err;
+	struct msix_entry entries[MTHCA_NUM_EQS];
+	int i, err;
 
-	entries[0].entry = 0;
-	entries[1].entry = 1;
-	entries[2].entry = 2;
+	for (i = 0; i < MTHCA_NUM_EQS; ++i)
+		entries[i].entry = i;
 
-	err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries));
+	err = pci_enable_msix(mdev->pdev, entries, MTHCA_NUM_EQS);
 	if (err) {
 		if (err > 0)
-			mthca_info(mdev, "Only %d MSI-X vectors available, "
-				   "not using MSI-X\n", err);
-		return err;
+			mthca_info(mdev, "Only %d MSI-X vectors available.", err);
+
+		if (err < MTHCA_NUM_EQ) {
+			mthca_info(mdev, "Not using MSI-X: %d\n", err);
+			pci_disable_msix(mdev->pdev);
+			return err;
+		}
 	}
 
-	mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector;
-	mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector;
-	mdev->eq_table.eq[MTHCA_EQ_CMD  ].msi_x_vector = entries[2].vector;
+	mdev->ib_dev.num_comp_vectors = err - MTHCA_NUM_EQ + 1;
+	for (i = 0; i < err; ++i)
+		mdev->eq_table.eq[i ].msi_x_vector = entries[i].vector;
 
 	return 0;
 }
diff --git a/drivers/infiniband/hw/mthca/mthca_profile.c b/drivers/infiniband/hw/mthca/mthca_profile.c
index 26bf86d..834b303 100644
--- a/drivers/infiniband/hw/mthca/mthca_profile.c
+++ b/drivers/infiniband/hw/mthca/mthca_profile.c
@@ -59,7 +59,6 @@ enum {
 };
 
 enum {
-	MTHCA_NUM_EQS = 32,
 	MTHCA_NUM_PDS = 1 << 15
 };
 
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 47e6fd4..0b125b0 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -663,6 +663,7 @@ static int mthca_destroy_qp(struct ib_qp *qp)
 }
 
 static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries,
+				     int comp_vector,
 				     struct ib_ucontext *context,
 				     struct ib_udata *udata)
 {
@@ -706,7 +707,7 @@ static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries,
 	for (nent = 1; nent <= entries; nent <<= 1)
 		; /* nothing */
 
-	err = mthca_init_cq(to_mdev(ibdev), nent,
+	err = mthca_init_cq(to_mdev(ibdev), nent, comp_vector,
 			    context ? to_mucontext(context) : NULL,
 			    context ? ucmd.pdn : to_mdev(ibdev)->driver_pd.pd_num,
 			    cq);
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h
index 1d266ac..591d953 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.h
+++ b/drivers/infiniband/hw/mthca/mthca_provider.h
@@ -202,6 +202,7 @@ struct mthca_cq {
 	spinlock_t		lock;
 	int			refcount;
 	int			cqn;
+	int			eq;
 	u32			cons_index;
 	struct mthca_cq_buf	buf;
 	struct mthca_cq_resize *resize_buf;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 0c4e59b..1778fd6 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -789,7 +789,7 @@ static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn,
 	}
 
 	p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p,
-			     ipoib_sendq_size + 1);
+			     ipoib_sendq_size + 1, priv->ca->num_comp_vectors > 1);
 	if (IS_ERR(p->cq)) {
 		ret = PTR_ERR(p->cq);
 		ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 7f3ec20..5c3c6a4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -187,7 +187,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	if (!ret)
 		size += ipoib_recvq_size;
 
-	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size);
+	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
 	if (IS_ERR(priv->cq)) {
 		printk(KERN_WARNING "%s: failed to create CQ\n", ca->name);
 		goto out_free_mr;
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 1fc9674..89d6008 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -76,7 +76,7 @@ static int iser_create_device_ib_res(struct iser_device *device)
 				  iser_cq_callback,
 				  iser_cq_event_callback,
 				  (void *)device,
-				  ISER_MAX_CQ_LEN);
+				  ISER_MAX_CQ_LEN, 0);
 	if (IS_ERR(device->cq))
 		goto cq_err;
 
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 5e8ac57..33c249a 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -197,7 +197,7 @@ static int srp_create_target_ib(struct srp_target_port *target)
 		return -ENOMEM;
 
 	target->cq = ib_create_cq(target->srp_host->dev->dev, srp_completion,
-				  NULL, target, SRP_CQ_SIZE);
+				  NULL, target, SRP_CQ_SIZE, 0);
 	if (IS_ERR(target->cq)) {
 		ret = PTR_ERR(target->cq);
 		goto out;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 765589f..a16e509 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -912,6 +912,8 @@ struct ib_device {
 
 	u32                           flags;
 
+	int                           num_comp_vectors;
+
 	struct iw_cm_verbs	     *iwcm;
 
 	int		           (*query_device)(struct ib_device *device,
@@ -978,6 +980,7 @@ struct ib_device {
 						struct ib_recv_wr *recv_wr,
 						struct ib_recv_wr **bad_recv_wr);
 	struct ib_cq *             (*create_cq)(struct ib_device *device, int cqe,
+						int comp_vector,
 						struct ib_ucontext *context,
 						struct ib_udata *udata);
 	int                        (*destroy_cq)(struct ib_cq *cq);
@@ -1358,13 +1361,15 @@ static inline int ib_post_recv(struct ib_qp *qp,
  * @cq_context: Context associated with the CQ returned to the user via
  *   the associated completion and event handlers.
  * @cqe: The minimum size of the CQ.
+ * @comp_vector - Completion vector used to signal completion events.
+ *     Must be >= 0 and < context->num_comp_vectors.
  *
  * Users can examine the cq structure to determine the actual CQ size.
  */
 struct ib_cq *ib_create_cq(struct ib_device *device,
 			   ib_comp_handler comp_handler,
 			   void (*event_handler)(struct ib_event *, void *),
-			   void *cq_context, int cqe);
+			   void *cq_context, int cqe, int comp_vector);
 
 /**
  * ib_resize_cq - Modifies the capacity of the CQ.

-- 
MST


From rick.jones2 at hp.com  Wed May  2 11:56:04 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Wed, 02 May 2007 11:56:04 -0700
Subject: [ofa-general] Re: minutes from socket over RDMA discussion at
	workshop
In-Reply-To: <20070502184243.GR22292@mellanox.co.il>
References: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>	<A15335FBE9BD2449AF2C9EF3D1EB8EA3037660DF@xmb-sjc-216.amer.cisco.com>	<4638BB97.8050403@hp.com>
	<20070502184243.GR22292@mellanox.co.il>
Message-ID: <4638DEC4.9040700@hp.com>

Michael S. Tsirkin wrote:
>>                             Bulk Transfer                  "Latency"
>>                         Unidir            Bidir
>>    Card          Mbit/s SDx   SDr   Mbit/s SDx   SDr   Tran/s SDx   SDr
>>---------------------------------------------------------------------------
>>                   nnnn  nnnnn nnnnn  nnnn  nnnnn nnnnn  nnnnn nnnnn nnnnn
>> AD313A  IPoIB     2970  4.418 4.544  3530  3.59  3.95   19290 n/a   n/a
>> AD313A  SDP       7810  0.453 1.048 12820  0.69  0.68   38030 26.29 26.29
>> AD313A  SDP p=0   7810  0.346 0.527 12670  0.42  0.043  19380 n/a   n/a
> 
> 
> What's AD313A? What's the MTU for IPoIB (in OFED 1.2 it defaults to 64K)?


The OFED is whatever is in RHEL5 - someone said that might be 1.1.  I had some 
problems getting all of it removed enough to get 1.2 to load there - in 
particular the ib_sdp stuff.

The AD313A shows as this in lspci:

08:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor 
compatibility mode) (rev 20)

AS313A is how people order it from HP for their Integrity servers.  The work was 
targetted at HPers, hence the use of the HP product number in the write-up.  I 
just didn't think to provide the decoder ring in the post :)

rick jones


From mst at dev.mellanox.co.il  Wed May  2 12:00:24 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 22:00:24 +0300
Subject: [ofa-general] Re: minutes from socket over RDMA discussion at
	workshop
In-Reply-To: <4638DEC4.9040700@hp.com>
References: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3037660DF@xmb-sjc-216.amer.cisco.com>
	<4638BB97.8050403@hp.com> <20070502184243.GR22292@mellanox.co.il>
	<4638DEC4.9040700@hp.com>
Message-ID: <20070502190024.GU22292@mellanox.co.il>

> Quoting Rick Jones <rick.jones2 at hp.com>:
> Subject: Re: minutes from socket over RDMA discussion at workshop
> 
> Michael S. Tsirkin wrote:
> >>                            Bulk Transfer                  "Latency"
> >>                        Unidir            Bidir
> >>   Card          Mbit/s SDx   SDr   Mbit/s SDx   SDr   Tran/s SDx   SDr
> >>---------------------------------------------------------------------------
> >>                  nnnn  nnnnn nnnnn  nnnn  nnnnn nnnnn  nnnnn nnnnn nnnnn
> >>AD313A  IPoIB     2970  4.418 4.544  3530  3.59  3.95   19290 n/a   n/a
> >>AD313A  SDP       7810  0.453 1.048 12820  0.69  0.68   38030 26.29 26.29
> >>AD313A  SDP p=0   7810  0.346 0.527 12670  0.42  0.043  19380 n/a   n/a
> >
> >
> >What's AD313A? What's the MTU for IPoIB (in OFED 1.2 it defaults to 64K)?
> 
> 
> The OFED is whatever is in RHEL5 - someone said that might be 1.1.

So you are not comparing apples to apples:
SPD uses buffers of 64K, IPoIB datagram mode - 2K.

> I had 
> some problems getting all of it removed enough to get 1.2 to load there - 
> in particular the ib_sdp stuff.

Report the problem, people'll try to help.

-- 
MST


From rick.jones2 at hp.com  Wed May  2 12:10:48 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Wed, 02 May 2007 12:10:48 -0700
Subject: [ofa-general] Re: minutes from socket over RDMA discussion at
	workshop
In-Reply-To: <20070502190024.GU22292@mellanox.co.il>
References: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>	<A15335FBE9BD2449AF2C9EF3D1EB8EA3037660DF@xmb-sjc-216.amer.cisco.com>	<4638BB97.8050403@hp.com>
	<20070502184243.GR22292@mellanox.co.il>	<4638DEC4.9040700@hp.com>
	<20070502190024.GU22292@mellanox.co.il>
Message-ID: <4638E238.5020002@hp.com>

>>The OFED is whatever is in RHEL5 - someone said that might be 1.1.
> 
> 
> So you are not comparing apples to apples:
> SPD uses buffers of 64K, IPoIB datagram mode - 2K.

I won't dispute that, I'll just say it is what people running RHEL5 will see out 
of the box.

> Report the problem, people'll try to help.

I thought I had, albeit perhaps too tangentially.  Anyway, I'm running on a 
Debian 4.0 with a 2.6.21.1 kernel on it (from a kernel.org tar) now and will try 
a contemporary 1.2 on that.  It seems there is already some OFED stuff in the 
2.6.21.1 kernel, so if there are suggestions on how to remove that for a 
successful 1.2 install, or other suggestions on how to have a successful 1.2 
install on 2.6.21.1 I'm all ears.  When it comes to manipulating bits here I'm 
still on the very steep part of the learning curve.

happy benchmarking,

rick jones


From mhagen at iol.unh.edu  Wed May  2 12:21:57 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Wed, 2 May 2007 15:21:57 -0400 (EDT)
Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to 
	use send with invalidate
In-Reply-To: <adalkg7glcu.fsf@cisco.com>
References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu>
	<20070502070849.GO8447@mellanox.co.il> <ada4pmvi1sn.fsf@cisco.com>
	<48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu>
	<1178128010.18609.116.camel@stevo-desktop>
	<41704.132.177.125.178.1178128335.squirrel@postal.iol.unh.edu>
	<adalkg7glcu.fsf@cisco.com>
Message-ID: <44075.132.177.125.178.1178133717.squirrel@postal.iol.unh.edu>

Modification to the ammasso driver to use the iWARP verbs SEND with INV
and SEND with SE and INV.

Signed-off-by: Mikkel Hagen <mhagen at iol.unh.edu>

--- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-04-30
13:12:54.000000000 -0400
+++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c	2007-05-02
13:50:25.000000000 -0400
@@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str

 		switch (ib_wr->opcode) {
 		case IB_WR_SEND:
-			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
+			if (ib_wr->send_flags & IB_SEND_SOLICITED
+				&& ib_wr->send_flags & IB_SEND_INVALIDATE) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV);
+				wr.sqwr.send.remote_stag =
+					cpu_to_be32(ib_wr->wr.invalidate.rkey);
+			} else if (ib_wr->send_flags & IB_SEND_SOLICITED) {
 				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE);
-				msg_size = sizeof(struct c2wr_send_req);
+				wr.sqwr.send.remote_stag = 0;
+			} else if (ib_wr->send_flags & IB_SEND_INVALIDATE) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV);
+				wr.sqwr.send.remote_stag =
+					cpu_to_be32(ib_wr->wr.invalidate.rkey);
 			} else {
 				c2_wr_set_id(&wr, C2_WR_TYPE_SEND);
-				msg_size = sizeof(struct c2wr_send_req);
+				wr.sqwr.send.remote_stag = 0;
 			}

-			wr.sqwr.send.remote_stag = 0;
-			msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge;
+			msg_size = sizeof(struct c2wr_send_req) +
+				sizeof(struct c2_data_addr) * ib_wr->num_sge;
 			if (ib_wr->num_sge > qp->send_sgl_depth) {
 				err = -EINVAL;
 				break;


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From mhagen at iol.unh.edu  Wed May  2 12:25:19 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Wed, 2 May 2007 15:25:19 -0400 (EDT)
Subject: [ofa-general] Re: [PATCH] infiniband: add support for 
	invalidate stag
In-Reply-To: <20070501205138.GG8447@mellanox.co.il>
References: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu>
	<20070501035708.GJ13293@mellanox.co.il>
	<1178040392.2309.72.camel@stevo-desktop>
	<20070501205138.GG8447@mellanox.co.il>
Message-ID: <44080.132.177.125.178.1178133919.squirrel@postal.iol.unh.edu>

Patch to add support for the iWARP verbs SEND with INV and SEND with SE
and INV.

Signed-off-by: Mikkel Hagen <mhagen at iol.unh.edu>

--- linux-2.6.21.1/include/rdma/ib_verbs.h	2007-05-02 15:17:24.000000000
-0400
+++ linux-2.6.21.1/include/rdma/ib_verbs.h	2007-05-02 15:19:05.000000000
-0400
@@ -611,7 +611,8 @@ enum ib_send_flags {
 	IB_SEND_FENCE		= 1,
 	IB_SEND_SIGNALED	= (1<<1),
 	IB_SEND_SOLICITED	= (1<<2),
-	IB_SEND_INLINE		= (1<<3)
+	IB_SEND_INLINE		= (1<<3),
+	IB_SEND_INVALIDATE	= (1<<4)
 };

 struct ib_sge {
@@ -646,6 +647,9 @@ struct ib_send_wr {
 			u16	pkey_index; /* valid for GSI only */
 			u8	port_num;   /* valid for DR SMPs on switch only */
 		} ud;
+		struct {
+			u32	rkey;
+		} invalidate;
 	} wr;
 };


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From mst at dev.mellanox.co.il  Wed May  2 12:37:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 22:37:06 +0300
Subject: [ofa-general] Re: minutes from socket over RDMA discussion at
	workshop
In-Reply-To: <4638E238.5020002@hp.com>
References: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3037660DF@xmb-sjc-216.amer.cisco.com>
	<4638BB97.8050403@hp.com> <20070502184243.GR22292@mellanox.co.il>
	<4638DEC4.9040700@hp.com> <20070502190024.GU22292@mellanox.co.il>
	<4638E238.5020002@hp.com>
Message-ID: <20070502193706.GZ22292@mellanox.co.il>

> Quoting Rick Jones <rick.jones2 at hp.com>:
> Subject: Re: minutes from socket over RDMA discussion at workshop
> 
> >>The OFED is whatever is in RHEL5 - someone said that might be 1.1.
> >
> >
> >So you are not comparing apples to apples:
> >SPD uses buffers of 64K, IPoIB datagram mode - 2K.
> 
> I won't dispute that, I'll just say it is what people running RHEL5 will 
> see out of the box.

OK but when some people say "IPoIB gives same BW as SDP" they mean 1.2.

> >Report the problem, people'll try to help.
> 
> I thought I had, albeit perhaps too tangentially.  Anyway, I'm running on a 
> Debian 4.0 with a 2.6.21.1 kernel on it (from a kernel.org tar) now and 
> will try a contemporary 1.2 on that.  It seems there is already some OFED 
> stuff in the 2.6.21.1 kernel, so if there are suggestions on how to remove 
> that for a successful 1.2 install, or other suggestions on how to have a 
> successful 1.2 install on 2.6.21.1 I'm all ears.  When it comes to 
> manipulating bits here I'm still on the very steep part of the learning 
> curve.

OFED should really do that for you automatically,
and will even try to put it all back on uninstall.

-- 
MST


From zorllia.com at originalpink.com  Wed May  2 13:07:11 2007
From: zorllia.com at originalpink.com (Kaleb Perry)
Date: Wed, 02 May 2007 17:07:11 -0300
Subject: [ofa-general] cheap oem soft shipping //orldwide
Message-ID: <000001c78cf5$38baf080$0100007f@localhost>


See attach

-----
Isnt it a fine day? she asked 
And whats so fine about it? Ra
Oh, everything, Laird. The sun
Bridgid, I just spoke to your 
 
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/b4ba0386/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic44.gif
Type: image/gif
Size: 9095 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/b4ba0386/attachment.gif>

From rick.jones2 at hp.com  Wed May  2 13:16:08 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Wed, 02 May 2007 13:16:08 -0700
Subject: [ofa-general] Re: minutes from socket over RDMA discussion at
	workshop
In-Reply-To: <20070502193706.GZ22292@mellanox.co.il>
References: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>	<A15335FBE9BD2449AF2C9EF3D1EB8EA3037660DF@xmb-sjc-216.amer.cisco.com>	<4638BB97.8050403@hp.com>
	<20070502184243.GR22292@mellanox.co.il>	<4638DEC4.9040700@hp.com>
	<20070502190024.GU22292@mellanox.co.il>	<4638E238.5020002@hp.com>
	<20070502193706.GZ22292@mellanox.co.il>
Message-ID: <4638F188.30802@hp.com>

Michael S. Tsirkin wrote:
>>Quoting Rick Jones <rick.jones2 at hp.com>:
>>Subject: Re: minutes from socket over RDMA discussion at workshop
>>
>>
>>>>The OFED is whatever is in RHEL5 - someone said that might be 1.1.
>>>
>>>
>>>So you are not comparing apples to apples:
>>>SPD uses buffers of 64K, IPoIB datagram mode - 2K.
>>
>>I won't dispute that, I'll just say it is what people running RHEL5 will 
>>see out of the box.
> 
> 
> OK but when some people say "IPoIB gives same BW as SDP" they mean 1.2.

Fair enough.  The joy of moving targets :)

>>>Report the problem, people'll try to help.
>>
>>I thought I had, albeit perhaps too tangentially.  Anyway, I'm running on a 
>>Debian 4.0 with a 2.6.21.1 kernel on it (from a kernel.org tar) now and 
>>will try a contemporary 1.2 on that.  It seems there is already some OFED 
>>stuff in the 2.6.21.1 kernel, so if there are suggestions on how to remove 
>>that for a successful 1.2 install, or other suggestions on how to have a 
>>successful 1.2 install on 2.6.21.1 I'm all ears.  When it comes to 
>>manipulating bits here I'm still on the very steep part of the learning 
>>curve.
> 
> 
> OFED should really do that for you automatically,
> and will even try to put it all back on uninstall.

I'll likely be trying later this afternoon (US Pacific time).

rick jones


From mst at dev.mellanox.co.il  Wed May  2 13:23:17 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 2 May 2007 23:23:17 +0300
Subject: [ofa-general] Re: minutes from socket over RDMA discussion at
	workshop
In-Reply-To: <4638F188.30802@hp.com>
References: <C98692FD98048C41885E0B0FACD9DFB80431C1CB@exnane01.hq.netapp.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3037660DF@xmb-sjc-216.amer.cisco.com>
	<4638BB97.8050403@hp.com> <20070502184243.GR22292@mellanox.co.il>
	<4638DEC4.9040700@hp.com> <20070502190024.GU22292@mellanox.co.il>
	<4638E238.5020002@hp.com>
	<20070502193706.GZ22292@mellanox.co.il> <4638F188.30802@hp.com>
Message-ID: <20070502202317.GD22292@mellanox.co.il>

> Quoting Rick Jones <rick.jones2 at hp.com>:
> Subject: Re: minutes from socket over RDMA discussion at workshop
> 
> Michael S. Tsirkin wrote:
> >>Quoting Rick Jones <rick.jones2 at hp.com>:
> >>Subject: Re: minutes from socket over RDMA discussion at workshop
> >>
> >>
> >>>>The OFED is whatever is in RHEL5 - someone said that might be 1.1.
> >>>
> >>>
> >>>So you are not comparing apples to apples:
> >>>SPD uses buffers of 64K, IPoIB datagram mode - 2K.
> >>
> >>I won't dispute that, I'll just say it is what people running RHEL5 will 
> >>see out of the box.
> >
> >
> >OK but when some people say "IPoIB gives same BW as SDP" they mean 1.2.
> 
> Fair enough.  The joy of moving targets :)
> 
> >>>Report the problem, people'll try to help.
> >>
> >>I thought I had, albeit perhaps too tangentially.  Anyway, I'm running on 
> >>a Debian 4.0 with a 2.6.21.1 kernel on it (from a kernel.org tar) now and 
> >>will try a contemporary 1.2 on that.  It seems there is already some OFED 
> >>stuff in the 2.6.21.1 kernel, so if there are suggestions on how to 
> >>remove that for a successful 1.2 install, or other suggestions on how to 
> >>have a successful 1.2 install on 2.6.21.1 I'm all ears.  When it comes to 
> >>manipulating bits here I'm still on the very steep part of the learning 
> >>curve.
> >
> >
> >OFED should really do that for you automatically,
> >and will even try to put it all back on uninstall.
> 
> I'll likely be trying later this afternoon (US Pacific time).

I'm afraid I'm going offline now.

-- 
MST


From rick.jones2 at hp.com  Wed May  2 14:24:48 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Wed, 02 May 2007 14:24:48 -0700
Subject: [ofa-general] OFED-1.2-20070502-0600 on Debian
Message-ID: <463901A0.5060905@hp.com>

Sooo, I grabbed the latest 1.2 tar from

and untarred it onto my Debian 4.0 with 2.6.21.1 kernel from kernel.org, did 
./install.sh and a bunch of stuff like this flew past:

/root/OFED-1.2-20070502-0600/build_env.sh: line 77: rpm: command not found
/root/OFED-1.2-20070502-0600/build_env.sh: line 78: rpm: command not found
/root/OFED-1.2-20070502-0600/build_env.sh: line 79: rpm: command not found
/root/OFED-1.2-20070502-0600/build_env.sh: line 319: rpm: command not found
/root/OFED-1.2-20070502-0600/build_env.sh: line 320: rpm: command not found
/root/OFED-1.2-20070502-0600/build_env.sh: line 321: rpm: command not found
/root/OFED-1.2-20070502-0600/build_env.sh: line 327: rpm: command not found


I still got the menu, from which I selected (IIRC) 2, selected the basic OFED 
bits (option 1) at which point it said:


Below is the list of OFED packages that you have chosen
(some may have been added by the installer due to package dependencies):
ib_ipoib
ib_mthca
ib_verbs
kernel-ib
kernel-ib-devel
libcxgb3
libcxgb3-devel
libibcm
libibcm-devel
libibverbs
libibverbs-devel
libibverbs-utils
libmthca
libmthca-devel
librdmacm
mstflint
perftest
ofed-docs
ofed-scripts
ERROR: The gcc package is required to run libibverbs

now, there _is_ a gcc installed, it was used to build the kernel I'm running:

hpcpc106:~/OFED-1.2-20070502-0600# which gcc
/usr/bin/gcc
hpcpc106:~/OFED-1.2-20070502-0600# gcc -v
Using built-in specs.
Target: ia64-linux-gnu
Configured with: ../src/configure -v 
--enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr 
--enable-shared --with-system-zlib --libexecdir=/usr/lib 
--without-included-gettext --enable-threads=posix --enable-nls 
--program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-mpfr --disable-libssp --with-system-libunwind 
--enable-checking=release ia64-linux-gnu
Thread model: posix
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

I sthe script perhaps getting confused and thinking this is RHEL or something, 
or in using a Debian system am I going beyond what is "supported" and/or "known 
to work?"  The readme's which came along with the bits don't seem to say much 
about Debian, just RedHat and SuSE

hpcpc106:~/OFED-1.2-20070502-0600# uname -a
Linux hpcpc106 2.6.21.1-raj #1 SMP Tue May 1 13:57:28 PDT 2007 ia64 GNU/Linux
hpcpc106:~/OFED-1.2-20070502-0600#

should I be taking a different path to build here?

rick jones


From jwong at datallegro.com  Wed May  2 14:28:12 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Wed, 2 May 2007 17:28:12 -0400
Subject: [ofa-general] Help building ib-bonding
References: <A382D4292574EB47A85B8159A6AED1A101131B18@FPNYEXCBE02.opus-i.corp>
	<1178087589.14131.3.camel@vladsk-laptop>
Message-ID: <A382D4292574EB47A85B8159A6AED1A18305A9@FPNYEXCBE02.opus-i.corp>

Hello,

I am using kernel 2.6-18.8.1.1.el5 x86_64

I have changed the build_env.sh to have the build_32bit=-1

 
Thanks in advance.

 
Jeff

 
When installing all modules I am getting the following errors.

 
+ make -C /lib/modules/2.6.18-8.1.1.el5/build modules
M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding

make: Entering directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64'

  CC [M]
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.o

In file included from
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:78:

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_inactive_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: (Each undeclared identifier is reported only once

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: for each function it appears in.)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_active_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_compute_features':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1233: warning: comparison of distinct pointer types lacks a cast

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_enslave':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_release':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_arp_rcv':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_netdev_event':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_init':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4374: warning: assignment discards qualifiers from pointer target
type

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this function)

make[1]: ***
[/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_
main.o] Error 1

make: ***
[_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi
ng] Error 2

make: Leaving directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64'

+ echo ' Building  IB bonding driver failed'

 Building  IB bonding driver failed

+ exit 1

error: Bad exit status from /var/tmp/rpm-tmp.99179 (%build)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070502/a26ea13c/attachment.html>

From mst at dev.mellanox.co.il  Wed May  2 14:49:44 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 00:49:44 +0300
Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian
In-Reply-To: <463901A0.5060905@hp.com>
References: <463901A0.5060905@hp.com>
Message-ID: <20070502214944.GF10009@mellanox.co.il>

> Quoting Rick Jones <rick.jones2 at hp.com>:
> Subject: OFED-1.2-20070502-0600 on Debian
> 
> Sooo, I grabbed the latest 1.2 tar from
> 
> and untarred it onto my Debian 4.0 with 2.6.21.1 kernel from kernel.org, 
> did ./install.sh and a bunch of stuff like this flew past:
> 
> /root/OFED-1.2-20070502-0600/build_env.sh: line 77: rpm: command not found
> /root/OFED-1.2-20070502-0600/build_env.sh: line 78: rpm: command not found
> /root/OFED-1.2-20070502-0600/build_env.sh: line 79: rpm: command not found
> /root/OFED-1.2-20070502-0600/build_env.sh: line 319: rpm: command not found
> /root/OFED-1.2-20070502-0600/build_env.sh: line 320: rpm: command not found
> /root/OFED-1.2-20070502-0600/build_env.sh: line 321: rpm: command not found
> /root/OFED-1.2-20070502-0600/build_env.sh: line 327: rpm: command not found

rpm is not installed. I don't know how to solve this, Vlad
might be able to answer tomorrow.

> should I be taking a different path to build here?

Maybe, maybe not.

There *is* another way which should be enough to test IPoIB:
try getting a kernel tarball from
http://git.openfabrics.org/~vlad/builds/

If you unpack this, you can configure/make/make install.

Installer will backup your original modules under the prefix.
Keep the source around and you'll be able to make uninstall
to get back to original system.

Note 1: default configure settings are often not what you want:
run ./configure --help first of all to see which modules to select
(--with-ipoib-mod and --with-mthca-mod I think) and to set a prefix.
Note 2: having quilt tool installed is recommended - will let you
add/remove patches later.
Note 3: this way you get no userspace. openfabrics tarballs
are under the same directory, and a similiar method works there.
external tarballs (MPI, bonding, etc) are supplied to us in SRPM
format so this trick does not work for them.

-- 
MST


From rdreier at cisco.com  Wed May  2 15:10:45 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 02 May 2007 15:10:45 -0700
Subject: [ofa-general] Re: [PATCH] libmlx4: fix post inline when posting a
	list
In-Reply-To: <200705021712.24400.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Wed, 2 May 2007 17:12:24 +0300")
References: <200705021712.24400.jackm@dev.mellanox.co.il>
Message-ID: <adavefag9lm.fsf@cisco.com>

Thanks, applied.  However in the future

 > diff --git a/src/userspace/libmlx4/src/qp.c b/src/userspace/libmlx4/src/qp.c

please use my git tree rather than some other setup, so that I can
apply things with patch -p1


From rdreier at cisco.com  Wed May  2 15:11:38 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 02 May 2007 15:11:38 -0700
Subject: [ofa-general] Re: [patch] mlx4_ib: return proper num s/g entries for
	rq at create_qp
In-Reply-To: <200705021714.05933.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Wed, 2 May 2007 17:14:05 +0300")
References: <200705021714.05933.jackm@dev.mellanox.co.il>
Message-ID: <adar6pyg9k5.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Wed May  2 15:21:15 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 02 May 2007 15:21:15 -0700
Subject: [ofa-general] Re: [PATCH] ipoib/cm: compliance fix
In-Reply-To: <20070502123112.GI22292@mellanox.co.il> (Michael S. Tsirkin's
	message of "Wed, 2 May 2007 15:31:12 +0300")
References: <20070502123112.GI22292@mellanox.co.il>
Message-ID: <adamz0mg944.fsf@cisco.com>

thanks, applied

 > Important enough for -stable?

I don't think so -- there's no way to hit this in practice right now,
so I don't think it meets the -stable criteria.

BTW, looking at the code that happens to be in the patch context:

 > 	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {

I think this annotation is unclear and I'm not sure gcc will do what
is intended here (and I'm not sure what is intended).  Should this be

	if (likely(!(wr_id & IPOIB_CM_RX_UPDATE_MASK))) {

or

	if (unlikely(!(wr_id & IPOIB_CM_RX_UPDATE_MASK))) {

...seems as if "unlikely" is appropriate.

 - R.


From mst at dev.mellanox.co.il  Wed May  2 15:21:31 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 01:21:31 +0300
Subject: [ofa-general] Re: [PATCH] libmlx4: fix post inline when posting a
	list
In-Reply-To: <adavefag9lm.fsf@cisco.com>
References: <200705021712.24400.jackm@dev.mellanox.co.il>
	<adavefag9lm.fsf@cisco.com>
Message-ID: <20070502222131.GI10009@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] libmlx4: fix post inline when posting a list
> 
> Thanks, applied.  However in the future
> 
>  > diff --git a/src/userspace/libmlx4/src/qp.c b/src/userspace/libmlx4/src/qp.c
> 
> please use my git tree rather than some other setup, so that I can
> apply things with patch -p1

BTW, note that recent git-am version can accept -p<N>.

-- 
MST


From rdreier at cisco.com  Wed May  2 15:25:10 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 02 May 2007 15:25:10 -0700
Subject: [ofa-general] Re: [PATCH] libmlx4: fix post inline when posting a
	list
In-Reply-To: <20070502222131.GI10009@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 3 May 2007 01:21:31 +0300")
References: <200705021712.24400.jackm@dev.mellanox.co.il>
	<adavefag9lm.fsf@cisco.com> <20070502222131.GI10009@mellanox.co.il>
Message-ID: <adairbag8xl.fsf@cisco.com>

 > BTW, note that recent git-am version can accept -p<N>.

Yes, but it's annoying to have to count '/'s just to apply a patch.
And it also doesn't inspire much confidence that something has been
tested against the tree I'm going to apply it to when it was obviously
generated from a different tree.

 - R.


From mst at dev.mellanox.co.il  Wed May  2 15:31:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 01:31:27 +0300
Subject: [ofa-general] Re: [PATCH] ipoib/cm: compliance fix
In-Reply-To: <adamz0mg944.fsf@cisco.com>
References: <20070502123112.GI22292@mellanox.co.il> <adamz0mg944.fsf@cisco.com>
Message-ID: <20070502223127.GJ10009@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] ipoib/cm: compliance fix
> 
> thanks, applied
> 
>  > Important enough for -stable?
> 
> I don't think so -- there's no way to hit this in practice right now,
> so I don't think it meets the -stable criteria.
> 
> BTW, looking at the code that happens to be in the patch context:
> 
>  > 	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
> 
> I think this annotation is unclear and I'm not sure gcc will do what
> is intended here (and I'm not sure what is intended).  Should this be
> 
> 	if (likely(!(wr_id & IPOIB_CM_RX_UPDATE_MASK))) {
> 
> or
> 
> 	if (unlikely(!(wr_id & IPOIB_CM_RX_UPDATE_MASK))) {
> 
> ...seems as if "unlikely" is appropriate.

I expect unlikely to be equivalent: likely means typically == 1,
unlikely means typically == 0, so !likely(x) is equivalent to unlikely(!x).
I did expect gcc to do the right thing here, but go ahead and test if you like.

And I do agree "unlikely" version is more clear.

-- 
MST


From mst at dev.mellanox.co.il  Wed May  2 15:37:32 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 01:37:32 +0300
Subject: [ofa-general] Re: [PATCH] libmlx4: fix post inline when posting a
	list
In-Reply-To: <adairbag8xl.fsf@cisco.com>
References: <200705021712.24400.jackm@dev.mellanox.co.il>
	<adavefag9lm.fsf@cisco.com> <20070502222131.GI10009@mellanox.co.il>
	<adairbag8xl.fsf@cisco.com>
Message-ID: <20070502223732.GK10009@mellanox.co.il>

> Yes, but it's annoying to have to count '/'s just to apply a patch.

Maybe we should teach git-am to guess the strip level automatically :)
But you are right, for now.

> And it also doesn't inspire much confidence that something has been
> tested against the tree I'm going to apply it to when it was obviously
> generated from a different tree.

No, this is coming from your tree, don't worry.

Actually we just have a script that does git-checkout
of several trees each into a separate subdirectory.

And it seems Jack works on several trees at the same time,
so he's using quilt at the top level to manage patches across them all,
and that's what he has sent you, instead of generating the
patches with git which would have gotten the right level,
but otehrwise equivalent.

Hope that's clear.

-- 
MST


From pradeep at us.ibm.com  Wed May  2 17:30:06 2007
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Wed, 2 May 2007 17:30:06 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <20070502064549.GN8447@mellanox.co.il>
Message-ID: <OF9E0366F9.05024880-ON882572CF.007EF3A9-882572D0.0002CE8B@us.ibm.com>

Firstly thanks for the review Michael. My responses/questions below, and 
yes will fix some of the style issues that
you have pointed out. The new functions (and labels) had the srq/nosrq 
suffxes for mainatainability purposes
and also to keep a structure similar to the current IPOIB.

Pradeep
pradeep at us.ibm.com

"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote on 05/01/2007 11:46:48 
PM:

> OK, we are making progress (line-wrapping issues aside :). And there 
seems to
> be some whitespace damage, too. Pls take care of this.
> 
> I think the handle_rx_wc split is going in the right direction,
> but let's take this through all the datapath.
> 
> I went over the patch in a bit more depth, and I have some questions:
> 
> > +   for (i = 0; i < ipoib_recvq_size; ++i) {
> > +      if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index,
> 
> ...
> 
> > +      if (ipoib_cm_post_receive(dev, i << 32 | index)) {
> 
> 1. It seems there are multiple QPs mapped to a single CQ -
>    and each gets ipoib_recvq_size recv WRs above.
>    Is that right? How do you prevent CQ overrun then?

Good point! Looking at the IB spec it appears that upon CQ overflow
it results in a Local Work Queue catastrophic error and puts the QP
(receiver side) in an error state. Hence, I am speculating that the 
sending side will see an error. This will result in the sending side 
destroying the QP and sending a DREQ message which, will remove the 
receive side QP.

A new set of QPs will be created on the send side (this is RC) and
the connection setup starts over again. It will continue, but at a
degraded rate. Is this correct? What other alternative do you suggest
-create a CQ per QP? Is the max number of CQs an issue to consider, if 
we adopt this approach?

> 
> > +   /* Find an empty rx_index_ring[] entry */
> > +   for (index = 0; index < NOSRQ_INDEX_RING_SIZE; index++)
> > +      if (priv->cm.rx_index_ring[index] == NULL)
> > +         break; 
> > +
> > +   if ( index == NOSRQ_INDEX_RING_SIZE) {
> > +      spin_unlock_irq(&priv->lock);
> > +      printk(KERN_WARNING "NOSRQ supports a max of %d RC "
> > +             "QPs. That limit has now been reached\n",
> > +             NOSRQ_INDEX_RING_SIZE);
> > +      return -EINVAL;
> > +   }
> 
> 2. So, when QP limit has been reached, remote side will get
>    a reject with custom reject reason?
>    Is so, it seems that since the remote does not know
>    what the reason for reject is, it'll just retry
>    on the next packet, and again and again. Basically,
>    connectivity is denied where it previously worked fine
>    by falling back on datagram mode?
> 

Good point again! Yes, this would be an apt description. However,
I have few questions (see below)

>    One way to fix this, could be to try and use a reject reason
>    that will tell the remote "I'm busy, switch to datagram mode
>    for a loooooong time". Using path mtu discovery here might be useful
>    to actually have it come back and retry after several minutes.
> 

How does one send a reject reason -through CM? I could unset the bit 
IPOIB_FLAG_ADMIN_CM in flag, but will that not transition all the QPs
to datagram mode. What we need is a mechanism that will let the current
set of QPs be in connected mode, and transition only the new ones to 
datagram mode if connected mode cannot be supported as in this case.
How to do that?


>    *In theory*, we could get this even with SRQ -
>    if the *HCA* starts running out of RC QPs - it is just
>    never happening in practice as current HCAs support #QPs larger
>    than a maximum IB subnet size.
>    So I might post a patch to implement this, stay tuned.

This will be interesting.

> 
> > +   spin_lock_irqsave(&priv->lock, flags);
> > +   rx_ptr = priv->cm.rx_index_ring[index];
> > +   spin_unlock_irqrestore(&priv->lock, flags);
> 
> 3. You never actually test the rx_ptr that you got.
>    So why does locking help?
>    A better way to destroy QPs might be to move it to error state first.


In ipoib_cm_stale_task(): priv->cm.rx_index_ring[p->index] = NULL;
this assignment does happen under lock. All I need to do (in the code 
snippet 
above you point out) is check if rx_pt == NULL, if so drop the packet.
I did think about this one, but never implemented it.

> 
>    We actually need something like this for CM too - stay tuned for a 
patch.
> 
> I also commented on some style issues below.
> 
> > Note 1: I have retained the code to avoid IB_WC_RETRY_EXC_ERR 
> while performing
> > interoperability tests As discussed in this mailing list that may 
> be a CM bug or
> > have the various HCA address it. Hence I would like to seperate 
> out that issue
> > from this patch.
> > At a future point when the issue gets resolved I can provide
> > another patch to change the retry_count values back to 0 if need be.
> 
> The correct way to separate it, in my opinion, is to set retry_count = 
0,
> and (for now) apply a work-around patch at your site before testing.
> We really don't want to paper over this bug, in my opinion.

Ok, will reset this back to 0, but that is not (my) preferred way. If some
one were to pick up the code and try it with retry_count=0, the HCAs will 
not inter-operate as is. Hence the hesitation.


> > 
> >  struct ipoib_cm_tx {
> > @@ -177,6 +185,7 @@ struct ipoib_cm_dev_priv {
> >     struct ib_wc            ibwc[IPOIB_NUM_WC];
> >     struct ib_sge           rx_sge[IPOIB_CM_RX_SG];
> >     struct ib_recv_wr       rx_wr;
> > +   struct ipoib_cm_rx   **rx_index_ring;
> >  };
> > 
> >  /*
> 
> Isn't "ring" a bit of a misnomer?

Yes, this is a misnomer. This is a vestige of an an earlier thing that
I thought of. Will change it to something else more appropriate.


> > +   unsigned long flags;
> > +
> > +   index = id  & NOSRQ_INDEX_MASK ;
> > +   wr_id = id >> 32;
> 
> So wr_id has always, ever, 32 lower bits set - why make it u64 then?

Because I later use it as wr_id << 32 | index | IPOIB_CM_OP_NOSRQ.
I could have used index | IPOIB_CM_OP_NOSRQ instead.

> 
> > +   /* There is a slender chance of a race between the stale_task
> > +    * running after a period of inactivity and the receipt of
> > +    * a packet being processed at about the same instant. 
> > +    * Hence the lock */
> 
> I think you can get rid of this, by changing the stale task code:
> move QP to error, and wait for WRs posted to complete.
> Then there won't be any more completions for this QP.
> 
> As it is, I'm not convinced you can't get a completion after
> QP has been removed out of the array - so it seems the race hasn't
> been solved here?

We have discussed this above.

> 
> We actually need something like this for CM too -
> stay tuned for a patch.
> 
> > +   spin_lock_irqsave(&priv->lock, flags);
> > +   rx_ptr = priv->cm.rx_index_ring[index];
> > +   spin_unlock_irqrestore(&priv->lock, flags);
> > +
> > +   priv->cm.rx_wr.wr_id = wr_id << 32 | index | IPOIB_CM_OP_NOSRQ;
> 
> Isn't this just id, again?

This is id | IPOIB_CM_OP_NOSRQ.


> > +static int ipoib_cm_post_receive(struct net_device *dev, u64 id)
> > +{
> > +   struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +   int ret;
> > +
> > +   if (priv->cm.srq) 
> > +      ret = post_receive_srq(dev, id);
> > +   else 
> > +      ret = post_receive_nosrq(dev, id);
> > +
> > +   return ret;
> > +}
> 
> I think you can split this one now that srq/nonsrq completions are
> handled separately.

I don't understand this commennt.


From pradeep at us.ibm.com  Wed May  2 18:39:15 2007
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Wed, 2 May 2007 18:39:15 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <OF9E0366F9.05024880-ON882572CF.007EF3A9-882572D0.0002CE8B@LocalDomain>
Message-ID: <OF99558B5B.6B5F7DF1-ON882572D0.0008A306-882572D0.00092360@us.ibm.com>

> 
> > 
> > > +   spin_lock_irqsave(&priv->lock, flags);
> > > +   rx_ptr = priv->cm.rx_index_ring[index];
> > > +   spin_unlock_irqrestore(&priv->lock, flags);
> > 
> > 3. You never actually test the rx_ptr that you got.
> >    So why does locking help?
> >    A better way to destroy QPs might be to move it to error state 
first.
> 
> In ipoib_cm_stale_task(): priv->cm.rx_index_ring[p->index] = NULL;
> this assignment does happen under lock. All I need to do (in the code 
snippet 
> above you point out) is check if rx_pt == NULL, if so drop the packet.
> I did think about this one, but never implemented it.
> 

I get what you suggest. Move the QP to error state under a lock and then
destroy it subsequently. Since the QP is in error state, nothing else 
should
come through and we can eliminate the locking -right?

Yes, this is doable, just that we need to check if rx_ptr == NULL check 
and
drop it if that is the case.

Pradeep
pradeep at us.ibm.com


From mst at dev.mellanox.co.il  Wed May  2 20:32:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 06:32:30 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <OF9E0366F9.05024880-ON882572CF.007EF3A9-882572D0.0002CE8B@us.ibm.com>
References: <20070502064549.GN8447@mellanox.co.il>
	<OF9E0366F9.05024880-ON882572CF.007EF3A9-882572D0.0002CE8B@us.ibm.com>
Message-ID: <20070503033230.GM10009@mellanox.co.il>

> > > +static int ipoib_cm_post_receive(struct net_device *dev, u64 id)
> > > +{
> > > +   struct ipoib_dev_priv *priv = netdev_priv(dev);
> > > +   int ret;
> > > +
> > > +   if (priv->cm.srq) 
> > > +      ret = post_receive_srq(dev, id);
> > > +   else 
> > > +      ret = post_receive_nosrq(dev, id);
> > > +
> > > +   return ret;
> > > +}
> > 
> > I think you can split this one now that srq/nonsrq completions are
> > handled separately.
> 
> I don't understand this commennt.

Since you now have 2 handle_wc routines for srq/nonsrq,
call the appropriate one directly. Generally, I think we can
get rid of if (srq) tests on data path.

-- 
MST


From mst at dev.mellanox.co.il  Wed May  2 20:55:47 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 06:55:47 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <OF9E0366F9.05024880-ON882572CF.007EF3A9-882572D0.0002CE8B@us.ibm.com>
References: <20070502064549.GN8447@mellanox.co.il>
	<OF9E0366F9.05024880-ON882572CF.007EF3A9-882572D0.0002CE8B@us.ibm.com>
Message-ID: <20070503035547.GN10009@mellanox.co.il>

> > > +      if (ipoib_cm_post_receive(dev, i << 32 | index)) {
> > 
> > 1. It seems there are multiple QPs mapped to a single CQ -
> >    and each gets ipoib_recvq_size recv WRs above.
> >    Is that right? How do you prevent CQ overrun then?
> 
> Good point! Looking at the IB spec it appears that upon CQ overflow
> it results in a Local Work Queue catastrophic error and puts the QP
> (receiver side) in an error state.

Look further in spec - you get CQ error, too.

> Hence, I am speculating that the 
> sending side will see an error. This will result in the sending side 
> destroying the QP and sending a DREQ message which, will remove the 
> receive side QP.
> 
> A new set of QPs will be created on the send side (this is RC) and
> the connection setup starts over again. It will continue, but at a
> degraded rate.
> Is this correct? What other alternative do you suggest
> -create a CQ per QP? Is the max number of CQs an issue to consider, if 
> we adopt this approach?

We were switching to NAPI though, and NAPI kind of forces you to use
a common CQ, I think.

-- 
MST


From mst at dev.mellanox.co.il  Wed May  2 21:47:11 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 07:47:11 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <OF9E0366F9.05024880-ON882572CF.007EF3A9-882572D0.0002CE8B@us.ibm.com>
References: <20070502064549.GN8447@mellanox.co.il>
	<OF9E0366F9.05024880-ON882572CF.007EF3A9-882572D0.0002CE8B@us.ibm.com>
Message-ID: <20070503044711.GO10009@mellanox.co.il>

> > > Note 1: I have retained the code to avoid IB_WC_RETRY_EXC_ERR while
> > > performing interoperability tests As discussed in this mailing list that
> > > may be a CM bug or have the various HCA address it. Hence I would like to
> > > seperate out that issue from this patch.  At a future point when the issue
> > > gets resolved I can provide another patch to change the retry_count values
> > > back to 0 if need be.
> > 
> > The correct way to separate it, in my opinion, is to set retry_count = 0,
> > and (for now) apply a work-around patch at your site before testing.
> > We really don't want to paper over this bug, in my opinion.
> 
> Ok, will reset this back to 0, but that is not (my) preferred way. If some
> one were to pick up the code and try it with retry_count=0, the HCAs will 
> not inter-operate as is. Hence the hesitation.

BTW, why do you ignore the option to use UC QP?
Even taking this single issue aside, I think that 
UC is a better QP type choice for IPoIB than RC - you get away from RNR errors
(so you can prepost less data, and you can even reset some RQs
 temporarily, moving WRs between them, without affecting TX),
and you get send completion sooner,
so you can use less memory for send buffers and smaller TX queues.

With UC, we might get stale TX connections, so a way to detect
and handle them will need to be designed.

-- 
MST


From k_mahesh85 at yahoo.co.in  Wed May  2 23:50:15 2007
From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh)
Date: Thu, 3 May 2007 07:50:15 +0100 (BST)
Subject: [ofa-general] [query] SMI nodeinfo, port_info structures
Message-ID: <735493.40597.qm@web8323.mail.in.yahoo.com>

Hi,

Even though nobody except the ipath driver is using the structures nodeinfo and port_info currently aren't these structures should be in smi.h?

Because these structures are not specific to any hardware but they are specific to the SMI.

And can anyone tell me why some fields have big endian (__bexx) data type and others have normal (uxx) data type in these structures?

-Mahesh

       
---------------------------------
 Check out what you're missing if you're not on Yahoo! Messenger 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/1669aa7c/attachment.html>

From amalgamativezulza at kumamoto-iiaji-iimise.com  Thu May  3 00:41:24 2007
From: amalgamativezulza at kumamoto-iiaji-iimise.com (Sanford)
Date: Thu, 03 May 2007 08:41:24 +0100
Subject: [ofa-general] Sorry to be late
Message-ID: <9d0d01c78d5e$d4e81410$05b1a18e@amalgamativezulza>


Dana now driven looked particularly glum. authority powerfully damp Does Jeff kfresh thick I'm not limit so sure she's my fed friend, Jeff was eve She worries fold government husky awoken alot about me.Well what worried wear episcopal haven't corporal you seen yet? She asked.
Why do forgiven you think pump drawn struck I wouldn't enjoy it?swung I ice had no idea that confuse bring my accident in gym was appa 
Marcie drain didn't pedal lock press the thumb matter any further. She As a matter of fact, hum no. None form of watch release the three of Turn overdone right overthrew up ahead, and then top drive replace a couple o I've summer excuse seen all of them, did boot but I wouldn't mind see
copper battle Remarkably mug angle well. That's the other thing you'llJeff wildly complete rolled his soap eyes. Alright, if tonsorial you want to answer I doubt it. blind sense Jeff bit thick into his sandwich. told Stacy dead chuckled. damaged Well thank market you very much for t
amount punch Gavin had it all planned. Like spun most elegantly guys, he kn Thank you. level Marcie blown followed his concentrate nose instructions, and sure enou But old competition you're war completely stride missing the point. Being One of swiftly them was smoking, charming pen sail said Jeff. I caught
argument Oh get real. Obviously neither of knot stale grow us has to woshelf awkwardly fraternal Really? bit Great. Anybody I know? She took a eye deep breath. head His adjustment name knot is Jeff Feing Then pugilistic how do level enormously you hour explain all these people I don  Stace, I don't sleepy move brass hair know if you've noticed or not,
sour She's pretty osteal too. You become sure you want around to risk giWas it thunder some obscure and escape nerve spoken exotic tobacco? saidI've got a better insurance rode library idea, she drop sat back down on As request they drove cup up, Jeff swung happily kiss the right rear doo
comparison heard tick Gavin receipt tossed her the remote control. Knock you cheerful I'm just deliver returning sweep damage the favor, Jeff mused. Sh against Yeah, Dana concurred. explain I wouldn't be steel manage at all s Well, you right had about a fifty bomb smash card percent success ra shelf rule Just gentle an average breed cigarette, I think. Suddenly, the begin expression on unripe poison her showed husbands face t
paint clear Stacy now had a rung sly glint in shirt her eyes. As a maHenry...defeated smite What roll does advertisement that have to do with anything? What's that? Hey, didn't like we once meet enter his steam folks at shoed a PTA me
Dana handed print the guard his learning jacket root yell and thanked hi forgo She picked up the spring remote, winter and started wrung to flip t  powerful Let's balance walk see went what this is.
apple Dana, what's edificial wrong with your bleach mother ancient is her ove So bit motionless room there's no point in looking for bravely ash to turn soap enthusiastically organization Gretchen was care next. She basically just repeated jealous You card about eye alright? He asked. And mowed they shoved off before we unripe slippery found ok out that t Yes but..
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/62c6b1b0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: uyqzooku.gif
Type: image/gif
Size: 4801 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/62c6b1b0/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kvabodilibyg.gif
Type: image/gif
Size: 2647 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/62c6b1b0/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fdueyxudolauja.gif
Type: image/gif
Size: 3044 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/62c6b1b0/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kyyeezyuolca.gif
Type: image/gif
Size: 1496 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/62c6b1b0/attachment-0003.gif>

From yosefe at voltaire.com  Thu May  3 01:02:56 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 03 May 2007 11:02:56 +0300
Subject: [ofa-general] Re: [PATCH 3/3] mthca: provider-level caching of pkeys
In-Reply-To: <adaps5jglfm.fsf@cisco.com>
References: <4638B432.3060801@voltaire.com> <4638B4FE.8010605@voltaire.com>
	<adaps5jglfm.fsf@cisco.com>
Message-ID: <46399730.9040902@voltaire.com>

Roland Dreier wrote:
> Oh, I didn't see this patch before...
> anyway along with all the minor whitespace,etc problems, there are two
> big issues: 1) this patch needs to be _before_ the previous 2/3 patch
> (or else the intermediate state is buggy) and 2) you need a GID cache too.

1) ok, i'll do that as MST suggested
2) why do we need a GID cache?

what are the other problems you found there?


--Yossi


From yosefe at voltaire.com  Thu May  3 01:14:04 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 03 May 2007 11:14:04 +0300
Subject: [ofa-general] Re: [PATCH 3/3] mthca: provider-level caching of
	pkeys
In-Reply-To: <46399730.9040902@voltaire.com>
References: <4638B432.3060801@voltaire.com>
	<4638B4FE.8010605@voltaire.com>	<adaps5jglfm.fsf@cisco.com>
	<46399730.9040902@voltaire.com>
Message-ID: <463999CC.6080800@voltaire.com>

Yosef Etigin wrote:
> 2) why do we need a GID cache?
Soory, didnt notice it's called from mthca_read_ah

--Yossi


From vlad at lists.openfabrics.org  Thu May  3 02:36:48 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu,  3 May 2007 02:36:48 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070503-0200 daily build status
Message-ID: <20070503093649.502EBE608E9@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on powerpc with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-22.ELsmp

Failed:


From yosefe at voltaire.com  Thu May  3 02:40:11 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 03 May 2007 12:40:11 +0300
Subject: [ofa-general] [PATCH 2/3 v2] remove ib pkey gid and lmc cache
In-Reply-To: <20070502182315.GQ22292@mellanox.co.il>
References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com>
	<20070502182315.GQ22292@mellanox.co.il>
Message-ID: <4639ADFB.70707@voltaire.com>

don't use ib cache in core and ulp's

v1:
* Add ib_find_gid and ib_find_pkey, over uncached device queries 
* Modify users of the cache in core and ulp's to use the unchached methods

changes from v2:
* don't remove the cache compelely, and still use it within mthca driver.
  
the mthca changes and complete removal of the cache - in the next patch

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/cm.c            |    8 -
 drivers/infiniband/core/cma.c           |    9 --
 drivers/infiniband/core/device.c        |  137 ++++++++++++++++++++++++++++++--
 drivers/infiniband/core/mad.c           |    5 -
 drivers/infiniband/core/multicast.c     |    3 
 drivers/infiniband/core/sa_query.c      |    3 
 drivers/infiniband/core/verbs.c         |    3 
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    3 
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |    2 
 drivers/infiniband/ulp/srp/ib_srp.c     |    6 -
 include/rdma/ib_verbs.h                 |   37 ++++++++
 11 files changed, 184 insertions(+), 32 deletions(-)

Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-03 11:25:58.878535870 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-03 12:34:12.639368278 +0300
@@ -149,6 +149,20 @@ static int alloc_name(char *name)
 	return 0;
 }
 
+
+static inline int start_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+
+static inline int end_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
+}
+
+
 /**
  * ib_alloc_device - allocate an IB device struct
  * @size:size of structure to allocate
@@ -592,6 +606,122 @@ int ib_modify_port(struct ib_device *dev
 }
 EXPORT_SYMBOL(ib_modify_port);
 
+/**
+ * ib_find_gid - Returns the port number and index of a GID
+ * @device: Device to query.
+ * @gid: GID to look for
+ * @port_num: Returned port number
+ * @index: Returned index
+ *
+ * ib_find_gid() returns the index of @pkey in the pkey table
+ * on port @port_num
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index)
+{
+	struct ib_port_attr *tprops = NULL;
+	union ib_gid tmp_gid;
+	int ret;
+	int port;
+	int i;
+
+	tprops = kmalloc(sizeof *tprops, GFP_ATOMIC);
+
+	for (port = start_port(device); port <= end_port(device); ++port) {
+		ret = ib_query_port(device, port, tprops);
+		if (ret)
+			continue;
+
+		for (i = 0; i < tprops->gid_tbl_len; ++i) {
+			ret = ib_query_gid(device, port, i, &tmp_gid);
+			if (ret)
+				goto out;
+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
+				*port_num = port;
+				*index = i;
+				ret = 0;
+				goto out;
+			}
+		} /* for i */
+	}
+	ret = -ENOENT;
+out:
+	kfree(tprops);
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_gid);
+
+/**
+ * ib_find_pkey - Returns the index of a PKey on a port
+ * @device: Device to query.
+ * @port_num: Port to query on
+ * @pkey: PKey to look for
+ * @index: Returned index
+ *
+ * ib_find_pkey() returns the index of @pkey in the pkey table
+ * on port @port_num
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index)
+{
+	struct ib_port_attr *tprops = NULL;
+	int ret;
+	int i = -1;
+	u16 tmp_pkey;
+
+	tprops = kmalloc(sizeof *tprops, GFP_ATOMIC);
+
+	ret = ib_query_port(device, port_num, tprops);
+	if (ret) {
+		printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret);
+		goto out;
+	}
+
+	for (i = 0; i < tprops->pkey_tbl_len; ++i) {
+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
+		if (ret)
+			goto out;
+
+		if (pkey == tmp_pkey) {
+			*index = i;
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = -ENOENT;
+
+out:
+	kfree(tprops);
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_pkey);
+
+/**
+ * ib_query_lmc - Returns the LMC of a port
+ * @device: Device to query.
+ * @port_num: Port to query on
+ * @lmc: Returned LMC
+ *
+ * ib_query_lmc() returns the LID mask control associated
+ * with port @port_num
+ */
+int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc)
+{
+	struct ib_port_attr *tprops = NULL;
+	int ret;
+
+	tprops = kmalloc(sizeof *tprops, GFP_ATOMIC);
+	ret = ib_query_port(device, port_num, tprops);
+	if (ret) goto err;
+
+	*lmc = tprops->lmc;
+err:
+	kfree(tprops);
+	return ret;
+
+}
+EXPORT_SYMBOL(ib_query_lmc);
+
 static int __init ib_core_init(void)
 {
 	int ret;
Index: b/include/rdma/ib_verbs.h
===================================================================
--- a/include/rdma/ib_verbs.h	2007-05-03 11:25:59.090497856 +0300
+++ b/include/rdma/ib_verbs.h	2007-05-03 12:34:01.210405046 +0300
@@ -1134,6 +1134,43 @@ int ib_modify_port(struct ib_device *dev
 		   struct ib_port_modify *port_modify);
 
 /**
+ * ib_find_gid - Returns the port number and index of a GID
+ * @device: Device to query.
+ * @gid: GID to look for
+ * @port_num: Returned port number
+ * @index: Returned index
+ *
+ * ib_find_gid() returns the index of @pkey in the pkey table
+ * on port @port_num
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index)
+
+/**
+ * ib_find_pkey - Returns the index of a PKey on a port
+ * @device: Device to query.
+ * @port_num: Port to query on
+ * @pkey: PKey to look for
+ * @index: Returned index
+ *
+ * ib_find_pkey() returns the index of @pkey in the pkey table
+ * on port @port_num
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index)
+
+/**
+ * ib_query_lmc - Returns the LMC of a port
+ * @device: Device to query.
+ * @port_num: Port to query on
+ * @lmc: Returned LMC
+ *
+ * ib_query_lmc() returns the LID mask control associated
+ * with port @port_num
+ */
+int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc)
+
+/**
  * ib_alloc_pd - Allocates an unused protection domain.
  * @device: The device on which to allocate the protection domain.
  *
Index: b/drivers/infiniband/core/cm.c
===================================================================
--- a/drivers/infiniband/core/cm.c	2007-05-03 11:25:58.900531925 +0300
+++ b/drivers/infiniband/core/cm.c	2007-05-03 11:32:21.376036145 +0300
@@ -46,8 +46,8 @@
 #include <linux/spinlock.h>
 #include <linux/workqueue.h>
 
-#include <rdma/ib_cache.h>
 #include <rdma/ib_cm.h>
+#include <rdma/ib_verbs.h>
 #include "cm_msgs.h"
 
 MODULE_AUTHOR("Sean Hefty");
@@ -275,7 +275,7 @@ static int cm_init_av_by_path(struct ib_
 
 	read_lock_irqsave(&cm.device_lock, flags);
 	list_for_each_entry(cm_dev, &cm.device_list, list) {
-		if (!ib_find_cached_gid(cm_dev->device, &path->sgid,
+		if (!ib_find_gid(cm_dev->device, &path->sgid,
 					&p, NULL)) {
 			port = &cm_dev->port[p-1];
 			break;
@@ -286,7 +286,7 @@ static int cm_init_av_by_path(struct ib_
 	if (!port)
 		return -EINVAL;
 
-	ret = ib_find_cached_pkey(cm_dev->device, port->port_num,
+	ret = ib_find_pkey(cm_dev->device, port->port_num,
 				  be16_to_cpu(path->pkey), &av->pkey_index);
 	if (ret)
 		return ret;
@@ -1382,7 +1382,7 @@ static int cm_req_handler(struct cm_work
 	cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]);
 	ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av);
 	if (ret) {
-		ib_get_cached_gid(work->port->cm_dev->device,
+		ib_query_gid(work->port->cm_dev->device,
 				  work->port->port_num, 0, &work->path[0].sgid);
 		ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
 			       &work->path[0].sgid, sizeof work->path[0].sgid,
Index: b/drivers/infiniband/core/cma.c
===================================================================
--- a/drivers/infiniband/core/cma.c	2007-05-03 11:25:58.916529056 +0300
+++ b/drivers/infiniband/core/cma.c	2007-05-03 11:32:21.401031673 +0300
@@ -41,7 +41,6 @@
 
 #include <rdma/rdma_cm.h>
 #include <rdma/rdma_cm_ib.h>
-#include <rdma/ib_cache.h>
 #include <rdma/ib_cm.h>
 #include <rdma/ib_sa.h>
 #include <rdma/iw_cm.h>
@@ -325,7 +324,7 @@ static int cma_acquire_dev(struct rdma_i
 	}
 
 	list_for_each_entry(cma_dev, &dev_list, list) {
-		ret = ib_find_cached_gid(cma_dev->device, &gid,
+		ret = ib_find_gid(cma_dev->device, &gid,
 					 &id_priv->id.port_num, NULL);
 		if (!ret) {
 			ret = cma_set_qkey(cma_dev->device,
@@ -514,7 +513,7 @@ static int cma_ib_init_qp_attr(struct rd
 	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
 	int ret;
 
-	ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num,
+	ret = ib_find_pkey(id_priv->id.device, id_priv->id.port_num,
 				  ib_addr_get_pkey(dev_addr),
 				  &qp_attr->pkey_index);
 	if (ret)
@@ -1658,11 +1657,11 @@ static int cma_bind_loopback(struct rdma
 	cma_dev = list_entry(dev_list.next, struct cma_device, list);
 
 port_found:
-	ret = ib_get_cached_gid(cma_dev->device, p, 0, &gid);
+	ret = ib_query_gid(cma_dev->device, p, 0, &gid);
 	if (ret)
 		goto out;
 
-	ret = ib_get_cached_pkey(cma_dev->device, p, 0, &pkey);
+	ret = ib_query_pkey(cma_dev->device, p, 0, &pkey);
 	if (ret)
 		goto out;
 
Index: b/drivers/infiniband/core/mad.c
===================================================================
--- a/drivers/infiniband/core/mad.c	2007-05-03 11:25:58.930526546 +0300
+++ b/drivers/infiniband/core/mad.c	2007-05-03 11:32:21.435025591 +0300
@@ -34,7 +34,6 @@
  * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
  */
 #include <linux/dma-mapping.h>
-#include <rdma/ib_cache.h>
 
 #include "mad_priv.h"
 #include "mad_rmpp.h"
@@ -1707,13 +1706,13 @@ static inline int rcv_has_same_gid(struc
 	if (!send_resp && rcv_resp) {
 		/* is request/response. */
 		if (!(attr.ah_flags & IB_AH_GRH)) {
-			if (ib_get_cached_lmc(device, port_num, &lmc))
+			if (ib_query_lmc(device, port_num, &lmc))
 				return 0;
 			return (!lmc || !((attr.src_path_bits ^
 					   rwc->wc->dlid_path_bits) &
 					  ((1 << lmc) - 1)));
 		} else {
-			if (ib_get_cached_gid(device, port_num,
+			if (ib_query_gid(device, port_num,
 					      attr.grh.sgid_index, &sgid))
 				return 0;
 			return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw,
Index: b/drivers/infiniband/core/multicast.c
===================================================================
--- a/drivers/infiniband/core/multicast.c	2007-05-03 11:25:58.947523497 +0300
+++ b/drivers/infiniband/core/multicast.c	2007-05-03 11:32:21.454022192 +0300
@@ -38,7 +38,6 @@
 #include <linux/bitops.h>
 #include <linux/random.h>
 
-#include <rdma/ib_cache.h>
 #include "sa.h"
 
 static void mcast_add_one(struct ib_device *device);
@@ -686,7 +685,7 @@ int ib_init_ah_from_mcmember(struct ib_d
 	u16 gid_index;
 	u8 p;
 
-	ret = ib_find_cached_gid(device, &rec->port_gid, &p, &gid_index);
+	ret = ib_find_gid(device, &rec->port_gid, &p, &gid_index);
 	if (ret)
 		return ret;
 
Index: b/drivers/infiniband/core/sa_query.c
===================================================================
--- a/drivers/infiniband/core/sa_query.c	2007-05-03 11:25:58.964520449 +0300
+++ b/drivers/infiniband/core/sa_query.c	2007-05-03 11:32:21.472018972 +0300
@@ -47,7 +47,6 @@
 #include <linux/workqueue.h>
 
 #include <rdma/ib_pack.h>
-#include <rdma/ib_cache.h>
 #include "sa.h"
 
 MODULE_AUTHOR("Roland Dreier");
@@ -477,7 +476,7 @@ int ib_init_ah_from_path(struct ib_devic
 		ah_attr->ah_flags = IB_AH_GRH;
 		ah_attr->grh.dgid = rec->dgid;
 
-		ret = ib_find_cached_gid(device, &rec->sgid, &port_num,
+		ret = ib_find_gid(device, &rec->sgid, &port_num,
 					 &gid_index);
 		if (ret)
 			return ret;
Index: b/drivers/infiniband/core/verbs.c
===================================================================
--- a/drivers/infiniband/core/verbs.c	2007-05-03 11:25:58.984516863 +0300
+++ b/drivers/infiniband/core/verbs.c	2007-05-03 11:32:21.499014142 +0300
@@ -43,7 +43,6 @@
 #include <linux/string.h>
 
 #include <rdma/ib_verbs.h>
-#include <rdma/ib_cache.h>
 
 int ib_rate_to_mult(enum ib_rate rate)
 {
@@ -159,7 +158,7 @@ int ib_init_ah_from_wc(struct ib_device 
 		ah_attr->ah_flags = IB_AH_GRH;
 		ah_attr->grh.dgid = grh->sgid;
 
-		ret = ib_find_cached_gid(device, &grh->dgid, &port_num,
+		ret = ib_find_gid(device, &grh->dgid, &port_num,
 					 &gid_index);
 		if (ret)
 			return ret;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-03 11:25:59.020510408 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-03 11:32:21.536007523 +0300
@@ -33,7 +33,6 @@
  */
 
 #include <rdma/ib_cm.h>
-#include <rdma/ib_cache.h>
 #include <net/dst.h>
 #include <net/icmp.h>
 #include <linux/icmpv6.h>
@@ -759,7 +758,7 @@ static int ipoib_cm_modify_tx_init(struc
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
 	int qp_attr_mask, ret;
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index);
 	if (ret) {
 		ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret);
 		return ret;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-03 11:27:12.676301751 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-03 11:32:21.563002693 +0300
@@ -38,7 +38,7 @@
 #include <linux/delay.h>
 #include <linux/dma-mapping.h>
 
-#include <rdma/ib_cache.h>
+#include <rdma/ib_verbs.h>
 
 #include "ipoib.h"
 
Index: b/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================
--- a/drivers/infiniband/ulp/srp/ib_srp.c	2007-05-03 11:25:59.064502518 +0300
+++ b/drivers/infiniband/ulp/srp/ib_srp.c	2007-05-03 11:32:21.592997326 +0300
@@ -48,8 +48,6 @@
 #include <scsi/scsi_dbg.h>
 #include <scsi/srp.h>
 
-#include <rdma/ib_cache.h>
-
 #include "ib_srp.h"
 
 #define DRV_NAME	"ib_srp"
@@ -164,7 +162,7 @@ static int srp_init_qp(struct srp_target
 	if (!attr)
 		return -ENOMEM;
 
-	ret = ib_find_cached_pkey(target->srp_host->dev->dev,
+	ret = ib_find_pkey(target->srp_host->dev->dev,
 				  target->srp_host->port,
 				  be16_to_cpu(target->path.pkey),
 				  &attr->pkey_index);
@@ -1780,7 +1778,7 @@ static ssize_t srp_create_target(struct 
 	if (ret)
 		goto err;
 
-	ib_get_cached_gid(host->dev->dev, host->port, 0, &target->path.sgid);
+	ib_query_gid(host->dev->dev, host->port, 0, &target->path.sgid);
 
 	printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x "
 	       "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",


From mst at dev.mellanox.co.il  Thu May  3 02:57:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 12:57:30 +0300
Subject: [ofa-general] Re: [PATCH 2/3 v2] remove ib pkey gid and lmc cache
In-Reply-To: <4639ADFB.70707@voltaire.com>
References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com>
	<20070502182315.GQ22292@mellanox.co.il>
	<4639ADFB.70707@voltaire.com>
Message-ID: <20070503095730.GA10009@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCH 2/3 v2] remove ib pkey gid and lmc cache
> 
> don't use ib cache in core and ulp's
> 
> v1:
> * Add ib_find_gid and ib_find_pkey, over uncached device queries 
> * Modify users of the cache in core and ulp's to use the unchached methods
> 
> changes from v2:
> * don't remove the cache compelely, and still use it within mthca driver.
>   
> the mthca changes and complete removal of the cache - in the next patch
> 
> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>

You haven't addressed the rest of my comments, though.
http://article.gmane.org/gmane.linux.drivers.openib/39173
Why is that?

-- 
MST


From jackm at dev.mellanox.co.il  Thu May  3 02:59:03 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 3 May 2007 12:59:03 +0300
Subject: [ofa-general] Re: [PATCH] libmlx4: fix post inline when posting a
	list
In-Reply-To: <adavefag9lm.fsf@cisco.com>
References: <200705021712.24400.jackm@dev.mellanox.co.il>
	<adavefag9lm.fsf@cisco.com>
Message-ID: <200705031259.03428.jackm@dev.mellanox.co.il>

On Thursday 03 May 2007 01:10, Roland Dreier wrote:
> please use my git tree rather than some other setup, so that I can
> apply things with patch -p1

In our build, we have a directory of userspace fixes (patches) that get applied while
the fix approval/commit process is in progress.  Once a fix is committed, the associated
patch is removed from the build (since we fetch/rebase or pull).

I apologize that I did not pay attention to the directories -- I just copied the patch
we are maintaining in that user-space fixes directory (which is at the level indicated
by the patch).

The irony is that I started off with a patch which was generated via git-diff on my 
libmlx4 git clone, and modified the file paths to comply with our 
userspace patch fixes requirements.

Next time, I'll make sure to send the original patch which is generated by
git-diff on my libmlx4 clone.

- Jack 
> 


From mst at dev.mellanox.co.il  Thu May  3 03:48:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 13:48:06 +0300
Subject: [ofa-general] [PATCH 0 of 3] comp_vector kernel support
Message-ID: <20070503104806.GC10009@mellanox.co.il>

The following patch series adds completion vector support in kernel
1. extends ib_create_cq to pass in comp_vector parameter
2. Update all ULP/providers
3. mthca is enhanced to support multiple vectors if MSI-X is enabled on SMP
4. Other providers report support for a single completion vector
5. uverbs and IPoIB CM are enhanced to use multiple vectors if available

Please consider for 2.6.22.

-- 
MST


From mst at dev.mellanox.co.il  Thu May  3 03:48:47 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 13:48:47 +0300
Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support in
	core
Message-ID: <20070503104847.GD10009@mellanox.co.il>

Extend ib_create_cq to pass in comp_vector parameter -
this parallels our userspace API.
Update all ULPs and providers.
Make uverbs use multiple vectors if available.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Note: since num_comp_vectors = 0 is not legal, and to mimizime provider churn,
I set num_comp_vectors to a sane value in core. Providers can increase that.

Index: linux-2.6/drivers/infiniband/core/device.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/core/device.c
+++ linux-2.6/drivers/infiniband/core/device.c
@@ -161,9 +161,14 @@ static int alloc_name(char *name)
  */
 struct ib_device *ib_alloc_device(size_t size)
 {
+	struct ib_device *dev;
 	BUG_ON(size < sizeof (struct ib_device));
 
-	return kzalloc(size, GFP_KERNEL);
+	dev = kzalloc(size, GFP_KERNEL);
+	if (dev)
+		dev->num_comp_vectors = 1;
+
+	return dev;
 }
 EXPORT_SYMBOL(ib_alloc_device);
 
Index: linux-2.6/drivers/infiniband/core/mad.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/core/mad.c
+++ linux-2.6/drivers/infiniband/core/mad.c
@@ -2767,7 +2767,7 @@ static int ib_mad_port_open(struct ib_de
 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
 	port_priv->cq = ib_create_cq(port_priv->device,
 				     ib_mad_thread_completion_handler,
-				     NULL, port_priv, cq_size);
+				     NULL, port_priv, cq_size, 0);
 	if (IS_ERR(port_priv->cq)) {
 		printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n");
 		ret = PTR_ERR(port_priv->cq);
Index: linux-2.6/drivers/infiniband/core/uverbs_cmd.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/core/uverbs_cmd.c
+++ linux-2.6/drivers/infiniband/core/uverbs_cmd.c
@@ -802,6 +802,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uv
 	INIT_LIST_HEAD(&obj->async_list);
 
 	cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe,
+					     cmd.comp_vector,
 					     file->ucontext, &udata);
 	if (IS_ERR(cq)) {
 		ret = PTR_ERR(cq);
Index: linux-2.6/drivers/infiniband/core/uverbs_main.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/core/uverbs_main.c
+++ linux-2.6/drivers/infiniband/core/uverbs_main.c
@@ -752,7 +752,7 @@ static void ib_uverbs_add_one(struct ib_
 	spin_unlock(&map_lock);
 
 	uverbs_dev->ib_dev           = device;
-	uverbs_dev->num_comp_vectors = 1;
+	uverbs_dev->num_comp_vectors = device->num_comp_vectors;
 
 	uverbs_dev->dev = cdev_alloc();
 	if (!uverbs_dev->dev)
Index: linux-2.6/drivers/infiniband/core/verbs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/core/verbs.c
+++ linux-2.6/drivers/infiniband/core/verbs.c
@@ -609,11 +609,11 @@ EXPORT_SYMBOL(ib_destroy_qp);
 struct ib_cq *ib_create_cq(struct ib_device *device,
 			   ib_comp_handler comp_handler,
 			   void (*event_handler)(struct ib_event *, void *),
-			   void *cq_context, int cqe)
+			   void *cq_context, int cqe, int comp_vector)
 {
 	struct ib_cq *cq;
 
-	cq = device->create_cq(device, cqe, NULL, NULL);
+	cq = device->create_cq(device, cqe, comp_vector, NULL, NULL);
 
 	if (!IS_ERR(cq)) {
 		cq->device        = device;
Index: linux-2.6/drivers/infiniband/hw/amso1100/c2_provider.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/amso1100/c2_provider.c
+++ linux-2.6/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -290,7 +290,7 @@ static int c2_destroy_qp(struct ib_qp *i
 	return 0;
 }
 
-static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries,
+static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, int vector,
 				  struct ib_ucontext *context,
 				  struct ib_udata *udata)
 {
Index: linux-2.6/drivers/infiniband/hw/cxgb3/iwch_provider.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ linux-2.6/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -139,7 +139,7 @@ static int iwch_destroy_cq(struct ib_cq 
 	return 0;
 }
 
-static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries,
+static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, int vector,
 			     struct ib_ucontext *ib_context,
 			     struct ib_udata *udata)
 {
Index: linux-2.6/drivers/infiniband/hw/ehca/ehca_cq.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_cq.c
+++ linux-2.6/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -113,7 +113,7 @@ struct ehca_qp* ehca_cq_get_qp(struct eh
 	return ret;
 }
 
-struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe,
+struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 			     struct ib_ucontext *context,
 			     struct ib_udata *udata)
 {
Index: linux-2.6/drivers/infiniband/hw/ehca/ehca_iverbs.h
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ linux-2.6/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -123,7 +123,7 @@ int ehca_destroy_eq(struct ehca_shca *sh
 void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq);
 
 
-struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe,
+struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 			     struct ib_ucontext *context,
 			     struct ib_udata *udata);
 
Index: linux-2.6/drivers/infiniband/hw/ehca/ehca_main.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_main.c
+++ linux-2.6/drivers/infiniband/hw/ehca/ehca_main.c
@@ -375,7 +375,7 @@ static int ehca_create_aqp1(struct ehca_
 		return -EPERM;
 	}
 
-	ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10);
+	ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10, 0);
 	if (IS_ERR(ibcq)) {
 		ehca_err(&shca->ib_device, "Cannot create AQP1 CQ.");
 		return PTR_ERR(ibcq);
Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_cq.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_cq.c
+++ linux-2.6/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -170,7 +170,7 @@ static void send_complete(unsigned long 
  *
  * Called by ib_create_cq() in the generic verbs code.
  */
-struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries,
+struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector,
 			      struct ib_ucontext *context,
 			      struct ib_udata *udata)
 {
Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_verbs.h
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ linux-2.6/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -710,7 +710,7 @@ void ipath_cq_enter(struct ipath_cq *cq,
 
 int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
 
-struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries,
+struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector,
 			      struct ib_ucontext *context,
 			      struct ib_udata *udata);
 
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -663,6 +663,7 @@ static int mthca_destroy_qp(struct ib_qp
 }
 
 static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries,
+				     int comp_vector,
 				     struct ib_ucontext *context,
 				     struct ib_udata *udata)
 {
Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -793,7 +793,7 @@ static int ipoib_cm_tx_init(struct ipoib
 	}
 
 	p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p,
-			     ipoib_sendq_size + 1);
+			     ipoib_sendq_size + 1, 0);
 	if (IS_ERR(p->cq)) {
 		ret = PTR_ERR(p->cq);
 		ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret);
Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -187,7 +187,7 @@ int ipoib_transport_dev_init(struct net_
 	if (!ret)
 		size += ipoib_recvq_size;
 
-	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size);
+	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
 	if (IS_ERR(priv->cq)) {
 		printk(KERN_WARNING "%s: failed to create CQ\n", ca->name);
 		goto out_free_mr;
Index: linux-2.6/drivers/infiniband/ulp/iser/iser_verbs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/iser/iser_verbs.c
+++ linux-2.6/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -76,7 +76,7 @@ static int iser_create_device_ib_res(str
 				  iser_cq_callback,
 				  iser_cq_event_callback,
 				  (void *)device,
-				  ISER_MAX_CQ_LEN);
+				  ISER_MAX_CQ_LEN, 0);
 	if (IS_ERR(device->cq))
 		goto cq_err;
 
Index: linux-2.6/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/srp/ib_srp.c
+++ linux-2.6/drivers/infiniband/ulp/srp/ib_srp.c
@@ -197,7 +197,7 @@ static int srp_create_target_ib(struct s
 		return -ENOMEM;
 
 	target->cq = ib_create_cq(target->srp_host->dev->dev, srp_completion,
-				  NULL, target, SRP_CQ_SIZE);
+				  NULL, target, SRP_CQ_SIZE, 0);
 	if (IS_ERR(target->cq)) {
 		ret = PTR_ERR(target->cq);
 		goto out;
Index: linux-2.6/include/rdma/ib_verbs.h
===================================================================
--- linux-2.6.orig/include/rdma/ib_verbs.h
+++ linux-2.6/include/rdma/ib_verbs.h
@@ -912,6 +912,8 @@ struct ib_device {
 
 	u32                           flags;
 
+	int                           num_comp_vectors;
+
 	struct iw_cm_verbs	     *iwcm;
 
 	int		           (*query_device)(struct ib_device *device,
@@ -978,6 +980,7 @@ struct ib_device {
 						struct ib_recv_wr *recv_wr,
 						struct ib_recv_wr **bad_recv_wr);
 	struct ib_cq *             (*create_cq)(struct ib_device *device, int cqe,
+						int comp_vector,
 						struct ib_ucontext *context,
 						struct ib_udata *udata);
 	int                        (*destroy_cq)(struct ib_cq *cq);
@@ -1358,13 +1361,15 @@ static inline int ib_post_recv(struct ib
  * @cq_context: Context associated with the CQ returned to the user via
  *   the associated completion and event handlers.
  * @cqe: The minimum size of the CQ.
+ * @comp_vector - Completion vector used to signal completion events.
+ *     Must be >= 0 and < context->num_comp_vectors.
  *
  * Users can examine the cq structure to determine the actual CQ size.
  */
 struct ib_cq *ib_create_cq(struct ib_device *device,
 			   ib_comp_handler comp_handler,
 			   void (*event_handler)(struct ib_event *, void *),
-			   void *cq_context, int cqe);
+			   void *cq_context, int cqe, int comp_vector);
 
 /**
  * ib_resize_cq - Modifies the capacity of the CQ.

-- 
MST


From mst at dev.mellanox.co.il  Thu May  3 03:49:24 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 13:49:24 +0300
Subject: [ofa-general] [PATCH 2 of 3] IB/mthca: support multiple completion
	vectors
Message-ID: <20070503104924.GE10009@mellanox.co.il>

Support 2 completion vectors in mthca on SMP if MSI-X is enabled

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

I don't know how many vectors make sense, so I decided to be
conservative here, since each EQ consumes a lot of memory by default.

Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_cq.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_cq.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -779,7 +779,7 @@ int mthca_arbel_arm_cq(struct ib_cq *ibc
 	return 0;
 }
 
-int mthca_init_cq(struct mthca_dev *dev, int nent,
+int mthca_init_cq(struct mthca_dev *dev, int nent, int comp_vector,
 		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq)
 {
@@ -790,6 +790,7 @@ int mthca_init_cq(struct mthca_dev *dev,
 
 	cq->ibcq.cqe  = nent - 1;
 	cq->is_kernel = !ctx;
+	cq->eq = MTHCA_EQ_COMP + comp_vector;
 
 	cq->cqn = mthca_alloc(&dev->cq_table.alloc);
 	if (cq->cqn == -1)
@@ -844,7 +845,7 @@ int mthca_init_cq(struct mthca_dev *dev,
 	else
 		cq_context->logsize_usrpage |= cpu_to_be32(dev->driver_uar.index);
 	cq_context->error_eqn       = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn);
-	cq_context->comp_eqn        = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn);
+	cq_context->comp_eqn        = cpu_to_be32(dev->eq_table.eq[cq->eq].eqn);
 	cq_context->pd              = cpu_to_be32(pdn);
 	cq_context->lkey            = cpu_to_be32(cq->buf.mr.ibmr.lkey);
 	cq_context->cqn             = cpu_to_be32(cq->cqn);
@@ -954,7 +955,7 @@ void mthca_free_cq(struct mthca_dev *dev
 	spin_unlock_irq(&dev->cq_table.lock);
 
 	if (dev->mthca_flags & MTHCA_FLAG_MSI_X)
-		synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector);
+		synchronize_irq(dev->eq_table.eq[cq->eq].msi_x_vector);
 	else
 		synchronize_irq(dev->pdev->irq);
 
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_dev.h
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -96,7 +96,9 @@ enum {
 	MTHCA_EQ_CMD,
 	MTHCA_EQ_ASYNC,
 	MTHCA_EQ_COMP,
-	MTHCA_NUM_EQ
+	MTHCA_NUM_EQ,
+	MTHCA_COMP_VECTORS = 2,
+	MTHCA_MAX_EQS = MTHCA_NUM_EQ + MTHCA_COMP_VECTORS - 1
 };
 
 enum {
@@ -230,7 +232,7 @@ struct mthca_eq_table {
 	void __iomem      *clr_int;
 	u32                clr_mask;
 	u32                arm_mask;
-	struct mthca_eq    eq[MTHCA_NUM_EQ];
+	struct mthca_eq    eq[MTHCA_MAX_EQS];
 	u64                icm_virt;
 	struct page       *icm_page;
 	dma_addr_t         icm_dma;
@@ -497,7 +499,7 @@ int mthca_poll_cq(struct ib_cq *ibcq, in
 		  struct ib_wc *entry);
 int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
 int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
-int mthca_init_cq(struct mthca_dev *dev, int nent,
+int mthca_init_cq(struct mthca_dev *dev, int nent, int comp_vector,
 		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq);
 void mthca_free_cq(struct mthca_dev *dev,
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_eq.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_eq.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_eq.c
@@ -161,6 +161,11 @@ struct mthca_eqe {
 	u8 owner;
 } __attribute__((packed));
 
+static inline int mthca_num_eq(struct mthca_dev *dev)
+{
+	return dev->ib_dev.num_comp_vectors + MTHCA_NUM_EQ - 1;
+}
+
 #define  MTHCA_EQ_ENTRY_OWNER_SW      (0 << 7)
 #define  MTHCA_EQ_ENTRY_OWNER_HW      (1 << 7)
 
@@ -579,8 +584,7 @@ static int mthca_create_eq(struct mthca_
 
 	dev->eq_table.arm_mask |= eq->eqn_mask;
 
-	mthca_dbg(dev, "Allocated EQ %d with %d entries\n",
-		  eq->eqn, eq->nent);
+	mthca_dbg(dev, "Allocated EQ %d with %d entries\n", eq->eqn, eq->nent);
 
 	return err;
 
@@ -657,7 +661,7 @@ static void mthca_free_irqs(struct mthca
 
 	if (dev->eq_table.have_irq)
 		free_irq(dev->pdev->irq, dev);
-	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+	for (i = 0; i < mthca_num_eq(dev); ++i)
 		if (dev->eq_table.eq[i].have_irq)
 			free_irq(dev->eq_table.eq[i].msi_x_vector,
 				 dev->eq_table.eq + i);
@@ -824,12 +828,37 @@ void mthca_unmap_eq_icm(struct mthca_dev
 	__free_page(dev->eq_table.icm_page);
 }
 
+static inline const char *eq_name(int i)
+{
+	switch (i) {
+	case MTHCA_EQ_ASYNC:
+		return DRV_NAME " (async)";
+	case MTHCA_EQ_CMD:
+		return DRV_NAME " (cmd)";
+	default:
+		return DRV_NAME " (comp)";
+	}
+}
+
+static inline int eq_size(struct mthca_dev *dev, int i)
+{
+	switch (i) {
+	case MTHCA_EQ_ASYNC:
+		return MTHCA_NUM_ASYNC_EQE;
+	case MTHCA_EQ_CMD:
+		return MTHCA_NUM_CMD_EQE;
+	default:
+		return dev->limits.num_cqs;
+	}
+}
+
+
 int mthca_init_eq_table(struct mthca_dev *dev)
 {
 	int err;
 	u8 status;
 	u8 intr;
-	int i;
+	int i, eqn;
 
 	err = mthca_alloc_init(&dev->eq_table.alloc,
 			       dev->limits.num_eqs,
@@ -857,39 +886,29 @@ int mthca_init_eq_table(struct mthca_dev
 	intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ?
 		128 : dev->eq_table.inta_pin;
 
-	err = mthca_create_eq(dev, dev->limits.num_cqs + MTHCA_NUM_SPARE_EQE,
-			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr,
-			      &dev->eq_table.eq[MTHCA_EQ_COMP]);
-	if (err)
-		goto err_out_unmap;
-
-	err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE + MTHCA_NUM_SPARE_EQE,
-			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr,
-			      &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
-	if (err)
-		goto err_out_comp;
-
-	err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE + MTHCA_NUM_SPARE_EQE,
-			      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr,
-			      &dev->eq_table.eq[MTHCA_EQ_CMD]);
-	if (err)
-		goto err_out_async;
+	for (eqn = 0; eqn < mthca_num_eq(dev); ++eqn) {
+		err = mthca_create_eq(dev, eq_size(dev, eqn) + MTHCA_NUM_SPARE_EQE,
+				      (dev->mthca_flags & MTHCA_FLAG_MSI_X) ?
+				      128 + eqn : intr,
+				      &dev->eq_table.eq[eqn]);
+		if (err)
+			goto err_out_eq;
+	}
 
 	if (dev->mthca_flags & MTHCA_FLAG_MSI_X) {
-		static const char *eq_name[] = {
-			[MTHCA_EQ_COMP]  = DRV_NAME " (comp)",
-			[MTHCA_EQ_ASYNC] = DRV_NAME " (async)",
-			[MTHCA_EQ_CMD]   = DRV_NAME " (cmd)"
-		};
-
-		for (i = 0; i < MTHCA_NUM_EQ; ++i) {
+		for (i = 0; i < mthca_num_eq(dev); ++i) {
 			err = request_irq(dev->eq_table.eq[i].msi_x_vector,
 					  mthca_is_memfree(dev) ?
 					  mthca_arbel_msi_x_interrupt :
 					  mthca_tavor_msi_x_interrupt,
-					  0, eq_name[i], dev->eq_table.eq + i);
-			if (err)
-				goto err_out_cmd;
+					  0, eq_name(i), dev->eq_table.eq + i);
+			if (err) {
+				mthca_err(dev, "Failed to request IRQ %d for EQ %d (%d),"
+					 " aborting.\n",
+					 dev->eq_table.eq[i].msi_x_vector,
+					 dev->eq_table.eq[i].eqn, i);
+				goto err_out_irq;
+			}
 			dev->eq_table.eq[i].have_irq = 1;
 		}
 	} else {
@@ -898,8 +917,11 @@ int mthca_init_eq_table(struct mthca_dev
 				  mthca_arbel_interrupt :
 				  mthca_tavor_interrupt,
 				  IRQF_SHARED, DRV_NAME, dev);
-		if (err)
-			goto err_out_cmd;
+		if (err) {
+			mthca_err(dev, "Failed to request IRQ %d, aborting.\n",
+				  dev->pdev->irq);
+			goto err_out_eq;
+		}
 		dev->eq_table.have_irq = 1;
 	}
 
@@ -921,7 +943,7 @@ int mthca_init_eq_table(struct mthca_dev
 		mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n",
 			   dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status);
 
-	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+	for (i = 0; i < mthca_num_eq(dev); ++i)
 		if (mthca_is_memfree(dev))
 			arbel_eq_req_not(dev, dev->eq_table.eq[i].eqn_mask);
 		else
@@ -929,17 +951,13 @@ int mthca_init_eq_table(struct mthca_dev
 
 	return 0;
 
-err_out_cmd:
+err_out_irq:
 	mthca_free_irqs(dev);
-	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]);
 
-err_out_async:
-	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]);
-
-err_out_comp:
-	mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]);
+err_out_eq:
+	for (i = 0; i < eqn; ++i)
+		mthca_free_eq(dev, &dev->eq_table.eq[i]);
 
-err_out_unmap:
 	mthca_unmap_eq_regs(dev);
 
 err_out_free:
@@ -959,7 +977,7 @@ void mthca_cleanup_eq_table(struct mthca
 	mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK,
 		     1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status);
 
-	for (i = 0; i < MTHCA_NUM_EQ; ++i)
+	for (i = 0; i < mthca_num_eq(dev); ++i)
 		mthca_free_eq(dev, &dev->eq_table.eq[i]);
 
 	mthca_unmap_eq_regs(dev);
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_main.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_main.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_main.c
@@ -39,6 +39,7 @@
 #include <linux/errno.h>
 #include <linux/pci.h>
 #include <linux/interrupt.h>
+#include <linux/cpumask.h>
 
 #include "mthca_dev.h"
 #include "mthca_config_reg.h"
@@ -976,24 +977,35 @@ static void mthca_release_regions(struct
 
 static int mthca_enable_msi_x(struct mthca_dev *mdev)
 {
-	struct msix_entry entries[3];
-	int err;
+	struct msix_entry entries[MTHCA_MAX_EQS];
+	int i, err, num;
 
-	entries[0].entry = 0;
-	entries[1].entry = 1;
-	entries[2].entry = 2;
-
-	err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries));
-	if (err) {
-		if (err > 0)
-			mthca_info(mdev, "Only %d MSI-X vectors available, "
-				   "not using MSI-X\n", err);
-		return err;
+	num = min(mdev->limits.num_eqs - mdev->limits.reserved_eqs, MTHCA_MAX_EQS);
+	num = min(num_possible_cpus(), num);
+
+	for (i = 0; i < num; ++i)
+		entries[i].entry = i;
+
+	for (;;) {
+		err = pci_enable_msix(mdev->pdev, entries, num);
+		if (!err)
+			break;
+		else if (err < 0)
+			return err;
+
+		if (err < MTHCA_NUM_EQ) {
+			mthca_info(mdev, "Only %d MSI-X vectors available. "
+				   "Not using MSI-X\n", err);
+			pci_disable_msix(mdev->pdev);
+			return err;
+		}
+
+		num = err;
 	}
 
-	mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector;
-	mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector;
-	mdev->eq_table.eq[MTHCA_EQ_CMD  ].msi_x_vector = entries[2].vector;
+	mdev->ib_dev.num_comp_vectors = num - MTHCA_NUM_EQ + 1;
+	for (i = 0; i < num; ++i)
+		mdev->eq_table.eq[i].msi_x_vector = entries[i].vector;
 
 	return 0;
 }
@@ -1115,12 +1127,6 @@ static int __mthca_init_one(struct pci_d
 		goto err_free_dev;
 	}
 
-	if (msi_x && !mthca_enable_msi_x(mdev))
-		mdev->mthca_flags |= MTHCA_FLAG_MSI_X;
-	if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) &&
-	    !pci_enable_msi(pdev))
-		mdev->mthca_flags |= MTHCA_FLAG_MSI;
-
 	if (mthca_cmd_init(mdev)) {
 		mthca_err(mdev, "Failed to init command interface, aborting.\n");
 		goto err_free_dev;
@@ -1144,6 +1150,12 @@ static int __mthca_init_one(struct pci_d
 		mthca_warn(mdev, "If you have problems, try updating your HCA FW.\n");
 	}
 
+	if (msi_x && !mthca_enable_msi_x(mdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI_X;
+	if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) &&
+	    !pci_enable_msi(pdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI;
+
 	err = mthca_setup_hca(mdev);
 	if (err)
 		goto err_close;
@@ -1180,17 +1192,17 @@ err_cleanup:
 	mthca_cleanup_uar_table(mdev);
 
 err_close:
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+		pci_disable_msix(pdev);
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+		pci_disable_msi(pdev);
+
 	mthca_close_hca(mdev);
 
 err_cmd:
 	mthca_cmd_cleanup(mdev);
 
 err_free_dev:
-	if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
-		pci_disable_msix(pdev);
-	if (mdev->mthca_flags & MTHCA_FLAG_MSI)
-		pci_disable_msi(pdev);
-
 	ib_dealloc_device(&mdev->ib_dev);
 
 err_free_res:
@@ -1231,14 +1243,15 @@ static void __mthca_remove_one(struct pc
 		iounmap(mdev->kar);
 		mthca_uar_free(mdev, &mdev->driver_uar);
 		mthca_cleanup_uar_table(mdev);
-		mthca_close_hca(mdev);
-		mthca_cmd_cleanup(mdev);
 
 		if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
 			pci_disable_msix(pdev);
 		if (mdev->mthca_flags & MTHCA_FLAG_MSI)
 			pci_disable_msi(pdev);
 
+		mthca_close_hca(mdev);
+		mthca_cmd_cleanup(mdev);
+
 		ib_dealloc_device(&mdev->ib_dev);
 		mthca_release_regions(pdev, mdev->mthca_flags &
 				      MTHCA_FLAG_DDR_HIDDEN);
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -707,7 +707,7 @@ static struct ib_cq *mthca_create_cq(str
 	for (nent = 1; nent <= entries; nent <<= 1)
 		; /* nothing */
 
-	err = mthca_init_cq(to_mdev(ibdev), nent,
+	err = mthca_init_cq(to_mdev(ibdev), nent, comp_vector,
 			    context ? to_mucontext(context) : NULL,
 			    context ? ucmd.pdn : to_mdev(ibdev)->driver_pd.pd_num,
 			    cq);
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.h
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.h
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.h
@@ -202,6 +202,7 @@ struct mthca_cq {
 	spinlock_t		lock;
 	int			refcount;
 	int			cqn;
+	int			eq;
 	u32			cons_index;
 	struct mthca_cq_buf	buf;
 	struct mthca_cq_resize *resize_buf;

-- 
MST


From mst at dev.mellanox.co.il  Thu May  3 03:49:55 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 13:49:55 +0300
Subject: [ofa-general] [PATCH 3 of 3] ipoib/cm: separate comp vectors to
	RX/TX
Message-ID: <20070503104955.GF10009@mellanox.co.il>

Enhance ipoib to use multiple completion vectors if available.
On mthca, this increases netperf BW by some 5% with
same or lower service demand.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -793,7 +793,7 @@ static int ipoib_cm_tx_init(struct ipoib
 	}
 
 	p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p,
-			     ipoib_sendq_size + 1, 0);
+			     ipoib_sendq_size + 1, priv->ca->num_comp_vectors > 1);
 	if (IS_ERR(p->cq)) {
 		ret = PTR_ERR(p->cq);
 		ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret);

-- 
MST


From mst at dev.mellanox.co.il  Thu May  3 05:22:15 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 15:22:15 +0300
Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache
In-Reply-To: <4639D16F.4060807@voltaire.com>
References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com>
	<20070502171829.GO22292@mellanox.co.il>
	<4639D16F.4060807@voltaire.com>
Message-ID: <20070503122215.GA9719@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCH 2/3] remove ib pkey gid and lmc cache
> 
> Michael S. Tsirkin wrote:
> >>+ * ib_query_lmc() returns the LID mask control associated
> >>+ * with port @port_num
> >>+ */
> >>+int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc);
> >>+
> > 
> > 
> > I don't think we need this one in ib_verbs.h - it just does query_port once.
> > Let's keep the API simple. The only user is in mad.c - move
> > it there and make it static.
> > 
> > 
> 
> why keep ib_query_lmc anyway if we won't use it?

Actually, I think I see a problem with changing
ib_get_cached_lmc -> ib_query_lmc: it is called on data path in mad.c.
Calling ib_query_port there will slow down MAD processing significantly,
because it's hard to driver to cache all of portinfo state
(e.g. how do you cache phys_state?).

But mad.c is actually seeing all MADs, too, so maybe the right thing
is to cache lmc directly there.

-- 
MST


From yosefe at voltaire.com  Thu May  3 05:28:52 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 03 May 2007 15:28:52 +0300
Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache
In-Reply-To: <20070503122215.GA9719@mellanox.co.il>
References: <4638B432.3060801@voltaire.com>
	<4638B4D5.7050709@voltaire.com>	<20070502171829.GO22292@mellanox.co.il>	<4639D16F.4060807@voltaire.com>
	<20070503122215.GA9719@mellanox.co.il>
Message-ID: <4639D584.3010706@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: Re: [PATCH 2/3] remove ib pkey gid and lmc cache
>>
>>Michael S. Tsirkin wrote:
>>
>>>>+ * ib_query_lmc() returns the LID mask control associated
>>>>+ * with port @port_num
>>>>+ */
>>>>+int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc);
>>>>+
>>>
>>>
>>>I don't think we need this one in ib_verbs.h - it just does query_port once.
>>>Let's keep the API simple. The only user is in mad.c - move
>>>it there and make it static.
>>>
>>>
>>
>>why keep ib_query_lmc anyway if we won't use it?
> 
> 
> Actually, I think I see a problem with changing
> ib_get_cached_lmc -> ib_query_lmc: it is called on data path in mad.c.
> Calling ib_query_port there will slow down MAD processing significantly,
> because it's hard to driver to cache all of portinfo state
> (e.g. how do you cache phys_state?).
> 
> But mad.c is actually seeing all MADs, too, so maybe the right thing
> is to cache lmc directly there.
> 
So how about mad.c will use the cache for now (like mthca), and will
have a lmc cache of its own after the cache is really removed?


From mst at dev.mellanox.co.il  Thu May  3 05:49:56 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 15:49:56 +0300
Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache
In-Reply-To: <4639D584.3010706@voltaire.com>
References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com>
	<20070502171829.GO22292@mellanox.co.il>
	<4639D16F.4060807@voltaire.com>
	<20070503122215.GA9719@mellanox.co.il>
	<4639D584.3010706@voltaire.com>
Message-ID: <20070503124956.GB9719@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCH 2/3] remove ib pkey gid and lmc cache
> 
> Michael S. Tsirkin wrote:
> >>Quoting Yosef Etigin <yosefe at voltaire.com>:
> >>Subject: Re: [PATCH 2/3] remove ib pkey gid and lmc cache
> >>
> >>Michael S. Tsirkin wrote:
> >>
> >>>>+ * ib_query_lmc() returns the LID mask control associated
> >>>>+ * with port @port_num
> >>>>+ */
> >>>>+int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc);
> >>>>+
> >>>
> >>>
> >>>I don't think we need this one in ib_verbs.h - it just does query_port once.
> >>>Let's keep the API simple. The only user is in mad.c - move
> >>>it there and make it static.
> >>>
> >>>
> >>
> >>why keep ib_query_lmc anyway if we won't use it?
> > 
> > 
> > Actually, I think I see a problem with changing
> > ib_get_cached_lmc -> ib_query_lmc: it is called on data path in mad.c.
> > Calling ib_query_port there will slow down MAD processing significantly,
> > because it's hard to driver to cache all of portinfo state
> > (e.g. how do you cache phys_state?).
> > 
> > But mad.c is actually seeing all MADs, too, so maybe the right thing
> > is to cache lmc directly there.
> > 
> So how about mad.c will use the cache for now (like mthca),
> and will have a lmc cache of its own after the cache is really removed?

OK I guess. But I wonder whether other changes might affect e.g.  connection
setup for other pieces, too.

We started by discussing a race condition with IPoIB.
So again, my suggestion for now (from archives) would be:

>  > - a patch to add ib_find_pkey() and ib_find_gid() to core
>  > - a patch to replace cache usage in IPoIB with uncached
>  >   hardware accesses on top of this
>  > - ipoib pkey change handling patch on top of these

Here's a link to discussion we had in April:
http://www.mail-archive.com/general at lists.openfabrics.org/msg01613.html

-- 
MST


From dotanb at dev.mellanox.co.il  Thu May  3 05:57:03 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Thu, 03 May 2007 15:57:03 +0300
Subject: [ofa-general] man pages for the rdma-cm
In-Reply-To: <1178127046.18609.107.camel@stevo-desktop>
References: <1178127046.18609.107.camel@stevo-desktop>
Message-ID: <4639DC1F.2030905@dev.mellanox.co.il>

Steve Wise wrote:
> Sean, 
>
> Are there man pages for the rdma-cm in the pipeline?  I think it would
> be great (requirement?) to have these for ofed-1.2 since we do have the
> other verbs man pages.  
>   
man-pages are very important to new users (this is why i wrote the man 
pages to the libibverbs).

I believe that all of the libraries (which has an API) that are being 
installed in the OFED package should have man pages.
I think that this should be one of the goals in OFED 1.3.


thanks
Dotan


From info at freeaward.co.uk  Thu May  3 06:06:33 2007
From: info at freeaward.co.uk (info at freeaward.co.uk)
Date: Thu,  3 May 2007 06:06:33 -0700
Subject: [ofa-general] FREE AWARD DEPT( YOU ARE A WINNER)...............
Message-ID: <1178197592.4639de5901a22@webmail.telus.net>


FREE AWARD DEPARTMENT
L70 1NL.LONDON,UNITED KINGDOM 
Ref: UK/9420X2/68
Batch: 074/05/ZY369 
  We are pleased to announce FREE LOTTO AWARDS draw held on 3rd of  May . 
All 3 winning addresses were randomlyselected from a batch of 5,000,000 
international emails. Your email addressemerged alongside 2 others as a 
3rd category winner in this month's draw.Consequently,you have therefore 
been approved for a total pay out of (Five Hundred And Fifty Thousand 
Pounds Sterlings) only. The following particulars are 
attached to your lotto payment order:
(i) Ticket:56475600545005
(ii) Serial Number:5368
(iii)File Number:KTU/9023118308/03 
(iv)LuckyNumber:09112239404212
Please contact the underlisted claims officer as soon as possible for 
the immediate release of your winnings:

Mr.Danny Boer
Email: mrdannyboer_freelottoaward at yahoo.com.hk
Tel : +44 7024035988
Yours faithfully, 
FREE AWARD DEPARTMENT


From vlad at dev.mellanox.co.il  Thu May  3 06:07:06 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 03 May 2007 16:07:06 +0300
Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
In-Reply-To: <D6A583C768392A4D8B297C500CDD54B5015736A7@mse11be1.mse11.exchange.ms>
References: <D6A583C768392A4D8B297C500CDD54B50157369E@mse11be1.mse11.exchange.ms>
	<1178114728.14131.30.camel@vladsk-laptop>
	<D6A583C768392A4D8B297C500CDD54B5015736A3@mse11be1.mse11.exchange.ms>
	<D6A583C768392A4D8B297C500CDD54B5015736A5@mse11be1.mse11.exchange.ms>
	<D6A583C768392A4D8B297C500CDD54B5015736A7@mse11be1.mse11.exchange.ms>
Message-ID: <1178197626.6580.38.camel@vladsk-laptop>

Please see if this happens in OFED-1.2-20070503-0600.
But first uninstall the previous OFED version with ofed_uninstall.sh
command.

Thanks,

Regards,
Vladimir

On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote:
> Hmm,
>  
> so I tried something. I put :
>  
> build_32bit=0
> 
> into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This time
> it built 64bit libraries, but it puts them in the wrong directory :
>  
> # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm
> /etc/ld.so.conf.d/ofed.conf
> /usr/lib/libibverbs.so.1
> /usr/lib/libibverbs.so.1.0.0
> 
> # file /usr/lib/libibverbs.so.1.0.0
> /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD
> x86-64, version 1 (SYSV), not stripped
> 
> So what's up ??
>  
> Cheers,
> Steffen Persvold
> Technical Director Americas
> tel. 508-281-7100 x401
> fax. 508-281-7171
> 
> http://www.scali.com/
> Scaling the Linux datacenter
> 
> 
> ______________________________________________________________________
> From: Steffen Persvold
> Sent: Wed 5/2/2007 10:30 AM
> To: Steffen Persvold; Vladimir Sokolovsky
> Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> 
> 
> Also,
>  
> If I look at the /etc/ld.so.conf/ofed.conf file I have :
>  
> # cat ofed.conf
> /usr/lib
> /usr/lib
> 
>  
> which seems kinda weird ? :)
>  
> Cheers,
>  
> Steffen Persvold
> Technical Director Americas
> tel. 508-281-7100 x401
> fax. 508-281-7171
> 
> http://www.scali.com/
> Scaling the Linux datacenter
> 
> 
> ______________________________________________________________________
> From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold
> Sent: Wed 5/2/2007 10:20 AM
> To: Vladimir Sokolovsky
> Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> 
> 
> Nope :
>  
>  
> [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
> /etc/ld.so.conf.d/ofed.conf
> /usr/lib/libibverbs.so.1
> /usr/lib/libibverbs.so.1.0.0
> [redhat-release-4ES-5.5]#
> 
> So the RPM got built, but without 64bit libraries. Now if it was the
> other way around (i.e no 32bit libraries) I could have understood it
> (as 32bit is an option on x86_64), but not having the native 64bit
> libraries is not so easy to understand :)
>  
> cheers,
> Steffen Persvold
> Technical Director Americas
> tel. 508-281-7100 x401
> fax. 508-281-7171
> 
> http://www.scali.com/
> Scaling the Linux datacenter
> 
> 
> ______________________________________________________________________
> From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir
> Sokolovsky
> Sent: Wed 5/2/2007 10:05 AM
> To: Steffen Persvold
> Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> 
> 
> Don't you have /usr/lib64/libibverbs.so.1.0.0?
> 
> Regards,
> Vladimir
> 
> On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote:
> > Folks,
> > 
> > I used the build.sh script to build the above mentioned packages on
> > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries
> > (even if the packages are named x86_64) :
> > 
> > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
> > x86_64
> > 
> > (after installing it) :
> > 
> > # file /usr/lib/libibverbs.so.1.0.0
> > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel
> > 80386, version 1 (SYSV), not stripped
> >
> > What did I do wrong ??
> > 
> > Cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> > _______________________________________________
> > ewg mailing list
> > ewg at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 
> 
> 


From vlad at dev.mellanox.co.il  Thu May  3 07:17:11 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 03 May 2007 17:17:11 +0300
Subject: [ofa-general] Re: Help building ib-bonding
In-Reply-To: <A382D4292574EB47A85B8159A6AED1A18305A9@FPNYEXCBE02.opus-i.corp>
References: <A382D4292574EB47A85B8159A6AED1A101131B18@FPNYEXCBE02.opus-i.corp>
	<1178087589.14131.3.camel@vladsk-laptop>
	<A382D4292574EB47A85B8159A6AED1A18305A9@FPNYEXCBE02.opus-i.corp>
Message-ID: <1178201831.6580.49.camel@vladsk-laptop>

Hi Moni,
Please check ib-bonding on updated RH5.0 and FedoraC6
See also https://bugs.openfabrics.org/show_bug.cgi?id=595

Thanks,

Regards,
Vladimir

On Wed, 2007-05-02 at 17:28 -0400, Jeffrey Wong wrote:
> Hello,
> 
> I am using kernel 2.6-18.8.1.1.el5 x86_64
> 
> I have changed the build_env.sh to have the build_32bit=-1
> 
>  
> 
> Thanks in advance.
> 
>  
> 
> Jeff
> 
>  
> 
> 
> 
> When installing all modules I am getting the following errors.
> 
>  
> 
> 
> 
> + make -C /lib/modules/2.6.18-8.1.1.el5/build modules
> M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding
> 
> make: Entering directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64'
> 
>   CC [M]
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.o
> 
> In file included from
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:78:
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h: In function 'bond_set_slave_inactive_flags':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
> function)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h:262: error: (Each undeclared identifier is reported only once
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h:262: error: for each function it appears in.)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h: In function 'bond_set_slave_active_flags':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
> function)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_compute_features':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:1233: warning: comparison of distinct pointer types lacks a cast
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_enslave':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this function)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_release':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this function)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
> function)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_arp_rcv':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this function)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_netdev_event':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this function)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_init':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:4374: warning: assignment discards qualifiers from pointer target
> type
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this function)
> 
> make[1]: ***
> [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_
> main.o] Error 1
> 
> make: ***
> [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi
> ng] Error 2
> 
> make: Leaving directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64'
> 
> + echo ' Building  IB bonding driver failed'
> 
>  Building  IB bonding driver failed
> 
> + exit 1
> 
> error: Bad exit status from /var/tmp/rpm-tmp.99179 (%build)
> 


From jboo620 at charter.net  Thu May  3 07:26:45 2007
From: jboo620 at charter.net (UK NATIONAL LOTTERY)
Date: Thu, 3 May 2007 7:26:45 -0700
Subject: [ofa-general] Congratulations! Your Have Won,
	Check Your Email For Details
Message-ID: <1479575152.1178202405785.JavaMail.root@fepweb08>

Dear Winner

We are pleased to inform you of the final announcement that you are one 
of our end of year winners of the UNITED KINGDOM FREE LOTTERY ONLINE 
PROMO PROGRAMMER, Ticket Number :4156189324Agent Id Number:110 held on 
the 2nd of May, 2007. You have therefore been approved to claim a 
total sum of £1,750,000.00 POUNDS STERLING. Please contact fiduciary 
agent for your claims.

To file for your claim, Please contact our Fiduciary Agent for 
VALIDATION.

Mr Anthony Flowers
Foreign Service Manager,
Claims and Release Order Department,
Email:claimsdepartment_uk at yahoo.co.uk
Tel:+4407204096859


From jsquyres at cisco.com  Thu May  3 07:28:37 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 3 May 2007 07:28:37 -0700
Subject: [ofa-general] OFED [bi-]weekly teleconferences
Message-ID: <D18DB84B-09A5-45C7-89B8-9412BE3D9DA2@cisco.com>

Tziporet and I chatted in Sonoma this week and decided that it would  
be good to keep the weekly OFED teleconferences going throughout the  
month of May.

You should already have teleconferences on your calendar for May 7  
(next Monday) and May 21.  I will be sending around an Outlook  
invitation shortly for May 14 and May 28.

-- 
Jeff Squyres
Cisco Systems


From monis at voltaire.com  Thu May  3 07:48:01 2007
From: monis at voltaire.com (Moni Shoua)
Date: Thu, 3 May 2007 17:48:01 +0300
Subject: [ofa-general] RE: Help building ib-bonding
In-Reply-To: <1178201831.6580.49.camel@vladsk-laptop>
Message-ID: <39C75744D164D948A170E9792AF8E7CA244F55@exil.voltaire.com>

I am going to release ib-bonding with a fix for bug 595 today (release
10)
However, this bug doesn't say anything about using build_32bit=-1 so
I'll have to look into it later (assuming that this is a problem)

> -----Original Message-----
> From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] 
> Sent: Thursday, May 03, 2007 5:17 PM
> To: Moni Shoua; Moni Shoua
> Cc: Jeffrey Wong; general at lists.openfabrics.org
> Subject: Re: Help building ib-bonding
> 
> Hi Moni,
> Please check ib-bonding on updated RH5.0 and FedoraC6 See 
> also https://bugs.openfabrics.org/show_bug.cgi?id=595
> 
> Thanks,
> 
> Regards,
> Vladimir
> 
> On Wed, 2007-05-02 at 17:28 -0400, Jeffrey Wong wrote:
> > Hello,
> > 
> > I am using kernel 2.6-18.8.1.1.el5 x86_64
> > 
> > I have changed the build_env.sh to have the build_32bit=-1
> > 
> >  
> > 
> > Thanks in advance.
> > 
> >  
> > 
> > Jeff
> > 
> >  
> > 
> > 
> > 
> > When installing all modules I am getting the following errors.
> > 
> >  
> > 
> > 
> > 
> > + make -C /lib/modules/2.6.18-8.1.1.el5/build modules
> > M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding
> > 
> > make: Entering directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64'
> > 
> >   CC [M]
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.o
> > 
> > In file included from
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c:78:
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > in
> > g.h: In function 'bond_set_slave_inactive_flags':
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > in
> > g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
> > function)
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > in
> > g.h:262: error: (Each undeclared identifier is reported only once
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > in
> > g.h:262: error: for each function it appears in.)
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > in
> > g.h: In function 'bond_set_slave_active_flags':
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > in
> > g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
> > function)
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c: In function 'bond_compute_features':
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c:1233: warning: comparison of distinct pointer types 
> lacks a cast
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c: In function 'bond_enslave':
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this 
> > function)
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c: In function 'bond_release':
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this 
> > function)
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
> > function)
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c: In function 'bond_arp_rcv':
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this 
> > function)
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c: In function 'bond_netdev_event':
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this 
> > function)
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c: In function 'bond_init':
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c:4374: warning: assignment discards qualifiers from pointer 
> > target type
> > 
> > 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond
> > _m
> > ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this 
> > function)
> > 
> > make[1]: ***
> > 
> [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bon
> > d_
> > main.o] Error 1
> > 
> > make: ***
> > 
> [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bon
> > di
> > ng] Error 2
> > 
> > make: Leaving directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64'
> > 
> > + echo ' Building  IB bonding driver failed'
> > 
> >  Building  IB bonding driver failed
> > 
> > + exit 1
> > 
> > error: Bad exit status from /var/tmp/rpm-tmp.99179 (%build)
> > 
> 
> 


From vlad at dev.mellanox.co.il  Thu May  3 07:58:24 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 03 May 2007 17:58:24 +0300
Subject: [ofa-general] RE: Help building ib-bonding
In-Reply-To: <39C75744D164D948A170E9792AF8E7CA244F55@exil.voltaire.com>
References: <39C75744D164D948A170E9792AF8E7CA244F55@exil.voltaire.com>
Message-ID: <1178204304.6580.53.camel@vladsk-laptop>

On Thu, 2007-05-03 at 17:48 +0300, Moni Shoua wrote:
> I am going to release ib-bonding with a fix for bug 595 today (release
> 10)
> However, this bug doesn't say anything about using build_32bit=-1 so
> I'll have to look into it later (assuming that this is a problem)
> 
Skip this part (build_32bit=-1). 

Please check also bug: https://bugs.openfabrics.org/show_bug.cgi?id=589


-- 
Vladimir Sokolovsky <vlad at dev.mellanox.co.il>
Mellanox Technologies Ltd.


From steffen.persvold at scali.com  Thu May  3 08:25:46 2007
From: steffen.persvold at scali.com (Steffen Persvold)
Date: Thu, 3 May 2007 11:25:46 -0400
Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
In-Reply-To: <1178197626.6580.38.camel@vladsk-laptop>
References: <D6A583C768392A4D8B297C500CDD54B50157369E@mse11be1.mse11.exchange.ms>
	<1178114728.14131.30.camel@vladsk-laptop>
	<D6A583C768392A4D8B297C500CDD54B5015736A3@mse11be1.mse11.exchange.ms>
	<D6A583C768392A4D8B297C500CDD54B5015736A5@mse11be1.mse11.exchange.ms>
	<D6A583C768392A4D8B297C500CDD54B5015736A7@mse11be1.mse11.exchange.ms>
	<1178197626.6580.38.camel@vladsk-laptop>
Message-ID: <D6A583C768392A4D8B297C500CDD54B5017699F0@mse11be1.mse11.exchange.ms>

Vladimir,

Nope. Still the same issue. The RPMs will only contain one set of
libraries and it is always in /usr/lib (if I set the build_32bit=0
option I get the 64bit libraries but in the wrong directory).

Seriously, am I the only one seeing this ? I would think rhel4 u4 was a
very normal test platform ?

Cheers,

Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter


> -----Original Message-----
> From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il]
> Sent: Thursday, May 03, 2007 9:07 AM
> To: Steffen Persvold
> Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> 
> Please see if this happens in OFED-1.2-20070503-0600.
> But first uninstall the previous OFED version with ofed_uninstall.sh
> command.
> 
> Thanks,
> 
> Regards,
> Vladimir
> 
> On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote:
> > Hmm,
> >
> > so I tried something. I put :
> >
> > build_32bit=0
> >
> > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This
time
> > it built 64bit libraries, but it puts them in the wrong directory :
> >
> > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm
> > /etc/ld.so.conf.d/ofed.conf
> > /usr/lib/libibverbs.so.1
> > /usr/lib/libibverbs.so.1.0.0
> >
> > # file /usr/lib/libibverbs.so.1.0.0
> > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD
> > x86-64, version 1 (SYSV), not stripped
> >
> > So what's up ??
> >
> > Cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: Steffen Persvold
> > Sent: Wed 5/2/2007 10:30 AM
> > To: Steffen Persvold; Vladimir Sokolovsky
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Also,
> >
> > If I look at the /etc/ld.so.conf/ofed.conf file I have :
> >
> > # cat ofed.conf
> > /usr/lib
> > /usr/lib
> >
> >
> > which seems kinda weird ? :)
> >
> > Cheers,
> >
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen
Persvold
> > Sent: Wed 5/2/2007 10:20 AM
> > To: Vladimir Sokolovsky
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Nope :
> >
> >
> > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
> > /etc/ld.so.conf.d/ofed.conf
> > /usr/lib/libibverbs.so.1
> > /usr/lib/libibverbs.so.1.0.0
> > [redhat-release-4ES-5.5]#
> >
> > So the RPM got built, but without 64bit libraries. Now if it was the
> > other way around (i.e no 32bit libraries) I could have understood it
> > (as 32bit is an option on x86_64), but not having the native 64bit
> > libraries is not so easy to understand :)
> >
> > cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir
> > Sokolovsky
> > Sent: Wed 5/2/2007 10:05 AM
> > To: Steffen Persvold
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Don't you have /usr/lib64/libibverbs.so.1.0.0?
> >
> > Regards,
> > Vladimir
> >
> > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote:
> > > Folks,
> > >
> > > I used the build.sh script to build the above mentioned packages
on
> > > rhel4u4 x86_64, but for some reason it only compiles 32bit
libraries
> > > (even if the packages are named x86_64) :
> > >
> > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
> > > x86_64
> > >
> > > (after installing it) :
> > >
> > > # file /usr/lib/libibverbs.so.1.0.0
> > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel
> > > 80386, version 1 (SYSV), not stripped
> > >
> > > What did I do wrong ??
> > >
> > > Cheers,
> > > Steffen Persvold
> > > Technical Director Americas
> > > tel. 508-281-7100 x401
> > > fax. 508-281-7171
> > >
> > > http://www.scali.com/
> > > Scaling the Linux datacenter
> > > _______________________________________________
> > > ewg mailing list
> > > ewg at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> > _______________________________________________
> > ewg mailing list
> > ewg at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> >
> >


From cap at nsc.liu.se  Thu May  3 09:06:36 2007
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Thu, 3 May 2007 18:06:36 +0200
Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
In-Reply-To: <D6A583C768392A4D8B297C500CDD54B5017699F0@mse11be1.mse11.exchange.ms>
References: <D6A583C768392A4D8B297C500CDD54B50157369E@mse11be1.mse11.exchange.ms>
	<1178197626.6580.38.camel@vladsk-laptop>
	<D6A583C768392A4D8B297C500CDD54B5017699F0@mse11be1.mse11.exchange.ms>
Message-ID: <200705031806.36851.cap@nsc.liu.se>

On Thursday 03 May 2007, Steffen Persvold wrote:
> Vladimir,
>
> Nope. Still the same issue. The RPMs will only contain one set of
> libraries and it is always in /usr/lib (if I set the build_32bit=0
> option I get the 64bit libraries but in the wrong directory).
>
> Seriously, am I the only one seeing this ? I would think rhel4 u4 was a
> very normal test platform ?

Hello Steffen,

Being a curious person I tried to build 1.2-20070503-0600 on one of my 
centos-4.4 x86_64 boxes. I had only x86_64 packages so the build 
warned "glibc-devel 32bit is required for 32-bit libraries. Building 64-bit 
only". The result was fine (all packages named x86_64 and all libs 
in /usr/lib64).

/Peter

> Cheers,
>
> Steffen Persvold
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/9777e289/attachment.sig>

From mshefty at ichips.intel.com  Thu May  3 09:48:14 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 03 May 2007 09:48:14 -0700
Subject: [ofa-general] man pages for the rdma-cm
In-Reply-To: <1178127046.18609.107.camel@stevo-desktop>
References: <1178127046.18609.107.camel@stevo-desktop>
Message-ID: <463A124E.4020604@ichips.intel.com>

> Are there man pages for the rdma-cm in the pipeline?  I think it would
> be great (requirement?) to have these for ofed-1.2 since we do have the
> other verbs man pages.  
> 
> I didn't know if this was in-progress or are we looking for
> volunteers...

I don't have man pages, but I did update the comments in the rdma_cm header file 
to assist in the auto generation of man pages.  The results weren't quite what I 
was wanting (I think I was using some tool that created kernel man pages).  It 
probably wouldn't take that long to auto generate some pages, then manually 
fix-up any issues.

What's the release date for RC3?

- Sean


From swise at opengridcomputing.com  Thu May  3 09:51:24 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 03 May 2007 11:51:24 -0500
Subject: [ofa-general] man pages for the rdma-cm
In-Reply-To: <463A124E.4020604@ichips.intel.com>
References: <1178127046.18609.107.camel@stevo-desktop>
	<463A124E.4020604@ichips.intel.com>
Message-ID: <1178211084.27558.8.camel@stevo-desktop>

On Thu, 2007-05-03 at 09:48 -0700, Sean Hefty wrote:
> > Are there man pages for the rdma-cm in the pipeline?  I think it would
> > be great (requirement?) to have these for ofed-1.2 since we do have the
> > other verbs man pages.  
> > 
> > I didn't know if this was in-progress or are we looking for
> > volunteers...
> 
> I don't have man pages, but I did update the comments in the rdma_cm header file 
> to assist in the auto generation of man pages.  The results weren't quite what I 
> was wanting (I think I was using some tool that created kernel man pages).  It 
> probably wouldn't take that long to auto generate some pages, then manually 
> fix-up any issues.
> 
> What's the release date for RC3?
> 

I believe it is today.


From xma at us.ibm.com  Thu May  3 09:54:12 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 3 May 2007 09:54:12 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <20070503044711.GO10009@mellanox.co.il>
Message-ID: <OFB9815AE0.60DB073D-ON872572D0.005CB2E0-882572D0.005CC7B0@us.ibm.com>


> BTW, why do you ignore the option to use UC QP?
> MST
Unfortunately, eHCA doesn't support UC in current version. Next generation
will have RC w/i SRQ support.

Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/98740ff0/attachment.html>

From sean.hefty at intel.com  Thu May  3 09:55:02 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 3 May 2007 09:55:02 -0700
Subject: [ofa-general] man pages for the rdma-cm
In-Reply-To: <1178211084.27558.8.camel@stevo-desktop>
Message-ID: <000101c78da3$cb32fad0$3b78e984@amr.corp.intel.com>

>I believe it is today.

So, there's about a 0% chance of this making RC 3 then...  I will try to get
this done by early next week.


From swise at opengridcomputing.com  Thu May  3 09:58:12 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 03 May 2007 11:58:12 -0500
Subject: [ofa-general] [PATCH librdmacm] rping: Transfer rkey/addr/len
	information in network byte order.
In-Reply-To: <1178117806.18609.25.camel@stevo-desktop>
References: <1177515271.22094.33.camel@stevo-desktop>
	<1178060596.2309.195.camel@stevo-desktop>
	<46380D09.5070906@ichips.intel.com>
	<1178117806.18609.25.camel@stevo-desktop>
Message-ID: <1178211492.27558.12.camel@stevo-desktop>

Hey,

We need this in -rc3.


Steve.


On Wed, 2007-05-02 at 09:56 -0500, Steve Wise wrote:
> On Tue, 2007-05-01 at 21:01 -0700, Sean Hefty wrote:
> > > This patch regresses rping.  I failed to test it on AMD64<->AMD64 (ie
> > > like endian systems).  I will provide another patch shortly, or we can
> > > undo the broken rping patch for -rc3.  Whatever you think is best.
> > 
> > Let's fix it.  Please create a patch on top of this that fixes the problem.
> > 
> > Thanks
> > 
> > - Sean
> 
> Here is the fix.  Tested with:
> 
> ppc64 client, amd64 server
> ppc64 server, amd64 client
> amd64 client, amd64 server
> 
> 
> ---
> 
> Fix regression introduced by 88fc0cb21698dfb5d7660eecf7dddd0531fc8021.
> 
> From: Steve Wise <swise at opengridcomputing.com>
> 
> - swizzle memory info when sending it to peer.
> - fixed printf format
> 
> Signed-off-by: Steve Wise <swise at opengridcomputing.com>
> ---
> 
>  examples/rping.c |   10 +++++-----
>  1 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/examples/rping.c b/examples/rping.c
> index 17b0000..bccabb0 100644
> --- a/examples/rping.c
> +++ b/examples/rping.c
> @@ -243,7 +243,7 @@ static int server_recv(struct rping_cb *
>  	cb->remote_rkey = ntohl(cb->recv_buf.rkey);
>  	cb->remote_addr = ntohll(cb->recv_buf.buf);
>  	cb->remote_len  = ntohl(cb->recv_buf.size);
> -	DEBUG_LOG("Received rkey %x addr %" PRIx64 "len %d from peer\n",
> +	DEBUG_LOG("Received rkey %x addr %" PRIx64 " len %d from peer\n",
>  		  cb->remote_rkey, cb->remote_addr, cb->remote_len);
>  
>  	if (cb->state <= CONNECTED || cb->state == RDMA_WRITE_COMPLETE)
> @@ -614,12 +614,12 @@ static void rping_format_send(struct rpi
>  {
>  	struct rping_rdma_info *info = &cb->send_buf;
>  
> -	info->buf = (uint64_t) (unsigned long) buf;
> -	info->rkey = mr->rkey;
> -	info->size = cb->size;
> +	info->buf = htonll((uint64_t) (unsigned long) buf);
> +	info->rkey = htonl(mr->rkey);
> +	info->size = htonl(cb->size);
>  
>  	DEBUG_LOG("RDMA addr %" PRIx64" rkey %x len %d\n",
> -		  info->buf, info->rkey, info->size);
> +		  ntohll(info->buf), ntohl(info->rkey), ntohl(info->size));
>  }
>  
>  static int rping_test_server(struct rping_cb *cb)
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From swise at opengridcomputing.com  Thu May  3 09:59:21 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 03 May 2007 11:59:21 -0500
Subject: [ofa-general] man pages for the rdma-cm
In-Reply-To: <000101c78da3$cb32fad0$3b78e984@amr.corp.intel.com>
References: <000101c78da3$cb32fad0$3b78e984@amr.corp.intel.com>
Message-ID: <1178211561.27558.14.camel@stevo-desktop>

On Thu, 2007-05-03 at 09:55 -0700, Sean Hefty wrote:
> >I believe it is today.
> 
> So, there's about a 0% chance of this making RC 3 then...  I will try to get
> this done by early next week.

If you want me to review the text, lemme know...


From halr at voltaire.com  Thu May  3 09:59:39 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 03 May 2007 12:59:39 -0400
Subject: [ofa-general] Re: [PATCH] osm: source and destination strings
	overlap when using sprintf()
In-Reply-To: <4636E4A7.7060108@dev.mellanox.co.il>
References: <462C7C21.7010004@dev.mellanox.co.il>
	<20070423101738.GG4579@mellanox.co.il>
	<462E80A3.5060503@dev.mellanox.co.il>
	<20070501005101.GA26019@sashak.voltaire.com>
	<4636E4A7.7060108@dev.mellanox.co.il>
Message-ID: <1178211572.32222.3479.camel@hal.voltaire.com>

On Tue, 2007-05-01 at 02:56, Yevgeny Kliteynik wrote:
> Sasha Khapyorsky wrote:
> > On 01:11 Wed 25 Apr     , Yevgeny Kliteynik wrote:
> >> Michael S. Tsirkin wrote:
> >>> Since you seem to do a strcat which does an anyway, how about, for example:
> >>>
> >>> -      sprintf( buf_line1,"%s 0x%01x |",
> >>> -               buf_line1, p_vla_tbl->vl_entry[i].vl);
> >>> +      sprintf( buf_line1 + strlen(buf_line1)," 0x%01x |",
> >>> +               p_vla_tbl->vl_entry[i].vl);
> >>>
> >>> and so on in all the other places?
> >> Agree.
> >> I'll send a new patch later.
> > 
> > Or like this:
> > 
> > +      int n = 0;
> > ...
> > -      sprintf( buf_line1,"%s 0x%01x |",
> > -               buf_line1, p_vla_tbl->vl_entry[i].vl);
> > +      n += sprintf( buf_line1 + n," 0x%01x |",
> > +                    p_vla_tbl->vl_entry[i].vl);
> > 
> > , so strlen() rerunning in loop is not needed anymore.
> 
> Right, it does look better.

So is someone going to submit this patch ? Thanks.

-- Hal

> -- Yevgeny
> 
> > Sasha
> > 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Thu May  3 10:01:51 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 03 May 2007 13:01:51 -0400
Subject: [ofa-general] [query] SMI nodeinfo, port_info structures
In-Reply-To: <735493.40597.qm@web8323.mail.in.yahoo.com>
References: <735493.40597.qm@web8323.mail.in.yahoo.com>
Message-ID: <1178211580.32222.3481.camel@hal.voltaire.com>

Hi Mahesh,

On Thu, 2007-05-03 at 02:50, Keshetti Mahesh wrote:
> Hi,
> 
> Even though nobody except the ipath driver is using the structures
> nodeinfo and port_info currently aren't these structures should be in
> smi.h?

I don't think so. 

> Because these structures are not specific to any hardware but they are
> specific to the SMI.

SMI has nothing to do with those SM attributes.

> And can anyone tell me why some fields have big endian (__bexx) data
> type and others have normal (uxx) data type in these structures?

What structures (in what file(s)) are you referring to ?

-- Hal

> -Mahesh
> 
> 
> 
> ______________________________________________________________________
>  Check out what you're missing if you're not on Yahoo! Messenger
> 
> ______________________________________________________________________
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From rick.jones2 at hp.com  Thu May  3 10:32:10 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Thu, 03 May 2007 10:32:10 -0700
Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian
In-Reply-To: <20070502214944.GF10009@mellanox.co.il>
References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il>
Message-ID: <463A1C9A.6060706@hp.com>

> rpm is not installed. I don't know how to solve this, Vlad
> might be able to answer tomorrow.

That would be cool.  Push comes to shove I can probably put my two systems on an 
external network if there is a need for a "laying on of hands."

>>should I be taking a different path to build here?
> 
> 
> Maybe, maybe not.
> 
> There *is* another way which should be enough to test IPoIB:
> try getting a kernel tarball from
> http://git.openfabrics.org/~vlad/builds/
> 
> If you unpack this, you can configure/make/make install.
> 
> Installer will backup your original modules under the prefix.
> Keep the source around and you'll be able to make uninstall
> to get back to original system.
> 
> Note 1: default configure settings are often not what you want:
> run ./configure --help first of all to see which modules to select
> (--with-ipoib-mod and --with-mthca-mod I think) and to set a prefix.
> Note 2: having quilt tool installed is recommended - will let you
> add/remove patches later.
> Note 3: this way you get no userspace. openfabrics tarballs
> are under the same directory, and a similiar method works there.
> external tarballs (MPI, bonding, etc) are supplied to us in SRPM
> format so this trick does not work for them.

Seems I found little joy there too, probably my own fault.  The environment is a 
Debian 4.0. The kernel is called:

hpcpc107:~/ofa_1_2_kernel-20070502-0200# uname -a
Linux hpcpc107 2.6.21.1-raj #1 SMP Tue May 1 14:11:27 PDT 2007 ia64 GNU/Linux


The sources to which are at:

/root/linux-2.6.21.1

My configure line was:

./configure --with-ipoib-mod --with-mthca-mod --with-sdp-mod --prefix=/root/save

I didn't save the first set of configure output :(  Subsequent configures give:

hpcpc107:~/ofa_1_2_kernel-20070502-0200# ./configure --with-ipoib-mod 
--with-mthca-mod --with-sdp-mod --prefix=/root/save
mkdir -p /root/ofa_1_2_kernel-20070502-0200/patches
touch /root/ofa_1_2_kernel-20070502-0200/patches/quiltrc
 
/root/ofa_1_2_kernel-20070502-0200/kernel_patches/fixes/add_orig_dgid_to_sysfs.patch
/usr/bin/quilt --quiltrc /root/ofa_1_2_kernel-20070502-0200/patches/quiltrc 
import 
/root/ofa_1_2_kernel-20070502-0200/kernel_patches/fixes/add_orig_dgid_to_sysfs.patch
Patch add_orig_dgid_to_sysfs.patch is applied

Failed executing /usr/bin/quilthpcpc107:~/ofa_1_2_kernel-20070502-0200#

Make then reports:

hpcpc107:~/ofa_1_2_kernel-20070502-0200# make
Building kernel modules
Kernel version: 2.6.21.1-raj
Modules directory: //lib/modules/2.6.21.1-raj/updates
Kernel sources: /lib/modules/2.6.21.1-raj/build
env EXTRA_CFLAGS="  -I/root/ofa_1_2_kernel-20070502-0200/include 
-I/root/ofa_1_2_kernel-20070502-0200/drivers/infiniband/include \
                 -I/root/ofa_1_2_kernel-20070502-0200/drivers/infiniband/ulp/ipoib \
                 -I/root/ofa_1_2_kernel-20070502-0200/drivers/infiniband/debug \
 
-I/root/ofa_1_2_kernel-20070502-0200/drivers/infiniband/hw/cxgb3/core \
                 -I/root/ofa_1_2_kernel-20070502-0200/drivers/net/cxgb3 \
                 -I/root/ofa_1_2_kernel-20070502-0200/drivers/net/rds " \
         make -C /lib/modules/2.6.21.1-raj/build 
SUBDIRS="/root/ofa_1_2_kernel-20070502-0200" KERNELRELEASE=2.6.21.1-raj \
                 EXTRAVERSION=.1-raj V=1  \
                 CONFIG_INFINIBAND= \
                 CONFIG_INFINIBAND_IPOIB=m \
                 CONFIG_INFINIBAND_IPOIB_CM=y \
                 CONFIG_INFINIBAND_SDP=m \
                 CONFIG_INFINIBAND_SRP= \
                 CONFIG_INFINIBAND_USER_MAD= \
                 CONFIG_INFINIBAND_USER_ACCESS= \
                 CONFIG_INFINIBAND_ADDR_TRANS= \
                 CONFIG_INFINIBAND_MTHCA=m \
                 CONFIG_INFINIBAND_IPOIB_DEBUG=y \
                 CONFIG_INFINIBAND_ISER= \
                 CONFIG_SCSI_ISCSI_ATTRS= \
                 CONFIG_ISCSI_TCP= \
                 CONFIG_INFINIBAND_EHCA= \
                 CONFIG_INFINIBAND_EHCA_SCALING= \
                 CONFIG_RDS= \
                 CONFIG_RDS_IB= \
                 CONFIG_RDS_TCP= \
                 CONFIG_RDS_DEBUG= \
                 CONFIG_INFINIBAND_IPOIB_DEBUG_DATA= \
                 CONFIG_INFINIBAND_SDP_SEND_ZCOPY= \
                 CONFIG_INFINIBAND_SDP_RECV_ZCOPY= \
                 CONFIG_INFINIBAND_SDP_DEBUG=y \
                 CONFIG_INFINIBAND_SDP_DEBUG_DATA= \
                 CONFIG_INFINIBAND_IPATH= \
                 CONFIG_INFINIBAND_MTHCA_DEBUG=y \
                 CONFIG_INFINIBAND_MADEYE= \
                 CONFIG_INFINIBAND_VNIC= \
                 CONFIG_INFINIBAND_VNIC_DEBUG= \
                 CONFIG_INFINIBAND_VNIC_STATS= \
                 CONFIG_CHELSIO_T3= \
                 CONFIG_INFINIBAND_CXGB3= \
                 CONFIG_INFINIBAND_CXGB3_DEBUG= \
                 LINUXINCLUDE=' \
                  \
                 -I/root/ofa_1_2_kernel-20070502-0200/include \
                 -I/root/ofa_1_2_kernel-20070502-0200/drivers/infiniband/include \
                 -Iinclude \
                 $(if $(KBUILD_SRC),-Iinclude2 -I$(srctree)/include) \
                 -include include/linux/autoconf.h \
                 -include 
/root/ofa_1_2_kernel-20070502-0200/include/linux/autoconf.h \
                 ' \
                 modules
make[1]: Entering directory `/root/linux-2.6.21.1'
test -e include/linux/autoconf.h -a -e include/config/auto.conf || (           \
         echo;                                                           \
         echo "  ERROR: Kernel configuration is invalid.";               \
         echo "         include/linux/autoconf.h or include/config/auto.conf are 
missing.";      \
         echo "         Run 'make oldconfig && make prepare' on kernel src to 
fix it.";  \
         echo;                                                           \
         /bin/false)
mkdir -p /root/ofa_1_2_kernel-20070502-0200/.tmp_versions
rm -f /root/ofa_1_2_kernel-20070502-0200/.tmp_versions/*
make -f scripts/Makefile.build obj=/root/ofa_1_2_kernel-20070502-0200
   Building modules, stage 2.
make -f /root/linux-2.6.21.1/scripts/Makefile.modpost
   scripts/mod/modpost -m  -i /root/linux-2.6.21.1/Module.symvers -I 
/root/ofa_1_2_kernel-20070502-0200/Module.symvers -o 
/root/ofa_1_2_kernel-20070502-0200/Module.symvers -w vmlinux
make[1]: Leaving directory `/root/linux-2.6.21.1'

so I go ahead and do as the output suggests, make oldconfig and make prepare in 
the kernel source directory, come back to the ofa kernel directory, type make 
and get:

...

make[1]: Entering directory `/root/linux-2.6.21.1'
test -e include/linux/autoconf.h -a -e include/config/auto.conf || (           \
         echo;                                                           \
         echo "  ERROR: Kernel configuration is invalid.";               \
         echo "         include/linux/autoconf.h or include/config/auto.conf are 
missing.";      \
         echo "         Run 'make oldconfig && make prepare' on kernel src to 
fix it.";  \
         echo;                                                           \
         /bin/false)
mkdir -p /root/ofa_1_2_kernel-20070502-0200/.tmp_versions
rm -f /root/ofa_1_2_kernel-20070502-0200/.tmp_versions/*
make -f scripts/Makefile.build obj=/root/ofa_1_2_kernel-20070502-0200
   Building modules, stage 2.
make -f /root/linux-2.6.21.1/scripts/Makefile.modpost
   scripts/mod/modpost -m  -i /root/linux-2.6.21.1/Module.symvers -I 
/root/ofa_1_2_kernel-20070502-0200/Module.symvers -o 
/root/ofa_1_2_kernel-20070502-0200/Module.symvers -w vmlinux
make[1]: Leaving directory `/root/linux-2.6.21.1'

again.

rick jones
still rather kernel clueless


From mst at dev.mellanox.co.il  Thu May  3 10:48:17 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 20:48:17 +0300
Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian
In-Reply-To: <463A1C9A.6060706@hp.com>
References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il>
	<463A1C9A.6060706@hp.com>
Message-ID: <20070503174817.GC9719@mellanox.co.il>

> make[1]: Entering directory `/root/linux-2.6.21.1'
> test -e include/linux/autoconf.h -a -e include/config/auto.conf || (        
> \
>         echo;                                                           \
>         echo "  ERROR: Kernel configuration is invalid.";               \
>         echo "         include/linux/autoconf.h or include/config/auto.conf 
>         are missing.";      \
>         echo "         Run 'make oldconfig && make prepare' on kernel src 
>         to fix it.";  \

This is kernel's message, not our's - is this the source you built kernel from?
If you go into /root/linux-2.6.21.1 as root and do make modules,
does it succeed?

-- 
MST


From rick.jones2 at hp.com  Thu May  3 10:59:15 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Thu, 03 May 2007 10:59:15 -0700
Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian
In-Reply-To: <20070503174817.GC9719@mellanox.co.il>
References: <463901A0.5060905@hp.com>
	<20070502214944.GF10009@mellanox.co.il>	<463A1C9A.6060706@hp.com>
	<20070503174817.GC9719@mellanox.co.il>
Message-ID: <463A22F3.4090108@hp.com>

Michael S. Tsirkin wrote:
>>make[1]: Entering directory `/root/linux-2.6.21.1'
>>test -e include/linux/autoconf.h -a -e include/config/auto.conf || (        
>>\
>>        echo;                                                           \
>>        echo "  ERROR: Kernel configuration is invalid.";               \
>>        echo "         include/linux/autoconf.h or include/config/auto.conf 
>>        are missing.";      \
>>        echo "         Run 'make oldconfig && make prepare' on kernel src 
>>        to fix it.";  \
> 
> 
> This is kernel's message, not our's - is this the source you built kernel from?
> If you go into /root/linux-2.6.21.1 as root and do make modules,
> does it succeed?

yes.  some warnings at the beginning about some modules and section mismatches 
but it seems to complete.

rick jones


From pradeep at us.ibm.com  Thu May  3 11:14:56 2007
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Thu, 3 May 2007 11:14:56 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
Message-ID: <OF24698A01.6E469469-ON882572D0.006418A5-882572D0.00644BF2@us.ibm.com>

"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote on 05/02/2007 08:55:47 
PM:

> > > > +      if (ipoib_cm_post_receive(dev, i << 32 | index)) {
> > > 
> > > 1. It seems there are multiple QPs mapped to a single CQ -
> > >    and each gets ipoib_recvq_size recv WRs above.
> > >    Is that right? How do you prevent CQ overrun then?
> > 
> > Good point! Looking at the IB spec it appears that upon CQ overflow
> > it results in a Local Work Queue catastrophic error and puts the QP
> > (receiver side) in an error state.
> 
> Look further in spec - you get CQ error, too.
> 
> > Hence, I am speculating that the 
> > sending side will see an error. This will result in the sending side 
> > destroying the QP and sending a DREQ message which, will remove the 
> > receive side QP.
> > 
> > A new set of QPs will be created on the send side (this is RC) and
> > the connection setup starts over again. It will continue, but at a
> > degraded rate.
> > Is this correct? What other alternative do you suggest
> > -create a CQ per QP? Is the max number of CQs an issue to consider, if 

> > we adopt this approach?
> 
> We were switching to NAPI though, and NAPI kind of forces you to use
> a common CQ, I think.

What if in ipoib_transport_dev_init() size is changed to something like:

size = ipoib_sendq_size + NOSRQ_INDEX_RING_SIZE * ipoib_recvq_size + 1;

used by ib_create_cq() call for the NOSRQ case only? Yes, we will end up
consuming a lot more memory -do you see any (other) problems with that?

Pradeep
pradeep at us.ibm.com


From rdreier at cisco.com  Thu May  3 11:33:29 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 11:33:29 -0700
Subject: [ofa-general] Re: [PATCH 0 of 3] comp_vector kernel support
In-Reply-To: <20070503104806.GC10009@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 3 May 2007 13:48:06 +0300")
References: <20070503104806.GC10009@mellanox.co.il>
Message-ID: <adahcqteozq.fsf@cisco.com>

 > 1. extends ib_create_cq to pass in comp_vector parameter
 > 2. Update all ULP/providers
 > 3. mthca is enhanced to support multiple vectors if MSI-X is enabled on SMP
 > 4. Other providers report support for a single completion vector
 > 5. uverbs and IPoIB CM are enhanced to use multiple vectors if available

 > Please consider for 2.6.22.

This is good work, but given that this has just appeared halfway
through the 2.6.22 merge window I don't think we should just merge it
just now.  Rather, let's definitely get it into 2.6.23.


From rdreier at cisco.com  Thu May  3 11:35:51 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 11:35:51 -0700
Subject: [ofa-general] Re: [PATCH 1 of 3] IB/verbs: add cq comp_vector
	support in core
In-Reply-To: <20070503104847.GD10009@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 3 May 2007 13:48:47 +0300")
References: <20070503104847.GD10009@mellanox.co.il>
Message-ID: <adad51heovs.fsf@cisco.com>

 > Note: since num_comp_vectors = 0 is not legal, and to mimizime provider churn,
 > I set num_comp_vectors to a sane value in core. Providers can increase that.

I would actually prefer to see providers updated to set this
explicitly.  Right now the rule is ib_alloc_device() returns a
completely zeroed out structure -- that's much easier to understand
than initializing some fields but not others.  And there are other
fields we could set defaults for, like phys_port_cnt == 0 is not legal
either, but we don't try to set that in the core.

 - R.


From rdreier at cisco.com  Thu May  3 11:37:43 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 11:37:43 -0700
Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple
	completion vectors
In-Reply-To: <20070503104924.GE10009@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 3 May 2007 13:49:24 +0300")
References: <20070503104924.GE10009@mellanox.co.il>
Message-ID: <ada8xc5eoso.fsf@cisco.com>

 > I don't know how many vectors make sense, so I decided to be
 > conservative here, since each EQ consumes a lot of memory by default.

I think #vectors == O(#CPUs) is what we should aim for.

Also another useful thing for NUMA systems might be to allocate the EQ
memory from the CPU where that interrupt will be targeted.  But I
don't know exactly the best way to do that, and I think we can leave
that for later.

 - R.


From rdreier at cisco.com  Thu May  3 11:39:43 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 11:39:43 -0700
Subject: [ofa-general] Re: [PATCH 3 of 3] ipoib/cm: separate comp vectors to
	RX/TX
In-Reply-To: <20070503104955.GF10009@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 3 May 2007 13:49:55 +0300")
References: <20070503104955.GF10009@mellanox.co.il>
Message-ID: <ada4pmteopc.fsf@cisco.com>

 > Enhance ipoib to use multiple completion vectors if available.
 > On mthca, this increases netperf BW by some 5% with
 > same or lower service demand.

I wonder if this is the best way to use multiple vectors in IPoIB.
Shirley has pointed out that right now we can't scale to both ports on
2-port adapters because both ports end up using the same interrupt.
So maybe we want to target each port to separate completion vectors
when available?

 - R.


From xma at us.ibm.com  Thu May  3 11:57:32 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 3 May 2007 11:57:32 -0700
Subject: [ofa-general] Re: [PATCH 3 of 3] ipoib/cm: separate comp vectors
	to	RX/TX
In-Reply-To: <ada4pmteopc.fsf@cisco.com>
Message-ID: <OFC8ED822E.AD693034-ON872572D0.0067E2D0-882572D0.00682A1F@us.ibm.com>


general-bounces at lists.openfabrics.org wrote on 05/03/2007 11:39:43 AM:

>  > Enhance ipoib to use multiple completion vectors if available.
>  > On mthca, this increases netperf BW by some 5% with
>  > same or lower service demand.
>
> I wonder if this is the best way to use multiple vectors in IPoIB.
> Shirley has pointed out that right now we can't scale to both ports on
> 2-port adapters because both ports end up using the same interrupt.
> So maybe we want to target each port to separate completion vectors
> when available?
>
>  - R.

Yes, it would be better to do perPort/perCompletion/perEvent/perInterrupt.
It also helps latency not just throughput.

Shirley Ma
IBM Linux Technology Center
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/ff34e31b/attachment.html>

From rdreier at cisco.com  Thu May  3 12:01:03 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 12:01:03 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <OF88F0F777.9937D2C9-ON882572CB.00013288-882572CB.0004BCA0@us.ibm.com>
	(Pradeep Satyanarayana's message of "Fri,
	27 Apr 2007 18:51:14 -0600")
References: <OF88F0F777.9937D2C9-ON882572CB.00013288-882572CB.0004BCA0@us.ibm.com>
Message-ID: <adazm4ld95c.fsf@cisco.com>

 > +#define IPOIB_CM_OP_NOSRQ (1ul << 29)

I don't understand the point of this... the only places you do anything
with it are:

 > +       priv->cm.rx_wr.wr_id = wr_id << 32 | index | IPOIB_CM_OP_NOSRQ;
 > +       index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) & NOSRQ_INDEX_MASK ;
 > +       if ((wc->wr_id & IPOIB_CM_OP_SRQ) || (wc->wr_id & IPOIB_CM_OP_NOSRQ))

so probably the most sensible thing to do is just to rename
IPOIB_CM_OP_SRQ to IPOIB_OP_CM_RECV.

 > +/* These two go hand in hand */
 > +#define NOSRQ_INDEX_RING_SIZE 1024
 > +#define NOSRQ_INDEX_MASK      0x00000000000003ff

Rather than having a comment, I would just do

#define NOSRQ_INDEX_RING_SIZE 1024
#define NOSRQ_INDEX_MASK      (NOSRQ_INDEX_RING_SIZE - 1)

also I think the RING name is wrong -- it's not a ring, it's a table,
right?  I don't like having a static limit on the number of nosrq
connections; could this be a hash table instead?

Some of these changes seem strange:

 > -               priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i];
 > +               priv->cm.rx_sge[i].addr = 
 > +               priv->cm.srq_ring[id].mapping[i];

 > -                                     priv->cm.srq_ring[id].mapping);
 > +                                     priv->cm.srq_ring[id].mapping);

please try to put any changes like this that you want to make into a
separate patch.

 > +       if (priv->cm.srq) 
 > +               priv->cm.srq_ring[id].skb = skb;
 > +       else {
 > +               index = id  & NOSRQ_INDEX_MASK ;
 > +               wr_id = id >> 32;
 > +               spin_lock_irqsave(&priv->lock, flags);
 > +               rx_ptr = priv->cm.rx_index_ring[index];
 > +               spin_unlock_irqrestore(&priv->lock, flags);
 > +
 > +               rx_ptr->rx_ring[wr_id].skb = skb;

why does the nosrq case need to take a lock when the srq case doesn't?
A comment would be welcome here...

 > -               .srq = priv->cm.srq,
 > 
 > +       if (priv->cm.srq)
 > +               attr.srq = priv->cm.srq;
 > +       else
 > +               attr.srq = NULL;

isn't the code you use to replace the assignment just an obfuscated
version of the original assignment?

 > -       rep.srq = 1;

 > +       if (priv->cm.srq)
 > +               rep.srq = 1;
 > +       else
 > +               rep.srq = 0;

similarly I would rather see "rep.srq = !!priv->cm.srq"

 > +       /* Allocate space for the rx_ring here */
 > +       p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring,
 > +                            GFP_KERNEL);

in general comments are good, but I don't like to see redundancy like:

	/* do something here */
        do_something();

 > +       if ( index == NOSRQ_INDEX_RING_SIZE) {

no space between ( and index please.

 > +               printk(KERN_WARNING "NOSRQ supports a max of %d RC "
 > +                      "QPs. That limit has now been reached\n",
 > +                      NOSRQ_INDEX_RING_SIZE);

ipoib_warn() instead of printk?  Also isn't this going to flood logs
if the remote side keeps trying to connect?

 > +       ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
 > +       if (ret) {
 > +               ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", 
 > ret);
 > +               goto err_modify_nosrq;
 > +       }

It's good to goto to unwind code, but in this case you just have a
return at err_modify_nosrq -- why not just return directly?  However
you seem to leak rx_ring here, so it would be better to use the unwind
code more consistently instead of using return later.

 > +       for (i = 0; i < ipoib_recvq_size; ++i) {
 > +               if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index,
 > +                                          IPOIB_CM_RX_SG - 1,
 > +                                          p->rx_ring[i].mapping)) {
 > +                       ipoib_warn(priv, "failed to allocate receive "
 > +                                  "buffer %d\n", i);
 > +                       ipoib_cm_dev_cleanup(dev);
 > +                       /* Free rx_ring previously allocated */

this comment doesn't tell me anything I couldn't see for myself

 > +                       kfree(p->rx_ring);
 > +                       return -ENOMEM;
 > +               }
 > +
 > +               /* Can we call the nosrq version? */
 > +               if (ipoib_cm_post_receive(dev, i << 32 | index)) {
 > +                       ipoib_warn(priv, "ipoib_ib_post_receive "
 > +                                  "failed for  buf %d\n", i);
 > +                       ipoib_cm_dev_cleanup(dev);

seems like you're missing the call to kfree(p->rx_ring) here?
this code could probably benefit from a goto to unwind code.

 > +                       return -EIO;
 > +               }


 > +       } /* end for */

and another useless comment here...

+       if (priv->cm.srq == NULL) { /* NOSRQ */

I prefer "if (!priv->cm.srq)" to "== NULL".  Also I don't think this
comment tells me anything.

+               psn = random32() & 0xffffff;
+               if (ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn))
+                       goto err_modify;
+       } else { /* SRQ */
+               p->rx_ring = NULL; /* This is used only by NOSRQ */
+               psn = random32() & 0xffffff;
+               ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
+               if (ret)
+                       goto err_modify;
+       }

move the psn assignment out of the if?  

 > +       struct ipoib_dev_priv *priv = netdev_priv(dev);
 > +       int ret;
 > + 
 > +

please avoid double spacing.

 > -       attr.srq = priv->cm.srq;
 > +       if (priv->cm.srq)
 > +               attr.srq = priv->cm.srq;
 > +       else
 > +               attr.srq = NULL;

adding the if seems like pure obfuscation here...

+       if (attr.max_srq)
+               supports_srq = 1; /* This device supports SRQ */
+       else {
+               supports_srq = 0; 
        }

I don't see what the supports_srq temporary variable buys you.  Also
please don't put { } around one-line blocks.

 > +               priv->cm.rx_index_ring = kzalloc(NOSRQ_INDEX_RING_SIZE * 
 > +                                        sizeof *priv->cm.rx_index_ring,
 > +                                        GFP_KERNEL);
 > +       } 

Handle allocation failure here?

 - R.


From etta at systemfabricworks.com  Thu May  3 12:26:05 2007
From: etta at systemfabricworks.com (Chieng Etta)
Date: Thu, 3 May 2007 14:26:05 -0500
Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
In-Reply-To: <D6A583C768392A4D8B297C500CDD54B5017699F0@mse11be1.mse11.exchange.ms>
Message-ID: <006f01c78db8$e57d1ff0$c801a8c0@ettac>

Hi Steffen,

After removing all the OFED packages by using ./uninstall.sh, I tried
./build.sh to build the RPMs then installed libibverbs-1.1-0.x86_64.rpm onto
system.  "libibverbs.so.1.0.0" was installed under the right directories
(/usr/lib and /usr/lib64).  Please see the output below.  
Thanks,
Etta

[root at sfw1 etc]# cat /etc/*release
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
[root at sfw1 etc]# uname -a
Linux sfw1.sfw.int 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux

[root at sfw1 lib64]# pwd
/usr/lib64
[root at sfw1 lib64]# ll libibverbs*
ls: libibverbs*: No such file or directory

[root at sfw1 lib64]# rpm -aq |grep libibverbs

[root at sfw1 lib64]# cd - 
/root/images/OFED-1.2-rc2/RPMS/redhat-release-4AS-5.5
[root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0
/usr/lib64/libibverbs.so.1
/usr/lib64/libibverbs.so.1.0.0

[root at sfw1 redhat-release-4AS-5.5]# rpm -ivh libibverbs-1.1-0.x86_64.rpm
Preparing...             ########################################### [100%]
   1:libibverbs          ########################################### [100%]

[root at sfw1 redhat-release-4AS-5.5]# rpm -qp --qf "%{arch}\n"
libibverbs-1.1-0.x86_64.rpm
x86_64

[root at sfw1 redhat-release-4AS-5.5]# cd -
/usr/lib64
[root at sfw1 lib64]# rpm -aq |grep libibverbs
libibverbs-1.1-0

[root at sfw1 lib64]# ll libibverbs*
lrwxrwxrwx  1 root root     19 May  3 13:50 libibverbs.so.1 ->
libibverbs.so.1.0.0
-rwxr-xr-x  1 root root 200993 May  3 13:18 libibverbs.so.1.0.0

[root at sfw1 lib64]# file libibverbs.so.1.0.0
libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD x86-64, version 1
(SYSV), not stripped

[root at sfw1 lib]# cd /usr/lib
[root at sfw1 lib]# file libibverbs.so.1.0.0
libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel 80386, version 1
(SYSV), not stripped

[root at sfw1 etc]# cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
/usr/ofed/lib64

[root at sfw1 etc]# cat /etc/ld.so.conf.d/ofed.conf
/usr/lib64
/usr/lib
    

-----Original Message-----
From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Steffen Persvold
Sent: Thursday, May 03, 2007 10:26 AM
To: vlad at dev.mellanox.co.il
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64

Vladimir,

Nope. Still the same issue. The RPMs will only contain one set of
libraries and it is always in /usr/lib (if I set the build_32bit=0
option I get the 64bit libraries but in the wrong directory).

Seriously, am I the only one seeing this ? I would think rhel4 u4 was a
very normal test platform ?

Cheers,

Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter


> -----Original Message-----
> From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il]
> Sent: Thursday, May 03, 2007 9:07 AM
> To: Steffen Persvold
> Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> 
> Please see if this happens in OFED-1.2-20070503-0600.
> But first uninstall the previous OFED version with ofed_uninstall.sh
> command.
> 
> Thanks,
> 
> Regards,
> Vladimir
> 
> On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote:
> > Hmm,
> >
> > so I tried something. I put :
> >
> > build_32bit=0
> >
> > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This
time
> > it built 64bit libraries, but it puts them in the wrong directory :
> >
> > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm
> > /etc/ld.so.conf.d/ofed.conf
> > /usr/lib/libibverbs.so.1
> > /usr/lib/libibverbs.so.1.0.0
> >
> > # file /usr/lib/libibverbs.so.1.0.0
> > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD
> > x86-64, version 1 (SYSV), not stripped
> >
> > So what's up ??
> >
> > Cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: Steffen Persvold
> > Sent: Wed 5/2/2007 10:30 AM
> > To: Steffen Persvold; Vladimir Sokolovsky
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Also,
> >
> > If I look at the /etc/ld.so.conf/ofed.conf file I have :
> >
> > # cat ofed.conf
> > /usr/lib
> > /usr/lib
> >
> >
> > which seems kinda weird ? :)
> >
> > Cheers,
> >
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen
Persvold
> > Sent: Wed 5/2/2007 10:20 AM
> > To: Vladimir Sokolovsky
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Nope :
> >
> >
> > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
> > /etc/ld.so.conf.d/ofed.conf
> > /usr/lib/libibverbs.so.1
> > /usr/lib/libibverbs.so.1.0.0
> > [redhat-release-4ES-5.5]#
> >
> > So the RPM got built, but without 64bit libraries. Now if it was the
> > other way around (i.e no 32bit libraries) I could have understood it
> > (as 32bit is an option on x86_64), but not having the native 64bit
> > libraries is not so easy to understand :)
> >
> > cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir
> > Sokolovsky
> > Sent: Wed 5/2/2007 10:05 AM
> > To: Steffen Persvold
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Don't you have /usr/lib64/libibverbs.so.1.0.0?
> >
> > Regards,
> > Vladimir
> >
> > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote:
> > > Folks,
> > >
> > > I used the build.sh script to build the above mentioned packages
on
> > > rhel4u4 x86_64, but for some reason it only compiles 32bit
libraries
> > > (even if the packages are named x86_64) :
> > >
> > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
> > > x86_64
> > >
> > > (after installing it) :
> > >
> > > # file /usr/lib/libibverbs.so.1.0.0
> > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel
> > > 80386, version 1 (SYSV), not stripped
> > >
> > > What did I do wrong ??
> > >
> > > Cheers,
> > > Steffen Persvold
> > > Technical Director Americas
> > > tel. 508-281-7100 x401
> > > fax. 508-281-7171
> > >
> > > http://www.scali.com/
> > > Scaling the Linux datacenter
> > > _______________________________________________
> > > ewg mailing list
> > > ewg at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> > _______________________________________________
> > ewg mailing list
> > ewg at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> >
> >

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


From mst at dev.mellanox.co.il  Thu May  3 12:29:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 22:29:06 +0300
Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple
	completion vectors
In-Reply-To: <ada8xc5eoso.fsf@cisco.com>
References: <20070503104924.GE10009@mellanox.co.il> <ada8xc5eoso.fsf@cisco.com>
Message-ID: <20070503192906.GE9719@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors
> 
>  > I don't know how many vectors make sense, so I decided to be
>  > conservative here, since each EQ consumes a lot of memory by default.
> 
> I think #vectors == O(#CPUs) is what we should aim for.

What do you suggest to do about memory scalaibility?

-- 
MST


From rdreier at cisco.com  Thu May  3 12:37:12 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 12:37:12 -0700
Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple
	completion vectors
In-Reply-To: <20070503192906.GE9719@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 3 May 2007 22:29:06 +0300")
References: <20070503104924.GE10009@mellanox.co.il>
	<ada8xc5eoso.fsf@cisco.com> <20070503192906.GE9719@mellanox.co.il>
Message-ID: <adar6pxd7h3.fsf@cisco.com>

 > > I think #vectors == O(#CPUs) is what we should aim for.
 > 
 > What do you suggest to do about memory scalaibility?

Right now we default to 64K CQs, so each completion EQ uses 64K * 32 bytes,
which is 2 MB.  Which is a lot but not a killer on modern servers.
And I would expect memory to scale as O(#CPUs) too.  So I think 1 or 2
completion EQs per CPU is the right amount.

I guess it would be nice if there was something tricky we could do to
adjust things on the fly but I don't think it's worth getting too tricky.

 - R.


From ralphc at pathscale.com  Thu May  3 12:40:51 2007
From: ralphc at pathscale.com (Ralph Campbell)
Date: Thu, 03 May 2007 12:40:51 -0700
Subject: [ofa-general] IB/ipath - fix two more spin lock problems
Message-ID: <1178221251.3407.111.camel@brick.pathscale.com>

IB/ipath - fix two more spin lock problems

Fix a missing unlock in ipath_rc_rcv_resp()
and remove an extra unlock from ipath_rc_rcv_error().

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c
index 9e68c91..f4d729d 100644
--- a/drivers/infiniband/hw/ipath/ipath_rc.c
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c
@@ -1257,6 +1257,7 @@ ack_err:
 	wc.dlid_path_bits = 0;
 	wc.port_num = 0;
 	ipath_sqerror_qp(qp, &wc);
+	spin_unlock_irqrestore(&qp->s_lock, flags);
 bail:
 	return;
 }
@@ -1436,7 +1437,6 @@ static inline int ipath_rc_rcv_error(struct ipath_ibdev *dev,
 		break;
 	}
 	qp->r_nak_state = 0;
-	spin_unlock_irq(&qp->s_lock);
 	tasklet_hi_schedule(&qp->s_task);
 
 unlock_done:


From ralph.campbell at qlogic.com  Thu May  3 12:43:03 2007
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 03 May 2007 12:43:03 -0700
Subject: [ofa-general] IB/ipath - fix a race condition when generating ACKs
Message-ID: <1178221383.3407.115.camel@brick.pathscale.com>

IB/ipath - fix a race condition when generating ACKs

Fix a problem where simple ACKs can be sent ahead of RDMA read responses
thus implicitly NAKing the RDMA read.

Signed-off-by: Ralph Campbell <ralph.cambpell at qlogic.com>
Signed-off-by: Robert Walsh <robert.walsh at qlogic.com>

diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c
index f4d729d..1915771 100644
--- a/drivers/infiniband/hw/ipath/ipath_rc.c
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c
@@ -98,13 +98,21 @@ static int ipath_make_rc_ack(struct ipath_qp *qp,
 	case OP(RDMA_READ_RESPONSE_LAST):
 	case OP(RDMA_READ_RESPONSE_ONLY):
 	case OP(ATOMIC_ACKNOWLEDGE):
-		qp->s_ack_state = OP(ACKNOWLEDGE);
+		/*
+		 * We can increment the tail pointer now that the last
+		 * response has been sent instead of only being
+		 * constructed.
+		 */
+		if (++qp->s_tail_ack_queue > IPATH_MAX_RDMA_ATOMIC)
+			qp->s_tail_ack_queue = 0;
 		/* FALLTHROUGH */
+	case OP(SEND_ONLY):
 	case OP(ACKNOWLEDGE):
 		/* Check for no next entry in the queue. */
 		if (qp->r_head_ack_queue == qp->s_tail_ack_queue) {
 			if (qp->s_flags & IPATH_S_ACK_PENDING)
 				goto normal;
+			qp->s_ack_state = OP(ACKNOWLEDGE);
 			goto bail;
 		}
 
@@ -117,12 +125,8 @@ static int ipath_make_rc_ack(struct ipath_qp *qp,
 			if (len > pmtu) {
 				len = pmtu;
 				qp->s_ack_state = OP(RDMA_READ_RESPONSE_FIRST);
-			} else {
+			} else
 				qp->s_ack_state = OP(RDMA_READ_RESPONSE_ONLY);
-				if (++qp->s_tail_ack_queue >
-				    IPATH_MAX_RDMA_ATOMIC)
-					qp->s_tail_ack_queue = 0;
-			}
 			ohdr->u.aeth = ipath_compute_aeth(qp);
 			hwords++;
 			qp->s_ack_rdma_psn = e->psn;
@@ -139,8 +143,6 @@ static int ipath_make_rc_ack(struct ipath_qp *qp,
 				cpu_to_be32(e->atomic_data);
 			hwords += sizeof(ohdr->u.at) / sizeof(u32);
 			bth2 = e->psn;
-			if (++qp->s_tail_ack_queue > IPATH_MAX_RDMA_ATOMIC)
-				qp->s_tail_ack_queue = 0;
 		}
 		bth0 = qp->s_ack_state << 24;
 		break;
@@ -156,8 +158,6 @@ static int ipath_make_rc_ack(struct ipath_qp *qp,
 			ohdr->u.aeth = ipath_compute_aeth(qp);
 			hwords++;
 			qp->s_ack_state = OP(RDMA_READ_RESPONSE_LAST);
-			if (++qp->s_tail_ack_queue > IPATH_MAX_RDMA_ATOMIC)
-				qp->s_tail_ack_queue = 0;
 		}
 		bth0 = qp->s_ack_state << 24;
 		bth2 = qp->s_ack_rdma_psn++ & IPATH_PSN_MASK;
@@ -171,7 +171,7 @@ static int ipath_make_rc_ack(struct ipath_qp *qp,
 		 * the ACK before setting s_ack_state to ACKNOWLEDGE
 		 * (see above).
 		 */
-		qp->s_ack_state = OP(ATOMIC_ACKNOWLEDGE);
+		qp->s_ack_state = OP(SEND_ONLY);
 		qp->s_flags &= ~IPATH_S_ACK_PENDING;
 		qp->s_cur_sge = NULL;
 		if (qp->s_nak_state)
@@ -223,7 +223,7 @@ int ipath_make_rc_req(struct ipath_qp *qp,
 	/* Sending responses has higher priority over sending requests. */
 	if ((qp->r_head_ack_queue != qp->s_tail_ack_queue ||
 	     (qp->s_flags & IPATH_S_ACK_PENDING) ||
-	     qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE) &&
+	     qp->s_ack_state != OP(ACKNOWLEDGE)) &&
 	    ipath_make_rc_ack(qp, ohdr, pmtu, bth0p, bth2p))
 		goto done;
 
@@ -585,7 +585,9 @@ static void send_rc_ack(struct ipath_qp *qp)
 	unsigned long flags;
 
 	/* Don't send ACK or NAK if a RDMA read or atomic is pending. */
-	if (qp->r_head_ack_queue != qp->s_tail_ack_queue)
+	if (qp->r_head_ack_queue != qp->s_tail_ack_queue ||
+	    (qp->s_flags & IPATH_S_ACK_PENDING) ||
+	    qp->s_ack_state != OP(ACKNOWLEDGE))
 		goto queue_ack;
 
 	/* Construct the header. */


From mst at dev.mellanox.co.il  Thu May  3 12:46:37 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 22:46:37 +0300
Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple
	completion vectors
In-Reply-To: <adar6pxd7h3.fsf@cisco.com>
References: <20070503104924.GE10009@mellanox.co.il> <ada8xc5eoso.fsf@cisco.com>
	<20070503192906.GE9719@mellanox.co.il> <adar6pxd7h3.fsf@cisco.com>
Message-ID: <20070503194637.GF9719@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors
> 
>  > > I think #vectors == O(#CPUs) is what we should aim for.
>  > 
>  > What do you suggest to do about memory scalaibility?
> 
> Right now we default to 64K CQs, so each completion EQ uses 64K * 32 bytes,
> which is 2 MB.  Which is a lot but not a killer on modern servers.
> And I would expect memory to scale as O(#CPUs) too.  So I think 1 or 2
> completion EQs per CPU is the right amount.

With dual-core, I'm not sure memory scales as fast as #CPUs anymore.

> I guess it would be nice if there was something tricky we could do to
> adjust things on the fly but I don't think it's worth getting too tricky.

Problem is, #CPUs is not a static value anymore.
With CPU hotplug num_possible_cpus is quite often 4 or 8 even though
one actually has a single one present.

-- 
MST


From mst at dev.mellanox.co.il  Thu May  3 12:49:38 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 3 May 2007 22:49:38 +0300
Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple
	completion vectors
In-Reply-To: <ada8xc5eoso.fsf@cisco.com>
References: <20070503104924.GE10009@mellanox.co.il> <ada8xc5eoso.fsf@cisco.com>
Message-ID: <20070503194938.GG9719@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors
> 
>  > I don't know how many vectors make sense, so I decided to be
>  > conservative here, since each EQ consumes a lot of memory by default.
> 
> I think #vectors == O(#CPUs) is what we should aim for.
> 
> Also another useful thing for NUMA systems might be to allocate the EQ
> memory from the CPU where that interrupt will be targeted.

Can't interrupts migrate between CPUs?

> But I
> don't know exactly the best way to do that, and I think we can leave
> that for later.

OTOH, especially for latency, it might be best to only have 1 interrupt,
and service that on the node closest to device.

-- 
MST


From rick.jones2 at hp.com  Thu May  3 13:10:45 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Thu, 03 May 2007 13:10:45 -0700
Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support
	multiple	completion vectors
In-Reply-To: <20070503194938.GG9719@mellanox.co.il>
References: <20070503104924.GE10009@mellanox.co.il> <ada8xc5eoso.fsf@cisco.com>
	<20070503194938.GG9719@mellanox.co.il>
Message-ID: <463A41C5.6050106@hp.com>

Michael S. Tsirkin wrote:
>>Quoting Roland Dreier <rdreier at cisco.com>:
>>Subject: Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors
>>
>> > I don't know how many vectors make sense, so I decided to be
>> > conservative here, since each EQ consumes a lot of memory by default.
>>
>>I think #vectors == O(#CPUs) is what we should aim for.
>>
>>Also another useful thing for NUMA systems might be to allocate the EQ
>>memory from the CPU where that interrupt will be targeted.
> 
> 
> Can't interrupts migrate between CPUs?

Only if someone leaves the irqbalancer running.  Given that it isn't presently 
NUMA-aware (plusungood) I at least tend to disable it.  Apart from it, then 
generally only under explicit administrator command would an interrupt be 
migrated from one CPU to another.

> 
> OTOH, especially for latency, it might be best to only have 1 interrupt,
> and service that on the node closest to device.
> 

topological proximity is a good thing.  there can be more than 1 core 
"topologically close" to the I/O card.

rickjones


From HNGUYEN at de.ibm.com  Thu May  3 13:20:40 2007
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Thu, 3 May 2007 22:20:40 +0200
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <OFB9815AE0.60DB073D-ON872572D0.005CB2E0-882572D0.005CC7B0@us.ibm.com>
Message-ID: <OFD79C9EC3.EC33A65E-ONC12572D0.006F41B8-C12572D0.006FC160@de.ibm.com>

> > BTW, why do you ignore the option to use UC QP?
> > MST
> Unfortunately, eHCA doesn't support UC in current version. Next
> generation will have RC w/i SRQ support.
Current ehca surely supports UC. Just give ibv_uc_pingpong a try.
Nam


From rdreier at cisco.com  Thu May  3 13:30:00 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 13:30:00 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <OFD79C9EC3.EC33A65E-ONC12572D0.006F41B8-C12572D0.006FC160@de.ibm.com>
	(Hoang-Nam Nguyen's message of "Thu, 3 May 2007 22:20:40 +0200")
References: <OFD79C9EC3.EC33A65E-ONC12572D0.006F41B8-C12572D0.006FC160@de.ibm.com>
Message-ID: <adaejlxd513.fsf@cisco.com>

 > Current ehca surely supports UC. Just give ibv_uc_pingpong a try.

Thanks... in that case I would definitely suggest investigating using
UC for IPoIB connected mode when SRQs are not available.

In fact assuming the IBA work to add the ability to attach UC QPs to
SRQs is completed, I think we would probably want to move IPoIB CM to
using UC exclusively.

 - R.


From rdreier at cisco.com  Thu May  3 13:31:38 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 13:31:38 -0700
Subject: [ofa-general] IB/ipath - fix two more spin lock problems
In-Reply-To: <1178221251.3407.111.camel@brick.pathscale.com> (Ralph Campbell's
	message of "Thu, 03 May 2007 12:40:51 -0700")
References: <1178221251.3407.111.camel@brick.pathscale.com>
Message-ID: <adaabwld4yd.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Thu May  3 13:33:00 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 13:33:00 -0700
Subject: [ofa-general] Re: IB/ipath - fix a race condition when generating
	ACKs
In-Reply-To: <1178221383.3407.115.camel@brick.pathscale.com> (Ralph Campbell's
	message of "Thu, 03 May 2007 12:43:03 -0700")
References: <1178221383.3407.115.camel@brick.pathscale.com>
Message-ID: <ada6479d4w3.fsf@cisco.com>

applied, thanks


From xma at us.ibm.com  Thu May  3 13:38:47 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 3 May 2007 13:38:47 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <adaejlxd513.fsf@cisco.com>
Message-ID: <OF274F3EF1.ABBDE6C2-ON872572D0.00715866-882572D0.00716F14@us.ibm.com>


Roland Dreier <rdreier at cisco.com> wrote on 05/03/2007 01:30:00 PM:

>  > Current ehca surely supports UC. Just give ibv_uc_pingpong a try.
>
> Thanks... in that case I would definitely suggest investigating using
> UC for IPoIB connected mode when SRQs are not available.
>
> In fact assuming the IBA work to add the ability to attach UC QPs to
> SRQs is completed, I think we would probably want to move IPoIB CM to
> using UC exclusively.
>
>  - R.

Agree.

Thanks
Shirley Ma

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/a7668735/attachment.html>

From rdreier at cisco.com  Thu May  3 13:44:07 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 13:44:07 -0700
Subject: [ofa-general] Last chance to NAK IPoIB NAPI
Message-ID: <adawszpbpt4.fsf@cisco.com>

I think we have consensus on merging IPoIB NAPI for 2.6.22, so I'll
just post the patches one more time and ask Linus to pull tomorrow.
If you don't think we should merge this, please let me know now!

 - R.


From pradeep at us.ibm.com  Thu May  3 13:49:34 2007
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Thu, 3 May 2007 13:49:34 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <adaejlxd513.fsf@cisco.com>
Message-ID: <OFB3F1F7E8.C8AA6081-ON882572D0.00715D85-882572D0.007273D6@us.ibm.com>

general-bounces at lists.openfabrics.org wrote on 05/03/2007 01:30:00 PM:

>  > Current ehca surely supports UC. Just give ibv_uc_pingpong a try.
> 
> Thanks... in that case I would definitely suggest investigating using
> UC for IPoIB connected mode when SRQs are not available.
> 
> In fact assuming the IBA work to add the ability to attach UC QPs to
> SRQs is completed, I think we would probably want to move IPoIB CM to
> using UC exclusively.

Then in the interim, how do we address the interoperability issue between 
say
for example, Topspin and IBM HCAs using connected mode -switch to datagram 
mode? 

This discussion started from using non zero retry count -that is still 
easily
solvable by patching CM and/or drivers and resetting the retry_count back 
to 0.

Switching to datagram mode will result in too big a performance drop. I am 
not
inclined to go in that direction.

Pradeep
pradeep at us.ibm.com 


From rdreier at cisco.com  Thu May  3 13:49:44 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 13:49:44 -0700
Subject: [ofa-general] [PATCH 1/2] IB: Return "maybe missed event" hint from
	ib_req_notify_cq()
In-Reply-To: <adawszpbpt4.fsf@cisco.com> (Roland Dreier's message of "Thu,
	03 May 2007 13:44:07 -0700")
References: <adawszpbpt4.fsf@cisco.com>
Message-ID: <adasladbpjr.fsf@cisco.com>

The semantics defined by the InfiniBand specification say that
completion events are only generated when a completions is added to a
completion queue (CQ) after completion notification is requested.  In
other words, this means that the following race is possible:

	while (CQ is not empty)
		ib_poll_cq(CQ);
	// new completion is added after while loop is exited
	ib_req_notify_cq(CQ);
	// no event is generated for the existing completion

To close this race, the IB spec recommends doing another poll of the
CQ after requesting notification.

However, it is not always possible to arrange code this way (for
example, we have found that NAPI for IPoIB cannot poll after
requesting notification).  Also, some hardware (eg Mellanox HCAs)
actually will generate an event for completions added before the call
to ib_req_notify_cq() -- which is allowed by the spec, since there's
no way for any upper-layer consumer to know exactly when a completion
was really added -- so the extra poll of the CQ is just a waste.

Motivated by this, we add a new flag "IB_CQ_REPORT_MISSED_EVENTS" for
ib_req_notify_cq() so that it can return a hint about whether the a
completion may have been added before the request for notification.
The return value of ib_req_notify_cq() is extended so:

	 < 0	means an error occurred while requesting notification
	== 0	means notification was requested successfully, and if
		IB_CQ_REPORT_MISSED_EVENTS was passed in, then no
		events were missed and it is safe to wait for another
		event.
	 > 0	is only returned if IB_CQ_REPORT_MISSED_EVENTS was
		passed in.  It means that the consumer must poll the
		CQ again to make sure it is empty to avoid the race
		described above.

We add a flag to enable this behavior rather than turning it on
unconditionally, because checking for missed events may incur
significant overhead for some low-level drivers, and consumers that
don't care about the results of this test shouldn't be forced to pay
for the test.

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/infiniband/hw/amso1100/c2.h         |    2 +-
 drivers/infiniband/hw/amso1100/c2_cq.c      |   16 ++++++++---
 drivers/infiniband/hw/cxgb3/cxio_hal.c      |    3 ++
 drivers/infiniband/hw/cxgb3/iwch_provider.c |    8 +++--
 drivers/infiniband/hw/ehca/ehca_iverbs.h    |    2 +-
 drivers/infiniband/hw/ehca/ehca_reqs.c      |   14 +++++++--
 drivers/infiniband/hw/ehca/ipz_pt_fn.h      |    8 +++++
 drivers/infiniband/hw/ipath/ipath_cq.c      |   15 +++++++---
 drivers/infiniband/hw/ipath/ipath_verbs.h   |    2 +-
 drivers/infiniband/hw/mthca/mthca_cq.c      |   12 +++++---
 drivers/infiniband/hw/mthca/mthca_dev.h     |    4 +-
 include/rdma/ib_verbs.h                     |   40 +++++++++++++++++++++------
 12 files changed, 93 insertions(+), 33 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h
index 04a9db5..fa58200 100644
--- a/drivers/infiniband/hw/amso1100/c2.h
+++ b/drivers/infiniband/hw/amso1100/c2.h
@@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq);
 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index);
 extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index);
 extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
-extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags);
 
 /* CM */
 extern int c2_llp_connect(struct iw_cm_id *cm_id,
diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c
index 5175c99..d2b3366 100644
--- a/drivers/infiniband/hw/amso1100/c2_cq.c
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -217,17 +217,19 @@ int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry)
 	return npolled;
 }
 
-int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags)
 {
 	struct c2_mq_shared __iomem *shared;
 	struct c2_cq *cq;
+	unsigned long flags;
+	int ret = 0;
 
 	cq = to_c2cq(ibcq);
 	shared = cq->mq.peer;
 
-	if (notify == IB_CQ_NEXT_COMP)
+	if ((notify_flags & IB_CQ_SOLICITED_MASK) == IB_CQ_NEXT_COMP)
 		writeb(C2_CQ_NOTIFICATION_TYPE_NEXT, &shared->notification_type);
-	else if (notify == IB_CQ_SOLICITED)
+	else if ((notify_flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED)
 		writeb(C2_CQ_NOTIFICATION_TYPE_NEXT_SE, &shared->notification_type);
 	else
 		return -EINVAL;
@@ -241,7 +243,13 @@ int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
 	 */
 	readb(&shared->armed);
 
-	return 0;
+	if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) {
+		spin_lock_irqsave(&cq->lock, flags);
+		ret = !c2_mq_empty(&cq->mq);
+		spin_unlock_irqrestore(&cq->lock, flags);
+	}
+
+	return ret;
 }
 
 static void c2_free_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq)
diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index f5e9aee..76049af 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -114,7 +114,10 @@ int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq,
 				return -EIO;
 			}
 		}
+
+		return 1;
 	}
+
 	return 0;
 }
 
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index af28a31..b7a2183 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -292,7 +292,7 @@ static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata)
 #endif
 }
 
-static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags)
 {
 	struct iwch_dev *rhp;
 	struct iwch_cq *chp;
@@ -303,7 +303,7 @@ static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
 
 	chp = to_iwch_cq(ibcq);
 	rhp = chp->rhp;
-	if (notify == IB_CQ_SOLICITED)
+	if ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED)
 		cq_op = CQ_ARM_SE;
 	else
 		cq_op = CQ_ARM_AN;
@@ -317,9 +317,11 @@ static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
 	PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr);
 	err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0);
 	spin_unlock_irqrestore(&chp->lock, flag);
-	if (err)
+	if (err < 0)
 		printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err,
 		       chp->cq.cqid);
+	if (err > 0 && !(flags & IB_CQ_REPORT_MISSED_EVENTS))
+		err = 0;
 	return err;
 }
 
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 95fd59f..9e5460d 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -135,7 +135,7 @@ int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc);
 
 int ehca_peek_cq(struct ib_cq *cq, int wc_cnt);
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify);
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags notify_flags);
 
 struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 			     struct ib_qp_init_attr *init_attr,
diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index 08d3f89..caec9de 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -634,11 +634,13 @@ poll_cq_exit0:
 	return ret;
 }
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify)
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags notify_flags)
 {
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
+	unsigned long spl_flags;
+	int ret = 0;
 
-	switch (cq_notify) {
+	switch (notify_flags & IB_CQ_SOLICITED_MASK) {
 	case IB_CQ_SOLICITED:
 		hipz_set_cqx_n0(my_cq, 1);
 		break;
@@ -649,5 +651,11 @@ int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify)
 		return -EINVAL;
 	}
 
-	return 0;
+	if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) {
+		spin_lock_irqsave(&my_cq->spinlock, spl_flags);
+		ret = ipz_qeit_is_valid(&my_cq->ipz_queue);
+		spin_unlock_irqrestore(&my_cq->spinlock, spl_flags);
+	}
+
+	return ret;
 }
diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
index 8199c45..57f141a 100644
--- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h
+++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
@@ -140,6 +140,14 @@ static inline void *ipz_qeit_get_inc_valid(struct ipz_queue *queue)
 	return cqe;
 }
 
+static inline int ipz_qeit_is_valid(struct ipz_queue *queue)
+{
+	struct ehca_cqe *cqe = ipz_qeit_get(queue);
+	u32 cqe_flags = cqe->cqe_flags;
+
+	return cqe_flags >> 7 == (queue->toggle_state & 1);
+}
+
 /*
  * returns and resets Queue Entry iterator
  * returns address (kv) of first Queue Entry
diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c
index ea78e6d..1eb204c 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -340,17 +340,18 @@ int ipath_destroy_cq(struct ib_cq *ibcq)
 /**
  * ipath_req_notify_cq - change the notification type for a completion queue
  * @ibcq: the completion queue
- * @notify: the type of notification to request
+ * @notify_flags: the type of notification to request
  *
  * Returns 0 for success.
  *
  * This may be called from interrupt context.  Also called by
  * ib_req_notify_cq() in the generic verbs code.
  */
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags)
 {
 	struct ipath_cq *cq = to_icq(ibcq);
 	unsigned long flags;
+	int ret = 0;
 
 	spin_lock_irqsave(&cq->lock, flags);
 	/*
@@ -358,9 +359,15 @@ int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
 	 * any other transitions (see C11-31 and C11-32 in ch. 11.4.2.2).
 	 */
 	if (cq->notify != IB_CQ_NEXT_COMP)
-		cq->notify = notify;
+		cq->notify = notify_flags & IB_CQ_SOLICITED_MASK;
+
+	if ((notify_flags & IB_CQ_REPORT_MISSED_EVENTS) &&
+	    cq->queue->head != cq->queue->tail)
+		ret = 1;
+
 	spin_unlock_irqrestore(&cq->lock, flags);
-	return 0;
+
+	return ret;
 }
 
 /**
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index 7c4929f..6662380 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -740,7 +740,7 @@ struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries,
 
 int ipath_destroy_cq(struct ib_cq *ibcq);
 
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags);
 
 int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata);
 
diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index efd79ef..cf0868f 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -726,11 +726,12 @@ repoll:
 	return err == 0 || err == -EAGAIN ? npolled : err;
 }
 
-int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify)
+int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags)
 {
 	__be32 doorbell[2];
 
-	doorbell[0] = cpu_to_be32((notify == IB_CQ_SOLICITED ?
+	doorbell[0] = cpu_to_be32(((flags & IB_CQ_SOLICITED_MASK) ==
+				   IB_CQ_SOLICITED ?
 				   MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL :
 				   MTHCA_TAVOR_CQ_DB_REQ_NOT)      |
 				  to_mcq(cq)->cqn);
@@ -743,7 +744,7 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify)
 	return 0;
 }
 
-int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags)
 {
 	struct mthca_cq *cq = to_mcq(ibcq);
 	__be32 doorbell[2];
@@ -755,7 +756,8 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
 
 	doorbell[0] = ci;
 	doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) |
-				  (notify == IB_CQ_SOLICITED ? 1 : 2));
+				  ((flags & IB_CQ_SOLICITED_MASK) ==
+				   IB_CQ_SOLICITED ? 1 : 2));
 
 	mthca_write_db_rec(doorbell, cq->arm_db);
 
@@ -766,7 +768,7 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
 	wmb();
 
 	doorbell[0] = cpu_to_be32((sn << 28)                       |
-				  (notify == IB_CQ_SOLICITED ?
+				  ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ?
 				   MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL :
 				   MTHCA_ARBEL_CQ_DB_REQ_NOT)      |
 				  cq->cqn);
diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
index b7e42ef..9bae3cc 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -495,8 +495,8 @@ void mthca_unmap_eq_icm(struct mthca_dev *dev);
 
 int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
 		  struct ib_wc *entry);
-int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
-int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
+int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags);
+int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags);
 int mthca_init_cq(struct mthca_dev *dev, int nent,
 		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 765589f..529a69d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -431,9 +431,11 @@ struct ib_wc {
 	u8			port_num;	/* valid only for DR SMPs on switches */
 };
 
-enum ib_cq_notify {
-	IB_CQ_SOLICITED,
-	IB_CQ_NEXT_COMP
+enum ib_cq_notify_flags {
+	IB_CQ_SOLICITED			= 1 << 0,
+	IB_CQ_NEXT_COMP			= 1 << 1,
+	IB_CQ_SOLICITED_MASK		= IB_CQ_SOLICITED | IB_CQ_NEXT_COMP,
+	IB_CQ_REPORT_MISSED_EVENTS	= 1 << 2,
 };
 
 enum ib_srq_attr_mask {
@@ -987,7 +989,7 @@ struct ib_device {
 					      struct ib_wc *wc);
 	int                        (*peek_cq)(struct ib_cq *cq, int wc_cnt);
 	int                        (*req_notify_cq)(struct ib_cq *cq,
-						    enum ib_cq_notify cq_notify);
+						    enum ib_cq_notify_flags flags);
 	int                        (*req_ncomp_notif)(struct ib_cq *cq,
 						      int wc_cnt);
 	struct ib_mr *             (*get_dma_mr)(struct ib_pd *pd,
@@ -1414,14 +1416,34 @@ int ib_peek_cq(struct ib_cq *cq, int wc_cnt);
 /**
  * ib_req_notify_cq - Request completion notification on a CQ.
  * @cq: The CQ to generate an event for.
- * @cq_notify: If set to %IB_CQ_SOLICITED, completion notification will
- *   occur on the next solicited event. If set to %IB_CQ_NEXT_COMP,
- *   notification will occur on the next completion.
+ * @flags:
+ *   Must contain exactly one of %IB_CQ_SOLICITED or %IB_CQ_NEXT_COMP
+ *   to request an event on the next solicited event or next work
+ *   completion at any type, respectively. %IB_CQ_REPORT_MISSED_EVENTS
+ *   may also be |ed in to request a hint about missed events, as
+ *   described below.
+ *
+ * Return Value:
+ *    < 0 means an error occurred while requesting notification
+ *   == 0 means notification was requested successfully, and if
+ *        IB_CQ_REPORT_MISSED_EVENTS was passed in, then no events
+ *        were missed and it is safe to wait for another event.  In
+ *        this case is it guaranteed that any work completions added
+ *        to the CQ since the last CQ poll will trigger a completion
+ *        notification event.
+ *    > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was passed
+ *        in.  It means that the consumer must poll the CQ again to
+ *        make sure it is empty to avoid missing an event because of a
+ *        race between requesting notification and an entry being
+ *        added to the CQ.  This return value means it is possible
+ *        (but not guaranteed) that a work completion has been added
+ *        to the CQ since the last poll without triggering a
+ *        completion notification event.
  */
 static inline int ib_req_notify_cq(struct ib_cq *cq,
-				   enum ib_cq_notify cq_notify)
+				   enum ib_cq_notify_flags flags)
 {
-	return cq->device->req_notify_cq(cq, cq_notify);
+	return cq->device->req_notify_cq(cq, flags);
 }
 
 /**
-- 
1.5.1.2


From rdreier at cisco.com  Thu May  3 13:50:10 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 13:50:10 -0700
Subject: [ofa-general] IPoIB: Convert to NAPI
In-Reply-To: <adawszpbpt4.fsf@cisco.com> (Roland Dreier's message of "Thu,
	03 May 2007 13:44:07 -0700")
References: <adawszpbpt4.fsf@cisco.com>
Message-ID: <adaodl1bpj1.fsf@cisco.com>

Convert the IP-over-InfiniBand network device driver over to using
NAPI to handle all completions (both receive and send).

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h      |    1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c   |    2 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c   |   90 +++++++++++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_main.c |    2 +
 4 files changed, 75 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index fd55826..15867af 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -311,6 +311,7 @@ extern struct workqueue_struct *ipoib_workqueue;
 
 /* functions */
 
+int ipoib_poll(struct net_device *dev, int *budget);
 void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr);
 
 struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 0c4e59b..6f78ae0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -416,7 +416,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	skb->dev = dev;
 	/* XXX get correct PACKET_ type here */
 	skb->pkt_type = PACKET_HOST;
-	netif_rx_ni(skb);
+	netif_receive_skb(skb);
 
 repost:
 	if (unlikely(ipoib_cm_post_receive(dev, wr_id)))
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 1bdb910..fbc7371 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -226,7 +226,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		skb->dev = dev;
 		/* XXX get correct PACKET_ type here */
 		skb->pkt_type = PACKET_HOST;
-		netif_rx_ni(skb);
+		netif_receive_skb(skb);
 	} else {
 		ipoib_dbg_data(priv, "dropping loopback packet\n");
 		dev_kfree_skb_any(skb);
@@ -280,28 +280,64 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 			   wc->status, wr_id, wc->vendor_err);
 }
 
-static void ipoib_ib_handle_wc(struct net_device *dev, struct ib_wc *wc)
+int ipoib_poll(struct net_device *dev, int *budget)
 {
-	if (wc->wr_id & IPOIB_CM_OP_SRQ)
-		ipoib_cm_handle_rx_wc(dev, wc);
-	else if (wc->wr_id & IPOIB_OP_RECV)
-		ipoib_ib_handle_rx_wc(dev, wc);
-	else
-		ipoib_ib_handle_tx_wc(dev, wc);
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int max = min(*budget, dev->quota);
+	int done;
+	int t;
+	int empty;
+	int n, i;
+
+	done  = 0;
+	empty = 0;
+
+	while (max) {
+		t = min(IPOIB_NUM_WC, max);
+		n = ib_poll_cq(priv->cq, t, priv->ibwc);
+
+		for (i = 0; i < n; ++i) {
+			struct ib_wc *wc = priv->ibwc + i;
+
+			if (wc->wr_id & IPOIB_CM_OP_SRQ) {
+				++done;
+				--max;
+				ipoib_cm_handle_rx_wc(dev, wc);
+			} else if (wc->wr_id & IPOIB_OP_RECV) {
+				++done;
+				--max;
+				ipoib_ib_handle_rx_wc(dev, wc);
+			} else
+				ipoib_ib_handle_tx_wc(dev, wc);
+		}
+
+		if (n != t) {
+			empty = 1;
+			break;
+		}
+	}
+
+	dev->quota -= done;
+	*budget    -= done;
+
+	if (empty) {
+		netif_rx_complete(dev);
+		if (unlikely(ib_req_notify_cq(priv->cq,
+					      IB_CQ_NEXT_COMP |
+					      IB_CQ_REPORT_MISSED_EVENTS))) {
+			netif_rx_reschedule(dev, 0);
+			return 1;
+		}
+
+		return 0;
+	}
+
+	return 1;
 }
 
 void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr)
 {
-	struct net_device *dev = (struct net_device *) dev_ptr;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	int n, i;
-
-	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
-	do {
-		n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc);
-		for (i = 0; i < n; ++i)
-			ipoib_ib_handle_wc(dev, priv->ibwc + i);
-	} while (n == IPOIB_NUM_WC);
+	netif_rx_schedule(dev_ptr);
 }
 
 static inline int post_send(struct ipoib_dev_priv *priv,
@@ -514,9 +550,10 @@ int ipoib_ib_dev_stop(struct net_device *dev)
 	struct ib_qp_attr qp_attr;
 	unsigned long begin;
 	struct ipoib_tx_buf *tx_req;
-	int i;
+	int i, n;
 
 	clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags);
+	netif_poll_disable(dev);
 
 	ipoib_cm_dev_stop(dev);
 
@@ -568,6 +605,18 @@ int ipoib_ib_dev_stop(struct net_device *dev)
 			goto timeout;
 		}
 
+		do {
+			n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
+			for (i = 0; i < n; ++i) {
+				if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
+					ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
+				else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
+					ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
+				else
+					ipoib_ib_handle_tx_wc(dev, priv->ibwc + i);
+			}
+		} while (n == IPOIB_NUM_WC);
+
 		msleep(1);
 	}
 
@@ -596,6 +645,9 @@ timeout:
 		msleep(1);
 	}
 
+	netif_poll_enable(dev);
+	ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP);
+
 	return 0;
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index b4c380c..0a428f2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -948,6 +948,8 @@ static void ipoib_setup(struct net_device *dev)
 	dev->hard_header 	 = ipoib_hard_header;
 	dev->set_multicast_list  = ipoib_set_mcast_list;
 	dev->neigh_setup         = ipoib_neigh_setup_dev;
+	dev->poll                = ipoib_poll;
+	dev->weight              = 100;
 
 	dev->watchdog_timeo 	 = HZ;
 
-- 
1.5.1.2


From rdreier at cisco.com  Thu May  3 13:52:23 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 13:52:23 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <OFB3F1F7E8.C8AA6081-ON882572D0.00715D85-882572D0.007273D6@us.ibm.com>
	(Pradeep Satyanarayana's message of "Thu,
	3 May 2007 13:49:34 -0700")
References: <OFB3F1F7E8.C8AA6081-ON882572D0.00715D85-882572D0.007273D6@us.ibm.com>
Message-ID: <adak5vpbpfc.fsf@cisco.com>

 > Then in the interim, how do we address the interoperability issue between 
 > say
 > for example, Topspin and IBM HCAs using connected mode -switch to datagram 
 > mode? 

I think that we need to understand the root cause of the issue and try
to solve it without going to a non-zero retry count if possible.
Right the theory is that it is a bug in the CM, namely that the
respective HCA ack delays are not taken into account.  So we should
fix that bug.

 - R.


From tziporet at dev.mellanox.co.il  Thu May  3 14:07:50 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 03 May 2007 14:07:50 -0700
Subject: [ofa-general] OFED 1.2 RC3 is delayed for Monday next week (May 7)
Message-ID: <463A4F26.3010804@mellanox.co.il>

Hi All,
Since some of the critical bugs are not solved yet we decided to delay 
the release to Monday May 7.

This is the list of critical bugs that should be fixed for RC3:
bug_id 	bug_severity 	assigned_to 	short_short_desc
574 	blocker 	raisch at de.ibm.com 	ehca driver fails while running openmpi
420 	critical 	monil at voltaire.com 	PKey table reordering caused by SM 
failover stops ipoib traffic
577 	critical 	rolandd at cisco.com 	SRP multipath failover too slow 
(minutes, not seconds)
465 	critical 	mst at mellanox.co.il 	IPoIB HA fails after several hours of 
failovers
549 	critical 	amip at dev.mellanox.co.il 	SDP Policy need to be consistent
597 	critical 	vlad at mellanox.co.il 	support RHEL4U5 in OFED 1.2
499 	major 	vlad at mellanox.co.il 	module compiled over ofed won't load 
due to symbol version mismatch
519 	major 	pasha at mellanox.co.il 	MVAPICH I APPLICATION  ABORTS WITH 
PARTITIONS CONFIGURED
534 	major 	vlad at mellanox.co.il 	SLES9 - Installer fails on declarations 
- OFED 1.2-20070409
530 	major 	dannyz at mellanox.co.il 	ibdiagnet -r fails on RHEL5 i686
538 	major 	monis at voltaire.com 	integrate IPoIB bonding with IPoIB CM
541 	major 	mst at mellanox.co.il 	slow failover with IPoIB CM 
bonding/ipoibtools HA
558 	major 	rolandd at cisco.com 	tvflash configure fails on SLES10 SP1 RC2


All owners of blocker and critical bugs - please reply with status of 
the bug resolution

Thanks,
Tziporet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/e791635b/attachment.html>

From swise at opengridcomputing.com  Thu May  3 14:14:28 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 03 May 2007 16:14:28 -0500
Subject: [ofa-general] OFED 1.2 RC3 is delayed for Monday next week (May 7)
In-Reply-To: <463A4F26.3010804@mellanox.co.il>
References: <463A4F26.3010804@mellanox.co.il>
Message-ID: <1178226868.27558.58.camel@stevo-desktop>

We also need 599 (rping regression; i just opened this) included in
-rc3.  I've submitted the patch and I believe sean has it ready to go.


Steve.


On Thu, 2007-05-03 at 14:07 -0700, Tziporet Koren wrote:
> Hi All,
> Since some of the critical bugs are not solved yet we decided to delay
> the release to Monday May 7.
> 
> This is the list of critical bugs that should be fixed for RC3:
> bug_id
> bug_severity
> assigned_to
> short_short_desc
> 574
> blocker
> raisch at de.ibm.com
> ehca driver fails
> while running
> openmpi
> 420
> critical
> monil at voltaire.com
> PKey table
> reordering caused
> by SM failover
> stops ipoib
> traffic
> 577
> critical
> rolandd at cisco.com
> SRP multipath
> failover too slow
> (minutes, not
> seconds)
> 465
> critical
> mst at mellanox.co.il
> IPoIB HA fails
> after several
> hours of
> failovers
> 549
> critical
> amip at dev.mellanox.co.il
> SDP Policy need
> to be consistent
> 597
> critical
> vlad at mellanox.co.il
> support RHEL4U5
> in OFED 1.2
> 499
> major
> vlad at mellanox.co.il
> module compiled
> over ofed won't
> load due to
> symbol version
> mismatch
> 519
> major
> pasha at mellanox.co.il
> MVAPICH I
> APPLICATION
> ABORTS WITH
> PARTITIONS
> CONFIGURED
> 534
> major
> vlad at mellanox.co.il
> SLES9 - Installer
> fails on
> declarations -
> OFED 1.2-20070409
> 530
> major
> dannyz at mellanox.co.il
> ibdiagnet -r
> fails on RHEL5
> i686
> 538
> major
> monis at voltaire.com
> integrate IPoIB
> bonding with
> IPoIB CM
> 541
> major
> mst at mellanox.co.il
> slow failover
> with IPoIB CM
> bonding/ipoibtools HA
> 558
> major
> rolandd at cisco.com
> tvflash configure
> fails on SLES10
> SP1 RC2
> 
> 
> All owners of blocker and critical bugs - please reply with status of
> the bug resolution
> 
> Thanks,
> Tziporet
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From tziporet at dev.mellanox.co.il  Thu May  3 14:27:14 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 03 May 2007 14:27:14 -0700
Subject: [ofa-general] man pages for the rdma-cm
In-Reply-To: <1178211084.27558.8.camel@stevo-desktop>
References: <1178127046.18609.107.camel@stevo-desktop>	<463A124E.4020604@ichips.intel.com>
	<1178211084.27558.8.camel@stevo-desktop>
Message-ID: <463A53B2.2040103@mellanox.co.il>

Steve Wise wrote:
>>
>> What's the release date for RC3?
>>     
>
> I believe it is today.
>
>   

Was just delayed to Monday. If you can do it today/tomorrow we may be 
able to integrate it

Tziporet


From sean.hefty at intel.com  Thu May  3 14:46:04 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 3 May 2007 14:46:04 -0700
Subject: [ofa-general] man pages for the rdma-cm
In-Reply-To: <463A53B2.2040103@mellanox.co.il>
Message-ID: <000201c78dcc$73a2ea90$3b78e984@amr.corp.intel.com>

>Was just delayed to Monday. If you can do it today/tomorrow we may be
>able to integrate it

I will try to complete the man pages for at least the APIs by tomorrow.

I'm about 70% done writing them, but still need to tie them in with the build
scripts.

Steve, I will push the rping changes in with these changes.

- Sean


From xma at us.ibm.com  Thu May  3 14:55:49 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 3 May 2007 14:55:49 -0700
Subject: [ofa-general] Re: [PATCH 3 of 3] ipoib/cm: separate comp vectors
	to	RX/TX
In-Reply-To: <OFC8ED822E.AD693034-ON872572D0.0067E2D0-882572D0.00682A1F@LocalDomain>
Message-ID: <OF252E2B2E.C7515EF6-ON872572D0.00784F92-882572D0.00787C5A@us.ibm.com>


> Yes, it would be better to do
> perPort/perCompletion/perEvent/perInterrupt. It also helps latency
> not just throughput.

It might even need multiple completions multple Events, multiple interrupt
association for CM mode along with CPU affinity.

thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/37f2e3ff/attachment.html>

From steffen.persvold at scali.com  Thu May  3 15:38:27 2007
From: steffen.persvold at scali.com (Steffen Persvold)
Date: Thu, 3 May 2007 18:38:27 -0400
Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
References: <006f01c78db8$e57d1ff0$c801a8c0@ettac>
Message-ID: <D6A583C768392A4D8B297C500CDD54B5015736BC@mse11be1.mse11.exchange.ms>

So I don't understand it then... Why are my RPMs only containing one of the two versions. I'm running on ES and not AS but that shouldn't really matter...
 
This output that you list :
 
[root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0
/usr/lib64/libibverbs.so.1
/usr/lib64/libibverbs.so.1.0.0

Is exactly what I would have expected as well, but my RPM says :
 
[root at pe1850-1 redhat-release-4ES-5.5]# pwd
/root/OFED-1.2-rc2/RPMS/redhat-release-4ES-5.5
[root at pe1850-1 redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0

I'm lookin through the build log (/tmp/OFED.build.xxx.log) and both versions get compiled, but it looks like the 32bit libraries (which gets compiled last) overwrites the 64bit libraries in the "make install" section because both ends up in /usr/lib :
 
(64bit section of the build) :
 
/usr/bin/install -c src/.libs/libibverbs.so.1.0.0 /var/tmp/OFED/usr/lib/libibverbs.so.1.0.0
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1 || { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; }; })
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so || { rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; })

 
(32bit section of the build) :
/usr/bin/install -c src/.libs/libibverbs.so.1.0.0 /var/tmp/OFED/usr/lib/libibverbs.so.1.0.0
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1 || { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; }; })
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so || { rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; })

 
So the question is, why is the 64bit section ending up in <buildpath>/usr/lib in the first place ???
 
I do see this though :
 
/bin/rm -f /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache
cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs
Running: env ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range=
yes ac_cv_func_ibv_dofork_range=yes ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes  ./configure --cache-file=/var/tmp/OFEDRPM/
BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir /usr/lib --mandir=/usr/man --sysconfdir=/usr/etc CPPFLAGS="-I../libibverbs/include"

 
--libdir /usr/lib ??? shouldn't that be --libdir /usr/lib64 for the 64bit section ?
 
Cheers,
Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter

________________________________

From: Chieng Etta [mailto:etta at systemfabricworks.com]
Sent: Thu 5/3/2007 3:26 PM
To: Steffen Persvold; vlad at dev.mellanox.co.il
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64


Hi Steffen,

After removing all the OFED packages by using ./uninstall.sh, I tried
./build.sh to build the RPMs then installed libibverbs-1.1-0.x86_64.rpm onto
system.  "libibverbs.so.1.0.0" was installed under the right directories
(/usr/lib and /usr/lib64).  Please see the output below. 
Thanks,
Etta

[root at sfw1 etc]# cat /etc/*release
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
[root at sfw1 etc]# uname -a
Linux sfw1.sfw.int 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux

[root at sfw1 lib64]# pwd
/usr/lib64
[root at sfw1 lib64]# ll libibverbs*
ls: libibverbs*: No such file or directory

[root at sfw1 lib64]# rpm -aq |grep libibverbs

[root at sfw1 lib64]# cd -
/root/images/OFED-1.2-rc2/RPMS/redhat-release-4AS-5.5
[root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0
/usr/lib64/libibverbs.so.1
/usr/lib64/libibverbs.so.1.0.0

[root at sfw1 redhat-release-4AS-5.5]# rpm -ivh libibverbs-1.1-0.x86_64.rpm
Preparing...             ########################################### [100%]
   1:libibverbs          ########################################### [100%]

[root at sfw1 redhat-release-4AS-5.5]# rpm -qp --qf "%{arch}\n"
libibverbs-1.1-0.x86_64.rpm
x86_64

[root at sfw1 redhat-release-4AS-5.5]# cd -
/usr/lib64
[root at sfw1 lib64]# rpm -aq |grep libibverbs
libibverbs-1.1-0

[root at sfw1 lib64]# ll libibverbs*
lrwxrwxrwx  1 root root     19 May  3 13:50 libibverbs.so.1 ->
libibverbs.so.1.0.0
-rwxr-xr-x  1 root root 200993 May  3 13:18 libibverbs.so.1.0.0

[root at sfw1 lib64]# file libibverbs.so.1.0.0
libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD x86-64, version 1
(SYSV), not stripped

[root at sfw1 lib]# cd /usr/lib
[root at sfw1 lib]# file libibverbs.so.1.0.0
libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel 80386, version 1
(SYSV), not stripped

[root at sfw1 etc]# cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
/usr/ofed/lib64

[root at sfw1 etc]# cat /etc/ld.so.conf.d/ofed.conf
/usr/lib64
/usr/lib
   

-----Original Message-----
From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Steffen Persvold
Sent: Thursday, May 03, 2007 10:26 AM
To: vlad at dev.mellanox.co.il
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64

Vladimir,

Nope. Still the same issue. The RPMs will only contain one set of
libraries and it is always in /usr/lib (if I set the build_32bit=0
option I get the 64bit libraries but in the wrong directory).

Seriously, am I the only one seeing this ? I would think rhel4 u4 was a
very normal test platform ?

Cheers,

Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter


> -----Original Message-----
> From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il]
> Sent: Thursday, May 03, 2007 9:07 AM
> To: Steffen Persvold
> Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
>
> Please see if this happens in OFED-1.2-20070503-0600.
> But first uninstall the previous OFED version with ofed_uninstall.sh
> command.
>
> Thanks,
>
> Regards,
> Vladimir
>
> On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote:
> > Hmm,
> >
> > so I tried something. I put :
> >
> > build_32bit=0
> >
> > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This
time
> > it built 64bit libraries, but it puts them in the wrong directory :
> >
> > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm
> > /etc/ld.so.conf.d/ofed.conf
> > /usr/lib/libibverbs.so.1
> > /usr/lib/libibverbs.so.1.0.0
> >
> > # file /usr/lib/libibverbs.so.1.0.0
> > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD
> > x86-64, version 1 (SYSV), not stripped
> >
> > So what's up ??
> >
> > Cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: Steffen Persvold
> > Sent: Wed 5/2/2007 10:30 AM
> > To: Steffen Persvold; Vladimir Sokolovsky
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Also,
> >
> > If I look at the /etc/ld.so.conf/ofed.conf file I have :
> >
> > # cat ofed.conf
> > /usr/lib
> > /usr/lib
> >
> >
> > which seems kinda weird ? :)
> >
> > Cheers,
> >
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen
Persvold
> > Sent: Wed 5/2/2007 10:20 AM
> > To: Vladimir Sokolovsky
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Nope :
> >
> >
> > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
> > /etc/ld.so.conf.d/ofed.conf
> > /usr/lib/libibverbs.so.1
> > /usr/lib/libibverbs.so.1.0.0
> > [redhat-release-4ES-5.5]#
> >
> > So the RPM got built, but without 64bit libraries. Now if it was the
> > other way around (i.e no 32bit libraries) I could have understood it
> > (as 32bit is an option on x86_64), but not having the native 64bit
> > libraries is not so easy to understand :)
> >
> > cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir
> > Sokolovsky
> > Sent: Wed 5/2/2007 10:05 AM
> > To: Steffen Persvold
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Don't you have /usr/lib64/libibverbs.so.1.0.0?
> >
> > Regards,
> > Vladimir
> >
> > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote:
> > > Folks,
> > >
> > > I used the build.sh script to build the above mentioned packages
on
> > > rhel4u4 x86_64, but for some reason it only compiles 32bit
libraries
> > > (even if the packages are named x86_64) :
> > >
> > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
> > > x86_64
> > >
> > > (after installing it) :
> > >
> > > # file /usr/lib/libibverbs.so.1.0.0
> > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel
> > > 80386, version 1 (SYSV), not stripped
> > >
> > > What did I do wrong ??
> > >
> > > Cheers,
> > > Steffen Persvold
> > > Technical Director Americas
> > > tel. 508-281-7100 x401
> > > fax. 508-281-7171
> > >
> > > http://www.scali.com/
> > > Scaling the Linux datacenter
> > > _______________________________________________
> > > ewg mailing list
> > > ewg at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> > _______________________________________________
> > ewg mailing list
> > ewg at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> >
> >

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070503/fe6321ff/attachment.html>

From lawver1 at llnl.gov  Thu May  3 16:37:46 2007
From: lawver1 at llnl.gov (Bryan Lawver)
Date: Thu, 03 May 2007 16:37:46 -0700
Subject: [ofa-general] Re: IPoIB forwarding
In-Reply-To: <4637B9A4.2050103@myri.com>
References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov>
	<20070425124652.GG1624@mellanox.co.il>
	<6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov>
	<20070426161409.GF15540@mellanox.co.il>
	<6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov>
	<20070426180618.GJ15540@mellanox.co.il>
	<6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov>
	<46325DF3.2050203@hp.com>
	<6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov>
	<46327A07.1000404@hp.com>
	<6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov>
	<4632894D.40705@hp.com>
	<20070428025117.a3b1200a.billfink@mindspring.com>
	<4634F49F.9030408@myri.com> <46365BD4.5060607@hp.com>
	<4637B9A4.2050103@myri.com>
Message-ID: <6.1.2.0.2.20070503163428.15b51750@mail.llnl.gov>

I have been able to install and use the 1.3.0 myricom driver and everything 
works as I expected and performance is pretty decent.  Interesting little 
side tour through various drivers...The router node sees almost no load 
which is really encouraging.

Thanks,
bryan

At 03:05 PM 5/1/2007, Loic Prylli wrote:
>On 4/30/2007 2:12 PM, Rick Jones wrote:
>>
>>Speaking of defaults, it would seem that the external 1.2.0 driver comes 
>>with 9000 bytes as the default MTU?  At least I think that is what I am 
>>seeing now that I've started looking more closely.
>>
>>rick jones
>
>
>That's the same for the in-kernel-tree code (9K MTU by default). Assuming 
>this is not wanted, I will submit a patch for that.
>
>
>Loic


From pradeep at us.ibm.com  Thu May  3 17:12:42 2007
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Thu, 3 May 2007 17:12:42 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <adazm4ld95c.fsf@cisco.com>
Message-ID: <OFCDE4B60E.5DD36823-ON882572D0.00830759-882572D1.000136D8@us.ibm.com>

Thanks for the review. Some of them MST had already pointed out. I will 
respond to the additional ones that you make.

Pradeep
pradeep at us.ibm.com

Roland Dreier <rdreier at cisco.com> wrote on 05/03/2007 12:01:03 PM:

>  > +#define IPOIB_CM_OP_NOSRQ (1ul << 29)
> 
> I don't understand the point of this... the only places you do anything
> with it are:
> 
>  > +       priv->cm.rx_wr.wr_id = wr_id << 32 | index | 
IPOIB_CM_OP_NOSRQ;
>  > +       index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) & NOSRQ_INDEX_MASK ;
>  > +       if ((wc->wr_id & IPOIB_CM_OP_SRQ) || (wc->wr_id & 
> IPOIB_CM_OP_NOSRQ))
> 
> so probably the most sensible thing to do is just to rename
> IPOIB_CM_OP_SRQ to IPOIB_OP_CM_RECV.

Agreed.

> 
>  > +/* These two go hand in hand */
>  > +#define NOSRQ_INDEX_RING_SIZE 1024
>  > +#define NOSRQ_INDEX_MASK      0x00000000000003ff
> 
> Rather than having a comment, I would just do
> 
> #define NOSRQ_INDEX_RING_SIZE 1024
> #define NOSRQ_INDEX_MASK      (NOSRQ_INDEX_RING_SIZE - 1)
> 
> also I think the RING name is wrong -- it's not a ring, it's a table,
> right?  I don't like having a static limit on the number of nosrq
> connections; could this be a hash table instead?
> 

I will just call this an array. Nosrq will hog memory and my thought was 
that 1024 was pretty large. I envisioned using nosrq for a small number 
(maybe a 
few dozen), and so did not think it was necessary to make this a module 
paramater 
either. What do you suggest? 

> 
>  > -       rep.srq = 1;
> 
>  > +       if (priv->cm.srq)
>  > +               rep.srq = 1;
>  > +       else
>  > +               rep.srq = 0;
> 
> similarly I would rather see "rep.srq = !!priv->cm.srq"

ok

> 
>  > +       /* Allocate space for the rx_ring here */
>  > +       p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring,
>  > +                            GFP_KERNEL);
> 
> 
>  > +               printk(KERN_WARNING "NOSRQ supports a max of %d RC "
>  > +                      "QPs. That limit has now been reached\n",
>  > +                      NOSRQ_INDEX_RING_SIZE);
> 
> ipoib_warn() instead of printk?  Also isn't this going to flood logs
> if the remote side keeps trying to connect?

As you describe, the remote side will continue to attempt connecting and
will fail. That is a pretty serious scenario. Hence I leaned towards 
flooding
the logs rather than losing this among a ton of other messages, there may 
be
application speceific messages too. I can change this to ipoib_warn().

> 
>  > +       ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
>  > +       if (ret) {
>  > +               ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed 
%d\n", 
>  > ret);
>  > +               goto err_modify_nosrq;
>  > +       }
> 
> It's good to goto to unwind code, but in this case you just have a
> return at err_modify_nosrq -- why not just return directly?  However
> you seem to leak rx_ring here, so it would be better to use the unwind
> code more consistently instead of using return later.

Yes, there is a leak here -will fix that.

                     kfree(p->rx_ring);
>  > +                       return -ENOMEM;
>  > +               }
>  > +
>  > +               /* Can we call the nosrq version? */
>  > +               if (ipoib_cm_post_receive(dev, i << 32 | index)) {
>  > +                       ipoib_warn(priv, "ipoib_ib_post_receive "
>  > +                                  "failed for  buf %d\n", i);
>  > +                       ipoib_cm_dev_cleanup(dev);
> 
> seems like you're missing the call to kfree(p->rx_ring) here?
> this code could probably benefit from a goto to unwind code.

Yes there is leak here too -will fix.


From rdreier at cisco.com  Thu May  3 17:16:07 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 17:16:07 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <OFCDE4B60E.5DD36823-ON882572D0.00830759-882572D1.000136D8@us.ibm.com>
	(Pradeep Satyanarayana's message of "Thu,
	3 May 2007 17:12:42 -0700")
References: <OFCDE4B60E.5DD36823-ON882572D0.00830759-882572D1.000136D8@us.ibm.com>
Message-ID: <aday7k5a1fc.fsf@cisco.com>

 > > also I think the RING name is wrong -- it's not a ring, it's a table,
 > > right?  I don't like having a static limit on the number of nosrq
 > > connections; could this be a hash table instead?
 > > 
 > 
 > I will just call this an array. Nosrq will hog memory and my thought was 
 > that 1024 was pretty large. I envisioned using nosrq for a small number 
 > (maybe a 
 > few dozen), and so did not think it was necessary to make this a module 
 > paramater 
 > either. What do you suggest? 

Maybe make it a hash table of size 32 or 64 or something like that.
You use less memory in the expected case, and degrade fairly
gracefully when things get bigger.  If you want to get really fancy,
make it a hash table that you grow when it gets too full.

I agree that we don't want yet another module parameter that has to be
tuned here.

 - R.


From pradeep at us.ibm.com  Thu May  3 21:49:40 2007
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Thu, 3 May 2007 21:49:40 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V3] patch for review
In-Reply-To: <aday7k5a1fc.fsf@cisco.com>
Message-ID: <OFB7C8A015.2D9343C8-ON882572D1.00191B99-882572D1.001A923B@us.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 05/03/2007 05:16:07 PM:

>  > > also I think the RING name is wrong -- it's not a ring, it's a 
table,
>  > > right?  I don't like having a static limit on the number of nosrq
>  > > connections; could this be a hash table instead?
>  > > 
>  > 
>  > I will just call this an array. Nosrq will hog memory and my thought 
was 
>  > that 1024 was pretty large. I envisioned using nosrq for a small 
number 
>  > (maybe a 
>  > few dozen), and so did not think it was necessary to make this a 
module 
>  > paramater 
>  > either. What do you suggest? 
> 
> Maybe make it a hash table of size 32 or 64 or something like that.
> You use less memory in the expected case, and degrade fairly
> gracefully when things get bigger.  If you want to get really fancy,
> make it a hash table that you grow when it gets too full.
> 

The only time we do a search in the array is to find an empty slot to 
store a
pointer to ipoib_cm_rx. This happens upon receipt of a REQ. There are no 
other
lookups that we perform. On the receipt of a packet we have the index
encoded in wr_id, and so use that to retrive the ipoib_cm_rx poineter. 
We don't need a hash table for this. Could use a head and tail pointer to 
reduce the search.

Pradeep
pradeep at us.ibm.com


From k_mahesh85 at yahoo.co.in  Thu May  3 22:04:53 2007
From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh)
Date: Fri, 4 May 2007 06:04:53 +0100 (BST)
Subject: [ofa-general] [query] SMI nodeinfo, port_info structures
In-Reply-To: <1178211580.32222.3481.camel@hal.voltaire.com>
Message-ID: <984925.59702.qm@web8327.mail.in.yahoo.com>

>SMI has nothing to do with those SM attributes.

Yes you are right. SMI has nothing to do with them right now. But if some other hardware vendor wants to implement the SMA in the host software (like ipath) in future he again needs to implement those structures (nodeinfo and port_info, attributes of SM ) in his driver. 
we can avoid this situation (duplicate declarations of same structre) by declaring the above mentioned structures in the core layer. 

>What structures (in what file(s)) are you referring to ?

Here I am referring to the structures for some SM attributes like nodeinfo and port_info which are currently declared in ipath driver. 

Some fields in those structures have big endian (__bexx) alignment and others
 have CPU  (uxx) alignment.

e.g: in struct port_info  declared in ipath driver (ipath_mad.c), the mkey is declared as __be64 mkey whereas  the local port number is declared as 
u8 local_port_num.

-Mahesh
 			
---------------------------------
 Heres a new way to find what you're looking for - Yahoo! Answers 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070504/e07ebec5/attachment.html>

From rdreier at cisco.com  Thu May  3 22:12:40 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 03 May 2007 22:12:40 -0700
Subject: [ofa-general] [query] SMI nodeinfo, port_info structures
In-Reply-To: <984925.59702.qm@web8327.mail.in.yahoo.com> (Keshetti Mahesh's
	message of "Fri, 4 May 2007 06:04:53 +0100 (BST)")
References: <984925.59702.qm@web8327.mail.in.yahoo.com>
Message-ID: <adaps5h9np3.fsf@cisco.com>

[please try to keep your line lengths below 80 columns]

 > we can avoid this situation (duplicate declarations of same
 > structre) by declaring the above mentioned structures in the core
 > layer.

I think a patch moving structures defined by the IB spec to a more
appropriate location would be fine.

 > Some fields in those structures have big endian (__bexx) alignment
 > and others have CPU (uxx) alignment.
 > 
 > e.g: in struct port_info declared in ipath driver (ipath_mad.c),
 > the mkey is declared as __be64 mkey whereas the local port number
 > is declared as u8 local_port_num.

Think about it... what could endianess mean for a single-byte field?


From k_mahesh85 at yahoo.co.in  Thu May  3 22:22:16 2007
From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh)
Date: Fri, 4 May 2007 06:22:16 +0100 (BST)
Subject: [ofa-general] [query] SMI nodeinfo, port_info structures
In-Reply-To: <adaps5h9np3.fsf@cisco.com>
Message-ID: <496423.96231.qm@web8314.mail.in.yahoo.com>


> [please try to keep your line lengths below 80 columns]

sure..

>I think a patch moving structures defined by the IB spec to a more
>appropriate location would be fine.

Isn't the include/rdma/ib_smi.h is an appropriate location?

 >Think about it... what could endianess mean for a single-byte field?

Yes.. got the point. 

-Mahesh

 			
---------------------------------
 Heres a new way to find what you're looking for - Yahoo! Answers 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070504/7cd5085f/attachment.html>

From HNGUYEN at de.ibm.com  Thu May  3 23:18:56 2007
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 4 May 2007 08:18:56 +0200
Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support
	in core
In-Reply-To: <20070503104847.GD10009@mellanox.co.il>
Message-ID: <OF4D9CA54F.FE204874-ONC12572D1.0021602A-C12572D1.0022B0EB@de.ibm.com>

Hello Michael and Roland!
How about a new verb delivering number of cqs associated with a
comp_vector like this

/**
 * Returns number of cqs assigned to comp_vector
 * @return < 0 in error case eg invalid comp_vector
 */
int ib_query_comp_vector(struct ib_device *dev, int comp_vector);

A consumer or ULP would be able to pick an "empty" comp_vector.
Surely that does not prevent that a certain comp_vector resp. IRQ
can be "overloaded", and that's another topic.
Thanks
Nam
PS: I'm waiting for your comments first, a patch will come later.


From mst at dev.mellanox.co.il  Thu May  3 23:25:26 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 4 May 2007 09:25:26 +0300
Subject: [ofa-general] Re: [PATCH 0 of 3] comp_vector kernel support
In-Reply-To: <adahcqteozq.fsf@cisco.com>
References: <20070503104806.GC10009@mellanox.co.il> <adahcqteozq.fsf@cisco.com>
Message-ID: <20070504062526.GB4829@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH 0 of 3] comp_vector kernel support
> 
>  > 1. extends ib_create_cq to pass in comp_vector parameter
>  > 2. Update all ULP/providers
>  > 3. mthca is enhanced to support multiple vectors if MSI-X is enabled on SMP
>  > 4. Other providers report support for a single completion vector
>  > 5. uverbs and IPoIB CM are enhanced to use multiple vectors if available
> 
>  > Please consider for 2.6.22.
> 
> This is good work, but given that this has just appeared halfway
> through the 2.6.22 merge window I don't think we should just merge it
> just now.  Rather, let's definitely get it into 2.6.23.

How about just patches 1 and 2?
They don't do anything to *kernel* ULPs by themselves,
and give userspace ULPs opportunity to start using the feature.
We'll learn from that, and enhance kernel ULPs by 2.6.23.

-- 
MST


From mst at dev.mellanox.co.il  Thu May  3 23:29:46 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 4 May 2007 09:29:46 +0300
Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support
	in core
In-Reply-To: <OF4D9CA54F.FE204874-ONC12572D1.0021602A-C12572D1.0022B0EB@de.ibm.com>
References: <20070503104847.GD10009@mellanox.co.il>
	<OF4D9CA54F.FE204874-ONC12572D1.0021602A-C12572D1.0022B0EB@de.ibm.com>
Message-ID: <20070504062946.GC4829@mellanox.co.il>

> Quoting Hoang-Nam Nguyen <HNGUYEN at de.ibm.com>:
> Subject: Re: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support in core
> 
> Hello Michael and Roland!
> How about a new verb delivering number of cqs associated with a
> comp_vector like this
> 
> /**
>  * Returns number of cqs assigned to comp_vector
>  * @return < 0 in error case eg invalid comp_vector
>  */
> int ib_query_comp_vector(struct ib_device *dev, int comp_vector);
> 
> A consumer or ULP would be able to pick an "empty" comp_vector.
> Surely that does not prevent that a certain comp_vector resp. IRQ
> can be "overloaded", and that's another topic.
> Thanks
> Nam
> PS: I'm waiting for your comments first, a patch will come later.

I'm not convinced it's an interesting metric.
A CQ which has multiple QPs assigned to it might get more traffic
than a CQ which only has a single QP.

My gut feeling would be that once ULPs learn to use multiple vectors,
each ULPs will spread across them evenly, without help from provider.

-- 
MST


From mst at dev.mellanox.co.il  Fri May  4 00:25:25 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 4 May 2007 10:25:25 +0300
Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian
In-Reply-To: <463A22F3.4090108@hp.com>
References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il>
	<463A1C9A.6060706@hp.com> <20070503174817.GC9719@mellanox.co.il>
	<463A22F3.4090108@hp.com>
Message-ID: <20070504072525.GE4829@mellanox.co.il>

> Quoting Rick Jones <rick.jones2 at hp.com>:
> Subject: Re: OFED-1.2-20070502-0600 on Debian
> 
> Michael S. Tsirkin wrote:
> >>make[1]: Entering directory `/root/linux-2.6.21.1'
> >>test -e include/linux/autoconf.h -a -e include/config/auto.conf || (      
> >>\
> >>       echo;                                                           \
> >>       echo "  ERROR: Kernel configuration is invalid.";               \
> >>       echo "         include/linux/autoconf.h or 
> >>       include/config/auto.conf are missing.";      \
> >>       echo "         Run 'make oldconfig && make prepare' on kernel src 
> >>       to fix it.";  \
> >
> >
> >This is kernel's message, not our's - is this the source you built kernel 
> >from?
> >If you go into /root/linux-2.6.21.1 as root and do make modules,
> >does it succeed?
> 
> yes.  some warnings at the beginning about some modules and section 
> mismatches but it seems to complete.

Okay ... so, do you see include/linux/autoconf.h there?

-- 
MST


From mst at dev.mellanox.co.il  Fri May  4 00:32:15 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 4 May 2007 10:32:15 +0300
Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian
In-Reply-To: <463A22F3.4090108@hp.com>
References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il>
	<463A1C9A.6060706@hp.com> <20070503174817.GC9719@mellanox.co.il>
	<463A22F3.4090108@hp.com>
Message-ID: <20070504073215.GG4829@mellanox.co.il>

> Quoting Rick Jones <rick.jones2 at hp.com>:
> Subject: Re: OFED-1.2-20070502-0600 on Debian
> 
> Michael S. Tsirkin wrote:
> >>make[1]: Entering directory `/root/linux-2.6.21.1'
> >>test -e include/linux/autoconf.h -a -e include/config/auto.conf || (      
> >>\
> >>       echo;                                                           \
> >>       echo "  ERROR: Kernel configuration is invalid.";               \
> >>       echo "         include/linux/autoconf.h or 
> >>       include/config/auto.conf are missing.";      \
> >>       echo "         Run 'make oldconfig && make prepare' on kernel src 
> >>       to fix it.";  \
> >
> >
> >This is kernel's message, not our's - is this the source you built kernel 
> >from?
> >If you go into /root/linux-2.6.21.1 as root and do make modules,
> >does it succeed?
> 
> yes.  some warnings at the beginning about some modules and section 
> mismatches but it seems to complete.

I just tried this on my ubuntu laptop with the same result.
We'll work on fixing this by Monday.

-- 
MST


From mst at dev.mellanox.co.il  Fri May  4 00:42:41 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 4 May 2007 10:42:41 +0300
Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian
In-Reply-To: <463A1C9A.6060706@hp.com>
References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il>
	<463A1C9A.6060706@hp.com>
Message-ID: <20070504074241.GH4829@mellanox.co.il>

> >There *is* another way which should be enough to test IPoIB:
> >try getting a kernel tarball from
> >http://git.openfabrics.org/~vlad/builds/
> >
> >If you unpack this, you can configure/make/make install.
> >
> >Installer will backup your original modules under the prefix.
> >Keep the source around and you'll be able to make uninstall
> >to get back to original system.
> >
> >Note 1: default configure settings are often not what you want:
> >run ./configure --help first of all to see which modules to select
> >(--with-ipoib-mod and --with-mthca-mod I think) and to set a prefix.
> >Note 2: having quilt tool installed is recommended - will let you
> >add/remove patches later.
> >Note 3: this way you get no userspace. openfabrics tarballs
> >are under the same directory, and a similiar method works there.
> >external tarballs (MPI, bonding, etc) are supplied to us in SRPM
> >format so this trick does not work for them.
> 
> Seems I found little joy there too, probably my own fault.  The environment 
> is a Debian 4.0. The kernel is called:
> 
> hpcpc107:~/ofa_1_2_kernel-20070502-0200# uname -a
> Linux hpcpc107 2.6.21.1-raj #1 SMP Tue May 1 14:11:27 PDT 2007 ia64 
> GNU/Linux
> 
> 
> The sources to which are at:
> 
> /root/linux-2.6.21.1
> 
> My configure line was:
> 
> ./configure --with-ipoib-mod --with-mthca-mod --with-sdp-mod 
> --prefix=/root/save

You must add --with-core-mod and it starts to rip.
Vlad, I think configure should be smart enough to know that
selecting any modules should enable core, too. OK?

However, I noticed that 2.6.21 isn't supported yet (build actually fails).
I'll try to add support by Monday, for now latest supported kernel is 2.6.20.y.

> I didn't save the first set of configure output :(  Subsequent configures 
> give:

Yes, it's a known limitation currently.
There are 3 possible work-arounds:
- remove the build directory and re-open it
- run
  >quilt pop -a
  >rm -fr patches .pc
  before second configure run
- add --without-patch to second configure run (only works
  if you did not change the kernel version to build for)

-- 
MST


From mst at dev.mellanox.co.il  Fri May  4 00:54:59 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 4 May 2007 10:54:59 +0300
Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian
In-Reply-To: <463A22F3.4090108@hp.com>
References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il>
	<463A1C9A.6060706@hp.com> <20070503174817.GC9719@mellanox.co.il>
	<463A22F3.4090108@hp.com>
Message-ID: <20070504075459.GI4829@mellanox.co.il>

OK.
Apply these 2 patches after configure:

-- 
MST
-------------- next part --------------
commit ecbb416939da77c0d107409976499724baddce7b
Author: Alexey Kuznetsov <kuznet at ms2.inr.ac.ru>
Date:   Sat Mar 24 12:52:16 2007 -0700

    [NET]: Fix neighbour destructor handling.
    
    ->neigh_destructor() is killed (not used), replaced with
    ->neigh_cleanup(), which is called when neighbor entry goes to dead
    state. At this point everything is still valid: neigh->dev,
    neigh->parms etc.
    
    The device should guarantee that dead neighbor entries (neigh->dead !=
    0) do not get private part initialized, otherwise nobody will cleanup
    it.
    
    I think this is enough for ipoib which is the only user of this thing.
    Initialization private part of neighbor entries happens in ipib
    start_xmit routine, which is not reached when device is down.  But it
    would be better to add explicit test for neigh->dead in any case.
    
    Signed-off-by: David S. Miller <davem at davemloft.net>

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 0741c6d..f2a40ae 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct net_device *dev)
 	queue_work(ipoib_workqueue, &priv->restart_task);
 }
 
-static void ipoib_neigh_destructor(struct neighbour *n)
+static void ipoib_neigh_cleanup(struct neighbour *n)
 {
 	struct ipoib_neigh *neigh;
 	struct ipoib_dev_priv *priv = netdev_priv(n->dev);
@@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struct neighbour *n)
 	struct ipoib_ah *ah = NULL;
 
 	ipoib_dbg(priv,
-		  "neigh_destructor for %06x " IPOIB_GID_FMT "\n",
+		  "neigh_cleanup for %06x " IPOIB_GID_FMT "\n",
 		  IPOIB_QPN(n->ha),
 		  IPOIB_GID_RAW_ARG(n->ha + 4));
 
@@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh)
 
 static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms)
 {
-	parms->neigh_destructor = ipoib_neigh_destructor;
+	parms->neigh_cleanup = ipoib_neigh_cleanup;
 
 	return 0;
 }
-------------- next part --------------
commit 43cb76d91ee85f579a69d42bc8efc08bac560278
Author: Greg Kroah-Hartman <gregkh at suse.de>
Date:   Tue Apr 9 12:14:34 2002 -0700

    Network: convert network devices to use struct device instead of class_device
    
    This lets the network core have the ability to handle suspend/resume
    issues, if it wants to.
    
    Thanks to Frederik Deweerdt <frederik.deweerdt at gmail.com> for the arm
    driver fixes.
    
    Signed-off-by: Greg Kroah-Hartman <gregkh at suse.de>

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 705eb1d..af5ee2e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -958,16 +958,17 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name)
 	return netdev_priv(dev);
 }
 
-static ssize_t show_pkey(struct class_device *cdev, char *buf)
+static ssize_t show_pkey(struct device *dev,
+			 struct device_attribute *attr, char *buf)
 {
-	struct ipoib_dev_priv *priv =
-		netdev_priv(container_of(cdev, struct net_device, class_dev));
+	struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev));
 
 	return sprintf(buf, "0x%04x\n", priv->pkey);
 }
-static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL);
+static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL);
 
-static ssize_t create_child(struct class_device *cdev,
+static ssize_t create_child(struct device *dev,
+			    struct device_attribute *attr,
 			    const char *buf, size_t count)
 {
 	int pkey;
@@ -985,14 +986,14 @@ static ssize_t create_child(struct class_device *cdev,
 	 */
 	pkey |= 0x8000;
 
-	ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev),
-			     pkey);
+	ret = ipoib_vlan_add(to_net_dev(dev), pkey);
 
 	return ret ? ret : count;
 }
-static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child);
+static DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child);
 
-static ssize_t delete_child(struct class_device *cdev,
+static ssize_t delete_child(struct device *dev,
+			    struct device_attribute *attr,
 			    const char *buf, size_t count)
 {
 	int pkey;
@@ -1004,18 +1005,16 @@ static ssize_t delete_child(struct class_device *cdev,
 	if (pkey < 0 || pkey > 0xffff)
 		return -EINVAL;
 
-	ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev),
-				pkey);
+	ret = ipoib_vlan_delete(to_net_dev(dev), pkey);
 
 	return ret ? ret : count;
 
 }
-static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child);
+static DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child);
 
 int ipoib_add_pkey_attr(struct net_device *dev)
 {
-	return class_device_create_file(&dev->class_dev,
-					&class_device_attr_pkey);
+	return device_create_file(&dev->dev, &dev_attr_pkey);
 }
 
 static struct net_device *ipoib_add_port(const char *format,
@@ -1083,11 +1082,9 @@ static struct net_device *ipoib_add_port(const char *format,
 
 	if (ipoib_add_pkey_attr(priv->dev))
 		goto sysfs_failed;
-	if (class_device_create_file(&priv->dev->class_dev,
-				     &class_device_attr_create_child))
+	if (device_create_file(&priv->dev->dev, &dev_attr_create_child))
 		goto sysfs_failed;
-	if (class_device_create_file(&priv->dev->class_dev,
-				     &class_device_attr_delete_child))
+	if (device_create_file(&priv->dev->dev, &dev_attr_delete_child))
 		goto sysfs_failed;
 
 	return priv->dev;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index f887780..085eafe 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -42,15 +42,15 @@
 
 #include "ipoib.h"
 
-static ssize_t show_parent(struct class_device *class_dev, char *buf)
+static ssize_t show_parent(struct device *d, struct device_attribute *attr,
+			   char *buf)
 {
-	struct net_device *dev =
-		container_of(class_dev, struct net_device, class_dev);
+	struct net_device *dev = to_net_dev(d);
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	return sprintf(buf, "%s\n", priv->parent->name);
 }
-static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL);
+static DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL);
 
 int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 {
@@ -118,8 +118,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 	if (ipoib_add_pkey_attr(priv->dev))
 		goto sysfs_failed;
 
-	if (class_device_create_file(&priv->dev->class_dev,
-				     &class_device_attr_parent))
+	if (device_create_file(&priv->dev->dev, &dev_attr_parent))
 		goto sysfs_failed;
 
 	list_add_tail(&priv->list, &ppriv->child_intfs);

From HNGUYEN at de.ibm.com  Fri May  4 02:09:06 2007
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 4 May 2007 11:09:06 +0200
Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector
	support	in core
In-Reply-To: <20070504062946.GC4829@mellanox.co.il>
Message-ID: <OFEA31DEAE.9BB11420-ONC12572D1.00313764-C12572D1.003244EE@de.ibm.com>

> > Hello Michael and Roland!
> > How about a new verb delivering number of cqs associated with a
> > comp_vector like this
> >
> > /**
> >  * Returns number of cqs assigned to comp_vector
> >  * @return < 0 in error case eg invalid comp_vector
> >  */
> > int ib_query_comp_vector(struct ib_device *dev, int comp_vector);
> >
> > A consumer or ULP would be able to pick an "empty" comp_vector.
> > Surely that does not prevent that a certain comp_vector resp. IRQ
> > can be "overloaded", and that's another topic.
> > Thanks
> > Nam
> > PS: I'm waiting for your comments first, a patch will come later.
>
> I'm not convinced it's an interesting metric.
> A CQ which has multiple QPs assigned to it might get more traffic
> than a CQ which only has a single QP.
That's true for association between CQ and QPs.
> My gut feeling would be that once ULPs learn to use multiple vectors,
> each ULPs will spread across them evenly, without help from provider.
Enabling multiple vectors is to be done by a provider. ULP can utilize
it by checking num_comp_vector. Above simple metric could be provided
in ib_core. Anyway, as Shirley stated in her email, using comp_vector
per port will help a lot, at least on ppc64 and with ehca - for other
HCAs I haven't benchmarked. And that metric allows ULP to implement
such one approach.
Nam


From mst at dev.mellanox.co.il  Fri May  4 02:13:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 4 May 2007 12:13:06 +0300
Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector
	support	in core
In-Reply-To: <OFEA31DEAE.9BB11420-ONC12572D1.00313764-C12572D1.003244EE@de.ibm.com>
References: <20070504062946.GC4829@mellanox.co.il>
	<OFEA31DEAE.9BB11420-ONC12572D1.00313764-C12572D1.003244EE@de.ibm.com>
Message-ID: <20070504091306.GJ4829@mellanox.co.il>

> using comp_vector
> per port will help a lot, at least on ppc64 and with ehca - for other
> HCAs I haven't benchmarked. And that metric allows ULP to implement
> such one approach.

Looks like a bit of overdesign. I think you can just set
	comp_vector = port_num - 1 % num_comp_vectors
without any special metrics

-- 
MST


From vlad at lists.openfabrics.org  Fri May  4 02:37:28 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Fri,  4 May 2007 02:37:28 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070504-0200 daily build status
Message-ID: <20070504093728.C6CFDE60979@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on ia64 with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From HNGUYEN at de.ibm.com  Fri May  4 02:38:27 2007
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 4 May 2007 11:38:27 +0200
Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq
	comp_vector	support	in core
In-Reply-To: <20070504091306.GJ4829@mellanox.co.il>
Message-ID: <OFF76266AC.386922B0-ONC12572D1.0034DD8C-C12572D1.0034F4FC@de.ibm.com>

"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote on 04.05.2007 11:13:06:

> > using comp_vector
> > per port will help a lot, at least on ppc64 and with ehca - for other
> > HCAs I haven't benchmarked. And that metric allows ULP to implement
> > such one approach.
>
> Looks like a bit of overdesign. I think you can just set
>    comp_vector = port_num - 1 % num_comp_vectors
> without any special metrics
Right, looks much simpler for that per-port-purpose.
Nam


From halr at voltaire.com  Fri May  4 03:39:04 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 04 May 2007 06:39:04 -0400
Subject: [ofa-general] [query] SMI nodeinfo, port_info structures
In-Reply-To: <adaps5h9np3.fsf@cisco.com>
References: <984925.59702.qm@web8327.mail.in.yahoo.com>
	<adaps5h9np3.fsf@cisco.com>
Message-ID: <1178275139.32222.70093.camel@hal.voltaire.com>

On Fri, 2007-05-04 at 01:12, Roland Dreier wrote:
> [please try to keep your line lengths below 80 columns]
> 
>  > we can avoid this situation (duplicate declarations of same
>  > structre) by declaring the above mentioned structures in the core
>  > layer.
> 
> I think a patch moving structures defined by the IB spec to a more
> appropriate location would be fine.

Sure; currently ipath is the only one which needed these for its soft
SMA so there was no push to do this.

-- Hal

>  > Some fields in those structures have big endian (__bexx) alignment
>  > and others have CPU (uxx) alignment.
>  > 
>  > e.g: in struct port_info declared in ipath driver (ipath_mad.c),
>  > the mkey is declared as __be64 mkey whereas the local port number
>  > is declared as u8 local_port_num.
> 
> Think about it... what could endianess mean for a single-byte field?


From halr at voltaire.com  Fri May  4 03:40:38 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 04 May 2007 06:40:38 -0400
Subject: [ofa-general] [query] SMI nodeinfo, port_info structures
In-Reply-To: <496423.96231.qm@web8314.mail.in.yahoo.com>
References: <496423.96231.qm@web8314.mail.in.yahoo.com>
Message-ID: <1178275237.32222.70180.camel@hal.voltaire.com>

On Fri, 2007-05-04 at 01:22, Keshetti Mahesh wrote:
> > [please try to keep your line lengths below 80 columns]
> 
> sure..
> 
> >I think a patch moving structures defined by the IB spec to a more
> >appropriate location would be fine.
> 
> Isn't the include/rdma/ib_smi.h is an appropriate location?

Not really as this is for SMI which is lower in the architecture than SM
class attributes. Maybe ib_mad.h or some new header files (ib_sma.h and
ib_pma.h) are more appropriate. What do others think ?

-- Hal

> >Think about it... what could endianess mean for a single-byte field?
> 
> Yes.. got the point. 
> 
> -Mahesh
> 
> 
> 
> ______________________________________________________________________
>  Heres a new way to find what you're looking for - Yahoo! Answers


From mst at dev.mellanox.co.il  Fri May  4 03:57:18 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 4 May 2007 13:57:18 +0300
Subject: [ofa-general] Re: [query] SMI nodeinfo, port_info structures
In-Reply-To: <984925.59702.qm@web8327.mail.in.yahoo.com>
References: <1178211580.32222.3481.camel@hal.voltaire.com>
	<984925.59702.qm@web8327.mail.in.yahoo.com>
Message-ID: <20070504105718.GL4829@mellanox.co.il>

> Quoting Keshetti Mahesh <k_mahesh85 at yahoo.co.in>:
> Subject: Re: [query] SMI nodeinfo, port_info structures
> 
> >SMI has nothing to do with those SM attributes.
> 
> Yes you are right. SMI has nothing to do with them right now. But if some other
> hardware vendor wants to implement the SMA in the host software (like ipath) in
> future he again needs to implement those structures (nodeinfo and port_info,
> attributes of SM ) in his driver.
> we can avoid this situation (duplicate declarations of same structre) by
> declaring the above mentioned structures in the core layer.

Why not wait till this actually happens?

-- 
MST


From etta at systemfabricworks.com  Fri May  4 07:28:23 2007
From: etta at systemfabricworks.com (Chieng Etta)
Date: Fri, 4 May 2007 09:28:23 -0500
Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
In-Reply-To: <D6A583C768392A4D8B297C500CDD54B5015736BC@mse11be1.mse11.exchange.ms>
Message-ID: <008d01c78e58$78df8810$c801a8c0@ettac>

Steffen,

 
The installation should be the same on either ES or AS.   

I assume that your system should have /usr/lib64 directory.  Would you be
able to install rc2 by using ./install.sh script?

 
Thanks,

Etta

 
  _____  

From: Steffen Persvold [mailto:steffen.persvold at scali.com] 
Sent: Thursday, May 03, 2007 5:38 PM
To: Chieng Etta; vlad at dev.mellanox.co.il
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64

 
So I don't understand it then... Why are my RPMs only containing one of the
two versions. I'm running on ES and not AS but that shouldn't really
matter...

 
This output that you list :

 
[root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0
/usr/lib64/libibverbs.so.1
/usr/lib64/libibverbs.so.1.0.0

Is exactly what I would have expected as well, but my RPM says :

 
[root at pe1850-1 redhat-release-4ES-5.5]# pwd
/root/OFED-1.2-rc2/RPMS/redhat-release-4ES-5.5
[root at pe1850-1 redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0

I'm lookin through the build log (/tmp/OFED.build.xxx.log) and both versions
get compiled, but it looks like the 32bit libraries (which gets compiled
last) overwrites the 64bit libraries in the "make install" section because
both ends up in /usr/lib :

 
(64bit section of the build) :

 
/usr/bin/install -c src/.libs/libibverbs.so.1.0.0
/var/tmp/OFED/usr/lib/libibverbs.so.1.0.0
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1
|| { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; };
})
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so ||
{ rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; })

 
(32bit section of the build) :

/usr/bin/install -c src/.libs/libibverbs.so.1.0.0
/var/tmp/OFED/usr/lib/libibverbs.so.1.0.0
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1
|| { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; };
})
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so ||
{ rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; })

 
So the question is, why is the 64bit section ending up in
<buildpath>/usr/lib in the first place ???

 
I do see this though :

 
/bin/rm -f /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache
cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs
Running: env ac_cv_lib_ibverbs_ibv_get_device_list=yes
ac_cv_header_infiniband_driver_h=yes ac_cv_func_ibv_read_sysfs_file=yes
ac_cv_func_ibv_dontfork_range=
yes ac_cv_func_ibv_dofork_range=yes ac_cv_func_ibv_register_driver=yes
HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes  ./configure
--cache-file=/var/tmp/OFEDRPM/
BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir
/usr/lib --mandir=/usr/man --sysconfdir=/usr/etc
CPPFLAGS="-I../libibverbs/include"

 
--libdir /usr/lib ??? shouldn't that be --libdir /usr/lib64 for the 64bit
section ?

 
Cheers,

Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171
 
http://www.scali.com/
Scaling the Linux datacenter

 
  _____  

From: Chieng Etta [mailto:etta at systemfabricworks.com]
Sent: Thu 5/3/2007 3:26 PM
To: Steffen Persvold; vlad at dev.mellanox.co.il
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64

Hi Steffen,

After removing all the OFED packages by using ./uninstall.sh, I tried
./build.sh to build the RPMs then installed libibverbs-1.1-0.x86_64.rpm onto
system.  "libibverbs.so.1.0.0" was installed under the right directories
(/usr/lib and /usr/lib64).  Please see the output below. 
Thanks,
Etta

[root at sfw1 etc]# cat /etc/*release
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
[root at sfw1 etc]# uname -a
Linux sfw1.sfw.int 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux

[root at sfw1 lib64]# pwd
/usr/lib64
[root at sfw1 lib64]# ll libibverbs*
ls: libibverbs*: No such file or directory

[root at sfw1 lib64]# rpm -aq |grep libibverbs

[root at sfw1 lib64]# cd -
/root/images/OFED-1.2-rc2/RPMS/redhat-release-4AS-5.5
[root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0
/usr/lib64/libibverbs.so.1
/usr/lib64/libibverbs.so.1.0.0

[root at sfw1 redhat-release-4AS-5.5]# rpm -ivh libibverbs-1.1-0.x86_64.rpm
Preparing...             ########################################### [100%]
   1:libibverbs          ########################################### [100%]

[root at sfw1 redhat-release-4AS-5.5]# rpm -qp --qf "%{arch}\n"
libibverbs-1.1-0.x86_64.rpm
x86_64

[root at sfw1 redhat-release-4AS-5.5]# cd -
/usr/lib64
[root at sfw1 lib64]# rpm -aq |grep libibverbs
libibverbs-1.1-0

[root at sfw1 lib64]# ll libibverbs*
lrwxrwxrwx  1 root root     19 May  3 13:50 libibverbs.so.1 ->
libibverbs.so.1.0.0
-rwxr-xr-x  1 root root 200993 May  3 13:18 libibverbs.so.1.0.0

[root at sfw1 lib64]# file libibverbs.so.1.0.0
libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD x86-64, version 1
(SYSV), not stripped

[root at sfw1 lib]# cd /usr/lib
[root at sfw1 lib]# file libibverbs.so.1.0.0
libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel 80386, version 1
(SYSV), not stripped

[root at sfw1 etc]# cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
/usr/ofed/lib64

[root at sfw1 etc]# cat /etc/ld.so.conf.d/ofed.conf
/usr/lib64
/usr/lib
   

-----Original Message-----
From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Steffen Persvold
Sent: Thursday, May 03, 2007 10:26 AM
To: vlad at dev.mellanox.co.il
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64

Vladimir,

Nope. Still the same issue. The RPMs will only contain one set of
libraries and it is always in /usr/lib (if I set the build_32bit=0
option I get the 64bit libraries but in the wrong directory).

Seriously, am I the only one seeing this ? I would think rhel4 u4 was a
very normal test platform ?

Cheers,

Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter


> -----Original Message-----
> From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il]
> Sent: Thursday, May 03, 2007 9:07 AM
> To: Steffen Persvold
> Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
>
> Please see if this happens in OFED-1.2-20070503-0600.
> But first uninstall the previous OFED version with ofed_uninstall.sh
> command.
>
> Thanks,
>
> Regards,
> Vladimir
>
> On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote:
> > Hmm,
> >
> > so I tried something. I put :
> >
> > build_32bit=0
> >
> > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This
time
> > it built 64bit libraries, but it puts them in the wrong directory :
> >
> > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm
> > /etc/ld.so.conf.d/ofed.conf
> > /usr/lib/libibverbs.so.1
> > /usr/lib/libibverbs.so.1.0.0
> >
> > # file /usr/lib/libibverbs.so.1.0.0
> > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD
> > x86-64, version 1 (SYSV), not stripped
> >
> > So what's up ??
> >
> > Cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: Steffen Persvold
> > Sent: Wed 5/2/2007 10:30 AM
> > To: Steffen Persvold; Vladimir Sokolovsky
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Also,
> >
> > If I look at the /etc/ld.so.conf/ofed.conf file I have :
> >
> > # cat ofed.conf
> > /usr/lib
> > /usr/lib
> >
> >
> > which seems kinda weird ? :)
> >
> > Cheers,
> >
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen
Persvold
> > Sent: Wed 5/2/2007 10:20 AM
> > To: Vladimir Sokolovsky
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Nope :
> >
> >
> > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
> > /etc/ld.so.conf.d/ofed.conf
> > /usr/lib/libibverbs.so.1
> > /usr/lib/libibverbs.so.1.0.0
> > [redhat-release-4ES-5.5]#
> >
> > So the RPM got built, but without 64bit libraries. Now if it was the
> > other way around (i.e no 32bit libraries) I could have understood it
> > (as 32bit is an option on x86_64), but not having the native 64bit
> > libraries is not so easy to understand :)
> >
> > cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir
> > Sokolovsky
> > Sent: Wed 5/2/2007 10:05 AM
> > To: Steffen Persvold
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Don't you have /usr/lib64/libibverbs.so.1.0.0?
> >
> > Regards,
> > Vladimir
> >
> > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote:
> > > Folks,
> > >
> > > I used the build.sh script to build the above mentioned packages
on
> > > rhel4u4 x86_64, but for some reason it only compiles 32bit
libraries
> > > (even if the packages are named x86_64) :
> > >
> > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
> > > x86_64
> > >
> > > (after installing it) :
> > >
> > > # file /usr/lib/libibverbs.so.1.0.0
> > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel
> > > 80386, version 1 (SYSV), not stripped
> > >
> > > What did I do wrong ??
> > >
> > > Cheers,
> > > Steffen Persvold
> > > Technical Director Americas
> > > tel. 508-281-7100 x401
> > > fax. 508-281-7171
> > >
> > > http://www.scali.com/
> > > Scaling the Linux datacenter
> > > _______________________________________________
> > > ewg mailing list
> > > ewg at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> > _______________________________________________
> > ewg mailing list
> > ewg at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> >
> >

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070504/d489e303/attachment.html>

From rick.jones2 at hp.com  Fri May  4 10:11:28 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Fri, 04 May 2007 10:11:28 -0700
Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian
In-Reply-To: <20070504075459.GI4829@mellanox.co.il>
References: <463901A0.5060905@hp.com>
	<20070502214944.GF10009@mellanox.co.il>	<463A1C9A.6060706@hp.com>
	<20070503174817.GC9719@mellanox.co.il>	<463A22F3.4090108@hp.com>
	<20070504075459.GI4829@mellanox.co.il>
Message-ID: <463B6940.2000500@hp.com>

Michael S. Tsirkin wrote:
> OK.
> Apply these 2 patches after configure:

Are they already in the latest nightly?

rick jones

> 
> 
> 
> ------------------------------------------------------------------------
> 
> commit ecbb416939da77c0d107409976499724baddce7b
> Author: Alexey Kuznetsov <kuznet at ms2.inr.ac.ru>
> Date:   Sat Mar 24 12:52:16 2007 -0700
> 
>     [NET]: Fix neighbour destructor handling.
>     
>     ->neigh_destructor() is killed (not used), replaced with
>     ->neigh_cleanup(), which is called when neighbor entry goes to dead
>     state. At this point everything is still valid: neigh->dev,
>     neigh->parms etc.
>     
>     The device should guarantee that dead neighbor entries (neigh->dead !=
>     0) do not get private part initialized, otherwise nobody will cleanup
>     it.
>     
>     I think this is enough for ipoib which is the only user of this thing.
>     Initialization private part of neighbor entries happens in ipib
>     start_xmit routine, which is not reached when device is down.  But it
>     would be better to add explicit test for neigh->dead in any case.
>     
>     Signed-off-by: David S. Miller <davem at davemloft.net>
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 0741c6d..f2a40ae 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct net_device *dev)
>  	queue_work(ipoib_workqueue, &priv->restart_task);
>  }
>  
> -static void ipoib_neigh_destructor(struct neighbour *n)
> +static void ipoib_neigh_cleanup(struct neighbour *n)
>  {
>  	struct ipoib_neigh *neigh;
>  	struct ipoib_dev_priv *priv = netdev_priv(n->dev);
> @@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struct neighbour *n)
>  	struct ipoib_ah *ah = NULL;
>  
>  	ipoib_dbg(priv,
> -		  "neigh_destructor for %06x " IPOIB_GID_FMT "\n",
> +		  "neigh_cleanup for %06x " IPOIB_GID_FMT "\n",
>  		  IPOIB_QPN(n->ha),
>  		  IPOIB_GID_RAW_ARG(n->ha + 4));
>  
> @@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh)
>  
>  static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms)
>  {
> -	parms->neigh_destructor = ipoib_neigh_destructor;
> +	parms->neigh_cleanup = ipoib_neigh_cleanup;
>  
>  	return 0;
>  }
> 
> 
> ------------------------------------------------------------------------
> 
> commit 43cb76d91ee85f579a69d42bc8efc08bac560278
> Author: Greg Kroah-Hartman <gregkh at suse.de>
> Date:   Tue Apr 9 12:14:34 2002 -0700
> 
>     Network: convert network devices to use struct device instead of class_device
>     
>     This lets the network core have the ability to handle suspend/resume
>     issues, if it wants to.
>     
>     Thanks to Frederik Deweerdt <frederik.deweerdt at gmail.com> for the arm
>     driver fixes.
>     
>     Signed-off-by: Greg Kroah-Hartman <gregkh at suse.de>
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 705eb1d..af5ee2e 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -958,16 +958,17 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name)
>  	return netdev_priv(dev);
>  }
>  
> -static ssize_t show_pkey(struct class_device *cdev, char *buf)
> +static ssize_t show_pkey(struct device *dev,
> +			 struct device_attribute *attr, char *buf)
>  {
> -	struct ipoib_dev_priv *priv =
> -		netdev_priv(container_of(cdev, struct net_device, class_dev));
> +	struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev));
>  
>  	return sprintf(buf, "0x%04x\n", priv->pkey);
>  }
> -static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL);
> +static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL);
>  
> -static ssize_t create_child(struct class_device *cdev,
> +static ssize_t create_child(struct device *dev,
> +			    struct device_attribute *attr,
>  			    const char *buf, size_t count)
>  {
>  	int pkey;
> @@ -985,14 +986,14 @@ static ssize_t create_child(struct class_device *cdev,
>  	 */
>  	pkey |= 0x8000;
>  
> -	ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev),
> -			     pkey);
> +	ret = ipoib_vlan_add(to_net_dev(dev), pkey);
>  
>  	return ret ? ret : count;
>  }
> -static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child);
> +static DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child);
>  
> -static ssize_t delete_child(struct class_device *cdev,
> +static ssize_t delete_child(struct device *dev,
> +			    struct device_attribute *attr,
>  			    const char *buf, size_t count)
>  {
>  	int pkey;
> @@ -1004,18 +1005,16 @@ static ssize_t delete_child(struct class_device *cdev,
>  	if (pkey < 0 || pkey > 0xffff)
>  		return -EINVAL;
>  
> -	ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev),
> -				pkey);
> +	ret = ipoib_vlan_delete(to_net_dev(dev), pkey);
>  
>  	return ret ? ret : count;
>  
>  }
> -static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child);
> +static DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child);
>  
>  int ipoib_add_pkey_attr(struct net_device *dev)
>  {
> -	return class_device_create_file(&dev->class_dev,
> -					&class_device_attr_pkey);
> +	return device_create_file(&dev->dev, &dev_attr_pkey);
>  }
>  
>  static struct net_device *ipoib_add_port(const char *format,
> @@ -1083,11 +1082,9 @@ static struct net_device *ipoib_add_port(const char *format,
>  
>  	if (ipoib_add_pkey_attr(priv->dev))
>  		goto sysfs_failed;
> -	if (class_device_create_file(&priv->dev->class_dev,
> -				     &class_device_attr_create_child))
> +	if (device_create_file(&priv->dev->dev, &dev_attr_create_child))
>  		goto sysfs_failed;
> -	if (class_device_create_file(&priv->dev->class_dev,
> -				     &class_device_attr_delete_child))
> +	if (device_create_file(&priv->dev->dev, &dev_attr_delete_child))
>  		goto sysfs_failed;
>  
>  	return priv->dev;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
> index f887780..085eafe 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
> @@ -42,15 +42,15 @@
>  
>  #include "ipoib.h"
>  
> -static ssize_t show_parent(struct class_device *class_dev, char *buf)
> +static ssize_t show_parent(struct device *d, struct device_attribute *attr,
> +			   char *buf)
>  {
> -	struct net_device *dev =
> -		container_of(class_dev, struct net_device, class_dev);
> +	struct net_device *dev = to_net_dev(d);
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  
>  	return sprintf(buf, "%s\n", priv->parent->name);
>  }
> -static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL);
> +static DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL);
>  
>  int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
>  {
> @@ -118,8 +118,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
>  	if (ipoib_add_pkey_attr(priv->dev))
>  		goto sysfs_failed;
>  
> -	if (class_device_create_file(&priv->dev->class_dev,
> -				     &class_device_attr_parent))
> +	if (device_create_file(&priv->dev->dev, &dev_attr_parent))
>  		goto sysfs_failed;
>  
>  	list_add_tail(&priv->list, &ppriv->child_intfs);


From mhagen at iol.unh.edu  Fri May  4 12:39:21 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Fri, 4 May 2007 15:39:21 -0400 (EDT)
Subject: [ofa-general] [PATCH] infiniband: add userspace support for
	invalidate stag
Message-ID: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu>

--- linux-2.6.21.1/include/rdma/ib_user_verbs.h	2007-05-02
15:35:13.000000000 -0400
+++ linux-2.6.21.1/include/rdma/ib_user_verbs.h	2007-05-02
15:53:40.000000000 -0400
@@ -553,6 +553,10 @@ struct ib_uverbs_send_wr {
 			__u32 remote_qkey;
 			__u32 reserved;
 		} ud;
+		struct {
+			__u32 rkey;
+			__u32 reserved;
+		} invalidate;
 	} wr;
 };


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From mhagen at iol.unh.edu  Fri May  4 12:39:55 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Fri, 4 May 2007 15:39:55 -0400 (EDT)
Subject: [ofa-general] [PATCH] infiniband: add userspace support for
	invalidate stag
Message-ID: <53313.132.177.125.178.1178307595.squirrel@postal.iol.unh.edu>

--- linux-2.6.21.1/drivers/infiniband/core/uverbs_cmd.c	2007-05-04
14:25:50.000000000 -0400
+++ linux-2.6.21.1/drivers/infiniband/core/uverbs_cmd.c	2007-05-04
14:47:42.000000000 -0400
@@ -1507,6 +1507,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 				next->wr.atomic.swap = user_wr->wr.atomic.swap;
 				next->wr.atomic.rkey = user_wr->wr.atomic.rkey;
 				break;
+			case IB_WR_SEND:
+				if(next->send_flags & IB_SEND_INVALIDATE) {
+					next->wr.invalidate.rkey =
+						user_wr->wr.invalidate.rkey;
+				}
+				break;
 			default:
 				break;
 			}


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From pradeeps at linux.vnet.ibm.com  Fri May  4 12:41:18 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Fri, 04 May 2007 12:41:18 -0700
Subject: [ofa-general] Queue Pair in error state
Message-ID: <463B8C5E.3060005@linux.vnet.ibm.com>

If packets are received by a queue pair that has gone to an error state- 
which of the following is to expected :

1.It gets dropped by the hardware and the sender will be notified with 
an error.
2. The packet gets delivered to the receiver and the work completion 
handler needs to deal with it.

Pradeep


From mhagen at iol.unh.edu  Fri May  4 12:40:55 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Fri, 4 May 2007 15:40:55 -0400 (EDT)
Subject: [ofa-general] [PATCH] infiniband: add userspace support for
	invalidate stag
Message-ID: <53314.132.177.125.178.1178307655.squirrel@postal.iol.unh.edu>


---
ofa_1_2_user-20070502-0200/src/userspace/libibverbs/include/infiniband/verbs.h	2007-05-03
10:11:23.000000000 -0400
+++
ofa_1_2_user-20070502-0200/src/userspace/libibverbs/include/infiniband/verbs.h	2007-05-03
10:12:32.000000000 -0400
@@ -492,7 +492,8 @@ enum ibv_send_flags {
 	IBV_SEND_FENCE		= 1 << 0,
 	IBV_SEND_SIGNALED	= 1 << 1,
 	IBV_SEND_SOLICITED	= 1 << 2,
-	IBV_SEND_INLINE		= 1 << 3
+	IBV_SEND_INLINE		= 1 << 3,
+	IBV_SEND_INVALIDATE	= 1 << 4
 };

 struct ibv_sge {
@@ -525,6 +526,9 @@ struct ibv_send_wr {
 			uint32_t	remote_qpn;
 			uint32_t	remote_qkey;
 		} ud;
+		struct {
+			uint32_t	rkey;
+		} invalidate;
 	} wr;
 };


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From mhagen at iol.unh.edu  Fri May  4 12:41:34 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Fri, 4 May 2007 15:41:34 -0400 (EDT)
Subject: [ofa-general] [PATCH] infiniband: add userspace support for
	invalidate stag
Message-ID: <43731.132.177.125.178.1178307694.squirrel@postal.iol.unh.edu>

---
ofa_1_2_user-20070502-0200/src/userspace/libibverbs/include/infiniband/kern-abi.h	2007-05-03
10:36:13.000000000 -0400
+++
ofa_1_2_user-20070502-0200/src/userspace/libibverbs/include/infiniband/kern-abi.h	2007-05-03
10:37:39.000000000 -0400
@@ -592,6 +592,10 @@ struct ibv_kern_send_wr {
 			__u32 remote_qkey;
 			__u32 reserved;
 		} ud;
+		struct {
+			__u32 rkey;
+			__u32 reserved;
+		} invalidate;
 	} wr;
 };


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From mhagen at iol.unh.edu  Fri May  4 12:42:09 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Fri, 4 May 2007 15:42:09 -0400 (EDT)
Subject: [ofa-general] [PATCH] infiniband: add userspace support for
	invalidate stag
Message-ID: <43733.132.177.125.178.1178307729.squirrel@postal.iol.unh.edu>


---
ofa_1_2_user-20070502-0200/src/userspace/libibverbs/src/cmd.c	2007-05-02
05:00:25.000000000 -0400
+++
ofa_1_2_user-20070502-0200/src/userspace/libibverbs/src/cmd.c	2007-05-04
15:19:36.000000000 -0400
@@ -857,6 +857,11 @@ int ibv_cmd_post_send(struct ibv_qp *ibq
 				tmp->wr.atomic.swap = i->wr.atomic.swap;
 				tmp->wr.atomic.rkey = i->wr.atomic.rkey;
 				break;
+			case IBV_WR_SEND:
+				if(tmp->send_flags & IBV_SEND_INVALIDATE) {
+					tmp->wr.invalidate.rkey =
+						i->wr.invalidate.rkey;
+				}
 			default:
 				break;
 			}


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From steffen.persvold at scali.com  Fri May  4 14:29:34 2007
From: steffen.persvold at scali.com (Steffen Persvold)
Date: Fri, 4 May 2007 17:29:34 -0400
Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
References: <008d01c78e58$78df8810$c801a8c0@ettac>
Message-ID: <D6A583C768392A4D8B297C500CDD54B5015736BE@mse11be1.mse11.exchange.ms>

Etta,
 
Of course my system has the /usr/lib64 directory. Using install or build doesn't seem to make a difference, the problems seems to be that when the 64bit libraries are compiled and installed they're installed in <RPM build path>/usr/lib and not <RPM build path>/usr/lib64 and thus when rpmbuild gets to compiling and installing the 32bit libraries the 64bit libraries are overwritten... I don't know too much about the Make files and configure scripts inside the .src.rpm files to understand exactly why it tells it to install the 64bit libraries in /usr/lib and not in /usr/lib64...
 
Anyone have any insight on that ??
 
Cheers,
 
Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter

________________________________

From: Chieng Etta [mailto:etta at systemfabricworks.com]
Sent: Fri 5/4/2007 10:28 AM
To: Steffen Persvold; vlad at dev.mellanox.co.il
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64


Steffen,

 
The installation should be the same on either ES or AS.   

I assume that your system should have /usr/lib64 directory.  Would you be able to install rc2 by using ./install.sh script?

 
Thanks,

Etta

 
________________________________

From: Steffen Persvold [mailto:steffen.persvold at scali.com] 
Sent: Thursday, May 03, 2007 5:38 PM
To: Chieng Etta; vlad at dev.mellanox.co.il
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64

 
So I don't understand it then... Why are my RPMs only containing one of the two versions. I'm running on ES and not AS but that shouldn't really matter...

 
This output that you list :

 
[root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0
/usr/lib64/libibverbs.so.1
/usr/lib64/libibverbs.so.1.0.0

Is exactly what I would have expected as well, but my RPM says :

 
[root at pe1850-1 redhat-release-4ES-5.5]# pwd
/root/OFED-1.2-rc2/RPMS/redhat-release-4ES-5.5
[root at pe1850-1 redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0

I'm lookin through the build log (/tmp/OFED.build.xxx.log) and both versions get compiled, but it looks like the 32bit libraries (which gets compiled last) overwrites the 64bit libraries in the "make install" section because both ends up in /usr/lib :

 
(64bit section of the build) :

 
/usr/bin/install -c src/.libs/libibverbs.so.1.0.0 /var/tmp/OFED/usr/lib/libibverbs.so.1.0.0
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1 || { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; }; })
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so || { rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; })

 
(32bit section of the build) :

/usr/bin/install -c src/.libs/libibverbs.so.1.0.0 /var/tmp/OFED/usr/lib/libibverbs.so.1.0.0
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1 || { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; }; })
(cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so || { rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; })

 
So the question is, why is the 64bit section ending up in <buildpath>/usr/lib in the first place ???

 
I do see this though :

 
/bin/rm -f /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache
cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs
Running: env ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range=
yes ac_cv_func_ibv_dofork_range=yes ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes  ./configure --cache-file=/var/tmp/OFEDRPM/
BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir /usr/lib --mandir=/usr/man --sysconfdir=/usr/etc CPPFLAGS="-I../libibverbs/include"

 
--libdir /usr/lib ??? shouldn't that be --libdir /usr/lib64 for the 64bit section ?

 
Cheers,

Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171
 
http://www.scali.com/
Scaling the Linux datacenter

 
________________________________

From: Chieng Etta [mailto:etta at systemfabricworks.com]
Sent: Thu 5/3/2007 3:26 PM
To: Steffen Persvold; vlad at dev.mellanox.co.il
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64

Hi Steffen,

After removing all the OFED packages by using ./uninstall.sh, I tried
./build.sh to build the RPMs then installed libibverbs-1.1-0.x86_64.rpm onto
system.  "libibverbs.so.1.0.0" was installed under the right directories
(/usr/lib and /usr/lib64).  Please see the output below. 
Thanks,
Etta

[root at sfw1 etc]# cat /etc/*release
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
[root at sfw1 etc]# uname -a
Linux sfw1.sfw.int 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux

[root at sfw1 lib64]# pwd
/usr/lib64
[root at sfw1 lib64]# ll libibverbs*
ls: libibverbs*: No such file or directory

[root at sfw1 lib64]# rpm -aq |grep libibverbs

[root at sfw1 lib64]# cd -
/root/images/OFED-1.2-rc2/RPMS/redhat-release-4AS-5.5
[root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
/etc/ld.so.conf.d/ofed.conf
/usr/lib/libibverbs.so.1
/usr/lib/libibverbs.so.1.0.0
/usr/lib64/libibverbs.so.1
/usr/lib64/libibverbs.so.1.0.0

[root at sfw1 redhat-release-4AS-5.5]# rpm -ivh libibverbs-1.1-0.x86_64.rpm
Preparing...             ########################################### [100%]
   1:libibverbs          ########################################### [100%]

[root at sfw1 redhat-release-4AS-5.5]# rpm -qp --qf "%{arch}\n"
libibverbs-1.1-0.x86_64.rpm
x86_64

[root at sfw1 redhat-release-4AS-5.5]# cd -
/usr/lib64
[root at sfw1 lib64]# rpm -aq |grep libibverbs
libibverbs-1.1-0

[root at sfw1 lib64]# ll libibverbs*
lrwxrwxrwx  1 root root     19 May  3 13:50 libibverbs.so.1 ->
libibverbs.so.1.0.0
-rwxr-xr-x  1 root root 200993 May  3 13:18 libibverbs.so.1.0.0

[root at sfw1 lib64]# file libibverbs.so.1.0.0
libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD x86-64, version 1
(SYSV), not stripped

[root at sfw1 lib]# cd /usr/lib
[root at sfw1 lib]# file libibverbs.so.1.0.0
libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel 80386, version 1
(SYSV), not stripped

[root at sfw1 etc]# cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
/usr/ofed/lib64

[root at sfw1 etc]# cat /etc/ld.so.conf.d/ofed.conf
/usr/lib64
/usr/lib
   

-----Original Message-----
From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Steffen Persvold
Sent: Thursday, May 03, 2007 10:26 AM
To: vlad at dev.mellanox.co.il
Cc: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64

Vladimir,

Nope. Still the same issue. The RPMs will only contain one set of
libraries and it is always in /usr/lib (if I set the build_32bit=0
option I get the 64bit libraries but in the wrong directory).

Seriously, am I the only one seeing this ? I would think rhel4 u4 was a
very normal test platform ?

Cheers,

Steffen Persvold
Technical Director Americas
tel. 508-281-7100 x401
fax. 508-281-7171

http://www.scali.com/
Scaling the Linux datacenter


> -----Original Message-----
> From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il]
> Sent: Thursday, May 03, 2007 9:07 AM
> To: Steffen Persvold
> Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
>
> Please see if this happens in OFED-1.2-20070503-0600.
> But first uninstall the previous OFED version with ofed_uninstall.sh
> command.
>
> Thanks,
>
> Regards,
> Vladimir
>
> On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote:
> > Hmm,
> >
> > so I tried something. I put :
> >
> > build_32bit=0
> >
> > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This
time
> > it built 64bit libraries, but it puts them in the wrong directory :
> >
> > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm
> > /etc/ld.so.conf.d/ofed.conf
> > /usr/lib/libibverbs.so.1
> > /usr/lib/libibverbs.so.1.0.0
> >
> > # file /usr/lib/libibverbs.so.1.0.0
> > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD
> > x86-64, version 1 (SYSV), not stripped
> >
> > So what's up ??
> >
> > Cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: Steffen Persvold
> > Sent: Wed 5/2/2007 10:30 AM
> > To: Steffen Persvold; Vladimir Sokolovsky
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Also,
> >
> > If I look at the /etc/ld.so.conf/ofed.conf file I have :
> >
> > # cat ofed.conf
> > /usr/lib
> > /usr/lib
> >
> >
> > which seems kinda weird ? :)
> >
> > Cheers,
> >
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen
Persvold
> > Sent: Wed 5/2/2007 10:20 AM
> > To: Vladimir Sokolovsky
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Nope :
> >
> >
> > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm
> > /etc/ld.so.conf.d/ofed.conf
> > /usr/lib/libibverbs.so.1
> > /usr/lib/libibverbs.so.1.0.0
> > [redhat-release-4ES-5.5]#
> >
> > So the RPM got built, but without 64bit libraries. Now if it was the
> > other way around (i.e no 32bit libraries) I could have understood it
> > (as 32bit is an option on x86_64), but not having the native 64bit
> > libraries is not so easy to understand :)
> >
> > cheers,
> > Steffen Persvold
> > Technical Director Americas
> > tel. 508-281-7100 x401
> > fax. 508-281-7171
> >
> > http://www.scali.com/
> > Scaling the Linux datacenter
> >
> >
> >
______________________________________________________________________
> > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir
> > Sokolovsky
> > Sent: Wed 5/2/2007 10:05 AM
> > To: Steffen Persvold
> > Cc: openfabrics-ewg at openib.org; openib-general at openib.org
> > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64
> >
> >
> > Don't you have /usr/lib64/libibverbs.so.1.0.0?
> >
> > Regards,
> > Vladimir
> >
> > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote:
> > > Folks,
> > >
> > > I used the build.sh script to build the above mentioned packages
on
> > > rhel4u4 x86_64, but for some reason it only compiles 32bit
libraries
> > > (even if the packages are named x86_64) :
> > >
> > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm
> > > x86_64
> > >
> > > (after installing it) :
> > >
> > > # file /usr/lib/libibverbs.so.1.0.0
> > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel
> > > 80386, version 1 (SYSV), not stripped
> > >
> > > What did I do wrong ??
> > >
> > > Cheers,
> > > Steffen Persvold
> > > Technical Director Americas
> > > tel. 508-281-7100 x401
> > > fax. 508-281-7171
> > >
> > > http://www.scali.com/
> > > Scaling the Linux datacenter
> > > _______________________________________________
> > > ewg mailing list
> > > ewg at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> > _______________________________________________
> > ewg mailing list
> > ewg at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> >
> >

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070504/9772fca4/attachment.html>

From rdreier at cisco.com  Fri May  4 14:50:04 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 04 May 2007 14:50:04 -0700
Subject: [ofa-general] [PATCH] infiniband: add userspace support for
	invalidate stag
In-Reply-To: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu>
	(mhagen@iol.unh.edu's message of "Fri,
	4 May 2007 15:39:21 -0400 (EDT)")
References: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu>
Message-ID: <adalkg49s37.fsf@cisco.com>

A few general things:
 - please always submit patches with a changelog entry and
   Signed-off-by: line
 - please send patches in logical chunks.  Usually I'm complaining
   about people combining unrelated things into one patch, but in this
   case I think you divided the patch up too much -- rather than 5
   patches, this should probably be one kernel patch and one userspace
   patch.
 - please make libibverbs patches apply to the libibverbs git tree
   with -p1.  You seem to have generated patches against an OFED package.

OK, with that out of the way, I think there are still some issues to
sort out with how to handle send with invalidate from userspace.
These patches don't address the case of new userspace with
send-with-invalidate support talking to an unpatched kernel -- it
seems that send-with-invalidate would be silently turned into a plain
send request, which is not a very good failure mode.

I don't know what the right solution is yet -- a kernel ABI bump for
this one case (send with invalidate support for userspace drivers that
don't do kernel bypass == amso1100) is ugly.  Maybe we also need a
device capabilities bit that says whether send-with-invalidate is
supported?

 - R.


From rdreier at cisco.com  Fri May  4 14:52:49 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 04 May 2007 14:52:49 -0700
Subject: [ofa-general] Re: [PATCH 0 of 3] comp_vector kernel support
In-Reply-To: <20070504062526.GB4829@mellanox.co.il> (Michael S. Tsirkin's
	message of "Fri, 4 May 2007 09:25:26 +0300")
References: <20070503104806.GC10009@mellanox.co.il>
	<adahcqteozq.fsf@cisco.com> <20070504062526.GB4829@mellanox.co.il>
Message-ID: <adad51g9rym.fsf@cisco.com>

 > How about just patches 1 and 2?
 > They don't do anything to *kernel* ULPs by themselves,
 > and give userspace ULPs opportunity to start using the feature.
 > We'll learn from that, and enhance kernel ULPs by 2.6.23.

I guess I could see doing the first patch (just support multiple
vectors in the kernel without changing any drivers).  That way it
would be easy to experiment with patched drivers that enable multiple
vectors.

I think there's still some figuring out to do about how many EQs to
enable, etc, and I think it would be better to prevent drivers from
escaping into the wild before we have a better handle on the issues.

 - R.


From swise at opengridcomputing.com  Fri May  4 17:32:58 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 04 May 2007 19:32:58 -0500
Subject: [ofa-general] [PATCH] infiniband: add userspace support for
	invalidate stag
In-Reply-To: <adalkg49s37.fsf@cisco.com>
References: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu>
	<adalkg49s37.fsf@cisco.com>
Message-ID: <1178325178.3011.4.camel@stevo-laptop>

On Fri, 2007-05-04 at 14:50 -0700, Roland Dreier wrote:
> A few general things:
>  - please always submit patches with a changelog entry and
>    Signed-off-by: line
>  - please send patches in logical chunks.  Usually I'm complaining
>    about people combining unrelated things into one patch, but in this
>    case I think you divided the patch up too much -- rather than 5
>    patches, this should probably be one kernel patch and one userspace
>    patch.
>  - please make libibverbs patches apply to the libibverbs git tree
>    with -p1.  You seem to have generated patches against an OFED package.
> 
> OK, with that out of the way, I think there are still some issues to
> sort out with how to handle send with invalidate from userspace.
> These patches don't address the case of new userspace with
> send-with-invalidate support talking to an unpatched kernel -- it
> seems that send-with-invalidate would be silently turned into a plain
> send request, which is not a very good failure mode.
> 
> I don't know what the right solution is yet -- a kernel ABI bump for
> this one case (send with invalidate support for userspace drivers that
> don't do kernel bypass == amso1100) is ugly.  Maybe we also need a
> device capabilities bit that says whether send-with-invalidate is
> supported?
> 

There already exists a SEND-INV capabilities flag.

<snip>
        IB_DEVICE_SEND_W_INV            = (1<<16),

I think with the capabilities flag, we shouldn't worry about changing
the ABI.  But the drivers will need to set this flag.  Amso currently
does...

Steve.


From mgredden at bellsouth.net  Fri May  4 20:19:41 2007
From: mgredden at bellsouth.net (Microsoft Award Team)
Date: Fri, 4 May 2007 22:19:41 -0500
Subject: [ofa-general] Microsoft Award Promo
Message-ID: <20070505031941.PEYX1041.ibm67aec.bellsouth.net@mail.bellsouth.net>


Microsoft Award Promo
43 Wilson Ave, Harlesden London NW10 United Kingdom,
Ref: BTD/876/03
Batch: 653978E

Dear Winner,
The prestigious Microsoft and AOL has set out and successfully
organised a Sweepstakes marking this year 2007 anniversary we rolled out over
£10,000.000.00 (Ten million Great Britain Pounds) for this year Anniversary Draws.
The selection was made randomly from World Wide Web site through a computer draw system extracted from over 100,000 individuals and companies,
attaching email addresses to ticket numbers.
Your email address as indicated was drawn and attached to ticket number
005493262748 with serial numbers BTD/890578302/04 and drew the lucky
numbers 15-22-27-38-40-47(20) which subsequently wonyou £1,000,000.00
(One Million Great Britain Pounds) as one of the 10 jackpot winners in this draw.
contact your agent on how to claim your prize 
Name: MR. DAVID CARPENTER
Email :claimsagent_carpenter07 at yahoo.co.uk 

Sincerely,
Mrs Susan Miller
Microsoft Promotion Team


From vlad at lists.openfabrics.org  Sat May  5 02:37:30 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sat,  5 May 2007 02:37:30 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070505-0200 daily build status
Message-ID: <20070505093730.F188DE60927@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From swise at opengridcomputing.com  Sat May  5 09:00:59 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 05 May 2007 11:00:59 -0500
Subject: [ofa-general] [PATCH] infiniband: add userspace support for
	invalidate stag
In-Reply-To: <1178325178.3011.4.camel@stevo-laptop>
References: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu>
	<adalkg49s37.fsf@cisco.com>  <1178325178.3011.4.camel@stevo-laptop>
Message-ID: <1178380859.8125.2.camel@stevo-desktop>

On Fri, 2007-05-04 at 19:32 -0500, Steve Wise wrote:
> On Fri, 2007-05-04 at 14:50 -0700, Roland Dreier wrote:
> > A few general things:
> >  - please always submit patches with a changelog entry and
> >    Signed-off-by: line
> >  - please send patches in logical chunks.  Usually I'm complaining
> >    about people combining unrelated things into one patch, but in this
> >    case I think you divided the patch up too much -- rather than 5
> >    patches, this should probably be one kernel patch and one userspace
> >    patch.
> >  - please make libibverbs patches apply to the libibverbs git tree
> >    with -p1.  You seem to have generated patches against an OFED package.
> > 
> > OK, with that out of the way, I think there are still some issues to
> > sort out with how to handle send with invalidate from userspace.
> > These patches don't address the case of new userspace with
> > send-with-invalidate support talking to an unpatched kernel -- it
> > seems that send-with-invalidate would be silently turned into a plain
> > send request, which is not a very good failure mode.
> > 
> > I don't know what the right solution is yet -- a kernel ABI bump for
> > this one case (send with invalidate support for userspace drivers that
> > don't do kernel bypass == amso1100) is ugly.  Maybe we also need a
> > device capabilities bit that says whether send-with-invalidate is
> > supported?
> > 
> 
> There already exists a SEND-INV capabilities flag.
> 
> <snip>
>         IB_DEVICE_SEND_W_INV            = (1<<16),
> 
> I think with the capabilities flag, we shouldn't worry about changing
> the ABI.  But the drivers will need to set this flag.  Amso currently
> does...

Actually, since Amso has set this flag since day one, it doesn't really
solve the ABI issue Roland describes.


Steve.


From halr at voltaire.com  Sat May  5 10:38:48 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 May 2007 13:38:48 -0400
Subject: [ofa-general] [PATCH] IB/core: Enhance SMI for switch support
Message-ID: <1178386725.32222.188297.camel@hal.voltaire.com>

IB/core: Enhance SMI for switch support

SMI is extended for switch (intermediate hop) support. Care has
been taken to ensure the CA (and router) code paths are as identical as
possible as to how they were prior to adding this support.

Signed-off-by: Suresh Shelvapille <suri at baymicrosystems.com>
Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
index ecd1a30..7583941 100644
--- a/drivers/infiniband/core/agent.c
+++ b/drivers/infiniband/core/agent.c
@@ -3,7 +3,7 @@
  * Copyright (c) 2004, 2005 Infinicon Corporation.  All rights reserved.
  * Copyright (c) 2004, 2005 Intel Corporation.  All rights reserved.
  * Copyright (c) 2004, 2005 Topspin Corporation.  All rights reserved.
- * Copyright (c) 2004, 2005 Voltaire Corporation.  All rights reserved.
+ * Copyright (c) 2004-2007 Voltaire Corporation.  All rights reserved.
  * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -34,7 +34,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: agent.c 1389 2004-12-27 22:56:47Z roland $
  */
 
 #include <linux/slab.h>
@@ -42,6 +41,7 @@
 
 #include "agent.h"
 #include "smi.h"
+#include "mad_priv.h"
 
 #define SPFX "ib_agent: "
 
@@ -87,8 +87,13 @@ int agent_send_response(struct ib_mad *m
 	struct ib_mad_send_buf *send_buf;
 	struct ib_ah *ah;
 	int ret;
+	struct ib_mad_send_wr_private *mad_send_wr;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH)
+		port_priv = ib_get_agent_port(device, 0);
+	else
+		port_priv = ib_get_agent_port(device, port_num);
 
-	port_priv = ib_get_agent_port(device, port_num);
 	if (!port_priv) {
 		printk(KERN_ERR SPFX "Unable to find port agent\n");
 		return -ENODEV;
@@ -113,6 +118,14 @@ int agent_send_response(struct ib_mad *m
 
 	memcpy(send_buf->mad, mad, sizeof *mad);
 	send_buf->ah = ah;
+	
+	if (device->node_type == RDMA_NODE_IB_SWITCH){
+		mad_send_wr = container_of(send_buf,
+				  	   struct ib_mad_send_wr_private,
+					   send_buf);
+		mad_send_wr->send_wr.wr.ud.port_num = port_num;
+	}
+	
 	if ((ret = ib_post_send_mad(send_buf, NULL))) {
 		printk(KERN_ERR SPFX "ib_post_send_mad error:%d\n", ret);
 		goto err2;
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 6edfecf..70b4adc 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -675,10 +675,16 @@ static int handle_outgoing_dr_smp(struct
 	struct ib_mad_port_private *port_priv;
 	struct ib_mad_agent_private *recv_mad_agent = NULL;
 	struct ib_device *device = mad_agent_priv->agent.device;
-	u8 port_num = mad_agent_priv->agent.port_num;
+	u8 port_num;
 	struct ib_wc mad_wc;
 	struct ib_send_wr *send_wr = &mad_send_wr->send_wr;
 
+	if (device->node_type == RDMA_NODE_IB_SWITCH &&
+	    smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+		port_num = send_wr->wr.ud.port_num;
+	else
+		port_num = mad_agent_priv->agent.port_num;
+
 	/*
 	 * Directed route handling starts if the initial LID routed part of
 	 * a request or the ending LID routed part of a response is empty.
@@ -1839,6 +1845,7 @@ static void ib_mad_recv_done_handler(str
 	struct ib_mad_private *recv, *response;
 	struct ib_mad_list_head *mad_list;
 	struct ib_mad_agent_private *mad_agent;
+	int port_num;
 
 	response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
 	if (!response)
@@ -1872,25 +1879,50 @@ static void ib_mad_recv_done_handler(str
 	if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num))
 		goto out;
 
+	if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH)
+		port_num = wc->port_num;
+	else
+		port_num = port_priv->port_num;
+
 	if (recv->mad.mad.mad_hdr.mgmt_class ==
 	    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+		enum smi_forward_action retsmi;
+
 		if (smi_handle_dr_smp_recv(&recv->mad.smp,
 					   port_priv->device->node_type,
-					   port_priv->port_num,
+					   port_num,
 					   port_priv->device->phys_port_cnt) ==
 					   IB_SMI_DISCARD)
 			goto out;
 
-		if (smi_check_forward_dr_smp(&recv->mad.smp) == IB_SMI_LOCAL)
+		retsmi = smi_check_forward_dr_smp(&recv->mad.smp);
+		if (retsmi == IB_SMI_LOCAL)
 			goto local;
 
-		if (smi_handle_dr_smp_send(&recv->mad.smp,
-					   port_priv->device->node_type,
-					   port_priv->port_num) == IB_SMI_DISCARD)
-			goto out;
+		if (retsmi == IB_SMI_SEND) { /* don't forward */
+			if (smi_handle_dr_smp_send(&recv->mad.smp,
+						   port_priv->device->node_type,
+						   port_num) == IB_SMI_DISCARD)
+				goto out;
+
+			if (smi_check_local_smp(&recv->mad.smp, port_priv->device) == IB_SMI_DISCARD)
+				goto out;
+		} else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) {
+			/* forward case for switches */
+			memcpy(response, recv, sizeof(*response));
+			response->header.recv_wc.wc = &response->header.wc;
+			response->header.recv_wc.recv_buf.mad = &response->mad.mad;
+			response->header.recv_wc.recv_buf.grh = &response->grh;
+
+			if (!agent_send_response(&response->mad.mad,
+						 &response->grh, wc,
+						 port_priv->device,
+						 smi_get_fwd_port(&recv->mad.smp),
+						 qp_info->qp->qp_num))
+				response = NULL;
 
-		if (smi_check_local_smp(&recv->mad.smp, port_priv->device) == IB_SMI_DISCARD)
 			goto out;
+		}
 	}
 
 local:
@@ -1919,7 +1951,7 @@ local:
 				agent_send_response(&response->mad.mad,
 						    &recv->grh, wc,
 						    port_priv->device,
-						    port_priv->port_num,
+						    port_num,
 						    qp_info->qp->qp_num);
 				goto out;
 			}
diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c
index 2bca753..8723675 100644
--- a/drivers/infiniband/core/smi.c
+++ b/drivers/infiniband/core/smi.c
@@ -192,7 +192,7 @@ enum smi_action smi_handle_dr_smp_recv(s
 			}
 			/* smp->hop_ptr updated when sending */
 			return (node_type == RDMA_NODE_IB_SWITCH ?
-				IB_SMI_HANDLE: IB_SMI_DISCARD);
+				IB_SMI_HANDLE : IB_SMI_DISCARD);
 		}
 
 		/* C14-13:4 -- hop_ptr = 0 -> give to SM */
@@ -211,7 +211,7 @@ enum smi_forward_action smi_check_forwar
 	if (!ib_get_smp_direction(smp)) {
 		/* C14-9:2 -- intermediate hop */
 		if (hop_ptr && hop_ptr < hop_cnt)
-			return IB_SMI_SEND;
+			return IB_SMI_FORWARD;
 
 		/* C14-9:3 -- at the end of the DR segment of path */
 		if (hop_ptr == hop_cnt)
@@ -224,7 +224,7 @@ enum smi_forward_action smi_check_forwar
 	} else {
 		/* C14-13:2  -- intermediate hop */
 		if (2 <= hop_ptr && hop_ptr <= hop_cnt)
-			return IB_SMI_SEND;
+			return IB_SMI_FORWARD;
 
 		/* C14-13:3 -- at the end of the DR segment of path */
 		if (hop_ptr == 1)
@@ -233,3 +233,13 @@ enum smi_forward_action smi_check_forwar
 	}
 	return IB_SMI_LOCAL;
 }
+
+/*
+ * Return the forwarding port number from initial_path for outgoing SMP and
+ * from return_path for returning SMP
+ */
+int smi_get_fwd_port(struct ib_smp *smp)
+{
+	return (!ib_get_smp_direction(smp) ? smp->initial_path[smp->hop_ptr+1] :
+		smp->return_path[smp->hop_ptr-1]);
+}
diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h
index 9a4b349..1cfc298 100644
--- a/drivers/infiniband/core/smi.h
+++ b/drivers/infiniband/core/smi.h
@@ -48,10 +48,12 @@ enum smi_action {
 enum smi_forward_action {
 	IB_SMI_LOCAL,	/* SMP should be completed up the stack */
 	IB_SMI_SEND,	/* received DR SMP should be forwarded to the send queue */
+	IB_SMI_FORWARD	/* SMP should be forwarded (for switches only) */
 };
 
 enum smi_action smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type,
 				       int port_num, int phys_port_cnt);
+int smi_get_fwd_port(struct ib_smp *smp);
 extern enum smi_forward_action smi_check_forward_dr_smp(struct ib_smp *smp);
 extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp,
 					      u8 node_type, int port_num);


From mst at dev.mellanox.co.il  Sat May  5 13:12:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 5 May 2007 23:12:00 +0300
Subject: [ofa-general] Re: [PATCH 0 of 3] comp_vector kernel support
In-Reply-To: <adad51g9rym.fsf@cisco.com>
References: <20070503104806.GC10009@mellanox.co.il> <adahcqteozq.fsf@cisco.com>
	<20070504062526.GB4829@mellanox.co.il> <adad51g9rym.fsf@cisco.com>
Message-ID: <20070505201200.GA20811@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH 0 of 3] comp_vector kernel support
> 
>  > How about just patches 1 and 2?
>  > They don't do anything to *kernel* ULPs by themselves,
>  > and give userspace ULPs opportunity to start using the feature.
>  > We'll learn from that, and enhance kernel ULPs by 2.6.23.
> 
> I guess I could see doing the first patch (just support multiple
> vectors in the kernel without changing any drivers).  That way it
> would be easy to experiment with patched drivers that enable multiple
> vectors.

OK, let's do it.

> I think there's still some figuring out to do about how many EQs to
> enable, etc, and I think it would be better to prevent drivers from
> escaping into the wild before we have a better handle on the issues.

How about applying mthca patch, but changing MTHCA_COMP_VECTORS to 1?
That would make it easy to do experiments at the ULP level.

-- 
MST


From emilysimon009 at yahoo.es  Sat May  5 16:19:45 2007
From: emilysimon009 at yahoo.es (CO-ORDINATOR {Euromillionlottery})
Date: Sat, 5 May 2007 18:19:45 -0500 (CDT)
Subject: [ofa-general] Euro Million Loteria Award !!!
Message-ID: <3518.81.199.179.2.1178407185.squirrel@leveldonchange.250meg.com>


Euro Million Loteria Award
Paseo De La Castellana
15-89, 28008 Madrid,Spain Branch.
Ref. Nº: ES/007/05/12/MAD.
Batch. Nº: GHT/2907/333/05.
www.loteria.com
Prize And Award Notification

                     YOUR E-MAIL ADDRESS WON THE LOTTERY.
We wish to congratulate you over your email success in our computer
BALLOTING SWEEPSTAKE held on May 5th,2007. This is a millennium scientific
computer game in which email addresses were used. It is a promotional
program aimed at encouraging internet users; therefore you do not need to
buy ticket to enter for it.

Your email address attached to ticket star number (9901-0148-790-691) drew
the EUROMILLION lucky numbers 3-19-26-49-50 which consequently won the
draw in the Second category.

You have been approve for the star prize of EUR  787,248.26. (Seven
Hundred And Eighty Seven housand, Two Hundred And Fourty Eight
Euros,Twenty Six Cents)

CONGRATULATIONS !!!
You are advised to keep this winning very confidential until you receive
your lump prize in your account or optional cheque issuance to you. This
is a protective measure to avoid double claiming by people you may tell as
we have had cases like this before. You are required to provide the
information below:

Name, Telephone Number, Fax Number, Wining Ticket Number, Reference Number
and Amount Won.
This information For processing of your winning fund should be sent to our
registered claim agent in address below.

Guarantee Trust Agency.
Mr.Melvin Clinton.
Address: Sin Numero Madrid Spain.
Telephone: 0034636287740.
E-mail: mrmelvinclinton at aim.com
Remember, all winning must be claimed not later than May 31st, 2007.
Please note, in order to avoid unnecessary delays and complications,
remember to quote your reference number and batch number in all
correspondence. Furthermore, should there be any change of address do
inform our agent as soon as possible.

ONCE AGAIN CONGRATULATIONS.

Best Regard,
Mrs. Emily Simon,
Lottery Co-Odinator.

The information transmitted is intended only for the person or entity to
whom or which it is addressed. Unauthorised use, disclosure or copying is
strictly prohibited. The sender accepts no liability for the improper
transmission of this communication nor for any delay in its receipt.


From xma at us.ibm.com  Sat May  5 23:33:49 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Sat, 5 May 2007 23:33:49 -0700
Subject: [ofa-general] IPoIB: Convert to NAPI
In-Reply-To: <adaodl1bpj1.fsf@cisco.com>
Message-ID: <OF6475F0E6.47EA2F9D-ON872572D3.002308C0-882572D3.0024133F@us.ibm.com>


Roland,

      This patch looks good. I am working on a patch to split CQ to sendCQ
and recvCQ and each CQ will have a different interrupt assoicated with
different CPU to reduce latency and improve uni & bi directional BW. I
would like to compare the performance difference. I hope there is no
conflict.

Thanks
Shirley Ma
IBM Linux Technology Center
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070505/c56e35c9/attachment.html>

From vlad at lists.openfabrics.org  Sun May  6 02:37:35 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun,  6 May 2007 02:37:35 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070506-0200 daily build status
Message-ID: <20070506093736.23E64E6083A@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.18
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From dotanb at dev.mellanox.co.il  Sun May  6 05:20:06 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 06 May 2007 15:20:06 +0300
Subject: [ofa-general] does the mlx4 low level driver support working with
 multicast groups from user level?
Message-ID: <463DC7F6.2070209@dev.mellanox.co.il>

Hi Roland.

When i executed ibv_devinfo and checked the multicast props of the 
device i got the following values:

sw180:~ # ibv_devinfo -v | grep cast
    This will severely limit memory registrations.
        max_mcast_grp:                  8192
        max_mcast_qp_attach:            0
        max_total_mcast_qp_attach:      0

It seems that the IB low level driver (drivers/infiniband/hw/mlx4) 
doesn't fill the attribute max_mcast_qp_attach.

When i tried to use multicast groups from user level i got weird failures.


Does the low level driver support working with multicast groups from 
user level?


thanks
Dotan


From sashak at voltaire.com  Sun May  6 05:43:33 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 6 May 2007 15:43:33 +0300
Subject: [ofa-general] Re: [PATCH] osm: source and destination strings
	overlap when using sprintf()
In-Reply-To: <1178211572.32222.3479.camel@hal.voltaire.com>
References: <462C7C21.7010004@dev.mellanox.co.il>
	<20070423101738.GG4579@mellanox.co.il>
	<462E80A3.5060503@dev.mellanox.co.il>
	<20070501005101.GA26019@sashak.voltaire.com>
	<4636E4A7.7060108@dev.mellanox.co.il>
	<1178211572.32222.3479.camel@hal.voltaire.com>
Message-ID: <20070506124333.GB9692@sashak.voltaire.com>

On 12:59 Thu 03 May     , Hal Rosenstock wrote:
> On Tue, 2007-05-01 at 02:56, Yevgeny Kliteynik wrote:
> > Sasha Khapyorsky wrote:
> > > On 01:11 Wed 25 Apr     , Yevgeny Kliteynik wrote:
> > >> Michael S. Tsirkin wrote:
> > >>> Since you seem to do a strcat which does an anyway, how about, for example:
> > >>>
> > >>> -      sprintf( buf_line1,"%s 0x%01x |",
> > >>> -               buf_line1, p_vla_tbl->vl_entry[i].vl);
> > >>> +      sprintf( buf_line1 + strlen(buf_line1)," 0x%01x |",
> > >>> +               p_vla_tbl->vl_entry[i].vl);
> > >>>
> > >>> and so on in all the other places?
> > >> Agree.
> > >> I'll send a new patch later.
> > > 
> > > Or like this:
> > > 
> > > +      int n = 0;
> > > ...
> > > -      sprintf( buf_line1,"%s 0x%01x |",
> > > -               buf_line1, p_vla_tbl->vl_entry[i].vl);
> > > +      n += sprintf( buf_line1 + n," 0x%01x |",
> > > +                    p_vla_tbl->vl_entry[i].vl);
> > > 
> > > , so strlen() rerunning in loop is not needed anymore.
> > 
> > Right, it does look better.
> 
> So is someone going to submit this patch ? Thanks.

Will do.

Sasha


From sashak at voltaire.com  Sun May  6 06:03:52 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 6 May 2007 16:03:52 +0300
Subject: [ofa-general] [PATCH TRIVIAL] opensm/osm_helper: remove repeated
	strlen() calls
In-Reply-To: <20070506124333.GB9692@sashak.voltaire.com>
References: <462C7C21.7010004@dev.mellanox.co.il>
	<20070423101738.GG4579@mellanox.co.il>
	<462E80A3.5060503@dev.mellanox.co.il>
	<20070501005101.GA26019@sashak.voltaire.com>
	<4636E4A7.7060108@dev.mellanox.co.il>
	<1178211572.32222.3479.camel@hal.voltaire.com>
	<20070506124333.GB9692@sashak.voltaire.com>
Message-ID: <20070506130352.GC9692@sashak.voltaire.com>


Replace repeated strlen() calls used in sprintf() by actual string
length accumulated from sprintf() return values.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/opensm/osm_helper.c |   56 +++++++++++++++++++++--------------------------
 1 files changed, 25 insertions(+), 31 deletions(-)

diff --git a/osm/opensm/osm_helper.c b/osm/opensm/osm_helper.c
index a1a2e93..b424e84 100644
--- a/osm/opensm/osm_helper.c
+++ b/osm/opensm/osm_helper.c
@@ -1145,22 +1145,22 @@ osm_dump_multipath_record(
   IN const ib_multipath_rec_t* const p_mpr,
   IN const osm_log_level_t log_level )
 {
-  int i;
   char buf_line[1024];
   ib_gid_t const *p_gid;
+  int i, n;
 
   if( osm_log_is_active( p_log, log_level ) )
   {
-    memset(buf_line, 0, sizeof(buf_line));
+    n = 0;
     p_gid = p_mpr->gids;
     if ( p_mpr->sgid_count )
     {
       for (i = 0; i < p_mpr->sgid_count; i++)
       {
-        sprintf( buf_line + strlen(buf_line), "\t\t\t\tsgid%02d.................."
-                 "0x%016" PRIx64 " : 0x%016" PRIx64 "\n",
-                 i + 1, cl_ntoh64( p_gid->unicast.prefix ),
-                 cl_ntoh64( p_gid->unicast.interface_id ) );
+        n += sprintf( buf_line + n, "\t\t\t\tsgid%02d.................."
+                      "0x%016" PRIx64 " : 0x%016" PRIx64 "\n",
+                      i + 1, cl_ntoh64( p_gid->unicast.prefix ),
+                      cl_ntoh64( p_gid->unicast.interface_id ) );
         p_gid++;
       }
     }
@@ -1168,10 +1168,10 @@ osm_dump_multipath_record(
     {
       for (i = 0; i < p_mpr->dgid_count; i++)
       {
-        sprintf( buf_line + strlen(buf_line), "\t\t\t\tdgid%02d.................."
-                 "0x%016" PRIx64 " : 0x%016" PRIx64 "\n",
-                 i + 1, cl_ntoh64( p_gid->unicast.prefix ),
-                 cl_ntoh64( p_gid->unicast.interface_id ) );
+        n += sprintf( buf_line + n, "\t\t\t\tdgid%02d.................."
+                      "0x%016" PRIx64 " : 0x%016" PRIx64 "\n",
+                      i + 1, cl_ntoh64( p_gid->unicast.prefix ),
+                      cl_ntoh64( p_gid->unicast.interface_id ) );
         p_gid++;
       }
     }
@@ -1650,15 +1650,14 @@ osm_dump_pkey_block(
   IN const ib_pkey_table_t* const p_pkey_tbl,
   IN const osm_log_level_t log_level )
 {
-  int i;
   char buf_line[1024];
+  int i, n;
 
   if( osm_log_is_active( p_log, log_level ) )
   {
-    buf_line[0] = '\0';
-    for (i = 0; i < 32; i++)
-      sprintf( buf_line + strlen(buf_line)," 0x%04x |",
-               cl_ntoh16(p_pkey_tbl->pkey_entry[i]));
+    for (i = 0, n = 0; i < 32; i++)
+      n += sprintf( buf_line + n," 0x%04x |",
+                    cl_ntoh16(p_pkey_tbl->pkey_entry[i]));
 
     osm_log( p_log, log_level,
              "P_Key table dump:\n"
@@ -1684,18 +1683,17 @@ osm_dump_slvl_map_table(
   IN const ib_slvl_table_t* const p_slvl_tbl,
   IN const osm_log_level_t log_level )
 {
-  uint8_t i;
   char buf_line1[1024];
   char buf_line2[1024];
+  int n;
+  uint8_t i;
 
   if( osm_log_is_active( p_log, log_level ) )
   {
-    buf_line1[0] = '\0';
-    buf_line2[0] = '\0';
-    for (i = 0; i < 16; i++)
-      sprintf( buf_line1 + strlen(buf_line1)," %-2u |", i);
-    for (i = 0; i < 16; i++)
-      sprintf( buf_line2 + strlen(buf_line2),"0x%01X |",
+    for (i = 0, n = 0; i < 16; i++)
+      n += sprintf( buf_line1 + n," %-2u |", i);
+    for (i = 0, n = 0; i < 16; i++)
+      n += sprintf( buf_line2 + n,"0x%01X |",
                ib_slvl_table_get(p_slvl_tbl, i));
     osm_log( p_log, log_level,
              "SLtoVL dump:\n"
@@ -1721,20 +1719,16 @@ osm_dump_vl_arb_table(
   IN const ib_vl_arb_table_t* const p_vla_tbl,
   IN const osm_log_level_t log_level )
 {
-  int i;
   char buf_line1[1024];
   char buf_line2[1024];
+  int i, n;
 
   if( osm_log_is_active( p_log, log_level ) )
   {
-    buf_line1[0] = '\0';
-    buf_line2[0] = '\0';
-    for (i = 0; i < 32; i++)
-      sprintf( buf_line1 + strlen(buf_line1)," 0x%01X |",
-               p_vla_tbl->vl_entry[i].vl);
-    for (i = 0; i < 32; i++)
-      sprintf( buf_line2 + strlen(buf_line2)," 0x%01X |",
-               p_vla_tbl->vl_entry[i].weight);
+    for (i = 0, n = 0; i < 32; i++)
+      n += sprintf( buf_line1 + n," 0x%01X |", p_vla_tbl->vl_entry[i].vl);
+    for (i = 0, n = 0; i < 32; i++)
+      n += sprintf( buf_line2 + n," 0x%01X |", p_vla_tbl->vl_entry[i].weight);
     osm_log( p_log, log_level,
              "VlArb dump:\n"
              "\t\t\tport_guid...........0x%016" PRIx64 "\n"
-- 
1.5.1.rc1.18.ga41b4


From dotanb at dev.mellanox.co.il  Sun May  6 05:58:28 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 06 May 2007 15:58:28 +0300
Subject: [ofa-general] Queue Pair in error state
In-Reply-To: <463B8C5E.3060005@linux.vnet.ibm.com>
References: <463B8C5E.3060005@linux.vnet.ibm.com>
Message-ID: <463DD0F4.2050709@dev.mellanox.co.il>

Pradeep Satyanarayana wrote:
> If packets are received by a queue pair that has gone to an error 
> state- which of the following is to expected :
>
> 1.It gets dropped by the hardware and the sender will be notified with 
> an error.
> 2. The packet gets delivered to the receiver and the work completion 
> handler needs to deal with it.
I believe that the first scenario will occur.

The responder QP is in error state so all of the incoming packets will 
be dropped by the HCA.
The requestor QP, which won't get any ack (or nack), will eventually get 
a retry exceeded and move to error state as well.


Dotan


From dotanb at dev.mellanox.co.il  Sun May  6 06:46:42 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 06 May 2007 16:46:42 +0300
Subject: [ofa-general] [PATCH] libibverbs/ibv_devinfo : Print the number of
	max_vl_num as a number
Message-ID: <1178459202.16752.1.camel@mtldesk014.lab.mtl.com>

Print the number of max_vl_num as a number and not as enumerated value.

Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>

---

diff --git a/examples/devinfo.c b/examples/devinfo.c
index 28cf8d1..40575c6 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -135,6 +135,18 @@ static const char *speed_str(uint8_t speed)
 	}
 }
 
+static const char *vl_str(uint8_t vl_num)
+{
+	switch (vl_num) {
+	case 1:  return "1";
+	case 2:  return "2";
+	case 3:  return "4";
+	case 4:  return "8";
+	case 5:  return "15";
+	default: return "invalid value";
+	}
+}
+
 static int print_all_port_gids(struct ibv_context *ctx, uint8_t port_num, int tbl_len)
 {
 	union ibv_gid gid;
@@ -266,7 +278,7 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port)
 		if (verbose) {
 			printf("\t\t\tmax_msg_sz:\t\t0x%x\n", port_attr.max_msg_sz);
 			printf("\t\t\tport_cap_flags:\t\t0x%08x\n", port_attr.port_cap_flags);
-			printf("\t\t\tmax_vl_num:\t\t%d\n", port_attr.max_vl_num);
+			printf("\t\t\tmax_vl_num:\t\t%s\n", vl_str(port_attr.max_vl_num));
 			printf("\t\t\tbad_pkey_cntr:\t\t0x%x\n", port_attr.bad_pkey_cntr);
 			printf("\t\t\tqkey_viol_cntr:\t\t0x%x\n", port_attr.qkey_viol_cntr);
 			printf("\t\t\tsm_sl:\t\t\t%d\n", port_attr.sm_sl);


From rdreier at cisco.com  Sun May  6 08:41:50 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 06 May 2007 08:41:50 -0700
Subject: [ofa-general] Re: does the mlx4 low level driver support working
	with multicast groups from user level?
References: <463DC7F6.2070209@dev.mellanox.co.il>
Message-ID: <adamz0i7ydd.fsf@cisco.com>

 > When i executed ibv_devinfo and checked the multicast props of the
 > device i got the following values:
 > 
 > sw180:~ # ibv_devinfo -v | grep cast
 >    This will severely limit memory registrations.
 >        max_mcast_grp:                  8192
 >        max_mcast_qp_attach:            0
 >        max_total_mcast_qp_attach:      0
 > 
 > It seems that the IB low level driver (drivers/infiniband/hw/mlx4)
 > doesn't fill the attribute max_mcast_qp_attach.

Yes, that code is missing.

 > When i tried to use multicast groups from user level i got weird failures.
 > 
 > Does the low level driver support working with multicast groups from
 > user level?

There's nothing special to do to handle userspace multicast groups.
The multicast groups work well enough for IPoIB to work for me, but I
haven't done any real testing.

It should work but there's probably a silly bug somewhere.  I just
fixed one such bug but without knowing what your weird failures are,
it's hard to say whether it would affect your tests.

 - R.


From eli at mellanox.co.il  Sun May  6 08:44:06 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Sun, 06 May 2007 18:44:06 +0300
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: free doorbel fix
Message-ID: <1178466276.20653.127.camel@mtls03>

When freeing an entry from order 1, the index field ends up shifted
twice and the resulting index is wrong causing corruption of the
data structure.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/infiniband/hw/mlx4/doorbell.c
===================================================================
--- connectx_kernel.orig/drivers/infiniband/hw/mlx4/doorbell.c	2007-05-06 18:24:54.000000000 +0300
+++ connectx_kernel/drivers/infiniband/hw/mlx4/doorbell.c	2007-05-06 18:29:32.000000000 +0300
@@ -136,9 +136,9 @@ void mlx4_ib_db_free(struct mlx4_ib_dev 
 	if (db->order == 0 && test_bit(i ^ 1, db->u.pgdir->order0)) {
 		clear_bit(i ^ 1, db->u.pgdir->order0);
 		++o;
+		i >>= o;
 	}
 
-	i >>= o;
 	set_bit(i, db->u.pgdir->bits[o]);
 
 	if (bitmap_full(db->u.pgdir->order1, MLX4_IB_DB_PER_PAGE / 2)) {


From eli at mellanox.co.il  Sun May  6 08:53:19 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Sun, 06 May 2007 18:53:19 +0300
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: fix doorbell allocations
Message-ID: <1178466829.5013.1.camel@mtls03>

These allocations are done under a spinlock and should be made with
GFP_ATOMIC flags to prevent a deadlock.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/infiniband/hw/mlx4/doorbell.c
===================================================================
--- connectx_kernel.orig/drivers/infiniband/hw/mlx4/doorbell.c	2007-05-06 10:38:26.000000000 +0300
+++ connectx_kernel/drivers/infiniband/hw/mlx4/doorbell.c	2007-05-06 10:43:08.000000000 +0300
@@ -47,7 +47,7 @@ static struct mlx4_ib_db_pgdir *mlx4_ib_
 {
 	struct mlx4_ib_db_pgdir *pgdir;
 
-	pgdir = kzalloc(sizeof *pgdir, GFP_KERNEL);
+	pgdir = kzalloc(sizeof *pgdir, GFP_ATOMIC);
 	if (!pgdir)
 		return NULL;
 
@@ -56,7 +56,7 @@ static struct mlx4_ib_db_pgdir *mlx4_ib_
 	pgdir->bits[1] = pgdir->order1;
 	pgdir->db_page = dma_alloc_coherent(dev->ib_dev.dma_device,
 					    PAGE_SIZE, &pgdir->db_dma,
-					    GFP_KERNEL);
+					    GFP_ATOMIC);
 	if (!pgdir->db_page) {
 		kfree(pgdir);
 		return NULL;


From rdreier at cisco.com  Sun May  6 09:21:38 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 06 May 2007 09:21:38 -0700
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: free doorbel fix
In-Reply-To: <1178466276.20653.127.camel@mtls03> (Eli Cohen's message of "Sun,
	06 May 2007 18:44:06 +0300")
References: <1178466276.20653.127.camel@mtls03>
Message-ID: <adahcqp9b3h.fsf@cisco.com>

Thanks, good catch... I fixed it this way:

commit e5b1dd9313497cc22ae171ab6cccb7eb044aba53
Author: Eli Cohen <eli at mellanox.co.il>
Date:   Sun May 6 09:20:13 2007 -0700

    When freeing an entry from order 1, the index field ends up shifted
    twice and the resulting index is wrong causing corruption of the
    data structure.
    
    Signed-off-by: Eli Cohen <eli at mellanox.co.il>
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c
index 4b564d5..e55c286 100644
--- a/drivers/infiniband/hw/mlx4/doorbell.c
+++ b/drivers/infiniband/hw/mlx4/doorbell.c
@@ -132,7 +132,6 @@ void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db)
 	spin_lock(&dev->pgdir_lock);
 
 	o = db->order;
-	i = db->index >> db->order;
 
 	if (db->order == 0 && test_bit(i ^ 1, db->u.pgdir->order0)) {
 		clear_bit(i ^ 1, db->u.pgdir->order0);


From rdreier at cisco.com  Sun May  6 09:27:58 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 06 May 2007 09:27:58 -0700
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: free doorbel fix
In-Reply-To: <adahcqp9b3h.fsf@cisco.com> (Roland Dreier's message of "Sun,
	06 May 2007 09:21:38 -0700")
References: <1178466276.20653.127.camel@mtls03> <adahcqp9b3h.fsf@cisco.com>
Message-ID: <adaabwh9asx.fsf@cisco.com>

err, like this really:

commit 19219048ce32931392ca703f4cd9d54a8926215b
Author: Eli Cohen <eli at mellanox.co.il>
Date:   Sun May 6 09:27:29 2007 -0700

    IB/mlx4: Fix free of doorbell record buddies
    
    When freeing an entry from order 1, the index field ends up shifted
    twice and the resulting index is wrong causing corruption of the
    data structure.
    
    Signed-off-by: Eli Cohen <eli at mellanox.co.il>
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c
index 4b564d5..0515052 100644
--- a/drivers/infiniband/hw/mlx4/doorbell.c
+++ b/drivers/infiniband/hw/mlx4/doorbell.c
@@ -131,8 +131,7 @@ void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db)
 
 	spin_lock(&dev->pgdir_lock);
 
-	o = db->order;
-	i = db->index >> db->order;
+	i = db->index;
 
 	if (db->order == 0 && test_bit(i ^ 1, db->u.pgdir->order0)) {
 		clear_bit(i ^ 1, db->u.pgdir->order0);


From rdreier at cisco.com  Sun May  6 09:28:07 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 06 May 2007 09:28:07 -0700
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: fix doorbell allocations
In-Reply-To: <1178466829.5013.1.camel@mtls03> (Eli Cohen's message of "Sun,
	06 May 2007 18:53:19 +0300")
References: <1178466829.5013.1.camel@mtls03>
Message-ID: <ada8xc19aso.fsf@cisco.com>

another good catch.  let's make the lock a mutex instead, rather than
relying on atomic allocations:

commit 7a62f478170f69225fa8f35d0502dbaf26652615
Author: Roland Dreier <rolandd at cisco.com>
Date:   Sun May 6 09:26:16 2007 -0700

    IB/mlx4: Convert pgdir_lock to pgdir_mutex
    
    Doorbell record pages are allocated inside the pgdir lock, so change
    the lock to a mutex so we can use GFP_KERNEL allocations.
    
    Pointed out by Eli Cohen <eli at mellanox.co.il>.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c
index e55c286..2e36cee 100644
--- a/drivers/infiniband/hw/mlx4/doorbell.c
+++ b/drivers/infiniband/hw/mlx4/doorbell.c
@@ -101,7 +101,7 @@ int mlx4_ib_db_alloc(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db, int order)
 	struct mlx4_ib_db_pgdir *pgdir;
 	int ret = 0;
 
-	spin_lock(&dev->pgdir_lock);
+	mutex_lock(&dev->pgdir_mutex);
 
 	list_for_each_entry(pgdir, &dev->pgdir_list, list)
 		if (!mlx4_ib_alloc_db_from_pgdir(pgdir, db, order))
@@ -119,7 +119,7 @@ int mlx4_ib_db_alloc(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db, int order)
 	WARN_ON(mlx4_ib_alloc_db_from_pgdir(pgdir, db, order));
 
 out:
-	spin_unlock(&dev->pgdir_lock);
+	mutex_unlock(&dev->pgdir_mutex);
 
 	return ret;
 }
@@ -129,7 +129,7 @@ void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db)
 	int o;
 	int i;
 
-	spin_lock(&dev->pgdir_lock);
+	mutex_lock(&dev->pgdir_mutex);
 
 	o = db->order;
 
@@ -148,7 +148,7 @@ void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db)
 		kfree(db->u.pgdir);
 	}
 
-	spin_unlock(&dev->pgdir_lock);
+	mutex_unlock(&dev->pgdir_mutex);
 }
 
 struct mlx4_ib_user_db_page {
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index b3af928..5ef6d19 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -490,7 +490,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 		goto err_uar;
 
 	INIT_LIST_HEAD(&ibdev->pgdir_list);
-	spin_lock_init(&ibdev->pgdir_lock);
+	mutex_init(&ibdev->pgdir_mutex);
 
 	ibdev->dev = dev;
 
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index bb866b0..62be599 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -152,7 +152,7 @@ struct mlx4_ib_dev {
 	void __iomem	       *uar_map;
 
 	struct list_head	pgdir_list;
-	spinlock_t		pgdir_lock;
+	struct mutex		pgdir_mutex;
 
 	struct mlx4_uar		priv_uar;
 	u32			priv_pdn;


From rdreier at cisco.com  Sun May  6 09:30:09 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 06 May 2007 09:30:09 -0700
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: free doorbel fix
In-Reply-To: <adaabwh9asx.fsf@cisco.com> (Roland Dreier's message of "Sun,
	06 May 2007 09:27:58 -0700")
References: <1178466276.20653.127.camel@mtls03> <adahcqp9b3h.fsf@cisco.com>
	<adaabwh9asx.fsf@cisco.com>
Message-ID: <ada4pmp9apa.fsf@cisco.com>

err, one more try:

commit 49b070c5a9473fabb379c82761ecf8c573a9b548
Author: Eli Cohen <eli at mellanox.co.il>
Date:   Sun May 6 09:29:28 2007 -0700

    IB/mlx4: Fix free of doorbell record buddies
    
    When freeing an entry from order 1, the index field ends up shifted
    twice and the resulting index is wrong causing corruption of the
    data structure.
    
    Signed-off-by: Eli Cohen <eli at mellanox.co.il>
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c
index 4b564d5..acb4ce2 100644
--- a/drivers/infiniband/hw/mlx4/doorbell.c
+++ b/drivers/infiniband/hw/mlx4/doorbell.c
@@ -132,7 +132,7 @@ void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db)
 	spin_lock(&dev->pgdir_lock);
 
 	o = db->order;
-	i = db->index >> db->order;
+	i = db->index;
 
 	if (db->order == 0 && test_bit(i ^ 1, db->u.pgdir->order0)) {
 		clear_bit(i ^ 1, db->u.pgdir->order0);


From sashak at voltaire.com  Sun May  6 10:41:38 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 6 May 2007 20:41:38 +0300
Subject: [ofa-general] [PATCH TRIVIAL] opensm: remove unneeded run-time check
Message-ID: <20070506174138.GI9692@sashak.voltaire.com>

remove unneeded run-time NULL pointer check (followed free() is not
under this check anyway).

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/opensm/osm_node.c |   20 +++++++++-----------
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/osm/opensm/osm_node.c b/osm/opensm/osm_node.c
index 3f96c16..e725fd5 100644
--- a/osm/opensm/osm_node.c
+++ b/osm/opensm/osm_node.c
@@ -147,20 +147,17 @@ void
 osm_node_destroy(
   IN osm_node_t *p_node )
 {
+  osm_physp_t *p_physp;
   uint16_t i;
 
-  /* Cleanup all PhysPorts */
-  if( p_node != NULL )
+  /*
+    Cleanup all physports 
+  */
+  for( i = 0; i < p_node->physp_tbl_size; i++ )
   {
-    /*
-      Cleanup all physports 
-    */
-    for( i = 0; i < p_node->physp_tbl_size; i++ )
-    {
-      osm_physp_t *p_physp = osm_node_get_physp_ptr( p_node, i );
-      if (p_physp) 
-        osm_physp_destroy( p_physp );
-    }
+    p_physp = osm_node_get_physp_ptr( p_node, i );
+    if (p_physp) 
+      osm_physp_destroy( p_physp );
   }
 }
 
@@ -170,6 +167,7 @@ void
 osm_node_delete(
   IN OUT osm_node_t** const p_node )
 {
+  CL_ASSERT(p_node && *p_node);
   osm_node_destroy( *p_node );
   free( *p_node );
   *p_node = NULL;
-- 
1.5.2.rc2.20.gac2a


From sashak at voltaire.com  Sun May  6 10:44:31 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 6 May 2007 20:44:31 +0300
Subject: [ofa-general] [PATCH TRIVIAL] opensm: make osm_node_destroy() static
In-Reply-To: <20070506174138.GI9692@sashak.voltaire.com>
References: <20070506174138.GI9692@sashak.voltaire.com>
Message-ID: <20070506174431.GJ9692@sashak.voltaire.com>


This makes locally used osm_node_destroy() function static

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/include/opensm/osm_node.h |   28 ----------------------------
 osm/opensm/osm_node.c         |    2 +-
 2 files changed, 1 insertions(+), 29 deletions(-)

diff --git a/osm/include/opensm/osm_node.h b/osm/include/opensm/osm_node.h
index 035ecef..a841de7 100644
--- a/osm/include/opensm/osm_node.h
+++ b/osm/include/opensm/osm_node.h
@@ -149,34 +149,6 @@ typedef struct _osm_node
 *	Node object
 *********/
 
-/****f* OpenSM: Node/osm_node_destroy
-* NAME
-*	osm_node_destroy
-*
-* DESCRIPTION
-*	The osm_node_destroy function destroys a node, releasing
-*	all resources.
-*
-* SYNOPSIS
-*/void
-osm_node_destroy(
-  IN osm_node_t *p_node );
-/*
-* PARAMETERS
-*	p_node
-*		[in] Pointer a Node object to destroy.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*	Performs any necessary cleanup of the specified Node object.
-*	This function should only be called after a call to osm_node_new.
-*
-* SEE ALSO
-*	Node object, osm_node_new
-*********/
-
 /****f* OpenSM: Node/osm_node_delete
 * NAME
 *	osm_node_delete
diff --git a/osm/opensm/osm_node.c b/osm/opensm/osm_node.c
index e725fd5..80a7465 100644
--- a/osm/opensm/osm_node.c
+++ b/osm/opensm/osm_node.c
@@ -143,7 +143,7 @@ osm_node_new(
 
 /**********************************************************************
  **********************************************************************/
-void
+static void
 osm_node_destroy(
   IN osm_node_t *p_node )
 {
-- 
1.5.2.rc2.20.gac2a


From sashak at voltaire.com  Sun May  6 11:19:37 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 6 May 2007 21:19:37 +0300
Subject: [ofa-general] [PATCH TRIVIAL] opensm: trivial osm_port cleanups
Message-ID: <20070506181937.GK9692@sashak.voltaire.com>


This removes non-meanful osm_port_construct() and osm_port_destroy()
functions and makes static locally used osm_port_init().

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/include/opensm/osm_port.h |  111 +---------------------------------------
 osm/opensm/osm_port.c         |   14 +++--
 2 files changed, 11 insertions(+), 114 deletions(-)

diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h
index 347ab3b..775f228 100644
--- a/osm/include/opensm/osm_port.h
+++ b/osm/include/opensm/osm_port.h
@@ -289,7 +289,7 @@ osm_physp_destroy(
 *	osm_physp_init.
 *
 * SEE ALSO
-*	Port, osm_port_init, osm_port_destroy
+*	Port
 *********/
 
 /****f* OpenSM: Physical Port/osm_physp_is_valid
@@ -1313,70 +1313,6 @@ typedef struct _osm_port
 *	Port, Physical Port, Physical Port Table
 *********/
 
-/****f* OpenSM: Port/osm_port_construct
-* NAME
-*	osm_port_construct
-*
-* DESCRIPTION
-*	This function constructs a Port object.
-*
-* SYNOPSIS
-*/
-static inline void
-osm_port_construct(
-	IN osm_port_t* const p_port )
-{
-	memset( p_port, 0, sizeof(*p_port) );
-	cl_qlist_init( &p_port->mcm_list );
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object to construct.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*	Allows calling osm_port_init, and osm_port_destroy.
-*
-*	Calling osm_port_construct is a prerequisite to calling any other
-*	method except osm_port_init.
-*
-* SEE ALSO
-*	Port, osm_port_init, osm_port_destroy
-*********/
-
-/****f* OpenSM: Port/osm_port_destroy
-* NAME
-*	osm_port_destroy
-*
-* DESCRIPTION
-*	This function destroys a Port object.
-*
-* SYNOPSIS
-*/
-void
-osm_port_destroy(
-  IN osm_port_t* const p_port );
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object to construct.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*	Performs any necessary cleanup of the specified Port object.
-*	Further operations should not be attempted on the destroyed object.
-*	This function should only be called after a call to osm_port_construct
-*	or osm_port_init.
-*
-* SEE ALSO
-*	Port, osm_port_init, osm_port_destroy
-*********/
-
 /****f* OpenSM: Port/osm_port_delete
 * NAME
 *	osm_port_delete
@@ -1386,14 +1322,9 @@ osm_port_destroy(
 *
 * SYNOPSIS
 */
-inline static void
+void
 osm_port_delete(
-	IN OUT osm_port_t** const pp_port )
-{
-	osm_port_destroy( *pp_port );
-	free( *pp_port );
-	*pp_port = NULL;
-}
+	IN OUT osm_port_t** const pp_port );
 /*
 * PARAMETERS
 *	pp_port
@@ -1407,42 +1338,6 @@ osm_port_delete(
 *	Performs any necessary cleanup of the specified Port object.
 *
 * SEE ALSO
-*	Port, osm_port_init, osm_port_destroy
-*********/
-
-/****f* OpenSM: Port/osm_port_init
-* NAME
-*	osm_port_init
-*
-* DESCRIPTION
-*	This function initializes a Port object.
-*
-* SYNOPSIS
-*/
-void
-osm_port_init(
-	IN osm_port_t* const p_port,
-	IN const ib_node_info_t* p_ni,
-	IN const struct _osm_node* const p_parent_node );
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object to initialize.
-*
-*	p_ni
-*		[in] Pointer to the NodeInfo attribute relavent for this port.
-*
-*	p_parent_node
-*		[in] Pointer to the initialized parent osm_node_t object
-*		that owns this port.
-*
-* RETURN VALUE
-*	None.
-*
-* NOTES
-*	Allows calling other port methods.
-*
-* SEE ALSO
 *	Port
 *********/
 
diff --git a/osm/opensm/osm_port.c b/osm/opensm/osm_port.c
index ec2998c..260e28a 100644
--- a/osm/opensm/osm_port.c
+++ b/osm/opensm/osm_port.c
@@ -154,16 +154,18 @@ osm_physp_init(
 /**********************************************************************
  **********************************************************************/
 void
-osm_port_destroy(
-  IN osm_port_t* const p_port )
+osm_port_delete(
+  IN OUT osm_port_t** const pp_port )
 {
   /* cleanup all mcm recs attached */
-  osm_port_remove_all_mgrp( p_port );
+  osm_port_remove_all_mgrp( *pp_port );
+  free( *pp_port );
+  *pp_port = NULL;
 }
 
 /**********************************************************************
  **********************************************************************/
-void
+static void
 osm_port_init(
   IN osm_port_t* const p_port,
   IN const ib_node_info_t* p_ni,
@@ -178,8 +180,8 @@ osm_port_init(
   CL_ASSERT( p_ni );
   CL_ASSERT( p_parent_node );
 
-  osm_port_construct( p_port );
-
+  memset( p_port, 0, sizeof(*p_port) );
+  cl_qlist_init( &p_port->mcm_list );
   p_port->p_node = (struct _osm_node *)p_parent_node;
   port_guid = p_ni->port_guid;
   p_port->guid = port_guid;
-- 
1.5.2.rc2.20.gac2a


From sashak at voltaire.com  Sun May  6 13:00:13 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 6 May 2007 23:00:13 +0300
Subject: [ofa-general] [PATCH] opensm: consolidate CA and router PortInfo
	receiving code
Message-ID: <20070506200013.GL9692@sashak.voltaire.com>


Consolidate CA and router PortInfo receiving processing code.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/opensm/osm_port_info_rcv.c |   36 +-----------------------------------
 1 files changed, 1 insertions(+), 35 deletions(-)

diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c
index e12daa6..f23410b 100644
--- a/osm/opensm/osm_port_info_rcv.c
+++ b/osm/opensm/osm_port_info_rcv.c
@@ -406,37 +406,6 @@ __osm_pi_rcv_process_ca_port(
   OSM_LOG_EXIT( p_rcv->p_log );
 }
 
-/**********************************************************************
- **********************************************************************/
-static void
-__osm_pi_rcv_process_router_port(
-  IN const osm_pi_rcv_t* const p_rcv,
-  IN osm_node_t* const p_node,
-  IN osm_physp_t* const p_physp,
-  IN const ib_port_info_t* const p_pi )
-{
-  ib_net16_t orig_lid;
-
-  OSM_LOG_ENTER( p_rcv->p_log, __osm_pi_rcv_process_router_port );
-
-  UNUSED_PARAM( p_node );
-
-  /*
-    Update the PortInfo attribute.
-  */
-  osm_physp_set_port_info( p_physp, p_pi );
-
-  if ( (orig_lid = osm_physp_trim_base_lid_to_valid_range( p_physp ) ) )
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-	     "__osm_pi_rcv_process_router_port: ERR 0F09: "
-	     "Invalid base LID 0x%x corrected\n",
-	     cl_ntoh16 ( orig_lid) );
-
-  __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi);
-
-  OSM_LOG_EXIT( p_rcv->p_log );
-}
-
 #define IBM_VENDOR_ID  (0x5076)
 /**********************************************************************
  **********************************************************************/
@@ -851,13 +820,10 @@ osm_pi_rcv_process(
     switch( osm_node_get_type( p_node ) )
     {
     case IB_NODE_TYPE_CA:
+    case IB_NODE_TYPE_ROUTER:
       __osm_pi_rcv_process_ca_port( p_rcv,
                                     p_node, p_physp, p_pi );
       break;
-    case IB_NODE_TYPE_ROUTER:
-      __osm_pi_rcv_process_router_port( p_rcv,
-                                        p_node, p_physp, p_pi );
-      break;
     case IB_NODE_TYPE_SWITCH:
       __osm_pi_rcv_process_switch_port( p_rcv,
                                         p_node, p_physp, p_pi );
-- 
1.5.2.rc2.20.gac2a


From mst at dev.mellanox.co.il  Sun May  6 13:07:52 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 6 May 2007 23:07:52 +0300
Subject: [ofa-general] Fwd: Re: using stgit/guilt for public branches
Message-ID: <20070506200752.GB30339@mellanox.co.il>

FYI, some more discussion forwarded from the git mailing list.

Executive summary: it's possible to make repostitory managed
by stgit public, but tools to make it possible for multiple
developers to clone and work on such a repository
seem not to be there yet.

----- Forwarded message from Yann Dirson <ydirson at altern.org> -----

Subject: Re: using stgit/guilt for public branches
Date: Sat, 5 May 2007 00:37:41 +0300
In-Reply-To: <20070504052042.GA4829 at mellanox.co.il>
References: <20070425122048.GD1624 at mellanox.co.il> <20070425191838.GA6267 at filer.fsl.cs.sunysb.edu> <200704252337.05851.robin.rosenberg.lists at dewire.com> <20070503205836.GA19253 at nan92-1-81-57-214-146.fbx.proxad.net> <20070504052042.GA4829 at mellanox.co.il>
From: Yann Dirson <ydirson at altern.org>

On Fri, May 04, 2007 at 08:20:59AM +0300, Michael S. Tsirkin wrote:
> > Quoting Yann Dirson <ydirson at altern.org>:
> > Subject: Re: using stgit/guilt for public branches
> > 
> > On Wed, Apr 25, 2007 at 11:37:05PM +0200, Robin Rosenberg wrote:
> > > onsdag 25 april 2007 skrev Josef Sipek:
> > > > On Wed, Apr 25, 2007 at 03:20:49PM +0300, Michael S. Tsirkin wrote:
> > > [...]
> > > > > I am concerned that publishing a git branch managed by stg/guilt
> > > > > would present problems: it seems that every time patches are re-ordered,
> > > > > a patch is re-written or removed, or we update from upstream,
> > > > > everyone who pulls the tree branch will have a hard-to-resolve conflict.
> > > > > 
> > > > > Is that really a problem? If so, would it be possible to work around this
> > > > > somehow?
> > > > 
> > > > I thought about this problem a while back when I was trying to decide how to
> > > > manage the Unionfs git repository. I came to the conclusion, that there was
> > > > no clean way of doing this (at least not using guilt - I can't really speak
> > > > for stgit, as I don't know how it does things exactly).
> > > 
> > > StGit has the same problem. Publishing such a branch is only for viewing if
> > > you want to publish the tip, like the pu branch in the Git repo. You shouldn't
> > > merge from pu either.
> > 
> > You are right, in that what can be done with such branches is limited.
> > BUT you can safely "stg branch --create" off any remote stgit stack.
> > Then you can "stg rebase origin/master" to port your stack to the new
> > tip of the remote stack.
> 
> OK.
> What happens if someone clones the repo, then reorders patches,
> drops some of them, adds new patches in the middle of the stack?

You can't do that out of the box, since you don't get a real stack
when you clone it, you only get the refs.  You would need to uncommit
patches manually, and there will not be much support to help you.

Now you're forcing me to unveil my secret plans :)

1. it would be quite easy to reconstruct a full-fledged stack from
those refs, and since you get the remote patchlogs, we could also
fetch any former version of the patch that would be still available
(more work for "stg clone")

2. if noone beats me to doing that, I'll enhance patchlogs some day to
record branching in patchlogs (eg. from "stg branch --clone" or "stg
pick"), as well as merges (eg. from "stg sync")

Note that proper merging from patchlog history will require working at
the meta-diff (ie. "diffs of diffs of trees") level, just like proper
merging at tree-level requires working at the diff level.  I don't
think we have the tools for this yet, so we still have a long way to
go :)


Best regards,
-- 
Yann.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

----- End forwarded message -----

-- 
MST


From rdreier at cisco.com  Sun May  6 14:35:22 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 06 May 2007 14:35:22 -0700
Subject: [ofa-general] Re: [PATCH 1 of 3] IB/verbs: add cq comp_vector
	support in core
In-Reply-To: <20070503104847.GD10009@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 3 May 2007 13:48:47 +0300")
References: <20070503104847.GD10009@mellanox.co.il>
Message-ID: <adazm4h7i05.fsf@cisco.com>

OK, I added at least this to my tree for now.  I haven't had a chance
to think about the mthca changes one way or another yet...

(I changed your patch to move the assignment of num_comp_vectors into
the individual drivers)

commit c15f960a112f8f0158e24b801bdce40da52ce485
Author: Michael S. Tsirkin <mst at dev.mellanox.co.il>
Date:   Thu May 3 13:48:47 2007 +0300

    IB: Add CQ comp_vector support
    
    Add a num_comp_vectors member to struct ib_device and extend
    ib_create_cq() to pass in a comp_vector parameter -- this parallels
    the userspace libibverbs API.  Update all hardware drivers to set
    num_comp_vectors to 1 and have all ULPs pass 0 for the comp_vector
    value.  Pass the value of num_comp_vectors to userspace rather than
    hard-coding a value of 1.
    
    We want multiple CQ event vector support (via MSI-X or similar for
    adapters that can generate multiple interrupts), but it's not clear
    how many vectors we want, or how we want to deal with policy issues
    such as how to decide which vector to use or how to set up interrupt
    affinity.  This patch is useful for experimenting, since no core
    changes will be necessary when updating a driver to support multiple
    vectors, and we know that we want to make at least these changes
    anyway.
    
    Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 6edfecf..85ccf13 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2771,7 +2771,7 @@ static int ib_mad_port_open(struct ib_device *device,
 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
 	port_priv->cq = ib_create_cq(port_priv->device,
 				     ib_mad_thread_completion_handler,
-				     NULL, port_priv, cq_size);
+				     NULL, port_priv, cq_size, 0);
 	if (IS_ERR(port_priv->cq)) {
 		printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n");
 		ret = PTR_ERR(port_priv->cq);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 4fd75af..bab6676 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -802,6 +802,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file,
 	INIT_LIST_HEAD(&obj->async_list);
 
 	cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe,
+					     cmd.comp_vector,
 					     file->ucontext, &udata);
 	if (IS_ERR(cq)) {
 		ret = PTR_ERR(cq);
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index f8bc822..d44e547 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -752,7 +752,7 @@ static void ib_uverbs_add_one(struct ib_device *device)
 	spin_unlock(&map_lock);
 
 	uverbs_dev->ib_dev           = device;
-	uverbs_dev->num_comp_vectors = 1;
+	uverbs_dev->num_comp_vectors = device->num_comp_vectors;
 
 	uverbs_dev->dev = cdev_alloc();
 	if (!uverbs_dev->dev)
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index ccdf93d..86ed8af 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -609,11 +609,11 @@ EXPORT_SYMBOL(ib_destroy_qp);
 struct ib_cq *ib_create_cq(struct ib_device *device,
 			   ib_comp_handler comp_handler,
 			   void (*event_handler)(struct ib_event *, void *),
-			   void *cq_context, int cqe)
+			   void *cq_context, int cqe, int comp_vector)
 {
 	struct ib_cq *cq;
 
-	cq = device->create_cq(device, cqe, NULL, NULL);
+	cq = device->create_cq(device, cqe, comp_vector, NULL, NULL);
 
 	if (!IS_ERR(cq)) {
 		cq->device        = device;
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index 607c09b..1091662 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -290,7 +290,7 @@ static int c2_destroy_qp(struct ib_qp *ib_qp)
 	return 0;
 }
 
-static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries,
+static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, int vector,
 				  struct ib_ucontext *context,
 				  struct ib_udata *udata)
 {
@@ -795,6 +795,7 @@ int c2_register_device(struct c2_dev *dev)
 	memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
 	memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6);
 	dev->ibdev.phys_port_cnt = 1;
+	dev->ibdev.num_comp_vectors = 1;
 	dev->ibdev.dma_device = &dev->pcidev->dev;
 	dev->ibdev.query_device = c2_query_device;
 	dev->ibdev.query_port = c2_query_port;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 93038c0..78a495f 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -139,7 +139,7 @@ static int iwch_destroy_cq(struct ib_cq *ib_cq)
 	return 0;
 }
 
-static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries,
+static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, int vector,
 			     struct ib_ucontext *ib_context,
 			     struct ib_udata *udata)
 {
@@ -1110,6 +1110,7 @@ int iwch_register_device(struct iwch_dev *dev)
 	dev->ibdev.node_type = RDMA_NODE_RNIC;
 	memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC));
 	dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports;
+	dev->ibdev.num_comp_vectors = 1;
 	dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
 	dev->ibdev.query_device = iwch_query_device;
 	dev->ibdev.query_port = iwch_query_port;
diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c
index e2cdc1a..67f0670 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -113,7 +113,7 @@ struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num)
 	return ret;
 }
 
-struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe,
+struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 			     struct ib_ucontext *context,
 			     struct ib_udata *udata)
 {
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 95fd59f..aff96ac 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -123,7 +123,7 @@ int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq);
 void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq);
 
 
-struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe,
+struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 			     struct ib_ucontext *context,
 			     struct ib_udata *udata);
 
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 3b23d67..77bb36b 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -313,6 +313,7 @@ int ehca_init_device(struct ehca_shca *shca)
 
 	shca->ib_device.node_type           = RDMA_NODE_IB_CA;
 	shca->ib_device.phys_port_cnt       = shca->num_ports;
+	shca->ib_device.num_comp_vectors    = 1;
 	shca->ib_device.dma_device          = &shca->ibmebus_dev->ofdev.dev;
 	shca->ib_device.query_device        = ehca_query_device;
 	shca->ib_device.query_port          = ehca_query_port;
@@ -375,7 +376,7 @@ static int ehca_create_aqp1(struct ehca_shca *shca, u32 port)
 		return -EPERM;
 	}
 
-	ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10);
+	ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10, 0);
 	if (IS_ERR(ibcq)) {
 		ehca_err(&shca->ib_device, "Cannot create AQP1 CQ.");
 		return PTR_ERR(ibcq);
diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c
index 4715f89..00d3eb9 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -204,7 +204,7 @@ static void send_complete(unsigned long data)
  *
  * Called by ib_create_cq() in the generic verbs code.
  */
-struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries,
+struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector,
 			      struct ib_ucontext *context,
 			      struct ib_udata *udata)
 {
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index b676ea8..12933e7 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -1561,6 +1561,7 @@ int ipath_register_ib_device(struct ipath_devdata *dd)
 		(1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV);
 	dev->node_type = RDMA_NODE_IB_CA;
 	dev->phys_port_cnt = 1;
+	dev->num_comp_vectors = 1;
 	dev->dma_device = &dd->pcidev->dev;
 	dev->query_device = ipath_query_device;
 	dev->modify_device = ipath_modify_device;
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index ac66c00..2d734fb 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -735,7 +735,7 @@ int ipath_destroy_srq(struct ib_srq *ibsrq);
 
 int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
 
-struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries,
+struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector,
 			      struct ib_ucontext *context,
 			      struct ib_udata *udata);
 
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 47e6fd4..1c05486 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -663,6 +663,7 @@ static int mthca_destroy_qp(struct ib_qp *qp)
 }
 
 static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries,
+				     int comp_vector,
 				     struct ib_ucontext *context,
 				     struct ib_udata *udata)
 {
@@ -1292,6 +1293,7 @@ int mthca_register_device(struct mthca_dev *dev)
 		(1ull << IB_USER_VERBS_CMD_DETACH_MCAST);
 	dev->ib_dev.node_type            = RDMA_NODE_IB_CA;
 	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
+	dev->ib_dev.num_comp_vectors     = 1;
 	dev->ib_dev.dma_device           = &dev->pdev->dev;
 	dev->ib_dev.query_device         = mthca_query_device;
 	dev->ib_dev.query_port           = mthca_query_port;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 1e27930..b8089a0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -793,7 +793,7 @@ static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn,
 	}
 
 	p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p,
-			     ipoib_sendq_size + 1);
+			     ipoib_sendq_size + 1, 0);
 	if (IS_ERR(p->cq)) {
 		ret = PTR_ERR(p->cq);
 		ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 7f3ec20..5c3c6a4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -187,7 +187,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	if (!ret)
 		size += ipoib_recvq_size;
 
-	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size);
+	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
 	if (IS_ERR(priv->cq)) {
 		printk(KERN_WARNING "%s: failed to create CQ\n", ca->name);
 		goto out_free_mr;
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 1fc9674..89d6008 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -76,7 +76,7 @@ static int iser_create_device_ib_res(struct iser_device *device)
 				  iser_cq_callback,
 				  iser_cq_event_callback,
 				  (void *)device,
-				  ISER_MAX_CQ_LEN);
+				  ISER_MAX_CQ_LEN, 0);
 	if (IS_ERR(device->cq))
 		goto cq_err;
 
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 3468ae1..39bf057 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -197,7 +197,7 @@ static int srp_create_target_ib(struct srp_target_port *target)
 		return -ENOMEM;
 
 	target->cq = ib_create_cq(target->srp_host->dev->dev, srp_completion,
-				  NULL, target, SRP_CQ_SIZE);
+				  NULL, target, SRP_CQ_SIZE, 0);
 	if (IS_ERR(target->cq)) {
 		ret = PTR_ERR(target->cq);
 		goto out;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 765589f..17cc309 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -912,6 +912,8 @@ struct ib_device {
 
 	u32                           flags;
 
+	int			      num_comp_vectors;
+
 	struct iw_cm_verbs	     *iwcm;
 
 	int		           (*query_device)(struct ib_device *device,
@@ -978,6 +980,7 @@ struct ib_device {
 						struct ib_recv_wr *recv_wr,
 						struct ib_recv_wr **bad_recv_wr);
 	struct ib_cq *             (*create_cq)(struct ib_device *device, int cqe,
+						int comp_vector,
 						struct ib_ucontext *context,
 						struct ib_udata *udata);
 	int                        (*destroy_cq)(struct ib_cq *cq);
@@ -1358,13 +1361,15 @@ static inline int ib_post_recv(struct ib_qp *qp,
  * @cq_context: Context associated with the CQ returned to the user via
  *   the associated completion and event handlers.
  * @cqe: The minimum size of the CQ.
+ * @comp_vector - Completion vector used to signal completion events.
+ *     Must be >= 0 and < context->num_comp_vectors.
  *
  * Users can examine the cq structure to determine the actual CQ size.
  */
 struct ib_cq *ib_create_cq(struct ib_device *device,
 			   ib_comp_handler comp_handler,
 			   void (*event_handler)(struct ib_event *, void *),
-			   void *cq_context, int cqe);
+			   void *cq_context, int cqe, int comp_vector);
 
 /**
  * ib_resize_cq - Modifies the capacity of the CQ.


From dannyz at mellanox.co.il  Sun May  6 15:28:43 2007
From: dannyz at mellanox.co.il (Danny Zarko)
Date: Mon, 7 May 2007 01:28:43 +0300
Subject: [ofa-general] RE: OFED 1.2 RC3 is delayed for Monday next week (May
	7)
In-Reply-To: <463A4F26.3010804@mellanox.co.il>
Message-ID: <6C2C79E72C305246B504CBA17B5500C90172404F@mtlexch01.mtl.com>

The bug could not be reproduced in mellanox. Will not be able to handle
it before next week.

________________________________

From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] 
Sent: Thursday, May 03, 2007 5:08 PM
To: EWG; Christoph Raisch; Moni Levy; Roland Dreier; Michael S. Tsirkin;
Ami Perlmutter; Vladimir Sokolovsky; Pavel Shamis; Danny Zarko
Cc: OpenFabrics General
Subject: OFED 1.2 RC3 is delayed for Monday next week (May 7)


Hi All,
Since some of the critical bugs are not solved yet we decided to delay
the release to Monday May 7.

This is the list of critical bugs that should be fixed for RC3:

bug_id	 bug_severity	 assigned_to	 short_short_desc	
574	 blocker	 raisch at de.ibm.com	 ehca driver fails while
running openmpi	
420	 critical	 monil at voltaire.com	 PKey table reordering
caused by SM failover stops ipoib traffic	
577	 critical	 rolandd at cisco.com	 SRP multipath failover
too slow (minutes, not seconds)	
465	 critical	 mst at mellanox.co.il	 IPoIB HA fails after
several hours of failovers	
549	 critical	 amip at dev.mellanox.co.il	 SDP Policy need
to be consistent	
597	 critical	 vlad at mellanox.co.il	 support RHEL4U5 in OFED
1.2	
499	 major	 vlad at mellanox.co.il	 module compiled over ofed won't
load due to symbol version mismatch	
519	 major	 pasha at mellanox.co.il	 MVAPICH I APPLICATION  ABORTS
WITH PARTITIONS CONFIGURED	
534	 major	 vlad at mellanox.co.il	 SLES9 - Installer fails on
declarations - OFED 1.2-20070409	
530	 major	 dannyz at mellanox.co.il	 ibdiagnet -r fails on RHEL5
i686	
538	 major	 monis at voltaire.com	 integrate IPoIB bonding with
IPoIB CM	
541	 major	 mst at mellanox.co.il	 slow failover with IPoIB CM
bonding/ipoibtools HA	
558	 major	 rolandd at cisco.com	 tvflash configure fails on
SLES10 SP1 RC2	


All owners of blocker and critical bugs - please reply with status of
the bug resolution

Thanks,
Tziporet

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070507/3e5ef233/attachment.html>

From sean.hefty at intel.com  Sun May  6 21:17:18 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Sun, 6 May 2007 21:17:18 -0700
Subject: [ofa-general] RE: man pages for the rdma-cm
In-Reply-To: <1178127046.18609.107.camel@stevo-desktop>
Message-ID: <000001c7905e$9a562190$95fd070a@amr.corp.intel.com>

>Are there man pages for the rdma-cm in the pipeline?  I think it would
>be great (requirement?) to have these for ofed-1.2 since we do have the
>other verbs man pages.

I've added man pages for the APIs and test programs to my master and ofed_1_2
branches.  If anyone gets a chance, I'd appreciate someone looking them over.  I
plan on requested that they be pulled into the rc3 release.

- Sean


From rdreier at cisco.com  Sun May  6 21:19:18 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 06 May 2007 21:19:18 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adaps5d6zax.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This is the second batch of merges for 2.6.22 -- mostly fixes, but
also the conversion of IPoIB to use NAPI:

Ishai Rabinovitz (1):
      IB/srp: Add orig_dgid sysfs attribute to scsi_host

Michael S. Tsirkin (4):
      IB/mthca: Work around kernel QP starvation
      IPoIB/cm: Fix error handling in ipoib_cm_dev_open()
      IPoIB/cm: Don't crash if remote side uses one QP for both directions
      IB: Add CQ comp_vector support

Ralph Campbell (4):
      IB/ipath: Don't call spin_lock_irq() from interrupt context
      IB/ipath: Don't put QP in timeout queue if waiting to send
      IB/ipath: Fix two more spin lock problems
      IB/ipath: Fix a race condition when generating ACKs

Robert Walsh (1):
      IB/ipath: Don't corrupt pending mmap list when unmapped objects are freed

Roland Dreier (4):
      IB/srp: Set proc_name
      IB/fmr_pool: Add prefix to all printks
      IB: Return "maybe missed event" hint from ib_req_notify_cq()
      IPoIB: Convert to NAPI

Steve Wise (4):
      RDMA/cxgb3: Fix TERM codes
      RDMA/cxgb3: Fail qp creation if the requested max_inline is too large
      RDMA/cxgb3: Initialize cpu_idx field in cpl_close_listserv_req message
      RDMA/cxgb3: Support for new abort logic

 drivers/infiniband/core/fmr_pool.c           |   32 +++++----
 drivers/infiniband/core/mad.c                |    2 +-
 drivers/infiniband/core/uverbs_cmd.c         |    1 +
 drivers/infiniband/core/uverbs_main.c        |    2 +-
 drivers/infiniband/core/verbs.c              |    4 +-
 drivers/infiniband/hw/amso1100/c2.h          |    2 +-
 drivers/infiniband/hw/amso1100/c2_cq.c       |   16 ++++-
 drivers/infiniband/hw/amso1100/c2_provider.c |    3 +-
 drivers/infiniband/hw/cxgb3/cxio_hal.c       |    3 +
 drivers/infiniband/hw/cxgb3/cxio_wr.h        |    1 +
 drivers/infiniband/hw/cxgb3/iwch_cm.c        |   19 ++++++
 drivers/infiniband/hw/cxgb3/iwch_cm.h        |    6 ++
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |   14 +++-
 drivers/infiniband/hw/cxgb3/iwch_qp.c        |   69 +++++++++++---------
 drivers/infiniband/hw/ehca/ehca_cq.c         |    2 +-
 drivers/infiniband/hw/ehca/ehca_iverbs.h     |    4 +-
 drivers/infiniband/hw/ehca/ehca_main.c       |    3 +-
 drivers/infiniband/hw/ehca/ehca_reqs.c       |   14 +++-
 drivers/infiniband/hw/ehca/ipz_pt_fn.h       |    8 ++
 drivers/infiniband/hw/ipath/ipath_cq.c       |   68 ++++++++++----------
 drivers/infiniband/hw/ipath/ipath_mmap.c     |   64 +++++++++++++++++--
 drivers/infiniband/hw/ipath/ipath_qp.c       |   52 +++++++++------
 drivers/infiniband/hw/ipath/ipath_rc.c       |   55 ++++++++--------
 drivers/infiniband/hw/ipath/ipath_srq.c      |   55 ++++++++--------
 drivers/infiniband/hw/ipath/ipath_verbs.c    |    4 +
 drivers/infiniband/hw/ipath/ipath_verbs.h    |   24 +++++--
 drivers/infiniband/hw/mthca/mthca_cq.c       |   12 ++--
 drivers/infiniband/hw/mthca/mthca_dev.h      |    4 +-
 drivers/infiniband/hw/mthca/mthca_provider.c |    2 +
 drivers/infiniband/hw/mthca/mthca_qp.c       |   13 ++++
 drivers/infiniband/ulp/ipoib/ipoib.h         |    1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c      |   14 +++--
 drivers/infiniband/ulp/ipoib/ipoib_ib.c      |   89 ++++++++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_main.c    |    2 +
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c   |    2 +-
 drivers/infiniband/ulp/iser/iser_verbs.c     |    2 +-
 drivers/infiniband/ulp/srp/ib_srp.c          |   27 +++++++-
 drivers/infiniband/ulp/srp/ib_srp.h          |    1 +
 drivers/net/cxgb3/version.h                  |    4 +-
 include/rdma/ib_verbs.h                      |   47 +++++++++++---
 40 files changed, 508 insertions(+), 239 deletions(-)


From yosefe at voltaire.com  Sun May  6 23:40:05 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 09:40:05 +0300
Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache
In-Reply-To: <20070503124956.GB9719@mellanox.co.il>
References: <4638B432.3060801@voltaire.com>
	<4638B4D5.7050709@voltaire.com>	<20070502171829.GO22292@mellanox.co.il>	<4639D16F.4060807@voltaire.com>	<20070503122215.GA9719@mellanox.co.il>	<4639D584.3010706@voltaire.com>
	<20070503124956.GB9719@mellanox.co.il>
Message-ID: <463EC9C5.2010509@voltaire.com>


How about keeping the cache, but keeping it always up-to-date by
registering it to process incomind mads instead of events?


From vlad at lists.openfabrics.org  Mon May  7 02:37:55 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon,  7 May 2007 02:37:55 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070507-0200 daily build status
Message-ID: <20070507093756.27DD9E60838@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.13
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-22.ELsmp

Failed:
Build failed on i686 with linux-2.6.21.1


From bs at q-leap.de  Mon May  7 02:40:10 2007
From: bs at q-leap.de (Bernd Schubert)
Date: Mon, 7 May 2007 11:40:10 +0200
Subject: [ofa-general] [PATCH] IB/ipath - Don't call spin_lock_irq() from
	interrupt context
In-Reply-To: <1177697471.3407.14.camel@brick.pathscale.com>
References: <1177697471.3407.14.camel@brick.pathscale.com>
Message-ID: <200705071140.10854.bs@q-leap.de>

On Friday 27 April 2007 20:11:11 Ralph Campbell wrote:
> This patch fixes the problem reported by Bernd Schubert <bs at q-leap.de>
> with kernel debug options enabled.
> BUG: at kernel/lockdep.c:1860 trace_hardirqs_on()
>
> Hopefully, this can be included in OFED 1.2 as well as
> going upstream.
>
> Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
>
> diff -r 97262e873c51 drivers/infiniband/hw/ipath/ipath_rc.c

This file looks significantly different to the version in 2.6.20 and 2.6.21, 
where can I get the latest version of the driver from?

Thanks in advance,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH


From eli at mellanox.co.il  Mon May  7 04:52:22 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 07 May 2007 14:52:22 +0300
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: commands timeout
Message-ID: <1178538772.10759.2.camel@mtls03>

When the system is busy it may happen that the command actually
completed but it took more than the specified timeout till the
task executing the command was actually given CPU time. This test
checks that the completion is really missing before failing.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/net/mlx4/cmd.c
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/cmd.c	2007-05-07 12:32:35.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/cmd.c	2007-05-07 14:30:26.000000000 +0300
@@ -272,10 +272,11 @@ static int mlx4_cmd_wait(struct mlx4_dev
 	mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0,
 		      in_modifier, op_modifier, op, context->token, 1);
 
-	if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout))) {
-		err = -EBUSY;
-		goto out;
-	}
+	if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout)))
+		if (!context->done.done) {
+			err = -EBUSY;
+			goto out;
+		}
 
 	err = context->result;
 	if (err)


From mst at dev.mellanox.co.il  Mon May  7 05:04:29 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 7 May 2007 15:04:29 +0300
Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout
In-Reply-To: <1178538772.10759.2.camel@mtls03>
References: <1178538772.10759.2.camel@mtls03>
Message-ID: <20070507115714.GC29350@mellanox.co.il>

> Quoting Eli Cohen <eli at mellanox.co.il>:
> Subject: [PATCH] IB/mlx4 mlx4_ib: commands timeout
> 
> When the system is busy it may happen that the command actually
> completed but it took more than the specified timeout till the
> task executing the command was actually given CPU time. This test
> checks that the completion is really missing before failing.
> 
> Signed-off-by: Eli Cohen <eli at mellanox.co.il>

How likely is this to help in practice?

-- 
MST


From eli at mellanox.co.il  Mon May  7 05:18:00 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 07 May 2007 15:18:00 +0300
Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout
In-Reply-To: <20070507115714.GC29350@mellanox.co.il>
References: <1178538772.10759.2.camel@mtls03>
	<20070507115714.GC29350@mellanox.co.il>
Message-ID: <1178540310.10759.9.camel@mtls03>

On Mon, 2007-05-07 at 15:04 +0300, Michael S. Tsirkin wrote:
> How likely is this to help in practice?
> 
Like I said, when the system is very busy. In this case the command may
actually complete very soon but wait_for_completion_timeout() will
nevertheless return zero since the task did not get CPU time before the
specified timeout expired. In this case we would like to check if done
is signaled and thus not fail the command.


From halr at voltaire.com  Mon May  7 05:32:45 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 May 2007 08:32:45 -0400
Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm/osm_helper: remove repeated
	strlen() calls
In-Reply-To: <20070506130352.GC9692@sashak.voltaire.com>
References: <462C7C21.7010004@dev.mellanox.co.il>
	<20070423101738.GG4579@mellanox.co.il>
	<462E80A3.5060503@dev.mellanox.co.il>
	<20070501005101.GA26019@sashak.voltaire.com>
	<4636E4A7.7060108@dev.mellanox.co.il>
	<1178211572.32222.3479.camel@hal.voltaire.com>
	<20070506124333.GB9692@sashak.voltaire.com>
	<20070506130352.GC9692@sashak.voltaire.com>
Message-ID: <1178541140.32222.348653.camel@hal.voltaire.com>

On Sun, 2007-05-06 at 09:03, Sasha Khapyorsky wrote:
> Replace repeated strlen() calls used in sprintf() by actual string
> length accumulated from sprintf() return values.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied to master only (as this is a cleanup rather than a bug
fix). Let me know if you think this should be applied to ofed_1_2.

-- Hal


From yosefe at voltaire.com  Mon May  7 05:52:49 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 15:52:49 +0300
Subject: [ofa-general] [PATCH 0/6 v2] fix pkey change handling and remove the
	cahce
Message-ID: <463F2121.5080803@voltaire.com>

The issue addressed is keeping ipoib interfaces alive despite port's pkey order is changed.
pkey-to-index queries were using a cache. however, the cache might not be up-to-date when
ipoib asks it to resolve a pkey. Therefore must use a direct query. On the other hand, in
build_mlx_header, the pkey query must be atomic. So, the driver will keep its own pkey cache,
which is non blocking and always updated before ipoib is notified of the event.
In addition, remove the pkey delayed initiallization thread, instead start the interface on
pkey change notification.

changes from v1:
 * code style fixes
 * reorganize patches
 * mthca: add gid cache
 * mad: add lmc cache

patch ordering:
1. core: add blockong device queries
2. core,ulp: use bloking queries
3. ipoib: handle pkey event
4. mthca: cache gids and pkeys
5. mad: cache lmc
6. core: remove cache


From yosefe at voltaire.com  Mon May  7 05:54:34 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 15:54:34 +0300
Subject: [ofa-general] [PATCH 1/6 v2] fix pkey change handling and remove the
	cahce
In-Reply-To: <463F2121.5080803@voltaire.com>
References: <463F2121.5080803@voltaire.com>
Message-ID: <463F218A.7030400@voltaire.com>

core: uncached "find gid" and "find pkey" queries

* Add ib_find_gid and ib_find_pkey over possibly blocking device
  queries. 

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/device.c |   96 +++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h          |   23 +++++++++
 2 files changed, 119 insertions(+)

Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-06 09:16:18.000000000 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-06 09:33:50.000000000 +0300
@@ -149,6 +149,18 @@ static int alloc_name(char *name)
 	return 0;
 }
 
+static inline int start_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+
+static inline int end_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
+}
+
 /**
  * ib_alloc_device - allocate an IB device struct
  * @size:size of structure to allocate
@@ -592,6 +604,90 @@ int ib_modify_port(struct ib_device *dev
 }
 EXPORT_SYMBOL(ib_modify_port);
 
+/**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index)
+{
+	struct ib_port_attr *tprops = NULL;
+	union ib_gid tmp_gid;
+	int ret, port, i;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+
+	for (port = start_port(device); port <= end_port(device); ++port) {
+		ret = ib_query_port(device, port, tprops);
+		if (ret)
+			continue;
+
+		for (i = 0; i < tprops->gid_tbl_len; ++i) {
+			ret = ib_query_gid(device, port, i, &tmp_gid);
+			if (ret)
+				goto out;
+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
+				*port_num = port;
+				*index = i;
+				ret = 0;
+				goto out;
+			}
+		}
+	}
+	ret = -ENOENT;
+out:
+	kfree(tprops);
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_gid);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index)
+{
+	struct ib_port_attr *tprops = NULL;
+	int ret, i;
+	u16 tmp_pkey;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+
+	ret = ib_query_port(device, port_num, tprops);
+	if (ret) {
+		printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret);
+		goto out;
+	}
+
+	for (i = 0; i < tprops->pkey_tbl_len; ++i) {
+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
+		if (ret)
+			goto out;
+
+		if (pkey == tmp_pkey) {
+			*index = i;
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = -ENOENT;
+
+out:
+	kfree(tprops);
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_pkey);
+
 static int __init ib_core_init(void)
 {
 	int ret;
Index: b/include/rdma/ib_verbs.h
===================================================================
--- a/include/rdma/ib_verbs.h	2007-05-06 09:16:18.000000000 +0300
+++ b/include/rdma/ib_verbs.h	2007-05-06 09:16:22.000000000 +0300
@@ -1134,6 +1134,29 @@ int ib_modify_port(struct ib_device *dev
 		   struct ib_port_modify *port_modify);
 
 /**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index);
+
+/**
  * ib_alloc_pd - Allocates an unused protection domain.
  * @device: The device on which to allocate the protection domain.
  *


From yosefe at voltaire.com  Mon May  7 05:55:40 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 15:55:40 +0300
Subject: [ofa-general] [PATCH 2/6 v2] fix pkey change handling and remove the
	cahce
In-Reply-To: <463F2121.5080803@voltaire.com>
References: <463F2121.5080803@voltaire.com>
Message-ID: <463F21CC.1060807@voltaire.com>

core, ulp: don't use ib_cahce

* Modify users of the ib cache in: core, ipoib, srp, to use
  blocking device queries.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/cm.c            |    8 ++++----
 drivers/infiniband/core/cma.c           |    9 ++++-----
 drivers/infiniband/core/multicast.c     |    3 +--
 drivers/infiniband/core/sa_query.c      |    3 +--
 drivers/infiniband/core/verbs.c         |    3 +--
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    3 +--
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |    4 ++--
 drivers/infiniband/ulp/srp/ib_srp.c     |    6 ++----
 8 files changed, 16 insertions(+), 23 deletions(-)

Index: b/drivers/infiniband/core/cm.c
===================================================================
--- a/drivers/infiniband/core/cm.c	2007-05-06 09:26:12.000000000 +0300
+++ b/drivers/infiniband/core/cm.c	2007-05-06 09:26:17.000000000 +0300
@@ -46,8 +46,8 @@
 #include <linux/spinlock.h>
 #include <linux/workqueue.h>
 
-#include <rdma/ib_cache.h>
 #include <rdma/ib_cm.h>
+#include <rdma/ib_verbs.h>
 #include "cm_msgs.h"
 
 MODULE_AUTHOR("Sean Hefty");
@@ -275,7 +275,7 @@ static int cm_init_av_by_path(struct ib_
 
 	read_lock_irqsave(&cm.device_lock, flags);
 	list_for_each_entry(cm_dev, &cm.device_list, list) {
-		if (!ib_find_cached_gid(cm_dev->device, &path->sgid,
+		if (!ib_find_gid(cm_dev->device, &path->sgid,
 					&p, NULL)) {
 			port = &cm_dev->port[p-1];
 			break;
@@ -286,7 +286,7 @@ static int cm_init_av_by_path(struct ib_
 	if (!port)
 		return -EINVAL;
 
-	ret = ib_find_cached_pkey(cm_dev->device, port->port_num,
+	ret = ib_find_pkey(cm_dev->device, port->port_num,
 				  be16_to_cpu(path->pkey), &av->pkey_index);
 	if (ret)
 		return ret;
@@ -1382,7 +1382,7 @@ static int cm_req_handler(struct cm_work
 	cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]);
 	ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av);
 	if (ret) {
-		ib_get_cached_gid(work->port->cm_dev->device,
+		ib_query_gid(work->port->cm_dev->device,
 				  work->port->port_num, 0, &work->path[0].sgid);
 		ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
 			       &work->path[0].sgid, sizeof work->path[0].sgid,
Index: b/drivers/infiniband/core/cma.c
===================================================================
--- a/drivers/infiniband/core/cma.c	2007-05-06 09:26:12.000000000 +0300
+++ b/drivers/infiniband/core/cma.c	2007-05-06 09:26:17.000000000 +0300
@@ -41,7 +41,6 @@
 
 #include <rdma/rdma_cm.h>
 #include <rdma/rdma_cm_ib.h>
-#include <rdma/ib_cache.h>
 #include <rdma/ib_cm.h>
 #include <rdma/ib_sa.h>
 #include <rdma/iw_cm.h>
@@ -325,7 +324,7 @@ static int cma_acquire_dev(struct rdma_i
 	}
 
 	list_for_each_entry(cma_dev, &dev_list, list) {
-		ret = ib_find_cached_gid(cma_dev->device, &gid,
+		ret = ib_find_gid(cma_dev->device, &gid,
 					 &id_priv->id.port_num, NULL);
 		if (!ret) {
 			ret = cma_set_qkey(cma_dev->device,
@@ -514,7 +513,7 @@ static int cma_ib_init_qp_attr(struct rd
 	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
 	int ret;
 
-	ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num,
+	ret = ib_find_pkey(id_priv->id.device, id_priv->id.port_num,
 				  ib_addr_get_pkey(dev_addr),
 				  &qp_attr->pkey_index);
 	if (ret)
@@ -1658,11 +1657,11 @@ static int cma_bind_loopback(struct rdma
 	cma_dev = list_entry(dev_list.next, struct cma_device, list);
 
 port_found:
-	ret = ib_get_cached_gid(cma_dev->device, p, 0, &gid);
+	ret = ib_query_gid(cma_dev->device, p, 0, &gid);
 	if (ret)
 		goto out;
 
-	ret = ib_get_cached_pkey(cma_dev->device, p, 0, &pkey);
+	ret = ib_query_pkey(cma_dev->device, p, 0, &pkey);
 	if (ret)
 		goto out;
 
Index: b/drivers/infiniband/core/multicast.c
===================================================================
--- a/drivers/infiniband/core/multicast.c	2007-05-06 09:26:12.000000000 +0300
+++ b/drivers/infiniband/core/multicast.c	2007-05-06 09:26:17.000000000 +0300
@@ -38,7 +38,6 @@
 #include <linux/bitops.h>
 #include <linux/random.h>
 
-#include <rdma/ib_cache.h>
 #include "sa.h"
 
 static void mcast_add_one(struct ib_device *device);
@@ -686,7 +685,7 @@ int ib_init_ah_from_mcmember(struct ib_d
 	u16 gid_index;
 	u8 p;
 
-	ret = ib_find_cached_gid(device, &rec->port_gid, &p, &gid_index);
+	ret = ib_find_gid(device, &rec->port_gid, &p, &gid_index);
 	if (ret)
 		return ret;
 
Index: b/drivers/infiniband/core/sa_query.c
===================================================================
--- a/drivers/infiniband/core/sa_query.c	2007-05-06 09:26:12.000000000 +0300
+++ b/drivers/infiniband/core/sa_query.c	2007-05-06 09:26:17.000000000 +0300
@@ -47,7 +47,6 @@
 #include <linux/workqueue.h>
 
 #include <rdma/ib_pack.h>
-#include <rdma/ib_cache.h>
 #include "sa.h"
 
 MODULE_AUTHOR("Roland Dreier");
@@ -477,7 +476,7 @@ int ib_init_ah_from_path(struct ib_devic
 		ah_attr->ah_flags = IB_AH_GRH;
 		ah_attr->grh.dgid = rec->dgid;
 
-		ret = ib_find_cached_gid(device, &rec->sgid, &port_num,
+		ret = ib_find_gid(device, &rec->sgid, &port_num,
 					 &gid_index);
 		if (ret)
 			return ret;
Index: b/drivers/infiniband/core/verbs.c
===================================================================
--- a/drivers/infiniband/core/verbs.c	2007-05-06 09:26:12.000000000 +0300
+++ b/drivers/infiniband/core/verbs.c	2007-05-06 09:26:17.000000000 +0300
@@ -43,7 +43,6 @@
 #include <linux/string.h>
 
 #include <rdma/ib_verbs.h>
-#include <rdma/ib_cache.h>
 
 int ib_rate_to_mult(enum ib_rate rate)
 {
@@ -159,7 +158,7 @@ int ib_init_ah_from_wc(struct ib_device 
 		ah_attr->ah_flags = IB_AH_GRH;
 		ah_attr->grh.dgid = grh->sgid;
 
-		ret = ib_find_cached_gid(device, &grh->dgid, &port_num,
+		ret = ib_find_gid(device, &grh->dgid, &port_num,
 					 &gid_index);
 		if (ret)
 			return ret;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-06 09:26:12.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-06 09:26:17.000000000 +0300
@@ -33,7 +33,6 @@
  */
 
 #include <rdma/ib_cm.h>
-#include <rdma/ib_cache.h>
 #include <net/dst.h>
 #include <net/icmp.h>
 #include <linux/icmpv6.h>
@@ -759,7 +758,7 @@ static int ipoib_cm_modify_tx_init(struc
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
 	int qp_attr_mask, ret;
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index);
 	if (ret) {
 		ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret);
 		return ret;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-06 09:26:12.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-07 14:28:48.625133165 +0300
@@ -38,7 +38,7 @@
 #include <linux/delay.h>
 #include <linux/dma-mapping.h>
 
-#include <rdma/ib_cache.h>
+#include <rdma/ib_verbs.h>
 
 #include "ipoib.h"
 
@@ -446,7 +446,7 @@ static void ipoib_pkey_dev_check_presenc
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	u16 pkey_index = 0;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index))
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index))
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 	else
 		set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
Index: b/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================
--- a/drivers/infiniband/ulp/srp/ib_srp.c	2007-05-06 09:26:12.000000000 +0300
+++ b/drivers/infiniband/ulp/srp/ib_srp.c	2007-05-06 09:26:17.000000000 +0300
@@ -48,8 +48,6 @@
 #include <scsi/scsi_dbg.h>
 #include <scsi/srp.h>
 
-#include <rdma/ib_cache.h>
-
 #include "ib_srp.h"
 
 #define DRV_NAME	"ib_srp"
@@ -164,7 +162,7 @@ static int srp_init_qp(struct srp_target
 	if (!attr)
 		return -ENOMEM;
 
-	ret = ib_find_cached_pkey(target->srp_host->dev->dev,
+	ret = ib_find_pkey(target->srp_host->dev->dev,
 				  target->srp_host->port,
 				  be16_to_cpu(target->path.pkey),
 				  &attr->pkey_index);
@@ -1780,7 +1778,7 @@ static ssize_t srp_create_target(struct 
 	if (ret)
 		goto err;
 
-	ib_get_cached_gid(host->dev->dev, host->port, 0, &target->path.sgid);
+	ib_query_gid(host->dev->dev, host->port, 0, &target->path.sgid);
 
 	printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x "
 	       "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",


From yosefe at voltaire.com  Mon May  7 05:57:27 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 15:57:27 +0300
Subject: [ofa-general] [PATCH 3/6 v2] fix pkey change handling and remove the
	cahce
In-Reply-To: <463F2121.5080803@voltaire.com>
References: <463F2121.5080803@voltaire.com>
Message-ID: <463F2237.7050809@voltaire.com>

ipoib: handle pkey change events

This issue was found during partitioning & SM fail over testing.

 * added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * fixed a bug in device extraction from the work struct
 * removed some warnings in case they are caused due to missing PKEY as 
	  this seems like a valid flow now.
 * Assume that the cache is coherent - do not retry on cache queries
 * Restart child interfaces first before parent
 * Remove the pkey polling thread and pkey delayed initiallization
 * If an interface is brought up but pkey is not found, mark it with
   IPOIB_PKEY_NEEDED and when a pkey event arrives, try to restart it.

SM reconfiguration or failover possibly causes a shuffling of the values in the port
pkey table. The current implementation only queries for the index of the pkey once,
when it creates the device QP and after that moves it into working state, and hence
does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger
to reconfigure the device QP.


Signed-off-by: Moni Levy <monil at voltaire.com>
Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |   10 -
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  144 ++++++++-----------------
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |   11 -
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   11 +
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |   21 +--
 5 files changed, 76 insertions(+), 121 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-06 09:26:08.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-06 09:26:18.000000000 +0300
@@ -80,7 +80,7 @@ enum {
 	IPOIB_FLAG_INITIALIZED    = 1,
 	IPOIB_FLAG_ADMIN_UP 	  = 2,
 	IPOIB_PKEY_ASSIGNED 	  = 3,
-	IPOIB_PKEY_STOP 	  = 4,
+	IPOIB_PKEY_NEEDED		  = 4,
 	IPOIB_FLAG_SUBINTERFACE   = 5,
 	IPOIB_MCAST_RUN 	  = 6,
 	IPOIB_STOP_REAPER         = 7,
@@ -202,9 +202,9 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
+	struct work_struct pkey_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
 
@@ -333,12 +333,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
@@ -384,9 +385,6 @@ void ipoib_event(struct ib_event_handler
 int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey);
 int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
 
-void ipoib_pkey_poll(struct work_struct *work);
-int ipoib_pkey_dev_delay_open(struct net_device *dev);
-
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 
 #define IPOIB_FLAGS_RC          0x80
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-06 09:26:17.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-06 09:26:26.000000000 +0300
@@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -441,28 +441,10 @@ int ipoib_ib_dev_open(struct net_device 
 	return 0;
 }
 
-static void ipoib_pkey_dev_check_presence(struct net_device *dev)
-{
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	u16 pkey_index = 0;
-
-	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index))
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-	else
-		set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-}
-
 int ipoib_ib_dev_up(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	ipoib_pkey_dev_check_presence(dev);
-
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
-		ipoib_dbg(priv, "PKEY is not assigned.\n");
-		return 0;
-	}
-
 	set_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
 
 	return ipoib_mcast_start_thread(dev);
@@ -477,16 +459,6 @@ int ipoib_ib_dev_down(struct net_device 
 	clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
 	netif_carrier_off(dev);
 
-	/* Shutdown the P_Key thread if still active */
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
-		mutex_lock(&pkey_mutex);
-		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
-		mutex_unlock(&pkey_mutex);
-		if (flush)
-			flush_workqueue(ipoib_workqueue);
-	}
-
 	ipoib_mcast_stop_thread(dev, flush);
 	ipoib_mcast_dev_flush(dev);
 
@@ -508,7 +480,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +553,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,14 +595,33 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
-		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces */
+	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, restart_qp);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	/*
+	 * If the device is not initiallized since it needs a pkey -
+	 * try to reopen it
+	 */
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
+
+		if (restart_qp
+			&& test_bit(IPOIB_PKEY_NEEDED, &priv->flags)
+		    && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
+			/* this iface needs pkey, try to bring it up */
+			ipoib_open(priv->dev);
+		}
+		else
+			ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
 
@@ -642,6 +634,12 @@ void ipoib_ib_dev_flush(struct work_stru
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (restart_qp) {
+		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) )
+			ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +648,25 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
+	/* we only restart the QP in case of pkey change event */
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_task);
 
-	mutex_unlock(&priv->vlan_mutex);
+	/* restart the QP in case of pkey change event */
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -672,54 +681,3 @@ void ipoib_ib_dev_cleanup(struct net_dev
 	ipoib_transport_dev_cleanup(dev);
 }
 
-/*
- * Delayed P_Key Assigment Interim Support
- *
- * The following is initial implementation of delayed P_Key assigment
- * mechanism. It is using the same approach implemented for the multicast
- * group join. The single goal of this implementation is to quickly address
- * Bug #2507. This implementation will probably be removed when the P_Key
- * change async notification is available.
- */
-
-void ipoib_pkey_poll(struct work_struct *work)
-{
-	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
-	struct net_device *dev = priv->dev;
-
-	ipoib_pkey_dev_check_presence(dev);
-
-	if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
-		ipoib_open(dev);
-	else {
-		mutex_lock(&pkey_mutex);
-		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
-					   HZ);
-		mutex_unlock(&pkey_mutex);
-	}
-}
-
-int ipoib_pkey_dev_delay_open(struct net_device *dev)
-{
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-
-	/* Look for the interface pkey value in the IB Port P_Key table and */
-	/* set the interface pkey assigment flag                            */
-	ipoib_pkey_dev_check_presence(dev);
-
-	/* P_Key value not assigned yet - start polling */
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
-		mutex_lock(&pkey_mutex);
-		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
-		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
-				   HZ);
-		mutex_unlock(&pkey_mutex);
-		return 1;
-	}
-
-	return 0;
-}
Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-06 09:26:08.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-06 09:26:18.000000000 +0300
@@ -100,14 +100,11 @@ int ipoib_open(struct net_device *dev)
 
 	set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
 
-	if (ipoib_pkey_dev_delay_open(dev))
-		return 0;
-
 	if (ipoib_ib_dev_open(dev))
-		return -EINVAL;
+		return test_bit(IPOIB_PKEY_NEEDED, &priv->flags) ? 0 : -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +149,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +987,7 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-06 09:26:08.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-06 09:26:18.000000000 +0300
@@ -232,9 +232,10 @@ static int ipoib_mcast_join_finish(struc
 		ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid),
 					 &mcast->mcmember.mgid);
 		if (ret < 0) {
-			ipoib_warn(priv, "couldn't attach QP to multicast group "
-				   IPOIB_GID_FMT "\n",
-				   IPOIB_GID_ARG(mcast->mcmember.mgid));
+			if (ret != -ENXIO) /* No pkey found */
+				ipoib_warn(priv, "couldn't attach QP to multicast group "
+					   IPOIB_GID_FMT "\n",
+					   IPOIB_GID_ARG(mcast->mcmember.mgid));
 
 			clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags);
 			return ret;
@@ -312,7 +313,7 @@ ipoib_mcast_sendonly_join_complete(int s
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 
 	if (status) {
-		if (mcast->logcount++ < 20)
+		if (mcast->logcount++ < 20 && status != -ENXIO)
 			ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for "
 					IPOIB_GID_FMT ", status %d\n",
 					IPOIB_GID_ARG(mcast->mcmember.mgid), status);
@@ -420,7 +421,7 @@ static int ipoib_mcast_join_complete(int
 					", status %d\n",
 					IPOIB_GID_ARG(mcast->mcmember.mgid),
 					status);
-		} else {
+		} else if (status != -ENXIO) {
 			ipoib_warn(priv, "multicast join failed for "
 				   IPOIB_GID_FMT ", status %d\n",
 				   IPOIB_GID_ARG(mcast->mcmember.mgid),
Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-06 09:26:08.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-06 09:26:18.000000000 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,12 +47,12 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+		clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
 	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	set_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 
 	/* set correct QKey for QP */
 	qp_attr->qkey = priv->qkey;
@@ -103,12 +101,12 @@ int ipoib_init_qp(struct net_device *dev
 	 * The port has to be assigned to the respective IB partition in
 	 * advance.
 	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
 	if (ret) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		set_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 		return ret;
 	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
@@ -238,7 +236,7 @@ void ipoib_transport_dev_cleanup(struct 
 			ipoib_warn(priv, "ib_qp_destroy failed\n");
 
 		priv->qp = NULL;
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 	}
 
 	if (ib_destroy_cq(priv->cq))
@@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_task);
 	}
 }


From yosefe at voltaire.com  Mon May  7 05:58:22 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 15:58:22 +0300
Subject: [ofa-general] [PATCH 4/6 v2] fix pkey change handling and remove the
	cahce
In-Reply-To: <463F2121.5080803@voltaire.com>
References: <463F2121.5080803@voltaire.com>
Message-ID: <463F226E.9040700@voltaire.com>

mthca: cache pkeys and gids

* Use incoming mads to update the internal cache: use PKEY_TABLE mads
  to update pkey table cache, and GUID_INFO, PORT_INFO mads to update
  gid table cache (which update guid table and gid prefix, accordingly).
* Modify query_pkey and query_gid to use this cache, which makes them
  non-blocking
* While creating a MLX QP, use these functions instead of the cache
  from ib core.


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/hw/mthca/mthca_av.c       |    3 
 drivers/infiniband/hw/mthca/mthca_dev.h      |   20 +
 drivers/infiniband/hw/mthca/mthca_mad.c      |    3 
 drivers/infiniband/hw/mthca/mthca_provider.c |  284 ++++++++++++++++++++-------
 drivers/infiniband/hw/mthca/mthca_qp.c       |    5 
 include/rdma/ib_smi.h                        |    4 
 6 files changed, 245 insertions(+), 74 deletions(-)

Index: b/drivers/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_dev.h	2007-05-07 14:28:47.574320783 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h	2007-05-07 14:28:55.365929626 +0300
@@ -49,6 +49,8 @@
 
 #include <asm/semaphore.h>
 
+#include <rdma/ib_smi.h>
+
 #include "mthca_provider.h"
 #include "mthca_doorbell.h"
 
@@ -287,6 +289,19 @@ struct mthca_catas_err {
 	struct list_head	list;
 };
 
+struct mthca_pkey_cache {
+	rwlock_t lock;
+	int      table_len;
+	u16      table[0];
+};
+
+struct mthca_gid_cache {
+	rwlock_t lock;
+	u64      gid_prefix;
+	int      table_len;
+	u64      guid_table[0];
+};
+
 extern struct mutex mthca_device_mutex;
 
 struct mthca_dev {
@@ -360,6 +375,9 @@ struct mthca_dev {
 	struct ib_ah         *sm_ah[MTHCA_MAX_PORTS];
 	spinlock_t            sm_lock;
 	u8                    rate[MTHCA_MAX_PORTS];
+
+	struct mthca_pkey_cache *pkey_cache[MTHCA_MAX_PORTS];
+	struct mthca_gid_cache *gid_cache[MTHCA_MAX_PORTS];
 };
 
 #ifdef CONFIG_INFINIBAND_MTHCA_DEBUG
@@ -585,6 +603,8 @@ int mthca_process_mad(struct ib_device *
 int mthca_create_agents(struct mthca_dev *dev);
 void mthca_free_agents(struct mthca_dev *dev);
 
+int mthca_cache_update(struct mthca_dev *mdev, u8 port_num, struct ib_mad *mad);
+
 static inline struct mthca_dev *to_mdev(struct ib_device *ibdev)
 {
 	return container_of(ibdev, struct mthca_dev, ib_dev);
Index: b/drivers/infiniband/hw/mthca/mthca_mad.c
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_mad.c	2007-05-07 14:28:47.574320783 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_mad.c	2007-05-07 14:28:55.366929448 +0300
@@ -139,6 +139,9 @@ static void smp_snoop(struct ib_device *
 			event.element.port_num = port_num;
 			ib_dispatch_event(&event);
 		}
+
+		/* update cache with the incoming mad */
+		mthca_cache_update(to_mdev(ibdev), port_num, mad);
 	}
 }
 
Index: b/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_provider.c	2007-05-07 14:28:47.575320605 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c	2007-05-07 14:28:55.367929269 +0300
@@ -243,87 +243,44 @@ out:
 static int mthca_query_pkey(struct ib_device *ibdev,
 			    u8 port, u16 index, u16 *pkey)
 {
-	struct ib_smp *in_mad  = NULL;
-	struct ib_smp *out_mad = NULL;
-	int err = -ENOMEM;
-	u8 status;
+	struct mthca_dev *mdev;
+	struct mthca_pkey_cache *pkey_cache;
+	unsigned int flags;
 
-	in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
-	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
-	if (!in_mad || !out_mad)
-		goto out;
+	mdev = to_mdev(ibdev);
 
-	init_query_mad(in_mad);
-	in_mad->attr_id  = IB_SMP_ATTR_PKEY_TABLE;
-	in_mad->attr_mod = cpu_to_be32(index / 32);
-
-	err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1,
-			    port, NULL, NULL, in_mad, out_mad,
-			    &status);
-	if (err)
-		goto out;
-	if (status) {
-		err = -EINVAL;
-		goto out;
+	if (port < 1 || port > mdev->ib_dev.phys_port_cnt ||
+		index >= mdev->pkey_cache[ port - 1 ]->table_len ) {
+		return -EINVAL;
 	}
 
-	*pkey = be16_to_cpu(((__be16 *) out_mad->data)[index % 32]);
-
- out:
-	kfree(in_mad);
-	kfree(out_mad);
-	return err;
+	pkey_cache = mdev->pkey_cache[ port - 1 ];
+	read_lock_irqsave(&pkey_cache->lock, flags);
+	*pkey = be16_to_cpu( pkey_cache->table[ index ] );
+	read_unlock_irqrestore(&pkey_cache->lock, flags);
+	return 0;
 }
 
 static int mthca_query_gid(struct ib_device *ibdev, u8 port,
 			   int index, union ib_gid *gid)
 {
-	struct ib_smp *in_mad  = NULL;
-	struct ib_smp *out_mad = NULL;
-	int err = -ENOMEM;
-	u8 status;
-
-	in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
-	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
-	if (!in_mad || !out_mad)
-		goto out;
+	struct mthca_dev * mdev;
+	unsigned int flags;
+	struct mthca_gid_cache *gid_cache;
 
-	init_query_mad(in_mad);
-	in_mad->attr_id  = IB_SMP_ATTR_PORT_INFO;
-	in_mad->attr_mod = cpu_to_be32(port);
+	mdev = to_mdev(ibdev);
 
-	err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1,
-			    port, NULL, NULL, in_mad, out_mad,
-			    &status);
-	if (err)
-		goto out;
-	if (status) {
-		err = -EINVAL;
-		goto out;
-	}
-
-	memcpy(gid->raw, out_mad->data + 8, 8);
-
-	init_query_mad(in_mad);
-	in_mad->attr_id  = IB_SMP_ATTR_GUID_INFO;
-	in_mad->attr_mod = cpu_to_be32(index / 8);
-
-	err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1,
-			    port, NULL, NULL, in_mad, out_mad,
-			    &status);
-	if (err)
-		goto out;
-	if (status) {
-		err = -EINVAL;
-		goto out;
+	if (port < 1 || port > mdev->ib_dev.phys_port_cnt ||
+		index >= mdev->gid_cache[ port - 1 ]->table_len ) {
+		return -EINVAL;
 	}
 
-	memcpy(gid->raw + 8, out_mad->data + (index % 8) * 8, 8);
-
- out:
-	kfree(in_mad);
-	kfree(out_mad);
-	return err;
+	gid_cache = mdev->gid_cache[ port - 1 ];
+	read_lock_irqsave(&gid_cache->lock, flags);
+	memcpy( gid->raw, &gid_cache->gid_prefix, 8);
+	memcpy( gid->raw + 8, gid_cache->guid_table + index, 8);
+	read_unlock_irqrestore(&gid_cache->lock, flags);
+	return 0;
 }
 
 static struct ib_ucontext *mthca_alloc_ucontext(struct ib_device *ibdev,
@@ -1259,6 +1216,189 @@ out:
 	return err;
 }
 
+/* update a cached table */
+static int mthca_cache_update_table(struct mthca_dev *mdev,
+			void *table, int table_size,
+			void *data, int data_size, int table_offset)
+{
+
+	/* make sure the offset is valid */
+	if (table_size < table_offset+data_size) {
+		mthca_warn(mdev, "cache table offset out of range - ignoring\n");
+		return -EINVAL;
+	}
+
+	/* update the cache */
+	memcpy((u8*)table+table_offset, data, data_size);
+
+	return 0;
+}
+
+/* update the cache with mad */
+int mthca_cache_update(struct mthca_dev *mdev, u8 port_num, struct ib_mad *mad)
+{
+	struct mthca_pkey_cache *pkey_cache;
+	struct mthca_gid_cache *gid_cache;
+	unsigned long flags;
+	struct ib_smp *smp;
+	unsigned int offset;
+	int ret = 0;
+
+	smp = (struct ib_smp*)mad;
+	offset = ( be32_to_cpu(smp->attr_mod) & 0xFFFF );
+	//TODO check if port# is valid
+
+	switch (mad->mad_hdr.attr_id) {
+	case IB_SMP_ATTR_PKEY_TABLE:
+		mthca_dbg(mdev, "port %d: pkey table change\n", port_num);
+		pkey_cache = mdev->pkey_cache[ port_num - 1 ];
+		write_lock_irqsave(&pkey_cache->lock, flags);
+		mthca_cache_update_table(mdev,
+				pkey_cache->table, pkey_cache->table_len * sizeof (u16),
+				smp->data, IB_SMP_NUM_PKEY_ENTRIES * sizeof (u16),
+				offset * IB_SMP_NUM_PKEY_ENTRIES * sizeof (u16));
+		write_unlock_irqrestore(&pkey_cache->lock, flags);
+		break;
+
+	case IB_SMP_ATTR_GUID_INFO:
+		mthca_dbg(mdev, "port %d: guid table change\n", port_num);
+		gid_cache = mdev->gid_cache[ port_num - 1 ];
+		write_lock_irqsave(&gid_cache->lock, flags);
+		mthca_cache_update_table(mdev,
+				gid_cache->guid_table, gid_cache->table_len * sizeof (u64),
+				smp->data, IB_SMP_NUM_GUID_ENTRIES * sizeof (u64),
+				offset * IB_SMP_NUM_GUID_ENTRIES * sizeof (u64));
+		write_unlock_irqrestore(&gid_cache->lock, flags);
+		break;
+
+	case IB_SMP_ATTR_PORT_INFO:
+		mthca_dbg(mdev, "port %d: port info change\n", port_num);
+		gid_cache = mdev->gid_cache[ port_num - 1 ];
+		write_lock_irqsave(&gid_cache->lock, flags);
+		gid_cache->gid_prefix = *(u64*)(smp->data + 8);
+		write_unlock_irqrestore(&gid_cache->lock, flags);
+		break;
+	}
+	return ret;
+}
+
+static int mthca_cache_init(struct mthca_dev *mdev)
+{
+	struct ib_smp *in_mad  = NULL;
+	struct ib_smp *out_mad = NULL;
+	struct mthca_pkey_cache *pkey_cache;
+	struct mthca_gid_cache *gid_cache;
+	unsigned int i, offset;
+	u8 status;
+	int err = -ENOMEM;
+
+	memset(mdev->pkey_cache, 0, sizeof mdev->pkey_cache);
+
+	in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+
+	if (!in_mad || !out_mad)
+		goto out;
+
+	for ( i = 0; i < mdev->ib_dev.phys_port_cnt; ++i ) {
+
+		unsigned int port = i + 1;
+
+		/* allocate pkey cache */
+		mdev->pkey_cache[ i ] = pkey_cache = kmalloc(sizeof *pkey_cache
+				+ mdev->limits.pkey_table_len * sizeof(u16), GFP_KERNEL);
+		if ( ! pkey_cache )
+			goto out;
+
+		rwlock_init(&pkey_cache->lock);
+
+		/* populate pkey table */
+		pkey_cache->table_len = mdev->limits.pkey_table_len;
+		for (offset = 0; offset < pkey_cache->table_len;
+				offset += IB_SMP_NUM_PKEY_ENTRIES) {
+
+			memset(in_mad, 0, sizeof *in_mad);
+			init_query_mad(in_mad);
+			in_mad->attr_id  = IB_SMP_ATTR_PKEY_TABLE;
+			in_mad->attr_mod = cpu_to_be32( offset / IB_SMP_NUM_PKEY_ENTRIES);
+
+			err = mthca_MAD_IFC(mdev, 1, 1,
+				    port, NULL, NULL, in_mad, out_mad,
+				    &status);
+
+			if (err || status)
+				break;
+
+			mthca_cache_update_table(mdev,
+					pkey_cache->table, pkey_cache->table_len * sizeof (u16),
+					out_mad->data, IB_SMP_NUM_PKEY_ENTRIES * sizeof (u16),
+					offset * sizeof (u16));
+		}
+
+		/* allocate gid cache */
+		mdev->gid_cache[ i ] = gid_cache = kmalloc(sizeof *gid_cache
+				+ mdev->limits.gid_table_len * sizeof(u64), GFP_KERNEL);
+		if ( !gid_cache )
+			goto out;
+
+		rwlock_init(&gid_cache->lock);
+
+		/* populate guid table */
+		gid_cache->table_len = mdev->limits.gid_table_len;
+		for (offset = 0; offset < gid_cache->table_len;
+				offset += IB_SMP_NUM_GUID_ENTRIES) {
+
+			memset(in_mad, 0, sizeof *in_mad);
+			init_query_mad(in_mad);
+			in_mad->attr_id  = IB_SMP_ATTR_GUID_INFO;
+			in_mad->attr_mod = cpu_to_be32( offset / IB_SMP_NUM_GUID_ENTRIES);
+
+			err = mthca_MAD_IFC(mdev, 1, 1,
+				    port, NULL, NULL, in_mad, out_mad,
+				    &status);
+
+			if (err || status)
+				break;
+
+			mthca_cache_update_table(mdev,
+					gid_cache->guid_table, gid_cache->table_len * sizeof (u64),
+					out_mad->data, IB_SMP_NUM_GUID_ENTRIES * sizeof (u64),
+					offset * sizeof (u64));
+		}
+
+		/* read gid prefix */
+		init_query_mad(in_mad);
+		in_mad->attr_id  = IB_SMP_ATTR_PORT_INFO;
+		in_mad->attr_mod = cpu_to_be32(port);
+
+		err = mthca_MAD_IFC(mdev, 1, 1,
+			    port, NULL, NULL, in_mad, out_mad,
+			    &status);
+
+		if (err || status)
+			continue;
+
+		mdev->gid_cache[ i ]->gid_prefix = *(u64*)(out_mad->data + 8);
+	}
+
+out:
+	kfree(in_mad);
+	kfree(out_mad);
+	return err;
+}
+
+/*
+ * Destroy the cache
+ */
+static void mthca_cache_destroy(struct mthca_dev *mdev)
+{
+	int i;
+	for ( i = 0; i < mdev->ib_dev.phys_port_cnt; ++i ) {
+		kfree( mdev->pkey_cache[ i ] );
+		kfree( mdev->gid_cache[ i ] );
+	}
+}
+
 int mthca_register_device(struct mthca_dev *dev)
 {
 	int ret;
@@ -1365,6 +1505,12 @@ int mthca_register_device(struct mthca_d
 
 	mutex_init(&dev->cap_mask_mutex);
 
+	ret = mthca_cache_init(dev);
+	if (ret) {
+		mthca_cache_destroy(dev);
+		return ret;
+	}
+
 	ret = ib_register_device(&dev->ib_dev);
 	if (ret)
 		return ret;
@@ -1387,4 +1533,6 @@ void mthca_unregister_device(struct mthc
 {
 	mthca_stop_catas_poll(dev);
 	ib_unregister_device(&dev->ib_dev);
+	mthca_cache_destroy(dev);
 }
+
Index: b/include/rdma/ib_smi.h
===================================================================
--- a/include/rdma/ib_smi.h	2007-05-07 14:28:47.576320426 +0300
+++ b/include/rdma/ib_smi.h	2007-05-07 14:28:55.367929269 +0300
@@ -42,7 +42,9 @@
 #include <rdma/ib_mad.h>
 
 #define IB_SMP_DATA_SIZE			64
-#define IB_SMP_MAX_PATH_HOPS			64
+#define IB_SMP_MAX_PATH_HOPS		64
+#define IB_SMP_NUM_PKEY_ENTRIES		32
+#define IB_SMP_NUM_GUID_ENTRIES		8
 
 struct ib_smp {
 	u8	base_version;
Index: b/drivers/infiniband/hw/mthca/mthca_qp.c
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_qp.c	2007-05-07 14:28:47.575320605 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c	2007-05-07 14:28:55.369928912 +0300
@@ -41,7 +41,6 @@
 #include <asm/io.h>
 
 #include <rdma/ib_verbs.h>
-#include <rdma/ib_cache.h>
 #include <rdma/ib_pack.h>
 
 #include "mthca_dev.h"
@@ -1485,10 +1484,10 @@ static int build_mlx_header(struct mthca
 		sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE;
 	sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED);
 	if (!sqp->qp.ibqp.qp_num)
-		ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port,
+		dev->ib_dev.query_pkey(&dev->ib_dev, sqp->qp.port,
 				   sqp->pkey_index, &pkey);
 	else
-		ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port,
+		dev->ib_dev.query_pkey(&dev->ib_dev, sqp->qp.port,
 				   wr->wr.ud.pkey_index, &pkey);
 	sqp->ud_header.bth.pkey = cpu_to_be16(pkey);
 	sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn);
Index: b/drivers/infiniband/hw/mthca/mthca_av.c
===================================================================
--- a/drivers/infiniband/hw/mthca/mthca_av.c	2007-05-07 14:28:47.575320605 +0300
+++ b/drivers/infiniband/hw/mthca/mthca_av.c	2007-05-07 14:28:55.369928912 +0300
@@ -37,7 +37,6 @@
 #include <linux/slab.h>
 
 #include <rdma/ib_verbs.h>
-#include <rdma/ib_cache.h>
 
 #include "mthca_dev.h"
 
@@ -279,7 +278,7 @@ int mthca_read_ah(struct mthca_dev *dev,
 			(be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff;
 		header->grh.flow_label    =
 			ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff);
-		ib_get_cached_gid(&dev->ib_dev,
+		dev->ib_dev.query_gid(&dev->ib_dev,
 				  be32_to_cpu(ah->av->port_pd) >> 24,
 				  ah->av->gid_index % dev->limits.gid_table_len,
 				  &header->grh.source_gid);


From yosefe at voltaire.com  Mon May  7 05:59:23 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 15:59:23 +0300
Subject: [ofa-general] [PATCH 5/6 v2] fix pkey change handling and remove the
	cahce
In-Reply-To: <463F2121.5080803@voltaire.com>
References: <463F2121.5080803@voltaire.com>
Message-ID: <463F22AB.3090704@voltaire.com>

mad: cache port lmc

* Instead of using the ib cache, mad core will keep the up-to-date
  lmc of each port inside port_priv struct. It will be updated by
  incoming PORT_INFO mads.
* use the uncached version of "query gid".
  This query will be cache-optimized in the provider level. 

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/mad.c      |   31 +++++++++++++++++++++++++++----
 drivers/infiniband/core/mad_priv.h |    1 +
 2 files changed, 28 insertions(+), 4 deletions(-)

Index: b/drivers/infiniband/core/mad.c
===================================================================
--- a/drivers/infiniband/core/mad.c	2007-05-07 14:31:49.304874864 +0300
+++ b/drivers/infiniband/core/mad.c	2007-05-07 14:31:59.320086832 +0300
@@ -34,7 +34,6 @@
  * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
  */
 #include <linux/dma-mapping.h>
-#include <rdma/ib_cache.h>
 
 #include "mad_priv.h"
 #include "mad_rmpp.h"
@@ -1707,13 +1706,12 @@ static inline int rcv_has_same_gid(struc
 	if (!send_resp && rcv_resp) {
 		/* is request/response. */
 		if (!(attr.ah_flags & IB_AH_GRH)) {
-			if (ib_get_cached_lmc(device, port_num, &lmc))
-				return 0;
+			lmc = atomic_read(&mad_agent_priv->qp_info->port_priv->port_lmc);
 			return (!lmc || !((attr.src_path_bits ^
 					   rwc->wc->dlid_path_bits) &
 					  ((1 << lmc) - 1)));
 		} else {
-			if (ib_get_cached_gid(device, port_num,
+			if (ib_query_gid(device, port_num,
 					      attr.grh.sgid_index, &sgid))
 				return 0;
 			return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw,
@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str
 	recv->header.recv_wc.recv_buf.mad = &recv->mad.mad;
 	recv->header.recv_wc.recv_buf.grh = &recv->grh;
 
+	/* update our lmc cache with port info smps */
+	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+	     recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+	    && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
+		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
+	{
+		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
+	}
+
 	if (atomic_read(&qp_info->snoop_count))
 		snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS);
 
@@ -2747,6 +2754,7 @@ static int ib_mad_port_open(struct ib_de
 {
 	int ret, cq_size;
 	struct ib_mad_port_private *port_priv;
+	struct ib_port_attr *port_attr;
 	unsigned long flags;
 	char name[sizeof "ib_mad123"];
 
@@ -2764,6 +2772,19 @@ static int ib_mad_port_open(struct ib_de
 	init_mad_qp(port_priv, &port_priv->qp_info[0]);
 	init_mad_qp(port_priv, &port_priv->qp_info[1]);
 
+	port_attr = kmalloc(sizeof *port_attr, GFP_KERNEL);
+	if (!port_attr) {
+		printk(KERN_ERR PFX "No memory for ib_port_attr\n");
+		return -ENOMEM;
+	}
+
+	if (ib_query_port(device, port_num, port_attr)) {
+		printk(KERN_ERR PFX "Couldn't query port %d\n", port_num);
+		ret = -EINVAL;
+		goto error2;
+	}
+	atomic_set(&port_priv->port_lmc, port_attr->lmc);
+
 	cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2;
 	port_priv->cq = ib_create_cq(port_priv->device,
 				     ib_mad_thread_completion_handler,
@@ -2834,6 +2855,8 @@ error4:
 	cleanup_recv_queue(&port_priv->qp_info[1]);
 	cleanup_recv_queue(&port_priv->qp_info[0]);
 error3:
+	kfree(port_attr);
+error2:
 	kfree(port_priv);
 
 	return ret;
Index: b/drivers/infiniband/core/mad_priv.h
===================================================================
--- a/drivers/infiniband/core/mad_priv.h	2007-05-07 14:32:34.000000000 +0300
+++ b/drivers/infiniband/core/mad_priv.h	2007-05-07 14:33:28.856102158 +0300
@@ -200,6 +200,7 @@ struct ib_mad_port_private {
 	struct list_head port_list;
 	struct ib_device *device;
 	int port_num;
+	atomic_t port_lmc;
 	struct ib_cq *cq;
 	struct ib_pd *pd;
 	struct ib_mr *mr;


From yosefe at voltaire.com  Mon May  7 06:00:24 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 16:00:24 +0300
Subject: [ofa-general] [PATCH 6/6 v2] fix pkey change handling and remove the
	cahce
In-Reply-To: <463F2121.5080803@voltaire.com>
References: <463F2121.5080803@voltaire.com>
Message-ID: <463F22E8.4020406@voltaire.com>

core: remove the cache

* Remove the core cache completely.

This patch depends on the previous pathces in the series, which remove
the usages of this cache, from: core, ipoib, srp, mthca, mad

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/cache.c     |  398 ------------------------------------
 include/rdma/ib_cache.h             |  118 ----------
 drivers/infiniband/core/Makefile    |    2 
 drivers/infiniband/core/core_priv.h |    3 
 drivers/infiniband/core/device.c    |    7 
 5 files changed, 1 insertion(+), 527 deletions(-)

Index: b/drivers/infiniband/core/Makefile
===================================================================
--- a/drivers/infiniband/core/Makefile	2007-05-06 09:33:50.000000000 +0300
+++ b/drivers/infiniband/core/Makefile	2007-05-07 14:28:56.187782888 +0300
@@ -8,7 +8,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	
 					$(user_access-y)
 
 ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
-				device.o fmr_pool.o cache.o
+				device.o fmr_pool.o
 
 ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
 
Index: b/drivers/infiniband/core/core_priv.h
===================================================================
--- a/drivers/infiniband/core/core_priv.h	2007-05-06 09:33:50.000000000 +0300
+++ b/drivers/infiniband/core/core_priv.h	2007-05-07 14:28:56.188782710 +0300
@@ -46,7 +46,4 @@ void ib_device_unregister_sysfs(struct i
 int  ib_sysfs_setup(void);
 void ib_sysfs_cleanup(void);
 
-int  ib_cache_setup(void);
-void ib_cache_cleanup(void);
-
 #endif /* _CORE_PRIV_H */
Index: b/include/rdma/ib_cache.h
===================================================================
--- a/include/rdma/ib_cache.h	2007-05-06 09:33:50.000000000 +0300
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,118 +0,0 @@
-/*
- * Copyright (c) 2004 Topspin Communications.  All rights reserved.
- * Copyright (c) 2005 Intel Corporation. All rights reserved.
- * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- *
- * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $
- */
-
-#ifndef _IB_CACHE_H
-#define _IB_CACHE_H
-
-#include <rdma/ib_verbs.h>
-
-/**
- * ib_get_cached_gid - Returns a cached GID table entry
- * @device: The device to query.
- * @port_num: The port number of the device to query.
- * @index: The index into the cached GID table to query.
- * @gid: The GID value found at the specified index.
- *
- * ib_get_cached_gid() fetches the specified GID table entry stored in
- * the local software cache.
- */
-int ib_get_cached_gid(struct ib_device    *device,
-		      u8                   port_num,
-		      int                  index,
-		      union ib_gid        *gid);
-
-/**
- * ib_find_cached_gid - Returns the port number and GID table index where
- *   a specified GID value occurs.
- * @device: The device to query.
- * @gid: The GID value to search for.
- * @port_num: The port number of the device where the GID value was found.
- * @index: The index into the cached GID table where the GID was found.  This
- *   parameter may be NULL.
- *
- * ib_find_cached_gid() searches for the specified GID value in
- * the local software cache.
- */
-int ib_find_cached_gid(struct ib_device *device,
-		       union ib_gid	*gid,
-		       u8               *port_num,
-		       u16              *index);
-
-/**
- * ib_get_cached_pkey - Returns a cached PKey table entry
- * @device: The device to query.
- * @port_num: The port number of the device to query.
- * @index: The index into the cached PKey table to query.
- * @pkey: The PKey value found at the specified index.
- *
- * ib_get_cached_pkey() fetches the specified PKey table entry stored in
- * the local software cache.
- */
-int ib_get_cached_pkey(struct ib_device    *device_handle,
-		       u8                   port_num,
-		       int                  index,
-		       u16                 *pkey);
-
-/**
- * ib_find_cached_pkey - Returns the PKey table index where a specified
- *   PKey value occurs.
- * @device: The device to query.
- * @port_num: The port number of the device to search for the PKey.
- * @pkey: The PKey value to search for.
- * @index: The index into the cached PKey table where the PKey was found.
- *
- * ib_find_cached_pkey() searches the specified PKey table in
- * the local software cache.
- */
-int ib_find_cached_pkey(struct ib_device    *device,
-			u8                   port_num,
-			u16                  pkey,
-			u16                 *index);
-
-/**
- * ib_get_cached_lmc - Returns a cached lmc table entry
- * @device: The device to query.
- * @port_num: The port number of the device to query.
- * @lmc: The lmc value for the specified port for that device.
- *
- * ib_get_cached_lmc() fetches the specified lmc table entry stored in
- * the local software cache.
- */
-int ib_get_cached_lmc(struct ib_device *device,
-		      u8                port_num,
-		      u8                *lmc);
-
-#endif /* _IB_CACHE_H */
Index: b/drivers/infiniband/core/cache.c
===================================================================
--- a/drivers/infiniband/core/cache.c	2007-05-06 09:33:50.000000000 +0300
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,398 +0,0 @@
-/*
- * Copyright (c) 2004 Topspin Communications.  All rights reserved.
- * Copyright (c) 2005 Intel Corporation. All rights reserved.
- * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
- * Copyright (c) 2005 Voltaire, Inc. All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- *
- * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $
- */
-
-#include <linux/module.h>
-#include <linux/errno.h>
-#include <linux/slab.h>
-
-#include <rdma/ib_cache.h>
-
-#include "core_priv.h"
-
-struct ib_pkey_cache {
-	int             table_len;
-	u16             table[0];
-};
-
-struct ib_gid_cache {
-	int             table_len;
-	union ib_gid    table[0];
-};
-
-struct ib_update_work {
-	struct work_struct work;
-	struct ib_device  *device;
-	u8                 port_num;
-};
-
-static inline int start_port(struct ib_device *device)
-{
-	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
-}
-
-static inline int end_port(struct ib_device *device)
-{
-	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
-		0 : device->phys_port_cnt;
-}
-
-int ib_get_cached_gid(struct ib_device *device,
-		      u8                port_num,
-		      int               index,
-		      union ib_gid     *gid)
-{
-	struct ib_gid_cache *cache;
-	unsigned long flags;
-	int ret = 0;
-
-	if (port_num < start_port(device) || port_num > end_port(device))
-		return -EINVAL;
-
-	read_lock_irqsave(&device->cache.lock, flags);
-
-	cache = device->cache.gid_cache[port_num - start_port(device)];
-
-	if (index < 0 || index >= cache->table_len)
-		ret = -EINVAL;
-	else
-		*gid = cache->table[index];
-
-	read_unlock_irqrestore(&device->cache.lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL(ib_get_cached_gid);
-
-int ib_find_cached_gid(struct ib_device *device,
-		       union ib_gid	*gid,
-		       u8               *port_num,
-		       u16              *index)
-{
-	struct ib_gid_cache *cache;
-	unsigned long flags;
-	int p, i;
-	int ret = -ENOENT;
-
-	*port_num = -1;
-	if (index)
-		*index = -1;
-
-	read_lock_irqsave(&device->cache.lock, flags);
-
-	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
-		cache = device->cache.gid_cache[p];
-		for (i = 0; i < cache->table_len; ++i) {
-			if (!memcmp(gid, &cache->table[i], sizeof *gid)) {
-				*port_num = p + start_port(device);
-				if (index)
-					*index = i;
-				ret = 0;
-				goto found;
-			}
-		}
-	}
-found:
-	read_unlock_irqrestore(&device->cache.lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL(ib_find_cached_gid);
-
-int ib_get_cached_pkey(struct ib_device *device,
-		       u8                port_num,
-		       int               index,
-		       u16              *pkey)
-{
-	struct ib_pkey_cache *cache;
-	unsigned long flags;
-	int ret = 0;
-
-	if (port_num < start_port(device) || port_num > end_port(device))
-		return -EINVAL;
-
-	read_lock_irqsave(&device->cache.lock, flags);
-
-	cache = device->cache.pkey_cache[port_num - start_port(device)];
-
-	if (index < 0 || index >= cache->table_len)
-		ret = -EINVAL;
-	else
-		*pkey = cache->table[index];
-
-	read_unlock_irqrestore(&device->cache.lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL(ib_get_cached_pkey);
-
-int ib_find_cached_pkey(struct ib_device *device,
-			u8                port_num,
-			u16               pkey,
-			u16              *index)
-{
-	struct ib_pkey_cache *cache;
-	unsigned long flags;
-	int i;
-	int ret = -ENOENT;
-
-	if (port_num < start_port(device) || port_num > end_port(device))
-		return -EINVAL;
-
-	read_lock_irqsave(&device->cache.lock, flags);
-
-	cache = device->cache.pkey_cache[port_num - start_port(device)];
-
-	*index = -1;
-
-	for (i = 0; i < cache->table_len; ++i)
-		if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) {
-			*index = i;
-			ret = 0;
-			break;
-		}
-
-	read_unlock_irqrestore(&device->cache.lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL(ib_find_cached_pkey);
-
-int ib_get_cached_lmc(struct ib_device *device,
-		      u8                port_num,
-		      u8                *lmc)
-{
-	unsigned long flags;
-	int ret = 0;
-
-	if (port_num < start_port(device) || port_num > end_port(device))
-		return -EINVAL;
-
-	read_lock_irqsave(&device->cache.lock, flags);
-	*lmc = device->cache.lmc_cache[port_num - start_port(device)];
-	read_unlock_irqrestore(&device->cache.lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL(ib_get_cached_lmc);
-
-static void ib_cache_update(struct ib_device *device,
-			    u8                port)
-{
-	struct ib_port_attr       *tprops = NULL;
-	struct ib_pkey_cache      *pkey_cache = NULL, *old_pkey_cache;
-	struct ib_gid_cache       *gid_cache = NULL, *old_gid_cache;
-	int                        i;
-	int                        ret;
-
-	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
-	if (!tprops)
-		return;
-
-	ret = ib_query_port(device, port, tprops);
-	if (ret) {
-		printk(KERN_WARNING "ib_query_port failed (%d) for %s\n",
-		       ret, device->name);
-		goto err;
-	}
-
-	pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
-			     sizeof *pkey_cache->table, GFP_KERNEL);
-	if (!pkey_cache)
-		goto err;
-
-	pkey_cache->table_len = tprops->pkey_tbl_len;
-
-	gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len *
-			    sizeof *gid_cache->table, GFP_KERNEL);
-	if (!gid_cache)
-		goto err;
-
-	gid_cache->table_len = tprops->gid_tbl_len;
-
-	for (i = 0; i < pkey_cache->table_len; ++i) {
-		ret = ib_query_pkey(device, port, i, pkey_cache->table + i);
-		if (ret) {
-			printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n",
-			       ret, device->name, i);
-			goto err;
-		}
-	}
-
-	for (i = 0; i < gid_cache->table_len; ++i) {
-		ret = ib_query_gid(device, port, i, gid_cache->table + i);
-		if (ret) {
-			printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
-			       ret, device->name, i);
-			goto err;
-		}
-	}
-
-	write_lock_irq(&device->cache.lock);
-
-	old_pkey_cache = device->cache.pkey_cache[port - start_port(device)];
-	old_gid_cache  = device->cache.gid_cache [port - start_port(device)];
-
-	device->cache.pkey_cache[port - start_port(device)] = pkey_cache;
-	device->cache.gid_cache [port - start_port(device)] = gid_cache;
-
-	device->cache.lmc_cache[port - start_port(device)] = tprops->lmc;
-
-	write_unlock_irq(&device->cache.lock);
-
-	kfree(old_pkey_cache);
-	kfree(old_gid_cache);
-	kfree(tprops);
-	return;
-
-err:
-	kfree(pkey_cache);
-	kfree(gid_cache);
-	kfree(tprops);
-}
-
-static void ib_cache_task(struct work_struct *_work)
-{
-	struct ib_update_work *work =
-		container_of(_work, struct ib_update_work, work);
-
-	ib_cache_update(work->device, work->port_num);
-	kfree(work);
-}
-
-static void ib_cache_event(struct ib_event_handler *handler,
-			   struct ib_event *event)
-{
-	struct ib_update_work *work;
-
-	if (event->event == IB_EVENT_PORT_ERR    ||
-	    event->event == IB_EVENT_PORT_ACTIVE ||
-	    event->event == IB_EVENT_LID_CHANGE  ||
-	    event->event == IB_EVENT_PKEY_CHANGE ||
-	    event->event == IB_EVENT_SM_CHANGE   ||
-	    event->event == IB_EVENT_CLIENT_REREGISTER) {
-		work = kmalloc(sizeof *work, GFP_ATOMIC);
-		if (work) {
-			INIT_WORK(&work->work, ib_cache_task);
-			work->device   = event->device;
-			work->port_num = event->element.port_num;
-			schedule_work(&work->work);
-		}
-	}
-}
-
-static void ib_cache_setup_one(struct ib_device *device)
-{
-	int p;
-
-	rwlock_init(&device->cache.lock);
-
-	device->cache.pkey_cache =
-		kmalloc(sizeof *device->cache.pkey_cache *
-			(end_port(device) - start_port(device) + 1), GFP_KERNEL);
-	device->cache.gid_cache =
-		kmalloc(sizeof *device->cache.gid_cache *
-			(end_port(device) - start_port(device) + 1), GFP_KERNEL);
-
-	device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache *
-					  (end_port(device) -
-					   start_port(device) + 1),
-					  GFP_KERNEL);
-
-	if (!device->cache.pkey_cache || !device->cache.gid_cache ||
-	    !device->cache.lmc_cache) {
-		printk(KERN_WARNING "Couldn't allocate cache "
-		       "for %s\n", device->name);
-		goto err;
-	}
-
-	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
-		device->cache.pkey_cache[p] = NULL;
-		device->cache.gid_cache [p] = NULL;
-		ib_cache_update(device, p + start_port(device));
-	}
-
-	INIT_IB_EVENT_HANDLER(&device->cache.event_handler,
-			      device, ib_cache_event);
-	if (ib_register_event_handler(&device->cache.event_handler))
-		goto err_cache;
-
-	return;
-
-err_cache:
-	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
-		kfree(device->cache.pkey_cache[p]);
-		kfree(device->cache.gid_cache[p]);
-	}
-
-err:
-	kfree(device->cache.pkey_cache);
-	kfree(device->cache.gid_cache);
-	kfree(device->cache.lmc_cache);
-}
-
-static void ib_cache_cleanup_one(struct ib_device *device)
-{
-	int p;
-
-	ib_unregister_event_handler(&device->cache.event_handler);
-	flush_scheduled_work();
-
-	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
-		kfree(device->cache.pkey_cache[p]);
-		kfree(device->cache.gid_cache[p]);
-	}
-
-	kfree(device->cache.pkey_cache);
-	kfree(device->cache.gid_cache);
-	kfree(device->cache.lmc_cache);
-}
-
-static struct ib_client cache_client = {
-	.name   = "cache",
-	.add    = ib_cache_setup_one,
-	.remove = ib_cache_cleanup_one
-};
-
-int __init ib_cache_setup(void)
-{
-	return ib_register_client(&cache_client);
-}
-
-void __exit ib_cache_cleanup(void)
-{
-	ib_unregister_client(&cache_client);
-}
Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-07 14:28:54.229132596 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-07 14:28:56.192781996 +0300
@@ -696,18 +696,11 @@ static int __init ib_core_init(void)
 	if (ret)
 		printk(KERN_WARNING "Couldn't create InfiniBand device class\n");
 
-	ret = ib_cache_setup();
-	if (ret) {
-		printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n");
-		ib_sysfs_cleanup();
-	}
-
 	return ret;
 }
 
 static void __exit ib_core_cleanup(void)
 {
-	ib_cache_cleanup();
 	ib_sysfs_cleanup();
 }
 

From halr at voltaire.com  Mon May  7 06:13:06 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 May 2007 09:13:06 -0400
Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm: remove unneeded run-time
	check
In-Reply-To: <20070506174138.GI9692@sashak.voltaire.com>
References: <20070506174138.GI9692@sashak.voltaire.com>
Message-ID: <1178543540.32222.350561.camel@hal.voltaire.com>

On Sun, 2007-05-06 at 13:41, Sasha Khapyorsky wrote:
> remove unneeded run-time NULL pointer check (followed free() is not
> under this check anyway).
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (to master only).

-- Hal


From halr at voltaire.com  Mon May  7 06:16:21 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 May 2007 09:16:21 -0400
Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm: make osm_node_destroy()
	static
In-Reply-To: <20070506174431.GJ9692@sashak.voltaire.com>
References: <20070506174138.GI9692@sashak.voltaire.com>
	<20070506174431.GJ9692@sashak.voltaire.com>
Message-ID: <1178543690.32222.350646.camel@hal.voltaire.com>

On Sun, 2007-05-06 at 13:44, Sasha Khapyorsky wrote:
> This makes locally used osm_node_destroy() function static
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (to master only).

Isn't the same applicable for the other osm_xxx_destroy functions ? If
so, shouldn't they also be made static ?

-- Hal


From mst at dev.mellanox.co.il  Mon May  7 06:26:02 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 7 May 2007 16:26:02 +0300
Subject: [ofa-general] Re: [PATCH 1/6 v2] fix pkey change handling and remove
	the cahce
In-Reply-To: <463F218A.7030400@voltaire.com>
References: <463F2121.5080803@voltaire.com> <463F218A.7030400@voltaire.com>
Message-ID: <20070507132602.GG29350@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCH 1/6 v2] fix pkey change handling and remove the cahce
> 
> core: uncached "find gid" and "find pkey" queries
> 
> * Add ib_find_gid and ib_find_pkey over possibly blocking device
>   queries. 

Before I look into this deeper, a note on submissin format: please do not use
the same subject for all patches in the series. For example, subject this one
should have been:

Subject: [PATCH 1/6 v2] IB/core: add uncached "find gid" and "find pkey" queries

And then there won't be a need to add '*' before the actual description.

-- 
MST


From mst at dev.mellanox.co.il  Mon May  7 06:31:40 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 7 May 2007 16:31:40 +0300
Subject: [ofa-general] Re: [PATCH 4/6 v2] fix pkey change handling and remove
	the cahce
In-Reply-To: <463F226E.9040700@voltaire.com>
References: <463F2121.5080803@voltaire.com> <463F226E.9040700@voltaire.com>
Message-ID: <20070507133140.GH29350@mellanox.co.il>

> @@ -139,6 +139,9 @@ static void smp_snoop(struct ib_device *
>  			event.element.port_num = port_num;
>  			ib_dispatch_event(&event);
>  		}
> +
> +		/* update cache with the incoming mad */

Please don't add such comments: name mthca_cache_update is clear enough.

> +		mthca_cache_update(to_mdev(ibdev), port_num, mad);
>  	}
>  }

This will generate the event first, update cache after this, right?
If so, there is still a window where e.g. ipoib gets
a pkey change event, performs a query and gets a stale value.

-- 
MST


From yosefe at voltaire.com  Mon May  7 06:41:27 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 16:41:27 +0300
Subject: [ofa-general] Re: [PATCH 4/6 v2] fix pkey change handling and remove
	the cahce
In-Reply-To: <20070507133140.GH29350@mellanox.co.il>
References: <463F2121.5080803@voltaire.com> <463F226E.9040700@voltaire.com>
	<20070507133140.GH29350@mellanox.co.il>
Message-ID: <463F2C87.7090002@voltaire.com>

Michael S. Tsirkin wrote:
>>@@ -139,6 +139,9 @@ static void smp_snoop(struct ib_device *
>> 			event.element.port_num = port_num;
>> 			ib_dispatch_event(&event);
>> 		}
>>+
>>+		/* update cache with the incoming mad */
> 
> 
> Please don't add such comments: name mthca_cache_update is clear enough.
> 
> 
>>+		mthca_cache_update(to_mdev(ibdev), port_num, mad);
>> 	}
>> }
> 
> 
> This will generate the event first, update cache after this, right?
> If so, there is still a window where e.g. ipoib gets
> a pkey change event, performs a query and gets a stale value.
> 

Right.
Did you find more issues in this patch?


From mst at dev.mellanox.co.il  Mon May  7 06:50:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 7 May 2007 16:50:30 +0300
Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove
	the cahce
In-Reply-To: <463F22AB.3090704@voltaire.com>
References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com>
Message-ID: <20070507135030.GI29350@mellanox.co.il>

> @@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str
>  	recv->header.recv_wc.recv_buf.mad = &recv->mad.mad;
>  	recv->header.recv_wc.recv_buf.grh = &recv->grh;
>  
> +	/* update our lmc cache with port info smps */
> +	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
> +	     recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
> +	    && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
> +		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
> +	{
> +		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
> +	}
> +
>  	if (atomic_read(&qp_info->snoop_count))
>  		snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS);
>  

Why is this an atomic?
The comment does not seem to tell us anything useful. Remove it?
These 8 lines seem to violate coding style rules in at least 3 different ways::)


-- 
MST


From tziporet at dev.mellanox.co.il  Mon May  7 06:59:04 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 07 May 2007 16:59:04 +0300
Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout
In-Reply-To: <1178540310.10759.9.camel@mtls03>
References: <1178538772.10759.2.camel@mtls03>	<20070507115714.GC29350@mellanox.co.il>
	<1178540310.10759.9.camel@mtls03>
Message-ID: <463F30A8.5050005@mellanox.co.il>

Eli Cohen wrote:
> On Mon, 2007-05-07 at 15:04 +0300, Michael S. Tsirkin wrote:
>   
>> How likely is this to help in practice?
>>
>>     
> Like I said, when the system is very busy. In this case the command may
> actually complete very soon but wait_for_completion_timeout() will
> nevertheless return zero since the task did not get CPU time before the
> specified timeout expired. In this case we would like to check if done
> is signaled and thus not fail the command.
>   
This is not a theoretical issue - we actually saw this problem here with 
some tests.

Tziporet


From mst at dev.mellanox.co.il  Mon May  7 07:06:19 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 7 May 2007 17:06:19 +0300
Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout
In-Reply-To: <463F30A8.5050005@mellanox.co.il>
References: <1178538772.10759.2.camel@mtls03>
	<20070507115714.GC29350@mellanox.co.il>
	<1178540310.10759.9.camel@mtls03> <463F30A8.5050005@mellanox.co.il>
Message-ID: <20070507140619.GK29350@mellanox.co.il>

> Quoting Tziporet Koren <tziporet at dev.mellanox.co.il>:
> Subject: Re: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout
> 
> Eli Cohen wrote:
> >On Mon, 2007-05-07 at 15:04 +0300, Michael S. Tsirkin wrote:
> >  
> >>How likely is this to help in practice?
> >>
> >>    
> >Like I said, when the system is very busy. In this case the command may
> >actually complete very soon but wait_for_completion_timeout() will
> >nevertheless return zero since the task did not get CPU time before the
> >specified timeout expired. In this case we would like to check if done
> >is signaled and thus not fail the command.
> >  
> This is not a theoretical issue - we actually saw this problem here with 
> some tests.

I wonder whether this applicable to mthca as well then.

-- 
MST


From rdreier at cisco.com  Mon May  7 07:12:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 07 May 2007 07:12:47 -0700
Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout
In-Reply-To: <1178538772.10759.2.camel@mtls03> (Eli Cohen's message of "Mon,
	07 May 2007 14:52:22 +0300")
References: <1178538772.10759.2.camel@mtls03>
Message-ID: <adad51c7me8.fsf@cisco.com>

 > When the system is busy it may happen that the command actually
 > completed but it took more than the specified timeout till the
 > task executing the command was actually given CPU time. This test
 > checks that the completion is really missing before failing.

 > +	if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout)))
 > +		if (!context->done.done) {
 > +			err = -EBUSY;
 > +			goto out;
 > +		}

This seems more like a bug in wait_for_completion_timeout().  Anyway,
it's definitely not OK to poke inside the definition of struct
completion in driver code, so we need to find a different way to solve
this.

BTW the same completion handling code is in mthca -- is this also a
problem there?

 - R.


From yosefe at voltaire.com  Mon May  7 07:18:51 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 17:18:51 +0300
Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove
	the cahce
In-Reply-To: <20070507135030.GI29350@mellanox.co.il>
References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com>
	<20070507135030.GI29350@mellanox.co.il>
Message-ID: <463F354B.8030908@voltaire.com>

Michael S. Tsirkin wrote:
>>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str
>> 	recv->header.recv_wc.recv_buf.mad = &recv->mad.mad;
>> 	recv->header.recv_wc.recv_buf.grh = &recv->grh;
>> 
>>+	/* update our lmc cache with port info smps */
>>+	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
>>+	     recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
>>+	    && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
>>+		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
>>+	{
>>+		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
>>+	}
>>+
>> 	if (atomic_read(&qp_info->snoop_count))
>> 		snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS);
>> 
> 
> 
> Why is this an atomic?

I thought there might be a race between this and where we read the lmc (rcv_has_same_gid)

> The comment does not seem to tell us anything useful. Remove it?
> These 8 lines seem to violate coding style rules in at least 3 different ways::)
> 
	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
		 recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
		&& (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);

is that better?


From eli at mellanox.co.il  Mon May  7 07:22:50 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 07 May 2007 17:22:50 +0300
Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout
In-Reply-To: <adad51c7me8.fsf@cisco.com>
References: <1178538772.10759.2.camel@mtls03>  <adad51c7me8.fsf@cisco.com>
Message-ID: <1178547800.10759.13.camel@mtls03>

On Mon, 2007-05-07 at 07:12 -0700, Roland Dreier wrote:
> > When the system is busy it may happen that the command actually
>  > completed but it took more than the specified timeout till the
>  > task executing the command was actually given CPU time. This test
>  > checks that the completion is really missing before failing.
> 
>  > +	if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout)))
>  > +		if (!context->done.done) {
>  > +			err = -EBUSY;
>  > +			goto out;
>  > +		}
> 
> This seems more like a bug in wait_for_completion_timeout().  Anyway,
> it's definitely not OK to poke inside the definition of struct
> completion in driver code, so we need to find a different way to solve
> this.
> 
I agree that wait_for_completion_timeout() would better give an
indication of this special case but it does not. The only we can now of
such a situation is by poking into done or poking into the EQ which is
worst.


> BTW the same completion handling code is in mthca -- is this also a
> problem there?
> 
We saw this with the mthca port for connectx.

>  - R.
> 


From svenar at simula.no  Mon May  7 07:42:54 2007
From: svenar at simula.no (svenar at simula.no)
Date: Mon, 7 May 2007 16:42:54 +0200 (CEST)
Subject: [ofa-general] ibdiagnet credit checks
Message-ID: <37241.192.9.112.188.1178548974.squirrel@webmail.uio.no>

Hi,

I have question regarding ibdiagnet and credit loop checking. In debug.tcl
there seems to be two different credit checks:

# report credit loops
ibdmCalcMinHopTables $fabric
set roots [ibdmFindRootNodesByMinHop $fabric]
if {[llength $roots]} {
    inform "-I-reporting:found.roots" $roots
    ibdmReportNonUpDownCa2CaPaths $fabric $roots
} else {
    ibdmAnalyzeLoops $fabric
}

What is the difference between these two checks? From a brief inspection
of the relevant code I would guess that "ibdmReportNonUpDownCa2CaPaths"
checks the routing table for volations of the UpDown rule, while
"ibdmAnalyzeLoops" checks the routing table for cyclic dependencies. Is
this correct?

Best regards,
Sven-Arne


From halr at voltaire.com  Mon May  7 07:47:28 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 May 2007 10:47:28 -0400
Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and
	remove the cahce
In-Reply-To: <463F354B.8030908@voltaire.com>
References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com>
	<20070507135030.GI29350@mellanox.co.il> <463F354B.8030908@voltaire.com>
Message-ID: <1178549162.32222.355374.camel@hal.voltaire.com>

Hi Yosef,

On Mon, 2007-05-07 at 10:18, Yosef Etigin wrote:
> Michael S. Tsirkin wrote:
> >>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str
> >> 	recv->header.recv_wc.recv_buf.mad = &recv->mad.mad;
> >> 	recv->header.recv_wc.recv_buf.grh = &recv->grh;
> >> 
> >>+	/* update our lmc cache with port info smps */
> >>+	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
> >>+	     recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
> >>+	    && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
> >>+		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
> >>+	{
> >>+		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
> >>+	}
> >>+
> >> 	if (atomic_read(&qp_info->snoop_count))
> >> 		snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS);
> >> 
> > 
> > 
> > Why is this an atomic?
> 
> I thought there might be a race between this and where we read the lmc (rcv_has_same_gid)
> 
> > The comment does not seem to tell us anything useful. Remove it?
> > These 8 lines seem to violate coding style rules in at least 3 different ways::)
> > 
> 	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
> 		 recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
> 		&& (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
> 		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
> 		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);

Should at least a #define be used for smp.data[34} if not a struct so it
is clearer what is going on here ?

I haven't yet had a chance to look at the rest of the patch.

-- Hal

> is that better?
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mst at dev.mellanox.co.il  Mon May  7 07:53:20 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 7 May 2007 17:53:20 +0300
Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout
In-Reply-To: <adad51c7me8.fsf@cisco.com>
References: <1178538772.10759.2.camel@mtls03> <adad51c7me8.fsf@cisco.com>
Message-ID: <20070507145320.GA15275@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout
> 
>  > When the system is busy it may happen that the command actually
>  > completed but it took more than the specified timeout till the
>  > task executing the command was actually given CPU time. This test
>  > checks that the completion is really missing before failing.
> 
>  > +	if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout)))
>  > +		if (!context->done.done) {
>  > +			err = -EBUSY;
>  > +			goto out;
>  > +		}
> 
> This seems more like a bug in wait_for_completion_timeout().  Anyway,
> it's definitely not OK to poke inside the definition of struct
> completion in driver code, so we need to find a different way to solve
> this.
> 
> BTW the same completion handling code is in mthca -- is this also a
> problem there?

Google gave me this:
http://lkml.org/lkml/2007/3/1/156
so it seems a similiar problem was observed in mthca.

Thomas, Ingo, I think you were the ones to propose
wait_for_completion_timeout()/wait_for_completion_interruptible_timeout():
would it make sense to change these functions to return -ETIMEDOUT on
timeout, 0 on success? No one seems to use the actual timeout value,
as far as I can see.

Would something like the following, untested, patch, make sense?

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

--

diff --git a/include/linux/completion.h b/include/linux/completion.h
index 268c5a4..84360c8 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -44,11 +44,10 @@ static inline void init_completion(struct completion *x)
 
 extern void FASTCALL(wait_for_completion(struct completion *));
 extern int FASTCALL(wait_for_completion_interruptible(struct completion *x));
-extern unsigned long FASTCALL(wait_for_completion_timeout(struct completion *x,
-						   unsigned long timeout));
-extern unsigned long FASTCALL(wait_for_completion_interruptible_timeout(
-			struct completion *x, unsigned long timeout));
-
+extern int FASTCALL(wait_for_completion_timeout(struct completion *x,
+						unsigned long timeout));
+extern int FASTCALL(wait_for_completion_interruptible_timeout(struct completion *x,
+							      unsigned long timeout));
 extern void FASTCALL(complete(struct completion *));
 extern void FASTCALL(complete_all(struct completion *));
 
diff --git a/kernel/sched.c b/kernel/sched.c
index b9a6837..5ee3df6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3661,9 +3661,10 @@ void fastcall __sched wait_for_completion(struct completion *x)
 }
 EXPORT_SYMBOL(wait_for_completion);
 
-unsigned long fastcall __sched
+int fastcall __sched
 wait_for_completion_timeout(struct completion *x, unsigned long timeout)
 {
+	int ret = 0;
 	might_sleep();
 
 	spin_lock_irq(&x->wait.lock);
@@ -3672,22 +3673,24 @@ wait_for_completion_timeout(struct completion *x, unsigned long timeout)
 
 		wait.flags |= WQ_FLAG_EXCLUSIVE;
 		__add_wait_queue_tail(&x->wait, &wait);
-		do {
+		for (;;) {
 			__set_current_state(TASK_UNINTERRUPTIBLE);
 			spin_unlock_irq(&x->wait.lock);
 			timeout = schedule_timeout(timeout);
 			spin_lock_irq(&x->wait.lock);
+			if (x->done)
+				break;
 			if (!timeout) {
-				__remove_wait_queue(&x->wait, &wait);
-				goto out;
+				ret = -ETIMEDOUT;
+				break;
 			}
-		} while (!x->done);
+		}
 		__remove_wait_queue(&x->wait, &wait);
 	}
 	x->done--;
 out:
 	spin_unlock_irq(&x->wait.lock);
-	return timeout;
+	return ret;
 }
 EXPORT_SYMBOL(wait_for_completion_timeout);
 
@@ -3724,10 +3727,12 @@ out:
 }
 EXPORT_SYMBOL(wait_for_completion_interruptible);
 
-unsigned long fastcall __sched
+int fastcall __sched
 wait_for_completion_interruptible_timeout(struct completion *x,
 					  unsigned long timeout)
 {
+	int ret = 0;
+
 	might_sleep();
 
 	spin_lock_irq(&x->wait.lock);
@@ -3736,7 +3741,7 @@ wait_for_completion_interruptible_timeout(struct completion *x,
 
 		wait.flags |= WQ_FLAG_EXCLUSIVE;
 		__add_wait_queue_tail(&x->wait, &wait);
-		do {
+		for (;;) {
 			if (signal_pending(current)) {
 				timeout = -ERESTARTSYS;
 				__remove_wait_queue(&x->wait, &wait);
@@ -3746,17 +3751,19 @@ wait_for_completion_interruptible_timeout(struct completion *x,
 			spin_unlock_irq(&x->wait.lock);
 			timeout = schedule_timeout(timeout);
 			spin_lock_irq(&x->wait.lock);
+			if (x->done)
+				break;
 			if (!timeout) {
-				__remove_wait_queue(&x->wait, &wait);
-				goto out;
+				ret = -ETIMEDOUT;
+				break;
 			}
-		} while (!x->done);
+		}
 		__remove_wait_queue(&x->wait, &wait);
 	}
 	x->done--;
 out:
 	spin_unlock_irq(&x->wait.lock);
-	return timeout;
+	return ret;
 }
 EXPORT_SYMBOL(wait_for_completion_interruptible_timeout);
 

-- 
MST


From mst at dev.mellanox.co.il  Mon May  7 07:56:53 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 7 May 2007 17:56:53 +0300
Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove
	the cahce
In-Reply-To: <463F354B.8030908@voltaire.com>
References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com>
	<20070507135030.GI29350@mellanox.co.il>
	<463F354B.8030908@voltaire.com>
Message-ID: <20070507145653.GB15275@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce
> 
> Michael S. Tsirkin wrote:
> >>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str
> >> 	recv->header.recv_wc.recv_buf.mad = &recv->mad.mad;
> >> 	recv->header.recv_wc.recv_buf.grh = &recv->grh;
> >> 
> >>+	/* update our lmc cache with port info smps */
> >>+	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
> >>+	     recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
> >>+	    && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
> >>+		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
> >>+	{
> >>+		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
> >>+	}
> >>+
> >> 	if (atomic_read(&qp_info->snoop_count))
> >> 		snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS);
> >> 
> > 
> > 
> > Why is this an atomic?
> 
> I thought there might be a race between this and where we read the lmc (rcv_has_same_gid)

Aren't all incoming MADs on a port handled over a single threaded WQ?
And how would atomics help?

> > The comment does not seem to tell us anything useful. Remove it?
> > These 8 lines seem to violate coding style rules in at least 3 different ways::)
> > 
> 	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
> 		 recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
> 		&& (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
> 		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
> 		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
> 
> is that better?

Move && to the end of each line, and kill the extra () around single comparisons.

-- 
MST


From yosefe at voltaire.com  Mon May  7 08:09:46 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 18:09:46 +0300
Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove
	the cahce
In-Reply-To: <20070507145653.GB15275@mellanox.co.il>
References: <463F2121.5080803@voltaire.com>
	<463F22AB.3090704@voltaire.com>	<20070507135030.GI29350@mellanox.co.il>	<463F354B.8030908@voltaire.com>
	<20070507145653.GB15275@mellanox.co.il>
Message-ID: <463F413A.5020308@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce
>>
>>Michael S. Tsirkin wrote:
>>
>>>>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str
>>>>	recv->header.recv_wc.recv_buf.mad = &recv->mad.mad;
>>>>	recv->header.recv_wc.recv_buf.grh = &recv->grh;
>>>>
>>>>+	/* update our lmc cache with port info smps */
>>>>+	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
>>>>+	     recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
>>>>+	    && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
>>>>+		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
>>>>+	{
>>>>+		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
>>>>+	}
>>>>+
>>>>	if (atomic_read(&qp_info->snoop_count))
>>>>		snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS);
>>>>
>>>
>>>
>>>Why is this an atomic?
>>
>>I thought there might be a race between this and where we read the lmc (rcv_has_same_gid)
> 
> 
> Aren't all incoming MADs on a port handled over a single threaded WQ?
> And how would atomics help?
> 
Yes. not atomic any more.

> 
>>>The comment does not seem to tell us anything useful. Remove it?
>>>These 8 lines seem to violate coding style rules in at least 3 different ways::)
>>>
>>
>>	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
>>		 recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
>>		&& (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
>>		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
>>		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
>>
>>is that better?
> 
> 
> Move && to the end of each line, and kill the extra () around single comparisons.
> 

ok.


From yosefe at voltaire.com  Mon May  7 08:12:19 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 18:12:19 +0300
Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and
	remove the cahce
In-Reply-To: <1178549162.32222.355374.camel@hal.voltaire.com>
References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com>	
	<20070507135030.GI29350@mellanox.co.il>
	<463F354B.8030908@voltaire.com>
	<1178549162.32222.355374.camel@hal.voltaire.com>
Message-ID: <463F41D3.4050603@voltaire.com>

Hal Rosenstock wrote:
> Hi Yosef,
> 
> On Mon, 2007-05-07 at 10:18, Yosef Etigin wrote:
> 
>>Michael S. Tsirkin wrote:
>>
>>>>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str
>>>>	recv->header.recv_wc.recv_buf.mad = &recv->mad.mad;
>>>>	recv->header.recv_wc.recv_buf.grh = &recv->grh;
>>>>
>>>>+	/* update our lmc cache with port info smps */
>>>>+	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
>>>>+	     recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
>>>>+	    && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
>>>>+		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
>>>>+	{
>>>>+		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
>>>>+	}
>>>>+
>>>>	if (atomic_read(&qp_info->snoop_count))
>>>>		snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS);
>>>>
>>>
>>>
>>>Why is this an atomic?
>>
>>I thought there might be a race between this and where we read the lmc (rcv_has_same_gid)
>>
>>
>>>The comment does not seem to tell us anything useful. Remove it?
>>>These 8 lines seem to violate coding style rules in at least 3 different ways::)
>>>
>>
>>	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
>>		 recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
>>		&& (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
>>		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
>>		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
> 
> 
> Should at least a #define be used for smp.data[34} if not a struct so it
> is clearer what is going on here ?
> 

you mean something like:
#define LMC_FROM_PORT_INFO(data) ( ( (char*)(data) )[34] & 0x07 ) ?

> I haven't yet had a chance to look at the rest of the patch.
> 
> -- Hal
> 
> 
>>is that better?
>>_______________________________________________
>>general mailing list
>>general at lists.openfabrics.org
>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 


From eli at mellanox.co.il  Mon May  7 08:20:33 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 07 May 2007 18:20:33 +0300
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts
Message-ID: <1178551555.17477.0.camel@mtls03>

In order to prevent losing interrupts, all EQs must be rearmed
whenever an interrupt occurs, regardless if that interrupt is
generated for the EQ or not.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/net/mlx4/eq.c
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/eq.c	2007-05-07 12:32:35.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/eq.c	2007-05-07 17:33:09.000000000 +0300
@@ -266,13 +266,17 @@ static irqreturn_t mlx4_interrupt(int ir
 {
 	struct mlx4_dev *dev = dev_ptr;
 	struct mlx4_priv *priv = mlx4_priv(dev);
+	struct mlx4_eq *eq;
 	int work = 0;
 	int i;
 
 	writel(priv->eq_table.clr_mask, priv->eq_table.clr_int);
 
-	for (i = 0; i < MLX4_EQ_CATAS; ++i)
-		work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]);
+	for (i = 0; i < MLX4_EQ_CATAS; ++i) {
+		eq = &priv->eq_table.eq[i];
+		work |= mlx4_eq_int(dev, eq);
+		eq_set_ci(eq, 1);
+	}
 
 	return IRQ_RETVAL(work);
 }


From mst at dev.mellanox.co.il  Mon May  7 08:30:25 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 7 May 2007 18:30:25 +0300
Subject: [ofa-general] Re: [PATCH 3/6 v2] fix pkey change handling and remove
	the cahce
In-Reply-To: <463F2237.7050809@voltaire.com>
References: <463F2121.5080803@voltaire.com> <463F2237.7050809@voltaire.com>
Message-ID: <20070507153025.GD15275@mellanox.co.il>

All in all, this patch tries to do many things at once.  I wonder whether you
can split the patch in 2: fix the pkey change case separately, and remove pkey
polling separately.

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCH 3/6 v2] fix pkey change handling and remove the cahce
> 
> ipoib: handle pkey change events
> 
> This issue was found during partitioning & SM fail over testing.
> 
>  * added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
>  * fixed a bug in device extraction from the work struct
>  * removed some warnings in case they are caused due to missing PKEY as 
> 	  this seems like a valid flow now.

This seems to remove a useful tool for debugging invalid pkeys.
Why is this a valid flow now?

>  * Assume that the cache is coherent - do not retry on cache queries
>  * Restart child interfaces first before parent

Why? Is this related to pkey change somehow?

>  * Remove the pkey polling thread and pkey delayed initiallization
>  * If an interface is brought up but pkey is not found, mark it with
>    IPOIB_PKEY_NEEDED and when a pkey event arrives, try to restart it.
> 
> SM reconfiguration or failover possibly causes a shuffling of the values in the port
> pkey table. The current implementation only queries for the index of the pkey once,
> when it creates the device QP and after that moves it into working state, and hence
> does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger
> to reconfigure the device QP.
> 
> 
> Signed-off-by: Moni Levy <monil at voltaire.com>
> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib.h           |   10 -
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  144 ++++++++-----------------
>  drivers/infiniband/ulp/ipoib/ipoib_main.c      |   11 -
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   11 +
>  drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |   21 +--
>  5 files changed, 76 insertions(+), 121 deletions(-)
> 
> Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-06 09:26:08.000000000 +0300
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-06 09:26:18.000000000 +0300
> @@ -80,7 +80,7 @@ enum {
>  	IPOIB_FLAG_INITIALIZED    = 1,
>  	IPOIB_FLAG_ADMIN_UP 	  = 2,
>  	IPOIB_PKEY_ASSIGNED 	  = 3,
> -	IPOIB_PKEY_STOP 	  = 4,
> +	IPOIB_PKEY_NEEDED		  = 4,
>  	IPOIB_FLAG_SUBINTERFACE   = 5,
>  	IPOIB_MCAST_RUN 	  = 6,
>  	IPOIB_STOP_REAPER         = 7,
> @@ -202,9 +202,9 @@ struct ipoib_dev_priv {
>  	struct list_head multicast_list;
>  	struct rb_root multicast_tree;
>  
> -	struct delayed_work pkey_task;
>  	struct delayed_work mcast_task;
>  	struct work_struct flush_task;
> +	struct work_struct pkey_task;
>  	struct work_struct restart_task;
>  	struct delayed_work ah_reap_task;
>  
> @@ -333,12 +333,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
>  
>  int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
>  void ipoib_ib_dev_flush(struct work_struct *work);
> +void ipoib_pkey_event(struct work_struct *work);
>  void ipoib_ib_dev_cleanup(struct net_device *dev);
>  
>  int ipoib_ib_dev_open(struct net_device *dev);
>  int ipoib_ib_dev_up(struct net_device *dev);
>  int ipoib_ib_dev_down(struct net_device *dev, int flush);
> -int ipoib_ib_dev_stop(struct net_device *dev);
> +int ipoib_ib_dev_stop(struct net_device *dev, int flush);
>  
>  int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
>  void ipoib_dev_cleanup(struct net_device *dev);
> @@ -384,9 +385,6 @@ void ipoib_event(struct ib_event_handler
>  int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey);
>  int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
>  
> -void ipoib_pkey_poll(struct work_struct *work);
> -int ipoib_pkey_dev_delay_open(struct net_device *dev);
> -
>  #ifdef CONFIG_INFINIBAND_IPOIB_CM
>  
>  #define IPOIB_FLAGS_RC          0x80
> Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-06 09:26:17.000000000 +0300
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-06 09:26:26.000000000 +0300
> @@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device 
>  	ret = ipoib_ib_post_receives(dev);
>  	if (ret) {
>  		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
> -		ipoib_ib_dev_stop(dev);
> +		ipoib_ib_dev_stop(dev, 1);
>  		return -1;
>  	}
>  
>  	ret = ipoib_cm_dev_open(dev);
>  	if (ret) {
>  		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
> -		ipoib_ib_dev_stop(dev);
> +		ipoib_ib_dev_stop(dev, 1);
>  		return -1;
>  	}
>  
> @@ -441,28 +441,10 @@ int ipoib_ib_dev_open(struct net_device 
>  	return 0;
>  }
>  
> -static void ipoib_pkey_dev_check_presence(struct net_device *dev)
> -{
> -	struct ipoib_dev_priv *priv = netdev_priv(dev);
> -	u16 pkey_index = 0;
> -
> -	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index))
> -		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> -	else
> -		set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> -}
> -
>  int ipoib_ib_dev_up(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  
> -	ipoib_pkey_dev_check_presence(dev);
> -
> -	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
> -		ipoib_dbg(priv, "PKEY is not assigned.\n");
> -		return 0;
> -	}
> -
>  	set_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
>  
>  	return ipoib_mcast_start_thread(dev);
> @@ -477,16 +459,6 @@ int ipoib_ib_dev_down(struct net_device 
>  	clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
>  	netif_carrier_off(dev);
>  
> -	/* Shutdown the P_Key thread if still active */
> -	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
> -		mutex_lock(&pkey_mutex);
> -		set_bit(IPOIB_PKEY_STOP, &priv->flags);
> -		cancel_delayed_work(&priv->pkey_task);
> -		mutex_unlock(&pkey_mutex);
> -		if (flush)
> -			flush_workqueue(ipoib_workqueue);
> -	}
> -
>  	ipoib_mcast_stop_thread(dev, flush);
>  	ipoib_mcast_dev_flush(dev);
>  
> @@ -508,7 +480,7 @@ static int recvs_pending(struct net_devi
>  	return pending;
>  }
>  
> -int ipoib_ib_dev_stop(struct net_device *dev)
> +int ipoib_ib_dev_stop(struct net_device *dev, int flush)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  	struct ib_qp_attr qp_attr;
> @@ -581,7 +553,8 @@ timeout:
>  	/* Wait for all AHs to be reaped */
>  	set_bit(IPOIB_STOP_REAPER, &priv->flags);
>  	cancel_delayed_work(&priv->ah_reap_task);
> -	flush_workqueue(ipoib_workqueue);
> +	if (flush)
> +		flush_workqueue(ipoib_workqueue);
>  
>  	begin = jiffies;
>  
> @@ -622,14 +595,33 @@ int ipoib_ib_dev_init(struct net_device 
>  	return 0;
>  }
>  
> -void ipoib_ib_dev_flush(struct work_struct *work)
> +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp)
>  {
> -	struct ipoib_dev_priv *cpriv, *priv =
> -		container_of(work, struct ipoib_dev_priv, flush_task);
> +	struct ipoib_dev_priv *cpriv;
>  	struct net_device *dev = priv->dev;
>  
> -	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
> -		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
> +	mutex_lock(&priv->vlan_mutex);
> +
> +	/* Flush any child interfaces */
> +	list_for_each_entry(cpriv, &priv->child_intfs, list)
> +		__ipoib_ib_dev_flush(cpriv, restart_qp);
> +
> +	mutex_unlock(&priv->vlan_mutex);
> +
> +	/*
> +	 * If the device is not initiallized since it needs a pkey -
> +	 * try to reopen it
> +	 */
> +	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
> +
> +		if (restart_qp
> +			&& test_bit(IPOIB_PKEY_NEEDED, &priv->flags)
> +		    && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
> +			/* this iface needs pkey, try to bring it up */
> +			ipoib_open(priv->dev);
> +		}
> +		else
> +			ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
>  		return;
>  	}

Clean up the above please.

> @@ -642,6 +634,12 @@ void ipoib_ib_dev_flush(struct work_stru
>  
>  	ipoib_ib_dev_down(dev, 0);
>  
> +	if (restart_qp) {
> +		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) )
> +			ipoib_ib_dev_stop(dev, 0);
> +		ipoib_ib_dev_open(dev);
> +	}
> +
>  	/*
>  	 * The device could have been brought down between the start and when
>  	 * we get here, don't bring it back up if it's not configured up

I find these if (restart_qp) branches somewhat confusing.
Why is this flag tested in 2 places?

> @@ -650,14 +648,25 @@ void ipoib_ib_dev_flush(struct work_stru
>  		ipoib_ib_dev_up(dev);
>  		ipoib_mcast_restart_task(&priv->restart_task);
>  	}
> +}
>  
> -	mutex_lock(&priv->vlan_mutex);
> +void ipoib_ib_dev_flush(struct work_struct *work)
> +{
> +	struct ipoib_dev_priv *priv =
> +		container_of(work, struct ipoib_dev_priv, flush_task);
> +	/* we only restart the QP in case of pkey change event */

Kill the comment please.

> +	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
> +	__ipoib_ib_dev_flush(priv, 0);
> +}
>  
> -	/* Flush any child interfaces too */
> -	list_for_each_entry(cpriv, &priv->child_intfs, list)
> -		ipoib_ib_dev_flush(&cpriv->flush_task);
> +void ipoib_pkey_event(struct work_struct *work)
> +{
> +	struct ipoib_dev_priv *priv =
> +		container_of(work, struct ipoib_dev_priv, pkey_task);
>  
> -	mutex_unlock(&priv->vlan_mutex);
> +	/* restart the QP in case of pkey change event */
> +	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);

Kill the comment please.

> +	__ipoib_ib_dev_flush(priv, 1);
>  }
>  
>  void ipoib_ib_dev_cleanup(struct net_device *dev)
> @@ -672,54 +681,3 @@ void ipoib_ib_dev_cleanup(struct net_dev
>  	ipoib_transport_dev_cleanup(dev);
>  }
>  
> -/*
> - * Delayed P_Key Assigment Interim Support
> - *
> - * The following is initial implementation of delayed P_Key assigment
> - * mechanism. It is using the same approach implemented for the multicast
> - * group join. The single goal of this implementation is to quickly address
> - * Bug #2507. This implementation will probably be removed when the P_Key
> - * change async notification is available.
> - */
> -
> -void ipoib_pkey_poll(struct work_struct *work)
> -{
> -	struct ipoib_dev_priv *priv =
> -		container_of(work, struct ipoib_dev_priv, pkey_task.work);
> -	struct net_device *dev = priv->dev;
> -
> -	ipoib_pkey_dev_check_presence(dev);
> -
> -	if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
> -		ipoib_open(dev);
> -	else {
> -		mutex_lock(&pkey_mutex);
> -		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue,
> -					   &priv->pkey_task,
> -					   HZ);
> -		mutex_unlock(&pkey_mutex);
> -	}
> -}
> -
> -int ipoib_pkey_dev_delay_open(struct net_device *dev)
> -{
> -	struct ipoib_dev_priv *priv = netdev_priv(dev);
> -
> -	/* Look for the interface pkey value in the IB Port P_Key table and */
> -	/* set the interface pkey assigment flag                            */
> -	ipoib_pkey_dev_check_presence(dev);
> -
> -	/* P_Key value not assigned yet - start polling */
> -	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
> -		mutex_lock(&pkey_mutex);
> -		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
> -		queue_delayed_work(ipoib_workqueue,
> -				   &priv->pkey_task,
> -				   HZ);
> -		mutex_unlock(&pkey_mutex);
> -		return 1;
> -	}
> -
> -	return 0;
> -}
>
> Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-06 09:26:08.000000000 +0300
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-06 09:26:18.000000000 +0300
> @@ -100,14 +100,11 @@ int ipoib_open(struct net_device *dev)
>  
>  	set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
>  
> -	if (ipoib_pkey_dev_delay_open(dev))
> -		return 0;
> -
>  	if (ipoib_ib_dev_open(dev))
> -		return -EINVAL;
> +		return test_bit(IPOIB_PKEY_NEEDED, &priv->flags) ? 0 : -EINVAL;
>  
>  	if (ipoib_ib_dev_up(dev)) {
> -		ipoib_ib_dev_stop(dev);
> +		ipoib_ib_dev_stop(dev, 1);
>  		return -EINVAL;
>  	}
>  
> @@ -152,7 +149,7 @@ static int ipoib_stop(struct net_device 
>  	flush_workqueue(ipoib_workqueue);
>  
>  	ipoib_ib_dev_down(dev, 1);
> -	ipoib_ib_dev_stop(dev);
> +	ipoib_ib_dev_stop(dev, 1);
>  
>  	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
>  		struct ipoib_dev_priv *cpriv;
> @@ -990,7 +987,7 @@ static void ipoib_setup(struct net_devic
>  	INIT_LIST_HEAD(&priv->dead_ahs);
>  	INIT_LIST_HEAD(&priv->multicast_list);
>  
> -	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
> +	INIT_WORK(&priv->pkey_task, ipoib_pkey_event);
>  	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
>  	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
>  	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
> Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-06 09:26:08.000000000 +0300
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-06 09:26:18.000000000 +0300
> @@ -232,9 +232,10 @@ static int ipoib_mcast_join_finish(struc
>  		ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid),
>  					 &mcast->mcmember.mgid);
>  		if (ret < 0) {
> -			ipoib_warn(priv, "couldn't attach QP to multicast group "
> -				   IPOIB_GID_FMT "\n",
> -				   IPOIB_GID_ARG(mcast->mcmember.mgid));
> +			if (ret != -ENXIO) /* No pkey found */
> +				ipoib_warn(priv, "couldn't attach QP to multicast group "
> +					   IPOIB_GID_FMT "\n",
> +					   IPOIB_GID_ARG(mcast->mcmember.mgid));
>  
>  			clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags);
>  			return ret;
> @@ -312,7 +313,7 @@ ipoib_mcast_sendonly_join_complete(int s
>  		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
>  
>  	if (status) {
> -		if (mcast->logcount++ < 20)
> +		if (mcast->logcount++ < 20 && status != -ENXIO)
>  			ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for "
>  					IPOIB_GID_FMT ", status %d\n",
>  					IPOIB_GID_ARG(mcast->mcmember.mgid), status);
> @@ -420,7 +421,7 @@ static int ipoib_mcast_join_complete(int
>  					", status %d\n",
>  					IPOIB_GID_ARG(mcast->mcmember.mgid),
>  					status);
> -		} else {
> +		} else if (status != -ENXIO) {
>  			ipoib_warn(priv, "multicast join failed for "
>  				   IPOIB_GID_FMT ", status %d\n",
>  				   IPOIB_GID_ARG(mcast->mcmember.mgid),
>
> Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-06 09:26:08.000000000 +0300
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-06 09:26:18.000000000 +0300
> @@ -33,8 +33,6 @@
>   * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
>   */
>  
> -#include <rdma/ib_cache.h>
> -
>  #include "ipoib.h"
>  
>  int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
> @@ -49,12 +47,12 @@ int ipoib_mcast_attach(struct net_device
>  	if (!qp_attr)
>  		goto out;
>  
> -	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
> -		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
> +		clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
>  		ret = -ENXIO;
>  		goto out;
>  	}
> -	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +	set_bit(IPOIB_PKEY_NEEDED, &priv->flags);
>  
>  	/* set correct QKey for QP */
>  	qp_attr->qkey = priv->qkey;
> @@ -103,12 +101,12 @@ int ipoib_init_qp(struct net_device *dev
>  	 * The port has to be assigned to the respective IB partition in
>  	 * advance.
>  	 */
> -	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
> +	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
>  	if (ret) {
> -		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +		set_bit(IPOIB_PKEY_NEEDED, &priv->flags);
>  		return ret;
>  	}
> -	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +	clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
>  
>  	qp_attr.qp_state = IB_QPS_INIT;
>  	qp_attr.qkey = 0;
> @@ -238,7 +236,7 @@ void ipoib_transport_dev_cleanup(struct 
>  			ipoib_warn(priv, "ib_qp_destroy failed\n");
>  
>  		priv->qp = NULL;
> -		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +		clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
>  	}
>  
>  	if (ib_destroy_cq(priv->cq))
> @@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler
>  		container_of(handler, struct ipoib_dev_priv, event_handler);
>  
>  	if ((record->event == IB_EVENT_PORT_ERR    ||
> -	     record->event == IB_EVENT_PKEY_CHANGE ||
>  	     record->event == IB_EVENT_PORT_ACTIVE ||
>  	     record->event == IB_EVENT_LID_CHANGE  ||
>  	     record->event == IB_EVENT_SM_CHANGE   ||
> @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler
>  	    record->element.port_num == priv->port) {
>  		ipoib_dbg(priv, "Port state change event\n");
>  		queue_work(ipoib_workqueue, &priv->flush_task);
> +	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
> +		record->element.port_num == priv->port) {
> +		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
> +		queue_work(ipoib_workqueue, &priv->pkey_task);
>  	}
>  }

-- 
MST


From halr at voltaire.com  Mon May  7 09:04:38 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 May 2007 12:04:38 -0400
Subject: [ofa-general] Re: [PATCH] opensm: consolidate CA and router PortInfo
	receiving code
In-Reply-To: <20070506200013.GL9692@sashak.voltaire.com>
References: <20070506200013.GL9692@sashak.voltaire.com>
Message-ID: <1178553448.32222.358968.camel@hal.voltaire.com>

On Sun, 2007-05-06 at 16:00, Sasha Khapyorsky wrote:
> Consolidate CA and router PortInfo receiving processing code.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (to master only).

-- Hal


From halr at voltaire.com  Mon May  7 09:16:07 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 May 2007 12:16:07 -0400
Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm: trivial osm_port cleanups
In-Reply-To: <20070506181937.GK9692@sashak.voltaire.com>
References: <20070506181937.GK9692@sashak.voltaire.com>
Message-ID: <1178554553.32222.359968.camel@hal.voltaire.com>

On Sun, 2007-05-06 at 14:19, Sasha Khapyorsky wrote:
> This removes non-meanful osm_port_construct() and osm_port_destroy()
> functions and makes static locally used osm_port_init().
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (to master only).

-- Hal


From halr at voltaire.com  Mon May  7 09:21:33 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 May 2007 12:21:33 -0400
Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and
	remove the cahce
In-Reply-To: <463F41D3.4050603@voltaire.com>
References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com>
	<20070507135030.GI29350@mellanox.co.il> <463F354B.8030908@voltaire.com>
	<1178549162.32222.355374.camel@hal.voltaire.com>
	<463F41D3.4050603@voltaire.com>
Message-ID: <1178554880.32222.360219.camel@hal.voltaire.com>

On Mon, 2007-05-07 at 11:12, Yosef Etigin wrote:
> Hal Rosenstock wrote:
> > Hi Yosef,
> > 
> > On Mon, 2007-05-07 at 10:18, Yosef Etigin wrote:
> > 
> >>Michael S. Tsirkin wrote:
> >>
> >>>>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str
> >>>>	recv->header.recv_wc.recv_buf.mad = &recv->mad.mad;
> >>>>	recv->header.recv_wc.recv_buf.grh = &recv->grh;
> >>>>
> >>>>+	/* update our lmc cache with port info smps */
> >>>>+	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
> >>>>+	     recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
> >>>>+	    && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
> >>>>+		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
> >>>>+	{
> >>>>+		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
> >>>>+	}
> >>>>+
> >>>>	if (atomic_read(&qp_info->snoop_count))
> >>>>		snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS);
> >>>>
> >>>
> >>>
> >>>Why is this an atomic?
> >>
> >>I thought there might be a race between this and where we read the lmc (rcv_has_same_gid)
> >>
> >>
> >>>The comment does not seem to tell us anything useful. Remove it?
> >>>These 8 lines seem to violate coding style rules in at least 3 different ways::)
> >>>
> >>
> >>	if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
> >>		 recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
> >>		&& (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO)
> >>		&& (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET))
> >>		atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7);
> > 
> > 
> > Should at least a #define be used for smp.data[34} if not a struct so it
> > is clearer what is going on here ?
> > 
> 
> you mean something like:
> #define LMC_FROM_PORT_INFO(data) ( ( (char*)(data) )[34] & 0x07 ) ?

Yes, something along those lines at a minimum.

-- Hal

> > I haven't yet had a chance to look at the rest of the patch.
> > 
> > -- Hal
> > 
> > 
> >>is that better?
> >>_______________________________________________
> >>general mailing list
> >>general at lists.openfabrics.org
> >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> > 
> 


From rdreier at cisco.com  Mon May  7 09:40:28 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 07 May 2007 09:40:28 -0700
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts
In-Reply-To: <1178551555.17477.0.camel@mtls03> (Eli Cohen's message of "Mon,
	07 May 2007 18:20:33 +0300")
References: <1178551555.17477.0.camel@mtls03>
Message-ID: <adar6ps60zn.fsf@cisco.com>

Thanks... should we optimize out the

	if (eqes_found)
		eq_set_ci(eq, 1);

at the end of mlx4_eq_int() now?  Actually the best fix is probably:

diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c
index 8d641b8..acf1c80 100644
--- a/drivers/net/mlx4/eq.c
+++ b/drivers/net/mlx4/eq.c
@@ -249,8 +249,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
 		}
 	}
 
-	if (eqes_found)
-		eq_set_ci(eq, 1);
+	eq_set_ci(eq, 1);
 
 	return eqes_found;
 }

because it seems sort of strange if we ever don't rearm the EQ on an
MSI-X interrupt.

What do you think?

On the other hand, this patch (and your patch) rearms the EQ on shared
interrupts for other devices too.  Can't be helped I guess.

 - R.


From Kapil.Dukle at med.ge.com  Mon May  7 09:41:28 2007
From: Kapil.Dukle at med.ge.com (Dukle, Kapil (GE Healthcare))
Date: Mon, 7 May 2007 12:41:28 -0400
Subject: [ofa-general] Infiniband data transfer across servers w/ different
	IB drivers
Message-ID: <DE4D96C8DFF3B94BACC3B6FE3B7D140103DA0FD1@CINMLVEM11.e2k.ad.ge.com>

Hi,

I am currently experimenting with Infiniband data transfers across
servers with different operating systems.
Is it possible for two servers with different Infiniband drivers (and
OS) to communicate for data transfers - as in
the example below.

Server A runs VxWorks and uses SBS IB driver modules and APIs
 
Server B runs Linux and uses OFED 1.0 drivers and APIs

- Is it possible for these servers to transfer data across Infiniband
the way they are currently set up?
OR 
- Would I need to update Server A to have the OFED 1.0 IB drivers?

Let me know if you need any more information that might help answer
these questions...


Thanks,


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070507/cb217b0f/attachment.html>

From boris at mellanox.com  Mon May  7 09:45:08 2007
From: boris at mellanox.com (Boris Shpolyansky)
Date: Mon, 7 May 2007 09:45:08 -0700
Subject: [ofa-general] Infiniband data transfer across servers w/
	differentIB drivers
In-Reply-To: <DE4D96C8DFF3B94BACC3B6FE3B7D140103DA0FD1@CINMLVEM11.e2k.ad.ge.com>
Message-ID: <1E3DCD1C63492545881FACB6063A57C1D524ED@mtiexch01.mti.com>

I am not familiar with SBS IB driver for VxWorks, but in general any IB
compliant HCA
should talk with any other IB compliant switch/HCA with no regards to
the driver implementation.
Make sure to run SM on one of the ends to enable link establishment.
 
Boris
 

________________________________

From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Dukle, Kapil
(GE Healthcare)
Sent: Monday, May 07, 2007 9:41 AM
To: openib-general at openib.org
Subject: [ofa-general] Infiniband data transfer across servers w/
differentIB drivers


Hi, 

I am currently experimenting with Infiniband data transfers across
servers with different operating systems. 
Is it possible for two servers with different Infiniband drivers (and
OS) to communicate for data transfers - as in 
the example below. 

Server A runs VxWorks and uses SBS IB driver modules and APIs 
  
Server B runs Linux and uses OFED 1.0 drivers and APIs 

- Is it possible for these servers to transfer data across Infiniband
the way they are currently set up? 
OR 
- Would I need to update Server A to have the OFED 1.0 IB drivers? 

Let me know if you need any more information that might help answer
these questions... 


Thanks, 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070507/57fdae28/attachment.html>

From swise at opengridcomputing.com  Mon May  7 09:49:36 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 07 May 2007 11:49:36 -0500
Subject: [ofa-general] RE: man pages for the rdma-cm
In-Reply-To: <000001c7905e$9a562190$95fd070a@amr.corp.intel.com>
References: <000001c7905e$9a562190$95fd070a@amr.corp.intel.com>
Message-ID: <1178556576.30571.79.camel@stevo-desktop>

On Sun, 2007-05-06 at 21:17 -0700, Sean Hefty wrote:
> >Are there man pages for the rdma-cm in the pipeline?  I think it would
> >be great (requirement?) to have these for ofed-1.2 since we do have the
> >other verbs man pages.
> 
> I've added man pages for the APIs and test programs to my master and ofed_1_2
> branches.  If anyone gets a chance, I'd appreciate someone looking them over.  I
> plan on requested that they be pulled into the rc3 release.
> 
> - Sean

Hey Sean, the pages look good!

Here are a few comments.  Consider them for inclusion, but what you've
done so far is a great start.

- are the events described anywhere?  Maybe they should be described in
rdma_get_cm_event?

- rdma_accept / rdma_connect: describe the conn_param fields.

- rdma_bind_addr: binding to port 0 will cause the rdma-cm to select and
available port.

- no pages for get_src_port/get_dst_port

- rdma_connect - "connected" and "unconnected" when discussing cm_ids is
misleading. Perhaps "reliable connection" vs "unreliable datagram"?

- rdma_create_event_channel: it would be nice to mention that the fd can
be used like any other fd (made non blocking, poll()/select()able, etc).

- rdma_disconnect - for iWARP connections, this initiates a RDMAC Verbs
"normal close".  If the connection was properly quiesced by the
application, then the QP will end up back in IDLE, but if the connection
was not quiesced, then the connection will be terminated and the QP will
end up in ERROR.  Dunno if we want to describe this in detail?

- Also, it might be nice to have some sort of overview man page that
maps the exected event flows for connection setup and teardown.  Maybe
'man rdmacm' gets you some overview?


Steve.


From koen.segers at vrt.be  Mon May  7 09:49:22 2007
From: koen.segers at vrt.be (Koen Segers)
Date: Mon, 07 May 2007 18:49:22 +0200
Subject: [ofa-general] DDR and SDR
Message-ID: <1178556562.8727.3.camel@KOEN>

A simple question:

Is it possible to connect a SDR HCA to a DDR switch?
If so, what happens with the data that is send from a DDR HCA to the SDR
HCA?

Regards,

Koen
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From yosefe at voltaire.com  Mon May  7 09:54:07 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 07 May 2007 19:54:07 +0300
Subject: [ofa-general] Re: [PATCH 3/6 v2] fix pkey change handling and remove
	the cahce
In-Reply-To: <20070507153025.GD15275@mellanox.co.il>
References: <463F2121.5080803@voltaire.com> <463F2237.7050809@voltaire.com>
	<20070507153025.GD15275@mellanox.co.il>
Message-ID: <463F59AF.70501@voltaire.com>

Michael S. Tsirkin wrote:
> All in all, this patch tries to do many things at once.  I wonder whether you
> can split the patch in 2: fix the pkey change case separately, and remove pkey
> polling separately.
> 
> 
I'm not sure it's nessesary. What I had in mind is that the polling was created
since we did not handle events, so now that we handle them we should update the
way ipoib handles pkey changes.

>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: [PATCH 3/6 v2] fix pkey change handling and remove the cahce
>>
>>ipoib: handle pkey change events
>>
>>This issue was found during partitioning & SM fail over testing.
>>
>> * added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
>> * fixed a bug in device extraction from the work struct
>> * removed some warnings in case they are caused due to missing PKEY as 
>>	  this seems like a valid flow now.
> 
> 
> This seems to remove a useful tool for debugging invalid pkeys.
> Why is this a valid flow now?
> 
>
restored to previous state.
>> * Assume that the cache is coherent - do not retry on cache queries
>> * Restart child interfaces first before parent
> 
> 
> Why? Is this related to pkey change somehow?
> 
comment removed.

>>@@ -642,6 +634,12 @@ void ipoib_ib_dev_flush(struct work_stru
>> 
>> 	ipoib_ib_dev_down(dev, 0);
>> 
>>+	if (restart_qp) {
>>+		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) )
>>+			ipoib_ib_dev_stop(dev, 0);
>>+		ipoib_ib_dev_open(dev);
>>+	}
>>+
>> 	/*
>> 	 * The device could have been brought down between the start and when
>> 	 * we get here, don't bring it back up if it's not configured up
> 
> 
> I find these if (restart_qp) branches somewhat confusing.
> Why is this flag tested in 2 places?
> 
> 

first test - open devices that need a pkey only from restart_qp flow
second - restart or not, at all.

these and rest of the comments are applied below.

ipoib: handle pkey change events

This issue was found during partitioning & SM fail over testing.

 * added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * fixed a bug in device extraction from the work struct
 * Restart child interfaces first before parent
 * Remove the pkey polling thread and pkey delayed initiallization
 * If an interface is brought up but pkey is not found, mark it with
   IPOIB_PKEY_NEEDED and when a pkey event arrives, try to restart it.

SM reconfiguration or failover possibly causes a shuffling of the values in the port
pkey table. The current implementation only queries for the index of the pkey once,
when it creates the device QP and after that moves it into working state, and hence
does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger
to reconfigure the device QP.


Signed-off-by: Moni Levy <monil at voltaire.com>
Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |   10 --
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |  140 +++++++++--------------------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |   11 --
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   21 ++--
 4 files changed, 66 insertions(+), 116 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-07 15:42:23.262692889 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-07 15:43:05.685154318 +0300
@@ -80,7 +80,7 @@ enum {
 	IPOIB_FLAG_INITIALIZED    = 1,
 	IPOIB_FLAG_ADMIN_UP 	  = 2,
 	IPOIB_PKEY_ASSIGNED 	  = 3,
-	IPOIB_PKEY_STOP 	  = 4,
+	IPOIB_PKEY_NEEDED		  = 4,
 	IPOIB_FLAG_SUBINTERFACE   = 5,
 	IPOIB_MCAST_RUN 	  = 6,
 	IPOIB_STOP_REAPER         = 7,
@@ -202,9 +202,9 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
+	struct work_struct pkey_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
 
@@ -333,12 +333,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
@@ -384,9 +385,6 @@ void ipoib_event(struct ib_event_handler
 int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey);
 int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
 
-void ipoib_pkey_poll(struct work_struct *work);
-int ipoib_pkey_dev_delay_open(struct net_device *dev);
-
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 
 #define IPOIB_FLAGS_RC          0x80
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-07 15:43:05.074262877 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-07 19:48:28.843156398 +0300
@@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -441,28 +441,10 @@ int ipoib_ib_dev_open(struct net_device 
 	return 0;
 }
 
-static void ipoib_pkey_dev_check_presence(struct net_device *dev)
-{
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	u16 pkey_index = 0;
-
-	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index))
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-	else
-		set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-}
-
 int ipoib_ib_dev_up(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	ipoib_pkey_dev_check_presence(dev);
-
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
-		ipoib_dbg(priv, "PKEY is not assigned.\n");
-		return 0;
-	}
-
 	set_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
 
 	return ipoib_mcast_start_thread(dev);
@@ -477,16 +459,6 @@ int ipoib_ib_dev_down(struct net_device 
 	clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
 	netif_carrier_off(dev);
 
-	/* Shutdown the P_Key thread if still active */
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
-		mutex_lock(&pkey_mutex);
-		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
-		mutex_unlock(&pkey_mutex);
-		if (flush)
-			flush_workqueue(ipoib_workqueue);
-	}
-
 	ipoib_mcast_stop_thread(dev, flush);
 	ipoib_mcast_dev_flush(dev);
 
@@ -508,7 +480,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +553,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,14 +595,30 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
-		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces */
+	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, restart_qp);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	/*
+	 * If the device is not initiallized since it needs a pkey -
+	 * try to reopen it
+	 */
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
+		if (restart_qp &&
+		    test_bit(IPOIB_PKEY_NEEDED, &priv->flags) &&
+		    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+			ipoib_open(priv->dev);
+		else
+			ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
 
@@ -642,6 +631,12 @@ void ipoib_ib_dev_flush(struct work_stru
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (restart_qp) {
+		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
+			ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +645,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	mutex_unlock(&priv->vlan_mutex);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_task);
+
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -672,54 +677,3 @@ void ipoib_ib_dev_cleanup(struct net_dev
 	ipoib_transport_dev_cleanup(dev);
 }
 
-/*
- * Delayed P_Key Assigment Interim Support
- *
- * The following is initial implementation of delayed P_Key assigment
- * mechanism. It is using the same approach implemented for the multicast
- * group join. The single goal of this implementation is to quickly address
- * Bug #2507. This implementation will probably be removed when the P_Key
- * change async notification is available.
- */
-
-void ipoib_pkey_poll(struct work_struct *work)
-{
-	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
-	struct net_device *dev = priv->dev;
-
-	ipoib_pkey_dev_check_presence(dev);
-
-	if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
-		ipoib_open(dev);
-	else {
-		mutex_lock(&pkey_mutex);
-		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
-					   HZ);
-		mutex_unlock(&pkey_mutex);
-	}
-}
-
-int ipoib_pkey_dev_delay_open(struct net_device *dev)
-{
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-
-	/* Look for the interface pkey value in the IB Port P_Key table and */
-	/* set the interface pkey assigment flag                            */
-	ipoib_pkey_dev_check_presence(dev);
-
-	/* P_Key value not assigned yet - start polling */
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
-		mutex_lock(&pkey_mutex);
-		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
-		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
-				   HZ);
-		mutex_unlock(&pkey_mutex);
-		return 1;
-	}
-
-	return 0;
-}
Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-07 15:42:23.101721494 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-07 15:43:05.687153963 +0300
@@ -100,14 +100,11 @@ int ipoib_open(struct net_device *dev)
 
 	set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
 
-	if (ipoib_pkey_dev_delay_open(dev))
-		return 0;
-
 	if (ipoib_ib_dev_open(dev))
-		return -EINVAL;
+		return test_bit(IPOIB_PKEY_NEEDED, &priv->flags) ? 0 : -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +149,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +987,7 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-07 15:42:23.387670681 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-07 15:43:05.688153785 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,12 +47,12 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+		clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
 	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	set_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 
 	/* set correct QKey for QP */
 	qp_attr->qkey = priv->qkey;
@@ -103,12 +101,12 @@ int ipoib_init_qp(struct net_device *dev
 	 * The port has to be assigned to the respective IB partition in
 	 * advance.
 	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
 	if (ret) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		set_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 		return ret;
 	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
@@ -238,7 +236,7 @@ void ipoib_transport_dev_cleanup(struct 
 			ipoib_warn(priv, "ib_qp_destroy failed\n");
 
 		priv->qp = NULL;
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		clear_bit(IPOIB_PKEY_NEEDED, &priv->flags);
 	}
 
 	if (ib_destroy_cq(priv->cq))
@@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_task);
 	}
 }


From mhagen at iol.unh.edu  Mon May  7 09:58:26 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Mon, 7 May 2007 12:58:26 -0400 (EDT)
Subject: [ofa-general] Re: [PATCH] infiniband: add userspace support for
	invalidate stag
In-Reply-To: <53312.132.177.125.178.1178307563.squirrel@postal.iol.unh.edu>
References: <53312.132.177.125.178.1178307563.squirrel@postal.iol.unh.edu>
Message-ID: <60316.132.177.125.178.1178557106.squirrel@postal.iol.unh.edu>

Add userspace support for iWARP verbs Send w/ INV and Send w/ SE and INV.

Signed-off-by: Mikkel Hagen <mhagen at iol.unh.edu>

--- linux-2.6.21.1/include/rdma/ib_user_verbs.h        2007-05-02
15:35:13.000000000 -0400
+++ linux-2.6.21.1/include/rdma/ib_user_verbs.h        2007-05-02
15:53:40.000000000 -0400
@@ -553,6 +553,10 @@ struct ib_uverbs_send_wr {
                         __u32 remote_qkey;
                         __u32 reserved;
                 } ud;
+                struct {
+                        __u32 rkey;
+                        __u32 reserved;
+                } invalidate;
         } wr;
 };

-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From mhagen at iol.unh.edu  Mon May  7 09:59:59 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Mon, 7 May 2007 12:59:59 -0400 (EDT)
Subject: [ofa-general] Re: [PATCH] infiniband: add userspace support for
	invalidate stag
In-Reply-To: <53313.132.177.125.178.1178307596.squirrel@postal.iol.unh.edu>
References: <53313.132.177.125.178.1178307596.squirrel@postal.iol.unh.edu>
Message-ID: <47431.132.177.125.178.1178557199.squirrel@postal.iol.unh.edu>

Add userspace support for iWARP verbs Send w/ INV and Send w/ SE and INV.

Signed-off-by: Mikkel Hagen <mhagen at iol.unh.edu>


--- linux-2.6.21.1/drivers/infiniband/core/uverbs_cmd.c        2007-05-04
14:25:50.000000000 -0400
+++ linux-2.6.21.1/drivers/infiniband/core/uverbs_cmd.c        2007-05-04
14:47:42.000000000 -0400
@@ -1507,6 +1507,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv
                                 next->wr.atomic.swap =
user_wr->wr.atomic.swap;
                                 next->wr.atomic.rkey =
user_wr->wr.atomic.rkey;
                                 break;
+                        case IB_WR_SEND:
+                                if(next->send_flags & IB_SEND_INVALIDATE) {
+                                        next->wr.invalidate.rkey =
+                                                user_wr->wr.invalidate.rkey;
+                                }
+                                break;
                         default:
                                 break;
                         }

-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From todd.rimmer at qlogic.com  Mon May  7 10:00:23 2007
From: todd.rimmer at qlogic.com (Todd Rimmer)
Date: Mon, 7 May 2007 12:00:23 -0500
Subject: [ofa-general] DDR and SDR
In-Reply-To: <1178556562.8727.3.camel@KOEN>
Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE06119251D38@EPEXCH2.qlogic.org>


> From: Koen Segers
> Sent: Monday, May 07, 2007 12:49 PM
> To: openib-general at openib.org
> Subject: [ofa-general] DDR and SDR
> 
> A simple question:
> 
> Is it possible to connect a SDR HCA to a DDR switch?

Yes, at the time of Link Layer training, the link speed and width are
negotiated down to the highest common speed/width.  Hence when an SDR
HCA is connected to a DDR switch, the HCA's link and the corresponding
switch port will run at SDR speeds.

> If so, what happens with the data that is send from a DDR HCA to the
SDR
> HCA?
In IB every Path Record, Multicast group and Address Vector has a
"static rate".  This represents the speed of the path between 2 nodes.
When a DDR HCA sends to an SDR HCA, it should have obtained a static
rate from the SA showing a 10Gb/s rate (for a 4x SDR path).  In which
case, the DDR HCA will pace its sending to not exceed SDR speeds.

Todd Rimmer
Chief Architect 
QLogic System Interconnect Group
Voice: 610-233-4852     Fax: 610-233-4777
Todd.Rimmer at QLogic.com  www.QLogic.com


From mshefty at ichips.intel.com  Mon May  7 10:03:33 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 07 May 2007 10:03:33 -0700
Subject: [ofa-general] RE: man pages for the rdma-cm
In-Reply-To: <1178556576.30571.79.camel@stevo-desktop>
References: <000001c7905e$9a562190$95fd070a@amr.corp.intel.com>
	<1178556576.30571.79.camel@stevo-desktop>
Message-ID: <463F5BE5.8030806@ichips.intel.com>

> Here are a few comments.  Consider them for inclusion, but what you've
> done so far is a great start.

Thanks for the feedback.  I'll try to update this before RC3 freezes.

> - rdma_disconnect - for iWARP connections, this initiates a RDMAC Verbs
> "normal close".  If the connection was properly quiesced by the
> application, then the QP will end up back in IDLE, but if the connection
> was not quiesced, then the connection will be terminated and the QP will
> end up in ERROR.  Dunno if we want to describe this in detail?

Are all work requests flushed in both cases?  I don't know if we need to go into 
details about which state the QP ends up in, unless the behavior differences are 
visible to the user.

> - Also, it might be nice to have some sort of overview man page that
> maps the exected event flows for connection setup and teardown.  Maybe
> 'man rdmacm' gets you some overview?

I agree that this would be nice.  Is there a standard way of doing this?

- Sean


From mhagen at iol.unh.edu  Mon May  7 10:04:13 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Mon, 7 May 2007 13:04:13 -0400 (EDT)
Subject: [ofa-general] [PATCH] infiniband: add userspace support for 
	invalidate stag
In-Reply-To: <1178380859.8125.2.camel@stevo-desktop>
References: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu>
	<adalkg49s37.fsf@cisco.com> <1178325178.3011.4.camel@stevo-laptop>
	<1178380859.8125.2.camel@stevo-desktop>
Message-ID: <47434.132.177.125.178.1178557453.squirrel@postal.iol.unh.edu>

Well, I resubmitted the kernel level changes with the comment and
signed-off-by fields.  I will wait on resubmitting the userspace changes.

The only real contribution to the discussion on what we should do, would
be to suggest that maybe we just keep them in a user-patches dir for a
while (until a couple of kernel revs with invalidate supported) then move
them into the main code base.


> On Fri, 2007-05-04 at 19:32 -0500, Steve Wise wrote:
>> On Fri, 2007-05-04 at 14:50 -0700, Roland Dreier wrote:
>> > A few general things:
>> >  - please always submit patches with a changelog entry and
>> >    Signed-off-by: line
>> >  - please send patches in logical chunks.  Usually I'm complaining
>> >    about people combining unrelated things into one patch, but in this
>> >    case I think you divided the patch up too much -- rather than 5
>> >    patches, this should probably be one kernel patch and one userspace
>> >    patch.
>> >  - please make libibverbs patches apply to the libibverbs git tree
>> >    with -p1.  You seem to have generated patches against an OFED
>> package.
>> >
>> > OK, with that out of the way, I think there are still some issues to
>> > sort out with how to handle send with invalidate from userspace.
>> > These patches don't address the case of new userspace with
>> > send-with-invalidate support talking to an unpatched kernel -- it
>> > seems that send-with-invalidate would be silently turned into a plain
>> > send request, which is not a very good failure mode.
>> >
>> > I don't know what the right solution is yet -- a kernel ABI bump for
>> > this one case (send with invalidate support for userspace drivers that
>> > don't do kernel bypass == amso1100) is ugly.  Maybe we also need a
>> > device capabilities bit that says whether send-with-invalidate is
>> > supported?
>> >
>>
>> There already exists a SEND-INV capabilities flag.
>>
>> <snip>
>>         IB_DEVICE_SEND_W_INV            = (1<<16),
>>
>> I think with the capabilities flag, we shouldn't worry about changing
>> the ABI.  But the drivers will need to set this flag.  Amso currently
>> does...
>
> Actually, since Amso has set this flag since day one, it doesn't really
> solve the ABI issue Roland describes.
>
>
> Steve.
>
>


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From swise at opengridcomputing.com  Mon May  7 10:27:03 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 07 May 2007 12:27:03 -0500
Subject: [ofa-general] RE: man pages for the rdma-cm
In-Reply-To: <463F5BE5.8030806@ichips.intel.com>
References: <000001c7905e$9a562190$95fd070a@amr.corp.intel.com>
	<1178556576.30571.79.camel@stevo-desktop>
	<463F5BE5.8030806@ichips.intel.com>
Message-ID: <1178558823.30571.97.camel@stevo-desktop>

On Mon, 2007-05-07 at 10:03 -0700, Sean Hefty wrote:
> > Here are a few comments.  Consider them for inclusion, but what you've
> > done so far is a great start.
> 
> Thanks for the feedback.  I'll try to update this before RC3 freezes.
> 
> > - rdma_disconnect - for iWARP connections, this initiates a RDMAC Verbs
> > "normal close".  If the connection was properly quiesced by the
> > application, then the QP will end up back in IDLE, but if the connection
> > was not quiesced, then the connection will be terminated and the QP will
> > end up in ERROR.  Dunno if we want to describe this in detail?
> 
> Are all work requests flushed in both cases?  I don't know if we need to go into 
> details about which state the QP ends up in, unless the behavior differences are 
> visible to the user.
> 

In the "normal close" case, the user is responsible to quiesce the SQ.
In both cases the RQ entries are flused. 

We can omit this for now if you want.

> > - Also, it might be nice to have some sort of overview man page that
> > maps the exected event flows for connection setup and teardown.  Maybe
> > 'man rdmacm' gets you some overview?
> 
> I agree that this would be nice.  Is there a standard way of doing this?
> 

There's a 'tcp' man page to describe tcp.  So I think its ok to have a
'rdmacm' or 'rdmacma' man page. 

Steve.


From pradeeps at linux.vnet.ibm.com  Mon May  7 10:32:09 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Mon, 07 May 2007 10:32:09 -0700
Subject: [ofa-general] Question about git tree
Message-ID: <463F6299.8050106@linux.vnet.ibm.com>

Roland,

Last night you submitted the NAPI work for 2.6.22. When I checked a few 
minutes ago I saw that the NAPI work has been merged into the for-linus 
tree and not the for-2.6.22 tree.

I want to merge and test my patch against the latest tree -which git 
tree should I use? Can you please provide insight into how this 
procedure works, or if it is documented please provide a pointer.

Pradeep


From sean.hefty at intel.com  Mon May  7 11:39:46 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 7 May 2007 11:39:46 -0700
Subject: [ofa-general] [PATCH 0/3] rdma/cm: cleanup device removal
	synchronization
In-Reply-To: <000401c787ca$f37d7ee0$2ad8180a@amr.corp.intel.com>
Message-ID: <000101c790d7$1642e680$8698070a@amr.corp.intel.com>

Here's a couple of patches that make the device removal synchronization
in the rdma_cm a little more explicit, along with one fix to add in
missing synchronization.

With these patches, it's now possible to call rdma_disconnect() after
receiving a device removal event.

I plan on pushing these changes to my git tree and request that they
be pulled into 2.6.22 within the next couple of days if there are no
issues.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


From sean.hefty at intel.com  Mon May  7 11:42:16 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 7 May 2007 11:42:16 -0700
Subject: [ofa-general] [PATCH 1/3] rdma/cm: simplify device removal handling
	code
In-Reply-To: <000101c790d7$1642e680$8698070a@amr.corp.intel.com>
Message-ID: <000201c790d7$6fac26f0$8698070a@amr.corp.intel.com>

Add a new routine and rename another to encapsulate common code for
synchronizing with device removal. 

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/cma.c |   89 ++++++++++++++++++++++-------------------
 1 files changed, 48 insertions(+), 41 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index fde92ce..d026764 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -346,7 +346,23 @@ static void cma_deref_id(struct rdma_id_private *id_priv)
 		complete(&id_priv->comp);
 }
 
-static void cma_release_remove(struct rdma_id_private *id_priv)
+static int cma_disable_remove(struct rdma_id_private *id_priv,
+			      enum cma_state state)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&id_priv->lock, flags);
+	if (id_priv->state == state) {
+		atomic_inc(&id_priv->dev_remove);
+		ret = 0;
+	} else
+		ret = -EINVAL;
+	spin_unlock_irqrestore(&id_priv->lock, flags);
+	return ret;
+}
+
+static void cma_enable_remove(struct rdma_id_private *id_priv)
 {
 	if (atomic_dec_and_test(&id_priv->dev_remove))
 		wake_up(&id_priv->wait_remove);
@@ -884,9 +900,8 @@ static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event)
 	struct rdma_cm_event event;
 	int ret = 0;
 
-	atomic_inc(&id_priv->dev_remove);
-	if (!cma_comp(id_priv, CMA_CONNECT))
-		goto out;
+	if (cma_disable_remove(id_priv, CMA_CONNECT))
+		return 0;
 
 	memset(&event, 0, sizeof event);
 	switch (ib_event->event) {
@@ -942,12 +957,12 @@ static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event
*ib_event)
 		/* Destroy the CM ID by returning a non-zero value. */
 		id_priv->cm_id.ib = NULL;
 		cma_exch(id_priv, CMA_DESTROYING);
-		cma_release_remove(id_priv);
+		cma_enable_remove(id_priv);
 		rdma_destroy_id(&id_priv->id);
 		return ret;
 	}
 out:
-	cma_release_remove(id_priv);
+	cma_enable_remove(id_priv);
 	return ret;
 }
 
@@ -1057,11 +1072,8 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event
*ib_event)
 	int offset, ret;
 
 	listen_id = cm_id->context;
-	atomic_inc(&listen_id->dev_remove);
-	if (!cma_comp(listen_id, CMA_LISTEN)) {
-		ret = -ECONNABORTED;
-		goto out;
-	}
+	if (cma_disable_remove(listen_id, CMA_LISTEN))
+		return -ECONNABORTED;
 
 	memset(&event, 0, sizeof event);
 	offset = cma_user_data_offset(listen_id->id.ps);
@@ -1101,11 +1113,11 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event
*ib_event)
 
 release_conn_id:
 	cma_exch(conn_id, CMA_DESTROYING);
-	cma_release_remove(conn_id);
+	cma_enable_remove(conn_id);
 	rdma_destroy_id(&conn_id->id);
 
 out:
-	cma_release_remove(listen_id);
+	cma_enable_remove(listen_id);
 	return ret;
 }
 
@@ -1214,12 +1226,12 @@ static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event
*iw_event)
 		/* Destroy the CM ID by returning a non-zero value. */
 		id_priv->cm_id.iw = NULL;
 		cma_exch(id_priv, CMA_DESTROYING);
-		cma_release_remove(id_priv);
+		cma_enable_remove(id_priv);
 		rdma_destroy_id(&id_priv->id);
 		return ret;
 	}
 
-	cma_release_remove(id_priv);
+	cma_enable_remove(id_priv);
 	return ret;
 }
 
@@ -1234,11 +1246,8 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id,
 	int ret;
 
 	listen_id = cm_id->context;
-	atomic_inc(&listen_id->dev_remove);
-	if (!cma_comp(listen_id, CMA_LISTEN)) {
-		ret = -ECONNABORTED;
-		goto out;
-	}
+	if (cma_disable_remove(listen_id, CMA_LISTEN))
+		return -ECONNABORTED;
 
 	/* Create a new RDMA id for the new IW CM ID */
 	new_cm_id = rdma_create_id(listen_id->id.event_handler,
@@ -1255,13 +1264,13 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id,
 	dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr);
 	if (!dev) {
 		ret = -EADDRNOTAVAIL;
-		cma_release_remove(conn_id);
+		cma_enable_remove(conn_id);
 		rdma_destroy_id(new_cm_id);
 		goto out;
 	}
 	ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL);
 	if (ret) {
-		cma_release_remove(conn_id);
+		cma_enable_remove(conn_id);
 		rdma_destroy_id(new_cm_id);
 		goto out;
 	}
@@ -1270,7 +1279,7 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id,
 	ret = cma_acquire_dev(conn_id);
 	mutex_unlock(&lock);
 	if (ret) {
-		cma_release_remove(conn_id);
+		cma_enable_remove(conn_id);
 		rdma_destroy_id(new_cm_id);
 		goto out;
 	}
@@ -1293,14 +1302,14 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id,
 		/* User wants to destroy the CM ID */
 		conn_id->cm_id.iw = NULL;
 		cma_exch(conn_id, CMA_DESTROYING);
-		cma_release_remove(conn_id);
+		cma_enable_remove(conn_id);
 		rdma_destroy_id(&conn_id->id);
 	}
 
 out:
 	if (dev)
 		dev_put(dev);
-	cma_release_remove(listen_id);
+	cma_enable_remove(listen_id);
 	return ret;
 }
 
@@ -1519,7 +1528,7 @@ static void cma_work_handler(struct work_struct *_work)
 		destroy = 1;
 	}
 out:
-	cma_release_remove(id_priv);
+	cma_enable_remove(id_priv);
 	cma_deref_id(id_priv);
 	if (destroy)
 		rdma_destroy_id(&id_priv->id);
@@ -1711,13 +1720,13 @@ static void addr_handler(int status, struct sockaddr *src_addr,
 
 	if (id_priv->id.event_handler(&id_priv->id, &event)) {
 		cma_exch(id_priv, CMA_DESTROYING);
-		cma_release_remove(id_priv);
+		cma_enable_remove(id_priv);
 		cma_deref_id(id_priv);
 		rdma_destroy_id(&id_priv->id);
 		return;
 	}
 out:
-	cma_release_remove(id_priv);
+	cma_enable_remove(id_priv);
 	cma_deref_id(id_priv);
 }
 
@@ -2042,11 +2051,10 @@ static int cma_sidr_rep_handler(struct ib_cm_id *cm_id,
 	struct ib_cm_sidr_rep_event_param *rep = &ib_event->param.sidr_rep_rcvd;
 	int ret = 0;
 
-	memset(&event, 0, sizeof event);
-	atomic_inc(&id_priv->dev_remove);
-	if (!cma_comp(id_priv, CMA_CONNECT))
-		goto out;
+	if (cma_disable_remove(id_priv, CMA_CONNECT))
+		return 0;
 
+	memset(&event, 0, sizeof event);
 	switch (ib_event->event) {
 	case IB_CM_SIDR_REQ_ERROR:
 		event.event = RDMA_CM_EVENT_UNREACHABLE;
@@ -2084,12 +2092,12 @@ static int cma_sidr_rep_handler(struct ib_cm_id *cm_id,
 		/* Destroy the CM ID by returning a non-zero value. */
 		id_priv->cm_id.ib = NULL;
 		cma_exch(id_priv, CMA_DESTROYING);
-		cma_release_remove(id_priv);
+		cma_enable_remove(id_priv);
 		rdma_destroy_id(&id_priv->id);
 		return ret;
 	}
 out:
-	cma_release_remove(id_priv);
+	cma_enable_remove(id_priv);
 	return ret;
 }
 
@@ -2499,10 +2507,9 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast)
 	int ret;
 
 	id_priv = mc->id_priv;
-	atomic_inc(&id_priv->dev_remove);
-	if (!cma_comp(id_priv, CMA_ADDR_BOUND) &&
-	    !cma_comp(id_priv, CMA_ADDR_RESOLVED))
-		goto out;
+	if (cma_disable_remove(id_priv, CMA_ADDR_BOUND) &&
+	    cma_disable_remove(id_priv, CMA_ADDR_RESOLVED))
+		return 0;
 
 	if (!status && id_priv->id.qp)
 		status = ib_attach_mcast(id_priv->id.qp, &multicast->rec.mgid,
@@ -2524,12 +2531,12 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast)
 	ret = id_priv->id.event_handler(&id_priv->id, &event);
 	if (ret) {
 		cma_exch(id_priv, CMA_DESTROYING);
-		cma_release_remove(id_priv);
+		cma_enable_remove(id_priv);
 		rdma_destroy_id(&id_priv->id);
 		return 0;
 	}
-out:
-	cma_release_remove(id_priv);
+
+	cma_enable_remove(id_priv);
 	return 0;
 }
 

From sean.hefty at intel.com  Mon May  7 11:43:38 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 7 May 2007 11:43:38 -0700
Subject: [ofa-general] [PATCH 2/3] rdma/cm: Fix synchronization with device
	removal in cma_iw_handler
In-Reply-To: <000201c790d7$6fac26f0$8698070a@amr.corp.intel.com>
Message-ID: <000301c790d7$a089c3e0$8698070a@amr.corp.intel.com>

The cma_iw_handler needs to validate the state of the rdma_cm_id before
processing a new connection request to ensure that a device removal is
not already being processed for the same rdma_cm_id.  Without the state
check, the user can receive simultaneous callbacks for the same cm_id, or
a callback after they've destroyed the cm_id.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/cma.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index d026764..cfd57b4 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1183,9 +1183,10 @@ static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event
*iw_event)
 	struct sockaddr_in *sin;
 	int ret = 0;
 
-	memset(&event, 0, sizeof event);
-	atomic_inc(&id_priv->dev_remove);
+	if (cma_disable_remove(id_priv, CMA_CONNECT))
+		return 0;
 
+	memset(&event, 0, sizeof event);
 	switch (iw_event->event) {
 	case IW_CM_EVENT_CLOSE:
 		event.event = RDMA_CM_EVENT_DISCONNECTED;


From sean.hefty at intel.com  Mon May  7 11:45:23 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 7 May 2007 11:45:23 -0700
Subject: [ofa-general] [PATCH 3/3] rdma/cm: Add check to validate that cm_id
	is bound to a device
In-Reply-To: <000101c790d7$1642e680$8698070a@amr.corp.intel.com>
Message-ID: <000401c790d7$df1b38a0$8698070a@amr.corp.intel.com>

Several checks in the rdma_cm check against the state of the
cm_id, but only to validate that the cm_id is bound to an underlying
transport specific CM and an RDMA device.  Make the check explicit
in what we're trying to check for, since we're not synchronizing
against the cm_id state.

This will allow a user to disconnect a cm_id or reject a connection
after receiving a device removal event.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/cma.c |   12 ++++++++----
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index cfd57b4..2eb52b7 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -368,6 +368,11 @@ static void cma_enable_remove(struct rdma_id_private *id_priv)
 		wake_up(&id_priv->wait_remove);
 }
 
+static int cma_has_cm_dev(struct rdma_id_private *id_priv)
+{
+	return (id_priv->id.device && id_priv->cm_id.ib);
+}
+
 struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler,
 				  void *context, enum rdma_port_space ps)
 {
@@ -2422,7 +2427,7 @@ int rdma_notify(struct rdma_cm_id *id, enum ib_event_type event)
 	int ret;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
-	if (!cma_comp(id_priv, CMA_CONNECT))
+	if (!cma_has_cm_dev(id_priv))
 		return -EINVAL;
 
 	switch (id->device->node_type) {
@@ -2444,7 +2449,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
 	int ret;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
-	if (!cma_comp(id_priv, CMA_CONNECT))
+	if (!cma_has_cm_dev(id_priv))
 		return -EINVAL;
 
 	switch (rdma_node_get_transport(id->device->node_type)) {
@@ -2475,8 +2480,7 @@ int rdma_disconnect(struct rdma_cm_id *id)
 	int ret;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
-	if (!cma_comp(id_priv, CMA_CONNECT) &&
-	    !cma_comp(id_priv, CMA_DISCONNECT))
+	if (!cma_has_cm_dev(id_priv))
 		return -EINVAL;
 
 	switch (rdma_node_get_transport(id->device->node_type)) {


From mst at dev.mellanox.co.il  Mon May  7 13:03:15 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 7 May 2007 23:03:15 +0300
Subject: [ofa-general] [PATCH] ipoib/cm: make stale task actually run once in
	a while
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3037668FC@xmb-sjc-216.amer.cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C90BFCB3@mtlexch01.mtl.com>
	<6C2C79E72C305246B504CBA17B5500C9076E27@mtlexch01.mtl.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3037668FC@xmb-sjc-216.amer.cisco.com>
Message-ID: <20070507200315.GD22341@mellanox.co.il>

In the presence of some active passive connections, stale task would never run,
since each 4 RX CQEs we repeat queue_delayed_work calls which delays it for some
10 minutes.  As a result, on a noisy system with failing ports, we slowly run
out of resources - slowing connection setup down and eventually failing.

What we actually want to do is - start stale task when a first
passive connection is added, rerun it every 10 min as long
as there are outstanding passive connections.

As a happy side effect, this removes some code from RX data path.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Scott, I think this might address bugs 541 and 465: slow IPoIB CM HA failover
and eventual failing IPoIB HA. Could you test this please?

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 2b242a4..b77e8d7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -258,10 +258,11 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	cm_id->context = p;
 	p->jiffies = jiffies;
 	spin_lock_irqsave(&priv->lock, flags);
+	if (list_empty(&priv->cm.passive_ids))
+		queue_delayed_work(ipoib_workqueue,
+				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	list_add(&p->list, &priv->cm.passive_ids);
 	spin_unlock_irqrestore(&priv->lock, flags);
-	queue_delayed_work(ipoib_workqueue,
-			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	return 0;
 
 err_rep:
@@ -380,8 +381,6 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 			if (!list_empty(&p->list))
 				list_move(&p->list, &priv->cm.passive_ids);
 			spin_unlock_irqrestore(&priv->lock, flags);
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 		}
 	}
 
@@ -1104,6 +1103,10 @@ static void ipoib_cm_stale_task(struct work_struct *work)
 		kfree(p);
 		spin_lock_irqsave(&priv->lock, flags);
 	}
+
+	if (!list_empty(&priv->cm.passive_ids))
+		queue_delayed_work(ipoib_workqueue,
+				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	spin_unlock_irqrestore(&priv->lock, flags);
 }
 
-- 
MST


From rdreier at cisco.com  Mon May  7 13:25:19 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 07 May 2007 13:25:19 -0700
Subject: [ofa-general] Re: Question about git tree
In-Reply-To: <463F6299.8050106@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Mon, 07 May 2007 10:32:09 -0700")
References: <463F6299.8050106@linux.vnet.ibm.com>
Message-ID: <adamz0g5qkw.fsf@cisco.com>

 > Last night you submitted the NAPI work for 2.6.22. When I checked a
 > few minutes ago I saw that the NAPI work has been merged into the
 > for-linus tree and not the for-2.6.22 tree.

Yes, that was a temporary situation until Linus pulled everything into
his tree (which he now has done).

 > I want to merge and test my patch against the latest tree -which git
 > tree should I use? Can you please provide insight into how this
 > procedure works, or if it is documented please provide a pointer.

Your question is actually a fairly deep one.  In fact in the git world
the concept of "latest tree" is not defined.  A situation such as for
example some fixes queued in for-2.6.22 and some new features queued
in for-2.6.23 is quite common.  And for-mm in general is something
like the union of everything that has a chance at being merged within
the next couple of kernel releases.

So I guess you just have to use some judgement and look at which tree
has things that are likely to impact what you're working on.

 - R.


From swise at opengridcomputing.com  Mon May  7 15:09:21 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 07 May 2007 17:09:21 -0500
Subject: [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
Message-ID: <1178575761.30571.175.camel@stevo-desktop>

On Sat, 2007-04-28 at 16:20 -0400, Jeff Squyres wrote:
> You'd probably be better asking this question on the Open MPI mailing  
> lists, not here.  :-)
> 
> FWIW, yes, adding RDMA CM support has actually been on my to-do list  
> for a while, but it keeps getting bumped by higher priority items.   
> It would be *much* better if some iWARP companies got involved in  
> Open MPI...
> 

Hey Jeff, 

Chelsio's gonna pony up the resources to get this work done asap.  Do
you have any thoughts on how we can collaborate on this project?  I'm
familiar with mvapich, not ompi, so I need to go do some homework.  But
any pointers on the connection setup design for ompi would be great.

I'm CCing devel at openmpi.org in case anyone else is interested in
helping.  Chelsio can provide rnic HW...


Thanks,

Steve.


> 
> 
> On Apr 28, 2007, at 4:16 PM, Steve Wise wrote:
> 
> > Is anyone working on adding RDMA-CM support to OpenMPI?
> >
> > Thanks,
> >
> > Steve.
> >
> >
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> > openib-general
> 
> 


From swise at opengridcomputing.com  Mon May  7 15:52:26 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 07 May 2007 17:52:26 -0500
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <1178575761.30571.175.camel@stevo-desktop>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
Message-ID: <1178578346.30571.183.camel@stevo-desktop>

Also, there appears to be a DAPL BTL in OMPI.  Is this BTL complete and
enabled for the ofed-1.2 udapl library? 


Steve.


On Mon, 2007-05-07 at 17:09 -0500, Steve Wise wrote:
> On Sat, 2007-04-28 at 16:20 -0400, Jeff Squyres wrote:
> > You'd probably be better asking this question on the Open MPI mailing  
> > lists, not here.  :-)
> > 
> > FWIW, yes, adding RDMA CM support has actually been on my to-do list  
> > for a while, but it keeps getting bumped by higher priority items.   
> > It would be *much* better if some iWARP companies got involved in  
> > Open MPI...
> > 
> 
> Hey Jeff, 
> 
> Chelsio's gonna pony up the resources to get this work done asap.  Do
> you have any thoughts on how we can collaborate on this project?  I'm
> familiar with mvapich, not ompi, so I need to go do some homework.  But
> any pointers on the connection setup design for ompi would be great.
> 
> I'm CCing devel at openmpi.org in case anyone else is interested in
> helping.  Chelsio can provide rnic HW...
> 
> 
> Thanks,
> 
> Steve.
> 
> 
> 
> > 
> > 
> > On Apr 28, 2007, at 4:16 PM, Steve Wise wrote:
> > 
> > > Is anyone working on adding RDMA-CM support to OpenMPI?
> > >
> > > Thanks,
> > >
> > > Steve.
> > >
> > >
> > >
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >
> > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> > > openib-general
> > 
> > 
> 
> _______________________________________________
> devel mailing list
> devel at open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


From jsquyres at cisco.com  Mon May  7 17:37:17 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 7 May 2007 20:37:17 -0400
Subject: [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <1178575761.30571.175.camel@stevo-desktop>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
Message-ID: <95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com>

On May 7, 2007, at 6:09 PM, Steve Wise wrote:

>> You'd probably be better asking this question on the Open MPI mailing
>> lists, not here.  :-)
>>
>> FWIW, yes, adding RDMA CM support has actually been on my to-do list
>> for a while, but it keeps getting bumped by higher priority items.
>> It would be *much* better if some iWARP companies got involved in
>> Open MPI...
>
> Chelsio's gonna pony up the resources to get this work done asap.  Do
> you have any thoughts on how we can collaborate on this project?  I'm
> familiar with mvapich, not ompi, so I need to go do some homework.   
> But
> any pointers on the connection setup design for ompi would be great.

Excellent!  Let's chat on the phone tomorrow -- this would probably  
be the best way to start.

We will need a signed Open MPI 3rd party contribution agreement from  
either you and/or Chelsio (whoever owns the intellectual property  
that will be contributed).  See http://www.open-mpi.org/community/ 
contribute/.

> I'm CCing devel at openmpi.org in case anyone else is interested in
> helping.  Chelsio can provide rnic HW...

Anyone else here interested?  Free hardware!  :-)

-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Mon May  7 17:39:58 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 7 May 2007 20:39:58 -0400
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <1178578346.30571.183.camel@stevo-desktop>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<1178578346.30571.183.camel@stevo-desktop>
Message-ID: <BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>

On May 7, 2007, at 6:52 PM, Steve Wise wrote:

> Also, there appears to be a DAPL BTL in OMPI.  Is this BTL complete  
> and
> enabled for the ofed-1.2 udapl library?

Yes, it is complete and is well-tested in Solaris.

It is not well tested in Linux/OFED (we've been concentrating on the  
verbs interface on the OFED side of things -- the "openib" BTL [we  
never renamed it when OpenIB changed names to OpenFabrics]).  In  
fact, we've had scattered reports of it not working properly in Linux/ 
OFED, but those could well have been pilot error (i.e., me not trying  
to run properly -- I know just about zilch about udapl).

-- 
Jeff Squyres
Cisco Systems


From afriedle at indiana.edu  Mon May  7 17:54:26 2007
From: afriedle at indiana.edu (Andrew Friedley)
Date: Mon, 07 May 2007 20:54:26 -0400
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com>
References: <1177791386.4615.8.camel@stevo-laptop>	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>	<1178575761.30571.175.camel@stevo-desktop>
	<95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com>
Message-ID: <463FCA42.3000104@indiana.edu>

Jeff Squyres wrote:
> On May 7, 2007, at 6:09 PM, Steve Wise wrote:
> 
>>> You'd probably be better asking this question on the Open MPI mailing
>>> lists, not here.  :-)
>>>
>>> FWIW, yes, adding RDMA CM support has actually been on my to-do list
>>> for a while, but it keeps getting bumped by higher priority items.
>>> It would be *much* better if some iWARP companies got involved in
>>> Open MPI...
>> Chelsio's gonna pony up the resources to get this work done asap.  Do
>> you have any thoughts on how we can collaborate on this project?  I'm
>> familiar with mvapich, not ompi, so I need to go do some homework.   
>> But
>> any pointers on the connection setup design for ompi would be great.
> 
> Excellent!  Let's chat on the phone tomorrow -- this would probably  
> be the best way to start.
> 
> We will need a signed Open MPI 3rd party contribution agreement from  
> either you and/or Chelsio (whoever owns the intellectual property  
> that will be contributed).  See http://www.open-mpi.org/community/ 
> contribute/.
> 
>> I'm CCing devel at openmpi.org in case anyone else is interested in
>> helping.  Chelsio can provide rnic HW...
> 
> Anyone else here interested?  Free hardware!  :-)

Hmm I'm interested.  I've already done some work switching over to RDMA 
CM for some research stuff I've been doing; it's not publicly accessible 
w/o the 3rd party agreement.  I can help answer questions on what 
exactly needs to change, and do some testing.

Andrew


From info123456789 at cox.net  Mon May  7 18:07:06 2007
From: info123456789 at cox.net (info123456789 at cox.net)
Date: Mon, 7 May 2007 18:07:06 -0700
Subject: [ofa-general] 53Q8/02.
Message-ID: <31308378.1178586426768.JavaMail.root@eastrmwml08.mgt.cox.net>

Congratulations!

You won 470,274.11 pounds and it is equivalent to $921,201 dollars from the NET ON-LINE LOTTERY CORPORATION IN UNITED KINGDOM this year bonanza.

Contact Claims Department quoting winning draw number: 53Q8/02.

CONTACT PERSON: Mr. Michael Watson
EMAIL: net_onlinelottery at yahoo.co.uk

Congratulations,

Ms.Trace C. Cusac.


From rdreier at cisco.com  Mon May  7 19:40:29 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 07 May 2007 19:40:29 -0700
Subject: [ofa-general] [last RFC] mlx4 (Mellanox ConnectX adapter) InfiniBand
	drivers
Message-ID: <adaejls597m.fsf@cisco.com>

I've added my InfiniBand drivers for the new Mellanox ConnectX adapter
to what's queued up for 2.6.22 in:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-2.6.22

This is still a new driver, with some things missing and undoubtedly
some bugs and opportunities for cleanup, but I trust myself to keep
improving the driver even after it's upstream.  Unless I hear a good
reason why I shouldn't, I'll ask Linus to pull this tomorrow.

I received no responses to my earlier posts, so I'm not going to spam
everyone with a big patch series again.  But here's the diffstat at
least -- if you want to see details, just pull the git URL above.

commit 0c2f16963d60c30920ee4fb3c900ae29d6ed0f74
Author: Roland Dreier <rolandd at cisco.com>
Date:   Mon May 7 15:48:06 2007 -0700

    IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters
    
    Add an InfiniBand driver for Mellanox ConnectX adapters.  Because
    these adapters can also be used as ethernet NICs and Fibre Channel
    HBAs, the driver is split into two modules:
    
      mlx4_core: Handles low-level things like device initialization and
        processing firmware commands.  Also controls resource allocation
        so that the InfiniBand, ethernet and FC functions can share a
        device without stepping on each other.
    
      mlx4_ib: Handles InfiniBand-specific things; plugs into the
        InfiniBand midlayer.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

 drivers/infiniband/Kconfig            |    2 +
 drivers/infiniband/Makefile           |    1 +
 drivers/infiniband/hw/mlx4/Kconfig    |    9 +
 drivers/infiniband/hw/mlx4/Makefile   |    3 +
 drivers/infiniband/hw/mlx4/ah.c       |  100 +++
 drivers/infiniband/hw/mlx4/cq.c       |  525 +++++++++++++
 drivers/infiniband/hw/mlx4/doorbell.c |  216 ++++++
 drivers/infiniband/hw/mlx4/mad.c      |  339 +++++++++
 drivers/infiniband/hw/mlx4/main.c     |  651 +++++++++++++++++
 drivers/infiniband/hw/mlx4/mlx4_ib.h  |  285 ++++++++
 drivers/infiniband/hw/mlx4/mr.c       |  184 +++++
 drivers/infiniband/hw/mlx4/qp.c       | 1294 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/mlx4/srq.c      |  334 +++++++++
 drivers/infiniband/hw/mlx4/user.h     |   92 +++
 drivers/net/Kconfig                   |   14 +
 drivers/net/Makefile                  |    1 +
 drivers/net/mlx4/Makefile             |    4 +
 drivers/net/mlx4/alloc.c              |  179 +++++
 drivers/net/mlx4/catas.c              |   70 ++
 drivers/net/mlx4/cmd.c                |  429 +++++++++++
 drivers/net/mlx4/cq.c                 |  254 +++++++
 drivers/net/mlx4/eq.c                 |  696 ++++++++++++++++++
 drivers/net/mlx4/fw.c                 |  775 ++++++++++++++++++++
 drivers/net/mlx4/fw.h                 |  167 +++++
 drivers/net/mlx4/icm.c                |  379 ++++++++++
 drivers/net/mlx4/icm.h                |  135 ++++
 drivers/net/mlx4/intf.c               |  165 +++++
 drivers/net/mlx4/main.c               |  936 ++++++++++++++++++++++++
 drivers/net/mlx4/mcg.c                |  380 ++++++++++
 drivers/net/mlx4/mlx4.h               |  348 +++++++++
 drivers/net/mlx4/mr.c                 |  479 ++++++++++++
 drivers/net/mlx4/pd.c                 |  102 +++
 drivers/net/mlx4/profile.c            |  238 ++++++
 drivers/net/mlx4/qp.c                 |  273 +++++++
 drivers/net/mlx4/reset.c              |  181 +++++
 drivers/net/mlx4/srq.c                |  227 ++++++
 include/linux/mlx4/cmd.h              |  178 +++++
 include/linux/mlx4/cq.h               |  123 ++++
 include/linux/mlx4/device.h           |  331 +++++++++
 include/linux/mlx4/doorbell.h         |   97 +++
 include/linux/mlx4/driver.h           |   59 ++
 include/linux/mlx4/qp.h               |  288 ++++++++
 include/linux/mlx4/srq.h              |   42 ++
 43 files changed, 11585 insertions(+), 0 deletions(-)


From benh at kernel.crashing.org  Mon May  7 20:21:56 2007
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 08 May 2007 13:21:56 +1000
Subject: [ofa-general] Incorrect atomic usage in ipath driver
Message-ID: <1178594516.14928.62.camel@localhost.localdomain>

Hi !

So I see this construct:

	/* There is already a thread processing this queue. */
	if (test_and_set_bit(0, &dd->ipath_rcv_pending))
		goto bail;

	.../...

done:
	clear_bit(0, &dd->ipath_rcv_pending);
	smp_mb__after_clear_bit();

So that's basically an attempt at doing a spinlock. The problem is your
barrier is wrong at the end. Better would be:


done:
	smp_mb__before_clear_bit();
	clear_bit(0, &dd->ipath_rcv_pending);


Though it's still less optimal that doing:

	if (!spin_trylock(...))
		goto bail;

	.../...

done:
	spin_unlock(...)

If you really want to stick to bitops, then you may want to look at
Nick's upcoming patches adding some bitops with appropriate lock
semantics.

Cheers,
Ben.


From mst at dev.mellanox.co.il  Mon May  7 22:07:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 08:07:00 +0300
Subject: [ofa-general] memory leak in cm.c?
Message-ID: <20070508050700.GI22341@mellanox.co.il>

Hi!
I applied the following patch to cm.c, and it crashed after
some duplicate reqs where detected. Does this indicate a
memory leak in cm?

---

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 842cd0b..3f95eae 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -790,6 +790,7 @@ retest:
 		cm_free_work(work);
 	kfree(cm_id_priv->compare_data);
 	kfree(cm_id_priv->private_data);
+	BUG_ON(cm_id_priv->timewait_info);
 	kfree(cm_id_priv);
 }
 

-- 
MST


From eli at mellanox.co.il  Mon May  7 23:47:26 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Tue, 08 May 2007 09:47:26 +0300
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts
In-Reply-To: <adar6ps60zn.fsf@cisco.com>
References: <1178551555.17477.0.camel@mtls03>  <adar6ps60zn.fsf@cisco.com>
Message-ID: <1178606876.17477.15.camel@mtls03>

On Mon, 2007-05-07 at 09:40 -0700, Roland Dreier wrote:
> Thanks... should we optimize out the
> 
> 	if (eqes_found)
> 		eq_set_ci(eq, 1);
> 
> at the end of mlx4_eq_int() now?
I think we should


>   Actually the best fix is probably:
> 
> diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c
> index 8d641b8..acf1c80 100644
> --- a/drivers/net/mlx4/eq.c
> +++ b/drivers/net/mlx4/eq.c
> @@ -249,8 +249,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
>  		}
>  	}
>  
> -	if (eqes_found)
> -		eq_set_ci(eq, 1);
> +	eq_set_ci(eq, 1);
>  
>  	return eqes_found;
>  }
> 
This will not ensure arming all EQs for all interrupts and we will face
the same problem of losing interrupts.

Index: connectx_kernel/drivers/net/mlx4/eq.c
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/eq.c	2007-05-06 17:34:12.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/eq.c	2007-05-08 09:37:50.000000000 +0300
@@ -256,9 +256,6 @@ static int mlx4_eq_int(struct mlx4_dev *
 		}
 	}
 
-	if (eqes_found)
-		eq_set_ci(eq, 1);
-
 	return eqes_found;
 }
 
@@ -266,13 +263,17 @@ static irqreturn_t mlx4_interrupt(int ir
 {
 	struct mlx4_dev *dev = dev_ptr;
 	struct mlx4_priv *priv = mlx4_priv(dev);
+	struct mlx4_eq *eq;
 	int work = 0;
 	int i;
 
 	writel(priv->eq_table.clr_mask, priv->eq_table.clr_int);
 
-	for (i = 0; i < MLX4_EQ_CATAS; ++i)
-		work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]);
+	for (i = 0; i < MLX4_EQ_CATAS; ++i) {
+		eq = &priv->eq_table.eq[i];
+		work |= mlx4_eq_int(dev, eq);
+		eq_set_ci(eq, 1);
+	}
 
 	return IRQ_RETVAL(work);
 }
@@ -283,6 +284,7 @@ static irqreturn_t mlx4_msi_x_interrupt(
 	struct mlx4_dev *dev = eq->dev;
 
 	mlx4_eq_int(dev, eq);
+	eq_set_ci(eq, 1);
 
 	/* MSI-X vectors always belong to us */
 	return IRQ_HANDLED;


> because it seems sort of strange if we ever don't rearm the EQ on an
> MSI-X interrupt.
> 
> What do you think?

Actually I think the following patch can do the work and is similar to
what we did for mthca/Hermon


> 
> On the other hand, this patch (and your patch) rearms the EQ on shared
> interrupts for other devices too.  Can't be helped I guess.
> 
>  - R.
> 


From eli at mellanox.co.il  Tue May  8 02:37:22 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Tue, 08 May 2007 12:37:22 +0300
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_core:  fix qp free sync
Message-ID: <1178617072.17477.45.camel@mtls03>

fix missing initialization of free object for qp and use logic
similar to cq when closing the qp. The problem first shows when
using qp events when complete attempts to acquire a none initialized
spinlock.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/net/mlx4/qp.c
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/qp.c	2007-05-07 17:48:17.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/qp.c	2007-05-08 12:07:33.000000000 +0300
@@ -185,6 +185,9 @@ int mlx4_qp_alloc(struct mlx4_dev *dev, 
 	if (err)
 		goto err_put_cmpt;
 
+	atomic_set(&qp->refcount, 1);
+	init_completion(&qp->free);
+
 	return 0;
 
 err_put_cmpt:
@@ -225,6 +228,10 @@ void mlx4_qp_free(struct mlx4_dev *dev, 
 {
 	struct mlx4_qp_table *qp_table = &mlx4_priv(dev)->qp_table;
 
+	if (atomic_dec_and_test(&qp->refcount))
+		complete(&qp->free);
+	wait_for_completion(&qp->free);
+
 	mlx4_table_put(dev, &qp_table->cmpt_table, qp->qpn);
 	mlx4_table_put(dev, &qp_table->rdmarc_table, qp->qpn);
 	mlx4_table_put(dev, &qp_table->altc_table, qp->qpn);


From jackm at dev.mellanox.co.il  Tue May  8 02:38:41 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 8 May 2007 12:38:41 +0300
Subject: [ofa-general] no SRQ empty check in libmthca and in mlx2 kernel
	modules
Message-ID: <200705081238.41255.jackm@dev.mellanox.co.il>

It looks to me like there is no check for "no more available WQEs" when posting
SRQ reads. See libmlx4/src/srq.c and drivers/infiniband/hw/mlx4/srq.c.
There is no check in either place if srq_head = srq_tail, or some equivalent check.

- Jack


From vlad at lists.openfabrics.org  Tue May  8 02:38:12 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue,  8 May 2007 02:38:12 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status
Message-ID: <20070508093812.9A193E603C1@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:
Build failed on i686 with linux-2.6.21.1


From tziporet at dev.mellanox.co.il  Tue May  8 05:40:02 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 08 May 2007 15:40:02 +0300
Subject: [ofa-general] [PATCH 0/3] rdma/cm: cleanup device
	removal	synchronization
In-Reply-To: <000101c790d7$1642e680$8698070a@amr.corp.intel.com>
References: <000101c790d7$1642e680$8698070a@amr.corp.intel.com>
Message-ID: <46406FA2.9060802@mellanox.co.il>

Sean Hefty wrote:
> Here's a couple of patches that make the device removal synchronization
> in the rdma_cm a little more explicit, along with one fix to add in
> missing synchronization.
>
> With these patches, it's now possible to call rdma_disconnect() after
> receiving a device removal event.
>
> I plan on pushing these changes to my git tree and request that they
> be pulled into 2.6.22 within the next couple of days if there are no
> issues.
>
>   
Hi Sean,
Do you think we want these for OFED 1.2 too?
if yes please prepare a patches against OFED 1.2 git tree too

Thanks,
Tziporet


From jsquyres at cisco.com  Tue May  8 06:16:57 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 8 May 2007 09:16:57 -0400
Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work
In-Reply-To: <464044D4.5010501@lfbs.rwth-aachen.de>
References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM>
	<46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM>
	<464044D4.5010501@lfbs.rwth-aachen.de>
Message-ID: <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com>

I'm forwarding this to the OpenFabrics general list -- as it just  
came up the other day, we know that Open MPI's UDAPL support works on  
Solaris, but we have done little/no testing of it on OFED (I  
personally know almost nothing about UDPAL).

Can the UDAPL OFED wizards shed any light on the error messages that  
are listed below?  In particular, these seem to be worrysome:

>  setup_listener Permission denied
>  setup_listener Address already in use
and
>  create_qp Address already in use

Thanks...


On May 8, 2007, at 5:37 AM, Boris Bierbaum wrote:

> Hi,
>
> we (my collegue Andreas and me) are still trying to solve this  
> problem.
> I have compiled some additional information, maybe somebody has an  
> idea
> about what's going on.
>
> OS: Debian GNU/Linux 4.0, Kernel 2.6.18, x86, 32-Bit
> IB software: OFED 1.1
> SM: OpenSM from OFED 1.1
> uDAPL: DAPL reference implementation version gamma 3.02 (using DAPL  
> from
> OFED 1.1 doesn't change anything, I suppose it's the same code, at  
> least
> roughly)
> Test program: Intel MPI Benchmarks Version 2.3
> OpenMPI version: 1.2.1
>
> Running OpenMPI directly over IB verbs (mpirun --mca btl  
> self,sm,openib
> ...) works. Here's the output of ibv_devinfo and ifconfig for the two
> nodes on which tried to run the benchmark (ulimit -l is unlimited on
> both machines):
>
> ------------ 1st node -------------------------------
>
> boris at pd-04:/work/boris/IMB_2.3/src$ /opt/infiniband/bin/ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         1.2.0
>         node_guid:                      0002:c902:0020:b528
>         sys_image_guid:                 0002:c902:0020:b52b
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25204
>         hw_ver:                         0xA0
>         board_id:                       MT_0230000001
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 1
>                         port_lid:               9
>                         port_lmc:               0x00
>
> boris at pd-04:/work/boris/IMB_2.3/src$ /sbin/ifconfig
>
> ...
>
> ib0       Protokoll:UNSPEC  Hardware Adresse
> 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
>           inet Adresse:192.168.0.14  Bcast:192.168.0.255
> Maske:255.255.255.0
>           inet6 Adresse: fe80::202:c902:20:b529/64
> Gültigkeitsbereich:Verbindung
>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>           RX packets:67 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:16 errors:0 dropped:2 overruns:0 carrier:0
>           Kollisionen:0 Sendewarteschlangenlänge:128
>           RX bytes:3752 (3.6 KiB)  TX bytes:968 (968.0 b)
>
> ...
>
> ------------ 2nd node -------------------------------
>
> boris at pd-05:~$  /opt/infiniband/bin/ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         1.2.0
>         node_guid:                      0002:c902:0020:b4f4
>         sys_image_guid:                 0002:c902:0020:b4f7
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25204
>         hw_ver:                         0xA0
>         board_id:                       MT_0230000001
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 1
>                         port_lid:               10
>                         port_lmc:               0x00
>
> boris at pd-05:~$ /sbin/ifconfig
>
> ...
>
> ib0       Protokoll:UNSPEC  Hardware Adresse
> 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
>           inet Adresse:192.168.0.15  Bcast:192.168.0.255
> Maske:255.255.255.0
>           inet6 Adresse: fe80::202:c902:20:b4f5/64
> Gültigkeitsbereich:Verbindung
>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>           RX packets:67 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:18 errors:0 dropped:2 overruns:0 carrier:0
>           Kollisionen:0 Sendewarteschlangenlänge:128
>           RX bytes:3752 (3.6 KiB)  TX bytes:1088 (1.0 KiB)
>
>
> ...
>
> ---------------------------------------------------------------------- 
> ---
>
>
> Here's the output from the failed run, with every DAT and DAPL debug
> output enabled:
>
>
>
> boris at pd-04:/work/boris/IMB_2.3/src$ mpirun -np 2 -x DAT_DBG_TYPE -x
> DAPL_DBG_TYPE -x DAT_OVERRIDE --mca btl self,sm,udapl --host  
> pd-04,pd-05
> /work/boris/IMB_2.3/src/IMB-MPI1 pingpong
> DAT Registry: Started (dat_init)
> DAT Registry: static registry file
> </home/boris/dapl_on_dope_gamma3.2/doc/dat.conf>
>
> DAT Registry: token
>  type  string
>  value <OpenIB-cma>
>
>
> DAT Registry: token
>  type  string
>  value <u1.2>
>
>
> DAT Registry: token
>  type  string
>  value <nonthreadsafe>
>
>
> DAT Registry: token
>  type  string
>  value <default>
>
>
> DAT Registry: token
>  type  string
>  value
> </home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so>
>
>
> DAT Registry: token
>  type  string
>  value <mv_dapl.1.2>
>
>
> DAT Registry: token
>  type  string
>  value <ib0 0>
>
>
> DAT Registry: token
>  type  string
>  value <>
>
>
> DAT Registry: token
>  type  eor
>  value <>
>
>
> DAT Registry: entry
>  ia_name OpenIB-cma
>  api_version
>      type 0x0
>      major.minor 1.2
>  is_thread_safe 0
>  is_default 1
>  lib_path
> /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so
>  provider_version
>      id mv_dapl
>      major.minor 1.2
>  ia_params ib0 0
>
> DAT Registry: loading provider for OpenIB-cma
>
> DAT Registry: token
>  type  eof
>  value <>
>
> DAT Registry: dat_registry_list_providers () called
> DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called
> DAT Registry: IA OpenIB-cma, trying to load library
> /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so
> DAPL: NOT Setting Loopback
>  dapl_ib_init:
> DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0)
>  open_hca: ib0 - 0x807cf28
>  ib_thread_init(17919)
>  ib_thread_init: waiting for ib_thread
>  ib_thread(17919,0xa7b08bb0): ENTER: pipe 8 ucma 12
> DAT Registry: Started (dat_init)
> DAT Registry: static registry file
> </home/boris/dapl_on_dope_gamma3.2/doc/dat.conf>
>
> DAT Registry: token
>  type  string
>  value <OpenIB-cma>
>
>
> DAT Registry: token
>  type  string
>  value <u1.2>
>
>
> DAT Registry: token
>  type  string
>  value <nonthreadsafe>
>
>
> DAT Registry: token
>  type  string
>  value <default>
>
>
> DAT Registry: token
>  type  string
>  value
> </home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so>
>
>
> DAT Registry: token
>  type  string
>  value <mv_dapl.1.2>
>
>
> DAT Registry: token
>  type  string
>  value <ib0 0>
>
>
> DAT Registry: token
>  type  string
>  value <>
>
>
> DAT Registry: token
>  type  eor
>  value <>
>
>
> DAT Registry: entry
>  ia_name OpenIB-cma
>  api_version
>      type 0x0
>      major.minor 1.2
>  is_thread_safe 0
>  is_default 1
>  lib_path
> /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so
>  provider_version
>      id mv_dapl
>      major.minor 1.2
>  ia_params ib0 0
>
> DAT Registry: loading provider for OpenIB-cma
>
> DAT Registry: token
>  type  eof
>  value <>
>
> DAT Registry: dat_registry_list_providers () called
> DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called
> DAT Registry: IA OpenIB-cma, trying to load library
> /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so
>  ib_thread_init(17919) exit
> DAPL: NOT Setting Loopback
>  dapl_ib_init:
> DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0)
>  open_hca: ib0 - 0x807cf18
>  ib_thread_init(12326)
>  ib_thread_init: waiting for ib_thread
>  ib_thread(12326,0xa7b75bb0): ENTER: pipe 8 ucma 12
>  ib_thread_init(12326) exit
>  getipaddr: family 2 port 0 addr 192.168.0.14
>  open_hca: ctx=0x809ecd0 port=1 GID subnet fe80000000000000 id
> 0002c9020020b529
>  open_hca: ib0, AF_INET 192.168.0.14 INLINE_MAX=128
>  ib_thread(17919) poll_event:  async=0x1 pipe=0x1 cm=0x0 cq=0x0
>  ib_thread(17919) poll_fd: hca[134729592]=0xb, async=8 pipe=12  
> cm=13 cq=d
>  query_hca: ib0 AF_INET  192.168.0.14
>  query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
>  query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0  
> rd_io 4
>  setup_async_cb: ia 0x80a1648 type 0 hdl (nil) cb 0xa7b1ec6c ctx  
> 0x80a16d0
>  setup_async_cb: ia 0x80a1648 type 1 hdl (nil) cb 0xa7b1e9c0 ctx  
> 0x80a16d0
>  setup_async_cb: ia 0x80a1648 type 3 hdl (nil) cb 0xa7b1eb50 ctx  
> 0x80a1648
> dat_set_handle 0x80a1648 to 1
> dat_get_ia_handle from 1 to 0x80a1648
>  pd_alloc: pd_handle=0x80a1928
> dat_get_ia_handle from 1 to 0x80a1648
>  query_hca: ib0 AF_INET  192.168.0.14
>  query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
>  query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0  
> rd_io 4
> dat_get_ia_handle from 1 to 0x80a1648
>  cq_object_create: (0x80a1958,0x80a1a44)
> dapls_ib_cq_alloc: evd 0x80a1958 cqlen=32
> dapls_ib_cq_alloc: new_cq 0x80a1a68 cqlen=63
>  setup_async_cb: ia 0x80a1648 type 2 hdl 0x80a1958 cb 0xa7b1f174 ctx
> 0x80a1958
> dat_get_ia_handle from 1 to 0x80a1648
> dat_get_ia_handle from 1 to 0x80a1648
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Address already in use
>  listen(ia_ptr 0x80a1648 SID 1025 sp 0x80a7a00 conn 0x80a7a70 id  
> 134904736)
>  listen(conn=0x80a7a70 cm_id=134904736)
> dat_get_ia_handle from 1 to 0x80a1648
>  mr_register: ia=0x80a1648, lmr=0x80a3718 va=0x80ae000 ln=266240  
> pv=0x0
>  mr_register: mr=0x80a37c8 h 4 pd 0x80a1928 ctx 0x809ecd0
> lkey=0x72002700 rkey=0x72002700 priv=41000
> dat_get_ia_handle from 1 to 0x80a1648
>  mr_register: ia=0x80a1648, lmr=0x80a7f18 va=0x80ef000 ln=528384  
> pv=0x0
>  mr_register: mr=0x80a7fc8 h 5 pd 0x80a1928 ctx 0x809ecd0
> lkey=0xf2002800 rkey=0xf2002800 priv=81000
>  getipaddr: family 2 port 0 addr 192.168.0.15
>  open_hca: ctx=0x809ecc0 port=1 GID subnet fe80000000000000 id
> 0002c9020020b4f5
>  open_hca: ib0, AF_INET 192.168.0.15 INLINE_MAX=128
>  ib_thread(12326) poll_event:  async=0x1 pipe=0x1 cm=0x0 cq=0x0
>  ib_thread(12326) poll_fd: hca[134729576]=0xb, async=8 pipe=12  
> cm=13 cq=d
>  query_hca: ib0 AF_INET  192.168.0.15
>  query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
>  query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0  
> rd_io 4
>  setup_async_cb: ia 0x80a1638 type 0 hdl (nil) cb 0xa7b8bc6c ctx  
> 0x80a16c0
>  setup_async_cb: ia 0x80a1638 type 1 hdl (nil) cb 0xa7b8b9c0 ctx  
> 0x80a16c0
>  setup_async_cb: ia 0x80a1638 type 3 hdl (nil) cb 0xa7b8bb50 ctx  
> 0x80a1638
> dat_set_handle 0x80a1638 to 1
> dat_get_ia_handle from 1 to 0x80a1638
>  pd_alloc: pd_handle=0x80a1918
> dat_get_ia_handle from 1 to 0x80a1638
>  query_hca: ib0 AF_INET  192.168.0.15
>  query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
>  query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0  
> rd_io 4
> dat_get_ia_handle from 1 to 0x80a1638
>  cq_object_create: (0x80a1948,0x80a1a34)
> dapls_ib_cq_alloc: evd 0x80a1948 cqlen=32
> dapls_ib_cq_alloc: new_cq 0x80a1a58 cqlen=63
>  setup_async_cb: ia 0x80a1638 type 2 hdl 0x80a1948 cb 0xa7b8c174 ctx
> 0x80a1948
> dat_get_ia_handle from 1 to 0x80a1638
> dat_get_ia_handle from 1 to 0x80a1638
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  listen(ia_ptr 0x80a1638 SID 1024 sp 0x80a7a00 conn 0x80a7a70 id  
> 134904736)
>  listen(conn=0x80a7a70 cm_id=134904736)
> dat_get_ia_handle from 1 to 0x80a1638
>  mr_register: ia=0x80a1638, lmr=0x80a3708 va=0x80ae000 ln=266240  
> pv=0x0
>  mr_register: mr=0x80a37b8 h 1 pd 0x80a1918 ctx 0x809ecc0
> lkey=0x60002400 rkey=0x60002400 priv=41000
> dat_get_ia_handle from 1 to 0x80a1638
>  mr_register: ia=0x80a1638, lmr=0x80a7ee8 va=0x80ef000 ln=528384  
> pv=0x0
>  mr_register: mr=0x80a7f98 h 2 pd 0x80a1918 ctx 0x809ecc0
> lkey=0x60002500 rkey=0x60002500 priv=81000
> #---------------------------------------------------
> #    Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
> #---------------------------------------------------
> # Date       : Tue May  8 11:16:58 2007
> # Machine    : i686# System     : Linux
> # Release    : 2.6.18
> # Version    : #1 SMP Tue Nov 14 18:02:03 CET 2006
>
> #
> # Minimum message length in bytes:   0
> # Maximum message length in bytes:   16777216
> #
> # MPI_Datatype                   :   MPI_BYTE
> # MPI_Datatype for reductions    :   MPI_FLOAT
> # MPI_Op                         :   MPI_SUM
> #
> #
>
> # List of Benchmarks to run:
>
> # PingPong
> dat_get_ia_handle from 1 to 0x80a1638
>  query_hca: MAX msg 2147483648 dto 16384 iov 30 rdma i4,o4
>  qp_alloc: ia_ptr 0x80a1638 ep_ptr 0x81741f8 ep_ctx_ptr 0x81741f8
>  create_qp Address already in use
>
> ---------------------------------------------------------------------- 
> ---
>
> The jobs hangs at this point. From the output of another simple test
> program I assume that it hangs inside of a receive operation. Of  
> course,
> I have noticed the "Permission denied" messages, but I don't think  
> that
> the probleme is there. These messages seem to come from RDMA CM when
> things are set up, but the execution continues from there on and I  
> have
> seen these messages on successful DAPL runs, too. I'm not very  
> familiar
> with RDMA CM, though, so I don't know the cause of these messages.
>
> That's a lot of information, I know, but it would be great if someone
> would have a look at it.
>
> Thanks in advance
> Boris
>
>
>
> Donald Kerr wrote:
>> I have not tried Open MPI uDAPL on Linux nor do I have access to a  
>> Linux
>> box so I am having a difficult time trying to find a way to help you
>> debug this issue.
>>
>> -DON
>>
>> Andreas Kuntze wrote:
>>
>>> On Linux you needn't initialise the dat registry. Your program  
>>> prints:
>>> "provider 1: OpenIB-cma". I successfully tested INTEL MPI  and   
>>> mvapich2
>>> with uDAPL .
>>>
>>> Andreas
>>>
>>> Donald Kerr wrote:
>>>
>>>
>>>> Andreas,
>>>>
>>>> I am going to guess at a minimum the interfaces are up and you can
>>>> ping them.  On Solaris there is an additional step required and  
>>>> that
>>>> is initializing the dat registry. If "/usr/sbin/datadm -v" does not
>>>> show some driver output then you would need to run "/usr/sbin/ 
>>>> datadm
>>>> -a /usr/share/dat/SUNWudaplt.conf". I don't know if there is an
>>>> equivalent on Linux.
>>>>
>>>> Attached is a simple udapl program which will check if the  
>>>> interfaces
>>>> are available in the dat registry.
>>>>
>>>> -DON
>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users at open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>> _______________________________________________
>> users mailing list
>> users at open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> -- 
> |  _  RWTH | Boris Bierbaum
> |_|_`_     | Lehrstuhl fuer Betriebssysteme
>    | |_) _  | RWTH Aachen D-52056 Aachen
>      |_)(_` | Tel: +49-241-80-27805
>         ._) | Fax: +49-241-80-22339
> <config.log.gz>
> <ompi_info.out.gz>
> _______________________________________________
> users mailing list
> users at open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
Cisco Systems


From mst at dev.mellanox.co.il  Tue May  8 06:34:59 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 16:34:59 +0300
Subject: [ofa-general] Re: ofa_1_2_kernel 20070508-0200 daily build status
In-Reply-To: <20070508093812.9A193E603C1@openfabrics.org>
References: <20070508093812.9A193E603C1@openfabrics.org>
Message-ID: <20070508133459.GQ21591@mellanox.co.il>

> Failed:
> Build failed on i686 with linux-2.6.21.1

OK, there were some build failures in ipoib, rds and cxgb3.
I picked the ipoib and cxgb3 patches from 2.6.21 git,
and now it builds. we missed 20070508 but will be in the next daily.

Steve, you might want to review
the patch under kernel_patches/backports/2.6.21/,
and/or test OFED there, on your hardware.

-- 
MST


From swise at opengridcomputing.com  Tue May  8 06:47:09 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 08 May 2007 08:47:09 -0500
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<1178578346.30571.183.camel@stevo-desktop>
	<BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>
Message-ID: <1178632029.3056.3.camel@stevo-desktop>

On Mon, 2007-05-07 at 20:39 -0400, Jeff Squyres wrote:
> On May 7, 2007, at 6:52 PM, Steve Wise wrote:
> 
> > Also, there appears to be a DAPL BTL in OMPI.  Is this BTL complete  
> > and
> > enabled for the ofed-1.2 udapl library?
> 
> Yes, it is complete and is well-tested in Solaris.
> 
> It is not well tested in Linux/OFED (we've been concentrating on the  
> verbs interface on the OFED side of things -- the "openib" BTL [we  
> never renamed it when OpenIB changed names to OpenFabrics]).  In  
> fact, we've had scattered reports of it not working properly in Linux/ 
> OFED, but those could well have been pilot error (i.e., me not trying  
> to run properly -- I know just about zilch about udapl).
> 

The reason I'm asking is twofold:  

1) this can get OMPI running on iwarp devices today if it works.

2) the udapl code can be a model for the rdma-cm piece, since the two
are similary (client / server connection set, ipaddr/port based, etc)...

I'll try it out on T3.


Steve.


From yosefe at voltaire.com  Tue May  8 06:54:52 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Tue, 08 May 2007 16:54:52 +0300
Subject: [ofa-general] [PATCHv3 0/2] pkey change handling - fix bug #577
Message-ID: <4640812C.6060003@voltaire.com>

These two patches fix bug #577: PKey table reordering caused by SM failover stops ipoib traffic
patch 1: add uncached device queries to core
patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init

--


From yosefe at voltaire.com  Tue May  8 07:03:34 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Tue, 08 May 2007 17:03:34 +0300
Subject: [ofa-general] [PATCHv2 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <4640812C.6060003@voltaire.com>
References: <4640812C.6060003@voltaire.com>
Message-ID: <46408336.8080908@voltaire.com>

Add ib_find_gid and ib_find_pkey over uncached device queries.
The calls might block but the returns are always up-to-date.


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/device.c |   96 +++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h          |   23 +++++++++
 2 files changed, 119 insertions(+)

Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-07 15:42:19.000000000 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-08 11:16:35.049600754 +0300
@@ -149,6 +149,18 @@ static int alloc_name(char *name)
 	return 0;
 }
 
+static inline int start_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+
+static inline int end_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
+}
+
 /**
  * ib_alloc_device - allocate an IB device struct
  * @size:size of structure to allocate
@@ -592,6 +604,90 @@ int ib_modify_port(struct ib_device *dev
 }
 EXPORT_SYMBOL(ib_modify_port);
 
+/**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index)
+{
+	struct ib_port_attr *tprops = NULL;
+	union ib_gid tmp_gid;
+	int ret, port, i;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+
+	for (port = start_port(device); port <= end_port(device); ++port) {
+		ret = ib_query_port(device, port, tprops);
+		if (ret)
+			continue;
+
+		for (i = 0; i < tprops->gid_tbl_len; ++i) {
+			ret = ib_query_gid(device, port, i, &tmp_gid);
+			if (ret)
+				goto out;
+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
+				*port_num = port;
+				*index = i;
+				ret = 0;
+				goto out;
+			}
+		}
+	}
+	ret = -ENOENT;
+out:
+	kfree(tprops);
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_gid);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index)
+{
+	struct ib_port_attr *tprops = NULL;
+	int ret, i;
+	u16 tmp_pkey;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+
+	ret = ib_query_port(device, port_num, tprops);
+	if (ret) {
+		printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret);
+		goto out;
+	}
+
+	for (i = 0; i < tprops->pkey_tbl_len; ++i) {
+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
+		if (ret)
+			goto out;
+
+		if (pkey == tmp_pkey) {
+			*index = i;
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = -ENOENT;
+
+out:
+	kfree(tprops);
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_pkey);
+
 static int __init ib_core_init(void)
 {
 	int ret;
Index: b/include/rdma/ib_verbs.h
===================================================================
--- a/include/rdma/ib_verbs.h	2007-05-07 15:41:13.000000000 +0300
+++ b/include/rdma/ib_verbs.h	2007-05-07 15:43:04.000000000 +0300
@@ -1134,6 +1134,29 @@ int ib_modify_port(struct ib_device *dev
 		   struct ib_port_modify *port_modify);
 
 /**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index);
+
+/**
  * ib_alloc_pd - Allocates an unused protection domain.
  * @device: The device on which to allocate the protection domain.
  *


From yosefe at voltaire.com  Tue May  8 07:04:16 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Tue, 08 May 2007 17:04:16 +0300
Subject: [ofa-general] [PATCHv2 1/2] ipoib: handle pkey change events
In-Reply-To: <4640812C.6060003@voltaire.com>
References: <4640812C.6060003@voltaire.com>
Message-ID: <46408360.3040006@voltaire.com>

This issue was found during partitioning & SM fail over testing.

 * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
 * Upon PKEY_CHANGE event, schedule a work that restarts the QP
 * Restart child interfaces before parent
 * Use uncached pkey query upon qp initiallization

SM reconfiguration or failover possibly causes a shuffling of the values in the port
pkey table. The current implementation only queries for the index of the pkey once,
when it creates the device QP and after that moves it into working state, and hence
does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger
to reconfigure the device QP.


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |    6 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   62 +++++++++++++++++++++--------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +--
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   11 ++---
 4 files changed, 59 insertions(+), 27 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 12:13:39.481155747 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 12:15:14.716172776 +0300
@@ -202,11 +202,12 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
+	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
+	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8            	  port;
@@ -333,12 +334,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 12:13:39.481155747 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 12:57:20.842183673 +0300
@@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -481,7 +481,7 @@ int ipoib_ib_dev_down(struct net_device 
 	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
 		mutex_lock(&pkey_mutex);
 		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
+		cancel_delayed_work(&priv->pkey_poll_task);
 		mutex_unlock(&pkey_mutex);
 		if (flush)
 			flush_workqueue(ipoib_workqueue);
@@ -508,7 +508,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +581,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,13 +623,24 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces */
+	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, restart_qp);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	/*
+	 * If the device is not initiallized since it needs a pkey -
+	 * try to reopen it
+	 */
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
 		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
@@ -642,6 +654,12 @@ void ipoib_ib_dev_flush(struct work_stru
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (restart_qp) {
+		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
+			ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +668,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	mutex_unlock(&priv->vlan_mutex);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -685,7 +713,7 @@ void ipoib_ib_dev_cleanup(struct net_dev
 void ipoib_pkey_poll(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
+		container_of(work, struct ipoib_dev_priv, pkey_poll_task.work);
 	struct net_device *dev = priv->dev;
 
 	ipoib_pkey_dev_check_presence(dev);
@@ -696,7 +724,7 @@ void ipoib_pkey_poll(struct work_struct 
 		mutex_lock(&pkey_mutex);
 		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
+					   &priv->pkey_poll_task,
 					   HZ);
 		mutex_unlock(&pkey_mutex);
 	}
@@ -715,7 +743,7 @@ int ipoib_pkey_dev_delay_open(struct net
 		mutex_lock(&pkey_mutex);
 		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
 		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
+				   &priv->pkey_poll_task,
 				   HZ);
 		mutex_unlock(&pkey_mutex);
 		return 1;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 12:13:39.481155747 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 12:20:51.605085249 +0300
@@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev)
 		return -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 12:13:39.481155747 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 12:57:41.882456471 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
@@ -103,7 +101,7 @@ int ipoib_init_qp(struct net_device *dev
 	 * The port has to be assigned to the respective IB partition in
 	 * advance.
 	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
 	if (ret) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		return ret;
@@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
 	}
 }


From mst at dev.mellanox.co.il  Tue May  8 07:17:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 17:17:27 +0300
Subject: [ofa-general] libmlx4 wc flash
Message-ID: <20070508141727.GR21591@mellanox.co.il>

Roland,
	libmlx4 has this comments:

	/* FIXME flush wc buffers */

	and since it does *not* currently actually flush the buffers, if we
	enable WC for blueflame, WRs gets mixed in the WC buffer, and QP gets
	corrupted/stuck.

It seems we should we have arch.h under mthca and stick
some macro like wc_wmb() in there.

Or, would infiniband/arch.h under libibverbs be a better place?

If WC is not enabled, userspace can avoid the flush - so, should we
return such a bit as part of kernel abi?

-- 
MST


From swise at opengridcomputing.com  Tue May  8 07:32:35 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 08 May 2007 09:32:35 -0500
Subject: [ofa-general] Re: ofa_1_2_kernel 20070508-0200 daily build status
In-Reply-To: <20070508133459.GQ21591@mellanox.co.il>
References: <20070508093812.9A193E603C1@openfabrics.org>
	<20070508133459.GQ21591@mellanox.co.il>
Message-ID: <1178634755.3056.7.camel@stevo-desktop>

On Tue, 2007-05-08 at 16:34 +0300, Michael S. Tsirkin wrote:
> > Failed:
> > Build failed on i686 with linux-2.6.21.1
> 
> OK, there were some build failures in ipoib, rds and cxgb3.
> I picked the ipoib and cxgb3 patches from 2.6.21 git,
> and now it builds. we missed 20070508 but will be in the next daily.
> 
> Steve, you might want to review
> the patch under kernel_patches/backports/2.6.21/,
> and/or test OFED there, on your hardware.
> 

looks good.

steve.


From mst at dev.mellanox.co.il  Tue May  8 08:03:56 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 18:03:56 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events
In-Reply-To: <46408360.3040006@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
Message-ID: <20070508150356.GT21591@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCHv2 1/2] ipoib: handle pkey change events
> 
> This issue was found during partitioning & SM fail over testing.
> 
>  * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
>  * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
>  * Upon PKEY_CHANGE event, schedule a work that restarts the QP
>  * Restart child interfaces before parent

What's the reason for this change?
Is this a separate bugfix?
You might want to put this info in the log.

>  * Use uncached pkey query upon qp initiallization
> 
> SM reconfiguration or failover possibly causes a shuffling of the values in the port
> pkey table. The current implementation only queries for the index of the pkey once,
> when it creates the device QP and after that moves it into working state, and hence
> does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger
> to reconfigure the device QP.
> 
> 
> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>

Btw, pls try making log lines a bit shorter - git log shifts everything
to the right.

-- 
MST


From mst at dev.mellanox.co.il  Tue May  8 08:09:07 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 18:09:07 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <46408336.8080908@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com>
Message-ID: <20070508150907.GU21591@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries
> 
> Add ib_find_gid and ib_find_pkey over uncached device queries.
> The calls might block but the returns are always up-to-date.
> 
> 
> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
> ---
>  drivers/infiniband/core/device.c |   96 +++++++++++++++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h          |   23 +++++++++
>  2 files changed, 119 insertions(+)
> 
> Index: b/drivers/infiniband/core/device.c
> ===================================================================
> --- a/drivers/infiniband/core/device.c	2007-05-07 15:42:19.000000000 +0300
> +++ b/drivers/infiniband/core/device.c	2007-05-08 11:16:35.049600754 +0300
> @@ -149,6 +149,18 @@ static int alloc_name(char *name)
>  	return 0;
>  }
>  
> +static inline int start_port(struct ib_device *device)
> +{
> +	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
> +}
> +
> +
> +static inline int end_port(struct ib_device *device)
> +{
> +	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
> +		0 : device->phys_port_cnt;
> +}
> +
>  /**
>   * ib_alloc_device - allocate an IB device struct
>   * @size:size of structure to allocate
> @@ -592,6 +604,90 @@ int ib_modify_port(struct ib_device *dev
>  }
>  EXPORT_SYMBOL(ib_modify_port);
>  
> +/**
> + * ib_find_gid - Returns the port number and GID table index where
> + *   a specified GID value occurs.
> + * @device: The device to query.
> + * @gid: The GID value to search for.
> + * @port_num: The port number of the device where the GID value was found.
> + * @index: The index into the GID table where the GID was found.  This
> + *   parameter may be NULL.
> + */
> +int ib_find_gid(struct ib_device *device, union ib_gid *gid,
> +		       u8 *port_num, u16 *index)
> +{
> +	struct ib_port_attr *tprops = NULL;
> +	union ib_gid tmp_gid;
> +	int ret, port, i;
> +
> +	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
> +
> +	for (port = start_port(device); port <= end_port(device); ++port) {
> +		ret = ib_query_port(device, port, tprops);
> +		if (ret)
> +			continue;
> +
> +		for (i = 0; i < tprops->gid_tbl_len; ++i) {
> +			ret = ib_query_gid(device, port, i, &tmp_gid);
> +			if (ret)
> +				goto out;
> +			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
> +				*port_num = port;
> +				*index = i;
> +				ret = 0;
> +				goto out;
> +			}
> +		}
> +	}
> +	ret = -ENOENT;
> +out:
> +	kfree(tprops);
> +	return ret;
> +}
> +EXPORT_SYMBOL(ib_find_gid);
> +
> +/**
> + * ib_find_pkey - Returns the PKey table index where a specified
> + *   PKey value occurs.
> + * @device: The device to query.
> + * @port_num: The port number of the device to search for the PKey.
> + * @pkey: The PKey value to search for.
> + * @index: The index into the PKey table where the PKey was found.
> + */
> +int ib_find_pkey(struct ib_device *device,
> +			u8 port_num, u16 pkey, u16 *index)
> +{
> +	struct ib_port_attr *tprops = NULL;
> +	int ret, i;
> +	u16 tmp_pkey;
> +
> +	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
> +
> +	ret = ib_query_port(device, port_num, tprops);
> +	if (ret) {
> +		printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret);
> +		goto out;
> +	}
> +
> +	for (i = 0; i < tprops->pkey_tbl_len; ++i) {
> +		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
> +		if (ret)
> +			goto out;
> +
> +		if (pkey == tmp_pkey) {
> +			*index = i;
> +			ret = 0;
> +			goto out;
> +		}
> +	}
> +	ret = -ENOENT;
> +
> +out:
> +	kfree(tprops);
> +	return ret;
> +}
> +EXPORT_SYMBOL(ib_find_pkey);
> +
>  static int __init ib_core_init(void)
>  {
>  	int ret;

OK, look good - later, providers will be able to optimize these
by caching ib_query_pkey/ib_query_gid calls.

But I see a problem here in that ib_query_port is a call providers
won't be able to optimize out (because it includes e.g. port state),
and it seems a waste.

Is that right?

One way out would be to pass the table length in to ib_find_pkey/ib_find_gid
as an extra parameter, and cache that at the ULP level.

-- 
MST


From yosefe at voltaire.com  Tue May  8 08:11:09 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Tue, 08 May 2007 18:11:09 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events
In-Reply-To: <20070508150356.GT21591@mellanox.co.il>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508150356.GT21591@mellanox.co.il>
Message-ID: <4640930D.9010800@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: [PATCHv2 1/2] ipoib: handle pkey change events
>>
>>This issue was found during partitioning & SM fail over testing.
>>
>> * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
>> * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
>> * Upon PKEY_CHANGE event, schedule a work that restarts the QP
>> * Restart child interfaces before parent
> 
> 
> What's the reason for this change?
> Is this a separate bugfix?
> You might want to put this info in the log.
> 

The reason is that if the child are restarted after the parent, and the parent is
not up, then the flush function returns immediately due to the INITIALLIZED bit test.
Now I think that we might use a goto statement instead.

> 
>> * Use uncached pkey query upon qp initiallization
>>
>>SM reconfiguration or failover possibly causes a shuffling of the values in the port
>>pkey table. The current implementation only queries for the index of the pkey once,
>>when it creates the device QP and after that moves it into working state, and hence
>>does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger
>>to reconfigure the device QP.
>>
>>
>>Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
> 
> 
> Btw, pls try making log lines a bit shorter - git log shifts everything
> to the right.
> 
Ok


From mst at dev.mellanox.co.il  Tue May  8 08:19:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 18:19:27 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events
In-Reply-To: <4640930D.9010800@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508150356.GT21591@mellanox.co.il>
	<4640930D.9010800@voltaire.com>
Message-ID: <20070508151927.GW21591@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events
> 
> Michael S. Tsirkin wrote:
> >>Quoting Yosef Etigin <yosefe at voltaire.com>:
> >>Subject: [PATCHv2 1/2] ipoib: handle pkey change events
> >>
> >>This issue was found during partitioning & SM fail over testing.
> >>
> >> * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
> >> * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
> >> * Upon PKEY_CHANGE event, schedule a work that restarts the QP
> >> * Restart child interfaces before parent
> > 
> > 
> > What's the reason for this change?
> > Is this a separate bugfix?
> > You might want to put this info in the log.
> > 
> 
> The reason is that if the child are restarted after the parent, and the parent is
> not up, then the flush function returns immediately due to the INITIALLIZED bit test.
> Now I think that we might use a goto statement instead.

So ... what the problem? I still don't see it.

-- 
MST


From yosefe at voltaire.com  Tue May  8 08:19:32 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Tue, 08 May 2007 18:19:32 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <20070508150907.GU21591@mellanox.co.il>
References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com>
	<20070508150907.GU21591@mellanox.co.il>
Message-ID: <46409504.9000802@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries
>>
>>Add ib_find_gid and ib_find_pkey over uncached device queries.
>>The calls might block but the returns are always up-to-date.
>>
>>
>>Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
>>---
>> drivers/infiniband/core/device.c |   96 +++++++++++++++++++++++++++++++++++++++
>> include/rdma/ib_verbs.h          |   23 +++++++++
>> 2 files changed, 119 insertions(+)
>>
>>Index: b/drivers/infiniband/core/device.c
>>===================================================================
>>--- a/drivers/infiniband/core/device.c	2007-05-07 15:42:19.000000000 +0300
>>+++ b/drivers/infiniband/core/device.c	2007-05-08 11:16:35.049600754 +0300
>>@@ -149,6 +149,18 @@ static int alloc_name(char *name)
>> 	return 0;
>> }
>> 
>>+static inline int start_port(struct ib_device *device)
>>+{
>>+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
>>+}
>>+
>>+
>>+static inline int end_port(struct ib_device *device)
>>+{
>>+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
>>+		0 : device->phys_port_cnt;
>>+}
>>+
>> /**
>>  * ib_alloc_device - allocate an IB device struct
>>  * @size:size of structure to allocate
>>@@ -592,6 +604,90 @@ int ib_modify_port(struct ib_device *dev
>> }
>> EXPORT_SYMBOL(ib_modify_port);
>> 
>>+/**
>>+ * ib_find_gid - Returns the port number and GID table index where
>>+ *   a specified GID value occurs.
>>+ * @device: The device to query.
>>+ * @gid: The GID value to search for.
>>+ * @port_num: The port number of the device where the GID value was found.
>>+ * @index: The index into the GID table where the GID was found.  This
>>+ *   parameter may be NULL.
>>+ */
>>+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
>>+		       u8 *port_num, u16 *index)
>>+{
>>+	struct ib_port_attr *tprops = NULL;
>>+	union ib_gid tmp_gid;
>>+	int ret, port, i;
>>+
>>+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
>>+
>>+	for (port = start_port(device); port <= end_port(device); ++port) {
>>+		ret = ib_query_port(device, port, tprops);
>>+		if (ret)
>>+			continue;
>>+
>>+		for (i = 0; i < tprops->gid_tbl_len; ++i) {
>>+			ret = ib_query_gid(device, port, i, &tmp_gid);
>>+			if (ret)
>>+				goto out;
>>+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
>>+				*port_num = port;
>>+				*index = i;
>>+				ret = 0;
>>+				goto out;
>>+			}
>>+		}
>>+	}
>>+	ret = -ENOENT;
>>+out:
>>+	kfree(tprops);
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL(ib_find_gid);
>>+
>>+/**
>>+ * ib_find_pkey - Returns the PKey table index where a specified
>>+ *   PKey value occurs.
>>+ * @device: The device to query.
>>+ * @port_num: The port number of the device to search for the PKey.
>>+ * @pkey: The PKey value to search for.
>>+ * @index: The index into the PKey table where the PKey was found.
>>+ */
>>+int ib_find_pkey(struct ib_device *device,
>>+			u8 port_num, u16 pkey, u16 *index)
>>+{
>>+	struct ib_port_attr *tprops = NULL;
>>+	int ret, i;
>>+	u16 tmp_pkey;
>>+
>>+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
>>+
>>+	ret = ib_query_port(device, port_num, tprops);
>>+	if (ret) {
>>+		printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret);
>>+		goto out;
>>+	}
>>+
>>+	for (i = 0; i < tprops->pkey_tbl_len; ++i) {
>>+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
>>+		if (ret)
>>+			goto out;
>>+
>>+		if (pkey == tmp_pkey) {
>>+			*index = i;
>>+			ret = 0;
>>+			goto out;
>>+		}
>>+	}
>>+	ret = -ENOENT;
>>+
>>+out:
>>+	kfree(tprops);
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL(ib_find_pkey);
>>+
>> static int __init ib_core_init(void)
>> {
>> 	int ret;
> 
> 
> OK, look good - later, providers will be able to optimize these
> by caching ib_query_pkey/ib_query_gid calls.
> 
> But I see a problem here in that ib_query_port is a call providers
> won't be able to optimize out (because it includes e.g. port state),
> and it seems a waste.
> 
> Is that right?
> 
> One way out would be to pass the table length in to ib_find_pkey/ib_find_gid
> as an extra parameter, and cache that at the ULP level.
> 
provider might try to remember the port state after each mad we see.. but it
looks like too much to demand from it.

Anyway, since the information about the port table length does not come from mads
but from device properties, the core can set each of these lengths during initiallization,
and use them in ib_find_* functions.


From sashak at voltaire.com  Tue May  8 08:29:44 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 8 May 2007 18:29:44 +0300
Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm/osm_helper: remove repeated
	strlen() calls
In-Reply-To: <1178541140.32222.348653.camel@hal.voltaire.com>
References: <462C7C21.7010004@dev.mellanox.co.il>
	<20070423101738.GG4579@mellanox.co.il>
	<462E80A3.5060503@dev.mellanox.co.il>
	<20070501005101.GA26019@sashak.voltaire.com>
	<4636E4A7.7060108@dev.mellanox.co.il>
	<1178211572.32222.3479.camel@hal.voltaire.com>
	<20070506124333.GB9692@sashak.voltaire.com>
	<20070506130352.GC9692@sashak.voltaire.com>
	<1178541140.32222.348653.camel@hal.voltaire.com>
Message-ID: <20070508152944.GN9692@sashak.voltaire.com>

On 08:32 Mon 07 May     , Hal Rosenstock wrote:
> On Sun, 2007-05-06 at 09:03, Sasha Khapyorsky wrote:
> > Replace repeated strlen() calls used in sprintf() by actual string
> > length accumulated from sprintf() return values.
> > 
> > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> 
> Thanks. Applied to master only (as this is a cleanup rather than a bug
> fix). Let me know if you think this should be applied to ofed_1_2.

This is a minor improvement and not critical for ofed_1_2.

Sasha


From mst at dev.mellanox.co.il  Tue May  8 08:26:50 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 18:26:50 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <46409504.9000802@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com>
	<20070508150907.GU21591@mellanox.co.il>
	<46409504.9000802@voltaire.com>
Message-ID: <20070508152650.GA5845@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries
> 
> Michael S. Tsirkin wrote:
> >>Quoting Yosef Etigin <yosefe at voltaire.com>:
> >>Subject: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries
> >>
> >>Add ib_find_gid and ib_find_pkey over uncached device queries.
> >>The calls might block but the returns are always up-to-date.
> >>
> >>
> >>Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
> >>---
> >> drivers/infiniband/core/device.c |   96 +++++++++++++++++++++++++++++++++++++++
> >> include/rdma/ib_verbs.h          |   23 +++++++++
> >> 2 files changed, 119 insertions(+)
> >>
> >>Index: b/drivers/infiniband/core/device.c
> >>===================================================================
> >>--- a/drivers/infiniband/core/device.c	2007-05-07 15:42:19.000000000 +0300
> >>+++ b/drivers/infiniband/core/device.c	2007-05-08 11:16:35.049600754 +0300
> >>@@ -149,6 +149,18 @@ static int alloc_name(char *name)
> >> 	return 0;
> >> }
> >> 
> >>+static inline int start_port(struct ib_device *device)
> >>+{
> >>+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
> >>+}
> >>+
> >>+
> >>+static inline int end_port(struct ib_device *device)
> >>+{
> >>+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
> >>+		0 : device->phys_port_cnt;
> >>+}
> >>+
> >> /**
> >>  * ib_alloc_device - allocate an IB device struct
> >>  * @size:size of structure to allocate
> >>@@ -592,6 +604,90 @@ int ib_modify_port(struct ib_device *dev
> >> }
> >> EXPORT_SYMBOL(ib_modify_port);
> >> 
> >>+/**
> >>+ * ib_find_gid - Returns the port number and GID table index where
> >>+ *   a specified GID value occurs.
> >>+ * @device: The device to query.
> >>+ * @gid: The GID value to search for.
> >>+ * @port_num: The port number of the device where the GID value was found.
> >>+ * @index: The index into the GID table where the GID was found.  This
> >>+ *   parameter may be NULL.
> >>+ */
> >>+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
> >>+		       u8 *port_num, u16 *index)
> >>+{
> >>+	struct ib_port_attr *tprops = NULL;
> >>+	union ib_gid tmp_gid;
> >>+	int ret, port, i;
> >>+
> >>+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
> >>+
> >>+	for (port = start_port(device); port <= end_port(device); ++port) {
> >>+		ret = ib_query_port(device, port, tprops);
> >>+		if (ret)
> >>+			continue;
> >>+
> >>+		for (i = 0; i < tprops->gid_tbl_len; ++i) {
> >>+			ret = ib_query_gid(device, port, i, &tmp_gid);
> >>+			if (ret)
> >>+				goto out;
> >>+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
> >>+				*port_num = port;
> >>+				*index = i;
> >>+				ret = 0;
> >>+				goto out;
> >>+			}
> >>+		}
> >>+	}
> >>+	ret = -ENOENT;
> >>+out:
> >>+	kfree(tprops);
> >>+	return ret;
> >>+}
> >>+EXPORT_SYMBOL(ib_find_gid);
> >>+
> >>+/**
> >>+ * ib_find_pkey - Returns the PKey table index where a specified
> >>+ *   PKey value occurs.
> >>+ * @device: The device to query.
> >>+ * @port_num: The port number of the device to search for the PKey.
> >>+ * @pkey: The PKey value to search for.
> >>+ * @index: The index into the PKey table where the PKey was found.
> >>+ */
> >>+int ib_find_pkey(struct ib_device *device,
> >>+			u8 port_num, u16 pkey, u16 *index)
> >>+{
> >>+	struct ib_port_attr *tprops = NULL;
> >>+	int ret, i;
> >>+	u16 tmp_pkey;
> >>+
> >>+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
> >>+
> >>+	ret = ib_query_port(device, port_num, tprops);
> >>+	if (ret) {
> >>+		printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret);
> >>+		goto out;
> >>+	}
> >>+
> >>+	for (i = 0; i < tprops->pkey_tbl_len; ++i) {
> >>+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
> >>+		if (ret)
> >>+			goto out;
> >>+
> >>+		if (pkey == tmp_pkey) {
> >>+			*index = i;
> >>+			ret = 0;
> >>+			goto out;
> >>+		}
> >>+	}
> >>+	ret = -ENOENT;
> >>+
> >>+out:
> >>+	kfree(tprops);
> >>+	return ret;
> >>+}
> >>+EXPORT_SYMBOL(ib_find_pkey);
> >>+
> >> static int __init ib_core_init(void)
> >> {
> >> 	int ret;
> > 
> > 
> > OK, look good - later, providers will be able to optimize these
> > by caching ib_query_pkey/ib_query_gid calls.
> > 
> > But I see a problem here in that ib_query_port is a call providers
> > won't be able to optimize out (because it includes e.g. port state),
> > and it seems a waste.
> > 
> > Is that right?
> > 
> > One way out would be to pass the table length in to ib_find_pkey/ib_find_gid
> > as an extra parameter, and cache that at the ULP level.
> > 
> provider might try to remember the port state after each mad we see.. but it
> looks like too much to demand from it.

Port can go down without any MADs.

> Anyway, since the information about the port table length does not come from mads
> but from device properties, the core can set each of these lengths during initiallization,
> and use them in ib_find_* functions.

Passing it in looks simpler ... but maybe you're right.
Patch?

-- 
MST


From yosefe at voltaire.com  Tue May  8 08:38:57 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Tue, 08 May 2007 18:38:57 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events
In-Reply-To: <20070508151927.GW21591@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508150356.GT21591@mellanox.co.il>	<4640930D.9010800@voltaire.com>
	<20070508151927.GW21591@mellanox.co.il>
Message-ID: <46409991.2070305@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events
>>
>>Michael S. Tsirkin wrote:
>>
>>>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>>>Subject: [PATCHv2 1/2] ipoib: handle pkey change events
>>>>
>>>>This issue was found during partitioning & SM fail over testing.
>>>>
>>>>* Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
>>>>* Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
>>>>* Upon PKEY_CHANGE event, schedule a work that restarts the QP
>>>>* Restart child interfaces before parent
>>>
>>>
>>>What's the reason for this change?
>>>Is this a separate bugfix?
>>>You might want to put this info in the log.
>>>
>>
>>The reason is that if the child are restarted after the parent, and the parent is
>>not up, then the flush function returns immediately due to the INITIALLIZED bit test.
>>Now I think that we might use a goto statement instead.
> 
> 
> So ... what the problem? I still don't see it.
> 
If I get pkey change event, I want to restart all active ifaces on my port. If the parent
is not marked with IPOIB_FLAG_INITIALIZED, the function returns before it has a chance to
recursively restart child ifaces.


From mst at dev.mellanox.co.il  Tue May  8 08:53:07 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 18:53:07 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events
In-Reply-To: <46409991.2070305@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508150356.GT21591@mellanox.co.il>
	<4640930D.9010800@voltaire.com>
	<20070508151927.GW21591@mellanox.co.il>
	<46409991.2070305@voltaire.com>
Message-ID: <20070508155307.GC5845@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events
> 
> Michael S. Tsirkin wrote:
> >>Quoting Yosef Etigin <yosefe at voltaire.com>:
> >>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events
> >>
> >>Michael S. Tsirkin wrote:
> >>
> >>>>Quoting Yosef Etigin <yosefe at voltaire.com>:
> >>>>Subject: [PATCHv2 1/2] ipoib: handle pkey change events
> >>>>
> >>>>This issue was found during partitioning & SM fail over testing.
> >>>>
> >>>>* Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
> >>>>* Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
> >>>>* Upon PKEY_CHANGE event, schedule a work that restarts the QP
> >>>>* Restart child interfaces before parent
> >>>
> >>>
> >>>What's the reason for this change?
> >>>Is this a separate bugfix?
> >>>You might want to put this info in the log.
> >>>
> >>
> >>The reason is that if the child are restarted after the parent, and the parent is
> >>not up, then the flush function returns immediately due to the INITIALLIZED bit test.
> >>Now I think that we might use a goto statement instead.
> > 
> > 
> > So ... what the problem? I still don't see it.
> > 
> If I get pkey change event, I want to restart all active ifaces on my port. If
> the parent is not marked with IPOIB_FLAG_INITIALIZED, the function returns
> before it has a chance to recursively restart child ifaces.

So now, if restart_qp is set, you are going to open it, and it's not
initialized?

BTW, if the interface is not initialized, is not QP in reset already?
So can't we just move the code that assign the pkey to the open call?

Another idea - won't it be cleaner to have a function ipoib_restart_qp
(functionally similiar to ib_dev_flush, but also changing the QP)
than adding a flag to ib_dev_flush?


-- 
MST


From yosefe at voltaire.com  Tue May  8 09:00:10 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Tue, 08 May 2007 19:00:10 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events
In-Reply-To: <20070508155307.GC5845@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508150356.GT21591@mellanox.co.il>	<4640930D.9010800@voltaire.com>	<20070508151927.GW21591@mellanox.co.il>	<46409991.2070305@voltaire.com>
	<20070508155307.GC5845@mellanox.co.il>
Message-ID: <46409E8A.6040408@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events
>>
>>Michael S. Tsirkin wrote:
>>
>>>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>>>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events
>>>>
>>>>Michael S. Tsirkin wrote:
>>>>
>>>>
>>>>>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>>>>>Subject: [PATCHv2 1/2] ipoib: handle pkey change events
>>>>>>
>>>>>>This issue was found during partitioning & SM fail over testing.
>>>>>>
>>>>>>* Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
>>>>>>* Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
>>>>>>* Upon PKEY_CHANGE event, schedule a work that restarts the QP
>>>>>>* Restart child interfaces before parent
>>>>>
>>>>>
>>>>>What's the reason for this change?
>>>>>Is this a separate bugfix?
>>>>>You might want to put this info in the log.
>>>>>
>>>>
>>>>The reason is that if the child are restarted after the parent, and the parent is
>>>>not up, then the flush function returns immediately due to the INITIALLIZED bit test.
>>>>Now I think that we might use a goto statement instead.
>>>
>>>
>>>So ... what the problem? I still don't see it.
>>>
>>
>>If I get pkey change event, I want to restart all active ifaces on my port. If
>>the parent is not marked with IPOIB_FLAG_INITIALIZED, the function returns
>>before it has a chance to recursively restart child ifaces.
> 
> 
> So now, if restart_qp is set, you are going to open it, and it's not
> initialized?
>
> BTW, if the interface is not initialized, is not QP in reset already?
> So can't we just move the code that assign the pkey to the open call?
> 
No - I'm going to open its *child* interface. The problem is that parents "mask out"
their children.

> Another idea - won't it be cleaner to have a function ipoib_restart_qp
> (functionally similiar to ib_dev_flush, but also changing the QP)
> than adding a flag to ib_dev_flush?
> 
It might be, but we wanted to avoid code duplication (the only difference is 2-3 lines)


From mst at dev.mellanox.co.il  Tue May  8 09:27:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 19:27:27 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events
In-Reply-To: <46408360.3040006@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
Message-ID: <20070508162727.GD5845@mellanox.co.il>

> @@ -622,13 +623,24 @@ int ipoib_ib_dev_init(struct net_device 
>  	return 0;
>  }
>  
> -void ipoib_ib_dev_flush(struct work_struct *work)
> +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp)
>  {
> -	struct ipoib_dev_priv *cpriv, *priv =
> -		container_of(work, struct ipoib_dev_priv, flush_task);
> +	struct ipoib_dev_priv *cpriv;
>  	struct net_device *dev = priv->dev;
>  
> -	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
> +	mutex_lock(&priv->vlan_mutex);
> +
> +	/* Flush any child interfaces */
> +	list_for_each_entry(cpriv, &priv->child_intfs, list)
> +		__ipoib_ib_dev_flush(cpriv, restart_qp);
> +
> +	mutex_unlock(&priv->vlan_mutex);
> +
> +	/*
> +	 * If the device is not initiallized since it needs a pkey -
> +	 * try to reopen it
> +	 */

Kill this comment - typos and all.

> +	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
>  		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
>  		return;
>  	}

-- 
MST


From mst at dev.mellanox.co.il  Tue May  8 09:30:03 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 19:30:03 +0300
Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events
In-Reply-To: <46409E8A.6040408@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508150356.GT21591@mellanox.co.il>
	<4640930D.9010800@voltaire.com>
	<20070508151927.GW21591@mellanox.co.il>
	<46409991.2070305@voltaire.com>
	<20070508155307.GC5845@mellanox.co.il>
	<46409E8A.6040408@voltaire.com>
Message-ID: <20070508163003.GE5845@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events
> 
> Michael S. Tsirkin wrote:
> >>Quoting Yosef Etigin <yosefe at voltaire.com>:
> >>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events
> >>
> >>Michael S. Tsirkin wrote:
> >>
> >>>>Quoting Yosef Etigin <yosefe at voltaire.com>:
> >>>>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events
> >>>>
> >>>>Michael S. Tsirkin wrote:
> >>>>
> >>>>
> >>>>>>Quoting Yosef Etigin <yosefe at voltaire.com>:
> >>>>>>Subject: [PATCHv2 1/2] ipoib: handle pkey change events
> >>>>>>
> >>>>>>This issue was found during partitioning & SM fail over testing.
> >>>>>>
> >>>>>>* Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
> >>>>>>* Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
> >>>>>>* Upon PKEY_CHANGE event, schedule a work that restarts the QP
> >>>>>>* Restart child interfaces before parent
> >>>>>
> >>>>>
> >>>>>What's the reason for this change?
> >>>>>Is this a separate bugfix?
> >>>>>You might want to put this info in the log.
> >>>>>
> >>>>
> >>>>The reason is that if the child are restarted after the parent, and the parent is
> >>>>not up, then the flush function returns immediately due to the INITIALLIZED bit test.
> >>>>Now I think that we might use a goto statement instead.
> >>>
> >>>
> >>>So ... what the problem? I still don't see it.
> >>>
> >>
> >>If I get pkey change event, I want to restart all active ifaces on my port. If
> >>the parent is not marked with IPOIB_FLAG_INITIALIZED, the function returns
> >>before it has a chance to recursively restart child ifaces.
> > 
> > 
> > So now, if restart_qp is set, you are going to open it, and it's not
> > initialized?
> >
> > BTW, if the interface is not initialized, is not QP in reset already?
> > So can't we just move the code that assign the pkey to the open call?
> > 
> No - I'm going to open its *child* interface. The problem is that parents "mask out"
> their children.

Aha, I see this now. You might want to explain this in the comment.
Something like:

/* Flush any child interfaces too -
 * they might be up even if the parent is down */

> > Another idea - won't it be cleaner to have a function ipoib_restart_qp
> > (functionally similiar to ib_dev_flush, but also changing the QP)
> > than adding a flag to ib_dev_flush?
> > 
> It might be, but we wanted to avoid code duplication (the only difference is 2-3 lines)

OK, it's a valid approach too.

-- 
MST


From swise at opengridcomputing.com  Tue May  8 09:31:12 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 08 May 2007 11:31:12 -0500
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<1178578346.30571.183.camel@stevo-desktop>
	<BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>
	<1178632029.3056.3.camel@stevo-desktop>
	<73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com>
Message-ID: <1178641872.6064.12.camel@stevo-desktop>

On Tue, 2007-05-08 at 11:50 -0400, Jeff Squyres wrote:
> In the "FYI" category...
> 
> There was discussion about the udpal BTL over OFED today on the  
> weekly developer teleconference (per my earlier post, a user is  
> reporting that it doesn't work).  Andrew Friedley is going to work  
> with the Sun developers -- he thinks he might know where the problem  
> is coming from but is in process of physically relocating, and  
> therefore couldn't look at it until late next week at the earliest.

> Sun may be able to pick up the issue -- but if so, I don't know what  
> their timeframe will be (and it may depend on the severity of the  
> problem).
> 

Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work.  I'm
debugging now.


> 
> On May 8, 2007, at 9:47 AM, Steve Wise wrote:
> 
> > On Mon, 2007-05-07 at 20:39 -0400, Jeff Squyres wrote:
> >> On May 7, 2007, at 6:52 PM, Steve Wise wrote:
> >>
> >>> Also, there appears to be a DAPL BTL in OMPI.  Is this BTL complete
> >>> and
> >>> enabled for the ofed-1.2 udapl library?
> >>
> >> Yes, it is complete and is well-tested in Solaris.
> >>
> >> It is not well tested in Linux/OFED (we've been concentrating on the
> >> verbs interface on the OFED side of things -- the "openib" BTL [we
> >> never renamed it when OpenIB changed names to OpenFabrics]).  In
> >> fact, we've had scattered reports of it not working properly in  
> >> Linux/
> >> OFED, but those could well have been pilot error (i.e., me not trying
> >> to run properly -- I know just about zilch about udapl).
> >>
> >
> > The reason I'm asking is twofold:
> >
> > 1) this can get OMPI running on iwarp devices today if it works.
> >
> > 2) the udapl code can be a model for the rdma-cm piece, since the two
> > are similary (client / server connection set, ipaddr/port based,  
> > etc)...
> >
> > I'll try it out on T3.
> >
> >
> > Steve.
> 
> 


From yosefe at voltaire.com  Tue May  8 09:43:41 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Tue, 08 May 2007 19:43:41 +0300
Subject: [ofa-general] [PATCHv3 1/2] ipoib: handle pkey change events
In-Reply-To: <20070508162727.GD5845@mellanox.co.il>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508162727.GD5845@mellanox.co.il>
Message-ID: <4640A8BD.4000405@voltaire.com>

This issue was found during partitioning & SM fail over testing.

 * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
 * Upon PKEY_CHANGE event, schedule a work that restarts the QP
 * Restart child interfaces before parent. They might be up even if the
   parent is down
 * Use uncached pkey query upon qp initiallization

SM reconfiguration or failover possibly causes a shuffling of the values
in the port pkey table. The current implementation only queries for the
index of the pkey once, when it creates the device QP and after that moves
it into working state, and hence does not address this scenario. Fix this
by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |    6 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   59 ++++++++++++++++++++---------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +--
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   11 ++---
 4 files changed, 56 insertions(+), 27 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 15:46:53.767972077 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 16:45:44.768882483 +0300
@@ -202,11 +202,12 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
+	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
+	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8            	  port;
@@ -333,12 +334,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.784969043 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 19:37:12.841977849 +0300
@@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -481,7 +481,7 @@ int ipoib_ib_dev_down(struct net_device 
 	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
 		mutex_lock(&pkey_mutex);
 		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
+		cancel_delayed_work(&priv->pkey_poll_task);
 		mutex_unlock(&pkey_mutex);
 		if (flush)
 			flush_workqueue(ipoib_workqueue);
@@ -508,7 +508,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +581,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,13 +623,21 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces too -
+ 	 * they might be up even if the parent is down */
+ 	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, restart_qp);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
 		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
@@ -642,6 +651,12 @@ void ipoib_ib_dev_flush(struct work_stru
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (restart_qp) {
+		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
+			ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +665,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	mutex_unlock(&priv->vlan_mutex);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -685,7 +710,7 @@ void ipoib_ib_dev_cleanup(struct net_dev
 void ipoib_pkey_poll(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
+		container_of(work, struct ipoib_dev_priv, pkey_poll_task.work);
 	struct net_device *dev = priv->dev;
 
 	ipoib_pkey_dev_check_presence(dev);
@@ -696,7 +721,7 @@ void ipoib_pkey_poll(struct work_struct 
 		mutex_lock(&pkey_mutex);
 		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
+					   &priv->pkey_poll_task,
 					   HZ);
 		mutex_unlock(&pkey_mutex);
 	}
@@ -715,7 +740,7 @@ int ipoib_pkey_dev_delay_open(struct net
 		mutex_lock(&pkey_mutex);
 		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
 		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
+				   &priv->pkey_poll_task,
 				   HZ);
 		mutex_unlock(&pkey_mutex);
 		return 1;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 15:46:53.805965295 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 16:45:44.768882483 +0300
@@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev)
 		return -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 15:46:53.877952447 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 16:45:44.769882305 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
@@ -103,7 +101,7 @@ int ipoib_init_qp(struct net_device *dev
 	 * The port has to be assigned to the respective IB partition in
 	 * advance.
 	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
 	if (ret) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		return ret;
@@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
 	}
 }


From yosefe at voltaire.com  Tue May  8 09:45:05 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Tue, 08 May 2007 19:45:05 +0300
Subject: [ofa-general] [PATCHv3 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <20070508152650.GA5845@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408336.8080908@voltaire.com>	<20070508150907.GU21591@mellanox.co.il>	<46409504.9000802@voltaire.com>
	<20070508152650.GA5845@mellanox.co.il>
Message-ID: <4640A911.8000609@voltaire.com>


* Add ib_find_gid and ib_find_pkey over uncached device queries.
  The calls might block but the returns are always up-to-date. 
* Cache pky,gid table lengths in core to avoid port info queries.


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/device.c |  140 +++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h          |   25 ++++++
 2 files changed, 165 insertions(+)

Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-08 15:46:36.773005388 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-08 19:30:53.095613249 +0300
@@ -149,6 +149,18 @@ static int alloc_name(char *name)
 	return 0;
 }
 
+static inline int start_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+
+static inline int end_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
+}
+
 /**
  * ib_alloc_device - allocate an IB device struct
  * @size:size of structure to allocate
@@ -208,6 +220,56 @@ static int add_client_context(struct ib_
 	return 0;
 }
 
+/* read the lengths of pkey,gid tables on each port */
+static inline int read_port_table_lengths(struct ib_device *device)
+{
+	struct ib_port_attr *tprops = NULL;
+	int num_ports, ret = -ENOMEM;
+	u8 port_index;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+	if (!tprops)
+		goto out;
+
+	num_ports = end_port(device) - start_port(device) + 1;
+
+	device->pkey_tbl_len = kmalloc(num_ports *
+				sizeof *device->pkey_tbl_len, GFP_KERNEL);
+	if (!device->pkey_tbl_len)
+		goto out;
+
+	device->gid_tbl_len = kmalloc(num_ports *
+				sizeof *device->gid_tbl_len, GFP_KERNEL);
+	if (!device->gid_tbl_len)
+		goto err1;
+
+	for (port_index = 0; port_index < num_ports; ++port_index) {
+		ret = ib_query_port(device, port_index + start_port(device),
+				tprops);
+		if (ret)
+			goto err2;
+
+		device->pkey_tbl_len[ port_index ] = tprops->pkey_tbl_len;
+		device->gid_tbl_len[ port_index ] = tprops->gid_tbl_len;
+	}
+
+	ret = 0;
+	goto out;
+err2:
+	kfree(device->gid_tbl_len);
+err1:
+	kfree(device->pkey_tbl_len);
+out:
+	kfree(tprops);
+	return ret;
+}
+
+static inline void free_port_table_lengths(struct ib_device *device)
+{
+	kfree(device->gid_tbl_len);
+	kfree(device->pkey_tbl_len);
+}
+
 /**
  * ib_register_device - Register an IB device with IB core
  * @device:Device to register
@@ -239,6 +301,13 @@ int ib_register_device(struct ib_device 
 	spin_lock_init(&device->event_handler_lock);
 	spin_lock_init(&device->client_data_lock);
 
+	ret = read_port_table_lengths(device);
+	if (ret) {
+		printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n",
+		       device->name);
+		goto out;
+	}
+
 	ret = ib_device_register_sysfs(device);
 	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
@@ -284,6 +353,8 @@ void ib_unregister_device(struct ib_devi
 
 	list_del(&device->core_list);
 
+	free_port_table_lengths(device);
+
 	mutex_unlock(&device_mutex);
 
 	spin_lock_irqsave(&device->client_data_lock, flags);
@@ -592,6 +663,75 @@ int ib_modify_port(struct ib_device *dev
 }
 EXPORT_SYMBOL(ib_modify_port);
 
+/**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index)
+{
+	union ib_gid tmp_gid;
+	int ret, port, i, tbl_len;
+
+	for (port = start_port(device); port <= end_port(device); ++port) {
+		tbl_len = device->gid_tbl_len[ port - start_port(device) ];
+		for (i = 0; i < tbl_len; ++i) {
+			ret = ib_query_gid(device, port, i, &tmp_gid);
+			if (ret)
+				goto out;
+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
+				*port_num = port;
+				*index = i;
+				ret = 0;
+				goto out;
+			}
+		}
+	}
+	ret = -ENOENT;
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_gid);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index)
+{
+	int ret, i, tbl_len;
+	u16 tmp_pkey;
+
+	tbl_len = device->pkey_tbl_len[ port_num - start_port(device) ];
+	for (i = 0; i < tbl_len; ++i) {
+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
+		if (ret)
+			goto out;
+
+		if (pkey == tmp_pkey) {
+			*index = i;
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = -ENOENT;
+
+out:
+	kfree(tprops);
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_pkey);
+
 static int __init ib_core_init(void)
 {
 	int ret;
Index: b/include/rdma/ib_verbs.h
===================================================================
--- a/include/rdma/ib_verbs.h	2007-05-08 15:45:45.199210546 +0300
+++ b/include/rdma/ib_verbs.h	2007-05-08 18:48:23.334763770 +0300
@@ -1058,6 +1058,8 @@ struct ib_device {
 	__be64			     node_guid;
 	u8                           node_type;
 	u8                           phys_port_cnt;
+	int                          *pkey_tbl_len;
+	int                          *gid_tbl_len;
 };
 
 struct ib_client {
@@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev
 		   struct ib_port_modify *port_modify);
 
 /**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index);
+
+/**
  * ib_alloc_pd - Allocates an unused protection domain.
  * @device: The device on which to allocate the protection domain.
  *


From jsquyres at cisco.com  Tue May  8 08:50:56 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 8 May 2007 11:50:56 -0400
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <1178632029.3056.3.camel@stevo-desktop>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<1178578346.30571.183.camel@stevo-desktop>
	<BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>
	<1178632029.3056.3.camel@stevo-desktop>
Message-ID: <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com>

In the "FYI" category...

There was discussion about the udpal BTL over OFED today on the  
weekly developer teleconference (per my earlier post, a user is  
reporting that it doesn't work).  Andrew Friedley is going to work  
with the Sun developers -- he thinks he might know where the problem  
is coming from but is in process of physically relocating, and  
therefore couldn't look at it until late next week at the earliest.

Sun may be able to pick up the issue -- but if so, I don't know what  
their timeframe will be (and it may depend on the severity of the  
problem).


On May 8, 2007, at 9:47 AM, Steve Wise wrote:

> On Mon, 2007-05-07 at 20:39 -0400, Jeff Squyres wrote:
>> On May 7, 2007, at 6:52 PM, Steve Wise wrote:
>>
>>> Also, there appears to be a DAPL BTL in OMPI.  Is this BTL complete
>>> and
>>> enabled for the ofed-1.2 udapl library?
>>
>> Yes, it is complete and is well-tested in Solaris.
>>
>> It is not well tested in Linux/OFED (we've been concentrating on the
>> verbs interface on the OFED side of things -- the "openib" BTL [we
>> never renamed it when OpenIB changed names to OpenFabrics]).  In
>> fact, we've had scattered reports of it not working properly in  
>> Linux/
>> OFED, but those could well have been pilot error (i.e., me not trying
>> to run properly -- I know just about zilch about udapl).
>>
>
> The reason I'm asking is twofold:
>
> 1) this can get OMPI running on iwarp devices today if it works.
>
> 2) the udapl code can be a model for the rdma-cm piece, since the two
> are similary (client / server connection set, ipaddr/port based,  
> etc)...
>
> I'll try it out on T3.
>
>
> Steve.


-- 
Jeff Squyres
Cisco Systems


From sean.hefty at intel.com  Tue May  8 08:46:36 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 8 May 2007 08:46:36 -0700
Subject: [ofa-general] RE: memory leak in cm.c?
In-Reply-To: <20070508050700.GI22341@mellanox.co.il>
Message-ID: <000101c79188$10043740$39d1180a@amr.corp.intel.com>

>I applied the following patch to cm.c, and it crashed after
>some duplicate reqs where detected. Does this indicate a
>memory leak in cm?

In this case, the timewait_info structure is freed directly in the
cm_req_handler (line 1373).  The pointer is just not cleared.

- Sean


From swise at opengridcomputing.com  Tue May  8 10:29:21 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 08 May 2007 12:29:21 -0500
Subject: OMPI over OFA udapl (was Re: [OMPI devel] [ofa-general] OpenMPI
	and RDMA-CM)
In-Reply-To: <1178641872.6064.12.camel@stevo-desktop>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<1178578346.30571.183.camel@stevo-desktop>
	<BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>
	<1178632029.3056.3.camel@stevo-desktop>
	<73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com>
	<1178641872.6064.12.camel@stevo-desktop>
Message-ID: <1178645361.6064.35.camel@stevo-desktop>


> 
> Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work.  I'm
> debugging now.
> 

Here's part of the problem (from ompi/btl/udapl/btl_udapl.c):

    /* TODO - big bad evil hack! */
    /* uDAPL doesn't ever seem to keep track of ports with addresses.  This
       becomes a problem when we use dat_ep_query() to obtain a remote address
       on an endpoint.  In this case, both the DAT_PORT_QUAL and the sin_port
       field in the DAT_SOCK_ADDR are 0, regardless of the actual port. This is
       a problem when we have more than one uDAPL process per IA - these
       processes will have exactly the same address, as the port is all
       we have to differentiate who is who.  Thus, our uDAPL EP -> BTL EP
       matching algorithm will break down.

       So, we insert the port we used for our PSP into the DAT_SOCK_ADDR for
       this IA.  uDAPL then conveniently propagates this to where we need it.
     */
    ((struct sockaddr_in*)attr.ia_address_ptr)->sin_port = htons(port);
    ((struct sockaddr_in*)&btl->udapl_addr.addr)->sin_port = htons(port);

The OMPI code stuffs the port chosen by udapl for a listening endpoint
into the ia address memory (which is owned by the udapl layer btw).
There's a slight problem with that:  The OFA udapl openib_cma code binds
cm_id's to this ia_address regularly.  When an hca is opened, a cm_id is
bound to this address to obtain the local hca port number and gid that
is being used.  In addition, a cm_id is bound to this address each time
an endpoint is created (either at ep_create time or ep_connect time).
So that ia_address field is used by the dapl cm to create local
cm_ids...  Since the port was always zero, the rmda-cma would choose a
unique port for each cm_id bound to that address.   

But OMPI sets a the port field to non-zero, the rdma_cma fails all the
subsequent rdma_bind_addr() calls since the port is already in use.

Perhaps this hack really is a workaround for a DAPL bug where somebodies
dapl wasn't tracking port numbers correctly?

I think there are three issues here:

1) OMPI shouldn't be stepping on the ia_address.

2) OFA udapl should probably be explicitly binding local cm_ids to port
zero.

3) dat_ep_query() should be returning the correct port numbers...


I'm going to run a few experiments:

1) remove the OMPI hack and see if things work fine for OFA udapl.
Perhaps OFA udapl correctly tracks ports on endpoints?

2) leave OMPI as-is and change OFA udapl to not assume the ia_addr
sockaddr has a 0 port in it.  


Steve.


From sean.hefty at intel.com  Tue May  8 10:29:35 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 8 May 2007 10:29:35 -0700
Subject: [ofa-general] RE: man pages for the rdma-cm
In-Reply-To: <1178556576.30571.79.camel@stevo-desktop>
Message-ID: <000201c79196$72e8a680$39d1180a@amr.corp.intel.com>

I updated the man pages in my master branch and pushed the changes out.  Details
below.

>- are the events described anywhere?  Maybe they should be described in
>rdma_get_cm_event?

done

>- rdma_accept / rdma_connect: describe the conn_param fields.

done

>- rdma_bind_addr: binding to port 0 will cause the rdma-cm to select and
>available port.

added

>- no pages for get_src_port/get_dst_port

not added yet

>- rdma_connect - "connected" and "unconnected" when discussing cm_ids is
>misleading. Perhaps "reliable connection" vs "unreliable datagram"?

I reworked the wording here to clarify that the behavior is based on the port
space associated with the cm_id.

>- rdma_create_event_channel: it would be nice to mention that the fd can
>be used like any other fd (made non blocking, poll()/select()able, etc).

added

>- Also, it might be nice to have some sort of overview man page that
>maps the exected event flows for connection setup and teardown.  Maybe
>'man rdmacm' gets you some overview?

I added an rdma_cm man page that gives an overview.  I still need to add
references to this man page from the other API man pages, which I'll do before
pushing into OFED.

- Sean


From swise at opengridcomputing.com  Tue May  8 10:51:30 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 08 May 2007 12:51:30 -0500
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <463FCA42.3000104@indiana.edu>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com>
	<463FCA42.3000104@indiana.edu>
Message-ID: <1178646690.6064.38.camel@stevo-desktop>

> >> Chelsio's gonna pony up the resources to get this work done asap.  Do
> >> you have any thoughts on how we can collaborate on this project?  I'm
> >> familiar with mvapich, not ompi, so I need to go do some homework.   
> >> But
> >> any pointers on the connection setup design for ompi would be great.
> > 
> > Excellent!  Let's chat on the phone tomorrow -- this would probably  
> > be the best way to start.
> > 
> > We will need a signed Open MPI 3rd party contribution agreement from  
> > either you and/or Chelsio (whoever owns the intellectual property  
> > that will be contributed).  See http://www.open-mpi.org/community/ 
> > contribute/.
> > 
> >> I'm CCing devel at openmpi.org in case anyone else is interested in
> >> helping.  Chelsio can provide rnic HW...
> > 
> > Anyone else here interested?  Free hardware!  :-)
> 
> Hmm I'm interested.  I've already done some work switching over to RDMA 
> CM for some research stuff I've been doing; it's not publicly accessible 
> w/o the 3rd party agreement.  I can help answer questions on what 
> exactly needs to change, and do some testing.
> 
> Andrew

I'm working on the 3rd party agreement from chelsio now.  Stay tuned!

Steve.


From afriedle at open-mpi.org  Tue May  8 10:57:44 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Tue, 08 May 2007 13:57:44 -0400
Subject: [OMPI devel] OMPI over OFA udapl (was Re: [ofa-general] OpenMPI
	and	RDMA-CM)
In-Reply-To: <1178645361.6064.35.camel@stevo-desktop>
References: <1177791386.4615.8.camel@stevo-laptop>	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>	<1178575761.30571.175.camel@stevo-desktop>	<1178578346.30571.183.camel@stevo-desktop>	<BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>	<1178632029.3056.3.camel@stevo-desktop>	<73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com>	<1178641872.6064.12.camel@stevo-desktop>
	<1178645361.6064.35.camel@stevo-desktop>
Message-ID: <4640BA18.7060104@open-mpi.org>

Steve Wise wrote:
>> Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work.  I'm
>> debugging now.
>>
> 
> Here's part of the problem (from ompi/btl/udapl/btl_udapl.c):
> 
>     /* TODO - big bad evil hack! */
>     /* uDAPL doesn't ever seem to keep track of ports with addresses.  This
>        becomes a problem when we use dat_ep_query() to obtain a remote address
>        on an endpoint.  In this case, both the DAT_PORT_QUAL and the sin_port
>        field in the DAT_SOCK_ADDR are 0, regardless of the actual port. This is
>        a problem when we have more than one uDAPL process per IA - these
>        processes will have exactly the same address, as the port is all
>        we have to differentiate who is who.  Thus, our uDAPL EP -> BTL EP
>        matching algorithm will break down.
> 
>        So, we insert the port we used for our PSP into the DAT_SOCK_ADDR for
>        this IA.  uDAPL then conveniently propagates this to where we need it.
>      */
>     ((struct sockaddr_in*)attr.ia_address_ptr)->sin_port = htons(port);
>     ((struct sockaddr_in*)&btl->udapl_addr.addr)->sin_port = htons(port);
> 
> The OMPI code stuffs the port chosen by udapl for a listening endpoint
> into the ia address memory (which is owned by the udapl layer btw).
> There's a slight problem with that:  The OFA udapl openib_cma code binds
> cm_id's to this ia_address regularly.  When an hca is opened, a cm_id is
> bound to this address to obtain the local hca port number and gid that
> is being used.  In addition, a cm_id is bound to this address each time
> an endpoint is created (either at ep_create time or ep_connect time).
> So that ia_address field is used by the dapl cm to create local
> cm_ids...  Since the port was always zero, the rmda-cma would choose a
> unique port for each cm_id bound to that address.   
> 
> But OMPI sets a the port field to non-zero, the rdma_cma fails all the
> subsequent rdma_bind_addr() calls since the port is already in use.
> 
> Perhaps this hack really is a workaround for a DAPL bug where somebodies
> dapl wasn't tracking port numbers correctly?

Yep. My memory is dim, but I think that was OFED's DAPL, or it was in 
the generic part of DAPL that all implementations seem to share.

As hinted by the comment (I wrote it by the way), I think the best 
solution would be if dat_ep_query() returned the port number correctly. 
  Most of uDAPL seems to just pass around pointers to internal data 
structures (which I'm not sure is the best idea in the world), so it 
didn't seem like a trivial fix to me at the time.  I remember 
considering reporting this as a bug, but I didn't because the uDAPL 
standard didn't seem to enforce any requirements on passing the port 
number around with the address, so it technically wasn't wrong.

Was the OFED uDAPL code switched from something else to RDMA CM at some 
point?  I'm almost certain I was running fine on OFED's uDAPL at one 
point (in fact, a lot of the uDAPL BTL development I did was using the 
OFED stack).


> I'm going to run a few experiments:
> 
> 1) remove the OMPI hack and see if things work fine for OFA udapl.
> Perhaps OFA udapl correctly tracks ports on endpoints?

Doubt it, but worth trying.

> 2) leave OMPI as-is and change OFA udapl to not assume the ia_addr
> sockaddr has a 0 port in it.  

Pretty sure this will work, don't know if it's the correct solution though.

Andrew


From jlentini at netapp.com  Tue May  8 11:38:06 2007
From: jlentini at netapp.com (James Lentini)
Date: Tue, 8 May 2007 14:38:06 -0400 (EDT)
Subject: [ofa-general] [IPoIB][RFC] remove redundant gid query
Message-ID: <Pine.LNX.4.64.0705081427320.27590@jlentini-linux.nane.netapp.com>


Both ipoib_add_port() and ipoib_mcast_join_task() query the GID at 
index 0 to setup the ipoib_dev_priv structure's local_gid and the 
net_device structure's dev_addr. There does not appear to be a way for 
ipoib_mcast_join_task() to be executed before ipoib_add_port() 
completes. Therefore, the work done in ipoib_mcast_join_task() appears 
to be redundant.

Signed-off-by: James Lentini <jlentini at netapp.com>

--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-08 14:34:15.000000000 -0400
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-08 14:34:07.000000000 -0400
@@ -523,11 +523,6 @@ void ipoib_mcast_join_task(struct work_s
 	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
 		return;
 
-	if (ib_query_gid(priv->ca, priv->port, 0, &priv->local_gid))
-		ipoib_warn(priv, "ib_gid_entry_get() failed\n");
-	else
-		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
-
 	{
 		struct ib_port_attr attr;
 

From ardavis at ichips.intel.com  Tue May  8 12:09:02 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 08 May 2007 12:09:02 -0700
Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work
In-Reply-To: <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com>
References: <462E13A6.3030207@lfbs.rwth-aachen.de>
	<462E1DFE.5010703@Sun.COM>	<46305D0A.5020900@lfbs.rwth-aachen.de>
	<4630EFDE.8070608@Sun.COM>	<464044D4.5010501@lfbs.rwth-aachen.de>
	<054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com>
Message-ID: <4640CACE.8070201@ichips.intel.com>

Jeff Squyres wrote:

> I'm forwarding this to the OpenFabrics general list -- as it just  
> came up the other day, we know that Open MPI's UDAPL support works on  
> Solaris, but we have done little/no testing of it on OFED (I  
> personally know almost nothing about UDPAL).
>
> Can the UDAPL OFED wizards shed any light on the error messages that  
> are listed below?  In particular, these seem to be worrysome:
>
>>  setup_listener Permission denied
>
>  setup_listener Address already in use

These failures are from rdma_cm_bind indicating the port is already 
bound to this IA address. How are you creating the service point?
dat_psp_create or dat_psp_create_any? If it is psp_create_any then you 
will see some failures until it  gets to a free port. That is normal. 
Just make sure your create call returns DAT_SUCCESS.

>>  create_qp Address already in use
>
This is a real problem with the bind, port is already in use. Not sure 
why this would fail since the current version of OFED uDAPL uses a 
wildcard port when binding and uses the address from the open;  I 
remember an issue a while back with rdma_cm and wildcard ports. What 
version of OFED are you using?

-arlin


From mst at dev.mellanox.co.il  Tue May  8 12:14:49 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 22:14:49 +0300
Subject: [ofa-general] [PATCH for-2.6.22] IB/mthca: fix REST to ERR
	transition
Message-ID: <20070508191449.GA10845@mellanox.co.il>

According to IB spec, QP can be moved from RESET to ERROR state,
but mthca firmware does not support this. Work around this
by moving the QP to INIT with dummy parameters first.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

--

diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c
index 1c6b63a..dfcb038 100644
--- a/drivers/infiniband/hw/mthca/mthca_qp.c
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -295,7 +295,7 @@ static int to_mthca_st(int transport)
 	}
 }
 
-static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr,
+static void store_attrs(struct mthca_sqp *sqp, const struct ib_qp_attr *attr,
 			int attr_mask)
 {
 	if (attr_mask & IB_QP_PKEY_INDEX)
@@ -327,7 +327,7 @@ static void init_port(struct mthca_dev *dev, int port)
 		mthca_warn(dev, "INIT_IB returned status %02x.\n", status);
 }
 
-static __be32 get_hw_access_flags(struct mthca_qp *qp, struct ib_qp_attr *attr,
+static __be32 get_hw_access_flags(struct mthca_qp *qp, const struct ib_qp_attr *attr,
 				  int attr_mask)
 {
 	u8 dest_rd_atomic;
@@ -510,7 +510,7 @@ out:
 	return err;
 }
 
-static int mthca_path_set(struct mthca_dev *dev, struct ib_ah_attr *ah,
+static int mthca_path_set(struct mthca_dev *dev, const struct ib_ah_attr *ah,
 			  struct mthca_qp_path *path, u8 port)
 {
 	path->g_mylmc     = ah->src_path_bits & 0x7f;
@@ -538,12 +538,12 @@ static int mthca_path_set(struct mthca_dev *dev, struct ib_ah_attr *ah,
 	return 0;
 }
 
-int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
-		    struct ib_udata *udata)
+static int __mthca_modify_qp(struct ib_qp *ibqp,
+			     const struct ib_qp_attr *attr, int attr_mask,
+			     enum ib_qp_state cur_state, enum ib_qp_state new_state)
 {
 	struct mthca_dev *dev = to_mdev(ibqp->device);
 	struct mthca_qp *qp = to_mqp(ibqp);
-	enum ib_qp_state cur_state, new_state;
 	struct mthca_mailbox *mailbox;
 	struct mthca_qp_param *qp_param;
 	struct mthca_qp_context *qp_context;
@@ -551,60 +551,6 @@ int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
 	u8 status;
 	int err = -EINVAL;
 
-	mutex_lock(&qp->mutex);
-
-	if (attr_mask & IB_QP_CUR_STATE) {
-		cur_state = attr->cur_qp_state;
-	} else {
-		spin_lock_irq(&qp->sq.lock);
-		spin_lock(&qp->rq.lock);
-		cur_state = qp->state;
-		spin_unlock(&qp->rq.lock);
-		spin_unlock_irq(&qp->sq.lock);
-	}
-
-	new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state;
-
-	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) {
-		mthca_dbg(dev, "Bad QP transition (transport %d) "
-			  "%d->%d with attr 0x%08x\n",
-			  qp->transport, cur_state, new_state,
-			  attr_mask);
-		goto out;
-	}
-
-	if (cur_state == new_state && cur_state == IB_QPS_RESET) {
-		err = 0;
-		goto out;
-	}
-
-	if ((attr_mask & IB_QP_PKEY_INDEX) &&
-	     attr->pkey_index >= dev->limits.pkey_table_len) {
-		mthca_dbg(dev, "P_Key index (%u) too large. max is %d\n",
-			  attr->pkey_index, dev->limits.pkey_table_len-1);
-		goto out;
-	}
-
-	if ((attr_mask & IB_QP_PORT) &&
-	    (attr->port_num == 0 || attr->port_num > dev->limits.num_ports)) {
-		mthca_dbg(dev, "Port number (%u) is invalid\n", attr->port_num);
-		goto out;
-	}
-
-	if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC &&
-	    attr->max_rd_atomic > dev->limits.max_qp_init_rdma) {
-		mthca_dbg(dev, "Max rdma_atomic as initiator %u too large (max is %d)\n",
-			  attr->max_rd_atomic, dev->limits.max_qp_init_rdma);
-		goto out;
-	}
-
-	if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC &&
-	    attr->max_dest_rd_atomic > 1 << dev->qp_table.rdb_shift) {
-		mthca_dbg(dev, "Max rdma_atomic as responder %u too large (max %d)\n",
-			  attr->max_dest_rd_atomic, 1 << dev->qp_table.rdb_shift);
-		goto out;
-	}
-
 	mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL);
 	if (IS_ERR(mailbox)) {
 		err = PTR_ERR(mailbox);
@@ -878,7 +824,98 @@ int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
 
 out_mailbox:
 	mthca_free_mailbox(dev, mailbox);
+out:
+	return err;
+}
+
+static const struct ib_qp_attr mthca_qp_attr = { .port_num = 1};
+static const int mthca_qp_attr_mask_table[IB_QPT_UD + 1] = {
+		[IB_QPT_UD]  = (IB_QP_PKEY_INDEX		|
+				IB_QP_PORT			|
+				IB_QP_QKEY),
+		[IB_QPT_UC]  = (IB_QP_PKEY_INDEX		|
+				IB_QP_PORT			|
+				IB_QP_ACCESS_FLAGS),
+		[IB_QPT_RC]  = (IB_QP_PKEY_INDEX		|
+				IB_QP_PORT			|
+				IB_QP_ACCESS_FLAGS),
+		[IB_QPT_SMI] = (IB_QP_PKEY_INDEX		|
+				IB_QP_QKEY),
+		[IB_QPT_GSI] = (IB_QP_PKEY_INDEX		|
+				IB_QP_QKEY),
+};
+
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
+		    struct ib_udata *udata)
+{
+	struct mthca_dev *dev = to_mdev(ibqp->device);
+	struct mthca_qp *qp = to_mqp(ibqp);
+	enum ib_qp_state cur_state, new_state;
+	int err = -EINVAL;
+
+	mutex_lock(&qp->mutex);
+	if (attr_mask & IB_QP_CUR_STATE) {
+		cur_state = attr->cur_qp_state;
+	} else {
+		spin_lock_irq(&qp->sq.lock);
+		spin_lock(&qp->rq.lock);
+		cur_state = qp->state;
+		spin_unlock(&qp->rq.lock);
+		spin_unlock_irq(&qp->sq.lock);
+	}
+
+	new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state;
+
+	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) {
+		mthca_dbg(dev, "Bad QP transition (transport %d) "
+			  "%d->%d with attr 0x%08x\n",
+			  qp->transport, cur_state, new_state,
+			  attr_mask);
+		goto out;
+	}
+
+	if ((attr_mask & IB_QP_PKEY_INDEX) &&
+	     attr->pkey_index >= dev->limits.pkey_table_len) {
+		mthca_dbg(dev, "P_Key index (%u) too large. max is %d\n",
+			  attr->pkey_index, dev->limits.pkey_table_len-1);
+		goto out;
+	}
+
+	if ((attr_mask & IB_QP_PORT) &&
+	    (attr->port_num == 0 || attr->port_num > dev->limits.num_ports)) {
+		mthca_dbg(dev, "Port number (%u) is invalid\n", attr->port_num);
+		goto out;
+	}
+
+	if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC &&
+	    attr->max_rd_atomic > dev->limits.max_qp_init_rdma) {
+		mthca_dbg(dev, "Max rdma_atomic as initiator %u too large (max is %d)\n",
+			  attr->max_rd_atomic, dev->limits.max_qp_init_rdma);
+		goto out;
+	}
+
+	if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC &&
+	    attr->max_dest_rd_atomic > 1 << dev->qp_table.rdb_shift) {
+		mthca_dbg(dev, "Max rdma_atomic as responder %u too large (max %d)\n",
+			  attr->max_dest_rd_atomic, 1 << dev->qp_table.rdb_shift);
+		goto out;
+	}
+
+	if (cur_state == new_state && cur_state == IB_QPS_RESET) {
+		err = 0;
+		goto out;
+	}
+
+	if (cur_state == IB_QPS_RESET && new_state == IB_QPS_ERR) {
+		err = __mthca_modify_qp(ibqp, &mthca_qp_attr,
+				       	mthca_qp_attr_mask_table[ibqp->qp_type],
+					IB_QPS_RESET, IB_QPS_INIT);
+		if (err)
+			goto out;
+		cur_state = IB_QPS_INIT;
+	}
 
+	err = __mthca_modify_qp(ibqp, attr, attr_mask, cur_state, new_state);
 out:
 	mutex_unlock(&qp->mutex);
 	return err;

-- 
MST


From ardavis at ichips.intel.com  Tue May  8 12:37:25 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 08 May 2007 12:37:25 -0700
Subject: [OMPI devel] OMPI over OFA udapl (was Re: [ofa-general] OpenMPI
	and	RDMA-CM)
In-Reply-To: <4640BA18.7060104@open-mpi.org>
References: <1177791386.4615.8.camel@stevo-laptop>	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>	<1178575761.30571.175.camel@stevo-desktop>	<1178578346.30571.183.camel@stevo-desktop>	<BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>	<1178632029.3056.3.camel@stevo-desktop>	<73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com>	<1178641872.6064.12.camel@stevo-desktop>	<1178645361.6064.35.camel@stevo-desktop>
	<4640BA18.7060104@open-mpi.org>
Message-ID: <4640D175.2080207@ichips.intel.com>

Andrew Friedley wrote:

>
> Yep. My memory is dim, but I think that was OFED's DAPL, or it was in 
> the generic part of DAPL that all implementations seem to share.
>
> As hinted by the comment (I wrote it by the way), I think the best 
> solution would be if dat_ep_query() returned the port number 
> correctly.  Most of uDAPL seems to just pass around pointers to 
> internal data structures (which I'm not sure is the best idea in the 
> world), so it didn't seem like a trivial fix to me at the time.  I 
> remember considering reporting this as a bug, but I didn't because the 
> uDAPL standard didn't seem to enforce any requirements on passing the 
> port number around with the address, so it technically wasn't wrong.

I tend to agree. The common code should query the actual provider to get 
local address for the EP and not assume it is the IA address from the 
HCA used during the open. They are after all different bindings. I will 
take a look at the code.

>
> Was the OFED uDAPL code switched from something else to RDMA CM at 
> some point?  I'm almost certain I was running fine on OFED's uDAPL at 
> one point (in fact, a lot of the uDAPL BTL development I did was using 
> the OFED stack).

We had several interations while waiting for the rdma_cm code to become 
available. I am guessing that you were using one the early version that 
used sockets to setup the QP's.

-arlin


From swise at opengridcomputing.com  Tue May  8 12:52:59 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 08 May 2007 14:52:59 -0500
Subject: [OMPI devel] OMPI over OFA udapl (was Re: [ofa-general]
	OpenMPI and	RDMA-CM)
In-Reply-To: <4640BA18.7060104@open-mpi.org>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<1178578346.30571.183.camel@stevo-desktop>
	<BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>
	<1178632029.3056.3.camel@stevo-desktop>
	<73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com>
	<1178641872.6064.12.camel@stevo-desktop>
	<1178645361.6064.35.camel@stevo-desktop>
	<4640BA18.7060104@open-mpi.org>
Message-ID: <1178653979.11455.4.camel@stevo-desktop>

On Tue, 2007-05-08 at 13:57 -0400, Andrew Friedley wrote:
> Steve Wise wrote:
> >> Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work.  I'm
> >> debugging now.
> >>
> > 
> > Here's part of the problem (from ompi/btl/udapl/btl_udapl.c):
> > 
> >     /* TODO - big bad evil hack! */
> >     /* uDAPL doesn't ever seem to keep track of ports with addresses.  This
> >        becomes a problem when we use dat_ep_query() to obtain a remote address
> >        on an endpoint.  In this case, both the DAT_PORT_QUAL and the sin_port
> >        field in the DAT_SOCK_ADDR are 0, regardless of the actual port. This is
> >        a problem when we have more than one uDAPL process per IA - these
> >        processes will have exactly the same address, as the port is all
> >        we have to differentiate who is who.  Thus, our uDAPL EP -> BTL EP
> >        matching algorithm will break down.
> > 
> >        So, we insert the port we used for our PSP into the DAT_SOCK_ADDR for
> >        this IA.  uDAPL then conveniently propagates this to where we need it.
> >      */
> >     ((struct sockaddr_in*)attr.ia_address_ptr)->sin_port = htons(port);
> >     ((struct sockaddr_in*)&btl->udapl_addr.addr)->sin_port = htons(port);
> > 
> > The OMPI code stuffs the port chosen by udapl for a listening endpoint
> > into the ia address memory (which is owned by the udapl layer btw).
> > There's a slight problem with that:  The OFA udapl openib_cma code binds
> > cm_id's to this ia_address regularly.  When an hca is opened, a cm_id is
> > bound to this address to obtain the local hca port number and gid that
> > is being used.  In addition, a cm_id is bound to this address each time
> > an endpoint is created (either at ep_create time or ep_connect time).
> > So that ia_address field is used by the dapl cm to create local
> > cm_ids...  Since the port was always zero, the rmda-cma would choose a
> > unique port for each cm_id bound to that address.   
> > 
> > But OMPI sets a the port field to non-zero, the rdma_cma fails all the
> > subsequent rdma_bind_addr() calls since the port is already in use.
> > 
> > Perhaps this hack really is a workaround for a DAPL bug where somebodies
> > dapl wasn't tracking port numbers correctly?
> 
> Yep. My memory is dim, but I think that was OFED's DAPL, or it was in 
> the generic part of DAPL that all implementations seem to share.
> 
> As hinted by the comment (I wrote it by the way), I think the best 
> solution would be if dat_ep_query() returned the port number correctly. 
>   Most of uDAPL seems to just pass around pointers to internal data 
> structures (which I'm not sure is the best idea in the world), so it 
> didn't seem like a trivial fix to me at the time.  I remember 
> considering reporting this as a bug, but I didn't because the uDAPL 
> standard didn't seem to enforce any requirements on passing the port 
> number around with the address, so it technically wasn't wrong.
> 
> Was the OFED uDAPL code switched from something else to RDMA CM at some 
> point?  I'm almost certain I was running fine on OFED's uDAPL at one 
> point (in fact, a lot of the uDAPL BTL development I did was using the 
> OFED stack).

Yes, the OFA uDAPL was changed from using the ib-cm to the rdma-cm a
while back.  Perhaps you ran on the ib-cm version?  And, the rdma-cma
started using port numbers and enforcing uniqueness even more recently I
think.

Perhaps Don Kerr has some insight on how the Sun uDAPL behaves?  Should
OMPI still need this hack?


Steve.


From swise at opengridcomputing.com  Tue May  8 12:55:24 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 08 May 2007 14:55:24 -0500
Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work
In-Reply-To: <4640CACE.8070201@ichips.intel.com>
References: <462E13A6.3030207@lfbs.rwth-aachen.de>
	<462E1DFE.5010703@Sun.COM>	<46305D0A.5020900@lfbs.rwth-aachen.de>
	<4630EFDE.8070608@Sun.COM>	<464044D4.5010501@lfbs.rwth-aachen.de>
	<054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com>
	<4640CACE.8070201@ichips.intel.com>
Message-ID: <1178654124.11455.6.camel@stevo-desktop>

BTW: 

We have 2 threads on this topic.

See my other emails describing the issue...


On Tue, 2007-05-08 at 12:09 -0700, Arlin Davis wrote:
> Jeff Squyres wrote:
> 
> > I'm forwarding this to the OpenFabrics general list -- as it just  
> > came up the other day, we know that Open MPI's UDAPL support works on  
> > Solaris, but we have done little/no testing of it on OFED (I  
> > personally know almost nothing about UDPAL).
> >
> > Can the UDAPL OFED wizards shed any light on the error messages that  
> > are listed below?  In particular, these seem to be worrysome:
> >
> >>  setup_listener Permission denied
> >
> >  setup_listener Address already in use
> 
> These failures are from rdma_cm_bind indicating the port is already 
> bound to this IA address. How are you creating the service point?
> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you 
> will see some failures until it  gets to a free port. That is normal. 
> Just make sure your create call returns DAT_SUCCESS.
> 
> >>  create_qp Address already in use
> >
> This is a real problem with the bind, port is already in use. Not sure 
> why this would fail since the current version of OFED uDAPL uses a 
> wildcard port when binding and uses the address from the open;  I 
> remember an issue a while back with rdma_cm and wildcard ports. What 
> version of OFED are you using?
> 
> -arlin
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From ardavis at ichips.intel.com  Tue May  8 12:55:39 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 08 May 2007 12:55:39 -0700
Subject: OMPI over OFA udapl (was Re: [OMPI devel] [ofa-general] OpenMPI
	and RDMA-CM)
In-Reply-To: <1178645361.6064.35.camel@stevo-desktop>
References: <1177791386.4615.8.camel@stevo-laptop>	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>	<1178575761.30571.175.camel@stevo-desktop>	<1178578346.30571.183.camel@stevo-desktop>	<BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>	<1178632029.3056.3.camel@stevo-desktop>	<73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com>	<1178641872.6064.12.camel@stevo-desktop>
	<1178645361.6064.35.camel@stevo-desktop>
Message-ID: <4640D5BB.8060104@ichips.intel.com>

Steve Wise wrote:

>1) OMPI shouldn't be stepping on the ia_address.
>  
>
stongly agree

>2) OFA udapl should probably be explicitly binding local cm_ids to port
>zero.
>  
>
current implementation uses port zero on ep_create and ia_open.

>3) dat_ep_query() should be returning the correct port numbers...
>  
>
agree. I also don't like the common code hands out  pointers to internal 
structures...

I will work on a patch that will insure compadibility with other 
providers but allow the openib_cma provider to return the port on the 
ep_query.

-arlin


From swise at opengridcomputing.com  Tue May  8 12:58:08 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 08 May 2007 14:58:08 -0500
Subject: OMPI over OFA udapl (was Re: [OMPI devel] [ofa-general]
	OpenMPI and RDMA-CM)
In-Reply-To: <4640D5BB.8060104@ichips.intel.com>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<1178578346.30571.183.camel@stevo-desktop>
	<BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>
	<1178632029.3056.3.camel@stevo-desktop>
	<73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com>
	<1178641872.6064.12.camel@stevo-desktop>
	<1178645361.6064.35.camel@stevo-desktop>
	<4640D5BB.8060104@ichips.intel.com>
Message-ID: <1178654288.11455.8.camel@stevo-desktop>

On Tue, 2007-05-08 at 12:55 -0700, Arlin Davis wrote:
> Steve Wise wrote:
> 
> >1) OMPI shouldn't be stepping on the ia_address.
> >  
> >
> stongly agree
> 
> >2) OFA udapl should probably be explicitly binding local cm_ids to port
> >zero.
> >  
> >
> current implementation uses port zero on ep_create and ia_open.
> 
> >3) dat_ep_query() should be returning the correct port numbers...
> >  
> >
> agree. I also don't like the common code hands out  pointers to internal 
> structures...
> 
> I will work on a patch that will insure compadibility with other 
> providers but allow the openib_cma provider to return the port on the 
> ep_query.
> 
> -arlin

Cool!  I'll test this over iWARP when you have something...


From jsquyres at cisco.com  Tue May  8 12:34:12 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 8 May 2007 15:34:12 -0400
Subject: Fwd: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work
References: <4640CACE.8070201@ichips.intel.com>
Message-ID: <BF658CBB-A50B-45AC-A426-5B3539D26270@cisco.com>

Re-forwarding to OMPI list; because of the OMPI list anti-spam  
checks, Arlin's post didn't make it through to our list when he  
originally posted.


Begin forwarded message:

> From: Arlin Davis <ardavis at ichips.intel.com>
> Date: May 8, 2007 3:09:02 PM EDT
> To: Jeff Squyres <jsquyres at cisco.com>
> Cc: Open MPI Users <users at open-mpi.org>, OpenFabrics General  
> <general at lists.openfabrics.org>
> Subject: Re: [ofa-general] Re: [OMPI users] openMPI over uDAPL  
> doesn't work
>
> Jeff Squyres wrote:
>
>> I'm forwarding this to the OpenFabrics general list -- as it just   
>> came up the other day, we know that Open MPI's UDAPL support works  
>> on  Solaris, but we have done little/no testing of it on OFED (I   
>> personally know almost nothing about UDPAL).
>>
>> Can the UDAPL OFED wizards shed any light on the error messages  
>> that  are listed below?  In particular, these seem to be worrysome:
>>
>>>  setup_listener Permission denied
>>
>>  setup_listener Address already in use
>
> These failures are from rdma_cm_bind indicating the port is already  
> bound to this IA address. How are you creating the service point?
> dat_psp_create or dat_psp_create_any? If it is psp_create_any then  
> you will see some failures until it  gets to a free port. That is  
> normal. Just make sure your create call returns DAT_SUCCESS.
>
>>>  create_qp Address already in use
>>
> This is a real problem with the bind, port is already in use. Not  
> sure why this would fail since the current version of OFED uDAPL  
> uses a wildcard port when binding and uses the address from the  
> open;  I remember an issue a while back with rdma_cm and wildcard  
> ports. What version of OFED are you using?
>
> -arlin


-- 
Jeff Squyres
Cisco Systems


From ggrundstrom at NetEffect.com  Tue May  8 13:14:39 2007
From: ggrundstrom at NetEffect.com (Glenn Grundstrom)
Date: Tue, 8 May 2007 15:14:39 -0500
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <1178646690.6064.38.camel@stevo-desktop>
References: <1177791386.4615.8.camel@stevo-laptop><98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com><1178575761.30571.175.camel@stevo-desktop><95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com><463FCA42.3000104@indiana.edu>
	<1178646690.6064.38.camel@stevo-desktop>
Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC06F967F9@venom2>

Steve/Andrew,

It sounds like you've got a handle on the development side.  I'd be
willing to provide additional testing resources.  Let me know how I can
help test.  I'll also check on the 3rd party agreement.

Glenn.

-----Original Message-----
From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Steve Wise
Sent: Tuesday, May 08, 2007 12:52 PM
To: Andrew Friedley
Cc: Devel at openmpi.org; Open MPI Developers; general; Asgeir Eiriksson
Subject: Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM

> >> Chelsio's gonna pony up the resources to get this work done asap.  
> >> Do you have any thoughts on how we can collaborate on this project?
I'm
> >> familiar with mvapich, not ompi, so I need to go do some homework.

> >> But
> >> any pointers on the connection setup design for ompi would be
great.
> > 
> > Excellent!  Let's chat on the phone tomorrow -- this would probably 
> > be the best way to start.
> > 
> > We will need a signed Open MPI 3rd party contribution agreement from

> > either you and/or Chelsio (whoever owns the intellectual property 
> > that will be contributed).  See http://www.open-mpi.org/community/
> > contribute/.
> > 
> >> I'm CCing devel at openmpi.org in case anyone else is interested in 
> >> helping.  Chelsio can provide rnic HW...
> > 
> > Anyone else here interested?  Free hardware!  :-)
> 
> Hmm I'm interested.  I've already done some work switching over to 
> RDMA CM for some research stuff I've been doing; it's not publicly 
> accessible w/o the 3rd party agreement.  I can help answer questions 
> on what exactly needs to change, and do some testing.
> 
> Andrew

I'm working on the 3rd party agreement from chelsio now.  Stay tuned!

Steve.


_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From swise at opengridcomputing.com  Tue May  8 13:15:53 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 08 May 2007 15:15:53 -0500
Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work
In-Reply-To: <4640CACE.8070201@ichips.intel.com>
References: <462E13A6.3030207@lfbs.rwth-aachen.de>
	<462E1DFE.5010703@Sun.COM>	<46305D0A.5020900@lfbs.rwth-aachen.de>
	<4630EFDE.8070608@Sun.COM>	<464044D4.5010501@lfbs.rwth-aachen.de>
	<054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com>
	<4640CACE.8070201@ichips.intel.com>
Message-ID: <1178655353.11455.14.camel@stevo-desktop>

> > Can the UDAPL OFED wizards shed any light on the error messages that  
> > are listed below?  In particular, these seem to be worrysome:
> >
> >>  setup_listener Permission denied
> >
> >  setup_listener Address already in use
> 
> These failures are from rdma_cm_bind indicating the port is already 
> bound to this IA address. How are you creating the service point?
> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you 
> will see some failures until it  gets to a free port. That is normal. 
> Just make sure your create call returns DAT_SUCCESS.
> 

Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down
and let the rdma-cma pick an available port number?


From Don.Kerr at Sun.COM  Tue May  8 13:21:18 2007
From: Don.Kerr at Sun.COM (Donald Kerr)
Date: Tue, 08 May 2007 16:21:18 -0400
Subject: [OMPI devel] OMPI over OFA udapl (was Re: [ofa-general]	OpenMPI
	and	RDMA-CM)
In-Reply-To: <1178653979.11455.4.camel@stevo-desktop>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<1178578346.30571.183.camel@stevo-desktop>
	<BB132B4A-DED0-41B4-8B44-6ADAB5F9740B@cisco.com>
	<1178632029.3056.3.camel@stevo-desktop>
	<73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com>
	<1178641872.6064.12.camel@stevo-desktop>
	<1178645361.6064.35.camel@stevo-desktop>
	<4640BA18.7060104@open-mpi.org>
	<1178653979.11455.4.camel@stevo-desktop>
Message-ID: <4640DBBE.5000601@Sun.COM>


Steve Wise wrote:

>On Tue, 2007-05-08 at 13:57 -0400, Andrew Friedley wrote:
>  
>
>>Steve Wise wrote:
>>    
>>
>>>>Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work.  I'm
>>>>debugging now.
>>>>
>>>>        
>>>>
>>>Here's part of the problem (from ompi/btl/udapl/btl_udapl.c):
>>>
>>>    /* TODO - big bad evil hack! */
>>>    /* uDAPL doesn't ever seem to keep track of ports with addresses.  This
>>>       becomes a problem when we use dat_ep_query() to obtain a remote address
>>>       on an endpoint.  In this case, both the DAT_PORT_QUAL and the sin_port
>>>       field in the DAT_SOCK_ADDR are 0, regardless of the actual port. This is
>>>       a problem when we have more than one uDAPL process per IA - these
>>>       processes will have exactly the same address, as the port is all
>>>       we have to differentiate who is who.  Thus, our uDAPL EP -> BTL EP
>>>       matching algorithm will break down.
>>>
>>>       So, we insert the port we used for our PSP into the DAT_SOCK_ADDR for
>>>       this IA.  uDAPL then conveniently propagates this to where we need it.
>>>     */
>>>    ((struct sockaddr_in*)attr.ia_address_ptr)->sin_port = htons(port);
>>>    ((struct sockaddr_in*)&btl->udapl_addr.addr)->sin_port = htons(port);
>>>
>>>The OMPI code stuffs the port chosen by udapl for a listening endpoint
>>>into the ia address memory (which is owned by the udapl layer btw).
>>>There's a slight problem with that:  The OFA udapl openib_cma code binds
>>>cm_id's to this ia_address regularly.  When an hca is opened, a cm_id is
>>>bound to this address to obtain the local hca port number and gid that
>>>is being used.  In addition, a cm_id is bound to this address each time
>>>an endpoint is created (either at ep_create time or ep_connect time).
>>>So that ia_address field is used by the dapl cm to create local
>>>cm_ids...  Since the port was always zero, the rmda-cma would choose a
>>>unique port for each cm_id bound to that address.   
>>>
>>>But OMPI sets a the port field to non-zero, the rdma_cma fails all the
>>>subsequent rdma_bind_addr() calls since the port is already in use.
>>>
>>>Perhaps this hack really is a workaround for a DAPL bug where somebodies
>>>dapl wasn't tracking port numbers correctly?
>>>      
>>>
>>Yep. My memory is dim, but I think that was OFED's DAPL, or it was in 
>>the generic part of DAPL that all implementations seem to share.
>>
>>As hinted by the comment (I wrote it by the way), I think the best 
>>solution would be if dat_ep_query() returned the port number correctly. 
>>  Most of uDAPL seems to just pass around pointers to internal data 
>>structures (which I'm not sure is the best idea in the world), so it 
>>didn't seem like a trivial fix to me at the time.  I remember 
>>considering reporting this as a bug, but I didn't because the uDAPL 
>>standard didn't seem to enforce any requirements on passing the port 
>>number around with the address, so it technically wasn't wrong.
>>
>>Was the OFED uDAPL code switched from something else to RDMA CM at some 
>>point?  I'm almost certain I was running fine on OFED's uDAPL at one 
>>point (in fact, a lot of the uDAPL BTL development I did was using the 
>>OFED stack).
>>    
>>
>
>Yes, the OFA uDAPL was changed from using the ib-cm to the rdma-cm a
>while back.  Perhaps you ran on the ib-cm version?  And, the rdma-cma
>started using port numbers and enforcing uniqueness even more recently I
>think.
>
>Perhaps Don Kerr has some insight on how the Sun uDAPL behaves?  Should
>OMPI still need this hack?
>  
>
 From what I recall, and Andrew can probably set me straight if I get 
this wrong. This hack was included because we were not able to pull the 
remote port from dat_ep_query. If dat_ep_query supplies that data then 
we could probably do away with the hack.

I have not heard back from the developer at Sun who implemented uDAPL 
for Solaris. My thought is that it was also based on the older ib-cm but 
will confirm. I submitted a bug against Solaris uDAPL to provide the 
port via dat_ep_query awhile back and it looks like it has been fixed, I 
just have not tested this because we weren't using it.

-DON

>
>Steve.
>
>  
>


From arthur.jones at qlogic.com  Tue May  8 13:25:58 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 08 May 2007 13:25:58 -0700
Subject: [ofa-general] [PATCH] IB/ipath -- shadow the gpio_mask register
Message-ID: <20070508202557.27647.47035.stgit@bauxite.internal.keyresearch.com>

GPIO interrupts which have the gpio_mask bits set are
no longer unlikely.  remove the unlikely annotation in
the interrupt handler and keep a shadow copy of the
gpio_mask register.

Signed-off-by: Arthur Jones <arthur.jones at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_iba6120.c |    7 +++----
 drivers/infiniband/hw/ipath/ipath_intr.c    |    7 +++----
 drivers/infiniband/hw/ipath/ipath_kernel.h  |    2 ++
 drivers/infiniband/hw/ipath/ipath_verbs.c   |   12 ++++++------
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c
index fb58154..c21d99b 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c
@@ -747,7 +747,6 @@ static void ipath_pe_quiet_serdes(struct ipath_devdata *dd)
 
 static int ipath_pe_intconfig(struct ipath_devdata *dd)
 {
-	u64 val;
 	u32 chiprev;
 
 	/*
@@ -760,9 +759,9 @@ static int ipath_pe_intconfig(struct ipath_devdata *dd)
 	if ((chiprev & INFINIPATH_R_CHIPREVMINOR_MASK) > 1) {
 		/* Rev2+ reports extra errors via internal GPIO pins */
 		dd->ipath_flags |= IPATH_GPIO_ERRINTRS;
-		val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask);
-		val |= IPATH_GPIO_ERRINTR_MASK;
-		ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val);
+		dd->ipath_gpio_mask |= IPATH_GPIO_ERRINTR_MASK;
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask,
+				 dd->ipath_gpio_mask);
 	}
 	return 0;
 }
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index 45d0331..a90d3b5 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -1056,7 +1056,7 @@ irqreturn_t ipath_intr(int irq, void *data)
 			gpiostatus &= ~(1 << IPATH_GPIO_PORT0_BIT);
 			chk0rcv = 1;
 		}
-		if (unlikely(gpiostatus)) {
+		if (gpiostatus) {
 			/*
 			 * Some unexpected bits remain. If they could have
 			 * caused the interrupt, complain and clear.
@@ -1065,9 +1065,8 @@ irqreturn_t ipath_intr(int irq, void *data)
 			 * GPIO interrupts, possibly on a "three strikes"
 			 * basis.
 			 */
-			u32 mask;
-			mask = ipath_read_kreg32(
-				dd, dd->ipath_kregs->kr_gpio_mask);
+			const u32 mask = (u32) dd->ipath_gpio_mask;
+
 			if (mask & gpiostatus) {
 				ipath_dbg("Unexpected GPIO IRQ bits %x\n",
 				  gpiostatus & mask);
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index e900c25..12194f3 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -397,6 +397,8 @@ struct ipath_devdata {
 	unsigned long ipath_pioavailshadow[8];
 	/* shadow of kr_gpio_out, for rmw ops */
 	u64 ipath_gpio_out;
+	/* shadow the gpio mask register */
+	u64 ipath_gpio_mask;
 	/* kr_revision shadow */
 	u64 ipath_revision;
 	/*
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 12933e7..bb70845 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -1387,13 +1387,12 @@ static int enable_timer(struct ipath_devdata *dd)
 	 * processing.
 	 */
 	if (dd->ipath_flags & IPATH_GPIO_INTR) {
-		u64 val;
 		ipath_write_kreg(dd, dd->ipath_kregs->kr_debugportselect,
 				 0x2074076542310ULL);
 		/* Enable GPIO bit 2 interrupt */
-		val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask);
-		val |= (u64) (1 << IPATH_GPIO_PORT0_BIT);
-		ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val);
+		dd->ipath_gpio_mask |= (u64) (1 << IPATH_GPIO_PORT0_BIT);
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask,
+				 dd->ipath_gpio_mask);
 	}
 
 	init_timer(&dd->verbs_timer);
@@ -1412,8 +1411,9 @@ static int disable_timer(struct ipath_devdata *dd)
                 u64 val;
                 /* Disable GPIO bit 2 interrupt */
                 val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask);
-                val &= ~((u64) (1 << IPATH_GPIO_PORT0_BIT));
-                ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val);
+		dd->ipath_gpio_mask &= ~((u64) (1 << IPATH_GPIO_PORT0_BIT));
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask,
+				 dd->ipath_gpio_mask);
 		/*
 		 * We might want to undo changes to debugportselect,
 		 * but how?


From mst at dev.mellanox.co.il  Tue May  8 13:28:36 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 23:28:36 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] ipoib: handle pkey change events
In-Reply-To: <4640A8BD.4000405@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508162727.GD5845@mellanox.co.il>
	<4640A8BD.4000405@voltaire.com>
Message-ID: <20070508202836.GG10845@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCHv3 1/2] ipoib: handle pkey change events

This should hav ebeen 1 of 2, is that right?
-- 
MST


From mst at dev.mellanox.co.il  Tue May  8 13:33:18 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 8 May 2007 23:33:18 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <4640A911.8000609@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com>
	<20070508150907.GU21591@mellanox.co.il>
	<46409504.9000802@voltaire.com>
	<20070508152650.GA5845@mellanox.co.il>
	<4640A911.8000609@voltaire.com>
Message-ID: <20070508203318.GH10845@mellanox.co.il>


Don't put whitespace after [ and before ].

+		device->pkey_tbl_len[ port_index ] = tprops->pkey_tbl_len;
+		device->gid_tbl_len[ port_index ] = tprops->gid_tbl_len;

whitespace damage here

+		tbl_len = device->gid_tbl_len[ port - start_port(device) ];

and here

+	tbl_len = device->pkey_tbl_len[ port_num - start_port(device) ];

and here

<plug>
Have you read the boring list of rules?
http://git.openfabrics.org/~mst/boring.txt
</plug>

-- 
MST


From tziporet at mellanox.co.il  Tue May  8 13:45:58 2007
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 8 May 2007 23:45:58 +0300
Subject: [ofa-general] OFED 1.2 RC3 delayed
Message-ID: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>

Hi All,
In the OFED meeting yesterday we decided that OFED 1.2 RC3 will be out
once the bugs 420 and 465 are resolved.
Tentative date is Thursday may 10. 

If these bugs will not be fixed this week we will have to reconsider
this decision next week.

Tziporet Koren
Software Director
Mellanox Technologies
mailto: tziporet at mellanox.co.il
Tel +972-4-9097200, ext 380

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070508/60e15a59/attachment.html>

From swise at opengridcomputing.com  Tue May  8 13:56:05 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 08 May 2007 15:56:05 -0500
Subject: [ofa-general] OFED 1.2 RC3 delayed
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
Message-ID: <1178657765.11455.32.camel@stevo-desktop>

I would like the group to consider including changes needed to OMPI
and/or ofa udapl to get OMPI working again on udapl for ofed-1.2.  

This will provide OMPI support over iwarp devices via udapl until we can
get rdma-cm support added to OMPI.  


Steve.
  

On Tue, 2007-05-08 at 23:45 +0300, Tziporet Koren wrote:
> Hi All, 
> In the OFED meeting yesterday we decided that OFED 1.2 RC3 will be out
> once the bugs 420 and 465 are resolved. 
> Tentative date is Thursday may 10. 
> 
> If these bugs will not be fixed this week we will have to reconsider
> this decision next week.
> 
> Tziporet Koren 
> Software Director 
> Mellanox Technologies 
> mailto: tziporet at mellanox.co.il
> Tel +972-4-9097200, ext 380
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From ardavis at ichips.intel.com  Tue May  8 13:56:46 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 08 May 2007 13:56:46 -0700
Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work
In-Reply-To: <1178655353.11455.14.camel@stevo-desktop>
References: <462E13A6.3030207@lfbs.rwth-aachen.de>	
	<462E1DFE.5010703@Sun.COM>	<46305D0A.5020900@lfbs.rwth-aachen.de>	
	<4630EFDE.8070608@Sun.COM>	<464044D4.5010501@lfbs.rwth-aachen.de>	
	<054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com>	
	<4640CACE.8070201@ichips.intel.com>
	<1178655353.11455.14.camel@stevo-desktop>
Message-ID: <4640E40E.6000803@ichips.intel.com>

Steve Wise wrote:

>>>Can the UDAPL OFED wizards shed any light on the error messages that  
>>>are listed below?  In particular, these seem to be worrysome:
>>>
>>>      
>>>
>>>> setup_listener Permission denied
>>>>        
>>>>
>>> setup_listener Address already in use
>>>      
>>>
>>These failures are from rdma_cm_bind indicating the port is already 
>>bound to this IA address. How are you creating the service point?
>>dat_psp_create or dat_psp_create_any? If it is psp_create_any then you 
>>will see some failures until it  gets to a free port. That is normal. 
>>Just make sure your create call returns DAT_SUCCESS.
>>
>>    
>>
>
>Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down
>and let the rdma-cma pick an available port number?
>
>  
>
That would work fine if the provider interface allowed the port to be 
returned. I  will take a look and see if we can improve on this common 
code seeding method.


From tom.mitchell at qlogic.com  Tue May  8 14:31:03 2007
From: tom.mitchell at qlogic.com (Tom Mitchell)
Date: Tue, 8 May 2007 14:31:03 -0700
Subject: [ofa-general] Re: Incorrect atomic usage in ipath driver
In-Reply-To: <1178594516.14928.62.camel@localhost.localdomain>
References: <1178594516.14928.62.camel@localhost.localdomain>
Message-ID: <20070508213103.GC19539@pathscale.com>


Thank you for the feedback.
Part this code path is necessary for an early revision
of the hardware.  It may be important on ppc.  As it is now,
it is a don't care on x86_64.

The responsible engineers here have seen this and will investigate further.


On May 08 01:21, Benjamin Herrenschmidt wrote:
> Hi !
> 
> So I see this construct:
> 
> 	/* There is already a thread processing this queue. */
> 	if (test_and_set_bit(0, &dd->ipath_rcv_pending))
> 		goto bail;
> 
> 	.../...
> 
> done:
> 	clear_bit(0, &dd->ipath_rcv_pending);
> 	smp_mb__after_clear_bit();
> 
> So that's basically an attempt at doing a spinlock. The problem is your
> barrier is wrong at the end. Better would be:
> 
> 
> done:
> 	smp_mb__before_clear_bit();
> 	clear_bit(0, &dd->ipath_rcv_pending);
> 
> 
> Though it's still less optimal that doing:
> 
> 	if (!spin_trylock(...))
> 		goto bail;
> 
> 	.../...
> 
> done:
> 	spin_unlock(...)
> 
> If you really want to stick to bitops, then you may want to look at
> Nick's upcoming patches adding some bitops with appropriate lock
> semantics.
> 
> Cheers,
> Ben.
> 
> 

-- 
	T o m   M i t c h e l l
	Host Solutions Group
	QLogic Corp.  http://www.qlogic.com


From stan.smith at intel.com  Tue May  8 16:28:29 2007
From: stan.smith at intel.com (Smith, Stan)
Date: Tue, 8 May 2007 16:28:29 -0700
Subject: [ofa-general] WinOF 1.0 RC1 available for testing
Message-ID: <55CE0347B98FCA468923E5FBC25CB4DCE8AC77@orsmsx413.amr.corp.intel.com>


For those who have an interest in Windows, otherwise flames > /dev/null

Please find WinOF 1.0 RC1 (Windows OpenFabrics Release Candidate #1)
'WinOF_1-0_RC1.zip' at 

 http://www.openfabrics.org/~woody/WinOF_1.0/
 
Suggestions can be directed towards 'ofw at lists.openfabrics.org'.


From arlin.r.davis at intel.com  Tue May  8 16:44:56 2007
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Tue, 8 May 2007 16:44:56 -0700
Subject: [ofa-general] RE: OMPI over OFA udapl [PATCH]
In-Reply-To: <1178654288.11455.8.camel@stevo-desktop>
Message-ID: <000001c791ca$e2853330$4297070a@amr.corp.intel.com>


>-----Original Message-----
>From: Steve Wise [mailto:swise at opengridcomputing.com]
>
>Cool!  I'll test this over iWARP when you have something...


Steve,

Can you try this patch? I also included a change to dtest to query. Make sure you have the latest
librdmacm fixes. There was a late breaking fix that just went in that overwrote the port during the
rdma_bind_addr call.

Signed-off by: Arlin Davis ardavis at ichips.intel.com

diff --git a/dapl/openib_cma/dapl_ib_cm.c b/dapl/openib_cma/dapl_ib_cm.c
index 8bdd0eb..4639f87 100755
--- a/dapl/openib_cma/dapl_ib_cm.c
+++ b/dapl/openib_cma/dapl_ib_cm.c
@@ -891,6 +891,9 @@ dapls_ib_accept_connection(IN DAT_CR_HANDLE cr_handle,
 		goto bail;
 	}
 
+	/* save remote port for ep query */
+	ep_ptr->param.remote_port_qual = rdma_get_dst_port(cr_conn->cm_id);
+
 	return DAT_SUCCESS;
 bail:
 	rdma_reject(cr_conn->cm_id, NULL, 0);
diff --git a/dapl/openib_cma/dapl_ib_qp.c b/dapl/openib_cma/dapl_ib_qp.c
old mode 100644
new mode 100755
index f1e1671..69c49a9
--- a/dapl/openib_cma/dapl_ib_qp.c
+++ b/dapl/openib_cma/dapl_ib_qp.c
@@ -179,14 +179,19 @@ DAT_RETURN dapls_ib_qp_alloc(IN DAPL_IA *ia_ptr,
 	conn->route_retries = dapl_os_get_env_val("DAPL_CM_ROUTE_RETRY_COUNT", 
 						    IB_ROUTE_RETRY_COUNT);
 
+	/* setup up ep->param to reference the bound local address and port */
+	ep_ptr->param.local_ia_address_ptr = &cm_id->route.addr.src_addr;
+	ep_ptr->param.local_port_qual = rdma_get_src_port(cm_id);
+		
 	ep_ptr->qp_handle = conn;
 	ep_ptr->qp_state = IB_QP_STATE_INIT;
 	
 	dapl_dbg_log(DAPL_DBG_TYPE_EP,
-		     " qp_alloc: qpn %p sq %d,%d rq %d,%d\n", 
+		     " qp_alloc: qpn %p sq %d,%d rq %d,%d port=%d\n", 
 		     ep_ptr->qp_handle->cm_id->qp->qp_num,
 		     qp_create.cap.max_send_wr,qp_create.cap.max_send_sge,
-		     qp_create.cap.max_recv_wr,qp_create.cap.max_recv_sge);
+		     qp_create.cap.max_recv_wr,qp_create.cap.max_recv_sge,
+		     ep_ptr->param.local_port_qual);
 	
 	return DAT_SUCCESS;
 bail:
diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c
old mode 100644
new mode 100755
index ec8a82e..68a3cbe
--- a/test/dtest/dtest.c
+++ b/test/dtest/dtest.c
@@ -106,6 +106,7 @@ static DAT_RMR_CONTEXT    rmr_context_send_msg;
 static DAT_VLEN           registered_size_send_msg;
 static DAT_VADDR          registered_addr_send_msg;
 static DAT_EP_ATTR        ep_attr;
+static DAT_EP_PARAM       ep_param;
 char                      hostname[256] = {0};
 char                      provider[256] = DAPL_PROVIDER;
 
@@ -329,6 +330,25 @@ main(int argc, char **argv)
        } else
                LOGPRINTF("%d EP created %p \n", getpid(), h_ep);
 
+       /* query EP for local address information */
+	ret = dat_ep_query( h_ep, DAT_EP_FIELD_ALL, &ep_param );
+	if(ret != DAT_SUCCESS) {
+               fprintf(stderr, "%d Error dat_ep_query: %s\n",
+                       getpid(),DT_RetToString(ret));
+               goto cleanup;
+       } else
+               LOGPRINTF("%d EP queried %p \n", getpid(), h_ep);
+
+       printf("%d Query EP: family %d port %d addr %d.%d.%d.%d (%d)\n", getpid(),
+		((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_family,
+		((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_port,
+		((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_addr.s_addr >> 0 & 0xff,
+		((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_addr.s_addr >> 8 & 0xff,
+		((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_addr.s_addr >> 16 & 0xff,
+		((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_addr.s_addr >> 24 & 0xff,
+		ep_param.local_port_qual); 
+       fflush(stdout);
+            
        /*
         * register message buffers, establish connection, and
         * exchange DMA RMR information info via messages


From sobebike at gmail.com  Tue May  8 17:37:38 2007
From: sobebike at gmail.com (SoBeBike)
Date: Tue, 8 May 2007 19:37:38 -0500
Subject: [ofa-general] abi_compat
Message-ID: <dedddf10705081737q4647e743j3e068bde27e8ba78@mail.gmail.com>

Under what conditions is the field abi_compat of struct ibv_context
set to non-zero? I'm encountering a situation where it is set when
coding to verbs on a clean OFED 1.2 install. Seems odd that it would
be set since I suspected that it would only occur for verbs 1.0/1.1
compatibility.

thanks!


From rdreier at cisco.com  Tue May  8 17:51:08 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 08 May 2007 17:51:08 -0700
Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_core:  fix qp free sync
In-Reply-To: <1178617072.17477.45.camel@mtls03> (Eli Cohen's message of "Tue,
	08 May 2007 12:37:22 +0300")
References: <1178617072.17477.45.camel@mtls03>
Message-ID: <ada64726cqr.fsf@cisco.com>

Thanks, I rolled this up into what I'll merge upstream.


From rdreier at cisco.com  Tue May  8 17:57:24 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 08 May 2007 17:57:24 -0700
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts
In-Reply-To: <1178606876.17477.15.camel@mtls03> (Eli Cohen's message of "Tue,
	08 May 2007 09:47:26 +0300")
References: <1178551555.17477.0.camel@mtls03> <adar6ps60zn.fsf@cisco.com>
	<1178606876.17477.15.camel@mtls03>
Message-ID: <ada1whq6cgb.fsf@cisco.com>

 > > @@ -249,8 +249,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
 > >  		}
 > >  	}
 > >  
 > > -	if (eqes_found)
 > > -		eq_set_ci(eq, 1);
 > > +	eq_set_ci(eq, 1);
 > >  
 > >  	return eqes_found;
 > >  }

 > This will not ensure arming all EQs for all interrupts and we will face
 > the same problem of losing interrupts.

I don't understand what you mean here.  How is unconditionally arming
the EQ at the end of mlx4_eq_int() any different from your proposed
patch?  My change calls eq_set_ci() at the end of every call to
mlx4_eq_int(), and your change calls eq_set_ci() after every call to
mlx4_eq_int().  I'm probably missing something obvious, but I really
don't see it right now.

 - R.


From rdreier at cisco.com  Tue May  8 17:58:51 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 08 May 2007 17:58:51 -0700
Subject: [ofa-general] [PATCH] IB/ipath -- shadow the gpio_mask register
In-Reply-To: <20070508202557.27647.47035.stgit@bauxite.internal.keyresearch.com>
	(Arthur Jones's message of "Tue, 08 May 2007 13:25:58 -0700")
References: <20070508202557.27647.47035.stgit@bauxite.internal.keyresearch.com>
Message-ID: <adawszi4xtg.fsf@cisco.com>

 > GPIO interrupts which have the gpio_mask bits set are
 > no longer unlikely.  remove the unlikely annotation in
 > the interrupt handler and keep a shadow copy of the
 > gpio_mask register.

A better changelog would be appreciated here... I can see deleting the
unlikely() if it's no longer appropriate, but why keep a shadow copy
of the register?  Because this is now a hotter path and you want to
avoid the MMIO read?

 - R.


From rdreier at cisco.com  Tue May  8 18:06:56 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 08 May 2007 18:06:56 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adasla64xfz.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will merge the mlx4 drivers for new Mellanox adapters:

Roland Dreier (3):
      IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules
      IB: Put rlimit accounting struct in struct ib_umem
      IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters

 drivers/infiniband/Kconfig                       |    7 +
 drivers/infiniband/Makefile                      |    1 +
 drivers/infiniband/core/Makefile                 |    4 +-
 drivers/infiniband/core/device.c                 |    2 +
 drivers/infiniband/core/{uverbs_mem.c => umem.c} |  153 ++-
 drivers/infiniband/core/uverbs.h                 |    6 +-
 drivers/infiniband/core/uverbs_cmd.c             |   60 +-
 drivers/infiniband/core/uverbs_main.c            |   11 +-
 drivers/infiniband/hw/amso1100/c2_provider.c     |   42 +-
 drivers/infiniband/hw/amso1100/c2_provider.h     |    1 +
 drivers/infiniband/hw/cxgb3/iwch_provider.c      |   28 +-
 drivers/infiniband/hw/cxgb3/iwch_provider.h      |    1 +
 drivers/infiniband/hw/ehca/ehca_classes.h        |    1 +
 drivers/infiniband/hw/ehca/ehca_iverbs.h         |    3 +-
 drivers/infiniband/hw/ehca/ehca_mrmw.c           |   69 +-
 drivers/infiniband/hw/ipath/ipath_mr.c           |   38 +-
 drivers/infiniband/hw/ipath/ipath_verbs.h        |    5 +-
 drivers/infiniband/hw/mlx4/Kconfig               |    9 +
 drivers/infiniband/hw/mlx4/Makefile              |    3 +
 drivers/infiniband/hw/mlx4/ah.c                  |  100 ++
 drivers/infiniband/hw/mlx4/cq.c                  |  525 +++++++++
 drivers/infiniband/hw/mlx4/doorbell.c            |  216 ++++
 drivers/infiniband/hw/mlx4/mad.c                 |  339 ++++++
 drivers/infiniband/hw/mlx4/main.c                |  651 +++++++++++
 drivers/infiniband/hw/mlx4/mlx4_ib.h             |  285 +++++
 drivers/infiniband/hw/mlx4/mr.c                  |  184 +++
 drivers/infiniband/hw/mlx4/qp.c                  | 1294 ++++++++++++++++++++++
 drivers/infiniband/hw/mlx4/srq.c                 |  334 ++++++
 drivers/infiniband/hw/mlx4/user.h                |   92 ++
 drivers/infiniband/hw/mthca/mthca_provider.c     |   38 +-
 drivers/infiniband/hw/mthca/mthca_provider.h     |    1 +
 drivers/net/Kconfig                              |   14 +
 drivers/net/Makefile                             |    1 +
 drivers/net/mlx4/Makefile                        |    4 +
 drivers/net/mlx4/alloc.c                         |  179 +++
 drivers/net/mlx4/catas.c                         |   70 ++
 drivers/net/mlx4/cmd.c                           |  429 +++++++
 drivers/net/mlx4/cq.c                            |  254 +++++
 drivers/net/mlx4/eq.c                            |  696 ++++++++++++
 drivers/net/mlx4/fw.c                            |  775 +++++++++++++
 drivers/net/mlx4/fw.h                            |  167 +++
 drivers/net/mlx4/icm.c                           |  379 +++++++
 drivers/net/mlx4/icm.h                           |  135 +++
 drivers/net/mlx4/intf.c                          |  165 +++
 drivers/net/mlx4/main.c                          |  936 ++++++++++++++++
 drivers/net/mlx4/mcg.c                           |  380 +++++++
 drivers/net/mlx4/mlx4.h                          |  348 ++++++
 drivers/net/mlx4/mr.c                            |  479 ++++++++
 drivers/net/mlx4/pd.c                            |  102 ++
 drivers/net/mlx4/profile.c                       |  238 ++++
 drivers/net/mlx4/qp.c                            |  280 +++++
 drivers/net/mlx4/reset.c                         |  181 +++
 drivers/net/mlx4/srq.c                           |  227 ++++
 include/linux/mlx4/cmd.h                         |  178 +++
 include/linux/mlx4/cq.h                          |  123 ++
 include/linux/mlx4/device.h                      |  331 ++++++
 include/linux/mlx4/doorbell.h                    |   97 ++
 include/linux/mlx4/driver.h                      |   59 +
 include/linux/mlx4/qp.h                          |  288 +++++
 include/linux/mlx4/srq.h                         |   42 +
 include/rdma/ib_umem.h                           |   81 ++
 include/rdma/ib_verbs.h                          |   28 +-
 62 files changed, 11951 insertions(+), 218 deletions(-)
 rename drivers/infiniband/core/{uverbs_mem.c => umem.c} (59%)
 create mode 100644 drivers/infiniband/hw/mlx4/Kconfig
 create mode 100644 drivers/infiniband/hw/mlx4/Makefile
 create mode 100644 drivers/infiniband/hw/mlx4/ah.c
 create mode 100644 drivers/infiniband/hw/mlx4/cq.c
 create mode 100644 drivers/infiniband/hw/mlx4/doorbell.c
 create mode 100644 drivers/infiniband/hw/mlx4/mad.c
 create mode 100644 drivers/infiniband/hw/mlx4/main.c
 create mode 100644 drivers/infiniband/hw/mlx4/mlx4_ib.h
 create mode 100644 drivers/infiniband/hw/mlx4/mr.c
 create mode 100644 drivers/infiniband/hw/mlx4/qp.c
 create mode 100644 drivers/infiniband/hw/mlx4/srq.c
 create mode 100644 drivers/infiniband/hw/mlx4/user.h
 create mode 100644 drivers/net/mlx4/Makefile
 create mode 100644 drivers/net/mlx4/alloc.c
 create mode 100644 drivers/net/mlx4/catas.c
 create mode 100644 drivers/net/mlx4/cmd.c
 create mode 100644 drivers/net/mlx4/cq.c
 create mode 100644 drivers/net/mlx4/eq.c
 create mode 100644 drivers/net/mlx4/fw.c
 create mode 100644 drivers/net/mlx4/fw.h
 create mode 100644 drivers/net/mlx4/icm.c
 create mode 100644 drivers/net/mlx4/icm.h
 create mode 100644 drivers/net/mlx4/intf.c
 create mode 100644 drivers/net/mlx4/main.c
 create mode 100644 drivers/net/mlx4/mcg.c
 create mode 100644 drivers/net/mlx4/mlx4.h
 create mode 100644 drivers/net/mlx4/mr.c
 create mode 100644 drivers/net/mlx4/pd.c
 create mode 100644 drivers/net/mlx4/profile.c
 create mode 100644 drivers/net/mlx4/qp.c
 create mode 100644 drivers/net/mlx4/reset.c
 create mode 100644 drivers/net/mlx4/srq.c
 create mode 100644 include/linux/mlx4/cmd.h
 create mode 100644 include/linux/mlx4/cq.h
 create mode 100644 include/linux/mlx4/device.h
 create mode 100644 include/linux/mlx4/doorbell.h
 create mode 100644 include/linux/mlx4/driver.h
 create mode 100644 include/linux/mlx4/qp.h
 create mode 100644 include/linux/mlx4/srq.h
 create mode 100644 include/rdma/ib_umem.h


From rdreier at cisco.com  Tue May  8 18:08:54 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 08 May 2007 18:08:54 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <20070508141727.GR21591@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 8 May 2007 17:17:27 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
Message-ID: <adaodku4xcp.fsf@cisco.com>

 > 	libmlx4 has this comments:
 > 
 > 	/* FIXME flush wc buffers */
 > 
 > 	and since it does *not* currently actually flush the buffers, if we
 > 	enable WC for blueflame, WRs gets mixed in the WC buffer, and QP gets
 > 	corrupted/stuck.
 > 
 > It seems we should we have arch.h under mthca and stick
 > some macro like wc_wmb() in there.
 > 
 > Or, would infiniband/arch.h under libibverbs be a better place?

I think we should add it to infiniband/arch.h but then also have an
#ifndef wc_wmb in libmlx4 until libibverbs with the define is ubiquitous.

 > If WC is not enabled, userspace can avoid the flush - so, should we
 > return such a bit as part of kernel abi?

Maybe, although I'm not sure it's worth it.

 - R.


From rdreier at cisco.com  Tue May  8 18:10:26 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 08 May 2007 18:10:26 -0700
Subject: [ofa-general] Re: no SRQ empty check in libmthca and in mlx2 kernel
	modules
In-Reply-To: <200705081238.41255.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 8 May 2007 12:38:41 +0300")
References: <200705081238.41255.jackm@dev.mellanox.co.il>
Message-ID: <adak5vi4xa5.fsf@cisco.com>

 > It looks to me like there is no check for "no more available WQEs" when posting
 > SRQ reads. See libmlx4/src/srq.c and drivers/infiniband/hw/mlx4/srq.c.
 > There is no check in either place if srq_head = srq_tail, or some equivalent check.

Yes, you're right.


From weiny2 at llnl.gov  Tue May  8 18:49:38 2007
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 8 May 2007 18:49:38 -0700
Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager
Message-ID: <20070508184938.311b1c8f.weiny2@llnl.gov>

I would like to submit to the list a performance manager which I have been
working on for OpenSM.

It is implemented as the first proposed architecture model set forth by Hal (As
an integrated thread to OpenSM.)  As such it works fine on our small test
cluster but there is some concern about its scalability.

I have extended this architecture with an idea of my own.  This idea is to have
a plug-able module for the "event database".  With this interface one could
write their own Data reduction, logging, and tracking methods.  Here at LLNL I
propose to use this to add counter and subnet events directly to our management
database which is used to show system status to our operators.  Other
installations might prefer other methods of logging, SNMP for example.  This
patch includes a "reference" implementation of this "event database" which
stores the information internally until the user requests a "dump".

Let the flames begin,
Ira Weiny
weiny2 at llnl.gov


>From 4ce288b6a5a371872cf160f6d4e29e768a065cb9 Mon Sep 17 00:00:00 2001
From: Ira K. Weiny <weiny2 at llnl.gov>
Date: Tue, 24 Apr 2007 23:44:15 -0700
Subject: [PATCH] OpenSM Proposed Perf Manager

   Features include:
      * Create "PerfMgr" thread and sweep all ports on the subnet every
        sweep_time seconds
      * port counter clear on overflow
      * plugable architecture for the "event" database
      * Output machine and human readable output in the default event database
        dump
      * Control using the "perfmgr" command in the console

   Known Issues
      * Not tested at scale.
      * Event database should record trap events and other "intresting" subnet
        events.
      * port counter log warnings should be configureable not hard coded.
      * partitions are not handled yet.
      * Code might not be as pristine as I would like

   Enable using --enable-perf-mgr

Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
---
 osm/Makefile.am                   |    3 +-
 osm/config/osmvsel.m4             |   26 ++
 osm/configure.in                  |    5 +-
 osm/eventdb/Makefile.am           |   37 ++
 osm/eventdb/autogen.sh            |   15 +
 osm/eventdb/configure.in          |   70 ++++
 osm/eventdb/libibeventdb.map      |    5 +
 osm/eventdb/libibeventdb.spec.in  |   38 ++
 osm/eventdb/libibeventdb.ver      |    9 +
 osm/eventdb/src/ibeventdb.c       |  622 +++++++++++++++++++++++++++++++++
 osm/include/Makefile.am           |    2 +
 osm/include/iba/ib_types.h        |   74 ++++
 osm/include/opensm/osm_base.h     |   23 ++
 osm/include/opensm/osm_event_db.h |  151 ++++++++
 osm/include/opensm/osm_madw.h     |   40 +++
 osm/include/opensm/osm_msgdef.h   |    1 +
 osm/include/opensm/osm_opensm.h   |    4 +
 osm/include/opensm/osm_perfmgr.h  |  223 ++++++++++++
 osm/include/opensm/osm_subnet.h   |   18 +
 osm/opensm.spec.in                |   11 +-
 osm/opensm/Makefile.am            |    5 +-
 osm/opensm/configure.in           |    3 +
 osm/opensm/main.c                 |   19 +
 osm/opensm/osm_console.c          |   78 +++++
 osm/opensm/osm_event_db.c         |  172 +++++++++
 osm/opensm/osm_opensm.c           |   24 ++
 osm/opensm/osm_perfmgr.c          |  686 +++++++++++++++++++++++++++++++++++++
 osm/opensm/osm_subnet.c           |   51 +++
 osm/opensm/osm_trap_rcv.c         |   15 +
 29 files changed, 2425 insertions(+), 5 deletions(-)

diff --git a/osm/Makefile.am b/osm/Makefile.am
index ec66883..32f5f64 100644
--- a/osm/Makefile.am
+++ b/osm/Makefile.am
@@ -1,6 +1,7 @@
 
 # note that order matters: make the libs first then use them 
-SUBDIRS 		= complib libvendor opensm osmtest include
+SUBDIRS 		= complib libvendor opensm osmtest include $(EVENTDB)
+DIST_SUBDIRS = complib libvendor opensm osmtest include eventdb
 
 # this will control the update of the files in order
 MAINTAINERCLEANFILES = Makefile.in aclocal.m4 configure config-h.in 
diff --git a/osm/config/osmvsel.m4 b/osm/config/osmvsel.m4
index 9234f36..ce6039c 100644
--- a/osm/config/osmvsel.m4
+++ b/osm/config/osmvsel.m4
@@ -180,3 +180,29 @@ if test "$disable_libcheck" != "yes"; th
 fi
 # --- END OPENIB_APP_OSMV_CHECK_HEADER ---
 ]) dnl OPENIB_APP_OSMV_CHECK_HEADER
+
+
+AC_DEFUN([OPENIB_OSM_PERF_MGR_SEL], [
+# --- BEGIN OPENIB_OSM_PERF_MGR_SEL ---
+
+dnl enable the perf-mgr
+AC_ARG_ENABLE(perf-mgr,
+[  --enable-perf-mgr Enable the performance manager (default no)],
+   [case $enableval in
+     yes) perf_mgr=yes ;;
+     no)  perf_mgr=no ;;
+   esac],
+   perf_mgr=no)
+if test $perf_mgr = yes; then
+  AC_DEFINE(ENABLE_OSM_PERF_MGR,
+	    1,
+	    [Define as 1 if you want to enable the performance manager])
+  EVENTDB=eventdb
+else
+  EVENTDB=
+fi
+AC_SUBST([EVENTDB])
+
+# --- END OPENIB_OSM_PERF_MGR_SEL ---
+]) dnl OPENIB_OSM_PERF_MGR_SEL
+
diff --git a/osm/configure.in b/osm/configure.in
index eb6552f..94d4483 100644
--- a/osm/configure.in
+++ b/osm/configure.in
@@ -27,11 +27,14 @@ AC_ARG_ENABLE(debug,
 esac],[debug=false])
 AM_CONDITIONAL(DEBUG, test x$debug = xtrue)
 
+dnl select performance manager or not
+OPENIB_OSM_PERF_MGR_SEL
+
 dnl Provide user option to select vendor
 OPENIB_APP_OSMV_SEL
 
 dnl Configure the following subdirs
-AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include)
+AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include eventdb)
 
 dnl Create the following Makefiles
 AC_OUTPUT(Makefile)
diff --git a/osm/eventdb/Makefile.am b/osm/eventdb/Makefile.am
new file mode 100644
index 0000000..18f2db9
--- /dev/null
+++ b/osm/eventdb/Makefile.am
@@ -0,0 +1,37 @@
+
+INCLUDES = -I$(srcdir)/../include \
+	   -I$(includedir)/infiniband
+
+lib_LTLIBRARIES = libibeventdb.la
+
+if DEBUG
+DBGFLAGS = -ggdb -D_DEBUG_
+else
+DBGFLAGS = -g
+endif
+
+libibeventdb_la_CFLAGS = -Wall $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -Wno-deprecated-declarations
+
+if HAVE_LD_VERSION_SCRIPT
+    libibeventdb_version_script = -Wl,--version-script=$(srcdir)/libibeventdb.map
+else
+    libibeventdb_version_script =
+endif
+
+libibeventdb_la_SOURCES = src/ibeventdb.c
+libibeventdb_la_LDFLAGS = -version-info $(ibeventdb_api_version) \
+	 -export-dynamic $(libibeventdb_version_script)
+libibeventdb_la_LIBADD = -L../complib $(OSMV_LDADD) -losmcomp
+libibeventdb_la_DEPENDENCIES = $(srcdir)/libibeventdb.map
+
+libibeventdbincludedir = $(includedir)/infiniband/complib
+
+libibeventdbinclude_HEADERS =
+
+# headers are distributed as part of the include dir
+EXTRA_DIST = $(srcdir)/libibeventdb.spec.in $(srcdir)/libibeventdb.map \
+	$(srcdir)/libibeventdb.ver
+
+dist-hook: libibeventdb.spec
+	cp libibeventdb.spec $(distdir)
+
diff --git a/osm/eventdb/autogen.sh b/osm/eventdb/autogen.sh
new file mode 100755
index 0000000..ec20fc5
--- /dev/null
+++ b/osm/eventdb/autogen.sh
@@ -0,0 +1,15 @@
+#! /bin/sh
+
+# We change dir since the later utilities assume to work in the project dir
+cd ${0%*/*}
+
+# create config dir if not exist
+test -d config || mkdir config
+
+set -x
+(aclocal -I config -I ../config 2>&1 ) && \
+(libtoolize --force --copy) && \
+(autoheader) && \
+(automake --foreign --add-missing --copy) && \
+autoconf
+
diff --git a/osm/eventdb/configure.in b/osm/eventdb/configure.in
new file mode 100644
index 0000000..f5fa345
--- /dev/null
+++ b/osm/eventdb/configure.in
@@ -0,0 +1,70 @@
+dnl Process this file with autoconf to produce a configure script.
+
+AC_PREREQ(2.57)
+AC_INIT(libibeventdb, 1.0.0, openib-general at openib.org)
+AC_CONFIG_AUX_DIR(config)
+AM_CONFIG_HEADER(config.h)
+AM_INIT_AUTOMAKE
+
+dnl the library version info is available in the file: libibeventdb.ver
+ibeventdb_api_version=`grep LIBVERSION $srcdir/libibeventdb.ver | sed 's/LIBVERSION=//'`
+if test -z $ibeventdb_api_version; then
+   ibeventdb_api_version=1:0:0
+fi
+AC_SUBST(ibeventdb_api_version)
+
+dnl Checks for programs
+AC_PROG_CC
+AC_PROG_GCC_TRADITIONAL
+AC_PROG_LIBTOOL
+
+dnl Checks for libraries
+AC_CHECK_LIB(pthread, pthread_mutex_init, [],
+	AC_MSG_ERROR([pthread_mutex_init() not found.  libibeventdb requires libpthread.]))
+
+dnl Checks for header files.
+AC_HEADER_STDC
+AC_CHECK_HEADERS([fcntl.h stdlib.h string.h sys/ioctl.h sys/time.h syslog.h unistd.h])
+
+dnl Checks for library functions
+AC_FUNC_MALLOC
+AC_FUNC_MEMCMP
+AC_CHECK_FUNC([time])
+dnl AC_CHECK_FUNC([cl_plock_excl_acquire], [],
+dnl AC_MSG_ERROR([cl_plock_excl_acquire not found, libibeventdb requires libosmcomp]))
+
+dnl Checks for typedefs, structures, and compiler characteristics.
+AC_C_CONST
+AC_C_INLINE
+AC_TYPE_SIZE_T
+AC_HEADER_TIME
+
+dnl We use --version-script with ld if possible
+AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script,
+    if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then
+        ac_cv_version_script=yes
+    else
+        ac_cv_version_script=no
+    fi)
+
+AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes")
+
+dnl Support debug mode build - if enable-debug provided the DEBUG variable is set
+AC_ARG_ENABLE(debug,
+[  --enable-debug Turn on debug mode],
+[case "${enableval}" in
+  yes) debug=true ;;
+  no)  debug=false ;;
+  *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;;
+esac],[debug=false])
+AM_CONDITIONAL(DEBUG, test x$debug = xtrue)
+
+# we have to revive the env CFLAGS as some how they are being overwritten...
+# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering
+# for why they should NEVER be modified by the configure to allow for user
+# overrides.
+CFLAGS=$ac_env_CFLAGS_value
+
+
+AC_CONFIG_FILES([Makefile libibeventdb.spec])
+AC_OUTPUT
diff --git a/osm/eventdb/libibeventdb.map b/osm/eventdb/libibeventdb.map
new file mode 100644
index 0000000..ca4f78c
--- /dev/null
+++ b/osm/eventdb/libibeventdb.map
@@ -0,0 +1,5 @@
+OSMPMDB_1.0 {
+	global:
+      __osm_event_db;
+	local: *;
+};
diff --git a/osm/eventdb/libibeventdb.spec.in b/osm/eventdb/libibeventdb.spec.in
new file mode 100644
index 0000000..ac66545
--- /dev/null
+++ b/osm/eventdb/libibeventdb.spec.in
@@ -0,0 +1,38 @@
+
+%define ver @VERSION@
+%define RELEASE 1
+%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE}
+
+Summary: OpenIB InfiniBand OpenSM Component Library
+Name: libibeventdb
+Version: %ver
+Release: %rel%{?dist}
+License: GPL/BSD
+Group: System Environment/Libraries
+BuildRoot: %{_tmppath}/%{name}-%{version}-root
+Source: http://openib.org/downloads/%{name}-%{version}.tar.gz
+Url: http://openib.org/
+Requires: opensm
+
+%description
+libibeventdb provides a default plugin for the OpenSM event database
+
+%prep
+%setup -q
+
+%build
+%configure
+make
+
+%install
+make DESTDIR=${RPM_BUILD_ROOT} install
+# remove unpackaged files from the buildroot
+rm -f $RPM_BUILD_ROOT%{_libdir}/*.la
+
+%clean
+rm -rf $RPM_BUILD_ROOT
+
+%files
+%defattr(-,root,root)
+%{_libdir}/libibeventdb*.so.*
+%doc ChangeLog
diff --git a/osm/eventdb/libibeventdb.ver b/osm/eventdb/libibeventdb.ver
new file mode 100644
index 0000000..7a703b7
--- /dev/null
+++ b/osm/eventdb/libibeventdb.ver
@@ -0,0 +1,9 @@
+# In this file we track the current API version
+# of the vendor interface (and libraries)
+# The version is built of the following 
+# tree numbers:
+# API_REV:RUNNING_REV:AGE
+# API_REV - advance on any added API
+# RUNNING_REV - advance any change to the vendor files
+# AGE - number of backward versions the API still supports
+LIBVERSION=1:0:0
diff --git a/osm/eventdb/src/ibeventdb.c b/osm/eventdb/src/ibeventdb.c
new file mode 100644
index 0000000..e98f85c
--- /dev/null
+++ b/osm/eventdb/src/ibeventdb.c
@@ -0,0 +1,622 @@
+/*
+ * Copyright (c) 2007 The Regents of the University of California.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#include <errno.h>
+#include <string.h>
+#include <stdlib.h>
+#include <time.h>
+#include <dlfcn.h>
+#include <stdint.h>
+#include <opensm/osm_event_db.h>
+#include <complib/cl_qmap.h>
+#include <complib/cl_passivelock.h>
+
+/**
+ * Port counter object.
+ * Store all the port counters for a single port.
+ */
+typedef struct _osm_event_pc {
+	struct {
+		uint64_t symbol_err_cnt;
+		uint64_t link_err_recover;
+		uint64_t link_downed;
+		uint64_t rcv_err;
+		uint64_t rcv_rem_phys_err;
+		uint64_t rcv_switch_relay_err;
+		uint64_t xmit_discards;
+		uint64_t xmit_constraint_err;
+		uint64_t rcv_constraint_err;
+		uint64_t link_int_err;
+		uint64_t buffer_overrun_err;
+		uint64_t vl15_dropped;
+		uint64_t xmit_data;
+		uint64_t rcv_data;
+		uint64_t xmit_pkts;
+		uint64_t rcv_pkts;
+		time_t   last_reset;
+	} totals;
+	osm_pc_reading_t previous;
+} osm_event_pc_t;
+
+/**
+ * group port counters for ports into the nodes
+ */
+typedef struct _osm_pc_node {
+	cl_map_item_t  map_item; /* must be first */
+	uint64_t       node_guid;
+	osm_event_pc_t   *ports;
+	uint8_t        num_ports;
+} osm_pc_node_t;
+
+/**
+ * all nodes in the system.
+ */
+typedef struct _osm_pc_db {
+	cl_qmap_t   pc_data; /* stores type (osm_pc_node_t *) */
+	cl_plock_t  lock;
+	osm_log_t  *osm_log;
+} osm_pc_db_t;
+
+
+/** =========================================================================
+ */
+static void *
+db_construct(osm_log_t *osm_log)
+{
+	/* use the default */
+	osm_pc_db_t *db = malloc(sizeof(*db));
+	if (!db) {
+		return (NULL);
+	}
+	cl_plock_construct(&(db->lock));
+	cl_plock_init(&(db->lock));
+	cl_qmap_init(&(db->pc_data));
+	db->osm_log = osm_log;
+	return ((void *)db);
+}
+
+/** =========================================================================
+ */
+static void
+db_destroy(void *_db)
+{
+	osm_pc_db_t *db = (osm_pc_db_t *)_db;
+	cl_plock_excl_acquire(&(db->lock));
+	/* remove all the items in the qmap */
+	while (!cl_is_qmap_empty(&(db->pc_data))) {
+		cl_map_item_t *rc = cl_qmap_head(&(db->pc_data));
+		cl_qmap_remove_item(&(db->pc_data), rc);
+	}
+	cl_plock_release(&(db->lock));
+	cl_plock_destroy(&(db->lock));
+	free(db);
+}
+
+/** =========================================================================
+ */
+static osm_pc_node_t *
+malloc_node(void *_db, uint64_t guid, uint8_t num_ports)
+{
+	int            i = 0;
+	time_t         cur_time = 0;
+	osm_pc_node_t *rc = malloc(sizeof(*rc));
+	if (!rc)
+		return (NULL);
+
+	rc->ports = calloc(num_ports, sizeof(osm_event_pc_t));
+	if (!rc->ports) {
+		goto free_rc;
+	}
+	rc->num_ports = num_ports;
+	rc->node_guid = guid;
+
+	cur_time = time(NULL);
+	for (i = 0; i < num_ports; i++) {
+		rc->ports[i].totals.last_reset = cur_time;
+		rc->ports[i].previous.time = cur_time;
+	}
+
+	return (rc);
+free_rc:
+	free(rc);
+	return (NULL);
+}
+
+/** =========================================================================
+ */
+static void
+free_node(osm_pc_node_t *node)
+{
+	if (!node)
+		return;
+	if (node->ports)
+		free(node->ports);
+	free(node);
+}
+
+/* insert nodes to the database */
+static osm_event_db_err_t
+insert(void *_db, osm_pc_node_t *node)
+{
+	osm_pc_db_t *db = (osm_pc_db_t *)_db;
+	cl_map_item_t *rc = cl_qmap_insert(&(db->pc_data), node->node_guid, (cl_map_item_t *)node);
+	if ((void *)rc != (void *)node)
+		return (OSM_EVENT_DB_FAIL);
+	return (OSM_EVENT_DB_SUCCESS);
+}
+
+/**********************************************************************
+ * Internal call db->lock should be held when calling
+ **********************************************************************/
+static inline osm_pc_node_t *
+get(void *_db, uint64_t guid)
+{
+	osm_pc_db_t *db = (osm_pc_db_t *)_db;
+	cl_map_item_t       *rc = cl_qmap_get(&(db->pc_data), guid);
+	const cl_map_item_t *end = cl_qmap_end(&(db->pc_data));
+	if (rc == end)
+		return (NULL);
+	return ((osm_pc_node_t *)rc);
+}
+
+/** =========================================================================
+ */
+static osm_event_db_err_t
+db_create_entry(void *_db, uint64_t guid, uint8_t num_ports)
+{
+  osm_pc_db_t        *db = (osm_pc_db_t *)_db;
+  osm_event_db_err_t  rc = OSM_EVENT_DB_SUCCESS;
+  cl_plock_excl_acquire(&(db->lock));
+  if (!get(db, guid)) {
+        osm_pc_node_t *pc_node = malloc_node(db, guid, num_ports);
+	if (!pc_node) {
+		rc = OSM_EVENT_DB_NOMEM;
+		goto Exit;
+	}
+	if (insert(db, pc_node)) {
+		free_node(pc_node);
+		rc = OSM_EVENT_DB_FAIL;
+		goto Exit;
+	}
+  }
+Exit:
+  cl_plock_release(&(db->lock));
+  return (rc);
+}
+
+/**********************************************************************
+ **********************************************************************/
+static osm_event_db_err_t
+db_get_prev(void *_db, uint64_t guid,
+		uint8_t port, osm_pc_reading_t *reading)
+{
+	osm_pc_db_t *db = (osm_pc_db_t *)_db;
+	osm_pc_node_t       *node = NULL;
+	cl_map_item_t       *rc = NULL;
+	const cl_map_item_t *end = NULL;
+
+	cl_plock_acquire(&(db->lock));
+
+	rc = cl_qmap_get(&(db->pc_data), guid);
+	end = cl_qmap_end(&(db->pc_data));
+	if (rc == end)
+		return (OSM_EVENT_DB_GUIDNOTFOUND);
+
+	node = (osm_pc_node_t *)rc;
+	if (port >= node->num_ports)
+		return (OSM_EVENT_DB_PORTNOTFOUND);
+
+	*reading = node->ports[port].previous;
+
+	cl_plock_release(&(db->lock));
+	return (OSM_EVENT_DB_SUCCESS);
+}
+
+/**********************************************************************
+ * Output a tab deliminated output of the port counters
+ **********************************************************************/
+static void
+__dump_node_mr(osm_pc_node_t *node, FILE *fp)
+{
+	int i = 0;
+
+	fprintf(fp, "\nGUID            Port\t%s\t%s\t"
+			"%s\t%s\t%s\t%s\t%s\t%s\t%s\t"
+			"%s\t%s\t%s\t%s\t%s\t%s\t%s\n",
+			"symbol_err_cnt",
+			"link_err_recover",
+			"link_downed",
+			"rcv_err",
+			"rcv_rem_phys_err",
+			"rcv_switch_relay_err",
+			"xmit_discards",
+			"xmit_constraint_err",
+			"rcv_constraint_err",
+			"link_int_err",
+			"buf_overrun_err",
+			"vl15_dropped",
+			"xmit_data",
+			"rcv_data",
+			"xmit_pkts",
+			"rcv_pkts");
+	for (i = 1; i < node->num_ports; i++)
+	{
+		fprintf(fp, "0x%" PRIx64 "\t%d\t%"PRIu64"\t%"PRIu64"\t"
+			"%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t"
+			"%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t"
+			"%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t"
+			"%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t%"PRIu64"\n",
+			node->node_guid,
+			i,
+			node->ports[i].totals.symbol_err_cnt,
+			node->ports[i].totals.link_err_recover,
+			node->ports[i].totals.link_downed,
+			node->ports[i].totals.rcv_err,
+			node->ports[i].totals.rcv_rem_phys_err,
+			node->ports[i].totals.rcv_switch_relay_err,
+			node->ports[i].totals.xmit_discards,
+			node->ports[i].totals.xmit_constraint_err,
+			node->ports[i].totals.rcv_constraint_err,
+			node->ports[i].totals.link_int_err,
+			node->ports[i].totals.buffer_overrun_err,
+			node->ports[i].totals.vl15_dropped,
+			node->ports[i].totals.xmit_data,
+			node->ports[i].totals.rcv_data,
+			node->ports[i].totals.xmit_pkts,
+			node->ports[i].totals.rcv_pkts
+			);
+	}
+}
+
+/**********************************************************************
+ * Output a human readable output of the port counters
+ **********************************************************************/
+static void
+__dump_node_hr(osm_pc_node_t *node, FILE *fp)
+{
+	int i = 0;
+
+	fprintf(fp, "\n");
+	for (i = 1; i < node->num_ports; i++)
+	{
+		fprintf(fp, "GUID 0x%"PRIx64": Port %d:\n"
+			"     symbol_err_cnt: %"PRIu64"\n"
+			"     link_err_recover: %"PRIu64"\n"
+			"     link_downed: %"PRIu64"\n"
+			"     rcv_err: %"PRIu64"\n"
+			"     rcv_rem_phys_err: %"PRIu64"\n"
+			"     rcv_switch_relay_err: %"PRIu64"\n"
+			"     xmit_discards: %"PRIu64"\n"
+			"     xmit_constraint_err: %"PRIu64"\n"
+			"     rcv_constraint_err: %"PRIu64"\n"
+			"     link_int_err: %"PRIu64"\n"
+			"     buf_overrun_err: %"PRIu64"\n"
+			"     vl15_dropped: %"PRIu64"\n"
+			"     xmit_data: %"PRIu64"\n"
+			"     rcv_data: %"PRIu64"\n"
+			"     xmit_pkts: %"PRIu64"\n"
+			"     rcv_pkts: %"PRIu64"\n"
+			,
+			node->node_guid,
+			i,
+			node->ports[i].totals.symbol_err_cnt,
+			node->ports[i].totals.link_err_recover,
+			node->ports[i].totals.link_downed,
+			node->ports[i].totals.rcv_err,
+			node->ports[i].totals.rcv_rem_phys_err,
+			node->ports[i].totals.rcv_switch_relay_err,
+			node->ports[i].totals.xmit_discards,
+			node->ports[i].totals.xmit_constraint_err,
+			node->ports[i].totals.rcv_constraint_err,
+			node->ports[i].totals.link_int_err,
+			node->ports[i].totals.buffer_overrun_err,
+			node->ports[i].totals.vl15_dropped,
+			node->ports[i].totals.xmit_data,
+			node->ports[i].totals.rcv_data,
+			node->ports[i].totals.xmit_pkts,
+			node->ports[i].totals.rcv_pkts
+			);
+	}
+}
+
+/* Define a context for the __db_dump callback */
+typedef struct {
+	FILE                *fp;
+	osm_event_db_dump_t  dump_type;
+} dump_context_t;
+
+/**********************************************************************
+ **********************************************************************/
+static void
+__db_dump(cl_map_item_t * const p_map_item, void *context )
+{
+	osm_pc_node_t  *node = (osm_pc_node_t *)p_map_item;
+	dump_context_t *c = (dump_context_t *)context;
+	FILE           *fp = c->fp;
+
+	switch (c->dump_type)
+	{
+		case OSM_EVENT_DB_DUMP_MR:
+			__dump_node_mr(node, fp);
+			break;
+		case OSM_EVENT_DB_DUMP_HR:
+		default:
+			__dump_node_hr(node, fp);
+			break;
+	}
+}
+
+/**********************************************************************
+ * dump the data to the file "file"
+ **********************************************************************/
+static osm_event_db_err_t
+db_dump(void *_db, char *file, osm_event_db_dump_t dump_type)
+{
+	osm_pc_db_t    *db = (osm_pc_db_t *)_db;
+	dump_context_t  context;
+
+	context.fp = fopen(file, "w+");
+	if (!context.fp)
+		return (OSM_EVENT_DB_FAIL);
+	context.dump_type = dump_type;
+
+	cl_plock_acquire(&(db->lock));
+        cl_qmap_apply_func(&(db->pc_data), __db_dump, (void *)&context);
+	cl_plock_release(&(db->lock));
+	fclose(context.fp);
+	return (OSM_EVENT_DB_SUCCESS);
+}
+
+/**********************************************************************
+ * call back to support the below
+ **********************************************************************/
+static void
+__clear_counters(cl_map_item_t * const p_map_item, void *context )
+{
+	osm_pc_node_t *node = (osm_pc_node_t *)p_map_item;
+	int            i = 0;
+	for (i = 0; i < node->num_ports; i++) {
+		node->ports[i].totals.symbol_err_cnt = 0;
+		node->ports[i].totals.link_err_recover = 0;
+		node->ports[i].totals.link_downed = 0;
+		node->ports[i].totals.rcv_err = 0;
+		node->ports[i].totals.rcv_rem_phys_err = 0;
+		node->ports[i].totals.rcv_switch_relay_err = 0;
+		node->ports[i].totals.xmit_discards = 0;
+		node->ports[i].totals.xmit_constraint_err = 0;
+		node->ports[i].totals.rcv_constraint_err = 0;
+		node->ports[i].totals.link_int_err = 0;
+		node->ports[i].totals.buffer_overrun_err = 0;
+		node->ports[i].totals.vl15_dropped = 0;
+		node->ports[i].totals.xmit_data = 0;
+		node->ports[i].totals.rcv_data = 0;
+		node->ports[i].totals.xmit_pkts = 0;
+		node->ports[i].totals.rcv_pkts = 0;
+		node->ports[i].totals.last_reset = time(NULL);
+	}
+}
+
+/**********************************************************************
+ * Clear the counters from the db
+ **********************************************************************/
+static void
+db_clear_port_counters(void *_db)
+{
+	osm_pc_db_t *db = (osm_pc_db_t *)_db;
+	cl_plock_excl_acquire(&(db->lock));
+	cl_qmap_apply_func(&(db->pc_data), __clear_counters, (void *)db);
+	cl_plock_release(&(db->lock));
+}
+
+#if 0
+/**********************************************************************
+ * Dump a reading vs the previous reading to stdout
+ **********************************************************************/
+static void
+dump_reading(osm_event_pc_t *port, ib_port_counters_t *cur)
+{
+	printf("sym %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->symbol_err_cnt),
+			cl_ntoh16(port->previous.reading.symbol_err_cnt), port->totals.symbol_err_cnt);
+	printf("ler %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->link_err_recover),
+		cl_ntoh16(port->previous.reading.link_err_recover), port->totals.link_err_recover);
+	printf("ld %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->link_downed),
+		cl_ntoh16(port->previous.reading.link_downed), port->totals.link_downed);
+	printf("re %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->rcv_err),
+		cl_ntoh16(port->previous.reading.rcv_err), port->totals.rcv_err);
+	printf("rrp %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->rcv_rem_phys_err),
+		cl_ntoh16(port->previous.reading.rcv_rem_phys_err), port->totals.rcv_rem_phys_err);
+	printf("rsr %u - %u (%" PRIx64 ")\n",
+		cl_ntoh16(cur->rcv_switch_relay_err),
+		cl_ntoh16(port->previous.reading.rcv_switch_relay_err), port->totals.rcv_switch_relay_err);
+	printf("xd %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->xmit_discards),
+		cl_ntoh16(port->previous.reading.xmit_discards), port->totals.xmit_discards);
+	printf("xce %u - %u (%" PRIx64 ")\n",
+		cl_ntoh16(cur->xmit_constraint_err),
+		cl_ntoh16(port->previous.reading.xmit_constraint_err), port->totals.xmit_constraint_err);
+	printf("rce %u - %u (%" PRIx64 ")\n",
+		cl_ntoh16(cur->rcv_constraint_err),
+		cl_ntoh16(port->previous.reading.rcv_constraint_err), port->totals.rcv_constraint_err);
+	printf("li %x - %x (%" PRIx64 ")\n",
+		cl_ntoh16(cur->link_int_buffer_overrun),
+		cl_ntoh16(port->previous.reading.link_int_buffer_overrun), port->totals.link_int_err);
+	printf("bo %x - %x (%" PRIx64 ")\n",
+		cl_ntoh16(cur->link_int_buffer_overrun),
+		cl_ntoh16(port->previous.reading.link_int_buffer_overrun), port->totals.buffer_overrun_err);
+	printf("vld %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->vl15_dropped),
+		cl_ntoh16(port->previous.reading.vl15_dropped), port->totals.vl15_dropped);
+	
+	printf("xd %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->xmit_data),
+		cl_ntoh32(port->previous.reading.xmit_data), port->totals.xmit_data);
+	printf("rd %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->rcv_data),
+		cl_ntoh32(port->previous.reading.rcv_data), port->totals.rcv_data);
+	printf("xp %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->xmit_pkts),
+		cl_ntoh32(port->previous.reading.xmit_pkts), port->totals.xmit_pkts);
+	printf("rp %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->rcv_pkts),
+		cl_ntoh32(port->previous.reading.rcv_pkts), port->totals.rcv_pkts);
+}
+#endif
+
+/**********************************************************************
+ * Add the reading to the osm_pc_node_t
+ **********************************************************************/
+static osm_event_db_err_t
+db_clear_prev_pc(void *_db, uint64_t guid, uint8_t port)
+{
+	osm_pc_db_t *db = (osm_pc_db_t *)_db;
+	osm_event_pc_t        *p_port = NULL;
+	osm_pc_node_t      *p_node = NULL;
+	ib_port_counters_t *previous = NULL;
+	osm_event_db_err_t     rc = OSM_EVENT_DB_SUCCESS;
+
+	cl_plock_excl_acquire(&(db->lock));
+	p_node = get(db, guid);
+
+	if (!p_node)
+		return (OSM_EVENT_DB_GUIDNOTFOUND);
+
+	if (port >= p_node->num_ports)
+		return (OSM_EVENT_DB_PORTNOTFOUND);
+
+	p_port = &(p_node->ports[port]);
+	previous = &(p_node->ports[port].previous.reading);
+
+	memset(previous, 0, sizeof(*previous));
+	p_port->previous.time = time(NULL);
+
+	cl_plock_release(&(db->lock));
+	return (rc);
+}
+
+/**********************************************************************
+ * Add the reading to the osm_pc_node_t
+ **********************************************************************/
+static osm_event_db_err_t
+db_add_reading(void *_db, uint64_t guid,
+                   uint8_t port, ib_port_counters_t *reading)
+{
+	osm_pc_db_t *db = (osm_pc_db_t *)_db;
+	osm_event_pc_t        *p_port = NULL;
+	osm_pc_node_t      *p_node = NULL;
+	ib_port_counters_t *previous = NULL;
+	osm_event_db_err_t     rc = OSM_EVENT_DB_SUCCESS;
+
+	cl_plock_excl_acquire(&(db->lock));
+	p_node = get(db, guid);
+
+	if (!p_node)
+		return (OSM_EVENT_DB_GUIDNOTFOUND);
+
+	if (port >= p_node->num_ports)
+		return (OSM_EVENT_DB_PORTNOTFOUND);
+
+	p_port = &(p_node->ports[port]);
+	previous = &(p_node->ports[port].previous.reading);
+
+#if 0
+	dump_reading(p_port, reading);
+#endif
+
+	/* calculate changes from previous reading */
+	p_port->totals.symbol_err_cnt
+		+= (cl_ntoh16(reading->symbol_err_cnt)
+				- cl_ntoh16(previous->symbol_err_cnt));
+	p_port->totals.link_err_recover
+		+= (reading->link_err_recover - previous->link_err_recover);
+	p_port->totals.link_downed
+		+= (reading->link_downed - previous->link_downed);
+	p_port->totals.rcv_err
+		+= (cl_ntoh16(reading->rcv_err)
+				- cl_ntoh16(previous->rcv_err));
+	p_port->totals.rcv_rem_phys_err
+		+= (cl_ntoh16(reading->rcv_rem_phys_err)
+				- cl_ntoh16(previous->rcv_rem_phys_err));
+	p_port->totals.rcv_switch_relay_err
+		+= (cl_ntoh16(reading->rcv_switch_relay_err)
+				- cl_ntoh16(previous->rcv_switch_relay_err));
+	p_port->totals.xmit_discards
+		+= (cl_ntoh16(reading->xmit_discards)
+				- cl_ntoh16(previous->xmit_discards));
+	p_port->totals.xmit_constraint_err
+		+= (reading->xmit_constraint_err - previous->xmit_constraint_err);
+	p_port->totals.rcv_constraint_err
+		+= (reading->rcv_constraint_err - previous->rcv_constraint_err);
+	p_port->totals.link_int_err
+		+= PC_LINK_INT(reading->link_int_buffer_overrun)
+			- PC_LINK_INT(previous->link_int_buffer_overrun);
+	p_port->totals.buffer_overrun_err
+		+= PC_BUF_OVERRUN(reading->link_int_buffer_overrun)
+			- PC_BUF_OVERRUN(previous->link_int_buffer_overrun);
+	p_port->totals.vl15_dropped
+		+= (cl_ntoh16(reading->vl15_dropped)
+				- cl_ntoh16(previous->vl15_dropped));
+	
+	p_port->totals.xmit_data
+		+= (cl_ntoh32(reading->xmit_data)
+				- cl_ntoh32(previous->xmit_data));
+	p_port->totals.rcv_data
+		+= (cl_ntoh32(reading->rcv_data)
+				- cl_ntoh32(previous->rcv_data));
+	p_port->totals.xmit_pkts
+		+= (cl_ntoh32(reading->xmit_pkts)
+				- cl_ntoh32(previous->xmit_pkts));
+	p_port->totals.rcv_pkts
+		+= (cl_ntoh32(reading->rcv_pkts)
+				- cl_ntoh32(previous->rcv_pkts));
+
+	p_port->previous.reading = *reading;
+	p_port->previous.time = time(NULL);
+
+	cl_plock_release(&(db->lock));
+	return (rc);
+}
+
+/** =========================================================================
+ * Define the object symbol for loading
+ */
+__osm_event_db_t __osm_event_db =
+{
+interface_version: OSM_EVENT_DB_INTERFACE_VER,
+construct : db_construct,
+destroy : db_destroy,
+create_entry : db_create_entry,
+get_prev_pc : db_get_prev,
+dump : db_dump,
+clear_port_counters : db_clear_port_counters,
+add_pc_reading : db_add_reading,
+clear_prev_pc : db_clear_prev_pc
+};
+
diff --git a/osm/include/Makefile.am b/osm/include/Makefile.am
index 8499d3b..fd874c8 100644
--- a/osm/include/Makefile.am
+++ b/osm/include/Makefile.am
@@ -87,6 +87,8 @@ EXTRA_DIST = \
 	$(srcdir)/opensm/osm_drop_mgr.h \
 	$(srcdir)/opensm/osm_port_info_rcv.h \
 	$(srcdir)/opensm/osm_state_mgr_ctrl.h \
+	$(srcdir)/opensm/osm_perfmgr.h \
+	$(srcdir)/opensm/osm_event_db.h \
 	$(srcdir)/complib/cl_thread_osd.h \
 	$(srcdir)/complib/cl_packon.h \
 	$(srcdir)/complib/cl_atomic_osd.h \
diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h
index b3937cb..2a4057b 100644
--- a/osm/include/iba/ib_types.h
+++ b/osm/include/iba/ib_types.h
@@ -7353,6 +7353,80 @@ typedef struct _ib_inform_info_record
 }	PACK_SUFFIX ib_inform_info_record_t;
 #include <complib/cl_packoff.h>
 
+/****s* IBA Base: Types/ib_perfmgr_mad_t
+* NAME
+*	ib_perfmgr_mad_t
+*
+* DESCRIPTION
+*	IBA defined Perf Management MAD (16.3.1)
+*
+* SYNOPSIS
+*/
+#include <complib/cl_packon.h>
+typedef struct _ib_perfmgr_mad
+{
+	ib_mad_t		header;
+	uint8_t			resv[40];
+
+#define	IB_PM_DATA_SIZE		192
+	uint8_t			data[IB_PM_DATA_SIZE];
+
+}	PACK_SUFFIX ib_perfmgr_mad_t;
+#include <complib/cl_packoff.h>
+/*
+* FIELDS
+*	header
+*		Common MAD header.
+*
+*	resv
+*		Reserved.
+*
+*	data
+*		Performance Management payload.  The structure and content of this field
+*		depend upon the method, attr_id, and attr_mod fields in the header.
+*
+* SEE ALSO
+* ib_mad_t
+*********/
+
+/****s* IBA Base: Types/ib_port_counters
+* NAME
+*	ib_port_counters_t
+*
+* DESCRIPTION
+*	IBA defined PortCounters Attribute. (16.1.3.5)
+*
+* SYNOPSIS
+*/
+#include <complib/cl_packon.h>
+typedef struct _ib_port_counters
+{
+	uint8_t 			reserved;
+	uint8_t                         port_select;
+	ib_net16_t                      counter_select;
+	ib_net16_t                      symbol_err_cnt;
+	uint8_t                         link_err_recover;
+	uint8_t                         link_downed;
+	ib_net16_t                      rcv_err;
+	ib_net16_t                      rcv_rem_phys_err;
+	ib_net16_t                      rcv_switch_relay_err;
+	ib_net16_t                      xmit_discards;
+	uint8_t                         xmit_constraint_err;
+	uint8_t                         rcv_constraint_err;
+	uint8_t                         res1;
+	uint8_t                         link_int_buffer_overrun;
+	ib_net16_t                      res2;
+	ib_net16_t                      vl15_dropped;
+	ib_net32_t                      xmit_data;
+	ib_net32_t                      rcv_data;
+	ib_net32_t                      xmit_pkts;
+	ib_net32_t                      rcv_pkts;
+}	PACK_SUFFIX ib_port_counters_t;
+#include <complib/cl_packoff.h>
+
+#define PC_LINK_INT(integ_buf_over) ((integ_buf_over & 0xF0) >> 4)
+#define PC_BUF_OVERRUN(integ_buf_over) (integ_buf_over & 0x0F)
+
 /****d* IBA Base: Types/DM_SVC_NAME
 * NAME
 *	DM_SVC_NAME
diff --git a/osm/include/opensm/osm_base.h b/osm/include/opensm/osm_base.h
index b38b511..51cef49 100644
--- a/osm/include/opensm/osm_base.h
+++ b/osm/include/opensm/osm_base.h
@@ -448,6 +448,29 @@ BEGIN_C_DECLS
 */
 #define OSM_SM_DEFAULT_QP1_SEND_SIZE 256
 
+/****d* OpenSM: Base/OSM_PM_DEFAULT_QP1_RCV_SIZE
+* NAME
+*   OSM_PM_DEFAULT_QP1_RCV_SIZE
+*
+* DESCRIPTION
+*   Specifies the default size (in MADs) of the QP1 receive queue
+*
+* SYNOPSIS
+*/
+#define OSM_PM_DEFAULT_QP1_RCV_SIZE 256
+/***********/
+
+/****d* OpenSM: Base/OSM_PM_DEFAULT_QP1_SEND_SIZE
+* NAME
+*   OSM_PM_DEFAULT_QP1_SEND_SIZE
+*
+* DESCRIPTION
+*   Specifies the default size (in MADs) of the QP1 send queue
+*
+* SYNOPSIS
+*/
+#define OSM_PM_DEFAULT_QP1_SEND_SIZE 256
+
 
 /****d* OpenSM: Base/OSM_SM_DEFAULT_POLLING_TIMEOUT_MILLISECS
 * NAME
diff --git a/osm/include/opensm/osm_event_db.h b/osm/include/opensm/osm_event_db.h
new file mode 100644
index 0000000..17effaf
--- /dev/null
+++ b/osm/include/opensm/osm_event_db.h
@@ -0,0 +1,151 @@
+/*
+ * Copyright (c) 2007 The Regents of the University of California.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef _OSM_EVENT_DB_H_
+#define _OSM_EVENT_DB_H_
+
+#include <time.h>
+#include <opensm/osm_log.h>
+#include <iba/ib_types.h>
+
+#ifdef __cplusplus
+#  define BEGIN_C_DECLS extern "C" {
+#  define END_C_DECLS   }
+#else /* !__cplusplus */
+#  define BEGIN_C_DECLS
+#  define END_C_DECLS
+#endif /* __cplusplus */
+
+BEGIN_C_DECLS
+
+/****h* OpenSM/Event Database
+* DESCRIPTION
+*       Database interface to record subnet events
+*
+*       Implementations of this object _MUST_ be thread safe.
+*
+* AUTHOR
+*	Ira Weiny, LLNL
+*
+*********/
+
+typedef enum {
+	OSM_EVENT_DB_SUCCESS = 0,
+	OSM_EVENT_DB_FAIL,
+	OSM_EVENT_DB_NOMEM,
+	OSM_EVENT_DB_GUIDNOTFOUND,
+	OSM_EVENT_DB_PORTNOTFOUND
+} osm_event_db_err_t;
+
+/** =========================================================================
+ * Port counter reading
+ */
+typedef struct {
+	ib_port_counters_t reading;
+	time_t             time;
+} osm_pc_reading_t;
+
+/** =========================================================================
+ * Dump output options
+ */
+typedef enum {
+	OSM_EVENT_DB_DUMP_HR = 0, /* Human readable */
+	OSM_EVENT_DB_DUMP_MR      /* Machine readable */
+} osm_event_db_dump_t;
+
+/** =========================================================================
+ * Plugin creators should allocate an object of this type
+ *    (name __osm_event_db_t)
+ * The version should be set to OSM_EVENT_DB_INTERFACE_VER
+ */
+#define OSM_EVENT_DB_INTERFACE_VER (1)
+typedef struct
+{
+	int                 interface_version;
+	void               *(*construct)(osm_log_t *osm_log);
+	void                (*destroy)(void *db);
+	osm_event_db_err_t  (*create_entry)(void *db, uint64_t guid, uint8_t num_ports);
+	osm_event_db_err_t  (*get_prev_pc)(void *db, uint64_t guid,
+				uint8_t port, osm_pc_reading_t *reading);
+	osm_event_db_err_t  (*dump)(void *db, char *file, osm_event_db_dump_t dump_type);
+	void                (*clear_port_counters)(void *db);
+	osm_event_db_err_t  (*add_pc_reading)(void *db, uint64_t guid,
+				uint8_t port, ib_port_counters_t *reading);
+	osm_event_db_err_t  (*clear_prev_pc)(void *db, uint64_t guid, uint8_t port);
+} __osm_event_db_t;
+
+/** =========================================================================
+ * The database structure which should be considered opaque
+ */
+typedef struct {
+	void             *handle;
+	__osm_event_db_t *db_impl;
+	void             *db_data;
+	osm_log_t        *p_log;
+} osm_event_db_t;
+
+
+/**
+ * functions
+ */
+osm_event_db_t     *osm_event_db_construct(osm_log_t *p_log, char *type);
+void                osm_event_db_destroy(osm_event_db_t *db);
+
+osm_event_db_err_t  osm_event_db_create_entry(osm_event_db_t *db, uint64_t guid,
+					uint8_t num_ports);
+osm_event_db_err_t  osm_event_db_get_prev_pc(osm_event_db_t *db,
+					uint64_t guid, uint8_t port,
+					osm_pc_reading_t *reading);
+osm_event_db_err_t  osm_event_db_dump(osm_event_db_t *db, char *file,
+					osm_event_db_dump_t dump_type);
+osm_event_db_err_t  osm_event_db_add_pc_reading(osm_event_db_t *db, uint64_t guid,
+					uint8_t port, ib_port_counters_t *reading);
+void                osm_event_db_clear_port_counters(osm_event_db_t *db);
+osm_event_db_err_t  osm_event_db_clear_prev_pc(osm_event_db_t *db, uint64_t guid,
+					uint8_t port);
+
+#if 0
+/* work out the tracking of notice (trap) events. */
+
+typedef struct {
+	ib_mad_notice_attr_t reading;
+	time_t               time;
+} osm_notice_reading_t;
+osm_event_db_err_t  osm_event_db_add_notice_reading(osm_event_db_t *db, uint64_t guid,
+					uint8_t port, ib_mad_notice_attr_t *reading);
+#endif
+
+END_C_DECLS
+
+#endif		/* _OSM_PM_DB_H_ */
+
diff --git a/osm/include/opensm/osm_madw.h b/osm/include/opensm/osm_madw.h
index 95be0f4..80258f4 100644
--- a/osm/include/opensm/osm_madw.h
+++ b/osm/include/opensm/osm_madw.h
@@ -315,6 +315,19 @@ typedef struct _osm_vla_context
 } osm_vla_context_t;
 /*********/
 
+/****s* OpenSM: MAD Wrapper/osm_perfmgr_context_t
+* DESCRIPTION
+*	Context for Performance manager queries
+*/
+typedef struct _osm_perfmgr_context {
+  uint64_t node_guid;
+  uint16_t port;
+  uint8_t num_ports;
+  uint8_t mad_method; /* was this a get or a set */
+  struct timeval query_start;
+} osm_perfmgr_context_t;
+/*********/
+
 #ifndef OSM_VENDOR_INTF_OPENIB
 /****s* OpenSM: MAD Wrapper/osm_arbitrary_context_t
 * NAME
@@ -354,6 +367,7 @@ typedef union _osm_madw_context
 	osm_slvl_context_t	slvl_context;
 	osm_pkey_context_t	pkey_context;
 	osm_vla_context_t	vla_context;
+	osm_perfmgr_context_t	perfmgr_context;
 #ifndef OSM_VENDOR_INTF_OPENIB
 	osm_arbitrary_context_t arb_context;
 #endif
@@ -639,6 +653,32 @@ osm_madw_get_sa_mad_ptr(
 *	MAD Wrapper object, osm_madw_construct, osm_madw_destroy
 *********/
 
+/****f* OpenSM: MAD Wrapper/osm_madw_get_perfmgr_mad_ptr
+* DESCRIPTION
+*	Gets a pointer to the PerfMgr MAD in this MAD wrapper.
+*
+* SYNOPSIS
+*/
+static inline ib_perfmgr_mad_t*
+osm_madw_get_perfmgr_mad_ptr(
+	IN const osm_madw_t* const p_madw )
+{
+	return((ib_perfmgr_mad_t*)p_madw->p_mad);
+}
+/*
+* PARAMETERS
+*	p_madw
+*		[in] Pointer to an osm_madw_t object.
+*
+* RETURN VALUES
+*	Pointer to the start of the PM MAD.
+*
+* NOTES
+*
+* SEE ALSO
+*	MAD Wrapper object, osm_madw_construct, osm_madw_destroy
+*********/
+
 /****f* OpenSM: MAD Wrapper/osm_madw_get_ni_context_ptr
 * NAME
 *	osm_madw_get_ni_context_ptr
diff --git a/osm/include/opensm/osm_msgdef.h b/osm/include/opensm/osm_msgdef.h
index a90e3b9..6732992 100644
--- a/osm/include/opensm/osm_msgdef.h
+++ b/osm/include/opensm/osm_msgdef.h
@@ -186,6 +186,7 @@ enum
 #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP)
 	OSM_MSG_MAD_MULTIPATH_RECORD,
 #endif
+	OSM_MSG_MAD_PORT_COUNTERS,
 	OSM_MSG_MAX
 };
 
diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h
index 482de28..bdaa8f3 100644
--- a/osm/include/opensm/osm_opensm.h
+++ b/osm/include/opensm/osm_opensm.h
@@ -57,6 +57,7 @@
 #include <opensm/osm_log.h>
 #include <opensm/osm_sm.h>
 #include <opensm/osm_sa.h>
+#include <opensm/osm_perfmgr.h>
 #include <opensm/osm_db.h>
 #include <opensm/osm_subnet.h>
 #include <opensm/osm_mad_pool.h>
@@ -157,6 +158,9 @@ typedef struct _osm_opensm_t
   osm_subn_t		subn;
   osm_sm_t		sm;
   osm_sa_t		sa;
+#ifdef ENABLE_OSM_PERF_MGR
+  osm_perfmgr_t         perfmgr;
+#endif /* ENABLE_OSM_PERF_MGR */
   osm_db_t		db;
   osm_mad_pool_t	mad_pool;
   osm_vendor_t		*p_vendor;
diff --git a/osm/include/opensm/osm_perfmgr.h b/osm/include/opensm/osm_perfmgr.h
new file mode 100644
index 0000000..6138ec3
--- /dev/null
+++ b/osm/include/opensm/osm_perfmgr.h
@@ -0,0 +1,223 @@
+/*
+ * Copyright (c) 2007 The Regents of the University of California.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef _OSM_PERFMGR_H_
+#define _OSM_PERFMGR_H_
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#ifdef ENABLE_OSM_PERF_MGR
+
+#include <iba/ib_types.h>
+#include <complib/cl_passivelock.h>
+#include <complib/cl_event.h>
+#include <complib/cl_thread.h>
+#include <opensm/osm_subnet.h>
+#include <opensm/osm_req.h>
+#include <opensm/osm_log.h>
+#include <opensm/osm_event_db.h>
+#include <opensm/osm_sm.h>
+#include <opensm/osm_base.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif /* __cplusplus */
+
+/****h* OpenSM/PERFMGR
+* NAME
+*	PERFMGR
+*
+* DESCRIPTION
+*       Performance manager thread which takes care of polling the fabric for
+*       Port counters values.
+*
+*	The PERFMGR object is thread safe.
+*
+* AUTHOR
+*	Ira Weiny, LLNL
+*
+*********/
+
+#define OSM_PERFMGR_DEFAULT_SWEEP_TIME_S 180
+#define OSM_PERFMGR_DEFAULT_DUMP_FILE OSM_DEFAULT_TMP_DIR "/osm_port_counters.log"
+#define OSM_DEFAULT_EVENT_PLUGIN "ibeventdb"
+
+/****s* OpenSM: PERFMGR/osm_perfmgr_state_t */
+typedef enum
+{
+  PERFMGR_STATE_DISABLE,
+  PERFMGR_STATE_ENABLED,
+  PERFMGR_STATE_NO_DB
+} osm_perfmgr_state_t;
+
+/****s* OpenSM: PERFMGR/osm_perfmgr_t
+*  This object should be treated as opaque and should
+*  be manipulated only through the provided functions.
+*/
+typedef struct _osm_perfmgr
+{
+  osm_thread_state_t    thread_state;
+  cl_event_t            sig_sweep;
+  cl_thread_t           sweeper;
+  osm_subn_t           *subn;
+  osm_sm_t             *sm;
+  cl_plock_t           *lock;
+  osm_log_t            *log;
+  osm_mad_pool_t       *mad_pool;
+  atomic32_t            trans_id;
+  osm_vendor_t         *vendor;
+  osm_bind_handle_t     bind_handle;
+  cl_disp_reg_handle_t  pc_disp_h;
+  osm_perfmgr_state_t   state;
+  uint16_t              sweep_time_s;
+  char                 *db_file;
+  char                 *event_db_dump_file;
+  char                 *event_db_plugin;
+  osm_event_db_t       *db;
+} osm_perfmgr_t;
+/*
+* FIELDS
+*	subn
+*	      Subnet object for this subnet.
+*
+*	log
+*	      Pointer to the log object.
+*
+*	mad_pool
+*		Pointer to the MAD pool.
+*
+*       event_db_dump_file
+*               File to be used to dump the Port Counters
+*
+*	mad_ctrl
+*		Mad Controller
+*********/
+
+/****f* OpenSM: Creation Functions */
+void osm_perfmgr_shutdown(osm_perfmgr_t *const p_perfmgr );
+void osm_perfmgr_destroy(osm_perfmgr_t * const p_perfmgr );
+
+/****f* OpenSM: Inline accessor functions */
+inline static void osm_perfmgr_set_state(osm_perfmgr_t *p_perfmgr,
+		osm_perfmgr_state_t state)
+{
+	p_perfmgr->state = state;
+}
+inline static osm_perfmgr_state_t osm_perfmgr_get_state(osm_perfmgr_t
+		*p_perfmgr) { return (p_perfmgr->state); }
+inline static char *osm_perfmgr_get_state_str(osm_perfmgr_t *p_perfmgr)
+{
+	switch (p_perfmgr->state)
+	{
+		case PERFMGR_STATE_DISABLE: return ("Disabled"); break;
+		case PERFMGR_STATE_ENABLED: return ("Enabled"); break;
+		case PERFMGR_STATE_NO_DB: return ("No Database"); break;
+	}
+	return ("UNKNOWN");
+}
+inline static void osm_perfmgr_set_sweep_time_s(osm_perfmgr_t *p_perfmgr, uint16_t time_s)
+{
+	p_perfmgr->sweep_time_s = time_s;
+   cl_event_signal(&p_perfmgr->sig_sweep);
+}
+inline static uint16_t osm_perfmgr_get_sweep_time_s(osm_perfmgr_t *p_perfmgr)
+{
+	return (p_perfmgr->sweep_time_s);
+}
+void osm_perfmgr_clear_counters(osm_perfmgr_t *p_perfmgr);
+void osm_perfmgr_dump_counters(osm_perfmgr_t *p_perfmgr,
+		osm_event_db_dump_t dump_type);
+
+ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * const p_perfmgr, const ib_net64_t port_guid);
+
+#if 0
+/* Work out the tracking of notice events */
+ib_api_status_t osm_report_notice_to_perfmgr(osm_log_t *const p_log, osm_subn_t *p_subn,
+  					ib_mad_notice_attr_t *p_ntc )
+#endif
+
+/****f* OpenSM: PERFMGR/osm_perfmgr_init */
+ib_api_status_t
+osm_perfmgr_init(
+	osm_perfmgr_t* const perfmgr,
+	osm_subn_t* const subn,
+        osm_sm_t * const sm,
+	osm_log_t* const log,
+	osm_mad_pool_t * const mad_pool,
+	osm_vendor_t * const vendor,
+        cl_dispatcher_t* const disp,
+   	cl_plock_t* const lock,
+	const osm_subn_opt_t * const p_opt );
+/*
+* PARAMETERS
+*	perfmgr
+*		[in] Pointer to an osm_perfmgr_t object to initialize.
+*
+*	subn
+*		[in] Pointer to the Subnet object for this subnet.
+*
+*	sm
+*		[in] Pointer to the Subnet object for this subnet.
+*
+*	log
+*		[in] Pointer to the log object.
+*
+*	mad_pool
+*		[in] Pointer to the MAD pool.
+*
+*	vendor
+*		[in] Pointer to the vendor specific interfaces object.
+*
+*	disp
+*		[in] Pointer to the OpenSM central Dispatcher.
+*
+*	lock
+*		[in] Pointer to the OpenSM serializing lock.
+*
+*	p_opt
+*		[in] Starting options
+*
+* RETURN VALUES
+*	IB_SUCCESS if the PERFMGR object was initialized successfully.
+*********/
+
+#ifdef __cplusplus
+}
+#endif /* __cplusplus */
+
+#endif /* ENABLE_OSM_PERF_MGR */
+
+#endif		/* _OSM_PERFMGR_H_ */
+
diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h
index fc52b5e..0fdc18b 100644
--- a/osm/include/opensm/osm_subnet.h
+++ b/osm/include/opensm/osm_subnet.h
@@ -291,6 +291,12 @@ typedef struct _osm_subn_opt
   osm_qos_options_t        qos_rtr_options;
   boolean_t                enable_quirks;
   boolean_t                no_clients_rereg;
+#ifdef ENABLE_OSM_PERF_MGR
+  boolean_t                perfmgr;
+  uint16_t                 perfmgr_sweep_time_s;
+  char *                   event_db_dump_file;
+  char *                   event_db_plugin;
+#endif /* ENABLE_OSM_PERF_MGR */
 } osm_subn_opt_t;
 /*
 * FIELDS
@@ -468,6 +474,18 @@ typedef struct _osm_subn_opt
 *	sm_inactive
 *		OpenSM will start with SM in not active state.
 *	
+*	perfmgr
+*		Enable or disable the performance manager
+*
+*	perfmgr_sweep_time_s
+*		Define the period of PM sweep (in seconds).
+*
+*       event_db_dump_file
+*               File to dump the event database to
+*
+*       event_db_plugin
+*               specify the name of the event plugin
+*
 *	qos_options
 *		Default set of QoS options
 *
diff --git a/osm/opensm.spec.in b/osm/opensm.spec.in
index c4e1798..8857a7b 100644
--- a/osm/opensm.spec.in
+++ b/osm/opensm.spec.in
@@ -38,10 +38,19 @@ Static libraries and header files for Op
 %define _disable_console_socket --disable-console-socket
 %endif
 
+%if %{?_with_perf_mgr:1}%{!?_with_perf_mgr:0}
+%define _enable_perf_mgr --enable-perf-mgr
+%endif
+%if %{?_without_perf_mgr:1}%{!?_without_perf_mgr:0}
+%define _disable_perf_mgr --disable-perf-mgr
+%endif
+
 %build
 %configure \
         %{?_enable_console_socket} \
-        %{?_disable_console_socket}
+        %{?_disable_console_socket} \
+        %{?_enable_perf_mgr} \
+        %{?_disable_perf_mgr}
 make %{?_smp_mflags}
 
 %install
diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am
index e2520b8..9a1f6f4 100644
--- a/osm/opensm/Makefile.am
+++ b/osm/opensm/Makefile.am
@@ -55,7 +55,8 @@ opensm_SOURCES = main.c osm_console.c os
 		 osm_trap_rcv.c osm_ucast_mgr.c osm_ucast_updn.c \
 		 osm_ucast_lash.c osm_ucast_file.c osm_ucast_ftree.c \
 		 osm_vl15intf.c osm_vl_arb_rcv.c \
-		 st.c
+		 st.c \
+		 osm_perfmgr.c osm_event_db.c
 if OSMV_OPENIB
 opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1
 opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1
@@ -78,7 +79,7 @@ endif
 # we always give precedence to local tree libs and then use the pre-installed ones.
 opensm_LDADD = -L../complib -L../libvendor -L. $(OSMV_LDADD) -lopensm -losmcomp -losmvendor
 
-opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread
+opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread -ldl
 
 opensmincludedir = $(includedir)/infiniband/opensm
 
diff --git a/osm/opensm/configure.in b/osm/opensm/configure.in
index ad3333a..9e23719 100644
--- a/osm/opensm/configure.in
+++ b/osm/opensm/configure.in
@@ -78,6 +78,9 @@ if test $console_socket = yes; then
 	    [Define as 1 if you want to enable a console on a socket connection])
 fi
 
+dnl select performance manager or not
+OPENIB_OSM_PERF_MGR_SEL
+
 dnl Provide user option to select vendor
 OPENIB_APP_OSMV_SEL
 
diff --git a/osm/opensm/main.c b/osm/opensm/main.c
index 153e44d..4fa3563 100644
--- a/osm/opensm/main.c
+++ b/osm/opensm/main.c
@@ -59,6 +59,7 @@
 #include <opensm/osm_version.h>
 #include <opensm/osm_opensm.h>
 #include <opensm/osm_console.h>
+#include <opensm/osm_perfmgr.h>
 
 volatile unsigned int osm_exit_flag = 0;
 
@@ -273,6 +274,13 @@ show_usage(void)
   printf("-I\n"
          "--inactive\n"
          "           Start SM in inactive rather than normal init SM state.\n\n");
+#ifdef ENABLE_OSM_PERF_MGR
+  printf( "--pm\n"
+          "          Activate the performance manager.\n\n");
+  printf( "--pm_sweep_time_s\n"
+          "          Define the period for PerfMgr sweeps (in seconds) default %ds.\n\n",
+	  OSM_PERFMGR_DEFAULT_SWEEP_TIME_S);
+#endif /* ENABLE_OSM_PERF_MGR */
   printf( "-v\n"
           "--verbose\n"
           "          This option increases the log verbosity level.\n"
@@ -630,6 +638,8 @@ main(
 #endif
       {  "daemon",        0, NULL, 'B'},
       {  "inactive",      0, NULL, 'I'},
+      {  "pm",            0, NULL, 1}, /* no short options for PM stuff */
+      {  "pm_sweep_time_s", 1, NULL, 2},
       {  NULL,            0, NULL,  0 }  /* Required at the end of the array */
     };
 
@@ -907,6 +917,15 @@ main(
       printf(" SM started in inactive state\n");
       break;
 
+#ifdef ENABLE_OSM_PERF_MGR
+    case 1:
+      opt.perfmgr = TRUE;
+      break;
+    case 2:
+      opt.perfmgr_sweep_time_s = atoi(optarg);
+      break;
+#endif /* ENABLE_OSM_PERF_MGR */
+
     case 'h':
     case '?':
     case ':':
diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c
index 38b978a..d6c30d8 100644
--- a/osm/opensm/osm_console.c
+++ b/osm/opensm/osm_console.c
@@ -52,6 +52,7 @@
 #include <ctype.h>
 #include <opensm/osm_console.h>
 #include <opensm/osm_version.h>
+#include <opensm/osm_perfmgr.h>
 
 struct command {
 	char *name;
@@ -136,6 +137,20 @@ static void help_logflush(FILE *out, int
 	fprintf(out, "logflush -- flush the osm.log file\n");
 }
 
+#ifdef ENABLE_OSM_PERF_MGR
+static void help_perfmgr(FILE *out, int detail)
+{
+	fprintf(out, "perfmgr [enable|disable|clear_counters|dump_counters|sweep_time][seconds]\n");
+	if (detail) {
+		fprintf(out, "perfmgr -- print the performance manager state\n");
+		fprintf(out, "   [enable|disable] -- change the perfmgr state\n");
+		fprintf(out, "   [sweep_time] -- change the perfmgr sweep time (requires [seconds] option)\n");
+		fprintf(out, "   [clear_counters] -- clear the counters stored\n");
+		fprintf(out, "   [dump_counters [mach]] -- dump the counters\n");
+	}
+}
+#endif /* ENABLE_OSM_PERF_MGR */
+
 /* more help routines go here */
 
 static void help_parse(char **p_last, osm_opensm_t *p_osm, FILE *out)
@@ -427,6 +442,66 @@ static void logflush_parse(char **p_last
 	fflush(p_osm->log.out_port);
 }
 
+#ifdef ENABLE_OSM_PERF_MGR
+static void perfmgr_parse(char **p_last, osm_opensm_t *p_osm, FILE *out)
+{
+	char *p_cmd;
+
+	p_cmd = next_token(p_last);
+	if (p_cmd)
+	{
+	   if (strcmp(p_cmd, "enable") == 0)
+	   {
+		   osm_perfmgr_set_state(&(p_osm->perfmgr), PERFMGR_STATE_ENABLED);
+	   }
+	   else if (strcmp(p_cmd, "disable") == 0)
+	   {
+		   osm_perfmgr_set_state(&(p_osm->perfmgr), PERFMGR_STATE_DISABLE);
+	   }
+	   else if (strcmp(p_cmd, "clear_counters") == 0)
+	   {
+		   osm_perfmgr_clear_counters(&(p_osm->perfmgr));
+	   }
+	   else if (strcmp(p_cmd, "dump_counters") == 0)
+	   {
+		p_cmd = next_token(p_last);
+		if (p_cmd && (strcmp(p_cmd, "mach") == 0)) {
+			osm_perfmgr_dump_counters(&(p_osm->perfmgr),
+					OSM_EVENT_DB_DUMP_MR);
+		} else {
+			osm_perfmgr_dump_counters(&(p_osm->perfmgr),
+					OSM_EVENT_DB_DUMP_HR);
+		}
+	   }
+	   else if (strcmp(p_cmd, "sweep_time") == 0)
+	   {
+		p_cmd = next_token(p_last);
+		if (p_cmd)
+		{
+			uint16_t time_s = atoi(p_cmd);
+		   	osm_perfmgr_set_sweep_time_s(&(p_osm->perfmgr), time_s);
+		}
+		else
+		{
+			fprintf(out, "sweep_time requires a time specified\n");
+		}
+	   }
+	   else
+	   {
+		fprintf(out, "\"%s\" option not found\n", p_cmd);
+	   }
+	} else {
+		fprintf(out, "Performance Manager status:\n"
+			     "state      : %s\n"
+		             "sweep time : %us\n"
+		        ,
+			osm_perfmgr_get_state_str(&(p_osm->perfmgr)),
+			osm_perfmgr_get_sweep_time_s(&(p_osm->perfmgr))
+			);
+	}
+}
+#endif /* ENABLE_OSM_PERF_MGR */
+
 /* This is public to be able to close it on exit */
 void osm_console_close_socket(osm_opensm_t *p_osm)
 {
@@ -456,6 +531,9 @@ static const struct command console_cmds
 	{ "resweep",	&help_resweep,		&resweep_parse},
 	{ "status",	&help_status,		&status_parse},
 	{ "logflush",	&help_logflush,		&logflush_parse},
+#ifdef ENABLE_OSM_PERF_MGR
+	{ "perfmgr",	&help_perfmgr,		&perfmgr_parse},
+#endif /* ENABLE_OSM_PERF_MGR */
 	{ NULL,		NULL,			NULL}	/* end of array */
 };
 
diff --git a/osm/opensm/osm_event_db.c b/osm/opensm/osm_event_db.c
new file mode 100644
index 0000000..90ca8da
--- /dev/null
+++ b/osm/opensm/osm_event_db.c
@@ -0,0 +1,172 @@
+/*
+ * Copyright (c) 2007 The Regents of the University of California.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#include <stdlib.h>
+#include <errno.h>
+#include <limits.h>
+#include <dlfcn.h>
+#include <sys/stat.h>
+
+#include <opensm/osm_event_db.h>
+
+/** =========================================================================
+ */
+osm_event_db_t *
+osm_event_db_construct(osm_log_t *p_log, char *type)
+{
+	char            lib_name[PATH_MAX];
+	osm_event_db_t *rc = NULL;
+
+	if (!type)
+		return (NULL);
+
+	/* find the plugin */
+	snprintf(lib_name, PATH_MAX, "lib%s.so", type);
+
+	rc = malloc(sizeof(*rc));
+	if (!rc)
+		return (NULL);
+
+	rc->handle = dlopen(lib_name, RTLD_LAZY);
+	if (!rc->handle)
+	{
+		osm_log(p_log, OSM_LOG_ERROR,
+			"Failed to open PM Database \"%s\" : \"%s\"\n",
+			lib_name, dlerror());
+		goto DLOPENFAIL;
+	}
+
+	rc->db_impl = (__osm_event_db_t *)dlsym(rc->handle, "__osm_event_db");
+	if (!rc->db_impl)
+	{
+		osm_log(p_log, OSM_LOG_ERROR,
+			"Failed to find __osm_event_db symbol in \"%s\" : \"%s\"\n",
+			lib_name, dlerror());
+		goto Exit;
+	}
+
+	/* Check the version to make sure this module will work with us */
+	if (rc->db_impl->interface_version != OSM_EVENT_DB_INTERFACE_VER)
+	{
+		osm_log(p_log, OSM_LOG_ERROR,
+			"__osm_event_db symbol is the wrong version %d != %d\n",
+			rc->db_impl->interface_version,
+			OSM_EVENT_DB_INTERFACE_VER);
+		goto Exit;
+	}
+
+	rc->db_data = rc->db_impl->construct(p_log);
+
+	if (!rc->db_data)
+		goto Exit;
+
+	rc->p_log = p_log;
+	return (rc);
+
+Exit:
+	dlclose(rc->handle);
+DLOPENFAIL:
+	free(rc);
+	return (NULL);
+}
+
+/** =========================================================================
+ */
+void
+osm_event_db_destroy(osm_event_db_t *db)
+{
+	if (db)
+	{
+		db->db_impl->destroy(db->db_data);
+		free(db);
+	}
+}
+
+/** =========================================================================
+ */
+osm_event_db_err_t
+osm_event_db_create_entry(osm_event_db_t *db, uint64_t guid, uint8_t num_ports)
+{
+	return(db->db_impl->create_entry(db->db_data, guid, num_ports));
+}
+
+/**********************************************************************
+ **********************************************************************/
+osm_event_db_err_t osm_event_db_get_prev_pc(osm_event_db_t *db, uint64_t guid,
+		uint8_t port, osm_pc_reading_t *reading)
+{
+	return (db->db_impl->get_prev_pc(db->db_data, guid, port, reading));
+}
+
+/**********************************************************************
+ * dump the data to the file "file"
+ **********************************************************************/
+osm_event_db_err_t
+osm_event_db_dump(osm_event_db_t *db, char *file, osm_event_db_dump_t dump_type)
+{
+	return (db->db_impl->dump(db->db_data, file, dump_type));
+}
+
+/**********************************************************************
+ * Clear the port counters from the db
+ **********************************************************************/
+void osm_event_db_clear_port_counters(osm_event_db_t *db)
+{
+	db->db_impl->clear_port_counters(db->db_data);
+}
+
+/**********************************************************************
+ * Add the reading to the osm_pm_node_t
+ **********************************************************************/
+osm_event_db_err_t
+osm_event_db_add_pc_reading(osm_event_db_t *db, uint64_t guid,
+                   uint8_t port, ib_port_counters_t *reading)
+{
+	return (db->db_impl->add_pc_reading(db->db_data, guid,
+				port, reading));
+}
+
+/**********************************************************************
+ * Add the reading to the osm_pm_node_t
+ **********************************************************************/
+osm_event_db_err_t
+osm_event_db_clear_prev_pc(osm_event_db_t *db, uint64_t guid, uint8_t port)
+{
+	return (db->db_impl->clear_prev_pc(db->db_data, guid, port));
+}
+
diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c
index 8430605..fa572c5 100644
--- a/osm/opensm/osm_opensm.c
+++ b/osm/opensm/osm_opensm.c
@@ -172,6 +172,9 @@ osm_opensm_destroy(
      p_osm->routing_engine.delete(p_osm->routing_engine.context);
    osm_sa_destroy( &p_osm->sa );
    osm_sm_destroy( &p_osm->sm );
+#ifdef ENABLE_OSM_PERF_MGR
+   osm_perfmgr_destroy( &p_osm->perfmgr );
+#endif /* ENABLE_OSM_PERF_MGR */
    osm_db_destroy( &p_osm->db );
    osm_vl15_destroy( &p_osm->vl15, &p_osm->mad_pool );
    osm_mad_pool_destroy( &p_osm->mad_pool );
@@ -286,6 +289,21 @@ osm_opensm_init(
    if( status != IB_SUCCESS )
       goto Exit;
 
+#ifdef ENABLE_OSM_PERF_MGR
+   status = osm_perfmgr_init( &p_osm->perfmgr,
+                         &p_osm->subn,
+			 &p_osm->sm,
+                         &p_osm->log,
+			 &p_osm->mad_pool,
+			 p_osm->p_vendor,
+			 &p_osm->disp,
+			 &p_osm->lock,
+			 p_opt);
+
+   if( status != IB_SUCCESS )
+      goto Exit;
+#endif /* ENABLE_OSM_PERF_MGR */
+
    if( p_opt->routing_engine_name &&
        setup_routing_engine(p_osm, p_opt->routing_engine_name)) {
       osm_log( &p_osm->log, OSM_LOG_VERBOSE,
@@ -319,6 +337,12 @@ osm_opensm_bind(
    if( status != IB_SUCCESS )
       goto Exit;
 
+#ifdef ENABLE_OSM_PERF_MGR
+   status = osm_perfmgr_bind( &p_osm->perfmgr, guid );
+   if( status != IB_SUCCESS )
+      goto Exit;
+#endif /* ENABLE_OSM_PERF_MGR */
+
  Exit:
    OSM_LOG_EXIT( &p_osm->log );
    return ( status );
diff --git a/osm/opensm/osm_perfmgr.c b/osm/opensm/osm_perfmgr.c
new file mode 100644
index 0000000..297a0e2
--- /dev/null
+++ b/osm/opensm/osm_perfmgr.c
@@ -0,0 +1,686 @@
+/*
+ * Copyright (c) 2007 The Regents of the University of California.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+
+/*
+ * Abstract:
+ *    Implementation of osm_perfmgr_t.
+ *
+ * Author:
+ *    Ira Weiny, LLNL
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#ifdef ENABLE_OSM_PERF_MGR
+
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <poll.h>
+#include <netinet/in.h>
+#include <complib/cl_debug.h>
+#include <iba/ib_types.h>
+#include <errno.h>
+#include <sys/time.h>
+#include <opensm/osm_perfmgr.h>
+#include <opensm/osm_log.h>
+#include <opensm/osm_node.h>
+#include <complib/cl_thread.h>
+#include <vendor/osm_vendor_api.h>
+
+#define  OSM_PERFMGR_INITIAL_TID_VALUE 0xcafe
+
+/**********************************************************************
+ * Recieve the MAD from the vendor layer and post it for processing by the
+ * dispatcher.
+ **********************************************************************/
+static void
+osm_perfmgr_mad_recv_callback(osm_madw_t *p_madw, void* bind_context,
+   				osm_madw_t *p_req_madw )
+{
+	osm_perfmgr_t      *pm = (osm_perfmgr_t *)bind_context;
+	cl_status_t         cl_status = CL_SUCCESS;
+	
+	OSM_LOG_ENTER( pm->log, osm_pm_mad_recv_callback );
+	
+	osm_madw_copy_context( p_madw, p_req_madw );
+	osm_mad_pool_put( pm->mad_pool, p_req_madw );
+	
+	/* post this message for later processing. */
+	cl_status = cl_disp_post(pm->pc_disp_h, OSM_MSG_MAD_PORT_COUNTERS,
+	      	           	(void *)p_madw, NULL, NULL);
+#if 0
+	do {
+		struct timeval      rcv_time;
+		gettimeofday(&rcv_time, NULL);
+		osm_log(pm->log, OSM_LOG_INFO,
+			"perfmgr rcv time %ld\n",
+			rcv_time.tv_usec -
+			p_madw->context.perfmgr_context.query_start.tv_usec);
+	} while (0);
+#endif
+	OSM_LOG_EXIT( pm->log );
+}
+
+/**********************************************************************
+ * Process errors from the MAD send.
+ **********************************************************************/
+static void
+osm_perfmgr_mad_send_err_callback(void* bind_context, osm_madw_t *p_madw)
+{
+	osm_perfmgr_t *pm = (osm_perfmgr_t *)bind_context;
+	osm_madw_context_t *context = &(p_madw->context);
+	
+	OSM_LOG_ENTER( pm->log, osm_pm_mad_send_err_callback );
+	
+	osm_log( pm->log, OSM_LOG_ERROR,
+	           "osm_pm_mad_send_err_callback: 0x%" PRIx64 " port %d\n",
+	      	  context->perfmgr_context.node_guid,
+	      	  context->perfmgr_context.port);
+	
+	osm_mad_pool_put( pm->mad_pool, p_madw );
+	
+	OSM_LOG_EXIT( pm->log );
+}
+
+/**********************************************************************
+ * Bind the PM to the vendor layer for MAD sends/receives
+ **********************************************************************/
+ib_api_status_t
+osm_perfmgr_bind(osm_perfmgr_t * const pm, const ib_net64_t port_guid)
+{
+	osm_bind_info_t bind_info;
+	ib_api_status_t status = IB_SUCCESS;
+	
+	OSM_LOG_ENTER( pm->log, osm_pm_bind );
+	
+	if( pm->bind_handle != OSM_BIND_INVALID_HANDLE ) {
+		osm_log( pm->log, OSM_LOG_ERROR,
+		         "osm_pm_mad_ctrl_bind: Multiple binds not allowed\n" );
+		status = IB_ERROR;
+		goto Exit;
+	}
+	
+	bind_info.port_guid = port_guid;
+	bind_info.mad_class = IB_MCLASS_PERF;
+	bind_info.class_version = 1;
+	bind_info.is_responder = FALSE;
+	bind_info.is_report_processor = FALSE;
+	bind_info.is_trap_processor = FALSE;
+	bind_info.recv_q_size = OSM_PM_DEFAULT_QP1_RCV_SIZE;
+	bind_info.send_q_size = OSM_PM_DEFAULT_QP1_SEND_SIZE;
+	
+	osm_log( pm->log, OSM_LOG_VERBOSE,
+	         "osm_pm_mad_bind: "
+	         "Binding to port GUID 0x%" PRIx64 "\n",
+	         cl_ntoh64( port_guid ) );
+	
+	pm->bind_handle = osm_vendor_bind( pm->vendor,
+	                                  &bind_info,
+	                                  pm->mad_pool,
+	                                  osm_perfmgr_mad_recv_callback,
+	                                  osm_perfmgr_mad_send_err_callback,
+	                                  pm );
+	
+	if( pm->bind_handle == OSM_BIND_INVALID_HANDLE ) {
+		status = IB_ERROR;
+		osm_log( pm->log, OSM_LOG_ERROR,
+		         "osm_pm_mad_bind: Vendor specific bind failed (%s)\n",
+		         ib_get_err_str(status) );
+		goto Exit;
+	}
+
+Exit:
+ 	OSM_LOG_EXIT( pm->log );
+	return( status );
+}
+
+/**********************************************************************
+ * Unbind the PM to the vendor layer for MAD sends/receives
+ **********************************************************************/
+void
+osm_perfmgr_mad_unbind(osm_perfmgr_t * const pm)
+{
+	OSM_LOG_ENTER( pm->log, osm_sa_mad_ctrl_unbind );
+	if( pm->bind_handle == OSM_BIND_INVALID_HANDLE ) {
+		osm_log( pm->log, OSM_LOG_ERROR,
+		         "osm_pm_mad_unbind: No previous bind\n" );
+		goto Exit;
+	}
+	osm_vendor_unbind( pm->bind_handle );
+Exit:
+	OSM_LOG_EXIT( pm->log );
+}
+
+/**********************************************************************
+ * Given a node and a port return the appropriate lid to query that port
+ **********************************************************************/
+static ib_net16_t
+get_lid(osm_node_t *p_node, uint8_t port)
+{
+	ib_net16_t lid = 0;
+	
+	switch (p_node->node_info.node_type)
+	{
+		case IB_NODE_TYPE_CA:
+		case IB_NODE_TYPE_ROUTER:
+			  lid = osm_node_get_base_lid(p_node, port);
+			  break;
+		case IB_NODE_TYPE_SWITCH:
+			  lid = osm_node_get_base_lid(p_node, 0);
+			  break;
+		default:
+			  break;
+	}
+	return (lid);
+}
+
+/**********************************************************************
+ * Form the Port Counter MAD and send the MAD for a single port.
+ **********************************************************************/
+static ib_api_status_t
+osm_perfmgr_send_pc_mad(osm_perfmgr_t *perfmgr, ib_net16_t dest_lid, uint8_t port,
+			uint8_t mad_method, osm_madw_context_t* const p_context )
+{
+	ib_api_status_t     status = IB_SUCCESS;
+	ib_port_counters_t *port_counter = NULL;
+	ib_perfmgr_mad_t   *pm_mad = NULL;
+	osm_madw_t         *p_madw = NULL;
+	
+	OSM_LOG_ENTER(perfmgr->log, osm_perfmgr_send_pc_mad);
+	
+	p_madw = osm_mad_pool_get(perfmgr->mad_pool, perfmgr->bind_handle, MAD_BLOCK_SIZE, NULL);
+	if (p_madw == NULL)
+		return (IB_INSUFFICIENT_MEMORY);
+	
+	pm_mad = osm_madw_get_perfmgr_mad_ptr(p_madw);
+	
+	/* build the mad */
+	pm_mad->header.base_ver = 1;
+	pm_mad->header.mgmt_class = IB_MCLASS_PERF;
+	pm_mad->header.class_ver = 1;
+	pm_mad->header.method = mad_method;
+	pm_mad->header.status = 0;
+	pm_mad->header.class_spec = 0;
+	pm_mad->header.trans_id = cl_hton64((uint64_t)cl_atomic_inc(&(perfmgr->trans_id)));
+	pm_mad->header.attr_id = IB_MAD_ATTR_PORT_CNTRS;
+	pm_mad->header.resv = 0;
+	pm_mad->header.attr_mod = 0;
+	
+	port_counter = (ib_port_counters_t *)&(pm_mad->data);
+	memset(port_counter, 0, sizeof(*port_counter));
+	port_counter->port_select = port;
+	port_counter->counter_select = 0xFFFF;
+	
+	p_madw->mad_addr.dest_lid = dest_lid;
+	p_madw->mad_addr.addr_type.gsi.remote_qp = cl_hton32(1);
+	p_madw->mad_addr.addr_type.gsi.remote_qkey = cl_hton32(IB_QP1_WELL_KNOWN_Q_KEY);
+	/* FIXME what about other partitions */
+	p_madw->mad_addr.addr_type.gsi.pkey = cl_hton16(0xFFFF);
+	p_madw->mad_addr.addr_type.gsi.service_level = 0;
+	p_madw->mad_addr.addr_type.gsi.global_route = FALSE;
+	p_madw->resp_expected = TRUE;
+	
+	if( p_context )
+		p_madw->context = *p_context;
+	
+	status = osm_vendor_send(perfmgr->bind_handle, p_madw, TRUE);
+	
+	OSM_LOG_EXIT(perfmgr->log);
+	return( status );
+}
+
+/**********************************************************************
+ * query the Port Counters of all the nodes in the subnet.
+ **********************************************************************/
+static void
+__osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context )
+{
+	ib_api_status_t     status = IB_SUCCESS;
+	uint8_t             port = 0;
+	osm_perfmgr_t      *pm = (osm_perfmgr_t *)context;
+	osm_node_t         *p_node = (osm_node_t *)p_map_item;
+	uint8_t             node_desc[IB_NODE_DESCRIPTION_SIZE];
+	osm_madw_context_t  mad_context;
+	uint8_t             num_ports = 0;
+	uint64_t            node_guid = 0;
+	
+	OSM_LOG_ENTER( pm->log, __osm_pm_query_counters );
+	
+	memcpy(node_desc, p_node->node_desc.description,
+			IB_NODE_DESCRIPTION_SIZE);
+	node_desc[IB_NODE_DESCRIPTION_SIZE-1] = '\0';
+	
+	num_ports = osm_node_get_num_physp(p_node);
+	node_guid = cl_ntoh64(p_node->node_info.node_guid);
+	
+	/* make sure we have a database object ready to store this information */
+	if (osm_event_db_create_entry(pm->db, node_guid, num_ports) !=
+	      	  OSM_EVENT_DB_SUCCESS)
+	{
+		osm_log(pm->log, OSM_LOG_ERROR,
+			"PerfMgr DB create entry failed for 0x%" PRIx64 " : %s\n",
+			node_guid, strerror(errno));
+		goto Exit;
+	}
+	
+	/* issue the queries for each port */
+	for (port = 1; port < num_ports; port++)
+	{
+		ib_net16_t lid = get_lid(p_node, port);
+		if (lid == 0)
+		{
+			osm_log(pm->log, OSM_LOG_DEBUG,
+				"WARN: node 0x%" PRIx64 " port %d (%s): port out of range, skipping\n",
+				cl_ntoh64(p_node->node_info.node_guid), port, node_desc);
+			continue;
+		}
+		
+		mad_context.perfmgr_context.node_guid = node_guid;
+		mad_context.perfmgr_context.port = port;
+		mad_context.perfmgr_context.num_ports = num_ports;
+		mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_GET;
+#if 0
+		gettimeofday(&(mad_context.perfmgr_context.query_start), NULL);
+#endif
+		osm_log(pm->log, OSM_LOG_VERBOSE,
+				"   Getting stats for node 0x%" PRIx64 " port %d (lid %X) (%s)\n",
+				node_guid, port, cl_ntoh16(lid), node_desc);
+		status = osm_perfmgr_send_pc_mad(pm, lid, port, IB_MAD_METHOD_GET, &mad_context);
+		if (status != IB_SUCCESS)
+		{
+		      osm_log(pm->log, OSM_LOG_ERROR,
+				"Failed to issue port counter query for node 0x%" PRIx64 " port %d (%s)\n",
+				p_node->node_info.node_guid, port, node_desc);
+		}
+	}
+Exit:
+	OSM_LOG_EXIT( pm->log );
+}
+
+/**********************************************************************
+ * Main PerfMgr Thread.
+ * Loop continueously and query the performance counters.
+ **********************************************************************/
+void
+__osm_perfmgr_sweeper(void *p_ptr)
+{
+	ib_api_status_t status;
+	osm_perfmgr_t *const pm = ( osm_perfmgr_t * ) p_ptr;
+	
+	OSM_LOG_ENTER( pm->log, __osm_pm_sweeper );
+	
+	if( pm->thread_state == OSM_THREAD_STATE_INIT )
+		pm->thread_state = OSM_THREAD_STATE_RUN;
+	
+	while( pm->thread_state == OSM_THREAD_STATE_RUN ) {
+		/*  do the sweep only if we are in MASTER state
+		 *  AND we have been activated.
+		 *  FIXME put something in here to try and reduce the load on the system
+		 *  when it is not IDLE.
+		if (pm->sm->state_mgr.state != OSM_SM_STATE_IDLE)
+		 */
+		if( pm->subn->sm_state == IB_SMINFO_STATE_MASTER
+		    && pm->state == PERFMGR_STATE_ENABLED) {
+#if 0
+			struct timeval before, after;
+			gettimeofday(&before, NULL);
+#endif
+			/* for each node query their counters */
+			cl_plock_acquire(pm->lock);
+			osm_log(pm->log, OSM_LOG_VERBOSE, "Gathering PerfMgr stats\n");
+			cl_qmap_apply_func(&(pm->subn->node_guid_tbl),
+			    	  __osm_perfmgr_query_counters, (void *)pm);
+			cl_plock_release(pm->lock);
+#if 0
+			gettimeofday(&after, NULL);
+			osm_log(pm->log, OSM_LOG_INFO,
+				"total sweep time : %ld us\n", after.tv_usec - before.tv_usec);
+#endif
+		}
+
+		/* Wait for a forced sweep or period timeout. */
+		status = cl_event_wait_on( &pm->sig_sweep,
+		                   		pm->sweep_time_s * 1000000,
+		                   		TRUE );
+	}
+	
+	OSM_LOG_EXIT( pm->log );
+}
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_perfmgr_shutdown(osm_perfmgr_t * const pm)
+{
+	OSM_LOG_ENTER( pm->log, osm_perfmgr_shutdown );
+	osm_perfmgr_mad_unbind(pm);
+	OSM_LOG_EXIT( pm->log );
+}
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_perfmgr_destroy(osm_perfmgr_t * const pm)
+{
+	OSM_LOG_ENTER( pm->log, osm_perfmgr_destroy );
+	free(pm->event_db_dump_file);
+	free(pm->event_db_plugin);
+	osm_event_db_destroy(pm->db);
+	OSM_LOG_EXIT( pm->log );
+}
+
+/**********************************************************************
+ * Return 1 if the value has overflowed
+ **********************************************************************/
+int counter_overflow_4(uint8_t val)
+{
+	return (val >= 10);
+}
+int counter_overflow_8(uint8_t val)
+{
+	return (val >= (UINT8_MAX - (UINT8_MAX/4)));
+}
+int counter_overflow_16(uint16_t val)
+{
+	return (cl_ntoh16(val) >= (UINT16_MAX - (UINT16_MAX/4)));
+}
+int counter_overflow_32(uint32_t val)
+{
+	return (cl_ntoh32(val) >= (UINT32_MAX - (UINT32_MAX/4)));
+}
+
+/**********************************************************************
+ * Check if the port counters have overflowed and if so issue a clear MAD to
+ * the port.
+ **********************************************************************/
+static void
+osm_perfmgr_check_clear(osm_perfmgr_t *pm, uint64_t node_guid,
+	     uint8_t port, int num_ports, ib_port_counters_t *cr)
+{
+  	osm_madw_context_t  mad_context;
+
+  	OSM_LOG_ENTER( pm->log, osm_pm_check_clear );
+	if (counter_overflow_16(cr->symbol_err_cnt)
+		|| counter_overflow_8(cr->link_err_recover)
+		|| counter_overflow_8(cr->link_downed)
+		|| counter_overflow_16(cr->rcv_err)
+		|| counter_overflow_16(cr->rcv_rem_phys_err)
+		|| counter_overflow_16(cr->rcv_switch_relay_err)
+		|| counter_overflow_16(cr->xmit_discards)
+		|| counter_overflow_8(cr->xmit_constraint_err)
+		|| counter_overflow_8(cr->rcv_constraint_err)
+		|| counter_overflow_4(PC_LINK_INT(cr->link_int_buffer_overrun))
+		|| counter_overflow_4(PC_BUF_OVERRUN(cr->link_int_buffer_overrun))
+		|| counter_overflow_16(cr->vl15_dropped)
+		|| counter_overflow_32(cr->xmit_data)
+		|| counter_overflow_32(cr->rcv_data)
+		|| counter_overflow_32(cr->xmit_pkts)
+		|| counter_overflow_32(cr->rcv_pkts)
+		)
+	{
+		osm_log(pm->log, OSM_LOG_INFO,
+			"Counter overflow: 0x%" PRIx64 " port %d; clearing counters\n",
+			node_guid, port);
+  		osm_node_t *p_node = NULL;
+		ib_net16_t  lid = 0;
+        	cl_plock_acquire(pm->lock);
+        	p_node = (osm_node_t *)cl_qmap_get(&(pm->subn->node_guid_tbl),
+						cl_hton64(node_guid));
+    		lid = get_lid(p_node, port);
+        	cl_plock_release(pm->lock);
+    		if (lid == 0)
+    		{
+    			osm_log(pm->log, OSM_LOG_INFO,
+    				"Failed to clear counters for node 0x%" PRIx64 " port %d; failed to get lid\n",
+    				node_guid, port);
+        		goto Exit;
+    		}
+    		mad_context.perfmgr_context.node_guid = node_guid;
+    		mad_context.perfmgr_context.port = port;
+    		mad_context.perfmgr_context.num_ports = num_ports;
+    		mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_SET;
+		/* clear port counter */
+		osm_perfmgr_send_pc_mad(pm, lid, port, IB_MAD_METHOD_SET, &mad_context);
+	}
+Exit:
+  	OSM_LOG_EXIT( pm->log );
+}
+
+/**********************************************************************
+ * Check values for logging of errors
+ **********************************************************************/
+static void
+osm_perfmgr_log_events(osm_perfmgr_t *pm, uint64_t node_guid, uint8_t port,
+			ib_port_counters_t *reading)
+{
+	osm_pc_reading_t    prev_read;
+	ib_port_counters_t *prev;
+	time_t              time_diff = 0;
+  	osm_event_db_err_t  err = osm_event_db_get_prev_pc(pm->db, node_guid, port, &prev_read);
+  	if (err != OSM_EVENT_DB_SUCCESS)
+  	{
+		osm_log(pm->log, OSM_LOG_VERBOSE,
+			"failed to find previous reading for 0x%" PRIx64 " port %u\n",
+			node_guid, port);
+		return;
+  	}
+	time_diff = (time(NULL) - prev_read.time);
+	prev = &(prev_read.reading);
+
+	/* FIXME these events should be defineable by the user in a config
+	 * file somewhere. */
+	if (reading->symbol_err_cnt > prev->symbol_err_cnt) {
+		osm_log(pm->log, OSM_LOG_ERROR,
+			"Found %u Symbol errors in %lu sec on node 0x%" PRIx64 " port %u\n",
+			(cl_ntoh16(reading->symbol_err_cnt) - cl_ntoh16(prev->symbol_err_cnt)),
+			time_diff,
+			node_guid,
+			port);
+	}
+	if (reading->rcv_err > prev->rcv_err) {
+		osm_log(pm->log, OSM_LOG_ERROR,
+			"Found %u Recieve errors in %lu sec on node 0x%" PRIx64 " port %u\n",
+			(cl_ntoh16(reading->rcv_err) - cl_ntoh16(prev->rcv_err)),
+			time_diff,
+			node_guid,
+			port);
+	}
+	if (reading->xmit_discards > prev->xmit_discards) {
+		osm_log(pm->log, OSM_LOG_ERROR,
+			"Found %u XMIT Discards in %lu sec on node 0x%" PRIx64 " port %u\n",
+			(cl_ntoh16(reading->xmit_discards) - cl_ntoh16(prev->xmit_discards)),
+			time_diff,
+			node_guid,
+			port);
+	}
+}
+
+
+/**********************************************************************
+ * The dispatcher uses a thread pool which will call this function when we have
+ * a thread available to process our mad recieved from the wire.
+ **********************************************************************/
+static void
+osm_pc_rcv_process(void *context, void *data)
+{
+	osm_perfmgr_t      *const pm = (osm_perfmgr_t *)context;
+	osm_madw_t         *p_madw = (osm_madw_t *)data;
+	osm_madw_context_t *mad_context = &(p_madw->context);
+	ib_port_counters_t *counter_reading =
+				(ib_port_counters_t *)&(osm_madw_get_perfmgr_mad_ptr(p_madw)->data);
+	uint64_t            node_guid = mad_context->perfmgr_context.node_guid;
+	uint8_t             port_num = mad_context->perfmgr_context.port;
+	int                 num_ports = mad_context->perfmgr_context.num_ports;
+	
+	OSM_LOG_ENTER( pm->log, osm_pc_rcv_process );
+	
+	osm_log(pm->log, OSM_LOG_VERBOSE,
+	      	  "Processing recieved MAD context 0x%" PRIx64 " port %u/%d\n",
+	      	  node_guid, port_num, num_ports);
+	
+	/* log any critical events from this reading */
+	osm_perfmgr_log_events(pm, node_guid, port_num, counter_reading);
+	
+	if (mad_context->perfmgr_context.mad_method == IB_MAD_METHOD_GET)
+		osm_event_db_add_pc_reading(pm->db, node_guid, port_num, counter_reading);
+	else
+		osm_event_db_clear_prev_pc(pm->db, node_guid, port_num);
+	osm_perfmgr_check_clear(pm, node_guid, port_num, num_ports, counter_reading);
+	
+#if 0
+	do {
+		struct timeval      proc_time;
+		gettimeofday(&proc_time, NULL);
+		osm_log(pm->log, OSM_LOG_INFO,
+			"perfmgr done processing time %ld\n",
+			proc_time.tv_usec -
+			p_madw->context.perfmgr_context.query_start.tv_usec);
+	} while (0);
+#endif
+
+	osm_mad_pool_put( pm->mad_pool, p_madw );
+	
+	OSM_LOG_EXIT( pm->log );
+}
+
+/**********************************************************************
+ * Initialize the PERFMGR object
+ **********************************************************************/
+ib_api_status_t
+osm_perfmgr_init(
+	osm_perfmgr_t * const pm,
+	osm_subn_t * const subn,
+	osm_sm_t * const sm,
+	osm_log_t * const log,
+	osm_mad_pool_t * const mad_pool,
+	osm_vendor_t * const vendor,
+	cl_dispatcher_t* const disp,
+	cl_plock_t* const lock,
+	const osm_subn_opt_t * const p_opt )
+{
+	ib_api_status_t    status = IB_SUCCESS;
+	
+	OSM_LOG_ENTER( log, osm_pm_init );
+	
+	osm_log(log, OSM_LOG_VERBOSE, "initing PM\n");
+	
+	memset( pm, 0, sizeof( *pm ) );
+	
+	cl_event_construct(&pm->sig_sweep);
+	cl_event_init(&pm->sig_sweep, FALSE);
+	pm->subn = subn;
+	pm->sm = sm;
+	pm->log = log;
+	pm->mad_pool = mad_pool;
+	pm->vendor = vendor;
+	pm->trans_id = OSM_PERFMGR_INITIAL_TID_VALUE;
+	pm->lock = lock;
+	pm->state = p_opt->perfmgr ? PERFMGR_STATE_ENABLED : PERFMGR_STATE_DISABLE;
+	pm->sweep_time_s = p_opt->perfmgr_sweep_time_s;
+	pm->event_db_dump_file = strdup(p_opt->event_db_dump_file);
+	pm->event_db_plugin = strdup(p_opt->event_db_plugin);
+	
+	pm->db = osm_event_db_construct(pm->log, pm->event_db_plugin);
+	if (!pm->db)
+	{
+	      pm->state = PERFMGR_STATE_NO_DB;
+	      goto Exit;
+	}
+	
+	pm->pc_disp_h = cl_disp_register(disp, OSM_MSG_MAD_PORT_COUNTERS,
+	                              osm_pc_rcv_process, pm);
+	if( pm->pc_disp_h == CL_DISP_INVALID_HANDLE )
+		goto Exit;
+	
+	pm->thread_state = OSM_THREAD_STATE_INIT;
+	status = cl_thread_init( &pm->sweeper, __osm_perfmgr_sweeper, pm,
+	                       "PerfMgr sweeper" );
+	if( status != IB_SUCCESS )
+	 	goto Exit;
+	
+Exit:
+	OSM_LOG_EXIT( log );
+	return ( status );
+}
+
+/**********************************************************************
+ * Clear the counters from the db
+ **********************************************************************/
+void
+osm_perfmgr_clear_counters(osm_perfmgr_t *pm)
+{
+	/**
+	 * FIXME todo issue clear on the fabric?
+	 */
+	osm_event_db_clear_port_counters(pm->db);
+  	osm_log( pm->log, OSM_LOG_INFO, "PerfMgr counters cleared\n");
+}
+
+/*******************************************************************
+ * Have the DB dump it's information to the file specified.
+ *******************************************************************/
+void
+osm_perfmgr_dump_counters(osm_perfmgr_t *pm, osm_event_db_dump_t dump_type)
+{
+	if (osm_event_db_dump(pm->db, pm->event_db_dump_file, dump_type) != 0)
+	{
+      		osm_log( pm->log, OSM_LOG_ERROR,
+               		"PB dump port counters: Failed to file %s : %s",
+               		pm->event_db_dump_file, strerror(errno));
+	}
+}
+
+#if 0
+/*******************************************************************
+ * Use this later to track events on the fabric
+ **********************************************************************/
+ib_api_status_t
+osm_report_notice_to_perfmgr(osm_log_t* const log, osm_subn_t*  subn,
+  			ib_mad_notice_attr_t *p_ntc )
+{
+  OSM_LOG_ENTER( log, osm_report_trap_to_pm );
+  if ((p_ntc->generic_type & 0x80)
+	  && (cl_ntoh16(p_ntc->g_or_v.generic.trap_num) == 128)) {
+	  osm_log( log, OSM_LOG_INFO, "PerfMgr notified of trap 128\n");
+  }
+  OSM_LOG_EXIT( log );
+  return (IB_SUCCESS);
+}
+#endif
+
+#endif /* ENABLE_OSM_PERF_MGR */
+
diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index c8c3ddc..77c19a5 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -66,6 +66,7 @@
 #include <opensm/osm_multicast.h>
 #include <opensm/osm_inform.h>
 #include <opensm/osm_console.h>
+#include <opensm/osm_perfmgr.h>
 
 #if defined(PATH_MAX)
 #define OSM_PATH_MAX	(PATH_MAX + 1)
@@ -471,6 +472,12 @@ osm_subn_set_default_opt(
   p_opt->honor_guid2lid_file = FALSE;
   p_opt->daemon = FALSE;
   p_opt->sm_inactive = FALSE;
+#ifdef ENABLE_OSM_PERF_MGR
+  p_opt->perfmgr = FALSE;
+  p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S;
+  p_opt->event_db_dump_file = OSM_PERFMGR_DEFAULT_DUMP_FILE;
+  p_opt->event_db_plugin = OSM_DEFAULT_EVENT_PLUGIN;
+#endif /* ENABLE_OSM_PERF_MGR */
 
   p_opt->dump_files_dir = getenv("OSM_TMP_DIR");
   if (!p_opt->dump_files_dir || !(*p_opt->dump_files_dir))
@@ -1076,6 +1083,24 @@ osm_subn_parse_conf_file(
         "sm_inactive",
         p_key, p_val, &p_opts->sm_inactive);
 
+#ifdef ENABLE_OSM_PERF_MGR
+      __osm_subn_opts_unpack_boolean(
+        "perfmgr",
+        p_key, p_val, &p_opts->perfmgr);
+
+      __osm_subn_opts_unpack_uint16(
+        "perfmgr_sweep_time_s",
+        p_key, p_val, &p_opts->perfmgr_sweep_time_s);
+
+      __osm_subn_opts_unpack_charp(
+        "event_db_dump_file",
+        p_key, p_val, &p_opts->event_db_dump_file);
+
+      __osm_subn_opts_unpack_charp(
+        "event_db_plugin",
+        p_key, p_val, &p_opts->event_db_plugin);
+#endif /* ENABLE_OSM_PERF_MGR */
+
       subn_parse_qos_options("qos",
         p_key, p_val, &p_opts->qos_options);
 
@@ -1321,6 +1346,32 @@ osm_subn_write_conf_file(
     p_opts->sm_inactive ? "TRUE" : "FALSE"
     );
 
+#ifdef ENABLE_OSM_PERF_MGR
+  fprintf(
+    opts_file,
+    "#\n# Performance Manager Options\n#\n"
+    "# perfmgr enable\n"
+    "perfmgr %s\n\n"
+    "# sweep time in seconds\n"
+    "perfmgr_sweep_time_s %d\n\n"
+    ,
+    p_opts->perfmgr ? "TRUE" : "FALSE",
+    p_opts->perfmgr_sweep_time_s
+    );
+
+  fprintf(
+    opts_file,
+    "#\n# Event DB Options\n#\n"
+    "# Dump file to dump the events to\n"
+    "event_db_dump_file %s\n\n"
+    "# Event db plugin\n"
+    "event_db_plugin %s\n\n"
+    ,
+    p_opts->event_db_dump_file,
+    p_opts->event_db_plugin
+    );
+#endif /* ENABLE_OSM_PERF_MGR */
+
   fprintf( 
     opts_file,
     "#\n# DEBUG FEATURES\n#\n"
diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c
index 0858968..19be781 100644
--- a/osm/opensm/osm_trap_rcv.c
+++ b/osm/opensm/osm_trap_rcv.c
@@ -698,6 +698,21 @@ __osm_trap_rcv_process_request(
     goto Exit;
   }
 
+#ifdef ENABLE_OSM_PERF_MGR
+#if 0
+  /* we still need to work out how this will work */
+  status = osm_report_notice_to_perfmgr(p_rcv->p_log, p_rcv->p_subn, p_ntci);
+  if( status != IB_SUCCESS )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_trap_rcv_process_request: ERR 3803: "
+             "Error sending trap reports (%s)\n",
+             ib_get_err_str( status ) );
+    goto Exit;
+  }
+#endif
+#endif /* ENABLE_OSM_PERF_MGR */
+
  Exit:
   OSM_LOG_EXIT( p_rcv->p_log );
 }
-- 
1.4.4


From arthur.jones at qlogic.com  Tue May  8 19:19:04 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 8 May 2007 19:19:04 -0700
Subject: [ofa-general] [PATCH] IB/ipath -- shadow the gpio_mask register
In-Reply-To: <adawszi4xtg.fsf@cisco.com>
References: <20070508202557.27647.47035.stgit@bauxite.internal.keyresearch.com>
	<adawszi4xtg.fsf@cisco.com>
Message-ID: <20070509021904.GA16964@bauxite.pathscale.com>

hi roland, ...

On Tue, May 08, 2007 at 05:58:51PM -0700, Roland Dreier wrote:
>  > GPIO interrupts which have the gpio_mask bits set are
>  > no longer unlikely.  remove the unlikely annotation in
>  > the interrupt handler and keep a shadow copy of the
>  > gpio_mask register.
> 
> A better changelog would be appreciated here... I can see deleting the
> unlikely() if it's no longer appropriate, but why keep a shadow copy
> of the register?  Because this is now a hotter path and you want to
> avoid the MMIO read?

exactly.  shall i add that and resend?

arthur


From ogerlitz at voltaire.com  Tue May  8 22:37:02 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 09 May 2007 08:37:02 +0300
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <463FCA42.3000104@indiana.edu>
References: <1177791386.4615.8.camel@stevo-laptop>	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>	<1178575761.30571.175.camel@stevo-desktop>	<95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com>
	<463FCA42.3000104@indiana.edu>
Message-ID: <46415DFE.9030807@voltaire.com>

Andrew Friedley wrote:
> Jeff Squyres wrote:
>>>> FWIW, yes, adding RDMA CM support has actually been on my to-do list
>>>> for a while, but it keeps getting bumped by higher priority items.
>>>> It would be *much* better if some iWARP companies got involved in
>>>> Open MPI...

> Hmm I'm interested.  I've already done some work switching over to RDMA 
> CM for some research stuff I've been doing; it's not publicly accessible 
> w/o the 3rd party agreement.  I can help answer questions on what 
> exactly needs to change, and do some testing.

Doing a bit of zoom out from the "how to make ofed's udapl work for 
ompi" thread, my thinking is that the ompi udapl btl enablement is 
actually only the first step, where for production/longterm/etc you want 
to have an rdmacm btl. Reasoning here is made of many arguments, among 
them the quickest i can make are:

A) it seems that ompi would want to use not only RC but rather also UD 
multicast and unicast, which are not covered by udapl

B) there's actually no real justification to maintain two APIs (namely 
udapl vs libibvers/librdmacm), so down the road, only one of them would 
survive (udapl is implemented ***over*** libibverbs/librdmacm so if the 
latteres dies same does udapl). Specifically, I hear here and there that 
the OFED stack is now on its way to be deployed all over the place, 
specifically in commercial Unix OSs (which want modern! code that 
supports IPoIB-CM,RDS,SRP,iSER, etc you named it) so eventually the 
rdmacm btl can be used also over Solaris et al.

Or.


From yosefe at voltaire.com  Tue May  8 23:59:26 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 09:59:26 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <20070508203318.GH10845@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408336.8080908@voltaire.com>	<20070508150907.GU21591@mellanox.co.il>	<46409504.9000802@voltaire.com>	<20070508152650.GA5845@mellanox.co.il>	<4640A911.8000609@voltaire.com>
	<20070508203318.GH10845@mellanox.co.il>
Message-ID: <4641714E.6050806@voltaire.com>

Michael S. Tsirkin wrote:
> 
> <plug>
> Have you read the boring list of rules?
> http://git.openfabrics.org/~mst/boring.txt
> </plug>
> 
Thanks for the pointer.


core: uncached "find gid" and "find pkey" queries

* Add ib_find_gid and ib_find_pkey over uncached device queries.
  The calls might block but the returns are always up-to-date. 
* Cache pky,gid table lengths in core to avoid port info queries.


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/device.c |  139 +++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h          |   25 +++++++
 2 files changed, 164 insertions(+)

Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-08 15:46:36.000000000 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-09 09:54:54.242486631 +0300
@@ -149,6 +149,18 @@ static int alloc_name(char *name)
 	return 0;
 }
 
+static inline int start_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+
+static inline int end_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
+}
+
 /**
  * ib_alloc_device - allocate an IB device struct
  * @size:size of structure to allocate
@@ -208,6 +220,56 @@ static int add_client_context(struct ib_
 	return 0;
 }
 
+/* read the lengths of pkey,gid tables on each port */
+static inline int read_port_table_lengths(struct ib_device *device)
+{
+	struct ib_port_attr *tprops = NULL;
+	int num_ports, ret = -ENOMEM;
+	u8 port_index;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+	if (!tprops)
+		goto out;
+
+	num_ports = end_port(device) - start_port(device) + 1;
+
+	device->pkey_tbl_len = kmalloc(num_ports *
+				sizeof *device->pkey_tbl_len, GFP_KERNEL);
+	if (!device->pkey_tbl_len)
+		goto out;
+
+	device->gid_tbl_len = kmalloc(num_ports *
+				sizeof *device->gid_tbl_len, GFP_KERNEL);
+	if (!device->gid_tbl_len)
+		goto err1;
+
+	for (port_index = 0; port_index < num_ports; ++port_index) {
+		ret = ib_query_port(device, port_index + start_port(device),
+				tprops);
+		if (ret)
+			goto err2;
+
+		device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len;
+		device->gid_tbl_len[port_index] = tprops->gid_tbl_len;
+	}
+
+	ret = 0;
+	goto out;
+err2:
+	kfree(device->gid_tbl_len);
+err1:
+	kfree(device->pkey_tbl_len);
+out:
+	kfree(tprops);
+	return ret;
+}
+
+static inline void free_port_table_lengths(struct ib_device *device)
+{
+	kfree(device->gid_tbl_len);
+	kfree(device->pkey_tbl_len);
+}
+
 /**
  * ib_register_device - Register an IB device with IB core
  * @device:Device to register
@@ -239,6 +301,13 @@ int ib_register_device(struct ib_device 
 	spin_lock_init(&device->event_handler_lock);
 	spin_lock_init(&device->client_data_lock);
 
+	ret = read_port_table_lengths(device);
+	if (ret) {
+		printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n",
+		       device->name);
+		goto out;
+	}
+
 	ret = ib_device_register_sysfs(device);
 	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
@@ -284,6 +353,8 @@ void ib_unregister_device(struct ib_devi
 
 	list_del(&device->core_list);
 
+	free_port_table_lengths(device);
+
 	mutex_unlock(&device_mutex);
 
 	spin_lock_irqsave(&device->client_data_lock, flags);
@@ -592,6 +663,74 @@ int ib_modify_port(struct ib_device *dev
 }
 EXPORT_SYMBOL(ib_modify_port);
 
+/**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index)
+{
+	union ib_gid tmp_gid;
+	int ret, port, i, tbl_len;
+
+	for (port = start_port(device); port <= end_port(device); ++port) {
+		tbl_len = device->gid_tbl_len[port - start_port(device)];
+		for (i = 0; i < tbl_len; ++i) {
+			ret = ib_query_gid(device, port, i, &tmp_gid);
+			if (ret)
+				goto out;
+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
+				*port_num = port;
+				*index = i;
+				ret = 0;
+				goto out;
+			}
+		}
+	}
+	ret = -ENOENT;
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_gid);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index)
+{
+	int ret, i, tbl_len;
+	u16 tmp_pkey;
+
+	tbl_len = device->pkey_tbl_len[port_num - start_port(device)];
+	for (i = 0; i < tbl_len; ++i) {
+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
+		if (ret)
+			goto out;
+
+		if (pkey == tmp_pkey) {
+			*index = i;
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = -ENOENT;
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_pkey);
+
 static int __init ib_core_init(void)
 {
 	int ret;
Index: b/include/rdma/ib_verbs.h
===================================================================
--- a/include/rdma/ib_verbs.h	2007-05-08 15:45:45.000000000 +0300
+++ b/include/rdma/ib_verbs.h	2007-05-08 18:48:23.000000000 +0300
@@ -1058,6 +1058,8 @@ struct ib_device {
 	__be64			     node_guid;
 	u8                           node_type;
 	u8                           phys_port_cnt;
+	int                          *pkey_tbl_len;
+	int                          *gid_tbl_len;
 };
 
 struct ib_client {
@@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev
 		   struct ib_port_modify *port_modify);
 
 /**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index);
+
+/**
  * ib_alloc_pd - Allocates an unused protection domain.
  * @device: The device on which to allocate the protection domain.
  *


From yosefe at voltaire.com  Wed May  9 00:00:05 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 10:00:05 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] ipoib: handle pkey change events
In-Reply-To: <20070508202836.GG10845@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>
	<20070508202836.GG10845@mellanox.co.il>
Message-ID: <46417175.1060505@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: [PATCHv3 1/2] ipoib: handle pkey change events
> 
> 
> This should hav ebeen 1 of 2, is that right?

Yes. should have been 2/2. 


From mst at dev.mellanox.co.il  Wed May  9 00:07:59 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 10:07:59 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <4641714E.6050806@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com>
	<20070508150907.GU21591@mellanox.co.il>
	<46409504.9000802@voltaire.com>
	<20070508152650.GA5845@mellanox.co.il>
	<4640A911.8000609@voltaire.com>
	<20070508203318.GH10845@mellanox.co.il>
	<4641714E.6050806@voltaire.com>
Message-ID: <20070509070759.GA18513@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries
> 
> Michael S. Tsirkin wrote:
> > 
> > <plug>
> > Have you read the boring list of rules?
> > http://git.openfabrics.org/~mst/boring.txt
> > </plug>
> > 
> Thanks for the pointer.

This still violates rule 4c in the above (chapter 2 in CodingStyle).

-- 
MST


From eli at mellanox.co.il  Wed May  9 00:19:53 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 09 May 2007 10:19:53 +0300
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts
In-Reply-To: <ada1whq6cgb.fsf@cisco.com>
References: <1178551555.17477.0.camel@mtls03> <adar6ps60zn.fsf@cisco.com>
	<1178606876.17477.15.camel@mtls03>  <ada1whq6cgb.fsf@cisco.com>
Message-ID: <1178695223.24989.42.camel@mtls03>

On Tue, 2007-05-08 at 17:57 -0700, Roland Dreier wrote:
> > > @@ -249,8 +249,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
>  > >  		}
>  > >  	}
>  > >  
>  > > -	if (eqes_found)
>  > > -		eq_set_ci(eq, 1);
>  > > +	eq_set_ci(eq, 1);
>  > >  
>  > >  	return eqes_found;
>  > >  }
> 
>  > This will not ensure arming all EQs for all interrupts and we will face
>  > the same problem of losing interrupts.
> 
> I don't understand what you mean here.  How is unconditionally arming
> the EQ at the end of mlx4_eq_int() any different from your proposed
> patch?  My change calls eq_set_ci() at the end of every call to
> mlx4_eq_int(), and your change calls eq_set_ci() after every call to
> mlx4_eq_int().  I'm probably missing something obvious, but I really
> don't see it right now.
> 

The difference between what I propose and what you propose is that my
version unconditionally arms ALL EQs regardless of whether we find any
EQEs in them while you arm only the EQs in which you find EQEs. The
justification for doing this comes from the following scenario. Suppose
we have two EQs, 0 and 1:

1. An event is generated on EQ1.
2. EQ1 posts an EQE.
3. A set interrupt message is sent. Very soon after
that ...
3. An event is generated on EQ0.
4. EQ0 posts an EQE.
5. The interrupt handler is called and does:
        a. clear interrupt
        b. poll EQ0 but there is nothing there since the EQE is not yet
in memory.
        c. poll EQ1, find an EQE, arm EQ1

Now we have an unconsumed EQE in EQ0 but it is not armed.

Remember that the same is true for Arbel but there we arm all the EQs in
a single write to the device.


From boris at lfbs.RWTH-Aachen.DE  Wed May  9 00:24:56 2007
From: boris at lfbs.RWTH-Aachen.DE (Boris Bierbaum)
Date: Wed, 09 May 2007 09:24:56 +0200
Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work
In-Reply-To: <1178655353.11455.14.camel@stevo-desktop>
References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM>
	<46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM>
	<464044D4.5010501@lfbs.rwth-aachen.de>
	<054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com>
	<4640CACE.8070201@ichips.intel.com>
	<1178655353.11455.14.camel@stevo-desktop>
Message-ID: <46417748.4020602@lfbs.rwth-aachen.de>

It has been explained in a different thread on [ofa-general] that the
problem lies in a combination of the OpenIB-cma provider not setting the
local and remote port numbers on endpoints correctly and Open MPI
stepping over the IA to save the port number to circumvent this problem,
thereby confusing the provider.

I commented out line 197 in ompi/mca/btl/udapl/btl_udapl.c (Open MPI
1.2.1 release) and this fixes the problem. As the problem in the
provider is currently being fixed, the whole saving of the port number
in the uDAPL BTL code will be unnecessary in the future.

Steve Wise wrote:
>>> Can the UDAPL OFED wizards shed any light on the error messages that  
>>> are listed below?  In particular, these seem to be worrysome:
>>>
>>>>  setup_listener Permission denied
>>>  setup_listener Address already in use
>> These failures are from rdma_cm_bind indicating the port is already 
>> bound to this IA address. How are you creating the service point?
>> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you 
>> will see some failures until it  gets to a free port. That is normal. 
>> Just make sure your create call returns DAT_SUCCESS.
>>
> 
> Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down
> and let the rdma-cma pick an available port number?
> 
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


-- 
|  _  RWTH | Boris Bierbaum
|_|_`_     | Lehrstuhl fuer Betriebssysteme
   | |_) _  | RWTH Aachen D-52056 Aachen
     |_)(_` | Tel: +49-241-80-27805
        ._) | Fax: +49-241-80-22339


From yosefe at voltaire.com  Wed May  9 01:11:38 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 11:11:38 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <20070509070759.GA18513@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408336.8080908@voltaire.com>	<20070508150907.GU21591@mellanox.co.il>	<46409504.9000802@voltaire.com>	<20070508152650.GA5845@mellanox.co.il>	<4640A911.8000609@voltaire.com>	<20070508203318.GH10845@mellanox.co.il>	<4641714E.6050806@voltaire.com>
	<20070509070759.GA18513@mellanox.co.il>
Message-ID: <4641823A.8040100@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries
>>
>>Michael S. Tsirkin wrote:
>>
>>><plug>
>>>Have you read the boring list of rules?
>>>http://git.openfabrics.org/~mst/boring.txt
>>></plug>
>>>
>>Thanks for the pointer.
> 
> 
> This still violates rule 4c in the above (chapter 2 in CodingStyle).
> 
Isn't chapter 2 about placing braces?


core: uncached "find gid" and "find pkey" queries

* Add ib_find_gid and ib_find_pkey over uncached device queries.
  The calls might block but the returns are always up-to-date. 
* Cache pky,gid table lengths in core to avoid port info queries.


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/device.c |  138 +++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h          |   25 +++++++
 2 files changed, 163 insertions(+)

Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-08 15:46:36.000000000 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-09 11:08:29.913598989 +0300
@@ -149,6 +149,18 @@ static int alloc_name(char *name)
 	return 0;
 }
 
+static inline int start_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+
+static inline int end_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
+}
+
 /**
  * ib_alloc_device - allocate an IB device struct
  * @size:size of structure to allocate
@@ -208,6 +220,55 @@ static int add_client_context(struct ib_
 	return 0;
 }
 
+/* read the lengths of pkey,gid tables on each port */
+static inline int read_port_table_lengths(struct ib_device *device)
+{
+	struct ib_port_attr *tprops = NULL;
+	int num_ports, ret = -ENOMEM;
+	u8 port_index;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+	if (!tprops)
+		goto out;
+
+	num_ports = end_port(device) - start_port(device) + 1;
+
+	device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len *
+						num_ports, GFP_KERNEL);
+	if (!device->pkey_tbl_len)
+		goto out;
+
+	device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len *
+						num_ports, GFP_KERNEL);
+	if (!device->gid_tbl_len)
+		goto err1;
+
+	for (port_index = 0; port_index < num_ports; ++port_index) {
+		ret = ib_query_port(device, port_index + start_port(device),
+					tprops);
+		if (ret)
+			goto err2;
+		device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len;
+		device->gid_tbl_len[port_index] = tprops->gid_tbl_len;
+	}
+
+	ret = 0;
+	goto out;
+err2:
+	kfree(device->gid_tbl_len);
+err1:
+	kfree(device->pkey_tbl_len);
+out:
+	kfree(tprops);
+	return ret;
+}
+
+static inline void free_port_table_lengths(struct ib_device *device)
+{
+	kfree(device->gid_tbl_len);
+	kfree(device->pkey_tbl_len);
+}
+
 /**
  * ib_register_device - Register an IB device with IB core
  * @device:Device to register
@@ -239,6 +300,13 @@ int ib_register_device(struct ib_device 
 	spin_lock_init(&device->event_handler_lock);
 	spin_lock_init(&device->client_data_lock);
 
+	ret = read_port_table_lengths(device);
+	if (ret) {
+		printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n",
+		       device->name);
+		goto out;
+	}
+
 	ret = ib_device_register_sysfs(device);
 	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
@@ -284,6 +352,8 @@ void ib_unregister_device(struct ib_devi
 
 	list_del(&device->core_list);
 
+	free_port_table_lengths(device);
+
 	mutex_unlock(&device_mutex);
 
 	spin_lock_irqsave(&device->client_data_lock, flags);
@@ -592,6 +662,74 @@ int ib_modify_port(struct ib_device *dev
 }
 EXPORT_SYMBOL(ib_modify_port);
 
+/**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index)
+{
+	union ib_gid tmp_gid;
+	int ret, port, i, tbl_len;
+
+	for (port = start_port(device); port <= end_port(device); ++port) {
+		tbl_len = device->gid_tbl_len[port - start_port(device)];
+		for (i = 0; i < tbl_len; ++i) {
+			ret = ib_query_gid(device, port, i, &tmp_gid);
+			if (ret)
+				goto out;
+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
+				*port_num = port;
+				*index = i;
+				ret = 0;
+				goto out;
+			}
+		}
+	}
+	ret = -ENOENT;
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_gid);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index)
+{
+	int ret, i, tbl_len;
+	u16 tmp_pkey;
+
+	tbl_len = device->pkey_tbl_len[port_num - start_port(device)];
+	for (i = 0; i < tbl_len; ++i) {
+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
+		if (ret)
+			goto out;
+
+		if (pkey == tmp_pkey) {
+			*index = i;
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = -ENOENT;
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_pkey);
+
 static int __init ib_core_init(void)
 {
 	int ret;
Index: b/include/rdma/ib_verbs.h
===================================================================
--- a/include/rdma/ib_verbs.h	2007-05-08 15:45:45.000000000 +0300
+++ b/include/rdma/ib_verbs.h	2007-05-08 18:48:23.000000000 +0300
@@ -1058,6 +1058,8 @@ struct ib_device {
 	__be64			     node_guid;
 	u8                           node_type;
 	u8                           phys_port_cnt;
+	int                          *pkey_tbl_len;
+	int                          *gid_tbl_len;
 };
 
 struct ib_client {
@@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev
 		   struct ib_port_modify *port_modify);
 
 /**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+		       u8 *port_num, u16 *index);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index);
+
+/**
  * ib_alloc_pd - Allocates an unused protection domain.
  * @device: The device on which to allocate the protection domain.
  *


From mst at dev.mellanox.co.il  Wed May  9 01:43:13 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 11:43:13 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <4641823A.8040100@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com>
	<20070508150907.GU21591@mellanox.co.il>
	<46409504.9000802@voltaire.com>
	<20070508152650.GA5845@mellanox.co.il>
	<4640A911.8000609@voltaire.com>
	<20070508203318.GH10845@mellanox.co.il>
	<4641714E.6050806@voltaire.com>
	<20070509070759.GA18513@mellanox.co.il>
	<4641823A.8040100@voltaire.com>
Message-ID: <20070509084312.GA6974@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries
> 
> Michael S. Tsirkin wrote:
> >>Quoting Yosef Etigin <yosefe at voltaire.com>:
> >>Subject: Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries
> >>
> >>Michael S. Tsirkin wrote:
> >>
> >>><plug>
> >>>Have you read the boring list of rules?
> >>>http://git.openfabrics.org/~mst/boring.txt
> >>></plug>
> >>>
> >>Thanks for the pointer.
> > 
> > 
> > This still violates rule 4c in the above (chapter 2 in CodingStyle).
> > 
> Isn't chapter 2 about placing braces?

Yes, I see you've fixed this. Some last pedantic nits:

> core: uncached "find gid" and "find pkey" queries
> 
> * Add ib_find_gid and ib_find_pkey over uncached device queries.
>   The calls might block but the returns are always up-to-date. 
> * Cache pky,gid table lengths in core to avoid port info queries.
> 
> 
> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
> ---
>  drivers/infiniband/core/device.c |  138 +++++++++++++++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h          |   25 +++++++
>  2 files changed, 163 insertions(+)
> 
> Index: b/drivers/infiniband/core/device.c
> ===================================================================
> --- a/drivers/infiniband/core/device.c	2007-05-08 15:46:36.000000000 +0300
> +++ b/drivers/infiniband/core/device.c	2007-05-09 11:08:29.913598989 +0300
> @@ -149,6 +149,18 @@ static int alloc_name(char *name)
>  	return 0;
>  }
>  
> +static inline int start_port(struct ib_device *device)
> +{
> +	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
> +}
> +
> +
> +static inline int end_port(struct ib_device *device)
> +{
> +	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
> +		0 : device->phys_port_cnt;
> +}
> +
>  /**
>   * ib_alloc_device - allocate an IB device struct
>   * @size:size of structure to allocate
> @@ -208,6 +220,55 @@ static int add_client_context(struct ib_
>  	return 0;
>  }
>  
> +/* read the lengths of pkey,gid tables on each port */
> +static inline int read_port_table_lengths(struct ib_device *device)

This function is too big to be inline.

> +{
> +	struct ib_port_attr *tprops = NULL;
> +	int num_ports, ret = -ENOMEM;
> +	u8 port_index;
> +
> +	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
> +	if (!tprops)
> +		goto out;
> +
> +	num_ports = end_port(device) - start_port(device) + 1;
> +
> +	device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len *
> +						num_ports, GFP_KERNEL);
> +	if (!device->pkey_tbl_len)
> +		goto out;
> +
> +	device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len *
> +						num_ports, GFP_KERNEL);
> +	if (!device->gid_tbl_len)
> +		goto err1;
> +
> +	for (port_index = 0; port_index < num_ports; ++port_index) {
> +		ret = ib_query_port(device, port_index + start_port(device),
> +					tprops);
> +		if (ret)
> +			goto err2;
> +		device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len;
> +		device->gid_tbl_len[port_index] = tprops->gid_tbl_len;
> +	}
> +
> +	ret = 0;
> +	goto out;
> +err2:
> +	kfree(device->gid_tbl_len);
> +err1:
> +	kfree(device->pkey_tbl_len);
> +out:
> +	kfree(tprops);
> +	return ret;
> +}
> +
> +static inline void free_port_table_lengths(struct ib_device *device)
> +{
> +	kfree(device->gid_tbl_len);
> +	kfree(device->pkey_tbl_len);
> +}
> +
>  /**
>   * ib_register_device - Register an IB device with IB core
>   * @device:Device to register
> @@ -239,6 +300,13 @@ int ib_register_device(struct ib_device 
>  	spin_lock_init(&device->event_handler_lock);
>  	spin_lock_init(&device->client_data_lock);
>  
> +	ret = read_port_table_lengths(device);
> +	if (ret) {
> +		printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n",
> +		       device->name);
> +		goto out;
> +	}
> +
>  	ret = ib_device_register_sysfs(device);
>  	if (ret) {
>  		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
> @@ -284,6 +352,8 @@ void ib_unregister_device(struct ib_devi
>  
>  	list_del(&device->core_list);
>  
> +	free_port_table_lengths(device);
> +
>  	mutex_unlock(&device_mutex);
>  
>  	spin_lock_irqsave(&device->client_data_lock, flags);
> @@ -592,6 +662,74 @@ int ib_modify_port(struct ib_device *dev
>  }
>  EXPORT_SYMBOL(ib_modify_port);
>  
> +/**
> + * ib_find_gid - Returns the port number and GID table index where
> + *   a specified GID value occurs.
> + * @device: The device to query.
> + * @gid: The GID value to search for.
> + * @port_num: The port number of the device where the GID value was found.
> + * @index: The index into the GID table where the GID was found.  This
> + *   parameter may be NULL.
> + */
> +int ib_find_gid(struct ib_device *device, union ib_gid *gid,
> +		       u8 *port_num, u16 *index)

Either indent with tabs only here, or use spaces to align continuation at (.

> +{
> +	union ib_gid tmp_gid;
> +	int ret, port, i, tbl_len;
> +
> +	for (port = start_port(device); port <= end_port(device); ++port) {
> +		tbl_len = device->gid_tbl_len[port - start_port(device)];
> +		for (i = 0; i < tbl_len; ++i) {
> +			ret = ib_query_gid(device, port, i, &tmp_gid);
> +			if (ret)
> +				goto out;
> +			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
> +				*port_num = port;
> +				*index = i;
> +				ret = 0;
> +				goto out;
> +			}
> +		}
> +	}
> +	ret = -ENOENT;
> +out:
> +	return ret;
> +}
> +EXPORT_SYMBOL(ib_find_gid);
> +
> +/**
> + * ib_find_pkey - Returns the PKey table index where a specified
> + *   PKey value occurs.
> + * @device: The device to query.
> + * @port_num: The port number of the device to search for the PKey.
> + * @pkey: The PKey value to search for.
> + * @index: The index into the PKey table where the PKey was found.
> + */
> +int ib_find_pkey(struct ib_device *device,
> +			u8 port_num, u16 pkey, u16 *index)
> +{
> +	int ret, i, tbl_len;
> +	u16 tmp_pkey;
> +
> +	tbl_len = device->pkey_tbl_len[port_num - start_port(device)];
> +	for (i = 0; i < tbl_len; ++i) {
> +		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
> +		if (ret)
> +			goto out;
> +
> +		if (pkey == tmp_pkey) {
> +			*index = i;
> +			ret = 0;
> +			goto out;
> +		}
> +	}
> +	ret = -ENOENT;
> +
> +out:
> +	return ret;
> +}
> +EXPORT_SYMBOL(ib_find_pkey);
> +
>  static int __init ib_core_init(void)
>  {
>  	int ret;
> Index: b/include/rdma/ib_verbs.h
> ===================================================================
> --- a/include/rdma/ib_verbs.h	2007-05-08 15:45:45.000000000 +0300
> +++ b/include/rdma/ib_verbs.h	2007-05-08 18:48:23.000000000 +0300
> @@ -1058,6 +1058,8 @@ struct ib_device {
>  	__be64			     node_guid;
>  	u8                           node_type;
>  	u8                           phys_port_cnt;
> +	int                          *pkey_tbl_len;
> +	int                          *gid_tbl_len;
>  };
>  
>  struct ib_client {
> @@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev
>  		   struct ib_port_modify *port_modify);
>  
>  /**
> + * ib_find_gid - Returns the port number and GID table index where
> + *   a specified GID value occurs.
> + * @device: The device to query.
> + * @gid: The GID value to search for.
> + * @port_num: The port number of the device where the GID value was found.
> + * @index: The index into the GID table where the GID was found.  This
> + *   parameter may be NULL.
> + */
> +int ib_find_gid(struct ib_device *device, union ib_gid *gid,
> +		       u8 *port_num, u16 *index);

And here, too.

> +
> +/**
> + * ib_find_pkey - Returns the PKey table index where a specified
> + *   PKey value occurs.
> + * @device: The device to query.
> + * @port_num: The port number of the device to search for the PKey.
> + * @pkey: The PKey value to search for.
> + * @index: The index into the PKey table where the PKey was found.
> + */
> +int ib_find_pkey(struct ib_device *device,
> +			u8 port_num, u16 pkey, u16 *index);
> +
> +/**
>   * ib_alloc_pd - Allocates an unused protection domain.
>   * @device: The device on which to allocate the protection domain.
>   *

-- 
MST


From yosefe at voltaire.com  Wed May  9 01:52:15 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 11:52:15 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find
	pkey" queries
In-Reply-To: <20070509084312.GA6974@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408336.8080908@voltaire.com>	<20070508150907.GU21591@mellanox.co.il>	<46409504.9000802@voltaire.com>	<20070508152650.GA5845@mellanox.co.il>	<4640A911.8000609@voltaire.com>	<20070508203318.GH10845@mellanox.co.il>	<4641714E.6050806@voltaire.com>	<20070509070759.GA18513@mellanox.co.il>	<4641823A.8040100@voltaire.com>
	<20070509084312.GA6974@mellanox.co.il>
Message-ID: <46418BBF.10801@voltaire.com>


* Add ib_find_gid and ib_find_pkey over uncached device queries.
  The calls might block but the returns are always up-to-date. 
* Cache pky,gid table lengths in core to avoid port info queries.


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/device.c |  138 +++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h          |   25 +++++++
 2 files changed, 163 insertions(+)

Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-08 15:46:36.000000000 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-09 11:47:22.096064221 +0300
@@ -149,6 +149,18 @@ static int alloc_name(char *name)
 	return 0;
 }
 
+static inline int start_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+
+static inline int end_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
+}
+
 /**
  * ib_alloc_device - allocate an IB device struct
  * @size:size of structure to allocate
@@ -208,6 +220,55 @@ static int add_client_context(struct ib_
 	return 0;
 }
 
+/* read the lengths of pkey,gid tables on each port */
+static int read_port_table_lengths(struct ib_device *device)
+{
+	struct ib_port_attr *tprops = NULL;
+	int num_ports, ret = -ENOMEM;
+	u8 port_index;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+	if (!tprops)
+		goto out;
+
+	num_ports = end_port(device) - start_port(device) + 1;
+
+	device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len *
+						num_ports, GFP_KERNEL);
+	if (!device->pkey_tbl_len)
+		goto out;
+
+	device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len *
+						num_ports, GFP_KERNEL);
+	if (!device->gid_tbl_len)
+		goto err1;
+
+	for (port_index = 0; port_index < num_ports; ++port_index) {
+		ret = ib_query_port(device, port_index + start_port(device),
+					tprops);
+		if (ret)
+			goto err2;
+		device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len;
+		device->gid_tbl_len[port_index] = tprops->gid_tbl_len;
+	}
+
+	ret = 0;
+	goto out;
+err2:
+	kfree(device->gid_tbl_len);
+err1:
+	kfree(device->pkey_tbl_len);
+out:
+	kfree(tprops);
+	return ret;
+}
+
+static inline void free_port_table_lengths(struct ib_device *device)
+{
+	kfree(device->gid_tbl_len);
+	kfree(device->pkey_tbl_len);
+}
+
 /**
  * ib_register_device - Register an IB device with IB core
  * @device:Device to register
@@ -239,6 +300,13 @@ int ib_register_device(struct ib_device 
 	spin_lock_init(&device->event_handler_lock);
 	spin_lock_init(&device->client_data_lock);
 
+	ret = read_port_table_lengths(device);
+	if (ret) {
+		printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n",
+		       device->name);
+		goto out;
+	}
+
 	ret = ib_device_register_sysfs(device);
 	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
@@ -284,6 +352,8 @@ void ib_unregister_device(struct ib_devi
 
 	list_del(&device->core_list);
 
+	free_port_table_lengths(device);
+
 	mutex_unlock(&device_mutex);
 
 	spin_lock_irqsave(&device->client_data_lock, flags);
@@ -592,6 +662,74 @@ int ib_modify_port(struct ib_device *dev
 }
 EXPORT_SYMBOL(ib_modify_port);
 
+/**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+			u8 *port_num, u16 *index)
+{
+	union ib_gid tmp_gid;
+	int ret, port, i, tbl_len;
+
+	for (port = start_port(device); port <= end_port(device); ++port) {
+		tbl_len = device->gid_tbl_len[port - start_port(device)];
+		for (i = 0; i < tbl_len; ++i) {
+			ret = ib_query_gid(device, port, i, &tmp_gid);
+			if (ret)
+				goto out;
+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
+				*port_num = port;
+				*index = i;
+				ret = 0;
+				goto out;
+			}
+		}
+	}
+	ret = -ENOENT;
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_gid);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index)
+{
+	int ret, i, tbl_len;
+	u16 tmp_pkey;
+
+	tbl_len = device->pkey_tbl_len[port_num - start_port(device)];
+	for (i = 0; i < tbl_len; ++i) {
+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
+		if (ret)
+			goto out;
+
+		if (pkey == tmp_pkey) {
+			*index = i;
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = -ENOENT;
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_pkey);
+
 static int __init ib_core_init(void)
 {
 	int ret;
Index: b/include/rdma/ib_verbs.h
===================================================================
--- a/include/rdma/ib_verbs.h	2007-05-08 15:45:45.000000000 +0300
+++ b/include/rdma/ib_verbs.h	2007-05-09 11:47:55.006221894 +0300
@@ -1058,6 +1058,8 @@ struct ib_device {
 	__be64			     node_guid;
 	u8                           node_type;
 	u8                           phys_port_cnt;
+	int                          *pkey_tbl_len;
+	int                          *gid_tbl_len;
 };
 
 struct ib_client {
@@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev
 		   struct ib_port_modify *port_modify);
 
 /**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+			u8 *port_num, u16 *index);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index);
+
+/**
  * ib_alloc_pd - Allocates an unused protection domain.
  * @device: The device on which to allocate the protection domain.
  *


From eli at mellanox.co.il  Wed May  9 01:12:43 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 09 May 2007 11:12:43 +0300
Subject: [ofa-general] [PATCH] IB/core user memory registrations
Message-ID: <1178698393.26046.8.camel@mtls03>

fix missing initialization of write_mtt_size 

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- connectx_kernel.orig/drivers/infiniband/hw/mthca/mthca_provider.c	2007-05-08 15:48:57.000000000 +0300
+++ connectx_kernel/drivers/infiniband/hw/mthca/mthca_provider.c	2007-05-08 17:17:03.000000000 +0300
@@ -1020,7 +1020,7 @@ static struct ib_mr *mthca_reg_user_mr(s
 	int shift, n, len;
 	int i, j, k;
 	int err = 0;
-	int write_mtt_size;
+	int write_mtt_size = mthca_write_mtt_size(dev);
 
 	mr = kmalloc(sizeof *mr, GFP_KERNEL);
 	if (!mr)


From vlad at lists.openfabrics.org  Wed May  9 02:30:10 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed,  9 May 2007 02:30:10 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070509-0200 daily build status
Message-ID: <20070509093010.A59C3E60823@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17

Failed:


From mst at dev.mellanox.co.il  Wed May  9 02:35:48 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 12:35:48 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] ipoib: handle pkey change events
In-Reply-To: <4640A8BD.4000405@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508162727.GD5845@mellanox.co.il>
	<4640A8BD.4000405@voltaire.com>
Message-ID: <20070509093548.GA7683@mellanox.co.il>

> @@ -642,6 +651,12 @@ void ipoib_ib_dev_flush(struct work_stru
>  
>  	ipoib_ib_dev_down(dev, 0);
>  
> +	if (restart_qp) {
> +		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
> +			ipoib_ib_dev_stop(dev, 0);
> +		ipoib_ib_dev_open(dev);
> +	}
> +
>  	/*
>  	 * The device could have been brought down between the start and when
>  	 * we get here, don't bring it back up if it's not configured up

This is something that still puzzles me

1. We have tested IPOIB_FLAG_INITIALIZED above already, didn't we?
   Did you observe it flipping in testing? If yes there's some race ...

2. Let's assume that device is not initialized:
   how come you are calling ipoib_ib_dev_open on it here?

-- 
MST


From yosefe at voltaire.com  Wed May  9 04:01:26 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 14:01:26 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] ipoib: handle pkey change events
In-Reply-To: <20070509093548.GA7683@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>
	<20070509093548.GA7683@mellanox.co.il>
Message-ID: <4641AA06.1050002@voltaire.com>

Michael S. Tsirkin wrote:
>>@@ -642,6 +651,12 @@ void ipoib_ib_dev_flush(struct work_stru
>> 
>> 	ipoib_ib_dev_down(dev, 0);
>> 
>>+	if (restart_qp) {
>>+		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
>>+			ipoib_ib_dev_stop(dev, 0);
>>+		ipoib_ib_dev_open(dev);
>>+	}
>>+
>> 	/*
>> 	 * The device could have been brought down between the start and when
>> 	 * we get here, don't bring it back up if it's not configured up
> 
> 
> This is something that still puzzles me
> 
> 1. We have tested IPOIB_FLAG_INITIALIZED above already, didn't we?
>    Did you observe it flipping in testing? If yes there's some race ...
> 
> 2. Let's assume that device is not initialized:
>    how come you are calling ipoib_ib_dev_open on it here?
> 

Option 2 is true. this test is a leftover from a previous version of the patch
and should be removed.
--


This issue was found during partitioning & SM fail over testing.

 * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
 * Upon PKEY_CHANGE event, schedule a work that restarts the QP
 * Restart child interfaces before parent. They might be up even if the
   parent is down
 * Use uncached pkey query upon qp initiallization

SM reconfiguration or failover possibly causes a shuffling of the values
in the port pkey table. The current implementation only queries for the
index of the pkey once, when it creates the device QP and after that moves
it into working state, and hence does not address this scenario. Fix this
by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |    6 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   59 ++++++++++++++++++++---------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +--
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   11 ++---
 4 files changed, 56 insertions(+), 27 deletions(-)

Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 16:45:44.000000000 +0300
@@ -202,11 +202,12 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
+	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
+	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8            	  port;
@@ -333,12 +334,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-09 13:56:00.754030478 +0300
@@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -481,7 +481,7 @@ int ipoib_ib_dev_down(struct net_device 
 	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
 		mutex_lock(&pkey_mutex);
 		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
+		cancel_delayed_work(&priv->pkey_poll_task);
 		mutex_unlock(&pkey_mutex);
 		if (flush)
 			flush_workqueue(ipoib_workqueue);
@@ -508,7 +508,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +581,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,13 +623,21 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces too -
+ 	 * they might be up even if the parent is down */
+ 	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, restart_qp);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
 		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
@@ -642,6 +651,11 @@ void ipoib_ib_dev_flush(struct work_stru
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (restart_qp) {
+		ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +664,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	mutex_unlock(&priv->vlan_mutex);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -685,7 +709,7 @@ void ipoib_ib_dev_cleanup(struct net_dev
 void ipoib_pkey_poll(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
+		container_of(work, struct ipoib_dev_priv, pkey_poll_task.work);
 	struct net_device *dev = priv->dev;
 
 	ipoib_pkey_dev_check_presence(dev);
@@ -696,7 +720,7 @@ void ipoib_pkey_poll(struct work_struct 
 		mutex_lock(&pkey_mutex);
 		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
+					   &priv->pkey_poll_task,
 					   HZ);
 		mutex_unlock(&pkey_mutex);
 	}
@@ -715,7 +739,7 @@ int ipoib_pkey_dev_delay_open(struct net
 		mutex_lock(&pkey_mutex);
 		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
 		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
+				   &priv->pkey_poll_task,
 				   HZ);
 		mutex_unlock(&pkey_mutex);
 		return 1;
Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 16:45:44.000000000 +0300
@@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev)
 		return -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 16:45:44.000000000 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
@@ -103,7 +101,7 @@ int ipoib_init_qp(struct net_device *dev
 	 * The port has to be assigned to the respective IB partition in
 	 * advance.
 	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
 	if (ret) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		return ret;
@@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
 	}
 }


From rdreier at cisco.com  Wed May  9 04:03:55 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 09 May 2007 04:03:55 -0700
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts
In-Reply-To: <1178695223.24989.42.camel@mtls03> (Eli Cohen's message of "Wed,
	09 May 2007 10:19:53 +0300")
References: <1178551555.17477.0.camel@mtls03> <adar6ps60zn.fsf@cisco.com>
	<1178606876.17477.15.camel@mtls03> <ada1whq6cgb.fsf@cisco.com>
	<1178695223.24989.42.camel@mtls03>
Message-ID: <adafy6645t0.fsf@cisco.com>

 > > I don't understand what you mean here.  How is unconditionally arming
 > > the EQ at the end of mlx4_eq_int() any different from your proposed
 > > patch?  My change calls eq_set_ci() at the end of every call to
 > > mlx4_eq_int(), and your change calls eq_set_ci() after every call to
 > > mlx4_eq_int().  I'm probably missing something obvious, but I really
 > > don't see it right now.

 > The difference between what I propose and what you propose is that my
 > version unconditionally arms ALL EQs regardless of whether we find any
 > EQEs in them while you arm only the EQs in which you find EQEs. The
 > justification for doing this comes from the following scenario. Suppose
 > we have two EQs, 0 and 1:

I understand all that.  The question is, what's the difference between
my version (which is in my tree now), which does:

	mlx4_eq_int(...eq...)
	{
		...
		eq_set_ci(eq, 1);
	
		return eqes_found;
	}

and your version, which does

	mlx4_eq_int(eq);
        eq_set_ci(eq, 1);

for every call to mlx4_eq_int()?  Why does it matter if the
eq_set_ci() is inside mlx4_eq_int() or outside?

 - R.


From mst at dev.mellanox.co.il  Wed May  9 04:26:26 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 14:26:26 +0300
Subject: [ofa-general] Re: [PATCHv3 1/2] ipoib: handle pkey change events
In-Reply-To: <4641AA06.1050002@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508162727.GD5845@mellanox.co.il>
	<4640A8BD.4000405@voltaire.com>
	<20070509093548.GA7683@mellanox.co.il>
	<4641AA06.1050002@voltaire.com>
Message-ID: <20070509112626.GA10068@mellanox.co.il>

OK. looks pretty good to me. One coding style violation I found:

> @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler
>  	    record->element.port_num == priv->port) {
>  		ipoib_dbg(priv, "Port state change event\n");
>  		queue_work(ipoib_workqueue, &priv->flush_task);
> +	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
> +		record->element.port_num == priv->port) {
> +		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
> +		queue_work(ipoib_workqueue, &priv->pkey_event_task);
>  	}
>  }

This violates Breaking long lines rule again. Should be

> +	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
> +		   record->element.port_num == priv->port) {
> +		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
> +		queue_work(ipoib_workqueue, &priv->pkey_event_task);
>  	}

-- 
MST


From eli at mellanox.co.il  Wed May  9 04:28:22 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 09 May 2007 14:28:22 +0300
Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts
In-Reply-To: <adafy6645t0.fsf@cisco.com>
References: <1178551555.17477.0.camel@mtls03> <adar6ps60zn.fsf@cisco.com>
	<1178606876.17477.15.camel@mtls03> <ada1whq6cgb.fsf@cisco.com>
	<1178695223.24989.42.camel@mtls03>  <adafy6645t0.fsf@cisco.com>
Message-ID: <1178710133.27749.4.camel@mtls03>

On Wed, 2007-05-09 at 04:03 -0700, Roland Dreier wrote:

> I understand all that.  The question is, what's the difference between
> my version (which is in my tree now), which does:
> 
> 	mlx4_eq_int(...eq...)
> 	{
> 		...
> 		eq_set_ci(eq, 1);
> 	
> 		return eqes_found;
> 	}
> 
> and your version, which does
> 
> 	mlx4_eq_int(eq);
>         eq_set_ci(eq, 1);
> 
> for every call to mlx4_eq_int()?  Why does it matter if the
> eq_set_ci() is inside mlx4_eq_int() or outside?
> 
>  - R.

Oh I see, you're right - your version also arms all the EQs.


From mst at dev.mellanox.co.il  Wed May  9 04:28:29 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 14:28:29 +0300
Subject: [ofa-general] Re: [PATCH] IB/core user memory registrations
In-Reply-To: <1178698393.26046.8.camel@mtls03>
References: <1178698393.26046.8.camel@mtls03>
Message-ID: <20070509112829.GB10068@mellanox.co.il>

> Quoting Eli Cohen <eli at mellanox.co.il>:
> Subject: [PATCH] IB/core user memory registrations
> 
> fix missing initialization of write_mtt_size 
> 
> Signed-off-by: Eli Cohen <eli at mellanox.co.il>

This is actually IB/mthca, right? Wow, this seems to fix breakage introduced by
latest core changes, is that right?

I'm not sure how could I have missed this - need to go back and re-review that
patch.

-- 
MST


From fenkes at de.ibm.com  Wed May  9 04:47:56 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Wed, 9 May 2007 13:47:56 +0200
Subject: [ofa-general] [PATCH 1/6] IB/ehca: Serialize hypervisor calls in
	ehca_register_mr()
Message-ID: <200705091347.57470.fenkes@de.ibm.com>

From: Stefan Roscher <stefan.roscher at de.ibm.com>

Some pSeries hypervisor versions show a race condition in the allocate MR hCall.
Serialize this call per adapter to circumvent this problem.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |    1 +
 drivers/infiniband/hw/ehca/ehca_main.c    |    2 ++
 drivers/infiniband/hw/ehca/hcp_if.c       |   14 ++++++++++++--
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 10fb8fb..4bc5cb3 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -276,6 +276,7 @@ void ehca_cleanup_mrmw_cache(void);
 
 extern spinlock_t ehca_qp_idr_lock;
 extern spinlock_t ehca_cq_idr_lock;
+extern spinlock_t hcall_lock;
 extern struct idr ehca_qp_idr;
 extern struct idr ehca_cq_idr;
 
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 2d37054..2e27e68 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -98,6 +98,7 @@ MODULE_PARM_DESC(scaling_code,
 
 spinlock_t ehca_qp_idr_lock;
 spinlock_t ehca_cq_idr_lock;
+spinlock_t hcall_lock;
 DEFINE_IDR(ehca_qp_idr);
 DEFINE_IDR(ehca_cq_idr);
 
@@ -817,6 +818,7 @@ int __init ehca_module_init(void)
 	idr_init(&ehca_cq_idr);
 	spin_lock_init(&ehca_qp_idr_lock);
 	spin_lock_init(&ehca_cq_idr_lock);
+	spin_lock_init(&hcall_lock);
 
 	INIT_LIST_HEAD(&shca_list);
 	spin_lock_init(&shca_list_lock);
diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index b564fcd..bb76134 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -154,7 +154,9 @@ static long ehca_plpar_hcall9(unsigned l
 			      unsigned long arg9)
 {
 	long ret;
-	int i, sleep_msecs;
+	int i, sleep_msecs, lock_is_set = 0;
+	unsigned long flags;
+
 
 	ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx "
 		     "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx",
@@ -162,10 +164,18 @@ static long ehca_plpar_hcall9(unsigned l
 		     arg8, arg9);
 
 	for (i = 0; i < 5; i++) {
+		if ((opcode == H_ALLOC_RESOURCE) && (arg2 == 5)) {
+			spin_lock_irqsave(&hcall_lock, flags);
+			lock_is_set = 1;
+		}
+
 		ret = plpar_hcall9(opcode, outs,
 				   arg1, arg2, arg3, arg4, arg5,
 				   arg6, arg7, arg8, arg9);
 
+		if (lock_is_set)
+			spin_unlock_irqrestore(&hcall_lock, flags);
+
 		if (H_IS_LONG_BUSY(ret)) {
 			sleep_msecs = get_longbusy_msecs(ret);
 			msleep_interruptible(sleep_msecs);
@@ -193,11 +203,11 @@ static long ehca_plpar_hcall9(unsigned l
 			     opcode, ret, outs[0], outs[1], outs[2], outs[3],
 			     outs[4], outs[5], outs[6], outs[7], outs[8]);
 		return ret;
-
 	}
 
 	return H_BUSY;
 }
+
 u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle,
 			     struct ehca_pfeq *pfeq,
 			     const u32 neq_control,
-- 
1.4.2.1


From fenkes at de.ibm.com  Wed May  9 04:48:01 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Wed, 9 May 2007 13:48:01 +0200
Subject: [ofa-general] [PATCH 2/6] IB/ehca: correctly set GRH mask bit in
	ehca_modify_qp()
Message-ID: <200705091348.02396.fenkes@de.ibm.com>

The driver needs to always supply the "GRH present" flag to the hypervisor,
whether it's true or false. Not supplying it (i.e. not setting the
corresponding mask bit) amounts to a "perhaps", which we don't want.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_qp.c |   12 ++++++++----
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index df0516f..e21d796 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -968,17 +968,21 @@ static int internal_modify_qp(struct ib_
 			((ehca_mult - 1) / ah_mult) : 0;
 		else
 			mqpcb->max_static_rate = 0;
-
 		update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE, 1);
 
 		/*
+		 * Always supply the GRH flag, even if it's zero, to give the
+		 * hypervisor a clear "yes" or "no" instead of a "perhaps"
+		 */
+		update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1);
+
+		/*
 		 * only if GRH is TRUE we might consider SOURCE_GID_IDX
 		 * and DEST_GID otherwise phype will return H_ATTR_PARM!!!
 		 */
 		if (attr->ah_attr.ah_flags == IB_AH_GRH) {
-			mqpcb->send_grh_flag = 1 << 31;
-			update_mask |=
-				EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1);
+			mqpcb->send_grh_flag = 1;
+
 			mqpcb->source_gid_idx = attr->ah_attr.grh.sgid_index;
 			update_mask |=
 				EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX, 1);
-- 
1.4.2.1


From fenkes at de.ibm.com  Wed May  9 04:48:25 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Wed, 9 May 2007 13:48:25 +0200
Subject: [ofa-general] [PATCH 5/6] IB/ehca: beautify sysfs attribute code,
	fix compiler warnings
Message-ID: <200705091348.26426.fenkes@de.ibm.com>

eHCA's sysfs attributes are now being created via sysfs_create_group(),
making the process neatly table-driven. The return value is checked, thus
fixing a few compiler warnings.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_main.c |   86 ++++++++++++++------------------
 1 files changed, 37 insertions(+), 49 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 2e27e68..dc736e8 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -454,15 +454,14 @@ static ssize_t ehca_store_debug_level(st
 DRIVER_ATTR(debug_level, S_IRUSR | S_IWUSR,
 	    ehca_show_debug_level, ehca_store_debug_level);
 
-void ehca_create_driver_sysfs(struct ibmebus_driver *drv)
-{
-	driver_create_file(&drv->driver, &driver_attr_debug_level);
-}
+static struct attribute *ehca_drv_attrs[] = {
+	&driver_attr_debug_level.attr,
+	NULL
+};
 
-void ehca_remove_driver_sysfs(struct ibmebus_driver *drv)
-{
-	driver_remove_file(&drv->driver, &driver_attr_debug_level);
-}
+static struct attribute_group ehca_drv_attr_grp = {
+	.attrs = ehca_drv_attrs
+};
 
 #define EHCA_RESOURCE_ATTR(name)                                           \
 static ssize_t  ehca_show_##name(struct device *dev,                       \
@@ -524,44 +523,28 @@ static ssize_t ehca_show_adapter_handle(
 }
 static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL);
 
+static struct attribute *ehca_dev_attrs[] = {
+	&dev_attr_adapter_handle.attr,
+	&dev_attr_num_ports.attr,
+	&dev_attr_hw_ver.attr,
+	&dev_attr_max_eq.attr,
+	&dev_attr_cur_eq.attr,
+	&dev_attr_max_cq.attr,
+	&dev_attr_cur_cq.attr,
+	&dev_attr_max_qp.attr,
+	&dev_attr_cur_qp.attr,
+	&dev_attr_max_mr.attr,
+	&dev_attr_cur_mr.attr,
+	&dev_attr_max_mw.attr,
+	&dev_attr_cur_mw.attr,
+	&dev_attr_max_pd.attr,
+	&dev_attr_max_ah.attr,
+	NULL
+};
 
-void ehca_create_device_sysfs(struct ibmebus_dev *dev)
-{
-	device_create_file(&dev->ofdev.dev, &dev_attr_adapter_handle);
-	device_create_file(&dev->ofdev.dev, &dev_attr_num_ports);
-	device_create_file(&dev->ofdev.dev, &dev_attr_hw_ver);
-	device_create_file(&dev->ofdev.dev, &dev_attr_max_eq);
-	device_create_file(&dev->ofdev.dev, &dev_attr_cur_eq);
-	device_create_file(&dev->ofdev.dev, &dev_attr_max_cq);
-	device_create_file(&dev->ofdev.dev, &dev_attr_cur_cq);
-	device_create_file(&dev->ofdev.dev, &dev_attr_max_qp);
-	device_create_file(&dev->ofdev.dev, &dev_attr_cur_qp);
-	device_create_file(&dev->ofdev.dev, &dev_attr_max_mr);
-	device_create_file(&dev->ofdev.dev, &dev_attr_cur_mr);
-	device_create_file(&dev->ofdev.dev, &dev_attr_max_mw);
-	device_create_file(&dev->ofdev.dev, &dev_attr_cur_mw);
-	device_create_file(&dev->ofdev.dev, &dev_attr_max_pd);
-	device_create_file(&dev->ofdev.dev, &dev_attr_max_ah);
-}
-
-void ehca_remove_device_sysfs(struct ibmebus_dev *dev)
-{
-	device_remove_file(&dev->ofdev.dev, &dev_attr_adapter_handle);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_num_ports);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_hw_ver);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_max_eq);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_cur_eq);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_max_cq);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_cur_cq);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_max_qp);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_cur_qp);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_max_mr);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_cur_mr);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_max_mw);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_cur_mw);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_max_pd);
-	device_remove_file(&dev->ofdev.dev, &dev_attr_max_ah);
-}
+static struct attribute_group ehca_dev_attr_grp = {
+	.attrs = ehca_dev_attrs
+};
 
 static int __devinit ehca_probe(struct ibmebus_dev *dev,
 				const struct of_device_id *id)
@@ -669,7 +652,10 @@ static int __devinit ehca_probe(struct i
 		}
 	}
 
-	ehca_create_device_sysfs(dev);
+	ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp);
+	if (ret) /* only complain; we can live without attributes */
+		ehca_err(&shca->ib_device,
+			 "Cannot create device attributes  ret=%d", ret);
 
 	spin_lock(&shca_list_lock);
 	list_add(&shca->shca_list, &shca_list);
@@ -721,7 +707,7 @@ static int __devexit ehca_remove(struct 
 	struct ehca_shca *shca = dev->ofdev.dev.driver_data;
 	int ret;
 
-	ehca_remove_device_sysfs(dev);
+	sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp);
 
 	if (ehca_open_aqp1 == 1) {
 		int i;
@@ -840,7 +826,9 @@ int __init ehca_module_init(void)
 		goto module_init2;
 	}
 
-	ehca_create_driver_sysfs(&ehca_driver);
+	ret = sysfs_create_group(&ehca_driver.driver.kobj, &ehca_drv_attr_grp);
+	if (ret) /* only complain; we can live without attributes */
+		ehca_gen_err("Cannot create driver attributes  ret=%d", ret);
 
 	if (ehca_poll_all_eqs != 1) {
 		ehca_gen_err("WARNING!!!");
@@ -867,7 +855,7 @@ void __exit ehca_module_exit(void)
 	if (ehca_poll_all_eqs == 1)
 		del_timer_sync(&poll_eqs_timer);
 
-	ehca_remove_driver_sysfs(&ehca_driver);
+	sysfs_remove_group(&ehca_driver.driver.kobj, &ehca_drv_attr_grp);
 	ibmebus_unregister_driver(&ehca_driver);
 
 	ehca_destroy_slab_caches();
-- 
1.4.2.1


From fenkes at de.ibm.com  Wed May  9 04:48:11 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Wed, 9 May 2007 13:48:11 +0200
Subject: [ofa-general] [PATCH 3/6] IB/ehca: Fix AQP0/1 QP number
Message-ID: <200705091348.12551.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

AQP0/1 should report qp_num={0|1} and the actual QP# should be stored in
struct ehca_qp, not the other way round.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_qp.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index e21d796..b5bc787 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -523,6 +523,8 @@ struct ib_qp *ehca_create_qp(struct ib_p
 		goto create_qp_exit1;
 	}
 
+	my_qp->ib_qp.qp_num = my_qp->real_qp_num;
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_RC:
 	        if (isdaqp == 0) {
@@ -568,7 +570,7 @@ struct ib_qp *ehca_create_qp(struct ib_p
 			parms.act_nr_recv_wqes = init_attr->cap.max_recv_wr;
 			parms.act_nr_send_sges = init_attr->cap.max_send_sge;
 			parms.act_nr_recv_sges = init_attr->cap.max_recv_sge;
-			my_qp->real_qp_num =
+			my_qp->ib_qp.qp_num =
 				(init_attr->qp_type == IB_QPT_SMI) ? 0 : 1;
 		}
 
@@ -595,7 +597,6 @@ struct ib_qp *ehca_create_qp(struct ib_p
 	my_qp->ib_qp.recv_cq = init_attr->recv_cq;
 	my_qp->ib_qp.send_cq = init_attr->send_cq;
 
-	my_qp->ib_qp.qp_num = my_qp->real_qp_num;
 	my_qp->ib_qp.qp_type = init_attr->qp_type;
 
 	my_qp->qp_type = init_attr->qp_type;
-- 
1.4.2.1


From fenkes at de.ibm.com  Wed May  9 04:48:20 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Wed, 9 May 2007 13:48:20 +0200
Subject: [ofa-general] [PATCH 4/6] IB/ehca: remove _irqsave, move #ifdef
Message-ID: <200705091348.20808.fenkes@de.ibm.com>

- In ehca_process_eq(), we're IRQ safe throughout the whole function, so we
  don't need another _irqsave in the middle of flight.

- take_over_work() is only called by comp_pool_callback(), so it can move
  into the same #ifdef block.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_irq.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
index f284be1..f172013 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -517,12 +517,11 @@ void ehca_process_eq(struct ehca_shca *s
 			else {
 				struct ehca_cq *cq = eq->eqe_cache[i].cq;
 				comp_event_callback(cq);
-				spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+				spin_lock(&ehca_cq_idr_lock);
 				cq->nr_events--;
 				if (!cq->nr_events)
 					wake_up(&cq->wait_completion);
-				spin_unlock_irqrestore(&ehca_cq_idr_lock,
-						       flags);
+				spin_unlock(&ehca_cq_idr_lock);
 			}
 		} else {
 			ehca_dbg(&shca->ib_device, "Got non completion event");
@@ -711,6 +710,7 @@ static void destroy_comp_task(struct ehc
 		kthread_stop(task);
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
 static void take_over_work(struct ehca_comp_pool *pool,
 			   int cpu)
 {
@@ -735,7 +735,6 @@ static void take_over_work(struct ehca_c
 
 }
 
-#ifdef CONFIG_HOTPLUG_CPU
 static int comp_pool_callback(struct notifier_block *nfb,
 			      unsigned long action,
 			      void *hcpu)
-- 
1.4.2.1


From fenkes at de.ibm.com  Wed May  9 04:48:31 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Wed, 9 May 2007 13:48:31 +0200
Subject: [ofa-general] [PATCH 6/6] IB/ehca: disable scaling code by default,
	bump version number
Message-ID: <200705091348.31742.fenkes@de.ibm.com>

- Scaling code is still considered experimental, so disable it by default
- Increase version to SVNEHCA_0023

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_main.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index dc736e8..233788a 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -52,7 +52,7 @@ #include "hcp_if.h"
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_AUTHOR("Christoph Raisch <raisch at de.ibm.com>");
 MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver");
-MODULE_VERSION("SVNEHCA_0022");
+MODULE_VERSION("SVNEHCA_0023");
 
 int ehca_open_aqp1     = 0;
 int ehca_debug_level   = 0;
@@ -62,7 +62,7 @@ int ehca_use_hp_mr     = 0;
 int ehca_port_act_time = 30;
 int ehca_poll_all_eqs  = 1;
 int ehca_static_rate   = -1;
-int ehca_scaling_code  = 1;
+int ehca_scaling_code  = 0;
 
 module_param_named(open_aqp1,     ehca_open_aqp1,     int, 0);
 module_param_named(debug_level,   ehca_debug_level,   int, 0);
@@ -799,7 +799,7 @@ int __init ehca_module_init(void)
 	int ret;
 
 	printk(KERN_INFO "eHCA Infiniband Device Driver "
-	       "(Rel.: SVNEHCA_0022)\n");
+	       "(Rel.: SVNEHCA_0023)\n");
 	idr_init(&ehca_qp_idr);
 	idr_init(&ehca_cq_idr);
 	spin_lock_init(&ehca_qp_idr_lock);
-- 
1.4.2.1


From jsquyres at cisco.com  Wed May  9 04:51:07 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 9 May 2007 07:51:07 -0400
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <46415DFE.9030807@voltaire.com>
References: <1177791386.4615.8.camel@stevo-laptop>	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>	<1178575761.30571.175.camel@stevo-desktop>	<95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com>
	<463FCA42.3000104@indiana.edu> <46415DFE.9030807@voltaire.com>
Message-ID: <E1CF7EBB-8BD8-4654-95C3-A410850FCC18@cisco.com>

On May 9, 2007, at 1:37 AM, Or Gerlitz wrote:

> Doing a bit of zoom out from the "how to make ofed's udapl work for  
> ompi" thread, my thinking is that the ompi udapl btl enablement is  
> actually only the first step, where for production/longterm/etc you  
> want to have an rdmacm btl.

I think this is a bit of a misunderstanding.  The "BTL" in Open MPI  
is a byte transfer layer; it is a point-to-point abstraction for  
moving bytes between two processes.  BTL components (read: plugins)  
are typically distinguished by the underlying protocols used.  For  
example, we have an RC verbs-based BTL and we have a separate uDAPL- 
based BTL.  Andrew is also working on a research-quality UD verbs- 
based BTL.

Hence, how a particular BTL component makes connections between  
process peers is really a side-effect of moving bytes around, and not  
the focus of the BTL.  So having a "rdmacm" BTL doesn't really make  
sense.  If both the RC and UD verbs-based BTLs someday use the RDMA  
CM for connections, we might abstract the connection management out  
to a common piece of code between the two.  But that's a different  
issue.  If we end up having a mixed BTL someday that uses both RC and  
UD, then the need for the common code may go away.  But that's in the  
future.

> Reasoning here is made of many arguments, among them the quickest i  
> can make are:
>
> A) it seems that ompi would want to use not only RC but rather also  
> UD multicast and unicast, which are not covered by udapl
>
> B) there's actually no real justification to maintain two APIs  
> (namely udapl vs libibvers/librdmacm), so down the road, only one  
> of them would survive (udapl is implemented ***over*** libibverbs/ 
> librdmacm so if the latteres dies same does udapl). Specifically, I  
> hear here and there that the OFED stack is now on its way to be  
> deployed all over the place, specifically in commercial Unix OSs  
> (which want modern! code that supports IPoIB-CM,RDS,SRP,iSER, etc  
> you named it) so eventually the rdmacm btl can be used also over  
> Solaris et al.

I think that's not quite the point.

1. A piece of history: the uDAPL BTL was originally developed by a  
grad student just as an excuse to learn the BTL interface and OMPI  
internals.  We already had an RC verbs-based BTL at the time.

2. When Sun joined Open MPI, they took over the development and  
maintenance of the uDAPL BTL because uDAPL is the only high  
performance stack on Solaris.

3. It's fine that Sun will someday support the same verbs interface  
that OFED does.  But *today*, they don't.  So for their current  
customers, they need to support uDAPL.  As such, we have done little/ 
no testing of uDAPL on OFED since Sun took over the uDAPL BTL -- all  
testing since that point has been on Solaris uDAPL.  All of our Linux/ 
OFED efforts have been on the verbs interface.

4. The Open MPI focus on uDAPL over OFED at the moment is simply to  
jump-start iWARP testing.  Both NetEffect and Chelsio have chimed in  
to say that they will do the RDMA CM work for Open MPI, but uDAPL can  
be used as a temporary workaround that can be used [effectively]  
immediately while they get up to speed on the Open MPI code base and  
do the RDMA CM work.

-- 
Jeff Squyres
Cisco Systems


From yosefe at voltaire.com  Wed May  9 04:53:33 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 14:53:33 +0300
Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <20070509112626.GA10068@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>	<20070509093548.GA7683@mellanox.co.il>	<4641AA06.1050002@voltaire.com>
	<20070509112626.GA10068@mellanox.co.il>
Message-ID: <4641B63D.4010602@voltaire.com>

Michael S. Tsirkin wrote:
> OK. looks pretty good to me. One coding style violation I found:
> 
> 
fixed

--
This issue was found during partitioning & SM fail over testing.

 * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
 * Upon PKEY_CHANGE event, schedule a work that restarts the QP
 * Restart child interfaces before parent. They might be up even if the
   parent is down
 * Use uncached pkey query upon qp initiallization

SM reconfiguration or failover possibly causes a shuffling of the values
in the port pkey table. The current implementation only queries for the
index of the pkey once, when it creates the device QP and after that moves
it into working state, and hence does not address this scenario. Fix this
by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |    6 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   59 ++++++++++++++++++++---------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +--
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   11 ++---
 4 files changed, 56 insertions(+), 27 deletions(-)

Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 16:45:44.000000000 +0300
@@ -202,11 +202,12 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
+	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
+	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8            	  port;
@@ -333,12 +334,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-09 13:56:00.754030478 +0300
@@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -481,7 +481,7 @@ int ipoib_ib_dev_down(struct net_device 
 	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
 		mutex_lock(&pkey_mutex);
 		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
+		cancel_delayed_work(&priv->pkey_poll_task);
 		mutex_unlock(&pkey_mutex);
 		if (flush)
 			flush_workqueue(ipoib_workqueue);
@@ -508,7 +508,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +581,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,13 +623,21 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces too -
+ 	 * they might be up even if the parent is down */
+ 	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, restart_qp);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
 		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
@@ -642,6 +651,11 @@ void ipoib_ib_dev_flush(struct work_stru
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (restart_qp) {
+		ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +664,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	mutex_unlock(&priv->vlan_mutex);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -685,7 +709,7 @@ void ipoib_ib_dev_cleanup(struct net_dev
 void ipoib_pkey_poll(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
+		container_of(work, struct ipoib_dev_priv, pkey_poll_task.work);
 	struct net_device *dev = priv->dev;
 
 	ipoib_pkey_dev_check_presence(dev);
@@ -696,7 +720,7 @@ void ipoib_pkey_poll(struct work_struct 
 		mutex_lock(&pkey_mutex);
 		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
+					   &priv->pkey_poll_task,
 					   HZ);
 		mutex_unlock(&pkey_mutex);
 	}
@@ -715,7 +739,7 @@ int ipoib_pkey_dev_delay_open(struct net
 		mutex_lock(&pkey_mutex);
 		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
 		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
+				   &priv->pkey_poll_task,
 				   HZ);
 		mutex_unlock(&pkey_mutex);
 		return 1;
Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 16:45:44.000000000 +0300
@@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev)
 		return -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-09 14:51:32.684627634 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
@@ -103,7 +101,7 @@ int ipoib_init_qp(struct net_device *dev
 	 * The port has to be assigned to the respective IB partition in
 	 * advance.
 	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
 	if (ret) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		return ret;
@@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		   record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
 	}
 }


From mst at dev.mellanox.co.il  Wed May  9 05:07:42 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 15:07:42 +0300
Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <4641B63D.4010602@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508162727.GD5845@mellanox.co.il>
	<4640A8BD.4000405@voltaire.com>
	<20070509093548.GA7683@mellanox.co.il>
	<4641AA06.1050002@voltaire.com>
	<20070509112626.GA10068@mellanox.co.il>
	<4641B63D.4010602@voltaire.com>
Message-ID: <20070509120742.GC10068@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCHv3 2/2] ipoib: handle pkey change events
> 
> Michael S. Tsirkin wrote:
> > OK. looks pretty good to me. One coding style violation I found:
> > 
> > 
> fixed

OK, Ack for this latest revision.
I'm quite happy with the latest state of these 2 patches - they are small,
clean, fix a real bug, and move us in the direction of gradually
phasing out the cache as we agreed we want to.

Please post the final revision of them (in a new
thread), so it's clear for Roland what to take up for 2.6.22
(you can label them [PATCHv4 for-2.6.22 1 of 2] for clarity).

We'll also stick them in OFED assuming no one objects by tomorrow.

-- 
MST


From yosefe at voltaire.com  Wed May  9 05:14:56 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 15:14:56 +0300
Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <20070509120742.GC10068@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>	<20070509093548.GA7683@mellanox.co.il>	<4641AA06.1050002@voltaire.com>	<20070509112626.GA10068@mellanox.co.il>	<4641B63D.4010602@voltaire.com>
	<20070509120742.GC10068@mellanox.co.il>
Message-ID: <4641BB40.9090208@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: Re: [PATCHv3 2/2] ipoib: handle pkey change events
>>
>>Michael S. Tsirkin wrote:
>>
>>>OK. looks pretty good to me. One coding style violation I found:
>>>
>>>
>>
>>fixed
> 
> 
> OK, Ack for this latest revision.
> I'm quite happy with the latest state of these 2 patches - they are small,
> clean, fix a real bug, and move us in the direction of gradually
> phasing out the cache as we agreed we want to.
> 
> Please post the final revision of them (in a new
> thread), so it's clear for Roland what to take up for 2.6.22
> (you can label them [PATCHv4 for-2.6.22 1 of 2] for clarity).
> 
> We'll also stick them in OFED assuming no one objects by tomorrow.
> 
ACK


From yosefe at voltaire.com  Wed May  9 05:17:09 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 15:17:09 +0300
Subject: [ofa-general] [PATCHv4 for-2.6.22 0 of 2] pkey change handling - fix
	bug #577
Message-ID: <4641BBC5.7040106@voltaire.com>

These two patches fix bug #577: PKey table reordering caused by SM failover stops ipoib traffic
patch 1: add uncached device queries to core
patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init

--


From yosefe at voltaire.com  Wed May  9 05:20:42 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 15:20:42 +0300
Subject: [ofa-general] [PATCHv4 for-2.6.22 1 of 2] core: uncached "find gid"
 and "find pkey" queries
In-Reply-To: <4641BBC5.7040106@voltaire.com>
References: <4641BBC5.7040106@voltaire.com>
Message-ID: <4641BC9A.2050409@voltaire.com>


* Add ib_find_gid and ib_find_pkey over uncached device queries.
  The calls might block but the returns are always up-to-date. 
* Cache pky,gid table lengths in core to avoid port info queries.


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/device.c |  138 +++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h          |   25 +++++++
 2 files changed, 163 insertions(+)

Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-08 15:46:36.000000000 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-09 11:47:22.096064221 +0300
@@ -149,6 +149,18 @@ static int alloc_name(char *name)
 	return 0;
 }
 
+static inline int start_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+
+static inline int end_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
+}
+
 /**
  * ib_alloc_device - allocate an IB device struct
  * @size:size of structure to allocate
@@ -208,6 +220,55 @@ static int add_client_context(struct ib_
 	return 0;
 }
 
+/* read the lengths of pkey,gid tables on each port */
+static int read_port_table_lengths(struct ib_device *device)
+{
+	struct ib_port_attr *tprops = NULL;
+	int num_ports, ret = -ENOMEM;
+	u8 port_index;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+	if (!tprops)
+		goto out;
+
+	num_ports = end_port(device) - start_port(device) + 1;
+
+	device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len *
+						num_ports, GFP_KERNEL);
+	if (!device->pkey_tbl_len)
+		goto out;
+
+	device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len *
+						num_ports, GFP_KERNEL);
+	if (!device->gid_tbl_len)
+		goto err1;
+
+	for (port_index = 0; port_index < num_ports; ++port_index) {
+		ret = ib_query_port(device, port_index + start_port(device),
+					tprops);
+		if (ret)
+			goto err2;
+		device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len;
+		device->gid_tbl_len[port_index] = tprops->gid_tbl_len;
+	}
+
+	ret = 0;
+	goto out;
+err2:
+	kfree(device->gid_tbl_len);
+err1:
+	kfree(device->pkey_tbl_len);
+out:
+	kfree(tprops);
+	return ret;
+}
+
+static inline void free_port_table_lengths(struct ib_device *device)
+{
+	kfree(device->gid_tbl_len);
+	kfree(device->pkey_tbl_len);
+}
+
 /**
  * ib_register_device - Register an IB device with IB core
  * @device:Device to register
@@ -239,6 +300,13 @@ int ib_register_device(struct ib_device 
 	spin_lock_init(&device->event_handler_lock);
 	spin_lock_init(&device->client_data_lock);
 
+	ret = read_port_table_lengths(device);
+	if (ret) {
+		printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n",
+		       device->name);
+		goto out;
+	}
+
 	ret = ib_device_register_sysfs(device);
 	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
@@ -284,6 +352,8 @@ void ib_unregister_device(struct ib_devi
 
 	list_del(&device->core_list);
 
+	free_port_table_lengths(device);
+
 	mutex_unlock(&device_mutex);
 
 	spin_lock_irqsave(&device->client_data_lock, flags);
@@ -592,6 +662,74 @@ int ib_modify_port(struct ib_device *dev
 }
 EXPORT_SYMBOL(ib_modify_port);
 
+/**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+			u8 *port_num, u16 *index)
+{
+	union ib_gid tmp_gid;
+	int ret, port, i, tbl_len;
+
+	for (port = start_port(device); port <= end_port(device); ++port) {
+		tbl_len = device->gid_tbl_len[port - start_port(device)];
+		for (i = 0; i < tbl_len; ++i) {
+			ret = ib_query_gid(device, port, i, &tmp_gid);
+			if (ret)
+				goto out;
+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
+				*port_num = port;
+				*index = i;
+				ret = 0;
+				goto out;
+			}
+		}
+	}
+	ret = -ENOENT;
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_gid);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index)
+{
+	int ret, i, tbl_len;
+	u16 tmp_pkey;
+
+	tbl_len = device->pkey_tbl_len[port_num - start_port(device)];
+	for (i = 0; i < tbl_len; ++i) {
+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
+		if (ret)
+			goto out;
+
+		if (pkey == tmp_pkey) {
+			*index = i;
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = -ENOENT;
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_pkey);
+
 static int __init ib_core_init(void)
 {
 	int ret;
Index: b/include/rdma/ib_verbs.h
===================================================================
--- a/include/rdma/ib_verbs.h	2007-05-08 15:45:45.000000000 +0300
+++ b/include/rdma/ib_verbs.h	2007-05-09 11:47:55.006221894 +0300
@@ -1058,6 +1058,8 @@ struct ib_device {
 	__be64			     node_guid;
 	u8                           node_type;
 	u8                           phys_port_cnt;
+	int                          *pkey_tbl_len;
+	int                          *gid_tbl_len;
 };
 
 struct ib_client {
@@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev
 		   struct ib_port_modify *port_modify);
 
 /**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+			u8 *port_num, u16 *index);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index);
+
+/**
  * ib_alloc_pd - Allocates an unused protection domain.
  * @device: The device on which to allocate the protection domain.
  *


From yosefe at voltaire.com  Wed May  9 05:20:44 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 15:20:44 +0300
Subject: [ofa-general] [PATCHv4 for-2.6.22 2 of 2] ipoib: handle pkey change
	events
In-Reply-To: <4641BBC5.7040106@voltaire.com>
References: <4641BBC5.7040106@voltaire.com>
Message-ID: <4641BC9C.8060501@voltaire.com>


This issue was found during partitioning & SM fail over testing.

 * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
 * Upon PKEY_CHANGE event, schedule a work that restarts the QP
 * Restart child interfaces before parent. They might be up even if the
   parent is down
 * Use uncached pkey query upon qp initiallization

SM reconfiguration or failover possibly causes a shuffling of the values
in the port pkey table. The current implementation only queries for the
index of the pkey once, when it creates the device QP and after that moves
it into working state, and hence does not address this scenario. Fix this
by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |    6 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   59 ++++++++++++++++++++---------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +--
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   11 ++---
 4 files changed, 56 insertions(+), 27 deletions(-)

Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 16:45:44.000000000 +0300
@@ -202,11 +202,12 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
+	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
+	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8            	  port;
@@ -333,12 +334,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-09 13:56:00.754030478 +0300
@@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -481,7 +481,7 @@ int ipoib_ib_dev_down(struct net_device 
 	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
 		mutex_lock(&pkey_mutex);
 		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
+		cancel_delayed_work(&priv->pkey_poll_task);
 		mutex_unlock(&pkey_mutex);
 		if (flush)
 			flush_workqueue(ipoib_workqueue);
@@ -508,7 +508,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +581,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,13 +623,21 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces too -
+ 	 * they might be up even if the parent is down */
+ 	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, restart_qp);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
 		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
@@ -642,6 +651,11 @@ void ipoib_ib_dev_flush(struct work_stru
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (restart_qp) {
+		ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +664,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	mutex_unlock(&priv->vlan_mutex);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -685,7 +709,7 @@ void ipoib_ib_dev_cleanup(struct net_dev
 void ipoib_pkey_poll(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
+		container_of(work, struct ipoib_dev_priv, pkey_poll_task.work);
 	struct net_device *dev = priv->dev;
 
 	ipoib_pkey_dev_check_presence(dev);
@@ -696,7 +720,7 @@ void ipoib_pkey_poll(struct work_struct 
 		mutex_lock(&pkey_mutex);
 		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
+					   &priv->pkey_poll_task,
 					   HZ);
 		mutex_unlock(&pkey_mutex);
 	}
@@ -715,7 +739,7 @@ int ipoib_pkey_dev_delay_open(struct net
 		mutex_lock(&pkey_mutex);
 		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
 		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
+				   &priv->pkey_poll_task,
 				   HZ);
 		mutex_unlock(&pkey_mutex);
 		return 1;
Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 16:45:44.000000000 +0300
@@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev)
 		return -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 15:46:53.000000000 +0300
+++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-09 14:51:32.684627634 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
@@ -103,7 +101,7 @@ int ipoib_init_qp(struct net_device *dev
 	 * The port has to be assigned to the respective IB partition in
 	 * advance.
 	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
+	ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
 	if (ret) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		return ret;
@@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		   record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
 	}
 }


From mst at dev.mellanox.co.il  Wed May  9 05:23:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 15:23:00 +0300
Subject: [ofa-general] Re: [PATCHv4 for-2.6.22 1 of 2] core: uncached "find
	gid" and "find pkey" queries
In-Reply-To: <4641BC9A.2050409@voltaire.com>
References: <4641BBC5.7040106@voltaire.com> <4641BC9A.2050409@voltaire.com>
Message-ID: <20070509122300.GE10068@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCHv4 for-2.6.22 1 of 2] core: uncached "find gid" and "find pkey" queries
> 
> 
> * Add ib_find_gid and ib_find_pkey over uncached device queries.
>   The calls might block but the returns are always up-to-date. 
> * Cache pky,gid table lengths in core to avoid port info queries.
> 
> 
> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>

Acked-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

-- 
MST


From mst at dev.mellanox.co.il  Wed May  9 05:23:37 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 15:23:37 +0300
Subject: [ofa-general] Re: [PATCHv4 for-2.6.22 2 of 2] ipoib: handle pkey
	change events
In-Reply-To: <4641BC9C.8060501@voltaire.com>
References: <4641BBC5.7040106@voltaire.com> <4641BC9C.8060501@voltaire.com>
Message-ID: <20070509122337.GF10068@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCHv4 for-2.6.22 2 of 2] ipoib: handle pkey change events
> 
> 
> This issue was found during partitioning & SM fail over testing.
> 
>  * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
>  * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
>  * Upon PKEY_CHANGE event, schedule a work that restarts the QP
>  * Restart child interfaces before parent. They might be up even if the
>    parent is down
>  * Use uncached pkey query upon qp initiallization
> 
> SM reconfiguration or failover possibly causes a shuffling of the values
> in the port pkey table. The current implementation only queries for the
> index of the pkey once, when it creates the device QP and after that moves
> it into working state, and hence does not address this scenario. Fix this
> by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.
> 
> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>

Acked-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>


-- 
MST


From mst at dev.mellanox.co.il  Wed May  9 05:25:58 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 15:25:58 +0300
Subject: [ofa-general] Re: [PATCHv4 for-2.6.22 0 of 2] pkey change handling -
	fix bug #577
In-Reply-To: <4641BBC5.7040106@voltaire.com>
References: <4641BBC5.7040106@voltaire.com>
Message-ID: <20070509122558.GG10068@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCHv4 for-2.6.22 0 of 2] pkey change handling - fix bug #577
> 
> These two patches fix bug #577: PKey table reordering caused by SM failover stops ipoib traffic
> patch 1: add uncached device queries to core
> patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init

Looks good to go to me - not only do these two patches fix a real bug,
but they also move us in the direction we decided we want to go - getting
rid of the cache.

Acked-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

Roland, what do you think?

-- 
MST


From cap at nsc.liu.se  Wed May  9 05:39:57 2007
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Wed, 9 May 2007 14:39:57 +0200
Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status
In-Reply-To: <20070508093812.9A193E603C1@openfabrics.org>
References: <20070508093812.9A193E603C1@openfabrics.org>
Message-ID: <200705091440.01872.cap@nsc.liu.se>

Not related to the failed 2.6.21.1 below, but, are there any plans to add the 
EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and 2.6.9-55.{EL,ELsmp}).

Also, out of curiosity, what is the 2.6.16.43-0.3 (below)?

/Peter

On Tuesday 08 May 2007, Vladimir Sokolovsky wrote:
...
> Passed on x86_64 with linux-2.6.16.43-0.3-smp
> Passed on ia64 with linux-2.6.16
> Passed on ia64 with linux-2.6.17
> Passed on x86_64 with linux-2.6.16.21-0.8-smp
> Passed on ia64 with linux-2.6.19
> Passed on x86_64 with linux-2.6.9-42.ELsmp
> Passed on x86_64 with linux-2.6.9-34.ELsmp
> Passed on x86_64 with linux-2.6.9-22.ELsmp
> Passed on x86_64 with linux-2.6.18-1.2798.fc6
> Passed on ia64 with linux-2.6.16.21-0.8-default
>
> Failed:
> Build failed on i686 with linux-2.6.21.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070509/89caeadf/attachment.sig>

From cap at nsc.liu.se  Wed May  9 05:39:57 2007
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Wed, 9 May 2007 14:39:57 +0200
Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status
In-Reply-To: <20070508093812.9A193E603C1@openfabrics.org>
References: <20070508093812.9A193E603C1@openfabrics.org>
Message-ID: <200705091440.01872.cap@nsc.liu.se>

Not related to the failed 2.6.21.1 below, but, are there any plans to add the 
EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and 2.6.9-55.{EL,ELsmp}).

Also, out of curiosity, what is the 2.6.16.43-0.3 (below)?

/Peter

On Tuesday 08 May 2007, Vladimir Sokolovsky wrote:
...
> Passed on x86_64 with linux-2.6.16.43-0.3-smp
> Passed on ia64 with linux-2.6.16
> Passed on ia64 with linux-2.6.17
> Passed on x86_64 with linux-2.6.16.21-0.8-smp
> Passed on ia64 with linux-2.6.19
> Passed on x86_64 with linux-2.6.9-42.ELsmp
> Passed on x86_64 with linux-2.6.9-34.ELsmp
> Passed on x86_64 with linux-2.6.9-22.ELsmp
> Passed on x86_64 with linux-2.6.18-1.2798.fc6
> Passed on ia64 with linux-2.6.16.21-0.8-default
>
> Failed:
> Build failed on i686 with linux-2.6.21.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070509/89caeadf/attachment-0001.sig>

From mst at dev.mellanox.co.il  Wed May  9 05:45:21 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 15:45:21 +0300
Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status
In-Reply-To: <200705091440.01872.cap@nsc.liu.se>
References: <20070508093812.9A193E603C1@openfabrics.org>
	<200705091440.01872.cap@nsc.liu.se>
Message-ID: <20070509124521.GI10068@mellanox.co.il>

> Quoting Peter Kjellstrom <cap at nsc.liu.se>:
> Subject: Re: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status
> 
> Not related to the failed 2.6.21.1 below, but, are there any plans to add the 
> EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and 2.6.9-55.{EL,ELsmp}).

We do test on them locally, haven't the time to prepare these for cross-build yet.
Can you do this?

> Also, out of curiosity, what is the 2.6.16.43-0.3 (below)?

SLES10 I think.


-- 
MST


From mst at dev.mellanox.co.il  Wed May  9 05:45:21 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 15:45:21 +0300
Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status
In-Reply-To: <200705091440.01872.cap@nsc.liu.se>
References: <20070508093812.9A193E603C1@openfabrics.org>
	<200705091440.01872.cap@nsc.liu.se>
Message-ID: <20070509124521.GI10068@mellanox.co.il>

> Quoting Peter Kjellstrom <cap at nsc.liu.se>:
> Subject: Re: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status
> 
> Not related to the failed 2.6.21.1 below, but, are there any plans to add the 
> EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and 2.6.9-55.{EL,ELsmp}).

We do test on them locally, haven't the time to prepare these for cross-build yet.
Can you do this?

> Also, out of curiosity, what is the 2.6.16.43-0.3 (below)?

SLES10 I think.


-- 
MST


From fenkes at de.ibm.com  Wed May  9 05:46:23 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Wed, 9 May 2007 14:46:23 +0200
Subject: [ofa-general] [PATCH 0/6] IB/ehca: Assorted patches
Message-ID: <200705091446.23783.fenkes@de.ibm.com>

Here's a set of patches containing various improvements and bugfixes for the
IBM eHCA InfiniBand driver, bumping the version number to SVNEHCA_0023. The
patches are, in detail:

#1 - Serialize hypervisor calls in ehca_register_mr()
#2 - correctly set GRH mask bit in ehca_modify_qp()
#3 - Fix AQP0/1 QP number
#4 - remove _irqsave where it's not needed;
     move an #ifdef to where it makes even better sense
#5 - beautify sysfs attribute code and fix compiler warnings
#6 - disable scaling code by default and bump version number

The patches are ready for inclusion into 2.6.22 and apply on top of Roland's
git tree (which has been pulled by Linus recently, so they should apply there,
too).

Cheers,
  Joachim

-- 
Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer
IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2)
Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany
eMail: fenkes at de.ibm.com  --  Phone: +49 7031 16 1239


From afriedle at open-mpi.org  Wed May  9 06:13:19 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Wed, 09 May 2007 09:13:19 -0400
Subject: [OMPI users] [ofa-general] Re: openMPI over uDAPL doesn't work
In-Reply-To: <46417748.4020602@lfbs.rwth-aachen.de>
References: <462E13A6.3030207@lfbs.rwth-aachen.de>
	<462E1DFE.5010703@Sun.COM>	<46305D0A.5020900@lfbs.rwth-aachen.de>
	<4630EFDE.8070608@Sun.COM>	<464044D4.5010501@lfbs.rwth-aachen.de>	<054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com>	<4640CACE.8070201@ichips.intel.com>	<1178655353.11455.14.camel@stevo-desktop>
	<46417748.4020602@lfbs.rwth-aachen.de>
Message-ID: <4641C8EF.5080709@open-mpi.org>

You say that fixes the problem, does it work even when running more than 
one MPI process per node? (that is the case the hack fixes)  Simply 
doing an mpirun with a -np paremeter higher than the number of nodes you 
have set up should trigger this case, and making sure to use '-mca btl 
udapl,self' (ie not SM or anything else).

Andrew

Boris Bierbaum wrote:
> It has been explained in a different thread on [ofa-general] that the
> problem lies in a combination of the OpenIB-cma provider not setting the
> local and remote port numbers on endpoints correctly and Open MPI
> stepping over the IA to save the port number to circumvent this problem,
> thereby confusing the provider.
> 
> I commented out line 197 in ompi/mca/btl/udapl/btl_udapl.c (Open MPI
> 1.2.1 release) and this fixes the problem. As the problem in the
> provider is currently being fixed, the whole saving of the port number
> in the uDAPL BTL code will be unnecessary in the future.
> 
> Steve Wise wrote:
>>>> Can the UDAPL OFED wizards shed any light on the error messages that  
>>>> are listed below?  In particular, these seem to be worrysome:
>>>>
>>>>>  setup_listener Permission denied
>>>>  setup_listener Address already in use
>>> These failures are from rdma_cm_bind indicating the port is already 
>>> bound to this IA address. How are you creating the service point?
>>> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you 
>>> will see some failures until it  gets to a free port. That is normal. 
>>> Just make sure your create call returns DAT_SUCCESS.
>>>
>> Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down
>> and let the rdma-cma pick an available port number?
>>
>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
> 
> 


From swise at opengridcomputing.com  Wed May  9 06:41:30 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 09 May 2007 08:41:30 -0500
Subject: [ofa-general] OMPI over ofed udapl - bugs opened
In-Reply-To: <4640FDE9.9010000@ichips.intel.com>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
Message-ID: <1178718090.382.4.camel@stevo-desktop>


606 opened to track the udapl change.

607 opened to track the ompi change to remove the port number stashing
hack.

Status: I have a patch from Arlin to test today.  I will test with that
patch and with the OMPI port hack removed.  Stay tuned...


Steve.

On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote:
> Steve Wise wrote:
> 
> >I would like the group to consider including changes needed to OMPI
> >and/or ofa udapl to get OMPI working again on udapl for ofed-1.2.  
> >
> >This will provide OMPI support over iwarp devices via udapl until we can
> >get rdma-cm support added to OMPI.  
> >
> >
> >Steve.
> >  
> >  
> >
> Steve,cCan you open a bug to track this?


From boris at lfbs.RWTH-Aachen.DE  Wed May  9 06:50:30 2007
From: boris at lfbs.RWTH-Aachen.DE (Boris Bierbaum)
Date: Wed, 09 May 2007 15:50:30 +0200
Subject: [OMPI users] [ofa-general] Re: openMPI over uDAPL doesn't work
In-Reply-To: <4641C8EF.5080709@open-mpi.org>
References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM>
	<46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM>
	<464044D4.5010501@lfbs.rwth-aachen.de>
	<054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com>
	<4640CACE.8070201@ichips.intel.com>
	<1178655353.11455.14.camel@stevo-desktop>
	<46417748.4020602@lfbs.rwth-aachen.de> <4641C8EF.5080709@open-mpi.org>
Message-ID: <4641D1A6.30603@lfbs.rwth-aachen.de>

I've run the whole IMB Benchmark Suite on 2, 3, and 4 nodes with 2
processes per node and --mca btl udapl,self. I didn't encouter any problems.

The comment above line 197 says that dat_ep_query() returns wrong port
numbers (which it does indeed), but I can't find any call to
dat_ep_query() in the uDAPL BTL code. Maybe the comment is out of date?

Boris


Andrew Friedley wrote:
> You say that fixes the problem, does it work even when running more than 
> one MPI process per node? (that is the case the hack fixes)  Simply 
> doing an mpirun with a -np paremeter higher than the number of nodes you 
> have set up should trigger this case, and making sure to use '-mca btl 
> udapl,self' (ie not SM or anything else).
> 
> Andrew
> 
> Boris Bierbaum wrote:
>> It has been explained in a different thread on [ofa-general] that the
>> problem lies in a combination of the OpenIB-cma provider not setting the
>> local and remote port numbers on endpoints correctly and Open MPI
>> stepping over the IA to save the port number to circumvent this problem,
>> thereby confusing the provider.
>>
>> I commented out line 197 in ompi/mca/btl/udapl/btl_udapl.c (Open MPI
>> 1.2.1 release) and this fixes the problem. As the problem in the
>> provider is currently being fixed, the whole saving of the port number
>> in the uDAPL BTL code will be unnecessary in the future.
>>
>> Steve Wise wrote:
>>>>> Can the UDAPL OFED wizards shed any light on the error messages that  
>>>>> are listed below?  In particular, these seem to be worrysome:
>>>>>
>>>>>>  setup_listener Permission denied
>>>>>  setup_listener Address already in use
>>>> These failures are from rdma_cm_bind indicating the port is already 
>>>> bound to this IA address. How are you creating the service point?
>>>> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you 
>>>> will see some failures until it  gets to a free port. That is normal. 
>>>> Just make sure your create call returns DAT_SUCCESS.
>>>>
>>> Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down
>>> and let the rdma-cma pick an available port number?
>>>
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>>
> _______________________________________________
> users mailing list
> users at open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
|  _  RWTH | Boris Bierbaum
|_|_`_     | Lehrstuhl fuer Betriebssysteme
   | |_) _  | RWTH Aachen D-52056 Aachen
     |_)(_` | Tel: +49-241-80-27805
        ._) | Fax: +49-241-80-22339


From erezz at voltaire.com  Wed May  9 06:54:29 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 09 May 2007 16:54:29 +0300
Subject: [ofa-general] [PATCH 0/2] IB/iser: Add open-iscsi over iSER support
 for RHAS4 up3 & up4 in OFED
Message-ID: <4641D295.5060907@voltaire.com>

The following patches add the required backports & kernel addons in order to support open-iscsi over iSER in RHAS4 up3 & up4 in OFED (currently SLES 10, SLES 10 sp1 & RHEL 5 are supported).

-- 
____________________________________________________________

Erez Zilber   |  972-9-971-7689

Software Engineer, Storage Team

Voltaire – _The Grid Backbone_

 __

 www.voltaire.com <http://www.voltaire.com/>

<mailto:erezz at voltaire.com>

  
From erezz at voltaire.com  Wed May  9 06:57:01 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 09 May 2007 16:57:01 +0300
Subject: [ofa-general] [PATCH 1/2] IB/iser: add open-iscsi over iSER support
 for RHAS4 in OFED scripts
In-Reply-To: <4641D295.5060907@voltaire.com>
References: <4641D295.5060907@voltaire.com>
Message-ID: <4641D32D.6030505@voltaire.com>

Add support for open-iscsi over iSER in RHAS4 in OFED's scripts.

Signed-off-by: Erez Zilber <erezz at voltaire.com>
---
 build.sh     |    2 +-
 build_env.sh |    4 ++--
 install.sh   |    2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/build.sh b/build.sh
index d54c55d..be2d1e6 100755
--- a/build.sh
+++ b/build.sh
@@ -344,7 +344,7 @@ open-iscsi()
             SuSE)
 	    ex "$MV -f ${RPM_DIR}/RPMS/$build_arch/${OPEN_ISCSI_SUSE_NAME}-${OPEN_ISCSI_VERSION}.${build_arch}.rpm $RPMS"
 	    ;;
-            redhat5)
+            redhat|redhat5)
 	    ex "$MV -f ${RPM_DIR}/RPMS/$build_arch/${OPEN_ISCSI_REDHAT_NAME}-${OPEN_ISCSI_VERSION}.${build_arch}.rpm $RPMS"
             ;;
 	    *)
diff --git a/build_env.sh b/build_env.sh
index 6e65b21..49821b4 100644
--- a/build_env.sh
+++ b/build_env.sh
@@ -135,7 +135,7 @@ IB_KERNEL_PACKAGES="${IB_KERNEL_PACKAGES
 # Iser
 # Currently iSER is supported only on SLES10 & RHEL5
 case ${K_VER} in
-        2.6.16.*-*-*|2.6.*.el5)
+        2.6.16.*-*-*|2.6.*.el5|2.6.9-*.EL*)
         IB_KERNEL_PACKAGES="${IB_KERNEL_PACKAGES} ib_iser"
         ;;
 esac
@@ -1998,7 +1998,7 @@ set_package_deps()
                     ib_iser)
 			# Currently iSER is supported only on SLES10 & RHEL5
                         case ${K_VER} in
-                        2.6.16.*-*-*|2.6.*.el5)
+                        2.6.16.*-*-*|2.6.*.el5|2.6.9-*.EL*)
                             OFA_KERNEL_PACKAGES=$(echo "$OFA_KERNEL_PACKAGES ib_verbs ${ll_driver} ib_iser" | tr -s ' ' '\n' | sort -n | uniq)
                             OFA_PACKAGES=$(echo "$OFA_PACKAGES kernel-ib" | tr -s ' ' '\n' | sort -n | uniq)
                             EXTRA_PACKAGES=$(echo "$EXTRA_PACKAGES open-iscsi" | tr -s ' ' '\n' | sort -rn | uniq)
diff --git a/install.sh b/install.sh
index f9ed6da..dadc144 100755
--- a/install.sh
+++ b/install.sh
@@ -990,7 +990,7 @@ #    fi    
                 err_echo "${OPEN_ISCSI_SUSE_NAME}-${OPEN_ISCSI_VERSION}.${build_arch}.rpm not found under ${RPMS}."
             fi
             ;;
-            redhat5)
+            redhat|redhat5)
 	    if [ -f ${RPMS}/${OPEN_ISCSI_REDHAT_NAME}-${OPEN_ISCSI_VERSION}.${build_arch}.rpm ]; then
                 ex "$RPM -Uhv --oldpackage ${RPMS}/${OPEN_ISCSI_REDHAT_NAME}-${OPEN_ISCSI_VERSION}.${build_arch}.rpm"
             else
-- 
1.4.2

  
From erezz at voltaire.com  Wed May  9 06:58:34 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 09 May 2007 16:58:34 +0300
Subject: [ofa-general] [PATCH 2/2] IB/iser: add backport & kernel addons for
 open-iscsi over iSER support for RHAS4 up3 and up4
In-Reply-To: <4641D295.5060907@voltaire.com>
References: <4641D295.5060907@voltaire.com>
Message-ID: <4641D38A.8040406@voltaire.com>


Add the required backport patches & kernel addons for open-iscsi
over iSER in RHAS4 up3 and up4.

Signed-off-by: Erez Zilber <erezz at voltaire.com>
---
 .../2.6.9_U3/include/linux/attribute_container.h   |   71 +++
 .../backport/2.6.9_U3/include/linux/klist.h        |   61 ++
 .../backport/2.6.9_U3/include/scsi/scsi_device.h   |   19 +
 .../2.6.9_U3/include/scsi/scsi_transport.h         |    8 
 .../2.6.9_U3/include/src/attribute_container.c     |  438 +++++++++++++++++
 kernel_addons/backport/2.6.9_U3/include/src/base.h |    1 
 kernel_addons/backport/2.6.9_U3/include/src/init.c |   26 +
 .../backport/2.6.9_U3/include/src/klist.c          |  287 +++++++++++
 .../backport/2.6.9_U3/include/src/kref_new.c       |   29 +
 kernel_addons/backport/2.6.9_U3/include/src/scsi.c |   50 ++
 .../backport/2.6.9_U3/include/src/scsi_lib.c       |  164 ++++++
 .../backport/2.6.9_U3/include/src/scsi_scan.c      |   48 ++
 .../2.6.9_U3/include/src/transport_class.c         |  280 +++++++++++
 .../2.6.9_U4/include/linux/attribute_container.h   |   71 +++
 .../backport/2.6.9_U4/include/linux/klist.h        |   61 ++
 .../backport/2.6.9_U4/include/scsi/scsi_device.h   |   19 +
 .../2.6.9_U4/include/scsi/scsi_transport.h         |    8 
 .../2.6.9_U4/include/src/attribute_container.c     |  438 +++++++++++++++++
 kernel_addons/backport/2.6.9_U4/include/src/base.h |    1 
 kernel_addons/backport/2.6.9_U4/include/src/init.c |   26 +
 .../backport/2.6.9_U4/include/src/klist.c          |  287 +++++++++++
 .../backport/2.6.9_U4/include/src/kref_new.c       |   29 +
 kernel_addons/backport/2.6.9_U4/include/src/scsi.c |   50 ++
 .../backport/2.6.9_U4/include/src/scsi_lib.c       |  164 ++++++
 .../backport/2.6.9_U4/include/src/scsi_scan.c      |   48 ++
 .../2.6.9_U4/include/src/transport_class.c         |  280 +++++++++++
 .../backport/2.6.9_U3/add_iscsi_proto_h.patch      |  591 +++++++++++++++++++++++
 kernel_patches/backport/2.6.9_U3/add_iser.patch    |   13 
 .../backport/2.6.9_U3/add_memory_h.patch           |   93 ++++
 .../backport/2.6.9_U3/add_open_iscsi.patch         |  504 ++++++++++++++++++++
 .../backport/2.6.9_U3/add_open_iscsi_h.patch       |   60 ++
 .../backport/2.6.9_U3/add_transport_class_h.patch  |  104 ++++
 .../2.6.9_U3/fix_inclusion_order_iscsi_iser.patch  |   13 +
 .../backport/2.6.9_U3/linux_stuff_to_2_6_17.patch  |   58 ++
 .../2.6.9_U3/netlink-01-add_netlink_h.patch        |  247 ++++++++++
 .../2.6.9_U3/netlink-02-netlink_h_for_rh4.patch    |  200 ++++++++
 .../backport/2.6.9_U4/add_iscsi_proto_h.patch      |  591 +++++++++++++++++++++++
 kernel_patches/backport/2.6.9_U4/add_iser.patch    |   13 
 .../backport/2.6.9_U4/add_memory_h.patch           |   93 ++++
 .../backport/2.6.9_U4/add_open_iscsi.patch         |  504 ++++++++++++++++++++
 .../backport/2.6.9_U4/add_open_iscsi_h.patch       |   60 ++
 .../backport/2.6.9_U4/add_transport_class_h.patch  |  104 ++++
 .../2.6.9_U4/fix_inclusion_order_iscsi_iser.patch  |   13 +
 .../backport/2.6.9_U4/linux_stuff_to_2_6_17.patch  |   58 ++
 .../2.6.9_U4/netlink-01-add_netlink_h.patch        |  247 ++++++++++
 .../2.6.9_U4/netlink-02-netlink_h_for_rh4.patch    |  200 ++++++++
 46 files changed, 6728 insertions(+), 2 deletions(-)

diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h b/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h
new file mode 100644
index 0000000..93bfb0b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h
@@ -0,0 +1,71 @@
+/*
+ * class_container.h - a generic container for all classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ */
+
+#ifndef _ATTRIBUTE_CONTAINER_H_
+#define _ATTRIBUTE_CONTAINER_H_
+
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/klist.h>
+#include <linux/spinlock.h>
+
+struct attribute_container {
+	struct list_head	node;
+	struct klist		containers;
+	struct class		*class;
+	struct class_device_attribute **attrs;
+	int (*match)(struct attribute_container *, struct device *);
+#define	ATTRIBUTE_CONTAINER_NO_CLASSDEVS	0x01
+	unsigned long		flags;
+};
+
+static inline int
+attribute_container_no_classdevs(struct attribute_container *atc)
+{
+	return atc->flags & ATTRIBUTE_CONTAINER_NO_CLASSDEVS;
+}
+
+static inline void
+attribute_container_set_no_classdevs(struct attribute_container *atc)
+{
+	atc->flags |= ATTRIBUTE_CONTAINER_NO_CLASSDEVS;
+}
+
+int attribute_container_register(struct attribute_container *cont);
+int attribute_container_unregister(struct attribute_container *cont);
+void attribute_container_create_device(struct device *dev,
+				       int (*fn)(struct attribute_container *,
+						 struct device *,
+						 struct class_device *));
+void attribute_container_add_device(struct device *dev,
+				    int (*fn)(struct attribute_container *,
+					      struct device *,
+					      struct class_device *));
+void attribute_container_remove_device(struct device *dev,
+				       void (*fn)(struct attribute_container *,
+						  struct device *,
+						  struct class_device *));
+void attribute_container_device_trigger(struct device *dev, 
+					int (*fn)(struct attribute_container *,
+						  struct device *,
+						  struct class_device *));
+void attribute_container_trigger(struct device *dev, 
+				 int (*fn)(struct attribute_container *,
+					   struct device *));
+int attribute_container_add_attrs(struct class_device *classdev);
+int attribute_container_add_class_device(struct class_device *classdev);
+int attribute_container_add_class_device_adapter(struct attribute_container *cont,
+						 struct device *dev,
+						 struct class_device *classdev);
+void attribute_container_remove_attrs(struct class_device *classdev);
+void attribute_container_class_device_del(struct class_device *classdev);
+struct attribute_container *attribute_container_classdev_to_container(struct class_device *);
+struct class_device *attribute_container_find_class_device(struct attribute_container *, struct device *);
+struct class_device_attribute **attribute_container_classdev_to_attrs(const struct class_device *classdev);
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/klist.h b/kernel_addons/backport/2.6.9_U3/include/linux/klist.h
new file mode 100644
index 0000000..7407125
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/linux/klist.h
@@ -0,0 +1,61 @@
+/*
+ *	klist.h - Some generic list helpers, extending struct list_head a bit.
+ *
+ *	Implementations are found in lib/klist.c
+ *
+ *
+ *	Copyright (C) 2005 Patrick Mochel
+ *
+ *	This file is rleased under the GPL v2.
+ */
+
+#ifndef _LINUX_KLIST_H
+#define _LINUX_KLIST_H
+
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+
+struct klist_node;
+struct klist {
+	spinlock_t		k_lock;
+	struct list_head	k_list;
+	void			(*get)(struct klist_node *);
+	void			(*put)(struct klist_node *);
+};
+
+
+extern void klist_init(struct klist * k, void (*get)(struct klist_node *),
+		       void (*put)(struct klist_node *));
+
+struct klist_node {
+	struct klist		* n_klist;
+	struct list_head	n_node;
+	struct kref		n_ref;
+	struct completion	n_removed;
+};
+
+extern void klist_add_tail(struct klist_node * n, struct klist * k);
+extern void klist_add_head(struct klist_node * n, struct klist * k);
+
+extern void klist_del(struct klist_node * n);
+extern void klist_remove(struct klist_node * n);
+
+extern int klist_node_attached(struct klist_node * n);
+
+
+struct klist_iter {
+	struct klist		* i_klist;
+	struct list_head	* i_head;
+	struct klist_node	* i_cur;
+};
+
+
+extern void klist_iter_init(struct klist * k, struct klist_iter * i);
+extern void klist_iter_init_node(struct klist * k, struct klist_iter * i, 
+				 struct klist_node * n);
+extern void klist_iter_exit(struct klist_iter * i);
+extern struct klist_node * klist_next(struct klist_iter * i);
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h
new file mode 100644
index 0000000..f353e0b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h
@@ -0,0 +1,19 @@
+#ifndef _SCSI_SCSI_DEVICE_H_BACKPORT
+#define _SCSI_SCSI_DEVICE_H_BACKPORT
+
+#include_next <scsi/scsi_device.h>
+
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <asm/atomic.h>
+
+struct scsi_lun;
+
+extern void int_to_scsilun(unsigned int, struct scsi_lun *);
+extern void scsi_target_block(struct device *);
+extern void scsi_target_unblock(struct device *);
+extern void starget_for_each_device(struct scsi_target *, void *,
+		     void (*fn)(struct scsi_device *, void *));
+#endif /* _SCSI_SCSI_DEVICE_H_BACKPORT */
diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h
new file mode 100644
index 0000000..99c2b12
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h
@@ -0,0 +1,8 @@
+#ifndef _SCSI_SCSI_TRANSPORT_H_BACKPORT
+#define _SCSI_SCSI_TRANSPORT_H_BACKPORT
+
+#include_next <scsi/scsi_transport.h>
+
+#include <linux/transport_class.h>
+
+#endif /* _SCSI_SCSI_TRANSPORT_H_BACKPORT */
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c b/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c
new file mode 100644
index 0000000..44948d1
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c
@@ -0,0 +1,438 @@
+/*
+ * attribute_container.c - implementation of a simple container for classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ *
+ * The basic idea here is to enable a device to be attached to an
+ * aritrary numer of classes without having to allocate storage for them.
+ * Instead, the contained classes select the devices they need to attach
+ * to via a matching function.
+ */
+
+#include <linux/attribute_container.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/module.h>
+
+#include "base.h"
+
+/* This is a private structure used to tie the classdev and the
+ * container .. it should never be visible outside this file */
+struct internal_container {
+	struct klist_node node;
+	struct attribute_container *cont;
+	struct class_device classdev;
+};
+
+static void internal_container_klist_get(struct klist_node *n)
+{
+	struct internal_container *ic =
+		container_of(n, struct internal_container, node);
+	class_device_get(&ic->classdev);
+}
+
+static void internal_container_klist_put(struct klist_node *n)
+{
+	struct internal_container *ic =
+		container_of(n, struct internal_container, node);
+	class_device_put(&ic->classdev);
+}
+
+
+/**
+ * attribute_container_classdev_to_container - given a classdev, return the container
+ *
+ * @classdev: the class device created by attribute_container_add_device.
+ *
+ * Returns the container associated with this classdev.
+ */
+struct attribute_container *
+attribute_container_classdev_to_container(struct class_device *classdev)
+{
+	struct internal_container *ic =
+		container_of(classdev, struct internal_container, classdev);
+	return ic->cont;
+}
+EXPORT_SYMBOL_GPL(attribute_container_classdev_to_container);
+
+static struct list_head attribute_container_list;
+
+static DECLARE_MUTEX(attribute_container_mutex);
+
+/**
+ * attribute_container_register - register an attribute container
+ *
+ * @cont: The container to register.  This must be allocated by the
+ *        callee and should also be zeroed by it.
+ */
+int
+attribute_container_register(struct attribute_container *cont)
+{
+	INIT_LIST_HEAD(&cont->node);
+	klist_init(&cont->containers,internal_container_klist_get,
+		   internal_container_klist_put);
+		
+	down(&attribute_container_mutex);
+	list_add_tail(&cont->node, &attribute_container_list);
+	up(&attribute_container_mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(attribute_container_register);
+
+/**
+ * attribute_container_unregister - remove a container registration
+ *
+ * @cont: previously registered container to remove
+ */
+int
+attribute_container_unregister(struct attribute_container *cont)
+{
+	int retval = -EBUSY;
+	down(&attribute_container_mutex);
+	spin_lock(&cont->containers.k_lock);
+	if (!list_empty(&cont->containers.k_list))
+		goto out;
+	retval = 0;
+	list_del(&cont->node);
+ out:
+	spin_unlock(&cont->containers.k_lock);
+	up(&attribute_container_mutex);
+	return retval;
+		
+}
+EXPORT_SYMBOL_GPL(attribute_container_unregister);
+
+/* private function used as class release */
+static void attribute_container_release(struct class_device *classdev)
+{
+	struct internal_container *ic 
+		= container_of(classdev, struct internal_container, classdev);
+	struct device *dev = classdev->dev;
+
+	kfree(ic);
+	put_device(dev);
+}
+
+/**
+ * attribute_container_add_device - see if any container is interested in dev
+ *
+ * @dev: device to add attributes to
+ * @fn:	 function to trigger addition of class device.
+ *
+ * This function allocates storage for the class device(s) to be
+ * attached to dev (one for each matching attribute_container).  If no
+ * fn is provided, the code will simply register the class device via
+ * class_device_add.  If a function is provided, it is expected to add
+ * the class device at the appropriate time.  One of the things that
+ * might be necessary is to allocate and initialise the classdev and
+ * then add it a later time.  To do this, call this routine for
+ * allocation and initialisation and then use
+ * attribute_container_device_trigger() to call class_device_add() on
+ * it.  Note: after this, the class device contains a reference to dev
+ * which is not relinquished until the release of the classdev.
+ */
+void
+attribute_container_add_device(struct device *dev,
+			       int (*fn)(struct attribute_container *,
+					 struct device *,
+					 struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+
+		if (attribute_container_no_classdevs(cont))
+			continue;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		ic = kzalloc(sizeof(*ic), GFP_KERNEL);
+		if (!ic) {
+			dev_printk(KERN_ERR, dev, "failed to allocate class container\n");
+			continue;
+		}
+
+		ic->cont = cont;
+		class_device_initialize(&ic->classdev);
+		ic->classdev.dev = get_device(dev);
+		ic->classdev.class = cont->class;
+		cont->class->release = attribute_container_release;
+		strcpy(ic->classdev.class_id, dev->bus_id);
+		if (fn)
+			fn(cont, dev, &ic->classdev);
+		else
+			attribute_container_add_class_device(&ic->classdev);
+		klist_add_tail(&ic->node, &cont->containers);
+	}
+	up(&attribute_container_mutex);
+}
+
+/* FIXME: can't break out of this unless klist_iter_exit is also
+ * called before doing the break
+ */
+#define klist_for_each_entry(pos, head, member, iter) \
+	for (klist_iter_init(head, iter); (pos = ({ \
+		struct klist_node *n = klist_next(iter); \
+		n ? container_of(n, typeof(*pos), member) : \
+			({ klist_iter_exit(iter) ; NULL; }); \
+	}) ) != NULL; )
+			
+
+/**
+ * attribute_container_remove_device - make device eligible for removal.
+ *
+ * @dev:  The generic device
+ * @fn:	  A function to call to remove the device
+ *
+ * This routine triggers device removal.  If fn is NULL, then it is
+ * simply done via class_device_unregister (note that if something
+ * still has a reference to the classdev, then the memory occupied
+ * will not be freed until the classdev is released).  If you want a
+ * two phase release: remove from visibility and then delete the
+ * device, then you should use this routine with a fn that calls
+ * class_device_del() and then use
+ * attribute_container_device_trigger() to do the final put on the
+ * classdev.
+ */
+void
+attribute_container_remove_device(struct device *dev,
+				  void (*fn)(struct attribute_container *,
+					     struct device *,
+					     struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+		struct klist_iter iter;
+
+		if (attribute_container_no_classdevs(cont))
+			continue;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		klist_for_each_entry(ic, &cont->containers, node, &iter) {
+			if (dev != ic->classdev.dev)
+				continue;
+			klist_del(&ic->node);
+			if (fn)
+				fn(cont, dev, &ic->classdev);
+			else {
+				attribute_container_remove_attrs(&ic->classdev);
+				class_device_unregister(&ic->classdev);
+			}
+		}
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_device_trigger - execute a trigger for each matching classdev
+ *
+ * @dev:  The generic device to run the trigger for
+ * @fn	  the function to execute for each classdev.
+ *
+ * This funcion is for executing a trigger when you need to know both
+ * the container and the classdev.  If you only care about the
+ * container, then use attribute_container_trigger() instead.
+ */
+void
+attribute_container_device_trigger(struct device *dev, 
+				   int (*fn)(struct attribute_container *,
+					     struct device *,
+					     struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+		struct klist_iter iter;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		if (attribute_container_no_classdevs(cont)) {
+			fn(cont, dev, NULL);
+			continue;
+		}
+
+		klist_for_each_entry(ic, &cont->containers, node, &iter) {
+			if (dev == ic->classdev.dev)
+				fn(cont, dev, &ic->classdev);
+		}
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_trigger - trigger a function for each matching container
+ *
+ * @dev:  The generic device to activate the trigger for
+ * @fn:	  the function to trigger
+ *
+ * This routine triggers a function that only needs to know the
+ * matching containers (not the classdev) associated with a device.
+ * It is more lightweight than attribute_container_device_trigger, so
+ * should be used in preference unless the triggering function
+ * actually needs to know the classdev.
+ */
+void
+attribute_container_trigger(struct device *dev,
+			    int (*fn)(struct attribute_container *,
+				      struct device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		if (cont->match(cont, dev))
+			fn(cont, dev);
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_add_attrs - add attributes
+ *
+ * @classdev: The class device
+ *
+ * This simply creates all the class device sysfs files from the
+ * attributes listed in the container
+ */
+int
+attribute_container_add_attrs(struct class_device *classdev)
+{
+	struct attribute_container *cont =
+		attribute_container_classdev_to_container(classdev);
+	struct class_device_attribute **attrs =	cont->attrs;
+	int i, error;
+
+	if (!attrs)
+		return 0;
+
+	for (i = 0; attrs[i]; i++) {
+		error = class_device_create_file(classdev, attrs[i]);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/**
+ * attribute_container_add_class_device - same function as class_device_add
+ *
+ * @classdev:	the class device to add
+ *
+ * This performs essentially the same function as class_device_add except for
+ * attribute containers, namely add the classdev to the system and then
+ * create the attribute files
+ */
+int
+attribute_container_add_class_device(struct class_device *classdev)
+{
+	int error = class_device_add(classdev);
+	if (error)
+		return error;
+	return attribute_container_add_attrs(classdev);
+}
+
+/**
+ * attribute_container_add_class_device_adapter - simple adapter for triggers
+ *
+ * This function is identical to attribute_container_add_class_device except
+ * that it is designed to be called from the triggers
+ */
+int
+attribute_container_add_class_device_adapter(struct attribute_container *cont,
+					     struct device *dev,
+					     struct class_device *classdev)
+{
+	return attribute_container_add_class_device(classdev);
+}
+
+/**
+ * attribute_container_remove_attrs - remove any attribute files
+ *
+ * @classdev: The class device to remove the files from
+ *
+ */
+void
+attribute_container_remove_attrs(struct class_device *classdev)
+{
+	struct attribute_container *cont =
+		attribute_container_classdev_to_container(classdev);
+	struct class_device_attribute **attrs =	cont->attrs;
+	int i;
+
+	if (!attrs)
+		return;
+
+	for (i = 0; attrs[i]; i++)
+		class_device_remove_file(classdev, attrs[i]);
+}
+
+/**
+ * attribute_container_class_device_del - equivalent of class_device_del
+ *
+ * @classdev: the class device
+ *
+ * This function simply removes all the attribute files and then calls
+ * class_device_del.
+ */
+void
+attribute_container_class_device_del(struct class_device *classdev)
+{
+	attribute_container_remove_attrs(classdev);
+	class_device_del(classdev);
+}
+
+/**
+ * attribute_container_find_class_device - find the corresponding class_device
+ *
+ * @cont:	the container
+ * @dev:	the generic device
+ *
+ * Looks up the device in the container's list of class devices and returns
+ * the corresponding class_device.
+ */
+struct class_device *
+attribute_container_find_class_device(struct attribute_container *cont,
+				      struct device *dev)
+{
+	struct class_device *cdev = NULL;
+	struct internal_container *ic;
+	struct klist_iter iter;
+
+	klist_for_each_entry(ic, &cont->containers, node, &iter) {
+		if (ic->classdev.dev == dev) {
+			cdev = &ic->classdev;
+			/* FIXME: must exit iterator then break */
+			klist_iter_exit(&iter);
+			break;
+		}
+	}
+
+	return cdev;
+}
+EXPORT_SYMBOL_GPL(attribute_container_find_class_device);
+
+int
+attribute_container_init(void)
+{
+	INIT_LIST_HEAD(&attribute_container_list);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(attribute_container_init);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/base.h b/kernel_addons/backport/2.6.9_U3/include/src/base.h
new file mode 100644
index 0000000..a5f8936
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/base.h
@@ -0,0 +1 @@
+extern int attribute_container_init(void);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/init.c b/kernel_addons/backport/2.6.9_U3/include/src/init.c
new file mode 100644
index 0000000..15f0bc6
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/init.c
@@ -0,0 +1,26 @@
+/*
+ *
+ * Copyright (c) 2002-3 Patrick Mochel
+ * Copyright (c) 2002-3 Open Source Development Labs
+ *
+ * This file is released under the GPLv2
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/memory.h>
+
+#include "base.h"
+
+/**
+ *	driver_init - initialize driver model.
+ *
+ *	Call the driver model init functions to initialize their
+ *	subsystems. Called early from init/main.c.
+ */
+
+void __init driver_init(void)
+{
+	attribute_container_init();
+}
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/klist.c b/kernel_addons/backport/2.6.9_U3/include/src/klist.c
new file mode 100644
index 0000000..3b29ebc
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/klist.c
@@ -0,0 +1,287 @@
+/*
+ *	klist.c - Routines for manipulating klists.
+ *
+ *
+ *	This klist interface provides a couple of structures that wrap around 
+ *	struct list_head to provide explicit list "head" (struct klist) and 
+ *	list "node" (struct klist_node) objects. For struct klist, a spinlock
+ *	is included that protects access to the actual list itself. struct 
+ *	klist_node provides a pointer to the klist that owns it and a kref
+ *	reference count that indicates the number of current users of that node
+ *	in the list.
+ *
+ *	The entire point is to provide an interface for iterating over a list
+ *	that is safe and allows for modification of the list during the
+ *	iteration (e.g. insertion and removal), including modification of the
+ *	current node on the list.
+ *
+ *	It works using a 3rd object type - struct klist_iter - that is declared
+ *	and initialized before an iteration. klist_next() is used to acquire the
+ *	next element in the list. It returns NULL if there are no more items.
+ *	Internally, that routine takes the klist's lock, decrements the reference
+ *	count of the previous klist_node and increments the count of the next
+ *	klist_node. It then drops the lock and returns.
+ *
+ *	There are primitives for adding and removing nodes to/from a klist. 
+ *	When deleting, klist_del() will simply decrement the reference count. 
+ *	Only when the count goes to 0 is the node removed from the list. 
+ *	klist_remove() will try to delete the node from the list and block
+ *	until it is actually removed. This is useful for objects (like devices)
+ *	that have been removed from the system and must be freed (but must wait
+ *	until all accessors have finished).
+ *
+ *	Copyright (C) 2005 Patrick Mochel
+ *
+ *	This file is released under the GPL v2.
+ */
+
+#include <linux/klist.h>
+#include <linux/module.h>
+
+
+/**
+ *	klist_init - Initialize a klist structure. 
+ *	@k:	The klist we're initializing.
+ *	@get:	The get function for the embedding object (NULL if none)
+ *	@put:	The put function for the embedding object (NULL if none)
+ *
+ * Initialises the klist structure.  If the klist_node structures are
+ * going to be embedded in refcounted objects (necessary for safe
+ * deletion) then the get/put arguments are used to initialise
+ * functions that take and release references on the embedding
+ * objects.
+ */
+
+void klist_init(struct klist * k, void (*get)(struct klist_node *),
+		void (*put)(struct klist_node *))
+{
+	INIT_LIST_HEAD(&k->k_list);
+	spin_lock_init(&k->k_lock);
+	k->get = get;
+	k->put = put;
+}
+
+EXPORT_SYMBOL_GPL(klist_init);
+
+
+static void add_head(struct klist * k, struct klist_node * n)
+{
+	spin_lock(&k->k_lock);
+	list_add(&n->n_node, &k->k_list);
+	spin_unlock(&k->k_lock);
+}
+
+static void add_tail(struct klist * k, struct klist_node * n)
+{
+	spin_lock(&k->k_lock);
+	list_add_tail(&n->n_node, &k->k_list);
+	spin_unlock(&k->k_lock);
+}
+
+
+static void klist_node_init(struct klist * k, struct klist_node * n)
+{
+	INIT_LIST_HEAD(&n->n_node);
+	init_completion(&n->n_removed);
+	kref_init(&n->n_ref);
+	n->n_klist = k;
+	if (k->get)
+		k->get(n);
+}
+
+
+/**
+ *	klist_add_head - Initialize a klist_node and add it to front.
+ *	@n:	node we're adding.
+ *	@k:	klist it's going on.
+ */
+
+void klist_add_head(struct klist_node * n, struct klist * k)
+{
+	klist_node_init(k, n);
+	add_head(k, n);
+}
+
+EXPORT_SYMBOL_GPL(klist_add_head);
+
+
+/**
+ *	klist_add_tail - Initialize a klist_node and add it to back.
+ *	@n:	node we're adding.
+ *	@k:	klist it's going on.
+ */
+
+void klist_add_tail(struct klist_node * n, struct klist * k)
+{
+	klist_node_init(k, n);
+	add_tail(k, n);
+}
+
+EXPORT_SYMBOL_GPL(klist_add_tail);
+
+
+static void klist_release(struct kref * kref)
+{
+	struct klist_node * n = container_of(kref, struct klist_node, n_ref);
+
+	list_del(&n->n_node);
+	complete(&n->n_removed);
+	n->n_klist = NULL;
+}
+
+static int klist_dec_and_del(struct klist_node * n)
+{
+	return kref_put_new(&n->n_ref, klist_release);
+}
+
+
+/**
+ *	klist_del - Decrement the reference count of node and try to remove.
+ *	@n:	node we're deleting.
+ */
+
+void klist_del(struct klist_node * n)
+{
+	struct klist * k = n->n_klist;
+	void (*put)(struct klist_node *) = k->put;
+
+	spin_lock(&k->k_lock);
+	if (!klist_dec_and_del(n))
+		put = NULL;
+	spin_unlock(&k->k_lock);
+	if (put)
+		put(n);
+}
+
+EXPORT_SYMBOL_GPL(klist_del);
+
+
+/**
+ *	klist_remove - Decrement the refcount of node and wait for it to go away.
+ *	@n:	node we're removing.
+ */
+
+void klist_remove(struct klist_node * n)
+{
+	klist_del(n);
+	wait_for_completion(&n->n_removed);
+}
+
+EXPORT_SYMBOL_GPL(klist_remove);
+
+
+/**
+ *	klist_node_attached - Say whether a node is bound to a list or not.
+ *	@n:	Node that we're testing.
+ */
+
+int klist_node_attached(struct klist_node * n)
+{
+	return (n->n_klist != NULL);
+}
+
+EXPORT_SYMBOL_GPL(klist_node_attached);
+
+
+/**
+ *	klist_iter_init_node - Initialize a klist_iter structure.
+ *	@k:	klist we're iterating.
+ *	@i:	klist_iter we're filling.
+ *	@n:	node to start with.
+ *
+ *	Similar to klist_iter_init(), but starts the action off with @n, 
+ *	instead of with the list head.
+ */
+
+void klist_iter_init_node(struct klist * k, struct klist_iter * i, struct klist_node * n)
+{
+	i->i_klist = k;
+	i->i_head = &k->k_list;
+	i->i_cur = n;
+	if (n)
+		kref_get(&n->n_ref);
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_init_node);
+
+
+/**
+ *	klist_iter_init - Iniitalize a klist_iter structure.
+ *	@k:	klist we're iterating.
+ *	@i:	klist_iter structure we're filling.
+ *
+ *	Similar to klist_iter_init_node(), but start with the list head.
+ */
+
+void klist_iter_init(struct klist * k, struct klist_iter * i)
+{
+	klist_iter_init_node(k, i, NULL);
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_init);
+
+
+/**
+ *	klist_iter_exit - Finish a list iteration.
+ *	@i:	Iterator structure.
+ *
+ *	Must be called when done iterating over list, as it decrements the 
+ *	refcount of the current node. Necessary in case iteration exited before
+ *	the end of the list was reached, and always good form.
+ */
+
+void klist_iter_exit(struct klist_iter * i)
+{
+	if (i->i_cur) {
+		klist_del(i->i_cur);
+		i->i_cur = NULL;
+	}
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_exit);
+
+
+static struct klist_node * to_klist_node(struct list_head * n)
+{
+	return container_of(n, struct klist_node, n_node);
+}
+
+
+/**
+ *	klist_next - Ante up next node in list.
+ *	@i:	Iterator structure.
+ *
+ *	First grab list lock. Decrement the reference count of the previous
+ *	node, if there was one. Grab the next node, increment its reference 
+ *	count, drop the lock, and return that next node.
+ */
+
+struct klist_node * klist_next(struct klist_iter * i)
+{
+	struct list_head * next;
+	struct klist_node * lnode = i->i_cur;
+	struct klist_node * knode = NULL;
+	void (*put)(struct klist_node *) = i->i_klist->put;
+
+	spin_lock(&i->i_klist->k_lock);
+	if (lnode) {
+		next = lnode->n_node.next;
+		if (!klist_dec_and_del(lnode))
+			put = NULL;
+	} else
+		next = i->i_head->next;
+
+	if (next != i->i_head) {
+		knode = to_klist_node(next);
+		kref_get(&knode->n_ref);
+	}
+	i->i_cur = knode;
+	spin_unlock(&i->i_klist->k_lock);
+	if (put && lnode)
+		put(lnode);
+	return knode;
+}
+
+EXPORT_SYMBOL_GPL(klist_next);
+
+
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c b/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c
new file mode 100644
index 0000000..d45bb3f
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c
@@ -0,0 +1,29 @@
+#include <linux/kref.h>
+#include <linux/module.h>
+
+/**
+ * kref_put - decrement refcount for object.
+ * @kref: object.
+ * @release: pointer to the function that will clean up the object when the
+ *           last reference to the object is released.
+ *           This pointer is required, and it is not acceptable to pass kfree
+ *           in as this function.
+ *
+ * Decrement the refcount, and if 0, call release().
+ * Return 1 if the object was removed, otherwise return 0.  Beware, if this
+ * function returns 0, you still can not count on the kref from remaining in
+ * memory.  Only use the return value if you want to see if the kref is now
+ * gone, not present.
+ */
+int kref_put_new(struct kref *kref, void (*release)(struct kref *kref))
+{
+        WARN_ON(release == NULL);
+        WARN_ON(release == (void (*)(struct kref *))kfree);
+
+        if (atomic_dec_and_test(&kref->refcount)) {
+                release(kref);
+                return 1;
+        }
+        return 0;
+}
+EXPORT_SYMBOL(kref_put_new);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi.c
new file mode 100644
index 0000000..8c833c0
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi.c
@@ -0,0 +1,50 @@
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/init.h>
+#include <linux/completion.h>
+#include <linux/unistd.h>
+#include <linux/spinlock.h>
+#include <linux/kmod.h>
+#include <linux/interrupt.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
+#include <linux/mutex.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_tcq.h>
+
+/**
+ * starget_for_each_device  -  helper to walk all devices of a target
+ * @starget:	target whose devices we want to iterate over.
+ *
+ * This traverses over each devices of @shost.  The devices have
+ * a reference that must be released by scsi_host_put when breaking
+ * out of the loop.
+ */
+void starget_for_each_device(struct scsi_target *starget, void * data,
+		     void (*fn)(struct scsi_device *, void *))
+{
+	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
+	struct scsi_device *sdev;
+
+	printk("%s: entry\n", __FUNCTION__);
+	shost_for_each_device(sdev, shost) {
+		if ((sdev->channel == starget->channel) &&
+		    (sdev->id == starget->id))
+			fn(sdev, data);
+	}
+	printk("%s: exit\n", __FUNCTION__);
+}
+EXPORT_SYMBOL(starget_for_each_device);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c
new file mode 100644
index 0000000..327b53f
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c
@@ -0,0 +1,164 @@
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/completion.h>
+#include <linux/kernel.h>
+#include <linux/mempool.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <linux/hardirq.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+
+int scsi_is_target_device(const struct device *dev)
+{
+        char *str = dev->bus_id;
+
+	if (strncmp(str, "target", 6) == 0) {
+		return 1;
+	}
+
+        return 0;
+}
+
+/**
+ * scsi_internal_device_block - internal function to put a device
+ *                              temporarily into the SDEV_BLOCK state
+ * @sdev:       device to block
+ *
+ * Block request made by scsi lld's to temporarily stop all
+ * scsi commands on the specified device.  Called from interrupt
+ * or normal process context.
+ *
+ * Returns zero if successful or error if not
+ *
+ * Notes:
+ *      This routine transitions the device to the SDEV_BLOCK state
+ *      (which must be a legal transition).  When the device is in this
+ *      state, all commands are deferred until the scsi lld reenables
+ *      the device with scsi_device_unblock or device_block_tmo fires.
+ *      This routine assumes the host_lock is held on entry.
+ **/
+int
+scsi_internal_device_block(struct scsi_device *sdev)
+{
+        request_queue_t *q = sdev->request_queue;
+        unsigned long flags;
+        int err = 0;
+
+        err = scsi_device_set_state(sdev, SDEV_BLOCK);
+        if (err)
+		return err;
+                
+        /*
+         * The device has transitioned to SDEV_BLOCK.  Stop the
+         * block layer from calling the midlayer with this device's
+         * request queue.
+         */
+        spin_lock_irqsave(q->queue_lock, flags);
+        blk_stop_queue(q);
+        spin_unlock_irqrestore(q->queue_lock, flags);
+
+        return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_internal_device_block);
+
+/**
+ * scsi_internal_device_unblock - resume a device after a block request
+ * @sdev:       device to resume
+ *
+ * Called by scsi lld's or the midlayer to restart the device queue
+ * for the previously suspended scsi device.  Called from interrupt or
+ * normal process context.
+ *
+ * Returns zero if successful or error if not.
+ *
+ * Notes:
+ *      This routine transitions the device to the SDEV_RUNNING state
+ *      (which must be a legal transition) allowing the midlayer to
+ *      goose the queue for this device.  This routine assumes the
+ *      host_lock is held upon entry.
+ **/
+int
+scsi_internal_device_unblock(struct scsi_device *sdev)
+{
+        request_queue_t *q = sdev->request_queue;
+        int err;
+        unsigned long flags;
+
+
+        /*
+         * Try to transition the scsi device to SDEV_RUNNING
+         * and goose the device queue if successful.
+         */
+        err = scsi_device_set_state(sdev, SDEV_RUNNING);
+        if (err)
+		return err;
+                
+        spin_lock_irqsave(q->queue_lock, flags);
+        blk_start_queue(q);
+        spin_unlock_irqrestore(q->queue_lock, flags);
+
+        return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_internal_device_unblock);
+
+static void
+device_block(struct scsi_device *sdev, void *data)
+{
+        scsi_internal_device_block(sdev);
+}
+
+static int
+target_block(struct device *dev, void *data)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_block);
+
+        return 0;
+}
+
+void
+scsi_target_block(struct device *dev)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_block);
+        else
+                device_for_each_child(dev, NULL, target_block);
+}
+EXPORT_SYMBOL_GPL(scsi_target_block);
+
+static void
+device_unblock(struct scsi_device *sdev, void *data)
+{
+        scsi_internal_device_unblock(sdev);
+}
+
+static int
+target_unblock(struct device *dev, void *data)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_unblock);
+        return 0;
+}
+
+void
+scsi_target_unblock(struct device *dev)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_unblock);
+        else
+                device_for_each_child(dev, NULL, target_unblock);
+}
+EXPORT_SYMBOL_GPL(scsi_target_unblock);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c
new file mode 100644
index 0000000..b7b7674
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c
@@ -0,0 +1,48 @@
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/spinlock.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_devinfo.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_transport.h>
+#include <scsi/scsi_eh.h>
+
+/**
+ * int_to_scsilun: reverts an int into a scsi_lun
+ * @int:        integer to be reverted
+ * @scsilun:    struct scsi_lun to be set.
+ *
+ * Description:
+ *     Reverts the functionality of the scsilun_to_int, which packed
+ *     an 8-byte lun value into an int. This routine unpacks the int
+ *     back into the lun value.
+ *     Note: the scsilun_to_int() routine does not truly handle all
+ *     8bytes of the lun value. This functions restores only as much
+ *     as was set by the routine.
+ *
+ * Notes:
+ *     Given an integer : 0x0b030a04,  this function returns a
+ *     scsi_lun of : struct scsi_lun of: 0a 04 0b 03 00 00 00 00
+ *
+ **/
+void int_to_scsilun(unsigned int lun, struct scsi_lun *scsilun)
+{
+        int i;
+
+        memset(scsilun->scsi_lun, 0, sizeof(scsilun->scsi_lun));
+
+        for (i = 0; i < sizeof(lun); i += 2) {
+                scsilun->scsi_lun[i] = (lun >> 8) & 0xFF;
+                scsilun->scsi_lun[i+1] = lun & 0xFF;
+                lun = lun >> 16;
+        }
+}
+EXPORT_SYMBOL(int_to_scsilun);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c b/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c
new file mode 100644
index 0000000..f25e7c6
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c
@@ -0,0 +1,280 @@
+/*
+ * transport_class.c - implementation of generic transport classes
+ *                     using attribute_containers
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ *
+ * The basic idea here is to allow any "device controller" (which
+ * would most often be a Host Bus Adapter to use the services of one
+ * or more tranport classes for performing transport specific
+ * services.  Transport specific services are things that the generic
+ * command layer doesn't want to know about (speed settings, line
+ * condidtioning, etc), but which the user might be interested in.
+ * Thus, the HBA's use the routines exported by the transport classes
+ * to perform these functions.  The transport classes export certain
+ * values to the user via sysfs using attribute containers.
+ *
+ * Note: because not every HBA will care about every transport
+ * attribute, there's a many to one relationship that goes like this:
+ *
+ * transport class<-----attribute container<----class device
+ *
+ * Usually the attribute container is per-HBA, but the design doesn't
+ * mandate that.  Although most of the services will be specific to
+ * the actual external storage connection used by the HBA, the generic
+ * transport class is framed entirely in terms of generic devices to
+ * allow it to be used by any physical HBA in the system.
+ */
+#include <linux/attribute_container.h>
+#include <linux/transport_class.h>
+
+/**
+ * transport_class_register - register an initial transport class
+ *
+ * @tclass:	a pointer to the transport class structure to be initialised
+ *
+ * The transport class contains an embedded class which is used to
+ * identify it.  The caller should initialise this structure with
+ * zeros and then generic class must have been initialised with the
+ * actual transport class unique name.  There's a macro
+ * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must
+ * be registered).
+ *
+ * Returns 0 on success or error on failure.
+ */
+int transport_class_register(struct transport_class *tclass)
+{
+	return class_register(&tclass->class);
+}
+EXPORT_SYMBOL_GPL(transport_class_register);
+
+/**
+ * transport_class_unregister - unregister a previously registered class
+ *
+ * @tclass: The transport class to unregister
+ *
+ * Must be called prior to deallocating the memory for the transport
+ * class.
+ */
+void transport_class_unregister(struct transport_class *tclass)
+{
+	class_unregister(&tclass->class);
+}
+EXPORT_SYMBOL_GPL(transport_class_unregister);
+
+static int anon_transport_dummy_function(struct transport_container *tc,
+					 struct device *dev,
+					 struct class_device *cdev)
+{
+	/* do nothing */
+	return 0;
+}
+
+/**
+ * anon_transport_class_register - register an anonymous class
+ *
+ * @atc: The anon transport class to register
+ *
+ * The anonymous transport class contains both a transport class and a
+ * container.  The idea of an anonymous class is that it never
+ * actually has any device attributes associated with it (and thus
+ * saves on container storage).  So it can only be used for triggering
+ * events.  Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to
+ * initialise the anon transport class storage.
+ */
+int anon_transport_class_register(struct anon_transport_class *atc)
+{
+	int error;
+	atc->container.class = &atc->tclass.class;
+	attribute_container_set_no_classdevs(&atc->container);
+	error = attribute_container_register(&atc->container);
+	if (error)
+		return error;
+	atc->tclass.setup = anon_transport_dummy_function;
+	atc->tclass.remove = anon_transport_dummy_function;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(anon_transport_class_register);
+
+/**
+ * anon_transport_class_unregister - unregister an anon class
+ *
+ * @atc: Pointer to the anon transport class to unregister
+ *
+ * Must be called prior to deallocating the memory for the anon
+ * transport class.
+ */
+void anon_transport_class_unregister(struct anon_transport_class *atc)
+{
+	attribute_container_unregister(&atc->container);
+}
+EXPORT_SYMBOL_GPL(anon_transport_class_unregister);
+
+static int transport_setup_classdev(struct attribute_container *cont,
+				    struct device *dev,
+				    struct class_device *classdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+	struct transport_container *tcont = attribute_container_to_transport_container(cont);
+
+	if (tclass->setup)
+		tclass->setup(tcont, dev, classdev);
+
+	return 0;
+}
+
+/**
+ * transport_setup_device - declare a new dev for transport class association
+ *			    but don't make it visible yet.
+ *
+ * @dev: the generic device representing the entity being added
+ *
+ * Usually, dev represents some component in the HBA system (either
+ * the HBA itself or a device remote across the HBA bus).  This
+ * routine is simply a trigger point to see if any set of transport
+ * classes wishes to associate with the added device.  This allocates
+ * storage for the class device and initialises it, but does not yet
+ * add it to the system or add attributes to it (you do this with
+ * transport_add_device).  If you have no need for a separate setup
+ * and add operations, use transport_register_device (see
+ * transport_class.h).
+ */
+
+void transport_setup_device(struct device *dev)
+{
+	attribute_container_add_device(dev, transport_setup_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_setup_device);
+
+static int transport_add_class_device(struct attribute_container *cont,
+				      struct device *dev,
+				      struct class_device *classdev)
+{
+	int error = attribute_container_add_class_device(classdev);
+	struct transport_container *tcont = 
+		attribute_container_to_transport_container(cont);
+
+	if (!error && tcont->statistics)
+		error = sysfs_create_group(&classdev->kobj, tcont->statistics);
+
+	return error;
+}
+
+
+/**
+ * transport_add_device - declare a new dev for transport class association
+ *
+ * @dev: the generic device representing the entity being added
+ *
+ * Usually, dev represents some component in the HBA system (either
+ * the HBA itself or a device remote across the HBA bus).  This
+ * routine is simply a trigger point used to add the device to the
+ * system and register attributes for it.
+ */
+
+void transport_add_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_add_class_device);
+}
+EXPORT_SYMBOL_GPL(transport_add_device);
+
+static int transport_configure(struct attribute_container *cont,
+			       struct device *dev,
+			       struct class_device *cdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+	struct transport_container *tcont = attribute_container_to_transport_container(cont);
+
+	if (tclass->configure)
+		tclass->configure(tcont, dev, cdev);
+
+	return 0;
+}
+
+/**
+ * transport_configure_device - configure an already set up device
+ *
+ * @dev: generic device representing device to be configured
+ *
+ * The idea of configure is simply to provide a point within the setup
+ * process to allow the transport class to extract information from a
+ * device after it has been setup.  This is used in SCSI because we
+ * have to have a setup device to begin using the HBA, but after we
+ * send the initial inquiry, we use configure to extract the device
+ * parameters.  The device need not have been added to be configured.
+ */
+void transport_configure_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_configure);
+}
+EXPORT_SYMBOL_GPL(transport_configure_device);
+
+static int transport_remove_classdev(struct attribute_container *cont,
+				     struct device *dev,
+				     struct class_device *classdev)
+{
+	struct transport_container *tcont = 
+		attribute_container_to_transport_container(cont);
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+
+	if (tclass->remove)
+		tclass->remove(tcont, dev, classdev);
+
+	if (tclass->remove != anon_transport_dummy_function) {
+		if (tcont->statistics)
+			sysfs_remove_group(&classdev->kobj, tcont->statistics);
+		attribute_container_class_device_del(classdev);
+	}
+
+	return 0;
+}
+
+
+/**
+ * transport_remove_device - remove the visibility of a device
+ *
+ * @dev: generic device to remove
+ *
+ * This call removes the visibility of the device (to the user from
+ * sysfs), but does not destroy it.  To eliminate a device entirely
+ * you must also call transport_destroy_device.  If you don't need to
+ * do remove and destroy as separate operations, use
+ * transport_unregister_device() (see transport_class.h) which will
+ * perform both calls for you.
+ */
+void transport_remove_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_remove_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_remove_device);
+
+static void transport_destroy_classdev(struct attribute_container *cont,
+				      struct device *dev,
+				      struct class_device *classdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+
+	if (tclass->remove != anon_transport_dummy_function)
+		class_device_put(classdev);
+}
+
+
+/**
+ * transport_destroy_device - destroy a removed device
+ *
+ * @dev: device to eliminate from the transport class.
+ *
+ * This call triggers the elimination of storage associated with the
+ * transport classdev.  Note: all it really does is relinquish a
+ * reference to the classdev.  The memory will not be freed until the
+ * last reference goes to zero.  Note also that the classdev retains a
+ * reference count on dev, so dev too will remain for as long as the
+ * transport class device remains around.
+ */
+void transport_destroy_device(struct device *dev)
+{
+	attribute_container_remove_device(dev, transport_destroy_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_destroy_device);
diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h b/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h
new file mode 100644
index 0000000..93bfb0b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h
@@ -0,0 +1,71 @@
+/*
+ * class_container.h - a generic container for all classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ */
+
+#ifndef _ATTRIBUTE_CONTAINER_H_
+#define _ATTRIBUTE_CONTAINER_H_
+
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/klist.h>
+#include <linux/spinlock.h>
+
+struct attribute_container {
+	struct list_head	node;
+	struct klist		containers;
+	struct class		*class;
+	struct class_device_attribute **attrs;
+	int (*match)(struct attribute_container *, struct device *);
+#define	ATTRIBUTE_CONTAINER_NO_CLASSDEVS	0x01
+	unsigned long		flags;
+};
+
+static inline int
+attribute_container_no_classdevs(struct attribute_container *atc)
+{
+	return atc->flags & ATTRIBUTE_CONTAINER_NO_CLASSDEVS;
+}
+
+static inline void
+attribute_container_set_no_classdevs(struct attribute_container *atc)
+{
+	atc->flags |= ATTRIBUTE_CONTAINER_NO_CLASSDEVS;
+}
+
+int attribute_container_register(struct attribute_container *cont);
+int attribute_container_unregister(struct attribute_container *cont);
+void attribute_container_create_device(struct device *dev,
+				       int (*fn)(struct attribute_container *,
+						 struct device *,
+						 struct class_device *));
+void attribute_container_add_device(struct device *dev,
+				    int (*fn)(struct attribute_container *,
+					      struct device *,
+					      struct class_device *));
+void attribute_container_remove_device(struct device *dev,
+				       void (*fn)(struct attribute_container *,
+						  struct device *,
+						  struct class_device *));
+void attribute_container_device_trigger(struct device *dev, 
+					int (*fn)(struct attribute_container *,
+						  struct device *,
+						  struct class_device *));
+void attribute_container_trigger(struct device *dev, 
+				 int (*fn)(struct attribute_container *,
+					   struct device *));
+int attribute_container_add_attrs(struct class_device *classdev);
+int attribute_container_add_class_device(struct class_device *classdev);
+int attribute_container_add_class_device_adapter(struct attribute_container *cont,
+						 struct device *dev,
+						 struct class_device *classdev);
+void attribute_container_remove_attrs(struct class_device *classdev);
+void attribute_container_class_device_del(struct class_device *classdev);
+struct attribute_container *attribute_container_classdev_to_container(struct class_device *);
+struct class_device *attribute_container_find_class_device(struct attribute_container *, struct device *);
+struct class_device_attribute **attribute_container_classdev_to_attrs(const struct class_device *classdev);
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/klist.h b/kernel_addons/backport/2.6.9_U4/include/linux/klist.h
new file mode 100644
index 0000000..7407125
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/linux/klist.h
@@ -0,0 +1,61 @@
+/*
+ *	klist.h - Some generic list helpers, extending struct list_head a bit.
+ *
+ *	Implementations are found in lib/klist.c
+ *
+ *
+ *	Copyright (C) 2005 Patrick Mochel
+ *
+ *	This file is rleased under the GPL v2.
+ */
+
+#ifndef _LINUX_KLIST_H
+#define _LINUX_KLIST_H
+
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+
+struct klist_node;
+struct klist {
+	spinlock_t		k_lock;
+	struct list_head	k_list;
+	void			(*get)(struct klist_node *);
+	void			(*put)(struct klist_node *);
+};
+
+
+extern void klist_init(struct klist * k, void (*get)(struct klist_node *),
+		       void (*put)(struct klist_node *));
+
+struct klist_node {
+	struct klist		* n_klist;
+	struct list_head	n_node;
+	struct kref		n_ref;
+	struct completion	n_removed;
+};
+
+extern void klist_add_tail(struct klist_node * n, struct klist * k);
+extern void klist_add_head(struct klist_node * n, struct klist * k);
+
+extern void klist_del(struct klist_node * n);
+extern void klist_remove(struct klist_node * n);
+
+extern int klist_node_attached(struct klist_node * n);
+
+
+struct klist_iter {
+	struct klist		* i_klist;
+	struct list_head	* i_head;
+	struct klist_node	* i_cur;
+};
+
+
+extern void klist_iter_init(struct klist * k, struct klist_iter * i);
+extern void klist_iter_init_node(struct klist * k, struct klist_iter * i, 
+				 struct klist_node * n);
+extern void klist_iter_exit(struct klist_iter * i);
+extern struct klist_node * klist_next(struct klist_iter * i);
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h
new file mode 100644
index 0000000..f353e0b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h
@@ -0,0 +1,19 @@
+#ifndef _SCSI_SCSI_DEVICE_H_BACKPORT
+#define _SCSI_SCSI_DEVICE_H_BACKPORT
+
+#include_next <scsi/scsi_device.h>
+
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <asm/atomic.h>
+
+struct scsi_lun;
+
+extern void int_to_scsilun(unsigned int, struct scsi_lun *);
+extern void scsi_target_block(struct device *);
+extern void scsi_target_unblock(struct device *);
+extern void starget_for_each_device(struct scsi_target *, void *,
+		     void (*fn)(struct scsi_device *, void *));
+#endif /* _SCSI_SCSI_DEVICE_H_BACKPORT */
diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h
new file mode 100644
index 0000000..99c2b12
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h
@@ -0,0 +1,8 @@
+#ifndef _SCSI_SCSI_TRANSPORT_H_BACKPORT
+#define _SCSI_SCSI_TRANSPORT_H_BACKPORT
+
+#include_next <scsi/scsi_transport.h>
+
+#include <linux/transport_class.h>
+
+#endif /* _SCSI_SCSI_TRANSPORT_H_BACKPORT */
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c b/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c
new file mode 100644
index 0000000..44948d1
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c
@@ -0,0 +1,438 @@
+/*
+ * attribute_container.c - implementation of a simple container for classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ *
+ * The basic idea here is to enable a device to be attached to an
+ * aritrary numer of classes without having to allocate storage for them.
+ * Instead, the contained classes select the devices they need to attach
+ * to via a matching function.
+ */
+
+#include <linux/attribute_container.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/module.h>
+
+#include "base.h"
+
+/* This is a private structure used to tie the classdev and the
+ * container .. it should never be visible outside this file */
+struct internal_container {
+	struct klist_node node;
+	struct attribute_container *cont;
+	struct class_device classdev;
+};
+
+static void internal_container_klist_get(struct klist_node *n)
+{
+	struct internal_container *ic =
+		container_of(n, struct internal_container, node);
+	class_device_get(&ic->classdev);
+}
+
+static void internal_container_klist_put(struct klist_node *n)
+{
+	struct internal_container *ic =
+		container_of(n, struct internal_container, node);
+	class_device_put(&ic->classdev);
+}
+
+
+/**
+ * attribute_container_classdev_to_container - given a classdev, return the container
+ *
+ * @classdev: the class device created by attribute_container_add_device.
+ *
+ * Returns the container associated with this classdev.
+ */
+struct attribute_container *
+attribute_container_classdev_to_container(struct class_device *classdev)
+{
+	struct internal_container *ic =
+		container_of(classdev, struct internal_container, classdev);
+	return ic->cont;
+}
+EXPORT_SYMBOL_GPL(attribute_container_classdev_to_container);
+
+static struct list_head attribute_container_list;
+
+static DECLARE_MUTEX(attribute_container_mutex);
+
+/**
+ * attribute_container_register - register an attribute container
+ *
+ * @cont: The container to register.  This must be allocated by the
+ *        callee and should also be zeroed by it.
+ */
+int
+attribute_container_register(struct attribute_container *cont)
+{
+	INIT_LIST_HEAD(&cont->node);
+	klist_init(&cont->containers,internal_container_klist_get,
+		   internal_container_klist_put);
+		
+	down(&attribute_container_mutex);
+	list_add_tail(&cont->node, &attribute_container_list);
+	up(&attribute_container_mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(attribute_container_register);
+
+/**
+ * attribute_container_unregister - remove a container registration
+ *
+ * @cont: previously registered container to remove
+ */
+int
+attribute_container_unregister(struct attribute_container *cont)
+{
+	int retval = -EBUSY;
+	down(&attribute_container_mutex);
+	spin_lock(&cont->containers.k_lock);
+	if (!list_empty(&cont->containers.k_list))
+		goto out;
+	retval = 0;
+	list_del(&cont->node);
+ out:
+	spin_unlock(&cont->containers.k_lock);
+	up(&attribute_container_mutex);
+	return retval;
+		
+}
+EXPORT_SYMBOL_GPL(attribute_container_unregister);
+
+/* private function used as class release */
+static void attribute_container_release(struct class_device *classdev)
+{
+	struct internal_container *ic 
+		= container_of(classdev, struct internal_container, classdev);
+	struct device *dev = classdev->dev;
+
+	kfree(ic);
+	put_device(dev);
+}
+
+/**
+ * attribute_container_add_device - see if any container is interested in dev
+ *
+ * @dev: device to add attributes to
+ * @fn:	 function to trigger addition of class device.
+ *
+ * This function allocates storage for the class device(s) to be
+ * attached to dev (one for each matching attribute_container).  If no
+ * fn is provided, the code will simply register the class device via
+ * class_device_add.  If a function is provided, it is expected to add
+ * the class device at the appropriate time.  One of the things that
+ * might be necessary is to allocate and initialise the classdev and
+ * then add it a later time.  To do this, call this routine for
+ * allocation and initialisation and then use
+ * attribute_container_device_trigger() to call class_device_add() on
+ * it.  Note: after this, the class device contains a reference to dev
+ * which is not relinquished until the release of the classdev.
+ */
+void
+attribute_container_add_device(struct device *dev,
+			       int (*fn)(struct attribute_container *,
+					 struct device *,
+					 struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+
+		if (attribute_container_no_classdevs(cont))
+			continue;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		ic = kzalloc(sizeof(*ic), GFP_KERNEL);
+		if (!ic) {
+			dev_printk(KERN_ERR, dev, "failed to allocate class container\n");
+			continue;
+		}
+
+		ic->cont = cont;
+		class_device_initialize(&ic->classdev);
+		ic->classdev.dev = get_device(dev);
+		ic->classdev.class = cont->class;
+		cont->class->release = attribute_container_release;
+		strcpy(ic->classdev.class_id, dev->bus_id);
+		if (fn)
+			fn(cont, dev, &ic->classdev);
+		else
+			attribute_container_add_class_device(&ic->classdev);
+		klist_add_tail(&ic->node, &cont->containers);
+	}
+	up(&attribute_container_mutex);
+}
+
+/* FIXME: can't break out of this unless klist_iter_exit is also
+ * called before doing the break
+ */
+#define klist_for_each_entry(pos, head, member, iter) \
+	for (klist_iter_init(head, iter); (pos = ({ \
+		struct klist_node *n = klist_next(iter); \
+		n ? container_of(n, typeof(*pos), member) : \
+			({ klist_iter_exit(iter) ; NULL; }); \
+	}) ) != NULL; )
+			
+
+/**
+ * attribute_container_remove_device - make device eligible for removal.
+ *
+ * @dev:  The generic device
+ * @fn:	  A function to call to remove the device
+ *
+ * This routine triggers device removal.  If fn is NULL, then it is
+ * simply done via class_device_unregister (note that if something
+ * still has a reference to the classdev, then the memory occupied
+ * will not be freed until the classdev is released).  If you want a
+ * two phase release: remove from visibility and then delete the
+ * device, then you should use this routine with a fn that calls
+ * class_device_del() and then use
+ * attribute_container_device_trigger() to do the final put on the
+ * classdev.
+ */
+void
+attribute_container_remove_device(struct device *dev,
+				  void (*fn)(struct attribute_container *,
+					     struct device *,
+					     struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+		struct klist_iter iter;
+
+		if (attribute_container_no_classdevs(cont))
+			continue;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		klist_for_each_entry(ic, &cont->containers, node, &iter) {
+			if (dev != ic->classdev.dev)
+				continue;
+			klist_del(&ic->node);
+			if (fn)
+				fn(cont, dev, &ic->classdev);
+			else {
+				attribute_container_remove_attrs(&ic->classdev);
+				class_device_unregister(&ic->classdev);
+			}
+		}
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_device_trigger - execute a trigger for each matching classdev
+ *
+ * @dev:  The generic device to run the trigger for
+ * @fn	  the function to execute for each classdev.
+ *
+ * This funcion is for executing a trigger when you need to know both
+ * the container and the classdev.  If you only care about the
+ * container, then use attribute_container_trigger() instead.
+ */
+void
+attribute_container_device_trigger(struct device *dev, 
+				   int (*fn)(struct attribute_container *,
+					     struct device *,
+					     struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+		struct klist_iter iter;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		if (attribute_container_no_classdevs(cont)) {
+			fn(cont, dev, NULL);
+			continue;
+		}
+
+		klist_for_each_entry(ic, &cont->containers, node, &iter) {
+			if (dev == ic->classdev.dev)
+				fn(cont, dev, &ic->classdev);
+		}
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_trigger - trigger a function for each matching container
+ *
+ * @dev:  The generic device to activate the trigger for
+ * @fn:	  the function to trigger
+ *
+ * This routine triggers a function that only needs to know the
+ * matching containers (not the classdev) associated with a device.
+ * It is more lightweight than attribute_container_device_trigger, so
+ * should be used in preference unless the triggering function
+ * actually needs to know the classdev.
+ */
+void
+attribute_container_trigger(struct device *dev,
+			    int (*fn)(struct attribute_container *,
+				      struct device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		if (cont->match(cont, dev))
+			fn(cont, dev);
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_add_attrs - add attributes
+ *
+ * @classdev: The class device
+ *
+ * This simply creates all the class device sysfs files from the
+ * attributes listed in the container
+ */
+int
+attribute_container_add_attrs(struct class_device *classdev)
+{
+	struct attribute_container *cont =
+		attribute_container_classdev_to_container(classdev);
+	struct class_device_attribute **attrs =	cont->attrs;
+	int i, error;
+
+	if (!attrs)
+		return 0;
+
+	for (i = 0; attrs[i]; i++) {
+		error = class_device_create_file(classdev, attrs[i]);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/**
+ * attribute_container_add_class_device - same function as class_device_add
+ *
+ * @classdev:	the class device to add
+ *
+ * This performs essentially the same function as class_device_add except for
+ * attribute containers, namely add the classdev to the system and then
+ * create the attribute files
+ */
+int
+attribute_container_add_class_device(struct class_device *classdev)
+{
+	int error = class_device_add(classdev);
+	if (error)
+		return error;
+	return attribute_container_add_attrs(classdev);
+}
+
+/**
+ * attribute_container_add_class_device_adapter - simple adapter for triggers
+ *
+ * This function is identical to attribute_container_add_class_device except
+ * that it is designed to be called from the triggers
+ */
+int
+attribute_container_add_class_device_adapter(struct attribute_container *cont,
+					     struct device *dev,
+					     struct class_device *classdev)
+{
+	return attribute_container_add_class_device(classdev);
+}
+
+/**
+ * attribute_container_remove_attrs - remove any attribute files
+ *
+ * @classdev: The class device to remove the files from
+ *
+ */
+void
+attribute_container_remove_attrs(struct class_device *classdev)
+{
+	struct attribute_container *cont =
+		attribute_container_classdev_to_container(classdev);
+	struct class_device_attribute **attrs =	cont->attrs;
+	int i;
+
+	if (!attrs)
+		return;
+
+	for (i = 0; attrs[i]; i++)
+		class_device_remove_file(classdev, attrs[i]);
+}
+
+/**
+ * attribute_container_class_device_del - equivalent of class_device_del
+ *
+ * @classdev: the class device
+ *
+ * This function simply removes all the attribute files and then calls
+ * class_device_del.
+ */
+void
+attribute_container_class_device_del(struct class_device *classdev)
+{
+	attribute_container_remove_attrs(classdev);
+	class_device_del(classdev);
+}
+
+/**
+ * attribute_container_find_class_device - find the corresponding class_device
+ *
+ * @cont:	the container
+ * @dev:	the generic device
+ *
+ * Looks up the device in the container's list of class devices and returns
+ * the corresponding class_device.
+ */
+struct class_device *
+attribute_container_find_class_device(struct attribute_container *cont,
+				      struct device *dev)
+{
+	struct class_device *cdev = NULL;
+	struct internal_container *ic;
+	struct klist_iter iter;
+
+	klist_for_each_entry(ic, &cont->containers, node, &iter) {
+		if (ic->classdev.dev == dev) {
+			cdev = &ic->classdev;
+			/* FIXME: must exit iterator then break */
+			klist_iter_exit(&iter);
+			break;
+		}
+	}
+
+	return cdev;
+}
+EXPORT_SYMBOL_GPL(attribute_container_find_class_device);
+
+int
+attribute_container_init(void)
+{
+	INIT_LIST_HEAD(&attribute_container_list);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(attribute_container_init);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/base.h b/kernel_addons/backport/2.6.9_U4/include/src/base.h
new file mode 100644
index 0000000..a5f8936
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/base.h
@@ -0,0 +1 @@
+extern int attribute_container_init(void);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/init.c b/kernel_addons/backport/2.6.9_U4/include/src/init.c
new file mode 100644
index 0000000..15f0bc6
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/init.c
@@ -0,0 +1,26 @@
+/*
+ *
+ * Copyright (c) 2002-3 Patrick Mochel
+ * Copyright (c) 2002-3 Open Source Development Labs
+ *
+ * This file is released under the GPLv2
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/memory.h>
+
+#include "base.h"
+
+/**
+ *	driver_init - initialize driver model.
+ *
+ *	Call the driver model init functions to initialize their
+ *	subsystems. Called early from init/main.c.
+ */
+
+void __init driver_init(void)
+{
+	attribute_container_init();
+}
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/klist.c b/kernel_addons/backport/2.6.9_U4/include/src/klist.c
new file mode 100644
index 0000000..3b29ebc
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/klist.c
@@ -0,0 +1,287 @@
+/*
+ *	klist.c - Routines for manipulating klists.
+ *
+ *
+ *	This klist interface provides a couple of structures that wrap around 
+ *	struct list_head to provide explicit list "head" (struct klist) and 
+ *	list "node" (struct klist_node) objects. For struct klist, a spinlock
+ *	is included that protects access to the actual list itself. struct 
+ *	klist_node provides a pointer to the klist that owns it and a kref
+ *	reference count that indicates the number of current users of that node
+ *	in the list.
+ *
+ *	The entire point is to provide an interface for iterating over a list
+ *	that is safe and allows for modification of the list during the
+ *	iteration (e.g. insertion and removal), including modification of the
+ *	current node on the list.
+ *
+ *	It works using a 3rd object type - struct klist_iter - that is declared
+ *	and initialized before an iteration. klist_next() is used to acquire the
+ *	next element in the list. It returns NULL if there are no more items.
+ *	Internally, that routine takes the klist's lock, decrements the reference
+ *	count of the previous klist_node and increments the count of the next
+ *	klist_node. It then drops the lock and returns.
+ *
+ *	There are primitives for adding and removing nodes to/from a klist. 
+ *	When deleting, klist_del() will simply decrement the reference count. 
+ *	Only when the count goes to 0 is the node removed from the list. 
+ *	klist_remove() will try to delete the node from the list and block
+ *	until it is actually removed. This is useful for objects (like devices)
+ *	that have been removed from the system and must be freed (but must wait
+ *	until all accessors have finished).
+ *
+ *	Copyright (C) 2005 Patrick Mochel
+ *
+ *	This file is released under the GPL v2.
+ */
+
+#include <linux/klist.h>
+#include <linux/module.h>
+
+
+/**
+ *	klist_init - Initialize a klist structure. 
+ *	@k:	The klist we're initializing.
+ *	@get:	The get function for the embedding object (NULL if none)
+ *	@put:	The put function for the embedding object (NULL if none)
+ *
+ * Initialises the klist structure.  If the klist_node structures are
+ * going to be embedded in refcounted objects (necessary for safe
+ * deletion) then the get/put arguments are used to initialise
+ * functions that take and release references on the embedding
+ * objects.
+ */
+
+void klist_init(struct klist * k, void (*get)(struct klist_node *),
+		void (*put)(struct klist_node *))
+{
+	INIT_LIST_HEAD(&k->k_list);
+	spin_lock_init(&k->k_lock);
+	k->get = get;
+	k->put = put;
+}
+
+EXPORT_SYMBOL_GPL(klist_init);
+
+
+static void add_head(struct klist * k, struct klist_node * n)
+{
+	spin_lock(&k->k_lock);
+	list_add(&n->n_node, &k->k_list);
+	spin_unlock(&k->k_lock);
+}
+
+static void add_tail(struct klist * k, struct klist_node * n)
+{
+	spin_lock(&k->k_lock);
+	list_add_tail(&n->n_node, &k->k_list);
+	spin_unlock(&k->k_lock);
+}
+
+
+static void klist_node_init(struct klist * k, struct klist_node * n)
+{
+	INIT_LIST_HEAD(&n->n_node);
+	init_completion(&n->n_removed);
+	kref_init(&n->n_ref);
+	n->n_klist = k;
+	if (k->get)
+		k->get(n);
+}
+
+
+/**
+ *	klist_add_head - Initialize a klist_node and add it to front.
+ *	@n:	node we're adding.
+ *	@k:	klist it's going on.
+ */
+
+void klist_add_head(struct klist_node * n, struct klist * k)
+{
+	klist_node_init(k, n);
+	add_head(k, n);
+}
+
+EXPORT_SYMBOL_GPL(klist_add_head);
+
+
+/**
+ *	klist_add_tail - Initialize a klist_node and add it to back.
+ *	@n:	node we're adding.
+ *	@k:	klist it's going on.
+ */
+
+void klist_add_tail(struct klist_node * n, struct klist * k)
+{
+	klist_node_init(k, n);
+	add_tail(k, n);
+}
+
+EXPORT_SYMBOL_GPL(klist_add_tail);
+
+
+static void klist_release(struct kref * kref)
+{
+	struct klist_node * n = container_of(kref, struct klist_node, n_ref);
+
+	list_del(&n->n_node);
+	complete(&n->n_removed);
+	n->n_klist = NULL;
+}
+
+static int klist_dec_and_del(struct klist_node * n)
+{
+	return kref_put_new(&n->n_ref, klist_release);
+}
+
+
+/**
+ *	klist_del - Decrement the reference count of node and try to remove.
+ *	@n:	node we're deleting.
+ */
+
+void klist_del(struct klist_node * n)
+{
+	struct klist * k = n->n_klist;
+	void (*put)(struct klist_node *) = k->put;
+
+	spin_lock(&k->k_lock);
+	if (!klist_dec_and_del(n))
+		put = NULL;
+	spin_unlock(&k->k_lock);
+	if (put)
+		put(n);
+}
+
+EXPORT_SYMBOL_GPL(klist_del);
+
+
+/**
+ *	klist_remove - Decrement the refcount of node and wait for it to go away.
+ *	@n:	node we're removing.
+ */
+
+void klist_remove(struct klist_node * n)
+{
+	klist_del(n);
+	wait_for_completion(&n->n_removed);
+}
+
+EXPORT_SYMBOL_GPL(klist_remove);
+
+
+/**
+ *	klist_node_attached - Say whether a node is bound to a list or not.
+ *	@n:	Node that we're testing.
+ */
+
+int klist_node_attached(struct klist_node * n)
+{
+	return (n->n_klist != NULL);
+}
+
+EXPORT_SYMBOL_GPL(klist_node_attached);
+
+
+/**
+ *	klist_iter_init_node - Initialize a klist_iter structure.
+ *	@k:	klist we're iterating.
+ *	@i:	klist_iter we're filling.
+ *	@n:	node to start with.
+ *
+ *	Similar to klist_iter_init(), but starts the action off with @n, 
+ *	instead of with the list head.
+ */
+
+void klist_iter_init_node(struct klist * k, struct klist_iter * i, struct klist_node * n)
+{
+	i->i_klist = k;
+	i->i_head = &k->k_list;
+	i->i_cur = n;
+	if (n)
+		kref_get(&n->n_ref);
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_init_node);
+
+
+/**
+ *	klist_iter_init - Iniitalize a klist_iter structure.
+ *	@k:	klist we're iterating.
+ *	@i:	klist_iter structure we're filling.
+ *
+ *	Similar to klist_iter_init_node(), but start with the list head.
+ */
+
+void klist_iter_init(struct klist * k, struct klist_iter * i)
+{
+	klist_iter_init_node(k, i, NULL);
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_init);
+
+
+/**
+ *	klist_iter_exit - Finish a list iteration.
+ *	@i:	Iterator structure.
+ *
+ *	Must be called when done iterating over list, as it decrements the 
+ *	refcount of the current node. Necessary in case iteration exited before
+ *	the end of the list was reached, and always good form.
+ */
+
+void klist_iter_exit(struct klist_iter * i)
+{
+	if (i->i_cur) {
+		klist_del(i->i_cur);
+		i->i_cur = NULL;
+	}
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_exit);
+
+
+static struct klist_node * to_klist_node(struct list_head * n)
+{
+	return container_of(n, struct klist_node, n_node);
+}
+
+
+/**
+ *	klist_next - Ante up next node in list.
+ *	@i:	Iterator structure.
+ *
+ *	First grab list lock. Decrement the reference count of the previous
+ *	node, if there was one. Grab the next node, increment its reference 
+ *	count, drop the lock, and return that next node.
+ */
+
+struct klist_node * klist_next(struct klist_iter * i)
+{
+	struct list_head * next;
+	struct klist_node * lnode = i->i_cur;
+	struct klist_node * knode = NULL;
+	void (*put)(struct klist_node *) = i->i_klist->put;
+
+	spin_lock(&i->i_klist->k_lock);
+	if (lnode) {
+		next = lnode->n_node.next;
+		if (!klist_dec_and_del(lnode))
+			put = NULL;
+	} else
+		next = i->i_head->next;
+
+	if (next != i->i_head) {
+		knode = to_klist_node(next);
+		kref_get(&knode->n_ref);
+	}
+	i->i_cur = knode;
+	spin_unlock(&i->i_klist->k_lock);
+	if (put && lnode)
+		put(lnode);
+	return knode;
+}
+
+EXPORT_SYMBOL_GPL(klist_next);
+
+
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c b/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c
new file mode 100644
index 0000000..d45bb3f
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c
@@ -0,0 +1,29 @@
+#include <linux/kref.h>
+#include <linux/module.h>
+
+/**
+ * kref_put - decrement refcount for object.
+ * @kref: object.
+ * @release: pointer to the function that will clean up the object when the
+ *           last reference to the object is released.
+ *           This pointer is required, and it is not acceptable to pass kfree
+ *           in as this function.
+ *
+ * Decrement the refcount, and if 0, call release().
+ * Return 1 if the object was removed, otherwise return 0.  Beware, if this
+ * function returns 0, you still can not count on the kref from remaining in
+ * memory.  Only use the return value if you want to see if the kref is now
+ * gone, not present.
+ */
+int kref_put_new(struct kref *kref, void (*release)(struct kref *kref))
+{
+        WARN_ON(release == NULL);
+        WARN_ON(release == (void (*)(struct kref *))kfree);
+
+        if (atomic_dec_and_test(&kref->refcount)) {
+                release(kref);
+                return 1;
+        }
+        return 0;
+}
+EXPORT_SYMBOL(kref_put_new);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi.c
new file mode 100644
index 0000000..8c833c0
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi.c
@@ -0,0 +1,50 @@
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/init.h>
+#include <linux/completion.h>
+#include <linux/unistd.h>
+#include <linux/spinlock.h>
+#include <linux/kmod.h>
+#include <linux/interrupt.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
+#include <linux/mutex.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_tcq.h>
+
+/**
+ * starget_for_each_device  -  helper to walk all devices of a target
+ * @starget:	target whose devices we want to iterate over.
+ *
+ * This traverses over each devices of @shost.  The devices have
+ * a reference that must be released by scsi_host_put when breaking
+ * out of the loop.
+ */
+void starget_for_each_device(struct scsi_target *starget, void * data,
+		     void (*fn)(struct scsi_device *, void *))
+{
+	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
+	struct scsi_device *sdev;
+
+	printk("%s: entry\n", __FUNCTION__);
+	shost_for_each_device(sdev, shost) {
+		if ((sdev->channel == starget->channel) &&
+		    (sdev->id == starget->id))
+			fn(sdev, data);
+	}
+	printk("%s: exit\n", __FUNCTION__);
+}
+EXPORT_SYMBOL(starget_for_each_device);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c
new file mode 100644
index 0000000..327b53f
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c
@@ -0,0 +1,164 @@
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/completion.h>
+#include <linux/kernel.h>
+#include <linux/mempool.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <linux/hardirq.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+
+int scsi_is_target_device(const struct device *dev)
+{
+        char *str = dev->bus_id;
+
+	if (strncmp(str, "target", 6) == 0) {
+		return 1;
+	}
+
+        return 0;
+}
+
+/**
+ * scsi_internal_device_block - internal function to put a device
+ *                              temporarily into the SDEV_BLOCK state
+ * @sdev:       device to block
+ *
+ * Block request made by scsi lld's to temporarily stop all
+ * scsi commands on the specified device.  Called from interrupt
+ * or normal process context.
+ *
+ * Returns zero if successful or error if not
+ *
+ * Notes:
+ *      This routine transitions the device to the SDEV_BLOCK state
+ *      (which must be a legal transition).  When the device is in this
+ *      state, all commands are deferred until the scsi lld reenables
+ *      the device with scsi_device_unblock or device_block_tmo fires.
+ *      This routine assumes the host_lock is held on entry.
+ **/
+int
+scsi_internal_device_block(struct scsi_device *sdev)
+{
+        request_queue_t *q = sdev->request_queue;
+        unsigned long flags;
+        int err = 0;
+
+        err = scsi_device_set_state(sdev, SDEV_BLOCK);
+        if (err)
+		return err;
+                
+        /*
+         * The device has transitioned to SDEV_BLOCK.  Stop the
+         * block layer from calling the midlayer with this device's
+         * request queue.
+         */
+        spin_lock_irqsave(q->queue_lock, flags);
+        blk_stop_queue(q);
+        spin_unlock_irqrestore(q->queue_lock, flags);
+
+        return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_internal_device_block);
+
+/**
+ * scsi_internal_device_unblock - resume a device after a block request
+ * @sdev:       device to resume
+ *
+ * Called by scsi lld's or the midlayer to restart the device queue
+ * for the previously suspended scsi device.  Called from interrupt or
+ * normal process context.
+ *
+ * Returns zero if successful or error if not.
+ *
+ * Notes:
+ *      This routine transitions the device to the SDEV_RUNNING state
+ *      (which must be a legal transition) allowing the midlayer to
+ *      goose the queue for this device.  This routine assumes the
+ *      host_lock is held upon entry.
+ **/
+int
+scsi_internal_device_unblock(struct scsi_device *sdev)
+{
+        request_queue_t *q = sdev->request_queue;
+        int err;
+        unsigned long flags;
+
+
+        /*
+         * Try to transition the scsi device to SDEV_RUNNING
+         * and goose the device queue if successful.
+         */
+        err = scsi_device_set_state(sdev, SDEV_RUNNING);
+        if (err)
+		return err;
+                
+        spin_lock_irqsave(q->queue_lock, flags);
+        blk_start_queue(q);
+        spin_unlock_irqrestore(q->queue_lock, flags);
+
+        return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_internal_device_unblock);
+
+static void
+device_block(struct scsi_device *sdev, void *data)
+{
+        scsi_internal_device_block(sdev);
+}
+
+static int
+target_block(struct device *dev, void *data)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_block);
+
+        return 0;
+}
+
+void
+scsi_target_block(struct device *dev)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_block);
+        else
+                device_for_each_child(dev, NULL, target_block);
+}
+EXPORT_SYMBOL_GPL(scsi_target_block);
+
+static void
+device_unblock(struct scsi_device *sdev, void *data)
+{
+        scsi_internal_device_unblock(sdev);
+}
+
+static int
+target_unblock(struct device *dev, void *data)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_unblock);
+        return 0;
+}
+
+void
+scsi_target_unblock(struct device *dev)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_unblock);
+        else
+                device_for_each_child(dev, NULL, target_unblock);
+}
+EXPORT_SYMBOL_GPL(scsi_target_unblock);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c
new file mode 100644
index 0000000..b7b7674
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c
@@ -0,0 +1,48 @@
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/spinlock.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_devinfo.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_transport.h>
+#include <scsi/scsi_eh.h>
+
+/**
+ * int_to_scsilun: reverts an int into a scsi_lun
+ * @int:        integer to be reverted
+ * @scsilun:    struct scsi_lun to be set.
+ *
+ * Description:
+ *     Reverts the functionality of the scsilun_to_int, which packed
+ *     an 8-byte lun value into an int. This routine unpacks the int
+ *     back into the lun value.
+ *     Note: the scsilun_to_int() routine does not truly handle all
+ *     8bytes of the lun value. This functions restores only as much
+ *     as was set by the routine.
+ *
+ * Notes:
+ *     Given an integer : 0x0b030a04,  this function returns a
+ *     scsi_lun of : struct scsi_lun of: 0a 04 0b 03 00 00 00 00
+ *
+ **/
+void int_to_scsilun(unsigned int lun, struct scsi_lun *scsilun)
+{
+        int i;
+
+        memset(scsilun->scsi_lun, 0, sizeof(scsilun->scsi_lun));
+
+        for (i = 0; i < sizeof(lun); i += 2) {
+                scsilun->scsi_lun[i] = (lun >> 8) & 0xFF;
+                scsilun->scsi_lun[i+1] = lun & 0xFF;
+                lun = lun >> 16;
+        }
+}
+EXPORT_SYMBOL(int_to_scsilun);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c b/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c
new file mode 100644
index 0000000..f25e7c6
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c
@@ -0,0 +1,280 @@
+/*
+ * transport_class.c - implementation of generic transport classes
+ *                     using attribute_containers
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ *
+ * The basic idea here is to allow any "device controller" (which
+ * would most often be a Host Bus Adapter to use the services of one
+ * or more tranport classes for performing transport specific
+ * services.  Transport specific services are things that the generic
+ * command layer doesn't want to know about (speed settings, line
+ * condidtioning, etc), but which the user might be interested in.
+ * Thus, the HBA's use the routines exported by the transport classes
+ * to perform these functions.  The transport classes export certain
+ * values to the user via sysfs using attribute containers.
+ *
+ * Note: because not every HBA will care about every transport
+ * attribute, there's a many to one relationship that goes like this:
+ *
+ * transport class<-----attribute container<----class device
+ *
+ * Usually the attribute container is per-HBA, but the design doesn't
+ * mandate that.  Although most of the services will be specific to
+ * the actual external storage connection used by the HBA, the generic
+ * transport class is framed entirely in terms of generic devices to
+ * allow it to be used by any physical HBA in the system.
+ */
+#include <linux/attribute_container.h>
+#include <linux/transport_class.h>
+
+/**
+ * transport_class_register - register an initial transport class
+ *
+ * @tclass:	a pointer to the transport class structure to be initialised
+ *
+ * The transport class contains an embedded class which is used to
+ * identify it.  The caller should initialise this structure with
+ * zeros and then generic class must have been initialised with the
+ * actual transport class unique name.  There's a macro
+ * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must
+ * be registered).
+ *
+ * Returns 0 on success or error on failure.
+ */
+int transport_class_register(struct transport_class *tclass)
+{
+	return class_register(&tclass->class);
+}
+EXPORT_SYMBOL_GPL(transport_class_register);
+
+/**
+ * transport_class_unregister - unregister a previously registered class
+ *
+ * @tclass: The transport class to unregister
+ *
+ * Must be called prior to deallocating the memory for the transport
+ * class.
+ */
+void transport_class_unregister(struct transport_class *tclass)
+{
+	class_unregister(&tclass->class);
+}
+EXPORT_SYMBOL_GPL(transport_class_unregister);
+
+static int anon_transport_dummy_function(struct transport_container *tc,
+					 struct device *dev,
+					 struct class_device *cdev)
+{
+	/* do nothing */
+	return 0;
+}
+
+/**
+ * anon_transport_class_register - register an anonymous class
+ *
+ * @atc: The anon transport class to register
+ *
+ * The anonymous transport class contains both a transport class and a
+ * container.  The idea of an anonymous class is that it never
+ * actually has any device attributes associated with it (and thus
+ * saves on container storage).  So it can only be used for triggering
+ * events.  Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to
+ * initialise the anon transport class storage.
+ */
+int anon_transport_class_register(struct anon_transport_class *atc)
+{
+	int error;
+	atc->container.class = &atc->tclass.class;
+	attribute_container_set_no_classdevs(&atc->container);
+	error = attribute_container_register(&atc->container);
+	if (error)
+		return error;
+	atc->tclass.setup = anon_transport_dummy_function;
+	atc->tclass.remove = anon_transport_dummy_function;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(anon_transport_class_register);
+
+/**
+ * anon_transport_class_unregister - unregister an anon class
+ *
+ * @atc: Pointer to the anon transport class to unregister
+ *
+ * Must be called prior to deallocating the memory for the anon
+ * transport class.
+ */
+void anon_transport_class_unregister(struct anon_transport_class *atc)
+{
+	attribute_container_unregister(&atc->container);
+}
+EXPORT_SYMBOL_GPL(anon_transport_class_unregister);
+
+static int transport_setup_classdev(struct attribute_container *cont,
+				    struct device *dev,
+				    struct class_device *classdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+	struct transport_container *tcont = attribute_container_to_transport_container(cont);
+
+	if (tclass->setup)
+		tclass->setup(tcont, dev, classdev);
+
+	return 0;
+}
+
+/**
+ * transport_setup_device - declare a new dev for transport class association
+ *			    but don't make it visible yet.
+ *
+ * @dev: the generic device representing the entity being added
+ *
+ * Usually, dev represents some component in the HBA system (either
+ * the HBA itself or a device remote across the HBA bus).  This
+ * routine is simply a trigger point to see if any set of transport
+ * classes wishes to associate with the added device.  This allocates
+ * storage for the class device and initialises it, but does not yet
+ * add it to the system or add attributes to it (you do this with
+ * transport_add_device).  If you have no need for a separate setup
+ * and add operations, use transport_register_device (see
+ * transport_class.h).
+ */
+
+void transport_setup_device(struct device *dev)
+{
+	attribute_container_add_device(dev, transport_setup_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_setup_device);
+
+static int transport_add_class_device(struct attribute_container *cont,
+				      struct device *dev,
+				      struct class_device *classdev)
+{
+	int error = attribute_container_add_class_device(classdev);
+	struct transport_container *tcont = 
+		attribute_container_to_transport_container(cont);
+
+	if (!error && tcont->statistics)
+		error = sysfs_create_group(&classdev->kobj, tcont->statistics);
+
+	return error;
+}
+
+
+/**
+ * transport_add_device - declare a new dev for transport class association
+ *
+ * @dev: the generic device representing the entity being added
+ *
+ * Usually, dev represents some component in the HBA system (either
+ * the HBA itself or a device remote across the HBA bus).  This
+ * routine is simply a trigger point used to add the device to the
+ * system and register attributes for it.
+ */
+
+void transport_add_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_add_class_device);
+}
+EXPORT_SYMBOL_GPL(transport_add_device);
+
+static int transport_configure(struct attribute_container *cont,
+			       struct device *dev,
+			       struct class_device *cdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+	struct transport_container *tcont = attribute_container_to_transport_container(cont);
+
+	if (tclass->configure)
+		tclass->configure(tcont, dev, cdev);
+
+	return 0;
+}
+
+/**
+ * transport_configure_device - configure an already set up device
+ *
+ * @dev: generic device representing device to be configured
+ *
+ * The idea of configure is simply to provide a point within the setup
+ * process to allow the transport class to extract information from a
+ * device after it has been setup.  This is used in SCSI because we
+ * have to have a setup device to begin using the HBA, but after we
+ * send the initial inquiry, we use configure to extract the device
+ * parameters.  The device need not have been added to be configured.
+ */
+void transport_configure_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_configure);
+}
+EXPORT_SYMBOL_GPL(transport_configure_device);
+
+static int transport_remove_classdev(struct attribute_container *cont,
+				     struct device *dev,
+				     struct class_device *classdev)
+{
+	struct transport_container *tcont = 
+		attribute_container_to_transport_container(cont);
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+
+	if (tclass->remove)
+		tclass->remove(tcont, dev, classdev);
+
+	if (tclass->remove != anon_transport_dummy_function) {
+		if (tcont->statistics)
+			sysfs_remove_group(&classdev->kobj, tcont->statistics);
+		attribute_container_class_device_del(classdev);
+	}
+
+	return 0;
+}
+
+
+/**
+ * transport_remove_device - remove the visibility of a device
+ *
+ * @dev: generic device to remove
+ *
+ * This call removes the visibility of the device (to the user from
+ * sysfs), but does not destroy it.  To eliminate a device entirely
+ * you must also call transport_destroy_device.  If you don't need to
+ * do remove and destroy as separate operations, use
+ * transport_unregister_device() (see transport_class.h) which will
+ * perform both calls for you.
+ */
+void transport_remove_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_remove_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_remove_device);
+
+static void transport_destroy_classdev(struct attribute_container *cont,
+				      struct device *dev,
+				      struct class_device *classdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+
+	if (tclass->remove != anon_transport_dummy_function)
+		class_device_put(classdev);
+}
+
+
+/**
+ * transport_destroy_device - destroy a removed device
+ *
+ * @dev: device to eliminate from the transport class.
+ *
+ * This call triggers the elimination of storage associated with the
+ * transport classdev.  Note: all it really does is relinquish a
+ * reference to the classdev.  The memory will not be freed until the
+ * last reference goes to zero.  Note also that the classdev retains a
+ * reference count on dev, so dev too will remain for as long as the
+ * transport class device remains around.
+ */
+void transport_destroy_device(struct device *dev)
+{
+	attribute_container_remove_device(dev, transport_destroy_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_destroy_device);
diff --git a/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch b/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch
new file mode 100644
index 0000000..c4df6bb
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch
@@ -0,0 +1,591 @@
+diff -rupN linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h linux-2.6.20/include/scsi/iscsi_proto.h
+--- linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h	1970-01-01 02:00:00.000000000 +0200
++++ linux-2.6.20/include/scsi/iscsi_proto.h	2007-02-04 20:44:54.000000000 +0200
+@@ -0,0 +1,587 @@
++/*
++ * RFC 3720 (iSCSI) protocol data types
++ *
++ * Copyright (C) 2005 Dmitry Yusupov
++ * Copyright (C) 2005 Alex Aizman
++ * maintained by open-iscsi at googlegroups.com
++ *
++ * This program is free software; you can redistribute it and/or modify
++ * it under the terms of the GNU General Public License as published
++ * by the Free Software Foundation; either version 2 of the License, or
++ * (at your option) any later version.
++ *
++ * This program is distributed in the hope that it will be useful, but
++ * WITHOUT ANY WARRANTY; without even the implied warranty of
++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
++ * General Public License for more details.
++ *
++ * See the file COPYING included with this distribution for more details.
++ */
++
++#ifndef ISCSI_PROTO_H
++#define ISCSI_PROTO_H
++
++#define ISCSI_DRAFT20_VERSION	0x00
++
++/* default iSCSI listen port for incoming connections */
++#define ISCSI_LISTEN_PORT	3260
++
++/* Padding word length */
++#define PAD_WORD_LEN		4
++
++/*
++ * useful common(control and data pathes) macro
++ */
++#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2]))
++#define hton24(p, v) { \
++        p[0] = (((v) >> 16) & 0xFF); \
++        p[1] = (((v) >> 8) & 0xFF); \
++        p[2] = ((v) & 0xFF); \
++}
++#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;}
++
++/*
++ * iSCSI Template Message Header
++ */
++struct iscsi_hdr {
++	uint8_t		opcode;
++	uint8_t		flags;		/* Final bit */
++	uint8_t		rsvd2[2];
++	uint8_t		hlength;	/* AHSs total length */
++	uint8_t		dlength[3];	/* Data length */
++	uint8_t		lun[8];
++	__be32		itt;		/* Initiator Task Tag */
++	__be32		ttt;		/* Target Task Tag */
++	__be32		statsn;
++	__be32		exp_statsn;
++	__be32		max_statsn;
++	uint8_t		other[12];
++};
++
++/************************* RFC 3720 Begin *****************************/
++
++#define ISCSI_RESERVED_TAG		0xffffffff
++
++/* Opcode encoding bits */
++#define ISCSI_OP_RETRY			0x80
++#define ISCSI_OP_IMMEDIATE		0x40
++#define ISCSI_OPCODE_MASK		0x3F
++
++/* Initiator Opcode values */
++#define ISCSI_OP_NOOP_OUT		0x00
++#define ISCSI_OP_SCSI_CMD		0x01
++#define ISCSI_OP_SCSI_TMFUNC		0x02
++#define ISCSI_OP_LOGIN			0x03
++#define ISCSI_OP_TEXT			0x04
++#define ISCSI_OP_SCSI_DATA_OUT		0x05
++#define ISCSI_OP_LOGOUT			0x06
++#define ISCSI_OP_SNACK			0x10
++
++#define ISCSI_OP_VENDOR1_CMD		0x1c
++#define ISCSI_OP_VENDOR2_CMD		0x1d
++#define ISCSI_OP_VENDOR3_CMD		0x1e
++#define ISCSI_OP_VENDOR4_CMD		0x1f
++
++/* Target Opcode values */
++#define ISCSI_OP_NOOP_IN		0x20
++#define ISCSI_OP_SCSI_CMD_RSP		0x21
++#define ISCSI_OP_SCSI_TMFUNC_RSP	0x22
++#define ISCSI_OP_LOGIN_RSP		0x23
++#define ISCSI_OP_TEXT_RSP		0x24
++#define ISCSI_OP_SCSI_DATA_IN		0x25
++#define ISCSI_OP_LOGOUT_RSP		0x26
++#define ISCSI_OP_R2T			0x31
++#define ISCSI_OP_ASYNC_EVENT		0x32
++#define ISCSI_OP_REJECT			0x3f
++
++struct iscsi_ahs_hdr {
++	__be16 ahslength;
++	uint8_t ahstype;
++	uint8_t ahspec[5];
++};
++
++#define ISCSI_AHSTYPE_CDB		1
++#define ISCSI_AHSTYPE_RLENGTH		2
++
++/* iSCSI PDU Header */
++struct iscsi_cmd {
++	uint8_t opcode;
++	uint8_t flags;
++	__be16 rsvd2;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32 itt;	/* Initiator Task Tag */
++	__be32 data_length;
++	__be32 cmdsn;
++	__be32 exp_statsn;
++	uint8_t cdb[16];	/* SCSI Command Block */
++	/* Additional Data (Command Dependent) */
++};
++
++/* Command PDU flags */
++#define ISCSI_FLAG_CMD_FINAL		0x80
++#define ISCSI_FLAG_CMD_READ		0x40
++#define ISCSI_FLAG_CMD_WRITE		0x20
++#define ISCSI_FLAG_CMD_ATTR_MASK	0x07	/* 3 bits */
++
++/* SCSI Command Attribute values */
++#define ISCSI_ATTR_UNTAGGED		0
++#define ISCSI_ATTR_SIMPLE		1
++#define ISCSI_ATTR_ORDERED		2
++#define ISCSI_ATTR_HEAD_OF_QUEUE	3
++#define ISCSI_ATTR_ACA			4
++
++struct iscsi_rlength_ahdr {
++	__be16 ahslength;
++	uint8_t ahstype;
++	uint8_t reserved;
++	__be32 read_length;
++};
++
++/* SCSI Response Header */
++struct iscsi_cmd_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t response;
++	uint8_t cmd_status;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	rsvd1;
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	__be32	exp_datasn;
++	__be32	bi_residual_count;
++	__be32	residual_count;
++	/* Response or Sense Data (optional) */
++};
++
++/* Command Response PDU flags */
++#define ISCSI_FLAG_CMD_BIDI_OVERFLOW	0x10
++#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW	0x08
++#define ISCSI_FLAG_CMD_OVERFLOW		0x04
++#define ISCSI_FLAG_CMD_UNDERFLOW	0x02
++
++/* iSCSI Status values. Valid if Rsp Selector bit is not set */
++#define ISCSI_STATUS_CMD_COMPLETED	0
++#define ISCSI_STATUS_TARGET_FAILURE	1
++#define ISCSI_STATUS_SUBSYS_FAILURE	2
++
++/* Asynchronous Event Header */
++struct iscsi_async {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[2];
++	uint8_t rsvd3;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	uint8_t rsvd4[8];
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	uint8_t async_event;
++	uint8_t async_vcode;
++	__be16	param1;
++	__be16	param2;
++	__be16	param3;
++	uint8_t rsvd5[4];
++};
++
++/* iSCSI Event Codes */
++#define ISCSI_ASYNC_MSG_SCSI_EVENT			0
++#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT			1
++#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION		2
++#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS	3
++#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION		4
++#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC			255
++
++/* NOP-Out Message */
++struct iscsi_nopout {
++	uint8_t opcode;
++	uint8_t flags;
++	__be16	rsvd2;
++	uint8_t rsvd3;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	ttt;	/* Target Transfer Tag */
++	__be32	cmdsn;
++	__be32	exp_statsn;
++	uint8_t rsvd4[16];
++};
++
++/* NOP-In Message */
++struct iscsi_nopin {
++	uint8_t opcode;
++	uint8_t flags;
++	__be16	rsvd2;
++	uint8_t rsvd3;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	ttt;	/* Target Transfer Tag */
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	uint8_t rsvd4[12];
++};
++
++/* SCSI Task Management Message Header */
++struct iscsi_tm {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd1[2];
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	rtt;	/* Reference Task Tag */
++	__be32	cmdsn;
++	__be32	exp_statsn;
++	__be32	refcmdsn;
++	__be32	exp_datasn;
++	uint8_t rsvd2[8];
++};
++
++#define ISCSI_FLAG_TM_FUNC_MASK			0x7F
++
++/* Function values */
++#define ISCSI_TM_FUNC_ABORT_TASK		1
++#define ISCSI_TM_FUNC_ABORT_TASK_SET		2
++#define ISCSI_TM_FUNC_CLEAR_ACA			3
++#define ISCSI_TM_FUNC_CLEAR_TASK_SET		4
++#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET	5
++#define ISCSI_TM_FUNC_TARGET_WARM_RESET		6
++#define ISCSI_TM_FUNC_TARGET_COLD_RESET		7
++#define ISCSI_TM_FUNC_TASK_REASSIGN		8
++
++/* SCSI Task Management Response Header */
++struct iscsi_tm_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t response;	/* see Response values below */
++	uint8_t qualifier;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd2[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	rtt;	/* Reference Task Tag */
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	uint8_t rsvd3[12];
++};
++
++/* Response values */
++#define ISCSI_TMF_RSP_COMPLETE		0x00
++#define ISCSI_TMF_RSP_NO_TASK		0x01
++#define ISCSI_TMF_RSP_NO_LUN		0x02
++#define ISCSI_TMF_RSP_TASK_ALLEGIANT	0x03
++#define ISCSI_TMF_RSP_NO_FAILOVER	0x04
++#define ISCSI_TMF_RSP_NOT_SUPPORTED	0x05
++#define ISCSI_TMF_RSP_AUTH_FAILED	0x06
++#define ISCSI_TMF_RSP_REJECTED		0xff
++
++/* Ready To Transfer Header */
++struct iscsi_r2t_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[2];
++	uint8_t	hlength;
++	uint8_t	dlength[3];
++	uint8_t lun[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	ttt;	/* Target Transfer Tag */
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	__be32	r2tsn;
++	__be32	data_offset;
++	__be32	data_length;
++};
++
++/* SCSI Data Hdr */
++struct iscsi_data {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[2];
++	uint8_t rsvd3;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32	itt;
++	__be32	ttt;
++	__be32	rsvd4;
++	__be32	exp_statsn;
++	__be32	rsvd5;
++	__be32	datasn;
++	__be32	offset;
++	__be32	rsvd6;
++	/* Payload */
++};
++
++/* SCSI Data Response Hdr */
++struct iscsi_data_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2;
++	uint8_t cmd_status;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32	itt;
++	__be32	ttt;
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	__be32	datasn;
++	__be32	offset;
++	__be32	residual_count;
++};
++
++/* Data Response PDU flags */
++#define ISCSI_FLAG_DATA_ACK		0x40
++#define ISCSI_FLAG_DATA_OVERFLOW	0x04
++#define ISCSI_FLAG_DATA_UNDERFLOW	0x02
++#define ISCSI_FLAG_DATA_STATUS		0x01
++
++/* Text Header */
++struct iscsi_text {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[2];
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd4[8];
++	__be32	itt;
++	__be32	ttt;
++	__be32	cmdsn;
++	__be32	exp_statsn;
++	uint8_t rsvd5[16];
++	/* Text - key=value pairs */
++};
++
++#define ISCSI_FLAG_TEXT_CONTINUE	0x40
++
++/* Text Response Header */
++struct iscsi_text_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[2];
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd4[8];
++	__be32	itt;
++	__be32	ttt;
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	uint8_t rsvd5[12];
++	/* Text Response - key:value pairs */
++};
++
++/* Login Header */
++struct iscsi_login {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t max_version;	/* Max. version supported */
++	uint8_t min_version;	/* Min. version supported */
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t isid[6];	/* Initiator Session ID */
++	__be16	tsih;	/* Target Session Handle */
++	__be32	itt;	/* Initiator Task Tag */
++	__be16	cid;
++	__be16	rsvd3;
++	__be32	cmdsn;
++	__be32	exp_statsn;
++	uint8_t rsvd5[16];
++};
++
++/* Login PDU flags */
++#define ISCSI_FLAG_LOGIN_TRANSIT		0x80
++#define ISCSI_FLAG_LOGIN_CONTINUE		0x40
++#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK	0x0C	/* 2 bits */
++#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK	0x03	/* 2 bits */
++
++#define ISCSI_LOGIN_CURRENT_STAGE(flags) \
++	((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2)
++#define ISCSI_LOGIN_NEXT_STAGE(flags) \
++	(flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK)
++
++/* Login Response Header */
++struct iscsi_login_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t max_version;	/* Max. version supported */
++	uint8_t active_version;	/* Active version */
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t isid[6];	/* Initiator Session ID */
++	__be16	tsih;	/* Target Session Handle */
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	rsvd3;
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	uint8_t status_class;	/* see Login RSP ststus classes below */
++	uint8_t status_detail;	/* see Login RSP Status details below */
++	uint8_t rsvd4[10];
++};
++
++/* Login stage (phase) codes for CSG, NSG */
++#define ISCSI_INITIAL_LOGIN_STAGE		-1
++#define ISCSI_SECURITY_NEGOTIATION_STAGE	0
++#define ISCSI_OP_PARMS_NEGOTIATION_STAGE	1
++#define ISCSI_FULL_FEATURE_PHASE		3
++
++/* Login Status response classes */
++#define ISCSI_STATUS_CLS_SUCCESS		0x00
++#define ISCSI_STATUS_CLS_REDIRECT		0x01
++#define ISCSI_STATUS_CLS_INITIATOR_ERR		0x02
++#define ISCSI_STATUS_CLS_TARGET_ERR		0x03
++
++/* Login Status response detail codes */
++/* Class-0 (Success) */
++#define ISCSI_LOGIN_STATUS_ACCEPT		0x00
++
++/* Class-1 (Redirection) */
++#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP	0x01
++#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM	0x02
++
++/* Class-2 (Initiator Error) */
++#define ISCSI_LOGIN_STATUS_INIT_ERR		0x00
++#define ISCSI_LOGIN_STATUS_AUTH_FAILED		0x01
++#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN	0x02
++#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND	0x03
++#define ISCSI_LOGIN_STATUS_TGT_REMOVED		0x04
++#define ISCSI_LOGIN_STATUS_NO_VERSION		0x05
++#define ISCSI_LOGIN_STATUS_ISID_ERROR		0x06
++#define ISCSI_LOGIN_STATUS_MISSING_FIELDS	0x07
++#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED	0x08
++#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE	0x09
++#define ISCSI_LOGIN_STATUS_NO_SESSION		0x0a
++#define ISCSI_LOGIN_STATUS_INVALID_REQUEST	0x0b
++
++/* Class-3 (Target Error) */
++#define ISCSI_LOGIN_STATUS_TARGET_ERROR		0x00
++#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE	0x01
++#define ISCSI_LOGIN_STATUS_NO_RESOURCES		0x02
++
++/* Logout Header */
++struct iscsi_logout {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd1[2];
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd2[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be16	cid;
++	uint8_t rsvd3[2];
++	__be32	cmdsn;
++	__be32	exp_statsn;
++	uint8_t rsvd4[16];
++};
++
++/* Logout PDU flags */
++#define ISCSI_FLAG_LOGOUT_REASON_MASK	0x7F
++
++/* logout reason_code values */
++
++#define ISCSI_LOGOUT_REASON_CLOSE_SESSION	0
++#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION	1
++#define ISCSI_LOGOUT_REASON_RECOVERY		2
++#define ISCSI_LOGOUT_REASON_AEN_REQUEST		3
++
++/* Logout Response Header */
++struct iscsi_logout_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t response;	/* see Logout response values below */
++	uint8_t rsvd2;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd3[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	rsvd4;
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	__be32	rsvd5;
++	__be16	t2wait;
++	__be16	t2retain;
++	__be32	rsvd6;
++};
++
++/* logout response status values */
++
++#define ISCSI_LOGOUT_SUCCESS			0
++#define ISCSI_LOGOUT_CID_NOT_FOUND		1
++#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED	2
++#define ISCSI_LOGOUT_CLEANUP_FAILED		3
++
++/* SNACK Header */
++struct iscsi_snack {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[14];
++	__be32	itt;
++	__be32	begrun;
++	__be32	runlength;
++	__be32	exp_statsn;
++	__be32	rsvd3;
++	__be32	exp_datasn;
++	uint8_t rsvd6[8];
++};
++
++/* SNACK PDU flags */
++#define ISCSI_FLAG_SNACK_TYPE_MASK	0x0F	/* 4 bits */
++
++/* Reject Message Header */
++struct iscsi_reject {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t reason;
++	uint8_t rsvd2;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd3[8];
++	__be32  ffffffff;
++	uint8_t rsvd4[4];
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	__be32	datasn;
++	uint8_t rsvd5[8];
++	/* Text - Rejected hdr */
++};
++
++/* Reason for Reject */
++#define ISCSI_REASON_CMD_BEFORE_LOGIN	1
++#define ISCSI_REASON_DATA_DIGEST_ERROR	2
++#define ISCSI_REASON_DATA_SNACK_REJECT	3
++#define ISCSI_REASON_PROTOCOL_ERROR	4
++#define ISCSI_REASON_CMD_NOT_SUPPORTED	5
++#define ISCSI_REASON_IMM_CMD_REJECT		6
++#define ISCSI_REASON_TASK_IN_PROGRESS	7
++#define ISCSI_REASON_INVALID_SNACK		8
++#define ISCSI_REASON_BOOKMARK_INVALID	9
++#define ISCSI_REASON_BOOKMARK_NO_RESOURCES	10
++#define ISCSI_REASON_NEGOTIATION_RESET	11
++
++/* Max. number of Key=Value pairs in a text message */
++#define MAX_KEY_VALUE_PAIRS	8192
++
++/* maximum length for text keys/values */
++#define KEY_MAXLEN		64
++#define VALUE_MAXLEN		255
++#define TARGET_NAME_MAXLEN	VALUE_MAXLEN
++
++#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH	8192
++
++/************************* RFC 3720 End *****************************/
++
++#endif /* ISCSI_PROTO_H */
diff --git a/kernel_patches/backport/2.6.9_U3/add_iser.patch b/kernel_patches/backport/2.6.9_U3/add_iser.patch
new file mode 100644
index 0000000..0da53d2
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/add_iser.patch
@@ -0,0 +1,13 @@
+diff -rup linux-2.6.20/drivers/infiniband/ulp/iser/iser_initiator.c linux-2.6.20-backport-rh4-u3/drivers/infiniband/ulp/iser/iser_initiator.c
+--- linux-2.6.20/drivers/infiniband/ulp/iser/iser_initiator.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/drivers/infiniband/ulp/iser/iser_initiator.c	2007-03-26 11:27:11.000000000 +0200
+@@ -618,7 +618,8 @@ void iser_snd_completion(struct iser_des
+ 
+ 	if (resume_tx) {
+ 		iser_dbg("%ld resuming tx\n",jiffies);
+-		scsi_queue_work(conn->session->host, &conn->xmitwork);
++		//scsi_queue_work(conn->session->host, &conn->xmitwork);
++		schedule_work(&conn->xmitwork);
+ 	}
+ 
+ 	if (tx_desc->type == ISCSI_TX_CONTROL) { 
diff --git a/kernel_patches/backport/2.6.9_U3/add_memory_h.patch b/kernel_patches/backport/2.6.9_U3/add_memory_h.patch
new file mode 100644
index 0000000..5daad2e
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/add_memory_h.patch
@@ -0,0 +1,93 @@
+diff -rupN linux-2.6.20-like-rh4/include/linux/memory.h linux-2.6.20/include/linux/memory.h
+--- linux-2.6.20-like-rh4/include/linux/memory.h	1970-01-01 02:00:00.000000000 +0200
++++ linux-2.6.20/include/linux/memory.h	2007-02-04 20:44:54.000000000 +0200
+@@ -0,0 +1,89 @@
++/*
++ * include/linux/memory.h - generic memory definition
++ *
++ * This is mainly for topological representation. We define the
++ * basic "struct memory_block" here, which can be embedded in per-arch
++ * definitions or NUMA information.
++ *
++ * Basic handling of the devices is done in drivers/base/memory.c
++ * and system devices are handled in drivers/base/sys.c.
++ *
++ * Memory block are exported via sysfs in the class/memory/devices/
++ * directory.
++ *
++ */
++#ifndef _LINUX_MEMORY_H_
++#define _LINUX_MEMORY_H_
++
++#include <linux/sysdev.h>
++#include <linux/node.h>
++#include <linux/compiler.h>
++
++#include <asm/semaphore.h>
++
++struct memory_block {
++	unsigned long phys_index;
++	unsigned long state;
++	/*
++	 * This serializes all state change requests.  It isn't
++	 * held during creation because the control files are
++	 * created long after the critical areas during
++	 * initialization.
++	 */
++	struct semaphore state_sem;
++	int phys_device;		/* to which fru does this belong? */
++	void *hw;			/* optional pointer to fw/hw data */
++	int (*phys_callback)(struct memory_block *);
++	struct sys_device sysdev;
++};
++
++/* These states are exposed to userspace as text strings in sysfs */
++#define	MEM_ONLINE		(1<<0) /* exposed to userspace */
++#define	MEM_GOING_OFFLINE	(1<<1) /* exposed to userspace */
++#define	MEM_OFFLINE		(1<<2) /* exposed to userspace */
++
++/*
++ * All of these states are currently kernel-internal for notifying
++ * kernel components and architectures.
++ *
++ * For MEM_MAPPING_INVALID, all notifier chains with priority >0
++ * are called before pfn_to_page() becomes invalid.  The priority=0
++ * entry is reserved for the function that actually makes
++ * pfn_to_page() stop working.  Any notifiers that want to be called
++ * after that should have priority <0.
++ */
++#define	MEM_MAPPING_INVALID	(1<<3)
++
++struct notifier_block;
++struct mem_section;
++
++#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE
++static inline int memory_dev_init(void)
++{
++	return 0;
++}
++static inline int register_memory_notifier(struct notifier_block *nb)
++{
++	return 0;
++}
++static inline void unregister_memory_notifier(struct notifier_block *nb)
++{
++}
++#else
++extern int register_new_memory(struct mem_section *);
++extern int unregister_memory_section(struct mem_section *);
++extern int memory_dev_init(void);
++extern int remove_memory_block(unsigned long, struct mem_section *, int);
++
++#define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
++
++
++#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
++
++#define hotplug_memory_notifier(fn, pri) {			\
++	static struct notifier_block fn##_mem_nb =		\
++		{ .notifier_call = fn, .priority = pri };	\
++	register_memory_notifier(&fn##_mem_nb);			\
++}
++
++#endif /* _LINUX_MEMORY_H_ */
diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch
new file mode 100644
index 0000000..d77c663
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch
@@ -0,0 +1,504 @@
+diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c
+--- linux-2.6.20/drivers/scsi/iscsi_tcp.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c	2007-04-01 13:11:17.000000000 +0300
+@@ -108,7 +108,7 @@ iscsi_hdr_digest(struct iscsi_conn *conn
+ {
+ 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+ 
+-	crypto_hash_digest(&tcp_conn->tx_hash, &buf->sg, buf->sg.length, crc);
++	crypto_digest_digest(tcp_conn->tx_tfm, &buf->sg, 1, crc);
+ 	buf->sg.length = tcp_conn->hdr_size;
+ }
+ 
+@@ -419,7 +419,7 @@ iscsi_r2t_rsp(struct iscsi_conn *conn, s
+ 	tcp_ctask->xmstate |= XMSTATE_SOL_HDR;
+ 	list_move_tail(&ctask->running, &conn->xmitqueue);
+ 
+-	scsi_queue_work(session->host, &conn->xmitwork);
++	schedule_work(&conn->xmitwork);
+ 	conn->r2t_pdus_cnt++;
+ 	spin_unlock(&session->lock);
+ 
+@@ -468,8 +468,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co
+ 
+ 		sg_init_one(&sg, (u8 *)hdr,
+ 			    sizeof(struct iscsi_hdr) + ahslen);
+-		crypto_hash_digest(&tcp_conn->rx_hash, &sg, sg.length,
+-				   (u8 *)&cdgst);
++		crypto_digest_digest(tcp_conn->rx_tfm, &sg, 1, (u8 *)&cdgst);
+ 		rdgst = *(uint32_t*)((char*)hdr + sizeof(struct iscsi_hdr) +
+ 				     ahslen);
+ 		if (cdgst != rdgst) {
+@@ -676,7 +675,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, 
+ }
+ 
+ static inline void
+-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg,
++partial_sg_digest_update(struct crypto_tfm *tfm, struct scatterlist *sg,
+ 			 int offset, int length)
+ {
+ 	struct scatterlist temp;
+@@ -684,7 +683,7 @@ partial_sg_digest_update(struct hash_des
+ 	memcpy(&temp, sg, sizeof(struct scatterlist));
+ 	temp.offset = offset;
+ 	temp.length = length;
+-	crypto_hash_update(desc, &temp, length);
++	crypto_digest_update(tfm, &temp, 1);
+ }
+ 
+ static void
+@@ -693,7 +692,7 @@ iscsi_recv_digest_update(struct iscsi_tc
+ 	struct scatterlist tmp;
+ 
+ 	sg_init_one(&tmp, buf, len);
+-	crypto_hash_update(&tcp_conn->rx_hash, &tmp, len);
++	crypto_digest_update(tcp_conn->rx_tfm, &tmp, 1);
+ }
+ 
+ static int iscsi_scsi_data_in(struct iscsi_conn *conn)
+@@ -747,12 +746,12 @@ static int iscsi_scsi_data_in(struct isc
+ 		if (!rc) {
+ 			if (conn->datadgst_en) {
+ 				if (!offset)
+-					crypto_hash_update(
+-							&tcp_conn->rx_hash,
+-							&sg[i], sg[i].length);
++					crypto_digest_update(
++							tcp_conn->rx_tfm,
++							&sg[i], 1);
+ 				else
+ 					partial_sg_digest_update(
+-							&tcp_conn->rx_hash,
++							tcp_conn->rx_tfm,
+ 							&sg[i],
+ 							sg[i].offset + offset,
+ 							sg[i].length - offset);
+@@ -766,10 +765,9 @@ static int iscsi_scsi_data_in(struct isc
+ 				/*
+ 				 * data-in is complete, but buffer not...
+ 				 */
+-				partial_sg_digest_update(&tcp_conn->rx_hash,
+-							 &sg[i],
+-							 sg[i].offset,
+-							 sg[i].length-rc);
++				partial_sg_digest_update(tcp_conn->rx_tfm,
++						&sg[i],
++						sg[i].offset, sg[i].length-rc);
+ 			rc = 0;
+ 			break;
+ 		}
+@@ -887,7 +885,7 @@ more:
+ 		rc = iscsi_tcp_hdr_recv(conn);
+ 		if (!rc && tcp_conn->in.datalen) {
+ 			if (conn->datadgst_en)
+-				crypto_hash_init(&tcp_conn->rx_hash);
++				crypto_digest_init(tcp_conn->rx_tfm);
+ 			tcp_conn->in_progress = IN_PROGRESS_DATA_RECV;
+ 		} else if (rc) {
+ 			iscsi_conn_failure(conn, rc);
+@@ -944,11 +942,11 @@ more:
+ 					  tcp_conn->in.padding);
+ 				memset(pad, 0, tcp_conn->in.padding);
+ 				sg_init_one(&sg, pad, tcp_conn->in.padding);
+-				crypto_hash_update(&tcp_conn->rx_hash,
+-						   &sg, sg.length);
++				crypto_digest_update(tcp_conn->rx_tfm,
++						     &sg, 1);
+ 			}
+-			crypto_hash_final(&tcp_conn->rx_hash,
+-					  (u8 *) &tcp_conn->in.datadgst);
++			crypto_digest_final(tcp_conn->rx_tfm,
++					    (u8 *) &tcp_conn->in.datadgst);
+ 			debug_tcp("rx digest 0x%x\n", tcp_conn->in.datadgst);
+ 			tcp_conn->in_progress = IN_PROGRESS_DDIGEST_RECV;
+ 			tcp_conn->data_copied = 0;
+@@ -1043,7 +1041,7 @@ iscsi_write_space(struct sock *sk)
+ 
+ 	tcp_conn->old_write_space(sk);
+ 	debug_tcp("iscsi_write_space: cid %d\n", conn->id);
+-	scsi_queue_work(conn->session->host, &conn->xmitwork);
++	schedule_work(&conn->xmitwork);
+ }
+ 
+ static void
+@@ -1193,7 +1191,7 @@ static inline void
+ iscsi_data_digest_init(struct iscsi_tcp_conn *tcp_conn,
+ 		      struct iscsi_tcp_cmd_task *tcp_ctask)
+ {
+-	crypto_hash_init(&tcp_conn->tx_hash);
++	crypto_digest_init(tcp_conn->tx_tfm);
+ 	tcp_ctask->digest_count = 4;
+ }
+ 
+@@ -1449,9 +1447,8 @@ iscsi_send_padding(struct iscsi_conn *co
+ 		iscsi_buf_init_iov(&tcp_ctask->sendbuf, (char*)&tcp_ctask->pad,
+ 				   tcp_ctask->pad_count);
+ 		if (conn->datadgst_en)
+-			crypto_hash_update(&tcp_conn->tx_hash,
+-					   &tcp_ctask->sendbuf.sg,
+-					   tcp_ctask->sendbuf.sg.length);
++			crypto_digest_update(tcp_conn->tx_tfm,
++					     &tcp_ctask->sendbuf.sg, 1);
+ 	} else if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_PAD))
+ 		return 0;
+ 
+@@ -1483,7 +1480,7 @@ iscsi_send_digest(struct iscsi_conn *con
+ 	tcp_conn = conn->dd_data;
+ 
+ 	if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_DATA_DIGEST)) {
+-		crypto_hash_final(&tcp_conn->tx_hash, (u8*)digest);
++		crypto_digest_final(tcp_conn->tx_tfm, (u8*)digest);
+ 		iscsi_buf_init_iov(buf, (char*)digest, 4);
+ 	}
+ 	tcp_ctask->xmstate &= ~XMSTATE_W_RESEND_DATA_DIGEST;
+@@ -1517,7 +1514,7 @@ iscsi_send_data(struct iscsi_cmd_task *c
+ 		rc = iscsi_sendpage(conn, sendbuf, count, &buf_sent);
+ 		*sent = *sent + buf_sent;
+ 		if (buf_sent && conn->datadgst_en)
+-			partial_sg_digest_update(&tcp_conn->tx_hash,
++			partial_sg_digest_update(tcp_conn->tx_tfm,
+ 				&sendbuf->sg, sendbuf->sg.offset + offset,
+ 				buf_sent);
+ 		if (!iscsi_buf_left(sendbuf) && *sg != tcp_ctask->bad_sg) {
+@@ -1774,22 +1771,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s
+ 	/* initial operational parameters */
+ 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
+ 
+-	tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->tx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->tx_hash.tfm))
++	tcp_conn->tx_tfm = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->tx_tfm)
+ 		goto free_tcp_conn;
+ 
+-	tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->rx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->rx_hash.tfm))
++	tcp_conn->rx_tfm = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->rx_tfm)
+ 		goto free_tx_tfm;
+ 
+ 	return cls_conn;
+ 
+ free_tx_tfm:
+-	crypto_free_hash(tcp_conn->tx_hash.tfm);
++	crypto_free_tfm(tcp_conn->tx_tfm);
+ free_tcp_conn:
+ 	kfree(tcp_conn);
+ tcp_conn_alloc_fail:
+@@ -1823,10 +1816,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_
+ 	iscsi_tcp_release_conn(conn);
+ 	iscsi_conn_teardown(cls_conn);
+ 
+-	if (tcp_conn->tx_hash.tfm)
+-		crypto_free_hash(tcp_conn->tx_hash.tfm);
+-	if (tcp_conn->rx_hash.tfm)
+-		crypto_free_hash(tcp_conn->rx_hash.tfm);
++	if (tcp_conn->tx_tfm)
++		crypto_free_tfm(tcp_conn->tx_tfm);
++	if (tcp_conn->rx_tfm)
++		crypto_free_tfm(tcp_conn->rx_tfm);
+ 
+ 	kfree(tcp_conn);
+ }
+@@ -2017,7 +2010,7 @@ iscsi_tcp_conn_get_param(struct iscsi_cl
+ {
+ 	struct iscsi_conn *conn = cls_conn->dd_data;
+ 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+-	struct inet_sock *inet;
++	struct inet_opt *inet;
+ 	struct ipv6_pinfo *np;
+ 	struct sock *sk;
+ 	int len;
+@@ -2044,11 +2037,13 @@ iscsi_tcp_conn_get_param(struct iscsi_cl
+ 		sk = tcp_conn->sock->sk;
+ 		if (sk->sk_family == PF_INET) {
+ 			inet = inet_sk(sk);
+-			len = sprintf(buf, NIPQUAD_FMT "\n",
++			len = sprintf(buf, "%u.%u.%u.%u\n",
+ 				      NIPQUAD(inet->daddr));
+ 		} else {
+ 			np = inet6_sk(sk);
+-			len = sprintf(buf, NIP6_FMT "\n", NIP6(np->daddr));
++			len = sprintf(buf,
++				"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
++				NIP6(np->daddr));
+ 		}
+ 		mutex_unlock(&conn->xmitmutex);
+ 		break;
+@@ -2135,7 +2130,6 @@ static void iscsi_tcp_session_destroy(st
+ static struct scsi_host_template iscsi_sht = {
+ 	.name			= "iSCSI Initiator over TCP/IP",
+ 	.queuecommand           = iscsi_queuecommand,
+-	.change_queue_depth	= iscsi_change_queue_depth,
+ 	.can_queue		= ISCSI_XMIT_CMDS_MAX - 1,
+ 	.sg_tablesize		= ISCSI_SG_TABLESIZE,
+ 	.cmd_per_lun		= ISCSI_DEF_CMD_PER_LUN,
+diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h
+--- linux-2.6.20/drivers/scsi/iscsi_tcp.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h	2007-04-01 13:11:55.000000000 +0300
+@@ -49,7 +49,6 @@
+ #define ISCSI_SG_TABLESIZE		SG_ALL
+ #define ISCSI_TCP_MAX_CMD_LEN		16
+ 
+-struct crypto_hash;
+ struct socket;
+ 
+ /* Socket connection recieve helper */
+@@ -93,8 +92,8 @@ struct iscsi_tcp_conn {
+ 	void			(*old_write_space)(struct sock *);
+ 
+ 	/* data and header digests */
+-	struct hash_desc	tx_hash;	/* CRC32C (Tx) */
+-	struct hash_desc	rx_hash;	/* CRC32C (Rx) */
++	struct crypto_tfm	*tx_tfm;	/* CRC32C (Tx) */
++	struct crypto_tfm	*rx_tfm;	/* CRC32C (Rx) */
+ 
+ 	/* MIB custom statistics */
+ 	uint32_t		sendpage_failures_cnt;
+diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c
+--- linux-2.6.20/drivers/scsi/libiscsi.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c	2007-04-01 13:15:57.000000000 +0300
+@@ -23,6 +23,7 @@
+  */
+ #include <linux/types.h>
+ #include <linux/mutex.h>
++#include <linux/gfp.h>
+ #include <linux/kfifo.h>
+ #include <linux/delay.h>
+ #include <net/tcp.h>
+@@ -831,7 +832,7 @@ int iscsi_queuecommand(struct scsi_cmnd 
+ 		session->cmdsn, session->max_cmdsn - session->exp_cmdsn + 1);
+ 	spin_unlock(&session->lock);
+ 
+-	scsi_queue_work(host, &conn->xmitwork);
++	schedule_work(&conn->xmitwork);
+ 	return 0;
+ 
+ reject:
+@@ -932,7 +933,7 @@ iscsi_conn_send_generic(struct iscsi_con
+ 	else
+ 	        __kfifo_put(conn->mgmtqueue, (void*)&mtask, sizeof(void*));
+ 
+-	scsi_queue_work(session->host, &conn->xmitwork);
++	schedule_work(&conn->xmitwork);
+ 	return 0;
+ }
+ 
+@@ -1370,7 +1371,6 @@ iscsi_session_setup(struct iscsi_transpo
+ 	shost->max_lun = iscsit->max_lun;
+ 	shost->max_cmd_len = iscsit->max_cmd_len;
+ 	shost->transportt = scsit;
+-	shost->transportt->create_work_queue = 1;
+ 	*hostno = shost->host_no;
+ 
+ 	session = iscsi_hostdata(shost->hostdata);
+diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c
+--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c	2007-04-01 13:18:33.000000000 +0300
+@@ -29,11 +29,15 @@
+ #include <scsi/scsi_transport.h>
+ #include <scsi/scsi_transport_iscsi.h>
+ #include <scsi/iscsi_if.h>
++#include <linux/transport_class.h>
++#include <linux/netlink.h>
+ 
+ #define ISCSI_SESSION_ATTRS 11
+ #define ISCSI_CONN_ATTRS 11
+ #define ISCSI_HOST_ATTRS 0
+-#define ISCSI_TRANSPORT_VERSION "2.0-724"
++#define ISCSI_TRANSPORT_VERSION "2.0-754"
++
++#define SCAN_WILD_CARD   ~0
+ 
+ struct iscsi_internal {
+ 	int daemon_pid;
+@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
+ #define cdev_to_iscsi_internal(_cdev) \
+ 	container_of(_cdev, struct iscsi_internal, cdev)
+ 
++extern int attribute_container_init(void);
++
+ static void iscsi_transport_release(struct class_device *cdev)
+ {
+ 	struct iscsi_internal *priv = cdev_to_iscsi_internal(cdev);
+@@ -80,6 +86,17 @@ static struct class iscsi_transport_clas
+ 	.release = iscsi_transport_release,
+ };
+ 
++static void iscsi_host_class_release(struct class_device *class_dev)
++{
++	struct Scsi_Host *shost = transport_class_to_shost(class_dev);
++	put_device(&shost->shost_gendev);
++}
++
++struct class iscsi_host_class = {
++	.name = "iscsi_host",
++	.release = iscsi_host_class_release,
++};
++
+ static ssize_t
+ show_transport_handle(struct class_device *cdev, char *buf)
+ {
+@@ -115,10 +132,8 @@ static struct attribute_group iscsi_tran
+ 	.attrs = iscsi_transport_attrs,
+ };
+ 
+-static int iscsi_setup_host(struct transport_container *tc, struct device *dev,
+-			    struct class_device *cdev)
++static int iscsi_setup_host(struct Scsi_Host *shost)
+ {
+-	struct Scsi_Host *shost = dev_to_shost(dev);
+ 	struct iscsi_host *ihost = shost->shost_data;
+ 
+ 	memset(ihost, 0, sizeof(*ihost));
+@@ -127,12 +142,6 @@ static int iscsi_setup_host(struct trans
+ 	return 0;
+ }
+ 
+-static DECLARE_TRANSPORT_CLASS(iscsi_host_class,
+-			       "iscsi_host",
+-			       iscsi_setup_host,
+-			       NULL,
+-			       NULL);
+-
+ static DECLARE_TRANSPORT_CLASS(iscsi_session_class,
+ 			       "iscsi_session",
+ 			       NULL,
+@@ -216,28 +225,10 @@ static int iscsi_is_session_dev(const st
+ 	return dev->release == iscsi_session_release;
+ }
+ 
+-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel,
+-			   uint id, uint lun)
+-{
+-	struct iscsi_host *ihost = shost->shost_data;
+-	struct iscsi_cls_session *session;
+-
+-	mutex_lock(&ihost->mutex);
+-	list_for_each_entry(session, &ihost->sessions, host_list) {
+-		if ((channel == SCAN_WILD_CARD || channel == 0) &&
+-		    (id == SCAN_WILD_CARD || id == session->target_id))
+-			scsi_scan_target(&session->dev, 0,
+-					 session->target_id, lun, 1);
+-	}
+-	mutex_unlock(&ihost->mutex);
+-
+-	return 0;
+-}
+-
+-static void session_recovery_timedout(struct work_struct *work)
++static void session_recovery_timedout(void *data)
+ {
+ 	struct iscsi_cls_session *session =
+-		container_of(work, struct iscsi_cls_session,
++		container_of(data, struct iscsi_cls_session,
+ 			     recovery_work.work);
+ 
+ 	dev_printk(KERN_INFO, &session->dev, "iscsi: session recovery timed "
+@@ -362,8 +353,6 @@ void iscsi_remove_session(struct iscsi_c
+ 	list_del(&session->host_list);
+ 	mutex_unlock(&ihost->mutex);
+ 
+-	scsi_remove_target(&session->dev);
+-
+ 	transport_unregister_device(&session->dev);
+ 	device_del(&session->dev);
+ }
+@@ -452,6 +441,7 @@ iscsi_create_conn(struct iscsi_cls_sessi
+ 		goto release_parent_ref;
+ 	}
+ 	transport_register_device(&conn->dev);
++
+ 	return conn;
+ 
+ release_parent_ref:
+@@ -606,9 +596,8 @@ iscsi_if_send_reply(int pid, int seq, in
+ 	struct nlmsghdr	*nlh;
+ 	int len = NLMSG_SPACE(size);
+ 	int flags = multi ? NLM_F_MULTI : 0;
+-	int t = done ? NLMSG_DONE : type;
+ 
+-	skb = alloc_skb(len, GFP_ATOMIC);
++	skb = alloc_skb(len, GFP_KERNEL);
+ 	/*
+ 	 * FIXME:
+ 	 * user is supposed to react on iferror == -ENOMEM;
+@@ -649,7 +638,7 @@ iscsi_if_get_stats(struct iscsi_transpor
+ 	do {
+ 		int actual_size;
+ 
+-		skbstat = alloc_skb(len, GFP_ATOMIC);
++		skbstat = alloc_skb(len, GFP_KERNEL);
+ 		if (!skbstat) {
+ 			dev_printk(KERN_ERR, &conn->dev, "iscsi: can not "
+ 				   "deliver stats: OOM\n");
+@@ -1269,24 +1258,6 @@ static int iscsi_conn_match(struct attri
+ 	return &priv->conn_cont.ac == cont;
+ }
+ 
+-static int iscsi_host_match(struct attribute_container *cont,
+-			    struct device *dev)
+-{
+-	struct Scsi_Host *shost;
+-	struct iscsi_internal *priv;
+-
+-	if (!scsi_is_host_device(dev))
+-		return 0;
+-
+-	shost = dev_to_shost(dev);
+-	if (!shost->transportt  ||
+-	    shost->transportt->host_attrs.ac.class != &iscsi_host_class.class)
+-		return 0;
+-
+-        priv = to_iscsi_internal(shost->transportt);
+-        return &priv->t.host_attrs.ac == cont;
+-}
+-
+ struct scsi_transport_template *
+ iscsi_register_transport(struct iscsi_transport *tt)
+ {
+@@ -1306,7 +1277,6 @@ iscsi_register_transport(struct iscsi_tr
+ 	INIT_LIST_HEAD(&priv->list);
+ 	priv->daemon_pid = -1;
+ 	priv->iscsi_transport = tt;
+-	priv->t.user_scan = iscsi_user_scan;
+ 
+ 	priv->cdev.class = &iscsi_transport_class;
+ 	snprintf(priv->cdev.class_id, BUS_ID_SIZE, "%s", tt->name);
+@@ -1319,12 +1289,11 @@ iscsi_register_transport(struct iscsi_tr
+ 		goto unregister_cdev;
+ 
+ 	/* host parameters */
+-	priv->t.host_attrs.ac.attrs = &priv->host_attrs[0];
+-	priv->t.host_attrs.ac.class = &iscsi_host_class.class;
+-	priv->t.host_attrs.ac.match = iscsi_host_match;
++
++	priv->t.host_attrs = &priv->host_attrs[0];
++	priv->t.host_class = &iscsi_host_class;
++	priv->t.host_setup = iscsi_setup_host;
+ 	priv->t.host_size = sizeof(struct iscsi_host);
+-	priv->host_attrs[0] = NULL;
+-	transport_container_register(&priv->t.host_attrs);
+ 
+ 	/* connection parameters */
+ 	priv->conn_cont.ac.attrs = &priv->conn_attrs[0];
+@@ -1402,7 +1371,6 @@ int iscsi_unregister_transport(struct is
+ 
+ 	transport_container_unregister(&priv->conn_cont);
+ 	transport_container_unregister(&priv->session_cont);
+-	transport_container_unregister(&priv->t.host_attrs);
+ 
+ 	sysfs_remove_group(&priv->cdev.kobj, &iscsi_transport_group);
+ 	class_device_unregister(&priv->cdev);
+@@ -1419,6 +1387,8 @@ static __init int iscsi_transport_init(v
+ 	printk(KERN_INFO "Loading iSCSI transport class v%s.\n",
+ 		ISCSI_TRANSPORT_VERSION);
+ 
++	attribute_container_init();
++
+ 	err = class_register(&iscsi_transport_class);
+ 	if (err)
+ 		return err; 
diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch
new file mode 100644
index 0000000..6dd4429
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch
@@ -0,0 +1,60 @@
+diff -rupN linux-2.6.20-rc7/include/scsi/iscsi_compat.h linux-2.6.9/include/scsi/iscsi_compat.h
+--- linux-2.6.20-rc7/include/scsi/iscsi_compat.h	1970-01-01 02:00:00.000000000 +0200
++++ linux-2.6.9/include/scsi/iscsi_compat.h	2007-02-08 08:45:39.000000000 +0200
+@@ -0,0 +1,16 @@
++#ifndef ISCSI_COMPAT
++#define ISCSI_COMPAT
++
++#include <linux/version.h>
++#include <linux/kernel.h>
++#include <scsi/scsi.h>
++
++#define __nlmsg_put(skb, daemon_pid, seq, type, len, flags) \
++       __nlmsg_put(skb, daemon_pid, 0, 0, len)
++
++#define netlink_kernel_create(uint, groups, input, mod) \
++       netlink_kernel_create(uint, input)
++
++#define gfp_t unsigned
++
++#endif /* ISCSI_COMPAT */
+diff -rupN linux-2.6.20-rc7/include/scsi/iscsi_if.h linux-2.6.9/include/scsi/iscsi_if.h
+--- linux-2.6.20-rc7/include/scsi/iscsi_if.h	2006-11-29 23:57:37.000000000 +0200
++++ linux-2.6.9/include/scsi/iscsi_if.h	2007-02-04 12:50:15.000000000 +0200
+@@ -277,7 +277,6 @@ enum iscsi_param {
+  * These flags describes reason of stop_conn() call
+  */
+ #define STOP_CONN_TERM		0x1
+-#define STOP_CONN_SUSPEND	0x2
+ #define STOP_CONN_RECOVER	0x3
+ 
+ #define ISCSI_STATS_CUSTOM_MAX		32
+diff -rupN linux-2.6.20-rc7/include/scsi/libiscsi.h linux-2.6.9/include/scsi/libiscsi.h
+--- linux-2.6.20-rc7/include/scsi/libiscsi.h	2007-02-07 11:10:56.000000000 +0200
++++ linux-2.6.9/include/scsi/libiscsi.h	2007-02-07 15:51:59.000000000 +0200
+@@ -25,10 +25,9 @@
+ 
+ #include <linux/types.h>
+ #include <linux/mutex.h>
+-#include <linux/timer.h>
+-#include <linux/workqueue.h>
+ #include <scsi/iscsi_proto.h>
+ #include <scsi/iscsi_if.h>
++#include <scsi/iscsi_compat.h>
+ 
+ struct scsi_transport_template;
+ struct scsi_device;
+diff -rupN linux-2.6.20-rc7/include/scsi/scsi_transport_iscsi.h linux-2.6.9/include/scsi/scsi_transport_iscsi.h
+--- linux-2.6.20-rc7/include/scsi/scsi_transport_iscsi.h	2007-02-07 11:10:56.000000000 +0200
++++ linux-2.6.9/include/scsi/scsi_transport_iscsi.h	2007-02-07 15:52:50.000000000 +0200
+@@ -24,7 +24,9 @@
+ #define SCSI_TRANSPORT_ISCSI_H
+ 
+ #include <linux/device.h>
+-#include <scsi/iscsi_if.h>
++#include "iscsi_if.h"
++#include "iscsi_compat.h"
++//#include <../drivers/scsi/transport_class.h>
+ 
+ struct scsi_transport_template;
+ struct iscsi_transport;
diff --git a/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch b/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch
new file mode 100644
index 0000000..f2425e0
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch
@@ -0,0 +1,104 @@
+diff -rupN linux-2.6.20-like-rh4/include/linux/transport_class.h linux-2.6.20/include/linux/transport_class.h
+--- linux-2.6.20-like-rh4/include/linux/transport_class.h	1970-01-01 02:00:00.000000000 +0200
++++ linux-2.6.20/include/linux/transport_class.h	2007-02-04 20:44:54.000000000 +0200
+@@ -0,0 +1,100 @@
++/*
++ * transport_class.h - a generic container for all transport classes
++ *
++ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
++ *
++ * This file is licensed under GPLv2
++ */
++
++#ifndef _TRANSPORT_CLASS_H_
++#define _TRANSPORT_CLASS_H_
++
++#include <linux/device.h>
++#include <linux/attribute_container.h>
++
++struct transport_container;
++
++struct transport_class {
++	struct class class;
++	int (*setup)(struct transport_container *, struct device *,
++		     struct class_device *);
++	int (*configure)(struct transport_container *, struct device *,
++			 struct class_device *);
++	int (*remove)(struct transport_container *, struct device *,
++		      struct class_device *);
++};
++
++#define DECLARE_TRANSPORT_CLASS(cls, nm, su, rm, cfg)			\
++struct transport_class cls = {						\
++	.class = {							\
++		.name = nm,						\
++	},								\
++	.setup = su,							\
++	.remove = rm,							\
++	.configure = cfg,						\
++}
++
++
++struct anon_transport_class {
++	struct transport_class tclass;
++	struct attribute_container container;
++};
++
++#define DECLARE_ANON_TRANSPORT_CLASS(cls, mtch, cfg)		\
++struct anon_transport_class cls = {				\
++	.tclass = {						\
++		.configure = cfg,				\
++	},							\
++	. container = {						\
++		.match = mtch,					\
++	},							\
++}
++
++#define class_to_transport_class(x) \
++	container_of(x, struct transport_class, class)
++
++struct transport_container {
++	struct attribute_container ac;
++	struct attribute_group *statistics;
++};
++
++#define attribute_container_to_transport_container(x) \
++	container_of(x, struct transport_container, ac)
++
++void transport_remove_device(struct device *);
++void transport_add_device(struct device *);
++void transport_setup_device(struct device *);
++void transport_configure_device(struct device *);
++void transport_destroy_device(struct device *);
++
++static inline void
++transport_register_device(struct device *dev)
++{
++	transport_setup_device(dev);
++	transport_add_device(dev);
++}
++
++static inline void
++transport_unregister_device(struct device *dev)
++{
++	transport_remove_device(dev);
++	transport_destroy_device(dev);
++}
++
++static inline int transport_container_register(struct transport_container *tc)
++{
++	return attribute_container_register(&tc->ac);
++}
++
++static inline int transport_container_unregister(struct transport_container *tc)
++{
++	return attribute_container_unregister(&tc->ac);
++}
++
++int transport_class_register(struct transport_class *);
++int anon_transport_class_register(struct anon_transport_class *);
++void transport_class_unregister(struct transport_class *);
++void anon_transport_class_unregister(struct anon_transport_class *);
++
++
++#endif
diff --git a/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch
new file mode 100644
index 0000000..3c2a969
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch
@@ -0,0 +1,13 @@
+--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:13:43.000000000 +0200
++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:14:31.000000000 +0200
+@@ -70,9 +70,8 @@
+ #include <scsi/scsi_tcq.h>
+ #include <scsi/scsi_host.h>
+ #include <scsi/scsi.h>
+-#include <scsi/scsi_transport_iscsi.h>
+-
+ #include "iscsi_iser.h"
++#include <scsi/scsi_transport_iscsi.h>
+ 
+ static unsigned int iscsi_max_lun = 512;
+ module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO);
diff --git a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch
index e84b964..52c0136 100644
--- a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch
+++ b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch
@@ -19,6 +19,62 @@ index 0000000..58cf933
 +++ b/drivers/infiniband/core/kfifo.c
 @@ -0,0 +1 @@
 +#include "src/kfifo.c"
+diff --git a/drivers/infiniband/core/init.c b/drivers/infiniband/core/init.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/init.c
+@@ -0,0 +1 @@
++#include "src/init.c"
+diff --git a/drivers/infiniband/core/attribute_container.c b/drivers/infiniband/core/attribute_container.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/attribute_container.c
+@@ -0,0 +1 @@
++#include "src/attribute_container.c"
+diff --git a/drivers/infiniband/core/transport_class.c b/drivers/infiniband/core/transport_class.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/transport_class.c
+@@ -0,0 +1 @@
++#include "src/transport_class.c"
+diff --git a/drivers/infiniband/core/klist.c b/drivers/infiniband/core/klist.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/klist.c
+@@ -0,0 +1 @@
++#include "src/klist.c"
+diff --git a/drivers/infiniband/core/scsi.c b/drivers/infiniband/core/scsi.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/scsi.c
+@@ -0,0 +1 @@
++#include "src/scsi.c"
+diff --git a/drivers/infiniband/core/scsi_lib.c b/drivers/infiniband/core/scsi_lib.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/scsi_lib.c
+@@ -0,0 +1 @@
++#include "src/scsi_lib.c"
+diff --git a/drivers/infiniband/core/scsi_scan.c b/drivers/infiniband/core/scsi_scan.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/scsi_scan.c
+@@ -0,0 +1 @@
++#include "src/scsi_scan.c"
+diff --git a/drivers/infiniband/core/kref_new.c b/drivers/infiniband/core/kref_new.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/kref_new.c
+@@ -0,0 +1 @@
++#include "src/kref_new.c"
 diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
 index 50fb1cd..456bfd0 100644
 --- a/drivers/infiniband/core/Makefile
@@ -28,4 +84,4 @@ index 50fb1cd..456bfd0 100644
  ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_mem.o \
  				uverbs_marshall.o
 +
-+ib_core-y +=			genalloc.o netevent.o kfifo.o
++ib_core-y +=			genalloc.o netevent.o kfifo.o scsi.o scsi_lib.o scsi_scan.o init.o attribute_container.o transport_class.o klist.o kref_new.o
diff --git a/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch b/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch
new file mode 100644
index 0000000..cc071ef
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch
@@ -0,0 +1,247 @@
+diff -rupN linux-2.6.20-like-rh4/include/linux/netlink.h linux-2.6.20/include/linux/netlink.h
+--- linux-2.6.20-like-rh4/include/linux/netlink.h	1970-01-01 02:00:00.000000000 +0200
++++ linux-2.6.20/include/linux/netlink.h	2007-02-04 20:44:54.000000000 +0200
+@@ -0,0 +1,243 @@
++#ifndef __LINUX_NETLINK_H
++#define __LINUX_NETLINK_H
++
++#include <linux/socket.h> /* for sa_family_t */
++#include <linux/types.h>
++
++#define NETLINK_ROUTE		0	/* Routing/device hook				*/
++#define NETLINK_UNUSED		1	/* Unused number				*/
++#define NETLINK_USERSOCK	2	/* Reserved for user mode socket protocols 	*/
++#define NETLINK_FIREWALL	3	/* Firewalling hook				*/
++#define NETLINK_INET_DIAG	4	/* INET socket monitoring			*/
++#define NETLINK_NFLOG		5	/* netfilter/iptables ULOG */
++#define NETLINK_XFRM		6	/* ipsec */
++#define NETLINK_SELINUX		7	/* SELinux event notifications */
++#define NETLINK_ISCSI		8	/* Open-iSCSI */
++#define NETLINK_AUDIT		9	/* auditing */
++#define NETLINK_FIB_LOOKUP	10	
++#define NETLINK_CONNECTOR	11
++#define NETLINK_NETFILTER	12	/* netfilter subsystem */
++#define NETLINK_IP6_FW		13
++#define NETLINK_DNRTMSG		14	/* DECnet routing messages */
++#define NETLINK_KOBJECT_UEVENT	15	/* Kernel messages to userspace */
++#define NETLINK_GENERIC		16
++/* leave room for NETLINK_DM (DM Events) */
++#define NETLINK_SCSITRANSPORT	18	/* SCSI Transports */
++
++#define MAX_LINKS 32		
++
++struct sockaddr_nl
++{
++	sa_family_t	nl_family;	/* AF_NETLINK	*/
++	unsigned short	nl_pad;		/* zero		*/
++	__u32		nl_pid;		/* process pid	*/
++       	__u32		nl_groups;	/* multicast groups mask */
++};
++
++struct nlmsghdr
++{
++	__u32		nlmsg_len;	/* Length of message including header */
++	__u16		nlmsg_type;	/* Message content */
++	__u16		nlmsg_flags;	/* Additional flags */
++	__u32		nlmsg_seq;	/* Sequence number */
++	__u32		nlmsg_pid;	/* Sending process PID */
++};
++
++/* Flags values */
++
++#define NLM_F_REQUEST		1	/* It is request message. 	*/
++#define NLM_F_MULTI		2	/* Multipart message, terminated by NLMSG_DONE */
++#define NLM_F_ACK		4	/* Reply with ack, with zero or error code */
++#define NLM_F_ECHO		8	/* Echo this request 		*/
++
++/* Modifiers to GET request */
++#define NLM_F_ROOT	0x100	/* specify tree	root	*/
++#define NLM_F_MATCH	0x200	/* return all matching	*/
++#define NLM_F_ATOMIC	0x400	/* atomic GET		*/
++#define NLM_F_DUMP	(NLM_F_ROOT|NLM_F_MATCH)
++
++/* Modifiers to NEW request */
++#define NLM_F_REPLACE	0x100	/* Override existing		*/
++#define NLM_F_EXCL	0x200	/* Do not touch, if it exists	*/
++#define NLM_F_CREATE	0x400	/* Create, if it does not exist	*/
++#define NLM_F_APPEND	0x800	/* Add to end of list		*/
++
++/*
++   4.4BSD ADD		NLM_F_CREATE|NLM_F_EXCL
++   4.4BSD CHANGE	NLM_F_REPLACE
++
++   True CHANGE		NLM_F_CREATE|NLM_F_REPLACE
++   Append		NLM_F_CREATE
++   Check		NLM_F_EXCL
++ */
++
++#define NLMSG_ALIGNTO	4
++#define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) )
++#define NLMSG_HDRLEN	 ((int) NLMSG_ALIGN(sizeof(struct nlmsghdr)))
++#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(NLMSG_HDRLEN))
++#define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len))
++#define NLMSG_DATA(nlh)  ((void*)(((char*)nlh) + NLMSG_LENGTH(0)))
++#define NLMSG_NEXT(nlh,len)	 ((len) -= NLMSG_ALIGN((nlh)->nlmsg_len), \
++				  (struct nlmsghdr*)(((char*)(nlh)) + NLMSG_ALIGN((nlh)->nlmsg_len)))
++#define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \
++			   (nlh)->nlmsg_len >= sizeof(struct nlmsghdr) && \
++			   (nlh)->nlmsg_len <= (len))
++#define NLMSG_PAYLOAD(nlh,len) ((nlh)->nlmsg_len - NLMSG_SPACE((len)))
++
++#define NLMSG_NOOP		0x1	/* Nothing.		*/
++#define NLMSG_ERROR		0x2	/* Error		*/
++#define NLMSG_DONE		0x3	/* End of a dump	*/
++#define NLMSG_OVERRUN		0x4	/* Data lost		*/
++
++#define NLMSG_MIN_TYPE		0x10	/* < 0x10: reserved control messages */
++
++struct nlmsgerr
++{
++	int		error;
++	struct nlmsghdr msg;
++};
++
++#define NETLINK_ADD_MEMBERSHIP	1
++#define NETLINK_DROP_MEMBERSHIP	2
++#define NETLINK_PKTINFO		3
++
++struct nl_pktinfo
++{
++	__u32	group;
++};
++
++#define NET_MAJOR 36		/* Major 36 is reserved for networking 						*/
++
++enum {
++	NETLINK_UNCONNECTED = 0,
++	NETLINK_CONNECTED,
++};
++
++/*
++ *  <------- NLA_HDRLEN ------> <-- NLA_ALIGN(payload)-->
++ * +---------------------+- - -+- - - - - - - - - -+- - -+
++ * |        Header       | Pad |     Payload       | Pad |
++ * |   (struct nlattr)   | ing |                   | ing |
++ * +---------------------+- - -+- - - - - - - - - -+- - -+
++ *  <-------------- nlattr->nla_len -------------->
++ */
++
++struct nlattr
++{
++	__u16           nla_len;
++	__u16           nla_type;
++};
++
++#define NLA_ALIGNTO		4
++#define NLA_ALIGN(len)		(((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1))
++#define NLA_HDRLEN		((int) NLA_ALIGN(sizeof(struct nlattr)))
++
++#ifdef __KERNEL__
++
++#include <linux/capability.h>
++#include <linux/skbuff.h>
++
++struct netlink_skb_parms
++{
++	struct ucred		creds;		/* Skb credentials	*/
++	__u32			pid;
++	__u32			dst_group;
++	kernel_cap_t		eff_cap;
++	__u32			loginuid;	/* Login (audit) uid */
++	__u32			sid;		/* SELinux security id */
++};
++
++#define NETLINK_CB(skb)		(*(struct netlink_skb_parms*)&((skb)->cb))
++#define NETLINK_CREDS(skb)	(&NETLINK_CB((skb)).creds)
++
++
++extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module);
++extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
++extern int netlink_has_listeners(struct sock *sk, unsigned int group);
++extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock);
++extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 pid,
++			     __u32 group, gfp_t allocation);
++extern void netlink_set_err(struct sock *ssk, __u32 pid, __u32 group, int code);
++extern int netlink_register_notifier(struct notifier_block *nb);
++extern int netlink_unregister_notifier(struct notifier_block *nb);
++
++/* finegrained unicast helpers: */
++struct sock *netlink_getsockbyfilp(struct file *filp);
++int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock,
++		long timeo, struct sock *ssk);
++void netlink_detachskb(struct sock *sk, struct sk_buff *skb);
++int netlink_sendskb(struct sock *sk, struct sk_buff *skb, int protocol);
++
++/*
++ *	skb should fit one page. This choice is good for headerless malloc.
++ */
++#define NLMSG_GOODORDER 0
++#define NLMSG_GOODSIZE (SKB_MAX_ORDER(0, NLMSG_GOODORDER))
++#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)
++
++
++struct netlink_callback
++{
++	struct sk_buff	*skb;
++	struct nlmsghdr	*nlh;
++	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
++	int		(*done)(struct netlink_callback *cb);
++	int		family;
++	long		args[5];
++};
++
++struct netlink_notify
++{
++	int pid;
++	int protocol;
++};
++
++static __inline__ struct nlmsghdr *
++__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len, int flags)
++{
++	struct nlmsghdr *nlh;
++	int size = NLMSG_LENGTH(len);
++
++	nlh = (struct nlmsghdr*)skb_put(skb, NLMSG_ALIGN(size));
++	nlh->nlmsg_type = type;
++	nlh->nlmsg_len = size;
++	nlh->nlmsg_flags = flags;
++	nlh->nlmsg_pid = pid;
++	nlh->nlmsg_seq = seq;
++	memset(NLMSG_DATA(nlh) + len, 0, NLMSG_ALIGN(size) - size);
++	return nlh;
++}
++
++#define NLMSG_NEW(skb, pid, seq, type, len, flags) \
++({	if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) \
++		goto nlmsg_failure; \
++	__nlmsg_put(skb, pid, seq, type, len, flags); })
++
++#define NLMSG_PUT(skb, pid, seq, type, len) \
++	NLMSG_NEW(skb, pid, seq, type, len, 0)
++
++#define NLMSG_NEW_ANSWER(skb, cb, type, len, flags) \
++	NLMSG_NEW(skb, NETLINK_CB((cb)->skb).pid, \
++		  (cb)->nlh->nlmsg_seq, type, len, flags)
++
++#define NLMSG_END(skb, nlh) \
++({	(nlh)->nlmsg_len = (skb)->tail - (unsigned char *) (nlh); \
++	(skb)->len; })
++
++#define NLMSG_CANCEL(skb, nlh) \
++({	skb_trim(skb, (unsigned char *) (nlh) - (skb)->data); \
++	-1; })
++
++extern int netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
++			      struct nlmsghdr *nlh,
++			      int (*dump)(struct sk_buff *skb, struct netlink_callback*),
++			      int (*done)(struct netlink_callback*));
++
++
++#define NL_NONROOT_RECV 0x1
++#define NL_NONROOT_SEND 0x2
++extern void netlink_set_nonroot(int protocol, unsigned flag);
++
++#endif /* __KERNEL__ */
++
++#endif	/* __LINUX_NETLINK_H */
diff --git a/kernel_patches/backport/2.6.9_U3/netlink-02-netlink_h_for_rh4.patch b/kernel_patches/backport/2.6.9_U3/netlink-02-netlink_h_for_rh4.patch
new file mode 100644
index 0000000..d9ba403
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/netlink-02-netlink_h_for_rh4.patch
@@ -0,0 +1,200 @@
+diff -rup linux-2.6.20/include/linux/netlink.h linux-2.6.20-backport-rh4-u3/include/linux/netlink.h
+--- linux-2.6.20/include/linux/netlink.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/include/linux/netlink.h	2007-03-08 10:09:43.000000000 +0200
+@@ -5,24 +5,19 @@
+ #include <linux/types.h>
+ 
+ #define NETLINK_ROUTE		0	/* Routing/device hook				*/
+-#define NETLINK_UNUSED		1	/* Unused number				*/
++#define NETLINK_SKIP		1	/* Reserved for ENskip  			*/
+ #define NETLINK_USERSOCK	2	/* Reserved for user mode socket protocols 	*/
+ #define NETLINK_FIREWALL	3	/* Firewalling hook				*/
+-#define NETLINK_INET_DIAG	4	/* INET socket monitoring			*/
++#define NETLINK_TCPDIAG		4	/* TCP socket monitoring			*/
+ #define NETLINK_NFLOG		5	/* netfilter/iptables ULOG */
+ #define NETLINK_XFRM		6	/* ipsec */
+ #define NETLINK_SELINUX		7	/* SELinux event notifications */
+-#define NETLINK_ISCSI		8	/* Open-iSCSI */
++#define NETLINK_ISCSI		8
+ #define NETLINK_AUDIT		9	/* auditing */
+-#define NETLINK_FIB_LOOKUP	10	
+-#define NETLINK_CONNECTOR	11
+-#define NETLINK_NETFILTER	12	/* netfilter subsystem */
++#define NETLINK_ROUTE6		11	/* af_inet6 route comm channel */
+ #define NETLINK_IP6_FW		13
+ #define NETLINK_DNRTMSG		14	/* DECnet routing messages */
+-#define NETLINK_KOBJECT_UEVENT	15	/* Kernel messages to userspace */
+-#define NETLINK_GENERIC		16
+-/* leave room for NETLINK_DM (DM Events) */
+-#define NETLINK_SCSITRANSPORT	18	/* SCSI Transports */
++#define NETLINK_TAPBASE		16	/* 16 to 31 are ethertap */
+ 
+ #define MAX_LINKS 32		
+ 
+@@ -73,8 +68,7 @@ struct nlmsghdr
+ 
+ #define NLMSG_ALIGNTO	4
+ #define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) )
+-#define NLMSG_HDRLEN	 ((int) NLMSG_ALIGN(sizeof(struct nlmsghdr)))
+-#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(NLMSG_HDRLEN))
++#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(sizeof(struct nlmsghdr)))
+ #define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len))
+ #define NLMSG_DATA(nlh)  ((void*)(((char*)nlh) + NLMSG_LENGTH(0)))
+ #define NLMSG_NEXT(nlh,len)	 ((len) -= NLMSG_ALIGN((nlh)->nlmsg_len), \
+@@ -89,23 +83,12 @@ struct nlmsghdr
+ #define NLMSG_DONE		0x3	/* End of a dump	*/
+ #define NLMSG_OVERRUN		0x4	/* Data lost		*/
+ 
+-#define NLMSG_MIN_TYPE		0x10	/* < 0x10: reserved control messages */
+-
+ struct nlmsgerr
+ {
+ 	int		error;
+ 	struct nlmsghdr msg;
+ };
+ 
+-#define NETLINK_ADD_MEMBERSHIP	1
+-#define NETLINK_DROP_MEMBERSHIP	2
+-#define NETLINK_PKTINFO		3
+-
+-struct nl_pktinfo
+-{
+-	__u32	group;
+-};
+-
+ #define NET_MAJOR 36		/* Major 36 is reserved for networking 						*/
+ 
+ enum {
+@@ -113,25 +96,6 @@ enum {
+ 	NETLINK_CONNECTED,
+ };
+ 
+-/*
+- *  <------- NLA_HDRLEN ------> <-- NLA_ALIGN(payload)-->
+- * +---------------------+- - -+- - - - - - - - - -+- - -+
+- * |        Header       | Pad |     Payload       | Pad |
+- * |   (struct nlattr)   | ing |                   | ing |
+- * +---------------------+- - -+- - - - - - - - - -+- - -+
+- *  <-------------- nlattr->nla_len -------------->
+- */
+-
+-struct nlattr
+-{
+-	__u16           nla_len;
+-	__u16           nla_type;
+-};
+-
+-#define NLA_ALIGNTO		4
+-#define NLA_ALIGN(len)		(((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1))
+-#define NLA_HDRLEN		((int) NLA_ALIGN(sizeof(struct nlattr)))
+-
+ #ifdef __KERNEL__
+ 
+ #include <linux/capability.h>
+@@ -141,39 +105,42 @@ struct netlink_skb_parms
+ {
+ 	struct ucred		creds;		/* Skb credentials	*/
+ 	__u32			pid;
+-	__u32			dst_group;
++	__u32			groups;
++	__u32			dst_pid;
++	__u32			dst_groups;
+ 	kernel_cap_t		eff_cap;
+ 	__u32			loginuid;	/* Login (audit) uid */
+-	__u32			sid;		/* SELinux security id */
+ };
+ 
+ #define NETLINK_CB(skb)		(*(struct netlink_skb_parms*)&((skb)->cb))
+ #define NETLINK_CREDS(skb)	(&NETLINK_CB((skb)).creds)
+ 
+ 
+-extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module);
++extern int netlink_attach(int unit, int (*function)(int,struct sk_buff *skb));
++extern void netlink_detach(int unit);
++extern int netlink_post(int unit, struct sk_buff *skb);
++extern struct sock *netlink_kernel_create(int unit, void (*input)(struct sock *sk, int len));
+ extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
+-extern int netlink_has_listeners(struct sock *sk, unsigned int group);
+ extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock);
+ extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 pid,
+-			     __u32 group, gfp_t allocation);
++			     __u32 group, int allocation);
+ extern void netlink_set_err(struct sock *ssk, __u32 pid, __u32 group, int code);
+ extern int netlink_register_notifier(struct notifier_block *nb);
+ extern int netlink_unregister_notifier(struct notifier_block *nb);
+ 
+ /* finegrained unicast helpers: */
++struct sock *netlink_getsockbypid(struct sock *ssk, u32 pid);
+ struct sock *netlink_getsockbyfilp(struct file *filp);
+-int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock,
+-		long timeo, struct sock *ssk);
++int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, long timeo);
+ void netlink_detachskb(struct sock *sk, struct sk_buff *skb);
+ int netlink_sendskb(struct sock *sk, struct sk_buff *skb, int protocol);
+ 
+ /*
+  *	skb should fit one page. This choice is good for headerless malloc.
++ *
++ *      FIXME: What is the best size for SLAB???? --ANK
+  */
+-#define NLMSG_GOODORDER 0
+-#define NLMSG_GOODSIZE (SKB_MAX_ORDER(0, NLMSG_GOODORDER))
+-#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)
++#define NLMSG_GOODSIZE (PAGE_SIZE - ((sizeof(struct sk_buff)+0xF)&~0xF))
+ 
+ 
+ struct netlink_callback
+@@ -183,7 +150,7 @@ struct netlink_callback
+ 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
+ 	int		(*done)(struct netlink_callback *cb);
+ 	int		family;
+-	long		args[5];
++	long		args[4];
+ };
+ 
+ struct netlink_notify
+@@ -193,7 +160,7 @@ struct netlink_notify
+ };
+ 
+ static __inline__ struct nlmsghdr *
+-__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len, int flags)
++__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len)
+ {
+ 	struct nlmsghdr *nlh;
+ 	int size = NLMSG_LENGTH(len);
+@@ -201,32 +168,15 @@ __nlmsg_put(struct sk_buff *skb, u32 pid
+ 	nlh = (struct nlmsghdr*)skb_put(skb, NLMSG_ALIGN(size));
+ 	nlh->nlmsg_type = type;
+ 	nlh->nlmsg_len = size;
+-	nlh->nlmsg_flags = flags;
++	nlh->nlmsg_flags = 0;
+ 	nlh->nlmsg_pid = pid;
+ 	nlh->nlmsg_seq = seq;
+-	memset(NLMSG_DATA(nlh) + len, 0, NLMSG_ALIGN(size) - size);
+ 	return nlh;
+ }
+ 
+-#define NLMSG_NEW(skb, pid, seq, type, len, flags) \
+-({	if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) \
+-		goto nlmsg_failure; \
+-	__nlmsg_put(skb, pid, seq, type, len, flags); })
+-
+ #define NLMSG_PUT(skb, pid, seq, type, len) \
+-	NLMSG_NEW(skb, pid, seq, type, len, 0)
+-
+-#define NLMSG_NEW_ANSWER(skb, cb, type, len, flags) \
+-	NLMSG_NEW(skb, NETLINK_CB((cb)->skb).pid, \
+-		  (cb)->nlh->nlmsg_seq, type, len, flags)
+-
+-#define NLMSG_END(skb, nlh) \
+-({	(nlh)->nlmsg_len = (skb)->tail - (unsigned char *) (nlh); \
+-	(skb)->len; })
+-
+-#define NLMSG_CANCEL(skb, nlh) \
+-({	skb_trim(skb, (unsigned char *) (nlh) - (skb)->data); \
+-	-1; })
++({ if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) goto nlmsg_failure; \
++   __nlmsg_put(skb, pid, seq, type, len); })
+ 
+ extern int netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
+ 			      struct nlmsghdr *nlh,
diff --git a/kernel_patches/backport/2.6.9_U4/add_iscsi_proto_h.patch b/kernel_patches/backport/2.6.9_U4/add_iscsi_proto_h.patch
new file mode 100644
index 0000000..c4df6bb
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/add_iscsi_proto_h.patch
@@ -0,0 +1,591 @@
+diff -rupN linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h linux-2.6.20/include/scsi/iscsi_proto.h
+--- linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h	1970-01-01 02:00:00.000000000 +0200
++++ linux-2.6.20/include/scsi/iscsi_proto.h	2007-02-04 20:44:54.000000000 +0200
+@@ -0,0 +1,587 @@
++/*
++ * RFC 3720 (iSCSI) protocol data types
++ *
++ * Copyright (C) 2005 Dmitry Yusupov
++ * Copyright (C) 2005 Alex Aizman
++ * maintained by open-iscsi at googlegroups.com
++ *
++ * This program is free software; you can redistribute it and/or modify
++ * it under the terms of the GNU General Public License as published
++ * by the Free Software Foundation; either version 2 of the License, or
++ * (at your option) any later version.
++ *
++ * This program is distributed in the hope that it will be useful, but
++ * WITHOUT ANY WARRANTY; without even the implied warranty of
++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
++ * General Public License for more details.
++ *
++ * See the file COPYING included with this distribution for more details.
++ */
++
++#ifndef ISCSI_PROTO_H
++#define ISCSI_PROTO_H
++
++#define ISCSI_DRAFT20_VERSION	0x00
++
++/* default iSCSI listen port for incoming connections */
++#define ISCSI_LISTEN_PORT	3260
++
++/* Padding word length */
++#define PAD_WORD_LEN		4
++
++/*
++ * useful common(control and data pathes) macro
++ */
++#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2]))
++#define hton24(p, v) { \
++        p[0] = (((v) >> 16) & 0xFF); \
++        p[1] = (((v) >> 8) & 0xFF); \
++        p[2] = ((v) & 0xFF); \
++}
++#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;}
++
++/*
++ * iSCSI Template Message Header
++ */
++struct iscsi_hdr {
++	uint8_t		opcode;
++	uint8_t		flags;		/* Final bit */
++	uint8_t		rsvd2[2];
++	uint8_t		hlength;	/* AHSs total length */
++	uint8_t		dlength[3];	/* Data length */
++	uint8_t		lun[8];
++	__be32		itt;		/* Initiator Task Tag */
++	__be32		ttt;		/* Target Task Tag */
++	__be32		statsn;
++	__be32		exp_statsn;
++	__be32		max_statsn;
++	uint8_t		other[12];
++};
++
++/************************* RFC 3720 Begin *****************************/
++
++#define ISCSI_RESERVED_TAG		0xffffffff
++
++/* Opcode encoding bits */
++#define ISCSI_OP_RETRY			0x80
++#define ISCSI_OP_IMMEDIATE		0x40
++#define ISCSI_OPCODE_MASK		0x3F
++
++/* Initiator Opcode values */
++#define ISCSI_OP_NOOP_OUT		0x00
++#define ISCSI_OP_SCSI_CMD		0x01
++#define ISCSI_OP_SCSI_TMFUNC		0x02
++#define ISCSI_OP_LOGIN			0x03
++#define ISCSI_OP_TEXT			0x04
++#define ISCSI_OP_SCSI_DATA_OUT		0x05
++#define ISCSI_OP_LOGOUT			0x06
++#define ISCSI_OP_SNACK			0x10
++
++#define ISCSI_OP_VENDOR1_CMD		0x1c
++#define ISCSI_OP_VENDOR2_CMD		0x1d
++#define ISCSI_OP_VENDOR3_CMD		0x1e
++#define ISCSI_OP_VENDOR4_CMD		0x1f
++
++/* Target Opcode values */
++#define ISCSI_OP_NOOP_IN		0x20
++#define ISCSI_OP_SCSI_CMD_RSP		0x21
++#define ISCSI_OP_SCSI_TMFUNC_RSP	0x22
++#define ISCSI_OP_LOGIN_RSP		0x23
++#define ISCSI_OP_TEXT_RSP		0x24
++#define ISCSI_OP_SCSI_DATA_IN		0x25
++#define ISCSI_OP_LOGOUT_RSP		0x26
++#define ISCSI_OP_R2T			0x31
++#define ISCSI_OP_ASYNC_EVENT		0x32
++#define ISCSI_OP_REJECT			0x3f
++
++struct iscsi_ahs_hdr {
++	__be16 ahslength;
++	uint8_t ahstype;
++	uint8_t ahspec[5];
++};
++
++#define ISCSI_AHSTYPE_CDB		1
++#define ISCSI_AHSTYPE_RLENGTH		2
++
++/* iSCSI PDU Header */
++struct iscsi_cmd {
++	uint8_t opcode;
++	uint8_t flags;
++	__be16 rsvd2;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32 itt;	/* Initiator Task Tag */
++	__be32 data_length;
++	__be32 cmdsn;
++	__be32 exp_statsn;
++	uint8_t cdb[16];	/* SCSI Command Block */
++	/* Additional Data (Command Dependent) */
++};
++
++/* Command PDU flags */
++#define ISCSI_FLAG_CMD_FINAL		0x80
++#define ISCSI_FLAG_CMD_READ		0x40
++#define ISCSI_FLAG_CMD_WRITE		0x20
++#define ISCSI_FLAG_CMD_ATTR_MASK	0x07	/* 3 bits */
++
++/* SCSI Command Attribute values */
++#define ISCSI_ATTR_UNTAGGED		0
++#define ISCSI_ATTR_SIMPLE		1
++#define ISCSI_ATTR_ORDERED		2
++#define ISCSI_ATTR_HEAD_OF_QUEUE	3
++#define ISCSI_ATTR_ACA			4
++
++struct iscsi_rlength_ahdr {
++	__be16 ahslength;
++	uint8_t ahstype;
++	uint8_t reserved;
++	__be32 read_length;
++};
++
++/* SCSI Response Header */
++struct iscsi_cmd_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t response;
++	uint8_t cmd_status;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	rsvd1;
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	__be32	exp_datasn;
++	__be32	bi_residual_count;
++	__be32	residual_count;
++	/* Response or Sense Data (optional) */
++};
++
++/* Command Response PDU flags */
++#define ISCSI_FLAG_CMD_BIDI_OVERFLOW	0x10
++#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW	0x08
++#define ISCSI_FLAG_CMD_OVERFLOW		0x04
++#define ISCSI_FLAG_CMD_UNDERFLOW	0x02
++
++/* iSCSI Status values. Valid if Rsp Selector bit is not set */
++#define ISCSI_STATUS_CMD_COMPLETED	0
++#define ISCSI_STATUS_TARGET_FAILURE	1
++#define ISCSI_STATUS_SUBSYS_FAILURE	2
++
++/* Asynchronous Event Header */
++struct iscsi_async {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[2];
++	uint8_t rsvd3;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	uint8_t rsvd4[8];
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	uint8_t async_event;
++	uint8_t async_vcode;
++	__be16	param1;
++	__be16	param2;
++	__be16	param3;
++	uint8_t rsvd5[4];
++};
++
++/* iSCSI Event Codes */
++#define ISCSI_ASYNC_MSG_SCSI_EVENT			0
++#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT			1
++#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION		2
++#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS	3
++#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION		4
++#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC			255
++
++/* NOP-Out Message */
++struct iscsi_nopout {
++	uint8_t opcode;
++	uint8_t flags;
++	__be16	rsvd2;
++	uint8_t rsvd3;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	ttt;	/* Target Transfer Tag */
++	__be32	cmdsn;
++	__be32	exp_statsn;
++	uint8_t rsvd4[16];
++};
++
++/* NOP-In Message */
++struct iscsi_nopin {
++	uint8_t opcode;
++	uint8_t flags;
++	__be16	rsvd2;
++	uint8_t rsvd3;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	ttt;	/* Target Transfer Tag */
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	uint8_t rsvd4[12];
++};
++
++/* SCSI Task Management Message Header */
++struct iscsi_tm {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd1[2];
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	rtt;	/* Reference Task Tag */
++	__be32	cmdsn;
++	__be32	exp_statsn;
++	__be32	refcmdsn;
++	__be32	exp_datasn;
++	uint8_t rsvd2[8];
++};
++
++#define ISCSI_FLAG_TM_FUNC_MASK			0x7F
++
++/* Function values */
++#define ISCSI_TM_FUNC_ABORT_TASK		1
++#define ISCSI_TM_FUNC_ABORT_TASK_SET		2
++#define ISCSI_TM_FUNC_CLEAR_ACA			3
++#define ISCSI_TM_FUNC_CLEAR_TASK_SET		4
++#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET	5
++#define ISCSI_TM_FUNC_TARGET_WARM_RESET		6
++#define ISCSI_TM_FUNC_TARGET_COLD_RESET		7
++#define ISCSI_TM_FUNC_TASK_REASSIGN		8
++
++/* SCSI Task Management Response Header */
++struct iscsi_tm_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t response;	/* see Response values below */
++	uint8_t qualifier;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd2[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	rtt;	/* Reference Task Tag */
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	uint8_t rsvd3[12];
++};
++
++/* Response values */
++#define ISCSI_TMF_RSP_COMPLETE		0x00
++#define ISCSI_TMF_RSP_NO_TASK		0x01
++#define ISCSI_TMF_RSP_NO_LUN		0x02
++#define ISCSI_TMF_RSP_TASK_ALLEGIANT	0x03
++#define ISCSI_TMF_RSP_NO_FAILOVER	0x04
++#define ISCSI_TMF_RSP_NOT_SUPPORTED	0x05
++#define ISCSI_TMF_RSP_AUTH_FAILED	0x06
++#define ISCSI_TMF_RSP_REJECTED		0xff
++
++/* Ready To Transfer Header */
++struct iscsi_r2t_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[2];
++	uint8_t	hlength;
++	uint8_t	dlength[3];
++	uint8_t lun[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	ttt;	/* Target Transfer Tag */
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	__be32	r2tsn;
++	__be32	data_offset;
++	__be32	data_length;
++};
++
++/* SCSI Data Hdr */
++struct iscsi_data {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[2];
++	uint8_t rsvd3;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32	itt;
++	__be32	ttt;
++	__be32	rsvd4;
++	__be32	exp_statsn;
++	__be32	rsvd5;
++	__be32	datasn;
++	__be32	offset;
++	__be32	rsvd6;
++	/* Payload */
++};
++
++/* SCSI Data Response Hdr */
++struct iscsi_data_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2;
++	uint8_t cmd_status;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t lun[8];
++	__be32	itt;
++	__be32	ttt;
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	__be32	datasn;
++	__be32	offset;
++	__be32	residual_count;
++};
++
++/* Data Response PDU flags */
++#define ISCSI_FLAG_DATA_ACK		0x40
++#define ISCSI_FLAG_DATA_OVERFLOW	0x04
++#define ISCSI_FLAG_DATA_UNDERFLOW	0x02
++#define ISCSI_FLAG_DATA_STATUS		0x01
++
++/* Text Header */
++struct iscsi_text {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[2];
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd4[8];
++	__be32	itt;
++	__be32	ttt;
++	__be32	cmdsn;
++	__be32	exp_statsn;
++	uint8_t rsvd5[16];
++	/* Text - key=value pairs */
++};
++
++#define ISCSI_FLAG_TEXT_CONTINUE	0x40
++
++/* Text Response Header */
++struct iscsi_text_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[2];
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd4[8];
++	__be32	itt;
++	__be32	ttt;
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	uint8_t rsvd5[12];
++	/* Text Response - key:value pairs */
++};
++
++/* Login Header */
++struct iscsi_login {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t max_version;	/* Max. version supported */
++	uint8_t min_version;	/* Min. version supported */
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t isid[6];	/* Initiator Session ID */
++	__be16	tsih;	/* Target Session Handle */
++	__be32	itt;	/* Initiator Task Tag */
++	__be16	cid;
++	__be16	rsvd3;
++	__be32	cmdsn;
++	__be32	exp_statsn;
++	uint8_t rsvd5[16];
++};
++
++/* Login PDU flags */
++#define ISCSI_FLAG_LOGIN_TRANSIT		0x80
++#define ISCSI_FLAG_LOGIN_CONTINUE		0x40
++#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK	0x0C	/* 2 bits */
++#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK	0x03	/* 2 bits */
++
++#define ISCSI_LOGIN_CURRENT_STAGE(flags) \
++	((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2)
++#define ISCSI_LOGIN_NEXT_STAGE(flags) \
++	(flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK)
++
++/* Login Response Header */
++struct iscsi_login_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t max_version;	/* Max. version supported */
++	uint8_t active_version;	/* Active version */
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t isid[6];	/* Initiator Session ID */
++	__be16	tsih;	/* Target Session Handle */
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	rsvd3;
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	uint8_t status_class;	/* see Login RSP ststus classes below */
++	uint8_t status_detail;	/* see Login RSP Status details below */
++	uint8_t rsvd4[10];
++};
++
++/* Login stage (phase) codes for CSG, NSG */
++#define ISCSI_INITIAL_LOGIN_STAGE		-1
++#define ISCSI_SECURITY_NEGOTIATION_STAGE	0
++#define ISCSI_OP_PARMS_NEGOTIATION_STAGE	1
++#define ISCSI_FULL_FEATURE_PHASE		3
++
++/* Login Status response classes */
++#define ISCSI_STATUS_CLS_SUCCESS		0x00
++#define ISCSI_STATUS_CLS_REDIRECT		0x01
++#define ISCSI_STATUS_CLS_INITIATOR_ERR		0x02
++#define ISCSI_STATUS_CLS_TARGET_ERR		0x03
++
++/* Login Status response detail codes */
++/* Class-0 (Success) */
++#define ISCSI_LOGIN_STATUS_ACCEPT		0x00
++
++/* Class-1 (Redirection) */
++#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP	0x01
++#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM	0x02
++
++/* Class-2 (Initiator Error) */
++#define ISCSI_LOGIN_STATUS_INIT_ERR		0x00
++#define ISCSI_LOGIN_STATUS_AUTH_FAILED		0x01
++#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN	0x02
++#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND	0x03
++#define ISCSI_LOGIN_STATUS_TGT_REMOVED		0x04
++#define ISCSI_LOGIN_STATUS_NO_VERSION		0x05
++#define ISCSI_LOGIN_STATUS_ISID_ERROR		0x06
++#define ISCSI_LOGIN_STATUS_MISSING_FIELDS	0x07
++#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED	0x08
++#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE	0x09
++#define ISCSI_LOGIN_STATUS_NO_SESSION		0x0a
++#define ISCSI_LOGIN_STATUS_INVALID_REQUEST	0x0b
++
++/* Class-3 (Target Error) */
++#define ISCSI_LOGIN_STATUS_TARGET_ERROR		0x00
++#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE	0x01
++#define ISCSI_LOGIN_STATUS_NO_RESOURCES		0x02
++
++/* Logout Header */
++struct iscsi_logout {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd1[2];
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd2[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be16	cid;
++	uint8_t rsvd3[2];
++	__be32	cmdsn;
++	__be32	exp_statsn;
++	uint8_t rsvd4[16];
++};
++
++/* Logout PDU flags */
++#define ISCSI_FLAG_LOGOUT_REASON_MASK	0x7F
++
++/* logout reason_code values */
++
++#define ISCSI_LOGOUT_REASON_CLOSE_SESSION	0
++#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION	1
++#define ISCSI_LOGOUT_REASON_RECOVERY		2
++#define ISCSI_LOGOUT_REASON_AEN_REQUEST		3
++
++/* Logout Response Header */
++struct iscsi_logout_rsp {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t response;	/* see Logout response values below */
++	uint8_t rsvd2;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd3[8];
++	__be32	itt;	/* Initiator Task Tag */
++	__be32	rsvd4;
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	__be32	rsvd5;
++	__be16	t2wait;
++	__be16	t2retain;
++	__be32	rsvd6;
++};
++
++/* logout response status values */
++
++#define ISCSI_LOGOUT_SUCCESS			0
++#define ISCSI_LOGOUT_CID_NOT_FOUND		1
++#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED	2
++#define ISCSI_LOGOUT_CLEANUP_FAILED		3
++
++/* SNACK Header */
++struct iscsi_snack {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t rsvd2[14];
++	__be32	itt;
++	__be32	begrun;
++	__be32	runlength;
++	__be32	exp_statsn;
++	__be32	rsvd3;
++	__be32	exp_datasn;
++	uint8_t rsvd6[8];
++};
++
++/* SNACK PDU flags */
++#define ISCSI_FLAG_SNACK_TYPE_MASK	0x0F	/* 4 bits */
++
++/* Reject Message Header */
++struct iscsi_reject {
++	uint8_t opcode;
++	uint8_t flags;
++	uint8_t reason;
++	uint8_t rsvd2;
++	uint8_t hlength;
++	uint8_t dlength[3];
++	uint8_t rsvd3[8];
++	__be32  ffffffff;
++	uint8_t rsvd4[4];
++	__be32	statsn;
++	__be32	exp_cmdsn;
++	__be32	max_cmdsn;
++	__be32	datasn;
++	uint8_t rsvd5[8];
++	/* Text - Rejected hdr */
++};
++
++/* Reason for Reject */
++#define ISCSI_REASON_CMD_BEFORE_LOGIN	1
++#define ISCSI_REASON_DATA_DIGEST_ERROR	2
++#define ISCSI_REASON_DATA_SNACK_REJECT	3
++#define ISCSI_REASON_PROTOCOL_ERROR	4
++#define ISCSI_REASON_CMD_NOT_SUPPORTED	5
++#define ISCSI_REASON_IMM_CMD_REJECT		6
++#define ISCSI_REASON_TASK_IN_PROGRESS	7
++#define ISCSI_REASON_INVALID_SNACK		8
++#define ISCSI_REASON_BOOKMARK_INVALID	9
++#define ISCSI_REASON_BOOKMARK_NO_RESOURCES	10
++#define ISCSI_REASON_NEGOTIATION_RESET	11
++
++/* Max. number of Key=Value pairs in a text message */
++#define MAX_KEY_VALUE_PAIRS	8192
++
++/* maximum length for text keys/values */
++#define KEY_MAXLEN		64
++#define VALUE_MAXLEN		255
++#define TARGET_NAME_MAXLEN	VALUE_MAXLEN
++
++#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH	8192
++
++/************************* RFC 3720 End *****************************/
++
++#endif /* ISCSI_PROTO_H */
diff --git a/kernel_patches/backport/2.6.9_U4/add_iser.patch b/kernel_patches/backport/2.6.9_U4/add_iser.patch
new file mode 100644
index 0000000..0da53d2
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/add_iser.patch
@@ -0,0 +1,13 @@
+diff -rup linux-2.6.20/drivers/infiniband/ulp/iser/iser_initiator.c linux-2.6.20-backport-rh4-u3/drivers/infiniband/ulp/iser/iser_initiator.c
+--- linux-2.6.20/drivers/infiniband/ulp/iser/iser_initiator.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/drivers/infiniband/ulp/iser/iser_initiator.c	2007-03-26 11:27:11.000000000 +0200
+@@ -618,7 +618,8 @@ void iser_snd_completion(struct iser_des
+ 
+ 	if (resume_tx) {
+ 		iser_dbg("%ld resuming tx\n",jiffies);
+-		scsi_queue_work(conn->session->host, &conn->xmitwork);
++		//scsi_queue_work(conn->session->host, &conn->xmitwork);
++		schedule_work(&conn->xmitwork);
+ 	}
+ 
+ 	if (tx_desc->type == ISCSI_TX_CONTROL) { 
diff --git a/kernel_patches/backport/2.6.9_U4/add_memory_h.patch b/kernel_patches/backport/2.6.9_U4/add_memory_h.patch
new file mode 100644
index 0000000..5daad2e
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/add_memory_h.patch
@@ -0,0 +1,93 @@
+diff -rupN linux-2.6.20-like-rh4/include/linux/memory.h linux-2.6.20/include/linux/memory.h
+--- linux-2.6.20-like-rh4/include/linux/memory.h	1970-01-01 02:00:00.000000000 +0200
++++ linux-2.6.20/include/linux/memory.h	2007-02-04 20:44:54.000000000 +0200
+@@ -0,0 +1,89 @@
++/*
++ * include/linux/memory.h - generic memory definition
++ *
++ * This is mainly for topological representation. We define the
++ * basic "struct memory_block" here, which can be embedded in per-arch
++ * definitions or NUMA information.
++ *
++ * Basic handling of the devices is done in drivers/base/memory.c
++ * and system devices are handled in drivers/base/sys.c.
++ *
++ * Memory block are exported via sysfs in the class/memory/devices/
++ * directory.
++ *
++ */
++#ifndef _LINUX_MEMORY_H_
++#define _LINUX_MEMORY_H_
++
++#include <linux/sysdev.h>
++#include <linux/node.h>
++#include <linux/compiler.h>
++
++#include <asm/semaphore.h>
++
++struct memory_block {
++	unsigned long phys_index;
++	unsigned long state;
++	/*
++	 * This serializes all state change requests.  It isn't
++	 * held during creation because the control files are
++	 * created long after the critical areas during
++	 * initialization.
++	 */
++	struct semaphore state_sem;
++	int phys_device;		/* to which fru does this belong? */
++	void *hw;			/* optional pointer to fw/hw data */
++	int (*phys_callback)(struct memory_block *);
++	struct sys_device sysdev;
++};
++
++/* These states are exposed to userspace as text strings in sysfs */
++#define	MEM_ONLINE		(1<<0) /* exposed to userspace */
++#define	MEM_GOING_OFFLINE	(1<<1) /* exposed to userspace */
++#define	MEM_OFFLINE		(1<<2) /* exposed to userspace */
++
++/*
++ * All of these states are currently kernel-internal for notifying
++ * kernel components and architectures.
++ *
++ * For MEM_MAPPING_INVALID, all notifier chains with priority >0
++ * are called before pfn_to_page() becomes invalid.  The priority=0
++ * entry is reserved for the function that actually makes
++ * pfn_to_page() stop working.  Any notifiers that want to be called
++ * after that should have priority <0.
++ */
++#define	MEM_MAPPING_INVALID	(1<<3)
++
++struct notifier_block;
++struct mem_section;
++
++#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE
++static inline int memory_dev_init(void)
++{
++	return 0;
++}
++static inline int register_memory_notifier(struct notifier_block *nb)
++{
++	return 0;
++}
++static inline void unregister_memory_notifier(struct notifier_block *nb)
++{
++}
++#else
++extern int register_new_memory(struct mem_section *);
++extern int unregister_memory_section(struct mem_section *);
++extern int memory_dev_init(void);
++extern int remove_memory_block(unsigned long, struct mem_section *, int);
++
++#define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
++
++
++#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
++
++#define hotplug_memory_notifier(fn, pri) {			\
++	static struct notifier_block fn##_mem_nb =		\
++		{ .notifier_call = fn, .priority = pri };	\
++	register_memory_notifier(&fn##_mem_nb);			\
++}
++
++#endif /* _LINUX_MEMORY_H_ */
diff --git a/kernel_patches/backport/2.6.9_U4/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U4/add_open_iscsi.patch
new file mode 100644
index 0000000..d77c663
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/add_open_iscsi.patch
@@ -0,0 +1,504 @@
+diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c
+--- linux-2.6.20/drivers/scsi/iscsi_tcp.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c	2007-04-01 13:11:17.000000000 +0300
+@@ -108,7 +108,7 @@ iscsi_hdr_digest(struct iscsi_conn *conn
+ {
+ 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+ 
+-	crypto_hash_digest(&tcp_conn->tx_hash, &buf->sg, buf->sg.length, crc);
++	crypto_digest_digest(tcp_conn->tx_tfm, &buf->sg, 1, crc);
+ 	buf->sg.length = tcp_conn->hdr_size;
+ }
+ 
+@@ -419,7 +419,7 @@ iscsi_r2t_rsp(struct iscsi_conn *conn, s
+ 	tcp_ctask->xmstate |= XMSTATE_SOL_HDR;
+ 	list_move_tail(&ctask->running, &conn->xmitqueue);
+ 
+-	scsi_queue_work(session->host, &conn->xmitwork);
++	schedule_work(&conn->xmitwork);
+ 	conn->r2t_pdus_cnt++;
+ 	spin_unlock(&session->lock);
+ 
+@@ -468,8 +468,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co
+ 
+ 		sg_init_one(&sg, (u8 *)hdr,
+ 			    sizeof(struct iscsi_hdr) + ahslen);
+-		crypto_hash_digest(&tcp_conn->rx_hash, &sg, sg.length,
+-				   (u8 *)&cdgst);
++		crypto_digest_digest(tcp_conn->rx_tfm, &sg, 1, (u8 *)&cdgst);
+ 		rdgst = *(uint32_t*)((char*)hdr + sizeof(struct iscsi_hdr) +
+ 				     ahslen);
+ 		if (cdgst != rdgst) {
+@@ -676,7 +675,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, 
+ }
+ 
+ static inline void
+-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg,
++partial_sg_digest_update(struct crypto_tfm *tfm, struct scatterlist *sg,
+ 			 int offset, int length)
+ {
+ 	struct scatterlist temp;
+@@ -684,7 +683,7 @@ partial_sg_digest_update(struct hash_des
+ 	memcpy(&temp, sg, sizeof(struct scatterlist));
+ 	temp.offset = offset;
+ 	temp.length = length;
+-	crypto_hash_update(desc, &temp, length);
++	crypto_digest_update(tfm, &temp, 1);
+ }
+ 
+ static void
+@@ -693,7 +692,7 @@ iscsi_recv_digest_update(struct iscsi_tc
+ 	struct scatterlist tmp;
+ 
+ 	sg_init_one(&tmp, buf, len);
+-	crypto_hash_update(&tcp_conn->rx_hash, &tmp, len);
++	crypto_digest_update(tcp_conn->rx_tfm, &tmp, 1);
+ }
+ 
+ static int iscsi_scsi_data_in(struct iscsi_conn *conn)
+@@ -747,12 +746,12 @@ static int iscsi_scsi_data_in(struct isc
+ 		if (!rc) {
+ 			if (conn->datadgst_en) {
+ 				if (!offset)
+-					crypto_hash_update(
+-							&tcp_conn->rx_hash,
+-							&sg[i], sg[i].length);
++					crypto_digest_update(
++							tcp_conn->rx_tfm,
++							&sg[i], 1);
+ 				else
+ 					partial_sg_digest_update(
+-							&tcp_conn->rx_hash,
++							tcp_conn->rx_tfm,
+ 							&sg[i],
+ 							sg[i].offset + offset,
+ 							sg[i].length - offset);
+@@ -766,10 +765,9 @@ static int iscsi_scsi_data_in(struct isc
+ 				/*
+ 				 * data-in is complete, but buffer not...
+ 				 */
+-				partial_sg_digest_update(&tcp_conn->rx_hash,
+-							 &sg[i],
+-							 sg[i].offset,
+-							 sg[i].length-rc);
++				partial_sg_digest_update(tcp_conn->rx_tfm,
++						&sg[i],
++						sg[i].offset, sg[i].length-rc);
+ 			rc = 0;
+ 			break;
+ 		}
+@@ -887,7 +885,7 @@ more:
+ 		rc = iscsi_tcp_hdr_recv(conn);
+ 		if (!rc && tcp_conn->in.datalen) {
+ 			if (conn->datadgst_en)
+-				crypto_hash_init(&tcp_conn->rx_hash);
++				crypto_digest_init(tcp_conn->rx_tfm);
+ 			tcp_conn->in_progress = IN_PROGRESS_DATA_RECV;
+ 		} else if (rc) {
+ 			iscsi_conn_failure(conn, rc);
+@@ -944,11 +942,11 @@ more:
+ 					  tcp_conn->in.padding);
+ 				memset(pad, 0, tcp_conn->in.padding);
+ 				sg_init_one(&sg, pad, tcp_conn->in.padding);
+-				crypto_hash_update(&tcp_conn->rx_hash,
+-						   &sg, sg.length);
++				crypto_digest_update(tcp_conn->rx_tfm,
++						     &sg, 1);
+ 			}
+-			crypto_hash_final(&tcp_conn->rx_hash,
+-					  (u8 *) &tcp_conn->in.datadgst);
++			crypto_digest_final(tcp_conn->rx_tfm,
++					    (u8 *) &tcp_conn->in.datadgst);
+ 			debug_tcp("rx digest 0x%x\n", tcp_conn->in.datadgst);
+ 			tcp_conn->in_progress = IN_PROGRESS_DDIGEST_RECV;
+ 			tcp_conn->data_copied = 0;
+@@ -1043,7 +1041,7 @@ iscsi_write_space(struct sock *sk)
+ 
+ 	tcp_conn->old_write_space(sk);
+ 	debug_tcp("iscsi_write_space: cid %d\n", conn->id);
+-	scsi_queue_work(conn->session->host, &conn->xmitwork);
++	schedule_work(&conn->xmitwork);
+ }
+ 
+ static void
+@@ -1193,7 +1191,7 @@ static inline void
+ iscsi_data_digest_init(struct iscsi_tcp_conn *tcp_conn,
+ 		      struct iscsi_tcp_cmd_task *tcp_ctask)
+ {
+-	crypto_hash_init(&tcp_conn->tx_hash);
++	crypto_digest_init(tcp_conn->tx_tfm);
+ 	tcp_ctask->digest_count = 4;
+ }
+ 
+@@ -1449,9 +1447,8 @@ iscsi_send_padding(struct iscsi_conn *co
+ 		iscsi_buf_init_iov(&tcp_ctask->sendbuf, (char*)&tcp_ctask->pad,
+ 				   tcp_ctask->pad_count);
+ 		if (conn->datadgst_en)
+-			crypto_hash_update(&tcp_conn->tx_hash,
+-					   &tcp_ctask->sendbuf.sg,
+-					   tcp_ctask->sendbuf.sg.length);
++			crypto_digest_update(tcp_conn->tx_tfm,
++					     &tcp_ctask->sendbuf.sg, 1);
+ 	} else if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_PAD))
+ 		return 0;
+ 
+@@ -1483,7 +1480,7 @@ iscsi_send_digest(struct iscsi_conn *con
+ 	tcp_conn = conn->dd_data;
+ 
+ 	if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_DATA_DIGEST)) {
+-		crypto_hash_final(&tcp_conn->tx_hash, (u8*)digest);
++		crypto_digest_final(tcp_conn->tx_tfm, (u8*)digest);
+ 		iscsi_buf_init_iov(buf, (char*)digest, 4);
+ 	}
+ 	tcp_ctask->xmstate &= ~XMSTATE_W_RESEND_DATA_DIGEST;
+@@ -1517,7 +1514,7 @@ iscsi_send_data(struct iscsi_cmd_task *c
+ 		rc = iscsi_sendpage(conn, sendbuf, count, &buf_sent);
+ 		*sent = *sent + buf_sent;
+ 		if (buf_sent && conn->datadgst_en)
+-			partial_sg_digest_update(&tcp_conn->tx_hash,
++			partial_sg_digest_update(tcp_conn->tx_tfm,
+ 				&sendbuf->sg, sendbuf->sg.offset + offset,
+ 				buf_sent);
+ 		if (!iscsi_buf_left(sendbuf) && *sg != tcp_ctask->bad_sg) {
+@@ -1774,22 +1771,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s
+ 	/* initial operational parameters */
+ 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
+ 
+-	tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->tx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->tx_hash.tfm))
++	tcp_conn->tx_tfm = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->tx_tfm)
+ 		goto free_tcp_conn;
+ 
+-	tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->rx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->rx_hash.tfm))
++	tcp_conn->rx_tfm = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->rx_tfm)
+ 		goto free_tx_tfm;
+ 
+ 	return cls_conn;
+ 
+ free_tx_tfm:
+-	crypto_free_hash(tcp_conn->tx_hash.tfm);
++	crypto_free_tfm(tcp_conn->tx_tfm);
+ free_tcp_conn:
+ 	kfree(tcp_conn);
+ tcp_conn_alloc_fail:
+@@ -1823,10 +1816,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_
+ 	iscsi_tcp_release_conn(conn);
+ 	iscsi_conn_teardown(cls_conn);
+ 
+-	if (tcp_conn->tx_hash.tfm)
+-		crypto_free_hash(tcp_conn->tx_hash.tfm);
+-	if (tcp_conn->rx_hash.tfm)
+-		crypto_free_hash(tcp_conn->rx_hash.tfm);
++	if (tcp_conn->tx_tfm)
++		crypto_free_tfm(tcp_conn->tx_tfm);
++	if (tcp_conn->rx_tfm)
++		crypto_free_tfm(tcp_conn->rx_tfm);
+ 
+ 	kfree(tcp_conn);
+ }
+@@ -2017,7 +2010,7 @@ iscsi_tcp_conn_get_param(struct iscsi_cl
+ {
+ 	struct iscsi_conn *conn = cls_conn->dd_data;
+ 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+-	struct inet_sock *inet;
++	struct inet_opt *inet;
+ 	struct ipv6_pinfo *np;
+ 	struct sock *sk;
+ 	int len;
+@@ -2044,11 +2037,13 @@ iscsi_tcp_conn_get_param(struct iscsi_cl
+ 		sk = tcp_conn->sock->sk;
+ 		if (sk->sk_family == PF_INET) {
+ 			inet = inet_sk(sk);
+-			len = sprintf(buf, NIPQUAD_FMT "\n",
++			len = sprintf(buf, "%u.%u.%u.%u\n",
+ 				      NIPQUAD(inet->daddr));
+ 		} else {
+ 			np = inet6_sk(sk);
+-			len = sprintf(buf, NIP6_FMT "\n", NIP6(np->daddr));
++			len = sprintf(buf,
++				"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
++				NIP6(np->daddr));
+ 		}
+ 		mutex_unlock(&conn->xmitmutex);
+ 		break;
+@@ -2135,7 +2130,6 @@ static void iscsi_tcp_session_destroy(st
+ static struct scsi_host_template iscsi_sht = {
+ 	.name			= "iSCSI Initiator over TCP/IP",
+ 	.queuecommand           = iscsi_queuecommand,
+-	.change_queue_depth	= iscsi_change_queue_depth,
+ 	.can_queue		= ISCSI_XMIT_CMDS_MAX - 1,
+ 	.sg_tablesize		= ISCSI_SG_TABLESIZE,
+ 	.cmd_per_lun		= ISCSI_DEF_CMD_PER_LUN,
+diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h
+--- linux-2.6.20/drivers/scsi/iscsi_tcp.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h	2007-04-01 13:11:55.000000000 +0300
+@@ -49,7 +49,6 @@
+ #define ISCSI_SG_TABLESIZE		SG_ALL
+ #define ISCSI_TCP_MAX_CMD_LEN		16
+ 
+-struct crypto_hash;
+ struct socket;
+ 
+ /* Socket connection recieve helper */
+@@ -93,8 +92,8 @@ struct iscsi_tcp_conn {
+ 	void			(*old_write_space)(struct sock *);
+ 
+ 	/* data and header digests */
+-	struct hash_desc	tx_hash;	/* CRC32C (Tx) */
+-	struct hash_desc	rx_hash;	/* CRC32C (Rx) */
++	struct crypto_tfm	*tx_tfm;	/* CRC32C (Tx) */
++	struct crypto_tfm	*rx_tfm;	/* CRC32C (Rx) */
+ 
+ 	/* MIB custom statistics */
+ 	uint32_t		sendpage_failures_cnt;
+diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c
+--- linux-2.6.20/drivers/scsi/libiscsi.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c	2007-04-01 13:15:57.000000000 +0300
+@@ -23,6 +23,7 @@
+  */
+ #include <linux/types.h>
+ #include <linux/mutex.h>
++#include <linux/gfp.h>
+ #include <linux/kfifo.h>
+ #include <linux/delay.h>
+ #include <net/tcp.h>
+@@ -831,7 +832,7 @@ int iscsi_queuecommand(struct scsi_cmnd 
+ 		session->cmdsn, session->max_cmdsn - session->exp_cmdsn + 1);
+ 	spin_unlock(&session->lock);
+ 
+-	scsi_queue_work(host, &conn->xmitwork);
++	schedule_work(&conn->xmitwork);
+ 	return 0;
+ 
+ reject:
+@@ -932,7 +933,7 @@ iscsi_conn_send_generic(struct iscsi_con
+ 	else
+ 	        __kfifo_put(conn->mgmtqueue, (void*)&mtask, sizeof(void*));
+ 
+-	scsi_queue_work(session->host, &conn->xmitwork);
++	schedule_work(&conn->xmitwork);
+ 	return 0;
+ }
+ 
+@@ -1370,7 +1371,6 @@ iscsi_session_setup(struct iscsi_transpo
+ 	shost->max_lun = iscsit->max_lun;
+ 	shost->max_cmd_len = iscsit->max_cmd_len;
+ 	shost->transportt = scsit;
+-	shost->transportt->create_work_queue = 1;
+ 	*hostno = shost->host_no;
+ 
+ 	session = iscsi_hostdata(shost->hostdata);
+diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c
+--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c	2007-04-01 13:18:33.000000000 +0300
+@@ -29,11 +29,15 @@
+ #include <scsi/scsi_transport.h>
+ #include <scsi/scsi_transport_iscsi.h>
+ #include <scsi/iscsi_if.h>
++#include <linux/transport_class.h>
++#include <linux/netlink.h>
+ 
+ #define ISCSI_SESSION_ATTRS 11
+ #define ISCSI_CONN_ATTRS 11
+ #define ISCSI_HOST_ATTRS 0
+-#define ISCSI_TRANSPORT_VERSION "2.0-724"
++#define ISCSI_TRANSPORT_VERSION "2.0-754"
++
++#define SCAN_WILD_CARD   ~0
+ 
+ struct iscsi_internal {
+ 	int daemon_pid;
+@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
+ #define cdev_to_iscsi_internal(_cdev) \
+ 	container_of(_cdev, struct iscsi_internal, cdev)
+ 
++extern int attribute_container_init(void);
++
+ static void iscsi_transport_release(struct class_device *cdev)
+ {
+ 	struct iscsi_internal *priv = cdev_to_iscsi_internal(cdev);
+@@ -80,6 +86,17 @@ static struct class iscsi_transport_clas
+ 	.release = iscsi_transport_release,
+ };
+ 
++static void iscsi_host_class_release(struct class_device *class_dev)
++{
++	struct Scsi_Host *shost = transport_class_to_shost(class_dev);
++	put_device(&shost->shost_gendev);
++}
++
++struct class iscsi_host_class = {
++	.name = "iscsi_host",
++	.release = iscsi_host_class_release,
++};
++
+ static ssize_t
+ show_transport_handle(struct class_device *cdev, char *buf)
+ {
+@@ -115,10 +132,8 @@ static struct attribute_group iscsi_tran
+ 	.attrs = iscsi_transport_attrs,
+ };
+ 
+-static int iscsi_setup_host(struct transport_container *tc, struct device *dev,
+-			    struct class_device *cdev)
++static int iscsi_setup_host(struct Scsi_Host *shost)
+ {
+-	struct Scsi_Host *shost = dev_to_shost(dev);
+ 	struct iscsi_host *ihost = shost->shost_data;
+ 
+ 	memset(ihost, 0, sizeof(*ihost));
+@@ -127,12 +142,6 @@ static int iscsi_setup_host(struct trans
+ 	return 0;
+ }
+ 
+-static DECLARE_TRANSPORT_CLASS(iscsi_host_class,
+-			       "iscsi_host",
+-			       iscsi_setup_host,
+-			       NULL,
+-			       NULL);
+-
+ static DECLARE_TRANSPORT_CLASS(iscsi_session_class,
+ 			       "iscsi_session",
+ 			       NULL,
+@@ -216,28 +225,10 @@ static int iscsi_is_session_dev(const st
+ 	return dev->release == iscsi_session_release;
+ }
+ 
+-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel,
+-			   uint id, uint lun)
+-{
+-	struct iscsi_host *ihost = shost->shost_data;
+-	struct iscsi_cls_session *session;
+-
+-	mutex_lock(&ihost->mutex);
+-	list_for_each_entry(session, &ihost->sessions, host_list) {
+-		if ((channel == SCAN_WILD_CARD || channel == 0) &&
+-		    (id == SCAN_WILD_CARD || id == session->target_id))
+-			scsi_scan_target(&session->dev, 0,
+-					 session->target_id, lun, 1);
+-	}
+-	mutex_unlock(&ihost->mutex);
+-
+-	return 0;
+-}
+-
+-static void session_recovery_timedout(struct work_struct *work)
++static void session_recovery_timedout(void *data)
+ {
+ 	struct iscsi_cls_session *session =
+-		container_of(work, struct iscsi_cls_session,
++		container_of(data, struct iscsi_cls_session,
+ 			     recovery_work.work);
+ 
+ 	dev_printk(KERN_INFO, &session->dev, "iscsi: session recovery timed "
+@@ -362,8 +353,6 @@ void iscsi_remove_session(struct iscsi_c
+ 	list_del(&session->host_list);
+ 	mutex_unlock(&ihost->mutex);
+ 
+-	scsi_remove_target(&session->dev);
+-
+ 	transport_unregister_device(&session->dev);
+ 	device_del(&session->dev);
+ }
+@@ -452,6 +441,7 @@ iscsi_create_conn(struct iscsi_cls_sessi
+ 		goto release_parent_ref;
+ 	}
+ 	transport_register_device(&conn->dev);
++
+ 	return conn;
+ 
+ release_parent_ref:
+@@ -606,9 +596,8 @@ iscsi_if_send_reply(int pid, int seq, in
+ 	struct nlmsghdr	*nlh;
+ 	int len = NLMSG_SPACE(size);
+ 	int flags = multi ? NLM_F_MULTI : 0;
+-	int t = done ? NLMSG_DONE : type;
+ 
+-	skb = alloc_skb(len, GFP_ATOMIC);
++	skb = alloc_skb(len, GFP_KERNEL);
+ 	/*
+ 	 * FIXME:
+ 	 * user is supposed to react on iferror == -ENOMEM;
+@@ -649,7 +638,7 @@ iscsi_if_get_stats(struct iscsi_transpor
+ 	do {
+ 		int actual_size;
+ 
+-		skbstat = alloc_skb(len, GFP_ATOMIC);
++		skbstat = alloc_skb(len, GFP_KERNEL);
+ 		if (!skbstat) {
+ 			dev_printk(KERN_ERR, &conn->dev, "iscsi: can not "
+ 				   "deliver stats: OOM\n");
+@@ -1269,24 +1258,6 @@ static int iscsi_conn_match(struct attri
+ 	return &priv->conn_cont.ac == cont;
+ }
+ 
+-static int iscsi_host_match(struct attribute_container *cont,
+-			    struct device *dev)
+-{
+-	struct Scsi_Host *shost;
+-	struct iscsi_internal *priv;
+-
+-	if (!scsi_is_host_device(dev))
+-		return 0;
+-
+-	shost = dev_to_shost(dev);
+-	if (!shost->transportt  ||
+-	    shost->transportt->host_attrs.ac.class != &iscsi_host_class.class)
+-		return 0;
+-
+-        priv = to_iscsi_internal(shost->transportt);
+-        return &priv->t.host_attrs.ac == cont;
+-}
+-
+ struct scsi_transport_template *
+ iscsi_register_transport(struct iscsi_transport *tt)
+ {
+@@ -1306,7 +1277,6 @@ iscsi_register_transport(struct iscsi_tr
+ 	INIT_LIST_HEAD(&priv->list);
+ 	priv->daemon_pid = -1;
+ 	priv->iscsi_transport = tt;
+-	priv->t.user_scan = iscsi_user_scan;
+ 
+ 	priv->cdev.class = &iscsi_transport_class;
+ 	snprintf(priv->cdev.class_id, BUS_ID_SIZE, "%s", tt->name);
+@@ -1319,12 +1289,11 @@ iscsi_register_transport(struct iscsi_tr
+ 		goto unregister_cdev;
+ 
+ 	/* host parameters */
+-	priv->t.host_attrs.ac.attrs = &priv->host_attrs[0];
+-	priv->t.host_attrs.ac.class = &iscsi_host_class.class;
+-	priv->t.host_attrs.ac.match = iscsi_host_match;
++
++	priv->t.host_attrs = &priv->host_attrs[0];
++	priv->t.host_class = &iscsi_host_class;
++	priv->t.host_setup = iscsi_setup_host;
+ 	priv->t.host_size = sizeof(struct iscsi_host);
+-	priv->host_attrs[0] = NULL;
+-	transport_container_register(&priv->t.host_attrs);
+ 
+ 	/* connection parameters */
+ 	priv->conn_cont.ac.attrs = &priv->conn_attrs[0];
+@@ -1402,7 +1371,6 @@ int iscsi_unregister_transport(struct is
+ 
+ 	transport_container_unregister(&priv->conn_cont);
+ 	transport_container_unregister(&priv->session_cont);
+-	transport_container_unregister(&priv->t.host_attrs);
+ 
+ 	sysfs_remove_group(&priv->cdev.kobj, &iscsi_transport_group);
+ 	class_device_unregister(&priv->cdev);
+@@ -1419,6 +1387,8 @@ static __init int iscsi_transport_init(v
+ 	printk(KERN_INFO "Loading iSCSI transport class v%s.\n",
+ 		ISCSI_TRANSPORT_VERSION);
+ 
++	attribute_container_init();
++
+ 	err = class_register(&iscsi_transport_class);
+ 	if (err)
+ 		return err; 
diff --git a/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch
new file mode 100644
index 0000000..6dd4429
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch
@@ -0,0 +1,60 @@
+diff -rupN linux-2.6.20-rc7/include/scsi/iscsi_compat.h linux-2.6.9/include/scsi/iscsi_compat.h
+--- linux-2.6.20-rc7/include/scsi/iscsi_compat.h	1970-01-01 02:00:00.000000000 +0200
++++ linux-2.6.9/include/scsi/iscsi_compat.h	2007-02-08 08:45:39.000000000 +0200
+@@ -0,0 +1,16 @@
++#ifndef ISCSI_COMPAT
++#define ISCSI_COMPAT
++
++#include <linux/version.h>
++#include <linux/kernel.h>
++#include <scsi/scsi.h>
++
++#define __nlmsg_put(skb, daemon_pid, seq, type, len, flags) \
++       __nlmsg_put(skb, daemon_pid, 0, 0, len)
++
++#define netlink_kernel_create(uint, groups, input, mod) \
++       netlink_kernel_create(uint, input)
++
++#define gfp_t unsigned
++
++#endif /* ISCSI_COMPAT */
+diff -rupN linux-2.6.20-rc7/include/scsi/iscsi_if.h linux-2.6.9/include/scsi/iscsi_if.h
+--- linux-2.6.20-rc7/include/scsi/iscsi_if.h	2006-11-29 23:57:37.000000000 +0200
++++ linux-2.6.9/include/scsi/iscsi_if.h	2007-02-04 12:50:15.000000000 +0200
+@@ -277,7 +277,6 @@ enum iscsi_param {
+  * These flags describes reason of stop_conn() call
+  */
+ #define STOP_CONN_TERM		0x1
+-#define STOP_CONN_SUSPEND	0x2
+ #define STOP_CONN_RECOVER	0x3
+ 
+ #define ISCSI_STATS_CUSTOM_MAX		32
+diff -rupN linux-2.6.20-rc7/include/scsi/libiscsi.h linux-2.6.9/include/scsi/libiscsi.h
+--- linux-2.6.20-rc7/include/scsi/libiscsi.h	2007-02-07 11:10:56.000000000 +0200
++++ linux-2.6.9/include/scsi/libiscsi.h	2007-02-07 15:51:59.000000000 +0200
+@@ -25,10 +25,9 @@
+ 
+ #include <linux/types.h>
+ #include <linux/mutex.h>
+-#include <linux/timer.h>
+-#include <linux/workqueue.h>
+ #include <scsi/iscsi_proto.h>
+ #include <scsi/iscsi_if.h>
++#include <scsi/iscsi_compat.h>
+ 
+ struct scsi_transport_template;
+ struct scsi_device;
+diff -rupN linux-2.6.20-rc7/include/scsi/scsi_transport_iscsi.h linux-2.6.9/include/scsi/scsi_transport_iscsi.h
+--- linux-2.6.20-rc7/include/scsi/scsi_transport_iscsi.h	2007-02-07 11:10:56.000000000 +0200
++++ linux-2.6.9/include/scsi/scsi_transport_iscsi.h	2007-02-07 15:52:50.000000000 +0200
+@@ -24,7 +24,9 @@
+ #define SCSI_TRANSPORT_ISCSI_H
+ 
+ #include <linux/device.h>
+-#include <scsi/iscsi_if.h>
++#include "iscsi_if.h"
++#include "iscsi_compat.h"
++//#include <../drivers/scsi/transport_class.h>
+ 
+ struct scsi_transport_template;
+ struct iscsi_transport;
diff --git a/kernel_patches/backport/2.6.9_U4/add_transport_class_h.patch b/kernel_patches/backport/2.6.9_U4/add_transport_class_h.patch
new file mode 100644
index 0000000..f2425e0
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/add_transport_class_h.patch
@@ -0,0 +1,104 @@
+diff -rupN linux-2.6.20-like-rh4/include/linux/transport_class.h linux-2.6.20/include/linux/transport_class.h
+--- linux-2.6.20-like-rh4/include/linux/transport_class.h	1970-01-01 02:00:00.000000000 +0200
++++ linux-2.6.20/include/linux/transport_class.h	2007-02-04 20:44:54.000000000 +0200
+@@ -0,0 +1,100 @@
++/*
++ * transport_class.h - a generic container for all transport classes
++ *
++ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
++ *
++ * This file is licensed under GPLv2
++ */
++
++#ifndef _TRANSPORT_CLASS_H_
++#define _TRANSPORT_CLASS_H_
++
++#include <linux/device.h>
++#include <linux/attribute_container.h>
++
++struct transport_container;
++
++struct transport_class {
++	struct class class;
++	int (*setup)(struct transport_container *, struct device *,
++		     struct class_device *);
++	int (*configure)(struct transport_container *, struct device *,
++			 struct class_device *);
++	int (*remove)(struct transport_container *, struct device *,
++		      struct class_device *);
++};
++
++#define DECLARE_TRANSPORT_CLASS(cls, nm, su, rm, cfg)			\
++struct transport_class cls = {						\
++	.class = {							\
++		.name = nm,						\
++	},								\
++	.setup = su,							\
++	.remove = rm,							\
++	.configure = cfg,						\
++}
++
++
++struct anon_transport_class {
++	struct transport_class tclass;
++	struct attribute_container container;
++};
++
++#define DECLARE_ANON_TRANSPORT_CLASS(cls, mtch, cfg)		\
++struct anon_transport_class cls = {				\
++	.tclass = {						\
++		.configure = cfg,				\
++	},							\
++	. container = {						\
++		.match = mtch,					\
++	},							\
++}
++
++#define class_to_transport_class(x) \
++	container_of(x, struct transport_class, class)
++
++struct transport_container {
++	struct attribute_container ac;
++	struct attribute_group *statistics;
++};
++
++#define attribute_container_to_transport_container(x) \
++	container_of(x, struct transport_container, ac)
++
++void transport_remove_device(struct device *);
++void transport_add_device(struct device *);
++void transport_setup_device(struct device *);
++void transport_configure_device(struct device *);
++void transport_destroy_device(struct device *);
++
++static inline void
++transport_register_device(struct device *dev)
++{
++	transport_setup_device(dev);
++	transport_add_device(dev);
++}
++
++static inline void
++transport_unregister_device(struct device *dev)
++{
++	transport_remove_device(dev);
++	transport_destroy_device(dev);
++}
++
++static inline int transport_container_register(struct transport_container *tc)
++{
++	return attribute_container_register(&tc->ac);
++}
++
++static inline int transport_container_unregister(struct transport_container *tc)
++{
++	return attribute_container_unregister(&tc->ac);
++}
++
++int transport_class_register(struct transport_class *);
++int anon_transport_class_register(struct anon_transport_class *);
++void transport_class_unregister(struct transport_class *);
++void anon_transport_class_unregister(struct anon_transport_class *);
++
++
++#endif
diff --git a/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch
new file mode 100644
index 0000000..3c2a969
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch
@@ -0,0 +1,13 @@
+--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:13:43.000000000 +0200
++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:14:31.000000000 +0200
+@@ -70,9 +70,8 @@
+ #include <scsi/scsi_tcq.h>
+ #include <scsi/scsi_host.h>
+ #include <scsi/scsi.h>
+-#include <scsi/scsi_transport_iscsi.h>
+-
+ #include "iscsi_iser.h"
++#include <scsi/scsi_transport_iscsi.h>
+ 
+ static unsigned int iscsi_max_lun = 512;
+ module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO);
diff --git a/kernel_patches/backport/2.6.9_U4/linux_stuff_to_2_6_17.patch b/kernel_patches/backport/2.6.9_U4/linux_stuff_to_2_6_17.patch
index e84b964..52c0136 100644
--- a/kernel_patches/backport/2.6.9_U4/linux_stuff_to_2_6_17.patch
+++ b/kernel_patches/backport/2.6.9_U4/linux_stuff_to_2_6_17.patch
@@ -19,6 +19,62 @@ index 0000000..58cf933
 +++ b/drivers/infiniband/core/kfifo.c
 @@ -0,0 +1 @@
 +#include "src/kfifo.c"
+diff --git a/drivers/infiniband/core/init.c b/drivers/infiniband/core/init.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/init.c
+@@ -0,0 +1 @@
++#include "src/init.c"
+diff --git a/drivers/infiniband/core/attribute_container.c b/drivers/infiniband/core/attribute_container.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/attribute_container.c
+@@ -0,0 +1 @@
++#include "src/attribute_container.c"
+diff --git a/drivers/infiniband/core/transport_class.c b/drivers/infiniband/core/transport_class.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/transport_class.c
+@@ -0,0 +1 @@
++#include "src/transport_class.c"
+diff --git a/drivers/infiniband/core/klist.c b/drivers/infiniband/core/klist.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/klist.c
+@@ -0,0 +1 @@
++#include "src/klist.c"
+diff --git a/drivers/infiniband/core/scsi.c b/drivers/infiniband/core/scsi.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/scsi.c
+@@ -0,0 +1 @@
++#include "src/scsi.c"
+diff --git a/drivers/infiniband/core/scsi_lib.c b/drivers/infiniband/core/scsi_lib.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/scsi_lib.c
+@@ -0,0 +1 @@
++#include "src/scsi_lib.c"
+diff --git a/drivers/infiniband/core/scsi_scan.c b/drivers/infiniband/core/scsi_scan.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/scsi_scan.c
+@@ -0,0 +1 @@
++#include "src/scsi_scan.c"
+diff --git a/drivers/infiniband/core/kref_new.c b/drivers/infiniband/core/kref_new.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/infiniband/core/kref_new.c
+@@ -0,0 +1 @@
++#include "src/kref_new.c"
 diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
 index 50fb1cd..456bfd0 100644
 --- a/drivers/infiniband/core/Makefile
@@ -28,4 +84,4 @@ index 50fb1cd..456bfd0 100644
  ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_mem.o \
  				uverbs_marshall.o
 +
-+ib_core-y +=			genalloc.o netevent.o kfifo.o
++ib_core-y +=			genalloc.o netevent.o kfifo.o scsi.o scsi_lib.o scsi_scan.o init.o attribute_container.o transport_class.o klist.o kref_new.o
diff --git a/kernel_patches/backport/2.6.9_U4/netlink-01-add_netlink_h.patch b/kernel_patches/backport/2.6.9_U4/netlink-01-add_netlink_h.patch
new file mode 100644
index 0000000..cc071ef
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/netlink-01-add_netlink_h.patch
@@ -0,0 +1,247 @@
+diff -rupN linux-2.6.20-like-rh4/include/linux/netlink.h linux-2.6.20/include/linux/netlink.h
+--- linux-2.6.20-like-rh4/include/linux/netlink.h	1970-01-01 02:00:00.000000000 +0200
++++ linux-2.6.20/include/linux/netlink.h	2007-02-04 20:44:54.000000000 +0200
+@@ -0,0 +1,243 @@
++#ifndef __LINUX_NETLINK_H
++#define __LINUX_NETLINK_H
++
++#include <linux/socket.h> /* for sa_family_t */
++#include <linux/types.h>
++
++#define NETLINK_ROUTE		0	/* Routing/device hook				*/
++#define NETLINK_UNUSED		1	/* Unused number				*/
++#define NETLINK_USERSOCK	2	/* Reserved for user mode socket protocols 	*/
++#define NETLINK_FIREWALL	3	/* Firewalling hook				*/
++#define NETLINK_INET_DIAG	4	/* INET socket monitoring			*/
++#define NETLINK_NFLOG		5	/* netfilter/iptables ULOG */
++#define NETLINK_XFRM		6	/* ipsec */
++#define NETLINK_SELINUX		7	/* SELinux event notifications */
++#define NETLINK_ISCSI		8	/* Open-iSCSI */
++#define NETLINK_AUDIT		9	/* auditing */
++#define NETLINK_FIB_LOOKUP	10	
++#define NETLINK_CONNECTOR	11
++#define NETLINK_NETFILTER	12	/* netfilter subsystem */
++#define NETLINK_IP6_FW		13
++#define NETLINK_DNRTMSG		14	/* DECnet routing messages */
++#define NETLINK_KOBJECT_UEVENT	15	/* Kernel messages to userspace */
++#define NETLINK_GENERIC		16
++/* leave room for NETLINK_DM (DM Events) */
++#define NETLINK_SCSITRANSPORT	18	/* SCSI Transports */
++
++#define MAX_LINKS 32		
++
++struct sockaddr_nl
++{
++	sa_family_t	nl_family;	/* AF_NETLINK	*/
++	unsigned short	nl_pad;		/* zero		*/
++	__u32		nl_pid;		/* process pid	*/
++       	__u32		nl_groups;	/* multicast groups mask */
++};
++
++struct nlmsghdr
++{
++	__u32		nlmsg_len;	/* Length of message including header */
++	__u16		nlmsg_type;	/* Message content */
++	__u16		nlmsg_flags;	/* Additional flags */
++	__u32		nlmsg_seq;	/* Sequence number */
++	__u32		nlmsg_pid;	/* Sending process PID */
++};
++
++/* Flags values */
++
++#define NLM_F_REQUEST		1	/* It is request message. 	*/
++#define NLM_F_MULTI		2	/* Multipart message, terminated by NLMSG_DONE */
++#define NLM_F_ACK		4	/* Reply with ack, with zero or error code */
++#define NLM_F_ECHO		8	/* Echo this request 		*/
++
++/* Modifiers to GET request */
++#define NLM_F_ROOT	0x100	/* specify tree	root	*/
++#define NLM_F_MATCH	0x200	/* return all matching	*/
++#define NLM_F_ATOMIC	0x400	/* atomic GET		*/
++#define NLM_F_DUMP	(NLM_F_ROOT|NLM_F_MATCH)
++
++/* Modifiers to NEW request */
++#define NLM_F_REPLACE	0x100	/* Override existing		*/
++#define NLM_F_EXCL	0x200	/* Do not touch, if it exists	*/
++#define NLM_F_CREATE	0x400	/* Create, if it does not exist	*/
++#define NLM_F_APPEND	0x800	/* Add to end of list		*/
++
++/*
++   4.4BSD ADD		NLM_F_CREATE|NLM_F_EXCL
++   4.4BSD CHANGE	NLM_F_REPLACE
++
++   True CHANGE		NLM_F_CREATE|NLM_F_REPLACE
++   Append		NLM_F_CREATE
++   Check		NLM_F_EXCL
++ */
++
++#define NLMSG_ALIGNTO	4
++#define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) )
++#define NLMSG_HDRLEN	 ((int) NLMSG_ALIGN(sizeof(struct nlmsghdr)))
++#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(NLMSG_HDRLEN))
++#define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len))
++#define NLMSG_DATA(nlh)  ((void*)(((char*)nlh) + NLMSG_LENGTH(0)))
++#define NLMSG_NEXT(nlh,len)	 ((len) -= NLMSG_ALIGN((nlh)->nlmsg_len), \
++				  (struct nlmsghdr*)(((char*)(nlh)) + NLMSG_ALIGN((nlh)->nlmsg_len)))
++#define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \
++			   (nlh)->nlmsg_len >= sizeof(struct nlmsghdr) && \
++			   (nlh)->nlmsg_len <= (len))
++#define NLMSG_PAYLOAD(nlh,len) ((nlh)->nlmsg_len - NLMSG_SPACE((len)))
++
++#define NLMSG_NOOP		0x1	/* Nothing.		*/
++#define NLMSG_ERROR		0x2	/* Error		*/
++#define NLMSG_DONE		0x3	/* End of a dump	*/
++#define NLMSG_OVERRUN		0x4	/* Data lost		*/
++
++#define NLMSG_MIN_TYPE		0x10	/* < 0x10: reserved control messages */
++
++struct nlmsgerr
++{
++	int		error;
++	struct nlmsghdr msg;
++};
++
++#define NETLINK_ADD_MEMBERSHIP	1
++#define NETLINK_DROP_MEMBERSHIP	2
++#define NETLINK_PKTINFO		3
++
++struct nl_pktinfo
++{
++	__u32	group;
++};
++
++#define NET_MAJOR 36		/* Major 36 is reserved for networking 						*/
++
++enum {
++	NETLINK_UNCONNECTED = 0,
++	NETLINK_CONNECTED,
++};
++
++/*
++ *  <------- NLA_HDRLEN ------> <-- NLA_ALIGN(payload)-->
++ * +---------------------+- - -+- - - - - - - - - -+- - -+
++ * |        Header       | Pad |     Payload       | Pad |
++ * |   (struct nlattr)   | ing |                   | ing |
++ * +---------------------+- - -+- - - - - - - - - -+- - -+
++ *  <-------------- nlattr->nla_len -------------->
++ */
++
++struct nlattr
++{
++	__u16           nla_len;
++	__u16           nla_type;
++};
++
++#define NLA_ALIGNTO		4
++#define NLA_ALIGN(len)		(((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1))
++#define NLA_HDRLEN		((int) NLA_ALIGN(sizeof(struct nlattr)))
++
++#ifdef __KERNEL__
++
++#include <linux/capability.h>
++#include <linux/skbuff.h>
++
++struct netlink_skb_parms
++{
++	struct ucred		creds;		/* Skb credentials	*/
++	__u32			pid;
++	__u32			dst_group;
++	kernel_cap_t		eff_cap;
++	__u32			loginuid;	/* Login (audit) uid */
++	__u32			sid;		/* SELinux security id */
++};
++
++#define NETLINK_CB(skb)		(*(struct netlink_skb_parms*)&((skb)->cb))
++#define NETLINK_CREDS(skb)	(&NETLINK_CB((skb)).creds)
++
++
++extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module);
++extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
++extern int netlink_has_listeners(struct sock *sk, unsigned int group);
++extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock);
++extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 pid,
++			     __u32 group, gfp_t allocation);
++extern void netlink_set_err(struct sock *ssk, __u32 pid, __u32 group, int code);
++extern int netlink_register_notifier(struct notifier_block *nb);
++extern int netlink_unregister_notifier(struct notifier_block *nb);
++
++/* finegrained unicast helpers: */
++struct sock *netlink_getsockbyfilp(struct file *filp);
++int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock,
++		long timeo, struct sock *ssk);
++void netlink_detachskb(struct sock *sk, struct sk_buff *skb);
++int netlink_sendskb(struct sock *sk, struct sk_buff *skb, int protocol);
++
++/*
++ *	skb should fit one page. This choice is good for headerless malloc.
++ */
++#define NLMSG_GOODORDER 0
++#define NLMSG_GOODSIZE (SKB_MAX_ORDER(0, NLMSG_GOODORDER))
++#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)
++
++
++struct netlink_callback
++{
++	struct sk_buff	*skb;
++	struct nlmsghdr	*nlh;
++	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
++	int		(*done)(struct netlink_callback *cb);
++	int		family;
++	long		args[5];
++};
++
++struct netlink_notify
++{
++	int pid;
++	int protocol;
++};
++
++static __inline__ struct nlmsghdr *
++__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len, int flags)
++{
++	struct nlmsghdr *nlh;
++	int size = NLMSG_LENGTH(len);
++
++	nlh = (struct nlmsghdr*)skb_put(skb, NLMSG_ALIGN(size));
++	nlh->nlmsg_type = type;
++	nlh->nlmsg_len = size;
++	nlh->nlmsg_flags = flags;
++	nlh->nlmsg_pid = pid;
++	nlh->nlmsg_seq = seq;
++	memset(NLMSG_DATA(nlh) + len, 0, NLMSG_ALIGN(size) - size);
++	return nlh;
++}
++
++#define NLMSG_NEW(skb, pid, seq, type, len, flags) \
++({	if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) \
++		goto nlmsg_failure; \
++	__nlmsg_put(skb, pid, seq, type, len, flags); })
++
++#define NLMSG_PUT(skb, pid, seq, type, len) \
++	NLMSG_NEW(skb, pid, seq, type, len, 0)
++
++#define NLMSG_NEW_ANSWER(skb, cb, type, len, flags) \
++	NLMSG_NEW(skb, NETLINK_CB((cb)->skb).pid, \
++		  (cb)->nlh->nlmsg_seq, type, len, flags)
++
++#define NLMSG_END(skb, nlh) \
++({	(nlh)->nlmsg_len = (skb)->tail - (unsigned char *) (nlh); \
++	(skb)->len; })
++
++#define NLMSG_CANCEL(skb, nlh) \
++({	skb_trim(skb, (unsigned char *) (nlh) - (skb)->data); \
++	-1; })
++
++extern int netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
++			      struct nlmsghdr *nlh,
++			      int (*dump)(struct sk_buff *skb, struct netlink_callback*),
++			      int (*done)(struct netlink_callback*));
++
++
++#define NL_NONROOT_RECV 0x1
++#define NL_NONROOT_SEND 0x2
++extern void netlink_set_nonroot(int protocol, unsigned flag);
++
++#endif /* __KERNEL__ */
++
++#endif	/* __LINUX_NETLINK_H */
diff --git a/kernel_patches/backport/2.6.9_U4/netlink-02-netlink_h_for_rh4.patch b/kernel_patches/backport/2.6.9_U4/netlink-02-netlink_h_for_rh4.patch
new file mode 100644
index 0000000..d9ba403
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/netlink-02-netlink_h_for_rh4.patch
@@ -0,0 +1,200 @@
+diff -rup linux-2.6.20/include/linux/netlink.h linux-2.6.20-backport-rh4-u3/include/linux/netlink.h
+--- linux-2.6.20/include/linux/netlink.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-backport-rh4-u3/include/linux/netlink.h	2007-03-08 10:09:43.000000000 +0200
+@@ -5,24 +5,19 @@
+ #include <linux/types.h>
+ 
+ #define NETLINK_ROUTE		0	/* Routing/device hook				*/
+-#define NETLINK_UNUSED		1	/* Unused number				*/
++#define NETLINK_SKIP		1	/* Reserved for ENskip  			*/
+ #define NETLINK_USERSOCK	2	/* Reserved for user mode socket protocols 	*/
+ #define NETLINK_FIREWALL	3	/* Firewalling hook				*/
+-#define NETLINK_INET_DIAG	4	/* INET socket monitoring			*/
++#define NETLINK_TCPDIAG		4	/* TCP socket monitoring			*/
+ #define NETLINK_NFLOG		5	/* netfilter/iptables ULOG */
+ #define NETLINK_XFRM		6	/* ipsec */
+ #define NETLINK_SELINUX		7	/* SELinux event notifications */
+-#define NETLINK_ISCSI		8	/* Open-iSCSI */
++#define NETLINK_ISCSI		8
+ #define NETLINK_AUDIT		9	/* auditing */
+-#define NETLINK_FIB_LOOKUP	10	
+-#define NETLINK_CONNECTOR	11
+-#define NETLINK_NETFILTER	12	/* netfilter subsystem */
++#define NETLINK_ROUTE6		11	/* af_inet6 route comm channel */
+ #define NETLINK_IP6_FW		13
+ #define NETLINK_DNRTMSG		14	/* DECnet routing messages */
+-#define NETLINK_KOBJECT_UEVENT	15	/* Kernel messages to userspace */
+-#define NETLINK_GENERIC		16
+-/* leave room for NETLINK_DM (DM Events) */
+-#define NETLINK_SCSITRANSPORT	18	/* SCSI Transports */
++#define NETLINK_TAPBASE		16	/* 16 to 31 are ethertap */
+ 
+ #define MAX_LINKS 32		
+ 
+@@ -73,8 +68,7 @@ struct nlmsghdr
+ 
+ #define NLMSG_ALIGNTO	4
+ #define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) )
+-#define NLMSG_HDRLEN	 ((int) NLMSG_ALIGN(sizeof(struct nlmsghdr)))
+-#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(NLMSG_HDRLEN))
++#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(sizeof(struct nlmsghdr)))
+ #define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len))
+ #define NLMSG_DATA(nlh)  ((void*)(((char*)nlh) + NLMSG_LENGTH(0)))
+ #define NLMSG_NEXT(nlh,len)	 ((len) -= NLMSG_ALIGN((nlh)->nlmsg_len), \
+@@ -89,23 +83,12 @@ struct nlmsghdr
+ #define NLMSG_DONE		0x3	/* End of a dump	*/
+ #define NLMSG_OVERRUN		0x4	/* Data lost		*/
+ 
+-#define NLMSG_MIN_TYPE		0x10	/* < 0x10: reserved control messages */
+-
+ struct nlmsgerr
+ {
+ 	int		error;
+ 	struct nlmsghdr msg;
+ };
+ 
+-#define NETLINK_ADD_MEMBERSHIP	1
+-#define NETLINK_DROP_MEMBERSHIP	2
+-#define NETLINK_PKTINFO		3
+-
+-struct nl_pktinfo
+-{
+-	__u32	group;
+-};
+-
+ #define NET_MAJOR 36		/* Major 36 is reserved for networking 						*/
+ 
+ enum {
+@@ -113,25 +96,6 @@ enum {
+ 	NETLINK_CONNECTED,
+ };
+ 
+-/*
+- *  <------- NLA_HDRLEN ------> <-- NLA_ALIGN(payload)-->
+- * +---------------------+- - -+- - - - - - - - - -+- - -+
+- * |        Header       | Pad |     Payload       | Pad |
+- * |   (struct nlattr)   | ing |                   | ing |
+- * +---------------------+- - -+- - - - - - - - - -+- - -+
+- *  <-------------- nlattr->nla_len -------------->
+- */
+-
+-struct nlattr
+-{
+-	__u16           nla_len;
+-	__u16           nla_type;
+-};
+-
+-#define NLA_ALIGNTO		4
+-#define NLA_ALIGN(len)		(((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1))
+-#define NLA_HDRLEN		((int) NLA_ALIGN(sizeof(struct nlattr)))
+-
+ #ifdef __KERNEL__
+ 
+ #include <linux/capability.h>
+@@ -141,39 +105,42 @@ struct netlink_skb_parms
+ {
+ 	struct ucred		creds;		/* Skb credentials	*/
+ 	__u32			pid;
+-	__u32			dst_group;
++	__u32			groups;
++	__u32			dst_pid;
++	__u32			dst_groups;
+ 	kernel_cap_t		eff_cap;
+ 	__u32			loginuid;	/* Login (audit) uid */
+-	__u32			sid;		/* SELinux security id */
+ };
+ 
+ #define NETLINK_CB(skb)		(*(struct netlink_skb_parms*)&((skb)->cb))
+ #define NETLINK_CREDS(skb)	(&NETLINK_CB((skb)).creds)
+ 
+ 
+-extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module);
++extern int netlink_attach(int unit, int (*function)(int,struct sk_buff *skb));
++extern void netlink_detach(int unit);
++extern int netlink_post(int unit, struct sk_buff *skb);
++extern struct sock *netlink_kernel_create(int unit, void (*input)(struct sock *sk, int len));
+ extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
+-extern int netlink_has_listeners(struct sock *sk, unsigned int group);
+ extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock);
+ extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 pid,
+-			     __u32 group, gfp_t allocation);
++			     __u32 group, int allocation);
+ extern void netlink_set_err(struct sock *ssk, __u32 pid, __u32 group, int code);
+ extern int netlink_register_notifier(struct notifier_block *nb);
+ extern int netlink_unregister_notifier(struct notifier_block *nb);
+ 
+ /* finegrained unicast helpers: */
++struct sock *netlink_getsockbypid(struct sock *ssk, u32 pid);
+ struct sock *netlink_getsockbyfilp(struct file *filp);
+-int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock,
+-		long timeo, struct sock *ssk);
++int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, long timeo);
+ void netlink_detachskb(struct sock *sk, struct sk_buff *skb);
+ int netlink_sendskb(struct sock *sk, struct sk_buff *skb, int protocol);
+ 
+ /*
+  *	skb should fit one page. This choice is good for headerless malloc.
++ *
++ *      FIXME: What is the best size for SLAB???? --ANK
+  */
+-#define NLMSG_GOODORDER 0
+-#define NLMSG_GOODSIZE (SKB_MAX_ORDER(0, NLMSG_GOODORDER))
+-#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)
++#define NLMSG_GOODSIZE (PAGE_SIZE - ((sizeof(struct sk_buff)+0xF)&~0xF))
+ 
+ 
+ struct netlink_callback
+@@ -183,7 +150,7 @@ struct netlink_callback
+ 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
+ 	int		(*done)(struct netlink_callback *cb);
+ 	int		family;
+-	long		args[5];
++	long		args[4];
+ };
+ 
+ struct netlink_notify
+@@ -193,7 +160,7 @@ struct netlink_notify
+ };
+ 
+ static __inline__ struct nlmsghdr *
+-__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len, int flags)
++__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len)
+ {
+ 	struct nlmsghdr *nlh;
+ 	int size = NLMSG_LENGTH(len);
+@@ -201,32 +168,15 @@ __nlmsg_put(struct sk_buff *skb, u32 pid
+ 	nlh = (struct nlmsghdr*)skb_put(skb, NLMSG_ALIGN(size));
+ 	nlh->nlmsg_type = type;
+ 	nlh->nlmsg_len = size;
+-	nlh->nlmsg_flags = flags;
++	nlh->nlmsg_flags = 0;
+ 	nlh->nlmsg_pid = pid;
+ 	nlh->nlmsg_seq = seq;
+-	memset(NLMSG_DATA(nlh) + len, 0, NLMSG_ALIGN(size) - size);
+ 	return nlh;
+ }
+ 
+-#define NLMSG_NEW(skb, pid, seq, type, len, flags) \
+-({	if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) \
+-		goto nlmsg_failure; \
+-	__nlmsg_put(skb, pid, seq, type, len, flags); })
+-
+ #define NLMSG_PUT(skb, pid, seq, type, len) \
+-	NLMSG_NEW(skb, pid, seq, type, len, 0)
+-
+-#define NLMSG_NEW_ANSWER(skb, cb, type, len, flags) \
+-	NLMSG_NEW(skb, NETLINK_CB((cb)->skb).pid, \
+-		  (cb)->nlh->nlmsg_seq, type, len, flags)
+-
+-#define NLMSG_END(skb, nlh) \
+-({	(nlh)->nlmsg_len = (skb)->tail - (unsigned char *) (nlh); \
+-	(skb)->len; })
+-
+-#define NLMSG_CANCEL(skb, nlh) \
+-({	skb_trim(skb, (unsigned char *) (nlh) - (skb)->data); \
+-	-1; })
++({ if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) goto nlmsg_failure; \
++   __nlmsg_put(skb, pid, seq, type, len); })
+ 
+ extern int netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
+ 			      struct nlmsghdr *nlh,
-- 
1.4.2


From swise at opengridcomputing.com  Wed May  9 07:30:36 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 09 May 2007 09:30:36 -0500
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <46415DFE.9030807@voltaire.com>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com>
	<463FCA42.3000104@indiana.edu>  <46415DFE.9030807@voltaire.com>
Message-ID: <1178721036.382.16.camel@stevo-desktop>

On Wed, 2007-05-09 at 08:37 +0300, Or Gerlitz wrote:
> Andrew Friedley wrote:
> > Jeff Squyres wrote:
> >>>> FWIW, yes, adding RDMA CM support has actually been on my to-do list
> >>>> for a while, but it keeps getting bumped by higher priority items.
> >>>> It would be *much* better if some iWARP companies got involved in
> >>>> Open MPI...
> 
> > Hmm I'm interested.  I've already done some work switching over to RDMA 
> > CM for some research stuff I've been doing; it's not publicly accessible 
> > w/o the 3rd party agreement.  I can help answer questions on what 
> > exactly needs to change, and do some testing.
> 
> Doing a bit of zoom out from the "how to make ofed's udapl work for 
> ompi" thread, my thinking is that the ompi udapl btl enablement is 
> actually only the first step, where for production/longterm/etc you want 
> to have an rdmacm btl. Reasoning here is made of many arguments, among 
> them the quickest i can make are:
> 
> A) it seems that ompi would want to use not only RC but rather also UD 
> multicast and unicast, which are not covered by udapl
> 
> B) there's actually no real justification to maintain two APIs (namely 
> udapl vs libibvers/librdmacm), so down the road, only one of them would 
> survive (udapl is implemented ***over*** libibverbs/librdmacm so if the 
> latteres dies same does udapl). Specifically, I hear here and there that 
> the OFED stack is now on its way to be deployed all over the place, 
> specifically in commercial Unix OSs (which want modern! code that 
> supports IPoIB-CM,RDS,SRP,iSER, etc you named it) so eventually the 
> rdmacm btl can be used also over Solaris et al.
> 

Agreed.  enabling udapl will get OMPI over iwarp immediately (and
hopefully in ofed-1.2).  Post ofed-1.2, I think OMPI _should_ create a
rdma-cm btl.  That's the plan...

Steve.


From swise at opengridcomputing.com  Wed May  9 07:37:56 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 09 May 2007 09:37:56 -0500
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <1178718090.382.4.camel@stevo-desktop>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
	<1178718090.382.4.camel@stevo-desktop>
Message-ID: <1178721476.382.18.camel@stevo-desktop>


Although as Boris pointed out, perhaps the hack in OMPI is no longer
needed at all...


On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote:
> 606 opened to track the udapl change.
> 
> 607 opened to track the ompi change to remove the port number stashing
> hack.
> 
> Status: I have a patch from Arlin to test today.  I will test with that
> patch and with the OMPI port hack removed.  Stay tuned...
> 
> 
> 
> Steve.
> 
> On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote:
> > Steve Wise wrote:
> > 
> > >I would like the group to consider including changes needed to OMPI
> > >and/or ofa udapl to get OMPI working again on udapl for ofed-1.2.  
> > >
> > >This will provide OMPI support over iwarp devices via udapl until we can
> > >get rdma-cm support added to OMPI.  
> > >
> > >
> > >Steve.
> > >  
> > >  
> > >
> > Steve,cCan you open a bug to track this?
> 
> _______________________________________________
> devel mailing list
> devel at open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


From jsquyres at cisco.com  Wed May  9 08:25:43 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 9 May 2007 11:25:43 -0400
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <1178721476.382.18.camel@stevo-desktop>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
Message-ID: <E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>

FWIW, I would marginally prefer if this bug is tracked in the Open  
MPI trac ticket system, not the OFA bugzilla (Steve W. will have  
write access there as soon as Chelsio submits their OMPI 3rd party  
contribution agreement).  We've traditionally [mostly] tracked OMPI  
bugs in the OMPI bug system and OFED-specific OMPI packaging problems  
in the OFA bugzilla.  It's a gray area, I admit.

But since I'm not the uDAPL maintainer in Open MPI, moving the bug  
over there will allow the Right people to see it (some OMPI  
developers are cross subscribed to the OFA general list, but not  
all).  For example, this udapl problem is likely related to the  
existing OMPI trac ticket 890 (https://svn.open-mpi.org/trac/ompi/ 
ticket/890).


On May 9, 2007, at 10:37 AM, Steve Wise wrote:

>
> Although as Boris pointed out, perhaps the hack in OMPI is no longer
> needed at all...
>
>
> On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote:
>> 606 opened to track the udapl change.
>>
>> 607 opened to track the ompi change to remove the port number  
>> stashing
>> hack.
>>
>> Status: I have a patch from Arlin to test today.  I will test with  
>> that
>> patch and with the OMPI port hack removed.  Stay tuned...
>>
>>
>>
>> Steve.
>>
>> On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote:
>>> Steve Wise wrote:
>>>
>>>> I would like the group to consider including changes needed to OMPI
>>>> and/or ofa udapl to get OMPI working again on udapl for ofed-1.2.
>>>>
>>>> This will provide OMPI support over iwarp devices via udapl  
>>>> until we can
>>>> get rdma-cm support added to OMPI.
>>>>
>>>>
>>>> Steve.
>>>>
>>>>
>>>>
>>> Steve,cCan you open a bug to track this?
>>
>> _______________________________________________
>> devel mailing list
>> devel at open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Wed May  9 08:25:46 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 9 May 2007 11:25:46 -0400
Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
In-Reply-To: <1178721036.382.16.camel@stevo-desktop>
References: <1177791386.4615.8.camel@stevo-laptop>
	<98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com>
	<1178575761.30571.175.camel@stevo-desktop>
	<95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com>
	<463FCA42.3000104@indiana.edu> <46415DFE.9030807@voltaire.com>
	<1178721036.382.16.camel@stevo-desktop>
Message-ID: <DA4A8E66-D0BF-4842-A529-022D7E35591A@cisco.com>

On May 9, 2007, at 10:30 AM, Steve Wise wrote:

> Agreed.  enabling udapl will get OMPI over iwarp immediately (and
> hopefully in ofed-1.2).  Post ofed-1.2, I think OMPI _should_ create a
> rdma-cm btl.  That's the plan...

Yes and no.  Please see my other reply about an "rdma cm" BTL...

-- 
Jeff Squyres
Cisco Systems


From pradeeps at linux.vnet.ibm.com  Wed May  9 08:32:43 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Wed, 09 May 2007 08:32:43 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V4] patch for review
Message-ID: <4641E99B.10706@linux.vnet.ibm.com>

Here is a fourth version of the IPOIB_CM_NOSRQ patch for review. This 
patch will benefit adapters that do not support shared receive queues.

This patch incorporates the following review comments from v3:
1. Incorporated review comments (related to style) from Roland Dreier 
and Michael Tsirkin
2. Fixed a couple of leaks in the error path (thanks to Roland Dreier 
for pointing them out).
3. Eliminated spin lock in data path (as suggested by Michael Tsirkin)
4. Changes to avoid CQ overflow (issue pointed out by Micheal Tsirkin)
5. Send REJ when no RC QPs remain (credit Micheal Tsirkin for the idea)
6. I have reset the retry_count to 0 in ipoib_cm_send_req()

This patch has been tested with linux-2.6.21-rc7 derived from Roland's 
for-2.6.22  git tree on 05/07/2007) with Topspin and IBM HCAs on ppc64 
machines. I have run netperf between two IBM HCAs and as well
as between IBM and Topspin HCA.

Note 1: For interoperability retry_count in ipoib_cm_send_req() may need 
to be changed to a non zero value (3 has worked for me). This is a 
temporary work around till HCA and/or CM bug is fixed that takes into 
account the HCA local ACK delay.

Note 2: I ran into problems trying to build Roland's git tree (on ppc64) 
that I downloaded 05/07/2007. Hence I just used the infiniband/ 
subdirectory and had to make changes to use skb->mac.raw = skb->data 
instead of skb_reset_mac_header(skb). Did not want to submit a patch 
that was untested. This can be fixed with a subsequent patch when I the 
tree to build.

Signed-off-by: Pradeep Satyanarayana
---

--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-07 16:05:32.000000000 
-0700
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-07 17:42:14.000000000 
-0700
@@ -97,9 +97,13 @@ enum {

  #define	IPOIB_OP_RECV   (1ul << 31)
  #ifdef CONFIG_INFINIBAND_IPOIB_CM
-#define	IPOIB_CM_OP_SRQ (1ul << 30)
+#define	IPOIB_CM_OP_RECV (1ul << 30)
+
+#define NOSRQ_INDEX_TABLE_SIZE 1024
+#define NOSRQ_INDEX_MASK      (NOSRQ_INDEX_TABLE_SIZE -1)
+
  #else
-#define	IPOIB_CM_OP_SRQ (0)
+#define	IPOIB_CM_OP_RECV (0)
  #endif

  /* structs */
@@ -133,11 +137,14 @@ struct ipoib_cm_data {
  };

  struct ipoib_cm_rx {
-	struct ib_cm_id     *id;
-	struct ib_qp        *qp;
-	struct list_head     list;
-	struct net_device   *dev;
-	unsigned long        jiffies;
+	struct ib_cm_id     	*id;
+	struct ib_qp        	*qp;
+	struct ipoib_cm_rx_buf 	*rx_ring; /* Used by NOSRQ only */
+	struct list_head     	 list;
+	struct net_device   	*dev;
+	unsigned long        	 jiffies;
+	u32			 index; /* wr_ids are distinguished by index
+					 * to identify the QP -NOSRQ only */
  };

  struct ipoib_cm_tx {
@@ -176,6 +183,8 @@ struct ipoib_cm_dev_priv {
  	struct ib_wc            ibwc[IPOIB_NUM_WC];
  	struct ib_sge           rx_sge[IPOIB_CM_RX_SG];
  	struct ib_recv_wr       rx_wr;
+	struct ipoib_cm_rx	**rx_index_table; /* See ipoib_cm_dev_init()
+						   *for usage of this element */
  };

  /*
@@ -521,10 +530,9 @@ static inline void ipoib_cm_skb_too_long
  	dev_kfree_skb_any(skb);
  }

-static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
+void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
  {
  }
-
  #endif

  #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-07 
22:19:52.000000000 -0700
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-08 
18:07:15.000000000 -0700
@@ -76,20 +76,20 @@ static void ipoib_cm_dma_unmap_rx(struct
  		ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, 
DMA_FROM_DEVICE);
  }

-static int ipoib_cm_post_receive(struct net_device *dev, int id)
+static int post_receive_srq(struct net_device *dev, u64 id)
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
  	struct ib_recv_wr *bad_wr;
  	int i, ret;

-	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ;
+	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV;

  	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
  		priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i];

  	ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr);
  	if (unlikely(ret)) {
-		ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret);
+		ipoib_warn(priv, "post srq failed for buf %ld (%d)\n", id, ret);
  		ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
  				      priv->cm.srq_ring[id].mapping);
  		dev_kfree_skb_any(priv->cm.srq_ring[id].skb);
@@ -99,12 +99,60 @@ static int ipoib_cm_post_receive(struct
  	return ret;
  }

-static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, 
int id, int frags,
+static int post_receive_nosrq(struct net_device *dev, u64 id)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_recv_wr *bad_wr;
+	int i, ret;
+	u32 index;
+	u32 wr_id;
+	struct ipoib_cm_rx *rx_ptr;
+
+	index = id  & NOSRQ_INDEX_MASK ;
+	wr_id = id >> 32;
+
+	rx_ptr = priv->cm.rx_index_table[index];
+
+	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV;
+
+	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
+		priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i];
+
+	ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr);
+	if (unlikely(ret)) {
+		ipoib_warn(priv, "post recv failed for buf %d (%d)\n",
+		           wr_id, ret);
+		ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
+		                      rx_ptr->rx_ring[wr_id].mapping);
+		dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb);
+		rx_ptr->rx_ring[wr_id].skb = NULL;
+	}
+
+	return ret;
+}
+
+static int ipoib_cm_post_receive(struct net_device *dev, u64 id)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+
+	if (priv->cm.srq)
+		ret = post_receive_srq(dev, id);
+	else
+		ret = post_receive_nosrq(dev, id);
+
+	return ret;
+}
+
+static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, 
u64 id,
+					     int frags,
  					     u64 mapping[IPOIB_CM_RX_SG])
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
  	struct sk_buff *skb;
  	int i;
+	struct ipoib_cm_rx *rx_ptr;
+	u32 index, wr_id;

  	skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12);
  	if (unlikely(!skb))
@@ -136,7 +184,14 @@ static struct sk_buff *ipoib_cm_alloc_rx
  			goto partial_error;
  	}

-	priv->cm.srq_ring[id].skb = skb;
+	if (priv->cm.srq)
+		priv->cm.srq_ring[id].skb = skb;
+	else {
+		index = id  & NOSRQ_INDEX_MASK ;
+		wr_id = id >> 32;
+		rx_ptr = priv->cm.rx_index_table[index];
+		rx_ptr->rx_ring[wr_id].skb = skb;
+	}
  	return skb;

  partial_error:
@@ -159,11 +214,14 @@ static struct ib_qp *ipoib_cm_create_rx_
  		.recv_cq = priv->cq,
  		.srq = priv->cm.srq,
  		.cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */
+		.cap.max_recv_wr = ipoib_recvq_size + 1,
  		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
  		.sq_sig_type = IB_SIGNAL_ALL_WR,
  		.qp_type = IB_QPT_RC,
  		.qp_context = p,
  	};
+	if (!priv->cm.srq)
+		attr.cap.max_recv_sge = IPOIB_CM_RX_SG;
  	return ib_create_qp(priv->pd, &attr);
  }

@@ -217,12 +275,103 @@ static int ipoib_cm_send_rep(struct net_
  	rep.flow_control = 0;
  	rep.rnr_retry_count = req->rnr_retry_count;
  	rep.target_ack_delay = 20; /* FIXME */
-	rep.srq = 1;
  	rep.qp_num = qp->qp_num;
  	rep.starting_psn = psn;
+	rep.srq	= !!priv->cm.srq;
  	return ib_send_cm_rep(cm_id, &rep);
  }

+static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id,
+				        struct ipoib_cm_rx *p, unsigned psn)
+{
+	struct net_device *dev = cm_id->context;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+	u32 qp_num, index;
+	u64 i;
+
+	qp_num = p->qp->qp_num;
+
+	/* In the SRQ case there is a common rx buffer called the srq_ring.
+	 * However, for the NOSRQ we create an rx_ring for every
+	 * struct ipoib_cm_rx.
+	 */
+	p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL);
+	if (!p->rx_ring) {
+		printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n",
+		       qp_num);
+		return -ENOMEM;
+	}
+
+	cm_id->context = p;
+	p->jiffies = jiffies;
+	spin_lock_irq(&priv->lock);
+	list_add(&p->list, &priv->cm.passive_ids);
+		
+	for (index = 0; index < NOSRQ_INDEX_TABLE_SIZE; index++)
+		if (priv->cm.rx_index_table[index] == NULL)
+			break;
+
+	if ( index == NOSRQ_INDEX_TABLE_SIZE) {
+		spin_unlock_irq(&priv->lock);
+		ipoib_warn(priv, "NOSRQ supports a max of %d RC "
+		       "QPs. That limit has now been reached\n",
+		       NOSRQ_INDEX_TABLE_SIZE);
+
+		/* We send a REJ to the remote side indicating that we
+		 * have no more free RC QPs and leave it to the remote side
+		 * to take appropriate action. This should leave the
+		 * current set of QPs unaffected and any subsequent REQs
+		 * will be able to use RC QPs if they are available.
+		 */
+		ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0);
+		ret = -EINVAL;
+		goto err_send_rej;
+	}
+
+	priv->cm.rx_index_table[index] = p;
+	spin_unlock_irq(&priv->lock);
+
+	/* We will subsequently use this stored pointer while freeing
+	 * resources in stale task */
+	p->index = index;
+
+	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
+	if (ret) {
+		ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret);
+		ipoib_cm_dev_cleanup(dev);
+		goto err_modify_nosrq;
+	}
+
+	for (i = 0; i < ipoib_recvq_size; ++i) {
+		if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index,
+					   IPOIB_CM_RX_SG - 1,
+					   p->rx_ring[i].mapping)) {
+			ipoib_warn(priv, "failed to allocate receive "
+			           "buffer %ld\n", i);
+			ipoib_cm_dev_cleanup(dev);
+			ret = -ENOMEM;
+			goto err_alloc_and_post;
+		}
+
+		if (ipoib_cm_post_receive(dev, i << 32 | index)) {
+			ipoib_warn(priv, "ipoib_ib_post_receive "
+			           "failed for  buf %ld\n", i);
+			ipoib_cm_dev_cleanup(dev);
+			ret = -EIO;
+			goto err_alloc_and_post;
+		}
+	}
+
+	return 0;
+
+err_send_rej:
+err_modify_nosrq:
+err_alloc_and_post:
+	kfree(p->rx_ring);
+	return ret;
+}
+
  static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct 
ib_cm_event *event)
  {
  	struct net_device *dev = cm_id->context;
@@ -233,8 +382,11 @@ static int ipoib_cm_req_handler(struct i

  	ipoib_dbg(priv, "REQ arrived\n");
  	p = kzalloc(sizeof *p, GFP_KERNEL);
-	if (!p)
+	if (!p) {
+		printk(KERN_WARNING "Failed to allocate RX control block when "
+		       "REQ arrived\n");
  		return -ENOMEM;
+	}
  	p->dev = dev;
  	p->id = cm_id;
  	p->qp = ipoib_cm_create_rx_qp(dev, p);
@@ -244,9 +396,15 @@ static int ipoib_cm_req_handler(struct i
  	}

  	psn = random32() & 0xffffff;
-	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
-	if (ret)
-		goto err_modify;
+	if (!priv->cm.srq) {
+		if (ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn))
+			goto err_post_nosrq;
+	} else {
+		p->rx_ring = NULL;
+		ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
+		if (ret)
+			goto err_modify;
+	}

  	ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn);
  	if (ret) {
@@ -254,16 +412,19 @@ static int ipoib_cm_req_handler(struct i
  		goto err_rep;
  	}

-	cm_id->context = p;
-	p->jiffies = jiffies;
-	spin_lock_irq(&priv->lock);
-	list_add(&p->list, &priv->cm.passive_ids);
-	spin_unlock_irq(&priv->lock);
+	if (priv->cm.srq) {
+		cm_id->context = p;
+		p->jiffies = jiffies;
+		spin_lock_irq(&priv->lock);
+		list_add(&p->list, &priv->cm.passive_ids);
+		spin_unlock_irq(&priv->lock);
+	}
  	queue_delayed_work(ipoib_workqueue,
  			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
  	return 0;

  err_rep:
+err_post_nosrq:
  err_modify:
  	ib_destroy_qp(p->qp);
  err_qp:
@@ -339,48 +500,53 @@ static void skb_put_frags(struct sk_buff
  	}
  }

-void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+static void timer_check(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p)
+{
+	unsigned long flags;
+
+	if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
+		spin_lock_irqsave(&priv->lock, flags);
+		p->jiffies = jiffies;
+		/* Move this entry to list head, but do
+		 * not re-add it if it has been removed. */
+		if (!list_empty(&p->list))
+			list_move(&p->list, &priv->cm.passive_ids);
+		spin_unlock_irqrestore(&priv->lock, flags);
+		queue_delayed_work(ipoib_workqueue,
+				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
+	}
+}
+
+static void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc)
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ;
  	struct sk_buff *skb, *newskb;
+	u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV;
  	struct ipoib_cm_rx *p;
-	unsigned long flags;
-	u64 mapping[IPOIB_CM_RX_SG];
-	int frags;
+	int frags, ret;

  	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
  		       wr_id, wc->status);

  	if (unlikely(wr_id >= ipoib_recvq_size)) {
-		ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
-			   wr_id, ipoib_recvq_size);
-		return;
+		ipoib_warn(priv, "cm recv completion event with wrid %ld "
+		           "(> %d)\n", wr_id, ipoib_recvq_size);
+		return;
  	}

  	skb  = priv->cm.srq_ring[wr_id].skb;

  	if (unlikely(wc->status != IB_WC_SUCCESS)) {
  		ipoib_dbg(priv, "cm recv error "
-			   "(status=%d, wrid=%d vend_err %x)\n",
+			   "(status=%d, wrid=%ld vend_err %x)\n",
  			   wc->status, wr_id, wc->vendor_err);
  		++priv->stats.rx_dropped;
-		goto repost;
+		goto repost_srq;
  	}

  	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
  		p = wc->qp->qp_context;
-		if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
-			spin_lock_irqsave(&priv->lock, flags);
-			p->jiffies = jiffies;
-			/* Move this entry to list head, but do
-			 * not re-add it if it has been removed. */
-			if (!list_empty(&p->list))
-				list_move(&p->list, &priv->cm.passive_ids);
-			spin_unlock_irqrestore(&priv->lock, flags);
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
-		}
+		timer_check(priv, p);
  	}

  	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
@@ -392,13 +558,113 @@ void ipoib_cm_handle_rx_wc(struct net_de
  		 * If we can't allocate a new RX buffer, dump
  		 * this packet and reuse the old buffer.
  		 */
-		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
+		ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id);
+                ++priv->stats.rx_dropped;
+                goto repost_srq;
+        }
+
+	ipoib_cm_dma_unmap_rx(priv, frags,
+	                      priv->cm.srq_ring[wr_id].mapping);
+	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping,
+	       (frags + 1) * sizeof *mapping);
+	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
+		       wc->byte_len, wc->slid);
+
+	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
+
+	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
+	skb->mac.raw = skb->data;
+	skb_pull(skb, IPOIB_ENCAP_LEN);
+
+	dev->last_rx = jiffies;
+	++priv->stats.rx_packets;
+	priv->stats.rx_bytes += skb->len;
+
+	skb->dev = dev;
+	/* XXX get correct PACKET_ type here */
+	skb->pkt_type = PACKET_HOST;
+	netif_rx_ni(skb);
+
+repost_srq:
+	ret = ipoib_cm_post_receive(dev, wr_id);
+
+	if (unlikely(ret))
+		ipoib_warn(priv, "ipoib_cm_post_receive failed for buf %ld\n",
+		           wr_id);
+
+}
+
+static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb, *newskb;
+	u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32;
+	u32 index;
+	struct ipoib_cm_rx *p, *rx_ptr;
+	int frags, ret;
+
+
+	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
+		       wr_id, wc->status);
+
+	if (unlikely(wr_id >= ipoib_recvq_size)) {
+		ipoib_warn(priv, "cm recv completion event with wrid %ld "
+		           "(> %d)\n", wr_id, ipoib_recvq_size);
+		return;
+	}
+
+	index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK ;
+
+	/* This is the only place where rx_ptr could be a NULL - could
+	 * have just received a packet from a connection that has become
+	 * stale and so is going away. We will simply drop the packet and
+	 * let the hardware (it s IB_QPT_RC) handle the dropped packet.
+	 * In the timer_check() function below, p->jiffies is updated and
+	 * hence the connection will not be stale after that.
+	 */
+	rx_ptr = priv->cm.rx_index_table[index];
+	if (unlikely(!rx_ptr)) {
+		ipoib_warn(priv, "Received packet from a connection "
+		           "that is going away. Hardware will handle it.\n");
+		return;
+	}
+
+	skb = rx_ptr->rx_ring[wr_id].skb;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ipoib_dbg(priv, "cm recv error "
+			   "(status=%d, wrid=%ld vend_err %x)\n",
+			   wc->status, wr_id, wc->vendor_err);
+		++priv->stats.rx_dropped;
+		goto repost_nosrq;
+	}
+
+	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
+		/* There are no guarantees that wc->qp is not NULL for HCAs
+	 	* that do not support SRQ. */
+		p = rx_ptr;
+		timer_check(priv, p);
+	}
+
+	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
+					      (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
+
+	newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags,
+				       mapping);
+	if (unlikely(!newskb)) {
+		/*
+		 * If we can't allocate a new RX buffer, dump
+		 * this packet and reuse the old buffer.
+		 */
+		ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id);
  		++priv->stats.rx_dropped;
-		goto repost;
+		goto repost_nosrq;
  	}

-	ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping);
-	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof 
*mapping);
+	ipoib_cm_dma_unmap_rx(priv, frags,
+	                      rx_ptr->rx_ring[wr_id].mapping);
+	memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping,
+	       (frags + 1) * sizeof *mapping);

  	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
  		       wc->byte_len, wc->slid);
@@ -418,10 +684,22 @@ void ipoib_cm_handle_rx_wc(struct net_de
  	skb->pkt_type = PACKET_HOST;
  	netif_receive_skb(skb);

-repost:
-	if (unlikely(ipoib_cm_post_receive(dev, wr_id)))
-		ipoib_warn(priv, "ipoib_cm_post_receive failed "
-			   "for buf %d\n", wr_id);
+repost_nosrq:
+	ret = ipoib_cm_post_receive(dev, wr_id << 32 | index);
+
+	if (unlikely(ret))
+		ipoib_warn(priv, "ipoib_cm_post_receive failed for buf %ld\n",
+		           wr_id);
+}
+
+void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (priv->cm.srq)
+		handle_rx_wc_srq(dev, wc);
+	else
+		handle_rx_wc_nosrq(dev, wc);
  }

  static inline int post_send(struct ipoib_dev_priv *priv,
@@ -609,6 +887,22 @@ int ipoib_cm_dev_open(struct net_device
  	return 0;
  }

+static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct 
ipoib_cm_rx *p)
+{
+	int i;
+
+	for(i = 0; i < ipoib_recvq_size; ++i)
+		if(p->rx_ring[i].skb) {
+			ipoib_cm_dma_unmap_rx(priv,
+				         IPOIB_CM_RX_SG - 1,
+					 p->rx_ring[i].mapping);
+			dev_kfree_skb_any(p->rx_ring[i].skb);
+			p->rx_ring[i].skb = NULL;
+		}
+		kfree(p->rx_ring);
+}
+
+
  void ipoib_cm_dev_stop(struct net_device *dev)
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -622,6 +916,8 @@ void ipoib_cm_dev_stop(struct net_device
  	spin_lock_irq(&priv->lock);
  	while (!list_empty(&priv->cm.passive_ids)) {
  		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
+		if (!priv->cm.srq)
+			free_resources_nosrq(priv, p);
  		list_del_init(&p->list);
  		spin_unlock_irq(&priv->lock);
  		ib_destroy_cm_id(p->id);
@@ -709,7 +1005,9 @@ static struct ib_qp *ipoib_cm_create_tx_
  	attr.recv_cq = priv->cq;
  	attr.srq = priv->cm.srq;
  	attr.cap.max_send_wr = ipoib_sendq_size;
+	attr.cap.max_recv_wr = 1;
  	attr.cap.max_send_sge = 1;
+	attr.cap.max_recv_sge = 1;
  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
  	attr.qp_type = IB_QPT_RC;
  	attr.send_cq = cq;
@@ -749,7 +1047,7 @@ static int ipoib_cm_send_req(struct net_
  	req.retry_count 	      = 0; /* RFC draft warns against retries */
  	req.rnr_retry_count 	      = 0; /* RFC draft warns against retries */
  	req.max_cm_retries 	      = 15;
-	req.srq 	              = 1;
+	req.srq			      = !!priv->cm.srq;
  	return ib_send_cm_req(id, &req);
  }

@@ -1085,6 +1383,7 @@ static void ipoib_cm_stale_task(struct w
  	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
  						   cm.stale_task.work);
  	struct ipoib_cm_rx *p;
+	struct ib_qp_attr qp_attr;

  	spin_lock_irq(&priv->lock);
  	while (!list_empty(&priv->cm.passive_ids)) {
@@ -1093,6 +1392,12 @@ static void ipoib_cm_stale_task(struct w
  		p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list);
  		if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT))
  			break;
+		if (!priv->cm.srq) {
+			free_resources_nosrq(priv, p);
+			priv->cm.rx_index_table[p->index] = NULL;
+			qp_attr.qp_state = IB_QPS_ERR;
+			ib_modify_qp(p->qp, &qp_attr, IB_QP_STATE);
+		}
  		list_del_init(&p->list);
  		spin_unlock_irq(&priv->lock);
  		ib_destroy_cm_id(p->id);
@@ -1147,16 +1452,40 @@ int ipoib_cm_add_mode_attr(struct net_de
  	return device_create_file(&dev->dev, &dev_attr_mode);
  }

+static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv)
+{
+	struct ib_srq_init_attr srq_init_attr;
+	int ret;
+
+	srq_init_attr.attr.max_wr = ipoib_recvq_size;
+	srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG;
+
+	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
+	if (IS_ERR(priv->cm.srq)) {
+		ret = PTR_ERR(priv->cm.srq);
+		priv->cm.srq = NULL;
+		return ret;
+	}
+
+	priv->cm.srq_ring = kzalloc(ipoib_recvq_size *
+		                    sizeof *priv->cm.srq_ring,
+			            GFP_KERNEL);
+	if (!priv->cm.srq_ring) {
+		printk(KERN_WARNING "%s: failed to allocate CM ring "
+		       "(%d entries)\n",
+	       	       priv->ca->name, ipoib_recvq_size);
+		ipoib_cm_dev_cleanup(dev);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
  int ipoib_cm_dev_init(struct net_device *dev)
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ib_srq_init_attr srq_init_attr = {
-		.attr = {
-			.max_wr  = ipoib_recvq_size,
-			.max_sge = IPOIB_CM_RX_SG
-		}
-	};
  	int ret, i;
+	struct ib_device_attr attr;

  	INIT_LIST_HEAD(&priv->cm.passive_ids);
  	INIT_LIST_HEAD(&priv->cm.reap_list);
@@ -1168,20 +1497,30 @@ int ipoib_cm_dev_init(struct net_device

  	skb_queue_head_init(&priv->cm.skb_queue);

-	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
-	if (IS_ERR(priv->cm.srq)) {
-		ret = PTR_ERR(priv->cm.srq);
-		priv->cm.srq = NULL;
+	if (ret = ib_query_device(priv->ca, &attr))
  		return ret;
-	}

-	priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring,
-				    GFP_KERNEL);
-	if (!priv->cm.srq_ring) {
-		printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n",
-		       priv->ca->name, ipoib_recvq_size);
-		ipoib_cm_dev_cleanup(dev);
-		return -ENOMEM;
+	if (attr.max_srq) {
+		/* This device supports SRQ */
+		if (ret = create_srq(dev, priv))
+			return ret;
+		priv->cm.rx_index_table = NULL;
+	} else {
+		priv->cm.srq = NULL;
+		priv->cm.srq_ring = NULL;
+
+		/* Every new REQ that arrives creates a struct ipoib_cm_rx.
+		 * These structures form a link list starting with the
+		 * passive_ids. For quick and easy access we maintain a table
+		 * of pointers to struct ipoib_cm_rx called the rx_index_table
+		 */
+		priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE *
+					 sizeof *priv->cm.rx_index_table,
+					 GFP_KERNEL);
+		if (!priv->cm.rx_index_table) {
+			printk(KERN_WARNING "Failed to allocate NOSRQ_INDEX_TABLE\n");
+			return -ENOMEM;
+		}	
  	}

  	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
@@ -1194,17 +1533,23 @@ int ipoib_cm_dev_init(struct net_device
  	priv->cm.rx_wr.sg_list = priv->cm.rx_sge;
  	priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG;

-	for (i = 0; i < ipoib_recvq_size; ++i) {
-		if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1,
+	/* One can post receive buffers even before the RX QP is created
+	 * only in the SRQ case. Therefore for NOSRQ we skip the rest of init
+	 * and do that in ipoib_cm_req_handler() */
+
+	if (priv->cm.srq) {
+		for (i = 0; i < ipoib_recvq_size; ++i) {
+			if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1,
  					   priv->cm.srq_ring[i].mapping)) {
-			ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
-			ipoib_cm_dev_cleanup(dev);
-			return -ENOMEM;
-		}
-		if (ipoib_cm_post_receive(dev, i)) {
-			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
-			ipoib_cm_dev_cleanup(dev);
-			return -EIO;
+				ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
+				ipoib_cm_dev_cleanup(dev);
+				return -ENOMEM;
+			}
+			if (ipoib_cm_post_receive(dev, i)) {
+				ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
+				ipoib_cm_dev_cleanup(dev);
+				return -EIO;
+			}
  		}
  	}

--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-07 
22:31:33.000000000 -0700
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-07 
17:29:52.000000000 -0700
@@ -299,7 +299,7 @@ int ipoib_poll(struct net_device *dev, i
  		for (i = 0; i < n; ++i) {
  			struct ib_wc *wc = priv->ibwc + i;

-			if (wc->wr_id & IPOIB_CM_OP_SRQ) {
+			if (wc->wr_id & IPOIB_CM_OP_RECV) {
  				++done;
  				--max;
  				ipoib_cm_handle_rx_wc(dev, wc);
@@ -607,7 +607,7 @@ int ipoib_ib_dev_stop(struct net_device
  		do {
  			n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
  			for (i = 0; i < n; ++i) {
-				if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
+				if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV)
  					ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
  				else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
  					ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-07 
16:05:32.000000000 -0700
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-07 
17:13:28.000000000 -0700
@@ -187,6 +187,15 @@ int ipoib_transport_dev_init(struct net_
  	if (!ret)
  		size += ipoib_recvq_size;

+ 	/* We increase the size of the CQ in the NOSRQ case to prevent CQ
+ 	 * overflow. Every new REQ creates a new RX QP and each QP has an
+ 	 * RX ring associated with it. Therefore we could have
+ 	 * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs
+ 	 * in a CQ.
+ 	 */
+ 	if(!priv->cm.srq)
+ 		size += (NOSRQ_INDEX_TABLE_SIZE -1)* ipoib_recvq_size;
+
  	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, 
size, 0);
  	if (IS_ERR(priv->cq)) {
  		printk(KERN_WARNING "%s: failed to create CQ\n", ca->name);


From Don.Kerr at Sun.COM  Wed May  9 08:42:08 2007
From: Don.Kerr at Sun.COM (Donald Kerr)
Date: Wed, 09 May 2007 11:42:08 -0400
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>
Message-ID: <4641EBD0.3000600@Sun.COM>


I agree OMPI trac ticket #890 should cover this. I will test the 
suggested fix, just removing that one line from btl_udapl.c, on Solaris. 
I am still not set up on Linux so hopefully Steve can confirm there.

-DON

Jeff Squyres wrote:

>FWIW, I would marginally prefer if this bug is tracked in the Open  
>MPI trac ticket system, not the OFA bugzilla (Steve W. will have  
>write access there as soon as Chelsio submits their OMPI 3rd party  
>contribution agreement).  We've traditionally [mostly] tracked OMPI  
>bugs in the OMPI bug system and OFED-specific OMPI packaging problems  
>in the OFA bugzilla.  It's a gray area, I admit.
>
>But since I'm not the uDAPL maintainer in Open MPI, moving the bug  
>over there will allow the Right people to see it (some OMPI  
>developers are cross subscribed to the OFA general list, but not  
>all).  For example, this udapl problem is likely related to the  
>existing OMPI trac ticket 890 (https://svn.open-mpi.org/trac/ompi/ 
>ticket/890).
>
>
>On May 9, 2007, at 10:37 AM, Steve Wise wrote:
>
>  
>
>>Although as Boris pointed out, perhaps the hack in OMPI is no longer
>>needed at all...
>>
>>
>>On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote:
>>    
>>
>>>606 opened to track the udapl change.
>>>
>>>607 opened to track the ompi change to remove the port number  
>>>stashing
>>>hack.
>>>
>>>Status: I have a patch from Arlin to test today.  I will test with  
>>>that
>>>patch and with the OMPI port hack removed.  Stay tuned...
>>>
>>>
>>>
>>>Steve.
>>>
>>>On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote:
>>>      
>>>
>>>>Steve Wise wrote:
>>>>
>>>>        
>>>>
>>>>>I would like the group to consider including changes needed to OMPI
>>>>>and/or ofa udapl to get OMPI working again on udapl for ofed-1.2.
>>>>>
>>>>>This will provide OMPI support over iwarp devices via udapl  
>>>>>until we can
>>>>>get rdma-cm support added to OMPI.
>>>>>
>>>>>
>>>>>Steve.
>>>>>
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>Steve,cCan you open a bug to track this?
>>>>        
>>>>
>>>_______________________________________________
>>>devel mailing list
>>>devel at open-mpi.org
>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>      
>>>
>
>
>  
>


From yosefe at voltaire.com  Wed May  9 08:47:48 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 09 May 2007 18:47:48 +0300
Subject: [ofa-general] Re: [PATCHv4 for-2.6.22 0 of 2] pkey change handling -
	fix bug #420
In-Reply-To: <4641BBC5.7040106@voltaire.com>
References: <4641BBC5.7040106@voltaire.com>
Message-ID: <4641ED24.6030303@voltaire.com>

Yosef Etigin wrote:
> These two patches fix bug #577: PKey table reordering caused by SM failover stops ipoib traffic

This should have been bug #420

> patch 1: add uncached device queries to core
> patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init
> 
> --
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From jimmy at hillraiser.com  Wed May  9 08:59:39 2007
From: jimmy at hillraiser.com (=?iso-8859-1?Q?Jimmy=20Hill?=)
Date: Wed, 09 May 2007 15:59:39 +0000
Subject: [ofa-general] verbs abi_compat
Message-ID: <20070509155939.17788.qmail@station183.com>

Under what conditions is the field abi_compat of struct ibv_context set to non-zero? I'm encountering a situation where it is set whencoding to verbs on a clean OFED 1.2 install. Seems odd that it would be set since I suspected that it would only occur for verbs 1.0/1.1 compatibility.

thanks!


From ossrosch at linux.vnet.ibm.com  Wed May  9 09:24:53 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Wed, 9 May 2007 18:24:53 +0200
Subject: [ofa-general] Build problem with RHEL-4.5 and OFED-1.2
Message-ID: <200705091824.54394.ossrosch@linux.vnet.ibm.com>

Hi Doug,

I installed RHEL-4.5 on one of our ppc64 systems and recognized that asm-ppc
directory is missing in /usr/src/kernels/2.6.9-55.EL/include. 
Normally I don't need this directory, but ibmebus.h includes
asm-ppc64/of_device.h. And there asm-ppc64/of_device.h includes 
asm-ppc/of_device.h. Because this file is missing I can not build 
ehca and ofed stack with ofed-1.2 daily build from today.

Did I make something wrong during installation?

Regards Stefan Roscher


From cap at nsc.liu.se  Wed May  9 09:28:54 2007
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Wed, 9 May 2007 18:28:54 +0200
Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status
In-Reply-To: <20070509124521.GI10068@mellanox.co.il>
References: <20070508093812.9A193E603C1@openfabrics.org>
	<200705091440.01872.cap@nsc.liu.se>
	<20070509124521.GI10068@mellanox.co.il>
Message-ID: <200705091828.54260.cap@nsc.liu.se>

On Wednesday 09 May 2007, Michael S. Tsirkin wrote:
> > Quoting Peter Kjellstrom <cap at nsc.liu.se>:
> > Subject: Re: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build
> > status
> >
> > Not related to the failed 2.6.21.1 below, but, are there any plans to add
> > the EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and
> > 2.6.9-55.{EL,ELsmp}).
>
> We do test on them locally, haven't the time to prepare these for
> cross-build yet. Can you do this?

Unfortunately I lack both time and resources to maintain an automated build 
verification system ;-(

/Peter

> > Also, out of curiosity, what is the 2.6.16.43-0.3 (below)?
>
> SLES10 I think.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070509/f8626d9a/attachment.sig>

From cap at nsc.liu.se  Wed May  9 09:28:54 2007
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Wed, 9 May 2007 18:28:54 +0200
Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status
In-Reply-To: <20070509124521.GI10068@mellanox.co.il>
References: <20070508093812.9A193E603C1@openfabrics.org>
	<200705091440.01872.cap@nsc.liu.se>
	<20070509124521.GI10068@mellanox.co.il>
Message-ID: <200705091828.54260.cap@nsc.liu.se>

On Wednesday 09 May 2007, Michael S. Tsirkin wrote:
> > Quoting Peter Kjellstrom <cap at nsc.liu.se>:
> > Subject: Re: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build
> > status
> >
> > Not related to the failed 2.6.21.1 below, but, are there any plans to add
> > the EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and
> > 2.6.9-55.{EL,ELsmp}).
>
> We do test on them locally, haven't the time to prepare these for
> cross-build yet. Can you do this?

Unfortunately I lack both time and resources to maintain an automated build 
verification system ;-(

/Peter

> > Also, out of curiosity, what is the 2.6.16.43-0.3 (below)?
>
> SLES10 I think.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070509/f8626d9a/attachment-0001.sig>

From sean.hefty at intel.com  Wed May  9 10:16:00 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 9 May 2007 10:16:00 -0700
Subject: [ofa-general] [GIT PULL] OFED 1.2 librdmacm
Message-ID: <000001c7925d$b752cfe0$e598070a@amr.corp.intel.com>

Please pull in the latest librdmacm ofed_1_2 tree.  This will add a fix for
rping and man pages.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


From mst at dev.mellanox.co.il  Wed May  9 10:41:38 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 20:41:38 +0300
Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <4641B63D.4010602@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508162727.GD5845@mellanox.co.il>
	<4640A8BD.4000405@voltaire.com>
	<20070509093548.GA7683@mellanox.co.il>
	<4641AA06.1050002@voltaire.com>
	<20070509112626.GA10068@mellanox.co.il>
	<4641B63D.4010602@voltaire.com>
Message-ID: <20070509174138.GB17734@mellanox.co.il>

> @@ -642,6 +651,11 @@ void ipoib_ib_dev_flush(struct work_stru
>  
>  	ipoib_ib_dev_down(dev, 0);
>  
> +	if (restart_qp) {
> +		ipoib_ib_dev_stop(dev, 0);
> +		ipoib_ib_dev_open(dev);
> +	}
> +
>  	/*
>  	 * The device could have been brought down between the start and when
>  	 * we get here, don't bring it back up if it's not configured up

By the way, I think I see a small issue now - if there's a
pkey change event, this will flush all interfaces, even if
the pkey changed is not used by ipoib at all.

How about:
- rename restart_qp flag to pkey_change_event
- do something like this at the beginning of the flush routine
	if (pkey_change_event &&
		query_pkey(current index) == current_pkey))
		return;

Need to think what to do if index is not valid, but you get the idea.

This will remove all the extra flushes in the common case
where pkeys are not moved around too much.

-- 
MST


From mst at dev.mellanox.co.il  Wed May  9 10:46:25 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 20:46:25 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <4641E99B.10706@linux.vnet.ibm.com>
References: <4641E99B.10706@linux.vnet.ibm.com>
Message-ID: <20070509174625.GC17734@mellanox.co.il>

> +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +
> +	if (priv->cm.srq)
> +		handle_rx_wc_srq(dev, wc);
> +	else
> +		handle_rx_wc_nosrq(dev, wc);
>  }

I still think this conditional branch on datapath should be avoided
by using separate RX handlers for SRQ/non SRQ cases.
And same for the one on alloc_rx_skb.

-- 
MST


From mst at dev.mellanox.co.il  Wed May  9 10:47:51 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 20:47:51 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <4641E99B.10706@linux.vnet.ibm.com>
References: <4641E99B.10706@linux.vnet.ibm.com>
Message-ID: <20070509174751.GD17734@mellanox.co.il>

> @@ -159,11 +214,14 @@ static struct ib_qp *ipoib_cm_create_rx_
>  		.recv_cq = priv->cq,
>  		.srq = priv->cm.srq,
>  		.cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */
> +		.cap.max_recv_wr = ipoib_recvq_size + 1,
>  		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
>  		.sq_sig_type = IB_SIGNAL_ALL_WR,
>  		.qp_type = IB_QPT_RC,
>  		.qp_context = p,
>  	};

Why aren't you using UC QPs here? With retry count 0, what is
the benefit of RC?

-- 
MST


From pradeeps at linux.vnet.ibm.com  Wed May  9 10:50:51 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Wed, 09 May 2007 10:50:51 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <20070509174625.GC17734@mellanox.co.il>
References: <4641E99B.10706@linux.vnet.ibm.com>
	<20070509174625.GC17734@mellanox.co.il>
Message-ID: <464209FB.8060006@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>> +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
>> +{
>> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
>> +
>> +	if (priv->cm.srq)
>> +		handle_rx_wc_srq(dev, wc);
>> +	else
>> +		handle_rx_wc_nosrq(dev, wc);
>>  }
> 
> I still think this conditional branch on datapath should be avoided
> by using separate RX handlers for SRQ/non SRQ cases.
> And same for the one on alloc_rx_skb.
> 

I attempted implementing this. With NAPI now included, the code looked 
real ugly and so decided not do so.

Pradeep


From mst at dev.mellanox.co.il  Wed May  9 10:55:16 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 9 May 2007 20:55:16 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <464209FB.8060006@linux.vnet.ibm.com>
References: <4641E99B.10706@linux.vnet.ibm.com>
	<20070509174625.GC17734@mellanox.co.il>
	<464209FB.8060006@linux.vnet.ibm.com>
Message-ID: <20070509175516.GE17734@mellanox.co.il>

> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> 
> Michael S. Tsirkin wrote:
> >>+void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
> >>+{
> >>+	struct ipoib_dev_priv *priv = netdev_priv(dev);
> >>+
> >>+	if (priv->cm.srq)
> >>+		handle_rx_wc_srq(dev, wc);
> >>+	else
> >>+		handle_rx_wc_nosrq(dev, wc);
> >> }
> >
> >I still think this conditional branch on datapath should be avoided
> >by using separate RX handlers for SRQ/non SRQ cases.
> >And same for the one on alloc_rx_skb.
> >
> 
> I attempted implementing this. With NAPI now included,
> the code looked real ugly and so decided not do so.

Why?
The only difference with NAPI is that instead of a separate
completion handler, you should have a separate poll routine.

-- 
MST


From pradeeps at linux.vnet.ibm.com  Wed May  9 10:56:51 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Wed, 09 May 2007 10:56:51 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <20070509174751.GD17734@mellanox.co.il>
References: <4641E99B.10706@linux.vnet.ibm.com>
	<20070509174751.GD17734@mellanox.co.il>
Message-ID: <46420B63.8080508@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>> @@ -159,11 +214,14 @@ static struct ib_qp *ipoib_cm_create_rx_
>>  		.recv_cq = priv->cq,
>>  		.srq = priv->cm.srq,
>>  		.cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */
>> +		.cap.max_recv_wr = ipoib_recvq_size + 1,
>>  		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
>>  		.sq_sig_type = IB_SIGNAL_ALL_WR,
>>  		.qp_type = IB_QPT_RC,
>>  		.qp_context = p,
>>  	};
> 
> Why aren't you using UC QPs here? With retry count 0, what is
> the benefit of RC?
> 
The issue with switching only NOSRQ to UC is interoperability between 
HCAs. Switching IPOIB CM to UC mode would be good, but let us do all of 
it at one go.

Pradeep


From kschoche at scl.ameslab.gov  Wed May  9 11:08:17 2007
From: kschoche at scl.ameslab.gov (Kyle Schochenmaier)
Date: Wed, 09 May 2007 13:08:17 -0500
Subject: [ofa-general] ehca_mrmw patch
In-Reply-To: <200705091446.23783.fenkes@de.ibm.com>
References: <200705091446.23783.fenkes@de.ibm.com>
Message-ID: <46420E11.7080902@scl.ameslab.gov>

With the memory registration restrictions of the eHCA coupled with our 
applications which require large memory registrations, we've found that 
we can quickly trigger a case where ibv_reg_mr()  will return -EINVAL, 
when it should be returning -ENOMEM.  If we were able to differentiate 
this type of error from the default -EINVAL, we would be able to handle 
this in userspace by flushing cached entries and retrying the memory 
registration.

I've attached a patch to start the process, if there are other paths 
back to userspace that can return ENOMEM on a resource limitation we 
should also add that.

thanks,

Kyle


-- 
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ehca_mrmw.patch
Type: text/x-patch
Size: 920 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070509/5c05d05e/attachment.bin>

From kschoche at scl.ameslab.gov  Wed May  9 11:11:31 2007
From: kschoche at scl.ameslab.gov (Kyle Schochenmaier)
Date: Wed, 09 May 2007 13:11:31 -0500
Subject: [ofa-general] ehca_mrmw patch
In-Reply-To: <200705091446.23783.fenkes@de.ibm.com>
References: <200705091446.23783.fenkes@de.ibm.com>
Message-ID: <46420ED3.8000008@scl.ameslab.gov>

With the memory registration restrictions of the eHCA coupled with our 
applications which require large memory registrations, we've found that 
we can quickly trigger a case where ibv_reg_mr()  will return -EINVAL, 
when it should be returning -ENOMEM.  If we were able to differentiate 
this type of error from the default -EINVAL, we would be able to handle 
this in userspace by flushing cached entries and retrying the memory 
registration.

I've attached a patch to start the process, if there are other paths 
back to userspace that can return ENOMEM on a resource limitation we 
should also add that.

thanks,

Kyle


-- 
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ehca_mrmw.patch
Type: text/x-patch
Size: 920 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070509/2545cd99/attachment.bin>

From kschoche at scl.ameslab.gov  Wed May  9 11:13:15 2007
From: kschoche at scl.ameslab.gov (Kyle Schochenmaier)
Date: Wed, 09 May 2007 13:13:15 -0500
Subject: [ofa-general] ehca_mrmw patch
In-Reply-To: <200705091446.23783.fenkes@de.ibm.com>
References: <200705091446.23783.fenkes@de.ibm.com>
Message-ID: <46420F3B.5000409@scl.ameslab.gov>

With the memory registration restrictions of the eHCA coupled with our 
applications which require large memory registrations, we've found that 
we can quickly trigger a case where ibv_reg_mr()  will return -EINVAL, 
when it should be returning -ENOMEM.  If we were able to differentiate 
this type of error from the default -EINVAL, we would be able to handle 
this in userspace by flushing cached entries and retrying the memory 
registration.

I've attached a patch to start the process, if there are other paths 
back to userspace that can return ENOMEM on a resource limitation we 
should also add that.

thanks,

Kyle


-- 
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ehca_mrmw.patch
Type: text/x-patch
Size: 920 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070509/78bf8fd1/attachment.bin>

From sean.hefty at intel.com  Wed May  9 11:39:59 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 9 May 2007 11:39:59 -0700
Subject: [ofa-general] [GIT PULL] 2.6.22: please pull rdma-dev.git
Message-ID: <000101c79269$72c67aa0$e598070a@amr.corp.intel.com>

Roland, please pull from:

	git://git.openfabrics.org/~shefty/rdma-dev.git for-roland

This will cleanup device removal synchronization in the rdma_cm.  The changes
are based on 2.6.21.

Sean Hefty (3):
      rdma/cm: simplify device removal handling code
      rdma/cm: Fix synchronization with device removal in cma_iw_handler
      rdma/cm: Add check to validate that cm_id is bound to a device.

 drivers/infiniband/core/cma.c |  106 +++++++++++++++++++++++------------------
 1 files changed, 59 insertions(+), 47 deletions(-)


From dledford at redhat.com  Wed May  9 12:05:35 2007
From: dledford at redhat.com (Doug Ledford)
Date: Wed, 09 May 2007 15:05:35 -0400
Subject: [ofa-general] Re: Build problem with RHEL-4.5 and OFED-1.2
In-Reply-To: <200705091824.54394.ossrosch@linux.vnet.ibm.com>
References: <200705091824.54394.ossrosch@linux.vnet.ibm.com>
Message-ID: <1178737535.2848.152.camel@fc6.xsintricity.com>

On Wed, 2007-05-09 at 18:24 +0200, Stefan Roscher wrote:
> Hi Doug,
> 
> I installed RHEL-4.5 on one of our ppc64 systems and recognized that asm-ppc
> directory is missing in /usr/src/kernels/2.6.9-55.EL/include. 
> Normally I don't need this directory, but ibmebus.h includes
> asm-ppc64/of_device.h. And there asm-ppc64/of_device.h includes 
> asm-ppc/of_device.h. Because this file is missing I can not build 
> ehca and ofed stack with ofed-1.2 daily build from today.
> 
> Did I make something wrong during installation?
> 
> Regards Stefan Roscher

I'll look into it, but in the meantime, install the kernel src.rpm, go
into /usr/src/redhat/SPEC and run rpmbuild --bp kernel-2.6.spec and it
should create a complete source tree
in /usr/src/redhat/BUILD/kernel-2.6.18 that you can then get the asm-ppc
directory contents out of.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070509/da2d074b/attachment.sig>

From ardavis at ichips.intel.com  Wed May  9 12:45:52 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Wed, 09 May 2007 12:45:52 -0700
Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL
Message-ID: <464224F0.6020408@ichips.intel.com>


Vlad, please pull latest from uDAPL project (ofed_1_2 branch)

Signed-off by: Arlin Davis ardavis at ichips.intel.com

Bug Fixes:
- 606: Return local and remote ports with dat_ep_query
- 585: Add bonding example to dat.conf


From swise at opengridcomputing.com  Wed May  9 12:54:58 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 09 May 2007 14:54:58 -0500
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <4641EBD0.3000600@Sun.COM>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>
	<4641EBD0.3000600@Sun.COM>
Message-ID: <1178740498.382.97.camel@stevo-desktop>

On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote:
> I agree OMPI trac ticket #890 should cover this. I will test the 
> suggested fix, just removing that one line from btl_udapl.c, on Solaris. 
> I am still not set up on Linux so hopefully Steve can confirm there.
> 

All,

First, I haven't tested Arlins dat_ep_query() fix yet as we have
determined its not needed.  The OMPI udapl btl never calls
dat_ep_query()... 

So running OMPI with the suggested fix (removing the overwriting of the
hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp
rnic still doesn't work.  

There are two new issues so far:

1) this has uncovered a connection migration issue in the Chelsio
driver/firmware.  We are developing and testing a fix for this now.
Should be ready tomorrow hopefully.

2) OMPI is not adhering to the iwarp protocol requirement that the ULP,
in this case OMPI, initiating the iwarp connection (the side issuing the
dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA
message.  So if a OMPI process _accepts_ an rdma connection, then it
cannot send on that connection until it receives some sort of rdma
operation from the client process.  It appears the current OMPI
connection setup model doesn't enforce this.

This combined with the bug above causes an immediate connection failure
on chelsio's rnic.  After I fix #1 above, things might get slightly
better but my guess is we will still have connection setup problems if
the server side sends before the client side finishes streaming->rdma
mode transition.  

There have been a series of discussions on the ofa general list about
this issue, and the conclusion to date is that it cannot be resolved in
the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
sending an RDMA message involves the ULP's work queue and completion
queue, so the CM cannot do this under the covers in a mannor that
doesn't affect the application.  Thus, the applications must deal with
this.


Here is a possible solution: 

I assume in OMPI that connections are only initiated when the mpi
application does a send operation.   Given that, then udapl btl must
ensure that if a given rank accepts a connection, it cannot not send
anything until the rank at the other end of the connection sends first.
Since the other side initiated the connection, it will have pending data
to send...

I haven't looked into how painful this will be to implement.

Thoughts?


FYI:

IETF Draft requiring this behavior:

http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt

See section 7 for specifics.

Steve.


From afriedle at open-mpi.org  Wed May  9 16:15:51 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Wed, 09 May 2007 16:15:51 -0700
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <1178740498.382.97.camel@stevo-desktop>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>	<1178657765.11455.32.camel@stevo-desktop>	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>	<1178721476.382.18.camel@stevo-desktop>	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop>
Message-ID: <46425627.8000903@open-mpi.org>


Steve Wise wrote:
> There have been a series of discussions on the ofa general list about
> this issue, and the conclusion to date is that it cannot be resolved in
> the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
> sending an RDMA message involves the ULP's work queue and completion
> queue, so the CM cannot do this under the covers in a mannor that
> doesn't affect the application.  Thus, the applications must deal with
> this.

Why can't uDAPL deal with this?  As a uDAPL user, I really don't care 
what API uDAPL is using under the hood to move data from one place to 
another, nor the quirks of that API.  The whole point of uDAPL is to 
form a network-agnostic abstraction layer.  AFAIK, the uDAPL spec 
doesn't enforce any such requirement on RDMA communication either.  In 
my opinion, exposing such behavior above uDAPL is incorrect and is part 
of why uDAPL has seen limited adoption -- every single uDAPL 
implementation behaves in different ways, making it extremely difficult 
to write an application to work on any uDAPL implementation.  Sorry if 
this sounds harsh, but this comes from many hours of banging my head on 
the wall due to working around these sorts of problems :)

> 
> Here is a possible solution: 
> 
> I assume in OMPI that connections are only initiated when the mpi
> application does a send operation.   Given that, then udapl btl must
> ensure that if a given rank accepts a connection, it cannot not send
> anything until the rank at the other end of the connection sends first.
> Since the other side initiated the connection, it will have pending data
> to send...
> 
> I haven't looked into how painful this will be to implement.
> 
> Thoughts?

Following on what I wrote above, I think Open MPI is the wrong place to 
be dealing with this.  There's enough of these hacks as it is; I'm not 
interested in seeing more get added.

Andrew


From Don.Kerr at Sun.COM  Wed May  9 13:20:23 2007
From: Don.Kerr at Sun.COM (Donald Kerr)
Date: Wed, 09 May 2007 16:20:23 -0400
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <1178740498.382.97.camel@stevo-desktop>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>
	<4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop>
Message-ID: <46422D07.3050600@Sun.COM>

I missing some context here. Where are you plugging iwarp and OMPI 
together?

Steve Wise wrote:

>On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote:
>  
>
>>I agree OMPI trac ticket #890 should cover this. I will test the 
>>suggested fix, just removing that one line from btl_udapl.c, on Solaris. 
>>I am still not set up on Linux so hopefully Steve can confirm there.
>>
>>    
>>
>
>All,
>
>First, I haven't tested Arlins dat_ep_query() fix yet as we have
>determined its not needed.  The OMPI udapl btl never calls
>dat_ep_query()... 
>
>So running OMPI with the suggested fix (removing the overwriting of the
>hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp
>rnic still doesn't work.  
>
>There are two new issues so far:
>
>1) this has uncovered a connection migration issue in the Chelsio
>driver/firmware.  We are developing and testing a fix for this now.
>Should be ready tomorrow hopefully.
>
>2) OMPI is not adhering to the iwarp protocol requirement that the ULP,
>in this case OMPI, initiating the iwarp connection (the side issuing the
>dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA
>message.  So if a OMPI process _accepts_ an rdma connection, then it
>cannot send on that connection until it receives some sort of rdma
>operation from the client process.  It appears the current OMPI
>connection setup model doesn't enforce this.
>
>This combined with the bug above causes an immediate connection failure
>on chelsio's rnic.  After I fix #1 above, things might get slightly
>better but my guess is we will still have connection setup problems if
>the server side sends before the client side finishes streaming->rdma
>mode transition.  
>
>There have been a series of discussions on the ofa general list about
>this issue, and the conclusion to date is that it cannot be resolved in
>the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
>sending an RDMA message involves the ULP's work queue and completion
>queue, so the CM cannot do this under the covers in a mannor that
>doesn't affect the application.  Thus, the applications must deal with
>this.
>
>
>Here is a possible solution: 
>
>I assume in OMPI that connections are only initiated when the mpi
>application does a send operation.   Given that, then udapl btl must
>ensure that if a given rank accepts a connection, it cannot not send
>anything until the rank at the other end of the connection sends first.
>Since the other side initiated the connection, it will have pending data
>to send...
>
>I haven't looked into how painful this will be to implement.
>
>Thoughts?
>
>
>FYI:
>
>IETF Draft requiring this behavior:
>
>http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt
>
>See section 7 for specifics.
>
>Steve.
>
>
>_______________________________________________
>devel mailing list
>devel at open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  
>


From sashak at voltaire.com  Wed May  9 13:30:22 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed,  9 May 2007 23:30:22 +0300
Subject: [ofa-general] [PATCH 0/3] opensm: osm_port_t structure
	simplification.
Message-ID: <11787426251341-git-send-email-sashak@voltaire.com>

Hi Hal,

This simplifies osm_port_t structure and related API functions -
the main idea is to not use duplicated (from osm_node_t) physical port
pointers table, but only one direct pointer to appropriated physical
port (osm_physp_t).

Sasha


From sashak at voltaire.com  Wed May  9 13:30:24 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed,  9 May 2007 23:30:24 +0300
Subject: [ofa-general] [PATCH 2/3] opensm: eliminate node's physical ports
	table duplication in osm_port_t
In-Reply-To: <11787426251341-git-send-email-sashak@voltaire.com>
References: <11787426251341-git-send-email-sashak@voltaire.com>
Message-ID: <1178742625769-git-send-email-sashak@voltaire.com>

Eliminate duplication of osm_node's physical ports table in osm_port_t
object.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/include/opensm/osm_port.h    |   37 +++++++----------------
 osm/opensm/osm_pkey_rcv.c        |    2 +-
 osm/opensm/osm_port.c            |   60 +++++++++-----------------------------
 osm/opensm/osm_sa_link_record.c  |    4 +-
 osm/opensm/osm_sa_pkey_record.c  |    2 +-
 osm/opensm/osm_sa_slvl_record.c  |    4 +-
 osm/opensm/osm_sa_vlarb_record.c |    2 +-
 osm/opensm/osm_slvl_map_rcv.c    |    2 +-
 osm/opensm/osm_sm_state_mgr.c    |    2 +-
 osm/opensm/osm_subnet.c          |    4 +-
 osm/opensm/osm_vl_arb_rcv.c      |    2 +-
 11 files changed, 37 insertions(+), 84 deletions(-)

diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h
index 134012c..19a8502 100644
--- a/osm/include/opensm/osm_port.h
+++ b/osm/include/opensm/osm_port.h
@@ -1274,10 +1274,8 @@ typedef struct _osm_port
 	struct _osm_node		*p_node;
 	ib_net64_t			guid;
 	uint32_t			discovery_count;
-	uint8_t				default_port_num;
-	uint8_t				physp_tbl_size;
+	osm_physp_t			*p_physp;
 	cl_qlist_t			mcm_list;
-	osm_physp_t			*tbl[1];
 } osm_port_t;
 /*
 * FIELDS
@@ -1295,20 +1293,13 @@ typedef struct _osm_port
 *		during the current fabric sweep.  This number is reset
 *		to zero at the start of a sweep.
 *
-*	default_port_num
-*		Index of the physical port used when physical characteristics
-*		contained in the Physical Port are needed.
-*
-*	physp_tbl_size
-*		Number of physical ports associated with this logical port.
+*	p_physp
+*		The pointer to physical port used when physical
+*		characteristics contained in the Physical Port are needed.
 *
 *	mcm_list
 *		Multicast member list
 *
-*	tbl
-*		Array of pointers to Physical Port objects contained by this node.
-*     MUST BE LAST ELEMENT SINCE IT CAN GROW !!!
-*
 * SEE ALSO
 *	Port, Physical Port, Physical Port Table
 *********/
@@ -1386,10 +1377,8 @@ static inline ib_net16_t
 osm_port_get_base_lid(
 	IN const osm_port_t* const p_port )
 {
-	const osm_physp_t* const p_physp = p_port->tbl[p_port->default_port_num];
-	CL_ASSERT( p_physp );
-	CL_ASSERT( osm_physp_is_valid( p_physp ) );
-	return( osm_physp_get_base_lid( p_physp ));
+	CL_ASSERT( p_port->p_physp && osm_physp_is_valid( p_port->p_physp ) );
+	return( osm_physp_get_base_lid( p_port->p_physp ));
 }
 /*
 * PARAMETERS
@@ -1419,10 +1408,8 @@ static inline uint8_t
 osm_port_get_lmc(
 	IN const osm_port_t* const p_port )
 {
-	const osm_physp_t* const p_physp = p_port->tbl[p_port->default_port_num];
-	CL_ASSERT( p_physp );
-	CL_ASSERT( osm_physp_is_valid( p_physp ) );
-	return( osm_physp_get_lmc( p_physp ));
+	CL_ASSERT( p_port->p_physp && osm_physp_is_valid( p_port->p_physp ) );
+	return( osm_physp_get_lmc( p_port->p_physp ));
 }
 /*
 * PARAMETERS
@@ -1481,8 +1468,7 @@ osm_port_get_phys_ptr(
 	IN const osm_port_t* const p_port,
 	IN const uint8_t port_num )
 {
-	CL_ASSERT( port_num < p_port->physp_tbl_size );
-	return( p_port->tbl[port_num] );
+	return p_port->p_physp;
 }
 /*
 * PARAMETERS
@@ -1519,9 +1505,8 @@ osm_physp_t*
 osm_port_get_default_phys_ptr(
 	IN const osm_port_t* const p_port )
 {
-	CL_ASSERT( p_port->tbl[p_port->default_port_num] );
-	CL_ASSERT( osm_physp_is_valid( p_port->tbl[p_port->default_port_num] ) );
-	return( p_port->tbl[p_port->default_port_num] );
+	CL_ASSERT( osm_physp_is_valid( p_port->p_physp ) );
+	return p_port->p_physp;
 }
 /*
 * PARAMETERS
diff --git a/osm/opensm/osm_pkey_rcv.c b/osm/opensm/osm_pkey_rcv.c
index 76af9fc..0e0ec46 100644
--- a/osm/opensm/osm_pkey_rcv.c
+++ b/osm/opensm/osm_pkey_rcv.c
@@ -172,7 +172,7 @@ osm_pkey_rcv_process(
   else
   {
     p_physp = osm_port_get_default_phys_ptr(p_port);
-    port_num = p_port->default_port_num;
+    port_num = p_physp->port_num;
   }
 
   CL_ASSERT( p_physp );
diff --git a/osm/opensm/osm_port.c b/osm/opensm/osm_port.c
index 053fc22..b0949a0 100644
--- a/osm/opensm/osm_port.c
+++ b/osm/opensm/osm_port.c
@@ -174,7 +174,6 @@ osm_port_init(
   uint32_t port_index;
   ib_net64_t port_guid;
   osm_physp_t *p_physp;
-  uint32_t size;
 
   CL_ASSERT( p_port );
   CL_ASSERT( p_ni );
@@ -187,36 +186,24 @@ osm_port_init(
   p_port->guid = port_guid;
 
   /*
-    See comment in port_new for info about this...
-  */
-  size = p_ni->num_ports;
-
-  p_port->physp_tbl_size = (uint8_t)(size + 1);
-
-  /*
     Get the pointers to the physical node objects "owned" by this
     logical port GUID.
     For switches, all the ports are owned; for HCA's and routers,
     only the singular part that has this GUID is owned.
   */
-  p_port->default_port_num = 0xFF;
-  for( port_index = 0; port_index < p_port->physp_tbl_size; port_index++ )
+  for( port_index = 0; port_index < p_parent_node->physp_tbl_size; port_index++ )
   {
     p_physp = osm_node_get_physp_ptr( p_parent_node, port_index );
+    /*
+      Because much of the PortInfo data is only valid
+      for port 0 on switches, try to keep the lowest
+      possible value of default_port_num.
+    */
     if( osm_physp_is_valid( p_physp ) &&
-        port_guid == osm_physp_get_port_guid( p_physp ) )
-    {
-      p_port->tbl[port_index] = p_physp;
-      /*
-        Because much of the PortInfo data is only valid
-        for port 0 on switches, try to keep the lowest
-        possible value of default_port_num.
-      */
-      if( port_index < p_port->default_port_num )
-        p_port->default_port_num = (uint8_t)port_index;
+        port_guid == osm_physp_get_port_guid( p_physp ) ) {
+      p_port->p_physp = p_physp;
+      break;
     }
-    else
-      p_port->tbl[port_index] = NULL;
   }
 
   CL_ASSERT( p_port->default_port_num < 0xFF );
@@ -230,21 +217,11 @@ osm_port_new(
   IN const osm_node_t* const p_parent_node )
 {
   osm_port_t*  p_port;
-  uint32_t size;
-
-  /*
-    The port object already contains one physical port object pointer.
-    Therefore, subtract 1 from the number of physical ports
-    used by the switch.  This is not done for CA's since they
-    need to occupy 1 more physp pointer than they physically have since
-    we still reserve room for a "port 0".
-  */
-  size = p_ni->num_ports;
 
-  p_port = malloc( sizeof(*p_port) + sizeof(void *) * size );
+  p_port = malloc( sizeof(*p_port) );
   if( p_port != NULL )
   {
-    memset( p_port, 0, sizeof(*p_port) + sizeof(void *) * size );
+    memset( p_port, 0, sizeof(*p_port) );
     osm_port_init( p_port, p_ni, p_parent_node );
   }
 
@@ -326,7 +303,6 @@ osm_port_add_new_physp(
   p_physp = osm_node_get_physp_ptr( p_node, port_num );
   CL_ASSERT( osm_physp_is_valid( p_physp ) );
   CL_ASSERT( osm_physp_get_port_guid( p_physp ) == p_port->guid );
-  p_port->tbl[port_num] = p_physp;
 
   /*
     For switches, we generally want to use Port 0, which is
@@ -334,17 +310,9 @@ osm_port_add_new_physp(
     The LID value in the PortInfo for example, is only valid
     for port 0 on switches.
   */
-  if( !osm_physp_is_valid( p_port->tbl[p_port->default_port_num] ) )
-  {
-    p_port->default_port_num = port_num;
-  }
-  else
-  {
-    if(  port_num < p_port->default_port_num )
-    {
-      p_port->default_port_num = port_num;
-    }
-  }
+  if( !osm_physp_is_valid( osm_port_get_default_phys_ptr( p_port ) ) ||
+      port_num < p_port->p_physp->port_num )
+    p_port->p_physp = p_physp;
 }
 
 /**********************************************************************
diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c
index 18f655c..17df424 100644
--- a/osm/opensm/osm_sa_link_record.c
+++ b/osm/opensm/osm_sa_link_record.c
@@ -374,7 +374,7 @@ __osm_lr_rcv_get_port_links(
         port_num = p_lr->from_port_num;
         /* If the port number is out of the range of the p_src_port, then
            this couldn't be a relevant record. */
-        if (port_num < p_src_port->physp_tbl_size) 
+        if (port_num < p_src_port->p_node->physp_tbl_size)
         {          
           p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
           if (p_src_physp)
@@ -409,7 +409,7 @@ __osm_lr_rcv_get_port_links(
         port_num = p_lr->to_port_num;
         /* If the port number is out of the range of the p_dest_port, then
            this couldn't be a relevant record. */
-        if (port_num < p_dest_port->physp_tbl_size ) 
+        if (port_num < p_dest_port->p_node->physp_tbl_size )
         {
           p_dest_physp = osm_port_get_phys_ptr(
             p_dest_port, port_num );
diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c
index 0a199f1..8186603 100644
--- a/osm/opensm/osm_sa_pkey_record.c
+++ b/osm/opensm/osm_sa_pkey_record.c
@@ -239,7 +239,7 @@ __osm_sa_pkey_by_comp_mask(
   if ( p_port->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH )
   {
     /* we put it in the comp mask and port num */
-    port_num = p_port->default_port_num;
+    port_num = p_port->p_physp->port_num;
     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_sa_pkey_by_comp_mask:  "
              "Using Physical Default Port Number: 0x%X (for End Node)\n",
diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c
index 3c4ff02..9fbb5c7 100644
--- a/osm/opensm/osm_sa_slvl_record.c
+++ b/osm/opensm/osm_sa_slvl_record.c
@@ -225,8 +225,8 @@ __osm_sa_slvl_by_comp_mask(
     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_sa_slvl_by_comp_mask:  "
              "Using Physical Default Port Number: 0x%X (for End Node)\n",
-             p_port->default_port_num );
-    p_out_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num );
+             p_port->p_physp->port_num );
+    p_out_physp = p_port->p_physp;
     /* check that the p_out_physp and the p_req_physp share a pkey */
     if (osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_out_physp ))
     __osm_sa_slvl_create( p_rcv, p_out_physp, p_ctxt, 0 );
diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c
index 6df5ed9..97fe060 100644
--- a/osm/opensm/osm_sa_vlarb_record.c
+++ b/osm/opensm/osm_sa_vlarb_record.c
@@ -243,7 +243,7 @@ __osm_sa_vl_arb_by_comp_mask(
   if ( p_port->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH)
   {
     /* we put it in the comp mask and port num */
-    port_num = p_port->default_port_num;
+    port_num = p_port->p_physp->port_num;
     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_sa_vl_arb_by_comp_mask:  "
              "Using Physical Default Port Number: 0x%X (for End Node)\n",
diff --git a/osm/opensm/osm_slvl_map_rcv.c b/osm/opensm/osm_slvl_map_rcv.c
index 3fa3a7e..b109f75 100644
--- a/osm/opensm/osm_slvl_map_rcv.c
+++ b/osm/opensm/osm_slvl_map_rcv.c
@@ -183,7 +183,7 @@ osm_slvl_rcv_process(
   else
   {
     p_physp = osm_port_get_default_phys_ptr(p_port);
-    out_port_num = p_port->default_port_num;
+    out_port_num = p_physp->port_num;
     in_port_num  = 0;
   }
 
diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c
index 3aa92c8..0034320 100644
--- a/osm/opensm/osm_sm_state_mgr.c
+++ b/osm/opensm/osm_sm_state_mgr.c
@@ -194,7 +194,7 @@ __osm_sm_state_mgr_send_local_port_info_req(
                          osm_physp_get_dr_path_ptr
                          ( osm_port_get_default_phys_ptr( p_port ) ),
                          IB_MAD_ATTR_PORT_INFO,
-                         cl_hton32( p_port->default_port_num ),
+                         cl_hton32( p_port->p_physp->port_num ),
                          CL_DISP_MSGID_NONE, &context );
 
    if( status != IB_SUCCESS )
diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index c8c3ddc..3d9fdca 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -266,7 +266,7 @@ osm_get_gid_by_mad_addr(
                );
       return(IB_INVALID_PARAMETER);
     }
-    p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num);
+    p_physp = p_port->p_physp;
     p_gid->unicast.interface_id = p_physp->port_guid;
     p_gid->unicast.prefix = p_subn->opt.subnet_prefix;
   }
@@ -316,7 +316,7 @@ osm_get_physp_by_mad_addr(
     
       goto Exit;
     }
-    p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num);
+    p_physp = p_port->p_physp;
   }
   else
   {
diff --git a/osm/opensm/osm_vl_arb_rcv.c b/osm/opensm/osm_vl_arb_rcv.c
index 930360a..ed8dfc5 100644
--- a/osm/opensm/osm_vl_arb_rcv.c
+++ b/osm/opensm/osm_vl_arb_rcv.c
@@ -184,7 +184,7 @@ osm_vla_rcv_process(
   else
   {
     p_physp = osm_port_get_default_phys_ptr(p_port);
-    port_num = p_port->default_port_num;
+    port_num = p_physp->port_num;
   }
 
   CL_ASSERT( p_physp );
-- 
1.5.2.rc2.20.gac2a


From sashak at voltaire.com  Wed May  9 13:30:23 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed,  9 May 2007 23:30:23 +0300
Subject: [ofa-general] [PATCH 1/3] opensm: remove osm_port_get_num_physp()
	function
In-Reply-To: <11787426251341-git-send-email-sashak@voltaire.com>
References: <11787426251341-git-send-email-sashak@voltaire.com>
Message-ID: <11787426251658-git-send-email-sashak@voltaire.com>

This removes osm_port_get_num_physp() function and instead uses native
node oriented osm_node_get_num_physp().

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/include/opensm/osm_port.h       |   29 -----------------------------
 osm/opensm/osm_drop_mgr.c           |    2 +-
 osm/opensm/osm_link_mgr.c           |    2 +-
 osm/opensm/osm_qos.c                |    2 +-
 osm/opensm/osm_sa_link_record.c     |    8 ++++----
 osm/opensm/osm_sa_pkey_record.c     |    6 +++---
 osm/opensm/osm_sa_portinfo_record.c |    2 +-
 osm/opensm/osm_sa_slvl_record.c     |    2 +-
 osm/opensm/osm_sa_vlarb_record.c    |    6 +++---
 osm/opensm/osm_state_mgr.c          |    2 +-
 osm/opensm/osm_trap_rcv.c           |    2 +-
 11 files changed, 17 insertions(+), 46 deletions(-)

diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h
index 6d51d2b..134012c 100644
--- a/osm/include/opensm/osm_port.h
+++ b/osm/include/opensm/osm_port.h
@@ -1467,35 +1467,6 @@ osm_port_get_guid(
 *	Port
 *********/
 
-/****f* OpenSM: Port/osm_port_get_num_physp
-* NAME
-*	osm_port_get_num_physp
-*
-* DESCRIPTION
-*	Returns the number of Physical Port objects associated with this port.
-*
-* SYNOPSIS
-*/
-static inline uint8_t
-osm_port_get_num_physp(
-	IN const osm_port_t* const p_port )
-{
-	return( p_port->physp_tbl_size );
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object.
-*
-* RETURN VALUE
-*	Returns the number of Physical Port objects associated with this port.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port
-*********/
-
 /****f* OpenSM: Port/osm_port_get_phys_ptr
 * NAME
 *	osm_port_get_phys_ptr
diff --git a/osm/opensm/osm_drop_mgr.c b/osm/opensm/osm_drop_mgr.c
index 0d08ff6..d091347 100644
--- a/osm/opensm/osm_drop_mgr.c
+++ b/osm/opensm/osm_drop_mgr.c
@@ -237,7 +237,7 @@ __osm_drop_mgr_remove_port(
     Re-initialize each Physical Port.
   */
 
-  num_physp = osm_port_get_num_physp( p_port );
+  num_physp = osm_node_get_num_physp( p_port->p_node );
   for( port_num = 0; port_num < num_physp; port_num++ )
   {
     p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)port_num );
diff --git a/osm/opensm/osm_link_mgr.c b/osm/opensm/osm_link_mgr.c
index a1081bd..71c0495 100644
--- a/osm/opensm/osm_link_mgr.c
+++ b/osm/opensm/osm_link_mgr.c
@@ -426,7 +426,7 @@ __osm_link_mgr_process_port(
     with this Port.  Start iterating with port 1, since the linkstate
     is not applicable to the management port on switches.
   */
-  num_physp = osm_port_get_num_physp( p_port );
+  num_physp = osm_node_get_num_physp( p_port->p_node );
   for( i = 0; i < num_physp; i ++ )
   {
     /*
diff --git a/osm/opensm/osm_qos.c b/osm/opensm/osm_qos.c
index e71c053..11beaae 100644
--- a/osm/opensm/osm_qos.c
+++ b/osm/opensm/osm_qos.c
@@ -334,7 +334,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm)
 
 		p_node = p_port->p_node;
 		if (p_node->sw) {
-			num_physp = osm_port_get_num_physp(p_port);
+			num_physp = osm_node_get_num_physp(p_node);
 			for (i = 1; i < num_physp; i++) {
 				p_physp = osm_port_get_phys_ptr(p_port, i);
 				if (!p_physp || !osm_physp_is_valid(p_physp))
diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c
index 169e75e..18f655c 100644
--- a/osm/opensm/osm_sa_link_record.c
+++ b/osm/opensm/osm_sa_link_record.c
@@ -346,8 +346,8 @@ __osm_lr_rcv_get_port_links(
         that do not actually connect.  Don't bother screening
         for that here.
       */
-      num_ports = osm_port_get_num_physp( p_src_port );
-      dest_num_ports = osm_port_get_num_physp( p_dest_port );
+      num_ports = osm_node_get_num_physp( p_src_port->p_node );
+      dest_num_ports = osm_node_get_num_physp( p_dest_port->p_node );
       for( port_num = 1; port_num < num_ports; port_num++ )
       {
         p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
@@ -385,7 +385,7 @@ __osm_lr_rcv_get_port_links(
       }
       else
       {
-        num_ports = osm_port_get_num_physp( p_src_port );
+        num_ports = osm_node_get_num_physp( p_src_port->p_node );
         for( port_num = 1; port_num < num_ports; port_num++ )
         {
           p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
@@ -421,7 +421,7 @@ __osm_lr_rcv_get_port_links(
       }
       else
       {
-        num_ports = osm_port_get_num_physp( p_dest_port );
+        num_ports = osm_node_get_num_physp( p_dest_port->p_node );
         for( port_num = 1; port_num < num_ports; port_num++ )
         {
           p_dest_physp = osm_port_get_phys_ptr(
diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c
index 5eb15df..0a199f1 100644
--- a/osm/opensm/osm_sa_pkey_record.c
+++ b/osm/opensm/osm_sa_pkey_record.c
@@ -249,7 +249,7 @@ __osm_sa_pkey_by_comp_mask(
 
   if( comp_mask & IB_PKEY_COMPMASK_PORT )
   {
-    if (port_num < osm_port_get_num_physp( p_port ))
+    if (port_num < osm_node_get_num_physp( p_port->p_node ))
     {
       p_physp = osm_port_get_phys_ptr( p_port, port_num );
       /* Check that the p_physp is valid, and that is shares a pkey
@@ -263,13 +263,13 @@ __osm_sa_pkey_by_comp_mask(
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "__osm_sa_pkey_by_comp_mask: ERR 4603: "
                "Given Physical Port Number: 0x%X is out of range should be < 0x%X\n",
-               port_num, osm_port_get_num_physp( p_port ));
+               port_num, osm_node_get_num_physp( p_port->p_node ));
       goto Exit;
     }
   }
   else
   {
-    num_ports = osm_port_get_num_physp( p_port );
+    num_ports = osm_node_get_num_physp( p_port->p_node );
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
       p_physp = osm_port_get_phys_ptr( p_port, port_num );
diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c
index 5d9b1b2..9d4f18e 100644
--- a/osm/opensm/osm_sa_portinfo_record.c
+++ b/osm/opensm/osm_sa_portinfo_record.c
@@ -538,7 +538,7 @@ __osm_sa_pir_by_comp_mask(
   comp_mask = p_ctxt->comp_mask;
   p_req_physp = p_ctxt->p_req_physp;
 
-  num_ports = osm_port_get_num_physp( p_port );
+  num_ports = osm_node_get_num_physp( p_port->p_node );
 
   if( comp_mask & IB_PIR_COMPMASK_PORTNUM )
   {
diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c
index d831ffd..3c4ff02 100644
--- a/osm/opensm/osm_sa_slvl_record.c
+++ b/osm/opensm/osm_sa_slvl_record.c
@@ -213,7 +213,7 @@ __osm_sa_slvl_by_comp_mask(
 
   p_rcvd_rec = p_ctxt->p_rcvd_rec;
   comp_mask = p_ctxt->comp_mask;
-  num_ports = osm_port_get_num_physp( p_port );
+  num_ports = osm_node_get_num_physp( p_port->p_node );
   in_port_start = 0;
   in_port_end = num_ports;
   out_port_start = 0;
diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c
index f0ff957..6df5ed9 100644
--- a/osm/opensm/osm_sa_vlarb_record.c
+++ b/osm/opensm/osm_sa_vlarb_record.c
@@ -253,7 +253,7 @@ __osm_sa_vl_arb_by_comp_mask(
 
   if( comp_mask & IB_VLA_COMPMASK_OUT_PORT )
   {
-    if (port_num < osm_port_get_num_physp( p_port ))
+    if (port_num < osm_node_get_num_physp( p_port->p_node ))
     {
       p_physp = osm_port_get_phys_ptr( p_port, port_num );
       /* check that the p_physp is valid, and that the requester
@@ -267,13 +267,13 @@ __osm_sa_vl_arb_by_comp_mask(
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "__osm_sa_vl_arb_by_comp_mask: ERR 2A03: "
                "Given Physical Port Number: 0x%X is out of range should be < 0x%X\n",
-               port_num, osm_port_get_num_physp( p_port ) );
+               port_num, osm_node_get_num_physp( p_port->p_node ) );
       goto Exit;
     }
   }
   else
   {
-    num_ports = osm_port_get_num_physp( p_port );
+    num_ports = osm_node_get_num_physp( p_port->p_node );
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
       p_physp = osm_port_get_phys_ptr( p_port, port_num );
diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
index ddec10c..6f53e60 100644
--- a/osm/opensm/osm_state_mgr.c
+++ b/osm/opensm/osm_state_mgr.c
@@ -1284,7 +1284,7 @@ __osm_state_mgr_report(
       else
          start_port = 1;
 
-      num_ports = osm_port_get_num_physp( p_port );
+      num_ports = osm_node_get_num_physp( p_node );
       for( port_num = start_port; port_num < num_ports; port_num++ )
       {
          p_physp = osm_port_get_phys_ptr( p_port, port_num );
diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c
index 0858968..ed507b6 100644
--- a/osm/opensm/osm_trap_rcv.c
+++ b/osm/opensm/osm_trap_rcv.c
@@ -108,7 +108,7 @@ __get_physp_by_lid_and_num(
   if (! p_port)
     return NULL;
 
-  if (osm_port_get_num_physp(p_port) < num)
+  if (osm_node_get_num_physp(p_port->p_node) < num)
     return NULL;
 
   return( osm_port_get_phys_ptr(p_port, num) );
-- 
1.5.2.rc2.20.gac2a


From sashak at voltaire.com  Wed May  9 13:30:25 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed,  9 May 2007 23:30:25 +0300
Subject: [ofa-general] [PATCH 3/3] opensm: remove some unneeded funcs
In-Reply-To: <11787426251341-git-send-email-sashak@voltaire.com>
References: <11787426251341-git-send-email-sashak@voltaire.com>
Message-ID: <11787426253080-git-send-email-sashak@voltaire.com>

This removes some not really needed functions: osm_port_get_phys_ptr(),
osm_port_get_default_phys_ptr() and osm_port_get_parent_node().

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/include/opensm/osm_port.h        |  101 ----------------------------------
 osm/opensm/osm_drop_mgr.c            |    2 +-
 osm/opensm/osm_lid_mgr.c             |   14 +----
 osm/opensm/osm_link_mgr.c            |    2 +-
 osm/opensm/osm_mcast_mgr.c           |    2 +-
 osm/opensm/osm_node_info_rcv.c       |    6 +-
 osm/opensm/osm_pkey.c                |    8 +-
 osm/opensm/osm_pkey_mgr.c            |    8 +--
 osm/opensm/osm_pkey_rcv.c            |    4 +-
 osm/opensm/osm_port.c                |    6 +-
 osm/opensm/osm_port_info_rcv.c       |   12 ++--
 osm/opensm/osm_prtn.c                |    2 +-
 osm/opensm/osm_qos.c                 |    6 +-
 osm/opensm/osm_sa_informinfo.c       |    6 +-
 osm/opensm/osm_sa_lft_record.c       |    2 +-
 osm/opensm/osm_sa_link_record.c      |   18 +++---
 osm/opensm/osm_sa_mcmember_record.c  |    2 +-
 osm/opensm/osm_sa_mft_record.c       |    2 +-
 osm/opensm/osm_sa_multipath_record.c |   10 ++--
 osm/opensm/osm_sa_path_record.c      |   12 ++--
 osm/opensm/osm_sa_pkey_record.c      |    4 +-
 osm/opensm/osm_sa_portinfo_record.c  |    4 +-
 osm/opensm/osm_sa_service_record.c   |    2 +-
 osm/opensm/osm_sa_slvl_record.c      |    4 +-
 osm/opensm/osm_sa_sminfo_record.c    |    2 +-
 osm/opensm/osm_sa_sw_info_record.c   |    2 +-
 osm/opensm/osm_sa_vlarb_record.c     |    4 +-
 osm/opensm/osm_slvl_map_rcv.c        |    4 +-
 osm/opensm/osm_sm_state_mgr.c        |    5 +-
 osm/opensm/osm_state_mgr.c           |   13 ++--
 osm/opensm/osm_switch.c              |    6 +-
 osm/opensm/osm_trap_rcv.c            |    2 +-
 osm/opensm/osm_ucast_lash.c          |    2 +-
 osm/opensm/osm_ucast_mgr.c           |    7 +-
 osm/opensm/osm_ucast_updn.c          |    2 +-
 osm/opensm/osm_vl_arb_rcv.c          |    5 +-
 36 files changed, 88 insertions(+), 205 deletions(-)

diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h
index 19a8502..df9065e 100644
--- a/osm/include/opensm/osm_port.h
+++ b/osm/include/opensm/osm_port.h
@@ -1454,107 +1454,6 @@ osm_port_get_guid(
 *	Port
 *********/
 
-/****f* OpenSM: Port/osm_port_get_phys_ptr
-* NAME
-*	osm_port_get_phys_ptr
-*
-* DESCRIPTION
-*	Gets the pointer to the specified Physical Port object.
-*
-* SYNOPSIS
-*/
-static inline osm_physp_t*
-osm_port_get_phys_ptr(
-	IN const osm_port_t* const p_port,
-	IN const uint8_t port_num )
-{
-	return p_port->p_physp;
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object.
-*
-*	port_num
-*		[in] Number of physical port for which to return the
-*		osm_physp_t object.  If this port is on an HCA, then
-*		this value is ignored.
-*
-* RETURN VALUE
-*	Pointer to the Physical Port object.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port
-*********/
-
-/****f* OpenSM: Port/osm_port_get_default_phys_ptr
-* NAME
-*	osm_port_get_default_phys_ptr
-*
-* DESCRIPTION
-*	Gets the pointer to the default Physical Port object.
-*	This call should only be used for non-switch ports in which there
-*	is a one-for-one mapping of port to physp.
-*
-* SYNOPSIS
-*/
-static inline
-osm_physp_t*
-osm_port_get_default_phys_ptr(
-	IN const osm_port_t* const p_port )
-{
-	CL_ASSERT( osm_physp_is_valid( p_port->p_physp ) );
-	return p_port->p_physp;
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object.
-*
-* RETURN VALUE
-*	Pointer to the Physical Port object.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port
-*********/
-
-/****f* OpenSM: Port/osm_port_get_parent_node
-* NAME
-*	osm_port_get_parent_node
-*
-* DESCRIPTION
-*	Gets the pointer to the this port's Node object.
-*
-* SYNOPSIS
-*/
-static inline struct _osm_node*
-osm_port_get_parent_node(
-	IN const osm_port_t* const p_port )
-{
-	return( p_port->p_node );
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object.
-*
-*	port_num
-*		[in] Number of physical port for which to return the
-*		osm_physp_t object.
-*
-* RETURN VALUE
-*	Pointer to the Physical Port object.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port
-*********/
-
 /****f* OpenSM: Port/osm_port_get_lid_range_ho
 * NAME
 *	osm_port_get_lid_range_ho
diff --git a/osm/opensm/osm_drop_mgr.c b/osm/opensm/osm_drop_mgr.c
index d091347..97a95c2 100644
--- a/osm/opensm/osm_drop_mgr.c
+++ b/osm/opensm/osm_drop_mgr.c
@@ -240,7 +240,7 @@ __osm_drop_mgr_remove_port(
   num_physp = osm_node_get_num_physp( p_port->p_node );
   for( port_num = 0; port_num < num_physp; port_num++ )
   {
-    p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)port_num );
+    p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)port_num );
 
     if( p_physp )
     {
diff --git a/osm/opensm/osm_lid_mgr.c b/osm/opensm/osm_lid_mgr.c
index d856fb0..6712c6c 100644
--- a/osm/opensm/osm_lid_mgr.c
+++ b/osm/opensm/osm_lid_mgr.c
@@ -975,10 +975,7 @@ __osm_lid_mgr_set_physp_pi(
     Don't bother doing anything if this Physical Port is not valid.
     This allows simplified code in the caller.
   */
-  if( p_physp == NULL )
-    goto Exit;
-
-  if( !osm_physp_is_valid( p_physp ) )
+  if( p_physp == NULL || !osm_physp_is_valid( p_physp ) )
     goto Exit;
 
   port_num = osm_physp_get_port_num( p_physp );
@@ -1283,7 +1280,6 @@ __osm_lid_mgr_process_our_sm_node(
   osm_port_t     *p_port;
   uint16_t        min_lid_ho;
   uint16_t        max_lid_ho;
-  osm_physp_t    *p_physp;
   boolean_t       res = TRUE;
 
   OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_process_our_sm_node );
@@ -1336,9 +1332,7 @@ __osm_lid_mgr_process_our_sm_node(
     Set the PortInfo the Physical Port associated
     with this Port.
   */
-  p_physp = osm_port_get_default_phys_ptr( p_port );
-
-  __osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho ) );
+  __osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_port->p_physp, cl_hton16( min_lid_ho ) );
 
  Exit:
   OSM_LOG_EXIT( p_mgr->p_log );
@@ -1404,7 +1398,6 @@ osm_lid_mgr_process_subnet(
   osm_port_t     *p_port;
   ib_net64_t      port_guid;
   uint16_t        min_lid_ho, max_lid_ho;
-  osm_physp_t    *p_physp;
   int             lid_changed;
 
   CL_ASSERT( p_mgr );
@@ -1460,9 +1453,8 @@ osm_lid_mgr_process_subnet(
                ", LID [0x%X,0x%X]\n", cl_ntoh64( port_guid ),
                min_lid_ho, max_lid_ho );
       
-      p_physp = osm_port_get_default_phys_ptr( p_port );
       /* the proc returns the fact it sent a set port info */
-      if (__osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho )))
+      if (__osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_port->p_physp, cl_hton16( min_lid_ho )))
         p_mgr->send_set_reqs = TRUE;
     }
   } /* all ports */
diff --git a/osm/opensm/osm_link_mgr.c b/osm/opensm/osm_link_mgr.c
index 71c0495..a38d179 100644
--- a/osm/opensm/osm_link_mgr.c
+++ b/osm/opensm/osm_link_mgr.c
@@ -434,7 +434,7 @@ __osm_link_mgr_process_port(
       or if the state of the port is already better then the
       specified state.
     */
-    p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)i );
+    p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)i );
     if( p_physp && osm_physp_is_valid( p_physp ) )
     {
       current_state = osm_physp_get_port_state( p_physp );
diff --git a/osm/opensm/osm_mcast_mgr.c b/osm/opensm/osm_mcast_mgr.c
index 0cdcc0e..f5059c9 100644
--- a/osm/opensm/osm_mcast_mgr.c
+++ b/osm/opensm/osm_mcast_mgr.c
@@ -1127,7 +1127,7 @@ osm_mcast_mgr_process_single(
     goto Exit;
   }
 
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if( p_physp == NULL )
   {
     osm_log( p_mgr->p_log, OSM_LOG_ERROR,
diff --git a/osm/opensm/osm_node_info_rcv.c b/osm/opensm/osm_node_info_rcv.c
index 364b07c..2c79056 100644
--- a/osm/opensm/osm_node_info_rcv.c
+++ b/osm/opensm/osm_node_info_rcv.c
@@ -791,12 +791,10 @@ __osm_ni_rcv_process_new(
              "Duplicate Port GUID 0x%" PRIx64 "! Found by the two directed routes:\n",
              cl_ntoh64( p_ni->port_guid ) );
     osm_dump_dr_path(p_rcv->p_log,
-                     osm_physp_get_dr_path_ptr(
-                       osm_port_get_default_phys_ptr ( p_port) ),
+                     osm_physp_get_dr_path_ptr(p_port->p_physp),
                      OSM_LOG_ERROR);
     osm_dump_dr_path(p_rcv->p_log,
-                     osm_physp_get_dr_path_ptr(
-                       osm_port_get_default_phys_ptr ( p_port_check) ),
+                     osm_physp_get_dr_path_ptr(p_port_check->p_physp),
                      OSM_LOG_ERROR);
     if ( p_rtr )
       osm_router_delete( &p_rtr );
diff --git a/osm/opensm/osm_pkey.c b/osm/opensm/osm_pkey.c
index be5578a..c0daa38 100644
--- a/osm/opensm/osm_pkey.c
+++ b/osm/opensm/osm_pkey.c
@@ -432,8 +432,8 @@ osm_port_share_pkey(
     goto Exit;
   }
 
-  p_physp1 = osm_port_get_default_phys_ptr(p_port_1);
-  p_physp2 = osm_port_get_default_phys_ptr(p_port_2);
+  p_physp1 = p_port_1->p_physp;
+  p_physp2 = p_port_2->p_physp;
 
   if (!p_physp1 || !p_physp2)
   {
@@ -478,7 +478,7 @@ osm_lid_share_pkey(
   }
   else
   {
-    p_physp1 = osm_port_get_default_phys_ptr(p_port1);
+    p_physp1 = p_port1->p_physp;
   }
 
   if  (osm_node_get_type( p_node2 ) == IB_NODE_TYPE_SWITCH)
@@ -487,7 +487,7 @@ osm_lid_share_pkey(
   }
   else
   {
-    p_physp2 = osm_port_get_default_phys_ptr(p_port2);
+    p_physp2 = p_port2->p_physp;
   }
 
   return(osm_physp_share_pkey(p_log, p_physp1, p_physp2));
diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c
index bbbe192..33ac8b5 100644
--- a/osm/opensm/osm_pkey_mgr.c
+++ b/osm/opensm/osm_pkey_mgr.c
@@ -310,7 +310,7 @@ static boolean_t pkey_mgr_update_port(
 
   memset(&empty_block, 0, sizeof(ib_pkey_table_t));
 
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if ( !osm_physp_is_valid( p_physp ) )
     return FALSE;
 
@@ -449,7 +449,7 @@ pkey_mgr_update_peer_port(
 
   memset(&empty_block, 0, sizeof(ib_pkey_table_t));
 
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if (!osm_physp_is_valid( p_physp ))
     return FALSE;
   peer = osm_physp_get_remote( p_physp );
@@ -532,7 +532,6 @@ osm_pkey_mgr_process(
   osm_prtn_t *p_prtn;
   osm_port_t *p_port;
   osm_signal_t signal = OSM_SIGNAL_DONE;
-  osm_node_t *p_node;
 
   CL_ASSERT( p_osm );
 
@@ -570,8 +569,7 @@ osm_pkey_mgr_process(
     p_next = cl_qmap_next( p_next );
     if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) )
       signal = OSM_SIGNAL_DONE_PENDING;
-    p_node = osm_port_get_parent_node( p_port );
-    if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
+    if ( ( osm_node_get_type( p_port->p_node ) != IB_NODE_TYPE_SWITCH ) &&
 	 pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, 
 				    &p_osm->subn, p_port,
 				    !p_osm->subn.opt.no_partition_enforcement ) )
diff --git a/osm/opensm/osm_pkey_rcv.c b/osm/opensm/osm_pkey_rcv.c
index 0e0ec46..7c58d98 100644
--- a/osm/opensm/osm_pkey_rcv.c
+++ b/osm/opensm/osm_pkey_rcv.c
@@ -159,7 +159,7 @@ osm_pkey_rcv_process(
     goto Exit;
   }
 
-  p_node = osm_port_get_parent_node( p_port );
+  p_node = p_port->p_node;
   CL_ASSERT( p_node );
 
   block_num = (uint16_t)((cl_ntoh32(p_smp->attr_mod)) & 0x0000FFFF);
@@ -171,7 +171,7 @@ osm_pkey_rcv_process(
   }
   else
   {
-    p_physp = osm_port_get_default_phys_ptr(p_port);
+    p_physp = p_port->p_physp;
     port_num = p_physp->port_num;
   }
 
diff --git a/osm/opensm/osm_port.c b/osm/opensm/osm_port.c
index b0949a0..d6ea9a1 100644
--- a/osm/opensm/osm_port.c
+++ b/osm/opensm/osm_port.c
@@ -310,7 +310,7 @@ osm_port_add_new_physp(
     The LID value in the PortInfo for example, is only valid
     for port 0 on switches.
   */
-  if( !osm_physp_is_valid( osm_port_get_default_phys_ptr( p_port ) ) ||
+  if( !osm_physp_is_valid( p_port->p_physp ) ||
       port_num < p_port->p_physp->port_num )
     p_port->p_physp = p_physp;
 }
@@ -573,7 +573,7 @@ __osm_physp_get_dr_physp_set(
   }
 
   /* get the node of the SM */
-  p_node = osm_port_get_parent_node(p_port);
+  p_node = p_port->p_node;
   
   /* 
      traverse the path adding the nodes to the table 
@@ -740,7 +740,7 @@ osm_physp_replace_dr_path_with_alternate_dr_path(
      port we'll get the port connected to the rest of the subnet. If SM is
      running on SWITCH - we should try to get a dr path from all switch ports.
   */
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
 
   CL_ASSERT( p_physp );
   CL_ASSERT( osm_physp_is_valid( p_physp ) );
diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c
index 9bd75b5..0076b00 100644
--- a/osm/opensm/osm_port_info_rcv.c
+++ b/osm/opensm/osm_port_info_rcv.c
@@ -555,13 +555,13 @@ osm_pi_rcv_process_set(
 
   p_context = osm_madw_get_pi_context_ptr( p_madw );
 
-  p_physp = osm_port_get_phys_ptr( p_port, port_num );
-  CL_ASSERT( p_physp );
-  CL_ASSERT( osm_physp_is_valid( p_physp ) );
+  p_node = p_port->p_node;
+  CL_ASSERT( p_node );
+
+  p_physp = osm_node_get_physp_ptr( p_node, port_num );
+  CL_ASSERT( p_physp && osm_physp_is_valid( p_physp ) );
 
   port_guid = osm_physp_get_port_guid( p_physp );
-  p_node = osm_port_get_parent_node( p_port );
-  CL_ASSERT( p_node );
 
   p_smp = osm_madw_get_smp_ptr( p_madw );
   p_pi = (ib_port_info_t*)ib_smp_get_payload_ptr( p_smp );
@@ -743,7 +743,7 @@ osm_pi_rcv_process(
                cl_ntoh64( p_smp->trans_id ) );
     }
 
-    p_node = osm_port_get_parent_node( p_port );
+    p_node = p_port->p_node;
     p_physp = osm_node_get_physp_ptr( p_node, port_num );
 
     CL_ASSERT( p_node );
diff --git a/osm/opensm/osm_prtn.c b/osm/opensm/osm_prtn.c
index 4099cee..027a5a4 100644
--- a/osm/opensm/osm_prtn.c
+++ b/osm/opensm/osm_prtn.c
@@ -119,7 +119,7 @@ ib_api_status_t osm_prtn_add_port(osm_log_t *p_log, osm_subn_t *p_subn,
 		return status;
 	}
 
-	p_physp = osm_port_get_default_phys_ptr(p_port);
+	p_physp = p_port->p_physp;
 	if (!p_physp) {
 		osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: "
 			"no physical for port 0x%" PRIx64 "\n",
diff --git a/osm/opensm/osm_qos.c b/osm/opensm/osm_qos.c
index 11beaae..f426241 100644
--- a/osm/opensm/osm_qos.c
+++ b/osm/opensm/osm_qos.c
@@ -195,7 +195,7 @@ static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port,
 	if (osm_node_get_type(osm_physp_get_node_ptr(p)) == IB_NODE_TYPE_SWITCH) {
 		if (ib_port_info_get_vl_cap(&p->port_info) == 1) {
 			/* Check port 0's capability mask */
-			p_physp = osm_port_get_default_phys_ptr(p_port);
+			p_physp = p_port->p_physp;
 			if (!(p_physp->port_info.capability_mask & IB_PORT_CAP_HAS_SL_MAP))
 				return IB_SUCCESS;
 		}
@@ -336,7 +336,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm)
 		if (p_node->sw) {
 			num_physp = osm_node_get_num_physp(p_node);
 			for (i = 1; i < num_physp; i++) {
-				p_physp = osm_port_get_phys_ptr(p_port, i);
+				p_physp = osm_node_get_physp_ptr(p_node, i);
 				if (!p_physp || !osm_physp_is_valid(p_physp))
 					continue;
 				status =
@@ -353,7 +353,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm)
 		else
 			cfg = &ca_config;
 
-		p_physp = osm_port_get_default_phys_ptr(p_port);
+		p_physp = p_port->p_physp;
 		if (!osm_physp_is_valid(p_physp))
 			continue;
 
diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c
index 340a7f1..6109c5d 100644
--- a/osm/opensm/osm_sa_informinfo.c
+++ b/osm/opensm/osm_sa_informinfo.c
@@ -194,7 +194,7 @@ __validate_ports_access_rights(
     }
 
     /* get the destination InformInfo physical port */
-    p_physp = osm_port_get_default_phys_ptr(p_port);
+    p_physp = p_port->p_physp;
 
     /* make sure that the requester and destination port can access each other 
        according to the current partitioning. */
@@ -244,7 +244,7 @@ __validate_ports_access_rights(
       if ( p_port == NULL )
         continue;
 
-      p_physp = osm_port_get_default_phys_ptr(p_port);
+      p_physp = p_port->p_physp;
       /* make sure that the requester and destination port can access 
          each other according to the current partitioning. */
       if (! osm_physp_share_pkey( p_rcv->p_log, p_physp, p_requester_physp))
@@ -405,7 +405,7 @@ __osm_sa_inform_info_rec_by_comp_mask(
   }
 
   /* get the subscriber InformInfo physical port */
-  p_subscriber_physp = osm_port_get_default_phys_ptr(p_subscriber_port);
+  p_subscriber_physp = p_subscriber_port->p_physp;
   /* make sure that the requester and subscriber port can access each other 
      according to the current partitioning. */
   if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_subscriber_physp ))
diff --git a/osm/opensm/osm_sa_lft_record.c b/osm/opensm/osm_sa_lft_record.c
index b6333e7..c5cd9ca 100644
--- a/osm/opensm/osm_sa_lft_record.c
+++ b/osm/opensm/osm_sa_lft_record.c
@@ -244,7 +244,7 @@ __osm_lftr_rcv_by_comp_mask(
 
   /* check that the requester physp and the current physp are under
      the same partition. */
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if (! p_physp)
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c
index 17df424..5e4e35e 100644
--- a/osm/opensm/osm_sa_link_record.c
+++ b/osm/opensm/osm_sa_link_record.c
@@ -350,12 +350,12 @@ __osm_lr_rcv_get_port_links(
       dest_num_ports = osm_node_get_num_physp( p_dest_port->p_node );
       for( port_num = 1; port_num < num_ports; port_num++ )
       {
-        p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
+        p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num );
         for( dest_port_num = 1; dest_port_num < dest_num_ports;
              dest_port_num++ )
         {
-          p_dest_physp = osm_port_get_phys_ptr( p_dest_port,
-                                                dest_port_num );
+          p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node,
+                                                 dest_port_num );
           /* both physical ports should be with data */
           if (p_src_physp && p_dest_physp)
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp,
@@ -376,7 +376,7 @@ __osm_lr_rcv_get_port_links(
            this couldn't be a relevant record. */
         if (port_num < p_src_port->p_node->physp_tbl_size)
         {          
-          p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
+          p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num );
           if (p_src_physp)
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp,
                                          NULL, comp_mask, p_list,
@@ -388,7 +388,7 @@ __osm_lr_rcv_get_port_links(
         num_ports = osm_node_get_num_physp( p_src_port->p_node );
         for( port_num = 1; port_num < num_ports; port_num++ )
         {
-          p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
+          p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num );
           if (p_src_physp)
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp,
                                          NULL, comp_mask, p_list,
@@ -411,8 +411,8 @@ __osm_lr_rcv_get_port_links(
            this couldn't be a relevant record. */
         if (port_num < p_dest_port->p_node->physp_tbl_size )
         {
-          p_dest_physp = osm_port_get_phys_ptr(
-            p_dest_port, port_num );
+          p_dest_physp = osm_node_get_physp_ptr(
+            p_dest_port->p_node, port_num );
           if (p_dest_physp)
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL,
                                          p_dest_physp, comp_mask,
@@ -424,8 +424,8 @@ __osm_lr_rcv_get_port_links(
         num_ports = osm_node_get_num_physp( p_dest_port->p_node );
         for( port_num = 1; port_num < num_ports; port_num++ )
         {
-          p_dest_physp = osm_port_get_phys_ptr(
-            p_dest_port, port_num );
+          p_dest_physp = osm_node_get_physp_ptr(
+            p_dest_port->p_node, port_num );
           if (p_dest_physp)
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL,
                                          p_dest_physp, comp_mask,
diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
index 50c4f22..8241129 100644
--- a/osm/opensm/osm_sa_mcmember_record.c
+++ b/osm/opensm/osm_sa_mcmember_record.c
@@ -1570,7 +1570,7 @@ __osm_mcmr_rcv_join_mgrp(
     goto Exit;
   }
 
-  p_physp = osm_port_get_default_phys_ptr(p_port);
+  p_physp = p_port->p_physp;
   /* Check that the p_physp and the requester physp are in the same
      partition. */
   p_request_physp =
diff --git a/osm/opensm/osm_sa_mft_record.c b/osm/opensm/osm_sa_mft_record.c
index 005c9bd..7908583 100644
--- a/osm/opensm/osm_sa_mft_record.c
+++ b/osm/opensm/osm_sa_mft_record.c
@@ -250,7 +250,7 @@ __osm_mftr_rcv_by_comp_mask(
 
   /* check that the requester physp and the current physp are under
      the same partition. */
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if (! p_physp)
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
diff --git a/osm/opensm/osm_sa_multipath_record.c b/osm/opensm/osm_sa_multipath_record.c
index 0c5643e..06640d9 100644
--- a/osm/opensm/osm_sa_multipath_record.c
+++ b/osm/opensm/osm_sa_multipath_record.c
@@ -154,7 +154,7 @@ __osm_sa_multipath_rec_is_tavor_port(
   osm_node_t const* p_node;
   ib_net32_t vend_id;
 
-  p_node = osm_port_get_parent_node( p_port );
+  p_node = p_port->p_node;
   vend_id = ib_node_info_get_vendor_id( &p_node->node_info );
 
   return( (p_node->node_info.device_id == CL_HTON16(23108)) &&
@@ -255,8 +255,8 @@ __osm_mpr_rcv_get_path_parms(
 
   dest_lid = cl_hton16( dest_lid_ho );
 
-  p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port );
-  p_physp = osm_port_get_default_phys_ptr( p_src_port );
+  p_dest_physp = p_dest_port->p_physp;
+  p_physp = p_src_port->p_physp;
   p_pi = &p_physp->port_info;
 
   mtu = ib_port_info_get_mtu_cap( p_pi );
@@ -744,8 +744,8 @@ __osm_mpr_rcv_build_pr(
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_mpr_rcv_build_pr );
 
-  p_src_physp = osm_port_get_default_phys_ptr( p_src_port );
-  p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port );
+  p_src_physp = p_src_port->p_physp;
+  p_dest_physp = p_dest_port->p_physp;
 
   p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp );
   p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp );
diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c
index 1b0f89f..47d9c33 100644
--- a/osm/opensm/osm_sa_path_record.c
+++ b/osm/opensm/osm_sa_path_record.c
@@ -171,7 +171,7 @@ __osm_sa_path_rec_is_tavor_port(
   osm_node_t const* p_node;
   ib_net32_t vend_id;
 
-  p_node = osm_port_get_parent_node( p_port );
+  p_node = p_port->p_node;
   vend_id = ib_node_info_get_vendor_id( &p_node->node_info );
 	
   return( (p_node->node_info.device_id == CL_HTON16(23108)) &&
@@ -268,8 +268,8 @@ __osm_pr_rcv_get_path_parms(
 
   dest_lid = cl_hton16( dest_lid_ho );
 
-  p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port );
-  p_physp = osm_port_get_default_phys_ptr( p_src_port );
+  p_dest_physp = p_dest_port->p_physp;
+  p_physp = p_src_port->p_physp;
   p_pi = &p_physp->port_info;
 
   mtu = ib_port_info_get_mtu_cap( p_pi );
@@ -753,9 +753,9 @@ __osm_pr_rcv_build_pr(
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_build_pr );
 
-  p_src_physp = osm_port_get_default_phys_ptr( p_src_port );
+  p_src_physp = p_src_port->p_physp;
 #ifndef ROUTER_EXP
-  p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port );
+  p_dest_physp = p_dest_port->p_physp;
 
   p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp );
   p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp );
@@ -770,7 +770,7 @@ __osm_pr_rcv_build_pr(
     p_pr->dgid = *p_dgid;
   else
   {
-    p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port );
+    p_dest_physp = p_dest_port->p_physp;
 
     p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp );
     p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp );
diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c
index 8186603..8a71314 100644
--- a/osm/opensm/osm_sa_pkey_record.c
+++ b/osm/opensm/osm_sa_pkey_record.c
@@ -251,7 +251,7 @@ __osm_sa_pkey_by_comp_mask(
   {
     if (port_num < osm_node_get_num_physp( p_port->p_node ))
     {
-      p_physp = osm_port_get_phys_ptr( p_port, port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       /* Check that the p_physp is valid, and that is shares a pkey
          with the p_req_physp. */
       if( p_physp && osm_physp_is_valid( p_physp ) &&
@@ -272,7 +272,7 @@ __osm_sa_pkey_by_comp_mask(
     num_ports = osm_node_get_num_physp( p_port->p_node );
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
-      p_physp = osm_port_get_phys_ptr( p_port, port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       if( p_physp == NULL )
         continue;
 
diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c
index 9d4f18e..74f53d6 100644
--- a/osm/opensm/osm_sa_portinfo_record.c
+++ b/osm/opensm/osm_sa_portinfo_record.c
@@ -544,7 +544,7 @@ __osm_sa_pir_by_comp_mask(
   {
     if (p_rcvd_rec->port_num < num_ports)
     {
-      p_physp = osm_port_get_phys_ptr( p_port, p_rcvd_rec->port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, p_rcvd_rec->port_num );
       /* Check that the p_physp is valid, and that the p_physp and the
          p_req_physp share a pkey. */
       if( p_physp && osm_physp_is_valid( p_physp ) &&
@@ -556,7 +556,7 @@ __osm_sa_pir_by_comp_mask(
   {
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
-      p_physp = osm_port_get_phys_ptr( p_port, port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       if( p_physp == NULL )
         continue;
 
diff --git a/osm/opensm/osm_sa_service_record.c b/osm/opensm/osm_sa_service_record.c
index b23a12d..eff0b0a 100644
--- a/osm/opensm/osm_sa_service_record.c
+++ b/osm/opensm/osm_sa_service_record.c
@@ -213,7 +213,7 @@ __match_service_pkey_with_ports_pkey(
       /* check on the table of the default physical port of the service port */
       if ( !osm_physp_has_pkey( p_rcv->p_log,
                                 p_service_rec->service_pkey,
-                                osm_port_get_default_phys_ptr(service_port) ) )
+                                service_port->p_physp ) )
       {
         valid = FALSE;
         goto Exit;
diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c
index 9fbb5c7..e40ad61 100644
--- a/osm/opensm/osm_sa_slvl_record.c
+++ b/osm/opensm/osm_sa_slvl_record.c
@@ -243,7 +243,7 @@ __osm_sa_slvl_by_comp_mask(
     }
 
     for( out_port_num = out_port_start; out_port_num <= out_port_end; out_port_num++ ) {
-      p_out_physp = osm_port_get_phys_ptr( p_port, out_port_num );
+      p_out_physp = osm_node_get_physp_ptr( p_port->p_node, out_port_num );
       if( p_out_physp == NULL )
         continue;
 
@@ -256,7 +256,7 @@ __osm_sa_slvl_by_comp_mask(
           continue;
 #endif
 
-        p_in_physp = osm_port_get_phys_ptr( p_port, in_port_num );
+        p_in_physp = osm_node_get_physp_ptr( p_port->p_node, in_port_num );
         if( p_in_physp == NULL )
           continue;
 
diff --git a/osm/opensm/osm_sa_sminfo_record.c b/osm/opensm/osm_sa_sminfo_record.c
index 5e15f52..8c343b4 100644
--- a/osm/opensm/osm_sa_sminfo_record.c
+++ b/osm/opensm/osm_sa_sminfo_record.c
@@ -374,7 +374,7 @@ osm_smir_rcv_process(
     {
       if (FALSE ==
           osm_physp_share_pkey( p_rcv->p_log, p_req_physp,
-                                osm_port_get_default_phys_ptr( local_port ) ) )
+                                local_port->p_physp ) )
       {
         cl_plock_release( p_rcv->p_lock );
         osm_log( p_rcv->p_log, OSM_LOG_ERROR,
diff --git a/osm/opensm/osm_sa_sw_info_record.c b/osm/opensm/osm_sa_sw_info_record.c
index da65864..94b1ff9 100644
--- a/osm/opensm/osm_sa_sw_info_record.c
+++ b/osm/opensm/osm_sa_sw_info_record.c
@@ -245,7 +245,7 @@ __osm_sir_rcv_create_sir(
 
   /* check that the requester physp and the current physp are under
      the same partition. */
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if (! p_physp)
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c
index 97fe060..a462ee9 100644
--- a/osm/opensm/osm_sa_vlarb_record.c
+++ b/osm/opensm/osm_sa_vlarb_record.c
@@ -255,7 +255,7 @@ __osm_sa_vl_arb_by_comp_mask(
   {
     if (port_num < osm_node_get_num_physp( p_port->p_node ))
     {
-      p_physp = osm_port_get_phys_ptr( p_port, port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       /* check that the p_physp is valid, and that the requester
          and the p_physp share a pkey. */
       if( p_physp && osm_physp_is_valid( p_physp ) &&
@@ -276,7 +276,7 @@ __osm_sa_vl_arb_by_comp_mask(
     num_ports = osm_node_get_num_physp( p_port->p_node );
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
-      p_physp = osm_port_get_phys_ptr( p_port, port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       if( p_physp == NULL )
         continue;
 
diff --git a/osm/opensm/osm_slvl_map_rcv.c b/osm/opensm/osm_slvl_map_rcv.c
index b109f75..3352627 100644
--- a/osm/opensm/osm_slvl_map_rcv.c
+++ b/osm/opensm/osm_slvl_map_rcv.c
@@ -170,7 +170,7 @@ osm_slvl_rcv_process(
     goto Exit;
   }
 
-  p_node = osm_port_get_parent_node( p_port );
+  p_node = p_port->p_node;
   CL_ASSERT( p_node );
 
   /* in case of a non switch node the attr modifier should be ignored */
@@ -182,7 +182,7 @@ osm_slvl_rcv_process(
   }
   else
   {
-    p_physp = osm_port_get_default_phys_ptr(p_port);
+    p_physp = p_port->p_physp;
     out_port_num = p_physp->port_num;
     in_port_num  = 0;
   }
diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c
index 0034320..51df1df 100644
--- a/osm/opensm/osm_sm_state_mgr.c
+++ b/osm/opensm/osm_sm_state_mgr.c
@@ -192,7 +192,7 @@ __osm_sm_state_mgr_send_local_port_info_req(
 
    status = osm_req_get( p_sm_mgr->p_req,
                          osm_physp_get_dr_path_ptr
-                         ( osm_port_get_default_phys_ptr( p_port ) ),
+                         ( p_port->p_physp ),
                          IB_MAD_ATTR_PORT_INFO,
                          cl_hton32( p_port->p_physp->port_num ),
                          CL_DISP_MSGID_NONE, &context );
@@ -261,8 +261,7 @@ __osm_sm_state_mgr_send_master_sm_info_req(
    context.smi_context.set_method = FALSE;
 
    status = osm_req_get( p_sm_mgr->p_req,
-                         osm_physp_get_dr_path_ptr
-                         ( osm_port_get_default_phys_ptr( p_port ) ),
+                         osm_physp_get_dr_path_ptr(p_port->p_physp),
                          IB_MAD_ATTR_SM_INFO, 0, CL_DISP_MSGID_NONE,
                          &context );
 
diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
index 6f53e60..6681cfc 100644
--- a/osm/opensm/osm_state_mgr.c
+++ b/osm/opensm/osm_state_mgr.c
@@ -849,7 +849,7 @@ __osm_state_mgr_is_sm_port_down(
       goto Exit;
    }
 
-   p_physp = osm_port_get_default_phys_ptr( p_port );
+   p_physp = p_port->p_physp;
 
    CL_ASSERT( p_physp );
    CL_ASSERT( osm_physp_is_valid( p_physp ) );
@@ -914,7 +914,7 @@ __osm_state_mgr_sweep_hop_1(
       goto Exit;
    }
 
-   p_node = osm_port_get_parent_node( p_port );
+   p_node = p_port->p_node;
    CL_ASSERT( p_node );
 
    port_num = ib_node_info_get_local_port_num( &p_node->node_info );
@@ -1277,7 +1277,7 @@ __osm_state_mgr_report(
                   cl_ntoh64( osm_port_get_guid( p_port ) ) );
       }
 
-      p_node = osm_port_get_parent_node( p_port );
+      p_node = p_port->p_node;
       node_type = osm_node_get_type( p_node );
       if( node_type == IB_NODE_TYPE_SWITCH )
          start_port = 0;
@@ -1287,7 +1287,7 @@ __osm_state_mgr_report(
       num_ports = osm_node_get_num_physp( p_node );
       for( port_num = start_port; port_num < num_ports; port_num++ )
       {
-         p_physp = osm_port_get_phys_ptr( p_port, port_num );
+         p_physp = osm_node_get_physp_ptr( p_node, port_num );
          if( ( p_physp == NULL ) || ( !osm_physp_is_valid( p_physp ) ) )
             continue;
 
@@ -1622,9 +1622,8 @@ __osm_state_mgr_send_handover(
    }
 
    status = osm_req_set( p_mgr->p_req,
-                         osm_physp_get_dr_path_ptr
-                         ( osm_port_get_default_phys_ptr( p_port ) ), payload,
-                         sizeof(payload),
+                         osm_physp_get_dr_path_ptr(p_port->p_physp),
+                         payload, sizeof(payload),
                          IB_MAD_ATTR_SM_INFO, IB_SMINFO_ATTR_MOD_HANDOVER,
                          CL_DISP_MSGID_NONE, &context );
 
diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c
index 9273459..a79f5cd 100644
--- a/osm/opensm/osm_switch.c
+++ b/osm/opensm/osm_switch.c
@@ -291,7 +291,7 @@ osm_switch_recommend_path(
   }
   else
   {
-    p_physp = osm_port_get_default_phys_ptr(p_port);
+    p_physp = p_port->p_physp;
     if (!p_physp || !p_physp->p_remote_physp ||
         !p_physp->p_remote_physp->p_node->sw)
       return OSM_NO_PATH;
@@ -566,7 +566,7 @@ osm_switch_get_port_least_hops(
   }
   else
   {
-    osm_physp_t *p = osm_port_get_default_phys_ptr(p_port);
+    osm_physp_t *p = p_port->p_physp;
     uint8_t hops;
 
     if (!p || !p->p_remote_physp || !p->p_remote_physp->p_node->sw)
@@ -604,7 +604,7 @@ osm_switch_recommend_mcast_path(
   }
   else
   {
-    osm_physp_t *p_physp = osm_port_get_default_phys_ptr(p_port);
+    osm_physp_t *p_physp = p_port->p_physp;
     if (!p_physp || !p_physp->p_remote_physp ||
         !p_physp->p_remote_physp->p_node->sw)
       return OSM_NO_PATH;
diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c
index ed507b6..0ec9a1f 100644
--- a/osm/opensm/osm_trap_rcv.c
+++ b/osm/opensm/osm_trap_rcv.c
@@ -111,7 +111,7 @@ __get_physp_by_lid_and_num(
   if (osm_node_get_num_physp(p_port->p_node) < num)
     return NULL;
 
-  return( osm_port_get_phys_ptr(p_port, num) );
+  return( osm_node_get_physp_ptr(p_port->p_node, num) );
 }
 
 /**********************************************************************
diff --git a/osm/opensm/osm_ucast_lash.c b/osm/opensm/osm_ucast_lash.c
index 4459d9f..5d32e89 100644
--- a/osm/opensm/osm_ucast_lash.c
+++ b/osm/opensm/osm_ucast_lash.c
@@ -170,7 +170,7 @@ static uint64_t osm_lash_get_switch_guid(IN const osm_switch_t *p_sw)
 
 static osm_switch_t *get_osm_switch_from_port(osm_port_t *port)
 {
-	osm_physp_t *p = osm_port_get_default_phys_ptr(port);
+	osm_physp_t *p = port->p_physp;
 	if (p->p_node->sw)
 		return p->p_node->sw;
 	else if (p->p_remote_physp->p_node->sw)
diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index 7d3916b..2860e66 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -306,7 +306,7 @@ __osm_ucast_mgr_dump_ucast_routes(
     }
     else
     {
-      osm_physp_t *p_physp = osm_port_get_default_phys_ptr(p_port);
+      osm_physp_t *p_physp = p_port->p_physp;
       if( !p_physp || !p_physp->p_remote_physp ||
           !p_physp->p_remote_physp->p_node->sw )
         num_hops = OSM_NO_PATH;
@@ -413,7 +413,7 @@ ucast_mgr_dump_lfts(cl_map_item_t *p_map_item, void *cxt)
 
 		p_port = cl_ptr_vector_get(&p_mgr->p_subn->port_lid_tbl, lid);
 		if (p_port) {
-			p_node = osm_port_get_parent_node(p_port);
+			p_node = p_port->p_node;
 			fprintf(file, "%s portguid 0x016%" PRIx64 ": \'%s\'",
 				ib_get_node_type_str(osm_node_get_type(p_node)),
 				cl_ntoh64(osm_port_get_guid(p_port)),
@@ -671,8 +671,7 @@ __osm_ucast_mgr_process_port(
       if (!p_mgr->p_subn->opt.port_profile_switch_nodes)
       {
 	is_ignored_by_port_prof |=
-	  (osm_node_get_type(osm_port_get_parent_node(p_port)) ==
-	   IB_NODE_TYPE_SWITCH);
+	  (osm_node_get_type(p_port->p_node) == IB_NODE_TYPE_SWITCH);
       }
     }
 
diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c
index b15fe5e..d9446e9 100644
--- a/osm/opensm/osm_ucast_updn.c
+++ b/osm/opensm/osm_ucast_updn.c
@@ -792,7 +792,7 @@ __osm_updn_find_root_nodes_by_min_hop(
     p_next_port = (osm_port_t*)cl_qmap_next( &p_next_port->map_item );
     if ( osm_node_get_type(p_port->p_node) != IB_NODE_TYPE_SWITCH )
     {
-      p_physp = osm_port_get_default_phys_ptr(p_port);
+      p_physp = p_port->p_physp;
       self_lid_ho = cl_ntoh16( osm_physp_get_base_lid(p_physp) );
       numCas++;
       /* EZ:
diff --git a/osm/opensm/osm_vl_arb_rcv.c b/osm/opensm/osm_vl_arb_rcv.c
index ed8dfc5..f36751e 100644
--- a/osm/opensm/osm_vl_arb_rcv.c
+++ b/osm/opensm/osm_vl_arb_rcv.c
@@ -171,7 +171,7 @@ osm_vla_rcv_process(
     goto Exit;
   }
 
-  p_node = osm_port_get_parent_node( p_port );
+  p_node = p_port->p_node;
   CL_ASSERT( p_node );
 
   block_num = (uint8_t)(cl_ntoh32(p_smp->attr_mod) >> 16);
@@ -183,7 +183,7 @@ osm_vla_rcv_process(
   }
   else
   {
-    p_physp = osm_port_get_default_phys_ptr(p_port);
+    p_physp = p_port->p_physp;
     port_num = p_physp->port_num;
   }
 
@@ -239,4 +239,3 @@ osm_vla_rcv_process(
 
   OSM_LOG_EXIT( p_rcv->p_log );
 }
-
-- 
1.5.2.rc2.20.gac2a


From swise at opengridcomputing.com  Wed May  9 13:24:19 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 09 May 2007 15:24:19 -0500
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <46422D07.3050600@Sun.COM>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>
	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop>  <46422D07.3050600@Sun.COM>
Message-ID: <1178742259.382.112.camel@stevo-desktop>

On Wed, 2007-05-09 at 16:20 -0400, Donald Kerr wrote:
> I missing some context here. Where are you plugging iwarp and OMPI 
> together? 

ofed-1.2 supports iwarp and the chelsio rnic.  It can be accessed
directly via the ofa verbs and ofa rdma-cm _as well as_ via udapl.  

I'm attempting to run OMPI over udapl over chelsio's rnic.

Steve.


From Don.Kerr at Sun.COM  Wed May  9 13:27:18 2007
From: Don.Kerr at Sun.COM (Donald Kerr)
Date: Wed, 09 May 2007 16:27:18 -0400
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs	opened
In-Reply-To: <1178742259.382.112.camel@stevo-desktop>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>
	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM>
	<1178742259.382.112.camel@stevo-desktop>
Message-ID: <46422EA6.3020006@Sun.COM>

So then I agree with Andrew, I think you are trying to impose 
restrictions on uDAPL which are not part of the Spec.

-DON

Steve Wise wrote:

>On Wed, 2007-05-09 at 16:20 -0400, Donald Kerr wrote:
>  
>
>>I missing some context here. Where are you plugging iwarp and OMPI 
>>together? 
>>    
>>
>
>ofed-1.2 supports iwarp and the chelsio rnic.  It can be accessed
>directly via the ofa verbs and ofa rdma-cm _as well as_ via udapl.  
>
>I'm attempting to run OMPI over udapl over chelsio's rnic.
>
>Steve.
>
>
>
>  
>


From swise at opengridcomputing.com  Wed May  9 13:33:39 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 09 May 2007 15:33:39 -0500
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs	opened
In-Reply-To: <46422EA6.3020006@Sun.COM>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>
	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM>
	<1178742259.382.112.camel@stevo-desktop>  <46422EA6.3020006@Sun.COM>
Message-ID: <1178742819.382.114.camel@stevo-desktop>

On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote:
> So then I agree with Andrew, I think you are trying to impose 
> restrictions on uDAPL which are not part of the Spec.
> 

true, but if you want a single btl for IB and IW, then you'll need to
address this issue in some way...


From Don.Kerr at Sun.COM  Wed May  9 13:45:16 2007
From: Don.Kerr at Sun.COM (Donald Kerr)
Date: Wed, 09 May 2007 16:45:16 -0400
Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl	-	bugs	opened
In-Reply-To: <1178742819.382.114.camel@stevo-desktop>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>
	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM>
	<1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM>
	<1178742819.382.114.camel@stevo-desktop>
Message-ID: <464232DC.9010201@Sun.COM>

I guess I have not read enough about iwarp yet but if iwarp is sitting 
below ib verbs or udapl in the stack and is trying to impose 
restrictions which ib verbs or udapl do not adhere to then maybe iwarp 
is in the wrong place in the ofed stack.
 
Having said that I do agree the OMPI community needs to consider where 
iwarp plays in its own stack. If it has not already.

Steve Wise wrote:

>On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote:
>  
>
>>So then I agree with Andrew, I think you are trying to impose 
>>restrictions on uDAPL which are not part of the Spec.
>>
>>    
>>
>
>true, but if you want a single btl for IB and IW, then you'll need to
>address this issue in some way...
>
>
>_______________________________________________
>devel mailing list
>devel at open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  
>


From caitlinb at broadcom.com  Wed May  9 14:11:59 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Wed, 9 May 2007 14:11:59 -0700
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <1178740498.382.97.camel@stevo-desktop>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D03C331A7@NT-IRVA-0750.brcm.ad.broadcom.com>


> 
> 2) OMPI is not adhering to the iwarp protocol requirement
> that the ULP,
> in this case OMPI, initiating the iwarp connection (the side
> issuing the
> dat_ep_connect() or rdma_connect()) _MUST_ be the first to
> send an RDMA
> message.  So if a OMPI process _accepts_ an rdma connection, then it
> cannot send on that connection until it receives some sort of rdma
> operation from the client process.  It appears the current OMPI
> connection setup model doesn't enforce this.
> 

This is actually an MPA requirement, and accoring to *protocol* specs
having the active side send a zero length RDMA Write should be able
to fix the problem. However there is language in the RDMAC verbs that
clearly implies that the active side must Send something, and that an
RDMA Write is insufficient.

Therefore, the only truly safe thing for an iWARP btl to do (or a
udapl btl since that is also an iWARP btl) is to have the active
layer send an MPI Layer "nop" of some kind immediately after 
establishing the connection if there is nothing else to send.


From sashak at voltaire.com  Wed May  9 14:27:40 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 10 May 2007 00:27:40 +0300
Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm: make osm_node_destroy()
	static
In-Reply-To: <1178543690.32222.350646.camel@hal.voltaire.com>
References: <20070506174138.GI9692@sashak.voltaire.com>
	<20070506174431.GJ9692@sashak.voltaire.com>
	<1178543690.32222.350646.camel@hal.voltaire.com>
Message-ID: <20070509212740.GV9692@sashak.voltaire.com>

On 09:16 Mon 07 May     , Hal Rosenstock wrote:
> On Sun, 2007-05-06 at 13:44, Sasha Khapyorsky wrote:
> > This makes locally used osm_node_destroy() function static
> > 
> > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> 
> Thanks. Applied (to master only).
> 
> Isn't the same applicable for the other osm_xxx_destroy functions ?

Only for those osm_xxx objects which have dynamic
constructors/destructors osm_xxx_new() and osm_xxx_delete().

> If
> so, shouldn't they also be made static ?

Yes, good idea.

Sasha


From jsquyres at cisco.com  Wed May  9 14:22:44 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 9 May 2007 17:22:44 -0400
Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl	-	bugs	opened
In-Reply-To: <464232DC.9010201@Sun.COM>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>
	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>
	<4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop>
	<46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop>
	<46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop>
	<464232DC.9010201@Sun.COM>
Message-ID: <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>

I talked with Steve a bunch on the phone about this.

1. This "connector must RDMA first" issue is an iWARP restriction --  
it's not specific to udapl or verbs.  For example, if you try to use  
udapl with iWARP on Solaris, you'll have the same issue (I have no  
idea whether you have iWARP drivers in Solaris or not).

2. Per his prior e-mail (which I didn't fully grok until I talked to  
him), using the RDMA CM in the openib BTL will not magically fix this  
issue for us.

3. So for any of the BTLs to support iWARP -- regardless of  
underlying protocol or OS -- they are going to have to obey this  
restriction.

4. Luckily, in iWARP, the restriction can be met by either send/ 
receive semantics *or* RDMA semantics.  You don't have to  
specifically use RDMA verbs semantics, for example.  This is good  
because of the way that OMPI works (the first fragment that will be  
transmitted is pretty much guaranteed to be a send/receive fragment,  
not an RDMA fragment) -- it makes the logistics slightly simpler.

Galen Shipman and I talked about this a bit and suggest the following:

- During the connection dance (probably for both the udapl and openib  
BTLs), whichever peer ends up being the connection initiator (don't  
forget about the race condition where 2 peers may simultaneously  
decide to initiate -- this case is handled properly in the OMPI code;  
but just make sure you modify the side that ends up being actual  
initiator), they can send their pending fragment immediately (and  
Steve is right that there will always be a pending fragment, because  
OMPI doesn't make a connection until the first send).

- The other peer (the receiver of the connection) must wait to send  
its pending fragment(s) until it receives the first frag from the  
connection initiator.  This can be accomplished either with another  
flag on the OMPI module struct or perhaps making it part of the  
connection protocol (i.e., don't transition the endpoint to be  
CONNECTED until the first fragment is received).  Either of which can  
be used to queue up fragments on the receiver until the first  
fragment is received from the initiator.  I'd have to look in the  
code deeper, but I'm *guessing* that it might be best to use the  
already-existing state flag (i.e., checking for CONNECTED) because  
then you won't be introducing any more conditionals in the critical  
path.


On May 9, 2007, at 4:45 PM, Donald Kerr wrote:

> I guess I have not read enough about iwarp yet but if iwarp is sitting
> below ib verbs or udapl in the stack and is trying to impose
> restrictions which ib verbs or udapl do not adhere to then maybe iwarp
> is in the wrong place in the ofed stack.
>
> Having said that I do agree the OMPI community needs to consider where
> iwarp plays in its own stack. If it has not already.
>
> Steve Wise wrote:
>
>> On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote:
>>
>>
>>> So then I agree with Andrew, I think you are trying to impose
>>> restrictions on uDAPL which are not part of the Spec.
>>>
>>>
>>>
>>
>> true, but if you want a single btl for IB and IW, then you'll need to
>> address this issue in some way...
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel at open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel at open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
Cisco Systems


From caitlinb at broadcom.com  Wed May  9 14:33:38 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Wed, 9 May 2007 14:33:38 -0700
Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
In-Reply-To: <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D03C331D9@NT-IRVA-0750.brcm.ad.broadcom.com>

Jeff Squyres wrote:

> 
> - The other peer (the receiver of the connection) must wait
> to send its pending fragment(s) until it receives the first
> frag from the connection initiator.  This can be accomplished
> either with another flag on the OMPI module struct or perhaps
> making it part of the connection protocol (i.e., don't
> transition the endpoint to be CONNECTED until the first
> fragment is received).  Either of which can be used to queue
> up fragments on the receiver until the first fragment is
> received from the initiator.  I'd have to look in the code
> deeper, but I'm *guessing* that it might be best to use the
> already-existing state flag (i.e., checking for CONNECTED)
> because then you won't be introducing any more conditionals
> in the critical path.
> 

The transport provider has several options on ensuring that
the passive side does not put a message on the wire before
the first message is received.

What the transport layer cannot do is create the first message
from the active side. Because it will have send/recv semantics
it will complete a receive work request, which the application
layer has to post with that expectation.

this nop does not have to be visible above OMPI, but I'm pretty
sure OMPI has to generate it. That isn't exactly fair to the 
application layer, but the RDMAC verbs are water under the
bridge. Assuming OMPI wants to work with *any* iWARP RNIC
then it needs to ensure that the active side will send something
promptly in all cases.


From jsquyres at cisco.com  Wed May  9 14:38:02 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 9 May 2007 17:38:02 -0400
Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
In-Reply-To: <1EF1E44200D82B47BD5BA61171E8CE9D03C331D9@NT-IRVA-0750.brcm.ad.broadcom.com>
References: <1EF1E44200D82B47BD5BA61171E8CE9D03C331D9@NT-IRVA-0750.brcm.ad.broadcom.com>
Message-ID: <A31472F8-C4AD-41FA-9B71-04D4EB5549BF@cisco.com>

Understood, and I agree.

FWIW: note that the CONNECTED state that I refered to is internal to  
OMPI's endpoint abstraction (not an iwarp/udapl/verbs/etc. state).   
It's part of our connection dance protocol.


On May 9, 2007, at 5:33 PM, Caitlin Bestler wrote:

> Jeff Squyres wrote:
>
>>
>> - The other peer (the receiver of the connection) must wait
>> to send its pending fragment(s) until it receives the first
>> frag from the connection initiator.  This can be accomplished
>> either with another flag on the OMPI module struct or perhaps
>> making it part of the connection protocol (i.e., don't
>> transition the endpoint to be CONNECTED until the first
>> fragment is received).  Either of which can be used to queue
>> up fragments on the receiver until the first fragment is
>> received from the initiator.  I'd have to look in the code
>> deeper, but I'm *guessing* that it might be best to use the
>> already-existing state flag (i.e., checking for CONNECTED)
>> because then you won't be introducing any more conditionals
>> in the critical path.
>>
>
> The transport provider has several options on ensuring that
> the passive side does not put a message on the wire before
> the first message is received.
>
> What the transport layer cannot do is create the first message
> from the active side. Because it will have send/recv semantics
> it will complete a receive work request, which the application
> layer has to post with that expectation.
>
> this nop does not have to be visible above OMPI, but I'm pretty
> sure OMPI has to generate it. That isn't exactly fair to the
> application layer, but the RDMAC verbs are water under the
> bridge. Assuming OMPI wants to work with *any* iWARP RNIC
> then it needs to ensure that the active side will send something
> promptly in all cases.
>
>


-- 
Jeff Squyres
Cisco Systems


From swise at opengridcomputing.com  Wed May  9 14:44:52 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 09 May 2007 16:44:52 -0500
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <46425627.8000903@open-mpi.org>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop> <46425627.8000903@open-mpi.org>
Message-ID: <1178747092.382.125.camel@stevo-desktop>

On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:
> 
> Steve Wise wrote:
> > There have been a series of discussions on the ofa general list about
> > this issue, and the conclusion to date is that it cannot be resolved in
> > the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
> > sending an RDMA message involves the ULP's work queue and completion
> > queue, so the CM cannot do this under the covers in a mannor that
> > doesn't affect the application.  Thus, the applications must deal with
> > this.
> 
> Why can't uDAPL deal with this?  As a uDAPL user, I really don't care 
> what API uDAPL is using under the hood to move data from one place to 
> another, nor the quirks of that API.  The whole point of uDAPL is to 
> form a network-agnostic abstraction layer.  AFAIK, the uDAPL spec 
> doesn't enforce any such requirement on RDMA communication either.  In 
> my opinion, exposing such behavior above uDAPL is incorrect and is part 
> of why uDAPL has seen limited adoption -- every single uDAPL 
> implementation behaves in different ways, making it extremely difficult 
> to write an application to work on any uDAPL implementation.  Sorry if 
> this sounds harsh, but this comes from many hours of banging my head on 
> the wall due to working around these sorts of problems :)
> 

I understand your frustration.  I think the MPA protocol is deficient in
this respect and should have required the necessary "first FPDU" to be
sent under the covers by the RNICs. A RTR packet if you will.  To
resolve this issue "properly", in my opinion, would involve changing the
IETF MPA spec and also breaking all the existing iwarp HW.  We can't do
that.

The reason it is hard or impossible to solve this in the DAPL layer is
that any rdma operation on the QP affects the state of that QP and the
associate CQs.  In addition, if you use an RDMA send to enforce this you
impact the other side by consuming a RECV buffer. So its hard if not
impossible to do this under the covers without affecting the
application's resources.

Also, the DAPL specification had a goal to not impose any additional
protocol on the wire.  If you add this under the covers, then you add
such a "protocol" and break interoperability between a connection
accessed via DAPL on one end and some other API on the other end.

> > 
> > Here is a possible solution: 
> > 
> > I assume in OMPI that connections are only initiated when the mpi
> > application does a send operation.   Given that, then udapl btl must
> > ensure that if a given rank accepts a connection, it cannot not send
> > anything until the rank at the other end of the connection sends first.
> > Since the other side initiated the connection, it will have pending data
> > to send...
> > 
> > I haven't looked into how painful this will be to implement.
> > 
> > Thoughts?
> 
> Following on what I wrote above, I think Open MPI is the wrong place to 
> be dealing with this.  There's enough of these hacks as it is; I'm not 
> interested in seeing more get added.
> 

Unfortunately, I haven't been able to come up with a solution that works
with existing iWARP HW and is interoperable. 

Steve.


From afriedle at open-mpi.org  Wed May  9 17:46:12 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Wed, 09 May 2007 17:46:12 -0700
Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
In-Reply-To: <1EF1E44200D82B47BD5BA61171E8CE9D03C331A7@NT-IRVA-0750.brcm.ad.broadcom.com>
References: <1EF1E44200D82B47BD5BA61171E8CE9D03C331A7@NT-IRVA-0750.brcm.ad.broadcom.com>
Message-ID: <46426B54.3020105@open-mpi.org>


> Therefore, the only truly safe thing for an iWARP btl to do (or a
> udapl btl since that is also an iWARP btl) is to have the active
> layer send an MPI Layer "nop" of some kind immediately after 
> establishing the connection if there is nothing else to send.

This is fine for an iWARP/RDMACM/whatever BTL (or anything else that 
uses the OFA verbs interface(s)), but my argument is that uDAPL is NOT 
specifically there to support just iWARP (though it may include it), and 
that OFED's uDAPL should be adjusted to handle this.  Again, uDAPL is a 
network *independent* abstraction, so requiring network-dependent 
behavior from the uDAPL consumer is wrong.

A related question -- how does this 'connection initiator must send 
first' requirement relate to UD?

Andrew


From caitlinb at broadcom.com  Wed May  9 14:54:53 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Wed, 9 May 2007 14:54:53 -0700
Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
In-Reply-To: <46426B54.3020105@open-mpi.org>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D03C331F4@NT-IRVA-0750.brcm.ad.broadcom.com>

general-bounces at lists.openfabrics.org wrote:
>> Therefore, the only truly safe thing for an iWARP btl to do (or a
>> udapl btl since that is also an iWARP btl) is to have the active
>> layer send an MPI Layer "nop" of some kind immediately after
>> establishing the connection if there is nothing else to send.
> 
> This is fine for an iWARP/RDMACM/whatever BTL (or anything
> else that uses the OFA verbs interface(s)), but my argument
> is that uDAPL is NOT specifically there to support just iWARP
> (though it may include it), and that OFED's uDAPL should be
> adjusted to handle this.  Again, uDAPL is a network
> *independent* abstraction, so requiring network-dependent
> behavior from the uDAPL consumer is wrong.
>

DAPL strives to define network independent solutions. In this
case the network independent solution is that the active side
*always* sends the first message. This works for both iWARP
and InfiniBand. And away from the HPC market it is almost a
non-requirement (which is why the RDMAC managed to goof on
this in its specification. A zero-length RDMA Write is enough
to deal with the wire protocol problem, but people implemented
to the RDMAC verbs.)

 
>
> A related question -- how does this 'connection initiator
> must send first' requirement relate to UD?
> 

iWARP UD is called UDP. It has nothing to do with MPA
or RDMA. An API that mapped to either IB UD or UDP is
definitely feasible, but hasn't been important enough
to anyone to draft as of yet.


From afriedle at open-mpi.org  Wed May  9 17:55:52 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Wed, 09 May 2007 17:55:52 -0700
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <1178747092.382.125.camel@stevo-desktop>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>	<1178657765.11455.32.camel@stevo-desktop>	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>	<1178721476.382.18.camel@stevo-desktop>	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>	<1178740498.382.97.camel@stevo-desktop>
	<46425627.8000903@open-mpi.org>
	<1178747092.382.125.camel@stevo-desktop>
Message-ID: <46426D98.1030406@open-mpi.org>


Steve Wise wrote:
> On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:
>> Steve Wise wrote:
>>> There have been a series of discussions on the ofa general list about
>>> this issue, and the conclusion to date is that it cannot be resolved in
>>> the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
>>> sending an RDMA message involves the ULP's work queue and completion
>>> queue, so the CM cannot do this under the covers in a mannor that
>>> doesn't affect the application.  Thus, the applications must deal with
>>> this.
>> Why can't uDAPL deal with this?  As a uDAPL user, I really don't care 
>> what API uDAPL is using under the hood to move data from one place to 
>> another, nor the quirks of that API.  The whole point of uDAPL is to 
>> form a network-agnostic abstraction layer.  AFAIK, the uDAPL spec 
>> doesn't enforce any such requirement on RDMA communication either.  In 
>> my opinion, exposing such behavior above uDAPL is incorrect and is part 
>> of why uDAPL has seen limited adoption -- every single uDAPL 
>> implementation behaves in different ways, making it extremely difficult 
>> to write an application to work on any uDAPL implementation.  Sorry if 
>> this sounds harsh, but this comes from many hours of banging my head on 
>> the wall due to working around these sorts of problems :)
>>
> 
> I understand your frustration.  I think the MPA protocol is deficient in
> this respect and should have required the necessary "first FPDU" to be
> sent under the covers by the RNICs. A RTR packet if you will.  To
> resolve this issue "properly", in my opinion, would involve changing the
> IETF MPA spec and also breaking all the existing iwarp HW.  We can't do
> that.

Understood.

> The reason it is hard or impossible to solve this in the DAPL layer is
> that any rdma operation on the QP affects the state of that QP and the
> associate CQs.  In addition, if you use an RDMA send to enforce this you
> impact the other side by consuming a RECV buffer. So its hard if not
> impossible to do this under the covers without affecting the
> application's resources.

Is there no way to do this before passing connection established events 
to the uDAPL consumer?  I need to go read up on the uDAPL API to really 
understand why this wouldn't work.

> 
> Also, the DAPL specification had a goal to not impose any additional
> protocol on the wire.  If you add this under the covers, then you add
> such a "protocol" and break interoperability between a connection
> accessed via DAPL on one end and some other API on the other end.

So I guess there's no 'right' solution, at least at the uDAPL level. 
With RDMACM/OFA verbs, there's at least the argument that you can design 
the API/semantics however you please, while uDAPL is already standardized.

I hope you guys are documenting this in a way that makes this issue 
extremely clear to both uDAPL and OFA verbs (is this the right naming?) 
users.  Maybe it's been done already, but is it possible to emit some 
sort of loud warning/error when the accept()'ing side tries to send 
before a receive?

Andrew


From swise at opengridcomputing.com  Wed May  9 14:56:46 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 09 May 2007 16:56:46 -0500
Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
In-Reply-To: <46426B54.3020105@open-mpi.org>
References: <1EF1E44200D82B47BD5BA61171E8CE9D03C331A7@NT-IRVA-0750.brcm.ad.broadcom.com>
	<46426B54.3020105@open-mpi.org>
Message-ID: <1178747806.382.128.camel@stevo-desktop>

On Wed, 2007-05-09 at 17:46 -0700, Andrew Friedley wrote:
> > Therefore, the only truly safe thing for an iWARP btl to do (or a
> > udapl btl since that is also an iWARP btl) is to have the active
> > layer send an MPI Layer "nop" of some kind immediately after 
> > establishing the connection if there is nothing else to send.
> 
> This is fine for an iWARP/RDMACM/whatever BTL (or anything else that 
> uses the OFA verbs interface(s)), but my argument is that uDAPL is NOT 
> specifically there to support just iWARP (though it may include it), and 
> that OFED's uDAPL should be adjusted to handle this.  Again, uDAPL is a 
> network *independent* abstraction, so requiring network-dependent 
> behavior from the uDAPL consumer is wrong.
> 
> A related question -- how does this 'connection initiator must send 
> first' requirement relate to UD?
> 

It doesn't.  UD isn't supported in IWARP.  


From ossrosch at linux.vnet.ibm.com  Wed May  9 14:57:59 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Wed, 9 May 2007 23:57:59 +0200
Subject: [ofa-general] Re: Build problem with RHEL-4.5 and OFED-1.2
In-Reply-To: <1178737535.2848.152.camel@fc6.xsintricity.com>
References: <200705091824.54394.ossrosch@linux.vnet.ibm.com>
	<1178737535.2848.152.camel@fc6.xsintricity.com>
Message-ID: <200705092357.59973.ossrosch@linux.vnet.ibm.com>

On Wednesday 09 May 2007 21:05, Doug Ledford wrote:
> On Wed, 2007-05-09 at 18:24 +0200, Stefan Roscher wrote:
> > Hi Doug,
> > 
> > I installed RHEL-4.5 on one of our ppc64 systems and recognized that asm-ppc
> > directory is missing in /usr/src/kernels/2.6.9-55.EL/include. 
> > Normally I don't need this directory, but ibmebus.h includes
> > asm-ppc64/of_device.h. And there asm-ppc64/of_device.h includes 
> > asm-ppc/of_device.h. Because this file is missing I can not build 
> > ehca and ofed stack with ofed-1.2 daily build from today.
> > 
> > Did I make something wrong during installation?
> > 
> > Regards Stefan Roscher
> 
> I'll look into it, but in the meantime, install the kernel src.rpm, go
> into /usr/src/redhat/SPEC and run rpmbuild --bp kernel-2.6.spec and it
> should create a complete source tree
> in /usr/src/redhat/BUILD/kernel-2.6.18 that you can then get the asm-ppc
> directory contents out of.
> 
> -- 
> Doug Ledford <dledford at redhat.com>
>               GPG KeyID: CFBFF194
>               http://people.redhat.com/dledford
> 
> Infiniband specific RPMs available at
>               http://people.redhat.com/dledford/Infiniband
> 
To create the backportpatches for rhel4.5 I did it like you say, but the
buildscripts of ofed dont uses the kernelsources in
/usr/src/redhat/BUILD. OFED-1.2 use the source link within
/lib/modules/kernel-x.x.x and this points into /usr/src/kernel this
kernelsources were created during installation of rhel-4.5. In this kernel
source the directory include/asm-ppc is missing.
This is the reason why I found this problem not during creation of the
backport patches.

regards stefan


From mshefty at ichips.intel.com  Wed May  9 15:01:12 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 09 May 2007 15:01:12 -0700
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <1178747092.382.125.camel@stevo-desktop>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>	<1178657765.11455.32.camel@stevo-desktop>	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>	<1178721476.382.18.camel@stevo-desktop>	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>	<1178740498.382.97.camel@stevo-desktop>
	<46425627.8000903@open-mpi.org>
	<1178747092.382.125.camel@stevo-desktop>
Message-ID: <464244A8.4070406@ichips.intel.com>

> The reason it is hard or impossible to solve this in the DAPL layer is
> that any rdma operation on the QP affects the state of that QP and the
> associate CQs.  In addition, if you use an RDMA send to enforce this you
> impact the other side by consuming a RECV buffer. So its hard if not
> impossible to do this under the covers without affecting the
> application's resources.

I agree that this is hard, but I don't believe that it's impossible.

> Also, the DAPL specification had a goal to not impose any additional
> protocol on the wire.  If you add this under the covers, then you add
> such a "protocol" and break interoperability between a connection
> accessed via DAPL on one end and some other API on the other end.

IMO, this is a unrealized dream.  DAPL does generate wire protocol.  For 
example, when running over IB, DAPL's selection of a service ID and CM protocol 
is visible on the wire.  A DAPL that establishes connections using the RDMA CM 
will likely have a different wire protocol than a version of DAPL that 
establishes connections talking directly to the IB CM.  The two DAPLs will not 
interoperate unless they agree on how they will map to service IDs and, in the 
case of using the RDMA CM, the format of the private data carried in the CM 
messages.

Even in the case of iWarp, DAPL's selection of a local port number affects the 
data visible on the wire.  TO communicate, a remote end point must know how this 
mapping occurs.

- Sean


From caitlinb at broadcom.com  Wed May  9 15:03:08 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Wed, 9 May 2007 15:03:08 -0700
Subject: [ofa-general] RE: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <46425627.8000903@open-mpi.org>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D03C33203@NT-IRVA-0750.brcm.ad.broadcom.com>

devel-bounces at open-mpi.org wrote:
> Steve Wise wrote:
>> There have been a series of discussions on the ofa general list about
>> this issue, and the conclusion to date is that it cannot be resolved
>> in the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly
>> because sending an RDMA message involves the ULP's work queue and
>> completion queue, so the CM cannot do this under the covers in a
>> mannor that doesn't affect the application.  Thus, the applications
>> must deal with this.
> 
> Why can't uDAPL deal with this?  As a uDAPL user, I really
> don't care what API uDAPL is using under the hood to move
> data from one place to another, nor the quirks of that API.
> The whole point of uDAPL is to form a network-agnostic
> abstraction layer.  AFAIK, the uDAPL spec doesn't enforce any
> such requirement on RDMA communication either.  In my
> opinion, exposing such behavior above uDAPL is incorrect and
> is part of why uDAPL has seen limited adoption -- every
> single uDAPL implementation behaves in different ways, making
> it extremely difficult to write an application to work on any
> uDAPL implementation.  Sorry if this sounds harsh, but this
> comes from many hours of banging my head on the wall due to
> working around these sorts of problems :)
> 

The simple answer is that uDAPL cannot deal with this.

The RDMAC verbs specification was overly focused on client/server
and therefore did not realize that there was any harm in requiring
that the active side did the first send. But given that DAPL could
not rewrite either the RDMAC or InfiniBand verbs it had to come up
with the best solution that matched the verbs as they were. One of
the explicit ground rules was that DAPL MUST support all RDMA devices
that were IBTA or RDMAC compliant. Given those rules, if the active
side does not send a message the passive side might be held off
indefinitely, and sending a message cause consumption of a receive
buffer and therefore cannot be transparent to the uDAPL consumer.

Given those constraints there is literally nothing that can be
done to work around this problem by either DAPL or OFA.


From swise at opengridcomputing.com  Wed May  9 15:15:15 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 09 May 2007 17:15:15 -0500
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <46426D98.1030406@open-mpi.org>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop> <46425627.8000903@open-mpi.org>
	<1178747092.382.125.camel@stevo-desktop>
	<46426D98.1030406@open-mpi.org>
Message-ID: <1178748915.382.145.camel@stevo-desktop>

On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote:
> 
> Steve Wise wrote:
> > On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:
> >> Steve Wise wrote:
> >>> There have been a series of discussions on the ofa general list about
> >>> this issue, and the conclusion to date is that it cannot be resolved in
> >>> the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
> >>> sending an RDMA message involves the ULP's work queue and completion
> >>> queue, so the CM cannot do this under the covers in a mannor that
> >>> doesn't affect the application.  Thus, the applications must deal with
> >>> this.
> >> Why can't uDAPL deal with this?  As a uDAPL user, I really don't care 
> >> what API uDAPL is using under the hood to move data from one place to 
> >> another, nor the quirks of that API.  The whole point of uDAPL is to 
> >> form a network-agnostic abstraction layer.  AFAIK, the uDAPL spec 
> >> doesn't enforce any such requirement on RDMA communication either.  In 
> >> my opinion, exposing such behavior above uDAPL is incorrect and is part 
> >> of why uDAPL has seen limited adoption -- every single uDAPL 
> >> implementation behaves in different ways, making it extremely difficult 
> >> to write an application to work on any uDAPL implementation.  Sorry if 
> >> this sounds harsh, but this comes from many hours of banging my head on 
> >> the wall due to working around these sorts of problems :)
> >>
> > 
> > I understand your frustration.  I think the MPA protocol is deficient in
> > this respect and should have required the necessary "first FPDU" to be
> > sent under the covers by the RNICs. A RTR packet if you will.  To
> > resolve this issue "properly", in my opinion, would involve changing the
> > IETF MPA spec and also breaking all the existing iwarp HW.  We can't do
> > that.
> 
> Understood.
> 
> > The reason it is hard or impossible to solve this in the DAPL layer is
> > that any rdma operation on the QP affects the state of that QP and the
> > associate CQs.  In addition, if you use an RDMA send to enforce this you
> > impact the other side by consuming a RECV buffer. So its hard if not
> > impossible to do this under the covers without affecting the
> > application's resources.
> 
> Is there no way to do this before passing connection established events 
> to the uDAPL consumer?  I need to go read up on the uDAPL API to really 
> understand why this wouldn't work.
> 

Perhaps the dapl or maybe even a OFA iWARP CM could defer passing up the
"established" event on the passive side until an incoming SEND is
detected.  I know we've discussed this before, but I'm not sure why this
was not a workable solution.  Perhaps Caitlin or some iwarp folks can
recall?  

> > 
> > Also, the DAPL specification had a goal to not impose any additional
> > protocol on the wire.  If you add this under the covers, then you add
> > such a "protocol" and break interoperability between a connection
> > accessed via DAPL on one end and some other API on the other end.
> 
> So I guess there's no 'right' solution, at least at the uDAPL level. 
> With RDMACM/OFA verbs, there's at least the argument that you can design 
> the API/semantics however you please, while uDAPL is already standardized.

Yes, but its still difficult to post a SEND under the covers because it
consumes the application resources in the form of QP and CQ space and a
RECV buffer.

So to date, we have...punted and pushed to problem to the ULP.

> 
> I hope you guys are documenting this in a way that makes this issue 
> extremely clear to both uDAPL and OFA verbs (is this the right naming?) 
> users.  Maybe it's been done already, but is it possible to emit some 
> sort of loud warning/error when the accept()'ing side tries to send 
> before a receive?
> 

The connection comes tumbling down.  How's that for loud? :)

Seriously though, it isn't documented well enough.  But we're bleeding
edge here. And I'm still hoping somebody will come up with an elegant
solution that doesn't break interoperability, applications and/or iwarp
hw (i'm a dreamer :). 


Steve.


From swise at opengridcomputing.com  Wed May  9 15:18:00 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 09 May 2007 17:18:00 -0500
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <464244A8.4070406@ichips.intel.com>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>
	<1178657765.11455.32.camel@stevo-desktop>
	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>
	<1178721476.382.18.camel@stevo-desktop>
	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop> <46425627.8000903@open-mpi.org>
	<1178747092.382.125.camel@stevo-desktop>
	<464244A8.4070406@ichips.intel.com>
Message-ID: <1178749080.382.148.camel@stevo-desktop>

On Wed, 2007-05-09 at 15:01 -0700, Sean Hefty wrote:
> > The reason it is hard or impossible to solve this in the DAPL layer is
> > that any rdma operation on the QP affects the state of that QP and the
> > associate CQs.  In addition, if you use an RDMA send to enforce this you
> > impact the other side by consuming a RECV buffer. So its hard if not
> > impossible to do this under the covers without affecting the
> > application's resources.
> 
> I agree that this is hard, but I don't believe that it's impossible.
> 
> > Also, the DAPL specification had a goal to not impose any additional
> > protocol on the wire.  If you add this under the covers, then you add
> > such a "protocol" and break interoperability between a connection
> > accessed via DAPL on one end and some other API on the other end.
> 
> IMO, this is a unrealized dream.  DAPL does generate wire protocol.  For 
> example, when running over IB, DAPL's selection of a service ID and CM protocol 
> is visible on the wire.  A DAPL that establishes connections using the RDMA CM 
> will likely have a different wire protocol than a version of DAPL that 
> establishes connections talking directly to the IB CM.  The two DAPLs will not 
> interoperate unless they agree on how they will map to service IDs and, in the 
> case of using the RDMA CM, the format of the private data carried in the CM 
> messages.

I wasn't aware of this.

> 
> Even in the case of iWarp, DAPL's selection of a local port number affects the 
> data visible on the wire.  TO communicate, a remote end point must know how this 
> mapping occurs.

You mean the local port on the active side?  The remote end point
doesn't need to know this at all...

Steve.


From caitlinb at broadcom.com  Wed May  9 15:25:06 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Wed, 9 May 2007 15:25:06 -0700
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <1178748915.382.145.camel@stevo-desktop>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D03C3322A@NT-IRVA-0750.brcm.ad.broadcom.com>

general-bounces at lists.openfabrics.org wrote:
> On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote:
>> 
>> Steve Wise wrote:
>>> On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:
>>>> Steve Wise wrote:
>>>>> There have been a series of discussions on the ofa general list
>>>>> about this issue, and the conclusion to date is that it cannot be
>>>>> resolved in the rdma-cm or iwarp-cm code of the linux rdma stack.
>>>>> Mainly because sending an RDMA message involves the ULP's work
>>>>> queue and completion queue, so the CM cannot do this under the
>>>>> covers in a mannor that doesn't affect the application.
>  Thus, the
>>>>> applications must deal with this.
>>>> Why can't uDAPL deal with this?  As a uDAPL user, I really don't
>>>> care what API uDAPL is using under the hood to move data from one
>>>> place to another, nor the quirks of that API.  The whole point of
>>>> uDAPL is to form a network-agnostic abstraction layer. AFAIK, the
>>>> uDAPL spec doesn't enforce any such requirement on RDMA
>>>> communication either.  In my opinion, exposing such behavior above
>>>> uDAPL is incorrect and is part of why uDAPL has seen limited
>>>> adoption -- every single uDAPL implementation behaves in different
>>>> ways, making it extremely difficult to write an application to work
>>>> on any uDAPL implementation.  Sorry if this sounds harsh, but this
>>>> comes from many hours of banging my head on the wall due to working
>>>> around these sorts of problems :)
>>>> 
>>> 
>>> I understand your frustration.  I think the MPA protocol is
>>> deficient in this respect and should have required the necessary
>>> "first FPDU" to be sent under the covers by the RNICs. A RTR packet
>>> if you will.  To resolve this issue "properly", in my opinion, would
>>> involve changing the IETF MPA spec and also breaking all the
>>> existing iwarp HW.  We can't do that.
>> 
>> Understood.
>> 
>>> The reason it is hard or impossible to solve this in the DAPL layer
>>> is that any rdma operation on the QP affects the state of that QP
>>> and the associate CQs.  In addition, if you use an RDMA send to
>>> enforce this you impact the other side by consuming a RECV buffer.
>>> So its hard if not impossible to do this under the covers without
>>> affecting the application's resources.
>> 
>> Is there no way to do this before passing connection established
>> events to the uDAPL consumer?  I need to go read up on the uDAPL API
>> to really understand why this wouldn't work.
>> 
> 
> Perhaps the dapl or maybe even a OFA iWARP CM could defer
> passing up the "established" event on the passive side until
> an incoming SEND is detected.  I know we've discussed this
> before, but I'm not sure why this was not a workable
> solution.  Perhaps Caitlin or some iwarp folks can recall?
> 

That was what the RNIC-PI flag would have enabled. DAPL could
check for that flag in a transport/device independent way, and
delay the established event until it was safe to post (but no
longer than required, for IB and iWARP NICs that fenced the first
transmit the Established Event could be generated immediately).

So yes, the transport layer (OFA or DAPL) CAN hide this on 
the passive side.

But as you point out, that doesn't solve the problem of needing
the Send from the active side. Since the Consumer posts RECV
buffers *before* indicating whether the QP/EP will be used 
on the passive or active end, and there are no standard verbs
to jam a receive buffer to the head of an RQ, there is no way
to hide a send/recv exchange from the application layer.

The fact that it can't be made transparent on the active side
certainly diminishes the value of making it traansparent on
the receive side. It's still a good idea, but I don't think 
it has percolated to the top of anyone's TODO list yet.
When it does, the RNIC-PI proposed flag is a simple capability
flag that is quite easy for any provider to statically set.


From afriedle at open-mpi.org  Wed May  9 18:26:14 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Wed, 09 May 2007 18:26:14 -0700
Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened
In-Reply-To: <1178748915.382.145.camel@stevo-desktop>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>	<1178657765.11455.32.camel@stevo-desktop>	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>	<1178721476.382.18.camel@stevo-desktop>	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>	<1178740498.382.97.camel@stevo-desktop>
	<46425627.8000903@open-mpi.org>	<1178747092.382.125.camel@stevo-desktop>	<46426D98.1030406@open-mpi.org>
	<1178748915.382.145.camel@stevo-desktop>
Message-ID: <464274B6.2030508@open-mpi.org>

Steve Wise wrote:
>> I hope you guys are documenting this in a way that makes this issue 
>> extremely clear to both uDAPL and OFA verbs (is this the right naming?) 
>> users.  Maybe it's been done already, but is it possible to emit some 
>> sort of loud warning/error when the accept()'ing side tries to send 
>> before a receive?
>>
> 
> The connection comes tumbling down.  How's that for loud? :)

works :)

> Seriously though, it isn't documented well enough.  But we're bleeding
> edge here. And I'm still hoping somebody will come up with an elegant
> solution that doesn't break interoperability, applications and/or iwarp
> hw (i'm a dreamer :). 

Well, if documenting it once saves someone a headache and a few hours of 
their time, it's probably worth it.

Seems like everyone understands now what the problem is, that it sucks, 
and it can't be fixed lower down the stack :)  Thanks for explaining 
Caitlin/Steve.  As Jeff wrote, dealing with it in the BTLs really won't 
be that hard, just makes things a little more complicated to maintain.

Andrew


From sweitzen at cisco.com  Wed May  9 16:45:33 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 9 May 2007 16:45:33 -0700
Subject: [ofa-general] RE: [PATCH] ipoib/cm: make stale task actually run
	once in a while
In-Reply-To: <20070507200315.GD22341@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C90BFCB3@mtlexch01.mtl.com><6C2C79E72C305246B504CBA17B5500C9076E27@mtlexch01.mtl.com><A15335FBE9BD2449AF2C9EF3D1EB8EA3037668FC@xmb-sjc-216.amer.cisco.com>
	<20070507200315.GD22341@mellanox.co.il>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303820F76@xmb-sjc-216.amer.cisco.com>

I see a new patch ipoib_correct_timers.patch in OFED-1.2-20070509-0600,
which patch should I try?

Scott 

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] 
> Sent: Monday, May 07, 2007 1:03 PM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Yohad Dickman; Amit Krig; Tziporet Koren; 
> mst at mellanox.co.il; general at lists.openfabrics.org; Roland Dreier
> Subject: [PATCH] ipoib/cm: make stale task actually run once 
> in a while
> 
> In the presence of some active passive connections, stale 
> task would never run,
> since each 4 RX CQEs we repeat queue_delayed_work calls which 
> delays it for some
> 10 minutes.  As a result, on a noisy system with failing 
> ports, we slowly run
> out of resources - slowing connection setup down and 
> eventually failing.
> 
> What we actually want to do is - start stale task when a first
> passive connection is added, rerun it every 10 min as long
> as there are outstanding passive connections.
> 
> As a happy side effect, this removes some code from RX data path.
> 
> Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
> 
> ---
> 
> Scott, I think this might address bugs 541 and 465: slow 
> IPoIB CM HA failover
> and eventual failing IPoIB HA. Could you test this please?
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
> b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 2b242a4..b77e8d7 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -258,10 +258,11 @@ static int ipoib_cm_req_handler(struct 
> ib_cm_id *cm_id, struct ib_cm_event *even
>  	cm_id->context = p;
>  	p->jiffies = jiffies;
>  	spin_lock_irqsave(&priv->lock, flags);
> +	if (list_empty(&priv->cm.passive_ids))
> +		queue_delayed_work(ipoib_workqueue,
> +				   &priv->cm.stale_task, 
> IPOIB_CM_RX_DELAY);
>  	list_add(&p->list, &priv->cm.passive_ids);
>  	spin_unlock_irqrestore(&priv->lock, flags);
> -	queue_delayed_work(ipoib_workqueue,
> -			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  	return 0;
>  
>  err_rep:
> @@ -380,8 +381,6 @@ void ipoib_cm_handle_rx_wc(struct 
> net_device *dev, struct ib_wc *wc)
>  			if (!list_empty(&p->list))
>  				list_move(&p->list, 
> &priv->cm.passive_ids);
>  			spin_unlock_irqrestore(&priv->lock, flags);
> -			queue_delayed_work(ipoib_workqueue,
> -					   
> &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  		}
>  	}
>  
> @@ -1104,6 +1103,10 @@ static void ipoib_cm_stale_task(struct 
> work_struct *work)
>  		kfree(p);
>  		spin_lock_irqsave(&priv->lock, flags);
>  	}
> +
> +	if (!list_empty(&priv->cm.passive_ids))
> +		queue_delayed_work(ipoib_workqueue,
> +				   &priv->cm.stale_task, 
> IPOIB_CM_RX_DELAY);
>  	spin_unlock_irqrestore(&priv->lock, flags);
>  }
>  
> -- 
> MST
> 


From rdreier at cisco.com  Wed May  9 20:00:58 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 09 May 2007 20:00:58 -0700
Subject: [ofa-general] Re: [PATCH] IB/core user memory registrations
In-Reply-To: <1178698393.26046.8.camel@mtls03> (Eli Cohen's message of "Wed,
	09 May 2007 11:12:43 +0300")
References: <1178698393.26046.8.camel@mtls03>
Message-ID: <adasla52xhx.fsf@cisco.com>

 > @@ -1020,7 +1020,7 @@ static struct ib_mr *mthca_reg_user_mr(s
 >  	int shift, n, len;
 >  	int i, j, k;
 >  	int err = 0;
 > -	int write_mtt_size;
 > +	int write_mtt_size = mthca_write_mtt_size(dev);
 >  
 >  	mr = kmalloc(sizeof *mr, GFP_KERNEL);
 >  	if (!mr)

Not sure I understand what this is fixing... can you be more explicit?
As far as I can see, the first use of write_mtt_size in mthca_reg_user_mr()
is the line

	write_mtt_size = min(mthca_write_mtt_size(dev), (int) (PAGE_SIZE / sizeof *pages));

so I'm not sure why we need another initialization?  (I'm looking at
Linus's latest tree, which contains the mlx4 merge)

 - R.


From rdreier at cisco.com  Wed May  9 20:02:35 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 09 May 2007 20:02:35 -0700
Subject: [ofa-general] [PATCH] IB/ipath -- shadow the gpio_mask register
In-Reply-To: <20070509021904.GA16964@bauxite.pathscale.com> (Arthur Jones's
	message of "Tue, 8 May 2007 19:19:04 -0700")
References: <20070508202557.27647.47035.stgit@bauxite.internal.keyresearch.com>
	<adawszi4xtg.fsf@cisco.com>
	<20070509021904.GA16964@bauxite.pathscale.com>
Message-ID: <adaodkt2xf8.fsf@cisco.com>

 > > A better changelog would be appreciated here... I can see deleting the
 > > unlikely() if it's no longer appropriate, but why keep a shadow copy
 > > of the register?  Because this is now a hotter path and you want to
 > > avoid the MMIO read?
 > 
 > exactly.  shall i add that and resend?

That would be great.  Also I'm wondering what changed to make this a
hotter path (just a reference to an earlier patch would be fine, or
you could be more explicit).

No rush because I'm traveling this week so I probably won't be able to
apply anything until Monday anyway.

 - R.


From rdreier at cisco.com  Wed May  9 20:03:03 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 09 May 2007 20:03:03 -0700
Subject: [ofa-general] Re: [PATCH 1/6] IB/ehca: Serialize hypervisor calls in
	ehca_register_mr()
In-Reply-To: <200705091347.57470.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Wed, 9 May 2007 13:47:56 +0200")
References: <200705091347.57470.fenkes@de.ibm.com>
Message-ID: <adak5vh2xeg.fsf@cisco.com>

thanks, it all looks fine... I'll apply when I'm back from my trip on Monday.


From rdreier at cisco.com  Wed May  9 20:06:46 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 09 May 2007 20:06:46 -0700
Subject: [ofa-general] verbs abi_compat
In-Reply-To: <20070509155939.17788.qmail@station183.com> (Jimmy Hill's message
	of "Wed, 09 May 2007 15:59:39 +0000")
References: <20070509155939.17788.qmail@station183.com>
Message-ID: <adafy652x89.fsf@cisco.com>

 > Under what conditions is the field abi_compat of struct ibv_context
 > set to non-zero? I'm encountering a situation where it is set
 > whencoding to verbs on a clean OFED 1.2 install. Seems odd that it
 > would be set since I suspected that it would only occur for verbs
 > 1.0/1.1 compatibility.

Are you sure it's being set?  I think most drivers just use malloc()
to allocate the context structure so you could just be seeing
uninitialized memory there.

Anyway I'm not sure why you're looking at the field at all.  It's
really just internal to libibverbs.  If you want to understand how
things work, there's only one assignment to context->abi_compat in
libibverbs, in src/cmd.c so it shouldn't be too hard to figure out
(and add whatever tracing info you want).

 - R.


From rdreier at cisco.com  Wed May  9 20:07:08 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 09 May 2007 20:07:08 -0700
Subject: [ofa-general] Re: [GIT PULL] 2.6.22: please pull rdma-dev.git
In-Reply-To: <000101c79269$72c67aa0$e598070a@amr.corp.intel.com> (Sean Hefty's
	message of "Wed, 9 May 2007 11:39:59 -0700")
References: <000101c79269$72c67aa0$e598070a@amr.corp.intel.com>
Message-ID: <adabqgt2x7n.fsf@cisco.com>

ok, I'll look at this when I get back home on Monday.


From jimmy at hillraiser.com  Wed May  9 21:08:01 2007
From: jimmy at hillraiser.com (Jimmy Hill)
Date: Wed, 9 May 2007 23:08:01 -0500
Subject: [ofa-general] verbs abi_compat
In-Reply-To: <adafy652x89.fsf@cisco.com>
Message-ID: <HFEPKIFILMMCLHAOBMMOGEOJGLAA.jimmy@hillraiser.com>


> -----Original Message-----
>
>  > Under what conditions is the field abi_compat of struct ibv_context
>  > set to non-zero? I'm encountering a situation where it is set
>  > whencoding to verbs on a clean OFED 1.2 install. Seems odd that it
>  > would be set since I suspected that it would only occur for verbs
>  > 1.0/1.1 compatibility.
>
> Are you sure it's being set?  I think most drivers just use malloc()
> to allocate the context structure so you could just be seeing
> uninitialized memory there.
>
> Anyway I'm not sure why you're looking at the field at all.  It's
> really just internal to libibverbs.  If you want to understand how
> things work, there's only one assignment to context->abi_compat in
> libibverbs, in src/cmd.c so it shouldn't be too hard to figure out
> (and add whatever tracing info you want).
>

It is set in that it is non-zero, but I agree, it has garbage in it...and
that's part of the problem. It is not being set in src/cmd.c, and has a
non-zero value. When I call ibv_alloc_pd, I'm ending up in
__ibv_alloc_pd_1_0 and that attempts to use context->real_context which is
non-zero garbage as well and I get a segmentation violation. The abi_compat
flag was what I thought was redirecting me into __ibv_alloc_pd_1_0 instead
of __ibv_alloc_pd (where it should be going).

So, maybe I asked the wrong question. Let me try a diff approach. What
determines if ibv_alloc_pd resolves to __ibv_alloc_pd_1_0 or __ibv_alloc_pd?
If I can find out what is redirecting my call to the "compat" code, maybe I
can stop it and resolve the problem.

Thanks for the response - I appreciate your help!


From eli at mellanox.co.il  Wed May  9 22:38:39 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 10 May 2007 08:38:39 +0300
Subject: [ofa-general] Re: [PATCH] IB/core user memory registrations
In-Reply-To: <adasla52xhx.fsf@cisco.com>
References: <1178698393.26046.8.camel@mtls03>  <adasla52xhx.fsf@cisco.com>
Message-ID: <1178775549.7405.4.camel@mtls03>

On Wed, 2007-05-09 at 20:00 -0700, Roland Dreier wrote:
> > @@ -1020,7 +1020,7 @@ static struct ib_mr *mthca_reg_user_mr(s
>  >  	int shift, n, len;
>  >  	int i, j, k;
>  >  	int err = 0;
>  > -	int write_mtt_size;
>  > +	int write_mtt_size = mthca_write_mtt_size(dev);
>  >  
>  >  	mr = kmalloc(sizeof *mr, GFP_KERNEL);
>  >  	if (!mr)
> 
> Not sure I understand what this is fixing... can you be more explicit?
> As far as I can see, the first use of write_mtt_size in mthca_reg_user_mr()
> is the line
> 
> 	write_mtt_size = min(mthca_write_mtt_size(dev), (int) (PAGE_SIZE / sizeof *pages));
> 
> so I'm not sure why we need another initialization?  (I'm looking at
> Linus's latest tree, which contains the mlx4 merge)
> 
>  - R.

This initialization was not in the tree I was working on. Now I see it
is fixed in the updated tree. Thanks.


From yosefe at voltaire.com  Wed May  9 22:51:21 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 10 May 2007 08:51:21 +0300
Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <20070509174138.GB17734@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>	<20070509093548.GA7683@mellanox.co.il>	<4641AA06.1050002@voltaire.com>	<20070509112626.GA10068@mellanox.co.il>	<4641B63D.4010602@voltaire.com>
	<20070509174138.GB17734@mellanox.co.il>
Message-ID: <4642B2D9.60309@voltaire.com>

Michael S. Tsirkin wrote:
>>@@ -642,6 +651,11 @@ void ipoib_ib_dev_flush(struct work_stru
>> 
>> 	ipoib_ib_dev_down(dev, 0);
>> 
>>+	if (restart_qp) {
>>+		ipoib_ib_dev_stop(dev, 0);
>>+		ipoib_ib_dev_open(dev);
>>+	}
>>+
>> 	/*
>> 	 * The device could have been brought down between the start and when
>> 	 * we get here, don't bring it back up if it's not configured up
> 
> 
> By the way, I think I see a small issue now - if there's a
> pkey change event, this will flush all interfaces, even if
> the pkey changed is not used by ipoib at all.
> 
> How about:
> - rename restart_qp flag to pkey_change_event
> - do something like this at the beginning of the flush routine
> 	if (pkey_change_event &&
> 		query_pkey(current index) == current_pkey))
> 		return;
I think we should do the following: hold the index in dev_priv, set it outside
restart_qp, and use it in restart_qp as an input parameter. On flush, we find
and set it. This will prevent ~64 pkey queries, which are not yet cache-optimized.

> 
> Need to think what to do if index is not valid, but you get the idea.
> 
We can give up, clear PKEY_ASSIGNED flag, and let the polling do its job.

> This will remove all the extra flushes in the common case
> where pkeys are not moved around too much.
> 


From yosefe at voltaire.com  Thu May 10 00:25:15 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 10 May 2007 10:25:15 +0300
Subject: [ofa-general] [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <20070509174138.GB17734@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>	<20070509093548.GA7683@mellanox.co.il>	<4641AA06.1050002@voltaire.com>	<20070509112626.GA10068@mellanox.co.il>	<4641B63D.4010602@voltaire.com>
	<20070509174138.GB17734@mellanox.co.il>
Message-ID: <4642C8DB.1090205@voltaire.com>

This issue was found during partitioning & SM fail over testing.

 * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
 * Obtain pkey index prior to entering init_qp, and save in in dev_priv
 * Upon PKEY_CHANGE event, schedule a work that restarts the qp.
 * Precondition the restart on whether the pkey index is really changed.
   Use the cached pkey_index to test this.  
 * Restart child interfaces before parent. They might be up even if the
   parent is down
 * Use uncached pkey query upon qp initiallization

SM reconfiguration or failover possibly causes a shuffling of the values
in the port pkey table. The current implementation only queries for the
index of the pkey once, when it creates the device QP and after that moves
it into working state, and hence does not address this scenario. Fix this
by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   95 +++++++++++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   26 ++-----
 4 files changed, 96 insertions(+), 39 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-10 08:34:58.335171047 +0300
@@ -202,15 +202,17 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
+	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
+	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8            	  port;
 	u16           	  pkey;
+	u16               pkey_index;
 	struct ib_pd  	 *pd;
 	struct ib_mr  	 *mr;
 	struct ib_cq  	 *cq;
@@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-10 10:19:35.041587502 +0300
@@ -408,11 +408,40 @@ void ipoib_reap_ah(struct work_struct *w
 		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ);
 }
 
+static int ipoib_find_pkey_index(struct ipoib_dev_priv *priv, int *is_changed)
+{
+	u16 new_index;
+
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
+		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		return -ENXIO;
+	}
+
+	if (is_changed)
+		*is_changed = !test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags) ||
+				priv->pkey_index != new_index;
+
+	priv->pkey_index = new_index;
+	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	return 0;
+}
+
 int ipoib_ib_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
 
+	/*
+	 * Search through the port P_Key table for the requested pkey value.
+	 * The port has to be assigned to the respective IB partition in
+	 * advance.
+	 */
+	ret = ipoib_find_pkey_index(priv, NULL);
+	if (ret) {
+		ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey);
+		return -1;
+	}
+
 	ret = ipoib_init_qp(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret);
@@ -422,14 +451,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -481,7 +510,7 @@ int ipoib_ib_dev_down(struct net_device 
 	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
 		mutex_lock(&pkey_mutex);
 		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
+		cancel_delayed_work(&priv->pkey_poll_task);
 		mutex_unlock(&pkey_mutex);
 		if (flush)
 			flush_workqueue(ipoib_workqueue);
@@ -508,7 +537,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +610,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,13 +652,22 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
+	int is_index_changed;
+
+	mutex_lock(&priv->vlan_mutex);
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
+	/* Flush any child interfaces too -
+ 	 * they might be up even if the parent is down */
+ 	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, pkey_event);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
 		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
@@ -638,10 +677,22 @@ void ipoib_ib_dev_flush(struct work_stru
 		return;
 	}
 
+	if (pkey_event &&
+	    !ipoib_find_pkey_index(priv, &is_index_changed) &&
+	    !is_index_changed) {
+	    	ipoib_dbg(priv, "Not flushing - pkey index not changed.\n");
+		return;
+	}
+
 	ipoib_dbg(priv, "flushing\n");
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (pkey_event) {
+		ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +701,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	mutex_unlock(&priv->vlan_mutex);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -685,7 +746,7 @@ void ipoib_ib_dev_cleanup(struct net_dev
 void ipoib_pkey_poll(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
+		container_of(work, struct ipoib_dev_priv, pkey_poll_task.work);
 	struct net_device *dev = priv->dev;
 
 	ipoib_pkey_dev_check_presence(dev);
@@ -696,7 +757,7 @@ void ipoib_pkey_poll(struct work_struct 
 		mutex_lock(&pkey_mutex);
 		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
+					   &priv->pkey_poll_task,
 					   HZ);
 		mutex_unlock(&pkey_mutex);
 	}
@@ -715,7 +776,7 @@ int ipoib_pkey_dev_delay_open(struct net
 		mutex_lock(&pkey_mutex);
 		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
 		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
+				   &priv->pkey_poll_task,
 				   HZ);
 		mutex_unlock(&pkey_mutex);
 		return 1;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-09 17:21:03.000000000 +0300
@@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev)
 		return -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-10 09:13:28.997127223 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
@@ -94,26 +92,17 @@ int ipoib_init_qp(struct net_device *dev
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
-	u16 pkey_index;
 	struct ib_qp_attr qp_attr;
 	int attr_mask;
 
-	/*
-	 * Search through the port P_Key table for the requested pkey value.
-	 * The port has to be assigned to the respective IB partition in
-	 * advance.
-	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
-	if (ret) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-		return ret;
-	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	/* Make sure we have a valid pkey_index in priv->pkey_index */
+	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
+		return -1;
 
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
 	qp_attr.port_num = priv->port;
-	qp_attr.pkey_index = pkey_index;
+	qp_attr.pkey_index = priv->pkey_index;
 	attr_mask =
 	    IB_QP_QKEY |
 	    IB_QP_PORT |
@@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		   record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
 	}
 }


From vlad at mellanox.co.il  Thu May 10 00:27:15 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 10 May 2007 10:27:15 +0300
Subject: [ofa-general] Re: [GIT PULL] OFED 1.2 librdmacm
In-Reply-To: <000001c7925d$b752cfe0$e598070a@amr.corp.intel.com>
References: <000001c7925d$b752cfe0$e598070a@amr.corp.intel.com>
Message-ID: <1178782035.7967.2.camel@vladsk-laptop>

On Wed, 2007-05-09 at 10:16 -0700, Sean Hefty wrote:
> Please pull in the latest librdmacm ofed_1_2 tree.  This will add a fix for
> rping and man pages.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>

Done.


-- 
Vladimir Sokolovsky <vlad at mellanox.co.il>
Mellanox Technologies Ltd.


From vlad at mellanox.co.il  Thu May 10 00:28:11 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 10 May 2007 10:28:11 +0300
Subject: [ofa-general] Re: [GIT PULL] OFED 1.2 uDAPL
In-Reply-To: <464224F0.6020408@ichips.intel.com>
References: <464224F0.6020408@ichips.intel.com>
Message-ID: <1178782091.7967.4.camel@vladsk-laptop>

On Wed, 2007-05-09 at 12:45 -0700, Arlin Davis wrote:
> Vlad, please pull latest from uDAPL project (ofed_1_2 branch)
> 
> Signed-off by: Arlin Davis ardavis at ichips.intel.com
> 
> Bug Fixes:
> - 606: Return local and remote ports with dat_ep_query
> - 585: Add bonding example to dat.conf

Done,


-- 
Vladimir Sokolovsky <vlad at mellanox.co.il>
Mellanox Technologies Ltd.


From fenkes at de.ibm.com  Thu May 10 01:55:46 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 10 May 2007 10:55:46 +0200
Subject: [ofa-general] Re: [PATCH 1/6] IB/ehca: Serialize hypervisor calls in
	ehca_register_mr()
In-Reply-To: <200705091347.57470.fenkes@de.ibm.com>
References: <200705091347.57470.fenkes@de.ibm.com>
Message-ID: <200705101055.46433.fenkes@de.ibm.com>

On Wednesday 09 May 2007 13:47, Joachim Fenkes wrote:

> --- a/drivers/infiniband/hw/ehca/hcp_if.c
> +++ b/drivers/infiniband/hw/ehca/hcp_if.c
> @@ -154,7 +154,9 @@ static long ehca_plpar_hcall9(unsigned l
>  			      unsigned long arg9)
>  {
>  	long ret;
> -	int i, sleep_msecs;
> +	int i, sleep_msecs, lock_is_set = 0;
> +	unsigned long flags;
> +
>  
>  	ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx "
>  		     "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx",
> @@ -162,10 +164,18 @@ static long ehca_plpar_hcall9(unsigned l

Whoops, that's one too many empty line...
Roland, when you apply this patch, could you apply the following patch on top:

--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -157,7 +157,6 @@ static long ehca_plpar_hcall9(unsigned l
        int i, sleep_msecs, lock_is_set = 0;
        unsigned long flags;

-
        ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx "
                     "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx",
                     opcode, arg1, arg2, arg3, arg4, arg5, arg6, arg7,


Thanks!
  Joachim

-- 
Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer
IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2)
Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany
eMail: fenkes at de.ibm.com  --  Phone: +49 7031 16 1239


From mst at dev.mellanox.co.il  Thu May 10 02:29:25 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 12:29:25 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons
	for open-iscsi over iSER support for RHAS4 up3 and up4
In-Reply-To: <4641D38A.8040406@voltaire.com>
References: <4641D295.5060907@voltaire.com> <4641D38A.8040406@voltaire.com>
Message-ID: <20070510092925.GB13655@mellanox.co.il>

> Quoting Erez Zilber <erezz at voltaire.com>:
> Subject: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4
> 
> 
> Add the required backport patches & kernel addons for open-iscsi
> over iSER in RHAS4 up3 and up4.
> 
> Signed-off-by: Erez Zilber <erezz at voltaire.com>

In addition to posting patches, could you pls publish a git tree to pull from,
please? This makes it easy to test-build the patch as our build system
knows how to do git checkout.

---

Two comments, generally
A: Please move code from kernel_patches to kernel_addons as much
   as possible. There are many places where you just add new headers,
   or add #include directives, or change the function called or
   remove extra parameters, all this can and should be done through addons.

B: Please do not add code to core unless there is more than 1 user -
   add it to the iser module instead. This way if there is
   compilation failure there, you do not break core for people.

Some specifics below:
....

> diff --git a/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch b/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch
> new file mode 100644
> index 0000000..c4df6bb
> --- /dev/null
> +++ b/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch
> @@ -0,0 +1,591 @@
> +diff -rupN linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h linux-2.6.20/include/scsi/iscsi_proto.h
> +--- linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h	1970-01-01 02:00:00.000000000 +0200
> ++++ linux-2.6.20/include/scsi/iscsi_proto.h	2007-02-04 20:44:54.000000000 +0200
> +@@ -0,0 +1,587 @@
> ++/*
> ++ * RFC 3720 (iSCSI) protocol data types
> ++ *
> ++ * Copyright (C) 2005 Dmitry Yusupov
> ++ * Copyright (C) 2005 Alex Aizman
> ++ * maintained by open-iscsi at googlegroups.com
> ++ *
> ++ * This program is free software; you can redistribute it and/or modify
> ++ * it under the terms of the GNU General Public License as published
> ++ * by the Free Software Foundation; either version 2 of the License, or
> ++ * (at your option) any later version.
> ++ *
> ++ * This program is distributed in the hope that it will be useful, but
> ++ * WITHOUT ANY WARRANTY; without even the implied warranty of
> ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> ++ * General Public License for more details.
> ++ *
> ++ * See the file COPYING included with this distribution for more details.
> ++ */
> ++
> ++#ifndef ISCSI_PROTO_H
> ++#define ISCSI_PROTO_H
> ++
> ++#define ISCSI_DRAFT20_VERSION	0x00
> ++
> ++/* default iSCSI listen port for incoming connections */
> ++#define ISCSI_LISTEN_PORT	3260
> ++
> ++/* Padding word length */
> ++#define PAD_WORD_LEN		4
> ++
> ++/*
> ++ * useful common(control and data pathes) macro
> ++ */
> ++#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2]))
> ++#define hton24(p, v) { \
> ++        p[0] = (((v) >> 16) & 0xFF); \
> ++        p[1] = (((v) >> 8) & 0xFF); \
> ++        p[2] = ((v) & 0xFF); \
> ++}
> ++#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;}
> ++
> ++/*
> ++ * iSCSI Template Message Header
> ++ */
> ++struct iscsi_hdr {
> ++	uint8_t		opcode;
> ++	uint8_t		flags;		/* Final bit */
> ++	uint8_t		rsvd2[2];
> ++	uint8_t		hlength;	/* AHSs total length */
> ++	uint8_t		dlength[3];	/* Data length */
> ++	uint8_t		lun[8];
> ++	__be32		itt;		/* Initiator Task Tag */
> ++	__be32		ttt;		/* Target Task Tag */
> ++	__be32		statsn;
> ++	__be32		exp_statsn;
> ++	__be32		max_statsn;
> ++	uint8_t		other[12];
> ++};
> ++
> ++/************************* RFC 3720 Begin *****************************/
> ++
> ++#define ISCSI_RESERVED_TAG		0xffffffff
> ++
> ++/* Opcode encoding bits */
> ++#define ISCSI_OP_RETRY			0x80
> ++#define ISCSI_OP_IMMEDIATE		0x40
> ++#define ISCSI_OPCODE_MASK		0x3F
> ++
> ++/* Initiator Opcode values */
> ++#define ISCSI_OP_NOOP_OUT		0x00
> ++#define ISCSI_OP_SCSI_CMD		0x01
> ++#define ISCSI_OP_SCSI_TMFUNC		0x02
> ++#define ISCSI_OP_LOGIN			0x03
> ++#define ISCSI_OP_TEXT			0x04
> ++#define ISCSI_OP_SCSI_DATA_OUT		0x05
> ++#define ISCSI_OP_LOGOUT			0x06
> ++#define ISCSI_OP_SNACK			0x10
> ++
> ++#define ISCSI_OP_VENDOR1_CMD		0x1c
> ++#define ISCSI_OP_VENDOR2_CMD		0x1d
> ++#define ISCSI_OP_VENDOR3_CMD		0x1e
> ++#define ISCSI_OP_VENDOR4_CMD		0x1f
> ++
> ++/* Target Opcode values */
> ++#define ISCSI_OP_NOOP_IN		0x20
> ++#define ISCSI_OP_SCSI_CMD_RSP		0x21
> ++#define ISCSI_OP_SCSI_TMFUNC_RSP	0x22
> ++#define ISCSI_OP_LOGIN_RSP		0x23
> ++#define ISCSI_OP_TEXT_RSP		0x24
> ++#define ISCSI_OP_SCSI_DATA_IN		0x25
> ++#define ISCSI_OP_LOGOUT_RSP		0x26
> ++#define ISCSI_OP_R2T			0x31
> ++#define ISCSI_OP_ASYNC_EVENT		0x32
> ++#define ISCSI_OP_REJECT			0x3f
> ++
> ++struct iscsi_ahs_hdr {
> ++	__be16 ahslength;
> ++	uint8_t ahstype;
> ++	uint8_t ahspec[5];
> ++};
> ++
> ++#define ISCSI_AHSTYPE_CDB		1
> ++#define ISCSI_AHSTYPE_RLENGTH		2
> ++
> ++/* iSCSI PDU Header */
> ++struct iscsi_cmd {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	__be16 rsvd2;
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t lun[8];
> ++	__be32 itt;	/* Initiator Task Tag */
> ++	__be32 data_length;
> ++	__be32 cmdsn;
> ++	__be32 exp_statsn;
> ++	uint8_t cdb[16];	/* SCSI Command Block */
> ++	/* Additional Data (Command Dependent) */
> ++};
> ++
> ++/* Command PDU flags */
> ++#define ISCSI_FLAG_CMD_FINAL		0x80
> ++#define ISCSI_FLAG_CMD_READ		0x40
> ++#define ISCSI_FLAG_CMD_WRITE		0x20
> ++#define ISCSI_FLAG_CMD_ATTR_MASK	0x07	/* 3 bits */
> ++
> ++/* SCSI Command Attribute values */
> ++#define ISCSI_ATTR_UNTAGGED		0
> ++#define ISCSI_ATTR_SIMPLE		1
> ++#define ISCSI_ATTR_ORDERED		2
> ++#define ISCSI_ATTR_HEAD_OF_QUEUE	3
> ++#define ISCSI_ATTR_ACA			4
> ++
> ++struct iscsi_rlength_ahdr {
> ++	__be16 ahslength;
> ++	uint8_t ahstype;
> ++	uint8_t reserved;
> ++	__be32 read_length;
> ++};
> ++
> ++/* SCSI Response Header */
> ++struct iscsi_cmd_rsp {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t response;
> ++	uint8_t cmd_status;
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t rsvd[8];
> ++	__be32	itt;	/* Initiator Task Tag */
> ++	__be32	rsvd1;
> ++	__be32	statsn;
> ++	__be32	exp_cmdsn;
> ++	__be32	max_cmdsn;
> ++	__be32	exp_datasn;
> ++	__be32	bi_residual_count;
> ++	__be32	residual_count;
> ++	/* Response or Sense Data (optional) */
> ++};
> ++
> ++/* Command Response PDU flags */
> ++#define ISCSI_FLAG_CMD_BIDI_OVERFLOW	0x10
> ++#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW	0x08
> ++#define ISCSI_FLAG_CMD_OVERFLOW		0x04
> ++#define ISCSI_FLAG_CMD_UNDERFLOW	0x02
> ++
> ++/* iSCSI Status values. Valid if Rsp Selector bit is not set */
> ++#define ISCSI_STATUS_CMD_COMPLETED	0
> ++#define ISCSI_STATUS_TARGET_FAILURE	1
> ++#define ISCSI_STATUS_SUBSYS_FAILURE	2
> ++
> ++/* Asynchronous Event Header */
> ++struct iscsi_async {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t rsvd2[2];
> ++	uint8_t rsvd3;
> ++	uint8_t dlength[3];
> ++	uint8_t lun[8];
> ++	uint8_t rsvd4[8];
> ++	__be32	statsn;
> ++	__be32	exp_cmdsn;
> ++	__be32	max_cmdsn;
> ++	uint8_t async_event;
> ++	uint8_t async_vcode;
> ++	__be16	param1;
> ++	__be16	param2;
> ++	__be16	param3;
> ++	uint8_t rsvd5[4];
> ++};
> ++
> ++/* iSCSI Event Codes */
> ++#define ISCSI_ASYNC_MSG_SCSI_EVENT			0
> ++#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT			1
> ++#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION		2
> ++#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS	3
> ++#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION		4
> ++#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC			255
> ++
> ++/* NOP-Out Message */
> ++struct iscsi_nopout {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	__be16	rsvd2;
> ++	uint8_t rsvd3;
> ++	uint8_t dlength[3];
> ++	uint8_t lun[8];
> ++	__be32	itt;	/* Initiator Task Tag */
> ++	__be32	ttt;	/* Target Transfer Tag */
> ++	__be32	cmdsn;
> ++	__be32	exp_statsn;
> ++	uint8_t rsvd4[16];
> ++};
> ++
> ++/* NOP-In Message */
> ++struct iscsi_nopin {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	__be16	rsvd2;
> ++	uint8_t rsvd3;
> ++	uint8_t dlength[3];
> ++	uint8_t lun[8];
> ++	__be32	itt;	/* Initiator Task Tag */
> ++	__be32	ttt;	/* Target Transfer Tag */
> ++	__be32	statsn;
> ++	__be32	exp_cmdsn;
> ++	__be32	max_cmdsn;
> ++	uint8_t rsvd4[12];
> ++};
> ++
> ++/* SCSI Task Management Message Header */
> ++struct iscsi_tm {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t rsvd1[2];
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t lun[8];
> ++	__be32	itt;	/* Initiator Task Tag */
> ++	__be32	rtt;	/* Reference Task Tag */
> ++	__be32	cmdsn;
> ++	__be32	exp_statsn;
> ++	__be32	refcmdsn;
> ++	__be32	exp_datasn;
> ++	uint8_t rsvd2[8];
> ++};
> ++
> ++#define ISCSI_FLAG_TM_FUNC_MASK			0x7F
> ++
> ++/* Function values */
> ++#define ISCSI_TM_FUNC_ABORT_TASK		1
> ++#define ISCSI_TM_FUNC_ABORT_TASK_SET		2
> ++#define ISCSI_TM_FUNC_CLEAR_ACA			3
> ++#define ISCSI_TM_FUNC_CLEAR_TASK_SET		4
> ++#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET	5
> ++#define ISCSI_TM_FUNC_TARGET_WARM_RESET		6
> ++#define ISCSI_TM_FUNC_TARGET_COLD_RESET		7
> ++#define ISCSI_TM_FUNC_TASK_REASSIGN		8
> ++
> ++/* SCSI Task Management Response Header */
> ++struct iscsi_tm_rsp {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t response;	/* see Response values below */
> ++	uint8_t qualifier;
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t rsvd2[8];
> ++	__be32	itt;	/* Initiator Task Tag */
> ++	__be32	rtt;	/* Reference Task Tag */
> ++	__be32	statsn;
> ++	__be32	exp_cmdsn;
> ++	__be32	max_cmdsn;
> ++	uint8_t rsvd3[12];
> ++};
> ++
> ++/* Response values */
> ++#define ISCSI_TMF_RSP_COMPLETE		0x00
> ++#define ISCSI_TMF_RSP_NO_TASK		0x01
> ++#define ISCSI_TMF_RSP_NO_LUN		0x02
> ++#define ISCSI_TMF_RSP_TASK_ALLEGIANT	0x03
> ++#define ISCSI_TMF_RSP_NO_FAILOVER	0x04
> ++#define ISCSI_TMF_RSP_NOT_SUPPORTED	0x05
> ++#define ISCSI_TMF_RSP_AUTH_FAILED	0x06
> ++#define ISCSI_TMF_RSP_REJECTED		0xff
> ++
> ++/* Ready To Transfer Header */
> ++struct iscsi_r2t_rsp {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t rsvd2[2];
> ++	uint8_t	hlength;
> ++	uint8_t	dlength[3];
> ++	uint8_t lun[8];
> ++	__be32	itt;	/* Initiator Task Tag */
> ++	__be32	ttt;	/* Target Transfer Tag */
> ++	__be32	statsn;
> ++	__be32	exp_cmdsn;
> ++	__be32	max_cmdsn;
> ++	__be32	r2tsn;
> ++	__be32	data_offset;
> ++	__be32	data_length;
> ++};
> ++
> ++/* SCSI Data Hdr */
> ++struct iscsi_data {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t rsvd2[2];
> ++	uint8_t rsvd3;
> ++	uint8_t dlength[3];
> ++	uint8_t lun[8];
> ++	__be32	itt;
> ++	__be32	ttt;
> ++	__be32	rsvd4;
> ++	__be32	exp_statsn;
> ++	__be32	rsvd5;
> ++	__be32	datasn;
> ++	__be32	offset;
> ++	__be32	rsvd6;
> ++	/* Payload */
> ++};
> ++
> ++/* SCSI Data Response Hdr */
> ++struct iscsi_data_rsp {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t rsvd2;
> ++	uint8_t cmd_status;
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t lun[8];
> ++	__be32	itt;
> ++	__be32	ttt;
> ++	__be32	statsn;
> ++	__be32	exp_cmdsn;
> ++	__be32	max_cmdsn;
> ++	__be32	datasn;
> ++	__be32	offset;
> ++	__be32	residual_count;
> ++};
> ++
> ++/* Data Response PDU flags */
> ++#define ISCSI_FLAG_DATA_ACK		0x40
> ++#define ISCSI_FLAG_DATA_OVERFLOW	0x04
> ++#define ISCSI_FLAG_DATA_UNDERFLOW	0x02
> ++#define ISCSI_FLAG_DATA_STATUS		0x01
> ++
> ++/* Text Header */
> ++struct iscsi_text {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t rsvd2[2];
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t rsvd4[8];
> ++	__be32	itt;
> ++	__be32	ttt;
> ++	__be32	cmdsn;
> ++	__be32	exp_statsn;
> ++	uint8_t rsvd5[16];
> ++	/* Text - key=value pairs */
> ++};
> ++
> ++#define ISCSI_FLAG_TEXT_CONTINUE	0x40
> ++
> ++/* Text Response Header */
> ++struct iscsi_text_rsp {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t rsvd2[2];
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t rsvd4[8];
> ++	__be32	itt;
> ++	__be32	ttt;
> ++	__be32	statsn;
> ++	__be32	exp_cmdsn;
> ++	__be32	max_cmdsn;
> ++	uint8_t rsvd5[12];
> ++	/* Text Response - key:value pairs */
> ++};
> ++
> ++/* Login Header */
> ++struct iscsi_login {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t max_version;	/* Max. version supported */
> ++	uint8_t min_version;	/* Min. version supported */
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t isid[6];	/* Initiator Session ID */
> ++	__be16	tsih;	/* Target Session Handle */
> ++	__be32	itt;	/* Initiator Task Tag */
> ++	__be16	cid;
> ++	__be16	rsvd3;
> ++	__be32	cmdsn;
> ++	__be32	exp_statsn;
> ++	uint8_t rsvd5[16];
> ++};
> ++
> ++/* Login PDU flags */
> ++#define ISCSI_FLAG_LOGIN_TRANSIT		0x80
> ++#define ISCSI_FLAG_LOGIN_CONTINUE		0x40
> ++#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK	0x0C	/* 2 bits */
> ++#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK	0x03	/* 2 bits */
> ++
> ++#define ISCSI_LOGIN_CURRENT_STAGE(flags) \
> ++	((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2)
> ++#define ISCSI_LOGIN_NEXT_STAGE(flags) \
> ++	(flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK)
> ++
> ++/* Login Response Header */
> ++struct iscsi_login_rsp {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t max_version;	/* Max. version supported */
> ++	uint8_t active_version;	/* Active version */
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t isid[6];	/* Initiator Session ID */
> ++	__be16	tsih;	/* Target Session Handle */
> ++	__be32	itt;	/* Initiator Task Tag */
> ++	__be32	rsvd3;
> ++	__be32	statsn;
> ++	__be32	exp_cmdsn;
> ++	__be32	max_cmdsn;
> ++	uint8_t status_class;	/* see Login RSP ststus classes below */
> ++	uint8_t status_detail;	/* see Login RSP Status details below */
> ++	uint8_t rsvd4[10];
> ++};
> ++
> ++/* Login stage (phase) codes for CSG, NSG */
> ++#define ISCSI_INITIAL_LOGIN_STAGE		-1
> ++#define ISCSI_SECURITY_NEGOTIATION_STAGE	0
> ++#define ISCSI_OP_PARMS_NEGOTIATION_STAGE	1
> ++#define ISCSI_FULL_FEATURE_PHASE		3
> ++
> ++/* Login Status response classes */
> ++#define ISCSI_STATUS_CLS_SUCCESS		0x00
> ++#define ISCSI_STATUS_CLS_REDIRECT		0x01
> ++#define ISCSI_STATUS_CLS_INITIATOR_ERR		0x02
> ++#define ISCSI_STATUS_CLS_TARGET_ERR		0x03
> ++
> ++/* Login Status response detail codes */
> ++/* Class-0 (Success) */
> ++#define ISCSI_LOGIN_STATUS_ACCEPT		0x00
> ++
> ++/* Class-1 (Redirection) */
> ++#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP	0x01
> ++#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM	0x02
> ++
> ++/* Class-2 (Initiator Error) */
> ++#define ISCSI_LOGIN_STATUS_INIT_ERR		0x00
> ++#define ISCSI_LOGIN_STATUS_AUTH_FAILED		0x01
> ++#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN	0x02
> ++#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND	0x03
> ++#define ISCSI_LOGIN_STATUS_TGT_REMOVED		0x04
> ++#define ISCSI_LOGIN_STATUS_NO_VERSION		0x05
> ++#define ISCSI_LOGIN_STATUS_ISID_ERROR		0x06
> ++#define ISCSI_LOGIN_STATUS_MISSING_FIELDS	0x07
> ++#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED	0x08
> ++#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE	0x09
> ++#define ISCSI_LOGIN_STATUS_NO_SESSION		0x0a
> ++#define ISCSI_LOGIN_STATUS_INVALID_REQUEST	0x0b
> ++
> ++/* Class-3 (Target Error) */
> ++#define ISCSI_LOGIN_STATUS_TARGET_ERROR		0x00
> ++#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE	0x01
> ++#define ISCSI_LOGIN_STATUS_NO_RESOURCES		0x02
> ++
> ++/* Logout Header */
> ++struct iscsi_logout {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t rsvd1[2];
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t rsvd2[8];
> ++	__be32	itt;	/* Initiator Task Tag */
> ++	__be16	cid;
> ++	uint8_t rsvd3[2];
> ++	__be32	cmdsn;
> ++	__be32	exp_statsn;
> ++	uint8_t rsvd4[16];
> ++};
> ++
> ++/* Logout PDU flags */
> ++#define ISCSI_FLAG_LOGOUT_REASON_MASK	0x7F
> ++
> ++/* logout reason_code values */
> ++
> ++#define ISCSI_LOGOUT_REASON_CLOSE_SESSION	0
> ++#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION	1
> ++#define ISCSI_LOGOUT_REASON_RECOVERY		2
> ++#define ISCSI_LOGOUT_REASON_AEN_REQUEST		3
> ++
> ++/* Logout Response Header */
> ++struct iscsi_logout_rsp {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t response;	/* see Logout response values below */
> ++	uint8_t rsvd2;
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t rsvd3[8];
> ++	__be32	itt;	/* Initiator Task Tag */
> ++	__be32	rsvd4;
> ++	__be32	statsn;
> ++	__be32	exp_cmdsn;
> ++	__be32	max_cmdsn;
> ++	__be32	rsvd5;
> ++	__be16	t2wait;
> ++	__be16	t2retain;
> ++	__be32	rsvd6;
> ++};
> ++
> ++/* logout response status values */
> ++
> ++#define ISCSI_LOGOUT_SUCCESS			0
> ++#define ISCSI_LOGOUT_CID_NOT_FOUND		1
> ++#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED	2
> ++#define ISCSI_LOGOUT_CLEANUP_FAILED		3
> ++
> ++/* SNACK Header */
> ++struct iscsi_snack {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t rsvd2[14];
> ++	__be32	itt;
> ++	__be32	begrun;
> ++	__be32	runlength;
> ++	__be32	exp_statsn;
> ++	__be32	rsvd3;
> ++	__be32	exp_datasn;
> ++	uint8_t rsvd6[8];
> ++};
> ++
> ++/* SNACK PDU flags */
> ++#define ISCSI_FLAG_SNACK_TYPE_MASK	0x0F	/* 4 bits */
> ++
> ++/* Reject Message Header */
> ++struct iscsi_reject {
> ++	uint8_t opcode;
> ++	uint8_t flags;
> ++	uint8_t reason;
> ++	uint8_t rsvd2;
> ++	uint8_t hlength;
> ++	uint8_t dlength[3];
> ++	uint8_t rsvd3[8];
> ++	__be32  ffffffff;
> ++	uint8_t rsvd4[4];
> ++	__be32	statsn;
> ++	__be32	exp_cmdsn;
> ++	__be32	max_cmdsn;
> ++	__be32	datasn;
> ++	uint8_t rsvd5[8];
> ++	/* Text - Rejected hdr */
> ++};
> ++
> ++/* Reason for Reject */
> ++#define ISCSI_REASON_CMD_BEFORE_LOGIN	1
> ++#define ISCSI_REASON_DATA_DIGEST_ERROR	2
> ++#define ISCSI_REASON_DATA_SNACK_REJECT	3
> ++#define ISCSI_REASON_PROTOCOL_ERROR	4
> ++#define ISCSI_REASON_CMD_NOT_SUPPORTED	5
> ++#define ISCSI_REASON_IMM_CMD_REJECT		6
> ++#define ISCSI_REASON_TASK_IN_PROGRESS	7
> ++#define ISCSI_REASON_INVALID_SNACK		8
> ++#define ISCSI_REASON_BOOKMARK_INVALID	9
> ++#define ISCSI_REASON_BOOKMARK_NO_RESOURCES	10
> ++#define ISCSI_REASON_NEGOTIATION_RESET	11
> ++
> ++/* Max. number of Key=Value pairs in a text message */
> ++#define MAX_KEY_VALUE_PAIRS	8192
> ++
> ++/* maximum length for text keys/values */
> ++#define KEY_MAXLEN		64
> ++#define VALUE_MAXLEN		255
> ++#define TARGET_NAME_MAXLEN	VALUE_MAXLEN
> ++
> ++#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH	8192
> ++
> ++/************************* RFC 3720 End *****************************/
> ++
> ++#endif /* ISCSI_PROTO_H */

Why isn't the above in addons?

...

> diff --git a/kernel_patches/backport/2.6.9_U3/add_memory_h.patch b/kernel_patches/backport/2.6.9_U3/add_memory_h.patch
> new file mode 100644
> index 0000000..5daad2e
> --- /dev/null
> +++ b/kernel_patches/backport/2.6.9_U3/add_memory_h.patch
> @@ -0,0 +1,93 @@
> +diff -rupN linux-2.6.20-like-rh4/include/linux/memory.h linux-2.6.20/include/linux/memory.h
> +--- linux-2.6.20-like-rh4/include/linux/memory.h	1970-01-01 02:00:00.000000000 +0200
> ++++ linux-2.6.20/include/linux/memory.h	2007-02-04 20:44:54.000000000 +0200
> +@@ -0,0 +1,89 @@
> ++/*
> ++ * include/linux/memory.h - generic memory definition
> ++ *
> ++ * This is mainly for topological representation. We define the
> ++ * basic "struct memory_block" here, which can be embedded in per-arch
> ++ * definitions or NUMA information.
> ++ *
> ++ * Basic handling of the devices is done in drivers/base/memory.c
> ++ * and system devices are handled in drivers/base/sys.c.
> ++ *
> ++ * Memory block are exported via sysfs in the class/memory/devices/
> ++ * directory.
> ++ *
> ++ */
> ++#ifndef _LINUX_MEMORY_H_
> ++#define _LINUX_MEMORY_H_
> ++
> ++#include <linux/sysdev.h>
> ++#include <linux/node.h>
> ++#include <linux/compiler.h>
> ++
> ++#include <asm/semaphore.h>
> ++
> ++struct memory_block {
> ++	unsigned long phys_index;
> ++	unsigned long state;
> ++	/*
> ++	 * This serializes all state change requests.  It isn't
> ++	 * held during creation because the control files are
> ++	 * created long after the critical areas during
> ++	 * initialization.
> ++	 */
> ++	struct semaphore state_sem;
> ++	int phys_device;		/* to which fru does this belong? */
> ++	void *hw;			/* optional pointer to fw/hw data */
> ++	int (*phys_callback)(struct memory_block *);
> ++	struct sys_device sysdev;
> ++};
> ++
> ++/* These states are exposed to userspace as text strings in sysfs */
> ++#define	MEM_ONLINE		(1<<0) /* exposed to userspace */
> ++#define	MEM_GOING_OFFLINE	(1<<1) /* exposed to userspace */
> ++#define	MEM_OFFLINE		(1<<2) /* exposed to userspace */
> ++
> ++/*
> ++ * All of these states are currently kernel-internal for notifying
> ++ * kernel components and architectures.
> ++ *
> ++ * For MEM_MAPPING_INVALID, all notifier chains with priority >0
> ++ * are called before pfn_to_page() becomes invalid.  The priority=0
> ++ * entry is reserved for the function that actually makes
> ++ * pfn_to_page() stop working.  Any notifiers that want to be called
> ++ * after that should have priority <0.
> ++ */
> ++#define	MEM_MAPPING_INVALID	(1<<3)
> ++
> ++struct notifier_block;
> ++struct mem_section;
> ++
> ++#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE
> ++static inline int memory_dev_init(void)
> ++{
> ++	return 0;
> ++}
> ++static inline int register_memory_notifier(struct notifier_block *nb)
> ++{
> ++	return 0;
> ++}
> ++static inline void unregister_memory_notifier(struct notifier_block *nb)
> ++{
> ++}
> ++#else
> ++extern int register_new_memory(struct mem_section *);
> ++extern int unregister_memory_section(struct mem_section *);
> ++extern int memory_dev_init(void);
> ++extern int remove_memory_block(unsigned long, struct mem_section *, int);
> ++
> ++#define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
> ++
> ++
> ++#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
> ++
> ++#define hotplug_memory_notifier(fn, pri) {			\
> ++	static struct notifier_block fn##_mem_nb =		\
> ++		{ .notifier_call = fn, .priority = pri };	\
> ++	register_memory_notifier(&fn##_mem_nb);			\
> ++}
> ++
> ++#endif /* _LINUX_MEMORY_H_ */

why isn't this in addons?

> diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch
> new file mode 100644
> index 0000000..d77c663
> --- /dev/null
> +++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch
> @@ -0,0 +1,504 @@
> +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c
> +--- linux-2.6.20/drivers/scsi/iscsi_tcp.c	2007-02-04 20:44:54.000000000 +0200
> ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c	2007-04-01 13:11:17.000000000 +0300

...
> +@@ -108,7 +108,7 @@ iscsi_hdr_digest(struct iscsi_conn *conn
> + {
> + 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
> + 
> +-	crypto_hash_digest(&tcp_conn->tx_hash, &buf->sg, buf->sg.length, crc);
> ++	crypto_digest_digest(tcp_conn->tx_tfm, &buf->sg, 1, crc);
> + 	buf->sg.length = tcp_conn->hdr_size;
> + }
> + 

You could make it a macro in addons if you had named the new field tx_hash.

> +@@ -419,7 +419,7 @@ iscsi_r2t_rsp(struct iscsi_conn *conn, s
> + 	tcp_ctask->xmstate |= XMSTATE_SOL_HDR;
> + 	list_move_tail(&ctask->running, &conn->xmitqueue);
> + 
> +-	scsi_queue_work(session->host, &conn->xmitwork);
> ++	schedule_work(&conn->xmitwork);
> + 	conn->r2t_pdus_cnt++;
> + 	spin_unlock(&session->lock);
> + 

Can not this be done with a macro in addons?


Same for other places where this change was done.

> +@@ -2044,11 +2037,13 @@ iscsi_tcp_conn_get_param(struct iscsi_cl
> + 		sk = tcp_conn->sock->sk;
> + 		if (sk->sk_family == PF_INET) {
> + 			inet = inet_sk(sk);
> +-			len = sprintf(buf, NIPQUAD_FMT "\n",
> ++			len = sprintf(buf, "%u.%u.%u.%u\n",
> + 				      NIPQUAD(inet->daddr));
> + 		} else {
> + 			np = inet6_sk(sk);
> +-			len = sprintf(buf, NIP6_FMT "\n", NIP6(np->daddr));
> ++			len = sprintf(buf,
> ++				"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
> ++				NIP6(np->daddr));
> + 		}
> + 		mutex_unlock(&conn->xmitmutex);
> + 		break;

NIP6_FMT should be defined in addons.


> +@@ -2135,7 +2130,6 @@ static void iscsi_tcp_session_destroy(st
> + static struct scsi_host_template iscsi_sht = {
> + 	.name			= "iSCSI Initiator over TCP/IP",
> + 	.queuecommand           = iscsi_queuecommand,
> +-	.change_queue_depth	= iscsi_change_queue_depth,
> + 	.can_queue		= ISCSI_XMIT_CMDS_MAX - 1,
> + 	.sg_tablesize		= ISCSI_SG_TABLESIZE,
> + 	.cmd_per_lun		= ISCSI_DEF_CMD_PER_LUN,
> +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h
> +--- linux-2.6.20/drivers/scsi/iscsi_tcp.h	2007-02-04 20:44:54.000000000 +0200
> ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h	2007-04-01 13:11:55.000000000 +0300
> +@@ -49,7 +49,6 @@
> + #define ISCSI_SG_TABLESIZE		SG_ALL
> + #define ISCSI_TCP_MAX_CMD_LEN		16
> + 
> +-struct crypto_hash;
> + struct socket;
> + 
> + /* Socket connection recieve helper */
> +@@ -93,8 +92,8 @@ struct iscsi_tcp_conn {
> + 	void			(*old_write_space)(struct sock *);
> + 
> + 	/* data and header digests */
> +-	struct hash_desc	tx_hash;	/* CRC32C (Tx) */
> +-	struct hash_desc	rx_hash;	/* CRC32C (Rx) */
> ++	struct crypto_tfm	*tx_tfm;	/* CRC32C (Tx) */
> ++	struct crypto_tfm	*rx_tfm;	/* CRC32C (Rx) */
> + 
> + 	/* MIB custom statistics */
> + 	uint32_t		sendpage_failures_cnt;

Name the new field tx_hash (just change the type) and then
you will be able to replace a lot of changes by a one liner in addons.

> +diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c
> +--- linux-2.6.20/drivers/scsi/libiscsi.c	2007-02-04 20:44:54.000000000 +0200
> ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c	2007-04-01 13:15:57.000000000 +0300
> +@@ -23,6 +23,7 @@
> +  */
> + #include <linux/types.h>
> + #include <linux/mutex.h>
> ++#include <linux/gfp.h>
> + #include <linux/kfifo.h>
> + #include <linux/delay.h>
> + #include <net/tcp.h>

Why does this need to be added?
Such stuff should be done through addons.

....

> +diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c
> +--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c	2007-02-04 20:44:54.000000000 +0200
> ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c	2007-04-01 13:18:33.000000000 +0300
> +@@ -29,11 +29,15 @@
> + #include <scsi/scsi_transport.h>
> + #include <scsi/scsi_transport_iscsi.h>
> + #include <scsi/iscsi_if.h>
> ++#include <linux/transport_class.h>
> ++#include <linux/netlink.h>

Do this through addons.

> + 
> + #define ISCSI_SESSION_ATTRS 11
> + #define ISCSI_CONN_ATTRS 11
> + #define ISCSI_HOST_ATTRS 0
> +-#define ISCSI_TRANSPORT_VERSION "2.0-724"
> ++#define ISCSI_TRANSPORT_VERSION "2.0-754"

Really a necessary change?

> ++
> ++#define SCAN_WILD_CARD   ~0

Do this through addons.

> + 
> + struct iscsi_internal {
> + 	int daemon_pid;
> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
> + #define cdev_to_iscsi_internal(_cdev) \
> + 	container_of(_cdev, struct iscsi_internal, cdev)
> + 
> ++extern int attribute_container_init(void);
> ++

This does not look scsi-related. Why does this belong here?

> +@@ -216,28 +225,10 @@ static int iscsi_is_session_dev(const st
> + 	return dev->release == iscsi_session_release;
> + }
> + 
> +-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel,
> +-			   uint id, uint lun)
> +-{
> +-	struct iscsi_host *ihost = shost->shost_data;
> +-	struct iscsi_cls_session *session;
> +-
> +-	mutex_lock(&ihost->mutex);
> +-	list_for_each_entry(session, &ihost->sessions, host_list) {
> +-		if ((channel == SCAN_WILD_CARD || channel == 0) &&
> +-		    (id == SCAN_WILD_CARD || id == session->target_id))
> +-			scsi_scan_target(&session->dev, 0,
> +-					 session->target_id, lun, 1);
> +-	}
> +-	mutex_unlock(&ihost->mutex);
> +-
> +-	return 0;
> +-}
> +-
> +-static void session_recovery_timedout(struct work_struct *work)
> ++static void session_recovery_timedout(void *data)
> + {
> + 	struct iscsi_cls_session *session =
> +-		container_of(work, struct iscsi_cls_session,
> ++		container_of(data, struct iscsi_cls_session,
> + 			     recovery_work.work);
> + 
> + 	dev_printk(KERN_INFO, &session->dev, "iscsi: session recovery timed "

you should not need this.
This looks like it duplcates the work we did on
backporting work_struct to old kernels.
		   
> +@@ -452,6 +441,7 @@ iscsi_create_conn(struct iscsi_cls_sessi
> + 		goto release_parent_ref;
> + 	}
> + 	transport_register_device(&conn->dev);
> ++
> + 	return conn;
> + 
> + release_parent_ref:

Really necessary in a backport?

> +@@ -606,9 +596,8 @@ iscsi_if_send_reply(int pid, int seq, in
> + 	struct nlmsghdr	*nlh;
> + 	int len = NLMSG_SPACE(size);
> + 	int flags = multi ? NLM_F_MULTI : 0;
> +-	int t = done ? NLMSG_DONE : type;
> + 
> +-	skb = alloc_skb(len, GFP_ATOMIC);
> ++	skb = alloc_skb(len, GFP_KERNEL);
> + 	/*
> + 	 * FIXME:
> + 	 * user is supposed to react on iferror == -ENOMEM;

This looks really strange in a backport.

> +@@ -649,7 +638,7 @@ iscsi_if_get_stats(struct iscsi_transpor
> + 	do {
> + 		int actual_size;
> + 
> +-		skbstat = alloc_skb(len, GFP_ATOMIC);
> ++		skbstat = alloc_skb(len, GFP_KERNEL);
> + 		if (!skbstat) {
> + 			dev_printk(KERN_ERR, &conn->dev, "iscsi: can not "
> + 				   "deliver stats: OOM\n");

As does this.

....

> diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch
> new file mode 100644
> index 0000000..6dd4429
> --- /dev/null
> +++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch
> @@ -0,0 +1,60 @@
> +diff -rupN linux-2.6.20-rc7/include/scsi/iscsi_compat.h linux-2.6.9/include/scsi/iscsi_compat.h
> +--- linux-2.6.20-rc7/include/scsi/iscsi_compat.h	1970-01-01 02:00:00.000000000 +0200
> ++++ linux-2.6.9/include/scsi/iscsi_compat.h	2007-02-08 08:45:39.000000000 +0200

Why isn't this in addons?

> diff --git a/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch b/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch
> new file mode 100644
> index 0000000..f2425e0
> --- /dev/null
> +++ b/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch
> @@ -0,0 +1,104 @@
> +diff -rupN linux-2.6.20-like-rh4/include/linux/transport_class.h linux-2.6.20/include/linux/transport_class.h
> +--- linux-2.6.20-like-rh4/include/linux/transport_class.h	1970-01-01 02:00:00.000000000 +0200
> ++++ linux-2.6.20/include/linux/transport_class.h	2007-02-04 20:44:54.000000000 +0200

Why isn't this in addons?

> diff --git a/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch
> new file mode 100644
> index 0000000..3c2a969
> --- /dev/null
> +++ b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch
> @@ -0,0 +1,13 @@
> +--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:13:43.000000000 +0200
> ++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:14:31.000000000 +0200
> +@@ -70,9 +70,8 @@
> + #include <scsi/scsi_tcq.h>
> + #include <scsi/scsi_host.h>
> + #include <scsi/scsi.h>
> +-#include <scsi/scsi_transport_iscsi.h>
> +-
> + #include "iscsi_iser.h"
> ++#include <scsi/scsi_transport_iscsi.h>
> + 
> + static unsigned int iscsi_max_lun = 512;
> + module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO);

Looks like the right thing to do anyway.
So put it in fixes instead, and post upstream.

> diff --git a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch
> index e84b964..52c0136 100644
> --- a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch
> +++ b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch
> @@ -19,6 +19,62 @@ index 0000000..58cf933
>  +++ b/drivers/infiniband/core/kfifo.c
>  @@ -0,0 +1 @@
>  +#include "src/kfifo.c"
> +diff --git a/drivers/infiniband/core/init.c b/drivers/infiniband/core/init.c
> +new file mode 100644
> +index 0000000..58cf933
> +--- /dev/null
> ++++ b/drivers/infiniband/core/init.c
> +@@ -0,0 +1 @@
> ++#include "src/init.c"
> +diff --git a/drivers/infiniband/core/attribute_container.c b/drivers/infiniband/core/attribute_container.c
> +new file mode 100644
> +index 0000000..58cf933
> +--- /dev/null
> ++++ b/drivers/infiniband/core/attribute_container.c
> +@@ -0,0 +1 @@
> ++#include "src/attribute_container.c"
> +diff --git a/drivers/infiniband/core/transport_class.c b/drivers/infiniband/core/transport_class.c
> +new file mode 100644
> +index 0000000..58cf933
> +--- /dev/null
> ++++ b/drivers/infiniband/core/transport_class.c
> +@@ -0,0 +1 @@
> ++#include "src/transport_class.c"
> +diff --git a/drivers/infiniband/core/klist.c b/drivers/infiniband/core/klist.c
> +new file mode 100644
> +index 0000000..58cf933
> +--- /dev/null
> ++++ b/drivers/infiniband/core/klist.c
> +@@ -0,0 +1 @@
> ++#include "src/klist.c"
> +diff --git a/drivers/infiniband/core/scsi.c b/drivers/infiniband/core/scsi.c
> +new file mode 100644
> +index 0000000..58cf933
> +--- /dev/null
> ++++ b/drivers/infiniband/core/scsi.c
> +@@ -0,0 +1 @@
> ++#include "src/scsi.c"
> +diff --git a/drivers/infiniband/core/scsi_lib.c b/drivers/infiniband/core/scsi_lib.c
> +new file mode 100644
> +index 0000000..58cf933
> +--- /dev/null
> ++++ b/drivers/infiniband/core/scsi_lib.c
> +@@ -0,0 +1 @@
> ++#include "src/scsi_lib.c"
> +diff --git a/drivers/infiniband/core/scsi_scan.c b/drivers/infiniband/core/scsi_scan.c
> +new file mode 100644
> +index 0000000..58cf933
> +--- /dev/null
> ++++ b/drivers/infiniband/core/scsi_scan.c
> +@@ -0,0 +1 @@
> ++#include "src/scsi_scan.c"
> +diff --git a/drivers/infiniband/core/kref_new.c b/drivers/infiniband/core/kref_new.c
> +new file mode 100644
> +index 0000000..58cf933
> +--- /dev/null
> ++++ b/drivers/infiniband/core/kref_new.c
> +@@ -0,0 +1 @@
> ++#include "src/kref_new.c"
>  diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
>  index 50fb1cd..456bfd0 100644
>  --- a/drivers/infiniband/core/Makefile
> @@ -28,4 +84,4 @@ index 50fb1cd..456bfd0 100644
>   ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_mem.o \
>   				uverbs_marshall.o
>  +
> -+ib_core-y +=			genalloc.o netevent.o kfifo.o
> ++ib_core-y +=			genalloc.o netevent.o kfifo.o scsi.o scsi_lib.o scsi_scan.o init.o attribute_container.o transport_class.o klist.o kref_new.o

Can we make these part of iser place?
Linking scsi stuff into core does not look right.

> diff --git a/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch b/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch
> new file mode 100644
> index 0000000..cc071ef
> --- /dev/null
> +++ b/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch
> @@ -0,0 +1,247 @@
> +diff -rupN linux-2.6.20-like-rh4/include/linux/netlink.h linux-2.6.20/include/linux/netlink.h
> +--- linux-2.6.20-like-rh4/include/linux/netlink.h	1970-01-01 02:00:00.000000000 +0200
> ++++ linux-2.6.20/include/linux/netlink.h	2007-02-04 20:44:54.000000000 +0200

Belongs in addons.

-- 
MST


From vlad at lists.openfabrics.org  Thu May 10 02:31:48 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu, 10 May 2007 02:31:48 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070510-0200 daily build status
Message-ID: <20070510093148.9781DE60828@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.12
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.15
Passed on ia64 with linux-2.6.12
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14

Failed:


From rdreier at cisco.com  Thu May 10 03:42:28 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 10 May 2007 03:42:28 -0700
Subject: [ofa-general] verbs abi_compat
In-Reply-To: <HFEPKIFILMMCLHAOBMMOGEOJGLAA.jimmy@hillraiser.com> (Jimmy Hill's
	message of "Wed, 9 May 2007 23:08:01 -0500")
References: <HFEPKIFILMMCLHAOBMMOGEOJGLAA.jimmy@hillraiser.com>
Message-ID: <ada7irh2c4r.fsf@cisco.com>

 > It is set in that it is non-zero, but I agree, it has garbage in it...and
 > that's part of the problem. It is not being set in src/cmd.c, and has a
 > non-zero value. When I call ibv_alloc_pd, I'm ending up in
 > __ibv_alloc_pd_1_0 and that attempts to use context->real_context which is
 > non-zero garbage as well and I get a segmentation violation. The abi_compat
 > flag was what I thought was redirecting me into __ibv_alloc_pd_1_0 instead
 > of __ibv_alloc_pd (where it should be going).
 > 
 > So, maybe I asked the wrong question. Let me try a diff approach. What
 > determines if ibv_alloc_pd resolves to __ibv_alloc_pd_1_0 or __ibv_alloc_pd?
 > If I can find out what is redirecting my call to the "compat" code, maybe I
 > can stop it and resolve the problem.

abi_compat has nothing to do with __ibv_alloc_pd vs. __ibv_alloc_pd_1_0.
Rather, that choice is made based on whether your app is linked
against the IBVERBS_1.1 or IBVERBS_1.0 ABI.  If you link against the
new library, you should get all IBVERBS_1.1 symbols; if you link
against libibverbs 1.0, you should get all IBVERBS_1.1 symbols.

Your problem might be that your app is getting __ibv_alloc_pd_1_0, but
it gets __ibv_open_device instead of __ibv_open_device_1_0 so the
context passed into __ibv_alloc_pd_1_0 is wrong.  Are you possibly
relinking only part of your app or something?

 - R.


From yosefe at voltaire.com  Thu May 10 04:26:31 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 10 May 2007 14:26:31 +0300
Subject: [ofa-general] [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <20070509174138.GB17734@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>	<20070509093548.GA7683@mellanox.co.il>	<4641AA06.1050002@voltaire.com>	<20070509112626.GA10068@mellanox.co.il>	<4641B63D.4010602@voltaire.com>
	<20070509174138.GB17734@mellanox.co.il>
Message-ID: <46430167.3010106@voltaire.com>

Added - handling the case when a pkey of an interface is deleted and then restored
--

This issue was found during partitioning & SM fail over testing.

 * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
 * Obtain pkey index prior to entering init_qp, and save in in dev_priv
 * Upon PKEY_CHANGE event, schedule a work that restarts the qp.
 * Precondition the restart on whether the pkey index is really changed.
   Use the cached pkey_index to test this.  
 * Restart child interfaces before parent. They might be up even if the
   parent is down.
 * When interface is restarted, queue delayed initiallization, to handle
   the case that a pkey is deleted and later restored. 
 * Use uncached pkey query upon qp initiallization

SM reconfiguration or failover possibly causes a shuffling of the values
in the port pkey table. The current implementation only queries for the
index of the pkey once, when it creates the device QP and after that moves
it into working state, and hence does not address this scenario. Fix this
by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   96 +++++++++++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   26 ++-----
 4 files changed, 97 insertions(+), 39 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-10 08:34:58.335171047 +0300
@@ -202,15 +202,17 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
+	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
+	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8            	  port;
 	u16           	  pkey;
+	u16               pkey_index;
 	struct ib_pd  	 *pd;
 	struct ib_mr  	 *mr;
 	struct ib_cq  	 *cq;
@@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-10 13:01:10.737347938 +0300
@@ -408,11 +408,40 @@ void ipoib_reap_ah(struct work_struct *w
 		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ);
 }
 
+static int ipoib_find_pkey_index(struct ipoib_dev_priv *priv, int *is_changed)
+{
+	u16 new_index;
+
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
+		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		return -ENXIO;
+	}
+
+	if (is_changed)
+		*is_changed = !test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags) ||
+				priv->pkey_index != new_index;
+
+	priv->pkey_index = new_index;
+	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	return 0;
+}
+
 int ipoib_ib_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
 
+	/*
+	 * Search through the port P_Key table for the requested pkey value.
+	 * The port has to be assigned to the respective IB partition in
+	 * advance.
+	 */
+	ret = ipoib_find_pkey_index(priv, NULL);
+	if (ret) {
+		ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey);
+		return -1;
+	}
+
 	ret = ipoib_init_qp(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret);
@@ -422,14 +451,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -481,7 +510,7 @@ int ipoib_ib_dev_down(struct net_device 
 	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
 		mutex_lock(&pkey_mutex);
 		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
+		cancel_delayed_work(&priv->pkey_poll_task);
 		mutex_unlock(&pkey_mutex);
 		if (flush)
 			flush_workqueue(ipoib_workqueue);
@@ -508,7 +537,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +610,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,13 +652,22 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
+	int is_index_changed;
+
+	mutex_lock(&priv->vlan_mutex);
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
+	/* Flush any child interfaces too -
+ 	 * they might be up even if the parent is down */
+ 	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, pkey_event);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
 		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
@@ -638,10 +677,23 @@ void ipoib_ib_dev_flush(struct work_stru
 		return;
 	}
 
+	if (pkey_event &&
+	    !ipoib_find_pkey_index(priv, &is_index_changed) &&
+	    !is_index_changed) {
+	    	ipoib_dbg(priv, "Not flushing - pkey index not changed.\n");
+		return;
+	}
+
 	ipoib_dbg(priv, "flushing\n");
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (pkey_event) {
+		ipoib_ib_dev_stop(dev, 0);
+		ipoib_pkey_dev_delay_open(dev);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +702,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	mutex_unlock(&priv->vlan_mutex);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -685,7 +747,7 @@ void ipoib_ib_dev_cleanup(struct net_dev
 void ipoib_pkey_poll(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
+		container_of(work, struct ipoib_dev_priv, pkey_poll_task.work);
 	struct net_device *dev = priv->dev;
 
 	ipoib_pkey_dev_check_presence(dev);
@@ -696,7 +758,7 @@ void ipoib_pkey_poll(struct work_struct 
 		mutex_lock(&pkey_mutex);
 		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
+					   &priv->pkey_poll_task,
 					   HZ);
 		mutex_unlock(&pkey_mutex);
 	}
@@ -715,7 +777,7 @@ int ipoib_pkey_dev_delay_open(struct net
 		mutex_lock(&pkey_mutex);
 		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
 		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
+				   &priv->pkey_poll_task,
 				   HZ);
 		mutex_unlock(&pkey_mutex);
 		return 1;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-09 17:21:03.000000000 +0300
@@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev)
 		return -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-10 09:13:28.997127223 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
@@ -94,26 +92,17 @@ int ipoib_init_qp(struct net_device *dev
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
-	u16 pkey_index;
 	struct ib_qp_attr qp_attr;
 	int attr_mask;
 
-	/*
-	 * Search through the port P_Key table for the requested pkey value.
-	 * The port has to be assigned to the respective IB partition in
-	 * advance.
-	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
-	if (ret) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-		return ret;
-	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	/* Make sure we have a valid pkey_index in priv->pkey_index */
+	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
+		return -1;
 
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
 	qp_attr.port_num = priv->port;
-	qp_attr.pkey_index = pkey_index;
+	qp_attr.pkey_index = priv->pkey_index;
 	attr_mask =
 	    IB_QP_QKEY |
 	    IB_QP_PORT |
@@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		   record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
 	}
 }


From mst at dev.mellanox.co.il  Thu May 10 05:01:44 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 15:01:44 +0300
Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <46430167.3010106@voltaire.com>
References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com>
	<20070508162727.GD5845@mellanox.co.il>
	<4640A8BD.4000405@voltaire.com>
	<20070509093548.GA7683@mellanox.co.il>
	<4641AA06.1050002@voltaire.com>
	<20070509112626.GA10068@mellanox.co.il>
	<4641B63D.4010602@voltaire.com>
	<20070509174138.GB17734@mellanox.co.il>
	<46430167.3010106@voltaire.com>
Message-ID: <20070510120144.GF13655@mellanox.co.il>

OK, this is a whole different approach to the problem.
Seems to make sense to me.

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCHv3 2/2] ipoib: handle pkey change events
> 
> Added - handling the case when a pkey of an interface is deleted and then restored
>
> --
> 
> This issue was found during partitioning & SM fail over testing.
> 
>  * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
>  * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
>  * Obtain pkey index prior to entering init_qp, and save in in dev_priv
>  * Upon PKEY_CHANGE event, schedule a work that restarts the qp.
>  * Precondition the restart on whether the pkey index is really changed.
>    Use the cached pkey_index to test this.  
>  * Restart child interfaces before parent. They might be up even if the
>    parent is down.
>  * When interface is restarted, queue delayed initiallization, to handle
>    the case that a pkey is deleted and later restored. 
>  * Use uncached pkey query upon qp initiallization
> 
> SM reconfiguration or failover possibly causes a shuffling of the values
> in the port pkey table. The current implementation only queries for the
> index of the pkey once, when it creates the device QP and after that moves
> it into working state, and hence does not address this scenario. Fix this
> by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.
> 
> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib.h       |    7 +-
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   96 +++++++++++++++++++++++------
>  drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +-
>  drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   26 ++-----
>  4 files changed, 97 insertions(+), 39 deletions(-)
> 
> Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 15:46:53.000000000 +0300
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-10 08:34:58.335171047 +0300
> @@ -202,15 +202,17 @@ struct ipoib_dev_priv {
>  	struct list_head multicast_list;
>  	struct rb_root multicast_tree;
>  
> -	struct delayed_work pkey_task;
> +	struct delayed_work pkey_poll_task;
>  	struct delayed_work mcast_task;
>  	struct work_struct flush_task;
>  	struct work_struct restart_task;
>  	struct delayed_work ah_reap_task;
> +	struct work_struct pkey_event_task;
>  
>  	struct ib_device *ca;
>  	u8            	  port;
>  	u16           	  pkey;
> +	u16               pkey_index;
>  	struct ib_pd  	 *pd;
>  	struct ib_mr  	 *mr;
>  	struct ib_cq  	 *cq;
> @@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
>  
>  int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
>  void ipoib_ib_dev_flush(struct work_struct *work);
> +void ipoib_pkey_event(struct work_struct *work);
>  void ipoib_ib_dev_cleanup(struct net_device *dev);
>  
>  int ipoib_ib_dev_open(struct net_device *dev);
>  int ipoib_ib_dev_up(struct net_device *dev);
>  int ipoib_ib_dev_down(struct net_device *dev, int flush);
> -int ipoib_ib_dev_stop(struct net_device *dev);
> +int ipoib_ib_dev_stop(struct net_device *dev, int flush);
>  
>  int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
>  void ipoib_dev_cleanup(struct net_device *dev);
> Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.000000000 +0300
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-10 13:01:10.737347938 +0300
> @@ -408,11 +408,40 @@ void ipoib_reap_ah(struct work_struct *w
>  		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ);
>  }
>  
> +static int ipoib_find_pkey_index(struct ipoib_dev_priv *priv, int *is_changed)
> +{
> +	u16 new_index;
> +
> +	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
> +		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +		return -ENXIO;
> +	}
> +
> +	if (is_changed)
> +		*is_changed = !test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags) ||
> +				priv->pkey_index != new_index;
> +
> +	priv->pkey_index = new_index;
> +	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +	return 0;
> +}

I suggest open-coding this - the name ipoib_find_pkey_index
does not tell me that it actually sets flags, etc.

> @@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler
>  		container_of(handler, struct ipoib_dev_priv, event_handler);
>  
>  	if ((record->event == IB_EVENT_PORT_ERR    ||
> -	     record->event == IB_EVENT_PKEY_CHANGE ||
>  	     record->event == IB_EVENT_PORT_ACTIVE ||
>  	     record->event == IB_EVENT_LID_CHANGE  ||
>  	     record->event == IB_EVENT_SM_CHANGE   ||
> @@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler
>  	    record->element.port_num == priv->port) {
>  		ipoib_dbg(priv, "Port state change event\n");
>  		queue_work(ipoib_workqueue, &priv->flush_task);
> +	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
> +		   record->element.port_num == priv->port) {
> +		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
> +		queue_work(ipoib_workqueue, &priv->pkey_event_task);
>  	}
>  }

BTW, should we maybe do:
if (record->element.port_num != priv->port)
	return;

and then we won't have to do this test for each event type?

-- 
MST


From ogerlitz at voltaire.com  Thu May 10 05:07:57 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 10 May 2007 15:07:57 +0300
Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <20070510120144.GF13655@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>	<20070509093548.GA7683@mellanox.co.il>	<4641AA06.1050002@voltaire.com>	<20070509112626.GA10068@mellanox.co.il>	<4641B63D.4010602@voltaire.com>	<20070509174138.GB17734@mellanox.co.il>	<46430167.3010106@voltaire.com>
	<20070510120144.GF13655@mellanox.co.il>
Message-ID: <46430B1D.1040905@voltaire.com>

Michael S. Tsirkin wrote:
>> @@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler
>>  		container_of(handler, struct ipoib_dev_priv, event_handler);
>>  
>>  	if ((record->event == IB_EVENT_PORT_ERR    ||
>> -	     record->event == IB_EVENT_PKEY_CHANGE ||
>>  	     record->event == IB_EVENT_PORT_ACTIVE ||
>>  	     record->event == IB_EVENT_LID_CHANGE  ||
>>  	     record->event == IB_EVENT_SM_CHANGE   ||
>> @@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler
>>  	    record->element.port_num == priv->port) {
>>  		ipoib_dbg(priv, "Port state change event\n");
>>  		queue_work(ipoib_workqueue, &priv->flush_task);
>> +	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
>> +		   record->element.port_num == priv->port) {
>> +		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
>> +		queue_work(ipoib_workqueue, &priv->pkey_event_task);
>>  	}
>>  }
> 
> BTW, should we maybe do:
> if (record->element.port_num != priv->port)
> 	return;
> 
> and then we won't have to do this test for each event type?

Just make sure that all the events covered by this check are port 
affiliated, ie don't have a wider scope.

Or.


From mst at dev.mellanox.co.il  Thu May 10 05:10:39 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 15:10:39 +0300
Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <46430B1D.1040905@voltaire.com>
References: <20070508162727.GD5845@mellanox.co.il>
	<4640A8BD.4000405@voltaire.com>
	<20070509093548.GA7683@mellanox.co.il>
	<4641AA06.1050002@voltaire.com>
	<20070509112626.GA10068@mellanox.co.il>
	<4641B63D.4010602@voltaire.com>
	<20070509174138.GB17734@mellanox.co.il>
	<46430167.3010106@voltaire.com>
	<20070510120144.GF13655@mellanox.co.il>
	<46430B1D.1040905@voltaire.com>
Message-ID: <20070510121039.GI13655@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events
> 
> Michael S. Tsirkin wrote:
> >>@@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler
> >> 		container_of(handler, struct ipoib_dev_priv, event_handler);
> >> 
> >> 	if ((record->event == IB_EVENT_PORT_ERR    ||
> >>-	     record->event == IB_EVENT_PKEY_CHANGE ||
> >> 	     record->event == IB_EVENT_PORT_ACTIVE ||
> >> 	     record->event == IB_EVENT_LID_CHANGE  ||
> >> 	     record->event == IB_EVENT_SM_CHANGE   ||
> >>@@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler
> >> 	    record->element.port_num == priv->port) {
> >> 		ipoib_dbg(priv, "Port state change event\n");
> >> 		queue_work(ipoib_workqueue, &priv->flush_task);
> >>+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
> >>+		   record->element.port_num == priv->port) {
> >>+		ipoib_dbg(priv, "pkey change event on port:%d\n", 
> >>priv->port);
> >>+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
> >> 	}
> >> }
> >
> >BTW, should we maybe do:
> >if (record->element.port_num != priv->port)
> >	return;
> >
> >and then we won't have to do this test for each event type?
> 
> Just make sure that all the events covered by this check are port 
> affiliated, ie don't have a wider scope.
> 
> Or.


Well, we currently have:

void ipoib_event(struct ib_event_handler *handler,
                 struct ib_event *record)
{
        struct ipoib_dev_priv *priv =
                container_of(handler, struct ipoib_dev_priv, event_handler);

        if ((record->event == IB_EVENT_PORT_ERR    ||
             record->event == IB_EVENT_PKEY_CHANGE ||
             record->event == IB_EVENT_PORT_ACTIVE ||
             record->event == IB_EVENT_LID_CHANGE  ||
             record->event == IB_EVENT_SM_CHANGE   ||
             record->event == IB_EVENT_CLIENT_REREGISTER) &&
            record->element.port_num == priv->port) {
                ipoib_dbg(priv, "Port state change event\n");
                queue_work(ipoib_workqueue, &priv->flush_task);
        }
}

So this would not change anything, just clean up code a little.


-- 
MST


From ogerlitz at voltaire.com  Thu May 10 05:12:00 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 10 May 2007 15:12:00 +0300
Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events
In-Reply-To: <20070510121039.GI13655@mellanox.co.il>
References: <20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>	<20070509093548.GA7683@mellanox.co.il>	<4641AA06.1050002@voltaire.com>	<20070509112626.GA10068@mellanox.co.il>	<4641B63D.4010602@voltaire.com>	<20070509174138.GB17734@mellanox.co.il>	<46430167.3010106@voltaire.com>	<20070510120144.GF13655@mellanox.co.il>	<46430B1D.1040905@voltaire.com>
	<20070510121039.GI13655@mellanox.co.il>
Message-ID: <46430C10.7040702@voltaire.com>

Michael S. Tsirkin wrote:
> Well, we currently have:
> 
> void ipoib_event(struct ib_event_handler *handler,
>                  struct ib_event *record)
> {
>         struct ipoib_dev_priv *priv =
>                 container_of(handler, struct ipoib_dev_priv, event_handler);
> 
>         if ((record->event == IB_EVENT_PORT_ERR    ||
>              record->event == IB_EVENT_PKEY_CHANGE ||
>              record->event == IB_EVENT_PORT_ACTIVE ||
>              record->event == IB_EVENT_LID_CHANGE  ||
>              record->event == IB_EVENT_SM_CHANGE   ||
>              record->event == IB_EVENT_CLIENT_REREGISTER) &&
>             record->element.port_num == priv->port) {
>                 ipoib_dbg(priv, "Port state change event\n");
>                 queue_work(ipoib_workqueue, &priv->flush_task);
>         }
> }
> 
> So this would not change anything, just clean up code a little.

OK


From ogerlitz at voltaire.com  Thu May 10 05:23:14 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 10 May 2007 15:23:14 +0300
Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>	<1178657765.11455.32.camel@stevo-desktop>	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>	<1178721476.382.18.camel@stevo-desktop>	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop>	<46422D07.3050600@Sun.COM>
	<1178742259.382.112.camel@stevo-desktop>	<46422EA6.3020006@Sun.COM>
	<1178742819.382.114.camel@stevo-desktop>	<464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
Message-ID: <46430EB2.7080703@voltaire.com>

Jeff Squyres wrote:
> Galen Shipman and I talked about this a bit and suggest the following:
> 
> - During the connection dance (probably for both the udapl and openib 
> BTLs), whichever peer ends up being the connection initiator (don't 
> forget about the race condition where 2 peers may simultaneously decide 
> to initiate -- this case is handled properly in the OMPI code; but just 
> make sure you modify the side that ends up being actual initiator), they 
> can send their pending fragment immediately (and Steve is right that 
> there will always be a pending fragment, because OMPI doesn't make a 
> connection until the first send).
> 
> - The other peer (the receiver of the connection) must wait to send its 
> pending fragment(s) until it receives the first frag from the connection 
> initiator.  This can be accomplished either with another flag on the 
> OMPI module struct or perhaps making it part of the connection protocol 
> (i.e., don't transition the endpoint to be CONNECTED until the first 
> fragment is received).  Either of which can be used to queue up 
> fragments on the receiver until the first fragment is received from the 
> initiator.  I'd have to look in the code deeper, but I'm *guessing* that 
> it might be best to use the already-existing state flag (i.e., checking 
> for CONNECTED) because then you won't be introducing any more 
> conditionals in the critical path.

A different approach which you might want to consider is to have at the 
btl level --two-- connections per <src,dst> ranks. so if A wants to send 
B it does so through the A --> B connection and if B wants to send A it 
does so through the B --> A connection. To some extent, this is the 
approach taken by IPoIB-CM (I am not enough into the RFC to understand 
the reasoning but i am quite sure this was the approach in the initial 
implementation). At first thought it mights seems not very elegant, but 
taking it into the details (projected on the ompi env) you might find it 
  even nice.

Or.


From yosefe at voltaire.com  Thu May 10 05:25:42 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 10 May 2007 15:25:42 +0300
Subject: [ofa-general] [PATCHv4 2/2] ipoib: handle pkey change events
In-Reply-To: <20070510120144.GF13655@mellanox.co.il>
References: <4640812C.6060003@voltaire.com>
	<46408360.3040006@voltaire.com>	<20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>	<20070509093548.GA7683@mellanox.co.il>	<4641AA06.1050002@voltaire.com>	<20070509112626.GA10068@mellanox.co.il>	<4641B63D.4010602@voltaire.com>	<20070509174138.GB17734@mellanox.co.il>	<46430167.3010106@voltaire.com>
	<20070510120144.GF13655@mellanox.co.il>
Message-ID: <46430F46.1080002@voltaire.com>

Comments:
1. the return -1 is consistent with all other "return -1" in ipoib_ib_dev_open
2. the polling thread is stopped in ipoib_ib_dev_down

Changes:
* remove ipoib_find_pkey()

--
This issue was found during partitioning & SM fail over testing.

 * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
 * Obtain pkey index prior to entering init_qp, and save in in dev_priv
 * Upon PKEY_CHANGE event, schedule a work that restarts the qp.
 * Precondition the restart on whether the pkey index is really changed.
   Use the cached pkey_index to test this.  
 * Restart child interfaces before parent. They might be up even if the
   parent is down.
 * When interface is restarted, queue delayed initiallization, to handle
   the case that a pkey is deleted and later restored. 
 * Use uncached pkey query upon qp initiallization

SM reconfiguration or failover possibly causes a shuffling of the values
in the port pkey table. The current implementation only queries for the
index of the pkey once, when it creates the device QP and after that moves
it into working state, and hence does not address this scenario. Fix this
by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   88 +++++++++++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   26 ++------
 4 files changed, 89 insertions(+), 39 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-10 08:34:58.335171047 +0300
@@ -202,15 +202,17 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
+	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
+	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8            	  port;
 	u16           	  pkey;
+	u16               pkey_index;
 	struct ib_pd  	 *pd;
 	struct ib_mr  	 *mr;
 	struct ib_cq  	 *cq;
@@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-10 15:16:47.592982550 +0300
@@ -413,6 +413,18 @@ int ipoib_ib_dev_open(struct net_device 
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
 
+	/*
+	 * Search through the port P_Key table for the requested pkey value.
+	 * The port has to be assigned to the respective IB partition in
+	 * advance.
+	 */
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) {
+		ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey);
+		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		return -1;
+	}
+	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+
 	ret = ipoib_init_qp(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret);
@@ -422,14 +434,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -481,7 +493,7 @@ int ipoib_ib_dev_down(struct net_device 
 	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
 		mutex_lock(&pkey_mutex);
 		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
+		cancel_delayed_work(&priv->pkey_poll_task);
 		mutex_unlock(&pkey_mutex);
 		if (flush)
 			flush_workqueue(ipoib_workqueue);
@@ -508,7 +520,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +593,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,13 +635,22 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
+	u16 new_index;
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces too -
+ 	 * they might be up even if the parent is down */
+ 	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, pkey_event);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
 		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
@@ -638,10 +660,32 @@ void ipoib_ib_dev_flush(struct work_stru
 		return;
 	}
 
+	if (pkey_event) {
+		if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
+			clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+			ipoib_ib_dev_down(dev, 0);
+			ipoib_pkey_dev_delay_open(dev);
+			return;
+		}
+		set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+
+		/* restart qp only of pkey index is cahnged */
+		if (new_index == priv->pkey_index) {
+			ipoib_dbg(priv, "Not flushing - pkey index not changed.\n");
+			return;
+		}
+		priv->pkey_index = new_index;
+	}
+
 	ipoib_dbg(priv, "flushing\n");
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (pkey_event) {
+		ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +694,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
+
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_event_task);
 
-	mutex_unlock(&priv->vlan_mutex);
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -685,7 +739,7 @@ void ipoib_ib_dev_cleanup(struct net_dev
 void ipoib_pkey_poll(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
+		container_of(work, struct ipoib_dev_priv, pkey_poll_task.work);
 	struct net_device *dev = priv->dev;
 
 	ipoib_pkey_dev_check_presence(dev);
@@ -696,7 +750,7 @@ void ipoib_pkey_poll(struct work_struct 
 		mutex_lock(&pkey_mutex);
 		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
+					   &priv->pkey_poll_task,
 					   HZ);
 		mutex_unlock(&pkey_mutex);
 	}
@@ -715,7 +769,7 @@ int ipoib_pkey_dev_delay_open(struct net
 		mutex_lock(&pkey_mutex);
 		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
 		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
+				   &priv->pkey_poll_task,
 				   HZ);
 		mutex_unlock(&pkey_mutex);
 		return 1;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-09 17:21:03.000000000 +0300
@@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev)
 		return -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-10 09:13:28.997127223 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
@@ -94,26 +92,17 @@ int ipoib_init_qp(struct net_device *dev
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
-	u16 pkey_index;
 	struct ib_qp_attr qp_attr;
 	int attr_mask;
 
-	/*
-	 * Search through the port P_Key table for the requested pkey value.
-	 * The port has to be assigned to the respective IB partition in
-	 * advance.
-	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
-	if (ret) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-		return ret;
-	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	/* Make sure we have a valid pkey_index in priv->pkey_index */
+	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
+		return -1;
 
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
 	qp_attr.port_num = priv->port;
-	qp_attr.pkey_index = pkey_index;
+	qp_attr.pkey_index = priv->pkey_index;
 	attr_mask =
 	    IB_QP_QKEY |
 	    IB_QP_PORT |
@@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		   record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
 	}
 }


From jsquyres at cisco.com  Thu May 10 05:26:22 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 10 May 2007 08:26:22 -0400
Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <46430EB2.7080703@voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>	<1178657765.11455.32.camel@stevo-desktop>	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>	<1178721476.382.18.camel@stevo-desktop>	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop>	<46422D07.3050600@Sun.COM>
	<1178742259.382.112.camel@stevo-desktop>	<46422EA6.3020006@Sun.COM>
	<1178742819.382.114.camel@stevo-desktop>	<464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
Message-ID: <C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>

On May 10, 2007, at 8:23 AM, Or Gerlitz wrote:

> A different approach which you might want to consider is to have at  
> the btl level --two-- connections per <src,dst> ranks. so if A  
> wants to send B it does so through the A --> B connection and if B  
> wants to send A it does so through the B --> A connection. To some  
> extent, this is the approach taken by IPoIB-CM (I am not enough  
> into the RFC to understand the reasoning but i am quite sure this  
> was the approach in the initial implementation). At first thought  
> it mights seems not very elegant, but taking it into the details  
> (projected on the ompi env) you might find it  even nice.

What is the advantage of this approach?

-- 
Jeff Squyres
Cisco Systems


From mst at dev.mellanox.co.il  Thu May 10 05:38:55 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 15:38:55 +0300
Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events
In-Reply-To: <46430F46.1080002@voltaire.com>
References: <20070508162727.GD5845@mellanox.co.il>
	<4640A8BD.4000405@voltaire.com>
	<20070509093548.GA7683@mellanox.co.il>
	<4641AA06.1050002@voltaire.com>
	<20070509112626.GA10068@mellanox.co.il>
	<4641B63D.4010602@voltaire.com>
	<20070509174138.GB17734@mellanox.co.il>
	<46430167.3010106@voltaire.com>
	<20070510120144.GF13655@mellanox.co.il>
	<46430F46.1080002@voltaire.com>
Message-ID: <20070510123855.GL13655@mellanox.co.il>

> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.000000000 +0300
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-10 15:16:47.592982550 +0300
> @@ -413,6 +413,18 @@ int ipoib_ib_dev_open(struct net_device 
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  	int ret;
>  
> +	/*
> +	 * Search through the port P_Key table for the requested pkey value.
> +	 * The port has to be assigned to the respective IB partition in
> +	 * advance.
> +	 */
> +	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) {
> +		ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey);
> +		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +		return -1;
> +	}
> +	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +
>  	ret = ipoib_init_qp(dev);
>  	if (ret) {
>  		ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret);

Return some error code -ENXIO.

-- 
MST


From ossrosch at linux.vnet.ibm.com  Thu May 10 05:41:33 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Thu, 10 May 2007 14:41:33 +0200
Subject: [ofa-general] [PATCH ofed-1.2-rc3 0/3] ehca: backport for rhel-4.5
Message-ID: <200705101441.33813.ossrosch@linux.vnet.ibm.com>

Hi,

these are the patches to backport ehca driver for rhel-4.5. The patches switch
the driver back to handle the old mmap style dur to lack of vm_insert_page()
support in kernel 2.6.9. Additionally we use the introduced dma_ops
mechanism in order to communicate with ibmebus.

regards Stefan


From ossrosch at linux.vnet.ibm.com  Thu May 10 05:41:43 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Thu, 10 May 2007 14:41:43 +0200
Subject: [ofa-general] [PATCH ofed-1.2-rc3 1/3] ehca: backport for rhel-4.5 -
	hvcall.h
Message-ID: <200705101441.44286.ossrosch@linux.vnet.ibm.com>

use kmem_cache_t instead of struct kmem_cache and update hvcall.h


Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
---

drivers/infiniband/hw/ehca/ehca_av.c                         |    2
drivers/infiniband/hw/ehca/ehca_cq.c                         |    2
drivers/infiniband/hw/ehca/ehca_main.c                       |    2
drivers/infiniband/hw/ehca/ehca_mrmw.c                       |    4
drivers/infiniband/hw/ehca/ehca_pd.c                         |    2
drivers/infiniband/hw/ehca/ehca_qp.c                         |    2
kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h |    1
kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h         |  167 +++++++++++
8 files changed, 174 insertions(+), 8 deletions(-)


diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_av.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_av.c
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_av.c	2007-05-09 12:42:01.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_av.c	2007-05-09 12:42:34.000000000 +0200
@@ -48,7 +48,7 @@
 #include "ehca_iverbs.h"
 #include "hcp_if.h"
 
-static struct kmem_cache *av_cache;
+static kmem_cache_t *av_cache;
 
 struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
 {
diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c	2007-05-09 12:42:01.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c	2007-05-09 12:42:34.000000000 +0200
@@ -50,7 +50,7 @@
 #include "ehca_irq.h"
 #include "hcp_if.h"
 
-static struct kmem_cache *cq_cache;
+static kmem_cache_t *cq_cache;
 
 int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp)
 {
diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-09 12:42:01.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-09 12:42:34.000000000 +0200
@@ -465,7 +465,6 @@ void ehca_remove_driver_sysfs(struct ibm
 
 #define EHCA_RESOURCE_ATTR(name)                                           \
 static ssize_t  ehca_show_##name(struct device *dev,                       \
-				 struct device_attribute *attr,            \
 				 char *buf)                                \
 {									   \
 	struct ehca_shca *shca;						   \
@@ -513,7 +512,6 @@ EHCA_RESOURCE_ATTR(max_pd);
 EHCA_RESOURCE_ATTR(max_ah);
 
 static ssize_t ehca_show_adapter_handle(struct device *dev,
-					struct device_attribute *attr,
 					char *buf)
 {
 	struct ehca_shca *shca = dev->driver_data;
diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_mrmw.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_mrmw.c
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_mrmw.c	2007-05-09 12:42:01.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_mrmw.c	2007-05-09 12:42:34.000000000 +0200
@@ -46,8 +46,8 @@
 #include "hcp_if.h"
 #include "hipz_hw.h"
 
-static struct kmem_cache *mr_cache;
-static struct kmem_cache *mw_cache;
+static kmem_cache_t *mr_cache;
+static kmem_cache_t *mw_cache;
 
 static struct ehca_mr *ehca_mr_new(void)
 {
diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_pd.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_pd.c
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_pd.c	2007-05-09 12:42:01.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_pd.c	2007-05-09 12:42:34.000000000 +0200
@@ -43,7 +43,7 @@
 #include "ehca_tools.h"
 #include "ehca_iverbs.h"
 
-static struct kmem_cache *pd_cache;
+static kmem_cache_t *pd_cache;
 
 struct ib_pd *ehca_alloc_pd(struct ib_device *device,
 			    struct ib_ucontext *context, struct ib_udata *udata)
diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c	2007-05-09 12:42:01.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c	2007-05-09 12:42:34.000000000 +0200
@@ -51,7 +51,7 @@
 #include "hcp_if.h"
 #include "hipz_fns.h"
 
-static struct kmem_cache *qp_cache;
+static kmem_cache_t *qp_cache;
 
 /*
  * attributes not supported by query qp
diff -Nurp ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h
--- ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h	2007-05-09 12:48:09.000000000 +0200
+++ ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h	2007-05-09 12:51:00.000000000 +0200
@@ -137,6 +137,173 @@ inline static long plpar_hcall9(unsigned
 	return regs[0];
 }
 
+inline static long plpar_hcall_7arg_7ret(unsigned long opcode,
+					 unsigned long arg1,    /* <R4  */
+					 unsigned long arg2,	/* <R5  */
+					 unsigned long arg3,	/* <R6  */
+					 unsigned long arg4,	/* <R7  */
+					 unsigned long arg5,	/* <R8  */
+					 unsigned long arg6,	/* <R9  */
+					 unsigned long arg7,	/* <R10 */
+					 unsigned long *out1,	/* <R4  */
+					 unsigned long *out2,	/* <R5  */
+					 unsigned long *out3,	/* <R6  */
+					 unsigned long *out4,	/* <R7  */
+					 unsigned long *out5,	/* <R8  */
+					 unsigned long *out6,	/* <R9  */
+					 unsigned long *out7	/* <R10 */
+    )
+{
+	unsigned long regs[11] = {opcode,
+				  arg1, arg2, arg3, arg4, arg5, arg6, arg7};
+
+	__asm__ __volatile__("mr 3,%10\n"
+			     "mr 4,%11\n"
+			     "mr 5,%12\n"
+			     "mr 6,%13\n"
+			     "mr 7,%14\n"
+			     "mr 8,%15\n"
+			     "mr 9,%16\n"
+			     "mr 10,%17\n"
+			     "mr 11,%18\n"
+			     "mr 12,%19\n"
+			     ".long 0x44000022\n"
+			     "mr %0,3\n"
+			     "mr %1,4\n"
+			     "mr %2,5\n"
+			     "mr %3,6\n"
+			     "mr %4,7\n"
+			     "mr %5,8\n"
+			     "mr %6,9\n"
+			     "mr %7,10\n"
+			     "mr %8,11\n"
+			     "mr %9,12\n":"=r"(regs[0]),
+			     "=r"(regs[1]), "=r"(regs[2]),
+			     "=r"(regs[3]), "=r"(regs[4]),
+			     "=r"(regs[5]), "=r"(regs[6]),
+			     "=r"(regs[7]), "=r"(regs[8]),
+			     "=r"(regs[9])
+			     :"r"(regs[0]), "r"(regs[1]),
+			     "r"(regs[2]), "r"(regs[3]),
+			     "r"(regs[4]), "r"(regs[5]),
+			     "r"(regs[6]), "r"(regs[7]),
+			     "r"(regs[8]), "r"(regs[9])
+			     :"r0", "r2", "r3", "r4", "r5", "r6", "r7",
+			     "r8", "r9", "r10", "r11", "r12", "cc",
+			     "xer", "ctr", "lr", "cr0", "cr1", "cr5",
+			     "cr6", "cr7");
+	*out1 = regs[1];
+	*out2 = regs[2];
+	*out3 = regs[3];
+	*out4 = regs[4];
+	*out5 = regs[5];
+	*out6 = regs[6];
+	*out7 = regs[7];
+
+	if (!H_isLongBusy(regs[0]) && regs[0] < 0) {
+		printk(KERN_ERR "HCALL77_IN r3=%lx r4=%lx r5=%lx r6=%lx "
+		       "r7=%lx r8=%lx r9=%lx r10=%lx",
+		       opcode, arg1, arg2, arg3,
+		       arg4, arg5, arg6, arg7);
+		printk(KERN_ERR "HCALL77_OUT r3=%lx r4=%lx r5=%lx "
+		       "r6=%lx r7=%lx r8=%lx r9=%lx r10=%lx ",
+		       regs[0], regs[1],
+		       regs[2], regs[3],
+		       regs[4], regs[5],
+		       regs[6], regs[7]);
+	}
+	return regs[0];
+}
+
+inline static long plpar_hcall_9arg_9ret(unsigned long opcode,
+					 unsigned long arg1,	/* <R4  */
+					 unsigned long arg2,	/* <R5  */
+					 unsigned long arg3,	/* <R6  */
+					 unsigned long arg4,	/* <R7  */
+					 unsigned long arg5,	/* <R8  */
+					 unsigned long arg6,	/* <R9  */
+					 unsigned long arg7,	/* <R10 */
+					 unsigned long arg8,	/* <R11 */
+					 unsigned long arg9,	/* <R12 */
+					 unsigned long *out1,	/* <R4  */
+					 unsigned long *out2,	/* <R5  */
+					 unsigned long *out3,	/* <R6  */
+					 unsigned long *out4,	/* <R7  */
+					 unsigned long *out5,	/* <R8  */
+					 unsigned long *out6,	/* <R9  */
+					 unsigned long *out7,	/* <R10 */
+					 unsigned long *out8,	/* <R11 */
+					 unsigned long *out9	/* <R12 */
+    )
+{
+	unsigned long regs[11] = {opcode,
+				  arg1, arg2, arg3, arg4, arg5, arg6, arg7,
+				  arg8, arg9};
+
+	__asm__ __volatile__("mr 3,%10\n"
+			     "mr 4,%11\n"
+			     "mr 5,%12\n"
+			     "mr 6,%13\n"
+			     "mr 7,%14\n"
+			     "mr 8,%15\n"
+			     "mr 9,%16\n"
+			     "mr 10,%17\n"
+			     "mr 11,%18\n"
+			     "mr 12,%19\n"
+			     ".long 0x44000022\n"
+			     "mr %0,3\n"
+			     "mr %1,4\n"
+			     "mr %2,5\n"
+			     "mr %3,6\n"
+			     "mr %4,7\n"
+			     "mr %5,8\n"
+			     "mr %6,9\n"
+			     "mr %7,10\n"
+			     "mr %8,11\n"
+			     "mr %9,12\n":"=r"(regs[0]),
+			     "=r"(regs[1]), "=r"(regs[2]),
+			     "=r"(regs[3]), "=r"(regs[4]),
+			     "=r"(regs[5]), "=r"(regs[6]),
+			     "=r"(regs[7]), "=r"(regs[8]),
+			     "=r"(regs[9])
+			     :"r"(regs[0]), "r"(regs[1]),
+			     "r"(regs[2]), "r"(regs[3]),
+			     "r"(regs[4]), "r"(regs[5]),
+			     "r"(regs[6]), "r"(regs[7]),
+			     "r"(regs[8]), "r"(regs[9])
+			     :"r0", "r2", "r3", "r4", "r5", "r6", "r7",
+			     "r8", "r9", "r10", "r11", "r12", "cc",
+			     "xer", "ctr", "lr", "cr0", "cr1", "cr5",
+			     "cr6", "cr7");
+	*out1 = regs[1];
+	*out2 = regs[2];
+	*out3 = regs[3];
+	*out4 = regs[4];
+	*out5 = regs[5];
+	*out6 = regs[6];
+	*out7 = regs[7];
+	*out8 = regs[8];
+	*out9 = regs[9];
+
+	if (!H_isLongBusy(regs[0]) && regs[0] < 0) {
+		printk(KERN_ERR "HCALL99_IN r3=%lx r4=%lx r5=%lx r6=%lx "
+		       "r7=%lx r8=%lx r9=%lx r10=%lx "
+		       "r11=%lx r12=%lx",
+		       opcode, arg1, arg2, arg3,
+		       arg4, arg5, arg6, arg7,
+		       arg8, arg9);
+		printk(KERN_ERR "HCALL99_OUT r3=%lx r4=%lx r5=%lx "
+		       "r6=%lx r7=%lx r8=%lx r9=%lx r10=%lx "
+		       "r11=%lx r12=lx",
+		       regs[0], regs[1],
+		       regs[2], regs[3],
+		       regs[4], regs[5],
+		       regs[6], regs[7],
+		       regs[8]);
+	}
+	return regs[0];
+}
+
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif
diff -Nurp ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h
--- ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h	1970-01-01 01:00:00.000000000 +0100
+++ ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h	2007-05-09 12:49:46.000000000 +0200
@@ -0,0 +1 @@
+#include <asm-ppc64/system.h>


From ossrosch at linux.vnet.ibm.com  Thu May 10 05:41:57 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Thu, 10 May 2007 14:41:57 +0200
Subject: [ofa-general] [PATCH ofed-1.2-rc3 3/3] ehca: backport for rhel-4.5 -
	use introduced dma_ops
Message-ID: <200705101441.58102.ossrosch@linux.vnet.ibm.com>

use introduced dma_ops


Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
---


Makefile    |    2
ehca_dma.c  |  194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ehca_main.c |    2
3 files changed, 197 insertions(+), 1 deletion(-)


diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c
--- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c	1970-01-01 01:00:00.000000000 +0100
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c	2007-05-03 16:25:30.000000000 +0200
@@ -0,0 +1,194 @@
+/*
+ *  IBM eServer eHCA Infiniband device driver for Linux on POWER
+ *
+ *  eHCA dma mapping via ibmebus
+ *
+ *  Authors: Stefan Roscher <stefan.roscher at de.ibm.com>
+ *           Hoang-Nam Nguyen <hnguyen at de.ibm.com>
+ *
+ *  Copyright (c) 2007 IBM Corporation
+ *
+ *  All rights reserved.
+ *
+ *  This source code is distributed under a dual license of GPL v2.0 and OpenIB
+ *  BSD.
+ *
+ * OpenIB BSD License
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials
+ * provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+ * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <asm/ibmebus.h>
+#include <rdma/ib_verbs.h>
+
+static int ehca_mapping_error(struct ib_device *dev, u64 dma_addr);
+
+static u64 ehca_dma_map_single(struct ib_device *dev,
+			        void *cpu_addr, size_t size,
+			        enum dma_data_direction direction);
+
+static void ehca_dma_unmap_single(struct ib_device *dev,
+				   u64 addr, size_t size,
+				  enum dma_data_direction direction);
+
+static u64 ehca_dma_map_page(struct ib_device *dev,
+			      struct page *page,
+			      unsigned long offset,
+			      size_t size,
+			     enum dma_data_direction direction);
+
+static void ehca_dma_unmap_page(struct ib_device *dev,
+				 u64 addr, size_t size,
+				enum dma_data_direction direction);
+
+int ehca_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents,
+		enum dma_data_direction direction);
+
+static void ehca_unmap_sg(struct ib_device *dev,
+			   struct scatterlist *sg, int nents,
+			  enum dma_data_direction direction);
+
+static u64 ehca_sg_dma_address(struct ib_device *dev, struct scatterlist *sg);
+
+static unsigned int ehca_sg_dma_len(struct ib_device *dev,
+				    struct scatterlist *sg);
+
+static void ehca_sync_single_for_cpu(struct ib_device *dev,
+				      u64 addr,
+				      size_t size,
+				     enum dma_data_direction dir);
+
+static void ehca_sync_single_for_device(struct ib_device *dev,
+					 u64 addr,
+					 size_t size,
+					enum dma_data_direction dir);
+
+static void *ehca_dma_alloc_coherent(struct ib_device *dev, size_t size,
+				     u64 *dma_handle, gfp_t flag);
+
+static void ehca_dma_free_coherent(struct ib_device *dev, size_t size,
+				   void *cpu_addr, dma_addr_t dma_handle);
+
+struct ib_dma_mapping_ops ehca_dma_mapping_ops = {
+	ehca_mapping_error,
+	ehca_dma_map_single,
+	ehca_dma_unmap_single,
+	ehca_dma_map_page,
+	ehca_dma_unmap_page,
+	ehca_map_sg,
+	ehca_unmap_sg,
+	ehca_sg_dma_address,
+	ehca_sg_dma_len,
+	ehca_sync_single_for_cpu,
+	ehca_sync_single_for_device,
+	ehca_dma_alloc_coherent,
+	ehca_dma_free_coherent
+};
+
+static int ehca_mapping_error(struct ib_device *dev, u64 dma_addr)
+{
+	return dma_addr == 0L;
+}
+
+static u64 ehca_dma_map_single(struct ib_device *dev,
+			        void *cpu_addr, size_t size,
+			        enum dma_data_direction direction)
+{
+	return ibmebus_map_single(dev, cpu_addr, size, direction);
+}
+
+static void ehca_dma_unmap_single(struct ib_device *dev,
+				   u64 addr, size_t size,
+				   enum dma_data_direction direction)
+{
+	ibmebus_unmap_single(dev, addr, size, direction);
+}
+
+static u64 ehca_dma_map_page(struct ib_device *dev,
+			      struct page *page,
+			      unsigned long offset,
+			      size_t size,
+			      enum dma_data_direction direction)
+{
+	return dma_map_page(dev->dma_device, page, offset, size, direction);
+}
+
+static void ehca_dma_unmap_page(struct ib_device *dev,
+				 u64 addr, size_t size,
+				 enum dma_data_direction direction)
+{
+	dma_unmap_page(dev->dma_device, addr, size, direction);
+}
+
+int ehca_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents,
+		 enum dma_data_direction direction)
+{
+	return ibmebus_map_sg(dev, sg, nents, direction);
+}
+
+static void ehca_unmap_sg(struct ib_device *dev,
+			   struct scatterlist *sg, int nents,
+			   enum dma_data_direction direction)
+{
+	ibmebus_unmap_sg(dev, sg, nents, direction);
+}
+
+static u64 ehca_sg_dma_address(struct ib_device *dev, struct scatterlist *sg)
+{
+	return sg_dma_address(sg);
+}
+
+static unsigned int ehca_sg_dma_len(struct ib_device *dev,
+				     struct scatterlist *sg)
+{
+	return sg_dma_len(sg);
+}
+
+static void ehca_sync_single_for_cpu(struct ib_device *dev,
+				      u64 addr,
+				      size_t size,
+				      enum dma_data_direction dir)
+{
+	dma_sync_single_for_cpu(dev->dma_device, addr, size, dir);
+}
+
+static void ehca_sync_single_for_device(struct ib_device *dev,
+					 u64 addr,
+					 size_t size,
+					 enum dma_data_direction dir)
+{
+	dma_sync_single_for_device(dev->dma_device, addr, size, dir);
+}
+
+static void *ehca_dma_alloc_coherent(struct ib_device *dev, size_t size,
+				      u64 *dma_handle, gfp_t flag)
+{
+	return ibmebus_alloc_coherent(dev, size, dma_handle, flag);
+}
+
+static void ehca_dma_free_coherent(struct ib_device *dev, size_t size,
+				    void *cpu_addr, dma_addr_t dma_handle)
+{
+	ibmebus_free_coherent(dev, size, cpu_addr, dma_handle);
+}
diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c
--- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_main.c	2007-04-29 15:10:56.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-03 16:19:28.000000000 +0200
@@ -279,6 +279,7 @@ init_node_guid1:
 
 int ehca_init_device(struct ehca_shca *shca)
 {
+	extern struct ib_dma_mapping_ops ehca_dma_mapping_ops;
 	int ret;
 
 	ret = init_node_guid(shca);
@@ -354,6 +355,7 @@ int ehca_init_device(struct ehca_shca *s
 	shca->ib_device.detach_mcast	    = ehca_detach_mcast;
 	/* shca->ib_device.process_mad	    = ehca_process_mad;	    */
 	shca->ib_device.mmap		    = ehca_mmap;
+	shca->ib_device.dma_ops             = &ehca_dma_mapping_ops;
 
 	return ret;
 }
diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/Makefile ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/Makefile
--- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/Makefile	2007-04-29 15:10:56.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/Makefile	2007-05-03 16:26:13.000000000 +0200
@@ -12,5 +12,5 @@ obj-$(CONFIG_INFINIBAND_EHCA) += ib_ehca
 
 ib_ehca-objs  = ehca_main.o ehca_hca.o ehca_mcast.o ehca_pd.o ehca_av.o ehca_eq.o \
 		ehca_cq.o ehca_qp.o ehca_sqp.o ehca_mrmw.o ehca_reqs.o ehca_irq.o \
-		ehca_uverbs.o ipz_pt_fn.o hcp_if.o hcp_phyp.o
+		ehca_uverbs.o ehca_dma.o ipz_pt_fn.o hcp_if.o hcp_phyp.o


From ossrosch at linux.vnet.ibm.com  Thu May 10 05:41:52 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Thu, 10 May 2007 14:41:52 +0200
Subject: [ofa-general] [PATCH ofed-1.2-rc3 2/3] ehca: backport for rhel-4.5 -
	mmap functonality
Message-ID: <200705101441.52922.ossrosch@linux.vnet.ibm.com>

change ehca module to older mmap functinality due to lack of vm_insert_page()
support in kernel 2.6.9


Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
---

ehca_classes.h |   29 +++-
ehca_cq.c      |   65 +++++++--
ehca_iverbs.h  |   10 +
ehca_main.c    |    8 -
ehca_qp.c      |   78 +++++++++--
ehca_uverbs.c  |  395 +++++++++++++++++++++++++++++++++------------------------

6 files changed, 379 insertions(+), 206 deletions(-)


diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h	2007-05-04 10:38:23.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h	2007-05-04 10:40:06.000000000 +0200
@@ -126,14 +126,13 @@ struct ehca_qp {
 	struct ipz_qp_handle ipz_qp_handle;
 	struct ehca_pfqp pf;
 	struct ib_qp_init_attr init_attr;
+	u64 uspace_squeue;
+	u64 uspace_rqueue;
+	u64 uspace_fwh;
 	struct ehca_cq *send_cq;
 	struct ehca_cq *recv_cq;
 	unsigned int sqerr_purgeflag;
 	struct hlist_node list_entries;
-	/* mmap counter for resources mapped into user space */
-	u32 mm_count_squeue;
-	u32 mm_count_rqueue;
-	u32 mm_count_galpa;
 };
 
 /* must be power of 2 */
@@ -150,6 +149,8 @@ struct ehca_cq {
 	struct ipz_cq_handle ipz_cq_handle;
 	struct ehca_pfcq pf;
 	spinlock_t cb_lock;
+	u64 uspace_queue;
+	u64 uspace_fwh;
 	struct hlist_head qp_hashtab[QP_HASHTAB_LEN];
 	struct list_head entry;
 	u32 nr_callbacks; /* #events assigned to cpu by scaling code */
@@ -157,9 +158,6 @@ struct ehca_cq {
 	wait_queue_head_t wait_completion;
 	spinlock_t task_lock;
 	u32 ownpid;
-	/* mmap counter for resources mapped into user space */
-	u32 mm_count_queue;
-	u32 mm_count_galpa;
 };
 
 enum ehca_mr_flag {
@@ -259,6 +257,20 @@ struct ehca_ucontext {
 	struct ib_ucontext ib_ucontext;
 };
 
+struct ehca_module *ehca_module_new(void);
+
+int ehca_module_delete(struct ehca_module *me);
+
+int ehca_eq_ctor(struct ehca_eq *eq);
+
+int ehca_eq_dtor(struct ehca_eq *eq);
+
+struct ehca_shca *ehca_shca_new(void);
+
+int ehca_shca_delete(struct ehca_shca *me);
+
+struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor);
+
 int ehca_init_pd_cache(void);
 void ehca_cleanup_pd_cache(void);
 int ehca_init_cq_cache(void);
@@ -282,6 +294,7 @@ extern int ehca_use_hp_mr;
 extern int ehca_scaling_code;
 
 struct ipzu_queue_resp {
+	u64 queue;        /* points to first queue entry */
 	u32 qe_size;      /* queue entry size */
 	u32 act_nr_of_sg;
 	u32 queue_length; /* queue length allocated in bytes */
@@ -294,6 +307,7 @@ struct ehca_create_cq_resp {
 	u32 cq_number;
 	u32 token;
 	struct ipzu_queue_resp ipz_queue;
+	struct h_galpas galpas;
 };
 
 struct ehca_create_qp_resp {
@@ -306,6 +320,7 @@ struct ehca_create_qp_resp {
 	u32 dummy; /* padding for 8 byte alignment */
 	struct ipzu_queue_resp ipz_squeue;
 	struct ipzu_queue_resp ipz_rqueue;
+	struct h_galpas galpas;
 };
 
 struct ehca_alloc_cq_parms {
diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c	2007-05-04 10:38:23.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c	2007-05-04 10:40:06.000000000 +0200
@@ -268,6 +268,7 @@ struct ib_cq *ehca_create_cq(struct ib_d
 	if (context) {
 		struct ipz_queue *ipz_queue = &my_cq->ipz_queue;
 		struct ehca_create_cq_resp resp;
+		struct vm_area_struct *vma;
 		memset(&resp, 0, sizeof(resp));
 		resp.cq_number = my_cq->cq_number;
 		resp.token = my_cq->token;
@@ -276,14 +277,40 @@ struct ib_cq *ehca_create_cq(struct ib_d
 		resp.ipz_queue.queue_length = ipz_queue->queue_length;
 		resp.ipz_queue.pagesize = ipz_queue->pagesize;
 		resp.ipz_queue.toggle_state = ipz_queue->toggle_state;
+		ret = ehca_mmap_nopage(((u64)(my_cq->token) << 32) | 0x12000000,
+				       ipz_queue->queue_length,
+				       (void**)&resp.ipz_queue.queue,
+				       &vma);
+		if (ret) {
+			ehca_err(device, "Could not mmap queue pages");
+			cq = ERR_PTR(ret);
+			goto create_cq_exit4;
+		}
+		my_cq->uspace_queue = resp.ipz_queue.queue;
+		resp.galpas = my_cq->galpas;
+		ret = ehca_mmap_register(my_cq->galpas.user.fw_handle,
+					 (void**)&resp.galpas.kernel.fw_handle,
+					 &vma);
+		if (ret) {
+			ehca_err(device, "Could not mmap fw_handle");
+			cq = ERR_PTR(ret);
+			goto create_cq_exit5;
+		}
+		my_cq->uspace_fwh = (u64)resp.galpas.kernel.fw_handle;
 		if (ib_copy_to_udata(udata, &resp, sizeof(resp))) {
 			ehca_err(device, "Copy to udata failed.");
-			goto create_cq_exit4;
+			goto create_cq_exit6;
 		}
 	}
 
 	return cq;
 
+create_cq_exit6:
+	ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE);
+
+create_cq_exit5:
+	ehca_munmap(my_cq->uspace_queue, my_cq->ipz_queue.queue_length);
+
 create_cq_exit4:
 	ipz_queue_dtor(&my_cq->ipz_queue);
 
@@ -317,6 +344,7 @@ static int get_cq_nr_events(struct ehca_
 int ehca_destroy_cq(struct ib_cq *cq)
 {
 	u64 h_ret;
+	int ret;
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
 	int cq_num = my_cq->cq_number;
 	struct ib_device *device = cq->device;
@@ -326,20 +354,6 @@ int ehca_destroy_cq(struct ib_cq *cq)
 	u32 cur_pid = current->tgid;
 	unsigned long flags;
 
-	if (cq->uobject) {
-		if (my_cq->mm_count_galpa || my_cq->mm_count_queue) {
-			ehca_err(device, "Resources still referenced in "
-				 "user space cq_num=%x", my_cq->cq_number);
-			return -EINVAL;
-		}
-		if (my_cq->ownpid != cur_pid) {
-			ehca_err(device, "Invalid caller pid=%x ownpid=%x "
-				 "cq_num=%x",
-				 cur_pid, my_cq->ownpid, my_cq->cq_number);
-			return -EINVAL;
-		}
-	}
-
 	spin_lock_irqsave(&ehca_cq_idr_lock, flags);
 	while (my_cq->nr_events) {
 		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
@@ -351,6 +365,25 @@ int ehca_destroy_cq(struct ib_cq *cq)
 	idr_remove(&ehca_cq_idr, my_cq->token);
 	spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 
+	if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) {
+		ehca_err(device, "Invalid caller pid=%x ownpid=%x",
+			 cur_pid, my_cq->ownpid);
+		return -EINVAL;
+	}
+
+	/* un-mmap if vma alloc */
+	if (my_cq->uspace_queue ) {
+		ret = ehca_munmap(my_cq->uspace_queue,
+				  my_cq->ipz_queue.queue_length);
+		if (ret)
+			ehca_err(device, "Could not munmap queue ehca_cq=%p "
+				 "cq_num=%x", my_cq, cq_num);
+		ret = ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE);
+		if (ret)
+			ehca_err(device, "Could not munmap fwh ehca_cq=%p "
+				 "cq_num=%x", my_cq, cq_num);
+	}
+
 	h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0);
 	if (h_ret == H_R_STATE) {
 		/* cq in err: read err data and destroy it forcibly */
@@ -379,7 +412,7 @@ int ehca_resize_cq(struct ib_cq *cq, int
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
 	u32 cur_pid = current->tgid;
 
-	if (cq->uobject && my_cq->ownpid != cur_pid) {
+	if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) {
 		ehca_err(cq->device, "Invalid caller pid=%x ownpid=%x",
 			 cur_pid, my_cq->ownpid);
 		return -EINVAL;
diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h	2007-05-04 10:38:23.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h	2007-04-29 15:10:56.000000000 +0200
@@ -171,11 +171,19 @@ int ehca_mmap(struct ib_ucontext *contex
 
 void ehca_poll_eqs(unsigned long data);
 
+int ehca_mmap_nopage(u64 foffset,u64 length,void **mapped,
+		     struct vm_area_struct **vma);
+
+int ehca_mmap_register(u64 physical,void **mapped,
+		       struct vm_area_struct **vma);
+
+int ehca_munmap(unsigned long addr, size_t len);
+
 #ifdef CONFIG_PPC_64K_PAGES
 void *ehca_alloc_fw_ctrlblock(gfp_t flags);
 void ehca_free_fw_ctrlblock(void *ptr);
 #else
-#define ehca_alloc_fw_ctrlblock(flags) ((void*) get_zeroed_page(flags))
+#define ehca_alloc_fw_ctrlblock(flags) ((void *) get_zeroed_page(flags))
 #define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr))
 #endif
 
diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-04 10:38:23.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-04 10:40:06.000000000 +0200
@@ -52,7 +52,7 @@
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_AUTHOR("Christoph Raisch <raisch at de.ibm.com>");
 MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver");
-MODULE_VERSION("SVNEHCA_0022");
+MODULE_VERSION("SVNEHCA_0019");
 
 int ehca_open_aqp1     = 0;
 int ehca_debug_level   = 0;
@@ -293,7 +293,7 @@ int ehca_init_device(struct ehca_shca *s
 	strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX);
 	shca->ib_device.owner               = THIS_MODULE;
 
-	shca->ib_device.uverbs_abi_ver	    = 6;
+	shca->ib_device.uverbs_abi_ver	    = 5;
 	shca->ib_device.uverbs_cmd_mask	    =
 		(1ull << IB_USER_VERBS_CMD_GET_CONTEXT)		|
 		(1ull << IB_USER_VERBS_CMD_QUERY_DEVICE)	|
@@ -357,7 +357,7 @@ int ehca_init_device(struct ehca_shca *s
 	shca->ib_device.dealloc_fmr	    = ehca_dealloc_fmr;
 	shca->ib_device.attach_mcast	    = ehca_attach_mcast;
 	shca->ib_device.detach_mcast	    = ehca_detach_mcast;
-	/* shca->ib_device.process_mad	    = ehca_process_mad;     */
+	/* shca->ib_device.process_mad	    = ehca_process_mad;	    */
 	shca->ib_device.mmap		    = ehca_mmap;
 
 	return ret;
@@ -811,7 +811,7 @@ int __init ehca_module_init(void)
 	int ret;
 
 	printk(KERN_INFO "eHCA Infiniband Device Driver "
-	                 "(Rel.: SVNEHCA_0022)\n");
+	                 "(Rel.: SVNEHCA_0019)\n");
 	idr_init(&ehca_qp_idr);
 	idr_init(&ehca_cq_idr);
 	spin_lock_init(&ehca_qp_idr_lock);
diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c	2007-05-04 10:38:23.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c	2007-04-29 15:10:56.000000000 +0200
@@ -637,6 +637,7 @@ struct ib_qp *ehca_create_qp(struct ib_p
 		struct ipz_queue *ipz_rqueue = &my_qp->ipz_rqueue;
 		struct ipz_queue *ipz_squeue = &my_qp->ipz_squeue;
 		struct ehca_create_qp_resp resp;
+		struct vm_area_struct * vma;
 		memset(&resp, 0, sizeof(resp));
 
 		resp.qp_num = my_qp->real_qp_num;
@@ -650,21 +651,59 @@ struct ib_qp *ehca_create_qp(struct ib_p
 		resp.ipz_rqueue.queue_length = ipz_rqueue->queue_length;
 		resp.ipz_rqueue.pagesize = ipz_rqueue->pagesize;
 		resp.ipz_rqueue.toggle_state = ipz_rqueue->toggle_state;
+		ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x22000000,
+				       ipz_rqueue->queue_length,
+				       (void**)&resp.ipz_rqueue.queue,
+				       &vma);
+		if (ret) {
+			ehca_err(pd->device, "Could not mmap rqueue pages");
+			goto create_qp_exit3;
+		}
+		my_qp->uspace_rqueue = resp.ipz_rqueue.queue;
 		/* squeue properties */
 		resp.ipz_squeue.qe_size = ipz_squeue->qe_size;
 		resp.ipz_squeue.act_nr_of_sg = ipz_squeue->act_nr_of_sg;
 		resp.ipz_squeue.queue_length = ipz_squeue->queue_length;
 		resp.ipz_squeue.pagesize = ipz_squeue->pagesize;
 		resp.ipz_squeue.toggle_state = ipz_squeue->toggle_state;
+		ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x23000000,
+				       ipz_squeue->queue_length,
+				       (void**)&resp.ipz_squeue.queue,
+				       &vma);
+		if (ret) {
+			ehca_err(pd->device, "Could not mmap squeue pages");
+			goto create_qp_exit4;
+		}
+		my_qp->uspace_squeue = resp.ipz_squeue.queue;
+		/* fw_handle */
+		resp.galpas = my_qp->galpas;
+		ret = ehca_mmap_register(my_qp->galpas.user.fw_handle,
+					 (void**)&resp.galpas.kernel.fw_handle,
+					 &vma);
+		if (ret) {
+			ehca_err(pd->device, "Could not mmap fw_handle");
+			goto create_qp_exit5;
+		}
+		my_qp->uspace_fwh = (u64)resp.galpas.kernel.fw_handle;
+
 		if (ib_copy_to_udata(udata, &resp, sizeof resp)) {
 			ehca_err(pd->device, "Copy to udata failed");
 			ret = -EINVAL;
-			goto create_qp_exit3;
+			goto create_qp_exit6;
 		}
 	}
 
 	return &my_qp->ib_qp;
 
+create_qp_exit6:
+	ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE);
+
+create_qp_exit5:
+	ehca_munmap(my_qp->uspace_squeue, my_qp->ipz_squeue.queue_length);
+
+create_qp_exit4:
+	ehca_munmap(my_qp->uspace_rqueue, my_qp->ipz_rqueue.queue_length);
+
 create_qp_exit3:
 	ipz_queue_dtor(&my_qp->ipz_rqueue);
 	ipz_queue_dtor(&my_qp->ipz_squeue);
@@ -892,7 +931,7 @@ static int internal_modify_qp(struct ib_
 	     my_qp->qp_type == IB_QPT_SMI) &&
 	    statetrans == IB_QPST_SQE2RTS) {
 		/* mark next free wqe if kernel */
-		if (!ibqp->uobject) {
+		if (my_qp->uspace_squeue == 0) {
 			struct ehca_wqe *wqe;
 			/* lock send queue */
 			spin_lock_irqsave(&my_qp->spinlock_s, spl_flags);
@@ -1378,18 +1417,11 @@ int ehca_destroy_qp(struct ib_qp *ibqp)
 	enum ib_qp_type	qp_type;
 	unsigned long flags;
 
-	if (ibqp->uobject) {
-		if (my_qp->mm_count_galpa ||
-		    my_qp->mm_count_rqueue || my_qp->mm_count_squeue) {
-			ehca_err(ibqp->device, "Resources still referenced in "
-				 "user space qp_num=%x", ibqp->qp_num);
-			return -EINVAL;
-		}
-		if (my_pd->ownpid != cur_pid) {
-			ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x",
-				 cur_pid, my_pd->ownpid);
-			return -EINVAL;
-		}
+	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
+	    my_pd->ownpid != cur_pid) {
+		ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x",
+			 cur_pid, my_pd->ownpid);
+		return -EINVAL;
 	}
 
 	if (my_qp->send_cq) {
@@ -1407,6 +1439,24 @@ int ehca_destroy_qp(struct ib_qp *ibqp)
 	idr_remove(&ehca_qp_idr, my_qp->token);
 	spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
 
+	/* un-mmap if vma alloc */
+	if (my_qp->uspace_rqueue) {
+		ret = ehca_munmap(my_qp->uspace_rqueue,
+				  my_qp->ipz_rqueue.queue_length);
+		if (ret)
+			ehca_err(ibqp->device, "Could not munmap rqueue "
+				 "qp_num=%x", qp_num);
+		ret = ehca_munmap(my_qp->uspace_squeue,
+				  my_qp->ipz_squeue.queue_length);
+		if (ret)
+			ehca_err(ibqp->device, "Could not munmap squeue "
+				 "qp_num=%x", qp_num);
+		ret = ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE);
+		if (ret)
+			ehca_err(ibqp->device, "Could not munmap fwh qp_num=%x",
+				 qp_num);
+	}
+
 	h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
 	if (h_ret != H_SUCCESS) {
 		ehca_err(ibqp->device, "hipz_h_destroy_qp() failed rc=%lx "
diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c
--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c	2007-05-04 10:38:23.000000000 +0200
+++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c	2007-04-29 15:10:56.000000000 +0200
@@ -68,183 +68,105 @@ int ehca_dealloc_ucontext(struct ib_ucon
 	return 0;
 }
 
-static void ehca_mm_open(struct vm_area_struct *vma)
+struct page *ehca_nopage(struct vm_area_struct *vma,
+			 unsigned long address, int *type)
 {
-	u32 *count = (u32*)vma->vm_private_data;
-	if (!count) {
-		ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx",
-			     vma->vm_start, vma->vm_end);
-		return;
-	}
-	(*count)++;
-	if (!(*count))
-		ehca_gen_err("Use count overflow vm_start=%lx vm_end=%lx",
-			     vma->vm_start, vma->vm_end);
-	ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x",
-		     vma->vm_start, vma->vm_end, *count);
-}
-
-static void ehca_mm_close(struct vm_area_struct *vma)
-{
-	u32 *count = (u32*)vma->vm_private_data;
-	if (!count) {
-		ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx",
-			     vma->vm_start, vma->vm_end);
-		return;
-	}
-	(*count)--;
-	ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x",
-		     vma->vm_start, vma->vm_end, *count);
-}
-
-static struct vm_operations_struct vm_ops = {
-	.open =	ehca_mm_open,
-	.close = ehca_mm_close,
-};
-
-static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas,
-			u32 *mm_count)
-{
-	int ret;
-	u64 vsize, physical;
-
-	vsize = vma->vm_end - vma->vm_start;
-	if (vsize != EHCA_PAGESIZE) {
-		ehca_gen_err("invalid vsize=%lx", vma->vm_end - vma->vm_start);
-		return -EINVAL;
-	}
-
-	physical = galpas->user.fw_handle;
-	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical);
-	/* VM_IO | VM_RESERVED are set by remap_pfn_range() */
-	ret = remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT,
-			      vsize, vma->vm_page_prot);
-	if (unlikely(ret)) {
-		ehca_gen_err("remap_pfn_range() failed ret=%x", ret);
-		return -ENOMEM;
-	}
-
-	vma->vm_private_data = mm_count;
-	(*mm_count)++;
-	vma->vm_ops = &vm_ops;
-
-	return 0;
-}
+	struct page *mypage = NULL;
+	u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT;
+	u32 idr_handle = fileoffset >> 32;
+	u32 q_type = (fileoffset >> 28) & 0xF;	  /* CQ, QP,...        */
+	u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */
+	u32 cur_pid = current->tgid;
+	unsigned long flags;
+	struct ehca_cq *cq;
+	struct ehca_qp *qp;
+	struct ehca_pd *pd;
+	u64 offset;
+	void *vaddr;
 
-static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue,
-			   u32 *mm_count)
-{
-	int ret;
-	u64 start, ofs;
-	struct page *page;
+	switch (q_type) {
+	case 1: /* CQ */
+		spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+		cq = idr_find(&ehca_cq_idr, idr_handle);
+		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 
-	vma->vm_flags |= VM_RESERVED;
-	start = vma->vm_start;
-	for (ofs = 0; ofs < queue->queue_length; ofs += PAGE_SIZE) {
-		u64 virt_addr = (u64)ipz_qeit_calc(queue, ofs);
-		page = virt_to_page(virt_addr);
-		ret = vm_insert_page(vma, start, page);
-		if (unlikely(ret)) {
-			ehca_gen_err("vm_insert_page() failed rc=%x", ret);
-			return ret;
+		/* make sure this mmap really belongs to the authorized user */
+		if (!cq) {
+			ehca_gen_err("cq is NULL ret=NOPAGE_SIGBUS");
+			return NOPAGE_SIGBUS;
 		}
-		start +=  PAGE_SIZE;
-	}
-	vma->vm_private_data = mm_count;
-	(*mm_count)++;
-	vma->vm_ops = &vm_ops;
 
-	return 0;
-}
-
-static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq,
-			u32 rsrc_type)
-{
-	int ret;
-
-	switch (rsrc_type) {
-	case 1: /* galpa fw handle */
-		ehca_dbg(cq->ib_cq.device, "cq_num=%x fw", cq->cq_number);
-		ret = ehca_mmap_fw(vma, &cq->galpas, &cq->mm_count_galpa);
-		if (unlikely(ret)) {
+		if (cq->ownpid != cur_pid) {
 			ehca_err(cq->ib_cq.device,
-				 "ehca_mmap_fw() failed rc=%x cq_num=%x",
-				 ret, cq->cq_number);
-			return ret;
+				 "Invalid caller pid=%x ownpid=%x",
+				 cur_pid, cq->ownpid);
+			return NOPAGE_SIGBUS;
 		}
-		break;
 
-	case 2: /* cq queue_addr */
-		ehca_dbg(cq->ib_cq.device, "cq_num=%x queue", cq->cq_number);
-		ret = ehca_mmap_queue(vma, &cq->ipz_queue, &cq->mm_count_queue);
-		if (unlikely(ret)) {
-			ehca_err(cq->ib_cq.device,
-				 "ehca_mmap_queue() failed rc=%x cq_num=%x",
-				 ret, cq->cq_number);
-			return ret;
+		if (rsrc_type == 2) {
+			ehca_dbg(cq->ib_cq.device, "cq=%p cq queuearea", cq);
+			offset = address - vma->vm_start;
+			vaddr = ipz_qeit_calc(&cq->ipz_queue, offset);
+			ehca_dbg(cq->ib_cq.device, "offset=%lx vaddr=%p",
+				 offset, vaddr);
+			mypage = virt_to_page(vaddr);
 		}
 		break;
 
-	default:
-		ehca_err(cq->ib_cq.device, "bad resource type=%x cq_num=%x",
-			 rsrc_type, cq->cq_number);
-		return -EINVAL;
-	}
-
-	return 0;
-}
-
-static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp,
-			u32 rsrc_type)
-{
-	int ret;
+	case 2: /* QP */
+		spin_lock_irqsave(&ehca_qp_idr_lock, flags);
+		qp = idr_find(&ehca_qp_idr, idr_handle);
+		spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
 
-	switch (rsrc_type) {
-	case 1: /* galpa fw handle */
-		ehca_dbg(qp->ib_qp.device, "qp_num=%x fw", qp->ib_qp.qp_num);
-		ret = ehca_mmap_fw(vma, &qp->galpas, &qp->mm_count_galpa);
-		if (unlikely(ret)) {
-			ehca_err(qp->ib_qp.device,
-				 "remap_pfn_range() failed ret=%x qp_num=%x",
-				 ret, qp->ib_qp.qp_num);
-			return -ENOMEM;
+		/* make sure this mmap really belongs to the authorized user */
+		if (!qp) {
+			ehca_gen_err("qp is NULL ret=NOPAGE_SIGBUS");
+			return NOPAGE_SIGBUS;
 		}
-		break;
 
-	case 2: /* qp rqueue_addr */
-		ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue",
-			 qp->ib_qp.qp_num);
-		ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, &qp->mm_count_rqueue);
-		if (unlikely(ret)) {
+		pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd);
+		if (pd->ownpid != cur_pid) {
 			ehca_err(qp->ib_qp.device,
-				 "ehca_mmap_queue(rq) failed rc=%x qp_num=%x",
-				 ret, qp->ib_qp.qp_num);
-			return ret;
+				 "Invalid caller pid=%x ownpid=%x",
+				 cur_pid, pd->ownpid);
+			return NOPAGE_SIGBUS;
 		}
-		break;
 
-	case 3: /* qp squeue_addr */
-		ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue",
-			 qp->ib_qp.qp_num);
-		ret = ehca_mmap_queue(vma, &qp->ipz_squeue, &qp->mm_count_squeue);
-		if (unlikely(ret)) {
-			ehca_err(qp->ib_qp.device,
-				 "ehca_mmap_queue(sq) failed rc=%x qp_num=%x",
-				 ret, qp->ib_qp.qp_num);
-			return ret;
+		if (rsrc_type == 2) {	/* rqueue */
+			ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueuearea", qp);
+			offset = address - vma->vm_start;
+			vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset);
+			ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p",
+				 offset, vaddr);
+			mypage = virt_to_page(vaddr);
+		} else if (rsrc_type == 3) {	/* squeue */
+			ehca_dbg(qp->ib_qp.device, "qp=%p qp squeuearea", qp);
+			offset = address - vma->vm_start;
+			vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset);
+			ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p",
+				 offset, vaddr);
+			mypage = virt_to_page(vaddr);
 		}
 		break;
 
 	default:
-		ehca_err(qp->ib_qp.device, "bad resource type=%x qp=num=%x",
-			 rsrc_type, qp->ib_qp.qp_num);
-		return -EINVAL;
+		ehca_gen_err("bad queue type %x", q_type);
+		return NOPAGE_SIGBUS;
 	}
 
-	return 0;
+	if (!mypage) {
+		ehca_gen_err("Invalid page adr==NULL ret=NOPAGE_SIGBUS");
+		return NOPAGE_SIGBUS;
+	}
+	get_page(mypage);
+
+	return mypage;
 }
 
+static struct vm_operations_struct ehcau_vm_ops = {
+	.nopage = ehca_nopage,
+};
+
 int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 {
 	u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT;
@@ -253,6 +175,7 @@ int ehca_mmap(struct ib_ucontext *contex
 	u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */
 	u32 cur_pid = current->tgid;
 	u32 ret;
+	u64 vsize, physical;
 	unsigned long flags;
 	struct ehca_cq *cq;
 	struct ehca_qp *qp;
@@ -278,12 +201,44 @@ int ehca_mmap(struct ib_ucontext *contex
 		if (!cq->ib_cq.uobject || cq->ib_cq.uobject->context != context)
 			return -EINVAL;
 
-		ret = ehca_mmap_cq(vma, cq, rsrc_type);
-		if (unlikely(ret)) {
-			ehca_err(cq->ib_cq.device,
-				 "ehca_mmap_cq() failed rc=%x cq_num=%x",
-				 ret, cq->cq_number);
-			return ret;
+		switch (rsrc_type) {
+		case 1: /* galpa fw handle */
+			ehca_dbg(cq->ib_cq.device, "cq=%p cq triggerarea", cq);
+			vma->vm_flags |= VM_RESERVED;
+			vsize = vma->vm_end - vma->vm_start;
+			if (vsize != EHCA_PAGESIZE) {
+				ehca_err(cq->ib_cq.device, "invalid vsize=%lx",
+					 vma->vm_end - vma->vm_start);
+				return -EINVAL;
+			}
+
+			physical = cq->galpas.user.fw_handle;
+			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+			vma->vm_flags |= VM_IO | VM_RESERVED;
+
+			ehca_dbg(cq->ib_cq.device,
+				 "vsize=%lx physical=%lx", vsize, physical);
+			ret = remap_pfn_range(vma, vma->vm_start,
+					      physical >> PAGE_SHIFT, vsize,
+					      vma->vm_page_prot);
+			if (ret) {
+				ehca_err(cq->ib_cq.device,
+					 "remap_pfn_range() failed ret=%x",
+					 ret);
+				return -ENOMEM;
+			}
+			break;
+
+		case 2: /* cq queue_addr */
+			ehca_dbg(cq->ib_cq.device, "cq=%p cq q_addr", cq);
+			vma->vm_flags |= VM_RESERVED;
+			vma->vm_ops = &ehcau_vm_ops;
+			break;
+
+		default:
+			ehca_err(cq->ib_cq.device, "bad resource type %x",
+				 rsrc_type);
+			return -EINVAL;
 		}
 		break;
 
@@ -307,12 +262,50 @@ int ehca_mmap(struct ib_ucontext *contex
 		if (!qp->ib_qp.uobject || qp->ib_qp.uobject->context != context)
 			return -EINVAL;
 
-		ret = ehca_mmap_qp(vma, qp, rsrc_type);
-		if (unlikely(ret)) {
-			ehca_err(qp->ib_qp.device,
-				 "ehca_mmap_qp() failed rc=%x qp_num=%x",
-				 ret, qp->ib_qp.qp_num);
-			return ret;
+		switch (rsrc_type) {
+		case 1: /* galpa fw handle */
+			ehca_dbg(qp->ib_qp.device, "qp=%p qp triggerarea", qp);
+			vma->vm_flags |= VM_RESERVED;
+			vsize = vma->vm_end - vma->vm_start;
+			if (vsize != EHCA_PAGESIZE) {
+				ehca_err(qp->ib_qp.device, "invalid vsize=%lx",
+					 vma->vm_end - vma->vm_start);
+				return -EINVAL;
+			}
+
+			physical = qp->galpas.user.fw_handle;
+			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+			vma->vm_flags |= VM_IO | VM_RESERVED;
+
+			ehca_dbg(qp->ib_qp.device, "vsize=%lx physical=%lx",
+				 vsize, physical);
+			ret = remap_pfn_range(vma, vma->vm_start,
+					      physical >> PAGE_SHIFT, vsize,
+					      vma->vm_page_prot);
+			if (ret) {
+				ehca_err(qp->ib_qp.device,
+					 "remap_pfn_range() failed ret=%x",
+					 ret);
+				return -ENOMEM;
+			}
+			break;
+
+		case 2: /* qp rqueue_addr */
+			ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueue_addr", qp);
+			vma->vm_flags |= VM_RESERVED;
+			vma->vm_ops = &ehcau_vm_ops;
+			break;
+
+		case 3: /* qp squeue_addr */
+			ehca_dbg(qp->ib_qp.device, "qp=%p qp squeue_addr", qp);
+			vma->vm_flags |= VM_RESERVED;
+			vma->vm_ops = &ehcau_vm_ops;
+			break;
+
+		default:
+			ehca_err(qp->ib_qp.device, "bad resource type %x",
+				 rsrc_type);
+			return -EINVAL;
 		}
 		break;
 
@@ -323,3 +316,77 @@ int ehca_mmap(struct ib_ucontext *contex
 
 	return 0;
 }
+
+int ehca_mmap_nopage(u64 foffset, u64 length, void **mapped,
+		     struct vm_area_struct **vma)
+{
+	down_write(&current->mm->mmap_sem);
+	*mapped = (void*)do_mmap(NULL,0, length, PROT_WRITE,
+				 MAP_SHARED | MAP_ANONYMOUS,
+				 foffset);
+	up_write(&current->mm->mmap_sem);
+	if (!(*mapped)) {
+		ehca_gen_err("couldn't mmap foffset=%lx length=%lx",
+			     foffset, length);
+		return -EINVAL;
+	}
+
+	*vma = find_vma(current->mm, (u64)*mapped);
+	if (!(*vma)) {
+		down_write(&current->mm->mmap_sem);
+		do_munmap(current->mm, 0, length);
+		up_write(&current->mm->mmap_sem);
+		ehca_gen_err("couldn't find vma queue=%p", *mapped);
+		return -EINVAL;
+	}
+	(*vma)->vm_flags |= VM_RESERVED;
+	(*vma)->vm_ops = &ehcau_vm_ops;
+
+	return 0;
+}
+
+int ehca_mmap_register(u64 physical, void **mapped,
+		       struct vm_area_struct **vma)
+{
+	int ret;
+	unsigned long vsize;
+	/* ehca hw supports only 4k page */
+	ret = ehca_mmap_nopage(0, EHCA_PAGESIZE, mapped, vma);
+	if (ret) {
+		ehca_gen_err("could'nt mmap physical=%lx", physical);
+		return ret;
+	}
+
+	(*vma)->vm_flags |= VM_RESERVED;
+	vsize = (*vma)->vm_end - (*vma)->vm_start;
+	if (vsize != EHCA_PAGESIZE) {
+		ehca_gen_err("invalid vsize=%lx",
+			     (*vma)->vm_end - (*vma)->vm_start);
+		return -EINVAL;
+	}
+
+	(*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot);
+	(*vma)->vm_flags |= VM_IO | VM_RESERVED;
+
+	ret = remap_pfn_range((*vma), (*vma)->vm_start,
+			      physical >> PAGE_SHIFT, vsize,
+			      (*vma)->vm_page_prot);
+	if (ret) {
+		ehca_gen_err("remap_pfn_range() failed ret=%x", ret);
+		return -ENOMEM;
+	}
+
+	return 0;
+
+}
+
+int ehca_munmap(unsigned long addr, size_t len) {
+	int ret = 0;
+	struct mm_struct *mm = current->mm;
+	if (mm) {
+		down_write(&mm->mmap_sem);
+		ret = do_munmap(mm, addr, len);
+		up_write(&mm->mmap_sem);
+	}
+	return ret;
+}


From mst at dev.mellanox.co.il  Thu May 10 05:49:29 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 15:49:29 +0300
Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 1/3] ehca: backport for
	rhel-4.5 - hvcall.h
In-Reply-To: <200705101441.44286.ossrosch@linux.vnet.ibm.com>
References: <200705101441.44286.ossrosch@linux.vnet.ibm.com>
Message-ID: <20070510124929.GA22029@mellanox.co.il>

> Quoting Stefan Roscher <ossrosch at linux.vnet.ibm.com>:
> Subject: [PATCH ofed-1.2-rc3 1/3] ehca: backport for rhel-4.5 - hvcall.h
> 
> use kmem_cache_t instead of struct kmem_cache and update hvcall.h
> 
> 
> 
> Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
> ---

Format's wrong here:

> drivers/infiniband/hw/ehca/ehca_av.c                         |    2
> drivers/infiniband/hw/ehca/ehca_cq.c                         |    2
> drivers/infiniband/hw/ehca/ehca_main.c                       |    2
> drivers/infiniband/hw/ehca/ehca_mrmw.c                       |    4
> drivers/infiniband/hw/ehca/ehca_pd.c                         |    2
> drivers/infiniband/hw/ehca/ehca_qp.c                         |    2

These should be a patch in kernel_patches/backports/2.6.9_U5.

> kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h |    1
> kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h         |  167 +++++++++++
	
And this part creates files under include/asm/hvcall.h so can be applied
directly.
-- 
MST


From yosefe at voltaire.com  Thu May 10 05:52:34 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 10 May 2007 15:52:34 +0300
Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events
In-Reply-To: <20070510123855.GL13655@mellanox.co.il>
References: <20070508162727.GD5845@mellanox.co.il>	<4640A8BD.4000405@voltaire.com>	<20070509093548.GA7683@mellanox.co.il>	<4641AA06.1050002@voltaire.com>	<20070509112626.GA10068@mellanox.co.il>	<4641B63D.4010602@voltaire.com>	<20070509174138.GB17734@mellanox.co.il>	<46430167.3010106@voltaire.com>	<20070510120144.GF13655@mellanox.co.il>	<46430F46.1080002@voltaire.com>
	<20070510123855.GL13655@mellanox.co.il>
Message-ID: <46431592.6080401@voltaire.com>

> 
> Return some error code -ENXIO.
> 
All other branches in this function return -1 (see next hunk)

Anyway, let it be -ENXIO.


--
This issue was found during partitioning & SM fail over testing.

 * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
 * Obtain pkey index prior to entering init_qp, and save in in dev_priv
 * Upon PKEY_CHANGE event, schedule a work that restarts the qp.
 * Precondition the restart on whether the pkey index is really changed.
   Use the cached pkey_index to test this.  
 * Restart child interfaces before parent. They might be up even if the
   parent is down.
 * When interface is restarted, queue delayed initiallization, to handle
   the case that a pkey is deleted and later restored. 
 * Use uncached pkey query upon qp initiallization

SM reconfiguration or failover possibly causes a shuffling of the values
in the port pkey table. The current implementation only queries for the
index of the pkey once, when it creates the device QP and after that moves
it into working state, and hence does not address this scenario. Fix this
by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   88 +++++++++++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   26 ++------
 4 files changed, 89 insertions(+), 39 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-10 08:34:58.335171047 +0300
@@ -202,15 +202,17 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
+	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
+	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8            	  port;
 	u16           	  pkey;
+	u16               pkey_index;
 	struct ib_pd  	 *pd;
 	struct ib_mr  	 *mr;
 	struct ib_cq  	 *cq;
@@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-10 15:50:29.315183358 +0300
@@ -413,6 +413,18 @@ int ipoib_ib_dev_open(struct net_device 
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
 
+	/*
+	 * Search through the port P_Key table for the requested pkey value.
+	 * The port has to be assigned to the respective IB partition in
+	 * advance.
+	 */
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) {
+		ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey);
+		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		return -ENXIO;
+	}
+	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+
 	ret = ipoib_init_qp(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret);
@@ -422,14 +434,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -481,7 +493,7 @@ int ipoib_ib_dev_down(struct net_device 
 	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
 		mutex_lock(&pkey_mutex);
 		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
+		cancel_delayed_work(&priv->pkey_poll_task);
 		mutex_unlock(&pkey_mutex);
 		if (flush)
 			flush_workqueue(ipoib_workqueue);
@@ -508,7 +520,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +593,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,13 +635,22 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
+	u16 new_index;
+
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces too -
+ 	 * they might be up even if the parent is down */
+ 	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, pkey_event);
+
+	mutex_unlock(&priv->vlan_mutex);
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
 		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
@@ -638,10 +660,32 @@ void ipoib_ib_dev_flush(struct work_stru
 		return;
 	}
 
+	if (pkey_event) {
+		if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
+			clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+			ipoib_ib_dev_down(dev, 0);
+			ipoib_pkey_dev_delay_open(dev);
+			return;
+		}
+		set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+
+		/* restart qp only of pkey index is cahnged */
+		if (new_index == priv->pkey_index) {
+			ipoib_dbg(priv, "Not flushing - pkey index not changed.\n");
+			return;
+		}
+		priv->pkey_index = new_index;
+	}
+
 	ipoib_dbg(priv, "flushing\n");
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (pkey_event) {
+		ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +694,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	mutex_unlock(&priv->vlan_mutex);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -685,7 +739,7 @@ void ipoib_ib_dev_cleanup(struct net_dev
 void ipoib_pkey_poll(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
+		container_of(work, struct ipoib_dev_priv, pkey_poll_task.work);
 	struct net_device *dev = priv->dev;
 
 	ipoib_pkey_dev_check_presence(dev);
@@ -696,7 +750,7 @@ void ipoib_pkey_poll(struct work_struct 
 		mutex_lock(&pkey_mutex);
 		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
+					   &priv->pkey_poll_task,
 					   HZ);
 		mutex_unlock(&pkey_mutex);
 	}
@@ -715,7 +769,7 @@ int ipoib_pkey_dev_delay_open(struct net
 		mutex_lock(&pkey_mutex);
 		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
 		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
+				   &priv->pkey_poll_task,
 				   HZ);
 		mutex_unlock(&pkey_mutex);
 		return 1;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-09 17:21:03.000000000 +0300
@@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev)
 		return -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-10 09:13:28.997127223 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
@@ -94,26 +92,17 @@ int ipoib_init_qp(struct net_device *dev
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
-	u16 pkey_index;
 	struct ib_qp_attr qp_attr;
 	int attr_mask;
 
-	/*
-	 * Search through the port P_Key table for the requested pkey value.
-	 * The port has to be assigned to the respective IB partition in
-	 * advance.
-	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
-	if (ret) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-		return ret;
-	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	/* Make sure we have a valid pkey_index in priv->pkey_index */
+	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
+		return -1;
 
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
 	qp_attr.port_num = priv->port;
-	qp_attr.pkey_index = pkey_index;
+	qp_attr.pkey_index = priv->pkey_index;
 	attr_mask =
 	    IB_QP_QKEY |
 	    IB_QP_PORT |
@@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler
 		container_of(handler, struct ipoib_dev_priv, event_handler);
 
 	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
 	     record->event == IB_EVENT_PORT_ACTIVE ||
 	     record->event == IB_EVENT_LID_CHANGE  ||
 	     record->event == IB_EVENT_SM_CHANGE   ||
@@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler
 	    record->element.port_num == priv->port) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
+		   record->element.port_num == priv->port) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
 	}
 }


From mst at dev.mellanox.co.il  Thu May 10 06:01:23 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 16:01:23 +0300
Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events
In-Reply-To: <46431592.6080401@voltaire.com>
References: <20070509093548.GA7683@mellanox.co.il>
	<4641AA06.1050002@voltaire.com>
	<20070509112626.GA10068@mellanox.co.il>
	<4641B63D.4010602@voltaire.com>
	<20070509174138.GB17734@mellanox.co.il>
	<46430167.3010106@voltaire.com>
	<20070510120144.GF13655@mellanox.co.il>
	<46430F46.1080002@voltaire.com>
	<20070510123855.GL13655@mellanox.co.il>
	<46431592.6080401@voltaire.com>
Message-ID: <20070510130123.GC22029@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: [PATCHv4 2/2] ipoib: handle pkey change events
> 
> > 
> > Return some error code -ENXIO.
> > 
> All other branches in this function return -1 (see next hunk)

Oh. Right. I haven't looked.

> Anyway, let it be -ENXIO.

Up to you really, I take it back.

> @@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler
>  		container_of(handler, struct ipoib_dev_priv, event_handler);
>  
>  	if ((record->event == IB_EVENT_PORT_ERR    ||
> -	     record->event == IB_EVENT_PKEY_CHANGE ||
>  	     record->event == IB_EVENT_PORT_ACTIVE ||
>  	     record->event == IB_EVENT_LID_CHANGE  ||
>  	     record->event == IB_EVENT_SM_CHANGE   ||
> @@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler
>  	    record->element.port_num == priv->port) {
>  		ipoib_dbg(priv, "Port state change event\n");
>  		queue_work(ipoib_workqueue, &priv->flush_task);
> +	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
> +		   record->element.port_num == priv->port) {
> +		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
> +		queue_work(ipoib_workqueue, &priv->pkey_event_task);
>  	}
>  }

What do you think about my idea to do
if (record->element.port_num != priv->port)
	return
at the top?

Anyway, I think you've addressed all real issues - could you pls
post final version of both patches for OFED and 2.6.22?

-- 
MST


From ogerlitz at voltaire.com  Thu May 10 06:02:01 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 10 May 2007 16:02:01 +0300
Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>	<1178657765.11455.32.camel@stevo-desktop>	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>	<1178721476.382.18.camel@stevo-desktop>	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop>	<46422D07.3050600@Sun.COM>
	<1178742259.382.112.camel@stevo-desktop>	<46422EA6.3020006@Sun.COM>
	<1178742819.382.114.camel@stevo-desktop>	<464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
Message-ID: <464317C9.3020601@voltaire.com>

Jeff Squyres wrote:
> On May 10, 2007, at 8:23 AM, Or Gerlitz wrote:
> 
>> A different approach which you might want to consider is to have at 
>> the btl level --two-- connections per <src,dst> ranks. so if A wants 
>> to send B it does so through the A --> B connection and if B wants to 
>> send A it does so through the B --> A connection. To some extent, this 
>> is the approach taken by IPoIB-CM (I am not enough into the RFC to 
>> understand the reasoning but i am quite sure this was the approach in 
>> the initial implementation). At first thought it mights seems not very 
>> elegant, but taking it into the details (projected on the ompi env) 
>> you might find it  even nice.
> 
> What is the advantage of this approach?

To start with, my hope here is at least to be able play defensive here, 
that is convince you that the disadvantages are minor, where only if 
this fails, would schedule myself some reading into the ipoib-cm rfc to 
dig the advantages.

Or.


From mst at dev.mellanox.co.il  Thu May 10 06:04:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 16:04:00 +0300
Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 3/3] ehca: backport for
	rhel-4.5 - use introduced dma_ops
In-Reply-To: <200705101441.58102.ossrosch@linux.vnet.ibm.com>
References: <200705101441.58102.ossrosch@linux.vnet.ibm.com>
Message-ID: <20070510130400.GD22029@mellanox.co.il>

> Quoting Stefan Roscher <ossrosch at linux.vnet.ibm.com>:
> Subject: [PATCH ofed-1.2-rc3 3/3] ehca: backport for rhel-4.5 - use introduced dma_ops
> 
> use introduced dma_ops
> 
> 
> Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
> ---
> 
> 
> Makefile    |    2
> ehca_dma.c  |  194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ehca_main.c |    2
> 3 files changed, 197 insertions(+), 1 deletion(-)
> 
> 
> 
> diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c
> --- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c	1970-01-01 01:00:00.000000000 +0100
> +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c	2007-05-03 16:25:30.000000000 +0200

These patches belong in kernel_patches/backports.
So please post as such.

-- 
MST


From yosefe at voltaire.com  Thu May 10 06:11:35 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 10 May 2007 16:11:35 +0300
Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events
In-Reply-To: <20070510130123.GC22029@mellanox.co.il>
References: <20070509093548.GA7683@mellanox.co.il>	<4641AA06.1050002@voltaire.com>	<20070509112626.GA10068@mellanox.co.il>	<4641B63D.4010602@voltaire.com>	<20070509174138.GB17734@mellanox.co.il>	<46430167.3010106@voltaire.com>	<20070510120144.GF13655@mellanox.co.il>	<46430F46.1080002@voltaire.com>	<20070510123855.GL13655@mellanox.co.il>	<46431592.6080401@voltaire.com>
	<20070510130123.GC22029@mellanox.co.il>
Message-ID: <46431A07.4080205@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Yosef Etigin <yosefe at voltaire.com>:
>>Subject: Re: [PATCHv4 2/2] ipoib: handle pkey change events
>>
>>
>>>Return some error code -ENXIO.
>>>
>>
>>All other branches in this function return -1 (see next hunk)
> 
> 
> Oh. Right. I haven't looked.
> 
> 
>>Anyway, let it be -ENXIO.
> 
> 
> Up to you really, I take it back.
> 
> 
I'd leave it -1, for consistency.

>>@@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler
>> 		container_of(handler, struct ipoib_dev_priv, event_handler);
>> 
>> 	if ((record->event == IB_EVENT_PORT_ERR    ||
>>-	     record->event == IB_EVENT_PKEY_CHANGE ||
>> 	     record->event == IB_EVENT_PORT_ACTIVE ||
>> 	     record->event == IB_EVENT_LID_CHANGE  ||
>> 	     record->event == IB_EVENT_SM_CHANGE   ||
>>@@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler
>> 	    record->element.port_num == priv->port) {
>> 		ipoib_dbg(priv, "Port state change event\n");
>> 		queue_work(ipoib_workqueue, &priv->flush_task);
>>+	} else if (record->event == IB_EVENT_PKEY_CHANGE &&
>>+		   record->element.port_num == priv->port) {
>>+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
>>+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
>> 	}
>> }
> 
> 
> What do you think about my idea to do
> if (record->element.port_num != priv->port)
> 	return
> at the top?
> 
> Anyway, I think you've addressed all real issues - could you pls
> post final version of both patches for OFED and 2.6.22?
> 

What should be the difference between for OFED and for 2.6.22?


From jsquyres at cisco.com  Thu May 10 06:11:38 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 10 May 2007 09:11:38 -0400
Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <464317C9.3020601@voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>	<1178657765.11455.32.camel@stevo-desktop>	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>	<1178721476.382.18.camel@stevo-desktop>	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop>	<46422D07.3050600@Sun.COM>
	<1178742259.382.112.camel@stevo-desktop>	<46422EA6.3020006@Sun.COM>
	<1178742819.382.114.camel@stevo-desktop>	<464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
	<464317C9.3020601@voltaire.com>
Message-ID: <A342C73A-E51B-4E96-A90F-78CDF2F2A38E@cisco.com>

On May 10, 2007, at 9:02 AM, Or Gerlitz wrote:

>>> A different approach which you might want to consider is to have  
>>> at the btl level --two-- connections per <src,dst> ranks. so if A  
>>> wants to send B it does so through the A --> B connection and if  
>>> B wants to send A it does so through the B --> A connection. To  
>>> some extent, this is the approach taken by IPoIB-CM (I am not  
>>> enough into the RFC to understand the reasoning but i am quite  
>>> sure this was the approach in the initial implementation). At  
>>> first thought it mights seems not very elegant, but taking it  
>>> into the details (projected on the ompi env) you might find it   
>>> even nice.
>> What is the advantage of this approach?
>
> To start with, my hope here is at least to be able play defensive  
> here, that is convince you that the disadvantages are minor, where  
> only if this fails, would schedule myself some reading into the  
> ipoib-cm rfc to dig the advantages.

I ask about the advantages because OMPI currently treats QP's as bi- 
directional.  Having OMPI treat them at unidirectional would be a  
change.  I'm not against such a change, but I think we'd need to be  
convinced that there are good reasons to do so.  For example, on the  
surface, it seems like this scheme would simply consume more QPs and  
potentially more registered memory (and is therefore unattractive).

-- 
Jeff Squyres
Cisco Systems


From ossrosch at linux.vnet.ibm.com  Thu May 10 06:12:34 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Thu, 10 May 2007 15:12:34 +0200
Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 1/3] ehca: backport for
	rhel-4.5 - hvcall.h
In-Reply-To: <20070510124929.GA22029@mellanox.co.il>
References: <200705101441.44286.ossrosch@linux.vnet.ibm.com>
	<20070510124929.GA22029@mellanox.co.il>
Message-ID: <200705101512.35152.ossrosch@linux.vnet.ibm.com>

On Thursday 10 May 2007 14:49, Michael S. Tsirkin wrote:
> > Quoting Stefan Roscher <ossrosch at linux.vnet.ibm.com>:
> > Subject: [PATCH ofed-1.2-rc3 1/3] ehca: backport for rhel-4.5 - hvcall.h
> > 
> > use kmem_cache_t instead of struct kmem_cache and update hvcall.h
> > 
> > 
> > 
> > Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
> > ---
> 
> Format's wrong here:

Whats is wrong with the format?

> 
> > drivers/infiniband/hw/ehca/ehca_av.c                         |    2
> > drivers/infiniband/hw/ehca/ehca_cq.c                         |    2
> > drivers/infiniband/hw/ehca/ehca_main.c                       |    2
> > drivers/infiniband/hw/ehca/ehca_mrmw.c                       |    4
> > drivers/infiniband/hw/ehca/ehca_pd.c                         |    2
> > drivers/infiniband/hw/ehca/ehca_qp.c                         |    2
> 
> These should be a patch in kernel_patches/backports/2.6.9_U5.

Yes you are rigth the correct patches will follow.


regards Stefan


From ogerlitz at voltaire.com  Thu May 10 06:30:27 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 10 May 2007 16:30:27 +0300
Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <A342C73A-E51B-4E96-A90F-78CDF2F2A38E@cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com>	<1178657765.11455.32.camel@stevo-desktop>	<4640FDE9.9010000@ichips.intel.com>	<1178718090.382.4.camel@stevo-desktop>	<1178721476.382.18.camel@stevo-desktop>	<E170A1B6-DDE7-45EA-9AC0-E815281F745F@cisco.com>	<4641EBD0.3000600@Sun.COM>
	<1178740498.382.97.camel@stevo-desktop>	<46422D07.3050600@Sun.COM>
	<1178742259.382.112.camel@stevo-desktop>	<46422EA6.3020006@Sun.COM>
	<1178742819.382.114.camel@stevo-desktop>	<464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
	<464317C9.3020601@voltaire.com>
	<A342C73A-E51B-4E96-A90F-78CDF2F2A38E@cisco.com>
Message-ID: <46431E73.70505@voltaire.com>

Jeff Squyres wrote:
> On May 10, 2007, at 9:02 AM, Or Gerlitz wrote:

>> To start with, my hope here is at least to be able play defensive 
>> here, that is convince you that the disadvantages are minor, where 
>> only if this fails, would schedule myself some reading into the 
>> ipoib-cm rfc to dig the advantages.

> I ask about the advantages because OMPI currently treats QP's as 
> bi-directional.  Having OMPI treat them at unidirectional would be a 
> change.  I'm not against such a change, but I think we'd need to be 
> convinced that there are good reasons to do so.  For example, on the 
> surface, it seems like this scheme would simply consume more QPs and 
> potentially more registered memory (and is therefore unattractive).

Indeed you would need two QPs per btl connection, however, for each 
direction you can make the relevant QP consume ~zero resources per the 
other direction, ie on side A:

for the A --> B QP : RX WR num  = 0, RX SG size = 0
for the B --> A QP : TX WR num  = 0, TX SG size = 0

and on side B the other way. I think that IB disallows to have zero len 
WR num so you set it actually to 1. Note that since you use SRQ for 
large jobs you have zero overhead for RX resources and this one TX WR 
overhead for the "RX" connection on each side. This is the only memory 
related overhead since you don't have to allocate any extra buffers over 
what you do now.

Or.


From mst at dev.mellanox.co.il  Thu May 10 06:36:14 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 16:36:14 +0300
Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events
In-Reply-To: <46431A07.4080205@voltaire.com>
References: <20070509112626.GA10068@mellanox.co.il>
	<4641B63D.4010602@voltaire.com>
	<20070509174138.GB17734@mellanox.co.il>
	<46430167.3010106@voltaire.com>
	<20070510120144.GF13655@mellanox.co.il>
	<46430F46.1080002@voltaire.com>
	<20070510123855.GL13655@mellanox.co.il>
	<46431592.6080401@voltaire.com>
	<20070510130123.GC22029@mellanox.co.il>
	<46431A07.4080205@voltaire.com>
Message-ID: <20070510133614.GM13655@mellanox.co.il>

> > Anyway, I think you've addressed all real issues - could you pls
> > post final version of both patches for OFED and 2.6.22?
> > 
> 
> What should be the difference between for OFED and for 2.6.22?

I wouldn't expect any difference.

-- 
MST


From glebn at voltaire.com  Thu May 10 06:44:05 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Thu, 10 May 2007 16:44:05 +0300
Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <46431E73.70505@voltaire.com>
References: <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM>
	<1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
	<464317C9.3020601@voltaire.com>
	<A342C73A-E51B-4E96-A90F-78CDF2F2A38E@cisco.com>
	<46431E73.70505@voltaire.com>
Message-ID: <20070510134405.GH24497@minantech.com>

On Thu, May 10, 2007 at 04:30:27PM +0300, Or Gerlitz wrote:
> Jeff Squyres wrote:
> >On May 10, 2007, at 9:02 AM, Or Gerlitz wrote:
> 
> >>To start with, my hope here is at least to be able play defensive 
> >>here, that is convince you that the disadvantages are minor, where 
> >>only if this fails, would schedule myself some reading into the 
> >>ipoib-cm rfc to dig the advantages.
> 
> >I ask about the advantages because OMPI currently treats QP's as 
> >bi-directional.  Having OMPI treat them at unidirectional would be a 
> >change.  I'm not against such a change, but I think we'd need to be 
> >convinced that there are good reasons to do so.  For example, on the 
> >surface, it seems like this scheme would simply consume more QPs and 
> >potentially more registered memory (and is therefore unattractive).
> 
> Indeed you would need two QPs per btl connection, however, for each 
> direction you can make the relevant QP consume ~zero resources per the 
> other direction, ie on side A:
> 
> for the A --> B QP : RX WR num  = 0, RX SG size = 0
> for the B --> A QP : TX WR num  = 0, TX SG size = 0
> 
> and on side B the other way. I think that IB disallows to have zero len 
> WR num so you set it actually to 1. Note that since you use SRQ for 
> large jobs you have zero overhead for RX resources and this one TX WR 
> overhead for the "RX" connection on each side. This is the only memory 
> related overhead since you don't have to allocate any extra buffers over 
> what you do now.
> 
QP is a limited resource and we already have 2 per connection (and much
more if LMC is in used), so I don't see any reason to use this scheme only
to overcome brain damaged design of iWarp.

--
			Gleb.


From mst at dev.mellanox.co.il  Thu May 10 06:54:03 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 16:54:03 +0300
Subject: [ofa-general] Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <46431E73.70505@voltaire.com>
References: <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM>
	<1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
	<464317C9.3020601@voltaire.com>
	<A342C73A-E51B-4E96-A90F-78CDF2F2A38E@cisco.com>
	<46431E73.70505@voltaire.com>
Message-ID: <20070510135403.GP13655@mellanox.co.il>

> I think that IB disallows to have zero len 
> WR num so you set it actually to 1.

I don't think such limitation exists.

-- 
MST


From ogerlitz at voltaire.com  Thu May 10 06:57:09 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 10 May 2007 16:57:09 +0300
Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <20070510134405.GH24497@minantech.com>
References: <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM>
	<1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
	<464317C9.3020601@voltaire.com>
	<A342C73A-E51B-4E96-A90F-78CDF2F2A38E@cisco.com>
	<46431E73.70505@voltaire.com>
	<20070510134405.GH24497@minantech.com>
Message-ID: <464324B5.1080209@voltaire.com>

Gleb Natapov wrote:
> QP is a limited resource and we already have 2 per connection (and much
> more if LMC is in used), so I don't see any reason to use this scheme only
> to overcome brain damaged design of iWarp.

fair enough, just note that **some** damage (which in understand is just 
to the extent of adding a flag somewhere) would experienced by ompi 
people to support iwarp...

Or.


From mst at dev.mellanox.co.il  Thu May 10 06:58:16 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 16:58:16 +0300
Subject: [ofa-general] Fwd: Re: [ewg] Re: [OMPI devel] Re: OMPI over
	ofed?udapl -?bugs?opened
Message-ID: <20070510135816.GQ13655@mellanox.co.il>

Not sure who first added open-mpi list to Cc:, but please don't
do it for mesasges sent to openib-general in the future
since this is a subscriber-only list (see below).

----- Forwarded message from devel-owner at open-mpi.org -----

Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed	udapl -	bugs	opened
From: devel-owner at open-mpi.org
Date: Thu, 10 May 2007 09:54:06 -0400

You are not allowed to post to this mailing list, and your message has
been automatically rejected.  If you think that your messages are
being rejected in error, contact the mailing list owner at
devel-owner at open-mpi.org.


Date: Thu, 10 May 2007 16:54:03 +0300
From: "Michael S. Tsirkin" <mst at dev.mellanox.co.il>
Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed	udapl -	bugs	opened
Reply-To: "Michael S. Tsirkin" <mst at dev.mellanox.co.il>
References: <1178742259.382.112.camel at stevo-desktop> <46422EA6.3020006 at Sun.COM>
	<1178742819.382.114.camel at stevo-desktop> <464232DC.9010201 at Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD at cisco.com>
	<46430EB2.7080703 at voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2 at cisco.com>
	<464317C9.3020601 at voltaire.com>
	<A342C73A-E51B-4E96-A90F-78CDF2F2A38E at cisco.com>
	<46431E73.70505 at voltaire.com>
In-Reply-To: <46431E73.70505 at voltaire.com>

> I think that IB disallows to have zero len 
> WR num so you set it actually to 1.

I don't think such limitation exists.

-- 
MST


----- End forwarded message -----

-- 
MST


From swise at opengridcomputing.com  Thu May 10 07:09:16 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 10 May 2007 09:09:16 -0500
Subject: [ofa-general] Fwd: Re: [ewg] Re: [OMPI devel] Re: OMPI over
	ofed?udapl -?bugs?opened
In-Reply-To: <20070510135816.GQ13655@mellanox.co.il>
References: <20070510135816.GQ13655@mellanox.co.il>
Message-ID: <1178806156.1519.12.camel@stevo-desktop>

My fault, but the issue in question is pertinent to both lists (OMPI
over ofa iwarp).  But since the ompi list is a closed list, I'll refrain
from doing this in the future.

Steve.


On Thu, 2007-05-10 at 16:58 +0300, Michael S. Tsirkin wrote:
> Not sure who first added open-mpi list to Cc:, but please don't
> do it for mesasges sent to openib-general in the future
> since this is a subscriber-only list (see below).
> 
> ----- Forwarded message from devel-owner at open-mpi.org -----
> 
> Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed	udapl -	bugs	opened
> From: devel-owner at open-mpi.org
> Date: Thu, 10 May 2007 09:54:06 -0400
> 
> You are not allowed to post to this mailing list, and your message has
> been automatically rejected.  If you think that your messages are
> being rejected in error, contact the mailing list owner at
> devel-owner at open-mpi.org.
> 
> 
> Date: Thu, 10 May 2007 16:54:03 +0300
> From: "Michael S. Tsirkin" <mst at dev.mellanox.co.il>
> Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed	udapl -	bugs	opened
> Reply-To: "Michael S. Tsirkin" <mst at dev.mellanox.co.il>
> References: <1178742259.382.112.camel at stevo-desktop> <46422EA6.3020006 at Sun.COM>
> 	<1178742819.382.114.camel at stevo-desktop> <464232DC.9010201 at Sun.COM>
> 	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD at cisco.com>
> 	<46430EB2.7080703 at voltaire.com>
> 	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2 at cisco.com>
> 	<464317C9.3020601 at voltaire.com>
> 	<A342C73A-E51B-4E96-A90F-78CDF2F2A38E at cisco.com>
> 	<46431E73.70505 at voltaire.com>
> In-Reply-To: <46431E73.70505 at voltaire.com>
> 
> > I think that IB disallows to have zero len 
> > WR num so you set it actually to 1.
> 
> I don't think such limitation exists.
> 
> -- 
> MST
> 
> 
> ----- End forwarded message -----
> 


From mst at dev.mellanox.co.il  Thu May 10 07:09:33 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 17:09:33 +0300
Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events
In-Reply-To: <46430F46.1080002@voltaire.com>
References: <20070508162727.GD5845@mellanox.co.il>
	<4640A8BD.4000405@voltaire.com>
	<20070509093548.GA7683@mellanox.co.il>
	<4641AA06.1050002@voltaire.com>
	<20070509112626.GA10068@mellanox.co.il>
	<4641B63D.4010602@voltaire.com>
	<20070509174138.GB17734@mellanox.co.il>
	<46430167.3010106@voltaire.com>
	<20070510120144.GF13655@mellanox.co.il>
	<46430F46.1080002@voltaire.com>
Message-ID: <20070510140932.GB26302@mellanox.co.il>

> @@ -638,10 +660,32 @@ void ipoib_ib_dev_flush(struct work_stru
>  		return;
>  	}
>  
> +	if (pkey_event) {
> +		if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
> +			clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +			ipoib_ib_dev_down(dev, 0);
> +			ipoib_pkey_dev_delay_open(dev);
> +			return;
> +		}
> +		set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> +
> +		/* restart qp only of pkey index is cahnged */
> +		if (new_index == priv->pkey_index) {
> +			ipoib_dbg(priv, "Not flushing - pkey index not changed.\n");
> +			return;
> +		}
> +		priv->pkey_index = new_index;
> +	}
> +
>  	ipoib_dbg(priv, "flushing\n");
>  

Say, what if IPOIB_PKEY_ASSIGNED was cleared previously?
priv->pkey_index will be wrong, won't it?

-- 
MST


From ossrosch at linux.vnet.ibm.com  Thu May 10 07:26:55 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Thu, 10 May 2007 16:26:55 +0200
Subject: [ofa-general] [PATCH ofed-1.2-rc3 0/4] ehca: backport for rhel-4.5
Message-ID: <200705101626.56308.ossrosch@linux.vnet.ibm.com>

Because these patches
http://lists.openfabrics.org/pipermail/general/2007-May/036125.html
I send before were in frong format and did not patch into backport directory I
send now the changed patches.

Regards Stefan


From ossrosch at linux.vnet.ibm.com  Thu May 10 07:28:02 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Thu, 10 May 2007 16:28:02 +0200
Subject: [ofa-general] [PATCH ofed-1.2-rc3 1/4] ehca: backport for rhel-4.5 -
	use kmem_cache_t instead of struct kmem_cache
Message-ID: <200705101628.03034.ossrosch@linux.vnet.ibm.com>


Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
---
backport_ehca_1_2.6.9.patch |   82 ++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 82 insertions(+)


diff -Nurp ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_1_2.6.9.patch ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_1_2.6.9.patch
--- ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_1_2.6.9.patch	1970-01-01 01:00:00.000000000 +0100
+++ ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_1_2.6.9.patch	2007-05-10 17:25:58.000000000 +0200
@@ -0,0 +1,82 @@
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_av.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_av.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_av.c	2007-05-09 12:42:01.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_av.c	2007-05-09 12:42:34.000000000 +0200
+@@ -48,7 +48,7 @@
+ #include "ehca_iverbs.h"
+ #include "hcp_if.h"
+ 
+-static struct kmem_cache *av_cache;
++static kmem_cache_t *av_cache;
+ 
+ struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
+ {
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c	2007-05-09 12:42:01.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c	2007-05-09 12:42:34.000000000 +0200
+@@ -50,7 +50,7 @@
+ #include "ehca_irq.h"
+ #include "hcp_if.h"
+ 
+-static struct kmem_cache *cq_cache;
++static kmem_cache_t *cq_cache;
+ 
+ int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp)
+ {
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-09 12:42:01.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-09 12:42:34.000000000 +0200
+@@ -465,7 +465,6 @@ void ehca_remove_driver_sysfs(struct ibm
+ 
+ #define EHCA_RESOURCE_ATTR(name)                                           \
+ static ssize_t  ehca_show_##name(struct device *dev,                       \
+-				 struct device_attribute *attr,            \
+ 				 char *buf)                                \
+ {									   \
+ 	struct ehca_shca *shca;						   \
+@@ -513,7 +512,6 @@ EHCA_RESOURCE_ATTR(max_pd);
+ EHCA_RESOURCE_ATTR(max_ah);
+ 
+ static ssize_t ehca_show_adapter_handle(struct device *dev,
+-					struct device_attribute *attr,
+ 					char *buf)
+ {
+ 	struct ehca_shca *shca = dev->driver_data;
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_mrmw.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_mrmw.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_mrmw.c	2007-05-09 12:42:01.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_mrmw.c	2007-05-09 12:42:34.000000000 +0200
+@@ -46,8 +46,8 @@
+ #include "hcp_if.h"
+ #include "hipz_hw.h"
+ 
+-static struct kmem_cache *mr_cache;
+-static struct kmem_cache *mw_cache;
++static kmem_cache_t *mr_cache;
++static kmem_cache_t *mw_cache;
+ 
+ static struct ehca_mr *ehca_mr_new(void)
+ {
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_pd.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_pd.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_pd.c	2007-05-09 12:42:01.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_pd.c	2007-05-09 12:42:34.000000000 +0200
+@@ -43,7 +43,7 @@
+ #include "ehca_tools.h"
+ #include "ehca_iverbs.h"
+ 
+-static struct kmem_cache *pd_cache;
++static kmem_cache_t *pd_cache;
+ 
+ struct ib_pd *ehca_alloc_pd(struct ib_device *device,
+ 			    struct ib_ucontext *context, struct ib_udata *udata)
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c	2007-05-09 12:42:01.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c	2007-05-09 12:42:34.000000000 +0200
+@@ -51,7 +51,7 @@
+ #include "hcp_if.h"
+ #include "hipz_fns.h"
+ 
+-static struct kmem_cache *qp_cache;
++static kmem_cache_t *qp_cache;
+ 
+ /*
+  * attributes not supported by query qp
+


From mst at dev.mellanox.co.il  Thu May 10 07:28:26 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 17:28:26 +0300
Subject: [ofa-general] Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
References: <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop>
	<46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop>
	<46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop>
	<464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
Message-ID: <20070510142826.GE22029@mellanox.co.il>

> Quoting Jeff Squyres <jsquyres at cisco.com>:
> Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed?udapl -?bugs?opened
> 
> On May 10, 2007, at 8:23 AM, Or Gerlitz wrote:
> 
> >A different approach which you might want to consider is to have at  
> >the btl level --two-- connections per <src,dst> ranks. so if A  
> >wants to send B it does so through the A --> B connection and if B  
> >wants to send A it does so through the B --> A connection. To some  
> >extent, this is the approach taken by IPoIB-CM (I am not enough  
> >into the RFC to understand the reasoning but i am quite sure this  
> >was the approach in the initial implementation). At first thought  
> >it mights seems not very elegant, but taking it into the details  
> >(projected on the ompi env) you might find it  even nice.
> 
> What is the advantage of this approach?

Current ipoib cm uses this approach to simplify the implementation.
Overhead seems insignificant.

-- 
MST


From ossrosch at linux.vnet.ibm.com  Thu May 10 07:28:42 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Thu, 10 May 2007 16:28:42 +0200
Subject: [ofa-general] [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 -
	mmap functonality
Message-ID: <200705101628.43095.ossrosch@linux.vnet.ibm.com>


Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
---
backport_ehca_2_rhel45_umap.patch |  850 ++++++++++++++++++++++++++++++++++++++
1 files changed, 850 insertions(+)


diff -Nurp ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch
--- ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch	1970-01-01 01:00:00.000000000 +0100
+++ ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch	2007-05-10 17:27:33.000000000 +0200
@@ -0,0 +1,850 @@
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h	2007-05-04 10:40:06.000000000 +0200
+@@ -126,14 +126,13 @@ struct ehca_qp {
+ 	struct ipz_qp_handle ipz_qp_handle;
+ 	struct ehca_pfqp pf;
+ 	struct ib_qp_init_attr init_attr;
++	u64 uspace_squeue;
++	u64 uspace_rqueue;
++	u64 uspace_fwh;
+ 	struct ehca_cq *send_cq;
+ 	struct ehca_cq *recv_cq;
+ 	unsigned int sqerr_purgeflag;
+ 	struct hlist_node list_entries;
+-	/* mmap counter for resources mapped into user space */
+-	u32 mm_count_squeue;
+-	u32 mm_count_rqueue;
+-	u32 mm_count_galpa;
+ };
+ 
+ /* must be power of 2 */
+@@ -150,6 +149,8 @@ struct ehca_cq {
+ 	struct ipz_cq_handle ipz_cq_handle;
+ 	struct ehca_pfcq pf;
+ 	spinlock_t cb_lock;
++	u64 uspace_queue;
++	u64 uspace_fwh;
+ 	struct hlist_head qp_hashtab[QP_HASHTAB_LEN];
+ 	struct list_head entry;
+ 	u32 nr_callbacks; /* #events assigned to cpu by scaling code */
+@@ -157,9 +158,6 @@ struct ehca_cq {
+ 	wait_queue_head_t wait_completion;
+ 	spinlock_t task_lock;
+ 	u32 ownpid;
+-	/* mmap counter for resources mapped into user space */
+-	u32 mm_count_queue;
+-	u32 mm_count_galpa;
+ };
+ 
+ enum ehca_mr_flag {
+@@ -259,6 +257,20 @@ struct ehca_ucontext {
+ 	struct ib_ucontext ib_ucontext;
+ };
+ 
++struct ehca_module *ehca_module_new(void);
++
++int ehca_module_delete(struct ehca_module *me);
++
++int ehca_eq_ctor(struct ehca_eq *eq);
++
++int ehca_eq_dtor(struct ehca_eq *eq);
++
++struct ehca_shca *ehca_shca_new(void);
++
++int ehca_shca_delete(struct ehca_shca *me);
++
++struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor);
++
+ int ehca_init_pd_cache(void);
+ void ehca_cleanup_pd_cache(void);
+ int ehca_init_cq_cache(void);
+@@ -282,6 +294,7 @@ extern int ehca_use_hp_mr;
+ extern int ehca_scaling_code;
+ 
+ struct ipzu_queue_resp {
++	u64 queue;        /* points to first queue entry */
+ 	u32 qe_size;      /* queue entry size */
+ 	u32 act_nr_of_sg;
+ 	u32 queue_length; /* queue length allocated in bytes */
+@@ -294,6 +307,7 @@ struct ehca_create_cq_resp {
+ 	u32 cq_number;
+ 	u32 token;
+ 	struct ipzu_queue_resp ipz_queue;
++	struct h_galpas galpas;
+ };
+ 
+ struct ehca_create_qp_resp {
+@@ -306,6 +320,7 @@ struct ehca_create_qp_resp {
+ 	u32 dummy; /* padding for 8 byte alignment */
+ 	struct ipzu_queue_resp ipz_squeue;
+ 	struct ipzu_queue_resp ipz_rqueue;
++	struct h_galpas galpas;
+ };
+ 
+ struct ehca_alloc_cq_parms {
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c	2007-05-04 10:40:06.000000000 +0200
+@@ -268,6 +268,7 @@ struct ib_cq *ehca_create_cq(struct ib_d
+ 	if (context) {
+ 		struct ipz_queue *ipz_queue = &my_cq->ipz_queue;
+ 		struct ehca_create_cq_resp resp;
++		struct vm_area_struct *vma;
+ 		memset(&resp, 0, sizeof(resp));
+ 		resp.cq_number = my_cq->cq_number;
+ 		resp.token = my_cq->token;
+@@ -276,14 +277,40 @@ struct ib_cq *ehca_create_cq(struct ib_d
+ 		resp.ipz_queue.queue_length = ipz_queue->queue_length;
+ 		resp.ipz_queue.pagesize = ipz_queue->pagesize;
+ 		resp.ipz_queue.toggle_state = ipz_queue->toggle_state;
++		ret = ehca_mmap_nopage(((u64)(my_cq->token) << 32) | 0x12000000,
++				       ipz_queue->queue_length,
++				       (void**)&resp.ipz_queue.queue,
++				       &vma);
++		if (ret) {
++			ehca_err(device, "Could not mmap queue pages");
++			cq = ERR_PTR(ret);
++			goto create_cq_exit4;
++		}
++		my_cq->uspace_queue = resp.ipz_queue.queue;
++		resp.galpas = my_cq->galpas;
++		ret = ehca_mmap_register(my_cq->galpas.user.fw_handle,
++					 (void**)&resp.galpas.kernel.fw_handle,
++					 &vma);
++		if (ret) {
++			ehca_err(device, "Could not mmap fw_handle");
++			cq = ERR_PTR(ret);
++			goto create_cq_exit5;
++		}
++		my_cq->uspace_fwh = (u64)resp.galpas.kernel.fw_handle;
+ 		if (ib_copy_to_udata(udata, &resp, sizeof(resp))) {
+ 			ehca_err(device, "Copy to udata failed.");
+-			goto create_cq_exit4;
++			goto create_cq_exit6;
+ 		}
+ 	}
+ 
+ 	return cq;
+ 
++create_cq_exit6:
++	ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE);
++
++create_cq_exit5:
++	ehca_munmap(my_cq->uspace_queue, my_cq->ipz_queue.queue_length);
++
+ create_cq_exit4:
+ 	ipz_queue_dtor(&my_cq->ipz_queue);
+ 
+@@ -317,6 +344,7 @@ static int get_cq_nr_events(struct ehca_
+ int ehca_destroy_cq(struct ib_cq *cq)
+ {
+ 	u64 h_ret;
++	int ret;
+ 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
+ 	int cq_num = my_cq->cq_number;
+ 	struct ib_device *device = cq->device;
+@@ -326,20 +354,6 @@ int ehca_destroy_cq(struct ib_cq *cq)
+ 	u32 cur_pid = current->tgid;
+ 	unsigned long flags;
+ 
+-	if (cq->uobject) {
+-		if (my_cq->mm_count_galpa || my_cq->mm_count_queue) {
+-			ehca_err(device, "Resources still referenced in "
+-				 "user space cq_num=%x", my_cq->cq_number);
+-			return -EINVAL;
+-		}
+-		if (my_cq->ownpid != cur_pid) {
+-			ehca_err(device, "Invalid caller pid=%x ownpid=%x "
+-				 "cq_num=%x",
+-				 cur_pid, my_cq->ownpid, my_cq->cq_number);
+-			return -EINVAL;
+-		}
+-	}
+-
+ 	spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+ 	while (my_cq->nr_events) {
+ 		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+@@ -351,6 +365,25 @@ int ehca_destroy_cq(struct ib_cq *cq)
+ 	idr_remove(&ehca_cq_idr, my_cq->token);
+ 	spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+ 
++	if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) {
++		ehca_err(device, "Invalid caller pid=%x ownpid=%x",
++			 cur_pid, my_cq->ownpid);
++		return -EINVAL;
++	}
++
++	/* un-mmap if vma alloc */
++	if (my_cq->uspace_queue ) {
++		ret = ehca_munmap(my_cq->uspace_queue,
++				  my_cq->ipz_queue.queue_length);
++		if (ret)
++			ehca_err(device, "Could not munmap queue ehca_cq=%p "
++				 "cq_num=%x", my_cq, cq_num);
++		ret = ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE);
++		if (ret)
++			ehca_err(device, "Could not munmap fwh ehca_cq=%p "
++				 "cq_num=%x", my_cq, cq_num);
++	}
++
+ 	h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0);
+ 	if (h_ret == H_R_STATE) {
+ 		/* cq in err: read err data and destroy it forcibly */
+@@ -379,7 +412,7 @@ int ehca_resize_cq(struct ib_cq *cq, int
+ 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
+ 	u32 cur_pid = current->tgid;
+ 
+-	if (cq->uobject && my_cq->ownpid != cur_pid) {
++	if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) {
+ 		ehca_err(cq->device, "Invalid caller pid=%x ownpid=%x",
+ 			 cur_pid, my_cq->ownpid);
+ 		return -EINVAL;
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h	2007-04-29 15:10:56.000000000 +0200
+@@ -171,11 +171,19 @@ int ehca_mmap(struct ib_ucontext *contex
+ 
+ void ehca_poll_eqs(unsigned long data);
+ 
++int ehca_mmap_nopage(u64 foffset,u64 length,void **mapped,
++		     struct vm_area_struct **vma);
++
++int ehca_mmap_register(u64 physical,void **mapped,
++		       struct vm_area_struct **vma);
++
++int ehca_munmap(unsigned long addr, size_t len);
++
+ #ifdef CONFIG_PPC_64K_PAGES
+ void *ehca_alloc_fw_ctrlblock(gfp_t flags);
+ void ehca_free_fw_ctrlblock(void *ptr);
+ #else
+-#define ehca_alloc_fw_ctrlblock(flags) ((void*) get_zeroed_page(flags))
++#define ehca_alloc_fw_ctrlblock(flags) ((void *) get_zeroed_page(flags))
+ #define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr))
+ #endif
+ 
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-04 10:40:06.000000000 +0200
+@@ -52,7 +52,7 @@
+ MODULE_LICENSE("Dual BSD/GPL");
+ MODULE_AUTHOR("Christoph Raisch <raisch at de.ibm.com>");
+ MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver");
+-MODULE_VERSION("SVNEHCA_0022");
++MODULE_VERSION("SVNEHCA_0019");
+ 
+ int ehca_open_aqp1     = 0;
+ int ehca_debug_level   = 0;
+@@ -293,7 +293,7 @@ int ehca_init_device(struct ehca_shca *s
+ 	strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX);
+ 	shca->ib_device.owner               = THIS_MODULE;
+ 
+-	shca->ib_device.uverbs_abi_ver	    = 6;
++	shca->ib_device.uverbs_abi_ver	    = 5;
+ 	shca->ib_device.uverbs_cmd_mask	    =
+ 		(1ull << IB_USER_VERBS_CMD_GET_CONTEXT)		|
+ 		(1ull << IB_USER_VERBS_CMD_QUERY_DEVICE)	|
+@@ -357,7 +357,7 @@ int ehca_init_device(struct ehca_shca *s
+ 	shca->ib_device.dealloc_fmr	    = ehca_dealloc_fmr;
+ 	shca->ib_device.attach_mcast	    = ehca_attach_mcast;
+ 	shca->ib_device.detach_mcast	    = ehca_detach_mcast;
+-	/* shca->ib_device.process_mad	    = ehca_process_mad;     */
++	/* shca->ib_device.process_mad	    = ehca_process_mad;	    */
+ 	shca->ib_device.mmap		    = ehca_mmap;
+ 
+ 	return ret;
+@@ -811,7 +811,7 @@ int __init ehca_module_init(void)
+ 	int ret;
+ 
+ 	printk(KERN_INFO "eHCA Infiniband Device Driver "
+-	                 "(Rel.: SVNEHCA_0022)\n");
++	                 "(Rel.: SVNEHCA_0019)\n");
+ 	idr_init(&ehca_qp_idr);
+ 	idr_init(&ehca_cq_idr);
+ 	spin_lock_init(&ehca_qp_idr_lock);
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c	2007-04-29 15:10:56.000000000 +0200
+@@ -637,6 +637,7 @@ struct ib_qp *ehca_create_qp(struct ib_p
+ 		struct ipz_queue *ipz_rqueue = &my_qp->ipz_rqueue;
+ 		struct ipz_queue *ipz_squeue = &my_qp->ipz_squeue;
+ 		struct ehca_create_qp_resp resp;
++		struct vm_area_struct * vma;
+ 		memset(&resp, 0, sizeof(resp));
+ 
+ 		resp.qp_num = my_qp->real_qp_num;
+@@ -650,21 +651,59 @@ struct ib_qp *ehca_create_qp(struct ib_p
+ 		resp.ipz_rqueue.queue_length = ipz_rqueue->queue_length;
+ 		resp.ipz_rqueue.pagesize = ipz_rqueue->pagesize;
+ 		resp.ipz_rqueue.toggle_state = ipz_rqueue->toggle_state;
++		ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x22000000,
++				       ipz_rqueue->queue_length,
++				       (void**)&resp.ipz_rqueue.queue,
++				       &vma);
++		if (ret) {
++			ehca_err(pd->device, "Could not mmap rqueue pages");
++			goto create_qp_exit3;
++		}
++		my_qp->uspace_rqueue = resp.ipz_rqueue.queue;
+ 		/* squeue properties */
+ 		resp.ipz_squeue.qe_size = ipz_squeue->qe_size;
+ 		resp.ipz_squeue.act_nr_of_sg = ipz_squeue->act_nr_of_sg;
+ 		resp.ipz_squeue.queue_length = ipz_squeue->queue_length;
+ 		resp.ipz_squeue.pagesize = ipz_squeue->pagesize;
+ 		resp.ipz_squeue.toggle_state = ipz_squeue->toggle_state;
++		ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x23000000,
++				       ipz_squeue->queue_length,
++				       (void**)&resp.ipz_squeue.queue,
++				       &vma);
++		if (ret) {
++			ehca_err(pd->device, "Could not mmap squeue pages");
++			goto create_qp_exit4;
++		}
++		my_qp->uspace_squeue = resp.ipz_squeue.queue;
++		/* fw_handle */
++		resp.galpas = my_qp->galpas;
++		ret = ehca_mmap_register(my_qp->galpas.user.fw_handle,
++					 (void**)&resp.galpas.kernel.fw_handle,
++					 &vma);
++		if (ret) {
++			ehca_err(pd->device, "Could not mmap fw_handle");
++			goto create_qp_exit5;
++		}
++		my_qp->uspace_fwh = (u64)resp.galpas.kernel.fw_handle;
++
+ 		if (ib_copy_to_udata(udata, &resp, sizeof resp)) {
+ 			ehca_err(pd->device, "Copy to udata failed");
+ 			ret = -EINVAL;
+-			goto create_qp_exit3;
++			goto create_qp_exit6;
+ 		}
+ 	}
+ 
+ 	return &my_qp->ib_qp;
+ 
++create_qp_exit6:
++	ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE);
++
++create_qp_exit5:
++	ehca_munmap(my_qp->uspace_squeue, my_qp->ipz_squeue.queue_length);
++
++create_qp_exit4:
++	ehca_munmap(my_qp->uspace_rqueue, my_qp->ipz_rqueue.queue_length);
++
+ create_qp_exit3:
+ 	ipz_queue_dtor(&my_qp->ipz_rqueue);
+ 	ipz_queue_dtor(&my_qp->ipz_squeue);
+@@ -892,7 +931,7 @@ static int internal_modify_qp(struct ib_
+ 	     my_qp->qp_type == IB_QPT_SMI) &&
+ 	    statetrans == IB_QPST_SQE2RTS) {
+ 		/* mark next free wqe if kernel */
+-		if (!ibqp->uobject) {
++		if (my_qp->uspace_squeue == 0) {
+ 			struct ehca_wqe *wqe;
+ 			/* lock send queue */
+ 			spin_lock_irqsave(&my_qp->spinlock_s, spl_flags);
+@@ -1378,18 +1417,11 @@ int ehca_destroy_qp(struct ib_qp *ibqp)
+ 	enum ib_qp_type	qp_type;
+ 	unsigned long flags;
+ 
+-	if (ibqp->uobject) {
+-		if (my_qp->mm_count_galpa ||
+-		    my_qp->mm_count_rqueue || my_qp->mm_count_squeue) {
+-			ehca_err(ibqp->device, "Resources still referenced in "
+-				 "user space qp_num=%x", ibqp->qp_num);
+-			return -EINVAL;
+-		}
+-		if (my_pd->ownpid != cur_pid) {
+-			ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x",
+-				 cur_pid, my_pd->ownpid);
+-			return -EINVAL;
+-		}
++	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
++	    my_pd->ownpid != cur_pid) {
++		ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x",
++			 cur_pid, my_pd->ownpid);
++		return -EINVAL;
+ 	}
+ 
+ 	if (my_qp->send_cq) {
+@@ -1407,6 +1439,24 @@ int ehca_destroy_qp(struct ib_qp *ibqp)
+ 	idr_remove(&ehca_qp_idr, my_qp->token);
+ 	spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
+ 
++	/* un-mmap if vma alloc */
++	if (my_qp->uspace_rqueue) {
++		ret = ehca_munmap(my_qp->uspace_rqueue,
++				  my_qp->ipz_rqueue.queue_length);
++		if (ret)
++			ehca_err(ibqp->device, "Could not munmap rqueue "
++				 "qp_num=%x", qp_num);
++		ret = ehca_munmap(my_qp->uspace_squeue,
++				  my_qp->ipz_squeue.queue_length);
++		if (ret)
++			ehca_err(ibqp->device, "Could not munmap squeue "
++				 "qp_num=%x", qp_num);
++		ret = ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE);
++		if (ret)
++			ehca_err(ibqp->device, "Could not munmap fwh qp_num=%x",
++				 qp_num);
++	}
++
+ 	h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
+ 	if (h_ret != H_SUCCESS) {
+ 		ehca_err(ibqp->device, "hipz_h_destroy_qp() failed rc=%lx "
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c	2007-04-29 15:10:56.000000000 +0200
+@@ -68,183 +68,105 @@ int ehca_dealloc_ucontext(struct ib_ucon
+ 	return 0;
+ }
+ 
+-static void ehca_mm_open(struct vm_area_struct *vma)
++struct page *ehca_nopage(struct vm_area_struct *vma,
++			 unsigned long address, int *type)
+ {
+-	u32 *count = (u32*)vma->vm_private_data;
+-	if (!count) {
+-		ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx",
+-			     vma->vm_start, vma->vm_end);
+-		return;
+-	}
+-	(*count)++;
+-	if (!(*count))
+-		ehca_gen_err("Use count overflow vm_start=%lx vm_end=%lx",
+-			     vma->vm_start, vma->vm_end);
+-	ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x",
+-		     vma->vm_start, vma->vm_end, *count);
+-}
+-
+-static void ehca_mm_close(struct vm_area_struct *vma)
+-{
+-	u32 *count = (u32*)vma->vm_private_data;
+-	if (!count) {
+-		ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx",
+-			     vma->vm_start, vma->vm_end);
+-		return;
+-	}
+-	(*count)--;
+-	ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x",
+-		     vma->vm_start, vma->vm_end, *count);
+-}
+-
+-static struct vm_operations_struct vm_ops = {
+-	.open =	ehca_mm_open,
+-	.close = ehca_mm_close,
+-};
+-
+-static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas,
+-			u32 *mm_count)
+-{
+-	int ret;
+-	u64 vsize, physical;
+-
+-	vsize = vma->vm_end - vma->vm_start;
+-	if (vsize != EHCA_PAGESIZE) {
+-		ehca_gen_err("invalid vsize=%lx", vma->vm_end - vma->vm_start);
+-		return -EINVAL;
+-	}
+-
+-	physical = galpas->user.fw_handle;
+-	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+-	ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical);
+-	/* VM_IO | VM_RESERVED are set by remap_pfn_range() */
+-	ret = remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT,
+-			      vsize, vma->vm_page_prot);
+-	if (unlikely(ret)) {
+-		ehca_gen_err("remap_pfn_range() failed ret=%x", ret);
+-		return -ENOMEM;
+-	}
+-
+-	vma->vm_private_data = mm_count;
+-	(*mm_count)++;
+-	vma->vm_ops = &vm_ops;
+-
+-	return 0;
+-}
++	struct page *mypage = NULL;
++	u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT;
++	u32 idr_handle = fileoffset >> 32;
++	u32 q_type = (fileoffset >> 28) & 0xF;	  /* CQ, QP,...        */
++	u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */
++	u32 cur_pid = current->tgid;
++	unsigned long flags;
++	struct ehca_cq *cq;
++	struct ehca_qp *qp;
++	struct ehca_pd *pd;
++	u64 offset;
++	void *vaddr;
+ 
+-static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue,
+-			   u32 *mm_count)
+-{
+-	int ret;
+-	u64 start, ofs;
+-	struct page *page;
++	switch (q_type) {
++	case 1: /* CQ */
++		spin_lock_irqsave(&ehca_cq_idr_lock, flags);
++		cq = idr_find(&ehca_cq_idr, idr_handle);
++		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+ 
+-	vma->vm_flags |= VM_RESERVED;
+-	start = vma->vm_start;
+-	for (ofs = 0; ofs < queue->queue_length; ofs += PAGE_SIZE) {
+-		u64 virt_addr = (u64)ipz_qeit_calc(queue, ofs);
+-		page = virt_to_page(virt_addr);
+-		ret = vm_insert_page(vma, start, page);
+-		if (unlikely(ret)) {
+-			ehca_gen_err("vm_insert_page() failed rc=%x", ret);
+-			return ret;
++		/* make sure this mmap really belongs to the authorized user */
++		if (!cq) {
++			ehca_gen_err("cq is NULL ret=NOPAGE_SIGBUS");
++			return NOPAGE_SIGBUS;
+ 		}
+-		start +=  PAGE_SIZE;
+-	}
+-	vma->vm_private_data = mm_count;
+-	(*mm_count)++;
+-	vma->vm_ops = &vm_ops;
+ 
+-	return 0;
+-}
+-
+-static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq,
+-			u32 rsrc_type)
+-{
+-	int ret;
+-
+-	switch (rsrc_type) {
+-	case 1: /* galpa fw handle */
+-		ehca_dbg(cq->ib_cq.device, "cq_num=%x fw", cq->cq_number);
+-		ret = ehca_mmap_fw(vma, &cq->galpas, &cq->mm_count_galpa);
+-		if (unlikely(ret)) {
++		if (cq->ownpid != cur_pid) {
+ 			ehca_err(cq->ib_cq.device,
+-				 "ehca_mmap_fw() failed rc=%x cq_num=%x",
+-				 ret, cq->cq_number);
+-			return ret;
++				 "Invalid caller pid=%x ownpid=%x",
++				 cur_pid, cq->ownpid);
++			return NOPAGE_SIGBUS;
+ 		}
+-		break;
+ 
+-	case 2: /* cq queue_addr */
+-		ehca_dbg(cq->ib_cq.device, "cq_num=%x queue", cq->cq_number);
+-		ret = ehca_mmap_queue(vma, &cq->ipz_queue, &cq->mm_count_queue);
+-		if (unlikely(ret)) {
+-			ehca_err(cq->ib_cq.device,
+-				 "ehca_mmap_queue() failed rc=%x cq_num=%x",
+-				 ret, cq->cq_number);
+-			return ret;
++		if (rsrc_type == 2) {
++			ehca_dbg(cq->ib_cq.device, "cq=%p cq queuearea", cq);
++			offset = address - vma->vm_start;
++			vaddr = ipz_qeit_calc(&cq->ipz_queue, offset);
++			ehca_dbg(cq->ib_cq.device, "offset=%lx vaddr=%p",
++				 offset, vaddr);
++			mypage = virt_to_page(vaddr);
+ 		}
+ 		break;
+ 
+-	default:
+-		ehca_err(cq->ib_cq.device, "bad resource type=%x cq_num=%x",
+-			 rsrc_type, cq->cq_number);
+-		return -EINVAL;
+-	}
+-
+-	return 0;
+-}
+-
+-static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp,
+-			u32 rsrc_type)
+-{
+-	int ret;
++	case 2: /* QP */
++		spin_lock_irqsave(&ehca_qp_idr_lock, flags);
++		qp = idr_find(&ehca_qp_idr, idr_handle);
++		spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
+ 
+-	switch (rsrc_type) {
+-	case 1: /* galpa fw handle */
+-		ehca_dbg(qp->ib_qp.device, "qp_num=%x fw", qp->ib_qp.qp_num);
+-		ret = ehca_mmap_fw(vma, &qp->galpas, &qp->mm_count_galpa);
+-		if (unlikely(ret)) {
+-			ehca_err(qp->ib_qp.device,
+-				 "remap_pfn_range() failed ret=%x qp_num=%x",
+-				 ret, qp->ib_qp.qp_num);
+-			return -ENOMEM;
++		/* make sure this mmap really belongs to the authorized user */
++		if (!qp) {
++			ehca_gen_err("qp is NULL ret=NOPAGE_SIGBUS");
++			return NOPAGE_SIGBUS;
+ 		}
+-		break;
+ 
+-	case 2: /* qp rqueue_addr */
+-		ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue",
+-			 qp->ib_qp.qp_num);
+-		ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, &qp->mm_count_rqueue);
+-		if (unlikely(ret)) {
++		pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd);
++		if (pd->ownpid != cur_pid) {
+ 			ehca_err(qp->ib_qp.device,
+-				 "ehca_mmap_queue(rq) failed rc=%x qp_num=%x",
+-				 ret, qp->ib_qp.qp_num);
+-			return ret;
++				 "Invalid caller pid=%x ownpid=%x",
++				 cur_pid, pd->ownpid);
++			return NOPAGE_SIGBUS;
+ 		}
+-		break;
+ 
+-	case 3: /* qp squeue_addr */
+-		ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue",
+-			 qp->ib_qp.qp_num);
+-		ret = ehca_mmap_queue(vma, &qp->ipz_squeue, &qp->mm_count_squeue);
+-		if (unlikely(ret)) {
+-			ehca_err(qp->ib_qp.device,
+-				 "ehca_mmap_queue(sq) failed rc=%x qp_num=%x",
+-				 ret, qp->ib_qp.qp_num);
+-			return ret;
++		if (rsrc_type == 2) {	/* rqueue */
++			ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueuearea", qp);
++			offset = address - vma->vm_start;
++			vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset);
++			ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p",
++				 offset, vaddr);
++			mypage = virt_to_page(vaddr);
++		} else if (rsrc_type == 3) {	/* squeue */
++			ehca_dbg(qp->ib_qp.device, "qp=%p qp squeuearea", qp);
++			offset = address - vma->vm_start;
++			vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset);
++			ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p",
++				 offset, vaddr);
++			mypage = virt_to_page(vaddr);
+ 		}
+ 		break;
+ 
+ 	default:
+-		ehca_err(qp->ib_qp.device, "bad resource type=%x qp=num=%x",
+-			 rsrc_type, qp->ib_qp.qp_num);
+-		return -EINVAL;
++		ehca_gen_err("bad queue type %x", q_type);
++		return NOPAGE_SIGBUS;
+ 	}
+ 
+-	return 0;
++	if (!mypage) {
++		ehca_gen_err("Invalid page adr==NULL ret=NOPAGE_SIGBUS");
++		return NOPAGE_SIGBUS;
++	}
++	get_page(mypage);
++
++	return mypage;
+ }
+ 
++static struct vm_operations_struct ehcau_vm_ops = {
++	.nopage = ehca_nopage,
++};
++
+ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
+ {
+ 	u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT;
+@@ -253,6 +175,7 @@ int ehca_mmap(struct ib_ucontext *contex
+ 	u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */
+ 	u32 cur_pid = current->tgid;
+ 	u32 ret;
++	u64 vsize, physical;
+ 	unsigned long flags;
+ 	struct ehca_cq *cq;
+ 	struct ehca_qp *qp;
+@@ -278,12 +201,44 @@ int ehca_mmap(struct ib_ucontext *contex
+ 		if (!cq->ib_cq.uobject || cq->ib_cq.uobject->context != context)
+ 			return -EINVAL;
+ 
+-		ret = ehca_mmap_cq(vma, cq, rsrc_type);
+-		if (unlikely(ret)) {
+-			ehca_err(cq->ib_cq.device,
+-				 "ehca_mmap_cq() failed rc=%x cq_num=%x",
+-				 ret, cq->cq_number);
+-			return ret;
++		switch (rsrc_type) {
++		case 1: /* galpa fw handle */
++			ehca_dbg(cq->ib_cq.device, "cq=%p cq triggerarea", cq);
++			vma->vm_flags |= VM_RESERVED;
++			vsize = vma->vm_end - vma->vm_start;
++			if (vsize != EHCA_PAGESIZE) {
++				ehca_err(cq->ib_cq.device, "invalid vsize=%lx",
++					 vma->vm_end - vma->vm_start);
++				return -EINVAL;
++			}
++
++			physical = cq->galpas.user.fw_handle;
++			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
++			vma->vm_flags |= VM_IO | VM_RESERVED;
++
++			ehca_dbg(cq->ib_cq.device,
++				 "vsize=%lx physical=%lx", vsize, physical);
++			ret = remap_pfn_range(vma, vma->vm_start,
++					      physical >> PAGE_SHIFT, vsize,
++					      vma->vm_page_prot);
++			if (ret) {
++				ehca_err(cq->ib_cq.device,
++					 "remap_pfn_range() failed ret=%x",
++					 ret);
++				return -ENOMEM;
++			}
++			break;
++
++		case 2: /* cq queue_addr */
++			ehca_dbg(cq->ib_cq.device, "cq=%p cq q_addr", cq);
++			vma->vm_flags |= VM_RESERVED;
++			vma->vm_ops = &ehcau_vm_ops;
++			break;
++
++		default:
++			ehca_err(cq->ib_cq.device, "bad resource type %x",
++				 rsrc_type);
++			return -EINVAL;
+ 		}
+ 		break;
+ 
+@@ -307,12 +262,50 @@ int ehca_mmap(struct ib_ucontext *contex
+ 		if (!qp->ib_qp.uobject || qp->ib_qp.uobject->context != context)
+ 			return -EINVAL;
+ 
+-		ret = ehca_mmap_qp(vma, qp, rsrc_type);
+-		if (unlikely(ret)) {
+-			ehca_err(qp->ib_qp.device,
+-				 "ehca_mmap_qp() failed rc=%x qp_num=%x",
+-				 ret, qp->ib_qp.qp_num);
+-			return ret;
++		switch (rsrc_type) {
++		case 1: /* galpa fw handle */
++			ehca_dbg(qp->ib_qp.device, "qp=%p qp triggerarea", qp);
++			vma->vm_flags |= VM_RESERVED;
++			vsize = vma->vm_end - vma->vm_start;
++			if (vsize != EHCA_PAGESIZE) {
++				ehca_err(qp->ib_qp.device, "invalid vsize=%lx",
++					 vma->vm_end - vma->vm_start);
++				return -EINVAL;
++			}
++
++			physical = qp->galpas.user.fw_handle;
++			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
++			vma->vm_flags |= VM_IO | VM_RESERVED;
++
++			ehca_dbg(qp->ib_qp.device, "vsize=%lx physical=%lx",
++				 vsize, physical);
++			ret = remap_pfn_range(vma, vma->vm_start,
++					      physical >> PAGE_SHIFT, vsize,
++					      vma->vm_page_prot);
++			if (ret) {
++				ehca_err(qp->ib_qp.device,
++					 "remap_pfn_range() failed ret=%x",
++					 ret);
++				return -ENOMEM;
++			}
++			break;
++
++		case 2: /* qp rqueue_addr */
++			ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueue_addr", qp);
++			vma->vm_flags |= VM_RESERVED;
++			vma->vm_ops = &ehcau_vm_ops;
++			break;
++
++		case 3: /* qp squeue_addr */
++			ehca_dbg(qp->ib_qp.device, "qp=%p qp squeue_addr", qp);
++			vma->vm_flags |= VM_RESERVED;
++			vma->vm_ops = &ehcau_vm_ops;
++			break;
++
++		default:
++			ehca_err(qp->ib_qp.device, "bad resource type %x",
++				 rsrc_type);
++			return -EINVAL;
+ 		}
+ 		break;
+ 
+@@ -323,3 +316,77 @@ int ehca_mmap(struct ib_ucontext *contex
+ 
+ 	return 0;
+ }
++
++int ehca_mmap_nopage(u64 foffset, u64 length, void **mapped,
++		     struct vm_area_struct **vma)
++{
++	down_write(&current->mm->mmap_sem);
++	*mapped = (void*)do_mmap(NULL,0, length, PROT_WRITE,
++				 MAP_SHARED | MAP_ANONYMOUS,
++				 foffset);
++	up_write(&current->mm->mmap_sem);
++	if (!(*mapped)) {
++		ehca_gen_err("couldn't mmap foffset=%lx length=%lx",
++			     foffset, length);
++		return -EINVAL;
++	}
++
++	*vma = find_vma(current->mm, (u64)*mapped);
++	if (!(*vma)) {
++		down_write(&current->mm->mmap_sem);
++		do_munmap(current->mm, 0, length);
++		up_write(&current->mm->mmap_sem);
++		ehca_gen_err("couldn't find vma queue=%p", *mapped);
++		return -EINVAL;
++	}
++	(*vma)->vm_flags |= VM_RESERVED;
++	(*vma)->vm_ops = &ehcau_vm_ops;
++
++	return 0;
++}
++
++int ehca_mmap_register(u64 physical, void **mapped,
++		       struct vm_area_struct **vma)
++{
++	int ret;
++	unsigned long vsize;
++	/* ehca hw supports only 4k page */
++	ret = ehca_mmap_nopage(0, EHCA_PAGESIZE, mapped, vma);
++	if (ret) {
++		ehca_gen_err("could'nt mmap physical=%lx", physical);
++		return ret;
++	}
++
++	(*vma)->vm_flags |= VM_RESERVED;
++	vsize = (*vma)->vm_end - (*vma)->vm_start;
++	if (vsize != EHCA_PAGESIZE) {
++		ehca_gen_err("invalid vsize=%lx",
++			     (*vma)->vm_end - (*vma)->vm_start);
++		return -EINVAL;
++	}
++
++	(*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot);
++	(*vma)->vm_flags |= VM_IO | VM_RESERVED;
++
++	ret = remap_pfn_range((*vma), (*vma)->vm_start,
++			      physical >> PAGE_SHIFT, vsize,
++			      (*vma)->vm_page_prot);
++	if (ret) {
++		ehca_gen_err("remap_pfn_range() failed ret=%x", ret);
++		return -ENOMEM;
++	}
++
++	return 0;
++
++}
++
++int ehca_munmap(unsigned long addr, size_t len) {
++	int ret = 0;
++	struct mm_struct *mm = current->mm;
++	if (mm) {
++		down_write(&mm->mmap_sem);
++		ret = do_munmap(mm, addr, len);
++		up_write(&mm->mmap_sem);
++	}
++	return ret;
++}


From ossrosch at linux.vnet.ibm.com  Thu May 10 07:29:08 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Thu, 10 May 2007 16:29:08 +0200
Subject: [ofa-general] [PATCH ofed-1.2-rc3 3/4] ehca: backport for rhel-4.5 -
	use introduced dma_ops
Message-ID: <200705101629.09382.ossrosch@linux.vnet.ibm.com>


Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
---
backport_ehca_3_rhel45_dma.patch |  226 +++++++++++++++++++++++++++++++++++++++
1 files changed, 226 insertions(+)


diff -Nurp ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_3_rhel45_dma.patch ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_3_rhel45_dma.patch
--- ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_3_rhel45_dma.patch	1970-01-01 01:00:00.000000000 +0100
+++ ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_3_rhel45_dma.patch	2007-05-10 17:30:24.000000000 +0200
@@ -0,0 +1,226 @@
+diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c
+--- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c	1970-01-01 01:00:00.000000000 +0100
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c	2007-05-03 16:25:30.000000000 +0200
+@@ -0,0 +1,194 @@
++/*
++ *  IBM eServer eHCA Infiniband device driver for Linux on POWER
++ *
++ *  eHCA dma mapping via ibmebus
++ *
++ *  Authors: Stefan Roscher <stefan.roscher at de.ibm.com>
++ *           Hoang-Nam Nguyen <hnguyen at de.ibm.com>
++ *
++ *  Copyright (c) 2007 IBM Corporation
++ *
++ *  All rights reserved.
++ *
++ *  This source code is distributed under a dual license of GPL v2.0 and OpenIB
++ *  BSD.
++ *
++ * OpenIB BSD License
++ *
++ * Redistribution and use in source and binary forms, with or without
++ * modification, are permitted provided that the following conditions are met:
++ *
++ * Redistributions of source code must retain the above copyright notice, this
++ * list of conditions and the following disclaimer.
++ *
++ * Redistributions in binary form must reproduce the above copyright notice,
++ * this list of conditions and the following disclaimer in the documentation
++ * and/or other materials
++ * provided with the distribution.
++ *
++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
++ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
++ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
++ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
++ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
++ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
++ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
++ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
++ * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
++ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
++ * POSSIBILITY OF SUCH DAMAGE.
++ */
++
++#include <asm/ibmebus.h>
++#include <rdma/ib_verbs.h>
++
++static int ehca_mapping_error(struct ib_device *dev, u64 dma_addr);
++
++static u64 ehca_dma_map_single(struct ib_device *dev,
++			        void *cpu_addr, size_t size,
++			        enum dma_data_direction direction);
++
++static void ehca_dma_unmap_single(struct ib_device *dev,
++				   u64 addr, size_t size,
++				  enum dma_data_direction direction);
++
++static u64 ehca_dma_map_page(struct ib_device *dev,
++			      struct page *page,
++			      unsigned long offset,
++			      size_t size,
++			     enum dma_data_direction direction);
++
++static void ehca_dma_unmap_page(struct ib_device *dev,
++				 u64 addr, size_t size,
++				enum dma_data_direction direction);
++
++int ehca_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents,
++		enum dma_data_direction direction);
++
++static void ehca_unmap_sg(struct ib_device *dev,
++			   struct scatterlist *sg, int nents,
++			  enum dma_data_direction direction);
++
++static u64 ehca_sg_dma_address(struct ib_device *dev, struct scatterlist *sg);
++
++static unsigned int ehca_sg_dma_len(struct ib_device *dev,
++				    struct scatterlist *sg);
++
++static void ehca_sync_single_for_cpu(struct ib_device *dev,
++				      u64 addr,
++				      size_t size,
++				     enum dma_data_direction dir);
++
++static void ehca_sync_single_for_device(struct ib_device *dev,
++					 u64 addr,
++					 size_t size,
++					enum dma_data_direction dir);
++
++static void *ehca_dma_alloc_coherent(struct ib_device *dev, size_t size,
++				     u64 *dma_handle, gfp_t flag);
++
++static void ehca_dma_free_coherent(struct ib_device *dev, size_t size,
++				   void *cpu_addr, dma_addr_t dma_handle);
++
++struct ib_dma_mapping_ops ehca_dma_mapping_ops = {
++	ehca_mapping_error,
++	ehca_dma_map_single,
++	ehca_dma_unmap_single,
++	ehca_dma_map_page,
++	ehca_dma_unmap_page,
++	ehca_map_sg,
++	ehca_unmap_sg,
++	ehca_sg_dma_address,
++	ehca_sg_dma_len,
++	ehca_sync_single_for_cpu,
++	ehca_sync_single_for_device,
++	ehca_dma_alloc_coherent,
++	ehca_dma_free_coherent
++};
++
++static int ehca_mapping_error(struct ib_device *dev, u64 dma_addr)
++{
++	return dma_addr == 0L;
++}
++
++static u64 ehca_dma_map_single(struct ib_device *dev,
++			        void *cpu_addr, size_t size,
++			        enum dma_data_direction direction)
++{
++	return ibmebus_map_single(dev, cpu_addr, size, direction);
++}
++
++static void ehca_dma_unmap_single(struct ib_device *dev,
++				   u64 addr, size_t size,
++				   enum dma_data_direction direction)
++{
++	ibmebus_unmap_single(dev, addr, size, direction);
++}
++
++static u64 ehca_dma_map_page(struct ib_device *dev,
++			      struct page *page,
++			      unsigned long offset,
++			      size_t size,
++			      enum dma_data_direction direction)
++{
++	return dma_map_page(dev->dma_device, page, offset, size, direction);
++}
++
++static void ehca_dma_unmap_page(struct ib_device *dev,
++				 u64 addr, size_t size,
++				 enum dma_data_direction direction)
++{
++	dma_unmap_page(dev->dma_device, addr, size, direction);
++}
++
++int ehca_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents,
++		 enum dma_data_direction direction)
++{
++	return ibmebus_map_sg(dev, sg, nents, direction);
++}
++
++static void ehca_unmap_sg(struct ib_device *dev,
++			   struct scatterlist *sg, int nents,
++			   enum dma_data_direction direction)
++{
++	ibmebus_unmap_sg(dev, sg, nents, direction);
++}
++
++static u64 ehca_sg_dma_address(struct ib_device *dev, struct scatterlist *sg)
++{
++	return sg_dma_address(sg);
++}
++
++static unsigned int ehca_sg_dma_len(struct ib_device *dev,
++				     struct scatterlist *sg)
++{
++	return sg_dma_len(sg);
++}
++
++static void ehca_sync_single_for_cpu(struct ib_device *dev,
++				      u64 addr,
++				      size_t size,
++				      enum dma_data_direction dir)
++{
++	dma_sync_single_for_cpu(dev->dma_device, addr, size, dir);
++}
++
++static void ehca_sync_single_for_device(struct ib_device *dev,
++					 u64 addr,
++					 size_t size,
++					 enum dma_data_direction dir)
++{
++	dma_sync_single_for_device(dev->dma_device, addr, size, dir);
++}
++
++static void *ehca_dma_alloc_coherent(struct ib_device *dev, size_t size,
++				      u64 *dma_handle, gfp_t flag)
++{
++	return ibmebus_alloc_coherent(dev, size, dma_handle, flag);
++}
++
++static void ehca_dma_free_coherent(struct ib_device *dev, size_t size,
++				    void *cpu_addr, dma_addr_t dma_handle)
++{
++	ibmebus_free_coherent(dev, size, cpu_addr, dma_handle);
++}
+diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c
+--- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_main.c	2007-04-29 15:10:56.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-03 16:19:28.000000000 +0200
+@@ -279,6 +279,7 @@ init_node_guid1:
+ 
+ int ehca_init_device(struct ehca_shca *shca)
+ {
++	extern struct ib_dma_mapping_ops ehca_dma_mapping_ops;
+ 	int ret;
+ 
+ 	ret = init_node_guid(shca);
+@@ -354,6 +355,7 @@ int ehca_init_device(struct ehca_shca *s
+ 	shca->ib_device.detach_mcast	    = ehca_detach_mcast;
+ 	/* shca->ib_device.process_mad	    = ehca_process_mad;	    */
+ 	shca->ib_device.mmap		    = ehca_mmap;
++	shca->ib_device.dma_ops             = &ehca_dma_mapping_ops;
+ 
+ 	return ret;
+ }
+diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/Makefile ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/Makefile
+--- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/Makefile	2007-04-29 15:10:56.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/Makefile	2007-05-03 16:26:13.000000000 +0200
+@@ -12,5 +12,5 @@ obj-$(CONFIG_INFINIBAND_EHCA) += ib_ehca
+ 
+ ib_ehca-objs  = ehca_main.o ehca_hca.o ehca_mcast.o ehca_pd.o ehca_av.o ehca_eq.o \
+ 		ehca_cq.o ehca_qp.o ehca_sqp.o ehca_mrmw.o ehca_reqs.o ehca_irq.o \
+-		ehca_uverbs.o ipz_pt_fn.o hcp_if.o hcp_phyp.o
++		ehca_uverbs.o ehca_dma.o ipz_pt_fn.o hcp_if.o hcp_phyp.o


From ossrosch at linux.vnet.ibm.com  Thu May 10 07:30:08 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Thu, 10 May 2007 16:30:08 +0200
Subject: [ofa-general] [PATCH ofed-1.2-rc3 4/4] ehca: backport for rhel-4.5 -
	create hvcall.h in kernel_addons
Message-ID: <200705101630.09153.ossrosch@linux.vnet.ibm.com>

creates file hvcall.h and system.h in kernel_addons/backport/2.6.9_U5/include

Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
---
asm-powerpc/system.h |    1
asm/hvcall.h         |  309 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 310 insertions(+)


diff -Nurp ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h
--- ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h	1970-01-01 01:00:00.000000000 +0100
+++ ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h	2007-05-10 18:14:12.000000000 +0200
@@ -0,0 +1,309 @@
+#ifndef ASM_HVCALL_BACKPORT_2616_H
+#define ASM_HVCALL_BACKPORT_2616_H
+
+#include_next <asm/hvcall.h>
+
+#ifdef __KERNEL__
+
+#define H_SUCCESS               H_Success
+#define H_BUSY                  H_Busy
+#define H_CONSTRAINED           H_Constrained
+#define H_PAGE_REGISTERED       15
+
+#define H_PARAMETER             H_Parameter
+#define H_NO_MEM                H_NoMem
+#define H_RESOURCE              H_Resource
+#define H_HARDWARE              H_Hardware
+#define H_ADAPTER_PARM          -17
+#define H_RH_PARM               -18
+#define H_RT_PARM               -22
+#define H_MLENGTH_PARM          -27
+#define H_MEM_PARM              -28
+#define H_MEM_ACCESS_PARM       -29
+#define H_ALIAS_EXIST           -39
+#define H_TABLE_FULL            -41
+#define H_NOT_ENOUGH_RESOURCES  -44
+#define H_R_STATE               -45
+
+#define H_CB_ALIGNMENT          4096
+
+#define H_RESET_EVENTS          0x15C
+#define H_ALLOC_RESOURCE        0x160
+#define H_FREE_RESOURCE         0x164
+#define H_MODIFY_QP             0x168
+#define H_QUERY_QP              0x16C
+#define H_REREGISTER_PMR        0x170
+#define H_REGISTER_SMR          0x174
+#define H_QUERY_MR              0x178
+#define H_QUERY_MW              0x17C
+#define H_QUERY_HCA             0x180
+#define H_QUERY_PORT            0x184
+#define H_MODIFY_PORT           0x188
+#define H_DEFINE_AQP1           0x18C
+#define H_DEFINE_AQP0           0x194
+#define H_RESIZE_MR             0x198
+#define H_ATTACH_MCQP           0x19C
+#define H_DETACH_MCQP           0x1A0
+#define H_REGISTER_RPAGES       0x1AC
+#define H_DISABLE_AND_GETC      0x1B0
+#define H_ERROR_DATA            0x1B4
+#define H_QUERY_INT_STATE       0x1E4
+
+#define H_LONG_BUSY_ORDER_1_MSEC   H_LongBusyOrder1msec
+#define H_LONG_BUSY_ORDER_10_MSEC  H_LongBusyOrder10msec
+#define H_LONG_BUSY_ORDER_100_MSEC H_LongBusyOrder100msec
+#define H_LONG_BUSY_ORDER_1_SEC    H_LongBusyOrder1sec
+#define H_LONG_BUSY_ORDER_10_SEC   H_LongBusyOrder10sec
+#define H_LONG_BUSY_ORDER_100_SEC  H_LongBusyOrder100sec
+#define H_IS_LONG_BUSY(x) ((x >= H_LongBusyStartRange) && (x <= H_LongBusyEndRange))
+
+
+#ifndef __ASSEMBLY__
+#include <linux/kernel.h>
+
+#define PLPAR_HCALL9_BUFSIZE 9
+inline static long plpar_hcall9(unsigned long opcode,
+				unsigned long *retbuf,
+				unsigned long arg1,	/* <R4  */
+				unsigned long arg2,	/* <R5  */
+				unsigned long arg3,	/* <R6  */
+				unsigned long arg4,	/* <R7  */
+				unsigned long arg5,	/* <R8  */
+				unsigned long arg6,	/* <R9  */
+				unsigned long arg7,	/* <R10 */
+				unsigned long arg8,	/* <R11 */
+				unsigned long arg9	/* <R12 */
+    )
+{
+	int i;
+	unsigned long regs[11] = {opcode,
+				  arg1, arg2, arg3, arg4, arg5, arg6, arg7,
+				  arg8, arg9};
+
+	__asm__ __volatile__("mr 3,%10\n"
+			     "mr 4,%11\n"
+			     "mr 5,%12\n"
+			     "mr 6,%13\n"
+			     "mr 7,%14\n"
+			     "mr 8,%15\n"
+			     "mr 9,%16\n"
+			     "mr 10,%17\n"
+			     "mr 11,%18\n"
+			     "mr 12,%19\n"
+			     ".long 0x44000022\n"
+			     "mr %0,3\n"
+			     "mr %1,4\n"
+			     "mr %2,5\n"
+			     "mr %3,6\n"
+			     "mr %4,7\n"
+			     "mr %5,8\n"
+			     "mr %6,9\n"
+			     "mr %7,10\n"
+			     "mr %8,11\n"
+			     "mr %9,12\n":"=r"(regs[0]),
+			     "=r"(regs[1]), "=r"(regs[2]),
+			     "=r"(regs[3]), "=r"(regs[4]),
+			     "=r"(regs[5]), "=r"(regs[6]),
+			     "=r"(regs[7]), "=r"(regs[8]),
+			     "=r"(regs[9])
+			     :"r"(regs[0]), "r"(regs[1]),
+			     "r"(regs[2]), "r"(regs[3]),
+			     "r"(regs[4]), "r"(regs[5]),
+			     "r"(regs[6]), "r"(regs[7]),
+			     "r"(regs[8]), "r"(regs[9])
+			     :"r0", "r2", "r3", "r4", "r5", "r6", "r7",
+			     "r8", "r9", "r10", "r11", "r12", "cc",
+			     "xer", "ctr", "lr", "cr0", "cr1", "cr5",
+			     "cr6", "cr7");
+	for (i = 0; i < 9; i++)
+		retbuf[i] = regs[i + 1];
+
+	if (!H_isLongBusy(regs[0]) && regs[0] < 0) {
+		printk(KERN_ERR "HCALL99_IN r3=%lx r4=%lx r5=%lx r6=%lx "
+		       "r7=%lx r8=%lx r9=%lx r10=%lx "
+		       "r11=%lx r12=%lx",
+		       opcode, arg1, arg2, arg3,
+		       arg4, arg5, arg6, arg7,
+		       arg8, arg9);
+		printk(KERN_ERR "HCALL99_OUT r3=%lx r4=%lx r5=%lx "
+		       "r6=%lx r7=%lx r8=%lx r9=%lx r10=%lx "
+		       "r11=%lx r12=lx",
+		       regs[0], regs[1],
+		       regs[2], regs[3],
+		       regs[4], regs[5],
+		       regs[6], regs[7],
+		       regs[8]);
+	}
+	return regs[0];
+}
+
+inline static long plpar_hcall_7arg_7ret(unsigned long opcode,
+					 unsigned long arg1,    /* <R4  */
+					 unsigned long arg2,	/* <R5  */
+					 unsigned long arg3,	/* <R6  */
+					 unsigned long arg4,	/* <R7  */
+					 unsigned long arg5,	/* <R8  */
+					 unsigned long arg6,	/* <R9  */
+					 unsigned long arg7,	/* <R10 */
+					 unsigned long *out1,	/* <R4  */
+					 unsigned long *out2,	/* <R5  */
+					 unsigned long *out3,	/* <R6  */
+					 unsigned long *out4,	/* <R7  */
+					 unsigned long *out5,	/* <R8  */
+					 unsigned long *out6,	/* <R9  */
+					 unsigned long *out7	/* <R10 */
+    )
+{
+	unsigned long regs[11] = {opcode,
+				  arg1, arg2, arg3, arg4, arg5, arg6, arg7};
+
+	__asm__ __volatile__("mr 3,%10\n"
+			     "mr 4,%11\n"
+			     "mr 5,%12\n"
+			     "mr 6,%13\n"
+			     "mr 7,%14\n"
+			     "mr 8,%15\n"
+			     "mr 9,%16\n"
+			     "mr 10,%17\n"
+			     "mr 11,%18\n"
+			     "mr 12,%19\n"
+			     ".long 0x44000022\n"
+			     "mr %0,3\n"
+			     "mr %1,4\n"
+			     "mr %2,5\n"
+			     "mr %3,6\n"
+			     "mr %4,7\n"
+			     "mr %5,8\n"
+			     "mr %6,9\n"
+			     "mr %7,10\n"
+			     "mr %8,11\n"
+			     "mr %9,12\n":"=r"(regs[0]),
+			     "=r"(regs[1]), "=r"(regs[2]),
+			     "=r"(regs[3]), "=r"(regs[4]),
+			     "=r"(regs[5]), "=r"(regs[6]),
+			     "=r"(regs[7]), "=r"(regs[8]),
+			     "=r"(regs[9])
+			     :"r"(regs[0]), "r"(regs[1]),
+			     "r"(regs[2]), "r"(regs[3]),
+			     "r"(regs[4]), "r"(regs[5]),
+			     "r"(regs[6]), "r"(regs[7]),
+			     "r"(regs[8]), "r"(regs[9])
+			     :"r0", "r2", "r3", "r4", "r5", "r6", "r7",
+			     "r8", "r9", "r10", "r11", "r12", "cc",
+			     "xer", "ctr", "lr", "cr0", "cr1", "cr5",
+			     "cr6", "cr7");
+	*out1 = regs[1];
+	*out2 = regs[2];
+	*out3 = regs[3];
+	*out4 = regs[4];
+	*out5 = regs[5];
+	*out6 = regs[6];
+	*out7 = regs[7];
+
+	if (!H_isLongBusy(regs[0]) && regs[0] < 0) {
+		printk(KERN_ERR "HCALL77_IN r3=%lx r4=%lx r5=%lx r6=%lx "
+		       "r7=%lx r8=%lx r9=%lx r10=%lx",
+		       opcode, arg1, arg2, arg3,
+		       arg4, arg5, arg6, arg7);
+		printk(KERN_ERR "HCALL77_OUT r3=%lx r4=%lx r5=%lx "
+		       "r6=%lx r7=%lx r8=%lx r9=%lx r10=%lx ",
+		       regs[0], regs[1],
+		       regs[2], regs[3],
+		       regs[4], regs[5],
+		       regs[6], regs[7]);
+	}
+	return regs[0];
+}
+
+inline static long plpar_hcall_9arg_9ret(unsigned long opcode,
+					 unsigned long arg1,	/* <R4  */
+					 unsigned long arg2,	/* <R5  */
+					 unsigned long arg3,	/* <R6  */
+					 unsigned long arg4,	/* <R7  */
+					 unsigned long arg5,	/* <R8  */
+					 unsigned long arg6,	/* <R9  */
+					 unsigned long arg7,	/* <R10 */
+					 unsigned long arg8,	/* <R11 */
+					 unsigned long arg9,	/* <R12 */
+					 unsigned long *out1,	/* <R4  */
+					 unsigned long *out2,	/* <R5  */
+					 unsigned long *out3,	/* <R6  */
+					 unsigned long *out4,	/* <R7  */
+					 unsigned long *out5,	/* <R8  */
+					 unsigned long *out6,	/* <R9  */
+					 unsigned long *out7,	/* <R10 */
+					 unsigned long *out8,	/* <R11 */
+					 unsigned long *out9	/* <R12 */
+    )
+{
+	unsigned long regs[11] = {opcode,
+				  arg1, arg2, arg3, arg4, arg5, arg6, arg7,
+				  arg8, arg9};
+
+	__asm__ __volatile__("mr 3,%10\n"
+			     "mr 4,%11\n"
+			     "mr 5,%12\n"
+			     "mr 6,%13\n"
+			     "mr 7,%14\n"
+			     "mr 8,%15\n"
+			     "mr 9,%16\n"
+			     "mr 10,%17\n"
+			     "mr 11,%18\n"
+			     "mr 12,%19\n"
+			     ".long 0x44000022\n"
+			     "mr %0,3\n"
+			     "mr %1,4\n"
+			     "mr %2,5\n"
+			     "mr %3,6\n"
+			     "mr %4,7\n"
+			     "mr %5,8\n"
+			     "mr %6,9\n"
+			     "mr %7,10\n"
+			     "mr %8,11\n"
+			     "mr %9,12\n":"=r"(regs[0]),
+			     "=r"(regs[1]), "=r"(regs[2]),
+			     "=r"(regs[3]), "=r"(regs[4]),
+			     "=r"(regs[5]), "=r"(regs[6]),
+			     "=r"(regs[7]), "=r"(regs[8]),
+			     "=r"(regs[9])
+			     :"r"(regs[0]), "r"(regs[1]),
+			     "r"(regs[2]), "r"(regs[3]),
+			     "r"(regs[4]), "r"(regs[5]),
+			     "r"(regs[6]), "r"(regs[7]),
+			     "r"(regs[8]), "r"(regs[9])
+			     :"r0", "r2", "r3", "r4", "r5", "r6", "r7",
+			     "r8", "r9", "r10", "r11", "r12", "cc",
+			     "xer", "ctr", "lr", "cr0", "cr1", "cr5",
+			     "cr6", "cr7");
+	*out1 = regs[1];
+	*out2 = regs[2];
+	*out3 = regs[3];
+	*out4 = regs[4];
+	*out5 = regs[5];
+	*out6 = regs[6];
+	*out7 = regs[7];
+	*out8 = regs[8];
+	*out9 = regs[9];
+
+	if (!H_isLongBusy(regs[0]) && regs[0] < 0) {
+		printk(KERN_ERR "HCALL99_IN r3=%lx r4=%lx r5=%lx r6=%lx "
+		       "r7=%lx r8=%lx r9=%lx r10=%lx "
+		       "r11=%lx r12=%lx",
+		       opcode, arg1, arg2, arg3,
+		       arg4, arg5, arg6, arg7,
+		       arg8, arg9);
+		printk(KERN_ERR "HCALL99_OUT r3=%lx r4=%lx r5=%lx "
+		       "r6=%lx r7=%lx r8=%lx r9=%lx r10=%lx "
+		       "r11=%lx r12=lx",
+		       regs[0], regs[1],
+		       regs[2], regs[3],
+		       regs[4], regs[5],
+		       regs[6], regs[7],
+		       regs[8]);
+	}
+	return regs[0];
+}
+
+#endif /* __ASSEMBLY__ */
+#endif /* __KERNEL__ */
+#endif
diff -Nurp ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h
--- ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h	1970-01-01 01:00:00.000000000 +0100
+++ ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h	2007-05-10 18:14:12.000000000 +0200
@@ -0,0 +1 @@
+#include <asm-ppc64/system.h>


From yosefe at voltaire.com  Thu May 10 07:34:50 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 10 May 2007 17:34:50 +0300
Subject: [ofa-general] [PATCHv4 for 2.6.22 0/2] fix bug #420: ippib handling
 of pkey change events
Message-ID: <46432D8A.8030007@voltaire.com>

These two patches fix bug #420: PKey table reordering caused by SM failover stops ipoib traffic
patch 1: add uncached device queries to core
patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init

--


From yosefe at voltaire.com  Thu May 10 07:36:43 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 10 May 2007 17:36:43 +0300
Subject: [ofa-general] [PATCHv4 for 2.6.22 1/2] core: uncached "find gid" and
 "find pkey" queries
In-Reply-To: <46432D8A.8030007@voltaire.com>
References: <46432D8A.8030007@voltaire.com>
Message-ID: <46432DFB.9070007@voltaire.com>


* Add ib_find_gid and ib_find_pkey over uncached device queries.
  The calls might block but the returns are always up-to-date. 
* Cache pky,gid table lengths in core to avoid port info queries.


Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/core/device.c |  138 +++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h          |   25 +++++++
 2 files changed, 163 insertions(+)

Index: b/drivers/infiniband/core/device.c
===================================================================
--- a/drivers/infiniband/core/device.c	2007-05-08 15:46:36.000000000 +0300
+++ b/drivers/infiniband/core/device.c	2007-05-09 11:47:22.096064221 +0300
@@ -149,6 +149,18 @@ static int alloc_name(char *name)
 	return 0;
 }
 
+static inline int start_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
+}
+
+
+static inline int end_port(struct ib_device *device)
+{
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
+}
+
 /**
  * ib_alloc_device - allocate an IB device struct
  * @size:size of structure to allocate
@@ -208,6 +220,55 @@ static int add_client_context(struct ib_
 	return 0;
 }
 
+/* read the lengths of pkey,gid tables on each port */
+static int read_port_table_lengths(struct ib_device *device)
+{
+	struct ib_port_attr *tprops = NULL;
+	int num_ports, ret = -ENOMEM;
+	u8 port_index;
+
+	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
+	if (!tprops)
+		goto out;
+
+	num_ports = end_port(device) - start_port(device) + 1;
+
+	device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len *
+						num_ports, GFP_KERNEL);
+	if (!device->pkey_tbl_len)
+		goto out;
+
+	device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len *
+						num_ports, GFP_KERNEL);
+	if (!device->gid_tbl_len)
+		goto err1;
+
+	for (port_index = 0; port_index < num_ports; ++port_index) {
+		ret = ib_query_port(device, port_index + start_port(device),
+					tprops);
+		if (ret)
+			goto err2;
+		device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len;
+		device->gid_tbl_len[port_index] = tprops->gid_tbl_len;
+	}
+
+	ret = 0;
+	goto out;
+err2:
+	kfree(device->gid_tbl_len);
+err1:
+	kfree(device->pkey_tbl_len);
+out:
+	kfree(tprops);
+	return ret;
+}
+
+static inline void free_port_table_lengths(struct ib_device *device)
+{
+	kfree(device->gid_tbl_len);
+	kfree(device->pkey_tbl_len);
+}
+
 /**
  * ib_register_device - Register an IB device with IB core
  * @device:Device to register
@@ -239,6 +300,13 @@ int ib_register_device(struct ib_device 
 	spin_lock_init(&device->event_handler_lock);
 	spin_lock_init(&device->client_data_lock);
 
+	ret = read_port_table_lengths(device);
+	if (ret) {
+		printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n",
+		       device->name);
+		goto out;
+	}
+
 	ret = ib_device_register_sysfs(device);
 	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
@@ -284,6 +352,8 @@ void ib_unregister_device(struct ib_devi
 
 	list_del(&device->core_list);
 
+	free_port_table_lengths(device);
+
 	mutex_unlock(&device_mutex);
 
 	spin_lock_irqsave(&device->client_data_lock, flags);
@@ -592,6 +662,74 @@ int ib_modify_port(struct ib_device *dev
 }
 EXPORT_SYMBOL(ib_modify_port);
 
+/**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+			u8 *port_num, u16 *index)
+{
+	union ib_gid tmp_gid;
+	int ret, port, i, tbl_len;
+
+	for (port = start_port(device); port <= end_port(device); ++port) {
+		tbl_len = device->gid_tbl_len[port - start_port(device)];
+		for (i = 0; i < tbl_len; ++i) {
+			ret = ib_query_gid(device, port, i, &tmp_gid);
+			if (ret)
+				goto out;
+			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
+				*port_num = port;
+				*index = i;
+				ret = 0;
+				goto out;
+			}
+		}
+	}
+	ret = -ENOENT;
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_gid);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index)
+{
+	int ret, i, tbl_len;
+	u16 tmp_pkey;
+
+	tbl_len = device->pkey_tbl_len[port_num - start_port(device)];
+	for (i = 0; i < tbl_len; ++i) {
+		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
+		if (ret)
+			goto out;
+
+		if (pkey == tmp_pkey) {
+			*index = i;
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = -ENOENT;
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL(ib_find_pkey);
+
 static int __init ib_core_init(void)
 {
 	int ret;
Index: b/include/rdma/ib_verbs.h
===================================================================
--- a/include/rdma/ib_verbs.h	2007-05-08 15:45:45.000000000 +0300
+++ b/include/rdma/ib_verbs.h	2007-05-09 11:47:55.006221894 +0300
@@ -1058,6 +1058,8 @@ struct ib_device {
 	__be64			     node_guid;
 	u8                           node_type;
 	u8                           phys_port_cnt;
+	int                          *pkey_tbl_len;
+	int                          *gid_tbl_len;
 };
 
 struct ib_client {
@@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev
 		   struct ib_port_modify *port_modify);
 
 /**
+ * ib_find_gid - Returns the port number and GID table index where
+ *   a specified GID value occurs.
+ * @device: The device to query.
+ * @gid: The GID value to search for.
+ * @port_num: The port number of the device where the GID value was found.
+ * @index: The index into the GID table where the GID was found.  This
+ *   parameter may be NULL.
+ */
+int ib_find_gid(struct ib_device *device, union ib_gid *gid,
+			u8 *port_num, u16 *index);
+
+/**
+ * ib_find_pkey - Returns the PKey table index where a specified
+ *   PKey value occurs.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the PKey table where the PKey was found.
+ */
+int ib_find_pkey(struct ib_device *device,
+			u8 port_num, u16 pkey, u16 *index);
+
+/**
  * ib_alloc_pd - Allocates an unused protection domain.
  * @device: The device on which to allocate the protection domain.
  *


From yosefe at voltaire.com  Thu May 10 07:38:02 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 10 May 2007 17:38:02 +0300
Subject: [ofa-general] [PATCHv4 for 2.6.22 2/2] ipoib: handle pkey change
	events
In-Reply-To: <46432D8A.8030007@voltaire.com>
References: <46432D8A.8030007@voltaire.com>
Message-ID: <46432E4A.9080301@voltaire.com>


This issue was found during partitioning & SM fail over testing.

 * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
 * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity
 * Obtain pkey index prior to entering init_qp, and save in in dev_priv
 * Upon PKEY_CHANGE event, schedule a work that restarts the qp.
 * Precondition the restart on whether the pkey index is really changed.
   Use the cached pkey_index to test this.  
 * Restart child interfaces before parent. They might be up even if the
   parent is down.
 * When interface is restarted, queue delayed initiallization, to handle
   the case that a pkey is deleted and later restored. 
 * Use uncached pkey query upon qp initiallization

SM reconfiguration or failover possibly causes a shuffling of the values
in the port pkey table. The current implementation only queries for the
index of the pkey once, when it creates the device QP and after that moves
it into working state, and hence does not address this scenario. Fix this
by using the PKEY_CHANGE event as a trigger to reconfigure the device QP.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   88 +++++++++++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   41 +++++--------
 4 files changed, 97 insertions(+), 46 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-10 08:34:58.335171047 +0300
@@ -202,15 +202,17 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
-	struct delayed_work pkey_task;
+	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
 	struct work_struct flush_task;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
+	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8            	  port;
 	u16           	  pkey;
+	u16               pkey_index;
 	struct ib_pd  	 *pd;
 	struct ib_mr  	 *mr;
 	struct ib_cq  	 *cq;
@@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc(
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
 int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
 int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-10 16:09:38.296297842 +0300
@@ -413,6 +413,18 @@ int ipoib_ib_dev_open(struct net_device 
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
 
+	/*
+	 * Search through the port P_Key table for the requested pkey value.
+	 * The port has to be assigned to the respective IB partition in
+	 * advance.
+	 */
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) {
+		ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey);
+		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+		return -1;
+	}
+	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+
 	ret = ipoib_init_qp(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret);
@@ -422,14 +434,14 @@ int ipoib_ib_dev_open(struct net_device 
 	ret = ipoib_ib_post_receives(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
 		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
 
@@ -481,7 +493,7 @@ int ipoib_ib_dev_down(struct net_device 
 	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
 		mutex_lock(&pkey_mutex);
 		set_bit(IPOIB_PKEY_STOP, &priv->flags);
-		cancel_delayed_work(&priv->pkey_task);
+		cancel_delayed_work(&priv->pkey_poll_task);
 		mutex_unlock(&pkey_mutex);
 		if (flush)
 			flush_workqueue(ipoib_workqueue);
@@ -508,7 +520,7 @@ static int recvs_pending(struct net_devi
 	return pending;
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev)
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -581,7 +593,8 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
 
 	begin = jiffies;
 
@@ -622,13 +635,22 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 {
-	struct ipoib_dev_priv *cpriv, *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
+	u16 new_index;
 
-	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) {
+	mutex_lock(&priv->vlan_mutex);
+
+	/* Flush any child interfaces too -
+ 	 * they might be up even if the parent is down */
+ 	list_for_each_entry(cpriv, &priv->child_intfs, list)
+		__ipoib_ib_dev_flush(cpriv, pkey_event);
+
+	mutex_unlock(&priv->vlan_mutex);
+
+	if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) {
 		ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n");
 		return;
 	}
@@ -638,10 +660,32 @@ void ipoib_ib_dev_flush(struct work_stru
 		return;
 	}
 
+	if (pkey_event) {
+		if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
+			clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+			ipoib_ib_dev_down(dev, 0);
+			ipoib_pkey_dev_delay_open(dev);
+			return;
+		}
+		set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+
+		/* restart qp only of pkey index is cahnged */
+		if (new_index == priv->pkey_index) {
+			ipoib_dbg(priv, "Not flushing - pkey index not changed.\n");
+			return;
+		}
+		priv->pkey_index = new_index;
+	}
+
 	ipoib_dbg(priv, "flushing\n");
 
 	ipoib_ib_dev_down(dev, 0);
 
+	if (pkey_event) {
+		ipoib_ib_dev_stop(dev, 0);
+		ipoib_ib_dev_open(dev);
+	}
+
 	/*
 	 * The device could have been brought down between the start and when
 	 * we get here, don't bring it back up if it's not configured up
@@ -650,14 +694,24 @@ void ipoib_ib_dev_flush(struct work_stru
 		ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
+}
 
-	mutex_lock(&priv->vlan_mutex);
+void ipoib_ib_dev_flush(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task);
+
+	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 0);
+}
 
-	/* Flush any child interfaces too */
-	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		ipoib_ib_dev_flush(&cpriv->flush_task);
+void ipoib_pkey_event(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, pkey_event_task);
 
-	mutex_unlock(&priv->vlan_mutex);
+	ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name);
+	__ipoib_ib_dev_flush(priv, 1);
 }
 
 void ipoib_ib_dev_cleanup(struct net_device *dev)
@@ -685,7 +739,7 @@ void ipoib_ib_dev_cleanup(struct net_dev
 void ipoib_pkey_poll(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_task.work);
+		container_of(work, struct ipoib_dev_priv, pkey_poll_task.work);
 	struct net_device *dev = priv->dev;
 
 	ipoib_pkey_dev_check_presence(dev);
@@ -696,7 +750,7 @@ void ipoib_pkey_poll(struct work_struct 
 		mutex_lock(&pkey_mutex);
 		if (!test_bit(IPOIB_PKEY_STOP, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
-					   &priv->pkey_task,
+					   &priv->pkey_poll_task,
 					   HZ);
 		mutex_unlock(&pkey_mutex);
 	}
@@ -715,7 +769,7 @@ int ipoib_pkey_dev_delay_open(struct net
 		mutex_lock(&pkey_mutex);
 		clear_bit(IPOIB_PKEY_STOP, &priv->flags);
 		queue_delayed_work(ipoib_workqueue,
-				   &priv->pkey_task,
+				   &priv->pkey_poll_task,
 				   HZ);
 		mutex_unlock(&pkey_mutex);
 		return 1;
Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-09 17:21:03.000000000 +0300
@@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev)
 		return -EINVAL;
 
 	if (ipoib_ib_dev_up(dev)) {
-		ipoib_ib_dev_stop(dev);
+		ipoib_ib_dev_stop(dev, 1);
 		return -EINVAL;
 	}
 
@@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device 
 	flush_workqueue(ipoib_workqueue);
 
 	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev);
+	ipoib_ib_dev_stop(dev, 1);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
@@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->dead_ahs);
 	INIT_LIST_HEAD(&priv->multicast_list);
 
-	INIT_DELAYED_WORK(&priv->pkey_task,    ipoib_pkey_poll);
+	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
+	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
 	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-08 15:46:53.000000000 +0300
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-10 16:08:15.551192622 +0300
@@ -33,8 +33,6 @@
  * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
-#include <rdma/ib_cache.h>
-
 #include "ipoib.h"
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid)
@@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device
 	if (!qp_attr)
 		goto out;
 
-	if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
+	if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) {
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 		ret = -ENXIO;
 		goto out;
@@ -94,26 +92,17 @@ int ipoib_init_qp(struct net_device *dev
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
-	u16 pkey_index;
 	struct ib_qp_attr qp_attr;
 	int attr_mask;
 
-	/*
-	 * Search through the port P_Key table for the requested pkey value.
-	 * The port has to be assigned to the respective IB partition in
-	 * advance.
-	 */
-	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index);
-	if (ret) {
-		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
-		return ret;
-	}
-	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+	/* Make sure we have a valid pkey_index in priv->pkey_index */
+	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
+		return -1;
 
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
 	qp_attr.port_num = priv->port;
-	qp_attr.pkey_index = pkey_index;
+	qp_attr.pkey_index = priv->pkey_index;
 	attr_mask =
 	    IB_QP_QKEY |
 	    IB_QP_PORT |
@@ -258,15 +247,19 @@ void ipoib_event(struct ib_event_handler
 {
 	struct ipoib_dev_priv *priv =
 		container_of(handler, struct ipoib_dev_priv, event_handler);
-
-	if ((record->event == IB_EVENT_PORT_ERR    ||
-	     record->event == IB_EVENT_PKEY_CHANGE ||
-	     record->event == IB_EVENT_PORT_ACTIVE ||
-	     record->event == IB_EVENT_LID_CHANGE  ||
-	     record->event == IB_EVENT_SM_CHANGE   ||
-	     record->event == IB_EVENT_CLIENT_REREGISTER) &&
-	    record->element.port_num == priv->port) {
+
+	if (record->element.port_num != priv->port)
+		return;
+
+	if (record->event == IB_EVENT_PORT_ERR    ||
+	    record->event == IB_EVENT_PORT_ACTIVE ||
+	    record->event == IB_EVENT_LID_CHANGE  ||
+	    record->event == IB_EVENT_SM_CHANGE   ||
+	    record->event == IB_EVENT_CLIENT_REREGISTER) {
 		ipoib_dbg(priv, "Port state change event\n");
 		queue_work(ipoib_workqueue, &priv->flush_task);
+	} else if (record->event == IB_EVENT_PKEY_CHANGE) {
+		ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port);
+		queue_work(ipoib_workqueue, &priv->pkey_event_task);
 	}
 }


From mst at dev.mellanox.co.il  Thu May 10 07:42:12 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 17:42:12 +0300
Subject: [ofa-general] Re: [PATCHv4 for 2.6.22 0/2] fix bug #420: ippib
	handling of pkey change events
In-Reply-To: <46432D8A.8030007@voltaire.com>
References: <46432D8A.8030007@voltaire.com>
Message-ID: <20070510144212.GH22029@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: [PATCHv4 for 2.6.22 0/2] fix bug #420: ippib handling of pkey change events
> 
> These two patches fix bug #420: PKey table reordering caused by SM failover stops ipoib traffic
> patch 1: add uncached device queries to core
> patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init

OK, that's pretty clean.
Acked-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

-- 
MST


From jsquyres at cisco.com  Thu May 10 07:42:01 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 10 May 2007 10:42:01 -0400
Subject: [ofa-general] Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <20070510142826.GE22029@mellanox.co.il>
References: <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop>
	<46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop>
	<46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop>
	<464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
	<20070510142826.GE22029@mellanox.co.il>
Message-ID: <D8150EE2-F521-43C2-8B31-04655933E96A@cisco.com>

On May 10, 2007, at 10:28 AM, Michael S. Tsirkin wrote:

>> What is the advantage of this approach?
>
> Current ipoib cm uses this approach to simplify the implementation.
> Overhead seems insignificant.

I think MPI's requirements are a bit different than IPoIB.  See  
Gleb's response. It is not uncommon for MPI apps to have connections  
open to many peers simultaneously.  Registered memory / internal  
buffering usage is a Big Deal in the MPI / HPC community.

-- 
Jeff Squyres
Cisco Systems


From mst at dev.mellanox.co.il  Thu May 10 07:56:13 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 10 May 2007 17:56:13 +0300
Subject: [ofa-general] Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <D8150EE2-F521-43C2-8B31-04655933E96A@cisco.com>
References: <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop>
	<46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop>
	<464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
	<20070510142826.GE22029@mellanox.co.il>
	<D8150EE2-F521-43C2-8B31-04655933E96A@cisco.com>
Message-ID: <20070510145613.GR13655@mellanox.co.il>

> Quoting Jeff Squyres <jsquyres at cisco.com>:
> Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed?udapl -?bugs?opened
> 
> On May 10, 2007, at 10:28 AM, Michael S. Tsirkin wrote:
> 
> >>What is the advantage of this approach?
> >
> >Current ipoib cm uses this approach to simplify the implementation.
> >Overhead seems insignificant.
> 
> I think MPI's requirements are a bit different than IPoIB.  See  
> Gleb's response. It is not uncommon for MPI apps to have connections  
> open to many peers simultaneously.

You mean, hundreds of QPs between the same pair of hosts?
Yes, in this case you might start running out of QPs.

> Registered memory / internal  
> buffering usage is a Big Deal in the MPI / HPC community.

I don't see the connection with the # of QPs.
They are very cheap in memory.

-- 
MST


From rick.jones2 at hp.com  Thu May 10 10:01:37 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Thu, 10 May 2007 10:01:37 -0700
Subject: [ofa-general] is there an OFED way to putt VPD from an HCA?
Message-ID: <46434FF1.9020005@hp.com>

Hi -

I would like to pull vital product data (eg serial number) from an IB HCA which 
is "driven" via OFED bits.  Is there any OFED tool to do that or do I have to go 
hunt-down something HCA-vendor specific (mellanox in this case)?

thanks,

rick jones


From glebn at voltaire.com  Thu May 10 10:02:40 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Thu, 10 May 2007 20:02:40 +0300
Subject: [ofa-general] Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed	udapl
	-	bugs	opened
In-Reply-To: <20070510145613.GR13655@mellanox.co.il>
References: <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM>
	<1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM>
	<3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com>
	<46430EB2.7080703@voltaire.com>
	<C28A2DEB-47C4-438A-9C2A-A2C769DF38B2@cisco.com>
	<20070510142826.GE22029@mellanox.co.il>
	<D8150EE2-F521-43C2-8B31-04655933E96A@cisco.com>
	<20070510145613.GR13655@mellanox.co.il>
Message-ID: <20070510170240.GA32053@minantech.com>

On Thu, May 10, 2007 at 05:56:13PM +0300, Michael S. Tsirkin wrote:
> > Quoting Jeff Squyres <jsquyres at cisco.com>:
> > Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed?udapl -?bugs?opened
> > 
> > On May 10, 2007, at 10:28 AM, Michael S. Tsirkin wrote:
> > 
> > >>What is the advantage of this approach?
> > >
> > >Current ipoib cm uses this approach to simplify the implementation.
> > >Overhead seems insignificant.
> > 
> > I think MPI's requirements are a bit different than IPoIB.  See  
> > Gleb's response. It is not uncommon for MPI apps to have connections  
> > open to many peers simultaneously.
> 
> You mean, hundreds of QPs between the same pair of hosts?
> Yes, in this case you might start running out of QPs.

Why is it matters that QPs between the same pair of hosts or not.
QPs are global resource, aren't they?

> 
> > Registered memory / internal  
> > buffering usage is a Big Deal in the MPI / HPC community.
> 
> I don't see the connection with the # of QPs.
> They are very cheap in memory.
> 
4K is cheap?

--
			Gleb.


From sweitzen at cisco.com  Thu May 10 10:59:27 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Thu, 10 May 2007 10:59:27 -0700
Subject: [ofa-general] is there an OFED way to putt VPD from an HCA?
In-Reply-To: <46434FF1.9020005@hp.com>
References: <46434FF1.9020005@hp.com>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303821207@xmb-sjc-216.amer.cisco.com>

For Mellanox HCAs, you can use mstflint (Mellanox util) or tvflash (Cisco util).  For example, from a Cisco-produced Mellanox HCA (YMMV):

[root at svbu-qa1850-1 ~]# mstflint -d mthca0 q
Image type:      Failsafe
FW Version:      1.2.0
I.S. Version:    1
Device ID:       25204
Chip Revision:   A0
GUID Des:        Node             Port1            Sys image
GUIDs:           0002c90200218140 0002c90200218141 0005ad000100d050
Board ID:        É,­
VSD:             É,­
PSID:
[root at svbu-qa1850-1 ~]# tvflash -i
HCA #0: MT25204, Cheetah DDR, revision 20
  Primary image is v1.2.000 build 3.2.0.140, with label 'HCA.Cheetah-DDR.20'
  Secondary image is v1.2.000 build 3.2.0.139, with label 'HCA.Cheetah-DDR.20'

  Vital Product Data
    Product Name: Cheetah DDR
    P/N: MHGS18-XTC
    E/C: A1
    S/N: MT0612X01178
    Freq/Power: PCIe x8
    Checksum: Ok
    Date Code: N/A

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones
> Sent: Thursday, May 10, 2007 10:02 AM
> To: general at lists.openfabrics.org
> Subject: [ofa-general] is there an OFED way to putt VPD from an HCA?
> 
> Hi -
> 
> I would like to pull vital product data (eg serial number) 
> from an IB HCA which 
> is "driven" via OFED bits.  Is there any OFED tool to do that 
> or do I have to go 
> hunt-down something HCA-vendor specific (mellanox in this case)?
> 
> thanks,
> 
> rick jones
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From rdreier at cisco.com  Thu May 10 11:14:07 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 10 May 2007 11:14:07 -0700
Subject: [ofa-general] Re: [PATCHv4 for 2.6.22 0/2] fix bug #420: ippib
	handling of pkey change events
In-Reply-To: <20070510144212.GH22029@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 10 May 2007 17:42:12 +0300")
References: <46432D8A.8030007@voltaire.com>
	<20070510144212.GH22029@mellanox.co.il>
Message-ID: <adalkfw1r80.fsf@cisco.com>

thanks ... I haven't had a chance to follow all the discussion while
I'm traveling this week, but I'll deal with this next week.


From pradeeps at linux.vnet.ibm.com  Thu May 10 11:18:44 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Thu, 10 May 2007 11:18:44 -0700
Subject: [ofa-general] is there an OFED way to putt VPD from an HCA?
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303821207@xmb-sjc-216.amer.cisco.com>
References: <46434FF1.9020005@hp.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303821207@xmb-sjc-216.amer.cisco.com>
Message-ID: <46436204.40600@linux.vnet.ibm.com>

Scott Weitzenkamp (sweitzen) wrote:
> For Mellanox HCAs, you can use mstflint (Mellanox util) or tvflash (Cisco util).  For example, from a Cisco-produced Mellanox HCA (YMMV):
> 
> [root at svbu-qa1850-1 ~]# mstflint -d mthca0 q
> Image type:      Failsafe
> FW Version:      1.2.0
> I.S. Version:    1
> Device ID:       25204
> Chip Revision:   A0
> GUID Des:        Node             Port1            Sys image
> GUIDs:           0002c90200218140 0002c90200218141 0005ad000100d050
> Board ID:        É,­
> VSD:             É,­
> PSID:
> [root at svbu-qa1850-1 ~]# tvflash -i
> HCA #0: MT25204, Cheetah DDR, revision 20
>   Primary image is v1.2.000 build 3.2.0.140, with label 'HCA.Cheetah-DDR.20'
>   Secondary image is v1.2.000 build 3.2.0.139, with label 'HCA.Cheetah-DDR.20'
> 
>   Vital Product Data
>     Product Name: Cheetah DDR
>     P/N: MHGS18-XTC
>     E/C: A1
>     S/N: MT0612X01178
>     Freq/Power: PCIe x8
>     Checksum: Ok
>     Date Code: N/A
> 
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
> 
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org 
>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones
>> Sent: Thursday, May 10, 2007 10:02 AM
>> To: general at lists.openfabrics.org
>> Subject: [ofa-general] is there an OFED way to putt VPD from an HCA?
>>
>> Hi -
>>
>> I would like to pull vital product data (eg serial number) 
>> from an IB HCA which 
>> is "driven" via OFED bits.  Is there any OFED tool to do that 
>> or do I have to go 
>> hunt-down something HCA-vendor specific (mellanox in this case)?
>>
>> thanks,
>>
>> rick jones


There is also some similar information available in the 
/sys/class/infiniband directory (without using Mellanox or Cisco tools). 
For example on p5 (ppc64) system I get the following:

[root at elm3b37 mthca0]# pwd
/sys/class/infiniband/mthca0
[root at elm3b37 mthca0]# ls -l
total 0
-r--r--r-- 1 root root 4096 May 10 14:10 board_id
lrwxrwxrwx 1 root root    0 May  8 20:19 device -> 
../../../devices/pci0002:00/0002:00:02.6/0002:d8:01.0/0002:d9:00.0
-r--r--r-- 1 root root 4096 May 10 14:10 fw_ver
-r--r--r-- 1 root root 4096 May 10 14:10 hca_type
-r--r--r-- 1 root root 4096 May 10 14:10 hw_rev
-rw-r--r-- 1 root root 4096 May 10 14:10 node_desc
-r--r--r-- 1 root root 4096 May 10 14:10 node_guid
-r--r--r-- 1 root root 4096 May 10 14:10 node_type
drwxr-xr-x 4 root root    0 May  8 20:19 ports
lrwxrwxrwx 1 root root    0 May 10 14:10 subsystem -> 
../../../class/infiniband
-r--r--r-- 1 root root 4096 May 10 14:10 sys_image_guid
--w------- 1 root root 4096 May 10 14:10 uevent
[root at elm3b37 mthca0]# cat board_id
MT_0030000001
[root at elm3b37 mthca0]# cat fw_ver
3.5.0
[root at elm3b37 mthca0]# cat hca_type
MT23108
[root at elm3b37 mthca0]# cat hw_rev
a1
[root at elm3b37 mthca0]# cat node_desc
MT23108 InfiniHost Mellanox Technologies
[root at elm3b37 mthca0]# cat node_guid
0005:ad00:0003:0564
[root at elm3b37 mthca0]# cat node_type
1: CA
[root at elm3b37 mthca0]# cat sys_image_guid
0005:ad00:0003:0567
[root at elm3b37 mthca0]# cat uevent


Pradeep


From raleigh at systemfabricworks.com  Thu May 10 11:20:37 2007
From: raleigh at systemfabricworks.com (Raleigh F Rinehart)
Date: Thu, 10 May 2007 13:20:37 -0500
Subject: [ofa-general] is there an OFED way to putt VPD from an HCA?
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303821207@xmb-sjc-216.amer.cisco.com>
References: <46434FF1.9020005@hp.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303821207@xmb-sjc-216.amer.cisco.com>
Message-ID: <46436275.80406@systemfabricworks.com>

Just FYI for older versions of mstflint, such the one I have (ofed_1_1) 
specifying the device name 'mthcaX' doesn't work, one have to specify 
the PCI id.
 i.e.
[root at tiger ~]> mstflint -d 07:00.0 q
Image type:      Failsafe
I.S. Version:    1
Chip Revision:   A1
GUID Des:        Node             Port1            Port2            Sys 
image
GUIDs:           0002c901078ce000 0002c901078ce001 0002c901078ce002 
0002c901078ce000
Board ID:         (MT_0000000001)
VSD:
PSID:            MT_0000000001

or alternatively
[root at tiger ~]> mstflint -d /proc/bus/pci/07/00.0 q
...

-raleigh


Scott Weitzenkamp (sweitzen) wrote:
> For Mellanox HCAs, you can use mstflint (Mellanox util) or tvflash (Cisco util).  For example, from a Cisco-produced Mellanox HCA (YMMV):
>
> [root at svbu-qa1850-1 ~]# mstflint -d mthca0 q
> Image type:      Failsafe
> FW Version:      1.2.0
> I.S. Version:    1
> Device ID:       25204
> Chip Revision:   A0
> GUID Des:        Node             Port1            Sys image
> GUIDs:           0002c90200218140 0002c90200218141 0005ad000100d050
> Board ID:        É,­
> VSD:             É,­
> PSID:
> [root at svbu-qa1850-1 ~]# tvflash -i
> HCA #0: MT25204, Cheetah DDR, revision 20
>   Primary image is v1.2.000 build 3.2.0.140, with label 'HCA.Cheetah-DDR.20'
>   Secondary image is v1.2.000 build 3.2.0.139, with label 'HCA.Cheetah-DDR.20'
>
>   Vital Product Data
>     Product Name: Cheetah DDR
>     P/N: MHGS18-XTC
>     E/C: A1
>     S/N: MT0612X01178
>     Freq/Power: PCIe x8
>     Checksum: Ok
>     Date Code: N/A
>
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
>
>   
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org 
>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones
>> Sent: Thursday, May 10, 2007 10:02 AM
>> To: general at lists.openfabrics.org
>> Subject: [ofa-general] is there an OFED way to putt VPD from an HCA?
>>
>> Hi -
>>
>> I would like to pull vital product data (eg serial number) 
>> from an IB HCA which 
>> is "driven" via OFED bits.  Is there any OFED tool to do that 
>> or do I have to go 
>> hunt-down something HCA-vendor specific (mellanox in this case)?
>>
>> thanks,
>>
>> rick jones
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
>>     
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>
>   


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3285 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070510/b2a522f6/attachment.bin>

From rdreier at cisco.com  Thu May 10 11:27:56 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 10 May 2007 11:27:56 -0700
Subject: [ofa-general] Re: [HOWTO] accessing the DMA mapped data
In-Reply-To: <464304DD.9070600@cdac.in> (Mahesh's message of "Thu,
	10 May 2007 17:11:17 +0530")
References: <4642D095.6070305@cdac.in> <20070510.014751.74749340.davem@daveml
	oft.net> <4642FFA7.4040303@cdac.in>
	<20070510.043233.08323827.davem@davemloft.net>
	<464304DD.9070600@cdac.in>
Message-ID: <ada4pmk1qkz.fsf@cisco.com>

 > Here I am dealing with a infiniband (see www.openfabrics.org)
 > network device driver. The layer above the driver is the standard
 > infiniband core interface. Now I have a situation where I need to
 > peek into the packets and do some modifications(some hacking). So I
 > just want know whether I can access the original data region using
 > the bus address generated by the dma_map_single.

If you give more details about what you're trying to do, maybe I can
suggest a good way to accomplish it.

Also for IB-related stuff you may want to at least CC
<general at lists.openfabrics.org> to get the best info.


From arthur.jones at qlogic.com  Thu May 10 12:10:49 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Thu, 10 May 2007 12:10:49 -0700
Subject: [ofa-general] [PATCH take2] IB/ipath -- shadow the gpio_mask
	register
Message-ID: <20070510191047.6876.80760.stgit@bauxite.internal.keyresearch.com>

Once upon a time, GPIO interrupts were rare.  But
then a chip bug in the waldo series forced the use of
a GPIO interrupt to signal packet reception.  This
greatly increased the frequency of GPIO interrupts
which have the gpio_mask bits set on the waldo chips.
Other bits in the gpio_status register are used for
I2C clock and data lines, these bits are usually on.
An "unlikely" annotation leftover from the old days
was improperly applied to these bits, and an unnecessary
chip mmio read was being accessed in the interrupt fast
path on waldo.

Remove the stagnant unlikely annotation in the interrupt
handler and keep a shadow copy of the gpio_mask register
to avoid the slow mmio read when testing for interruptable
GPIO bits.

Signed-off-by: Arthur Jones <arthur.jones at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_iba6120.c |    7 +++----
 drivers/infiniband/hw/ipath/ipath_intr.c    |    7 +++----
 drivers/infiniband/hw/ipath/ipath_kernel.h  |    2 ++
 drivers/infiniband/hw/ipath/ipath_verbs.c   |   12 ++++++------
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c
index fb58154..c21d99b 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c
@@ -747,7 +747,6 @@ static void ipath_pe_quiet_serdes(struct ipath_devdata *dd)
 
 static int ipath_pe_intconfig(struct ipath_devdata *dd)
 {
-	u64 val;
 	u32 chiprev;
 
 	/*
@@ -760,9 +759,9 @@ static int ipath_pe_intconfig(struct ipath_devdata *dd)
 	if ((chiprev & INFINIPATH_R_CHIPREVMINOR_MASK) > 1) {
 		/* Rev2+ reports extra errors via internal GPIO pins */
 		dd->ipath_flags |= IPATH_GPIO_ERRINTRS;
-		val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask);
-		val |= IPATH_GPIO_ERRINTR_MASK;
-		ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val);
+		dd->ipath_gpio_mask |= IPATH_GPIO_ERRINTR_MASK;
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask,
+				 dd->ipath_gpio_mask);
 	}
 	return 0;
 }
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index 45d0331..a90d3b5 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -1056,7 +1056,7 @@ irqreturn_t ipath_intr(int irq, void *data)
 			gpiostatus &= ~(1 << IPATH_GPIO_PORT0_BIT);
 			chk0rcv = 1;
 		}
-		if (unlikely(gpiostatus)) {
+		if (gpiostatus) {
 			/*
 			 * Some unexpected bits remain. If they could have
 			 * caused the interrupt, complain and clear.
@@ -1065,9 +1065,8 @@ irqreturn_t ipath_intr(int irq, void *data)
 			 * GPIO interrupts, possibly on a "three strikes"
 			 * basis.
 			 */
-			u32 mask;
-			mask = ipath_read_kreg32(
-				dd, dd->ipath_kregs->kr_gpio_mask);
+			const u32 mask = (u32) dd->ipath_gpio_mask;
+
 			if (mask & gpiostatus) {
 				ipath_dbg("Unexpected GPIO IRQ bits %x\n",
 				  gpiostatus & mask);
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index e900c25..12194f3 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -397,6 +397,8 @@ struct ipath_devdata {
 	unsigned long ipath_pioavailshadow[8];
 	/* shadow of kr_gpio_out, for rmw ops */
 	u64 ipath_gpio_out;
+	/* shadow the gpio mask register */
+	u64 ipath_gpio_mask;
 	/* kr_revision shadow */
 	u64 ipath_revision;
 	/*
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 12933e7..bb70845 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -1387,13 +1387,12 @@ static int enable_timer(struct ipath_devdata *dd)
 	 * processing.
 	 */
 	if (dd->ipath_flags & IPATH_GPIO_INTR) {
-		u64 val;
 		ipath_write_kreg(dd, dd->ipath_kregs->kr_debugportselect,
 				 0x2074076542310ULL);
 		/* Enable GPIO bit 2 interrupt */
-		val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask);
-		val |= (u64) (1 << IPATH_GPIO_PORT0_BIT);
-		ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val);
+		dd->ipath_gpio_mask |= (u64) (1 << IPATH_GPIO_PORT0_BIT);
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask,
+				 dd->ipath_gpio_mask);
 	}
 
 	init_timer(&dd->verbs_timer);
@@ -1412,8 +1411,9 @@ static int disable_timer(struct ipath_devdata *dd)
                 u64 val;
                 /* Disable GPIO bit 2 interrupt */
                 val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask);
-                val &= ~((u64) (1 << IPATH_GPIO_PORT0_BIT));
-                ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val);
+		dd->ipath_gpio_mask &= ~((u64) (1 << IPATH_GPIO_PORT0_BIT));
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask,
+				 dd->ipath_gpio_mask);
 		/*
 		 * We might want to undo changes to debugportselect,
 		 * but how?


From sean.hefty at intel.com  Thu May 10 12:39:28 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 10 May 2007 12:39:28 -0700
Subject: [ofa-general] RFC: location for IB CM statistics
Message-ID: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com>

I'd like to start adding some statistical information to the IB CM to help
identify scalability or connectivity issues.  Some example statistics that I
would like to expose now are number of retried MADs, unmatched requests, total
number of connections, etc.  Longer term, additional statistics and information
on each connection could be added.

I'm looking for ideas on the best way to expose this sort of data.  Any
thoughts?

- Sean


From pradeeps at linux.vnet.ibm.com  Thu May 10 13:54:05 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Thu, 10 May 2007 13:54:05 -0700
Subject: [ofa-general] RFC: location for IB CM statistics
In-Reply-To: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com>
References: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com>
Message-ID: <4643866D.9010203@linux.vnet.ibm.com>

Sean Hefty wrote:
> I'd like to start adding some statistical information to the IB CM to help
> identify scalability or connectivity issues.  Some example statistics that I
> would like to expose now are number of retried MADs, unmatched requests, total
> number of connections, etc.  Longer term, additional statistics and information
> on each connection could be added.
> 
> I'm looking for ideas on the best way to expose this sort of data.  Any
> thoughts?
> 
> - Sean

This is a great idea and would be a big help in debugging problems and 
identify performance issues.

An approach similar to /sys/class/net/ wherein stats for various devices 
are given and then a utility akin to netstat that may consolidate these- 
would that be appealing?

Pradeep


From sashak at voltaire.com  Thu May 10 14:14:36 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 11 May 2007 00:14:36 +0300
Subject: [ofa-general] [PATCH 0/3 v2] opensm: osm_port_t structure
	simplification.
Message-ID: <11788316803259-git-send-email-sashak@voltaire.com>

Hi Hal,

This simplifies osm_port_t structure and related API functions -
the main idea is to not use duplicated (from osm_node_t) physical port
pointers table, but only one direct pointer to appropriated physical
port (osm_physp_t).

The changes in the patch series are slightly reordered against original
version, so each patch "works" (in original version patch 2/3 brokes
things if applied separately without 3/3).

Sasha


From sashak at voltaire.com  Thu May 10 14:14:37 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 11 May 2007 00:14:37 +0300
Subject: [ofa-general] [PATCH 1/4 v2] opensm: remove osm_port_get_num_physp()
	function
In-Reply-To: <11788316803259-git-send-email-sashak@voltaire.com>
References: <11788316803259-git-send-email-sashak@voltaire.com>
Message-ID: <11788316803779-git-send-email-sashak@voltaire.com>

This removes osm_port_get_num_physp() function and instead uses native
node oriented osm_node_get_num_physp().

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/include/opensm/osm_port.h       |   29 -----------------------------
 osm/opensm/osm_drop_mgr.c           |    2 +-
 osm/opensm/osm_link_mgr.c           |    2 +-
 osm/opensm/osm_qos.c                |    2 +-
 osm/opensm/osm_sa_link_record.c     |    8 ++++----
 osm/opensm/osm_sa_pkey_record.c     |    6 +++---
 osm/opensm/osm_sa_portinfo_record.c |    2 +-
 osm/opensm/osm_sa_slvl_record.c     |    2 +-
 osm/opensm/osm_sa_vlarb_record.c    |    6 +++---
 osm/opensm/osm_state_mgr.c          |    2 +-
 osm/opensm/osm_trap_rcv.c           |    2 +-
 11 files changed, 17 insertions(+), 46 deletions(-)

diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h
index 6d51d2b..134012c 100644
--- a/osm/include/opensm/osm_port.h
+++ b/osm/include/opensm/osm_port.h
@@ -1467,35 +1467,6 @@ osm_port_get_guid(
 *	Port
 *********/
 
-/****f* OpenSM: Port/osm_port_get_num_physp
-* NAME
-*	osm_port_get_num_physp
-*
-* DESCRIPTION
-*	Returns the number of Physical Port objects associated with this port.
-*
-* SYNOPSIS
-*/
-static inline uint8_t
-osm_port_get_num_physp(
-	IN const osm_port_t* const p_port )
-{
-	return( p_port->physp_tbl_size );
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object.
-*
-* RETURN VALUE
-*	Returns the number of Physical Port objects associated with this port.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port
-*********/
-
 /****f* OpenSM: Port/osm_port_get_phys_ptr
 * NAME
 *	osm_port_get_phys_ptr
diff --git a/osm/opensm/osm_drop_mgr.c b/osm/opensm/osm_drop_mgr.c
index 0d08ff6..d091347 100644
--- a/osm/opensm/osm_drop_mgr.c
+++ b/osm/opensm/osm_drop_mgr.c
@@ -237,7 +237,7 @@ __osm_drop_mgr_remove_port(
     Re-initialize each Physical Port.
   */
 
-  num_physp = osm_port_get_num_physp( p_port );
+  num_physp = osm_node_get_num_physp( p_port->p_node );
   for( port_num = 0; port_num < num_physp; port_num++ )
   {
     p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)port_num );
diff --git a/osm/opensm/osm_link_mgr.c b/osm/opensm/osm_link_mgr.c
index a1081bd..71c0495 100644
--- a/osm/opensm/osm_link_mgr.c
+++ b/osm/opensm/osm_link_mgr.c
@@ -426,7 +426,7 @@ __osm_link_mgr_process_port(
     with this Port.  Start iterating with port 1, since the linkstate
     is not applicable to the management port on switches.
   */
-  num_physp = osm_port_get_num_physp( p_port );
+  num_physp = osm_node_get_num_physp( p_port->p_node );
   for( i = 0; i < num_physp; i ++ )
   {
     /*
diff --git a/osm/opensm/osm_qos.c b/osm/opensm/osm_qos.c
index e71c053..11beaae 100644
--- a/osm/opensm/osm_qos.c
+++ b/osm/opensm/osm_qos.c
@@ -334,7 +334,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm)
 
 		p_node = p_port->p_node;
 		if (p_node->sw) {
-			num_physp = osm_port_get_num_physp(p_port);
+			num_physp = osm_node_get_num_physp(p_node);
 			for (i = 1; i < num_physp; i++) {
 				p_physp = osm_port_get_phys_ptr(p_port, i);
 				if (!p_physp || !osm_physp_is_valid(p_physp))
diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c
index 169e75e..18f655c 100644
--- a/osm/opensm/osm_sa_link_record.c
+++ b/osm/opensm/osm_sa_link_record.c
@@ -346,8 +346,8 @@ __osm_lr_rcv_get_port_links(
         that do not actually connect.  Don't bother screening
         for that here.
       */
-      num_ports = osm_port_get_num_physp( p_src_port );
-      dest_num_ports = osm_port_get_num_physp( p_dest_port );
+      num_ports = osm_node_get_num_physp( p_src_port->p_node );
+      dest_num_ports = osm_node_get_num_physp( p_dest_port->p_node );
       for( port_num = 1; port_num < num_ports; port_num++ )
       {
         p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
@@ -385,7 +385,7 @@ __osm_lr_rcv_get_port_links(
       }
       else
       {
-        num_ports = osm_port_get_num_physp( p_src_port );
+        num_ports = osm_node_get_num_physp( p_src_port->p_node );
         for( port_num = 1; port_num < num_ports; port_num++ )
         {
           p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
@@ -421,7 +421,7 @@ __osm_lr_rcv_get_port_links(
       }
       else
       {
-        num_ports = osm_port_get_num_physp( p_dest_port );
+        num_ports = osm_node_get_num_physp( p_dest_port->p_node );
         for( port_num = 1; port_num < num_ports; port_num++ )
         {
           p_dest_physp = osm_port_get_phys_ptr(
diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c
index 5eb15df..0a199f1 100644
--- a/osm/opensm/osm_sa_pkey_record.c
+++ b/osm/opensm/osm_sa_pkey_record.c
@@ -249,7 +249,7 @@ __osm_sa_pkey_by_comp_mask(
 
   if( comp_mask & IB_PKEY_COMPMASK_PORT )
   {
-    if (port_num < osm_port_get_num_physp( p_port ))
+    if (port_num < osm_node_get_num_physp( p_port->p_node ))
     {
       p_physp = osm_port_get_phys_ptr( p_port, port_num );
       /* Check that the p_physp is valid, and that is shares a pkey
@@ -263,13 +263,13 @@ __osm_sa_pkey_by_comp_mask(
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "__osm_sa_pkey_by_comp_mask: ERR 4603: "
                "Given Physical Port Number: 0x%X is out of range should be < 0x%X\n",
-               port_num, osm_port_get_num_physp( p_port ));
+               port_num, osm_node_get_num_physp( p_port->p_node ));
       goto Exit;
     }
   }
   else
   {
-    num_ports = osm_port_get_num_physp( p_port );
+    num_ports = osm_node_get_num_physp( p_port->p_node );
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
       p_physp = osm_port_get_phys_ptr( p_port, port_num );
diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c
index 5d9b1b2..9d4f18e 100644
--- a/osm/opensm/osm_sa_portinfo_record.c
+++ b/osm/opensm/osm_sa_portinfo_record.c
@@ -538,7 +538,7 @@ __osm_sa_pir_by_comp_mask(
   comp_mask = p_ctxt->comp_mask;
   p_req_physp = p_ctxt->p_req_physp;
 
-  num_ports = osm_port_get_num_physp( p_port );
+  num_ports = osm_node_get_num_physp( p_port->p_node );
 
   if( comp_mask & IB_PIR_COMPMASK_PORTNUM )
   {
diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c
index d831ffd..3c4ff02 100644
--- a/osm/opensm/osm_sa_slvl_record.c
+++ b/osm/opensm/osm_sa_slvl_record.c
@@ -213,7 +213,7 @@ __osm_sa_slvl_by_comp_mask(
 
   p_rcvd_rec = p_ctxt->p_rcvd_rec;
   comp_mask = p_ctxt->comp_mask;
-  num_ports = osm_port_get_num_physp( p_port );
+  num_ports = osm_node_get_num_physp( p_port->p_node );
   in_port_start = 0;
   in_port_end = num_ports;
   out_port_start = 0;
diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c
index f0ff957..6df5ed9 100644
--- a/osm/opensm/osm_sa_vlarb_record.c
+++ b/osm/opensm/osm_sa_vlarb_record.c
@@ -253,7 +253,7 @@ __osm_sa_vl_arb_by_comp_mask(
 
   if( comp_mask & IB_VLA_COMPMASK_OUT_PORT )
   {
-    if (port_num < osm_port_get_num_physp( p_port ))
+    if (port_num < osm_node_get_num_physp( p_port->p_node ))
     {
       p_physp = osm_port_get_phys_ptr( p_port, port_num );
       /* check that the p_physp is valid, and that the requester
@@ -267,13 +267,13 @@ __osm_sa_vl_arb_by_comp_mask(
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "__osm_sa_vl_arb_by_comp_mask: ERR 2A03: "
                "Given Physical Port Number: 0x%X is out of range should be < 0x%X\n",
-               port_num, osm_port_get_num_physp( p_port ) );
+               port_num, osm_node_get_num_physp( p_port->p_node ) );
       goto Exit;
     }
   }
   else
   {
-    num_ports = osm_port_get_num_physp( p_port );
+    num_ports = osm_node_get_num_physp( p_port->p_node );
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
       p_physp = osm_port_get_phys_ptr( p_port, port_num );
diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
index ddec10c..6f53e60 100644
--- a/osm/opensm/osm_state_mgr.c
+++ b/osm/opensm/osm_state_mgr.c
@@ -1284,7 +1284,7 @@ __osm_state_mgr_report(
       else
          start_port = 1;
 
-      num_ports = osm_port_get_num_physp( p_port );
+      num_ports = osm_node_get_num_physp( p_node );
       for( port_num = start_port; port_num < num_ports; port_num++ )
       {
          p_physp = osm_port_get_phys_ptr( p_port, port_num );
diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c
index 0858968..ed507b6 100644
--- a/osm/opensm/osm_trap_rcv.c
+++ b/osm/opensm/osm_trap_rcv.c
@@ -108,7 +108,7 @@ __get_physp_by_lid_and_num(
   if (! p_port)
     return NULL;
 
-  if (osm_port_get_num_physp(p_port) < num)
+  if (osm_node_get_num_physp(p_port->p_node) < num)
     return NULL;
 
   return( osm_port_get_phys_ptr(p_port, num) );
-- 
1.5.2.rc2.20.gac2a


From sashak at voltaire.com  Thu May 10 14:14:38 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 11 May 2007 00:14:38 +0300
Subject: [ofa-general] [PATCH 2/4 v2] opensm: remove osm_port_get_phys_ptr()
In-Reply-To: <11788316803259-git-send-email-sashak@voltaire.com>
References: <11788316803259-git-send-email-sashak@voltaire.com>
Message-ID: <11788316803195-git-send-email-sashak@voltaire.com>

Function osm_port_get_phys_ptr() returns pointer to physical port by its
number - and this should be node and not port related routine. So this
patch replaces osm_port_get_phys_ptr() by osm_node_get_physp_ptr().

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/include/opensm/osm_port.h       |   36 -----------------------------------
 osm/opensm/osm_drop_mgr.c           |    2 +-
 osm/opensm/osm_link_mgr.c           |    2 +-
 osm/opensm/osm_port_info_rcv.c      |   10 ++++----
 osm/opensm/osm_qos.c                |    2 +-
 osm/opensm/osm_sa_link_record.c     |   18 ++++++++--------
 osm/opensm/osm_sa_pkey_record.c     |    4 +-
 osm/opensm/osm_sa_portinfo_record.c |    4 +-
 osm/opensm/osm_sa_slvl_record.c     |    6 ++--
 osm/opensm/osm_sa_vlarb_record.c    |    4 +-
 osm/opensm/osm_state_mgr.c          |    2 +-
 osm/opensm/osm_subnet.c             |    4 +-
 osm/opensm/osm_trap_rcv.c           |    2 +-
 13 files changed, 30 insertions(+), 66 deletions(-)

diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h
index 134012c..feebf63 100644
--- a/osm/include/opensm/osm_port.h
+++ b/osm/include/opensm/osm_port.h
@@ -1467,42 +1467,6 @@ osm_port_get_guid(
 *	Port
 *********/
 
-/****f* OpenSM: Port/osm_port_get_phys_ptr
-* NAME
-*	osm_port_get_phys_ptr
-*
-* DESCRIPTION
-*	Gets the pointer to the specified Physical Port object.
-*
-* SYNOPSIS
-*/
-static inline osm_physp_t*
-osm_port_get_phys_ptr(
-	IN const osm_port_t* const p_port,
-	IN const uint8_t port_num )
-{
-	CL_ASSERT( port_num < p_port->physp_tbl_size );
-	return( p_port->tbl[port_num] );
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object.
-*
-*	port_num
-*		[in] Number of physical port for which to return the
-*		osm_physp_t object.  If this port is on an HCA, then
-*		this value is ignored.
-*
-* RETURN VALUE
-*	Pointer to the Physical Port object.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port
-*********/
-
 /****f* OpenSM: Port/osm_port_get_default_phys_ptr
 * NAME
 *	osm_port_get_default_phys_ptr
diff --git a/osm/opensm/osm_drop_mgr.c b/osm/opensm/osm_drop_mgr.c
index d091347..97a95c2 100644
--- a/osm/opensm/osm_drop_mgr.c
+++ b/osm/opensm/osm_drop_mgr.c
@@ -240,7 +240,7 @@ __osm_drop_mgr_remove_port(
   num_physp = osm_node_get_num_physp( p_port->p_node );
   for( port_num = 0; port_num < num_physp; port_num++ )
   {
-    p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)port_num );
+    p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)port_num );
 
     if( p_physp )
     {
diff --git a/osm/opensm/osm_link_mgr.c b/osm/opensm/osm_link_mgr.c
index 71c0495..a38d179 100644
--- a/osm/opensm/osm_link_mgr.c
+++ b/osm/opensm/osm_link_mgr.c
@@ -434,7 +434,7 @@ __osm_link_mgr_process_port(
       or if the state of the port is already better then the
       specified state.
     */
-    p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)i );
+    p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)i );
     if( p_physp && osm_physp_is_valid( p_physp ) )
     {
       current_state = osm_physp_get_port_state( p_physp );
diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c
index 9bd75b5..5d9c5c7 100644
--- a/osm/opensm/osm_port_info_rcv.c
+++ b/osm/opensm/osm_port_info_rcv.c
@@ -555,13 +555,13 @@ osm_pi_rcv_process_set(
 
   p_context = osm_madw_get_pi_context_ptr( p_madw );
 
-  p_physp = osm_port_get_phys_ptr( p_port, port_num );
-  CL_ASSERT( p_physp );
-  CL_ASSERT( osm_physp_is_valid( p_physp ) );
+  p_node = p_port->p_node;
+  CL_ASSERT( p_node );
+
+  p_physp = osm_node_get_physp_ptr( p_node, port_num );
+  CL_ASSERT( p_physp && osm_physp_is_valid( p_physp ) );
 
   port_guid = osm_physp_get_port_guid( p_physp );
-  p_node = osm_port_get_parent_node( p_port );
-  CL_ASSERT( p_node );
 
   p_smp = osm_madw_get_smp_ptr( p_madw );
   p_pi = (ib_port_info_t*)ib_smp_get_payload_ptr( p_smp );
diff --git a/osm/opensm/osm_qos.c b/osm/opensm/osm_qos.c
index 11beaae..1255169 100644
--- a/osm/opensm/osm_qos.c
+++ b/osm/opensm/osm_qos.c
@@ -336,7 +336,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm)
 		if (p_node->sw) {
 			num_physp = osm_node_get_num_physp(p_node);
 			for (i = 1; i < num_physp; i++) {
-				p_physp = osm_port_get_phys_ptr(p_port, i);
+				p_physp = osm_node_get_physp_ptr(p_node, i);
 				if (!p_physp || !osm_physp_is_valid(p_physp))
 					continue;
 				status =
diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c
index 18f655c..c6b7a7c 100644
--- a/osm/opensm/osm_sa_link_record.c
+++ b/osm/opensm/osm_sa_link_record.c
@@ -350,12 +350,12 @@ __osm_lr_rcv_get_port_links(
       dest_num_ports = osm_node_get_num_physp( p_dest_port->p_node );
       for( port_num = 1; port_num < num_ports; port_num++ )
       {
-        p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
+        p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num );
         for( dest_port_num = 1; dest_port_num < dest_num_ports;
              dest_port_num++ )
         {
-          p_dest_physp = osm_port_get_phys_ptr( p_dest_port,
-                                                dest_port_num );
+          p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node,
+                                                 dest_port_num );
           /* both physical ports should be with data */
           if (p_src_physp && p_dest_physp)
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp,
@@ -376,7 +376,7 @@ __osm_lr_rcv_get_port_links(
            this couldn't be a relevant record. */
         if (port_num < p_src_port->physp_tbl_size) 
         {          
-          p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
+          p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num );
           if (p_src_physp)
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp,
                                          NULL, comp_mask, p_list,
@@ -388,7 +388,7 @@ __osm_lr_rcv_get_port_links(
         num_ports = osm_node_get_num_physp( p_src_port->p_node );
         for( port_num = 1; port_num < num_ports; port_num++ )
         {
-          p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num );
+          p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num );
           if (p_src_physp)
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp,
                                          NULL, comp_mask, p_list,
@@ -411,8 +411,8 @@ __osm_lr_rcv_get_port_links(
            this couldn't be a relevant record. */
         if (port_num < p_dest_port->physp_tbl_size ) 
         {
-          p_dest_physp = osm_port_get_phys_ptr(
-            p_dest_port, port_num );
+          p_dest_physp = osm_node_get_physp_ptr(
+            p_dest_port->p_node, port_num );
           if (p_dest_physp)
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL,
                                          p_dest_physp, comp_mask,
@@ -424,8 +424,8 @@ __osm_lr_rcv_get_port_links(
         num_ports = osm_node_get_num_physp( p_dest_port->p_node );
         for( port_num = 1; port_num < num_ports; port_num++ )
         {
-          p_dest_physp = osm_port_get_phys_ptr(
-            p_dest_port, port_num );
+          p_dest_physp = osm_node_get_physp_ptr(
+            p_dest_port->p_node, port_num );
           if (p_dest_physp)
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL,
                                          p_dest_physp, comp_mask,
diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c
index 0a199f1..a943fe0 100644
--- a/osm/opensm/osm_sa_pkey_record.c
+++ b/osm/opensm/osm_sa_pkey_record.c
@@ -251,7 +251,7 @@ __osm_sa_pkey_by_comp_mask(
   {
     if (port_num < osm_node_get_num_physp( p_port->p_node ))
     {
-      p_physp = osm_port_get_phys_ptr( p_port, port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       /* Check that the p_physp is valid, and that is shares a pkey
          with the p_req_physp. */
       if( p_physp && osm_physp_is_valid( p_physp ) &&
@@ -272,7 +272,7 @@ __osm_sa_pkey_by_comp_mask(
     num_ports = osm_node_get_num_physp( p_port->p_node );
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
-      p_physp = osm_port_get_phys_ptr( p_port, port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       if( p_physp == NULL )
         continue;
 
diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c
index 9d4f18e..74f53d6 100644
--- a/osm/opensm/osm_sa_portinfo_record.c
+++ b/osm/opensm/osm_sa_portinfo_record.c
@@ -544,7 +544,7 @@ __osm_sa_pir_by_comp_mask(
   {
     if (p_rcvd_rec->port_num < num_ports)
     {
-      p_physp = osm_port_get_phys_ptr( p_port, p_rcvd_rec->port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, p_rcvd_rec->port_num );
       /* Check that the p_physp is valid, and that the p_physp and the
          p_req_physp share a pkey. */
       if( p_physp && osm_physp_is_valid( p_physp ) &&
@@ -556,7 +556,7 @@ __osm_sa_pir_by_comp_mask(
   {
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
-      p_physp = osm_port_get_phys_ptr( p_port, port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       if( p_physp == NULL )
         continue;
 
diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c
index 3c4ff02..2f250d9 100644
--- a/osm/opensm/osm_sa_slvl_record.c
+++ b/osm/opensm/osm_sa_slvl_record.c
@@ -226,7 +226,7 @@ __osm_sa_slvl_by_comp_mask(
              "__osm_sa_slvl_by_comp_mask:  "
              "Using Physical Default Port Number: 0x%X (for End Node)\n",
              p_port->default_port_num );
-    p_out_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num );
+    p_out_physp = osm_port_get_default_phys_ptr( p_port );
     /* check that the p_out_physp and the p_req_physp share a pkey */
     if (osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_out_physp ))
     __osm_sa_slvl_create( p_rcv, p_out_physp, p_ctxt, 0 );
@@ -243,7 +243,7 @@ __osm_sa_slvl_by_comp_mask(
     }
 
     for( out_port_num = out_port_start; out_port_num <= out_port_end; out_port_num++ ) {
-      p_out_physp = osm_port_get_phys_ptr( p_port, out_port_num );
+      p_out_physp = osm_node_get_physp_ptr( p_port->p_node, out_port_num );
       if( p_out_physp == NULL )
         continue;
 
@@ -256,7 +256,7 @@ __osm_sa_slvl_by_comp_mask(
           continue;
 #endif
 
-        p_in_physp = osm_port_get_phys_ptr( p_port, in_port_num );
+        p_in_physp = osm_node_get_physp_ptr( p_port->p_node, in_port_num );
         if( p_in_physp == NULL )
           continue;
 
diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c
index 6df5ed9..9cd346c 100644
--- a/osm/opensm/osm_sa_vlarb_record.c
+++ b/osm/opensm/osm_sa_vlarb_record.c
@@ -255,7 +255,7 @@ __osm_sa_vl_arb_by_comp_mask(
   {
     if (port_num < osm_node_get_num_physp( p_port->p_node ))
     {
-      p_physp = osm_port_get_phys_ptr( p_port, port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       /* check that the p_physp is valid, and that the requester
          and the p_physp share a pkey. */
       if( p_physp && osm_physp_is_valid( p_physp ) &&
@@ -276,7 +276,7 @@ __osm_sa_vl_arb_by_comp_mask(
     num_ports = osm_node_get_num_physp( p_port->p_node );
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
-      p_physp = osm_port_get_phys_ptr( p_port, port_num );
+      p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       if( p_physp == NULL )
         continue;
 
diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
index 6f53e60..9aeec74 100644
--- a/osm/opensm/osm_state_mgr.c
+++ b/osm/opensm/osm_state_mgr.c
@@ -1287,7 +1287,7 @@ __osm_state_mgr_report(
       num_ports = osm_node_get_num_physp( p_node );
       for( port_num = start_port; port_num < num_ports; port_num++ )
       {
-         p_physp = osm_port_get_phys_ptr( p_port, port_num );
+         p_physp = osm_node_get_physp_ptr( p_node, port_num );
          if( ( p_physp == NULL ) || ( !osm_physp_is_valid( p_physp ) ) )
             continue;
 
diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index c8c3ddc..8e0c53b 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -266,7 +266,7 @@ osm_get_gid_by_mad_addr(
                );
       return(IB_INVALID_PARAMETER);
     }
-    p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num);
+    p_physp = osm_port_get_default_phys_ptr( p_port );
     p_gid->unicast.interface_id = p_physp->port_guid;
     p_gid->unicast.prefix = p_subn->opt.subnet_prefix;
   }
@@ -316,7 +316,7 @@ osm_get_physp_by_mad_addr(
     
       goto Exit;
     }
-    p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num);
+    p_physp = osm_port_get_default_phys_ptr( p_port );
   }
   else
   {
diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c
index ed507b6..0ec9a1f 100644
--- a/osm/opensm/osm_trap_rcv.c
+++ b/osm/opensm/osm_trap_rcv.c
@@ -111,7 +111,7 @@ __get_physp_by_lid_and_num(
   if (osm_node_get_num_physp(p_port->p_node) < num)
     return NULL;
 
-  return( osm_port_get_phys_ptr(p_port, num) );
+  return( osm_node_get_physp_ptr(p_port->p_node, num) );
 }
 
 /**********************************************************************
-- 
1.5.2.rc2.20.gac2a


From sashak at voltaire.com  Thu May 10 14:14:39 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 11 May 2007 00:14:39 +0300
Subject: [ofa-general] [PATCH 3/4 v2] opensm: eliminate node's physical ports
	table duplication in osm_port_t
In-Reply-To: <11788316803259-git-send-email-sashak@voltaire.com>
References: <11788316803259-git-send-email-sashak@voltaire.com>
Message-ID: <11788316801832-git-send-email-sashak@voltaire.com>

Eliminate duplication of osm_node's physical ports table in osm_port_t
object.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/include/opensm/osm_port.h    |   34 +++++-------------
 osm/opensm/osm_pkey_rcv.c        |    2 +-
 osm/opensm/osm_port.c            |   70 ++++++++-----------------------------
 osm/opensm/osm_sa_link_record.c  |    4 +-
 osm/opensm/osm_sa_pkey_record.c  |    2 +-
 osm/opensm/osm_sa_slvl_record.c  |    2 +-
 osm/opensm/osm_sa_vlarb_record.c |    2 +-
 osm/opensm/osm_slvl_map_rcv.c    |    2 +-
 osm/opensm/osm_sm_state_mgr.c    |    2 +-
 osm/opensm/osm_vl_arb_rcv.c      |    2 +-
 10 files changed, 34 insertions(+), 88 deletions(-)

diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h
index feebf63..0873a7e 100644
--- a/osm/include/opensm/osm_port.h
+++ b/osm/include/opensm/osm_port.h
@@ -1274,10 +1274,8 @@ typedef struct _osm_port
 	struct _osm_node		*p_node;
 	ib_net64_t			guid;
 	uint32_t			discovery_count;
-	uint8_t				default_port_num;
-	uint8_t				physp_tbl_size;
+	osm_physp_t			*p_physp;
 	cl_qlist_t			mcm_list;
-	osm_physp_t			*tbl[1];
 } osm_port_t;
 /*
 * FIELDS
@@ -1295,20 +1293,13 @@ typedef struct _osm_port
 *		during the current fabric sweep.  This number is reset
 *		to zero at the start of a sweep.
 *
-*	default_port_num
-*		Index of the physical port used when physical characteristics
-*		contained in the Physical Port are needed.
-*
-*	physp_tbl_size
-*		Number of physical ports associated with this logical port.
+*	p_physp
+*		The pointer to physical port used when physical
+*		characteristics contained in the Physical Port are needed.
 *
 *	mcm_list
 *		Multicast member list
 *
-*	tbl
-*		Array of pointers to Physical Port objects contained by this node.
-*     MUST BE LAST ELEMENT SINCE IT CAN GROW !!!
-*
 * SEE ALSO
 *	Port, Physical Port, Physical Port Table
 *********/
@@ -1386,10 +1377,8 @@ static inline ib_net16_t
 osm_port_get_base_lid(
 	IN const osm_port_t* const p_port )
 {
-	const osm_physp_t* const p_physp = p_port->tbl[p_port->default_port_num];
-	CL_ASSERT( p_physp );
-	CL_ASSERT( osm_physp_is_valid( p_physp ) );
-	return( osm_physp_get_base_lid( p_physp ));
+	CL_ASSERT( p_port->p_physp && osm_physp_is_valid( p_port->p_physp ) );
+	return( osm_physp_get_base_lid( p_port->p_physp ));
 }
 /*
 * PARAMETERS
@@ -1419,10 +1408,8 @@ static inline uint8_t
 osm_port_get_lmc(
 	IN const osm_port_t* const p_port )
 {
-	const osm_physp_t* const p_physp = p_port->tbl[p_port->default_port_num];
-	CL_ASSERT( p_physp );
-	CL_ASSERT( osm_physp_is_valid( p_physp ) );
-	return( osm_physp_get_lmc( p_physp ));
+	CL_ASSERT( p_port->p_physp && osm_physp_is_valid( p_port->p_physp ) );
+	return( osm_physp_get_lmc( p_port->p_physp ));
 }
 /*
 * PARAMETERS
@@ -1483,9 +1470,8 @@ osm_physp_t*
 osm_port_get_default_phys_ptr(
 	IN const osm_port_t* const p_port )
 {
-	CL_ASSERT( p_port->tbl[p_port->default_port_num] );
-	CL_ASSERT( osm_physp_is_valid( p_port->tbl[p_port->default_port_num] ) );
-	return( p_port->tbl[p_port->default_port_num] );
+	CL_ASSERT( osm_physp_is_valid( p_port->p_physp ) );
+	return p_port->p_physp;
 }
 /*
 * PARAMETERS
diff --git a/osm/opensm/osm_pkey_rcv.c b/osm/opensm/osm_pkey_rcv.c
index 76af9fc..0e0ec46 100644
--- a/osm/opensm/osm_pkey_rcv.c
+++ b/osm/opensm/osm_pkey_rcv.c
@@ -172,7 +172,7 @@ osm_pkey_rcv_process(
   else
   {
     p_physp = osm_port_get_default_phys_ptr(p_port);
-    port_num = p_port->default_port_num;
+    port_num = p_physp->port_num;
   }
 
   CL_ASSERT( p_physp );
diff --git a/osm/opensm/osm_port.c b/osm/opensm/osm_port.c
index 053fc22..30e2ab2 100644
--- a/osm/opensm/osm_port.c
+++ b/osm/opensm/osm_port.c
@@ -174,7 +174,6 @@ osm_port_init(
   uint32_t port_index;
   ib_net64_t port_guid;
   osm_physp_t *p_physp;
-  uint32_t size;
 
   CL_ASSERT( p_port );
   CL_ASSERT( p_ni );
@@ -187,39 +186,25 @@ osm_port_init(
   p_port->guid = port_guid;
 
   /*
-    See comment in port_new for info about this...
-  */
-  size = p_ni->num_ports;
-
-  p_port->physp_tbl_size = (uint8_t)(size + 1);
-
-  /*
     Get the pointers to the physical node objects "owned" by this
     logical port GUID.
     For switches, all the ports are owned; for HCA's and routers,
     only the singular part that has this GUID is owned.
   */
-  p_port->default_port_num = 0xFF;
-  for( port_index = 0; port_index < p_port->physp_tbl_size; port_index++ )
+  for( port_index = 0; port_index < p_parent_node->physp_tbl_size; port_index++ )
   {
     p_physp = osm_node_get_physp_ptr( p_parent_node, port_index );
+    /*
+      Because much of the PortInfo data is only valid
+      for port 0 on switches, try to keep the lowest
+      possible value of default_port_num.
+    */
     if( osm_physp_is_valid( p_physp ) &&
-        port_guid == osm_physp_get_port_guid( p_physp ) )
-    {
-      p_port->tbl[port_index] = p_physp;
-      /*
-        Because much of the PortInfo data is only valid
-        for port 0 on switches, try to keep the lowest
-        possible value of default_port_num.
-      */
-      if( port_index < p_port->default_port_num )
-        p_port->default_port_num = (uint8_t)port_index;
+        port_guid == osm_physp_get_port_guid( p_physp ) ) {
+      p_port->p_physp = p_physp;
+      break;
     }
-    else
-      p_port->tbl[port_index] = NULL;
   }
-
-  CL_ASSERT( p_port->default_port_num < 0xFF );
 }
 
 /**********************************************************************
@@ -230,21 +215,11 @@ osm_port_new(
   IN const osm_node_t* const p_parent_node )
 {
   osm_port_t*  p_port;
-  uint32_t size;
-
-  /*
-    The port object already contains one physical port object pointer.
-    Therefore, subtract 1 from the number of physical ports
-    used by the switch.  This is not done for CA's since they
-    need to occupy 1 more physp pointer than they physically have since
-    we still reserve room for a "port 0".
-  */
-  size = p_ni->num_ports;
 
-  p_port = malloc( sizeof(*p_port) + sizeof(void *) * size );
+  p_port = malloc( sizeof(*p_port) );
   if( p_port != NULL )
   {
-    memset( p_port, 0, sizeof(*p_port) + sizeof(void *) * size );
+    memset( p_port, 0, sizeof(*p_port) );
     osm_port_init( p_port, p_ni, p_parent_node );
   }
 
@@ -315,18 +290,11 @@ osm_port_add_new_physp(
   IN osm_port_t* const p_port,
   IN const uint8_t port_num )
 {
-  osm_node_t *p_node;
   osm_physp_t *p_physp;
 
-  CL_ASSERT( port_num < p_port->physp_tbl_size );
-
-  p_node = p_port->p_node;
-  CL_ASSERT( p_node );
-
-  p_physp = osm_node_get_physp_ptr( p_node, port_num );
+  p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
   CL_ASSERT( osm_physp_is_valid( p_physp ) );
   CL_ASSERT( osm_physp_get_port_guid( p_physp ) == p_port->guid );
-  p_port->tbl[port_num] = p_physp;
 
   /*
     For switches, we generally want to use Port 0, which is
@@ -334,17 +302,9 @@ osm_port_add_new_physp(
     The LID value in the PortInfo for example, is only valid
     for port 0 on switches.
   */
-  if( !osm_physp_is_valid( p_port->tbl[p_port->default_port_num] ) )
-  {
-    p_port->default_port_num = port_num;
-  }
-  else
-  {
-    if(  port_num < p_port->default_port_num )
-    {
-      p_port->default_port_num = port_num;
-    }
-  }
+  if( !osm_physp_is_valid( osm_port_get_default_phys_ptr( p_port ) ) ||
+      port_num < p_port->p_physp->port_num )
+    p_port->p_physp = p_physp;
 }
 
 /**********************************************************************
diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c
index c6b7a7c..5e4e35e 100644
--- a/osm/opensm/osm_sa_link_record.c
+++ b/osm/opensm/osm_sa_link_record.c
@@ -374,7 +374,7 @@ __osm_lr_rcv_get_port_links(
         port_num = p_lr->from_port_num;
         /* If the port number is out of the range of the p_src_port, then
            this couldn't be a relevant record. */
-        if (port_num < p_src_port->physp_tbl_size) 
+        if (port_num < p_src_port->p_node->physp_tbl_size)
         {          
           p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num );
           if (p_src_physp)
@@ -409,7 +409,7 @@ __osm_lr_rcv_get_port_links(
         port_num = p_lr->to_port_num;
         /* If the port number is out of the range of the p_dest_port, then
            this couldn't be a relevant record. */
-        if (port_num < p_dest_port->physp_tbl_size ) 
+        if (port_num < p_dest_port->p_node->physp_tbl_size )
         {
           p_dest_physp = osm_node_get_physp_ptr(
             p_dest_port->p_node, port_num );
diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c
index a943fe0..8a71314 100644
--- a/osm/opensm/osm_sa_pkey_record.c
+++ b/osm/opensm/osm_sa_pkey_record.c
@@ -239,7 +239,7 @@ __osm_sa_pkey_by_comp_mask(
   if ( p_port->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH )
   {
     /* we put it in the comp mask and port num */
-    port_num = p_port->default_port_num;
+    port_num = p_port->p_physp->port_num;
     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_sa_pkey_by_comp_mask:  "
              "Using Physical Default Port Number: 0x%X (for End Node)\n",
diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c
index 2f250d9..168901e 100644
--- a/osm/opensm/osm_sa_slvl_record.c
+++ b/osm/opensm/osm_sa_slvl_record.c
@@ -225,7 +225,7 @@ __osm_sa_slvl_by_comp_mask(
     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_sa_slvl_by_comp_mask:  "
              "Using Physical Default Port Number: 0x%X (for End Node)\n",
-             p_port->default_port_num );
+             p_port->p_physp->port_num );
     p_out_physp = osm_port_get_default_phys_ptr( p_port );
     /* check that the p_out_physp and the p_req_physp share a pkey */
     if (osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_out_physp ))
diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c
index 9cd346c..a462ee9 100644
--- a/osm/opensm/osm_sa_vlarb_record.c
+++ b/osm/opensm/osm_sa_vlarb_record.c
@@ -243,7 +243,7 @@ __osm_sa_vl_arb_by_comp_mask(
   if ( p_port->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH)
   {
     /* we put it in the comp mask and port num */
-    port_num = p_port->default_port_num;
+    port_num = p_port->p_physp->port_num;
     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_sa_vl_arb_by_comp_mask:  "
              "Using Physical Default Port Number: 0x%X (for End Node)\n",
diff --git a/osm/opensm/osm_slvl_map_rcv.c b/osm/opensm/osm_slvl_map_rcv.c
index 3fa3a7e..b109f75 100644
--- a/osm/opensm/osm_slvl_map_rcv.c
+++ b/osm/opensm/osm_slvl_map_rcv.c
@@ -183,7 +183,7 @@ osm_slvl_rcv_process(
   else
   {
     p_physp = osm_port_get_default_phys_ptr(p_port);
-    out_port_num = p_port->default_port_num;
+    out_port_num = p_physp->port_num;
     in_port_num  = 0;
   }
 
diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c
index 3aa92c8..0034320 100644
--- a/osm/opensm/osm_sm_state_mgr.c
+++ b/osm/opensm/osm_sm_state_mgr.c
@@ -194,7 +194,7 @@ __osm_sm_state_mgr_send_local_port_info_req(
                          osm_physp_get_dr_path_ptr
                          ( osm_port_get_default_phys_ptr( p_port ) ),
                          IB_MAD_ATTR_PORT_INFO,
-                         cl_hton32( p_port->default_port_num ),
+                         cl_hton32( p_port->p_physp->port_num ),
                          CL_DISP_MSGID_NONE, &context );
 
    if( status != IB_SUCCESS )
diff --git a/osm/opensm/osm_vl_arb_rcv.c b/osm/opensm/osm_vl_arb_rcv.c
index 930360a..ed8dfc5 100644
--- a/osm/opensm/osm_vl_arb_rcv.c
+++ b/osm/opensm/osm_vl_arb_rcv.c
@@ -184,7 +184,7 @@ osm_vla_rcv_process(
   else
   {
     p_physp = osm_port_get_default_phys_ptr(p_port);
-    port_num = p_port->default_port_num;
+    port_num = p_physp->port_num;
   }
 
   CL_ASSERT( p_physp );
-- 
1.5.2.rc2.20.gac2a


From sashak at voltaire.com  Thu May 10 14:14:40 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 11 May 2007 00:14:40 +0300
Subject: [ofa-general] [PATCH 4/4 v2] opensm: remove some unneeded funcs
In-Reply-To: <11788316803259-git-send-email-sashak@voltaire.com>
References: <11788316803259-git-send-email-sashak@voltaire.com>
Message-ID: <11788316801642-git-send-email-sashak@voltaire.com>

This removes some not really needed functions
osm_port_get_default_phys_ptr() and osm_port_get_parent_node().

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/include/opensm/osm_port.h        |   66 ----------------------------------
 osm/opensm/osm_lid_mgr.c             |   14 ++------
 osm/opensm/osm_mcast_mgr.c           |    2 +-
 osm/opensm/osm_node_info_rcv.c       |    6 +--
 osm/opensm/osm_pkey.c                |    8 ++--
 osm/opensm/osm_pkey_mgr.c            |    8 ++---
 osm/opensm/osm_pkey_rcv.c            |    4 +-
 osm/opensm/osm_port.c                |    6 ++--
 osm/opensm/osm_port_info_rcv.c       |    2 +-
 osm/opensm/osm_prtn.c                |    2 +-
 osm/opensm/osm_qos.c                 |    4 +-
 osm/opensm/osm_sa_informinfo.c       |    6 ++--
 osm/opensm/osm_sa_lft_record.c       |    2 +-
 osm/opensm/osm_sa_mcmember_record.c  |    2 +-
 osm/opensm/osm_sa_mft_record.c       |    2 +-
 osm/opensm/osm_sa_multipath_record.c |   10 +++---
 osm/opensm/osm_sa_path_record.c      |   12 +++---
 osm/opensm/osm_sa_service_record.c   |    2 +-
 osm/opensm/osm_sa_slvl_record.c      |    2 +-
 osm/opensm/osm_sa_sminfo_record.c    |    2 +-
 osm/opensm/osm_sa_sw_info_record.c   |    2 +-
 osm/opensm/osm_slvl_map_rcv.c        |    4 +-
 osm/opensm/osm_sm_state_mgr.c        |    5 +--
 osm/opensm/osm_state_mgr.c           |   11 +++---
 osm/opensm/osm_subnet.c              |    6 +--
 osm/opensm/osm_switch.c              |    6 ++--
 osm/opensm/osm_ucast_lash.c          |    2 +-
 osm/opensm/osm_ucast_mgr.c           |    7 ++--
 osm/opensm/osm_ucast_updn.c          |    2 +-
 osm/opensm/osm_vl_arb_rcv.c          |    5 +--
 30 files changed, 64 insertions(+), 148 deletions(-)

diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h
index 0873a7e..df9065e 100644
--- a/osm/include/opensm/osm_port.h
+++ b/osm/include/opensm/osm_port.h
@@ -1454,72 +1454,6 @@ osm_port_get_guid(
 *	Port
 *********/
 
-/****f* OpenSM: Port/osm_port_get_default_phys_ptr
-* NAME
-*	osm_port_get_default_phys_ptr
-*
-* DESCRIPTION
-*	Gets the pointer to the default Physical Port object.
-*	This call should only be used for non-switch ports in which there
-*	is a one-for-one mapping of port to physp.
-*
-* SYNOPSIS
-*/
-static inline
-osm_physp_t*
-osm_port_get_default_phys_ptr(
-	IN const osm_port_t* const p_port )
-{
-	CL_ASSERT( osm_physp_is_valid( p_port->p_physp ) );
-	return p_port->p_physp;
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object.
-*
-* RETURN VALUE
-*	Pointer to the Physical Port object.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port
-*********/
-
-/****f* OpenSM: Port/osm_port_get_parent_node
-* NAME
-*	osm_port_get_parent_node
-*
-* DESCRIPTION
-*	Gets the pointer to the this port's Node object.
-*
-* SYNOPSIS
-*/
-static inline struct _osm_node*
-osm_port_get_parent_node(
-	IN const osm_port_t* const p_port )
-{
-	return( p_port->p_node );
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to a Port object.
-*
-*	port_num
-*		[in] Number of physical port for which to return the
-*		osm_physp_t object.
-*
-* RETURN VALUE
-*	Pointer to the Physical Port object.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port
-*********/
-
 /****f* OpenSM: Port/osm_port_get_lid_range_ho
 * NAME
 *	osm_port_get_lid_range_ho
diff --git a/osm/opensm/osm_lid_mgr.c b/osm/opensm/osm_lid_mgr.c
index d856fb0..6712c6c 100644
--- a/osm/opensm/osm_lid_mgr.c
+++ b/osm/opensm/osm_lid_mgr.c
@@ -975,10 +975,7 @@ __osm_lid_mgr_set_physp_pi(
     Don't bother doing anything if this Physical Port is not valid.
     This allows simplified code in the caller.
   */
-  if( p_physp == NULL )
-    goto Exit;
-
-  if( !osm_physp_is_valid( p_physp ) )
+  if( p_physp == NULL || !osm_physp_is_valid( p_physp ) )
     goto Exit;
 
   port_num = osm_physp_get_port_num( p_physp );
@@ -1283,7 +1280,6 @@ __osm_lid_mgr_process_our_sm_node(
   osm_port_t     *p_port;
   uint16_t        min_lid_ho;
   uint16_t        max_lid_ho;
-  osm_physp_t    *p_physp;
   boolean_t       res = TRUE;
 
   OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_process_our_sm_node );
@@ -1336,9 +1332,7 @@ __osm_lid_mgr_process_our_sm_node(
     Set the PortInfo the Physical Port associated
     with this Port.
   */
-  p_physp = osm_port_get_default_phys_ptr( p_port );
-
-  __osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho ) );
+  __osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_port->p_physp, cl_hton16( min_lid_ho ) );
 
  Exit:
   OSM_LOG_EXIT( p_mgr->p_log );
@@ -1404,7 +1398,6 @@ osm_lid_mgr_process_subnet(
   osm_port_t     *p_port;
   ib_net64_t      port_guid;
   uint16_t        min_lid_ho, max_lid_ho;
-  osm_physp_t    *p_physp;
   int             lid_changed;
 
   CL_ASSERT( p_mgr );
@@ -1460,9 +1453,8 @@ osm_lid_mgr_process_subnet(
                ", LID [0x%X,0x%X]\n", cl_ntoh64( port_guid ),
                min_lid_ho, max_lid_ho );
       
-      p_physp = osm_port_get_default_phys_ptr( p_port );
       /* the proc returns the fact it sent a set port info */
-      if (__osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho )))
+      if (__osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_port->p_physp, cl_hton16( min_lid_ho )))
         p_mgr->send_set_reqs = TRUE;
     }
   } /* all ports */
diff --git a/osm/opensm/osm_mcast_mgr.c b/osm/opensm/osm_mcast_mgr.c
index 0cdcc0e..f5059c9 100644
--- a/osm/opensm/osm_mcast_mgr.c
+++ b/osm/opensm/osm_mcast_mgr.c
@@ -1127,7 +1127,7 @@ osm_mcast_mgr_process_single(
     goto Exit;
   }
 
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if( p_physp == NULL )
   {
     osm_log( p_mgr->p_log, OSM_LOG_ERROR,
diff --git a/osm/opensm/osm_node_info_rcv.c b/osm/opensm/osm_node_info_rcv.c
index 364b07c..2c79056 100644
--- a/osm/opensm/osm_node_info_rcv.c
+++ b/osm/opensm/osm_node_info_rcv.c
@@ -791,12 +791,10 @@ __osm_ni_rcv_process_new(
              "Duplicate Port GUID 0x%" PRIx64 "! Found by the two directed routes:\n",
              cl_ntoh64( p_ni->port_guid ) );
     osm_dump_dr_path(p_rcv->p_log,
-                     osm_physp_get_dr_path_ptr(
-                       osm_port_get_default_phys_ptr ( p_port) ),
+                     osm_physp_get_dr_path_ptr(p_port->p_physp),
                      OSM_LOG_ERROR);
     osm_dump_dr_path(p_rcv->p_log,
-                     osm_physp_get_dr_path_ptr(
-                       osm_port_get_default_phys_ptr ( p_port_check) ),
+                     osm_physp_get_dr_path_ptr(p_port_check->p_physp),
                      OSM_LOG_ERROR);
     if ( p_rtr )
       osm_router_delete( &p_rtr );
diff --git a/osm/opensm/osm_pkey.c b/osm/opensm/osm_pkey.c
index be5578a..c0daa38 100644
--- a/osm/opensm/osm_pkey.c
+++ b/osm/opensm/osm_pkey.c
@@ -432,8 +432,8 @@ osm_port_share_pkey(
     goto Exit;
   }
 
-  p_physp1 = osm_port_get_default_phys_ptr(p_port_1);
-  p_physp2 = osm_port_get_default_phys_ptr(p_port_2);
+  p_physp1 = p_port_1->p_physp;
+  p_physp2 = p_port_2->p_physp;
 
   if (!p_physp1 || !p_physp2)
   {
@@ -478,7 +478,7 @@ osm_lid_share_pkey(
   }
   else
   {
-    p_physp1 = osm_port_get_default_phys_ptr(p_port1);
+    p_physp1 = p_port1->p_physp;
   }
 
   if  (osm_node_get_type( p_node2 ) == IB_NODE_TYPE_SWITCH)
@@ -487,7 +487,7 @@ osm_lid_share_pkey(
   }
   else
   {
-    p_physp2 = osm_port_get_default_phys_ptr(p_port2);
+    p_physp2 = p_port2->p_physp;
   }
 
   return(osm_physp_share_pkey(p_log, p_physp1, p_physp2));
diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c
index bbbe192..33ac8b5 100644
--- a/osm/opensm/osm_pkey_mgr.c
+++ b/osm/opensm/osm_pkey_mgr.c
@@ -310,7 +310,7 @@ static boolean_t pkey_mgr_update_port(
 
   memset(&empty_block, 0, sizeof(ib_pkey_table_t));
 
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if ( !osm_physp_is_valid( p_physp ) )
     return FALSE;
 
@@ -449,7 +449,7 @@ pkey_mgr_update_peer_port(
 
   memset(&empty_block, 0, sizeof(ib_pkey_table_t));
 
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if (!osm_physp_is_valid( p_physp ))
     return FALSE;
   peer = osm_physp_get_remote( p_physp );
@@ -532,7 +532,6 @@ osm_pkey_mgr_process(
   osm_prtn_t *p_prtn;
   osm_port_t *p_port;
   osm_signal_t signal = OSM_SIGNAL_DONE;
-  osm_node_t *p_node;
 
   CL_ASSERT( p_osm );
 
@@ -570,8 +569,7 @@ osm_pkey_mgr_process(
     p_next = cl_qmap_next( p_next );
     if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) )
       signal = OSM_SIGNAL_DONE_PENDING;
-    p_node = osm_port_get_parent_node( p_port );
-    if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
+    if ( ( osm_node_get_type( p_port->p_node ) != IB_NODE_TYPE_SWITCH ) &&
 	 pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, 
 				    &p_osm->subn, p_port,
 				    !p_osm->subn.opt.no_partition_enforcement ) )
diff --git a/osm/opensm/osm_pkey_rcv.c b/osm/opensm/osm_pkey_rcv.c
index 0e0ec46..7c58d98 100644
--- a/osm/opensm/osm_pkey_rcv.c
+++ b/osm/opensm/osm_pkey_rcv.c
@@ -159,7 +159,7 @@ osm_pkey_rcv_process(
     goto Exit;
   }
 
-  p_node = osm_port_get_parent_node( p_port );
+  p_node = p_port->p_node;
   CL_ASSERT( p_node );
 
   block_num = (uint16_t)((cl_ntoh32(p_smp->attr_mod)) & 0x0000FFFF);
@@ -171,7 +171,7 @@ osm_pkey_rcv_process(
   }
   else
   {
-    p_physp = osm_port_get_default_phys_ptr(p_port);
+    p_physp = p_port->p_physp;
     port_num = p_physp->port_num;
   }
 
diff --git a/osm/opensm/osm_port.c b/osm/opensm/osm_port.c
index 30e2ab2..eab86e1 100644
--- a/osm/opensm/osm_port.c
+++ b/osm/opensm/osm_port.c
@@ -302,7 +302,7 @@ osm_port_add_new_physp(
     The LID value in the PortInfo for example, is only valid
     for port 0 on switches.
   */
-  if( !osm_physp_is_valid( osm_port_get_default_phys_ptr( p_port ) ) ||
+  if( !osm_physp_is_valid( p_port->p_physp ) ||
       port_num < p_port->p_physp->port_num )
     p_port->p_physp = p_physp;
 }
@@ -565,7 +565,7 @@ __osm_physp_get_dr_physp_set(
   }
 
   /* get the node of the SM */
-  p_node = osm_port_get_parent_node(p_port);
+  p_node = p_port->p_node;
   
   /* 
      traverse the path adding the nodes to the table 
@@ -732,7 +732,7 @@ osm_physp_replace_dr_path_with_alternate_dr_path(
      port we'll get the port connected to the rest of the subnet. If SM is
      running on SWITCH - we should try to get a dr path from all switch ports.
   */
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
 
   CL_ASSERT( p_physp );
   CL_ASSERT( osm_physp_is_valid( p_physp ) );
diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c
index 5d9c5c7..0076b00 100644
--- a/osm/opensm/osm_port_info_rcv.c
+++ b/osm/opensm/osm_port_info_rcv.c
@@ -743,7 +743,7 @@ osm_pi_rcv_process(
                cl_ntoh64( p_smp->trans_id ) );
     }
 
-    p_node = osm_port_get_parent_node( p_port );
+    p_node = p_port->p_node;
     p_physp = osm_node_get_physp_ptr( p_node, port_num );
 
     CL_ASSERT( p_node );
diff --git a/osm/opensm/osm_prtn.c b/osm/opensm/osm_prtn.c
index 4099cee..027a5a4 100644
--- a/osm/opensm/osm_prtn.c
+++ b/osm/opensm/osm_prtn.c
@@ -119,7 +119,7 @@ ib_api_status_t osm_prtn_add_port(osm_log_t *p_log, osm_subn_t *p_subn,
 		return status;
 	}
 
-	p_physp = osm_port_get_default_phys_ptr(p_port);
+	p_physp = p_port->p_physp;
 	if (!p_physp) {
 		osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: "
 			"no physical for port 0x%" PRIx64 "\n",
diff --git a/osm/opensm/osm_qos.c b/osm/opensm/osm_qos.c
index 1255169..f426241 100644
--- a/osm/opensm/osm_qos.c
+++ b/osm/opensm/osm_qos.c
@@ -195,7 +195,7 @@ static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port,
 	if (osm_node_get_type(osm_physp_get_node_ptr(p)) == IB_NODE_TYPE_SWITCH) {
 		if (ib_port_info_get_vl_cap(&p->port_info) == 1) {
 			/* Check port 0's capability mask */
-			p_physp = osm_port_get_default_phys_ptr(p_port);
+			p_physp = p_port->p_physp;
 			if (!(p_physp->port_info.capability_mask & IB_PORT_CAP_HAS_SL_MAP))
 				return IB_SUCCESS;
 		}
@@ -353,7 +353,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm)
 		else
 			cfg = &ca_config;
 
-		p_physp = osm_port_get_default_phys_ptr(p_port);
+		p_physp = p_port->p_physp;
 		if (!osm_physp_is_valid(p_physp))
 			continue;
 
diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c
index 340a7f1..6109c5d 100644
--- a/osm/opensm/osm_sa_informinfo.c
+++ b/osm/opensm/osm_sa_informinfo.c
@@ -194,7 +194,7 @@ __validate_ports_access_rights(
     }
 
     /* get the destination InformInfo physical port */
-    p_physp = osm_port_get_default_phys_ptr(p_port);
+    p_physp = p_port->p_physp;
 
     /* make sure that the requester and destination port can access each other 
        according to the current partitioning. */
@@ -244,7 +244,7 @@ __validate_ports_access_rights(
       if ( p_port == NULL )
         continue;
 
-      p_physp = osm_port_get_default_phys_ptr(p_port);
+      p_physp = p_port->p_physp;
       /* make sure that the requester and destination port can access 
          each other according to the current partitioning. */
       if (! osm_physp_share_pkey( p_rcv->p_log, p_physp, p_requester_physp))
@@ -405,7 +405,7 @@ __osm_sa_inform_info_rec_by_comp_mask(
   }
 
   /* get the subscriber InformInfo physical port */
-  p_subscriber_physp = osm_port_get_default_phys_ptr(p_subscriber_port);
+  p_subscriber_physp = p_subscriber_port->p_physp;
   /* make sure that the requester and subscriber port can access each other 
      according to the current partitioning. */
   if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_subscriber_physp ))
diff --git a/osm/opensm/osm_sa_lft_record.c b/osm/opensm/osm_sa_lft_record.c
index b6333e7..c5cd9ca 100644
--- a/osm/opensm/osm_sa_lft_record.c
+++ b/osm/opensm/osm_sa_lft_record.c
@@ -244,7 +244,7 @@ __osm_lftr_rcv_by_comp_mask(
 
   /* check that the requester physp and the current physp are under
      the same partition. */
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if (! p_physp)
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
index 50c4f22..8241129 100644
--- a/osm/opensm/osm_sa_mcmember_record.c
+++ b/osm/opensm/osm_sa_mcmember_record.c
@@ -1570,7 +1570,7 @@ __osm_mcmr_rcv_join_mgrp(
     goto Exit;
   }
 
-  p_physp = osm_port_get_default_phys_ptr(p_port);
+  p_physp = p_port->p_physp;
   /* Check that the p_physp and the requester physp are in the same
      partition. */
   p_request_physp =
diff --git a/osm/opensm/osm_sa_mft_record.c b/osm/opensm/osm_sa_mft_record.c
index 005c9bd..7908583 100644
--- a/osm/opensm/osm_sa_mft_record.c
+++ b/osm/opensm/osm_sa_mft_record.c
@@ -250,7 +250,7 @@ __osm_mftr_rcv_by_comp_mask(
 
   /* check that the requester physp and the current physp are under
      the same partition. */
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if (! p_physp)
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
diff --git a/osm/opensm/osm_sa_multipath_record.c b/osm/opensm/osm_sa_multipath_record.c
index 0c5643e..06640d9 100644
--- a/osm/opensm/osm_sa_multipath_record.c
+++ b/osm/opensm/osm_sa_multipath_record.c
@@ -154,7 +154,7 @@ __osm_sa_multipath_rec_is_tavor_port(
   osm_node_t const* p_node;
   ib_net32_t vend_id;
 
-  p_node = osm_port_get_parent_node( p_port );
+  p_node = p_port->p_node;
   vend_id = ib_node_info_get_vendor_id( &p_node->node_info );
 
   return( (p_node->node_info.device_id == CL_HTON16(23108)) &&
@@ -255,8 +255,8 @@ __osm_mpr_rcv_get_path_parms(
 
   dest_lid = cl_hton16( dest_lid_ho );
 
-  p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port );
-  p_physp = osm_port_get_default_phys_ptr( p_src_port );
+  p_dest_physp = p_dest_port->p_physp;
+  p_physp = p_src_port->p_physp;
   p_pi = &p_physp->port_info;
 
   mtu = ib_port_info_get_mtu_cap( p_pi );
@@ -744,8 +744,8 @@ __osm_mpr_rcv_build_pr(
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_mpr_rcv_build_pr );
 
-  p_src_physp = osm_port_get_default_phys_ptr( p_src_port );
-  p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port );
+  p_src_physp = p_src_port->p_physp;
+  p_dest_physp = p_dest_port->p_physp;
 
   p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp );
   p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp );
diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c
index 1b0f89f..47d9c33 100644
--- a/osm/opensm/osm_sa_path_record.c
+++ b/osm/opensm/osm_sa_path_record.c
@@ -171,7 +171,7 @@ __osm_sa_path_rec_is_tavor_port(
   osm_node_t const* p_node;
   ib_net32_t vend_id;
 
-  p_node = osm_port_get_parent_node( p_port );
+  p_node = p_port->p_node;
   vend_id = ib_node_info_get_vendor_id( &p_node->node_info );
 	
   return( (p_node->node_info.device_id == CL_HTON16(23108)) &&
@@ -268,8 +268,8 @@ __osm_pr_rcv_get_path_parms(
 
   dest_lid = cl_hton16( dest_lid_ho );
 
-  p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port );
-  p_physp = osm_port_get_default_phys_ptr( p_src_port );
+  p_dest_physp = p_dest_port->p_physp;
+  p_physp = p_src_port->p_physp;
   p_pi = &p_physp->port_info;
 
   mtu = ib_port_info_get_mtu_cap( p_pi );
@@ -753,9 +753,9 @@ __osm_pr_rcv_build_pr(
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_build_pr );
 
-  p_src_physp = osm_port_get_default_phys_ptr( p_src_port );
+  p_src_physp = p_src_port->p_physp;
 #ifndef ROUTER_EXP
-  p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port );
+  p_dest_physp = p_dest_port->p_physp;
 
   p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp );
   p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp );
@@ -770,7 +770,7 @@ __osm_pr_rcv_build_pr(
     p_pr->dgid = *p_dgid;
   else
   {
-    p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port );
+    p_dest_physp = p_dest_port->p_physp;
 
     p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp );
     p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp );
diff --git a/osm/opensm/osm_sa_service_record.c b/osm/opensm/osm_sa_service_record.c
index b23a12d..eff0b0a 100644
--- a/osm/opensm/osm_sa_service_record.c
+++ b/osm/opensm/osm_sa_service_record.c
@@ -213,7 +213,7 @@ __match_service_pkey_with_ports_pkey(
       /* check on the table of the default physical port of the service port */
       if ( !osm_physp_has_pkey( p_rcv->p_log,
                                 p_service_rec->service_pkey,
-                                osm_port_get_default_phys_ptr(service_port) ) )
+                                service_port->p_physp ) )
       {
         valid = FALSE;
         goto Exit;
diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c
index 168901e..e40ad61 100644
--- a/osm/opensm/osm_sa_slvl_record.c
+++ b/osm/opensm/osm_sa_slvl_record.c
@@ -226,7 +226,7 @@ __osm_sa_slvl_by_comp_mask(
              "__osm_sa_slvl_by_comp_mask:  "
              "Using Physical Default Port Number: 0x%X (for End Node)\n",
              p_port->p_physp->port_num );
-    p_out_physp = osm_port_get_default_phys_ptr( p_port );
+    p_out_physp = p_port->p_physp;
     /* check that the p_out_physp and the p_req_physp share a pkey */
     if (osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_out_physp ))
     __osm_sa_slvl_create( p_rcv, p_out_physp, p_ctxt, 0 );
diff --git a/osm/opensm/osm_sa_sminfo_record.c b/osm/opensm/osm_sa_sminfo_record.c
index 5e15f52..8c343b4 100644
--- a/osm/opensm/osm_sa_sminfo_record.c
+++ b/osm/opensm/osm_sa_sminfo_record.c
@@ -374,7 +374,7 @@ osm_smir_rcv_process(
     {
       if (FALSE ==
           osm_physp_share_pkey( p_rcv->p_log, p_req_physp,
-                                osm_port_get_default_phys_ptr( local_port ) ) )
+                                local_port->p_physp ) )
       {
         cl_plock_release( p_rcv->p_lock );
         osm_log( p_rcv->p_log, OSM_LOG_ERROR,
diff --git a/osm/opensm/osm_sa_sw_info_record.c b/osm/opensm/osm_sa_sw_info_record.c
index da65864..94b1ff9 100644
--- a/osm/opensm/osm_sa_sw_info_record.c
+++ b/osm/opensm/osm_sa_sw_info_record.c
@@ -245,7 +245,7 @@ __osm_sir_rcv_create_sir(
 
   /* check that the requester physp and the current physp are under
      the same partition. */
-  p_physp = osm_port_get_default_phys_ptr( p_port );
+  p_physp = p_port->p_physp;
   if (! p_physp)
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
diff --git a/osm/opensm/osm_slvl_map_rcv.c b/osm/opensm/osm_slvl_map_rcv.c
index b109f75..3352627 100644
--- a/osm/opensm/osm_slvl_map_rcv.c
+++ b/osm/opensm/osm_slvl_map_rcv.c
@@ -170,7 +170,7 @@ osm_slvl_rcv_process(
     goto Exit;
   }
 
-  p_node = osm_port_get_parent_node( p_port );
+  p_node = p_port->p_node;
   CL_ASSERT( p_node );
 
   /* in case of a non switch node the attr modifier should be ignored */
@@ -182,7 +182,7 @@ osm_slvl_rcv_process(
   }
   else
   {
-    p_physp = osm_port_get_default_phys_ptr(p_port);
+    p_physp = p_port->p_physp;
     out_port_num = p_physp->port_num;
     in_port_num  = 0;
   }
diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c
index 0034320..51df1df 100644
--- a/osm/opensm/osm_sm_state_mgr.c
+++ b/osm/opensm/osm_sm_state_mgr.c
@@ -192,7 +192,7 @@ __osm_sm_state_mgr_send_local_port_info_req(
 
    status = osm_req_get( p_sm_mgr->p_req,
                          osm_physp_get_dr_path_ptr
-                         ( osm_port_get_default_phys_ptr( p_port ) ),
+                         ( p_port->p_physp ),
                          IB_MAD_ATTR_PORT_INFO,
                          cl_hton32( p_port->p_physp->port_num ),
                          CL_DISP_MSGID_NONE, &context );
@@ -261,8 +261,7 @@ __osm_sm_state_mgr_send_master_sm_info_req(
    context.smi_context.set_method = FALSE;
 
    status = osm_req_get( p_sm_mgr->p_req,
-                         osm_physp_get_dr_path_ptr
-                         ( osm_port_get_default_phys_ptr( p_port ) ),
+                         osm_physp_get_dr_path_ptr(p_port->p_physp),
                          IB_MAD_ATTR_SM_INFO, 0, CL_DISP_MSGID_NONE,
                          &context );
 
diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
index 9aeec74..6681cfc 100644
--- a/osm/opensm/osm_state_mgr.c
+++ b/osm/opensm/osm_state_mgr.c
@@ -849,7 +849,7 @@ __osm_state_mgr_is_sm_port_down(
       goto Exit;
    }
 
-   p_physp = osm_port_get_default_phys_ptr( p_port );
+   p_physp = p_port->p_physp;
 
    CL_ASSERT( p_physp );
    CL_ASSERT( osm_physp_is_valid( p_physp ) );
@@ -914,7 +914,7 @@ __osm_state_mgr_sweep_hop_1(
       goto Exit;
    }
 
-   p_node = osm_port_get_parent_node( p_port );
+   p_node = p_port->p_node;
    CL_ASSERT( p_node );
 
    port_num = ib_node_info_get_local_port_num( &p_node->node_info );
@@ -1277,7 +1277,7 @@ __osm_state_mgr_report(
                   cl_ntoh64( osm_port_get_guid( p_port ) ) );
       }
 
-      p_node = osm_port_get_parent_node( p_port );
+      p_node = p_port->p_node;
       node_type = osm_node_get_type( p_node );
       if( node_type == IB_NODE_TYPE_SWITCH )
          start_port = 0;
@@ -1622,9 +1622,8 @@ __osm_state_mgr_send_handover(
    }
 
    status = osm_req_set( p_mgr->p_req,
-                         osm_physp_get_dr_path_ptr
-                         ( osm_port_get_default_phys_ptr( p_port ) ), payload,
-                         sizeof(payload),
+                         osm_physp_get_dr_path_ptr(p_port->p_physp),
+                         payload, sizeof(payload),
                          IB_MAD_ATTR_SM_INFO, IB_SMINFO_ATTR_MOD_HANDOVER,
                          CL_DISP_MSGID_NONE, &context );
 
diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index 8e0c53b..0484530 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -238,7 +238,6 @@ osm_get_gid_by_mad_addr(
 {
   const cl_ptr_vector_t*  p_tbl;
   const osm_port_t*       p_port = NULL;
-  const osm_physp_t*      p_physp = NULL;
 
   if ( p_gid == NULL ) 
   {
@@ -266,8 +265,7 @@ osm_get_gid_by_mad_addr(
                );
       return(IB_INVALID_PARAMETER);
     }
-    p_physp = osm_port_get_default_phys_ptr( p_port );
-    p_gid->unicast.interface_id = p_physp->port_guid;
+    p_gid->unicast.interface_id = p_port->p_physp->port_guid;
     p_gid->unicast.prefix = p_subn->opt.subnet_prefix;
   }
   else
@@ -316,7 +314,7 @@ osm_get_physp_by_mad_addr(
     
       goto Exit;
     }
-    p_physp = osm_port_get_default_phys_ptr( p_port );
+    p_physp = p_port->p_physp;
   }
   else
   {
diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c
index 9273459..a79f5cd 100644
--- a/osm/opensm/osm_switch.c
+++ b/osm/opensm/osm_switch.c
@@ -291,7 +291,7 @@ osm_switch_recommend_path(
   }
   else
   {
-    p_physp = osm_port_get_default_phys_ptr(p_port);
+    p_physp = p_port->p_physp;
     if (!p_physp || !p_physp->p_remote_physp ||
         !p_physp->p_remote_physp->p_node->sw)
       return OSM_NO_PATH;
@@ -566,7 +566,7 @@ osm_switch_get_port_least_hops(
   }
   else
   {
-    osm_physp_t *p = osm_port_get_default_phys_ptr(p_port);
+    osm_physp_t *p = p_port->p_physp;
     uint8_t hops;
 
     if (!p || !p->p_remote_physp || !p->p_remote_physp->p_node->sw)
@@ -604,7 +604,7 @@ osm_switch_recommend_mcast_path(
   }
   else
   {
-    osm_physp_t *p_physp = osm_port_get_default_phys_ptr(p_port);
+    osm_physp_t *p_physp = p_port->p_physp;
     if (!p_physp || !p_physp->p_remote_physp ||
         !p_physp->p_remote_physp->p_node->sw)
       return OSM_NO_PATH;
diff --git a/osm/opensm/osm_ucast_lash.c b/osm/opensm/osm_ucast_lash.c
index 4459d9f..5d32e89 100644
--- a/osm/opensm/osm_ucast_lash.c
+++ b/osm/opensm/osm_ucast_lash.c
@@ -170,7 +170,7 @@ static uint64_t osm_lash_get_switch_guid(IN const osm_switch_t *p_sw)
 
 static osm_switch_t *get_osm_switch_from_port(osm_port_t *port)
 {
-	osm_physp_t *p = osm_port_get_default_phys_ptr(port);
+	osm_physp_t *p = port->p_physp;
 	if (p->p_node->sw)
 		return p->p_node->sw;
 	else if (p->p_remote_physp->p_node->sw)
diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index 7d3916b..2860e66 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -306,7 +306,7 @@ __osm_ucast_mgr_dump_ucast_routes(
     }
     else
     {
-      osm_physp_t *p_physp = osm_port_get_default_phys_ptr(p_port);
+      osm_physp_t *p_physp = p_port->p_physp;
       if( !p_physp || !p_physp->p_remote_physp ||
           !p_physp->p_remote_physp->p_node->sw )
         num_hops = OSM_NO_PATH;
@@ -413,7 +413,7 @@ ucast_mgr_dump_lfts(cl_map_item_t *p_map_item, void *cxt)
 
 		p_port = cl_ptr_vector_get(&p_mgr->p_subn->port_lid_tbl, lid);
 		if (p_port) {
-			p_node = osm_port_get_parent_node(p_port);
+			p_node = p_port->p_node;
 			fprintf(file, "%s portguid 0x016%" PRIx64 ": \'%s\'",
 				ib_get_node_type_str(osm_node_get_type(p_node)),
 				cl_ntoh64(osm_port_get_guid(p_port)),
@@ -671,8 +671,7 @@ __osm_ucast_mgr_process_port(
       if (!p_mgr->p_subn->opt.port_profile_switch_nodes)
       {
 	is_ignored_by_port_prof |=
-	  (osm_node_get_type(osm_port_get_parent_node(p_port)) ==
-	   IB_NODE_TYPE_SWITCH);
+	  (osm_node_get_type(p_port->p_node) == IB_NODE_TYPE_SWITCH);
       }
     }
 
diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c
index b15fe5e..d9446e9 100644
--- a/osm/opensm/osm_ucast_updn.c
+++ b/osm/opensm/osm_ucast_updn.c
@@ -792,7 +792,7 @@ __osm_updn_find_root_nodes_by_min_hop(
     p_next_port = (osm_port_t*)cl_qmap_next( &p_next_port->map_item );
     if ( osm_node_get_type(p_port->p_node) != IB_NODE_TYPE_SWITCH )
     {
-      p_physp = osm_port_get_default_phys_ptr(p_port);
+      p_physp = p_port->p_physp;
       self_lid_ho = cl_ntoh16( osm_physp_get_base_lid(p_physp) );
       numCas++;
       /* EZ:
diff --git a/osm/opensm/osm_vl_arb_rcv.c b/osm/opensm/osm_vl_arb_rcv.c
index ed8dfc5..f36751e 100644
--- a/osm/opensm/osm_vl_arb_rcv.c
+++ b/osm/opensm/osm_vl_arb_rcv.c
@@ -171,7 +171,7 @@ osm_vla_rcv_process(
     goto Exit;
   }
 
-  p_node = osm_port_get_parent_node( p_port );
+  p_node = p_port->p_node;
   CL_ASSERT( p_node );
 
   block_num = (uint8_t)(cl_ntoh32(p_smp->attr_mod) >> 16);
@@ -183,7 +183,7 @@ osm_vla_rcv_process(
   }
   else
   {
-    p_physp = osm_port_get_default_phys_ptr(p_port);
+    p_physp = p_port->p_physp;
     port_num = p_physp->port_num;
   }
 
@@ -239,4 +239,3 @@ osm_vla_rcv_process(
 
   OSM_LOG_EXIT( p_rcv->p_log );
 }
-
-- 
1.5.2.rc2.20.gac2a


From jsquyres at cisco.com  Thu May 10 14:10:52 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 10 May 2007 17:10:52 -0400
Subject: [ofa-general] Fwd: [Netpipe] NetPIPE-3.7.1 released!
References: <46434F9E.6060206@scl.ameslab.gov>
Message-ID: <011D5F75-0AF2-41E8-A876-FD688D845066@cisco.com>

FYI.

Begin forwarded message:

> From: Troy Benjegerdes <troy at scl.ameslab.gov>
> Date: May 10, 2007 1:00:14 PM EDT
> To: netpipe at source.iprt.iastate.edu
> Subject: [Netpipe] NetPIPE-3.7.1 released!
>
> NetPIPE-3.7.1 has been released. See the trac wiki
>
> http://source.scl.ameslab.gov/trac/netpipe
>
> and the direct download link
>
> http://source.scl.ameslab.gov/NetPIPE/NetPIPE-3.7.1.tar.gz
>
>
>
>
> The major change from NetPIPE-3.7 is the ability of the ibv module to
> select the OpenFabrics adapter and port to use.
>
> FYI, for those of you wanting to send me patches against NetPIPE, I am
> planning on doing a large cosmetic code clean up to indent all the
> source files consistently. This means you will have some patch cleanup
> to do, so either send your patches now, or plan on some manual work to
> get them to re-apply.
>
> My preference is to follow the Linux coding style, with the  
> exception of
> using 4-space tabs. If you have a strong opinion about codingstyle,
> speak now or forever hold your peace ;)
> _______________________________________________
> Netpipe mailing list
> Netpipe at lists.scl.ameslab.gov
> https://lists.scl.ameslab.gov/cgi-bin/mailman/listinfo/netpipe


-- 
Jeff Squyres
Cisco Systems


From pradeeps at linux.vnet.ibm.com  Thu May 10 14:26:10 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Thu, 10 May 2007 14:26:10 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <4641E99B.10706@linux.vnet.ibm.com>
References: <4641E99B.10706@linux.vnet.ibm.com>
Message-ID: <46438DF2.3080601@linux.vnet.ibm.com>

If there are no other issues than the small restructure suggestion that
Michael had, can this patch be merged into the for-2.6.22 tree?

Pradeep


From rdreier at cisco.com  Thu May 10 14:31:36 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 10 May 2007 14:31:36 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <46438DF2.3080601@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Thu, 10 May 2007 14:26:10 -0700")
References: <4641E99B.10706@linux.vnet.ibm.com>
	<46438DF2.3080601@linux.vnet.ibm.com>
Message-ID: <adar6poz7pj.fsf@cisco.com>

I need to read over the whole thread when I get back from my
trip... so it will be next week at the earliest.


From rdreier at cisco.com  Thu May 10 14:32:12 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 10 May 2007 14:32:12 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <46438DF2.3080601@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Thu, 10 May 2007 14:26:10 -0700")
References: <4641E99B.10706@linux.vnet.ibm.com>
	<46438DF2.3080601@linux.vnet.ibm.com>
Message-ID: <adamz0cz7oj.fsf@cisco.com>

 > If there are no other issues than the small restructure suggestion that
 > Michael had, can this patch be merged into the for-2.6.22 tree?

by the way, have you adressed that suggestion?


From xma at us.ibm.com  Thu May 10 14:34:49 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 10 May 2007 14:34:49 -0700
Subject: [ofa-general] RFC: location for IB CM statistics
In-Reply-To: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com>
Message-ID: <OFF95D0839.00890D76-ON872572D7.007602C0-882572D7.00767113@us.ibm.com>


Another place is /proc. Networking uses /proc/net for statistics, can we
have something like /proc/infiniband?

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070510/9e4a4902/attachment.html>

From militec-uk.com at watcke.com  Thu May 10 16:37:29 2007
From: militec-uk.com at watcke.com (Joseph Evans)
Date: Thu, 10 May 2007 15:37:29 -0800
Subject: [ofa-general] Avoid enhancement pills
Message-ID: <000001c7934b$19ead880$0100007f@localhost>


See attached image
http://www.tiranpol.net/

-----
I agree with you that hell not
The cook had joined them a few
Yes, Elyne? Jamie asked. Why n
Brenna embraced Elynes idea. I
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070510/44a488f4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic37.jpg
Type: image/jpeg
Size: 14015 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070510/44a488f4/attachment.jpg>

From pradeeps at linux.vnet.ibm.com  Thu May 10 14:49:01 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Thu, 10 May 2007 14:49:01 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <adamz0cz7oj.fsf@cisco.com>
References: <4641E99B.10706@linux.vnet.ibm.com>
	<46438DF2.3080601@linux.vnet.ibm.com> <adamz0cz7oj.fsf@cisco.com>
Message-ID: <4643934D.4080306@linux.vnet.ibm.com>

Roland Dreier wrote:
>  > If there are no other issues than the small restructure suggestion that
>  > Michael had, can this patch be merged into the for-2.6.22 tree?
> 
> by the way, have you adressed that suggestion?
> 

In the submitted patch I have conditional branch in
ipoib_cm_handle_rx_wc(), to handle_rx_wc_srq() or
handle_rx_wc_nosrq(), and he wanted me to have separate
handlers for the SRQ and NOSRQ case. Changing the code to do
that would make ipoib_poll() very messy and so I have not done
that.

I feel that should not stand in the way of merging this patch.

Pradeep


From pradeeps at linux.vnet.ibm.com  Thu May 10 14:50:45 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Thu, 10 May 2007 14:50:45 -0700
Subject: [ofa-general] RFC: location for IB CM statistics
In-Reply-To: <OFF95D0839.00890D76-ON872572D7.007602C0-882572D7.00767113@us.ibm.com>
References: <OFF95D0839.00890D76-ON872572D7.007602C0-882572D7.00767113@us.ibm.com>
Message-ID: <464393B5.9030109@linux.vnet.ibm.com>

Shirley Ma wrote:
> Another place is /proc. Networking uses /proc/net for statistics, can we 
> have something like /proc/infiniband?
> 
> Thanks
> Shirley Ma
> 

My understanding is that the current thinking is /proc is for process
related info and the rest goes into sysfs.

Pradeep


From rick.jones2 at hp.com  Thu May 10 15:00:19 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Thu, 10 May 2007 15:00:19 -0700
Subject: [ofa-general] is there an OFED way to putt VPD from an HCA?
In-Reply-To: <46436275.80406@systemfabricworks.com>
References: <46434FF1.9020005@hp.com>	<A15335FBE9BD2449AF2C9EF3D1EB8EA303821207@xmb-sjc-216.amer.cisco.com>
	<46436275.80406@systemfabricworks.com>
Message-ID: <464395F3.9080004@hp.com>

Seems that none of those utilities went into Debian, which was the base distro I 
installed, and then on top of which I put the 2.6.21.1 kernel.  Of course I'm 
still having that "gcc rpm" not found issue trying to grab the whole OFED 1.2 
from 5/10, and an attempt to compile the ofa_kernel from 5.10 ended-up with some 
asm related errors which sadly I've not saved, but could I suppose recreate.

At this point I'm not sure if I don't need to lay-down a fresh set of kernel 
sources to allow things to patch correctly.

rick jones


From raleigh at systemfabricworks.com  Thu May 10 15:11:14 2007
From: raleigh at systemfabricworks.com (Raleigh F Rinehart)
Date: Thu, 10 May 2007 17:11:14 -0500
Subject: [ofa-general] is there an OFED way to putt VPD from an HCA?
In-Reply-To: <464395F3.9080004@hp.com>
References: <46434FF1.9020005@hp.com>	<A15335FBE9BD2449AF2C9EF3D1EB8EA303821207@xmb-sjc-216.amer.cisco.com>
	<46436275.80406@systemfabricworks.com> <464395F3.9080004@hp.com>
Message-ID: <46439882.7070601@systemfabricworks.com>

I don't know about the other issues as I haven't tried installing on 
Debian, but on my SLES 10 machine I had to build mstflint by hand from 
the sources in the OFED-1.1 release tarball.   The standard ofed 
install.sh script wouldn't build if I included it.
A simple 'make' run in 
"OFED-1.1/SOURCES/openib-1.1/src/userspace/mstflint"  worked just fine 
for me.

thanks,
-raleigh


Rick Jones wrote:
> Seems that none of those utilities went into Debian, which was the 
> base distro I installed, and then on top of which I put the 2.6.21.1 
> kernel.  Of course I'm still having that "gcc rpm" not found issue 
> trying to grab the whole OFED 1.2 from 5/10, and an attempt to compile 
> the ofa_kernel from 5.10 ended-up with some asm related errors which 
> sadly I've not saved, but could I suppose recreate.
>
> At this point I'm not sure if I don't need to lay-down a fresh set of 
> kernel sources to allow things to patch correctly.
>
> rick jones
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3285 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070510/dc0ce654/attachment.bin>

From jimmy at hillraiser.com  Thu May 10 15:33:08 2007
From: jimmy at hillraiser.com (=?iso-8859-1?Q?Jimmy=20Hill?=)
Date: Thu, 10 May 2007 22:33:08 +0000
Subject: [ofa-general] verbs abi_compat
Message-ID: <20070510223308.11276.qmail@station183.com>

>  
>  abi_compat has nothing to do with __ibv_alloc_pd vs. __ibv_alloc_pd_1_0.
>  Rather, that choice is made based on whether your app is linked
>  against the IBVERBS_1.1 or IBVERBS_1.0 ABI.  If you link against the
>  new library, you should get all IBVERBS_1.1 symbols; if you link
>  against libibverbs 1.0, you should get all IBVERBS_1.1 symbols.
>  
>  Your problem might be that your app is getting __ibv_alloc_pd_1_0, but
>  it gets __ibv_open_device instead of __ibv_open_device_1_0 so the
>  context passed into __ibv_alloc_pd_1_0 is wrong.  Are you possibly
>  relinking only part of your app or something?
>  

In my makefile I was linking against ibverbs (libibverbs.so.1) and rdmacm (librdmacm.so.1). In the code, I create an RDMA CM ID and was using the context out of it to create my other IB resources (e.g., PDs, etc.). (long story why this is necessary but suffice it to say doing something similar to the uDAPL implementation). I initially thought I had a compatibility issue between the ibverbs I was linking against compared to the ibverbs the rdmacm was using. But, there was only one copy of each on my system. But, that is what pushed me down the abi_compat stuff, etc. Turns out, another module in my exe was linking with the DAT library...that apparently pulled in the 1.0 verbs and thus the compat stuff. I removed the reference to the DAT library and now my context from the RDMA CM ID gets me to the correct verbs. Thanks for your help!


From sashak at voltaire.com  Thu May 10 15:43:08 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 11 May 2007 01:43:08 +0300
Subject: [ofa-general] [PATCH] opensm: more osm_*_construct/_init/_destroy
	cleanups
In-Reply-To: <20070509212740.GV9692@sashak.voltaire.com>
References: <20070506174138.GI9692@sashak.voltaire.com>
	<20070506174431.GJ9692@sashak.voltaire.com>
	<1178543690.32222.350646.camel@hal.voltaire.com>
	<20070509212740.GV9692@sashak.voltaire.com>
Message-ID: <20070510224308.GH9692@sashak.voltaire.com>

Hi Hal,

As suggested :)


This removes/makes static non used osm_*_construct/_init/_destroy
initializers for OpenSM objects where osm*_new/_delete are actually
used.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/include/opensm/osm_inform.h      |   57 ++---------------
 osm/include/opensm/osm_lin_fwd_tbl.h |    5 +-
 osm/include/opensm/osm_mcm_info.h    |   93 +--------------------------
 osm/include/opensm/osm_mcm_port.h    |  117 +++-------------------------------
 osm/include/opensm/osm_mtree.h       |   88 -------------------------
 osm/include/opensm/osm_multicast.h   |  104 ++----------------------------
 osm/include/opensm/osm_router.h      |  106 ++----------------------------
 osm/include/opensm/osm_service.h     |   39 ++---------
 osm/opensm/osm_inform.c              |   32 +--------
 osm/opensm/osm_mcast_mgr.c           |    2 +-
 osm/opensm/osm_mcm_info.c            |   23 +------
 osm/opensm/osm_mcm_port.c            |   28 +--------
 osm/opensm/osm_mtree.c               |    4 +-
 osm/opensm/osm_multicast.c           |   18 +----
 osm/opensm/osm_router.c              |   46 +------------
 osm/opensm/osm_sa_mcmember_record.c  |    4 +-
 osm/opensm/osm_sa_service_record.c   |    4 +-
 osm/opensm/osm_service.c             |   13 +---
 osm/opensm/osm_subnet.c              |    4 +-
 19 files changed, 64 insertions(+), 723 deletions(-)

diff --git a/osm/include/opensm/osm_inform.h b/osm/include/opensm/osm_inform.h
index 3e8e122..57ab05c 100644
--- a/osm/include/opensm/osm_inform.h
+++ b/osm/include/opensm/osm_inform.h
@@ -154,57 +154,12 @@ osm_infr_new(
 *	Allows calling other service record methods.
 *
 * SEE ALSO
-*	Inform Record, osm_infr_construct, osm_infr_destroy
+*	Inform Record, osm_infr_delete
 *********/
 
-/****f* OpenSM: Inform Record/osm_infr_init
+/****f* OpenSM: Inform Record/osm_infr_delete
 * NAME
-*	osm_infr_new
-*
-* DESCRIPTION
-*	Initializes the osm_infr_t structure.
-*
-* SYNOPSIS
-*/
-void
-osm_infr_init(
-	IN osm_infr_t* const p_infr,
-	IN const osm_infr_t *p_infr_rec );
-/*
-* PARAMETERS
-*	p_infr
-*		[in] Pointer to osm_infr_t structure
-*	p_inf_rec
-*		[in] Pointer to the ib_inform_info_record_t
-*
-* SEE ALSO
-*	Inform Record, osm_infr_construct, osm_infr_destroy
-*********/
-
-/****f* OpenSM: Inform Record/osm_infr_construct
-* NAME
-*	osm_infr_construct
-*
-* DESCRIPTION
-*	Constructs the osm_infr_t structure.
-*
-* SYNOPSIS
-*/
-void
-osm_infr_construct(
-	IN osm_infr_t* const p_infr );
-/*
-* PARAMETERS
-*	p_infr
-*		[in] Pointer to osm_infr_t structure
-*
-* SEE ALSO
-*	Inform Record, osm_infr_construct, osm_infr_destroy
-*********/
-
-/****f* OpenSM: Inform Record/osm_infr_destroy
-* NAME
-*	osm_infr_destroy
+*	osm_infr_delete
 *
 * DESCRIPTION
 *	Constructs the osm_infr_t structure.
@@ -212,7 +167,7 @@ osm_infr_construct(
 * SYNOPSIS
 */
 void
-osm_infr_destroy(
+osm_infr_delete(
 	IN osm_infr_t* const p_infr );
 /*
 * PARAMETERS
@@ -220,7 +175,7 @@ osm_infr_destroy(
 *		[in] Pointer to osm_infr_t structure
 *
 * SEE ALSO
-*	Inform Record, osm_infr_construct, osm_infr_destroy
+*	Inform Record, osm_infr_new
 *********/
 
 /****f* OpenSM: Inform Record/osm_infr_get_by_rec
@@ -251,7 +206,7 @@ osm_infr_get_by_rec(
 * RETURN
 *	The matching osm_infr_t
 * SEE ALSO
-*	Inform Record, osm_infr_construct, osm_infr_destroy
+*	Inform Record, osm_infr_new, osm_infr_delete
 *********/
 
 void
diff --git a/osm/include/opensm/osm_lin_fwd_tbl.h b/osm/include/opensm/osm_lin_fwd_tbl.h
index e059020..26d0465 100644
--- a/osm/include/opensm/osm_lin_fwd_tbl.h
+++ b/osm/include/opensm/osm_lin_fwd_tbl.h
@@ -93,9 +93,8 @@ BEGIN_C_DECLS
 */
 typedef struct _osm_lin_fwd_tbl
 {
-	uint16_t				size;
-	uint8_t					port_tbl[1];
-
+	uint16_t	size;
+	uint8_t		port_tbl[1];
 } osm_lin_fwd_tbl_t;
 /*
 * FIELDS
diff --git a/osm/include/opensm/osm_mcm_info.h b/osm/include/opensm/osm_mcm_info.h
index b6f6ee2..48d61f5 100644
--- a/osm/include/opensm/osm_mcm_info.h
+++ b/osm/include/opensm/osm_mcm_info.h
@@ -79,9 +79,8 @@ BEGIN_C_DECLS
 */
 typedef struct _osm_mcm_info
 {
-	cl_list_item_t				list_item;
-	ib_net16_t					mlid;
-
+	cl_list_item_t	list_item;
+	ib_net16_t	mlid;
 } osm_mcm_info_t;
 /*
 * FIELDS
@@ -94,94 +93,6 @@ typedef struct _osm_mcm_info
 * SEE ALSO
 *********/
 
-/****f* OpenSM: Multicast Member Info/osm_mcm_info_construct
-* NAME
-*	osm_mcm_info_construct
-*
-* DESCRIPTION
-*	This function constructs a Multicast Member Info object.
-*
-* SYNOPSIS
-*/
-static inline void
-osm_mcm_info_construct(
-	IN osm_mcm_info_t* const p_mcm )
-{
-	memset( p_mcm, 0, sizeof(*p_mcm) );
-}
-/*
-* PARAMETERS
-*	p_mcm
-*		[in] Pointer to a Multicast Member Info object to construct.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*
-* SEE ALSO
-*********/
-
-/****f* OpenSM: Multicast Member Info/osm_mcm_info_destroy
-* NAME
-*	osm_mcm_info_destroy
-*
-* DESCRIPTION
-*	The osm_mcm_info_destroy function destroys the object, releasing
-*	all resources.
-*
-* SYNOPSIS
-*/
-void
-osm_mcm_info_destroy(
-	IN osm_mcm_info_t* const p_mcm );
-/*
-* PARAMETERS
-*	p_mcm
-*		[in] Pointer to a Multicast Member Info object to destroy.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*	Performs any necessary cleanup of the specified Multicast Member Info object.
-*	Further operations should not be attempted on the destroyed object.
-*	This function should only be called after a call to osm_mtree_construct or
-*	osm_mtree_init.
-*
-* SEE ALSO
-*	Multicast Member Info object, osm_mtree_construct, osm_mtree_init
-*********/
-
-/****f* OpenSM: Multicast Member Info/osm_mcm_info_init
-* NAME
-*	osm_mcm_info_init
-*
-* DESCRIPTION
-*	Initializes a Multicast Member Info object for use.
-*
-* SYNOPSIS
-*/
-void
-osm_mcm_info_init(
-	IN osm_mcm_info_t* const p_mcm,
-	IN const ib_net16_t mlid );
-/*
-* PARAMETERS
-*	p_mcm
-*		[in] Pointer to an osm_mcm_info_t object to initialize.
-*
-*	mlid
-*		[in] MLID value for this multicast group.
-*
-* RETURN VALUES
-*	None.
-*
-* NOTES
-*
-* SEE ALSO
-*********/
-
 /****f* OpenSM: Multicast Member Info/osm_mcm_info_new
 * NAME
 *	osm_mcm_info_new
diff --git a/osm/include/opensm/osm_mcm_port.h b/osm/include/opensm/osm_mcm_port.h
index df30b84..634b3c7 100644
--- a/osm/include/opensm/osm_mcm_port.h
+++ b/osm/include/opensm/osm_mcm_port.h
@@ -103,112 +103,13 @@ typedef struct _osm_mcm_port
 *	MCM Port Object
 *********/
 
-/****f* OpenSM: MCM Port Object/osm_mcm_port_construct
+/****f* OpenSM: MCM Port Object/osm_mcm_port_new
 * NAME
-*	osm_mcm_port_construct
+*	osm_mcm_port_new
 *
 * DESCRIPTION
-*	This function constructs a MCM Port object.
-*
-* SYNOPSIS
-*/
-void
-osm_mcm_port_construct(
-	IN osm_mcm_port_t* const p_mcm );
-/*
-* PARAMETERS
-*	p_mcm
-*		[in] Pointer to a MCM Port Object to construct.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*	Allows calling osm_mcm_port_init, osm_mcm_port_destroy.
-*
-*	Calling osm_mcm_port_construct is a prerequisite to calling any other
-*	method except osm_mcm_port_init.
-*
-* SEE ALSO
-*	MCM Port Object, osm_mcm_port_init, osm_mcm_port_destroy
-*********/
-
-/****f* OpenSM: MCM Port Object/osm_mcm_port_destroy
-* NAME
-*	osm_mcm_port_destroy
-*
-* DESCRIPTION
-*	The osm_mcm_port_destroy function destroys a MCM Port Object, releasing
-*	all resources.
-*
-* SYNOPSIS
-*/
-void
-osm_mcm_port_destroy(
-	IN osm_mcm_port_t* const p_mcm );
-/*
-* PARAMETERS
-*	p_mcm
-*		[in] Pointer to a MCM Port Object to destroy.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*	Performs any necessary cleanup of the specified MCM Port Object.
-*	Further operations should not be attempted on the destroyed object.
-*	This function should only be called after a call to
-*	osm_mcm_port_construct or osm_mcm_port_init.
-*
-* SEE ALSO
-*	MCM Port Object, osm_mcm_port_construct, osm_mcm_port_init
-*********/
-
-/****f* OpenSM: MCM Port Object/osm_mcm_port_init
-* NAME
-*	osm_mcm_port_init
-*
-* DESCRIPTION
-*	The osm_mcm_port_init function initializes a MCM Port Object for use.
-*
-* SYNOPSIS
-*/
-void
-osm_mcm_port_init(
-	IN osm_mcm_port_t* const p_mcm,
-	IN const ib_gid_t* const p_port_gid,
-	IN const uint8_t   scope_state,
-   IN const boolean_t proxy_join );
-/*
-* PARAMETERS
-*	p_mcm
-*		[in] Pointer to an osm_mcm_port_t object to initialize.
-*
-*	p_port_gid
-*		[in] Pointer to the GID of the port to add to the multicast group.
-*
-*	scope_state
-*		[in] scope state of the join request
-*
-*  proxy_join
-*     [in] proxy_join state analyzed from the request
-*
-* RETURN VALUES
-*	None.
-*
-* NOTES
-*	Allows calling other MCM Port Object methods.
-*
-* SEE ALSO
-*	MCM Port Object, osm_mcm_port_construct, osm_mcm_port_destroy,
-*********/
-
-/****f* OpenSM: MCM Port Object/osm_mcm_port_init
-* NAME
-*	osm_mcm_port_init
-*
-* DESCRIPTION
-*	The osm_mcm_port_init function initializes a MCM Port Object for use.
+*	The osm_mcm_port_new function allocates and initializes a
+*	MCM Port Object for use.
 *
 * SYNOPSIS
 */
@@ -234,15 +135,15 @@ osm_mcm_port_new(
 * NOTES
 *
 * SEE ALSO
-*	MCM Port Object, osm_mcm_port_construct, osm_mcm_port_destroy,
+*	MCM Port Object, osm_mcm_port_delete,
 *********/
 
-/****f* OpenSM: MCM Port Object/osm_mcm_port_destroy
+/****f* OpenSM: MCM Port Object/osm_mcm_port_delete
 * NAME
-*	osm_mcm_port_destroy
+*	osm_mcm_port_delete
 *
 * DESCRIPTION
-*	The osm_mcm_port_destroy function destroys and dellallocates an
+*	The osm_mcm_port_delete function destroys and dellallocates an
 *	MCM Port Object, releasing all resources.
 *
 * SYNOPSIS
@@ -261,7 +162,7 @@ osm_mcm_port_delete(
 * NOTES
 *
 * SEE ALSO
-*	MCM Port Object, osm_mcm_port_construct, osm_mcm_port_init
+*	MCM Port Object, osm_mcm_port_new
 *********/
 
 END_C_DECLS
diff --git a/osm/include/opensm/osm_mtree.h b/osm/include/opensm/osm_mtree.h
index d349dc8..aa02cbb 100644
--- a/osm/include/opensm/osm_mtree.h
+++ b/osm/include/opensm/osm_mtree.h
@@ -138,94 +138,6 @@ typedef struct _osm_mtree_node
 * SEE ALSO
 *********/
 
-/****f* OpenSM: Multicast Tree/osm_mtree_node_construct
-* NAME
-*	osm_mtree_node_construct
-*
-* DESCRIPTION
-*	This function constructs a Multicast Tree Node object.
-*
-* SYNOPSIS
-*/
-static inline void
-osm_mtree_node_construct(
-	IN osm_mtree_node_t* const p_mtn )
-{
-	memset( p_mtn, 0, sizeof(*p_mtn) );
-}
-/*
-* PARAMETERS
-*	p_mtn
-*		[in] Pointer to a Multicast Tree Node object to construct.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*
-* SEE ALSO
-*********/
-
-/****f* OpenSM: Multicast Tree/osm_mtree_node_destroy
-* NAME
-*	osm_mtree_node_destroy
-*
-* DESCRIPTION
-*	The osm_mtree_node_destroy function destroys a node, releasing
-*	all resources.
-*
-* SYNOPSIS
-*/
-void
-osm_mtree_node_destroy(
-	IN osm_mtree_node_t* const p_mtn );
-/*
-* PARAMETERS
-*	p_mtn
-*		[in] Pointer to a Multicast Tree Node object to destroy.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*	Performs any necessary cleanup of the specified Multicast Tree object.
-*	Further operations should not be attempted on the destroyed object.
-*	This function should only be called after a call to osm_mtree_construct or
-*	osm_mtree_init.
-*
-* SEE ALSO
-*	Multicast Tree object, osm_mtree_construct, osm_mtree_init
-*********/
-
-/****f* OpenSM: Multicast Tree/osm_mtree_node_init
-* NAME
-*	osm_mtree_node_init
-*
-* DESCRIPTION
-*	Initializes a Multicast Tree Node object for use.
-*
-* SYNOPSIS
-*/
-void
-osm_mtree_node_init(
-	IN osm_mtree_node_t* const p_mtn,
-	IN const osm_switch_t* const p_sw );
-/*
-* PARAMETERS
-*	p_mtn
-*		[in] Pointer to an osm_mtree_node_t object to initialize.
-*
-*	p_sw
-*		[in] Pointer to the switch represented by this node.
-*
-* RETURN VALUES
-*	None.
-*
-* NOTES
-*
-* SEE ALSO
-*********/
-
 /****f* OpenSM: Multicast Tree/osm_mtree_node_new
 * NAME
 *	osm_mtree_node_new
diff --git a/osm/include/opensm/osm_multicast.h b/osm/include/opensm/osm_multicast.h
index b247e01..13a6fd1 100644
--- a/osm/include/opensm/osm_multicast.h
+++ b/osm/include/opensm/osm_multicast.h
@@ -126,9 +126,9 @@ osm_get_mcast_req_type_str(
 */
 typedef struct osm_mcast_mgr_ctxt
 {
-  ib_net16_t						mlid;
-  osm_mcast_req_type_t        req_type;
-  ib_net64_t                  port_guid;
+	ib_net16_t		mlid;
+	osm_mcast_req_type_t	req_type;
+	ib_net64_t		port_guid;
 } osm_mcast_mgr_ctxt_t;
 /*
 * FIELDS
@@ -246,98 +246,6 @@ typedef	void (*osm_mgrp_func_t)(
 * SEE ALSO
 *********/
 
-/****f* OpenSM: Multicast Group/osm_mgrp_construct
-* NAME
-*	osm_mgrp_construct
-*
-* DESCRIPTION
-*	This function constructs a Multicast Group.
-*
-* SYNOPSIS
-*/
-void
-osm_mgrp_construct(
-	IN osm_mgrp_t* const p_mgrp );
-/*
-* PARAMETERS
-*	p_mgrp
-*		[in] Pointer to a Multicast Group to construct.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*	Allows calling osm_mgrp_init, osm_mgrp_destroy.
-*
-*	Calling osm_mgrp_construct is a prerequisite to calling any other
-*	method except osm_mgrp_init.
-*
-* SEE ALSO
-*	Multicast Group, osm_mgrp_init, osm_mgrp_destroy
-*********/
-
-/****f* OpenSM: Multicast Group/osm_mgrp_destroy
-* NAME
-*	osm_mgrp_destroy
-*
-* DESCRIPTION
-*	The osm_mgrp_destroy function destroys a Multicast Group, releasing
-*	all resources.
-*
-* SYNOPSIS
-*/
-void
-osm_mgrp_destroy(
-	IN osm_mgrp_t* const p_mgrp );
-/*
-* PARAMETERS
-*	p_mgrp
-*		[in] Pointer to a Muticast Group to destroy.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*	Performs any necessary cleanup of the specified Multicast Group.
-*	Further operations should not be attempted on the destroyed object.
-*	This function should only be called after a call to osm_mgrp_construct or
-*	osm_mgrp_init.
-*
-* SEE ALSO
-*	Multicast Group, osm_mgrp_construct, osm_mgrp_init
-*********/
-
-/****f* OpenSM: Multicast Group/osm_mgrp_init
-* NAME
-*	osm_mgrp_init
-*
-* DESCRIPTION
-*	The osm_mgrp_init function initializes a Multicast Group for use.
-*
-* SYNOPSIS
-*/
-void
-osm_mgrp_init(
-	IN osm_mgrp_t* const p_mgrp,
-	IN const ib_net16_t mlid );
-/*
-* PARAMETERS
-*	p_mgrp
-*		[in] Pointer to an osm_mgrp_t object to initialize.
-*
-*	mlid
-*		[in] Multicast LID for this multicast group.
-*
-* RETURN VALUES
-*	None.
-*
-* NOTES
-*	Allows calling other Multicast Group methods.
-*
-* SEE ALSO
-*	Multicast Group, osm_mgrp_construct, osm_mgrp_destroy,
-*********/
-
 /****f* OpenSM: Multicast Group/osm_mgrp_new
 * NAME
 *	osm_mgrp_new
@@ -362,7 +270,7 @@ osm_mgrp_new(
 *	Allows calling other Multicast Group methods.
 *
 * SEE ALSO
-*	Multicast Group, osm_mgrp_construct, osm_mgrp_destroy,
+*	Multicast Group, osm_mgrp_delete,
 *********/
 
 /****f* OpenSM: Multicast Group/osm_mgrp_delete
@@ -388,7 +296,7 @@ osm_mgrp_delete(
 * NOTES
 *
 * SEE ALSO
-*	Multicast Group, osm_mgrp_construct, osm_mgrp_destroy,
+*	Multicast Group, osm_mgrp_new
 *********/
 
 /****f* OpenSM: Multicast Group/osm_mgrp_is_guid
@@ -568,7 +476,7 @@ osm_mgrp_is_port_present(
 void
 osm_mgrp_remove_port(
 	IN osm_subn_t* const p_subn,
-   IN osm_log_t* const p_log,
+	IN osm_log_t* const p_log,
 	IN osm_mgrp_t* const p_mgrp,
 	IN const ib_net64_t port_guid );
 /*
diff --git a/osm/include/opensm/osm_router.h b/osm/include/opensm/osm_router.h
index 49c5b46..db1dc13 100644
--- a/osm/include/opensm/osm_router.h
+++ b/osm/include/opensm/osm_router.h
@@ -100,8 +100,8 @@ BEGIN_C_DECLS
 */
 typedef struct _osm_router
 {
-	cl_map_item_t				map_item;
-	osm_port_t				*p_port;
+	cl_map_item_t	map_item;
+	osm_port_t	*p_port;
 } osm_router_t;
 /*
 * FIELDS
@@ -115,70 +115,9 @@ typedef struct _osm_router
 *	Router object
 *********/
 
-/****f* OpenSM: Router/osm_router_construct
+/****f* OpenSM: Router/osm_router_delete
 * NAME
-*	osm_router_construct
-*
-* DESCRIPTION
-*	This function constructs a Router object.
-*
-* SYNOPSIS
-*/
-void
-osm_router_construct(
-	IN osm_router_t* const p_rtr );
-/*
-* PARAMETERS
-*	p_rtr
-*		[in] Pointer to a Router object to construct.
-*
-* RETURN VALUE
-*	This function does not return a value.
-*
-* NOTES
-*	Allows calling osm_router_init, and osm_router_destroy.
-*
-*	Calling osm_router_construct is a prerequisite to calling any other
-*	method except osm_router_init.
-*
-* SEE ALSO
-*	Router object, osm_router_init, osm_router_destroy
-*********/
-
-/****f* OpenSM: Router/osm_router_destroy
-* NAME
-*	osm_router_destroy
-*
-* DESCRIPTION
-*	The osm_router_destroy function destroys the object, releasing
-*	all resources.
-*
-* SYNOPSIS
-*/
-void
-osm_router_destroy(
-	IN osm_router_t* const p_rtr );
-/*
-* PARAMETERS
-*	p_rtr
-*		[in] Pointer to the object to destroy.
-*
-* RETURN VALUE
-*	None.
-*
-* NOTES
-*	Performs any necessary cleanup of the specified object.
-*	Further operations should not be attempted on the destroyed object.
-*	This function should only be called after a call to osm_router_construct
-*	or osm_router_init.
-*
-* SEE ALSO
-*	Router object, osm_router_construct, osm_router_init
-*********/
-
-/****f* OpenSM: Router/osm_router_destroy
-* NAME
-*	osm_router_destroy
+*	osm_router_delete
 *
 * DESCRIPTION
 *	Destroys and deallocates the object.
@@ -199,38 +138,7 @@ osm_router_delete(
 * NOTES
 *
 * SEE ALSO
-*	Router object, osm_router_construct, osm_router_init
-*********/
-
-/****f* OpenSM: Router/osm_router_init
-* NAME
-*	osm_router_init
-*
-* DESCRIPTION
-*	The osm_router_init function initializes a Router object for use.
-*
-* SYNOPSIS
-*/
-ib_api_status_t
-osm_router_init(
-	IN osm_router_t* const p_rtr,
-	IN osm_port_t* const p_port );
-/*
-* PARAMETERS
-*	p_rtr
-*		[in] Pointer to an osm_router_t object to initialize.
-*
-*	p_port
-*		[in] Pointer to the port object of this router 
-*
-* RETURN VALUES
-*	IB_SUCCESS if the Router object was initialized successfully.
-*
-* NOTES
-*	Allows calling other node methods.
-*
-* SEE ALSO
-*	Router object, osm_router_construct, osm_router_destroy
+*	Router object, osm_router_new
 *********/
 
 /****f* OpenSM: Router/osm_router_new
@@ -238,7 +146,7 @@ osm_router_init(
 *	osm_router_new
 *
 * DESCRIPTION
-*	The osm_router_init function initializes a Router object for use.
+*	The osm_router_new function initializes a Router object for use.
 *
 * SYNOPSIS
 */
@@ -256,7 +164,7 @@ osm_router_new(
 * NOTES
 *
 * SEE ALSO
-*	Router object, osm_router_construct, osm_router_destroy,
+*	Router object, osm_router_new,
 *********/
 
 /****f* OpenSM: Router/osm_router_get_port_ptr
diff --git a/osm/include/opensm/osm_service.h b/osm/include/opensm/osm_service.h
index 2470650..7c7434c 100644
--- a/osm/include/opensm/osm_service.h
+++ b/osm/include/opensm/osm_service.h
@@ -146,13 +146,13 @@ osm_svcr_new(
 *	Allows calling other service record methods.
 *
 * SEE ALSO
-*	Service Record, osm_svcr_construct, osm_svcr_destroy
+*	Service Record, osm_svcr_delete
 *********/
 
 
 /****f* OpenSM: Service Record/osm_svcr_init
 * NAME
-*	osm_svcr_new
+*	osm_svcr_init
 *
 * DESCRIPTION
 *	Initializes the osm_svcr_t structure.
@@ -171,41 +171,20 @@ osm_svcr_init(
 *		[in] Pointer to the ib_service_record_t
 *
 * SEE ALSO
-*	Service Record, osm_svcr_construct, osm_svcr_destroy
-*********/
-
-/****f* OpenSM: Service Record/osm_svcr_construct
-* NAME
-*	osm_svcr_construct
-*
-* DESCRIPTION
-*	Constructs the osm_svcr_t structure.
-*
-* SYNOPSIS
-*/
-void
-osm_svcr_construct(
-	IN osm_svcr_t* const p_svcr );
-/*
-* PARAMETERS
-*	p_svc_rec
-*		[in] Pointer to osm_svcr_t structure
-*
-* SEE ALSO
-*	Service Record, osm_svcr_construct, osm_svcr_destroy
+*	Service Record
 *********/
 
-/****f* OpenSM: Service Record/osm_svcr_destroy
+/****f* OpenSM: Service Record/osm_svcr_delete
 * NAME
-*	osm_svcr_destroy
+*	osm_svcr_delete
 *
 * DESCRIPTION
-*	Constructs the osm_svcr_t structure.
+*	Deallocates the osm_svcr_t structure.
 *
 * SYNOPSIS
 */
 void
-osm_svcr_destroy(
+osm_svcr_delete(
 	IN osm_svcr_t* const p_svcr );
 /*
 * PARAMETERS
@@ -213,11 +192,9 @@ osm_svcr_destroy(
 *		[in] Pointer to osm_svcr_t structure
 *
 * SEE ALSO
-*	Service Record, osm_svcr_construct, osm_svcr_destroy
+*	Service Record, osm_svcr_new
 *********/
 
-
-
 osm_svcr_t*
 osm_svcr_get_by_rid(
 	IN osm_subn_t	const	*p_subn,
diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c
index e66c259..0069b59 100644
--- a/osm/opensm/osm_inform.c
+++ b/osm/opensm/osm_inform.c
@@ -67,16 +67,7 @@ typedef struct _osm_infr_match_ctxt
 /**********************************************************************
  **********************************************************************/
 void
-osm_infr_construct(
-  IN osm_infr_t* const p_infr )
-{
-  memset( p_infr, 0, sizeof(osm_infr_t) );
-}
-
-/**********************************************************************
- **********************************************************************/
-void
-osm_infr_destroy(
+osm_infr_delete(
   IN osm_infr_t* const p_infr )
 {
   free( p_infr );
@@ -84,21 +75,6 @@ osm_infr_destroy(
 
 /**********************************************************************
  **********************************************************************/
-void
-osm_infr_init(
-  IN osm_infr_t* const p_infr,
-  IN const osm_infr_t *p_infr_rec  )
-{
-  CL_ASSERT( p_infr );
-
-  /* what else do we need in the inform_record ??? */
-
-  /* copy the contents of the provided informinfo */
-  memcpy( p_infr, p_infr_rec, sizeof(osm_infr_t) );
-}
-
-/**********************************************************************
- **********************************************************************/
 osm_infr_t*
 osm_infr_new(
   IN const osm_infr_t *p_infr_rec )
@@ -109,9 +85,7 @@ osm_infr_new(
 
   p_infr = (osm_infr_t*)malloc( sizeof(osm_infr_t) );
   if( p_infr )
-  {
-    osm_infr_init( p_infr, p_infr_rec );
-  }
+    memcpy( p_infr, p_infr_rec, sizeof(osm_infr_t) );
 
   return( p_infr );
 }
@@ -369,7 +343,7 @@ osm_infr_remove_from_db(
   cl_qlist_remove_item( &p_subn->sa_infr_list,
                         &p_infr->list_item );
 
-  osm_infr_destroy( p_infr );
+  osm_infr_delete( p_infr );
 
   OSM_LOG_EXIT( p_log );
 }
diff --git a/osm/opensm/osm_mcast_mgr.c b/osm/opensm/osm_mcast_mgr.c
index f5059c9..508dd72 100644
--- a/osm/opensm/osm_mcast_mgr.c
+++ b/osm/opensm/osm_mcast_mgr.c
@@ -1697,7 +1697,7 @@ osm_mcast_mgr_process_mgrp_cb(
       cl_qmap_remove_item(&p_mgr->p_subn->mgrp_mlid_tbl,
                           (cl_map_item_t *)p_mgrp );
 
-      osm_mgrp_destroy(p_mgrp);
+      osm_mgrp_delete(p_mgrp);
     }
   }
 
diff --git a/osm/opensm/osm_mcm_info.c b/osm/opensm/osm_mcm_info.c
index f250c36..a550a1c 100644
--- a/osm/opensm/osm_mcm_info.c
+++ b/osm/opensm/osm_mcm_info.c
@@ -55,26 +55,6 @@
 
 /**********************************************************************
  **********************************************************************/
-void
-osm_mcm_info_destroy(
-  IN osm_mcm_info_t* const p_mcm )
-{
-  CL_ASSERT( p_mcm );
-}
-
-/**********************************************************************
- **********************************************************************/
-void
-osm_mcm_info_init(
-  IN osm_mcm_info_t* const p_mcm,
-  IN const ib_net16_t mlid )
-{
-  CL_ASSERT( p_mcm );
-  p_mcm->mlid = mlid;
-}
-
-/**********************************************************************
- **********************************************************************/
 osm_mcm_info_t*
 osm_mcm_info_new(
   IN const ib_net16_t mlid )
@@ -85,7 +65,7 @@ osm_mcm_info_new(
   if( p_mcm )
   {
     memset(p_mcm, 0, sizeof(*p_mcm) );
-    osm_mcm_info_init( p_mcm, mlid );
+    p_mcm->mlid = mlid;
   }
 
   return( p_mcm );
@@ -97,6 +77,5 @@ void
 osm_mcm_info_delete(
   IN osm_mcm_info_t* const p_mcm )
 {
-  osm_mcm_info_destroy( p_mcm );
   free( p_mcm );
 }
diff --git a/osm/opensm/osm_mcm_port.c b/osm/opensm/osm_mcm_port.c
index b617a9e..9e4dfe0 100644
--- a/osm/opensm/osm_mcm_port.c
+++ b/osm/opensm/osm_mcm_port.c
@@ -56,39 +56,16 @@
 
 /**********************************************************************
  **********************************************************************/
-void
-osm_mcm_port_construct(
-  IN osm_mcm_port_t* const p_mcm )
-{
-  memset( p_mcm, 0, sizeof(*p_mcm) );
-}
-
-/**********************************************************************
- **********************************************************************/
-void
-osm_mcm_port_destroy(
-  IN osm_mcm_port_t* const p_mcm )
-{
-  /*
-    Nothing to do?
-  */
-  UNUSED_PARAM( p_mcm );
-}
-
-/**********************************************************************
- **********************************************************************/
-void
+static void
 osm_mcm_port_init(
   IN osm_mcm_port_t* const p_mcm,
   IN const ib_gid_t* const p_port_gid,
   IN const uint8_t   scope_state,
   IN const boolean_t proxy_join )
 {
-  CL_ASSERT( p_mcm );
   CL_ASSERT( p_port_gid );
   CL_ASSERT( scope_state );
 
-  osm_mcm_port_construct( p_mcm );
   p_mcm->port_gid = *p_port_gid;
   p_mcm->scope_state = scope_state;
   p_mcm->proxy_join = proxy_join;
@@ -107,6 +84,7 @@ osm_mcm_port_new(
   p_mcm = malloc( sizeof(*p_mcm) );
   if( p_mcm )
   {
+    memset( p_mcm, 0, sizeof(*p_mcm) );
     osm_mcm_port_init( p_mcm, p_port_gid,
                        scope_state, proxy_join );
   }
@@ -121,7 +99,5 @@ osm_mcm_port_delete(
   IN osm_mcm_port_t* const p_mcm )
 {
   CL_ASSERT( p_mcm );
-
-  osm_mcm_port_destroy( p_mcm );
   free( p_mcm );
 }
diff --git a/osm/opensm/osm_mtree.c b/osm/opensm/osm_mtree.c
index 90576d3..c69f46c 100644
--- a/osm/opensm/osm_mtree.c
+++ b/osm/opensm/osm_mtree.c
@@ -55,7 +55,7 @@
 
 /**********************************************************************
  **********************************************************************/
-void
+static void
 osm_mtree_node_init(
   IN osm_mtree_node_t*     const p_mtn,
   IN const osm_switch_t*      const p_sw )
@@ -65,7 +65,7 @@ osm_mtree_node_init(
   CL_ASSERT( p_mtn );
   CL_ASSERT( p_sw );
 
-  osm_mtree_node_construct( p_mtn );
+  memset( p_mtn, 0, sizeof(*p_mtn) );
 
   p_mtn->p_sw = (osm_switch_t*)p_sw;
   p_mtn->max_children = p_sw->num_ports;
diff --git a/osm/opensm/osm_multicast.c b/osm/opensm/osm_multicast.c
index 2538a9a..db79fcd 100644
--- a/osm/opensm/osm_multicast.c
+++ b/osm/opensm/osm_multicast.c
@@ -77,17 +77,7 @@ osm_get_mcast_req_type_str(
 /**********************************************************************
  **********************************************************************/
 void
-osm_mgrp_construct(
-  IN osm_mgrp_t* const p_mgrp )
-{
-  memset( p_mgrp, 0, sizeof(*p_mgrp) );
-  cl_qmap_init( &p_mgrp->mcm_port_tbl );
-}
-
-/**********************************************************************
- **********************************************************************/
-void
-osm_mgrp_destroy(
+osm_mgrp_delete(
   IN osm_mgrp_t* const p_mgrp )
 {
   osm_mcm_port_t *p_mcm_port;
@@ -110,15 +100,15 @@ osm_mgrp_destroy(
 
 /**********************************************************************
  **********************************************************************/
-void
+static void
 osm_mgrp_init(
   IN osm_mgrp_t* const p_mgrp,
   IN const ib_net16_t mlid )
 {
-  CL_ASSERT( p_mgrp );
   CL_ASSERT( cl_ntoh16( mlid ) >= IB_LID_MCAST_START_HO );
 
-  osm_mgrp_construct( p_mgrp );
+  memset( p_mgrp, 0, sizeof(*p_mgrp) );
+  cl_qmap_init( &p_mgrp->mcm_port_tbl );
   p_mgrp->mlid = mlid;
   p_mgrp->last_change_id = 0;
   p_mgrp->last_tree_id = 0;
diff --git a/osm/opensm/osm_router.c b/osm/opensm/osm_router.c
index 4b6470c..544afec 100644
--- a/osm/opensm/osm_router.c
+++ b/osm/opensm/osm_router.c
@@ -50,54 +50,15 @@
 
 #include <stdlib.h>
 #include <string.h>
-#include <complib/cl_math.h>
 #include <iba/ib_types.h>
 #include <opensm/osm_router.h>
 
 /**********************************************************************
  **********************************************************************/
 void
-osm_router_construct(
-  IN osm_router_t* const p_rtr )
-{
-  CL_ASSERT( p_rtr );
-  memset( p_rtr, 0, sizeof(*p_rtr) );
-}
-
-/**********************************************************************
- **********************************************************************/
-ib_api_status_t
-osm_router_init(
-  IN osm_router_t* const p_rtr,
-  IN osm_port_t*   const p_port )
-{
-  ib_api_status_t  status = IB_SUCCESS;
-
-  CL_ASSERT( p_rtr );
-  CL_ASSERT( p_port );
-
-  osm_router_construct( p_rtr );
-
-  p_rtr->p_port = p_port;
-
-  return( status );
-}
-
-/**********************************************************************
- **********************************************************************/
-void
-osm_router_destroy(
-  IN osm_router_t* const p_rtr )
-{
-}
-
-/**********************************************************************
- **********************************************************************/
-void
 osm_router_delete(
   IN OUT osm_router_t** const pp_rtr )
 {
-  osm_router_destroy( *pp_rtr );
   free( *pp_rtr );
   *pp_rtr = NULL;
 }
@@ -108,16 +69,15 @@ osm_router_t*
 osm_router_new(
   IN osm_port_t* const p_port )
 {
-  ib_api_status_t status;
   osm_router_t *p_rtr;
 
+  CL_ASSERT( p_port );
+
   p_rtr = (osm_router_t*)malloc( sizeof(*p_rtr) );
   if( p_rtr )
   {
     memset( p_rtr, 0, sizeof(*p_rtr) );
-    status = osm_router_init( p_rtr, p_port );
-    if( status != IB_SUCCESS )
-      osm_router_delete( &p_rtr );
+    p_rtr->p_port = p_port;
   }
 
   return( p_rtr );
diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
index 8241129..d27de5e 100644
--- a/osm/opensm/osm_sa_mcmember_record.c
+++ b/osm/opensm/osm_sa_mcmember_record.c
@@ -419,7 +419,7 @@ __cleanup_mgrp(
     {
       cl_qmap_remove_item(&p_rcv->p_subn->mgrp_mlid_tbl,
                           (cl_map_item_t *)p_mgrp );
-      osm_mgrp_destroy(p_mgrp);
+      osm_mgrp_delete(p_mgrp);
     }
   }
 }
@@ -1358,7 +1358,7 @@ osm_mcmr_rcv_create_new_mgrp(
              cl_ntoh16(mlid) );
     cl_qmap_remove_item(&p_rcv->p_subn->mgrp_mlid_tbl,
                         (cl_map_item_t *)p_prev_mgrp );
-    osm_mgrp_destroy( p_prev_mgrp );
+    osm_mgrp_delete( p_prev_mgrp );
   }
 
   cl_qmap_insert(&p_rcv->p_subn->mgrp_mlid_tbl,
diff --git a/osm/opensm/osm_sa_service_record.c b/osm/opensm/osm_sa_service_record.c
index eff0b0a..ded1cd2 100644
--- a/osm/opensm/osm_sa_service_record.c
+++ b/osm/opensm/osm_sa_service_record.c
@@ -1038,7 +1038,7 @@ osm_sr_rcv_process_delete_method(
   cl_qlist_insert_tail( &sr_list, (cl_list_item_t*)&p_sr_item->pool_item );
 
   if(p_svcr)
-    osm_svcr_destroy(p_svcr);
+    osm_svcr_delete(p_svcr);
 
   __osm_sr_rcv_respond( p_rcv, p_madw, &sr_list );
 
@@ -1186,7 +1186,7 @@ osm_sr_rcv_lease_cb(
                               p_rcv->p_log,
                               p_svcr);
 
-      osm_svcr_destroy(p_svcr);
+      osm_svcr_delete(p_svcr);
 
       p_list_item = p_next_list_item;
       continue;
diff --git a/osm/opensm/osm_service.c b/osm/opensm/osm_service.c
index e97d8c6..ba422b3 100644
--- a/osm/opensm/osm_service.c
+++ b/osm/opensm/osm_service.c
@@ -56,16 +56,7 @@
 /**********************************************************************
  **********************************************************************/
 void
-osm_svcr_construct(
-  IN osm_svcr_t* const p_svcr )
-{
-  memset( p_svcr, 0, sizeof(*p_svcr) );
-}
-
-/**********************************************************************
- **********************************************************************/
-void
-osm_svcr_destroy(
+osm_svcr_delete(
   IN osm_svcr_t* const p_svcr )
 {
   free( p_svcr);
@@ -102,7 +93,7 @@ osm_svcr_new(
   p_svcr = (osm_svcr_t*)malloc( sizeof(*p_svcr) );
   if( p_svcr )
   {
-    osm_svcr_construct( p_svcr );
+    memset( p_svcr, 0, sizeof(*p_svcr) );
     osm_svcr_init( p_svcr, p_svc_rec );
   }
 
diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index 0484530..855d1ab 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -158,7 +158,7 @@ osm_subn_destroy(
   {
     p_mgrp = p_next_mgrp;
     p_next_mgrp = (osm_mgrp_t*)cl_qmap_next( &p_mgrp->map_item );
-    osm_mgrp_destroy( p_mgrp );
+    osm_mgrp_delete( p_mgrp );
   }
 
   p_next_infr = (osm_infr_t*)cl_qlist_head( &p_subn->sa_infr_list );
@@ -166,7 +166,7 @@ osm_subn_destroy(
   {
     p_infr = p_next_infr;
     p_next_infr = (osm_infr_t*)cl_qlist_next( &p_infr->list_item );
-    osm_infr_destroy( p_infr );
+    osm_infr_delete( p_infr );
   }
 
   cl_list_remove_all( &p_subn->new_ports_list );
-- 
1.5.2.rc2.20.gac2a


From ardavis at ichips.intel.com  Thu May 10 15:40:47 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Thu, 10 May 2007 15:40:47 -0700
Subject: [ofa-general] [ANNOUNCE] DAT 2.0 release branch available for OFED
Message-ID: <46439F6F.4030900@ichips.intel.com>


A new DAPL branch based on the new DAT 2.0 specification that is ready 
for testing. This version requires OFED 1.2 verbs and rdmacm libraries.

git://git.openfabrics.org/~ardavis/scm/dapl.git
branch == dat2.0

This version can be build with or without IB extensions. DAT 2.0 IB 
extension addendum is attached for reference. Basically, rdma_write with 
immediate and atomic operations are supported through the new 2.0 
extended interface. I have included a new test/dtest/dtestx.c that 
includes new extended operation examples.

To build with IB extensions:

./autogen.sh && ./configure –enable-ext-type=ib && make

-arlin


-------------- next part --------------
A non-text attachment was scrubbed...
Name: DAT_IB_Extensions_Final_draft.pdf
Type: application/pdf
Size: 107506 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070510/d67251f8/attachment.pdf>

From devesh28 at gmail.com  Thu May 10 23:31:46 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Fri, 11 May 2007 12:01:46 +0530
Subject: [ofa-general] [Query] ib add path record cache
Message-ID: <309a667c0705102331w7839d7et688f9bc00827338@mail.gmail.com>

Hello Sean,

With reference to discussions we had ( Ref--->[ofa-general] [RFC]
[PATCH 0/3] 2.6.22 or 23 ib: add path record cache) related to "adding
dummy entries into the local_sa_cache",
Will following idea do?
One user command, reading path records from some file and passing this
to local_sa_cache module using standard entry point (read/write),
local_cache module is
assuming it as a incoming resolved path_record, and adding it to the
cache in normal fashion. possibly some device interface will be
required to be added.
port agent related issues needs to be looked into.

Or some better idea if you have, can we discuss?

-Devesh


From vlad at lists.openfabrics.org  Fri May 11 02:30:47 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Fri, 11 May 2007 02:30:47 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070511-0200 daily build status
Message-ID: <20070511093047.5A21DE60824@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17

Failed:


From chevchenkovic at gmail.com  Fri May 11 02:54:05 2007
From: chevchenkovic at gmail.com (Chevchenkovic Chevchenkovic)
Date: Fri, 11 May 2007 15:24:05 +0530
Subject: [ofa-general] LMC read_bw test
Message-ID: <1c16cdf90705110254o7acf9996ye2bacbb995351b2e@mail.gmail.com>

Hi,
 I had this problem. I had the following configuration problem:
node 1 : port 1  : LMC = 1 , LIDs = 12,13
node 2 : port 1  : LMC = 1 , LIDs = 18,19

Now when I run the read_bw test with the lids set as 12 and 18, the
test runs fine with no errors. But if I set the value to 13 and 19 , I
get an error in execution. The error is in completion queue.
 How do i get over this?

Awaiting reply,
-chev


From halr at voltaire.com  Fri May 11 04:34:27 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 May 2007 07:34:27 -0400
Subject: [ofa-general] Re: [PATCH 1/3] opensm: remove
	osm_port_get_num_physp() function
In-Reply-To: <11787426251658-git-send-email-sashak@voltaire.com>
References: <11787426251341-git-send-email-sashak@voltaire.com>
	<11787426251658-git-send-email-sashak@voltaire.com>
Message-ID: <1178883254.25974.76454.camel@hal.voltaire.com>

On Wed, 2007-05-09 at 16:30, Sasha Khapyorsky wrote: 
> This removes osm_port_get_num_physp() function and instead uses native
> node oriented osm_node_get_num_physp().
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (to master only).

-- Hal


From halr at voltaire.com  Fri May 11 04:34:56 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 May 2007 07:34:56 -0400
Subject: [ofa-general] Re: [PATCH 2/4 v2] opensm: remove
	osm_port_get_phys_ptr()
In-Reply-To: <11788316803195-git-send-email-sashak@voltaire.com>
References: <11788316803259-git-send-email-sashak@voltaire.com>
	<11788316803195-git-send-email-sashak@voltaire.com>
Message-ID: <1178883259.25974.76456.camel@hal.voltaire.com>

On Thu, 2007-05-10 at 17:14, Sasha Khapyorsky wrote:
> Function osm_port_get_phys_ptr() returns pointer to physical port by its
> number - and this should be node and not port related routine. So this
> patch replaces osm_port_get_phys_ptr() by osm_node_get_physp_ptr().
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (to master only).

-- Hal


From halr at voltaire.com  Fri May 11 04:35:53 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 May 2007 07:35:53 -0400
Subject: [ofa-general] Re: [PATCH 3/4 v2] opensm: eliminate node's physical
	ports table duplication in osm_port_t
In-Reply-To: <11788316801832-git-send-email-sashak@voltaire.com>
References: <11788316803259-git-send-email-sashak@voltaire.com>
	<11788316801832-git-send-email-sashak@voltaire.com>
Message-ID: <1178883342.25974.76542.camel@hal.voltaire.com>

On Thu, 2007-05-10 at 17:14, Sasha Khapyorsky wrote:
> Eliminate duplication of osm_node's physical ports table in osm_port_t
> object.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (to master only).

-- Hal


From halr at voltaire.com  Fri May 11 04:37:42 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 May 2007 07:37:42 -0400
Subject: [ofa-general] Re: [PATCH 4/4 v2] opensm: remove some unneeded funcs
In-Reply-To: <11788316801642-git-send-email-sashak@voltaire.com>
References: <11788316803259-git-send-email-sashak@voltaire.com>
	<11788316801642-git-send-email-sashak@voltaire.com>
Message-ID: <1178883355.25974.76544.camel@hal.voltaire.com>

On Thu, 2007-05-10 at 17:14, Sasha Khapyorsky wrote:
> This removes some not really needed functions
> osm_port_get_default_phys_ptr() and osm_port_get_parent_node().
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (to master only).

-- Hal


From mst at dev.mellanox.co.il  Fri May 11 05:30:36 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 11 May 2007 15:30:36 +0300
Subject: [ofa-general] Re: is there an OFED way to putt VPD from an HCA?
In-Reply-To: <46434FF1.9020005@hp.com>
References: <46434FF1.9020005@hp.com>
Message-ID: <20070511123036.GB30092@mellanox.co.il>


> Quoting Rick Jones <rick.jones2 at hp.com>:
> Subject: is there an OFED way to putt VPD from an HCA?
> 
> Hi -
> 
> I would like to pull vital product data (eg serial number) from an IB HCA 
> which is "driven" via OFED bits.  Is there any OFED tool to do that or do I 
> have to go hunt-down something HCA-vendor specific (mellanox in this case)?

Under the mstflint directory, there is an "mstvpd" tool
which will dump out all of vpd.
It works for any PCI device, actually.

-- 
MST


From mst at dev.mellanox.co.il  Fri May 11 05:40:10 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 11 May 2007 15:40:10 +0300
Subject: [ofa-general] Re: is there an OFED way to putt VPD from an HCA?
In-Reply-To: <464395F3.9080004@hp.com>
References: <46434FF1.9020005@hp.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303821207@xmb-sjc-216.amer.cisco.com>
	<46436275.80406@systemfabricworks.com> <464395F3.9080004@hp.com>
Message-ID: <20070511124010.GC30092@mellanox.co.il>

> Quoting Rick Jones <rick.jones2 at hp.com>:
> Subject: Re: is there an OFED way to putt VPD from an HCA?
> 
> Seems that none of those utilities went into Debian, which was the base 
> distro I installed, and then on top of which I put the 2.6.21.1 kernel.

If you just want the utility, get it from source:

git clone git.opebfabrics.org/~mst/mstflint.git
cd mstflint
make

and this will generate mstvpd in the current directory.
And then give it the pci device address:

./mstvpd 0000:02:00.0

> course I'm still having that "gcc rpm" not found issue trying to grab the 
> whole OFED 1.2 from 5/10, and an attempt to compile the ofa_kernel from 
> 5.10 ended-up with some asm related errors which sadly I've not saved, but 
> could I suppose recreate.

I think at that date 2.6.21 wasn't yet supported, but should work now.

> At this point I'm not sure if I don't need to lay-down a fresh set of 
> kernel sources to allow things to patch correctly.

Whether you need the OFED kernel bits depends on what you need to do,
you certainly do not need them just to look up the vpd.

-- 
MST


From mst at dev.mellanox.co.il  Fri May 11 06:36:39 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 11 May 2007 16:36:39 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <46438DF2.3080601@linux.vnet.ibm.com>
References: <4641E99B.10706@linux.vnet.ibm.com>
	<46438DF2.3080601@linux.vnet.ibm.com>
Message-ID: <20070511133639.GD30092@mellanox.co.il>

> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> 
> If there are no other issues than the small restructure suggestion that
> Michael had, can this patch be merged into the for-2.6.22 tree?

I'm not sure.

I haven't the time, at the moment, to go over the patch again in depth.
Have the issues from this message been addressed?

http://www.mail-archive.com/general at lists.openfabrics.org/msg02056.html

Just a quick review, it seems that two most important issues have
apparently not been addressed yet:

1. Testing device SRQ capability twice on each RX packet is just too ugly,
   and it *should* be possible to structure the code
   by separating common functionality in separate
   functions instead of scattering if (srq) tests around.

2. Once the number of created connections exceeds
   the constant that you allow, all attempts to communicate
   with this node over IP over IB will fail.
   A way needs to be designed to switch to the datagram mode,
   and to retry going back to connected after some time.
   [We actually have this theoretical issue in SRQ
    as well - it is just much more severe in the nonSRQ case].

If connected mode works much worse than datagram in some setups,
people won't be able to enable it by default.

-- 
MST


From mst at dev.mellanox.co.il  Fri May 11 07:14:29 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 11 May 2007 17:14:29 +0300
Subject: [ofa-general] Re: RFC: location for IB CM statistics
In-Reply-To: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com>
References: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com>
Message-ID: <20070511141429.GF30092@mellanox.co.il>

> Quoting Sean Hefty <sean.hefty at intel.com>:
> Subject: RFC: location for IB CM statistics
> 
> I'd like to start adding some statistical information to the IB CM to help
> identify scalability or connectivity issues.  Some example statistics that I
> would like to expose now are number of retried MADs, unmatched requests, total
> number of connections, etc.  Longer term, additional statistics and information
> on each connection could be added.
> 
> I'm looking for ideas on the best way to expose this sort of data.  Any
> thoughts?

I would start with adding data to debugfs - we don't have
to maintain format stability there.  When we are convinced
we got the format right, we'll be able to add data to
sysfs/proc as well.

-- 
MST


From mst at dev.mellanox.co.il  Fri May 11 07:35:10 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 11 May 2007 17:35:10 +0300
Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for
	rhel-4.5 - mmap functonality
In-Reply-To: <200705101628.43095.ossrosch@linux.vnet.ibm.com>
References: <200705101628.43095.ossrosch@linux.vnet.ibm.com>
Message-ID: <20070511143510.GG30092@mellanox.co.il>

Some questions:

+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-04 10:40:06.000000000 +0200
+@@ -52,7 +52,7 @@
+ MODULE_LICENSE("Dual BSD/GPL");
+ MODULE_AUTHOR("Christoph Raisch <raisch at de.ibm.com>");
+ MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver");
+-MODULE_VERSION("SVNEHCA_0022");
++MODULE_VERSION("SVNEHCA_0019");
+ 
+ int ehca_open_aqp1     = 0;
+ int ehca_debug_level   = 0;

Is this intentional?

+@@ -293,7 +293,7 @@ int ehca_init_device(struct ehca_shca *s
+ 	strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX);
+ 	shca->ib_device.owner               = THIS_MODULE;
+ 
+-	shca->ib_device.uverbs_abi_ver	    = 6;
++	shca->ib_device.uverbs_abi_ver	    = 5;
+ 	shca->ib_device.uverbs_cmd_mask	    =
+ 		(1ull << IB_USER_VERBS_CMD_GET_CONTEXT)		|
+ 		(1ull << IB_USER_VERBS_CMD_QUERY_DEVICE)	|
+@@ -357,7 +357,7 @@ int ehca_init_device(struct ehca_shca *s
+ 	shca->ib_device.dealloc_fmr	    = ehca_dealloc_fmr;
+ 	shca->ib_device.attach_mcast	    = ehca_attach_mcast;
+ 	shca->ib_device.detach_mcast	    = ehca_detach_mcast;
+-	/* shca->ib_device.process_mad	    = ehca_process_mad;     */
++	/* shca->ib_device.process_mad	    = ehca_process_mad;	    */
+ 	shca->ib_device.mmap		    = ehca_mmap;
+ 
+ 	return ret;

Is this really necessary?

+@@ -811,7 +811,7 @@ int __init ehca_module_init(void)
+ 	int ret;
+ 
+ 	printk(KERN_INFO "eHCA Infiniband Device Driver "
+-	                 "(Rel.: SVNEHCA_0022)\n");
++	                 "(Rel.: SVNEHCA_0019)\n");
+ 	idr_init(&ehca_qp_idr);
+ 	idr_init(&ehca_cq_idr);
+ 	spin_lock_init(&ehca_qp_idr_lock);

Is this intentional?


-- 
MST


From suri at baymicrosystems.com  Fri May 11 08:06:38 2007
From: suri at baymicrosystems.com (Suresh Shelvapille)
Date: Fri, 11 May 2007 11:06:38 -0400
Subject: [ofa-general] Sonoma Conf presentations
In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403011C1559@G3W0634.americas.hpqcorp.net>
References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com><45DAB3FD.8060606@voltaire.com><349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net><4625C1C6.6040709@voltaire.com><349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net><462755BD.5020305@voltaire.com><349DCDA352EACF42A0C49FA6DCEA8403011C1415@G3W0634.americas.hpqcorp.net><4627804E.2040004@voltaire.com><349DCDA352EACF42A0C49FA6DCEA8403011C1497@G3W0634.americas.hpqcorp.net><46278780.2010900@voltaire.com>
	<349DCDA352EACF42A0C49FA6DCEA8403011C1559@G3W0634.americas.hpqcorp.net>
Message-ID: <05f401c793dd$fcc25c40$1914a8c0@surioffice>

Folks:

I was able to access day-1 presentations on the openfabrics.org site but not the 
Day-2/3 presentations (404-page not found error). Is it my issue or the pages have not
been posted yet?

Thanks,
Suri


From ossrosch at linux.vnet.ibm.com  Fri May 11 08:38:12 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Fri, 11 May 2007 17:38:12 +0200
Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for
	rhel-4.5 - mmap functonality
In-Reply-To: <20070511143510.GG30092@mellanox.co.il>
References: <200705101628.43095.ossrosch@linux.vnet.ibm.com>
	<20070511143510.GG30092@mellanox.co.il>
Message-ID: <200705111738.12797.ossrosch@linux.vnet.ibm.com>

Hi Michael,

thanks for reviewing. I make the changes and send the new patch below.

On Friday 11 May 2007 16:35, Michael S. Tsirkin wrote:
> Some questions:

> + MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver");
> +-MODULE_VERSION("SVNEHCA_0022");
> ++MODULE_VERSION("SVNEHCA_0019");
> + 
> + int ehca_open_aqp1     = 0;
> + int ehca_debug_level   = 0;
> 
> Is this intentional?

no, i just delete this hunk now
> 
> + 	shca->ib_device.detach_mcast	    = ehca_detach_mcast;
> +-	/* shca->ib_device.process_mad	    = ehca_process_mad;     */
> ++	/* shca->ib_device.process_mad	    = ehca_process_mad;	    */
> + 	shca->ib_device.mmap		    = ehca_mmap;
> + 
> + 	return ret;
> 
> Is this really necessary?

no I think there was an unantentional change in spaces, i delete this hunk
> 
> +@@ -811,7 +811,7 @@ int __init ehca_module_init(void)
> + 	int ret;
> + 
> + 	printk(KERN_INFO "eHCA Infiniband Device Driver "
> +-	                 "(Rel.: SVNEHCA_0022)\n");
> ++	                 "(Rel.: SVNEHCA_0019)\n");
> + 	idr_init(&ehca_qp_idr);
> + 	idr_init(&ehca_cq_idr);
> + 	spin_lock_init(&ehca_qp_idr_lock);
> 
> Is this intentional?

no, i delete this hunk too

below is the new patch.

Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
---
backport_ehca_2_rhel45_umap.patch |  823 ++++++++++++++++++++++++++++++++++++++
1 files changed, 823 insertions(+)


diff -Nurp ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch
--- ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch	1970-01-01 01:00:00.000000000 +0100
+++ ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch	2007-05-11 19:13:51.000000000 +0200
@@ -0,0 +1,823 @@
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h	2007-05-04 10:40:06.000000000 +0200
+@@ -126,14 +126,13 @@ struct ehca_qp {
+ 	struct ipz_qp_handle ipz_qp_handle;
+ 	struct ehca_pfqp pf;
+ 	struct ib_qp_init_attr init_attr;
++	u64 uspace_squeue;
++	u64 uspace_rqueue;
++	u64 uspace_fwh;
+ 	struct ehca_cq *send_cq;
+ 	struct ehca_cq *recv_cq;
+ 	unsigned int sqerr_purgeflag;
+ 	struct hlist_node list_entries;
+-	/* mmap counter for resources mapped into user space */
+-	u32 mm_count_squeue;
+-	u32 mm_count_rqueue;
+-	u32 mm_count_galpa;
+ };
+ 
+ /* must be power of 2 */
+@@ -150,6 +149,8 @@ struct ehca_cq {
+ 	struct ipz_cq_handle ipz_cq_handle;
+ 	struct ehca_pfcq pf;
+ 	spinlock_t cb_lock;
++	u64 uspace_queue;
++	u64 uspace_fwh;
+ 	struct hlist_head qp_hashtab[QP_HASHTAB_LEN];
+ 	struct list_head entry;
+ 	u32 nr_callbacks; /* #events assigned to cpu by scaling code */
+@@ -157,9 +158,6 @@ struct ehca_cq {
+ 	wait_queue_head_t wait_completion;
+ 	spinlock_t task_lock;
+ 	u32 ownpid;
+-	/* mmap counter for resources mapped into user space */
+-	u32 mm_count_queue;
+-	u32 mm_count_galpa;
+ };
+ 
+ enum ehca_mr_flag {
+@@ -259,6 +257,20 @@ struct ehca_ucontext {
+ 	struct ib_ucontext ib_ucontext;
+ };
+ 
++struct ehca_module *ehca_module_new(void);
++
++int ehca_module_delete(struct ehca_module *me);
++
++int ehca_eq_ctor(struct ehca_eq *eq);
++
++int ehca_eq_dtor(struct ehca_eq *eq);
++
++struct ehca_shca *ehca_shca_new(void);
++
++int ehca_shca_delete(struct ehca_shca *me);
++
++struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor);
++
+ int ehca_init_pd_cache(void);
+ void ehca_cleanup_pd_cache(void);
+ int ehca_init_cq_cache(void);
+@@ -282,6 +294,7 @@ extern int ehca_use_hp_mr;
+ extern int ehca_scaling_code;
+ 
+ struct ipzu_queue_resp {
++	u64 queue;        /* points to first queue entry */
+ 	u32 qe_size;      /* queue entry size */
+ 	u32 act_nr_of_sg;
+ 	u32 queue_length; /* queue length allocated in bytes */
+@@ -294,6 +307,7 @@ struct ehca_create_cq_resp {
+ 	u32 cq_number;
+ 	u32 token;
+ 	struct ipzu_queue_resp ipz_queue;
++	struct h_galpas galpas;
+ };
+ 
+ struct ehca_create_qp_resp {
+@@ -306,6 +320,7 @@ struct ehca_create_qp_resp {
+ 	u32 dummy; /* padding for 8 byte alignment */
+ 	struct ipzu_queue_resp ipz_squeue;
+ 	struct ipzu_queue_resp ipz_rqueue;
++	struct h_galpas galpas;
+ };
+ 
+ struct ehca_alloc_cq_parms {
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c	2007-05-04 10:40:06.000000000 +0200
+@@ -268,6 +268,7 @@ struct ib_cq *ehca_create_cq(struct ib_d
+ 	if (context) {
+ 		struct ipz_queue *ipz_queue = &my_cq->ipz_queue;
+ 		struct ehca_create_cq_resp resp;
++		struct vm_area_struct *vma;
+ 		memset(&resp, 0, sizeof(resp));
+ 		resp.cq_number = my_cq->cq_number;
+ 		resp.token = my_cq->token;
+@@ -276,14 +277,40 @@ struct ib_cq *ehca_create_cq(struct ib_d
+ 		resp.ipz_queue.queue_length = ipz_queue->queue_length;
+ 		resp.ipz_queue.pagesize = ipz_queue->pagesize;
+ 		resp.ipz_queue.toggle_state = ipz_queue->toggle_state;
++		ret = ehca_mmap_nopage(((u64)(my_cq->token) << 32) | 0x12000000,
++				       ipz_queue->queue_length,
++				       (void**)&resp.ipz_queue.queue,
++				       &vma);
++		if (ret) {
++			ehca_err(device, "Could not mmap queue pages");
++			cq = ERR_PTR(ret);
++			goto create_cq_exit4;
++		}
++		my_cq->uspace_queue = resp.ipz_queue.queue;
++		resp.galpas = my_cq->galpas;
++		ret = ehca_mmap_register(my_cq->galpas.user.fw_handle,
++					 (void**)&resp.galpas.kernel.fw_handle,
++					 &vma);
++		if (ret) {
++			ehca_err(device, "Could not mmap fw_handle");
++			cq = ERR_PTR(ret);
++			goto create_cq_exit5;
++		}
++		my_cq->uspace_fwh = (u64)resp.galpas.kernel.fw_handle;
+ 		if (ib_copy_to_udata(udata, &resp, sizeof(resp))) {
+ 			ehca_err(device, "Copy to udata failed.");
+-			goto create_cq_exit4;
++			goto create_cq_exit6;
+ 		}
+ 	}
+ 
+ 	return cq;
+ 
++create_cq_exit6:
++	ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE);
++
++create_cq_exit5:
++	ehca_munmap(my_cq->uspace_queue, my_cq->ipz_queue.queue_length);
++
+ create_cq_exit4:
+ 	ipz_queue_dtor(&my_cq->ipz_queue);
+ 
+@@ -317,6 +344,7 @@ static int get_cq_nr_events(struct ehca_
+ int ehca_destroy_cq(struct ib_cq *cq)
+ {
+ 	u64 h_ret;
++	int ret;
+ 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
+ 	int cq_num = my_cq->cq_number;
+ 	struct ib_device *device = cq->device;
+@@ -326,20 +354,6 @@ int ehca_destroy_cq(struct ib_cq *cq)
+ 	u32 cur_pid = current->tgid;
+ 	unsigned long flags;
+ 
+-	if (cq->uobject) {
+-		if (my_cq->mm_count_galpa || my_cq->mm_count_queue) {
+-			ehca_err(device, "Resources still referenced in "
+-				 "user space cq_num=%x", my_cq->cq_number);
+-			return -EINVAL;
+-		}
+-		if (my_cq->ownpid != cur_pid) {
+-			ehca_err(device, "Invalid caller pid=%x ownpid=%x "
+-				 "cq_num=%x",
+-				 cur_pid, my_cq->ownpid, my_cq->cq_number);
+-			return -EINVAL;
+-		}
+-	}
+-
+ 	spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+ 	while (my_cq->nr_events) {
+ 		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+@@ -351,6 +365,25 @@ int ehca_destroy_cq(struct ib_cq *cq)
+ 	idr_remove(&ehca_cq_idr, my_cq->token);
+ 	spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+ 
++	if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) {
++		ehca_err(device, "Invalid caller pid=%x ownpid=%x",
++			 cur_pid, my_cq->ownpid);
++		return -EINVAL;
++	}
++
++	/* un-mmap if vma alloc */
++	if (my_cq->uspace_queue ) {
++		ret = ehca_munmap(my_cq->uspace_queue,
++				  my_cq->ipz_queue.queue_length);
++		if (ret)
++			ehca_err(device, "Could not munmap queue ehca_cq=%p "
++				 "cq_num=%x", my_cq, cq_num);
++		ret = ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE);
++		if (ret)
++			ehca_err(device, "Could not munmap fwh ehca_cq=%p "
++				 "cq_num=%x", my_cq, cq_num);
++	}
++
+ 	h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0);
+ 	if (h_ret == H_R_STATE) {
+ 		/* cq in err: read err data and destroy it forcibly */
+@@ -379,7 +412,7 @@ int ehca_resize_cq(struct ib_cq *cq, int
+ 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
+ 	u32 cur_pid = current->tgid;
+ 
+-	if (cq->uobject && my_cq->ownpid != cur_pid) {
++	if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) {
+ 		ehca_err(cq->device, "Invalid caller pid=%x ownpid=%x",
+ 			 cur_pid, my_cq->ownpid);
+ 		return -EINVAL;
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h	2007-04-29 15:10:56.000000000 +0200
+@@ -171,11 +171,19 @@ int ehca_mmap(struct ib_ucontext *contex
+ 
+ void ehca_poll_eqs(unsigned long data);
+ 
++int ehca_mmap_nopage(u64 foffset,u64 length,void **mapped,
++		     struct vm_area_struct **vma);
++
++int ehca_mmap_register(u64 physical,void **mapped,
++		       struct vm_area_struct **vma);
++
++int ehca_munmap(unsigned long addr, size_t len);
++
+ #ifdef CONFIG_PPC_64K_PAGES
+ void *ehca_alloc_fw_ctrlblock(gfp_t flags);
+ void ehca_free_fw_ctrlblock(void *ptr);
+ #else
+-#define ehca_alloc_fw_ctrlblock(flags) ((void*) get_zeroed_page(flags))
++#define ehca_alloc_fw_ctrlblock(flags) ((void *) get_zeroed_page(flags))
+ #define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr))
+ #endif
+ 
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c	2007-05-04 10:40:06.000000000 +0200
+@@ -293,7 +293,7 @@ int ehca_init_device(struct ehca_shca *s
+ 	strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX);
+ 	shca->ib_device.owner               = THIS_MODULE;
+ 
+-	shca->ib_device.uverbs_abi_ver	    = 6;
++	shca->ib_device.uverbs_abi_ver	    = 5;
+ 	shca->ib_device.uverbs_cmd_mask	    =
+ 		(1ull << IB_USER_VERBS_CMD_GET_CONTEXT)		|
+ 		(1ull << IB_USER_VERBS_CMD_QUERY_DEVICE)	|
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c	2007-04-29 15:10:56.000000000 +0200
+@@ -637,6 +637,7 @@ struct ib_qp *ehca_create_qp(struct ib_p
+ 		struct ipz_queue *ipz_rqueue = &my_qp->ipz_rqueue;
+ 		struct ipz_queue *ipz_squeue = &my_qp->ipz_squeue;
+ 		struct ehca_create_qp_resp resp;
++		struct vm_area_struct * vma;
+ 		memset(&resp, 0, sizeof(resp));
+ 
+ 		resp.qp_num = my_qp->real_qp_num;
+@@ -650,21 +651,59 @@ struct ib_qp *ehca_create_qp(struct ib_p
+ 		resp.ipz_rqueue.queue_length = ipz_rqueue->queue_length;
+ 		resp.ipz_rqueue.pagesize = ipz_rqueue->pagesize;
+ 		resp.ipz_rqueue.toggle_state = ipz_rqueue->toggle_state;
++		ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x22000000,
++				       ipz_rqueue->queue_length,
++				       (void**)&resp.ipz_rqueue.queue,
++				       &vma);
++		if (ret) {
++			ehca_err(pd->device, "Could not mmap rqueue pages");
++			goto create_qp_exit3;
++		}
++		my_qp->uspace_rqueue = resp.ipz_rqueue.queue;
+ 		/* squeue properties */
+ 		resp.ipz_squeue.qe_size = ipz_squeue->qe_size;
+ 		resp.ipz_squeue.act_nr_of_sg = ipz_squeue->act_nr_of_sg;
+ 		resp.ipz_squeue.queue_length = ipz_squeue->queue_length;
+ 		resp.ipz_squeue.pagesize = ipz_squeue->pagesize;
+ 		resp.ipz_squeue.toggle_state = ipz_squeue->toggle_state;
++		ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x23000000,
++				       ipz_squeue->queue_length,
++				       (void**)&resp.ipz_squeue.queue,
++				       &vma);
++		if (ret) {
++			ehca_err(pd->device, "Could not mmap squeue pages");
++			goto create_qp_exit4;
++		}
++		my_qp->uspace_squeue = resp.ipz_squeue.queue;
++		/* fw_handle */
++		resp.galpas = my_qp->galpas;
++		ret = ehca_mmap_register(my_qp->galpas.user.fw_handle,
++					 (void**)&resp.galpas.kernel.fw_handle,
++					 &vma);
++		if (ret) {
++			ehca_err(pd->device, "Could not mmap fw_handle");
++			goto create_qp_exit5;
++		}
++		my_qp->uspace_fwh = (u64)resp.galpas.kernel.fw_handle;
++
+ 		if (ib_copy_to_udata(udata, &resp, sizeof resp)) {
+ 			ehca_err(pd->device, "Copy to udata failed");
+ 			ret = -EINVAL;
+-			goto create_qp_exit3;
++			goto create_qp_exit6;
+ 		}
+ 	}
+ 
+ 	return &my_qp->ib_qp;
+ 
++create_qp_exit6:
++	ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE);
++
++create_qp_exit5:
++	ehca_munmap(my_qp->uspace_squeue, my_qp->ipz_squeue.queue_length);
++
++create_qp_exit4:
++	ehca_munmap(my_qp->uspace_rqueue, my_qp->ipz_rqueue.queue_length);
++
+ create_qp_exit3:
+ 	ipz_queue_dtor(&my_qp->ipz_rqueue);
+ 	ipz_queue_dtor(&my_qp->ipz_squeue);
+@@ -892,7 +931,7 @@ static int internal_modify_qp(struct ib_
+ 	     my_qp->qp_type == IB_QPT_SMI) &&
+ 	    statetrans == IB_QPST_SQE2RTS) {
+ 		/* mark next free wqe if kernel */
+-		if (!ibqp->uobject) {
++		if (my_qp->uspace_squeue == 0) {
+ 			struct ehca_wqe *wqe;
+ 			/* lock send queue */
+ 			spin_lock_irqsave(&my_qp->spinlock_s, spl_flags);
+@@ -1378,18 +1417,11 @@ int ehca_destroy_qp(struct ib_qp *ibqp)
+ 	enum ib_qp_type	qp_type;
+ 	unsigned long flags;
+ 
+-	if (ibqp->uobject) {
+-		if (my_qp->mm_count_galpa ||
+-		    my_qp->mm_count_rqueue || my_qp->mm_count_squeue) {
+-			ehca_err(ibqp->device, "Resources still referenced in "
+-				 "user space qp_num=%x", ibqp->qp_num);
+-			return -EINVAL;
+-		}
+-		if (my_pd->ownpid != cur_pid) {
+-			ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x",
+-				 cur_pid, my_pd->ownpid);
+-			return -EINVAL;
+-		}
++	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
++	    my_pd->ownpid != cur_pid) {
++		ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x",
++			 cur_pid, my_pd->ownpid);
++		return -EINVAL;
+ 	}
+ 
+ 	if (my_qp->send_cq) {
+@@ -1407,6 +1439,24 @@ int ehca_destroy_qp(struct ib_qp *ibqp)
+ 	idr_remove(&ehca_qp_idr, my_qp->token);
+ 	spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
+ 
++	/* un-mmap if vma alloc */
++	if (my_qp->uspace_rqueue) {
++		ret = ehca_munmap(my_qp->uspace_rqueue,
++				  my_qp->ipz_rqueue.queue_length);
++		if (ret)
++			ehca_err(ibqp->device, "Could not munmap rqueue "
++				 "qp_num=%x", qp_num);
++		ret = ehca_munmap(my_qp->uspace_squeue,
++				  my_qp->ipz_squeue.queue_length);
++		if (ret)
++			ehca_err(ibqp->device, "Could not munmap squeue "
++				 "qp_num=%x", qp_num);
++		ret = ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE);
++		if (ret)
++			ehca_err(ibqp->device, "Could not munmap fwh qp_num=%x",
++				 qp_num);
++	}
++
+ 	h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
+ 	if (h_ret != H_SUCCESS) {
+ 		ehca_err(ibqp->device, "hipz_h_destroy_qp() failed rc=%lx "
+diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c
+--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c	2007-05-04 10:38:23.000000000 +0200
++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c	2007-04-29 15:10:56.000000000 +0200
+@@ -68,183 +68,105 @@ int ehca_dealloc_ucontext(struct ib_ucon
+ 	return 0;
+ }
+ 
+-static void ehca_mm_open(struct vm_area_struct *vma)
++struct page *ehca_nopage(struct vm_area_struct *vma,
++			 unsigned long address, int *type)
+ {
+-	u32 *count = (u32*)vma->vm_private_data;
+-	if (!count) {
+-		ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx",
+-			     vma->vm_start, vma->vm_end);
+-		return;
+-	}
+-	(*count)++;
+-	if (!(*count))
+-		ehca_gen_err("Use count overflow vm_start=%lx vm_end=%lx",
+-			     vma->vm_start, vma->vm_end);
+-	ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x",
+-		     vma->vm_start, vma->vm_end, *count);
+-}
+-
+-static void ehca_mm_close(struct vm_area_struct *vma)
+-{
+-	u32 *count = (u32*)vma->vm_private_data;
+-	if (!count) {
+-		ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx",
+-			     vma->vm_start, vma->vm_end);
+-		return;
+-	}
+-	(*count)--;
+-	ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x",
+-		     vma->vm_start, vma->vm_end, *count);
+-}
+-
+-static struct vm_operations_struct vm_ops = {
+-	.open =	ehca_mm_open,
+-	.close = ehca_mm_close,
+-};
+-
+-static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas,
+-			u32 *mm_count)
+-{
+-	int ret;
+-	u64 vsize, physical;
+-
+-	vsize = vma->vm_end - vma->vm_start;
+-	if (vsize != EHCA_PAGESIZE) {
+-		ehca_gen_err("invalid vsize=%lx", vma->vm_end - vma->vm_start);
+-		return -EINVAL;
+-	}
+-
+-	physical = galpas->user.fw_handle;
+-	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+-	ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical);
+-	/* VM_IO | VM_RESERVED are set by remap_pfn_range() */
+-	ret = remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT,
+-			      vsize, vma->vm_page_prot);
+-	if (unlikely(ret)) {
+-		ehca_gen_err("remap_pfn_range() failed ret=%x", ret);
+-		return -ENOMEM;
+-	}
+-
+-	vma->vm_private_data = mm_count;
+-	(*mm_count)++;
+-	vma->vm_ops = &vm_ops;
+-
+-	return 0;
+-}
++	struct page *mypage = NULL;
++	u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT;
++	u32 idr_handle = fileoffset >> 32;
++	u32 q_type = (fileoffset >> 28) & 0xF;	  /* CQ, QP,...        */
++	u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */
++	u32 cur_pid = current->tgid;
++	unsigned long flags;
++	struct ehca_cq *cq;
++	struct ehca_qp *qp;
++	struct ehca_pd *pd;
++	u64 offset;
++	void *vaddr;
+ 
+-static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue,
+-			   u32 *mm_count)
+-{
+-	int ret;
+-	u64 start, ofs;
+-	struct page *page;
++	switch (q_type) {
++	case 1: /* CQ */
++		spin_lock_irqsave(&ehca_cq_idr_lock, flags);
++		cq = idr_find(&ehca_cq_idr, idr_handle);
++		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+ 
+-	vma->vm_flags |= VM_RESERVED;
+-	start = vma->vm_start;
+-	for (ofs = 0; ofs < queue->queue_length; ofs += PAGE_SIZE) {
+-		u64 virt_addr = (u64)ipz_qeit_calc(queue, ofs);
+-		page = virt_to_page(virt_addr);
+-		ret = vm_insert_page(vma, start, page);
+-		if (unlikely(ret)) {
+-			ehca_gen_err("vm_insert_page() failed rc=%x", ret);
+-			return ret;
++		/* make sure this mmap really belongs to the authorized user */
++		if (!cq) {
++			ehca_gen_err("cq is NULL ret=NOPAGE_SIGBUS");
++			return NOPAGE_SIGBUS;
+ 		}
+-		start +=  PAGE_SIZE;
+-	}
+-	vma->vm_private_data = mm_count;
+-	(*mm_count)++;
+-	vma->vm_ops = &vm_ops;
+ 
+-	return 0;
+-}
+-
+-static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq,
+-			u32 rsrc_type)
+-{
+-	int ret;
+-
+-	switch (rsrc_type) {
+-	case 1: /* galpa fw handle */
+-		ehca_dbg(cq->ib_cq.device, "cq_num=%x fw", cq->cq_number);
+-		ret = ehca_mmap_fw(vma, &cq->galpas, &cq->mm_count_galpa);
+-		if (unlikely(ret)) {
++		if (cq->ownpid != cur_pid) {
+ 			ehca_err(cq->ib_cq.device,
+-				 "ehca_mmap_fw() failed rc=%x cq_num=%x",
+-				 ret, cq->cq_number);
+-			return ret;
++				 "Invalid caller pid=%x ownpid=%x",
++				 cur_pid, cq->ownpid);
++			return NOPAGE_SIGBUS;
+ 		}
+-		break;
+ 
+-	case 2: /* cq queue_addr */
+-		ehca_dbg(cq->ib_cq.device, "cq_num=%x queue", cq->cq_number);
+-		ret = ehca_mmap_queue(vma, &cq->ipz_queue, &cq->mm_count_queue);
+-		if (unlikely(ret)) {
+-			ehca_err(cq->ib_cq.device,
+-				 "ehca_mmap_queue() failed rc=%x cq_num=%x",
+-				 ret, cq->cq_number);
+-			return ret;
++		if (rsrc_type == 2) {
++			ehca_dbg(cq->ib_cq.device, "cq=%p cq queuearea", cq);
++			offset = address - vma->vm_start;
++			vaddr = ipz_qeit_calc(&cq->ipz_queue, offset);
++			ehca_dbg(cq->ib_cq.device, "offset=%lx vaddr=%p",
++				 offset, vaddr);
++			mypage = virt_to_page(vaddr);
+ 		}
+ 		break;
+ 
+-	default:
+-		ehca_err(cq->ib_cq.device, "bad resource type=%x cq_num=%x",
+-			 rsrc_type, cq->cq_number);
+-		return -EINVAL;
+-	}
+-
+-	return 0;
+-}
+-
+-static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp,
+-			u32 rsrc_type)
+-{
+-	int ret;
++	case 2: /* QP */
++		spin_lock_irqsave(&ehca_qp_idr_lock, flags);
++		qp = idr_find(&ehca_qp_idr, idr_handle);
++		spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
+ 
+-	switch (rsrc_type) {
+-	case 1: /* galpa fw handle */
+-		ehca_dbg(qp->ib_qp.device, "qp_num=%x fw", qp->ib_qp.qp_num);
+-		ret = ehca_mmap_fw(vma, &qp->galpas, &qp->mm_count_galpa);
+-		if (unlikely(ret)) {
+-			ehca_err(qp->ib_qp.device,
+-				 "remap_pfn_range() failed ret=%x qp_num=%x",
+-				 ret, qp->ib_qp.qp_num);
+-			return -ENOMEM;
++		/* make sure this mmap really belongs to the authorized user */
++		if (!qp) {
++			ehca_gen_err("qp is NULL ret=NOPAGE_SIGBUS");
++			return NOPAGE_SIGBUS;
+ 		}
+-		break;
+ 
+-	case 2: /* qp rqueue_addr */
+-		ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue",
+-			 qp->ib_qp.qp_num);
+-		ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, &qp->mm_count_rqueue);
+-		if (unlikely(ret)) {
++		pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd);
++		if (pd->ownpid != cur_pid) {
+ 			ehca_err(qp->ib_qp.device,
+-				 "ehca_mmap_queue(rq) failed rc=%x qp_num=%x",
+-				 ret, qp->ib_qp.qp_num);
+-			return ret;
++				 "Invalid caller pid=%x ownpid=%x",
++				 cur_pid, pd->ownpid);
++			return NOPAGE_SIGBUS;
+ 		}
+-		break;
+ 
+-	case 3: /* qp squeue_addr */
+-		ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue",
+-			 qp->ib_qp.qp_num);
+-		ret = ehca_mmap_queue(vma, &qp->ipz_squeue, &qp->mm_count_squeue);
+-		if (unlikely(ret)) {
+-			ehca_err(qp->ib_qp.device,
+-				 "ehca_mmap_queue(sq) failed rc=%x qp_num=%x",
+-				 ret, qp->ib_qp.qp_num);
+-			return ret;
++		if (rsrc_type == 2) {	/* rqueue */
++			ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueuearea", qp);
++			offset = address - vma->vm_start;
++			vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset);
++			ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p",
++				 offset, vaddr);
++			mypage = virt_to_page(vaddr);
++		} else if (rsrc_type == 3) {	/* squeue */
++			ehca_dbg(qp->ib_qp.device, "qp=%p qp squeuearea", qp);
++			offset = address - vma->vm_start;
++			vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset);
++			ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p",
++				 offset, vaddr);
++			mypage = virt_to_page(vaddr);
+ 		}
+ 		break;
+ 
+ 	default:
+-		ehca_err(qp->ib_qp.device, "bad resource type=%x qp=num=%x",
+-			 rsrc_type, qp->ib_qp.qp_num);
+-		return -EINVAL;
++		ehca_gen_err("bad queue type %x", q_type);
++		return NOPAGE_SIGBUS;
+ 	}
+ 
+-	return 0;
++	if (!mypage) {
++		ehca_gen_err("Invalid page adr==NULL ret=NOPAGE_SIGBUS");
++		return NOPAGE_SIGBUS;
++	}
++	get_page(mypage);
++
++	return mypage;
+ }
+ 
++static struct vm_operations_struct ehcau_vm_ops = {
++	.nopage = ehca_nopage,
++};
++
+ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
+ {
+ 	u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT;
+@@ -253,6 +175,7 @@ int ehca_mmap(struct ib_ucontext *contex
+ 	u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */
+ 	u32 cur_pid = current->tgid;
+ 	u32 ret;
++	u64 vsize, physical;
+ 	unsigned long flags;
+ 	struct ehca_cq *cq;
+ 	struct ehca_qp *qp;
+@@ -278,12 +201,44 @@ int ehca_mmap(struct ib_ucontext *contex
+ 		if (!cq->ib_cq.uobject || cq->ib_cq.uobject->context != context)
+ 			return -EINVAL;
+ 
+-		ret = ehca_mmap_cq(vma, cq, rsrc_type);
+-		if (unlikely(ret)) {
+-			ehca_err(cq->ib_cq.device,
+-				 "ehca_mmap_cq() failed rc=%x cq_num=%x",
+-				 ret, cq->cq_number);
+-			return ret;
++		switch (rsrc_type) {
++		case 1: /* galpa fw handle */
++			ehca_dbg(cq->ib_cq.device, "cq=%p cq triggerarea", cq);
++			vma->vm_flags |= VM_RESERVED;
++			vsize = vma->vm_end - vma->vm_start;
++			if (vsize != EHCA_PAGESIZE) {
++				ehca_err(cq->ib_cq.device, "invalid vsize=%lx",
++					 vma->vm_end - vma->vm_start);
++				return -EINVAL;
++			}
++
++			physical = cq->galpas.user.fw_handle;
++			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
++			vma->vm_flags |= VM_IO | VM_RESERVED;
++
++			ehca_dbg(cq->ib_cq.device,
++				 "vsize=%lx physical=%lx", vsize, physical);
++			ret = remap_pfn_range(vma, vma->vm_start,
++					      physical >> PAGE_SHIFT, vsize,
++					      vma->vm_page_prot);
++			if (ret) {
++				ehca_err(cq->ib_cq.device,
++					 "remap_pfn_range() failed ret=%x",
++					 ret);
++				return -ENOMEM;
++			}
++			break;
++
++		case 2: /* cq queue_addr */
++			ehca_dbg(cq->ib_cq.device, "cq=%p cq q_addr", cq);
++			vma->vm_flags |= VM_RESERVED;
++			vma->vm_ops = &ehcau_vm_ops;
++			break;
++
++		default:
++			ehca_err(cq->ib_cq.device, "bad resource type %x",
++				 rsrc_type);
++			return -EINVAL;
+ 		}
+ 		break;
+ 
+@@ -307,12 +262,50 @@ int ehca_mmap(struct ib_ucontext *contex
+ 		if (!qp->ib_qp.uobject || qp->ib_qp.uobject->context != context)
+ 			return -EINVAL;
+ 
+-		ret = ehca_mmap_qp(vma, qp, rsrc_type);
+-		if (unlikely(ret)) {
+-			ehca_err(qp->ib_qp.device,
+-				 "ehca_mmap_qp() failed rc=%x qp_num=%x",
+-				 ret, qp->ib_qp.qp_num);
+-			return ret;
++		switch (rsrc_type) {
++		case 1: /* galpa fw handle */
++			ehca_dbg(qp->ib_qp.device, "qp=%p qp triggerarea", qp);
++			vma->vm_flags |= VM_RESERVED;
++			vsize = vma->vm_end - vma->vm_start;
++			if (vsize != EHCA_PAGESIZE) {
++				ehca_err(qp->ib_qp.device, "invalid vsize=%lx",
++					 vma->vm_end - vma->vm_start);
++				return -EINVAL;
++			}
++
++			physical = qp->galpas.user.fw_handle;
++			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
++			vma->vm_flags |= VM_IO | VM_RESERVED;
++
++			ehca_dbg(qp->ib_qp.device, "vsize=%lx physical=%lx",
++				 vsize, physical);
++			ret = remap_pfn_range(vma, vma->vm_start,
++					      physical >> PAGE_SHIFT, vsize,
++					      vma->vm_page_prot);
++			if (ret) {
++				ehca_err(qp->ib_qp.device,
++					 "remap_pfn_range() failed ret=%x",
++					 ret);
++				return -ENOMEM;
++			}
++			break;
++
++		case 2: /* qp rqueue_addr */
++			ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueue_addr", qp);
++			vma->vm_flags |= VM_RESERVED;
++			vma->vm_ops = &ehcau_vm_ops;
++			break;
++
++		case 3: /* qp squeue_addr */
++			ehca_dbg(qp->ib_qp.device, "qp=%p qp squeue_addr", qp);
++			vma->vm_flags |= VM_RESERVED;
++			vma->vm_ops = &ehcau_vm_ops;
++			break;
++
++		default:
++			ehca_err(qp->ib_qp.device, "bad resource type %x",
++				 rsrc_type);
++			return -EINVAL;
+ 		}
+ 		break;
+ 
+@@ -323,3 +316,77 @@ int ehca_mmap(struct ib_ucontext *contex
+ 
+ 	return 0;
+ }
++
++int ehca_mmap_nopage(u64 foffset, u64 length, void **mapped,
++		     struct vm_area_struct **vma)
++{
++	down_write(&current->mm->mmap_sem);
++	*mapped = (void*)do_mmap(NULL,0, length, PROT_WRITE,
++				 MAP_SHARED | MAP_ANONYMOUS,
++				 foffset);
++	up_write(&current->mm->mmap_sem);
++	if (!(*mapped)) {
++		ehca_gen_err("couldn't mmap foffset=%lx length=%lx",
++			     foffset, length);
++		return -EINVAL;
++	}
++
++	*vma = find_vma(current->mm, (u64)*mapped);
++	if (!(*vma)) {
++		down_write(&current->mm->mmap_sem);
++		do_munmap(current->mm, 0, length);
++		up_write(&current->mm->mmap_sem);
++		ehca_gen_err("couldn't find vma queue=%p", *mapped);
++		return -EINVAL;
++	}
++	(*vma)->vm_flags |= VM_RESERVED;
++	(*vma)->vm_ops = &ehcau_vm_ops;
++
++	return 0;
++}
++
++int ehca_mmap_register(u64 physical, void **mapped,
++		       struct vm_area_struct **vma)
++{
++	int ret;
++	unsigned long vsize;
++	/* ehca hw supports only 4k page */
++	ret = ehca_mmap_nopage(0, EHCA_PAGESIZE, mapped, vma);
++	if (ret) {
++		ehca_gen_err("could'nt mmap physical=%lx", physical);
++		return ret;
++	}
++
++	(*vma)->vm_flags |= VM_RESERVED;
++	vsize = (*vma)->vm_end - (*vma)->vm_start;
++	if (vsize != EHCA_PAGESIZE) {
++		ehca_gen_err("invalid vsize=%lx",
++			     (*vma)->vm_end - (*vma)->vm_start);
++		return -EINVAL;
++	}
++
++	(*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot);
++	(*vma)->vm_flags |= VM_IO | VM_RESERVED;
++
++	ret = remap_pfn_range((*vma), (*vma)->vm_start,
++			      physical >> PAGE_SHIFT, vsize,
++			      (*vma)->vm_page_prot);
++	if (ret) {
++		ehca_gen_err("remap_pfn_range() failed ret=%x", ret);
++		return -ENOMEM;
++	}
++
++	return 0;
++
++}
++
++int ehca_munmap(unsigned long addr, size_t len) {
++	int ret = 0;
++	struct mm_struct *mm = current->mm;
++	if (mm) {
++		down_write(&mm->mmap_sem);
++		ret = do_munmap(mm, addr, len);
++		up_write(&mm->mmap_sem);
++	}
++	return ret;
++}


From rdreier at cisco.com  Fri May 11 08:41:25 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 11 May 2007 08:41:25 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <20070508141727.GR21591@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 8 May 2007 17:17:27 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
Message-ID: <ada4pmjz7tm.fsf@cisco.com>

By the way, do you know what the best way to flush WC buffers for i386
is?  I know on x86-64 sfence is the way to go, and on ia64 I think we
want fc, but I'm not sure what the right thing is for for old 32-bit
processors.

Also, does it make sense to think about using non-temporal stores
(movntq) to get the effect of WC without having to worry about setting
up PAT?

 - R.


From sean.hefty at intel.com  Fri May 11 09:44:18 2007
From: sean.hefty at intel.com (sean.hefty)
Date: Fri, 11 May 2007 09:44:18 -0700
Subject: [ofa-general] RE: [Query] ib add path record cache
In-Reply-To: <309a667c0705102331w7839d7et688f9bc00827338@mail.gmail.com>
Message-ID: <000001c793eb$9e410f50$ff0da8c0@amr.corp.intel.com>

>One user command, reading path records from some file and passing this
>to local_sa_cache module using standard entry point (read/write),
>local_cache module is
>assuming it as a incoming resolved path_record, and adding it to the
>cache in normal fashion. possibly some device interface will be
>required to be added.
>port agent related issues needs to be looked into.
>
>Or some better idea if you have, can we discuss?

This sounds fine.  I still just not understanding the reasoning behind
populating the cache with dummy entries.  (I do think that being able to
populate it from a file could be useful to initially load the cache in fairly
static configurations.)

I should note that the cache flushes old entries after it performs a full
update.  So if these are entries that you want to remain in the cache after an
update, additional changes of the cache would be required.

- Sean


From sweitzen at cisco.com  Fri May 11 10:29:05 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Fri, 11 May 2007 10:29:05 -0700
Subject: [ofa-general] RE: [PATCH] ipoib/cm: make stale task actually run
	once in a while
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303820F76@xmb-sjc-216.amer.cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C90BFCB3@mtlexch01.mtl.com><6C2C79E72C305246B504CBA17B5500C9076E27@mtlexch01.mtl.com><A15335FBE9BD2449AF2C9EF3D1EB8EA3037668FC@xmb-sjc-216.amer.cisco.com>
	<20070507200315.GD22341@mellanox.co.il>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303820F76@xmb-sjc-216.amer.cisco.com>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303821664@xmb-sjc-216.amer.cisco.com>

I see the first patch is in OFED-1.2-20070511-0600 now, I'll try it out.

Scott 

> -----Original Message-----
> From: Scott Weitzenkamp (sweitzen) 
> Sent: Wednesday, May 09, 2007 4:46 PM
> To: Michael S. Tsirkin; Scott Weitzenkamp (sweitzen)
> Cc: Yohad Dickman; Amit Krig; Tziporet Koren; 
> mst at mellanox.co.il; general at lists.openfabrics.org; Roland Dreier
> Subject: RE: [PATCH] ipoib/cm: make stale task actually run 
> once in a while
> 
> I see a new patch ipoib_correct_timers.patch in 
> OFED-1.2-20070509-0600, which patch should I try?
> 
> Scott 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] 
> > Sent: Monday, May 07, 2007 1:03 PM
> > To: Scott Weitzenkamp (sweitzen)
> > Cc: Yohad Dickman; Amit Krig; Tziporet Koren; 
> > mst at mellanox.co.il; general at lists.openfabrics.org; Roland Dreier
> > Subject: [PATCH] ipoib/cm: make stale task actually run once 
> > in a while
> > 
> > In the presence of some active passive connections, stale 
> > task would never run,
> > since each 4 RX CQEs we repeat queue_delayed_work calls which 
> > delays it for some
> > 10 minutes.  As a result, on a noisy system with failing 
> > ports, we slowly run
> > out of resources - slowing connection setup down and 
> > eventually failing.
> > 
> > What we actually want to do is - start stale task when a first
> > passive connection is added, rerun it every 10 min as long
> > as there are outstanding passive connections.
> > 
> > As a happy side effect, this removes some code from RX data path.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
> > 
> > ---
> > 
> > Scott, I think this might address bugs 541 and 465: slow 
> > IPoIB CM HA failover
> > and eventual failing IPoIB HA. Could you test this please?
> > 
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
> > b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> > index 2b242a4..b77e8d7 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> > @@ -258,10 +258,11 @@ static int ipoib_cm_req_handler(struct 
> > ib_cm_id *cm_id, struct ib_cm_event *even
> >  	cm_id->context = p;
> >  	p->jiffies = jiffies;
> >  	spin_lock_irqsave(&priv->lock, flags);
> > +	if (list_empty(&priv->cm.passive_ids))
> > +		queue_delayed_work(ipoib_workqueue,
> > +				   &priv->cm.stale_task, 
> > IPOIB_CM_RX_DELAY);
> >  	list_add(&p->list, &priv->cm.passive_ids);
> >  	spin_unlock_irqrestore(&priv->lock, flags);
> > -	queue_delayed_work(ipoib_workqueue,
> > -			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
> >  	return 0;
> >  
> >  err_rep:
> > @@ -380,8 +381,6 @@ void ipoib_cm_handle_rx_wc(struct 
> > net_device *dev, struct ib_wc *wc)
> >  			if (!list_empty(&p->list))
> >  				list_move(&p->list, 
> > &priv->cm.passive_ids);
> >  			spin_unlock_irqrestore(&priv->lock, flags);
> > -			queue_delayed_work(ipoib_workqueue,
> > -					   
> > &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
> >  		}
> >  	}
> >  
> > @@ -1104,6 +1103,10 @@ static void ipoib_cm_stale_task(struct 
> > work_struct *work)
> >  		kfree(p);
> >  		spin_lock_irqsave(&priv->lock, flags);
> >  	}
> > +
> > +	if (!list_empty(&priv->cm.passive_ids))
> > +		queue_delayed_work(ipoib_workqueue,
> > +				   &priv->cm.stale_task, 
> > IPOIB_CM_RX_DELAY);
> >  	spin_unlock_irqrestore(&priv->lock, flags);
> >  }
> >  
> > -- 
> > MST
> > 


From pradeeps at linux.vnet.ibm.com  Fri May 11 11:03:47 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Fri, 11 May 2007 11:03:47 -0700
Subject: [ofa-general] UC mode benefits
Message-ID: <4644B003.2040707@linux.vnet.ibm.com>

This is an offshoot of the discussions that we have had on this mailing 
list about moving IPOIB CM to use UC mode at some point in the future.

Is it speculation that moving to UC mode will get us better performance
than RC mode or, if you do have some hard data to that effect can you
please share the same?

Pradeep


From raleigh at systemfabricworks.com  Fri May 11 11:58:39 2007
From: raleigh at systemfabricworks.com (Raleigh F Rinehart)
Date: Fri, 11 May 2007 13:58:39 -0500
Subject: [ofa-general] RFC: location for IB CM statistics
In-Reply-To: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com>
References: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com>
Message-ID: <4644BCDF.1000300@systemfabricworks.com>

Sean Hefty wrote:
> I'd like to start adding some statistical information to the IB CM to help
> identify scalability or connectivity issues.  Some example statistics that I
> would like to expose now are number of retried MADs, unmatched requests, total
> number of connections, etc.  Longer term, additional statistics and information
> on each connection could be added.
>
> I'm looking for ideas on the best way to expose this sort of data.  Any
> thoughts?
>
> - Sean
>
>   
This may be way off on a tangent but some users may also benefit from 
the CM exposing these as to mmapable so that one can do RDMA reads to 
gather them.  This is useful in monitoring a large clusters of IB nodes 
as efficiently as possible.  This type of scenario has been discussed 
and researched (see the OSU paper by Panda et. al.).

-raleigh

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3285 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070511/27b679d2/attachment.bin>

From pradeeps at linux.vnet.ibm.com  Fri May 11 12:19:46 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Fri, 11 May 2007 12:19:46 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <20070511133639.GD30092@mellanox.co.il>
References: <4641E99B.10706@linux.vnet.ibm.com>
	<46438DF2.3080601@linux.vnet.ibm.com>
	<20070511133639.GD30092@mellanox.co.il>
Message-ID: <4644C1D2.6040103@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
>>
>> If there are no other issues than the small restructure suggestion that
>> Michael had, can this patch be merged into the for-2.6.22 tree?
> 
> I'm not sure.
> 
> I haven't the time, at the moment, to go over the patch again in depth.
> Have the issues from this message been addressed?
> 
> http://www.mail-archive.com/general at lists.openfabrics.org/msg02056.html
> 
> Just a quick review, it seems that two most important issues have
> apparently not been addressed yet:
> 
> 1. Testing device SRQ capability twice on each RX packet is just too ugly,
>    and it *should* be possible to structure the code
>    by separating common functionality in separate
>    functions instead of scattering if (srq) tests around.

I have restructured the code as suggested. In the latest code, there are
only two places where SRQ capability is tested upon receipt of a packet:
a) ipoib_cm_handle_rx_wc()
b)ipoib_cm_post_receive()

Instead of the suggested change to ipoib_cm_handle_rx_packet() it is
possible to change ipoib_cm_post_receive() and call the  srq and nosrq
versions directly, without mangling the code. However, I do not believe 
that this should be stopping us from the code being merged. This can 
handled as a separate patch.


> 
> 2. Once the number of created connections exceeds
>    the constant that you allow, all attempts to communicate
>    with this node over IP over IB will fail.
>    A way needs to be designed to switch to the datagram mode,
>    and to retry going back to connected after some time.
>    [We actually have this theoretical issue in SRQ
>     as well - it is just much more severe in the nonSRQ case].

Firstly, this has now been changed to send a REJ message to the remote
side indicating that there no more free QPs. It is up to the application
to handle the situation. Previously, this was flagged as an error that
appeared in /var/log/messages.

However, here are a few other things we need to consider. Lets us
compute the amount of memory consumed when we run into this situation:

In CM mode we use 64K packets. Assuming, the rx_ring has 256 entries and
the current limitation of 1024 QPs, NOSRQ only will consume 16GB of 
memory. All else remaining the same if we change the rx_ring size to 
1024, NOSRQ will consume 64GB of memory.

This is huge and my guess is that on most systems, the application will 
run out of memory before it runs out of RC QPs (with NOSRQ).

Aside from this I would like to understand how do we switch just the 
"current" QP to datagram mode; we would not want to switch all the
existing QPs to datagram mode -that would be unacceptable. Also, we
should not prevent subsequent connections using RC QPs. Is there 
anything in the IB spec about this?

I think solving this is a fairly big issue and not just specific to
NOSRQ. NOSRQ is just exacerbating the situation. This can be dealt with
all at once with SRQ and NOSRQ, if need be.

Hence, I do not see these as impediments to the merge.

Pradeep


From sweitzen at cisco.com  Fri May 11 15:32:05 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Fri, 11 May 2007 15:32:05 -0700
Subject: [ofa-general] RE: [PATCH] ipoib/cm: make stale task actually run
	once in a while (DOES NOT HELP)
In-Reply-To: <20070507200315.GD22341@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C90BFCB3@mtlexch01.mtl.com><6C2C79E72C305246B504CBA17B5500C9076E27@mtlexch01.mtl.com><A15335FBE9BD2449AF2C9EF3D1EB8EA3037668FC@xmb-sjc-216.amer.cisco.com>
	<20070507200315.GD22341@mellanox.co.il>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038217F3@xmb-sjc-216.amer.cisco.com>

This patch, which is in OFED-1.2-20070511-0600, does NOT help.  I am
still seeing 105-second port failover times.  Amit, did you try it?

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] 
> Sent: Monday, May 07, 2007 1:03 PM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Yohad Dickman; Amit Krig; Tziporet Koren; 
> mst at mellanox.co.il; general at lists.openfabrics.org; Roland Dreier
> Subject: [PATCH] ipoib/cm: make stale task actually run once 
> in a while
> 
> In the presence of some active passive connections, stale 
> task would never run,
> since each 4 RX CQEs we repeat queue_delayed_work calls which 
> delays it for some
> 10 minutes.  As a result, on a noisy system with failing 
> ports, we slowly run
> out of resources - slowing connection setup down and 
> eventually failing.
> 
> What we actually want to do is - start stale task when a first
> passive connection is added, rerun it every 10 min as long
> as there are outstanding passive connections.
> 
> As a happy side effect, this removes some code from RX data path.
> 
> Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
> 
> ---
> 
> Scott, I think this might address bugs 541 and 465: slow 
> IPoIB CM HA failover
> and eventual failing IPoIB HA. Could you test this please?
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
> b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 2b242a4..b77e8d7 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -258,10 +258,11 @@ static int ipoib_cm_req_handler(struct 
> ib_cm_id *cm_id, struct ib_cm_event *even
>  	cm_id->context = p;
>  	p->jiffies = jiffies;
>  	spin_lock_irqsave(&priv->lock, flags);
> +	if (list_empty(&priv->cm.passive_ids))
> +		queue_delayed_work(ipoib_workqueue,
> +				   &priv->cm.stale_task, 
> IPOIB_CM_RX_DELAY);
>  	list_add(&p->list, &priv->cm.passive_ids);
>  	spin_unlock_irqrestore(&priv->lock, flags);
> -	queue_delayed_work(ipoib_workqueue,
> -			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  	return 0;
>  
>  err_rep:
> @@ -380,8 +381,6 @@ void ipoib_cm_handle_rx_wc(struct 
> net_device *dev, struct ib_wc *wc)
>  			if (!list_empty(&p->list))
>  				list_move(&p->list, 
> &priv->cm.passive_ids);
>  			spin_unlock_irqrestore(&priv->lock, flags);
> -			queue_delayed_work(ipoib_workqueue,
> -					   
> &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  		}
>  	}
>  
> @@ -1104,6 +1103,10 @@ static void ipoib_cm_stale_task(struct 
> work_struct *work)
>  		kfree(p);
>  		spin_lock_irqsave(&priv->lock, flags);
>  	}
> +
> +	if (!list_empty(&priv->cm.passive_ids))
> +		queue_delayed_work(ipoib_workqueue,
> +				   &priv->cm.stale_task, 
> IPOIB_CM_RX_DELAY);
>  	spin_unlock_irqrestore(&priv->lock, flags);
>  }
>  
> -- 
> MST
> 


From devesh28 at gmail.com  Fri May 11 23:39:19 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Sat, 12 May 2007 12:09:19 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <000001c793eb$9e410f50$ff0da8c0@amr.corp.intel.com>
References: <309a667c0705102331w7839d7et688f9bc00827338@mail.gmail.com>
	<000001c793eb$9e410f50$ff0da8c0@amr.corp.intel.com>
Message-ID: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>

Thanks for replying, My comments are as follows

On 5/11/07, sean.hefty <sean.hefty at intel.com> wrote:
> >One user command, reading path records from some file and passing this
> >to local_sa_cache module using standard entry point (read/write),
> >local_cache module is
> >assuming it as a incoming resolved path_record, and adding it to the
> >cache in normal fashion. possibly some device interface will be
> >required to be added.
> >port agent related issues needs to be looked into.
> >
> >Or some better idea if you have, can we discuss?
>
> This sounds fine.  I still just not understanding the reasoning behind

This can be treated as a facility similar to what we have in ARP table
for TCP/IP. Secondly this will help in debugging of some new up-coming
partially infiniband complaint hardware.

> populating the cache with dummy entries.  (I do think that being able to
> populate it from a file could be useful to initially load the cache in fairly
> static configurations.)
This is one more benefit we will get, It will prevent that initial
traffic generated by local_sa_cache module on the network, assume that
the cluster is big and every node is creating its cache DB, this will
generate a huge traffic burst, mutil-pathing will make the case even
worse. Generating a static path_record initially is a issue!
>
> I should note that the cache flushes old entries after it performs a full
> update.  So if these are entries that you want to remain in the cache after an
yes, I want them to remain in the DB, my idea is similar to the hard
coding of ARP table entries in TCP/IP.
How do you see this can be achieved?
> update, additional changes of the cache would be required.
>
> - Sean
>


From vlad at lists.openfabrics.org  Sat May 12 02:30:24 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sat, 12 May 2007 02:30:24 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070512-0200 daily build status
Message-ID: <20070512093025.32400E60823@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16

Failed:


From ianjiang.ict at gmail.com  Sat May 12 02:47:30 2007
From: ianjiang.ict at gmail.com (Ian Jiang)
Date: Sat, 12 May 2007 17:47:30 +0800
Subject: [ofa-general] [SRPT]multiple initiators supported?
Message-ID: <7b2fa1820705120247t1b232345w8373bb72416c5b28@mail.gmail.com>

Does the SRP target support multiple initiators?
I am using the SRR initiator and IB drivers in linux-2.6.20.
The SRP target is at
http://www.openfabrics.org/git/?p=~vu/srpt.git;a=summary
and the IB driver is OFED-1.1 with linux-2.6.16.13-4-default of Suse-10.1.


-- 
Ian Jiang


From mst at dev.mellanox.co.il  Sat May 12 10:29:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 12 May 2007 20:29:27 +0300
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <ada4pmjz7tm.fsf@cisco.com>
References: <20070508141727.GR21591@mellanox.co.il> <ada4pmjz7tm.fsf@cisco.com>
Message-ID: <20070512172927.GA5908@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: libmlx4 wc flash
> 
> By the way, do you know what the best way to flush WC buffers for i386
> is?  I know on x86-64 sfence is the way to go, and on ia64 I think we
> want fc, but I'm not sure what the right thing is for for old 32-bit
> processors.

Maybe just disable WC there?

> Also, does it make sense to think about using non-temporal stores
> (movntq) to get the effect of WC without having to worry about setting
> up PAT?

I don't think it works this way: if PAT is programmed to UC,
I think you get UC access with movntq. No?

-- 
MST


From mst at dev.mellanox.co.il  Sat May 12 13:06:35 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 12 May 2007 23:06:35 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <4644C1D2.6040103@linux.vnet.ibm.com>
References: <4641E99B.10706@linux.vnet.ibm.com>
	<46438DF2.3080601@linux.vnet.ibm.com>
	<20070511133639.GD30092@mellanox.co.il>
	<4644C1D2.6040103@linux.vnet.ibm.com>
Message-ID: <20070512200635.GB5908@mellanox.co.il>

> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> 
> Michael S. Tsirkin wrote:
> >>Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> >>Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> >>
> >>If there are no other issues than the small restructure suggestion that
> >>Michael had, can this patch be merged into the for-2.6.22 tree?
> >
> >I'm not sure.
> >
> >I haven't the time, at the moment, to go over the patch again in depth.
> >Have the issues from this message been addressed?
> >
> >http://www.mail-archive.com/general at lists.openfabrics.org/msg02056.html
> >
> >Just a quick review, it seems that two most important issues have
> >apparently not been addressed yet:
> >
> >1. Testing device SRQ capability twice on each RX packet is just too ugly,
> >   and it *should* be possible to structure the code
> >   by separating common functionality in separate
> >   functions instead of scattering if (srq) tests around.
> 
> I have restructured the code as suggested. In the latest code, there are
> only two places where SRQ capability is tested upon receipt of a packet:
> a) ipoib_cm_handle_rx_wc()
> b)ipoib_cm_post_receive()
> 
> Instead of the suggested change to ipoib_cm_handle_rx_packet() it is
> possible to change ipoib_cm_post_receive() and call the  srq and nosrq
> versions directly, without mangling the code. However, I do not believe 
> that this should be stopping us from the code being merged. This can 
> handled as a separate patch.

I actually suggested implementing separate poll routines
for srq and non-srq code. This way we won't have *any* if(srq)
tests on datapath.

> >
> >2. Once the number of created connections exceeds
> >   the constant that you allow, all attempts to communicate
> >   with this node over IP over IB will fail.
> >   A way needs to be designed to switch to the datagram mode,
> >   and to retry going back to connected after some time.
> >   [We actually have this theoretical issue in SRQ
> >    as well - it is just much more severe in the nonSRQ case].
> 
> Firstly, this has now been changed to send a REJ message to the remote
> side indicating that there no more free QPs.

Since the HCA actually has free QPs - you are actually running out of buffers that
you are ready to prepost - one might argue about whether this is spec compliant
behaviour.  This is something that might better be checked up with at IBTA.

> It is up to the application
> to handle the situation.

The application here being kernel IP over IB here, it currently handles the
reject by dropping outstanding packets and retrying the connection on the next
packet to this dst.  So the specific node might be denied connectivity
potentially forever.

> Previously, this was flagged as an error that
> appeared in /var/log/messages.
> 
> However, here are a few other things we need to consider. Lets us
> compute the amount of memory consumed when we run into this situation:
> 
> In CM mode we use 64K packets. Assuming, the rx_ring has 256 entries and
> the current limitation of 1024 QPs, NOSRQ only will consume 16GB of 
> memory. All else remaining the same if we change the rx_ring size to 
> 1024, NOSRQ will consume 64GB of memory.
>
> This is huge and my guess is that on most systems, the application will 
> run out of memory before it runs out of RC QPs (with NOSRQ).
> 
> Aside from this I would like to understand how do we switch just the 
> "current" QP to datagram mode; we would not want to switch all the
> existing QPs to datagram mode -that would be unacceptable. Also, we
> should not prevent subsequent connections using RC QPs. Is there 
> anything in the IB spec about this?

Yes, this might need a solution at the protocol level, as you indicate above.

> I think solving this is a fairly big issue and not just specific to
> NOSRQ. NOSRQ is just exacerbating the situation. This can be dealt with
> all at once with SRQ and NOSRQ, if need be.

IMO, the memory scalability issue is specific to your code.
With current code using shared RQ, each connection needs
an order of 1KByte of memory. So we need just 10MByte
for a typical 10000 node cluster.

> Hence, I do not see these as impediments to the merge.

In my humble opinion, we need a handle on the scalability issue
(other than crashing or denying service) before merging this,
otherwise IBM will be the first to object to making connected mode the default.

-- 
MST


From mst at dev.mellanox.co.il  Sat May 12 13:15:19 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 12 May 2007 23:15:19 +0300
Subject: [ofa-general] Re: [PATCH] ipoib/cm: make stale task actually run
	once in a while (DOES NOT HELP)
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038217F3@xmb-sjc-216.amer.cisco.com>
References: <20070507200315.GD22341@mellanox.co.il>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3038217F3@xmb-sjc-216.amer.cisco.com>
Message-ID: <20070512201519.GD5908@mellanox.co.il>

> Quoting Scott Weitzenkamp (sweitzen) <sweitzen at cisco.com>:
> Subject: RE: [PATCH] ipoib/cm: make stale task actually run once in a while (DOES NOT HELP)
> 
> This patch, which is in OFED-1.2-20070511-0600, does NOT help.  I am
> still seeing 105-second port failover times.  Amit, did you try it?

Same here. Still debugging.

-- 
MST


From rdreier at cisco.com  Sat May 12 14:20:18 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 12 May 2007 14:20:18 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <20070512172927.GA5908@mellanox.co.il> (Michael S. Tsirkin's
	message of "Sat, 12 May 2007 20:29:27 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
	<ada4pmjz7tm.fsf@cisco.com> <20070512172927.GA5908@mellanox.co.il>
Message-ID: <adamz09yc19.fsf@cisco.com>

 > > By the way, do you know what the best way to flush WC buffers for i386
 > > is?  I know on x86-64 sfence is the way to go, and on ia64 I think we
 > > want fc, but I'm not sure what the right thing is for for old 32-bit
 > > processors.
 > 
 > Maybe just disable WC there?

I think we want to use write combining on 32-bit kernels or 32-bit
userspace.  But I don't want to rely on SSE2 instructions for i386 binaries.

 > I don't think it works this way: if PAT is programmed to UC,
 > I think you get UC access with movntq. No?

You're right -- I misremembered what the non-temporal stuff does, but
I just checked and the manual says:

 "The memory type of the region being written to can override the
  non-temporal hint, if the memory address specified for the
  non-temporal store is in an uncacheable (UC) or write protected (WP)
  memory region."

 - R.


From rdreier at cisco.com  Sat May 12 14:21:44 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 12 May 2007 14:21:44 -0700
Subject: [ofa-general] Re: UC mode benefits
In-Reply-To: <4644B003.2040707@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Fri, 11 May 2007 11:03:47 -0700")
References: <4644B003.2040707@linux.vnet.ibm.com>
Message-ID: <adairaxybyv.fsf@cisco.com>

 > Is it speculation that moving to UC mode will get us better performance
 > than RC mode or, if you do have some hard data to that effect can you
 > please share the same?

I don't think there will be any performance benefit.  The advantage is
just that dropped up messages will just be dropped without retries or
transitioning the QP to error.


From mst at dev.mellanox.co.il  Sat May 12 21:59:38 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 13 May 2007 07:59:38 +0300
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <adamz09yc19.fsf@cisco.com>
References: <20070508141727.GR21591@mellanox.co.il> <ada4pmjz7tm.fsf@cisco.com>
	<20070512172927.GA5908@mellanox.co.il> <adamz09yc19.fsf@cisco.com>
Message-ID: <20070513045921.GA7402@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: libmlx4 wc flash
> 
>  > > By the way, do you know what the best way to flush WC buffers for i386
>  > > is?  I know on x86-64 sfence is the way to go, and on ia64 I think we
>  > > want fc, but I'm not sure what the right thing is for for old 32-bit
>  > > processors.
>  > 
>  > Maybe just disable WC there?
> 
> I think we want to use write combining on 32-bit kernels or 32-bit
> userspace.  But I don't want to rely on SSE2 instructions for i386 binaries.
> 
>  > I don't think it works this way: if PAT is programmed to UC,
>  > I think you get UC access with movntq. No?
> 
> You're right -- I misremembered what the non-temporal stuff does, but
> I just checked and the manual says:
> 
>  "The memory type of the region being written to can override the
>   non-temporal hint, if the memory address specified for the
>   non-temporal store is in an uncacheable (UC) or write protected (WP)
>   memory region."

I just found this:
• Write Combining (WC) — System memory locations are not cached (as with
uncacheable memory) and coherency is not enforced by the processor’s bus
coherency protocol. Speculative reads are allowed. Writes may be delayed and
combined in the write combining buffer (WC buffer) to reduce memory accesses.
If the WC buffer is partially filled, the writes may be delayed until the next
occurrence of a serializing event; such as, an SFENCE or MFENCE instruction,
CPUID execution, a read or write to uncached memory, an interrupt occurrence,
or a LOCK instruction execution. This type of cachecontrol is appropriate for
video frame buffers, where the order of writes is unimportant as long as the
writes update memory so they can be seen on the graphics display. See Section
10.3.1, “Buffering of Write Combining Memory Locations,” for more information
about caching the WC memory type. This memory type is available in the Pentium
Pro and Pentium II processors by programming the MTRRs or in the Pentium III,
Pentium 4, and Intel Xeon processors by programming the MTRRs or by selecting
it through the PAT.


But in another place it says confusingly:


Software should access semaphores (shared memory used for signalling between
multiple processors) using identical addresses and operand lengths. For
example, if one processor accesses a semaphore using a word access, other
processors should not access the semaphore using a byte access. Do not use
semaphores on the WC memory type.

So, could we use a lock instructions to fence WC writes out?

-- 
MST


From mst at dev.mellanox.co.il  Sat May 12 22:18:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 13 May 2007 08:18:06 +0300
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <ada4pmjz7tm.fsf@cisco.com>
References: <20070508141727.GR21591@mellanox.co.il> <ada4pmjz7tm.fsf@cisco.com>
Message-ID: <20070513051806.GB7402@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: libmlx4 wc flash
> 
> By the way, do you know what the best way to flush WC buffers for i386
> is?  I know on x86-64 sfence is the way to go, and on ia64 I think we
> want fc, but I'm not sure what the right thing is for for old 32-bit
> processors.

By the way, I just re-checked and it seems that WC support first
appeared in Pentium II systems. So I think we should be able to
use sfence if WC is enabled.
   
-- 
MST


From kliteyn at dev.mellanox.co.il  Sun May 13 01:03:43 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 13 May 2007 11:03:43 +0300
Subject: [ofa-general] [PATCH] osm: two similar bugs in Up/Dn routing
Message-ID: <4646C65F.1090705@dev.mellanox.co.il>

Hi Hal,

This small patch fixes two similar bugs in Up/Dn routing in OpenSM.
A 8-bits integers were used as indexes when scanning subnet, which
in one case caused OpenSM to crash when ranking "path" is longer 
than 256 switches, and in other case caused OpenSM to go into infinite
loop when fabric has more than 256 roots.

Please apply both to ofed_1_2 and to master.

-- Yevgeny

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/osm_ucast_updn.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c
index 78e8363..398a2b2 100644
--- a/osm/opensm/osm_ucast_updn.c
+++ b/osm/opensm/osm_ucast_updn.c
@@ -473,7 +473,7 @@ updn_subn_rank(
   IN updn_t* p_updn )
 {
   osm_switch_t *p_sw;
-  uint8_t rank = base_rank;
+  uint32_t rank = base_rank;
   osm_physp_t *p_physp, *p_remote_physp;
   cl_qlist_t list;
   cl_status_t did_cause_update;
@@ -636,7 +636,7 @@ __osm_subn_calc_up_down_min_hop_table(
   IN uint64_t* guid_list,
   IN updn_t* p_updn )
 {
-  uint8_t idx = 0;
+  uint32_t idx = 0;
   int status;
 
   OSM_LOG_ENTER( &p_updn->p_osm->log, osm_subn_calc_up_down_min_hop_table );
-- 
1.4.4.1.GIT


From dotanb at dev.mellanox.co.il  Sun May 13 01:59:09 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 13 May 2007 11:59:09 +0300
Subject: [ofa-general] LMC read_bw test
In-Reply-To: <1c16cdf90705110254o7acf9996ye2bacbb995351b2e@mail.gmail.com>
References: <1c16cdf90705110254o7acf9996ye2bacbb995351b2e@mail.gmail.com>
Message-ID: <4646D35D.3050708@dev.mellanox.co.il>

Chevchenkovic Chevchenkovic wrote:
> Hi,
> I had this problem. I had the following configuration problem:
> node 1 : port 1  : LMC = 1 , LIDs = 12,13
> node 2 : port 1  : LMC = 1 , LIDs = 18,19
>
> Now when I run the read_bw test with the lids set as 12 and 18, the
> test runs fine with no errors. But if I set the value to 13 and 19 , I
> get an error in execution. The error is in completion queue.
> How do i get over this?
How do you "tell" the test to use the LMC value?

(i have a tip: when the extra LIDs (which were given to the port due to 
the LMC) are being used,
src_path_bits needs to be set as well).


thanks
Dotan


From vlad at mellanox.co.il  Sun May 13 02:01:03 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 13 May 2007 12:01:03 +0300
Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 0/4] ehca: backport for
	rhel-4.5
In-Reply-To: <200705101626.56308.ossrosch@linux.vnet.ibm.com>
References: <200705101626.56308.ossrosch@linux.vnet.ibm.com>
Message-ID: <1179046863.9023.0.camel@vladsk-laptop>

On Thu, 2007-05-10 at 16:26 +0200, Stefan Roscher wrote:
> Because these patches
> http://lists.openfabrics.org/pipermail/general/2007-May/036125.html
> I send before were in frong format and did not patch into backport directory I
> send now the changed patches.
> 
> Regards Stefan
> 

Applied patches 1-4 and ofed-build-scripts patch.


-- 
Vladimir Sokolovsky <vlad at mellanox.co.il>
Mellanox Technologies Ltd.


From vlad at lists.openfabrics.org  Sun May 13 02:31:18 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun, 13 May 2007 02:31:18 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070513-0200 daily build status
Message-ID: <20070513093118.A366EE60836@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on powerpc with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.19

Failed:


From erezz at voltaire.com  Sun May 13 02:38:38 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Sun, 13 May 2007 12:38:38 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons
 for open-iscsi over iSER support for RHAS4 up3 and up4
In-Reply-To: <20070510092925.GB13655@mellanox.co.il>
References: <4641D295.5060907@voltaire.com> <4641D38A.8040406@voltaire.com>
	<20070510092925.GB13655@mellanox.co.il>
Message-ID: <4646DC9E.9030706@voltaire.com>

Michael S. Tsirkin wrote:

>> Quoting Erez Zilber <erezz at voltaire.com>:
>> Subject: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4
>>
>>
>> Add the required backport patches & kernel addons for open-iscsi
>> over iSER in RHAS4 up3 and up4.
>>
>> Signed-off-by: Erez Zilber <erezz at voltaire.com>
>>     
>
> In addition to posting patches, could you pls publish a git tree to pull from,
> please? This makes it easy to test-build the patch as our build system
> knows how to do git checkout.
>
> ---
>
> Two comments, generally
> A: Please move code from kernel_patches to kernel_addons as much
>    as possible. There are many places where you just add new headers,
>    or add #include directives, or change the function called or
>    remove extra parameters, all this can and should be done through addons.
>
> B: Please do not add code to core unless there is more than 1 user -
>    add it to the iser module instead. This way if there is
>    compilation failure there, you do not break core for people.
>
>   

Thanks for the feedback. I will make the fixes and post a new version soon.

Erez


From halr at voltaire.com  Sun May 13 05:39:35 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 May 2007 08:39:35 -0400
Subject: [ofa-general] Re: [PATCH] osm: two similar bugs in Up/Dn routing
In-Reply-To: <4646C65F.1090705@dev.mellanox.co.il>
References: <4646C65F.1090705@dev.mellanox.co.il>
Message-ID: <1179059958.1540.81616.camel@hal.voltaire.com>

Hi Yevgeny,

On Sun, 2007-05-13 at 04:03, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> This small patch fixes two similar bugs in Up/Dn routing in OpenSM.
> A 8-bits integers were used as indexes when scanning subnet, which
> in one case caused OpenSM to crash when ranking "path" is longer 
> than 256 switches, and in other case caused OpenSM to go into infinite
> loop when fabric has more than 256 roots.

Good catch.

> Please apply both to ofed_1_2 and to master.
> 
> -- Yevgeny
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied (to both master and ofed_1_2).

-- Hal


From dotanb at dev.mellanox.co.il  Sun May 13 07:06:40 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 13 May 2007 17:06:40 +0300
Subject: [ofa-general] RE: man pages for the rdma-cm
In-Reply-To: <000201c79196$72e8a680$39d1180a@amr.corp.intel.com>
References: <000201c79196$72e8a680$39d1180a@amr.corp.intel.com>
Message-ID: <46471B70.1070303@dev.mellanox.co.il>

Hi Sean & Steve.
> I added an rdma_cm man page that gives an overview.  I still need to add
> references to this man page from the other API man pages, which I'll do before
> pushing into OFED.
>   

The man pages for the rdma-cm are really great news, but their style is 
different from the man pages
of the libibverbs utils and verbs.

For example:
you can compare the man pages of rping and ibv_devinfo and view the 
differences
(OPTIONS section was added, AUTHORS section is missing, the first line 
and the last line
  in the man pages are different).

The man pages style of the verbs/functions are different as well.


I believe that we wish that all of the man pages that are being 
installed with the OFED package will
have the same style.

Can you please fix this issue?

Thanks
Dotan


From jackm at dev.mellanox.co.il  Sun May 13 07:18:23 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 13 May 2007 17:18:23 +0300
Subject: [ofa-general] [PATCH] mlx4: fix uninitialized spinlock for 32-bit
	architectures
Message-ID: <200705131718.23298.jackm@dev.mellanox.co.il>

uar_lock spinlock was used in mlx4_ib_cq_arm without being
initialized (on 32-bit hosts, in 64-bit atomic writes).

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 7959698..ad107f2 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -561,6 +561,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 
 	spin_lock_init(&ibdev->sm_lock);
 	mutex_init(&ibdev->cap_mask_mutex);
+	MLX4_INIT_DOORBELL_LOCK(&ibdev->uar_lock);
 
 	if (ib_register_device(&ibdev->ib_dev))
 		goto err_map;


From mst at dev.mellanox.co.il  Sun May 13 08:56:47 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 13 May 2007 18:56:47 +0300
Subject: [ofa-general] [PATCH] mthca: fix posting >255 recv WRs
Message-ID: <20070513155647.GD16537@mellanox.co.il>

The following fixes posting lists of > 255 WRs for tavor:
rq.next_ind must be updated each doorbell, otherwise the
next doorbell will use an incorrect index.

Found by Ronni Zimmermann at Mellanox.
Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

--

diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c
index fee60c8..72fabb8 100644
--- a/drivers/infiniband/hw/mthca/mthca_qp.c
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -1862,6 +1862,7 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
 				      dev->kar + MTHCA_RECEIVE_DOORBELL,
 				      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 
+			qp->rq.next_ind = ind;
 			qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB;
 			size0 = 0;
 		}

-- 
MST


From mst at dev.mellanox.co.il  Sun May 13 08:57:08 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 13 May 2007 18:57:08 +0300
Subject: [ofa-general] [PATCH] libmthca: fix posting >255 recv WRs
Message-ID: <20070513155708.GE16537@mellanox.co.il>

The following fixes posting lists of > 255 WRs for tavor:
rq.next_ind must be updated each doorbell, otherwise the
next doorbell will use an incorrect index.

Found by Ronni Zimmermann at Mellanox.
Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

--

Same as the kernel patch, really.

diff --git a/src/qp.c b/src/qp.c
index f2483e9..372a418 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -412,6 +412,7 @@ int mthca_tavor_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
 
 			mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_RECV_DOORBELL);
 
+			qp->rq.next_ind = ind;
 			qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB;
 			size0 = 0;
 		}

-- 
MST


From sashak at voltaire.com  Sun May 13 10:29:38 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 13 May 2007 20:29:38 +0300
Subject: [ofa-general] Re: [PATCH] osm: two similar bugs in Up/Dn routing
In-Reply-To: <1179059958.1540.81616.camel@hal.voltaire.com>
References: <4646C65F.1090705@dev.mellanox.co.il>
	<1179059958.1540.81616.camel@hal.voltaire.com>
Message-ID: <20070513172938.GG29746@sashak.voltaire.com>

On 08:39 Sun 13 May     , Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Sun, 2007-05-13 at 04:03, Yevgeny Kliteynik wrote:
> > Hi Hal,
> > 
> > This small patch fixes two similar bugs in Up/Dn routing in OpenSM.
> > A 8-bits integers were used as indexes when scanning subnet, which
> > in one case caused OpenSM to crash when ranking "path" is longer 
> > than 256 switches, and in other case caused OpenSM to go into infinite
> > loop when fabric has more than 256 roots.
> 
> Good catch.

Yes. And IMO this shows how fixed-size integers overusing can hurt.

Sasha


From sashak at voltaire.com  Sun May 13 12:55:39 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 13 May 2007 22:55:39 +0300
Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager
In-Reply-To: <20070508184938.311b1c8f.weiny2@llnl.gov>
References: <20070508184938.311b1c8f.weiny2@llnl.gov>
Message-ID: <20070513195539.GH29746@sashak.voltaire.com>

Hi Ira,

Thanks for the great work!

On 18:49 Tue 08 May     , Ira Weiny wrote:
> I would like to submit to the list a performance manager which I have been
> working on for OpenSM.
> 
> It is implemented as the first proposed architecture model set forth by Hal (As
> an integrated thread to OpenSM.)  As such it works fine on our small test
> cluster but there is some concern about its scalability.
> 
> I have extended this architecture with an idea of my own.  This idea is to have
> a plug-able module for the "event database".  With this interface one could
> write their own Data reduction, logging, and tracking methods.  Here at LLNL I
> propose to use this to add counter and subnet events directly to our management
> database which is used to show system status to our operators.  Other
> installations might prefer other methods of logging, SNMP for example.  This
> patch includes a "reference" implementation of this "event database" which
> stores the information internally until the user requests a "dump".

I like this event db idea, but not sure this should not be integral part
of the low level perfmgr stuff - as it is currently implemented without
such plugin loaded PerfMgr just doesn't work - this unconditionally tries
to pull all ports counters, but has nothing to do with it without plugin.

Instead I would purpose to have a builtin PerfMgr which will be able to
pull and store performance related data and then to call "generic" event
manager which can process such data. This also will help to have simpler
generic API for such event db plugin so other parts of OpenSM will be
able to report events using same method(s). What do you think?

Some patch related comments are inlined below.

Sasha

> 
> Let the flames begin,
> Ira Weiny
> weiny2 at llnl.gov
> 
> 
> 
> >From 4ce288b6a5a371872cf160f6d4e29e768a065cb9 Mon Sep 17 00:00:00 2001
> From: Ira K. Weiny <weiny2 at llnl.gov>
> Date: Tue, 24 Apr 2007 23:44:15 -0700
> Subject: [PATCH] OpenSM Proposed Perf Manager
> 
>    Features include:
>       * Create "PerfMgr" thread and sweep all ports on the subnet every
>         sweep_time seconds
>       * port counter clear on overflow
>       * plugable architecture for the "event" database
>       * Output machine and human readable output in the default event database
>         dump
>       * Control using the "perfmgr" command in the console
> 
>    Known Issues
>       * Not tested at scale.
>       * Event database should record trap events and other "intresting" subnet
>         events.
>       * port counter log warnings should be configureable not hard coded.
>       * partitions are not handled yet.
>       * Code might not be as pristine as I would like
> 
>    Enable using --enable-perf-mgr
> 
> Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
> ---
>  osm/Makefile.am                   |    3 +-
>  osm/config/osmvsel.m4             |   26 ++
>  osm/configure.in                  |    5 +-
>  osm/eventdb/Makefile.am           |   37 ++
>  osm/eventdb/autogen.sh            |   15 +
>  osm/eventdb/configure.in          |   70 ++++
>  osm/eventdb/libibeventdb.map      |    5 +
>  osm/eventdb/libibeventdb.spec.in  |   38 ++
>  osm/eventdb/libibeventdb.ver      |    9 +
>  osm/eventdb/src/ibeventdb.c       |  622 +++++++++++++++++++++++++++++++++
>  osm/include/Makefile.am           |    2 +
>  osm/include/iba/ib_types.h        |   74 ++++
>  osm/include/opensm/osm_base.h     |   23 ++
>  osm/include/opensm/osm_event_db.h |  151 ++++++++
>  osm/include/opensm/osm_madw.h     |   40 +++
>  osm/include/opensm/osm_msgdef.h   |    1 +
>  osm/include/opensm/osm_opensm.h   |    4 +
>  osm/include/opensm/osm_perfmgr.h  |  223 ++++++++++++
>  osm/include/opensm/osm_subnet.h   |   18 +
>  osm/opensm.spec.in                |   11 +-
>  osm/opensm/Makefile.am            |    5 +-
>  osm/opensm/configure.in           |    3 +
>  osm/opensm/main.c                 |   19 +
>  osm/opensm/osm_console.c          |   78 +++++
>  osm/opensm/osm_event_db.c         |  172 +++++++++
>  osm/opensm/osm_opensm.c           |   24 ++
>  osm/opensm/osm_perfmgr.c          |  686 +++++++++++++++++++++++++++++++++++++
>  osm/opensm/osm_subnet.c           |   51 +++
>  osm/opensm/osm_trap_rcv.c         |   15 +
>  29 files changed, 2425 insertions(+), 5 deletions(-)
> 
> diff --git a/osm/Makefile.am b/osm/Makefile.am
> index ec66883..32f5f64 100644
> --- a/osm/Makefile.am
> +++ b/osm/Makefile.am
> @@ -1,6 +1,7 @@
>  
>  # note that order matters: make the libs first then use them 
> -SUBDIRS 		= complib libvendor opensm osmtest include
> +SUBDIRS 		= complib libvendor opensm osmtest include $(EVENTDB)
> +DIST_SUBDIRS = complib libvendor opensm osmtest include eventdb
>  
>  # this will control the update of the files in order
>  MAINTAINERCLEANFILES = Makefile.in aclocal.m4 configure config-h.in 
> diff --git a/osm/config/osmvsel.m4 b/osm/config/osmvsel.m4
> index 9234f36..ce6039c 100644
> --- a/osm/config/osmvsel.m4
> +++ b/osm/config/osmvsel.m4
> @@ -180,3 +180,29 @@ if test "$disable_libcheck" != "yes"; th
>  fi
>  # --- END OPENIB_APP_OSMV_CHECK_HEADER ---
>  ]) dnl OPENIB_APP_OSMV_CHECK_HEADER
> +
> +
> +AC_DEFUN([OPENIB_OSM_PERF_MGR_SEL], [
> +# --- BEGIN OPENIB_OSM_PERF_MGR_SEL ---
> +
> +dnl enable the perf-mgr
> +AC_ARG_ENABLE(perf-mgr,
> +[  --enable-perf-mgr Enable the performance manager (default no)],
> +   [case $enableval in
> +     yes) perf_mgr=yes ;;
> +     no)  perf_mgr=no ;;
> +   esac],
> +   perf_mgr=no)
> +if test $perf_mgr = yes; then
> +  AC_DEFINE(ENABLE_OSM_PERF_MGR,
> +	    1,
> +	    [Define as 1 if you want to enable the performance manager])
> +  EVENTDB=eventdb
> +else
> +  EVENTDB=
> +fi
> +AC_SUBST([EVENTDB])
> +
> +# --- END OPENIB_OSM_PERF_MGR_SEL ---
> +]) dnl OPENIB_OSM_PERF_MGR_SEL
> +
> diff --git a/osm/configure.in b/osm/configure.in
> index eb6552f..94d4483 100644
> --- a/osm/configure.in
> +++ b/osm/configure.in
> @@ -27,11 +27,14 @@ AC_ARG_ENABLE(debug,
>  esac],[debug=false])
>  AM_CONDITIONAL(DEBUG, test x$debug = xtrue)
>  
> +dnl select performance manager or not
> +OPENIB_OSM_PERF_MGR_SEL
> +
>  dnl Provide user option to select vendor
>  OPENIB_APP_OSMV_SEL
>  
>  dnl Configure the following subdirs
> -AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include)
> +AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include eventdb)
>  
>  dnl Create the following Makefiles
>  AC_OUTPUT(Makefile)
> diff --git a/osm/eventdb/Makefile.am b/osm/eventdb/Makefile.am
> new file mode 100644
> index 0000000..18f2db9
> --- /dev/null
> +++ b/osm/eventdb/Makefile.am
> @@ -0,0 +1,37 @@
> +
> +INCLUDES = -I$(srcdir)/../include \
> +	   -I$(includedir)/infiniband
> +
> +lib_LTLIBRARIES = libibeventdb.la
> +
> +if DEBUG
> +DBGFLAGS = -ggdb -D_DEBUG_
> +else
> +DBGFLAGS = -g
> +endif
> +
> +libibeventdb_la_CFLAGS = -Wall $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -Wno-deprecated-declarations
> +
> +if HAVE_LD_VERSION_SCRIPT
> +    libibeventdb_version_script = -Wl,--version-script=$(srcdir)/libibeventdb.map
> +else
> +    libibeventdb_version_script =
> +endif
> +
> +libibeventdb_la_SOURCES = src/ibeventdb.c
> +libibeventdb_la_LDFLAGS = -version-info $(ibeventdb_api_version) \
> +	 -export-dynamic $(libibeventdb_version_script)
> +libibeventdb_la_LIBADD = -L../complib $(OSMV_LDADD) -losmcomp
> +libibeventdb_la_DEPENDENCIES = $(srcdir)/libibeventdb.map
> +
> +libibeventdbincludedir = $(includedir)/infiniband/complib
> +
> +libibeventdbinclude_HEADERS =
> +
> +# headers are distributed as part of the include dir
> +EXTRA_DIST = $(srcdir)/libibeventdb.spec.in $(srcdir)/libibeventdb.map \
> +	$(srcdir)/libibeventdb.ver
> +
> +dist-hook: libibeventdb.spec
> +	cp libibeventdb.spec $(distdir)
> +
> diff --git a/osm/eventdb/autogen.sh b/osm/eventdb/autogen.sh
> new file mode 100755
> index 0000000..ec20fc5
> --- /dev/null
> +++ b/osm/eventdb/autogen.sh
> @@ -0,0 +1,15 @@
> +#! /bin/sh
> +
> +# We change dir since the later utilities assume to work in the project dir
> +cd ${0%*/*}
> +
> +# create config dir if not exist
> +test -d config || mkdir config
> +
> +set -x
> +(aclocal -I config -I ../config 2>&1 ) && \
> +(libtoolize --force --copy) && \
> +(autoheader) && \
> +(automake --foreign --add-missing --copy) && \
> +autoconf
> +
> diff --git a/osm/eventdb/configure.in b/osm/eventdb/configure.in
> new file mode 100644
> index 0000000..f5fa345
> --- /dev/null
> +++ b/osm/eventdb/configure.in
> @@ -0,0 +1,70 @@
> +dnl Process this file with autoconf to produce a configure script.
> +
> +AC_PREREQ(2.57)
> +AC_INIT(libibeventdb, 1.0.0, openib-general at openib.org)
> +AC_CONFIG_AUX_DIR(config)
> +AM_CONFIG_HEADER(config.h)
> +AM_INIT_AUTOMAKE
> +
> +dnl the library version info is available in the file: libibeventdb.ver
> +ibeventdb_api_version=`grep LIBVERSION $srcdir/libibeventdb.ver | sed 's/LIBVERSION=//'`
> +if test -z $ibeventdb_api_version; then
> +   ibeventdb_api_version=1:0:0
> +fi
> +AC_SUBST(ibeventdb_api_version)
> +
> +dnl Checks for programs
> +AC_PROG_CC
> +AC_PROG_GCC_TRADITIONAL
> +AC_PROG_LIBTOOL
> +
> +dnl Checks for libraries
> +AC_CHECK_LIB(pthread, pthread_mutex_init, [],
> +	AC_MSG_ERROR([pthread_mutex_init() not found.  libibeventdb requires libpthread.]))
> +
> +dnl Checks for header files.
> +AC_HEADER_STDC
> +AC_CHECK_HEADERS([fcntl.h stdlib.h string.h sys/ioctl.h sys/time.h syslog.h unistd.h])
> +
> +dnl Checks for library functions
> +AC_FUNC_MALLOC
> +AC_FUNC_MEMCMP
> +AC_CHECK_FUNC([time])
> +dnl AC_CHECK_FUNC([cl_plock_excl_acquire], [],
> +dnl AC_MSG_ERROR([cl_plock_excl_acquire not found, libibeventdb requires libosmcomp]))
> +
> +dnl Checks for typedefs, structures, and compiler characteristics.
> +AC_C_CONST
> +AC_C_INLINE
> +AC_TYPE_SIZE_T
> +AC_HEADER_TIME
> +
> +dnl We use --version-script with ld if possible
> +AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script,
> +    if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then
> +        ac_cv_version_script=yes
> +    else
> +        ac_cv_version_script=no
> +    fi)
> +
> +AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes")
> +
> +dnl Support debug mode build - if enable-debug provided the DEBUG variable is set
> +AC_ARG_ENABLE(debug,
> +[  --enable-debug Turn on debug mode],
> +[case "${enableval}" in
> +  yes) debug=true ;;
> +  no)  debug=false ;;
> +  *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;;
> +esac],[debug=false])
> +AM_CONDITIONAL(DEBUG, test x$debug = xtrue)
> +
> +# we have to revive the env CFLAGS as some how they are being overwritten...
> +# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering
> +# for why they should NEVER be modified by the configure to allow for user
> +# overrides.
> +CFLAGS=$ac_env_CFLAGS_value
> +
> +
> +AC_CONFIG_FILES([Makefile libibeventdb.spec])
> +AC_OUTPUT
> diff --git a/osm/eventdb/libibeventdb.map b/osm/eventdb/libibeventdb.map
> new file mode 100644
> index 0000000..ca4f78c
> --- /dev/null
> +++ b/osm/eventdb/libibeventdb.map
> @@ -0,0 +1,5 @@
> +OSMPMDB_1.0 {
> +	global:
> +      __osm_event_db;
> +	local: *;
> +};
> diff --git a/osm/eventdb/libibeventdb.spec.in b/osm/eventdb/libibeventdb.spec.in
> new file mode 100644
> index 0000000..ac66545
> --- /dev/null
> +++ b/osm/eventdb/libibeventdb.spec.in
> @@ -0,0 +1,38 @@
> +
> +%define ver @VERSION@
> +%define RELEASE 1
> +%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE}
> +
> +Summary: OpenIB InfiniBand OpenSM Component Library
> +Name: libibeventdb
> +Version: %ver
> +Release: %rel%{?dist}
> +License: GPL/BSD
> +Group: System Environment/Libraries
> +BuildRoot: %{_tmppath}/%{name}-%{version}-root
> +Source: http://openib.org/downloads/%{name}-%{version}.tar.gz
> +Url: http://openib.org/
> +Requires: opensm
> +
> +%description
> +libibeventdb provides a default plugin for the OpenSM event database
> +
> +%prep
> +%setup -q
> +
> +%build
> +%configure
> +make
> +
> +%install
> +make DESTDIR=${RPM_BUILD_ROOT} install
> +# remove unpackaged files from the buildroot
> +rm -f $RPM_BUILD_ROOT%{_libdir}/*.la
> +
> +%clean
> +rm -rf $RPM_BUILD_ROOT
> +
> +%files
> +%defattr(-,root,root)
> +%{_libdir}/libibeventdb*.so.*
> +%doc ChangeLog
> diff --git a/osm/eventdb/libibeventdb.ver b/osm/eventdb/libibeventdb.ver
> new file mode 100644
> index 0000000..7a703b7
> --- /dev/null
> +++ b/osm/eventdb/libibeventdb.ver
> @@ -0,0 +1,9 @@
> +# In this file we track the current API version
> +# of the vendor interface (and libraries)
> +# The version is built of the following 
> +# tree numbers:
> +# API_REV:RUNNING_REV:AGE
> +# API_REV - advance on any added API
> +# RUNNING_REV - advance any change to the vendor files
> +# AGE - number of backward versions the API still supports
> +LIBVERSION=1:0:0
> diff --git a/osm/eventdb/src/ibeventdb.c b/osm/eventdb/src/ibeventdb.c
> new file mode 100644
> index 0000000..e98f85c
> --- /dev/null
> +++ b/osm/eventdb/src/ibeventdb.c
> @@ -0,0 +1,622 @@
> +/*
> + * Copyright (c) 2007 The Regents of the University of California.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +
> +#if HAVE_CONFIG_H
> +#  include <config.h>
> +#endif /* HAVE_CONFIG_H */
> +
> +#include <errno.h>
> +#include <string.h>
> +#include <stdlib.h>
> +#include <time.h>
> +#include <dlfcn.h>
> +#include <stdint.h>
> +#include <opensm/osm_event_db.h>
> +#include <complib/cl_qmap.h>
> +#include <complib/cl_passivelock.h>
> +
> +/**
> + * Port counter object.
> + * Store all the port counters for a single port.
> + */
> +typedef struct _osm_event_pc {
> +	struct {
> +		uint64_t symbol_err_cnt;
> +		uint64_t link_err_recover;
> +		uint64_t link_downed;
> +		uint64_t rcv_err;
> +		uint64_t rcv_rem_phys_err;
> +		uint64_t rcv_switch_relay_err;
> +		uint64_t xmit_discards;
> +		uint64_t xmit_constraint_err;
> +		uint64_t rcv_constraint_err;
> +		uint64_t link_int_err;
> +		uint64_t buffer_overrun_err;
> +		uint64_t vl15_dropped;
> +		uint64_t xmit_data;
> +		uint64_t rcv_data;
> +		uint64_t xmit_pkts;
> +		uint64_t rcv_pkts;
> +		time_t   last_reset;
> +	} totals;
> +	osm_pc_reading_t previous;
> +} osm_event_pc_t;
> +
> +/**
> + * group port counters for ports into the nodes
> + */
> +typedef struct _osm_pc_node {
> +	cl_map_item_t  map_item; /* must be first */
> +	uint64_t       node_guid;
> +	osm_event_pc_t   *ports;
> +	uint8_t        num_ports;
> +} osm_pc_node_t;

Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)?
Why not to reuse already existed maps in osm_subn_t (we could add
'void *pm_data' or so field to osm_physp_t structure)?

> +
> +/**
> + * all nodes in the system.
> + */
> +typedef struct _osm_pc_db {
> +	cl_qmap_t   pc_data; /* stores type (osm_pc_node_t *) */
> +	cl_plock_t  lock;
> +	osm_log_t  *osm_log;
> +} osm_pc_db_t;
> +
> +
> +/** =========================================================================
> + */
> +static void *
> +db_construct(osm_log_t *osm_log)
> +{
> +	/* use the default */
> +	osm_pc_db_t *db = malloc(sizeof(*db));
> +	if (!db) {
> +		return (NULL);
> +	}
> +	cl_plock_construct(&(db->lock));
> +	cl_plock_init(&(db->lock));
> +	cl_qmap_init(&(db->pc_data));
> +	db->osm_log = osm_log;
> +	return ((void *)db);
> +}
> +
> +/** =========================================================================
> + */
> +static void
> +db_destroy(void *_db)
> +{
> +	osm_pc_db_t *db = (osm_pc_db_t *)_db;
> +	cl_plock_excl_acquire(&(db->lock));
> +	/* remove all the items in the qmap */
> +	while (!cl_is_qmap_empty(&(db->pc_data))) {
> +		cl_map_item_t *rc = cl_qmap_head(&(db->pc_data));
> +		cl_qmap_remove_item(&(db->pc_data), rc);
> +	}
> +	cl_plock_release(&(db->lock));
> +	cl_plock_destroy(&(db->lock));
> +	free(db);
> +}
> +
> +/** =========================================================================
> + */
> +static osm_pc_node_t *
> +malloc_node(void *_db, uint64_t guid, uint8_t num_ports)
> +{
> +	int            i = 0;
> +	time_t         cur_time = 0;
> +	osm_pc_node_t *rc = malloc(sizeof(*rc));
> +	if (!rc)
> +		return (NULL);
> +
> +	rc->ports = calloc(num_ports, sizeof(osm_event_pc_t));
> +	if (!rc->ports) {
> +		goto free_rc;
> +	}
> +	rc->num_ports = num_ports;
> +	rc->node_guid = guid;
> +
> +	cur_time = time(NULL);
> +	for (i = 0; i < num_ports; i++) {
> +		rc->ports[i].totals.last_reset = cur_time;
> +		rc->ports[i].previous.time = cur_time;
> +	}
> +
> +	return (rc);
> +free_rc:
> +	free(rc);
> +	return (NULL);
> +}
> +
> +/** =========================================================================
> + */
> +static void
> +free_node(osm_pc_node_t *node)
> +{
> +	if (!node)
> +		return;
> +	if (node->ports)
> +		free(node->ports);
> +	free(node);
> +}
> +
> +/* insert nodes to the database */
> +static osm_event_db_err_t
> +insert(void *_db, osm_pc_node_t *node)
> +{
> +	osm_pc_db_t *db = (osm_pc_db_t *)_db;
> +	cl_map_item_t *rc = cl_qmap_insert(&(db->pc_data), node->node_guid, (cl_map_item_t *)node);
> +	if ((void *)rc != (void *)node)
> +		return (OSM_EVENT_DB_FAIL);
> +	return (OSM_EVENT_DB_SUCCESS);
> +}
> +
> +/**********************************************************************
> + * Internal call db->lock should be held when calling
> + **********************************************************************/
> +static inline osm_pc_node_t *
> +get(void *_db, uint64_t guid)
> +{
> +	osm_pc_db_t *db = (osm_pc_db_t *)_db;
> +	cl_map_item_t       *rc = cl_qmap_get(&(db->pc_data), guid);
> +	const cl_map_item_t *end = cl_qmap_end(&(db->pc_data));
> +	if (rc == end)
> +		return (NULL);
> +	return ((osm_pc_node_t *)rc);
> +}
> +
> +/** =========================================================================
> + */
> +static osm_event_db_err_t
> +db_create_entry(void *_db, uint64_t guid, uint8_t num_ports)
> +{
> +  osm_pc_db_t        *db = (osm_pc_db_t *)_db;
> +  osm_event_db_err_t  rc = OSM_EVENT_DB_SUCCESS;
> +  cl_plock_excl_acquire(&(db->lock));
> +  if (!get(db, guid)) {
> +        osm_pc_node_t *pc_node = malloc_node(db, guid, num_ports);
> +	if (!pc_node) {
> +		rc = OSM_EVENT_DB_NOMEM;
> +		goto Exit;
> +	}
> +	if (insert(db, pc_node)) {
> +		free_node(pc_node);
> +		rc = OSM_EVENT_DB_FAIL;
> +		goto Exit;
> +	}
> +  }
> +Exit:
> +  cl_plock_release(&(db->lock));
> +  return (rc);
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +static osm_event_db_err_t
> +db_get_prev(void *_db, uint64_t guid,
> +		uint8_t port, osm_pc_reading_t *reading)
> +{
> +	osm_pc_db_t *db = (osm_pc_db_t *)_db;
> +	osm_pc_node_t       *node = NULL;
> +	cl_map_item_t       *rc = NULL;
> +	const cl_map_item_t *end = NULL;
> +
> +	cl_plock_acquire(&(db->lock));
> +
> +	rc = cl_qmap_get(&(db->pc_data), guid);
> +	end = cl_qmap_end(&(db->pc_data));
> +	if (rc == end)
> +		return (OSM_EVENT_DB_GUIDNOTFOUND);
> +
> +	node = (osm_pc_node_t *)rc;
> +	if (port >= node->num_ports)
> +		return (OSM_EVENT_DB_PORTNOTFOUND);
> +
> +	*reading = node->ports[port].previous;
> +
> +	cl_plock_release(&(db->lock));
> +	return (OSM_EVENT_DB_SUCCESS);
> +}
> +
> +/**********************************************************************
> + * Output a tab deliminated output of the port counters
> + **********************************************************************/
> +static void
> +__dump_node_mr(osm_pc_node_t *node, FILE *fp)
> +{
> +	int i = 0;
> +
> +	fprintf(fp, "\nGUID            Port\t%s\t%s\t"
> +			"%s\t%s\t%s\t%s\t%s\t%s\t%s\t"
> +			"%s\t%s\t%s\t%s\t%s\t%s\t%s\n",
> +			"symbol_err_cnt",
> +			"link_err_recover",
> +			"link_downed",
> +			"rcv_err",
> +			"rcv_rem_phys_err",
> +			"rcv_switch_relay_err",
> +			"xmit_discards",
> +			"xmit_constraint_err",
> +			"rcv_constraint_err",
> +			"link_int_err",
> +			"buf_overrun_err",
> +			"vl15_dropped",
> +			"xmit_data",
> +			"rcv_data",
> +			"xmit_pkts",
> +			"rcv_pkts");
> +	for (i = 1; i < node->num_ports; i++)
> +	{
> +		fprintf(fp, "0x%" PRIx64 "\t%d\t%"PRIu64"\t%"PRIu64"\t"
> +			"%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t"
> +			"%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t"
> +			"%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t"
> +			"%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t%"PRIu64"\n",
> +			node->node_guid,
> +			i,
> +			node->ports[i].totals.symbol_err_cnt,
> +			node->ports[i].totals.link_err_recover,
> +			node->ports[i].totals.link_downed,
> +			node->ports[i].totals.rcv_err,
> +			node->ports[i].totals.rcv_rem_phys_err,
> +			node->ports[i].totals.rcv_switch_relay_err,
> +			node->ports[i].totals.xmit_discards,
> +			node->ports[i].totals.xmit_constraint_err,
> +			node->ports[i].totals.rcv_constraint_err,
> +			node->ports[i].totals.link_int_err,
> +			node->ports[i].totals.buffer_overrun_err,
> +			node->ports[i].totals.vl15_dropped,
> +			node->ports[i].totals.xmit_data,
> +			node->ports[i].totals.rcv_data,
> +			node->ports[i].totals.xmit_pkts,
> +			node->ports[i].totals.rcv_pkts
> +			);
> +	}
> +}
> +
> +/**********************************************************************
> + * Output a human readable output of the port counters
> + **********************************************************************/
> +static void
> +__dump_node_hr(osm_pc_node_t *node, FILE *fp)
> +{
> +	int i = 0;
> +
> +	fprintf(fp, "\n");
> +	for (i = 1; i < node->num_ports; i++)
> +	{
> +		fprintf(fp, "GUID 0x%"PRIx64": Port %d:\n"
> +			"     symbol_err_cnt: %"PRIu64"\n"
> +			"     link_err_recover: %"PRIu64"\n"
> +			"     link_downed: %"PRIu64"\n"
> +			"     rcv_err: %"PRIu64"\n"
> +			"     rcv_rem_phys_err: %"PRIu64"\n"
> +			"     rcv_switch_relay_err: %"PRIu64"\n"
> +			"     xmit_discards: %"PRIu64"\n"
> +			"     xmit_constraint_err: %"PRIu64"\n"
> +			"     rcv_constraint_err: %"PRIu64"\n"
> +			"     link_int_err: %"PRIu64"\n"
> +			"     buf_overrun_err: %"PRIu64"\n"
> +			"     vl15_dropped: %"PRIu64"\n"
> +			"     xmit_data: %"PRIu64"\n"
> +			"     rcv_data: %"PRIu64"\n"
> +			"     xmit_pkts: %"PRIu64"\n"
> +			"     rcv_pkts: %"PRIu64"\n"
> +			,
> +			node->node_guid,
> +			i,
> +			node->ports[i].totals.symbol_err_cnt,
> +			node->ports[i].totals.link_err_recover,
> +			node->ports[i].totals.link_downed,
> +			node->ports[i].totals.rcv_err,
> +			node->ports[i].totals.rcv_rem_phys_err,
> +			node->ports[i].totals.rcv_switch_relay_err,
> +			node->ports[i].totals.xmit_discards,
> +			node->ports[i].totals.xmit_constraint_err,
> +			node->ports[i].totals.rcv_constraint_err,
> +			node->ports[i].totals.link_int_err,
> +			node->ports[i].totals.buffer_overrun_err,
> +			node->ports[i].totals.vl15_dropped,
> +			node->ports[i].totals.xmit_data,
> +			node->ports[i].totals.rcv_data,
> +			node->ports[i].totals.xmit_pkts,
> +			node->ports[i].totals.rcv_pkts
> +			);
> +	}
> +}
> +
> +/* Define a context for the __db_dump callback */
> +typedef struct {
> +	FILE                *fp;
> +	osm_event_db_dump_t  dump_type;
> +} dump_context_t;
> +
> +/**********************************************************************
> + **********************************************************************/
> +static void
> +__db_dump(cl_map_item_t * const p_map_item, void *context )
> +{
> +	osm_pc_node_t  *node = (osm_pc_node_t *)p_map_item;
> +	dump_context_t *c = (dump_context_t *)context;
> +	FILE           *fp = c->fp;
> +
> +	switch (c->dump_type)
> +	{
> +		case OSM_EVENT_DB_DUMP_MR:
> +			__dump_node_mr(node, fp);
> +			break;
> +		case OSM_EVENT_DB_DUMP_HR:
> +		default:
> +			__dump_node_hr(node, fp);
> +			break;
> +	}
> +}
> +
> +/**********************************************************************
> + * dump the data to the file "file"
> + **********************************************************************/
> +static osm_event_db_err_t
> +db_dump(void *_db, char *file, osm_event_db_dump_t dump_type)
> +{
> +	osm_pc_db_t    *db = (osm_pc_db_t *)_db;
> +	dump_context_t  context;
> +
> +	context.fp = fopen(file, "w+");
> +	if (!context.fp)
> +		return (OSM_EVENT_DB_FAIL);
> +	context.dump_type = dump_type;
> +
> +	cl_plock_acquire(&(db->lock));
> +        cl_qmap_apply_func(&(db->pc_data), __db_dump, (void *)&context);
> +	cl_plock_release(&(db->lock));
> +	fclose(context.fp);
> +	return (OSM_EVENT_DB_SUCCESS);
> +}
> +
> +/**********************************************************************
> + * call back to support the below
> + **********************************************************************/
> +static void
> +__clear_counters(cl_map_item_t * const p_map_item, void *context )
> +{
> +	osm_pc_node_t *node = (osm_pc_node_t *)p_map_item;
> +	int            i = 0;
> +	for (i = 0; i < node->num_ports; i++) {
> +		node->ports[i].totals.symbol_err_cnt = 0;
> +		node->ports[i].totals.link_err_recover = 0;
> +		node->ports[i].totals.link_downed = 0;
> +		node->ports[i].totals.rcv_err = 0;
> +		node->ports[i].totals.rcv_rem_phys_err = 0;
> +		node->ports[i].totals.rcv_switch_relay_err = 0;
> +		node->ports[i].totals.xmit_discards = 0;
> +		node->ports[i].totals.xmit_constraint_err = 0;
> +		node->ports[i].totals.rcv_constraint_err = 0;
> +		node->ports[i].totals.link_int_err = 0;
> +		node->ports[i].totals.buffer_overrun_err = 0;
> +		node->ports[i].totals.vl15_dropped = 0;
> +		node->ports[i].totals.xmit_data = 0;
> +		node->ports[i].totals.rcv_data = 0;
> +		node->ports[i].totals.xmit_pkts = 0;
> +		node->ports[i].totals.rcv_pkts = 0;
> +		node->ports[i].totals.last_reset = time(NULL);
> +	}
> +}
> +
> +/**********************************************************************
> + * Clear the counters from the db
> + **********************************************************************/
> +static void
> +db_clear_port_counters(void *_db)
> +{
> +	osm_pc_db_t *db = (osm_pc_db_t *)_db;
> +	cl_plock_excl_acquire(&(db->lock));
> +	cl_qmap_apply_func(&(db->pc_data), __clear_counters, (void *)db);
> +	cl_plock_release(&(db->lock));
> +}
> +
> +#if 0
> +/**********************************************************************
> + * Dump a reading vs the previous reading to stdout
> + **********************************************************************/
> +static void
> +dump_reading(osm_event_pc_t *port, ib_port_counters_t *cur)
> +{
> +	printf("sym %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->symbol_err_cnt),
> +			cl_ntoh16(port->previous.reading.symbol_err_cnt), port->totals.symbol_err_cnt);
> +	printf("ler %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->link_err_recover),
> +		cl_ntoh16(port->previous.reading.link_err_recover), port->totals.link_err_recover);
> +	printf("ld %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->link_downed),
> +		cl_ntoh16(port->previous.reading.link_downed), port->totals.link_downed);
> +	printf("re %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->rcv_err),
> +		cl_ntoh16(port->previous.reading.rcv_err), port->totals.rcv_err);
> +	printf("rrp %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->rcv_rem_phys_err),
> +		cl_ntoh16(port->previous.reading.rcv_rem_phys_err), port->totals.rcv_rem_phys_err);
> +	printf("rsr %u - %u (%" PRIx64 ")\n",
> +		cl_ntoh16(cur->rcv_switch_relay_err),
> +		cl_ntoh16(port->previous.reading.rcv_switch_relay_err), port->totals.rcv_switch_relay_err);
> +	printf("xd %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->xmit_discards),
> +		cl_ntoh16(port->previous.reading.xmit_discards), port->totals.xmit_discards);
> +	printf("xce %u - %u (%" PRIx64 ")\n",
> +		cl_ntoh16(cur->xmit_constraint_err),
> +		cl_ntoh16(port->previous.reading.xmit_constraint_err), port->totals.xmit_constraint_err);
> +	printf("rce %u - %u (%" PRIx64 ")\n",
> +		cl_ntoh16(cur->rcv_constraint_err),
> +		cl_ntoh16(port->previous.reading.rcv_constraint_err), port->totals.rcv_constraint_err);
> +	printf("li %x - %x (%" PRIx64 ")\n",
> +		cl_ntoh16(cur->link_int_buffer_overrun),
> +		cl_ntoh16(port->previous.reading.link_int_buffer_overrun), port->totals.link_int_err);
> +	printf("bo %x - %x (%" PRIx64 ")\n",
> +		cl_ntoh16(cur->link_int_buffer_overrun),
> +		cl_ntoh16(port->previous.reading.link_int_buffer_overrun), port->totals.buffer_overrun_err);
> +	printf("vld %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->vl15_dropped),
> +		cl_ntoh16(port->previous.reading.vl15_dropped), port->totals.vl15_dropped);
> +	
> +	printf("xd %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->xmit_data),
> +		cl_ntoh32(port->previous.reading.xmit_data), port->totals.xmit_data);
> +	printf("rd %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->rcv_data),
> +		cl_ntoh32(port->previous.reading.rcv_data), port->totals.rcv_data);
> +	printf("xp %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->xmit_pkts),
> +		cl_ntoh32(port->previous.reading.xmit_pkts), port->totals.xmit_pkts);
> +	printf("rp %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->rcv_pkts),
> +		cl_ntoh32(port->previous.reading.rcv_pkts), port->totals.rcv_pkts);
> +}
> +#endif
> +
> +/**********************************************************************
> + * Add the reading to the osm_pc_node_t
> + **********************************************************************/
> +static osm_event_db_err_t
> +db_clear_prev_pc(void *_db, uint64_t guid, uint8_t port)
> +{
> +	osm_pc_db_t *db = (osm_pc_db_t *)_db;
> +	osm_event_pc_t        *p_port = NULL;
> +	osm_pc_node_t      *p_node = NULL;
> +	ib_port_counters_t *previous = NULL;
> +	osm_event_db_err_t     rc = OSM_EVENT_DB_SUCCESS;
> +
> +	cl_plock_excl_acquire(&(db->lock));
> +	p_node = get(db, guid);
> +
> +	if (!p_node)
> +		return (OSM_EVENT_DB_GUIDNOTFOUND);
> +
> +	if (port >= p_node->num_ports)
> +		return (OSM_EVENT_DB_PORTNOTFOUND);
> +
> +	p_port = &(p_node->ports[port]);
> +	previous = &(p_node->ports[port].previous.reading);
> +
> +	memset(previous, 0, sizeof(*previous));
> +	p_port->previous.time = time(NULL);
> +
> +	cl_plock_release(&(db->lock));
> +	return (rc);
> +}
> +
> +/**********************************************************************
> + * Add the reading to the osm_pc_node_t
> + **********************************************************************/
> +static osm_event_db_err_t
> +db_add_reading(void *_db, uint64_t guid,
> +                   uint8_t port, ib_port_counters_t *reading)
> +{
> +	osm_pc_db_t *db = (osm_pc_db_t *)_db;
> +	osm_event_pc_t        *p_port = NULL;
> +	osm_pc_node_t      *p_node = NULL;
> +	ib_port_counters_t *previous = NULL;
> +	osm_event_db_err_t     rc = OSM_EVENT_DB_SUCCESS;
> +
> +	cl_plock_excl_acquire(&(db->lock));
> +	p_node = get(db, guid);
> +
> +	if (!p_node)
> +		return (OSM_EVENT_DB_GUIDNOTFOUND);
> +
> +	if (port >= p_node->num_ports)
> +		return (OSM_EVENT_DB_PORTNOTFOUND);
> +
> +	p_port = &(p_node->ports[port]);
> +	previous = &(p_node->ports[port].previous.reading);
> +
> +#if 0
> +	dump_reading(p_port, reading);
> +#endif
> +
> +	/* calculate changes from previous reading */
> +	p_port->totals.symbol_err_cnt
> +		+= (cl_ntoh16(reading->symbol_err_cnt)
> +				- cl_ntoh16(previous->symbol_err_cnt));
> +	p_port->totals.link_err_recover
> +		+= (reading->link_err_recover - previous->link_err_recover);
> +	p_port->totals.link_downed
> +		+= (reading->link_downed - previous->link_downed);
> +	p_port->totals.rcv_err
> +		+= (cl_ntoh16(reading->rcv_err)
> +				- cl_ntoh16(previous->rcv_err));
> +	p_port->totals.rcv_rem_phys_err
> +		+= (cl_ntoh16(reading->rcv_rem_phys_err)
> +				- cl_ntoh16(previous->rcv_rem_phys_err));
> +	p_port->totals.rcv_switch_relay_err
> +		+= (cl_ntoh16(reading->rcv_switch_relay_err)
> +				- cl_ntoh16(previous->rcv_switch_relay_err));
> +	p_port->totals.xmit_discards
> +		+= (cl_ntoh16(reading->xmit_discards)
> +				- cl_ntoh16(previous->xmit_discards));
> +	p_port->totals.xmit_constraint_err
> +		+= (reading->xmit_constraint_err - previous->xmit_constraint_err);
> +	p_port->totals.rcv_constraint_err
> +		+= (reading->rcv_constraint_err - previous->rcv_constraint_err);
> +	p_port->totals.link_int_err
> +		+= PC_LINK_INT(reading->link_int_buffer_overrun)
> +			- PC_LINK_INT(previous->link_int_buffer_overrun);
> +	p_port->totals.buffer_overrun_err
> +		+= PC_BUF_OVERRUN(reading->link_int_buffer_overrun)
> +			- PC_BUF_OVERRUN(previous->link_int_buffer_overrun);
> +	p_port->totals.vl15_dropped
> +		+= (cl_ntoh16(reading->vl15_dropped)
> +				- cl_ntoh16(previous->vl15_dropped));
> +	
> +	p_port->totals.xmit_data
> +		+= (cl_ntoh32(reading->xmit_data)
> +				- cl_ntoh32(previous->xmit_data));
> +	p_port->totals.rcv_data
> +		+= (cl_ntoh32(reading->rcv_data)
> +				- cl_ntoh32(previous->rcv_data));
> +	p_port->totals.xmit_pkts
> +		+= (cl_ntoh32(reading->xmit_pkts)
> +				- cl_ntoh32(previous->xmit_pkts));
> +	p_port->totals.rcv_pkts
> +		+= (cl_ntoh32(reading->rcv_pkts)
> +				- cl_ntoh32(previous->rcv_pkts));
> +
> +	p_port->previous.reading = *reading;
> +	p_port->previous.time = time(NULL);
> +
> +	cl_plock_release(&(db->lock));
> +	return (rc);
> +}
> +
> +/** =========================================================================
> + * Define the object symbol for loading
> + */
> +__osm_event_db_t __osm_event_db =
> +{
> +interface_version: OSM_EVENT_DB_INTERFACE_VER,
> +construct : db_construct,
> +destroy : db_destroy,
> +create_entry : db_create_entry,
> +get_prev_pc : db_get_prev,
> +dump : db_dump,
> +clear_port_counters : db_clear_port_counters,
> +add_pc_reading : db_add_reading,
> +clear_prev_pc : db_clear_prev_pc
> +};
> +
> diff --git a/osm/include/Makefile.am b/osm/include/Makefile.am
> index 8499d3b..fd874c8 100644
> --- a/osm/include/Makefile.am
> +++ b/osm/include/Makefile.am
> @@ -87,6 +87,8 @@ EXTRA_DIST = \
>  	$(srcdir)/opensm/osm_drop_mgr.h \
>  	$(srcdir)/opensm/osm_port_info_rcv.h \
>  	$(srcdir)/opensm/osm_state_mgr_ctrl.h \
> +	$(srcdir)/opensm/osm_perfmgr.h \
> +	$(srcdir)/opensm/osm_event_db.h \
>  	$(srcdir)/complib/cl_thread_osd.h \
>  	$(srcdir)/complib/cl_packon.h \
>  	$(srcdir)/complib/cl_atomic_osd.h \
> diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h
> index b3937cb..2a4057b 100644
> --- a/osm/include/iba/ib_types.h
> +++ b/osm/include/iba/ib_types.h
> @@ -7353,6 +7353,80 @@ typedef struct _ib_inform_info_record
>  }	PACK_SUFFIX ib_inform_info_record_t;
>  #include <complib/cl_packoff.h>
>  
> +/****s* IBA Base: Types/ib_perfmgr_mad_t
> +* NAME
> +*	ib_perfmgr_mad_t
> +*
> +* DESCRIPTION
> +*	IBA defined Perf Management MAD (16.3.1)
> +*
> +* SYNOPSIS
> +*/
> +#include <complib/cl_packon.h>
> +typedef struct _ib_perfmgr_mad
> +{
> +	ib_mad_t		header;
> +	uint8_t			resv[40];
> +
> +#define	IB_PM_DATA_SIZE		192
> +	uint8_t			data[IB_PM_DATA_SIZE];
> +
> +}	PACK_SUFFIX ib_perfmgr_mad_t;
> +#include <complib/cl_packoff.h>
> +/*
> +* FIELDS
> +*	header
> +*		Common MAD header.
> +*
> +*	resv
> +*		Reserved.
> +*
> +*	data
> +*		Performance Management payload.  The structure and content of this field
> +*		depend upon the method, attr_id, and attr_mod fields in the header.
> +*
> +* SEE ALSO
> +* ib_mad_t
> +*********/
> +
> +/****s* IBA Base: Types/ib_port_counters
> +* NAME
> +*	ib_port_counters_t
> +*
> +* DESCRIPTION
> +*	IBA defined PortCounters Attribute. (16.1.3.5)
> +*
> +* SYNOPSIS
> +*/
> +#include <complib/cl_packon.h>
> +typedef struct _ib_port_counters
> +{
> +	uint8_t 			reserved;
> +	uint8_t                         port_select;
> +	ib_net16_t                      counter_select;
> +	ib_net16_t                      symbol_err_cnt;
> +	uint8_t                         link_err_recover;
> +	uint8_t                         link_downed;
> +	ib_net16_t                      rcv_err;
> +	ib_net16_t                      rcv_rem_phys_err;
> +	ib_net16_t                      rcv_switch_relay_err;
> +	ib_net16_t                      xmit_discards;
> +	uint8_t                         xmit_constraint_err;
> +	uint8_t                         rcv_constraint_err;
> +	uint8_t                         res1;
> +	uint8_t                         link_int_buffer_overrun;
> +	ib_net16_t                      res2;
> +	ib_net16_t                      vl15_dropped;
> +	ib_net32_t                      xmit_data;
> +	ib_net32_t                      rcv_data;
> +	ib_net32_t                      xmit_pkts;
> +	ib_net32_t                      rcv_pkts;
> +}	PACK_SUFFIX ib_port_counters_t;
> +#include <complib/cl_packoff.h>
> +
> +#define PC_LINK_INT(integ_buf_over) ((integ_buf_over & 0xF0) >> 4)
> +#define PC_BUF_OVERRUN(integ_buf_over) (integ_buf_over & 0x0F)
> +
>  /****d* IBA Base: Types/DM_SVC_NAME
>  * NAME
>  *	DM_SVC_NAME
> diff --git a/osm/include/opensm/osm_base.h b/osm/include/opensm/osm_base.h
> index b38b511..51cef49 100644
> --- a/osm/include/opensm/osm_base.h
> +++ b/osm/include/opensm/osm_base.h
> @@ -448,6 +448,29 @@ BEGIN_C_DECLS
>  */
>  #define OSM_SM_DEFAULT_QP1_SEND_SIZE 256
>  
> +/****d* OpenSM: Base/OSM_PM_DEFAULT_QP1_RCV_SIZE
> +* NAME
> +*   OSM_PM_DEFAULT_QP1_RCV_SIZE
> +*
> +* DESCRIPTION
> +*   Specifies the default size (in MADs) of the QP1 receive queue
> +*
> +* SYNOPSIS
> +*/
> +#define OSM_PM_DEFAULT_QP1_RCV_SIZE 256
> +/***********/
> +
> +/****d* OpenSM: Base/OSM_PM_DEFAULT_QP1_SEND_SIZE
> +* NAME
> +*   OSM_PM_DEFAULT_QP1_SEND_SIZE
> +*
> +* DESCRIPTION
> +*   Specifies the default size (in MADs) of the QP1 send queue
> +*
> +* SYNOPSIS
> +*/
> +#define OSM_PM_DEFAULT_QP1_SEND_SIZE 256
> +
>  
>  /****d* OpenSM: Base/OSM_SM_DEFAULT_POLLING_TIMEOUT_MILLISECS
>  * NAME
> diff --git a/osm/include/opensm/osm_event_db.h b/osm/include/opensm/osm_event_db.h
> new file mode 100644
> index 0000000..17effaf
> --- /dev/null
> +++ b/osm/include/opensm/osm_event_db.h
> @@ -0,0 +1,151 @@
> +/*
> + * Copyright (c) 2007 The Regents of the University of California.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +
> +#ifndef _OSM_EVENT_DB_H_
> +#define _OSM_EVENT_DB_H_
> +
> +#include <time.h>
> +#include <opensm/osm_log.h>
> +#include <iba/ib_types.h>
> +
> +#ifdef __cplusplus
> +#  define BEGIN_C_DECLS extern "C" {
> +#  define END_C_DECLS   }
> +#else /* !__cplusplus */
> +#  define BEGIN_C_DECLS
> +#  define END_C_DECLS
> +#endif /* __cplusplus */
> +
> +BEGIN_C_DECLS
> +
> +/****h* OpenSM/Event Database
> +* DESCRIPTION
> +*       Database interface to record subnet events
> +*
> +*       Implementations of this object _MUST_ be thread safe.
> +*
> +* AUTHOR
> +*	Ira Weiny, LLNL
> +*
> +*********/
> +
> +typedef enum {
> +	OSM_EVENT_DB_SUCCESS = 0,
> +	OSM_EVENT_DB_FAIL,
> +	OSM_EVENT_DB_NOMEM,
> +	OSM_EVENT_DB_GUIDNOTFOUND,
> +	OSM_EVENT_DB_PORTNOTFOUND
> +} osm_event_db_err_t;
> +
> +/** =========================================================================
> + * Port counter reading
> + */
> +typedef struct {
> +	ib_port_counters_t reading;
> +	time_t             time;
> +} osm_pc_reading_t;
> +
> +/** =========================================================================
> + * Dump output options
> + */
> +typedef enum {
> +	OSM_EVENT_DB_DUMP_HR = 0, /* Human readable */
> +	OSM_EVENT_DB_DUMP_MR      /* Machine readable */
> +} osm_event_db_dump_t;
> +
> +/** =========================================================================
> + * Plugin creators should allocate an object of this type
> + *    (name __osm_event_db_t)
> + * The version should be set to OSM_EVENT_DB_INTERFACE_VER
> + */
> +#define OSM_EVENT_DB_INTERFACE_VER (1)
> +typedef struct
> +{
> +	int                 interface_version;
> +	void               *(*construct)(osm_log_t *osm_log);
> +	void                (*destroy)(void *db);
> +	osm_event_db_err_t  (*create_entry)(void *db, uint64_t guid, uint8_t num_ports);
> +	osm_event_db_err_t  (*get_prev_pc)(void *db, uint64_t guid,
> +				uint8_t port, osm_pc_reading_t *reading);
> +	osm_event_db_err_t  (*dump)(void *db, char *file, osm_event_db_dump_t dump_type);
> +	void                (*clear_port_counters)(void *db);
> +	osm_event_db_err_t  (*add_pc_reading)(void *db, uint64_t guid,
> +				uint8_t port, ib_port_counters_t *reading);
> +	osm_event_db_err_t  (*clear_prev_pc)(void *db, uint64_t guid, uint8_t port);
> +} __osm_event_db_t;
> +
> +/** =========================================================================
> + * The database structure which should be considered opaque
> + */
> +typedef struct {
> +	void             *handle;
> +	__osm_event_db_t *db_impl;
> +	void             *db_data;
> +	osm_log_t        *p_log;
> +} osm_event_db_t;
> +
> +
> +/**
> + * functions
> + */
> +osm_event_db_t     *osm_event_db_construct(osm_log_t *p_log, char *type);
> +void                osm_event_db_destroy(osm_event_db_t *db);
> +
> +osm_event_db_err_t  osm_event_db_create_entry(osm_event_db_t *db, uint64_t guid,
> +					uint8_t num_ports);
> +osm_event_db_err_t  osm_event_db_get_prev_pc(osm_event_db_t *db,
> +					uint64_t guid, uint8_t port,
> +					osm_pc_reading_t *reading);
> +osm_event_db_err_t  osm_event_db_dump(osm_event_db_t *db, char *file,
> +					osm_event_db_dump_t dump_type);
> +osm_event_db_err_t  osm_event_db_add_pc_reading(osm_event_db_t *db, uint64_t guid,
> +					uint8_t port, ib_port_counters_t *reading);
> +void                osm_event_db_clear_port_counters(osm_event_db_t *db);
> +osm_event_db_err_t  osm_event_db_clear_prev_pc(osm_event_db_t *db, uint64_t guid,
> +					uint8_t port);
> +
> +#if 0
> +/* work out the tracking of notice (trap) events. */
> +
> +typedef struct {
> +	ib_mad_notice_attr_t reading;
> +	time_t               time;
> +} osm_notice_reading_t;
> +osm_event_db_err_t  osm_event_db_add_notice_reading(osm_event_db_t *db, uint64_t guid,
> +					uint8_t port, ib_mad_notice_attr_t *reading);
> +#endif
> +
> +END_C_DECLS
> +
> +#endif		/* _OSM_PM_DB_H_ */
> +
> diff --git a/osm/include/opensm/osm_madw.h b/osm/include/opensm/osm_madw.h
> index 95be0f4..80258f4 100644
> --- a/osm/include/opensm/osm_madw.h
> +++ b/osm/include/opensm/osm_madw.h
> @@ -315,6 +315,19 @@ typedef struct _osm_vla_context
>  } osm_vla_context_t;
>  /*********/
>  
> +/****s* OpenSM: MAD Wrapper/osm_perfmgr_context_t
> +* DESCRIPTION
> +*	Context for Performance manager queries
> +*/
> +typedef struct _osm_perfmgr_context {
> +  uint64_t node_guid;
> +  uint16_t port;
> +  uint8_t num_ports;
> +  uint8_t mad_method; /* was this a get or a set */
> +  struct timeval query_start;
> +} osm_perfmgr_context_t;
> +/*********/
> +
>  #ifndef OSM_VENDOR_INTF_OPENIB
>  /****s* OpenSM: MAD Wrapper/osm_arbitrary_context_t
>  * NAME
> @@ -354,6 +367,7 @@ typedef union _osm_madw_context
>  	osm_slvl_context_t	slvl_context;
>  	osm_pkey_context_t	pkey_context;
>  	osm_vla_context_t	vla_context;
> +	osm_perfmgr_context_t	perfmgr_context;
>  #ifndef OSM_VENDOR_INTF_OPENIB
>  	osm_arbitrary_context_t arb_context;
>  #endif
> @@ -639,6 +653,32 @@ osm_madw_get_sa_mad_ptr(
>  *	MAD Wrapper object, osm_madw_construct, osm_madw_destroy
>  *********/
>  
> +/****f* OpenSM: MAD Wrapper/osm_madw_get_perfmgr_mad_ptr
> +* DESCRIPTION
> +*	Gets a pointer to the PerfMgr MAD in this MAD wrapper.
> +*
> +* SYNOPSIS
> +*/
> +static inline ib_perfmgr_mad_t*
> +osm_madw_get_perfmgr_mad_ptr(
> +	IN const osm_madw_t* const p_madw )
> +{
> +	return((ib_perfmgr_mad_t*)p_madw->p_mad);
> +}
> +/*
> +* PARAMETERS
> +*	p_madw
> +*		[in] Pointer to an osm_madw_t object.
> +*
> +* RETURN VALUES
> +*	Pointer to the start of the PM MAD.
> +*
> +* NOTES
> +*
> +* SEE ALSO
> +*	MAD Wrapper object, osm_madw_construct, osm_madw_destroy
> +*********/
> +
>  /****f* OpenSM: MAD Wrapper/osm_madw_get_ni_context_ptr
>  * NAME
>  *	osm_madw_get_ni_context_ptr
> diff --git a/osm/include/opensm/osm_msgdef.h b/osm/include/opensm/osm_msgdef.h
> index a90e3b9..6732992 100644
> --- a/osm/include/opensm/osm_msgdef.h
> +++ b/osm/include/opensm/osm_msgdef.h
> @@ -186,6 +186,7 @@ enum
>  #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP)
>  	OSM_MSG_MAD_MULTIPATH_RECORD,
>  #endif
> +	OSM_MSG_MAD_PORT_COUNTERS,
>  	OSM_MSG_MAX
>  };
>  
> diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h
> index 482de28..bdaa8f3 100644
> --- a/osm/include/opensm/osm_opensm.h
> +++ b/osm/include/opensm/osm_opensm.h
> @@ -57,6 +57,7 @@
>  #include <opensm/osm_log.h>
>  #include <opensm/osm_sm.h>
>  #include <opensm/osm_sa.h>
> +#include <opensm/osm_perfmgr.h>
>  #include <opensm/osm_db.h>
>  #include <opensm/osm_subnet.h>
>  #include <opensm/osm_mad_pool.h>
> @@ -157,6 +158,9 @@ typedef struct _osm_opensm_t
>    osm_subn_t		subn;
>    osm_sm_t		sm;
>    osm_sa_t		sa;
> +#ifdef ENABLE_OSM_PERF_MGR
> +  osm_perfmgr_t         perfmgr;
> +#endif /* ENABLE_OSM_PERF_MGR */
>    osm_db_t		db;
>    osm_mad_pool_t	mad_pool;
>    osm_vendor_t		*p_vendor;
> diff --git a/osm/include/opensm/osm_perfmgr.h b/osm/include/opensm/osm_perfmgr.h
> new file mode 100644
> index 0000000..6138ec3
> --- /dev/null
> +++ b/osm/include/opensm/osm_perfmgr.h
> @@ -0,0 +1,223 @@
> +/*
> + * Copyright (c) 2007 The Regents of the University of California.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +
> +#ifndef _OSM_PERFMGR_H_
> +#define _OSM_PERFMGR_H_
> +
> +#if HAVE_CONFIG_H
> +#  include <config.h>
> +#endif /* HAVE_CONFIG_H */
> +
> +#ifdef ENABLE_OSM_PERF_MGR
> +
> +#include <iba/ib_types.h>
> +#include <complib/cl_passivelock.h>
> +#include <complib/cl_event.h>
> +#include <complib/cl_thread.h>
> +#include <opensm/osm_subnet.h>
> +#include <opensm/osm_req.h>
> +#include <opensm/osm_log.h>
> +#include <opensm/osm_event_db.h>
> +#include <opensm/osm_sm.h>
> +#include <opensm/osm_base.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif /* __cplusplus */
> +
> +/****h* OpenSM/PERFMGR
> +* NAME
> +*	PERFMGR
> +*
> +* DESCRIPTION
> +*       Performance manager thread which takes care of polling the fabric for
> +*       Port counters values.
> +*
> +*	The PERFMGR object is thread safe.
> +*
> +* AUTHOR
> +*	Ira Weiny, LLNL
> +*
> +*********/
> +
> +#define OSM_PERFMGR_DEFAULT_SWEEP_TIME_S 180
> +#define OSM_PERFMGR_DEFAULT_DUMP_FILE OSM_DEFAULT_TMP_DIR "/osm_port_counters.log"
> +#define OSM_DEFAULT_EVENT_PLUGIN "ibeventdb"
> +
> +/****s* OpenSM: PERFMGR/osm_perfmgr_state_t */
> +typedef enum
> +{
> +  PERFMGR_STATE_DISABLE,
> +  PERFMGR_STATE_ENABLED,
> +  PERFMGR_STATE_NO_DB

Why PERFMGR_STATE_NO_DB is needed? Isn't is duplicated by
(pm->db == NULL)?

As side effect of this duplication - now when DB was not found I can
enable perfmgr with console command, but it obviously crashes during
follow 'dump'.

> +} osm_perfmgr_state_t;
> +
> +/****s* OpenSM: PERFMGR/osm_perfmgr_t
> +*  This object should be treated as opaque and should
> +*  be manipulated only through the provided functions.
> +*/
> +typedef struct _osm_perfmgr
> +{
> +  osm_thread_state_t    thread_state;
> +  cl_event_t            sig_sweep;
> +  cl_thread_t           sweeper;
> +  osm_subn_t           *subn;
> +  osm_sm_t             *sm;
> +  cl_plock_t           *lock;
> +  osm_log_t            *log;
> +  osm_mad_pool_t       *mad_pool;
> +  atomic32_t            trans_id;

Do we need separate transaction id generator for PerfMgr? 

> +  osm_vendor_t         *vendor;
> +  osm_bind_handle_t     bind_handle;
> +  cl_disp_reg_handle_t  pc_disp_h;
> +  osm_perfmgr_state_t   state;
> +  uint16_t              sweep_time_s;
> +  char                 *db_file;
> +  char                 *event_db_dump_file;
> +  char                 *event_db_plugin;
> +  osm_event_db_t       *db;
> +} osm_perfmgr_t;
> +/*
> +* FIELDS
> +*	subn
> +*	      Subnet object for this subnet.
> +*
> +*	log
> +*	      Pointer to the log object.
> +*
> +*	mad_pool
> +*		Pointer to the MAD pool.
> +*
> +*       event_db_dump_file
> +*               File to be used to dump the Port Counters
> +*
> +*	mad_ctrl
> +*		Mad Controller
> +*********/
> +
> +/****f* OpenSM: Creation Functions */
> +void osm_perfmgr_shutdown(osm_perfmgr_t *const p_perfmgr );
> +void osm_perfmgr_destroy(osm_perfmgr_t * const p_perfmgr );
> +
> +/****f* OpenSM: Inline accessor functions */
> +inline static void osm_perfmgr_set_state(osm_perfmgr_t *p_perfmgr,
> +		osm_perfmgr_state_t state)
> +{
> +	p_perfmgr->state = state;
> +}
> +inline static osm_perfmgr_state_t osm_perfmgr_get_state(osm_perfmgr_t
> +		*p_perfmgr) { return (p_perfmgr->state); }
> +inline static char *osm_perfmgr_get_state_str(osm_perfmgr_t *p_perfmgr)
> +{
> +	switch (p_perfmgr->state)
> +	{
> +		case PERFMGR_STATE_DISABLE: return ("Disabled"); break;
> +		case PERFMGR_STATE_ENABLED: return ("Enabled"); break;
> +		case PERFMGR_STATE_NO_DB: return ("No Database"); break;
> +	}
> +	return ("UNKNOWN");
> +}
> +inline static void osm_perfmgr_set_sweep_time_s(osm_perfmgr_t *p_perfmgr, uint16_t time_s)
> +{
> +	p_perfmgr->sweep_time_s = time_s;
> +   cl_event_signal(&p_perfmgr->sig_sweep);
> +}
> +inline static uint16_t osm_perfmgr_get_sweep_time_s(osm_perfmgr_t *p_perfmgr)
> +{
> +	return (p_perfmgr->sweep_time_s);
> +}
> +void osm_perfmgr_clear_counters(osm_perfmgr_t *p_perfmgr);
> +void osm_perfmgr_dump_counters(osm_perfmgr_t *p_perfmgr,
> +		osm_event_db_dump_t dump_type);
> +
> +ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * const p_perfmgr, const ib_net64_t port_guid);
> +
> +#if 0
> +/* Work out the tracking of notice events */
> +ib_api_status_t osm_report_notice_to_perfmgr(osm_log_t *const p_log, osm_subn_t *p_subn,
> +  					ib_mad_notice_attr_t *p_ntc )
> +#endif
> +
> +/****f* OpenSM: PERFMGR/osm_perfmgr_init */
> +ib_api_status_t
> +osm_perfmgr_init(
> +	osm_perfmgr_t* const perfmgr,
> +	osm_subn_t* const subn,
> +        osm_sm_t * const sm,
> +	osm_log_t* const log,
> +	osm_mad_pool_t * const mad_pool,
> +	osm_vendor_t * const vendor,
> +        cl_dispatcher_t* const disp,
> +   	cl_plock_t* const lock,
> +	const osm_subn_opt_t * const p_opt );

The identation is not unified (tab character is preferred) here and in
another places, also there are lot of trailing white spaces in the patch.
You can run 'git-diff --color' to see formatting issues.

> +/*
> +* PARAMETERS
> +*	perfmgr
> +*		[in] Pointer to an osm_perfmgr_t object to initialize.
> +*
> +*	subn
> +*		[in] Pointer to the Subnet object for this subnet.
> +*
> +*	sm
> +*		[in] Pointer to the Subnet object for this subnet.
> +*
> +*	log
> +*		[in] Pointer to the log object.
> +*
> +*	mad_pool
> +*		[in] Pointer to the MAD pool.
> +*
> +*	vendor
> +*		[in] Pointer to the vendor specific interfaces object.
> +*
> +*	disp
> +*		[in] Pointer to the OpenSM central Dispatcher.
> +*
> +*	lock
> +*		[in] Pointer to the OpenSM serializing lock.
> +*
> +*	p_opt
> +*		[in] Starting options
> +*
> +* RETURN VALUES
> +*	IB_SUCCESS if the PERFMGR object was initialized successfully.
> +*********/
> +
> +#ifdef __cplusplus
> +}
> +#endif /* __cplusplus */
> +
> +#endif /* ENABLE_OSM_PERF_MGR */
> +
> +#endif		/* _OSM_PERFMGR_H_ */
> +
> diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h
> index fc52b5e..0fdc18b 100644
> --- a/osm/include/opensm/osm_subnet.h
> +++ b/osm/include/opensm/osm_subnet.h
> @@ -291,6 +291,12 @@ typedef struct _osm_subn_opt
>    osm_qos_options_t        qos_rtr_options;
>    boolean_t                enable_quirks;
>    boolean_t                no_clients_rereg;
> +#ifdef ENABLE_OSM_PERF_MGR
> +  boolean_t                perfmgr;
> +  uint16_t                 perfmgr_sweep_time_s;
> +  char *                   event_db_dump_file;
> +  char *                   event_db_plugin;
> +#endif /* ENABLE_OSM_PERF_MGR */
>  } osm_subn_opt_t;
>  /*
>  * FIELDS
> @@ -468,6 +474,18 @@ typedef struct _osm_subn_opt
>  *	sm_inactive
>  *		OpenSM will start with SM in not active state.
>  *	
> +*	perfmgr
> +*		Enable or disable the performance manager
> +*
> +*	perfmgr_sweep_time_s
> +*		Define the period of PM sweep (in seconds).
> +*
> +*       event_db_dump_file
> +*               File to dump the event database to
> +*
> +*       event_db_plugin
> +*               specify the name of the event plugin
> +*
>  *	qos_options
>  *		Default set of QoS options
>  *
> diff --git a/osm/opensm.spec.in b/osm/opensm.spec.in
> index c4e1798..8857a7b 100644
> --- a/osm/opensm.spec.in
> +++ b/osm/opensm.spec.in
> @@ -38,10 +38,19 @@ Static libraries and header files for Op
>  %define _disable_console_socket --disable-console-socket
>  %endif
>  
> +%if %{?_with_perf_mgr:1}%{!?_with_perf_mgr:0}
> +%define _enable_perf_mgr --enable-perf-mgr
> +%endif
> +%if %{?_without_perf_mgr:1}%{!?_without_perf_mgr:0}
> +%define _disable_perf_mgr --disable-perf-mgr
> +%endif
> +
>  %build
>  %configure \
>          %{?_enable_console_socket} \
> -        %{?_disable_console_socket}
> +        %{?_disable_console_socket} \
> +        %{?_enable_perf_mgr} \
> +        %{?_disable_perf_mgr}
>  make %{?_smp_mflags}
>  
>  %install
> diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am
> index e2520b8..9a1f6f4 100644
> --- a/osm/opensm/Makefile.am
> +++ b/osm/opensm/Makefile.am
> @@ -55,7 +55,8 @@ opensm_SOURCES = main.c osm_console.c os
>  		 osm_trap_rcv.c osm_ucast_mgr.c osm_ucast_updn.c \
>  		 osm_ucast_lash.c osm_ucast_file.c osm_ucast_ftree.c \
>  		 osm_vl15intf.c osm_vl_arb_rcv.c \
> -		 st.c
> +		 st.c \
> +		 osm_perfmgr.c osm_event_db.c
>  if OSMV_OPENIB
>  opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1
>  opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1
> @@ -78,7 +79,7 @@ endif
>  # we always give precedence to local tree libs and then use the pre-installed ones.
>  opensm_LDADD = -L../complib -L../libvendor -L. $(OSMV_LDADD) -lopensm -losmcomp -losmvendor
>  
> -opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread
> +opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread -ldl
>  
>  opensmincludedir = $(includedir)/infiniband/opensm
>  
> diff --git a/osm/opensm/configure.in b/osm/opensm/configure.in
> index ad3333a..9e23719 100644
> --- a/osm/opensm/configure.in
> +++ b/osm/opensm/configure.in
> @@ -78,6 +78,9 @@ if test $console_socket = yes; then
>  	    [Define as 1 if you want to enable a console on a socket connection])
>  fi
>  
> +dnl select performance manager or not
> +OPENIB_OSM_PERF_MGR_SEL
> +
>  dnl Provide user option to select vendor
>  OPENIB_APP_OSMV_SEL
>  
> diff --git a/osm/opensm/main.c b/osm/opensm/main.c
> index 153e44d..4fa3563 100644
> --- a/osm/opensm/main.c
> +++ b/osm/opensm/main.c
> @@ -59,6 +59,7 @@
>  #include <opensm/osm_version.h>
>  #include <opensm/osm_opensm.h>
>  #include <opensm/osm_console.h>
> +#include <opensm/osm_perfmgr.h>
>  
>  volatile unsigned int osm_exit_flag = 0;
>  
> @@ -273,6 +274,13 @@ show_usage(void)
>    printf("-I\n"
>           "--inactive\n"
>           "           Start SM in inactive rather than normal init SM state.\n\n");
> +#ifdef ENABLE_OSM_PERF_MGR
> +  printf( "--pm\n"
> +          "          Activate the performance manager.\n\n");
> +  printf( "--pm_sweep_time_s\n"
> +          "          Define the period for PerfMgr sweeps (in seconds) default %ds.\n\n",
> +	  OSM_PERFMGR_DEFAULT_SWEEP_TIME_S);
> +#endif /* ENABLE_OSM_PERF_MGR */
>    printf( "-v\n"
>            "--verbose\n"
>            "          This option increases the log verbosity level.\n"
> @@ -630,6 +638,8 @@ main(
>  #endif
>        {  "daemon",        0, NULL, 'B'},
>        {  "inactive",      0, NULL, 'I'},
> +      {  "pm",            0, NULL, 1}, /* no short options for PM stuff */
> +      {  "pm_sweep_time_s", 1, NULL, 2},
>        {  NULL,            0, NULL,  0 }  /* Required at the end of the array */
>      };
>  
> @@ -907,6 +917,15 @@ main(
>        printf(" SM started in inactive state\n");
>        break;
>  
> +#ifdef ENABLE_OSM_PERF_MGR
> +    case 1:
> +      opt.perfmgr = TRUE;
> +      break;
> +    case 2:
> +      opt.perfmgr_sweep_time_s = atoi(optarg);

In case of user error we can get opt.perfmgr_sweep_time_s = 0 (or another
strange value), I think at least minimal verification is needed here.

> +      break;
> +#endif /* ENABLE_OSM_PERF_MGR */
> +
>      case 'h':
>      case '?':
>      case ':':
> diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c
> index 38b978a..d6c30d8 100644
> --- a/osm/opensm/osm_console.c
> +++ b/osm/opensm/osm_console.c
> @@ -52,6 +52,7 @@
>  #include <ctype.h>
>  #include <opensm/osm_console.h>
>  #include <opensm/osm_version.h>
> +#include <opensm/osm_perfmgr.h>
>  
>  struct command {
>  	char *name;
> @@ -136,6 +137,20 @@ static void help_logflush(FILE *out, int
>  	fprintf(out, "logflush -- flush the osm.log file\n");
>  }
>  
> +#ifdef ENABLE_OSM_PERF_MGR
> +static void help_perfmgr(FILE *out, int detail)
> +{
> +	fprintf(out, "perfmgr [enable|disable|clear_counters|dump_counters|sweep_time][seconds]\n");
> +	if (detail) {
> +		fprintf(out, "perfmgr -- print the performance manager state\n");
> +		fprintf(out, "   [enable|disable] -- change the perfmgr state\n");
> +		fprintf(out, "   [sweep_time] -- change the perfmgr sweep time (requires [seconds] option)\n");
> +		fprintf(out, "   [clear_counters] -- clear the counters stored\n");
> +		fprintf(out, "   [dump_counters [mach]] -- dump the counters\n");
> +	}
> +}
> +#endif /* ENABLE_OSM_PERF_MGR */
> +
>  /* more help routines go here */
>  
>  static void help_parse(char **p_last, osm_opensm_t *p_osm, FILE *out)
> @@ -427,6 +442,66 @@ static void logflush_parse(char **p_last
>  	fflush(p_osm->log.out_port);
>  }
>  
> +#ifdef ENABLE_OSM_PERF_MGR
> +static void perfmgr_parse(char **p_last, osm_opensm_t *p_osm, FILE *out)
> +{
> +	char *p_cmd;
> +
> +	p_cmd = next_token(p_last);
> +	if (p_cmd)
> +	{
> +	   if (strcmp(p_cmd, "enable") == 0)
> +	   {
> +		   osm_perfmgr_set_state(&(p_osm->perfmgr), PERFMGR_STATE_ENABLED);
> +	   }
> +	   else if (strcmp(p_cmd, "disable") == 0)
> +	   {
> +		   osm_perfmgr_set_state(&(p_osm->perfmgr), PERFMGR_STATE_DISABLE);
> +	   }
> +	   else if (strcmp(p_cmd, "clear_counters") == 0)
> +	   {
> +		   osm_perfmgr_clear_counters(&(p_osm->perfmgr));
> +	   }
> +	   else if (strcmp(p_cmd, "dump_counters") == 0)
> +	   {
> +		p_cmd = next_token(p_last);
> +		if (p_cmd && (strcmp(p_cmd, "mach") == 0)) {
> +			osm_perfmgr_dump_counters(&(p_osm->perfmgr),
> +					OSM_EVENT_DB_DUMP_MR);
> +		} else {
> +			osm_perfmgr_dump_counters(&(p_osm->perfmgr),
> +					OSM_EVENT_DB_DUMP_HR);
> +		}
> +	   }
> +	   else if (strcmp(p_cmd, "sweep_time") == 0)
> +	   {
> +		p_cmd = next_token(p_last);
> +		if (p_cmd)
> +		{
> +			uint16_t time_s = atoi(p_cmd);
> +		   	osm_perfmgr_set_sweep_time_s(&(p_osm->perfmgr), time_s);
> +		}
> +		else
> +		{
> +			fprintf(out, "sweep_time requires a time specified\n");
> +		}
> +	   }
> +	   else
> +	   {
> +		fprintf(out, "\"%s\" option not found\n", p_cmd);
> +	   }
> +	} else {
> +		fprintf(out, "Performance Manager status:\n"
> +			     "state      : %s\n"
> +		             "sweep time : %us\n"
> +		        ,
> +			osm_perfmgr_get_state_str(&(p_osm->perfmgr)),
> +			osm_perfmgr_get_sweep_time_s(&(p_osm->perfmgr))
> +			);
> +	}
> +}
> +#endif /* ENABLE_OSM_PERF_MGR */
> +
>  /* This is public to be able to close it on exit */
>  void osm_console_close_socket(osm_opensm_t *p_osm)
>  {
> @@ -456,6 +531,9 @@ static const struct command console_cmds
>  	{ "resweep",	&help_resweep,		&resweep_parse},
>  	{ "status",	&help_status,		&status_parse},
>  	{ "logflush",	&help_logflush,		&logflush_parse},
> +#ifdef ENABLE_OSM_PERF_MGR
> +	{ "perfmgr",	&help_perfmgr,		&perfmgr_parse},
> +#endif /* ENABLE_OSM_PERF_MGR */
>  	{ NULL,		NULL,			NULL}	/* end of array */
>  };
>  
> diff --git a/osm/opensm/osm_event_db.c b/osm/opensm/osm_event_db.c
> new file mode 100644
> index 0000000..90ca8da
> --- /dev/null
> +++ b/osm/opensm/osm_event_db.c
> @@ -0,0 +1,172 @@
> +/*
> + * Copyright (c) 2007 The Regents of the University of California.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +
> +
> +#if HAVE_CONFIG_H
> +#  include <config.h>
> +#endif /* HAVE_CONFIG_H */
> +
> +#include <stdlib.h>
> +#include <errno.h>
> +#include <limits.h>
> +#include <dlfcn.h>
> +#include <sys/stat.h>
> +
> +#include <opensm/osm_event_db.h>
> +
> +/** =========================================================================
> + */
> +osm_event_db_t *
> +osm_event_db_construct(osm_log_t *p_log, char *type)
> +{
> +	char            lib_name[PATH_MAX];
> +	osm_event_db_t *rc = NULL;
> +
> +	if (!type)
> +		return (NULL);
> +
> +	/* find the plugin */
> +	snprintf(lib_name, PATH_MAX, "lib%s.so", type);
> +
> +	rc = malloc(sizeof(*rc));
> +	if (!rc)
> +		return (NULL);
> +
> +	rc->handle = dlopen(lib_name, RTLD_LAZY);
> +	if (!rc->handle)
> +	{
> +		osm_log(p_log, OSM_LOG_ERROR,
> +			"Failed to open PM Database \"%s\" : \"%s\"\n",
> +			lib_name, dlerror());
> +		goto DLOPENFAIL;
> +	}
> +
> +	rc->db_impl = (__osm_event_db_t *)dlsym(rc->handle, "__osm_event_db");
> +	if (!rc->db_impl)
> +	{
> +		osm_log(p_log, OSM_LOG_ERROR,
> +			"Failed to find __osm_event_db symbol in \"%s\" : \"%s\"\n",
> +			lib_name, dlerror());
> +		goto Exit;
> +	}
> +
> +	/* Check the version to make sure this module will work with us */
> +	if (rc->db_impl->interface_version != OSM_EVENT_DB_INTERFACE_VER)
> +	{
> +		osm_log(p_log, OSM_LOG_ERROR,
> +			"__osm_event_db symbol is the wrong version %d != %d\n",
> +			rc->db_impl->interface_version,
> +			OSM_EVENT_DB_INTERFACE_VER);
> +		goto Exit;
> +	}
> +
> +	rc->db_data = rc->db_impl->construct(p_log);
> +
> +	if (!rc->db_data)
> +		goto Exit;
> +
> +	rc->p_log = p_log;
> +	return (rc);
> +
> +Exit:
> +	dlclose(rc->handle);
> +DLOPENFAIL:
> +	free(rc);
> +	return (NULL);
> +}
> +
> +/** =========================================================================
> + */
> +void
> +osm_event_db_destroy(osm_event_db_t *db)
> +{
> +	if (db)
> +	{
> +		db->db_impl->destroy(db->db_data);
> +		free(db);
> +	}
> +}
> +
> +/** =========================================================================
> + */
> +osm_event_db_err_t
> +osm_event_db_create_entry(osm_event_db_t *db, uint64_t guid, uint8_t num_ports)
> +{
> +	return(db->db_impl->create_entry(db->db_data, guid, num_ports));
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +osm_event_db_err_t osm_event_db_get_prev_pc(osm_event_db_t *db, uint64_t guid,
> +		uint8_t port, osm_pc_reading_t *reading)
> +{
> +	return (db->db_impl->get_prev_pc(db->db_data, guid, port, reading));
> +}
> +
> +/**********************************************************************
> + * dump the data to the file "file"
> + **********************************************************************/
> +osm_event_db_err_t
> +osm_event_db_dump(osm_event_db_t *db, char *file, osm_event_db_dump_t dump_type)
> +{
> +	return (db->db_impl->dump(db->db_data, file, dump_type));
> +}
> +
> +/**********************************************************************
> + * Clear the port counters from the db
> + **********************************************************************/
> +void osm_event_db_clear_port_counters(osm_event_db_t *db)
> +{
> +	db->db_impl->clear_port_counters(db->db_data);
> +}
> +
> +/**********************************************************************
> + * Add the reading to the osm_pm_node_t
> + **********************************************************************/
> +osm_event_db_err_t
> +osm_event_db_add_pc_reading(osm_event_db_t *db, uint64_t guid,
> +                   uint8_t port, ib_port_counters_t *reading)
> +{
> +	return (db->db_impl->add_pc_reading(db->db_data, guid,
> +				port, reading));
> +}
> +
> +/**********************************************************************
> + * Add the reading to the osm_pm_node_t
> + **********************************************************************/
> +osm_event_db_err_t
> +osm_event_db_clear_prev_pc(osm_event_db_t *db, uint64_t guid, uint8_t port)
> +{
> +	return (db->db_impl->clear_prev_pc(db->db_data, guid, port));
> +}
> +
> diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c
> index 8430605..fa572c5 100644
> --- a/osm/opensm/osm_opensm.c
> +++ b/osm/opensm/osm_opensm.c
> @@ -172,6 +172,9 @@ osm_opensm_destroy(
>       p_osm->routing_engine.delete(p_osm->routing_engine.context);
>     osm_sa_destroy( &p_osm->sa );
>     osm_sm_destroy( &p_osm->sm );
> +#ifdef ENABLE_OSM_PERF_MGR
> +   osm_perfmgr_destroy( &p_osm->perfmgr );
> +#endif /* ENABLE_OSM_PERF_MGR */
>     osm_db_destroy( &p_osm->db );
>     osm_vl15_destroy( &p_osm->vl15, &p_osm->mad_pool );
>     osm_mad_pool_destroy( &p_osm->mad_pool );
> @@ -286,6 +289,21 @@ osm_opensm_init(
>     if( status != IB_SUCCESS )
>        goto Exit;
>  
> +#ifdef ENABLE_OSM_PERF_MGR
> +   status = osm_perfmgr_init( &p_osm->perfmgr,
> +                         &p_osm->subn,
> +			 &p_osm->sm,
> +                         &p_osm->log,
> +			 &p_osm->mad_pool,
> +			 p_osm->p_vendor,
> +			 &p_osm->disp,
> +			 &p_osm->lock,
> +			 p_opt);
> +
> +   if( status != IB_SUCCESS )
> +      goto Exit;
> +#endif /* ENABLE_OSM_PERF_MGR */
> +
>     if( p_opt->routing_engine_name &&
>         setup_routing_engine(p_osm, p_opt->routing_engine_name)) {
>        osm_log( &p_osm->log, OSM_LOG_VERBOSE,
> @@ -319,6 +337,12 @@ osm_opensm_bind(
>     if( status != IB_SUCCESS )
>        goto Exit;
>  
> +#ifdef ENABLE_OSM_PERF_MGR
> +   status = osm_perfmgr_bind( &p_osm->perfmgr, guid );
> +   if( status != IB_SUCCESS )
> +      goto Exit;
> +#endif /* ENABLE_OSM_PERF_MGR */
> +
>   Exit:
>     OSM_LOG_EXIT( &p_osm->log );
>     return ( status );
> diff --git a/osm/opensm/osm_perfmgr.c b/osm/opensm/osm_perfmgr.c
> new file mode 100644
> index 0000000..297a0e2
> --- /dev/null
> +++ b/osm/opensm/osm_perfmgr.c
> @@ -0,0 +1,686 @@
> +/*
> + * Copyright (c) 2007 The Regents of the University of California.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +
> +
> +/*
> + * Abstract:
> + *    Implementation of osm_perfmgr_t.
> + *
> + * Author:
> + *    Ira Weiny, LLNL
> + */
> +
> +#if HAVE_CONFIG_H
> +#  include <config.h>
> +#endif /* HAVE_CONFIG_H */
> +
> +#ifdef ENABLE_OSM_PERF_MGR
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <poll.h>
> +#include <netinet/in.h>
> +#include <complib/cl_debug.h>
> +#include <iba/ib_types.h>
> +#include <errno.h>
> +#include <sys/time.h>
> +#include <opensm/osm_perfmgr.h>
> +#include <opensm/osm_log.h>
> +#include <opensm/osm_node.h>
> +#include <complib/cl_thread.h>
> +#include <vendor/osm_vendor_api.h>
> +
> +#define  OSM_PERFMGR_INITIAL_TID_VALUE 0xcafe
> +
> +/**********************************************************************
> + * Recieve the MAD from the vendor layer and post it for processing by the
> + * dispatcher.
> + **********************************************************************/
> +static void
> +osm_perfmgr_mad_recv_callback(osm_madw_t *p_madw, void* bind_context,
> +   				osm_madw_t *p_req_madw )
> +{
> +	osm_perfmgr_t      *pm = (osm_perfmgr_t *)bind_context;
> +	cl_status_t         cl_status = CL_SUCCESS;
> +	
> +	OSM_LOG_ENTER( pm->log, osm_pm_mad_recv_callback );
                                ^^^^^^^^^^^^^^^^^^^^^^^^
I guess here should be 'osm_perfmgr_mad_recv_callback'

> +	
> +	osm_madw_copy_context( p_madw, p_req_madw );
> +	osm_mad_pool_put( pm->mad_pool, p_req_madw );
> +	
> +	/* post this message for later processing. */
> +	cl_status = cl_disp_post(pm->pc_disp_h, OSM_MSG_MAD_PORT_COUNTERS,
> +	      	           	(void *)p_madw, NULL, NULL);
> +#if 0
> +	do {
> +		struct timeval      rcv_time;
> +		gettimeofday(&rcv_time, NULL);
> +		osm_log(pm->log, OSM_LOG_INFO,
> +			"perfmgr rcv time %ld\n",
> +			rcv_time.tv_usec -
> +			p_madw->context.perfmgr_context.query_start.tv_usec);
> +	} while (0);
> +#endif
> +	OSM_LOG_EXIT( pm->log );
> +}
> +
> +/**********************************************************************
> + * Process errors from the MAD send.
> + **********************************************************************/
> +static void
> +osm_perfmgr_mad_send_err_callback(void* bind_context, osm_madw_t *p_madw)
> +{
> +	osm_perfmgr_t *pm = (osm_perfmgr_t *)bind_context;
> +	osm_madw_context_t *context = &(p_madw->context);
> +	
> +	OSM_LOG_ENTER( pm->log, osm_pm_mad_send_err_callback );
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Ditto (the same for another perfmgr functions)

> +	
> +	osm_log( pm->log, OSM_LOG_ERROR,
> +	           "osm_pm_mad_send_err_callback: 0x%" PRIx64 " port %d\n",
> +	      	  context->perfmgr_context.node_guid,
> +	      	  context->perfmgr_context.port);
> +	
> +	osm_mad_pool_put( pm->mad_pool, p_madw );
> +	
> +	OSM_LOG_EXIT( pm->log );
> +}
> +
> +/**********************************************************************
> + * Bind the PM to the vendor layer for MAD sends/receives
> + **********************************************************************/
> +ib_api_status_t
> +osm_perfmgr_bind(osm_perfmgr_t * const pm, const ib_net64_t port_guid)
> +{
> +	osm_bind_info_t bind_info;
> +	ib_api_status_t status = IB_SUCCESS;
> +	
> +	OSM_LOG_ENTER( pm->log, osm_pm_bind );
> +	
> +	if( pm->bind_handle != OSM_BIND_INVALID_HANDLE ) {
> +		osm_log( pm->log, OSM_LOG_ERROR,
> +		         "osm_pm_mad_ctrl_bind: Multiple binds not allowed\n" );
> +		status = IB_ERROR;
> +		goto Exit;
> +	}
> +	
> +	bind_info.port_guid = port_guid;
> +	bind_info.mad_class = IB_MCLASS_PERF;
> +	bind_info.class_version = 1;
> +	bind_info.is_responder = FALSE;
> +	bind_info.is_report_processor = FALSE;
> +	bind_info.is_trap_processor = FALSE;
> +	bind_info.recv_q_size = OSM_PM_DEFAULT_QP1_RCV_SIZE;
> +	bind_info.send_q_size = OSM_PM_DEFAULT_QP1_SEND_SIZE;
> +	
> +	osm_log( pm->log, OSM_LOG_VERBOSE,
> +	         "osm_pm_mad_bind: "
> +	         "Binding to port GUID 0x%" PRIx64 "\n",
> +	         cl_ntoh64( port_guid ) );
> +	
> +	pm->bind_handle = osm_vendor_bind( pm->vendor,
> +	                                  &bind_info,
> +	                                  pm->mad_pool,
> +	                                  osm_perfmgr_mad_recv_callback,
> +	                                  osm_perfmgr_mad_send_err_callback,
> +	                                  pm );
> +	
> +	if( pm->bind_handle == OSM_BIND_INVALID_HANDLE ) {
> +		status = IB_ERROR;
> +		osm_log( pm->log, OSM_LOG_ERROR,
> +		         "osm_pm_mad_bind: Vendor specific bind failed (%s)\n",
> +		         ib_get_err_str(status) );
> +		goto Exit;
> +	}
> +
> +Exit:
> + 	OSM_LOG_EXIT( pm->log );
> +	return( status );
> +}
> +
> +/**********************************************************************
> + * Unbind the PM to the vendor layer for MAD sends/receives
> + **********************************************************************/
> +void
> +osm_perfmgr_mad_unbind(osm_perfmgr_t * const pm)
> +{
> +	OSM_LOG_ENTER( pm->log, osm_sa_mad_ctrl_unbind );
> +	if( pm->bind_handle == OSM_BIND_INVALID_HANDLE ) {
> +		osm_log( pm->log, OSM_LOG_ERROR,
> +		         "osm_pm_mad_unbind: No previous bind\n" );
> +		goto Exit;
> +	}
> +	osm_vendor_unbind( pm->bind_handle );
> +Exit:
> +	OSM_LOG_EXIT( pm->log );
> +}
> +
> +/**********************************************************************
> + * Given a node and a port return the appropriate lid to query that port
> + **********************************************************************/
> +static ib_net16_t
> +get_lid(osm_node_t *p_node, uint8_t port)
> +{
> +	ib_net16_t lid = 0;
> +	
> +	switch (p_node->node_info.node_type)
> +	{
> +		case IB_NODE_TYPE_CA:
> +		case IB_NODE_TYPE_ROUTER:
> +			  lid = osm_node_get_base_lid(p_node, port);
> +			  break;
> +		case IB_NODE_TYPE_SWITCH:
> +			  lid = osm_node_get_base_lid(p_node, 0);
> +			  break;
> +		default:
> +			  break;
> +	}
> +	return (lid);
> +}
> +
> +/**********************************************************************
> + * Form the Port Counter MAD and send the MAD for a single port.
> + **********************************************************************/
> +static ib_api_status_t
> +osm_perfmgr_send_pc_mad(osm_perfmgr_t *perfmgr, ib_net16_t dest_lid, uint8_t port,
> +			uint8_t mad_method, osm_madw_context_t* const p_context )
> +{
> +	ib_api_status_t     status = IB_SUCCESS;
> +	ib_port_counters_t *port_counter = NULL;
> +	ib_perfmgr_mad_t   *pm_mad = NULL;
> +	osm_madw_t         *p_madw = NULL;
> +	
> +	OSM_LOG_ENTER(perfmgr->log, osm_perfmgr_send_pc_mad);
> +	
> +	p_madw = osm_mad_pool_get(perfmgr->mad_pool, perfmgr->bind_handle, MAD_BLOCK_SIZE, NULL);
> +	if (p_madw == NULL)
> +		return (IB_INSUFFICIENT_MEMORY);
> +	
> +	pm_mad = osm_madw_get_perfmgr_mad_ptr(p_madw);
> +	
> +	/* build the mad */
> +	pm_mad->header.base_ver = 1;
> +	pm_mad->header.mgmt_class = IB_MCLASS_PERF;
> +	pm_mad->header.class_ver = 1;
> +	pm_mad->header.method = mad_method;
> +	pm_mad->header.status = 0;
> +	pm_mad->header.class_spec = 0;
> +	pm_mad->header.trans_id = cl_hton64((uint64_t)cl_atomic_inc(&(perfmgr->trans_id)));
> +	pm_mad->header.attr_id = IB_MAD_ATTR_PORT_CNTRS;
> +	pm_mad->header.resv = 0;
> +	pm_mad->header.attr_mod = 0;
> +	
> +	port_counter = (ib_port_counters_t *)&(pm_mad->data);
> +	memset(port_counter, 0, sizeof(*port_counter));
> +	port_counter->port_select = port;
> +	port_counter->counter_select = 0xFFFF;
> +	
> +	p_madw->mad_addr.dest_lid = dest_lid;
> +	p_madw->mad_addr.addr_type.gsi.remote_qp = cl_hton32(1);
> +	p_madw->mad_addr.addr_type.gsi.remote_qkey = cl_hton32(IB_QP1_WELL_KNOWN_Q_KEY);
> +	/* FIXME what about other partitions */
> +	p_madw->mad_addr.addr_type.gsi.pkey = cl_hton16(0xFFFF);
> +	p_madw->mad_addr.addr_type.gsi.service_level = 0;
> +	p_madw->mad_addr.addr_type.gsi.global_route = FALSE;
> +	p_madw->resp_expected = TRUE;
> +	
> +	if( p_context )
> +		p_madw->context = *p_context;
> +	
> +	status = osm_vendor_send(perfmgr->bind_handle, p_madw, TRUE);
> +	
> +	OSM_LOG_EXIT(perfmgr->log);
> +	return( status );
> +}
> +
> +/**********************************************************************
> + * query the Port Counters of all the nodes in the subnet.
> + **********************************************************************/
> +static void
> +__osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context )
> +{
> +	ib_api_status_t     status = IB_SUCCESS;
> +	uint8_t             port = 0;
> +	osm_perfmgr_t      *pm = (osm_perfmgr_t *)context;
> +	osm_node_t         *p_node = (osm_node_t *)p_map_item;
> +	uint8_t             node_desc[IB_NODE_DESCRIPTION_SIZE];
> +	osm_madw_context_t  mad_context;
> +	uint8_t             num_ports = 0;
> +	uint64_t            node_guid = 0;
> +	
> +	OSM_LOG_ENTER( pm->log, __osm_pm_query_counters );
> +	
> +	memcpy(node_desc, p_node->node_desc.description,
> +			IB_NODE_DESCRIPTION_SIZE);
> +	node_desc[IB_NODE_DESCRIPTION_SIZE-1] = '\0';

We have null terminated 'print_desc' field in osm_node_t structure

> +	
> +	num_ports = osm_node_get_num_physp(p_node);
> +	node_guid = cl_ntoh64(p_node->node_info.node_guid);
> +	
> +	/* make sure we have a database object ready to store this information */
> +	if (osm_event_db_create_entry(pm->db, node_guid, num_ports) !=
> +	      	  OSM_EVENT_DB_SUCCESS)
> +	{
> +		osm_log(pm->log, OSM_LOG_ERROR,
> +			"PerfMgr DB create entry failed for 0x%" PRIx64 " : %s\n",
> +			node_guid, strerror(errno));
> +		goto Exit;
> +	}
> +	
> +	/* issue the queries for each port */
> +	for (port = 1; port < num_ports; port++)
> +	{
> +		ib_net16_t lid = get_lid(p_node, port);
> +		if (lid == 0)
> +		{
> +			osm_log(pm->log, OSM_LOG_DEBUG,
> +				"WARN: node 0x%" PRIx64 " port %d (%s): port out of range, skipping\n",
> +				cl_ntoh64(p_node->node_info.node_guid), port, node_desc);
> +			continue;
> +		}
> +		
> +		mad_context.perfmgr_context.node_guid = node_guid;
> +		mad_context.perfmgr_context.port = port;
> +		mad_context.perfmgr_context.num_ports = num_ports;
> +		mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_GET;
> +#if 0
> +		gettimeofday(&(mad_context.perfmgr_context.query_start), NULL);
> +#endif
> +		osm_log(pm->log, OSM_LOG_VERBOSE,
> +				"   Getting stats for node 0x%" PRIx64 " port %d (lid %X) (%s)\n",
> +				node_guid, port, cl_ntoh16(lid), node_desc);
> +		status = osm_perfmgr_send_pc_mad(pm, lid, port, IB_MAD_METHOD_GET, &mad_context);
> +		if (status != IB_SUCCESS)
> +		{
> +		      osm_log(pm->log, OSM_LOG_ERROR,
> +				"Failed to issue port counter query for node 0x%" PRIx64 " port %d (%s)\n",
> +				p_node->node_info.node_guid, port, node_desc);
> +		}
> +	}
> +Exit:
> +	OSM_LOG_EXIT( pm->log );
> +}
> +
> +/**********************************************************************
> + * Main PerfMgr Thread.
> + * Loop continueously and query the performance counters.
> + **********************************************************************/
> +void
> +__osm_perfmgr_sweeper(void *p_ptr)
> +{
> +	ib_api_status_t status;
> +	osm_perfmgr_t *const pm = ( osm_perfmgr_t * ) p_ptr;
> +	
> +	OSM_LOG_ENTER( pm->log, __osm_pm_sweeper );
> +	
> +	if( pm->thread_state == OSM_THREAD_STATE_INIT )
> +		pm->thread_state = OSM_THREAD_STATE_RUN;
> +	
> +	while( pm->thread_state == OSM_THREAD_STATE_RUN ) {
> +		/*  do the sweep only if we are in MASTER state
> +		 *  AND we have been activated.
> +		 *  FIXME put something in here to try and reduce the load on the system
> +		 *  when it is not IDLE.
> +		if (pm->sm->state_mgr.state != OSM_SM_STATE_IDLE)
> +		 */
> +		if( pm->subn->sm_state == IB_SMINFO_STATE_MASTER
> +		    && pm->state == PERFMGR_STATE_ENABLED) {
> +#if 0
> +			struct timeval before, after;
> +			gettimeofday(&before, NULL);
> +#endif
> +			/* for each node query their counters */
> +			cl_plock_acquire(pm->lock);
> +			osm_log(pm->log, OSM_LOG_VERBOSE, "Gathering PerfMgr stats\n");
> +			cl_qmap_apply_func(&(pm->subn->node_guid_tbl),
> +			    	  __osm_perfmgr_query_counters, (void *)pm);
> +			cl_plock_release(pm->lock);
> +#if 0
> +			gettimeofday(&after, NULL);
> +			osm_log(pm->log, OSM_LOG_INFO,
> +				"total sweep time : %ld us\n", after.tv_usec - before.tv_usec);
> +#endif
> +		}
> +
> +		/* Wait for a forced sweep or period timeout. */
> +		status = cl_event_wait_on( &pm->sig_sweep,
> +		                   		pm->sweep_time_s * 1000000,
> +		                   		TRUE );
> +	}
> +	
> +	OSM_LOG_EXIT( pm->log );
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +void
> +osm_perfmgr_shutdown(osm_perfmgr_t * const pm)
> +{
> +	OSM_LOG_ENTER( pm->log, osm_perfmgr_shutdown );
> +	osm_perfmgr_mad_unbind(pm);
> +	OSM_LOG_EXIT( pm->log );
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +void
> +osm_perfmgr_destroy(osm_perfmgr_t * const pm)
> +{
> +	OSM_LOG_ENTER( pm->log, osm_perfmgr_destroy );
> +	free(pm->event_db_dump_file);
> +	free(pm->event_db_plugin);
> +	osm_event_db_destroy(pm->db);
> +	OSM_LOG_EXIT( pm->log );
> +}
> +
> +/**********************************************************************
> + * Return 1 if the value has overflowed
> + **********************************************************************/
> +int counter_overflow_4(uint8_t val)
> +{
> +	return (val >= 10);
> +}
> +int counter_overflow_8(uint8_t val)
> +{
> +	return (val >= (UINT8_MAX - (UINT8_MAX/4)));
> +}
> +int counter_overflow_16(uint16_t val)
> +{
> +	return (cl_ntoh16(val) >= (UINT16_MAX - (UINT16_MAX/4)));
> +}
> +int counter_overflow_32(uint32_t val)
> +{
> +	return (cl_ntoh32(val) >= (UINT32_MAX - (UINT32_MAX/4)));
> +}
> +
> +/**********************************************************************
> + * Check if the port counters have overflowed and if so issue a clear MAD to
> + * the port.
> + **********************************************************************/
> +static void
> +osm_perfmgr_check_clear(osm_perfmgr_t *pm, uint64_t node_guid,
> +	     uint8_t port, int num_ports, ib_port_counters_t *cr)
> +{
> +  	osm_madw_context_t  mad_context;
> +
> +  	OSM_LOG_ENTER( pm->log, osm_pm_check_clear );
> +	if (counter_overflow_16(cr->symbol_err_cnt)
> +		|| counter_overflow_8(cr->link_err_recover)
> +		|| counter_overflow_8(cr->link_downed)
> +		|| counter_overflow_16(cr->rcv_err)
> +		|| counter_overflow_16(cr->rcv_rem_phys_err)
> +		|| counter_overflow_16(cr->rcv_switch_relay_err)
> +		|| counter_overflow_16(cr->xmit_discards)
> +		|| counter_overflow_8(cr->xmit_constraint_err)
> +		|| counter_overflow_8(cr->rcv_constraint_err)
> +		|| counter_overflow_4(PC_LINK_INT(cr->link_int_buffer_overrun))
> +		|| counter_overflow_4(PC_BUF_OVERRUN(cr->link_int_buffer_overrun))
> +		|| counter_overflow_16(cr->vl15_dropped)
> +		|| counter_overflow_32(cr->xmit_data)
> +		|| counter_overflow_32(cr->rcv_data)
> +		|| counter_overflow_32(cr->xmit_pkts)
> +		|| counter_overflow_32(cr->rcv_pkts)
> +		)
> +	{
> +		osm_log(pm->log, OSM_LOG_INFO,
> +			"Counter overflow: 0x%" PRIx64 " port %d; clearing counters\n",
> +			node_guid, port);
> +  		osm_node_t *p_node = NULL;
> +		ib_net16_t  lid = 0;
> +        	cl_plock_acquire(pm->lock);
> +        	p_node = (osm_node_t *)cl_qmap_get(&(pm->subn->node_guid_tbl),
> +						cl_hton64(node_guid));
> +    		lid = get_lid(p_node, port);
> +        	cl_plock_release(pm->lock);
> +    		if (lid == 0)
> +    		{
> +    			osm_log(pm->log, OSM_LOG_INFO,
> +    				"Failed to clear counters for node 0x%" PRIx64 " port %d; failed to get lid\n",
> +    				node_guid, port);
> +        		goto Exit;
> +    		}
> +    		mad_context.perfmgr_context.node_guid = node_guid;
> +    		mad_context.perfmgr_context.port = port;
> +    		mad_context.perfmgr_context.num_ports = num_ports;
> +    		mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_SET;
> +		/* clear port counter */
> +		osm_perfmgr_send_pc_mad(pm, lid, port, IB_MAD_METHOD_SET, &mad_context);
> +	}
> +Exit:
> +  	OSM_LOG_EXIT( pm->log );
> +}
> +
> +/**********************************************************************
> + * Check values for logging of errors
> + **********************************************************************/
> +static void
> +osm_perfmgr_log_events(osm_perfmgr_t *pm, uint64_t node_guid, uint8_t port,
> +			ib_port_counters_t *reading)
> +{
> +	osm_pc_reading_t    prev_read;
> +	ib_port_counters_t *prev;
> +	time_t              time_diff = 0;
> +  	osm_event_db_err_t  err = osm_event_db_get_prev_pc(pm->db, node_guid, port, &prev_read);
> +  	if (err != OSM_EVENT_DB_SUCCESS)
> +  	{
> +		osm_log(pm->log, OSM_LOG_VERBOSE,
> +			"failed to find previous reading for 0x%" PRIx64 " port %u\n",
> +			node_guid, port);
> +		return;
> +  	}
> +	time_diff = (time(NULL) - prev_read.time);
> +	prev = &(prev_read.reading);
> +
> +	/* FIXME these events should be defineable by the user in a config
> +	 * file somewhere. */
> +	if (reading->symbol_err_cnt > prev->symbol_err_cnt) {
> +		osm_log(pm->log, OSM_LOG_ERROR,
> +			"Found %u Symbol errors in %lu sec on node 0x%" PRIx64 " port %u\n",
> +			(cl_ntoh16(reading->symbol_err_cnt) - cl_ntoh16(prev->symbol_err_cnt)),
> +			time_diff,
> +			node_guid,
> +			port);
> +	}
> +	if (reading->rcv_err > prev->rcv_err) {
> +		osm_log(pm->log, OSM_LOG_ERROR,
> +			"Found %u Recieve errors in %lu sec on node 0x%" PRIx64 " port %u\n",
> +			(cl_ntoh16(reading->rcv_err) - cl_ntoh16(prev->rcv_err)),
> +			time_diff,
> +			node_guid,
> +			port);
> +	}
> +	if (reading->xmit_discards > prev->xmit_discards) {
> +		osm_log(pm->log, OSM_LOG_ERROR,
> +			"Found %u XMIT Discards in %lu sec on node 0x%" PRIx64 " port %u\n",
> +			(cl_ntoh16(reading->xmit_discards) - cl_ntoh16(prev->xmit_discards)),
> +			time_diff,
> +			node_guid,
> +			port);
> +	}
> +}
> +
> +
> +/**********************************************************************
> + * The dispatcher uses a thread pool which will call this function when we have
> + * a thread available to process our mad recieved from the wire.
> + **********************************************************************/
> +static void
> +osm_pc_rcv_process(void *context, void *data)
> +{
> +	osm_perfmgr_t      *const pm = (osm_perfmgr_t *)context;
> +	osm_madw_t         *p_madw = (osm_madw_t *)data;
> +	osm_madw_context_t *mad_context = &(p_madw->context);
> +	ib_port_counters_t *counter_reading =
> +				(ib_port_counters_t *)&(osm_madw_get_perfmgr_mad_ptr(p_madw)->data);
> +	uint64_t            node_guid = mad_context->perfmgr_context.node_guid;
> +	uint8_t             port_num = mad_context->perfmgr_context.port;
> +	int                 num_ports = mad_context->perfmgr_context.num_ports;
> +	
> +	OSM_LOG_ENTER( pm->log, osm_pc_rcv_process );
> +	
> +	osm_log(pm->log, OSM_LOG_VERBOSE,
> +	      	  "Processing recieved MAD context 0x%" PRIx64 " port %u/%d\n",
> +	      	  node_guid, port_num, num_ports);
> +	
> +	/* log any critical events from this reading */
> +	osm_perfmgr_log_events(pm, node_guid, port_num, counter_reading);
> +	
> +	if (mad_context->perfmgr_context.mad_method == IB_MAD_METHOD_GET)
> +		osm_event_db_add_pc_reading(pm->db, node_guid, port_num, counter_reading);
> +	else
> +		osm_event_db_clear_prev_pc(pm->db, node_guid, port_num);
> +	osm_perfmgr_check_clear(pm, node_guid, port_num, num_ports, counter_reading);
> +	
> +#if 0
> +	do {
> +		struct timeval      proc_time;
> +		gettimeofday(&proc_time, NULL);
> +		osm_log(pm->log, OSM_LOG_INFO,
> +			"perfmgr done processing time %ld\n",
> +			proc_time.tv_usec -
> +			p_madw->context.perfmgr_context.query_start.tv_usec);
> +	} while (0);
> +#endif
> +
> +	osm_mad_pool_put( pm->mad_pool, p_madw );
> +	
> +	OSM_LOG_EXIT( pm->log );
> +}
> +
> +/**********************************************************************
> + * Initialize the PERFMGR object
> + **********************************************************************/
> +ib_api_status_t
> +osm_perfmgr_init(
> +	osm_perfmgr_t * const pm,
> +	osm_subn_t * const subn,
> +	osm_sm_t * const sm,
> +	osm_log_t * const log,
> +	osm_mad_pool_t * const mad_pool,
> +	osm_vendor_t * const vendor,
> +	cl_dispatcher_t* const disp,
> +	cl_plock_t* const lock,
> +	const osm_subn_opt_t * const p_opt )
> +{
> +	ib_api_status_t    status = IB_SUCCESS;
> +	
> +	OSM_LOG_ENTER( log, osm_pm_init );
> +	
> +	osm_log(log, OSM_LOG_VERBOSE, "initing PM\n");
> +	
> +	memset( pm, 0, sizeof( *pm ) );
> +	
> +	cl_event_construct(&pm->sig_sweep);
> +	cl_event_init(&pm->sig_sweep, FALSE);
> +	pm->subn = subn;
> +	pm->sm = sm;
> +	pm->log = log;
> +	pm->mad_pool = mad_pool;
> +	pm->vendor = vendor;
> +	pm->trans_id = OSM_PERFMGR_INITIAL_TID_VALUE;
> +	pm->lock = lock;
> +	pm->state = p_opt->perfmgr ? PERFMGR_STATE_ENABLED : PERFMGR_STATE_DISABLE;
> +	pm->sweep_time_s = p_opt->perfmgr_sweep_time_s;
> +	pm->event_db_dump_file = strdup(p_opt->event_db_dump_file);
> +	pm->event_db_plugin = strdup(p_opt->event_db_plugin);
> +	
> +	pm->db = osm_event_db_construct(pm->log, pm->event_db_plugin);
> +	if (!pm->db)
> +	{
> +	      pm->state = PERFMGR_STATE_NO_DB;
> +	      goto Exit;
> +	}
> +	
> +	pm->pc_disp_h = cl_disp_register(disp, OSM_MSG_MAD_PORT_COUNTERS,
> +	                              osm_pc_rcv_process, pm);
> +	if( pm->pc_disp_h == CL_DISP_INVALID_HANDLE )
> +		goto Exit;
> +	
> +	pm->thread_state = OSM_THREAD_STATE_INIT;
> +	status = cl_thread_init( &pm->sweeper, __osm_perfmgr_sweeper, pm,
> +	                       "PerfMgr sweeper" );
> +	if( status != IB_SUCCESS )
> +	 	goto Exit;
> +	
> +Exit:
> +	OSM_LOG_EXIT( log );
> +	return ( status );
> +}
> +
> +/**********************************************************************
> + * Clear the counters from the db
> + **********************************************************************/
> +void
> +osm_perfmgr_clear_counters(osm_perfmgr_t *pm)
> +{
> +	/**
> +	 * FIXME todo issue clear on the fabric?
> +	 */
> +	osm_event_db_clear_port_counters(pm->db);
> +  	osm_log( pm->log, OSM_LOG_INFO, "PerfMgr counters cleared\n");
> +}
> +
> +/*******************************************************************
> + * Have the DB dump it's information to the file specified.
> + *******************************************************************/
> +void
> +osm_perfmgr_dump_counters(osm_perfmgr_t *pm, osm_event_db_dump_t dump_type)
> +{
> +	if (osm_event_db_dump(pm->db, pm->event_db_dump_file, dump_type) != 0)
> +	{
> +      		osm_log( pm->log, OSM_LOG_ERROR,
> +               		"PB dump port counters: Failed to file %s : %s",
> +               		pm->event_db_dump_file, strerror(errno));
> +	}
> +}
> +
> +#if 0
> +/*******************************************************************
> + * Use this later to track events on the fabric
> + **********************************************************************/
> +ib_api_status_t
> +osm_report_notice_to_perfmgr(osm_log_t* const log, osm_subn_t*  subn,
> +  			ib_mad_notice_attr_t *p_ntc )
> +{
> +  OSM_LOG_ENTER( log, osm_report_trap_to_pm );
> +  if ((p_ntc->generic_type & 0x80)
> +	  && (cl_ntoh16(p_ntc->g_or_v.generic.trap_num) == 128)) {
> +	  osm_log( log, OSM_LOG_INFO, "PerfMgr notified of trap 128\n");
> +  }
> +  OSM_LOG_EXIT( log );
> +  return (IB_SUCCESS);
> +}
> +#endif
> +
> +#endif /* ENABLE_OSM_PERF_MGR */
> +
> diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
> index c8c3ddc..77c19a5 100644
> --- a/osm/opensm/osm_subnet.c
> +++ b/osm/opensm/osm_subnet.c
> @@ -66,6 +66,7 @@
>  #include <opensm/osm_multicast.h>
>  #include <opensm/osm_inform.h>
>  #include <opensm/osm_console.h>
> +#include <opensm/osm_perfmgr.h>
>  
>  #if defined(PATH_MAX)
>  #define OSM_PATH_MAX	(PATH_MAX + 1)
> @@ -471,6 +472,12 @@ osm_subn_set_default_opt(
>    p_opt->honor_guid2lid_file = FALSE;
>    p_opt->daemon = FALSE;
>    p_opt->sm_inactive = FALSE;
> +#ifdef ENABLE_OSM_PERF_MGR
> +  p_opt->perfmgr = FALSE;
> +  p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S;
> +  p_opt->event_db_dump_file = OSM_PERFMGR_DEFAULT_DUMP_FILE;
> +  p_opt->event_db_plugin = OSM_DEFAULT_EVENT_PLUGIN;
> +#endif /* ENABLE_OSM_PERF_MGR */
>  
>    p_opt->dump_files_dir = getenv("OSM_TMP_DIR");
>    if (!p_opt->dump_files_dir || !(*p_opt->dump_files_dir))
> @@ -1076,6 +1083,24 @@ osm_subn_parse_conf_file(
>          "sm_inactive",
>          p_key, p_val, &p_opts->sm_inactive);
>  
> +#ifdef ENABLE_OSM_PERF_MGR
> +      __osm_subn_opts_unpack_boolean(
> +        "perfmgr",
> +        p_key, p_val, &p_opts->perfmgr);
> +
> +      __osm_subn_opts_unpack_uint16(
> +        "perfmgr_sweep_time_s",
> +        p_key, p_val, &p_opts->perfmgr_sweep_time_s);
> +
> +      __osm_subn_opts_unpack_charp(
> +        "event_db_dump_file",
> +        p_key, p_val, &p_opts->event_db_dump_file);
> +
> +      __osm_subn_opts_unpack_charp(
> +        "event_db_plugin",
> +        p_key, p_val, &p_opts->event_db_plugin);
> +#endif /* ENABLE_OSM_PERF_MGR */
> +
>        subn_parse_qos_options("qos",
>          p_key, p_val, &p_opts->qos_options);
>  
> @@ -1321,6 +1346,32 @@ osm_subn_write_conf_file(
>      p_opts->sm_inactive ? "TRUE" : "FALSE"
>      );
>  
> +#ifdef ENABLE_OSM_PERF_MGR
> +  fprintf(
> +    opts_file,
> +    "#\n# Performance Manager Options\n#\n"
> +    "# perfmgr enable\n"
> +    "perfmgr %s\n\n"
> +    "# sweep time in seconds\n"
> +    "perfmgr_sweep_time_s %d\n\n"
> +    ,
> +    p_opts->perfmgr ? "TRUE" : "FALSE",
> +    p_opts->perfmgr_sweep_time_s
> +    );
> +
> +  fprintf(
> +    opts_file,
> +    "#\n# Event DB Options\n#\n"
> +    "# Dump file to dump the events to\n"
> +    "event_db_dump_file %s\n\n"
> +    "# Event db plugin\n"
> +    "event_db_plugin %s\n\n"
> +    ,
> +    p_opts->event_db_dump_file,
> +    p_opts->event_db_plugin
> +    );
> +#endif /* ENABLE_OSM_PERF_MGR */
> +
>    fprintf( 
>      opts_file,
>      "#\n# DEBUG FEATURES\n#\n"
> diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c
> index 0858968..19be781 100644
> --- a/osm/opensm/osm_trap_rcv.c
> +++ b/osm/opensm/osm_trap_rcv.c
> @@ -698,6 +698,21 @@ __osm_trap_rcv_process_request(
>      goto Exit;
>    }
>  
> +#ifdef ENABLE_OSM_PERF_MGR
> +#if 0
> +  /* we still need to work out how this will work */
> +  status = osm_report_notice_to_perfmgr(p_rcv->p_log, p_rcv->p_subn, p_ntci);
> +  if( status != IB_SUCCESS )
> +  {
> +    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +             "__osm_trap_rcv_process_request: ERR 3803: "
> +             "Error sending trap reports (%s)\n",
> +             ib_get_err_str( status ) );
> +    goto Exit;
> +  }
> +#endif
> +#endif /* ENABLE_OSM_PERF_MGR */
> +
>   Exit:
>    OSM_LOG_EXIT( p_rcv->p_log );
>  }
> -- 
> 1.4.4
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mst at dev.mellanox.co.il  Sun May 13 21:58:32 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 14 May 2007 07:58:32 +0300
Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
Message-ID: <20070514045832.GA18615@mellanox.co.il>

Roland, please pick up the patches from:

	git://git.openfabrics.org/~mst/linux-2.6/.git master

This will pull in the following outstanding patches intended for 2.6.22: all of
them have been posted previously, I just cleaned up a couple of whitespace
errors reported by git-apply (let me know if you like me to re-post the
patches):

Michael S. Tsirkin (3):
      IB/mthca: fix posting >255 recv WRs for Tavor
      IB/mthca: fix RESET to ERROR transition
      ipoib/cm: optimize stale connection detection

Yosef Etigin (2):
      IB/core: add helpers for uncached gid/pkey queries
      IB/ipoib: handle pkey re-shuffling

-- 
MST


From Nippon at lists.openfabrics.org  Sun May 13 23:47:10 2007
From: Nippon at lists.openfabrics.org (Nippon at lists.openfabrics.org)
Date: Mon, 14 May 2007 03:47:10 -0300
Subject: [ofa-general] (Job Offer)Work with us@Nippon Oil Exploration Ltd
Message-ID: <APP33nYBTclU6h7e3Fn0002da61@app33.terraempresas.com.br>

Nippon Oil Exploration and Production U.K., Limited
New Liverpool House / 2nd Floor, 
15 Eldon Street London
EC2M 7LD, U.K.
ATTN: SIR/MADAM,
Yes you can apply and be appointed,as your province/region covers within the range of North and South America.We seek a representative that can help my company in clearing our payments,this has being occasioned by the wide growth in our market that has expanded our goods and products to your province encompassing---AMERICA,CANADA,MEXICO AND PUERTO RICO. Likewise all exterior axis of EUROPE.
Hence the mergence for a representative/agent who is competent,diligent and trustworthy to handle the responsiblity of receiveing payments on our behalf.Our customers issue out payments in cheques,bankdrafts,bonds etc and we don't run an account presently in your province that will clear this payments.Likewise clearance of our goods within your province.
As our representative you will receive 10% of any payments you clear for the company, that is how you will be paid/renumerated by the company then you will remit the balance of 90% to us.This the company's standard mode of operation.
If you are interested in our offer please confirm your complete informations:
1)FULL NAMES/COMPANY NAME
2)COMPANY/HOME ADDRESS 
3)MAILING ADRESS 
4)COUNTRY/STATE
5)PHONE NUMBERS-CELL,HOME,OFFICE ETC(kindly ensure you forward a functional phone number/s) 
6)FAX NUMBER
7)ZIP CODE
8)EVIDENCE OF FULL IDENTIFICATION--SCANNED AND FORWARDED COPY OF INTERNATIOAL PASSPORT OR DRIVER"S LICENCE.
9)COMPLETE WORKING EXPERIENCE(WITH DETAILS)
10)ANNUAL INCOME
all these are required for official documentation and record purpose once this is done i will fax it to our customers instructing them that you are our rep/agent and that they should issue out payments to you on behalf of Nippon Company Ltd.
Waiting to hear from you.If you need any other informations do not fail to attention me.i ll instruct my attorney to prepare and send you our company"s statutory memorandum of understanding(M.O.U) via mail/fax. 
Kindly acknowledge reciept of mail by swiftly replying.i need to to hear from you as soon as possible so that the board of management can concess to your appointment.
have a nice day.
I await your urgent response.
Thanks for your time,
Yours Respectfuly,
Mr. Tony Johnson
Regional Coordinator
Nippon Oil Exploration and Production U.K., Limited
TEL: +44-704-571-1123
Email to: nipponoilexplorationltd01 at yahoo.co.uk


From kliteyn at dev.mellanox.co.il  Mon May 14 00:30:45 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 14 May 2007 10:30:45 +0300
Subject: [ofa-general] [PATCH] osm: integer indexes in fat-tree
Message-ID: <46481025.9080209@dev.mellanox.co.il>

Hi Hal,

Enhancing integer indexes in fat-tree to 32 bits.
I'm not sure whether it's a bug - fat-tree routing makes indexing
not the same way as up/down. It marks rank on all the leaf switches,
and only then starts BFS (starting from all the leaf switches and not
from roots), so I don't think that it can really overflow the existing
indexes. But who knows...
Fixing this won't hurt anyway.

Please apply to master.

-- Yevgeny

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/osm_ucast_ftree.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
index 7b6a6a5..ca51484 100644
--- a/osm/opensm/osm_ucast_ftree.c
+++ b/osm/opensm/osm_ucast_ftree.c
@@ -174,7 +174,7 @@ typedef struct ftree_sw_t_
 {
    cl_map_item_t          map_item;
    osm_switch_t         * p_osm_sw;
-   uint8_t                rank;
+   uint32_t               rank;
    ftree_tuple_t          tuple;
    ib_net16_t             base_lid;
    ftree_port_group_t  ** down_port_groups;
@@ -588,7 +588,7 @@ __osm_ftree_sw_create(
    memset(p_sw, 0, sizeof(ftree_sw_t));
 
    p_sw->p_osm_sw = p_osm_sw;
-   p_sw->rank = 0xFF;
+   p_sw->rank = 0xFFFFFFFF;
    __osm_ftree_tuple_init(p_sw->tuple);
 
    p_sw->base_lid = osm_node_get_base_lid(p_sw->p_osm_sw->p_node, 0);
@@ -678,7 +678,7 @@ static boolean_t
 __osm_ftree_sw_ranked(
    IN  ftree_sw_t * p_sw)
 {
-   return (p_sw->rank != 0xFF); 
+   return (p_sw->rank != 0xFFFFFFFF); 
 }
 
 /***************************************************/
@@ -1025,7 +1025,7 @@ __osm_ftree_fabric_destroy(ftree_fabric_
 /***************************************************/
 
 static void 
-__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint8_t rank)
+__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank)
 {
    if (rank > p_ftree->tree_rank)
       p_ftree->tree_rank = rank;
@@ -1314,7 +1314,7 @@ __osm_ftree_fabric_assign_first_tuple(
    ftree_tuple_t new_tuple;
 
    __osm_ftree_tuple_init(new_tuple);
-   new_tuple[0] = p_sw->rank;
+   new_tuple[0] = (uint8_t)p_sw->rank;
    for (i = 1; i <= p_sw->rank; i++)
       new_tuple[i] = 0;
 
@@ -1374,7 +1374,7 @@ __osm_ftree_fabric_calculate_rank(
 {
    ftree_sw_t   * p_sw;
    ftree_sw_t   * p_next_sw;
-   uint16_t       max_rank = 0;
+   uint32_t       max_rank = 0;
 
    /* go over all the switches and find maximal switch rank */
 
-- 
1.4.4.1.GIT


From vlad at lists.openfabrics.org  Mon May 14 02:32:13 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon, 14 May 2007 02:32:13 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070514-0200 daily build status
Message-ID: <20070514093213.51F90E6082D@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.15
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.17
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.13

Failed:


From eli at mellanox.co.il  Mon May 14 01:35:43 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 14 May 2007 11:35:43 +0300
Subject: [ofa-general] [PATCH] IB/core free umem when mm is destroyed
Message-ID: <1179131773.7405.39.camel@mtls03>

free umem when task's mm is already destroyed by the time ib_umem_release
gets called.

Found by Dotan Barak at Mellanox
Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/infiniband/core/umem.c
===================================================================
--- connectx_kernel.orig/drivers/infiniband/core/umem.c	2007-05-14 09:43:02.000000000 +0300
+++ connectx_kernel/drivers/infiniband/core/umem.c	2007-05-14 10:26:26.000000000 +0300
@@ -261,8 +261,10 @@ void ib_umem_release(struct ib_umem *ume
 	__ib_umem_release(umem->context->device, umem, 1);
 
 	mm = get_task_mm(current);
-	if (!mm)
+	if (!mm) {
+		kfree(umem);
 		return;
+	}
 
 	diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT;
 

From halr at voltaire.com  Mon May 14 03:58:34 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 06:58:34 -0400
Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager
In-Reply-To: <20070513195539.GH29746@sashak.voltaire.com>
References: <20070508184938.311b1c8f.weiny2@llnl.gov>
	<20070513195539.GH29746@sashak.voltaire.com>
Message-ID: <1179140285.1540.167239.camel@hal.voltaire.com>

On Sun, 2007-05-13 at 15:55, Sasha Khapyorsky wrote:
> Hi Ira,
> 
> Thanks for the great work!

Indeed :-)

> On 18:49 Tue 08 May     , Ira Weiny wrote:
> > I would like to submit to the list a performance manager which I have been
> > working on for OpenSM.
> > 
> > It is implemented as the first proposed architecture model set forth by Hal (As
> > an integrated thread to OpenSM.)  As such it works fine on our small test
> > cluster but there is some concern about its scalability.
> > 
> > I have extended this architecture with an idea of my own.  This idea is to have
> > a plug-able module for the "event database".  With this interface one could
> > write their own Data reduction, logging, and tracking methods.  Here at LLNL I
> > propose to use this to add counter and subnet events directly to our management
> > database which is used to show system status to our operators.  Other
> > installations might prefer other methods of logging, SNMP for example.  This
> > patch includes a "reference" implementation of this "event database" which
> > stores the information internally until the user requests a "dump".
> 
> I like this event db idea, but not sure this should not be integral part
> of the low level perfmgr stuff - as it is currently implemented without
> such plugin loaded PerfMgr just doesn't work - this unconditionally tries
> to pull all ports counters, but has nothing to do with it without plugin.
> 
> Instead I would purpose to have a builtin PerfMgr which will be able to
> pull and store performance related data and then to call "generic" event
> manager which can process such data. This also will help to have simpler
> generic API for such event db plugin so other parts of OpenSM will be
> able to report events using same method(s). What do you think?

Sounds better to me. Ira ?

> Some patch related comments are inlined below.
> 
> Sasha
> 
> > 
> > Let the flames begin,
> > Ira Weiny
> > weiny2 at llnl.gov
> > 
> > 
> > 
> > >From 4ce288b6a5a371872cf160f6d4e29e768a065cb9 Mon Sep 17 00:00:00 2001
> > From: Ira K. Weiny <weiny2 at llnl.gov>
> > Date: Tue, 24 Apr 2007 23:44:15 -0700
> > Subject: [PATCH] OpenSM Proposed Perf Manager
> > 
> >    Features include:
> >       * Create "PerfMgr" thread and sweep all ports on the subnet every
> >         sweep_time seconds
> >       * port counter clear on overflow
> >       * plugable architecture for the "event" database
> >       * Output machine and human readable output in the default event database
> >         dump
> >       * Control using the "perfmgr" command in the console
> > 
> >    Known Issues
> >       * Not tested at scale.
> >       * Event database should record trap events and other "intresting" subnet
> >         events.
> >       * port counter log warnings should be configureable not hard coded.
> >       * partitions are not handled yet.
> >       * Code might not be as pristine as I would like
> > 
> >    Enable using --enable-perf-mgr
> > 
> > Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
> > ---
> >  osm/Makefile.am                   |    3 +-
> >  osm/config/osmvsel.m4             |   26 ++
> >  osm/configure.in                  |    5 +-
> >  osm/eventdb/Makefile.am           |   37 ++
> >  osm/eventdb/autogen.sh            |   15 +
> >  osm/eventdb/configure.in          |   70 ++++
> >  osm/eventdb/libibeventdb.map      |    5 +
> >  osm/eventdb/libibeventdb.spec.in  |   38 ++
> >  osm/eventdb/libibeventdb.ver      |    9 +
> >  osm/eventdb/src/ibeventdb.c       |  622 +++++++++++++++++++++++++++++++++
> >  osm/include/Makefile.am           |    2 +
> >  osm/include/iba/ib_types.h        |   74 ++++
> >  osm/include/opensm/osm_base.h     |   23 ++
> >  osm/include/opensm/osm_event_db.h |  151 ++++++++
> >  osm/include/opensm/osm_madw.h     |   40 +++
> >  osm/include/opensm/osm_msgdef.h   |    1 +
> >  osm/include/opensm/osm_opensm.h   |    4 +
> >  osm/include/opensm/osm_perfmgr.h  |  223 ++++++++++++
> >  osm/include/opensm/osm_subnet.h   |   18 +
> >  osm/opensm.spec.in                |   11 +-
> >  osm/opensm/Makefile.am            |    5 +-
> >  osm/opensm/configure.in           |    3 +
> >  osm/opensm/main.c                 |   19 +
> >  osm/opensm/osm_console.c          |   78 +++++
> >  osm/opensm/osm_event_db.c         |  172 +++++++++
> >  osm/opensm/osm_opensm.c           |   24 ++
> >  osm/opensm/osm_perfmgr.c          |  686 +++++++++++++++++++++++++++++++++++++
> >  osm/opensm/osm_subnet.c           |   51 +++
> >  osm/opensm/osm_trap_rcv.c         |   15 +
> >  29 files changed, 2425 insertions(+), 5 deletions(-)
> > 

[snip...]


> > diff --git a/osm/eventdb/src/ibeventdb.c b/osm/eventdb/src/ibeventdb.c
> > new file mode 100644
> > index 0000000..e98f85c
> > --- /dev/null
> > +++ b/osm/eventdb/src/ibeventdb.c
> > @@ -0,0 +1,622 @@
> > +/*
> > + * Copyright (c) 2007 The Regents of the University of California.
> > + *
> > + * This software is available to you under a choice of one of two
> > + * licenses.  You may choose to be licensed under the terms of the GNU
> > + * General Public License (GPL) Version 2, available from the file
> > + * COPYING in the main directory of this source tree, or the
> > + * OpenIB.org BSD license below:
> > + *
> > + *     Redistribution and use in source and binary forms, with or
> > + *     without modification, are permitted provided that the following
> > + *     conditions are met:
> > + *
> > + *      - Redistributions of source code must retain the above
> > + *        copyright notice, this list of conditions and the following
> > + *        disclaimer.
> > + *
> > + *      - Redistributions in binary form must reproduce the above
> > + *        copyright notice, this list of conditions and the following
> > + *        disclaimer in the documentation and/or other materials
> > + *        provided with the distribution.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + *
> > + */
> > +
> > +#if HAVE_CONFIG_H
> > +#  include <config.h>
> > +#endif /* HAVE_CONFIG_H */
> > +
> > +#include <errno.h>
> > +#include <string.h>
> > +#include <stdlib.h>
> > +#include <time.h>
> > +#include <dlfcn.h>
> > +#include <stdint.h>
> > +#include <opensm/osm_event_db.h>
> > +#include <complib/cl_qmap.h>
> > +#include <complib/cl_passivelock.h>
> > +
> > +/**
> > + * Port counter object.
> > + * Store all the port counters for a single port.
> > + */
> > +typedef struct _osm_event_pc {
> > +	struct {
> > +		uint64_t symbol_err_cnt;
> > +		uint64_t link_err_recover;
> > +		uint64_t link_downed;
> > +		uint64_t rcv_err;
> > +		uint64_t rcv_rem_phys_err;
> > +		uint64_t rcv_switch_relay_err;
> > +		uint64_t xmit_discards;
> > +		uint64_t xmit_constraint_err;
> > +		uint64_t rcv_constraint_err;
> > +		uint64_t link_int_err;
> > +		uint64_t buffer_overrun_err;
> > +		uint64_t vl15_dropped;
> > +		uint64_t xmit_data;
> > +		uint64_t rcv_data;
> > +		uint64_t xmit_pkts;
> > +		uint64_t rcv_pkts;
> > +		time_t   last_reset;
> > +	} totals;
> > +	osm_pc_reading_t previous;
> > +} osm_event_pc_t;
> > +
> > +/**
> > + * group port counters for ports into the nodes
> > + */
> > +typedef struct _osm_pc_node {
> > +	cl_map_item_t  map_item; /* must be first */
> > +	uint64_t       node_guid;
> > +	osm_event_pc_t   *ports;
> > +	uint8_t        num_ports;
> > +} osm_pc_node_t;
> 
> Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)?
> Why not to reuse already existed maps in osm_subn_t (we could add
> 'void *pm_data' or so field to osm_physp_t structure)?

My one concern would be evolving the PerfMgr. This is better now but is
this better when the PerfMgr is separated from the SM functionality ? I
know there are other things to untangle to get there.

-- Hal

[snip...]


From halr at voltaire.com  Mon May 14 04:04:56 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 07:04:56 -0400
Subject: [ofa-general] Re: [PATCH] osm: integer indexes in fat-tree
In-Reply-To: <46481025.9080209@dev.mellanox.co.il>
References: <46481025.9080209@dev.mellanox.co.il>
Message-ID: <1179140689.1540.167579.camel@hal.voltaire.com>

Hi Yevgeny,

On Mon, 2007-05-14 at 03:30, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> Enhancing integer indexes in fat-tree to 32 bits.
> I'm not sure whether it's a bug - fat-tree routing makes indexing
> not the same way as up/down. It marks rank on all the leaf switches,
> and only then starts BFS (starting from all the leaf switches and not
> from roots), so I don't think that it can really overflow the existing
> indexes. But who knows...
> Fixing this won't hurt anyway.

No, it won't.

> Please apply to master.

See comment/question below.

-- Hal

> -- Yevgeny
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  osm/opensm/osm_ucast_ftree.c |   12 ++++++------
>  1 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
> index 7b6a6a5..ca51484 100644
> --- a/osm/opensm/osm_ucast_ftree.c
> +++ b/osm/opensm/osm_ucast_ftree.c
> @@ -174,7 +174,7 @@ typedef struct ftree_sw_t_
>  {
>     cl_map_item_t          map_item;
>     osm_switch_t         * p_osm_sw;
> -   uint8_t                rank;
> +   uint32_t               rank;
>     ftree_tuple_t          tuple;
>     ib_net16_t             base_lid;
>     ftree_port_group_t  ** down_port_groups;
> @@ -588,7 +588,7 @@ __osm_ftree_sw_create(
>     memset(p_sw, 0, sizeof(ftree_sw_t));
>  
>     p_sw->p_osm_sw = p_osm_sw;
> -   p_sw->rank = 0xFF;
> +   p_sw->rank = 0xFFFFFFFF;
>     __osm_ftree_tuple_init(p_sw->tuple);
>  
>     p_sw->base_lid = osm_node_get_base_lid(p_sw->p_osm_sw->p_node, 0);
> @@ -678,7 +678,7 @@ static boolean_t
>  __osm_ftree_sw_ranked(
>     IN  ftree_sw_t * p_sw)
>  {
> -   return (p_sw->rank != 0xFF); 
> +   return (p_sw->rank != 0xFFFFFFFF); 
>  }
>  
>  /***************************************************/
> @@ -1025,7 +1025,7 @@ __osm_ftree_fabric_destroy(ftree_fabric_
>  /***************************************************/
>  
>  static void 
> -__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint8_t rank)
> +__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank)
>  {
>     if (rank > p_ftree->tree_rank)
>        p_ftree->tree_rank = rank;
> @@ -1314,7 +1314,7 @@ __osm_ftree_fabric_assign_first_tuple(
>     ftree_tuple_t new_tuple;
>  
>     __osm_ftree_tuple_init(new_tuple);
> -   new_tuple[0] = p_sw->rank;
> +   new_tuple[0] = (uint8_t)p_sw->rank;

Should the declaration of ftree_tuple_t change ?

>     for (i = 1; i <= p_sw->rank; i++)
>        new_tuple[i] = 0;
>  
> @@ -1374,7 +1374,7 @@ __osm_ftree_fabric_calculate_rank(
>  {
>     ftree_sw_t   * p_sw;
>     ftree_sw_t   * p_next_sw;
> -   uint16_t       max_rank = 0;
> +   uint32_t       max_rank = 0;
>  
>     /* go over all the switches and find maximal switch rank */
>  


From halr at voltaire.com  Mon May 14 04:17:08 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 07:17:08 -0400
Subject: [ofa-general] Re: [PATCH] opensm: more
	osm_*_construct/_init/_destroy cleanups
In-Reply-To: <20070510224308.GH9692@sashak.voltaire.com>
References: <20070506174138.GI9692@sashak.voltaire.com>
	<20070506174431.GJ9692@sashak.voltaire.com>
	<1178543690.32222.350646.camel@hal.voltaire.com>
	<20070509212740.GV9692@sashak.voltaire.com>
	<20070510224308.GH9692@sashak.voltaire.com>
Message-ID: <1179141423.1540.168253.camel@hal.voltaire.com>

Hi Sasha,

On Thu, 2007-05-10 at 18:43, Sasha Khapyorsky wrote:
> Hi Hal,
> 
> As suggested :)
> 
> 
> This removes/makes static non used osm_*_construct/_init/_destroy
> initializers for OpenSM objects where osm*_new/_delete are actually
> used.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (to master only).

-- Hal


From kliteyn at dev.mellanox.co.il  Mon May 14 04:33:05 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 14 May 2007 14:33:05 +0300
Subject: [ofa-general] Re: [PATCH] osm: integer indexes in fat-tree
In-Reply-To: <1179140689.1540.167579.camel@hal.voltaire.com>
References: <46481025.9080209@dev.mellanox.co.il>
	<1179140689.1540.167579.camel@hal.voltaire.com>
Message-ID: <464848F1.207@dev.mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Mon, 2007-05-14 at 03:30, Yevgeny Kliteynik wrote:
>> Hi Hal,
>>
>> Enhancing integer indexes in fat-tree to 32 bits.
>> I'm not sure whether it's a bug - fat-tree routing makes indexing
>> not the same way as up/down. It marks rank on all the leaf switches,
>> and only then starts BFS (starting from all the leaf switches and not
>> from roots), so I don't think that it can really overflow the existing
>> indexes. But who knows...
>> Fixing this won't hurt anyway.
> 
> No, it won't.
> 
>> Please apply to master.
> 
> See comment/question below.
> 
> -- Hal
> 
>> -- Yevgeny
>>
>> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  osm/opensm/osm_ucast_ftree.c |   12 ++++++------
>>  1 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
>> index 7b6a6a5..ca51484 100644
>> --- a/osm/opensm/osm_ucast_ftree.c
>> +++ b/osm/opensm/osm_ucast_ftree.c
>> @@ -174,7 +174,7 @@ typedef struct ftree_sw_t_
>>  {
>>     cl_map_item_t          map_item;
>>     osm_switch_t         * p_osm_sw;
>> -   uint8_t                rank;
>> +   uint32_t               rank;
>>     ftree_tuple_t          tuple;
>>     ib_net16_t             base_lid;
>>     ftree_port_group_t  ** down_port_groups;
>> @@ -588,7 +588,7 @@ __osm_ftree_sw_create(
>>     memset(p_sw, 0, sizeof(ftree_sw_t));
>>  
>>     p_sw->p_osm_sw = p_osm_sw;
>> -   p_sw->rank = 0xFF;
>> +   p_sw->rank = 0xFFFFFFFF;
>>     __osm_ftree_tuple_init(p_sw->tuple);
>>  
>>     p_sw->base_lid = osm_node_get_base_lid(p_sw->p_osm_sw->p_node, 0);
>> @@ -678,7 +678,7 @@ static boolean_t
>>  __osm_ftree_sw_ranked(
>>     IN  ftree_sw_t * p_sw)
>>  {
>> -   return (p_sw->rank != 0xFF); 
>> +   return (p_sw->rank != 0xFFFFFFFF); 
>>  }
>>  
>>  /***************************************************/
>> @@ -1025,7 +1025,7 @@ __osm_ftree_fabric_destroy(ftree_fabric_
>>  /***************************************************/
>>  
>>  static void 
>> -__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint8_t rank)
>> +__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank)
>>  {
>>     if (rank > p_ftree->tree_rank)
>>        p_ftree->tree_rank = rank;
>> @@ -1314,7 +1314,7 @@ __osm_ftree_fabric_assign_first_tuple(
>>     ftree_tuple_t new_tuple;
>>  
>>     __osm_ftree_tuple_init(new_tuple);
>> -   new_tuple[0] = p_sw->rank;
>> +   new_tuple[0] = (uint8_t)p_sw->rank;
> 
> Should the declaration of ftree_tuple_t change ?

Only if there's a chance that we will build a switch with more than 256 ports.
And when I say "switch" I mean a single unit, not the one that has internal
topology with several building blocks.
But in that case lots of things should be fixed.
For instance, ib_node_info_t.num_ports is 8 bits too.

-- Yevgeny

>>     for (i = 1; i <= p_sw->rank; i++)
>>        new_tuple[i] = 0;
>>  
>> @@ -1374,7 +1374,7 @@ __osm_ftree_fabric_calculate_rank(
>>  {
>>     ftree_sw_t   * p_sw;
>>     ftree_sw_t   * p_next_sw;
>> -   uint16_t       max_rank = 0;
>> +   uint32_t       max_rank = 0;
>>  
>>     /* go over all the switches and find maximal switch rank */
>>  
> 
> 


From halr at voltaire.com  Mon May 14 04:45:55 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 07:45:55 -0400
Subject: [ofa-general] Re: [PATCH] osm: integer indexes in fat-tree
In-Reply-To: <46481025.9080209@dev.mellanox.co.il>
References: <46481025.9080209@dev.mellanox.co.il>
Message-ID: <1179143150.1540.170000.camel@hal.voltaire.com>

On Mon, 2007-05-14 at 03:30, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> Enhancing integer indexes in fat-tree to 32 bits.
> I'm not sure whether it's a bug - fat-tree routing makes indexing
> not the same way as up/down. It marks rank on all the leaf switches,
> and only then starts BFS (starting from all the leaf switches and not
> from roots), so I don't think that it can really overflow the existing
> indexes. But who knows...
> Fixing this won't hurt anyway.
> 
> Please apply to master.
> 
> -- Yevgeny
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied (to master only).

-- Hal


From k_mahesh85 at yahoo.co.in  Mon May 14 04:54:12 2007
From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh)
Date: Mon, 14 May 2007 12:54:12 +0100 (BST)
Subject: [ofa-general] [query]  addressing the the IB switches using LID.
Message-ID: <572616.31409.qm@web8324.mail.in.yahoo.com>

In the case of a IB switch which is not running an IB subnet manager is there any 
requirement of LID. 
I mean, is there any situation where an IB switch will be addressed directly 
(using LID) by SM or any other node after the subnet initialization is complete?

-Mahesh

 			
---------------------------------
 Heres a new way to find what you're looking for - Yahoo! Answers 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070514/1b53957b/attachment.html>

From dotanb at dev.mellanox.co.il  Mon May 14 05:15:49 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Mon, 14 May 2007 15:15:49 +0300
Subject: [ofa-general] [query]  addressing the the IB switches using LID.
In-Reply-To: <572616.31409.qm@web8324.mail.in.yahoo.com>
References: <572616.31409.qm@web8324.mail.in.yahoo.com>
Message-ID: <464852F5.2010409@dev.mellanox.co.il>

Keshetti Mahesh wrote:
> In the case of a IB switch which is not running an IB subnet manager 
> is there any
> requirement of LID.
> I mean, is there any situation where an IB switch will be addressed 
> directly
> (using LID) by SM or any other node after the subnet initialization is 
> complete?
>
IB switches are not transparent and every IB switch should get a LID
(at least one port of the switch is connected to the subnet).

It doesn't matter if the SM is being executed on this switch or not.


Dotan


From eli at mellanox.co.il  Mon May 14 05:17:52 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 14 May 2007 15:17:52 +0300
Subject: [ofa-general] [PATCH] libmlx4: WQE shift calculation
Message-ID: <1179145102.25749.11.camel@mtls03>

For RC QPs we need to add atomic header size when calculating a WQE
size.

Found by Dotan Barak at Mellanox
Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Rolland,
the code that calculates WQ size is quite different between kernel and
user. I think that writing the code in a way that will allow to copy it
as is between kernel and user is in place. Would like me to send such a
patch?

Index: connectx_user/src/userspace/libmlx4/src/qp.c
===================================================================
--- connectx_user.orig/src/userspace/libmlx4/src/qp.c	2007-05-14 17:43:10.000000000 +0300
+++ connectx_user/src/userspace/libmlx4/src/qp.c	2007-05-14 17:44:04.000000000 +0300
@@ -439,7 +439,8 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd,
 		break;
 
 	case IBV_QPT_RC:
-		size += sizeof (struct mlx4_wqe_raddr_seg);
+		size += sizeof (struct mlx4_wqe_raddr_seg) +
+			sizeof (struct mlx4_wqe_atomic_seg);
 		/*
 		 * An atomic op will require an atomic segment, a
 		 * remote address segment and one scatter entry.


From k_mahesh85 at yahoo.co.in  Mon May 14 05:26:28 2007
From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh)
Date: Mon, 14 May 2007 13:26:28 +0100 (BST)
Subject: [ofa-general] [query]  addressing the the IB switches using LID.
In-Reply-To: <464852F5.2010409@dev.mellanox.co.il>
Message-ID: <129174.21096.qm@web8320.mail.in.yahoo.com>

IB switches are not transparent and every IB switch should get a LID
(at least one port of the switch is connected to the subnet).

It doesn't matter if the SM is being executed on this switch or not.
Yes, According to the IB architecture IB switch should get a LID. Just out
of curiosity I want to know whether the IB switch will be addressed using LID 
or not.

I have mentioned SM here beacuse if the IB switch is running the SM then 
it needs to be reachable using LID by all nodes inorder to answer the SA
queries. But in the other case (no SM) I didn't see any situation yet where the 
IB switch will be addressed using the LID assigned to it.

-Mahesh

 			
---------------------------------
 Heres a new way to find what you're looking for - Yahoo! Answers 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070514/837f2b28/attachment.html>

From halr at voltaire.com  Mon May 14 05:27:37 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 08:27:37 -0400
Subject: [ofa-general] [query]  addressing the the IB switches using LID.
In-Reply-To: <464852F5.2010409@dev.mellanox.co.il>
References: <572616.31409.qm@web8324.mail.in.yahoo.com>
	<464852F5.2010409@dev.mellanox.co.il>
Message-ID: <1179145646.1540.172583.camel@hal.voltaire.com>

On Mon, 2007-05-14 at 08:15, Dotan Barak wrote:
> Keshetti Mahesh wrote:
> > In the case of a IB switch which is not running an IB subnet manager 
> > is there any
> > requirement of LID.
> > I mean, is there any situation where an IB switch will be addressed 
> > directly
> > (using LID) by SM or any other node after the subnet initialization is 
> > complete?
> >
> IB switches are not transparent and every IB switch should get a LID
> (at least one port of the switch is connected to the subnet).

Switch port 0 needs a LID. Switch external/physical ports do not get
them.

-- Hal

> It doesn't matter if the SM is being executed on this switch or not.
> 
> 
> Dotan


From halr at voltaire.com  Mon May 14 05:28:54 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 08:28:54 -0400
Subject: [ofa-general] Re: [query] addressing the the IB switches using LID.
In-Reply-To: <572616.31409.qm@web8324.mail.in.yahoo.com>
References: <572616.31409.qm@web8324.mail.in.yahoo.com>
Message-ID: <1179145661.1540.172585.camel@hal.voltaire.com>

On Mon, 2007-05-14 at 07:54, Keshetti Mahesh wrote:
> In the case of a IB switch which is not running an IB subnet manager
> is there any 
> requirement of LID. 
> I mean, is there any situation where an IB switch will be addressed
> directly 
> (using LID) by SM or any other node after the subnet initialization is
> complete?

This is purely SM policy. IBA does not dictate this and leaves it up to
the SM in question as to whether it uses LID routing or direct routing
to "talk" with nodes (including switches). Clearly, initialization
requires direct routing.

-- Hal

> -Mahesh
> 
> 
> 
> ______________________________________________________________________
>  Heres a new way to find what you're looking for - Yahoo! Answers


From kliteyn at dev.mellanox.co.il  Mon May 14 05:30:51 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 14 May 2007 15:30:51 +0300
Subject: [ofa-general] [PATCH] osm: fat-tree optimization - creating internal
	nodes
Message-ID: <4648567B.3060809@dev.mellanox.co.il>

Hi Hal,

A small optimization to creation of fat-tree internal data structures.
Using the pointers from osm_node to osm_switch that Sasha has added
a while ago, it is enough to scan the OSM node_guid table only once
to create all the fat-tree internal nodes.

Please apply to master only.

-- Yevgeny

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/osm_ucast_ftree.c |   51 +++++------------------------------------
 1 files changed, 7 insertions(+), 44 deletions(-)

diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
index ca51484..3bad2fc 100644
--- a/osm/opensm/osm_ucast_ftree.c
+++ b/osm/opensm/osm_ucast_ftree.c
@@ -2365,36 +2365,13 @@ __osm_ftree_fabric_route_to_switches(
  ***************************************************/
 
 static int 
-__osm_ftree_fabric_populate_switches(
-   IN  ftree_fabric_t * p_ftree)
-{
-   osm_switch_t * p_osm_sw;
-   osm_switch_t * p_next_osm_sw;
-
-   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_switches);
-
-   p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_ftree->p_osm->subn.sw_guid_tbl);
-   while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_ftree->p_osm->subn.sw_guid_tbl) )
-   {
-      p_osm_sw = p_next_osm_sw;
-      p_next_osm_sw = (osm_switch_t *)cl_qmap_next(&p_osm_sw->map_item );
-      __osm_ftree_fabric_add_sw(p_ftree,p_osm_sw);
-   }
-   OSM_LOG_EXIT(&p_ftree->p_osm->log);
-   return 0;
-} /* __osm_ftree_fabric_populate_switches() */
-
-/***************************************************
- ***************************************************/
-
-static int 
-__osm_ftree_fabric_populate_hcas(
+__osm_ftree_fabric_populate_nodes(
    IN  ftree_fabric_t * p_ftree)
 {
    osm_node_t   * p_osm_node;
    osm_node_t   * p_next_osm_node;
 
-   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_hcas);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_nodes);
 
    p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_ftree->p_osm->subn.node_guid_tbl);
    while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_ftree->p_osm->subn.node_guid_tbl) )
@@ -2409,11 +2386,11 @@ __osm_ftree_fabric_populate_hcas(
          case IB_NODE_TYPE_ROUTER:
             break;
          case IB_NODE_TYPE_SWITCH:
-            /* all the switches added separately */
+            __osm_ftree_fabric_add_sw(p_ftree,p_osm_node->sw);
             break;
          default:
             osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
-                    "__osm_ftree_fabric_populate_hcas: ERR AB0E: "
+                    "__osm_ftree_fabric_populate_nodes: ERR AB0E: "
                     "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
                     cl_ntoh64(osm_node_get_node_guid(p_osm_node)),
                     ib_get_node_type_str(osm_node_get_type(p_osm_node)));
@@ -2424,7 +2401,7 @@ __osm_ftree_fabric_populate_hcas(
 
    OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return 0;
-} /* __osm_ftree_fabric_populate_hcas() */
+} /* __osm_ftree_fabric_populate_nodes() */
 
 /***************************************************
  ***************************************************/
@@ -2962,22 +2939,8 @@ __osm_ftree_construct_fabric(
 
    osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
            "__osm_ftree_construct_fabric: "
-           "Populating FatTree switch table\n");
-   /* ToDo: now that the pointer from node to switch exists,  
-      no need to fill the switch table in a separate loop */
-   if (__osm_ftree_fabric_populate_switches(p_ftree) != 0)
-   {
-      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
-              "Fabric topology is not fat-tree - "
-              "falling back to default routing\n");
-      status = -1;
-      goto Exit;
-   }
-
-   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
-           "__osm_ftree_construct_fabric: "
-           "Populating FatTree HCA table\n");
-   if (__osm_ftree_fabric_populate_hcas(p_ftree) != 0)
+           "Populating FatTree Switch and HCA tables\n");
+   if (__osm_ftree_fabric_populate_nodes(p_ftree) != 0)
    {
       osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric topology is not fat-tree - "
-- 
1.4.4.1.GIT


From k_mahesh85 at yahoo.co.in  Mon May 14 05:35:09 2007
From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh)
Date: Mon, 14 May 2007 13:35:09 +0100 (BST)
Subject: [ofa-general] Re: [query] addressing the the IB switches using LID.
In-Reply-To: <1179145661.1540.172585.camel@hal.voltaire.com>
Message-ID: <374012.81252.qm@web8316.mail.in.yahoo.com>

This is purely SM policy. IBA does not dictate this and leaves it up to
the SM in question as to whether it uses LID routing or direct routing
to "talk" with nodes (including switches). Clearly, initialization
requires direct routing.what is the policy of current implementation of subnet manager 
i.e. openSM?

-Mahesh

       
---------------------------------
 Office firewalls, cyber cafes, college labs, don't allow you to download CHAT? Here's a solution! 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070514/c0a8681e/attachment.html>

From halr at voltaire.com  Mon May 14 05:38:53 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 08:38:53 -0400
Subject: [ofa-general] [query]  addressing the the IB switches using LID.
In-Reply-To: <129174.21096.qm@web8320.mail.in.yahoo.com>
References: <129174.21096.qm@web8320.mail.in.yahoo.com>
Message-ID: <1179146303.1540.173170.camel@hal.voltaire.com>

On Mon, 2007-05-14 at 08:26, Keshetti Mahesh wrote:
>         IB switches are not transparent and every IB switch should get
>         a LID
>         (at least one port of the switch is connected to the subnet).
>         
>         It doesn't matter if the SM is being executed on this switch
>         or not.
> Yes, According to the IB architecture IB switch should get a LID. Just
> out
> of curiosity I want to know whether the IB switch will be addressed
> using LID 
> or not.
> 
> I have mentioned SM here beacuse if the IB switch is running the SM
> then 
> it needs to be reachable using LID by all nodes inorder to answer the
> SA
> queries.

More than this.

>  But in the other case (no SM) I didn't see any situation yet where
> the 
> IB switch will be addressed using the LID assigned to it.

Operationally, it depends on the SM. You would also be relying on
something beyond the spec (so that if the SM changes (such a change
being valid), then things would stop working).

Also, there are some port 0 features which require the LID to be set.

Compliance wise, this is a non compliance (for the switch port 0 not to
have a LID).

-- Hal

> -Mahesh
> 
> 
> 
> ______________________________________________________________________
>  Heres a new way to find what you're looking for - Yahoo! Answers
> 
> ______________________________________________________________________
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From k_mahesh85 at yahoo.co.in  Mon May 14 05:56:06 2007
From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh)
Date: Mon, 14 May 2007 13:56:06 +0100 (BST)
Subject: [ofa-general] Re: [query] addressing the the IB switches using LID.
In-Reply-To: <1179145661.1540.172585.camel@hal.voltaire.com>
Message-ID: <422405.87014.qm@web8316.mail.in.yahoo.com>


>This is purely SM policy. IBA does not dictate this and 
>leaves it up to the SM in question as to whether it uses 
>LID routing or direct routing to "talk" with nodes 
>(including switches). Clearly, initialization requires 
>direct routing.

what is the policy of current implementation of 
subnet manager i.e. openSM?

-Mahesh


---------------------------------
 Office firewalls, cyber cafes, college labs, don't allow you to download CHAT? Here's a solution! 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070514/9399669b/attachment.html>

From kliteyn at dev.mellanox.co.il  Mon May 14 06:02:31 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 14 May 2007 16:02:31 +0300
Subject: [ofa-general] [PATCH] osm: fat-tree optimization - improved ranking
Message-ID: <46485DE7.1050506@dev.mellanox.co.il>

Hi Hal,

This patch optimizes fabric ranking.
All the leaf switches are marked with rank and added to the BFS list,
and only then ranking of rest of the fabric begins.

I actually thought that this is the way I've originally
implemented it, as I mentioned in the patch that was dealing 
with 8 and 16 bit integers :)

Similar optimization may be applicable to up/dn routing - the roots
should be marked with rank 0 and only then ranking of rest of the 
switches should begin, but frankly, it practically doesn't reduce
the routing time, because ranking is only a small fraction of the 
routing runtime (I checked it on a 4K+ subnet).

In case of fat-tree I'm going to need it anyway when I enhance
the routing to consider only subset of HCAs in the routing balancing
(compute nodes vs. management nodes).

Please apply to master.

-- Yevgeny

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

>From dfa455f86d9ac48ff5cefd38a009718e5aeab1f9 Mon Sep 17 00:00:00 2001
From: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
Date: Mon, 14 May 2007 15:45:00 +0300
Subject: [PATCH] DELETE AFTER UPDATE: ranking

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/osm_ucast_ftree.c |   83 +++++++++++++++++++++++++-----------------
 1 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
index 3bad2fc..84da3d7 100644
--- a/osm/opensm/osm_ucast_ftree.c
+++ b/osm/opensm/osm_ucast_ftree.c
@@ -2406,10 +2406,24 @@ __osm_ftree_fabric_populate_nodes(
 /***************************************************
  ***************************************************/
 
+static boolean_t 
+__osm_ftree_sw_update_rank(
+   IN  ftree_sw_t  * p_sw,
+   IN  uint32_t      new_rank)
+{
+   if (__osm_ftree_sw_ranked(p_sw) && p_sw->rank <= new_rank)
+      return FALSE;
+   p_sw->rank = new_rank;
+   return TRUE;
+
+}
+
+/***************************************************/
+
 static void
-__osm_ftree_rank_from_switch(
+__osm_ftree_rank_switches_from_leafs(
    IN  ftree_fabric_t * p_ftree, 
-   IN  ftree_sw_t *     p_starting_sw)
+   IN  cl_list_t      * p_ranking_bfs_list)
 {
    ftree_sw_t   * p_sw;
    ftree_sw_t   * p_remote_sw;
@@ -2417,19 +2431,11 @@ __osm_ftree_rank_from_switch(
    osm_node_t   * p_remote_node;
    osm_physp_t  * p_osm_port;
    uint8_t        i;
-   cl_list_t      bfs_list;
    ftree_sw_tbl_element_t * p_sw_tbl_element = NULL;
 
-   p_starting_sw->rank = 0;
-
-   /* Run BFS scan of the tree, starting from this switch */
-
-   cl_list_init(&bfs_list, cl_qmap_count(&p_ftree->sw_tbl));
-   cl_list_insert_tail(&bfs_list, &__osm_ftree_sw_tbl_element_create(p_starting_sw)->map_item);
-
-   while (!cl_is_list_empty(&bfs_list))
+   while (!cl_is_list_empty(p_ranking_bfs_list))
    {
-      p_sw_tbl_element = (ftree_sw_tbl_element_t *)cl_list_remove_head(&bfs_list);
+      p_sw_tbl_element = (ftree_sw_tbl_element_t *) cl_list_remove_head(p_ranking_bfs_list);
       p_sw = p_sw_tbl_element->p_sw;
       __osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element);
 
@@ -2457,26 +2463,23 @@ __osm_ftree_rank_from_switch(
             /* remote node is not a switch */
             continue;
          }
-         if (__osm_ftree_sw_ranked(p_remote_sw) && p_remote_sw->rank <= (p_sw->rank + 1))
-            continue;
 
-         /* rank the remote switch and add it to the BFS list */
-         p_remote_sw->rank = p_sw->rank + 1;
-         cl_list_insert_tail(&bfs_list, 
-                             &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item);
+         /* if needed, rank the remote switch and add it to the BFS list */
+         if (__osm_ftree_sw_update_rank(p_remote_sw, p_sw->rank + 1))
+            cl_list_insert_tail(p_ranking_bfs_list, 
+                                &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item);
       }
    }
-   cl_list_destroy(&bfs_list);
-} /* __osm_ftree_rank_from_switch() */
 
+} /* __osm_ftree_rank_switches_from_leafs() */
 
-/***************************************************
- ***************************************************/
+/***************************************************/
 
 static int 
-__osm_ftree_rank_switches_from_hca(
+__osm_ftree_rank_leaf_switches(
    IN  ftree_fabric_t * p_ftree,
-   IN  ftree_hca_t    * p_hca)
+   IN  ftree_hca_t    * p_hca,
+   IN  cl_list_t      * p_ranking_bfs_list)
 {
    ftree_sw_t     * p_sw;
    osm_node_t     * p_osm_node = p_hca->p_osm_node;
@@ -2502,7 +2505,7 @@ __osm_ftree_rank_switches_from_hca(
          case IB_NODE_TYPE_CA:
             /* HCA connected directly to another HCA - not FatTree */
             osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
-                    "__osm_ftree_rank_switches_from_hca: ERR AB0F: "
+                    "__osm_ftree_rank_leaf_switches: ERR AB0F: "
                     "HCA conected directly to another HCA: "
                     "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
                     cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
@@ -2520,7 +2523,7 @@ __osm_ftree_rank_switches_from_hca(
 
          default:
             osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
-                    "__osm_ftree_rank_switches_from_hca: ERR AB10: "
+                    "__osm_ftree_rank_leaf_switches: ERR AB10: "
                     "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
                     cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)),
                     ib_get_node_type_str(osm_node_get_type(p_remote_osm_node)));
@@ -2535,11 +2538,12 @@ __osm_ftree_rank_switches_from_hca(
 
       CL_ASSERT(p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl));
 
-      if (__osm_ftree_sw_ranked(p_sw) && p_sw->rank == 0)
+      if ( !__osm_ftree_sw_update_rank(p_sw,0) )
          continue;
 
+      /* if needed, rank the remote switch and add it to the BFS list */
       osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
-              "__osm_ftree_rank_switches_from_hca: "
+              "__osm_ftree_rank_leaf_switches: "
               "Marking rank of switch that is directly connected to HCA:\n"
               "                                            - HCA guid   : 0x%016" PRIx64 "\n"
               "                                            - Switch guid: 0x%016" PRIx64 "\n"
@@ -2547,13 +2551,14 @@ __osm_ftree_rank_switches_from_hca(
               cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
               cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
               cl_ntoh16(p_sw->base_lid));
-      __osm_ftree_rank_from_switch(p_ftree, p_sw);
+      cl_list_insert_tail(p_ranking_bfs_list, 
+                          &__osm_ftree_sw_tbl_element_create(p_sw)->map_item);
    }
 
  Exit:
    OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return res;
-} /* __osm_ftree_rank_switches_from_hca() */
+} /* __osm_ftree_rank_leaf_switches() */
 
 /***************************************************/
 
@@ -2789,18 +2794,21 @@ __osm_ftree_fabric_construct_sw_ports(
 /***************************************************
  ***************************************************/
 
-/* ToDo: improve ranking algorithm complexity
-   by propogating BFS from more nodes */ 
 static int
 __osm_ftree_fabric_perform_ranking(
    IN  ftree_fabric_t * p_ftree)
 {
    ftree_hca_t * p_hca;
    ftree_hca_t * p_next_hca;
+   cl_list_t     ranking_bfs_list;
    int res = 0;
 
    OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_perform_ranking);
 
+   /* Init the bfs list - the list of the switches that will be
+      initially filled with the leaf switches */
+   cl_list_init(&ranking_bfs_list, cl_qmap_count(&p_ftree->sw_tbl));
+
    /* Mark REVERSED rank of all the switches in the subnet. 
       Start from switches that are connected to hca's, and 
       scan all the switches in the subnet. */
@@ -2809,7 +2817,7 @@ __osm_ftree_fabric_perform_ranking(
    {
       p_hca = p_next_hca;
       p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item );
-      if (__osm_ftree_rank_switches_from_hca(p_ftree,p_hca) != 0)
+      if (__osm_ftree_rank_leaf_switches(p_ftree, p_hca, &ranking_bfs_list) != 0)
       {
          res = -1;
          osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
@@ -2819,7 +2827,14 @@ __osm_ftree_fabric_perform_ranking(
       }
    }
 
-   /* calculate and set FatTree rank */
+   /* Now rank rest of the switches in the fabric, while the
+      list already contains all the ranked leaf switches */
+   __osm_ftree_rank_switches_from_leafs(p_ftree, &ranking_bfs_list);
+   cl_list_destroy(&ranking_bfs_list);
+   
+   /* REVERSED ranking of all the switches completed.
+      Calculate and set FatTree rank */
+
    __osm_ftree_fabric_calculate_rank(p_ftree);
    osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
            "__osm_ftree_fabric_perform_ranking: "
-- 
1.4.4.1.GIT


From halr at voltaire.com  Mon May 14 06:07:19 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 09:07:19 -0400
Subject: [ofa-general] Re: [PATCH] osm: fat-tree optimization - creating
	internal nodes
In-Reply-To: <4648567B.3060809@dev.mellanox.co.il>
References: <4648567B.3060809@dev.mellanox.co.il>
Message-ID: <1179148023.1540.174674.camel@hal.voltaire.com>

Hi Yevgeny,

On Mon, 2007-05-14 at 08:30, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> A small optimization to creation of fat-tree internal data structures.
> Using the pointers from osm_node to osm_switch that Sasha has added
> a while ago, it is enough to scan the OSM node_guid table only once
> to create all the fat-tree internal nodes.
> 
> Please apply to master only.
> 
> -- Yevgeny
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied (to master only).

-- Hal


From halr at voltaire.com  Mon May 14 06:13:05 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 09:13:05 -0400
Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM/osm_ucast_ftree.c: Change HCA
	to CA in log messages
Message-ID: <1179148361.1540.175012.camel@hal.voltaire.com>

OpenSM/osm_ucast_ftree.c: Change HCA to CA in log messages

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
index 3bad2fc..eb33e5a 100644
--- a/osm/opensm/osm_ucast_ftree.c
+++ b/osm/opensm/osm_ucast_ftree.c
@@ -850,7 +850,7 @@ __osm_ftree_hca_dump(
 
    osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
            "__osm_ftree_hca_dump: "
-           "HCA GUID: 0x%016" PRIx64 ", Ports: %u UP\n",
+           "CA GUID: 0x%016" PRIx64 ", Ports: %u UP\n",
           cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), 
           p_hca->up_port_groups_num);
 
@@ -1124,7 +1124,7 @@ __osm_ftree_fabric_dump(ftree_fabric_t *
            "                       |-------------------------------|\n\n");
 
    osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
-           "__osm_ftree_fabric_dump: -- HCAs:\n");
+           "__osm_ftree_fabric_dump: -- CAs:\n");
 
    for ( p_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
          p_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl);
@@ -1174,7 +1174,7 @@ __osm_ftree_fabric_dump_general_info(
            p_ftree->tree_rank);
    osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
            "__osm_ftree_fabric_dump_general_info: "
-           "  - Fabric has %u HCAs, %u switches\n",
+           "  - Fabric has %u CAs, %u switches\n",
            cl_qmap_count(&p_ftree->hca_tbl),
            cl_qmap_count(&p_ftree->sw_tbl));
 
@@ -1886,7 +1886,7 @@ __osm_ftree_fabric_route_upgoing_by_goin
                                             p_min_port->remote_port_num);
          osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_upgoing_by_going_down: "
-                 "Switch %s: set path to HCA LID 0x%x through port %u\n",
+                 "Switch %s: set path to CA LID 0x%x through port %u\n",
                  __osm_ftree_tuple_to_str(p_remote_sw->tuple),
                  cl_ntoh16(target_lid),
                  p_min_port->remote_port_num);
@@ -2067,7 +2067,7 @@ __osm_ftree_fabric_route_downgoing_by_go
       {
          osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_downgoing_by_going_up: "
-                 " - Routing MAIN path for %s HCA LID 0x%x: %s --> %s\n",
+                 " - Routing MAIN path for %s CA LID 0x%x: %s --> %s\n",
                  (is_real_lid)? "real" : "DUMMY",
                  cl_ntoh16(target_lid),
                  __osm_ftree_tuple_to_str(p_sw->tuple),
@@ -2084,7 +2084,7 @@ __osm_ftree_fabric_route_downgoing_by_go
                                             p_min_port->remote_port_num);
          osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_downgoing_by_going_up: "
-                 "Switch %s: set path to HCA LID 0x%x through port %u\n",
+                 "Switch %s: set path to CA LID 0x%x through port %u\n",
                  __osm_ftree_tuple_to_str(p_remote_sw->tuple),
                  cl_ntoh16(target_lid),p_min_port->remote_port_num);
 
@@ -2250,7 +2250,7 @@ __osm_ftree_fabric_route_to_hcas(
                                             p_port->port_num);
          osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_to_hcas: "
-                 "Switch %s: set path to HCA LID 0x%x through port %u\n",
+                 "Switch %s: set path to CA LID 0x%x through port %u\n",
                  __osm_ftree_tuple_to_str(p_sw->tuple),
                  cl_ntoh16(remote_lid),
                  p_port->port_num);
@@ -2279,7 +2279,7 @@ __osm_ftree_fabric_route_to_hcas(
       if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num)
       {
          osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: "
-                 "Routing %u dummy HCAs\n",
+                 "Routing %u dummy CAs\n",
                  p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
          for ( j = 0;
                ((int)j) < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
@@ -2503,7 +2503,7 @@ __osm_ftree_rank_switches_from_hca(
             /* HCA connected directly to another HCA - not FatTree */
             osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
                     "__osm_ftree_rank_switches_from_hca: ERR AB0F: "
-                    "HCA conected directly to another HCA: "
+                    "CA conected directly to another CA: "
                     "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
                     cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
                     cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)));
@@ -2540,8 +2540,8 @@ __osm_ftree_rank_switches_from_hca(
 
       osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
               "__osm_ftree_rank_switches_from_hca: "
-              "Marking rank of switch that is directly connected to HCA:\n"
-              "                                            - HCA guid   : 0x%016" PRIx64 "\n"
+              "Marking rank of switch that is directly connected to CA:\n"
+              "                                            - CA guid    : 0x%016" PRIx64 "\n"
               "                                            - Switch guid: 0x%016" PRIx64 "\n"
               "                                            - Switch LID : 0x%x\n",
               cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
@@ -2613,7 +2613,7 @@ __osm_ftree_fabric_construct_hca_ports(
             /* HCA connected directly to another HCA - not FatTree */
             osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
                     "__osm_ftree_fabric_construct_hca_ports: ERR AB11: "
-                    "HCA conected directly to another HCA: "
+                    "CA conected directly to another CA: "
                     "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
                     cl_ntoh64(osm_node_get_node_guid(p_node)),
                     cl_ntoh64(remote_node_guid));
@@ -2939,7 +2939,7 @@ __osm_ftree_construct_fabric(
 
    osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
            "__osm_ftree_construct_fabric: "
-           "Populating FatTree Switch and HCA tables\n");
+           "Populating FatTree Switch and CA tables\n");
    if (__osm_ftree_fabric_populate_nodes(p_ftree) != 0)
    {
       osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
@@ -2952,7 +2952,7 @@ __osm_ftree_construct_fabric(
    if (cl_qmap_count(&p_ftree->hca_tbl) < 2)
    {
       osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
-              "Fabric has %u HCAa - topology is not fat-tree.\n"
+              "Fabric has %u CAa - topology is not fat-tree.\n"
               "Falling back to default routing.\n",
               cl_qmap_count(&p_ftree->hca_tbl));
       status = -1;
@@ -2983,7 +2983,7 @@ __osm_ftree_construct_fabric(
       we want the ports to have pointers to ftree_{sw,hca}_t objects.*/
    osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
            "__osm_ftree_construct_fabric: "
-           "Populating HCA & switch ports\n");
+           "Populating CA & switch ports\n");
    if (__osm_ftree_fabric_populate_ports(p_ftree) != 0)
    {
       osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
@@ -3061,7 +3061,7 @@ __osm_ftree_do_routing(
            "Starting FatTree routing\n");
 
    osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
-           "Filling switch forwarding tables for routes to HCAs\n");
+           "Filling switch forwarding tables for routes to CAs\n");
    __osm_ftree_fabric_route_to_hcas(p_ftree);
 
    osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "


From kliteyn at dev.mellanox.co.il  Mon May 14 06:21:26 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 14 May 2007 16:21:26 +0300
Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM/osm_ucast_ftree.c: Change
	HCA	to CA in log messages
In-Reply-To: <1179148361.1540.175012.camel@hal.voltaire.com>
References: <1179148361.1540.175012.camel@hal.voltaire.com>
Message-ID: <46486256.6050806@dev.mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> OpenSM/osm_ucast_ftree.c: Change HCA to CA in log messages

Sure, makes sense.

--Yevgeny
 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
> index 3bad2fc..eb33e5a 100644
> --- a/osm/opensm/osm_ucast_ftree.c
> +++ b/osm/opensm/osm_ucast_ftree.c
> @@ -850,7 +850,7 @@ __osm_ftree_hca_dump(
>  
>     osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>             "__osm_ftree_hca_dump: "
> -           "HCA GUID: 0x%016" PRIx64 ", Ports: %u UP\n",
> +           "CA GUID: 0x%016" PRIx64 ", Ports: %u UP\n",
>            cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), 
>            p_hca->up_port_groups_num);
>  
> @@ -1124,7 +1124,7 @@ __osm_ftree_fabric_dump(ftree_fabric_t *
>             "                       |-------------------------------|\n\n");
>  
>     osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
> -           "__osm_ftree_fabric_dump: -- HCAs:\n");
> +           "__osm_ftree_fabric_dump: -- CAs:\n");
>  
>     for ( p_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
>           p_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl);
> @@ -1174,7 +1174,7 @@ __osm_ftree_fabric_dump_general_info(
>             p_ftree->tree_rank);
>     osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
>             "__osm_ftree_fabric_dump_general_info: "
> -           "  - Fabric has %u HCAs, %u switches\n",
> +           "  - Fabric has %u CAs, %u switches\n",
>             cl_qmap_count(&p_ftree->hca_tbl),
>             cl_qmap_count(&p_ftree->sw_tbl));
>  
> @@ -1886,7 +1886,7 @@ __osm_ftree_fabric_route_upgoing_by_goin
>                                              p_min_port->remote_port_num);
>           osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>                   "__osm_ftree_fabric_route_upgoing_by_going_down: "
> -                 "Switch %s: set path to HCA LID 0x%x through port %u\n",
> +                 "Switch %s: set path to CA LID 0x%x through port %u\n",
>                   __osm_ftree_tuple_to_str(p_remote_sw->tuple),
>                   cl_ntoh16(target_lid),
>                   p_min_port->remote_port_num);
> @@ -2067,7 +2067,7 @@ __osm_ftree_fabric_route_downgoing_by_go
>        {
>           osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>                   "__osm_ftree_fabric_route_downgoing_by_going_up: "
> -                 " - Routing MAIN path for %s HCA LID 0x%x: %s --> %s\n",
> +                 " - Routing MAIN path for %s CA LID 0x%x: %s --> %s\n",
>                   (is_real_lid)? "real" : "DUMMY",
>                   cl_ntoh16(target_lid),
>                   __osm_ftree_tuple_to_str(p_sw->tuple),
> @@ -2084,7 +2084,7 @@ __osm_ftree_fabric_route_downgoing_by_go
>                                              p_min_port->remote_port_num);
>           osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>                   "__osm_ftree_fabric_route_downgoing_by_going_up: "
> -                 "Switch %s: set path to HCA LID 0x%x through port %u\n",
> +                 "Switch %s: set path to CA LID 0x%x through port %u\n",
>                   __osm_ftree_tuple_to_str(p_remote_sw->tuple),
>                   cl_ntoh16(target_lid),p_min_port->remote_port_num);
>  
> @@ -2250,7 +2250,7 @@ __osm_ftree_fabric_route_to_hcas(
>                                              p_port->port_num);
>           osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>                   "__osm_ftree_fabric_route_to_hcas: "
> -                 "Switch %s: set path to HCA LID 0x%x through port %u\n",
> +                 "Switch %s: set path to CA LID 0x%x through port %u\n",
>                   __osm_ftree_tuple_to_str(p_sw->tuple),
>                   cl_ntoh16(remote_lid),
>                   p_port->port_num);
> @@ -2279,7 +2279,7 @@ __osm_ftree_fabric_route_to_hcas(
>        if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num)
>        {
>           osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: "
> -                 "Routing %u dummy HCAs\n",
> +                 "Routing %u dummy CAs\n",
>                   p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
>           for ( j = 0;
>                 ((int)j) < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
> @@ -2503,7 +2503,7 @@ __osm_ftree_rank_switches_from_hca(
>              /* HCA connected directly to another HCA - not FatTree */
>              osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
>                      "__osm_ftree_rank_switches_from_hca: ERR AB0F: "
> -                    "HCA conected directly to another HCA: "
> +                    "CA conected directly to another CA: "
>                      "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
>                      cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
>                      cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)));
> @@ -2540,8 +2540,8 @@ __osm_ftree_rank_switches_from_hca(
>  
>        osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>                "__osm_ftree_rank_switches_from_hca: "
> -              "Marking rank of switch that is directly connected to HCA:\n"
> -              "                                            - HCA guid   : 0x%016" PRIx64 "\n"
> +              "Marking rank of switch that is directly connected to CA:\n"
> +              "                                            - CA guid    : 0x%016" PRIx64 "\n"
>                "                                            - Switch guid: 0x%016" PRIx64 "\n"
>                "                                            - Switch LID : 0x%x\n",
>                cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
> @@ -2613,7 +2613,7 @@ __osm_ftree_fabric_construct_hca_ports(
>              /* HCA connected directly to another HCA - not FatTree */
>              osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
>                      "__osm_ftree_fabric_construct_hca_ports: ERR AB11: "
> -                    "HCA conected directly to another HCA: "
> +                    "CA conected directly to another CA: "
>                      "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
>                      cl_ntoh64(osm_node_get_node_guid(p_node)),
>                      cl_ntoh64(remote_node_guid));
> @@ -2939,7 +2939,7 @@ __osm_ftree_construct_fabric(
>  
>     osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
>             "__osm_ftree_construct_fabric: "
> -           "Populating FatTree Switch and HCA tables\n");
> +           "Populating FatTree Switch and CA tables\n");
>     if (__osm_ftree_fabric_populate_nodes(p_ftree) != 0)
>     {
>        osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
> @@ -2952,7 +2952,7 @@ __osm_ftree_construct_fabric(
>     if (cl_qmap_count(&p_ftree->hca_tbl) < 2)
>     {
>        osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
> -              "Fabric has %u HCAa - topology is not fat-tree.\n"
> +              "Fabric has %u CAa - topology is not fat-tree.\n"
>                "Falling back to default routing.\n",
>                cl_qmap_count(&p_ftree->hca_tbl));
>        status = -1;
> @@ -2983,7 +2983,7 @@ __osm_ftree_construct_fabric(
>        we want the ports to have pointers to ftree_{sw,hca}_t objects.*/
>     osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
>             "__osm_ftree_construct_fabric: "
> -           "Populating HCA & switch ports\n");
> +           "Populating CA & switch ports\n");
>     if (__osm_ftree_fabric_populate_ports(p_ftree) != 0)
>     {
>        osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
> @@ -3061,7 +3061,7 @@ __osm_ftree_do_routing(
>             "Starting FatTree routing\n");
>  
>     osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
> -           "Filling switch forwarding tables for routes to HCAs\n");
> +           "Filling switch forwarding tables for routes to CAs\n");
>     __osm_ftree_fabric_route_to_hcas(p_ftree);
>  
>     osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
> 
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From ossrosch at linux.vnet.ibm.com  Mon May 14 06:31:26 2007
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Mon, 14 May 2007 15:31:26 +0200
Subject: [ofa-general] Re: [ewg] Re: Build problem with RHEL-4.5 and OFED-1.2
In-Reply-To: <200705092357.59973.ossrosch@linux.vnet.ibm.com>
References: <200705091824.54394.ossrosch@linux.vnet.ibm.com>
	<1178737535.2848.152.camel@fc6.xsintricity.com>
	<200705092357.59973.ossrosch@linux.vnet.ibm.com>
Message-ID: <200705141531.26635.ossrosch@linux.vnet.ibm.com>

He Doug,

are there any news for this problem? Is it a problem of the OFED-build or a
problem with redhat?
Should I open a bugzilla to track this?

Regards Stefan
On Wednesday 09 May 2007 23:57, Stefan Roscher wrote:
> On Wednesday 09 May 2007 21:05, Doug Ledford wrote:
> > On Wed, 2007-05-09 at 18:24 +0200, Stefan Roscher wrote:
> > > Hi Doug,
> > > 
> > > I installed RHEL-4.5 on one of our ppc64 systems and recognized that asm-ppc
> > > directory is missing in /usr/src/kernels/2.6.9-55.EL/include. 
> > > Normally I don't need this directory, but ibmebus.h includes
> > > asm-ppc64/of_device.h. And there asm-ppc64/of_device.h includes 
> > > asm-ppc/of_device.h. Because this file is missing I can not build 
> > > ehca and ofed stack with ofed-1.2 daily build from today.
> > > 
> > > Did I make something wrong during installation?
> > > 
> > > Regards Stefan Roscher
> > 
> > I'll look into it, but in the meantime, install the kernel src.rpm, go
> > into /usr/src/redhat/SPEC and run rpmbuild --bp kernel-2.6.spec and it
> > should create a complete source tree
> > in /usr/src/redhat/BUILD/kernel-2.6.18 that you can then get the asm-ppc
> > directory contents out of.
> > 
> > -- 
> > Doug Ledford <dledford at redhat.com>
> >               GPG KeyID: CFBFF194
> >               http://people.redhat.com/dledford
> > 
> > Infiniband specific RPMs available at
> >               http://people.redhat.com/dledford/Infiniband
> > 
> To create the backportpatches for rhel4.5 I did it like you say, but the
> buildscripts of ofed dont uses the kernelsources in
> /usr/src/redhat/BUILD. OFED-1.2 use the source link within
> /lib/modules/kernel-x.x.x and this points into /usr/src/kernel this
> kernelsources were created during installation of rhel-4.5. In this kernel
> source the directory include/asm-ppc is missing.
> This is the reason why I found this problem not during creation of the
> backport patches.
> 
> regards stefan
> 
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 


From halr at voltaire.com  Mon May 14 06:30:52 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 09:30:52 -0400
Subject: [ofa-general] Re: [query] addressing the the IB switches using LID.
In-Reply-To: <374012.81252.qm@web8316.mail.in.yahoo.com>
References: <374012.81252.qm@web8316.mail.in.yahoo.com>
Message-ID: <1179149388.1540.175937.camel@hal.voltaire.com>

On Mon, 2007-05-14 at 08:35, Keshetti Mahesh wrote:
>         This is purely SM policy. IBA does not dictate this and leaves
>         it up to
>         the SM in question as to whether it uses LID routing or direct
>         routing
>         to "talk" with nodes (including switches). Clearly,
>         initialization
>         requires direct routing.
> what is the policy of current implementation of subnet manager 
> i.e. openSM?

OpenSM currently uses DR except in the case of polling standby SMs but
to rely on this is prone to errors and is non compliant with IBA.

-- Hal

> -Mahesh
> 
> 
> 
> ______________________________________________________________________
>  Office firewalls, cyber cafes, college labs, don't allow you to
> download CHAT? Here's a solution!


From halr at voltaire.com  Mon May 14 07:02:47 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 10:02:47 -0400
Subject: [ofa-general] Re: [PATCH] osm: fat-tree optimization - improved
	ranking
In-Reply-To: <46485DE7.1050506@dev.mellanox.co.il>
References: <46485DE7.1050506@dev.mellanox.co.il>
Message-ID: <1179151362.1540.177796.camel@hal.voltaire.com>

Hi Yevgeny,

On Mon, 2007-05-14 at 09:02, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> This patch optimizes fabric ranking.
> All the leaf switches are marked with rank and added to the BFS list,
> and only then ranking of rest of the fabric begins.
> 
> I actually thought that this is the way I've originally
> implemented it, as I mentioned in the patch that was dealing 
> with 8 and 16 bit integers :)
> 
> Similar optimization may be applicable to up/dn routing - the roots
> should be marked with rank 0 and only then ranking of rest of the 
> switches should begin, but frankly, it practically doesn't reduce
> the routing time, because ranking is only a small fraction of the 
> routing runtime (I checked it on a 4K+ subnet).

It's still worth doing IMO. Can you look into this for up/down ?

> In case of fat-tree I'm going to need it anyway when I enhance
> the routing to consider only subset of HCAs in the routing balancing
> (compute nodes vs. management nodes).
> 
> Please apply to master.
> 
> -- Yevgeny
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> 
> >From dfa455f86d9ac48ff5cefd38a009718e5aeab1f9 Mon Sep 17 00:00:00 2001
> From: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> Date: Mon, 14 May 2007 15:45:00 +0300
> Subject: [PATCH] DELETE AFTER UPDATE: ranking
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied (to master only).

-- Hal


From kliteyn at dev.mellanox.co.il  Mon May 14 07:07:26 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 14 May 2007 17:07:26 +0300
Subject: [ofa-general] Error message in OSM log when cached op file doesn't
	exist 
Message-ID: <46486D1E.6010408@dev.mellanox.co.il>

Hi Hal.

[snip]
> Date:   03/30/2007 12:24:12 AM
> OpenSM: Handle conf file open failures better
>     
> diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
> index 46315a5..746fbd1 100644
> --- a/osm/opensm/osm_subnet.c
> +++ b/osm/opensm/osm_subnet.c
> @@ -732,7 +732,7 @@ subn_dump_qos_options(
>  
>  /**********************************************************************
>   **********************************************************************/
> -void
> +ib_api_status_t
>  osm_subn_rescan_conf_file(
>    IN osm_subn_opt_t* const p_opts )
>  {
> @@ -751,7 +751,7 @@ osm_subn_rescan_conf_file(
>    
>    opts_file = fopen(file_name, "r");
>    if (!opts_file)
> -    return;
> +    return IB_ERROR;
[/snip]

This patch was applied a month and a half ago (master).
It handles opening cached options file, and prints error messages
when OSM failed opening such file.

I actually don't like this thing, because now every time you run
OpenSM on the machine that doesn't have any cached options file
(which is usually the case) you get an error message.

There's no point checking whether the file exists, because osm runs
as root, and if it fails opening this file, it means that the file
doesn't exist or is inaccessible (broken mount, etc).

In any case, user gets info in stdout whether or now OpenSM is using
cached options file.

Do you agree? Should I issue a patch?

-- Yevgeny


From kliteyn at dev.mellanox.co.il  Mon May 14 07:08:37 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 14 May 2007 17:08:37 +0300
Subject: [ofa-general] Re: [PATCH] osm: fat-tree optimization - improved
	ranking
In-Reply-To: <1179151362.1540.177796.camel@hal.voltaire.com>
References: <46485DE7.1050506@dev.mellanox.co.il>
	<1179151362.1540.177796.camel@hal.voltaire.com>
Message-ID: <46486D65.5000304@dev.mellanox.co.il>

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Mon, 2007-05-14 at 09:02, Yevgeny Kliteynik wrote:
>> Hi Hal,
>>
>> This patch optimizes fabric ranking.
>> All the leaf switches are marked with rank and added to the BFS list,
>> and only then ranking of rest of the fabric begins.
>>
>> I actually thought that this is the way I've originally
>> implemented it, as I mentioned in the patch that was dealing 
>> with 8 and 16 bit integers :)
>>
>> Similar optimization may be applicable to up/dn routing - the roots
>> should be marked with rank 0 and only then ranking of rest of the 
>> switches should begin, but frankly, it practically doesn't reduce
>> the routing time, because ranking is only a small fraction of the 
>> routing runtime (I checked it on a 4K+ subnet).
> 
> It's still worth doing IMO. Can you look into this for up/down ?

Sure.

--Yevgeny

> 
>> In case of fat-tree I'm going to need it anyway when I enhance
>> the routing to consider only subset of HCAs in the routing balancing
>> (compute nodes vs. management nodes).
>>
>> Please apply to master.
>>
>> -- Yevgeny
>>
>> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>>
>> >From dfa455f86d9ac48ff5cefd38a009718e5aeab1f9 Mon Sep 17 00:00:00 2001
>> From: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> Date: Mon, 14 May 2007 15:45:00 +0300
>> Subject: [PATCH] DELETE AFTER UPDATE: ranking
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> 
> Thanks. Applied (to master only).
> 
> -- Hal
> 
> 


From mst at dev.mellanox.co.il  Mon May 14 07:14:50 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 14 May 2007 17:14:50 +0300
Subject: [ofa-general] [PATCH] IB/mthca: fix cq cleanup
Message-ID: <20070514141450.GB7989@mellanox.co.il>

mthca_cq_clean updated CQ consumer index without moving CQEs
to HW ownership. As a result, the same WRID might get reported twice,
resulting in use-after-free. This was observed in IPoIB CM.
Fix by moving all freed CQEs to HW ownership.
This fixes this bug: https://bugs.openfabrics.org/show_bug.cgi?id=617

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_cq.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_cq.c	2007-05-14 14:22:58.000000000 +0300
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_cq.c	2007-05-14 14:42:05.000000000 +0300
@@ -284,7 +284,7 @@ void mthca_cq_clean(struct mthca_dev *de
 {
 	struct mthca_cqe *cqe;
 	u32 prod_index;
-	int nfreed = 0;
+	int i, nfreed = 0;
 
 	spin_lock_irq(&cq->lock);
 
@@ -321,6 +321,8 @@ void mthca_cq_clean(struct mthca_dev *de
 	}
 
 	if (nfreed) {
+		for (i = 0; i < nfreed; ++i)
+			set_cqe_hw(get_cqe(cq, (cq->cons_index + i) & cq->ibcq.cqe));
 		wmb();
 		cq->cons_index += nfreed;
 		update_cons_index(dev, cq, nfreed);

-- 
MST


From mst at dev.mellanox.co.il  Mon May 14 07:15:14 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 14 May 2007 17:15:14 +0300
Subject: [ofa-general] [PATCH] libmthca: fix cq cleanup
Message-ID: <20070514141514.GC7989@mellanox.co.il>

mthca_cq_clean updated CQ consumer index without moving CQEs
to HW ownership. As a result, the same WRID might get reported twice,
resulting in use-after-free. This was observed in IPoIB CM.
Fix by moving all freed CQEs to HW ownership.
This fixes this bug: https://bugs.openfabrics.org/show_bug.cgi?id=617

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

diff --git a/src/cq.c b/src/cq.c
index 0aeb7a9..ecaed9c 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -550,7 +546,7 @@ void mthca_cq_clean(struct mthca_cq *cq, uint32_t qpn, struct mthca_srq *srq)
 {
 	struct mthca_cqe *cqe;
 	uint32_t prod_index;
-	int nfreed = 0;
+	int i, nfreed = 0;
 
 	pthread_spin_lock(&cq->lock);
 
@@ -584,6 +580,8 @@ void mthca_cq_clean(struct mthca_cq *cq, uint32_t qpn, struct mthca_srq *srq)
 	}
 
 	if (nfreed) {
+		for (i = 0; i < nfreed; ++i)
+			set_cqe_hw(get_cqe(cq, (cq->cons_index + i) & cq->ibv_cq.cqe));
 		wmb();
 		cq->cons_index += nfreed;
 		update_cons_index(cq, nfreed);


-- 
MST


From mst at dev.mellanox.co.il  Mon May 14 07:22:33 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 14 May 2007 17:22:33 +0300
Subject: [ofa-general] weird kconfig output
Message-ID: <20070514142233.GD7989@mellanox.co.il>

Doing make oldconfig on 2.6.22-rc1 (.config came from 2.6.21)
gave me this prompt, among the list of 10G/s adapters:
  Verbose debugging output (MLX4_DEBUG) [Y/n/?] (NEW)

Shouldn't I get prompted for mlx4 eth first?

-- 
MST


From halr at voltaire.com  Mon May 14 07:21:53 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 10:21:53 -0400
Subject: [ofa-general] Re: Error message in OSM log when cached op file
	doesn't exist
In-Reply-To: <46486D1E.6010408@dev.mellanox.co.il>
References: <46486D1E.6010408@dev.mellanox.co.il>
Message-ID: <1179152459.1540.178811.camel@hal.voltaire.com>

Hi Yevgeny,

On Mon, 2007-05-14 at 10:07, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> [snip]
> > Date:   03/30/2007 12:24:12 AM
> > OpenSM: Handle conf file open failures better
> >     
> > diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
> > index 46315a5..746fbd1 100644
> > --- a/osm/opensm/osm_subnet.c
> > +++ b/osm/opensm/osm_subnet.c
> > @@ -732,7 +732,7 @@ subn_dump_qos_options(
> >  
> >  /**********************************************************************
> >   **********************************************************************/
> > -void
> > +ib_api_status_t
> >  osm_subn_rescan_conf_file(
> >    IN osm_subn_opt_t* const p_opts )
> >  {
> > @@ -751,7 +751,7 @@ osm_subn_rescan_conf_file(
> >    
> >    opts_file = fopen(file_name, "r");
> >    if (!opts_file)
> > -    return;
> > +    return IB_ERROR;
> [/snip]
> 
> This patch was applied a month and a half ago (master).
> It handles opening cached options file, and prints error messages
> when OSM failed opening such file.
> 
> I actually don't like this thing, because now every time you run
> OpenSM on the machine that doesn't have any cached options file
> (which is usually the case) you get an error message.

Perhaps error is too severe as one can run just fine without this file
and there is no requirement to have it. Should it be some other type of
message instead ?

> There's no point checking whether the file exists, because osm runs
> as root, and if it fails opening this file, it means that the file
> doesn't exist or is inaccessible (broken mount, etc).

That's the most common use case (running OpenSM as root, but not the
only one).

> In any case, user gets info in stdout whether or now OpenSM is using
> cached options file.

Is there always a message in the log as well indicating this ?

-- Hal

> Do you agree? Should I issue a patch?
> 
> -- Yevgeny


From tziporet at dev.mellanox.co.il  Mon May 14 08:06:16 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 14 May 2007 18:06:16 +0300
Subject: [ofa-general] OFED 1.2 rc3 release
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com>
	<6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com>
Message-ID: <46487AE8.1020005@mellanox.co.il>

Hi,

OFED 1.2-RC3 is available on  _http://www.openfabrics.org/builds/ofed-1.2/_
File: OFED-1.2-rc3.tgz
To get BUILD_ID run ofed_info

Please report any issues in bugzilla _https://bugs.openfabrics.org/_

*RC4 due date is May 21*

Tziporet & Vlad

==================================================================================== 

*Release information:*

*OS support: *
Novell:
    - SLES 9.0 SP3
    - SLES10 (and SP1 RC2 partially tested)
Redhat:
    - Redhat EL4 up3, up4 and up5
    - Redhat EL5
kernel.org:
    - 2.6.20
    - 2.6.19

Note: Fedora C6 and SuSE Pro 10 are not part of the official list.
We keep the backport patches for these OSes and make sure OFED compile 
and loaded properly but will not do full QA cycle.

*Systems: *
    * x86_64
    * x86
    * ia64
    * ppc64

*Main changes from OFED-1.1-rc2:*

1. Fixed 49 bugs (see attachment for all bugs fixed)
2. Replace Open MPI to version 1.2.1
3. Added support for RHEL 4 up5
4. Updated documents (but not yet completed)

*Major limitations and known issues:
*

567 	blocker 	rolandd at cisco.com 	MPI does not work on RHEL5 ppc64
420 	critical 	monil at voltaire.com 	PKey table reordering caused by SM 
failover stops ipoib traffic
607 	critical 	jsquyres at cisco.com 	remove the hack to save the port 
number in the ia hca_address
608 	critical 	monis at voltaire.com 	traffic fails to resume after SM 
failover with bonding interfaces
611 	critical 	swise at opengridcomputing.com 	cxgb3: passive side 
connection transition from streaming to RDMA is broken
577 	critical 	rolandd at cisco.com 	SRP multipath failover too slow 
(minutes, not seconds)
465 	critical 	mst at mellanox.co.il 	IPoIB HA fails after several hours of 
failovers
549 	critical 	amip at dev.mellanox.co.il 	SDP Policy need to be consistent
604 	critical 	mst at mellanox.co.il 	Oops running UDP traffic with IPoIB CM
605 	major 	sean.hefty at intel.com 	kernel oops in rdma_cm during module 
unload
614 	major 	halr at voltaire.com 	All of the CM definitions should be 
removed from ib_types.h
534 	major 	vlad at mellanox.co.il 	SLES9 - Installer fails on declarations 
- OFED 1.2-20070409
530 	major 	dannyz at mellanox.co.il 	ibdiagnet -r fails on RHEL5 i686
538 	major 	monis at voltaire.com 	integrate IPoIB bonding with IPoIB CM
541 	major 	mst at mellanox.co.il 	slow failover with IPoIB CM 
bonding/ipoibtools HA
558 	major 	rolandd at cisco.com 	tvflash configure fails on SLES10 SP1 RC2


See bugzilla for all open issues.

Tasks that should be completed for RC4 (due date is 21-May):

   1. Support SLES10 SP1 RC1
   3. Fix all blocker, critical and major bugs
   4. Prepare all documentation (release notes, README, etc.)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070514/2149b9d3/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fixed_bugs-rc3.csv
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070514/2149b9d3/attachment.ksh>

From sean.hefty at intel.com  Mon May 14 08:11:53 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 14 May 2007 08:11:53 -0700
Subject: [ofa-general] RE: [Query] ib add path record cache
In-Reply-To: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
Message-ID: <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>

>This can be treated as a facility similar to what we have in ARP table
>for TCP/IP. Secondly this will help in debugging of some new up-coming
>partially infiniband complaint hardware.

But unless such a path actually exists to the remote node, I don't see that it's
useful.  And if such a path exists, I would expect it to be returned by the SA.
Can you clarify its use more wrt the subnet in general?

>yes, I want them to remain in the DB, my idea is similar to the hard
>coding of ARP table entries in TCP/IP.
>How do you see this can be achieved?

A simple flag or setting the update counter on the added path to the maximum
should be sufficient.

- Sean


From philippe.gregoire at cea.fr  Mon May 14 08:20:06 2007
From: philippe.gregoire at cea.fr (Philippe Gregoire)
Date: Mon, 14 May 2007 17:20:06 +0200
Subject: [ofa-general] suggested patch for partition membership definitiion
	in osm-partitions.conf
Message-ID: <46487E26.4040501@cea.fr>

Hi Hal,
the way to define in osm-partitions.conf file  partition membership for 
port guids is quite very verbose,
specially when you have a lot of (full member) ports.

Here is a patch to allow a more compact partition membership definition. 
It allows definition of a default
membership partition for the port guid list. The old syntax is still usable.
old way
G1 = 0x01 :  0x123=full, 0x124=full, 0x0x125=full, 0x126=full, 0x127=full ;
G1 = 0x01 :  0x128=full, 0x129=full, 0x567, 0x569=full

new way :
G1 = 0x01 , defmember=full : 0x123, 0x124, 0x125, 0x126, 0x127 ;
G1 = 0x01 , defmember=full :  0x128, 0x129, 0x567=limited, 0x569

I changed also the opensm man page as some lines (arround limited/full 
membership) are not well formatted.

This patch has been compiled and tested on our cluster with the 
following osm-partitions.conf :
G1  = 0x0001 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9, 
0x0005ad00000168ad, 0x0005ad0000000cb7=limited, 0x0008f10403962eb1 ;
G2  = 0x0002 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ;
G3  = 0x0003 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9, 
0x0008f10403962eb1 ;
G5  = 0x0005 , defmember=full : 0x0008f10403962eb1 ;
G10 = 0x0010 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ;
G70 = 0x0070 , defmember=full : 0x0005ad00000165f1 ;
G80 = 0x0080 , defmember=full : 0x0005ad00000165f1;
G80 = 0x0080 : 0x0005ad00000168ad;
G80 = 0x0080 , defmember=full : 0x0005ad0000016cb9;
G80 = 0x0080 , defmember=limited : 0x0005ad0000000cb7, 0x0008f10403962eb1;

Philippe

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: defmember.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070514/d97cd743/attachment.ksh>

From philippe.gregoire at cea.fr  Mon May 14 08:26:55 2007
From: philippe.gregoire at cea.fr (Philippe Gregoire)
Date: Mon, 14 May 2007 17:26:55 +0200
Subject: [ofa-general] suggested patch for partition membership definitiion
 in osm-partitions.conf (fix)
Message-ID: <46487FBF.7020300@cea.fr>

This time , with the definitive patch (sorry)

Hi Hal,
the way to define in osm-partitions.conf file  partition membership for
port guids is quite very verbose,
specially when you have a lot of (full member) ports.

Here is a patch to allow a more compact partition membership definition.
It allows definition of a default
membership partition for the port guid list. The old syntax is still usable.
old way
G1 = 0x01 :  0x123=full, 0x124=full, 0x0x125=full, 0x126=full, 0x127=full ;
G1 = 0x01 :  0x128=full, 0x129=full, 0x567, 0x569=full

new way :
G1 = 0x01 , defmember=full : 0x123, 0x124, 0x125, 0x126, 0x127 ;
G1 = 0x01 , defmember=full :  0x128, 0x129, 0x567=limited, 0x569

I changed also the opensm man page as some lines (arround limited/full
membership) are not well formatted.

This patch has been compiled and tested on our cluster with the
following osm-partitions.conf :
G1  = 0x0001 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9,
0x0005ad00000168ad, 0x0005ad0000000cb7=limited, 0x0008f10403962eb1 ;
G2  = 0x0002 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ;
G3  = 0x0003 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9,
0x0008f10403962eb1 ;
G5  = 0x0005 , defmember=full : 0x0008f10403962eb1 ;
G10 = 0x0010 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ;
G70 = 0x0070 , defmember=full : 0x0005ad00000165f1 ;
G80 = 0x0080 , defmember=full : 0x0005ad00000165f1;
G80 = 0x0080 : 0x0005ad00000168ad;
G80 = 0x0080 , defmember=full : 0x0005ad0000016cb9;
G80 = 0x0080 , defmember=limited : 0x0005ad0000000cb7, 0x0008f10403962eb1;

Philippe


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: defmember.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070514/9df93ac1/attachment.ksh>

From monil at voltaire.com  Mon May 14 08:44:02 2007
From: monil at voltaire.com (Moni Levy)
Date: Mon, 14 May 2007 17:44:02 +0200
Subject: [ofa-general] Re: [ewg] OFED 1.2 rc3 release
In-Reply-To: <46487AE8.1020005@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com>
	<6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com>
	<46487AE8.1020005@mellanox.co.il>
Message-ID: <6a122cc00705140844k7f65a988x8746c4ea8474b920@mail.gmail.com>

On 5/14/07, Tziporet Koren <tziporet at dev.mellanox.co.il> wrote:

>
> *Major limitations and known issues:
> *
>     567 blocker rolandd at cisco.com MPI does not work on RHEL5 ppc64 420
> critical monil at voltaire.com PKey table reordering caused by SM failover
> stops ipoib traffic
>


 Tziporet, bug #420 was fixed and bugzilla was updated this morning

Moni


>    607 critical jsquyres at cisco.com remove the hack to save the port number
> in the ia hca_address 608 critical monis at voltaire.com traffic fails to
> resume after SM failover with bonding interfaces 611 critical
> swise at opengridcomputing.com cxgb3: passive side connection transition from
> streaming to RDMA is broken 577 critical rolandd at cisco.com SRP multipath
> failover too slow (minutes, not seconds) 465 critical mst at mellanox.co.il IPoIB
> HA fails after several hours of failovers 549 critical
> amip at dev.mellanox.co.il SDP Policy need to be consistent 604 critical
> mst at mellanox.co.il Oops running UDP traffic with IPoIB CM 605 major
> sean.hefty at intel.com kernel oops in rdma_cm during module unload 614 major
> halr at voltaire.com All of the CM definitions should be removed from
> ib_types.h 534 major vlad at mellanox.co.il SLES9 - Installer fails on
> declarations - OFED 1.2-20070409 530 major dannyz at mellanox.co.il ibdiagnet
> -r fails on RHEL5 i686 538 major monis at voltaire.com integrate IPoIB
> bonding with IPoIB CM 541 major mst at mellanox.co.il slow failover with
> IPoIB CM bonding/ipoibtools HA 558 major rolandd at cisco.com tvflash
> configure fails on SLES10 SP1 RC2
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070514/acc89458/attachment.html>

From halr at voltaire.com  Mon May 14 08:51:22 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 11:51:22 -0400
Subject: [ofa-general] Re: suggested patch for partition membership
	definitiion in osm-partitions.conf (fix)
In-Reply-To: <46487FBF.7020300@cea.fr>
References: <46487FBF.7020300@cea.fr>
Message-ID: <1179157835.1540.183713.camel@hal.voltaire.com>

Hi Philippe,

On Mon, 2007-05-14 at 11:26, Philippe Gregoire wrote:
> This time , with the definitive patch (sorry)

Can you resubmit this with a S-O-B line ?

> Hi Hal,
> the way to define in osm-partitions.conf file  partition membership for
> port guids is quite very verbose,
> specially when you have a lot of (full member) ports.

or lots of limited members, either way. This is an improvement in the
allowed syntax.

> Here is a patch to allow a more compact partition membership definition.
> It allows definition of a default
> membership partition for the port guid list. The old syntax is still usable.
> old way
> G1 = 0x01 :  0x123=full, 0x124=full, 0x0x125=full, 0x126=full, 0x127=full ;
> G1 = 0x01 :  0x128=full, 0x129=full, 0x567, 0x569=full
> 
> new way :
> G1 = 0x01 , defmember=full : 0x123, 0x124, 0x125, 0x126, 0x127 ;
> G1 = 0x01 , defmember=full :  0x128, 0x129, 0x567=limited, 0x569
> 
> I changed also the opensm man page as some lines (arround limited/full
> membership) are not well formatted.

Can you break this piece into 2 parts: fix formatting, and then add
defmember ?

> This patch has been compiled and tested on our cluster with the
> following osm-partitions.conf :
> G1  = 0x0001 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9,
> 0x0005ad00000168ad, 0x0005ad0000000cb7=limited, 0x0008f10403962eb1 ;
> G2  = 0x0002 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ;
> G3  = 0x0003 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9,
> 0x0008f10403962eb1 ;
> G5  = 0x0005 , defmember=full : 0x0008f10403962eb1 ;
> G10 = 0x0010 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ;
> G70 = 0x0070 , defmember=full : 0x0005ad00000165f1 ;
> G80 = 0x0080 , defmember=full : 0x0005ad00000165f1;
> G80 = 0x0080 : 0x0005ad00000168ad;
> G80 = 0x0080 , defmember=full : 0x0005ad0000016cb9;
> G80 = 0x0080 , defmember=limited : 0x0005ad0000000cb7, 0x0008f10403962eb1;

Thanks.

-- Hal

> Philippe
> 
> 
> 
> ______________________________________________________________________
> 
> --- opensm/osm_prtn_config.old.c	2007-04-18 11:54:29.000000000 +0200
> +++ opensm/osm_prtn_config.c	2007-05-14 17:14:42.228813361 +0200
> @@ -70,6 +70,7 @@
>  	osm_subn_t *p_subn;
>  	osm_prtn_t *p_prtn;
>  	unsigned    is_ipoib, mtu, rate, sl, scope;
> +	boolean_t   full;
>  };
>  
>  extern osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn,
> @@ -163,6 +164,14 @@
>  				" - skipped\n", lineno);
>  		else
>  			conf->sl = sl;
> +	} else if (!strncmp(flag, "defmember", len)) {
> +		if (!val || (strcmp(val, "limited") && strcmp(val, "full")))
> +			osm_log(conf->p_log, OSM_LOG_VERBOSE,
> +				"PARSE WARN: line %d: "
> +				"flag \'defmember\' requires valid value (limited or full)"
> +				" - skipped\n", lineno);
> +		else
> +			conf->full = strcmp(val, "full") == 0;
>  	} else {
>  			osm_log(conf->p_log, OSM_LOG_VERBOSE,
>  					  "PARSE WARN: line %d: "
> @@ -177,12 +186,14 @@
>  {
>  	osm_prtn_t *p = conf->p_prtn;
>  	ib_net64_t guid;
> -	boolean_t full = FALSE;
> +	boolean_t full = conf->full;
>  
>  	if (!name || !*name || !strncmp(name, "NONE", strlen(name)))
>  		return 0;
>  
>  	if (flag) {
> +		/* reset default membership to limited */
> +		full = FALSE;
>  		if (!strncmp(flag, "full", strlen(flag)))
>  			full = TRUE;
>  		else if (strncmp(flag, "limited", strlen(flag))) {
> @@ -275,6 +286,7 @@
>  	conf->p_prtn = NULL;
>  	conf->is_ipoib = 0;
>  	conf->sl = OSM_DEFAULT_SL;
> +	conf->full = FALSE;
>  	return conf;
>  }
>  
> --- man/opensm.8.old	2007-04-18 11:54:29.000000000 +0200
> +++ man/opensm.8	2007-05-14 16:19:11.747555126 +0200
> @@ -291,13 +291,15 @@
>  
>  Partition Definition:
>  
> -[PartitionName][=PKey][,flag[=value]]
> +[PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
>  
>   PartitionName - string, will be used with logging. When omitted
>                   empty string will be used.
>   PKey          - P_Key value for this partition. Only low 15 bits will
>                   be used. When omitted will be autogenerated.
>   flag          - used to indicate IPoIB capability of this partition.
> + defmember=full|limited - specifies default membership for port guid. 
> +                 Default is limited.
>  
>  Currently recognized flags are:
>  
> @@ -317,10 +319,10 @@
>  
>  PortGUIDs list:
>  
> -PortGUID     - GUID of partition member EndPort. Hexadecimal numbers
> -               should start from 0x, decimal numbers are accepted too.
> -full or      - indicates full or limited membership for this port. When
> -  limited      omitted (or unrecognized) limited membership is assumed.
> + PortGUID         - GUID of partition member EndPort. Hexadecimal numbers
> +                   should start from 0x, decimal numbers are accepted too.
> + full or limited  - indicates full or limited membership for this port.
> +                   When omitted (or unrecognized) default (defmember) membership is assumed.
>  
>  There are two useful keywords for PortGUID definition:
>  
> @@ -346,12 +348,20 @@
>  
>  Examples:
>  
> -Default=0x7fff : ALL, SELF=full ;
> + Default=0x7fff : ALL, SELF=full ;
>  
> -NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
> + NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
>  
> -YetAnotherOne = 0x300 : SELF=full ;
> -YetAnotherOne = 0x300 : ALL=limited ;
> + YetAnotherOne = 0x300 : SELF=full ;
> + YetAnotherOne = 0x300 : ALL=limited ;
> +
> + ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
> + # 0x123453, 0x123454 will be limited
> + ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
> + # 0x123456, 0x123457 will be limited
> + ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
> + ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
> + ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;
>  
>  Note:
>  


From weiny2 at llnl.gov  Mon May 14 09:55:53 2007
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 14 May 2007 09:55:53 -0700
Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager
In-Reply-To: <20070513195539.GH29746@sashak.voltaire.com>
References: <20070508184938.311b1c8f.weiny2@llnl.gov>
	<20070513195539.GH29746@sashak.voltaire.com>
Message-ID: <20070514095553.3ec3bec7.weiny2@llnl.gov>

On Sun, 13 May 2007 22:55:39 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi Ira,
> 
> Thanks for the great work!
> 
> On 18:49 Tue 08 May     , Ira Weiny wrote:
> > I would like to submit to the list a performance manager which I have been
> > working on for OpenSM.
> > 
> > It is implemented as the first proposed architecture model set forth by Hal (As
> > an integrated thread to OpenSM.)  As such it works fine on our small test
> > cluster but there is some concern about its scalability.
> > 
> > I have extended this architecture with an idea of my own.  This idea is to have
> > a plug-able module for the "event database".  With this interface one could
> > write their own Data reduction, logging, and tracking methods.  Here at LLNL I
> > propose to use this to add counter and subnet events directly to our management
> > database which is used to show system status to our operators.  Other
> > installations might prefer other methods of logging, SNMP for example.  This
> > patch includes a "reference" implementation of this "event database" which
> > stores the information internally until the user requests a "dump".
> 
> I like this event db idea, but not sure this should not be integral part
> of the low level perfmgr stuff - as it is currently implemented without
> such plugin loaded PerfMgr just doesn't work - this unconditionally tries
> to pull all ports counters, but has nothing to do with it without plugin.
> 
> Instead I would purpose to have a builtin PerfMgr which will be able to
> pull and store performance related data and then to call "generic" event
> manager which can process such data. This also will help to have simpler
> generic API for such event db plugin so other parts of OpenSM will be
> able to report events using same method(s). What do you think?

This is a good idea.  I will think about how to make it work.

<snip>

> > +
> > +/**
> > + * group port counters for ports into the nodes
> > + */
> > +typedef struct _osm_pc_node {
> > +	cl_map_item_t  map_item; /* must be first */
> > +	uint64_t       node_guid;
> > +	osm_event_pc_t   *ports;
> > +	uint8_t        num_ports;
> > +} osm_pc_node_t;
> 
> Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)?
> Why not to reuse already existed maps in osm_subn_t (we could add
> 'void *pm_data' or so field to osm_physp_t structure)?
> 

I did not want to complicate the SM data structures.  Also these structures
were part of the plugin.  This reference plugin used the compatibility lib qmap
structures to store the data.  But other plugins may use SQL or other data
stores.  I think I agree with Hal that we should keep these data structures
separate from the SM structures.

> > +
> > +/****s* OpenSM: PERFMGR/osm_perfmgr_state_t */
> > +typedef enum
> > +{
> > +  PERFMGR_STATE_DISABLE,
> > +  PERFMGR_STATE_ENABLED,
> > +  PERFMGR_STATE_NO_DB
> 
> Why PERFMGR_STATE_NO_DB is needed? Isn't is duplicated by
> (pm->db == NULL)?
> 
> As side effect of this duplication - now when DB was not found I can
> enable perfmgr with console command, but it obviously crashes during
> follow 'dump'.

Ah I did not catch that.

If we separate out the plugin to be generic with a perfmgr internal store, this
will go away.  I did add checks for NULL DB functions so that plugins could
decide to not receive some types of data, but this only makes sense with the
refactoring I did on the DB interface.

<snip>

> 
> > +} osm_perfmgr_state_t;
> > +
> > +/****s* OpenSM: PERFMGR/osm_perfmgr_t
> > +*  This object should be treated as opaque and should
> > +*  be manipulated only through the provided functions.
> > +*/
> > +typedef struct _osm_perfmgr
> > +{
> > +  osm_thread_state_t    thread_state;
> > +  cl_event_t            sig_sweep;
> > +  cl_thread_t           sweeper;
> > +  osm_subn_t           *subn;
> > +  osm_sm_t             *sm;
> > +  cl_plock_t           *lock;
> > +  osm_log_t            *log;
> > +  osm_mad_pool_t       *mad_pool;
> > +  atomic32_t            trans_id;
> 
> Do we need separate transaction id generator for PerfMgr? 

Probably not here but if we separate out the perfmgr we might.

<snip>

> > +
> > +/****f* OpenSM: PERFMGR/osm_perfmgr_init */
> > +ib_api_status_t
> > +osm_perfmgr_init(
> > +	osm_perfmgr_t* const perfmgr,
> > +	osm_subn_t* const subn,
> > +        osm_sm_t * const sm,
> > +	osm_log_t* const log,
> > +	osm_mad_pool_t * const mad_pool,
> > +	osm_vendor_t * const vendor,
> > +        cl_dispatcher_t* const disp,
> > +   	cl_plock_t* const lock,
> > +	const osm_subn_opt_t * const p_opt );
> 
> The identation is not unified (tab character is preferred) here and in
> another places, also there are lot of trailing white spaces in the patch.
> You can run 'git-diff --color' to see formatting issues.

Yes, sorry.  I have been trying to follow the new codeing standard but I have
not done a great job.  Thanks for the git tip.

<snip>

> >  
> > +#ifdef ENABLE_OSM_PERF_MGR
> > +    case 1:
> > +      opt.perfmgr = TRUE;
> > +      break;
> > +    case 2:
> > +      opt.perfmgr_sweep_time_s = atoi(optarg);
> 
> In case of user error we can get opt.perfmgr_sweep_time_s = 0 (or another
> strange value), I think at least minimal verification is needed here.

Yes, good catch.  I am actually going to remove these from the command line
options.  I think one can control these better from the opensm.opts file.
There seems to be too many options which must be set for this to work correctly
right now.  Also what would you guys think of having a separate perfmgr config
file?  I am not sure about that idea.  On one hand it helps to keep the
opensm.opts file clean but on the other hand it means the user has to deal with
another config file.  :-/


<snip>

> > +
> > +/**********************************************************************
> > + * Process errors from the MAD send.
> > + **********************************************************************/
> > +static void
> > +osm_perfmgr_mad_send_err_callback(void* bind_context, osm_madw_t *p_madw)
> > +{
> > +	osm_perfmgr_t *pm = (osm_perfmgr_t *)bind_context;
> > +	osm_madw_context_t *context = &(p_madw->context);
> > +	
> > +	OSM_LOG_ENTER( pm->log, osm_pm_mad_send_err_callback );
>                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Ditto (the same for another perfmgr functions)

Yep, I already caught these, thanks.

<snip>

> > +	
> > +	OSM_LOG_ENTER( pm->log, __osm_pm_query_counters );
> > +	
> > +	memcpy(node_desc, p_node->node_desc.description,
> > +			IB_NODE_DESCRIPTION_SIZE);
> > +	node_desc[IB_NODE_DESCRIPTION_SIZE-1] = '\0';
> 
> We have null terminated 'print_desc' field in osm_node_t structure

Yea, I put it there, I should have known that ;-)  I changed this already...

<snip>

Thanks for the comments.  I will get a framework done for the general event
plugin...  I do agree that would be better for other types of events.  My
original idea was to have these events reported to the "perfmgr".  But that is
somewhat invasive on the perfmgr object.

I am not sure what the best way to do this is at the moment.  I have cleaned up
the DB interface quite a bit, including making it more generic.  So I think
this might fit in nicely.  I can reissue a patch if you would like to see it.
Or I can just submit the header file to see the interface.

Thanks,
Ira


From vuhuong at mellanox.com  Mon May 14 09:55:43 2007
From: vuhuong at mellanox.com (Vu Pham)
Date: Mon, 14 May 2007 09:55:43 -0700
Subject: [ofa-general] [SRPT]multiple initiators supported?
In-Reply-To: <7b2fa1820705120247t1b232345w8373bb72416c5b28@mail.gmail.com>
References: <7b2fa1820705120247t1b232345w8373bb72416c5b28@mail.gmail.com>
Message-ID: <4648948F.5000802@mellanox.com>

Ian Jiang wrote:
> Does the SRP target support multiple initiators?

Yes, it does.


> I am using the SRR initiator and IB drivers in linux-2.6.20.
> The SRP target is at
> http://www.openfabrics.org/git/?p=~vu/srpt.git;a=summary
> and the IB driver is OFED-1.1 with linux-2.6.16.13-4-default of Suse-10.1.
> 
> 


From weiny2 at llnl.gov  Mon May 14 10:02:24 2007
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 14 May 2007 10:02:24 -0700
Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager
In-Reply-To: <1179140285.1540.167239.camel@hal.voltaire.com>
References: <20070508184938.311b1c8f.weiny2@llnl.gov>
	<20070513195539.GH29746@sashak.voltaire.com>
	<1179140285.1540.167239.camel@hal.voltaire.com>
Message-ID: <20070514100224.7c399438.weiny2@llnl.gov>

On 14 May 2007 06:58:34 -0400
Hal Rosenstock <halr at voltaire.com> wrote:

> On Sun, 2007-05-13 at 15:55, Sasha Khapyorsky wrote:
> > Hi Ira,
> > 
> > Thanks for the great work!
> 
> Indeed :-)
> 
> > On 18:49 Tue 08 May     , Ira Weiny wrote:
> > > I would like to submit to the list a performance manager which I have been
> > > working on for OpenSM.
> > > 
> > > It is implemented as the first proposed architecture model set forth by Hal (As
> > > an integrated thread to OpenSM.)  As such it works fine on our small test
> > > cluster but there is some concern about its scalability.
> > > 
> > > I have extended this architecture with an idea of my own.  This idea is to have
> > > a plug-able module for the "event database".  With this interface one could
> > > write their own Data reduction, logging, and tracking methods.  Here at LLNL I
> > > propose to use this to add counter and subnet events directly to our management
> > > database which is used to show system status to our operators.  Other
> > > installations might prefer other methods of logging, SNMP for example.  This
> > > patch includes a "reference" implementation of this "event database" which
> > > stores the information internally until the user requests a "dump".
> > 
> > I like this event db idea, but not sure this should not be integral part
> > of the low level perfmgr stuff - as it is currently implemented without
> > such plugin loaded PerfMgr just doesn't work - this unconditionally tries
> > to pull all ports counters, but has nothing to do with it without plugin.
> > 
> > Instead I would purpose to have a builtin PerfMgr which will be able to
> > pull and store performance related data and then to call "generic" event
> > manager which can process such data. This also will help to have simpler
> > generic API for such event db plugin so other parts of OpenSM will be
> > able to report events using same method(s). What do you think?
> 
> Sounds better to me. Ira ?

Yes, except that I am concerned with storing the data in the perfmgr as well as
the plugin.  But I like the idea of a more generic plugin for getting events
from OSM.  My mind is already full of ideas after responding to Sasha...  ;-)

<snip>

> > > +
> > > +/**
> > > + * group port counters for ports into the nodes
> > > + */
> > > +typedef struct _osm_pc_node {
> > > +	cl_map_item_t  map_item; /* must be first */
> > > +	uint64_t       node_guid;
> > > +	osm_event_pc_t   *ports;
> > > +	uint8_t        num_ports;
> > > +} osm_pc_node_t;
> > 
> > Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)?
> > Why not to reuse already existed maps in osm_subn_t (we could add
> > 'void *pm_data' or so field to osm_physp_t structure)?
> 
> My one concern would be evolving the PerfMgr. This is better now but is
> this better when the PerfMgr is separated from the SM functionality ? I
> know there are other things to untangle to get there.
> 

I fully agree.  I don't think we want intertwine the SM structures with the
PerfMgr structures.  BTW in the new code I have this is named _db_node_t.

Ira


From amitk at mellanox.co.il  Mon May 14 10:59:58 2007
From: amitk at mellanox.co.il (Amit Krig)
Date: Mon, 14 May 2007 20:59:58 +0300
Subject: [ofa-general] RE: [PATCH] ipoib/cm: make stale task actually run
	once in a while (DOES NOT HELP)
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038217F3@xmb-sjc-216.amer.cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C90BFCB3@mtlexch01.mtl.com><6C2C79E72C305246B504CBA17B5500C9076E27@mtlexch01.mtl.com><A15335FBE9BD2449AF2C9EF3D1EB8EA3037668FC@xmb-sjc-216.amer.cisco.com>
	<20070507200315.GD22341@mellanox.co.il>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3038217F3@xmb-sjc-216.amer.cisco.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C90179C4AA@mtlexch01.mtl.com>

Still failing in our test as well

Amit

-----Original Message-----
From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
Sent: Saturday, May 12, 2007 1:32 AM
To: Michael S. Tsirkin; Scott Weitzenkamp (sweitzen)
Cc: Yohad Dickman; Amit Krig; Tziporet Koren; Michael S. Tsirkin;
general at lists.openfabrics.org; Roland Dreier
Subject: RE: [PATCH] ipoib/cm: make stale task actually run once in a
while (DOES NOT HELP)
Importance: High

This patch, which is in OFED-1.2-20070511-0600, does NOT help.  I am
still seeing 105-second port failover times.  Amit, did you try it?

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il]
> Sent: Monday, May 07, 2007 1:03 PM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Yohad Dickman; Amit Krig; Tziporet Koren; mst at mellanox.co.il; 
> general at lists.openfabrics.org; Roland Dreier
> Subject: [PATCH] ipoib/cm: make stale task actually run once in a 
> while
> 
> In the presence of some active passive connections, stale task would 
> never run, since each 4 RX CQEs we repeat queue_delayed_work calls 
> which delays it for some 10 minutes.  As a result, on a noisy system 
> with failing ports, we slowly run out of resources - slowing 
> connection setup down and eventually failing.
> 
> What we actually want to do is - start stale task when a first passive

> connection is added, rerun it every 10 min as long as there are 
> outstanding passive connections.
> 
> As a happy side effect, this removes some code from RX data path.
> 
> Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
> 
> ---
> 
> Scott, I think this might address bugs 541 and 465: slow IPoIB CM HA 
> failover and eventual failing IPoIB HA. Could you test this please?
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 2b242a4..b77e8d7 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -258,10 +258,11 @@ static int ipoib_cm_req_handler(struct ib_cm_id 
> *cm_id, struct ib_cm_event *even
>  	cm_id->context = p;
>  	p->jiffies = jiffies;
>  	spin_lock_irqsave(&priv->lock, flags);
> +	if (list_empty(&priv->cm.passive_ids))
> +		queue_delayed_work(ipoib_workqueue,
> +				   &priv->cm.stale_task,
> IPOIB_CM_RX_DELAY);
>  	list_add(&p->list, &priv->cm.passive_ids);
>  	spin_unlock_irqrestore(&priv->lock, flags);
> -	queue_delayed_work(ipoib_workqueue,
> -			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  	return 0;
>  
>  err_rep:
> @@ -380,8 +381,6 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev,

> struct ib_wc *wc)
>  			if (!list_empty(&p->list))
>  				list_move(&p->list,
> &priv->cm.passive_ids);
>  			spin_unlock_irqrestore(&priv->lock, flags);
> -			queue_delayed_work(ipoib_workqueue,
> -					   
> &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  		}
>  	}
>  
> @@ -1104,6 +1103,10 @@ static void ipoib_cm_stale_task(struct 
> work_struct *work)
>  		kfree(p);
>  		spin_lock_irqsave(&priv->lock, flags);
>  	}
> +
> +	if (!list_empty(&priv->cm.passive_ids))
> +		queue_delayed_work(ipoib_workqueue,
> +				   &priv->cm.stale_task,
> IPOIB_CM_RX_DELAY);
>  	spin_unlock_irqrestore(&priv->lock, flags);  }
>  
> --
> MST
> 


From sashak at voltaire.com  Mon May 14 11:24:00 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 14 May 2007 21:24:00 +0300
Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager
In-Reply-To: <1179140285.1540.167239.camel@hal.voltaire.com>
References: <20070508184938.311b1c8f.weiny2@llnl.gov>
	<20070513195539.GH29746@sashak.voltaire.com>
	<1179140285.1540.167239.camel@hal.voltaire.com>
Message-ID: <20070514182400.GL29746@sashak.voltaire.com>

On 06:58 Mon 14 May     , Hal Rosenstock wrote:
> > > +
> > > +/**
> > > + * group port counters for ports into the nodes
> > > + */
> > > +typedef struct _osm_pc_node {
> > > +	cl_map_item_t  map_item; /* must be first */
> > > +	uint64_t       node_guid;
> > > +	osm_event_pc_t   *ports;
> > > +	uint8_t        num_ports;
> > > +} osm_pc_node_t;
> > 
> > Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)?
> > Why not to reuse already existed maps in osm_subn_t (we could add
> > 'void *pm_data' or so field to osm_physp_t structure)?
> 
> My one concern would be evolving the PerfMgr. This is better now but is
> this better when the PerfMgr is separated from the SM functionality ? I
> know there are other things to untangle to get there.

PerfMgr "sweep" is based on discovered fabric topology structures
anyway. So what is a reason to duplicate nodes/ports qmaps?

Sasha


From rick.jones2 at hp.com  Mon May 14 11:52:08 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Mon, 14 May 2007 11:52:08 -0700
Subject: [ofa-general] OFED 1.2 rc3 release
In-Reply-To: <46487AE8.1020005@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com>	<6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com>
	<46487AE8.1020005@mellanox.co.il>
Message-ID: <4648AFD8.2060101@hp.com>

Tziporet Koren wrote:
> Hi,
> 
> OFED 1.2-RC3 is available on  _http://www.openfabrics.org/builds/ofed-1.2/_
> File: OFED-1.2-rc3.tgz
> To get BUILD_ID run ofed_info
> 
> Please report any issues in bugzilla _https://bugs.openfabrics.org/_
> 
> *RC4 due date is May 21*

It could be that I need new bifocals, but there does not appear to be a 1.2rc3 
version listed against which we can submit reports.

rick jones


From sashak at voltaire.com  Mon May 14 12:04:46 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 14 May 2007 22:04:46 +0300
Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager
In-Reply-To: <20070514095553.3ec3bec7.weiny2@llnl.gov>
References: <20070508184938.311b1c8f.weiny2@llnl.gov>
	<20070513195539.GH29746@sashak.voltaire.com>
	<20070514095553.3ec3bec7.weiny2@llnl.gov>
Message-ID: <20070514190446.GM29746@sashak.voltaire.com>

On 09:55 Mon 14 May     , Ira Weiny wrote:
> > 
> > Instead I would purpose to have a builtin PerfMgr which will be able to
> > pull and store performance related data and then to call "generic" event
> > manager which can process such data. This also will help to have simpler
> > generic API for such event db plugin so other parts of OpenSM will be
> > able to report events using same method(s). What do you think?
> 
> This is a good idea.  I will think about how to make it work.

Thanks.

> <snip>
> 
> > > +
> > > +/**
> > > + * group port counters for ports into the nodes
> > > + */
> > > +typedef struct _osm_pc_node {
> > > +	cl_map_item_t  map_item; /* must be first */
> > > +	uint64_t       node_guid;
> > > +	osm_event_pc_t   *ports;
> > > +	uint8_t        num_ports;
> > > +} osm_pc_node_t;
> > 
> > Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)?
> > Why not to reuse already existed maps in osm_subn_t (we could add
> > 'void *pm_data' or so field to osm_physp_t structure)?
> > 
> 
> I did not want to complicate the SM data structures.  Also these structures
> were part of the plugin. This reference plugin used the compatibility lib qmap
> structures to store the data.  But other plugins may use SQL or other data
> stores.

Right, but plugin can access OpenSM data structures in the same way as
its internal stuff, and just qmaps duplication affects performance.

> I think I agree with Hal that we should keep these data structures
> separate from the SM structures.

[snip..]

> I think one can control these better from the opensm.opts file.
> There seems to be too many options which must be set for this to work correctly
> right now.  Also what would you guys think of having a separate perfmgr config
> file?  I am not sure about that idea.  On one hand it helps to keep the
> opensm.opts file clean but on the other hand it means the user has to deal with
> another config file.  :-/

Probably we need to think about /etc/*/opensm.conf instead of option
cached /var/*/osm/opensm.opts?

> <snip>
> 
> Thanks for the comments.  I will get a framework done for the general event
> plugin...  I do agree that would be better for other types of events.  My
> original idea was to have these events reported to the "perfmgr".  But that is
> somewhat invasive on the perfmgr object.
> 
> I am not sure what the best way to do this is at the moment.  I have cleaned up
> the DB interface quite a bit, including making it more generic.  So I think
> this might fit in nicely.  I can reissue a patch if you would like to see it.
> Or I can just submit the header file to see the interface.

A "header file" way looks fine. Probably we may want to separate PerfMgr
and EventMgr things to be separate patch sets. But it is up to you.

Thanks again for the great work!

Sasha


From sweitzen at cisco.com  Mon May 14 12:19:24 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 14 May 2007 12:19:24 -0700
Subject: [ofa-general] OFED 1.2 rc3 release
In-Reply-To: <4648AFD8.2060101@hp.com>
References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com>	<6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com><46487AE8.1020005@mellanox.co.il>
	<4648AFD8.2060101@hp.com>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303821C3C@xmb-sjc-216.amer.cisco.com>

I added 1.2rc3 to bugzilla.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones
> Sent: Monday, May 14, 2007 11:52 AM
> To: Tziporet Koren
> Cc: ewg at lists.openfabrics.org; general at lists.openfabrics.org
> Subject: Re: [ofa-general] OFED 1.2 rc3 release
> 
> Tziporet Koren wrote:
> > Hi,
> > 
> > OFED 1.2-RC3 is available on  
> _http://www.openfabrics.org/builds/ofed-1.2/_
> > File: OFED-1.2-rc3.tgz
> > To get BUILD_ID run ofed_info
> > 
> > Please report any issues in bugzilla _https://bugs.openfabrics.org/_
> > 
> > *RC4 due date is May 21*
> 
> It could be that I need new bifocals, but there does not 
> appear to be a 1.2rc3 
> version listed against which we can submit reports.
> 
> rick jones
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From rick.jones2 at hp.com  Mon May 14 12:29:48 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Mon, 14 May 2007 12:29:48 -0700
Subject: [ofa-general] OFED 1.2 rc3 release
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303821C3C@xmb-sjc-216.amer.cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com>	<6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com><46487AE8.1020005@mellanox.co.il>	<4648AFD8.2060101@hp.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303821C3C@xmb-sjc-216.amer.cisco.com>
Message-ID: <4648B8AC.4030306@hp.com>

Scott Weitzenkamp (sweitzen) wrote:
> I added 1.2rc3 to bugzilla.

Splendid - bug 618 added :)  I didn't have a good feel for severity and/or 
priority so left them at the defaults.

rick jones


From rdreier at cisco.com  Mon May 14 12:39:31 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 12:39:31 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <20070513051806.GB7402@mellanox.co.il> (Michael S. Tsirkin's
	message of "Sun, 13 May 2007 08:18:06 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
	<ada4pmjz7tm.fsf@cisco.com> <20070513051806.GB7402@mellanox.co.il>
Message-ID: <adaodknw5xo.fsf@cisco.com>

 > By the way, I just re-checked and it seems that WC support first
 > appeared in Pentium II systems. So I think we should be able to
 > use sfence if WC is enabled.

That's actually doubly wrong: WC support was added in Pentium Pro, and
sfence was added in Pentium III.

 - R.


From rdreier at cisco.com  Mon May 14 12:38:35 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 12:38:35 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <20070513045921.GA7402@mellanox.co.il> (Michael S. Tsirkin's
	message of "Sun, 13 May 2007 07:59:38 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
	<ada4pmjz7tm.fsf@cisco.com> <20070512172927.GA5908@mellanox.co.il>
	<adamz09yc19.fsf@cisco.com> <20070513045921.GA7402@mellanox.co.il>
Message-ID: <adasl9zw5z8.fsf@cisco.com>

 > So, could we use a lock instructions to fence WC writes out?

Yes, the right thing seems to be to use the same thing for wc_wmb() as
for mb() on i386, namely "lock; addl $0,0(%%esp)".  That is definitely
a serializing instruction that will flush WC buffers.

 - R.


From rdreier at cisco.com  Mon May 14 12:41:28 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 12:41:28 -0700
Subject: [ofa-general] Re: weird kconfig output
In-Reply-To: <20070514142233.GD7989@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 14 May 2007 17:22:33 +0300")
References: <20070514142233.GD7989@mellanox.co.il>
Message-ID: <adak5vbw5uf.fsf@cisco.com>

 > Doing make oldconfig on 2.6.22-rc1 (.config came from 2.6.21)
 > gave me this prompt, among the list of 10G/s adapters:
 >   Verbose debugging output (MLX4_DEBUG) [Y/n/?] (NEW)
 > 
 > Shouldn't I get prompted for mlx4 eth first?

mlx4 eth isn't upstream (since it doesn't do anything, FW isn't ready,
etc etc).  I'm not sure if there's a way to improve this until then.

 - R.


From swise at opengridcomputing.com  Mon May 14 12:55:39 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 14 May 2007 14:55:39 -0500
Subject: [ofa-general] [GIT PULL] ofed_1_2 iw_cxgb3 - fix for bug 611
Message-ID: <1179172539.25841.57.camel@stevo-desktop>

Vlad,

Please pull the cxgb3 driver fix for bug 611 from

git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2

Thanks,

Steve.

----

commit 1e6d99bddf75465a6c05b74074278f2691edcc37
Author: Steve Wise <swise at opengridcomputing.com>
Date:   Mon May 14 13:27:27 2007 -0500

    iw_cxgb3: Streaming -> RDMA mode transition fixes.
    
    Due to a HW issue, our current scheme to transition the
    connection from streaming to rdma mode was broken on
    the passive side.  The firmware and driver now support
    a new transition scheme for the passive side:
    
    - driver posts rdma_init_wr (now including the initial receive seqno)
    
    - driver posts last streaming message via TX_DATA message (MPA
      start response)
    
    - uP atomically sends the last streaming message and transitions
      the tcb to rdma mode.
    
    - driver waits for wr_ack indicating the last streaming message was
      ACKed.
    
    This change also bumps the required firmware version...
    
    Signed-off-by: Steve Wise <swise at opengridcomputing.com>

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index ce05db5..62998d3 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -862,7 +862,7 @@ int cxio_rdma_init(struct cxio_rdev *rde
 	wqe->ird = cpu_to_be32(attr->ird);
 	wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr);
 	wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size);
-	wqe->rsvd = 0;
+	wqe->irs = cpu_to_be32(attr->irs);
 	skb->priority = 0;	/* 0=>ToeQ; 1=>CtrlQ */
 	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
 }
diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h
index e7ea455..9094147 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_wr.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h
@@ -294,6 +294,7 @@ struct t3_rdma_init_attr {
 	u64 qp_dma_addr;
 	u32 qp_dma_size;
 	u32 flags;
+	u32 irs;
 };
 
 struct t3_rdma_init_wr {
@@ -314,7 +315,7 @@ struct t3_rdma_init_wr {
 	__be32 ird;
 	__be64 qp_dma_addr;	/* 7 */
 	__be32 qp_dma_size;	/* 8 */
-	u32 rsvd;
+	u32 irs;
 };
 
 struct t3_genbit {
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 0d81e2f..ed56d55 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -516,7 +516,7 @@ static void send_mpa_req(struct iwch_ep 
 	req->len = htonl(len);
 	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
 			   V_TX_SNDBUF(snd_win>>15));
-	req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT);
+	req->flags = htonl(F_TX_INIT);
 	req->sndseq = htonl(ep->snd_seq);
 	BUG_ON(ep->mpa_skb);
 	ep->mpa_skb = skb;
@@ -567,7 +567,7 @@ static int send_mpa_reject(struct iwch_e
 	req->len = htonl(mpalen);
 	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
 			   V_TX_SNDBUF(snd_win>>15));
-	req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT);
+	req->flags = htonl(F_TX_INIT);
 	req->sndseq = htonl(ep->snd_seq);
 	BUG_ON(ep->mpa_skb);
 	ep->mpa_skb = skb;
@@ -619,7 +619,7 @@ static int send_mpa_reply(struct iwch_ep
 	req->len = htonl(len);
 	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
 			   V_TX_SNDBUF(snd_win>>15));
-	req->flags = htonl(F_TX_MORE | F_TX_IMM_ACK | F_TX_INIT);
+	req->flags = htonl(F_TX_INIT);
 	req->sndseq = htonl(ep->snd_seq);
 	ep->mpa_skb = skb;
 	state_set(&ep->com, MPA_REP_SENT);
@@ -642,6 +642,7 @@ static int act_establish(struct t3cdev *
 	cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid);
 
 	ep->snd_seq = ntohl(req->snd_isn);
+	ep->rcv_seq = ntohl(req->rcv_isn);
 
 	set_emss(ep, ntohs(req->tcp_opt));
 
@@ -1021,6 +1022,9 @@ static int rx_data(struct t3cdev *tdev, 
 
 	skb_pull(skb, sizeof(*hdr));
 	skb_trim(skb, dlen);
+	
+	ep->rcv_seq += dlen;
+	BUG_ON(ep->rcv_seq != (ntohl(hdr->seq) + dlen));
 
 	switch (state_read(&ep->com)) {
 	case MPA_REQ_SENT:
@@ -1059,7 +1063,6 @@ static int tx_ack(struct t3cdev *tdev, s
 	struct iwch_ep *ep = ctx;
 	struct cpl_wr_ack *hdr = cplhdr(skb);
 	unsigned int credits = ntohs(hdr->credits);
-	enum iwch_qp_attr_mask  mask;
 
 	PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits);
 
@@ -1071,30 +1074,6 @@ static int tx_ack(struct t3cdev *tdev, s
 	ep->mpa_skb = NULL;
 	dst_confirm(ep->dst);
 	if (state_read(&ep->com) == MPA_REP_SENT) {
-		struct iwch_qp_attributes attrs;
-
-		/* bind QP to EP and move to RTS */
-		attrs.mpa_attr = ep->mpa_attr;
-		attrs.max_ird = ep->ord;
-		attrs.max_ord = ep->ord;
-		attrs.llp_stream_handle = ep;
-		attrs.next_state = IWCH_QP_STATE_RTS;
-
-		/* bind QP and TID with INIT_WR */
-		mask = IWCH_QP_ATTR_NEXT_STATE |
-				     IWCH_QP_ATTR_LLP_STREAM_HANDLE |
-				     IWCH_QP_ATTR_MPA_ATTR |
-				     IWCH_QP_ATTR_MAX_IRD |
-				     IWCH_QP_ATTR_MAX_ORD;
-
-		ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp,
-				     ep->com.qp, mask, &attrs, 1);
-
-		if (!ep->com.rpl_err) {
-			state_set(&ep->com, FPDU_MODE);
-			established_upcall(ep);
-		}
-
 		ep->com.rpl_done = 1;
 		PDBG("waking up ep %p\n", ep);
 		wake_up(&ep->com.waitq);
@@ -1377,6 +1356,7 @@ static int pass_establish(struct t3cdev 
 
 	PDBG("%s ep %p\n", __FUNCTION__, ep);
 	ep->snd_seq = ntohl(req->snd_isn);
+	ep->rcv_seq = ntohl(req->rcv_isn);
 
 	set_emss(ep, ntohs(req->tcp_opt));
 
@@ -1730,10 +1710,8 @@ int iwch_accept_cr(struct iw_cm_id *cm_i
 	struct iwch_qp *qp = get_qhp(h, conn_param->qpn);
 
 	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
-	if (state_read(&ep->com) == DEAD) {
-		put_ep(&ep->com);
+	if (state_read(&ep->com) == DEAD)
 		return -ECONNRESET;
-	}
 
 	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
 	BUG_ON(!qp);
@@ -1753,18 +1731,9 @@ int iwch_accept_cr(struct iw_cm_id *cm_i
 	ep->ird = conn_param->ird;
 	ep->ord = conn_param->ord;
 	PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord);
+
 	get_ep(&ep->com);
-	err = send_mpa_reply(ep, conn_param->private_data,
-			     conn_param->private_data_len);
-	if (err) {
-		ep->com.cm_id = NULL;
-		ep->com.qp = NULL;
-		cm_id->rem_ref(cm_id);
-		abort_connection(ep, NULL, GFP_KERNEL);
-		put_ep(&ep->com);
-		return err;
-	}
-	
+
 	/* bind QP to EP and move to RTS */
 	attrs.mpa_attr = ep->mpa_attr;
 	attrs.max_ird = ep->ord;
@@ -1781,16 +1750,29 @@ int iwch_accept_cr(struct iw_cm_id *cm_i
 
 	err = iwch_modify_qp(ep->com.qp->rhp,
 			     ep->com.qp, mask, &attrs, 1);
+	if (err)
+		goto err;
 
-	if (err) {
-		ep->com.cm_id = NULL;
-		ep->com.qp = NULL;
-		cm_id->rem_ref(cm_id);
-		abort_connection(ep, NULL, GFP_KERNEL);
-	} else {
-		state_set(&ep->com, FPDU_MODE);
-		established_upcall(ep);
-	}
+	err = send_mpa_reply(ep, conn_param->private_data,
+			     conn_param->private_data_len);
+	if (err)
+		goto err;
+	
+	/* wait for wr_ack */
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	err = ep->com.rpl_err;
+	if (err) 
+		goto err;
+
+	state_set(&ep->com, FPDU_MODE);
+	established_upcall(ep);
+	put_ep(&ep->com);
+	return 0;
+err:
+	ep->com.cm_id = NULL;
+	ep->com.qp = NULL;
+	cm_id->rem_ref(cm_id);
+	abort_connection(ep, NULL, GFP_KERNEL);
 	put_ep(&ep->com);
 	return err;
 }
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h
index 1d4a1a5..5462331 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.h
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h
@@ -175,6 +175,7 @@ struct iwch_ep {
 	unsigned int atid;
 	u32 hwtid;
 	u32 snd_seq;
+	u32 rcv_seq;
 	struct l2t_entry *l2t;
 	struct dst_entry *dst;
 	struct sk_buff *mpa_skb;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index 7530dc0..162d1fa 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -734,6 +734,7 @@ #endif
 	init_attr.qp_dma_addr = qhp->wq.dma_addr;
 	init_attr.qp_dma_size = (1UL << qhp->wq.size_log2);
 	init_attr.flags = rqes_posted(qhp) ? RECVS_POSTED : 0;
+	init_attr.irs = qhp->ep->rcv_seq;
 	PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d "
 	     "flags 0x%x qpcaps 0x%x\n", __FUNCTION__,
 	     init_attr.rq_addr, init_attr.rq_size,
diff --git a/drivers/net/cxgb3/version.h b/drivers/net/cxgb3/version.h
index 17b9801..7ef2193 100644
--- a/drivers/net/cxgb3/version.h
+++ b/drivers/net/cxgb3/version.h
@@ -39,6 +39,6 @@ #define DRV_VERSION "1.0-ofed"
 
 /* Firmware version */
 #define FW_VERSION_MAJOR 4
-#define FW_VERSION_MINOR 0
+#define FW_VERSION_MINOR 2
 #define FW_VERSION_MICRO 0
 #endif				/* __CHELSIO_VERSION_H */


From rdreier at cisco.com  Mon May 14 13:03:13 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 13:03:13 -0700
Subject: [ofa-general] [PATCH] mlx4: fix uninitialized spinlock for 32-bit
	architectures
In-Reply-To: <200705131718.23298.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Sun, 13 May 2007 17:18:23 +0300")
References: <200705131718.23298.jackm@dev.mellanox.co.il>
Message-ID: <adafy5zw4u6.fsf@cisco.com>

Thanks, applied.


From rdreier at cisco.com  Mon May 14 13:23:37 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 13:23:37 -0700
Subject: [ofa-general] Re: [PATCH take2] IB/ipath -- shadow the gpio_mask
	register
In-Reply-To: <20070510191047.6876.80760.stgit@bauxite.internal.keyresearch.com>
	(Arthur Jones's message of "Thu, 10 May 2007 12:10:49 -0700")
References: <20070510191047.6876.80760.stgit@bauxite.internal.keyresearch.com>
Message-ID: <ada7irbw3w6.fsf@cisco.com>

Thanks, applied.  That changelog was what I dream of seeing with a
patch -- it was so perfect I choked up a little.


From rdreier at cisco.com  Mon May 14 13:42:08 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 13:42:08 -0700
Subject: [ofa-general] Re: [PATCH 6/6] IB/ehca: disable scaling code by
	default, bump version number
In-Reply-To: <200705091348.31742.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Wed, 9 May 2007 13:48:31 +0200")
References: <200705091348.31742.fenkes@de.ibm.com>
Message-ID: <adatzufuogv.fsf@cisco.com>

Thanks, applied 1-6.


From mst at dev.mellanox.co.il  Mon May 14 13:42:41 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 14 May 2007 23:42:41 +0300
Subject: [ofa-general] Re: weird kconfig output
In-Reply-To: <adak5vbw5uf.fsf@cisco.com>
References: <20070514142233.GD7989@mellanox.co.il> <adak5vbw5uf.fsf@cisco.com>
Message-ID: <20070514204241.GB12462@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: weird kconfig output
> 
>  > Doing make oldconfig on 2.6.22-rc1 (.config came from 2.6.21)
>  > gave me this prompt, among the list of 10G/s adapters:
>  >   Verbose debugging output (MLX4_DEBUG) [Y/n/?] (NEW)
>  > 
>  > Shouldn't I get prompted for mlx4 eth first?
> 
> mlx4 eth isn't upstream (since it doesn't do anything, FW isn't ready,
> etc etc).  I'm not sure if there's a way to improve this until then.

Maybe just change the help text?

-- 
MST


From rdreier at cisco.com  Mon May 14 13:46:07 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 13:46:07 -0700
Subject: [ofa-general] Re: weird kconfig output
In-Reply-To: <20070514204241.GB12462@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 14 May 2007 23:42:41 +0300")
References: <20070514142233.GD7989@mellanox.co.il> <adak5vbw5uf.fsf@cisco.com>
	<20070514204241.GB12462@mellanox.co.il>
Message-ID: <adaps53uoa8.fsf@cisco.com>

 > Maybe just change the help text?

Right now we have:

          This option causes debugging code to be compiled into the
          mlx4_core driver.  The output can be turned on via the
          debug_level module parameter (which can also be set after
          the driver is loaded through sysfs).

what would you suggest changing?


From sashak at voltaire.com  Mon May 14 14:05:41 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 15 May 2007 00:05:41 +0300
Subject: [ofa-general] Re: Error message in OSM log when cached op file
	doesn't exist
In-Reply-To: <46486D1E.6010408@dev.mellanox.co.il>
References: <46486D1E.6010408@dev.mellanox.co.il>
Message-ID: <20070514210541.GR29746@sashak.voltaire.com>

Hi Yevgeny,

On 17:07 Mon 14 May     , Yevgeny Kliteynik wrote:
> 
> I actually don't like this thing, because now every time you run
> OpenSM on the machine that doesn't have any cached options file
> (which is usually the case) you get an error message.
> 
> There's no point checking whether the file exists, because osm runs
> as root,

Not necessary.

> and if it fails opening this file, it means that the file
> doesn't exist or is inaccessible (broken mount, etc).

or user provided OSM_CACHE_DIR environment variable is broken or malloc
failed, or other error (see: man 3 fopen, man 2 open, man 3 malloc)

Probably just this solves your issue:

diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index 855d1ab..f7ddf7d 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -51,6 +51,7 @@
 
 #include <string.h>
 #include <stdio.h>
+#include <errno.h>
 #include <limits.h>
 #include <complib/cl_debug.h>
 #include <complib/cl_log.h>
@@ -856,7 +857,7 @@ osm_subn_parse_conf_file(
   
   opts_file = fopen(file_name, "r");
   if (!opts_file)
-    return IB_ERROR;
+    return errno == ENOENT ? IB_SUCCESS : IB_ERROR;
 
   while (fgets(line, 1023, opts_file) != NULL)
   {

Or yet another IB_* status value and less aggressive warning message?

Sasha


From rdreier at cisco.com  Mon May 14 13:57:23 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 13:57:23 -0700
Subject: [ofa-general] Re: [GIT PULL] 2.6.22: please pull rdma-dev.git
In-Reply-To: <000101c79269$72c67aa0$e598070a@amr.corp.intel.com> (Sean Hefty's
	message of "Wed, 9 May 2007 11:39:59 -0700")
References: <000101c79269$72c67aa0$e598070a@amr.corp.intel.com>
Message-ID: <adalkfrunrg.fsf@cisco.com>

Thanks, applied 1-3


From mst at dev.mellanox.co.il  Mon May 14 13:58:23 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 14 May 2007 23:58:23 +0300
Subject: [ofa-general] Re: weird kconfig output
In-Reply-To: <adaps53uoa8.fsf@cisco.com>
References: <20070514142233.GD7989@mellanox.co.il> <adak5vbw5uf.fsf@cisco.com>
	<20070514204241.GB12462@mellanox.co.il> <adaps53uoa8.fsf@cisco.com>
Message-ID: <20070514205823.GC12462@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: weird kconfig output
> 
>  > Maybe just change the help text?
> 
> Right now we have:
> 
>           This option causes debugging code to be compiled into the
>           mlx4_core driver.  The output can be turned on via the
>           debug_level module parameter (which can also be set after
>           the driver is loaded through sysfs).
> 
> what would you suggest changing?

-        bool "Verbose debugging output" if (MLX4_CORE && EMBEDDED)
+        bool "Mellanox ConnectX coremodule: verbose debugging output" if (MLX4_CORE && EMBEDDED)

or, is it possible to move the menu option to mlx4_ib?

-- 
MST


From rdreier at cisco.com  Mon May 14 14:00:57 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 14:00:57 -0700
Subject: [ofa-general] Re: [PATCH] libmthca: fix cq cleanup
In-Reply-To: <20070514141514.GC7989@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 14 May 2007 17:15:14 +0300")
References: <20070514141514.GC7989@mellanox.co.il>
Message-ID: <adahcqfunli.fsf@cisco.com>

Thanks, applied both libmthca fixes.


From rdreier at cisco.com  Mon May 14 14:10:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 14:10:47 -0700
Subject: [ofa-general] Re: [PATCH] IB/mthca: fix cq cleanup
In-Reply-To: <20070514141450.GB7989@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 14 May 2007 17:14:50 +0300")
References: <20070514141450.GB7989@mellanox.co.il>
Message-ID: <adad513un54.fsf@cisco.com>

thanks, applied.


From rdreier at cisco.com  Mon May 14 14:14:04 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 14:14:04 -0700
Subject: [ofa-general] Re: [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
In-Reply-To: <20070514045832.GA18615@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 14 May 2007 07:58:32 +0300")
References: <20070514045832.GA18615@mellanox.co.il>
Message-ID: <ada8xbrumzn.fsf@cisco.com>

thanks...

 > Michael S. Tsirkin (3):
 >       IB/mthca: fix posting >255 recv WRs for Tavor
 >       ipoib/cm: optimize stale connection detection

I applied this one.

 >       IB/mthca: fix RESET to ERROR transition

I will read this over more carefully -- it seems to be a rather big
patch that adds constification various places etc.

 > Yosef Etigin (2):
 >       IB/core: add helpers for uncached gid/pkey queries
 >       IB/ipoib: handle pkey re-shuffling

I need to catch up on the discussion that I did not read while I was
traveling last week, so I'll hold these two as well.


From rdreier at cisco.com  Mon May 14 14:18:00 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 14:18:00 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <ada4pmfumt3.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get the following post 2.6.22-rc1 fixes:

Arthur Jones (1):
      IB/ipath: Shadow the gpio_mask register

Hoang-Nam Nguyen (1):
      IB/ehca: Fix AQP0/1 QP number

Jack Morgenstein (1):
      IB/mlx4: Fix uninitialized spinlock for 32-bit archs

Joachim Fenkes (4):
      IB/ehca: Correctly set GRH mask bit in ehca_modify_qp()
      IB/ehca: Remove _irqsave, move #ifdef
      IB/ehca: Beautify sysfs attribute code and fix compiler warnings
      IB/ehca: Disable scaling code by default, bump version number

Michael S. Tsirkin (3):
      IB/mthca: Fix posting >255 recv WRs for Tavor
      IB/mthca: Set cleaned CQEs back to HW ownership when cleaning CQ
      IPoIB/cm: Optimize stale connection detection

Paul Mundt (1):
      net: Trivial MLX4_DEBUG dependency fix.

Roland Dreier (1):
      mlx4_core: Remove unused doorbell_lock

Sean Hefty (3):
      RDMA/cma: Simplify device removal handling code
      RDMA/cma: Fix synchronization with device removal in cma_iw_handler
      RDMA/cma: Add check to validate that cm_id is bound to a device

Stefan Roscher (1):
      IB/ehca: Serialize hypervisor calls in ehca_register_mr()

 drivers/infiniband/core/cma.c               |  106 +++++++++++++++------------
 drivers/infiniband/hw/ehca/ehca_classes.h   |    1 +
 drivers/infiniband/hw/ehca/ehca_irq.c       |    7 +-
 drivers/infiniband/hw/ehca/ehca_main.c      |   94 +++++++++++-------------
 drivers/infiniband/hw/ehca/ehca_qp.c        |   17 +++--
 drivers/infiniband/hw/ehca/hcp_if.c         |   13 +++-
 drivers/infiniband/hw/ipath/ipath_iba6120.c |    7 +-
 drivers/infiniband/hw/ipath/ipath_intr.c    |    7 +-
 drivers/infiniband/hw/ipath/ipath_kernel.h  |    2 +
 drivers/infiniband/hw/ipath/ipath_verbs.c   |   12 ++--
 drivers/infiniband/hw/mlx4/main.c           |    1 +
 drivers/infiniband/hw/mthca/mthca_cq.c      |    4 +-
 drivers/infiniband/hw/mthca/mthca_qp.c      |    1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c     |   11 ++-
 drivers/net/Kconfig                         |    1 +
 drivers/net/mlx4/main.c                     |    2 -
 drivers/net/mlx4/mlx4.h                     |    1 -
 17 files changed, 154 insertions(+), 133 deletions(-)


From halr at voltaire.com  Mon May 14 14:21:52 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 17:21:52 -0400
Subject: [ofa-general] IB/core: Enhance SMI for switch support
Message-ID: <1179177711.4531.10290.camel@hal.voltaire.com>

IB/core: Enhance SMI for switch support

SMI is extended for switch (intermediate hop) support. Care has
been taken to ensure that the CA (and router) code paths are as
identical as possible as to how they were prior to adding this support.

Signed-off-by: Suresh Shelvapille <suri at baymicrosystems.com>
Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
index ecd1a30..7583941 100644
--- a/drivers/infiniband/core/agent.c
+++ b/drivers/infiniband/core/agent.c
@@ -3,7 +3,7 @@
  * Copyright (c) 2004, 2005 Infinicon Corporation.  All rights reserved.
  * Copyright (c) 2004, 2005 Intel Corporation.  All rights reserved.
  * Copyright (c) 2004, 2005 Topspin Corporation.  All rights reserved.
- * Copyright (c) 2004, 2005 Voltaire Corporation.  All rights reserved.
+ * Copyright (c) 2004-2007 Voltaire Corporation.  All rights reserved.
  * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -34,7 +34,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: agent.c 1389 2004-12-27 22:56:47Z roland $
  */
 
 #include <linux/slab.h>
@@ -42,6 +41,7 @@
 
 #include "agent.h"
 #include "smi.h"
+#include "mad_priv.h"
 
 #define SPFX "ib_agent: "
 
@@ -87,8 +87,13 @@ int agent_send_response(struct ib_mad *m
 	struct ib_mad_send_buf *send_buf;
 	struct ib_ah *ah;
 	int ret;
+	struct ib_mad_send_wr_private *mad_send_wr;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH)
+		port_priv = ib_get_agent_port(device, 0);
+	else
+		port_priv = ib_get_agent_port(device, port_num);
 
-	port_priv = ib_get_agent_port(device, port_num);
 	if (!port_priv) {
 		printk(KERN_ERR SPFX "Unable to find port agent\n");
 		return -ENODEV;
@@ -113,6 +118,14 @@ int agent_send_response(struct ib_mad *m
 
 	memcpy(send_buf->mad, mad, sizeof *mad);
 	send_buf->ah = ah;
+	
+	if (device->node_type == RDMA_NODE_IB_SWITCH){
+		mad_send_wr = container_of(send_buf,
+				  	   struct ib_mad_send_wr_private,
+					   send_buf);
+		mad_send_wr->send_wr.wr.ud.port_num = port_num;
+	}
+	
 	if ((ret = ib_post_send_mad(send_buf, NULL))) {
 		printk(KERN_ERR SPFX "ib_post_send_mad error:%d\n", ret);
 		goto err2;
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 85ccf13..6b8faca 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -675,10 +675,16 @@ static int handle_outgoing_dr_smp(struct
 	struct ib_mad_port_private *port_priv;
 	struct ib_mad_agent_private *recv_mad_agent = NULL;
 	struct ib_device *device = mad_agent_priv->agent.device;
-	u8 port_num = mad_agent_priv->agent.port_num;
+	u8 port_num;
 	struct ib_wc mad_wc;
 	struct ib_send_wr *send_wr = &mad_send_wr->send_wr;
 
+	if (device->node_type == RDMA_NODE_IB_SWITCH &&
+	    smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+		port_num = send_wr->wr.ud.port_num;
+	else
+		port_num = mad_agent_priv->agent.port_num;
+
 	/*
 	 * Directed route handling starts if the initial LID routed part of
 	 * a request or the ending LID routed part of a response is empty.
@@ -1839,6 +1845,7 @@ static void ib_mad_recv_done_handler(str
 	struct ib_mad_private *recv, *response;
 	struct ib_mad_list_head *mad_list;
 	struct ib_mad_agent_private *mad_agent;
+	int port_num;
 
 	response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
 	if (!response)
@@ -1872,25 +1879,50 @@ static void ib_mad_recv_done_handler(str
 	if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num))
 		goto out;
 
+	if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH)
+		port_num = wc->port_num;
+	else
+		port_num = port_priv->port_num;
+
 	if (recv->mad.mad.mad_hdr.mgmt_class ==
 	    IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
+		enum smi_forward_action retsmi;
+
 		if (smi_handle_dr_smp_recv(&recv->mad.smp,
 					   port_priv->device->node_type,
-					   port_priv->port_num,
+					   port_num,
 					   port_priv->device->phys_port_cnt) ==
 					   IB_SMI_DISCARD)
 			goto out;
 
-		if (smi_check_forward_dr_smp(&recv->mad.smp) == IB_SMI_LOCAL)
+		retsmi = smi_check_forward_dr_smp(&recv->mad.smp);
+		if (retsmi == IB_SMI_LOCAL)
 			goto local;
 
-		if (smi_handle_dr_smp_send(&recv->mad.smp,
-					   port_priv->device->node_type,
-					   port_priv->port_num) == IB_SMI_DISCARD)
-			goto out;
+		if (retsmi == IB_SMI_SEND) { /* don't forward */
+			if (smi_handle_dr_smp_send(&recv->mad.smp,
+						   port_priv->device->node_type,
+						   port_num) == IB_SMI_DISCARD)
+				goto out;
+
+			if (smi_check_local_smp(&recv->mad.smp, port_priv->device) == IB_SMI_DISCARD)
+				goto out;
+		} else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) {
+			/* forward case for switches */
+			memcpy(response, recv, sizeof(*response));
+			response->header.recv_wc.wc = &response->header.wc;
+			response->header.recv_wc.recv_buf.mad = &response->mad.mad;
+			response->header.recv_wc.recv_buf.grh = &response->grh;
+
+			if (!agent_send_response(&response->mad.mad,
+						 &response->grh, wc,
+						 port_priv->device,
+						 smi_get_fwd_port(&recv->mad.smp),
+						 qp_info->qp->qp_num))
+				response = NULL;
 
-		if (smi_check_local_smp(&recv->mad.smp, port_priv->device) == IB_SMI_DISCARD)
 			goto out;
+		}
 	}
 
 local:
@@ -1919,7 +1951,7 @@ local:
 				agent_send_response(&response->mad.mad,
 						    &recv->grh, wc,
 						    port_priv->device,
-						    port_priv->port_num,
+						    port_num,
 						    qp_info->qp->qp_num);
 				goto out;
 			}
diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c
index 2bca753..8723675 100644
--- a/drivers/infiniband/core/smi.c
+++ b/drivers/infiniband/core/smi.c
@@ -192,7 +192,7 @@ enum smi_action smi_handle_dr_smp_recv(s
 			}
 			/* smp->hop_ptr updated when sending */
 			return (node_type == RDMA_NODE_IB_SWITCH ?
-				IB_SMI_HANDLE: IB_SMI_DISCARD);
+				IB_SMI_HANDLE : IB_SMI_DISCARD);
 		}
 
 		/* C14-13:4 -- hop_ptr = 0 -> give to SM */
@@ -211,7 +211,7 @@ enum smi_forward_action smi_check_forwar
 	if (!ib_get_smp_direction(smp)) {
 		/* C14-9:2 -- intermediate hop */
 		if (hop_ptr && hop_ptr < hop_cnt)
-			return IB_SMI_SEND;
+			return IB_SMI_FORWARD;
 
 		/* C14-9:3 -- at the end of the DR segment of path */
 		if (hop_ptr == hop_cnt)
@@ -224,7 +224,7 @@ enum smi_forward_action smi_check_forwar
 	} else {
 		/* C14-13:2  -- intermediate hop */
 		if (2 <= hop_ptr && hop_ptr <= hop_cnt)
-			return IB_SMI_SEND;
+			return IB_SMI_FORWARD;
 
 		/* C14-13:3 -- at the end of the DR segment of path */
 		if (hop_ptr == 1)
@@ -233,3 +233,13 @@ enum smi_forward_action smi_check_forwar
 	}
 	return IB_SMI_LOCAL;
 }
+
+/*
+ * Return the forwarding port number from initial_path for outgoing SMP and
+ * from return_path for returning SMP
+ */
+int smi_get_fwd_port(struct ib_smp *smp)
+{
+	return (!ib_get_smp_direction(smp) ? smp->initial_path[smp->hop_ptr+1] :
+		smp->return_path[smp->hop_ptr-1]);
+}
diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h
index 9a4b349..1cfc298 100644
--- a/drivers/infiniband/core/smi.h
+++ b/drivers/infiniband/core/smi.h
@@ -48,10 +48,12 @@ enum smi_action {
 enum smi_forward_action {
 	IB_SMI_LOCAL,	/* SMP should be completed up the stack */
 	IB_SMI_SEND,	/* received DR SMP should be forwarded to the send queue */
+	IB_SMI_FORWARD	/* SMP should be forwarded (for switches only) */
 };
 
 enum smi_action smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type,
 				       int port_num, int phys_port_cnt);
+int smi_get_fwd_port(struct ib_smp *smp);
 extern enum smi_forward_action smi_check_forward_dr_smp(struct ib_smp *smp);
 extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp,
 					      u8 node_type, int port_num);


From rdreier at cisco.com  Mon May 14 14:24:53 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 14 May 2007 14:24:53 -0700
Subject: [ofa-general] IB/core: Enhance SMI for switch support
In-Reply-To: <1179177711.4531.10290.camel@hal.voltaire.com> (Hal Rosenstock's
	message of "14 May 2007 17:21:52 -0400")
References: <1179177711.4531.10290.camel@hal.voltaire.com>
Message-ID: <adazm47t7x6.fsf@cisco.com>

Sorry, I lost this one in my queue.

However when I was thinking about applying it, I couldn't help but
wonder whether it was a good idea or not.  Is there any prospect of an
in-tree driver that would use the code?  Current Mellanox switches do
intermediate hop SMI handling in firmware, so a Mellanox switch driver
wouldn't use this code.  And I'm not sure we can justify this change
(which after all carries some risk) just to make it easier for
baymicrosystems's proprietary driver.

 - R.


From sashak at voltaire.com  Mon May 14 14:36:24 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 15 May 2007 00:36:24 +0300
Subject: [ofa-general] suggested patch for partition membership
	definitiion in osm-partitions.conf (fix)
In-Reply-To: <46487FBF.7020300@cea.fr>
References: <46487FBF.7020300@cea.fr>
Message-ID: <20070514213624.GS29746@sashak.voltaire.com>

Hi Philippe,

On 17:26 Mon 14 May     , Philippe Gregoire wrote:
> 
> Here is a patch to allow a more compact partition membership definition.
> It allows definition of a default
> membership partition for the port guid list. The old syntax is still usable.
> old way
> G1 = 0x01 :  0x123=full, 0x124=full, 0x0x125=full, 0x126=full, 0x127=full ;
> G1 = 0x01 :  0x128=full, 0x129=full, 0x567, 0x569=full
> 
> new way :
> G1 = 0x01 , defmember=full : 0x123, 0x124, 0x125, 0x126, 0x127 ;
> G1 = 0x01 , defmember=full :  0x128, 0x129, 0x567=limited, 0x569

I think this can be useful. Minor comment below.

> --- opensm/osm_prtn_config.old.c	2007-04-18 11:54:29.000000000 +0200
> +++ opensm/osm_prtn_config.c	2007-05-14 17:14:42.228813361 +0200
> @@ -70,6 +70,7 @@
>  	osm_subn_t *p_subn;
>  	osm_prtn_t *p_prtn;
>  	unsigned    is_ipoib, mtu, rate, sl, scope;
> +	boolean_t   full;
>  };
>  
>  extern osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn,
> @@ -163,6 +164,14 @@
>  				" - skipped\n", lineno);
>  		else
>  			conf->sl = sl;
> +	} else if (!strncmp(flag, "defmember", len)) {
> +		if (!val || (strcmp(val, "limited") && strcmp(val, "full")))

With strncmp(val, "limited"/"full", strlen(val)) user will be able to use
"limi" and "fu" (or shorter :)) substrings.

> +			osm_log(conf->p_log, OSM_LOG_VERBOSE,
> +				"PARSE WARN: line %d: "
> +				"flag \'defmember\' requires valid value (limited or full)"
> +				" - skipped\n", lineno);
> +		else
> +			conf->full = strcmp(val, "full") == 0;
>  	} else {
>  			osm_log(conf->p_log, OSM_LOG_VERBOSE,
>  					  "PARSE WARN: line %d: "

Sasha


From halr at voltaire.com  Mon May 14 14:32:52 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 May 2007 17:32:52 -0400
Subject: [ofa-general] IB/core: Enhance SMI for switch support
In-Reply-To: <adazm47t7x6.fsf@cisco.com>
References: <1179177711.4531.10290.camel@hal.voltaire.com>
	<adazm47t7x6.fsf@cisco.com>
Message-ID: <1179178372.4531.10975.camel@hal.voltaire.com>

On Mon, 2007-05-14 at 17:24, Roland Dreier wrote:
> Sorry, I lost this one in my queue.
> 
> However when I was thinking about applying it, I couldn't help but
> wonder whether it was a good idea or not.  Is there any prospect of an
> in-tree driver that would use the code?

I'm not sure; I've heard rumors of other OpenIB based switches.

> Current Mellanox switches do
> intermediate hop SMI handling in firmware, so a Mellanox switch driver
> wouldn't use this code.  And I'm not sure we can justify this change
> (which after all carries some risk)

The risk is primarily on the switch side, rather than the CA/router
side, right ? So isn't the downside of this minimal ?

-- Hal

> just to make it easier for baymicrosystems's proprietary driver.
> 
>  - R.


From mst at dev.mellanox.co.il  Mon May 14 14:50:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 15 May 2007 00:50:30 +0300
Subject: [ofa-general] Re: [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
In-Reply-To: <ada8xbrumzn.fsf@cisco.com>
References: <20070514045832.GA18615@mellanox.co.il> <ada8xbrumzn.fsf@cisco.com>
Message-ID: <20070514215030.GE12462@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
> 
> thanks...
> 
>  > Michael S. Tsirkin (3):
>  >       IB/mthca: fix posting >255 recv WRs for Tavor
>  >       ipoib/cm: optimize stale connection detection
> 
> I applied this one.
> 
>  >       IB/mthca: fix RESET to ERROR transition
> 
> I will read this over more carefully -- it seems to be a rather big
> patch that adds constification various places etc.

Thanks.

Some explanations:
The only reason for const change is because there's a table of attribute structs that's
always the same so I decided it's nice to have it a global const,
and the change rippled over.

The rest is just splitting up the modify command so that
on RESET->ERROR I can perform 2 commands without code duplication.

>  > Yosef Etigin (2):
>  >       IB/core: add helpers for uncached gid/pkey queries
>  >       IB/ipoib: handle pkey re-shuffling
> 
> I need to catch up on the discussion that I did not read while I was
> traveling last week, so I'll hold these two as well.

Here's a summary:

The last time we all agreed that long term we want to get rid of ib_cache, which
will solve all kind of coherency issues.  So the ipoib is a minimal patch to do
this wrt to pkey, fixing the bug Voltaire is seeing with their partitioning
setup.

The core patch just adds helpers for this bit, but since query_port
can't be chached by provider (it gives e.g. physical port state),
it seemed worth the while to query table lengths at startup only,
rather than have each ib_find_pkey call re-do this.

Yosef also has more patches cooking to remove the rest of the cache
usage and speed up query_gid/query_pkey in providers, and also
clean the pkey polling thread in ipoib, but it seemed like a
good idea to have the bugfix out first.

-- 
MST


From pradeeps at linux.vnet.ibm.com  Mon May 14 18:21:32 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Mon, 14 May 2007 18:21:32 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <20070512200635.GB5908@mellanox.co.il>
References: <4641E99B.10706@linux.vnet.ibm.com>
	<46438DF2.3080601@linux.vnet.ibm.com>
	<20070511133639.GD30092@mellanox.co.il>
	<4644C1D2.6040103@linux.vnet.ibm.com>
	<20070512200635.GB5908@mellanox.co.il>
Message-ID: <46490B1C.20808@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
>>
>> Michael S. Tsirkin wrote:
>>>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>>>> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
>>>>
>>>> If there are no other issues than the small restructure suggestion that
>>>> Michael had, can this patch be merged into the for-2.6.22 tree?
>>> I'm not sure.
>>>
>>> I haven't the time, at the moment, to go over the patch again in depth.
>>> Have the issues from this message been addressed?
>>>
>>> http://www.mail-archive.com/general at lists.openfabrics.org/msg02056.html
>>>
>>> Just a quick review, it seems that two most important issues have
>>> apparently not been addressed yet:
>>>
>>> 1. Testing device SRQ capability twice on each RX packet is just too ugly,
>>>   and it *should* be possible to structure the code
>>>   by separating common functionality in separate
>>>   functions instead of scattering if (srq) tests around.
>> I have restructured the code as suggested. In the latest code, there are
>> only two places where SRQ capability is tested upon receipt of a packet:
>> a) ipoib_cm_handle_rx_wc()
>> b)ipoib_cm_post_receive()
>>
>> Instead of the suggested change to ipoib_cm_handle_rx_packet() it is
>> possible to change ipoib_cm_post_receive() and call the  srq and nosrq
>> versions directly, without mangling the code. However, I do not believe 
>> that this should be stopping us from the code being merged. This can 
>> handled as a separate patch.
> 
> I actually suggested implementing separate poll routines
> for srq and non-srq code. This way we won't have *any* if(srq)
> tests on datapath.

Right, I remember you suggested that. From a maintainability perspective
I use as much common code as possible. Therefore I did not implement
separate polling routines as suggested. So, it boils down to one if(srq)
in the data path. I really do not think that should be a point of 
contention.

> 
>>> 2. Once the number of created connections exceeds
>>>   the constant that you allow, all attempts to communicate
>>>   with this node over IP over IB will fail.
>>>   A way needs to be designed to switch to the datagram mode,
>>>   and to retry going back to connected after some time.
>>>   [We actually have this theoretical issue in SRQ
>>>    as well - it is just much more severe in the nonSRQ case].
>> Firstly, this has now been changed to send a REJ message to the remote
>> side indicating that there no more free QPs.
> 
> Since the HCA actually has free QPs - you are actually running out of buffers that
> you are ready to prepost - one might argue about whether this is spec compliant
> behaviour.  This is something that might better be checked up with at IBTA.
> 
>> It is up to the application
>> to handle the situation.
> 
> The application here being kernel IP over IB here, it currently handles the
> reject by dropping outstanding packets and retrying the connection on the next
> packet to this dst.  So the specific node might be denied connectivity
> potentially forever.

When I stated application, I did not mean IPoIB. I meant the user level
app. Yes, the app will keep on retrying to establish connection to the
specified node using Connected Mode and then subsequently time out.
See more comments below.

> 
>> Previously, this was flagged as an error that
>> appeared in /var/log/messages.
>>
>> However, here are a few other things we need to consider. Lets us
>> compute the amount of memory consumed when we run into this situation:
>>
>> In CM mode we use 64K packets. Assuming, the rx_ring has 256 entries and
>> the current limitation of 1024 QPs, NOSRQ only will consume 16GB of 
>> memory. All else remaining the same if we change the rx_ring size to 
>> 1024, NOSRQ will consume 64GB of memory.
>>
>> This is huge and my guess is that on most systems, the application will 
>> run out of memory before it runs out of RC QPs (with NOSRQ).
>>
>> Aside from this I would like to understand how do we switch just the 
>> "current" QP to datagram mode; we would not want to switch all the
>> existing QPs to datagram mode -that would be unacceptable. Also, we
>> should not prevent subsequent connections using RC QPs. Is there 
>> anything in the IB spec about this?
> 
> Yes, this might need a solution at the protocol level, as you indicate above.

I thought through this some more and I do not believe that this is such
a good idea (i.e. switching to datagram mode). The app (user level) is
expecting to use RC and we silently (or even with warnings) switch to
UD mode -I do not think that is appropriate.

The app should time out or be returned an error and maybe the app can
switch to using another node that has the requested resources. The onus
is on the user level app to take appropriate action.

The equivalent situation in a non IB environment would be when the
recipient node has no more memory to respond to an arp request. The
app receives a "node unreachable" message. Therefore I am inclined
to say we should leave this as is.

> 
>> I think solving this is a fairly big issue and not just specific to
>> NOSRQ. NOSRQ is just exacerbating the situation. This can be dealt with
>> all at once with SRQ and NOSRQ, if need be.
> 
> IMO, the memory scalability issue is specific to your code.
> With current code using shared RQ, each connection needs
> an order of 1KByte of memory. So we need just 10MByte
> for a typical 10000 node cluster.
> 

Right, I have always maintained that NOSRQ is indeed a memory hog. I
think we must revisit this memory computation for the srq case too -
I would say the receive buffers consumed would be 64K (packet size) *
1000 (srq_ring_size) is 64MBytes, irrespective of the number of the
number of nodes in the cluster. However, the question that is still
unanswered (at least in my mind) is, will 1000  buffers be sufficient
to support a 10,000 or even a 1000 node cluster. On just a 2 node
cluster (using UD) we had seen previously that a receiveq_size of 256
was inadequate. I would guess even in the SRQ case that would be true.

To support large clusters one will run into memory issues even in the
SRQ case, but it will occur much sooner in the NOSRQ case.

>> Hence, I do not see these as impediments to the merge.
> 
> In my humble opinion, we need a handle on the scalability issue
> (other than crashing or denying service) before merging this,
> otherwise IBM will be the first to object to making connected mode the default.

I will seek the opinion from folks who use applications on large
clusters within IBM. I have always stated that NOSRQ should be used
only when there are a handful or at most a few dozen clusters. I will
try and make this well known so that this does not come as a surprise.

Pradeep


From mst at dev.mellanox.co.il  Mon May 14 23:26:46 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 15 May 2007 09:26:46 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <46490B1C.20808@linux.vnet.ibm.com>
References: <4641E99B.10706@linux.vnet.ibm.com>
	<46438DF2.3080601@linux.vnet.ibm.com>
	<20070511133639.GD30092@mellanox.co.il>
	<4644C1D2.6040103@linux.vnet.ibm.com>
	<20070512200635.GB5908@mellanox.co.il>
	<46490B1C.20808@linux.vnet.ibm.com>
Message-ID: <20070515062646.GD5437@mellanox.co.il>

> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> 
> Michael S. Tsirkin wrote:
> >>Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> >>Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> >>
> >>Michael S. Tsirkin wrote:
> >>>>Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> >>>>Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> >>>>
> >>>>If there are no other issues than the small restructure suggestion that
> >>>>Michael had, can this patch be merged into the for-2.6.22 tree?
> >>>I'm not sure.
> >>>
> >>>I haven't the time, at the moment, to go over the patch again in depth.
> >>>Have the issues from this message been addressed?
> >>>
> >>>http://www.mail-archive.com/general at lists.openfabrics.org/msg02056.html
> >>>
> >>>Just a quick review, it seems that two most important issues have
> >>>apparently not been addressed yet:
> >>>
> >>>1. Testing device SRQ capability twice on each RX packet is just too 
> >>>ugly,
> >>>  and it *should* be possible to structure the code
> >>>  by separating common functionality in separate
> >>>  functions instead of scattering if (srq) tests around.
> >>I have restructured the code as suggested. In the latest code, there are
> >>only two places where SRQ capability is tested upon receipt of a packet:
> >>a) ipoib_cm_handle_rx_wc()
> >>b)ipoib_cm_post_receive()
> >>
> >>Instead of the suggested change to ipoib_cm_handle_rx_packet() it is
> >>possible to change ipoib_cm_post_receive() and call the  srq and nosrq
> >>versions directly, without mangling the code. However, I do not believe 
> >>that this should be stopping us from the code being merged. This can 
> >>handled as a separate patch.
> >
> >I actually suggested implementing separate poll routines
> >for srq and non-srq code. This way we won't have *any* if(srq)
> >tests on datapath.
> 
> Right, I remember you suggested that. From a maintainability perspective
> I use as much common code as possible.

Sprinkling if (srq) all over the code is not necessarily the best wait to reuse code.
Moving common code to separate functions is a better way IMO.

> Therefore I did not implement
> separate polling routines as suggested. So, it boils down to one if(srq)
> in the data path.

Which patch are we discussing? Patch V4 has 3 of these on data path.
The one in alloc_rx_skb also seems to be open-coded - bad for cache usage.

> I really do not think that should be a point of contention.

True, it's not a *major* point, scalability is still a larger issue.
But IMO fixing this would make the patch less ugly.

> >>>2. Once the number of created connections exceeds
> >>>  the constant that you allow, all attempts to communicate
> >>>  with this node over IP over IB will fail.
> >>>  A way needs to be designed to switch to the datagram mode,
> >>>  and to retry going back to connected after some time.
> >>>  [We actually have this theoretical issue in SRQ
> >>>   as well - it is just much more severe in the nonSRQ case].
> >>Firstly, this has now been changed to send a REJ message to the remote
> >>side indicating that there no more free QPs.
> >
> >Since the HCA actually has free QPs - you are actually running out of buffers
> >that you are ready to prepost - one might argue about whether this is spec
> >compliant behaviour.  This is something that might better be checked up with
> >at IBTA.
> >
> >>It is up to the application to handle the situation.
> >
> >The application here being kernel IP over IB here, it currently handles the
> >reject by dropping outstanding packets and retrying the connection on the
> >next packet to this dst.  So the specific node might be denied connectivity
> >potentially forever.
> 
> When I stated application, I did not mean IPoIB. I meant the user level
> app. Yes, the app will keep on retrying to establish connection to the
> specified node using Connected Mode and then subsequently time out.

So, how would an application handle the situation?

> See more comments below.
> 
> >
> >>Previously, this was flagged as an error that
> >>appeared in /var/log/messages.
> >>
> >>However, here are a few other things we need to consider. Lets us
> >>compute the amount of memory consumed when we run into this situation:
> >>
> >>In CM mode we use 64K packets. Assuming, the rx_ring has 256 entries and
> >>the current limitation of 1024 QPs, NOSRQ only will consume 16GB of 
> >>memory. All else remaining the same if we change the rx_ring size to 
> >>1024, NOSRQ will consume 64GB of memory.
> >>
> >>This is huge and my guess is that on most systems, the application will 
> >>run out of memory before it runs out of RC QPs (with NOSRQ).
> >>
> >>Aside from this I would like to understand how do we switch just the 
> >>"current" QP to datagram mode; we would not want to switch all the
> >>existing QPs to datagram mode -that would be unacceptable. Also, we
> >>should not prevent subsequent connections using RC QPs. Is there 
> >>anything in the IB spec about this?
> >
> >Yes, this might need a solution at the protocol level, as you indicate 
> >above.
> 
> I thought through this some more and I do not believe that this is such
> a good idea (i.e. switching to datagram mode). The app (user level) is
> expecting to use RC and we silently (or even with warnings) switch to
> UD mode -I do not think that is appropriate.

Which app is this?

> The app should time out or be returned an error and maybe the app can
> switch to using another node that has the requested resources. The onus
> is on the user level app to take appropriate action.

Most applications can't do this however. So your patch will break them.

> The equivalent situation in a non IB environment would be when the
> recipient node has no more memory to respond to an arp request. The
> app receives a "node unreachable" message.

I think you mean that on a TCP socket, connect will return ENETUNREACH, rather
than a message?  But since ARP is normally using multicast, if the remote won't
accept connections, this is *not* what will happen here, is it?

> Therefore I am inclined to say we should leave this as is.

The main difference is that on a LAN, arp timeouts don't really occur
too often in practice - they are sufficiently rare that lots of
applications regard TCP errors as a "node is down" indication.

> >
> >>I think solving this is a fairly big issue and not just specific to
> >>NOSRQ. NOSRQ is just exacerbating the situation. This can be dealt with
> >>all at once with SRQ and NOSRQ, if need be.
> >
> >IMO, the memory scalability issue is specific to your code.
> >With current code using shared RQ, each connection needs
> >an order of 1KByte of memory. So we need just 10MByte
> >for a typical 10000 node cluster.
> >
> 
> Right, I have always maintained that NOSRQ is indeed a memory hog. I
> think we must revisit this memory computation for the srq case too -
> I would say the receive buffers consumed would be 64K (packet size) *
> 1000 (srq_ring_size) is 64MBytes, irrespective of the number of the
> number of nodes in the cluster. However, the question that is still
> unanswered (at least in my mind) is, will 1000  buffers be sufficient
> to support a 10,000 or even a 1000 node cluster. On just a 2 node
> cluster (using UD) we had seen previously that a receiveq_size of 256
> was inadequate.

You should distinguish between occasional packet drops due to RQ overrun,
which happens all the time on the internet, so protocols are built to handle
it, and dropping *all* packets to a specific destination, which is
a quality of implementation issue.

> I would guess even in the SRQ case that would be true.

Less likely, since each buffer is 32 times larger now.  Further, with SRQ we can
auto-tune the buffer size by using watermark events. Stay tuned.

> To support large clusters one will run into memory issues even in the
> SRQ case,

I don't really think IPoIB with SRQ will run into memory issues even with
large clusters.

> but it will occur much sooner in the NOSRQ case.
>
> >>Hence, I do not see these as impediments to the merge.
> >
> >In my humble opinion, we need a handle on the scalability issue
> >(other than crashing or denying service) before merging this,
> >otherwise IBM will be the first to object to making connected mode the 
> >default.
> 
> I will seek the opinion from folks who use applications on large
> clusters within IBM. I have always stated that NOSRQ should be used
> only when there are a handful or at most a few dozen clusters. I will
> try and make this well known so that this does not come as a surprise.

One of my targets is to make connected mode the default, eventually.
My concern is that if that enabling connected mode breaks applications,
as your patch does, people will be afraid to turn it on.

-- 
MST


From ianjiang.ict at gmail.com  Tue May 15 00:12:39 2007
From: ianjiang.ict at gmail.com (Ian Jiang)
Date: Tue, 15 May 2007 15:12:39 +0800
Subject: [ofa-general] [SRPT]multiple initiators supported?
In-Reply-To: <4648948F.5000802@mellanox.com>
References: <7b2fa1820705120247t1b232345w8373bb72416c5b28@mail.gmail.com>
	<4648948F.5000802@mellanox.com>
Message-ID: <7b2fa1820705150012l743b817fn1eefeaca290789a@mail.gmail.com>

Hi Vu,
Thanks for your replay. But I have got something wrong when using two
initiators.

Two initiators and one target are at three separated nodes.  The first
initiator connected to the target correctly. However, the second one
was aborted 1 minute after its login, and then required to
*reset_host*, but it failed to send the CM Connect Request when trying
to reconnect to the target.

Here are the logs of the second initiator:

May 15 13:58:59 cluster5 kernel: ib_srp: new target: id_ext
0002c90200206bd8 ioc_guid 0002c90200206bd8 pkey ffff service_id
0002c90200206bd8 dgid fe80:0000:0000:0000:0002:c902:0020:6bd9
May 15 13:58:59 cluster5 kernel: scsi2 : SRP.T10:0002C90200206BD8
May 15 13:58:59 cluster5 kernel:   Vendor: SCST_FIO  Model: fdisk_128M
       Rev:  095
May 15 13:58:59 cluster5 kernel:   Type:   Direct-Access
       ANSI SCSI revision: 04
May 15 13:58:59 cluster5 kernel: SCSI device sdb: 262144 512-byte hdwr
sectors (134 MB)
May 15 13:58:59 cluster5 kernel: sdb: Write Protect is off
May 15 13:58:59 cluster5 kernel: sdb: Mode Sense: 6b 00 10 08
May 15 13:58:59 cluster5 kernel: SCSI device sdb: drive cache: write back w/ FUA
May 15 13:58:59 cluster5 kernel: SCSI device sdb: 262144 512-byte hdwr
sectors (134 MB)
May 15 13:58:59 cluster5 kernel: sdb: Write Protect is off
May 15 13:58:59 cluster5 kernel: sdb: Mode Sense: 6b 00 10 08
May 15 13:58:59 cluster5 kernel: SCSI device sdb: drive cache: write back w/ FUA
May 15 13:58:59 cluster5 kernel:  sdb: unknown partition table
May 15 13:58:59 cluster5 kernel: sd 2:0:0:0: Attached scsi disk sdb
May 15 13:58:59 cluster5 kernel: sd 2:0:0:0: Attached scsi generic sg1 type 0
May 15 13:59:59 cluster5 kernel: SRP abort called
May 15 13:59:59 cluster5 kernel: SRP reset_device called
May 15 14:00:29 cluster5 kernel: SRP abort called
May 15 14:00:34 cluster5 kernel: ib_srp: SRP reset_host called
May 15 14:00:36 cluster5 kernel: ib_srp: connection closed
May 15 14:02:15 cluster5 kernel: ib_srp: Sending CM REQ failed
May 15 14:02:15 cluster5 kernel: ib_srp: reconnect failed (-104),
removing target port.
May 15 14:02:15 cluster5 kernel: sd 2:0:0:0: scsi: Device offlined -
not ready after error recovery
May 15 14:02:15 cluster5 kernel: sd 2:0:0:0: rejecting I/O to offline device
May 15 14:02:15 cluster5 kernel: Buffer I/O error on device sdb,
logical block 32760
May 15 14:02:15 cluster5 kernel:  2:0:0:0: rejecting I/O to dead device
May 15 14:02:15 cluster5 kernel: Buffer I/O error on device sdb,
logical block 32760

Here are the logs of the target during the second initiator's
connection.  It seemed that it did not receive the reconnect request.

May 15 13:57:27 cluster4 kernel: ib_srpt: Host
i_port_id=0x100000000000000:0xcc6b200002c90200 login with
t_port_id=0xd86b200002c90200:0xd86b200002c90200 it_iu_len=260
May 15 13:57:27 cluster4 kernel: ib_srpt: srpt_create_ch_ib[1105]
max_cqe= 255 max_sge= 29 cm_id= da9b7200
May 15 13:57:27 cluster4 kernel: [3966]: scst_init_session:scst: Name
0x00000000000000010002c90200206bcc not found, using default group
May 15 13:57:27 cluster4 kernel: [3966]:
scst_alloc_add_tgt_dev:Virtual device SCST lun=0
May 15 13:57:27 cluster4 kernel: [3966]: tm_dbg_init_tgt_dev:LUN 0
connected from initiator ib_srpt is under TM debugging
May 15 13:57:27 cluster4 kernel: ib_srpt: Establish connection sess=
c9a677a8 name= 0x00000000000000010002c90200206bcc cm_id= da9b7200
May 15 13:57:27 cluster4 kernel: [3964]: scst_set_pending_UA:Setting
pending UA cmd dabb3ec0
May 15 13:57:27 cluster4 kernel: [3964]:
tm_dbg_delay_cmd:tm_dbg_delay_cmd: delaying timed cmd dabb3ec0 (tag
35) for 60.96 seconds (15241 HZ)

May 15 13:58:27 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 1 for
task_tag= 35 using tag= 163 cm_id= da9b7200 sess= c9a677a8
May 15 13:58:27 cluster4 kernel: [0]: scst_rx_mgmt_fn_tag:sess=c9a677a8, tag=35
May 15 13:58:27 cluster4 kernel: [0]: scst_post_rx_mgmt_cmd:Adding
mgmt cmd c70486a0 to active mgmt cmd list
May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving
mgmt cmd c70486a0 to mgmt cmd list
May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_init:Cmd
dabb3ec0 for tag 35 (sn 35) found, aborting it
May 15 13:58:27 cluster4 kernel: [3965]: scst_abort_cmd:Aborting cmd
dabb3ec0 (tag 35)
May 15 13:58:27 cluster4 kernel: [3965]:
scst_call_dev_task_mgmt_fn:Calling dev handler disk_fileio
task_mgmt_fn(fn=0)
May 15 13:58:27 cluster4 kernel: [3965]:
scst_call_dev_task_mgmt_fn:Dev handler disk_fileio task_mgmt_fn()
returned 0
May 15 13:58:27 cluster4 kernel: [3965]: tm_dbg_release_cmd:Abort
request for delayed cmd dabb3ec0 (tag=35), moving it to active cmd
list (delayed_cmds_count=1)
May 15 13:58:27 cluster4 kernel: ib_srpt: srpt_tsk_mgmt_done[1972]
tsk_mgmt_done for tag= 163 status=0
May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_send_done:Dev
handler ib_srpt task_mgmt_fn_done() returned
May 15 13:58:27 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 8 for
task_tag= 35 using tag= 163 cm_id= da9b7200 sess= c9a677a8
May 15 13:58:27 cluster4 kernel: [3964]: tm_dbg_check_cmd:Processing
delayed cmd dabb3ec0 (tag 35), delayed_cmds_count=1
May 15 13:58:27 cluster4 kernel: [3964]: tm_dbg_change_state:Deleting timer
May 15 13:58:27 cluster4 kernel: ib_srpt: srpt_xmit_response[1898]
tag= 35 already get aborted
May 15 13:58:57 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 1 for
task_tag= 36 using tag= 164 cm_id= da9b7200 sess= c9a677a8
May 15 13:58:57 cluster4 kernel: [0]: scst_rx_mgmt_fn_tag:sess=c9a677a8, tag=36
May 15 13:58:57 cluster4 kernel: [0]: scst_post_rx_mgmt_cmd:Adding
mgmt cmd c7048240 to active mgmt cmd list
May 15 13:58:57 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving
mgmt cmd c7048240 to mgmt cmd list
May 15 13:58:57 cluster4 kernel: [3965]: scst_mgmt_cmd_init:Cmd
dabb3050 for tag 36 (sn 36) found, aborting it
May 15 13:58:57 cluster4 kernel: [3965]: scst_abort_cmd:Aborting cmd
dabb3050 (tag 36)
May 15 13:58:57 cluster4 kernel: [3965]:
scst_call_dev_task_mgmt_fn:Calling dev handler disk_fileio
task_mgmt_fn(fn=0)
May 15 13:58:57 cluster4 kernel: [3965]:
scst_call_dev_task_mgmt_fn:Dev handler disk_fileio task_mgmt_fn()
returned 0
May 15 13:58:57 cluster4 kernel: [3965]: scst_abort_cmd:cmd dabb3050
(tag 36) being executed/xmitted (state 12), deferring ABORT...
May 15 13:58:57 cluster4 kernel: [3965]:
scst_set_mcmd_next_state:cmd_wait_count(1) not 0, preparing to wait
May 15 13:59:02 cluster4 kernel: ib_srpt: srpt_cm_dreq_recv[1523]
cm_id= da9b7200 ch->state= 1
May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_cm_timewait_exit[1502]
cm_id= da9b7200
May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_release_channel[1154]
Release channel cm_id= da9b7200
May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_release_channel[1159]
Release sess= c9a677a8 sess_name= 0x00000000000000010002c90200206bcc
May 15 13:59:21 cluster4 kernel: ib_srpt: failed send status= 12
May 15 13:59:21 cluster4 kernel: [0]: scst_complete_cmd_mgmt:cmd
dabb3050 completed (tag 36, mcmd c7048240, mcmd->cmd_wait_count 1)
May 15 13:59:21 cluster4 kernel: [0]: scst_complete_cmd_mgmt:Moving
mgmt cmd c7048240 to active mgmt cmd list
May 15 13:59:21 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving
mgmt cmd c7048240 to mgmt cmd list
May 15 13:59:21 cluster4 kernel: ib_srpt: srpt_tsk_mgmt_done[1972]
tsk_mgmt_done for tag= 164 status=-1
May 15 13:59:21 cluster4 kernel: [3965]: scst_mgmt_cmd_send_done:Dev
handler ib_srpt task_mgmt_fn_done() returned
May 15 13:59:21 cluster4 kernel: ib_srpt:
srpt_unregister_session_done[1143]  sess= c9a677a8
May 15 13:59:21 cluster4 kernel: [3966]: scst_free_all_UA:Clearing UA
for tgt_dev lun 0
May 15 13:59:21 cluster4 kernel: ib_srpt: failed send status= 5
May 15 13:59:21 cluster4 kernel: ib_srpt: QP event 16 on cm_id=
da9b7200 sess_name= 0x00000000000000010002c90200206bcc state= 2


I have no idea why the *abort* was called at the second initiator.
Could you please give some suggestion? Thanks a lot!


On 5/15/07, Vu Pham <vuhuong at mellanox.com> wrote:
> Ian Jiang wrote:
> > Does the SRP target support multiple initiators?
>
> Yes, it does.
>
>
> > I am using the SRR initiator and IB drivers in linux-2.6.20.
> > The SRP target is at
> > http://www.openfabrics.org/git/?p=~vu/srpt.git;a=summary
> > and the IB driver is OFED-1.1 with linux-2.6.16.13-4-default of Suse-10.1.
> >


-- 
Ian Jiang


From vlad at lists.openfabrics.org  Tue May 15 02:29:54 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue, 15 May 2007 02:29:54 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070515-0200 daily build status
Message-ID: <20070515092955.65D46E60821@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.15

Failed:


From bs at q-leap.de  Tue May 15 02:48:50 2007
From: bs at q-leap.de (Bernd Schubert)
Date: Tue, 15 May 2007 11:48:50 +0200
Subject: [ofa-general] possible irq lock inversion dependency detected
Message-ID: <200705151148.50607.bs@q-leap.de>

Hi,

with 2.6.20 I get this message:


[263206.999448] =========================================================
[263207.007607] [ INFO: possible irq lock inversion dependency detected ]
[263207.014153] 2.6.20.3-debug #9
[263207.017230] ---------------------------------------------------------
[263207.023775] ipoib/6662 just changed the state of lock:
[263207.029020]  (&idev->n_mcast_grps_lock){-...}, at: [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath]
[263207.039866] but this lock was taken by another, hard-irq-safe lock in the past:
[263207.047294]  (mcast_lock){++..}
[263207.050380] 
[263207.050381] and interrupts could create inverse lock ordering between them.
[263207.050382] 
[263207.060846] 
[263207.060847] other info that might help us debug this:
[263207.067609] 1 lock held by ipoib/6662:
[263207.071468]  #0:  (&priv->mcast_mutex){--..}, at: [mutex_lock+35/39] mutex_lock+0x23/0x27
[263207.080061] 
[263207.080062] the first lock's dependencies:
[263207.085862] -> (&idev->n_mcast_grps_lock){-...} ops: 3 {
[263207.091371]    initial-use  at:
[263207.094647] 					[mark_lock+135/1127] mark_lock+0x87/0x467
[263207.104352] 					[__lock_acquire+1476/3168] __lock_acquire+0x5c4/0xc60
[263207.114571] 					[_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath]
[263207.126446] 					[lock_acquire+124/160] lock_acquire+0x7c/0xa0
[263207.136317] 					[_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath]
[263207.148192] 					[_spin_lock+34/46] _spin_lock+0x22/0x2e
[263207.157891] 					[_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath]
[263207.169766] 					[_end+122587907/2124917936] ib_attach_mcast+0x2f/0x31 [ib_core]
[263207.187046] 					[_end+123887919/2124917936] ipoib_mcast_attach+0xe3/0x123 [ib_ipoib]
[263207.198481] 					[_end+123880632/2124917936] ipoib_mcast_join_finish+0x125/0x3fa [ib_ipoib]
[263207.210441] 					[check_usage+53/661] check_usage+0x35/0x295
[263207.220328] 					[lock_timer_base+35/72] lock_timer_base+0x23/0x48
[263207.230454] 					[__lock_acquire+3080/3168] __lock_acquire+0xc08/0xc60
[263207.240674] 					[_end+123881884/2124917936] ipoib_mcast_join_complete+0xcb/0x31a [ib_ipoib]
[263207.252720] 					[_read_unlock_irqrestore+56/71] _read_unlock_irqrestore+0x38/0x47
[263207.263544] 					[_end+122585085/2124917936] ib_unpack+0xad/0x11c [ib_core]
[263207.274117] 					[_end+123844451/2124917936] ib_sa_mcmember_rec_callback+0x4c/0x57 [ib_sa]
[263207.285983] 					[_end+123844738/2124917936] recv_handler+0x3f/0x4b [ib_sa]
[263207.296547] 					[_end+123718406/2124917936] ib_mad_completion_handler+0x3c7/0x59b [ib_mad]
[263207.308511] 					[run_workqueue+134/380] run_workqueue+0x86/0x17c
[263207.318552] 					[_end+123717439/2124917936] ib_mad_completion_handler+0x0/0x59b [ib_mad]
[263207.330333] 					[run_workqueue+177/380] run_workqueue+0xb1/0x17c
[263207.340377] 					[worker_thread+0/349] worker_thread+0x0/0x15d
[263207.350336] 					[keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a
[263207.360983] 					[worker_thread+294/349] worker_thread+0x126/0x15d
[263207.371113] 					[default_wake_function+0/15] default_wake_function+0x0/0xf
[263207.381584] 					[default_wake_function+0/15] default_wake_function+0x0/0xf
[263207.392063] 					[worker_thread+0/349] worker_thread+0x0/0x15d
[263207.402020] 					[kthread+208/252] kthread+0xd0/0xfc
[263207.411459] 					[child_rip+10/18] child_rip+0xa/0x12
[263207.420985] 					[_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d
[263207.431204] 					[restore_args+0/48] restore_args+0x0/0x30
[263207.440998] 					[kthread+0/252] kthread+0x0/0xfc
[263207.450344] 					[child_rip+0/18] child_rip+0x0/0x12
[263207.459862] 					[<ffffffffffffffff>] 0xffffffffffffffff
[263207.469386]    hardirq-on-W at:
[263207.472661] 					[mark_lock+135/1127] mark_lock+0x87/0x467
[263207.482357] 					[__lock_acquire+1412/3168] __lock_acquire+0x584/0xc60
[263207.492567] 					[kfree+525/541] kfree+0x20d/0x21d
[263207.501996] 					[_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath]
[263207.513865] 					[lock_acquire+124/160] lock_acquire+0x7c/0xa0
[263207.523735] 					[_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath]
[263207.535601] 					[_spin_lock+34/46] _spin_lock+0x22/0x2e
[263207.545298] 					[_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath]
[263207.557167] 					[debug_mutex_free_waiter+88/92] debug_mutex_free_waiter+0x58/0x5c
[263207.567990] 					[__mutex_lock_slowpath+624/637] __mutex_lock_slowpath+0x270/0x27d
[263207.578814] 					[wait_for_completion+189/198] wait_for_completion+0xbd/0xc6
[263207.589292] 					[_end+122587956/2124917936] ib_detach_mcast+0x2f/0x33 [ib_core]
[263207.600298] 					[_end+123888044/2124917936] ipoib_mcast_detach+0x3d/0x6e [ib_ipoib]
[263207.611650] 					[_end+123884728/2124917936] ipoib_mcast_leave+0x12d/0x1c8 [ib_ipoib]
[263207.623091] 					[_end+123886246/2124917936] ipoib_mcast_dev_flush+0x100/0x14e [ib_ipoib]
[263207.634877] 					[_end+123886283/2124917936] ipoib_mcast_dev_flush+0x125/0x14e [ib_ipoib]
[263207.646661] 					[_end+123878948/2124917936] ipoib_ib_dev_flush+0x0/0x11f [ib_ipoib]
[263207.658013] 					[_end+123877784/2124917936] ipoib_ib_dev_down+0xa8/0xb7 [ib_ipoib]
[263207.669300] 					[_end+123879087/2124917936] ipoib_ib_dev_flush+0x8b/0x11f [ib_ipoib]
[263207.680739] 					[run_workqueue+177/380] run_workqueue+0xb1/0x17c
[263207.690781] 					[worker_thread+0/349] worker_thread+0x0/0x15d
[263207.700732] 					[keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a
[263207.711393] 					[worker_thread+294/349] worker_thread+0x126/0x15d
[263207.721523] 					[default_wake_function+0/15] default_wake_function+0x0/0xf
[263207.732000] 					[default_wake_function+0/15] default_wake_function+0x0/0xf
[263207.742478] 					[worker_thread+0/349] worker_thread+0x0/0x15d
[263207.752436] 					[kthread+208/252] kthread+0xd0/0xfc
[263207.761874] 					[child_rip+10/18] child_rip+0xa/0x12
[263207.771393] 					[_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d
[263207.781603] 					[restore_args+0/48] restore_args+0x0/0x30
[263207.791388] 					[kthread+0/252] kthread+0x0/0xfc
[263207.800730] 					[child_rip+0/18] child_rip+0x0/0x12
[263207.810251] 					[<ffffffffffffffff>] 0xffffffffffffffff
[263207.819776]  }
[263207.821548]  ... key      at: [_end+122893256/2124917936] __key.5+0x0/0xfffffffffffe369f [ib_ipath]
[263207.830154] 
[263207.830155] the second lock's dependencies:
[263207.836049] -> (mcast_lock){++..} ops: 15329 {
[263207.840692]    initial-use  at:
[263207.843966] 					[<ffffffffffffffff>] 0xffffffffffffffff
[263207.853492]    in-hardirq-W at:
[263207.856757] 					[<ffffffffffffffff>] 0xffffffffffffffff
[263207.866283]    in-softirq-W at:
[263207.869550] 					[<ffffffffffffffff>] 0xffffffffffffffff
[263207.879071]  }
[263207.880843]  ... key      at: [_end+122888936/2124917936] mcast_lock+0x18/0xfffffffffffe4797 [ib_ipath]
[263207.889798]  -> (&idev->n_mcast_grps_lock){-...} ops: 3 {
[263207.895398]     initial-use  at:
[263207.898758] 					  [mark_lock+135/1127] mark_lock+0x87/0x467
[263207.908679] 					  [__lock_acquire+1476/3168] __lock_acquire+0x5c4/0xc60
[263207.919118] 					  [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath]
[263207.931220] 					  [lock_acquire+124/160] lock_acquire+0x7c/0xa0
[263207.941313] 					  [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath]
[263207.953414] 					  [_spin_lock+34/46] _spin_lock+0x22/0x2e
[263207.963335] 					  [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath]
[263207.975429] 					  [_end+122587907/2124917936] ib_attach_mcast+0x2f/0x31 [ib_core]
[263207.986661] 					  [_end+123887919/2124917936] ipoib_mcast_attach+0xe3/0x123 [ib_ipoib]
[263207.998334] 					  [_end+123880632/2124917936] ipoib_mcast_join_finish+0x125/0x3fa [ib_ipoib]
[263208.010527] 					  [check_usage+53/661] check_usage+0x35/0x295
[263208.020622] 					  [lock_timer_base+35/72] lock_timer_base+0x23/0x48
[263208.030971] 					  [__lock_acquire+3080/3168] __lock_acquire+0xc08/0xc60
[263208.041409] 					  [_end+123881884/2124917936] ipoib_mcast_join_complete+0xcb/0x31a [ib_ipoib]
[263208.053680] 					  [_read_unlock_irqrestore+56/71] _read_unlock_irqrestore+0x38/0x47
[263208.064729] 					  [_end+122585085/2124917936] ib_unpack+0xad/0x11c [ib_core]
[263208.075526] 					  [_end+123844451/2124917936] ib_sa_mcmember_rec_callback+0x4c/0x57 [ib_sa]
[263208.087611] 					  [_end+123844738/2124917936] recv_handler+0x3f/0x4b [ib_sa]
[263208.098399] 					  [_end+123718406/2124917936] ib_mad_completion_handler+0x3c7/0x59b [ib_mad]
[263208.110585] 					  [run_workqueue+134/380] run_workqueue+0x86/0x17c
[263208.127122] 					  [_end+123717439/2124917936] ib_mad_completion_handler+0x0/0x59b [ib_mad]
[263208.139119] 					  [run_workqueue+177/380] run_workqueue+0xb1/0x17c
[263208.149389] 					  [worker_thread+0/349] worker_thread+0x0/0x15d
[263208.159571] 					  [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a
[263208.170456] 					  [worker_thread+294/349] worker_thread+0x126/0x15d
[263208.180814] 					  [default_wake_function+0/15] default_wake_function+0x0/0xf
[263208.191516] 					  [default_wake_function+0/15] default_wake_function+0x0/0xf
[263208.202231] 					  [worker_thread+0/349] worker_thread+0x0/0x15d
[263208.212414] 					  [kthread+208/252] kthread+0xd0/0xfc
[263208.222076] 					  [child_rip+10/18] child_rip+0xa/0x12
[263208.231818] 					  [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d
[263208.242254] 					  [restore_args+0/48] restore_args+0x0/0x30
[263208.252263] 					  [kthread+0/252] kthread+0x0/0xfc
[263208.261833] 					  [child_rip+0/18] child_rip+0x0/0x12
[263208.271575] 					  [<ffffffffffffffff>] 0xffffffffffffffff
[263208.281317]     hardirq-on-W at:
[263208.284671] 					  [mark_lock+135/1127] mark_lock+0x87/0x467
[263208.294593] 					  [__lock_acquire+1412/3168] __lock_acquire+0x584/0xc60
[263208.305037] 					  [kfree+525/541] kfree+0x20d/0x21d
[263208.314692] 					  [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath]
[263208.326786] 					  [lock_acquire+124/160] lock_acquire+0x7c/0xa0
[263208.336880] 					  [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath]
[263208.348974] 					  [_spin_lock+34/46] _spin_lock+0x22/0x2e
[263208.358896] 					  [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath]
[263208.370987] 					  [debug_mutex_free_waiter+88/92] debug_mutex_free_waiter+0x58/0x5c
[263208.382037] 					  [__mutex_lock_slowpath+624/637] __mutex_lock_slowpath+0x270/0x27d
[263208.393086] 					  [wait_for_completion+189/198] wait_for_completion+0xbd/0xc6
[263208.403789] 					  [_end+122587956/2124917936] ib_detach_mcast+0x2f/0x33 [ib_core]
[263208.415020] 					  [_end+123888044/2124917936] ipoib_mcast_detach+0x3d/0x6e [ib_ipoib]
[263208.426608] 					  [_end+123884728/2124917936] ipoib_mcast_leave+0x12d/0x1c8 [ib_ipoib]
[263208.438274] 					  [_end+123886246/2124917936] ipoib_mcast_dev_flush+0x100/0x14e [ib_ipoib]
[263208.450286] 					  [_end+123886283/2124917936] ipoib_mcast_dev_flush+0x125/0x14e [ib_ipoib]
[263208.462296] 					  [_end+123878948/2124917936] ipoib_ib_dev_flush+0x0/0x11f [ib_ipoib]
[263208.473872] 					  [_end+123877784/2124917936] ipoib_ib_dev_down+0xa8/0xb7 [ib_ipoib]
[263208.485366] 					  [_end+123879087/2124917936] ipoib_ib_dev_flush+0x8b/0x11f [ib_ipoib]
[263208.497030] 					  [run_workqueue+177/380] run_workqueue+0xb1/0x17c
[263208.507298] 					  [worker_thread+0/349] worker_thread+0x0/0x15d
[263208.517475] 					  [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a
[263208.528355] 					  [worker_thread+294/349] worker_thread+0x126/0x15d
[263208.538708] 					  [default_wake_function+0/15] default_wake_function+0x0/0xf
[263208.549412] 					  [default_wake_function+0/15] default_wake_function+0x0/0xf
[263208.560116] 					  [worker_thread+0/349] worker_thread+0x0/0x15d
[263208.570299] 					  [kthread+208/252] kthread+0xd0/0xfc
[263208.579968] 					  [child_rip+10/18] child_rip+0xa/0x12
[263208.589713] 					  [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d
[263208.600150] 					  [restore_args+0/48] restore_args+0x0/0x30
[263208.610157] 					  [kthread+0/252] kthread+0x0/0xfc
[263208.619724] 					  [child_rip+0/18] child_rip+0x0/0x12
[263208.629469] 					  [<ffffffffffffffff>] 0xffffffffffffffff
[263208.639220]   }
[263208.641084]   ... key      at: [_end+122893256/2124917936] __key.5+0x0/0xfffffffffffe369f [ib_ipath]
[263208.649786]  ... acquired at:
[263208.652862]    [add_lock_to_list+125/169] add_lock_to_list+0x7d/0xa9
[263208.658899]    [__lock_acquire+2822/3168] __lock_acquire+0xb06/0xc60
[263208.664936]    [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath]
[263208.672640]    [lock_acquire+124/160] lock_acquire+0x7c/0xa0
[263208.678330]    [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath]
[263208.686036]    [_spin_lock+34/46] _spin_lock+0x22/0x2e
[263208.691553]    [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath]
[263208.699261]    [_end+122587907/2124917936] ib_attach_mcast+0x2f/0x31 [ib_core]
[263208.706086]    [_end+123887919/2124917936] ipoib_mcast_attach+0xe3/0x123 [ib_ipoib]
[263208.713342]    [_end+123880632/2124917936] ipoib_mcast_join_finish+0x125/0x3fa [ib_ipoib]
[263208.721134]    [check_usage+53/661] check_usage+0x35/0x295
[263208.726832]    [lock_timer_base+35/72] lock_timer_base+0x23/0x48
[263208.732782]    [__lock_acquire+3080/3168] __lock_acquire+0xc08/0xc60
[263208.738818]    [_end+123881884/2124917936] ipoib_mcast_join_complete+0xcb/0x31a [ib_ipoib]
[263208.746688]    [_read_unlock_irqrestore+56/71] _read_unlock_irqrestore+0x38/0x47
[263208.753333]    [_end+122585085/2124917936] ib_unpack+0xad/0x11c [ib_core]
[263208.759722]    [_end+123844451/2124917936] ib_sa_mcmember_rec_callback+0x4c/0x57 [ib_sa]
[263208.767435]    [_end+123844738/2124917936] recv_handler+0x3f/0x4b [ib_sa]
[263208.773822]    [_end+123718406/2124917936] ib_mad_completion_handler+0x3c7/0x59b [ib_mad]
[263208.781622]    [run_workqueue+134/380] run_workqueue+0x86/0x17c
[263208.787486]    [_end+123717439/2124917936] ib_mad_completion_handler+0x0/0x59b [ib_mad]
[263208.795106]    [run_workqueue+177/380] run_workqueue+0xb1/0x17c
[263208.800969]    [worker_thread+0/349] worker_thread+0x0/0x15d
[263208.806744]    [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a
[263208.813215]    [worker_thread+294/349] worker_thread+0x126/0x15d
[263208.819164]    [default_wake_function+0/15] default_wake_function+0x0/0xf
[263208.825469]    [default_wake_function+0/15] default_wake_function+0x0/0xf
[263208.831764]    [worker_thread+0/349] worker_thread+0x0/0x15d
[263208.837541]    [kthread+208/252] kthread+0xd0/0xfc
[263208.842796]    [child_rip+10/18] child_rip+0xa/0x12
[263208.848137]    [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d
[263208.854174]    [restore_args+0/48] restore_args+0x0/0x30
[263208.859778]    [kthread+0/252] kthread+0x0/0xfc
[263208.864946]    [child_rip+0/18] child_rip+0x0/0x12
[263208.870289]    [<ffffffffffffffff>] 0xffffffffffffffff
[263208.875632] 
[263208.877236] 
[263208.877236] stack backtrace:
[263208.881817] 
[263208.881818] Call Trace:
[263208.885968]  [print_irq_inversion_bug+292/307] print_irq_inversion_bug+0x124/0x133
[263208.892513]  [check_usage_backwards+65/74] check_usage_backwards+0x41/0x4a
[263208.898711]  [mark_lock+630/1127] mark_lock+0x276/0x467
[263208.904045]  [__lock_acquire+1412/3168] __lock_acquire+0x584/0xc60
[263208.909813]  [kfree+525/541] kfree+0x20d/0x21d
[263208.914808]  [_end+122750808/2124917936] :ib_ipath:ipath_multicast_detach+0x211/0x231
[263208.922153]  [lock_acquire+124/160] lock_acquire+0x7c/0xa0
[263208.927582]  [_end+122750808/2124917936] :ib_ipath:ipath_multicast_detach+0x211/0x231
[263208.934927]  [_spin_lock+34/46] _spin_lock+0x22/0x2e
[263208.940182]  [_end+122750808/2124917936] :ib_ipath:ipath_multicast_detach+0x211/0x231
[263208.947526]  [debug_mutex_free_waiter+88/92] debug_mutex_free_waiter+0x58/0x5c
[263208.953904]  [__mutex_lock_slowpath+624/637] __mutex_lock_slowpath+0x270/0x27d
[263208.960277]  [wait_for_completion+189/198] wait_for_completion+0xbd/0xc6
[263208.966307]  [_end+122587956/2124917936] :ib_core:ib_detach_mcast+0x2f/0x33
[263208.972766]  [_end+123888044/2124917936] :ib_ipoib:ipoib_mcast_detach+0x3d/0x6e
[263208.979575]  [_end+123884728/2124917936] :ib_ipoib:ipoib_mcast_leave+0x12d/0x1c8
[263208.986469]  [_end+123886246/2124917936] :ib_ipoib:ipoib_mcast_dev_flush+0x100/0x14e
[263208.993724]  [_end+123886283/2124917936] :ib_ipoib:ipoib_mcast_dev_flush+0x125/0x14e
[263209.000981]  [_end+123878948/2124917936] :ib_ipoib:ipoib_ib_dev_flush+0x0/0x11f
[263209.007790]  [_end+123877784/2124917936] :ib_ipoib:ipoib_ib_dev_down+0xa8/0xb7
[263209.014510]  [_end+123879087/2124917936] :ib_ipoib:ipoib_ib_dev_flush+0x8b/0x11f
[263209.021409]  [run_workqueue+177/380] run_workqueue+0xb1/0x17c
[263209.027001]  [worker_thread+0/349] worker_thread+0x0/0x15d
[263209.032508]  [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a
[263209.038711]  [worker_thread+294/349] worker_thread+0x126/0x15d
[263209.050667]  [default_wake_function+0/15] default_wake_function+0x0/0xf
[263209.056697]  [default_wake_function+0/15] default_wake_function+0x0/0xf
[263209.062726]  [worker_thread+0/349] worker_thread+0x0/0x15d
[263209.068233]  [kthread+208/252] kthread+0xd0/0xfc
[263209.073219]  [child_rip+10/18] child_rip+0xa/0x12
[263209.078294]  [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d
[263209.084062]  [restore_args+0/48] restore_args+0x0/0x30
[263209.089396]  [kthread+0/252] kthread+0x0/0xfc
[263209.094294]  [child_rip+0/18] child_rip+0x0/0x12
[263209.099362] 
[263209.101175] BUG: workqueue leaked lock or atomic: ipoib/0x00000000/6662
[263209.107904]     last function: ipoib_ib_dev_flush+0x0/0x11f [ib_ipoib]
[263209.114575] 1 lock held by ipoib/6662:
[263209.118430]  #0:  (&priv->mcast_mutex){--..}, at: [mutex_lock+35/39] mutex_lock+0x23/0x27
[263209.127012] 
[263209.127013] Call Trace:
[263209.131160]  [run_workqueue+302/380] run_workqueue+0x12e/0x17c
[263209.136841]  [worker_thread+0/349] worker_thread+0x0/0x15d
[263209.142354]  [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a
[263209.148554]  [worker_thread+294/349] worker_thread+0x126/0x15d
[263209.154237]  [default_wake_function+0/15] default_wake_function+0x0/0xf
[263209.160264]  [default_wake_function+0/15] default_wake_function+0x0/0xf
[263209.166297]  [worker_thread+0/349] worker_thread+0x0/0x15d
[263209.171807]  [kthread+208/252] kthread+0xd0/0xfc
[263209.176801]  [child_rip+10/18] child_rip+0xa/0x12
[263209.181875]  [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d
[263209.187646]  [restore_args+0/48] restore_args+0x0/0x30
[263209.192977]  [kthread+0/252] kthread+0x0/0xfc
[263209.197875]  [child_rip+0/18] child_rip+0x0/0x12
[263209.202951] 
[263209.204559] BUG: workqueue leaked lock or atomic: ipoib/0x00000000/6662
[263209.211282]     last function: ipoib_ib_dev_flush+0x0/0x11f [ib_ipoib]
[263209.217946] 1 lock held by ipoib/6662:
[263209.221807]  #0:  (&priv->mcast_mutex){--..}, at: [mutex_lock+35/39] mutex_lock+0x23/0x27
...
[many more repeating traces]


-- 
Bernd Schubert
Q-Leap Networks GmbH


From kliteyn at dev.mellanox.co.il  Tue May 15 04:20:09 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 15 May 2007 14:20:09 +0300
Subject: [ofa-general] Re: Error message in OSM log when cached op file
	doesn't exist
In-Reply-To: <1179152459.1540.178811.camel@hal.voltaire.com>
References: <46486D1E.6010408@dev.mellanox.co.il>
	<1179152459.1540.178811.camel@hal.voltaire.com>
Message-ID: <46499769.1070404@dev.mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Mon, 2007-05-14 at 10:07, Yevgeny Kliteynik wrote:
>> Hi Hal.
>>
>> [snip]
>>> Date:   03/30/2007 12:24:12 AM
>>> OpenSM: Handle conf file open failures better
>>>     
>>> diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
>>> index 46315a5..746fbd1 100644
>>> --- a/osm/opensm/osm_subnet.c
>>> +++ b/osm/opensm/osm_subnet.c
>>> @@ -732,7 +732,7 @@ subn_dump_qos_options(
>>>  
>>>  /**********************************************************************
>>>   **********************************************************************/
>>> -void
>>> +ib_api_status_t
>>>  osm_subn_rescan_conf_file(
>>>    IN osm_subn_opt_t* const p_opts )
>>>  {
>>> @@ -751,7 +751,7 @@ osm_subn_rescan_conf_file(
>>>    
>>>    opts_file = fopen(file_name, "r");
>>>    if (!opts_file)
>>> -    return;
>>> +    return IB_ERROR;
>> [/snip]
>>
>> This patch was applied a month and a half ago (master).
>> It handles opening cached options file, and prints error messages
>> when OSM failed opening such file.
>>
>> I actually don't like this thing, because now every time you run
>> OpenSM on the machine that doesn't have any cached options file
>> (which is usually the case) you get an error message.
> 
> Perhaps error is too severe as one can run just fine without this file
> and there is no requirement to have it. Should it be some other type of
> message instead ?

I think that the message should appear when OpenSM *does* find cached
option file, and no message should appear when such file wasn't found
(which is the most common use case).

>> There's no point checking whether the file exists, because osm runs
>> as root, and if it fails opening this file, it means that the file
>> doesn't exist or is inaccessible (broken mount, etc).
> 
> That's the most common use case (running OpenSM as root, but not the
> only one).
> 
>> In any case, user gets info in stdout whether or now OpenSM is using
>> cached options file.
> 
> Is there always a message in the log as well indicating this ?

Nope.
When this file is parsed, osm_log is not yet initialized.

-- Yevgeny

> -- Hal
> 
>> Do you agree? Should I issue a patch?
>>
>> -- Yevgeny
> 
> 


From kliteyn at dev.mellanox.co.il  Tue May 15 04:23:17 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 15 May 2007 14:23:17 +0300
Subject: [ofa-general] Re: Error message in OSM log when cached op file
	doesn't exist
In-Reply-To: <20070514210541.GR29746@sashak.voltaire.com>
References: <46486D1E.6010408@dev.mellanox.co.il>
	<20070514210541.GR29746@sashak.voltaire.com>
Message-ID: <46499825.9090204@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 17:07 Mon 14 May     , Yevgeny Kliteynik wrote:
>> I actually don't like this thing, because now every time you run
>> OpenSM on the machine that doesn't have any cached options file
>> (which is usually the case) you get an error message.
>>
>> There's no point checking whether the file exists, because osm runs
>> as root,
> 
> Not necessary.
> 
>> and if it fails opening this file, it means that the file
>> doesn't exist or is inaccessible (broken mount, etc).
> 
> or user provided OSM_CACHE_DIR environment variable is broken or malloc
> failed, or other error (see: man 3 fopen, man 2 open, man 3 malloc)
> 
> Probably just this solves your issue:
> 
> diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
> index 855d1ab..f7ddf7d 100644
> --- a/osm/opensm/osm_subnet.c
> +++ b/osm/opensm/osm_subnet.c
> @@ -51,6 +51,7 @@
>  
>  #include <string.h>
>  #include <stdio.h>
> +#include <errno.h>
>  #include <limits.h>
>  #include <complib/cl_debug.h>
>  #include <complib/cl_log.h>
> @@ -856,7 +857,7 @@ osm_subn_parse_conf_file(
>    
>    opts_file = fopen(file_name, "r");
>    if (!opts_file)
> -    return IB_ERROR;
> +    return errno == ENOENT ? IB_SUCCESS : IB_ERROR;
>  
>    while (fgets(line, 1023, opts_file) != NULL)
>    {
 
I think that this one is a good solution.

-- Yevgeny

> Or yet another IB_* status value and less aggressive warning message?
> 
> Sasha
> 


From kliteyn at dev.mellanox.co.il  Tue May 15 04:35:43 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 15 May 2007 14:35:43 +0300
Subject: [ofa-general] [PATCH] osm: error message when failed opening cached
	options file
Message-ID: <46499B0F.2090902@dev.mellanox.co.il>

Hi Hal,

As suggested by Sasha, printing error message when failed
opening cached options file only when the file was found, but
osm failed opening it.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/osm_subnet.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index 855d1ab..9bba1b4 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -52,6 +52,7 @@
 #include <string.h>
 #include <stdio.h>
 #include <limits.h>
+#include <errno.h>
 #include <complib/cl_debug.h>
 #include <complib/cl_log.h>
 #include <opensm/osm_subnet.h>
@@ -856,7 +857,7 @@ osm_subn_parse_conf_file(
   
   opts_file = fopen(file_name, "r");
   if (!opts_file)
-    return IB_ERROR;
+    return (errno == ENOENT) ? IB_SUCCESS : IB_ERROR;
 
   while (fgets(line, 1023, opts_file) != NULL)
   {
-- 
1.4.4.1.GIT


From sashak at voltaire.com  Tue May 15 05:54:01 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 15 May 2007 15:54:01 +0300
Subject: [ofa-general] Re: Error message in OSM log when cached op file
	doesn't exist
In-Reply-To: <46499769.1070404@dev.mellanox.co.il>
References: <46486D1E.6010408@dev.mellanox.co.il>
	<1179152459.1540.178811.camel@hal.voltaire.com>
	<46499769.1070404@dev.mellanox.co.il>
Message-ID: <20070515125401.GD23240@sashak.voltaire.com>

On 14:20 Tue 15 May     , Yevgeny Kliteynik wrote:
> 
> I think that the message should appear when OpenSM *does* find cached
> option file, and no message should appear when such file wasn't found
> (which is the most common use case).

AFAIK OpenSM which used in the labs' clusters almost always uses this
file, so I'm not sure about common case.

Sasha


From philippe.gregoire at cea.fr  Tue May 15 05:48:39 2007
From: philippe.gregoire at cea.fr (Philippe Gregoire)
Date: Tue, 15 May 2007 14:48:39 +0200
Subject: [ofa-general] git over http
Message-ID: <4649AC27.8010903@cea.fr>

I can't get git clone command working due to our firewall.
Is there any git http server configured ?
If any, how do I translate
git clone git://git.openfabrics.org/~halr/management 
<git://git.openfabrics.org/%7Ehalr/management>
in git clone http path ?
Thanks
Philippe


From kliteyn at dev.mellanox.co.il  Tue May 15 05:56:32 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 15 May 2007 15:56:32 +0300
Subject: [ofa-general] Re: Error message in OSM log when cached op file
	doesn't exist
In-Reply-To: <20070515125401.GD23240@sashak.voltaire.com>
References: <46486D1E.6010408@dev.mellanox.co.il>
	<1179152459.1540.178811.camel@hal.voltaire.com>
	<46499769.1070404@dev.mellanox.co.il>
	<20070515125401.GD23240@sashak.voltaire.com>
Message-ID: <4649AE00.8080806@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 14:20 Tue 15 May     , Yevgeny Kliteynik wrote:
>> I think that the message should appear when OpenSM *does* find cached
>> option file, and no message should appear when such file wasn't found
>> (which is the most common use case).
> 
> AFAIK OpenSM which used in the labs' clusters almost always uses this
> file, so I'm not sure about common case.

If the file is found, user sees "Using cached bla-bla" and 
"Loading cached option bla-bla" messages.
If the file wasn't found, these messages are not printed,
so absence of these messages means that the file wasn't found.
The only thing we can do is to add a new message that will
explicitly inform the user about this, something like 
"No cached options file".
Is this necessary? IMHO, it's not. Do you think otherwise?

-- Yevgeny

> Sasha
> 


From bhartner at us.ibm.com  Tue May 15 06:26:16 2007
From: bhartner at us.ibm.com (Bill Hartner)
Date: Tue, 15 May 2007 08:26:16 -0500
Subject: [ofa-general] binary compatibility ofed 1.1 and 1.2
Message-ID: <OFEEA2D314.CE1FBFE5-ON852572DC.00471A18-862572DC.0049CD84@us.ibm.com>

Hi,

Will apps built with OFED 1.1 verbs.h run on an OFED 1.2 install ?

-Bill
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070515/1cedc01e/attachment.html>

From halr at voltaire.com  Tue May 15 07:05:20 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 May 2007 10:05:20 -0400
Subject: [ofa-general] [NOTICE] IB management changes
Message-ID: <1179237918.4531.74522.camel@hal.voltaire.com>

As discussed last month, the following changes have now been made for IB
management (master branch of my management git tree):

In order to better match package names, the following directory names have
been changed from->to:
	osm->opensm
	diags->infiniband-diags

Still pending are the following changes:

Since opensm is a system daemon, opensm is to be moved from /usr/bin to /usr/sbin
Similarly for the infiniband-diags.

For consistency with the package name, /var/cache/osm moved to
/var/cache/opensm

Also, for consistency with the package name, all config, log, and dump files named osm* 
to be changed to opensm*

To avoid confusion and possible conflicts in configuring daemon options,
only have 1 configuration file (existence of both /etc/sysconfig/opensm 
and /etc/opensm.conf is problematic).  Remove the /etc/sysconfig/opensm 
file and only use opensm.conf.  Move opensm.conf to /etc/rdma (as 
discussed in the thread labeled "Location and naming of RDMA enablement 
stack rpm" on general at lists.openfabrics.org.

-- Hal


From kliteyn at dev.mellanox.co.il  Tue May 15 07:08:52 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 15 May 2007 17:08:52 +0300
Subject: [ofa-general] Re: Error message in OSM log when cached op file
	doesn't exist
In-Reply-To: <46499769.1070404@dev.mellanox.co.il>
References: <46486D1E.6010408@dev.mellanox.co.il>
	<1179152459.1540.178811.camel@hal.voltaire.com>
	<46499769.1070404@dev.mellanox.co.il>
Message-ID: <4649BEF4.9000801@dev.mellanox.co.il>

Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> Hal Rosenstock wrote:
>> Hi Yevgeny,
>>
>> On Mon, 2007-05-14 at 10:07, Yevgeny Kliteynik wrote:
>>> Hi Hal.
>>>
>>> [snip]
>>>> Date:   03/30/2007 12:24:12 AM
>>>> OpenSM: Handle conf file open failures better
>>>>     diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
>>>> index 46315a5..746fbd1 100644
>>>> --- a/osm/opensm/osm_subnet.c
>>>> +++ b/osm/opensm/osm_subnet.c
>>>> @@ -732,7 +732,7 @@ subn_dump_qos_options(
>>>>  
>>>>  /********************************************************************** 
>>>>
>>>>   
>>>> **********************************************************************/
>>>> -void
>>>> +ib_api_status_t
>>>>  osm_subn_rescan_conf_file(
>>>>    IN osm_subn_opt_t* const p_opts )
>>>>  {
>>>> @@ -751,7 +751,7 @@ osm_subn_rescan_conf_file(
>>>>       opts_file = fopen(file_name, "r");
>>>>    if (!opts_file)
>>>> -    return;
>>>> +    return IB_ERROR;
>>> [/snip]
>>>
>>> This patch was applied a month and a half ago (master).
>>> It handles opening cached options file, and prints error messages
>>> when OSM failed opening such file.
>>>
>>> I actually don't like this thing, because now every time you run
>>> OpenSM on the machine that doesn't have any cached options file
>>> (which is usually the case) you get an error message.
>>
>> Perhaps error is too severe as one can run just fine without this file
>> and there is no requirement to have it. Should it be some other type of
>> message instead ?
> 
> I think that the message should appear when OpenSM *does* find cached
> option file, and no message should appear when such file wasn't found
> (which is the most common use case).
> 
>>> There's no point checking whether the file exists, because osm runs
>>> as root, and if it fails opening this file, it means that the file
>>> doesn't exist or is inaccessible (broken mount, etc).
>>
>> That's the most common use case (running OpenSM as root, but not the
>> only one).
>>
>>> In any case, user gets info in stdout whether or now OpenSM is using
>>> cached options file.
>>
>> Is there always a message in the log as well indicating this ?
> 
> Nope.
> When this file is parsed, osm_log is not yet initialized.

Correction:

There are two places where this file is parsed:
1. osm_subn_parse_conf_file() - called from main(), osm log
   is not yet initialized when the function is called
2. osm_subn_rescan_conf_file() - called from osm_state_mgr_process()
   before every heavy sweep (when the log is already initialized),
   and logs error message about the missing file every time.

-- Yevgeny

> -- Yevgeny
> 
>> -- Hal
>>
>>> Do you agree? Should I issue a patch?
>>>
>>> -- Yevgeny
>>
>>
> 
> 
> 


From tziporet at dev.mellanox.co.il  Tue May 15 07:15:45 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 15 May 2007 17:15:45 +0300
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
In-Reply-To: <ada4pmfumt3.fsf@cisco.com>
References: <ada4pmfumt3.fsf@cisco.com>
Message-ID: <4649C091.5050804@mellanox.co.il>

Roland Dreier wrote:
>
> Jack Morgenstein (1):
>       IB/mlx4: Fix uninitialized spinlock for 32-bit archs
>
>   
Hi Roland,
There were other 2 fixes from Eli and I see they are not in.
Can you take them too?

Thanks,
Tziporet


From serahali100 at yahoo.de  Tue May 15 07:15:51 2007
From: serahali100 at yahoo.de (serah ali)
Date: Tue, 15 May 2007 16:15:51 +0200 (CEST)
Subject: [ofa-general] ATTORNEY SERAH ALI ESQ
In-Reply-To: <478935.50146.qm@web39202.mail.mud.yahoo.com>
Message-ID: <643391.7216.qm@web23106.mail.ird.yahoo.com>

SERAH ALI &  ASSOCIATES 
      NOTARY PUBLIC & CORPERATE ATTORNEY
      II6 WINCHESEA STREET , LONDON. 
      UK.
      ATTN 
       
      RE-ESTATE OF LATE ABDULAZEE AHMED HAMZA HABIB OF IRAQ
     
    We  are attorneys and executors of the estate of late Ahmed Hamza Habib of  Iraq. Who is the richest Oil Merchant in the histroy of Iraqi. 
       
      He  escaped out from the War turn Iraq with his family for a political  assylun in London and died last month after a brief illiness and has  since been buried according to the christian right, because while he  was in London , he was converted to christianity and accepted  christ.
       
       He died at the   ripe age of 100yrs (1902-2003) 
       
      We  are contacting you because your name are listed as a beneficiary in the  estate of the late Iraqi richest Oil Merchant. . You are specifically  listed as beneficiary to the sum of $5, 750,000. in his will. This is  for your activities and help to the less privileged in the society.
      In  accordance with the Great Britian inheritance laws, you are required to  forward documents of proof of your identity as the bonafide beneficiary  to this inheritance and your bio-data. Also required is your present  address, telephone and fax numbers to enable easy cmmunication. 
       
      We shall inform you on further details on reciept of the above outlined information. 
      We hope to hear from you soon. 
      Yours faithfully 
       
      ATTORNEY SERAH ALI ESQ
      P/P Notary Public

       
---------------------------------
Yahoo! Clever - Sie haben Fragen? Yahoo! Nutzer antworten Ihnen.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070515/908c3055/attachment.html>

From rdreier at cisco.com  Tue May 15 07:21:13 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 07:21:13 -0700
Subject: [ofa-general] Re: [PATCH] IB/core free umem when mm is destroyed
In-Reply-To: <1179131773.7405.39.camel@mtls03> (Eli Cohen's message of "Mon,
	14 May 2007 11:35:43 +0300")
References: <1179131773.7405.39.camel@mtls03>
Message-ID: <adafy5ytbfq.fsf@cisco.com>

thanks, applied.


From rdreier at cisco.com  Tue May 15 07:26:49 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 07:26:49 -0700
Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation
In-Reply-To: <1179145102.25749.11.camel@mtls03> (Eli Cohen's message of "Mon,
	14 May 2007 15:17:52 +0300")
References: <1179145102.25749.11.camel@mtls03>
Message-ID: <adabqgmtb6e.fsf@cisco.com>

 > the code that calculates WQ size is quite different between kernel and
 > user. I think that writing the code in a way that will allow to copy it
 > as is between kernel and user is in place. Would like me to send such a
 > patch?

Actually I've been thinking that perhaps we should let libmlx4 tell
mlx4 in the kernel what WQE sizes it wants to use.  Otherwise it will
probably be a pain if we want to use a small BB for SQs, etc.

 >  	case IBV_QPT_RC:
 > -		size += sizeof (struct mlx4_wqe_raddr_seg);
 > +		size += sizeof (struct mlx4_wqe_raddr_seg) +
 > +			sizeof (struct mlx4_wqe_atomic_seg);
 >  		/*
 >  		 * An atomic op will require an atomic segment, a
 >  		 * remote address segment and one scatter entry.

This looks wrong.  Why do we have to allow for an atomic segment for
normal operations?  The code that starts with the context above:

		/*
		 * An atomic op will require an atomic segment, a
		 * remote address segment and one scatter entry.
		 */
		if (size < (sizeof (struct mlx4_wqe_atomic_seg) +
			    sizeof (struct mlx4_wqe_raddr_seg) +
			    sizeof (struct mlx4_wqe_data_seg)))
			size = (sizeof (struct mlx4_wqe_atomic_seg) +
				sizeof (struct mlx4_wqe_raddr_seg) +
				sizeof (struct mlx4_wqe_data_seg));

seems to take into account leaving space for atomic operations.

 - R.


From mst at dev.mellanox.co.il  Tue May 15 07:47:08 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 15 May 2007 17:47:08 +0300
Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation
In-Reply-To: <adabqgmtb6e.fsf@cisco.com>
References: <1179145102.25749.11.camel@mtls03> <adabqgmtb6e.fsf@cisco.com>
Message-ID: <20070515144708.GH13649@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] libmlx4: WQE shift calculation
> 
>  > the code that calculates WQ size is quite different between kernel and
>  > user. I think that writing the code in a way that will allow to copy it
>  > as is between kernel and user is in place. Would like me to send such a
>  > patch?
> 
> Actually I've been thinking that perhaps we should let libmlx4 tell
> mlx4 in the kernel what WQE sizes it wants to use.  Otherwise it will
> probably be a pain if we want to use a small BB for SQs, etc.

I've been thinking about this, too.
But this is a separate issue from what Eli proposes - we'll still
need to have this math in both kernel and user-space.

-- 
MST


From eli at mellanox.co.il  Tue May 15 08:13:32 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Tue, 15 May 2007 18:13:32 +0300
Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation
In-Reply-To: <adabqgmtb6e.fsf@cisco.com>
References: <1179145102.25749.11.camel@mtls03>  <adabqgmtb6e.fsf@cisco.com>
Message-ID: <1179242042.25749.33.camel@mtls03>

On Tue, 2007-05-15 at 07:26 -0700, Roland Dreier wrote:
> > the code that calculates WQ size is quite different between kernel and
>  > user. I think that writing the code in a way that will allow to copy it
>  > as is between kernel and user is in place. Would like me to send such a
>  > patch?
First I should add the case that triggered his patch: the userspace code
calculated a smaller buffer size than the kernel code, which caused
get_user_pages() to fail since part of the buffer did not belong to the
process's address space.

> 
> Actually I've been thinking that perhaps we should let libmlx4 tell
> mlx4 in the kernel what WQE sizes it wants to use.  Otherwise it will
> probably be a pain if we want to use a small BB for SQs, etc.
As Mihcael said in a subsequent post, we still need this code both in
user and in kernel.


> 
>  >  	case IBV_QPT_RC:
>  > -		size += sizeof (struct mlx4_wqe_raddr_seg);
>  > +		size += sizeof (struct mlx4_wqe_raddr_seg) +
>  > +			sizeof (struct mlx4_wqe_atomic_seg);
>  >  		/*
>  >  		 * An atomic op will require an atomic segment, a
>  >  		 * remote address segment and one scatter entry.
> 
> This looks wrong.  Why do we have to allow for an atomic segment for
> normal operations?  The code that starts with the context above:

The kernel code in send_wqe_overhead() always adds atomic headers. Maybe
the fix should have gone there. Still I think we should have the same
code for this calculation.

> 
> 		/*
> 		 * An atomic op will require an atomic segment, a
> 		 * remote address segment and one scatter entry.
> 		 */
> 		if (size < (sizeof (struct mlx4_wqe_atomic_seg) +
> 			    sizeof (struct mlx4_wqe_raddr_seg) +
> 			    sizeof (struct mlx4_wqe_data_seg)))
> 			size = (sizeof (struct mlx4_wqe_atomic_seg) +
> 				sizeof (struct mlx4_wqe_raddr_seg) +
> 				sizeof (struct mlx4_wqe_data_seg));
> 
> seems to take into account leaving space for atomic operations.
> 
>  - R.


From halr at voltaire.com  Tue May 15 08:14:52 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 May 2007 11:14:52 -0400
Subject: [ofa-general] [NOTICE] IB management changes
In-Reply-To: <1179237918.4531.74522.camel@hal.voltaire.com>
References: <1179237918.4531.74522.camel@hal.voltaire.com>
Message-ID: <1179242091.4531.78861.camel@hal.voltaire.com>

On Tue, 2007-05-15 at 10:05, Hal Rosenstock wrote:
> As discussed last month, the following changes have now been made for IB
> management (master branch of my management git tree):
> 
> In order to better match package names, the following directory names have
> been changed from->to:
> 	osm->opensm
> 	diags->infiniband-diags
> 
> Still pending are the following changes:
> 
> Since opensm is a system daemon, opensm is to be moved from /usr/bin
> to /usr/sbin

This was done.

> Similarly for the infiniband-diags.

Pending.

> For consistency with the package name, /var/cache/osm moved to
> /var/cache/opensm

Done.

> Also, for consistency with the package name, all config, log, and dump files named osm* 
> to be changed to opensm*

Done.

-- Hal

> To avoid confusion and possible conflicts in configuring daemon options,
> only have 1 configuration file (existence of both /etc/sysconfig/opensm 
> and /etc/opensm.conf is problematic).  Remove the /etc/sysconfig/opensm 
> file and only use opensm.conf.  Move opensm.conf to /etc/rdma (as 
> discussed in the thread labeled "Location and naming of RDMA enablement 
> stack rpm" on general at lists.openfabrics.org.
> 
> -- Hal
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From kliteyn at dev.mellanox.co.il  Tue May 15 08:14:39 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 15 May 2007 18:14:39 +0300
Subject: [ofa-general] [PATCHv2] osm: error message when failed opening
	cached options file
Message-ID: <4649CE5F.70102@dev.mellanox.co.il>

Hi Hal,

[V2 of the patch]

As suggested by Sasha, printing error message when failed
opening cached options file only when the file was found, but
osm failed opening it.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_subnet.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 855d1ab..c785923 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -52,6 +52,7 @@
 #include <string.h>
 #include <stdio.h>
 #include <limits.h>
+#include <errno.h>
 #include <complib/cl_debug.h>
 #include <complib/cl_log.h>
 #include <opensm/osm_subnet.h>
@@ -758,7 +759,7 @@ osm_subn_rescan_conf_file(
   
   opts_file = fopen(file_name, "r");
   if (!opts_file)
-    return IB_ERROR;
+    return (errno == ENOENT) ? IB_SUCCESS : IB_ERROR;
 
   while (fgets(line, 1023, opts_file) != NULL)
   {
@@ -856,7 +857,7 @@ osm_subn_parse_conf_file(
   
   opts_file = fopen(file_name, "r");
   if (!opts_file)
-    return IB_ERROR;
+    return (errno == ENOENT) ? IB_SUCCESS : IB_ERROR;
 
   while (fgets(line, 1023, opts_file) != NULL)
   {
-- 
1.5.1.4


From halr at voltaire.com  Tue May 15 08:29:11 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 May 2007 11:29:11 -0400
Subject: [ofa-general] Re: [PATCHv2] osm: error message when failed opening
	cached options file
In-Reply-To: <4649CE5F.70102@dev.mellanox.co.il>
References: <4649CE5F.70102@dev.mellanox.co.il>
Message-ID: <1179242950.4531.79784.camel@hal.voltaire.com>

Hi Yevgeny,

On Tue, 2007-05-15 at 11:14, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> [V2 of the patch]
> 
> As suggested by Sasha, printing error message when failed
> opening cached options file only when the file was found, but
> osm failed opening it.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied (to master only).

-- Hal


From philippe.gregoire at cea.fr  Tue May 15 08:32:25 2007
From: philippe.gregoire at cea.fr (Philippe Gregoire)
Date: Tue, 15 May 2007 17:32:25 +0200
Subject: [ofa-general] Re: suggested patch for partition membership
 definitiion in	osm-partitions.conf (fix)
In-Reply-To: <1179157835.1540.183713.camel@hal.voltaire.com>
References: <46487FBF.7020300@cea.fr>
	<1179157835.1540.183713.camel@hal.voltaire.com>
Message-ID: <4649D289.3070301@cea.fr>

Here are the patches as you asked.
I changed the code to use strncmp as suggested by Sasha.
Philippe

Hal Rosenstock a écrit :
> Hi Philippe,
>
> On Mon, 2007-05-14 at 11:26, Philippe Gregoire wrote:
>   
>> This time , with the definitive patch (sorry)
>>     
>
> Can you resubmit this with a S-O-B line ?
>
>   
>> Hi Hal,
>> the way to define in osm-partitions.conf file  partition membership for
>> port guids is quite very verbose,
>> specially when you have a lot of (full member) ports.
>>     
>
> or lots of limited members, either way. This is an improvement in the
> allowed syntax.
>
>   
>> Here is a patch to allow a more compact partition membership definition.
>> It allows definition of a default
>> membership partition for the port guid list. The old syntax is still usable.
>> old way
>> G1 = 0x01 :  0x123=full, 0x124=full, 0x0x125=full, 0x126=full, 0x127=full ;
>> G1 = 0x01 :  0x128=full, 0x129=full, 0x567, 0x569=full
>>
>> new way :
>> G1 = 0x01 , defmember=full : 0x123, 0x124, 0x125, 0x126, 0x127 ;
>> G1 = 0x01 , defmember=full :  0x128, 0x129, 0x567=limited, 0x569
>>
>> I changed also the opensm man page as some lines (arround limited/full
>> membership) are not well formatted.
>>     
>
> Can you break this piece into 2 parts: fix formatting, and then add
> defmember ?
>
>   
>> This patch has been compiled and tested on our cluster with the
>> following osm-partitions.conf :
>> G1  = 0x0001 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9,
>> 0x0005ad00000168ad, 0x0005ad0000000cb7=limited, 0x0008f10403962eb1 ;
>> G2  = 0x0002 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ;
>> G3  = 0x0003 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9,
>> 0x0008f10403962eb1 ;
>> G5  = 0x0005 , defmember=full : 0x0008f10403962eb1 ;
>> G10 = 0x0010 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ;
>> G70 = 0x0070 , defmember=full : 0x0005ad00000165f1 ;
>> G80 = 0x0080 , defmember=full : 0x0005ad00000165f1;
>> G80 = 0x0080 : 0x0005ad00000168ad;
>> G80 = 0x0080 , defmember=full : 0x0005ad0000016cb9;
>> G80 = 0x0080 , defmember=limited : 0x0005ad0000000cb7, 0x0008f10403962eb1;
>>     
>
> Thanks.
>
> -- Hal
>
>   
>> Philippe
>>
>>
>>
>> ______________________________________________________________________
>>
>> --- opensm/osm_prtn_config.old.c	2007-04-18 11:54:29.000000000 +0200
>> +++ opensm/osm_prtn_config.c	2007-05-14 17:14:42.228813361 +0200
>> @@ -70,6 +70,7 @@
>>  	osm_subn_t *p_subn;
>>  	osm_prtn_t *p_prtn;
>>  	unsigned    is_ipoib, mtu, rate, sl, scope;
>> +	boolean_t   full;
>>  };
>>  
>>  extern osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn,
>> @@ -163,6 +164,14 @@
>>  				" - skipped\n", lineno);
>>  		else
>>  			conf->sl = sl;
>> +	} else if (!strncmp(flag, "defmember", len)) {
>> +		if (!val || (strcmp(val, "limited") && strcmp(val, "full")))
>> +			osm_log(conf->p_log, OSM_LOG_VERBOSE,
>> +				"PARSE WARN: line %d: "
>> +				"flag \'defmember\' requires valid value (limited or full)"
>> +				" - skipped\n", lineno);
>> +		else
>> +			conf->full = strcmp(val, "full") == 0;
>>  	} else {
>>  			osm_log(conf->p_log, OSM_LOG_VERBOSE,
>>  					  "PARSE WARN: line %d: "
>> @@ -177,12 +186,14 @@
>>  {
>>  	osm_prtn_t *p = conf->p_prtn;
>>  	ib_net64_t guid;
>> -	boolean_t full = FALSE;
>> +	boolean_t full = conf->full;
>>  
>>  	if (!name || !*name || !strncmp(name, "NONE", strlen(name)))
>>  		return 0;
>>  
>>  	if (flag) {
>> +		/* reset default membership to limited */
>> +		full = FALSE;
>>  		if (!strncmp(flag, "full", strlen(flag)))
>>  			full = TRUE;
>>  		else if (strncmp(flag, "limited", strlen(flag))) {
>> @@ -275,6 +286,7 @@
>>  	conf->p_prtn = NULL;
>>  	conf->is_ipoib = 0;
>>  	conf->sl = OSM_DEFAULT_SL;
>> +	conf->full = FALSE;
>>  	return conf;
>>  }
>>  
>> --- man/opensm.8.old	2007-04-18 11:54:29.000000000 +0200
>> +++ man/opensm.8	2007-05-14 16:19:11.747555126 +0200
>> @@ -291,13 +291,15 @@
>>  
>>  Partition Definition:
>>  
>> -[PartitionName][=PKey][,flag[=value]]
>> +[PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
>>  
>>   PartitionName - string, will be used with logging. When omitted
>>                   empty string will be used.
>>   PKey          - P_Key value for this partition. Only low 15 bits will
>>                   be used. When omitted will be autogenerated.
>>   flag          - used to indicate IPoIB capability of this partition.
>> + defmember=full|limited - specifies default membership for port guid. 
>> +                 Default is limited.
>>  
>>  Currently recognized flags are:
>>  
>> @@ -317,10 +319,10 @@
>>  
>>  PortGUIDs list:
>>  
>> -PortGUID     - GUID of partition member EndPort. Hexadecimal numbers
>> -               should start from 0x, decimal numbers are accepted too.
>> -full or      - indicates full or limited membership for this port. When
>> -  limited      omitted (or unrecognized) limited membership is assumed.
>> + PortGUID         - GUID of partition member EndPort. Hexadecimal numbers
>> +                   should start from 0x, decimal numbers are accepted too.
>> + full or limited  - indicates full or limited membership for this port.
>> +                   When omitted (or unrecognized) default (defmember) membership is assumed.
>>  
>>  There are two useful keywords for PortGUID definition:
>>  
>> @@ -346,12 +348,20 @@
>>  
>>  Examples:
>>  
>> -Default=0x7fff : ALL, SELF=full ;
>> + Default=0x7fff : ALL, SELF=full ;
>>  
>> -NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
>> + NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
>>  
>> -YetAnotherOne = 0x300 : SELF=full ;
>> -YetAnotherOne = 0x300 : ALL=limited ;
>> + YetAnotherOne = 0x300 : SELF=full ;
>> + YetAnotherOne = 0x300 : ALL=limited ;
>> +
>> + ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
>> + # 0x123453, 0x123454 will be limited
>> + ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
>> + # 0x123456, 0x123457 will be limited
>> + ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
>> + ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
>> + ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;
>>  
>>  Note:
>>  
>>     
>
>
>   

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: defmember.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070515/803b4944/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: osm-man1.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070515/803b4944/attachment-0001.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: osm-man2.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070515/803b4944/attachment-0002.ksh>

From rdreier at cisco.com  Tue May 15 08:58:19 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 08:58:19 -0700
Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation
In-Reply-To: <1179242042.25749.33.camel@mtls03> (Eli Cohen's message of "Tue,
	15 May 2007 18:13:32 +0300")
References: <1179145102.25749.11.camel@mtls03> <adabqgmtb6e.fsf@cisco.com>
	<1179242042.25749.33.camel@mtls03>
Message-ID: <ada7irat6xw.fsf@cisco.com>

 > First I should add the case that triggered his patch: the userspace code
 > calculated a smaller buffer size than the kernel code, which caused
 > get_user_pages() to fail since part of the buffer did not belong to the
 > process's address space.

OK, in this case it seems the bug is in the kernel -- since it is
overestimating the size of the WQEs needed.  So we might as well fix
it in the kernel.

 > As Mihcael said in a subsequent post, we still need this code both in
 > user and in kernel.

Yes, but I think this issue really convinces me that we should
decouple the two calculations, so the kernel code is only used for
kernel QPs.  And then change the mlx4 ABI so that userspace tells the
kernel the wqe buffer size and rq/sq wqe shift/offset.  That will
allow for different SQ BB sizes and also make things more robust
against bugs like this.

 - R.


From rdreier at cisco.com  Tue May 15 08:58:46 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 08:58:46 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
In-Reply-To: <4649C091.5050804@mellanox.co.il> (Tziporet Koren's message of
	"Tue, 15 May 2007 17:15:45 +0300")
References: <ada4pmfumt3.fsf@cisco.com> <4649C091.5050804@mellanox.co.il>
Message-ID: <ada3b1yt6x5.fsf@cisco.com>

Thanks for the reminder.  I put them in the wrong folder but I think I
found them now.


From xma at us.ibm.com  Tue May 15 10:41:23 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Tue, 15 May 2007 10:41:23 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <20070515062646.GD5437@mellanox.co.il>
Message-ID: <OF25A26510.AB741DE1-ON872572DC.0060B20C-882572DC.006111DB@us.ibm.com>


Hello Michael,

      Regarding the memory issue w/o SRQ, do you think there is a way to
use low watermark to release prepost buffer in large connections? I think
most of the prepost buffers are empty in that case because of the BW.

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070515/27eaf1a1/attachment.html>

From mst at dev.mellanox.co.il  Tue May 15 11:54:58 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 15 May 2007 21:54:58 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <OF25A26510.AB741DE1-ON872572DC.0060B20C-882572DC.006111DB@us.ibm.com>
References: <20070515062646.GD5437@mellanox.co.il>
	<OF25A26510.AB741DE1-ON872572DC.0060B20C-882572DC.006111DB@us.ibm.com>
Message-ID: <20070515185445.GD4161@mellanox.co.il>

> Quoting Shirley Ma <xma at us.ibm.com>:
> Subject: Re: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> 
> Hello Michael,
> 
> Regarding the memory issue w/o SRQ, do you think there is a way to use low
> watermark to release prepost buffer in large connections?

Maybe with UC - with RC you'll get RNR and connection'll get closed before you
have time to handle the low watermark.  So sure, might be an interesting idea,
but isn't low watermark a SRQ feature?

> I think most of the prepost buffers are empty in that case because of the BW.

I don't really get the argument.

-- 
MST


From parks at lanl.gov  Tue May 15 11:55:32 2007
From: parks at lanl.gov (Parks Fields)
Date: Tue, 15 May 2007 12:55:32 -0600
Subject: [ofa-general] Re: [ewg] OFED 1.2 rc3 release
In-Reply-To: <6a122cc00705140844k7f65a988x8746c4ea8474b920@mail.gmail.co
 m>
References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com>
	<6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com>
	<46487AE8.1020005@mellanox.co.il>
	<6a122cc00705140844k7f65a988x8746c4ea8474b920@mail.gmail.com>
Message-ID: <7.0.1.0.2.20070515125402.02784f60@lanl.gov>


Ofed 1.2 rc3 is running IPoIB slower that rc2 or earlier

I use to get ~800MB/sec no tuning and now I get ~650MB/sec ??

Ideas??

thanks


At 09:44 AM 5/14/2007, Moni Levy wrote:
>On 5/14/07, Tziporet Koren 
><<mailto:tziporet at dev.mellanox.co.il>tziporet at dev.mellanox.co.il> wrote:
>
>
>Major limitations and known issues:
>567 blocker <mailto:rolandd at cisco.com>rolandd at cisco.com MPI does not 
>work on RHEL5 ppc64
>420 critical <mailto:monil at voltaire.com>monil at voltaire.com PKey 
>table reordering caused by SM failover stops ipoib traffic
>
>
>
>Tziporet, bug #420 was fixed and bugzilla was updated this morning
>
>Moni
>
>
>
>607 critical <mailto:jsquyres at cisco.com>jsquyres at cisco.com remove 
>the hack to save the port number in the ia hca_address
>608 critical <mailto:monis at voltaire.com>monis at voltaire.com traffic 
>fails to resume after SM failover with bonding interfaces
>611 critical 
><mailto:swise at opengridcomputing.com>swise at opengridcomputing.com 
>cxgb3: passive side connection transition from streaming to RDMA is broken
>577 critical <mailto:rolandd at cisco.com>rolandd at cisco.com SRP 
>multipath failover too slow (minutes, not seconds)
>465 critical <mailto:mst at mellanox.co.il>mst at mellanox.co.il IPoIB HA 
>fails after several hours of failovers
>549 critical <mailto:amip at dev.mellanox.co.il>amip at dev.mellanox.co.il 
>SDP Policy need to be consistent
>604 critical <mailto:mst at mellanox.co.il>mst at mellanox.co.il Oops 
>running UDP traffic with IPoIB CM
>605 major <mailto:sean.hefty at intel.com>sean.hefty at intel.com kernel 
>oops in rdma_cm during module unload
>614 major <mailto:halr at voltaire.com>halr at voltaire.com All of the CM 
>definitions should be removed from ib_types.h
>534 major <mailto:vlad at mellanox.co.il>vlad at mellanox.co.il SLES9 - 
>Installer fails on declarations - OFED 1.2-20070409
>530 major <mailto:dannyz at mellanox.co.il>dannyz at mellanox.co.il 
>ibdiagnet -r fails on RHEL5 i686
>538 major <mailto:monis at voltaire.com>monis at voltaire.com integrate 
>IPoIB bonding with IPoIB CM
>541 major <mailto:mst at mellanox.co.il>mst at mellanox.co.il slow 
>failover with IPoIB CM bonding/ipoibtools HA
>558 major <mailto:rolandd at cisco.com>rolandd at cisco.com tvflash 
>configure fails on SLES10 SP1 RC2
>
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

                    ***** Correspondence *****

This email contains no programmatic content that requires independent 
ADC review  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070515/826e6140/attachment.html>

From rdreier at cisco.com  Tue May 15 11:58:54 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 11:58:54 -0700
Subject: [ofa-general] possible irq lock inversion dependency detected
In-Reply-To: <200705151148.50607.bs@q-leap.de> (Bernd Schubert's message of
	"Tue, 15 May 2007 11:48:50 +0200")
References: <200705151148.50607.bs@q-leap.de>
Message-ID: <adahcqdsykx.fsf@cisco.com>

Thanks for the report... looks like a real bug.

Can you check whether this patch makes the lockdep warnings go away?

commit 4b7eed244c032ce963be543a63e3100b96bc2d87
Author: Roland Dreier <rolandd at cisco.com>
Date:   Tue May 15 11:56:05 2007 -0700

    IB/ipath: Fix potential deadlock with multicast spinlocks
    
    Lockdep found the following potential deadlock between mcast_lock and
    n_mcast_grps_lock: mcast_lock is taken from both interrupt context and
    process context, so spin_lock_irqsave() must be used to take it.
    n_mcast_grps_lock is only taken from process context, so at first it
    seems safe to take it with plain spin_lock(); however, it also nests
    inside mcast_lock, and hence we could deadlock:
    
      cpu A                                   cpu B
        ipath_mcast_add():
          spin_lock_irq(&mcast_lock);
    
                                                ipath_mcast_detach():
                                                  spin_lock(&n_mcast_grps_lock);
    
                                                <enter interrupt>
    
                                                ipath_mcast_find():
                                                  spin_lock_irqsave(&mcast_lock);
    
          spin_lock(&n_mcast_grps_lock);
    
    Fix this by using spin_lock_irq() to take n_mcast_grps_lock.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c
index 085e28b..dd691cf 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c
@@ -165,10 +165,9 @@ static int ipath_mcast_add(struct ipath_ibdev *dev,
 {
 	struct rb_node **n = &mcast_tree.rb_node;
 	struct rb_node *pn = NULL;
-	unsigned long flags;
 	int ret;
 
-	spin_lock_irqsave(&mcast_lock, flags);
+	spin_lock_irq(&mcast_lock);
 
 	while (*n) {
 		struct ipath_mcast *tmcast;
@@ -228,7 +227,7 @@ static int ipath_mcast_add(struct ipath_ibdev *dev,
 	ret = 0;
 
 bail:
-	spin_unlock_irqrestore(&mcast_lock, flags);
+	spin_unlock_irq(&mcast_lock);
 
 	return ret;
 }
@@ -289,17 +288,16 @@ int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 	struct ipath_mcast *mcast = NULL;
 	struct ipath_mcast_qp *p, *tmp;
 	struct rb_node *n;
-	unsigned long flags;
 	int last = 0;
 	int ret;
 
-	spin_lock_irqsave(&mcast_lock, flags);
+	spin_lock_irq(&mcast_lock);
 
 	/* Find the GID in the mcast table. */
 	n = mcast_tree.rb_node;
 	while (1) {
 		if (n == NULL) {
-			spin_unlock_irqrestore(&mcast_lock, flags);
+			spin_unlock_irq(&mcast_lock);
 			ret = -EINVAL;
 			goto bail;
 		}
@@ -334,7 +332,7 @@ int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 		break;
 	}
 
-	spin_unlock_irqrestore(&mcast_lock, flags);
+	spin_unlock_irq(&mcast_lock);
 
 	if (p) {
 		/*
@@ -348,9 +346,9 @@ int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 		atomic_dec(&mcast->refcount);
 		wait_event(mcast->wait, !atomic_read(&mcast->refcount));
 		ipath_mcast_free(mcast);
-		spin_lock(&dev->n_mcast_grps_lock);
+		spin_lock_irq(&dev->n_mcast_grps_lock);
 		dev->n_mcast_grps_allocated--;
-		spin_unlock(&dev->n_mcast_grps_lock);
+		spin_unlock_irq(&dev->n_mcast_grps_lock);
 	}
 
 	ret = 0;


From rdreier at cisco.com  Tue May 15 12:00:28 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 12:00:28 -0700
Subject: [ofa-general] New ipath MAINTAINERS entry?
Message-ID: <adad511syib.fsf@cisco.com>

Do you guys want to update the maintainers entry for ipath?  Right now
we have:

IPATH DRIVER:
P:	Bryan O'Sullivan
M:	support at pathscale.com
L:	openib-general at openib.org
S:	Supported

Qlogic bought Pathscale and Bryan no longer works for Qlogic so it
seems some fresher information might be appropriate.


From rdreier at cisco.com  Tue May 15 12:41:31 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 12:41:31 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <20070508141727.GR21591@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 8 May 2007 17:17:27 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
Message-ID: <adazm45ri1g.fsf@cisco.com>

OK, how about this for libibverbs:

diff --git a/include/infiniband/arch.h b/include/infiniband/arch.h
index 6a04287..df4c949 100644
--- a/include/infiniband/arch.h
+++ b/include/infiniband/arch.h
@@ -56,13 +56,17 @@ static inline uint64_t ntohll(uint64_t x) { return x; }
  *     macro by either the compiler or the CPU.
  * wmb() - write memory barrier.  No stores may be reordered across
  *     this macro by either the compiler or the CPU.
+ * wc_wmb() - flush write combine buffers.  No write-combined writes
+ *     will be reordered across this macro by either the compiler or
+ *     the CPU.
  */
 
 #if defined(__i386__)
 
-#define mb()	asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
-#define rmb()	mb()
-#define wmb()	asm volatile("" ::: "memory")
+#define mb()	 asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
+#define rmb()	 mb()
+#define wmb()	 asm volatile("" ::: "memory")
+#define wc_wmb() mb()
 
 #elif defined(__x86_64__)
 
@@ -70,47 +74,54 @@ static inline uint64_t ntohll(uint64_t x) { return x; }
  * Only use lfence for mb() and rmb() because we don't care about
  * ordering against non-temporal stores (for now at least).
  */
-#define mb()	asm volatile("lfence" ::: "memory")
-#define rmb()	mb()
-#define wmb()	asm volatile("" ::: "memory")
+#define mb()	 asm volatile("lfence" ::: "memory")
+#define rmb()	 mb()
+#define wmb()	 asm volatile("" ::: "memory")
+#define wc_wmb() asm volatile("sfence" ::: "memory")
 
 #elif defined(__PPC64__)
 
-#define mb()	asm volatile("sync" ::: "memory")
-#define rmb()	asm volatile("lwsync" ::: "memory")
-#define wmb()	mb()
+#define mb()	 asm volatile("sync" ::: "memory")
+#define rmb()	 asm volatile("lwsync" ::: "memory")
+#define wmb()	 mb()
+#define wc_wmb() wmb()
 
 #elif defined(__ia64__)
 
-#define mb()	asm volatile("mf" ::: "memory")
-#define rmb()	mb()
-#define wmb()	mb()
+#define mb()	 asm volatile("mf" ::: "memory")
+#define rmb()	 mb()
+#define wmb()	 mb()
+#define wc_wmb() asm volatile("fwb" ::: "memory")
 
 #elif defined(__PPC__)
 
-#define mb()	asm volatile("sync" ::: "memory")
-#define rmb()	mb()
-#define wmb()	asm volatile("eieio" ::: "memory")
+#define mb()	 asm volatile("sync" ::: "memory")
+#define rmb()	 mb()
+#define wmb()	 asm volatile("eieio" ::: "memory")
+#define wc_wmb() wmb()
 
 #elif defined(__sparc_v9__)
 
-#define mb()	asm volatile("membar #LoadLoad | #LoadStore | #StoreStore | #StoreLoad" ::: "memory")
-#define rmb()	asm volatile("membar #LoadLoad" ::: "memory")
-#define wmb()	asm volatile("membar #StoreStore" ::: "memory")
+#define mb()	 asm volatile("membar #LoadLoad | #LoadStore | #StoreStore | #StoreLoad" ::: "memory")
+#define rmb()	 asm volatile("membar #LoadLoad" ::: "memory")
+#define wmb()	 asm volatile("membar #StoreStore" ::: "memory")
+#define wc_wmb() wmb()
 
 #elif defined(__sparc__)
 
-#define mb()	asm volatile("" ::: "memory")
-#define rmb()	mb()
-#define wmb()	mb()
+#define mb()	 asm volatile("" ::: "memory")
+#define rmb()	 mb()
+#define wmb()	 mb()
+#define wc_wmb() wmb()
 
 #else
 
 #warning No architecture specific defines found.  Using generic implementation.
 
-#define mb()	asm volatile("" ::: "memory")
-#define rmb()	mb()
-#define wmb()	mb()
+#define mb()	 asm volatile("" ::: "memory")
+#define rmb()	 mb()
+#define wmb()	 mb()
+#define wc_wmb() wmb()
 
 #endif
 

From rdreier at cisco.com  Tue May 15 12:42:08 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 12:42:08 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <20070508141727.GR21591@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 8 May 2007 17:17:27 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
Message-ID: <adaveetri0f.fsf@cisco.com>

...and this for libmlx4?

diff --git a/src/mlx4.h b/src/mlx4.h
index c4d389f..1e92b88 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -65,6 +65,20 @@
 #  define wmb() mb()
 #endif
 
+#ifndef wc_wmb
+
+#if defined(__i386__)
+#define wc_wmb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
+#elif defined(__x86_64__)
+#define wc_wmb() asm volatile("sfence" ::: "memory")
+#elif defined(__ia64__)
+#define wc_wmb() asm volatile("fwb" ::: "memory")
+#else
+#define wc_wmb() wmb()
+#endif
+
+#endif
+
 #define HIDDEN		__attribute__((visibility ("hidden")))
 
 #define PFX		"mlx4: "
diff --git a/src/qp.c b/src/qp.c
index a70e5f2..a4384f9 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -282,9 +282,12 @@ out:
 		++qp->sq.head;
 
 		pthread_spin_lock(&ctx->bf_lock);
+
 		memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64));
-		/* FIXME flush wc buffers */
+		wc_wmb();
+
 		ctx->bf_offset ^= ctx->bf_buf_size;
+
 		pthread_spin_unlock(&ctx->bf_lock);
 	} else if (nreq) {
 		qp->sq.head += nreq;


From rdreier at cisco.com  Tue May 15 12:51:26 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 12:51:26 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <adaveetri0f.fsf@cisco.com> (Roland Dreier's message of "Tue,
	15 May 2007 12:42:08 -0700")
References: <20070508141727.GR21591@mellanox.co.il> <adaveetri0f.fsf@cisco.com>
Message-ID: <adar6phrhkx.fsf@cisco.com>

 >  		memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64));

By the way, why are we aligning the size of the WQE we copy to 64
bytes?  I copied this from Jack's code but I don't see anything that
requires it.  We already have:

	if (nreq == 1 && inl && size > 1 && size < ctx->bf_buf_size / 16) {

so we will always have at least 32 bytes to copy.

 - R.


From mst at dev.mellanox.co.il  Tue May 15 13:06:18 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 15 May 2007 23:06:18 +0300
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <adaodknw5xo.fsf@cisco.com>
References: <20070508141727.GR21591@mellanox.co.il> <ada4pmjz7tm.fsf@cisco.com>
	<20070513051806.GB7402@mellanox.co.il> <adaodknw5xo.fsf@cisco.com>
Message-ID: <20070515200618.GF4161@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: libmlx4 wc flash
> 
>  > By the way, I just re-checked and it seems that WC support first
>  > appeared in Pentium II systems. So I think we should be able to
>  > use sfence if WC is enabled.
> 
> That's actually doubly wrong: WC support was added in Pentium Pro, and
> sfence was added in Pentium III.

OK, here's what I remembered, after checking the sources:

	This memory type is available in the Pentium Pro and
	Pentium II processors by programming the MTRRs or in the Pentium III,
	Pentium 4, and Intel Xeon processors by programming the MTRRs
	or by selecting it through the PAT.

so what it comes down to, is that if we assume that WC
will *only be enabled through PAT* then it's safe to use sfence
in this case. Right?

-- 
MST


From rick.jones2 at hp.com  Tue May 15 13:08:24 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Tue, 15 May 2007 13:08:24 -0700
Subject: [ofa-general] Re: [ewg] OFED 1.2 rc3 release
In-Reply-To: <7.0.1.0.2.20070515125402.02784f60@lanl.gov>
References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com>	<6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com>	<46487AE8.1020005@mellanox.co.il>	<6a122cc00705140844k7f65a988x8746c4ea8474b920@mail.gmail.com>
	<7.0.1.0.2.20070515125402.02784f60@lanl.gov>
Message-ID: <464A1338.7040405@hp.com>

Parks Fields wrote:
> 
> 
> Ofed 1.2 rc3 is running IPoIB slower that rc2 or earlier
> 
> I use to get ~800MB/sec no tuning and now I get ~650MB/sec ??
> 
> Ideas??

Not specific to IPoIB, but whenever something like that happens to me I start 
with things like:

*) CPU utilization - did that and the netperf (assuming netperf) service demand 
increase?

*) packet losses?

*) change in MTU?

*) change in interrupt behaviour by the I/O card? (ie did netperf TCP_RR change 
much?)

rick jones


From mst at dev.mellanox.co.il  Tue May 15 13:11:05 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 15 May 2007 23:11:05 +0300
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <adar6phrhkx.fsf@cisco.com>
References: <20070508141727.GR21591@mellanox.co.il>
	<adaveetri0f.fsf@cisco.com> <adar6phrhkx.fsf@cisco.com>
Message-ID: <20070515201105.GG4161@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: libmlx4 wc flash
> 
>  >  		memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64));
> 
> By the way, why are we aligning the size of the WQE we copy to 64
> bytes?  I copied this from Jack's code but I don't see anything that
> requires it.  We already have:
> 
> 	if (nreq == 1 && inl && size > 1 && size < ctx->bf_buf_size / 16) {
> 
> so we will always have at least 32 bytes to copy.

This is an intel-specific optimization (for new Intel processors):

Once the processor has started to evict data from the WC buffer into system
memory, it will make a bus-transaction style decision based on how much of the
buffer contains valid data. If the buffer is full (for example, all bytes are
valid) the processor will execute a burst-write transaction on the bus that
will result in all 32 bytes (P6 family processors) or 64 bytes (Pentium 4 and
Intel Xeon processor) being transmitted on the data bus in a single burst
transaction. If one or more of the WC buffer’s bytes are invalid (for example,
have not been written by software) then the processor will transmit the data to
memory using “partial write” transactions (one chunk at a time, where a “chunk”
is 8 bytes).

in other words, it is important to fill the full WC buffer to get good speed.

Need to check which sizes are good for AMD, PPC, ...

-- 
MST


From rdreier at cisco.com  Tue May 15 13:11:55 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 13:11:55 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <20070515200618.GF4161@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 15 May 2007 23:06:18 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
	<ada4pmjz7tm.fsf@cisco.com> <20070513051806.GB7402@mellanox.co.il>
	<adaodknw5xo.fsf@cisco.com> <20070515200618.GF4161@mellanox.co.il>
Message-ID: <adaejlhrgms.fsf@cisco.com>

 > so what it comes down to, is that if we assume that WC
 > will *only be enabled through PAT* then it's safe to use sfence
 > in this case. Right?

I don't think we can really make any assumptions about what
instructions a 32-bit x86 processor has available.  Who knows what
wacky stuff VIA or someone like that will come up with?

The best thing seems to be just to stick to "lock; addl $0,0(%%esp)"
for 32-bit x86.  We now have the infrastructure to support multiple
builds of libraries and have ld.so select automatically at runtime but
I'm not sure it's really worth it.

 - R.


From mst at dev.mellanox.co.il  Tue May 15 13:16:34 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 15 May 2007 23:16:34 +0300
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <adaveetri0f.fsf@cisco.com>
References: <20070508141727.GR21591@mellanox.co.il> <adaveetri0f.fsf@cisco.com>
Message-ID: <20070515201634.GH4161@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: libmlx4 wc flash
> 
> ...and this for libmlx4?
> 
> diff --git a/src/mlx4.h b/src/mlx4.h
> index c4d389f..1e92b88 100644
> --- a/src/mlx4.h
> +++ b/src/mlx4.h
> @@ -65,6 +65,20 @@
>  #  define wmb() mb()
>  #endif
>  
> +#ifndef wc_wmb
> +
> +#if defined(__i386__)
> +#define wc_wmb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
> +#elif defined(__x86_64__)
> +#define wc_wmb() asm volatile("sfence" ::: "memory")
> +#elif defined(__ia64__)
> +#define wc_wmb() asm volatile("fwb" ::: "memory")
> +#else
> +#define wc_wmb() wmb()
> +#endif
> +
> +#endif
> +
>  #define HIDDEN		__attribute__((visibility ("hidden")))
>  
>  #define PFX		"mlx4: "
> diff --git a/src/qp.c b/src/qp.c
> index a70e5f2..a4384f9 100644
> --- a/src/qp.c
> +++ b/src/qp.c
> @@ -282,9 +282,12 @@ out:
>  		++qp->sq.head;
>  
>  		pthread_spin_lock(&ctx->bf_lock);
> +
>  		memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64));
> -		/* FIXME flush wc buffers */
> +		wc_wmb();
> +
>  		ctx->bf_offset ^= ctx->bf_buf_size;
> +
>  		pthread_spin_unlock(&ctx->bf_lock);
>  	} else if (nreq) {
>  		qp->sq.head += nreq;

Since both the need for fencing and the size being copied are
architecture-dependent, it might be that a better API would be
memcpy_wc() that does the size alignment tricks and the flush
in one go.


-- 
MST


From rdreier at cisco.com  Tue May 15 13:22:49 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 13:22:49 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <20070515201105.GG4161@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 15 May 2007 23:11:05 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
	<adaveetri0f.fsf@cisco.com> <adar6phrhkx.fsf@cisco.com>
	<20070515201105.GG4161@mellanox.co.il>
Message-ID: <adaabw5rg4m.fsf@cisco.com>

 > Once the processor has started to evict data from the WC buffer into system
 > memory, it will make a bus-transaction style decision based on how much of the
 > buffer contains valid data. If the buffer is full (for example, all bytes are
 > valid) the processor will execute a burst-write transaction on the bus that
 > will result in all 32 bytes (P6 family processors) or 64 bytes (Pentium 4 and
 > Intel Xeon processor) being transmitted on the data bus in a single burst
 > transaction. If one or more of the WC bufferâs bytes are invalid (for example,
 > have not been written by software) then the processor will transmit the data to
 > memory using âpartial writeâ transactions (one chunk at a time, where a âchunkâ
 > is 8 bytes).

OK, thanks.

Do you have any idea how WC works on ppc?  Is the lwsync instruction
necessary/sufficient to flush WC buffers?

 - R.


From rdreier at cisco.com  Tue May 15 13:24:50 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 13:24:50 -0700
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <20070515201105.GG4161@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 15 May 2007 23:11:05 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
	<adaveetri0f.fsf@cisco.com> <adar6phrhkx.fsf@cisco.com>
	<20070515201105.GG4161@mellanox.co.il>
Message-ID: <ada646trg19.fsf@cisco.com>

>  >  		memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64));

By the way, if we have an SQ with 32-byte WQEs, and we do blueflame
from a WQE at the end of the buffer, we might end up reading off the
end of the buffer.  Not very likely, I guess.

I wonder if memset(,0,) for the remaining bytes might be faster anyway?

 - R.


From mst at dev.mellanox.co.il  Tue May 15 13:43:35 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 15 May 2007 23:43:35 +0300
Subject: [ofa-general] movnt (Was Re: libmlx4 wc flash)
In-Reply-To: <adamz09yc19.fsf@cisco.com>
References: <20070508141727.GR21591@mellanox.co.il> <ada4pmjz7tm.fsf@cisco.com>
	<20070512172927.GA5908@mellanox.co.il> <adamz09yc19.fsf@cisco.com>
Message-ID: <20070515204335.GI4161@mellanox.co.il>

>  > I don't think it works this way: if PAT is programmed to UC,
>  > I think you get UC access with movntq. No?
> 
> You're right -- I misremembered what the non-temporal stuff does, but
> I just checked and the manual says:
> 
>  "The memory type of the region being written to can override the
>   non-temporal hint, if the memory address specified for the
>   non-temporal store is in an uncacheable (UC) or write protected (WP)
>   memory region."

Actually, I think I just thought up a way to solve this, and I quote in full:

Vol. 1 10-19
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

These SSE and SSE2 non-temporal store instructions minimize cache pollutions by
treating the memory being accessed as the write combining (WC) type. If a program
specifies a non-temporal store with one of these instructions and the destination
region is mapped as cacheable memory (write back [WB], write through [WT] or WC
memory type), the processor will do the following:
• If the memory location being written to is present in the cache hierarchy, the data
in the caches is evicted.
• The non-temporal data is written to memory with WC semantics.
See also: Chapter 10, “Memory Cache Control,” in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3A.
Using the WC semantics, the store transaction will be weakly ordered, meaning that
the data may not be written to memory in program order, and the store will not write
allocate (that is, the processor will not fetch the corresponding cache line into the
cache hierarchy, prior to performing the store). Also, different processor implementations
may choose to collapse and combine these stores.
The memory type of the region being written to can override the non-temporal hint,
if the memory address specified for the non-temporal store is in uncacheable
memory. Uncacheable as referred to here means that the region being written to has
been mapped with either an uncacheable (UC) or write protected (WP) memory type.

-------------

So we can map the device memory with WB or WT semantics, and movnt will enable
WC. And the nice thing about this trick, is that both WB and WT *are already
programmed into PAT after reset*, which means that we can use them for pages we
map for userspace, without stepping on anyone's toes or waiting for
the generic in-kernel support for WC to materialize.

Another nice thing is that all WRs are 16-byte aligned so we can
use the aligned instructions there.

Given that full WC support in kernel is likely to take
quite while to materialize, maybe that's the way to go for now?
What do you think?

I attach a header file that implements WC memcpy with these
instructions for lengths from 16 to 128 bytes (and one can,
naturally, just call xmm_copy64 in a loop), that I wrote for fun
at some point. Feel free to read/flame/reuse in any way you like.

As far as I remember, replacing memcpy with this hack resulted
in a marginal latency speedup for intel, likely on account
of loop unrolling I did there.

-- 
MST

-------------- next part --------------
A non-text attachment was scrubbed...
Name: xmm.h
Type: text/x-chdr
Size: 2123 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070515/47b4c72e/attachment.h>

From mst at dev.mellanox.co.il  Tue May 15 13:48:04 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 15 May 2007 23:48:04 +0300
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <ada646trg19.fsf@cisco.com>
References: <20070508141727.GR21591@mellanox.co.il>
	<adaveetri0f.fsf@cisco.com> <adar6phrhkx.fsf@cisco.com>
	<20070515201105.GG4161@mellanox.co.il> <ada646trg19.fsf@cisco.com>
Message-ID: <20070515204804.GJ4161@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: libmlx4 wc flash
> 
> >  >  		memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64));
> 
> By the way, if we have an SQ with 32-byte WQEs, and we do blueflame
> from a WQE at the end of the buffer, we might end up reading off the
> end of the buffer.  Not very likely, I guess.

Hmm, is 32-byte SQ wqe shift actually possible? Which parameters give this?

> I wonder if memset(,0,) for the remaining bytes might be faster anyway?

In some early testing it seemed much slower. Try it :)

-- 
MST


From mst at dev.mellanox.co.il  Tue May 15 13:50:49 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 15 May 2007 23:50:49 +0300
Subject: [ofa-general] Re: libmlx4 wc flash
In-Reply-To: <adaabw5rg4m.fsf@cisco.com>
References: <20070508141727.GR21591@mellanox.co.il>
	<adaveetri0f.fsf@cisco.com> <adar6phrhkx.fsf@cisco.com>
	<20070515201105.GG4161@mellanox.co.il> <adaabw5rg4m.fsf@cisco.com>
Message-ID: <20070515205049.GK4161@mellanox.co.il>

> Do you have any idea how WC works on ppc?  Is the lwsync instruction
> necessary/sufficient to flush WC buffers?

Donnu yet. I hear Jack here plans to start looking at ppc RSN.

-- 
MST


From mst at dev.mellanox.co.il  Tue May 15 14:04:53 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 16 May 2007 00:04:53 +0300
Subject: [ofa-general] [PATCH RFC/untested v0] IPoIB/CM: fix SRQ WR leak
Message-ID: <20070515210453.GL4161@mellanox.co.il>

If the Consumer does not wait for the Affiliated Asynchronous Last WQE Reached
Event, then WQE and Data Segment leakage may occur.
This leakage has been observed with IPoIB/CM: flipping ports on and off will,
with time, leak out all WRs and then all connections will start getting RNR
NACKs. Fix this in the way suggested by spec: create a "drain qp" in error state,
wait for last wqe reached event on a qp and then post send on "drain QP".
Once we observe a completion on the drain QP, it's safe to call ib_destroy_qp.

---

Roland, all. Here's a largish, and untested, patch that fixes a design bug in
the way IPoIB/CM destroyed passive connections.

Unfortunately, doing it by the spec kind of forces us to add a "state"
for passive connections, and split the passive list per connection state.
That's why the patch grew to be so large.

I expect to post a fully tested version by beginning of next week. The issue
addressed is very severe (work-around is to unload the ipoib module once in a
while) and I do think we need this fixed in 2.6.22, so
given how large the patch is, I'd like to ask everyone to review and comment.

NB: this is on top of 2.6.22-rc1.

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 87310ee..e300c75 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -132,12 +132,39 @@ struct ipoib_cm_data {
 	__be32 mtu;
 };
 
+/* Quoting spec:
+ *
+ * If the Consumer does not wait for the Affiliated Asynchronous Last WQE Reached
+ * Event, then WQE and Data Segment leakage may occur. Therefore, it is good
+ * programming practice to tear down a QP that is associated with an SRQ by using
+ * the following process:
+ *
+ *
+ * Put the QP in the Error State
+ * Wait for the Affiliated Asynchronous Last WQE Reached Event;
+ * either:
+ *       drain the CQ by invoking the Poll CQ verb and either wait for CQ
+ *       to be empty or the number of Poll CQ operations has exceeded
+ *       CQ capacity size;
+ * or
+ *       post another WR that completes on the same CQ and wait for this
+ *       WR to return as a WC;
+ * and then invoke a Destroy QP or Reset QP.
+ */
+
+enum ipoib_cm_state {
+	IPOIB_CM_RX_LIVE,
+	IPOIB_CM_RX_ERROR, /* Ignored by stale task */
+	IPOIB_CM_RX_FLUSH  /* Last WQE Reached event observed */
+};
+
 struct ipoib_cm_rx {
 	struct ib_cm_id     *id;
 	struct ib_qp        *qp;
 	struct list_head     list;
 	struct net_device   *dev;
 	unsigned long        jiffies;
+	enum ipoib_cm_state  state;
 };
 
 struct ipoib_cm_tx {
@@ -165,10 +192,15 @@ struct ipoib_cm_dev_priv {
 	struct ib_srq  	       *srq;
 	struct ipoib_cm_rx_buf *srq_ring;
 	struct ib_cm_id        *id;
-	struct list_head        passive_ids;
+	struct ib_qp           *rx_drain_qp;
+	struct list_head        passive_ids;   /* state: LIVE */
+	struct list_head        rx_error_list; /* state: ERROR */
+	struct list_head        rx_flush_list; /* state: FLUSH, drain not started */
+	struct list_head        rx_drain_list; /* state: FLUSH, drain started */
 	struct work_struct      start_task;
 	struct work_struct      reap_task;
 	struct work_struct      skb_task;
+	struct work_struct      rx_drain_task;
 	struct delayed_work     stale_task;
 	struct sk_buff_head     skb_queue;
 	struct list_head        start_list;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 785bc85..f6a1405 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -62,6 +62,17 @@ struct ipoib_cm_id {
 	u32 remote_mtu;
 };
 
+static struct ib_qp_attr ipoib_cm_err_attr __read_mostly = {
+	.qp_state = IB_QPS_ERR
+};
+
+#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff
+
+static struct ib_send_wr ipoib_cm_rx_drain_wr __read_mostly = {
+	.wr_id = 0xfff /* todo */,
+	.opcode = IB_WR_SEND
+};
+
 static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
 			       struct ib_cm_event *event);
 
@@ -150,11 +161,42 @@ partial_error:
 	return NULL;
 }
 
+static void ipoib_cm_rx_drain(struct ipoib_dev_priv* priv)
+{
+	struct ib_send_wr *bad_send_wr;
+
+	if (list_empty(&priv->cm.rx_flush_list) ||
+	    !list_empty(&priv->cm.rx_drain_list))
+		return;
+
+	if (ib_post_send(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_send_wr))
+		ipoib_warn(priv, "failed to start rx flush\n");
+
+	list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list);
+}
+
+static void ipoib_cm_rx_event_handler(struct ib_event *event, void *ctx)
+{
+	struct ipoib_cm_rx *p = ctx;
+	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
+	unsigned long flags;
+
+	if (event->event != IB_EVENT_QP_LAST_WQE_REACHED)
+		return;
+
+	spin_lock_irqsave(&priv->lock, flags);
+	list_move(&p->list, &priv->cm.rx_flush_list);
+	p->state = IPOIB_CM_RX_FLUSH;
+	ipoib_cm_rx_drain(priv);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
 static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 					   struct ipoib_cm_rx *p)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
+		.event_handler = ipoib_cm_rx_event_handler,
 		.send_cq = priv->cq, /* does not matter, we never send anything */
 		.recv_cq = priv->cq,
 		.srq = priv->cm.srq,
@@ -256,6 +298,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 
 	cm_id->context = p;
 	p->jiffies = jiffies;
+	p->state = IPOIB_CM_RX_LIVE;
 	spin_lock_irq(&priv->lock);
 	list_add(&p->list, &priv->cm.passive_ids);
 	spin_unlock_irq(&priv->lock);
@@ -271,12 +314,12 @@ err_qp:
 	return ret;
 }
 
+
 static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id,
 			       struct ib_cm_event *event)
 {
 	struct ipoib_cm_rx *p;
 	struct ipoib_dev_priv *priv;
-	int ret;
 
 	switch (event->event) {
 	case IB_CM_REQ_RECEIVED:
@@ -288,20 +331,9 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id,
 	case IB_CM_REJ_RECEIVED:
 		p = cm_id->context;
 		priv = netdev_priv(p->dev);
-		spin_lock_irq(&priv->lock);
-		if (list_empty(&p->list))
-			ret = 0; /* Connection is going away already. */
-		else {
-			list_del_init(&p->list);
-			ret = -ECONNRESET;
-		}
-		spin_unlock_irq(&priv->lock);
-		if (ret) {
-			ib_destroy_qp(p->qp);
-			kfree(p);
-			return ret;
-		}
-		return 0;
+		if (ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE))
+			ipoib_warn(priv, "unable to move qp to error state\n");
+		/* Fall through */
 	default:
 		return 0;
 	}
@@ -353,8 +385,11 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		       wr_id, wc->status);
 
 	if (unlikely(wr_id >= ipoib_recvq_size)) {
-		ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
-			   wr_id, ipoib_recvq_size);
+		if (wr_id == IPOIB_CM_RX_DRAIN_WRID)
+			queue_work(ipoib_workqueue, &priv->cm.rx_drain_task);
+		else
+			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
+				   wr_id, ipoib_recvq_size);
 		return;
 	}
 
@@ -373,9 +408,9 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
 			spin_lock_irqsave(&priv->lock, flags);
 			p->jiffies = jiffies;
-			/* Move this entry to list head, but do
-			 * not re-add it if it has been removed. */
-			if (!list_empty(&p->list))
+			/* Move this entry to list head, but do not re-add it
+			 * if it has been moved out of list. */
+			if (p->state == IPOIB_CM_RX_LIVE)
 				list_move(&p->list, &priv->cm.passive_ids);
 			spin_unlock_irqrestore(&priv->lock, flags);
 			queue_delayed_work(ipoib_workqueue,
@@ -584,17 +619,40 @@ static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr)
 int ipoib_cm_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr qp_attr;
 	int ret;
 
 	if (!IPOIB_CM_SUPPORTED(dev->dev_addr))
 		return 0;
 
+	priv->cm.rx_drain_qp = ipoib_cm_create_rx_qp(dev, NULL);
+	if (IS_ERR(priv->cm.rx_drain_qp)) {
+		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
+		ret = PTR_ERR(priv->cm.rx_drain_qp);
+		return ret;
+	}
+
+	qp_attr.qp_state = IB_QPS_INIT;
+	qp_attr.port_num = priv->port;
+	qp_attr.qkey = 0;
+	qp_attr.qp_access_flags = 0;
+	ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr,
+			   IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PORT | IB_QP_QKEY);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify drain QP to INIT: %d\n", ret);
+		goto err_qp;
+	}
+	qp_attr.qp_state = IB_QPS_ERR;
+	ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, IB_QP_STATE);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret);
+		goto err_qp;
+	}
+
 	priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev);
 	if (IS_ERR(priv->cm.id)) {
 		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
-		ret = PTR_ERR(priv->cm.id);
-		priv->cm.id = NULL;
-		return ret;
+		goto err_cm;
 	}
 
 	ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num),
@@ -602,35 +660,76 @@ int ipoib_cm_dev_open(struct net_device *dev)
 	if (ret) {
 		printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name,
 		       IPOIB_CM_IETF_ID | priv->qp->qp_num);
-		ib_destroy_cm_id(priv->cm.id);
-		priv->cm.id = NULL;
-		return ret;
+		goto err_cm;
 	}
+
 	return 0;
+
+err_cm:
+	ib_destroy_cm_id(priv->cm.id);
+	priv->cm.id = NULL;
+err_qp:
+	ib_destroy_qp(priv->cm.rx_drain_qp);
+	return ret;
 }
 
 void ipoib_cm_dev_stop(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_cm_rx *p;
+	struct ipoib_cm_rx *p, *n;
+	unsigned long begin;
+	LIST_HEAD(list);
+	int ret;
 
 	if (!IPOIB_CM_SUPPORTED(dev->dev_addr) || !priv->cm.id)
 		return;
 
 	ib_destroy_cm_id(priv->cm.id);
 	priv->cm.id = NULL;
+
 	spin_lock_irq(&priv->lock);
 	while (!list_empty(&priv->cm.passive_ids)) {
-		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
-		list_del_init(&p->list);
+		p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list);
+		list_move(&p->list, &priv->cm.rx_error_list);
+		p->state = IPOIB_CM_RX_ERROR;
 		spin_unlock_irq(&priv->lock);
+		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+		if (ret)
+			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
+		spin_lock_irq(&priv->lock);
+	}
+
+	/* Wait for all RX to be drained */
+	begin = jiffies;
+
+	while (!list_empty(&priv->cm.rx_error_list) ||
+	       !list_empty(&priv->cm.rx_flush_list) ||
+	       !list_empty(&priv->cm.rx_drain_list)) {
+		if (!time_after(jiffies, begin + 5 * HZ)) {
+			ipoib_warn(priv, "RX drain timing out\n");
+
+			/*
+			 * assume the HW is wedged and just free up everything.
+			 */
+			list_splice_init(&priv->cm.rx_flush_list, &list);
+			list_splice_init(&priv->cm.rx_error_list, &list);
+			list_splice_init(&priv->cm.rx_drain_list, &list);
+			break;
+		}
+		spin_unlock_irq(&priv->lock);
+		msleep(1);
+		spin_lock_irq(&priv->lock);
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	list_for_each_entry_safe(p, n, &list, list) {
 		ib_destroy_cm_id(p->id);
 		ib_destroy_qp(p->qp);
 		kfree(p);
-		spin_lock_irq(&priv->lock);
 	}
-	spin_unlock_irq(&priv->lock);
 
+	ib_destroy_qp(priv->cm.rx_drain_qp);
 	cancel_delayed_work(&priv->cm.stale_task);
 }
 
@@ -1080,24 +1179,45 @@ void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb,
 		queue_work(ipoib_workqueue, &priv->cm.skb_task);
 }
 
+static void ipoib_cm_rx_drain_task(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
+						   cm.rx_drain_task);
+	struct ipoib_cm_rx *p, *n;
+	LIST_HEAD(list);
+
+	spin_lock_irq(&priv->lock);
+	list_splice_init(&priv->cm.rx_drain_list, &list);
+	ipoib_cm_rx_drain(priv);
+	spin_unlock_irq(&priv->lock);
+
+	list_for_each_entry_safe(p, n, &list, list) {
+		ib_destroy_cm_id(p->id);
+		ib_destroy_qp(p->qp);
+		kfree(p);
+	}
+}
+
 static void ipoib_cm_stale_task(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
 						   cm.stale_task.work);
 	struct ipoib_cm_rx *p;
+	int ret;
 
 	spin_lock_irq(&priv->lock);
 	while (!list_empty(&priv->cm.passive_ids)) {
-		/* List if sorted by LRU, start from tail,
+		/* List is sorted by LRU, start from tail,
 		 * stop when we see a recently used entry */
 		p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list);
 		if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT))
 			break;
-		list_del_init(&p->list);
+		list_move(&p->list, &priv->cm.rx_error_list);
+		p->state = IPOIB_CM_RX_ERROR;
 		spin_unlock_irq(&priv->lock);
-		ib_destroy_cm_id(p->id);
-		ib_destroy_qp(p->qp);
-		kfree(p);
+		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+		if (ret)
+			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
 		spin_lock_irq(&priv->lock);
 	}
 	spin_unlock_irq(&priv->lock);
@@ -1161,9 +1281,12 @@ int ipoib_cm_dev_init(struct net_device *dev)
 	INIT_LIST_HEAD(&priv->cm.passive_ids);
 	INIT_LIST_HEAD(&priv->cm.reap_list);
 	INIT_LIST_HEAD(&priv->cm.start_list);
+	INIT_LIST_HEAD(&priv->cm.rx_flush_list);
+	INIT_LIST_HEAD(&priv->cm.rx_error_list);
 	INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start);
 	INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap);
 	INIT_WORK(&priv->cm.skb_task, ipoib_cm_skb_reap);
+	INIT_WORK(&priv->cm.rx_drain_task, ipoib_cm_rx_drain_task);
 	INIT_DELAYED_WORK(&priv->cm.stale_task, ipoib_cm_stale_task);
 
 	skb_queue_head_init(&priv->cm.skb_queue);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 5c3c6a4..af8a6d4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -185,7 +185,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	size = ipoib_sendq_size + ipoib_recvq_size + 1;
 	ret = ipoib_cm_dev_init(dev);
 	if (!ret)
-		size += ipoib_recvq_size;
+		size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */;
 
 	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
 	if (IS_ERR(priv->cq)) {


-- 
MST


From eli at dev.mellanox.co.il  Tue May 15 14:05:44 2007
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 16 May 2007 00:05:44 +0300
Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation
In-Reply-To: <ada7irat6xw.fsf@cisco.com>
References: <1179145102.25749.11.camel@mtls03> <adabqgmtb6e.fsf@cisco.com>
	<1179242042.25749.33.camel@mtls03> <ada7irat6xw.fsf@cisco.com>
Message-ID: <4e6a6b3c0705151405n18afdebna815a3ee1c117f74@mail.gmail.com>

On 5/15/07, Roland Dreier <rdreier at cisco.com> wrote:
>
> > First I should add the case that triggered his patch: the userspace code
> > calculated a smaller buffer size than the kernel code, which caused
> > get_user_pages() to fail since part of the buffer did not belong to the
> > process's address space.
>
> OK, in this case it seems the bug is in the kernel -- since it is
> overestimating the size of the WQEs needed.  So we might as well fix
> it in the kernel.
>
> > As Mihcael said in a subsequent post, we still need this code both in
> > user and in kernel.
>
> Yes, but I think this issue really convinces me that we should
> decouple the two calculations, so the kernel code is only used for
> kernel QPs.  And then change the mlx4 ABI so that userspace tells the
> kernel the wqe buffer size and rq/sq wqe shift/offset.  That will
> allow for different SQ BB sizes and also make things more robust
> against bugs like this.
>
> - R.


So it looks like we can start by:
1. Change the user code to pass the size to kernel
2. Fix calculations in kernel.

Would like me to send patches or do you prefer to add your code? If you
prefer to code this can you tell when that would be?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070516/25ceb843/attachment.html>

From rdreier at cisco.com  Tue May 15 14:55:03 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 14:55:03 -0700
Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation
In-Reply-To: <4e6a6b3c0705151405n18afdebna815a3ee1c117f74@mail.gmail.com> (Eli
	Cohen's message of "Wed, 16 May 2007 00:05:44 +0300")
References: <1179145102.25749.11.camel@mtls03> <adabqgmtb6e.fsf@cisco.com>
	<1179242042.25749.33.camel@mtls03> <ada7irat6xw.fsf@cisco.com>
	<4e6a6b3c0705151405n18afdebna815a3ee1c117f74@mail.gmail.com>
Message-ID: <ada4pmdpxag.fsf@cisco.com>

 > So it looks like we can start by:
 > 1. Change the user code to pass the size to kernel
 > 2. Fix calculations in kernel.
 > 
 > Would like me to send patches or do you prefer to add your code? If you
 > prefer to code this can you tell when that would be?

It would be great if you implement it.  Otherwise if you don't get to
it, I will probably look at it on Thursday or Friday (your weekend).

 - R.


From xma at us.ibm.com  Tue May 15 15:31:04 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Tue, 15 May 2007 15:31:04 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
In-Reply-To: <20070515185445.GD4161@mellanox.co.il>
Message-ID: <OFF9DB8C42.98B6B3B0-ON872572DC.007A1937-882572DC.007B976E@us.ibm.com>


> > Regarding the memory issue w/o SRQ, do you think there is a way to use
low
> > watermark to release prepost buffer in large connections?
>
> Maybe with UC - with RC you'll get RNR and connection'll get closed
before you
> have time to handle the low watermark.  So sure, might be an interesting
idea,
> but isn't low watermark a SRQ feature?
>
> > I think most of the prepost buffers are empty in that case becauseof
the BW.
>
> I don't really get the argument.
>
> --
> MST

      That's just some random idea. :) Some other ideas like to share RQ
buffer based on source-destination address, per CPU RQ buffer ...which
might hurt performance too much?

      It might be too complicated to have UD/RC mode coexisted?

      Maybe it's better to set up a small RQ size for now, and later when
high watermark patch is available we can use it to address RQ overrun?

Thanks
Shirley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070515/d12f9a88/attachment.html>

From guthridg at us.ibm.com  Tue May 15 15:46:36 2007
From: guthridg at us.ibm.com (Scott Guthridge)
Date: Tue, 15 May 2007 18:46:36 -0400
Subject: [ofa-general] ibv_modify_port?
Message-ID: <OF7C61155C.3474413C-ON852572DC.007C3278-852572DC.007D1E06@us.ibm.com>


I'm working on implementing a DM agent at user-level for an experimental
I/O controller. Registering the DM agent via umad_register seems to work --
I can receive DM MAD's.  But there doesn't appear to be a way to set the
IB_PORT_DEVICE_MGMT_SUP bit in the port's SA PortInfo.CapabilityMask, so
mask-match SA port queries do not find my device.

I noticed that the "ib_srpt" driver does an explicit ib_modify_port in
order to set this flag.  If there were a user-level version of this
function, I could do the same.


But.... this leads to another point.  Implementing the DMA in each target
driver, isn't a particularly general approach.  The problem is that you
can't implement more than one target driver behind the same channel
adapter.  For example, I can not register my DM agent if the ib_srpt module
happens to be loaded.

I would like to propose a better interface.  What if there were a generic
DM agent in the kernel that provided an API for target devices (kernel and
user) to register IOC's with it?  It might look something like this:

      struct ib_dm_ioc {
            ...
            u8    ioc_slot;
            ...
      };..

      struct ib_dm_ioc *ib_dm_register_ioc(struct ib_device *device,
            u8 port_num, const struct ib_dm_ioc_profile *ioprof);
      void ib_dm_unregister_ioc(struct ib_ioc *iocp);

      /* returns service entry slot number */
      int ib_dm_add_svcent(struct ib_dm_ioc *, const char *svc_name,
          u64 service_id);

      void ib_dm_del_svcent(struct ib_dm_ioc *, int svcent_slot);

      /* additional registration fn's for diag support could be added later
if someone feels ambitious */


The generic DMA would set the IB_PORT_DEVICE_MGMT_SUP flag in
PortInfo.CapabilityMask whenever at least one IOC is registered.

This interface would allow the DMA functionality to be removed from target
drivers, simplifying them somewhat.  And it would make it possible to
support more than one type of IOC within the same target CA.


Comments?


Scott


From mshefty at ichips.intel.com  Tue May 15 16:19:19 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 15 May 2007 16:19:19 -0700
Subject: [ofa-general] ibv_modify_port?
In-Reply-To: <OF7C61155C.3474413C-ON852572DC.007C3278-852572DC.007D1E06@us.ibm.com>
References: <OF7C61155C.3474413C-ON852572DC.007C3278-852572DC.007D1E06@us.ibm.com>
Message-ID: <464A3FF7.6090101@ichips.intel.com>

> I would like to propose a better interface.  What if there were a generic
> DM agent in the kernel that provided an API for target devices (kernel and
> user) to register IOC's with it?  It might look something like this:

A generic DM makes sense.

There are existing interfaces / implementations available in some of the 
legacy code that might be of use for a starting point.  I know there's 
some DM related code in the svn database in the gen1 branch.  There may 
be other implementations under the trunk/contrib directories as well, 
but I didn't actually check there.

- Sean


From rdreier at cisco.com  Tue May 15 19:32:15 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 15 May 2007 19:32:15 -0700
Subject: [ofa-general] ibv_modify_port?
In-Reply-To: <OF7C61155C.3474413C-ON852572DC.007C3278-852572DC.007D1E06@us.ibm.com>
	(Scott Guthridge's message of "Tue, 15 May 2007 18:46:36 -0400")
References: <OF7C61155C.3474413C-ON852572DC.007C3278-852572DC.007D1E06@us.ibm.com>
Message-ID: <adaveeto5w0.fsf@cisco.com>

 > I noticed that the "ib_srpt" driver does an explicit ib_modify_port in
 > order to set this flag.  If there were a user-level version of this
 > function, I could do the same.

It would be pretty straightforward to add something like /dev/infiniband/isdmX
that behaves like the issmX files we already have.  Or we could even
have something that automatically sets the IsDM bit when the first
agent for DM class is created and clears it when the last agent is
destroyed.  (In fact we could do the same thing for IsCM if we wanted to)

 > I would like to propose a better interface.  What if there were a generic
 > DM agent in the kernel that provided an API for target devices (kernel and
 > user) to register IOC's with it?  It might look something like this:

I'm not sure having a DM agent in the kernel is worth it.  Why not
have a generic daemon in userspace to do all the DMA stuff?  I don't
see a strong reason to put it in the kernel, and userspace code is
quite a bit easier to write...

 - R.


From vlad at lists.openfabrics.org  Wed May 16 02:39:41 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed, 16 May 2007 02:39:41 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070516-0200 daily build status
Message-ID: <20070516093941.F0B71E6082D@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From mst at dev.mellanox.co.il  Wed May 16 03:14:57 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 16 May 2007 13:14:57 +0300
Subject: [ofa-general] [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR leak
In-Reply-To: <20070515210453.GL4161@mellanox.co.il>
References: <20070515210453.GL4161@mellanox.co.il>
Message-ID: <20070516101457.GA5091@mellanox.co.il>

SRQ WR leakage has been observed with IPoIB/CM: e.g. flipping ports on and off
will, with time, leak out all WRs and then all connections will start getting RNR
NACKs. Fix this in the way suggested by spec: move QP to error, wait for last
wqe reached event and then post send on "drain QP" connected to the same CQ.
Once we observe a completion on the drain QP, it's safe to call ib_destroy_qp.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Changes from v0: fixed drain WR ID, comment on the algorithm used,
cleaned up the patch description.

Roland, all. This is a largish, and still untested, patch that fixes a design bug in
the way IPoIB/CM destroyed QPs connected to SRQ.

Unfortunately, doing it by the spec kind of forces us to add a "state"
for passive connections, and split the connection list per connection state.
That's why the patch grew to be so large.

The issue addressed is very severe (the only work-around is to unload the ipoib module
once in a while), so given how large the patch is, I'd like to ask everyone to review
and comment.

NB: this is on top of 2.6.22-rc1.

 ipoib.h       |   38 ++++++++++
 ipoib_cm.c    |  208 ++++++++++++++++++++++++++++++++++++++++++++++++----------
 ipoib_verbs.c |    2
 3 files changed, 212 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 87310ee..087bbfc 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -132,12 +132,43 @@ struct ipoib_cm_data {
 	__be32 mtu;
 };
 
+/*
+ * Quoting 10.3.1 Queue Pair and EE Context States:
+ *
+ * Note, for QPs that are associated with an SRQ, the Consumer should take the
+ * QP through the Error State before invoking a Destroy QP or a Modify QP to the
+ * Reset State.  The Consumer may invoke the Destroy QP without first performing
+ * a Modify QP to the Error State and waiting for the Affiliated Asynchronous
+ * Last WQE Reached Event. However, if the Consumer does not wait for the
+ * Affiliated Asynchronous Last WQE Reached Event, then WQE and Data Segment
+ * leakage may occur. Therefore, it is good programming practice to tear down a
+ * QP that is associated with an SRQ by using the following process:
+ *
+ * - Put the QP in the Error State
+ * - Wait for the Affiliated Asynchronous Last WQE Reached Event;
+ * - either:
+ *       drain the CQ by invoking the Poll CQ verb and either wait for CQ
+ *       to be empty or the number of Poll CQ operations has exceeded
+ *       CQ capacity size;
+ * - or
+ *       post another WR that completes on the same CQ and wait for this
+ *       WR to return as a WC; (NB: this is the option that we use)
+ * and then invoke a Destroy QP or Reset QP.
+ */
+
+enum ipoib_cm_state {
+	IPOIB_CM_RX_LIVE,
+	IPOIB_CM_RX_ERROR, /* Ignored by stale task */
+	IPOIB_CM_RX_FLUSH  /* Last WQE Reached event observed */
+};
+
 struct ipoib_cm_rx {
 	struct ib_cm_id     *id;
 	struct ib_qp        *qp;
 	struct list_head     list;
 	struct net_device   *dev;
 	unsigned long        jiffies;
+	enum ipoib_cm_state  state;
 };
 
 struct ipoib_cm_tx {
@@ -165,10 +196,15 @@ struct ipoib_cm_dev_priv {
 	struct ib_srq  	       *srq;
 	struct ipoib_cm_rx_buf *srq_ring;
 	struct ib_cm_id        *id;
-	struct list_head        passive_ids;
+	struct ib_qp           *rx_drain_qp;   /* generates WR described in 10.3.1 */
+	struct list_head        passive_ids;   /* state: LIVE */
+	struct list_head        rx_error_list; /* state: ERROR */
+	struct list_head        rx_flush_list; /* state: FLUSH, drain not started */
+	struct list_head        rx_drain_list; /* state: FLUSH, drain started */
 	struct work_struct      start_task;
 	struct work_struct      reap_task;
 	struct work_struct      skb_task;
+	struct work_struct      rx_drain_task;
 	struct delayed_work     stale_task;
 	struct sk_buff_head     skb_queue;
 	struct list_head        start_list;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 785bc85..d4e4cf3 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -62,6 +62,17 @@ struct ipoib_cm_id {
 	u32 remote_mtu;
 };
 
+static struct ib_qp_attr ipoib_cm_err_attr __read_mostly = {
+	.qp_state = IB_QPS_ERR
+};
+
+#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff
+
+static struct ib_send_wr ipoib_cm_rx_drain_wr __read_mostly = {
+	.wr_id = IPOIB_CM_RX_DRAIN_WRID,
+	.opcode = IB_WR_SEND
+};
+
 static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
 			       struct ib_cm_event *event);
 
@@ -150,11 +161,44 @@ partial_error:
 	return NULL;
 }
 
+static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv)
+{
+	struct ib_send_wr *bad_send_wr;
+
+	/* rx_drain_qp send queue depth is 1, so
+	 * make sure we have at most 1 outstanding WR. */
+	if (list_empty(&priv->cm.rx_flush_list) ||
+	    !list_empty(&priv->cm.rx_drain_list))
+		return;
+
+	if (ib_post_send(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_send_wr))
+		ipoib_warn(priv, "failed to post rx_drain wr\n");
+
+	list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list);
+}
+
+static void ipoib_cm_rx_event_handler(struct ib_event *event, void *ctx)
+{
+	struct ipoib_cm_rx *p = ctx;
+	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
+	unsigned long flags;
+
+	if (event->event != IB_EVENT_QP_LAST_WQE_REACHED)
+		return;
+
+	spin_lock_irqsave(&priv->lock, flags);
+	list_move(&p->list, &priv->cm.rx_flush_list);
+	p->state = IPOIB_CM_RX_FLUSH;
+	ipoib_cm_start_rx_drain(priv);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
 static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 					   struct ipoib_cm_rx *p)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
+		.event_handler = ipoib_cm_rx_event_handler,
 		.send_cq = priv->cq, /* does not matter, we never send anything */
 		.recv_cq = priv->cq,
 		.srq = priv->cm.srq,
@@ -256,6 +300,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 
 	cm_id->context = p;
 	p->jiffies = jiffies;
+	p->state = IPOIB_CM_RX_LIVE;
 	spin_lock_irq(&priv->lock);
 	list_add(&p->list, &priv->cm.passive_ids);
 	spin_unlock_irq(&priv->lock);
@@ -276,7 +321,6 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id,
 {
 	struct ipoib_cm_rx *p;
 	struct ipoib_dev_priv *priv;
-	int ret;
 
 	switch (event->event) {
 	case IB_CM_REQ_RECEIVED:
@@ -288,20 +332,9 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id,
 	case IB_CM_REJ_RECEIVED:
 		p = cm_id->context;
 		priv = netdev_priv(p->dev);
-		spin_lock_irq(&priv->lock);
-		if (list_empty(&p->list))
-			ret = 0; /* Connection is going away already. */
-		else {
-			list_del_init(&p->list);
-			ret = -ECONNRESET;
-		}
-		spin_unlock_irq(&priv->lock);
-		if (ret) {
-			ib_destroy_qp(p->qp);
-			kfree(p);
-			return ret;
-		}
-		return 0;
+		if (ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE))
+			ipoib_warn(priv, "unable to move qp to error state\n");
+		/* Fall through */
 	default:
 		return 0;
 	}
@@ -353,8 +386,11 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		       wr_id, wc->status);
 
 	if (unlikely(wr_id >= ipoib_recvq_size)) {
-		ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
-			   wr_id, ipoib_recvq_size);
+		if (wr_id == IPOIB_CM_RX_DRAIN_WRID)
+			queue_work(ipoib_workqueue, &priv->cm.rx_drain_task);
+		else
+			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
+				   wr_id, ipoib_recvq_size);
 		return;
 	}
 
@@ -373,9 +409,9 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
 			spin_lock_irqsave(&priv->lock, flags);
 			p->jiffies = jiffies;
-			/* Move this entry to list head, but do
-			 * not re-add it if it has been removed. */
-			if (!list_empty(&p->list))
+			/* Move this entry to list head, but do not re-add it
+			 * if it has been moved out of list. */
+			if (p->state == IPOIB_CM_RX_LIVE)
 				list_move(&p->list, &priv->cm.passive_ids);
 			spin_unlock_irqrestore(&priv->lock, flags);
 			queue_delayed_work(ipoib_workqueue,
@@ -584,17 +620,54 @@ static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr)
 int ipoib_cm_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr qp_attr;
+	struct ib_qp_init_attr qp_init_attr = {
+		.send_cq = priv->cq,
+		.recv_cq = priv->cq, /* does not matter, we never get anything */
+		.srq = priv->cm.srq, /* does not matter, we never get anything */
+		.cap.max_send_wr = 1,
+		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type = IB_QPT_UC,
+	};
 	int ret;
 
 	if (!IPOIB_CM_SUPPORTED(dev->dev_addr))
 		return 0;
 
+	priv->cm.rx_drain_qp = ib_create_qp(priv->pd, &qp_init_attr);
+	if (IS_ERR(priv->cm.rx_drain_qp)) {
+		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
+		ret = PTR_ERR(priv->cm.rx_drain_qp);
+		return ret;
+	}
+
+	qp_attr.qp_state = IB_QPS_INIT;
+	qp_attr.port_num = priv->port;
+	qp_attr.qkey = 0;
+	qp_attr.qp_access_flags = 0;
+	ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr,
+			   IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PORT | IB_QP_QKEY);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify drain QP to INIT: %d\n", ret);
+		goto err_qp;
+	}
+
+	/* We put the QP in error state directly: this way, hardware
+	 * will immediately generate WC for each WR we post, without
+	 * sending anything on the wire. */
+	qp_attr.qp_state = IB_QPS_ERR;
+	ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, IB_QP_STATE);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret);
+		goto err_qp;
+	}
+
 	priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev);
 	if (IS_ERR(priv->cm.id)) {
 		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
 		ret = PTR_ERR(priv->cm.id);
-		priv->cm.id = NULL;
-		return ret;
+		goto err_cm;
 	}
 
 	ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num),
@@ -602,35 +675,77 @@ int ipoib_cm_dev_open(struct net_device *dev)
 	if (ret) {
 		printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name,
 		       IPOIB_CM_IETF_ID | priv->qp->qp_num);
-		ib_destroy_cm_id(priv->cm.id);
-		priv->cm.id = NULL;
-		return ret;
+		goto err_listen;
 	}
+
 	return 0;
+
+err_listen:
+	ib_destroy_cm_id(priv->cm.id);
+err_cm:
+	priv->cm.id = NULL;
+err_qp:
+	ib_destroy_qp(priv->cm.rx_drain_qp);
+	return ret;
 }
 
 void ipoib_cm_dev_stop(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_cm_rx *p;
+	struct ipoib_cm_rx *p, *n;
+	unsigned long begin;
+	LIST_HEAD(list);
+	int ret;
 
 	if (!IPOIB_CM_SUPPORTED(dev->dev_addr) || !priv->cm.id)
 		return;
 
 	ib_destroy_cm_id(priv->cm.id);
 	priv->cm.id = NULL;
+
 	spin_lock_irq(&priv->lock);
 	while (!list_empty(&priv->cm.passive_ids)) {
 		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
-		list_del_init(&p->list);
+		list_move(&p->list, &priv->cm.rx_error_list);
+		p->state = IPOIB_CM_RX_ERROR;
 		spin_unlock_irq(&priv->lock);
+		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+		if (ret)
+			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
+		spin_lock_irq(&priv->lock);
+	}
+
+	/* Wait for all RX to be drained */
+	begin = jiffies;
+
+	while (!list_empty(&priv->cm.rx_error_list) ||
+	       !list_empty(&priv->cm.rx_flush_list) ||
+	       !list_empty(&priv->cm.rx_drain_list)) {
+		if (!time_after(jiffies, begin + 5 * HZ)) {
+			ipoib_warn(priv, "RX drain timing out\n");
+
+			/*
+			 * assume the HW is wedged and just free up everything.
+			 */
+			list_splice_init(&priv->cm.rx_flush_list, &list);
+			list_splice_init(&priv->cm.rx_error_list, &list);
+			list_splice_init(&priv->cm.rx_drain_list, &list);
+			break;
+		}
+		spin_unlock_irq(&priv->lock);
+		msleep(1);
+		spin_lock_irq(&priv->lock);
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	list_for_each_entry_safe(p, n, &list, list) {
 		ib_destroy_cm_id(p->id);
 		ib_destroy_qp(p->qp);
 		kfree(p);
-		spin_lock_irq(&priv->lock);
 	}
-	spin_unlock_irq(&priv->lock);
 
+	ib_destroy_qp(priv->cm.rx_drain_qp);
 	cancel_delayed_work(&priv->cm.stale_task);
 }
 
@@ -1080,24 +1195,45 @@ void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb,
 		queue_work(ipoib_workqueue, &priv->cm.skb_task);
 }
 
+static void ipoib_cm_rx_drain_task(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
+						   cm.rx_drain_task);
+	struct ipoib_cm_rx *p, *n;
+	LIST_HEAD(list);
+
+	spin_lock_irq(&priv->lock);
+	list_splice_init(&priv->cm.rx_drain_list, &list);
+	ipoib_cm_start_rx_drain(priv);
+	spin_unlock_irq(&priv->lock);
+
+	list_for_each_entry_safe(p, n, &list, list) {
+		ib_destroy_cm_id(p->id);
+		ib_destroy_qp(p->qp);
+		kfree(p);
+	}
+}
+
 static void ipoib_cm_stale_task(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
 						   cm.stale_task.work);
 	struct ipoib_cm_rx *p;
+	int ret;
 
 	spin_lock_irq(&priv->lock);
 	while (!list_empty(&priv->cm.passive_ids)) {
-		/* List if sorted by LRU, start from tail,
+		/* List is sorted by LRU, start from tail,
 		 * stop when we see a recently used entry */
 		p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list);
 		if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT))
 			break;
-		list_del_init(&p->list);
+		list_move(&p->list, &priv->cm.rx_error_list);
+		p->state = IPOIB_CM_RX_ERROR;
 		spin_unlock_irq(&priv->lock);
-		ib_destroy_cm_id(p->id);
-		ib_destroy_qp(p->qp);
-		kfree(p);
+		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+		if (ret)
+			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
 		spin_lock_irq(&priv->lock);
 	}
 	spin_unlock_irq(&priv->lock);
@@ -1161,9 +1297,13 @@ int ipoib_cm_dev_init(struct net_device *dev)
 	INIT_LIST_HEAD(&priv->cm.passive_ids);
 	INIT_LIST_HEAD(&priv->cm.reap_list);
 	INIT_LIST_HEAD(&priv->cm.start_list);
+	INIT_LIST_HEAD(&priv->cm.rx_error_list);
+	INIT_LIST_HEAD(&priv->cm.rx_flush_list);
+	INIT_LIST_HEAD(&priv->cm.rx_drain_list);
 	INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start);
 	INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap);
 	INIT_WORK(&priv->cm.skb_task, ipoib_cm_skb_reap);
+	INIT_WORK(&priv->cm.rx_drain_task, ipoib_cm_rx_drain_task);
 	INIT_DELAYED_WORK(&priv->cm.stale_task, ipoib_cm_stale_task);
 
 	skb_queue_head_init(&priv->cm.skb_queue);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 5c3c6a4..af8a6d4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -185,7 +185,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	size = ipoib_sendq_size + ipoib_recvq_size + 1;
 	ret = ipoib_cm_dev_init(dev);
 	if (!ret)
-		size += ipoib_recvq_size;
+		size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */;
 
 	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
 	if (IS_ERR(priv->cq)) {
-- 
MST


From halr at voltaire.com  Wed May 16 04:12:15 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 16 May 2007 07:12:15 -0400
Subject: [ofa-general] Re: suggested patch for partition membership
	definitiion in	osm-partitions.conf (fix)
In-Reply-To: <4649D289.3070301@cea.fr>
References: <46487FBF.7020300@cea.fr>
	<1179157835.1540.183713.camel@hal.voltaire.com>
	<4649D289.3070301@cea.fr>
Message-ID: <1179313934.4531.155801.camel@hal.voltaire.com>

On Tue, 2007-05-15 at 11:32, Philippe Gregoire wrote:
> Here are the patches as you asked.
> I changed the code to use strncmp as suggested by Sasha.

Thanks! Applied (to master only).

-- Hal

> Philippe


From keshetti.mahesh at gmail.com  Wed May 16 05:00:07 2007
From: keshetti.mahesh at gmail.com (Keshetti Mahesh)
Date: Wed, 16 May 2007 17:30:07 +0530
Subject: [ofa-general] problem with loading IB modules in a IB node with
	OFED.
Message-ID: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com>

I am facing problem while loading any IB module ( madeye.ko) into an
IB node with OFED-1.0
installed. while loading module lots of "disagrees about symbol
version" errors appeared.
where as the same module gets successfully loaded into 2.6.9-42Elsmp (
which contains
OFED-1.0??).
Is this already discussed?

-- 
Thanks and regards,
Mahesh.


From hnguyen at linux.vnet.ibm.com  Wed May 16 05:50:55 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Wed, 16 May 2007 14:50:55 +0200
Subject: [ofa-general] [PATCH 2.6.22] ehca: return proper error code if
	register_mr fails
Message-ID: <200705161450.55848.hnguyen@linux.vnet.ibm.com>

This patch sets the return code of ehca_register_mr() to ENOMEM
if corresponding firmware call fails due to out of resources.
Some of error codes were mapped to EINVAL. They are now mapped
to default case, which already returns EINVAL anyway.


Signed-off-by: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
---


 ehca_mrmw.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)


diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 84c5bb4..add79bd 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -2050,13 +2050,10 @@ int ehca_mrmw_map_hrc_alloc(const u64 hi
 	switch (hipz_rc) {
 	case H_SUCCESS:	             /* successful completion */
 		return 0;
-	case H_ADAPTER_PARM:         /* invalid adapter handle */
-	case H_RT_PARM:              /* invalid resource type */
 	case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */
-	case H_MLENGTH_PARM:         /* invalid memory length */
-	case H_MEM_ACCESS_PARM:      /* invalid access controls */
 	case H_CONSTRAINED:          /* resource constraint */
-		return -EINVAL;
+	case H_NO_MEM:
+		return -ENOMEM;
 	case H_BUSY:                 /* long busy */
 		return -EBUSY;
 	default:


From hnguyen at linux.vnet.ibm.com  Wed May 16 06:05:08 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Wed, 16 May 2007 15:05:08 +0200
Subject: [ofa-general] [PATCH ofed-1.2] ehca (kernel space): return proper
	error code if register_mr fails
Message-ID: <200705161505.09214.hnguyen@linux.vnet.ibm.com>

Hello Tziporet!
Please accept below patch for ofed-1.2, because it fixes a mr resources
limitation problem reported by Troy and Kyle on this mailing list. Only
with this patch their application is able to release no longer used mrs
properly.
Thanks!
Regards
Nam


This patch sets the return code of ehca_register_mr() to ENOMEM
if corresponding firmware call fails due to out of resources.
Some of error codes were mapped to EINVAL. They are now mapped
to default case, which already returns EINVAL anyway.


Signed-off-by: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
---


 ehca_mrmw.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)


diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index cfb362a..b3bbe3b 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -2045,13 +2045,10 @@ int ehca_mrmw_map_hrc_alloc(const u64 hi
 	switch (hipz_rc) {
 	case H_SUCCESS:	             /* successful completion */
 		return 0;
-	case H_ADAPTER_PARM:         /* invalid adapter handle */
-	case H_RT_PARM:              /* invalid resource type */
 	case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */
-	case H_MLENGTH_PARM:         /* invalid memory length */
-	case H_MEM_ACCESS_PARM:      /* invalid access controls */
 	case H_CONSTRAINED:          /* resource constraint */
-		return -EINVAL;
+	case H_NO_MEM:
+		return -ENOMEM;
 	case H_BUSY:                 /* long busy */
 		return -EBUSY;
 	default:


From devesh28 at gmail.com  Wed May 16 06:13:50 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Wed, 16 May 2007 18:43:50 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
Message-ID: <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>

On 5/14/07, Sean Hefty <sean.hefty at intel.com> wrote:
> >This can be treated as a facility similar to what we have in ARP table
> >for TCP/IP. Secondly this will help in debugging of some new up-coming
> >partially infiniband complaint hardware.
>
> But unless such a path actually exists to the remote node, I don't see that it's
> useful.  And if such a path exists, I would expect it to be returned by the SA.
But initially this will generate a packet for each path, while sys
admin knows that path is there and he can hard-code the entries for
it. Other thing is that why Admin will care about creating such record
while SA is itself taking care, right?
> Can you clarify its use more wrt the subnet in general?
Again the same, in most cases Administrator knows that some path is
there between Node A and Node B, then why to introduce more delay in
making stack up by introducing extra packets (generated by
sa_cache_module). In later stages if something is changing, may be, it
will generated only few packets to update the cache.

Another point I want to know is,
When local_sa_cache module will be inserted? After SM comes up or
Before SM comes up?
I think its after SM is up, So this is introducing extra efforts for
Admin, He will have to wait for SM to come up and then insert sa_cache
module.

If Its inserted before SM is coming up (I am assuming SM is running on
some node not on switch) then First Forced schedule_update() is
waisted, and for the first application presence of cache is
meaningless. Why not to keep cache effective right from the start?
CMIIW
>
> >yes, I want them to remain in the DB, my idea is similar to the hard
> >coding of ARP table entries in TCP/IP.
> >How do you see this can be achieved?
>
> A simple flag or setting the update counter on the added path to the maximum
> should be sufficient.
>
> - Sean
>


From tziporet at dev.mellanox.co.il  Wed May 16 06:43:06 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 16 May 2007 16:43:06 +0300
Subject: [ofa-general] Re: [PATCH ofed-1.2] ehca (kernel space): return
 proper error code if register_mr fails
In-Reply-To: <200705161505.09214.hnguyen@linux.vnet.ibm.com>
References: <200705161505.09214.hnguyen@linux.vnet.ibm.com>
Message-ID: <464B0A6A.6090303@mellanox.co.il>

Hoang-Nam Nguyen wrote:
> Hello Tziporet!
> Please accept below patch for ofed-1.2, because it fixes a mr resources
> limitation problem reported by Troy and Kyle on this mailing list. Only
> with this patch their application is able to release no longer used mrs
> properly.
> Thanks!
> Regards
> Nam
>   

approved
Tziporet


From changquing.tang at hp.com  Wed May 16 07:18:59 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Wed, 16 May 2007 14:18:59 -0000
Subject: [ofa-general] OFED HA related question
In-Reply-To: <ada7irat6xw.fsf@cisco.com>
References: <1179145102.25749.11.camel@mtls03>
	<adabqgmtb6e.fsf@cisco.com><1179242042.25749.33.camel@mtls03>
	<ada7irat6xw.fsf@cisco.com>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net>

 
Roland:

	Suppose I get IBV_EVENT_DEVICE_FATAL async event from the first
HCA on my node,
can I continue to call ibv_poll_cq() to get back all the work-requests I
posted before ?
or do I need to keep track these work-requests? I am afraid
ibv_poll_cq() will return error by itself. Also can I call
ibv_dereg_mr() to free the memory I registered to this HCA ?

	If I continue to use the second HCA, does the failure of first
HCA affect the operation
of second HCA (from driver point of view) ?

	Thanks 

--CQ


From vlad at mellanox.co.il  Wed May 16 07:33:14 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 16 May 2007 17:33:14 +0300
Subject: [ofa-general] Re: [PATCH ofed-1.2] ehca (kernel space): return
	proper error code if register_mr fails
In-Reply-To: <200705161505.09214.hnguyen@linux.vnet.ibm.com>
References: <200705161505.09214.hnguyen@linux.vnet.ibm.com>
Message-ID: <1179325994.7463.56.camel@vladsk-laptop>

On Wed, 2007-05-16 at 15:05 +0200, Hoang-Nam Nguyen wrote:
> Hello Tziporet!
> Please accept below patch for ofed-1.2, because it fixes a mr resources
> limitation problem reported by Troy and Kyle on this mailing list. Only
> with this patch their application is able to release no longer used mrs
> properly.
> Thanks!
> Regards
> Nam
> 
> 
> 
> This patch sets the return code of ehca_register_mr() to ENOMEM
> if corresponding firmware call fails due to out of resources.
> Some of error codes were mapped to EINVAL. They are now mapped
> to default case, which already returns EINVAL anyway.
> 
> 
> Signed-off-by: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Added: kernel_patches/fixes/ehca_8_fix_mr_resources_limitation.patch


-- 
Vladimir Sokolovsky <vlad at mellanox.co.il>
Mellanox Technologies Ltd.


From bs at q-leap.de  Wed May 16 07:41:18 2007
From: bs at q-leap.de (Bernd Schubert)
Date: Wed, 16 May 2007 16:41:18 +0200
Subject: [ofa-general] possible irq lock inversion dependency detected
In-Reply-To: <adahcqdsykx.fsf@cisco.com>
References: <200705151148.50607.bs@q-leap.de> <adahcqdsykx.fsf@cisco.com>
Message-ID: <200705161641.18749.bs@q-leap.de>

On Tuesday 15 May 2007 20:58:54 Roland Dreier wrote:
> Thanks for the report... looks like a real bug.
>
> Can you check whether this patch makes the lockdep warnings go away?

The warnings usually appeared after a few hours uptime, even though the system 
was in idle state. 
After applying your patch and a couple of hours uptime no warnings so far, so 
I guess your patch fixed it.
If you shouldn't hear anything from me until Thursday, it definitely fixed it.


Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH


From kliteyn at dev.mellanox.co.il  Wed May 16 08:03:29 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 16 May 2007 18:03:29 +0300
Subject: [ofa-general] [PATCH] osm: up/dn optimization - improved ranking
Message-ID: <464B1D41.8080905@dev.mellanox.co.il>

Hi Hal,

This patch optimizes fabric ranking similar to the fat-tree ranking.
All the root switches are marked with rank and added to the BFS list,
and only then ranking of rest of the fabric begins.

Please apply to master. 

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_ucast_updn.c |   66 +++++++++++++++++----------------------
 1 files changed, 29 insertions(+), 37 deletions(-)

diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
index 5cebd9b..9574216 100644
--- a/opensm/opensm/osm_ucast_updn.c
+++ b/opensm/opensm/osm_ucast_updn.c
@@ -408,53 +408,49 @@ Exit :
 /*        rank is a SWITCH for BFS purpose */
 static int
 updn_subn_rank(
-  IN uint64_t root_guid,
-  IN uint8_t base_rank,
+  IN uint32_t num_guids,
+  IN uint64_t* guid_list,
   IN updn_t* p_updn )
 {
   osm_switch_t *p_sw;
-  uint32_t rank = base_rank;
   osm_physp_t *p_physp, *p_remote_physp;
   cl_qlist_t list;
   cl_status_t did_cause_update;
   struct updn_node *u, *remote_u;
   uint8_t num_ports, port_num;
   osm_log_t *p_log = &p_updn->p_osm->log;
+  uint32_t idx = 0;
 
   OSM_LOG_ENTER( p_log, updn_subn_rank );
+  cl_qlist_init(&list);
 
-  p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, cl_hton64(root_guid));
-  if(!p_sw)
-  {
-    osm_log( p_log, OSM_LOG_ERROR,
-             "updn_subn_rank: ERR AA05: "
-             "Root switch GUID 0x%" PRIx64 " not found\n", root_guid );
-    OSM_LOG_EXIT( p_log );
-    return 1;
-  }
-
-  osm_log( p_log, OSM_LOG_VERBOSE,
-           "updn_subn_rank: "
-           "Ranking starts from GUID 0x%" PRIx64 "\n", root_guid );
-
-  u = p_sw->priv;
-  u->is_root = 1;
+  /* Rank all the roots and add them to list */
 
-  /* Rank the first guid chosen anyway since it's the base rank */
-  osm_log( p_log, OSM_LOG_DEBUG,
-           "updn_subn_rank: "
-           "Ranking port GUID 0x%" PRIx64 "\n", root_guid );
+  for (idx = 0; idx < num_guids; idx++)
+  {
+    /* Apply the ranking for each guid given by user - bypass illegal ones */
+    p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, cl_hton64(guid_list[idx]));
+    if(!p_sw)
+    {
+      osm_log( p_log, OSM_LOG_ERROR,
+               "updn_subn_rank: ERR AA05: "
+               "Root switch GUID 0x%" PRIx64 " not found\n", guid_list[idx] );
+      continue;
+    }
 
-  __updn_update_rank(u, rank);
+    u = p_sw->priv;
+    u->is_root = 1;
 
-  cl_qlist_init(&list);
-  cl_qlist_insert_tail(&list, &u->list);
+    osm_log( p_log, OSM_LOG_DEBUG,
+             "updn_subn_rank: "
+             "Ranking root port GUID 0x%" PRIx64 "\n", guid_list[idx] );
+    __updn_update_rank(u, 0);
+    cl_qlist_insert_tail(&list, &u->list);
+  }
 
   /* BFS the list till it's empty */
   while (!cl_is_qlist_empty(&list))
   {
-    rank++;
-
     u = (struct updn_node *)cl_qlist_remove_head(&list);
     /* Go over all remote nodes and rank them (if not already visited) */
     p_sw = u->sw;
@@ -483,7 +479,7 @@ updn_subn_rank(
       {
         remote_u = p_remote_physp->p_node->sw->priv;
         port_guid = p_remote_physp->port_guid;
-        did_cause_update = __updn_update_rank(remote_u, rank);
+        did_cause_update = __updn_update_rank(remote_u, u->rank+1);
 
         osm_log( p_log, OSM_LOG_DEBUG,
                  "updn_subn_rank: "
@@ -500,8 +496,8 @@ updn_subn_rank(
   /* Print Summary of ranking */
   osm_log( p_log, OSM_LOG_VERBOSE,
            "updn_subn_rank: "
-           "Rank Info :\n\t Root Guid = 0x%" PRIx64 "\n\t Max Node Rank = %d\n",
-           root_guid, rank );
+           "Subnet ranking completed. Max Node Rank = %d\n",
+           remote_u->rank );
   OSM_LOG_EXIT( p_log );
   return 0;
 }
@@ -566,7 +562,6 @@ __osm_subn_calc_up_down_min_hop_table(
   IN uint64_t* guid_list,
   IN updn_t* p_updn )
 {
-  uint32_t idx = 0;
   int status;
 
   OSM_LOG_ENTER( &p_updn->p_osm->log, osm_subn_calc_up_down_min_hop_table );
@@ -593,11 +588,8 @@ __osm_subn_calc_up_down_min_hop_table(
     goto _exit;
   }
 
-  for (idx = 0; idx < num_guids; idx++)
-  {
-    /* Apply the ranking for each guid given by user - bypass illegal ones */
-    updn_subn_rank(guid_list[idx], 0, p_updn);
-  }
+  /* Rank the subnet switches */
+  updn_subn_rank(num_guids, guid_list, p_updn);
 
   /* After multiple ranking need to set Min Hop Table by UpDn algorithm  */
   osm_log( &p_updn->p_osm->log, OSM_LOG_VERBOSE,
-- 
1.5.1.4


From dledford at redhat.com  Wed May 16 09:13:06 2007
From: dledford at redhat.com (Doug Ledford)
Date: Wed, 16 May 2007 12:13:06 -0400
Subject: [ofa-general] Re: [ewg] Re: Build problem with RHEL-4.5 and OFED-1.2
In-Reply-To: <200705141531.26635.ossrosch@linux.vnet.ibm.com>
References: <200705091824.54394.ossrosch@linux.vnet.ibm.com>
	<1178737535.2848.152.camel@fc6.xsintricity.com>
	<200705092357.59973.ossrosch@linux.vnet.ibm.com>
	<200705141531.26635.ossrosch@linux.vnet.ibm.com>
Message-ID: <464B2D92.3010400@redhat.com>

Stefan Roscher wrote:
> He Doug,

(Sorry for the late reply, I'm out of town at the moment)

> are there any news for this problem? Is it a problem of the OFED-build or a
> problem with redhat?
> Should I open a bugzilla to track this?

Well, yes, there should be a bugzilla.  What exactly the bug is depends 
on the kernel RPM maintainer's intended wishes for the kernel-devel 
package.  If he wanted the ppc kernel-devel to be self sufficient, then 
he should have included all asm-ppc64 header files that were included by 
asm-ppc header files and not having them in the kernel-devel.ppc package 
is the bug.  However, doing things this way probably precludes 
installing both the kernel-devel.ppc and kernel-devel.ppc64 packages at 
the same time.

If, on the other hand, he wanted the ppc devel package to be pure and 
not include ppc64 header files, then he needs to make sure that A) both 
the kernel-devel.ppc and kernel-devel.ppc64 rpms can be installed at the 
same time and B) that the kernel-devel.ppc RPM has a Requires: 
kernel-devel.ppc64 item in there to avoid this dangling header include 
problem that currently exists.

> Regards Stefan
> On Wednesday 09 May 2007 23:57, Stefan Roscher wrote:
>> On Wednesday 09 May 2007 21:05, Doug Ledford wrote:
>>> On Wed, 2007-05-09 at 18:24 +0200, Stefan Roscher wrote:
>>>> Hi Doug,
>>>>
>>>> I installed RHEL-4.5 on one of our ppc64 systems and recognized that asm-ppc
>>>> directory is missing in /usr/src/kernels/2.6.9-55.EL/include. 
>>>> Normally I don't need this directory, but ibmebus.h includes
>>>> asm-ppc64/of_device.h. And there asm-ppc64/of_device.h includes 
>>>> asm-ppc/of_device.h. Because this file is missing I can not build 
>>>> ehca and ofed stack with ofed-1.2 daily build from today.
>>>>
>>>> Did I make something wrong during installation?
>>>>
>>>> Regards Stefan Roscher
>>> I'll look into it, but in the meantime, install the kernel src.rpm, go
>>> into /usr/src/redhat/SPEC and run rpmbuild --bp kernel-2.6.spec and it
>>> should create a complete source tree
>>> in /usr/src/redhat/BUILD/kernel-2.6.18 that you can then get the asm-ppc
>>> directory contents out of.
>>>
>>> -- 
>>> Doug Ledford <dledford at redhat.com>
>>>               GPG KeyID: CFBFF194
>>>               http://people.redhat.com/dledford
>>>
>>> Infiniband specific RPMs available at
>>>               http://people.redhat.com/dledford/Infiniband
>>>
>> To create the backportpatches for rhel4.5 I did it like you say, but the
>> buildscripts of ofed dont uses the kernelsources in
>> /usr/src/redhat/BUILD. OFED-1.2 use the source link within
>> /lib/modules/kernel-x.x.x and this points into /usr/src/kernel this
>> kernelsources were created during installation of rhel-4.5. In this kernel
>> source the directory include/asm-ppc is missing.
>> This is the reason why I found this problem not during creation of the
>> backport patches.
>>
>> regards stefan
>>
>> _______________________________________________
>> ewg mailing list
>> ewg at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>


-- 
Doug Ledford <dledford at redhat.com>
http://people.redhat.com/dledford

Infiniband specific RPMs can be found at
http://people.redhat.com/dledford/Infiniband


From halr at voltaire.com  Wed May 16 09:38:06 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 16 May 2007 12:38:06 -0400
Subject: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move CM definitions
	from ib_types.h
Message-ID: <1179333484.4531.176519.camel@hal.voltaire.com>

OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/opensm/include/iba/ib_cm_types.h b/opensm/include/iba/ib_cm_types.h
new file mode 100644
index 0000000..f4fb139
--- /dev/null
+++ b/opensm/include/iba/ib_cm_types.h
@@ -0,0 +1,210 @@
+/*
+ * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#if !defined(__IB_CM_TYPES_H__)
+#define __IB_CM_TYPES_H__
+
+#ifndef WIN32
+
+#include <iba/ib_types.h>
+
+#ifdef __cplusplus
+#  define BEGIN_C_DECLS extern "C" {
+#  define END_C_DECLS   }
+#else /* !__cplusplus */
+#  define BEGIN_C_DECLS
+#  define END_C_DECLS
+#endif /* __cplusplus */
+
+BEGIN_C_DECLS
+
+/*
+ * Defines known Communication management class versions
+ */
+#define IB_MCLASS_CM_VER_2				2
+#define IB_MCLASS_CM_VER_1				1
+
+/*
+ *	Defines the size of user available data in communication management MADs
+ */
+#define IB_REQ_PDATA_SIZE_VER2				92
+#define IB_MRA_PDATA_SIZE_VER2				222
+#define IB_REJ_PDATA_SIZE_VER2				148
+#define IB_REP_PDATA_SIZE_VER2				196
+#define IB_RTU_PDATA_SIZE_VER2				224
+#define IB_LAP_PDATA_SIZE_VER2				168
+#define IB_APR_PDATA_SIZE_VER2				148
+#define IB_DREQ_PDATA_SIZE_VER2				220
+#define IB_DREP_PDATA_SIZE_VER2				224
+#define IB_SIDR_REQ_PDATA_SIZE_VER2			216
+#define IB_SIDR_REP_PDATA_SIZE_VER2			136
+
+#define IB_REQ_PDATA_SIZE_VER1				92
+#define IB_MRA_PDATA_SIZE_VER1				222
+#define IB_REJ_PDATA_SIZE_VER1				148
+#define IB_REP_PDATA_SIZE_VER1				204
+#define IB_RTU_PDATA_SIZE_VER1				224
+#define IB_LAP_PDATA_SIZE_VER1				168
+#define IB_APR_PDATA_SIZE_VER1				151
+#define IB_DREQ_PDATA_SIZE_VER1				220
+#define IB_DREP_PDATA_SIZE_VER1				224
+#define IB_SIDR_REQ_PDATA_SIZE_VER1			216
+#define IB_SIDR_REP_PDATA_SIZE_VER1			140
+
+#define IB_ARI_SIZE					72	// redefine
+#define IB_APR_INFO_SIZE				72
+
+/****d* Access Layer/ib_rej_status_t
+* NAME
+*	ib_rej_status_t
+*
+* DESCRIPTION
+*	Rejection reasons.
+*
+* SYNOPSIS
+*/
+typedef	ib_net16_t					ib_rej_status_t;
+/*
+* SEE ALSO
+*	ib_cm_rej, ib_cm_rej_rec_t
+*
+* SOURCE
+*/
+#define IB_REJ_INSUF_QP					CL_HTON16(1)
+#define IB_REJ_INSUF_EEC				CL_HTON16(2)
+#define IB_REJ_INSUF_RESOURCES				CL_HTON16(3)
+#define IB_REJ_TIMEOUT					CL_HTON16(4)
+#define IB_REJ_UNSUPPORTED				CL_HTON16(5)
+#define IB_REJ_INVALID_COMM_ID				CL_HTON16(6)
+#define IB_REJ_INVALID_COMM_INSTANCE			CL_HTON16(7)
+#define IB_REJ_INVALID_SID				CL_HTON16(8)
+#define IB_REJ_INVALID_XPORT				CL_HTON16(9)
+#define IB_REJ_STALE_CONN				CL_HTON16(10)
+#define IB_REJ_RDC_NOT_EXIST				CL_HTON16(11)
+#define IB_REJ_INVALID_GID				CL_HTON16(12)
+#define IB_REJ_INVALID_LID				CL_HTON16(13)
+#define IB_REJ_INVALID_SL				CL_HTON16(14)
+#define IB_REJ_INVALID_TRAFFIC_CLASS			CL_HTON16(15)
+#define IB_REJ_INVALID_HOP_LIMIT			CL_HTON16(16)
+#define IB_REJ_INVALID_PKT_RATE				CL_HTON16(17)
+#define IB_REJ_INVALID_ALT_GID				CL_HTON16(18)
+#define IB_REJ_INVALID_ALT_LID				CL_HTON16(19)
+#define IB_REJ_INVALID_ALT_SL				CL_HTON16(20)
+#define IB_REJ_INVALID_ALT_TRAFFIC_CLASS		CL_HTON16(21)
+#define IB_REJ_INVALID_ALT_HOP_LIMIT			CL_HTON16(22)
+#define IB_REJ_INVALID_ALT_PKT_RATE			CL_HTON16(23)
+#define IB_REJ_PORT_REDIRECT				CL_HTON16(24)
+#define IB_REJ_INVALID_MTU				CL_HTON16(26)
+#define IB_REJ_INSUFFICIENT_RESP_RES			CL_HTON16(27)
+#define IB_REJ_USER_DEFINED				CL_HTON16(28)
+#define IB_REJ_INVALID_RNR_RETRY			CL_HTON16(29)
+#define IB_REJ_DUPLICATE_LOCAL_COMM_ID			CL_HTON16(30)
+#define IB_REJ_INVALID_CLASS_VER			CL_HTON16(31)
+#define IB_REJ_INVALID_FLOW_LBL				CL_HTON16(32)
+#define IB_REJ_INVALID_ALT_FLOW_LBL			CL_HTON16(33)
+
+#define IB_REJ_SERVICE_HANDOFF				CL_HTON16(65535)
+/******/
+
+/****d* Access Layer/ib_apr_status_t
+* NAME
+*	ib_apr_status_t
+*
+* DESCRIPTION
+*	Automatic path migration status information.
+*
+* SYNOPSIS
+*/
+typedef uint8_t						ib_apr_status_t;
+/*
+* SEE ALSO
+*	ib_cm_apr, ib_cm_apr_rec_t
+*
+* SOURCE
+ */
+#define IB_AP_SUCCESS					0
+#define IB_AP_INVALID_COMM_ID				1
+#define IB_AP_UNSUPPORTED				2
+#define IB_AP_REJECT					3
+#define IB_AP_REDIRECT					4
+#define IB_AP_IS_CURRENT				5
+#define IB_AP_INVALID_QPN_EECN				6
+#define IB_AP_INVALID_LID				7
+#define IB_AP_INVALID_GID				8
+#define IB_AP_INVALID_FLOW_LBL				9
+#define IB_AP_INVALID_TCLASS				10
+#define IB_AP_INVALID_HOP_LIMIT				11
+#define IB_AP_INVALID_PKT_RATE				12
+#define IB_AP_INVALID_SL				13
+/******/
+
+/****d* Access Layer/ib_cm_cap_mask_t
+* NAME
+*	ib_cm_cap_mask_t
+*
+* DESCRIPTION
+*	Capability mask values in ClassPortInfo.
+*
+* SYNOPSIS
+*/
+#define IB_CM_RELIABLE_CONN_CAPABLE			CL_HTON16(9)
+#define IB_CM_RELIABLE_DGRM_CAPABLE			CL_HTON16(10)
+#define IB_CM_RDGRM_CAPABLE				CL_HTON16(11)
+#define IB_CM_UNRELIABLE_CONN_CAPABLE			CL_HTON16(12)
+#define IB_CM_SIDR_CAPABLE				CL_HTON16(13)
+/*
+* SEE ALSO
+*	ib_cm_rep, ib_class_port_info_t
+*
+* SOURCE
+*
+*******/
+
+/*
+ *	Service ID resolution status
+ */
+typedef uint16_t					ib_sidr_status_t;
+#define IB_SIDR_SUCCESS					0
+#define IB_SIDR_UNSUPPORTED				1
+#define IB_SIDR_REJECT					2
+#define IB_SIDR_NO_QP					3
+#define IB_SIDR_REDIRECT				4
+#define IB_SIDR_UNSUPPORTED_VER				5
+
+END_C_DECLS
+
+#endif /* ndef WIN32 */
+
+#endif /* __IB_CM_TYPES_H__ */


From halr at voltaire.com  Wed May 16 09:38:18 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 16 May 2007 12:38:18 -0400
Subject: [ofa-general] [PATCH 2/2] OpenSM/ib_types.h: Remove CM defines
Message-ID: <1179333485.4531.176520.camel@hal.voltaire.com>

OpenSM/ib_types.h: Remove CM definitions as now in ib_cm_types.h

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/opensm/include/Makefile.am b/opensm/include/Makefile.am
index 8499d3b..3428d9a 100644
--- a/opensm/include/Makefile.am
+++ b/opensm/include/Makefile.am
@@ -1,7 +1,7 @@
 
 SUBDIRS = .
 
-nobase_pkginclude_HEADERS = iba/ib_types.h
+nobase_pkginclude_HEADERS = iba/ib_types.h iba/ib_cm_types.h
 
 EXTRA_DIST = \
 	$(srcdir)/opensm/osm_version.h \
@@ -128,6 +128,7 @@ EXTRA_DIST = \
 	$(srcdir)/complib/cl_fleximap.h \
 	$(srcdir)/complib/cl_qcomppool.h \
 	$(srcdir)/iba/ib_types.h \
+	$(srcdir)/iba/ib_cm_types.h \
 	$(srcdir)/vendor/osm_vendor_mlx_transport_anafa.h \
 	$(srcdir)/vendor/osm_vendor_mlx.h \
 	$(srcdir)/vendor/osm_vendor_mlx_sender.h \
diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index b3937cb..aee7024 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -31,7 +31,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id$
  */
 
 #if !defined(__IB_TYPES_H__)
@@ -7773,163 +7772,8 @@ typedef struct _ib_ioc_info
 #include <complib/cl_packoff.h>
 
 /*
- * Defines known Communication management class versions
- */
-#define IB_MCLASS_CM_VER_2				2
-#define IB_MCLASS_CM_VER_1				1
-
-/*
- *	Defines the size of user available data in communication management MADs
- */
-#define IB_REQ_PDATA_SIZE_VER2				92
-#define IB_MRA_PDATA_SIZE_VER2				222
-#define IB_REJ_PDATA_SIZE_VER2				148
-#define IB_REP_PDATA_SIZE_VER2				196
-#define IB_RTU_PDATA_SIZE_VER2				224
-#define IB_LAP_PDATA_SIZE_VER2				168
-#define IB_APR_PDATA_SIZE_VER2				148
-#define IB_DREQ_PDATA_SIZE_VER2				220
-#define IB_DREP_PDATA_SIZE_VER2				224
-#define IB_SIDR_REQ_PDATA_SIZE_VER2			216
-#define IB_SIDR_REP_PDATA_SIZE_VER2			136
-
-#define IB_REQ_PDATA_SIZE_VER1				92
-#define IB_MRA_PDATA_SIZE_VER1				222
-#define IB_REJ_PDATA_SIZE_VER1				148
-#define IB_REP_PDATA_SIZE_VER1				204
-#define IB_RTU_PDATA_SIZE_VER1				224
-#define IB_LAP_PDATA_SIZE_VER1				168
-#define IB_APR_PDATA_SIZE_VER1				151
-#define IB_DREQ_PDATA_SIZE_VER1				220
-#define IB_DREP_PDATA_SIZE_VER1				224
-#define IB_SIDR_REQ_PDATA_SIZE_VER1			216
-#define IB_SIDR_REP_PDATA_SIZE_VER1			140
-
-#define IB_ARI_SIZE					72		// redefine
-#define IB_APR_INFO_SIZE				72
-
-/****d* Access Layer/ib_rej_status_t
-* NAME
-*	ib_rej_status_t
-*
-* DESCRIPTION
-*	Rejection reasons.
-*
-* SYNOPSIS
-*/
-typedef	ib_net16_t					ib_rej_status_t;
-/*
-* SEE ALSO
-*	ib_cm_rej, ib_cm_rej_rec_t
-*
-* SOURCE
-*/
-#define IB_REJ_INSUF_QP					CL_HTON16(1)
-#define IB_REJ_INSUF_EEC				CL_HTON16(2)
-#define IB_REJ_INSUF_RESOURCES				CL_HTON16(3)
-#define IB_REJ_TIMEOUT					CL_HTON16(4)
-#define IB_REJ_UNSUPPORTED				CL_HTON16(5)
-#define IB_REJ_INVALID_COMM_ID				CL_HTON16(6)
-#define IB_REJ_INVALID_COMM_INSTANCE			CL_HTON16(7)
-#define IB_REJ_INVALID_SID				CL_HTON16(8)
-#define IB_REJ_INVALID_XPORT				CL_HTON16(9)
-#define IB_REJ_STALE_CONN				CL_HTON16(10)
-#define IB_REJ_RDC_NOT_EXIST				CL_HTON16(11)
-#define IB_REJ_INVALID_GID				CL_HTON16(12)
-#define IB_REJ_INVALID_LID				CL_HTON16(13)
-#define IB_REJ_INVALID_SL				CL_HTON16(14)
-#define IB_REJ_INVALID_TRAFFIC_CLASS			CL_HTON16(15)
-#define IB_REJ_INVALID_HOP_LIMIT			CL_HTON16(16)
-#define IB_REJ_INVALID_PKT_RATE				CL_HTON16(17)
-#define IB_REJ_INVALID_ALT_GID				CL_HTON16(18)
-#define IB_REJ_INVALID_ALT_LID				CL_HTON16(19)
-#define IB_REJ_INVALID_ALT_SL				CL_HTON16(20)
-#define IB_REJ_INVALID_ALT_TRAFFIC_CLASS		CL_HTON16(21)
-#define IB_REJ_INVALID_ALT_HOP_LIMIT			CL_HTON16(22)
-#define IB_REJ_INVALID_ALT_PKT_RATE			CL_HTON16(23)
-#define IB_REJ_PORT_REDIRECT				CL_HTON16(24)
-#define IB_REJ_INVALID_MTU				CL_HTON16(26)
-#define IB_REJ_INSUFFICIENT_RESP_RES			CL_HTON16(27)
-#define IB_REJ_USER_DEFINED				CL_HTON16(28)
-#define IB_REJ_INVALID_RNR_RETRY			CL_HTON16(29)
-#define IB_REJ_DUPLICATE_LOCAL_COMM_ID			CL_HTON16(30)
-#define IB_REJ_INVALID_CLASS_VER			CL_HTON16(31)
-#define IB_REJ_INVALID_FLOW_LBL				CL_HTON16(32)
-#define IB_REJ_INVALID_ALT_FLOW_LBL			CL_HTON16(33)
-
-#define IB_REJ_SERVICE_HANDOFF				CL_HTON16(65535)
-/******/
-
-/****d* Access Layer/ib_apr_status_t
-* NAME
-*	ib_apr_status_t
-*
-* DESCRIPTION
-*	Automatic path migration status information.
-*
-* SYNOPSIS
-*/
-typedef uint8_t						ib_apr_status_t;
-/*
-* SEE ALSO
-*	ib_cm_apr, ib_cm_apr_rec_t
-*
-* SOURCE
- */
-#define IB_AP_SUCCESS					0
-#define IB_AP_INVALID_COMM_ID				1
-#define IB_AP_UNSUPPORTED				2
-#define IB_AP_REJECT					3
-#define IB_AP_REDIRECT					4
-#define IB_AP_IS_CURRENT				5
-#define IB_AP_INVALID_QPN_EECN				6
-#define IB_AP_INVALID_LID				7
-#define IB_AP_INVALID_GID				8
-#define IB_AP_INVALID_FLOW_LBL				9
-#define IB_AP_INVALID_TCLASS				10
-#define IB_AP_INVALID_HOP_LIMIT				11
-#define IB_AP_INVALID_PKT_RATE				12
-#define IB_AP_INVALID_SL				13
-/******/
-
-/****d* Access Layer/ib_cm_cap_mask_t
-* NAME
-*	ib_cm_cap_mask_t
-*
-* DESCRIPTION
-*	Capability mask values in ClassPortInfo.
-*
-* SYNOPSIS
-*/
-#define IB_CM_RELIABLE_CONN_CAPABLE			CL_HTON16(9)
-#define IB_CM_RELIABLE_DGRM_CAPABLE			CL_HTON16(10)
-#define IB_CM_RDGRM_CAPABLE				CL_HTON16(11)
-#define IB_CM_UNRELIABLE_CONN_CAPABLE			CL_HTON16(12)
-#define IB_CM_SIDR_CAPABLE				CL_HTON16(13)
-/*
-* SEE ALSO
-*	ib_cm_rep, ib_class_port_info_t
-*
-* SOURCE
-*
-*******/
-
-/*
- *	Service ID resolution status
- */
-typedef uint16_t					ib_sidr_status_t;
-#define IB_SIDR_SUCCESS					0
-#define IB_SIDR_UNSUPPORTED				1
-#define IB_SIDR_REJECT					2
-#define IB_SIDR_NO_QP					3
-#define IB_SIDR_REDIRECT				4
-#define IB_SIDR_UNSUPPORTED_VER				5
-
-/*
  *	The following definitions are shared between the Access Layer and VPD
  */
-
-
 typedef struct _ib_ca* __ptr64			ib_ca_handle_t;
 typedef struct _ib_pd* __ptr64			ib_pd_handle_t;
 typedef struct _ib_rdd* __ptr64			ib_rdd_handle_t;
@@ -10467,7 +10311,8 @@ typedef struct _ib_ci_op
 
 END_C_DECLS
 
-#endif /* ndef WIN */
+#endif /* ndef WIN32 */
+
 #if defined( __WIN__ )
     #include <iba/ib_types_extended.h>
 #endif


From mshefty at ichips.intel.com  Wed May 16 09:45:43 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 16 May 2007 09:45:43 -0700
Subject: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move CM
	definitions from ib_types.h
In-Reply-To: <1179333484.4531.176519.camel@hal.voltaire.com>
References: <1179333484.4531.176519.camel@hal.voltaire.com>
Message-ID: <464B3537.7060405@ichips.intel.com>

Hal Rosenstock wrote:
> OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h

CM types are defined in the libibcm library.  Why not remove them 
completely from the opensm code?

- Sean


From eboyeryyie at tdatabrasil.net.br  Wed May 16 10:04:52 2007
From: eboyeryyie at tdatabrasil.net.br (Jacquline Watson)
Date: Wed, 16 May 2007 08:04:52 -0900
Subject: [ofa-general] Holiday time
Message-ID: <d76201c79790$e1ef2720$03908f00@eboyeryyie>


match year Do you fork letter and Jeff...you know...?Suddenly, the story got competition slept a busily clip little more complicat He thought about that for a receipt command appreciate ball moment. Dana, doesCarl sugar blushing was plate doubtful sped that lightening could possibl
vivaciously What 'thing' sponge would take that mortally be? asked Gretchen.So tomorrow are throughout you man did gonna teach behave me to catch? 
speedily Gav, What's in group mine taking so long? help Stacy smiled. Dana, what copy the two bone of opinion us do when I'll knit be skin quit there in theory a second came the voice from Well, politely if mist that happens, there limit park are other girls.
Actually, cuddly for tow file the time being, I notice don't see anypump The destroy thing regret where you've got outstanding one foot on a front No. push stung direction Tomorrow we're taking Carl faithfully and Linda to th comfort Was this ship what you were pain trying to different do? asked An
Yeah. box My second detail bet choice is glamorous anybody who happens disease This sleep was not burst what Dana spread was expecting to hear at Y-You didn't brush have calculate to amuse open it for crime me. She trie Ah, excited hah, declared brachial metal stitch Stacy. You don't realize i Thanks, important I will, ship Guy told her with misty peripatetic a smile. He
discover base bled Yeah, that's it. suspend Jeff got up and dusted himsebroad Well, I took flap a good owner look at the exchange property in Sa press bore Remarkably born spring well. That's the other thing you'll splendid fetch tame Jeff envious rolled his eyes at his own absent mindedne  Ooh, how romantic.
tray set arrive A hint of a loss smile now started to show through Dseen Half six for a paste quarter untidy to blastous seven kick off, saimiss Jeff spotted the written show two girls tail as they emerged from I'm just bein' a difficult list music gentleman. spilt He sat down next
Have idea you sternly plain notice guys picked out a film? asked Stacy. mass curve noisy Jeff was blushing understandably confused, Huh? What's that? I'll explain it to you brake later. Look gotten dust heat there are t corporal I'll get one clear shelter of stamp the lads in the workshop to kn shyly hurt tore Really? fry Great. Anybody I know?
That's called fly snake a steamroller, chess blastous said Gretchen. She took a dress deep breath. high-pitched His fiercely name honestly is Jeff Feing3:45 PM Alright, Jeff taste picked up his frame process nation bike Tell me wh Suddenly, the distribution expression on hematal outgoing her got husbands face t
No, dreamt edge it's charming not that. Her arrest mind was now racing... Although pen prepare voice they shrug actually hadn't, Jeff immediately  Interesting bulb smite agree choice, observed unusual Stacy. Any part
Dane, I don't smell talk think sex you're ready wake to have a boy far Jeff escorted the raptorial guest frightened beat to the front door. As G Well, bulb space I'm ball certainly been sorry to hear about your m nerve nervously Gavin pointed to the thunder door profit that they entered thr animal watch You're sure we're shine doing the right stand thing, not c Henry...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070516/023b5713/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zyqil.gif
Type: image/gif
Size: 6280 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070516/023b5713/attachment.gif>

From halr at voltaire.com  Wed May 16 10:22:36 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 16 May 2007 13:22:36 -0400
Subject: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move CM
	definitions from ib_types.h
In-Reply-To: <464B3537.7060405@ichips.intel.com>
References: <1179333484.4531.176519.camel@hal.voltaire.com>
	<464B3537.7060405@ichips.intel.com>
Message-ID: <1179336155.23882.604.camel@hal.voltaire.com>

On Wed, 2007-05-16 at 12:45, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h
> 
> CM types are defined in the libibcm library.  Why not remove them 
> completely from the opensm code?

I think they are needed in the Windows environment. I believe Linux
userspace would never include this header.

-- Hal

> - Sean


From sashak at voltaire.com  Wed May 16 10:30:47 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 16 May 2007 20:30:47 +0300
Subject: [ofa-general] Re: Error message in OSM log when cached op file
	doesn't exist
In-Reply-To: <4649AE00.8080806@dev.mellanox.co.il>
References: <46486D1E.6010408@dev.mellanox.co.il>
	<1179152459.1540.178811.camel@hal.voltaire.com>
	<46499769.1070404@dev.mellanox.co.il>
	<20070515125401.GD23240@sashak.voltaire.com>
	<4649AE00.8080806@dev.mellanox.co.il>
Message-ID: <20070516173047.GK19271@sashak.voltaire.com>

On 15:56 Tue 15 May     , Yevgeny Kliteynik wrote:
> Sasha Khapyorsky wrote:
> >On 14:20 Tue 15 May     , Yevgeny Kliteynik wrote:
> >>I think that the message should appear when OpenSM *does* find cached
> >>option file, and no message should appear when such file wasn't found
> >>(which is the most common use case).
> >
> >AFAIK OpenSM which used in the labs' clusters almost always uses this
> >file, so I'm not sure about common case.
> 
> If the file is found, user sees "Using cached bla-bla" and 
> "Loading cached option bla-bla" messages.
> If the file wasn't found, these messages are not printed,
> so absence of these messages means that the file wasn't found.
> The only thing we can do is to add a new message that will
> explicitly inform the user about this, something like 
> "No cached options file".
> Is this necessary? IMHO, it's not. Do you think otherwise?

Yes, explicit message would be cleaner.

Sasha


From sashak at voltaire.com  Wed May 16 11:29:02 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 16 May 2007 21:29:02 +0300
Subject: [ofa-general] git over http
In-Reply-To: <4649AC27.8010903@cea.fr>
References: <4649AC27.8010903@cea.fr>
Message-ID: <20070516182902.GL19271@sashak.voltaire.com>

On 14:48 Tue 15 May     , Philippe Gregoire wrote:
> I can't get git clone command working due to our firewall.
> Is there any git http server configured ?
> If any, how do I translate
> git clone git://git.openfabrics.org/~halr/management 
> <git://git.openfabrics.org/%7Ehalr/management>
> in git clone http path ?

Try this:

  git clone http://git.openfabrics.org/pub/scm/~halr/management.git

Also you can ask Hal to put symbolic link to his repo under
~halr/public_html and then it will be accessible similar to git:// as:

  git clone http://git.openfabrics.org/~halr/management.git

Sasha


From xma at us.ibm.com  Wed May 16 12:09:36 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 16 May 2007 12:09:36 -0700
Subject: [ofa-general] binary compatibility ofed 1.1 and 1.2
In-Reply-To: <OFEEA2D314.CE1FBFE5-ON852572DC.00471A18-862572DC.0049CD84@us.ibm.com>
Message-ID: <OFC91CA342.9D30939F-ON872572DD.00688999-882572DD.0069354C@us.ibm.com>


Hello Roland,

> Hi,
>
> Will apps built with OFED 1.1 verbs.h run on an OFED 1.2 install ?
>
> -Bill

      It seems the binary apps are broken between OFED-1.1 and OFED-1.2.
Any reason why we can't maintain struct like ibv_cq binary compatibility?

Thanks
Shirley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070516/abbf9d4b/attachment.html>

From mshefty at ichips.intel.com  Wed May 16 12:31:19 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 16 May 2007 12:31:19 -0700
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>	<000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
	<309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>
Message-ID: <464B5C07.8040601@ichips.intel.com>

> But initially this will generate a packet for each path, while sys
> admin knows that path is there and he can hard-code the entries for
> it. Other thing is that why Admin will care about creating such record
> while SA is itself taking care, right?

In your original message you asked about adding 'dummy entries' to the 
cache.  I agree that pre-loading the cache can be useful.  What I still 
am not understanding is the reasoning for adding 'dummy entries'.  By 
'dummy entries', I've been assuming that these are invalid path records, 
but maybe that's not what you meant.

> Another point I want to know is,
> When local_sa_cache module will be inserted? After SM comes up or
> Before SM comes up?

It can occur either way.  There is no restriction.  The cache responds 
to port up and GID in/out of service events to update itself.

> If Its inserted before SM is coming up (I am assuming SM is running on
> some node not on switch) then First Forced schedule_update() is
> waisted, and for the first application presence of cache is
> meaningless. Why not to keep cache effective right from the start?

Pre-loading the cache with path records doesn't guarantee that those 
paths are usable.  If the SM has not come up, then the path records will 
be unusable until the SM configures the subnet, plus there's no 
guarantee that the remote endpoints specified by the paths are running.

The main benefit I see to pre-loading the cache is to avoid SA storms 
when booting a large cluster.

- Sean


From sashak at voltaire.com  Wed May 16 12:49:19 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 16 May 2007 22:49:19 +0300
Subject: [ofa-general] [PATCH] osm: up/dn optimization - improved ranking
In-Reply-To: <464B1D41.8080905@dev.mellanox.co.il>
References: <464B1D41.8080905@dev.mellanox.co.il>
Message-ID: <20070516194919.GO19271@sashak.voltaire.com>

Hi Yevgeny,

On 18:03 Wed 16 May     , Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> This patch optimizes fabric ranking similar to the fat-tree ranking.
> All the root switches are marked with rank and added to the BFS list,
> and only then ranking of rest of the fabric begins.
> 
> Please apply to master. 
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---

Basically looks good. However couple comments below.

> opensm/opensm/osm_ucast_updn.c |   66 
> +++++++++++++++++----------------------
> 1 files changed, 29 insertions(+), 37 deletions(-)
> 
> diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
> index 5cebd9b..9574216 100644
> --- a/opensm/opensm/osm_ucast_updn.c
> +++ b/opensm/opensm/osm_ucast_updn.c
> @@ -408,53 +408,49 @@ Exit :
> /*        rank is a SWITCH for BFS purpose */
> static int
> updn_subn_rank(
> -  IN uint64_t root_guid,
> -  IN uint8_t base_rank,
> +  IN uint32_t num_guids,

'num_guids' should not be fixed-size integer just compiler friendly
'unsigned' is fine.

> +  IN uint64_t* guid_list,
>   IN updn_t* p_updn )
> {
>   osm_switch_t *p_sw;
> -  uint32_t rank = base_rank;
>   osm_physp_t *p_physp, *p_remote_physp;
>   cl_qlist_t list;
>   cl_status_t did_cause_update;
>   struct updn_node *u, *remote_u;
>   uint8_t num_ports, port_num;
>   osm_log_t *p_log = &p_updn->p_osm->log;
> +  uint32_t idx = 0;

Ditto.

> 
>   OSM_LOG_ENTER( p_log, updn_subn_rank );
> +  cl_qlist_init(&list);
> 
> -  p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, 
> cl_hton64(root_guid));
> -  if(!p_sw)
> -  {
> -    osm_log( p_log, OSM_LOG_ERROR,
> -             "updn_subn_rank: ERR AA05: "
> -             "Root switch GUID 0x%" PRIx64 " not found\n", root_guid );
> -    OSM_LOG_EXIT( p_log );
> -    return 1;
> -  }
> -
> -  osm_log( p_log, OSM_LOG_VERBOSE,
> -           "updn_subn_rank: "
> -           "Ranking starts from GUID 0x%" PRIx64 "\n", root_guid );
> -
> -  u = p_sw->priv;
> -  u->is_root = 1;
> +  /* Rank all the roots and add them to list */
> 
> -  /* Rank the first guid chosen anyway since it's the base rank */
> -  osm_log( p_log, OSM_LOG_DEBUG,
> -           "updn_subn_rank: "
> -           "Ranking port GUID 0x%" PRIx64 "\n", root_guid );
> +  for (idx = 0; idx < num_guids; idx++)
> +  {
> +    /* Apply the ranking for each guid given by user - bypass illegal ones 
> */
> +    p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, 
> cl_hton64(guid_list[idx]));
> +    if(!p_sw)
> +    {
> +      osm_log( p_log, OSM_LOG_ERROR,
> +               "updn_subn_rank: ERR AA05: "
> +               "Root switch GUID 0x%" PRIx64 " not found\n", 
> guid_list[idx] );
> +      continue;
> +    }
> 
> -  __updn_update_rank(u, rank);
> +    u = p_sw->priv;
> +    u->is_root = 1;

Now when root switches are always ranked first 'is_root' field is not
needed anymore, (!u->rank) answers this.

> 
> -  cl_qlist_init(&list);
> -  cl_qlist_insert_tail(&list, &u->list);
> +    osm_log( p_log, OSM_LOG_DEBUG,
> +             "updn_subn_rank: "
> +             "Ranking root port GUID 0x%" PRIx64 "\n", guid_list[idx] );
> +    __updn_update_rank(u, 0);
> +    cl_qlist_insert_tail(&list, &u->list);
> +  }
> 
>   /* BFS the list till it's empty */
>   while (!cl_is_qlist_empty(&list))
>   {
> -    rank++;
> -
>     u = (struct updn_node *)cl_qlist_remove_head(&list);
>     /* Go over all remote nodes and rank them (if not already visited) */
>     p_sw = u->sw;
> @@ -483,7 +479,7 @@ updn_subn_rank(
>       {
>         remote_u = p_remote_physp->p_node->sw->priv;
>         port_guid = p_remote_physp->port_guid;
> -        did_cause_update = __updn_update_rank(remote_u, rank);
> +        did_cause_update = __updn_update_rank(remote_u, u->rank+1);
> 
>         osm_log( p_log, OSM_LOG_DEBUG,
>                  "updn_subn_rank: "
> @@ -500,8 +496,8 @@ updn_subn_rank(
>   /* Print Summary of ranking */
>   osm_log( p_log, OSM_LOG_VERBOSE,
>            "updn_subn_rank: "
> -           "Rank Info :\n\t Root Guid = 0x%" PRIx64 "\n\t Max Node Rank = 
> %d\n",
> -           root_guid, rank );
> +           "Subnet ranking completed. Max Node Rank = %d\n",
> +           remote_u->rank );

'remote_u' can be not initialized here. Another issue is that it can be
initialized but to remote switch which has lower than max rank (when
did_cause_update = 0).

The rest is fine.

Sasha


From rdreier at cisco.com  Wed May 16 13:21:31 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 16 May 2007 13:21:31 -0700
Subject: [ofa-general] binary compatibility ofed 1.1 and 1.2
In-Reply-To: <OFC91CA342.9D30939F-ON872572DD.00688999-882572DD.0069354C@us.ibm.com>
	(Shirley Ma's message of "Wed, 16 May 2007 12:09:36 -0700")
References: <OFC91CA342.9D30939F-ON872572DD.00688999-882572DD.0069354C@us.ibm.com>
Message-ID: <adamz04msdw.fsf@cisco.com>

 > > Will apps built with OFED 1.1 verbs.h run on an OFED 1.2 install ?

Yes, unless you do something to defeat the ABI versioning.

 >       It seems the binary apps are broken between OFED-1.1 and OFED-1.2.
 > Any reason why we can't maintain struct like ibv_cq binary compatibility?

libibverbs has a versioned ABI.  So if you link your app against
libibverbs 1.0, it will be linked against IBVERBS_1.0 symbols and
still work with the libibverbs 1.1 dynamic library.

A number of changes required struct layout differences etc., so a new
IBVERBS_1.1 ABI was introduced as well.  But you will only get that by
linking against libibverbs 1.1.

 - R.


From rdreier at cisco.com  Wed May 16 13:26:00 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 16 May 2007 13:26:00 -0700
Subject: [ofa-general] Re: OFED HA related question
In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net>
	(Changqing Tang's message of "Wed, 16 May 2007 14:18:59 -0000")
References: <1179145102.25749.11.camel@mtls03> <adabqgmtb6e.fsf@cisco.com>
	<1179242042.25749.33.camel@mtls03> <ada7irat6xw.fsf@cisco.com>
	<349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net>
Message-ID: <adairasms6f.fsf@cisco.com>

    Changqing> 	Suppose I get IBV_EVENT_DEVICE_FATAL async event from
    Changqing> the first HCA on my node, can I continue to call
    Changqing> ibv_poll_cq() to get back all the work-requests I
    Changqing> posted before ?  or do I need to keep track these
    Changqing> work-requests? I am afraid ibv_poll_cq() will return
    Changqing> error by itself. Also can I call ibv_dereg_mr() to free
    Changqing> the memory I registered to this HCA ?

Once you get a catastrophic error, all bets are off.  Work request
processing is in an undetermined state, since basically the HCA
crashed in an unknown way.  Polling CQs is probably not a good idea.
I guess you do need to deregister memory regions to unpin the memory
as part of your cleanup....

    Changqing> 	If I continue to use the second HCA, does the failure
    Changqing> of first HCA affect the operation of second HCA (from
    Changqing> driver point of view) ?

No.

 - R.


From rdreier at cisco.com  Wed May 16 13:39:16 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 16 May 2007 13:39:16 -0700
Subject: [ofa-general] Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR
	leak
In-Reply-To: <20070516101457.GA5091@mellanox.co.il> (Michael S. Tsirkin's
	message of "Wed, 16 May 2007 13:14:57 +0300")
References: <20070515210453.GL4161@mellanox.co.il>
	<20070516101457.GA5091@mellanox.co.il>
Message-ID: <adaejlgmrkb.fsf@cisco.com>

 > + * - Put the QP in the Error State
 > + * - Wait for the Affiliated Asynchronous Last WQE Reached Event;
 > + * - either:
 > + *       drain the CQ by invoking the Poll CQ verb and either wait for CQ
 > + *       to be empty or the number of Poll CQ operations has exceeded
 > + *       CQ capacity size;
 > + * - or
 > + *       post another WR that completes on the same CQ and wait for this
 > + *       WR to return as a WC; (NB: this is the option that we use)
 > + * and then invoke a Destroy QP or Reset QP.

I guess this last line would look better as

 * - invoke a Destroy QP or Reset QP.

 > +static struct ib_qp_attr ipoib_cm_err_attr __read_mostly = {
 > +	.qp_state = IB_QPS_ERR
 > +};
 > +
 > +#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff
 > +
 > +static struct ib_send_wr ipoib_cm_rx_drain_wr __read_mostly = {
 > +	.wr_id = IPOIB_CM_RX_DRAIN_WRID,
 > +	.opcode = IB_WR_SEND
 > +};

I don't think these are hot enough to be worth marking as __read_mostly.
(better to leave them in normal .data so that stuff that is written to
ends up getting spaced out more)

 > +	qp_attr.qp_state = IB_QPS_INIT;
 > +	qp_attr.port_num = priv->port;
 > +	qp_attr.qkey = 0;
 > +	qp_attr.qp_access_flags = 0;
 > +	ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr,
 > +			   IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PORT | IB_QP_QKEY);
 > +	if (ret) {
 > +		ipoib_warn(priv, "failed to modify drain QP to INIT: %d\n", ret);
 > +		goto err_qp;
 > +	}
 > +
 > +	/* We put the QP in error state directly: this way, hardware
 > +	 * will immediately generate WC for each WR we post, without
 > +	 * sending anything on the wire. */
 > +	qp_attr.qp_state = IB_QPS_ERR;
 > +	ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, IB_QP_STATE);
 > +	if (ret) {
 > +		ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret);
 > +		goto err_qp;
 > +	}

This actually seems like a good motivation for the mthca RESET ->
ERROR fix.  We could avoid the transition to INIT if we fixed mthca
and mlx4, right?  (By the way, any interest in making an mlx4 patch to
fix that too?)

 - R.


From minich at ornl.gov  Wed May 16 13:46:49 2007
From: minich at ornl.gov (Makia Minich)
Date: Wed, 16 May 2007 16:46:49 -0400
Subject: [ofa-general] problem with loading IB modules in a IB node
	=?iso-8859-1?q?with=09OFED=2E?=
In-Reply-To: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com>
References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com>
Message-ID: <200705161646.49910.minich@ornl.gov>

The question is what kernel are you trying to load the module into?  Whether 
or not it's OFED-1.0 is irrelevant if one system is 2.6.9-42Elsmp and the 
other is not.  What is the result of the following (on your system where it's 
not loading):

uname -r
modinfo madeye.ko | grep vermagic

Also, you might want to check dmesg.

On Wednesday 16 May 2007 8:00:07 am Keshetti Mahesh wrote:
> I am facing problem while loading any IB module ( madeye.ko) into an
> IB node with OFED-1.0
> installed. while loading module lots of "disagrees about symbol
> version" errors appeared.
> where as the same module gets successfully loaded into 2.6.9-42Elsmp (
> which contains
> OFED-1.0??).
> Is this already discussed?

-- 
Makia Minich <minich at ornl.gov>
National Center for Computation Science
Oak Ridge National Laboratory
--*--
Imagine no possessions
I wonder if you can
- John Lennon


From rdreier at cisco.com  Wed May 16 13:56:46 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 16 May 2007 13:56:46 -0700
Subject: [ofa-general] Re: movnt
In-Reply-To: <20070515204335.GI4161@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 15 May 2007 23:43:35 +0300")
References: <20070508141727.GR21591@mellanox.co.il>
	<ada4pmjz7tm.fsf@cisco.com> <20070512172927.GA5908@mellanox.co.il>
	<adamz09yc19.fsf@cisco.com> <20070515204335.GI4161@mellanox.co.il>
Message-ID: <ada646smqr5.fsf@cisco.com>

 > So we can map the device memory with WB or WT semantics, and movnt will enable
 > WC. And the nice thing about this trick, is that both WB and WT *are already
 > programmed into PAT after reset*, which means that we can use them for pages we
 > map for userspace, without stepping on anyone's toes or waiting for
 > the generic in-kernel support for WC to materialize.

I'm not sure whether this is much of an advantage.  There's no generic
way to map memory with WB that I know of.  I don't think that setting
a PAT entry for WC is the hold-up -- the problem is more in the right
infrastructure for pgprot_xxx().  I don't think it's very nice to have
#ifdef __x86_64__ in a driver.

 > I attach a header file that implements WC memcpy with these
 > instructions for lengths from 16 to 128 bytes (and one can,
 > naturally, just call xmm_copy64 in a loop), that I wrote for fun
 > at some point. Feel free to read/flame/reuse in any way you like.

Using movntdq means we have to save off xmm's, and it's a hassle to
get a properly aligned block to be able to use movdqa to save them
(you can't rely on the stack being 16-byte aligned).  I'd be curious
to see whether it's even worth it for a 64-byte copy (which is
probably the most common case for BF), since you need 8 extra movdqa
to save/restore the xmms on top of 4 movdqa to load the WQE and 4
movntdq to write it.  Just plain movnti might be the simplest thing to
do, since 16 movnti is all you would need, and I think that comes out
to be smaller code than 12 movdqa + 4 movntdq.

(Optimizing the WQE copy in assembly might be worth it independent of
how we map the BF page for WC, since obviously posting BF sends is a
super-hot path.  And it's fun to write SSE code anyway)

 - R.


From kliteyn at mellanox.co.il  Wed May 16 14:24:25 2007
From: kliteyn at mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 17 May 2007 00:24:25 +0300
Subject: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move
	CMdefinitions from ib_types.h
In-Reply-To: <1179336155.23882.604.camel@hal.voltaire.com>
References: <1179333484.4531.176519.camel@hal.voltaire.com>
	<464B3537.7060405@ichips.intel.com>
	<1179336155.23882.604.camel@hal.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901825156@mtlexch01.mtl.com>


> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Subject: Re: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move
CMdefinitions from ib_types.h
> 
> On Wed, 2007-05-16 at 12:45, Sean Hefty wrote:
> > Hal Rosenstock wrote:
> > > OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h
> > 
> > CM types are defined in the libibcm library.  Why not remove them 
> > completely from the opensm code?
> 
> I think they are needed in the Windows environment. 

I'm not sure about this. Windows has another ib_types header, and I
think that all the
other applications are using this header and not the management
ib_types.

What are the rest of the CM defines that you want to remove?

-- Yevgeny

> I believe Linux userspace would never include this header.
> 
> -- Hal
> 
> > - Sean
> 
> 


From changquing.tang at hp.com  Wed May 16 14:43:30 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Wed, 16 May 2007 21:43:30 -0000
Subject: [ofa-general] RE: OFED HA related question
In-Reply-To: <adairasms6f.fsf@cisco.com>
References: <1179145102.25749.11.camel@mtls03>
	<adabqgmtb6e.fsf@cisco.com><1179242042.25749.33.camel@mtls03>
	<ada7irat6xw.fsf@cisco.com><349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net>
	<adairasms6f.fsf@cisco.com>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net>

 
> 
>     Changqing> 	Suppose I get IBV_EVENT_DEVICE_FATAL 
> async event from
>     Changqing> the first HCA on my node, can I continue to call
>     Changqing> ibv_poll_cq() to get back all the work-requests I
>     Changqing> posted before ?  or do I need to keep track these
>     Changqing> work-requests? I am afraid ibv_poll_cq() will return
>     Changqing> error by itself. Also can I call ibv_dereg_mr() to free
>     Changqing> the memory I registered to this HCA ?
> 
> Once you get a catastrophic error, all bets are off.  Work 
> request processing is in an undetermined state, since 
> basically the HCA crashed in an unknown way.  Polling CQs is 
> probably not a good idea.
> I guess you do need to deregister memory regions to unpin the 
> memory as part of your cleanup....

Thanks. However, when catastrophic error occurs, there are some entries
in CQ,
can I continue to peek them using ibv_poll_cq() ?

Also does ibv_dereg_mr() work when fatal error occurs ?


--CQ


> 
>     Changqing> 	If I continue to use the second HCA, 
> does the failure
>     Changqing> of first HCA affect the operation of second HCA (from
>     Changqing> driver point of view) ?
> 
> No.
> 
>  - R.
> 


From halr at voltaire.com  Wed May 16 15:02:52 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 16 May 2007 18:02:52 -0400
Subject: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move
	CMdefinitions from ib_types.h
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901825156@mtlexch01.mtl.com>
References: <1179333484.4531.176519.camel@hal.voltaire.com>
	<464B3537.7060405@ichips.intel.com>
	<1179336155.23882.604.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901825156@mtlexch01.mtl.com>
Message-ID: <1179352971.23882.18793.camel@hal.voltaire.com>

On Wed, 2007-05-16 at 17:24, Yevgeny Kliteynik wrote:
> > From: Hal Rosenstock [mailto:halr at voltaire.com] 
> > Subject: Re: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move
> CMdefinitions from ib_types.h
> > 
> > On Wed, 2007-05-16 at 12:45, Sean Hefty wrote:
> > > Hal Rosenstock wrote:
> > > > OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h
> > > 
> > > CM types are defined in the libibcm library.  Why not remove them 
> > > completely from the opensm code?
> > 
> > I think they are needed in the Windows environment. 
> 
> I'm not sure about this. Windows has another ib_types header,

and those two ib_types.h are unrelated ? If so, what are the various WIN
conditionalizations doing in ib_types.h ? Can they be removed ?

> and I think that all the
> other applications are using this header and not the management
> ib_types.

Are you referring to Windows, Linux, or both here ?

> What are the rest of the CM defines that you want to remove?

Huh ? What are you referring to here ?

-- Hal

> -- Yevgeny
> 
> > I believe Linux userspace would never include this header.
> > 
> > -- Hal
> > 
> > > - Sean
> > 
> > 


From gsadasiv7 at gmail.com  Wed May 16 16:00:54 2007
From: gsadasiv7 at gmail.com (Ganesh Sadasivan)
Date: Wed, 16 May 2007 16:00:54 -0700
Subject: [ofa-general] Running multiple SM
Message-ID: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com>

Hi,

   I have a setup with 2 HCAs connected back to back and am running opensm (
ofed1.1, running at the same priority) on both of them. Is there any utility
to see who is the master?  The smlid in ibv_devinfo, seems to be changing
whenever an SM does a sweep. Is this expected?

Thanks
Ganesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070516/ac2bdfaa/attachment.html>

From halr at voltaire.com  Wed May 16 16:22:00 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 16 May 2007 19:22:00 -0400
Subject: [ofa-general] Running multiple SM
In-Reply-To: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com>
References: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com>
Message-ID: <1179357718.23882.23845.camel@hal.voltaire.com>

Hi Ganesh,

On Wed, 2007-05-16 at 19:00, Ganesh Sadasivan wrote:
> Hi,
> 
>    I have a setup with 2 HCAs connected back to back and am running
> opensm (ofed1.1, running at the same priority) on both of them. Is
> there any utility to see who is the master?

sminfo will show the SM state for a LID/GUID.

>   The smlid in ibv_devinfo, seems to be changing whenever an SM does a
> sweep. Is this expected? 

Nope. If they are both at the same priority, the lower GUID should win
the SM election.

Not sure what is going wrong in your (back to back HCA) subnet. Do you
ports stay active ?

-- Hal

> Thanks
> Ganesh
> 
> ______________________________________________________________________
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From gsadasiv7 at gmail.com  Wed May 16 18:42:19 2007
From: gsadasiv7 at gmail.com (Ganesh Sadasivan)
Date: Wed, 16 May 2007 18:42:19 -0700
Subject: [ofa-general] Running multiple SM
In-Reply-To: <1179357718.23882.23845.camel@hal.voltaire.com>
References: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com>
	<1179357718.23882.23845.camel@hal.voltaire.com>
Message-ID: <532b813a0705161842q59038b1dx69cfa642d989789@mail.gmail.com>

Hi Hal,

 Please see inline.

On 16 May 2007 19:22:00 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
>
> Hi Ganesh,
>
> On Wed, 2007-05-16 at 19:00, Ganesh Sadasivan wrote:
> > Hi,
> >
> >    I have a setup with 2 HCAs connected back to back and am running
> > opensm (ofed1.1, running at the same priority) on both of them. Is
> > there any utility to see who is the master?


Even with priority difeferences I am seeing the same behavior.Am I missing
any option. I am setting "opensm -s 30" and "opensm -s 60" on the respective
sides.

sminfo will show the SM state for a LID/GUID.


Thanks.

>   The smlid in ibv_devinfo, seems to be changing whenever an SM does a
> > sweep. Is this expected?
>
> Nope. If they are both at the same priority, the lower GUID should win
> the SM election.
>
> Not sure what is going wrong in your (back to back HCA) subnet. Do you
> ports stay active ?


Yes both ports are active.

Thanks
Ganesh

-- Hal
>
> > Thanks
> > Ganesh
> >
> > ______________________________________________________________________
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070516/a2219b1c/attachment.html>

From halr at voltaire.com  Wed May 16 18:57:27 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 16 May 2007 21:57:27 -0400
Subject: [ofa-general] Running multiple SM
In-Reply-To: <532b813a0705161842q59038b1dx69cfa642d989789@mail.gmail.com>
References: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com>
	<1179357718.23882.23845.camel@hal.voltaire.com>
	<532b813a0705161842q59038b1dx69cfa642d989789@mail.gmail.com>
Message-ID: <1179367045.23882.33850.camel@hal.voltaire.com>

Hi again Ganesh,

On Wed, 2007-05-16 at 21:42, Ganesh Sadasivan wrote:
> Hi Hal,
> 
>  Please see inline.
> 
> On 16 May 2007 19:22:00 -0400, Hal Rosenstock <halr at voltaire.com>
> wrote:
>         Hi Ganesh,
>         
>         On Wed, 2007-05-16 at 19:00, Ganesh Sadasivan wrote:
>         > Hi,
>         >
>         >    I have a setup with 2 HCAs connected back to back and am
>         running
>         > opensm (ofed1.1, running at the same priority) on both of
>         them. Is
>         > there any utility to see who is the master?
> 
> Even with priority difeferences I am seeing the same behavior.Am I
> missing any option. I am setting "opensm -s 30" and "opensm -s 60" on
> the respective sides.

Why not use the default (10 secs) or at least the same on both sides ?

>         sminfo will show the SM state for a LID/GUID.
> 
> 
> Thanks. 
> 
>         >   The smlid in ibv_devinfo, seems to be changing whenever an
>         SM does a
>         > sweep. Is this expected?
>         
>         Nope. If they are both at the same priority, the lower GUID
>         should win
>         the SM election.
>         
>         Not sure what is going wrong in your (back to back HCA)
>         subnet. Do you 
>         ports stay active ?
> 
> 
> Yes both ports are active. 

And they stay active (no LED color changes) ?

If not, can you run both OpenSMs in verbose mode (-V) and see if there
is anything interesting/relevant in the logs ?

-- Hal

> Thanks
> Ganesh
> 
>         -- Hal
>         
>         > Thanks
>         > Ganesh
>         >
>         >
>         ______________________________________________________________________
>         > _______________________________________________
>         > general mailing list
>         > general at lists.openfabrics.org
>         >
>         http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>         >
>         > To unsubscribe, please visit
>         http://openib.org/mailman/listinfo/openib-general
>         
> 


From rdreier at cisco.com  Wed May 16 19:03:13 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 16 May 2007 19:03:13 -0700
Subject: [ofa-general] Re: OFED HA related question
In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net>
	(Changqing Tang's message of "Wed, 16 May 2007 21:43:30 -0000")
References: <1179145102.25749.11.camel@mtls03> <adabqgmtb6e.fsf@cisco.com>
	<1179242042.25749.33.camel@mtls03> <ada7irat6xw.fsf@cisco.com>
	<349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net>
	<adairasms6f.fsf@cisco.com>
	<349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net>
Message-ID: <adak5v8kxzy.fsf@cisco.com>

 > Thanks. However, when catastrophic error occurs, there are some
 > entries in CQ, can I continue to peek them using ibv_poll_cq() ?

Not necessarily.  It's better not to do anything once a catastrophic
error is reported, because everything is in an indeterminate state.

 > Also does ibv_dereg_mr() work when fatal error occurs ?

It will probably fail but you should try to destroy all your resources
I guess.

 - R.


From gsadasiv7 at gmail.com  Wed May 16 21:18:15 2007
From: gsadasiv7 at gmail.com (Ganesh Sadasivan)
Date: Wed, 16 May 2007 21:18:15 -0700
Subject: [ofa-general] Running multiple SM
In-Reply-To: <1179367045.23882.33850.camel@hal.voltaire.com>
References: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com>
	<1179357718.23882.23845.camel@hal.voltaire.com>
	<532b813a0705161842q59038b1dx69cfa642d989789@mail.gmail.com>
	<1179367045.23882.33850.camel@hal.voltaire.com>
Message-ID: <532b813a0705162118t1ad20d8al65f940116971e50e@mail.gmail.com>

The reason is:
Jan 01 01:46:17 321555 [58F3E280] -> osm_vendor_set_sm: ERR 5431: setting
IS_SM capability mask failed; errno 2

>From the code it looks like  /dev/infiniband/issm<umad_port> needs to be
created and I did that. But still the SM with higher GUID seem to become the
master whenever it does a sweep. The logs are too detailed. So I am sending
snippets.

Local port (with a high GUID)
Jan 01 02:49:56 332142 [5873E280] -> osm_pi_rcv_process: Discovered port num
0x1 with GUID = 0x2c901097682d1 for parent node GUID = 0x2c901097682d0, TID
= 0x1236
Jan 01 02:49:56 332197 [5873E280] -> PortInfo dump:
                                port number.............0x1
                                node_guid...............0x0002c901097682d0
                                port_guid...............0x0002c901097682d1
                                m_key...................0x0000000000000000
                                subnet_prefix...........0xfe80000000000000
                                base_lid................0x1
                                master_sm_base_lid......0x2
                                capability_mask.........0x2510A68
                                diag_code...............0x0
                                m_key_lease_period......0x0
                                local_port_num..........0x1
                                link_width_enabled......0x3
                                link_width_supported....0x3
                                link_width_active.......0x2
                                link_speed_supported....0x1
                                port_state..............ACTIVE
                                state_info2.............0x52
                                m_key_protect_bits......0x0
                                lmc.....................0x0
                                link_speed..............0x11
                                mtu_smsl................0x40
                                vl_cap_init_type........0x40
                                vl_high_limit...........0x0
                                vl_arb_high_cap.........0x8
                                vl_arb_low_cap..........0x8
                                init_rep_mtu_cap........0x4
                                vl_stall_life...........0xFF
                                vl_enforce..............0x40
                                m_key_violations........0x0
                                p_key_violations........0x0
                                q_key_violations........0x0
                                guid_cap................0x20
                                client_reregister.......0x0
                                subnet_timeout..........0x12
                                resp_time_value.........0x10
                                error_threshold.........0x88
Jan 01 02:49:56 332337 [5873E280] -> Capabilities Mask:
                                IB_PORT_CAP_HAS_TRAP
                                IB_PORT_CAP_HAS_AUTO_MIG
                                IB_PORT_CAP_HAS_SL_MAP
                                IB_PORT_CAP_HAS_LED_INFO
                                IB_PORT_CAP_HAS_SYS_IMG_GUID
                                IB_PORT_CAP_HAS_COM_MGT
                                IB_PORT_CAP_HAS_VEND_CLS
                                IB_PORT_CAP_HAS_CAP_NTC
                                IB_PORT_CAP_HAS_CLIENT_REREG

Remote Port which hosts the SM:
Jan 01 02:49:56 500638 [5AF3E280] -> osm_pi_rcv_process: Discovered port num
0x1 with GUID = 0x2c90109765da1 for parent node GUID = 0x2c90109765da0, TID
= 0x123b
Jan 01 02:49:56 500690 [5AF3E280] -> PortInfo dump:
Jan 01 02:49:56 500638 [5AF3E280] -> osm_pi_rcv_process: Discovered port num
0x1 with GUID = 0x2c90109765da1 for parent node GUID = 0x2c90109765da0, TID
= 0x123b
Jan 01 02:49:56 500690 [5AF3E280] -> PortInfo dump:
                                port number.............0x1
                                node_guid...............0x0002c90109765da0
                                port_guid...............0x0002c90109765da1
                                m_key...................0x0000000000000000
                                subnet_prefix...........0xfe80000000000000
                                base_lid................0x2
                                master_sm_base_lid......0x2
                                capability_mask.........0x2510A68
                                diag_code...............0x0
                                m_key_lease_period......0x0
                                local_port_num..........0x1
                                link_width_enabled......0x3
                                link_width_supported....0x3
                                link_width_active.......0x2
                                link_speed_supported....0x1
                                port_state..............ACTIVE
                                state_info2.............0x52
                                m_key_protect_bits......0x0
                                lmc.....................0x0
                                link_speed..............0x11
                                mtu_smsl................0x40
                                vl_cap_init_type........0x40
                                vl_high_limit...........0x0
                                vl_arb_high_cap.........0x8
                                vl_arb_low_cap..........0x8
                                init_rep_mtu_cap........0x4
                                vl_stall_life...........0xFF
                                vl_enforce..............0x40
                                m_key_violations........0x0
                                p_key_violations........0x0
                                q_key_violations........0x0
                                guid_cap................0x20
                                client_reregister.......0x0
                                subnet_timeout..........0x12
                                resp_time_value.........0x10
                                error_threshold.........0x88
Jan 01 02:49:56 500831 [5AF3E280] -> Capabilities Mask:
                                IB_PORT_CAP_HAS_TRAP
                                IB_PORT_CAP_HAS_AUTO_MIG
                                IB_PORT_CAP_HAS_SL_MAP
                                IB_PORT_CAP_HAS_LED_INFO
                                IB_PORT_CAP_HAS_SYS_IMG_GUID
                                IB_PORT_CAP_HAS_COM_MGT
                                IB_PORT_CAP_HAS_VEND_CLS
                                IB_PORT_CAP_HAS_CAP_NTC
                                IB_PORT_CAP_HAS_CLIENT_REREG

Please let me know if I look at some specific portion.

Thanks
Ganesh


On 16 May 2007 21:57:27 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
>
> Hi again Ganesh,
>
> On Wed, 2007-05-16 at 21:42, Ganesh Sadasivan wrote:
> > Hi Hal,
> >
> >  Please see inline.
> >
> > On 16 May 2007 19:22:00 -0400, Hal Rosenstock <halr at voltaire.com>
> > wrote:
> >         Hi Ganesh,
> >
> >         On Wed, 2007-05-16 at 19:00, Ganesh Sadasivan wrote:
> >         > Hi,
> >         >
> >         >    I have a setup with 2 HCAs connected back to back and am
> >         running
> >         > opensm (ofed1.1, running at the same priority) on both of
> >         them. Is
> >         > there any utility to see who is the master?
> >
> > Even with priority difeferences I am seeing the same behavior.Am I
> > missing any option. I am setting "opensm -s 30" and "opensm -s 60" on
> > the respective sides.
>
> Why not use the default (10 secs) or at least the same on both sides ?
>
> >         sminfo will show the SM state for a LID/GUID.
> >
> >
> > Thanks.
> >
> >         >   The smlid in ibv_devinfo, seems to be changing whenever an
> >         SM does a
> >         > sweep. Is this expected?
> >
> >         Nope. If they are both at the same priority, the lower GUID
> >         should win
> >         the SM election.
> >
> >         Not sure what is going wrong in your (back to back HCA)
> >         subnet. Do you
> >         ports stay active ?
> >
> >
> > Yes both ports are active.
>
> And they stay active (no LED color changes) ?
>
> If not, can you run both OpenSMs in verbose mode (-V) and see if there
> is anything interesting/relevant in the logs ?
>
> -- Hal
>
> > Thanks
> > Ganesh
> >
> >         -- Hal
> >
> >         > Thanks
> >         > Ganesh
> >         >
> >         >
> >
> ______________________________________________________________________
> >         > _______________________________________________
> >         > general mailing list
> >         > general at lists.openfabrics.org
> >         >
> >         http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >         >
> >         > To unsubscribe, please visit
> >         http://openib.org/mailman/listinfo/openib-general
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070516/c47179dd/attachment.html>

From keshetti.mahesh at gmail.com  Wed May 16 21:57:28 2007
From: keshetti.mahesh at gmail.com (Keshetti Mahesh)
Date: Thu, 17 May 2007 10:27:28 +0530
Subject: [ofa-general] problem with loading IB modules in a IB node with
	OFED.
In-Reply-To: <200705161646.49910.minich@ornl.gov>
References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com>
	<200705161646.49910.minich@ornl.gov>
Message-ID: <829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com>

> uname -r
2.6.9-34.ELsmp

> modinfo madeye.ko | grep vermagic
vermagic:       2.6.9-34.ELsmp SMP gcc-3.4

> Also, you might want to check dmesg.

dmesg output:
madeye: disagrees about version of symbol ib_unregister_client
madeye: Unknown symbol ib_unregister_client
madeye: disagrees about version of symbol ib_register_mad_snoop
madeye: Unknown symbol ib_register_mad_snoop
madeye: disagrees about version of symbol ib_register_client
madeye: Unknown symbol ib_register_client
madeye: disagrees about version of symbol ib_unregister_mad_agent
madeye: Unknown symbol ib_unregister_mad_agent
madeye: disagrees about version of symbol ib_set_client_data
madeye: Unknown symbol ib_set_client_data
madeye: disagrees about version of symbol ib_get_client_data
madeye: Unknown symbol ib_get_client_data

I think the problem with the IB headers with which it is being
compiled. In my case I am
compiling the 'madeye' module withe IB headers available in the
2.6.9.34Elsmp source
code. where as the IB verbs are compiled against the sources from
OFED-1.0. But I don't
know how to compile my module with the OFED-1.0 headers(because the
headers are not
available after the compilation).

-- 
Thanks and regards,
Mahesh.


From devesh28 at gmail.com  Wed May 16 22:21:53 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Thu, 17 May 2007 10:51:53 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <464B5C07.8040601@ichips.intel.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
	<309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>
	<464B5C07.8040601@ichips.intel.com>
Message-ID: <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>

On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > But initially this will generate a packet for each path, while sys
> > admin knows that path is there and he can hard-code the entries for
> > it. Other thing is that why Admin will care about creating such record
> > while SA is itself taking care, right?
>
> In your original message you asked about adding 'dummy entries' to the
> cache.  I agree that pre-loading the cache can be useful.  What I still
> am not understanding is the reasoning for adding 'dummy entries'.  By
> 'dummy entries', I've been assuming that these are invalid path records,
> but maybe that's not what you meant.
Ok if "dummy entries" word as such has created confusion then I am
sorry for that, But with that I mean that, those are valid path
records which Administrator knows in advance and while loading the
module, Admin is loading this info in the cache with user command.
>
> > Another point I want to know is,
> > When local_sa_cache module will be inserted? After SM comes up or
> > Before SM comes up?
>
> It can occur either way.  There is no restriction.  The cache responds
> to port up and GID in/out of service events to update itself.
Do you mean cache module will start building cache only after Port is UP?
>
> > If Its inserted before SM is coming up (I am assuming SM is running on
> > some node not on switch) then First Forced schedule_update() is
> > waisted, and for the first application presence of cache is
> > meaningless. Why not to keep cache effective right from the start?
>
> Pre-loading the cache with path records doesn't guarantee that those
> paths are usable.  If the SM has not come up, then the path records will
> be unusable until the SM configures the subnet, plus there's no
> guarantee that the remote endpoints specified by the paths are running.
You mean there is no guarantee that even if SM is UP and we have some
hard coded entries of path record corresponding to some node X, we are
not sure that node X has actually come up or not?  In that case
actually that path resolving should fail if node has not come up, but
with the hard coding still path will be resolved?
>
> The main benefit I see to pre-loading the cache is to avoid SA storms
> when booting a large cluster.
that's true. Also cache will get valid entries only if network is
configured by SM otherwise every node SA will, possibly, drop SA
packets.
>
> - Sean
>


From glebn at voltaire.com  Wed May 16 23:26:38 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Thu, 17 May 2007 09:26:38 +0300
Subject: [ofa-general] Re: OFED HA related question
In-Reply-To: <adak5v8kxzy.fsf@cisco.com>
References: <1179145102.25749.11.camel@mtls03> <adabqgmtb6e.fsf@cisco.com>
	<1179242042.25749.33.camel@mtls03> <ada7irat6xw.fsf@cisco.com>
	<349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net>
	<adairasms6f.fsf@cisco.com>
	<349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net>
	<adak5v8kxzy.fsf@cisco.com>
Message-ID: <20070517062637.GL6273@minantech.com>

On Wed, May 16, 2007 at 07:03:13PM -0700, Roland Dreier wrote:
>  > Also does ibv_dereg_mr() work when fatal error occurs ?
> 
> It will probably fail but you should try to destroy all your resources
> I guess.
> 
This is very good question. Application should be able to unpin memory
even if HCA is completely dead. AFAIR you introduced (or want to introduce)
separation of pinning and registering APIs in libibverbs and then
unpinning will be totally independent of HCA state, but what will happen
in the current driver from ofed1.2?

--
			Gleb.


From mst at dev.mellanox.co.il  Thu May 17 00:30:28 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 17 May 2007 10:30:28 +0300
Subject: [ofa-general] Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR
	leak
In-Reply-To: <adaejlgmrkb.fsf@cisco.com>
References: <20070515210453.GL4161@mellanox.co.il>
	<20070516101457.GA5091@mellanox.co.il> <adaejlgmrkb.fsf@cisco.com>
Message-ID: <20070517073017.GA4205@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR leak
> 
>  > + * - Put the QP in the Error State
>  > + * - Wait for the Affiliated Asynchronous Last WQE Reached Event;
>  > + * - either:
>  > + *       drain the CQ by invoking the Poll CQ verb and either wait for CQ
>  > + *       to be empty or the number of Poll CQ operations has exceeded
>  > + *       CQ capacity size;
>  > + * - or
>  > + *       post another WR that completes on the same CQ and wait for this
>  > + *       WR to return as a WC; (NB: this is the option that we use)
>  > + * and then invoke a Destroy QP or Reset QP.
> 
> I guess this last line would look better as
> 
>  * - invoke a Destroy QP or Reset QP.

Hmm, I would like to quote the spec *literally*. Maybe
- and then invoke a Destroy QP or Reset QP.

>  > +static struct ib_qp_attr ipoib_cm_err_attr __read_mostly = {
>  > +	.qp_state = IB_QPS_ERR
>  > +};
>  > +
>  > +#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff
>  > +
>  > +static struct ib_send_wr ipoib_cm_rx_drain_wr __read_mostly = {
>  > +	.wr_id = IPOIB_CM_RX_DRAIN_WRID,
>  > +	.opcode = IB_WR_SEND
>  > +};
> 
> I don't think these are hot enough to be worth marking as __read_mostly.
> (better to leave them in normal .data so that stuff that is written to
> ends up getting spaced out more)

OK, thanks for the suggestion.

>  > +	qp_attr.qp_state = IB_QPS_INIT;
>  > +	qp_attr.port_num = priv->port;
>  > +	qp_attr.qkey = 0;
>  > +	qp_attr.qp_access_flags = 0;
>  > +	ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr,
>  > +			   IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PORT | IB_QP_QKEY);
>  > +	if (ret) {
>  > +		ipoib_warn(priv, "failed to modify drain QP to INIT: %d\n", ret);
>  > +		goto err_qp;
>  > +	}
>  > +
>  > +	/* We put the QP in error state directly: this way, hardware
>  > +	 * will immediately generate WC for each WR we post, without
>  > +	 * sending anything on the wire. */
>  > +	qp_attr.qp_state = IB_QPS_ERR;
>  > +	ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, IB_QP_STATE);
>  > +	if (ret) {
>  > +		ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret);
>  > +		goto err_qp;
>  > +	}
> 
> This actually seems like a good motivation for the mthca RESET ->
> ERROR fix.  We could avoid the transition to INIT if we fixed mthca
> and mlx4, right?

Yes. That was the motivation.

> (By the way, any interest in making an mlx4 patch to
> fix that too?)

Easy (I also fixed reset to reset on the way).

IB/mlx4: fix RESET -> ERROR and RESET -> RESET transitions

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 5cd7069..c93daab 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -573,7 +573,7 @@ static int to_mlx4_st(enum ib_qp_type type)
 	}
 }
 
-static __be32 to_mlx4_access_flags(struct mlx4_ib_qp *qp, struct ib_qp_attr *attr,
+static __be32 to_mlx4_access_flags(struct mlx4_ib_qp *qp, const struct ib_qp_attr *attr,
 				   int attr_mask)
 {
 	u8 dest_rd_atomic;
@@ -603,7 +603,7 @@ static __be32 to_mlx4_access_flags(struct mlx4_ib_qp *qp, struct ib_qp_attr *att
 	return cpu_to_be32(hw_access_flags);
 }
 
-static void store_sqp_attrs(struct mlx4_ib_sqp *sqp, struct ib_qp_attr *attr,
+static void store_sqp_attrs(struct mlx4_ib_sqp *sqp, const struct ib_qp_attr *attr,
 			    int attr_mask)
 {
 	if (attr_mask & IB_QP_PKEY_INDEX)
@@ -619,7 +619,7 @@ static void mlx4_set_sched(struct mlx4_qp_path *path, u8 port)
 	path->sched_queue = (path->sched_queue & 0xbf) | ((port - 1) << 6);
 }
 
-static int mlx4_set_path(struct mlx4_ib_dev *dev, struct ib_ah_attr *ah,
+static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
 			 struct mlx4_qp_path *path, u8 port)
 {
 	path->grh_mylmc     = ah->src_path_bits & 0x7f;
@@ -655,14 +655,14 @@ static int mlx4_set_path(struct mlx4_ib_dev *dev, struct ib_ah_attr *ah,
 	return 0;
 }
 
-int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
-		      int attr_mask, struct ib_udata *udata)
+static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
+			       const struct ib_qp_attr *attr, int attr_mask,
+			       enum ib_qp_state cur_state, enum ib_qp_state new_state)
 {
 	struct mlx4_ib_dev *dev = to_mdev(ibqp->device);
 	struct mlx4_ib_qp *qp = to_mqp(ibqp);
 	struct mlx4_qp_context *context;
 	enum mlx4_qp_optpar optpar = 0;
-	enum ib_qp_state cur_state, new_state;
 	int sqd_event;
 	int err = -EINVAL;
 
@@ -670,34 +670,6 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	if (!context)
 		return -ENOMEM;
 
-	mutex_lock(&qp->mutex);
-
-	cur_state = attr_mask & IB_QP_CUR_STATE ? attr->cur_qp_state : qp->state;
-	new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state;
-
-	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask))
-		goto out;
-
-	if ((attr_mask & IB_QP_PKEY_INDEX) &&
-	     attr->pkey_index >= dev->dev->caps.pkey_table_len) {
-		goto out;
-	}
-
-	if ((attr_mask & IB_QP_PORT) &&
-	    (attr->port_num == 0 || attr->port_num > dev->dev->caps.num_ports)) {
-		goto out;
-	}
-
-	if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC &&
-	    attr->max_rd_atomic > dev->dev->caps.max_qp_init_rdma) {
-		goto out;
-	}
-
-	if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC &&
-	    attr->max_dest_rd_atomic > 1 << dev->dev->caps.max_qp_dest_rdma) {
-		goto out;
-	}
-
 	context->flags = cpu_to_be32((to_mlx4_state(new_state) << 28) |
 				     (to_mlx4_st(ibqp->qp_type) << 16));
 	context->flags     |= cpu_to_be32(1 << 8); /* DE? */
@@ -920,11 +892,83 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	}
 
 out:
-	mutex_unlock(&qp->mutex);
 	kfree(context);
 	return err;
 }
 
+static const struct ib_qp_attr mlx4_ib_qp_attr = { .port_num = 1 };
+static const int mlx4_ib_qp_attr_mask_table[IB_QPT_UD + 1] = {
+		[IB_QPT_UD]  = (IB_QP_PKEY_INDEX		|
+				IB_QP_PORT			|
+				IB_QP_QKEY),
+		[IB_QPT_UC]  = (IB_QP_PKEY_INDEX		|
+				IB_QP_PORT			|
+				IB_QP_ACCESS_FLAGS),
+		[IB_QPT_RC]  = (IB_QP_PKEY_INDEX		|
+				IB_QP_PORT			|
+				IB_QP_ACCESS_FLAGS),
+		[IB_QPT_SMI] = (IB_QP_PKEY_INDEX		|
+				IB_QP_QKEY),
+		[IB_QPT_GSI] = (IB_QP_PKEY_INDEX		|
+				IB_QP_QKEY),
+};
+
+int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		      int attr_mask, struct ib_udata *udata)
+{
+	struct mlx4_ib_dev *dev = to_mdev(ibqp->device);
+	struct mlx4_ib_qp *qp = to_mqp(ibqp);
+	enum ib_qp_state cur_state, new_state;
+	int err = -EINVAL;
+
+	mutex_lock(&qp->mutex);
+
+	cur_state = attr_mask & IB_QP_CUR_STATE ? attr->cur_qp_state : qp->state;
+	new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state;
+
+	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask))
+		goto out;
+
+	if ((attr_mask & IB_QP_PKEY_INDEX) &&
+	     attr->pkey_index >= dev->dev->caps.pkey_table_len) {
+		goto out;
+	}
+
+	if ((attr_mask & IB_QP_PORT) &&
+	    (attr->port_num == 0 || attr->port_num > dev->dev->caps.num_ports)) {
+		goto out;
+	}
+
+	if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC &&
+	    attr->max_rd_atomic > dev->dev->caps.max_qp_init_rdma) {
+		goto out;
+	}
+
+	if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC &&
+	    attr->max_dest_rd_atomic > 1 << dev->dev->caps.max_qp_dest_rdma) {
+		goto out;
+	}
+
+	if (cur_state == new_state && cur_state == IB_QPS_RESET) {
+		err = 0;
+		goto out;
+	}
+
+	if (cur_state == IB_QPS_RESET && new_state == IB_QPS_ERR) {
+		err = __mlx4_ib_modify_qp(ibqp, &mlx4_ib_qp_attr,
+					  mlx4_ib_qp_attr_mask_table[ibqp->qp_type],
+					  IB_QPS_RESET, IB_QPS_INIT);
+		if (err)
+			goto out;
+		cur_state = IB_QPS_INIT;
+	}
+ 
+	err = __mlx4_ib_modify_qp(ibqp, attr, attr_mask, cur_state, new_state);
+out:
+	mutex_unlock(&qp->mutex);
+	return err;
+}
+
 static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr,
 			    void *wqe)
 {

-- 
MST


From eli at mellanox.co.il  Thu May 17 00:32:37 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 17 May 2007 10:32:37 +0300
Subject: [ofa-general] [PATCH 1/2] libmlx4: pass more data from user to
	kernel
Message-ID: <1179387187.25749.61.camel@mtls03>

Pass to kernel user calcuated values which will be used by the kernel to
configure the qp and pin memory.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: libmlx4/src/mlx4-abi.h
===================================================================
--- libmlx4.orig/src/mlx4-abi.h	2007-05-16 16:37:01.000000000 +0300
+++ libmlx4/src/mlx4-abi.h	2007-05-17 09:46:56.000000000 +0300
@@ -35,7 +35,7 @@
 
 #include <infiniband/kern-abi.h>
 
-#define MLX4_UVERBS_ABI_VERSION	1
+#define MLX4_UVERBS_ABI_VERSION	2
 
 struct mlx4_alloc_ucontext_resp {
 	struct ibv_get_context_resp	ibv_resp;
@@ -83,6 +83,10 @@ struct mlx4_create_qp {
 	struct ibv_create_qp		ibv_cmd;
 	__u64				buf_addr;
 	__u64				db_addr;
+	__u64				rq_size;
+	__u64				sq_size;
+	__u8				rcv_wqe_shift;
+	__u8				log_wqe_bb;
 };
 
 #endif /* MLX4_ABI_H */
Index: libmlx4/src/verbs.c
===================================================================
--- libmlx4.orig/src/verbs.c	2007-05-16 16:37:01.000000000 +0300
+++ libmlx4/src/verbs.c	2007-05-17 09:37:46.000000000 +0300
@@ -385,6 +385,11 @@ struct ibv_qp *mlx4_create_qp(struct ibv
 
 	cmd.buf_addr = (uintptr_t) qp->buf.buf;
 	cmd.db_addr  = (uintptr_t) qp->db;
+	cmd.rq_size = (uintptr_t) qp->rq.max;
+	cmd.sq_size = (uintptr_t) qp->sq.max;
+	cmd.rcv_wqe_shift = qp->rq.wqe_shift;
+	cmd.log_wqe_bb = qp->sq.wqe_shift;
+	qp->max_inline_data = attr->cap.max_inline_data;
 
 	ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd,
 				&resp, sizeof resp);
@@ -395,12 +400,6 @@ struct ibv_qp *mlx4_create_qp(struct ibv
 	if (ret)
 		goto err_destroy;
 
-	qp->sq.max	    = attr->cap.max_send_wr;
-	qp->rq.max	    = attr->cap.max_recv_wr;
-	qp->sq.max_gs	    = attr->cap.max_send_sge;
-	qp->rq.max_gs	    = attr->cap.max_recv_sge;
-	qp->max_inline_data = attr->cap.max_inline_data;
-
 	qp->doorbell_qpn    = htonl(qp->ibv_qp.qp_num << 8);
 	if (attr->sq_sig_all)
 		qp->sq_signal_bits = htonl(MLX4_WQE_CTRL_CQ_UPDATE);


From eli at mellanox.co.il  Thu May 17 00:32:41 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 17 May 2007 10:32:41 +0300
Subject: [ofa-general] [PATCH 2/2] IB/mlx4: pass more data from user to
	kernel
Message-ID: <1179387217.25749.62.camel@mtls03>

kernel code make minimum caclulations to evaluate wq size of user space
consumers for calcualting the buffer size.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/infiniband/hw/mlx4/qp.c
===================================================================
--- connectx_kernel.orig/drivers/infiniband/hw/mlx4/qp.c	2007-05-16 16:37:35.000000000 +0300
+++ connectx_kernel/drivers/infiniband/hw/mlx4/qp.c	2007-05-17 09:13:49.000000000 +0300
@@ -188,8 +188,8 @@ static int send_wqe_overhead(enum ib_qp_
 	}
 }
 
-static int set_qp_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
-		       enum ib_qp_type type, struct mlx4_ib_qp *qp)
+static int set_kernel_qp_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
+			      enum ib_qp_type type, struct mlx4_ib_qp *qp)
 {
 	/* Sanity check QP size before proceeding */
 	if (cap->max_send_wr	 > dev->dev->caps.max_wqes  ||
@@ -249,6 +249,23 @@ static int set_qp_size(struct mlx4_ib_de
 	return 0;
 }
 
+static int set_user_qp_size(struct mlx4_ib_qp *qp,
+			    struct mlx4_ib_create_qp *ucmd)
+{
+	if (ucmd->rq_size & ucmd->rq_size - 1 || ucmd->sq_size & ucmd->sq_size - 1)
+		return -EINVAL;
+
+	qp->rq.max = ucmd->rq_size;
+	qp->rq.wqe_shift = ucmd->rcv_wqe_shift;
+	qp->sq.wqe_shift = ucmd->log_wqe_bb;
+	qp->sq.max = ucmd->sq_size;
+
+	qp->buf_size = (qp->rq.max << qp->rq.wqe_shift) +
+		(qp->sq.max << qp->sq.wqe_shift);
+
+	return 0;
+}
+
 static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			    struct ib_qp_init_attr *init_attr,
 			    struct ib_udata *udata, int sqpn, struct mlx4_ib_qp *qp)
@@ -270,10 +287,6 @@ static int create_qp_common(struct mlx4_
 	qp->sq.head	    = 0;
 	qp->sq.tail	    = 0;
 
-	err = set_qp_size(dev, &init_attr->cap, init_attr->qp_type, qp);
-	if (err)
-		goto err;
-
 	if (pd->uobject) {
 		struct mlx4_ib_create_qp ucmd;
 
@@ -282,6 +295,10 @@ static int create_qp_common(struct mlx4_
 			goto err;
 		}
 
+		err = set_user_qp_size(qp, &ucmd);
+		if (err)
+			goto err;
+
 		qp->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr,
 				       qp->buf_size, 0);
 		if (IS_ERR(qp->umem)) {
@@ -303,6 +320,10 @@ static int create_qp_common(struct mlx4_
 		if (err)
 			goto err_mtt;
 	} else {
+		err = set_kernel_qp_size(dev, &init_attr->cap, init_attr->qp_type, qp);
+		if (err)
+			goto err;
+
 		err = mlx4_ib_db_alloc(dev, &qp->db, 0);
 		if (err)
 			goto err;
Index: connectx_kernel/drivers/infiniband/hw/mlx4/user.h
===================================================================
--- connectx_kernel.orig/drivers/infiniband/hw/mlx4/user.h	2007-05-16 16:37:35.000000000 +0300
+++ connectx_kernel/drivers/infiniband/hw/mlx4/user.h	2007-05-17 09:45:45.000000000 +0300
@@ -39,7 +39,7 @@
  * Increment this value if any changes that break userspace ABI
  * compatibility are made.
  */
-#define MLX4_IB_UVERBS_ABI_VERSION	1
+#define MLX4_IB_UVERBS_ABI_VERSION	2
 
 /*
  * Make sure that all structs defined in this file remain laid out so
@@ -87,6 +87,10 @@ struct mlx4_ib_create_srq_resp {
 struct mlx4_ib_create_qp {
 	__u64	buf_addr;
 	__u64	db_addr;
+	__u64	rq_size;
+	__u64	sq_size;
+	__u8	rcv_wqe_shift;
+	__u8	log_wqe_bb;
 };
 
 #endif /* MLX4_IB_USER_H */


From mst at dev.mellanox.co.il  Thu May 17 01:06:23 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 17 May 2007 11:06:23 +0300
Subject: [ofa-general] Re: movnt
In-Reply-To: <ada646smqr5.fsf@cisco.com>
References: <20070508141727.GR21591@mellanox.co.il> <ada4pmjz7tm.fsf@cisco.com>
	<20070512172927.GA5908@mellanox.co.il> <adamz09yc19.fsf@cisco.com>
	<20070515204335.GI4161@mellanox.co.il> <ada646smqr5.fsf@cisco.com>
Message-ID: <20070517080623.GB4205@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: movnt
> 
>  > So we can map the device memory with WB or WT semantics, and movnt will enable
>  > WC. And the nice thing about this trick, is that both WB and WT *are already
>  > programmed into PAT after reset*, which means that we can use them for pages we
>  > map for userspace, without stepping on anyone's toes or waiting for
>  > the generic in-kernel support for WC to materialize.
> 
> I'm not sure whether this is much of an advantage.  There's no generic
> way to map memory with WB that I know of.

I think we just need to avoid setting noncacheable flag -
this does not sound too bad.

>  I don't think that setting
> a PAT entry for WC is the hold-up

I think it is - editing PAT will affect all of the system, you start
hittig various errata ...

> -- the problem is more in the right
> infrastructure for pgprot_xxx().  I don't think it's very nice to have
> #ifdef __x86_64__ in a driver.

Not nice, but I think doing it in a platform-dependent way first
will be a good way for whoever implements the portable infrastructure
to see the patterns. So far, most people think about X as the main
user for the infrastructure-to-emerge - an easier to experiment with
user such as libmlx4 will be a good testcase I think.

Look at drivers/infiniband/hw/ipath/ipath_file_ops.c
for example - so why make exception for mthca?

Isn't the following a good start?

        } else if (vma->vm_pgoff == 1 && dev->dev->caps.bf_reg_size != 0) {
-                /* FIXME want pgprot_writecombine() for BlueFlame pages */
+#if defined(__x86_64__)
                vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+#endif
                if (io_remap_pfn_range(vma, vma->vm_start,
                                       to_mucontext(context)->uar.pfn +
                                       dev->dev->caps.num_uars,
                                       PAGE_SIZE, vma->vm_page_prot))
                        return -EAGAIN;
        } else

This does not look too bad, does it?
All we need to do is requite libmlx4 to always use movnti on x86_64.

-- 
MST


From mst at dev.mellanox.co.il  Thu May 17 01:10:51 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 17 May 2007 11:10:51 +0300
Subject: [ofa-general] [PATCH] IB/mthca: fix use-after-free
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90182527E@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C90182527E@mtlexch01.mtl.com>
Message-ID: <20070517081050.GC4205@mellanox.co.il>

From: Ali Ayoub <ali at mellanox.co.il>

Fix use-after-free on hardware restart.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

--- ./drivers/infiniband/hw/mthca/mthca_main.c.orig	2007-05-17 11:01:28.000000000 +0300
+++ ./drivers/infiniband/hw/mthca/mthca_main.c	2007-05-17 11:02:36.000000000 +0300
@@ -1250,12 +1250,14 @@
 int __mthca_restart_one(struct pci_dev *pdev)
 {
 	struct mthca_dev *mdev;
+	int hca_type;
 
 	mdev = pci_get_drvdata(pdev);
+	hca_type = mdev->hca_type;
 	if (!mdev)
 		return -ENODEV;
 	__mthca_remove_one(pdev);
-	return __mthca_init_one(pdev, mdev->hca_type);
+	return __mthca_init_one(pdev, hca_type);
 }
 
 static int __devinit mthca_init_one(struct pci_dev *pdev,


-- 
Michael S. Tsirkin - Staff Engineer, Mellanox Technologies Ltd.
Eternity is a very long time, especially towards the end.


From mst at dev.mellanox.co.il  Thu May 17 01:26:03 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 17 May 2007 11:26:03 +0300
Subject: [ofa-general] Re: Re: OFED HA related question
In-Reply-To: <20070517062637.GL6273@minantech.com>
References: <1179145102.25749.11.camel@mtls03> <adabqgmtb6e.fsf@cisco.com>
	<1179242042.25749.33.camel@mtls03> <ada7irat6xw.fsf@cisco.com>
	<349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net>
	<adairasms6f.fsf@cisco.com>
	<349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net>
	<adak5v8kxzy.fsf@cisco.com> <20070517062637.GL6273@minantech.com>
Message-ID: <20070517082603.GE4205@mellanox.co.il>

> Quoting Gleb Natapov <glebn at voltaire.com>:
> Subject: Re: Re: OFED HA related question
> 
> On Wed, May 16, 2007 at 07:03:13PM -0700, Roland Dreier wrote:
> >  > Also does ibv_dereg_mr() work when fatal error occurs ?
> > 
> > It will probably fail but you should try to destroy all your resources
> > I guess.
> > 
> This is very good question. Application should be able to unpin memory
> even if HCA is completely dead.

This is only safe after you reset the HCA, otherwise it might be
writing over this memory.

-- 
MST


From glebn at voltaire.com  Thu May 17 01:30:44 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Thu, 17 May 2007 11:30:44 +0300
Subject: [ofa-general] Re: Re: OFED HA related question
In-Reply-To: <20070517082603.GE4205@mellanox.co.il>
References: <1179145102.25749.11.camel@mtls03> <adabqgmtb6e.fsf@cisco.com>
	<1179242042.25749.33.camel@mtls03> <ada7irat6xw.fsf@cisco.com>
	<349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net>
	<adairasms6f.fsf@cisco.com>
	<349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net>
	<adak5v8kxzy.fsf@cisco.com> <20070517062637.GL6273@minantech.com>
	<20070517082603.GE4205@mellanox.co.il>
Message-ID: <20070517083044.GM6273@minantech.com>

On Thu, May 17, 2007 at 11:26:03AM +0300, Michael S. Tsirkin wrote:
> > Quoting Gleb Natapov <glebn at voltaire.com>:
> > Subject: Re: Re: OFED HA related question
> > 
> > On Wed, May 16, 2007 at 07:03:13PM -0700, Roland Dreier wrote:
> > >  > Also does ibv_dereg_mr() work when fatal error occurs ?
> > > 
> > > It will probably fail but you should try to destroy all your resources
> > > I guess.
> > > 
> > This is very good question. Application should be able to unpin memory
> > even if HCA is completely dead.
> 
> This is only safe after you reset the HCA, otherwise it might be
> writing over this memory.
> 
Right. Good point. So to recover memory after HCA failure event we need
to reset HCA and only after that unpin memory. What current mthca driver
does in case it cannot unregister memory from HCA? Does it proceed to unpin
it?

--
			Gleb.


From vlad at lists.openfabrics.org  Thu May 17 02:40:42 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu, 17 May 2007 02:40:42 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070517-0200 daily build status
Message-ID: <20070517094042.E9717E60835@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.15
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From halr at voltaire.com  Thu May 17 03:42:16 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 17 May 2007 06:42:16 -0400
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
	<309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>
	<464B5C07.8040601@ichips.intel.com>
	<309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>
Message-ID: <1179398534.23882.67542.camel@hal.voltaire.com>

On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote:
> On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > > But initially this will generate a packet for each path, while sys
> > > admin knows that path is there and he can hard-code the entries for
> > > it. Other thing is that why Admin will care about creating such record
> > > while SA is itself taking care, right?
> >
> > In your original message you asked about adding 'dummy entries' to the
> > cache.  I agree that pre-loading the cache can be useful.  What I still
> > am not understanding is the reasoning for adding 'dummy entries'.  By
> > 'dummy entries', I've been assuming that these are invalid path records,
> > but maybe that's not what you meant.
> Ok if "dummy entries" word as such has created confusion then I am
> sorry for that, But with that I mean that, those are valid path
> records which Administrator knows in advance and while loading the
> module,

How does the admin know they are valid ? Are they somehow preconfigured
at the SM ? Doesn't each SM have its own policy for generating valid PRs
? Or are these from a live SM and just loaded "out of band" to
bypass/preclude the SA PR mechanism ?

-- Hal

>  Admin is loading this info in the cache with user command.
> >
> > > Another point I want to know is,
> > > When local_sa_cache module will be inserted? After SM comes up or
> > > Before SM comes up?
> >
> > It can occur either way.  There is no restriction.  The cache responds
> > to port up and GID in/out of service events to update itself.
> Do you mean cache module will start building cache only after Port is UP?
> >
> > > If Its inserted before SM is coming up (I am assuming SM is running on
> > > some node not on switch) then First Forced schedule_update() is
> > > waisted, and for the first application presence of cache is
> > > meaningless. Why not to keep cache effective right from the start?
> >
> > Pre-loading the cache with path records doesn't guarantee that those
> > paths are usable.  If the SM has not come up, then the path records will
> > be unusable until the SM configures the subnet, plus there's no
> > guarantee that the remote endpoints specified by the paths are running.
> You mean there is no guarantee that even if SM is UP and we have some
> hard coded entries of path record corresponding to some node X, we are
> not sure that node X has actually come up or not?  In that case
> actually that path resolving should fail if node has not come up, but
> with the hard coding still path will be resolved?
> >
> > The main benefit I see to pre-loading the cache is to avoid SA storms
> > when booting a large cluster.
> that's true. Also cache will get valid entries only if network is
> configured by SM otherwise every node SA will, possibly, drop SA
> packets.
> >
> > - Sean
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Thu May 17 03:46:47 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 17 May 2007 06:46:47 -0400
Subject: [ofa-general] problem with loading IB modules in a IB node
	with OFED.
In-Reply-To: <829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com>
References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com>
	<200705161646.49910.minich@ornl.gov>
	<829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com>
Message-ID: <1179398805.23882.67884.camel@hal.voltaire.com>

On Thu, 2007-05-17 at 00:57, Keshetti Mahesh wrote:
> > uname -r
> 2.6.9-34.ELsmp
> 
> > modinfo madeye.ko | grep vermagic
> vermagic:       2.6.9-34.ELsmp SMP gcc-3.4
> 
> > Also, you might want to check dmesg.
> 
> dmesg output:
> madeye: disagrees about version of symbol ib_unregister_client
> madeye: Unknown symbol ib_unregister_client
> madeye: disagrees about version of symbol ib_register_mad_snoop
> madeye: Unknown symbol ib_register_mad_snoop
> madeye: disagrees about version of symbol ib_register_client
> madeye: Unknown symbol ib_register_client
> madeye: disagrees about version of symbol ib_unregister_mad_agent
> madeye: Unknown symbol ib_unregister_mad_agent
> madeye: disagrees about version of symbol ib_set_client_data
> madeye: Unknown symbol ib_set_client_data
> madeye: disagrees about version of symbol ib_get_client_data
> madeye: Unknown symbol ib_get_client_data
> 
> I think the problem with the IB headers with which it is being
> compiled. In my case I am
> compiling the 'madeye' module withe IB headers available in the
> 2.6.9.34Elsmp source
> code. where as the IB verbs are compiled against the sources from
> OFED-1.0.

I didn't think that OFED 1.0 included madeye. Where did madeye come from
? It was first made part of OFED at 1.1.

-- Hal

>  But I don't
> know how to compile my module with the OFED-1.0 headers(because the
> headers are not
> available after the compilation).


From halr at voltaire.com  Thu May 17 03:51:15 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 17 May 2007 06:51:15 -0400
Subject: [ofa-general] Running multiple SM
In-Reply-To: <532b813a0705162118t1ad20d8al65f940116971e50e@mail.gmail.com>
References: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com>
	<1179357718.23882.23845.camel@hal.voltaire.com>
	<532b813a0705161842q59038b1dx69cfa642d989789@mail.gmail.com>
	<1179367045.23882.33850.camel@hal.voltaire.com>
	<532b813a0705162118t1ad20d8al65f940116971e50e@mail.gmail.com>
Message-ID: <1179399074.23882.68142.camel@hal.voltaire.com>

On Thu, 2007-05-17 at 00:18, Ganesh Sadasivan wrote:
> The reason is:
> Jan 01 01:46:17 321555 [58F3E280] -> osm_vendor_set_sm: ERR 5431:
> setting IS_SM capability mask failed; errno 2

Yes, this makes sense now and explains what you are seeing.

> From the code it looks like  /dev/infiniband/issm<umad_port> needs to
> be created and I did that.

This should be done via udev rather than manually. Do you have udev
setup ? If not, please follow the instructions on the wiki.

-- Hal

>  But still the SM with higher GUID seem to become the master whenever
> it does a sweep. The logs are too detailed. So I am sending snippets. 
> 
> Local port (with a high GUID)
> Jan 01 02:49:56 332142 [5873E280] -> osm_pi_rcv_process: Discovered
> port num 0x1 with GUID = 0x2c901097682d1 for parent node GUID =
> 0x2c901097682d0, TID = 0x1236 
> Jan 01 02:49:56 332197 [5873E280] -> PortInfo dump:
>                                 port number.............0x1
>                                
> node_guid...............0x0002c901097682d0
>                                
> port_guid...............0x0002c901097682d1 
>                                
> m_key...................0x0000000000000000
>                                
> subnet_prefix...........0xfe80000000000000
>                                 base_lid................0x1
>                                 master_sm_base_lid......0x2
>                                 capability_mask.........0x2510A68 
>                                 diag_code...............0x0
>                                 m_key_lease_period......0x0
>                                 local_port_num..........0x1
>                                 link_width_enabled......0x3 
>                                 link_width_supported....0x3
>                                 link_width_active.......0x2
>                                 link_speed_supported....0x1
>                                 port_state..............ACTIVE 
>                                 state_info2.............0x52
>                                 m_key_protect_bits......0x0
>                                 lmc.....................0x0
>                                 link_speed..............0x11 
>                                 mtu_smsl................0x40
>                                 vl_cap_init_type........0x40
>                                 vl_high_limit...........0x0
>                                 vl_arb_high_cap.........0x8 
>                                 vl_arb_low_cap..........0x8
>                                 init_rep_mtu_cap........0x4
>                                 vl_stall_life...........0xFF
>                                 vl_enforce..............0x40 
>                                 m_key_violations........0x0
>                                 p_key_violations........0x0
>                                 q_key_violations........0x0
>                                 guid_cap................0x20 
>                                 client_reregister.......0x0
>                                 subnet_timeout..........0x12
>                                 resp_time_value.........0x10
>                                 error_threshold.........0x88 
> Jan 01 02:49:56 332337 [5873E280] -> Capabilities Mask:
>                                 IB_PORT_CAP_HAS_TRAP
>                                 IB_PORT_CAP_HAS_AUTO_MIG
>                                 IB_PORT_CAP_HAS_SL_MAP 
>                                 IB_PORT_CAP_HAS_LED_INFO
>                                 IB_PORT_CAP_HAS_SYS_IMG_GUID
>                                 IB_PORT_CAP_HAS_COM_MGT
>                                 IB_PORT_CAP_HAS_VEND_CLS 
>                                 IB_PORT_CAP_HAS_CAP_NTC
>                                 IB_PORT_CAP_HAS_CLIENT_REREG
> 
> Remote Port which hosts the SM:
> Jan 01 02:49:56 500638 [5AF3E280] -> osm_pi_rcv_process: Discovered
> port num 0x1 with GUID = 0x2c90109765da1 for parent node GUID =
> 0x2c90109765da0, TID = 0x123b 
> Jan 01 02:49:56 500690 [5AF3E280] -> PortInfo dump:
> Jan 01 02:49:56 500638 [5AF3E280] -> osm_pi_rcv_process: Discovered
> port num 0x1 with GUID = 0x2c90109765da1 for parent node GUID =
> 0x2c90109765da0, TID = 0x123b 
> Jan 01 02:49:56 500690 [5AF3E280] -> PortInfo dump:
>                                 port number.............0x1
>                                
> node_guid...............0x0002c90109765da0
>                                
> port_guid...............0x0002c90109765da1 
>                                
> m_key...................0x0000000000000000
>                                
> subnet_prefix...........0xfe80000000000000
>                                 base_lid................0x2
>                                 master_sm_base_lid......0x2
>                                 capability_mask.........0x2510A68 
>                                 diag_code...............0x0
>                                 m_key_lease_period......0x0
>                                 local_port_num..........0x1
>                                 link_width_enabled......0x3 
>                                 link_width_supported....0x3
>                                 link_width_active.......0x2
>                                 link_speed_supported....0x1
>                                 port_state..............ACTIVE 
>                                 state_info2.............0x52
>                                 m_key_protect_bits......0x0
>                                 lmc.....................0x0
>                                 link_speed..............0x11 
>                                 mtu_smsl................0x40
>                                 vl_cap_init_type........0x40
>                                 vl_high_limit...........0x0
>                                 vl_arb_high_cap.........0x8 
>                                 vl_arb_low_cap..........0x8
>                                 init_rep_mtu_cap........0x4
>                                 vl_stall_life...........0xFF
>                                 vl_enforce..............0x40 
>                                 m_key_violations........0x0
>                                 p_key_violations........0x0
>                                 q_key_violations........0x0
>                                 guid_cap................0x20 
>                                 client_reregister.......0x0
>                                 subnet_timeout..........0x12
>                                 resp_time_value.........0x10
>                                 error_threshold.........0x88 
> Jan 01 02:49:56 500831 [5AF3E280] -> Capabilities Mask:
>                                 IB_PORT_CAP_HAS_TRAP
>                                 IB_PORT_CAP_HAS_AUTO_MIG
>                                 IB_PORT_CAP_HAS_SL_MAP 
>                                 IB_PORT_CAP_HAS_LED_INFO
>                                 IB_PORT_CAP_HAS_SYS_IMG_GUID
>                                 IB_PORT_CAP_HAS_COM_MGT
>                                 IB_PORT_CAP_HAS_VEND_CLS 
>                                 IB_PORT_CAP_HAS_CAP_NTC
>                                 IB_PORT_CAP_HAS_CLIENT_REREG
> 
> Please let me know if I look at some specific portion.
> 
> Thanks
> Ganesh
> 
> 
> 
> On 16 May 2007 21:57:27 -0400, Hal Rosenstock <halr at voltaire.com>
> wrote:
>         Hi again Ganesh,
>         
>         On Wed, 2007-05-16 at 21:42, Ganesh Sadasivan wrote:
>         > Hi Hal,
>         >
>         >  Please see inline.
>         >
>         > On 16 May 2007 19:22:00 -0400, Hal Rosenstock
>         <halr at voltaire.com>
>         > wrote:
>         >         Hi Ganesh,
>         >
>         >         On Wed, 2007-05-16 at 19:00, Ganesh Sadasivan wrote:
>         >         > Hi,
>         >         >
>         >         >    I have a setup with 2 HCAs connected back to
>         back and am 
>         >         running
>         >         > opensm (ofed1.1, running at the same priority) on
>         both of
>         >         them. Is
>         >         > there any utility to see who is the master?
>         >
>         > Even with priority difeferences I am seeing the same
>         behavior.Am I
>         > missing any option. I am setting "opensm -s 30" and "opensm
>         -s 60" on
>         > the respective sides.
>         
>         Why not use the default (10 secs) or at least the same on both
>         sides ?
>         
>         >         sminfo will show the SM state for a LID/GUID.
>         >
>         >
>         > Thanks.
>         >
>         >         >   The smlid in ibv_devinfo, seems to be changing
>         whenever an
>         >         SM does a
>         >         > sweep. Is this expected? 
>         >
>         >         Nope. If they are both at the same priority, the
>         lower GUID
>         >         should win
>         >         the SM election.
>         >
>         >         Not sure what is going wrong in your (back to back
>         HCA) 
>         >         subnet. Do you
>         >         ports stay active ?
>         >
>         >
>         > Yes both ports are active.
>         
>         And they stay active (no LED color changes) ?
>         
>         If not, can you run both OpenSMs in verbose mode (-V) and see
>         if there 
>         is anything interesting/relevant in the logs ?
>         
>         -- Hal
>         
>         > Thanks
>         > Ganesh
>         >
>         >         -- Hal
>         >
>         >         > Thanks
>         >         > Ganesh
>         >         >
>         >         > 
>         >        
>         ______________________________________________________________________
>         >         > _______________________________________________
>         >         > general mailing list
>         >         > general at lists.openfabrics.org
>         >         >
>         >        
>         http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>         >         > 
>         >         > To unsubscribe, please visit
>         >         http://openib.org/mailman/listinfo/openib-general
>         >
>         >
>         
> 


From keshetti.mahesh at gmail.com  Thu May 17 04:46:03 2007
From: keshetti.mahesh at gmail.com (Keshetti Mahesh)
Date: Thu, 17 May 2007 17:16:03 +0530
Subject: [ofa-general] problem with loading IB modules in a IB node with
	OFED.
In-Reply-To: <1179400360.23882.69514.camel@hal.voltaire.com>
References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com>
	<200705161646.49910.minich@ornl.gov>
	<829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com>
	<1179398805.23882.67884.camel@hal.voltaire.com>
	<829ded920705170353n175e46d5x1dbeebd9db0006a1@mail.gmail.com>
	<1179400360.23882.69514.camel@hal.voltaire.com>
Message-ID: <829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com>

> But...
>
> You need to find the proper header for ib+verbs.h as those mismatched
> symbols are there. I'm sure it's on your machine somewhere (under
> something like /usr/local/ofed). Also, it needs to be added into
> Kconfig. There is a backport patch for the other pieces to build this
> that went out on the list quite a while ago.
>

I have checked the /usr/local/ofed/ directory after the OFED-1.0
installation. All it has is the
following directories.

'backup'  -> which has the backup of previous IB modules.
' etc'  -> configuration file
' lib'   ->  user space libraies
'lib64'  -> 64 bit user space libraies
uninstall.sh -> uninstallation script

There are no IB headers in that directory as you have said.

-- 
Thanks and regards,
Mahesh.


From halr at voltaire.com  Thu May 17 05:25:58 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 17 May 2007 08:25:58 -0400
Subject: [ofa-general] problem with loading IB modules in a IB node
	with OFED.
In-Reply-To: <829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com>
References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com>
	<200705161646.49910.minich@ornl.gov>
	<829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com>
	<1179398805.23882.67884.camel@hal.voltaire.com>
	<829ded920705170353n175e46d5x1dbeebd9db0006a1@mail.gmail.com>
	<1179400360.23882.69514.camel@hal.voltaire.com>
	<829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com>
Message-ID: <1179404751.23882.74229.camel@hal.voltaire.com>

On Thu, 2007-05-17 at 07:46, Keshetti Mahesh wrote:
> > But...
> >
> > You need to find the proper header for ib+verbs.h as those mismatched
> > symbols are there. I'm sure it's on your machine somewhere (under
> > something like /usr/local/ofed). Also, it needs to be added into
> > Kconfig. There is a backport patch for the other pieces to build this
> > that went out on the list quite a while ago.
> >
> 
> I have checked the /usr/local/ofed/ directory after the OFED-1.0
> installation. All it has is the
> following directories.
> 
> 'backup'  -> which has the backup of previous IB modules.
> ' etc'  -> configuration file
> ' lib'   ->  user space libraies
> 'lib64'  -> 64 bit user space libraies
> uninstall.sh -> uninstallation script
> 
> There are no IB headers in that directory as you have said.

Is there no include/infiniband dir there ?

-- Hal


From devesh28 at gmail.com  Thu May 17 05:28:45 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Thu, 17 May 2007 17:58:45 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <1179398534.23882.67542.camel@hal.voltaire.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
	<309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>
	<464B5C07.8040601@ichips.intel.com>
	<309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>
	<1179398534.23882.67542.camel@hal.voltaire.com>
Message-ID: <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>

On 17 May 2007 06:42:16 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote:
> > On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > > > But initially this will generate a packet for each path, while sys
> > > > admin knows that path is there and he can hard-code the entries for
> > > > it. Other thing is that why Admin will care about creating such record
> > > > while SA is itself taking care, right?
> > >
> > > In your original message you asked about adding 'dummy entries' to the
> > > cache.  I agree that pre-loading the cache can be useful.  What I still
> > > am not understanding is the reasoning for adding 'dummy entries'.  By
> > > 'dummy entries', I've been assuming that these are invalid path records,
> > > but maybe that's not what you meant.
> > Ok if "dummy entries" word as such has created confusion then I am
> > sorry for that, But with that I mean that, those are valid path
> > records which Administrator knows in advance and while loading the
> > module,
>
> How does the admin know they are valid ?
Depending on the initial application runs, some trusted PRs can be generated.
>Are they somehow preconfigured at the SM ?
I am not sure about SM has any such provision? Also not sure about the
role of SM in path resolving. I mean once node has initiated SA query,
whether SM has some database to reply SA or On the fly destination
node is contacted to get asked path recored?
>Doesn't each SM have its own policy for generating valid PRs ?
Ultimately path record is in Path_Record object format, and SA cache
is going to store in a fixed manner, How generation policy matters?
CMIIW. Also I am assuming a homogeneous cluster where certain
parameters can be assumed to be same always.
>are these from a live SM and just loaded "out of band" to
bypass/preclude the SA PR >mechanism ?
may be
>
> -- Hal
>
> >  Admin is loading this info in the cache with user command.
> > >
> > > > Another point I want to know is,
> > > > When local_sa_cache module will be inserted? After SM comes up or
> > > > Before SM comes up?
> > >
> > > It can occur either way.  There is no restriction.  The cache responds
> > > to port up and GID in/out of service events to update itself.
> > Do you mean cache module will start building cache only after Port is UP?
> > >
> > > > If Its inserted before SM is coming up (I am assuming SM is running on
> > > > some node not on switch) then First Forced schedule_update() is
> > > > waisted, and for the first application presence of cache is
> > > > meaningless. Why not to keep cache effective right from the start?
> > >
> > > Pre-loading the cache with path records doesn't guarantee that those
> > > paths are usable.  If the SM has not come up, then the path records will
> > > be unusable until the SM configures the subnet, plus there's no
> > > guarantee that the remote endpoints specified by the paths are running.
> > You mean there is no guarantee that even if SM is UP and we have some
> > hard coded entries of path record corresponding to some node X, we are
> > not sure that node X has actually come up or not?  In that case
> > actually that path resolving should fail if node has not come up, but
> > with the hard coding still path will be resolved?
> > >
> > > The main benefit I see to pre-loading the cache is to avoid SA storms
> > > when booting a large cluster.
> > that's true. Also cache will get valid entries only if network is
> > configured by SM otherwise every node SA will, possibly, drop SA
> > packets.
> > >
> > > - Sean
> > >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>


From keshetti.mahesh at gmail.com  Thu May 17 05:54:52 2007
From: keshetti.mahesh at gmail.com (Keshetti Mahesh)
Date: Thu, 17 May 2007 18:24:52 +0530
Subject: [ofa-general] problem with loading IB modules in a IB node with
	OFED.
In-Reply-To: <1179404751.23882.74229.camel@hal.voltaire.com>
References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com>
	<200705161646.49910.minich@ornl.gov>
	<829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com>
	<1179398805.23882.67884.camel@hal.voltaire.com>
	<829ded920705170353n175e46d5x1dbeebd9db0006a1@mail.gmail.com>
	<1179400360.23882.69514.camel@hal.voltaire.com>
	<829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com>
	<1179404751.23882.74229.camel@hal.voltaire.com>
Message-ID: <829ded920705170554u240cb1c8x33b03cdefb79e282@mail.gmail.com>

> Is there no include/infiniband dir there ?

Yes there is a include/infiniband directory but its not the kernel
headers directory.

The contents of that directory are

[root at infini00 ofed]# ls /usr/local/ofed/include/infiniband/
arch.h  cm_abi.h  cm.h  driver.h  kern-abi.h  marshall.h  opcode.h
sa.h  sa-kern-abi.h  verbs.h

They are user space headers.

-- 
Thanks and regards,
Mahesh.


From halr at voltaire.com  Thu May 17 05:59:09 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 17 May 2007 08:59:09 -0400
Subject: [ofa-general] problem with loading IB modules in a IB node
	with OFED.
In-Reply-To: <829ded920705170554u240cb1c8x33b03cdefb79e282@mail.gmail.com>
References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com>
	<200705161646.49910.minich@ornl.gov>
	<829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com>
	<1179398805.23882.67884.camel@hal.voltaire.com>
	<829ded920705170353n175e46d5x1dbeebd9db0006a1@mail.gmail.com>
	<1179400360.23882.69514.camel@hal.voltaire.com>
	<829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com>
	<1179404751.23882.74229.camel@hal.voltaire.com>
	<829ded920705170554u240cb1c8x33b03cdefb79e282@mail.gmail.com>
Message-ID: <1179406748.23882.76340.camel@hal.voltaire.com>

On Thu, 2007-05-17 at 08:54, Keshetti Mahesh wrote:
> > Is there no include/infiniband dir there ?
> 
> Yes there is a include/infiniband directory but its not the kernel
> headers directory.
> 
> The contents of that directory are
> 
> [root at infini00 ofed]# ls /usr/local/ofed/include/infiniband/
> arch.h  cm_abi.h  cm.h  driver.h  kern-abi.h  marshall.h  opcode.h
> sa.h  sa-kern-abi.h  verbs.h
> 
> They are user space headers.

Right, sorry.

Is there some ofed src directory containing the kernel sources ? I
forget how this part works.

-- Hal


From keshetti.mahesh at gmail.com  Thu May 17 06:11:10 2007
From: keshetti.mahesh at gmail.com (Keshetti Mahesh)
Date: Thu, 17 May 2007 18:41:10 +0530
Subject: [ofa-general] problem with loading IB modules in a IB node with
	OFED.
In-Reply-To: <1179406748.23882.76340.camel@hal.voltaire.com>
References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com>
	<200705161646.49910.minich@ornl.gov>
	<829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com>
	<1179398805.23882.67884.camel@hal.voltaire.com>
	<829ded920705170353n175e46d5x1dbeebd9db0006a1@mail.gmail.com>
	<1179400360.23882.69514.camel@hal.voltaire.com>
	<829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com>
	<1179404751.23882.74229.camel@hal.voltaire.com>
	<829ded920705170554u240cb1c8x33b03cdefb79e282@mail.gmail.com>
	<1179406748.23882.76340.camel@hal.voltaire.com>
Message-ID: <829ded920705170611o1925f837u74b495545bbce6f3@mail.gmail.com>

> Right, sorry.
>
> Is there some ofed src directory containing the kernel sources ? I
> forget how this part works.

Yes the OFED source directory containing the kernel sources is present
inside the OFED
package. Even I use the headers in that package I am getting the same errors.

But if I add my module in the OFED it is working fine.

-- 
Thanks and regards,
Mahesh.


From changquing.tang at hp.com  Thu May 17 07:50:37 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Thu, 17 May 2007 14:50:37 -0000
Subject: [ofa-general] Re: Re: OFED HA related question
In-Reply-To: <20070517083044.GM6273@minantech.com>
References: <1179145102.25749.11.camel@mtls03>
	<adabqgmtb6e.fsf@cisco.com><1179242042.25749.33.camel@mtls03>
	<ada7irat6xw.fsf@cisco.com><349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net><adairasms6f.fsf@cisco.com><349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net><adak5v8kxzy.fsf@cisco.com>
	<20070517062637.GL6273@minantech.com><20070517082603.GE4205@mellanox.co.il>
	<20070517083044.GM6273@minantech.com>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403015AD534@G3W0634.americas.hpqcorp.net>

 
> -----Original Message-----
> > > On Wed, May 16, 2007 at 07:03:13PM -0700, Roland Dreier wrote:
> > > >  > Also does ibv_dereg_mr() work when fatal error occurs ?
> > > > 
> > > > It will probably fail but you should try to destroy all your 
> > > > resources I guess.
> > > > 
> > > This is very good question. Application should be able to unpin 
> > > memory even if HCA is completely dead.
> > 
> > This is only safe after you reset the HCA, otherwise it might be 
> > writing over this memory.
> > 
> Right. Good point. So to recover memory after HCA failure 
> event we need to reset HCA and only after that unpin memory. 
> What current mthca driver does in case it cannot unregister 
> memory from HCA? Does it proceed to unpin it?

Sorry, also how to reset HCA in application ?

--CQ


> 
> --
> 			Gleb.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From mhagen at iol.unh.edu  Thu May 17 07:59:40 2007
From: mhagen at iol.unh.edu (mhagen at iol.unh.edu)
Date: Thu, 17 May 2007 10:59:40 -0400 (EDT)
Subject: [ofa-general] [PATCH] libibverbs: add userspace support for
	invalidate stag
Message-ID: <39754.132.177.125.178.1179413980.squirrel@postal.iol.unh.edu>

Modification to userspace verbs to provide support for the iWARP Verbs
SEND with INV and SEND with SE and INV.

Signed-off-by: Mikkel Hagen <mhagen at iol.unh.edu>

--- libibverbs/include/infiniband/verbs.h	2007-05-03 10:11:23.000000000 -0400
+++ libibverbs/include/infiniband/verbs.h	2007-05-03 10:12:32.000000000 -0400
@@ -492,7 +492,8 @@ enum ibv_send_flags {
 	IBV_SEND_FENCE		= 1 << 0,
 	IBV_SEND_SIGNALED	= 1 << 1,
 	IBV_SEND_SOLICITED	= 1 << 2,
-	IBV_SEND_INLINE		= 1 << 3
+	IBV_SEND_INLINE		= 1 << 3,
+	IBV_SEND_INVALIDATE	= 1 << 4
 };

 struct ibv_sge {
@@ -525,6 +526,9 @@ struct ibv_send_wr {
 			uint32_t	remote_qpn;
 			uint32_t	remote_qkey;
 		} ud;
+		struct {
+			uint32_t	rkey;
+		} invalidate;
 	} wr;
 };

--- libibverbs/include/infiniband/kern-abi.h	2007-05-03 10:36:13.000000000
-0400
+++ libibverbs/include/infiniband/kern-abi.h	2007-05-03 10:37:39.000000000
-0400
@@ -592,6 +592,10 @@ struct ibv_kern_send_wr {
 			__u32 remote_qkey;
 			__u32 reserved;
 		} ud;
+		struct {
+			__u32 rkey;
+			__u32 reserved;
+		} invalidate;
 	} wr;
 };

--- libibverbs/src/cmd.c	2007-05-02 05:00:25.000000000 -0400
+++ libibverbs/src/cmd.c	2007-05-04 15:19:36.000000000 -0400
@@ -857,6 +857,11 @@ int ibv_cmd_post_send(struct ibv_qp *ibq
 				tmp->wr.atomic.swap = i->wr.atomic.swap;
 				tmp->wr.atomic.rkey = i->wr.atomic.rkey;
 				break;
+			case IBV_WR_SEND:
+				if(tmp->send_flags & IBV_SEND_INVALIDATE) {
+					tmp->wr.invalidate.rkey =
+						i->wr.invalidate.rkey;
+				}
 			default:
 				break;
 			}


-- 
Mikkel Hagen
Project Assistant - Fibre Channel/SAS/SATA Consortiums
Research and Development Engineer - iWARP Consortium
FC/SAS/SATA:1-603-862-0701  iWARP:1-603-862-5083  Fax:1-603-862-4181
UNH-IOL
121 Technology Drive, Suite 2
Durham, NH 03824


From steve.apo at googlemail.com  Thu May 17 09:44:43 2007
From: steve.apo at googlemail.com (Steven Wooding)
Date: Thu, 17 May 2007 17:44:43 +0100
Subject: [ofa-general] libibcm compatability problem
Message-ID: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com>

Hi,

I'm using a 2.6.20.1 kernel with OFED 1.1. I get the following message when
running my application:

"libibcm: Kernel ABi version 5 doesn't match library version 4".

Could someone tell me what version of the library in terms of OFED release I
should be using?

Cheers,

Steve.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070517/367e7fb8/attachment.html>

From rvm at obsidianresearch.com  Thu May 17 08:45:48 2007
From: rvm at obsidianresearch.com (Rolf Manderscheid)
Date: Thu, 17 May 2007 09:45:48 -0600
Subject: [ofa-general] [PATCH] IB/mthca: initialise GRH:HopLimit when
	building MLX headers
Message-ID: <E1HoiAm-00023A-Ia@ib1.edm.orcorp.ca>

Hi Roland,

Global CM packets used by rmda_cm were being sent with a GRH:hopLimit of zero, causing
them to be dropped by the router.  The problem was a missing initialiser in mthca_read_ah
(called by build_mlx_header).

Signed-off-by: Rolf Manderscheid <rvm at obsidianresearch.com>

---

diff --git a/drivers/infiniband/hw/mthca/mthca_av.c b/drivers/infiniband/hw/mthca/mthca_av.c
index 27caf3b..d952891 100644
--- a/drivers/infiniband/hw/mthca/mthca_av.c
+++ b/drivers/infiniband/hw/mthca/mthca_av.c
@@ -279,6 +279,7 @@ int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah,
 			(be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff;
 		header->grh.flow_label    =
 			ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff);
+		header->grh.hop_limit = ah->av->hop_limit;
 		ib_get_cached_gid(&dev->ib_dev,
 				  be32_to_cpu(ah->av->port_pd) >> 24,
 				  ah->av->gid_index % dev->limits.gid_table_len,


From mshefty at ichips.intel.com  Thu May 17 09:53:29 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 17 May 2007 09:53:29 -0700
Subject: [ofa-general] libibcm compatability problem
In-Reply-To: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com>
References: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com>
Message-ID: <464C8889.5090403@ichips.intel.com>

> I'm using a 2.6.20.1 <http://2.6.20.1> kernel with OFED 1.1. I get the 
> following message when running my application:
>  
> "libibcm: Kernel ABi version 5 doesn't match library version 4".
>  
> Could someone tell me what version of the library in terms of OFED 
> release I should be using?

I'm not sure if OFED 1.1 supports 2.6.20.  But you can try using the 
latest libibcm from here:

	http://www.openfabrics.org/~shefty/libibcm-1.0.tar.gz

You can also look into using the librdmacm as an alternative, which 
supports other RDMA transports.

- Sean


From rdreier at cisco.com  Thu May 17 10:19:14 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 10:19:14 -0700
Subject: [ofa-general] ib_find_gid / ib_find_pkey
In-Reply-To: <20070514045832.GA18615@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 14 May 2007 07:58:32 +0300")
References: <20070514045832.GA18615@mellanox.co.il>
Message-ID: <ada8xbnl65p.fsf@cisco.com>

OK, I applied the ib_find_gid / ib_find_pkey stuff with the following
cleanup on top of it... mostly this is me being picky about
indentation etc, but I think I did fix two bugs: a memory leak if
registering with sysfs fails, and a NULL deref in ib_find_gid if index
is NULL (as the comment says it may be).

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 3f2c619..c084495 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -149,13 +149,13 @@ static int alloc_name(char *name)
 	return 0;
 }
 
-static inline int start_port(struct ib_device *device)
+static int start_port(struct ib_device *device)
 {
 	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
 }
 
 
-static inline int end_port(struct ib_device *device)
+static int end_port(struct ib_device *device)
 {
 	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
 		0 : device->phys_port_cnt;
@@ -220,7 +220,6 @@ static int add_client_context(struct ib_device *device, struct ib_client *client
 	return 0;
 }
 
-/* read the lengths of pkey,gid tables on each port */
 static int read_port_table_lengths(struct ib_device *device)
 {
 	struct ib_port_attr *tprops = NULL;
@@ -233,42 +232,33 @@ static int read_port_table_lengths(struct ib_device *device)
 
 	num_ports = end_port(device) - start_port(device) + 1;
 
-	device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len *
-						num_ports, GFP_KERNEL);
-	if (!device->pkey_tbl_len)
-		goto out;
-
-	device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len *
-						num_ports, GFP_KERNEL);
-	if (!device->gid_tbl_len)
-		goto err1;
+	device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len * num_ports,
+				       GFP_KERNEL);
+	device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len * num_ports,
+				      GFP_KERNEL);
+	if (!device->pkey_tbl_len || !device->gid_tbl_len)
+		goto err;
 
 	for (port_index = 0; port_index < num_ports; ++port_index) {
 		ret = ib_query_port(device, port_index + start_port(device),
 					tprops);
 		if (ret)
-			goto err2;
+			goto err;
 		device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len;
-		device->gid_tbl_len[port_index] = tprops->gid_tbl_len;
+		device->gid_tbl_len[port_index]  = tprops->gid_tbl_len;
 	}
 
 	ret = 0;
 	goto out;
-err2:
+
+err:
 	kfree(device->gid_tbl_len);
-err1:
 	kfree(device->pkey_tbl_len);
 out:
 	kfree(tprops);
 	return ret;
 }
 
-static inline void free_port_table_lengths(struct ib_device *device)
-{
-	kfree(device->gid_tbl_len);
-	kfree(device->pkey_tbl_len);
-}
-
 /**
  * ib_register_device - Register an IB device with IB core
  * @device:Device to register
@@ -311,6 +301,8 @@ int ib_register_device(struct ib_device *device)
 	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
 		       device->name);
+		kfree(device->gid_tbl_len);
+		kfree(device->pkey_tbl_len);
 		goto out;
 	}
 
@@ -352,7 +344,8 @@ void ib_unregister_device(struct ib_device *device)
 
 	list_del(&device->core_list);
 
-	free_port_table_lengths(device);
+	kfree(device->gid_tbl_len);
+	kfree(device->pkey_tbl_len);
 
 	mutex_unlock(&device_mutex);
 
@@ -672,28 +665,26 @@ EXPORT_SYMBOL(ib_modify_port);
  *   parameter may be NULL.
  */
 int ib_find_gid(struct ib_device *device, union ib_gid *gid,
-			u8 *port_num, u16 *index)
+		u8 *port_num, u16 *index)
 {
 	union ib_gid tmp_gid;
-	int ret, port, i, tbl_len;
+	int ret, port, i;
 
 	for (port = start_port(device); port <= end_port(device); ++port) {
-		tbl_len = device->gid_tbl_len[port - start_port(device)];
-		for (i = 0; i < tbl_len; ++i) {
+		for (i = 0; i < device->gid_tbl_len[port - start_port(device)]; ++i) {
 			ret = ib_query_gid(device, port, i, &tmp_gid);
 			if (ret)
-				goto out;
+				return ret;
 			if (!memcmp(&tmp_gid, gid, sizeof *gid)) {
 				*port_num = port;
-				*index = i;
-				ret = 0;
-				goto out;
+				if (index)
+					*index = i;
+				return 0;
 			}
 		}
 	}
-	ret = -ENOENT;
-out:
-	return ret;
+
+	return -ENOENT;
 }
 EXPORT_SYMBOL(ib_find_gid);
 
@@ -706,27 +697,24 @@ EXPORT_SYMBOL(ib_find_gid);
  * @index: The index into the PKey table where the PKey was found.
  */
 int ib_find_pkey(struct ib_device *device,
-			u8 port_num, u16 pkey, u16 *index)
+		 u8 port_num, u16 pkey, u16 *index)
 {
-	int ret, i, tbl_len;
+	int ret, i;
 	u16 tmp_pkey;
 
-	tbl_len = device->pkey_tbl_len[port_num - start_port(device)];
-	for (i = 0; i < tbl_len; ++i) {
+	tbl_len = ;
+	for (i = 0; i < device->pkey_tbl_len[port_num - start_port(device)]; ++i) {
 		ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
 		if (ret)
-			goto out;
+			return ret;
 
 		if (pkey == tmp_pkey) {
 			*index = i;
-			ret = 0;
-			goto out;
+			return 0;
 		}
 	}
-	ret = -ENOENT;
 
-out:
-	return ret;
+	return -ENOENT;
 }
 EXPORT_SYMBOL(ib_find_pkey);
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index a4ae080..0627a6a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -890,6 +890,8 @@ struct ib_device {
 	spinlock_t                    client_data_lock;
 
 	struct ib_cache               cache;
+	int                          *pkey_tbl_len;
+	int                          *gid_tbl_len;
 
 	u32                           flags;
 
@@ -1043,8 +1045,6 @@ struct ib_device {
 	__be64			     node_guid;
 	u8                           node_type;
 	u8                           phys_port_cnt;
-	int                          *pkey_tbl_len;
-	int                          *gid_tbl_len;
 };
 
 struct ib_client {
@@ -1120,28 +1120,11 @@ int ib_modify_port(struct ib_device *device,
 		   u8 port_num, int port_modify_mask,
 		   struct ib_port_modify *port_modify);
 
-/**
- * ib_find_gid - Returns the port number and GID table index where
- *   a specified GID value occurs.
- * @device: The device to query.
- * @gid: The GID value to search for.
- * @port_num: The port number of the device where the GID value was found.
- * @index: The index into the GID table where the GID was found.  This
- *   parameter may be NULL.
- */
 int ib_find_gid(struct ib_device *device, union ib_gid *gid,
-			u8 *port_num, u16 *index);
+		u8 *port_num, u16 *index);
 
-/**
- * ib_find_pkey - Returns the PKey table index where a specified
- *   PKey value occurs.
- * @device: The device to query.
- * @port_num: The port number of the device to search for the PKey.
- * @pkey: The PKey value to search for.
- * @index: The index into the PKey table where the PKey was found.
- */
 int ib_find_pkey(struct ib_device *device,
-			u8 port_num, u16 pkey, u16 *index);
+		 u8 port_num, u16 pkey, u16 *index);
 
 /**
  * ib_alloc_pd - Allocates an unused protection domain.


From rdreier at cisco.com  Thu May 17 10:33:51 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 10:33:51 -0700
Subject: [ofa-general] ib_find_gid / ib_find_pkey
In-Reply-To: <ada8xbnl65p.fsf@cisco.com> (Roland Dreier's message of "Thu,
	17 May 2007 10:19:14 -0700")
References: <20070514045832.GA18615@mellanox.co.il> <ada8xbnl65p.fsf@cisco.com>
Message-ID: <ada4pmbl5hc.fsf@cisco.com>

Also applied P_Key reordering patch too...


From rdreier at cisco.com  Thu May 17 10:41:08 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 10:41:08 -0700
Subject: [ofa-general] Re: [IPoIB][RFC] remove redundant gid query
In-Reply-To: <Pine.LNX.4.64.0705081427320.27590@jlentini-linux.nane.netapp.com>
	(James Lentini's message of "Tue,
	8 May 2007 14:38:06 -0400 (EDT)")
References: <Pine.LNX.4.64.0705081427320.27590@jlentini-linux.nane.netapp.com>
Message-ID: <adazm43jqkr.fsf@cisco.com>

 > Both ipoib_add_port() and ipoib_mcast_join_task() query the GID at 
 > index 0 to setup the ipoib_dev_priv structure's local_gid and the 
 > net_device structure's dev_addr. There does not appear to be a way for 
 > ipoib_mcast_join_task() to be executed before ipoib_add_port() 
 > completes. Therefore, the work done in ipoib_mcast_join_task() appears 
 > to be redundant.

It does look like we're doing some work we don't need to do.  However
ipoib_add_port() could run before an SM has brought up the local port,
so the GID prefix might change later.

I'm not sure what the best way to clean this up is.

 - R.


From mst at dev.mellanox.co.il  Thu May 17 10:45:19 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 17 May 2007 20:45:19 +0300
Subject: [ofa-general] ib_find_gid / ib_find_pkey
In-Reply-To: <ada4pmbl5hc.fsf@cisco.com>
References: <20070514045832.GA18615@mellanox.co.il>
	<ada8xbnl65p.fsf@cisco.com> <ada4pmbl5hc.fsf@cisco.com>
Message-ID: <20070517174519.GC22028@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] ib_find_gid / ib_find_pkey
> 
> Also applied P_Key reordering patch too...

OK. I think the next step is to get rid of ipoib_pkey_dev_check_presence and
ipoib_pkey_poll in ipoib.  This way we'll have one ULP clean of cache usage.
Yosef?

Another thing to do at this front, is make the pkey change event
less intrusive: we should not need to kill connections and AHs
because of pkey change: just cycling the QP through reset should be
enough.

-- 
MST


From mst at dev.mellanox.co.il  Thu May 17 10:58:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 17 May 2007 20:58:30 +0300
Subject: [ofa-general] [PATCHv2] IB/mthca: fix use-after-free
In-Reply-To: <20070517081050.GC4205@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C90182527E@mtlexch01.mtl.com>
	<20070517081050.GC4205@mellanox.co.il>
Message-ID: <20070517175830.GE22028@mellanox.co.il>

From: Ali Ayoub <ali at mellanox.co.il>
Subject: [PATCH] IB/mthca: fix use-after-free

Fix use-after-free on hardware restart.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Previous version would do NULL-pointer dereference
if pci_get_drvdata returns NULL. BTW, when does this happen?

diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 773145e..aa563e6 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -1250,12 +1250,14 @@ static void __mthca_remove_one(struct pci_dev *pdev)
 int __mthca_restart_one(struct pci_dev *pdev)
 {
 	struct mthca_dev *mdev;
+	int hca_type;
 
 	mdev = pci_get_drvdata(pdev);
 	if (!mdev)
 		return -ENODEV;
+	hca_type = mdev->hca_type;
 	__mthca_remove_one(pdev);
-	return __mthca_init_one(pdev, mdev->hca_type);
+	return __mthca_init_one(pdev, hca_type);
 }
 
 static int __devinit mthca_init_one(struct pci_dev *pdev,
-- 
MST


From rdreier at cisco.com  Thu May 17 11:32:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 11:32:52 -0700
Subject: [ofa-general] Re: [PATCH 2.6.22] ehca: return proper error code if
	register_mr fails
In-Reply-To: <200705161450.55848.hnguyen@linux.vnet.ibm.com> (Hoang-Nam
	Nguyen's message of "Wed, 16 May 2007 14:50:55 +0200")
References: <200705161450.55848.hnguyen@linux.vnet.ibm.com>
Message-ID: <adar6pfjo6j.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Thu May 17 11:40:22 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 11:40:22 -0700
Subject: [ofa-general] Re: [PATCHv2] IB/mthca: fix use-after-free
In-Reply-To: <20070517175830.GE22028@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 17 May 2007 20:58:30 +0300")
References: <6C2C79E72C305246B504CBA17B5500C90182527E@mtlexch01.mtl.com>
	<20070517081050.GC4205@mellanox.co.il>
	<20070517175830.GE22028@mellanox.co.il>
Message-ID: <adamz03jnu1.fsf@cisco.com>

Thanks, applied.

 > Previous version would do NULL-pointer dereference
 > if pci_get_drvdata returns NULL. BTW, when does this happen?

I'm not positive why that check is there -- it dates back to Jack's
original device restart patch.

But I guess it's at least conceivable that an HCA is hot-unplugged
after a catastrophic error but before the restart task runs.

 - R.


From rdreier at cisco.com  Thu May 17 11:51:07 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 11:51:07 -0700
Subject: [ofa-general] Re: [PATCH] IB/mthca: initialise GRH:HopLimit when
	building MLX headers
In-Reply-To: <E1HoiAm-00023A-Ia@ib1.edm.orcorp.ca> (Rolf Manderscheid's
	message of "Thu, 17 May 2007 09:45:48 -0600")
References: <E1HoiAm-00023A-Ia@ib1.edm.orcorp.ca>
Message-ID: <adad50zjnc4.fsf@cisco.com>

thanks, applied.  I also added the following patch, since I think mlx4
has the same bug.  If you happen to have any ConnectX cards available,
can you check this works too?

commit c3f9fc8d912387837c65abf59e8cd0146b17589f
Author: Roland Dreier <rolandd at cisco.com>
Date:   Thu May 17 11:49:55 2007 -0700

    IB/mlx4: Set GRH:HopLimit when sending globally routed MADs
    
    This is the same issue discovered in mthca by Rolf Manderscheid
    <rvm at obsidianresearch.com>.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 9c362fa..0cf8b95 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -952,6 +952,7 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr,
 			(be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20) & 0xff;
 		sqp->ud_header.grh.flow_label    =
 			ah->av.sl_tclass_flowlabel & cpu_to_be32(0xfffff);
+		sqp->ud_header.grh.hop_limit     = ah->av.hop_limit;
 		ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.port_pd) >> 24,
 				  ah->av.gid_index, &sqp->ud_header.grh.source_gid);
 		memcpy(sqp->ud_header.grh.destination_gid.raw,


From rdreier at cisco.com  Thu May 17 11:52:45 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 11:52:45 -0700
Subject: [ofa-general] ib_find_gid / ib_find_pkey
In-Reply-To: <20070517174519.GC22028@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 17 May 2007 20:45:19 +0300")
References: <20070514045832.GA18615@mellanox.co.il>
	<ada8xbnl65p.fsf@cisco.com> <ada4pmbl5hc.fsf@cisco.com>
	<20070517174519.GC22028@mellanox.co.il>
Message-ID: <ada8xbnjn9e.fsf@cisco.com>

 > OK. I think the next step is to get rid of ipoib_pkey_dev_check_presence and
 > ipoib_pkey_poll in ipoib.  This way we'll have one ULP clean of cache usage.
 > Yosef?

Sounds good.  Removing cache usage from SRP should be easy now too;
I'll queue that for 2.6.23 when I get a chance.

 - R.


From rdreier at cisco.com  Thu May 17 13:03:19 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 13:03:19 -0700
Subject: [ofa-general] IB/core: Enhance SMI for switch support
In-Reply-To: <1179178372.4531.10975.camel@hal.voltaire.com> (Hal Rosenstock's
	message of "14 May 2007 17:32:52 -0400")
References: <1179177711.4531.10290.camel@hal.voltaire.com>
	<adazm47t7x6.fsf@cisco.com>
	<1179178372.4531.10975.camel@hal.voltaire.com>
Message-ID: <ada4pmbjjzs.fsf@cisco.com>

 > The risk is primarily on the switch side, rather than the CA/router
 > side, right ? So isn't the downside of this minimal ?

Actually I was thinking that this disturbs the current SMI code and
potentially introduces bugs (even with plain old CAs).  And without
any switch driver it's not clear if we want to take that risk.

But OK, I'll queue this up for 2.6.23 and hope for the best...


From halr at voltaire.com  Thu May 17 13:10:32 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 17 May 2007 16:10:32 -0400
Subject: [ofa-general] IB/core: Enhance SMI for switch support
In-Reply-To: <ada4pmbjjzs.fsf@cisco.com>
References: <1179177711.4531.10290.camel@hal.voltaire.com>
	<adazm47t7x6.fsf@cisco.com>
	<1179178372.4531.10975.camel@hal.voltaire.com>
	<ada4pmbjjzs.fsf@cisco.com>
Message-ID: <1179432631.23882.103465.camel@hal.voltaire.com>

On Thu, 2007-05-17 at 16:03, Roland Dreier wrote:
>  > The risk is primarily on the switch side, rather than the CA/router
>  > side, right ? So isn't the downside of this minimal ?
> 
> Actually I was thinking that this disturbs the current SMI code and
> potentially introduces bugs (even with plain old CAs).

I may be wrong but the main SMI change is returning IB_SMI_FORWARD
rather than IB_SMI_SEND for two intermediate hop cases (C14-9:2 and
C14-13:2) which should not apply to CA or router ports.

> And without any switch driver it's not clear if we want to take that risk.

Understood.

> But OK, I'll queue this up for 2.6.23 and hope for the best...

Thanks.

-- Hal


From steve.apo at googlemail.com  Thu May 17 14:23:28 2007
From: steve.apo at googlemail.com (Steven Wooding)
Date: Thu, 17 May 2007 22:23:28 +0100
Subject: [ofa-general] libibcm compatability problem
In-Reply-To: <464C8889.5090403@ichips.intel.com>
References: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com>
	<464C8889.5090403@ichips.intel.com>
Message-ID: <2cfcf21e0705171423m166f6d3axbae29dc70ed2a0eb@mail.gmail.com>

Hi Sean,

I tried to compile the latest libibcm library, but have got stuck on
./configure with the cpp failing it's sanity check. I'm on Scientific Linux
4.3 (aka RHEL 4.3) and have gcc installed.

Could I try and use the OFED 1.1 kernel drivers in 2.6.20 instead of the
in-built kernel drivers? Or would it be better to install the latest
subversion snapshot of the userspace libraries?

It's a very strange installation we've ended up with from our supplier. They
are using the latest kernel ib drivers whilst using OFED 1.0 userspace
libraries. It seems that libibcm is the only library that has a
compatibility problem.

Finally, how much effort do you think it would take to convert a program
that uses libibcm to librdmacm?

Thanks very much for your advice

Cheers,

Steve.

On 17/05/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
>
> > I'm using a 2.6.20.1 <http://2.6.20.1> kernel with OFED 1.1. I get the
> > following message when running my application:
> >
> > "libibcm: Kernel ABi version 5 doesn't match library version 4".
> >
> > Could someone tell me what version of the library in terms of OFED
> > release I should be using?
>
> I'm not sure if OFED 1.1 supports 2.6.20.  But you can try using the
> latest libibcm from here:
>
>         http://www.openfabrics.org/~shefty/libibcm-1.0.tar.gz
>
> You can also look into using the librdmacm as an alternative, which
> supports other RDMA transports.
>
> - Sean
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070517/5755276c/attachment.html>

From rdreier at cisco.com  Thu May 17 15:05:33 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 15:05:33 -0700
Subject: [ofa-general] Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR
	leak
In-Reply-To: <20070517073017.GA4205@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 17 May 2007 10:30:28 +0300")
References: <20070515210453.GL4161@mellanox.co.il>
	<20070516101457.GA5091@mellanox.co.il> <adaejlgmrkb.fsf@cisco.com>
	<20070517073017.GA4205@mellanox.co.il>
Message-ID: <adaveerhzrm.fsf@cisco.com>

 > Hmm, I would like to quote the spec *literally*. Maybe
 > - and then invoke a Destroy QP or Reset QP.

That's fine ... in fact the spec has a bullet for that line too (I
just looked).  The way you have it formatted now is visually kind of
strange (the last line looks odd):

 > + * - Put the QP in the Error State
 > + * - Wait for the Affiliated Asynchronous Last WQE Reached Event;
 > + * - either:
 > + *       drain the CQ by invoking the Poll CQ verb and either wait for CQ
 > + *       to be empty or the number of Poll CQ operations has exceeded
 > + *       CQ capacity size;
 > + * - or
 > + *       post another WR that completes on the same CQ and wait for this
 > + *       WR to return as a WC; (NB: this is the option that we use)
 > + * and then invoke a Destroy QP or Reset QP.

which doesn't really match the spec's formatting... I guess this is
pretty minor, but I would write the comment as:

* - Put the QP in the Error State;
* - wait for the Affiliated Asynchronous Last WQE Reached Event;
* - either:
*   - drain the CQ by invoking the Poll CQ verb and either wait for CQ
*     to be empty or the number of Poll CQ operations has exceeded
*     CQ capacity size; or
*   - post another WR that completes on the same CQ and wait for this
*     WR to return as a WC;
* - and then invoke a Destroy QP or Reset QP.
*
* For IPoIB we choose the second option of posting another WR, and we
* keep a dedicated QP in the error state for doing this.

 > > This actually seems like a good motivation for the mthca RESET ->
 > > ERROR fix.  We could avoid the transition to INIT if we fixed mthca
 > > and mlx4, right?
 > 
 > Yes. That was the motivation.

OK, I just queued the mthca and mlx4 versions of the RESET->ERROR fix.
So I guess you can drop the dummy transition to INIT within IPoIB for
the final version of the WQE leakage patch.

 - R.


From rdreier at cisco.com  Thu May 17 15:15:06 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 15:15:06 -0700
Subject: [ofa-general] Re: [PATCH] libibverbs/ibv_devinfo : Print the number
	of max_vl_num as a number
In-Reply-To: <1178459202.16752.1.camel@mtldesk014.lab.mtl.com> (Dotan Barak's
	message of "Sun, 06 May 2007 16:46:42 +0300")
References: <1178459202.16752.1.camel@mtldesk014.lab.mtl.com>
Message-ID: <adar6pfhzbp.fsf@cisco.com>

Thanks, applied.


From mst at dev.mellanox.co.il  Thu May 17 15:21:39 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 18 May 2007 01:21:39 +0300
Subject: [ofa-general] Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR
	leak
In-Reply-To: <adaveerhzrm.fsf@cisco.com>
References: <20070515210453.GL4161@mellanox.co.il>
	<20070516101457.GA5091@mellanox.co.il> <adaejlgmrkb.fsf@cisco.com>
	<20070517073017.GA4205@mellanox.co.il> <adaveerhzrm.fsf@cisco.com>
Message-ID: <20070517222139.GC29259@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR leak
> 
>  > Hmm, I would like to quote the spec *literally*. Maybe
>  > - and then invoke a Destroy QP or Reset QP.
> 
> That's fine ... in fact the spec has a bullet for that line too (I
> just looked).  The way you have it formatted now is visually kind of
> strange (the last line looks odd):
> 
>  > + * - Put the QP in the Error State
>  > + * - Wait for the Affiliated Asynchronous Last WQE Reached Event;
>  > + * - either:
>  > + *       drain the CQ by invoking the Poll CQ verb and either wait for CQ
>  > + *       to be empty or the number of Poll CQ operations has exceeded
>  > + *       CQ capacity size;
>  > + * - or
>  > + *       post another WR that completes on the same CQ and wait for this
>  > + *       WR to return as a WC; (NB: this is the option that we use)
>  > + * and then invoke a Destroy QP or Reset QP.
> 
> which doesn't really match the spec's formatting... I guess this is
> pretty minor, but I would write the comment as:
> 
> * - Put the QP in the Error State;
> * - wait for the Affiliated Asynchronous Last WQE Reached Event;
> * - either:
> *   - drain the CQ by invoking the Poll CQ verb and either wait for CQ
> *     to be empty or the number of Poll CQ operations has exceeded
> *     CQ capacity size; or
> *   - post another WR that completes on the same CQ and wait for this
> *     WR to return as a WC;
> * - and then invoke a Destroy QP or Reset QP.
> *
> * For IPoIB we choose the second option of posting another WR, and we
> * keep a dedicated QP in the error state for doing this.

Yes, that's exactly what I did.

>  > > This actually seems like a good motivation for the mthca RESET ->
>  > > ERROR fix.  We could avoid the transition to INIT if we fixed mthca
>  > > and mlx4, right?
>  > 
>  > Yes. That was the motivation.
> 
> OK, I just queued the mthca and mlx4 versions of the RESET->ERROR fix.
> So I guess you can drop the dummy transition to INIT within IPoIB for
> the final version of the WQE leakage patch.

Yea. I'm just fixing it to work on top of the pkey change:
ipoib_cm_dev_stop will need a flush flag, and I think this means
I'll need a reap_list for RX connections like I do for TX:
so I can just go over this list and destroy all QPs safely
from inside the ipoib_wq.

BTW, it seems that for 2.6.23 the right thing to do will be
to merge TX and RX structures, and reuse some more of
the code between the two.

-- 
MST


From rdreier at cisco.com  Thu May 17 15:56:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 15:56:47 -0700
Subject: [ofa-general] Re: [PATCH 2/2] IB/mlx4: pass more data from user to
	kernel
In-Reply-To: <1179387217.25749.62.camel@mtls03> (Eli Cohen's message of "Thu,
	17 May 2007 10:32:41 +0300")
References: <1179387217.25749.62.camel@mtls03>
Message-ID: <adalkfnhxe8.fsf@cisco.com>

This looks a little busted:

 >  struct mlx4_ib_create_qp {
 >  	__u64	buf_addr;
 >  	__u64	db_addr;
 > +	__u64	rq_size;
 > +	__u64	sq_size;
 > +	__u8	rcv_wqe_shift;
 > +	__u8	log_wqe_bb;
 >  };

the structure ends up not aligned to a multiple of 8 bytes, so it ends
up having a size of 36 bytes on 32-bit setups and 40 bytes on 64-bit
setups, which might mess up 32-bit userspace on 64-bit kernels.

Also I don't understand why you made rq_size and sq_size 64 bits
anyway?  It seems they could never be more than 16 bits, although to
be safe perhaps 32 bits is best.  So I'll fix this up to look like
this (with names that seem more self-documenting to me):

struct mlx4_ib_create_qp {
	__u64	buf_addr;
	__u64	db_addr;
	__u32	rq_wqe_count;
	__u32	rq_wqe_shift;
	__u32	sq_wqebb_count;
	__u32	sq_wqebb_shift;
};

am I missing some hidden reason to make those fields 64 bits?


From rdreier at cisco.com  Thu May 17 16:25:22 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 17 May 2007 16:25:22 -0700
Subject: [ofa-general] Re: [PATCH 2/2] IB/mlx4: pass more data from user
	to kernel
In-Reply-To: <adalkfnhxe8.fsf@cisco.com> (Roland Dreier's message of "Thu,
	17 May 2007 15:56:47 -0700")
References: <1179387217.25749.62.camel@mtls03> <adalkfnhxe8.fsf@cisco.com>
Message-ID: <adad50zhw2l.fsf@cisco.com>

 > struct mlx4_ib_create_qp {
 > 	__u64	buf_addr;
 > 	__u64	db_addr;
 > 	__u32	rq_wqe_count;
 > 	__u32	rq_wqe_shift;
 > 	__u32	sq_wqebb_count;
 > 	__u32	sq_wqebb_shift;
 > };

Actually, on second thought maybe it's cleaner just to pass the SQ
information from user->kernel?  There's not really anything that can
go wrong with RQs, and it's probably safer not to have the same info
passed in two different ways.

maybe something like

struct mlx4_ib_create_qp {
	__u64	buf_addr;
	__u64	db_addr;
        __u8	log_sq_stride;	
        __u8	log_sq_bb_per_wqe;
        __u8	reserved[6];
};

and then use the RQ and SQ sizes and number of gather entries from the
normal part of the command to figure the rest out?

Any opinion on which is preferable?

 - R.


From mst at dev.mellanox.co.il  Thu May 17 20:40:28 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 18 May 2007 06:40:28 +0300
Subject: [ofa-general] Re: Re: [PATCH 2/2] IB/mlx4: pass more data from user
	to kernel
In-Reply-To: <adad50zhw2l.fsf@cisco.com>
References: <1179387217.25749.62.camel@mtls03> <adalkfnhxe8.fsf@cisco.com>
	<adad50zhw2l.fsf@cisco.com>
Message-ID: <20070518034028.GB4708@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: Re: [PATCH 2/2] IB/mlx4: pass more data from user to kernel
> 
>  > struct mlx4_ib_create_qp {
>  > 	__u64	buf_addr;
>  > 	__u64	db_addr;
>  > 	__u32	rq_wqe_count;
>  > 	__u32	rq_wqe_shift;
>  > 	__u32	sq_wqebb_count;
>  > 	__u32	sq_wqebb_shift;
>  > };
> 
> Actually, on second thought maybe it's cleaner just to pass the SQ
> information from user->kernel?  There's not really anything that can
> go wrong with RQs, and it's probably safer not to have the same info
> passed in two different ways.
> 
> maybe something like
> 
> struct mlx4_ib_create_qp {
> 	__u64	buf_addr;
> 	__u64	db_addr;
>         __u8	log_sq_stride;	
>         __u8	log_sq_bb_per_wqe;
>         __u8	reserved[6];
> };
> 
> and then use the RQ and SQ sizes and number of gather entries from the
> normal part of the command to figure the rest out?
> 
> Any opinion on which is preferable?

I'm OK with what you say about RQ, but replacing sq_wqebb_count with log_sq_bb_per_wqe
looks like obfuscation to me: you still pass in 2 values, and the kernel does
not actually care about number of SQ WRs at all, it really only needs to look at
# of wqbbs.

-- 
MST


From stlsylvan.com at esxpress.com  Fri May 18 02:05:25 2007
From: stlsylvan.com at esxpress.com (Luke Diaz)
Date: Fri, 18 May 2007 03:05:25 -0600
Subject: [ofa-general] Why be an average guy any longer
Message-ID: <000001c7991a$c3693580$0100007f@localhost>


See attach
http://www.jinte.hk/

-----
He didnt see his wife again un
That routine became the standa
Her attitude frustrated the he
Her passionate nature asserted
 
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070518/b5d69562/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: img66.jpg
Type: image/jpeg
Size: 12321 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070518/b5d69562/attachment.jpg>

From steve.apo at googlemail.com  Fri May 18 01:13:41 2007
From: steve.apo at googlemail.com (Steven Wooding)
Date: Fri, 18 May 2007 09:13:41 +0100
Subject: [ofa-general] libibcm compatability problem
In-Reply-To: <464C8889.5090403@ichips.intel.com>
References: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com>
	<464C8889.5090403@ichips.intel.com>
Message-ID: <2cfcf21e0705180113u43ad5977rd5fef68793c5aaea@mail.gmail.com>

Hi Sean,

OK. I got it to compile, though when I ran it, it failed to create the
listening CM (still looking into the exact error).

Also I see that the function ib_cm_get_device has been removed. I was using
this to monitor the file desriptor of the CM device. Could this function be
put back into my local copy of libibcm or has this function been moved
somewhere else in the code?

Cheers,

Steve.

On 17/05/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
>
> > I'm using a 2.6.20.1 <http://2.6.20.1> kernel with OFED 1.1. I get the
> > following message when running my application:
> >
> > "libibcm: Kernel ABi version 5 doesn't match library version 4".
> >
> > Could someone tell me what version of the library in terms of OFED
> > release I should be using?
>
> I'm not sure if OFED 1.1 supports 2.6.20.  But you can try using the
> latest libibcm from here:
>
>        http://www.openfabrics.org/~shefty/libibcm-1.0.tar.gz
>
> You can also look into using the librdmacm as an alternative, which
> supports other RDMA transports.
>
> - Sean
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070518/943578e3/attachment.html>

From vlad at lists.openfabrics.org  Fri May 18 02:41:25 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Fri, 18 May 2007 02:41:25 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070518-0200 daily build status
Message-ID: <20070518094126.0D1F2E6082B@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From halr at voltaire.com  Fri May 18 03:21:05 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 18 May 2007 06:21:05 -0400
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
	<309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>
	<464B5C07.8040601@ichips.intel.com>
	<309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>
	<1179398534.23882.67542.camel@hal.voltaire.com>
	<309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>
Message-ID: <1179483657.23882.158398.camel@hal.voltaire.com>

On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote:
> On 17 May 2007 06:42:16 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote:
> > > On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > > > > But initially this will generate a packet for each path, while sys
> > > > > admin knows that path is there and he can hard-code the entries for
> > > > > it. Other thing is that why Admin will care about creating such record
> > > > > while SA is itself taking care, right?
> > > >
> > > > In your original message you asked about adding 'dummy entries' to the
> > > > cache.  I agree that pre-loading the cache can be useful.  What I still
> > > > am not understanding is the reasoning for adding 'dummy entries'.  By
> > > > 'dummy entries', I've been assuming that these are invalid path records,
> > > > but maybe that's not what you meant.
> > > Ok if "dummy entries" word as such has created confusion then I am
> > > sorry for that, But with that I mean that, those are valid path
> > > records which Administrator knows in advance and while loading the
> > > module,
> >
> > How does the admin know they are valid ?
> Depending on the initial application runs, some trusted PRs can be generated.

What do initial application runs have to do with this ?

> >Are they somehow preconfigured at the SM ?
> I am not sure about SM has any such provision?

Not that I'm aware of.

> Also not sure about the
> role of SM in path resolving. I mean once node has initiated SA query,
> whether SM has some database to reply SA or On the fly destination
> node is contacted to get asked path recored?

SMs can either calculate the SA PRs on the fly based on the routing
algorithm in use and some other things or put them in a local database.
This is up to that SM.

Destination node is not contacted in the SA PR query process.

> >Doesn't each SM have its own policy for generating valid PRs ?
> Ultimately path record is in Path_Record object format, and SA cache
> is going to store in a fixed manner, How generation policy matters?

What if the local policy loaded does not agree with what the SM would
generate for a particular PR ? One then gets a local error which will
need to be tracked down. Not so easy IMO.

> CMIIW. Also I am assuming a homogeneous cluster where certain
> parameters can be assumed to be same always.

and always in agreement with what the SM would return ? For example,
what happens when a link goes down and the end node is no longer
reachable ?

> >are these from a live SM and just loaded "out of band" to
> bypass/preclude the SA PR >mechanism ?
> may be

Even if they are, there is still the changes in the subnet issue.

-- Hal

> > -- Hal
> >
> > >  Admin is loading this info in the cache with user command.
> > > >
> > > > > Another point I want to know is,
> > > > > When local_sa_cache module will be inserted? After SM comes up or
> > > > > Before SM comes up?
> > > >
> > > > It can occur either way.  There is no restriction.  The cache responds
> > > > to port up and GID in/out of service events to update itself.
> > > Do you mean cache module will start building cache only after Port is UP?
> > > >
> > > > > If Its inserted before SM is coming up (I am assuming SM is running on
> > > > > some node not on switch) then First Forced schedule_update() is
> > > > > waisted, and for the first application presence of cache is
> > > > > meaningless. Why not to keep cache effective right from the start?
> > > >
> > > > Pre-loading the cache with path records doesn't guarantee that those
> > > > paths are usable.  If the SM has not come up, then the path records will
> > > > be unusable until the SM configures the subnet, plus there's no
> > > > guarantee that the remote endpoints specified by the paths are running.
> > > You mean there is no guarantee that even if SM is UP and we have some
> > > hard coded entries of path record corresponding to some node X, we are
> > > not sure that node X has actually come up or not?  In that case
> > > actually that path resolving should fail if node has not come up, but
> > > with the hard coding still path will be resolved?
> > > >
> > > > The main benefit I see to pre-loading the cache is to avoid SA storms
> > > > when booting a large cluster.
> > > that's true. Also cache will get valid entries only if network is
> > > configured by SM otherwise every node SA will, possibly, drop SA
> > > packets.
> > > >
> > > > - Sean
> > > >
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >
> > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >
> >


From eli at dev.mellanox.co.il  Fri May 18 05:26:14 2007
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Fri, 18 May 2007 15:26:14 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/mlx4: pass more data from user
	to kernel
In-Reply-To: <adalkfnhxe8.fsf@cisco.com>
References: <1179387217.25749.62.camel@mtls03> <adalkfnhxe8.fsf@cisco.com>
Message-ID: <4e6a6b3c0705180526j28355392l71a7f18ea25d69c6@mail.gmail.com>

On 5/18/07, Roland Dreier <rdreier at cisco.com> wrote:
>
> This looks a little busted:
>
> >  struct mlx4_ib_create_qp {
> >      __u64   buf_addr;
> >      __u64   db_addr;
> > +    __u64   rq_size;
> > +    __u64   sq_size;
> > +    __u8    rcv_wqe_shift;
> > +    __u8    log_wqe_bb;
> >  };
>
> the structure ends up not aligned to a multiple of 8 bytes, so it ends
> up having a size of 36 bytes on 32-bit setups and 40 bytes on 64-bit
> setups, which might mess up 32-bit userspace on 64-bit kernels.
>
> Also I don't understand why you made rq_size and sq_size 64 bits
> anyway?  It seems they could never be more than 16 bits, although to
> be safe perhaps 32 bits is best.  So I'll fix this up to look like
> this (with names that seem more self-documenting to me):
>
> struct mlx4_ib_create_qp {
>        __u64   buf_addr;
>        __u64   db_addr;
>        __u32   rq_wqe_count;
>        __u32   rq_wqe_shift;
>        __u32   sq_wqebb_count;
>        __u32   sq_wqebb_shift;
> };
>
> am I missing some hidden reason to make those fields 64 bits?


Well, I was just not sure why the casting to ulntptr_t was needed and
decided to use 64 bit varaiables which obviously are an overkill.

_______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070518/d8408511/attachment.html>

From mst at dev.mellanox.co.il  Fri May 18 06:12:54 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 18 May 2007 16:12:54 +0300
Subject: [ofa-general] [PATCH] IB/ipoib: fix error message
Message-ID: <20070518131254.GJ4708@mellanox.co.il>

Trivial error message fixup.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index e3b0937..c1aad06 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -470,7 +470,7 @@ int ipoib_ib_dev_open(struct net_device *dev)
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
-		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
+		ipoib_warn(priv, "ipoib_cm_dev_open returned %d\n", ret);
 		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
-- 
MST


From eawinteruz at nasicnet.com  Fri May 18 06:31:18 2007
From: eawinteruz at nasicnet.com (Linda Freeman)
Date: Fri, 18 May 2007 14:31:18 +0100
Subject: [ofa-general] Everything should be ok
Message-ID: <e7ad01c79959$32f5ff30$461eca03@eawinteruz>


Dana got up sped and did as the sticky balance teacher upset said. So whpotato Dana sort faithfully started punch to head for the front door, but st poor peace The reduce crime conversation was suddenly interrupted by thYou wove said that glamorous you and settle he are spring going to be study
almost What learning decide water the hell happened to you?Put this on. 
inquisitive Yes? give Answered Jeff, who was super touch leaning against t The Principal introduced slope them identify theory tongue to her. Dana, th ...um, learn berry Jeff...I dust owe you loss one. Big time. Of church course replace that wasn't shakily the case. Even account if Gavin h
A sly grin suddenly tear strove formed idea tail on Stacy's face. YoI was after order just out cup practicing nest some new moves on my look consider call But then you snore won't have one. meat Well, at least turn I'll be able live order to report to your
Um... Dana wasn't bathe sure how perfectly gather nervously to answer this. W-what can I heard bang do for disgusted burst you? she asked nervously. Jeff paused receive guide for a mass fought moment to gather his thoughts Lieutenant scare Carnahan time run glass spoke, Miss Lefkowitz, we histrionic spoil caught potato Shall I unload this shotgun, or that? called
hung Believe it or garden not, earn poison as awful as I look, the schStace, don't glove forget wipe suspiciously make there's a camera pointing Big flower deal. We're fear country distance not destroying any school prop Your safety is stretch more judge bibulous hook important than mine.  Jeff you leaped permit belief trousers know that isn't true...
Hey kick roll pencil Stace! thick Carl yelled out.The youngster in the orange left dislike separate strange, chalk-striped suiWell, it doesn't dealt change broad matter. I'm bore just glad you're What's that?
Whether or not rail meet Gavin was a wreck stretch straight-A student o Yeah? Y-yeah. level delicious sea Dana was now connection slightly trembling. Is Carl motioned her over, and she humor gather sat change early back down n You bewildered jerk! said the American print shade loss in a soft snarl. curve As history she head disappeared join down the hall, Jeff grabbed
insect So how'd hate you make out basin with air the blonde last nigHeya Carl.top Stop arguing. You're wearing dealt clock own it and that's tha This clever instantly victoriously brought berry a touch smile to his face It Carl obnoxiously just hand stood there for a statement army moment, then manage
Tomorrow after school, walk driving attempt long can you meet me in room Dana army glanced at heal the road map. knot Turn pump left up ahe  As fistic the car rounded the ornament report bend, wind the Marshall estat
The Lieutenant introduced the sleepy number sped growth crying woman. Da The osseous tactical spill tease withdrawal thrust through the kitchen and forgave Congratulations, you've got rub join smite Dana getting along ornament Dana thought for a fuzzy disgusted moment. I soak suppose. What exa shiver parcel lain Have they gone? said engine Guy, poking his head int 10:30 AM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070518/5947592b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jhanh.gif
Type: image/gif
Size: 6434 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070518/5947592b/attachment.gif>

From mst at dev.mellanox.co.il  Fri May 18 08:05:25 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 18 May 2007 18:05:25 +0300
Subject: [ofa-general] [PATCH RFC v1] IPoIB/CM: fix SRQ WR leak
In-Reply-To: <adaveerhzrm.fsf@cisco.com>
References: <20070515210453.GL4161@mellanox.co.il>
	<20070516101457.GA5091@mellanox.co.il> <adaejlgmrkb.fsf@cisco.com>
	<20070517073017.GA4205@mellanox.co.il> <adaveerhzrm.fsf@cisco.com>
Message-ID: <20070518150525.GL4708@mellanox.co.il>

OK, here's a new version.
I'm in the process of testing this, seems to work OK so far.
I will let it run for the weekend.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Changes from v0.5:
- comment fix to match spec better
- added rx_reap_list and move RX from rx_drain_list to there
after drain is done, so that we can destroy flushed connections from any thread
just by going over this list. This is required since ipoib_cm_dev_stop can now
(after pkey change) be called from ipoib_wq, so we can't wait for a task to run
and clean the drained RX for us.

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 93d4a9a..69f1c25 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -132,12 +132,43 @@ struct ipoib_cm_data {
 	__be32 mtu;
 };
 
+/*
+ * Quoting 10.3.1 Queue Pair and EE Context States:
+ *
+ * Note, for QPs that are associated with an SRQ, the Consumer should take the
+ * QP through the Error State before invoking a Destroy QP or a Modify QP to the
+ * Reset State.  The Consumer may invoke the Destroy QP without first performing
+ * a Modify QP to the Error State and waiting for the Affiliated Asynchronous
+ * Last WQE Reached Event. However, if the Consumer does not wait for the
+ * Affiliated Asynchronous Last WQE Reached Event, then WQE and Data Segment
+ * leakage may occur. Therefore, it is good programming practice to tear down a
+ * QP that is associated with an SRQ by using the following process:
+ *
+ * - Put the QP in the Error State
+ * - Wait for the Affiliated Asynchronous Last WQE Reached Event;
+ * - either:
+ *       drain the CQ by invoking the Poll CQ verb and either wait for CQ
+ *       to be empty or the number of Poll CQ operations has exceeded
+ *       CQ capacity size;
+ * - or
+ *       post another WR that completes on the same CQ and wait for this
+ *       WR to return as a WC; (NB: this is the option that we use)
+ * - and then invoke a Destroy QP or Reset QP.
+ */
+
+enum ipoib_cm_state {
+	IPOIB_CM_RX_LIVE,
+	IPOIB_CM_RX_ERROR, /* Ignored by stale task */
+	IPOIB_CM_RX_FLUSH  /* Last WQE Reached event observed */
+};
+
 struct ipoib_cm_rx {
 	struct ib_cm_id     *id;
 	struct ib_qp        *qp;
 	struct list_head     list;
 	struct net_device   *dev;
 	unsigned long        jiffies;
+	enum ipoib_cm_state  state;
 };
 
 struct ipoib_cm_tx {
@@ -165,10 +196,16 @@ struct ipoib_cm_dev_priv {
 	struct ib_srq  	       *srq;
 	struct ipoib_cm_rx_buf *srq_ring;
 	struct ib_cm_id        *id;
-	struct list_head        passive_ids;
+	struct ib_qp           *rx_drain_qp;   /* generates WR described in 10.3.1 */
+	struct list_head        passive_ids;   /* state: LIVE */
+	struct list_head        rx_error_list; /* state: ERROR */
+	struct list_head        rx_flush_list; /* state: FLUSH, drain not started */
+	struct list_head        rx_drain_list; /* state: FLUSH, drain started */
+	struct list_head        rx_reap_list;  /* state: FLUSH, drain done */
 	struct work_struct      start_task;
 	struct work_struct      reap_task;
 	struct work_struct      skb_task;
+	struct work_struct      rx_reap_task;
 	struct delayed_work     stale_task;
 	struct sk_buff_head     skb_queue;
 	struct list_head        start_list;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index eec833b..46121cf 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -62,6 +62,17 @@ struct ipoib_cm_id {
 	u32 remote_mtu;
 };
 
+static struct ib_qp_attr ipoib_cm_err_attr = {
+	.qp_state = IB_QPS_ERR
+};
+
+#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff
+
+static struct ib_send_wr ipoib_cm_rx_drain_wr = {
+	.wr_id = IPOIB_CM_RX_DRAIN_WRID,
+	.opcode = IB_WR_SEND
+};
+
 static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
 			       struct ib_cm_event *event);
 
@@ -150,11 +161,44 @@ partial_error:
 	return NULL;
 }
 
+static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv)
+{
+	struct ib_send_wr *bad_send_wr;
+
+	/* rx_drain_qp send queue depth is 1, so
+	 * make sure we have at most 1 outstanding WR. */
+	if (list_empty(&priv->cm.rx_flush_list) ||
+	    !list_empty(&priv->cm.rx_drain_list))
+		return;
+
+	if (ib_post_send(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_send_wr))
+		ipoib_warn(priv, "failed to post rx_drain wr\n");
+
+	list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list);
+}
+
+static void ipoib_cm_rx_event_handler(struct ib_event *event, void *ctx)
+{
+	struct ipoib_cm_rx *p = ctx;
+	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
+	unsigned long flags;
+
+	if (event->event != IB_EVENT_QP_LAST_WQE_REACHED)
+		return;
+
+	spin_lock_irqsave(&priv->lock, flags);
+	list_move(&p->list, &priv->cm.rx_flush_list);
+	p->state = IPOIB_CM_RX_FLUSH;
+	ipoib_cm_start_rx_drain(priv);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
 static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 					   struct ipoib_cm_rx *p)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
+		.event_handler = ipoib_cm_rx_event_handler,
 		.send_cq = priv->cq, /* does not matter, we never send anything */
 		.recv_cq = priv->cq,
 		.srq = priv->cm.srq,
@@ -256,6 +300,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 
 	cm_id->context = p;
 	p->jiffies = jiffies;
+	p->state = IPOIB_CM_RX_LIVE;
 	spin_lock_irq(&priv->lock);
 	if (list_empty(&priv->cm.passive_ids))
 		queue_delayed_work(ipoib_workqueue,
@@ -277,7 +322,6 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id,
 {
 	struct ipoib_cm_rx *p;
 	struct ipoib_dev_priv *priv;
-	int ret;
 
 	switch (event->event) {
 	case IB_CM_REQ_RECEIVED:
@@ -289,20 +333,9 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id,
 	case IB_CM_REJ_RECEIVED:
 		p = cm_id->context;
 		priv = netdev_priv(p->dev);
-		spin_lock_irq(&priv->lock);
-		if (list_empty(&p->list))
-			ret = 0; /* Connection is going away already. */
-		else {
-			list_del_init(&p->list);
-			ret = -ECONNRESET;
-		}
-		spin_unlock_irq(&priv->lock);
-		if (ret) {
-			ib_destroy_qp(p->qp);
-			kfree(p);
-			return ret;
-		}
-		return 0;
+		if (ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE))
+			ipoib_warn(priv, "unable to move qp to error state\n");
+		/* Fall through */
 	default:
 		return 0;
 	}
@@ -354,8 +387,15 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		       wr_id, wc->status);
 
 	if (unlikely(wr_id >= ipoib_recvq_size)) {
-		ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
-			   wr_id, ipoib_recvq_size);
+		if (wr_id == IPOIB_CM_RX_DRAIN_WRID) {
+			spin_lock_irqsave(&priv->lock, flags);
+			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
+			ipoib_cm_start_rx_drain(priv);
+			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+			spin_unlock_irqrestore(&priv->lock, flags);
+		} else
+			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
+				   wr_id, ipoib_recvq_size);
 		return;
 	}
 
@@ -374,9 +414,9 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
 			spin_lock_irqsave(&priv->lock, flags);
 			p->jiffies = jiffies;
-			/* Move this entry to list head, but do
-			 * not re-add it if it has been removed. */
-			if (!list_empty(&p->list))
+			/* Move this entry to list head, but do not re-add it
+			 * if it has been moved out of list. */
+			if (p->state == IPOIB_CM_RX_LIVE)
 				list_move(&p->list, &priv->cm.passive_ids);
 			spin_unlock_irqrestore(&priv->lock, flags);
 		}
@@ -583,17 +623,41 @@ static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr)
 int ipoib_cm_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr qp_init_attr = {
+		.send_cq = priv->cq,
+		.recv_cq = priv->cq, /* does not matter, we never get anything */
+		.srq = priv->cm.srq, /* does not matter, we never get anything */
+		.cap.max_send_wr = 1,
+		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type = IB_QPT_UC,
+	};
 	int ret;
 
 	if (!IPOIB_CM_SUPPORTED(dev->dev_addr))
 		return 0;
 
+	priv->cm.rx_drain_qp = ib_create_qp(priv->pd, &qp_init_attr);
+	if (IS_ERR(priv->cm.rx_drain_qp)) {
+		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
+		ret = PTR_ERR(priv->cm.rx_drain_qp);
+		return ret;
+	}
+
+	/* We put the QP in error state directly: this way, hardware
+	 * will immediately generate WC for each WR we post, without
+	 * sending anything on the wire. */
+	ret = ib_modify_qp(priv->cm.rx_drain_qp, &ipoib_cm_err_attr, IB_QP_STATE);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret);
+		goto err_qp;
+	}
+
 	priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev);
 	if (IS_ERR(priv->cm.id)) {
 		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
 		ret = PTR_ERR(priv->cm.id);
-		priv->cm.id = NULL;
-		return ret;
+		goto err_cm;
 	}
 
 	ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num),
@@ -601,35 +665,79 @@ int ipoib_cm_dev_open(struct net_device *dev)
 	if (ret) {
 		printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name,
 		       IPOIB_CM_IETF_ID | priv->qp->qp_num);
-		ib_destroy_cm_id(priv->cm.id);
-		priv->cm.id = NULL;
-		return ret;
+		goto err_listen;
 	}
+
 	return 0;
+
+err_listen:
+	ib_destroy_cm_id(priv->cm.id);
+err_cm:
+	priv->cm.id = NULL;
+err_qp:
+	ib_destroy_qp(priv->cm.rx_drain_qp);
+	return ret;
 }
 
 void ipoib_cm_dev_stop(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_cm_rx *p;
+	struct ipoib_cm_rx *p, *n;
+	unsigned long begin;
+	LIST_HEAD(list);
+	int ret;
 
 	if (!IPOIB_CM_SUPPORTED(dev->dev_addr) || !priv->cm.id)
 		return;
 
 	ib_destroy_cm_id(priv->cm.id);
 	priv->cm.id = NULL;
+
 	spin_lock_irq(&priv->lock);
 	while (!list_empty(&priv->cm.passive_ids)) {
 		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
-		list_del_init(&p->list);
+		list_move(&p->list, &priv->cm.rx_error_list);
+		p->state = IPOIB_CM_RX_ERROR;
+		spin_unlock_irq(&priv->lock);
+		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+		if (ret)
+			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
+		spin_lock_irq(&priv->lock);
+	}
+
+	/* Wait for all RX to be drained */
+	begin = jiffies;
+
+	while (!list_empty(&priv->cm.rx_error_list) ||
+	       !list_empty(&priv->cm.rx_flush_list) ||
+	       !list_empty(&priv->cm.rx_drain_list)) {
+		if (!time_after(jiffies, begin + 5 * HZ)) {
+			ipoib_warn(priv, "RX drain timing out\n");
+
+			/*
+			 * assume the HW is wedged and just free up everything.
+			 */
+			list_splice_init(&priv->cm.rx_flush_list, &list);
+			list_splice_init(&priv->cm.rx_error_list, &list);
+			list_splice_init(&priv->cm.rx_drain_list, &list);
+			break;
+		}
 		spin_unlock_irq(&priv->lock);
+		msleep(1);
+		spin_lock_irq(&priv->lock);
+	}
+
+	list_splice_init(&priv->cm.rx_reap_list, &list);
+
+	spin_unlock_irq(&priv->lock);
+
+	list_for_each_entry_safe(p, n, &list, list) {
 		ib_destroy_cm_id(p->id);
 		ib_destroy_qp(p->qp);
 		kfree(p);
-		spin_lock_irq(&priv->lock);
 	}
-	spin_unlock_irq(&priv->lock);
 
+	ib_destroy_qp(priv->cm.rx_drain_qp);
 	cancel_delayed_work(&priv->cm.stale_task);
 }
 
@@ -1079,24 +1187,44 @@ void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb,
 		queue_work(ipoib_workqueue, &priv->cm.skb_task);
 }
 
+static void ipoib_cm_rx_reap(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
+						   cm.rx_reap_task);
+	struct ipoib_cm_rx *p, *n;
+	LIST_HEAD(list);
+
+	spin_lock_irq(&priv->lock);
+	list_splice_init(&priv->cm.rx_reap_list, &list);
+	spin_unlock_irq(&priv->lock);
+
+	list_for_each_entry_safe(p, n, &list, list) {
+		ib_destroy_cm_id(p->id);
+		ib_destroy_qp(p->qp);
+		kfree(p);
+	}
+}
+
 static void ipoib_cm_stale_task(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
 						   cm.stale_task.work);
 	struct ipoib_cm_rx *p;
+	int ret;
 
 	spin_lock_irq(&priv->lock);
 	while (!list_empty(&priv->cm.passive_ids)) {
-		/* List if sorted by LRU, start from tail,
+		/* List is sorted by LRU, start from tail,
 		 * stop when we see a recently used entry */
 		p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list);
 		if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT))
 			break;
-		list_del_init(&p->list);
+		list_move(&p->list, &priv->cm.rx_error_list);
+		p->state = IPOIB_CM_RX_ERROR;
 		spin_unlock_irq(&priv->lock);
-		ib_destroy_cm_id(p->id);
-		ib_destroy_qp(p->qp);
-		kfree(p);
+		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+		if (ret)
+			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
 		spin_lock_irq(&priv->lock);
 	}
 
@@ -1164,9 +1292,14 @@ int ipoib_cm_dev_init(struct net_device *dev)
 	INIT_LIST_HEAD(&priv->cm.passive_ids);
 	INIT_LIST_HEAD(&priv->cm.reap_list);
 	INIT_LIST_HEAD(&priv->cm.start_list);
+	INIT_LIST_HEAD(&priv->cm.rx_error_list);
+	INIT_LIST_HEAD(&priv->cm.rx_flush_list);
+	INIT_LIST_HEAD(&priv->cm.rx_drain_list);
+	INIT_LIST_HEAD(&priv->cm.rx_reap_list);
 	INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start);
 	INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap);
 	INIT_WORK(&priv->cm.skb_task, ipoib_cm_skb_reap);
+	INIT_WORK(&priv->cm.rx_reap_task, ipoib_cm_rx_reap);
 	INIT_DELAYED_WORK(&priv->cm.stale_task, ipoib_cm_stale_task);
 
 	skb_queue_head_init(&priv->cm.skb_queue);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index e3b0937..c1aad06 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -470,7 +470,7 @@ int ipoib_ib_dev_open(struct net_device *dev)
 
 	ret = ipoib_cm_dev_open(dev);
 	if (ret) {
-		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
+		ipoib_warn(priv, "ipoib_cm_dev_open returned %d\n", ret);
 		ipoib_ib_dev_stop(dev, 1);
 		return -1;
 	}
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 7912526..982eb88 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -173,7 +173,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	size = ipoib_sendq_size + ipoib_recvq_size + 1;
 	ret = ipoib_cm_dev_init(dev);
 	if (!ret)
-		size += ipoib_recvq_size;
+		size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */;
 
 	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
 	if (IS_ERR(priv->cq)) {
-- 
MST


From mshefty at ichips.intel.com  Fri May 18 08:55:49 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 18 May 2007 08:55:49 -0700
Subject: [ofa-general] libibcm compatability problem
In-Reply-To: <2cfcf21e0705180113u43ad5977rd5fef68793c5aaea@mail.gmail.com>
References: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com>	
	<464C8889.5090403@ichips.intel.com>
	<2cfcf21e0705180113u43ad5977rd5fef68793c5aaea@mail.gmail.com>
Message-ID: <464DCC85.6090308@ichips.intel.com>

> Also I see that the function ib_cm_get_device has been removed. I was 
> using this to monitor the file desriptor of the CM device. Could this 
> function be put back into my local copy of libibcm or has this function 
> been moved somewhere else in the code?

The fd is exposed directly by walking ib_cm_id->device->fd.

Btw, to respond to you other email, converting from the ib_cm to the 
rdma_cm shouldn't be overly difficult.  You get fewer CM related events, 
QP transitions done for you, use of actual path records, and the ability 
to reference nodes using an IP address.  There are some example programs 
in the librdmacm/example directory if you want to take a quick look at 
what the code would look like.  The IB device is acquired dynamically 
though, so depending on how you allocate resources in your code, you may 
need some rework in this area.

- Sean


From rvm at obsidianresearch.com  Fri May 18 09:56:36 2007
From: rvm at obsidianresearch.com (Rolf Manderscheid)
Date: Fri, 18 May 2007 10:56:36 -0600
Subject: [ofa-general] Re: [PATCH] IB/mthca: initialise GRH:HopLimit when
	building MLX headers
In-Reply-To: <adad50zjnc4.fsf@cisco.com>
References: <E1HoiAm-00023A-Ia@ib1.edm.orcorp.ca> <adad50zjnc4.fsf@cisco.com>
Message-ID: <464DDAC4.305@obsidianresearch.com>

Roland Dreier wrote:
>  If you happen to have any ConnectX cards available, can you check this works too?
>   
We don't have any, but we are getting some.  I'll report back after they 
show up.

   Rolf


From rdreier at cisco.com  Fri May 18 14:51:13 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 18 May 2007 14:51:13 -0700
Subject: [ofa-general] Re: [PATCH 2/2] IB/mlx4: pass more data from user to
	kernel
In-Reply-To: <20070518034028.GB4708@mellanox.co.il> (Michael S. Tsirkin's
	message of "Fri, 18 May 2007 06:40:28 +0300")
References: <1179387217.25749.62.camel@mtls03> <adalkfnhxe8.fsf@cisco.com>
	<adad50zhw2l.fsf@cisco.com> <20070518034028.GB4708@mellanox.co.il>
Message-ID: <adaps4xhkby.fsf@cisco.com>

 > I'm OK with what you say about RQ, but replacing sq_wqebb_count with log_sq_bb_per_wqe
 > looks like obfuscation to me: you still pass in 2 values, and the kernel does
 > not actually care about number of SQ WRs at all, it really only needs to look at
 > # of wqbbs.

Makes sense... how about:

struct mlx4_ib_create_qp {
	__u64	buf_addr;
	__u64	db_addr;
        __u8	log_sq_stride;
        __u8	log_sq_bb_count;
        __u8	reserved[6];
};


From sean.hefty at intel.com  Fri May 18 15:14:07 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 18 May 2007 15:14:07 -0700
Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.23: basic support for IB routers
Message-ID: <000401c79999$dad2e9d0$b4d4180a@amr.corp.intel.com>

Re-sending - typo in mailing list name...

I'd like to get feedback about incorporating the following changes to support
IB routers into 2.6.23.  The goal of the patches is to allow for IB router
development and prototyping within the current framework of IBA.  The changes
themselves are fairly minimal, but based on the following concepts:

* Routing data is maintained by the local SA.  No assumption is made regarding
  how the SA obtains routing information.  The SA is only expected to respond
  to cross subnet PR queries by providing a path to the local router.  This
  matches the behavior in opensm.

* A ULP connecting to a remote subnet provides path information about both
  subnets.  For now the implementation simply assumes that the properties of
  the remote path match that of the local path.  This allows the active side
  CM to properly format the CM REQ.

* If the SLID/DLID values in the CM REQ are set to the permissive LID, then
  the passive side CM uses the SLID/DLID/SL values from the received CM REQ
  LRH to configure the passive side QP.  This is done to meet C9-54 without
  requiring communication with the remote SA, but I should note that this
  behavior is non-compliant.

These changes were tested by establishing a connection and transferring data
between two IB subnets connected by an Obsidian router.

These patches are also available in the ib_router branch of my rdma-dev.git
tree.  The tree is based on 2.6.21, so include a couple of additional patches
that were already pushed for 2.6.22.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


From sean.hefty at intel.com  Fri May 18 15:15:36 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 18 May 2007 15:15:36 -0700
Subject: [ofa-general] [RFC] [PATCH 1/3] 2.6.23: IB/CM: add support for paths
	with hop_limit > 1
In-Reply-To: <000401c79999$dad2e9d0$b4d4180a@amr.corp.intel.com>
Message-ID: <000501c7999a$0fb2d520$b4d4180a@amr.corp.intel.com>

Paths with hop_limit > 1 indicate that the connection will be routed
between IB subnets.  To support routed connections, the ib_cm requires
two paths: one from the active side to the active side router, and
a second from the passive side to the passive side router.

Modify the ib_cm interface to support multiple paths, and format the
CM REQ message with the correct passive side information.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/cm.c |   50 ++++++++++++++++++++++++------------------
 include/rdma/ib_cm.h         |    5 ++++
 2 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 842cd0b..1e2010e 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -877,6 +877,8 @@ static void cm_format_req(struct cm_req_msg *req_msg,
 			  struct cm_id_private *cm_id_priv,
 			  struct ib_cm_req_param *param)
 {
+	struct ib_sa_path_rec *pri_path, *alt_path;
+
 	cm_format_mad_hdr(&req_msg->hdr, CM_REQ_ATTR_ID,
 			  cm_form_tid(cm_id_priv, CM_MSG_SEQUENCE_REQ));
 
@@ -900,33 +902,37 @@ static void cm_format_req(struct cm_req_msg *req_msg,
 	cm_req_set_max_cm_retries(req_msg, param->max_cm_retries);
 	cm_req_set_srq(req_msg, param->srq);
 
-	req_msg->primary_local_lid = param->primary_path->slid;
-	req_msg->primary_remote_lid = param->primary_path->dlid;
-	req_msg->primary_local_gid = param->primary_path->sgid;
-	req_msg->primary_remote_gid = param->primary_path->dgid;
-	cm_req_set_primary_flow_label(req_msg, param->primary_path->flow_label);
-	cm_req_set_primary_packet_rate(req_msg, param->primary_path->rate);
-	req_msg->primary_traffic_class = param->primary_path->traffic_class;
-	req_msg->primary_hop_limit = param->primary_path->hop_limit;
-	cm_req_set_primary_sl(req_msg, param->primary_path->sl);
-	cm_req_set_primary_subnet_local(req_msg, 1); /* local only... */
+	pri_path = (param->primary_path->hop_limit <= 1) ?
+		    param->primary_path : &param->primary_path[1];
+	req_msg->primary_local_lid = pri_path->slid;
+	req_msg->primary_remote_lid = pri_path->dlid;
+	req_msg->primary_local_gid = pri_path->sgid;
+	req_msg->primary_remote_gid = pri_path->dgid;
+	cm_req_set_primary_flow_label(req_msg, pri_path->flow_label);
+	cm_req_set_primary_packet_rate(req_msg, pri_path->rate);
+	req_msg->primary_traffic_class = pri_path->traffic_class;
+	req_msg->primary_hop_limit = pri_path->hop_limit;
+	cm_req_set_primary_sl(req_msg, pri_path->sl);
+	cm_req_set_primary_subnet_local(req_msg, (pri_path->hop_limit <= 1));
 	cm_req_set_primary_local_ack_timeout(req_msg,
-		min(31, param->primary_path->packet_life_time + 1));
+		min(31, pri_path->packet_life_time + 1));
 
 	if (param->alternate_path) {
-		req_msg->alt_local_lid = param->alternate_path->slid;
-		req_msg->alt_remote_lid = param->alternate_path->dlid;
-		req_msg->alt_local_gid = param->alternate_path->sgid;
-		req_msg->alt_remote_gid = param->alternate_path->dgid;
+		alt_path = (param->alternate_path->hop_limit <= 1) ?
+			    param->alternate_path : &param->alternate_path[1];
+		req_msg->alt_local_lid = alt_path->slid;
+		req_msg->alt_remote_lid = alt_path->dlid;
+		req_msg->alt_local_gid = alt_path->sgid;
+		req_msg->alt_remote_gid = alt_path->dgid;
 		cm_req_set_alt_flow_label(req_msg,
-					  param->alternate_path->flow_label);
-		cm_req_set_alt_packet_rate(req_msg, param->alternate_path->rate);
-		req_msg->alt_traffic_class = param->alternate_path->traffic_class;
-		req_msg->alt_hop_limit = param->alternate_path->hop_limit;
-		cm_req_set_alt_sl(req_msg, param->alternate_path->sl);
-		cm_req_set_alt_subnet_local(req_msg, 1); /* local only... */
+					  alt_path->flow_label);
+		cm_req_set_alt_packet_rate(req_msg, alt_path->rate);
+		req_msg->alt_traffic_class = alt_path->traffic_class;
+		req_msg->alt_hop_limit = alt_path->hop_limit;
+		cm_req_set_alt_sl(req_msg, alt_path->sl);
+		cm_req_set_alt_subnet_local(req_msg, (alt_path->hop_limit <= 1));
 		cm_req_set_alt_local_ack_timeout(req_msg,
-			min(31, param->alternate_path->packet_life_time + 1));
+			min(31, alt_path->packet_life_time + 1));
 	}
 
 	if (param->private_data && param->private_data_len)
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index 5c07017..f715ba5 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -347,6 +347,11 @@ struct ib_cm_compare_data {
 int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
 		 struct ib_cm_compare_data *compare_data);
 
+/*
+ * If hop_limit > 1 or reversible = 0, then primary/alternate path fields
+ * point to an array of paths.  The first path is relative to the active
+ * side, and the second path is relative to the passive side.
+ */
 struct ib_cm_req_param {
 	struct ib_sa_path_rec	*primary_path;
 	struct ib_sa_path_rec	*alternate_path;


From sean.hefty at intel.com  Fri May 18 15:17:18 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 18 May 2007 15:17:18 -0700
Subject: [ofa-general] [RFC] [PATCH 2/3] 2.6.23: IB/cm: Modify passive side
	to use LIDs from LRH for routed connections
In-Reply-To: <000501c7999a$0fb2d520$b4d4180a@amr.corp.intel.com>
Message-ID: <000601c7999a$4cb74140$b4d4180a@amr.corp.intel.com>

To support inter-subnet connections, the passive endpoint needs to use
its subnet local LIDs.  The LIDs carried in the REQ are currently the
LIDs from the active subnet (SLID and router LID).  Replace LIDs in the
REQ with subnet local LIDs from LRH.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/cm.c |   29 +++++++++++++++++++++++++++++
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 1e2010e..4d30e49 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1345,6 +1345,34 @@ out:
 	return listen_cm_id_priv;
 }
 
+/*
+ * Work-around for inter-subnet connections.  If the LIDs are permissive,
+ * we need to override the LID/SL data in the REQ with the LID information
+ * in the work completion.
+ */
+static void cm_process_routed_req(struct cm_req_msg *req_msg, struct ib_wc *wc)
+{
+	if (!cm_req_get_primary_subnet_local(req_msg)) {
+		if (req_msg->primary_local_lid == IB_LID_PERMISSIVE) {
+			req_msg->primary_local_lid = cpu_to_be16(wc->slid);
+			cm_req_set_primary_sl(req_msg, wc->sl);
+		}
+
+		if (req_msg->primary_remote_lid == IB_LID_PERMISSIVE)
+			req_msg->primary_remote_lid = cpu_to_be16(wc->dlid_path_bits);
+	}
+
+	if (!cm_req_get_alt_subnet_local(req_msg)) {
+		if (req_msg->alt_local_lid == IB_LID_PERMISSIVE) {
+			req_msg->alt_local_lid = cpu_to_be16(wc->slid);
+			cm_req_set_alt_sl(req_msg, wc->sl);
+		}
+
+		if (req_msg->alt_remote_lid == IB_LID_PERMISSIVE)
+			req_msg->alt_remote_lid = cpu_to_be16(wc->dlid_path_bits);
+	}
+}
+
 static int cm_req_handler(struct cm_work *work)
 {
 	struct ib_cm_id *cm_id;
@@ -1385,6 +1413,7 @@ static int cm_req_handler(struct cm_work *work)
 	cm_id_priv->id.service_id = req_msg->service_id;
 	cm_id_priv->id.service_mask = __constant_cpu_to_be64(~0ULL);
 
+	cm_process_routed_req(req_msg, work->mad_recv_wc->wc);
 	cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]);
 	ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av);
 	if (ret) {


From sean.hefty at intel.com  Fri May 18 15:18:37 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 18 May 2007 15:18:37 -0700
Subject: [ofa-general] [RFC] [PATCH 3/3] 2.6.23: RDMA/cma: Add support for
	routed paths
In-Reply-To: <000601c7999a$4cb74140$b4d4180a@amr.corp.intel.com>
Message-ID: <000701c7999a$7bf62250$b4d4180a@amr.corp.intel.com>

In order to support IB-to-IB routers, we need to provide path 
information about the remote subnet to the ib_cm.  For now, we
simply copy our local path information, but use permissive LIDs
in place of the actual, remote LIDs.  This indicates to the
remote ib_cm that it should use the LIDs/SL data from the LRH
received with CM REQ in place of the actual data carried in the
REQ message when configuring the remote QP.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/cma.c |   17 ++++++++++++++++-
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index fde92ce..430f104 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2170,7 +2170,19 @@ static int cma_connect_ib(struct rdma_id_private *id_priv,
 		goto out;
 	req.private_data = private_data;
 
-	req.primary_path = &route->path_rec[0];
+	if (route->path_rec[0].hop_limit > 1) {
+		req.primary_path = kmalloc(sizeof *req.primary_path * 2,
+					   GFP_ATOMIC);
+		if (!req.primary_path) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		req.primary_path[0] = route->path_rec[0];
+		req.primary_path[1] = route->path_rec[0];
+		req.primary_path[1].slid = IB_LID_PERMISSIVE;
+		req.primary_path[1].dlid = IB_LID_PERMISSIVE;
+	} else
+		req.primary_path = &route->path_rec[0];
 	if (route->num_paths == 2)
 		req.alternate_path = &route->path_rec[1];
 
@@ -2190,6 +2202,9 @@ static int cma_connect_ib(struct rdma_id_private *id_priv,
 	req.srq = id_priv->srq ? 1 : 0;
 
 	ret = ib_send_cm_req(id_priv->cm_id.ib, &req);
+
+	if (route->path_rec[0].hop_limit > 1)
+		kfree(req.primary_path);
 out:
 	if (ret && !IS_ERR(id_priv->cm_id.ib)) {
 		ib_destroy_cm_id(id_priv->cm_id.ib);


From lioyd.okoro at gmail.com  Sat May 19 02:13:55 2007
From: lioyd.okoro at gmail.com (lioyd Okoro)
Date: Sat, 19 May 2007 10:13:55 +0100
Subject: [ofa-general] PLEASE REPLY(Expecting Your Reply)
Message-ID: <f06f56b00705190213j63c80094s19943156c948fabc@mail.gmail.com>

*VIEW THE ATTACHED MESSAGE.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070519/c6dbcc7a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: LIOYD.rtf
Type: application/rtf
Size: 3284 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070519/c6dbcc7a/attachment.rtf>

From vlad at lists.openfabrics.org  Sat May 19 02:39:56 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sat, 19 May 2007 02:39:56 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070519-0200 daily build status
Message-ID: <20070519093957.5001CE60826@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From yosefe at voltaire.com  Sun May 20 01:07:00 2007
From: yosefe at voltaire.com (Yosef Etigin)
Date: Sun, 20 May 2007 11:07:00 +0300
Subject: [ofa-general] ib_find_gid / ib_find_pkey
In-Reply-To: <20070517174519.GC22028@mellanox.co.il>
References: <20070514045832.GA18615@mellanox.co.il>
	<ada8xbnl65p.fsf@cisco.com> <ada4pmbl5hc.fsf@cisco.com>
	<20070517174519.GC22028@mellanox.co.il>
Message-ID: <465001A4.2020502@voltaire.com>

Michael S. Tsirkin wrote:
>>Quoting Roland Dreier <rdreier at cisco.com>:
>>Subject: Re: [ofa-general] ib_find_gid / ib_find_pkey
>>
>>Also applied P_Key reordering patch too...
> 
> 
> OK. I think the next step is to get rid of ipoib_pkey_dev_check_presence and
> ipoib_pkey_poll in ipoib.  This way we'll have one ULP clean of cache usage.
> Yosef?
> 
I'll rebase the patch on the last version of pkey patch and repost.

> Another thing to do at this front, is make the pkey change event
> less intrusive: we should not need to kill connections and AHs
> because of pkey change: just cycling the QP through reset should be
> enough.
> 

like modify->reset and call ipoib_init_qp?


From mst at dev.mellanox.co.il  Sun May 20 01:11:08 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 20 May 2007 11:11:08 +0300
Subject: [ofa-general] Re: ib_find_gid / ib_find_pkey
In-Reply-To: <465001A4.2020502@voltaire.com>
References: <20070514045832.GA18615@mellanox.co.il>
	<ada8xbnl65p.fsf@cisco.com> <ada4pmbl5hc.fsf@cisco.com>
	<20070517174519.GC22028@mellanox.co.il>
	<465001A4.2020502@voltaire.com>
Message-ID: <20070520081108.GB16863@mellanox.co.il>

> Quoting Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: ib_find_gid / ib_find_pkey
> 
> Michael S. Tsirkin wrote:
> >>Quoting Roland Dreier <rdreier at cisco.com>:
> >>Subject: Re: [ofa-general] ib_find_gid / ib_find_pkey
> >>
> >>Also applied P_Key reordering patch too...
> > 
> > 
> > OK. I think the next step is to get rid of ipoib_pkey_dev_check_presence and
> > ipoib_pkey_poll in ipoib.  This way we'll have one ULP clean of cache usage.
> > Yosef?
> > 
> I'll rebase the patch on the last version of pkey patch and repost.
> 
> > Another thing to do at this front, is make the pkey change event
> > less intrusive: we should not need to kill connections and AHs
> > because of pkey change: just cycling the QP through reset should be
> > enough.
> > 
> 
> like modify->reset and call ipoib_init_qp?

Careful: you must ->ERR and flush posted WRs.

-- 
MST


From vlad at lists.openfabrics.org  Sun May 20 02:41:53 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun, 20 May 2007 02:41:53 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070520-0200 daily build status
Message-ID: <20070520094154.0C836E6082A@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.17
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From kliteyn at dev.mellanox.co.il  Sun May 20 03:17:07 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 20 May 2007 13:17:07 +0300
Subject: [ofa-general] [PATCH] osm: up/dn optimization - improved	ranking
In-Reply-To: <20070516194919.GO19271@sashak.voltaire.com>
References: <464B1D41.8080905@dev.mellanox.co.il>
	<20070516194919.GO19271@sashak.voltaire.com>
Message-ID: <46502023.20703@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 18:03 Wed 16 May     , Yevgeny Kliteynik wrote:
>> Hi Hal,
>>
>> This patch optimizes fabric ranking similar to the fat-tree ranking.
>> All the root switches are marked with rank and added to the BFS list,
>> and only then ranking of rest of the fabric begins.
>>
>> Please apply to master. 
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
> 
> Basically looks good. However couple comments below.
> 
>> opensm/opensm/osm_ucast_updn.c |   66 
>> +++++++++++++++++----------------------
>> 1 files changed, 29 insertions(+), 37 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
>> index 5cebd9b..9574216 100644
>> --- a/opensm/opensm/osm_ucast_updn.c
>> +++ b/opensm/opensm/osm_ucast_updn.c
>> @@ -408,53 +408,49 @@ Exit :
>> /*        rank is a SWITCH for BFS purpose */
>> static int
>> updn_subn_rank(
>> -  IN uint64_t root_guid,
>> -  IN uint8_t base_rank,
>> +  IN uint32_t num_guids,
> 
> 'num_guids' should not be fixed-size integer just compiler friendly
> 'unsigned' is fine.

NP.

>> +  IN uint64_t* guid_list,
>>   IN updn_t* p_updn )
>> {
>>   osm_switch_t *p_sw;
>> -  uint32_t rank = base_rank;
>>   osm_physp_t *p_physp, *p_remote_physp;
>>   cl_qlist_t list;
>>   cl_status_t did_cause_update;
>>   struct updn_node *u, *remote_u;
>>   uint8_t num_ports, port_num;
>>   osm_log_t *p_log = &p_updn->p_osm->log;
>> +  uint32_t idx = 0;
> 
> Ditto.

NP
 
>>   OSM_LOG_ENTER( p_log, updn_subn_rank );
>> +  cl_qlist_init(&list);
>>
>> -  p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, 
>> cl_hton64(root_guid));
>> -  if(!p_sw)
>> -  {
>> -    osm_log( p_log, OSM_LOG_ERROR,
>> -             "updn_subn_rank: ERR AA05: "
>> -             "Root switch GUID 0x%" PRIx64 " not found\n", root_guid );
>> -    OSM_LOG_EXIT( p_log );
>> -    return 1;
>> -  }
>> -
>> -  osm_log( p_log, OSM_LOG_VERBOSE,
>> -           "updn_subn_rank: "
>> -           "Ranking starts from GUID 0x%" PRIx64 "\n", root_guid );
>> -
>> -  u = p_sw->priv;
>> -  u->is_root = 1;
>> +  /* Rank all the roots and add them to list */
>>
>> -  /* Rank the first guid chosen anyway since it's the base rank */
>> -  osm_log( p_log, OSM_LOG_DEBUG,
>> -           "updn_subn_rank: "
>> -           "Ranking port GUID 0x%" PRIx64 "\n", root_guid );
>> +  for (idx = 0; idx < num_guids; idx++)
>> +  {
>> +    /* Apply the ranking for each guid given by user - bypass illegal ones 
>> */
>> +    p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, 
>> cl_hton64(guid_list[idx]));
>> +    if(!p_sw)
>> +    {
>> +      osm_log( p_log, OSM_LOG_ERROR,
>> +               "updn_subn_rank: ERR AA05: "
>> +               "Root switch GUID 0x%" PRIx64 " not found\n", 
>> guid_list[idx] );
>> +      continue;
>> +    }
>>
>> -  __updn_update_rank(u, rank);
>> +    u = p_sw->priv;
>> +    u->is_root = 1;
> 
> Now when root switches are always ranked first 'is_root' field is not
> needed anymore, (!u->rank) answers this.

Agree

>> -  cl_qlist_init(&list);
>> -  cl_qlist_insert_tail(&list, &u->list);
>> +    osm_log( p_log, OSM_LOG_DEBUG,
>> +             "updn_subn_rank: "
>> +             "Ranking root port GUID 0x%" PRIx64 "\n", guid_list[idx] );
>> +    __updn_update_rank(u, 0);
>> +    cl_qlist_insert_tail(&list, &u->list);
>> +  }
>>
>>   /* BFS the list till it's empty */
>>   while (!cl_is_qlist_empty(&list))
>>   {
>> -    rank++;
>> -
>>     u = (struct updn_node *)cl_qlist_remove_head(&list);
>>     /* Go over all remote nodes and rank them (if not already visited) */
>>     p_sw = u->sw;
>> @@ -483,7 +479,7 @@ updn_subn_rank(
>>       {
>>         remote_u = p_remote_physp->p_node->sw->priv;
>>         port_guid = p_remote_physp->port_guid;
>> -        did_cause_update = __updn_update_rank(remote_u, rank);
>> +        did_cause_update = __updn_update_rank(remote_u, u->rank+1);
>>
>>         osm_log( p_log, OSM_LOG_DEBUG,
>>                  "updn_subn_rank: "
>> @@ -500,8 +496,8 @@ updn_subn_rank(
>>   /* Print Summary of ranking */
>>   osm_log( p_log, OSM_LOG_VERBOSE,
>>            "updn_subn_rank: "
>> -           "Rank Info :\n\t Root Guid = 0x%" PRIx64 "\n\t Max Node Rank = 
>> %d\n",
>> -           root_guid, rank );
>> +           "Subnet ranking completed. Max Node Rank = %d\n",
>> +           remote_u->rank );
> 
> 'remote_u' can be not initialized here. Another issue is that it can be
> initialized but to remote switch which has lower than max rank (when
> did_cause_update = 0).

Right, good catch.
I'll issue a new patch.
Thanks.

-- Yevgeny

> The rest is fine.
> 
> Sasha
> 


From kliteyn at dev.mellanox.co.il  Sun May 20 04:26:28 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 20 May 2007 14:26:28 +0300
Subject: [ofa-general] [PATCHv2] osm: up/dn optimization - improved ranking
Message-ID: <46503064.7010107@dev.mellanox.co.il>

Hi Hal,

This patch optimizes fabric ranking similar to the fat-tree ranking.
All the root switches are marked with rank and added to the BFS list,
and only then ranking of rest of the fabric begins.
This version of the patch is updated in accordance with Sasha's suggestions.

Please apply to master.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_ucast_updn.c |   80 +++++++++++++++++----------------------
 1 files changed, 35 insertions(+), 45 deletions(-)

diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
index 5cebd9b..95a0622 100644
--- a/opensm/opensm/osm_ucast_updn.c
+++ b/opensm/opensm/osm_ucast_updn.c
@@ -93,7 +93,6 @@ struct updn_node {
   osm_switch_t *sw;
   updn_switch_dir_t dir;
   unsigned rank;
-  unsigned is_root;
   unsigned visited;
 };
 
@@ -111,15 +110,13 @@ __updn_get_dir(
   IN unsigned cur_rank,
   IN unsigned rem_rank,
   IN uint64_t cur_guid,
-  IN uint64_t rem_guid,
-  IN unsigned cur_is_root,
-  IN unsigned rem_is_root )
+  IN uint64_t rem_guid )
 {
   /* HACK: comes to solve root nodes connection, in a classic subnet root nodes do not connect
      directly, but in case they are we assign to root node an UP direction to allow UPDN to discover
      the subnet correctly (and not from the point of view of the last root node).
   */
-  if (cur_is_root && rem_is_root)
+  if (!cur_rank && !rem_rank)
     return UP;
 
   if (cur_rank < rem_rank)
@@ -215,8 +212,7 @@ __updn_bfs_by_node(
       rem_u = p_remote_sw->priv;
       /* Decide which direction to mark it (UP/DOWN) */
       next_dir = __updn_get_dir(u->rank, rem_u->rank,
-                                current_guid, remote_guid,
-                                u->is_root, rem_u->is_root);
+                                current_guid, remote_guid);
 
       /* Check if this is a legal step : the only illegal step is going
          from DOWN to UP */
@@ -408,53 +404,48 @@ Exit :
 /*        rank is a SWITCH for BFS purpose */
 static int
 updn_subn_rank(
-  IN uint64_t root_guid,
-  IN uint8_t base_rank,
+  IN unsigned num_guids,
+  IN uint64_t* guid_list,
   IN updn_t* p_updn )
 {
   osm_switch_t *p_sw;
-  uint32_t rank = base_rank;
   osm_physp_t *p_physp, *p_remote_physp;
   cl_qlist_t list;
   cl_status_t did_cause_update;
   struct updn_node *u, *remote_u;
   uint8_t num_ports, port_num;
   osm_log_t *p_log = &p_updn->p_osm->log;
+  unsigned idx = 0;
+  unsigned max_rank = 0;
 
   OSM_LOG_ENTER( p_log, updn_subn_rank );
+  cl_qlist_init(&list);
 
-  p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, cl_hton64(root_guid));
-  if(!p_sw)
-  {
-    osm_log( p_log, OSM_LOG_ERROR,
-             "updn_subn_rank: ERR AA05: "
-             "Root switch GUID 0x%" PRIx64 " not found\n", root_guid );
-    OSM_LOG_EXIT( p_log );
-    return 1;
-  }
-
-  osm_log( p_log, OSM_LOG_VERBOSE,
-           "updn_subn_rank: "
-           "Ranking starts from GUID 0x%" PRIx64 "\n", root_guid );
-
-  u = p_sw->priv;
-  u->is_root = 1;
+  /* Rank all the roots and add them to list */
 
-  /* Rank the first guid chosen anyway since it's the base rank */
-  osm_log( p_log, OSM_LOG_DEBUG,
-           "updn_subn_rank: "
-           "Ranking port GUID 0x%" PRIx64 "\n", root_guid );
-
-  __updn_update_rank(u, rank);
+  for (idx = 0; idx < num_guids; idx++)
+  {
+    /* Apply the ranking for each guid given by user - bypass illegal ones */
+    p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, cl_hton64(guid_list[idx]));
+    if(!p_sw)
+    {
+      osm_log( p_log, OSM_LOG_ERROR,
+               "updn_subn_rank: ERR AA05: "
+               "Root switch GUID 0x%" PRIx64 " not found\n", guid_list[idx] );
+      continue;
+    }
 
-  cl_qlist_init(&list);
-  cl_qlist_insert_tail(&list, &u->list);
+    u = p_sw->priv;
+    osm_log( p_log, OSM_LOG_DEBUG,
+             "updn_subn_rank: "
+             "Ranking root port GUID 0x%" PRIx64 "\n", guid_list[idx] );
+    __updn_update_rank(u, 0);
+    cl_qlist_insert_tail(&list, &u->list);
+  }
 
   /* BFS the list till it's empty */
   while (!cl_is_qlist_empty(&list))
   {
-    rank++;
-
     u = (struct updn_node *)cl_qlist_remove_head(&list);
     /* Go over all remote nodes and rank them (if not already visited) */
     p_sw = u->sw;
@@ -483,7 +474,7 @@ updn_subn_rank(
       {
         remote_u = p_remote_physp->p_node->sw->priv;
         port_guid = p_remote_physp->port_guid;
-        did_cause_update = __updn_update_rank(remote_u, rank);
+        did_cause_update = __updn_update_rank(remote_u, u->rank+1);
 
         osm_log( p_log, OSM_LOG_DEBUG,
                  "updn_subn_rank: "
@@ -492,7 +483,10 @@ updn_subn_rank(
                  remote_u->rank );
 
         if (did_cause_update)
+        {
           cl_qlist_insert_tail(&list, &remote_u->list);
+          max_rank = remote_u->rank;
+        }
       }
     }
   }
@@ -500,8 +494,8 @@ updn_subn_rank(
   /* Print Summary of ranking */
   osm_log( p_log, OSM_LOG_VERBOSE,
            "updn_subn_rank: "
-           "Rank Info :\n\t Root Guid = 0x%" PRIx64 "\n\t Max Node Rank = %d\n",
-           root_guid, rank );
+           "Subnet ranking completed. Max Node Rank = %d\n",
+           max_rank );
   OSM_LOG_EXIT( p_log );
   return 0;
 }
@@ -566,7 +560,6 @@ __osm_subn_calc_up_down_min_hop_table(
   IN uint64_t* guid_list,
   IN updn_t* p_updn )
 {
-  uint32_t idx = 0;
   int status;
 
   OSM_LOG_ENTER( &p_updn->p_osm->log, osm_subn_calc_up_down_min_hop_table );
@@ -593,11 +586,8 @@ __osm_subn_calc_up_down_min_hop_table(
     goto _exit;
   }
 
-  for (idx = 0; idx < num_guids; idx++)
-  {
-    /* Apply the ranking for each guid given by user - bypass illegal ones */
-    updn_subn_rank(guid_list[idx], 0, p_updn);
-  }
+  /* Rank the subnet switches */
+  updn_subn_rank(num_guids, guid_list, p_updn);
 
   /* After multiple ranking need to set Min Hop Table by UpDn algorithm  */
   osm_log( &p_updn->p_osm->log, OSM_LOG_VERBOSE,
-- 
1.5.1.4


From mst at dev.mellanox.co.il  Sun May 20 06:44:41 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 20 May 2007 16:44:41 +0300
Subject: [ofa-general] IB/cm: bug in stale connection detection logic?
Message-ID: <20070520134441.GI20649@mellanox.co.il>


Sean, Roland, pls comment on the following 2 questions:

1. I see this in cm_match_req:

        timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
        if (!timewait_info)
                timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);

        if (timewait_info) {
                cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
                                           timewait_info->work.remote_id);
                cm_cleanup_timewait(cm_id_priv->timewait_info);
                spin_unlock_irqrestore(&cm.lock, flags);
                if (cur_cm_id_priv) {
                        cm_dup_req_handler(work, cur_cm_id_priv);
                        cm_deref_id(cur_cm_id_priv);
                } else
                        cm_issue_rej(work->port, work->mad_recv_wc,
                                     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
                                     NULL, 0);

Note that cm_get_id is passed data from timewait_info and not from the request:
thus, if the QPN in request matches QPN in an existing connection, this is
mis-detected as a duplicate request even if the IDs do not match;
thus, the request is dropped or "duplicate" reject is sent instead of
a "stale connection" reject.

Am I missing something?

Suggestion:
Why is an extra call to cm_get_id required to detect a duplicate?
Shouldn't we just

        timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
	if (timewait_info) {
		/* handle duplicate */
		return;
	}

	timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
	if (timewait_info) {
		/* handle stale */
		return;
	}

	not a duplicate and not a stale connection

2. Another question:

	cm_dup_req_handler does this:
       /* Quick state check to discard duplicate REQs. */
        if (cm_id_priv->id.state == IB_CM_REQ_RCVD)
                return;

Why is this code here? IB_CM_REQ_RCVD is an ephemeural state,
going to IB_CM_REP_SENT immediately.
Why are duplicate REQs discarded? Should not REP be re-sent?
See 12.9.6 COMMUNICATION ESTABLISHMENT - PASSIVE

The spec also says:
	The general rule for handling any input message that is received while in 
	an ephemeral state is to hold that message pending until the CM protocol 
	enters a non-ephemeral state.

So this code looks wrong.
What am I missing?

-- 
MST


From sashak at voltaire.com  Sun May 20 09:10:34 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 20 May 2007 19:10:34 +0300
Subject: [ofa-general] Re: [PATCHv2] osm: up/dn optimization - improved
	ranking
In-Reply-To: <46503064.7010107@dev.mellanox.co.il>
References: <46503064.7010107@dev.mellanox.co.il>
Message-ID: <20070520161034.GY19271@sashak.voltaire.com>

Hi Yevgeny,

On 14:26 Sun 20 May     , Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> This patch optimizes fabric ranking similar to the fat-tree ranking.
> All the root switches are marked with rank and added to the BFS list,
> and only then ranking of rest of the fabric begins.
> This version of the patch is updated in accordance with Sasha's suggestions.
> 
> Please apply to master.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---

Looks fine for me. Nice optimization. Thanks.

I guess there still be issue with max_rank calculation (details are
below), which affects only log message and for me it is ok to fix it in
the incremental patch.

> opensm/opensm/osm_ucast_updn.c |   80 
> +++++++++++++++++----------------------
> 1 files changed, 35 insertions(+), 45 deletions(-)
> 
> diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
> index 5cebd9b..95a0622 100644
> --- a/opensm/opensm/osm_ucast_updn.c
> +++ b/opensm/opensm/osm_ucast_updn.c

[snip...]

> @@ -483,7 +474,7 @@ updn_subn_rank(
>       {
>         remote_u = p_remote_physp->p_node->sw->priv;
>         port_guid = p_remote_physp->port_guid;
> -        did_cause_update = __updn_update_rank(remote_u, rank);
> +        did_cause_update = __updn_update_rank(remote_u, u->rank+1);
> 
>         osm_log( p_log, OSM_LOG_DEBUG,
>                  "updn_subn_rank: "
> @@ -492,7 +483,10 @@ updn_subn_rank(
>                  remote_u->rank );
> 
>         if (did_cause_update)
> +        {
>           cl_qlist_insert_tail(&list, &remote_u->list);
> +          max_rank = remote_u->rank;
> +        }

I think this still be not accurate. For instance with topology like:
A <-> B <-> C <-> D <-> E , where roots are A and E we will get
max_rank= 1, which obviously should be 2.

Probably we need something like this instead:

	if (did_cause_update)
		cl_qlist_insert_tail(&list, &remote_u->list);
	if (remote_u->rank <= u->rank + 1)
		max_rank = remote_u->rank;

(and after such intervention into rank updating technique we may want to
remove also __updn_update_rank() function)

And again, this nit affects only reported value in the log message (and
just this log message removing can be option too :)) and doesn't touch
the optimization itself - good stuff, Yevgeny!

Sasha


From ianjiang.ict at gmail.com  Sun May 20 17:49:18 2007
From: ianjiang.ict at gmail.com (Ian Jiang)
Date: Mon, 21 May 2007 08:49:18 +0800
Subject: [ofa-general] [SRPT]multiple initiators supported?
In-Reply-To: <7b2fa1820705150012l743b817fn1eefeaca290789a@mail.gmail.com>
References: <7b2fa1820705120247t1b232345w8373bb72416c5b28@mail.gmail.com>
	<4648948F.5000802@mellanox.com>
	<7b2fa1820705150012l743b817fn1eefeaca290789a@mail.gmail.com>
Message-ID: <7b2fa1820705201749t5cb9cee2gb336412cbc88e958@mail.gmail.com>

(1) I found where my problem was.
I should not define DEBUG_TM in the Makefile of Generic SCSI target
mid-level for Linux (SCST).
Reference from README of scst-0.9.5.2:
 - DEBUG_TM - turns on task management functions debugging, when on
   LUN 0 in the "Default" group some of the commands will be delayed for
   about 60 sec., so making the remote initiator send TM functions, eg
   ABORT TASK and TARGET RESET. Also set TM_DBG_GO_OFFLINE symbol in the
   Makefile to 1 if you want that the device eventually become
   completely unresponsive, or to 0 otherwise to circle around ABORTs
   and RESETs code. Needs DEBUG turned on.

(2) Likely a bug in SRPT:
The CM ID cannot be destroyed in srpt_release_channel(). If so and if
a initiator disconnect from the target, the following connection
request will fail, because no CM ID exists at that moment. So the line
  ib_destroy_cm_id(ch->cm_id);
in srpt_release_channel() of ib_srpt.c should be removed. And the CM
ID would be destroyed only when the entire SRP target is removed.


On 5/15/07, Ian Jiang <ianjiang.ict at gmail.com> wrote:
> Hi Vu,
> Thanks for your replay. But I have got something wrong when using two
> initiators.
>
> Two initiators and one target are at three separated nodes.  The first
> initiator connected to the target correctly. However, the second one
> was aborted 1 minute after its login, and then required to
> *reset_host*, but it failed to send the CM Connect Request when trying
> to reconnect to the target.
>
> Here are the logs of the second initiator:
>
> May 15 13:58:59 cluster5 kernel: ib_srp: new target: id_ext
> 0002c90200206bd8 ioc_guid 0002c90200206bd8 pkey ffff service_id
> 0002c90200206bd8 dgid fe80:0000:0000:0000:0002:c902:0020:6bd9
> May 15 13:58:59 cluster5 kernel: scsi2 : SRP.T10:0002C90200206BD8
> May 15 13:58:59 cluster5 kernel:   Vendor: SCST_FIO  Model: fdisk_128M
>        Rev:  095
> May 15 13:58:59 cluster5 kernel:   Type:   Direct-Access
>        ANSI SCSI revision: 04
> May 15 13:58:59 cluster5 kernel: SCSI device sdb: 262144 512-byte hdwr
> sectors (134 MB)
> May 15 13:58:59 cluster5 kernel: sdb: Write Protect is off
> May 15 13:58:59 cluster5 kernel: sdb: Mode Sense: 6b 00 10 08
> May 15 13:58:59 cluster5 kernel: SCSI device sdb: drive cache: write back w/ FUA
> May 15 13:58:59 cluster5 kernel: SCSI device sdb: 262144 512-byte hdwr
> sectors (134 MB)
> May 15 13:58:59 cluster5 kernel: sdb: Write Protect is off
> May 15 13:58:59 cluster5 kernel: sdb: Mode Sense: 6b 00 10 08
> May 15 13:58:59 cluster5 kernel: SCSI device sdb: drive cache: write back w/ FUA
> May 15 13:58:59 cluster5 kernel:  sdb: unknown partition table
> May 15 13:58:59 cluster5 kernel: sd 2:0:0:0: Attached scsi disk sdb
> May 15 13:58:59 cluster5 kernel: sd 2:0:0:0: Attached scsi generic sg1 type 0
> May 15 13:59:59 cluster5 kernel: SRP abort called
> May 15 13:59:59 cluster5 kernel: SRP reset_device called
> May 15 14:00:29 cluster5 kernel: SRP abort called
> May 15 14:00:34 cluster5 kernel: ib_srp: SRP reset_host called
> May 15 14:00:36 cluster5 kernel: ib_srp: connection closed
> May 15 14:02:15 cluster5 kernel: ib_srp: Sending CM REQ failed
> May 15 14:02:15 cluster5 kernel: ib_srp: reconnect failed (-104),
> removing target port.
> May 15 14:02:15 cluster5 kernel: sd 2:0:0:0: scsi: Device offlined -
> not ready after error recovery
> May 15 14:02:15 cluster5 kernel: sd 2:0:0:0: rejecting I/O to offline device
> May 15 14:02:15 cluster5 kernel: Buffer I/O error on device sdb,
> logical block 32760
> May 15 14:02:15 cluster5 kernel:  2:0:0:0: rejecting I/O to dead device
> May 15 14:02:15 cluster5 kernel: Buffer I/O error on device sdb,
> logical block 32760
>
> Here are the logs of the target during the second initiator's
> connection.  It seemed that it did not receive the reconnect request.
>
> May 15 13:57:27 cluster4 kernel: ib_srpt: Host
> i_port_id=0x100000000000000:0xcc6b200002c90200 login with
> t_port_id=0xd86b200002c90200:0xd86b200002c90200 it_iu_len=260
> May 15 13:57:27 cluster4 kernel: ib_srpt: srpt_create_ch_ib[1105]
> max_cqe= 255 max_sge= 29 cm_id= da9b7200
> May 15 13:57:27 cluster4 kernel: [3966]: scst_init_session:scst: Name
> 0x00000000000000010002c90200206bcc not found, using default group
> May 15 13:57:27 cluster4 kernel: [3966]:
> scst_alloc_add_tgt_dev:Virtual device SCST lun=0
> May 15 13:57:27 cluster4 kernel: [3966]: tm_dbg_init_tgt_dev:LUN 0
> connected from initiator ib_srpt is under TM debugging
> May 15 13:57:27 cluster4 kernel: ib_srpt: Establish connection sess=
> c9a677a8 name= 0x00000000000000010002c90200206bcc cm_id= da9b7200
> May 15 13:57:27 cluster4 kernel: [3964]: scst_set_pending_UA:Setting
> pending UA cmd dabb3ec0
> May 15 13:57:27 cluster4 kernel: [3964]:
> tm_dbg_delay_cmd:tm_dbg_delay_cmd: delaying timed cmd dabb3ec0 (tag
> 35) for 60.96 seconds (15241 HZ)
>
> May 15 13:58:27 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 1 for
> task_tag= 35 using tag= 163 cm_id= da9b7200 sess= c9a677a8
> May 15 13:58:27 cluster4 kernel: [0]: scst_rx_mgmt_fn_tag:sess=c9a677a8, tag=35
> May 15 13:58:27 cluster4 kernel: [0]: scst_post_rx_mgmt_cmd:Adding
> mgmt cmd c70486a0 to active mgmt cmd list
> May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving
> mgmt cmd c70486a0 to mgmt cmd list
> May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_init:Cmd
> dabb3ec0 for tag 35 (sn 35) found, aborting it
> May 15 13:58:27 cluster4 kernel: [3965]: scst_abort_cmd:Aborting cmd
> dabb3ec0 (tag 35)
> May 15 13:58:27 cluster4 kernel: [3965]:
> scst_call_dev_task_mgmt_fn:Calling dev handler disk_fileio
> task_mgmt_fn(fn=0)
> May 15 13:58:27 cluster4 kernel: [3965]:
> scst_call_dev_task_mgmt_fn:Dev handler disk_fileio task_mgmt_fn()
> returned 0
> May 15 13:58:27 cluster4 kernel: [3965]: tm_dbg_release_cmd:Abort
> request for delayed cmd dabb3ec0 (tag=35), moving it to active cmd
> list (delayed_cmds_count=1)
> May 15 13:58:27 cluster4 kernel: ib_srpt: srpt_tsk_mgmt_done[1972]
> tsk_mgmt_done for tag= 163 status=0
> May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_send_done:Dev
> handler ib_srpt task_mgmt_fn_done() returned
> May 15 13:58:27 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 8 for
> task_tag= 35 using tag= 163 cm_id= da9b7200 sess= c9a677a8
> May 15 13:58:27 cluster4 kernel: [3964]: tm_dbg_check_cmd:Processing
> delayed cmd dabb3ec0 (tag 35), delayed_cmds_count=1
> May 15 13:58:27 cluster4 kernel: [3964]: tm_dbg_change_state:Deleting timer
> May 15 13:58:27 cluster4 kernel: ib_srpt: srpt_xmit_response[1898]
> tag= 35 already get aborted
> May 15 13:58:57 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 1 for
> task_tag= 36 using tag= 164 cm_id= da9b7200 sess= c9a677a8
> May 15 13:58:57 cluster4 kernel: [0]: scst_rx_mgmt_fn_tag:sess=c9a677a8, tag=36
> May 15 13:58:57 cluster4 kernel: [0]: scst_post_rx_mgmt_cmd:Adding
> mgmt cmd c7048240 to active mgmt cmd list
> May 15 13:58:57 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving
> mgmt cmd c7048240 to mgmt cmd list
> May 15 13:58:57 cluster4 kernel: [3965]: scst_mgmt_cmd_init:Cmd
> dabb3050 for tag 36 (sn 36) found, aborting it
> May 15 13:58:57 cluster4 kernel: [3965]: scst_abort_cmd:Aborting cmd
> dabb3050 (tag 36)
> May 15 13:58:57 cluster4 kernel: [3965]:
> scst_call_dev_task_mgmt_fn:Calling dev handler disk_fileio
> task_mgmt_fn(fn=0)
> May 15 13:58:57 cluster4 kernel: [3965]:
> scst_call_dev_task_mgmt_fn:Dev handler disk_fileio task_mgmt_fn()
> returned 0
> May 15 13:58:57 cluster4 kernel: [3965]: scst_abort_cmd:cmd dabb3050
> (tag 36) being executed/xmitted (state 12), deferring ABORT...
> May 15 13:58:57 cluster4 kernel: [3965]:
> scst_set_mcmd_next_state:cmd_wait_count(1) not 0, preparing to wait
> May 15 13:59:02 cluster4 kernel: ib_srpt: srpt_cm_dreq_recv[1523]
> cm_id= da9b7200 ch->state= 1
> May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_cm_timewait_exit[1502]
> cm_id= da9b7200
> May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_release_channel[1154]
> Release channel cm_id= da9b7200
> May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_release_channel[1159]
> Release sess= c9a677a8 sess_name= 0x00000000000000010002c90200206bcc
> May 15 13:59:21 cluster4 kernel: ib_srpt: failed send status= 12
> May 15 13:59:21 cluster4 kernel: [0]: scst_complete_cmd_mgmt:cmd
> dabb3050 completed (tag 36, mcmd c7048240, mcmd->cmd_wait_count 1)
> May 15 13:59:21 cluster4 kernel: [0]: scst_complete_cmd_mgmt:Moving
> mgmt cmd c7048240 to active mgmt cmd list
> May 15 13:59:21 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving
> mgmt cmd c7048240 to mgmt cmd list
> May 15 13:59:21 cluster4 kernel: ib_srpt: srpt_tsk_mgmt_done[1972]
> tsk_mgmt_done for tag= 164 status=-1
> May 15 13:59:21 cluster4 kernel: [3965]: scst_mgmt_cmd_send_done:Dev
> handler ib_srpt task_mgmt_fn_done() returned
> May 15 13:59:21 cluster4 kernel: ib_srpt:
> srpt_unregister_session_done[1143]  sess= c9a677a8
> May 15 13:59:21 cluster4 kernel: [3966]: scst_free_all_UA:Clearing UA
> for tgt_dev lun 0
> May 15 13:59:21 cluster4 kernel: ib_srpt: failed send status= 5
> May 15 13:59:21 cluster4 kernel: ib_srpt: QP event 16 on cm_id=
> da9b7200 sess_name= 0x00000000000000010002c90200206bcc state= 2
>
>
> I have no idea why the *abort* was called at the second initiator.
> Could you please give some suggestion? Thanks a lot!
>
>
> On 5/15/07, Vu Pham <vuhuong at mellanox.com> wrote:
> > Ian Jiang wrote:
> > > Does the SRP target support multiple initiators?
> >
> > Yes, it does.
> >
> >
> > > I am using the SRR initiator and IB drivers in linux-2.6.20.
> > > The SRP target is at
> > > http://www.openfabrics.org/git/?p=~vu/srpt.git;a=summary
> > > and the IB driver is OFED-1.1 with linux-2.6.16.13-4-default of Suse-10.1.
> > >
>
>
>
> --
> Ian Jiang
>


-- 
Ian Jiang


From devesh28 at gmail.com  Sun May 20 22:58:09 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Mon, 21 May 2007 11:28:09 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <1179483657.23882.158398.camel@hal.voltaire.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
	<309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>
	<464B5C07.8040601@ichips.intel.com>
	<309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>
	<1179398534.23882.67542.camel@hal.voltaire.com>
	<309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>
	<1179483657.23882.158398.camel@hal.voltaire.com>
Message-ID: <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com>

On 18 May 2007 06:21:05 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote:
> > On 17 May 2007 06:42:16 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote:
> > > > On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > > > > > But initially this will generate a packet for each path, while sys
> > > > > > admin knows that path is there and he can hard-code the entries for
> > > > > > it. Other thing is that why Admin will care about creating such record
> > > > > > while SA is itself taking care, right?
> > > > >
> > > > > In your original message you asked about adding 'dummy entries' to the
> > > > > cache.  I agree that pre-loading the cache can be useful.  What I still
> > > > > am not understanding is the reasoning for adding 'dummy entries'.  By
> > > > > 'dummy entries', I've been assuming that these are invalid path records,
> > > > > but maybe that's not what you meant.
> > > > Ok if "dummy entries" word as such has created confusion then I am
> > > > sorry for that, But with that I mean that, those are valid path
> > > > records which Administrator knows in advance and while loading the
> > > > module,
> > >
> > > How does the admin know they are valid ?
> > Depending on the initial application runs, some trusted PRs can be generated.
>
> What do initial application runs have to do with this ?
My understanding is that, once the cluster is UP, and if between Node
A and Node B there is only one path, then, SA query always going to
return same values in PR. On this basis Initial application runs will
generate PRs, these PRs can be saved in some file, and can be loaded
when cache_module comes in.
>
> > >Are they somehow preconfigured at the SM ?
> > I am not sure about SM has any such provision?
>
> Not that I'm aware of.
Ok, So, currently no such support is there in SM?
>
> > Also not sure about the
> > role of SM in path resolving. I mean once node has initiated SA query,
> > whether SM has some database to reply SA or On the fly destination
> > node is contacted to get asked path recored?
>
> SMs can either calculate the SA PRs on the fly based on the routing
> algorithm in use and some other things or put them in a local database.
> This is up to that SM.
Ok
>
> Destination node is not contacted in the SA PR query process.
>
> > >Doesn't each SM have its own policy for generating valid PRs ?
> > Ultimately path record is in Path_Record object format, and SA cache
> > is going to store in a fixed manner, How generation policy matters?
>
> What if the local policy loaded does not agree with what the SM would
> generate for a particular PR ? One then gets a local error which will
> need to be tracked down. Not so easy IMO.
SM policies in a subnet to generate PRs, changes dynamically? at run time?
if Not then depending on the local SM policy static PR can be
generated to load initially.
>
> > CMIIW. Also I am assuming a homogeneous cluster where certain
> > parameters can be assumed to be same always.
>
> and always in agreement with what the SM would return ? For example,
yes
> what happens when a link goes down and the end node is no longer
> reachable ?
If node is not reachable then, after first timeout of sa_cache, that
entry will be removed from cache.
>
> > >are these from a live SM and just loaded "out of band" to
> > bypass/preclude the SA PR >mechanism ?
> > may be
>
> Even if they are, there is still the changes in the subnet issue.
>
> -- Hal
>
> > > -- Hal
> > >
> > > >  Admin is loading this info in the cache with user command.
> > > > >
> > > > > > Another point I want to know is,
> > > > > > When local_sa_cache module will be inserted? After SM comes up or
> > > > > > Before SM comes up?
> > > > >
> > > > > It can occur either way.  There is no restriction.  The cache responds
> > > > > to port up and GID in/out of service events to update itself.
> > > > Do you mean cache module will start building cache only after Port is UP?
> > > > >
> > > > > > If Its inserted before SM is coming up (I am assuming SM is running on
> > > > > > some node not on switch) then First Forced schedule_update() is
> > > > > > waisted, and for the first application presence of cache is
> > > > > > meaningless. Why not to keep cache effective right from the start?
> > > > >
> > > > > Pre-loading the cache with path records doesn't guarantee that those
> > > > > paths are usable.  If the SM has not come up, then the path records will
> > > > > be unusable until the SM configures the subnet, plus there's no
> > > > > guarantee that the remote endpoints specified by the paths are running.
> > > > You mean there is no guarantee that even if SM is UP and we have some
> > > > hard coded entries of path record corresponding to some node X, we are
> > > > not sure that node X has actually come up or not?  In that case
> > > > actually that path resolving should fail if node has not come up, but
> > > > with the hard coding still path will be resolved?
> > > > >
> > > > > The main benefit I see to pre-loading the cache is to avoid SA storms
> > > > > when booting a large cluster.
> > > > that's true. Also cache will get valid entries only if network is
> > > > configured by SM otherwise every node SA will, possibly, drop SA
> > > > packets.
> > > > >
> > > > > - Sean
> > > > >
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > >
> > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > >
> > >
>
>


From erezz at voltaire.com  Sun May 20 23:28:35 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Mon, 21 May 2007 09:28:35 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons
 for open-iscsi over iSER support for RHAS4 up3 and up4
In-Reply-To: <20070510092925.GB13655@mellanox.co.il>
References: <4641D295.5060907@voltaire.com> <4641D38A.8040406@voltaire.com>
	<20070510092925.GB13655@mellanox.co.il>
Message-ID: <46513C13.3010100@voltaire.com>

Michael S. Tsirkin wrote:
>> Quoting Erez Zilber <erezz at voltaire.com>:
>> Subject: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4
>>
>>
>> Add the required backport patches & kernel addons for open-iscsi
>> over iSER in RHAS4 up3 and up4.
>>
>> Signed-off-by: Erez Zilber <erezz at voltaire.com>
> 
> In addition to posting patches, could you pls publish a git tree to pull from,
> please? This makes it easy to test-build the patch as our build system
> knows how to do git checkout.

Added a git tree:

http://www.openfabrics.org/git/?p=~erezz/ofed_1_2_iser_rh4.git;a=summary

> 
> ---
> 
> Two comments, generally
> A: Please move code from kernel_patches to kernel_addons as much
>    as possible. There are many places where you just add new headers,
>    or add #include directives, or change the function called or
>    remove extra parameters, all this can and should be done through addons.
> 

Done

> B: Please do not add code to core unless there is more than 1 user -
>    add it to the iser module instead. This way if there is
>    compilation failure there, you do not break core for people.

Done

> 
> Some specifics below:
> ....
> 
>> 
>> diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch
>> new file mode 100644
>> index 0000000..d77c663
>> --- /dev/null
>> +++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch
>> @@ -0,0 +1,504 @@
>> +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c
>> +--- linux-2.6.20/drivers/scsi/iscsi_tcp.c	2007-02-04 20:44:54.000000000 +0200
>> ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c	2007-04-01 13:11:17.000000000 +0300
> 
> ...
>> +@@ -108,7 +108,7 @@ iscsi_hdr_digest(struct iscsi_conn *conn
>> + {
>> + 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
>> + 
>> +-	crypto_hash_digest(&tcp_conn->tx_hash, &buf->sg, buf->sg.length, crc);
>> ++	crypto_digest_digest(tcp_conn->tx_tfm, &buf->sg, 1, crc);
>> + 	buf->sg.length = tcp_conn->hdr_size;
>> + }
>> + 
> 
> You could make it a macro in addons if you had named the new field tx_hash.

I fixed that and other crypto function calls whenever possible.

>> + 
>> + struct iscsi_internal {
>> + 	int daemon_pid;
>> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
>> + #define cdev_to_iscsi_internal(_cdev) \
>> + 	container_of(_cdev, struct iscsi_internal, cdev)
>> + 
>> ++extern int attribute_container_init(void);
>> ++
> 
> This does not look scsi-related. Why does this belong here?

This is a hack. In 2.6.20, attribute_container_init is called from drivers/base/init.c. Since I cannot do that, I'm calling it from the init function in scsi_transport_iscsi (because scsi_transport_iscsi uses the attribute container). Do you have a better suggestion?

> 
>> diff --git a/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch
>> new file mode 100644
>> index 0000000..3c2a969
>> --- /dev/null
>> +++ b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch
>> @@ -0,0 +1,13 @@
>> +--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:13:43.000000000 +0200
>> ++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:14:31.000000000 +0200
>> +@@ -70,9 +70,8 @@
>> + #include <scsi/scsi_tcq.h>
>> + #include <scsi/scsi_host.h>
>> + #include <scsi/scsi.h>
>> +-#include <scsi/scsi_transport_iscsi.h>
>> +-
>> + #include "iscsi_iser.h"
>> ++#include <scsi/scsi_transport_iscsi.h>
>> + 
>> + static unsigned int iscsi_max_lun = 512;
>> + module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO);
> 
> Looks like the right thing to do anyway.
> So put it in fixes instead, and post upstream.

It is not required in newer kernels: mutex.h is included from include/scsi/scsi_host.h. Therefore, I don't want to post a patch upstream. Maybe we can add this inclusion to kernel_addons/backport/2.6.9_U4/include/scsi/scsi_host.h. What do you think?

> 
>> diff --git a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch
>> index e84b964..52c0136 100644
>> --- a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch
>> +++ b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch
>> @@ -19,6 +19,62 @@ index 0000000..58cf933
>>  +++ b/drivers/infiniband/core/kfifo.c
>>  @@ -0,0 +1 @@
>>  +#include "src/kfifo.c"
>> +diff --git a/drivers/infiniband/core/init.c b/drivers/infiniband/core/init.c
>> +new file mode 100644
>> +index 0000000..58cf933
>> +--- /dev/null
>> ++++ b/drivers/infiniband/core/init.c
>> +@@ -0,0 +1 @@
>> ++#include "src/init.c"
>> +diff --git a/drivers/infiniband/core/attribute_container.c b/drivers/infiniband/core/attribute_container.c
>> +new file mode 100644
>> +index 0000000..58cf933
>> +--- /dev/null
>> ++++ b/drivers/infiniband/core/attribute_container.c
>> +@@ -0,0 +1 @@
>> ++#include "src/attribute_container.c"
...
>> +diff --git a/drivers/infiniband/core/kref_new.c b/drivers/infiniband/core/kref_new.c
>> +new file mode 100644
>> +index 0000000..58cf933
>> +--- /dev/null
>> ++++ b/drivers/infiniband/core/kref_new.c
>> +@@ -0,0 +1 @@
>> ++#include "src/kref_new.c"
>>  diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
>>  index 50fb1cd..456bfd0 100644
>>  --- a/drivers/infiniband/core/Makefile
>> @@ -28,4 +84,4 @@ index 50fb1cd..456bfd0 100644
>>   ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_mem.o \
>>   				uverbs_marshall.o
>>  +
>> -+ib_core-y +=			genalloc.o netevent.o kfifo.o
>> ++ib_core-y +=			genalloc.o netevent.o kfifo.o scsi.o scsi_lib.o scsi_scan.o init.o attribute_container.o transport_class.o klist.o kref_new.o
> 
> Can we make these part of iser place?
> Linking scsi stuff into core does not look right.

Moved that into open-iscsi modules. This code is required for open-iscsi over any transport (not just iSER).

Here's the fixed patch (also available on my git tree):

diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h b/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h
new file mode 100644
index 0000000..93bfb0b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h
@@ -0,0 +1,71 @@
+/*
+ * class_container.h - a generic container for all classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ */
+
+#ifndef _ATTRIBUTE_CONTAINER_H_
+#define _ATTRIBUTE_CONTAINER_H_
+
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/klist.h>
+#include <linux/spinlock.h>
+
+struct attribute_container {
+	struct list_head	node;
+	struct klist		containers;
+	struct class		*class;
+	struct class_device_attribute **attrs;
+	int (*match)(struct attribute_container *, struct device *);
+#define	ATTRIBUTE_CONTAINER_NO_CLASSDEVS	0x01
+	unsigned long		flags;
+};
+
+static inline int
+attribute_container_no_classdevs(struct attribute_container *atc)
+{
+	return atc->flags & ATTRIBUTE_CONTAINER_NO_CLASSDEVS;
+}
+
+static inline void
+attribute_container_set_no_classdevs(struct attribute_container *atc)
+{
+	atc->flags |= ATTRIBUTE_CONTAINER_NO_CLASSDEVS;
+}
+
+int attribute_container_register(struct attribute_container *cont);
+int attribute_container_unregister(struct attribute_container *cont);
+void attribute_container_create_device(struct device *dev,
+				       int (*fn)(struct attribute_container *,
+						 struct device *,
+						 struct class_device *));
+void attribute_container_add_device(struct device *dev,
+				    int (*fn)(struct attribute_container *,
+					      struct device *,
+					      struct class_device *));
+void attribute_container_remove_device(struct device *dev,
+				       void (*fn)(struct attribute_container *,
+						  struct device *,
+						  struct class_device *));
+void attribute_container_device_trigger(struct device *dev, 
+					int (*fn)(struct attribute_container *,
+						  struct device *,
+						  struct class_device *));
+void attribute_container_trigger(struct device *dev, 
+				 int (*fn)(struct attribute_container *,
+					   struct device *));
+int attribute_container_add_attrs(struct class_device *classdev);
+int attribute_container_add_class_device(struct class_device *classdev);
+int attribute_container_add_class_device_adapter(struct attribute_container *cont,
+						 struct device *dev,
+						 struct class_device *classdev);
+void attribute_container_remove_attrs(struct class_device *classdev);
+void attribute_container_class_device_del(struct class_device *classdev);
+struct attribute_container *attribute_container_classdev_to_container(struct class_device *);
+struct class_device *attribute_container_find_class_device(struct attribute_container *, struct device *);
+struct class_device_attribute **attribute_container_classdev_to_attrs(const struct class_device *classdev);
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/crypto.h b/kernel_addons/backport/2.6.9_U3/include/linux/crypto.h
new file mode 100644
index 0000000..aecccde
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/linux/crypto.h
@@ -0,0 +1,11 @@
+#ifndef LINUX_CRYPTO_BACKPORT_H
+#define LINUX_CRYPTO_BACKPORT_H
+
+#include_next <linux/crypto.h>
+
+#define crypto_hash_init(desc) crypto_digest_init(*desc)
+#define crypto_hash_digest(desc, sg, nbytes, out) crypto_digest_digest(*desc, sg, 1, out)
+#define crypto_hash_update(desc, sg, nbytes) crypto_digest_update(*desc, sg, 1)
+#define crypto_hash_final(desc, out) crypto_digest_final(*desc, out)
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/kernel.h b/kernel_addons/backport/2.6.9_U3/include/linux/kernel.h
index a37dcd5..02a5907 100644
--- a/kernel_addons/backport/2.6.9_U3/include/linux/kernel.h
+++ b/kernel_addons/backport/2.6.9_U3/include/linux/kernel.h
@@ -4,4 +4,7 @@ #define BACKPORT_KERNEL_H_2_6_19
 #include_next <linux/kernel.h>
 #include <linux/log2.h>
 
+#define NIP6_FMT "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x"
+#define NIPQUAD_FMT "%u.%u.%u.%u"
+
 #endif
diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/kfifo.h b/kernel_addons/backport/2.6.9_U3/include/linux/kfifo.h
index 48eccd8..2b94461 100644
--- a/kernel_addons/backport/2.6.9_U3/include/linux/kfifo.h
+++ b/kernel_addons/backport/2.6.9_U3/include/linux/kfifo.h
@@ -25,6 +25,7 @@ #ifdef __KERNEL__
 
 #include <linux/kernel.h>
 #include <linux/spinlock.h>
+#include <linux/gfp.h>
 
 struct kfifo {
 	unsigned char *buffer;	/* the buffer holding the data */
diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/klist.h b/kernel_addons/backport/2.6.9_U3/include/linux/klist.h
new file mode 100644
index 0000000..7407125
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/linux/klist.h
@@ -0,0 +1,61 @@
+/*
+ *	klist.h - Some generic list helpers, extending struct list_head a bit.
+ *
+ *	Implementations are found in lib/klist.c
+ *
+ *
+ *	Copyright (C) 2005 Patrick Mochel
+ *
+ *	This file is rleased under the GPL v2.
+ */
+
+#ifndef _LINUX_KLIST_H
+#define _LINUX_KLIST_H
+
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+
+struct klist_node;
+struct klist {
+	spinlock_t		k_lock;
+	struct list_head	k_list;
+	void			(*get)(struct klist_node *);
+	void			(*put)(struct klist_node *);
+};
+
+
+extern void klist_init(struct klist * k, void (*get)(struct klist_node *),
+		       void (*put)(struct klist_node *));
+
+struct klist_node {
+	struct klist		* n_klist;
+	struct list_head	n_node;
+	struct kref		n_ref;
+	struct completion	n_removed;
+};
+
+extern void klist_add_tail(struct klist_node * n, struct klist * k);
+extern void klist_add_head(struct klist_node * n, struct klist * k);
+
+extern void klist_del(struct klist_node * n);
+extern void klist_remove(struct klist_node * n);
+
+extern int klist_node_attached(struct klist_node * n);
+
+
+struct klist_iter {
+	struct klist		* i_klist;
+	struct list_head	* i_head;
+	struct klist_node	* i_cur;
+};
+
+
+extern void klist_iter_init(struct klist * k, struct klist_iter * i);
+extern void klist_iter_init_node(struct klist * k, struct klist_iter * i, 
+				 struct klist_node * n);
+extern void klist_iter_exit(struct klist_iter * i);
+extern struct klist_node * klist_next(struct klist_iter * i);
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/memory.h b/kernel_addons/backport/2.6.9_U3/include/linux/memory.h
new file mode 100644
index 0000000..654ef55
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/linux/memory.h
@@ -0,0 +1,89 @@
+/*
+ * include/linux/memory.h - generic memory definition
+ *
+ * This is mainly for topological representation. We define the
+ * basic "struct memory_block" here, which can be embedded in per-arch
+ * definitions or NUMA information.
+ *
+ * Basic handling of the devices is done in drivers/base/memory.c
+ * and system devices are handled in drivers/base/sys.c.
+ *
+ * Memory block are exported via sysfs in the class/memory/devices/
+ * directory.
+ *
+ */
+#ifndef _LINUX_MEMORY_H_
+#define _LINUX_MEMORY_H_
+
+#include <linux/sysdev.h>
+#include <linux/node.h>
+#include <linux/compiler.h>
+
+#include <asm/semaphore.h>
+
+struct memory_block {
+	unsigned long phys_index;
+	unsigned long state;
+	/*
+	 * This serializes all state change requests.  It isn't
+	 * held during creation because the control files are
+	 * created long after the critical areas during
+	 * initialization.
+	 */
+	struct semaphore state_sem;
+	int phys_device;		/* to which fru does this belong? */
+	void *hw;			/* optional pointer to fw/hw data */
+	int (*phys_callback)(struct memory_block *);
+	struct sys_device sysdev;
+};
+
+/* These states are exposed to userspace as text strings in sysfs */
+#define	MEM_ONLINE		(1<<0) /* exposed to userspace */
+#define	MEM_GOING_OFFLINE	(1<<1) /* exposed to userspace */
+#define	MEM_OFFLINE		(1<<2) /* exposed to userspace */
+
+/*
+ * All of these states are currently kernel-internal for notifying
+ * kernel components and architectures.
+ *
+ * For MEM_MAPPING_INVALID, all notifier chains with priority >0
+ * are called before pfn_to_page() becomes invalid.  The priority=0
+ * entry is reserved for the function that actually makes
+ * pfn_to_page() stop working.  Any notifiers that want to be called
+ * after that should have priority <0.
+ */
+#define	MEM_MAPPING_INVALID	(1<<3)
+
+struct notifier_block;
+struct mem_section;
+
+#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE
+static inline int memory_dev_init(void)
+{
+	return 0;
+}
+static inline int register_memory_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+static inline void unregister_memory_notifier(struct notifier_block *nb)
+{
+}
+#else
+extern int register_new_memory(struct mem_section *);
+extern int unregister_memory_section(struct mem_section *);
+extern int memory_dev_init(void);
+extern int remove_memory_block(unsigned long, struct mem_section *, int);
+
+#define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
+
+
+#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
+
+#define hotplug_memory_notifier(fn, pri) {			\
+	static struct notifier_block fn##_mem_nb =		\
+		{ .notifier_call = fn, .priority = pri };	\
+	register_memory_notifier(&fn##_mem_nb);			\
+}
+
+#endif /* _LINUX_MEMORY_H_ */
diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/netlink.h b/kernel_addons/backport/2.6.9_U3/include/linux/netlink.h
new file mode 100644
index 0000000..6d69105
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/linux/netlink.h
@@ -0,0 +1,14 @@
+#ifndef BACKPORT_LINUX_NETLINK_H
+#define BACKPORT_LINUX_NETLINK_H
+
+#include_next <linux/netlink.h>
+
+#define __nlmsg_put(skb, daemon_pid, seq, type, len, flags) \
+       __nlmsg_put(skb, daemon_pid, 0, 0, len)
+
+#define netlink_kernel_create(uint, groups, input, mod) \
+       netlink_kernel_create(uint, input)
+
+#define NETLINK_ISCSI           8
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/transport_class.h b/kernel_addons/backport/2.6.9_U3/include/linux/transport_class.h
new file mode 100644
index 0000000..1d6cc22
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/linux/transport_class.h
@@ -0,0 +1,100 @@
+/*
+ * transport_class.h - a generic container for all transport classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ */
+
+#ifndef _TRANSPORT_CLASS_H_
+#define _TRANSPORT_CLASS_H_
+
+#include <linux/device.h>
+#include <linux/attribute_container.h>
+
+struct transport_container;
+
+struct transport_class {
+	struct class class;
+	int (*setup)(struct transport_container *, struct device *,
+		     struct class_device *);
+	int (*configure)(struct transport_container *, struct device *,
+			 struct class_device *);
+	int (*remove)(struct transport_container *, struct device *,
+		      struct class_device *);
+};
+
+#define DECLARE_TRANSPORT_CLASS(cls, nm, su, rm, cfg)			\
+struct transport_class cls = {						\
+	.class = {							\
+		.name = nm,						\
+	},								\
+	.setup = su,							\
+	.remove = rm,							\
+	.configure = cfg,						\
+}
+
+
+struct anon_transport_class {
+	struct transport_class tclass;
+	struct attribute_container container;
+};
+
+#define DECLARE_ANON_TRANSPORT_CLASS(cls, mtch, cfg)		\
+struct anon_transport_class cls = {				\
+	.tclass = {						\
+		.configure = cfg,				\
+	},							\
+	. container = {						\
+		.match = mtch,					\
+	},							\
+}
+
+#define class_to_transport_class(x) \
+	container_of(x, struct transport_class, class)
+
+struct transport_container {
+	struct attribute_container ac;
+	struct attribute_group *statistics;
+};
+
+#define attribute_container_to_transport_container(x) \
+	container_of(x, struct transport_container, ac)
+
+void transport_remove_device(struct device *);
+void transport_add_device(struct device *);
+void transport_setup_device(struct device *);
+void transport_configure_device(struct device *);
+void transport_destroy_device(struct device *);
+
+static inline void
+transport_register_device(struct device *dev)
+{
+	transport_setup_device(dev);
+	transport_add_device(dev);
+}
+
+static inline void
+transport_unregister_device(struct device *dev)
+{
+	transport_remove_device(dev);
+	transport_destroy_device(dev);
+}
+
+static inline int transport_container_register(struct transport_container *tc)
+{
+	return attribute_container_register(&tc->ac);
+}
+
+static inline int transport_container_unregister(struct transport_container *tc)
+{
+	return attribute_container_unregister(&tc->ac);
+}
+
+int transport_class_register(struct transport_class *);
+int anon_transport_class_register(struct anon_transport_class *);
+void transport_class_unregister(struct transport_class *);
+void anon_transport_class_unregister(struct anon_transport_class *);
+
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/iscsi_proto.h b/kernel_addons/backport/2.6.9_U3/include/scsi/iscsi_proto.h
new file mode 100644
index 0000000..02f6e4b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/scsi/iscsi_proto.h
@@ -0,0 +1,587 @@
+/*
+ * RFC 3720 (iSCSI) protocol data types
+ *
+ * Copyright (C) 2005 Dmitry Yusupov
+ * Copyright (C) 2005 Alex Aizman
+ * maintained by open-iscsi at googlegroups.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published
+ * by the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * See the file COPYING included with this distribution for more details.
+ */
+
+#ifndef ISCSI_PROTO_H
+#define ISCSI_PROTO_H
+
+#define ISCSI_DRAFT20_VERSION	0x00
+
+/* default iSCSI listen port for incoming connections */
+#define ISCSI_LISTEN_PORT	3260
+
+/* Padding word length */
+#define PAD_WORD_LEN		4
+
+/*
+ * useful common(control and data pathes) macro
+ */
+#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2]))
+#define hton24(p, v) { \
+        p[0] = (((v) >> 16) & 0xFF); \
+        p[1] = (((v) >> 8) & 0xFF); \
+        p[2] = ((v) & 0xFF); \
+}
+#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;}
+
+/*
+ * iSCSI Template Message Header
+ */
+struct iscsi_hdr {
+	uint8_t		opcode;
+	uint8_t		flags;		/* Final bit */
+	uint8_t		rsvd2[2];
+	uint8_t		hlength;	/* AHSs total length */
+	uint8_t		dlength[3];	/* Data length */
+	uint8_t		lun[8];
+	__be32		itt;		/* Initiator Task Tag */
+	__be32		ttt;		/* Target Task Tag */
+	__be32		statsn;
+	__be32		exp_statsn;
+	__be32		max_statsn;
+	uint8_t		other[12];
+};
+
+/************************* RFC 3720 Begin *****************************/
+
+#define ISCSI_RESERVED_TAG		0xffffffff
+
+/* Opcode encoding bits */
+#define ISCSI_OP_RETRY			0x80
+#define ISCSI_OP_IMMEDIATE		0x40
+#define ISCSI_OPCODE_MASK		0x3F
+
+/* Initiator Opcode values */
+#define ISCSI_OP_NOOP_OUT		0x00
+#define ISCSI_OP_SCSI_CMD		0x01
+#define ISCSI_OP_SCSI_TMFUNC		0x02
+#define ISCSI_OP_LOGIN			0x03
+#define ISCSI_OP_TEXT			0x04
+#define ISCSI_OP_SCSI_DATA_OUT		0x05
+#define ISCSI_OP_LOGOUT			0x06
+#define ISCSI_OP_SNACK			0x10
+
+#define ISCSI_OP_VENDOR1_CMD		0x1c
+#define ISCSI_OP_VENDOR2_CMD		0x1d
+#define ISCSI_OP_VENDOR3_CMD		0x1e
+#define ISCSI_OP_VENDOR4_CMD		0x1f
+
+/* Target Opcode values */
+#define ISCSI_OP_NOOP_IN		0x20
+#define ISCSI_OP_SCSI_CMD_RSP		0x21
+#define ISCSI_OP_SCSI_TMFUNC_RSP	0x22
+#define ISCSI_OP_LOGIN_RSP		0x23
+#define ISCSI_OP_TEXT_RSP		0x24
+#define ISCSI_OP_SCSI_DATA_IN		0x25
+#define ISCSI_OP_LOGOUT_RSP		0x26
+#define ISCSI_OP_R2T			0x31
+#define ISCSI_OP_ASYNC_EVENT		0x32
+#define ISCSI_OP_REJECT			0x3f
+
+struct iscsi_ahs_hdr {
+	__be16 ahslength;
+	uint8_t ahstype;
+	uint8_t ahspec[5];
+};
+
+#define ISCSI_AHSTYPE_CDB		1
+#define ISCSI_AHSTYPE_RLENGTH		2
+
+/* iSCSI PDU Header */
+struct iscsi_cmd {
+	uint8_t opcode;
+	uint8_t flags;
+	__be16 rsvd2;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32 itt;	/* Initiator Task Tag */
+	__be32 data_length;
+	__be32 cmdsn;
+	__be32 exp_statsn;
+	uint8_t cdb[16];	/* SCSI Command Block */
+	/* Additional Data (Command Dependent) */
+};
+
+/* Command PDU flags */
+#define ISCSI_FLAG_CMD_FINAL		0x80
+#define ISCSI_FLAG_CMD_READ		0x40
+#define ISCSI_FLAG_CMD_WRITE		0x20
+#define ISCSI_FLAG_CMD_ATTR_MASK	0x07	/* 3 bits */
+
+/* SCSI Command Attribute values */
+#define ISCSI_ATTR_UNTAGGED		0
+#define ISCSI_ATTR_SIMPLE		1
+#define ISCSI_ATTR_ORDERED		2
+#define ISCSI_ATTR_HEAD_OF_QUEUE	3
+#define ISCSI_ATTR_ACA			4
+
+struct iscsi_rlength_ahdr {
+	__be16 ahslength;
+	uint8_t ahstype;
+	uint8_t reserved;
+	__be32 read_length;
+};
+
+/* SCSI Response Header */
+struct iscsi_cmd_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t response;
+	uint8_t cmd_status;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rsvd1;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	exp_datasn;
+	__be32	bi_residual_count;
+	__be32	residual_count;
+	/* Response or Sense Data (optional) */
+};
+
+/* Command Response PDU flags */
+#define ISCSI_FLAG_CMD_BIDI_OVERFLOW	0x10
+#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW	0x08
+#define ISCSI_FLAG_CMD_OVERFLOW		0x04
+#define ISCSI_FLAG_CMD_UNDERFLOW	0x02
+
+/* iSCSI Status values. Valid if Rsp Selector bit is not set */
+#define ISCSI_STATUS_CMD_COMPLETED	0
+#define ISCSI_STATUS_TARGET_FAILURE	1
+#define ISCSI_STATUS_SUBSYS_FAILURE	2
+
+/* Asynchronous Event Header */
+struct iscsi_async {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	uint8_t rsvd4[8];
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t async_event;
+	uint8_t async_vcode;
+	__be16	param1;
+	__be16	param2;
+	__be16	param3;
+	uint8_t rsvd5[4];
+};
+
+/* iSCSI Event Codes */
+#define ISCSI_ASYNC_MSG_SCSI_EVENT			0
+#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT			1
+#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION		2
+#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS	3
+#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION		4
+#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC			255
+
+/* NOP-Out Message */
+struct iscsi_nopout {
+	uint8_t opcode;
+	uint8_t flags;
+	__be16	rsvd2;
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	ttt;	/* Target Transfer Tag */
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd4[16];
+};
+
+/* NOP-In Message */
+struct iscsi_nopin {
+	uint8_t opcode;
+	uint8_t flags;
+	__be16	rsvd2;
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	ttt;	/* Target Transfer Tag */
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t rsvd4[12];
+};
+
+/* SCSI Task Management Message Header */
+struct iscsi_tm {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd1[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rtt;	/* Reference Task Tag */
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	__be32	refcmdsn;
+	__be32	exp_datasn;
+	uint8_t rsvd2[8];
+};
+
+#define ISCSI_FLAG_TM_FUNC_MASK			0x7F
+
+/* Function values */
+#define ISCSI_TM_FUNC_ABORT_TASK		1
+#define ISCSI_TM_FUNC_ABORT_TASK_SET		2
+#define ISCSI_TM_FUNC_CLEAR_ACA			3
+#define ISCSI_TM_FUNC_CLEAR_TASK_SET		4
+#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET	5
+#define ISCSI_TM_FUNC_TARGET_WARM_RESET		6
+#define ISCSI_TM_FUNC_TARGET_COLD_RESET		7
+#define ISCSI_TM_FUNC_TASK_REASSIGN		8
+
+/* SCSI Task Management Response Header */
+struct iscsi_tm_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t response;	/* see Response values below */
+	uint8_t qualifier;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd2[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rtt;	/* Reference Task Tag */
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t rsvd3[12];
+};
+
+/* Response values */
+#define ISCSI_TMF_RSP_COMPLETE		0x00
+#define ISCSI_TMF_RSP_NO_TASK		0x01
+#define ISCSI_TMF_RSP_NO_LUN		0x02
+#define ISCSI_TMF_RSP_TASK_ALLEGIANT	0x03
+#define ISCSI_TMF_RSP_NO_FAILOVER	0x04
+#define ISCSI_TMF_RSP_NOT_SUPPORTED	0x05
+#define ISCSI_TMF_RSP_AUTH_FAILED	0x06
+#define ISCSI_TMF_RSP_REJECTED		0xff
+
+/* Ready To Transfer Header */
+struct iscsi_r2t_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t	hlength;
+	uint8_t	dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	ttt;	/* Target Transfer Tag */
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	r2tsn;
+	__be32	data_offset;
+	__be32	data_length;
+};
+
+/* SCSI Data Hdr */
+struct iscsi_data {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	rsvd4;
+	__be32	exp_statsn;
+	__be32	rsvd5;
+	__be32	datasn;
+	__be32	offset;
+	__be32	rsvd6;
+	/* Payload */
+};
+
+/* SCSI Data Response Hdr */
+struct iscsi_data_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2;
+	uint8_t cmd_status;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	datasn;
+	__be32	offset;
+	__be32	residual_count;
+};
+
+/* Data Response PDU flags */
+#define ISCSI_FLAG_DATA_ACK		0x40
+#define ISCSI_FLAG_DATA_OVERFLOW	0x04
+#define ISCSI_FLAG_DATA_UNDERFLOW	0x02
+#define ISCSI_FLAG_DATA_STATUS		0x01
+
+/* Text Header */
+struct iscsi_text {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd4[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd5[16];
+	/* Text - key=value pairs */
+};
+
+#define ISCSI_FLAG_TEXT_CONTINUE	0x40
+
+/* Text Response Header */
+struct iscsi_text_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd4[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t rsvd5[12];
+	/* Text Response - key:value pairs */
+};
+
+/* Login Header */
+struct iscsi_login {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t max_version;	/* Max. version supported */
+	uint8_t min_version;	/* Min. version supported */
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t isid[6];	/* Initiator Session ID */
+	__be16	tsih;	/* Target Session Handle */
+	__be32	itt;	/* Initiator Task Tag */
+	__be16	cid;
+	__be16	rsvd3;
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd5[16];
+};
+
+/* Login PDU flags */
+#define ISCSI_FLAG_LOGIN_TRANSIT		0x80
+#define ISCSI_FLAG_LOGIN_CONTINUE		0x40
+#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK	0x0C	/* 2 bits */
+#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK	0x03	/* 2 bits */
+
+#define ISCSI_LOGIN_CURRENT_STAGE(flags) \
+	((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2)
+#define ISCSI_LOGIN_NEXT_STAGE(flags) \
+	(flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK)
+
+/* Login Response Header */
+struct iscsi_login_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t max_version;	/* Max. version supported */
+	uint8_t active_version;	/* Active version */
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t isid[6];	/* Initiator Session ID */
+	__be16	tsih;	/* Target Session Handle */
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rsvd3;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t status_class;	/* see Login RSP ststus classes below */
+	uint8_t status_detail;	/* see Login RSP Status details below */
+	uint8_t rsvd4[10];
+};
+
+/* Login stage (phase) codes for CSG, NSG */
+#define ISCSI_INITIAL_LOGIN_STAGE		-1
+#define ISCSI_SECURITY_NEGOTIATION_STAGE	0
+#define ISCSI_OP_PARMS_NEGOTIATION_STAGE	1
+#define ISCSI_FULL_FEATURE_PHASE		3
+
+/* Login Status response classes */
+#define ISCSI_STATUS_CLS_SUCCESS		0x00
+#define ISCSI_STATUS_CLS_REDIRECT		0x01
+#define ISCSI_STATUS_CLS_INITIATOR_ERR		0x02
+#define ISCSI_STATUS_CLS_TARGET_ERR		0x03
+
+/* Login Status response detail codes */
+/* Class-0 (Success) */
+#define ISCSI_LOGIN_STATUS_ACCEPT		0x00
+
+/* Class-1 (Redirection) */
+#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP	0x01
+#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM	0x02
+
+/* Class-2 (Initiator Error) */
+#define ISCSI_LOGIN_STATUS_INIT_ERR		0x00
+#define ISCSI_LOGIN_STATUS_AUTH_FAILED		0x01
+#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN	0x02
+#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND	0x03
+#define ISCSI_LOGIN_STATUS_TGT_REMOVED		0x04
+#define ISCSI_LOGIN_STATUS_NO_VERSION		0x05
+#define ISCSI_LOGIN_STATUS_ISID_ERROR		0x06
+#define ISCSI_LOGIN_STATUS_MISSING_FIELDS	0x07
+#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED	0x08
+#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE	0x09
+#define ISCSI_LOGIN_STATUS_NO_SESSION		0x0a
+#define ISCSI_LOGIN_STATUS_INVALID_REQUEST	0x0b
+
+/* Class-3 (Target Error) */
+#define ISCSI_LOGIN_STATUS_TARGET_ERROR		0x00
+#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE	0x01
+#define ISCSI_LOGIN_STATUS_NO_RESOURCES		0x02
+
+/* Logout Header */
+struct iscsi_logout {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd1[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd2[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be16	cid;
+	uint8_t rsvd3[2];
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd4[16];
+};
+
+/* Logout PDU flags */
+#define ISCSI_FLAG_LOGOUT_REASON_MASK	0x7F
+
+/* logout reason_code values */
+
+#define ISCSI_LOGOUT_REASON_CLOSE_SESSION	0
+#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION	1
+#define ISCSI_LOGOUT_REASON_RECOVERY		2
+#define ISCSI_LOGOUT_REASON_AEN_REQUEST		3
+
+/* Logout Response Header */
+struct iscsi_logout_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t response;	/* see Logout response values below */
+	uint8_t rsvd2;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd3[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rsvd4;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	rsvd5;
+	__be16	t2wait;
+	__be16	t2retain;
+	__be32	rsvd6;
+};
+
+/* logout response status values */
+
+#define ISCSI_LOGOUT_SUCCESS			0
+#define ISCSI_LOGOUT_CID_NOT_FOUND		1
+#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED	2
+#define ISCSI_LOGOUT_CLEANUP_FAILED		3
+
+/* SNACK Header */
+struct iscsi_snack {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[14];
+	__be32	itt;
+	__be32	begrun;
+	__be32	runlength;
+	__be32	exp_statsn;
+	__be32	rsvd3;
+	__be32	exp_datasn;
+	uint8_t rsvd6[8];
+};
+
+/* SNACK PDU flags */
+#define ISCSI_FLAG_SNACK_TYPE_MASK	0x0F	/* 4 bits */
+
+/* Reject Message Header */
+struct iscsi_reject {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t reason;
+	uint8_t rsvd2;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd3[8];
+	__be32  ffffffff;
+	uint8_t rsvd4[4];
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	datasn;
+	uint8_t rsvd5[8];
+	/* Text - Rejected hdr */
+};
+
+/* Reason for Reject */
+#define ISCSI_REASON_CMD_BEFORE_LOGIN	1
+#define ISCSI_REASON_DATA_DIGEST_ERROR	2
+#define ISCSI_REASON_DATA_SNACK_REJECT	3
+#define ISCSI_REASON_PROTOCOL_ERROR	4
+#define ISCSI_REASON_CMD_NOT_SUPPORTED	5
+#define ISCSI_REASON_IMM_CMD_REJECT		6
+#define ISCSI_REASON_TASK_IN_PROGRESS	7
+#define ISCSI_REASON_INVALID_SNACK		8
+#define ISCSI_REASON_BOOKMARK_INVALID	9
+#define ISCSI_REASON_BOOKMARK_NO_RESOURCES	10
+#define ISCSI_REASON_NEGOTIATION_RESET	11
+
+/* Max. number of Key=Value pairs in a text message */
+#define MAX_KEY_VALUE_PAIRS	8192
+
+/* maximum length for text keys/values */
+#define KEY_MAXLEN		64
+#define VALUE_MAXLEN		255
+#define TARGET_NAME_MAXLEN	VALUE_MAXLEN
+
+#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH	8192
+
+/************************* RFC 3720 End *****************************/
+
+#endif /* ISCSI_PROTO_H */
diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h
new file mode 100644
index 0000000..f353e0b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h
@@ -0,0 +1,19 @@
+#ifndef _SCSI_SCSI_DEVICE_H_BACKPORT
+#define _SCSI_SCSI_DEVICE_H_BACKPORT
+
+#include_next <scsi/scsi_device.h>
+
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <asm/atomic.h>
+
+struct scsi_lun;
+
+extern void int_to_scsilun(unsigned int, struct scsi_lun *);
+extern void scsi_target_block(struct device *);
+extern void scsi_target_unblock(struct device *);
+extern void starget_for_each_device(struct scsi_target *, void *,
+		     void (*fn)(struct scsi_device *, void *));
+#endif /* _SCSI_SCSI_DEVICE_H_BACKPORT */
diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_host.h b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_host.h
new file mode 100644
index 0000000..b7e019b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_host.h
@@ -0,0 +1,8 @@
+#ifndef _SCSI_SCSI_HOST_H_BACKPORT
+#define _SCSI_SCSI_HOST_H_BACKPORT
+
+#include_next <scsi/scsi_host.h>
+
+#define scsi_queue_work(shost, work) schedule_work(work)
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h
new file mode 100644
index 0000000..99c2b12
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h
@@ -0,0 +1,8 @@
+#ifndef _SCSI_SCSI_TRANSPORT_H_BACKPORT
+#define _SCSI_SCSI_TRANSPORT_H_BACKPORT
+
+#include_next <scsi/scsi_transport.h>
+
+#include <linux/transport_class.h>
+
+#endif /* _SCSI_SCSI_TRANSPORT_H_BACKPORT */
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c b/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c
new file mode 100644
index 0000000..44948d1
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c
@@ -0,0 +1,438 @@
+/*
+ * attribute_container.c - implementation of a simple container for classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ *
+ * The basic idea here is to enable a device to be attached to an
+ * aritrary numer of classes without having to allocate storage for them.
+ * Instead, the contained classes select the devices they need to attach
+ * to via a matching function.
+ */
+
+#include <linux/attribute_container.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/module.h>
+
+#include "base.h"
+
+/* This is a private structure used to tie the classdev and the
+ * container .. it should never be visible outside this file */
+struct internal_container {
+	struct klist_node node;
+	struct attribute_container *cont;
+	struct class_device classdev;
+};
+
+static void internal_container_klist_get(struct klist_node *n)
+{
+	struct internal_container *ic =
+		container_of(n, struct internal_container, node);
+	class_device_get(&ic->classdev);
+}
+
+static void internal_container_klist_put(struct klist_node *n)
+{
+	struct internal_container *ic =
+		container_of(n, struct internal_container, node);
+	class_device_put(&ic->classdev);
+}
+
+
+/**
+ * attribute_container_classdev_to_container - given a classdev, return the container
+ *
+ * @classdev: the class device created by attribute_container_add_device.
+ *
+ * Returns the container associated with this classdev.
+ */
+struct attribute_container *
+attribute_container_classdev_to_container(struct class_device *classdev)
+{
+	struct internal_container *ic =
+		container_of(classdev, struct internal_container, classdev);
+	return ic->cont;
+}
+EXPORT_SYMBOL_GPL(attribute_container_classdev_to_container);
+
+static struct list_head attribute_container_list;
+
+static DECLARE_MUTEX(attribute_container_mutex);
+
+/**
+ * attribute_container_register - register an attribute container
+ *
+ * @cont: The container to register.  This must be allocated by the
+ *        callee and should also be zeroed by it.
+ */
+int
+attribute_container_register(struct attribute_container *cont)
+{
+	INIT_LIST_HEAD(&cont->node);
+	klist_init(&cont->containers,internal_container_klist_get,
+		   internal_container_klist_put);
+		
+	down(&attribute_container_mutex);
+	list_add_tail(&cont->node, &attribute_container_list);
+	up(&attribute_container_mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(attribute_container_register);
+
+/**
+ * attribute_container_unregister - remove a container registration
+ *
+ * @cont: previously registered container to remove
+ */
+int
+attribute_container_unregister(struct attribute_container *cont)
+{
+	int retval = -EBUSY;
+	down(&attribute_container_mutex);
+	spin_lock(&cont->containers.k_lock);
+	if (!list_empty(&cont->containers.k_list))
+		goto out;
+	retval = 0;
+	list_del(&cont->node);
+ out:
+	spin_unlock(&cont->containers.k_lock);
+	up(&attribute_container_mutex);
+	return retval;
+		
+}
+EXPORT_SYMBOL_GPL(attribute_container_unregister);
+
+/* private function used as class release */
+static void attribute_container_release(struct class_device *classdev)
+{
+	struct internal_container *ic 
+		= container_of(classdev, struct internal_container, classdev);
+	struct device *dev = classdev->dev;
+
+	kfree(ic);
+	put_device(dev);
+}
+
+/**
+ * attribute_container_add_device - see if any container is interested in dev
+ *
+ * @dev: device to add attributes to
+ * @fn:	 function to trigger addition of class device.
+ *
+ * This function allocates storage for the class device(s) to be
+ * attached to dev (one for each matching attribute_container).  If no
+ * fn is provided, the code will simply register the class device via
+ * class_device_add.  If a function is provided, it is expected to add
+ * the class device at the appropriate time.  One of the things that
+ * might be necessary is to allocate and initialise the classdev and
+ * then add it a later time.  To do this, call this routine for
+ * allocation and initialisation and then use
+ * attribute_container_device_trigger() to call class_device_add() on
+ * it.  Note: after this, the class device contains a reference to dev
+ * which is not relinquished until the release of the classdev.
+ */
+void
+attribute_container_add_device(struct device *dev,
+			       int (*fn)(struct attribute_container *,
+					 struct device *,
+					 struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+
+		if (attribute_container_no_classdevs(cont))
+			continue;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		ic = kzalloc(sizeof(*ic), GFP_KERNEL);
+		if (!ic) {
+			dev_printk(KERN_ERR, dev, "failed to allocate class container\n");
+			continue;
+		}
+
+		ic->cont = cont;
+		class_device_initialize(&ic->classdev);
+		ic->classdev.dev = get_device(dev);
+		ic->classdev.class = cont->class;
+		cont->class->release = attribute_container_release;
+		strcpy(ic->classdev.class_id, dev->bus_id);
+		if (fn)
+			fn(cont, dev, &ic->classdev);
+		else
+			attribute_container_add_class_device(&ic->classdev);
+		klist_add_tail(&ic->node, &cont->containers);
+	}
+	up(&attribute_container_mutex);
+}
+
+/* FIXME: can't break out of this unless klist_iter_exit is also
+ * called before doing the break
+ */
+#define klist_for_each_entry(pos, head, member, iter) \
+	for (klist_iter_init(head, iter); (pos = ({ \
+		struct klist_node *n = klist_next(iter); \
+		n ? container_of(n, typeof(*pos), member) : \
+			({ klist_iter_exit(iter) ; NULL; }); \
+	}) ) != NULL; )
+			
+
+/**
+ * attribute_container_remove_device - make device eligible for removal.
+ *
+ * @dev:  The generic device
+ * @fn:	  A function to call to remove the device
+ *
+ * This routine triggers device removal.  If fn is NULL, then it is
+ * simply done via class_device_unregister (note that if something
+ * still has a reference to the classdev, then the memory occupied
+ * will not be freed until the classdev is released).  If you want a
+ * two phase release: remove from visibility and then delete the
+ * device, then you should use this routine with a fn that calls
+ * class_device_del() and then use
+ * attribute_container_device_trigger() to do the final put on the
+ * classdev.
+ */
+void
+attribute_container_remove_device(struct device *dev,
+				  void (*fn)(struct attribute_container *,
+					     struct device *,
+					     struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+		struct klist_iter iter;
+
+		if (attribute_container_no_classdevs(cont))
+			continue;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		klist_for_each_entry(ic, &cont->containers, node, &iter) {
+			if (dev != ic->classdev.dev)
+				continue;
+			klist_del(&ic->node);
+			if (fn)
+				fn(cont, dev, &ic->classdev);
+			else {
+				attribute_container_remove_attrs(&ic->classdev);
+				class_device_unregister(&ic->classdev);
+			}
+		}
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_device_trigger - execute a trigger for each matching classdev
+ *
+ * @dev:  The generic device to run the trigger for
+ * @fn	  the function to execute for each classdev.
+ *
+ * This funcion is for executing a trigger when you need to know both
+ * the container and the classdev.  If you only care about the
+ * container, then use attribute_container_trigger() instead.
+ */
+void
+attribute_container_device_trigger(struct device *dev, 
+				   int (*fn)(struct attribute_container *,
+					     struct device *,
+					     struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+		struct klist_iter iter;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		if (attribute_container_no_classdevs(cont)) {
+			fn(cont, dev, NULL);
+			continue;
+		}
+
+		klist_for_each_entry(ic, &cont->containers, node, &iter) {
+			if (dev == ic->classdev.dev)
+				fn(cont, dev, &ic->classdev);
+		}
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_trigger - trigger a function for each matching container
+ *
+ * @dev:  The generic device to activate the trigger for
+ * @fn:	  the function to trigger
+ *
+ * This routine triggers a function that only needs to know the
+ * matching containers (not the classdev) associated with a device.
+ * It is more lightweight than attribute_container_device_trigger, so
+ * should be used in preference unless the triggering function
+ * actually needs to know the classdev.
+ */
+void
+attribute_container_trigger(struct device *dev,
+			    int (*fn)(struct attribute_container *,
+				      struct device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		if (cont->match(cont, dev))
+			fn(cont, dev);
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_add_attrs - add attributes
+ *
+ * @classdev: The class device
+ *
+ * This simply creates all the class device sysfs files from the
+ * attributes listed in the container
+ */
+int
+attribute_container_add_attrs(struct class_device *classdev)
+{
+	struct attribute_container *cont =
+		attribute_container_classdev_to_container(classdev);
+	struct class_device_attribute **attrs =	cont->attrs;
+	int i, error;
+
+	if (!attrs)
+		return 0;
+
+	for (i = 0; attrs[i]; i++) {
+		error = class_device_create_file(classdev, attrs[i]);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/**
+ * attribute_container_add_class_device - same function as class_device_add
+ *
+ * @classdev:	the class device to add
+ *
+ * This performs essentially the same function as class_device_add except for
+ * attribute containers, namely add the classdev to the system and then
+ * create the attribute files
+ */
+int
+attribute_container_add_class_device(struct class_device *classdev)
+{
+	int error = class_device_add(classdev);
+	if (error)
+		return error;
+	return attribute_container_add_attrs(classdev);
+}
+
+/**
+ * attribute_container_add_class_device_adapter - simple adapter for triggers
+ *
+ * This function is identical to attribute_container_add_class_device except
+ * that it is designed to be called from the triggers
+ */
+int
+attribute_container_add_class_device_adapter(struct attribute_container *cont,
+					     struct device *dev,
+					     struct class_device *classdev)
+{
+	return attribute_container_add_class_device(classdev);
+}
+
+/**
+ * attribute_container_remove_attrs - remove any attribute files
+ *
+ * @classdev: The class device to remove the files from
+ *
+ */
+void
+attribute_container_remove_attrs(struct class_device *classdev)
+{
+	struct attribute_container *cont =
+		attribute_container_classdev_to_container(classdev);
+	struct class_device_attribute **attrs =	cont->attrs;
+	int i;
+
+	if (!attrs)
+		return;
+
+	for (i = 0; attrs[i]; i++)
+		class_device_remove_file(classdev, attrs[i]);
+}
+
+/**
+ * attribute_container_class_device_del - equivalent of class_device_del
+ *
+ * @classdev: the class device
+ *
+ * This function simply removes all the attribute files and then calls
+ * class_device_del.
+ */
+void
+attribute_container_class_device_del(struct class_device *classdev)
+{
+	attribute_container_remove_attrs(classdev);
+	class_device_del(classdev);
+}
+
+/**
+ * attribute_container_find_class_device - find the corresponding class_device
+ *
+ * @cont:	the container
+ * @dev:	the generic device
+ *
+ * Looks up the device in the container's list of class devices and returns
+ * the corresponding class_device.
+ */
+struct class_device *
+attribute_container_find_class_device(struct attribute_container *cont,
+				      struct device *dev)
+{
+	struct class_device *cdev = NULL;
+	struct internal_container *ic;
+	struct klist_iter iter;
+
+	klist_for_each_entry(ic, &cont->containers, node, &iter) {
+		if (ic->classdev.dev == dev) {
+			cdev = &ic->classdev;
+			/* FIXME: must exit iterator then break */
+			klist_iter_exit(&iter);
+			break;
+		}
+	}
+
+	return cdev;
+}
+EXPORT_SYMBOL_GPL(attribute_container_find_class_device);
+
+int
+attribute_container_init(void)
+{
+	INIT_LIST_HEAD(&attribute_container_list);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(attribute_container_init);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/base.h b/kernel_addons/backport/2.6.9_U3/include/src/base.h
new file mode 100644
index 0000000..a5f8936
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/base.h
@@ -0,0 +1 @@
+extern int attribute_container_init(void);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/init.c b/kernel_addons/backport/2.6.9_U3/include/src/init.c
new file mode 100644
index 0000000..15f0bc6
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/init.c
@@ -0,0 +1,26 @@
+/*
+ *
+ * Copyright (c) 2002-3 Patrick Mochel
+ * Copyright (c) 2002-3 Open Source Development Labs
+ *
+ * This file is released under the GPLv2
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/memory.h>
+
+#include "base.h"
+
+/**
+ *	driver_init - initialize driver model.
+ *
+ *	Call the driver model init functions to initialize their
+ *	subsystems. Called early from init/main.c.
+ */
+
+void __init driver_init(void)
+{
+	attribute_container_init();
+}
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/klist.c b/kernel_addons/backport/2.6.9_U3/include/src/klist.c
new file mode 100644
index 0000000..3b29ebc
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/klist.c
@@ -0,0 +1,287 @@
+/*
+ *	klist.c - Routines for manipulating klists.
+ *
+ *
+ *	This klist interface provides a couple of structures that wrap around 
+ *	struct list_head to provide explicit list "head" (struct klist) and 
+ *	list "node" (struct klist_node) objects. For struct klist, a spinlock
+ *	is included that protects access to the actual list itself. struct 
+ *	klist_node provides a pointer to the klist that owns it and a kref
+ *	reference count that indicates the number of current users of that node
+ *	in the list.
+ *
+ *	The entire point is to provide an interface for iterating over a list
+ *	that is safe and allows for modification of the list during the
+ *	iteration (e.g. insertion and removal), including modification of the
+ *	current node on the list.
+ *
+ *	It works using a 3rd object type - struct klist_iter - that is declared
+ *	and initialized before an iteration. klist_next() is used to acquire the
+ *	next element in the list. It returns NULL if there are no more items.
+ *	Internally, that routine takes the klist's lock, decrements the reference
+ *	count of the previous klist_node and increments the count of the next
+ *	klist_node. It then drops the lock and returns.
+ *
+ *	There are primitives for adding and removing nodes to/from a klist. 
+ *	When deleting, klist_del() will simply decrement the reference count. 
+ *	Only when the count goes to 0 is the node removed from the list. 
+ *	klist_remove() will try to delete the node from the list and block
+ *	until it is actually removed. This is useful for objects (like devices)
+ *	that have been removed from the system and must be freed (but must wait
+ *	until all accessors have finished).
+ *
+ *	Copyright (C) 2005 Patrick Mochel
+ *
+ *	This file is released under the GPL v2.
+ */
+
+#include <linux/klist.h>
+#include <linux/module.h>
+
+
+/**
+ *	klist_init - Initialize a klist structure. 
+ *	@k:	The klist we're initializing.
+ *	@get:	The get function for the embedding object (NULL if none)
+ *	@put:	The put function for the embedding object (NULL if none)
+ *
+ * Initialises the klist structure.  If the klist_node structures are
+ * going to be embedded in refcounted objects (necessary for safe
+ * deletion) then the get/put arguments are used to initialise
+ * functions that take and release references on the embedding
+ * objects.
+ */
+
+void klist_init(struct klist * k, void (*get)(struct klist_node *),
+		void (*put)(struct klist_node *))
+{
+	INIT_LIST_HEAD(&k->k_list);
+	spin_lock_init(&k->k_lock);
+	k->get = get;
+	k->put = put;
+}
+
+EXPORT_SYMBOL_GPL(klist_init);
+
+
+static void add_head(struct klist * k, struct klist_node * n)
+{
+	spin_lock(&k->k_lock);
+	list_add(&n->n_node, &k->k_list);
+	spin_unlock(&k->k_lock);
+}
+
+static void add_tail(struct klist * k, struct klist_node * n)
+{
+	spin_lock(&k->k_lock);
+	list_add_tail(&n->n_node, &k->k_list);
+	spin_unlock(&k->k_lock);
+}
+
+
+static void klist_node_init(struct klist * k, struct klist_node * n)
+{
+	INIT_LIST_HEAD(&n->n_node);
+	init_completion(&n->n_removed);
+	kref_init(&n->n_ref);
+	n->n_klist = k;
+	if (k->get)
+		k->get(n);
+}
+
+
+/**
+ *	klist_add_head - Initialize a klist_node and add it to front.
+ *	@n:	node we're adding.
+ *	@k:	klist it's going on.
+ */
+
+void klist_add_head(struct klist_node * n, struct klist * k)
+{
+	klist_node_init(k, n);
+	add_head(k, n);
+}
+
+EXPORT_SYMBOL_GPL(klist_add_head);
+
+
+/**
+ *	klist_add_tail - Initialize a klist_node and add it to back.
+ *	@n:	node we're adding.
+ *	@k:	klist it's going on.
+ */
+
+void klist_add_tail(struct klist_node * n, struct klist * k)
+{
+	klist_node_init(k, n);
+	add_tail(k, n);
+}
+
+EXPORT_SYMBOL_GPL(klist_add_tail);
+
+
+static void klist_release(struct kref * kref)
+{
+	struct klist_node * n = container_of(kref, struct klist_node, n_ref);
+
+	list_del(&n->n_node);
+	complete(&n->n_removed);
+	n->n_klist = NULL;
+}
+
+static int klist_dec_and_del(struct klist_node * n)
+{
+	return kref_put_new(&n->n_ref, klist_release);
+}
+
+
+/**
+ *	klist_del - Decrement the reference count of node and try to remove.
+ *	@n:	node we're deleting.
+ */
+
+void klist_del(struct klist_node * n)
+{
+	struct klist * k = n->n_klist;
+	void (*put)(struct klist_node *) = k->put;
+
+	spin_lock(&k->k_lock);
+	if (!klist_dec_and_del(n))
+		put = NULL;
+	spin_unlock(&k->k_lock);
+	if (put)
+		put(n);
+}
+
+EXPORT_SYMBOL_GPL(klist_del);
+
+
+/**
+ *	klist_remove - Decrement the refcount of node and wait for it to go away.
+ *	@n:	node we're removing.
+ */
+
+void klist_remove(struct klist_node * n)
+{
+	klist_del(n);
+	wait_for_completion(&n->n_removed);
+}
+
+EXPORT_SYMBOL_GPL(klist_remove);
+
+
+/**
+ *	klist_node_attached - Say whether a node is bound to a list or not.
+ *	@n:	Node that we're testing.
+ */
+
+int klist_node_attached(struct klist_node * n)
+{
+	return (n->n_klist != NULL);
+}
+
+EXPORT_SYMBOL_GPL(klist_node_attached);
+
+
+/**
+ *	klist_iter_init_node - Initialize a klist_iter structure.
+ *	@k:	klist we're iterating.
+ *	@i:	klist_iter we're filling.
+ *	@n:	node to start with.
+ *
+ *	Similar to klist_iter_init(), but starts the action off with @n, 
+ *	instead of with the list head.
+ */
+
+void klist_iter_init_node(struct klist * k, struct klist_iter * i, struct klist_node * n)
+{
+	i->i_klist = k;
+	i->i_head = &k->k_list;
+	i->i_cur = n;
+	if (n)
+		kref_get(&n->n_ref);
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_init_node);
+
+
+/**
+ *	klist_iter_init - Iniitalize a klist_iter structure.
+ *	@k:	klist we're iterating.
+ *	@i:	klist_iter structure we're filling.
+ *
+ *	Similar to klist_iter_init_node(), but start with the list head.
+ */
+
+void klist_iter_init(struct klist * k, struct klist_iter * i)
+{
+	klist_iter_init_node(k, i, NULL);
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_init);
+
+
+/**
+ *	klist_iter_exit - Finish a list iteration.
+ *	@i:	Iterator structure.
+ *
+ *	Must be called when done iterating over list, as it decrements the 
+ *	refcount of the current node. Necessary in case iteration exited before
+ *	the end of the list was reached, and always good form.
+ */
+
+void klist_iter_exit(struct klist_iter * i)
+{
+	if (i->i_cur) {
+		klist_del(i->i_cur);
+		i->i_cur = NULL;
+	}
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_exit);
+
+
+static struct klist_node * to_klist_node(struct list_head * n)
+{
+	return container_of(n, struct klist_node, n_node);
+}
+
+
+/**
+ *	klist_next - Ante up next node in list.
+ *	@i:	Iterator structure.
+ *
+ *	First grab list lock. Decrement the reference count of the previous
+ *	node, if there was one. Grab the next node, increment its reference 
+ *	count, drop the lock, and return that next node.
+ */
+
+struct klist_node * klist_next(struct klist_iter * i)
+{
+	struct list_head * next;
+	struct klist_node * lnode = i->i_cur;
+	struct klist_node * knode = NULL;
+	void (*put)(struct klist_node *) = i->i_klist->put;
+
+	spin_lock(&i->i_klist->k_lock);
+	if (lnode) {
+		next = lnode->n_node.next;
+		if (!klist_dec_and_del(lnode))
+			put = NULL;
+	} else
+		next = i->i_head->next;
+
+	if (next != i->i_head) {
+		knode = to_klist_node(next);
+		kref_get(&knode->n_ref);
+	}
+	i->i_cur = knode;
+	spin_unlock(&i->i_klist->k_lock);
+	if (put && lnode)
+		put(lnode);
+	return knode;
+}
+
+EXPORT_SYMBOL_GPL(klist_next);
+
+
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c b/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c
new file mode 100644
index 0000000..d45bb3f
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c
@@ -0,0 +1,29 @@
+#include <linux/kref.h>
+#include <linux/module.h>
+
+/**
+ * kref_put - decrement refcount for object.
+ * @kref: object.
+ * @release: pointer to the function that will clean up the object when the
+ *           last reference to the object is released.
+ *           This pointer is required, and it is not acceptable to pass kfree
+ *           in as this function.
+ *
+ * Decrement the refcount, and if 0, call release().
+ * Return 1 if the object was removed, otherwise return 0.  Beware, if this
+ * function returns 0, you still can not count on the kref from remaining in
+ * memory.  Only use the return value if you want to see if the kref is now
+ * gone, not present.
+ */
+int kref_put_new(struct kref *kref, void (*release)(struct kref *kref))
+{
+        WARN_ON(release == NULL);
+        WARN_ON(release == (void (*)(struct kref *))kfree);
+
+        if (atomic_dec_and_test(&kref->refcount)) {
+                release(kref);
+                return 1;
+        }
+        return 0;
+}
+EXPORT_SYMBOL(kref_put_new);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi.c
new file mode 100644
index 0000000..8c833c0
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi.c
@@ -0,0 +1,50 @@
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/init.h>
+#include <linux/completion.h>
+#include <linux/unistd.h>
+#include <linux/spinlock.h>
+#include <linux/kmod.h>
+#include <linux/interrupt.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
+#include <linux/mutex.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_tcq.h>
+
+/**
+ * starget_for_each_device  -  helper to walk all devices of a target
+ * @starget:	target whose devices we want to iterate over.
+ *
+ * This traverses over each devices of @shost.  The devices have
+ * a reference that must be released by scsi_host_put when breaking
+ * out of the loop.
+ */
+void starget_for_each_device(struct scsi_target *starget, void * data,
+		     void (*fn)(struct scsi_device *, void *))
+{
+	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
+	struct scsi_device *sdev;
+
+	printk("%s: entry\n", __FUNCTION__);
+	shost_for_each_device(sdev, shost) {
+		if ((sdev->channel == starget->channel) &&
+		    (sdev->id == starget->id))
+			fn(sdev, data);
+	}
+	printk("%s: exit\n", __FUNCTION__);
+}
+EXPORT_SYMBOL(starget_for_each_device);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c
new file mode 100644
index 0000000..f53f824
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c
@@ -0,0 +1,166 @@
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/completion.h>
+#include <linux/kernel.h>
+#include <linux/mempool.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <linux/hardirq.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+
+int scsi_is_target_device(const struct device *dev)
+{
+        char *str = dev->bus_id;
+
+	if (strncmp(str, "target", 6) == 0) {
+		return 1;
+	}
+
+        return 0;
+}
+
+/**
+ * scsi_internal_device_block - internal function to put a device
+ *                              temporarily into the SDEV_BLOCK state
+ * @sdev:       device to block
+ *
+ * Block request made by scsi lld's to temporarily stop all
+ * scsi commands on the specified device.  Called from interrupt
+ * or normal process context.
+ *
+ * Returns zero if successful or error if not
+ *
+ * Notes:
+ *      This routine transitions the device to the SDEV_BLOCK state
+ *      (which must be a legal transition).  When the device is in this
+ *      state, all commands are deferred until the scsi lld reenables
+ *      the device with scsi_device_unblock or device_block_tmo fires.
+ *      This routine assumes the host_lock is held on entry.
+ **/
+int
+scsi_internal_device_block(struct scsi_device *sdev)
+{
+        request_queue_t *q = sdev->request_queue;
+        unsigned long flags;
+        int err = 0;
+
+        err = scsi_device_set_state(sdev, SDEV_BLOCK);
+        if (err)
+		return err;
+                
+        /*
+         * The device has transitioned to SDEV_BLOCK.  Stop the
+         * block layer from calling the midlayer with this device's
+         * request queue.
+         */
+        spin_lock_irqsave(q->queue_lock, flags);
+        blk_stop_queue(q);
+        spin_unlock_irqrestore(q->queue_lock, flags);
+
+        return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_internal_device_block);
+
+/**
+ * scsi_internal_device_unblock - resume a device after a block request
+ * @sdev:       device to resume
+ *
+ * Called by scsi lld's or the midlayer to restart the device queue
+ * for the previously suspended scsi device.  Called from interrupt or
+ * normal process context.
+ *
+ * Returns zero if successful or error if not.
+ *
+ * Notes:
+ *      This routine transitions the device to the SDEV_RUNNING state
+ *      (which must be a legal transition) allowing the midlayer to
+ *      goose the queue for this device.  This routine assumes the
+ *      host_lock is held upon entry.
+ **/
+int
+scsi_internal_device_unblock(struct scsi_device *sdev)
+{
+        request_queue_t *q = sdev->request_queue;
+        int err;
+        unsigned long flags;
+
+
+        /*
+         * Try to transition the scsi device to SDEV_RUNNING
+         * and goose the device queue if successful.
+         */
+        err = scsi_device_set_state(sdev, SDEV_RUNNING);
+        if (err)
+		return err;
+                
+        spin_lock_irqsave(q->queue_lock, flags);
+        blk_start_queue(q);
+        spin_unlock_irqrestore(q->queue_lock, flags);
+
+        return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_internal_device_unblock);
+
+static void
+device_block(struct scsi_device *sdev, void *data)
+{
+        scsi_internal_device_block(sdev);
+}
+
+static int
+target_block(struct device *dev, void *data)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_block);
+
+        return 0;
+}
+
+void
+scsi_target_block(struct device *dev)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_block);
+        else
+                device_for_each_child(dev, NULL, target_block);
+}
+EXPORT_SYMBOL_GPL(scsi_target_block);
+
+static void
+device_unblock(struct scsi_device *sdev, void *data)
+{
+        scsi_internal_device_unblock(sdev);
+}
+
+static int
+target_unblock(struct device *dev, void *data)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_unblock);
+        return 0;
+}
+
+void
+scsi_target_unblock(struct device *dev)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_unblock);
+        else
+                device_for_each_child(dev, NULL, target_unblock);
+}
+EXPORT_SYMBOL_GPL(scsi_target_unblock);
+
+MODULE_LICENSE("GPL");
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c
new file mode 100644
index 0000000..b7b7674
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c
@@ -0,0 +1,48 @@
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/spinlock.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_devinfo.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_transport.h>
+#include <scsi/scsi_eh.h>
+
+/**
+ * int_to_scsilun: reverts an int into a scsi_lun
+ * @int:        integer to be reverted
+ * @scsilun:    struct scsi_lun to be set.
+ *
+ * Description:
+ *     Reverts the functionality of the scsilun_to_int, which packed
+ *     an 8-byte lun value into an int. This routine unpacks the int
+ *     back into the lun value.
+ *     Note: the scsilun_to_int() routine does not truly handle all
+ *     8bytes of the lun value. This functions restores only as much
+ *     as was set by the routine.
+ *
+ * Notes:
+ *     Given an integer : 0x0b030a04,  this function returns a
+ *     scsi_lun of : struct scsi_lun of: 0a 04 0b 03 00 00 00 00
+ *
+ **/
+void int_to_scsilun(unsigned int lun, struct scsi_lun *scsilun)
+{
+        int i;
+
+        memset(scsilun->scsi_lun, 0, sizeof(scsilun->scsi_lun));
+
+        for (i = 0; i < sizeof(lun); i += 2) {
+                scsilun->scsi_lun[i] = (lun >> 8) & 0xFF;
+                scsilun->scsi_lun[i+1] = lun & 0xFF;
+                lun = lun >> 16;
+        }
+}
+EXPORT_SYMBOL(int_to_scsilun);
diff --git a/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c b/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c
new file mode 100644
index 0000000..f25e7c6
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c
@@ -0,0 +1,280 @@
+/*
+ * transport_class.c - implementation of generic transport classes
+ *                     using attribute_containers
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ *
+ * The basic idea here is to allow any "device controller" (which
+ * would most often be a Host Bus Adapter to use the services of one
+ * or more tranport classes for performing transport specific
+ * services.  Transport specific services are things that the generic
+ * command layer doesn't want to know about (speed settings, line
+ * condidtioning, etc), but which the user might be interested in.
+ * Thus, the HBA's use the routines exported by the transport classes
+ * to perform these functions.  The transport classes export certain
+ * values to the user via sysfs using attribute containers.
+ *
+ * Note: because not every HBA will care about every transport
+ * attribute, there's a many to one relationship that goes like this:
+ *
+ * transport class<-----attribute container<----class device
+ *
+ * Usually the attribute container is per-HBA, but the design doesn't
+ * mandate that.  Although most of the services will be specific to
+ * the actual external storage connection used by the HBA, the generic
+ * transport class is framed entirely in terms of generic devices to
+ * allow it to be used by any physical HBA in the system.
+ */
+#include <linux/attribute_container.h>
+#include <linux/transport_class.h>
+
+/**
+ * transport_class_register - register an initial transport class
+ *
+ * @tclass:	a pointer to the transport class structure to be initialised
+ *
+ * The transport class contains an embedded class which is used to
+ * identify it.  The caller should initialise this structure with
+ * zeros and then generic class must have been initialised with the
+ * actual transport class unique name.  There's a macro
+ * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must
+ * be registered).
+ *
+ * Returns 0 on success or error on failure.
+ */
+int transport_class_register(struct transport_class *tclass)
+{
+	return class_register(&tclass->class);
+}
+EXPORT_SYMBOL_GPL(transport_class_register);
+
+/**
+ * transport_class_unregister - unregister a previously registered class
+ *
+ * @tclass: The transport class to unregister
+ *
+ * Must be called prior to deallocating the memory for the transport
+ * class.
+ */
+void transport_class_unregister(struct transport_class *tclass)
+{
+	class_unregister(&tclass->class);
+}
+EXPORT_SYMBOL_GPL(transport_class_unregister);
+
+static int anon_transport_dummy_function(struct transport_container *tc,
+					 struct device *dev,
+					 struct class_device *cdev)
+{
+	/* do nothing */
+	return 0;
+}
+
+/**
+ * anon_transport_class_register - register an anonymous class
+ *
+ * @atc: The anon transport class to register
+ *
+ * The anonymous transport class contains both a transport class and a
+ * container.  The idea of an anonymous class is that it never
+ * actually has any device attributes associated with it (and thus
+ * saves on container storage).  So it can only be used for triggering
+ * events.  Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to
+ * initialise the anon transport class storage.
+ */
+int anon_transport_class_register(struct anon_transport_class *atc)
+{
+	int error;
+	atc->container.class = &atc->tclass.class;
+	attribute_container_set_no_classdevs(&atc->container);
+	error = attribute_container_register(&atc->container);
+	if (error)
+		return error;
+	atc->tclass.setup = anon_transport_dummy_function;
+	atc->tclass.remove = anon_transport_dummy_function;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(anon_transport_class_register);
+
+/**
+ * anon_transport_class_unregister - unregister an anon class
+ *
+ * @atc: Pointer to the anon transport class to unregister
+ *
+ * Must be called prior to deallocating the memory for the anon
+ * transport class.
+ */
+void anon_transport_class_unregister(struct anon_transport_class *atc)
+{
+	attribute_container_unregister(&atc->container);
+}
+EXPORT_SYMBOL_GPL(anon_transport_class_unregister);
+
+static int transport_setup_classdev(struct attribute_container *cont,
+				    struct device *dev,
+				    struct class_device *classdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+	struct transport_container *tcont = attribute_container_to_transport_container(cont);
+
+	if (tclass->setup)
+		tclass->setup(tcont, dev, classdev);
+
+	return 0;
+}
+
+/**
+ * transport_setup_device - declare a new dev for transport class association
+ *			    but don't make it visible yet.
+ *
+ * @dev: the generic device representing the entity being added
+ *
+ * Usually, dev represents some component in the HBA system (either
+ * the HBA itself or a device remote across the HBA bus).  This
+ * routine is simply a trigger point to see if any set of transport
+ * classes wishes to associate with the added device.  This allocates
+ * storage for the class device and initialises it, but does not yet
+ * add it to the system or add attributes to it (you do this with
+ * transport_add_device).  If you have no need for a separate setup
+ * and add operations, use transport_register_device (see
+ * transport_class.h).
+ */
+
+void transport_setup_device(struct device *dev)
+{
+	attribute_container_add_device(dev, transport_setup_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_setup_device);
+
+static int transport_add_class_device(struct attribute_container *cont,
+				      struct device *dev,
+				      struct class_device *classdev)
+{
+	int error = attribute_container_add_class_device(classdev);
+	struct transport_container *tcont = 
+		attribute_container_to_transport_container(cont);
+
+	if (!error && tcont->statistics)
+		error = sysfs_create_group(&classdev->kobj, tcont->statistics);
+
+	return error;
+}
+
+
+/**
+ * transport_add_device - declare a new dev for transport class association
+ *
+ * @dev: the generic device representing the entity being added
+ *
+ * Usually, dev represents some component in the HBA system (either
+ * the HBA itself or a device remote across the HBA bus).  This
+ * routine is simply a trigger point used to add the device to the
+ * system and register attributes for it.
+ */
+
+void transport_add_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_add_class_device);
+}
+EXPORT_SYMBOL_GPL(transport_add_device);
+
+static int transport_configure(struct attribute_container *cont,
+			       struct device *dev,
+			       struct class_device *cdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+	struct transport_container *tcont = attribute_container_to_transport_container(cont);
+
+	if (tclass->configure)
+		tclass->configure(tcont, dev, cdev);
+
+	return 0;
+}
+
+/**
+ * transport_configure_device - configure an already set up device
+ *
+ * @dev: generic device representing device to be configured
+ *
+ * The idea of configure is simply to provide a point within the setup
+ * process to allow the transport class to extract information from a
+ * device after it has been setup.  This is used in SCSI because we
+ * have to have a setup device to begin using the HBA, but after we
+ * send the initial inquiry, we use configure to extract the device
+ * parameters.  The device need not have been added to be configured.
+ */
+void transport_configure_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_configure);
+}
+EXPORT_SYMBOL_GPL(transport_configure_device);
+
+static int transport_remove_classdev(struct attribute_container *cont,
+				     struct device *dev,
+				     struct class_device *classdev)
+{
+	struct transport_container *tcont = 
+		attribute_container_to_transport_container(cont);
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+
+	if (tclass->remove)
+		tclass->remove(tcont, dev, classdev);
+
+	if (tclass->remove != anon_transport_dummy_function) {
+		if (tcont->statistics)
+			sysfs_remove_group(&classdev->kobj, tcont->statistics);
+		attribute_container_class_device_del(classdev);
+	}
+
+	return 0;
+}
+
+
+/**
+ * transport_remove_device - remove the visibility of a device
+ *
+ * @dev: generic device to remove
+ *
+ * This call removes the visibility of the device (to the user from
+ * sysfs), but does not destroy it.  To eliminate a device entirely
+ * you must also call transport_destroy_device.  If you don't need to
+ * do remove and destroy as separate operations, use
+ * transport_unregister_device() (see transport_class.h) which will
+ * perform both calls for you.
+ */
+void transport_remove_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_remove_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_remove_device);
+
+static void transport_destroy_classdev(struct attribute_container *cont,
+				      struct device *dev,
+				      struct class_device *classdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+
+	if (tclass->remove != anon_transport_dummy_function)
+		class_device_put(classdev);
+}
+
+
+/**
+ * transport_destroy_device - destroy a removed device
+ *
+ * @dev: device to eliminate from the transport class.
+ *
+ * This call triggers the elimination of storage associated with the
+ * transport classdev.  Note: all it really does is relinquish a
+ * reference to the classdev.  The memory will not be freed until the
+ * last reference goes to zero.  Note also that the classdev retains a
+ * reference count on dev, so dev too will remain for as long as the
+ * transport class device remains around.
+ */
+void transport_destroy_device(struct device *dev)
+{
+	attribute_container_remove_device(dev, transport_destroy_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_destroy_device);
diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h b/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h
new file mode 100644
index 0000000..93bfb0b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h
@@ -0,0 +1,71 @@
+/*
+ * class_container.h - a generic container for all classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ */
+
+#ifndef _ATTRIBUTE_CONTAINER_H_
+#define _ATTRIBUTE_CONTAINER_H_
+
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/klist.h>
+#include <linux/spinlock.h>
+
+struct attribute_container {
+	struct list_head	node;
+	struct klist		containers;
+	struct class		*class;
+	struct class_device_attribute **attrs;
+	int (*match)(struct attribute_container *, struct device *);
+#define	ATTRIBUTE_CONTAINER_NO_CLASSDEVS	0x01
+	unsigned long		flags;
+};
+
+static inline int
+attribute_container_no_classdevs(struct attribute_container *atc)
+{
+	return atc->flags & ATTRIBUTE_CONTAINER_NO_CLASSDEVS;
+}
+
+static inline void
+attribute_container_set_no_classdevs(struct attribute_container *atc)
+{
+	atc->flags |= ATTRIBUTE_CONTAINER_NO_CLASSDEVS;
+}
+
+int attribute_container_register(struct attribute_container *cont);
+int attribute_container_unregister(struct attribute_container *cont);
+void attribute_container_create_device(struct device *dev,
+				       int (*fn)(struct attribute_container *,
+						 struct device *,
+						 struct class_device *));
+void attribute_container_add_device(struct device *dev,
+				    int (*fn)(struct attribute_container *,
+					      struct device *,
+					      struct class_device *));
+void attribute_container_remove_device(struct device *dev,
+				       void (*fn)(struct attribute_container *,
+						  struct device *,
+						  struct class_device *));
+void attribute_container_device_trigger(struct device *dev, 
+					int (*fn)(struct attribute_container *,
+						  struct device *,
+						  struct class_device *));
+void attribute_container_trigger(struct device *dev, 
+				 int (*fn)(struct attribute_container *,
+					   struct device *));
+int attribute_container_add_attrs(struct class_device *classdev);
+int attribute_container_add_class_device(struct class_device *classdev);
+int attribute_container_add_class_device_adapter(struct attribute_container *cont,
+						 struct device *dev,
+						 struct class_device *classdev);
+void attribute_container_remove_attrs(struct class_device *classdev);
+void attribute_container_class_device_del(struct class_device *classdev);
+struct attribute_container *attribute_container_classdev_to_container(struct class_device *);
+struct class_device *attribute_container_find_class_device(struct attribute_container *, struct device *);
+struct class_device_attribute **attribute_container_classdev_to_attrs(const struct class_device *classdev);
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/crypto.h b/kernel_addons/backport/2.6.9_U4/include/linux/crypto.h
new file mode 100644
index 0000000..aecccde
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/linux/crypto.h
@@ -0,0 +1,11 @@
+#ifndef LINUX_CRYPTO_BACKPORT_H
+#define LINUX_CRYPTO_BACKPORT_H
+
+#include_next <linux/crypto.h>
+
+#define crypto_hash_init(desc) crypto_digest_init(*desc)
+#define crypto_hash_digest(desc, sg, nbytes, out) crypto_digest_digest(*desc, sg, 1, out)
+#define crypto_hash_update(desc, sg, nbytes) crypto_digest_update(*desc, sg, 1)
+#define crypto_hash_final(desc, out) crypto_digest_final(*desc, out)
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/kernel.h b/kernel_addons/backport/2.6.9_U4/include/linux/kernel.h
index a37dcd5..02a5907 100644
--- a/kernel_addons/backport/2.6.9_U4/include/linux/kernel.h
+++ b/kernel_addons/backport/2.6.9_U4/include/linux/kernel.h
@@ -4,4 +4,7 @@ #define BACKPORT_KERNEL_H_2_6_19
 #include_next <linux/kernel.h>
 #include <linux/log2.h>
 
+#define NIP6_FMT "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x"
+#define NIPQUAD_FMT "%u.%u.%u.%u"
+
 #endif
diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/kfifo.h b/kernel_addons/backport/2.6.9_U4/include/linux/kfifo.h
index 48eccd8..2b94461 100644
--- a/kernel_addons/backport/2.6.9_U4/include/linux/kfifo.h
+++ b/kernel_addons/backport/2.6.9_U4/include/linux/kfifo.h
@@ -25,6 +25,7 @@ #ifdef __KERNEL__
 
 #include <linux/kernel.h>
 #include <linux/spinlock.h>
+#include <linux/gfp.h>
 
 struct kfifo {
 	unsigned char *buffer;	/* the buffer holding the data */
diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/klist.h b/kernel_addons/backport/2.6.9_U4/include/linux/klist.h
new file mode 100644
index 0000000..7407125
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/linux/klist.h
@@ -0,0 +1,61 @@
+/*
+ *	klist.h - Some generic list helpers, extending struct list_head a bit.
+ *
+ *	Implementations are found in lib/klist.c
+ *
+ *
+ *	Copyright (C) 2005 Patrick Mochel
+ *
+ *	This file is rleased under the GPL v2.
+ */
+
+#ifndef _LINUX_KLIST_H
+#define _LINUX_KLIST_H
+
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+
+struct klist_node;
+struct klist {
+	spinlock_t		k_lock;
+	struct list_head	k_list;
+	void			(*get)(struct klist_node *);
+	void			(*put)(struct klist_node *);
+};
+
+
+extern void klist_init(struct klist * k, void (*get)(struct klist_node *),
+		       void (*put)(struct klist_node *));
+
+struct klist_node {
+	struct klist		* n_klist;
+	struct list_head	n_node;
+	struct kref		n_ref;
+	struct completion	n_removed;
+};
+
+extern void klist_add_tail(struct klist_node * n, struct klist * k);
+extern void klist_add_head(struct klist_node * n, struct klist * k);
+
+extern void klist_del(struct klist_node * n);
+extern void klist_remove(struct klist_node * n);
+
+extern int klist_node_attached(struct klist_node * n);
+
+
+struct klist_iter {
+	struct klist		* i_klist;
+	struct list_head	* i_head;
+	struct klist_node	* i_cur;
+};
+
+
+extern void klist_iter_init(struct klist * k, struct klist_iter * i);
+extern void klist_iter_init_node(struct klist * k, struct klist_iter * i, 
+				 struct klist_node * n);
+extern void klist_iter_exit(struct klist_iter * i);
+extern struct klist_node * klist_next(struct klist_iter * i);
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/memory.h b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
new file mode 100644
index 0000000..654ef55
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
@@ -0,0 +1,89 @@
+/*
+ * include/linux/memory.h - generic memory definition
+ *
+ * This is mainly for topological representation. We define the
+ * basic "struct memory_block" here, which can be embedded in per-arch
+ * definitions or NUMA information.
+ *
+ * Basic handling of the devices is done in drivers/base/memory.c
+ * and system devices are handled in drivers/base/sys.c.
+ *
+ * Memory block are exported via sysfs in the class/memory/devices/
+ * directory.
+ *
+ */
+#ifndef _LINUX_MEMORY_H_
+#define _LINUX_MEMORY_H_
+
+#include <linux/sysdev.h>
+#include <linux/node.h>
+#include <linux/compiler.h>
+
+#include <asm/semaphore.h>
+
+struct memory_block {
+	unsigned long phys_index;
+	unsigned long state;
+	/*
+	 * This serializes all state change requests.  It isn't
+	 * held during creation because the control files are
+	 * created long after the critical areas during
+	 * initialization.
+	 */
+	struct semaphore state_sem;
+	int phys_device;		/* to which fru does this belong? */
+	void *hw;			/* optional pointer to fw/hw data */
+	int (*phys_callback)(struct memory_block *);
+	struct sys_device sysdev;
+};
+
+/* These states are exposed to userspace as text strings in sysfs */
+#define	MEM_ONLINE		(1<<0) /* exposed to userspace */
+#define	MEM_GOING_OFFLINE	(1<<1) /* exposed to userspace */
+#define	MEM_OFFLINE		(1<<2) /* exposed to userspace */
+
+/*
+ * All of these states are currently kernel-internal for notifying
+ * kernel components and architectures.
+ *
+ * For MEM_MAPPING_INVALID, all notifier chains with priority >0
+ * are called before pfn_to_page() becomes invalid.  The priority=0
+ * entry is reserved for the function that actually makes
+ * pfn_to_page() stop working.  Any notifiers that want to be called
+ * after that should have priority <0.
+ */
+#define	MEM_MAPPING_INVALID	(1<<3)
+
+struct notifier_block;
+struct mem_section;
+
+#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE
+static inline int memory_dev_init(void)
+{
+	return 0;
+}
+static inline int register_memory_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+static inline void unregister_memory_notifier(struct notifier_block *nb)
+{
+}
+#else
+extern int register_new_memory(struct mem_section *);
+extern int unregister_memory_section(struct mem_section *);
+extern int memory_dev_init(void);
+extern int remove_memory_block(unsigned long, struct mem_section *, int);
+
+#define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
+
+
+#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
+
+#define hotplug_memory_notifier(fn, pri) {			\
+	static struct notifier_block fn##_mem_nb =		\
+		{ .notifier_call = fn, .priority = pri };	\
+	register_memory_notifier(&fn##_mem_nb);			\
+}
+
+#endif /* _LINUX_MEMORY_H_ */
diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/netlink.h b/kernel_addons/backport/2.6.9_U4/include/linux/netlink.h
new file mode 100644
index 0000000..6d69105
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/linux/netlink.h
@@ -0,0 +1,14 @@
+#ifndef BACKPORT_LINUX_NETLINK_H
+#define BACKPORT_LINUX_NETLINK_H
+
+#include_next <linux/netlink.h>
+
+#define __nlmsg_put(skb, daemon_pid, seq, type, len, flags) \
+       __nlmsg_put(skb, daemon_pid, 0, 0, len)
+
+#define netlink_kernel_create(uint, groups, input, mod) \
+       netlink_kernel_create(uint, input)
+
+#define NETLINK_ISCSI           8
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/transport_class.h b/kernel_addons/backport/2.6.9_U4/include/linux/transport_class.h
new file mode 100644
index 0000000..1d6cc22
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/linux/transport_class.h
@@ -0,0 +1,100 @@
+/*
+ * transport_class.h - a generic container for all transport classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ */
+
+#ifndef _TRANSPORT_CLASS_H_
+#define _TRANSPORT_CLASS_H_
+
+#include <linux/device.h>
+#include <linux/attribute_container.h>
+
+struct transport_container;
+
+struct transport_class {
+	struct class class;
+	int (*setup)(struct transport_container *, struct device *,
+		     struct class_device *);
+	int (*configure)(struct transport_container *, struct device *,
+			 struct class_device *);
+	int (*remove)(struct transport_container *, struct device *,
+		      struct class_device *);
+};
+
+#define DECLARE_TRANSPORT_CLASS(cls, nm, su, rm, cfg)			\
+struct transport_class cls = {						\
+	.class = {							\
+		.name = nm,						\
+	},								\
+	.setup = su,							\
+	.remove = rm,							\
+	.configure = cfg,						\
+}
+
+
+struct anon_transport_class {
+	struct transport_class tclass;
+	struct attribute_container container;
+};
+
+#define DECLARE_ANON_TRANSPORT_CLASS(cls, mtch, cfg)		\
+struct anon_transport_class cls = {				\
+	.tclass = {						\
+		.configure = cfg,				\
+	},							\
+	. container = {						\
+		.match = mtch,					\
+	},							\
+}
+
+#define class_to_transport_class(x) \
+	container_of(x, struct transport_class, class)
+
+struct transport_container {
+	struct attribute_container ac;
+	struct attribute_group *statistics;
+};
+
+#define attribute_container_to_transport_container(x) \
+	container_of(x, struct transport_container, ac)
+
+void transport_remove_device(struct device *);
+void transport_add_device(struct device *);
+void transport_setup_device(struct device *);
+void transport_configure_device(struct device *);
+void transport_destroy_device(struct device *);
+
+static inline void
+transport_register_device(struct device *dev)
+{
+	transport_setup_device(dev);
+	transport_add_device(dev);
+}
+
+static inline void
+transport_unregister_device(struct device *dev)
+{
+	transport_remove_device(dev);
+	transport_destroy_device(dev);
+}
+
+static inline int transport_container_register(struct transport_container *tc)
+{
+	return attribute_container_register(&tc->ac);
+}
+
+static inline int transport_container_unregister(struct transport_container *tc)
+{
+	return attribute_container_unregister(&tc->ac);
+}
+
+int transport_class_register(struct transport_class *);
+int anon_transport_class_register(struct anon_transport_class *);
+void transport_class_unregister(struct transport_class *);
+void anon_transport_class_unregister(struct anon_transport_class *);
+
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/iscsi_proto.h b/kernel_addons/backport/2.6.9_U4/include/scsi/iscsi_proto.h
new file mode 100644
index 0000000..02f6e4b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/scsi/iscsi_proto.h
@@ -0,0 +1,587 @@
+/*
+ * RFC 3720 (iSCSI) protocol data types
+ *
+ * Copyright (C) 2005 Dmitry Yusupov
+ * Copyright (C) 2005 Alex Aizman
+ * maintained by open-iscsi at googlegroups.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published
+ * by the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * See the file COPYING included with this distribution for more details.
+ */
+
+#ifndef ISCSI_PROTO_H
+#define ISCSI_PROTO_H
+
+#define ISCSI_DRAFT20_VERSION	0x00
+
+/* default iSCSI listen port for incoming connections */
+#define ISCSI_LISTEN_PORT	3260
+
+/* Padding word length */
+#define PAD_WORD_LEN		4
+
+/*
+ * useful common(control and data pathes) macro
+ */
+#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2]))
+#define hton24(p, v) { \
+        p[0] = (((v) >> 16) & 0xFF); \
+        p[1] = (((v) >> 8) & 0xFF); \
+        p[2] = ((v) & 0xFF); \
+}
+#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;}
+
+/*
+ * iSCSI Template Message Header
+ */
+struct iscsi_hdr {
+	uint8_t		opcode;
+	uint8_t		flags;		/* Final bit */
+	uint8_t		rsvd2[2];
+	uint8_t		hlength;	/* AHSs total length */
+	uint8_t		dlength[3];	/* Data length */
+	uint8_t		lun[8];
+	__be32		itt;		/* Initiator Task Tag */
+	__be32		ttt;		/* Target Task Tag */
+	__be32		statsn;
+	__be32		exp_statsn;
+	__be32		max_statsn;
+	uint8_t		other[12];
+};
+
+/************************* RFC 3720 Begin *****************************/
+
+#define ISCSI_RESERVED_TAG		0xffffffff
+
+/* Opcode encoding bits */
+#define ISCSI_OP_RETRY			0x80
+#define ISCSI_OP_IMMEDIATE		0x40
+#define ISCSI_OPCODE_MASK		0x3F
+
+/* Initiator Opcode values */
+#define ISCSI_OP_NOOP_OUT		0x00
+#define ISCSI_OP_SCSI_CMD		0x01
+#define ISCSI_OP_SCSI_TMFUNC		0x02
+#define ISCSI_OP_LOGIN			0x03
+#define ISCSI_OP_TEXT			0x04
+#define ISCSI_OP_SCSI_DATA_OUT		0x05
+#define ISCSI_OP_LOGOUT			0x06
+#define ISCSI_OP_SNACK			0x10
+
+#define ISCSI_OP_VENDOR1_CMD		0x1c
+#define ISCSI_OP_VENDOR2_CMD		0x1d
+#define ISCSI_OP_VENDOR3_CMD		0x1e
+#define ISCSI_OP_VENDOR4_CMD		0x1f
+
+/* Target Opcode values */
+#define ISCSI_OP_NOOP_IN		0x20
+#define ISCSI_OP_SCSI_CMD_RSP		0x21
+#define ISCSI_OP_SCSI_TMFUNC_RSP	0x22
+#define ISCSI_OP_LOGIN_RSP		0x23
+#define ISCSI_OP_TEXT_RSP		0x24
+#define ISCSI_OP_SCSI_DATA_IN		0x25
+#define ISCSI_OP_LOGOUT_RSP		0x26
+#define ISCSI_OP_R2T			0x31
+#define ISCSI_OP_ASYNC_EVENT		0x32
+#define ISCSI_OP_REJECT			0x3f
+
+struct iscsi_ahs_hdr {
+	__be16 ahslength;
+	uint8_t ahstype;
+	uint8_t ahspec[5];
+};
+
+#define ISCSI_AHSTYPE_CDB		1
+#define ISCSI_AHSTYPE_RLENGTH		2
+
+/* iSCSI PDU Header */
+struct iscsi_cmd {
+	uint8_t opcode;
+	uint8_t flags;
+	__be16 rsvd2;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32 itt;	/* Initiator Task Tag */
+	__be32 data_length;
+	__be32 cmdsn;
+	__be32 exp_statsn;
+	uint8_t cdb[16];	/* SCSI Command Block */
+	/* Additional Data (Command Dependent) */
+};
+
+/* Command PDU flags */
+#define ISCSI_FLAG_CMD_FINAL		0x80
+#define ISCSI_FLAG_CMD_READ		0x40
+#define ISCSI_FLAG_CMD_WRITE		0x20
+#define ISCSI_FLAG_CMD_ATTR_MASK	0x07	/* 3 bits */
+
+/* SCSI Command Attribute values */
+#define ISCSI_ATTR_UNTAGGED		0
+#define ISCSI_ATTR_SIMPLE		1
+#define ISCSI_ATTR_ORDERED		2
+#define ISCSI_ATTR_HEAD_OF_QUEUE	3
+#define ISCSI_ATTR_ACA			4
+
+struct iscsi_rlength_ahdr {
+	__be16 ahslength;
+	uint8_t ahstype;
+	uint8_t reserved;
+	__be32 read_length;
+};
+
+/* SCSI Response Header */
+struct iscsi_cmd_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t response;
+	uint8_t cmd_status;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rsvd1;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	exp_datasn;
+	__be32	bi_residual_count;
+	__be32	residual_count;
+	/* Response or Sense Data (optional) */
+};
+
+/* Command Response PDU flags */
+#define ISCSI_FLAG_CMD_BIDI_OVERFLOW	0x10
+#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW	0x08
+#define ISCSI_FLAG_CMD_OVERFLOW		0x04
+#define ISCSI_FLAG_CMD_UNDERFLOW	0x02
+
+/* iSCSI Status values. Valid if Rsp Selector bit is not set */
+#define ISCSI_STATUS_CMD_COMPLETED	0
+#define ISCSI_STATUS_TARGET_FAILURE	1
+#define ISCSI_STATUS_SUBSYS_FAILURE	2
+
+/* Asynchronous Event Header */
+struct iscsi_async {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	uint8_t rsvd4[8];
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t async_event;
+	uint8_t async_vcode;
+	__be16	param1;
+	__be16	param2;
+	__be16	param3;
+	uint8_t rsvd5[4];
+};
+
+/* iSCSI Event Codes */
+#define ISCSI_ASYNC_MSG_SCSI_EVENT			0
+#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT			1
+#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION		2
+#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS	3
+#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION		4
+#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC			255
+
+/* NOP-Out Message */
+struct iscsi_nopout {
+	uint8_t opcode;
+	uint8_t flags;
+	__be16	rsvd2;
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	ttt;	/* Target Transfer Tag */
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd4[16];
+};
+
+/* NOP-In Message */
+struct iscsi_nopin {
+	uint8_t opcode;
+	uint8_t flags;
+	__be16	rsvd2;
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	ttt;	/* Target Transfer Tag */
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t rsvd4[12];
+};
+
+/* SCSI Task Management Message Header */
+struct iscsi_tm {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd1[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rtt;	/* Reference Task Tag */
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	__be32	refcmdsn;
+	__be32	exp_datasn;
+	uint8_t rsvd2[8];
+};
+
+#define ISCSI_FLAG_TM_FUNC_MASK			0x7F
+
+/* Function values */
+#define ISCSI_TM_FUNC_ABORT_TASK		1
+#define ISCSI_TM_FUNC_ABORT_TASK_SET		2
+#define ISCSI_TM_FUNC_CLEAR_ACA			3
+#define ISCSI_TM_FUNC_CLEAR_TASK_SET		4
+#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET	5
+#define ISCSI_TM_FUNC_TARGET_WARM_RESET		6
+#define ISCSI_TM_FUNC_TARGET_COLD_RESET		7
+#define ISCSI_TM_FUNC_TASK_REASSIGN		8
+
+/* SCSI Task Management Response Header */
+struct iscsi_tm_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t response;	/* see Response values below */
+	uint8_t qualifier;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd2[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rtt;	/* Reference Task Tag */
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t rsvd3[12];
+};
+
+/* Response values */
+#define ISCSI_TMF_RSP_COMPLETE		0x00
+#define ISCSI_TMF_RSP_NO_TASK		0x01
+#define ISCSI_TMF_RSP_NO_LUN		0x02
+#define ISCSI_TMF_RSP_TASK_ALLEGIANT	0x03
+#define ISCSI_TMF_RSP_NO_FAILOVER	0x04
+#define ISCSI_TMF_RSP_NOT_SUPPORTED	0x05
+#define ISCSI_TMF_RSP_AUTH_FAILED	0x06
+#define ISCSI_TMF_RSP_REJECTED		0xff
+
+/* Ready To Transfer Header */
+struct iscsi_r2t_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t	hlength;
+	uint8_t	dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	ttt;	/* Target Transfer Tag */
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	r2tsn;
+	__be32	data_offset;
+	__be32	data_length;
+};
+
+/* SCSI Data Hdr */
+struct iscsi_data {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	rsvd4;
+	__be32	exp_statsn;
+	__be32	rsvd5;
+	__be32	datasn;
+	__be32	offset;
+	__be32	rsvd6;
+	/* Payload */
+};
+
+/* SCSI Data Response Hdr */
+struct iscsi_data_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2;
+	uint8_t cmd_status;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	datasn;
+	__be32	offset;
+	__be32	residual_count;
+};
+
+/* Data Response PDU flags */
+#define ISCSI_FLAG_DATA_ACK		0x40
+#define ISCSI_FLAG_DATA_OVERFLOW	0x04
+#define ISCSI_FLAG_DATA_UNDERFLOW	0x02
+#define ISCSI_FLAG_DATA_STATUS		0x01
+
+/* Text Header */
+struct iscsi_text {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd4[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd5[16];
+	/* Text - key=value pairs */
+};
+
+#define ISCSI_FLAG_TEXT_CONTINUE	0x40
+
+/* Text Response Header */
+struct iscsi_text_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd4[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t rsvd5[12];
+	/* Text Response - key:value pairs */
+};
+
+/* Login Header */
+struct iscsi_login {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t max_version;	/* Max. version supported */
+	uint8_t min_version;	/* Min. version supported */
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t isid[6];	/* Initiator Session ID */
+	__be16	tsih;	/* Target Session Handle */
+	__be32	itt;	/* Initiator Task Tag */
+	__be16	cid;
+	__be16	rsvd3;
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd5[16];
+};
+
+/* Login PDU flags */
+#define ISCSI_FLAG_LOGIN_TRANSIT		0x80
+#define ISCSI_FLAG_LOGIN_CONTINUE		0x40
+#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK	0x0C	/* 2 bits */
+#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK	0x03	/* 2 bits */
+
+#define ISCSI_LOGIN_CURRENT_STAGE(flags) \
+	((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2)
+#define ISCSI_LOGIN_NEXT_STAGE(flags) \
+	(flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK)
+
+/* Login Response Header */
+struct iscsi_login_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t max_version;	/* Max. version supported */
+	uint8_t active_version;	/* Active version */
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t isid[6];	/* Initiator Session ID */
+	__be16	tsih;	/* Target Session Handle */
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rsvd3;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t status_class;	/* see Login RSP ststus classes below */
+	uint8_t status_detail;	/* see Login RSP Status details below */
+	uint8_t rsvd4[10];
+};
+
+/* Login stage (phase) codes for CSG, NSG */
+#define ISCSI_INITIAL_LOGIN_STAGE		-1
+#define ISCSI_SECURITY_NEGOTIATION_STAGE	0
+#define ISCSI_OP_PARMS_NEGOTIATION_STAGE	1
+#define ISCSI_FULL_FEATURE_PHASE		3
+
+/* Login Status response classes */
+#define ISCSI_STATUS_CLS_SUCCESS		0x00
+#define ISCSI_STATUS_CLS_REDIRECT		0x01
+#define ISCSI_STATUS_CLS_INITIATOR_ERR		0x02
+#define ISCSI_STATUS_CLS_TARGET_ERR		0x03
+
+/* Login Status response detail codes */
+/* Class-0 (Success) */
+#define ISCSI_LOGIN_STATUS_ACCEPT		0x00
+
+/* Class-1 (Redirection) */
+#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP	0x01
+#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM	0x02
+
+/* Class-2 (Initiator Error) */
+#define ISCSI_LOGIN_STATUS_INIT_ERR		0x00
+#define ISCSI_LOGIN_STATUS_AUTH_FAILED		0x01
+#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN	0x02
+#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND	0x03
+#define ISCSI_LOGIN_STATUS_TGT_REMOVED		0x04
+#define ISCSI_LOGIN_STATUS_NO_VERSION		0x05
+#define ISCSI_LOGIN_STATUS_ISID_ERROR		0x06
+#define ISCSI_LOGIN_STATUS_MISSING_FIELDS	0x07
+#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED	0x08
+#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE	0x09
+#define ISCSI_LOGIN_STATUS_NO_SESSION		0x0a
+#define ISCSI_LOGIN_STATUS_INVALID_REQUEST	0x0b
+
+/* Class-3 (Target Error) */
+#define ISCSI_LOGIN_STATUS_TARGET_ERROR		0x00
+#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE	0x01
+#define ISCSI_LOGIN_STATUS_NO_RESOURCES		0x02
+
+/* Logout Header */
+struct iscsi_logout {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd1[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd2[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be16	cid;
+	uint8_t rsvd3[2];
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd4[16];
+};
+
+/* Logout PDU flags */
+#define ISCSI_FLAG_LOGOUT_REASON_MASK	0x7F
+
+/* logout reason_code values */
+
+#define ISCSI_LOGOUT_REASON_CLOSE_SESSION	0
+#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION	1
+#define ISCSI_LOGOUT_REASON_RECOVERY		2
+#define ISCSI_LOGOUT_REASON_AEN_REQUEST		3
+
+/* Logout Response Header */
+struct iscsi_logout_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t response;	/* see Logout response values below */
+	uint8_t rsvd2;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd3[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rsvd4;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	rsvd5;
+	__be16	t2wait;
+	__be16	t2retain;
+	__be32	rsvd6;
+};
+
+/* logout response status values */
+
+#define ISCSI_LOGOUT_SUCCESS			0
+#define ISCSI_LOGOUT_CID_NOT_FOUND		1
+#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED	2
+#define ISCSI_LOGOUT_CLEANUP_FAILED		3
+
+/* SNACK Header */
+struct iscsi_snack {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[14];
+	__be32	itt;
+	__be32	begrun;
+	__be32	runlength;
+	__be32	exp_statsn;
+	__be32	rsvd3;
+	__be32	exp_datasn;
+	uint8_t rsvd6[8];
+};
+
+/* SNACK PDU flags */
+#define ISCSI_FLAG_SNACK_TYPE_MASK	0x0F	/* 4 bits */
+
+/* Reject Message Header */
+struct iscsi_reject {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t reason;
+	uint8_t rsvd2;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd3[8];
+	__be32  ffffffff;
+	uint8_t rsvd4[4];
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	datasn;
+	uint8_t rsvd5[8];
+	/* Text - Rejected hdr */
+};
+
+/* Reason for Reject */
+#define ISCSI_REASON_CMD_BEFORE_LOGIN	1
+#define ISCSI_REASON_DATA_DIGEST_ERROR	2
+#define ISCSI_REASON_DATA_SNACK_REJECT	3
+#define ISCSI_REASON_PROTOCOL_ERROR	4
+#define ISCSI_REASON_CMD_NOT_SUPPORTED	5
+#define ISCSI_REASON_IMM_CMD_REJECT		6
+#define ISCSI_REASON_TASK_IN_PROGRESS	7
+#define ISCSI_REASON_INVALID_SNACK		8
+#define ISCSI_REASON_BOOKMARK_INVALID	9
+#define ISCSI_REASON_BOOKMARK_NO_RESOURCES	10
+#define ISCSI_REASON_NEGOTIATION_RESET	11
+
+/* Max. number of Key=Value pairs in a text message */
+#define MAX_KEY_VALUE_PAIRS	8192
+
+/* maximum length for text keys/values */
+#define KEY_MAXLEN		64
+#define VALUE_MAXLEN		255
+#define TARGET_NAME_MAXLEN	VALUE_MAXLEN
+
+#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH	8192
+
+/************************* RFC 3720 End *****************************/
+
+#endif /* ISCSI_PROTO_H */
diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h
new file mode 100644
index 0000000..f353e0b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h
@@ -0,0 +1,19 @@
+#ifndef _SCSI_SCSI_DEVICE_H_BACKPORT
+#define _SCSI_SCSI_DEVICE_H_BACKPORT
+
+#include_next <scsi/scsi_device.h>
+
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <asm/atomic.h>
+
+struct scsi_lun;
+
+extern void int_to_scsilun(unsigned int, struct scsi_lun *);
+extern void scsi_target_block(struct device *);
+extern void scsi_target_unblock(struct device *);
+extern void starget_for_each_device(struct scsi_target *, void *,
+		     void (*fn)(struct scsi_device *, void *));
+#endif /* _SCSI_SCSI_DEVICE_H_BACKPORT */
diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_host.h b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_host.h
new file mode 100644
index 0000000..b7e019b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_host.h
@@ -0,0 +1,8 @@
+#ifndef _SCSI_SCSI_HOST_H_BACKPORT
+#define _SCSI_SCSI_HOST_H_BACKPORT
+
+#include_next <scsi/scsi_host.h>
+
+#define scsi_queue_work(shost, work) schedule_work(work)
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h
new file mode 100644
index 0000000..99c2b12
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h
@@ -0,0 +1,8 @@
+#ifndef _SCSI_SCSI_TRANSPORT_H_BACKPORT
+#define _SCSI_SCSI_TRANSPORT_H_BACKPORT
+
+#include_next <scsi/scsi_transport.h>
+
+#include <linux/transport_class.h>
+
+#endif /* _SCSI_SCSI_TRANSPORT_H_BACKPORT */
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c b/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c
new file mode 100644
index 0000000..44948d1
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c
@@ -0,0 +1,438 @@
+/*
+ * attribute_container.c - implementation of a simple container for classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ *
+ * The basic idea here is to enable a device to be attached to an
+ * aritrary numer of classes without having to allocate storage for them.
+ * Instead, the contained classes select the devices they need to attach
+ * to via a matching function.
+ */
+
+#include <linux/attribute_container.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/module.h>
+
+#include "base.h"
+
+/* This is a private structure used to tie the classdev and the
+ * container .. it should never be visible outside this file */
+struct internal_container {
+	struct klist_node node;
+	struct attribute_container *cont;
+	struct class_device classdev;
+};
+
+static void internal_container_klist_get(struct klist_node *n)
+{
+	struct internal_container *ic =
+		container_of(n, struct internal_container, node);
+	class_device_get(&ic->classdev);
+}
+
+static void internal_container_klist_put(struct klist_node *n)
+{
+	struct internal_container *ic =
+		container_of(n, struct internal_container, node);
+	class_device_put(&ic->classdev);
+}
+
+
+/**
+ * attribute_container_classdev_to_container - given a classdev, return the container
+ *
+ * @classdev: the class device created by attribute_container_add_device.
+ *
+ * Returns the container associated with this classdev.
+ */
+struct attribute_container *
+attribute_container_classdev_to_container(struct class_device *classdev)
+{
+	struct internal_container *ic =
+		container_of(classdev, struct internal_container, classdev);
+	return ic->cont;
+}
+EXPORT_SYMBOL_GPL(attribute_container_classdev_to_container);
+
+static struct list_head attribute_container_list;
+
+static DECLARE_MUTEX(attribute_container_mutex);
+
+/**
+ * attribute_container_register - register an attribute container
+ *
+ * @cont: The container to register.  This must be allocated by the
+ *        callee and should also be zeroed by it.
+ */
+int
+attribute_container_register(struct attribute_container *cont)
+{
+	INIT_LIST_HEAD(&cont->node);
+	klist_init(&cont->containers,internal_container_klist_get,
+		   internal_container_klist_put);
+		
+	down(&attribute_container_mutex);
+	list_add_tail(&cont->node, &attribute_container_list);
+	up(&attribute_container_mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(attribute_container_register);
+
+/**
+ * attribute_container_unregister - remove a container registration
+ *
+ * @cont: previously registered container to remove
+ */
+int
+attribute_container_unregister(struct attribute_container *cont)
+{
+	int retval = -EBUSY;
+	down(&attribute_container_mutex);
+	spin_lock(&cont->containers.k_lock);
+	if (!list_empty(&cont->containers.k_list))
+		goto out;
+	retval = 0;
+	list_del(&cont->node);
+ out:
+	spin_unlock(&cont->containers.k_lock);
+	up(&attribute_container_mutex);
+	return retval;
+		
+}
+EXPORT_SYMBOL_GPL(attribute_container_unregister);
+
+/* private function used as class release */
+static void attribute_container_release(struct class_device *classdev)
+{
+	struct internal_container *ic 
+		= container_of(classdev, struct internal_container, classdev);
+	struct device *dev = classdev->dev;
+
+	kfree(ic);
+	put_device(dev);
+}
+
+/**
+ * attribute_container_add_device - see if any container is interested in dev
+ *
+ * @dev: device to add attributes to
+ * @fn:	 function to trigger addition of class device.
+ *
+ * This function allocates storage for the class device(s) to be
+ * attached to dev (one for each matching attribute_container).  If no
+ * fn is provided, the code will simply register the class device via
+ * class_device_add.  If a function is provided, it is expected to add
+ * the class device at the appropriate time.  One of the things that
+ * might be necessary is to allocate and initialise the classdev and
+ * then add it a later time.  To do this, call this routine for
+ * allocation and initialisation and then use
+ * attribute_container_device_trigger() to call class_device_add() on
+ * it.  Note: after this, the class device contains a reference to dev
+ * which is not relinquished until the release of the classdev.
+ */
+void
+attribute_container_add_device(struct device *dev,
+			       int (*fn)(struct attribute_container *,
+					 struct device *,
+					 struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+
+		if (attribute_container_no_classdevs(cont))
+			continue;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		ic = kzalloc(sizeof(*ic), GFP_KERNEL);
+		if (!ic) {
+			dev_printk(KERN_ERR, dev, "failed to allocate class container\n");
+			continue;
+		}
+
+		ic->cont = cont;
+		class_device_initialize(&ic->classdev);
+		ic->classdev.dev = get_device(dev);
+		ic->classdev.class = cont->class;
+		cont->class->release = attribute_container_release;
+		strcpy(ic->classdev.class_id, dev->bus_id);
+		if (fn)
+			fn(cont, dev, &ic->classdev);
+		else
+			attribute_container_add_class_device(&ic->classdev);
+		klist_add_tail(&ic->node, &cont->containers);
+	}
+	up(&attribute_container_mutex);
+}
+
+/* FIXME: can't break out of this unless klist_iter_exit is also
+ * called before doing the break
+ */
+#define klist_for_each_entry(pos, head, member, iter) \
+	for (klist_iter_init(head, iter); (pos = ({ \
+		struct klist_node *n = klist_next(iter); \
+		n ? container_of(n, typeof(*pos), member) : \
+			({ klist_iter_exit(iter) ; NULL; }); \
+	}) ) != NULL; )
+			
+
+/**
+ * attribute_container_remove_device - make device eligible for removal.
+ *
+ * @dev:  The generic device
+ * @fn:	  A function to call to remove the device
+ *
+ * This routine triggers device removal.  If fn is NULL, then it is
+ * simply done via class_device_unregister (note that if something
+ * still has a reference to the classdev, then the memory occupied
+ * will not be freed until the classdev is released).  If you want a
+ * two phase release: remove from visibility and then delete the
+ * device, then you should use this routine with a fn that calls
+ * class_device_del() and then use
+ * attribute_container_device_trigger() to do the final put on the
+ * classdev.
+ */
+void
+attribute_container_remove_device(struct device *dev,
+				  void (*fn)(struct attribute_container *,
+					     struct device *,
+					     struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+		struct klist_iter iter;
+
+		if (attribute_container_no_classdevs(cont))
+			continue;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		klist_for_each_entry(ic, &cont->containers, node, &iter) {
+			if (dev != ic->classdev.dev)
+				continue;
+			klist_del(&ic->node);
+			if (fn)
+				fn(cont, dev, &ic->classdev);
+			else {
+				attribute_container_remove_attrs(&ic->classdev);
+				class_device_unregister(&ic->classdev);
+			}
+		}
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_device_trigger - execute a trigger for each matching classdev
+ *
+ * @dev:  The generic device to run the trigger for
+ * @fn	  the function to execute for each classdev.
+ *
+ * This funcion is for executing a trigger when you need to know both
+ * the container and the classdev.  If you only care about the
+ * container, then use attribute_container_trigger() instead.
+ */
+void
+attribute_container_device_trigger(struct device *dev, 
+				   int (*fn)(struct attribute_container *,
+					     struct device *,
+					     struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+		struct klist_iter iter;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		if (attribute_container_no_classdevs(cont)) {
+			fn(cont, dev, NULL);
+			continue;
+		}
+
+		klist_for_each_entry(ic, &cont->containers, node, &iter) {
+			if (dev == ic->classdev.dev)
+				fn(cont, dev, &ic->classdev);
+		}
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_trigger - trigger a function for each matching container
+ *
+ * @dev:  The generic device to activate the trigger for
+ * @fn:	  the function to trigger
+ *
+ * This routine triggers a function that only needs to know the
+ * matching containers (not the classdev) associated with a device.
+ * It is more lightweight than attribute_container_device_trigger, so
+ * should be used in preference unless the triggering function
+ * actually needs to know the classdev.
+ */
+void
+attribute_container_trigger(struct device *dev,
+			    int (*fn)(struct attribute_container *,
+				      struct device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		if (cont->match(cont, dev))
+			fn(cont, dev);
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_add_attrs - add attributes
+ *
+ * @classdev: The class device
+ *
+ * This simply creates all the class device sysfs files from the
+ * attributes listed in the container
+ */
+int
+attribute_container_add_attrs(struct class_device *classdev)
+{
+	struct attribute_container *cont =
+		attribute_container_classdev_to_container(classdev);
+	struct class_device_attribute **attrs =	cont->attrs;
+	int i, error;
+
+	if (!attrs)
+		return 0;
+
+	for (i = 0; attrs[i]; i++) {
+		error = class_device_create_file(classdev, attrs[i]);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/**
+ * attribute_container_add_class_device - same function as class_device_add
+ *
+ * @classdev:	the class device to add
+ *
+ * This performs essentially the same function as class_device_add except for
+ * attribute containers, namely add the classdev to the system and then
+ * create the attribute files
+ */
+int
+attribute_container_add_class_device(struct class_device *classdev)
+{
+	int error = class_device_add(classdev);
+	if (error)
+		return error;
+	return attribute_container_add_attrs(classdev);
+}
+
+/**
+ * attribute_container_add_class_device_adapter - simple adapter for triggers
+ *
+ * This function is identical to attribute_container_add_class_device except
+ * that it is designed to be called from the triggers
+ */
+int
+attribute_container_add_class_device_adapter(struct attribute_container *cont,
+					     struct device *dev,
+					     struct class_device *classdev)
+{
+	return attribute_container_add_class_device(classdev);
+}
+
+/**
+ * attribute_container_remove_attrs - remove any attribute files
+ *
+ * @classdev: The class device to remove the files from
+ *
+ */
+void
+attribute_container_remove_attrs(struct class_device *classdev)
+{
+	struct attribute_container *cont =
+		attribute_container_classdev_to_container(classdev);
+	struct class_device_attribute **attrs =	cont->attrs;
+	int i;
+
+	if (!attrs)
+		return;
+
+	for (i = 0; attrs[i]; i++)
+		class_device_remove_file(classdev, attrs[i]);
+}
+
+/**
+ * attribute_container_class_device_del - equivalent of class_device_del
+ *
+ * @classdev: the class device
+ *
+ * This function simply removes all the attribute files and then calls
+ * class_device_del.
+ */
+void
+attribute_container_class_device_del(struct class_device *classdev)
+{
+	attribute_container_remove_attrs(classdev);
+	class_device_del(classdev);
+}
+
+/**
+ * attribute_container_find_class_device - find the corresponding class_device
+ *
+ * @cont:	the container
+ * @dev:	the generic device
+ *
+ * Looks up the device in the container's list of class devices and returns
+ * the corresponding class_device.
+ */
+struct class_device *
+attribute_container_find_class_device(struct attribute_container *cont,
+				      struct device *dev)
+{
+	struct class_device *cdev = NULL;
+	struct internal_container *ic;
+	struct klist_iter iter;
+
+	klist_for_each_entry(ic, &cont->containers, node, &iter) {
+		if (ic->classdev.dev == dev) {
+			cdev = &ic->classdev;
+			/* FIXME: must exit iterator then break */
+			klist_iter_exit(&iter);
+			break;
+		}
+	}
+
+	return cdev;
+}
+EXPORT_SYMBOL_GPL(attribute_container_find_class_device);
+
+int
+attribute_container_init(void)
+{
+	INIT_LIST_HEAD(&attribute_container_list);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(attribute_container_init);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/base.h b/kernel_addons/backport/2.6.9_U4/include/src/base.h
new file mode 100644
index 0000000..a5f8936
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/base.h
@@ -0,0 +1 @@
+extern int attribute_container_init(void);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/init.c b/kernel_addons/backport/2.6.9_U4/include/src/init.c
new file mode 100644
index 0000000..15f0bc6
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/init.c
@@ -0,0 +1,26 @@
+/*
+ *
+ * Copyright (c) 2002-3 Patrick Mochel
+ * Copyright (c) 2002-3 Open Source Development Labs
+ *
+ * This file is released under the GPLv2
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/memory.h>
+
+#include "base.h"
+
+/**
+ *	driver_init - initialize driver model.
+ *
+ *	Call the driver model init functions to initialize their
+ *	subsystems. Called early from init/main.c.
+ */
+
+void __init driver_init(void)
+{
+	attribute_container_init();
+}
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/klist.c b/kernel_addons/backport/2.6.9_U4/include/src/klist.c
new file mode 100644
index 0000000..3b29ebc
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/klist.c
@@ -0,0 +1,287 @@
+/*
+ *	klist.c - Routines for manipulating klists.
+ *
+ *
+ *	This klist interface provides a couple of structures that wrap around 
+ *	struct list_head to provide explicit list "head" (struct klist) and 
+ *	list "node" (struct klist_node) objects. For struct klist, a spinlock
+ *	is included that protects access to the actual list itself. struct 
+ *	klist_node provides a pointer to the klist that owns it and a kref
+ *	reference count that indicates the number of current users of that node
+ *	in the list.
+ *
+ *	The entire point is to provide an interface for iterating over a list
+ *	that is safe and allows for modification of the list during the
+ *	iteration (e.g. insertion and removal), including modification of the
+ *	current node on the list.
+ *
+ *	It works using a 3rd object type - struct klist_iter - that is declared
+ *	and initialized before an iteration. klist_next() is used to acquire the
+ *	next element in the list. It returns NULL if there are no more items.
+ *	Internally, that routine takes the klist's lock, decrements the reference
+ *	count of the previous klist_node and increments the count of the next
+ *	klist_node. It then drops the lock and returns.
+ *
+ *	There are primitives for adding and removing nodes to/from a klist. 
+ *	When deleting, klist_del() will simply decrement the reference count. 
+ *	Only when the count goes to 0 is the node removed from the list. 
+ *	klist_remove() will try to delete the node from the list and block
+ *	until it is actually removed. This is useful for objects (like devices)
+ *	that have been removed from the system and must be freed (but must wait
+ *	until all accessors have finished).
+ *
+ *	Copyright (C) 2005 Patrick Mochel
+ *
+ *	This file is released under the GPL v2.
+ */
+
+#include <linux/klist.h>
+#include <linux/module.h>
+
+
+/**
+ *	klist_init - Initialize a klist structure. 
+ *	@k:	The klist we're initializing.
+ *	@get:	The get function for the embedding object (NULL if none)
+ *	@put:	The put function for the embedding object (NULL if none)
+ *
+ * Initialises the klist structure.  If the klist_node structures are
+ * going to be embedded in refcounted objects (necessary for safe
+ * deletion) then the get/put arguments are used to initialise
+ * functions that take and release references on the embedding
+ * objects.
+ */
+
+void klist_init(struct klist * k, void (*get)(struct klist_node *),
+		void (*put)(struct klist_node *))
+{
+	INIT_LIST_HEAD(&k->k_list);
+	spin_lock_init(&k->k_lock);
+	k->get = get;
+	k->put = put;
+}
+
+EXPORT_SYMBOL_GPL(klist_init);
+
+
+static void add_head(struct klist * k, struct klist_node * n)
+{
+	spin_lock(&k->k_lock);
+	list_add(&n->n_node, &k->k_list);
+	spin_unlock(&k->k_lock);
+}
+
+static void add_tail(struct klist * k, struct klist_node * n)
+{
+	spin_lock(&k->k_lock);
+	list_add_tail(&n->n_node, &k->k_list);
+	spin_unlock(&k->k_lock);
+}
+
+
+static void klist_node_init(struct klist * k, struct klist_node * n)
+{
+	INIT_LIST_HEAD(&n->n_node);
+	init_completion(&n->n_removed);
+	kref_init(&n->n_ref);
+	n->n_klist = k;
+	if (k->get)
+		k->get(n);
+}
+
+
+/**
+ *	klist_add_head - Initialize a klist_node and add it to front.
+ *	@n:	node we're adding.
+ *	@k:	klist it's going on.
+ */
+
+void klist_add_head(struct klist_node * n, struct klist * k)
+{
+	klist_node_init(k, n);
+	add_head(k, n);
+}
+
+EXPORT_SYMBOL_GPL(klist_add_head);
+
+
+/**
+ *	klist_add_tail - Initialize a klist_node and add it to back.
+ *	@n:	node we're adding.
+ *	@k:	klist it's going on.
+ */
+
+void klist_add_tail(struct klist_node * n, struct klist * k)
+{
+	klist_node_init(k, n);
+	add_tail(k, n);
+}
+
+EXPORT_SYMBOL_GPL(klist_add_tail);
+
+
+static void klist_release(struct kref * kref)
+{
+	struct klist_node * n = container_of(kref, struct klist_node, n_ref);
+
+	list_del(&n->n_node);
+	complete(&n->n_removed);
+	n->n_klist = NULL;
+}
+
+static int klist_dec_and_del(struct klist_node * n)
+{
+	return kref_put_new(&n->n_ref, klist_release);
+}
+
+
+/**
+ *	klist_del - Decrement the reference count of node and try to remove.
+ *	@n:	node we're deleting.
+ */
+
+void klist_del(struct klist_node * n)
+{
+	struct klist * k = n->n_klist;
+	void (*put)(struct klist_node *) = k->put;
+
+	spin_lock(&k->k_lock);
+	if (!klist_dec_and_del(n))
+		put = NULL;
+	spin_unlock(&k->k_lock);
+	if (put)
+		put(n);
+}
+
+EXPORT_SYMBOL_GPL(klist_del);
+
+
+/**
+ *	klist_remove - Decrement the refcount of node and wait for it to go away.
+ *	@n:	node we're removing.
+ */
+
+void klist_remove(struct klist_node * n)
+{
+	klist_del(n);
+	wait_for_completion(&n->n_removed);
+}
+
+EXPORT_SYMBOL_GPL(klist_remove);
+
+
+/**
+ *	klist_node_attached - Say whether a node is bound to a list or not.
+ *	@n:	Node that we're testing.
+ */
+
+int klist_node_attached(struct klist_node * n)
+{
+	return (n->n_klist != NULL);
+}
+
+EXPORT_SYMBOL_GPL(klist_node_attached);
+
+
+/**
+ *	klist_iter_init_node - Initialize a klist_iter structure.
+ *	@k:	klist we're iterating.
+ *	@i:	klist_iter we're filling.
+ *	@n:	node to start with.
+ *
+ *	Similar to klist_iter_init(), but starts the action off with @n, 
+ *	instead of with the list head.
+ */
+
+void klist_iter_init_node(struct klist * k, struct klist_iter * i, struct klist_node * n)
+{
+	i->i_klist = k;
+	i->i_head = &k->k_list;
+	i->i_cur = n;
+	if (n)
+		kref_get(&n->n_ref);
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_init_node);
+
+
+/**
+ *	klist_iter_init - Iniitalize a klist_iter structure.
+ *	@k:	klist we're iterating.
+ *	@i:	klist_iter structure we're filling.
+ *
+ *	Similar to klist_iter_init_node(), but start with the list head.
+ */
+
+void klist_iter_init(struct klist * k, struct klist_iter * i)
+{
+	klist_iter_init_node(k, i, NULL);
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_init);
+
+
+/**
+ *	klist_iter_exit - Finish a list iteration.
+ *	@i:	Iterator structure.
+ *
+ *	Must be called when done iterating over list, as it decrements the 
+ *	refcount of the current node. Necessary in case iteration exited before
+ *	the end of the list was reached, and always good form.
+ */
+
+void klist_iter_exit(struct klist_iter * i)
+{
+	if (i->i_cur) {
+		klist_del(i->i_cur);
+		i->i_cur = NULL;
+	}
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_exit);
+
+
+static struct klist_node * to_klist_node(struct list_head * n)
+{
+	return container_of(n, struct klist_node, n_node);
+}
+
+
+/**
+ *	klist_next - Ante up next node in list.
+ *	@i:	Iterator structure.
+ *
+ *	First grab list lock. Decrement the reference count of the previous
+ *	node, if there was one. Grab the next node, increment its reference 
+ *	count, drop the lock, and return that next node.
+ */
+
+struct klist_node * klist_next(struct klist_iter * i)
+{
+	struct list_head * next;
+	struct klist_node * lnode = i->i_cur;
+	struct klist_node * knode = NULL;
+	void (*put)(struct klist_node *) = i->i_klist->put;
+
+	spin_lock(&i->i_klist->k_lock);
+	if (lnode) {
+		next = lnode->n_node.next;
+		if (!klist_dec_and_del(lnode))
+			put = NULL;
+	} else
+		next = i->i_head->next;
+
+	if (next != i->i_head) {
+		knode = to_klist_node(next);
+		kref_get(&knode->n_ref);
+	}
+	i->i_cur = knode;
+	spin_unlock(&i->i_klist->k_lock);
+	if (put && lnode)
+		put(lnode);
+	return knode;
+}
+
+EXPORT_SYMBOL_GPL(klist_next);
+
+
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c b/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c
new file mode 100644
index 0000000..d45bb3f
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c
@@ -0,0 +1,29 @@
+#include <linux/kref.h>
+#include <linux/module.h>
+
+/**
+ * kref_put - decrement refcount for object.
+ * @kref: object.
+ * @release: pointer to the function that will clean up the object when the
+ *           last reference to the object is released.
+ *           This pointer is required, and it is not acceptable to pass kfree
+ *           in as this function.
+ *
+ * Decrement the refcount, and if 0, call release().
+ * Return 1 if the object was removed, otherwise return 0.  Beware, if this
+ * function returns 0, you still can not count on the kref from remaining in
+ * memory.  Only use the return value if you want to see if the kref is now
+ * gone, not present.
+ */
+int kref_put_new(struct kref *kref, void (*release)(struct kref *kref))
+{
+        WARN_ON(release == NULL);
+        WARN_ON(release == (void (*)(struct kref *))kfree);
+
+        if (atomic_dec_and_test(&kref->refcount)) {
+                release(kref);
+                return 1;
+        }
+        return 0;
+}
+EXPORT_SYMBOL(kref_put_new);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi.c
new file mode 100644
index 0000000..8c833c0
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi.c
@@ -0,0 +1,50 @@
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/init.h>
+#include <linux/completion.h>
+#include <linux/unistd.h>
+#include <linux/spinlock.h>
+#include <linux/kmod.h>
+#include <linux/interrupt.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
+#include <linux/mutex.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_tcq.h>
+
+/**
+ * starget_for_each_device  -  helper to walk all devices of a target
+ * @starget:	target whose devices we want to iterate over.
+ *
+ * This traverses over each devices of @shost.  The devices have
+ * a reference that must be released by scsi_host_put when breaking
+ * out of the loop.
+ */
+void starget_for_each_device(struct scsi_target *starget, void * data,
+		     void (*fn)(struct scsi_device *, void *))
+{
+	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
+	struct scsi_device *sdev;
+
+	printk("%s: entry\n", __FUNCTION__);
+	shost_for_each_device(sdev, shost) {
+		if ((sdev->channel == starget->channel) &&
+		    (sdev->id == starget->id))
+			fn(sdev, data);
+	}
+	printk("%s: exit\n", __FUNCTION__);
+}
+EXPORT_SYMBOL(starget_for_each_device);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c
new file mode 100644
index 0000000..f53f824
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c
@@ -0,0 +1,166 @@
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/completion.h>
+#include <linux/kernel.h>
+#include <linux/mempool.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <linux/hardirq.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+
+int scsi_is_target_device(const struct device *dev)
+{
+        char *str = dev->bus_id;
+
+	if (strncmp(str, "target", 6) == 0) {
+		return 1;
+	}
+
+        return 0;
+}
+
+/**
+ * scsi_internal_device_block - internal function to put a device
+ *                              temporarily into the SDEV_BLOCK state
+ * @sdev:       device to block
+ *
+ * Block request made by scsi lld's to temporarily stop all
+ * scsi commands on the specified device.  Called from interrupt
+ * or normal process context.
+ *
+ * Returns zero if successful or error if not
+ *
+ * Notes:
+ *      This routine transitions the device to the SDEV_BLOCK state
+ *      (which must be a legal transition).  When the device is in this
+ *      state, all commands are deferred until the scsi lld reenables
+ *      the device with scsi_device_unblock or device_block_tmo fires.
+ *      This routine assumes the host_lock is held on entry.
+ **/
+int
+scsi_internal_device_block(struct scsi_device *sdev)
+{
+        request_queue_t *q = sdev->request_queue;
+        unsigned long flags;
+        int err = 0;
+
+        err = scsi_device_set_state(sdev, SDEV_BLOCK);
+        if (err)
+		return err;
+                
+        /*
+         * The device has transitioned to SDEV_BLOCK.  Stop the
+         * block layer from calling the midlayer with this device's
+         * request queue.
+         */
+        spin_lock_irqsave(q->queue_lock, flags);
+        blk_stop_queue(q);
+        spin_unlock_irqrestore(q->queue_lock, flags);
+
+        return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_internal_device_block);
+
+/**
+ * scsi_internal_device_unblock - resume a device after a block request
+ * @sdev:       device to resume
+ *
+ * Called by scsi lld's or the midlayer to restart the device queue
+ * for the previously suspended scsi device.  Called from interrupt or
+ * normal process context.
+ *
+ * Returns zero if successful or error if not.
+ *
+ * Notes:
+ *      This routine transitions the device to the SDEV_RUNNING state
+ *      (which must be a legal transition) allowing the midlayer to
+ *      goose the queue for this device.  This routine assumes the
+ *      host_lock is held upon entry.
+ **/
+int
+scsi_internal_device_unblock(struct scsi_device *sdev)
+{
+        request_queue_t *q = sdev->request_queue;
+        int err;
+        unsigned long flags;
+
+
+        /*
+         * Try to transition the scsi device to SDEV_RUNNING
+         * and goose the device queue if successful.
+         */
+        err = scsi_device_set_state(sdev, SDEV_RUNNING);
+        if (err)
+		return err;
+                
+        spin_lock_irqsave(q->queue_lock, flags);
+        blk_start_queue(q);
+        spin_unlock_irqrestore(q->queue_lock, flags);
+
+        return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_internal_device_unblock);
+
+static void
+device_block(struct scsi_device *sdev, void *data)
+{
+        scsi_internal_device_block(sdev);
+}
+
+static int
+target_block(struct device *dev, void *data)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_block);
+
+        return 0;
+}
+
+void
+scsi_target_block(struct device *dev)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_block);
+        else
+                device_for_each_child(dev, NULL, target_block);
+}
+EXPORT_SYMBOL_GPL(scsi_target_block);
+
+static void
+device_unblock(struct scsi_device *sdev, void *data)
+{
+        scsi_internal_device_unblock(sdev);
+}
+
+static int
+target_unblock(struct device *dev, void *data)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_unblock);
+        return 0;
+}
+
+void
+scsi_target_unblock(struct device *dev)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_unblock);
+        else
+                device_for_each_child(dev, NULL, target_unblock);
+}
+EXPORT_SYMBOL_GPL(scsi_target_unblock);
+
+MODULE_LICENSE("GPL");
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c
new file mode 100644
index 0000000..b7b7674
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c
@@ -0,0 +1,48 @@
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/spinlock.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_devinfo.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_transport.h>
+#include <scsi/scsi_eh.h>
+
+/**
+ * int_to_scsilun: reverts an int into a scsi_lun
+ * @int:        integer to be reverted
+ * @scsilun:    struct scsi_lun to be set.
+ *
+ * Description:
+ *     Reverts the functionality of the scsilun_to_int, which packed
+ *     an 8-byte lun value into an int. This routine unpacks the int
+ *     back into the lun value.
+ *     Note: the scsilun_to_int() routine does not truly handle all
+ *     8bytes of the lun value. This functions restores only as much
+ *     as was set by the routine.
+ *
+ * Notes:
+ *     Given an integer : 0x0b030a04,  this function returns a
+ *     scsi_lun of : struct scsi_lun of: 0a 04 0b 03 00 00 00 00
+ *
+ **/
+void int_to_scsilun(unsigned int lun, struct scsi_lun *scsilun)
+{
+        int i;
+
+        memset(scsilun->scsi_lun, 0, sizeof(scsilun->scsi_lun));
+
+        for (i = 0; i < sizeof(lun); i += 2) {
+                scsilun->scsi_lun[i] = (lun >> 8) & 0xFF;
+                scsilun->scsi_lun[i+1] = lun & 0xFF;
+                lun = lun >> 16;
+        }
+}
+EXPORT_SYMBOL(int_to_scsilun);
diff --git a/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c b/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c
new file mode 100644
index 0000000..f25e7c6
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c
@@ -0,0 +1,280 @@
+/*
+ * transport_class.c - implementation of generic transport classes
+ *                     using attribute_containers
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ *
+ * The basic idea here is to allow any "device controller" (which
+ * would most often be a Host Bus Adapter to use the services of one
+ * or more tranport classes for performing transport specific
+ * services.  Transport specific services are things that the generic
+ * command layer doesn't want to know about (speed settings, line
+ * condidtioning, etc), but which the user might be interested in.
+ * Thus, the HBA's use the routines exported by the transport classes
+ * to perform these functions.  The transport classes export certain
+ * values to the user via sysfs using attribute containers.
+ *
+ * Note: because not every HBA will care about every transport
+ * attribute, there's a many to one relationship that goes like this:
+ *
+ * transport class<-----attribute container<----class device
+ *
+ * Usually the attribute container is per-HBA, but the design doesn't
+ * mandate that.  Although most of the services will be specific to
+ * the actual external storage connection used by the HBA, the generic
+ * transport class is framed entirely in terms of generic devices to
+ * allow it to be used by any physical HBA in the system.
+ */
+#include <linux/attribute_container.h>
+#include <linux/transport_class.h>
+
+/**
+ * transport_class_register - register an initial transport class
+ *
+ * @tclass:	a pointer to the transport class structure to be initialised
+ *
+ * The transport class contains an embedded class which is used to
+ * identify it.  The caller should initialise this structure with
+ * zeros and then generic class must have been initialised with the
+ * actual transport class unique name.  There's a macro
+ * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must
+ * be registered).
+ *
+ * Returns 0 on success or error on failure.
+ */
+int transport_class_register(struct transport_class *tclass)
+{
+	return class_register(&tclass->class);
+}
+EXPORT_SYMBOL_GPL(transport_class_register);
+
+/**
+ * transport_class_unregister - unregister a previously registered class
+ *
+ * @tclass: The transport class to unregister
+ *
+ * Must be called prior to deallocating the memory for the transport
+ * class.
+ */
+void transport_class_unregister(struct transport_class *tclass)
+{
+	class_unregister(&tclass->class);
+}
+EXPORT_SYMBOL_GPL(transport_class_unregister);
+
+static int anon_transport_dummy_function(struct transport_container *tc,
+					 struct device *dev,
+					 struct class_device *cdev)
+{
+	/* do nothing */
+	return 0;
+}
+
+/**
+ * anon_transport_class_register - register an anonymous class
+ *
+ * @atc: The anon transport class to register
+ *
+ * The anonymous transport class contains both a transport class and a
+ * container.  The idea of an anonymous class is that it never
+ * actually has any device attributes associated with it (and thus
+ * saves on container storage).  So it can only be used for triggering
+ * events.  Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to
+ * initialise the anon transport class storage.
+ */
+int anon_transport_class_register(struct anon_transport_class *atc)
+{
+	int error;
+	atc->container.class = &atc->tclass.class;
+	attribute_container_set_no_classdevs(&atc->container);
+	error = attribute_container_register(&atc->container);
+	if (error)
+		return error;
+	atc->tclass.setup = anon_transport_dummy_function;
+	atc->tclass.remove = anon_transport_dummy_function;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(anon_transport_class_register);
+
+/**
+ * anon_transport_class_unregister - unregister an anon class
+ *
+ * @atc: Pointer to the anon transport class to unregister
+ *
+ * Must be called prior to deallocating the memory for the anon
+ * transport class.
+ */
+void anon_transport_class_unregister(struct anon_transport_class *atc)
+{
+	attribute_container_unregister(&atc->container);
+}
+EXPORT_SYMBOL_GPL(anon_transport_class_unregister);
+
+static int transport_setup_classdev(struct attribute_container *cont,
+				    struct device *dev,
+				    struct class_device *classdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+	struct transport_container *tcont = attribute_container_to_transport_container(cont);
+
+	if (tclass->setup)
+		tclass->setup(tcont, dev, classdev);
+
+	return 0;
+}
+
+/**
+ * transport_setup_device - declare a new dev for transport class association
+ *			    but don't make it visible yet.
+ *
+ * @dev: the generic device representing the entity being added
+ *
+ * Usually, dev represents some component in the HBA system (either
+ * the HBA itself or a device remote across the HBA bus).  This
+ * routine is simply a trigger point to see if any set of transport
+ * classes wishes to associate with the added device.  This allocates
+ * storage for the class device and initialises it, but does not yet
+ * add it to the system or add attributes to it (you do this with
+ * transport_add_device).  If you have no need for a separate setup
+ * and add operations, use transport_register_device (see
+ * transport_class.h).
+ */
+
+void transport_setup_device(struct device *dev)
+{
+	attribute_container_add_device(dev, transport_setup_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_setup_device);
+
+static int transport_add_class_device(struct attribute_container *cont,
+				      struct device *dev,
+				      struct class_device *classdev)
+{
+	int error = attribute_container_add_class_device(classdev);
+	struct transport_container *tcont = 
+		attribute_container_to_transport_container(cont);
+
+	if (!error && tcont->statistics)
+		error = sysfs_create_group(&classdev->kobj, tcont->statistics);
+
+	return error;
+}
+
+
+/**
+ * transport_add_device - declare a new dev for transport class association
+ *
+ * @dev: the generic device representing the entity being added
+ *
+ * Usually, dev represents some component in the HBA system (either
+ * the HBA itself or a device remote across the HBA bus).  This
+ * routine is simply a trigger point used to add the device to the
+ * system and register attributes for it.
+ */
+
+void transport_add_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_add_class_device);
+}
+EXPORT_SYMBOL_GPL(transport_add_device);
+
+static int transport_configure(struct attribute_container *cont,
+			       struct device *dev,
+			       struct class_device *cdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+	struct transport_container *tcont = attribute_container_to_transport_container(cont);
+
+	if (tclass->configure)
+		tclass->configure(tcont, dev, cdev);
+
+	return 0;
+}
+
+/**
+ * transport_configure_device - configure an already set up device
+ *
+ * @dev: generic device representing device to be configured
+ *
+ * The idea of configure is simply to provide a point within the setup
+ * process to allow the transport class to extract information from a
+ * device after it has been setup.  This is used in SCSI because we
+ * have to have a setup device to begin using the HBA, but after we
+ * send the initial inquiry, we use configure to extract the device
+ * parameters.  The device need not have been added to be configured.
+ */
+void transport_configure_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_configure);
+}
+EXPORT_SYMBOL_GPL(transport_configure_device);
+
+static int transport_remove_classdev(struct attribute_container *cont,
+				     struct device *dev,
+				     struct class_device *classdev)
+{
+	struct transport_container *tcont = 
+		attribute_container_to_transport_container(cont);
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+
+	if (tclass->remove)
+		tclass->remove(tcont, dev, classdev);
+
+	if (tclass->remove != anon_transport_dummy_function) {
+		if (tcont->statistics)
+			sysfs_remove_group(&classdev->kobj, tcont->statistics);
+		attribute_container_class_device_del(classdev);
+	}
+
+	return 0;
+}
+
+
+/**
+ * transport_remove_device - remove the visibility of a device
+ *
+ * @dev: generic device to remove
+ *
+ * This call removes the visibility of the device (to the user from
+ * sysfs), but does not destroy it.  To eliminate a device entirely
+ * you must also call transport_destroy_device.  If you don't need to
+ * do remove and destroy as separate operations, use
+ * transport_unregister_device() (see transport_class.h) which will
+ * perform both calls for you.
+ */
+void transport_remove_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_remove_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_remove_device);
+
+static void transport_destroy_classdev(struct attribute_container *cont,
+				      struct device *dev,
+				      struct class_device *classdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+
+	if (tclass->remove != anon_transport_dummy_function)
+		class_device_put(classdev);
+}
+
+
+/**
+ * transport_destroy_device - destroy a removed device
+ *
+ * @dev: device to eliminate from the transport class.
+ *
+ * This call triggers the elimination of storage associated with the
+ * transport classdev.  Note: all it really does is relinquish a
+ * reference to the classdev.  The memory will not be freed until the
+ * last reference goes to zero.  Note also that the classdev retains a
+ * reference count on dev, so dev too will remain for as long as the
+ * transport class device remains around.
+ */
+void transport_destroy_device(struct device *dev)
+{
+	attribute_container_remove_device(dev, transport_destroy_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_destroy_device);
diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/attribute_container.h b/kernel_addons/backport/2.6.9_U5/include/linux/attribute_container.h
new file mode 100644
index 0000000..93bfb0b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/linux/attribute_container.h
@@ -0,0 +1,71 @@
+/*
+ * class_container.h - a generic container for all classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ */
+
+#ifndef _ATTRIBUTE_CONTAINER_H_
+#define _ATTRIBUTE_CONTAINER_H_
+
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/klist.h>
+#include <linux/spinlock.h>
+
+struct attribute_container {
+	struct list_head	node;
+	struct klist		containers;
+	struct class		*class;
+	struct class_device_attribute **attrs;
+	int (*match)(struct attribute_container *, struct device *);
+#define	ATTRIBUTE_CONTAINER_NO_CLASSDEVS	0x01
+	unsigned long		flags;
+};
+
+static inline int
+attribute_container_no_classdevs(struct attribute_container *atc)
+{
+	return atc->flags & ATTRIBUTE_CONTAINER_NO_CLASSDEVS;
+}
+
+static inline void
+attribute_container_set_no_classdevs(struct attribute_container *atc)
+{
+	atc->flags |= ATTRIBUTE_CONTAINER_NO_CLASSDEVS;
+}
+
+int attribute_container_register(struct attribute_container *cont);
+int attribute_container_unregister(struct attribute_container *cont);
+void attribute_container_create_device(struct device *dev,
+				       int (*fn)(struct attribute_container *,
+						 struct device *,
+						 struct class_device *));
+void attribute_container_add_device(struct device *dev,
+				    int (*fn)(struct attribute_container *,
+					      struct device *,
+					      struct class_device *));
+void attribute_container_remove_device(struct device *dev,
+				       void (*fn)(struct attribute_container *,
+						  struct device *,
+						  struct class_device *));
+void attribute_container_device_trigger(struct device *dev, 
+					int (*fn)(struct attribute_container *,
+						  struct device *,
+						  struct class_device *));
+void attribute_container_trigger(struct device *dev, 
+				 int (*fn)(struct attribute_container *,
+					   struct device *));
+int attribute_container_add_attrs(struct class_device *classdev);
+int attribute_container_add_class_device(struct class_device *classdev);
+int attribute_container_add_class_device_adapter(struct attribute_container *cont,
+						 struct device *dev,
+						 struct class_device *classdev);
+void attribute_container_remove_attrs(struct class_device *classdev);
+void attribute_container_class_device_del(struct class_device *classdev);
+struct attribute_container *attribute_container_classdev_to_container(struct class_device *);
+struct class_device *attribute_container_find_class_device(struct attribute_container *, struct device *);
+struct class_device_attribute **attribute_container_classdev_to_attrs(const struct class_device *classdev);
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/crypto.h b/kernel_addons/backport/2.6.9_U5/include/linux/crypto.h
new file mode 100644
index 0000000..aecccde
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/linux/crypto.h
@@ -0,0 +1,11 @@
+#ifndef LINUX_CRYPTO_BACKPORT_H
+#define LINUX_CRYPTO_BACKPORT_H
+
+#include_next <linux/crypto.h>
+
+#define crypto_hash_init(desc) crypto_digest_init(*desc)
+#define crypto_hash_digest(desc, sg, nbytes, out) crypto_digest_digest(*desc, sg, 1, out)
+#define crypto_hash_update(desc, sg, nbytes) crypto_digest_update(*desc, sg, 1)
+#define crypto_hash_final(desc, out) crypto_digest_final(*desc, out)
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/kernel.h b/kernel_addons/backport/2.6.9_U5/include/linux/kernel.h
index a37dcd5..02a5907 100644
--- a/kernel_addons/backport/2.6.9_U5/include/linux/kernel.h
+++ b/kernel_addons/backport/2.6.9_U5/include/linux/kernel.h
@@ -4,4 +4,7 @@ #define BACKPORT_KERNEL_H_2_6_19
 #include_next <linux/kernel.h>
 #include <linux/log2.h>
 
+#define NIP6_FMT "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x"
+#define NIPQUAD_FMT "%u.%u.%u.%u"
+
 #endif
diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/kfifo.h b/kernel_addons/backport/2.6.9_U5/include/linux/kfifo.h
index 48eccd8..2b94461 100644
--- a/kernel_addons/backport/2.6.9_U5/include/linux/kfifo.h
+++ b/kernel_addons/backport/2.6.9_U5/include/linux/kfifo.h
@@ -25,6 +25,7 @@ #ifdef __KERNEL__
 
 #include <linux/kernel.h>
 #include <linux/spinlock.h>
+#include <linux/gfp.h>
 
 struct kfifo {
 	unsigned char *buffer;	/* the buffer holding the data */
diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/klist.h b/kernel_addons/backport/2.6.9_U5/include/linux/klist.h
new file mode 100644
index 0000000..7407125
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/linux/klist.h
@@ -0,0 +1,61 @@
+/*
+ *	klist.h - Some generic list helpers, extending struct list_head a bit.
+ *
+ *	Implementations are found in lib/klist.c
+ *
+ *
+ *	Copyright (C) 2005 Patrick Mochel
+ *
+ *	This file is rleased under the GPL v2.
+ */
+
+#ifndef _LINUX_KLIST_H
+#define _LINUX_KLIST_H
+
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+
+struct klist_node;
+struct klist {
+	spinlock_t		k_lock;
+	struct list_head	k_list;
+	void			(*get)(struct klist_node *);
+	void			(*put)(struct klist_node *);
+};
+
+
+extern void klist_init(struct klist * k, void (*get)(struct klist_node *),
+		       void (*put)(struct klist_node *));
+
+struct klist_node {
+	struct klist		* n_klist;
+	struct list_head	n_node;
+	struct kref		n_ref;
+	struct completion	n_removed;
+};
+
+extern void klist_add_tail(struct klist_node * n, struct klist * k);
+extern void klist_add_head(struct klist_node * n, struct klist * k);
+
+extern void klist_del(struct klist_node * n);
+extern void klist_remove(struct klist_node * n);
+
+extern int klist_node_attached(struct klist_node * n);
+
+
+struct klist_iter {
+	struct klist		* i_klist;
+	struct list_head	* i_head;
+	struct klist_node	* i_cur;
+};
+
+
+extern void klist_iter_init(struct klist * k, struct klist_iter * i);
+extern void klist_iter_init_node(struct klist * k, struct klist_iter * i, 
+				 struct klist_node * n);
+extern void klist_iter_exit(struct klist_iter * i);
+extern struct klist_node * klist_next(struct klist_iter * i);
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/memory.h b/kernel_addons/backport/2.6.9_U5/include/linux/memory.h
new file mode 100644
index 0000000..654ef55
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/linux/memory.h
@@ -0,0 +1,89 @@
+/*
+ * include/linux/memory.h - generic memory definition
+ *
+ * This is mainly for topological representation. We define the
+ * basic "struct memory_block" here, which can be embedded in per-arch
+ * definitions or NUMA information.
+ *
+ * Basic handling of the devices is done in drivers/base/memory.c
+ * and system devices are handled in drivers/base/sys.c.
+ *
+ * Memory block are exported via sysfs in the class/memory/devices/
+ * directory.
+ *
+ */
+#ifndef _LINUX_MEMORY_H_
+#define _LINUX_MEMORY_H_
+
+#include <linux/sysdev.h>
+#include <linux/node.h>
+#include <linux/compiler.h>
+
+#include <asm/semaphore.h>
+
+struct memory_block {
+	unsigned long phys_index;
+	unsigned long state;
+	/*
+	 * This serializes all state change requests.  It isn't
+	 * held during creation because the control files are
+	 * created long after the critical areas during
+	 * initialization.
+	 */
+	struct semaphore state_sem;
+	int phys_device;		/* to which fru does this belong? */
+	void *hw;			/* optional pointer to fw/hw data */
+	int (*phys_callback)(struct memory_block *);
+	struct sys_device sysdev;
+};
+
+/* These states are exposed to userspace as text strings in sysfs */
+#define	MEM_ONLINE		(1<<0) /* exposed to userspace */
+#define	MEM_GOING_OFFLINE	(1<<1) /* exposed to userspace */
+#define	MEM_OFFLINE		(1<<2) /* exposed to userspace */
+
+/*
+ * All of these states are currently kernel-internal for notifying
+ * kernel components and architectures.
+ *
+ * For MEM_MAPPING_INVALID, all notifier chains with priority >0
+ * are called before pfn_to_page() becomes invalid.  The priority=0
+ * entry is reserved for the function that actually makes
+ * pfn_to_page() stop working.  Any notifiers that want to be called
+ * after that should have priority <0.
+ */
+#define	MEM_MAPPING_INVALID	(1<<3)
+
+struct notifier_block;
+struct mem_section;
+
+#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE
+static inline int memory_dev_init(void)
+{
+	return 0;
+}
+static inline int register_memory_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+static inline void unregister_memory_notifier(struct notifier_block *nb)
+{
+}
+#else
+extern int register_new_memory(struct mem_section *);
+extern int unregister_memory_section(struct mem_section *);
+extern int memory_dev_init(void);
+extern int remove_memory_block(unsigned long, struct mem_section *, int);
+
+#define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
+
+
+#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
+
+#define hotplug_memory_notifier(fn, pri) {			\
+	static struct notifier_block fn##_mem_nb =		\
+		{ .notifier_call = fn, .priority = pri };	\
+	register_memory_notifier(&fn##_mem_nb);			\
+}
+
+#endif /* _LINUX_MEMORY_H_ */
diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/netlink.h b/kernel_addons/backport/2.6.9_U5/include/linux/netlink.h
new file mode 100644
index 0000000..6d69105
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/linux/netlink.h
@@ -0,0 +1,14 @@
+#ifndef BACKPORT_LINUX_NETLINK_H
+#define BACKPORT_LINUX_NETLINK_H
+
+#include_next <linux/netlink.h>
+
+#define __nlmsg_put(skb, daemon_pid, seq, type, len, flags) \
+       __nlmsg_put(skb, daemon_pid, 0, 0, len)
+
+#define netlink_kernel_create(uint, groups, input, mod) \
+       netlink_kernel_create(uint, input)
+
+#define NETLINK_ISCSI           8
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/transport_class.h b/kernel_addons/backport/2.6.9_U5/include/linux/transport_class.h
new file mode 100644
index 0000000..1d6cc22
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/linux/transport_class.h
@@ -0,0 +1,100 @@
+/*
+ * transport_class.h - a generic container for all transport classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ */
+
+#ifndef _TRANSPORT_CLASS_H_
+#define _TRANSPORT_CLASS_H_
+
+#include <linux/device.h>
+#include <linux/attribute_container.h>
+
+struct transport_container;
+
+struct transport_class {
+	struct class class;
+	int (*setup)(struct transport_container *, struct device *,
+		     struct class_device *);
+	int (*configure)(struct transport_container *, struct device *,
+			 struct class_device *);
+	int (*remove)(struct transport_container *, struct device *,
+		      struct class_device *);
+};
+
+#define DECLARE_TRANSPORT_CLASS(cls, nm, su, rm, cfg)			\
+struct transport_class cls = {						\
+	.class = {							\
+		.name = nm,						\
+	},								\
+	.setup = su,							\
+	.remove = rm,							\
+	.configure = cfg,						\
+}
+
+
+struct anon_transport_class {
+	struct transport_class tclass;
+	struct attribute_container container;
+};
+
+#define DECLARE_ANON_TRANSPORT_CLASS(cls, mtch, cfg)		\
+struct anon_transport_class cls = {				\
+	.tclass = {						\
+		.configure = cfg,				\
+	},							\
+	. container = {						\
+		.match = mtch,					\
+	},							\
+}
+
+#define class_to_transport_class(x) \
+	container_of(x, struct transport_class, class)
+
+struct transport_container {
+	struct attribute_container ac;
+	struct attribute_group *statistics;
+};
+
+#define attribute_container_to_transport_container(x) \
+	container_of(x, struct transport_container, ac)
+
+void transport_remove_device(struct device *);
+void transport_add_device(struct device *);
+void transport_setup_device(struct device *);
+void transport_configure_device(struct device *);
+void transport_destroy_device(struct device *);
+
+static inline void
+transport_register_device(struct device *dev)
+{
+	transport_setup_device(dev);
+	transport_add_device(dev);
+}
+
+static inline void
+transport_unregister_device(struct device *dev)
+{
+	transport_remove_device(dev);
+	transport_destroy_device(dev);
+}
+
+static inline int transport_container_register(struct transport_container *tc)
+{
+	return attribute_container_register(&tc->ac);
+}
+
+static inline int transport_container_unregister(struct transport_container *tc)
+{
+	return attribute_container_unregister(&tc->ac);
+}
+
+int transport_class_register(struct transport_class *);
+int anon_transport_class_register(struct anon_transport_class *);
+void transport_class_unregister(struct transport_class *);
+void anon_transport_class_unregister(struct anon_transport_class *);
+
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U5/include/scsi/iscsi_proto.h b/kernel_addons/backport/2.6.9_U5/include/scsi/iscsi_proto.h
new file mode 100644
index 0000000..02f6e4b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/scsi/iscsi_proto.h
@@ -0,0 +1,587 @@
+/*
+ * RFC 3720 (iSCSI) protocol data types
+ *
+ * Copyright (C) 2005 Dmitry Yusupov
+ * Copyright (C) 2005 Alex Aizman
+ * maintained by open-iscsi at googlegroups.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published
+ * by the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * See the file COPYING included with this distribution for more details.
+ */
+
+#ifndef ISCSI_PROTO_H
+#define ISCSI_PROTO_H
+
+#define ISCSI_DRAFT20_VERSION	0x00
+
+/* default iSCSI listen port for incoming connections */
+#define ISCSI_LISTEN_PORT	3260
+
+/* Padding word length */
+#define PAD_WORD_LEN		4
+
+/*
+ * useful common(control and data pathes) macro
+ */
+#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2]))
+#define hton24(p, v) { \
+        p[0] = (((v) >> 16) & 0xFF); \
+        p[1] = (((v) >> 8) & 0xFF); \
+        p[2] = ((v) & 0xFF); \
+}
+#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;}
+
+/*
+ * iSCSI Template Message Header
+ */
+struct iscsi_hdr {
+	uint8_t		opcode;
+	uint8_t		flags;		/* Final bit */
+	uint8_t		rsvd2[2];
+	uint8_t		hlength;	/* AHSs total length */
+	uint8_t		dlength[3];	/* Data length */
+	uint8_t		lun[8];
+	__be32		itt;		/* Initiator Task Tag */
+	__be32		ttt;		/* Target Task Tag */
+	__be32		statsn;
+	__be32		exp_statsn;
+	__be32		max_statsn;
+	uint8_t		other[12];
+};
+
+/************************* RFC 3720 Begin *****************************/
+
+#define ISCSI_RESERVED_TAG		0xffffffff
+
+/* Opcode encoding bits */
+#define ISCSI_OP_RETRY			0x80
+#define ISCSI_OP_IMMEDIATE		0x40
+#define ISCSI_OPCODE_MASK		0x3F
+
+/* Initiator Opcode values */
+#define ISCSI_OP_NOOP_OUT		0x00
+#define ISCSI_OP_SCSI_CMD		0x01
+#define ISCSI_OP_SCSI_TMFUNC		0x02
+#define ISCSI_OP_LOGIN			0x03
+#define ISCSI_OP_TEXT			0x04
+#define ISCSI_OP_SCSI_DATA_OUT		0x05
+#define ISCSI_OP_LOGOUT			0x06
+#define ISCSI_OP_SNACK			0x10
+
+#define ISCSI_OP_VENDOR1_CMD		0x1c
+#define ISCSI_OP_VENDOR2_CMD		0x1d
+#define ISCSI_OP_VENDOR3_CMD		0x1e
+#define ISCSI_OP_VENDOR4_CMD		0x1f
+
+/* Target Opcode values */
+#define ISCSI_OP_NOOP_IN		0x20
+#define ISCSI_OP_SCSI_CMD_RSP		0x21
+#define ISCSI_OP_SCSI_TMFUNC_RSP	0x22
+#define ISCSI_OP_LOGIN_RSP		0x23
+#define ISCSI_OP_TEXT_RSP		0x24
+#define ISCSI_OP_SCSI_DATA_IN		0x25
+#define ISCSI_OP_LOGOUT_RSP		0x26
+#define ISCSI_OP_R2T			0x31
+#define ISCSI_OP_ASYNC_EVENT		0x32
+#define ISCSI_OP_REJECT			0x3f
+
+struct iscsi_ahs_hdr {
+	__be16 ahslength;
+	uint8_t ahstype;
+	uint8_t ahspec[5];
+};
+
+#define ISCSI_AHSTYPE_CDB		1
+#define ISCSI_AHSTYPE_RLENGTH		2
+
+/* iSCSI PDU Header */
+struct iscsi_cmd {
+	uint8_t opcode;
+	uint8_t flags;
+	__be16 rsvd2;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32 itt;	/* Initiator Task Tag */
+	__be32 data_length;
+	__be32 cmdsn;
+	__be32 exp_statsn;
+	uint8_t cdb[16];	/* SCSI Command Block */
+	/* Additional Data (Command Dependent) */
+};
+
+/* Command PDU flags */
+#define ISCSI_FLAG_CMD_FINAL		0x80
+#define ISCSI_FLAG_CMD_READ		0x40
+#define ISCSI_FLAG_CMD_WRITE		0x20
+#define ISCSI_FLAG_CMD_ATTR_MASK	0x07	/* 3 bits */
+
+/* SCSI Command Attribute values */
+#define ISCSI_ATTR_UNTAGGED		0
+#define ISCSI_ATTR_SIMPLE		1
+#define ISCSI_ATTR_ORDERED		2
+#define ISCSI_ATTR_HEAD_OF_QUEUE	3
+#define ISCSI_ATTR_ACA			4
+
+struct iscsi_rlength_ahdr {
+	__be16 ahslength;
+	uint8_t ahstype;
+	uint8_t reserved;
+	__be32 read_length;
+};
+
+/* SCSI Response Header */
+struct iscsi_cmd_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t response;
+	uint8_t cmd_status;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rsvd1;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	exp_datasn;
+	__be32	bi_residual_count;
+	__be32	residual_count;
+	/* Response or Sense Data (optional) */
+};
+
+/* Command Response PDU flags */
+#define ISCSI_FLAG_CMD_BIDI_OVERFLOW	0x10
+#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW	0x08
+#define ISCSI_FLAG_CMD_OVERFLOW		0x04
+#define ISCSI_FLAG_CMD_UNDERFLOW	0x02
+
+/* iSCSI Status values. Valid if Rsp Selector bit is not set */
+#define ISCSI_STATUS_CMD_COMPLETED	0
+#define ISCSI_STATUS_TARGET_FAILURE	1
+#define ISCSI_STATUS_SUBSYS_FAILURE	2
+
+/* Asynchronous Event Header */
+struct iscsi_async {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	uint8_t rsvd4[8];
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t async_event;
+	uint8_t async_vcode;
+	__be16	param1;
+	__be16	param2;
+	__be16	param3;
+	uint8_t rsvd5[4];
+};
+
+/* iSCSI Event Codes */
+#define ISCSI_ASYNC_MSG_SCSI_EVENT			0
+#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT			1
+#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION		2
+#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS	3
+#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION		4
+#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC			255
+
+/* NOP-Out Message */
+struct iscsi_nopout {
+	uint8_t opcode;
+	uint8_t flags;
+	__be16	rsvd2;
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	ttt;	/* Target Transfer Tag */
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd4[16];
+};
+
+/* NOP-In Message */
+struct iscsi_nopin {
+	uint8_t opcode;
+	uint8_t flags;
+	__be16	rsvd2;
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	ttt;	/* Target Transfer Tag */
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t rsvd4[12];
+};
+
+/* SCSI Task Management Message Header */
+struct iscsi_tm {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd1[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rtt;	/* Reference Task Tag */
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	__be32	refcmdsn;
+	__be32	exp_datasn;
+	uint8_t rsvd2[8];
+};
+
+#define ISCSI_FLAG_TM_FUNC_MASK			0x7F
+
+/* Function values */
+#define ISCSI_TM_FUNC_ABORT_TASK		1
+#define ISCSI_TM_FUNC_ABORT_TASK_SET		2
+#define ISCSI_TM_FUNC_CLEAR_ACA			3
+#define ISCSI_TM_FUNC_CLEAR_TASK_SET		4
+#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET	5
+#define ISCSI_TM_FUNC_TARGET_WARM_RESET		6
+#define ISCSI_TM_FUNC_TARGET_COLD_RESET		7
+#define ISCSI_TM_FUNC_TASK_REASSIGN		8
+
+/* SCSI Task Management Response Header */
+struct iscsi_tm_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t response;	/* see Response values below */
+	uint8_t qualifier;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd2[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rtt;	/* Reference Task Tag */
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t rsvd3[12];
+};
+
+/* Response values */
+#define ISCSI_TMF_RSP_COMPLETE		0x00
+#define ISCSI_TMF_RSP_NO_TASK		0x01
+#define ISCSI_TMF_RSP_NO_LUN		0x02
+#define ISCSI_TMF_RSP_TASK_ALLEGIANT	0x03
+#define ISCSI_TMF_RSP_NO_FAILOVER	0x04
+#define ISCSI_TMF_RSP_NOT_SUPPORTED	0x05
+#define ISCSI_TMF_RSP_AUTH_FAILED	0x06
+#define ISCSI_TMF_RSP_REJECTED		0xff
+
+/* Ready To Transfer Header */
+struct iscsi_r2t_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t	hlength;
+	uint8_t	dlength[3];
+	uint8_t lun[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	ttt;	/* Target Transfer Tag */
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	r2tsn;
+	__be32	data_offset;
+	__be32	data_length;
+};
+
+/* SCSI Data Hdr */
+struct iscsi_data {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t rsvd3;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	rsvd4;
+	__be32	exp_statsn;
+	__be32	rsvd5;
+	__be32	datasn;
+	__be32	offset;
+	__be32	rsvd6;
+	/* Payload */
+};
+
+/* SCSI Data Response Hdr */
+struct iscsi_data_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2;
+	uint8_t cmd_status;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t lun[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	datasn;
+	__be32	offset;
+	__be32	residual_count;
+};
+
+/* Data Response PDU flags */
+#define ISCSI_FLAG_DATA_ACK		0x40
+#define ISCSI_FLAG_DATA_OVERFLOW	0x04
+#define ISCSI_FLAG_DATA_UNDERFLOW	0x02
+#define ISCSI_FLAG_DATA_STATUS		0x01
+
+/* Text Header */
+struct iscsi_text {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd4[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd5[16];
+	/* Text - key=value pairs */
+};
+
+#define ISCSI_FLAG_TEXT_CONTINUE	0x40
+
+/* Text Response Header */
+struct iscsi_text_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd4[8];
+	__be32	itt;
+	__be32	ttt;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t rsvd5[12];
+	/* Text Response - key:value pairs */
+};
+
+/* Login Header */
+struct iscsi_login {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t max_version;	/* Max. version supported */
+	uint8_t min_version;	/* Min. version supported */
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t isid[6];	/* Initiator Session ID */
+	__be16	tsih;	/* Target Session Handle */
+	__be32	itt;	/* Initiator Task Tag */
+	__be16	cid;
+	__be16	rsvd3;
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd5[16];
+};
+
+/* Login PDU flags */
+#define ISCSI_FLAG_LOGIN_TRANSIT		0x80
+#define ISCSI_FLAG_LOGIN_CONTINUE		0x40
+#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK	0x0C	/* 2 bits */
+#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK	0x03	/* 2 bits */
+
+#define ISCSI_LOGIN_CURRENT_STAGE(flags) \
+	((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2)
+#define ISCSI_LOGIN_NEXT_STAGE(flags) \
+	(flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK)
+
+/* Login Response Header */
+struct iscsi_login_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t max_version;	/* Max. version supported */
+	uint8_t active_version;	/* Active version */
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t isid[6];	/* Initiator Session ID */
+	__be16	tsih;	/* Target Session Handle */
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rsvd3;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	uint8_t status_class;	/* see Login RSP ststus classes below */
+	uint8_t status_detail;	/* see Login RSP Status details below */
+	uint8_t rsvd4[10];
+};
+
+/* Login stage (phase) codes for CSG, NSG */
+#define ISCSI_INITIAL_LOGIN_STAGE		-1
+#define ISCSI_SECURITY_NEGOTIATION_STAGE	0
+#define ISCSI_OP_PARMS_NEGOTIATION_STAGE	1
+#define ISCSI_FULL_FEATURE_PHASE		3
+
+/* Login Status response classes */
+#define ISCSI_STATUS_CLS_SUCCESS		0x00
+#define ISCSI_STATUS_CLS_REDIRECT		0x01
+#define ISCSI_STATUS_CLS_INITIATOR_ERR		0x02
+#define ISCSI_STATUS_CLS_TARGET_ERR		0x03
+
+/* Login Status response detail codes */
+/* Class-0 (Success) */
+#define ISCSI_LOGIN_STATUS_ACCEPT		0x00
+
+/* Class-1 (Redirection) */
+#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP	0x01
+#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM	0x02
+
+/* Class-2 (Initiator Error) */
+#define ISCSI_LOGIN_STATUS_INIT_ERR		0x00
+#define ISCSI_LOGIN_STATUS_AUTH_FAILED		0x01
+#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN	0x02
+#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND	0x03
+#define ISCSI_LOGIN_STATUS_TGT_REMOVED		0x04
+#define ISCSI_LOGIN_STATUS_NO_VERSION		0x05
+#define ISCSI_LOGIN_STATUS_ISID_ERROR		0x06
+#define ISCSI_LOGIN_STATUS_MISSING_FIELDS	0x07
+#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED	0x08
+#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE	0x09
+#define ISCSI_LOGIN_STATUS_NO_SESSION		0x0a
+#define ISCSI_LOGIN_STATUS_INVALID_REQUEST	0x0b
+
+/* Class-3 (Target Error) */
+#define ISCSI_LOGIN_STATUS_TARGET_ERROR		0x00
+#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE	0x01
+#define ISCSI_LOGIN_STATUS_NO_RESOURCES		0x02
+
+/* Logout Header */
+struct iscsi_logout {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd1[2];
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd2[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be16	cid;
+	uint8_t rsvd3[2];
+	__be32	cmdsn;
+	__be32	exp_statsn;
+	uint8_t rsvd4[16];
+};
+
+/* Logout PDU flags */
+#define ISCSI_FLAG_LOGOUT_REASON_MASK	0x7F
+
+/* logout reason_code values */
+
+#define ISCSI_LOGOUT_REASON_CLOSE_SESSION	0
+#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION	1
+#define ISCSI_LOGOUT_REASON_RECOVERY		2
+#define ISCSI_LOGOUT_REASON_AEN_REQUEST		3
+
+/* Logout Response Header */
+struct iscsi_logout_rsp {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t response;	/* see Logout response values below */
+	uint8_t rsvd2;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd3[8];
+	__be32	itt;	/* Initiator Task Tag */
+	__be32	rsvd4;
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	rsvd5;
+	__be16	t2wait;
+	__be16	t2retain;
+	__be32	rsvd6;
+};
+
+/* logout response status values */
+
+#define ISCSI_LOGOUT_SUCCESS			0
+#define ISCSI_LOGOUT_CID_NOT_FOUND		1
+#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED	2
+#define ISCSI_LOGOUT_CLEANUP_FAILED		3
+
+/* SNACK Header */
+struct iscsi_snack {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t rsvd2[14];
+	__be32	itt;
+	__be32	begrun;
+	__be32	runlength;
+	__be32	exp_statsn;
+	__be32	rsvd3;
+	__be32	exp_datasn;
+	uint8_t rsvd6[8];
+};
+
+/* SNACK PDU flags */
+#define ISCSI_FLAG_SNACK_TYPE_MASK	0x0F	/* 4 bits */
+
+/* Reject Message Header */
+struct iscsi_reject {
+	uint8_t opcode;
+	uint8_t flags;
+	uint8_t reason;
+	uint8_t rsvd2;
+	uint8_t hlength;
+	uint8_t dlength[3];
+	uint8_t rsvd3[8];
+	__be32  ffffffff;
+	uint8_t rsvd4[4];
+	__be32	statsn;
+	__be32	exp_cmdsn;
+	__be32	max_cmdsn;
+	__be32	datasn;
+	uint8_t rsvd5[8];
+	/* Text - Rejected hdr */
+};
+
+/* Reason for Reject */
+#define ISCSI_REASON_CMD_BEFORE_LOGIN	1
+#define ISCSI_REASON_DATA_DIGEST_ERROR	2
+#define ISCSI_REASON_DATA_SNACK_REJECT	3
+#define ISCSI_REASON_PROTOCOL_ERROR	4
+#define ISCSI_REASON_CMD_NOT_SUPPORTED	5
+#define ISCSI_REASON_IMM_CMD_REJECT		6
+#define ISCSI_REASON_TASK_IN_PROGRESS	7
+#define ISCSI_REASON_INVALID_SNACK		8
+#define ISCSI_REASON_BOOKMARK_INVALID	9
+#define ISCSI_REASON_BOOKMARK_NO_RESOURCES	10
+#define ISCSI_REASON_NEGOTIATION_RESET	11
+
+/* Max. number of Key=Value pairs in a text message */
+#define MAX_KEY_VALUE_PAIRS	8192
+
+/* maximum length for text keys/values */
+#define KEY_MAXLEN		64
+#define VALUE_MAXLEN		255
+#define TARGET_NAME_MAXLEN	VALUE_MAXLEN
+
+#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH	8192
+
+/************************* RFC 3720 End *****************************/
+
+#endif /* ISCSI_PROTO_H */
diff --git a/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_device.h b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_device.h
new file mode 100644
index 0000000..f353e0b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_device.h
@@ -0,0 +1,19 @@
+#ifndef _SCSI_SCSI_DEVICE_H_BACKPORT
+#define _SCSI_SCSI_DEVICE_H_BACKPORT
+
+#include_next <scsi/scsi_device.h>
+
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <asm/atomic.h>
+
+struct scsi_lun;
+
+extern void int_to_scsilun(unsigned int, struct scsi_lun *);
+extern void scsi_target_block(struct device *);
+extern void scsi_target_unblock(struct device *);
+extern void starget_for_each_device(struct scsi_target *, void *,
+		     void (*fn)(struct scsi_device *, void *));
+#endif /* _SCSI_SCSI_DEVICE_H_BACKPORT */
diff --git a/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_host.h b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_host.h
new file mode 100644
index 0000000..b7e019b
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_host.h
@@ -0,0 +1,8 @@
+#ifndef _SCSI_SCSI_HOST_H_BACKPORT
+#define _SCSI_SCSI_HOST_H_BACKPORT
+
+#include_next <scsi/scsi_host.h>
+
+#define scsi_queue_work(shost, work) schedule_work(work)
+
+#endif
diff --git a/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_transport.h b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_transport.h
new file mode 100644
index 0000000..99c2b12
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_transport.h
@@ -0,0 +1,8 @@
+#ifndef _SCSI_SCSI_TRANSPORT_H_BACKPORT
+#define _SCSI_SCSI_TRANSPORT_H_BACKPORT
+
+#include_next <scsi/scsi_transport.h>
+
+#include <linux/transport_class.h>
+
+#endif /* _SCSI_SCSI_TRANSPORT_H_BACKPORT */
diff --git a/kernel_addons/backport/2.6.9_U5/include/src/attribute_container.c b/kernel_addons/backport/2.6.9_U5/include/src/attribute_container.c
new file mode 100644
index 0000000..44948d1
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/src/attribute_container.c
@@ -0,0 +1,438 @@
+/*
+ * attribute_container.c - implementation of a simple container for classes
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ *
+ * The basic idea here is to enable a device to be attached to an
+ * aritrary numer of classes without having to allocate storage for them.
+ * Instead, the contained classes select the devices they need to attach
+ * to via a matching function.
+ */
+
+#include <linux/attribute_container.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/module.h>
+
+#include "base.h"
+
+/* This is a private structure used to tie the classdev and the
+ * container .. it should never be visible outside this file */
+struct internal_container {
+	struct klist_node node;
+	struct attribute_container *cont;
+	struct class_device classdev;
+};
+
+static void internal_container_klist_get(struct klist_node *n)
+{
+	struct internal_container *ic =
+		container_of(n, struct internal_container, node);
+	class_device_get(&ic->classdev);
+}
+
+static void internal_container_klist_put(struct klist_node *n)
+{
+	struct internal_container *ic =
+		container_of(n, struct internal_container, node);
+	class_device_put(&ic->classdev);
+}
+
+
+/**
+ * attribute_container_classdev_to_container - given a classdev, return the container
+ *
+ * @classdev: the class device created by attribute_container_add_device.
+ *
+ * Returns the container associated with this classdev.
+ */
+struct attribute_container *
+attribute_container_classdev_to_container(struct class_device *classdev)
+{
+	struct internal_container *ic =
+		container_of(classdev, struct internal_container, classdev);
+	return ic->cont;
+}
+EXPORT_SYMBOL_GPL(attribute_container_classdev_to_container);
+
+static struct list_head attribute_container_list;
+
+static DECLARE_MUTEX(attribute_container_mutex);
+
+/**
+ * attribute_container_register - register an attribute container
+ *
+ * @cont: The container to register.  This must be allocated by the
+ *        callee and should also be zeroed by it.
+ */
+int
+attribute_container_register(struct attribute_container *cont)
+{
+	INIT_LIST_HEAD(&cont->node);
+	klist_init(&cont->containers,internal_container_klist_get,
+		   internal_container_klist_put);
+		
+	down(&attribute_container_mutex);
+	list_add_tail(&cont->node, &attribute_container_list);
+	up(&attribute_container_mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(attribute_container_register);
+
+/**
+ * attribute_container_unregister - remove a container registration
+ *
+ * @cont: previously registered container to remove
+ */
+int
+attribute_container_unregister(struct attribute_container *cont)
+{
+	int retval = -EBUSY;
+	down(&attribute_container_mutex);
+	spin_lock(&cont->containers.k_lock);
+	if (!list_empty(&cont->containers.k_list))
+		goto out;
+	retval = 0;
+	list_del(&cont->node);
+ out:
+	spin_unlock(&cont->containers.k_lock);
+	up(&attribute_container_mutex);
+	return retval;
+		
+}
+EXPORT_SYMBOL_GPL(attribute_container_unregister);
+
+/* private function used as class release */
+static void attribute_container_release(struct class_device *classdev)
+{
+	struct internal_container *ic 
+		= container_of(classdev, struct internal_container, classdev);
+	struct device *dev = classdev->dev;
+
+	kfree(ic);
+	put_device(dev);
+}
+
+/**
+ * attribute_container_add_device - see if any container is interested in dev
+ *
+ * @dev: device to add attributes to
+ * @fn:	 function to trigger addition of class device.
+ *
+ * This function allocates storage for the class device(s) to be
+ * attached to dev (one for each matching attribute_container).  If no
+ * fn is provided, the code will simply register the class device via
+ * class_device_add.  If a function is provided, it is expected to add
+ * the class device at the appropriate time.  One of the things that
+ * might be necessary is to allocate and initialise the classdev and
+ * then add it a later time.  To do this, call this routine for
+ * allocation and initialisation and then use
+ * attribute_container_device_trigger() to call class_device_add() on
+ * it.  Note: after this, the class device contains a reference to dev
+ * which is not relinquished until the release of the classdev.
+ */
+void
+attribute_container_add_device(struct device *dev,
+			       int (*fn)(struct attribute_container *,
+					 struct device *,
+					 struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+
+		if (attribute_container_no_classdevs(cont))
+			continue;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		ic = kzalloc(sizeof(*ic), GFP_KERNEL);
+		if (!ic) {
+			dev_printk(KERN_ERR, dev, "failed to allocate class container\n");
+			continue;
+		}
+
+		ic->cont = cont;
+		class_device_initialize(&ic->classdev);
+		ic->classdev.dev = get_device(dev);
+		ic->classdev.class = cont->class;
+		cont->class->release = attribute_container_release;
+		strcpy(ic->classdev.class_id, dev->bus_id);
+		if (fn)
+			fn(cont, dev, &ic->classdev);
+		else
+			attribute_container_add_class_device(&ic->classdev);
+		klist_add_tail(&ic->node, &cont->containers);
+	}
+	up(&attribute_container_mutex);
+}
+
+/* FIXME: can't break out of this unless klist_iter_exit is also
+ * called before doing the break
+ */
+#define klist_for_each_entry(pos, head, member, iter) \
+	for (klist_iter_init(head, iter); (pos = ({ \
+		struct klist_node *n = klist_next(iter); \
+		n ? container_of(n, typeof(*pos), member) : \
+			({ klist_iter_exit(iter) ; NULL; }); \
+	}) ) != NULL; )
+			
+
+/**
+ * attribute_container_remove_device - make device eligible for removal.
+ *
+ * @dev:  The generic device
+ * @fn:	  A function to call to remove the device
+ *
+ * This routine triggers device removal.  If fn is NULL, then it is
+ * simply done via class_device_unregister (note that if something
+ * still has a reference to the classdev, then the memory occupied
+ * will not be freed until the classdev is released).  If you want a
+ * two phase release: remove from visibility and then delete the
+ * device, then you should use this routine with a fn that calls
+ * class_device_del() and then use
+ * attribute_container_device_trigger() to do the final put on the
+ * classdev.
+ */
+void
+attribute_container_remove_device(struct device *dev,
+				  void (*fn)(struct attribute_container *,
+					     struct device *,
+					     struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+		struct klist_iter iter;
+
+		if (attribute_container_no_classdevs(cont))
+			continue;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		klist_for_each_entry(ic, &cont->containers, node, &iter) {
+			if (dev != ic->classdev.dev)
+				continue;
+			klist_del(&ic->node);
+			if (fn)
+				fn(cont, dev, &ic->classdev);
+			else {
+				attribute_container_remove_attrs(&ic->classdev);
+				class_device_unregister(&ic->classdev);
+			}
+		}
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_device_trigger - execute a trigger for each matching classdev
+ *
+ * @dev:  The generic device to run the trigger for
+ * @fn	  the function to execute for each classdev.
+ *
+ * This funcion is for executing a trigger when you need to know both
+ * the container and the classdev.  If you only care about the
+ * container, then use attribute_container_trigger() instead.
+ */
+void
+attribute_container_device_trigger(struct device *dev, 
+				   int (*fn)(struct attribute_container *,
+					     struct device *,
+					     struct class_device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		struct internal_container *ic;
+		struct klist_iter iter;
+
+		if (!cont->match(cont, dev))
+			continue;
+
+		if (attribute_container_no_classdevs(cont)) {
+			fn(cont, dev, NULL);
+			continue;
+		}
+
+		klist_for_each_entry(ic, &cont->containers, node, &iter) {
+			if (dev == ic->classdev.dev)
+				fn(cont, dev, &ic->classdev);
+		}
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_trigger - trigger a function for each matching container
+ *
+ * @dev:  The generic device to activate the trigger for
+ * @fn:	  the function to trigger
+ *
+ * This routine triggers a function that only needs to know the
+ * matching containers (not the classdev) associated with a device.
+ * It is more lightweight than attribute_container_device_trigger, so
+ * should be used in preference unless the triggering function
+ * actually needs to know the classdev.
+ */
+void
+attribute_container_trigger(struct device *dev,
+			    int (*fn)(struct attribute_container *,
+				      struct device *))
+{
+	struct attribute_container *cont;
+
+	down(&attribute_container_mutex);
+	list_for_each_entry(cont, &attribute_container_list, node) {
+		if (cont->match(cont, dev))
+			fn(cont, dev);
+	}
+	up(&attribute_container_mutex);
+}
+
+/**
+ * attribute_container_add_attrs - add attributes
+ *
+ * @classdev: The class device
+ *
+ * This simply creates all the class device sysfs files from the
+ * attributes listed in the container
+ */
+int
+attribute_container_add_attrs(struct class_device *classdev)
+{
+	struct attribute_container *cont =
+		attribute_container_classdev_to_container(classdev);
+	struct class_device_attribute **attrs =	cont->attrs;
+	int i, error;
+
+	if (!attrs)
+		return 0;
+
+	for (i = 0; attrs[i]; i++) {
+		error = class_device_create_file(classdev, attrs[i]);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/**
+ * attribute_container_add_class_device - same function as class_device_add
+ *
+ * @classdev:	the class device to add
+ *
+ * This performs essentially the same function as class_device_add except for
+ * attribute containers, namely add the classdev to the system and then
+ * create the attribute files
+ */
+int
+attribute_container_add_class_device(struct class_device *classdev)
+{
+	int error = class_device_add(classdev);
+	if (error)
+		return error;
+	return attribute_container_add_attrs(classdev);
+}
+
+/**
+ * attribute_container_add_class_device_adapter - simple adapter for triggers
+ *
+ * This function is identical to attribute_container_add_class_device except
+ * that it is designed to be called from the triggers
+ */
+int
+attribute_container_add_class_device_adapter(struct attribute_container *cont,
+					     struct device *dev,
+					     struct class_device *classdev)
+{
+	return attribute_container_add_class_device(classdev);
+}
+
+/**
+ * attribute_container_remove_attrs - remove any attribute files
+ *
+ * @classdev: The class device to remove the files from
+ *
+ */
+void
+attribute_container_remove_attrs(struct class_device *classdev)
+{
+	struct attribute_container *cont =
+		attribute_container_classdev_to_container(classdev);
+	struct class_device_attribute **attrs =	cont->attrs;
+	int i;
+
+	if (!attrs)
+		return;
+
+	for (i = 0; attrs[i]; i++)
+		class_device_remove_file(classdev, attrs[i]);
+}
+
+/**
+ * attribute_container_class_device_del - equivalent of class_device_del
+ *
+ * @classdev: the class device
+ *
+ * This function simply removes all the attribute files and then calls
+ * class_device_del.
+ */
+void
+attribute_container_class_device_del(struct class_device *classdev)
+{
+	attribute_container_remove_attrs(classdev);
+	class_device_del(classdev);
+}
+
+/**
+ * attribute_container_find_class_device - find the corresponding class_device
+ *
+ * @cont:	the container
+ * @dev:	the generic device
+ *
+ * Looks up the device in the container's list of class devices and returns
+ * the corresponding class_device.
+ */
+struct class_device *
+attribute_container_find_class_device(struct attribute_container *cont,
+				      struct device *dev)
+{
+	struct class_device *cdev = NULL;
+	struct internal_container *ic;
+	struct klist_iter iter;
+
+	klist_for_each_entry(ic, &cont->containers, node, &iter) {
+		if (ic->classdev.dev == dev) {
+			cdev = &ic->classdev;
+			/* FIXME: must exit iterator then break */
+			klist_iter_exit(&iter);
+			break;
+		}
+	}
+
+	return cdev;
+}
+EXPORT_SYMBOL_GPL(attribute_container_find_class_device);
+
+int
+attribute_container_init(void)
+{
+	INIT_LIST_HEAD(&attribute_container_list);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(attribute_container_init);
diff --git a/kernel_addons/backport/2.6.9_U5/include/src/base.h b/kernel_addons/backport/2.6.9_U5/include/src/base.h
new file mode 100644
index 0000000..a5f8936
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/src/base.h
@@ -0,0 +1 @@
+extern int attribute_container_init(void);
diff --git a/kernel_addons/backport/2.6.9_U5/include/src/init.c b/kernel_addons/backport/2.6.9_U5/include/src/init.c
new file mode 100644
index 0000000..15f0bc6
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/src/init.c
@@ -0,0 +1,26 @@
+/*
+ *
+ * Copyright (c) 2002-3 Patrick Mochel
+ * Copyright (c) 2002-3 Open Source Development Labs
+ *
+ * This file is released under the GPLv2
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/memory.h>
+
+#include "base.h"
+
+/**
+ *	driver_init - initialize driver model.
+ *
+ *	Call the driver model init functions to initialize their
+ *	subsystems. Called early from init/main.c.
+ */
+
+void __init driver_init(void)
+{
+	attribute_container_init();
+}
diff --git a/kernel_addons/backport/2.6.9_U5/include/src/klist.c b/kernel_addons/backport/2.6.9_U5/include/src/klist.c
new file mode 100644
index 0000000..3b29ebc
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/src/klist.c
@@ -0,0 +1,287 @@
+/*
+ *	klist.c - Routines for manipulating klists.
+ *
+ *
+ *	This klist interface provides a couple of structures that wrap around 
+ *	struct list_head to provide explicit list "head" (struct klist) and 
+ *	list "node" (struct klist_node) objects. For struct klist, a spinlock
+ *	is included that protects access to the actual list itself. struct 
+ *	klist_node provides a pointer to the klist that owns it and a kref
+ *	reference count that indicates the number of current users of that node
+ *	in the list.
+ *
+ *	The entire point is to provide an interface for iterating over a list
+ *	that is safe and allows for modification of the list during the
+ *	iteration (e.g. insertion and removal), including modification of the
+ *	current node on the list.
+ *
+ *	It works using a 3rd object type - struct klist_iter - that is declared
+ *	and initialized before an iteration. klist_next() is used to acquire the
+ *	next element in the list. It returns NULL if there are no more items.
+ *	Internally, that routine takes the klist's lock, decrements the reference
+ *	count of the previous klist_node and increments the count of the next
+ *	klist_node. It then drops the lock and returns.
+ *
+ *	There are primitives for adding and removing nodes to/from a klist. 
+ *	When deleting, klist_del() will simply decrement the reference count. 
+ *	Only when the count goes to 0 is the node removed from the list. 
+ *	klist_remove() will try to delete the node from the list and block
+ *	until it is actually removed. This is useful for objects (like devices)
+ *	that have been removed from the system and must be freed (but must wait
+ *	until all accessors have finished).
+ *
+ *	Copyright (C) 2005 Patrick Mochel
+ *
+ *	This file is released under the GPL v2.
+ */
+
+#include <linux/klist.h>
+#include <linux/module.h>
+
+
+/**
+ *	klist_init - Initialize a klist structure. 
+ *	@k:	The klist we're initializing.
+ *	@get:	The get function for the embedding object (NULL if none)
+ *	@put:	The put function for the embedding object (NULL if none)
+ *
+ * Initialises the klist structure.  If the klist_node structures are
+ * going to be embedded in refcounted objects (necessary for safe
+ * deletion) then the get/put arguments are used to initialise
+ * functions that take and release references on the embedding
+ * objects.
+ */
+
+void klist_init(struct klist * k, void (*get)(struct klist_node *),
+		void (*put)(struct klist_node *))
+{
+	INIT_LIST_HEAD(&k->k_list);
+	spin_lock_init(&k->k_lock);
+	k->get = get;
+	k->put = put;
+}
+
+EXPORT_SYMBOL_GPL(klist_init);
+
+
+static void add_head(struct klist * k, struct klist_node * n)
+{
+	spin_lock(&k->k_lock);
+	list_add(&n->n_node, &k->k_list);
+	spin_unlock(&k->k_lock);
+}
+
+static void add_tail(struct klist * k, struct klist_node * n)
+{
+	spin_lock(&k->k_lock);
+	list_add_tail(&n->n_node, &k->k_list);
+	spin_unlock(&k->k_lock);
+}
+
+
+static void klist_node_init(struct klist * k, struct klist_node * n)
+{
+	INIT_LIST_HEAD(&n->n_node);
+	init_completion(&n->n_removed);
+	kref_init(&n->n_ref);
+	n->n_klist = k;
+	if (k->get)
+		k->get(n);
+}
+
+
+/**
+ *	klist_add_head - Initialize a klist_node and add it to front.
+ *	@n:	node we're adding.
+ *	@k:	klist it's going on.
+ */
+
+void klist_add_head(struct klist_node * n, struct klist * k)
+{
+	klist_node_init(k, n);
+	add_head(k, n);
+}
+
+EXPORT_SYMBOL_GPL(klist_add_head);
+
+
+/**
+ *	klist_add_tail - Initialize a klist_node and add it to back.
+ *	@n:	node we're adding.
+ *	@k:	klist it's going on.
+ */
+
+void klist_add_tail(struct klist_node * n, struct klist * k)
+{
+	klist_node_init(k, n);
+	add_tail(k, n);
+}
+
+EXPORT_SYMBOL_GPL(klist_add_tail);
+
+
+static void klist_release(struct kref * kref)
+{
+	struct klist_node * n = container_of(kref, struct klist_node, n_ref);
+
+	list_del(&n->n_node);
+	complete(&n->n_removed);
+	n->n_klist = NULL;
+}
+
+static int klist_dec_and_del(struct klist_node * n)
+{
+	return kref_put_new(&n->n_ref, klist_release);
+}
+
+
+/**
+ *	klist_del - Decrement the reference count of node and try to remove.
+ *	@n:	node we're deleting.
+ */
+
+void klist_del(struct klist_node * n)
+{
+	struct klist * k = n->n_klist;
+	void (*put)(struct klist_node *) = k->put;
+
+	spin_lock(&k->k_lock);
+	if (!klist_dec_and_del(n))
+		put = NULL;
+	spin_unlock(&k->k_lock);
+	if (put)
+		put(n);
+}
+
+EXPORT_SYMBOL_GPL(klist_del);
+
+
+/**
+ *	klist_remove - Decrement the refcount of node and wait for it to go away.
+ *	@n:	node we're removing.
+ */
+
+void klist_remove(struct klist_node * n)
+{
+	klist_del(n);
+	wait_for_completion(&n->n_removed);
+}
+
+EXPORT_SYMBOL_GPL(klist_remove);
+
+
+/**
+ *	klist_node_attached - Say whether a node is bound to a list or not.
+ *	@n:	Node that we're testing.
+ */
+
+int klist_node_attached(struct klist_node * n)
+{
+	return (n->n_klist != NULL);
+}
+
+EXPORT_SYMBOL_GPL(klist_node_attached);
+
+
+/**
+ *	klist_iter_init_node - Initialize a klist_iter structure.
+ *	@k:	klist we're iterating.
+ *	@i:	klist_iter we're filling.
+ *	@n:	node to start with.
+ *
+ *	Similar to klist_iter_init(), but starts the action off with @n, 
+ *	instead of with the list head.
+ */
+
+void klist_iter_init_node(struct klist * k, struct klist_iter * i, struct klist_node * n)
+{
+	i->i_klist = k;
+	i->i_head = &k->k_list;
+	i->i_cur = n;
+	if (n)
+		kref_get(&n->n_ref);
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_init_node);
+
+
+/**
+ *	klist_iter_init - Iniitalize a klist_iter structure.
+ *	@k:	klist we're iterating.
+ *	@i:	klist_iter structure we're filling.
+ *
+ *	Similar to klist_iter_init_node(), but start with the list head.
+ */
+
+void klist_iter_init(struct klist * k, struct klist_iter * i)
+{
+	klist_iter_init_node(k, i, NULL);
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_init);
+
+
+/**
+ *	klist_iter_exit - Finish a list iteration.
+ *	@i:	Iterator structure.
+ *
+ *	Must be called when done iterating over list, as it decrements the 
+ *	refcount of the current node. Necessary in case iteration exited before
+ *	the end of the list was reached, and always good form.
+ */
+
+void klist_iter_exit(struct klist_iter * i)
+{
+	if (i->i_cur) {
+		klist_del(i->i_cur);
+		i->i_cur = NULL;
+	}
+}
+
+EXPORT_SYMBOL_GPL(klist_iter_exit);
+
+
+static struct klist_node * to_klist_node(struct list_head * n)
+{
+	return container_of(n, struct klist_node, n_node);
+}
+
+
+/**
+ *	klist_next - Ante up next node in list.
+ *	@i:	Iterator structure.
+ *
+ *	First grab list lock. Decrement the reference count of the previous
+ *	node, if there was one. Grab the next node, increment its reference 
+ *	count, drop the lock, and return that next node.
+ */
+
+struct klist_node * klist_next(struct klist_iter * i)
+{
+	struct list_head * next;
+	struct klist_node * lnode = i->i_cur;
+	struct klist_node * knode = NULL;
+	void (*put)(struct klist_node *) = i->i_klist->put;
+
+	spin_lock(&i->i_klist->k_lock);
+	if (lnode) {
+		next = lnode->n_node.next;
+		if (!klist_dec_and_del(lnode))
+			put = NULL;
+	} else
+		next = i->i_head->next;
+
+	if (next != i->i_head) {
+		knode = to_klist_node(next);
+		kref_get(&knode->n_ref);
+	}
+	i->i_cur = knode;
+	spin_unlock(&i->i_klist->k_lock);
+	if (put && lnode)
+		put(lnode);
+	return knode;
+}
+
+EXPORT_SYMBOL_GPL(klist_next);
+
+
diff --git a/kernel_addons/backport/2.6.9_U5/include/src/kref_new.c b/kernel_addons/backport/2.6.9_U5/include/src/kref_new.c
new file mode 100644
index 0000000..d45bb3f
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/src/kref_new.c
@@ -0,0 +1,29 @@
+#include <linux/kref.h>
+#include <linux/module.h>
+
+/**
+ * kref_put - decrement refcount for object.
+ * @kref: object.
+ * @release: pointer to the function that will clean up the object when the
+ *           last reference to the object is released.
+ *           This pointer is required, and it is not acceptable to pass kfree
+ *           in as this function.
+ *
+ * Decrement the refcount, and if 0, call release().
+ * Return 1 if the object was removed, otherwise return 0.  Beware, if this
+ * function returns 0, you still can not count on the kref from remaining in
+ * memory.  Only use the return value if you want to see if the kref is now
+ * gone, not present.
+ */
+int kref_put_new(struct kref *kref, void (*release)(struct kref *kref))
+{
+        WARN_ON(release == NULL);
+        WARN_ON(release == (void (*)(struct kref *))kfree);
+
+        if (atomic_dec_and_test(&kref->refcount)) {
+                release(kref);
+                return 1;
+        }
+        return 0;
+}
+EXPORT_SYMBOL(kref_put_new);
diff --git a/kernel_addons/backport/2.6.9_U5/include/src/scsi.c b/kernel_addons/backport/2.6.9_U5/include/src/scsi.c
new file mode 100644
index 0000000..8c833c0
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/src/scsi.c
@@ -0,0 +1,50 @@
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/init.h>
+#include <linux/completion.h>
+#include <linux/unistd.h>
+#include <linux/spinlock.h>
+#include <linux/kmod.h>
+#include <linux/interrupt.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
+#include <linux/mutex.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_tcq.h>
+
+/**
+ * starget_for_each_device  -  helper to walk all devices of a target
+ * @starget:	target whose devices we want to iterate over.
+ *
+ * This traverses over each devices of @shost.  The devices have
+ * a reference that must be released by scsi_host_put when breaking
+ * out of the loop.
+ */
+void starget_for_each_device(struct scsi_target *starget, void * data,
+		     void (*fn)(struct scsi_device *, void *))
+{
+	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
+	struct scsi_device *sdev;
+
+	printk("%s: entry\n", __FUNCTION__);
+	shost_for_each_device(sdev, shost) {
+		if ((sdev->channel == starget->channel) &&
+		    (sdev->id == starget->id))
+			fn(sdev, data);
+	}
+	printk("%s: exit\n", __FUNCTION__);
+}
+EXPORT_SYMBOL(starget_for_each_device);
diff --git a/kernel_addons/backport/2.6.9_U5/include/src/scsi_lib.c b/kernel_addons/backport/2.6.9_U5/include/src/scsi_lib.c
new file mode 100644
index 0000000..f53f824
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/src/scsi_lib.c
@@ -0,0 +1,166 @@
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/completion.h>
+#include <linux/kernel.h>
+#include <linux/mempool.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <linux/hardirq.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+
+int scsi_is_target_device(const struct device *dev)
+{
+        char *str = dev->bus_id;
+
+	if (strncmp(str, "target", 6) == 0) {
+		return 1;
+	}
+
+        return 0;
+}
+
+/**
+ * scsi_internal_device_block - internal function to put a device
+ *                              temporarily into the SDEV_BLOCK state
+ * @sdev:       device to block
+ *
+ * Block request made by scsi lld's to temporarily stop all
+ * scsi commands on the specified device.  Called from interrupt
+ * or normal process context.
+ *
+ * Returns zero if successful or error if not
+ *
+ * Notes:
+ *      This routine transitions the device to the SDEV_BLOCK state
+ *      (which must be a legal transition).  When the device is in this
+ *      state, all commands are deferred until the scsi lld reenables
+ *      the device with scsi_device_unblock or device_block_tmo fires.
+ *      This routine assumes the host_lock is held on entry.
+ **/
+int
+scsi_internal_device_block(struct scsi_device *sdev)
+{
+        request_queue_t *q = sdev->request_queue;
+        unsigned long flags;
+        int err = 0;
+
+        err = scsi_device_set_state(sdev, SDEV_BLOCK);
+        if (err)
+		return err;
+                
+        /*
+         * The device has transitioned to SDEV_BLOCK.  Stop the
+         * block layer from calling the midlayer with this device's
+         * request queue.
+         */
+        spin_lock_irqsave(q->queue_lock, flags);
+        blk_stop_queue(q);
+        spin_unlock_irqrestore(q->queue_lock, flags);
+
+        return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_internal_device_block);
+
+/**
+ * scsi_internal_device_unblock - resume a device after a block request
+ * @sdev:       device to resume
+ *
+ * Called by scsi lld's or the midlayer to restart the device queue
+ * for the previously suspended scsi device.  Called from interrupt or
+ * normal process context.
+ *
+ * Returns zero if successful or error if not.
+ *
+ * Notes:
+ *      This routine transitions the device to the SDEV_RUNNING state
+ *      (which must be a legal transition) allowing the midlayer to
+ *      goose the queue for this device.  This routine assumes the
+ *      host_lock is held upon entry.
+ **/
+int
+scsi_internal_device_unblock(struct scsi_device *sdev)
+{
+        request_queue_t *q = sdev->request_queue;
+        int err;
+        unsigned long flags;
+
+
+        /*
+         * Try to transition the scsi device to SDEV_RUNNING
+         * and goose the device queue if successful.
+         */
+        err = scsi_device_set_state(sdev, SDEV_RUNNING);
+        if (err)
+		return err;
+                
+        spin_lock_irqsave(q->queue_lock, flags);
+        blk_start_queue(q);
+        spin_unlock_irqrestore(q->queue_lock, flags);
+
+        return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_internal_device_unblock);
+
+static void
+device_block(struct scsi_device *sdev, void *data)
+{
+        scsi_internal_device_block(sdev);
+}
+
+static int
+target_block(struct device *dev, void *data)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_block);
+
+        return 0;
+}
+
+void
+scsi_target_block(struct device *dev)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_block);
+        else
+                device_for_each_child(dev, NULL, target_block);
+}
+EXPORT_SYMBOL_GPL(scsi_target_block);
+
+static void
+device_unblock(struct scsi_device *sdev, void *data)
+{
+        scsi_internal_device_unblock(sdev);
+}
+
+static int
+target_unblock(struct device *dev, void *data)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_unblock);
+        return 0;
+}
+
+void
+scsi_target_unblock(struct device *dev)
+{
+        if (scsi_is_target_device(dev))
+                starget_for_each_device(to_scsi_target(dev), NULL,
+                                        device_unblock);
+        else
+                device_for_each_child(dev, NULL, target_unblock);
+}
+EXPORT_SYMBOL_GPL(scsi_target_unblock);
+
+MODULE_LICENSE("GPL");
diff --git a/kernel_addons/backport/2.6.9_U5/include/src/scsi_scan.c b/kernel_addons/backport/2.6.9_U5/include/src/scsi_scan.c
new file mode 100644
index 0000000..b7b7674
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/src/scsi_scan.c
@@ -0,0 +1,48 @@
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/spinlock.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_devinfo.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_transport.h>
+#include <scsi/scsi_eh.h>
+
+/**
+ * int_to_scsilun: reverts an int into a scsi_lun
+ * @int:        integer to be reverted
+ * @scsilun:    struct scsi_lun to be set.
+ *
+ * Description:
+ *     Reverts the functionality of the scsilun_to_int, which packed
+ *     an 8-byte lun value into an int. This routine unpacks the int
+ *     back into the lun value.
+ *     Note: the scsilun_to_int() routine does not truly handle all
+ *     8bytes of the lun value. This functions restores only as much
+ *     as was set by the routine.
+ *
+ * Notes:
+ *     Given an integer : 0x0b030a04,  this function returns a
+ *     scsi_lun of : struct scsi_lun of: 0a 04 0b 03 00 00 00 00
+ *
+ **/
+void int_to_scsilun(unsigned int lun, struct scsi_lun *scsilun)
+{
+        int i;
+
+        memset(scsilun->scsi_lun, 0, sizeof(scsilun->scsi_lun));
+
+        for (i = 0; i < sizeof(lun); i += 2) {
+                scsilun->scsi_lun[i] = (lun >> 8) & 0xFF;
+                scsilun->scsi_lun[i+1] = lun & 0xFF;
+                lun = lun >> 16;
+        }
+}
+EXPORT_SYMBOL(int_to_scsilun);
diff --git a/kernel_addons/backport/2.6.9_U5/include/src/transport_class.c b/kernel_addons/backport/2.6.9_U5/include/src/transport_class.c
new file mode 100644
index 0000000..f25e7c6
--- /dev/null
+++ b/kernel_addons/backport/2.6.9_U5/include/src/transport_class.c
@@ -0,0 +1,280 @@
+/*
+ * transport_class.c - implementation of generic transport classes
+ *                     using attribute_containers
+ *
+ * Copyright (c) 2005 - James Bottomley <James.Bottomley at steeleye.com>
+ *
+ * This file is licensed under GPLv2
+ *
+ * The basic idea here is to allow any "device controller" (which
+ * would most often be a Host Bus Adapter to use the services of one
+ * or more tranport classes for performing transport specific
+ * services.  Transport specific services are things that the generic
+ * command layer doesn't want to know about (speed settings, line
+ * condidtioning, etc), but which the user might be interested in.
+ * Thus, the HBA's use the routines exported by the transport classes
+ * to perform these functions.  The transport classes export certain
+ * values to the user via sysfs using attribute containers.
+ *
+ * Note: because not every HBA will care about every transport
+ * attribute, there's a many to one relationship that goes like this:
+ *
+ * transport class<-----attribute container<----class device
+ *
+ * Usually the attribute container is per-HBA, but the design doesn't
+ * mandate that.  Although most of the services will be specific to
+ * the actual external storage connection used by the HBA, the generic
+ * transport class is framed entirely in terms of generic devices to
+ * allow it to be used by any physical HBA in the system.
+ */
+#include <linux/attribute_container.h>
+#include <linux/transport_class.h>
+
+/**
+ * transport_class_register - register an initial transport class
+ *
+ * @tclass:	a pointer to the transport class structure to be initialised
+ *
+ * The transport class contains an embedded class which is used to
+ * identify it.  The caller should initialise this structure with
+ * zeros and then generic class must have been initialised with the
+ * actual transport class unique name.  There's a macro
+ * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must
+ * be registered).
+ *
+ * Returns 0 on success or error on failure.
+ */
+int transport_class_register(struct transport_class *tclass)
+{
+	return class_register(&tclass->class);
+}
+EXPORT_SYMBOL_GPL(transport_class_register);
+
+/**
+ * transport_class_unregister - unregister a previously registered class
+ *
+ * @tclass: The transport class to unregister
+ *
+ * Must be called prior to deallocating the memory for the transport
+ * class.
+ */
+void transport_class_unregister(struct transport_class *tclass)
+{
+	class_unregister(&tclass->class);
+}
+EXPORT_SYMBOL_GPL(transport_class_unregister);
+
+static int anon_transport_dummy_function(struct transport_container *tc,
+					 struct device *dev,
+					 struct class_device *cdev)
+{
+	/* do nothing */
+	return 0;
+}
+
+/**
+ * anon_transport_class_register - register an anonymous class
+ *
+ * @atc: The anon transport class to register
+ *
+ * The anonymous transport class contains both a transport class and a
+ * container.  The idea of an anonymous class is that it never
+ * actually has any device attributes associated with it (and thus
+ * saves on container storage).  So it can only be used for triggering
+ * events.  Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to
+ * initialise the anon transport class storage.
+ */
+int anon_transport_class_register(struct anon_transport_class *atc)
+{
+	int error;
+	atc->container.class = &atc->tclass.class;
+	attribute_container_set_no_classdevs(&atc->container);
+	error = attribute_container_register(&atc->container);
+	if (error)
+		return error;
+	atc->tclass.setup = anon_transport_dummy_function;
+	atc->tclass.remove = anon_transport_dummy_function;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(anon_transport_class_register);
+
+/**
+ * anon_transport_class_unregister - unregister an anon class
+ *
+ * @atc: Pointer to the anon transport class to unregister
+ *
+ * Must be called prior to deallocating the memory for the anon
+ * transport class.
+ */
+void anon_transport_class_unregister(struct anon_transport_class *atc)
+{
+	attribute_container_unregister(&atc->container);
+}
+EXPORT_SYMBOL_GPL(anon_transport_class_unregister);
+
+static int transport_setup_classdev(struct attribute_container *cont,
+				    struct device *dev,
+				    struct class_device *classdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+	struct transport_container *tcont = attribute_container_to_transport_container(cont);
+
+	if (tclass->setup)
+		tclass->setup(tcont, dev, classdev);
+
+	return 0;
+}
+
+/**
+ * transport_setup_device - declare a new dev for transport class association
+ *			    but don't make it visible yet.
+ *
+ * @dev: the generic device representing the entity being added
+ *
+ * Usually, dev represents some component in the HBA system (either
+ * the HBA itself or a device remote across the HBA bus).  This
+ * routine is simply a trigger point to see if any set of transport
+ * classes wishes to associate with the added device.  This allocates
+ * storage for the class device and initialises it, but does not yet
+ * add it to the system or add attributes to it (you do this with
+ * transport_add_device).  If you have no need for a separate setup
+ * and add operations, use transport_register_device (see
+ * transport_class.h).
+ */
+
+void transport_setup_device(struct device *dev)
+{
+	attribute_container_add_device(dev, transport_setup_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_setup_device);
+
+static int transport_add_class_device(struct attribute_container *cont,
+				      struct device *dev,
+				      struct class_device *classdev)
+{
+	int error = attribute_container_add_class_device(classdev);
+	struct transport_container *tcont = 
+		attribute_container_to_transport_container(cont);
+
+	if (!error && tcont->statistics)
+		error = sysfs_create_group(&classdev->kobj, tcont->statistics);
+
+	return error;
+}
+
+
+/**
+ * transport_add_device - declare a new dev for transport class association
+ *
+ * @dev: the generic device representing the entity being added
+ *
+ * Usually, dev represents some component in the HBA system (either
+ * the HBA itself or a device remote across the HBA bus).  This
+ * routine is simply a trigger point used to add the device to the
+ * system and register attributes for it.
+ */
+
+void transport_add_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_add_class_device);
+}
+EXPORT_SYMBOL_GPL(transport_add_device);
+
+static int transport_configure(struct attribute_container *cont,
+			       struct device *dev,
+			       struct class_device *cdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+	struct transport_container *tcont = attribute_container_to_transport_container(cont);
+
+	if (tclass->configure)
+		tclass->configure(tcont, dev, cdev);
+
+	return 0;
+}
+
+/**
+ * transport_configure_device - configure an already set up device
+ *
+ * @dev: generic device representing device to be configured
+ *
+ * The idea of configure is simply to provide a point within the setup
+ * process to allow the transport class to extract information from a
+ * device after it has been setup.  This is used in SCSI because we
+ * have to have a setup device to begin using the HBA, but after we
+ * send the initial inquiry, we use configure to extract the device
+ * parameters.  The device need not have been added to be configured.
+ */
+void transport_configure_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_configure);
+}
+EXPORT_SYMBOL_GPL(transport_configure_device);
+
+static int transport_remove_classdev(struct attribute_container *cont,
+				     struct device *dev,
+				     struct class_device *classdev)
+{
+	struct transport_container *tcont = 
+		attribute_container_to_transport_container(cont);
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+
+	if (tclass->remove)
+		tclass->remove(tcont, dev, classdev);
+
+	if (tclass->remove != anon_transport_dummy_function) {
+		if (tcont->statistics)
+			sysfs_remove_group(&classdev->kobj, tcont->statistics);
+		attribute_container_class_device_del(classdev);
+	}
+
+	return 0;
+}
+
+
+/**
+ * transport_remove_device - remove the visibility of a device
+ *
+ * @dev: generic device to remove
+ *
+ * This call removes the visibility of the device (to the user from
+ * sysfs), but does not destroy it.  To eliminate a device entirely
+ * you must also call transport_destroy_device.  If you don't need to
+ * do remove and destroy as separate operations, use
+ * transport_unregister_device() (see transport_class.h) which will
+ * perform both calls for you.
+ */
+void transport_remove_device(struct device *dev)
+{
+	attribute_container_device_trigger(dev, transport_remove_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_remove_device);
+
+static void transport_destroy_classdev(struct attribute_container *cont,
+				      struct device *dev,
+				      struct class_device *classdev)
+{
+	struct transport_class *tclass = class_to_transport_class(cont->class);
+
+	if (tclass->remove != anon_transport_dummy_function)
+		class_device_put(classdev);
+}
+
+
+/**
+ * transport_destroy_device - destroy a removed device
+ *
+ * @dev: device to eliminate from the transport class.
+ *
+ * This call triggers the elimination of storage associated with the
+ * transport classdev.  Note: all it really does is relinquish a
+ * reference to the classdev.  The memory will not be freed until the
+ * last reference goes to zero.  Note also that the classdev retains a
+ * reference count on dev, so dev too will remain for as long as the
+ * transport class device remains around.
+ */
+void transport_destroy_device(struct device *dev)
+{
+	attribute_container_remove_device(dev, transport_destroy_classdev);
+}
+EXPORT_SYMBOL_GPL(transport_destroy_device);
diff --git a/kernel_patches/backport/2.6.16_sles10/iscsi_scsi_makefile.patch b/kernel_patches/backport/2.6.16_sles10/iscsi_scsi_makefile.patch
new file mode 100644
index 0000000..30b6f0e
--- /dev/null
+++ b/kernel_patches/backport/2.6.16_sles10/iscsi_scsi_makefile.patch
@@ -0,0 +1,9 @@
+diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile
+--- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile  1970-01-01 02:00:00.000000000 +0200
++++ ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile       2006-12-28 17:01:22.000000000 +0200
+@@ -0,0 +1,5 @@
++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
++obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
++
++scsi_transport_iscsi-y := scsi_transport_iscsi_f.o
++libiscsi-y             := libiscsi_f.o
diff --git a/kernel_patches/backport/2.6.16_sles10_sp1/iscsi_scsi_makefile.patch b/kernel_patches/backport/2.6.16_sles10_sp1/iscsi_scsi_makefile.patch
new file mode 100644
index 0000000..30b6f0e
--- /dev/null
+++ b/kernel_patches/backport/2.6.16_sles10_sp1/iscsi_scsi_makefile.patch
@@ -0,0 +1,9 @@
+diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile
+--- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile  1970-01-01 02:00:00.000000000 +0200
++++ ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile       2006-12-28 17:01:22.000000000 +0200
+@@ -0,0 +1,5 @@
++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
++obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
++
++scsi_transport_iscsi-y := scsi_transport_iscsi_f.o
++libiscsi-y             := libiscsi_f.o
diff --git a/kernel_patches/backport/2.6.18/iscsi_scsi_makefile.patch b/kernel_patches/backport/2.6.18/iscsi_scsi_makefile.patch
new file mode 100644
index 0000000..30b6f0e
--- /dev/null
+++ b/kernel_patches/backport/2.6.18/iscsi_scsi_makefile.patch
@@ -0,0 +1,9 @@
+diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile
+--- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile  1970-01-01 02:00:00.000000000 +0200
++++ ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile       2006-12-28 17:01:22.000000000 +0200
+@@ -0,0 +1,5 @@
++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
++obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
++
++scsi_transport_iscsi-y := scsi_transport_iscsi_f.o
++libiscsi-y             := libiscsi_f.o
diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch
new file mode 100644
index 0000000..a339163
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch
@@ -0,0 +1,270 @@
+diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c
+--- linux-2.6.20/drivers/scsi/iscsi_tcp.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c	2007-05-17 16:55:43.000000000 +0300
+@@ -676,7 +676,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, 
+ }
+ 
+ static inline void
+-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg,
++partial_sg_digest_update(struct crypto_tfm *desc, struct scatterlist *sg,
+ 			 int offset, int length)
+ {
+ 	struct scatterlist temp;
+@@ -684,7 +684,7 @@ partial_sg_digest_update(struct hash_des
+ 	memcpy(&temp, sg, sizeof(struct scatterlist));
+ 	temp.offset = offset;
+ 	temp.length = length;
+-	crypto_hash_update(desc, &temp, length);
++	crypto_hash_update(&desc, &temp, length);
+ }
+ 
+ static void
+@@ -1774,22 +1774,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s
+ 	/* initial operational parameters */
+ 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
+ 
+-	tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->tx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->tx_hash.tfm))
++	tcp_conn->tx_hash = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->tx_hash)
+ 		goto free_tcp_conn;
+ 
+-	tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->rx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->rx_hash.tfm))
++	tcp_conn->rx_hash = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->rx_hash)
+ 		goto free_tx_tfm;
+ 
+ 	return cls_conn;
+ 
+ free_tx_tfm:
+-	crypto_free_hash(tcp_conn->tx_hash.tfm);
++	crypto_free_tfm(tcp_conn->tx_hash);
+ free_tcp_conn:
+ 	kfree(tcp_conn);
+ tcp_conn_alloc_fail:
+@@ -1823,10 +1819,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_
+ 	iscsi_tcp_release_conn(conn);
+ 	iscsi_conn_teardown(cls_conn);
+ 
+-	if (tcp_conn->tx_hash.tfm)
+-		crypto_free_hash(tcp_conn->tx_hash.tfm);
+-	if (tcp_conn->rx_hash.tfm)
+-		crypto_free_hash(tcp_conn->rx_hash.tfm);
++	if (tcp_conn->tx_hash)
++		crypto_free_tfm(tcp_conn->tx_hash);
++	if (tcp_conn->rx_hash)
++		crypto_free_tfm(tcp_conn->rx_hash);
+ 
+ 	kfree(tcp_conn);
+ }
+@@ -2017,7 +2013,7 @@ iscsi_tcp_conn_get_param(struct iscsi_cl
+ {
+ 	struct iscsi_conn *conn = cls_conn->dd_data;
+ 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+-	struct inet_sock *inet;
++	struct inet_opt *inet;
+ 	struct ipv6_pinfo *np;
+ 	struct sock *sk;
+ 	int len;
+@@ -2135,7 +2131,6 @@ static void iscsi_tcp_session_destroy(st
+ static struct scsi_host_template iscsi_sht = {
+ 	.name			= "iSCSI Initiator over TCP/IP",
+ 	.queuecommand           = iscsi_queuecommand,
+-	.change_queue_depth	= iscsi_change_queue_depth,
+ 	.can_queue		= ISCSI_XMIT_CMDS_MAX - 1,
+ 	.sg_tablesize		= ISCSI_SG_TABLESIZE,
+ 	.cmd_per_lun		= ISCSI_DEF_CMD_PER_LUN,
+diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h
+--- linux-2.6.20/drivers/scsi/iscsi_tcp.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h	2007-05-17 16:38:14.000000000 +0300
+@@ -49,7 +49,6 @@
+ #define ISCSI_SG_TABLESIZE		SG_ALL
+ #define ISCSI_TCP_MAX_CMD_LEN		16
+ 
+-struct crypto_hash;
+ struct socket;
+ 
+ /* Socket connection recieve helper */
+@@ -93,8 +92,8 @@ struct iscsi_tcp_conn {
+ 	void			(*old_write_space)(struct sock *);
+ 
+ 	/* data and header digests */
+-	struct hash_desc	tx_hash;	/* CRC32C (Tx) */
+-	struct hash_desc	rx_hash;	/* CRC32C (Rx) */
++	struct crypto_tfm	*tx_hash;	/* CRC32C (Tx) */
++	struct crypto_tfm	*rx_hash;	/* CRC32C (Rx) */
+ 
+ 	/* MIB custom statistics */
+ 	uint32_t		sendpage_failures_cnt;
+diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c
+--- linux-2.6.20/drivers/scsi/libiscsi.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c	2007-05-17 16:38:14.000000000 +0300
+@@ -1370,7 +1370,6 @@ iscsi_session_setup(struct iscsi_transpo
+ 	shost->max_lun = iscsit->max_lun;
+ 	shost->max_cmd_len = iscsit->max_cmd_len;
+ 	shost->transportt = scsit;
+-	shost->transportt->create_work_queue = 1;
+ 	*hostno = shost->host_no;
+ 
+ 	session = iscsi_hostdata(shost->hostdata);
+diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c
+--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c	2007-05-17 16:38:14.000000000 +0300
+@@ -65,6 +65,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
+ #define cdev_to_iscsi_internal(_cdev) \
+ 	container_of(_cdev, struct iscsi_internal, cdev)
+ 
++extern int attribute_container_init(void);
++
+ static void iscsi_transport_release(struct class_device *cdev)
+ {
+ 	struct iscsi_internal *priv = cdev_to_iscsi_internal(cdev);
+@@ -80,6 +82,17 @@ static struct class iscsi_transport_clas
+ 	.release = iscsi_transport_release,
+ };
+ 
++static void iscsi_host_class_release(struct class_device *class_dev)
++{
++	struct Scsi_Host *shost = transport_class_to_shost(class_dev);
++	put_device(&shost->shost_gendev);
++}
++
++struct class iscsi_host_class = {
++	.name = "iscsi_host",
++	.release = iscsi_host_class_release,
++};
++
+ static ssize_t
+ show_transport_handle(struct class_device *cdev, char *buf)
+ {
+@@ -115,10 +128,8 @@ static struct attribute_group iscsi_tran
+ 	.attrs = iscsi_transport_attrs,
+ };
+ 
+-static int iscsi_setup_host(struct transport_container *tc, struct device *dev,
+-			    struct class_device *cdev)
++static int iscsi_setup_host(struct Scsi_Host *shost)
+ {
+-	struct Scsi_Host *shost = dev_to_shost(dev);
+ 	struct iscsi_host *ihost = shost->shost_data;
+ 
+ 	memset(ihost, 0, sizeof(*ihost));
+@@ -127,12 +138,6 @@ static int iscsi_setup_host(struct trans
+ 	return 0;
+ }
+ 
+-static DECLARE_TRANSPORT_CLASS(iscsi_host_class,
+-			       "iscsi_host",
+-			       iscsi_setup_host,
+-			       NULL,
+-			       NULL);
+-
+ static DECLARE_TRANSPORT_CLASS(iscsi_session_class,
+ 			       "iscsi_session",
+ 			       NULL,
+@@ -216,24 +221,6 @@ static int iscsi_is_session_dev(const st
+ 	return dev->release == iscsi_session_release;
+ }
+ 
+-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel,
+-			   uint id, uint lun)
+-{
+-	struct iscsi_host *ihost = shost->shost_data;
+-	struct iscsi_cls_session *session;
+-
+-	mutex_lock(&ihost->mutex);
+-	list_for_each_entry(session, &ihost->sessions, host_list) {
+-		if ((channel == SCAN_WILD_CARD || channel == 0) &&
+-		    (id == SCAN_WILD_CARD || id == session->target_id))
+-			scsi_scan_target(&session->dev, 0,
+-					 session->target_id, lun, 1);
+-	}
+-	mutex_unlock(&ihost->mutex);
+-
+-	return 0;
+-}
+-
+ static void session_recovery_timedout(struct work_struct *work)
+ {
+ 	struct iscsi_cls_session *session =
+@@ -362,8 +349,6 @@ void iscsi_remove_session(struct iscsi_c
+ 	list_del(&session->host_list);
+ 	mutex_unlock(&ihost->mutex);
+ 
+-	scsi_remove_target(&session->dev);
+-
+ 	transport_unregister_device(&session->dev);
+ 	device_del(&session->dev);
+ }
+@@ -1269,24 +1254,6 @@ static int iscsi_conn_match(struct attri
+ 	return &priv->conn_cont.ac == cont;
+ }
+ 
+-static int iscsi_host_match(struct attribute_container *cont,
+-			    struct device *dev)
+-{
+-	struct Scsi_Host *shost;
+-	struct iscsi_internal *priv;
+-
+-	if (!scsi_is_host_device(dev))
+-		return 0;
+-
+-	shost = dev_to_shost(dev);
+-	if (!shost->transportt  ||
+-	    shost->transportt->host_attrs.ac.class != &iscsi_host_class.class)
+-		return 0;
+-
+-        priv = to_iscsi_internal(shost->transportt);
+-        return &priv->t.host_attrs.ac == cont;
+-}
+-
+ struct scsi_transport_template *
+ iscsi_register_transport(struct iscsi_transport *tt)
+ {
+@@ -1306,7 +1273,6 @@ iscsi_register_transport(struct iscsi_tr
+ 	INIT_LIST_HEAD(&priv->list);
+ 	priv->daemon_pid = -1;
+ 	priv->iscsi_transport = tt;
+-	priv->t.user_scan = iscsi_user_scan;
+ 
+ 	priv->cdev.class = &iscsi_transport_class;
+ 	snprintf(priv->cdev.class_id, BUS_ID_SIZE, "%s", tt->name);
+@@ -1319,12 +1285,11 @@ iscsi_register_transport(struct iscsi_tr
+ 		goto unregister_cdev;
+ 
+ 	/* host parameters */
+-	priv->t.host_attrs.ac.attrs = &priv->host_attrs[0];
+-	priv->t.host_attrs.ac.class = &iscsi_host_class.class;
+-	priv->t.host_attrs.ac.match = iscsi_host_match;
++
++	priv->t.host_attrs = &priv->host_attrs[0];
++	priv->t.host_class = &iscsi_host_class;
++	priv->t.host_setup = iscsi_setup_host;
+ 	priv->t.host_size = sizeof(struct iscsi_host);
+-	priv->host_attrs[0] = NULL;
+-	transport_container_register(&priv->t.host_attrs);
+ 
+ 	/* connection parameters */
+ 	priv->conn_cont.ac.attrs = &priv->conn_attrs[0];
+@@ -1402,7 +1367,6 @@ int iscsi_unregister_transport(struct is
+ 
+ 	transport_container_unregister(&priv->conn_cont);
+ 	transport_container_unregister(&priv->session_cont);
+-	transport_container_unregister(&priv->t.host_attrs);
+ 
+ 	sysfs_remove_group(&priv->cdev.kobj, &iscsi_transport_group);
+ 	class_device_unregister(&priv->cdev);
+@@ -1419,6 +1383,8 @@ static __init int iscsi_transport_init(v
+ 	printk(KERN_INFO "Loading iSCSI transport class v%s.\n",
+ 		ISCSI_TRANSPORT_VERSION);
+ 
++	attribute_container_init();
++
+ 	err = class_register(&iscsi_transport_class);
+ 	if (err)
+ 		return err;
diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch
new file mode 100644
index 0000000..21715fd
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch
@@ -0,0 +1,35 @@
+diff -rup linux-2.6.20/include/scsi/iscsi_if.h linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h
+--- linux-2.6.20/include/scsi/iscsi_if.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h	2007-05-15 08:49:53.000000000 +0300
+@@ -277,7 +277,6 @@ enum iscsi_param {
+  * These flags describes reason of stop_conn() call
+  */
+ #define STOP_CONN_TERM		0x1
+-#define STOP_CONN_SUSPEND	0x2
+ #define STOP_CONN_RECOVER	0x3
+ 
+ #define ISCSI_STATS_CUSTOM_MAX		32
+diff -rup linux-2.6.20/include/scsi/libiscsi.h linux-2.6.20-rh4-backport/include/scsi/libiscsi.h
+--- linux-2.6.20/include/scsi/libiscsi.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/include/scsi/libiscsi.h	2007-05-15 08:54:49.000000000 +0300
+@@ -25,8 +25,6 @@
+ 
+ #include <linux/types.h>
+ #include <linux/mutex.h>
+-#include <linux/timer.h>
+-#include <linux/workqueue.h>
+ #include <scsi/iscsi_proto.h>
+ #include <scsi/iscsi_if.h>
+ 
+diff -rup linux-2.6.20/include/scsi/scsi_transport_iscsi.h linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h
+--- linux-2.6.20/include/scsi/scsi_transport_iscsi.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h	2007-05-15 08:54:24.000000000 +0300
+@@ -24,7 +24,7 @@
+ #define SCSI_TRANSPORT_ISCSI_H
+ 
+ #include <linux/device.h>
+-#include <scsi/iscsi_if.h>
++#include "iscsi_if.h"
+ 
+ struct scsi_transport_template;
+ struct iscsi_transport;
diff --git a/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch
new file mode 100644
index 0000000..3c2a969
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch
@@ -0,0 +1,13 @@
+--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:13:43.000000000 +0200
++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:14:31.000000000 +0200
+@@ -70,9 +70,8 @@
+ #include <scsi/scsi_tcq.h>
+ #include <scsi/scsi_host.h>
+ #include <scsi/scsi.h>
+-#include <scsi/scsi_transport_iscsi.h>
+-
+ #include "iscsi_iser.h"
++#include <scsi/scsi_transport_iscsi.h>
+ 
+ static unsigned int iscsi_max_lun = 512;
+ module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO);
diff --git a/kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch b/kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch
new file mode 100644
index 0000000..ffa0598
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch
@@ -0,0 +1,65 @@
+diff --git a/drivers/scsi/init.c b/drivers/scsi/init.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/init.c
+@@ -0,0 +1 @@
++#include "src/init.c"
+diff --git a/drivers/scsi/attribute_container.c b/drivers/scsi/attribute_container.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/attribute_container.c
+@@ -0,0 +1 @@
++#include "src/attribute_container.c"
+diff --git a/drivers/scsi/transport_class.c b/drivers/scsi/transport_class.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/transport_class.c
+@@ -0,0 +1 @@
++#include "src/transport_class.c"
+diff --git a/drivers/scsi/klist.c b/drivers/scsi/klist.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/klist.c
+@@ -0,0 +1 @@
++#include "src/klist.c"
+diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/scsi.c
+@@ -0,0 +1 @@
++#include "src/scsi.c"
+diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/scsi_lib.c
+@@ -0,0 +1 @@
++#include "src/scsi_lib.c"
+diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/scsi_scan.c
+@@ -0,0 +1 @@
++#include "src/scsi_scan.c"
+diff --git a/drivers/scsi/kref_new.c b/drivers/scsi/kref_new.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/kref_new.c
+@@ -0,0 +1 @@
++#include "src/kref_new.c"
+diff -rupN ofa_kernel-1.2/drivers/scsi/Makefile ofa_kernel-1.2-iscsi/drivers/scsi/Makefile
+--- ofa_kernel-1.2/drivers/scsi/Makefile	1970-01-01 02:00:00.000000000 +0200
++++ ofa_kernel-1.2-iscsi/drivers/scsi/Makefile	2007-05-16 14:12:22.000000000 +0300
+@@ -0,0 +1,5 @@
++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
++obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
++
++scsi_transport_iscsi-y	:= scsi_transport_iscsi_f.o scsi.o scsi_lib.o init.o kref_new.o klist.o attribute_container.o transport_class.o
++libiscsi-y		:= libiscsi_f.o scsi_scan.o
diff --git a/kernel_patches/backport/2.6.9_U4/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U4/add_open_iscsi.patch
new file mode 100644
index 0000000..a339163
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/add_open_iscsi.patch
@@ -0,0 +1,270 @@
+diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c
+--- linux-2.6.20/drivers/scsi/iscsi_tcp.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c	2007-05-17 16:55:43.000000000 +0300
+@@ -676,7 +676,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, 
+ }
+ 
+ static inline void
+-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg,
++partial_sg_digest_update(struct crypto_tfm *desc, struct scatterlist *sg,
+ 			 int offset, int length)
+ {
+ 	struct scatterlist temp;
+@@ -684,7 +684,7 @@ partial_sg_digest_update(struct hash_des
+ 	memcpy(&temp, sg, sizeof(struct scatterlist));
+ 	temp.offset = offset;
+ 	temp.length = length;
+-	crypto_hash_update(desc, &temp, length);
++	crypto_hash_update(&desc, &temp, length);
+ }
+ 
+ static void
+@@ -1774,22 +1774,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s
+ 	/* initial operational parameters */
+ 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
+ 
+-	tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->tx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->tx_hash.tfm))
++	tcp_conn->tx_hash = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->tx_hash)
+ 		goto free_tcp_conn;
+ 
+-	tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->rx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->rx_hash.tfm))
++	tcp_conn->rx_hash = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->rx_hash)
+ 		goto free_tx_tfm;
+ 
+ 	return cls_conn;
+ 
+ free_tx_tfm:
+-	crypto_free_hash(tcp_conn->tx_hash.tfm);
++	crypto_free_tfm(tcp_conn->tx_hash);
+ free_tcp_conn:
+ 	kfree(tcp_conn);
+ tcp_conn_alloc_fail:
+@@ -1823,10 +1819,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_
+ 	iscsi_tcp_release_conn(conn);
+ 	iscsi_conn_teardown(cls_conn);
+ 
+-	if (tcp_conn->tx_hash.tfm)
+-		crypto_free_hash(tcp_conn->tx_hash.tfm);
+-	if (tcp_conn->rx_hash.tfm)
+-		crypto_free_hash(tcp_conn->rx_hash.tfm);
++	if (tcp_conn->tx_hash)
++		crypto_free_tfm(tcp_conn->tx_hash);
++	if (tcp_conn->rx_hash)
++		crypto_free_tfm(tcp_conn->rx_hash);
+ 
+ 	kfree(tcp_conn);
+ }
+@@ -2017,7 +2013,7 @@ iscsi_tcp_conn_get_param(struct iscsi_cl
+ {
+ 	struct iscsi_conn *conn = cls_conn->dd_data;
+ 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+-	struct inet_sock *inet;
++	struct inet_opt *inet;
+ 	struct ipv6_pinfo *np;
+ 	struct sock *sk;
+ 	int len;
+@@ -2135,7 +2131,6 @@ static void iscsi_tcp_session_destroy(st
+ static struct scsi_host_template iscsi_sht = {
+ 	.name			= "iSCSI Initiator over TCP/IP",
+ 	.queuecommand           = iscsi_queuecommand,
+-	.change_queue_depth	= iscsi_change_queue_depth,
+ 	.can_queue		= ISCSI_XMIT_CMDS_MAX - 1,
+ 	.sg_tablesize		= ISCSI_SG_TABLESIZE,
+ 	.cmd_per_lun		= ISCSI_DEF_CMD_PER_LUN,
+diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h
+--- linux-2.6.20/drivers/scsi/iscsi_tcp.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h	2007-05-17 16:38:14.000000000 +0300
+@@ -49,7 +49,6 @@
+ #define ISCSI_SG_TABLESIZE		SG_ALL
+ #define ISCSI_TCP_MAX_CMD_LEN		16
+ 
+-struct crypto_hash;
+ struct socket;
+ 
+ /* Socket connection recieve helper */
+@@ -93,8 +92,8 @@ struct iscsi_tcp_conn {
+ 	void			(*old_write_space)(struct sock *);
+ 
+ 	/* data and header digests */
+-	struct hash_desc	tx_hash;	/* CRC32C (Tx) */
+-	struct hash_desc	rx_hash;	/* CRC32C (Rx) */
++	struct crypto_tfm	*tx_hash;	/* CRC32C (Tx) */
++	struct crypto_tfm	*rx_hash;	/* CRC32C (Rx) */
+ 
+ 	/* MIB custom statistics */
+ 	uint32_t		sendpage_failures_cnt;
+diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c
+--- linux-2.6.20/drivers/scsi/libiscsi.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c	2007-05-17 16:38:14.000000000 +0300
+@@ -1370,7 +1370,6 @@ iscsi_session_setup(struct iscsi_transpo
+ 	shost->max_lun = iscsit->max_lun;
+ 	shost->max_cmd_len = iscsit->max_cmd_len;
+ 	shost->transportt = scsit;
+-	shost->transportt->create_work_queue = 1;
+ 	*hostno = shost->host_no;
+ 
+ 	session = iscsi_hostdata(shost->hostdata);
+diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c
+--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c	2007-05-17 16:38:14.000000000 +0300
+@@ -65,6 +65,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
+ #define cdev_to_iscsi_internal(_cdev) \
+ 	container_of(_cdev, struct iscsi_internal, cdev)
+ 
++extern int attribute_container_init(void);
++
+ static void iscsi_transport_release(struct class_device *cdev)
+ {
+ 	struct iscsi_internal *priv = cdev_to_iscsi_internal(cdev);
+@@ -80,6 +82,17 @@ static struct class iscsi_transport_clas
+ 	.release = iscsi_transport_release,
+ };
+ 
++static void iscsi_host_class_release(struct class_device *class_dev)
++{
++	struct Scsi_Host *shost = transport_class_to_shost(class_dev);
++	put_device(&shost->shost_gendev);
++}
++
++struct class iscsi_host_class = {
++	.name = "iscsi_host",
++	.release = iscsi_host_class_release,
++};
++
+ static ssize_t
+ show_transport_handle(struct class_device *cdev, char *buf)
+ {
+@@ -115,10 +128,8 @@ static struct attribute_group iscsi_tran
+ 	.attrs = iscsi_transport_attrs,
+ };
+ 
+-static int iscsi_setup_host(struct transport_container *tc, struct device *dev,
+-			    struct class_device *cdev)
++static int iscsi_setup_host(struct Scsi_Host *shost)
+ {
+-	struct Scsi_Host *shost = dev_to_shost(dev);
+ 	struct iscsi_host *ihost = shost->shost_data;
+ 
+ 	memset(ihost, 0, sizeof(*ihost));
+@@ -127,12 +138,6 @@ static int iscsi_setup_host(struct trans
+ 	return 0;
+ }
+ 
+-static DECLARE_TRANSPORT_CLASS(iscsi_host_class,
+-			       "iscsi_host",
+-			       iscsi_setup_host,
+-			       NULL,
+-			       NULL);
+-
+ static DECLARE_TRANSPORT_CLASS(iscsi_session_class,
+ 			       "iscsi_session",
+ 			       NULL,
+@@ -216,24 +221,6 @@ static int iscsi_is_session_dev(const st
+ 	return dev->release == iscsi_session_release;
+ }
+ 
+-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel,
+-			   uint id, uint lun)
+-{
+-	struct iscsi_host *ihost = shost->shost_data;
+-	struct iscsi_cls_session *session;
+-
+-	mutex_lock(&ihost->mutex);
+-	list_for_each_entry(session, &ihost->sessions, host_list) {
+-		if ((channel == SCAN_WILD_CARD || channel == 0) &&
+-		    (id == SCAN_WILD_CARD || id == session->target_id))
+-			scsi_scan_target(&session->dev, 0,
+-					 session->target_id, lun, 1);
+-	}
+-	mutex_unlock(&ihost->mutex);
+-
+-	return 0;
+-}
+-
+ static void session_recovery_timedout(struct work_struct *work)
+ {
+ 	struct iscsi_cls_session *session =
+@@ -362,8 +349,6 @@ void iscsi_remove_session(struct iscsi_c
+ 	list_del(&session->host_list);
+ 	mutex_unlock(&ihost->mutex);
+ 
+-	scsi_remove_target(&session->dev);
+-
+ 	transport_unregister_device(&session->dev);
+ 	device_del(&session->dev);
+ }
+@@ -1269,24 +1254,6 @@ static int iscsi_conn_match(struct attri
+ 	return &priv->conn_cont.ac == cont;
+ }
+ 
+-static int iscsi_host_match(struct attribute_container *cont,
+-			    struct device *dev)
+-{
+-	struct Scsi_Host *shost;
+-	struct iscsi_internal *priv;
+-
+-	if (!scsi_is_host_device(dev))
+-		return 0;
+-
+-	shost = dev_to_shost(dev);
+-	if (!shost->transportt  ||
+-	    shost->transportt->host_attrs.ac.class != &iscsi_host_class.class)
+-		return 0;
+-
+-        priv = to_iscsi_internal(shost->transportt);
+-        return &priv->t.host_attrs.ac == cont;
+-}
+-
+ struct scsi_transport_template *
+ iscsi_register_transport(struct iscsi_transport *tt)
+ {
+@@ -1306,7 +1273,6 @@ iscsi_register_transport(struct iscsi_tr
+ 	INIT_LIST_HEAD(&priv->list);
+ 	priv->daemon_pid = -1;
+ 	priv->iscsi_transport = tt;
+-	priv->t.user_scan = iscsi_user_scan;
+ 
+ 	priv->cdev.class = &iscsi_transport_class;
+ 	snprintf(priv->cdev.class_id, BUS_ID_SIZE, "%s", tt->name);
+@@ -1319,12 +1285,11 @@ iscsi_register_transport(struct iscsi_tr
+ 		goto unregister_cdev;
+ 
+ 	/* host parameters */
+-	priv->t.host_attrs.ac.attrs = &priv->host_attrs[0];
+-	priv->t.host_attrs.ac.class = &iscsi_host_class.class;
+-	priv->t.host_attrs.ac.match = iscsi_host_match;
++
++	priv->t.host_attrs = &priv->host_attrs[0];
++	priv->t.host_class = &iscsi_host_class;
++	priv->t.host_setup = iscsi_setup_host;
+ 	priv->t.host_size = sizeof(struct iscsi_host);
+-	priv->host_attrs[0] = NULL;
+-	transport_container_register(&priv->t.host_attrs);
+ 
+ 	/* connection parameters */
+ 	priv->conn_cont.ac.attrs = &priv->conn_attrs[0];
+@@ -1402,7 +1367,6 @@ int iscsi_unregister_transport(struct is
+ 
+ 	transport_container_unregister(&priv->conn_cont);
+ 	transport_container_unregister(&priv->session_cont);
+-	transport_container_unregister(&priv->t.host_attrs);
+ 
+ 	sysfs_remove_group(&priv->cdev.kobj, &iscsi_transport_group);
+ 	class_device_unregister(&priv->cdev);
+@@ -1419,6 +1383,8 @@ static __init int iscsi_transport_init(v
+ 	printk(KERN_INFO "Loading iSCSI transport class v%s.\n",
+ 		ISCSI_TRANSPORT_VERSION);
+ 
++	attribute_container_init();
++
+ 	err = class_register(&iscsi_transport_class);
+ 	if (err)
+ 		return err;
diff --git a/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch
new file mode 100644
index 0000000..21715fd
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch
@@ -0,0 +1,35 @@
+diff -rup linux-2.6.20/include/scsi/iscsi_if.h linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h
+--- linux-2.6.20/include/scsi/iscsi_if.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h	2007-05-15 08:49:53.000000000 +0300
+@@ -277,7 +277,6 @@ enum iscsi_param {
+  * These flags describes reason of stop_conn() call
+  */
+ #define STOP_CONN_TERM		0x1
+-#define STOP_CONN_SUSPEND	0x2
+ #define STOP_CONN_RECOVER	0x3
+ 
+ #define ISCSI_STATS_CUSTOM_MAX		32
+diff -rup linux-2.6.20/include/scsi/libiscsi.h linux-2.6.20-rh4-backport/include/scsi/libiscsi.h
+--- linux-2.6.20/include/scsi/libiscsi.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/include/scsi/libiscsi.h	2007-05-15 08:54:49.000000000 +0300
+@@ -25,8 +25,6 @@
+ 
+ #include <linux/types.h>
+ #include <linux/mutex.h>
+-#include <linux/timer.h>
+-#include <linux/workqueue.h>
+ #include <scsi/iscsi_proto.h>
+ #include <scsi/iscsi_if.h>
+ 
+diff -rup linux-2.6.20/include/scsi/scsi_transport_iscsi.h linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h
+--- linux-2.6.20/include/scsi/scsi_transport_iscsi.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h	2007-05-15 08:54:24.000000000 +0300
+@@ -24,7 +24,7 @@
+ #define SCSI_TRANSPORT_ISCSI_H
+ 
+ #include <linux/device.h>
+-#include <scsi/iscsi_if.h>
++#include "iscsi_if.h"
+ 
+ struct scsi_transport_template;
+ struct iscsi_transport;
diff --git a/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch
new file mode 100644
index 0000000..3c2a969
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch
@@ -0,0 +1,13 @@
+--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:13:43.000000000 +0200
++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:14:31.000000000 +0200
+@@ -70,9 +70,8 @@
+ #include <scsi/scsi_tcq.h>
+ #include <scsi/scsi_host.h>
+ #include <scsi/scsi.h>
+-#include <scsi/scsi_transport_iscsi.h>
+-
+ #include "iscsi_iser.h"
++#include <scsi/scsi_transport_iscsi.h>
+ 
+ static unsigned int iscsi_max_lun = 512;
+ module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO);
diff --git a/kernel_patches/backport/2.6.9_U4/iscsi_scsi_addons.patch b/kernel_patches/backport/2.6.9_U4/iscsi_scsi_addons.patch
new file mode 100644
index 0000000..ffa0598
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U4/iscsi_scsi_addons.patch
@@ -0,0 +1,65 @@
+diff --git a/drivers/scsi/init.c b/drivers/scsi/init.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/init.c
+@@ -0,0 +1 @@
++#include "src/init.c"
+diff --git a/drivers/scsi/attribute_container.c b/drivers/scsi/attribute_container.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/attribute_container.c
+@@ -0,0 +1 @@
++#include "src/attribute_container.c"
+diff --git a/drivers/scsi/transport_class.c b/drivers/scsi/transport_class.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/transport_class.c
+@@ -0,0 +1 @@
++#include "src/transport_class.c"
+diff --git a/drivers/scsi/klist.c b/drivers/scsi/klist.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/klist.c
+@@ -0,0 +1 @@
++#include "src/klist.c"
+diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/scsi.c
+@@ -0,0 +1 @@
++#include "src/scsi.c"
+diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/scsi_lib.c
+@@ -0,0 +1 @@
++#include "src/scsi_lib.c"
+diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/scsi_scan.c
+@@ -0,0 +1 @@
++#include "src/scsi_scan.c"
+diff --git a/drivers/scsi/kref_new.c b/drivers/scsi/kref_new.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/kref_new.c
+@@ -0,0 +1 @@
++#include "src/kref_new.c"
+diff -rupN ofa_kernel-1.2/drivers/scsi/Makefile ofa_kernel-1.2-iscsi/drivers/scsi/Makefile
+--- ofa_kernel-1.2/drivers/scsi/Makefile	1970-01-01 02:00:00.000000000 +0200
++++ ofa_kernel-1.2-iscsi/drivers/scsi/Makefile	2007-05-16 14:12:22.000000000 +0300
+@@ -0,0 +1,5 @@
++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
++obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
++
++scsi_transport_iscsi-y	:= scsi_transport_iscsi_f.o scsi.o scsi_lib.o init.o kref_new.o klist.o attribute_container.o transport_class.o
++libiscsi-y		:= libiscsi_f.o scsi_scan.o
diff --git a/kernel_patches/backport/2.6.9_U5/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U5/add_open_iscsi.patch
new file mode 100644
index 0000000..a339163
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U5/add_open_iscsi.patch
@@ -0,0 +1,270 @@
+diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c
+--- linux-2.6.20/drivers/scsi/iscsi_tcp.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c	2007-05-17 16:55:43.000000000 +0300
+@@ -676,7 +676,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, 
+ }
+ 
+ static inline void
+-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg,
++partial_sg_digest_update(struct crypto_tfm *desc, struct scatterlist *sg,
+ 			 int offset, int length)
+ {
+ 	struct scatterlist temp;
+@@ -684,7 +684,7 @@ partial_sg_digest_update(struct hash_des
+ 	memcpy(&temp, sg, sizeof(struct scatterlist));
+ 	temp.offset = offset;
+ 	temp.length = length;
+-	crypto_hash_update(desc, &temp, length);
++	crypto_hash_update(&desc, &temp, length);
+ }
+ 
+ static void
+@@ -1774,22 +1774,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s
+ 	/* initial operational parameters */
+ 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
+ 
+-	tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->tx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->tx_hash.tfm))
++	tcp_conn->tx_hash = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->tx_hash)
+ 		goto free_tcp_conn;
+ 
+-	tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0,
+-						  CRYPTO_ALG_ASYNC);
+-	tcp_conn->rx_hash.flags = 0;
+-	if (IS_ERR(tcp_conn->rx_hash.tfm))
++	tcp_conn->rx_hash = crypto_alloc_tfm("crc32c", 0);
++	if (!tcp_conn->rx_hash)
+ 		goto free_tx_tfm;
+ 
+ 	return cls_conn;
+ 
+ free_tx_tfm:
+-	crypto_free_hash(tcp_conn->tx_hash.tfm);
++	crypto_free_tfm(tcp_conn->tx_hash);
+ free_tcp_conn:
+ 	kfree(tcp_conn);
+ tcp_conn_alloc_fail:
+@@ -1823,10 +1819,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_
+ 	iscsi_tcp_release_conn(conn);
+ 	iscsi_conn_teardown(cls_conn);
+ 
+-	if (tcp_conn->tx_hash.tfm)
+-		crypto_free_hash(tcp_conn->tx_hash.tfm);
+-	if (tcp_conn->rx_hash.tfm)
+-		crypto_free_hash(tcp_conn->rx_hash.tfm);
++	if (tcp_conn->tx_hash)
++		crypto_free_tfm(tcp_conn->tx_hash);
++	if (tcp_conn->rx_hash)
++		crypto_free_tfm(tcp_conn->rx_hash);
+ 
+ 	kfree(tcp_conn);
+ }
+@@ -2017,7 +2013,7 @@ iscsi_tcp_conn_get_param(struct iscsi_cl
+ {
+ 	struct iscsi_conn *conn = cls_conn->dd_data;
+ 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+-	struct inet_sock *inet;
++	struct inet_opt *inet;
+ 	struct ipv6_pinfo *np;
+ 	struct sock *sk;
+ 	int len;
+@@ -2135,7 +2131,6 @@ static void iscsi_tcp_session_destroy(st
+ static struct scsi_host_template iscsi_sht = {
+ 	.name			= "iSCSI Initiator over TCP/IP",
+ 	.queuecommand           = iscsi_queuecommand,
+-	.change_queue_depth	= iscsi_change_queue_depth,
+ 	.can_queue		= ISCSI_XMIT_CMDS_MAX - 1,
+ 	.sg_tablesize		= ISCSI_SG_TABLESIZE,
+ 	.cmd_per_lun		= ISCSI_DEF_CMD_PER_LUN,
+diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h
+--- linux-2.6.20/drivers/scsi/iscsi_tcp.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h	2007-05-17 16:38:14.000000000 +0300
+@@ -49,7 +49,6 @@
+ #define ISCSI_SG_TABLESIZE		SG_ALL
+ #define ISCSI_TCP_MAX_CMD_LEN		16
+ 
+-struct crypto_hash;
+ struct socket;
+ 
+ /* Socket connection recieve helper */
+@@ -93,8 +92,8 @@ struct iscsi_tcp_conn {
+ 	void			(*old_write_space)(struct sock *);
+ 
+ 	/* data and header digests */
+-	struct hash_desc	tx_hash;	/* CRC32C (Tx) */
+-	struct hash_desc	rx_hash;	/* CRC32C (Rx) */
++	struct crypto_tfm	*tx_hash;	/* CRC32C (Tx) */
++	struct crypto_tfm	*rx_hash;	/* CRC32C (Rx) */
+ 
+ 	/* MIB custom statistics */
+ 	uint32_t		sendpage_failures_cnt;
+diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c
+--- linux-2.6.20/drivers/scsi/libiscsi.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c	2007-05-17 16:38:14.000000000 +0300
+@@ -1370,7 +1370,6 @@ iscsi_session_setup(struct iscsi_transpo
+ 	shost->max_lun = iscsit->max_lun;
+ 	shost->max_cmd_len = iscsit->max_cmd_len;
+ 	shost->transportt = scsit;
+-	shost->transportt->create_work_queue = 1;
+ 	*hostno = shost->host_no;
+ 
+ 	session = iscsi_hostdata(shost->hostdata);
+diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c
+--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c	2007-05-17 16:38:14.000000000 +0300
+@@ -65,6 +65,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
+ #define cdev_to_iscsi_internal(_cdev) \
+ 	container_of(_cdev, struct iscsi_internal, cdev)
+ 
++extern int attribute_container_init(void);
++
+ static void iscsi_transport_release(struct class_device *cdev)
+ {
+ 	struct iscsi_internal *priv = cdev_to_iscsi_internal(cdev);
+@@ -80,6 +82,17 @@ static struct class iscsi_transport_clas
+ 	.release = iscsi_transport_release,
+ };
+ 
++static void iscsi_host_class_release(struct class_device *class_dev)
++{
++	struct Scsi_Host *shost = transport_class_to_shost(class_dev);
++	put_device(&shost->shost_gendev);
++}
++
++struct class iscsi_host_class = {
++	.name = "iscsi_host",
++	.release = iscsi_host_class_release,
++};
++
+ static ssize_t
+ show_transport_handle(struct class_device *cdev, char *buf)
+ {
+@@ -115,10 +128,8 @@ static struct attribute_group iscsi_tran
+ 	.attrs = iscsi_transport_attrs,
+ };
+ 
+-static int iscsi_setup_host(struct transport_container *tc, struct device *dev,
+-			    struct class_device *cdev)
++static int iscsi_setup_host(struct Scsi_Host *shost)
+ {
+-	struct Scsi_Host *shost = dev_to_shost(dev);
+ 	struct iscsi_host *ihost = shost->shost_data;
+ 
+ 	memset(ihost, 0, sizeof(*ihost));
+@@ -127,12 +138,6 @@ static int iscsi_setup_host(struct trans
+ 	return 0;
+ }
+ 
+-static DECLARE_TRANSPORT_CLASS(iscsi_host_class,
+-			       "iscsi_host",
+-			       iscsi_setup_host,
+-			       NULL,
+-			       NULL);
+-
+ static DECLARE_TRANSPORT_CLASS(iscsi_session_class,
+ 			       "iscsi_session",
+ 			       NULL,
+@@ -216,24 +221,6 @@ static int iscsi_is_session_dev(const st
+ 	return dev->release == iscsi_session_release;
+ }
+ 
+-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel,
+-			   uint id, uint lun)
+-{
+-	struct iscsi_host *ihost = shost->shost_data;
+-	struct iscsi_cls_session *session;
+-
+-	mutex_lock(&ihost->mutex);
+-	list_for_each_entry(session, &ihost->sessions, host_list) {
+-		if ((channel == SCAN_WILD_CARD || channel == 0) &&
+-		    (id == SCAN_WILD_CARD || id == session->target_id))
+-			scsi_scan_target(&session->dev, 0,
+-					 session->target_id, lun, 1);
+-	}
+-	mutex_unlock(&ihost->mutex);
+-
+-	return 0;
+-}
+-
+ static void session_recovery_timedout(struct work_struct *work)
+ {
+ 	struct iscsi_cls_session *session =
+@@ -362,8 +349,6 @@ void iscsi_remove_session(struct iscsi_c
+ 	list_del(&session->host_list);
+ 	mutex_unlock(&ihost->mutex);
+ 
+-	scsi_remove_target(&session->dev);
+-
+ 	transport_unregister_device(&session->dev);
+ 	device_del(&session->dev);
+ }
+@@ -1269,24 +1254,6 @@ static int iscsi_conn_match(struct attri
+ 	return &priv->conn_cont.ac == cont;
+ }
+ 
+-static int iscsi_host_match(struct attribute_container *cont,
+-			    struct device *dev)
+-{
+-	struct Scsi_Host *shost;
+-	struct iscsi_internal *priv;
+-
+-	if (!scsi_is_host_device(dev))
+-		return 0;
+-
+-	shost = dev_to_shost(dev);
+-	if (!shost->transportt  ||
+-	    shost->transportt->host_attrs.ac.class != &iscsi_host_class.class)
+-		return 0;
+-
+-        priv = to_iscsi_internal(shost->transportt);
+-        return &priv->t.host_attrs.ac == cont;
+-}
+-
+ struct scsi_transport_template *
+ iscsi_register_transport(struct iscsi_transport *tt)
+ {
+@@ -1306,7 +1273,6 @@ iscsi_register_transport(struct iscsi_tr
+ 	INIT_LIST_HEAD(&priv->list);
+ 	priv->daemon_pid = -1;
+ 	priv->iscsi_transport = tt;
+-	priv->t.user_scan = iscsi_user_scan;
+ 
+ 	priv->cdev.class = &iscsi_transport_class;
+ 	snprintf(priv->cdev.class_id, BUS_ID_SIZE, "%s", tt->name);
+@@ -1319,12 +1285,11 @@ iscsi_register_transport(struct iscsi_tr
+ 		goto unregister_cdev;
+ 
+ 	/* host parameters */
+-	priv->t.host_attrs.ac.attrs = &priv->host_attrs[0];
+-	priv->t.host_attrs.ac.class = &iscsi_host_class.class;
+-	priv->t.host_attrs.ac.match = iscsi_host_match;
++
++	priv->t.host_attrs = &priv->host_attrs[0];
++	priv->t.host_class = &iscsi_host_class;
++	priv->t.host_setup = iscsi_setup_host;
+ 	priv->t.host_size = sizeof(struct iscsi_host);
+-	priv->host_attrs[0] = NULL;
+-	transport_container_register(&priv->t.host_attrs);
+ 
+ 	/* connection parameters */
+ 	priv->conn_cont.ac.attrs = &priv->conn_attrs[0];
+@@ -1402,7 +1367,6 @@ int iscsi_unregister_transport(struct is
+ 
+ 	transport_container_unregister(&priv->conn_cont);
+ 	transport_container_unregister(&priv->session_cont);
+-	transport_container_unregister(&priv->t.host_attrs);
+ 
+ 	sysfs_remove_group(&priv->cdev.kobj, &iscsi_transport_group);
+ 	class_device_unregister(&priv->cdev);
+@@ -1419,6 +1383,8 @@ static __init int iscsi_transport_init(v
+ 	printk(KERN_INFO "Loading iSCSI transport class v%s.\n",
+ 		ISCSI_TRANSPORT_VERSION);
+ 
++	attribute_container_init();
++
+ 	err = class_register(&iscsi_transport_class);
+ 	if (err)
+ 		return err;
diff --git a/kernel_patches/backport/2.6.9_U5/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U5/add_open_iscsi_h.patch
new file mode 100644
index 0000000..21715fd
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U5/add_open_iscsi_h.patch
@@ -0,0 +1,35 @@
+diff -rup linux-2.6.20/include/scsi/iscsi_if.h linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h
+--- linux-2.6.20/include/scsi/iscsi_if.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h	2007-05-15 08:49:53.000000000 +0300
+@@ -277,7 +277,6 @@ enum iscsi_param {
+  * These flags describes reason of stop_conn() call
+  */
+ #define STOP_CONN_TERM		0x1
+-#define STOP_CONN_SUSPEND	0x2
+ #define STOP_CONN_RECOVER	0x3
+ 
+ #define ISCSI_STATS_CUSTOM_MAX		32
+diff -rup linux-2.6.20/include/scsi/libiscsi.h linux-2.6.20-rh4-backport/include/scsi/libiscsi.h
+--- linux-2.6.20/include/scsi/libiscsi.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/include/scsi/libiscsi.h	2007-05-15 08:54:49.000000000 +0300
+@@ -25,8 +25,6 @@
+ 
+ #include <linux/types.h>
+ #include <linux/mutex.h>
+-#include <linux/timer.h>
+-#include <linux/workqueue.h>
+ #include <scsi/iscsi_proto.h>
+ #include <scsi/iscsi_if.h>
+ 
+diff -rup linux-2.6.20/include/scsi/scsi_transport_iscsi.h linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h
+--- linux-2.6.20/include/scsi/scsi_transport_iscsi.h	2007-02-04 20:44:54.000000000 +0200
++++ linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h	2007-05-15 08:54:24.000000000 +0300
+@@ -24,7 +24,7 @@
+ #define SCSI_TRANSPORT_ISCSI_H
+ 
+ #include <linux/device.h>
+-#include <scsi/iscsi_if.h>
++#include "iscsi_if.h"
+ 
+ struct scsi_transport_template;
+ struct iscsi_transport;
diff --git a/kernel_patches/backport/2.6.9_U5/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U5/fix_inclusion_order_iscsi_iser.patch
new file mode 100644
index 0000000..3c2a969
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U5/fix_inclusion_order_iscsi_iser.patch
@@ -0,0 +1,13 @@
+--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:13:43.000000000 +0200
++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-02-08 09:14:31.000000000 +0200
+@@ -70,9 +70,8 @@
+ #include <scsi/scsi_tcq.h>
+ #include <scsi/scsi_host.h>
+ #include <scsi/scsi.h>
+-#include <scsi/scsi_transport_iscsi.h>
+-
+ #include "iscsi_iser.h"
++#include <scsi/scsi_transport_iscsi.h>
+ 
+ static unsigned int iscsi_max_lun = 512;
+ module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO);
diff --git a/kernel_patches/backport/2.6.9_U5/iscsi_scsi_addons.patch b/kernel_patches/backport/2.6.9_U5/iscsi_scsi_addons.patch
new file mode 100644
index 0000000..ffa0598
--- /dev/null
+++ b/kernel_patches/backport/2.6.9_U5/iscsi_scsi_addons.patch
@@ -0,0 +1,65 @@
+diff --git a/drivers/scsi/init.c b/drivers/scsi/init.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/init.c
+@@ -0,0 +1 @@
++#include "src/init.c"
+diff --git a/drivers/scsi/attribute_container.c b/drivers/scsi/attribute_container.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/attribute_container.c
+@@ -0,0 +1 @@
++#include "src/attribute_container.c"
+diff --git a/drivers/scsi/transport_class.c b/drivers/scsi/transport_class.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/transport_class.c
+@@ -0,0 +1 @@
++#include "src/transport_class.c"
+diff --git a/drivers/scsi/klist.c b/drivers/scsi/klist.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/klist.c
+@@ -0,0 +1 @@
++#include "src/klist.c"
+diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/scsi.c
+@@ -0,0 +1 @@
++#include "src/scsi.c"
+diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/scsi_lib.c
+@@ -0,0 +1 @@
++#include "src/scsi_lib.c"
+diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/scsi_scan.c
+@@ -0,0 +1 @@
++#include "src/scsi_scan.c"
+diff --git a/drivers/scsi/kref_new.c b/drivers/scsi/kref_new.c
+new file mode 100644
+index 0000000..58cf933
+--- /dev/null
++++ b/drivers/scsi/kref_new.c
+@@ -0,0 +1 @@
++#include "src/kref_new.c"
+diff -rupN ofa_kernel-1.2/drivers/scsi/Makefile ofa_kernel-1.2-iscsi/drivers/scsi/Makefile
+--- ofa_kernel-1.2/drivers/scsi/Makefile	1970-01-01 02:00:00.000000000 +0200
++++ ofa_kernel-1.2-iscsi/drivers/scsi/Makefile	2007-05-16 14:12:22.000000000 +0300
+@@ -0,0 +1,5 @@
++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
++obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
++
++scsi_transport_iscsi-y	:= scsi_transport_iscsi_f.o scsi.o scsi_lib.o init.o kref_new.o klist.o attribute_container.o transport_class.o
++libiscsi-y		:= libiscsi_f.o scsi_scan.o
diff --git a/kernel_patches/fixes/iscsi_scsi_makefile.patch b/kernel_patches/fixes/iscsi_scsi_makefile.patch
deleted file mode 100644
index 9c4fd01..0000000
--- a/kernel_patches/fixes/iscsi_scsi_makefile.patch
+++ /dev/null
@@ -1,10 +0,0 @@
-Add a Makefile based on the kernel's drivers/scsi/Makefile in order to build open-iscsi.
-
-Signed-off-by: Erez Zilber <erezz at voltaire.com>
-
-diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile
---- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile  1970-01-01 02:00:00.000000000 +0200
-+++ ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile       2006-12-28 17:01:22.000000000 +0200
-@@ -0,0 +1,2 @@
-+obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
-+obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
diff --git a/ofed_scripts/makefile b/ofed_scripts/makefile
index 34a8996..62abe2c 100644
--- a/ofed_scripts/makefile
+++ b/ofed_scripts/makefile
@@ -60,6 +60,12 @@ kernel:
 	@echo "Kernel version: $(KVERSION)"
 	@echo "Modules directory: $(DESTDIR)/$(MODULES_DIR)"
 	@echo "Kernel sources: $(KSRC)"
+	if [ -e $(CWD)/drivers/scsi/libiscsi.c ]; then \
+		mv $(CWD)/drivers/scsi/libiscsi.c $(CWD)/drivers/scsi/libiscsi_f.c; \
+	fi
+	if [ -e $(CWD)/drivers/scsi/scsi_transport_iscsi.c ]; then \
+		mv $(CWD)/drivers/scsi/scsi_transport_iscsi.c $(CWD)/drivers/scsi/scsi_transport_iscsi_f.c; \
+	fi
 	env EXTRA_CFLAGS="$(OPENIB_KERNEL_EXTRA_CFLAGS) $(KERNEL_MEMTRACK_CFLAGS) -I$(CWD)/include -I$(CWD)/drivers/infiniband/include \
 		-I$(CWD)/drivers/infiniband/ulp/ipoib \
 		-I$(CWD)/drivers/infiniband/debug \


From mst at dev.mellanox.co.il  Mon May 21 01:16:25 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 21 May 2007 11:16:25 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons
	for open-iscsi over iSER support for RHAS4 up3 and up4
In-Reply-To: <46513C13.3010100@voltaire.com>
References: <4641D295.5060907@voltaire.com> <4641D38A.8040406@voltaire.com>
	<20070510092925.GB13655@mellanox.co.il>
	<46513C13.3010100@voltaire.com>
Message-ID: <20070521081625.GA20400@mellanox.co.il>

> Quoting Erez Zilber <erezz at voltaire.com>:
> Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4
> 
> Michael S. Tsirkin wrote:
> >> Quoting Erez Zilber <erezz at voltaire.com>:
> >> Subject: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4
> >>
> >>
> >> Add the required backport patches & kernel addons for open-iscsi
> >> over iSER in RHAS4 up3 and up4.
> >>
> >> Signed-off-by: Erez Zilber <erezz at voltaire.com>
> > 
> > In addition to posting patches, could you pls publish a git tree to pull from,
> > please? This makes it easy to test-build the patch as our build system
> > knows how to do git checkout.
> 
> Added a git tree:
> 
> http://www.openfabrics.org/git/?p=~erezz/ofed_1_2_iser_rh4.git;a=summary

Looks reasonable.

However, you are copying a ton of files from upstream kernel.
Sticking extra files in include might interfere with newer
kernels, so I don't have better ideas for this for 1.2
(for 1.3 I am hoping we'll use the submodule support in git,
so we'll be able to re-use headers as well).

But, for files *not* in "include/", I suggest that, instead of sticking our
own version in addons, we should check out the files from upstream and tweak
makefiles to pick them up: maintaining these in OFED tree long-term will be a
problem.

> >> + 
> >> + struct iscsi_internal {
> >> + 	int daemon_pid;
> >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
> >> + #define cdev_to_iscsi_internal(_cdev) \
> >> + 	container_of(_cdev, struct iscsi_internal, cdev)
> >> + 
> >> ++extern int attribute_container_init(void);
> >> ++
> > 
> > This does not look scsi-related. Why does this belong here?
> 
> This is a hack. In 2.6.20, attribute_container_init is called from drivers/base/init.c. Since I cannot do that, I'm calling it from the init function in scsi_transport_iscsi (because scsi_transport_iscsi uses the attribute container). Do you have a better suggestion?

Aha. No better ideas for the header, let it be for now.
But the code in drivers/base/init.c can be checked out rather than
copied over.

> diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/memory.h b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
> new file mode 100644
> index 0000000..654ef55
> --- /dev/null
> +++ b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
> @@ -0,0 +1,89 @@
> +/*
> + * include/linux/memory.h - generic memory definition
> + *
> + * This is mainly for topological representation. We define the
> + * basic "struct memory_block" here, which can be embedded in per-arch
> + * definitions or NUMA information.
> + *
> + * Basic handling of the devices is done in drivers/base/memory.c
> + * and system devices are handled in drivers/base/sys.c.
> + *
> + * Memory block are exported via sysfs in the class/memory/devices/
> + * directory.
> + *
> + */


Sorry, why are we copying this here?
Are you actually trying to work with hotplug memory?


> --- a/kernel_patches/fixes/iscsi_scsi_makefile.patch
> +++ /dev/null
> @@ -1,10 +0,0 @@
> -Add a Makefile based on the kernel's drivers/scsi/Makefile in order to build open-iscsi.
> -
> -Signed-off-by: Erez Zilber <erezz at voltaire.com>
> -
> -diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile
> ---- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile  1970-01-01 02:00:00.000000000 +0200
> -+++ ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile       2006-12-28 17:01:22.000000000 +0200
> -@@ -0,0 +1,2 @@
> -+obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
> -+obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
> diff --git a/ofed_scripts/makefile b/ofed_scripts/makefile
> index 34a8996..62abe2c 100644
> --- a/ofed_scripts/makefile
> +++ b/ofed_scripts/makefile
> @@ -60,6 +60,12 @@ kernel:
>  	@echo "Kernel version: $(KVERSION)"
>  	@echo "Modules directory: $(DESTDIR)/$(MODULES_DIR)"
>  	@echo "Kernel sources: $(KSRC)"
> +	if [ -e $(CWD)/drivers/scsi/libiscsi.c ]; then \
> +		mv $(CWD)/drivers/scsi/libiscsi.c $(CWD)/drivers/scsi/libiscsi_f.c; \
> +	fi
> +	if [ -e $(CWD)/drivers/scsi/scsi_transport_iscsi.c ]; then \
> +		mv $(CWD)/drivers/scsi/scsi_transport_iscsi.c $(CWD)/drivers/scsi/scsi_transport_iscsi_f.c; \
> +	fi
>  	env EXTRA_CFLAGS="$(OPENIB_KERNEL_EXTRA_CFLAGS) $(KERNEL_MEMTRACK_CFLAGS) -I$(CWD)/include -I$(CWD)/drivers/infiniband/include \
>  		-I$(CWD)/drivers/infiniband/ulp/ipoib \
>  		-I$(CWD)/drivers/infiniband/debug \

This looks pretty hacky. Moving files around during make will
interfere with people trying to e.g. create a patch.
What is this doing? Can't makefile just point to the right files?

-- 
MST


From kliteyn at dev.mellanox.co.il  Mon May 21 01:17:02 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 21 May 2007 11:17:02 +0300
Subject: [ofa-general] Re: [PATCHv2] osm: up/dn optimization - improved
	ranking
In-Reply-To: <20070520161034.GY19271@sashak.voltaire.com>
References: <46503064.7010107@dev.mellanox.co.il>
	<20070520161034.GY19271@sashak.voltaire.com>
Message-ID: <4651557E.2080400@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 14:26 Sun 20 May     , Yevgeny Kliteynik wrote:
>> Hi Hal,
>>
>> This patch optimizes fabric ranking similar to the fat-tree ranking.
>> All the root switches are marked with rank and added to the BFS list,
>> and only then ranking of rest of the fabric begins.
>> This version of the patch is updated in accordance with Sasha's suggestions.
>>
>> Please apply to master.
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
> 
> Looks fine for me. Nice optimization. Thanks.
> 
> I guess there still be issue with max_rank calculation (details are
> below), which affects only log message and for me it is ok to fix it in
> the incremental patch.
> 
>> opensm/opensm/osm_ucast_updn.c |   80 
>> +++++++++++++++++----------------------
>> 1 files changed, 35 insertions(+), 45 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
>> index 5cebd9b..95a0622 100644
>> --- a/opensm/opensm/osm_ucast_updn.c
>> +++ b/opensm/opensm/osm_ucast_updn.c
> 
> [snip...]
> 
>> @@ -483,7 +474,7 @@ updn_subn_rank(
>>       {
>>         remote_u = p_remote_physp->p_node->sw->priv;
>>         port_guid = p_remote_physp->port_guid;
>> -        did_cause_update = __updn_update_rank(remote_u, rank);
>> +        did_cause_update = __updn_update_rank(remote_u, u->rank+1);
>>
>>         osm_log( p_log, OSM_LOG_DEBUG,
>>                  "updn_subn_rank: "
>> @@ -492,7 +483,10 @@ updn_subn_rank(
>>                  remote_u->rank );
>>
>>         if (did_cause_update)
>> +        {
>>           cl_qlist_insert_tail(&list, &remote_u->list);
>> +          max_rank = remote_u->rank;
>> +        }
> 
> I think this still be not accurate. For instance with topology like:
> A <-> B <-> C <-> D <-> E , where roots are A and E we will get
> max_rank= 1, which obviously should be 2.

Not exactly. What you're describing would happen if the scan would be DFS-like,
not BFS. In your example there are two roots: A and E. They both got rank 0 and
entered to the BFS list. Now, starting BFS scan: 
 - Removing head of the list - A 
 - Discovering B
 - Assigning B with rank 1      -------> updating max_rank 
 - Adding B to the end of the list
 - Removing head of the list - E
 - Discovering D
 - Assigning D with rank 1      -------> updating max_rank
 - Adding D to the end of the list
 - Removing head of the list - B
 - Discovering C
 - Assigning C with rank 2      -------> updating max_rank
 - Adding C to the end of the list
 - Removing head of the list - D
 - Nothing to discover (C has been already discovered)
 - Removing head of the list - C
 - BFS list is empty
As you can see, the last rank was 2.

I actually was expecting this mail, because I thought of something like this initially :)
 
> Probably we need something like this instead:
> 
> 	if (did_cause_update)
> 		cl_qlist_insert_tail(&list, &remote_u->list);
> 	if (remote_u->rank <= u->rank + 1)
> 		max_rank = remote_u->rank;
> 
> (and after such intervention into rank updating technique we may want to
> remove also __updn_update_rank() function)

Although I can't think of any scenario that would prove me wrong, I do think that
to make the code more "intuitive" we might want to remove the __updn_update_rank()
and do something like this:

    if (remote_u->rank > u->rank + 1)
    {
        remote_u->rank = u->rank + 1;
        max_rank = remote_u->rank; 
        cl_qlist_insert_tail(&list, &remote_u->list);
    }
 
> And again, this nit affects only reported value in the log message (and
> just this log message removing can be option too :)) and doesn't touch
> the optimization itself - good stuff, Yevgeny!

Truth, all this is for the log message only :)
We also might want to remove the message :)
I'm OK with either of the two options.

-- Yevgeny

> Sasha
> 


From vlad at lists.openfabrics.org  Mon May 21 02:40:34 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon, 21 May 2007 02:40:34 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070521-0200 daily build status
Message-ID: <20070521094034.B6A33E6082C@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.13
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.12
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-42.ELsmp

Failed:


From erezz at voltaire.com  Mon May 21 04:16:08 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Mon, 21 May 2007 14:16:08 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons
 for open-iscsiover iSER support for RHAS4 up3 and up4
In-Reply-To: <20070521081625.GA20400@mellanox.co.il>
References: <4641D295.5060907@voltaire.com>
	<4641D38A.8040406@voltaire.com><20070510092925.GB13655@mellanox.co.il><46513C13.3010100@voltaire.com>
	<20070521081625.GA20400@mellanox.co.il>
Message-ID: <46517F78.8000805@voltaire.com>

>>
>> Added a git tree:
>>
>> http://www.openfabrics.org/git/?p=~erezz/ofed_1_2_iser_rh4.git;a=summary
> 
> Looks reasonable.
> 
> However, you are copying a ton of files from upstream kernel.
> Sticking extra files in include might interfere with newer
> kernels, so I don't have better ideas for this for 1.2
> (for 1.3 I am hoping we'll use the submodule support in git,
> so we'll be able to re-use headers as well).
> 
> But, for files *not* in "include/", I suggest that, instead of sticking our
> own version in addons, we should check out the files from upstream and tweak
> makefiles to pick them up: maintaining these in OFED tree long-term will
> be a
> problem.

Do you suggest to add a new mechanism to OFED that will do that?

> 
>> >> +
>> >> + struct iscsi_internal {
>> >> +  int daemon_pid;
>> >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
>> >> + #define cdev_to_iscsi_internal(_cdev) \
>> >> +  container_of(_cdev, struct iscsi_internal, cdev)
>> >> +
>> >> ++extern int attribute_container_init(void);
>> >> ++
>> >
>> > This does not look scsi-related. Why does this belong here?
>>
>> This is a hack. In 2.6.20, attribute_container_init is called from
> drivers/base/init.c. Since I cannot do that, I'm calling it from the
> init function in scsi_transport_iscsi (because scsi_transport_iscsi uses
> the attribute container). Do you have a better suggestion?
> 
> Aha. No better ideas for the header, let it be for now.
> But the code in drivers/base/init.c can be checked out rather than
> copied over.

I'm using only a very small part of init.c. I'm not sure that we should copy it.

> 
>> diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
> b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
>> new file mode 100644
>> index 0000000..654ef55
>> --- /dev/null
>> +++ b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
>> @@ -0,0 +1,89 @@
>> +/*
>> + * include/linux/memory.h - generic memory definition
>> + *
>> + * This is mainly for topological representation. We define the
>> + * basic "struct memory_block" here, which can be embedded in per-arch
>> + * definitions or NUMA information.
>> + *
>> + * Basic handling of the devices is done in drivers/base/memory.c
>> + * and system devices are handled in drivers/base/sys.c.
>> + *
>> + * Memory block are exported via sysfs in the class/memory/devices/
>> + * directory.
>> + *
>> + */
> 
> 
> Sorry, why are we copying this here?
> Are you actually trying to work with hotplug memory?

Sorry, it seems that I don't really need memory.h. It was included from init.c, but it is not necessary. I made the fix on ofed_1_2_iser_rh4.git.

> 
> 
>> --- a/kernel_patches/fixes/iscsi_scsi_makefile.patch
>> +++ /dev/null
>> @@ -1,10 +0,0 @@
>> -Add a Makefile based on the kernel's drivers/scsi/Makefile in order
> to build open-iscsi.
>> -
>> -Signed-off-by: Erez Zilber <erezz at voltaire.com>
>> -
>> -diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile
> ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile
>> ---- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile  1970-01-01
> 02:00:00.000000000 +0200
>> -+++
> ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile      
> 2006-12-28 17:01:22.000000000 +0200
>> -@@ -0,0 +1,2 @@
>> -+obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
>> -+obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
>> diff --git a/ofed_scripts/makefile b/ofed_scripts/makefile
>> index 34a8996..62abe2c 100644
>> --- a/ofed_scripts/makefile
>> +++ b/ofed_scripts/makefile
>> @@ -60,6 +60,12 @@ kernel:
>>       @echo "Kernel version: $(KVERSION)"
>>       @echo "Modules directory: $(DESTDIR)/$(MODULES_DIR)"
>>       @echo "Kernel sources: $(KSRC)"
>> +     if [ -e $(CWD)/drivers/scsi/libiscsi.c ]; then \
>> +             mv $(CWD)/drivers/scsi/libiscsi.c
> $(CWD)/drivers/scsi/libiscsi_f.c; \
>> +     fi
>> +     if [ -e $(CWD)/drivers/scsi/scsi_transport_iscsi.c ]; then \
>> +             mv $(CWD)/drivers/scsi/scsi_transport_iscsi.c
> $(CWD)/drivers/scsi/scsi_transport_iscsi_f.c; \
>> +     fi
>>       env EXTRA_CFLAGS="$(OPENIB_KERNEL_EXTRA_CFLAGS)
> $(KERNEL_MEMTRACK_CFLAGS) -I$(CWD)/include
> -I$(CWD)/drivers/infiniband/include \
>>               -I$(CWD)/drivers/infiniband/ulp/ipoib \
>>               -I$(CWD)/drivers/infiniband/debug \
> 
> This looks pretty hacky. Moving files around during make will
> interfere with people trying to e.g. create a patch.
> What is this doing? Can't makefile just point to the right files?
> 

Here's the problem:

I'm trying to build a module that contains multiple object files (e.g. libiscsi). libiscsi contains libiscsi.o & scsi_scan.o. Something like:

libiscsi-y             := libiscsi_f.o scsi_scan.o

The problem is that if I'm doing something like:

libiscsi-y             := libiscsi.o scsi_scan.o

then, libiscsi.ko doesn't contain the symbols from libiscsi.o (only symbols from scsi_scan.o). We found 2 solutions for this problem:

1. Change the module name - this is problematic because open-iscsi startup script uses the original module name.
2. Change the file name (libiscsi.c -> libiscsi_f.c) - this is what I did.

I don't really like this hack, but I wasn't able to come up with something better. Do you know how to overcome this problem?

Thanks,
Erez


From mst at dev.mellanox.co.il  Mon May 21 04:44:10 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 21 May 2007 14:44:10 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons
	for open-iscsiover iSER support for RHAS4 up3 and up4
In-Reply-To: <46517F78.8000805@voltaire.com>
References: <4641D295.5060907@voltaire.com>
	<20070521081625.GA20400@mellanox.co.il>
	<46517F78.8000805@voltaire.com>
Message-ID: <20070521114410.GG20400@mellanox.co.il>

> Quoting Erez Zilber <erezz at voltaire.com>:
> Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsiover iSER support for RHAS4 up3 and up4
> 
> >>
> >> Added a git tree:
> >>
> >> http://www.openfabrics.org/git/?p=~erezz/ofed_1_2_iser_rh4.git;a=summary
> > 
> > Looks reasonable.
> > 
> > However, you are copying a ton of files from upstream kernel.
> > Sticking extra files in include might interfere with newer
> > kernels, so I don't have better ideas for this for 1.2
> > (for 1.3 I am hoping we'll use the submodule support in git,
> > so we'll be able to re-use headers as well).
> > 
> > But, for files *not* in "include/", I suggest that, instead of sticking our
> > own version in addons, we should check out the files from upstream and tweak
> > makefiles to pick them up: maintaining these in OFED tree long-term will
> > be a
> > problem.
> 
> Do you suggest to add a new mechanism to OFED that will do that?

No, this is the same mechanism that we use for the rest of the files:
check them out of the kernel tree.
Look at file ofed_scripts/ofed_checkout.sh

But I stress that we can not do this for files under
include/ *unless* they only include packet structure definitions.
Otherwise we'll get weird data corruption on newer kernels.

> > 
> >> >> +
> >> >> + struct iscsi_internal {
> >> >> +  int daemon_pid;
> >> >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
> >> >> + #define cdev_to_iscsi_internal(_cdev) \
> >> >> +  container_of(_cdev, struct iscsi_internal, cdev)
> >> >> +
> >> >> ++extern int attribute_container_init(void);
> >> >> ++
> >> >
> >> > This does not look scsi-related. Why does this belong here?
> >>
> >> This is a hack. In 2.6.20, attribute_container_init is called from
> > drivers/base/init.c. Since I cannot do that, I'm calling it from the
> > init function in scsi_transport_iscsi (because scsi_transport_iscsi uses
> > the attribute container). Do you have a better suggestion?
> > 
> > Aha. No better ideas for the header, let it be for now.
> > But the code in drivers/base/init.c can be checked out rather than
> > copied over.
> 
> I'm using only a very small part of init.c. I'm not sure that we should copy it.

OK then.
What about the stuff like scsi.c?

> >> diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
> > b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
> >> new file mode 100644
> >> index 0000000..654ef55
> >> --- /dev/null
> >> +++ b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
> >> @@ -0,0 +1,89 @@
> >> +/*
> >> + * include/linux/memory.h - generic memory definition
> >> + *
> >> + * This is mainly for topological representation. We define the
> >> + * basic "struct memory_block" here, which can be embedded in per-arch
> >> + * definitions or NUMA information.
> >> + *
> >> + * Basic handling of the devices is done in drivers/base/memory.c
> >> + * and system devices are handled in drivers/base/sys.c.
> >> + *
> >> + * Memory block are exported via sysfs in the class/memory/devices/
> >> + * directory.
> >> + *
> >> + */
> > 
> > 
> > Sorry, why are we copying this here?
> > Are you actually trying to work with hotplug memory?
> 
> Sorry, it seems that I don't really need memory.h. It was included from init.c, but it is not necessary. I made the fix on ofed_1_2_iser_rh4.git.

Pls check other headers you pull in - is there something you can skip?

> > 
> >> --- a/kernel_patches/fixes/iscsi_scsi_makefile.patch
> >> +++ /dev/null
> >> @@ -1,10 +0,0 @@
> >> -Add a Makefile based on the kernel's drivers/scsi/Makefile in order
> > to build open-iscsi.
> >> -
> >> -Signed-off-by: Erez Zilber <erezz at voltaire.com>
> >> -
> >> -diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile
> > ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile
> >> ---- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile  1970-01-01
> > 02:00:00.000000000 +0200
> >> -+++
> > ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile      
> > 2006-12-28 17:01:22.000000000 +0200
> >> -@@ -0,0 +1,2 @@
> >> -+obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
> >> -+obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
> >> diff --git a/ofed_scripts/makefile b/ofed_scripts/makefile
> >> index 34a8996..62abe2c 100644
> >> --- a/ofed_scripts/makefile
> >> +++ b/ofed_scripts/makefile
> >> @@ -60,6 +60,12 @@ kernel:
> >>       @echo "Kernel version: $(KVERSION)"
> >>       @echo "Modules directory: $(DESTDIR)/$(MODULES_DIR)"
> >>       @echo "Kernel sources: $(KSRC)"
> >> +     if [ -e $(CWD)/drivers/scsi/libiscsi.c ]; then \
> >> +             mv $(CWD)/drivers/scsi/libiscsi.c
> > $(CWD)/drivers/scsi/libiscsi_f.c; \
> >> +     fi
> >> +     if [ -e $(CWD)/drivers/scsi/scsi_transport_iscsi.c ]; then \
> >> +             mv $(CWD)/drivers/scsi/scsi_transport_iscsi.c
> > $(CWD)/drivers/scsi/scsi_transport_iscsi_f.c; \
> >> +     fi
> >>       env EXTRA_CFLAGS="$(OPENIB_KERNEL_EXTRA_CFLAGS)
> > $(KERNEL_MEMTRACK_CFLAGS) -I$(CWD)/include
> > -I$(CWD)/drivers/infiniband/include \
> >>               -I$(CWD)/drivers/infiniband/ulp/ipoib \
> >>               -I$(CWD)/drivers/infiniband/debug \
> > 
> > This looks pretty hacky. Moving files around during make will
> > interfere with people trying to e.g. create a patch.
> > What is this doing? Can't makefile just point to the right files?
> > 
> 
> Here's the problem:
> 
> I'm trying to build a module that contains multiple object files (e.g. libiscsi). libiscsi contains libiscsi.o & scsi_scan.o. Something like:
> 
> libiscsi-y             := libiscsi_f.o scsi_scan.o
> 
> The problem is that if I'm doing something like:
> 
> libiscsi-y             := libiscsi.o scsi_scan.o
> 
> then, libiscsi.ko doesn't contain the symbols from libiscsi.o (only symbols from scsi_scan.o). We found 2 solutions for this problem:
> 
> 1. Change the module name - this is problematic because open-iscsi startup script uses the original module name.
> 2. Change the file name (libiscsi.c -> libiscsi_f.c) - this is what I did.
> 
> I don't really like this hack, but I wasn't able to come up with something better. Do you know how to overcome this problem?

I do not have the time to look into this in a deep way.
But it seems that you can just add a file libiscsi_f.c with

#include "libiscsi.c"

would this work?

-- 
MST


From kliteyn at dev.mellanox.co.il  Mon May 21 04:53:59 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 21 May 2007 14:53:59 +0300
Subject: [ofa-general] [PATCH] osm: fixing coredump in drop manager
Message-ID: <46518857.2060308@dev.mellanox.co.il>

Hi Hal.

This patch fixes a coredump in a drop manager when trying to clear
unititialized physical ports.
It happens only in master (the code in this area is a bit different in ofed_1_2).

Please apply to master.
Thanks.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_drop_mgr.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c
index 97a95c2..7ec185c 100644
--- a/opensm/opensm/osm_drop_mgr.c
+++ b/opensm/opensm/osm_drop_mgr.c
@@ -242,7 +242,7 @@ __osm_drop_mgr_remove_port(
   {
     p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)port_num );
 
-    if( p_physp )
+    if( p_physp && osm_physp_is_valid(p_physp) )
     {
       p_remote_physp = osm_physp_get_remote( p_physp );
       if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) )
-- 
1.5.1.4


From mst at dev.mellanox.co.il  Mon May 21 05:04:59 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 21 May 2007 15:04:59 +0300
Subject: [ofa-general] [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak
Message-ID: <20070521120459.GI20400@mellanox.co.il>

SRQ WR leakage that has been observed with IPoIB/CM: e.g. flipping ports on and
off will, with time, leak out all WRs and then all connections will start
getting RNR NACKs. Fix this in the way suggested by spec: move QP to error, wait
for last wqe reached event and then post WR on "drain QP" connected to the same
CQ.  Once we observe a completion on the drain QP, it's safe to call
ib_destroy_qp.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

The following has been working well for me. Please consider for 2.6.22.

ipoib.h       |   39 ++++++++++-
ipoib_cm.c    |  201 ++++++++++++++++++++++++++++++++++++++++++++++++----------
ipoib_verbs.c |    2
3 files changed, 206 insertions(+), 36 deletions(-)

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-18 15:13:21.000000000 +0300
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-20 16:22:00.000000000 +0300
@@ -132,12 +135,43 @@ struct ipoib_cm_data {
 	__be32 mtu;
 };
 
+/*
+ * Quoting 10.3.1 Queue Pair and EE Context States:
+ *
+ * Note, for QPs that are associated with an SRQ, the Consumer should take the
+ * QP through the Error State before invoking a Destroy QP or a Modify QP to the
+ * Reset State.  The Consumer may invoke the Destroy QP without first performing
+ * a Modify QP to the Error State and waiting for the Affiliated Asynchronous
+ * Last WQE Reached Event. However, if the Consumer does not wait for the
+ * Affiliated Asynchronous Last WQE Reached Event, then WQE and Data Segment
+ * leakage may occur. Therefore, it is good programming practice to tear down a
+ * QP that is associated with an SRQ by using the following process:
+ *
+ * - Put the QP in the Error State
+ * - Wait for the Affiliated Asynchronous Last WQE Reached Event;
+ * - either:
+ *       drain the CQ by invoking the Poll CQ verb and either wait for CQ
+ *       to be empty or the number of Poll CQ operations has exceeded
+ *       CQ capacity size;
+ * - or
+ *       post another WR that completes on the same CQ and wait for this
+ *       WR to return as a WC; (NB: this is the option that we use)
+ * - and then invoke a Destroy QP or Reset QP.
+ */
+
+enum ipoib_cm_state {
+	IPOIB_CM_RX_LIVE,
+	IPOIB_CM_RX_ERROR, /* Ignored by stale task */
+	IPOIB_CM_RX_FLUSH  /* Last WQE Reached event observed */
+};
+
 struct ipoib_cm_rx {
 	struct ib_cm_id     *id;
 	struct ib_qp        *qp;
 	struct list_head     list;
 	struct net_device   *dev;
 	unsigned long        jiffies;
+	enum ipoib_cm_state  state;
 };
 
 struct ipoib_cm_tx {
@@ -165,10 +199,16 @@ struct ipoib_cm_dev_priv {
 	struct ib_srq  	       *srq;
 	struct ipoib_cm_rx_buf *srq_ring;
 	struct ib_cm_id        *id;
-	struct list_head        passive_ids;
+	struct ib_qp           *rx_drain_qp;   /* generates WR described in 10.3.1 */
+	struct list_head        passive_ids;   /* state: LIVE */
+	struct list_head        rx_error_list; /* state: ERROR */
+	struct list_head        rx_flush_list; /* state: FLUSH, drain not started */
+	struct list_head        rx_drain_list; /* state: FLUSH, drain started */
+	struct list_head        rx_reap_list;  /* state: FLUSH, drain done */
 	struct work_struct      start_task;
 	struct work_struct      reap_task;
 	struct work_struct      skb_task;
+	struct work_struct      rx_reap_task;
 	struct delayed_work     stale_task;
 	struct sk_buff_head     skb_queue;
 	struct list_head        start_list;
Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-18 15:13:21.000000000 +0300
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-05-21 07:44:50.000000000 +0300
@@ -37,6 +37,7 @@
 #include <net/dst.h>
 #include <net/icmp.h>
 #include <linux/icmpv6.h>
+#include <linux/delay.h>
 
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
 static int data_debug_level;
@@ -62,6 +63,16 @@ struct ipoib_cm_id {
 	u32 remote_mtu;
 };
 
+static struct ib_qp_attr ipoib_cm_err_attr = {
+	.qp_state = IB_QPS_ERR
+};
+
+#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff
+
+static struct ib_recv_wr ipoib_cm_rx_drain_wr = {
+	.wr_id = IPOIB_CM_RX_DRAIN_WRID
+};
+
 static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
 			       struct ib_cm_event *event);
 
@@ -150,11 +161,44 @@ partial_error:
 	return NULL;
 }
 
+static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv)
+{
+	struct ib_recv_wr *bad_wr;
+
+	/* rx_drain_qp send queue depth is 1, so
+	 * make sure we have at most 1 outstanding WR. */
+	if (list_empty(&priv->cm.rx_flush_list) ||
+	    !list_empty(&priv->cm.rx_drain_list))
+		return;
+
+	if (ib_post_recv(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_wr))
+		ipoib_warn(priv, "failed to post rx_drain wr\n");
+
+	list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list);
+}
+
+static void ipoib_cm_rx_event_handler(struct ib_event *event, void *ctx)
+{
+	struct ipoib_cm_rx *p = ctx;
+	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
+	unsigned long flags;
+
+	if (event->event != IB_EVENT_QP_LAST_WQE_REACHED)
+		return;
+
+	spin_lock_irqsave(&priv->lock, flags);
+	list_move(&p->list, &priv->cm.rx_flush_list);
+	p->state = IPOIB_CM_RX_FLUSH;
+	ipoib_cm_start_rx_drain(priv);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
 static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 					   struct ipoib_cm_rx *p)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
+		.event_handler = ipoib_cm_rx_event_handler,
 		.send_cq = priv->cq, /* does not matter, we never send anything */
 		.recv_cq = priv->cq,
 		.srq = priv->cm.srq,
@@ -256,6 +300,7 @@ static int ipoib_cm_req_handler(struct i
 
 	cm_id->context = p;
 	p->jiffies = jiffies;
+	p->state = IPOIB_CM_RX_LIVE;
 	spin_lock_irq(&priv->lock);
 	if (list_empty(&priv->cm.passive_ids))
 		queue_delayed_work(ipoib_workqueue,
@@ -277,7 +322,6 @@ static int ipoib_cm_rx_handler(struct ib
 {
 	struct ipoib_cm_rx *p;
 	struct ipoib_dev_priv *priv;
-	int ret;
 
 	switch (event->event) {
 	case IB_CM_REQ_RECEIVED:
@@ -289,20 +333,9 @@ static int ipoib_cm_rx_handler(struct ib
 	case IB_CM_REJ_RECEIVED:
 		p = cm_id->context;
 		priv = netdev_priv(p->dev);
-		spin_lock_irq(&priv->lock);
-		if (list_empty(&p->list))
-			ret = 0; /* Connection is going away already. */
-		else {
-			list_del_init(&p->list);
-			ret = -ECONNRESET;
-		}
-		spin_unlock_irq(&priv->lock);
-		if (ret) {
-			ib_destroy_qp(p->qp);
-			kfree(p);
-			return ret;
-		}
-		return 0;
+		if (ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE))
+			ipoib_warn(priv, "unable to move qp to error state\n");
+		/* Fall through */
 	default:
 		return 0;
 	}
@@ -354,8 +387,15 @@ void ipoib_cm_handle_rx_wc(struct net_de
 		       wr_id, wc->status);
 
 	if (unlikely(wr_id >= ipoib_recvq_size)) {
-		ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
-			   wr_id, ipoib_recvq_size);
+		if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) {
+			spin_lock_irqsave(&priv->lock, flags);
+			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
+			ipoib_cm_start_rx_drain(priv);
+			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+			spin_unlock_irqrestore(&priv->lock, flags);
+		} else
+			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
+				   wr_id, ipoib_recvq_size);
 		return;
 	}
 
@@ -374,9 +414,9 @@ void ipoib_cm_handle_rx_wc(struct net_de
 		if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
 			spin_lock_irqsave(&priv->lock, flags);
 			p->jiffies = jiffies;
-			/* Move this entry to list head, but do
-			 * not re-add it if it has been removed. */
-			if (!list_empty(&p->list))
+			/* Move this entry to list head, but do not re-add it
+			 * if it has been moved out of list. */
+			if (p->state == IPOIB_CM_RX_LIVE)
 				list_move(&p->list, &priv->cm.passive_ids);
 			spin_unlock_irqrestore(&priv->lock, flags);
 		}
@@ -583,17 +623,41 @@ static void ipoib_cm_tx_completion(struc
 int ipoib_cm_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr qp_init_attr = {
+		.send_cq = priv->cq,   /* does not matter, we never send anything */
+		.recv_cq = priv->cq,
+		.cap.max_send_wr = 1,  /* FIXME: 0 Seems not to work */
+		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
+		.cap.max_recv_wr = 1,
+		.cap.max_recv_sge = 1, /* FIXME: 0 Seems not to work */
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type = IB_QPT_UC,
+	};
 	int ret;
 
 	if (!IPOIB_CM_SUPPORTED(dev->dev_addr))
 		return 0;
 
+	priv->cm.rx_drain_qp = ib_create_qp(priv->pd, &qp_init_attr);
+	if (IS_ERR(priv->cm.rx_drain_qp)) {
+		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
+		ret = PTR_ERR(priv->cm.rx_drain_qp);
+		return ret;
+	}
+
+	/* We put the QP in error state directly: this way, hardware
+	 * will immediately generate WC for each WR we post */
+	ret = ib_modify_qp(priv->cm.rx_drain_qp, &ipoib_cm_err_attr, IB_QP_STATE);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret);
+		goto err_qp;
+	}
+
 	priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev);
 	if (IS_ERR(priv->cm.id)) {
 		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
 		ret = PTR_ERR(priv->cm.id);
-		priv->cm.id = NULL;
-		return ret;
+		goto err_cm;
 	}
 
 	ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num),
@@ -601,35 +665,79 @@ int ipoib_cm_dev_open(struct net_device 
 	if (ret) {
 		printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name,
 		       IPOIB_CM_IETF_ID | priv->qp->qp_num);
-		ib_destroy_cm_id(priv->cm.id);
-		priv->cm.id = NULL;
-		return ret;
+		goto err_listen;
 	}
+
 	return 0;
+
+err_listen:
+	ib_destroy_cm_id(priv->cm.id);
+err_cm:
+	priv->cm.id = NULL;
+err_qp:
+	ib_destroy_qp(priv->cm.rx_drain_qp);
+	return ret;
 }
 
 void ipoib_cm_dev_stop(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_cm_rx *p;
+	struct ipoib_cm_rx *p, *n;
+	unsigned long begin;
+	LIST_HEAD(list);
+	int ret;
 
 	if (!IPOIB_CM_SUPPORTED(dev->dev_addr) || !priv->cm.id)
 		return;
 
 	ib_destroy_cm_id(priv->cm.id);
 	priv->cm.id = NULL;
+
 	spin_lock_irq(&priv->lock);
 	while (!list_empty(&priv->cm.passive_ids)) {
 		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
-		list_del_init(&p->list);
+		list_move(&p->list, &priv->cm.rx_error_list);
+		p->state = IPOIB_CM_RX_ERROR;
 		spin_unlock_irq(&priv->lock);
+		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+		if (ret)
+			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
+		spin_lock_irq(&priv->lock);
+	}
+
+	/* Wait for all RX to be drained */
+	begin = jiffies;
+
+	while (!list_empty(&priv->cm.rx_error_list) ||
+	       !list_empty(&priv->cm.rx_flush_list) ||
+	       !list_empty(&priv->cm.rx_drain_list)) {
+		if (!time_after(jiffies, begin + 5 * HZ)) {
+			ipoib_warn(priv, "RX drain timing out\n");
+
+			/*
+			 * assume the HW is wedged and just free up everything.
+			 */
+			list_splice_init(&priv->cm.rx_flush_list, &list);
+			list_splice_init(&priv->cm.rx_error_list, &list);
+			list_splice_init(&priv->cm.rx_drain_list, &list);
+			break;
+		}
+		spin_unlock_irq(&priv->lock);
+		msleep(1);
+		spin_lock_irq(&priv->lock);
+	}
+
+	list_splice_init(&priv->cm.rx_reap_list, &list);
+
+	spin_unlock_irq(&priv->lock);
+
+	list_for_each_entry_safe(p, n, &list, list) {
 		ib_destroy_cm_id(p->id);
 		ib_destroy_qp(p->qp);
 		kfree(p);
-		spin_lock_irq(&priv->lock);
 	}
-	spin_unlock_irq(&priv->lock);
 
+	ib_destroy_qp(priv->cm.rx_drain_qp);
 	cancel_delayed_work(&priv->cm.stale_task);
 }
 
@@ -1079,24 +1187,44 @@ void ipoib_cm_skb_too_long(struct net_de
 		queue_work(ipoib_workqueue, &priv->cm.skb_task);
 }
 
+static void ipoib_cm_rx_reap(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
+						   cm.rx_reap_task);
+	struct ipoib_cm_rx *p, *n;
+	LIST_HEAD(list);
+
+	spin_lock_irq(&priv->lock);
+	list_splice_init(&priv->cm.rx_reap_list, &list);
+	spin_unlock_irq(&priv->lock);
+
+	list_for_each_entry_safe(p, n, &list, list) {
+		ib_destroy_cm_id(p->id);
+		ib_destroy_qp(p->qp);
+		kfree(p);
+	}
+}
+
 static void ipoib_cm_stale_task(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
 						   cm.stale_task.work);
 	struct ipoib_cm_rx *p;
+	int ret;
 
 	spin_lock_irq(&priv->lock);
 	while (!list_empty(&priv->cm.passive_ids)) {
-		/* List if sorted by LRU, start from tail,
+		/* List is sorted by LRU, start from tail,
 		 * stop when we see a recently used entry */
 		p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list);
 		if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT))
 			break;
-		list_del_init(&p->list);
+		list_move(&p->list, &priv->cm.rx_error_list);
+		p->state = IPOIB_CM_RX_ERROR;
 		spin_unlock_irq(&priv->lock);
-		ib_destroy_cm_id(p->id);
-		ib_destroy_qp(p->qp);
-		kfree(p);
+		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+		if (ret)
+			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
 		spin_lock_irq(&priv->lock);
 	}
 
@@ -1164,9 +1292,14 @@ int ipoib_cm_dev_init(struct net_device 
 	INIT_LIST_HEAD(&priv->cm.passive_ids);
 	INIT_LIST_HEAD(&priv->cm.reap_list);
 	INIT_LIST_HEAD(&priv->cm.start_list);
+	INIT_LIST_HEAD(&priv->cm.rx_error_list);
+	INIT_LIST_HEAD(&priv->cm.rx_flush_list);
+	INIT_LIST_HEAD(&priv->cm.rx_drain_list);
+	INIT_LIST_HEAD(&priv->cm.rx_reap_list);
 	INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start);
 	INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap);
 	INIT_WORK(&priv->cm.skb_task, ipoib_cm_skb_reap);
+	INIT_WORK(&priv->cm.rx_reap_task, ipoib_cm_rx_reap);
 	INIT_DELAYED_WORK(&priv->cm.stale_task, ipoib_cm_stale_task);
 
 	skb_queue_head_init(&priv->cm.skb_queue);
Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-18 15:13:21.000000000 +0300
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-18 15:13:39.000000000 +0300
@@ -173,7 +173,7 @@ int ipoib_transport_dev_init(struct net_
 	size = ipoib_sendq_size + ipoib_recvq_size + 1;
 	ret = ipoib_cm_dev_init(dev);
 	if (!ret)
-		size += ipoib_recvq_size;
+		size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */;
 
 	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
 	if (IS_ERR(priv->cq)) {


-- 
MST


From mst at dev.mellanox.co.il  Mon May 21 05:06:33 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 21 May 2007 15:06:33 +0300
Subject: [ofa-general] [PATCH for-2.6.22] IB/cm: improve local id allocation
Message-ID: <20070521120633.GJ20400@mellanox.co.il>

IB/cm uses idr for local id allocations, with a running counter
as start_id. This fails to generate distinct ids in the scenario where
1. An id is constantly created and destroyed
2. A chunk of ids just beyond the current next_id value is occupied

This in turn leads to an increased chance of connection request being mis-detected
as a duplicate, sometimes for several retries, until next_id gets past
the block of allocated ids. This has been observed in practice.

As a fix, remember the last id allocated and start immediately above it.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

BTW, cast to unsigned here is to prevent integer overflow and make language
lawyers happy.

Sean, can you ack this pls?

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index d446998..9032cd3 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -308,7 +308,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv)
 	do {
 		spin_lock_irqsave(&cm.lock, flags);
 		ret = idr_get_new_above(&cm.local_id_table, cm_id_priv,
-					next_id++, &id);
+					next_id, &id);
+		if (!ret)
+			next_id = (unsigned)id + 1;
 		spin_unlock_irqrestore(&cm.lock, flags);
 	} while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) );
 

-- 
MST


From ogerlitz at voltaire.com  Mon May 21 05:28:37 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 21 May 2007 15:28:37 +0300
Subject: [ofa-general] [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <20070521120633.GJ20400@mellanox.co.il>
References: <20070521120633.GJ20400@mellanox.co.il>
Message-ID: <46519075.9030303@voltaire.com>

Michael S. Tsirkin wrote:
> IB/cm uses idr for local id allocations, with a running counter
> as start_id. This fails to generate distinct ids 

> static int cm_alloc_id(struct cm_id_private *cm_id_priv)
> {
>         unsigned long flags;
>         int ret, id;
>         static int next_id;
> 
>         do {
>                 spin_lock_irqsave(&cm.lock, flags);
>                 ret = idr_get_new_above(&cm.local_id_table, cm_id_priv,
>                                         next_id++, &id);
>                 spin_unlock_irqrestore(&cm.lock, flags);
>         } while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) );
> 
>         cm_id_priv->id.local_id = (__force __be32) (id ^ cm.random_id_operand);

Doesn't this Xor of the resulted ID with a random value done after the 
idr allocation causes the cm to --always-- produce distinct ids???

Or.


From mst at dev.mellanox.co.il  Mon May 21 05:34:22 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 21 May 2007 15:34:22 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <46519075.9030303@voltaire.com>
References: <20070521120633.GJ20400@mellanox.co.il>
	<46519075.9030303@voltaire.com>
Message-ID: <20070521123422.GK20400@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [PATCH for-2.6.22] IB/cm: improve local id allocation
> 
> Michael S. Tsirkin wrote:
> >IB/cm uses idr for local id allocations, with a running counter
> >as start_id. This fails to generate distinct ids 
> 
> >static int cm_alloc_id(struct cm_id_private *cm_id_priv)
> >{
> >        unsigned long flags;
> >        int ret, id;
> >        static int next_id;
> >
> >        do {
> >                spin_lock_irqsave(&cm.lock, flags);
> >                ret = idr_get_new_above(&cm.local_id_table, cm_id_priv,
> >                                        next_id++, &id);
> >                spin_unlock_irqrestore(&cm.lock, flags);
> >        } while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, 
> >        GFP_KERNEL) );
> >
> >        cm_id_priv->id.local_id = (__force __be32) (id ^ 
> >        cm.random_id_operand);
> 
> Doesn't this Xor of the resulted ID with a random value done after the 
> idr allocation causes the cm to --always-- produce distinct ids???
> 
> Or.

No - the "cm.random_id_operand" is initialized at module load time.
That's why we have the static next_id iterator.

-- 
MST


From Koen.SEGERS at VRT.BE  Mon May 21 06:04:08 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Mon, 21 May 2007 15:04:08 +0200
Subject: [ofa-general] GPFS node loses IB-connection 
Message-ID: <D63C0BE2D613C543B6F3305502E9784C030AA260@OCBEXS01001.rto.be>

Hi,
 
We are running GPFS with SDP. For this we use OFED 1.2-rc1. The machines are IBM x3755's and x3655's. The IB-switch is a SFS-7000P. The HCA's are all "Mellanox Technologies MT25208 InfiniHost III Ex (rev a0)".
 
Under heavy load, we sometimes lose a node from our GPFS cluster.
The machine that lost connection (=10.224.158.104 or gpfswhbe1s1) gave this error:
May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901:   Reason code 668 Failure Reason Lost membership in cluster enterprise.universe. Unmounting file systems.
May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901:   
 
After this, we got the following message on some of the nodes that are part of the cluster (including the failing node):
GPFS Deadman Switch timer [0] has expired; IOs in progress: 0
Badness in do_exit at kernel/exit.c:807
Call Trace: <ffffffff80133370>{do_exit+80} <ffffffff80133c17>{sys_exit_group+0}
       <ffffffff8010a7be>{system_call+126}
Badness in do_exit at kernel/exit.c:807
Call Trace: <ffffffff80133370>{do_exit+80} <ffffffff80133c17>{sys_exit_group+0}
       <ffffffff8010a7be>{system_call+126}
idr_remove called for id=0 which is not allocated.
Call Trace: <ffffffff801e5ac0>{idr_remove+228} <ffffffff80180904>{kill_anon_super+41}
       <ffffffff8018099a>{deactivate_super+111} <ffffffff8019418b>{sys_umount+624}
       <ffffffff8018303c>{sys_newstat+25} <ffffffff8017b7c8>{__fput+348}
       <ffffffff80193b3d>{mntput_no_expire+25} <ffffffff80178eb3>{filp_close+89}
       <ffffffff8010a7be>{system_call+126}

Not all of them give the tracelog at the end.
 
GPFS then gives the following errors:
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.1 cic-gpfswhbe1n1
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.1 cic-gpfswhbe1n1
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.15 cic-gpfswhbe1s2
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.15 cic-gpfswhbe1s2
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.16 cic-gpfswhbe1s3
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.16 cic-gpfswhbe1s3
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.17 cic-gpfswhbe1s4
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.17 cic-gpfswhbe1s4
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.2 cic-gpfswhbe1n2
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.2 cic-gpfswhbe1n2
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.3 cic-gpfswhbe1n3
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.3 cic-gpfswhbe1n3
10.224.158.104 Fri May 18 13:02:51 2007: Lost membership in cluster enterprise.universe. Unmounting file systems.
10.224.158.104 Fri May 18 13:02:51 2007: Lost membership in cluster enterprise.universe. Unmounting file systems.
10.224.158.106 Fri May 18 13:02:53 2007: Close connection to 192.168.1.14 cic-gpfswhbe1s1
10.224.158.106 Fri May 18 13:02:53 2007: Close connection to 192.168.1.14 cic-gpfswhbe1s1
etc.
 
 
We found this in the logs of the switch:
May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:98:d1
May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:98:d1
May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering removed ports
May 18 11:02:51 topspin-120sc ib_sm.x[628]: %IB-6-INFO: Program switch port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 1, due to non-responding CA
May 18 11:02:51 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port down - port=1/1, type=ib4xTXP
May 18 11:02:51 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in portTblFindEntry() - IfIndex=65(1/1)
May 18 11:02:51 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: cannot find entry - IfIndex=65(1/1)
May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering new ports
May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change
May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:98:d1
May 18 11:02:53 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up - port=1/1, type=ib4xTXP
May 18 11:02:54 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:98:d1
May 18 11:02:54 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change

We are sure this is gpfswhbe1s1, as the number is the same as the node_guid+1:
gpfswhbe1s1:~ # ibv_devinfo
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.hca_id:       mthca0
        fw_ver:                         5.1.0
        node_guid:                      0005:ad00:0008:98d0
        sys_image_guid:                 0005:ad00:0008:98d3
        vendor_id:                      0x05ad
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       HCA.LionMini.A0
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               6
                        port_lmc:               0x00
                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               4
                        port_lmc:               0x00

 
Does anyone have a clue what happened?
The error does not come up very often. So we can't reproduce it easily.
 
We believe the HCA on gpfswhbe1s1 caused the probem, but we can't really see it. 
All help is appreciated!
 
Regards,
 
Koen
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070521/9c115a45/attachment.html>

From rdreier at cisco.com  Mon May 21 06:54:24 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 21 May 2007 06:54:24 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <20070521120633.GJ20400@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 21 May 2007 15:06:33 +0300")
References: <20070521120633.GJ20400@mellanox.co.il>
Message-ID: <ada4pm6ffjj.fsf@cisco.com>

 > IB/cm uses idr for local id allocations, with a running counter
 > as start_id. This fails to generate distinct ids in the scenario where
 > 1. An id is constantly created and destroyed
 > 2. A chunk of ids just beyond the current next_id value is occupied
 > 
 > This in turn leads to an increased chance of connection request being mis-detected
 > as a duplicate, sometimes for several retries, until next_id gets past
 > the block of allocated ids. This has been observed in practice.
 > 
 > As a fix, remember the last id allocated and start immediately above it.

OK I guess but this needs some explanation about why the impact is so
severe we want to merge it after rc2 is already out.

 > +			next_id = (unsigned)id + 1;

what happens when this wraps and becomes negative?

in fact the idr stuff all works with plain signed ints -- could
idr_get_new() ever give a negative id?  (too lazy too look at the
source right now)

 - R.


From jsquyres at cisco.com  Mon May 21 07:37:18 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 21 May 2007 07:37:18 -0700
Subject: [ofa-general] Today's OFED teleconf
Message-ID: <D320D1E2-A75D-4451-864F-78F0441D69B3@cisco.com>

Greetings all.

Today's OFED teleconference starts in approximately 90 minutes (9am  
US Pacific, 11am US central, noon US Eastern, 7pm Israel).

Code: 2102061
Dial in numbers:

US/Canada:  +1.866.432.9903
India:      +91.80.4103.3979
Israel:     +972.9.892.7026
Others:      http://cisco.com/en/US/about/doing_business/conferencing/ 
index.html

-- 
Jeff Squyres
Cisco Systems


From jim.ryan at intel.com  Mon May 21 07:45:59 2007
From: jim.ryan at intel.com (Ryan, Jim)
Date: Mon, 21 May 2007 07:45:59 -0700
Subject: [ofa-general] RE: [ewg] Today's OFED teleconf
In-Reply-To: <D320D1E2-A75D-4451-864F-78F0441D69B3@cisco.com>
Message-ID: <55CE0347B98FCA468923E5FBC25CB4DCF6D55B@orsmsx413.amr.corp.intel.com>

Jeff, I think we have a pretty light agenda. I'll try to wrap the board
meeting up early so there's no conflict

Jim

-----Original Message-----
From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres
Sent: Monday, May 21, 2007 7:37 AM
To: OpenFabrics EWG; OpenFabrics General
Subject: [ewg] Today's OFED teleconf

Greetings all.

Today's OFED teleconference starts in approximately 90 minutes (9am  
US Pacific, 11am US central, noon US Eastern, 7pm Israel).

Code: 2102061
Dial in numbers:

US/Canada:  +1.866.432.9903
India:      +91.80.4103.3979
Israel:     +972.9.892.7026
Others:      http://cisco.com/en/US/about/doing_business/conferencing/ 
index.html

-- 
Jeff Squyres
Cisco Systems

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


From tziporet at dev.mellanox.co.il  Mon May 21 07:49:34 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 21 May 2007 17:49:34 +0300
Subject: [ofa-general] Reminder: OFED 1.2 meeting today at 9am PST
Message-ID: <4651B17E.40200@mellanox.co.il>

Hi All,

Note: Woody will run the meeting today since I will not be able to attend.
Vlad will represent Mellanox in the meeting
I suggest we set a new target date for RC4: May 30

Tziporet

These are the bugs that should be reviewed:

567 	blocker 	jsquyres at cisco.com 	MPI does not work on RHEL5 ppc64
611 	critical 	swise at opengridcomputing.com 	cxgb3: passive side 
connection transition from streaming to RDMA is broken
577 	critical 	ishai at mellanox.co.il 	SRP multipath failover too slow 
(minutes, not seconds)
465 	critical 	mst at mellanox.co.il 	IPoIB HA fails after several hours of 
failovers
604 	critical 	mst at mellanox.co.il 	Oops running UDP traffic with IPoIB CM
608 	major 	monis at voltaire.com 	traffic fails to resume after SM 
failover with bonding interfaces
626 	major 	monis at voltaire.com 	wrong network /broadcast address set by 
ib-bond script
629 	major 	monis at voltaire.com 	ib-bonding: sometimes slow failover is 
noticed
632 	major 	mee at pathscale.com 	Intel MPI Benchmark fails on InfiniPath 
with 4 or more PPN


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070521/3d865bb2/attachment.html>

From tziporet at dev.mellanox.co.il  Mon May 21 07:52:03 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 21 May 2007 17:52:03 +0300
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C030AA260@OCBEXS01001.rto.be>
References: <D63C0BE2D613C543B6F3305502E9784C030AA260@OCBEXS01001.rto.be>
Message-ID: <4651B213.8040803@mellanox.co.il>

SEGERS Koen wrote:
> Hi,
>  
> We are running GPFS with SDP. For this we use OFED 1.2-rc1. The 
> machines are IBM x3755's and x3655's. The IB-switch is a SFS-7000P. 
> The HCA's are all "Mellanox Technologies MT25208 InfiniHost III Ex 
> (rev a0)".
>  
Can you try OFED 1.2-rc3?

Tziporet


From jim.ryan at intel.com  Mon May 21 07:51:01 2007
From: jim.ryan at intel.com (Ryan, Jim)
Date: Mon, 21 May 2007 07:51:01 -0700
Subject: [ofa-general] RE: [ewg] Today's OFED teleconf -- board bridge
	reminder added
In-Reply-To: <55CE0347B98FCA468923E5FBC25CB4DCF6D55B@orsmsx413.amr.corp.intel.com>
Message-ID: <55CE0347B98FCA468923E5FBC25CB4DCF6D561@orsmsx413.amr.corp.intel.com>

Monday, May 21, 2007, 08:00 AM US Pacific Time 
916-356-2663, Bridge: 1, Passcode: 3094290

-----Original Message-----
From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Ryan, Jim
Sent: Monday, May 21, 2007 7:46 AM
To: Jeff Squyres; OpenFabrics EWG; OpenFabrics General
Subject: RE: [ewg] Today's OFED teleconf

Jeff, I think we have a pretty light agenda. I'll try to wrap the board
meeting up early so there's no conflict

Jim

-----Original Message-----
From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres
Sent: Monday, May 21, 2007 7:37 AM
To: OpenFabrics EWG; OpenFabrics General
Subject: [ewg] Today's OFED teleconf

Greetings all.

Today's OFED teleconference starts in approximately 90 minutes (9am  
US Pacific, 11am US central, noon US Eastern, 7pm Israel).

Code: 2102061
Dial in numbers:

US/Canada:  +1.866.432.9903
India:      +91.80.4103.3979
Israel:     +972.9.892.7026
Others:      http://cisco.com/en/US/about/doing_business/conferencing/ 
index.html

-- 
Jeff Squyres
Cisco Systems

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


From mst at dev.mellanox.co.il  Mon May 21 07:54:36 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 21 May 2007 17:54:36 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <ada4pm6ffjj.fsf@cisco.com>
References: <20070521120633.GJ20400@mellanox.co.il> <ada4pm6ffjj.fsf@cisco.com>
Message-ID: <20070521145436.GA31097@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH for-2.6.22] IB/cm: improve local id allocation
> 
>  > IB/cm uses idr for local id allocations, with a running counter
>  > as start_id. This fails to generate distinct ids in the scenario where
>  > 1. An id is constantly created and destroyed
>  > 2. A chunk of ids just beyond the current next_id value is occupied
>  > 
>  > This in turn leads to an increased chance of connection request being mis-detected
>  > as a duplicate, sometimes for several retries, until next_id gets past
>  > the block of allocated ids. This has been observed in practice.
>  > 
>  > As a fix, remember the last id allocated and start immediately above it.
> 
> OK I guess but this needs some explanation about why the impact is so
> severe we want to merge it after rc2 is already out.

Well, it's a single-liner, so it seemed safe.
The impact currently is that CM times out, we re-create
a connection, either the applicatin aborts, or this process repeats 
until we get a good id, which can take a couple of minutes.

>  > +			next_id = (unsigned)id + 1;
> 
> what happens when this wraps and becomes negative?
> 
> in fact the idr stuff all works with plain signed ints -- could
> idr_get_new() ever give a negative id?  (too lazy too look at the
> source right now)

Good point, I'll check.

-- 
MST


From Koen.SEGERS at VRT.BE  Mon May 21 07:55:58 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Mon, 21 May 2007 16:55:58 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <4651B213.8040803@mellanox.co.il>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D58@OCBEXS01001.rto.be>

Is this a common problem with RC1?

I can change it, but it will take a wile...

I'll start building rpms anyway.

Greetz

Koen

-----Oorspronkelijk bericht-----
Van: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] 
Verzonden: maandag 21 mei 2007 16:52
Aan: SEGERS Koen
CC: general at lists.openfabrics.org
Onderwerp: Re: [ofa-general] GPFS node loses IB-connection

SEGERS Koen wrote:
> Hi,
>  
> We are running GPFS with SDP. For this we use OFED 1.2-rc1. The 
> machines are IBM x3755's and x3655's. The IB-switch is a SFS-7000P. 
> The HCA's are all "Mellanox Technologies MT25208 InfiniHost III Ex 
> (rev a0)".
>  
Can you try OFED 1.2-rc3?

Tziporet
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From xma at us.ibm.com  Mon May 21 08:41:10 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Mon, 21 May 2007 08:41:10 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D58@OCBEXS01001.rto.be>
Message-ID: <OF7447A28F.A7D1FDCB-ON872572E2.00561B4C-882572E2.005B9877@us.ibm.com>


Hello,

      What's the output of /var/log/messages when you hitting this problem?

Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070521/a4e6ff56/attachment.html>

From rdreier at cisco.com  Mon May 21 09:01:28 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 21 May 2007 09:01:28 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <20070521120633.GJ20400@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 21 May 2007 15:06:33 +0300")
References: <20070521120633.GJ20400@mellanox.co.il>
Message-ID: <adazm3ydv3b.fsf@cisco.com>

lib/idr.c says it returns positive IDs always (actually the comments
say "in the range 0 ... 0x7fffffff").  So I guess we would want
something like:

	if (!ret)
		next_id = id == INT_MAX ? 0 : id + 1;

(current code has a similar bug, plus exposes undefined behavior of
signed overflow).

 - R.


From mst at dev.mellanox.co.il  Mon May 21 09:06:54 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 21 May 2007 19:06:54 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <ada4pm6ffjj.fsf@cisco.com>
References: <20070521120633.GJ20400@mellanox.co.il> <ada4pm6ffjj.fsf@cisco.com>
Message-ID: <20070521160654.GD31097@mellanox.co.il>

>  > +			next_id = (unsigned)id + 1;
> 
> what happens when this wraps and becomes negative?
> 
> in fact the idr stuff all works with plain signed ints -- could
> idr_get_new() ever give a negative id?  (too lazy too look at the
> source right now)

A quick looks makes it look like idr stuff is *really* not designed to
get a negative input: and note that old code has the wrap-around problem, too.
So, I think the following would be a better fix:

Hmm?

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>


diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index eff591d..5e77b01 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -306,7 +306,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv)
 	do {
 		spin_lock_irqsave(&cm.lock, flags);
 		ret = idr_get_new_above(&cm.local_id_table, cm_id_priv,
-					next_id++, &id);
+					next_id, &id);
+		if (!ret)
+			next_id = id == 0x7ffffff ? 0 : id + 1;
 		spin_unlock_irqrestore(&cm.lock, flags);
 	} while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) );
 

-- 
MST


From mst at dev.mellanox.co.il  Mon May 21 09:11:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 21 May 2007 19:11:30 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <adazm3ydv3b.fsf@cisco.com>
References: <20070521120633.GJ20400@mellanox.co.il> <adazm3ydv3b.fsf@cisco.com>
Message-ID: <20070521161130.GE31097@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH for-2.6.22] IB/cm: improve local id allocation
> 
> lib/idr.c says it returns positive IDs always (actually the comments
> say "in the range 0 ... 0x7fffffff").  So I guess we would want
> something like:
> 
> 	if (!ret)
> 		next_id = id == INT_MAX ? 0 : id + 1;

True, except INT_MAX isn't defined in kernel headers I think,
so I just put 0x7fffffff there.

> (current code has a similar bug, plus exposes undefined behavior of
> signed overflow).


-- 
MST


From rdreier at cisco.com  Mon May 21 09:13:14 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 21 May 2007 09:13:14 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <20070521160654.GD31097@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 21 May 2007 19:06:54 +0300")
References: <20070521120633.GJ20400@mellanox.co.il>
	<ada4pm6ffjj.fsf@cisco.com> <20070521160654.GD31097@mellanox.co.il>
Message-ID: <adaveemdujp.fsf@cisco.com>

 > A quick looks makes it look like idr stuff is *really* not designed to
 > get a negative input: and note that old code has the wrap-around problem, too.
 > So, I think the following would be a better fix:

Yes, that's basically what I just proposed (although see below).  It
all looks pretty safe to me...  Sean, what do you think about this for
2.6.22?

 > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
 > index eff591d..5e77b01 100644
 > --- a/drivers/infiniband/core/cm.c
 > +++ b/drivers/infiniband/core/cm.c
 > @@ -306,7 +306,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv)
 >  	do {
 >  		spin_lock_irqsave(&cm.lock, flags);
 >  		ret = idr_get_new_above(&cm.local_id_table, cm_id_priv,
 > -					next_id++, &id);
 > +					next_id, &id);
 > +		if (!ret)
 > +			next_id = id == 0x7ffffff ? 0 : id + 1;

...except I used MAX_INT here, and indeed your patch only has 6 'f's
in that constant.  Actually digging a little I see that we should use
MAX_ID_MASK to be really correct.


From rdreier at cisco.com  Mon May 21 09:14:03 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 21 May 2007 09:14:03 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <20070521161130.GE31097@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 21 May 2007 19:11:30 +0300")
References: <20070521120633.GJ20400@mellanox.co.il>
	<adazm3ydv3b.fsf@cisco.com> <20070521161130.GE31097@mellanox.co.il>
Message-ID: <adar6paduic.fsf@cisco.com>

 > True, except INT_MAX isn't defined in kernel headers I think,
 > so I just put 0x7fffffff there.

It doesn't really matter (see my other reply) but actually INT_MAX and
others are in <linux/kernel.h>

 - R.


From swise at opengridcomputing.com  Mon May 21 09:14:14 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 21 May 2007 09:14:14 -0700
Subject: [ofa-general] Reminder: OFED 1.2 meeting today at 9am PST
In-Reply-To: <4651B17E.40200@mellanox.co.il>
References: <4651B17E.40200@mellanox.co.il>
Message-ID: <4651C556.4020904@opengridcomputing.com>

I cannot attend this meeting.  Bug 611 has been closed today.  The fix 
went in last week.

Thanks,

Steve.


Tziporet Koren wrote:
> Hi All,
>
> Note: Woody will run the meeting today since I will not be able to attend.
> Vlad will represent Mellanox in the meeting
> I suggest we set a new target date for RC4: May 30
>
> Tziporet
>
> These are the bugs that should be reviewed:
>
> 567 	blocker 	jsquyres at cisco.com 	MPI does not work on RHEL5 ppc64
> 611 	critical 	swise at opengridcomputing.com 	cxgb3: passive side 
> connection transition from streaming to RDMA is broken
> 577 	critical 	ishai at mellanox.co.il 	SRP multipath failover too slow 
> (minutes, not seconds)
> 465 	critical 	mst at mellanox.co.il 	IPoIB HA fails after several hours 
> of failovers
> 604 	critical 	mst at mellanox.co.il 	Oops running UDP traffic with IPoIB CM
> 608 	major 	monis at voltaire.com 	traffic fails to resume after SM 
> failover with bonding interfaces
> 626 	major 	monis at voltaire.com 	wrong network /broadcast address set 
> by ib-bond script
> 629 	major 	monis at voltaire.com 	ib-bonding: sometimes slow failover is 
> noticed
> 632 	major 	mee at pathscale.com 	Intel MPI Benchmark fails on InfiniPath 
> with 4 or more PPN
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From parks at lanl.gov  Mon May 21 09:16:16 2007
From: parks at lanl.gov (parks)
Date: Mon, 21 May 2007 10:16:16 -0600
Subject: [ofa-general] Today's OFED teleconf
In-Reply-To: <D320D1E2-A75D-4451-864F-78F0441D69B3@cisco.com>
References: <D320D1E2-A75D-4451-864F-78F0441D69B3@cisco.com>
Message-ID: <7.0.1.0.2.20070521101557.02a49e50@lanl.gov>

ON travel will not be able to make it.

parks


At 08:37 AM 5/21/2007, Jeff Squyres wrote:
>Greetings all.
>
>Today's OFED teleconference starts in approximately 90 minutes (9am
>US Pacific, 11am US central, noon US Eastern, 7pm Israel).
>
>Code: 2102061
>Dial in numbers:
>
>US/Canada:  +1.866.432.9903
>India:      +91.80.4103.3979
>Israel:     +972.9.892.7026
>Others: 
>http://cisco.com/en/US/about/doing_business/conferencing/ index.html
>
>--
>Jeff Squyres
>Cisco Systems
>
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070521/e32aac80/attachment.html>

From mst at dev.mellanox.co.il  Mon May 21 09:23:37 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 21 May 2007 19:23:37 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <adaveemdujp.fsf@cisco.com>
References: <20070521120633.GJ20400@mellanox.co.il> <ada4pm6ffjj.fsf@cisco.com>
	<20070521160654.GD31097@mellanox.co.il> <adaveemdujp.fsf@cisco.com>
Message-ID: <20070521162336.GF31097@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH for-2.6.22] IB/cm: improve local id allocation
> 
>  > A quick looks makes it look like idr stuff is *really* not designed to
>  > get a negative input: and note that old code has the wrap-around problem, too.
>  > So, I think the following would be a better fix:
> 
> Yes, that's basically what I just proposed (although see below).  It
> all looks pretty safe to me...  Sean, what do you think about this for
> 2.6.22?
> 
>  > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>  > index eff591d..5e77b01 100644
>  > --- a/drivers/infiniband/core/cm.c
>  > +++ b/drivers/infiniband/core/cm.c
>  > @@ -306,7 +306,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv)
>  >  	do {
>  >  		spin_lock_irqsave(&cm.lock, flags);
>  >  		ret = idr_get_new_above(&cm.local_id_table, cm_id_priv,
>  > -					next_id++, &id);
>  > +					next_id, &id);
>  > +		if (!ret)
>  > +			next_id = id == 0x7ffffff ? 0 : id + 1;
> 
> ...except I used MAX_INT here, and indeed your patch only has 6 'f's
> in that constant.  Actually digging a little I see that we should use
> MAX_ID_MASK to be really correct.

And since it's a *mask*, we can do it this way if you like:

>  > +		if (!ret)
>  > +			next_id = ((unsigned)id + 1) & MAX_ID_MASK;

which might generate a bit less code.

-- 
MST


From mshefty at ichips.intel.com  Mon May 21 09:34:45 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 21 May 2007 09:34:45 -0700
Subject: [ofa-general] IB/cm: bug in stale connection detection logic?
In-Reply-To: <20070520134441.GI20649@mellanox.co.il>
References: <20070520134441.GI20649@mellanox.co.il>
Message-ID: <4651CA25.9050309@ichips.intel.com>


> 1. I see this in cm_match_req:
> 
>         timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
>         if (!timewait_info)
>                 timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
> 
>         if (timewait_info) {
>                 cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
>                                            timewait_info->work.remote_id);
>                 cm_cleanup_timewait(cm_id_priv->timewait_info);
>                 spin_unlock_irqrestore(&cm.lock, flags);
>                 if (cur_cm_id_priv) {
>                         cm_dup_req_handler(work, cur_cm_id_priv);
>                         cm_deref_id(cur_cm_id_priv);
>                 } else
>                         cm_issue_rej(work->port, work->mad_recv_wc,
>                                      IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
>                                      NULL, 0);
> 
> Note that cm_get_id is passed data from timewait_info and not from the request:
> thus, if the QPN in request matches QPN in an existing connection, this is
> mis-detected as a duplicate request even if the IDs do not match;
> thus, the request is dropped or "duplicate" reject is sent instead of
> a "stale connection" reject.
> 
> Am I missing something?

I think you may be right on the QPN check, so I'll look into it more. 
Note that a REQ doesn't carry the local ID, which is why cm_get_id 
doesn't use the IDs from the REQ.

> Suggestion:
> Why is an extra call to cm_get_id required to detect a duplicate?
> Shouldn't we just
> 
>         timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
> 	if (timewait_info) {
> 		/* handle duplicate */

This isn't necessarily a duplicate.  We need to check the state of the 
local connection endpoint (hence the extra call to cm_get_id).  Also if 
the remote CM lost its state information, it could re-use the remote ID 
for a new connection.

> 		return;
> 	}
> 
> 	timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
> 	if (timewait_info) {
> 		/* handle stale */

If the previous check fails, it does look like this should be stale.

> 		return;
> 	}
> 
> 	not a duplicate and not a stale connection
> 
> 2. Another question:
> 
> 	cm_dup_req_handler does this:
>        /* Quick state check to discard duplicate REQs. */
>         if (cm_id_priv->id.state == IB_CM_REQ_RCVD)
>                 return;
> 
> Why is this code here? IB_CM_REQ_RCVD is an ephemeural state,
> going to IB_CM_REP_SENT immediately.

The transition from REQ_RCVD to REP_SENT requires user intervention.  It 
is not immediate and can take several seconds to up to a minute 
depending on how quickly the user responds to connection requests.  For 
userspace apps, the timing depends on how quickly the user retrieves and 
processes CM events, which can take longer than the retry timeout.

> Why are duplicate REQs discarded? Should not REP be re-sent?
> See 12.9.6 COMMUNICATION ESTABLISHMENT - PASSIVE

A REP can only be re-sent if we're in the REP_SENT state, which is not 
the state being checked.  If the remote side has sent 3 REQs in the time 
that it takes to respond to the first REQ, it's inefficient to generate 
2 duplicate REPs when finally sending the first REP.

- Sean


From robert.j.woodruff at intel.com  Mon May 21 09:40:37 2007
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 21 May 2007 09:40:37 -0700
Subject: [ofa-general] OFED 1.2 meeting today at 9am PST - Meeting Minutes
In-Reply-To: <4651B17E.40200@mellanox.co.il>
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C025B6B95@orsmsx418.amr.corp.intel.com>

We discussed the May 30 date for RC4 and people were OK with that date.

We reviewed the outstanding bugs,

Still open                    567      blocker
jsquyres at cisco.com      MPI does not work on RHEL5 ppc64      
Fixed                            611      critical
swise at opengridcomputing.com     cxgb3: passive side    
Still open                    577      critical
ishai at mellanox.co.il    SRP multipath failover too slow (minutes, not
seconds)
Fixed                           465      critical
mst at mellanox.co.il      IPoIB HA fails after several hours of failovers

Still open                    604      critical
mst at mellanox.co.il      Oops running UDP traffic with IPoIB CM
Cannot reproduce    608      major   monis at voltaire.com      traffic
fails to resume after SM failover with bonding   
Still open                    626      major   monis at voltaire.com
wrong network /broadcast address set by ib-bond script
Still open                    629      major   monis at voltaire.com
ib-bonding: sometimes slow failover is noticed
May not fix for this release - 632      major   mee at pathscale.com
Intel MPI Benchmark fails on InfiniPath with 4 or more PPN    


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070521/6d43f79b/attachment.html>

From mshefty at ichips.intel.com  Mon May 21 09:48:37 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 21 May 2007 09:48:37 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <adaveemdujp.fsf@cisco.com>
References: <20070521120633.GJ20400@mellanox.co.il>	<ada4pm6ffjj.fsf@cisco.com>
	<20070521160654.GD31097@mellanox.co.il> <adaveemdujp.fsf@cisco.com>
Message-ID: <4651CD65.7070303@ichips.intel.com>

> Yes, that's basically what I just proposed (although see below).  It
> all looks pretty safe to me...  Sean, what do you think about this for
> 2.6.22?
> 
>  > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>  > index eff591d..5e77b01 100644
>  > --- a/drivers/infiniband/core/cm.c
>  > +++ b/drivers/infiniband/core/cm.c
>  > @@ -306,7 +306,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv)
>  >  	do {
>  >  		spin_lock_irqsave(&cm.lock, flags);
>  >  		ret = idr_get_new_above(&cm.local_id_table, cm_id_priv,
>  > -					next_id++, &id);
>  > +					next_id, &id);
>  > +		if (!ret)
>  > +			next_id = id == 0x7ffffff ? 0 : id + 1;
> 
> ...except I used MAX_INT here, and indeed your patch only has 6 'f's
> in that constant.  Actually digging a little I see that we should use
> MAX_ID_MASK to be really correct.

Looks good.  Thanks

Acked by: Sean Hefty <sean.hefty at intel.com>


From halr at voltaire.com  Mon May 21 10:26:33 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 May 2007 13:26:33 -0400
Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.23: basic support for IB
	routers
In-Reply-To: <000401c79999$dad2e9d0$b4d4180a@amr.corp.intel.com>
References: <000401c79999$dad2e9d0$b4d4180a@amr.corp.intel.com>
Message-ID: <1179768393.15940.8104.camel@hal.voltaire.com>

On Fri, 2007-05-18 at 18:14, Sean Hefty wrote:
> Re-sending - typo in mailing list name...
> 
> I'd like to get feedback about incorporating the following changes to support
> IB routers into 2.6.23.  The goal of the patches is to allow for IB router
> development and prototyping within the current framework of IBA.  The changes
> themselves are fairly minimal, but based on the following concepts:
> 
> * Routing data is maintained by the local SA.  No assumption is made regarding
>   how the SA obtains routing information.  The SA is only expected to respond
>   to cross subnet PR queries by providing a path to the local router.  This
>   matches the behavior in opensm.
> 
> * A ULP connecting to a remote subnet provides path information about both
>   subnets.  For now the implementation simply assumes that the properties of
>   the remote path match that of the local path.  This allows the active side
>   CM to properly format the CM REQ.
> 
> * If the SLID/DLID values in the CM REQ are set to the permissive LID, then
>   the passive side CM uses the SLID/DLID/SL values from the received CM REQ
>   LRH to configure the passive side QP.  This is done to meet C9-54 without
>   requiring communication with the remote SA, but I should note that this
>   behavior is non-compliant.

Should there be some conditionalization of any non IBA compliant code so
it is only turned on if someone really wants this ?

I presume this is to be replaced by the real code some time in the
future once the IBA spec for these router issues is decided.

-- Hal

> These changes were tested by establishing a connection and transferring data
> between two IB subnets connected by an Obsidian router.
> 
> These patches are also available in the ib_router branch of my rdma-dev.git
> tree.  The tree is based on 2.6.21, so include a couple of additional patches
> that were already pushed for 2.6.22.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sean.hefty at intel.com  Mon May 21 10:50:05 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 21 May 2007 10:50:05 -0700
Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.23: basic support for
	IBrouters
In-Reply-To: <1179768393.15940.8104.camel@hal.voltaire.com>
Message-ID: <000201c79bd0$7751f070$14d0180a@amr.corp.intel.com>

>Should there be some conditionalization of any non IBA compliant code so
>it is only turned on if someone really wants this ?

That is doable.  AFIK, the only non-compliance is using the permissive LIDs in
the CM REQ.  I don't believe this causes any interoperability issues with fully
compliant code.  It should just result in a rejected connection request.

>I presume this is to be replaced by the real code some time in the
>future once the IBA spec for these router issues is decided.

Yes.  The changes to the RDMA CM and passive side IB CM will likely need to be
replaced (patches 2 & 3).  The active side IB CM changes (patch 1) may be okay
unless wording to the 1.2 spec changes.  (Patch 1 could also be used to support
non-reversible paths at some point, but more work may be needed.)

- Sean


From halr at voltaire.com  Mon May 21 10:52:11 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 May 2007 13:52:11 -0400
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
	<309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>
	<464B5C07.8040601@ichips.intel.com>
	<309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>
	<1179398534.23882.67542.camel@hal.voltaire.com>
	<309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>
	<1179483657.23882.158398.camel@hal.voltaire.com>
	<309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com>
Message-ID: <1179769930.15940.9823.camel@hal.voltaire.com>

On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote:
> On 18 May 2007 06:21:05 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote:
> > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote:
> > > > > On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > > > > > > But initially this will generate a packet for each path, while sys
> > > > > > > admin knows that path is there and he can hard-code the entries for
> > > > > > > it. Other thing is that why Admin will care about creating such record
> > > > > > > while SA is itself taking care, right?
> > > > > >
> > > > > > In your original message you asked about adding 'dummy entries' to the
> > > > > > cache.  I agree that pre-loading the cache can be useful.  What I still
> > > > > > am not understanding is the reasoning for adding 'dummy entries'.  By
> > > > > > 'dummy entries', I've been assuming that these are invalid path records,
> > > > > > but maybe that's not what you meant.
> > > > > Ok if "dummy entries" word as such has created confusion then I am
> > > > > sorry for that, But with that I mean that, those are valid path
> > > > > records which Administrator knows in advance and while loading the
> > > > > module,
> > > >
> > > > How does the admin know they are valid ?
> > > Depending on the initial application runs, some trusted PRs can be generated.
> >
> > What do initial application runs have to do with this ?
> My understanding is that, once the cluster is UP, and if between Node
> A and Node B there is only one path,

So this is a feature for such one path subnets. I wonder what percentage
of deployed subnets fits this case.

> then, SA query always going to return same values in PR.

If subnet topology is changed, these PRs might change. There are other
cases where they change too.

>  On this basis Initial application runs will generate PRs,

That's what confused me before (Applications don't generate PRs but
rather request them.) but I think I see what you mean now.

> these PRs can be saved in some file, and can be loaded
> when cache_module comes in.
> >
> > > >Are they somehow preconfigured at the SM ?
> > > I am not sure about SM has any such provision?
> >
> > Not that I'm aware of.
> Ok, So, currently no such support is there in SM?

I can speak definitively for OpenSM and there is no such support. As to
the vendor SMs, I don't think so but don't know for absolute certainty.
Someone can correct me if I'm wrong but I wouldn't assume no response
means correctness as some may not be listening nor want to respond as to
"value added" vendor specific features.

> > > Also not sure about the
> > > role of SM in path resolving. I mean once node has initiated SA query,
> > > whether SM has some database to reply SA or On the fly destination
> > > node is contacted to get asked path recored?
> >
> > SMs can either calculate the SA PRs on the fly based on the routing
> > algorithm in use and some other things or put them in a local database.
> > This is up to that SM.
> Ok
> >
> > Destination node is not contacted in the SA PR query process.
> >
> > > >Doesn't each SM have its own policy for generating valid PRs ?
> > > Ultimately path record is in Path_Record object format, and SA cache
> > > is going to store in a fixed manner, How generation policy matters?
> >
> > What if the local policy loaded does not agree with what the SM would
> > generate for a particular PR ? One then gets a local error which will
> > need to be tracked down. Not so easy IMO.
> SM policies in a subnet to generate PRs, changes dynamically? at run time?

The policy doesn't change dynamically but the data to be returned in the
SA PR response might.

> if Not then depending on the local SM policy static PR can be
> generated to load initially.

Just as one question related to this, how would link failures be handled
? There are others.

> > > CMIIW. Also I am assuming a homogeneous cluster where certain
> > > parameters can be assumed to be same always.
> >
> > and always in agreement with what the SM would return ? For example,
> yes
> > what happens when a link goes down and the end node is no longer
> > reachable ?
> If node is not reachable then, after first timeout of sa_cache, that
> entry will be removed from cache.

OK; that's another aspect to add into this feature. I don't think that
is currently done. I think there would need to be an API added to do
this.

-- Hal

> > > >are these from a live SM and just loaded "out of band" to
> > > bypass/preclude the SA PR >mechanism ?
> > > may be
> >
> > Even if they are, there is still the changes in the subnet issue.
> >
> > -- Hal
> >
> > > > -- Hal
> > > >
> > > > >  Admin is loading this info in the cache with user command.
> > > > > >
> > > > > > > Another point I want to know is,
> > > > > > > When local_sa_cache module will be inserted? After SM comes up or
> > > > > > > Before SM comes up?
> > > > > >
> > > > > > It can occur either way.  There is no restriction.  The cache responds
> > > > > > to port up and GID in/out of service events to update itself.
> > > > > Do you mean cache module will start building cache only after Port is UP?
> > > > > >
> > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on
> > > > > > > some node not on switch) then First Forced schedule_update() is
> > > > > > > waisted, and for the first application presence of cache is
> > > > > > > meaningless. Why not to keep cache effective right from the start?
> > > > > >
> > > > > > Pre-loading the cache with path records doesn't guarantee that those
> > > > > > paths are usable.  If the SM has not come up, then the path records will
> > > > > > be unusable until the SM configures the subnet, plus there's no
> > > > > > guarantee that the remote endpoints specified by the paths are running.
> > > > > You mean there is no guarantee that even if SM is UP and we have some
> > > > > hard coded entries of path record corresponding to some node X, we are
> > > > > not sure that node X has actually come up or not?  In that case
> > > > > actually that path resolving should fail if node has not come up, but
> > > > > with the hard coding still path will be resolved?
> > > > > >
> > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms
> > > > > > when booting a large cluster.
> > > > > that's true. Also cache will get valid entries only if network is
> > > > > configured by SM otherwise every node SA will, possibly, drop SA
> > > > > packets.
> > > > > >
> > > > > > - Sean
> > > > > >
> > > > > _______________________________________________
> > > > > general mailing list
> > > > > general at lists.openfabrics.org
> > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > >
> > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > >
> > > >
> >
> >


From halr at voltaire.com  Mon May 21 11:00:11 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 May 2007 14:00:11 -0400
Subject: [ofa-general] Re: [PATCH] osm: fixing coredump in drop manager
In-Reply-To: <46518857.2060308@dev.mellanox.co.il>
References: <46518857.2060308@dev.mellanox.co.il>
Message-ID: <1179770406.15940.10338.camel@hal.voltaire.com>

Hi Yevgeny,

On Mon, 2007-05-21 at 07:53, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> This patch fixes a coredump in a drop manager when trying to clear
> unititialized physical ports.
> It happens only in master (the code in this area is a bit different in ofed_1_2).
> 
> Please apply to master.
> Thanks.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---

Thanks. Applied (to master only).

-- Hal


From halr at voltaire.com  Mon May 21 11:17:13 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 May 2007 14:17:13 -0400
Subject: [ofa-general] Re: [PATCHv2] osm: up/dn optimization - improved
	ranking
In-Reply-To: <46503064.7010107@dev.mellanox.co.il>
References: <46503064.7010107@dev.mellanox.co.il>
Message-ID: <1179771431.15940.11448.camel@hal.voltaire.com>

Hi Yevgeny,

On Sun, 2007-05-20 at 07:26, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> This patch optimizes fabric ranking similar to the fat-tree ranking.
> All the root switches are marked with rank and added to the BFS list,
> and only then ranking of rest of the fabric begins.
> This version of the patch is updated in accordance with Sasha's suggestions.
> 
> Please apply to master.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---

Nice work.

Thanks! Applied (to master only).

-- Hal


From jlentini at netapp.com  Mon May 21 11:50:32 2007
From: jlentini at netapp.com (James Lentini)
Date: Mon, 21 May 2007 14:50:32 -0400 (EDT)
Subject: [ofa-general] Re: [IPoIB][RFC] remove redundant gid query
In-Reply-To: <adazm43jqkr.fsf@cisco.com>
References: <Pine.LNX.4.64.0705081427320.27590@jlentini-linux.nane.netapp.com>
	<adazm43jqkr.fsf@cisco.com>
Message-ID: <Pine.LNX.4.64.0705181357160.27590@jlentini-linux.nane.netapp.com>


On Thu, 17 May 2007, Roland Dreier wrote:

>  > Both ipoib_add_port() and ipoib_mcast_join_task() query the GID at 
>  > index 0 to setup the ipoib_dev_priv structure's local_gid and the 
>  > net_device structure's dev_addr. There does not appear to be a way for 
>  > ipoib_mcast_join_task() to be executed before ipoib_add_port() 
>  > completes. Therefore, the work done in ipoib_mcast_join_task() appears 
>  > to be redundant.
> 
> It does look like we're doing some work we don't need to do.  However
> ipoib_add_port() could run before an SM has brought up the local port,

The same could be true for ipoib_mcast_join_task()

These are both instances of the general problem that if the GID at 
index 0 changes, the IPoIB code is not automatically notified. Agree?

> so the GID prefix might change later.
> 
> I'm not sure what the best way to clean this up is.

As an aside: Why does ipoib_add_port() treat an error return from 
ib_query_gid() as fatal while ipoib_mcast_join_task() only emits a 
warning?

james


From Koen.SEGERS at VRT.BE  Mon May 21 11:50:31 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Mon, 21 May 2007 20:50:31 +0200
Subject: [ofa-general] GPFS node loses IB-connection
References: <OF7447A28F.A7D1FDCB-ON872572E2.00561B4C-882572E2.005B9877@us.ibm.com>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C030AA261@OCBEXS01001.rto.be>

The same as in dmesg.
 
The output for the failing node:
May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901:   Reason code 668 Failure Reason Lost membership in cluster enterprise.universe. Unmounting file systems.
May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901:   
May 18 13:03:36 gpfswhbe1s1 kernel: GPFS Deadman Switch timer [0] has expired; IOs in progress: 0
May 18 13:04:11 gpfswhbe1s1 kernel: Badness in do_exit at kernel/exit.c:807
May 18 13:04:11 gpfswhbe1s1 kernel: 
May 18 13:04:11 gpfswhbe1s1 kernel: Call Trace: <ffffffff80133370>{do_exit+80} <ffffffff80133c17>{sys_exit_group+0}
May 18 13:04:11 gpfswhbe1s1 kernel:        <ffffffff8010a7be>{system_call+126}
May 18 13:04:11 gpfswhbe1s1 kernel: Badness in do_exit at kernel/exit.c:807
May 18 13:04:11 gpfswhbe1s1 kernel: 
May 18 13:04:11 gpfswhbe1s1 kernel: Call Trace: <ffffffff80133370>{do_exit+80} <ffffffff80133c17>{sys_exit_group+0}
May 18 13:04:11 gpfswhbe1s1 kernel:        <ffffffff8010a7be>{system_call+126}
May 18 13:18:57 gpfswhbe1s1 sshd[15090]: Accepted publickey for root from 192.168.1.1 port 52281 ssh2
May 18 13:25:12 gpfswhbe1s1 syslog-ng[3705]: STATS: dropped 0
 
Today we also did some tests with iperf using sdp. The tests worked fine, as long as we didn't use the parrallel option (-P <number>). This option starts multiple client threads to connect to the server. As soon as we started the command, the interface died.
 
I found it very strange. Didn't anyone get this problem? Is it still a problem in RC3?
 
Tomorrow we will do more tests to pinpoint the problem even further.
We will also build RPMS for the RC3. Hopefully this helps.
 
Regards,
 
Koen

________________________________

Van: Shirley Ma [mailto:xma at us.ibm.com]
Verzonden: ma 21/05/2007 17:41
Aan: SEGERS Koen
CC: general at lists.openfabrics.org; general-bounces at lists.openfabrics.org; Tziporet Koren
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection


Hello,

What's the output of /var/log/messages when you hitting this problem?

Shirley Ma

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070521/78c59939/attachment.html>

From mshefty at ichips.intel.com  Mon May 21 12:37:34 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 21 May 2007 12:37:34 -0700
Subject: [ofa-general] IB/cm: bug in stale connection detection logic?
In-Reply-To: <20070520134441.GI20649@mellanox.co.il>
References: <20070520134441.GI20649@mellanox.co.il>
Message-ID: <4651F4FE.3090307@ichips.intel.com>

> Why is an extra call to cm_get_id required to detect a duplicate?
> Shouldn't we just
> 
>         timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
> 	if (timewait_info) {
> 		/* handle duplicate */
> 		return;
> 	}
> 
> 	timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
> 	if (timewait_info) {
> 		/* handle stale */
> 		return;
> 	}
> 
> 	not a duplicate and not a stale connection

After looking at this more, I think we want something structured closer 
to what's listed above, with the duplicate handling enhanced to check 
that the QPN in the potential duplicate REQ matches what's already 
associated with the remote ID.

Did you hit into an actual problem with the current code?  It seems like 
the only issue is that a possible stale request would timeout, rather 
then be immediately rejected.  If so, I will queue up a patch for 2.6.23.

- Sean


From rdreier at cisco.com  Mon May 21 13:29:20 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 21 May 2007 13:29:20 -0700
Subject: [ofa-general] Re: [PATCH] IB/ipoib: fix error message
In-Reply-To: <20070518131254.GJ4708@mellanox.co.il> (Michael S. Tsirkin's
	message of "Fri, 18 May 2007 16:12:54 +0300")
References: <20070518131254.GJ4708@mellanox.co.il>
Message-ID: <adaabvxex9b.fsf@cisco.com>

thanks applied


From rdreier at cisco.com  Mon May 21 13:30:39 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 21 May 2007 13:30:39 -0700
Subject: [ofa-general] Re: [PATCH 1/2] libmlx4: pass more data from user to
	kernel
In-Reply-To: <1179387187.25749.61.camel@mtls03> (Eli Cohen's message of "Thu,
	17 May 2007 10:32:37 +0300")
References: <1179387187.25749.61.camel@mtls03>
Message-ID: <ada646lex74.fsf@cisco.com>

thanks, I applied a new version of this with my changes to the ABI,
and also I added code to libmlx4 so it calculates max_inline_data etc
correctly.

 - R.


From rdreier at cisco.com  Mon May 21 13:35:57 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 21 May 2007 13:35:57 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak
In-Reply-To: <20070521120459.GI20400@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 21 May 2007 15:04:59 +0300")
References: <20070521120459.GI20400@mellanox.co.il>
Message-ID: <ada1wh9ewya.fsf@cisco.com>

OK, I crossed my fingers and merged this for 2.6.22


From mst at dev.mellanox.co.il  Mon May 21 13:40:41 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 21 May 2007 23:40:41 +0300
Subject: [ofa-general] IB/cm: bug in stale connection detection logic?
In-Reply-To: <4651F4FE.3090307@ichips.intel.com>
References: <20070520134441.GI20649@mellanox.co.il>
	<4651F4FE.3090307@ichips.intel.com>
Message-ID: <20070521204041.GG31097@mellanox.co.il>

> Quoting Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [ofa-general] IB/cm: bug in stale connection detection logic?
> 
> >Why is an extra call to cm_get_id required to detect a duplicate?
> >Shouldn't we just
> >
> >     timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
> >	if (timewait_info) {
> >		/* handle duplicate */
> >		return;
> >	}
> >
> >	timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
> >	if (timewait_info) {
> >		/* handle stale */
> >		return;
> >	}
> >
> >	not a duplicate and not a stale connection
> 
> After looking at this more, I think we want something structured closer 
> to what's listed above, with the duplicate handling enhanced to check 
> that the QPN in the potential duplicate REQ matches what's already 
> associated with the remote ID.

Yes, that's what I thought too.

> Did you hit into an actual problem with the current code?  It seems like 
> the only issue is that a possible stale request would timeout, rather 
> then be immediately rejected.  If so, I will queue up a patch for 2.6.23.

Exactly. This is a serious problem for IPoIB CM since packet drop rates
and recovery times go up radically: sockets get closed, etc.
With a reject we would just retry connecting on the next packet.

Could you please post a patch? Let's discuss whether it's appropriate
for 2.6.22 separately.

-- 
MST


From rdreier at cisco.com  Mon May 21 13:45:12 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 21 May 2007 13:45:12 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id
	allocation
In-Reply-To: <20070521162336.GF31097@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 21 May 2007 19:23:37 +0300")
References: <20070521120633.GJ20400@mellanox.co.il>
	<ada4pm6ffjj.fsf@cisco.com> <20070521160654.GD31097@mellanox.co.il>
	<adaveemdujp.fsf@cisco.com> <20070521162336.GF31097@mellanox.co.il>
Message-ID: <adawsz1dhyf.fsf@cisco.com>

 > And since it's a *mask*, we can do it this way if you like:
 > 
 > >  > +		if (!ret)
 > >  > +			next_id = ((unsigned)id + 1) & MAX_ID_MASK;
 > 
 > which might generate a bit less code.

Good point.  In fact it is 8 bytes smaller for x86-64 at least, so
this is what I just merged:

commit 9f81036c54ed1f860d2807c5a6aa4f2b30c21204
Author: Michael S. Tsirkin <mst at dev.mellanox.co.il>
Date:   Mon May 21 19:06:54 2007 +0300

    IB/cm: Improve local id allocation
    
    The IB CM uses an idr for local id allocations, with a running counter
    as start_id.  This fails to generate distinct ids if
    
    1. An id is constantly created and destroyed
    2. A chunk of ids just beyond the current next_id value is occupied
    
    This in turn leads to an increased chance of connection request being
    mis-detected as a duplicate, sometimes for several retries, until
    next_id gets past the block of allocated ids. This has been observed
    in practice.
    
    As a fix, remember the last id allocated and start immediately above it.
    This also fixes a problem with the old code, where next_id might
    overflow and become negative.
    
    Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
    Acked-by: Sean Hefty <sean.hefty at intel.com>
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index eff591d..e840434 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -306,7 +306,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv)
 	do {
 		spin_lock_irqsave(&cm.lock, flags);
 		ret = idr_get_new_above(&cm.local_id_table, cm_id_priv,
-					next_id++, &id);
+					next_id, &id);
+		if (!ret)
+			next_id = ((unsigned) id + 1) & MAX_ID_MASK;
 		spin_unlock_irqrestore(&cm.lock, flags);
 	} while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) );
 

From rdreier at cisco.com  Mon May 21 13:48:12 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 21 May 2007 13:48:12 -0700
Subject: [ofa-general] Re: [IPoIB][RFC] remove redundant gid query
In-Reply-To: <Pine.LNX.4.64.0705181357160.27590@jlentini-linux.nane.netapp.com>
	(James Lentini's message of "Mon,
	21 May 2007 14:50:32 -0400 (EDT)")
References: <Pine.LNX.4.64.0705081427320.27590@jlentini-linux.nane.netapp.com>
	<adazm43jqkr.fsf@cisco.com>
	<Pine.LNX.4.64.0705181357160.27590@jlentini-linux.nane.netapp.com>
Message-ID: <adasl9pdhtf.fsf@cisco.com>

 > > It does look like we're doing some work we don't need to do.  However
 > > ipoib_add_port() could run before an SM has brought up the local port,
 > 
 > The same could be true for ipoib_mcast_join_task()
 > 
 > These are both instances of the general problem that if the GID at 
 > index 0 changes, the IPoIB code is not automatically notified. Agree?

Yes, although what is there now should be semi-OK: a multicast join
can't succeed until the port is up, so ipoib should eventually get the
right GID.  And I would argue that an SM that changes a port's GID
prefix without at least generating a client reregister event is broken.

 > > so the GID prefix might change later.
 > > 
 > > I'm not sure what the best way to clean this up is.
 > 
 > As an aside: Why does ipoib_add_port() treat an error return from 
 > ib_query_gid() as fatal while ipoib_mcast_join_task() only emits a 
 > warning?

I guess because it's much easier to bail out of ipoib_add_port() than
it is to do something intelligent in ipoib_mcast_join_task().


From rdreier at cisco.com  Mon May 21 13:51:43 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 21 May 2007 13:51:43 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adaodkddhnk.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get the following post 2.6.22-rc2 fixes.  (This batch is
bigger than I would like, but I think it's all legitimately post-rc2
material: we've had some fixes for fairly serious problems cooking for
a while, and those fixes involve largish patches.  The rest is either
trivial stuff or fixes for the just-merged mlx4 driver)

Ali Ayoub (1):
      IB/mthca: Fix use-after-free on device restart

Eli Cohen (3):
      IB/core: Free umem when mm is already gone
      IB/mlx4: Fix check of max_qp_dest_rdma in modify QP
      IB/mlx4: Pass send queue sizes from userspace to kernel

Hoang-Nam Nguyen (1):
      IB/ehca: Return proper error code if register_mr fails

Michael S. Tsirkin (5):
      IB/mthca: Fix RESET to ERROR transition
      IB/mlx4: Fix RESET to RESET and RESET to ERROR transitions
      IB/ipoib: Fix typos in error messages
      IPoIB/cm: Fix SRQ WR leak
      IB/cm: Improve local id allocation

Roland Dreier (6):
      IB/ipath: Fix potential deadlock with multicast spinlocks
      IB/core: Use start_port() and end_port()
      IB/mlx4: Set GRH:HopLimit when sending globally routed MADs
      mlx4_core: Fix array overrun in dump_dev_cap_flags()
      IB/mlx4: Fix check of opcode in mlx4_ib_post_send()
      IB/mlx4: Check if SRQ is full when posting receive

Rolf Manderscheid (1):
      IB/mthca: Set GRH:HopLimit when building MLX headers

Yosef Etigin (2):
      IB/core: Add helpers for uncached GID and P_Key searches
      IPoIB: Handle P_Key table reordering

 drivers/infiniband/core/cm.c                    |    4 +-
 drivers/infiniband/core/device.c                |  135 ++++++++++++++-
 drivers/infiniband/core/umem.c                  |    4 +-
 drivers/infiniband/hw/ehca/ehca_mrmw.c          |    7 +-
 drivers/infiniband/hw/ipath/ipath_verbs_mcast.c |   16 +-
 drivers/infiniband/hw/mlx4/qp.c                 |  181 ++++++++++++++------
 drivers/infiniband/hw/mlx4/srq.c                |    6 +
 drivers/infiniband/hw/mlx4/user.h               |    5 +-
 drivers/infiniband/hw/mthca/mthca_av.c          |    1 +
 drivers/infiniband/hw/mthca/mthca_main.c        |    4 +-
 drivers/infiniband/hw/mthca/mthca_qp.c          |  158 +++++++++++-------
 drivers/infiniband/ulp/ipoib/ipoib.h            |   49 +++++-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c         |  203 +++++++++++++++++++----
 drivers/infiniband/ulp/ipoib/ipoib_ib.c         |   87 ++++++++--
 drivers/infiniband/ulp/ipoib/ipoib_main.c       |    7 +-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c  |    2 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c      |   40 ++---
 drivers/net/mlx4/fw.c                           |    2 +-
 include/rdma/ib_verbs.h                         |    8 +
 19 files changed, 697 insertions(+), 222 deletions(-)


From mshefty at ichips.intel.com  Mon May 21 14:20:29 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 21 May 2007 14:20:29 -0700
Subject: [ofa-general] IB/cm: bug in stale connection detection logic?
In-Reply-To: <20070521204041.GG31097@mellanox.co.il>
References: <20070520134441.GI20649@mellanox.co.il>	<4651F4FE.3090307@ichips.intel.com>
	<20070521204041.GG31097@mellanox.co.il>
Message-ID: <46520D1D.3000001@ichips.intel.com>

> Could you please post a patch? Let's discuss whether it's appropriate
> for 2.6.22 separately.

I mentioned 2.6.23 because it affects when I have to generate the patch.  :)

I will try to get to this tomorrow then.

- Sean


From venkatesh.babu at 3leafnetworks.com  Mon May 21 15:00:31 2007
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Mon, 21 May 2007 15:00:31 -0700
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them
	claims master
Message-ID: <4652167F.9040709@3leafnetworks.com>


Configuration:
 - 4 nodes in the IB network with two nodes running OpenSMs.
 - Each node has MT25218 InfiniHostEx Mellanox with two IB ports and 
with firmware version 5.2.0
 - All node's IB port 1 is connected to IB Switch1, say subnet1
 - All node's IB port 2 is connected to IB Switch2, say subnet2
 - vortex3l-83 has two opensm's for each subnet with priority 0
 - vortex3l-84 has two opensm's for each subnet with priority 13

Problem:

The problem is opensm's on both the machines are in Standy state and none of them are
claiming the mastership, though they have different priorities 0 and 13. Most of the times
this configuration works fine, but ocassionally it is getting into this problem. It is hard
to reproduce this problem. 

I tried to set the mastership of the opensm but it didn't worked.
[root at vortex3l-83 ~]# sminfo -s3
 sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5043 priority 0
state 2 SMINFO_STANDBY

After couple of minutes 
[root at vortex3l-83 ~]# sminfo
sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5938 priority 0 state 3 SMINFO_MASTER


Data:

[root at vortex3l-83 ~]# sminfo
sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4937 priority 0 state
2 SMINFO_STANDBY
[root at vortex3l-83 ~]# ps -aux | grep open
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ
root      5237  0.0  0.0 92848 1692 ?        Sl   00:39   0:00 /usr/bin/opensm
-g 0x005045014a3a0001 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm1.log
root      5250  0.0  0.0 92848 1700 ?        Sl   00:39   0:00 /usr/bin/opensm
-g 0x005045014a3a0002 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm2.log
root      8356  0.0  0.0 51064  708 pts/0    S+   13:40   0:00 grep open


[root at vortex3l-84 ~]# sminfo
sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4939 priority 0 state
2 SMINFO_STANDBY
[root at vortex3l-84 ~]# ps -aux | grep open
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ
root      5871  0.0  0.0 92848 1560 ?        Sl   00:40   0:00 /usr/bin/opensm
-g 0x005045014a2e0001 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm1.log
root      5884  0.0  0.0 92848 1568 ?        Sl   00:40   0:00 /usr/bin/opensm
-g 0x005045014a2e0002 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm2.log
root      8845  0.0  0.0 51084  668 pts/0    S+   13:40   0:00 grep open


But ibv_devinfo on vortex3l-83 shows that both ports are active and sm_lid and
lid are same, indicating it is master. Looks like it is the stale information.

[root at vortex3l-83 ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.2.0
        node_guid:                      0050:4501:4a3a:0000
        sys_image_guid:                 0050:4501:4a3a:0003
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       ARM0020000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 6
                        port_lid:               6
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00

[root at vortex3l-84 ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.2.0
        node_guid:                      0050:4501:4a2e:0000
        sys_image_guid:                 0050:4501:4a2e:0003
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       ARM0020000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 6
                        port_lid:               7
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               2
                        port_lmc:               0x00


And also in /var/log/opensm[1/2].log I see the following error messages -

May 21 00:40:28 250119 [95A9A160] -> OpenSM Rev:openib-2.0.5 OpenIB svn 4954M
May 21 00:40:28 484648 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001
May 21 00:40:28 487418 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001
May 21 00:40:29 292689 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x100000125f) -- dropping
May 21 00:40:29 292728 [45007960] -> umad_receiver: ERR 5411: DR SMP
May 21 00:40:29 292741 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT)
May 21 00:40:29 292818 [45007960] -> SMP dump:


I found that for both ports on both vortex boxes I see the port_xmit_discards
counter was 1. Other error counters seems to be zero. Looks like some packets
has been transmitted and received on both machines.

[root at vortex3l-83 ~]# cat
/sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards
1
[root at vortex3l-83 ~]# cat
/sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
1

[root at vortex3l-84 ~]# cat
/sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards
1
[root at vortex3l-84 ~]# cat
/sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
1

[root at sqasmd ~]# cat
/sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
1


Link speed seems to be set to 10 Gb/sec (4X) on all machines.


I have the opensm logs and gdb output for all the opensms. If you want I 
can send it to you.
Just attaching one sample gdb output with stack traces of all threads.

[root at vortex3l-83 ~]# gdb /usr/bin/opensm 5237
GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(no debugging symbols found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Attaching to program: /usr/bin/opensm, process 5237
Reading symbols from /usr/lib/libopensm.so.1...done.
Loaded symbols for /usr/lib/libopensm.so.1
Reading symbols from /usr/lib/libosmcomp.so.1...done.
Loaded symbols for /usr/lib/libosmcomp.so.1
Reading symbols from /lib64/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 182904123744 (LWP 5237)]
[New Thread 1157658976 (LWP 5267)]
[New Thread 1147169120 (LWP 5266)]
[New Thread 1136679264 (LWP 5265)]
[New Thread 1126189408 (LWP 5264)]
[New Thread 1115699552 (LWP 5263)]
[New Thread 1105209696 (LWP 5262)]
[New Thread 1094719840 (LWP 5261)]
[New Thread 1084229984 (LWP 5253)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /usr/lib/libosmvendor.so.2...done.
Loaded symbols for /usr/lib/libosmvendor.so.2
Reading symbols from /usr/lib/libibcommon.so.1...done.
Loaded symbols for /usr/lib/libibcommon.so.1
Reading symbols from /usr/lib/libibumad.so.1...done.
Loaded symbols for /usr/lib/libibumad.so.1
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
(gdb) info threads
  9 Thread 1084229984 (LWP 5253)  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  8 Thread 1094719840 (LWP 5261)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  7 Thread 1105209696 (LWP 5262)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  6 Thread 1115699552 (LWP 5263)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  5 Thread 1126189408 (LWP 5264)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  4 Thread 1136679264 (LWP 5265)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  3 Thread 1147169120 (LWP 5266)  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  2 Thread 1157658976 (LWP 5267)  0x0000002a95d7fd22 in poll ()
   from /lib64/tls/libc.so.6
  1 Thread 182904123744 (LWP 5237)  0x0000002a95d51d65 in __nanosleep_nocancel
    () from /lib64/tls/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 182904123744 (LWP 5237))]#0  0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1  0x0000002a95d82368 in usleep () from /lib64/tls/libc.so.6
#2  0x0000002a9578a05e in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125
#3  0x0000000000405ba1 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 1157658976 (LWP 5267))]#0  0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6
#1  0x0000002a95bbb94d in dev_poll (fd=Variable "fd" is not available.
) at src/umad.c:775
#2  0x0000002a95bbba6d in umad_recv (portid=Variable "portid" is not available.
) at src/umad.c:805
#3  0x0000002a959ae68b in umad_receiver (p_ptr=0x5c3000)
    at osm_vendor_ibumad.c:266
#4  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5c3070) at cl_thread.c:61
#5  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
#7  0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 1147169120 (LWP 5266))]#0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a95783b4b in cl_event_wait_on (p_event=0x5887b8,
    wait_us=10000000, interruptible=1) at cl_event.c:181
#2  0x000000000043630c in __osm_sm_sweeper ()
#3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x588898) at cl_thread.c:61
#4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 4
[Switching to thread 4 (Thread 1136679264 (LWP 5265))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a278,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x000000000044d7a1 in __osm_vl15_poller ()
#3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58a2e8) at cl_thread.c:61
#4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 5
[Switching to thread 5 (Thread 1126189408 (LWP 5264))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
    at cl_threadpool.c:71
#3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5900e0) at cl_thread.c:61
#4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 6
[Switching to thread 6 (Thread 1115699552 (LWP 5263))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
    at cl_threadpool.c:71
#3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x590010) at cl_thread.c:61
#4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 7
[Switching to thread 7 (Thread 1105209696 (LWP 5262))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
    at cl_threadpool.c:71
#3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58ff40) at cl_thread.c:61
#4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 8
[Switching to thread 8 (Thread 1094719840 (LWP 5261))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
    at cl_threadpool.c:71
#3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58b760) at cl_thread.c:61
#4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 9
[Switching to thread 9 (Thread 1084229984 (LWP 5253))]#0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9578a9dd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168
#2  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
#3  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
#4  0x0000000000000000 in ?? ()
(gdb) thread 10
Thread ID 10 not known.
(gdb)
Thread ID 10 not known.
(gdb) q
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /usr/bin/opensm, process 5237
[root at vortex3l-83 ~]#


From halr at voltaire.com  Mon May 21 15:16:38 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 May 2007 18:16:38 -0400
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <4652167F.9040709@3leafnetworks.com>
References: <4652167F.9040709@3leafnetworks.com>
Message-ID: <1179785796.15940.27092.camel@hal.voltaire.com>

On Mon, 2007-05-21 at 18:00, Venkatesh Babu wrote:
> Configuration:
>  - 4 nodes in the IB network with two nodes running OpenSMs.
>  - Each node has MT25218 InfiniHostEx Mellanox with two IB ports and 
> with firmware version 5.2.0
>  - All node's IB port 1 is connected to IB Switch1, say subnet1
>  - All node's IB port 2 is connected to IB Switch2, say subnet2

So there is no link between the 2 switches, right ?

>  - vortex3l-83 has two opensm's for each subnet with priority 0
>  - vortex3l-84 has two opensm's for each subnet with priority 13
> 
> Problem:
> 
> The problem is opensm's on both the machines are in Standy state and none of them are
> claiming the mastership, though they have different priorities 0 and 13. Most of the times
> this configuration works fine, but ocassionally it is getting into this problem. It is hard
> to reproduce this problem. 

Is there anything being done ? Cables pulled and reinserted ? Is
anything changing or is this a "stable" configuration in terms of the
topology ?

Is this the only thing going on on the subnet ?

> I tried to set the mastership of the opensm but it didn't worked.

> [root at vortex3l-83 ~]# sminfo -s3
>  sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5043 priority 0
> state 2 SMINFO_STANDBY
> 
> After couple of minutes 
> [root at vortex3l-83 ~]# sminfo
> sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5938 priority 0 state 3 SMINFO_MASTER

So it did finally become master ?

I take it LID 6 is local (vortex31-83).

> Data:
> 
> [root at vortex3l-83 ~]# sminfo
> sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4937 priority 0 state
> 2 SMINFO_STANDBY
> [root at vortex3l-83 ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ
> root      5237  0.0  0.0 92848 1692 ?        Sl   00:39   0:00 /usr/bin/opensm
> -g 0x005045014a3a0001 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm1.log
> root      5250  0.0  0.0 92848 1700 ?        Sl   00:39   0:00 /usr/bin/opensm
> -g 0x005045014a3a0002 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm2.log
> root      8356  0.0  0.0 51064  708 pts/0    S+   13:40   0:00 grep open
> 
> 
> [root at vortex3l-84 ~]# sminfo
> sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4939 priority 0 state
> 2 SMINFO_STANDBY
> [root at vortex3l-84 ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ
> root      5871  0.0  0.0 92848 1560 ?        Sl   00:40   0:00 /usr/bin/opensm
> -g 0x005045014a2e0001 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm1.log
> root      5884  0.0  0.0 92848 1568 ?        Sl   00:40   0:00 /usr/bin/opensm
> -g 0x005045014a2e0002 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm2.log
> root      8845  0.0  0.0 51084  668 pts/0    S+   13:40   0:00 grep open
> 
> 
> But ibv_devinfo on vortex3l-83 shows that both ports are active and sm_lid and
> lid are same, indicating it is master. Looks like it is the stale information.
> 
> [root at vortex3l-83 ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.2.0
>         node_guid:                      0050:4501:4a3a:0000
>         sys_image_guid:                 0050:4501:4a3a:0003
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       ARM0020000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 6
>                         port_lid:               6
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 1
>                         port_lid:               1
>                         port_lmc:               0x00
> 
> [root at vortex3l-84 ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.2.0
>         node_guid:                      0050:4501:4a2e:0000
>         sys_image_guid:                 0050:4501:4a2e:0003
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       ARM0020000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 6
>                         port_lid:               7
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 1
>                         port_lid:               2
>                         port_lmc:               0x00
> 
> 
> 
> And also in /var/log/opensm[1/2].log I see the following error messages -
> 
> May 21 00:40:28 250119 [95A9A160] -> OpenSM Rev:openib-2.0.5 OpenIB svn 4954M

This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you
try OFED 1.2 ?

What kernel is being used ? What distro ? What processor architecture ?

> May 21 00:40:28 484648 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001
> May 21 00:40:28 487418 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001
> May 21 00:40:29 292689 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x100000125f) -- dropping
> May 21 00:40:29 292728 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 00:40:29 292741 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT)
> May 21 00:40:29 292818 [45007960] -> SMP dump:

Is this around the time of the error or just an error in the OpenSM log
? 

> I found that for both ports on both vortex boxes I see the port_xmit_discards
> counter was 1.

Did this change from 0 to 1 around the time of the issue with the SM
mastership ?

Also, what are the port counters for the switch ports in use ?

>  Other error counters seems to be zero. Looks like some packets
> has been transmitted and received on both machines.
> 
> [root at vortex3l-83 ~]# cat
> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards
> 1
> [root at vortex3l-83 ~]# cat
> /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
> 1
> 
> [root at vortex3l-84 ~]# cat
> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards
> 1
> [root at vortex3l-84 ~]# cat
> /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
> 1
> 
> [root at sqasmd ~]# cat
> /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
> 1
> 
> 
> Link speed seems to be set to 10 Gb/sec (4X) on all machines.
> 
> 
> I have the opensm logs and gdb output for all the opensms. If you want I 
> can send it to you.

Perhaps later; not just yet.

> Just attaching one sample gdb output with stack traces of all threads.

Are they all the same ?

-- Hal

> [root at vortex3l-83 ~]# gdb /usr/bin/opensm 5237
> GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (no debugging symbols found)
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
> 
> Attaching to program: /usr/bin/opensm, process 5237
> Reading symbols from /usr/lib/libopensm.so.1...done.
> Loaded symbols for /usr/lib/libopensm.so.1
> Reading symbols from /usr/lib/libosmcomp.so.1...done.
> Loaded symbols for /usr/lib/libosmcomp.so.1
> Reading symbols from /lib64/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 182904123744 (LWP 5237)]
> [New Thread 1157658976 (LWP 5267)]
> [New Thread 1147169120 (LWP 5266)]
> [New Thread 1136679264 (LWP 5265)]
> [New Thread 1126189408 (LWP 5264)]
> [New Thread 1115699552 (LWP 5263)]
> [New Thread 1105209696 (LWP 5262)]
> [New Thread 1094719840 (LWP 5261)]
> [New Thread 1084229984 (LWP 5253)]
> Loaded symbols for /lib64/tls/libpthread.so.0
> Reading symbols from /usr/lib/libosmvendor.so.2...done.
> Loaded symbols for /usr/lib/libosmvendor.so.2
> Reading symbols from /usr/lib/libibcommon.so.1...done.
> Loaded symbols for /usr/lib/libibcommon.so.1
> Reading symbols from /usr/lib/libibumad.so.1...done.
> Loaded symbols for /usr/lib/libibumad.so.1
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) info threads
>   9 Thread 1084229984 (LWP 5253)  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   8 Thread 1094719840 (LWP 5261)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   7 Thread 1105209696 (LWP 5262)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   6 Thread 1115699552 (LWP 5263)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   5 Thread 1126189408 (LWP 5264)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   4 Thread 1136679264 (LWP 5265)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   3 Thread 1147169120 (LWP 5266)  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   2 Thread 1157658976 (LWP 5267)  0x0000002a95d7fd22 in poll ()
>    from /lib64/tls/libc.so.6
>   1 Thread 182904123744 (LWP 5237)  0x0000002a95d51d65 in __nanosleep_nocancel
>     () from /lib64/tls/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 182904123744 (LWP 5237))]#0  0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x0000002a95d82368 in usleep () from /lib64/tls/libc.so.6
> #2  0x0000002a9578a05e in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125
> #3  0x0000000000405ba1 in main ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 1157658976 (LWP 5267))]#0  0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6
> #1  0x0000002a95bbb94d in dev_poll (fd=Variable "fd" is not available.
> ) at src/umad.c:775
> #2  0x0000002a95bbba6d in umad_recv (portid=Variable "portid" is not available.
> ) at src/umad.c:805
> #3  0x0000002a959ae68b in umad_receiver (p_ptr=0x5c3000)
>     at osm_vendor_ibumad.c:266
> #4  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5c3070) at cl_thread.c:61
> #5  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #6  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #7  0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (Thread 1147169120 (LWP 5266))]#0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783b4b in cl_event_wait_on (p_event=0x5887b8,
>     wait_us=10000000, interruptible=1) at cl_event.c:181
> #2  0x000000000043630c in __osm_sm_sweeper ()
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x588898) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 4
> [Switching to thread 4 (Thread 1136679264 (LWP 5265))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a278,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x000000000044d7a1 in __osm_vl15_poller ()
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58a2e8) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 1126189408 (LWP 5264))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
>     at cl_threadpool.c:71
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5900e0) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 6
> [Switching to thread 6 (Thread 1115699552 (LWP 5263))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
>     at cl_threadpool.c:71
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x590010) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 7
> [Switching to thread 7 (Thread 1105209696 (LWP 5262))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
>     at cl_threadpool.c:71
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58ff40) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 8
> [Switching to thread 8 (Thread 1094719840 (LWP 5261))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
>     at cl_threadpool.c:71
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58b760) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 9
> [Switching to thread 9 (Thread 1084229984 (LWP 5253))]#0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9578a9dd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168
> #2  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #3  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #4  0x0000000000000000 in ?? ()
> (gdb) thread 10
> Thread ID 10 not known.
> (gdb)
> Thread ID 10 not known.
> (gdb) q
> The program is running.  Quit anyway (and detach it)? (y or n) y
> Detaching from program: /usr/bin/opensm, process 5237
> [root at vortex3l-83 ~]#
> 
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From xma at us.ibm.com  Mon May 21 15:34:54 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Mon, 21 May 2007 15:34:54 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C030AA261@OCBEXS01001.rto.be>
Message-ID: <OF76849DAB.9374EE37-ON872572E2.007BE57F-882572E2.007BEED4@us.ibm.com>


Thanks. There is no info to show why the connection got lost. Let's wait to
see whether you can reproduce this problem in rc3.

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070521/de78ebee/attachment.html>

From sean.hefty at intel.com  Mon May 21 17:38:02 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 21 May 2007 17:38:02 -0700
Subject: [ofa-general] [PATCH] ib/cm: fix stale connection detection
In-Reply-To: <20070520134441.GI20649@mellanox.co.il>
Message-ID: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com>

The ib_cm can incorrectly detect a stale connection (a new connection
request for a QPN that is already connected) as a duplicate connection
request.  Separate the handling of potential duplicate REQs from stale
connections.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Can you let me know if this fixes the issues for you?  I reworked the
code only to detect the stale connection properly.  More work is needed
to force the local QP into timewait if that is needed.  This would
likely require adding a new CM event to report that a stale connection
was detected on the QP.  Also, I left the duplicate request handling
as it was, since that should go in as a separate patch.


 drivers/infiniband/core/cm.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index eff591d..c53d486 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1295,26 +1295,29 @@ static struct cm_id_private * cm_match_req(struct cm_work *work,
 
 	req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
 
-	/* Check for duplicate REQ and stale connections. */
+	/* Check for possible duplicate REQ. */
 	spin_lock_irqsave(&cm.lock, flags);
 	timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
-	if (!timewait_info)
-		timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
-
 	if (timewait_info) {
 		cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
 					   timewait_info->work.remote_id);
-		cm_cleanup_timewait(cm_id_priv->timewait_info);
 		spin_unlock_irqrestore(&cm.lock, flags);
 		if (cur_cm_id_priv) {
 			cm_dup_req_handler(work, cur_cm_id_priv);
 			cm_deref_id(cur_cm_id_priv);
-		} else
-			cm_issue_rej(work->port, work->mad_recv_wc,
-				     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
-				     NULL, 0);
-		listen_cm_id_priv = NULL;
-		goto out;
+		}
+		return NULL;
+	}
+
+	/* Check for stale connections. */
+	timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
+	if (timewait_info) {
+		cm_cleanup_timewait(cm_id_priv->timewait_info);
+		spin_unlock_irqrestore(&cm.lock, flags);
+		cm_issue_rej(work->port, work->mad_recv_wc,
+			     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
+			     NULL, 0);
+		return NULL;
 	}
 
 	/* Find matching listen request. */


From mst at dev.mellanox.co.il  Mon May 21 17:48:53 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 22 May 2007 03:48:53 +0300
Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection
In-Reply-To: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com>
References: <20070520134441.GI20649@mellanox.co.il>
	<000101c79c09$74f15440$54c8180a@amr.corp.intel.com>
Message-ID: <20070522004853.GA13311@mellanox.co.il>

Thanks, Sean!
The patch looks OK to me.
I hope to be able to test it Sunday/Monday.

Quoting Sean Hefty <sean.hefty at intel.com>:
Subject: [PATCH] ib/cm: fix stale connection detection

The ib_cm can incorrectly detect a stale connection (a new connection
request for a QPN that is already connected) as a duplicate connection
request.  Separate the handling of potential duplicate REQs from stale
connections.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Can you let me know if this fixes the issues for you?  I reworked the
code only to detect the stale connection properly.  More work is needed
to force the local QP into timewait if that is needed.  This would
likely require adding a new CM event to report that a stale connection
was detected on the QP.  Also, I left the duplicate request handling
as it was, since that should go in as a separate patch.


 drivers/infiniband/core/cm.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index eff591d..c53d486 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1295,26 +1295,29 @@ static struct cm_id_private * cm_match_req(struct cm_work *work,
 
 	req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
 
-	/* Check for duplicate REQ and stale connections. */
+	/* Check for possible duplicate REQ. */
 	spin_lock_irqsave(&cm.lock, flags);
 	timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
-	if (!timewait_info)
-		timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
-
 	if (timewait_info) {
 		cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
 					   timewait_info->work.remote_id);
-		cm_cleanup_timewait(cm_id_priv->timewait_info);
 		spin_unlock_irqrestore(&cm.lock, flags);
 		if (cur_cm_id_priv) {
 			cm_dup_req_handler(work, cur_cm_id_priv);
 			cm_deref_id(cur_cm_id_priv);
-		} else
-			cm_issue_rej(work->port, work->mad_recv_wc,
-				     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
-				     NULL, 0);
-		listen_cm_id_priv = NULL;
-		goto out;
+		}
+		return NULL;
+	}
+
+	/* Check for stale connections. */
+	timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
+	if (timewait_info) {
+		cm_cleanup_timewait(cm_id_priv->timewait_info);
+		spin_unlock_irqrestore(&cm.lock, flags);
+		cm_issue_rej(work->port, work->mad_recv_wc,
+			     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
+			     NULL, 0);
+		return NULL;
 	}
 
 	/* Find matching listen request. */

-- 
MST


From mst at dev.mellanox.co.il  Mon May 21 17:59:18 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 22 May 2007 03:59:18 +0300
Subject: [ofa-general] [PATCH RFC] IB/ipoib: fix to_ipoib_neigh access race
Message-ID: <20070522005918.GB13311@mellanox.co.il>

hard_start_xmit dereferences to_ipoib_neigh when only tx_lock is taken.  This
would only be safe if all calls that modify *to_ipoib_neigh take tx_lock too.
Currently this is not always true for ipoib_neigh_free and path_rec_completion,
which results in memory corruption.  Fix this race, making sure
path_rec_completion and ipoib_neigh_free are always called under
tx_lock.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

I'm looking at
https://bugs.openfabrics.org/show_bug.cgi?id=604
and I think this could explain the crashes.
In any case, Roland, is there a race or am I imagining things?

NB: The patch is untested (I'm not at the lab now).

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 0a428f2..ef9845a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -364,9 +364,9 @@ void ipoib_flush_paths(struct net_device *dev)
 		spin_unlock(&priv->lock);
 		spin_unlock_irq(&priv->tx_lock);
 		wait_for_completion(&path->done);
-		path_free(dev, path);
 		spin_lock_irq(&priv->tx_lock);
 		spin_lock(&priv->lock);
+		path_free(dev, path);
 	}
 	spin_unlock(&priv->lock);
 	spin_unlock_irq(&priv->tx_lock);
@@ -401,7 +401,8 @@ static void path_rec_completion(int status,
 			ah = ipoib_create_ah(dev, priv->pd, &av);
 	}
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	path->ah = ah;
 
@@ -442,7 +443,8 @@ static void path_rec_completion(int status,
 	path->query = NULL;
 	complete(&path->done);
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	while ((skb = __skb_dequeue(&skqueue))) {
 		skb->dev = dev;
@@ -822,7 +824,8 @@ static void ipoib_neigh_cleanup(struct neighbour *n)
 		  IPOIB_QPN(n->ha),
 		  IPOIB_GID_RAW_ARG(n->ha + 4));
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	neigh = *to_ipoib_neigh(n);
 	if (neigh) {
@@ -832,7 +835,8 @@ static void ipoib_neigh_cleanup(struct neighbour *n)
 		ipoib_neigh_free(n->dev, neigh);
 	}
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock, flags);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (ah)
 		ipoib_put_ah(ah);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 54fbead..d2e6a1a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -100,7 +100,8 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 			"deleting multicast group " IPOIB_GID_FMT "\n",
 			IPOIB_GID_ARG(mcast->mcmember.mgid));
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) {
 		/*
@@ -114,7 +115,8 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 		ipoib_neigh_free(dev, neigh);
 	}
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (mcast->ah)
 		ipoib_put_ah(mcast->ah);

-- 
MST


From venkatesh.babu at 3leafnetworks.com  Mon May 21 19:23:40 2007
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Mon, 21 May 2007 19:23:40 -0700
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <1179785796.15940.27092.camel@hal.voltaire.com>
References: <4652167F.9040709@3leafnetworks.com>
	<1179785796.15940.27092.camel@hal.voltaire.com>
Message-ID: <4652542C.3010400@3leafnetworks.com>


Hal Rosenstock wrote:

>So there is no link between the 2 switches, right ?
>  
>
 That is right.

>
>Is there anything being done ? Cables pulled and reinserted ? Is
>anything changing or is this a "stable" configuration in terms of the
>topology ?
>  
>
 There was no configuration changes from the cable or switch 
perspective. But nodes were being rebooted.

>Is this the only thing going on on the subnet ?
>  
>
 That was ipoib but no other ulp modules. There was propritery ulp 
module which creates udqp and joins broadcast
group and discovers nodes and sets up rcqps. There was no traffic being run.

>So it did finally become master ?
>  
>
 Yes, from the /var/log/opensm1.log it looks like it became master. But 
it was not responding to
link local broadcast join operations. It was failing with -110, 
Connection timed out.

>I take it LID 6 is local (vortex31-83).
>
>This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you
>try OFED 1.2 ?
>  
>
  It is OFED 1.1 released stack. I have seen this problem with OFED 1.0 
also.
Trying with OFED 1.2 may take much longer time, since we need to port 
our stuff.

>What kernel is being used ? What distro ? What processor architecture ?
>  
>
 2.6.9-22.EL     RHEL 4.2           Dual Core AMD Opteron(tm) Processor 
270 HE

>
>Is this around the time of the error or just an error in the OpenSM log
>? 
>  
>
  The logs were frozen after these error messages. No new entries were 
being written to the log files.
After doing "sminfo -s3" I saw the some messages indicating that it 
moved to MASTER state and other messages.

May 21 00:40:28 013290 [41401960] -> __osm_trap_rcv_process_request: 
Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0007 
TID:0x0000000000000003
May 21 00:40:28 013431 [41401960] -> osm_report_notice: Reporting 
Generic Notice type:4 num:144 from LID:0x0007 
GID:0xfe80000000000000,0x005045014a2e0001
May 21 00:40:28 818202 [45007960] -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x11 trans_id=0x100000135b) -- 
dropping
May 21 00:40:28 819089 [45007960] -> umad_receiver: ERR 5411: DR SMP
May 21 00:40:28 819110 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
3113: MAD completed in error (IB_TIMEOUT)
May 21 00:40:28 819145 [45007960] -> SMP dump:
...
May 21 00:40:28 819247 [41E02960] -> Entering STANDBY state
May 21 14:04:17 204871 [45007960] -> umad_receiver: ERR 5404: recv error 
on MAD sized umad (Interrupted system call)
May 21 14:06:08 022096 [45007960] -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x20 trans_id=0x100000264f) -- 
dropping
May 21 14:06:08 022132 [45007960] -> umad_receiver: ERR 5411: DR SMP
May 21 14:06:08 022145 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
3113: MAD completed in error (IB_TIMEOUT)
May 21 14:06:08 022182 [45007960] -> SMP dump:
...
May 21 14:06:38 035957 [41401960] -> Entering MASTER state
May 21 14:06:38 038818 [42803960] -> osm_subn_set_up_down_min_hop_table: 
BFS through all port guids in the subnet ]
May 21 14:06:38 038886 [42803960] -> osm_ucast_mgr_process: Min Hop 
Tables configured on all switches
May 21 14:06:38 046438 [41401960] -> __osm_trap_rcv_process_request: 
Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000C 
TID:0x0000000000000ec4
May 21 14:06:38 046565 [41401960] -> osm_report_notice: Reporting 
Generic Notice type:1 num:128 from LID:0x000C 
GID:0xfe80000000000000,0x000b8cffff0024f9
May 21 14:06:38 108660 [42803960] -> SUBNET UP
May 21 14:06:38 402900 [41401960] -> __osm_trap_rcv_process_request: 
Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 
TID:0x0000000000000000
May 21 14:06:38 403007 [41401960] -> osm_report_notice: Reporting 
Generic Notice type:4 num:144 from LID:0x0001 
GID:0xfe80000000000000,0x0002c9020020f5c5
May 21 14:06:38 914806 [45007960] -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x20 trans_id=0x10000026f0) -- 
dropping
May 21 14:06:38 914823 [45007960] -> umad_receiver: ERR 5411: DR SMP
May 21 14:06:38 914864 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
3113: MAD completed in error (IB_TIMEOUT)
May 21 14:06:38 914899 [45007960] -> SMP dump:

>Did this change from 0 to 1 around the time of the issue with the SM
>mastership ?
>  
>
  Not sure, I just got the snapshot when I saw this problem.

>Also, what are the port counters for the switch ports in use ?
>  
>
[root at vortex3l-83 ~]# ibnetdiscover
ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed,
skipping port
#
# Topology file: generated on Mon May 21 02:11:34 2007
#
# Max of 2 hops discovered
# Initiated from node 005045014a3a0000 port 005045014a3a0001

vendid=0x2c9
devid=0xb924
sysimgguid=0xb8cffff0024f9
switchguid=0xb8cffff0024f9
Switch  24 "S-000b8cffff0024f9"         # MT47396 Infiniscale-III Mellanox
Technologies base port 0 lid 12 lmc 0
[18]    "H-005045014a2e0000"[1]
[11]    "H-0002c902002048b0"[1]
[10]    "H-0002c9020020f584"[1]
[19]    "H-005045014a3a0000"[1]

vendid=0x2c9
devid=0x6282
sysimgguid=0x5045014a2e0003
caguid=0x5045014a2e0000
Ca      2 "H-005045014a2e0000"          # vortex3l-84 HCA-1
[1]     "S-000b8cffff0024f9"[18]                # lid 7 lmc 0

vendid=0x2c9
devid=0x6282
sysimgguid=0x2c902002048b3
caguid=0x2c902002048b0
Ca      2 "H-0002c902002048b0"          # MT25218 InfiniHostEx Mellanox
Technologies
[1]     "S-000b8cffff0024f9"[11]                # lid 5 lmc 0

vendid=0x2c9
devid=0x6282
sysimgguid=0x2c9020020f587
caguid=0x2c9020020f584
Ca      2 "H-0002c9020020f584"          # MT25218 InfiniHostEx Mellanox
Technologies
[1]     "S-000b8cffff0024f9"[10]                # lid 8 lmc 0

vendid=0x2c9
devid=0x6282
sysimgguid=0x5045014a3a0003
caguid=0x5045014a3a0000
Ca      2 "H-005045014a3a0000"          # vortex3l-83 HCA-1
[1]     "S-000b8cffff0024f9"[19]                # lid 6 lmc 0
[root at vortex3l-83 ~]#

>Perhaps later; not just yet.
>  
>
>Are they all the same ?
>  
>
  More or less they are same. All of them have 9 threads and each thread 
is blocking form some event.

 VBabu


From halr at voltaire.com  Mon May 21 20:45:57 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 May 2007 23:45:57 -0400
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <4652542C.3010400@3leafnetworks.com>
References: <4652167F.9040709@3leafnetworks.com>
	<1179785796.15940.27092.camel@hal.voltaire.com>
	<4652542C.3010400@3leafnetworks.com>
Message-ID: <1179805556.15940.47640.camel@hal.voltaire.com>

On Mon, 2007-05-21 at 22:23, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >So there is no link between the 2 switches, right ?
> >  
> >
>  That is right.
> 
> >
> >Is there anything being done ? Cables pulled and reinserted ? Is
> >anything changing or is this a "stable" configuration in terms of the
> >topology ?
> >  
> >
>  There was no configuration changes from the cable or switch 
> perspective. But nodes were being rebooted.
> 
> >Is this the only thing going on on the subnet ?
> >  
> >
>  That was ipoib but no other ulp modules. There was propritery ulp 
> module which creates udqp and joins broadcast
> group and discovers nodes and sets up rcqps. There was no traffic being run.
> 
> >So it did finally become master ?
> >  
> >
>  Yes, from the /var/log/opensm1.log it looks like it became master. But 
> it was not responding to
> link local broadcast join operations. It was failing with -110, 
> Connection timed out.
> 
> >I take it LID 6 is local (vortex31-83).
> >
> >This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you
> >try OFED 1.2 ?
> >  
> >
>   It is OFED 1.1 released stack. I have seen this problem with OFED 1.0 
> also.
> Trying with OFED 1.2 may take much longer time, since we need to port 
> our stuff.

Can you at least use OFED 1.2 management (OpenSM and management
libraries) with the rest being OFED 1.1 ?

There are a number of bugs which have been fixed which might affect
this. The one I can think of off the top of my head is a fix to atomics
in OpenSM's complib. I think that was found and fixed post OFED 1.1.
I'll confirm this tomorrow.

There may also be some important kernel differences (in user_mad.c or
mad.c) which might be relevant.

> >What kernel is being used ? What distro ? What processor architecture ?
> >  
> >
>  2.6.9-22.EL     RHEL 4.2           Dual Core AMD Opteron(tm) Processor 
> 270 HE
> 
> >
> >Is this around the time of the error or just an error in the OpenSM log
> >? 
> >  
> >
>   The logs were frozen after these error messages. No new entries were 
> being written to the log files.
> After doing "sminfo -s3" I saw the some messages indicating that it 
> moved to MASTER state and other messages.
> 
> May 21 00:40:28 013290 [41401960] -> __osm_trap_rcv_process_request: 
> Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0007 
> TID:0x0000000000000003
> May 21 00:40:28 013431 [41401960] -> osm_report_notice: Reporting 
> Generic Notice type:4 num:144 from LID:0x0007 
> GID:0xfe80000000000000,0x005045014a2e0001
> May 21 00:40:28 818202 [45007960] -> umad_receiver: ERR 5409: send 
> completed with error (method=0x1 attr=0x11 trans_id=0x100000135b) -- 
> dropping
> May 21 00:40:28 819089 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 00:40:28 819110 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
> 3113: MAD completed in error (IB_TIMEOUT)
> May 21 00:40:28 819145 [45007960] -> SMP dump:
> ...
> May 21 00:40:28 819247 [41E02960] -> Entering STANDBY state
> May 21 14:04:17 204871 [45007960] -> umad_receiver: ERR 5404: recv error 
> on MAD sized umad (Interrupted system call)
> May 21 14:06:08 022096 [45007960] -> umad_receiver: ERR 5409: send 
> completed with error (method=0x1 attr=0x20 trans_id=0x100000264f) -- 
> dropping
> May 21 14:06:08 022132 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 14:06:08 022145 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
> 3113: MAD completed in error (IB_TIMEOUT)
> May 21 14:06:08 022182 [45007960] -> SMP dump:
> ...
> May 21 14:06:38 035957 [41401960] -> Entering MASTER state
> May 21 14:06:38 038818 [42803960] -> osm_subn_set_up_down_min_hop_table: 
> BFS through all port guids in the subnet ]
> May 21 14:06:38 038886 [42803960] -> osm_ucast_mgr_process: Min Hop 
> Tables configured on all switches
> May 21 14:06:38 046438 [41401960] -> __osm_trap_rcv_process_request: 
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000C 
> TID:0x0000000000000ec4
> May 21 14:06:38 046565 [41401960] -> osm_report_notice: Reporting 
> Generic Notice type:1 num:128 from LID:0x000C 
> GID:0xfe80000000000000,0x000b8cffff0024f9
> May 21 14:06:38 108660 [42803960] -> SUBNET UP
> May 21 14:06:38 402900 [41401960] -> __osm_trap_rcv_process_request: 
> Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 
> TID:0x0000000000000000
> May 21 14:06:38 403007 [41401960] -> osm_report_notice: Reporting 
> Generic Notice type:4 num:144 from LID:0x0001 
> GID:0xfe80000000000000,0x0002c9020020f5c5
> May 21 14:06:38 914806 [45007960] -> umad_receiver: ERR 5409: send 
> completed with error (method=0x1 attr=0x20 trans_id=0x10000026f0) -- 
> dropping
> May 21 14:06:38 914823 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 14:06:38 914864 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
> 3113: MAD completed in error (IB_TIMEOUT)
> May 21 14:06:38 914899 [45007960] -> SMP dump:
> 
> >Did this change from 0 to 1 around the time of the issue with the SM
> >mastership ?
> >  
> >
>   Not sure, I just got the snapshot when I saw this problem.
> 
> >Also, what are the port counters for the switch ports in use ?
> >  
> >
> [root at vortex3l-83 ~]# ibnetdiscover

I was referring to using perfquery, not ibnetdiscover.

> ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed,
> skipping port

Was this node rebooting while you did this or is there some other issue
?

> #
> # Topology file: generated on Mon May 21 02:11:34 2007
> #
> # Max of 2 hops discovered
> # Initiated from node 005045014a3a0000 port 005045014a3a0001
> 
> vendid=0x2c9
> devid=0xb924
> sysimgguid=0xb8cffff0024f9
> switchguid=0xb8cffff0024f9
> Switch  24 "S-000b8cffff0024f9"         # MT47396 Infiniscale-III Mellanox
> Technologies base port 0 lid 12 lmc 0
> [18]    "H-005045014a2e0000"[1]
> [11]    "H-0002c902002048b0"[1]
> [10]    "H-0002c9020020f584"[1]
> [19]    "H-005045014a3a0000"[1]

So run these (before and after):
perfquery 12 18
perfquery 12 11
perfquery 12 10
perfquery 12 19

and

perfquery 12 9

-- Hal

> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x5045014a2e0003
> caguid=0x5045014a2e0000
> Ca      2 "H-005045014a2e0000"          # vortex3l-84 HCA-1
> [1]     "S-000b8cffff0024f9"[18]                # lid 7 lmc 0
> 
> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x2c902002048b3
> caguid=0x2c902002048b0
> Ca      2 "H-0002c902002048b0"          # MT25218 InfiniHostEx Mellanox
> Technologies
> [1]     "S-000b8cffff0024f9"[11]                # lid 5 lmc 0
> 
> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x2c9020020f587
> caguid=0x2c9020020f584
> Ca      2 "H-0002c9020020f584"          # MT25218 InfiniHostEx Mellanox
> Technologies
> [1]     "S-000b8cffff0024f9"[10]                # lid 8 lmc 0
> 
> vendid=0x2c9
> devid=0x6282
> sysimgguid=0x5045014a3a0003
> caguid=0x5045014a3a0000
> Ca      2 "H-005045014a3a0000"          # vortex3l-83 HCA-1
> [1]     "S-000b8cffff0024f9"[19]                # lid 6 lmc 0
> [root at vortex3l-83 ~]#
> 
> >Perhaps later; not just yet.
> >  
> >
> >Are they all the same ?
> >  
> >
>   More or less they are same. All of them have 9 threads and each thread 
> is blocking form some event.
> 
>  VBabu


From mst at dev.mellanox.co.il  Mon May 21 22:16:40 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 22 May 2007 08:16:40 +0300
Subject: [ofa-general] cm.c and irqsave (was Re: [PATCH] ib/cm: fix stale
	connection detection)
In-Reply-To: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com>
References: <20070520134441.GI20649@mellanox.co.il>
	<000101c79c09$74f15440$54c8180a@amr.corp.intel.com>
Message-ID: <20070522051640.GA23066@mellanox.co.il>

> @@ -1295,26 +1295,29 @@ static struct cm_id_private * cm_match_req(struct cm_work *work,
>  
>  	req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
>  
> -	/* Check for duplicate REQ and stale connections. */
> +	/* Check for possible duplicate REQ. */
>  	spin_lock_irqsave(&cm.lock, flags);
>  	timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);

On an unrelated note, it looks like cm.c would benefit from an irqsave
diet: it seems to perform work almost exclusively from thread
context, so just spin_lock_irq is sure to be enough.

And if *everything* is done from a thread context, I think we can
go one step further and avoid disabling interrupts as well.

-- 
MST


From venkatesh.babu at 3leafnetworks.com  Mon May 21 23:31:24 2007
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Mon, 21 May 2007 23:31:24 -0700
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <1179805556.15940.47640.camel@hal.voltaire.com>
References: <4652167F.9040709@3leafnetworks.com>	
	<1179785796.15940.27092.camel@hal.voltaire.com>	
	<4652542C.3010400@3leafnetworks.com>
	<1179805556.15940.47640.camel@hal.voltaire.com>
Message-ID: <46528E3C.8090305@3leafnetworks.com>


Hal Rosenstock wrote:

>
>Can you at least use OFED 1.2 management (OpenSM and management
>libraries) with the rest being OFED 1.1 ?
>  
>
 Are these backward compatible ?

>There are a number of bugs which have been fixed which might affect
>this. The one I can think of off the top of my head is a fix to atomics
>in OpenSM's complib. I think that was found and fixed post OFED 1.1.
>I'll confirm this tomorrow.
>
>There may also be some important kernel differences (in user_mad.c or
>mad.c) which might be relevant.
>  
>
  It would be great if you can find these particular patches, we could 
apply these onto OFED 1.1
instead of migrating to OFED 1.2.

  By the way, when is production quality OFED 1.2 is supposed to be 
released ?

>I was referring to using perfquery, not ibnetdiscover.
>  
>
 I don't have that output right now. But I found that all other error 
counters were zero except port_xmit_discards.

>  
>
>>ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed,
>>skipping port
>>    
>>
>
>Was this node rebooting while you did this or is there some other issue
>?
>  
>
  Yes, it is quite possible that node was being rebooted.

>
>So run these (before and after):
>perfquery 12 18
>perfquery 12 11
>perfquery 12 10
>perfquery 12 19
>
>and
>
>perfquery 12 9
>  
>
  Unfortunately the systems got rebooted and issue is lost. I was able 
to collect the perfquery output. It looks like now it is seeing some errors.
[root at vortex3l-83 ~]# perfquery 12 9
# Port counters: Lid 12 port 9
PortSelect:......................9
CounterSelect:...................0x0100
SymbolErrors:....................65535
LinkRecovers:....................2
LinkDowned:......................255
RcvErrors:.......................1
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................41484
XmtDiscards:.....................4918
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................1
XmtBytes:........................2050081143
RcvBytes:........................4294967295
XmtPkts:.........................14539343
RcvPkts:.........................37028545
[root at vortex3l-83 ~]# perfquery 12 10
# Port counters: Lid 12 port 10
PortSelect:......................10
CounterSelect:...................0x0100
SymbolErrors:....................65535
LinkRecovers:....................27
LinkDowned:......................255
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................19936
XmtDiscards:.....................5192
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtBytes:........................4294967295
RcvBytes:........................4294967295
XmtPkts:.........................1739931538
RcvPkts:.........................1794380558
[root at vortex3l-83 ~]# perfquery 12 11
# Port counters: Lid 12 port 11
PortSelect:......................11
CounterSelect:...................0x0100
SymbolErrors:....................65535
LinkRecovers:....................0
LinkDowned:......................255
RcvErrors:.......................1
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................8963
XmtDiscards:.....................5636
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtBytes:........................4294967295
RcvBytes:........................4294967295
XmtPkts:.........................2375935494
RcvPkts:.........................2714377528
[root at vortex3l-83 ~]# perfquery 12 18
# Port counters: Lid 12 port 18
PortSelect:......................18
CounterSelect:...................0x0100
SymbolErrors:....................65535
LinkRecovers:....................24
LinkDowned:......................220
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................65535
XmtDiscards:.....................23628
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtBytes:........................4294967295
RcvBytes:........................4294967295
XmtPkts:.........................604709394
RcvPkts:.........................448409077
[root at vortex3l-83 ~]# perfquery 12 19
# Port counters: Lid 12 port 19
PortSelect:......................19
CounterSelect:...................0x0100
SymbolErrors:....................65535
LinkRecovers:....................21
LinkDowned:......................247
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................65535
XmtDiscards:.....................37754
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtBytes:........................4294967295
RcvBytes:........................4294967295
XmtPkts:.........................3958092428
RcvPkts:.........................3679343076
[root at vortex3l-83 ~]#

  -VBabu


From mst at dev.mellanox.co.il  Mon May 21 23:36:34 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 22 May 2007 09:36:34 +0300
Subject: [ofa-general] skb queue management in ipoib
Message-ID: <20070522063634.GB3331@mellanox.co.il>

Roland, all,

	currently, IPoIB keeps skb queues while SA query/connection request is
	outstanding.  These queues have a length limit, but once the
	limit is reached, new packets are dropped. Example:

                       if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE)
                                __skb_queue_tail(&neigh->queue, skb);
                        else {
                                ipoib_warn(priv, "queue length limit %d. Packet drop.\n",
                                           skb_queue_len(&neigh->queue));
                                goto err_drop;
                        }


	I think that managing this queue in a FIFO manner, dropping
	old packets and inserting new ones instead would be better:
	and older packet has more chance to have been timed out.
	So we would do something along the lines of:

                       __skb_queue_tail(&neigh->queue, skb);
                       if (skb_queue_len(&neigh->queue) > IPOIB_MAX_PATH_REC_QUEUE) {
                                skb = __skb_dequeue_tail(&neigh->queue);
                                ipoib_warn(priv, "queue length limit %d. Packet drop.\n",
                                           skb_queue_len(&neigh->queue));
                                goto err_drop;
                       }

Does this make sense?

-- 
MST


From mst at dev.mellanox.co.il  Tue May 22 00:59:52 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 22 May 2007 10:59:52 +0300
Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection
In-Reply-To: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com>
References: <20070520134441.GI20649@mellanox.co.il>
	<000101c79c09$74f15440$54c8180a@amr.corp.intel.com>
Message-ID: <20070522075952.GC3331@mellanox.co.il>

The ib_cm can incorrectly detect a stale connection (a new connection
request for a QPN that is already connected) as a duplicate connection
request.  Separate the handling of potential duplicate REQs from stale
connections.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
Acked-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

> ---
>
> Can you let me know if this fixes the issues for you?  I reworked the
> code only to detect the stale connection properly.

Yes, this has fixed the issue for me.
I have not seen any timeouts yet: netperf seems to recover in at most 15 sec,
where previously it needed up to 2 minutes.

The patch looks obvious enough for 2.6.22, safe enough in that it replaces a
timeout with a reject, and it addresses a real problem.  Sean? Roland? What do
you think?

> More work is needed
> to force the local QP into timewait if that is needed.

Yes, this is needed: IPoIB has its own stale
connection detection logic, so it will, after several minutes
of inactivity, clean out the connection; however, if the
number of QPs is increased, this timeout might become too long:
and handling this only at the ULP level is wrong anyway.

In practice I don't think we have seen this yet,
but the spec is quite explicit about this point:
	When a CM receives such a REQ/REP it shall abort the connection establishment by
	issuing REJ to the REQ/REP. It shall then issue DREQ, with “DREQ:remote QPN”
	set to the remote QPN from the REQ/REP, until DREP is received or Max Retries
	is exceeded, and place the local QP in the timeWait state.

I agree this is 2.6.23 material, however.

Something that I think would be very useful for 2.6.22 already: could you please
document which portions of chapter 12 are not currently implemented in cm.c, and
put this in some file in kernel tree?  This way people will be able to figure
out whether something that they need is missing, and contribute.


> This would
> likely require adding a new CM event to report that a stale connection
> was detected on the QP.

Yes, this looks like a reasonable way to do this.

> Also, I left the duplicate request handling
> as it was, since that should go in as a separate patch.

Could you please describe what is missing currently?
Is the missing handling likely to cause timeouts?

I hope we have reduced the chance of duplicate request misdetections
with the local id patch sufficiently, and fixing this can
wait till 2.6.23.

>  drivers/infiniband/core/cm.c |   25 ++++++++++++++-----------
>  1 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index eff591d..c53d486 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1295,26 +1295,29 @@ static struct cm_id_private * cm_match_req(struct cm_work *work,
 
 	req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
 
-	/* Check for duplicate REQ and stale connections. */
+	/* Check for possible duplicate REQ. */
 	spin_lock_irqsave(&cm.lock, flags);
 	timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
-	if (!timewait_info)
-		timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
-
 	if (timewait_info) {
 		cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
 					   timewait_info->work.remote_id);
-		cm_cleanup_timewait(cm_id_priv->timewait_info);
 		spin_unlock_irqrestore(&cm.lock, flags);
 		if (cur_cm_id_priv) {
 			cm_dup_req_handler(work, cur_cm_id_priv);
 			cm_deref_id(cur_cm_id_priv);
-		} else
-			cm_issue_rej(work->port, work->mad_recv_wc,
-				     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
-				     NULL, 0);
-		listen_cm_id_priv = NULL;
-		goto out;
+		}
+		return NULL;
+	}
+
+	/* Check for stale connections. */
+	timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
+	if (timewait_info) {
+		cm_cleanup_timewait(cm_id_priv->timewait_info);
+		spin_unlock_irqrestore(&cm.lock, flags);
+		cm_issue_rej(work->port, work->mad_recv_wc,
+			     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
+			     NULL, 0);
+		return NULL;
 	}
 
 	/* Find matching listen request. */

-- 
MST


From amip at dev.mellanox.co.il  Tue May 22 01:33:12 2007
From: amip at dev.mellanox.co.il (Ami Perlmutter)
Date: Tue, 22 May 2007 11:33:12 +0300
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <OF76849DAB.9374EE37-ON872572E2.007BE57F-882572E2.007BEED4@us.ibm.com>
References: <D63C0BE2D613C543B6F3305502E9784C030AA261@OCBEXS01001.rto.be>
	<OF76849DAB.9374EE37-ON872572E2.007BE57F-882572E2.007BEED4@us.ibm.com>
Message-ID: <cf054deb0705220133v164f8943q6ccdf985b741cc36@mail.gmail.com>

does the application constantly open and close connections?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/04af1d43/attachment.html>

From Koen.SEGERS at VRT.BE  Tue May 22 01:54:41 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Tue, 22 May 2007 10:54:41 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <cf054deb0705220133v164f8943q6ccdf985b741cc36@mail.gmail.com>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D5A@OCBEXS01001.rto.be>

GPFS keeps its connection constantly open.

 
We did some more tests with iperf:

If we don't run bidirectional tests, all connections keeps running
smoothly. If we add bidirectional tests, it becomes unstable. Certainly
if this is done on multiple nodes. Is this normal?

 
The failed iperf tests give the same error in the switch log:

May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM
OUT_OF_SERVICE trap for
GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71

May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM
DELETE_MC_GROUP trap for
GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71

May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration
caused by discovering removed ports

May 22 08:15:00 topspin-120sc ib_sm.x[621]: %IB-6-INFO: Program switch
port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 6, due to
non-responding CA

May 22 08:15:00 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port down -
port=1/6, type=ib4xTXP

May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in
portTblFindEntry() - IfIndex=70(1/6)

May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: cannot find
entry - IfIndex=70(1/6)

May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration
caused by discovering new ports

May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration
caused by multicast membership change

May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM
IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71

May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up -
port=1/6, type=ib4xTXP

May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
CREATE_MC_GROUP trap for
GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71

May 22 08:15:08 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration
caused by multicast membership change

 
RC3 is just installed. Results will follow soon.

 
Regards,

 
Koen

 
________________________________

Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
Verzonden: dinsdag 22 mei 2007 10:33
Aan: Shirley Ma
CC: SEGERS Koen; general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
Onderwerp: Re: [ofa-general] GPFS node loses IB-connection

 
does the application constantly open and close connections? 

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/6c93443e/attachment.html>

From vlad at lists.openfabrics.org  Tue May 22 02:41:37 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue, 22 May 2007 02:41:37 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070522-0200 daily build status
Message-ID: <20070522094137.660F4E6081F@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.17
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.18
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From halr at voltaire.com  Tue May 22 03:53:02 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 May 2007 06:53:02 -0400
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <46528E3C.8090305@3leafnetworks.com>
References: <4652167F.9040709@3leafnetworks.com>
	<1179785796.15940.27092.camel@hal.voltaire.com>
	<4652542C.3010400@3leafnetworks.com>
	<1179805556.15940.47640.camel@hal.voltaire.com>
	<46528E3C.8090305@3leafnetworks.com>
Message-ID: <1179831181.15940.74121.camel@hal.voltaire.com>

On Tue, 2007-05-22 at 02:31, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >
> >Can you at least use OFED 1.2 management (OpenSM and management
> >libraries) with the rest being OFED 1.1 ?
> >  
> >
>  Are these backward compatible ?

Yes, user_mad kernel module has been at ABI version 5 for quite some
time now.

> >There are a number of bugs which have been fixed which might affect
> >this. The one I can think of off the top of my head is a fix to atomics
> >in OpenSM's complib. I think that was found and fixed post OFED 1.1.
> >I'll confirm this tomorrow.

The atomic fix was in OpenSM 2.0.5 but there are numerous other fixes
(see OpenSM release notes for OFED 1.2).

> >There may also be some important kernel differences (in user_mad.c or
> >mad.c) which might be relevant.
> >  
> >
>   It would be great if you can find these particular patches, we could 
> apply these onto OFED 1.1
> instead of migrating to OFED 1.2.

The one I see that might be related is the following:

commit 39798695b4bcc7b145f8910ca56195808d3a7637
Author: Roland Dreier <rolandd at cisco.com>
Date:   Mon Nov 13 09:38:07 2006 -0800

    IB/mad: Fix race between cancel and receive completion
    
    When ib_cancel_mad() is called, it puts the canceled send on a list
    and schedules a "flushed" callback from process context.  However,
    this leaves a window where a receive completion could be processed
    before the send is fully flushed.
    
    This is fine, except that ib_find_send_mad() will find the MAD and
    return it to the receive processing, which results in the sender
    getting both a successful receive and a "flushed" send completion for
    the same request.  Understandably, this confuses the sender, which is
    expecting only one of these two callbacks, and leads to grief such as
    a use-after-free in IPoIB.
    
    Fix this by changing ib_find_send_mad() to return a send struct only
    if the status is still successful (and not "flushed").  The search of
    the send_list already had this check, so this patch just adds the same
    check to the search of the wait_list.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

My search was not exhaustive.

>   By the way, when is production quality OFED 1.2 is supposed to be 
> released ?

It was supposed to be released already but we are closing in on rc4 (May
30) with the release to follow shortly thereafter (1-2 weeks).

> >I was referring to using perfquery, not ibnetdiscover.
> >  
> >
>  I don't have that output right now. But I found that all other error 
> counters were zero except port_xmit_discards.

It would be useful to get these to be sure after the problem occurs.

> >>ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed,
> >>skipping port
> >>    
> >>
> >
> >Was this node rebooting while you did this or is there some other issue
> >?
> >  
> >
>   Yes, it is quite possible that node was being rebooted.
> 
> >
> >So run these (before and after):
> >perfquery 12 18
> >perfquery 12 11
> >perfquery 12 10
> >perfquery 12 19
> >
> >and
> >
> >perfquery 12 9
> >  
> >
>   Unfortunately the systems got rebooted and issue is lost. I was able 
> to collect the perfquery output. It looks like now it is seeing some errors.

Are they incrementing ? Which node is this ? I think some of them would
increment on node reboot.

-- Hal

> [root at vortex3l-83 ~]# perfquery 12 9
> # Port counters: Lid 12 port 9
> PortSelect:......................9
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................2
> LinkDowned:......................255
> RcvErrors:.......................1
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................41484
> XmtDiscards:.....................4918
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................1
> XmtBytes:........................2050081143
> RcvBytes:........................4294967295
> XmtPkts:.........................14539343
> RcvPkts:.........................37028545
> [root at vortex3l-83 ~]# perfquery 12 10
> # Port counters: Lid 12 port 10
> PortSelect:......................10
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................27
> LinkDowned:......................255
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................19936
> XmtDiscards:.....................5192
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................1739931538
> RcvPkts:.........................1794380558
> [root at vortex3l-83 ~]# perfquery 12 11
> # Port counters: Lid 12 port 11
> PortSelect:......................11
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................0
> LinkDowned:......................255
> RcvErrors:.......................1
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................8963
> XmtDiscards:.....................5636
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................2375935494
> RcvPkts:.........................2714377528
> [root at vortex3l-83 ~]# perfquery 12 18
> # Port counters: Lid 12 port 18
> PortSelect:......................18
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................24
> LinkDowned:......................220
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................65535
> XmtDiscards:.....................23628
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................604709394
> RcvPkts:.........................448409077
> [root at vortex3l-83 ~]# perfquery 12 19
> # Port counters: Lid 12 port 19
> PortSelect:......................19
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................21
> LinkDowned:......................247
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................65535
> XmtDiscards:.....................37754
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................3958092428
> RcvPkts:.........................3679343076
> [root at vortex3l-83 ~]#
> 
>   -VBabu


From Koen.SEGERS at VRT.BE  Tue May 22 06:43:59 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Tue, 22 May 2007 15:43:59 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D5A@OCBEXS01001.rto.be>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D5D@OCBEXS01001.rto.be>

I did the iperf tests on servers with OFED-1.2-RC3.

 
It also gives the same result. Actually, it is even worse: when the
interface dies, it gets in PORT_INIT state, but it doesn't go to
PORT_ACTIVE again. At least not within 10 minutes.

 
I'll give you the test script I ran:

 
ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
5001 &

ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
5002 &

ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
5003 &

ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
6001 &

ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
6002 &

ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
6003 &

ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
7001 &

ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
7002 &

ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
7003 &

ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
8001 &

ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
8002 &

ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p
8003 &

 
sleep 5

 
for i in 14 15 16 17

do

        ssh 10.224.158.111 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-c 192.168.2.$i -p $((i-9))001 -t 120 -d -P 5 &

        ssh 10.224.158.112 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-c 192.168.2.$i -p $((i-9))002 -t 120 -d -P 5 &

        ssh 10.224.158.113 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-c 192.168.2.$i -p $((i-9))003 -t 120 -d -P 5 &

done

 
Any ideas?

 
Regards,

 
Koen

________________________________

Van: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen
Verzonden: dinsdag 22 mei 2007 10:55
Aan: Ami Perlmutter; Shirley Ma
CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

 
GPFS keeps its connection constantly open.

 
We did some more tests with iperf:

If we don't run bidirectional tests, all connections keeps running
smoothly. If we add bidirectional tests, it becomes unstable. Certainly
if this is done on multiple nodes. Is this normal?

 
The failed iperf tests give the same error in the switch log:

May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM
OUT_OF_SERVICE trap for
GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71

May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM
DELETE_MC_GROUP trap for
GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71

May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration
caused by discovering removed ports

May 22 08:15:00 topspin-120sc ib_sm.x[621]: %IB-6-INFO: Program switch
port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 6, due to
non-responding CA

May 22 08:15:00 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port down -
port=1/6, type=ib4xTXP

May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in
portTblFindEntry() - IfIndex=70(1/6)

May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: cannot find
entry - IfIndex=70(1/6)

May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration
caused by discovering new ports

May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration
caused by multicast membership change

May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM
IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71

May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up -
port=1/6, type=ib4xTXP

May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
CREATE_MC_GROUP trap for
GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71

May 22 08:15:08 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration
caused by multicast membership change

 
RC3 is just installed. Results will follow soon.

 
Regards,

 
Koen

 
________________________________

Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
Verzonden: dinsdag 22 mei 2007 10:33
Aan: Shirley Ma
CC: SEGERS Koen; general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
Onderwerp: Re: [ofa-general] GPFS node loses IB-connection

 
does the application constantly open and close connections? 

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/d247e6f9/attachment.html>

From sweitzen at cisco.com  Tue May 22 08:34:24 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 22 May 2007 08:34:24 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D5D@OCBEXS01001.rto.be>
References: <D63C0BE2D613C543B6F3305502E9784C03157D5A@OCBEXS01001.rto.be>
	<D63C0BE2D613C543B6F3305502E9784C03157D5D@OCBEXS01001.rto.be>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E1323@xmb-sjc-216.amer.cisco.com>

What server model and CPU model do you have?
 
This could be https://bugs.openfabrics.org//show_bug.cgi?id=229.  Try
setting RENICE_IB_MAD=yes in /etc/infiniband/openibd.conf, then reboot
or run /etc/init.d/openibd restart, and see if that helps.
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

________________________________

	From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of SEGERS Koen
	Sent: Tuesday, May 22, 2007 6:44 AM
	To: Ami Perlmutter; Shirley Ma
	Cc: general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	Subject: RE: [ofa-general] GPFS node loses IB-connection
	
	
	I did the iperf tests on servers with OFED-1.2-RC3.

	 
	It also gives the same result. Actually, it is even worse: when
the interface dies, it gets in PORT_INIT state, but it doesn't go to
PORT_ACTIVE again. At least not within 10 minutes.

	 
	I'll give you the test script I ran:

	 
	ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 5001 &

	ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 5002 &

	ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 5003 &

	ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 6001 &

	ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 6002 &

	ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 6003 &

	ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 7001 &

	ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 7002 &

	ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 7003 &

	ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 8001 &

	ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 8002 &

	ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
-s -p 8003 &

	 
	sleep 5

	 
	for i in 14 15 16 17

	do

	        ssh 10.224.158.111 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK
iperf -c 192.168.2.$i -p $((i-9))001 -t 120 -d -P 5 &

	        ssh 10.224.158.112 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK
iperf -c 192.168.2.$i -p $((i-9))002 -t 120 -d -P 5 &

	        ssh 10.224.158.113 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK
iperf -c 192.168.2.$i -p $((i-9))003 -t 120 -d -P 5 &

	done

	 
	Any ideas?

	 
	Regards,

	 
	Koen

	
________________________________


	Van: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen
	Verzonden: dinsdag 22 mei 2007 10:55
	Aan: Ami Perlmutter; Shirley Ma
	CC: general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

	 
	GPFS keeps its connection constantly open.

	 
	We did some more tests with iperf:

	If we don't run bidirectional tests, all connections keeps
running smoothly. If we add bidirectional tests, it becomes unstable.
Certainly if this is done on multiple nodes. Is this normal?

	 
	The failed iperf tests give the same error in the switch log:

	May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate
SM OUT_OF_SERVICE trap for
GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71

	May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate
SM DELETE_MC_GROUP trap for
GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71

	May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
Configuration caused by discovering removed ports

	May 22 08:15:00 topspin-120sc ib_sm.x[621]: %IB-6-INFO: Program
switch port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 6, due to
non-responding CA

	May 22 08:15:00 topspin-120sc port_mgr.x[497]: %PORT-6-INFO:
port down - port=1/6, type=ib4xTXP

	May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in
portTblFindEntry() - IfIndex=70(1/6)

	May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO:
cannot find entry - IfIndex=70(1/6)

	May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
Configuration caused by discovering new ports

	May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
Configuration caused by multicast membership change

	May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate
SM IN_SERVICE trap for
GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71

	May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO:
port up - port=1/6, type=ib4xTXP

	May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate
SM CREATE_MC_GROUP trap for
GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71

	May 22 08:15:08 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
Configuration caused by multicast membership change

	 
	RC3 is just installed. Results will follow soon.

	 
	Regards,

	 
	Koen

	 
________________________________


	Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
	Verzonden: dinsdag 22 mei 2007 10:33
	Aan: Shirley Ma
	CC: SEGERS Koen; general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	Onderwerp: Re: [ofa-general] GPFS node loses IB-connection

	 
	does the application constantly open and close connections? 

	*** Disclaimer ***
	
	Vlaamse Radio- en Televisieomroep
	Auguste Reyerslaan 52, 1043 Brussel
	
	nv van publiek recht
	BTW BE 0244.142.664
	RPR Brussel
	http://www.vrt.be/disclaimer

	*** Disclaimer ***
	
	Vlaamse Radio- en Televisieomroep
	Auguste Reyerslaan 52, 1043 Brussel
	
	nv van publiek recht
	BTW BE 0244.142.664
	RPR Brussel
	http://www.vrt.be/disclaimer
	
	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/58c64a65/attachment.html>

From jlentini at netapp.com  Tue May 22 08:43:44 2007
From: jlentini at netapp.com (James Lentini)
Date: Tue, 22 May 2007 11:43:44 -0400 (EDT)
Subject: [ofa-general] Re: [IPoIB][RFC] remove redundant gid query
In-Reply-To: <adasl9pdhtf.fsf@cisco.com>
References: <Pine.LNX.4.64.0705081427320.27590@jlentini-linux.nane.netapp.com>
	<adazm43jqkr.fsf@cisco.com>
	<Pine.LNX.4.64.0705181357160.27590@jlentini-linux.nane.netapp.com>
	<adasl9pdhtf.fsf@cisco.com>
Message-ID: <Pine.LNX.4.64.0705221121220.27590@jlentini-linux.nane.netapp.com>


On Mon, 21 May 2007, Roland Dreier wrote:

>  > > It does look like we're doing some work we don't need to do.  However
>  > > ipoib_add_port() could run before an SM has brought up the local port,
>  > 
>  > The same could be true for ipoib_mcast_join_task()
>  > 
>  > These are both instances of the general problem that if the GID at 
>  > index 0 changes, the IPoIB code is not automatically notified. Agree?
> 
> Yes, although what is there now should be semi-OK: a multicast join
> can't succeed until the port is up, so ipoib should eventually get the
> right GID.  And I would argue that an SM that changes a port's GID
> prefix without at least generating a client reregister event is broken.

Expecting the SM to request a client reregister is reasonable.

>From IPoIB down, everything seems OK. 

I'm wondering about the layers above IPoIB.

When ipoib_add_port() calls register_netdev(), there is at least one 
place where the networking stack examines the dev_addr value (see 
rtnl_fill_ifinfo(), a netlink message is created with the device and 
broadcast hardware addresses).

If the GID changes, the IPoIB net_device's dev_addr changes. IPoIB 
doesn't inform the upper layers when this happens.

>  > > so the GID prefix might change later.
>  > > 
>  > > I'm not sure what the best way to clean this up is.
>  > 
>  > As an aside: Why does ipoib_add_port() treat an error return from 
>  > ib_query_gid() as fatal while ipoib_mcast_join_task() only emits a 
>  > warning?
> 
> I guess because it's much easier to bail out of ipoib_add_port() than
> it is to do something intelligent in ipoib_mcast_join_task().

Would adding a warning if the GID changes be of use?

Signed-off-by: James Lentini [jlentini at netapp.com]

--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2007-04-25 23:08:32.000000000 -0400
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-22 11:17:52.000000000 -0400
@@ -563,11 +563,18 @@
 		if (mcast_debug_level > 0)		\
 			ipoib_printk(KERN_DEBUG, priv, format , ## arg); \
 	} while (0)
+
+#define ipoib_dbg_chkgid(priv, a, b) 				\
+	do {							\
+		if (memcmp((a), (b), sizeof (union ib_gid)))	\
+			ipoib_warn(priv, "gid changed\n");	\
+	} while (0)
 #else /* CONFIG_INFINIBAND_IPOIB_DEBUG */
 #define ipoib_dbg(priv, format, arg...)			\
 	do { (void) (priv); } while (0)
 #define ipoib_dbg_mcast(priv, format, arg...)		\
 	do { (void) (priv); } while (0)
+#define ipoib_dbg_chkgid(priv, a, b)
 #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */
 
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-22 11:07:30.000000000 -0400
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-22 11:17:22.000000000 -0400
@@ -525,8 +525,10 @@
 
 	if (ib_query_gid(priv->ca, priv->port, 0, &priv->local_gid))
 		ipoib_warn(priv, "ib_gid_entry_get() failed\n");
-	else
+	else {
+		ipoib_dbg_chkgid(priv, priv->dev->dev_addr + 4, priv->local_gid.raw);
 		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
+	}
 
 	{
 		struct ib_port_attr attr;


From mshefty at ichips.intel.com  Tue May 22 09:38:18 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 22 May 2007 09:38:18 -0700
Subject: [ofa-general] cm.c and irqsave (was Re: [PATCH] ib/cm: fix stale
	connection detection)
In-Reply-To: <20070522051640.GA23066@mellanox.co.il>
References: <20070520134441.GI20649@mellanox.co.il>	<000101c79c09$74f15440$54c8180a@amr.corp.intel.com>
	<20070522051640.GA23066@mellanox.co.il>
Message-ID: <46531C7A.3060201@ichips.intel.com>

> On an unrelated note, it looks like cm.c would benefit from an irqsave
> diet: it seems to perform work almost exclusively from thread
> context, so just spin_lock_irq is sure to be enough.

I don't think everything is done at thread context - that depends on the 
ULP, but it could definitely replace irqsave with just irq in any of the 
message processing code.

- Sean


From xma at us.ibm.com  Tue May 22 09:45:28 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Tue, 22 May 2007 09:45:28 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D5A@OCBEXS01001.rto.be>
Message-ID: <OFD734C73D.218C5E16-ON872572E3.005BE9F6-882572E3.00617A52@us.ibm.com>


Hello Koen,

      From the switch log, it looks a SM issue to me. The node was kicked
out of the membership. Which SM you are using in your fabric?

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/df6d720d/attachment.html>

From mshefty at ichips.intel.com  Tue May 22 10:12:42 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 22 May 2007 10:12:42 -0700
Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection
In-Reply-To: <20070522075952.GC3331@mellanox.co.il>
References: <20070520134441.GI20649@mellanox.co.il>	<000101c79c09$74f15440$54c8180a@amr.corp.intel.com>
	<20070522075952.GC3331@mellanox.co.il>
Message-ID: <4653248A.1040108@ichips.intel.com>

> The patch looks obvious enough for 2.6.22, safe enough in that it replaces a
> timeout with a reject, and it addresses a real problem.  Sean? Roland? What do
> you think?

To make it easier, I've added the patch to:

	git://git.openfabrics.org/~shefty/rdma-dev.git for-roland

commit 2fbe169db0c6bddcc7b28d03eb51d057277ffd6a

I'm comfortable with this merging into 2.6.22 myself.

> Something that I think would be very useful for 2.6.22 already: could you please
> document which portions of chapter 12 are not currently implemented in cm.c, and
> put this in some file in kernel tree?  This way people will be able to figure
> out whether something that they need is missing, and contribute.

This isn't something that I know without comparing the code against the 
spec.

>> Also, I left the duplicate request handling
>> as it was, since that should go in as a separate patch.
> 
> Could you please describe what is missing currently?
> Is the missing handling likely to cause timeouts?

If two REQs are received with matching local IDs, but the REQs 
themselves differ on one or more fields, such as the QPN, the second REQ 
is dropped as a duplicate.  This causes timeouts, so I need to figure 
out what the correct behavior should be here.

- Sean


From weiny2 at llnl.gov  Tue May 22 10:23:27 2007
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 22 May 2007 10:23:27 -0700
Subject: [ofa-general] [PATCH] ib_types.h: Change macros to convert from
 "host" byte order to "network"
Message-ID: <20070522102327.0cea4153.weiny2@llnl.gov>

>From 7e53267d5bc9389f5f1a4dae3a2d290c69c6e1d4 Mon Sep 17 00:00:00 2001
From: Ira K. Weiny <weiny2 at llnl.gov>
Date: Tue, 24 Apr 2007 16:07:19 -0700
Subject: [PATCH] Change macros to convert from "host" byte order to "network"

   Although the macros CL_HTON* and CL_NTOH* are defined to be the same
   operation it is technically incorrect to convert a constant from network
   byte order.  The constant should be converted from host byte order to
   network byte order.

Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
---
 opensm/include/iba/ib_types.h |  180 ++++++++++++++++++++--------------------
 1 files changed, 90 insertions(+), 90 deletions(-)

diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index aee7024..f6e85a4 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -157,13 +157,13 @@ BEGIN_C_DECLS
 *
 * SOURCE
 */
-#define IB_QP1_WELL_KNOWN_Q_KEY				CL_NTOH32(0x80010000)
+#define IB_QP1_WELL_KNOWN_Q_KEY				CL_HTON32(0x80010000)
 /*********/
 
 #define IB_QP0								0
-#define IB_QP1								CL_NTOH32(1)
+#define IB_QP1								CL_HTON32(1)
 
-#define IB_QP_PRIVILEGED_Q_KEY				CL_NTOH32(0x80000000)
+#define IB_QP_PRIVILEGED_Q_KEY				CL_HTON32(0x80000000)
 
 /****d* IBA Base: Constants/IB_LID_UCAST_START
 * NAME
@@ -405,7 +405,7 @@ BEGIN_C_DECLS
 *
 * SOURCE
 */
-#define IB_PKEY_TYPE_MASK					(CL_NTOH16(0x8000))
+#define IB_PKEY_TYPE_MASK					(CL_HTON16(0x8000))
 /*********/
 
 /****d* IBA Base: Constants/IB_DEFAULT_PARTIAL_PKEY
@@ -967,7 +967,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_CLASS_PORT_INFO			(CL_NTOH16(0x0001))
+#define IB_MAD_ATTR_CLASS_PORT_INFO			(CL_HTON16(0x0001))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_NOTICE
@@ -979,7 +979,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_NOTICE					(CL_NTOH16(0x0002))
+#define IB_MAD_ATTR_NOTICE					(CL_HTON16(0x0002))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_INFORM_INFO
@@ -991,7 +991,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_INFORM_INFO				(CL_NTOH16(0x0003))
+#define IB_MAD_ATTR_INFORM_INFO				(CL_HTON16(0x0003))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_NODE_DESC
@@ -1003,7 +1003,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_NODE_DESC				(CL_NTOH16(0x0010))
+#define IB_MAD_ATTR_NODE_DESC				(CL_HTON16(0x0010))
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_PORT_SMPL_CTRL
 * NAME
@@ -1014,7 +1014,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_PORT_SMPL_CTRL			(CL_NTOH16(0x0010))
+#define IB_MAD_ATTR_PORT_SMPL_CTRL			(CL_HTON16(0x0010))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_NODE_INFO
@@ -1026,7 +1026,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_NODE_INFO				(CL_NTOH16(0x0011))
+#define IB_MAD_ATTR_NODE_INFO				(CL_HTON16(0x0011))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_PORT_SMPL_RSLT
@@ -1038,7 +1038,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_PORT_SMPL_RSLT			(CL_NTOH16(0x0011))
+#define IB_MAD_ATTR_PORT_SMPL_RSLT			(CL_HTON16(0x0011))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_SWITCH_INFO
@@ -1050,7 +1050,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_SWITCH_INFO				(CL_NTOH16(0x0012))
+#define IB_MAD_ATTR_SWITCH_INFO				(CL_HTON16(0x0012))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_PORT_CNTRS
@@ -1062,7 +1062,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_PORT_CNTRS				(CL_NTOH16(0x0012))
+#define IB_MAD_ATTR_PORT_CNTRS				(CL_HTON16(0x0012))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_GUID_INFO
@@ -1074,7 +1074,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_GUID_INFO				(CL_NTOH16(0x0014))
+#define IB_MAD_ATTR_GUID_INFO				(CL_HTON16(0x0014))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_PORT_INFO
@@ -1086,7 +1086,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_PORT_INFO				(CL_NTOH16(0x0015))
+#define IB_MAD_ATTR_PORT_INFO				(CL_HTON16(0x0015))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_P_KEY_TABLE
@@ -1098,7 +1098,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_P_KEY_TABLE				(CL_NTOH16(0x0016))
+#define IB_MAD_ATTR_P_KEY_TABLE				(CL_HTON16(0x0016))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_SLVL_TABLE
@@ -1110,7 +1110,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_SLVL_TABLE				(CL_NTOH16(0x0017))
+#define IB_MAD_ATTR_SLVL_TABLE				(CL_HTON16(0x0017))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_VL_ARBITRATION
@@ -1122,7 +1122,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_VL_ARBITRATION			(CL_NTOH16(0x0018))
+#define IB_MAD_ATTR_VL_ARBITRATION			(CL_HTON16(0x0018))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_LIN_FWD_TBL
@@ -1134,7 +1134,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_LIN_FWD_TBL				(CL_NTOH16(0x0019))
+#define IB_MAD_ATTR_LIN_FWD_TBL				(CL_HTON16(0x0019))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_RND_FWD_TBL
@@ -1146,7 +1146,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_RND_FWD_TBL				(CL_NTOH16(0x001A))
+#define IB_MAD_ATTR_RND_FWD_TBL				(CL_HTON16(0x001A))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_MCAST_FWD_TBL
@@ -1158,7 +1158,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_MCAST_FWD_TBL			(CL_NTOH16(0x001B))
+#define IB_MAD_ATTR_MCAST_FWD_TBL			(CL_HTON16(0x001B))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_NODE_RECORD
@@ -1170,7 +1170,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_NODE_RECORD				(CL_NTOH16(0x0011))
+#define IB_MAD_ATTR_NODE_RECORD				(CL_HTON16(0x0011))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_PORTINFO_RECORD
@@ -1182,7 +1182,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_PORTINFO_RECORD			(CL_NTOH16(0x0012))
+#define IB_MAD_ATTR_PORTINFO_RECORD			(CL_HTON16(0x0012))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_SWITCH_INFO_RECORD
@@ -1194,7 +1194,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_SWITCH_INFO_RECORD			(CL_NTOH16(0x0014))
+#define IB_MAD_ATTR_SWITCH_INFO_RECORD			(CL_HTON16(0x0014))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_LINK_RECORD
@@ -1206,7 +1206,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_LINK_RECORD				(CL_NTOH16(0x0020))
+#define IB_MAD_ATTR_LINK_RECORD				(CL_HTON16(0x0020))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_SM_INFO
@@ -1218,7 +1218,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_SM_INFO				(CL_NTOH16(0x0020))
+#define IB_MAD_ATTR_SM_INFO				(CL_HTON16(0x0020))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_SMINFO_RECORD
@@ -1230,7 +1230,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_SMINFO_RECORD			(CL_NTOH16(0x0018))
+#define IB_MAD_ATTR_SMINFO_RECORD			(CL_HTON16(0x0018))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_GUIDINFO_RECORD
@@ -1242,7 +1242,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_GUIDINFO_RECORD			(CL_NTOH16(0x0030))
+#define IB_MAD_ATTR_GUIDINFO_RECORD			(CL_HTON16(0x0030))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_VENDOR_DIAG
@@ -1254,7 +1254,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_VENDOR_DIAG				(CL_NTOH16(0x0030))
+#define IB_MAD_ATTR_VENDOR_DIAG				(CL_HTON16(0x0030))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_LED_INFO
@@ -1266,7 +1266,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_LED_INFO				(CL_NTOH16(0x0031))
+#define IB_MAD_ATTR_LED_INFO				(CL_HTON16(0x0031))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_SERVICE_RECORD
@@ -1278,7 +1278,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_SERVICE_RECORD			(CL_NTOH16(0x0031))
+#define IB_MAD_ATTR_SERVICE_RECORD			(CL_HTON16(0x0031))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_LFT_RECORD
@@ -1290,7 +1290,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_LFT_RECORD				(CL_NTOH16(0x0015))
+#define IB_MAD_ATTR_LFT_RECORD				(CL_HTON16(0x0015))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_MFT_RECORD
@@ -1302,7 +1302,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_MFT_RECORD				(CL_NTOH16(0x0017))
+#define IB_MAD_ATTR_MFT_RECORD				(CL_HTON16(0x0017))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_PKEYTBL_RECORD
@@ -1314,7 +1314,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_PKEY_TBL_RECORD			(CL_NTOH16(0x0033))
+#define IB_MAD_ATTR_PKEY_TBL_RECORD			(CL_HTON16(0x0033))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_PATH_RECORD
@@ -1326,7 +1326,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_PATH_RECORD				(CL_NTOH16(0x0035))
+#define IB_MAD_ATTR_PATH_RECORD				(CL_HTON16(0x0035))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_VLARB_RECORD
@@ -1338,7 +1338,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_VLARB_RECORD			(CL_NTOH16(0x0036))
+#define IB_MAD_ATTR_VLARB_RECORD			(CL_HTON16(0x0036))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_SLVL_RECORD
@@ -1350,7 +1350,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_SLVL_RECORD				(CL_NTOH16(0x0013))
+#define IB_MAD_ATTR_SLVL_RECORD				(CL_HTON16(0x0013))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_MCMEMBER_RECORD
@@ -1362,7 +1362,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_MCMEMBER_RECORD			(CL_NTOH16(0x0038))
+#define IB_MAD_ATTR_MCMEMBER_RECORD			(CL_HTON16(0x0038))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_TRACE_RECORD
@@ -1374,7 +1374,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_TRACE_RECORD			(CL_NTOH16(0x0039))
+#define IB_MAD_ATTR_TRACE_RECORD			(CL_HTON16(0x0039))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_MULTIPATH_RECORD
@@ -1386,7 +1386,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_MULTIPATH_RECORD			(CL_NTOH16(0x003A))
+#define IB_MAD_ATTR_MULTIPATH_RECORD			(CL_HTON16(0x003A))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_SVC_ASSOCIATION_RECORD
@@ -1398,7 +1398,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD		(CL_NTOH16(0x003B))
+#define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD		(CL_HTON16(0x003B))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_INFORM_INFO_RECORD
@@ -1410,7 +1410,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_INFORM_INFO_RECORD			(CL_NTOH16(0x00F3))
+#define IB_MAD_ATTR_INFORM_INFO_RECORD			(CL_HTON16(0x00F3))
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_IO_UNIT_INFO
 * NAME
@@ -1421,7 +1421,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_IO_UNIT_INFO			(CL_NTOH16(0x0010))
+#define IB_MAD_ATTR_IO_UNIT_INFO			(CL_HTON16(0x0010))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_IO_CONTROLLER_PROFILE
@@ -1433,7 +1433,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_IO_CONTROLLER_PROFILE	(CL_NTOH16(0x0011))
+#define IB_MAD_ATTR_IO_CONTROLLER_PROFILE	(CL_HTON16(0x0011))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_SERVICE_ENTRIES
@@ -1445,7 +1445,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_SERVICE_ENTRIES			(CL_NTOH16(0x0012))
+#define IB_MAD_ATTR_SERVICE_ENTRIES			(CL_HTON16(0x0012))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT
@@ -1457,7 +1457,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT		(CL_NTOH16(0x0020))
+#define IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT		(CL_HTON16(0x0020))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_PREPARE_TO_TEST
@@ -1469,7 +1469,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_PREPARE_TO_TEST			(CL_NTOH16(0x0021))
+#define IB_MAD_ATTR_PREPARE_TO_TEST			(CL_HTON16(0x0021))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_TEST_DEVICE_ONCE
@@ -1481,7 +1481,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_TEST_DEVICE_ONCE		(CL_NTOH16(0x0022))
+#define IB_MAD_ATTR_TEST_DEVICE_ONCE		(CL_HTON16(0x0022))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_TEST_DEVICE_LOOP
@@ -1493,7 +1493,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_TEST_DEVICE_LOOP		(CL_NTOH16(0x0023))
+#define IB_MAD_ATTR_TEST_DEVICE_LOOP		(CL_HTON16(0x0023))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_DIAG_CODE
@@ -1505,7 +1505,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_DIAG_CODE				(CL_NTOH16(0x0024))
+#define IB_MAD_ATTR_DIAG_CODE				(CL_HTON16(0x0024))
 /**********/
 
 /****d* IBA Base: Constants/IB_MAD_ATTR_SVC_ASSOCIATION_RECORD
@@ -1517,7 +1517,7 @@ ib_class_is_rmpp(
 *
 * SOURCE
 */
-#define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD	(CL_NTOH16(0x003B))
+#define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD	(CL_HTON16(0x003B))
 /**********/
 
 /****d* IBA Base: Constants/IB_NODE_TYPE_CA
@@ -4084,8 +4084,8 @@ ib_sa_mad_get_payload_ptr(
 *	ib_mad_t
 *********/
 
-#define IB_NODE_INFO_PORT_NUM_MASK		(CL_NTOH32(0xFF000000))
-#define IB_NODE_INFO_VEND_ID_MASK		(CL_NTOH32(0x00FFFFFF))
+#define IB_NODE_INFO_PORT_NUM_MASK		(CL_HTON32(0xFF000000))
+#define IB_NODE_INFO_VEND_ID_MASK		(CL_HTON32(0x00FFFFFF))
 #if CPU_LE
 	#define IB_NODE_INFO_PORT_NUM_SHIFT 0
 #else
@@ -4246,38 +4246,38 @@ typedef struct _ib_port_info
 #define IB_PORT_PHYS_STATE_PHYTEST	        7
 #define IB_PORT_LNKDWNDFTSTATE_MASK		0x0F
 
-#define IB_PORT_CAP_RESV0         (CL_NTOH32(0x00000001))
-#define IB_PORT_CAP_IS_SM         (CL_NTOH32(0x00000002))
-#define IB_PORT_CAP_HAS_NOTICE    (CL_NTOH32(0x00000004))
-#define IB_PORT_CAP_HAS_TRAP      (CL_NTOH32(0x00000008))
-#define IB_PORT_CAP_HAS_IPD       (CL_NTOH32(0x00000010))
-#define IB_PORT_CAP_HAS_AUTO_MIG  (CL_NTOH32(0x00000020))
-#define IB_PORT_CAP_HAS_SL_MAP    (CL_NTOH32(0x00000040))
-#define IB_PORT_CAP_HAS_NV_MKEY   (CL_NTOH32(0x00000080))
-#define IB_PORT_CAP_HAS_NV_PKEY   (CL_NTOH32(0x00000100))
-#define IB_PORT_CAP_HAS_LED_INFO  (CL_NTOH32(0x00000200))
-#define IB_PORT_CAP_SM_DISAB      (CL_NTOH32(0x00000400))
-#define IB_PORT_CAP_HAS_SYS_IMG_GUID  (CL_NTOH32(0x00000800))
-#define IB_PORT_CAP_HAS_PKEY_SW_EXT_PORT_TRAP (CL_NTOH32(0x00001000))
-#define IB_PORT_CAP_RESV13        (CL_NTOH32(0x00002000))
-#define IB_PORT_CAP_RESV14        (CL_NTOH32(0x00004000))
-#define IB_PORT_CAP_RESV15        (CL_NTOH32(0x00008000))
-#define IB_PORT_CAP_HAS_COM_MGT   (CL_NTOH32(0x00010000))
-#define IB_PORT_CAP_HAS_SNMP      (CL_NTOH32(0x00020000))
-#define IB_PORT_CAP_REINIT        (CL_NTOH32(0x00040000))
-#define IB_PORT_CAP_HAS_DEV_MGT   (CL_NTOH32(0x00080000))
-#define IB_PORT_CAP_HAS_VEND_CLS  (CL_NTOH32(0x00100000))
-#define IB_PORT_CAP_HAS_DR_NTC    (CL_NTOH32(0x00200000))
-#define IB_PORT_CAP_HAS_CAP_NTC   (CL_NTOH32(0x00400000))
-#define IB_PORT_CAP_HAS_BM        (CL_NTOH32(0x00800000))
-#define IB_PORT_CAP_HAS_LINK_RT_LATENCY (CL_NTOH32(0x01000000))
-#define IB_PORT_CAP_HAS_CLIENT_REREG (CL_NTOH32(0x02000000))
-#define IB_PORT_CAP_RESV26        (CL_NTOH32(0x04000000))
-#define IB_PORT_CAP_RESV27        (CL_NTOH32(0x08000000))
-#define IB_PORT_CAP_RESV28        (CL_NTOH32(0x10000000))
-#define IB_PORT_CAP_RESV29        (CL_NTOH32(0x20000000))
-#define IB_PORT_CAP_RESV30        (CL_NTOH32(0x40000000))
-#define IB_PORT_CAP_RESV31        (CL_NTOH32(0x80000000))
+#define IB_PORT_CAP_RESV0         (CL_HTON32(0x00000001))
+#define IB_PORT_CAP_IS_SM         (CL_HTON32(0x00000002))
+#define IB_PORT_CAP_HAS_NOTICE    (CL_HTON32(0x00000004))
+#define IB_PORT_CAP_HAS_TRAP      (CL_HTON32(0x00000008))
+#define IB_PORT_CAP_HAS_IPD       (CL_HTON32(0x00000010))
+#define IB_PORT_CAP_HAS_AUTO_MIG  (CL_HTON32(0x00000020))
+#define IB_PORT_CAP_HAS_SL_MAP    (CL_HTON32(0x00000040))
+#define IB_PORT_CAP_HAS_NV_MKEY   (CL_HTON32(0x00000080))
+#define IB_PORT_CAP_HAS_NV_PKEY   (CL_HTON32(0x00000100))
+#define IB_PORT_CAP_HAS_LED_INFO  (CL_HTON32(0x00000200))
+#define IB_PORT_CAP_SM_DISAB      (CL_HTON32(0x00000400))
+#define IB_PORT_CAP_HAS_SYS_IMG_GUID  (CL_HTON32(0x00000800))
+#define IB_PORT_CAP_HAS_PKEY_SW_EXT_PORT_TRAP (CL_HTON32(0x00001000))
+#define IB_PORT_CAP_RESV13        (CL_HTON32(0x00002000))
+#define IB_PORT_CAP_RESV14        (CL_HTON32(0x00004000))
+#define IB_PORT_CAP_RESV15        (CL_HTON32(0x00008000))
+#define IB_PORT_CAP_HAS_COM_MGT   (CL_HTON32(0x00010000))
+#define IB_PORT_CAP_HAS_SNMP      (CL_HTON32(0x00020000))
+#define IB_PORT_CAP_REINIT        (CL_HTON32(0x00040000))
+#define IB_PORT_CAP_HAS_DEV_MGT   (CL_HTON32(0x00080000))
+#define IB_PORT_CAP_HAS_VEND_CLS  (CL_HTON32(0x00100000))
+#define IB_PORT_CAP_HAS_DR_NTC    (CL_HTON32(0x00200000))
+#define IB_PORT_CAP_HAS_CAP_NTC   (CL_HTON32(0x00400000))
+#define IB_PORT_CAP_HAS_BM        (CL_HTON32(0x00800000))
+#define IB_PORT_CAP_HAS_LINK_RT_LATENCY (CL_HTON32(0x01000000))
+#define IB_PORT_CAP_HAS_CLIENT_REREG (CL_HTON32(0x02000000))
+#define IB_PORT_CAP_RESV26        (CL_HTON32(0x04000000))
+#define IB_PORT_CAP_RESV27        (CL_HTON32(0x08000000))
+#define IB_PORT_CAP_RESV28        (CL_HTON32(0x10000000))
+#define IB_PORT_CAP_RESV29        (CL_HTON32(0x20000000))
+#define IB_PORT_CAP_RESV30        (CL_HTON32(0x40000000))
+#define IB_PORT_CAP_RESV31        (CL_HTON32(0x80000000))
 
 /****f* IBA Base: Types/ib_port_info_get_port_state
 * NAME
@@ -10208,7 +10208,7 @@ typedef uint32_t						ib_mr_mod_t;
 *
 * SOURCE
 */
-#define IB_SMINFO_ATTR_MOD_HANDOVER		(CL_NTOH32(0x000001))
+#define IB_SMINFO_ATTR_MOD_HANDOVER		(CL_HTON32(0x000001))
 /**********/
 
 /****d* IBA Base: Constants/IB_SMINFO_ATTR_MOD_ACKNOWLEDGE
@@ -10220,7 +10220,7 @@ typedef uint32_t						ib_mr_mod_t;
 *
 * SOURCE
 */
-#define IB_SMINFO_ATTR_MOD_ACKNOWLEDGE		(CL_NTOH32(0x000002))
+#define IB_SMINFO_ATTR_MOD_ACKNOWLEDGE		(CL_HTON32(0x000002))
 /**********/
 
 /****d* IBA Base: Constants/IB_SMINFO_ATTR_MOD_DISABLE
@@ -10232,7 +10232,7 @@ typedef uint32_t						ib_mr_mod_t;
 *
 * SOURCE
 */
-#define IB_SMINFO_ATTR_MOD_DISABLE			(CL_NTOH32(0x000003))
+#define IB_SMINFO_ATTR_MOD_DISABLE			(CL_HTON32(0x000003))
 /**********/
 
 /****d* IBA Base: Constants/IB_SMINFO_ATTR_MOD_STANDBY
@@ -10244,7 +10244,7 @@ typedef uint32_t						ib_mr_mod_t;
 *
 * SOURCE
 */
-#define IB_SMINFO_ATTR_MOD_STANDBY			(CL_NTOH32(0x000004))
+#define IB_SMINFO_ATTR_MOD_STANDBY			(CL_HTON32(0x000004))
 /**********/
 
 /****d* IBA Base: Constants/IB_SMINFO_ATTR_MOD_DISCOVER
@@ -10256,7 +10256,7 @@ typedef uint32_t						ib_mr_mod_t;
 *
 * SOURCE
 */
-#define IB_SMINFO_ATTR_MOD_DISCOVER			(CL_NTOH32(0x000005))
+#define IB_SMINFO_ATTR_MOD_DISCOVER			(CL_HTON32(0x000005))
 /**********/
 
 /****s* Access Layer/ib_ci_op_t
-- 
1.4.4


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 0001-Change-macros-to-convert-from-host-byte-order-to-network.txt
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/83913923/attachment.txt>

From rdreier at cisco.com  Tue May 22 11:09:46 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 22 May 2007 11:09:46 -0700
Subject: [ofa-general] Re: skb queue management in ipoib
In-Reply-To: <20070522063634.GB3331@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 22 May 2007 09:36:34 +0300")
References: <20070522063634.GB3331@mellanox.co.il>
Message-ID: <adafy5obuhh.fsf@cisco.com>

 > 	I think that managing this queue in a FIFO manner, dropping
 > 	old packets and inserting new ones instead would be better:
 > 	and older packet has more chance to have been timed out.

Yes, that probably makes sense.

 > 	So we would do something along the lines of:
 > 
 >                        __skb_queue_tail(&neigh->queue, skb);
 >                        if (skb_queue_len(&neigh->queue) > IPOIB_MAX_PATH_REC_QUEUE) {
 >                                 skb = __skb_dequeue_tail(&neigh->queue);

this should just be __skb_dequeue though...

 >                                 ipoib_warn(priv, "queue length limit %d. Packet drop.\n",
 >                                            skb_queue_len(&neigh->queue));
 >                                 goto err_drop;
 >                        }


From koen.segers at vrt.be  Tue May 22 11:17:25 2007
From: koen.segers at vrt.be (Koen Segers)
Date: Tue, 22 May 2007 20:17:25 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E1323@xmb-sjc-216.amer.cisco.com>
References: <D63C0BE2D613C543B6F3305502E9784C03157D5A@OCBEXS01001.rto.be>
	<D63C0BE2D613C543B6F3305502E9784C03157D5D@OCBEXS01001.rto.be>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3038E1323@xmb-sjc-216.amer.cisco.com>
Message-ID: <1179857845.9528.6.camel@KOEN>

On Tue, 2007-05-22 at 08:34 -0700, Scott Weitzenkamp (sweitzen) wrote:
> What server model and CPU model do you have?

cat /proc/cpuinfo
processor       : 7
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 8218
stepping        : 2
cpu MHz         : 2600.202
cache size      : 1024 KB
physical id     : 3
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
bogomips        : 5200.54
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

>  
> This could be https://bugs.openfabrics.org//show_bug.cgi?id=229.  Try
> setting RENICE_IB_MAD=yes in /etc/infiniband/openibd.conf, then reboot
> or run /etc/init.d/openibd restart, and see if that helps.

AHA, this is interesting. I'll do it tomorrow!

>  
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
> 
>         
>         ______________________________________________________________
>         From: general-bounces at lists.openfabrics.org
>         [mailto:general-bounces at lists.openfabrics.org] On Behalf Of
>         SEGERS Koen
>         Sent: Tuesday, May 22, 2007 6:44 AM
>         To: Ami Perlmutter; Shirley Ma
>         Cc: general-bounces at lists.openfabrics.org;
>         general at lists.openfabrics.org
>         Subject: RE: [ofa-general] GPFS node loses IB-connection
>         
>         
>         
>         I did the iperf tests on servers with OFED-1.2-RC3.
>         
>          
>         
>         It also gives the same result. Actually, it is even worse:
>         when the interface dies, it gets in PORT_INIT state, but it
>         doesn’t go to PORT_ACTIVE again. At least not within 10
>         minutes.
>         
>          
>         
>         I’ll give you the test script I ran:
>         
>          
>         
>         ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 5001 &
>         
>         ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 5002 &
>         
>         ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 5003 &
>         
>         ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 6001 &
>         
>         ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 6002 &
>         
>         ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 6003 &
>         
>         ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 7001 &
>         
>         ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 7002 &
>         
>         ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 7003 &
>         
>         ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 8001 &
>         
>         ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 8002 &
>         
>         ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 8003 &
>         
>          
>         
>         sleep 5
>         
>          
>         
>         for i in 14 15 16 17
>         
>         do
>         
>                 ssh 10.224.158.111 LD_PRELOAD=libsdp.so
>         SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))001 -t 120
>         -d -P 5 &
>         
>                 ssh 10.224.158.112 LD_PRELOAD=libsdp.so
>         SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))002 -t 120
>         -d -P 5 &
>         
>                 ssh 10.224.158.113 LD_PRELOAD=libsdp.so
>         SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))003 -t 120
>         -d -P 5 &
>         
>         done
>         
>          
>         
>         Any ideas?
>         
>          
>         
>         Regards,
>         
>          
>         
>         Koen
>         
>                                        
>         ______________________________________________________________
>         Van: general-bounces at lists.openfabrics.org
>         [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS
>         Koen
>         Verzonden: dinsdag 22 mei 2007 10:55
>         Aan: Ami Perlmutter; Shirley Ma
>         CC: general-bounces at lists.openfabrics.org;
>         general at lists.openfabrics.org
>         Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
>         
>         
>          
>         
>         GPFS keeps its connection constantly open.
>         
>          
>         
>         We did some more tests with iperf:
>         
>         If we don’t run bidirectional tests, all connections keeps
>         running smoothly. If we add bidirectional tests, it becomes
>         unstable. Certainly if this is done on multiple nodes. Is this
>         normal?
>         
>          
>         
>         The failed iperf tests give the same error in the switch log:
>         
>         May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Generate SM OUT_OF_SERVICE trap for
>         GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71
>         
>         May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Generate SM DELETE_MC_GROUP trap for
>         GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71
>         
>         May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Configuration caused by discovering removed ports
>         
>         May 22 08:15:00 topspin-120sc ib_sm.x[621]: %IB-6-INFO:
>         Program switch port state to down,
>         node=00:05:ad:00:00:0b:a2:cc, port= 6, due to non-responding
>         CA
>         
>         May 22 08:15:00 topspin-120sc port_mgr.x[497]: %PORT-6-INFO:
>         port down - port=1/6, type=ib4xTXP
>         
>         May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO:
>         in portTblFindEntry() - IfIndex=70(1/6)
>         
>         May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO:
>         cannot find entry - IfIndex=70(1/6)
>         
>         May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Configuration caused by discovering new ports
>         
>         May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Configuration caused by multicast membership change
>         
>         May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Generate SM IN_SERVICE trap for
>         GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71
>         
>         May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO:
>         port up - port=1/6, type=ib4xTXP
>         
>         May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO:
>         Generate SM CREATE_MC_GROUP trap for
>         GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71
>         
>         May 22 08:15:08 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Configuration caused by multicast membership change
>         
>          
>         
>         RC3 is just installed. Results will follow soon.
>         
>          
>         
>         Regards,
>         
>          
>         
>         Koen
>         
>          
>         
>                                        
>         ______________________________________________________________
>         Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
>         Verzonden: dinsdag 22 mei 2007 10:33
>         Aan: Shirley Ma
>         CC: SEGERS Koen; general-bounces at lists.openfabrics.org;
>         general at lists.openfabrics.org
>         Onderwerp: Re: [ofa-general] GPFS node loses IB-connection
>         
>         
>          
>         
>         does the application constantly open and close connections? 
>         
>         *** Disclaimer ***
>         
>         Vlaamse Radio- en Televisieomroep
>         Auguste Reyerslaan 52, 1043 Brussel
>         
>         nv van publiek recht
>         BTW BE 0244.142.664
>         RPR Brussel
>         http://www.vrt.be/disclaimer
>         
>         
>         *** Disclaimer ***
>         
>         Vlaamse Radio- en Televisieomroep
>         Auguste Reyerslaan 52, 1043 Brussel
>         
>         nv van publiek recht
>         BTW BE 0244.142.664
>         RPR Brussel
>         http://www.vrt.be/disclaimer
>         
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From koen.segers at vrt.be  Tue May 22 11:14:46 2007
From: koen.segers at vrt.be (Koen Segers)
Date: Tue, 22 May 2007 20:14:46 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <OFD734C73D.218C5E16-ON872572E3.005BE9F6-882572E3.00617A52@us.ibm.com>
References: <OFD734C73D.218C5E16-ON872572E3.005BE9F6-882572E3.00617A52@us.ibm.com>
Message-ID: <1179857686.9528.3.camel@KOEN>

Hi,

It is the Cisco SM. 

SFS-7000P> show version


================================================================================
                           System Version Information
================================================================================
           system-version : SFS-7000P TopspinOS 2.9.0 releng #147
10/25/2006 02:01:32
                  contact : tac at cisco.com
                     name : SFS-7000P
                 location : 170 West Tasman Drive, San Jose, CA 95134
                  up-time : 11(d):7(h):49(m):3(s)
              last-change : none
         last-config-save : none
                   action : none
                   result : none
                oper-mode : normal

There is also a command that gives the SM version, but I can't find it
right now. 

On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> Hello Koen,
> 
> From the switch log, it looks a SM issue to me. The node was kicked
> out of the membership. Which SM you are using in your fabric? 
> 
> Thanks
> Shirley Ma
> 
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From xma at us.ibm.com  Tue May 22 11:29:43 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Tue, 22 May 2007 11:29:43 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <1179857686.9528.3.camel@KOEN>
Message-ID: <OF0C5ACB99.4E82D6F6-ON872572E3.00650C9C-882572E3.006AFED0@us.ibm.com>


Koen,

      So it is most likely you hit the same bug as 229 (Scott pointed out
earlier). The same workaround might work for you by renicing ib_mad as
Scott suggested.

      I think this should be a SM query timeout tunable value in Cisco SM.
Am I right, Scott?

Thanks
Shirley Ma


             Koen Segers                                                   
             <koen.segers at VRT.                                             
             BE>                                                        To 
                                       Shirley Ma/Beaverton/IBM at IBMUS      
             05/22/07 11:14 AM                                          cc 
                                       Ami Perlmutter                      
                                       <amip at dev.mellanox.co.il>,          
             Please respond to         general at lists.openfabrics.org,      
             koen.segers at VRT.B         general-bounces at lists.openfabrics.o 
                     E                 rg                                  
                                                                   Subject 
                                       RE: [ofa-general] GPFS node loses   
                                       IB-connection                       
                                                                           
                                                                           
Hi,

It is the Cisco SM.

SFS-7000P> show version


================================================================================

                           System Version Information
================================================================================

           system-version : SFS-7000P TopspinOS 2.9.0 releng #147
10/25/2006 02:01:32
                  contact : tac at cisco.com
                     name : SFS-7000P
                 location : 170 West Tasman Drive, San Jose, CA 95134
                  up-time : 11(d):7(h):49(m):3(s)
              last-change : none
         last-config-save : none
                   action : none
                   result : none
                oper-mode : normal

There is also a command that gives the SM version, but I can't find it
right now.

On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> Hello Koen,
>
> From the switch log, it looks a SM issue to me. The node was kicked
> out of the membership. Which SM you are using in your fabric?
>
> Thanks
> Shirley Ma
>
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/5e0a4d24/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/5e0a4d24/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic06250.gif
Type: image/gif
Size: 1255 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/5e0a4d24/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/5e0a4d24/attachment-0002.gif>

From halr at voltaire.com  Tue May 22 11:30:07 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 May 2007 14:30:07 -0400
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <OFD734C73D.218C5E16-ON872572E3.005BE9F6-882572E3.00617A52@us.ibm.com>
References: <OFD734C73D.218C5E16-ON872572E3.005BE9F6-882572E3.00617A52@us.ibm.com>
Message-ID: <1179858607.16831.20544.camel@hal.voltaire.com>

On Tue, 2007-05-22 at 12:45, Shirley Ma wrote:
> Hello Koen,
> 
> From the switch log, it looks a SM issue to me. The node was kicked
> out of the membership.

Are you referring to the following messages:

May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71

May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71

followed later by:

May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71

May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up - port=1/6, type=ib4xTXP

May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71

These is the IPv6 SNM group for that node which is coming and going
based on that node coming and going (see the port up and down events in
the log as well).

-- Hal

>  Which SM you are using in your fabric? 

> Thanks
> Shirley Ma
> 
> ______________________________________________________________________
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From umaxx at oleco.net  Tue May 22 12:04:30 2007
From: umaxx at oleco.net (Joerg Zinke)
Date: Tue, 22 May 2007 21:04:30 +0200
Subject: [ofa-general] mmap() and ibv_reg_mr() and RDMA
Message-ID: <20070522210430.5df75050@marvin.local>

> resend, first try did not arrived

Hi,

I want to do RDMA-write on mmap'ed memory.
but it fails to register the memory region.

is there something special, to use ibv_reg_mr() on memory which I got
from mmap()?
it works fine with plain allocated memory (with memalign()).
memory is page-aligned in both cases.

Regards,

Joerg


From sweitzen at cisco.com  Tue May 22 12:59:27 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 22 May 2007 12:59:27 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <OF0C5ACB99.4E82D6F6-ON872572E3.00650C9C-882572E3.006AFED0@us.ibm.com>
References: <1179857686.9528.3.camel@KOEN>
	<OF0C5ACB99.4E82D6F6-ON872572E3.00650C9C-882572E3.006AFED0@us.ibm.com>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E14DD@xmb-sjc-216.amer.cisco.com>

Yes, you can tune it.  Here's an example via the switch CLI:
 
SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
node-timeout <value>

The default is 10 seconds, it can be configured up to 2000 seconds.  If
a HCA is completely unresponsive for longer than the node-timeout value,
then we consider that HCA out of service.
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

________________________________

	From: Shirley Ma [mailto:xma at us.ibm.com] 
	Sent: Tuesday, May 22, 2007 11:30 AM
	To: koen.segers at VRT.BE
	Cc: Ami Perlmutter; general at lists.openfabrics.org;
general-bounces at lists.openfabrics.org; Scott Weitzenkamp (sweitzen)
	Subject: RE: [ofa-general] GPFS node loses IB-connection
	
	
	Koen,
	
	So it is most likely you hit the same bug as 229 (Scott pointed
out earlier). The same workaround might work for you by renicing ib_mad
as Scott suggested.
	
	I think this should be a SM query timeout tunable value in Cisco
SM. Am I right, Scott?
	
	Thanks
	Shirley Ma
	
	
	 Koen Segers <koen.segers at VRT.BE>
	
	
				Koen Segers <koen.segers at VRT.BE> 

				05/22/07 11:14 AM 
	
	Please respond to
koen.segers at VRT.BE

 
To

Shirley Ma/Beaverton/IBM at IBMUS	


cc

Ami Perlmutter <amip at dev.mellanox.co.il>, general at lists.openfabrics.org,
general-bounces at lists.openfabrics.org	


Subject

RE: [ofa-general] GPFS node loses IB-connection	
	 	

	Hi,
	
	It is the Cisco SM. 
	
	SFS-7000P> show version
	
	
========================================================================
========
	                          System Version Information
	
========================================================================
========
	          system-version : SFS-7000P TopspinOS 2.9.0 releng #147
	10/25/2006 02:01:32
	                 contact : tac at cisco.com
	                    name : SFS-7000P
	                location : 170 West Tasman Drive, San Jose, CA
95134
	                 up-time : 11(d):7(h):49(m):3(s)
	             last-change : none
	        last-config-save : none
	                  action : none
	                  result : none
	               oper-mode : normal
	
	There is also a command that gives the SM version, but I can't
find it
	right now. 
	
	On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
	> Hello Koen,
	> 
	> From the switch log, it looks a SM issue to me. The node was
kicked
	> out of the membership. Which SM you are using in your fabric? 
	> 
	> Thanks
	> Shirley Ma
	> 
	*** Disclaimer ***
	
	Vlaamse Radio- en Televisieomroep
	Auguste Reyerslaan 52, 1043 Brussel
	
	nv van publiek recht
	BTW BE 0244.142.664
	RPR Brussel
	http://www.vrt.be/disclaimer
	
	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/ab18f406/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: graycol.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/ab18f406/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: ecblank.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/ab18f406/attachment-0001.gif>

From sweitzen at cisco.com  Tue May 22 13:23:16 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 22 May 2007 13:23:16 -0700
Subject: [ofa-general] What causes "ib0: packet len 65520 (> 2048) too long
	to send, dropping" messages?
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E14FD@xmb-sjc-216.amer.cisco.com>

I see a small number of these types of messages, when I send large
messages via IP multicast.
 
Why do I only see a few of the messages?
 
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/0b0107fb/attachment.html>

From rdreier at cisco.com  Tue May 22 13:43:58 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 22 May 2007 13:43:58 -0700
Subject: [ofa-general] Re: [PATCH RFC] IB/ipoib: fix to_ipoib_neigh access
	race
In-Reply-To: <20070522005918.GB13311@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 22 May 2007 03:59:18 +0300")
References: <20070522005918.GB13311@mellanox.co.il>
Message-ID: <adatzu4d1wx.fsf@cisco.com>

 > hard_start_xmit dereferences to_ipoib_neigh when only tx_lock is taken.  This
 > would only be safe if all calls that modify *to_ipoib_neigh take tx_lock too.
 > Currently this is not always true for ipoib_neigh_free and path_rec_completion,
 > which results in memory corruption.  Fix this race, making sure
 > path_rec_completion and ipoib_neigh_free are always called under
 > tx_lock.
 > 
 > Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
 > 
 > ---
 > 
 > I'm looking at
 > https://bugs.openfabrics.org/show_bug.cgi?id=604
 > and I think this could explain the crashes.
 > In any case, Roland, is there a race or am I imagining things?
 > 
 > NB: The patch is untested (I'm not at the lab now).

Yes, it does seem that there is a problem here.  However, I the first
part of this needs to be handled another way -- for example:

 > -		path_free(dev, path);
 >  		spin_lock_irq(&priv->tx_lock);
 >  		spin_lock(&priv->lock);
 > +		path_free(dev, path);

path_free already takes priv->lock internally, and also calls
ipoib_put_ah(), which may end up in ipoib_free_ah(), which also might
take priv->lock.

It's not immediately obvious what the right fix is...

 - R.


From rdreier at cisco.com  Tue May 22 13:46:04 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 22 May 2007 13:46:04 -0700
Subject: [ofa-general] mmap() and ibv_reg_mr() and RDMA
In-Reply-To: <20070522210430.5df75050@marvin.local> (Joerg Zinke's message of
	"Tue, 22 May 2007 21:04:30 +0200")
References: <20070522210430.5df75050@marvin.local>
Message-ID: <adaps4sd1tf.fsf@cisco.com>

 > I want to do RDMA-write on mmap'ed memory.
 > but it fails to register the memory region.
 > 
 > is there something special, to use ibv_reg_mr() on memory which I got
 > from mmap()?
 > it works fine with plain allocated memory (with memalign()).
 > memory is page-aligned in both cases.

How exactly are you mmap()ing the memory?  memalign(), malloc() etc
are implemented with mmap() internally, so obviously memory
registration of some mmap()ed memory is fine.

 - R.


From umaxx at oleco.net  Tue May 22 14:25:59 2007
From: umaxx at oleco.net (Joerg Zinke)
Date: Tue, 22 May 2007 23:25:59 +0200
Subject: [ofa-general] mmap() and ibv_reg_mr() and RDMA
In-Reply-To: <adaps4sd1tf.fsf@cisco.com>
References: <20070522210430.5df75050@marvin.local> <adaps4sd1tf.fsf@cisco.com>
Message-ID: <20070522232559.7785a331@marvin.local>

On Tue, 22 May 2007 13:46:04 -0700
Roland Dreier <rdreier at cisco.com> wrote:

>  > I want to do RDMA-write on mmap'ed memory.
>  > but it fails to register the memory region.
>  > 
>  > is there something special, to use ibv_reg_mr() on memory which I
>  > got from mmap()?
>  > it works fine with plain allocated memory (with memalign()).
>  > memory is page-aligned in both cases.
> 
> How exactly are you mmap()ing the memory?  memalign(), malloc() etc
> are implemented with mmap() internally, so obviously memory
> registration of some mmap()ed memory is fine.

i created a character device in the kernel and registered memory with
kzalloc():

if ((kmalloc_ptr = kzalloc((NPAGES + 2) * PAGE_SIZE, GFP_KERNEL |
__GFP_DMA)) == NULL) { return -ENOMEM; }

rounded to page bondary:

kmalloc_area = (struct serverinfo *)((((unsigned long)kmalloc_ptr) +
PAGE_SIZE - 1) & PAGE_MASK);

this area is mapped via the character device and with
the help of remap_pfn_range() into userspace... this works fine i can
access it from userspace and write/read from it:

#define MMAP_AREA_LEN (NPAGES*getpagesize())

...

mmap_area = (struct serverinfo*)mmap(0, MMAP_AREA_LEN,
PROT_READ|PROT_WRITE, MAP_SHARED| MAP_LOCKED, fd, MMAP_AREA_LEN);

but when i try to register the mmap_area with ibv_reg_mr() it
fails. 

regards,

joerg


From sean.hefty at intel.com  Tue May 22 14:47:45 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 22 May 2007 14:47:45 -0700
Subject: [ofa-general] locating the index of the default PKey - possible
	sa_query bug?
Message-ID: <001101c79cba$d5409ca0$ff0da8c0@amr.corp.intel.com>

I've been asked to verify partition support in the IB stack .  Everything I've checked so far
appears fine, with one possible exception.

The sa_query module always sends MADs using pkey index 0.  According to section 15.4.2 of the spec,
SA MADs should be sent using the index of the default pkey.  Is there any requirement that the
default pkey be located at index 0?  If not, are we fine placing this requirement on the SM?  (I'm
not aware of any actual problems occurring with the existing code.)

- Sean


From rdreier at cisco.com  Tue May 22 15:19:15 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 22 May 2007 15:19:15 -0700
Subject: [ofa-general] mmap() and ibv_reg_mr() and RDMA
In-Reply-To: <20070522232559.7785a331@marvin.local> (Joerg Zinke's message of
	"Tue, 22 May 2007 23:25:59 +0200")
References: <20070522210430.5df75050@marvin.local> <adaps4sd1tf.fsf@cisco.com>
	<20070522232559.7785a331@marvin.local>
Message-ID: <adalkfgcxi4.fsf@cisco.com>

 > this area is mapped via the character device and with
 > the help of remap_pfn_range() into userspace... this works fine i can
 > access it from userspace and write/read from it:

I think that's the problem.  remap_pfn_range() sets VM_PFNMAP on the
vma used to map the pfns.  When ibv_reg_mr() calls into the kernel to
do the actual mapping, it ends up doing get_user_pages() which fails
in vm_normal_page() for such a vma.

I don't immediately see a good way to handle this.

 - R.


From koen.segers at vrt.be  Tue May 22 15:34:32 2007
From: koen.segers at vrt.be (Koen Segers)
Date: Wed, 23 May 2007 00:34:32 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E14DD@xmb-sjc-216.amer.cisco.com>
References: <1179857686.9528.3.camel@KOEN>
	<OF0C5ACB99.4E82D6F6-ON872572E3.00650C9C-882572E3.006AFED0@us.ibm.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3038E14DD@xmb-sjc-216.amer.cisco.com>
Message-ID: <1179873272.9528.27.camel@KOEN>

If I understand it wright, the switch is actually polling (=pinging) the
interfaces every 10s. This means that when the interface is handling
other traffic, the poll can fail and the port could be considered out of
service. My question is then: "How can the timeout be reached while
packets are being sent/received to/from the interface?"

Anyway, what timeout-value would you recommend for us? And why?

To recapitulate: these are the actions I'll take tomorrow
1) change the MAD niceness of the servers
2) change the timeout on the switches

Are these changes sufficient for the HCA's to keep their ports in
PORT_ACTIVE state?

Regards,

Koen

On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp (sweitzen) wrote:
> Yes, you can tune it.  Here's an example via the switch CLI:
>  
> SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
> node-timeout <value>
> 
> The default is 10 seconds, it can be configured up to 2000 seconds.
> If a HCA is completely unresponsive for longer than the node-timeout
> value, then we consider that HCA out of service.
>  
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
> 
>         
>         ______________________________________________________________
>         From: Shirley Ma [mailto:xma at us.ibm.com] 
>         Sent: Tuesday, May 22, 2007 11:30 AM
>         To: koen.segers at VRT.BE
>         Cc: Ami Perlmutter; general at lists.openfabrics.org;
>         general-bounces at lists.openfabrics.org; Scott Weitzenkamp
>         (sweitzen)
>         Subject: RE: [ofa-general] GPFS node loses IB-connection
>         
>         
>         
>         Koen,
>         
>         So it is most likely you hit the same bug as 229 (Scott
>         pointed out earlier). The same workaround might work for you
>         by renicing ib_mad as Scott suggested.
>         
>         I think this should be a SM query timeout tunable value in
>         Cisco SM. Am I right, Scott?
>         
>         Thanks
>         Shirley Ma
>         
>         
>         Inactive hide details for Koen Segers <koen.segers at VRT.BE>Koen
>         Segers <koen.segers at VRT.BE>
>         
>         
>                                         Koen Segers <koen.segers at VRT.BE> 
>                                         
>                                         05/22/07 11:14 AM 
>                                         Please respond to
>                                         koen.segers at VRT.BE
>                                         
>         
>                      To
>         
>         Shirley
>         Ma/Beaverton/IBM at IBMUS
>         
>                      cc
>         
>         Ami Perlmutter
>         <amip at dev.mellanox.co.il>, general at lists.openfabrics.org, general-bounces at lists.openfabrics.org
>         
>                 Subject
>         
>         RE:
>         [ofa-general]
>         GPFS node loses
>         IB-connection
>         
>         
>         
>         Hi,
>         
>         It is the Cisco SM. 
>         
>         SFS-7000P> show version
>         
>         
>         ================================================================================
>                                   System Version Information
>         ================================================================================
>                   system-version : SFS-7000P TopspinOS 2.9.0 releng
>         #147
>         10/25/2006 02:01:32
>                          contact : tac at cisco.com
>                             name : SFS-7000P
>                         location : 170 West Tasman Drive, San Jose, CA
>         95134
>                          up-time : 11(d):7(h):49(m):3(s)
>                      last-change : none
>                 last-config-save : none
>                           action : none
>                           result : none
>                        oper-mode : normal
>         
>         There is also a command that gives the SM version, but I can't
>         find it
>         right now. 
>         
>         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
>         > Hello Koen,
>         > 
>         > From the switch log, it looks a SM issue to me. The node was
>         kicked
>         > out of the membership. Which SM you are using in your
>         fabric? 
>         > 
>         > Thanks
>         > Shirley Ma
>         > 
>         *** Disclaimer ***
>         
>         Vlaamse Radio- en Televisieomroep
>         Auguste Reyerslaan 52, 1043 Brussel
>         
>         nv van publiek recht
>         BTW BE 0244.142.664
>         RPR Brussel
>         http://www.vrt.be/disclaimer
>         
>         
>         
>         
>         
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From rdreier at cisco.com  Tue May 22 15:35:39 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 22 May 2007 15:35:39 -0700
Subject: [ofa-general] locating the index of the default PKey - possible
	sa_query bug?
In-Reply-To: <001101c79cba$d5409ca0$ff0da8c0@amr.corp.intel.com> (Sean Hefty's
	message of "Tue, 22 May 2007 14:47:45 -0700")
References: <001101c79cba$d5409ca0$ff0da8c0@amr.corp.intel.com>
Message-ID: <adafy5ocwqs.fsf@cisco.com>

 > The sa_query module always sends MADs using pkey index 0.
 > According to section 15.4.2 of the spec, SA MADs should be sent
 > using the index of the default pkey.  Is there any requirement that
 > the default pkey be located at index 0?  If not, are we fine
 > placing this requirement on the SM?  (I'm not aware of any actual
 > problems occurring with the existing code.)

It does look like a (minor) bug.  I don't know of any compliance
statement that says the default P_Key has to be at index 0.  We
probably should fix sa_query to do the right thing, but I don't see it
as a high priority.

 - R.


From pradeeps at linux.vnet.ibm.com  Tue May 22 15:36:49 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Tue, 22 May 2007 15:36:49 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
Message-ID: <46537081.30906@linux.vnet.ibm.com>

Here are my thoughts about limiting the memory footprint for IPOIB CM
(NOSRQ) patch:

By default, cap the NOSRQ memory usage to 1GB. The default recvq_size
is set to 128. Therefore for 64KB packets this would imply a maximum of
128 endpoints.

-Make the maximum number of endpoints a module parameter with a default
value of 128.

-The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is
the default limit and could be changed as needed (by the administrator)
depending on the system configuration, application needs and so on. The
server would return a "REJ" message upon receiving a "REQ", whenever one
of these limits (i.e. max number of endpoints or the max NOSRQ memory
usage) is reached. Currently, we only check for the maximum number of
endpoints -hard coded to 1024.

-The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that
support SRQ like the Topspin HCA and, such HCAs should not be
impacted at all.

-Currently we allocate a default of 64KB for the ring buffer elements,
and this buffer size is not linked to the mtu. In the future, we could
allocate buffers based on the mtu and link that into the computation of
the memory cap. This way customers who might want to use a smaller mtu
could use a larger number of endpoints, or a larger recvq_size without
exceeding the memory cap.


Would this approach address the issues of scalability and enable IPOIB
CM to be turned as the default?


Pradeep


From sweitzen at cisco.com  Tue May 22 15:38:48 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 22 May 2007 15:38:48 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <1179873272.9528.27.camel@KOEN>
References: <1179857686.9528.3.camel@KOEN>
	<OF0C5ACB99.4E82D6F6-ON872572E3.00650C9C-882572E3.006AFED0@us.ibm.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3038E14DD@xmb-sjc-216.amer.cisco.com>
	<1179873272.9528.27.camel@KOEN>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E15EC@xmb-sjc-216.amer.cisco.com>

It's not so much pinging every 10 seconds as expecting a response within
10 seconds (Clive, correct me if I'm wrong).

You only need to do 1) or 2), not both.  Cisco configures 1) in the OFED
binary RPMs we release at
http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I prefer to have
the host be more responsive.


Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Koen Segers [mailto:koen.segers at VRT.BE] 
> Sent: Tuesday, May 22, 2007 3:35 PM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Shirley Ma; Ami Perlmutter; 
> general at lists.openfabrics.org; general-bounces at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
> 
> If I understand it wright, the switch is actually polling 
> (=pinging) the
> interfaces every 10s. This means that when the interface is handling
> other traffic, the poll can fail and the port could be 
> considered out of
> service. My question is then: "How can the timeout be reached while
> packets are being sent/received to/from the interface?"
> 
> Anyway, what timeout-value would you recommend for us? And why?
> 
> To recapitulate: these are the actions I'll take tomorrow
> 1) change the MAD niceness of the servers
> 2) change the timeout on the switches
> 
> Are these changes sufficient for the HCA's to keep their ports in
> PORT_ACTIVE state?
> 
> Regards,
> 
> Koen
> 
> On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp (sweitzen) wrote:
> > Yes, you can tune it.  Here's an example via the switch CLI:
> >  
> > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
> > node-timeout <value>
> > 
> > The default is 10 seconds, it can be configured up to 2000 seconds.
> > If a HCA is completely unresponsive for longer than the node-timeout
> > value, then we consider that HCA out of service.
> >  
> > Scott Weitzenkamp
> > SQA and Release Manager
> > Server Virtualization Business Unit
> > Cisco Systems
> >  
> > 
> >         
> >         
> ______________________________________________________________
> >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> >         Sent: Tuesday, May 22, 2007 11:30 AM
> >         To: koen.segers at VRT.BE
> >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> >         general-bounces at lists.openfabrics.org; Scott Weitzenkamp
> >         (sweitzen)
> >         Subject: RE: [ofa-general] GPFS node loses IB-connection
> >         
> >         
> >         
> >         Koen,
> >         
> >         So it is most likely you hit the same bug as 229 (Scott
> >         pointed out earlier). The same workaround might work for you
> >         by renicing ib_mad as Scott suggested.
> >         
> >         I think this should be a SM query timeout tunable value in
> >         Cisco SM. Am I right, Scott?
> >         
> >         Thanks
> >         Shirley Ma
> >         
> >         
> >         Inactive hide details for Koen Segers 
> <koen.segers at VRT.BE>Koen
> >         Segers <koen.segers at VRT.BE>
> >         
> >         
> >                                         Koen Segers 
> <koen.segers at VRT.BE> 
> >                                         
> >                                         05/22/07 11:14 AM 
> >                                         Please respond to
> >                                         koen.segers at VRT.BE
> >                                         
> >         
> >                      To
> >         
> >         Shirley
> >         Ma/Beaverton/IBM at IBMUS
> >         
> >                      cc
> >         
> >         Ami Perlmutter
> >         <amip at dev.mellanox.co.il>, 
> general at lists.openfabrics.org, general-bounces at lists.openfabrics.org
> >         
> >                 Subject
> >         
> >         RE:
> >         [ofa-general]
> >         GPFS node loses
> >         IB-connection
> >         
> >         
> >         
> >         Hi,
> >         
> >         It is the Cisco SM. 
> >         
> >         SFS-7000P> show version
> >         
> >         
> >         
> ==============================================================
> ==================
> >                                   System Version Information
> >         
> ==============================================================
> ==================
> >                   system-version : SFS-7000P TopspinOS 2.9.0 releng
> >         #147
> >         10/25/2006 02:01:32
> >                          contact : tac at cisco.com
> >                             name : SFS-7000P
> >                         location : 170 West Tasman Drive, 
> San Jose, CA
> >         95134
> >                          up-time : 11(d):7(h):49(m):3(s)
> >                      last-change : none
> >                 last-config-save : none
> >                           action : none
> >                           result : none
> >                        oper-mode : normal
> >         
> >         There is also a command that gives the SM version, 
> but I can't
> >         find it
> >         right now. 
> >         
> >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> >         > Hello Koen,
> >         > 
> >         > From the switch log, it looks a SM issue to me. 
> The node was
> >         kicked
> >         > out of the membership. Which SM you are using in your
> >         fabric? 
> >         > 
> >         > Thanks
> >         > Shirley Ma
> >         > 
> >         *** Disclaimer ***
> >         
> >         Vlaamse Radio- en Televisieomroep
> >         Auguste Reyerslaan 52, 1043 Brussel
> >         
> >         nv van publiek recht
> >         BTW BE 0244.142.664
> >         RPR Brussel
> >         http://www.vrt.be/disclaimer
> >         
> >         
> >         
> >         
> >         
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 


From venkatesh.babu at 3leafnetworks.com  Tue May 22 17:01:32 2007
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Tue, 22 May 2007 17:01:32 -0700
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <1179831181.15940.74121.camel@hal.voltaire.com>
References: <4652167F.9040709@3leafnetworks.com>	
	<1179785796.15940.27092.camel@hal.voltaire.com>	
	<4652542C.3010400@3leafnetworks.com>	
	<1179805556.15940.47640.camel@hal.voltaire.com>	
	<46528E3C.8090305@3leafnetworks.com>
	<1179831181.15940.74121.camel@hal.voltaire.com>
Message-ID: <4653845C.1090507@3leafnetworks.com>


Hal Rosenstock wrote:

>The one I see that might be related is the following:
>
>commit 39798695b4bcc7b145f8910ca56195808d3a7637
>Author: Roland Dreier <rolandd at cisco.com>
>Date:   Mon Nov 13 09:38:07 2006 -0800
>
>    IB/mad: Fix race between cancel and receive completion
>    
>    When ib_cancel_mad() is called, it puts the canceled send on a list
>    and schedules a "flushed" callback from process context.  However,
>    this leaves a window where a receive completion could be processed
>    before the send is fully flushed.
>    
>    This is fine, except that ib_find_send_mad() will find the MAD and
>    return it to the receive processing, which results in the sender
>    getting both a successful receive and a "flushed" send completion for
>    the same request.  Understandably, this confuses the sender, which is
>    expecting only one of these two callbacks, and leads to grief such as
>    a use-after-free in IPoIB.
>    
>    Fix this by changing ib_find_send_mad() to return a send struct only
>    if the status is still successful (and not "flushed").  The search of
>    the send_list already had this check, so this patch just adds the same
>    check to the search of the wait_list.
>    
>    Signed-off-by: Roland Dreier <rolandd at cisco.com>
>
>My search was not exhaustive.
>  
>
  It looks like this may be the fix for the MAD send errors. Do you 
think this is the cause of opensm not grabbing the mastership from the 
other ?

>
>Are they incrementing ? Which node is this ? I think some of them would
>increment on node reboot.
>  
>
  Looks like some counters (Symbol errors, link downed) are reached the 
top ceiling.
This output was captured on node vortex3l-83, the one who runs opensm.
Do you want the perfquery output before and after some time interval ?

 VBabu


From xma at us.ibm.com  Tue May 22 17:01:05 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Tue, 22 May 2007 17:01:05 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <1179858607.16831.20544.camel@hal.voltaire.com>
Message-ID: <OF6711CD75.EA9B409D-ON872572E3.0083CECD-882572E3.0083D2ED@us.ibm.com>


Thanks Hal.

      Thanks for the clarification. I meant to say the port up and down
kicked the node from GPFS membership. The port up and down was managed by
SM.

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070522/8e21f25c/attachment.html>

From halr at voltaire.com  Tue May 22 17:01:10 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 May 2007 20:01:10 -0400
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <4653845C.1090507@3leafnetworks.com>
References: <4652167F.9040709@3leafnetworks.com>
	<1179785796.15940.27092.camel@hal.voltaire.com>
	<4652542C.3010400@3leafnetworks.com>
	<1179805556.15940.47640.camel@hal.voltaire.com>
	<46528E3C.8090305@3leafnetworks.com>
	<1179831181.15940.74121.camel@hal.voltaire.com>
	<4653845C.1090507@3leafnetworks.com>
Message-ID: <1179878469.16831.42580.camel@hal.voltaire.com>

On Tue, 2007-05-22 at 20:01, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >The one I see that might be related is the following:
> >
> >commit 39798695b4bcc7b145f8910ca56195808d3a7637
> >Author: Roland Dreier <rolandd at cisco.com>
> >Date:   Mon Nov 13 09:38:07 2006 -0800
> >
> >    IB/mad: Fix race between cancel and receive completion
> >    
> >    When ib_cancel_mad() is called, it puts the canceled send on a list
> >    and schedules a "flushed" callback from process context.  However,
> >    this leaves a window where a receive completion could be processed
> >    before the send is fully flushed.
> >    
> >    This is fine, except that ib_find_send_mad() will find the MAD and
> >    return it to the receive processing, which results in the sender
> >    getting both a successful receive and a "flushed" send completion for
> >    the same request.  Understandably, this confuses the sender, which is
> >    expecting only one of these two callbacks, and leads to grief such as
> >    a use-after-free in IPoIB.
> >    
> >    Fix this by changing ib_find_send_mad() to return a send struct only
> >    if the status is still successful (and not "flushed").  The search of
> >    the send_list already had this check, so this patch just adds the same
> >    check to the search of the wait_list.
> >    
> >    Signed-off-by: Roland Dreier <rolandd at cisco.com>
> >
> >My search was not exhaustive.
> >  
> >
>   It looks like this may be the fix for the MAD send errors.

Perhaps.

>  Do you 
> think this is the cause of opensm not grabbing the mastership from the 
> other ?

Unlikely but don't know for sure.

> >Are they incrementing ? Which node is this ? I think some of them would
> >increment on node reboot.
> >  
> >
>   Looks like some counters (Symbol errors, link downed) are reached the 
> top ceiling.

You should replace the cable and see if symbol errors improves. You may
need to clear these with perfquery -R.

I think Link downed will increment when the node reboots.

> This output was captured on node vortex3l-83, the one who runs opensm.
> Do you want the perfquery output before and after some time interval ?

I'm interested in VL15 drops to make sure that is not going on.

-- Hal

>  VBabu


From halr at voltaire.com  Tue May 22 17:06:10 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 May 2007 20:06:10 -0400
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <OF6711CD75.EA9B409D-ON872572E3.0083CECD-882572E3.0083D2ED@us.ibm.com>
References: <OF6711CD75.EA9B409D-ON872572E3.0083CECD-882572E3.0083D2ED@us.ibm.com>
Message-ID: <1179878769.16831.42940.camel@hal.voltaire.com>

On Tue, 2007-05-22 at 20:01, Shirley Ma wrote:
> Thanks Hal.
> 
> Thanks for the clarification. I meant to say the port up and down
> kicked the node from GPFS membership. The port up and down was managed
> by SM.

Got it and that port down/up appears to be caused by MAD starvation at
the host.

-- Hal

> Thanks
> Shirley Ma


From vlad at lists.openfabrics.org  Wed May 23 02:40:24 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed, 23 May 2007 02:40:24 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070523-0200 daily build status
Message-ID: <20070523094024.C6340E60814@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.19
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From Koen.SEGERS at VRT.BE  Wed May 23 06:48:19 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Wed, 23 May 2007 15:48:19 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E15EC@xmb-sjc-216.amer.cisco.com>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D67@OCBEXS01001.rto.be>

This far, all tests seem to work.

Thanks for the help!

Scott,
Are there more bugfixes that cisco does in its rpms?

Greetz

Koen

-----Oorspronkelijk bericht-----
Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
Verzonden: woensdag 23 mei 2007 0:39
Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
general-bounces at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

It's not so much pinging every 10 seconds as expecting a response within
10 seconds (Clive, correct me if I'm wrong).

You only need to do 1) or 2), not both.  Cisco configures 1) in the OFED
binary RPMs we release at
http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I prefer to have
the host be more responsive.


Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Koen Segers [mailto:koen.segers at VRT.BE] 
> Sent: Tuesday, May 22, 2007 3:35 PM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Shirley Ma; Ami Perlmutter; 
> general at lists.openfabrics.org; general-bounces at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
> 
> If I understand it wright, the switch is actually polling 
> (=pinging) the
> interfaces every 10s. This means that when the interface is handling
> other traffic, the poll can fail and the port could be 
> considered out of
> service. My question is then: "How can the timeout be reached while
> packets are being sent/received to/from the interface?"
> 
> Anyway, what timeout-value would you recommend for us? And why?
> 
> To recapitulate: these are the actions I'll take tomorrow
> 1) change the MAD niceness of the servers
> 2) change the timeout on the switches
> 
> Are these changes sufficient for the HCA's to keep their ports in
> PORT_ACTIVE state?
> 
> Regards,
> 
> Koen
> 
> On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp (sweitzen) wrote:
> > Yes, you can tune it.  Here's an example via the switch CLI:
> >  
> > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
> > node-timeout <value>
> > 
> > The default is 10 seconds, it can be configured up to 2000 seconds.
> > If a HCA is completely unresponsive for longer than the node-timeout
> > value, then we consider that HCA out of service.
> >  
> > Scott Weitzenkamp
> > SQA and Release Manager
> > Server Virtualization Business Unit
> > Cisco Systems
> >  
> > 
> >         
> >         
> ______________________________________________________________
> >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> >         Sent: Tuesday, May 22, 2007 11:30 AM
> >         To: koen.segers at VRT.BE
> >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> >         general-bounces at lists.openfabrics.org; Scott Weitzenkamp
> >         (sweitzen)
> >         Subject: RE: [ofa-general] GPFS node loses IB-connection
> >         
> >         
> >         
> >         Koen,
> >         
> >         So it is most likely you hit the same bug as 229 (Scott
> >         pointed out earlier). The same workaround might work for you
> >         by renicing ib_mad as Scott suggested.
> >         
> >         I think this should be a SM query timeout tunable value in
> >         Cisco SM. Am I right, Scott?
> >         
> >         Thanks
> >         Shirley Ma
> >         
> >         
> >         Inactive hide details for Koen Segers 
> <koen.segers at VRT.BE>Koen
> >         Segers <koen.segers at VRT.BE>
> >         
> >         
> >                                         Koen Segers 
> <koen.segers at VRT.BE> 
> >                                         
> >                                         05/22/07 11:14 AM 
> >                                         Please respond to
> >                                         koen.segers at VRT.BE
> >                                         
> >         
> >                      To
> >         
> >         Shirley
> >         Ma/Beaverton/IBM at IBMUS
> >         
> >                      cc
> >         
> >         Ami Perlmutter
> >         <amip at dev.mellanox.co.il>, 
> general at lists.openfabrics.org, general-bounces at lists.openfabrics.org
> >         
> >                 Subject
> >         
> >         RE:
> >         [ofa-general]
> >         GPFS node loses
> >         IB-connection
> >         
> >         
> >         
> >         Hi,
> >         
> >         It is the Cisco SM. 
> >         
> >         SFS-7000P> show version
> >         
> >         
> >         
> ==============================================================
> ==================
> >                                   System Version Information
> >         
> ==============================================================
> ==================
> >                   system-version : SFS-7000P TopspinOS 2.9.0 releng
> >         #147
> >         10/25/2006 02:01:32
> >                          contact : tac at cisco.com
> >                             name : SFS-7000P
> >                         location : 170 West Tasman Drive, 
> San Jose, CA
> >         95134
> >                          up-time : 11(d):7(h):49(m):3(s)
> >                      last-change : none
> >                 last-config-save : none
> >                           action : none
> >                           result : none
> >                        oper-mode : normal
> >         
> >         There is also a command that gives the SM version, 
> but I can't
> >         find it
> >         right now. 
> >         
> >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> >         > Hello Koen,
> >         > 
> >         > From the switch log, it looks a SM issue to me. 
> The node was
> >         kicked
> >         > out of the membership. Which SM you are using in your
> >         fabric? 
> >         > 
> >         > Thanks
> >         > Shirley Ma
> >         > 
> >         *** Disclaimer ***
> >         
> >         Vlaamse Radio- en Televisieomroep
> >         Auguste Reyerslaan 52, 1043 Brussel
> >         
> >         nv van publiek recht
> >         BTW BE 0244.142.664
> >         RPR Brussel
> >         http://www.vrt.be/disclaimer
> >         
> >         
> >         
> >         
> >         
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From sweitzen at cisco.com  Wed May 23 06:51:55 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 23 May 2007 06:51:55 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D67@OCBEXS01001.rto.be>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E15EC@xmb-sjc-216.amer.cisco.com>
	<D63C0BE2D613C543B6F3305502E9784C03157D67@OCBEXS01001.rto.be>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E17BF@xmb-sjc-216.amer.cisco.com>

No C code changes, just a few config file changes (RENICE_IB_MAD=yes in
openib.conf, memlock in /etc/security/limits.conf, fix /etc/hosts on
SLES10 for bug 267, etc.).

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> Sent: Wednesday, May 23, 2007 6:48 AM
> To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> Cc: Shirley Ma; Ami Perlmutter; 
> general at lists.openfabrics.org; general-bounces at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
> 
> This far, all tests seem to work.
> 
> Thanks for the help!
> 
> Scott,
> Are there more bugfixes that cisco does in its rpms?
> 
> Greetz
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> Verzonden: woensdag 23 mei 2007 0:39
> Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> general-bounces at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> It's not so much pinging every 10 seconds as expecting a 
> response within
> 10 seconds (Clive, correct me if I'm wrong).
> 
> You only need to do 1) or 2), not both.  Cisco configures 1) 
> in the OFED
> binary RPMs we release at
> http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> prefer to have
> the host be more responsive.
> 
> 
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
> 
> > -----Original Message-----
> > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > Sent: Tuesday, May 22, 2007 3:35 PM
> > To: Scott Weitzenkamp (sweitzen)
> > Cc: Shirley Ma; Ami Perlmutter; 
> > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > If I understand it wright, the switch is actually polling 
> > (=pinging) the
> > interfaces every 10s. This means that when the interface is handling
> > other traffic, the poll can fail and the port could be 
> > considered out of
> > service. My question is then: "How can the timeout be reached while
> > packets are being sent/received to/from the interface?"
> > 
> > Anyway, what timeout-value would you recommend for us? And why?
> > 
> > To recapitulate: these are the actions I'll take tomorrow
> > 1) change the MAD niceness of the servers
> > 2) change the timeout on the switches
> > 
> > Are these changes sufficient for the HCA's to keep their ports in
> > PORT_ACTIVE state?
> > 
> > Regards,
> > 
> > Koen
> > 
> > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> (sweitzen) wrote:
> > > Yes, you can tune it.  Here's an example via the switch CLI:
> > >  
> > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
> > > node-timeout <value>
> > > 
> > > The default is 10 seconds, it can be configured up to 
> 2000 seconds.
> > > If a HCA is completely unresponsive for longer than the 
> node-timeout
> > > value, then we consider that HCA out of service.
> > >  
> > > Scott Weitzenkamp
> > > SQA and Release Manager
> > > Server Virtualization Business Unit
> > > Cisco Systems
> > >  
> > > 
> > >         
> > >         
> > ______________________________________________________________
> > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > >         To: koen.segers at VRT.BE
> > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > >         general-bounces at lists.openfabrics.org; Scott Weitzenkamp
> > >         (sweitzen)
> > >         Subject: RE: [ofa-general] GPFS node loses IB-connection
> > >         
> > >         
> > >         
> > >         Koen,
> > >         
> > >         So it is most likely you hit the same bug as 229 (Scott
> > >         pointed out earlier). The same workaround might 
> work for you
> > >         by renicing ib_mad as Scott suggested.
> > >         
> > >         I think this should be a SM query timeout tunable value in
> > >         Cisco SM. Am I right, Scott?
> > >         
> > >         Thanks
> > >         Shirley Ma
> > >         
> > >         
> > >         Inactive hide details for Koen Segers 
> > <koen.segers at VRT.BE>Koen
> > >         Segers <koen.segers at VRT.BE>
> > >         
> > >         
> > >                                         Koen Segers 
> > <koen.segers at VRT.BE> 
> > >                                         
> > >                                         05/22/07 11:14 AM 
> > >                                         Please respond to
> > >                                         koen.segers at VRT.BE
> > >                                         
> > >         
> > >                      To
> > >         
> > >         Shirley
> > >         Ma/Beaverton/IBM at IBMUS
> > >         
> > >                      cc
> > >         
> > >         Ami Perlmutter
> > >         <amip at dev.mellanox.co.il>, 
> > general at lists.openfabrics.org, general-bounces at lists.openfabrics.org
> > >         
> > >                 Subject
> > >         
> > >         RE:
> > >         [ofa-general]
> > >         GPFS node loses
> > >         IB-connection
> > >         
> > >         
> > >         
> > >         Hi,
> > >         
> > >         It is the Cisco SM. 
> > >         
> > >         SFS-7000P> show version
> > >         
> > >         
> > >         
> > ==============================================================
> > ==================
> > >                                   System Version Information
> > >         
> > ==============================================================
> > ==================
> > >                   system-version : SFS-7000P TopspinOS 
> 2.9.0 releng
> > >         #147
> > >         10/25/2006 02:01:32
> > >                          contact : tac at cisco.com
> > >                             name : SFS-7000P
> > >                         location : 170 West Tasman Drive, 
> > San Jose, CA
> > >         95134
> > >                          up-time : 11(d):7(h):49(m):3(s)
> > >                      last-change : none
> > >                 last-config-save : none
> > >                           action : none
> > >                           result : none
> > >                        oper-mode : normal
> > >         
> > >         There is also a command that gives the SM version, 
> > but I can't
> > >         find it
> > >         right now. 
> > >         
> > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > >         > Hello Koen,
> > >         > 
> > >         > From the switch log, it looks a SM issue to me. 
> > The node was
> > >         kicked
> > >         > out of the membership. Which SM you are using in your
> > >         fabric? 
> > >         > 
> > >         > Thanks
> > >         > Shirley Ma
> > >         > 
> > >         *** Disclaimer ***
> > >         
> > >         Vlaamse Radio- en Televisieomroep
> > >         Auguste Reyerslaan 52, 1043 Brussel
> > >         
> > >         nv van publiek recht
> > >         BTW BE 0244.142.664
> > >         RPR Brussel
> > >         http://www.vrt.be/disclaimer
> > >         
> > >         
> > >         
> > >         
> > >         
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 


From halr at voltaire.com  Wed May 23 07:11:38 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 23 May 2007 10:11:38 -0400
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E17BF@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E15EC@xmb-sjc-216.amer.cisco.com>
	<D63C0BE2D613C543B6F3305502E9784C03157D67@OCBEXS01001.rto.be>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3038E17BF@xmb-sjc-216.amer.cisco.com>
Message-ID: <1179929493.16831.98786.camel@hal.voltaire.com>

On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> No C code changes, just a few config file changes (RENICE_IB_MAD=yes in
> openib.conf,

Does the host really not respond to MAD requests for over 10 seconds in
some cases ?

-- Hal

>  memlock in /etc/security/limits.conf, fix /etc/hosts on
> SLES10 for bug 267, etc.).
> 
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
> 
> > -----Original Message-----
> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > Sent: Wednesday, May 23, 2007 6:48 AM
> > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > Cc: Shirley Ma; Ami Perlmutter; 
> > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > This far, all tests seem to work.
> > 
> > Thanks for the help!
> > 
> > Scott,
> > Are there more bugfixes that cisco does in its rpms?
> > 
> > Greetz
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > Verzonden: woensdag 23 mei 2007 0:39
> > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > general-bounces at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > It's not so much pinging every 10 seconds as expecting a 
> > response within
> > 10 seconds (Clive, correct me if I'm wrong).
> > 
> > You only need to do 1) or 2), not both.  Cisco configures 1) 
> > in the OFED
> > binary RPMs we release at
> > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > prefer to have
> > the host be more responsive.
> > 
> > 
> > Scott Weitzenkamp
> > SQA and Release Manager
> > Server Virtualization Business Unit
> > Cisco Systems
> >  
> > 
> > > -----Original Message-----
> > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > To: Scott Weitzenkamp (sweitzen)
> > > Cc: Shirley Ma; Ami Perlmutter; 
> > > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org
> > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > If I understand it wright, the switch is actually polling 
> > > (=pinging) the
> > > interfaces every 10s. This means that when the interface is handling
> > > other traffic, the poll can fail and the port could be 
> > > considered out of
> > > service. My question is then: "How can the timeout be reached while
> > > packets are being sent/received to/from the interface?"
> > > 
> > > Anyway, what timeout-value would you recommend for us? And why?
> > > 
> > > To recapitulate: these are the actions I'll take tomorrow
> > > 1) change the MAD niceness of the servers
> > > 2) change the timeout on the switches
> > > 
> > > Are these changes sufficient for the HCA's to keep their ports in
> > > PORT_ACTIVE state?
> > > 
> > > Regards,
> > > 
> > > Koen
> > > 
> > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > (sweitzen) wrote:
> > > > Yes, you can tune it.  Here's an example via the switch CLI:
> > > >  
> > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
> > > > node-timeout <value>
> > > > 
> > > > The default is 10 seconds, it can be configured up to 
> > 2000 seconds.
> > > > If a HCA is completely unresponsive for longer than the 
> > node-timeout
> > > > value, then we consider that HCA out of service.
> > > >  
> > > > Scott Weitzenkamp
> > > > SQA and Release Manager
> > > > Server Virtualization Business Unit
> > > > Cisco Systems
> > > >  
> > > > 
> > > >         
> > > >         
> > > ______________________________________________________________
> > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > >         To: koen.segers at VRT.BE
> > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > >         general-bounces at lists.openfabrics.org; Scott Weitzenkamp
> > > >         (sweitzen)
> > > >         Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > >         
> > > >         
> > > >         
> > > >         Koen,
> > > >         
> > > >         So it is most likely you hit the same bug as 229 (Scott
> > > >         pointed out earlier). The same workaround might 
> > work for you
> > > >         by renicing ib_mad as Scott suggested.
> > > >         
> > > >         I think this should be a SM query timeout tunable value in
> > > >         Cisco SM. Am I right, Scott?
> > > >         
> > > >         Thanks
> > > >         Shirley Ma
> > > >         
> > > >         
> > > >         Inactive hide details for Koen Segers 
> > > <koen.segers at VRT.BE>Koen
> > > >         Segers <koen.segers at VRT.BE>
> > > >         
> > > >         
> > > >                                         Koen Segers 
> > > <koen.segers at VRT.BE> 
> > > >                                         
> > > >                                         05/22/07 11:14 AM 
> > > >                                         Please respond to
> > > >                                         koen.segers at VRT.BE
> > > >                                         
> > > >         
> > > >                      To
> > > >         
> > > >         Shirley
> > > >         Ma/Beaverton/IBM at IBMUS
> > > >         
> > > >                      cc
> > > >         
> > > >         Ami Perlmutter
> > > >         <amip at dev.mellanox.co.il>, 
> > > general at lists.openfabrics.org, general-bounces at lists.openfabrics.org
> > > >         
> > > >                 Subject
> > > >         
> > > >         RE:
> > > >         [ofa-general]
> > > >         GPFS node loses
> > > >         IB-connection
> > > >         
> > > >         
> > > >         
> > > >         Hi,
> > > >         
> > > >         It is the Cisco SM. 
> > > >         
> > > >         SFS-7000P> show version
> > > >         
> > > >         
> > > >         
> > > ==============================================================
> > > ==================
> > > >                                   System Version Information
> > > >         
> > > ==============================================================
> > > ==================
> > > >                   system-version : SFS-7000P TopspinOS 
> > 2.9.0 releng
> > > >         #147
> > > >         10/25/2006 02:01:32
> > > >                          contact : tac at cisco.com
> > > >                             name : SFS-7000P
> > > >                         location : 170 West Tasman Drive, 
> > > San Jose, CA
> > > >         95134
> > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > >                      last-change : none
> > > >                 last-config-save : none
> > > >                           action : none
> > > >                           result : none
> > > >                        oper-mode : normal
> > > >         
> > > >         There is also a command that gives the SM version, 
> > > but I can't
> > > >         find it
> > > >         right now. 
> > > >         
> > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > > >         > Hello Koen,
> > > >         > 
> > > >         > From the switch log, it looks a SM issue to me. 
> > > The node was
> > > >         kicked
> > > >         > out of the membership. Which SM you are using in your
> > > >         fabric? 
> > > >         > 
> > > >         > Thanks
> > > >         > Shirley Ma
> > > >         > 
> > > >         *** Disclaimer ***
> > > >         
> > > >         Vlaamse Radio- en Televisieomroep
> > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > >         
> > > >         nv van publiek recht
> > > >         BTW BE 0244.142.664
> > > >         RPR Brussel
> > > >         http://www.vrt.be/disclaimer
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From devesh28 at gmail.com  Wed May 23 07:27:55 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Wed, 23 May 2007 19:57:55 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <1179769930.15940.9823.camel@hal.voltaire.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
	<309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>
	<464B5C07.8040601@ichips.intel.com>
	<309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>
	<1179398534.23882.67542.camel@hal.voltaire.com>
	<309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>
	<1179483657.23882.158398.camel@hal.voltaire.com>
	<309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com>
	<1179769930.15940.9823.camel@hal.voltaire.com>
Message-ID: <309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com>

On 21 May 2007 13:52:11 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote:
> > On 18 May 2007 06:21:05 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote:
> > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote:
> > > > > > On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > > > > > > > But initially this will generate a packet for each path, while sys
> > > > > > > > admin knows that path is there and he can hard-code the entries for
> > > > > > > > it. Other thing is that why Admin will care about creating such record
> > > > > > > > while SA is itself taking care, right?
> > > > > > >
> > > > > > > In your original message you asked about adding 'dummy entries' to the
> > > > > > > cache.  I agree that pre-loading the cache can be useful.  What I still
> > > > > > > am not understanding is the reasoning for adding 'dummy entries'.  By
> > > > > > > 'dummy entries', I've been assuming that these are invalid path records,
> > > > > > > but maybe that's not what you meant.
> > > > > > Ok if "dummy entries" word as such has created confusion then I am
> > > > > > sorry for that, But with that I mean that, those are valid path
> > > > > > records which Administrator knows in advance and while loading the
> > > > > > module,
> > > > >
> > > > > How does the admin know they are valid ?
> > > > Depending on the initial application runs, some trusted PRs can be generated.
> > >
> > > What do initial application runs have to do with this ?
> > My understanding is that, once the cluster is UP, and if between Node
> > A and Node B there is only one path,
>
> So this is a feature for such one path subnets. I wonder what percentage
> of deployed subnets fits this case.
You never know, It may be used for debugging also.
>
> > then, SA query always going to return same values in PR.
>
> If subnet topology is changed, these PRs might change. There are other
> cases where they change too.
Not sure about it...some suggestion?
>
> >  On this basis Initial application runs will generate PRs,
>
> That's what confused me before (Applications don't generate PRs but
> rather request them.) but I think I see what you mean now.
Ok
>
> > these PRs can be saved in some file, and can be loaded
> > when cache_module comes in.
> > >
> > > > >Are they somehow preconfigured at the SM ?
> > > > I am not sure about SM has any such provision?
> > >
> > > Not that I'm aware of.
> > Ok, So, currently no such support is there in SM?
>
> I can speak definitively for OpenSM and there is no such support. As to
> the vendor SMs, I don't think so but don't know for absolute certainty.
> Someone can correct me if I'm wrong but I wouldn't assume no response
> means correctness as some may not be listening nor want to respond as to
> "value added" vendor specific features.
What is the issue if OpenSM provides this?
>
> > > > Also not sure about the
> > > > role of SM in path resolving. I mean once node has initiated SA query,
> > > > whether SM has some database to reply SA or On the fly destination
> > > > node is contacted to get asked path recored?
> > >
> > > SMs can either calculate the SA PRs on the fly based on the routing
> > > algorithm in use and some other things or put them in a local database.
> > > This is up to that SM.
> > Ok
> > >
> > > Destination node is not contacted in the SA PR query process.
> > >
> > > > >Doesn't each SM have its own policy for generating valid PRs ?
> > > > Ultimately path record is in Path_Record object format, and SA cache
> > > > is going to store in a fixed manner, How generation policy matters?
> > >
> > > What if the local policy loaded does not agree with what the SM would
> > > generate for a particular PR ? One then gets a local error which will
> > > need to be tracked down. Not so easy IMO.
> > SM policies in a subnet to generate PRs, changes dynamically? at run time?
>
> The policy doesn't change dynamically but the data to be returned in the
> SA PR response might.
>
> > if Not then depending on the local SM policy static PR can be
> > generated to load initially.
>
> Just as one question related to this, how would link failures be handled
> ? There are others.
Its just a matter of avoiding initial PR query packets by loading the
cache with static PRs.....Later on cache module will function in
normal fashion. I expect, initially every thing will come up in a
trusted cluster.
>
> > > > CMIIW. Also I am assuming a homogeneous cluster where certain
> > > > parameters can be assumed to be same always.
> > >
> > > and always in agreement with what the SM would return ? For example,
> > yes
> > > what happens when a link goes down and the end node is no longer
> > > reachable ?
> > If node is not reachable then, after first timeout of sa_cache, that
> > entry will be removed from cache.
>
> OK; that's another aspect to add into this feature. I don't think that
> is currently done. I think there would need to be an API added to do
> this.
Yes, this has been discussed with Sean, we can add one char_dev
interface to the existing  sa_cache module implementation, Write entry
point will generate a SA_PR_response packet and this packet will be
passed to update_cache() function.

Also we need to remove the initial schedule_update() call in the
add_one() function.
One user command is also required to read from user file and write
onto this device.
>
> -- Hal
>
> > > > >are these from a live SM and just loaded "out of band" to
> > > > bypass/preclude the SA PR >mechanism ?
> > > > may be
> > >
> > > Even if they are, there is still the changes in the subnet issue.
> > >
> > > -- Hal
> > >
> > > > > -- Hal
> > > > >
> > > > > >  Admin is loading this info in the cache with user command.
> > > > > > >
> > > > > > > > Another point I want to know is,
> > > > > > > > When local_sa_cache module will be inserted? After SM comes up or
> > > > > > > > Before SM comes up?
> > > > > > >
> > > > > > > It can occur either way.  There is no restriction.  The cache responds
> > > > > > > to port up and GID in/out of service events to update itself.
> > > > > > Do you mean cache module will start building cache only after Port is UP?
> > > > > > >
> > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on
> > > > > > > > some node not on switch) then First Forced schedule_update() is
> > > > > > > > waisted, and for the first application presence of cache is
> > > > > > > > meaningless. Why not to keep cache effective right from the start?
> > > > > > >
> > > > > > > Pre-loading the cache with path records doesn't guarantee that those
> > > > > > > paths are usable.  If the SM has not come up, then the path records will
> > > > > > > be unusable until the SM configures the subnet, plus there's no
> > > > > > > guarantee that the remote endpoints specified by the paths are running.
> > > > > > You mean there is no guarantee that even if SM is UP and we have some
> > > > > > hard coded entries of path record corresponding to some node X, we are
> > > > > > not sure that node X has actually come up or not?  In that case
> > > > > > actually that path resolving should fail if node has not come up, but
> > > > > > with the hard coding still path will be resolved?
> > > > > > >
> > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms
> > > > > > > when booting a large cluster.
> > > > > > that's true. Also cache will get valid entries only if network is
> > > > > > configured by SM otherwise every node SA will, possibly, drop SA
> > > > > > packets.
> > > > > > >
> > > > > > > - Sean
> > > > > > >
> > > > > > _______________________________________________
> > > > > > general mailing list
> > > > > > general at lists.openfabrics.org
> > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > >
> > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > > >
> > > > >
> > >
> > >
>
>


From halr at voltaire.com  Wed May 23 07:38:25 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 23 May 2007 10:38:25 -0400
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<000101c7963a$3474ae00$49c9180a@amr.corp.intel.com>
	<309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com>
	<464B5C07.8040601@ichips.intel.com>
	<309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>
	<1179398534.23882.67542.camel@hal.voltaire.com>
	<309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>
	<1179483657.23882.158398.camel@hal.voltaire.com>
	<309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com>
	<1179769930.15940.9823.camel@hal.voltaire.com>
	<309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com>
Message-ID: <1179931104.16831.100554.camel@hal.voltaire.com>

On Wed, 2007-05-23 at 10:27, Devesh Sharma wrote:
> On 21 May 2007 13:52:11 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote:
> > > On 18 May 2007 06:21:05 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote:
> > > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote:
> > > > > > > On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > > > > > > > > But initially this will generate a packet for each path, while sys
> > > > > > > > > admin knows that path is there and he can hard-code the entries for
> > > > > > > > > it. Other thing is that why Admin will care about creating such record
> > > > > > > > > while SA is itself taking care, right?
> > > > > > > >
> > > > > > > > In your original message you asked about adding 'dummy entries' to the
> > > > > > > > cache.  I agree that pre-loading the cache can be useful.  What I still
> > > > > > > > am not understanding is the reasoning for adding 'dummy entries'.  By
> > > > > > > > 'dummy entries', I've been assuming that these are invalid path records,
> > > > > > > > but maybe that's not what you meant.
> > > > > > > Ok if "dummy entries" word as such has created confusion then I am
> > > > > > > sorry for that, But with that I mean that, those are valid path
> > > > > > > records which Administrator knows in advance and while loading the
> > > > > > > module,
> > > > > >
> > > > > > How does the admin know they are valid ?
> > > > > Depending on the initial application runs, some trusted PRs can be generated.
> > > >
> > > > What do initial application runs have to do with this ?
> > > My understanding is that, once the cluster is UP, and if between Node
> > > A and Node B there is only one path,
> >
> > So this is a feature for such one path subnets. I wonder what percentage
> > of deployed subnets fits this case.
> You never know, It may be used for debugging also.

I still don't have a good feel for how common/generally useful this will
really be.

> > > then, SA query always going to return same values in PR.
> >
> > If subnet topology is changed, these PRs might change. There are other
> > cases where they change too.
> Not sure about it...some suggestion?
> >
> > >  On this basis Initial application runs will generate PRs,
> >
> > That's what confused me before (Applications don't generate PRs but
> > rather request them.) but I think I see what you mean now.
> Ok
> >
> > > these PRs can be saved in some file, and can be loaded
> > > when cache_module comes in.
> > > >
> > > > > >Are they somehow preconfigured at the SM ?
> > > > > I am not sure about SM has any such provision?
> > > >
> > > > Not that I'm aware of.
> > > Ok, So, currently no such support is there in SM?
> >
> > I can speak definitively for OpenSM and there is no such support. As to
> > the vendor SMs, I don't think so but don't know for absolute certainty.
> > Someone can correct me if I'm wrong but I wouldn't assume no response
> > means correctness as some may not be listening nor want to respond as to
> > "value added" vendor specific features.
> What is the issue if OpenSM provides this?

I'm not following you. What does/should OpenSM provide ? OpenIB works in
configurations with other SMs.

> >
> > > > > Also not sure about the
> > > > > role of SM in path resolving. I mean once node has initiated SA query,
> > > > > whether SM has some database to reply SA or On the fly destination
> > > > > node is contacted to get asked path recored?
> > > >
> > > > SMs can either calculate the SA PRs on the fly based on the routing
> > > > algorithm in use and some other things or put them in a local database.
> > > > This is up to that SM.
> > > Ok
> > > >
> > > > Destination node is not contacted in the SA PR query process.
> > > >
> > > > > >Doesn't each SM have its own policy for generating valid PRs ?
> > > > > Ultimately path record is in Path_Record object format, and SA cache
> > > > > is going to store in a fixed manner, How generation policy matters?
> > > >
> > > > What if the local policy loaded does not agree with what the SM would
> > > > generate for a particular PR ? One then gets a local error which will
> > > > need to be tracked down. Not so easy IMO.
> > > SM policies in a subnet to generate PRs, changes dynamically? at run time?
> >
> > The policy doesn't change dynamically but the data to be returned in the
> > SA PR response might.
> >
> > > if Not then depending on the local SM policy static PR can be
> > > generated to load initially.
> >
> > Just as one question related to this, how would link failures be handled
> > ? There are others.
> Its just a matter of avoiding initial PR query packets by loading the
> cache with static PRs.....Later on cache module will function in
> normal fashion. I expect, initially every thing will come up in a
> trusted cluster.

So you're saying the cache would still react to GIDs out and in service,
right ?

If the cache is loaded from a file, does it bypass querying the SA
initially for PRs ? If that is the case, then the file is required to be
the full set of PRs for this node otherwise there would be incomplete
connectivity.

-- Hal

> > > > > CMIIW. Also I am assuming a homogeneous cluster where certain
> > > > > parameters can be assumed to be same always.
> > > >
> > > > and always in agreement with what the SM would return ? For example,
> > > yes
> > > > what happens when a link goes down and the end node is no longer
> > > > reachable ?
> > > If node is not reachable then, after first timeout of sa_cache, that
> > > entry will be removed from cache.
> >
> > OK; that's another aspect to add into this feature. I don't think that
> > is currently done. I think there would need to be an API added to do
> > this.
> Yes, this has been discussed with Sean, we can add one char_dev
> interface to the existing  sa_cache module implementation, Write entry
> point will generate a SA_PR_response packet and this packet will be
> passed to update_cache() function.
> 
> Also we need to remove the initial schedule_update() call in the
> add_one() function.
> One user command is also required to read from user file and write
> onto this device.
> >
> > -- Hal
> >
> > > > > >are these from a live SM and just loaded "out of band" to
> > > > > bypass/preclude the SA PR >mechanism ?
> > > > > may be
> > > >
> > > > Even if they are, there is still the changes in the subnet issue.
> > > >
> > > > -- Hal
> > > >
> > > > > > -- Hal
> > > > > >
> > > > > > >  Admin is loading this info in the cache with user command.
> > > > > > > >
> > > > > > > > > Another point I want to know is,
> > > > > > > > > When local_sa_cache module will be inserted? After SM comes up or
> > > > > > > > > Before SM comes up?
> > > > > > > >
> > > > > > > > It can occur either way.  There is no restriction.  The cache responds
> > > > > > > > to port up and GID in/out of service events to update itself.
> > > > > > > Do you mean cache module will start building cache only after Port is UP?
> > > > > > > >
> > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on
> > > > > > > > > some node not on switch) then First Forced schedule_update() is
> > > > > > > > > waisted, and for the first application presence of cache is
> > > > > > > > > meaningless. Why not to keep cache effective right from the start?
> > > > > > > >
> > > > > > > > Pre-loading the cache with path records doesn't guarantee that those
> > > > > > > > paths are usable.  If the SM has not come up, then the path records will
> > > > > > > > be unusable until the SM configures the subnet, plus there's no
> > > > > > > > guarantee that the remote endpoints specified by the paths are running.
> > > > > > > You mean there is no guarantee that even if SM is UP and we have some
> > > > > > > hard coded entries of path record corresponding to some node X, we are
> > > > > > > not sure that node X has actually come up or not?  In that case
> > > > > > > actually that path resolving should fail if node has not come up, but
> > > > > > > with the hard coding still path will be resolved?
> > > > > > > >
> > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms
> > > > > > > > when booting a large cluster.
> > > > > > > that's true. Also cache will get valid entries only if network is
> > > > > > > configured by SM otherwise every node SA will, possibly, drop SA
> > > > > > > packets.
> > > > > > > >
> > > > > > > > - Sean
> > > > > > > >
> > > > > > > _______________________________________________
> > > > > > > general mailing list
> > > > > > > general at lists.openfabrics.org
> > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > >
> > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > > > >
> > > > > >
> > > >
> > > >
> >
> >


From umaxx at oleco.net  Wed May 23 08:03:36 2007
From: umaxx at oleco.net (Joerg Zinke)
Date: Wed, 23 May 2007 17:03:36 +0200
Subject: [ofa-general] mmap() and ibv_reg_mr() and RDMA
In-Reply-To: <adalkfgcxi4.fsf@cisco.com>
References: <20070522210430.5df75050@marvin.local> <adaps4sd1tf.fsf@cisco.com>
	<20070522232559.7785a331@marvin.local> <adalkfgcxi4.fsf@cisco.com>
Message-ID: <20070523170336.2df4e755@marvin.local>

On Tue, 22 May 2007 15:19:15 -0700
Roland Dreier <rdreier at cisco.com> wrote:

>  > this area is mapped via the character device and with
>  > the help of remap_pfn_range() into userspace... this works fine i
>  > can access it from userspace and write/read from it:
> 
> I think that's the problem.  remap_pfn_range() sets VM_PFNMAP on the
> vma used to map the pfns.  When ibv_reg_mr() calls into the kernel to
> do the actual mapping, it ends up doing get_user_pages() which fails
> in vm_normal_page() for such a vma.
> 
> I don't immediately see a good way to handle this.
> 

many thanks for your fast answer. i will try to access the memory via
get_user_pages() too instead of mmap'ing it... just the other way
around - should be no problem.

regards,

joerg


From Koen.SEGERS at VRT.BE  Wed May 23 08:20:20 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Wed, 23 May 2007 17:20:20 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <1179929493.16831.98786.camel@hal.voltaire.com>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D68@OCBEXS01001.rto.be>

After a whole day of stresstesting with the MAD renicing turned on, we
got the error once. So I think I should raise the timeout on the switch
also.

It takes about 2 minutes to boot the system. Do you agree that this is a
good value for the timeout?

Scott,
Can you explain me the problem of the memlock?

I saw that the SLES10 bug is only an issue in MVAPICH. Since we didn't
install this, the bug is not related to us. This is correct, isn't it?

Greetz

Koen

-----Oorspronkelijk bericht-----
Van: Hal Rosenstock [mailto:halr at voltaire.com] 
Verzonden: woensdag 23 mei 2007 16:12
Aan: Scott "Weitzenkamp (sweitzen)
CC: SEGERS Koen; Clive Hall (clivhall);
general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> No C code changes, just a few config file changes (RENICE_IB_MAD=yes
in
> openib.conf,

Does the host really not respond to MAD requests for over 10 seconds in
some cases ?

-- Hal

>  memlock in /etc/security/limits.conf, fix /etc/hosts on
> SLES10 for bug 267, etc.).
> 
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
> 
> > -----Original Message-----
> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > Sent: Wednesday, May 23, 2007 6:48 AM
> > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > Cc: Shirley Ma; Ami Perlmutter; 
> > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > This far, all tests seem to work.
> > 
> > Thanks for the help!
> > 
> > Scott,
> > Are there more bugfixes that cisco does in its rpms?
> > 
> > Greetz
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > Verzonden: woensdag 23 mei 2007 0:39
> > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
(clivhall)
> > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > general-bounces at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > It's not so much pinging every 10 seconds as expecting a 
> > response within
> > 10 seconds (Clive, correct me if I'm wrong).
> > 
> > You only need to do 1) or 2), not both.  Cisco configures 1) 
> > in the OFED
> > binary RPMs we release at
> > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > prefer to have
> > the host be more responsive.
> > 
> > 
> > Scott Weitzenkamp
> > SQA and Release Manager
> > Server Virtualization Business Unit
> > Cisco Systems
> >  
> > 
> > > -----Original Message-----
> > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > To: Scott Weitzenkamp (sweitzen)
> > > Cc: Shirley Ma; Ami Perlmutter; 
> > > general at lists.openfabrics.org;
general-bounces at lists.openfabrics.org
> > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > If I understand it wright, the switch is actually polling 
> > > (=pinging) the
> > > interfaces every 10s. This means that when the interface is
handling
> > > other traffic, the poll can fail and the port could be 
> > > considered out of
> > > service. My question is then: "How can the timeout be reached
while
> > > packets are being sent/received to/from the interface?"
> > > 
> > > Anyway, what timeout-value would you recommend for us? And why?
> > > 
> > > To recapitulate: these are the actions I'll take tomorrow
> > > 1) change the MAD niceness of the servers
> > > 2) change the timeout on the switches
> > > 
> > > Are these changes sufficient for the HCA's to keep their ports in
> > > PORT_ACTIVE state?
> > > 
> > > Regards,
> > > 
> > > Koen
> > > 
> > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > (sweitzen) wrote:
> > > > Yes, you can tune it.  Here's an example via the switch CLI:
> > > >  
> > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
> > > > node-timeout <value>
> > > > 
> > > > The default is 10 seconds, it can be configured up to 
> > 2000 seconds.
> > > > If a HCA is completely unresponsive for longer than the 
> > node-timeout
> > > > value, then we consider that HCA out of service.
> > > >  
> > > > Scott Weitzenkamp
> > > > SQA and Release Manager
> > > > Server Virtualization Business Unit
> > > > Cisco Systems
> > > >  
> > > > 
> > > >         
> > > >         
> > > ______________________________________________________________
> > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > >         To: koen.segers at VRT.BE
> > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > >         general-bounces at lists.openfabrics.org; Scott Weitzenkamp
> > > >         (sweitzen)
> > > >         Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > >         
> > > >         
> > > >         
> > > >         Koen,
> > > >         
> > > >         So it is most likely you hit the same bug as 229 (Scott
> > > >         pointed out earlier). The same workaround might 
> > work for you
> > > >         by renicing ib_mad as Scott suggested.
> > > >         
> > > >         I think this should be a SM query timeout tunable value
in
> > > >         Cisco SM. Am I right, Scott?
> > > >         
> > > >         Thanks
> > > >         Shirley Ma
> > > >         
> > > >         
> > > >         Inactive hide details for Koen Segers 
> > > <koen.segers at VRT.BE>Koen
> > > >         Segers <koen.segers at VRT.BE>
> > > >         
> > > >         
> > > >                                         Koen Segers 
> > > <koen.segers at VRT.BE> 
> > > >                                         
> > > >                                         05/22/07 11:14 AM 
> > > >                                         Please respond to
> > > >                                         koen.segers at VRT.BE
> > > >                                         
> > > >         
> > > >                      To
> > > >         
> > > >         Shirley
> > > >         Ma/Beaverton/IBM at IBMUS
> > > >         
> > > >                      cc
> > > >         
> > > >         Ami Perlmutter
> > > >         <amip at dev.mellanox.co.il>, 
> > > general at lists.openfabrics.org,
general-bounces at lists.openfabrics.org
> > > >         
> > > >                 Subject
> > > >         
> > > >         RE:
> > > >         [ofa-general]
> > > >         GPFS node loses
> > > >         IB-connection
> > > >         
> > > >         
> > > >         
> > > >         Hi,
> > > >         
> > > >         It is the Cisco SM. 
> > > >         
> > > >         SFS-7000P> show version
> > > >         
> > > >         
> > > >         
> > > ==============================================================
> > > ==================
> > > >                                   System Version Information
> > > >         
> > > ==============================================================
> > > ==================
> > > >                   system-version : SFS-7000P TopspinOS 
> > 2.9.0 releng
> > > >         #147
> > > >         10/25/2006 02:01:32
> > > >                          contact : tac at cisco.com
> > > >                             name : SFS-7000P
> > > >                         location : 170 West Tasman Drive, 
> > > San Jose, CA
> > > >         95134
> > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > >                      last-change : none
> > > >                 last-config-save : none
> > > >                           action : none
> > > >                           result : none
> > > >                        oper-mode : normal
> > > >         
> > > >         There is also a command that gives the SM version, 
> > > but I can't
> > > >         find it
> > > >         right now. 
> > > >         
> > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > > >         > Hello Koen,
> > > >         > 
> > > >         > From the switch log, it looks a SM issue to me. 
> > > The node was
> > > >         kicked
> > > >         > out of the membership. Which SM you are using in your
> > > >         fabric? 
> > > >         > 
> > > >         > Thanks
> > > >         > Shirley Ma
> > > >         > 
> > > >         *** Disclaimer ***
> > > >         
> > > >         Vlaamse Radio- en Televisieomroep
> > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > >         
> > > >         nv van publiek recht
> > > >         BTW BE 0244.142.664
> > > >         RPR Brussel
> > > >         http://www.vrt.be/disclaimer
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From sweitzen at cisco.com  Wed May 23 08:37:15 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 23 May 2007 08:37:15 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <1179929493.16831.98786.camel@hal.voltaire.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E15EC@xmb-sjc-216.amer.cisco.com>
	<D63C0BE2D613C543B6F3305502E9784C03157D67@OCBEXS01001.rto.be>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3038E17BF@xmb-sjc-216.amer.cisco.com>
	<1179929493.16831.98786.camel@hal.voltaire.com>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E182E@xmb-sjc-216.amer.cisco.com>

> Does the host really not respond to MAD requests for over 10 
> seconds in
> some cases ?

With recent Xeon, Opteron, and Power processors, yes.

Scott


From sweitzen at cisco.com  Wed May 23 08:37:54 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 23 May 2007 08:37:54 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D68@OCBEXS01001.rto.be>
References: <1179929493.16831.98786.camel@hal.voltaire.com>
	<D63C0BE2D613C543B6F3305502E9784C03157D68@OCBEXS01001.rto.be>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E1830@xmb-sjc-216.amer.cisco.com>

The boot time of the host doesn't matter for this timeout.  While the
host is booting, the IB link is down anyway.

Scott 

> -----Original Message-----
> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> Sent: Wednesday, May 23, 2007 8:20 AM
> To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> Cc: Clive Hall (clivhall); 
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
> 
> After a whole day of stresstesting with the MAD renicing turned on, we
> got the error once. So I think I should raise the timeout on 
> the switch
> also.
> 
> It takes about 2 minutes to boot the system. Do you agree 
> that this is a
> good value for the timeout?
> 
> Scott,
> Can you explain me the problem of the memlock?
> 
> I saw that the SLES10 bug is only an issue in MVAPICH. Since we didn't
> install this, the bug is not related to us. This is correct, isn't it?
> 
> Greetz
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> Verzonden: woensdag 23 mei 2007 16:12
> Aan: Scott "Weitzenkamp (sweitzen)
> CC: SEGERS Koen; Clive Hall (clivhall);
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > No C code changes, just a few config file changes (RENICE_IB_MAD=yes
> in
> > openib.conf,
> 
> Does the host really not respond to MAD requests for over 10 
> seconds in
> some cases ?
> 
> -- Hal
> 
> >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > SLES10 for bug 267, etc.).
> > 
> > Scott Weitzenkamp
> > SQA and Release Manager
> > Server Virtualization Business Unit
> > Cisco Systems
> >  
> > 
> > > -----Original Message-----
> > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > Cc: Shirley Ma; Ami Perlmutter; 
> > > general at lists.openfabrics.org; 
> general-bounces at lists.openfabrics.org
> > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > This far, all tests seem to work.
> > > 
> > > Thanks for the help!
> > > 
> > > Scott,
> > > Are there more bugfixes that cisco does in its rpms?
> > > 
> > > Greetz
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > Verzonden: woensdag 23 mei 2007 0:39
> > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> (clivhall)
> > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > > general-bounces at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > It's not so much pinging every 10 seconds as expecting a 
> > > response within
> > > 10 seconds (Clive, correct me if I'm wrong).
> > > 
> > > You only need to do 1) or 2), not both.  Cisco configures 1) 
> > > in the OFED
> > > binary RPMs we release at
> > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > prefer to have
> > > the host be more responsive.
> > > 
> > > 
> > > Scott Weitzenkamp
> > > SQA and Release Manager
> > > Server Virtualization Business Unit
> > > Cisco Systems
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > To: Scott Weitzenkamp (sweitzen)
> > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > general at lists.openfabrics.org;
> general-bounces at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > If I understand it wright, the switch is actually polling 
> > > > (=pinging) the
> > > > interfaces every 10s. This means that when the interface is
> handling
> > > > other traffic, the poll can fail and the port could be 
> > > > considered out of
> > > > service. My question is then: "How can the timeout be reached
> while
> > > > packets are being sent/received to/from the interface?"
> > > > 
> > > > Anyway, what timeout-value would you recommend for us? And why?
> > > > 
> > > > To recapitulate: these are the actions I'll take tomorrow
> > > > 1) change the MAD niceness of the servers
> > > > 2) change the timeout on the switches
> > > > 
> > > > Are these changes sufficient for the HCA's to keep 
> their ports in
> > > > PORT_ACTIVE state?
> > > > 
> > > > Regards,
> > > > 
> > > > Koen
> > > > 
> > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > (sweitzen) wrote:
> > > > > Yes, you can tune it.  Here's an example via the switch CLI:
> > > > >  
> > > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
> > > > > node-timeout <value>
> > > > > 
> > > > > The default is 10 seconds, it can be configured up to 
> > > 2000 seconds.
> > > > > If a HCA is completely unresponsive for longer than the 
> > > node-timeout
> > > > > value, then we consider that HCA out of service.
> > > > >  
> > > > > Scott Weitzenkamp
> > > > > SQA and Release Manager
> > > > > Server Virtualization Business Unit
> > > > > Cisco Systems
> > > > >  
> > > > > 
> > > > >         
> > > > >         
> > > > ______________________________________________________________
> > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > >         To: koen.segers at VRT.BE
> > > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > > >         general-bounces at lists.openfabrics.org; Scott 
> Weitzenkamp
> > > > >         (sweitzen)
> > > > >         Subject: RE: [ofa-general] GPFS node loses 
> IB-connection
> > > > >         
> > > > >         
> > > > >         
> > > > >         Koen,
> > > > >         
> > > > >         So it is most likely you hit the same bug as 
> 229 (Scott
> > > > >         pointed out earlier). The same workaround might 
> > > work for you
> > > > >         by renicing ib_mad as Scott suggested.
> > > > >         
> > > > >         I think this should be a SM query timeout 
> tunable value
> in
> > > > >         Cisco SM. Am I right, Scott?
> > > > >         
> > > > >         Thanks
> > > > >         Shirley Ma
> > > > >         
> > > > >         
> > > > >         Inactive hide details for Koen Segers 
> > > > <koen.segers at VRT.BE>Koen
> > > > >         Segers <koen.segers at VRT.BE>
> > > > >         
> > > > >         
> > > > >                                         Koen Segers 
> > > > <koen.segers at VRT.BE> 
> > > > >                                         
> > > > >                                         05/22/07 11:14 AM 
> > > > >                                         Please respond to
> > > > >                                         koen.segers at VRT.BE
> > > > >                                         
> > > > >         
> > > > >                      To
> > > > >         
> > > > >         Shirley
> > > > >         Ma/Beaverton/IBM at IBMUS
> > > > >         
> > > > >                      cc
> > > > >         
> > > > >         Ami Perlmutter
> > > > >         <amip at dev.mellanox.co.il>, 
> > > > general at lists.openfabrics.org,
> general-bounces at lists.openfabrics.org
> > > > >         
> > > > >                 Subject
> > > > >         
> > > > >         RE:
> > > > >         [ofa-general]
> > > > >         GPFS node loses
> > > > >         IB-connection
> > > > >         
> > > > >         
> > > > >         
> > > > >         Hi,
> > > > >         
> > > > >         It is the Cisco SM. 
> > > > >         
> > > > >         SFS-7000P> show version
> > > > >         
> > > > >         
> > > > >         
> > > > ==============================================================
> > > > ==================
> > > > >                                   System Version Information
> > > > >         
> > > > ==============================================================
> > > > ==================
> > > > >                   system-version : SFS-7000P TopspinOS 
> > > 2.9.0 releng
> > > > >         #147
> > > > >         10/25/2006 02:01:32
> > > > >                          contact : tac at cisco.com
> > > > >                             name : SFS-7000P
> > > > >                         location : 170 West Tasman Drive, 
> > > > San Jose, CA
> > > > >         95134
> > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > >                      last-change : none
> > > > >                 last-config-save : none
> > > > >                           action : none
> > > > >                           result : none
> > > > >                        oper-mode : normal
> > > > >         
> > > > >         There is also a command that gives the SM version, 
> > > > but I can't
> > > > >         find it
> > > > >         right now. 
> > > > >         
> > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > > > >         > Hello Koen,
> > > > >         > 
> > > > >         > From the switch log, it looks a SM issue to me. 
> > > > The node was
> > > > >         kicked
> > > > >         > out of the membership. Which SM you are 
> using in your
> > > > >         fabric? 
> > > > >         > 
> > > > >         > Thanks
> > > > >         > Shirley Ma
> > > > >         > 
> > > > >         *** Disclaimer ***
> > > > >         
> > > > >         Vlaamse Radio- en Televisieomroep
> > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > >         
> > > > >         nv van publiek recht
> > > > >         BTW BE 0244.142.664
> > > > >         RPR Brussel
> > > > >         http://www.vrt.be/disclaimer
> > > > >         
> > > > >         
> > > > >         
> > > > >         
> > > > >         
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 


From Koen.SEGERS at VRT.BE  Wed May 23 08:39:06 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Wed, 23 May 2007 17:39:06 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E1830@xmb-sjc-216.amer.cisco.com>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D6A@OCBEXS01001.rto.be>

What value would you recommend then?

Koen

-----Oorspronkelijk bericht-----
Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
Verzonden: woensdag 23 mei 2007 17:38
Aan: SEGERS Koen; Hal Rosenstock
CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

The boot time of the host doesn't matter for this timeout.  While the
host is booting, the IB link is down anyway.

Scott 

> -----Original Message-----
> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> Sent: Wednesday, May 23, 2007 8:20 AM
> To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> Cc: Clive Hall (clivhall); 
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
> 
> After a whole day of stresstesting with the MAD renicing turned on, we
> got the error once. So I think I should raise the timeout on 
> the switch
> also.
> 
> It takes about 2 minutes to boot the system. Do you agree 
> that this is a
> good value for the timeout?
> 
> Scott,
> Can you explain me the problem of the memlock?
> 
> I saw that the SLES10 bug is only an issue in MVAPICH. Since we didn't
> install this, the bug is not related to us. This is correct, isn't it?
> 
> Greetz
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> Verzonden: woensdag 23 mei 2007 16:12
> Aan: Scott "Weitzenkamp (sweitzen)
> CC: SEGERS Koen; Clive Hall (clivhall);
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > No C code changes, just a few config file changes (RENICE_IB_MAD=yes
> in
> > openib.conf,
> 
> Does the host really not respond to MAD requests for over 10 
> seconds in
> some cases ?
> 
> -- Hal
> 
> >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > SLES10 for bug 267, etc.).
> > 
> > Scott Weitzenkamp
> > SQA and Release Manager
> > Server Virtualization Business Unit
> > Cisco Systems
> >  
> > 
> > > -----Original Message-----
> > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > Cc: Shirley Ma; Ami Perlmutter; 
> > > general at lists.openfabrics.org; 
> general-bounces at lists.openfabrics.org
> > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > This far, all tests seem to work.
> > > 
> > > Thanks for the help!
> > > 
> > > Scott,
> > > Are there more bugfixes that cisco does in its rpms?
> > > 
> > > Greetz
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > Verzonden: woensdag 23 mei 2007 0:39
> > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> (clivhall)
> > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > > general-bounces at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > It's not so much pinging every 10 seconds as expecting a 
> > > response within
> > > 10 seconds (Clive, correct me if I'm wrong).
> > > 
> > > You only need to do 1) or 2), not both.  Cisco configures 1) 
> > > in the OFED
> > > binary RPMs we release at
> > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > prefer to have
> > > the host be more responsive.
> > > 
> > > 
> > > Scott Weitzenkamp
> > > SQA and Release Manager
> > > Server Virtualization Business Unit
> > > Cisco Systems
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > To: Scott Weitzenkamp (sweitzen)
> > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > general at lists.openfabrics.org;
> general-bounces at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > If I understand it wright, the switch is actually polling 
> > > > (=pinging) the
> > > > interfaces every 10s. This means that when the interface is
> handling
> > > > other traffic, the poll can fail and the port could be 
> > > > considered out of
> > > > service. My question is then: "How can the timeout be reached
> while
> > > > packets are being sent/received to/from the interface?"
> > > > 
> > > > Anyway, what timeout-value would you recommend for us? And why?
> > > > 
> > > > To recapitulate: these are the actions I'll take tomorrow
> > > > 1) change the MAD niceness of the servers
> > > > 2) change the timeout on the switches
> > > > 
> > > > Are these changes sufficient for the HCA's to keep 
> their ports in
> > > > PORT_ACTIVE state?
> > > > 
> > > > Regards,
> > > > 
> > > > Koen
> > > > 
> > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > (sweitzen) wrote:
> > > > > Yes, you can tune it.  Here's an example via the switch CLI:
> > > > >  
> > > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
> > > > > node-timeout <value>
> > > > > 
> > > > > The default is 10 seconds, it can be configured up to 
> > > 2000 seconds.
> > > > > If a HCA is completely unresponsive for longer than the 
> > > node-timeout
> > > > > value, then we consider that HCA out of service.
> > > > >  
> > > > > Scott Weitzenkamp
> > > > > SQA and Release Manager
> > > > > Server Virtualization Business Unit
> > > > > Cisco Systems
> > > > >  
> > > > > 
> > > > >         
> > > > >         
> > > > ______________________________________________________________
> > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > >         To: koen.segers at VRT.BE
> > > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > > >         general-bounces at lists.openfabrics.org; Scott 
> Weitzenkamp
> > > > >         (sweitzen)
> > > > >         Subject: RE: [ofa-general] GPFS node loses 
> IB-connection
> > > > >         
> > > > >         
> > > > >         
> > > > >         Koen,
> > > > >         
> > > > >         So it is most likely you hit the same bug as 
> 229 (Scott
> > > > >         pointed out earlier). The same workaround might 
> > > work for you
> > > > >         by renicing ib_mad as Scott suggested.
> > > > >         
> > > > >         I think this should be a SM query timeout 
> tunable value
> in
> > > > >         Cisco SM. Am I right, Scott?
> > > > >         
> > > > >         Thanks
> > > > >         Shirley Ma
> > > > >         
> > > > >         
> > > > >         Inactive hide details for Koen Segers 
> > > > <koen.segers at VRT.BE>Koen
> > > > >         Segers <koen.segers at VRT.BE>
> > > > >         
> > > > >         
> > > > >                                         Koen Segers 
> > > > <koen.segers at VRT.BE> 
> > > > >                                         
> > > > >                                         05/22/07 11:14 AM 
> > > > >                                         Please respond to
> > > > >                                         koen.segers at VRT.BE
> > > > >                                         
> > > > >         
> > > > >                      To
> > > > >         
> > > > >         Shirley
> > > > >         Ma/Beaverton/IBM at IBMUS
> > > > >         
> > > > >                      cc
> > > > >         
> > > > >         Ami Perlmutter
> > > > >         <amip at dev.mellanox.co.il>, 
> > > > general at lists.openfabrics.org,
> general-bounces at lists.openfabrics.org
> > > > >         
> > > > >                 Subject
> > > > >         
> > > > >         RE:
> > > > >         [ofa-general]
> > > > >         GPFS node loses
> > > > >         IB-connection
> > > > >         
> > > > >         
> > > > >         
> > > > >         Hi,
> > > > >         
> > > > >         It is the Cisco SM. 
> > > > >         
> > > > >         SFS-7000P> show version
> > > > >         
> > > > >         
> > > > >         
> > > > ==============================================================
> > > > ==================
> > > > >                                   System Version Information
> > > > >         
> > > > ==============================================================
> > > > ==================
> > > > >                   system-version : SFS-7000P TopspinOS 
> > > 2.9.0 releng
> > > > >         #147
> > > > >         10/25/2006 02:01:32
> > > > >                          contact : tac at cisco.com
> > > > >                             name : SFS-7000P
> > > > >                         location : 170 West Tasman Drive, 
> > > > San Jose, CA
> > > > >         95134
> > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > >                      last-change : none
> > > > >                 last-config-save : none
> > > > >                           action : none
> > > > >                           result : none
> > > > >                        oper-mode : normal
> > > > >         
> > > > >         There is also a command that gives the SM version, 
> > > > but I can't
> > > > >         find it
> > > > >         right now. 
> > > > >         
> > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > > > >         > Hello Koen,
> > > > >         > 
> > > > >         > From the switch log, it looks a SM issue to me. 
> > > > The node was
> > > > >         kicked
> > > > >         > out of the membership. Which SM you are 
> using in your
> > > > >         fabric? 
> > > > >         > 
> > > > >         > Thanks
> > > > >         > Shirley Ma
> > > > >         > 
> > > > >         *** Disclaimer ***
> > > > >         
> > > > >         Vlaamse Radio- en Televisieomroep
> > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > >         
> > > > >         nv van publiek recht
> > > > >         BTW BE 0244.142.664
> > > > >         RPR Brussel
> > > > >         http://www.vrt.be/disclaimer
> > > > >         
> > > > >         
> > > > >         
> > > > >         
> > > > >         
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From sweitzen at cisco.com  Wed May 23 08:40:54 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 23 May 2007 08:40:54 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D6A@OCBEXS01001.rto.be>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E1830@xmb-sjc-216.amer.cisco.com>
	<D63C0BE2D613C543B6F3305502E9784C03157D6A@OCBEXS01001.rto.be>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E1834@xmb-sjc-216.amer.cisco.com>

Try 20 seconds, I'm curious if if you are barely crossing the 10-second
threshold.

Scott 

> -----Original Message-----
> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> Sent: Wednesday, May 23, 2007 8:39 AM
> To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> Cc: Clive Hall (clivhall); 
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
> 
> What value would you recommend then?
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> Verzonden: woensdag 23 mei 2007 17:38
> Aan: SEGERS Koen; Hal Rosenstock
> CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> The boot time of the host doesn't matter for this timeout.  While the
> host is booting, the IB link is down anyway.
> 
> Scott 
> 
> > -----Original Message-----
> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > Sent: Wednesday, May 23, 2007 8:20 AM
> > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > Cc: Clive Hall (clivhall); 
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > After a whole day of stresstesting with the MAD renicing 
> turned on, we
> > got the error once. So I think I should raise the timeout on 
> > the switch
> > also.
> > 
> > It takes about 2 minutes to boot the system. Do you agree 
> > that this is a
> > good value for the timeout?
> > 
> > Scott,
> > Can you explain me the problem of the memlock?
> > 
> > I saw that the SLES10 bug is only an issue in MVAPICH. 
> Since we didn't
> > install this, the bug is not related to us. This is 
> correct, isn't it?
> > 
> > Greetz
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > Verzonden: woensdag 23 mei 2007 16:12
> > Aan: Scott "Weitzenkamp (sweitzen)
> > CC: SEGERS Koen; Clive Hall (clivhall);
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > No C code changes, just a few config file changes 
> (RENICE_IB_MAD=yes
> > in
> > > openib.conf,
> > 
> > Does the host really not respond to MAD requests for over 10 
> > seconds in
> > some cases ?
> > 
> > -- Hal
> > 
> > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > SLES10 for bug 267, etc.).
> > > 
> > > Scott Weitzenkamp
> > > SQA and Release Manager
> > > Server Virtualization Business Unit
> > > Cisco Systems
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > general at lists.openfabrics.org; 
> > general-bounces at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > This far, all tests seem to work.
> > > > 
> > > > Thanks for the help!
> > > > 
> > > > Scott,
> > > > Are there more bugfixes that cisco does in its rpms?
> > > > 
> > > > Greetz
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > (clivhall)
> > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > > > general-bounces at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > It's not so much pinging every 10 seconds as expecting a 
> > > > response within
> > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > 
> > > > You only need to do 1) or 2), not both.  Cisco configures 1) 
> > > > in the OFED
> > > > binary RPMs we release at
> > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > > prefer to have
> > > > the host be more responsive.
> > > > 
> > > > 
> > > > Scott Weitzenkamp
> > > > SQA and Release Manager
> > > > Server Virtualization Business Unit
> > > > Cisco Systems
> > > >  
> > > > 
> > > > > -----Original Message-----
> > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > general at lists.openfabrics.org;
> > general-bounces at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > If I understand it wright, the switch is actually polling 
> > > > > (=pinging) the
> > > > > interfaces every 10s. This means that when the interface is
> > handling
> > > > > other traffic, the poll can fail and the port could be 
> > > > > considered out of
> > > > > service. My question is then: "How can the timeout be reached
> > while
> > > > > packets are being sent/received to/from the interface?"
> > > > > 
> > > > > Anyway, what timeout-value would you recommend for 
> us? And why?
> > > > > 
> > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > 1) change the MAD niceness of the servers
> > > > > 2) change the timeout on the switches
> > > > > 
> > > > > Are these changes sufficient for the HCA's to keep 
> > their ports in
> > > > > PORT_ACTIVE state?
> > > > > 
> > > > > Regards,
> > > > > 
> > > > > Koen
> > > > > 
> > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > (sweitzen) wrote:
> > > > > > Yes, you can tune it.  Here's an example via the switch CLI:
> > > > > >  
> > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> fe:80:00:00:00:00:00:00
> > > > > > node-timeout <value>
> > > > > > 
> > > > > > The default is 10 seconds, it can be configured up to 
> > > > 2000 seconds.
> > > > > > If a HCA is completely unresponsive for longer than the 
> > > > node-timeout
> > > > > > value, then we consider that HCA out of service.
> > > > > >  
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > >  
> > > > > > 
> > > > > >         
> > > > > >         
> > > > > ______________________________________________________________
> > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > >         To: koen.segers at VRT.BE
> > > > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > > > >         general-bounces at lists.openfabrics.org; Scott 
> > Weitzenkamp
> > > > > >         (sweitzen)
> > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > IB-connection
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > >         Koen,
> > > > > >         
> > > > > >         So it is most likely you hit the same bug as 
> > 229 (Scott
> > > > > >         pointed out earlier). The same workaround might 
> > > > work for you
> > > > > >         by renicing ib_mad as Scott suggested.
> > > > > >         
> > > > > >         I think this should be a SM query timeout 
> > tunable value
> > in
> > > > > >         Cisco SM. Am I right, Scott?
> > > > > >         
> > > > > >         Thanks
> > > > > >         Shirley Ma
> > > > > >         
> > > > > >         
> > > > > >         Inactive hide details for Koen Segers 
> > > > > <koen.segers at VRT.BE>Koen
> > > > > >         Segers <koen.segers at VRT.BE>
> > > > > >         
> > > > > >         
> > > > > >                                         Koen Segers 
> > > > > <koen.segers at VRT.BE> 
> > > > > >                                         
> > > > > >                                         05/22/07 11:14 AM 
> > > > > >                                         Please respond to
> > > > > >                                         koen.segers at VRT.BE
> > > > > >                                         
> > > > > >         
> > > > > >                      To
> > > > > >         
> > > > > >         Shirley
> > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > >         
> > > > > >                      cc
> > > > > >         
> > > > > >         Ami Perlmutter
> > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > general at lists.openfabrics.org,
> > general-bounces at lists.openfabrics.org
> > > > > >         
> > > > > >                 Subject
> > > > > >         
> > > > > >         RE:
> > > > > >         [ofa-general]
> > > > > >         GPFS node loses
> > > > > >         IB-connection
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > >         Hi,
> > > > > >         
> > > > > >         It is the Cisco SM. 
> > > > > >         
> > > > > >         SFS-7000P> show version
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > ==============================================================
> > > > > ==================
> > > > > >                                   System Version Information
> > > > > >         
> > > > > ==============================================================
> > > > > ==================
> > > > > >                   system-version : SFS-7000P TopspinOS 
> > > > 2.9.0 releng
> > > > > >         #147
> > > > > >         10/25/2006 02:01:32
> > > > > >                          contact : tac at cisco.com
> > > > > >                             name : SFS-7000P
> > > > > >                         location : 170 West Tasman Drive, 
> > > > > San Jose, CA
> > > > > >         95134
> > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > >                      last-change : none
> > > > > >                 last-config-save : none
> > > > > >                           action : none
> > > > > >                           result : none
> > > > > >                        oper-mode : normal
> > > > > >         
> > > > > >         There is also a command that gives the SM version, 
> > > > > but I can't
> > > > > >         find it
> > > > > >         right now. 
> > > > > >         
> > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > > > > >         > Hello Koen,
> > > > > >         > 
> > > > > >         > From the switch log, it looks a SM issue to me. 
> > > > > The node was
> > > > > >         kicked
> > > > > >         > out of the membership. Which SM you are 
> > using in your
> > > > > >         fabric? 
> > > > > >         > 
> > > > > >         > Thanks
> > > > > >         > Shirley Ma
> > > > > >         > 
> > > > > >         *** Disclaimer ***
> > > > > >         
> > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > >         
> > > > > >         nv van publiek recht
> > > > > >         BTW BE 0244.142.664
> > > > > >         RPR Brussel
> > > > > >         http://www.vrt.be/disclaimer
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > *** Disclaimer ***
> > > > > 
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > 
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >  
> > > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > 
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 


From pradeeps at linux.vnet.ibm.com  Wed May 23 09:17:19 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Wed, 23 May 2007 09:17:19 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <46537081.30906@linux.vnet.ibm.com>
References: <46537081.30906@linux.vnet.ibm.com>
Message-ID: <4654690F.1040305@linux.vnet.ibm.com>


If this proposal is acceptable, would you want me to generate a patch
against Roland's for-2.6.22 git tree, or would for-2.6.23 tree be
better?

Pradeep

Pradeep Satyanarayana wrote:
> Here are my thoughts about limiting the memory footprint for IPOIB CM
> (NOSRQ) patch:
> 
> By default, cap the NOSRQ memory usage to 1GB. The default recvq_size
> is set to 128. Therefore for 64KB packets this would imply a maximum of
> 128 endpoints.
> 
> -Make the maximum number of endpoints a module parameter with a default
> value of 128.
> 
> -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is
> the default limit and could be changed as needed (by the administrator)
> depending on the system configuration, application needs and so on. The
> server would return a "REJ" message upon receiving a "REQ", whenever one
> of these limits (i.e. max number of endpoints or the max NOSRQ memory
> usage) is reached. Currently, we only check for the maximum number of
> endpoints -hard coded to 1024.
> 
> -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that
> support SRQ like the Topspin HCA and, such HCAs should not be
> impacted at all.
> 
> -Currently we allocate a default of 64KB for the ring buffer elements,
> and this buffer size is not linked to the mtu. In the future, we could
> allocate buffers based on the mtu and link that into the computation of
> the memory cap. This way customers who might want to use a smaller mtu
> could use a larger number of endpoints, or a larger recvq_size without
> exceeding the memory cap.
> 
> 
> Would this approach address the issues of scalability and enable IPOIB
> CM to be turned as the default?
> 
> 
> Pradeep
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From narravul at cse.ohio-state.edu  Wed May 23 10:27:01 2007
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Wed, 23 May 2007 13:27:01 -0400 (EDT)
Subject: [ofa-general] Problem with using two interfaces with rdma-cm
Message-ID: <Pine.GSO.4.40.0705231243540.18218-100000@nu.cse.ohio-state.edu>


Hi Sean,

  I currently have a setup with two nodes connected with two hcas. Both
the hcas have ip addresses in different subnets.

Rail 1 (ib0): 192.168.1.*
Rail 2 (ib2): 192.168.3.*

When I try to connect two qps over these rails (one on each), many times
the address resolutions for both the qps return me the context of just
one of the rails. i.e. I am not able to use both the rails.

Is there any thing I am missing here?

We are using OFED-1.1 on this cluster.

Thanks,
  --Sundeep.


From sean.hefty at intel.com  Wed May 23 10:37:53 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 23 May 2007 10:37:53 -0700
Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm
In-Reply-To: <Pine.GSO.4.40.0705231243540.18218-100000@nu.cse.ohio-state.edu>
Message-ID: <000301c79d61$17db1210$ff0da8c0@amr.corp.intel.com>

>Rail 1 (ib0): 192.168.1.*
>Rail 2 (ib2): 192.168.3.*
>
>When I try to connect two qps over these rails (one on each), many times
>the address resolutions for both the qps return me the context of just
>one of the rails. i.e. I am not able to use both the rails.
>
>Is there any thing I am missing here?

Can you provide more details on how you are establishing your connections?

Are you calling rdma_resolve_addr() with 192.168.1.x in one case, and
192.168.3.x in the second case, and both of those resolve back to the same local
IP address?  Can you tell if ping routes the same way?

You can try binding to a specific local address, but based on your setup, I
would expect this to work.  So, I'd like to understand better what the issue
could be.

- Sean


From mst at dev.mellanox.co.il  Wed May 23 10:50:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 23 May 2007 20:50:30 +0300
Subject: [ofa-general] Re: skb queue management in ipoib
In-Reply-To: <adafy5obuhh.fsf@cisco.com>
References: <20070522063634.GB3331@mellanox.co.il> <adafy5obuhh.fsf@cisco.com>
Message-ID: <20070523175030.GA6019@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: skb queue management in ipoib
> 
>  > 	I think that managing this queue in a FIFO manner, dropping
>  > 	old packets and inserting new ones instead would be better:
>  > 	and older packet has more chance to have been timed out.
> 
> Yes, that probably makes sense.
> 
>  > 	So we would do something along the lines of:
>  > 
>  >                        __skb_queue_tail(&neigh->queue, skb);
>  >                        if (skb_queue_len(&neigh->queue) > IPOIB_MAX_PATH_REC_QUEUE) {
>  >                                 skb = __skb_dequeue_tail(&neigh->queue);
> 
> this should just be __skb_dequeue though...

Ugh, sure. I'll post something like this for 2.6.23 then?


-- 
MST


From mst at dev.mellanox.co.il  Wed May 23 11:36:50 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 23 May 2007 21:36:50 +0300
Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection
In-Reply-To: <4653248A.1040108@ichips.intel.com>
References: <20070520134441.GI20649@mellanox.co.il>
	<000101c79c09$74f15440$54c8180a@amr.corp.intel.com>
	<20070522075952.GC3331@mellanox.co.il>
	<4653248A.1040108@ichips.intel.com>
Message-ID: <20070523183650.GB6019@mellanox.co.il>

> >>Also, I left the duplicate request handling
> >>as it was, since that should go in as a separate patch.
> >
> >Could you please describe what is missing currently?
> >Is the missing handling likely to cause timeouts?
> 
> If two REQs are received with matching local IDs, but the REQs 
> themselves differ on one or more fields, such as the QPN, the second REQ 
> is dropped as a duplicate.

Why do you speak about dropping duplicates as a valid response?
As far as I can tell, the 2 legal responses to a duplicate REQ
are resending a REP and rejecting with code 30.

> This causes timeouts, so I need to figure 
> out what the correct behavior should be here.

I agree that it seems that we could use this as a hint that remote has
rebooted.

-- 
MST


From sean.hefty at intel.com  Wed May 23 12:39:33 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 23 May 2007 12:39:33 -0700
Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection
In-Reply-To: <20070523183650.GB6019@mellanox.co.il>
Message-ID: <000501c79d72$17260a30$ff0da8c0@amr.corp.intel.com>

>> If two REQs are received with matching local IDs, but the REQs
>> themselves differ on one or more fields, such as the QPN, the second REQ
>> is dropped as a duplicate.
>
>Why do you speak about dropping duplicates as a valid response?

I was only mentioning the current behavior.

>As far as I can tell, the 2 legal responses to a duplicate REQ
>are resending a REP and rejecting with code 30.

It's possible to receive a duplicate REQ before processing has completed and a
REP generated to the first REQ.  In this case, it makes sense simply to discard
the duplicate REQ.

When processing completes on the first REQ, the CM will generate either a REP or
a REJ, so I believe that the behavior is compliant when handling an actual
duplicate REQs.

- Sean


From mst at dev.mellanox.co.il  Wed May 23 12:50:11 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 23 May 2007 22:50:11 +0300
Subject: [ofa-general] Re: [PATCH RFC] IB/ipoib: fix to_ipoib_neigh access
	race
In-Reply-To: <adatzu4d1wx.fsf@cisco.com>
References: <20070522005918.GB13311@mellanox.co.il> <adatzu4d1wx.fsf@cisco.com>
Message-ID: <20070523195011.GC6019@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH RFC] IB/ipoib: fix to_ipoib_neigh access race
> 
>  > hard_start_xmit dereferences to_ipoib_neigh when only tx_lock is taken.  This
>  > would only be safe if all calls that modify *to_ipoib_neigh take tx_lock too.
>  > Currently this is not always true for ipoib_neigh_free and path_rec_completion,
>  > which results in memory corruption.  Fix this race, making sure
>  > path_rec_completion and ipoib_neigh_free are always called under
>  > tx_lock.
>  > 
>  > Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
>  > 
>  > ---
>  > 
>  > I'm looking at
>  > https://bugs.openfabrics.org/show_bug.cgi?id=604
>  > and I think this could explain the crashes.
>  > In any case, Roland, is there a race or am I imagining things?
>  > 
>  > NB: The patch is untested (I'm not at the lab now).
> 
> Yes, it does seem that there is a problem here.  However, I the first
> part of this needs to be handled another way -- for example:
> 
>  > -		path_free(dev, path);
>  >  		spin_lock_irq(&priv->tx_lock);
>  >  		spin_lock(&priv->lock);
>  > +		path_free(dev, path);
> 
> path_free already takes priv->lock internally, and also calls
> ipoib_put_ah(), which may end up in ipoib_free_ah(), which also might
> take priv->lock.

Interesting point: note how unicast_arp_send is called under tx_lock,
and calls path_free from there.

It seems to be safe simply because we never have an AH
or any neighbours there, but it does look a bit ugly,
and there's a bit of code duplication that function.

> It's not immediately obvious what the right fix is...

Maybe 1. avoid doing path_free in unicast_arp_send: just
do __path_add unconditionally like we do for regular packets.
and 2. make path_free take both tx_lock and priv->lock?

Something along the following lines (NB: untested):

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-22 01:46:54.000000000 +0300
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-23 22:45:18.000000000 +0300
@@ -262,7 +262,8 @@ static void path_free(struct net_device 
 	while ((skb = __skb_dequeue(&path->queue)))
 		dev_kfree_skb_irq(skb);
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) {
 		/*
@@ -277,7 +278,8 @@ static void path_free(struct net_device 
 		ipoib_neigh_free(dev, neigh);
 	}
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (path->ah)
 		ipoib_put_ah(path->ah);
@@ -401,7 +403,8 @@ static void path_rec_completion(int stat
 			ah = ipoib_create_ah(dev, priv->pd, &av);
 	}
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	path->ah = ah;
 
@@ -442,7 +445,8 @@ static void path_rec_completion(int stat
 	path->query = NULL;
 	complete(&path->done);
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	while ((skb = __skb_dequeue(&skqueue))) {
 		skb->dev = dev;
@@ -614,32 +618,16 @@ static void unicast_arp_send(struct sk_b
 	path = __path_find(dev, phdr->hwaddr + 4);
 	if (!path) {
 		path = path_rec_create(dev, phdr->hwaddr + 4);
-		if (path) {
-			/* put pseudoheader back on for next time */
-			skb_push(skb, sizeof *phdr);
-			__skb_queue_tail(&path->queue, skb);
-
-			if (path_rec_start(dev, path)) {
-				spin_unlock(&priv->lock);
-				path_free(dev, path);
-				return;
-			} else
-				__path_add(dev, path);
-		} else {
-			++priv->stats.tx_dropped;
-			dev_kfree_skb_any(skb);
-		}
-
-		spin_unlock(&priv->lock);
-		return;
+		if (path)
+			__path_add(dev, path);
 	}
 
-	if (path->ah) {
+	if (path && path->ah) {
 		ipoib_dbg(priv, "Send unicast ARP to %04x\n",
 			  be16_to_cpu(path->pathrec.dlid));
 
 		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
-	} else if ((path->query || !path_rec_start(dev, path)) &&
+	} else if (path && (path->query || !path_rec_start(dev, path)) &&
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		/* put pseudoheader back on for next time */
 		skb_push(skb, sizeof *phdr);
@@ -822,7 +810,8 @@ static void ipoib_neigh_cleanup(struct n
 		  IPOIB_QPN(n->ha),
 		  IPOIB_GID_RAW_ARG(n->ha + 4));
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	neigh = *to_ipoib_neigh(n);
 	if (neigh) {
@@ -832,7 +821,8 @@ static void ipoib_neigh_cleanup(struct n
 		ipoib_neigh_free(n->dev, neigh);
 	}
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (ah)
 		ipoib_put_ah(ah);
Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-22 01:46:54.000000000 +0300
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-23 21:38:28.000000000 +0300
@@ -100,7 +100,8 @@ static void ipoib_mcast_free(struct ipoi
 			"deleting multicast group " IPOIB_GID_FMT "\n",
 			IPOIB_GID_ARG(mcast->mcmember.mgid));
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) {
 		/*
@@ -114,7 +115,8 @@ static void ipoib_mcast_free(struct ipoi
 		ipoib_neigh_free(dev, neigh);
 	}
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (mcast->ah)
 		ipoib_put_ah(mcast->ah);

-- 
MST


From mst at dev.mellanox.co.il  Wed May 23 12:57:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 23 May 2007 22:57:00 +0300
Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection
In-Reply-To: <000501c79d72$17260a30$ff0da8c0@amr.corp.intel.com>
References: <20070523183650.GB6019@mellanox.co.il>
	<000501c79d72$17260a30$ff0da8c0@amr.corp.intel.com>
Message-ID: <20070523195700.GD6019@mellanox.co.il>

> Quoting Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection
> 
> >> If two REQs are received with matching local IDs, but the REQs
> >> themselves differ on one or more fields, such as the QPN, the second REQ
> >> is dropped as a duplicate.
> >
> >Why do you speak about dropping duplicates as a valid response?
> 
> I was only mentioning the current behavior.
> 
> >As far as I can tell, the 2 legal responses to a duplicate REQ
> >are resending a REP and rejecting with code 30.
> 
> It's possible to receive a duplicate REQ before processing has completed and a
> REP generated to the first REQ.  In this case, it makes sense simply to discard
> the duplicate REQ.
> 
> When processing completes on the first REQ, the CM will generate either a REP or
> a REJ, so I believe that the behavior is compliant when handling an actual
> duplicate REQs.

Well, in case the second REQ differs from the first one, discarding
it might not be the best option: I think we might want to queue it
and process when you exit the ephemeural state.

-- 
MST


From narravul at cse.ohio-state.edu  Wed May 23 13:07:33 2007
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Wed, 23 May 2007 16:07:33 -0400 (EDT)
Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm
In-Reply-To: <000301c79d61$17db1210$ff0da8c0@amr.corp.intel.com>
Message-ID: <Pine.GSO.4.40.0705231447570.18218-100000@nu.cse.ohio-state.edu>


On Wed, 23 May 2007, Sean Hefty wrote:

> >Rail 1 (ib0): 192.168.1.*
> >Rail 2 (ib2): 192.168.3.*
> >
> >When I try to connect two qps over these rails (one on each), many times
> >the address resolutions for both the qps return me the context of just
> >one of the rails. i.e. I am not able to use both the rails.
> >
> >Is there any thing I am missing here?
>
> Can you provide more details on how you are establishing your connections?
>
> Are you calling rdma_resolve_addr() with 192.168.1.x in one case, and
> 192.168.3.x in the second case, and both of those resolve back to the same local
> IP address?  Can you tell if ping routes the same way?

Basically I have the following sequence of steps.

Process 1:

rdma_bind_addr (someport, 0.0.0.0)
rdma_listen ()
....
  rdma_get_cm_event()
	if (RDMA_CM_EVENT_CONNECT_REQUEST)
		rdma_accept()
....

Process 2:

rdma_connect (192.168.1.x)
rdma_connect (192.168.3.x)
....
wait for connections.


I am able to ping on both the interfaces. The ping messages go over both
the interfaces. Infact after pinging the interfaces from each other a few
times I am able to connect properly over both the rails for some time.
After a few minutes it falls back to just one interface.

> You can try binding to a specific local address, but based on your setup, I
> would expect this to work.  So, I'd like to understand better what the issue
> could be.

hmm.. I can try this but as an last resort. Ideally I would like to use
just one listen cm_id binded to 0.0.0.0.

  --Sundeep.

>
> - Sean
>


From sean.hefty at intel.com  Wed May 23 13:25:28 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 23 May 2007 13:25:28 -0700
Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm
In-Reply-To: <Pine.GSO.4.40.0705231447570.18218-100000@nu.cse.ohio-state.edu>
Message-ID: <000601c79d78$810037e0$ff0da8c0@amr.corp.intel.com>

>I am able to ping on both the interfaces. The ping messages go over both
>the interfaces. Infact after pinging the interfaces from each other a few
>times I am able to connect properly over both the rails for some time.
>After a few minutes it falls back to just one interface.

Odd - I will see if I can reproduce this.

Are the HCAs sharing the same IB subnet?

>hmm.. I can try this but as an last resort. Ideally I would like to use
>just one listen cm_id binded to 0.0.0.0.

I was thinking about binding on the active side, before calling connect.  But I
still want to look into this more.

- Sean


From narravul at cse.ohio-state.edu  Wed May 23 13:30:33 2007
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Wed, 23 May 2007 16:30:33 -0400 (EDT)
Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm
In-Reply-To: <000601c79d78$810037e0$ff0da8c0@amr.corp.intel.com>
Message-ID: <Pine.GSO.4.40.0705231629060.26771-100000@nu.cse.ohio-state.edu>

> Odd - I will see if I can reproduce this.
>
> Are the HCAs sharing the same IB subnet?

Yes. They are in the same IB subnet.

>
> >hmm.. I can try this but as an last resort. Ideally I would like to use
> >just one listen cm_id binded to 0.0.0.0.
>
> I was thinking about binding on the active side, before calling connect.  But I
> still want to look into this more.

I can try this one out.

  --Sundeep.

>
> - Sean
>


From pradeeps at linux.vnet.ibm.com  Wed May 23 14:12:19 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Wed, 23 May 2007 14:12:19 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <4654690F.1040305@linux.vnet.ibm.com>
References: <46537081.30906@linux.vnet.ibm.com>
	<4654690F.1040305@linux.vnet.ibm.com>
Message-ID: <4654AE33.20506@linux.vnet.ibm.com>

Roland,

Is it too late to get this into 2.6.22? If so, I will try for 2.6.23
-please let me know.

Pradeep

Pradeep Satyanarayana wrote:
> 
> If this proposal is acceptable, would you want me to generate a patch
> against Roland's for-2.6.22 git tree, or would for-2.6.23 tree be
> better?
> 
> Pradeep
> 
> Pradeep Satyanarayana wrote:
>> Here are my thoughts about limiting the memory footprint for IPOIB CM
>> (NOSRQ) patch:
>>
>> By default, cap the NOSRQ memory usage to 1GB. The default recvq_size
>> is set to 128. Therefore for 64KB packets this would imply a maximum of
>> 128 endpoints.
>>
>> -Make the maximum number of endpoints a module parameter with a default
>> value of 128.
>>
>> -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is
>> the default limit and could be changed as needed (by the administrator)
>> depending on the system configuration, application needs and so on. The
>> server would return a "REJ" message upon receiving a "REQ", whenever one
>> of these limits (i.e. max number of endpoints or the max NOSRQ memory
>> usage) is reached. Currently, we only check for the maximum number of
>> endpoints -hard coded to 1024.
>>
>> -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that
>> support SRQ like the Topspin HCA and, such HCAs should not be
>> impacted at all.
>>
>> -Currently we allocate a default of 64KB for the ring buffer elements,
>> and this buffer size is not linked to the mtu. In the future, we could
>> allocate buffers based on the mtu and link that into the computation of
>> the memory cap. This way customers who might want to use a smaller mtu
>> could use a larger number of endpoints, or a larger recvq_size without
>> exceeding the memory cap.
>>
>>
>> Would this approach address the issues of scalability and enable IPOIB
>> CM to be turned as the default?
>>
>>
>> Pradeep
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From sean.hefty at intel.com  Wed May 23 14:14:40 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 23 May 2007 14:14:40 -0700
Subject: [ofa-general] [PATCH] for-2.6.23 ib/sa: use correct index for
	default pkey
In-Reply-To: <adafy5ocwqs.fsf@cisco.com>
Message-ID: <000701c79d7f$60460f00$ff0da8c0@amr.corp.intel.com>

MADs sent to the SA should use the index for the default pkey.  There's
no requirement that the default pkey be stored at index 0.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Patch requires the latest changes to the pkey cache.  This fix is not a
priority, but it appears to be the only issue in the stack with
supporting multiple partitions, which is a requirement for the labs.


 drivers/infiniband/core/sa_query.c |   85 +++++++++++++++++++++---------------
 include/rdma/ib_mad.h              |    3 +
 2 files changed, 53 insertions(+), 35 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 6469406..4791d01 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -56,6 +56,7 @@ MODULE_LICENSE("Dual BSD/GPL");
 struct ib_sa_sm_ah {
 	struct ib_ah        *ah;
 	struct kref          ref;
+	u16		     pkey_index;
 	u8		     src_path_mask;
 };
 
@@ -382,6 +383,13 @@ static void update_sm_ah(struct work_struct *work)
 	kref_init(&new_ah->ref);
 	new_ah->src_path_mask = (1 << port_attr.lmc) - 1;
 
+	new_ah->pkey_index = 0;
+	if (ib_find_pkey(port->agent->device, port->port_num,
+			 IB_DEFAULT_PKEY_FULL, &new_ah->pkey_index) &&
+	    ib_find_pkey(port->agent->device, port->port_num,
+			 IB_DEFAULT_PKEY_PARTIAL, &new_ah->pkey_index))
+		printk(KERN_ERR "Couldn't find index for default PKey\n");
+
 	memset(&ah_attr, 0, sizeof ah_attr);
 	ah_attr.dlid     = port_attr.sm_lid;
 	ah_attr.sl       = port_attr.sm_sl;
@@ -512,6 +520,35 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
 }
 EXPORT_SYMBOL(ib_init_ah_from_path);
 
+static int alloc_mad(struct ib_sa_query *query, gfp_t gfp_mask)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&query->port->ah_lock, flags);
+	kref_get(&query->port->sm_ah->ref);
+	query->sm_ah = query->port->sm_ah;
+	spin_unlock_irqrestore(&query->port->ah_lock, flags);
+
+	query->mad_buf = ib_create_send_mad(query->port->agent, 1,
+					    query->sm_ah->pkey_index,
+					    0, IB_MGMT_SA_HDR, IB_MGMT_SA_DATA,
+					    gfp_mask);
+	if (!query->mad_buf) {
+		kref_put(&query->sm_ah->ref, free_sm_ah);
+		return -ENOMEM;
+	}
+
+	query->mad_buf->ah = query->sm_ah->ah;
+
+	return 0;
+}
+
+static void free_mad(struct ib_sa_query *query)
+{
+	ib_free_send_mad(query->mad_buf);
+	kref_put(&query->sm_ah->ref, free_sm_ah);
+}
+
 static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
 {
 	unsigned long flags;
@@ -548,20 +585,11 @@ retry:
 	query->mad_buf->context[0] = query;
 	query->id = id;
 
-	spin_lock_irqsave(&query->port->ah_lock, flags);
-	kref_get(&query->port->sm_ah->ref);
-	query->sm_ah = query->port->sm_ah;
-	spin_unlock_irqrestore(&query->port->ah_lock, flags);
-
-	query->mad_buf->ah = query->sm_ah->ah;
-
 	ret = ib_post_send_mad(query->mad_buf, NULL);
 	if (ret) {
 		spin_lock_irqsave(&idr_lock, flags);
 		idr_remove(&query_idr, id);
 		spin_unlock_irqrestore(&idr_lock, flags);
-
-		kref_put(&query->sm_ah->ref, free_sm_ah);
 	}
 
 	/*
@@ -647,13 +675,10 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
 	if (!query)
 		return -ENOMEM;
 
-	query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0,
-						     0, IB_MGMT_SA_HDR,
-						     IB_MGMT_SA_DATA, gfp_mask);
-	if (!query->sa_query.mad_buf) {
-		ret = -ENOMEM;
+	query->sa_query.port     = port;
+	ret = alloc_mad(&query->sa_query, gfp_mask);
+	if (ret)
 		goto err1;
-	}
 
 	ib_sa_client_get(client);
 	query->sa_query.client = client;
@@ -665,7 +690,6 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
 
 	query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL;
 	query->sa_query.release  = ib_sa_path_rec_release;
-	query->sa_query.port     = port;
 	mad->mad_hdr.method	 = IB_MGMT_METHOD_GET;
 	mad->mad_hdr.attr_id	 = cpu_to_be16(IB_SA_ATTR_PATH_REC);
 	mad->sa_hdr.comp_mask	 = comp_mask;
@@ -683,7 +707,7 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
 err2:
 	*sa_query = NULL;
 	ib_sa_client_put(query->sa_query.client);
-	ib_free_send_mad(query->sa_query.mad_buf);
+	free_mad(&query->sa_query);
 
 err1:
 	kfree(query);
@@ -773,13 +797,10 @@ int ib_sa_service_rec_query(struct ib_sa_client *client,
 	if (!query)
 		return -ENOMEM;
 
-	query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0,
-						     0, IB_MGMT_SA_HDR,
-						     IB_MGMT_SA_DATA, gfp_mask);
-	if (!query->sa_query.mad_buf) {
-		ret = -ENOMEM;
+	query->sa_query.port     = port;
+	ret = alloc_mad(&query->sa_query, gfp_mask);
+	if (ret)
 		goto err1;
-	}
 
 	ib_sa_client_get(client);
 	query->sa_query.client = client;
@@ -791,7 +812,6 @@ int ib_sa_service_rec_query(struct ib_sa_client *client,
 
 	query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL;
 	query->sa_query.release  = ib_sa_service_rec_release;
-	query->sa_query.port     = port;
 	mad->mad_hdr.method	 = method;
 	mad->mad_hdr.attr_id	 = cpu_to_be16(IB_SA_ATTR_SERVICE_REC);
 	mad->sa_hdr.comp_mask	 = comp_mask;
@@ -810,7 +830,7 @@ int ib_sa_service_rec_query(struct ib_sa_client *client,
 err2:
 	*sa_query = NULL;
 	ib_sa_client_put(query->sa_query.client);
-	ib_free_send_mad(query->sa_query.mad_buf);
+	free_mad(&query->sa_query);
 
 err1:
 	kfree(query);
@@ -869,13 +889,10 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 	if (!query)
 		return -ENOMEM;
 
-	query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0,
-						     0, IB_MGMT_SA_HDR,
-						     IB_MGMT_SA_DATA, gfp_mask);
-	if (!query->sa_query.mad_buf) {
-		ret = -ENOMEM;
+	query->sa_query.port     = port;
+	ret = alloc_mad(&query->sa_query, gfp_mask);
+	if (ret)
 		goto err1;
-	}
 
 	ib_sa_client_get(client);
 	query->sa_query.client = client;
@@ -887,7 +904,6 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 
 	query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL;
 	query->sa_query.release  = ib_sa_mcmember_rec_release;
-	query->sa_query.port     = port;
 	mad->mad_hdr.method	 = method;
 	mad->mad_hdr.attr_id	 = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC);
 	mad->sa_hdr.comp_mask	 = comp_mask;
@@ -906,7 +922,7 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 err2:
 	*sa_query = NULL;
 	ib_sa_client_put(query->sa_query.client);
-	ib_free_send_mad(query->sa_query.mad_buf);
+	free_mad(&query->sa_query);
 
 err1:
 	kfree(query);
@@ -939,8 +955,7 @@ static void send_handler(struct ib_mad_agent *agent,
 	idr_remove(&query_idr, query->id);
 	spin_unlock_irqrestore(&idr_lock, flags);
 
-	ib_free_send_mad(mad_send_wc->send_buf);
-	kref_put(&query->sm_ah->ref, free_sm_ah);
+	free_mad(query);
 	ib_sa_client_put(query->client);
 	query->release(query);
 }
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
index 739fa4d..30712dd 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -111,6 +111,9 @@
 #define IB_QP1_QKEY	0x80010000
 #define IB_QP_SET_QKEY	0x80000000
 
+#define IB_DEFAULT_PKEY_PARTIAL 0x7FFF
+#define IB_DEFAULT_PKEY_FULL	0xFFFF
+
 enum {
 	IB_MGMT_MAD_HDR = 24,
 	IB_MGMT_MAD_DATA = 232,


From swise at opengridcomputing.com  Wed May 23 14:25:31 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 23 May 2007 14:25:31 -0700
Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm
In-Reply-To: <Pine.GSO.4.40.0705231629060.26771-100000@nu.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0705231629060.26771-100000@nu.cse.ohio-state.edu>
Message-ID: <4654B14B.4000208@opengridcomputing.com>

Guys, this reminds me of an issue we have with rnics and regular nics on 
the same physical network.  By default linux responds to arp queries on 
all ports it receives the query on.  This leads to very bad results with 
you're trying to do offloaded connections.  When resolving the 
address/route, the rdma client can end up getting the mac address of the 
dumb nic instead of the rnic.   I don't know if route resolution in the 
ib cm has this issue, but it might since they use ipoib for some part of 
the resolution, no?

You might this (snipit from the cxgb3 release notes file to be included 
in -rc4):

2) If you have a multi-homed host and the physical ethernet networks are
bridged, then you need to configure arp to only send replies on the
interface with the target ip address:

        sysctl -w net.ipv4.conf.all.arp_ignore=2

Steve.

Sundeep Narravula wrote:
>> Odd - I will see if I can reproduce this.
>>
>> Are the HCAs sharing the same IB subnet?
>>     
>
> Yes. They are in the same IB subnet.
>
>   
>>> hmm.. I can try this but as an last resort. Ideally I would like to use
>>> just one listen cm_id binded to 0.0.0.0.
>>>       
>> I was thinking about binding on the active side, before calling connect.  But I
>> still want to look into this more.
>>     
>
> I can try this one out.
>
>   --Sundeep.
>
>   
>> - Sean
>>
>>     
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From rdreier at cisco.com  Wed May 23 15:24:37 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 23 May 2007 15:24:37 -0700
Subject: [ofa-general] [PATCH] IB/mlx4: Don't allocate RQ doorbell if using
	SRQ
Message-ID: <aday7jf89ga.fsf@cisco.com>

Mellanox people, does this look good to you?


If a QP is attached to a shared receive queue (SRQ), then it doesn't
have a receive queue (RQ).  So don't allocate an RQ doorbell (or map a
doorbell from userspace for userspace QPs) for that QP.
    
Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index a824bc5..88a994d 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -319,20 +319,24 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 		if (err)
 			goto err_mtt;
 
-		err = mlx4_ib_db_map_user(to_mucontext(pd->uobject->context),
-					  ucmd.db_addr, &qp->db);
-		if (err)
-			goto err_mtt;
+		if (!init_attr->srq) {
+			err = mlx4_ib_db_map_user(to_mucontext(pd->uobject->context),
+						  ucmd.db_addr, &qp->db);
+			if (err)
+				goto err_mtt;
+		}
 	} else {
 		err = set_kernel_sq_size(dev, &init_attr->cap, init_attr->qp_type, qp);
 		if (err)
 			goto err;
 
-		err = mlx4_ib_db_alloc(dev, &qp->db, 0);
-		if (err)
-			goto err;
+		if (!init_attr->srq) {
+			err = mlx4_ib_db_alloc(dev, &qp->db, 0);
+			if (err)
+				goto err;
 
-		*qp->db.db = 0;
+			*qp->db.db = 0;
+		}
 
 		if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf)) {
 			err = -ENOMEM;
@@ -386,7 +390,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	return 0;
 
 err_wrid:
-	if (pd->uobject)
+	if (pd->uobject && !init_attr->srq)
 		mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &qp->db);
 	else {
 		kfree(qp->sq.wrid);
@@ -403,7 +407,7 @@ err_buf:
 		mlx4_buf_free(dev->dev, qp->buf_size, &qp->buf);
 
 err_db:
-	if (!pd->uobject)
+	if (!pd->uobject && !init_attr->srq)
 		mlx4_ib_db_free(dev, &qp->db);
 
 err:
@@ -481,14 +485,16 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
 	mlx4_mtt_cleanup(dev->dev, &qp->mtt);
 
 	if (is_user) {
-		mlx4_ib_db_unmap_user(to_mucontext(qp->ibqp.uobject->context),
-				      &qp->db);
+		if (!qp->ibqp.srq)
+			mlx4_ib_db_unmap_user(to_mucontext(qp->ibqp.uobject->context),
+					      &qp->db);
 		ib_umem_release(qp->umem);
 	} else {
 		kfree(qp->sq.wrid);
 		kfree(qp->rq.wrid);
 		mlx4_buf_free(dev->dev, qp->buf_size, &qp->buf);
-		mlx4_ib_db_free(dev, &qp->db);
+		if (!qp->ibqp.srq)
+			mlx4_ib_db_free(dev, &qp->db);
 	}
 }
 
@@ -852,7 +858,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 	if (ibqp->srq)
 		context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->msrq.srqn);
 
-	if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT)
+	if (!ibqp->srq && cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT)
 		context->db_rec_addr = cpu_to_be64(qp->db.dma);
 
 	if (cur_state == IB_QPS_INIT &&
@@ -919,7 +925,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 		qp->rq.tail = 0;
 		qp->sq.head = 0;
 		qp->sq.tail = 0;
-		*qp->db.db  = 0;
+		if (!ibqp->srq)
+			*qp->db.db  = 0;
 	}
 
 out:


From mshefty at ichips.intel.com  Wed May 23 15:38:48 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 23 May 2007 15:38:48 -0700
Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm
In-Reply-To: <4654B14B.4000208@opengridcomputing.com>
References: <Pine.GSO.4.40.0705231629060.26771-100000@nu.cse.ohio-state.edu>
	<4654B14B.4000208@opengridcomputing.com>
Message-ID: <4654C278.6030703@ichips.intel.com>

Steve Wise wrote:
> Guys, this reminds me of an issue we have with rnics and regular nics on 
> the same physical network.  By default linux responds to arp queries on 
> all ports it receives the query on.  This leads to very bad results with 
> you're trying to do offloaded connections.  When resolving the 
> address/route, the rdma client can end up getting the mac address of the 
> dumb nic instead of the rnic.   I don't know if route resolution in the 
> ib cm has this issue, but it might since they use ipoib for some part of 
> the resolution, no?

I think this could be the problem.  (And could have taken me a long time 
to figure it if it is, so thanks!)

> 2) If you have a multi-homed host and the physical ethernet networks are
> bridged, then you need to configure arp to only send replies on the
> interface with the target ip address:
> 
>        sysctl -w net.ipv4.conf.all.arp_ignore=2

Sundeep, can you try this and see if it fixes the problem for you?

- Sean


From xma at us.ibm.com  Wed May 23 16:28:57 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 23 May 2007 16:28:57 -0700
Subject: [ofa-general] IPoIB NETIF_F_SG flag for GSO
In-Reply-To: <20070523195700.GD6019@mellanox.co.il>
Message-ID: <OFEAC48BFD.D0D5C2B3-ON872572E4.008069AC-882572E5.00028DE8@us.ibm.com>


Hello Roland, Michael,

      I tried GSO for IPoIB last year, I didn't see much BW for UD rather
than some cpu utilization decreasement. I looked the GSO patch carefully
then I found there are additional skb copies in skb segment. If the device
supports SG, then the copies can be avoided. IPoIB does support SG, I am
planning to enable it to test GSO again. I would like to know whether you
have tried this before? Any possible issue you can think of?

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070523/e9295f44/attachment.html>

From sean.hefty at intel.com  Wed May 23 17:00:35 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 23 May 2007 17:00:35 -0700
Subject: [ofa-general] [PATCH] ib/cm: optimize locking
In-Reply-To: <46531C7A.3060201@ichips.intel.com>
Message-ID: <003101c79d96$8ea19d80$5bd4180a@amr.corp.intel.com>

The ib_cm is a little over zealous about using spin_lock_irqsave,
when spin_lock_irq would do.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
This patch applies on top of "ib/cm: fix stale connection detection".
It has only been lightly tested using the librdmacm.  Additional testing
with ipoib cm is still needed.  (I will try to get to that tomorrow.)

I will request that this be pulled for 2.6.23 if there are no objections.

 drivers/infiniband/core/cm.c |  171 ++++++++++++++++++------------------------
 1 files changed, 75 insertions(+), 96 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 40c004a..16181d6 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -318,12 +318,10 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv)
 
 static void cm_free_id(__be32 local_id)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&cm.lock, flags);
+	spin_lock_irq(&cm.lock);
 	idr_remove(&cm.local_id_table,
 		   (__force int) (local_id ^ cm.random_id_operand));
-	spin_unlock_irqrestore(&cm.lock, flags);
+	spin_unlock_irq(&cm.lock);
 }
 
 static struct cm_id_private * cm_get_id(__be32 local_id, __be32 remote_id)
@@ -345,11 +343,10 @@ static struct cm_id_private * cm_get_id(__be32 local_id, __be32 remote_id)
 static struct cm_id_private * cm_acquire_id(__be32 local_id, __be32 remote_id)
 {
 	struct cm_id_private *cm_id_priv;
-	unsigned long flags;
 
-	spin_lock_irqsave(&cm.lock, flags);
+	spin_lock_irq(&cm.lock);
 	cm_id_priv = cm_get_id(local_id, remote_id);
-	spin_unlock_irqrestore(&cm.lock, flags);
+	spin_unlock_irq(&cm.lock);
 
 	return cm_id_priv;
 }
@@ -713,31 +710,30 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
 {
 	struct cm_id_private *cm_id_priv;
 	struct cm_work *work;
-	unsigned long flags;
 
 	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
 retest:
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	switch (cm_id->state) {
 	case IB_CM_LISTEN:
 		cm_id->state = IB_CM_IDLE;
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
-		spin_lock_irqsave(&cm.lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
+		spin_lock_irq(&cm.lock);
 		rb_erase(&cm_id_priv->service_node, &cm.listen_service_table);
-		spin_unlock_irqrestore(&cm.lock, flags);
+		spin_unlock_irq(&cm.lock);
 		break;
 	case IB_CM_SIDR_REQ_SENT:
 		cm_id->state = IB_CM_IDLE;
 		ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg);
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		break;
 	case IB_CM_SIDR_REQ_RCVD:
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		cm_reject_sidr_req(cm_id_priv, IB_SIDR_REJECT);
 		break;
 	case IB_CM_REQ_SENT:
 		ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg);
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		ib_send_cm_rej(cm_id, IB_CM_REJ_TIMEOUT,
 			       &cm_id_priv->id.device->node_guid,
 			       sizeof cm_id_priv->id.device->node_guid,
@@ -747,9 +743,9 @@ retest:
 		if (err == -ENOMEM) {
 			/* Do not reject to allow future retries. */
 			cm_reset_to_idle(cm_id_priv);
-			spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+			spin_unlock_irq(&cm_id_priv->lock);
 		} else {
-			spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+			spin_unlock_irq(&cm_id_priv->lock);
 			ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED,
 				       NULL, 0, NULL, 0);
 		}
@@ -762,25 +758,25 @@ retest:
 	case IB_CM_MRA_REQ_SENT:
 	case IB_CM_REP_RCVD:
 	case IB_CM_MRA_REP_SENT:
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED,
 			       NULL, 0, NULL, 0);
 		break;
 	case IB_CM_ESTABLISHED:
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		ib_send_cm_dreq(cm_id, NULL, 0);
 		goto retest;
 	case IB_CM_DREQ_SENT:
 		ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg);
 		cm_enter_timewait(cm_id_priv);
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		break;
 	case IB_CM_DREQ_RCVD:
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		ib_send_cm_drep(cm_id, NULL, 0);
 		break;
 	default:
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		break;
 	}
 
@@ -1169,7 +1165,6 @@ static void cm_format_req_event(struct cm_work *work,
 static void cm_process_work(struct cm_id_private *cm_id_priv,
 			    struct cm_work *work)
 {
-	unsigned long flags;
 	int ret;
 
 	/* We will typically only have the current event to report. */
@@ -1177,9 +1172,9 @@ static void cm_process_work(struct cm_id_private *cm_id_priv,
 	cm_free_work(work);
 
 	while (!ret && !atomic_add_negative(-1, &cm_id_priv->work_count)) {
-		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		spin_lock_irq(&cm_id_priv->lock);
 		work = cm_dequeue_work(cm_id_priv);
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		BUG_ON(!work);
 		ret = cm_id_priv->id.cm_handler(&cm_id_priv->id,
 						&work->cm_event);
@@ -1250,7 +1245,6 @@ static void cm_dup_req_handler(struct cm_work *work,
 			       struct cm_id_private *cm_id_priv)
 {
 	struct ib_mad_send_buf *msg = NULL;
-	unsigned long flags;
 	int ret;
 
 	/* Quick state check to discard duplicate REQs. */
@@ -1261,7 +1255,7 @@ static void cm_dup_req_handler(struct cm_work *work,
 	if (ret)
 		return;
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	switch (cm_id_priv->id.state) {
 	case IB_CM_MRA_REQ_SENT:
 		cm_format_mra((struct cm_mra_msg *) msg->mad, cm_id_priv,
@@ -1276,14 +1270,14 @@ static void cm_dup_req_handler(struct cm_work *work,
 	default:
 		goto unlock;
 	}
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	ret = ib_post_send_mad(msg, NULL);
 	if (ret)
 		goto free;
 	return;
 
-unlock:	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+unlock:	spin_unlock_irq(&cm_id_priv->lock);
 free:	cm_free_msg(msg);
 }
 
@@ -1293,17 +1287,16 @@ static struct cm_id_private * cm_match_req(struct cm_work *work,
 	struct cm_id_private *listen_cm_id_priv, *cur_cm_id_priv;
 	struct cm_timewait_info *timewait_info;
 	struct cm_req_msg *req_msg;
-	unsigned long flags;
 
 	req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
 
 	/* Check for possible duplicate REQ. */
-	spin_lock_irqsave(&cm.lock, flags);
+	spin_lock_irq(&cm.lock);
 	timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
 	if (timewait_info) {
 		cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
 					   timewait_info->work.remote_id);
-		spin_unlock_irqrestore(&cm.lock, flags);
+		spin_unlock_irq(&cm.lock);
 		if (cur_cm_id_priv) {
 			cm_dup_req_handler(work, cur_cm_id_priv);
 			cm_deref_id(cur_cm_id_priv);
@@ -1315,7 +1308,7 @@ static struct cm_id_private * cm_match_req(struct cm_work *work,
 	timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
 	if (timewait_info) {
 		cm_cleanup_timewait(cm_id_priv->timewait_info);
-		spin_unlock_irqrestore(&cm.lock, flags);
+		spin_unlock_irq(&cm.lock);
 		cm_issue_rej(work->port, work->mad_recv_wc,
 			     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
 			     NULL, 0);
@@ -1328,7 +1321,7 @@ static struct cm_id_private * cm_match_req(struct cm_work *work,
 					   req_msg->private_data);
 	if (!listen_cm_id_priv) {
 		cm_cleanup_timewait(cm_id_priv->timewait_info);
-		spin_unlock_irqrestore(&cm.lock, flags);
+		spin_unlock_irq(&cm.lock);
 		cm_issue_rej(work->port, work->mad_recv_wc,
 			     IB_CM_REJ_INVALID_SERVICE_ID, CM_MSG_RESPONSE_REQ,
 			     NULL, 0);
@@ -1338,7 +1331,7 @@ static struct cm_id_private * cm_match_req(struct cm_work *work,
 	atomic_inc(&cm_id_priv->refcount);
 	cm_id_priv->id.state = IB_CM_REQ_RCVD;
 	atomic_inc(&cm_id_priv->work_count);
-	spin_unlock_irqrestore(&cm.lock, flags);
+	spin_unlock_irq(&cm.lock);
 out:
 	return listen_cm_id_priv;
 }
@@ -1591,7 +1584,6 @@ static void cm_dup_rep_handler(struct cm_work *work)
 	struct cm_id_private *cm_id_priv;
 	struct cm_rep_msg *rep_msg;
 	struct ib_mad_send_buf *msg = NULL;
-	unsigned long flags;
 	int ret;
 
 	rep_msg = (struct cm_rep_msg *) work->mad_recv_wc->recv_buf.mad;
@@ -1604,7 +1596,7 @@ static void cm_dup_rep_handler(struct cm_work *work)
 	if (ret)
 		goto deref;
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	if (cm_id_priv->id.state == IB_CM_ESTABLISHED)
 		cm_format_rtu((struct cm_rtu_msg *) msg->mad, cm_id_priv,
 			      cm_id_priv->private_data,
@@ -1616,14 +1608,14 @@ static void cm_dup_rep_handler(struct cm_work *work)
 			      cm_id_priv->private_data_len);
 	else
 		goto unlock;
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	ret = ib_post_send_mad(msg, NULL);
 	if (ret)
 		goto free;
 	goto deref;
 
-unlock:	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+unlock:	spin_unlock_irq(&cm_id_priv->lock);
 free:	cm_free_msg(msg);
 deref:	cm_deref_id(cm_id_priv);
 }
@@ -1632,7 +1624,6 @@ static int cm_rep_handler(struct cm_work *work)
 {
 	struct cm_id_private *cm_id_priv;
 	struct cm_rep_msg *rep_msg;
-	unsigned long flags;
 	int ret;
 
 	rep_msg = (struct cm_rep_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -1644,13 +1635,13 @@ static int cm_rep_handler(struct cm_work *work)
 
 	cm_format_rep_event(work);
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	switch (cm_id_priv->id.state) {
 	case IB_CM_REQ_SENT:
 	case IB_CM_MRA_REQ_RCVD:
 		break;
 	default:
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		ret = -EINVAL;
 		goto error;
 	}
@@ -1663,7 +1654,7 @@ static int cm_rep_handler(struct cm_work *work)
 	/* Check for duplicate REP. */
 	if (cm_insert_remote_id(cm_id_priv->timewait_info)) {
 		spin_unlock(&cm.lock);
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		ret = -EINVAL;
 		goto error;
 	}
@@ -1673,7 +1664,7 @@ static int cm_rep_handler(struct cm_work *work)
 			 &cm.remote_id_table);
 		cm_id_priv->timewait_info->inserted_remote_id = 0;
 		spin_unlock(&cm.lock);
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		cm_issue_rej(work->port, work->mad_recv_wc,
 			     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REP,
 			     NULL, 0);
@@ -1696,7 +1687,7 @@ static int cm_rep_handler(struct cm_work *work)
 	ret = atomic_inc_and_test(&cm_id_priv->work_count);
 	if (!ret)
 		list_add_tail(&work->list, &cm_id_priv->work_list);
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	if (ret)
 		cm_process_work(cm_id_priv, work);
@@ -1712,7 +1703,6 @@ error:
 static int cm_establish_handler(struct cm_work *work)
 {
 	struct cm_id_private *cm_id_priv;
-	unsigned long flags;
 	int ret;
 
 	/* See comment in cm_establish about lookup. */
@@ -1720,9 +1710,9 @@ static int cm_establish_handler(struct cm_work *work)
 	if (!cm_id_priv)
 		return -EINVAL;
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	if (cm_id_priv->id.state != IB_CM_ESTABLISHED) {
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		goto out;
 	}
 
@@ -1730,7 +1720,7 @@ static int cm_establish_handler(struct cm_work *work)
 	ret = atomic_inc_and_test(&cm_id_priv->work_count);
 	if (!ret)
 		list_add_tail(&work->list, &cm_id_priv->work_list);
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	if (ret)
 		cm_process_work(cm_id_priv, work);
@@ -1746,7 +1736,6 @@ static int cm_rtu_handler(struct cm_work *work)
 {
 	struct cm_id_private *cm_id_priv;
 	struct cm_rtu_msg *rtu_msg;
-	unsigned long flags;
 	int ret;
 
 	rtu_msg = (struct cm_rtu_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -1757,10 +1746,10 @@ static int cm_rtu_handler(struct cm_work *work)
 
 	work->cm_event.private_data = &rtu_msg->private_data;
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	if (cm_id_priv->id.state != IB_CM_REP_SENT &&
 	    cm_id_priv->id.state != IB_CM_MRA_REP_RCVD) {
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		goto out;
 	}
 	cm_id_priv->id.state = IB_CM_ESTABLISHED;
@@ -1769,7 +1758,7 @@ static int cm_rtu_handler(struct cm_work *work)
 	ret = atomic_inc_and_test(&cm_id_priv->work_count);
 	if (!ret)
 		list_add_tail(&work->list, &cm_id_priv->work_list);
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	if (ret)
 		cm_process_work(cm_id_priv, work);
@@ -1932,7 +1921,6 @@ static int cm_dreq_handler(struct cm_work *work)
 	struct cm_id_private *cm_id_priv;
 	struct cm_dreq_msg *dreq_msg;
 	struct ib_mad_send_buf *msg = NULL;
-	unsigned long flags;
 	int ret;
 
 	dreq_msg = (struct cm_dreq_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -1945,7 +1933,7 @@ static int cm_dreq_handler(struct cm_work *work)
 
 	work->cm_event.private_data = &dreq_msg->private_data;
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	if (cm_id_priv->local_qpn != cm_dreq_get_remote_qpn(dreq_msg))
 		goto unlock;
 
@@ -1964,7 +1952,7 @@ static int cm_dreq_handler(struct cm_work *work)
 		cm_format_drep((struct cm_drep_msg *) msg->mad, cm_id_priv,
 			       cm_id_priv->private_data,
 			       cm_id_priv->private_data_len);
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 
 		if (ib_post_send_mad(msg, NULL))
 			cm_free_msg(msg);
@@ -1977,7 +1965,7 @@ static int cm_dreq_handler(struct cm_work *work)
 	ret = atomic_inc_and_test(&cm_id_priv->work_count);
 	if (!ret)
 		list_add_tail(&work->list, &cm_id_priv->work_list);
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	if (ret)
 		cm_process_work(cm_id_priv, work);
@@ -1985,7 +1973,7 @@ static int cm_dreq_handler(struct cm_work *work)
 		cm_deref_id(cm_id_priv);
 	return 0;
 
-unlock:	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+unlock:	spin_unlock_irq(&cm_id_priv->lock);
 deref:	cm_deref_id(cm_id_priv);
 	return -EINVAL;
 }
@@ -1994,7 +1982,6 @@ static int cm_drep_handler(struct cm_work *work)
 {
 	struct cm_id_private *cm_id_priv;
 	struct cm_drep_msg *drep_msg;
-	unsigned long flags;
 	int ret;
 
 	drep_msg = (struct cm_drep_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -2005,10 +1992,10 @@ static int cm_drep_handler(struct cm_work *work)
 
 	work->cm_event.private_data = &drep_msg->private_data;
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	if (cm_id_priv->id.state != IB_CM_DREQ_SENT &&
 	    cm_id_priv->id.state != IB_CM_DREQ_RCVD) {
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		goto out;
 	}
 	cm_enter_timewait(cm_id_priv);
@@ -2017,7 +2004,7 @@ static int cm_drep_handler(struct cm_work *work)
 	ret = atomic_inc_and_test(&cm_id_priv->work_count);
 	if (!ret)
 		list_add_tail(&work->list, &cm_id_priv->work_list);
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	if (ret)
 		cm_process_work(cm_id_priv, work);
@@ -2107,17 +2094,16 @@ static struct cm_id_private * cm_acquire_rejected_id(struct cm_rej_msg
*rej_msg)
 {
 	struct cm_timewait_info *timewait_info;
 	struct cm_id_private *cm_id_priv;
-	unsigned long flags;
 	__be32 remote_id;
 
 	remote_id = rej_msg->local_comm_id;
 
 	if (__be16_to_cpu(rej_msg->reason) == IB_CM_REJ_TIMEOUT) {
-		spin_lock_irqsave(&cm.lock, flags);
+		spin_lock_irq(&cm.lock);
 		timewait_info = cm_find_remote_id( *((__be64 *) rej_msg->ari),
 						  remote_id);
 		if (!timewait_info) {
-			spin_unlock_irqrestore(&cm.lock, flags);
+			spin_unlock_irq(&cm.lock);
 			return NULL;
 		}
 		cm_id_priv = idr_find(&cm.local_id_table, (__force int)
@@ -2129,7 +2115,7 @@ static struct cm_id_private * cm_acquire_rejected_id(struct cm_rej_msg
*rej_msg)
 			else
 				cm_id_priv = NULL;
 		}
-		spin_unlock_irqrestore(&cm.lock, flags);
+		spin_unlock_irq(&cm.lock);
 	} else if (cm_rej_get_msg_rejected(rej_msg) == CM_MSG_RESPONSE_REQ)
 		cm_id_priv = cm_acquire_id(rej_msg->remote_comm_id, 0);
 	else
@@ -2142,7 +2128,6 @@ static int cm_rej_handler(struct cm_work *work)
 {
 	struct cm_id_private *cm_id_priv;
 	struct cm_rej_msg *rej_msg;
-	unsigned long flags;
 	int ret;
 
 	rej_msg = (struct cm_rej_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -2152,7 +2137,7 @@ static int cm_rej_handler(struct cm_work *work)
 
 	cm_format_rej_event(work);
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	switch (cm_id_priv->id.state) {
 	case IB_CM_REQ_SENT:
 	case IB_CM_MRA_REQ_RCVD:
@@ -2176,7 +2161,7 @@ static int cm_rej_handler(struct cm_work *work)
 		cm_enter_timewait(cm_id_priv);
 		break;
 	default:
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		ret = -EINVAL;
 		goto out;
 	}
@@ -2184,7 +2169,7 @@ static int cm_rej_handler(struct cm_work *work)
 	ret = atomic_inc_and_test(&cm_id_priv->work_count);
 	if (!ret)
 		list_add_tail(&work->list, &cm_id_priv->work_list);
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	if (ret)
 		cm_process_work(cm_id_priv, work);
@@ -2295,7 +2280,6 @@ static int cm_mra_handler(struct cm_work *work)
 {
 	struct cm_id_private *cm_id_priv;
 	struct cm_mra_msg *mra_msg;
-	unsigned long flags;
 	int timeout, ret;
 
 	mra_msg = (struct cm_mra_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -2309,7 +2293,7 @@ static int cm_mra_handler(struct cm_work *work)
 	timeout = cm_convert_to_ms(cm_mra_get_service_timeout(mra_msg)) +
 		  cm_convert_to_ms(cm_id_priv->av.packet_life_time);
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	switch (cm_id_priv->id.state) {
 	case IB_CM_REQ_SENT:
 		if (cm_mra_get_msg_mraed(mra_msg) != CM_MSG_RESPONSE_REQ ||
@@ -2342,7 +2326,7 @@ static int cm_mra_handler(struct cm_work *work)
 	ret = atomic_inc_and_test(&cm_id_priv->work_count);
 	if (!ret)
 		list_add_tail(&work->list, &cm_id_priv->work_list);
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	if (ret)
 		cm_process_work(cm_id_priv, work);
@@ -2350,7 +2334,7 @@ static int cm_mra_handler(struct cm_work *work)
 		cm_deref_id(cm_id_priv);
 	return 0;
 out:
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 	cm_deref_id(cm_id_priv);
 	return -EINVAL;
 }
@@ -2465,7 +2449,6 @@ static int cm_lap_handler(struct cm_work *work)
 	struct cm_lap_msg *lap_msg;
 	struct ib_cm_lap_event_param *param;
 	struct ib_mad_send_buf *msg = NULL;
-	unsigned long flags;
 	int ret;
 
 	/* todo: verify LAP request and send reject APR if invalid. */
@@ -2480,7 +2463,7 @@ static int cm_lap_handler(struct cm_work *work)
 	cm_format_path_from_lap(cm_id_priv, param->alternate_path, lap_msg);
 	work->cm_event.private_data = &lap_msg->private_data;
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	if (cm_id_priv->id.state != IB_CM_ESTABLISHED)
 		goto unlock;
 
@@ -2497,7 +2480,7 @@ static int cm_lap_handler(struct cm_work *work)
 			      cm_id_priv->service_timeout,
 			      cm_id_priv->private_data,
 			      cm_id_priv->private_data_len);
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 
 		if (ib_post_send_mad(msg, NULL))
 			cm_free_msg(msg);
@@ -2515,7 +2498,7 @@ static int cm_lap_handler(struct cm_work *work)
 	ret = atomic_inc_and_test(&cm_id_priv->work_count);
 	if (!ret)
 		list_add_tail(&work->list, &cm_id_priv->work_list);
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	if (ret)
 		cm_process_work(cm_id_priv, work);
@@ -2523,7 +2506,7 @@ static int cm_lap_handler(struct cm_work *work)
 		cm_deref_id(cm_id_priv);
 	return 0;
 
-unlock:	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+unlock:	spin_unlock_irq(&cm_id_priv->lock);
 deref:	cm_deref_id(cm_id_priv);
 	return -EINVAL;
 }
@@ -2598,7 +2581,6 @@ static int cm_apr_handler(struct cm_work *work)
 {
 	struct cm_id_private *cm_id_priv;
 	struct cm_apr_msg *apr_msg;
-	unsigned long flags;
 	int ret;
 
 	apr_msg = (struct cm_apr_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -2612,11 +2594,11 @@ static int cm_apr_handler(struct cm_work *work)
 	work->cm_event.param.apr_rcvd.info_len = apr_msg->info_length;
 	work->cm_event.private_data = &apr_msg->private_data;
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	if (cm_id_priv->id.state != IB_CM_ESTABLISHED ||
 	    (cm_id_priv->id.lap_state != IB_CM_LAP_SENT &&
 	     cm_id_priv->id.lap_state != IB_CM_MRA_LAP_RCVD)) {
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		goto out;
 	}
 	cm_id_priv->id.lap_state = IB_CM_LAP_IDLE;
@@ -2626,7 +2608,7 @@ static int cm_apr_handler(struct cm_work *work)
 	ret = atomic_inc_and_test(&cm_id_priv->work_count);
 	if (!ret)
 		list_add_tail(&work->list, &cm_id_priv->work_list);
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	if (ret)
 		cm_process_work(cm_id_priv, work);
@@ -2761,7 +2743,6 @@ static int cm_sidr_req_handler(struct cm_work *work)
 	struct cm_id_private *cm_id_priv, *cur_cm_id_priv;
 	struct cm_sidr_req_msg *sidr_req_msg;
 	struct ib_wc *wc;
-	unsigned long flags;
 
 	cm_id = ib_create_cm_id(work->port->cm_dev->device, NULL, NULL);
 	if (IS_ERR(cm_id))
@@ -2782,10 +2763,10 @@ static int cm_sidr_req_handler(struct cm_work *work)
 	cm_id_priv->tid = sidr_req_msg->hdr.tid;
 	atomic_inc(&cm_id_priv->work_count);
 
-	spin_lock_irqsave(&cm.lock, flags);
+	spin_lock_irq(&cm.lock);
 	cur_cm_id_priv = cm_insert_remote_sidr(cm_id_priv);
 	if (cur_cm_id_priv) {
-		spin_unlock_irqrestore(&cm.lock, flags);
+		spin_unlock_irq(&cm.lock);
 		goto out; /* Duplicate message. */
 	}
 	cur_cm_id_priv = cm_find_listen(cm_id->device,
@@ -2793,12 +2774,12 @@ static int cm_sidr_req_handler(struct cm_work *work)
 					sidr_req_msg->private_data);
 	if (!cur_cm_id_priv) {
 		rb_erase(&cm_id_priv->sidr_id_node, &cm.remote_sidr_table);
-		spin_unlock_irqrestore(&cm.lock, flags);
+		spin_unlock_irq(&cm.lock);
 		/* todo: reply with no match */
 		goto out; /* No match. */
 	}
 	atomic_inc(&cur_cm_id_priv->refcount);
-	spin_unlock_irqrestore(&cm.lock, flags);
+	spin_unlock_irq(&cm.lock);
 
 	cm_id_priv->id.cm_handler = cur_cm_id_priv->id.cm_handler;
 	cm_id_priv->id.context = cur_cm_id_priv->id.context;
@@ -2899,7 +2880,6 @@ static int cm_sidr_rep_handler(struct cm_work *work)
 {
 	struct cm_sidr_rep_msg *sidr_rep_msg;
 	struct cm_id_private *cm_id_priv;
-	unsigned long flags;
 
 	sidr_rep_msg = (struct cm_sidr_rep_msg *)
 				work->mad_recv_wc->recv_buf.mad;
@@ -2907,14 +2887,14 @@ static int cm_sidr_rep_handler(struct cm_work *work)
 	if (!cm_id_priv)
 		return -EINVAL; /* Unmatched reply. */
 
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	if (cm_id_priv->id.state != IB_CM_SIDR_REQ_SENT) {
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		spin_unlock_irq(&cm_id_priv->lock);
 		goto out;
 	}
 	cm_id_priv->id.state = IB_CM_IDLE;
 	ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg);
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 
 	cm_format_sidr_rep_event(work);
 	cm_process_work(cm_id_priv, work);
@@ -2930,14 +2910,13 @@ static void cm_process_send_error(struct ib_mad_send_buf *msg,
 	struct cm_id_private *cm_id_priv;
 	struct ib_cm_event cm_event;
 	enum ib_cm_state state;
-	unsigned long flags;
 	int ret;
 
 	memset(&cm_event, 0, sizeof cm_event);
 	cm_id_priv = msg->context[0];
 
 	/* Discard old sends or ones without a response. */
-	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	spin_lock_irq(&cm_id_priv->lock);
 	state = (enum ib_cm_state) (unsigned long) msg->context[1];
 	if (msg != cm_id_priv->msg || state != cm_id_priv->id.state)
 		goto discard;
@@ -2964,7 +2943,7 @@ static void cm_process_send_error(struct ib_mad_send_buf *msg,
 	default:
 		goto discard;
 	}
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 	cm_event.param.send_status = wc_status;
 
 	/* No other events can occur on the cm_id at this point. */
@@ -2974,7 +2953,7 @@ static void cm_process_send_error(struct ib_mad_send_buf *msg,
 		ib_destroy_cm_id(&cm_id_priv->id);
 	return;
 discard:
-	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	spin_unlock_irq(&cm_id_priv->lock);
 	cm_free_msg(msg);
 }
 

From ldanenwed at noerhaldgolf.dk  Wed May 23 18:13:01 2007
From: ldanenwed at noerhaldgolf.dk (Lindsey)
Date: Thu, 24 May 2007 04:13:01 +0300
Subject: [ofa-general] Hey, they are back
Message-ID: <0fd001c79db9$d1d61f90$4abc5b68@ldanenwed>


I the won He light organization led her to his late bike. She perched herself on spend broken should second a the have 5:00 PM stanza quarter feeling known of in
I is that was the weeks a No. Just rush be there. hide I don't think melodic take you'll be dis never poem time part going the I
mirror s to was of make Hold shrug need on a second, change step I'm almost done. emotional thankful us it silence that dies at is I
had been false is 5:30 PM Cortland I kill The principal experience was company cut not impressed. If I recall c as saving hollowed Not according to plane your support trouble aerial driver's licence, said attack When it ended, the four trot of spin them watch regrouped at th went it for Dana relation nearly charming fell over blood suspiciously laughing. Alright you, l out
 
clearly expresses Put this on.  a while Not to home its After trust to importance all
everyone strong consider dug But then you frighten won't have one. and power over this asking the was is how But... Gavin was now salt agreement in precede motion a no-win situation. H woman an abnormal misspelt One disease or both withstand of your parents will beam be here to pi it emergency
 
It brick Ay-yai Skipper to She gave support town him one last kiss, th enthusiastically See coach time you tomorrow Angel. angle Jeff climbed back in  authority She sat down fraternal at her usual desk in rang minute the front row and is was the who natural was uses Christmas result Horrible it Eve

each morning I and of thought Here, wet He high-pitched handed her adjustment his math power book. Would you The I bitter to extensive still or myself use had
to get of even history That cushion was alright, run but I'm probably ice not gonna r but responded silk Before I do orange sir, stung I'd like teaching to ask one question poetic my traumatic We thin live paint there too, crush dust said Nicki, referring to t air manage Stacy interest concurred. Remove a couple overtook of swear word it imagery mother 3:15 PM life
a Still careful not rescue sure where he experience was going cost with this, sh experiences Mistrust prehomet or and distrust was persona I are really enables walked live in anticipation of the Second Coming but it was his friendship with Dr. Tony Evans 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/f0222b96/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: geuy.gif
Type: image/gif
Size: 6724 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/f0222b96/attachment.gif>

From narravul at cse.ohio-state.edu  Wed May 23 18:40:10 2007
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Wed, 23 May 2007 21:40:10 -0400 (EDT)
Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm
In-Reply-To: <4654B14B.4000208@opengridcomputing.com>
Message-ID: <Pine.GSO.4.40.0705232132390.3949-100000@kappa.cse.ohio-state.edu>


  This suggestion seems like its working. I have been able to run the test
successfully several times so far with any of the earlier problems.

Thanks,
  --Sundeep.

On Wed, 23 May 2007, Steve Wise wrote:

> Guys, this reminds me of an issue we have with rnics and regular nics on
> the same physical network.  By default linux responds to arp queries on
> all ports it receives the query on.  This leads to very bad results with
> you're trying to do offloaded connections.  When resolving the
> address/route, the rdma client can end up getting the mac address of the
> dumb nic instead of the rnic.   I don't know if route resolution in the
> ib cm has this issue, but it might since they use ipoib for some part of
> the resolution, no?
>
> You might this (snipit from the cxgb3 release notes file to be included
> in -rc4):
>
> 2) If you have a multi-homed host and the physical ethernet networks are
> bridged, then you need to configure arp to only send replies on the
> interface with the target ip address:
>
>         sysctl -w net.ipv4.conf.all.arp_ignore=2
>
> Steve.
>
> Sundeep Narravula wrote:
> >> Odd - I will see if I can reproduce this.
> >>
> >> Are the HCAs sharing the same IB subnet?
> >>
> >
> > Yes. They are in the same IB subnet.
> >
> >
> >>> hmm.. I can try this but as an last resort. Ideally I would like to use
> >>> just one listen cm_id binded to 0.0.0.0.
> >>>
> >> I was thinking about binding on the active side, before calling connect.  But I
> >> still want to look into this more.
> >>
> >
> > I can try this one out.
> >
> >   --Sundeep.
> >
> >
> >> - Sean
> >>
> >>
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >
>


From swise at opengridcomputing.com  Wed May 23 19:15:43 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 23 May 2007 19:15:43 -0700
Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm
In-Reply-To: <Pine.GSO.4.40.0705232132390.3949-100000@kappa.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0705232132390.3949-100000@kappa.cse.ohio-state.edu>
Message-ID: <4654F54F.9060501@opengridcomputing.com>

BTW: I _think_ the ipoib module can set this option on its specific 
interfaces programatically.


Sundeep Narravula wrote:
>   This suggestion seems like its working. I have been able to run the test
> successfully several times so far with any of the earlier problems.
>
> Thanks,
>   --Sundeep.
>
> On Wed, 23 May 2007, Steve Wise wrote:
>
>   
>> Guys, this reminds me of an issue we have with rnics and regular nics on
>> the same physical network.  By default linux responds to arp queries on
>> all ports it receives the query on.  This leads to very bad results with
>> you're trying to do offloaded connections.  When resolving the
>> address/route, the rdma client can end up getting the mac address of the
>> dumb nic instead of the rnic.   I don't know if route resolution in the
>> ib cm has this issue, but it might since they use ipoib for some part of
>> the resolution, no?
>>
>> You might this (snipit from the cxgb3 release notes file to be included
>> in -rc4):
>>
>> 2) If you have a multi-homed host and the physical ethernet networks are
>> bridged, then you need to configure arp to only send replies on the
>> interface with the target ip address:
>>
>>         sysctl -w net.ipv4.conf.all.arp_ignore=2
>>
>> Steve.
>>
>> Sundeep Narravula wrote:
>>     
>>>> Odd - I will see if I can reproduce this.
>>>>
>>>> Are the HCAs sharing the same IB subnet?
>>>>
>>>>         
>>> Yes. They are in the same IB subnet.
>>>
>>>
>>>       
>>>>> hmm.. I can try this but as an last resort. Ideally I would like to use
>>>>> just one listen cm_id binded to 0.0.0.0.
>>>>>
>>>>>           
>>>> I was thinking about binding on the active side, before calling connect.  But I
>>>> still want to look into this more.
>>>>
>>>>         
>>> I can try this one out.
>>>
>>>   --Sundeep.
>>>
>>>
>>>       
>>>> - Sean
>>>>
>>>>
>>>>         
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>>>       
>
>   


From mst at dev.mellanox.co.il  Wed May 23 20:36:15 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 06:36:15 +0300
Subject: [ofa-general] Re: IPoIB NETIF_F_SG flag for GSO
In-Reply-To: <OFEAC48BFD.D0D5C2B3-ON872572E4.008069AC-882572E5.00028DE8@us.ibm.com>
References: <20070523195700.GD6019@mellanox.co.il>
	<OFEAC48BFD.D0D5C2B3-ON872572E4.008069AC-882572E5.00028DE8@us.ibm.com>
Message-ID: <20070524033615.GE6019@mellanox.co.il>

> Quoting Shirley Ma <xma at us.ibm.com>:
> Subject: IPoIB NETIF_F_SG flag for GSO
> 
> Hello Roland, Michael,
> 
> I tried GSO for IPoIB last year, I didn't see much BW for UD rather than some
> cpu utilization decreasement. I looked the GSO patch carefully then I found
> there are additional skb copies in skb segment. If the device supports SG, then
> the copies can be avoided. IPoIB does support SG, I am planning to enable it to
> test GSO again. I would like to know whether you have tried this before? Any
> possible issue you can think of?
> 
> Thanks
> Shirley Ma

Yes. SG currently needs csum offloading, and IPoIB does not
support that.

-- 
MST


From xma at us.ibm.com  Wed May 23 21:56:33 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 23 May 2007 21:56:33 -0700
Subject: [ofa-general] Re: IPoIB NETIF_F_SG flag for GSO
In-Reply-To: <20070524033615.GE6019@mellanox.co.il>
Message-ID: <OF9E0B17A2.A381D23D-ON872572E5.001A7495-882572E5.00208B2B@us.ibm.com>


Hello Michael,

> Yes. SG currently needs csum offloading, and IPoIB does not
> support that.
> --
> MST

SG should have nothing to do with CSUM. They are two different features,
one is for scatter/gather IO, one is HW can checksum all the packets. If
you look at net device feature, NETIF_F_SG & NETIF_F_HW_CSUM are different
flags.

What did you get when enabling IPoIB SG? I rememerbed there was a
discussion in net-dev before.

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070523/0538b52e/attachment.html>

From jackieict at gmail.com  Wed May 23 22:01:44 2007
From: jackieict at gmail.com (zhang Jackie)
Date: Thu, 24 May 2007 13:01:44 +0800
Subject: [ofa-general] how to write a IB user level multicast application
Message-ID: <13432ab00705232201s5f7d5a5h5ecaaddf57ead11b@mail.gmail.com>

hi,everyone!
  I want to write a IB user level multicast application,but I find only two
functions related to multicast:*ibv_attach_mcast* and *ibv_detach_mcast*, I
cant find any functions or any information in work request for sending
packets to a multicast group.for example,in struct ibv_send_wr, struct ud
must have a *remote_qpn *not a qp group.
  Do anyone know how to write a user level multicast application? If anyone
knows ,please let me known, thanks.
_
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/80e6bfaa/attachment.html>

From mst at dev.mellanox.co.il  Wed May 23 22:38:19 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 08:38:19 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <46537081.30906@linux.vnet.ibm.com>
References: <46537081.30906@linux.vnet.ibm.com>
Message-ID: <20070524053819.GF6019@mellanox.co.il>

> Here are my thoughts about limiting the memory footprint for IPOIB CM
> (NOSRQ) patch:
> 
> By default, cap the NOSRQ memory usage to 1GB.

ppc systems I have, start crashing if you map as much as 300MB for DMA.

> The default recvq_size
> is set to 128. Therefore for 64KB packets this would imply a maximum of
> 128 endpoints.
> 
> -Make the maximum number of endpoints a module parameter with a default
> value of 128.
> 
> -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is
> the default limit and could be changed as needed (by the administrator)
> depending on the system configuration, application needs and so on.

All this need for manual tuning is really going in the wrong direction:
we should be looking for ways to get rid of existing module
parameters, like using low watermark event to dynamically tune the RQ
depth.

> The
> server would return a "REJ" message upon receiving a "REQ", whenever one
> of these limits (i.e. max number of endpoints or the max NOSRQ memory
> usage) is reached. Currently, we only check for the maximum number of
> endpoints -hard coded to 1024.

So with limit sufficiently low, we hopefully will avoid crashing the server.
That's a progress, but what happens to the client when it gets this reject?

> -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that
> support SRQ like the Topspin HCA and, such HCAs should not be
> impacted at all.

I don't think it's that clean yet.

Here's an idea: implement "fake SRQ" for ehca in software: make post recv on srq
queue the WR, spread them evenly between QPs as they connect.  Once # of QPs
goes above some limit, create QP command will fail.  This would contain the mess
nicely inside ehca (I think you'll want to add a flag that lets software
figure out that SRQ is fake).

We will still be left with the basic problem of what to do
at the active side upon the reject, though.

> -Currently we allocate a default of 64KB for the ring buffer elements,
> and this buffer size is not linked to the mtu. In the future, we could
> allocate buffers based on the mtu and link that into the computation of
> the memory cap. This way customers who might want to use a smaller mtu
> could use a larger number of endpoints, or a larger recvq_size without
> exceeding the memory cap.

I think that conceptually, global MTU config is intended for outgoing packets,
not for the RX buffers. For example, how would we handle MTU changes?

> Would this approach address the issues of scalability and enable IPOIB
> CM to be turned as the default?

For IPoIB CM to be the default, it needs to work as well as datagram mode for
most usage scenarious. Unfortunately, your proposal above seems to fail to
satisfy this requirement: it will improve speed in some scenarious,
but will either increase the need for manual configuration drastically
or cause denial of service or use up huge amount of memory,
in others.

I think that to be able to use connected mode on ehca, what you need is

1. Find a way to make IPoIB fall back on datagram mode when you run out of
   resources.  This might need to be addressed at the protocol level.
2. Separate the noSRQ hacks more cleanly. I suggested some ways to do this
   earlier. Maybe, "fake srq" above will be a good way to solve it.

-- 
MST


From mst at dev.mellanox.co.il  Wed May 23 22:51:08 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 08:51:08 +0300
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <4654690F.1040305@linux.vnet.ibm.com>
References: <46537081.30906@linux.vnet.ibm.com>
	<4654690F.1040305@linux.vnet.ibm.com>
Message-ID: <20070524055108.GG6019@mellanox.co.il>

> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> Subject: Re: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
> 
> 
> If this proposal is acceptable, would you want me to generate a patch
> against Roland's for-2.6.22 git tree, or would for-2.6.23 tree be
> better?

I've just answered in another thread. Summary:
I think that to enable connected mode on ehca, what we need is

1. A way to make IPoIB fall back on datagram mode when you run out of
   resources.  This might need to be addressed at the protocol level.
2. A way to separate the noSRQ hacks more cleanly. This is not just me
   being a micro-optimization freak. I suggested some ways to do this better.


-- 
MST


From mst at dev.mellanox.co.il  Wed May 23 23:22:54 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 09:22:54 +0300
Subject: [ofa-general] Re: IPoIB NETIF_F_SG flag for GSO
In-Reply-To: <OF9E0B17A2.A381D23D-ON872572E5.001A7495-882572E5.00208B2B@us.ibm.com>
References: <20070524033615.GE6019@mellanox.co.il>
	<OF9E0B17A2.A381D23D-ON872572E5.001A7495-882572E5.00208B2B@us.ibm.com>
Message-ID: <20070524061736.GI6019@mellanox.co.il>

> Quoting Shirley Ma <xma at us.ibm.com>:
> Subject: Re: IPoIB NETIF_F_SG flag for GSO
> 
> Hello Michael,
> 
> > Yes. SG currently needs csum offloading, and IPoIB does not
> > support that.
> 
> SG should have nothing to do with CSUM. They are two different features, one is
> for scatter/gather IO, one is HW can checksum all the packets. If you look at
> net device feature, NETIF_F_SG & NETIF_F_HW_CSUM are different flags.

Look at register_netdevice:
       /* Fix illegal SG+CSUM combinations. */
        if ((dev->features & NETIF_F_SG) &&
            !(dev->features & NETIF_F_ALL_CSUM)) {
                printk(KERN_NOTICE "%s: Dropping NETIF_F_SG since no checksum feature.\n",
                       dev->name);
                dev->features &= ~NETIF_F_SG;
        }

> What did you get when enabling IPoIB SG? I rememerbed there was a discussion in
> net-dev before.
> 
> Thanks
> Shirley Ma

Google it.

-- 
MST


From dotanb at dev.mellanox.co.il  Wed May 23 23:44:24 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Thu, 24 May 2007 09:44:24 +0300
Subject: [ofa-general] how to write a IB user level multicast application
In-Reply-To: <13432ab00705232201s5f7d5a5h5ecaaddf57ead11b@mail.gmail.com>
References: <13432ab00705232201s5f7d5a5h5ecaaddf57ead11b@mail.gmail.com>
Message-ID: <46553448.6020508@dev.mellanox.co.il>

Hi.

zhang Jackie wrote:
> hi,everyone!
>   I want to write a IB user level multicast application,but I find 
> only two functions related to multicast:*ibv_attach_mcast* and 
> *ibv_detach_mcast*, I  cant find any functions or any information in 
> work request for sending packets to a multicast group.for example,in 
> struct ibv_send_wr, struct ud must have a *remote_qpn *not a qp group.
>   Do anyone know how to write a user level multicast application? If 
> anyone knows ,please let me known, thanks.
In the following URL you can find a very simple example on how to use 
multicast:
https://svn.openfabrics.org/svn/openib/trunk/contrib/mellanox/ibtp/gen2/userspace/useraccess/multicast_test/multicast_test.c


If you want to use multicast in IB:
you need to do the following things:

receiver side:
--------------
create an UD QP
attach this QP to a multicast group
post RR to the RQ of this QP


sender side:
--------------
post the message to remote QP number of 0xffffff, dlid which is the 
multicast LID and GID of the multicast (in the GRH of the AH).


this test doesn't send an SA query (to get the multicast props) or an SA 
multicast join (to make the SM configure the subnet to make the port 
that this QP is attached to) to get the multicast messages.

This example will work on a back-to-back topology.

I hope this helped you
Dotan


From ogerlitz at voltaire.com  Thu May 24 01:18:34 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 24 May 2007 11:18:34 +0300
Subject: [ofa-general] RE: two interfaces with ipoib
In-Reply-To: <4654C278.6030703@ichips.intel.com>
References: <Pine.GSO.4.40.0705231629060.26771-100000@nu.cse.ohio-state.edu>	<4654B14B.4000208@opengridcomputing.com>
	<4654C278.6030703@ichips.intel.com>
Message-ID: <46554A5A.3050607@voltaire.com>

Sean Hefty wrote:
> Steve Wise wrote:
>> Guys, this reminds me of an issue we have with rnics and regular nics 
>> on the same physical network.  By default linux responds to arp 
>> queries on all ports it receives the query on.  This leads to very bad 
>> results with you're trying to do offloaded connections.  When 
>> resolving the address/route, the rdma client can end up getting the 
>> mac address of the dumb nic instead of the rnic.   I don't know if 
>> route resolution in the ib cm has this issue, but it might since they 
>> use ipoib for some part of the resolution, no?
> 
> I think this could be the problem.  (And could have taken me a long time 
> to figure it if it is, so thanks!)
> 
>> 2) If you have a multi-homed host and the physical ethernet networks are
>> bridged, then you need to configure arp to only send replies on the
>> interface with the target ip address:
>>
>>        sysctl -w net.ipv4.conf.all.arp_ignore=2

OK, Sean, sorry not to mention this to you, we have resolved this with a 
customer some time ago and I have communicated it to Mellanox on Sonoma 
  such that it will be added to the OFED 1.2 documentation.

Generally speaking, its a bad (somehow dead on arrival test for a system 
administrator) habit to have two IP (=L3) subnets sharing the same L2 
(specifically broadcast) domain. In infiniband (IPoIB) it means have two 
IP subnets over the same Partition and in Ethernet is means have two IP 
subnets over the same VLAN.

I understand the default setting of arp_ignore = 0 is a religious 
argument held once in a while at the netdev mailing list, if people from 
here want to try it, i am crossing my fingers for them, but again, it 
has nothing special to do with ipoib.

Or.


From ogerlitz at voltaire.com  Thu May 24 02:31:11 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 24 May 2007 12:31:11 +0300
Subject: [ofa-general] What causes "ib0: packet len 65520 (> 2048) too
	long	to send, dropping" messages?
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E14FD@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E14FD@xmb-sjc-216.amer.cisco.com>
Message-ID: <46555B5F.6000302@voltaire.com>

Scott Weitzenkamp (sweitzen) wrote:
> I see a small number of these types of messages, when I send large 
> messages via IP multicast.
>  
> Why do I only see a few of the messages?

b/c IPoIB CM makes the stack to learn that for this neighbour the MTU is 
2K (2044) and not the 64K (65520) device MTU published to the stack.

The IPoIB code uses the update_pmtu callback of the neighbour for that 
matter, emulating the case where "path mtu" icmp packet has been 
received from a router.

Or.


From vlad at lists.openfabrics.org  Thu May 24 02:42:29 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu, 24 May 2007 02:42:29 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070524-0200 daily build status
Message-ID: <20070524094229.67E89E60830@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.16
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From bs at q-leap.de  Thu May 24 03:35:07 2007
From: bs at q-leap.de (Bernd Schubert)
Date: Thu, 24 May 2007 12:35:07 +0200
Subject: [ofa-general] abbreviations
Message-ID: <200705241235.07520.bs@q-leap.de>

Hi,

I need some help with abbreviations, in rdma_cm.h we have 

struct rdma_conn_param
struct rdma_ud_param
struct rdma_cm_event

So _conn_ means connection, but what do _ud_ and _cm_ mean?


Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH


From halr at voltaire.com  Thu May 24 03:49:28 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 24 May 2007 06:49:28 -0400
Subject: [ofa-general] abbreviations
In-Reply-To: <200705241235.07520.bs@q-leap.de>
References: <200705241235.07520.bs@q-leap.de>
Message-ID: <1180003767.16831.179665.camel@hal.voltaire.com>

On Thu, 2007-05-24 at 06:35, Bernd Schubert wrote:
> Hi,
> 
> I need some help with abbreviations, in rdma_cm.h we have 
> 
> struct rdma_conn_param
> struct rdma_ud_param
> struct rdma_cm_event
> 
> So _conn_ means connection, but what do _ud_ and _cm_ mean?

unreliable datagram

communication (some say connection) manager

-- Hal

> Thanks,
> Bernd
> 
> 


From dotanb at dev.mellanox.co.il  Thu May 24 03:58:38 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Thu, 24 May 2007 13:58:38 +0300
Subject: [ofa-general] abbreviations
In-Reply-To: <200705241235.07520.bs@q-leap.de>
References: <200705241235.07520.bs@q-leap.de>
Message-ID: <46556FDE.3000701@dev.mellanox.co.il>

Bernd Schubert wrote:
> Hi,
>
> I need some help with abbreviations, in rdma_cm.h we have 
>
> struct rdma_conn_param
> struct rdma_ud_param
> struct rdma_cm_event
>
> So _conn_ means connection, but what do _ud_ and _cm_ mean?
>   
UD: Unreliable Datagram
CM: Communication Manager

Is this is what you meant?

thanks
Dotan


From mst at dev.mellanox.co.il  Thu May 24 04:47:11 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 14:47:11 +0300
Subject: [ofa-general] wmb missing in libmthca?
Message-ID: <20070524114711.GB4585@mellanox.co.il>

Roland, I see this in kernel:

                ((struct mthca_next_seg *) prev_wqe)->nda_op =
                        cpu_to_be32((ind << qp->rq.wqe_shift) | 1);
                wmb();
                ((struct mthca_next_seg *) prev_wqe)->ee_nds =
                        cpu_to_be32(MTHCA_NEXT_DBD | size);

but userspace does not have wmb here.
Is it needed?

-- 
MST


From erezz at voltaire.com  Thu May 24 04:49:31 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Thu, 24 May 2007 14:49:31 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons
 foropen-iscsiover iSER support for RHAS4 up3 and up4
In-Reply-To: <20070521114410.GG20400@mellanox.co.il>
References: <4641D295.5060907@voltaire.com><20070521081625.GA20400@mellanox.co.il><46517F78.8000805@voltaire.com>
	<20070521114410.GG20400@mellanox.co.il>
Message-ID: <46557BCB.7030102@voltaire.com>

>> >
>> > However, you are copying a ton of files from upstream kernel.
>> > Sticking extra files in include might interfere with newer
>> > kernels, so I don't have better ideas for this for 1.2
>> > (for 1.3 I am hoping we'll use the submodule support in git,
>> > so we'll be able to re-use headers as well).
>> >
>> > But, for files *not* in "include/", I suggest that, instead of
> sticking our
>> > own version in addons, we should check out the files from upstream
> and tweak
>> > makefiles to pick them up: maintaining these in OFED tree long-term will
>> > be a
>> > problem.
>>
>> Do you suggest to add a new mechanism to OFED that will do that?
> 
> No, this is the same mechanism that we use for the rest of the files:
> check them out of the kernel tree.
> Look at file ofed_scripts/ofed_checkout.sh
> 
> But I stress that we can not do this for files under
> include/ *unless* they only include packet structure definitions.
> Otherwise we'll get weird data corruption on newer kernels.

See below.

> 
>> >
>> >> >> +
>> >> >> + struct iscsi_internal {
>> >> >> +  int daemon_pid;
>> >> >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
>> >> >> + #define cdev_to_iscsi_internal(_cdev) \
>> >> >> +  container_of(_cdev, struct iscsi_internal, cdev)
>> >> >> +
>> >> >> ++extern int attribute_container_init(void);
>> >> >> ++
>> >> >
>> >> > This does not look scsi-related. Why does this belong here?
>> >>
>> >> This is a hack. In 2.6.20, attribute_container_init is called from
>> > drivers/base/init.c. Since I cannot do that, I'm calling it from the
>> > init function in scsi_transport_iscsi (because scsi_transport_iscsi uses
>> > the attribute container). Do you have a better suggestion?
>> >
>> > Aha. No better ideas for the header, let it be for now.
>> > But the code in drivers/base/init.c can be checked out rather than
>> > copied over.
>>
>> I'm using only a very small part of init.c. I'm not sure that we
> should copy it.
> 
> OK then.
> What about the stuff like scsi.c?

I have the following files in backport/2.6.9_UX/include/src/:

attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it.

init.c - only a small part of the original file in 2.6.20

klist.c - almost identical to the file on 2.6.20. I had to change one line in it.

kref_new.c - based on kref.c

scsi.c - only a small part of the original file in 2.6.20

scsi_lib.c - only a small part of the original file in 2.6.20

scsi_scan.c - only a small part of the original file in 2.6.20

transport_class.c - identical to 2.6.20

So, the only file identical to 2.6.20 is transport_class.c. We can copy it from 2.6.20, but since it's only a single file, I'm not sure if it will make a real difference.

> 
>> >> diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
>> > b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
>> >> new file mode 100644
>> >> index 0000000..654ef55
>> >> --- /dev/null
>> >> +++ b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h
>> >> @@ -0,0 +1,89 @@
>> >> +/*
>> >> + * include/linux/memory.h - generic memory definition
>> >> + *
>> >> + * This is mainly for topological representation. We define the
>> >> + * basic "struct memory_block" here, which can be embedded in per-arch
>> >> + * definitions or NUMA information.
>> >> + *
>> >> + * Basic handling of the devices is done in drivers/base/memory.c
>> >> + * and system devices are handled in drivers/base/sys.c.
>> >> + *
>> >> + * Memory block are exported via sysfs in the class/memory/devices/
>> >> + * directory.
>> >> + *
>> >> + */
>> >
>> >
>> > Sorry, why are we copying this here?
>> > Are you actually trying to work with hotplug memory?
>>
>> Sorry, it seems that I don't really need memory.h. It was included
> from init.c, but it is not necessary. I made the fix on
> ofed_1_2_iser_rh4.git.
> 
> Pls check other headers you pull in - is there something you can skip?

No.


>> >
>> > This looks pretty hacky. Moving files around during make will
>> > interfere with people trying to e.g. create a patch.
>> > What is this doing? Can't makefile just point to the right files?
>> >
>>
>> Here's the problem:
>>
>> I'm trying to build a module that contains multiple object files (e.g.
> libiscsi). libiscsi contains libiscsi.o & scsi_scan.o. Something like:
>>
>> libiscsi-y             := libiscsi_f.o scsi_scan.o
>>
>> The problem is that if I'm doing something like:
>>
>> libiscsi-y             := libiscsi.o scsi_scan.o
>>
>> then, libiscsi.ko doesn't contain the symbols from libiscsi.o (only
> symbols from scsi_scan.o). We found 2 solutions for this problem:
>>
>> 1. Change the module name - this is problematic because open-iscsi
> startup script uses the original module name.
>> 2. Change the file name (libiscsi.c -> libiscsi_f.c) - this is what I did.
>>
>> I don't really like this hack, but I wasn't able to come up with
> something better. Do you know how to overcome this problem?
> 
> I do not have the time to look into this in a deep way.
> But it seems that you can just add a file libiscsi_f.c with
> 
> #include "libiscsi.c"
> 
> would this work?
> 
> --
> MST
> 

Yes, it works ok. I've updated my git tree. 

Do you think that there are other fixes to be made? Else, I'll be glad to have it in the next OFED rc.

Thanks,
Erez


From mst at dev.mellanox.co.il  Thu May 24 04:57:15 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 14:57:15 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons
	foropen-iscsiover iSER support for RHAS4 up3 and up4
In-Reply-To: <46557BCB.7030102@voltaire.com>
References: <20070521114410.GG20400@mellanox.co.il>
	<46557BCB.7030102@voltaire.com>
Message-ID: <20070524115715.GC4585@mellanox.co.il>

Quoting Erez Zilber <erezz at voltaire.com>:
Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons foropen-iscsiover iSER support for RHAS4 up3 and up4

>> >
>> > However, you are copying a ton of files from upstream kernel.
>> > Sticking extra files in include might interfere with newer
>> > kernels, so I don't have better ideas for this for 1.2
>> > (for 1.3 I am hoping we'll use the submodule support in git,
>> > so we'll be able to re-use headers as well).
>> >
>> > But, for files *not* in "include/", I suggest that, instead of
> sticking our
>> > own version in addons, we should check out the files from upstream
> and tweak
>> > makefiles to pick them up: maintaining these in OFED tree long-term will
>> > be a
>> > problem.
>>
>> Do you suggest to add a new mechanism to OFED that will do that?
> 
> No, this is the same mechanism that we use for the rest of the files:
> check them out of the kernel tree.
> Look at file ofed_scripts/ofed_checkout.sh
> 
> But I stress that we can not do this for files under
> include/ *unless* they only include packet structure definitions.
> Otherwise we'll get weird data corruption on newer kernels.

See below.

> > 
> >> >
> >> >> >> +
> >> >> >> + struct iscsi_internal {
> >> >> >> +  int daemon_pid;
> >> >> >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l
> >> >> >> + #define cdev_to_iscsi_internal(_cdev) \
> >> >> >> +  container_of(_cdev, struct iscsi_internal, cdev)
> >> >> >> +
> >> >> >> ++extern int attribute_container_init(void);
> >> >> >> ++
> >> >> >
> >> >> > This does not look scsi-related. Why does this belong here?
> >> >>
> >> >> This is a hack. In 2.6.20, attribute_container_init is called from
> >> > drivers/base/init.c. Since I cannot do that, I'm calling it from the
> >> > init function in scsi_transport_iscsi (because scsi_transport_iscsi uses
> >> > the attribute container). Do you have a better suggestion?
> >> >
> >> > Aha. No better ideas for the header, let it be for now.
> >> > But the code in drivers/base/init.c can be checked out rather than
> >> > copied over.
> >>
> >> I'm using only a very small part of init.c. I'm not sure that we
> > should copy it.
> > 
> > OK then.
> > What about the stuff like scsi.c?
> 
> I have the following files in backport/2.6.9_UX/include/src/:
> 
> attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it.

could be a patch ...
which line?

> init.c - only a small part of the original file in 2.6.20
> 
> klist.c - almost identical to the file on 2.6.20. I had to change one line in it.

which line?

> kref_new.c - based on kref.c

Sounds scary ... how different is it?

> scsi.c - only a small part of the original file in 2.6.20
> 
> scsi_lib.c - only a small part of the original file in 2.6.20
> 
> scsi_scan.c - only a small part of the original file in 2.6.20
> 
> transport_class.c - identical to 2.6.20
> 
> So, the only file identical to 2.6.20 is transport_class.c. We can copy it from 2.6.20, but since it's only a single file, I'm not sure if it will make a real difference.

transport_class.c, attribute_container.c and klist.c are quite big together:
more than 1000 lines. So by all means, let's check them out from kernel tree.

-- 
MST


From jsquyres at cisco.com  Thu May 24 04:58:50 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 24 May 2007 07:58:50 -0400
Subject: [ofa-general] No OFED teleconf Monday, May 28
Message-ID: <5DF88A18-1DFE-4A0D-AE86-CE01A5930B16@cisco.com>

Due to the US Memorial Day holiday this upcoming Monday (May 28th),  
there will be no EWG/OFED teleconference (you'll receive an Outlook  
cancellation shortly).

Moving the teleconf to a day later (Tuesday, May 29th) is problematic  
for some.  Do we want a teleconference on Wednesday, May 30th?  Let  
me know and I can setup a phone bridge, if desired.

-- 
Jeff Squyres
Cisco Systems


From devesh28 at gmail.com  Thu May 24 05:22:16 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Thu, 24 May 2007 17:52:16 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <1179930909.16831.100286.camel@hal.voltaire.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<464B5C07.8040601@ichips.intel.com>
	<309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>
	<1179398534.23882.67542.camel@hal.voltaire.com>
	<309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>
	<1179483657.23882.158398.camel@hal.voltaire.com>
	<309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com>
	<1179769930.15940.9823.camel@hal.voltaire.com>
	<309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com>
	<1179930909.16831.100286.camel@hal.voltaire.com>
Message-ID: <309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com>

On 23 May 2007 10:35:13 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> On Wed, 2007-05-23 at 10:27, Devesh Sharma wrote:
> > On 21 May 2007 13:52:11 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote:
> > > > On 18 May 2007 06:21:05 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote:
> > > > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote:
> > > > > > > > On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > > > > > > > > > But initially this will generate a packet for each path, while sys
> > > > > > > > > > admin knows that path is there and he can hard-code the entries for
> > > > > > > > > > it. Other thing is that why Admin will care about creating such record
> > > > > > > > > > while SA is itself taking care, right?
> > > > > > > > >
> > > > > > > > > In your original message you asked about adding 'dummy entries' to the
> > > > > > > > > cache.  I agree that pre-loading the cache can be useful.  What I still
> > > > > > > > > am not understanding is the reasoning for adding 'dummy entries'.  By
> > > > > > > > > 'dummy entries', I've been assuming that these are invalid path records,
> > > > > > > > > but maybe that's not what you meant.
> > > > > > > > Ok if "dummy entries" word as such has created confusion then I am
> > > > > > > > sorry for that, But with that I mean that, those are valid path
> > > > > > > > records which Administrator knows in advance and while loading the
> > > > > > > > module,
> > > > > > >
> > > > > > > How does the admin know they are valid ?
> > > > > > Depending on the initial application runs, some trusted PRs can be generated.
> > > > >
> > > > > What do initial application runs have to do with this ?
> > > > My understanding is that, once the cluster is UP, and if between Node
> > > > A and Node B there is only one path,
> > >
> > > So this is a feature for such one path subnets. I wonder what percentage
> > > of deployed subnets fits this case.
> > You never know, It may be used for debugging also.
>
> I still don't have a good feel for how common/generally useful this will
> really be.
>
> > > > then, SA query always going to return same values in PR.
> > >
> > > If subnet topology is changed, these PRs might change. There are other
> > > cases where they change too.
> > Not sure about it...some suggestion?
> > >
> > > >  On this basis Initial application runs will generate PRs,
> > >
> > > That's what confused me before (Applications don't generate PRs but
> > > rather request them.) but I think I see what you mean now.
> > Ok
> > >
> > > > these PRs can be saved in some file, and can be loaded
> > > > when cache_module comes in.
> > > > >
> > > > > > >Are they somehow preconfigured at the SM ?
> > > > > > I am not sure about SM has any such provision?
> > > > >
> > > > > Not that I'm aware of.
> > > > Ok, So, currently no such support is there in SM?
> > >
> > > I can speak definitively for OpenSM and there is no such support. As to
> > > the vendor SMs, I don't think so but don't know for absolute certainty.
> > > Someone can correct me if I'm wrong but I wouldn't assume no response
> > > means correctness as some may not be listening nor want to respond as to
> > > "value added" vendor specific features.
> > What is the issue if OpenSM provides this?
>
> I'm not following you. What does/should OpenSM provide ? OpenIB works in
> configurations with other SMs.
I am talking about pre-configuring PRs in OpenSM DB.
>
> > >
> > > > > > Also not sure about the
> > > > > > role of SM in path resolving. I mean once node has initiated SA query,
> > > > > > whether SM has some database to reply SA or On the fly destination
> > > > > > node is contacted to get asked path recored?
> > > > >
> > > > > SMs can either calculate the SA PRs on the fly based on the routing
> > > > > algorithm in use and some other things or put them in a local database.
> > > > > This is up to that SM.
> > > > Ok
> > > > >
> > > > > Destination node is not contacted in the SA PR query process.
> > > > >
> > > > > > >Doesn't each SM have its own policy for generating valid PRs ?
> > > > > > Ultimately path record is in Path_Record object format, and SA cache
> > > > > > is going to store in a fixed manner, How generation policy matters?
> > > > >
> > > > > What if the local policy loaded does not agree with what the SM would
> > > > > generate for a particular PR ? One then gets a local error which will
> > > > > need to be tracked down. Not so easy IMO.
> > > > SM policies in a subnet to generate PRs, changes dynamically? at run time?
> > >
> > > The policy doesn't change dynamically but the data to be returned in the
> > > SA PR response might.
> > >
> > > > if Not then depending on the local SM policy static PR can be
> > > > generated to load initially.
> > >
> > > Just as one question related to this, how would link failures be handled
> > > ? There are others.
> > Its just a matter of avoiding initial PR query packets by loading the
> > cache with static PRs.....Later on cache module will function in
> > normal fashion. I expect, initially every thing will come up in a
> > trusted cluster.
>
> So you're saying the cache would still react to GIDs out and in service,
> right ?
I am not about what GIDs in out service....but what I mean to say is,
Once sa_cache is programmed with some static PRs....it will avoid
initial cache_update step and after first time out normal
update_cache() will be initiated using SA MADs.
>
> If the cache is loaded from a file, does it bypass querying the SA
> initially for PRs ?
Yes It will, and hence reduce the initial SA traffic generated on a
big cluster...just imagin, the cluster is quite big and every node is
trying to build its cache initially. It will create large burst of SA
packets.
>If that is the case, then the file is required to be
> the full set of PRs for this node otherwise there would be incomplete
> connectivity.
Yes, correct, Generating these PRs is the next issue which I want to
discuss. may be this can be done by Admin on every node using the
read() entry point provided by char_dev interface of sa_cache module.
read entry point will simple extract PRs from cache itself.

Incomplete connectivity will be till first PR is requested for that
destination, Because if its a cache miss, any how application is going
to initiate a ib_sa_get_path_rec() and resolved PR will be added in
cache for future reference.
>
> -- Hal
>
> > > > > > CMIIW. Also I am assuming a homogeneous cluster where certain
> > > > > > parameters can be assumed to be same always.
> > > > >
> > > > > and always in agreement with what the SM would return ? For example,
> > > > yes
> > > > > what happens when a link goes down and the end node is no longer
> > > > > reachable ?
> > > > If node is not reachable then, after first timeout of sa_cache, that
> > > > entry will be removed from cache.
> > >
> > > OK; that's another aspect to add into this feature. I don't think that
> > > is currently done. I think there would need to be an API added to do
> > > this.
> > Yes, this has been discussed with Sean, we can add one char_dev
> > interface to the existing  sa_cache module implementation, Write entry
> > point will generate a SA_PR_response packet and this packet will be
> > passed to update_cache() function.
> >
> > Also we need to remove the initial schedule_update() call in the
> > add_one() function.
> > One user command is also required to read from user file and write
> > onto this device.
> > >
> > > -- Hal
> > >
> > > > > > >are these from a live SM and just loaded "out of band" to
> > > > > > bypass/preclude the SA PR >mechanism ?
> > > > > > may be
> > > > >
> > > > > Even if they are, there is still the changes in the subnet issue.
> > > > >
> > > > > -- Hal
> > > > >
> > > > > > > -- Hal
> > > > > > >
> > > > > > > >  Admin is loading this info in the cache with user command.
> > > > > > > > >
> > > > > > > > > > Another point I want to know is,
> > > > > > > > > > When local_sa_cache module will be inserted? After SM comes up or
> > > > > > > > > > Before SM comes up?
> > > > > > > > >
> > > > > > > > > It can occur either way.  There is no restriction.  The cache responds
> > > > > > > > > to port up and GID in/out of service events to update itself.
> > > > > > > > Do you mean cache module will start building cache only after Port is UP?
> > > > > > > > >
> > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on
> > > > > > > > > > some node not on switch) then First Forced schedule_update() is
> > > > > > > > > > waisted, and for the first application presence of cache is
> > > > > > > > > > meaningless. Why not to keep cache effective right from the start?
> > > > > > > > >
> > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those
> > > > > > > > > paths are usable.  If the SM has not come up, then the path records will
> > > > > > > > > be unusable until the SM configures the subnet, plus there's no
> > > > > > > > > guarantee that the remote endpoints specified by the paths are running.
> > > > > > > > You mean there is no guarantee that even if SM is UP and we have some
> > > > > > > > hard coded entries of path record corresponding to some node X, we are
> > > > > > > > not sure that node X has actually come up or not?  In that case
> > > > > > > > actually that path resolving should fail if node has not come up, but
> > > > > > > > with the hard coding still path will be resolved?
> > > > > > > > >
> > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms
> > > > > > > > > when booting a large cluster.
> > > > > > > > that's true. Also cache will get valid entries only if network is
> > > > > > > > configured by SM otherwise every node SA will, possibly, drop SA
> > > > > > > > packets.
> > > > > > > > >
> > > > > > > > > - Sean
> > > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > general mailing list
> > > > > > > > general at lists.openfabrics.org
> > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > > >
> > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > >
> > >
>
>


From eli at mellanox.co.il  Thu May 24 05:49:49 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 24 May 2007 15:49:49 +0300
Subject: [ofa-general] [PATCH] IB/mlx4_ib initialize work queue
Message-ID: <1180011019.11166.39.camel@mtls03>

Initialize send work queue when modified from reset to init

Need to initilaize owner bit of the send queue to software ownership
whenever the QP is modified from reset to init. This is required for
the cases that the QP is moved to reset but not destroyed and then
modified to init again.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/infiniband/hw/mlx4/qp.c
===================================================================
--- connectx_kernel.orig/drivers/infiniband/hw/mlx4/qp.c	2007-05-21 09:40:41.000000000 +0300
+++ connectx_kernel/drivers/infiniband/hw/mlx4/qp.c	2007-05-24 15:14:22.000000000 +0300
@@ -253,9 +253,7 @@ static int create_qp_common(struct mlx4_
 			    struct ib_qp_init_attr *init_attr,
 			    struct ib_udata *udata, int sqpn, struct mlx4_ib_qp *qp)
 {
-	struct mlx4_wqe_ctrl_seg *ctrl;
 	int err;
-	int i;
 
 	mutex_init(&qp->mutex);
 	spin_lock_init(&qp->sq.lock);
@@ -323,11 +321,6 @@ static int create_qp_common(struct mlx4_
 		if (err)
 			goto err_mtt;
 
-		for (i = 0; i < qp->sq.max; ++i) {
-			ctrl = get_send_wqe(qp, i);
-			ctrl->owner_opcode = cpu_to_be32(1 << 31);
-		}
-
 		qp->sq.wrid  = kmalloc(qp->sq.max * sizeof (u64), GFP_KERNEL);
 		qp->rq.wrid  = kmalloc(qp->rq.max * sizeof (u64), GFP_KERNEL);
 
@@ -670,8 +663,10 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp
 	struct mlx4_qp_context *context;
 	enum mlx4_qp_optpar optpar = 0;
 	enum ib_qp_state cur_state, new_state;
+	struct mlx4_wqe_ctrl_seg *ctrl;
 	int sqd_event;
 	int err = -EINVAL;
+	int i;
 
 	context = kzalloc(sizeof *context, GFP_KERNEL);
 	if (!context)
@@ -856,8 +851,13 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp
 	if (ibqp->srq)
 		context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->msrq.srqn);
 
-	if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT)
+	if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) {
 		context->db_rec_addr = cpu_to_be64(qp->db.dma);
+		for (i = 0; i < qp->sq.max; ++i) {
+			ctrl = get_send_wqe(qp, i);
+			ctrl->owner_opcode = cpu_to_be32(1 << 31);
+		}
+	}
 
 	if (cur_state == IB_QPS_INIT &&
 	    new_state == IB_QPS_RTR  &&


From eli at mellanox.co.il  Thu May 24 06:05:01 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 24 May 2007 16:05:01 +0300
Subject: [ofa-general] [PATCH] IB/mlx4_ib initialize work - resending fix
	description
Message-ID: <1180011931.11166.47.camel@mtls03>

Initialize send work queue when modified from reset to init

Need to initilaize owner bit of the send queue to hardware ownership
whenever the QP is modified from reset to init. This is required for
the cases that the QP is moved to reset but not destroyed and then
modified to init again.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/infiniband/hw/mlx4/qp.c
===================================================================
--- connectx_kernel.orig/drivers/infiniband/hw/mlx4/qp.c	2007-05-21 09:40:41.000000000 +0300
+++ connectx_kernel/drivers/infiniband/hw/mlx4/qp.c	2007-05-24 15:14:22.000000000 +0300
@@ -253,9 +253,7 @@ static int create_qp_common(struct mlx4_
 			    struct ib_qp_init_attr *init_attr,
 			    struct ib_udata *udata, int sqpn, struct mlx4_ib_qp *qp)
 {
-	struct mlx4_wqe_ctrl_seg *ctrl;
 	int err;
-	int i;
 
 	mutex_init(&qp->mutex);
 	spin_lock_init(&qp->sq.lock);
@@ -323,11 +321,6 @@ static int create_qp_common(struct mlx4_
 		if (err)
 			goto err_mtt;
 
-		for (i = 0; i < qp->sq.max; ++i) {
-			ctrl = get_send_wqe(qp, i);
-			ctrl->owner_opcode = cpu_to_be32(1 << 31);
-		}
-
 		qp->sq.wrid  = kmalloc(qp->sq.max * sizeof (u64), GFP_KERNEL);
 		qp->rq.wrid  = kmalloc(qp->rq.max * sizeof (u64), GFP_KERNEL);
 
@@ -670,8 +663,10 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp
 	struct mlx4_qp_context *context;
 	enum mlx4_qp_optpar optpar = 0;
 	enum ib_qp_state cur_state, new_state;
+	struct mlx4_wqe_ctrl_seg *ctrl;
 	int sqd_event;
 	int err = -EINVAL;
+	int i;
 
 	context = kzalloc(sizeof *context, GFP_KERNEL);
 	if (!context)
@@ -856,8 +851,13 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp
 	if (ibqp->srq)
 		context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->msrq.srqn);
 
-	if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT)
+	if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) {
 		context->db_rec_addr = cpu_to_be64(qp->db.dma);
+		for (i = 0; i < qp->sq.max; ++i) {
+			ctrl = get_send_wqe(qp, i);
+			ctrl->owner_opcode = cpu_to_be32(1 << 31);
+		}
+	}
 
 	if (cur_state == IB_QPS_INIT &&
 	    new_state == IB_QPS_RTR  &&


From mst at dev.mellanox.co.il  Thu May 24 06:11:54 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 16:11:54 +0300
Subject: [ofa-general] [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access
	race
In-Reply-To: <adatzu4d1wx.fsf@cisco.com>
References: <20070522005918.GB13311@mellanox.co.il> <adatzu4d1wx.fsf@cisco.com>
Message-ID: <20070524131154.GA7940@mellanox.co.il>

hard_start_xmit dereferences to_ipoib_neigh when only tx_lock is taken.  This
would only be safe if all calls that modify *to_ipoib_neigh take tx_lock too.
Currently this is not always true for ipoib_neigh_free and path_rec_completion,
which results in memory corruption.  Fix this race, making sure
path_rec_completion and ipoib_neigh_free are always called under
tx_lock.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

The following works fine for me here. Pls consider for 2.6.22.

ipoib_main.c      |   42 ++++++++++++++++--------------------------
ipoib_multicast.c |    6 ++++--
2 files changed, 20 insertions(+), 28 deletions(-)

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-22 01:46:54.000000000 +0300
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-23 22:45:18.000000000 +0300
@@ -262,7 +262,8 @@ static void path_free(struct net_device 
 	while ((skb = __skb_dequeue(&path->queue)))
 		dev_kfree_skb_irq(skb);
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) {
 		/*
@@ -277,7 +278,8 @@ static void path_free(struct net_device 
 		ipoib_neigh_free(dev, neigh);
 	}
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (path->ah)
 		ipoib_put_ah(path->ah);
@@ -401,7 +403,8 @@ static void path_rec_completion(int stat
 			ah = ipoib_create_ah(dev, priv->pd, &av);
 	}
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	path->ah = ah;
 
@@ -442,7 +445,8 @@ static void path_rec_completion(int stat
 	path->query = NULL;
 	complete(&path->done);
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	while ((skb = __skb_dequeue(&skqueue))) {
 		skb->dev = dev;
@@ -614,32 +618,16 @@ static void unicast_arp_send(struct sk_b
 	path = __path_find(dev, phdr->hwaddr + 4);
 	if (!path) {
 		path = path_rec_create(dev, phdr->hwaddr + 4);
-		if (path) {
-			/* put pseudoheader back on for next time */
-			skb_push(skb, sizeof *phdr);
-			__skb_queue_tail(&path->queue, skb);
-
-			if (path_rec_start(dev, path)) {
-				spin_unlock(&priv->lock);
-				path_free(dev, path);
-				return;
-			} else
-				__path_add(dev, path);
-		} else {
-			++priv->stats.tx_dropped;
-			dev_kfree_skb_any(skb);
-		}
-
-		spin_unlock(&priv->lock);
-		return;
+		if (path)
+			__path_add(dev, path);
 	}
 
-	if (path->ah) {
+	if (path && path->ah) {
 		ipoib_dbg(priv, "Send unicast ARP to %04x\n",
 			  be16_to_cpu(path->pathrec.dlid));
 
 		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
-	} else if ((path->query || !path_rec_start(dev, path)) &&
+	} else if (path && (path->query || !path_rec_start(dev, path)) &&
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		/* put pseudoheader back on for next time */
 		skb_push(skb, sizeof *phdr);
@@ -822,7 +810,8 @@ static void ipoib_neigh_cleanup(struct n
 		  IPOIB_QPN(n->ha),
 		  IPOIB_GID_RAW_ARG(n->ha + 4));
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	neigh = *to_ipoib_neigh(n);
 	if (neigh) {
@@ -832,7 +821,8 @@ static void ipoib_neigh_cleanup(struct n
 		ipoib_neigh_free(n->dev, neigh);
 	}
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (ah)
 		ipoib_put_ah(ah);
Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-22 01:46:54.000000000 +0300
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-05-23 21:38:28.000000000 +0300
@@ -100,7 +100,8 @@ static void ipoib_mcast_free(struct ipoi
 			"deleting multicast group " IPOIB_GID_FMT "\n",
 			IPOIB_GID_ARG(mcast->mcmember.mgid));
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
 
 	list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) {
 		/*
@@ -114,7 +115,8 @@ static void ipoib_mcast_free(struct ipoi
 		ipoib_neigh_free(dev, neigh);
 	}
 
-	spin_unlock_irqrestore(&priv->lock, flags);
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (mcast->ah)
 		ipoib_put_ah(mcast->ah);

-- 
MST


From mst at dev.mellanox.co.il  Thu May 24 07:04:10 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 17:04:10 +0300
Subject: [ofa-general] question on netpoll
Message-ID: <20070524140410.GB7940@mellanox.co.il>

Roland, there's something I don't understand in ipoib:
ipoib_ib_dev_stop currently moves QP to error, and then
waits for it to be drained of WRs.

However, if this is done during dev_stop,
netpoll is turned off (bit __LINK_STATE_START is cleared) -
so what is draining the CQ?


-- 
MST


From glebn at voltaire.com  Thu May 24 07:19:28 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Thu, 24 May 2007 17:19:28 +0300
Subject: [ofa-general] RDMA write completion question
Message-ID: <20070524141928.GI20691@minantech.com>

Hi,

  Does local RDMA write completion guaranties that a data that was RDMAed is
already accessible in a destination's host _memory_?

--
			Gleb.


From fenkes at de.ibm.com  Thu May 24 07:51:08 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 24 May 2007 16:51:08 +0200
Subject: [ofa-general] [PATCH] IB/ehca: fix wrong number of send WRs returned
Message-ID: <200705241651.09411.fenkes@de.ibm.com>

From: Stefan Roscher <stefan.roscher at de.ibm.com>

Due to a typo, the driver was reporting the wrong number of "actual send
WRs" after ehca_create_qp(). Fixed.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/hcp_if.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index 7f0beec..5766ae3 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -331,7 +331,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
 				0);
 	qp->ipz_qp_handle.handle = outs[0];
 	qp->real_qp_num = (u32)outs[1];
-	parms->act_nr_send_sges =
+	parms->act_nr_send_wqes =
 		(u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_SEND_WR, outs[2]);
 	parms->act_nr_recv_wqes =
 		(u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_RECV_WR, outs[2]);
-- 
1.5.2


From fenkes at de.ibm.com  Thu May 24 07:51:05 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 24 May 2007 16:51:05 +0200
Subject: [ofa-general] [PATCH] IB/ehca: Refactor "maybe missed event" code
Message-ID: <200705241651.05860.fenkes@de.ibm.com>

Refactored Roland's patch so the queue arithmetic is done in a little less
lines. Also, moved the spinlock inside the block it's used in.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_reqs.c |    2 +-
 drivers/infiniband/hw/ehca/ipz_pt_fn.h |   28 ++++++++++------------------
 2 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index caec9de..56c4527 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -637,7 +637,6 @@ poll_cq_exit0:
 int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags notify_flags)
 {
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
-	unsigned long spl_flags;
 	int ret = 0;
 
 	switch (notify_flags & IB_CQ_SOLICITED_MASK) {
@@ -652,6 +651,7 @@ int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags notify_flags)
 	}
 
 	if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) {
+		unsigned long spl_flags;
 		spin_lock_irqsave(&my_cq->spinlock, spl_flags);
 		ret = ipz_qeit_is_valid(&my_cq->ipz_queue);
 		spin_unlock_irqrestore(&my_cq->spinlock, spl_flags);
diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
index 57f141a..007f088 100644
--- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h
+++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
@@ -105,7 +105,6 @@ void *ipz_qpageit_get_inc(struct ipz_queue *queue);
  * step in struct ipz_queue, will wrap in ringbuffer
  * returns address (kv) of Queue Entry BEFORE increment
  * warning don't use in parallel with ipz_qpageit_get_inc()
- * warning unpredictable results may occur if steps>act_nr_of_queue_entries
  */
 static inline void *ipz_qeit_get_inc(struct ipz_queue *queue)
 {
@@ -121,31 +120,24 @@ static inline void *ipz_qeit_get_inc(struct ipz_queue *queue)
 }
 
 /*
+ * return a bool indicating whether current Queue Entry is valid
+ */
+static inline int ipz_qeit_is_valid(struct ipz_queue *queue)
+{
+	struct ehca_cqe *cqe = ipz_qeit_get(queue);
+	return ((cqe->cqe_flags >> 7) == (queue->toggle_state & 1));
+}
+
+/*
  * return current Queue Entry, increment Queue Entry iterator by one
  * step in struct ipz_queue, will wrap in ringbuffer
  * returns address (kv) of Queue Entry BEFORE increment
  * returns 0 and does not increment, if wrong valid state
  * warning don't use in parallel with ipz_qpageit_get_inc()
- * warning unpredictable results may occur if steps>act_nr_of_queue_entries
  */
 static inline void *ipz_qeit_get_inc_valid(struct ipz_queue *queue)
 {
-	struct ehca_cqe *cqe = ipz_qeit_get(queue);
-	u32 cqe_flags = cqe->cqe_flags;
-
-	if ((cqe_flags >> 7) != (queue->toggle_state & 1))
-		return NULL;
-
-	ipz_qeit_get_inc(queue);
-	return cqe;
-}
-
-static inline int ipz_qeit_is_valid(struct ipz_queue *queue)
-{
-	struct ehca_cqe *cqe = ipz_qeit_get(queue);
-	u32 cqe_flags = cqe->cqe_flags;
-
-	return cqe_flags >> 7 == (queue->toggle_state & 1);
+	return ipz_qeit_is_valid(queue) ? ipz_qeit_get_inc(queue) : NULL;
 }
 
 /*
-- 
1.5.2


From devesh28 at gmail.com  Thu May 24 08:08:00 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Thu, 24 May 2007 20:38:00 +0530
Subject: [ofa-general] RDMA write completion question
In-Reply-To: <20070524141928.GI20691@minantech.com>
References: <20070524141928.GI20691@minantech.com>
Message-ID: <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>

On 5/24/07, Gleb Natapov <glebn at voltaire.com> wrote:
> Hi,
>
>  Does local RDMA write completion guaranties that a data that was RDMAed is
> already accessible in a destination's host _memory_?
Local RDMA write completion guarantees that the data you have RDMAed
has been copied into the remote buffer, without any data corruption.
>
> --
>                        Gleb.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From mst at dev.mellanox.co.il  Thu May 24 08:12:54 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 18:12:54 +0300
Subject: [ofa-general] Re: question on netpoll
In-Reply-To: <20070524140410.GB7940@mellanox.co.il>
References: <20070524140410.GB7940@mellanox.co.il>
Message-ID: <20070524151254.GC7940@mellanox.co.il>

> Quoting Michael S. Tsirkin <mst at dev.mellanox.co.il>:
> Subject: question on netpoll
> 
> Roland, there's something I don't understand in ipoib:
> ipoib_ib_dev_stop currently moves QP to error, and then
> waits for it to be drained of WRs.
> 
> However, if this is done during dev_stop,
> netpoll is turned off (bit __LINK_STATE_START is cleared) -
> so what is draining the CQ?

OK, I noticed the following at the end of dev stop:

                do {
                        n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
                        for (i = 0; i < n; ++i) {
                                if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
                                        ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
                                else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
                                        ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
                                else
                                        ipoib_ib_handle_tx_wc(dev, priv->ibwc + i);
                        }
                } while (n == IPOIB_NUM_WC);

However: this could call netif_receive_skb - would that be a problem?
For example, what if we don't have any quota left?


-- 
MST


From changquing.tang at hp.com  Thu May 24 08:13:33 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Thu, 24 May 2007 15:13:33 -0000
Subject: [ofa-general] RDMA write completion question
In-Reply-To: <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
References: <20070524141928.GI20691@minantech.com>
	<309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301696F8A@G3W0634.americas.hpqcorp.net>


But I was learned a while back, that local rdma completion only means
that 
the data has been received by remote HCA, and an ACK has been
acknowledged,
the remote HCA may have deliveried the data to host memory, may NOT.

Is this still true ?

--CQ

 
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Devesh Sharma
> Sent: Thursday, May 24, 2007 10:08 AM
> To: Gleb Natapov
> Cc: general at lists.openfabrics.org
> Subject: Re: [ofa-general] RDMA write completion question
> 
> On 5/24/07, Gleb Natapov <glebn at voltaire.com> wrote:
> > Hi,
> >
> >  Does local RDMA write completion guaranties that a data that was 
> > RDMAed is already accessible in a destination's host _memory_?
> Local RDMA write completion guarantees that the data you have 
> RDMAed has been copied into the remote buffer, without any 
> data corruption.
> >
> > --
> >                        Gleb.
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit 
> > http://openib.org/mailman/listinfo/openib-general
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From afriedle at open-mpi.org  Thu May 24 08:18:26 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Thu, 24 May 2007 08:18:26 -0700
Subject: [ofa-general] how to write a IB user level multicast application
In-Reply-To: <46553448.6020508@dev.mellanox.co.il>
References: <13432ab00705232201s5f7d5a5h5ecaaddf57ead11b@mail.gmail.com>
	<46553448.6020508@dev.mellanox.co.il>
Message-ID: <4655ACC2.9030401@open-mpi.org>

Dotan Barak wrote:
> In the following URL you can find a very simple example on how to use 
> multicast:
> https://svn.openfabrics.org/svn/openib/trunk/contrib/mellanox/ibtp/gen2/userspace/useraccess/multicast_test/multicast_test.c 

I seem to be missing v1.h on my OFED v1.2 nightly install, where can I 
find it?

> this test doesn't send an SA query (to get the multicast props) or an SA 
> multicast join (to make the SM configure the subnet to make the port 
> that this QP is attached to) to get the multicast messages.
> 
> This example will work on a back-to-back topology.

An alternative that I've had pretty good success with is to use the RDMA 
CM.  It uses an IP(v6) abstraction, does the SA queries/joins for you, 
and also supports selection of an unused multicast address if you join 
the '0' address (and port? not sure if its required, I always zero it). 
  Check out the 'mckey.c' example included with the RDMA CM source, it 
will likely answer your questions.

Andrew


From sweitzen at cisco.com  Thu May 24 08:21:27 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Thu, 24 May 2007 08:21:27 -0700
Subject: [ofa-general] RE: two interfaces with ipoib
In-Reply-To: <46554A5A.3050607@voltaire.com>
References: <Pine.GSO.4.40.0705231629060.26771-100000@nu.cse.ohio-state.edu>	<4654B14B.4000208@opengridcomputing.com><4654C278.6030703@ichips.intel.com>
	<46554A5A.3050607@voltaire.com>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303950994@xmb-sjc-216.amer.cisco.com>

How is:

  sysctl -w net.ipv4.conf.all.arp_ignore=2 

different from:

  for i in /proc/sys/net/ipv4/conf/ib*/arp_filter; do echo 1 > $i; done

I have been using the latter successfully regarding this issue.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Or Gerlitz
> Sent: Thursday, May 24, 2007 1:19 AM
> To: Sean Hefty
> Cc: Roland Dreier (rdreier); general at lists.openfabrics.org
> Subject: [ofa-general] RE: two interfaces with ipoib
> 
> Sean Hefty wrote:
> > Steve Wise wrote:
> >> Guys, this reminds me of an issue we have with rnics and 
> regular nics 
> >> on the same physical network.  By default linux responds to arp 
> >> queries on all ports it receives the query on.  This leads 
> to very bad 
> >> results with you're trying to do offloaded connections.  When 
> >> resolving the address/route, the rdma client can end up 
> getting the 
> >> mac address of the dumb nic instead of the rnic.   I don't know if 
> >> route resolution in the ib cm has this issue, but it might 
> since they 
> >> use ipoib for some part of the resolution, no?
> > 
> > I think this could be the problem.  (And could have taken 
> me a long time 
> > to figure it if it is, so thanks!)
> > 
> >> 2) If you have a multi-homed host and the physical 
> ethernet networks are
> >> bridged, then you need to configure arp to only send replies on the
> >> interface with the target ip address:
> >>
> >>        sysctl -w net.ipv4.conf.all.arp_ignore=2
> 
> OK, Sean, sorry not to mention this to you, we have resolved 
> this with a 
> customer some time ago and I have communicated it to Mellanox 
> on Sonoma 
>   such that it will be added to the OFED 1.2 documentation.
> 
> Generally speaking, its a bad (somehow dead on arrival test 
> for a system 
> administrator) habit to have two IP (=L3) subnets sharing the same L2 
> (specifically broadcast) domain. In infiniband (IPoIB) it 
> means have two 
> IP subnets over the same Partition and in Ethernet is means 
> have two IP 
> subnets over the same VLAN.
> 
> I understand the default setting of arp_ignore = 0 is a religious 
> argument held once in a while at the netdev mailing list, if 
> people from 
> here want to try it, i am crossing my fingers for them, but again, it 
> has nothing special to do with ipoib.
> 
> Or.
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From mst at dev.mellanox.co.il  Thu May 24 08:32:46 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 18:32:46 +0300
Subject: [ofa-general] [PATCH] IB/ipoib: drain cq in dev_stop
Message-ID: <20070524153246.GD7940@mellanox.co.il>

Fix 2 bugs in RX draining code:
1. The logic in time_after is reversed, so it was timing out immediately
2. Since netpoll is disabled while ipoib_cm_dev_stop is running,
   ipoib_cm_dev_stop must poll the CQ in order to see the draining packet.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Pls review the above.

I'm still uncomfortable with the fact that ipoib_ib_dev_stop could cause
packets to be passed up without poll being called. Is this OK?

It is possible we never saw problems in practice because the race window is
small, but it seems that we should pass a flag to handle_rx_wc routines to have
it drop all packets.  Roland, what do you think?

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index a0b3782..158759e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -429,6 +429,7 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
 
 void ipoib_pkey_poll(struct work_struct *work);
 int ipoib_pkey_dev_delay_open(struct net_device *dev);
+void ipoib_drain_cq(struct net_device *dev);
 
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index ffec794..dc299db 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -713,7 +713,7 @@ void ipoib_cm_dev_stop(struct net_device *dev)
 	while (!list_empty(&priv->cm.rx_error_list) ||
 	       !list_empty(&priv->cm.rx_flush_list) ||
 	       !list_empty(&priv->cm.rx_drain_list)) {
-		if (!time_after(jiffies, begin + 5 * HZ)) {
+		if (time_after(jiffies, begin + 5 * HZ)) {
 			ipoib_warn(priv, "RX drain timing out\n");
 
 			/*
@@ -725,6 +725,7 @@ void ipoib_cm_dev_stop(struct net_device *dev)
 			break;
 		}
 		spin_unlock_irq(&priv->lock);
+		ipoib_drain_cq(dev);
 		msleep(1);
 		spin_lock_irq(&priv->lock);
 	}
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index c1aad06..8404f05 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -550,13 +550,30 @@ static int recvs_pending(struct net_device *dev)
 	return pending;
 }
 
+void ipoib_drain_cq(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i, n;
+	do {
+		n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
+		for (i = 0; i < n; ++i) {
+			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
+				ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
+			else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
+				ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
+			else
+				ipoib_ib_handle_tx_wc(dev, priv->ibwc + i);
+		}
+	} while (n == IPOIB_NUM_WC);
+}
+
 int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
 	unsigned long begin;
 	struct ipoib_tx_buf *tx_req;
-	int i, n;
+	int i;
 
 	clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags);
 	netif_poll_disable(dev);
@@ -611,17 +628,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 			goto timeout;
 		}
 
-		do {
-			n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
-			for (i = 0; i < n; ++i) {
-				if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
-					ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
-				else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
-					ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
-				else
-					ipoib_ib_handle_tx_wc(dev, priv->ibwc + i);
-			}
-		} while (n == IPOIB_NUM_WC);
+		ipoib_drain_cq(dev);
 
 		msleep(1);
 	}

-- 
MST


From halr at voltaire.com  Thu May 24 08:30:24 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 24 May 2007 11:30:24 -0400
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<464B5C07.8040601@ichips.intel.com>
	<309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>
	<1179398534.23882.67542.camel@hal.voltaire.com>
	<309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>
	<1179483657.23882.158398.camel@hal.voltaire.com>
	<309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com>
	<1179769930.15940.9823.camel@hal.voltaire.com>
	<309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com>
	<1179930909.16831.100286.camel@hal.voltaire.com>
	<309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com>
Message-ID: <1180020620.16831.198071.camel@hal.voltaire.com>

On Thu, 2007-05-24 at 08:22, Devesh Sharma wrote:
> On 23 May 2007 10:35:13 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > On Wed, 2007-05-23 at 10:27, Devesh Sharma wrote:
> > > On 21 May 2007 13:52:11 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote:
> > > > > On 18 May 2007 06:21:05 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote:
> > > > > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote:
> > > > > > > > > On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > > > > > > > > > > But initially this will generate a packet for each path, while sys
> > > > > > > > > > > admin knows that path is there and he can hard-code the entries for
> > > > > > > > > > > it. Other thing is that why Admin will care about creating such record
> > > > > > > > > > > while SA is itself taking care, right?
> > > > > > > > > >
> > > > > > > > > > In your original message you asked about adding 'dummy entries' to the
> > > > > > > > > > cache.  I agree that pre-loading the cache can be useful.  What I still
> > > > > > > > > > am not understanding is the reasoning for adding 'dummy entries'.  By
> > > > > > > > > > 'dummy entries', I've been assuming that these are invalid path records,
> > > > > > > > > > but maybe that's not what you meant.
> > > > > > > > > Ok if "dummy entries" word as such has created confusion then I am
> > > > > > > > > sorry for that, But with that I mean that, those are valid path
> > > > > > > > > records which Administrator knows in advance and while loading the
> > > > > > > > > module,
> > > > > > > >
> > > > > > > > How does the admin know they are valid ?
> > > > > > > Depending on the initial application runs, some trusted PRs can be generated.
> > > > > >
> > > > > > What do initial application runs have to do with this ?
> > > > > My understanding is that, once the cluster is UP, and if between Node
> > > > > A and Node B there is only one path,
> > > >
> > > > So this is a feature for such one path subnets. I wonder what percentage
> > > > of deployed subnets fits this case.
> > > You never know, It may be used for debugging also.
> >
> > I still don't have a good feel for how common/generally useful this will
> > really be.
> >
> > > > > then, SA query always going to return same values in PR.
> > > >
> > > > If subnet topology is changed, these PRs might change. There are other
> > > > cases where they change too.
> > > Not sure about it...some suggestion?
> > > >
> > > > >  On this basis Initial application runs will generate PRs,
> > > >
> > > > That's what confused me before (Applications don't generate PRs but
> > > > rather request them.) but I think I see what you mean now.
> > > Ok
> > > >
> > > > > these PRs can be saved in some file, and can be loaded
> > > > > when cache_module comes in.
> > > > > >
> > > > > > > >Are they somehow preconfigured at the SM ?
> > > > > > > I am not sure about SM has any such provision?
> > > > > >
> > > > > > Not that I'm aware of.
> > > > > Ok, So, currently no such support is there in SM?
> > > >
> > > > I can speak definitively for OpenSM and there is no such support. As to
> > > > the vendor SMs, I don't think so but don't know for absolute certainty.
> > > > Someone can correct me if I'm wrong but I wouldn't assume no response
> > > > means correctness as some may not be listening nor want to respond as to
> > > > "value added" vendor specific features.
> > > What is the issue if OpenSM provides this?
> >
> > I'm not following you. What does/should OpenSM provide ? OpenIB works in
> > configurations with other SMs.
> I am talking about pre-configuring PRs in OpenSM DB.

How does that help ? Why would PRs need to be preconfigured at the SM ?
Do you mean preconfigure the routing tables (and generate the PRs from
that) ? What problem is being solved on the SM side ?

> > > > > > > Also not sure about the
> > > > > > > role of SM in path resolving. I mean once node has initiated SA query,
> > > > > > > whether SM has some database to reply SA or On the fly destination
> > > > > > > node is contacted to get asked path recored?
> > > > > >
> > > > > > SMs can either calculate the SA PRs on the fly based on the routing
> > > > > > algorithm in use and some other things or put them in a local database.
> > > > > > This is up to that SM.
> > > > > Ok
> > > > > >
> > > > > > Destination node is not contacted in the SA PR query process.
> > > > > >
> > > > > > > >Doesn't each SM have its own policy for generating valid PRs ?
> > > > > > > Ultimately path record is in Path_Record object format, and SA cache
> > > > > > > is going to store in a fixed manner, How generation policy matters?
> > > > > >
> > > > > > What if the local policy loaded does not agree with what the SM would
> > > > > > generate for a particular PR ? One then gets a local error which will
> > > > > > need to be tracked down. Not so easy IMO.
> > > > > SM policies in a subnet to generate PRs, changes dynamically? at run time?
> > > >
> > > > The policy doesn't change dynamically but the data to be returned in the
> > > > SA PR response might.
> > > >
> > > > > if Not then depending on the local SM policy static PR can be
> > > > > generated to load initially.
> > > >
> > > > Just as one question related to this, how would link failures be handled
> > > > ? There are others.
> > > Its just a matter of avoiding initial PR query packets by loading the
> > > cache with static PRs.....Later on cache module will function in
> > > normal fashion. I expect, initially every thing will come up in a
> > > trusted cluster.
> >
> > So you're saying the cache would still react to GIDs out and in service,
> > right ?
> I am not about what GIDs in out service....

Why not ?

> but what I mean to say is,
> Once sa_cache is programmed with some static PRs....it will avoid
> initial cache_update step and after first time out normal
> update_cache() will be initiated using SA MADs.

How would the client know what PRs to request when that timeout first
occurs ? There's no get all except these semantics. If it is all PRs,
what does that save ?

> > If the cache is loaded from a file, does it bypass querying the SA
> > initially for PRs ?
> Yes It will, and hence reduce the initial SA traffic generated on a
> big cluster...just imagin, the cluster is quite big and every node is
> trying to build its cache initially. It will create large burst of SA
> packets.
> >If that is the case, then the file is required to be
> > the full set of PRs for this node otherwise there would be incomplete
> > connectivity.
> Yes, correct, Generating these PRs is the next issue which I want to
> discuss. may be this can be done by Admin on every node using the
> read() entry point provided by char_dev interface of sa_cache module.
> read entry point will simple extract PRs from cache itself.
> 
> Incomplete connectivity will be till first PR is requested for that
> destination, Because if its a cache miss, any how application is going
> to initiate a ib_sa_get_path_rec() and resolved PR will be added in
> cache for future reference.

OK then this becomes an on demand model for those destnations (at least
initially).

-- Hal

> > -- Hal
> >
> > > > > > > CMIIW. Also I am assuming a homogeneous cluster where certain
> > > > > > > parameters can be assumed to be same always.
> > > > > >
> > > > > > and always in agreement with what the SM would return ? For example,
> > > > > yes
> > > > > > what happens when a link goes down and the end node is no longer
> > > > > > reachable ?
> > > > > If node is not reachable then, after first timeout of sa_cache, that
> > > > > entry will be removed from cache.
> > > >
> > > > OK; that's another aspect to add into this feature. I don't think that
> > > > is currently done. I think there would need to be an API added to do
> > > > this.
> > > Yes, this has been discussed with Sean, we can add one char_dev
> > > interface to the existing  sa_cache module implementation, Write entry
> > > point will generate a SA_PR_response packet and this packet will be
> > > passed to update_cache() function.
> > >
> > > Also we need to remove the initial schedule_update() call in the
> > > add_one() function.
> > > One user command is also required to read from user file and write
> > > onto this device.
> > > >
> > > > -- Hal
> > > >
> > > > > > > >are these from a live SM and just loaded "out of band" to
> > > > > > > bypass/preclude the SA PR >mechanism ?
> > > > > > > may be
> > > > > >
> > > > > > Even if they are, there is still the changes in the subnet issue.
> > > > > >
> > > > > > -- Hal
> > > > > >
> > > > > > > > -- Hal
> > > > > > > >
> > > > > > > > >  Admin is loading this info in the cache with user command.
> > > > > > > > > >
> > > > > > > > > > > Another point I want to know is,
> > > > > > > > > > > When local_sa_cache module will be inserted? After SM comes up or
> > > > > > > > > > > Before SM comes up?
> > > > > > > > > >
> > > > > > > > > > It can occur either way.  There is no restriction.  The cache responds
> > > > > > > > > > to port up and GID in/out of service events to update itself.
> > > > > > > > > Do you mean cache module will start building cache only after Port is UP?
> > > > > > > > > >
> > > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on
> > > > > > > > > > > some node not on switch) then First Forced schedule_update() is
> > > > > > > > > > > waisted, and for the first application presence of cache is
> > > > > > > > > > > meaningless. Why not to keep cache effective right from the start?
> > > > > > > > > >
> > > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those
> > > > > > > > > > paths are usable.  If the SM has not come up, then the path records will
> > > > > > > > > > be unusable until the SM configures the subnet, plus there's no
> > > > > > > > > > guarantee that the remote endpoints specified by the paths are running.
> > > > > > > > > You mean there is no guarantee that even if SM is UP and we have some
> > > > > > > > > hard coded entries of path record corresponding to some node X, we are
> > > > > > > > > not sure that node X has actually come up or not?  In that case
> > > > > > > > > actually that path resolving should fail if node has not come up, but
> > > > > > > > > with the hard coding still path will be resolved?
> > > > > > > > > >
> > > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms
> > > > > > > > > > when booting a large cluster.
> > > > > > > > > that's true. Also cache will get valid entries only if network is
> > > > > > > > > configured by SM otherwise every node SA will, possibly, drop SA
> > > > > > > > > packets.
> > > > > > > > > >
> > > > > > > > > > - Sean
> > > > > > > > > >
> > > > > > > > > _______________________________________________
> > > > > > > > > general mailing list
> > > > > > > > > general at lists.openfabrics.org
> > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > > > >
> > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> >
> >


From tziporet at dev.mellanox.co.il  Thu May 24 08:36:11 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 24 May 2007 18:36:11 +0300
Subject: [ofa-general] Re: [ewg] Something is wrong in gitweb
In-Reply-To: <46558296.2090308@voltaire.com>
References: <46558296.2090308@voltaire.com>
Message-ID: <4655B0EB.5030407@mellanox.co.il>

Erez Zilber wrote:
> The links on http://www.openfabrics.org/git/ don't work. For example,
> the link to ofed_1_2 tree leads to:
>
>
> http://git/?p=~vlad/ofed_1_2/.git;a=summary
>
>
> It seems that "www.openfabrics.org/" is missing in all links.
>
>
>   
I have the same issue

Jeff - are the the owner of this too?

Tziporet


From jsquyres at cisco.com  Thu May 24 08:43:59 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 24 May 2007 11:43:59 -0400
Subject: [ofa-general] Re: [ewg] Something is wrong in gitweb
In-Reply-To: <4655B0EB.5030407@mellanox.co.il>
References: <46558296.2090308@voltaire.com> <4655B0EB.5030407@mellanox.co.il>
Message-ID: <310F0799-71F3-4E26-A965-4F8E4B6BF496@cisco.com>

Jeff Becker is the OFA system administrator.


On May 24, 2007, at 11:36 AM, Tziporet Koren wrote:

> Erez Zilber wrote:
>> The links on http://www.openfabrics.org/git/ don't work. For example,
>> the link to ofed_1_2 tree leads to:
>>
>>
>> http://git/?p=~vlad/ofed_1_2/.git;a=summary
>>
>>
>> It seems that "www.openfabrics.org/" is missing in all links.
>>
>>
>>
> I have the same issue
>
> Jeff - are the the owner of this too?
>
> Tziporet
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


-- 
Jeff Squyres
Cisco Systems


From mst at dev.mellanox.co.il  Thu May 24 08:50:08 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 18:50:08 +0300
Subject: [ofa-general] Re: [PATCH] IB/ipoib: drain cq in dev_stop
In-Reply-To: <20070524153246.GD7940@mellanox.co.il>
References: <20070524153246.GD7940@mellanox.co.il>
Message-ID: <20070524155008.GB23535@mellanox.co.il>

> I'm still uncomfortable with the fact that ipoib_ib_dev_stop could cause
> packets to be passed up without poll being called. Is this OK?
> 
> It is possible we never saw problems in practice because the race window is
> small, but it seems that we should pass a flag to handle_rx_wc routines to have
> it drop all packets.  Roland, what do you think?

Maybe the following is needed on top of this patch?
Roland, what do you think?

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 8404f05..92a2655 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -165,7 +165,7 @@ static int ipoib_ib_post_receives(struct net_device *dev)
 	return 0;
 }
 
-static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
@@ -184,7 +184,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	skb  = priv->rx_ring[wr_id].skb;
 	addr = priv->rx_ring[wr_id].mapping;
 
-	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+	if (unlikely(wc->status != IB_WC_SUCCESS || flush)) {
 		if (wc->status != IB_WC_WR_FLUSH_ERR)
 			ipoib_warn(priv, "failed recv event "
 				   "(status=%d, wrid=%d vend_err %x)\n",
@@ -302,11 +302,11 @@ int ipoib_poll(struct net_device *dev, int *budget)
 			if (wc->wr_id & IPOIB_CM_OP_SRQ) {
 				++done;
 				--max;
-				ipoib_cm_handle_rx_wc(dev, wc);
+				ipoib_cm_handle_rx_wc(dev, wc, 0);
 			} else if (wc->wr_id & IPOIB_OP_RECV) {
 				++done;
 				--max;
-				ipoib_ib_handle_rx_wc(dev, wc);
+				ipoib_ib_handle_rx_wc(dev, wc, 0);
 			} else
 				ipoib_ib_handle_tx_wc(dev, wc);
 		}
@@ -558,9 +558,9 @@ void ipoib_drain_cq(struct net_device *dev)
 		n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
 		for (i = 0; i < n; ++i) {
 			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
-				ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
+				ipoib_cm_handle_rx_wc(dev, priv->ibwc + i, 1);
 			else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
-				ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
+				ipoib_ib_handle_rx_wc(dev, priv->ibwc + i, 1);
 			else
 				ipoib_ib_handle_tx_wc(dev, priv->ibwc + i);
 		}
 

-- 
MST


From glebn at voltaire.com  Thu May 24 09:09:06 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Thu, 24 May 2007 19:09:06 +0300
Subject: [ofa-general] RDMA write completion question
In-Reply-To: <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
References: <20070524141928.GI20691@minantech.com>
	<309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
Message-ID: <20070524160905.GA29313@minantech.com>

On Thu, May 24, 2007 at 08:38:00PM +0530, Devesh Sharma wrote:
> On 5/24/07, Gleb Natapov <glebn at voltaire.com> wrote:
> >Hi,
> >
> > Does local RDMA write completion guaranties that a data that was RDMAed is
> >already accessible in a destination's host _memory_?
> Local RDMA write completion guarantees that the data you have RDMAed
> has been copied into the remote buffer, without any data corruption.
Is this required  by IB spec. How HCA can guaranty that the data is actually
committed into the memory and not travels through a twisty maze of PCI
buffers all alike?

--
			Gleb.


From krause at cup.hp.com  Thu May 24 09:15:33 2007
From: krause at cup.hp.com (Michael Krause)
Date: Thu, 24 May 2007 09:15:33 -0700
Subject: [ofa-general] RDMA write completion question
In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301696F8A@G3W0634.americas.
	hpqcorp.net>
References: <20070524141928.GI20691@minantech.com>
	<309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
	<349DCDA352EACF42A0C49FA6DCEA840301696F8A@G3W0634.americas.hpqcorp.net>
Message-ID: <6.2.0.14.2.20070524091327.066ad300@esmail.cup.hp.com>

At 08:13 AM 5/24/2007, Tang, Changqing wrote:

>But I was learned a while back, that local rdma completion only means
>that
>the data has been received by remote HCA, and an ACK has been
>acknowledged,
>the remote HCA may have deliveried the data to host memory, may NOT.
>
>Is this still true ?

Yes.   Unless a RDMA Read is done to flush all prior operations to host 
memory, acknowledgement of a RDMA Write via the IB protocol only indicates 
the packet arrived with a valid CRC to the CA.  There is no guarantee of 
anything getting to host memory or that any data corruption has been 
prevented as a CRC only guarantees the packet traversed the fabric without 
a CRC detectable error occurring.

Mike


>--CQ
>
>
>
> > -----Original Message-----
> > From: general-bounces at lists.openfabrics.org
> > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of
> > Devesh Sharma
> > Sent: Thursday, May 24, 2007 10:08 AM
> > To: Gleb Natapov
> > Cc: general at lists.openfabrics.org
> > Subject: Re: [ofa-general] RDMA write completion question
> >
> > On 5/24/07, Gleb Natapov <glebn at voltaire.com> wrote:
> > > Hi,
> > >
> > >  Does local RDMA write completion guaranties that a data that was
> > > RDMAed is already accessible in a destination's host _memory_?
> > Local RDMA write completion guarantees that the data you have
> > RDMAed has been copied into the remote buffer, without any
> > data corruption.
> > >
> > > --
> > >                        Gleb.
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >
> > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general


From krause at cup.hp.com  Thu May 24 09:19:19 2007
From: krause at cup.hp.com (Michael Krause)
Date: Thu, 24 May 2007 09:19:19 -0700
Subject: [ofa-general] RDMA write completion question
In-Reply-To: <20070524160905.GA29313@minantech.com>
References: <20070524141928.GI20691@minantech.com>
	<309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
	<20070524160905.GA29313@minantech.com>
Message-ID: <6.2.0.14.2.20070524091541.066ad070@esmail.cup.hp.com>

At 09:09 AM 5/24/2007, Gleb Natapov wrote:
>On Thu, May 24, 2007 at 08:38:00PM +0530, Devesh Sharma wrote:
> > On 5/24/07, Gleb Natapov <glebn at voltaire.com> wrote:
> > >Hi,
> > >
> > > Does local RDMA write completion guaranties that a data that was 
> RDMAed is
> > >already accessible in a destination's host _memory_?
> > Local RDMA write completion guarantees that the data you have RDMAed
> > has been copied into the remote buffer, without any data corruption.
>Is this required  by IB spec. How HCA can guaranty that the data is actually
>committed into the memory and not travels through a twisty maze of PCI
>buffers all alike?

There are no guarantees.  In fact, data corruption can occur within the CA 
as well as via the PCI fabric, etc.  There are simply no guarantees.   IHV 
and chipsets do take steps to prevent corruption at a minimum to at least 
detect when it occurs.  However, detection and prevention are not always 
possible in every design and there is a cost to be paid for either or 
both.   Technology such as IB does a reasonable job at the fabric level but 
has no impact on anything else in the end-to-end data integrity requirements.

Mike


>--
>                         Gleb.
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general


From becker at nas.nasa.gov  Thu May 24 09:33:14 2007
From: becker at nas.nasa.gov (Jeff Becker)
Date: Thu, 24 May 2007 09:33:14 -0700
Subject: [ofa-general] Re: [ewg] Something is wrong in gitweb
In-Reply-To: <4655B0EB.5030407@mellanox.co.il>
References: <46558296.2090308@voltaire.com> <4655B0EB.5030407@mellanox.co.il>
Message-ID: <795c49870705240933w276e41fbu941e822047ab5e25@mail.gmail.com>

Hi Tziporet. I just tried getting to the git tree from my web browser
and this seems to work, including the link you tried below. Does it
work for you now? Thanks.

-jeff

On 5/24/07, Tziporet Koren <tziporet at dev.mellanox.co.il> wrote:
> Erez Zilber wrote:
> > The links on http://www.openfabrics.org/git/ don't work. For example,
> > the link to ofed_1_2 tree leads to:
> >
> >
> > http://git/?p=~vlad/ofed_1_2/.git;a=summary
> >
> >
> > It seems that "www.openfabrics.org/" is missing in all links.
> >
> >
> >
> I have the same issue
>
> Jeff - are the the owner of this too?
>
> Tziporet
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From mshefty at ichips.intel.com  Thu May 24 09:34:23 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 24 May 2007 09:34:23 -0700
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>	
	<464B5C07.8040601@ichips.intel.com>	
	<309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com>	
	<1179398534.23882.67542.camel@hal.voltaire.com>	
	<309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>	
	<1179483657.23882.158398.camel@hal.voltaire.com>	
	<309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com>	
	<1179769930.15940.9823.camel@hal.voltaire.com>	
	<309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com>	
	<1179930909.16831.100286.camel@hal.voltaire.com>
	<309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com>
Message-ID: <4655BE8F.2080102@ichips.intel.com>

> Yes It will, and hence reduce the initial SA traffic generated on a
> big cluster...just imagin, the cluster is quite big and every node is
> trying to build its cache initially. It will create large burst of SA
> packets.

In general I agree with the notion of enhancing the cache to allow it to 
load locally from a file.  But I'd really like to get a framework merged 
upstream first before trying to add in these sort of enhancements.

Initially loading of caches on a large fabric can be limited to a single 
GetTable PR query per node, and by enabling the caches across the fabric 
at different times, the single burst to the SA can be avoided.

> Incomplete connectivity will be till first PR is requested for that
> destination, Because if its a cache miss, any how application is going
> to initiate a ib_sa_get_path_rec() and resolved PR will be added in
> cache for future reference.

ib_sa_get_path_rec() only returns a single path.  If multiple paths 
exist, adding a single path to the cache may cause all applications to 
make use of that single path.  Updating the cache on demand isn't as 
simple as it seems.

- Sean


From rdreier at cisco.com  Thu May 24 10:38:08 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 10:38:08 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <4654AE33.20506@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Wed, 23 May 2007 14:12:19 -0700")
References: <46537081.30906@linux.vnet.ibm.com>
	<4654690F.1040305@linux.vnet.ibm.com>
	<4654AE33.20506@linux.vnet.ibm.com>
Message-ID: <adatzu26s1r.fsf@cisco.com>

 > Is it too late to get this into 2.6.22? If so, I will try for 2.6.23
 > -please let me know.

Yes, it is too late.  This is not a fix, and I think by the time we
have all the issue ironed out it will be even later.


From rdreier at cisco.com  Thu May 24 10:40:00 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 10:40:00 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <46537081.30906@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Tue, 22 May 2007 15:36:49 -0700")
References: <46537081.30906@linux.vnet.ibm.com>
Message-ID: <adaps4q6ryn.fsf@cisco.com>

 > By default, cap the NOSRQ memory usage to 1GB. The default recvq_size
 > is set to 128. Therefore for 64KB packets this would imply a maximum of
 > 128 endpoints.

1 GB is a pretty eye-popping amount to tie up in receive buffers.

 > -The NOSRQ limit of 1GB is also made a module parameter.

There are too many module parameters already I think...

 > -Currently we allocate a default of 64KB for the ring buffer elements,
 > and this buffer size is not linked to the mtu. In the future, we could
 > allocate buffers based on the mtu and link that into the computation of
 > the memory cap. This way customers who might want to use a smaller mtu
 > could use a larger number of endpoints, or a larger recvq_size without
 > exceeding the memory cap.

It would be nice, but to handle increasing the MTU, you need some way
to handle the receives you already posted (which would be too small
all of a sudden).

 - R.


From rdreier at cisco.com  Thu May 24 10:40:39 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 10:40:39 -0700
Subject: [ofa-general] Re: [PATCH] IB/ehca: Refactor "maybe missed event"
	code
In-Reply-To: <200705241651.05860.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Thu, 24 May 2007 16:51:05 +0200")
References: <200705241651.05860.fenkes@de.ibm.com>
Message-ID: <adalkfe6rxk.fsf@cisco.com>

This isn't fixing anything is it?  I think it's 2.6.23 material;
correct me if I'm wrong.

 - R.


From xma at us.ibm.com  Thu May 24 10:56:32 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 24 May 2007 10:56:32 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <20070524055108.GG6019@mellanox.co.il>
Message-ID: <OF36F79D3E.0DB32C00-ON872572E5.0061CFC3-882572E5.0067F40E@us.ibm.com>


Hello Michael,

> I've just answered in another thread. Summary:
> I think that to enable connected mode on ehca, what we need is
>
> 1. A way to make IPoIB fall back on datagram mode when you run out of
>    resources.  This might need to be addressed at the protocol level.

My point of view this, if we run out of resource, there is nothing we can
do. There won't be any new connections, just like native RC mode. I think
this is a generic RC issue, w/o SRQ just will hit this sooner. In the
reality, our PPC cluster won't hit this because of the cluster size and
memory configuration in this cluster.

Anyway no matter what this issue we should address in the future. Can we
delay this work to find a solution for RC w/i or w/o SRQ later? We do want
IPoIB-CM mode for the performance gain and interoperability between our
xCluster(Mellanox) and pCluster(eHCA) in coming OFED-1.3. So let's keep
what it is without any parameter tuning, does this make sense?

> 2. A way to separate the noSRQ hacks more cleanly. This is not just me
>    being a micro-optimization freak. I suggested some ways to do this
better.
>
>
> --
> MST

Yes, that needs to be fixed.

Thanks
Shirley Ma
IBM Linux Technology Center
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/a7a0a17f/attachment.html>

From Koen.SEGERS at VRT.BE  Thu May 24 11:03:22 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Thu, 24 May 2007 20:03:22 +0200
Subject: [ofa-general] GPFS node loses IB-connection
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E1830@xmb-sjc-216.amer.cisco.com>
	<D63C0BE2D613C543B6F3305502E9784C03157D6A@OCBEXS01001.rto.be>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3038E1834@xmb-sjc-216.amer.cisco.com>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C030AA269@OCBEXS01001.rto.be>

After changing the switch timeout value, we never got the error again. Yesterday, we started a 24h stresstest. This test was succesfull. I think we can conclude that the problem is fixed now.
 
But, there is a strange message in de logs of the switch:

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx

Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change

Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change

 
Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy

Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change

Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change

 
With xx,yy = (e.g) ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:05:87:d9 but changing to different GIDs in the next group of loggings each belonging to the IB ports of the server HCA's.

This logging occurs every few minutes (not at a regular interval). Is there somewhere a Cisco manual available that describes or explains these messages? Or can anyone explain what is happening? And whether this can harm a setup that doesn't use multicast?

Greetz

Koen


________________________________

Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
Verzonden: wo 23/05/2007 17:40
Aan: SEGERS Koen; Hal Rosenstock
CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection


Try 20 seconds, I'm curious if if you are barely crossing the 10-second
threshold.

Scott

> -----Original Message-----
> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> Sent: Wednesday, May 23, 2007 8:39 AM
> To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> Cc: Clive Hall (clivhall);
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
>
> What value would you recommend then?
>
> Koen
>
> -----Oorspronkelijk bericht-----
> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> Verzonden: woensdag 23 mei 2007 17:38
> Aan: SEGERS Koen; Hal Rosenstock
> CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
>
> The boot time of the host doesn't matter for this timeout.  While the
> host is booting, the IB link is down anyway.
>
> Scott
>
> > -----Original Message-----
> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > Sent: Wednesday, May 23, 2007 8:20 AM
> > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > Cc: Clive Hall (clivhall);
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> >
> > After a whole day of stresstesting with the MAD renicing
> turned on, we
> > got the error once. So I think I should raise the timeout on
> > the switch
> > also.
> >
> > It takes about 2 minutes to boot the system. Do you agree
> > that this is a
> > good value for the timeout?
> >
> > Scott,
> > Can you explain me the problem of the memlock?
> >
> > I saw that the SLES10 bug is only an issue in MVAPICH.
> Since we didn't
> > install this, the bug is not related to us. This is
> correct, isn't it?
> >
> > Greetz
> >
> > Koen
> >
> > -----Oorspronkelijk bericht-----
> > Van: Hal Rosenstock [mailto:halr at voltaire.com]
> > Verzonden: woensdag 23 mei 2007 16:12
> > Aan: Scott "Weitzenkamp (sweitzen)
> > CC: SEGERS Koen; Clive Hall (clivhall);
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> >
> > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > No C code changes, just a few config file changes
> (RENICE_IB_MAD=yes
> > in
> > > openib.conf,
> >
> > Does the host really not respond to MAD requests for over 10
> > seconds in
> > some cases ?
> >
> > -- Hal
> >
> > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > SLES10 for bug 267, etc.).
> > >
> > > Scott Weitzenkamp
> > > SQA and Release Manager
> > > Server Virtualization Business Unit
> > > Cisco Systems
> > > 
> > >
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > Cc: Shirley Ma; Ami Perlmutter;
> > > > general at lists.openfabrics.org;
> > general-bounces at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > >
> > > > This far, all tests seem to work.
> > > >
> > > > Thanks for the help!
> > > >
> > > > Scott,
> > > > Are there more bugfixes that cisco does in its rpms?
> > > >
> > > > Greetz
> > > >
> > > > Koen
> > > >
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > (clivhall)
> > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > > > general-bounces at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > >
> > > > It's not so much pinging every 10 seconds as expecting a
> > > > response within
> > > > 10 seconds (Clive, correct me if I'm wrong).
> > > >
> > > > You only need to do 1) or 2), not both.  Cisco configures 1)
> > > > in the OFED
> > > > binary RPMs we release at
> > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I
> > > > prefer to have
> > > > the host be more responsive.
> > > >
> > > >
> > > > Scott Weitzenkamp
> > > > SQA and Release Manager
> > > > Server Virtualization Business Unit
> > > > Cisco Systems
> > > > 
> > > >
> > > > > -----Original Message-----
> > > > > From: Koen Segers [mailto:koen.segers at VRT.BE]
> > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > Cc: Shirley Ma; Ami Perlmutter;
> > > > > general at lists.openfabrics.org;
> > general-bounces at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > >
> > > > > If I understand it wright, the switch is actually polling
> > > > > (=pinging) the
> > > > > interfaces every 10s. This means that when the interface is
> > handling
> > > > > other traffic, the poll can fail and the port could be
> > > > > considered out of
> > > > > service. My question is then: "How can the timeout be reached
> > while
> > > > > packets are being sent/received to/from the interface?"
> > > > >
> > > > > Anyway, what timeout-value would you recommend for
> us? And why?
> > > > >
> > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > 1) change the MAD niceness of the servers
> > > > > 2) change the timeout on the switches
> > > > >
> > > > > Are these changes sufficient for the HCA's to keep
> > their ports in
> > > > > PORT_ACTIVE state?
> > > > >
> > > > > Regards,
> > > > >
> > > > > Koen
> > > > >
> > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp
> > > > (sweitzen) wrote:
> > > > > > Yes, you can tune it.  Here's an example via the switch CLI:
> > > > > > 
> > > > > > SFS-7000D(config)# ib sm subnet-prefix
> fe:80:00:00:00:00:00:00
> > > > > > node-timeout <value>
> > > > > >
> > > > > > The default is 10 seconds, it can be configured up to
> > > > 2000 seconds.
> > > > > > If a HCA is completely unresponsive for longer than the
> > > > node-timeout
> > > > > > value, then we consider that HCA out of service.
> > > > > > 
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > > 
> > > > > >
> > > > > >        
> > > > > >        
> > > > > ______________________________________________________________
> > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com]
> > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > >         To: koen.segers at VRT.BE
> > > > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > > > >         general-bounces at lists.openfabrics.org; Scott
> > Weitzenkamp
> > > > > >         (sweitzen)
> > > > > >         Subject: RE: [ofa-general] GPFS node loses
> > IB-connection
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > >         Koen,
> > > > > >        
> > > > > >         So it is most likely you hit the same bug as
> > 229 (Scott
> > > > > >         pointed out earlier). The same workaround might
> > > > work for you
> > > > > >         by renicing ib_mad as Scott suggested.
> > > > > >        
> > > > > >         I think this should be a SM query timeout
> > tunable value
> > in
> > > > > >         Cisco SM. Am I right, Scott?
> > > > > >        
> > > > > >         Thanks
> > > > > >         Shirley Ma
> > > > > >        
> > > > > >        
> > > > > >         Inactive hide details for Koen Segers
> > > > > <koen.segers at VRT.BE>Koen
> > > > > >         Segers <koen.segers at VRT.BE>
> > > > > >        
> > > > > >        
> > > > > >                                         Koen Segers
> > > > > <koen.segers at VRT.BE>
> > > > > >                                        
> > > > > >                                         05/22/07 11:14 AM
> > > > > >                                         Please respond to
> > > > > >                                         koen.segers at VRT.BE
> > > > > >                                        
> > > > > >        
> > > > > >                      To
> > > > > >        
> > > > > >         Shirley
> > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > >        
> > > > > >                      cc
> > > > > >        
> > > > > >         Ami Perlmutter
> > > > > >         <amip at dev.mellanox.co.il>,
> > > > > general at lists.openfabrics.org,
> > general-bounces at lists.openfabrics.org
> > > > > >        
> > > > > >                 Subject
> > > > > >        
> > > > > >         RE:
> > > > > >         [ofa-general]
> > > > > >         GPFS node loses
> > > > > >         IB-connection
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > >         Hi,
> > > > > >        
> > > > > >         It is the Cisco SM.
> > > > > >        
> > > > > >         SFS-7000P> show version
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > ==============================================================
> > > > > ==================
> > > > > >                                   System Version Information
> > > > > >        
> > > > > ==============================================================
> > > > > ==================
> > > > > >                   system-version : SFS-7000P TopspinOS
> > > > 2.9.0 releng
> > > > > >         #147
> > > > > >         10/25/2006 02:01:32
> > > > > >                          contact : tac at cisco.com
> > > > > >                             name : SFS-7000P
> > > > > >                         location : 170 West Tasman Drive,
> > > > > San Jose, CA
> > > > > >         95134
> > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > >                      last-change : none
> > > > > >                 last-config-save : none
> > > > > >                           action : none
> > > > > >                           result : none
> > > > > >                        oper-mode : normal
> > > > > >        
> > > > > >         There is also a command that gives the SM version,
> > > > > but I can't
> > > > > >         find it
> > > > > >         right now.
> > > > > >        
> > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > > > > >         > Hello Koen,
> > > > > >         >
> > > > > >         > From the switch log, it looks a SM issue to me.
> > > > > The node was
> > > > > >         kicked
> > > > > >         > out of the membership. Which SM you are
> > using in your
> > > > > >         fabric?
> > > > > >         >
> > > > > >         > Thanks
> > > > > >         > Shirley Ma
> > > > > >         >
> > > > > >         *** Disclaimer ***
> > > > > >        
> > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > >        
> > > > > >         nv van publiek recht
> > > > > >         BTW BE 0244.142.664
> > > > > >         RPR Brussel
> > > > > >         http://www.vrt.be/disclaimer
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > *** Disclaimer ***
> > > > >
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > >
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > > 
> > > > >
> > > > *** Disclaimer ***
> > > >
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > >
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > > 
> > > >
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >
> > *** Disclaimer ***
> >
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> >
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> > 
> >
> *** Disclaimer ***
>
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
>
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
> 
>


*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/6a40585d/attachment.html>

From rdreier at cisco.com  Thu May 24 11:03:54 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 11:03:54 -0700
Subject: [ofa-general] Re: [PATCH] IB/mlx4_ib initialize work - resending fix
	description
In-Reply-To: <1180011931.11166.47.camel@mtls03> (Eli Cohen's message of "Thu,
	24 May 2007 16:05:01 +0300")
References: <1180011931.11166.47.camel@mtls03>
Message-ID: <adad50q6qut.fsf@cisco.com>

Thanks, good catch.  I think this will fix some weird bugs I was
seeing (with IPoIB CM in my case).

I'll also push out the same fix for libmlx4.


From rdreier at cisco.com  Thu May 24 11:05:10 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 11:05:10 -0700
Subject: [ofa-general] Re: [PATCH] IB/ehca: fix wrong number of send WRs
	returned
In-Reply-To: <200705241651.09411.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Thu, 24 May 2007 16:51:08 +0200")
References: <200705241651.09411.fenkes@de.ibm.com>
Message-ID: <ada8xbe6qsp.fsf@cisco.com>

thanks, applied.


From rdreier at cisco.com  Thu May 24 11:09:26 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 11:09:26 -0700
Subject: [ofa-general] Re: question on netpoll
In-Reply-To: <20070524151254.GC7940@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 24 May 2007 18:12:54 +0300")
References: <20070524140410.GB7940@mellanox.co.il>
	<20070524151254.GC7940@mellanox.co.il>
Message-ID: <ada4pm26qll.fsf@cisco.com>

 > However: this could call netif_receive_skb - would that be a problem?
 > For example, what if we don't have any quota left?

I never thought of it before.  I don't think the quota is an issue per
se, since the quota accounting is done elsewhere.  The main issue I
think would be that netif_receive_skb() expects to be called from a
certain context (a poll routine only), but looking at the code, that
doesn't appear to be the case.


From xma at us.ibm.com  Thu May 24 11:15:52 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 24 May 2007 11:15:52 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C030AA269@OCBEXS01001.rto.be>
Message-ID: <OF9C4E5DE6.9FC2CFBA-ON872572E5.00642B57-882572E5.0069B94A@us.ibm.com>


Koen,

      Are you using IPv6? If not, then this is no harmful. If you don't use
it, you can simply disable loading IPv6 module in your notes when
rebooting.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638


             "SEGERS Koen"                                                 
             <Koen.SEGERS at VRT.                                             
             BE>                                                        To 
             Sent by:                  "Scott Weitzenkamp (sweitzen)"      
             general-bounces at l         <sweitzen at cisco.com>, "Hal          
             ists.openfabrics.         Rosenstock" <halr at voltaire.com>     
             org                                                        cc 
                                       general-bounces at lists.openfabrics.o 
                                       rg, general at lists.openfabrics.org   
             05/24/07 11:03 AM                                     Subject 
                                       RE: [ofa-general] GPFS node loses   
                                       IB-connection                       
                                                                           
                                                                           
After changing the switch timeout value, we never got the error again.
Yesterday, we started a 24h stresstest. This test was succesfull. I think
we can conclude that the problem is fixed now.

But, there is a strange message in de logs of the switch:


Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap
for GID=xx


Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap
for GID=xx


Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap
for GID=xx


Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap
for GID=xx


Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast
membership change


Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast
membership change


Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap
for GID=yy


Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap
for GID=yy


Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap
for GID=yy


Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap
for GID=yy


Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast
membership change


Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast
membership change


With xx,yy = (e.g) ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:05:87:d9 but
changing to different GIDs in the next group of loggings each belonging to
the IB ports of the server HCA’s.


This logging occurs every few minutes (not at a regular interval). Is there
somewhere a Cisco manual available that describes or explains these
messages? Or can anyone explain what is happening? And whether this can
harm a setup that doesn't use multicast?


Greetz


Koen


Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
Verzonden: wo 23/05/2007 17:40
Aan: SEGERS Koen; Hal Rosenstock
CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection


Try 20 seconds, I'm curious if if you are barely crossing the 10-second
threshold.

Scott

> -----Original Message-----
> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> Sent: Wednesday, May 23, 2007 8:39 AM
> To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> Cc: Clive Hall (clivhall);
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
>
> What value would you recommend then?
>
> Koen
>
> -----Oorspronkelijk bericht-----
> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> Verzonden: woensdag 23 mei 2007 17:38
> Aan: SEGERS Koen; Hal Rosenstock
> CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
>
> The boot time of the host doesn't matter for this timeout.  While the
> host is booting, the IB link is down anyway.
>
> Scott
>
> > -----Original Message-----
> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > Sent: Wednesday, May 23, 2007 8:20 AM
> > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > Cc: Clive Hall (clivhall);
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> >
> > After a whole day of stresstesting with the MAD renicing
> turned on, we
> > got the error once. So I think I should raise the timeout on
> > the switch
> > also.
> >
> > It takes about 2 minutes to boot the system. Do you agree
> > that this is a
> > good value for the timeout?
> >
> > Scott,
> > Can you explain me the problem of the memlock?
> >
> > I saw that the SLES10 bug is only an issue in MVAPICH.
> Since we didn't
> > install this, the bug is not related to us. This is
> correct, isn't it?
> >
> > Greetz
> >
> > Koen
> >
> > -----Oorspronkelijk bericht-----
> > Van: Hal Rosenstock [mailto:halr at voltaire.com]
> > Verzonden: woensdag 23 mei 2007 16:12
> > Aan: Scott "Weitzenkamp (sweitzen)
> > CC: SEGERS Koen; Clive Hall (clivhall);
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> >
> > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > No C code changes, just a few config file changes
> (RENICE_IB_MAD=yes
> > in
> > > openib.conf,
> >
> > Does the host really not respond to MAD requests for over 10
> > seconds in
> > some cases ?
> >
> > -- Hal
> >
> > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > SLES10 for bug 267, etc.).
> > >
> > > Scott Weitzenkamp
> > > SQA and Release Manager
> > > Server Virtualization Business Unit
> > > Cisco Systems
> > >
> > >
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > Cc: Shirley Ma; Ami Perlmutter;
> > > > general at lists.openfabrics.org;
> > general-bounces at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > >
> > > > This far, all tests seem to work.
> > > >
> > > > Thanks for the help!
> > > >
> > > > Scott,
> > > > Are there more bugfixes that cisco does in its rpms?
> > > >
> > > > Greetz
> > > >
> > > > Koen
> > > >
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > (clivhall)
> > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > > > general-bounces at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > >
> > > > It's not so much pinging every 10 seconds as expecting a
> > > > response within
> > > > 10 seconds (Clive, correct me if I'm wrong).
> > > >
> > > > You only need to do 1) or 2), not both.  Cisco configures 1)
> > > > in the OFED
> > > > binary RPMs we release at
> > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I
> > > > prefer to have
> > > > the host be more responsive.
> > > >
> > > >
> > > > Scott Weitzenkamp
> > > > SQA and Release Manager
> > > > Server Virtualization Business Unit
> > > > Cisco Systems
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Koen Segers [mailto:koen.segers at VRT.BE]
> > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > Cc: Shirley Ma; Ami Perlmutter;
> > > > > general at lists.openfabrics.org;
> > general-bounces at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > >
> > > > > If I understand it wright, the switch is actually polling
> > > > > (=pinging) the
> > > > > interfaces every 10s. This means that when the interface is
> > handling
> > > > > other traffic, the poll can fail and the port could be
> > > > > considered out of
> > > > > service. My question is then: "How can the timeout be reached
> > while
> > > > > packets are being sent/received to/from the interface?"
> > > > >
> > > > > Anyway, what timeout-value would you recommend for
> us? And why?
> > > > >
> > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > 1) change the MAD niceness of the servers
> > > > > 2) change the timeout on the switches
> > > > >
> > > > > Are these changes sufficient for the HCA's to keep
> > their ports in
> > > > > PORT_ACTIVE state?
> > > > >
> > > > > Regards,
> > > > >
> > > > > Koen
> > > > >
> > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp
> > > > (sweitzen) wrote:
> > > > > > Yes, you can tune it.  Here's an example via the switch CLI:
> > > > > >
> > > > > > SFS-7000D(config)# ib sm subnet-prefix
> fe:80:00:00:00:00:00:00
> > > > > > node-timeout <value>
> > > > > >
> > > > > > The default is 10 seconds, it can be configured up to
> > > > 2000 seconds.
> > > > > > If a HCA is completely unresponsive for longer than the
> > > > node-timeout
> > > > > > value, then we consider that HCA out of service.
> > > > > >
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > ______________________________________________________________
> > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com]
> > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > >         To: koen.segers at VRT.BE
> > > > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > > > >         general-bounces at lists.openfabrics.org; Scott
> > Weitzenkamp
> > > > > >         (sweitzen)
> > > > > >         Subject: RE: [ofa-general] GPFS node loses
> > IB-connection
> > > > > >
> > > > > >
> > > > > >
> > > > > >         Koen,
> > > > > >
> > > > > >         So it is most likely you hit the same bug as
> > 229 (Scott
> > > > > >         pointed out earlier). The same workaround might
> > > > work for you
> > > > > >         by renicing ib_mad as Scott suggested.
> > > > > >
> > > > > >         I think this should be a SM query timeout
> > tunable value
> > in
> > > > > >         Cisco SM. Am I right, Scott?
> > > > > >
> > > > > >         Thanks
> > > > > >         Shirley Ma
> > > > > >
> > > > > >
> > > > > >         Inactive hide details for Koen Segers
> > > > > <koen.segers at VRT.BE>Koen
> > > > > >         Segers <koen.segers at VRT.BE>
> > > > > >
> > > > > >
> > > > > >                                         Koen Segers
> > > > > <koen.segers at VRT.BE>
> > > > > >
> > > > > >                                         05/22/07 11:14 AM
> > > > > >                                         Please respond to
> > > > > >                                         koen.segers at VRT.BE
> > > > > >
> > > > > >
> > > > > >                      To
> > > > > >
> > > > > >         Shirley
> > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > >
> > > > > >                      cc
> > > > > >
> > > > > >         Ami Perlmutter
> > > > > >         <amip at dev.mellanox.co.il>,
> > > > > general at lists.openfabrics.org,
> > general-bounces at lists.openfabrics.org
> > > > > >
> > > > > >                 Subject
> > > > > >
> > > > > >         RE:
> > > > > >         [ofa-general]
> > > > > >         GPFS node loses
> > > > > >         IB-connection
> > > > > >
> > > > > >
> > > > > >
> > > > > >         Hi,
> > > > > >
> > > > > >         It is the Cisco SM.
> > > > > >
> > > > > >         SFS-7000P> show version
> > > > > >
> > > > > >
> > > > > >
> > > > > ==============================================================
> > > > > ==================
> > > > > >                                   System Version Information
> > > > > >
> > > > > ==============================================================
> > > > > ==================
> > > > > >                   system-version : SFS-7000P TopspinOS
> > > > 2.9.0 releng
> > > > > >         #147
> > > > > >         10/25/2006 02:01:32
> > > > > >                          contact : tac at cisco.com
> > > > > >                             name : SFS-7000P
> > > > > >                         location : 170 West Tasman Drive,
> > > > > San Jose, CA
> > > > > >         95134
> > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > >                      last-change : none
> > > > > >                 last-config-save : none
> > > > > >                           action : none
> > > > > >                           result : none
> > > > > >                        oper-mode : normal
> > > > > >
> > > > > >         There is also a command that gives the SM version,
> > > > > but I can't
> > > > > >         find it
> > > > > >         right now.
> > > > > >
> > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > > > > >         > Hello Koen,
> > > > > >         >
> > > > > >         > From the switch log, it looks a SM issue to me.
> > > > > The node was
> > > > > >         kicked
> > > > > >         > out of the membership. Which SM you are
> > using in your
> > > > > >         fabric?
> > > > > >         >
> > > > > >         > Thanks
> > > > > >         > Shirley Ma
> > > > > >         >
> > > > > >         *** Disclaimer ***
> > > > > >
> > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > >
> > > > > >         nv van publiek recht
> > > > > >         BTW BE 0244.142.664
> > > > > >         RPR Brussel
> > > > > >         http://www.vrt.be/disclaimer
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > *** Disclaimer ***
> > > > >
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > >
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >
> > > > >
> > > > *** Disclaimer ***
> > > >
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > >
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >
> > > >
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >
> > *** Disclaimer ***
> >
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> >
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >
> >
> *** Disclaimer ***
>
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
>
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>
>


*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/514e96f6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/514e96f6/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic11723.gif
Type: image/gif
Size: 1255 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/514e96f6/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/514e96f6/attachment-0002.gif>

From fuanifcla at alicedsl.de  Thu May 24 11:09:41 2007
From: fuanifcla at alicedsl.de (Noella Morris)
Date: Fri, 25 May 2007 05:09:41 +1100
Subject: [ofa-general] Everything kool
Message-ID: <e2ee01c79e8a$e6593570$1d4a6227@fuanifcla>


The Principal introduced pull them identify wept bumpy to her. Dana, thMarcie had been listening curve annually miniature tired in on the conversatio Alright. Bye.authority interest Just behind clap in case what!?
Can earth middle suppose you pick bit anything out? Besides the obviousJeff just outgoing ok greasy last shook his head. I refuse to believe 
theory thick I'm not sweet so sure she's my search friend, Jeff was eve W-what can I teaching alert do for expand pedal you? she asked nervously. Marcie bone didn't grin humor press the weight matter any further. She pin Dana, I don't include seen give a rat's ass fence how big, strong
fortunately The principal icy was shelter cut not impressed. If I recall cI've brush often form fine wondered error why they call it 'the big At walk poison surprise that very solid moment, another guy walked by and got bread lay Are you always this droll? sister Stacy was actually
Dana frame didn't answer. boast She nose didn't know alive what to say Lieutenant street Carnahan overdone learn pot spoke, Miss Lefkowitz, we Turn whip right unusual up ahead, and then top drive replace a couple o Y-yeah. woman yearly system Dana was now label slightly trembling. Is sternal I don't want produce hand tempt anyone finding out who drew them,
earn walk Unfortunately, trot yes. I'd probably road be alot moreBut... Gavin was now face trouble in sparkling shed a no-win situation. H laid One shine or both tense of your parents will jump be here to pi butyric copy Do you know sewed let that guy? she asked him.  I screw don't epithetic know wound outrageous his name, but he's in my math cla
Bye.I doubt fraternal authority almost it. gentle They'll be streaking off somewhere3:45 PM scale Marcie reason followed his agreeable fight instructions, and sure enou
The Faircrest precede frightened smoke Galleria was groan a large indoor mall, modern lift Stacy pig could barely contain wash her excitement. She The Lieutenant introduced the drank number dull country crying woman. Da She cruel grabbed exuberant the helmet intend that self was draped over the power But they won't manager built attraction be? said Jeff. swollen Before I do evil sir, encouraging I'd like rode to ask one question
met She stare shut thought about this damage for a moment. I supposeMr. Lazarus nodded.Stacy munched on suit subtract a carrot. boy slimy It was finally start I mean careful with boys. I can't brainy imagine infamous government any guy look Stacy decision turned to appear Gavin. cow Did property you really think th
As harbor they drove drank up, Jeff swung thunder need the right rear doo spent Do you think I should smoggy offer faithfully impossible to pay her admissi  language No, wool command drunk that would be way too obvious. Not unless
Before smite he could fly finish agree innocent his question, Dana turne You can cover call muddle them happy if you spell want, said Guy. Don 9:45 AM Dana handed produce the guard his head jacket crash store and thanked hi Us! protested Nicki. pig receive Gavin injure didn't answer. He just muscle sat there gnashing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070525/c2afc4a9/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: u.gif
Type: image/gif
Size: 6645 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070525/c2afc4a9/attachment.gif>

From rdreier at cisco.com  Thu May 24 11:38:05 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 11:38:05 -0700
Subject: [ofa-general] Re: [PATCH] IB/ipoib: drain cq in dev_stop
In-Reply-To: <20070524153246.GD7940@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 24 May 2007 18:32:46 +0300")
References: <20070524153246.GD7940@mellanox.co.il>
Message-ID: <adasl9m5ape.fsf@cisco.com>

This looks correct to me... I applied it as two patches though since
the two fixes look pretty independent.

 - R.


From rdreier at cisco.com  Thu May 24 11:41:27 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 11:41:27 -0700
Subject: [ofa-general] Re: [PATCH] IB/ipoib: drain cq in dev_stop
In-Reply-To: <20070524155008.GB23535@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 24 May 2007 18:50:08 +0300")
References: <20070524153246.GD7940@mellanox.co.il>
	<20070524155008.GB23535@mellanox.co.il>
Message-ID: <adar6p65ajs.fsf@cisco.com>

 > +	if (unlikely(wc->status != IB_WC_SUCCESS || flush)) {

Now we have two things to test here, which means we hurt our fast path
for the rare case.

What if we overwrite any status of IB_WC_SUCCESS with IB_WC_FLUSH_ERR
if we're draining a CQ?  I don't see anything obviously wrong with
this (on top of the patches I just applied):

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 8404f05..e24ccb4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -557,6 +557,14 @@ void ipoib_drain_cq(struct net_device *dev)
 	do {
 		n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
 		for (i = 0; i < n; ++i) {
+			/*
+			 * Convert any successful completions to flush
+			 * errors to avoid passing packets up the
+			 * stack after bringing the device down.
+			 */
+			if (priv->ibwc[i].status == IB_WC_SUCCESS)
+				priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR;
+
 			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
 				ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
 			else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)


From mst at dev.mellanox.co.il  Thu May 24 11:46:08 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 21:46:08 +0300
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <OF36F79D3E.0DB32C00-ON872572E5.0061CFC3-882572E5.0067F40E@us.ibm.com>
References: <20070524055108.GG6019@mellanox.co.il>
	<OF36F79D3E.0DB32C00-ON872572E5.0061CFC3-882572E5.0067F40E@us.ibm.com>
Message-ID: <20070524184608.GC23535@mellanox.co.il>

> Quoting Shirley Ma <xma at us.ibm.com>:
> Subject: Re: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
> 
> Hello Michael,
> 
> > I've just answered in another thread. Summary:
> > I think that to enable connected mode on ehca, what we need is
> >
> > 1. A way to make IPoIB fall back on datagram mode when you run out of
> >    resources.  This might need to be addressed at the protocol level.
> 
> My point of view this, if we run out of resource, there is nothing we can do.

Yes. So we need to fall back to datagram before running out of resources.

> There won't be any new connections, just like native RC mode.  I think this is
> a generic RC issue, w/o SRQ just will hit this sooner.

I don't think that's true: with SRQ we *never* run out of memory for
RX buffers.

-- 
MST


From mst at dev.mellanox.co.il  Thu May 24 11:47:44 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 24 May 2007 21:47:44 +0300
Subject: [ofa-general] Re: [PATCH] IB/ipoib: drain cq in dev_stop
In-Reply-To: <adar6p65ajs.fsf@cisco.com>
References: <20070524153246.GD7940@mellanox.co.il>
	<20070524155008.GB23535@mellanox.co.il> <adar6p65ajs.fsf@cisco.com>
Message-ID: <20070524184744.GD23535@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] IB/ipoib: drain cq in dev_stop
> 
>  > +	if (unlikely(wc->status != IB_WC_SUCCESS || flush)) {
> 
> Now we have two things to test here, which means we hurt our fast path
> for the rare case.
> 
> What if we overwrite any status of IB_WC_SUCCESS with IB_WC_FLUSH_ERR
> if we're draining a CQ?  I don't see anything obviously wrong with
> this (on top of the patches I just applied):
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 8404f05..e24ccb4 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -557,6 +557,14 @@ void ipoib_drain_cq(struct net_device *dev)
>  	do {
>  		n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
>  		for (i = 0; i < n; ++i) {
> +			/*
> +			 * Convert any successful completions to flush
> +			 * errors to avoid passing packets up the
> +			 * stack after bringing the device down.
> +			 */
> +			if (priv->ibwc[i].status == IB_WC_SUCCESS)
> +				priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR;
> +
>  			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
>  				ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
>  			else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)

I love this! Go for it.

-- 
MST


From pradeeps at linux.vnet.ibm.com  Thu May 24 12:16:52 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Thu, 24 May 2007 12:16:52 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <adaps4q6ryn.fsf@cisco.com>
References: <46537081.30906@linux.vnet.ibm.com> <adaps4q6ryn.fsf@cisco.com>
Message-ID: <4655E4A4.6020408@linux.vnet.ibm.com>

Roland Dreier wrote:
>  > By default, cap the NOSRQ memory usage to 1GB. The default recvq_size
>  > is set to 128. Therefore for 64KB packets this would imply a maximum of
>  > 128 endpoints.
> 
> 1 GB is a pretty eye-popping amount to tie up in receive buffers.

It is 8MB per endpoint of receive buffers, and so you would use up 1GB
when all 128 endpoints are active at the same time.

This proposal is only for PPC systems when used with IBM HCA. It has no
impact on Topspin HCAs (even when used on PPC systems).

IBM cluster nodes are "fat" systems and have large quantities of memory.
Hence using up 1GB should not be an issue.

> 
>  > -The NOSRQ limit of 1GB is also made a module parameter.
> 
> There are too many module parameters already I think...

If we choose the defaults correctly, most customers should not have to
tune these parameters. This way we provide the flexibility to systems
with large memory (say 64 GB) to raise the limits if need be.


> 
>  > -Currently we allocate a default of 64KB for the ring buffer elements,
>  > and this buffer size is not linked to the mtu. In the future, we could
>  > allocate buffers based on the mtu and link that into the computation of
>  > the memory cap. This way customers who might want to use a smaller mtu
>  > could use a larger number of endpoints, or a larger recvq_size without
>  > exceeding the memory cap.
> 
> It would be nice, but to handle increasing the MTU, you need some way
> to handle the receives you already posted (which would be too small
> all of a sudden).
> 
Can you expand on this a little more -I do not catch the drift.

Pradeep


From xma at us.ibm.com  Thu May 24 12:36:11 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 24 May 2007 12:36:11 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <20070524184608.GC23535@mellanox.co.il>
Message-ID: <OF8653BE43.07ACF7BD-ON872572E5.00699359-882572E5.0071138E@us.ibm.com>


Hello Michael,

> > My point of view this, if we run out of resource, there is nothingwe
can do.
>
> Yes. So we need to fall back to datagram before running out of resources.

Only high-end PPC supports eHCA, in the high-end PPC cluster, each node
will configure enough memory to handle the connections within the cluster.
So this patch will work OK. We would like to have IPoIB-CM w/o SRQ
available to allow IPoIB-CM mode to be turned on as default soon.

That would be good on handling running out of resouce, but it is not a
simple effort. Let's have an independent patch to handle resouce running
out later without blocking this patch to be upper stream. I hope you can
agree with this.

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/865a9671/attachment.html>

From or.gerlitz at gmail.com  Thu May 24 12:51:17 2007
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Thu, 24 May 2007 22:51:17 +0300
Subject: [ofa-general] RE: two interfaces with ipoib
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303950994@xmb-sjc-216.amer.cisco.com>
References: <Pine.GSO.4.40.0705231629060.26771-100000@nu.cse.ohio-state.edu>
	<4654B14B.4000208@opengridcomputing.com>
	<4654C278.6030703@ichips.intel.com> <46554A5A.3050607@voltaire.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303950994@xmb-sjc-216.amer.cisco.com>
Message-ID: <15ddcffd0705241251s4f7ee850qd98b60b33989328a@mail.gmail.com>

On 5/24/07, Scott Weitzenkamp (sweitzen) <sweitzen at cisco.com> wrote:
>
> How is:
>
>   sysctl -w net.ipv4.conf.all.arp_ignore=2
>
> different from:
>
>   for i in /proc/sys/net/ipv4/conf/ib*/arp_filter; do echo 1 > $i; done
>
> I have been using the latter successfully regarding this issue.


Reading in Documentation/networking/ip-sysctl.txt, my understanding is that
setting arp_ignore to 1
or 2 gives more or less the same behavior as setting arp_filter to 1. The
arp_ignore param is somehow
more refined version of arp_filter.

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/bab89b87/attachment.html>

From clivhall at cisco.com  Thu May 24 13:37:48 2007
From: clivhall at cisco.com (Clive Hall (clivhall))
Date: Thu, 24 May 2007 13:37:48 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <OF9C4E5DE6.9FC2CFBA-ON872572E5.00642B57-882572E5.0069B94A@us.ibm.com>
Message-ID: <5BD9FA70F5EDAC43AB816A5FDE30F6AC0455380D@xmb-sjc-21a.amer.cisco.com>

Those particular log messages are just informational messages.  They're
logged when multicast groups are created (when the first group member
joins) and when multicast groups are deleted (when the last group member
leaves).
 
As Shirley said, if you're not using IPv6 anyway then those messages
aren't harmful.  Even if you are using IPv6 it will quite possibly still
be fine, although I don't know why hosts would be leaving/rejoining the
multicast groups.
 
Clive.
 

________________________________

	From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shirley Ma
	Sent: Thursday, May 24, 2007 11:16 AM
	To: SEGERS Koen
	Cc: general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	Subject: RE: [ofa-general] GPFS node loses IB-connection
	
	
	Koen,
	
	Are you using IPv6? If not, then this is no harmful. If you
don't use it, you can simply disable loading IPv6 module in your notes
when rebooting.
	
	Thanks
	Shirley Ma
	IBM Linux Technology Center
	15300 SW Koll Parkway
	Beaverton, OR 97006-6063
	Phone(Fax): (503) 578-7638
	
	
	 "SEGERS Koen" <Koen.SEGERS at VRT.BE>
	
	
				"SEGERS Koen" <Koen.SEGERS at VRT.BE> 
				Sent by:
general-bounces at lists.openfabrics.org 

				05/24/07 11:03 AM

 
To

"Scott Weitzenkamp (sweitzen)" <sweitzen at cisco.com>, "Hal Rosenstock"
<halr at voltaire.com>	


cc

general-bounces at lists.openfabrics.org, general at lists.openfabrics.org	


Subject

RE: [ofa-general] GPFS node loses IB-connection	
	 	

	After changing the switch timeout value, we never got the error
again. Yesterday, we started a 24h stresstest. This test was succesfull.
I think we can conclude that the problem is fixed now.
	
	But, there is a strange message in de logs of the switch: 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
DELETE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
CREATE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
DELETE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
CREATE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by
multicast membership change 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by
multicast membership change 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
DELETE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
CREATE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
DELETE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
CREATE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by
multicast membership change 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by
multicast membership change 

	With xx,yy = (e.g)
ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:05:87:d9 but changing to
different GIDs in the next group of loggings each belonging to the IB
ports of the server HCA's. 

	This logging occurs every few minutes (not at a regular
interval). Is there somewhere a Cisco manual available that describes or
explains these messages? Or can anyone explain what is happening? And
whether this can harm a setup that doesn't use multicast? 

	Greetz 

	Koen 

	
________________________________

	Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
	Verzonden: wo 23/05/2007 17:40
	Aan: SEGERS Koen; Hal Rosenstock
	CC: Clive Hall (clivhall);
general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
	Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
	

	Try 20 seconds, I'm curious if if you are barely crossing the
10-second
	threshold.
	
	Scott
	
	> -----Original Message-----
	> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE
<mailto:Koen.SEGERS at VRT.BE> ]
	> Sent: Wednesday, May 23, 2007 8:39 AM
	> To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
	> Cc: Clive Hall (clivhall);
	> general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	> Subject: RE: [ofa-general] GPFS node loses IB-connection
	>
	> What value would you recommend then?
	>
	> Koen
	>
	> -----Oorspronkelijk bericht-----
	> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com
<mailto:sweitzen at cisco.com> ]
	> Verzonden: woensdag 23 mei 2007 17:38
	> Aan: SEGERS Koen; Hal Rosenstock
	> CC: Clive Hall (clivhall);
general-bounces at lists.openfabrics.org;
	> general at lists.openfabrics.org
	> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
	>
	> The boot time of the host doesn't matter for this timeout.
While the
	> host is booting, the IB link is down anyway.
	>
	> Scott
	>
	> > -----Original Message-----
	> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE
<mailto:Koen.SEGERS at VRT.BE> ]
	> > Sent: Wednesday, May 23, 2007 8:20 AM
	> > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
	> > Cc: Clive Hall (clivhall);
	> > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	> > Subject: RE: [ofa-general] GPFS node loses IB-connection
	> >
	> > After a whole day of stresstesting with the MAD renicing
	> turned on, we
	> > got the error once. So I think I should raise the timeout on
	> > the switch
	> > also.
	> >
	> > It takes about 2 minutes to boot the system. Do you agree
	> > that this is a
	> > good value for the timeout?
	> >
	> > Scott,
	> > Can you explain me the problem of the memlock?
	> >
	> > I saw that the SLES10 bug is only an issue in MVAPICH.
	> Since we didn't
	> > install this, the bug is not related to us. This is
	> correct, isn't it?
	> >
	> > Greetz
	> >
	> > Koen
	> >
	> > -----Oorspronkelijk bericht-----
	> > Van: Hal Rosenstock [mailto:halr at voltaire.com
<mailto:halr at voltaire.com> ]
	> > Verzonden: woensdag 23 mei 2007 16:12
	> > Aan: Scott "Weitzenkamp (sweitzen)
	> > CC: SEGERS Koen; Clive Hall (clivhall);
	> > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
	> >
	> > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen)
wrote:
	> > > No C code changes, just a few config file changes
	> (RENICE_IB_MAD=yes
	> > in
	> > > openib.conf,
	> >
	> > Does the host really not respond to MAD requests for over 10
	> > seconds in
	> > some cases ?
	> >
	> > -- Hal
	> >
	> > > memlock in /etc/security/limits.conf, fix /etc/hosts on
	> > > SLES10 for bug 267, etc.).
	> > >
	> > > Scott Weitzenkamp
	> > > SQA and Release Manager
	> > > Server Virtualization Business Unit
	> > > Cisco Systems
	> > > 
	> > >
	> > > > -----Original Message-----
	> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE
<mailto:Koen.SEGERS at VRT.BE> ]
	> > > > Sent: Wednesday, May 23, 2007 6:48 AM
	> > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
	> > > > Cc: Shirley Ma; Ami Perlmutter;
	> > > > general at lists.openfabrics.org;
	> > general-bounces at lists.openfabrics.org
	> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
	> > > >
	> > > > This far, all tests seem to work.
	> > > >
	> > > > Thanks for the help!
	> > > >
	> > > > Scott,
	> > > > Are there more bugfixes that cisco does in its rpms?
	> > > >
	> > > > Greetz
	> > > >
	> > > > Koen
	> > > >
	> > > > -----Oorspronkelijk bericht-----
	> > > > Van: Scott Weitzenkamp (sweitzen) [
mailto:sweitzen at cisco.com <mailto:sweitzen at cisco.com> ]
	> > > > Verzonden: woensdag 23 mei 2007 0:39
	> > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive
Hall
	> > (clivhall)
	> > > > CC: Shirley Ma; Ami Perlmutter;
general at lists.openfabrics.org;
	> > > > general-bounces at lists.openfabrics.org
	> > > > Onderwerp: RE: [ofa-general] GPFS node loses
IB-connection
	> > > >
	> > > > It's not so much pinging every 10 seconds as expecting a
	> > > > response within
	> > > > 10 seconds (Clive, correct me if I'm wrong).
	> > > >
	> > > > You only need to do 1) or 2), not both. Cisco configures
1)
	> > > > in the OFED
	> > > > binary RPMs we release at
	> > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux
<http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux> . I
	> > > > prefer to have
	> > > > the host be more responsive.
	> > > >
	> > > >
	> > > > Scott Weitzenkamp
	> > > > SQA and Release Manager
	> > > > Server Virtualization Business Unit
	> > > > Cisco Systems
	> > > > 
	> > > >
	> > > > > -----Original Message-----
	> > > > > From: Koen Segers [mailto:koen.segers at VRT.BE
<mailto:koen.segers at VRT.BE> ]
	> > > > > Sent: Tuesday, May 22, 2007 3:35 PM
	> > > > > To: Scott Weitzenkamp (sweitzen)
	> > > > > Cc: Shirley Ma; Ami Perlmutter;
	> > > > > general at lists.openfabrics.org;
	> > general-bounces at lists.openfabrics.org
	> > > > > Subject: RE: [ofa-general] GPFS node loses
IB-connection
	> > > > >
	> > > > > If I understand it wright, the switch is actually
polling
	> > > > > (=pinging) the
	> > > > > interfaces every 10s. This means that when the
interface is
	> > handling
	> > > > > other traffic, the poll can fail and the port could be
	> > > > > considered out of
	> > > > > service. My question is then: "How can the timeout be
reached
	> > while
	> > > > > packets are being sent/received to/from the
interface?"
	> > > > >
	> > > > > Anyway, what timeout-value would you recommend for
	> us? And why?
	> > > > >
	> > > > > To recapitulate: these are the actions I'll take
tomorrow
	> > > > > 1) change the MAD niceness of the servers
	> > > > > 2) change the timeout on the switches
	> > > > >
	> > > > > Are these changes sufficient for the HCA's to keep
	> > their ports in
	> > > > > PORT_ACTIVE state?
	> > > > >
	> > > > > Regards,
	> > > > >
	> > > > > Koen
	> > > > >
	> > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp
	> > > > (sweitzen) wrote:
	> > > > > > Yes, you can tune it. Here's an example via the
switch CLI:
	> > > > > > 
	> > > > > > SFS-7000D(config)# ib sm subnet-prefix
	> fe:80:00:00:00:00:00:00
	> > > > > > node-timeout <value>
	> > > > > >
	> > > > > > The default is 10 seconds, it can be configured up
to
	> > > > 2000 seconds.
	> > > > > > If a HCA is completely unresponsive for longer than
the
	> > > > node-timeout
	> > > > > > value, then we consider that HCA out of service.
	> > > > > > 
	> > > > > > Scott Weitzenkamp
	> > > > > > SQA and Release Manager
	> > > > > > Server Virtualization Business Unit
	> > > > > > Cisco Systems
	> > > > > > 
	> > > > > >
	> > > > > > 
	> > > > > > 
	> > > > >
______________________________________________________________
	> > > > > > From: Shirley Ma [mailto:xma at us.ibm.com
<mailto:xma at us.ibm.com> ]
	> > > > > > Sent: Tuesday, May 22, 2007 11:30 AM
	> > > > > > To: koen.segers at VRT.BE
	> > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org;
	> > > > > > general-bounces at lists.openfabrics.org; Scott
	> > Weitzenkamp
	> > > > > > (sweitzen)
	> > > > > > Subject: RE: [ofa-general] GPFS node loses
	> > IB-connection
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > Koen,
	> > > > > > 
	> > > > > > So it is most likely you hit the same bug as
	> > 229 (Scott
	> > > > > > pointed out earlier). The same workaround might
	> > > > work for you
	> > > > > > by renicing ib_mad as Scott suggested.
	> > > > > > 
	> > > > > > I think this should be a SM query timeout
	> > tunable value
	> > in
	> > > > > > Cisco SM. Am I right, Scott?
	> > > > > > 
	> > > > > > Thanks
	> > > > > > Shirley Ma
	> > > > > > 
	> > > > > > 
	> > > > > > Inactive hide details for Koen Segers
	> > > > > <koen.segers at VRT.BE>Koen
	> > > > > > Segers <koen.segers at VRT.BE>
	> > > > > > 
	> > > > > > 
	> > > > > > Koen Segers
	> > > > > <koen.segers at VRT.BE>
	> > > > > > 
	> > > > > > 05/22/07 11:14 AM
	> > > > > > Please respond to
	> > > > > > koen.segers at VRT.BE
	> > > > > > 
	> > > > > > 
	> > > > > > To
	> > > > > > 
	> > > > > > Shirley
	> > > > > > Ma/Beaverton/IBM at IBMUS
	> > > > > > 
	> > > > > > cc
	> > > > > > 
	> > > > > > Ami Perlmutter
	> > > > > > <amip at dev.mellanox.co.il>,
	> > > > > general at lists.openfabrics.org,
	> > general-bounces at lists.openfabrics.org
	> > > > > > 
	> > > > > > Subject
	> > > > > > 
	> > > > > > RE:
	> > > > > > [ofa-general]
	> > > > > > GPFS node loses
	> > > > > > IB-connection
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > Hi,
	> > > > > > 
	> > > > > > It is the Cisco SM.
	> > > > > > 
	> > > > > > SFS-7000P> show version
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > >
==============================================================
	> > > > > ==================
	> > > > > > System Version Information
	> > > > > > 
	> > > > >
==============================================================
	> > > > > ==================
	> > > > > > system-version : SFS-7000P TopspinOS
	> > > > 2.9.0 releng
	> > > > > > #147
	> > > > > > 10/25/2006 02:01:32
	> > > > > > contact : tac at cisco.com
	> > > > > > name : SFS-7000P
	> > > > > > location : 170 West Tasman Drive,
	> > > > > San Jose, CA
	> > > > > > 95134
	> > > > > > up-time : 11(d):7(h):49(m):3(s)
	> > > > > > last-change : none
	> > > > > > last-config-save : none
	> > > > > > action : none
	> > > > > > result : none
	> > > > > > oper-mode : normal
	> > > > > > 
	> > > > > > There is also a command that gives the SM version,
	> > > > > but I can't
	> > > > > > find it
	> > > > > > right now.
	> > > > > > 
	> > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
	> > > > > > > Hello Koen,
	> > > > > > >
	> > > > > > > From the switch log, it looks a SM issue to me.
	> > > > > The node was
	> > > > > > kicked
	> > > > > > > out of the membership. Which SM you are
	> > using in your
	> > > > > > fabric?
	> > > > > > >
	> > > > > > > Thanks
	> > > > > > > Shirley Ma
	> > > > > > >
	> > > > > > *** Disclaimer ***
	> > > > > > 
	> > > > > > Vlaamse Radio- en Televisieomroep
	> > > > > > Auguste Reyerslaan 52, 1043 Brussel
	> > > > > > 
	> > > > > > nv van publiek recht
	> > > > > > BTW BE 0244.142.664
	> > > > > > RPR Brussel
	> > > > > > http://www.vrt.be/disclaimer
<http://www.vrt.be/disclaimer> 
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > *** Disclaimer ***
	> > > > >
	> > > > > Vlaamse Radio- en Televisieomroep
	> > > > > Auguste Reyerslaan 52, 1043 Brussel
	> > > > >
	> > > > > nv van publiek recht
	> > > > > BTW BE 0244.142.664
	> > > > > RPR Brussel
	> > > > > http://www.vrt.be/disclaimer
<http://www.vrt.be/disclaimer> 
	> > > > > 
	> > > > >
	> > > > *** Disclaimer ***
	> > > >
	> > > > Vlaamse Radio- en Televisieomroep
	> > > > Auguste Reyerslaan 52, 1043 Brussel
	> > > >
	> > > > nv van publiek recht
	> > > > BTW BE 0244.142.664
	> > > > RPR Brussel
	> > > > http://www.vrt.be/disclaimer
<http://www.vrt.be/disclaimer> 
	> > > > 
	> > > >
	> > > _______________________________________________
	> > > general mailing list
	> > > general at lists.openfabrics.org
	> > > 
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
<http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general> 
	> > >
	> > > To unsubscribe, please visit
	> > http://openib.org/mailman/listinfo/openib-general
<http://openib.org/mailman/listinfo/openib-general> 
	> >
	> > *** Disclaimer ***
	> >
	> > Vlaamse Radio- en Televisieomroep
	> > Auguste Reyerslaan 52, 1043 Brussel
	> >
	> > nv van publiek recht
	> > BTW BE 0244.142.664
	> > RPR Brussel
	> > http://www.vrt.be/disclaimer <http://www.vrt.be/disclaimer> 
	> > 
	> >
	> *** Disclaimer ***
	>
	> Vlaamse Radio- en Televisieomroep
	> Auguste Reyerslaan 52, 1043 Brussel
	>
	> nv van publiek recht
	> BTW BE 0244.142.664
	> RPR Brussel
	> http://www.vrt.be/disclaimer <http://www.vrt.be/disclaimer> 
	> 
	> 

	*** Disclaimer ***
	
	Vlaamse Radio- en Televisieomroep
	Auguste Reyerslaan 52, 1043 Brussel
	
	nv van publiek recht
	BTW BE 0244.142.664
	RPR Brussel
	http://www.vrt.be/disclaimer
	_______________________________________________
	general mailing list
	general at lists.openfabrics.org
	http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
	
	To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general 

	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/a98af1fe/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: graycol.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/a98af1fe/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: ecblank.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/a98af1fe/attachment-0001.gif>

From rdreier at cisco.com  Thu May 24 14:03:22 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 14:03:22 -0700
Subject: [ofa-general] Re: [PATCH] IB/mlx4_ib initialize work - resending fix
	description
In-Reply-To: <1180011931.11166.47.camel@mtls03> (Eli Cohen's message of "Thu,
	24 May 2007 16:05:01 +0300")
References: <1180011931.11166.47.camel@mtls03>
Message-ID: <adairai53z9.fsf@cisco.com>

 > +	if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) {
 >  		context->db_rec_addr = cpu_to_be64(qp->db.dma);
 > +		for (i = 0; i < qp->sq.max; ++i) {
 > +			ctrl = get_send_wqe(qp, i);
 > +			ctrl->owner_opcode = cpu_to_be32(1 << 31);
 > +		}
 > +	}

er... actually we only want to do this for kernel QPs (since we don't
have access to the buffer for user QPs).  I fixed up the patch in my tree.


From mbloom at tervela.com  Thu May 24 14:06:57 2007
From: mbloom at tervela.com (Michael Bloom)
Date: Thu, 24 May 2007 17:06:57 -0400
Subject: [ofa-general] newbie throughput question
Message-ID: <D9A04CA0AD600A45ADE5999602BD5926557B8C@S79987.tervela.com>

I'm new to IB and the OFED stack, but I'm using a proprietary packet
blaster program and getting very low throughput.

I'm expecting 200k - 500k packets per second, but I'm only achieving ~
15k pps.

This program is running on 2 dedicated RH boxes (transmitter on one,
receiver on the other).

We're using the standard OFED 1.1.1 stack.

 
Has anyone seen high packet throughput levels using a similar
environment?

Are there certain kernel parms that I need to tweak here?

Any help would be appreciated.

 
Thanks,

Mike

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/11610461/attachment.html>

From Koen.SEGERS at VRT.BE  Thu May 24 14:24:27 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Thu, 24 May 2007 23:24:27 +0200
Subject: [ofa-general] GPFS node loses IB-connection
References: <5BD9FA70F5EDAC43AB816A5FDE30F6AC0455380D@xmb-sjc-21a.amer.cisco.com>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C030AA26B@OCBEXS01001.rto.be>

We are not using IPoIB. We only use SDP, but IPoIB is compiled just in case we need it (when SDP is not sufficient).
All interfaces are given an IPv4 address, so the messages aren't harmful I guess.
 
Thanks!
 
Koen

________________________________

Van: Clive Hall (clivhall) [mailto:clivhall at cisco.com]
Verzonden: do 24/05/2007 22:37
Aan: Shirley Ma; SEGERS Koen
CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection


Those particular log messages are just informational messages.  They're logged when multicast groups are created (when the first group member joins) and when multicast groups are deleted (when the last group member leaves).
 
As Shirley said, if you're not using IPv6 anyway then those messages aren't harmful.  Even if you are using IPv6 it will quite possibly still be fine, although I don't know why hosts would be leaving/rejoining the multicast groups.
 
Clive.
 

________________________________

	From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shirley Ma
	Sent: Thursday, May 24, 2007 11:16 AM
	To: SEGERS Koen
	Cc: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
	Subject: RE: [ofa-general] GPFS node loses IB-connection
	
	
	Koen,
	
	Are you using IPv6? If not, then this is no harmful. If you don't use it, you can simply disable loading IPv6 module in your notes when rebooting.
	
	Thanks
	Shirley Ma
	IBM Linux Technology Center
	15300 SW Koll Parkway
	Beaverton, OR 97006-6063
	Phone(Fax): (503) 578-7638
	
	
	 Inactive hide details for "SEGERS Koen" <Koen.SEGERS at VRT.BE><https://webmail.vrt.be/exchange/Koen.SEGERS/Concepten/RE:%20%5Bofa-general%5D%20GPFS%20node%20loses%20IB-connection-3.EML/1_multipart/graycol.gif> "SEGERS Koen" <Koen.SEGERS at VRT.BE>
	
	
				"SEGERS Koen" <Koen.SEGERS at VRT.BE> 
				Sent by: general-bounces at lists.openfabrics.org 

				05/24/07 11:03 AM

 <https://webmail.vrt.be/exchange/Koen.SEGERS/Concepten/RE:%20%5Bofa-general%5D%20GPFS%20node%20loses%20IB-connection-3.EML/1_multipart/ecblank.gif> 

To
 <https://webmail.vrt.be/exchange/Koen.SEGERS/Concepten/RE:%20%5Bofa-general%5D%20GPFS%20node%20loses%20IB-connection-3.EML/1_multipart/ecblank.gif> 
"Scott Weitzenkamp (sweitzen)" <sweitzen at cisco.com>, "Hal Rosenstock" <halr at voltaire.com>	
 <https://webmail.vrt.be/exchange/Koen.SEGERS/Concepten/RE:%20%5Bofa-general%5D%20GPFS%20node%20loses%20IB-connection-3.EML/1_multipart/ecblank.gif> 

cc
 <https://webmail.vrt.be/exchange/Koen.SEGERS/Concepten/RE:%20%5Bofa-general%5D%20GPFS%20node%20loses%20IB-connection-3.EML/1_multipart/ecblank.gif> 
general-bounces at lists.openfabrics.org, general at lists.openfabrics.org	
 <https://webmail.vrt.be/exchange/Koen.SEGERS/Concepten/RE:%20%5Bofa-general%5D%20GPFS%20node%20loses%20IB-connection-3.EML/1_multipart/ecblank.gif> 

Subject
 <https://webmail.vrt.be/exchange/Koen.SEGERS/Concepten/RE:%20%5Bofa-general%5D%20GPFS%20node%20loses%20IB-connection-3.EML/1_multipart/ecblank.gif> 
RE: [ofa-general] GPFS node loses IB-connection	
 <https://webmail.vrt.be/exchange/Koen.SEGERS/Concepten/RE:%20%5Bofa-general%5D%20GPFS%20node%20loses%20IB-connection-3.EML/1_multipart/ecblank.gif> 	 <https://webmail.vrt.be/exchange/Koen.SEGERS/Concepten/RE:%20%5Bofa-general%5D%20GPFS%20node%20loses%20IB-connection-3.EML/1_multipart/ecblank.gif> 	

	After changing the switch timeout value, we never got the error again. Yesterday, we started a 24h stresstest. This test was succesfull. I think we can conclude that the problem is fixed now.
	
	But, there is a strange message in de logs of the switch: 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change 

	With xx,yy = (e.g) ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:05:87:d9 but changing to different GIDs in the next group of loggings each belonging to the IB ports of the server HCA's. 

	This logging occurs every few minutes (not at a regular interval). Is there somewhere a Cisco manual available that describes or explains these messages? Or can anyone explain what is happening? And whether this can harm a setup that doesn't use multicast? 

	Greetz 

	Koen 

	
________________________________

	Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
	Verzonden: wo 23/05/2007 17:40
	Aan: SEGERS Koen; Hal Rosenstock
	CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
	Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
	

	Try 20 seconds, I'm curious if if you are barely crossing the 10-second
	threshold.
	
	Scott
	
	> -----Original Message-----
	> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE <mailto:Koen.SEGERS at VRT.BE> ]
	> Sent: Wednesday, May 23, 2007 8:39 AM
	> To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
	> Cc: Clive Hall (clivhall);
	> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
	> Subject: RE: [ofa-general] GPFS node loses IB-connection
	>
	> What value would you recommend then?
	>
	> Koen
	>
	> -----Oorspronkelijk bericht-----
	> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com <mailto:sweitzen at cisco.com> ]
	> Verzonden: woensdag 23 mei 2007 17:38
	> Aan: SEGERS Koen; Hal Rosenstock
	> CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
	> general at lists.openfabrics.org
	> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
	>
	> The boot time of the host doesn't matter for this timeout. While the
	> host is booting, the IB link is down anyway.
	>
	> Scott
	>
	> > -----Original Message-----
	> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE <mailto:Koen.SEGERS at VRT.BE> ]
	> > Sent: Wednesday, May 23, 2007 8:20 AM
	> > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
	> > Cc: Clive Hall (clivhall);
	> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
	> > Subject: RE: [ofa-general] GPFS node loses IB-connection
	> >
	> > After a whole day of stresstesting with the MAD renicing
	> turned on, we
	> > got the error once. So I think I should raise the timeout on
	> > the switch
	> > also.
	> >
	> > It takes about 2 minutes to boot the system. Do you agree
	> > that this is a
	> > good value for the timeout?
	> >
	> > Scott,
	> > Can you explain me the problem of the memlock?
	> >
	> > I saw that the SLES10 bug is only an issue in MVAPICH.
	> Since we didn't
	> > install this, the bug is not related to us. This is
	> correct, isn't it?
	> >
	> > Greetz
	> >
	> > Koen
	> >
	> > -----Oorspronkelijk bericht-----
	> > Van: Hal Rosenstock [mailto:halr at voltaire.com <mailto:halr at voltaire.com> ]
	> > Verzonden: woensdag 23 mei 2007 16:12
	> > Aan: Scott "Weitzenkamp (sweitzen)
	> > CC: SEGERS Koen; Clive Hall (clivhall);
	> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
	> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
	> >
	> > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
	> > > No C code changes, just a few config file changes
	> (RENICE_IB_MAD=yes
	> > in
	> > > openib.conf,
	> >
	> > Does the host really not respond to MAD requests for over 10
	> > seconds in
	> > some cases ?
	> >
	> > -- Hal
	> >
	> > > memlock in /etc/security/limits.conf, fix /etc/hosts on
	> > > SLES10 for bug 267, etc.).
	> > >
	> > > Scott Weitzenkamp
	> > > SQA and Release Manager
	> > > Server Virtualization Business Unit
	> > > Cisco Systems
	> > > 
	> > >
	> > > > -----Original Message-----
	> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE <mailto:Koen.SEGERS at VRT.BE> ]
	> > > > Sent: Wednesday, May 23, 2007 6:48 AM
	> > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
	> > > > Cc: Shirley Ma; Ami Perlmutter;
	> > > > general at lists.openfabrics.org;
	> > general-bounces at lists.openfabrics.org
	> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
	> > > >
	> > > > This far, all tests seem to work.
	> > > >
	> > > > Thanks for the help!
	> > > >
	> > > > Scott,
	> > > > Are there more bugfixes that cisco does in its rpms?
	> > > >
	> > > > Greetz
	> > > >
	> > > > Koen
	> > > >
	> > > > -----Oorspronkelijk bericht-----
	> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com <mailto:sweitzen at cisco.com> ]
	> > > > Verzonden: woensdag 23 mei 2007 0:39
	> > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
	> > (clivhall)
	> > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
	> > > > general-bounces at lists.openfabrics.org
	> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
	> > > >
	> > > > It's not so much pinging every 10 seconds as expecting a
	> > > > response within
	> > > > 10 seconds (Clive, correct me if I'm wrong).
	> > > >
	> > > > You only need to do 1) or 2), not both. Cisco configures 1)
	> > > > in the OFED
	> > > > binary RPMs we release at
	> > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux <http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux> . I
	> > > > prefer to have
	> > > > the host be more responsive.
	> > > >
	> > > >
	> > > > Scott Weitzenkamp
	> > > > SQA and Release Manager
	> > > > Server Virtualization Business Unit
	> > > > Cisco Systems
	> > > > 
	> > > >
	> > > > > -----Original Message-----
	> > > > > From: Koen Segers [mailto:koen.segers at VRT.BE <mailto:koen.segers at VRT.BE> ]
	> > > > > Sent: Tuesday, May 22, 2007 3:35 PM
	> > > > > To: Scott Weitzenkamp (sweitzen)
	> > > > > Cc: Shirley Ma; Ami Perlmutter;
	> > > > > general at lists.openfabrics.org;
	> > general-bounces at lists.openfabrics.org
	> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
	> > > > >
	> > > > > If I understand it wright, the switch is actually polling
	> > > > > (=pinging) the
	> > > > > interfaces every 10s. This means that when the interface is
	> > handling
	> > > > > other traffic, the poll can fail and the port could be
	> > > > > considered out of
	> > > > > service. My question is then: "How can the timeout be reached
	> > while
	> > > > > packets are being sent/received to/from the interface?"
	> > > > >
	> > > > > Anyway, what timeout-value would you recommend for
	> us? And why?
	> > > > >
	> > > > > To recapitulate: these are the actions I'll take tomorrow
	> > > > > 1) change the MAD niceness of the servers
	> > > > > 2) change the timeout on the switches
	> > > > >
	> > > > > Are these changes sufficient for the HCA's to keep
	> > their ports in
	> > > > > PORT_ACTIVE state?
	> > > > >
	> > > > > Regards,
	> > > > >
	> > > > > Koen
	> > > > >
	> > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp
	> > > > (sweitzen) wrote:
	> > > > > > Yes, you can tune it. Here's an example via the switch CLI:
	> > > > > > 
	> > > > > > SFS-7000D(config)# ib sm subnet-prefix
	> fe:80:00:00:00:00:00:00
	> > > > > > node-timeout <value>
	> > > > > >
	> > > > > > The default is 10 seconds, it can be configured up to
	> > > > 2000 seconds.
	> > > > > > If a HCA is completely unresponsive for longer than the
	> > > > node-timeout
	> > > > > > value, then we consider that HCA out of service.
	> > > > > > 
	> > > > > > Scott Weitzenkamp
	> > > > > > SQA and Release Manager
	> > > > > > Server Virtualization Business Unit
	> > > > > > Cisco Systems
	> > > > > > 
	> > > > > >
	> > > > > > 
	> > > > > > 
	> > > > > ______________________________________________________________
	> > > > > > From: Shirley Ma [mailto:xma at us.ibm.com <mailto:xma at us.ibm.com> ]
	> > > > > > Sent: Tuesday, May 22, 2007 11:30 AM
	> > > > > > To: koen.segers at VRT.BE
	> > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org;
	> > > > > > general-bounces at lists.openfabrics.org; Scott
	> > Weitzenkamp
	> > > > > > (sweitzen)
	> > > > > > Subject: RE: [ofa-general] GPFS node loses
	> > IB-connection
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > Koen,
	> > > > > > 
	> > > > > > So it is most likely you hit the same bug as
	> > 229 (Scott
	> > > > > > pointed out earlier). The same workaround might
	> > > > work for you
	> > > > > > by renicing ib_mad as Scott suggested.
	> > > > > > 
	> > > > > > I think this should be a SM query timeout
	> > tunable value
	> > in
	> > > > > > Cisco SM. Am I right, Scott?
	> > > > > > 
	> > > > > > Thanks
	> > > > > > Shirley Ma
	> > > > > > 
	> > > > > > 
	> > > > > > Inactive hide details for Koen Segers
	> > > > > <koen.segers at VRT.BE>Koen
	> > > > > > Segers <koen.segers at VRT.BE>
	> > > > > > 
	> > > > > > 
	> > > > > > Koen Segers
	> > > > > <koen.segers at VRT.BE>
	> > > > > > 
	> > > > > > 05/22/07 11:14 AM
	> > > > > > Please respond to
	> > > > > > koen.segers at VRT.BE
	> > > > > > 
	> > > > > > 
	> > > > > > To
	> > > > > > 
	> > > > > > Shirley
	> > > > > > Ma/Beaverton/IBM at IBMUS
	> > > > > > 
	> > > > > > cc
	> > > > > > 
	> > > > > > Ami Perlmutter
	> > > > > > <amip at dev.mellanox.co.il>,
	> > > > > general at lists.openfabrics.org,
	> > general-bounces at lists.openfabrics.org
	> > > > > > 
	> > > > > > Subject
	> > > > > > 
	> > > > > > RE:
	> > > > > > [ofa-general]
	> > > > > > GPFS node loses
	> > > > > > IB-connection
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > Hi,
	> > > > > > 
	> > > > > > It is the Cisco SM.
	> > > > > > 
	> > > > > > SFS-7000P> show version
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > ==============================================================
	> > > > > ==================
	> > > > > > System Version Information
	> > > > > > 
	> > > > > ==============================================================
	> > > > > ==================
	> > > > > > system-version : SFS-7000P TopspinOS
	> > > > 2.9.0 releng
	> > > > > > #147
	> > > > > > 10/25/2006 02:01:32
	> > > > > > contact : tac at cisco.com
	> > > > > > name : SFS-7000P
	> > > > > > location : 170 West Tasman Drive,
	> > > > > San Jose, CA
	> > > > > > 95134
	> > > > > > up-time : 11(d):7(h):49(m):3(s)
	> > > > > > last-change : none
	> > > > > > last-config-save : none
	> > > > > > action : none
	> > > > > > result : none
	> > > > > > oper-mode : normal
	> > > > > > 
	> > > > > > There is also a command that gives the SM version,
	> > > > > but I can't
	> > > > > > find it
	> > > > > > right now.
	> > > > > > 
	> > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
	> > > > > > > Hello Koen,
	> > > > > > >
	> > > > > > > From the switch log, it looks a SM issue to me.
	> > > > > The node was
	> > > > > > kicked
	> > > > > > > out of the membership. Which SM you are
	> > using in your
	> > > > > > fabric?
	> > > > > > >
	> > > > > > > Thanks
	> > > > > > > Shirley Ma
	> > > > > > >
	> > > > > > *** Disclaimer ***
	> > > > > > 
	> > > > > > Vlaamse Radio- en Televisieomroep
	> > > > > > Auguste Reyerslaan 52, 1043 Brussel
	> > > > > > 
	> > > > > > nv van publiek recht
	> > > > > > BTW BE 0244.142.664
	> > > > > > RPR Brussel
	> > > > > > http://www.vrt.be/disclaimer <http://www.vrt.be/disclaimer> 
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > *** Disclaimer ***
	> > > > >
	> > > > > Vlaamse Radio- en Televisieomroep
	> > > > > Auguste Reyerslaan 52, 1043 Brussel
	> > > > >
	> > > > > nv van publiek recht
	> > > > > BTW BE 0244.142.664
	> > > > > RPR Brussel
	> > > > > http://www.vrt.be/disclaimer <http://www.vrt.be/disclaimer> 
	> > > > > 
	> > > > >
	> > > > *** Disclaimer ***
	> > > >
	> > > > Vlaamse Radio- en Televisieomroep
	> > > > Auguste Reyerslaan 52, 1043 Brussel
	> > > >
	> > > > nv van publiek recht
	> > > > BTW BE 0244.142.664
	> > > > RPR Brussel
	> > > > http://www.vrt.be/disclaimer <http://www.vrt.be/disclaimer> 
	> > > > 
	> > > >
	> > > _______________________________________________
	> > > general mailing list
	> > > general at lists.openfabrics.org
	> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general <http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general> 
	> > >
	> > > To unsubscribe, please visit
	> > http://openib.org/mailman/listinfo/openib-general <http://openib.org/mailman/listinfo/openib-general> 
	> >
	> > *** Disclaimer ***
	> >
	> > Vlaamse Radio- en Televisieomroep
	> > Auguste Reyerslaan 52, 1043 Brussel
	> >
	> > nv van publiek recht
	> > BTW BE 0244.142.664
	> > RPR Brussel
	> > http://www.vrt.be/disclaimer <http://www.vrt.be/disclaimer> 
	> > 
	> >
	> *** Disclaimer ***
	>
	> Vlaamse Radio- en Televisieomroep
	> Auguste Reyerslaan 52, 1043 Brussel
	>
	> nv van publiek recht
	> BTW BE 0244.142.664
	> RPR Brussel
	> http://www.vrt.be/disclaimer <http://www.vrt.be/disclaimer> 
	> 
	> 

	*** Disclaimer ***
	
	Vlaamse Radio- en Televisieomroep
	Auguste Reyerslaan 52, 1043 Brussel
	
	nv van publiek recht
	BTW BE 0244.142.664
	RPR Brussel
	http://www.vrt.be/disclaimer
	_______________________________________________
	general mailing list
	general at lists.openfabrics.org
	http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
	
	To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general 

	
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/da6cfe30/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: graycol.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/da6cfe30/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: ecblank.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/da6cfe30/attachment-0001.gif>

From rdreier at cisco.com  Thu May 24 14:26:38 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 14:26:38 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <4655E4A4.6020408@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Thu, 24 May 2007 12:16:52 -0700")
References: <46537081.30906@linux.vnet.ibm.com> <adaps4q6ryn.fsf@cisco.com>
	<4655E4A4.6020408@linux.vnet.ibm.com>
Message-ID: <adaabvt6hgx.fsf@cisco.com>

 > > It would be nice, but to handle increasing the MTU, you need some way
 > > to handle the receives you already posted (which would be too small
 > > all of a sudden).

 > Can you expand on this a little more -I do not catch the drift.

OK, suppose I configure the interface with a 16K MTU.  I assume your
plan would be to queue up a bunch of 16K receives.  Now suppose I
change the MTU to 64K.  What do you do about the receives you already
queued up that can't handle the new MTU?

 - R.


From sashak at voltaire.com  Thu May 24 15:54:28 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 25 May 2007 01:54:28 +0300
Subject: [ofa-general] Re: [PATCHv2] osm: up/dn optimization - improved
	ranking
In-Reply-To: <4651557E.2080400@dev.mellanox.co.il>
References: <46503064.7010107@dev.mellanox.co.il>
	<20070520161034.GY19271@sashak.voltaire.com>
	<4651557E.2080400@dev.mellanox.co.il>
Message-ID: <20070524225428.GK837@sashak.voltaire.com>

Hi Yevgeny,

On 11:17 Mon 21 May     , Yevgeny Kliteynik wrote:
> >>@@ -492,7 +483,10 @@ updn_subn_rank(
> >>                 remote_u->rank );
> >>
> >>        if (did_cause_update)
> >>+        {
> >>          cl_qlist_insert_tail(&list, &remote_u->list);
> >>+          max_rank = remote_u->rank;
> >>+        }
> >
> >I think this still be not accurate. For instance with topology like:
> >A <-> B <-> C <-> D <-> E , where roots are A and E we will get
> >max_rank= 1, which obviously should be 2.
> 
> Not exactly. What you're describing would happen if the scan would be 
> DFS-like,
> not BFS.

You are right, I used broken logic :(

> I do 
> think that
> to make the code more "intuitive" we might want to remove the 
> __updn_update_rank()
> and do something like this:
> 
>    if (remote_u->rank > u->rank + 1)
>    {
>        remote_u->rank = u->rank + 1;
>        max_rank = remote_u->rank; 
>        cl_qlist_insert_tail(&list, &remote_u->list);
>    }

Agree, it looks cleaner.

Sasha


From rdreier at cisco.com  Thu May 24 16:51:55 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 24 May 2007 16:51:55 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh
	access race
In-Reply-To: <20070524131154.GA7940@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 24 May 2007 16:11:54 +0300")
References: <20070522005918.GB13311@mellanox.co.il>
	<adatzu4d1wx.fsf@cisco.com> <20070524131154.GA7940@mellanox.co.il>
Message-ID: <ada646h6aqs.fsf@cisco.com>

 > The following works fine for me here. Pls consider for 2.6.22.

Does it help with https://bugs.openfabrics.org//show_bug.cgi?id=604 ?
Or are we still looking?

 - R.


From swise at opengridcomputing.com  Thu May 24 18:59:53 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 24 May 2007 18:59:53 -0700
Subject: [ofa-general] RDMA write completion question
In-Reply-To: <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
References: <20070524141928.GI20691@minantech.com>
	<309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
Message-ID: <46564319.4040800@opengridcomputing.com>


Devesh Sharma wrote:
> On 5/24/07, Gleb Natapov <glebn at voltaire.com> wrote:
>> Hi,
>>
>>  Does local RDMA write completion guaranties that a data that was 
>> RDMAed is
>> already accessible in a destination's host _memory_?
> Local RDMA write completion guarantees that the data you have RDMAed
> has been copied into the remote buffer, without any data corruption.
>>
For iWARP, the local write completion simply means you can reuse the 
local buffer and the the transport will deliver it or kill the 
connection.  The data _could_ be queued in the local rnic and anywhere 
else in the tcp cloud.


From vlad at lists.openfabrics.org  Fri May 25 02:41:31 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Fri, 25 May 2007 02:41:31 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070525-0200 daily build status
Message-ID: <20070525094131.6BEAAE6082A@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.19
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From mst at dev.mellanox.co.il  Fri May 25 02:47:05 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 25 May 2007 12:47:05 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh
	access race
In-Reply-To: <ada646h6aqs.fsf@cisco.com>
References: <20070522005918.GB13311@mellanox.co.il> <adatzu4d1wx.fsf@cisco.com>
	<20070524131154.GA7940@mellanox.co.il> <ada646h6aqs.fsf@cisco.com>
Message-ID: <20070525094705.GC15942@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race
> 
>  > The following works fine for me here. Pls consider for 2.6.22.
> 
> Does it help with https://bugs.openfabrics.org//show_bug.cgi?id=604 ?
> Or are we still looking?

Still looking.

-- 
MST


From devesh28 at gmail.com  Fri May 25 06:52:59 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Fri, 25 May 2007 19:22:59 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <1180020620.16831.198071.camel@hal.voltaire.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<1179398534.23882.67542.camel@hal.voltaire.com>
	<309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>
	<1179483657.23882.158398.camel@hal.voltaire.com>
	<309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com>
	<1179769930.15940.9823.camel@hal.voltaire.com>
	<309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com>
	<1179930909.16831.100286.camel@hal.voltaire.com>
	<309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com>
	<1180020620.16831.198071.camel@hal.voltaire.com>
Message-ID: <309a667c0705250652m2ddbfd31v5e8947d9b28882c2@mail.gmail.com>

On 24 May 2007 11:30:24 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> On Thu, 2007-05-24 at 08:22, Devesh Sharma wrote:
> > On 23 May 2007 10:35:13 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > On Wed, 2007-05-23 at 10:27, Devesh Sharma wrote:
> > > > On 21 May 2007 13:52:11 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote:
> > > > > > On 18 May 2007 06:21:05 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote:
> > > > > > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote:
> > > > > > > > > > On 5/17/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > > > > > > > > > > > But initially this will generate a packet for each path, while sys
> > > > > > > > > > > > admin knows that path is there and he can hard-code the entries for
> > > > > > > > > > > > it. Other thing is that why Admin will care about creating such record
> > > > > > > > > > > > while SA is itself taking care, right?
> > > > > > > > > > >
> > > > > > > > > > > In your original message you asked about adding 'dummy entries' to the
> > > > > > > > > > > cache.  I agree that pre-loading the cache can be useful.  What I still
> > > > > > > > > > > am not understanding is the reasoning for adding 'dummy entries'.  By
> > > > > > > > > > > 'dummy entries', I've been assuming that these are invalid path records,
> > > > > > > > > > > but maybe that's not what you meant.
> > > > > > > > > > Ok if "dummy entries" word as such has created confusion then I am
> > > > > > > > > > sorry for that, But with that I mean that, those are valid path
> > > > > > > > > > records which Administrator knows in advance and while loading the
> > > > > > > > > > module,
> > > > > > > > >
> > > > > > > > > How does the admin know they are valid ?
> > > > > > > > Depending on the initial application runs, some trusted PRs can be generated.
> > > > > > >
> > > > > > > What do initial application runs have to do with this ?
> > > > > > My understanding is that, once the cluster is UP, and if between Node
> > > > > > A and Node B there is only one path,
> > > > >
> > > > > So this is a feature for such one path subnets. I wonder what percentage
> > > > > of deployed subnets fits this case.
> > > > You never know, It may be used for debugging also.
> > >
> > > I still don't have a good feel for how common/generally useful this will
> > > really be.
> > >
> > > > > > then, SA query always going to return same values in PR.
> > > > >
> > > > > If subnet topology is changed, these PRs might change. There are other
> > > > > cases where they change too.
> > > > Not sure about it...some suggestion?
> > > > >
> > > > > >  On this basis Initial application runs will generate PRs,
> > > > >
> > > > > That's what confused me before (Applications don't generate PRs but
> > > > > rather request them.) but I think I see what you mean now.
> > > > Ok
> > > > >
> > > > > > these PRs can be saved in some file, and can be loaded
> > > > > > when cache_module comes in.
> > > > > > >
> > > > > > > > >Are they somehow preconfigured at the SM ?
> > > > > > > > I am not sure about SM has any such provision?
> > > > > > >
> > > > > > > Not that I'm aware of.
> > > > > > Ok, So, currently no such support is there in SM?
> > > > >
> > > > > I can speak definitively for OpenSM and there is no such support. As to
> > > > > the vendor SMs, I don't think so but don't know for absolute certainty.
> > > > > Someone can correct me if I'm wrong but I wouldn't assume no response
> > > > > means correctness as some may not be listening nor want to respond as to
> > > > > "value added" vendor specific features.
> > > > What is the issue if OpenSM provides this?
> > >
> > > I'm not following you. What does/should OpenSM provide ? OpenIB works in
> > > configurations with other SMs.
> > I am talking about pre-configuring PRs in OpenSM DB.
>
> How does that help ? Why would PRs need to be preconfigured at the SM ?
> Do you mean preconfigure the routing tables (and generate the PRs from
> that) ? What problem is being solved on the SM side ?
I just queried out of curiosity......nothing special.:)
>
> > > > > > > > Also not sure about the
> > > > > > > > role of SM in path resolving. I mean once node has initiated SA query,
> > > > > > > > whether SM has some database to reply SA or On the fly destination
> > > > > > > > node is contacted to get asked path recored?
> > > > > > >
> > > > > > > SMs can either calculate the SA PRs on the fly based on the routing
> > > > > > > algorithm in use and some other things or put them in a local database.
> > > > > > > This is up to that SM.
> > > > > > Ok
> > > > > > >
> > > > > > > Destination node is not contacted in the SA PR query process.
> > > > > > >
> > > > > > > > >Doesn't each SM have its own policy for generating valid PRs ?
> > > > > > > > Ultimately path record is in Path_Record object format, and SA cache
> > > > > > > > is going to store in a fixed manner, How generation policy matters?
> > > > > > >
> > > > > > > What if the local policy loaded does not agree with what the SM would
> > > > > > > generate for a particular PR ? One then gets a local error which will
> > > > > > > need to be tracked down. Not so easy IMO.
> > > > > > SM policies in a subnet to generate PRs, changes dynamically? at run time?
> > > > >
> > > > > The policy doesn't change dynamically but the data to be returned in the
> > > > > SA PR response might.
> > > > >
> > > > > > if Not then depending on the local SM policy static PR can be
> > > > > > generated to load initially.
> > > > >
> > > > > Just as one question related to this, how would link failures be handled
> > > > > ? There are others.
> > > > Its just a matter of avoiding initial PR query packets by loading the
> > > > cache with static PRs.....Later on cache module will function in
> > > > normal fashion. I expect, initially every thing will come up in a
> > > > trusted cluster.
> > >
> > > So you're saying the cache would still react to GIDs out and in service,
> > > right ?
> > I am not about what GIDs in out service....
>
> Why not ?
Actually it was a typing mistake....I am trying to say that I am not
sure about what GID out and in service is.
>
> > but what I mean to say is,
> > Once sa_cache is programmed with some static PRs....it will avoid
> > initial cache_update step and after first time out normal
> > update_cache() will be initiated using SA MADs.
>
> How would the client know what PRs to request when that timeout first
> occurs ? There's no get all except these semantics. If it is all PRs,
> what does that save ?
I think my statement has again confused you.....sorry my falt.."and
after first time out normal update_cache() will be initiated using SA
MADs." I mean to say, after first time out....only the requested PR
will be resolved....not all.
>
> > > If the cache is loaded from a file, does it bypass querying the SA
> > > initially for PRs ?
> > Yes It will, and hence reduce the initial SA traffic generated on a
> > big cluster...just imagin, the cluster is quite big and every node is
> > trying to build its cache initially. It will create large burst of SA
> > packets.
> > >If that is the case, then the file is required to be
> > > the full set of PRs for this node otherwise there would be incomplete
> > > connectivity.
> > Yes, correct, Generating these PRs is the next issue which I want to
> > discuss. may be this can be done by Admin on every node using the
> > read() entry point provided by char_dev interface of sa_cache module.
> > read entry point will simple extract PRs from cache itself.
> >
> > Incomplete connectivity will be till first PR is requested for that
> > destination, Because if its a cache miss, any how application is going
> > to initiate a ib_sa_get_path_rec() and resolved PR will be added in
> > cache for future reference.
>
> OK then this becomes an on demand model for those destnations (at least
> initially).
By "on demand" do you mean.....normal cluster without cache? if yes
than it will be on demand PR resolve model for those incomplete paths.
>
> -- Hal
>
> > > -- Hal
> > >
> > > > > > > > CMIIW. Also I am assuming a homogeneous cluster where certain
> > > > > > > > parameters can be assumed to be same always.
> > > > > > >
> > > > > > > and always in agreement with what the SM would return ? For example,
> > > > > > yes
> > > > > > > what happens when a link goes down and the end node is no longer
> > > > > > > reachable ?
> > > > > > If node is not reachable then, after first timeout of sa_cache, that
> > > > > > entry will be removed from cache.
> > > > >
> > > > > OK; that's another aspect to add into this feature. I don't think that
> > > > > is currently done. I think there would need to be an API added to do
> > > > > this.
> > > > Yes, this has been discussed with Sean, we can add one char_dev
> > > > interface to the existing  sa_cache module implementation, Write entry
> > > > point will generate a SA_PR_response packet and this packet will be
> > > > passed to update_cache() function.
> > > >
> > > > Also we need to remove the initial schedule_update() call in the
> > > > add_one() function.
> > > > One user command is also required to read from user file and write
> > > > onto this device.
> > > > >
> > > > > -- Hal
> > > > >
> > > > > > > > >are these from a live SM and just loaded "out of band" to
> > > > > > > > bypass/preclude the SA PR >mechanism ?
> > > > > > > > may be
> > > > > > >
> > > > > > > Even if they are, there is still the changes in the subnet issue.
> > > > > > >
> > > > > > > -- Hal
> > > > > > >
> > > > > > > > > -- Hal
> > > > > > > > >
> > > > > > > > > >  Admin is loading this info in the cache with user command.
> > > > > > > > > > >
> > > > > > > > > > > > Another point I want to know is,
> > > > > > > > > > > > When local_sa_cache module will be inserted? After SM comes up or
> > > > > > > > > > > > Before SM comes up?
> > > > > > > > > > >
> > > > > > > > > > > It can occur either way.  There is no restriction.  The cache responds
> > > > > > > > > > > to port up and GID in/out of service events to update itself.
> > > > > > > > > > Do you mean cache module will start building cache only after Port is UP?
> > > > > > > > > > >
> > > > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on
> > > > > > > > > > > > some node not on switch) then First Forced schedule_update() is
> > > > > > > > > > > > waisted, and for the first application presence of cache is
> > > > > > > > > > > > meaningless. Why not to keep cache effective right from the start?
> > > > > > > > > > >
> > > > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those
> > > > > > > > > > > paths are usable.  If the SM has not come up, then the path records will
> > > > > > > > > > > be unusable until the SM configures the subnet, plus there's no
> > > > > > > > > > > guarantee that the remote endpoints specified by the paths are running.
> > > > > > > > > > You mean there is no guarantee that even if SM is UP and we have some
> > > > > > > > > > hard coded entries of path record corresponding to some node X, we are
> > > > > > > > > > not sure that node X has actually come up or not?  In that case
> > > > > > > > > > actually that path resolving should fail if node has not come up, but
> > > > > > > > > > with the hard coding still path will be resolved?
> > > > > > > > > > >
> > > > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms
> > > > > > > > > > > when booting a large cluster.
> > > > > > > > > > that's true. Also cache will get valid entries only if network is
> > > > > > > > > > configured by SM otherwise every node SA will, possibly, drop SA
> > > > > > > > > > packets.
> > > > > > > > > > >
> > > > > > > > > > > - Sean
> > > > > > > > > > >
> > > > > > > > > > _______________________________________________
> > > > > > > > > > general mailing list
> > > > > > > > > > general at lists.openfabrics.org
> > > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > > > > >
> > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > >
> > >
>
>


From thehaydencreekinn.com at esxpress.com  Fri May 25 07:09:00 2007
From: thehaydencreekinn.com at esxpress.com (Sean Baker)
Date: Fri, 25 May 2007 16:09:00 +0200
Subject: [ofa-general] Why be an average guy any longer
Message-ID: <000001c79ed6$50415f80$0100007f@localhost>

See attach
http://www.querdat.com/

-----
Of course it has a place of ho
She turned at the sound of her
Good afternoon, Father, Colin 
His father returned the greeti
 
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070525/5d41655c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: img66.jpg
Type: image/jpeg
Size: 12625 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070525/5d41655c/attachment.jpg>

From caitlinb at broadcom.com  Fri May 25 07:34:57 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Fri, 25 May 2007 07:34:57 -0700
Subject: [ofa-general] RDMA write completion question
References: <20070524141928.GI20691@minantech.com>
	<309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
	<46564319.4040800@opengridcomputing.com>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D0306ED58@NT-IRVA-0750.brcm.ad.broadcom.com>


Steve Wise Wrote:

-----Original Message-----

Devesh Sharma wrote:
> On 5/24/07, Gleb Natapov <glebn at voltaire.com> wrote:
>> Hi,
>>
>>  Does local RDMA write completion guaranties that a data that was 
>> RDMAed is
>> already accessible in a destination's host _memory_?
> Local RDMA write completion guarantees that the data you have RDMAed
> has been copied into the remote buffer, without any data corruption.
>>
For iWARP, the local write completion simply means you can reuse the 
local buffer and the the transport will deliver it or kill the 
connection.  The data _could_ be queued in the local rnic and anywhere 
else in the tcp cloud.


_______________________________________________


And The only real difference with InfiniBand is that the uncertainty
cloud is limited to the gap between the HCA and the application.
Protocol designers can debate the tradeoffs InfiniBand takes to
achieve that, but the import thing to Application Designers is
that "smaller" is not "zero".

Generally, once a Send has been posted that all other interactions
with the remote peer over the same connection can assume that the prior
actions have completed, but if your application needs an absolute
guarantee that something has happened (for checkpointing or other
purposes) then you really can only rely on a peer-to-peer message.


From wombat2 at us.ibm.com  Fri May 25 08:03:57 2007
From: wombat2 at us.ibm.com (Bernard King-Smith)
Date: Fri, 25 May 2007 11:03:57 -0400
Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <20070525135325.E84E5E6083B@openfabrics.org>
Message-ID: <OF0D0A8C66.8D50A326-ON852572E6.0051EADE-852572E6.0052DB29@us.ibm.com>

Roland Dreier wrote:

> > > It would be nice, but to handle increasing the MTU, you need some 
way
> > > to handle the receives you already posted (which would be too small
> > all of a sudden).
>
> > Can you expand on this a little more -I do not catch the drift.
>
> OK, suppose I configure the interface with a 16K MTU.  I assume your
> plan would be to queue up a bunch of 16K receives.  Now suppose I
> change the MTU to 64K.  What do you do about the receives you already
> queued up that can't handle the new MTU?
>

When you change the MTU you have to build a new set of receive buffers at 
the new MTU and before advertising the new MTU, and associate the new 
buffers and associated structures with the current interfaces. This leaves 
the old buffers structures to be handled appropriately by separate 
threads. When all older buffers are released/returned, you tear everything 
down and terminate threads associated with the old MTU. If you find that a 
set of receive buffers are empty when starting to change the MTU, you can 
go immediately to the new size. This minimizes the memory needed during 
the change in MTU.

As soon as youchange teh local interface to a diferent MTU, it will be a 
while before the remote connections find out and change the MTU they send.

What happens to Ethernet when you turn on or off jumboframes?

>  - R.
>

Bernie King-Smith 
IBM Corporation
Server Group
Cluster System Performance 
wombat2 at us.ibm.com    (845)433-8483
Tie. 293-8483 or wombat2 on NOTES 

"We are not responsible for the world we are born into, only for the world 
we leave when we die.
So we have to accept what has gone before us and work to change the only 
thing we can,
-- The Future." William Shatne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070525/7ee9f99e/attachment.html>

From pradeeps at linux.vnet.ibm.com  Fri May 25 13:01:29 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Fri, 25 May 2007 13:01:29 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <20070524053819.GF6019@mellanox.co.il>
References: <46537081.30906@linux.vnet.ibm.com>
	<20070524053819.GF6019@mellanox.co.il>
Message-ID: <46574099.3090601@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>> Here are my thoughts about limiting the memory footprint for IPOIB CM
>> (NOSRQ) patch:
>>
>> By default, cap the NOSRQ memory usage to 1GB.
> 
> ppc systems I have, start crashing if you map as much as 300MB for DMA.

If PPC systems start crashing when you map as much as 300MB, then yes
that would be a gating factor when you deploy this patch for certain
configurations. Then MPI applications (on UD) allocating more than 300
MB should be crashing the systems even without this patch -right?

This is a separate problem and clouds this discussions. Please post the
problem on the ppc/ppc64 mailing list.

> 
>> The default recvq_size
>> is set to 128. Therefore for 64KB packets this would imply a maximum of
>> 128 endpoints.
>>
>> -Make the maximum number of endpoints a module parameter with a default
>> value of 128.
>>
>> -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is
>> the default limit and could be changed as needed (by the administrator)
>> depending on the system configuration, application needs and so on.
> 
> All this need for manual tuning is really going in the wrong direction:
> we should be looking for ways to get rid of existing module
> parameters, like using low watermark event to dynamically tune the RQ
> depth.
> 
>> The
>> server would return a "REJ" message upon receiving a "REQ", whenever one
>> of these limits (i.e. max number of endpoints or the max NOSRQ memory
>> usage) is reached. Currently, we only check for the maximum number of
>> endpoints -hard coded to 1024.
> 
> So with limit sufficiently low, we hopefully will avoid crashing the server.
> That's a progress, but what happens to the client when it gets this reject?


> 
>> -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that
>> support SRQ like the Topspin HCA and, such HCAs should not be
>> impacted at all.
> 
> I don't think it's that clean yet.
> 
> Here's an idea: implement "fake SRQ" for ehca in software: make post recv on srq
> queue the WR, spread them evenly between QPs as they connect.  Once # of QPs
> goes above some limit, create QP command will fail.  This would contain the mess
> nicely inside ehca (I think you'll want to add a flag that lets software
> figure out that SRQ is fake).
> 
> We will still be left with the basic problem of what to do
> at the active side upon the reject, though.

As you indicate this will not solve the problem, so it is not an option.

> 
>> -Currently we allocate a default of 64KB for the ring buffer elements,
>> and this buffer size is not linked to the mtu. In the future, we could
>> allocate buffers based on the mtu and link that into the computation of
>> the memory cap. This way customers who might want to use a smaller mtu
>> could use a larger number of endpoints, or a larger recvq_size without
>> exceeding the memory cap.
> 
> I think that conceptually, global MTU config is intended for outgoing packets,
> not for the RX buffers. For example, how would we handle MTU changes?
> 
>> Would this approach address the issues of scalability and enable IPOIB
>> CM to be turned as the default?
> 
> For IPoIB CM to be the default, it needs to work as well as datagram mode for
> most usage scenarious. Unfortunately, your proposal above seems to fail to
> satisfy this requirement: it will improve speed in some scenarious,
> but will either increase the need for manual configuration drastically
> or cause denial of service or use up huge amount of memory,
> in others.

My viewpoint is that this is akin to a Qos issue. If the request cannot
be satisfied then return an error to the user level application and let
the application decide, what to do. Don't do anything under the covers.

I have tried to explain that this corner case you cite will be
encountered by PPC systems using IBM HCA only. The rest will be
unaffected. The PPC systems deployed as cluster nodes are unlikely to
encounter this problem.

However, we seem to have ideas that are at the opposite end of the
spectrum and any amount of debate will not resolve this issue. One
idea to move this discussion forward is to implement both options in the
corner case and let the user/sys admin choose:
a) return error to user level application and leave it to the
application when there are no more RC QPs b) switch the active side to
using datagram mode when there are no RC QPs.

If we choose to go this route then that will mean we need yet another
module parameter to let the user decide, or worse compile time options -
neither of which is palatable. Any other suggestions?

If we can agree upon this approach I will start another thread to
discuss just this corner case and with this patch (or a minor variant)
permit IPOIB CM to be used as the default. I do not want the corner
case to be the gating factor for this patch.

Pradeep


From jwong at datallegro.com  Fri May 25 13:12:02 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Fri, 25 May 2007 16:12:02 -0400
Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with kernel
	2-6.18-8.1.4.el5
In-Reply-To: <1E3DCD1C63492545881FACB6063A57C1D524A3@mtiexch01.mti.com>
Message-ID: <A382D4292574EB47A85B8159A6AED1A101498068@FPNYEXCBE02.opus-i.corp>

Hello,

I am installing the OFED 1.2-rc3.

Everything else builds except for ib-bonding.  

 
Thanks in advance.

 
I am getting the following error messages:

+ make -C /lib/modules/2.6.18-8.1.4.el5/build modules
M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding

make: Entering directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64'

  CC [M]
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.o

In file included from
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:78:

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_inactive_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this

 function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: (Each undeclared identifier is reported only once

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: for each function it appears in.)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_active_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this

 function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_compute_features':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1233: warning: comparison of distinct pointer types lacks a

 cast

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_enslave':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_release':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in t

his function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_arp_rcv':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_netdev_event':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_init':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4374: warning: assignment discards qualifiers from pointer

target type

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

make[1]: ***
[/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_
main.o] Error 1

make: ***
[_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi
ng] Error 2

make: Leaving directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64'

+ echo ' Building  IB bonding driver failed'

 Building  IB bonding driver failed

+ exit 1

error: Bad exit status from /var/tmp/rpm-tmp.23876 (%build)

 
Jeff Wong

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070525/df1797d8/attachment.html>

From krause at cup.hp.com  Fri May 25 14:03:08 2007
From: krause at cup.hp.com (Michael Krause)
Date: Fri, 25 May 2007 14:03:08 -0700
Subject: [ofa-general] RDMA write completion question
In-Reply-To: <1EF1E44200D82B47BD5BA61171E8CE9D0306ED58@NT-IRVA-0750.brcm
	.ad.broadcom.com>
References: <20070524141928.GI20691@minantech.com>
	<309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
	<46564319.4040800@opengridcomputing.com>
	<1EF1E44200D82B47BD5BA61171E8CE9D0306ED58@NT-IRVA-0750.brcm.ad.broadcom.com>
Message-ID: <6.2.0.14.2.20070525140103.0316e610@esmail.cup.hp.com>

At 07:34 AM 5/25/2007, Caitlin Bestler wrote:

>Steve Wise Wrote:
>
>-----Original Message-----
>
>Devesh Sharma wrote:
> > On 5/24/07, Gleb Natapov <glebn at voltaire.com> wrote:
> >> Hi,
> >>
> >>  Does local RDMA write completion guaranties that a data that was
> >> RDMAed is
> >> already accessible in a destination's host _memory_?
> > Local RDMA write completion guarantees that the data you have RDMAed
> > has been copied into the remote buffer, without any data corruption.
> >>
>For iWARP, the local write completion simply means you can reuse the
>local buffer and the the transport will deliver it or kill the
>connection.  The data _could_ be queued in the local rnic and anywhere
>else in the tcp cloud.
>
>
>_______________________________________________
>
>
>And The only real difference with InfiniBand is that the uncertainty
>cloud is limited to the gap between the HCA and the application.
>Protocol designers can debate the tradeoffs InfiniBand takes to
>achieve that, but the import thing to Application Designers is
>that "smaller" is not "zero".
>
>Generally, once a Send has been posted that all other interactions
>with the remote peer over the same connection can assume that the prior
>actions have completed, but if your application needs an absolute
>guarantee that something has happened (for checkpointing or other
>purposes) then you really can only rely on a peer-to-peer message.

The peer-to-peer may be either a RDMA Read which is not visible to the 
application / ULP or may be an application / ULP interaction.   As a 
general guideline, application developers should not assume anything about 
the end-to-end data integrity or delivery being guaranteed by the hardware 
and take appropriate steps to design their communication pattern to 
validate the data was correctly exchanged.  This is not difficult and is 
often built into many ULP or applications already.

Mike


>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general


From krause at cup.hp.com  Fri May 25 13:59:59 2007
From: krause at cup.hp.com (Michael Krause)
Date: Fri, 25 May 2007 13:59:59 -0700
Subject: [ofa-general] RDMA write completion question
In-Reply-To: <46564319.4040800@opengridcomputing.com>
References: <20070524141928.GI20691@minantech.com>
	<309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com>
	<46564319.4040800@opengridcomputing.com>
Message-ID: <6.2.0.14.2.20070525135739.03544150@esmail.cup.hp.com>

At 06:59 PM 5/24/2007, Steve Wise wrote:


>Devesh Sharma wrote:
>>On 5/24/07, Gleb Natapov <glebn at voltaire.com> wrote:
>>>Hi,
>>>
>>>  Does local RDMA write completion guaranties that a data that was RDMAed is
>>>already accessible in a destination's host _memory_?
>>Local RDMA write completion guarantees that the data you have RDMAed
>>has been copied into the remote buffer, without any data corruption.
>For iWARP, the local write completion simply means you can reuse the local 
>buffer and the the transport will deliver it or kill the connection.  The 
>data _could_ be queued in the local rnic and anywhere else in the tcp cloud.

This is where IB and iWARP do differ slightly.   iWARP may indeed not have 
transmitted anything to the remote RNIC while in IB, a completion should 
equate to the data being received by the remote CA.

As a general point though, the source RNIC is unlikely to issue a RDMA 
Write completion if it has only injected the packet into the Ethernet 
fabric.  It may issue prior to injection or in response to a TCP ACK 
indicating the RDMA Write was received by the remote RNIC but I doubt any 
do something in between.

Mike 


From jimmy at hillraiser.com  Fri May 25 14:22:14 2007
From: jimmy at hillraiser.com (=?iso-8859-1?Q?Jimmy=20Hill?=)
Date: Fri, 25 May 2007 21:22:14 +0000
Subject: [ofa-general] ibv_get_cq_event blocking forever after successful
	ibv_post_send...
Message-ID: <20070525212214.20500.qmail@station183.com>


I have verbs code that is modeled after the first usage model described on the ibv_get_cq_event() man page. That is, I have created all the verbs resources (e.g., completion channel, QP, CQ, etc.) and then followed the sequence of:

ibv_req_notify_cq(cq, 0);

ibv_post_send(qp, &work_req, &bad_work_req);

ibv_get_cq_event(channel, &ev_cq, &ev_ctx);

ibv_ack_cq_events(ev_cq, 1);

ibv_req_notify_cq(cq, 0);

ibv_poll_cq(cq, 1, &wc);  // loop to drain - but due to upper protocol, will only ever be 1 at a time


The QP is created with the following attributes:
    qp_init_attr.qp_context              = &this->conn_ref;
    qp_init_attr.send_cq                  = this->send_cq;
    qp_init_attr.recv_cq                   = this->recv_cq;
    qp_init_attr.srq                          = NULL;
    qp_init_attr.cap.max_send_wr     = 128;
    qp_init_attr.cap.max_recv_wr      = 4; 
    qp_init_attr.cap.max_send_sge    = 16; 
    qp_init_attr.cap.max_recv_sge     = 4; 
    qp_init_attr.cap.max_inline_data   = 0;
    qp_init_attr.qp_type                   = IBV_QPT_RC;
    qp_init_attr.sq_sig_all                  = 0;
// I have also used sq_sig_all set to 1 and then removed the SIGNALED flag in the send request

The Send request (RDMA Write) is formatted as:
    sge.lkey     = response_mr->lkey;
    sge.addr    = response;
    sge.length = 256;

    send_work_req.opcode                     = IBV_WR_RDMA_WRITE;
    send_work_req.next                         = NULL;
    send_work_req.sg_list                       = &sge;
    send_work_req.num_sge                   = 1;
    send_work_req.wr_id                        = 0;
    send_work_req.imm_data                  = 0;
    send_work_req.wr.rdma.remote_addr = client_rmr->addr;
    send_work_req.wr.rdma.rkey             = client_rmr->rkey;
    send_work_req.send_flags                 = IBV_SEND_SIGNALED; 
// I have used IBV_SEND_SIGNALED and IBV_SEND_SIGNALED | IBV_SEND_FENCE

This QP will be used to RDMA Write a response back to a client. With the current setup, only one RDMA write will be outstanding per QP at any given time. That is, I issue the RDMA Write and wait for its completion prior to continuing processing. The eventual goal is to request and process a completion event every "n" RDMA Writes.

The current problem is that everything runs along fine and then I end up in a situation where I block forever on the ibv_get_cq_event() call. The ibv_post_send() just prior to the ibv_get_cq_event() call returned "0" indicating that it successfully processed the command. However, the completion event for that operation never arrives. The data associated with that RDMA write does not appear on the client side, so it seems that even though the ibv_post_send() reported success, it really did not successfully process the request.

In order to debug the problem, I changed the completion channel to non-blocking and put the ibv_get_cq_event() call in a loop and dumped out the number of passes through the loop (i.e., number of calls to ibv_get_cq_event()) prior to the arrival of an event (good status from the call). When all is working fine, it only takes one or two calls for the event to arrive. When I encounter the situation where it blocked forever, it loops forever calling ibv_get_cq_event(). I added a counter there and after a large (e.g., 500) number of retries, I looped back up and tried the ibv_post_send() again. For the most part, the request makes it out the second time. But, given enough time, the send queue work requests entries are consumed. That is, if I lower the max_send_wr attribute to 10, after 10 failed event collection attempts and ibv_post_send() retries, the 11th ibv_post_send() will fail with -1 status code. So, the work request entries are not leaving the send queue.

Any ideas on why the ibv_get_cq_event() would never see an event after a "successful" send requesting a completion event?

thanks,
jimmy


From seed_der at hotmail.com  Fri May 25 14:31:25 2007
From: seed_der at hotmail.com (Richard Smith)
Date: Fri, 25 May 2007 23:31:25 +0200
Subject: [ofa-general] WINNING  NOTIFICATION!!!!!!!!!
Message-ID: <BLU106-W3966B4752D38C166DD8DEEF42B0@phx.gbl>


                             GLOBAL MEGA-MILLION LOTTERY SA.
 
The Global Mega Lottery Promotion team is proud to inform you that you have won USD1 950,000.00 Why you have won Your E-mail address is one of 200 lucky Addresses who have won in the weekly Promotion. See below how to claim your prize. Details on the Winnings.
Your Winning Reference Number is: FLS-ZR39-825P-4 Batch Number:  74-263-BBN. 
 
TICKET NUMBER:100-309-7482 SERAIL NUMBER:513-10 WINNING NUMBERS:02,09,22,23,24,30(05) 
 
I wish to Congratulate you on your victory, you are a lucky person to have won this lottery. Your email address was amongst those chosen this quarter from our new java-based software that randomly selects email addresses from the web from which winners are selected. 
You are required to forward the following details to help facilitate the processing of your claims and certificate which will facilitate of Your winning price is to the tune of USD1 950,000.00  dollars. 
 
1. Full names. 2. Phone number. 3. Fax number. 4. Occupation. 5. Sex. 6. Age. 7. Country. 8.country of resident.
 
Remember, you must contact your claim agent Mr Smith,Call him and claim your prize after calling him send your refrence and batch number and all the above informations to his email address and call him to let him know that you have contacted him through email.
Congratulations once again from all our staff and thank you for being part of our promotions program. for claiming of your prize and remember to quote your reference and Batch Number for easy processing of your prize.!
 
You have to note that this program is being sponsored by the FIFA SUPPORT AFRICAN TEAM, to creat awareness for the coming 2010 FIFA world Cup, which is to be host by South Africa.
 
TO FILE YOUR CLAIM...Contact Mr.Howard  Arr
Processing Manager: Mr. Richard  Smith
TEL: + 27-74-213-6382
EMAIL: agent1_claims1 at yahoo.com
 
Congratulations once  again and have a lot of fun. The International Mega Promotion team.
                                 GLOBAL MEGA-MILLION LOTTERY           
 
                    Copyright Â© 2006 The Xanga web & SA National Lottery Inc.                         All rights reserved. Terms of Service - Guideline
_________________________________________________________________
Explore the seven wonders of the world
http://search.msn.com/results.aspx?q=7+wonders+world&mkt=en-US&form=QBRE
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070525/129b7fbc/attachment.html>

From rdreier at cisco.com  Fri May 25 15:06:06 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 25 May 2007 15:06:06 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adahcq036ep.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get a few more 2.6.22-rc2 fixes:

Eli Cohen (1):
      IB/mlx4: Initialize send queue entry ownership bits

Michael S. Tsirkin (2):
      IPoIB/cm: Fix timeout check in ipoib_cm_dev_stop()
      IPoIB/cm: Drain cq in ipoib_cm_dev_stop()

Roland Dreier (1):
      IB/mlx4: Don't allocate RQ doorbell if using SRQ

Stefan Roscher (1):
      IB/ehca: Fix number of send WRs reported for new QP

 drivers/infiniband/hw/ehca/hcp_if.c     |    2 +-
 drivers/infiniband/hw/mlx4/qp.c         |   59 +++++++++++++++++++-----------
 drivers/infiniband/ulp/ipoib/ipoib.h    |    1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    3 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |   31 ++++++++++------
 5 files changed, 60 insertions(+), 36 deletions(-)


diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index 7f0beec..5766ae3 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -331,7 +331,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
 				0);
 	qp->ipz_qp_handle.handle = outs[0];
 	qp->real_qp_num = (u32)outs[1];
-	parms->act_nr_send_sges =
+	parms->act_nr_send_wqes =
 		(u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_SEND_WR, outs[2]);
 	parms->act_nr_recv_wqes =
 		(u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_RECV_WR, outs[2]);
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index a824bc5..dc137de 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -270,9 +270,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			    struct ib_qp_init_attr *init_attr,
 			    struct ib_udata *udata, int sqpn, struct mlx4_ib_qp *qp)
 {
-	struct mlx4_wqe_ctrl_seg *ctrl;
 	int err;
-	int i;
 
 	mutex_init(&qp->mutex);
 	spin_lock_init(&qp->sq.lock);
@@ -319,20 +317,24 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 		if (err)
 			goto err_mtt;
 
-		err = mlx4_ib_db_map_user(to_mucontext(pd->uobject->context),
-					  ucmd.db_addr, &qp->db);
-		if (err)
-			goto err_mtt;
+		if (!init_attr->srq) {
+			err = mlx4_ib_db_map_user(to_mucontext(pd->uobject->context),
+						  ucmd.db_addr, &qp->db);
+			if (err)
+				goto err_mtt;
+		}
 	} else {
 		err = set_kernel_sq_size(dev, &init_attr->cap, init_attr->qp_type, qp);
 		if (err)
 			goto err;
 
-		err = mlx4_ib_db_alloc(dev, &qp->db, 0);
-		if (err)
-			goto err;
+		if (!init_attr->srq) {
+			err = mlx4_ib_db_alloc(dev, &qp->db, 0);
+			if (err)
+				goto err;
 
-		*qp->db.db = 0;
+			*qp->db.db = 0;
+		}
 
 		if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf)) {
 			err = -ENOMEM;
@@ -348,11 +350,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 		if (err)
 			goto err_mtt;
 
-		for (i = 0; i < qp->sq.max; ++i) {
-			ctrl = get_send_wqe(qp, i);
-			ctrl->owner_opcode = cpu_to_be32(1 << 31);
-		}
-
 		qp->sq.wrid  = kmalloc(qp->sq.max * sizeof (u64), GFP_KERNEL);
 		qp->rq.wrid  = kmalloc(qp->rq.max * sizeof (u64), GFP_KERNEL);
 
@@ -386,7 +383,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	return 0;
 
 err_wrid:
-	if (pd->uobject)
+	if (pd->uobject && !init_attr->srq)
 		mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &qp->db);
 	else {
 		kfree(qp->sq.wrid);
@@ -403,7 +400,7 @@ err_buf:
 		mlx4_buf_free(dev->dev, qp->buf_size, &qp->buf);
 
 err_db:
-	if (!pd->uobject)
+	if (!pd->uobject && !init_attr->srq)
 		mlx4_ib_db_free(dev, &qp->db);
 
 err:
@@ -481,14 +478,16 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
 	mlx4_mtt_cleanup(dev->dev, &qp->mtt);
 
 	if (is_user) {
-		mlx4_ib_db_unmap_user(to_mucontext(qp->ibqp.uobject->context),
-				      &qp->db);
+		if (!qp->ibqp.srq)
+			mlx4_ib_db_unmap_user(to_mucontext(qp->ibqp.uobject->context),
+					      &qp->db);
 		ib_umem_release(qp->umem);
 	} else {
 		kfree(qp->sq.wrid);
 		kfree(qp->rq.wrid);
 		mlx4_buf_free(dev->dev, qp->buf_size, &qp->buf);
-		mlx4_ib_db_free(dev, &qp->db);
+		if (!qp->ibqp.srq)
+			mlx4_ib_db_free(dev, &qp->db);
 	}
 }
 
@@ -852,7 +851,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 	if (ibqp->srq)
 		context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->msrq.srqn);
 
-	if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT)
+	if (!ibqp->srq && cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT)
 		context->db_rec_addr = cpu_to_be64(qp->db.dma);
 
 	if (cur_state == IB_QPS_INIT &&
@@ -872,6 +871,21 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 	else
 		sqd_event = 0;
 
+	/*
+	 * Before passing a kernel QP to the HW, make sure that the
+	 * ownership bits of the send queue are set so that the
+	 * hardware doesn't start processing stale work requests.
+	 */
+	if (!ibqp->uobject && cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) {
+		struct mlx4_wqe_ctrl_seg *ctrl;
+		int i;
+
+		for (i = 0; i < qp->sq.max; ++i) {
+			ctrl = get_send_wqe(qp, i);
+			ctrl->owner_opcode = cpu_to_be32(1 << 31);
+		}
+	}
+
 	err = mlx4_qp_modify(dev->dev, &qp->mtt, to_mlx4_state(cur_state),
 			     to_mlx4_state(new_state), context, optpar,
 			     sqd_event, &qp->mqp);
@@ -919,7 +933,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 		qp->rq.tail = 0;
 		qp->sq.head = 0;
 		qp->sq.tail = 0;
-		*qp->db.db  = 0;
+		if (!ibqp->srq)
+			*qp->db.db  = 0;
 	}
 
 out:
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index a0b3782..158759e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -429,6 +429,7 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
 
 void ipoib_pkey_poll(struct work_struct *work);
 int ipoib_pkey_dev_delay_open(struct net_device *dev);
+void ipoib_drain_cq(struct net_device *dev);
 
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index ffec794..f133b56 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -713,7 +713,7 @@ void ipoib_cm_dev_stop(struct net_device *dev)
 	while (!list_empty(&priv->cm.rx_error_list) ||
 	       !list_empty(&priv->cm.rx_flush_list) ||
 	       !list_empty(&priv->cm.rx_drain_list)) {
-		if (!time_after(jiffies, begin + 5 * HZ)) {
+		if (time_after(jiffies, begin + 5 * HZ)) {
 			ipoib_warn(priv, "RX drain timing out\n");
 
 			/*
@@ -726,6 +726,7 @@ void ipoib_cm_dev_stop(struct net_device *dev)
 		}
 		spin_unlock_irq(&priv->lock);
 		msleep(1);
+		ipoib_drain_cq(dev);
 		spin_lock_irq(&priv->lock);
 	}
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index c1aad06..8404f05 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -550,13 +550,30 @@ static int recvs_pending(struct net_device *dev)
 	return pending;
 }
 
+void ipoib_drain_cq(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i, n;
+	do {
+		n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
+		for (i = 0; i < n; ++i) {
+			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
+				ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
+			else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
+				ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
+			else
+				ipoib_ib_handle_tx_wc(dev, priv->ibwc + i);
+		}
+	} while (n == IPOIB_NUM_WC);
+}
+
 int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
 	unsigned long begin;
 	struct ipoib_tx_buf *tx_req;
-	int i, n;
+	int i;
 
 	clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags);
 	netif_poll_disable(dev);
@@ -611,17 +628,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 			goto timeout;
 		}
 
-		do {
-			n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
-			for (i = 0; i < n; ++i) {
-				if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
-					ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
-				else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
-					ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
-				else
-					ipoib_ib_handle_tx_wc(dev, priv->ibwc + i);
-			}
-		} while (n == IPOIB_NUM_WC);
+		ipoib_drain_cq(dev);
 
 		msleep(1);
 	}


From sean.hefty at intel.com  Fri May 25 16:26:11 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 25 May 2007 16:26:11 -0700
Subject: [ofa-general] [PATCH] 2.6.23 ib/cm: include HCA ACK delay in local
	ACK timeout
Message-ID: <003801c79f24$14857c80$ff0da8c0@amr.corp.intel.com>

The ib_cm should include the HCA ACK delay when calculating the
local ACK timeout value.  If the HCA ACK delay is large enough
relative to the packet life time, then the calculated timeout
value is too small, which can result in connections timing out
or excessive retries.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
If there are no issues, I will queue this up along with my other
patches for 2.6.23.  This patch applies on top of the fix for
detecting stale connections, and the changes to the CM locking.

The local CA ACK delay is moved internally to the CM, removing
it from the external API.  If someone could perform a sanity
check on the ACK delay, I'd appreciate it.


 drivers/infiniband/core/cm.c            |   71 +++++++++++++++++++++++++------
 drivers/infiniband/core/cma.c           |    1 
 drivers/infiniband/core/ucm.c           |    1 
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    1 
 include/rdma/ib_cm.h                    |    1 
 5 files changed, 57 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 16181d6..c7007c4 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -87,6 +87,7 @@ struct cm_port {
 struct cm_device {
 	struct list_head list;
 	struct ib_device *device;
+	u8 ack_delay;
 	struct cm_port port[0];
 };
 
@@ -95,7 +96,7 @@ struct cm_av {
 	union ib_gid dgid;
 	struct ib_ah_attr ah_attr;
 	u16 pkey_index;
-	u8 packet_life_time;
+	u8 timeout;
 };
 
 struct cm_work {
@@ -154,6 +155,7 @@ struct cm_id_private {
 	u8 retry_count;
 	u8 rnr_retry_count;
 	u8 service_timeout;
+	u8 target_ack_delay;
 
 	struct list_head work_list;
 	atomic_t work_count;
@@ -293,7 +295,7 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av)
 	av->port = port;
 	ib_init_ah_from_path(cm_dev->device, port->port_num, path,
 			     &av->ah_attr);
-	av->packet_life_time = path->packet_life_time;
+	av->timeout = path->packet_life_time + 1;
 	return 0;
 }
 
@@ -643,6 +645,25 @@ static inline int cm_convert_to_ms(int iba_time)
 	return 1 << max(iba_time - 8, 0);
 }
 
+/*
+ * calculate: 4.096x2^ack_timeout = 4.096x2^ack_delay + 2x4.096x2^life_time
+ * Because of how ack_timeout is stored, adding one doubles the timeout.
+ * To avoid large timeouts, select the max(ack_delay, life_time + 1), and
+ * increment it (round up) only if the other is within 50%.
+ */
+static u8 cm_ack_timeout(u8 ca_ack_delay, u8 packet_life_time)
+{
+	int ack_timeout = packet_life_time + 1;
+
+	if (ack_timeout >= ca_ack_delay)
+		ack_timeout += (ca_ack_delay >= (ack_timeout - 1));
+	else
+		ack_timeout = ca_ack_delay +
+			      (ack_timeout >= (ca_ack_delay - 1));
+
+	return min(31, ack_timeout);
+}
+
 static void cm_cleanup_timewait(struct cm_timewait_info *timewait_info)
 {
 	if (timewait_info->inserted_remote_id) {
@@ -686,7 +707,7 @@ static void cm_enter_timewait(struct cm_id_private *cm_id_priv)
 	 * timewait before notifying the user that we've exited timewait.
 	 */
 	cm_id_priv->id.state = IB_CM_TIMEWAIT;
-	wait_time = cm_convert_to_ms(cm_id_priv->av.packet_life_time + 1);
+	wait_time = cm_convert_to_ms(cm_id_priv->av.timeout);
 	queue_delayed_work(cm.wq, &cm_id_priv->timewait_info->work.work,
 			   msecs_to_jiffies(wait_time));
 	cm_id_priv->timewait_info = NULL;
@@ -908,7 +929,8 @@ static void cm_format_req(struct cm_req_msg *req_msg,
 	cm_req_set_primary_sl(req_msg, param->primary_path->sl);
 	cm_req_set_primary_subnet_local(req_msg, 1); /* local only... */
 	cm_req_set_primary_local_ack_timeout(req_msg,
-		min(31, param->primary_path->packet_life_time + 1));
+		cm_ack_timeout(cm_id_priv->av.port->cm_dev->ack_delay,
+			       param->primary_path->packet_life_time));
 
 	if (param->alternate_path) {
 		req_msg->alt_local_lid = param->alternate_path->slid;
@@ -923,7 +945,8 @@ static void cm_format_req(struct cm_req_msg *req_msg,
 		cm_req_set_alt_sl(req_msg, param->alternate_path->sl);
 		cm_req_set_alt_subnet_local(req_msg, 1); /* local only... */
 		cm_req_set_alt_local_ack_timeout(req_msg,
-			min(31, param->alternate_path->packet_life_time + 1));
+			cm_ack_timeout(cm_id_priv->av.port->cm_dev->ack_delay,
+				       param->alternate_path->packet_life_time));
 	}
 
 	if (param->private_data && param->private_data_len)
@@ -1433,7 +1456,8 @@ static void cm_format_rep(struct cm_rep_msg *rep_msg,
 	cm_rep_set_starting_psn(rep_msg, cpu_to_be32(param->starting_psn));
 	rep_msg->resp_resources = param->responder_resources;
 	rep_msg->initiator_depth = param->initiator_depth;
-	cm_rep_set_target_ack_delay(rep_msg, param->target_ack_delay);
+	cm_rep_set_target_ack_delay(rep_msg,
+				    cm_id_priv->av.port->cm_dev->ack_delay);
 	cm_rep_set_failover(rep_msg, param->failover_accepted);
 	cm_rep_set_flow_ctrl(rep_msg, param->flow_control);
 	cm_rep_set_rnr_retry_count(rep_msg, param->rnr_retry_count);
@@ -1680,6 +1704,13 @@ static int cm_rep_handler(struct cm_work *work)
 	cm_id_priv->responder_resources = rep_msg->initiator_depth;
 	cm_id_priv->sq_psn = cm_rep_get_starting_psn(rep_msg);
 	cm_id_priv->rnr_retry_count = cm_rep_get_rnr_retry_count(rep_msg);
+	cm_id_priv->target_ack_delay = cm_rep_get_target_ack_delay(rep_msg);
+	cm_id_priv->av.timeout =
+			cm_ack_timeout(cm_id_priv->target_ack_delay,
+				       cm_id_priv->av.timeout - 1);
+	cm_id_priv->alt_av.timeout =
+			cm_ack_timeout(cm_id_priv->target_ack_delay,
+				       cm_id_priv->alt_av.timeout - 1);
 
 	/* todo: handle peer_to_peer */
 
@@ -2291,7 +2322,7 @@ static int cm_mra_handler(struct cm_work *work)
 	work->cm_event.param.mra_rcvd.service_timeout =
 					cm_mra_get_service_timeout(mra_msg);
 	timeout = cm_convert_to_ms(cm_mra_get_service_timeout(mra_msg)) +
-		  cm_convert_to_ms(cm_id_priv->av.packet_life_time);
+		  cm_convert_to_ms(cm_id_priv->av.timeout);
 
 	spin_lock_irq(&cm_id_priv->lock);
 	switch (cm_id_priv->id.state) {
@@ -2363,7 +2394,8 @@ static void cm_format_lap(struct cm_lap_msg *lap_msg,
 	cm_lap_set_sl(lap_msg, alternate_path->sl);
 	cm_lap_set_subnet_local(lap_msg, 1); /* local only... */
 	cm_lap_set_local_ack_timeout(lap_msg,
-		min(31, alternate_path->packet_life_time + 1));
+		cm_ack_timeout(cm_id_priv->av.port->cm_dev->ack_delay,
+			       alternate_path->packet_life_time));
 
 	if (private_data && private_data_len)
 		memcpy(lap_msg->private_data, private_data, private_data_len);
@@ -2394,6 +2426,9 @@ int ib_send_cm_lap(struct ib_cm_id *cm_id,
 	ret = cm_init_av_by_path(alternate_path, &cm_id_priv->alt_av);
 	if (ret)
 		goto out;
+	cm_id_priv->alt_av.timeout =
+			cm_ack_timeout(cm_id_priv->target_ack_delay,
+				       cm_id_priv->alt_av.timeout - 1);
 
 	ret = cm_alloc_msg(cm_id_priv, &msg);
 	if (ret)
@@ -3248,8 +3283,7 @@ static int cm_init_qp_rtr_attr(struct cm_id_private *cm_id_priv,
 			*qp_attr_mask |= IB_QP_ALT_PATH;
 			qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num;
 			qp_attr->alt_pkey_index = cm_id_priv->alt_av.pkey_index;
-			qp_attr->alt_timeout =
-					cm_id_priv->alt_av.packet_life_time + 1;
+			qp_attr->alt_timeout = cm_id_priv->alt_av.timeout;
 			qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr;
 		}
 		ret = 0;
@@ -3287,8 +3321,7 @@ static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv,
 				*qp_attr_mask |= IB_QP_TIMEOUT | IB_QP_RETRY_CNT |
 						 IB_QP_RNR_RETRY |
 						 IB_QP_MAX_QP_RD_ATOMIC;
-				qp_attr->timeout =
-					cm_id_priv->av.packet_life_time + 1;
+				qp_attr->timeout = cm_id_priv->av.timeout;
 				qp_attr->retry_cnt = cm_id_priv->retry_count;
 				qp_attr->rnr_retry = cm_id_priv->rnr_retry_count;
 				qp_attr->max_rd_atomic =
@@ -3302,8 +3335,7 @@ static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv,
 			*qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE;
 			qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num;
 			qp_attr->alt_pkey_index = cm_id_priv->alt_av.pkey_index;
-			qp_attr->alt_timeout =
-				cm_id_priv->alt_av.packet_life_time + 1;
+			qp_attr->alt_timeout = cm_id_priv->alt_av.timeout;
 			qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr;
 			qp_attr->path_mig_state = IB_MIG_REARM;
 		}
@@ -3343,6 +3375,16 @@ int ib_cm_init_qp_attr(struct ib_cm_id *cm_id,
 }
 EXPORT_SYMBOL(ib_cm_init_qp_attr);
 
+void cm_get_ack_delay(struct cm_device *cm_dev)
+{
+	struct ib_device_attr attr;
+
+	if (ib_query_device(cm_dev->device, &attr))
+		cm_dev->ack_delay = 0; /* acks will rely on packet life time */
+	else
+		cm_dev->ack_delay = attr.local_ca_ack_delay;
+}
+
 static void cm_add_one(struct ib_device *device)
 {
 	struct cm_device *cm_dev;
@@ -3367,6 +3409,7 @@ static void cm_add_one(struct ib_device *device)
 		return;
 
 	cm_dev->device = device;
+	cm_get_ack_delay(cm_dev);
 
 	set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask);
 	for (i = 1; i <= device->phys_port_cnt; i++) {
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 2eb52b7..eb15119 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2326,7 +2326,6 @@ static int cma_accept_ib(struct rdma_id_private *id_priv,
 	rep.private_data_len = conn_param->private_data_len;
 	rep.responder_resources = conn_param->responder_resources;
 	rep.initiator_depth = conn_param->initiator_depth;
-	rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT;
 	rep.failover_accepted = 0;
 	rep.flow_control = conn_param->flow_control;
 	rep.rnr_retry_count = conn_param->rnr_retry_count;
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 2586a3e..424983f 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -823,7 +823,6 @@ static ssize_t ib_ucm_send_rep(struct ib_ucm_file *file,
 	param.private_data_len    = cmd.len;
 	param.responder_resources = cmd.responder_resources;
 	param.initiator_depth     = cmd.initiator_depth;
-	param.target_ack_delay    = cmd.target_ack_delay;
 	param.failover_accepted   = cmd.failover_accepted;
 	param.flow_control        = cmd.flow_control;
 	param.rnr_retry_count     = cmd.rnr_retry_count;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index ffec794..4a8117f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -260,7 +260,6 @@ static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id,
 	rep.private_data_len = sizeof data;
 	rep.flow_control = 0;
 	rep.rnr_retry_count = req->rnr_retry_count;
-	rep.target_ack_delay = 20; /* FIXME */
 	rep.srq = 1;
 	rep.qp_num = qp->qp_num;
 	rep.starting_psn = psn;
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index 5c07017..12243e8 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -385,7 +385,6 @@ struct ib_cm_rep_param {
 	u8		private_data_len;
 	u8		responder_resources;
 	u8		initiator_depth;
-	u8		target_ack_delay;
 	u8		failover_accepted;
 	u8		flow_control;
 	u8		rnr_retry_count;


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 7654 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070525/ed7fd3cd/attachment.bin>

From ralph.campbell at qlogic.com  Fri May 25 17:33:43 2007
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Fri, 25 May 2007 17:33:43 -0700
Subject: [ofa-general] [PATCH] ofed_1_2/sdp - SDP can lose receive data
Message-ID: <1180139623.3407.373.camel@brick.pathscale.com>

Can this fix be considered for OFED 1.2?
Thanks.


If a receive work completion is processed but there is no room
in a previously queued skb, the data is dropped.
This patch fixes the problem by queuing the skb.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c
--- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:04:51 2007 -0700
+++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:07:02 2007 -0700
@@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q
 			skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len);
 			__kfree_skb(skb);
 			skb = tail;
-		}
+		} else
+			skb_queue_tail(&sk->sk_receive_queue, skb);
 	} else
 		skb_queue_tail(&sk->sk_receive_queue, skb);
 

From vlad at lists.openfabrics.org  Sat May 26 02:41:02 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sat, 26 May 2007 02:41:02 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070526-0200 daily build status
Message-ID: <20070526094102.B0D85E60845@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.12
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.18
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From lewisettaimo at linyi.tv  Sat May 26 03:59:16 2007
From: lewisettaimo at linyi.tv (Stacey Stafford)
Date: Sat, 26 May 2007 18:59:16 +0800
Subject: [ofa-general] Of before badger
Message-ID: <001001c79fc7$f50767c0$068379cc@computer>


THIS ONE IS BEING PROMOTED, TAKE ADVANTAGE!

S.ymboL: ADOVCurrent: $0.52 1 Day Target price: $2.50Action: Aggresive Buy/Hold!!!


Bullish profit guaranted (500+%)..


ADOV has a nice fresh news, openib-general, contact your broker!!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070526/0dbc4d21/attachment.html>

From mst at dev.mellanox.co.il  Sat May 26 12:40:49 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 26 May 2007 22:40:49 +0300
Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git
In-Reply-To: <adahcq036ep.fsf@cisco.com>
References: <adahcq036ep.fsf@cisco.com>
Message-ID: <20070526194049.GD15942@mellanox.co.il>

> Michael S. Tsirkin (2):
>       IPoIB/cm: Fix timeout check in ipoib_cm_dev_stop()
>       IPoIB/cm: Drain cq in ipoib_cm_dev_stop()

don't we want he patch that sets status to flushed with error?

-- 
MST


From dotanb at dev.mellanox.co.il  Sat May 26 23:26:26 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 27 May 2007 09:26:26 +0300
Subject: [ofa-general] ibv_get_cq_event blocking forever after successful
	ibv_post_send...
In-Reply-To: <20070525212214.20500.qmail@station183.com>
References: <20070525212214.20500.qmail@station183.com>
Message-ID: <46592492.6000404@dev.mellanox.co.il>

Jimmy Hill wrote:
> I have verbs code that is modeled after the first usage model described on the ibv_get_cq_event() man page. That is, I have created all the verbs resources (e.g., completion channel, QP, CQ, etc.) and then followed the sequence of:
>
> ibv_req_notify_cq(cq, 0);
>
> ibv_post_send(qp, &work_req, &bad_work_req);
>
> ibv_get_cq_event(channel, &ev_cq, &ev_ctx);
>
> ibv_ack_cq_events(ev_cq, 1);
>
> ibv_req_notify_cq(cq, 0);
>
> ibv_poll_cq(cq, 1, &wc);  // loop to drain - but due to upper protocol, will only ever be 1 at a time
>
>
> The QP is created with the following attributes:
>     qp_init_attr.qp_context              = &this->conn_ref;
>     qp_init_attr.send_cq                  = this->send_cq;
>     qp_init_attr.recv_cq                   = this->recv_cq;
>     qp_init_attr.srq                          = NULL;
>     qp_init_attr.cap.max_send_wr     = 128;
>     qp_init_attr.cap.max_recv_wr      = 4; 
>     qp_init_attr.cap.max_send_sge    = 16; 
>     qp_init_attr.cap.max_recv_sge     = 4; 
>     qp_init_attr.cap.max_inline_data   = 0;
>     qp_init_attr.qp_type                   = IBV_QPT_RC;
>     qp_init_attr.sq_sig_all                  = 0;
> // I have also used sq_sig_all set to 1 and then removed the SIGNALED flag in the send request
>
> The Send request (RDMA Write) is formatted as:
>     sge.lkey     = response_mr->lkey;
>     sge.addr    = response;
>     sge.length = 256;
>
>     send_work_req.opcode                     = IBV_WR_RDMA_WRITE;
>     send_work_req.next                         = NULL;
>     send_work_req.sg_list                       = &sge;
>     send_work_req.num_sge                   = 1;
>     send_work_req.wr_id                        = 0;
>     send_work_req.imm_data                  = 0;
>     send_work_req.wr.rdma.remote_addr = client_rmr->addr;
>     send_work_req.wr.rdma.rkey             = client_rmr->rkey;
>     send_work_req.send_flags                 = IBV_SEND_SIGNALED; 
> // I have used IBV_SEND_SIGNALED and IBV_SEND_SIGNALED | IBV_SEND_FENCE
>
> This QP will be used to RDMA Write a response back to a client. With the current setup, only one RDMA write will be outstanding per QP at any given time. That is, I issue the RDMA Write and wait for its completion prior to continuing processing. The eventual goal is to request and process a completion event every "n" RDMA Writes.
>
> The current problem is that everything runs along fine and then I end up in a situation where I block forever on the ibv_get_cq_event() call. The ibv_post_send() just prior to the ibv_get_cq_event() call returned "0" indicating that it successfully processed the command. However, the completion event for that operation never arrives. The data associated with that RDMA write does not appear on the client side, so it seems that even though the ibv_post_send() reported success, it really did not successfully process the request.
>
> In order to debug the problem, I changed the completion channel to non-blocking and put the ibv_get_cq_event() call in a loop and dumped out the number of passes through the loop (i.e., number of calls to ibv_get_cq_event()) prior to the arrival of an event (good status from the call). When all is working fine, it only takes one or two calls for the event to arrive. When I encounter the situation where it blocked forever, it loops forever calling ibv_get_cq_event(). I added a counter there and after a large (e.g., 500) number of retries, I looped back up and tried the ibv_post_send() again. For the most part, the request makes it out the second time. But, given enough time, the send queue work requests entries are consumed. That is, if I lower the max_send_wr attribute to 10, after 10 failed event collection attempts and ibv_post_send() retries, the 11th ibv_post_send() will fail with -1 status code. So, the work request entries are not leaving the send queue.
>
> Any ideas on why the ibv_get_cq_event() would never see an event after a "successful" send requesting a completion 
Try to do the following scenario:


ibv_req_notify_cq(cq, 0);

ibv_post_send(qp, &work_req, &bad_work_req);

ibv_get_cq_event(channel, &ev_cq, &ev_ctx);

ibv_ack_cq_events(ev_cq, 1);

ibv_req_notify_cq(cq, 0);

in a loop until the CQ is empty:
	ibv_poll_cq(cq, 1, &wc);  // loop to drain - but due to upper protocol, will only ever be 1 at a time


Dotan


From tziporet at dev.mellanox.co.il  Sat May 26 23:58:19 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 27 May 2007 09:58:19 +0300
Subject: [ofa-general] Re: [ewg] [PATCH] ofed_1_2/sdp - SDP can lose receive
	data
In-Reply-To: <1180139623.3407.373.camel@brick.pathscale.com>
References: <1180139623.3407.373.camel@brick.pathscale.com>
Message-ID: <46592C0B.3070709@mellanox.co.il>

Michael

Please review

Tziporet

Ralph Campbell wrote:
> Can this fix be considered for OFED 1.2?
> Thanks.
>
>
> If a receive work completion is processed but there is no room
> in a previously queued skb, the data is dropped.
> This patch fixes the problem by queuing the skb.
>
> Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
>
> diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c
> --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:04:51 2007 -0700
> +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:07:02 2007 -0700
> @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q
>  			skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len);
>  			__kfree_skb(skb);
>  			skb = tail;
> -		}
> +		} else
> +			skb_queue_tail(&sk->sk_receive_queue, skb);
>  	} else
>  		skb_queue_tail(&sk->sk_receive_queue, skb);
>  
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
>   


From tziporet at dev.mellanox.co.il  Sun May 27 00:08:56 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 27 May 2007 10:08:56 +0300
Subject: [ofa-general] Re: [ewg] Something is wrong in gitweb
In-Reply-To: <795c49870705240933w276e41fbu941e822047ab5e25@mail.gmail.com>
References: <46558296.2090308@voltaire.com> <4655B0EB.5030407@mellanox.co.il>
	<795c49870705240933w276e41fbu941e822047ab5e25@mail.gmail.com>
Message-ID: <46592E88.9010709@mellanox.co.il>

Jeff Becker wrote:
> Hi Tziporet. I just tried getting to the git tree from my web browser
> and this seems to work, including the link you tried below. Does it
> work for you now? Thanks.
>

Working for me now.

Thanks,
Tziporet


From mst at dev.mellanox.co.il  Sun May 27 01:39:32 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 27 May 2007 11:39:32 +0300
Subject: [ofa-general] Re: IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <46574099.3090601@linux.vnet.ibm.com>
References: <46537081.30906@linux.vnet.ibm.com>
	<20070524053819.GF6019@mellanox.co.il>
	<46574099.3090601@linux.vnet.ibm.com>
Message-ID: <20070527083932.GC8342@mellanox.co.il>

> >>-The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that
> >>support SRQ like the Topspin HCA and, such HCAs should not be
> >>impacted at all.
> >
> >I don't think it's that clean yet.
> >
> >Here's an idea: implement "fake SRQ" for ehca in software: make post recv on
> >srq queue the WR, spread them evenly between QPs as they connect.  Once # of
> >QPs goes above some limit, create QP command will fail.  This would contain
> >the mess nicely inside ehca (I think you'll want to add a flag that lets
> >software figure out that SRQ is fake).
> >
> >We will still be left with the basic problem of what to do at the active side
> >upon the reject, though.
> 
> As you indicate this will not solve the problem, so it is not an option.

Above, I have outlined how it can be done, so it certainly *is* an option.

In this thread, you basically keep saying that ehca will ever be the only HCA without SRQ
support, so you can make a lot of assumptions about how IPoIB is used.

Fine, but if you follow this logic, it makes sense to hide the mess under the ehca
provider interface.


-- 
MST


From amip at dev.mellanox.co.il  Sun May 27 02:07:00 2007
From: amip at dev.mellanox.co.il (Ami Perlmutter)
Date: Sun, 27 May 2007 12:07:00 +0300
Subject: [ofa-general] Re: [ewg] [PATCH] ofed_1_2/sdp - SDP can lose receive
	data
In-Reply-To: <1180139623.3407.373.camel@brick.pathscale.com>
References: <1180139623.3407.373.camel@brick.pathscale.com>
Message-ID: <1180256850.15464.1.camel@localhost>

Ralph,
this is how the code is now.
Were are you getting this code from?

On Fri, 2007-05-25 at 17:33 -0700, Ralph Campbell wrote:
> Can this fix be considered for OFED 1.2?
> Thanks.
> 
> 
> If a receive work completion is processed but there is no room
> in a previously queued skb, the data is dropped.
> This patch fixes the problem by queuing the skb.
> 
> Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
> 
> diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c
> --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:04:51 2007 -0700
> +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:07:02 2007 -0700
> @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q
>  			skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len);
>  			__kfree_skb(skb);
>  			skb = tail;
> -		}
> +		} else
> +			skb_queue_tail(&sk->sk_receive_queue, skb);
>  	} else
>  		skb_queue_tail(&sk->sk_receive_queue, skb);
>  
> 
> 
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


From vlad at lists.openfabrics.org  Sun May 27 02:41:04 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun, 27 May 2007 02:41:04 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070527-0200 daily build status
Message-ID: <20070527094104.BC09CE60852@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on powerpc with linux-2.6.15
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From ogerlitz at voltaire.com  Sun May 27 03:11:28 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 27 May 2007 13:11:28 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak
In-Reply-To: <ada1wh9ewya.fsf@cisco.com>
References: <20070521120459.GI20400@mellanox.co.il> <ada1wh9ewya.fsf@cisco.com>
Message-ID: <46595950.6080106@voltaire.com>

Roland Dreier wrote:
> OK, I crossed my fingers and merged this for 2.6.22

Somehow it seems when applying this patch to OFED something goes wrong, 
please see https://bugs.openfabrics.org/show_bug.cgi?id=636

Or.


From flynnrprtema at moio.net  Sun May 27 05:15:45 2007
From: flynnrprtema at moio.net (Antonio Simmons)
Date: Sun, 27 May 2007 23:15:45 +1100
Subject: [ofa-general] As a matter of fact, they do
Message-ID: <a80f01c7a0b4$f46dea30$d7b48a92@flynnrprtema>


birth mirror purpose The two voiceless of shot stamp them embraced and kissed. I tries to s had to my Before camp selfishly burn they knew it, it rode was getting dark outsid emotional reached convert father silence the
being him this injured is mint is truthfully Jeff, she really curl was dry at around Gavin's this evening.. at false farming shown work as area
it clearly and All by finally baby Yeah, Dana concurred. religion I wouldn't be applaud ate at all s expresses the the my its atmosphere s cousin importance was
absolutely filled and offering 1:00 PM dying from As danger soon as rapid she stocking left the stupid room, Stacy motioned Je power with of If bore you're really worried, I can bulb led heard still phone th It was in between 4th and receive steady narrow 5th forgotten period, that Cind cancer over mint Well, you successful had about a fifty bomb distribution card percent success ra the

the woman So tomorrow are groan you muscle tin gonna teach behave me to catch?  aroma What staff to all who to Brown have uses say
made No. repulsive fed nut Tomorrow we're taking Carl faithfully and Linda to th it each morning after me The that The realize seriously Jeff finally asked silly roughly her, rightfully So whaddya think? When extensive I goes risk I think complain this is going to cycle be trodden the ultimate litmu truly watched
 
it suspect shrank Charming. I daily cannot think of two unit people on the You've got to town horse sprang listen to me. funny This is not a prac  apple Dana, what's rate wrong with your bleach mother flag is her ove on to for say use that the of some Browns strength poetic while

imagery and that This family I low Dana lavatorial thought agreement about it for a peck moment. I guess si persona was has possess enables a had to Plath pleasant
feeling Then to dealings sped Stace, we've gotta impulse sleepy tire talk to you, said Rhonda. carry on The story school wash day raise began as every other, broadcast with kids arr make I with ride square Perhaps if we lock choose up so they can't settle get in wit What's up? I objective moved brave smoothly guard Gretchen was spade next. She basically just repeated evil
forandd dug Yeah, veracious that tour enjoy would make sense. in the While past observations examples have of crossing used overcome an through After Microsoft made their name with MS-DOS, they started work on a graphical based operating system much 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070527/2752e574/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: luqka.gif
Type: image/gif
Size: 6562 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070527/2752e574/attachment.gif>

From sobebike at gmail.com  Sun May 27 05:45:43 2007
From: sobebike at gmail.com (SoBeBike)
Date: Sun, 27 May 2007 07:45:43 -0500
Subject: [ofa-general] ibv_get_cq_event blocking forever after successful
	ibv_post_send...
In-Reply-To: <46592492.6000404@dev.mellanox.co.il>
References: <20070525212214.20500.qmail@station183.com>
	<46592492.6000404@dev.mellanox.co.il>
Message-ID: <dedddf10705270545sfc352c4qe299cdd767da66f0@mail.gmail.com>

That is what I am already doing (note the comment, "// loop to
drain..."). I loop calling ibv_poll_cq until it is empty. I just noted
that due to the usage model, I only see it pull one CQE and then on
the 2nd pass through the loop the CQ is empty.


On 5/27/07, Dotan Barak <dotanb at dev.mellanox.co.il> wrote:

> Try to do the following scenario:
>
>
> ibv_req_notify_cq(cq, 0);
>
> ibv_post_send(qp, &work_req, &bad_work_req);
>
> ibv_get_cq_event(channel, &ev_cq, &ev_ctx);
>
> ibv_ack_cq_events(ev_cq, 1);
>
> ibv_req_notify_cq(cq, 0);
>
> in a loop until the CQ is empty:
>         ibv_poll_cq(cq, 1, &wc);  // loop to drain - but due to upper protocol, will only ever be 1 at a time
>


From mst at dev.mellanox.co.il  Sun May 27 05:53:37 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 27 May 2007 15:53:37 +0300
Subject: [ofa-general] Re: Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak
In-Reply-To: <46595950.6080106@voltaire.com>
References: <20070521120459.GI20400@mellanox.co.il>
	<ada1wh9ewya.fsf@cisco.com> <46595950.6080106@voltaire.com>
Message-ID: <20070527125337.GF8342@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak
> 
> Roland Dreier wrote:
> >OK, I crossed my fingers and merged this for 2.6.22
> 
> Somehow it seems when applying this patch to OFED something goes wrong, 
> please see https://bugs.openfabrics.org/show_bug.cgi?id=636

Yes, it seems that we shouldn't keep a QP in error state
for extended periods of time, since that moves hardware
to the slow path.

It seems that the right approach might be to create
a loopback QP in RTS, and perform post sends there.

-- 
MST


From dotanb at dev.mellanox.co.il  Sun May 27 06:06:41 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 27 May 2007 16:06:41 +0300
Subject: [ofa-general] how to write a IB user level multicast application
In-Reply-To: <4655ACC2.9030401@open-mpi.org>
References: <13432ab00705232201s5f7d5a5h5ecaaddf57ead11b@mail.gmail.com>
	<46553448.6020508@dev.mellanox.co.il>
	<4655ACC2.9030401@open-mpi.org>
Message-ID: <46598261.4030005@dev.mellanox.co.il>

Andrew Friedley wrote:
> Dotan Barak wrote:
>> In the following URL you can find a very simple example on how to use 
>> multicast:
>> https://svn.openfabrics.org/svn/openib/trunk/contrib/mellanox/ibtp/gen2/userspace/useraccess/multicast_test/multicast_test.c 
>
>
> I seem to be missing v1.h on my OFED v1.2 nightly install, where can I 
> find it?
VL can be found in:
https://svn.openfabrics.org/svn/openib/trunk/contrib/mellanox/ibtp/common/tools/vl/
>
>> this test doesn't send an SA query (to get the multicast props) or an 
>> SA multicast join (to make the SM configure the subnet to make the 
>> port that this QP is attached to) to get the multicast messages.
>>
>> This example will work on a back-to-back topology.
>
> An alternative that I've had pretty good success with is to use the 
> RDMA CM.  It uses an IP(v6) abstraction, does the SA queries/joins for 
> you, and also supports selection of an unused multicast address if you 
> join the '0' address (and port? not sure if its required, I always 
> zero it).  Check out the 'mckey.c' example included with the RDMA CM 
> source, it will likely answer your questions.
This test was written in order to check the verbs layer without any 
dependency on any ULP.

thanks
Dotan


From dotanb at dev.mellanox.co.il  Sun May 27 06:34:43 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 27 May 2007 16:34:43 +0300
Subject: [ofa-general] ibv_get_cq_event blocking forever after successful
	ibv_post_send...
In-Reply-To: <dedddf10705270545sfc352c4qe299cdd767da66f0@mail.gmail.com>
References: <20070525212214.20500.qmail@station183.com>	
	<46592492.6000404@dev.mellanox.co.il>
	<dedddf10705270545sfc352c4qe299cdd767da66f0@mail.gmail.com>
Message-ID: <465988F3.4000100@dev.mellanox.co.il>

SoBeBike wrote:
> That is what I am already doing (note the comment, "// loop to
> drain..."). I loop calling ibv_poll_cq until it is empty. I just noted
> that due to the usage model, I only see it pull one CQE and then on
> the 2nd pass through the loop the CQ is empty.
>

When you get to this scenario (for many times you don't get the CQ 
event) did you try to poll the CQ and check if there is any completion 
in it?
(maybe the problem is that this WR just didn't create any completion 
when it ended).


thanks
Dotan


From or.gerlitz at gmail.com  Sun May 27 07:13:07 2007
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Sun, 27 May 2007 17:13:07 +0300
Subject: [ofa-general] Re: Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak
In-Reply-To: <20070527125337.GF8342@mellanox.co.il>
References: <20070521120459.GI20400@mellanox.co.il> <ada1wh9ewya.fsf@cisco.com>
	<46595950.6080106@voltaire.com> <20070527125337.GF8342@mellanox.co.il>
Message-ID: <15ddcffd0705270713h52449106x7b5654d558cbbda2@mail.gmail.com>

On 5/27/07, Michael S. Tsirkin <mst at dev.mellanox.co.il> wrote:
>
> > Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> > Subject: Re: Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak
> >
> > Roland Dreier wrote:
> > >OK, I crossed my fingers and merged this for 2.6.22
> >
> > Somehow it seems when applying this patch to OFED something goes wrong,
> > please see https://bugs.openfabrics.org/show_bug.cgi?id=636
>
> Yes, it seems that we shouldn't keep a QP in error state
> for extended periods of time, since that moves hardware
> to the slow path.
>

what actually is the "hardware slow path" ?

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070527/0c68865d/attachment.html>

From mst at dev.mellanox.co.il  Sun May 27 07:57:04 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 27 May 2007 17:57:04 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh
	access race
In-Reply-To: <ada646h6aqs.fsf@cisco.com>
References: <20070522005918.GB13311@mellanox.co.il> <adatzu4d1wx.fsf@cisco.com>
	<20070524131154.GA7940@mellanox.co.il> <ada646h6aqs.fsf@cisco.com>
Message-ID: <20070527145704.GB26933@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race
> 
>  > The following works fine for me here. Pls consider for 2.6.22.
> 
> Does it help with https://bugs.openfabrics.org//show_bug.cgi?id=604 ?
> Or are we still looking?

604 turns out to be a bug in mthca. I'll post a patch RSN.
Still, I think it's a good idea to apply this. Do you agree?
I also have put this patch in OFED.

-- 
MST


From jwong at datallegro.com  Sun May 27 07:59:06 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Sun, 27 May 2007 10:59:06 -0400
Subject: [ofa-general] Trouble install OFED 1.2-rc3 - ib-bonding
Message-ID: <A382D4292574EB47A85B8159A6AED1A18305BB@FPNYEXCBE02.opus-i.corp>

Hello,

I am installing the OFED 1.2-rc3.

Everything else builds except for ib-bonding.  

 
Thanks in advance.

 
I am getting the following error messages:

+ make -C /lib/modules/2.6.18-8.1.4.el5/build modules
M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding

make: Entering directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64'

  CC [M]
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.o

In file included from
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:78:

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_inactive_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this

 function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: (Each undeclared identifier is reported only once

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: for each function it appears in.)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_active_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this

 function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_compute_features':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1233: warning: comparison of distinct pointer types lacks a

 cast

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_enslave':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_release':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in t

his function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_arp_rcv':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_netdev_event':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_init':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4374: warning: assignment discards qualifiers from pointer

target type

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

make[1]: ***
[/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_
main.o] Error 1

make: ***
[_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi
ng] Error 2

make: Leaving directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64'

+ echo ' Building  IB bonding driver failed'

 Building  IB bonding driver failed

+ exit 1

error: Bad exit status from /var/tmp/rpm-tmp.23876 (%build)

 
Jeff Wong

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070527/29f07e26/attachment.html>

From jwong at datallegro.com  Sun May 27 08:03:01 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Sun, 27 May 2007 11:03:01 -0400
Subject: [ofa-general] Trouble install OFED 1.2-rc3 - ib-bonding
Message-ID: <A382D4292574EB47A85B8159A6AED1A18305BC@FPNYEXCBE02.opus-i.corp>

Hello,

I am installing the OFED 1.2-rc3.

Everything else builds except for ib-bonding.  =


 =


Thanks in advance.

 =


 =


I am getting the following error messages:

+ make -C /lib/modules/2.6.18-8.1.4.el5/build modules
M=3D/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding

make: Entering directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64'

  CC [M]
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.o

In file included from
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:78:

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_inactive_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this

 function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: (Each undeclared identifier is reported only once

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: for each function it appears in.)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_active_flags':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this

 function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_compute_features':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1233: warning: comparison of distinct pointer types lacks a

 cast

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_enslave':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_release':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in t

his function)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_arp_rcv':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_netdev_event':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_init':

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4374: warning: assignment discards qualifiers from pointer

target type

/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this fu

nction)

make[1]: ***
[/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_
main.o] Error 1

make: ***
[_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi
ng] Error 2

make: Leaving directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64'

+ echo ' Building  IB bonding driver failed'

 Building  IB bonding driver failed

+ exit 1

error: Bad exit status from /var/tmp/rpm-tmp.23876 (%build)

 =


 =


 =


Jeff Wong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070527/19f70843/attachment.html>

From mst at dev.mellanox.co.il  Sun May 27 08:06:42 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 27 May 2007 18:06:42 +0300
Subject: [ofa-general] [PATCH for-2.6.22] IB/mthca: fix send CQE with error
	for QP connected to SRQ
Message-ID: <20070527150642.GC26933@mellanox.co.il>

mthca_free_err_wqe currently treats both send and receive CQEs identically in
case of SRQ.  But for tavor mode hardware, send CQEs with error can be chained
together even if the RQ is part of SRQ, so we miss some CQEs.
This, in turn, triggers crashes in IPoIB CM:
https://bugs.openfabrics.org//show_bug.cgi?id=604.
Fix by following the WQE chain for all send CQEs, same as non-SRQ QP.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

This is a fix for bug 604 in ofa bugzilla. Pls consider for 2.6.22.

diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c
index 0276649..7474646 100644
--- a/drivers/infiniband/hw/mthca/mthca_qp.c
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -2287,7 +2287,7 @@ void mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send,
 	 * For SRQs, all WQEs generate a CQE, so we're always at the
 	 * end of the doorbell chain.
 	 */
-	if (qp->ibqp.srq) {
+	if (qp->ibqp.srq && !is_send) {
 		*new_wqe = 0;
 		return;
 	}

-- 
MST


From jimmy at hillraiser.com  Sun May 27 15:45:08 2007
From: jimmy at hillraiser.com (Jimmy Hill)
Date: Sun, 27 May 2007 17:45:08 -0500
Subject: [ofa-general] ibv_get_cq_event blocking forever after successful
	ibv_post_send...
In-Reply-To: <465988F3.4000100@dev.mellanox.co.il>
Message-ID: <HFEPKIFILMMCLHAOBMMOGEADGMAA.jimmy@hillraiser.com>


> -----Original Message-----
>
> SoBeBike wrote:
> > That is what I am already doing (note the comment, "// loop to
> > drain..."). I loop calling ibv_poll_cq until it is empty. I just noted
> > that due to the usage model, I only see it pull one CQE and then on
> > the 2nd pass through the loop the CQ is empty.
> >
>
> When you get to this scenario (for many times you don't get the CQ
> event) did you try to poll the CQ and check if there is any completion
> in it?
> (maybe the problem is that this WR just didn't create any completion
> when it ended).
>

My code currently blocks (either a blocking call, or looping with a
non-blocking FD) waiting for an event (ibv_get_cq_event) before attempting
to dequeue (ibv_poll_cq) any entries from the CQ. I assumed that if I did
not get a completion event, there would not be a CQ entry waiting. So, no, I
have not tried that. I can try that, but it may be a week or more before I
am able to get back to my machines.

Can I not rely on getting an event when I request signalled completions?

Thanks.


From rdreier at cisco.com  Sun May 27 18:18:34 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 27 May 2007 18:18:34 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak
In-Reply-To: <20070527125337.GF8342@mellanox.co.il> (Michael S. Tsirkin's
	message of "Sun, 27 May 2007 15:53:37 +0300")
References: <20070521120459.GI20400@mellanox.co.il>
	<ada1wh9ewya.fsf@cisco.com> <46595950.6080106@voltaire.com>
	<20070527125337.GF8342@mellanox.co.il>
Message-ID: <adalkf921at.fsf@cisco.com>

 > Yes, it seems that we shouldn't keep a QP in error state
 > for extended periods of time, since that moves hardware
 > to the slow path.

Ugh, so I can do a local DoS by just creating a QP and moving it to
error and then going to sleep for a long time?  What hardware is
susceptible to this?

 > It seems that the right approach might be to create
 > a loopback QP in RTS, and perform post sends there.

How about using the send queue of the QP we're trying to flush?
I'll try to code this up tomorrow if no one beats me to the fix.

 - R.


From vacchianow7037 at plaza101.com  Mon May 28 15:50:54 2007
From: vacchianow7037 at plaza101.com (Rafaela Cruz)
Date: Mon, 28 May 2007 23:50:54 +0100
Subject: [ofa-general] Think its' time to start
Message-ID: <000801c7a0cd$c205fa60$1e00a8c0@vacchianow7037>

Take delivery of a sizeable modify on your Meds

dependable classes, paramount quality.

Massive array, including intricate to find drugs
0 RX indispensable.
Secret with No waiting quarters or arrangmenet requisite

take in massiveness and Save! even supposing supplemental

Just type www [.] Topbuyrx . org
in Your Internet Explore - Go here now


-----
They panicky forsook suit each other cautiously remarkably bred well, said Dangl 'It is well,' said he, kissing solid new sane it; defiant 'it is my mast
fast And shrilly tip pipe what? demanded Morrel. 

revolting nerve feeling So done are all Italians. poke interest I think I may venture foolish to ask strip you this favor.


From mst at dev.mellanox.co.il  Sun May 27 20:41:03 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 28 May 2007 06:41:03 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak
In-Reply-To: <adalkf921at.fsf@cisco.com>
References: <20070521120459.GI20400@mellanox.co.il>
	<ada1wh9ewya.fsf@cisco.com> <46595950.6080106@voltaire.com>
	<20070527125337.GF8342@mellanox.co.il> <adalkf921at.fsf@cisco.com>
Message-ID: <20070528034103.GB2945@mellanox.co.il>

>  > It seems that the right approach might be to create
>  > a loopback QP in RTS, and perform post sends there.
> 
> How about using the send queue of the QP we're trying to flush?
> I'll try to code this up tomorrow if no one beats me to the fix.

Great idea - since we got last WQE reached, that QP will already be in error.

Another alternative I thought about is to keep the drain QP in reset state,
and move it to error only when we have some work to do.
But this looks like a better way to do it.

-- 
MST


From dotanb at dev.mellanox.co.il  Sun May 27 22:25:25 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Mon, 28 May 2007 08:25:25 +0300
Subject: [ofa-general] ibv_get_cq_event blocking forever after successful
	ibv_post_send...
In-Reply-To: <HFEPKIFILMMCLHAOBMMOGEADGMAA.jimmy@hillraiser.com>
References: <HFEPKIFILMMCLHAOBMMOGEADGMAA.jimmy@hillraiser.com>
Message-ID: <465A67C5.2060101@dev.mellanox.co.il>

Jimmy Hill wrote:
> My code currently blocks (either a blocking call, or looping with a
> non-blocking FD) waiting for an event (ibv_get_cq_event) before attempting
> to dequeue (ibv_poll_cq) any entries from the CQ. I assumed that if I did
> not get a completion event, there would not be a CQ entry waiting. So, no, I
> have not tried that. I can try that, but it may be a week or more before I
> am able to get back to my machines.
>
> Can I not rely on getting an event when I request signalled completions?
>   
This is the tricky part in IB: when you ask for a completion event, this 
event will be produced for the NEXT completion
that will be produced after you asked for the event.

But the answer is yes: if you asked for a completion notification and a 
completion is being produced you will get an event.


I have a question: if you enable in all of the SRs (Send Requests) that 
you are posting the SIGNAL bit, why don't you just
enable the sq_sig_all when creating the QP?

Dotan


From devesh28 at gmail.com  Sun May 27 22:50:24 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Mon, 28 May 2007 11:20:24 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <4655BE8F.2080102@ichips.intel.com>
References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com>
	<1179398534.23882.67542.camel@hal.voltaire.com>
	<309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com>
	<1179483657.23882.158398.camel@hal.voltaire.com>
	<309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com>
	<1179769930.15940.9823.camel@hal.voltaire.com>
	<309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com>
	<1179930909.16831.100286.camel@hal.voltaire.com>
	<309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com>
	<4655BE8F.2080102@ichips.intel.com>
Message-ID: <309a667c0705272250q68aa4064l40454db5b266a967@mail.gmail.com>

On 5/24/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > Yes It will, and hence reduce the initial SA traffic generated on a
> > big cluster...just imagin, the cluster is quite big and every node is
> > trying to build its cache initially. It will create large burst of SA
> > packets.
>
> In general I agree with the notion of enhancing the cache to allow it to
> load locally from a file.  But I'd really like to get a framework merged
> upstream first before trying to add in these sort of enhancements.
Ok, but, by that time we can keep the framework ready?
>
> Initially loading of caches on a large fabric can be limited to a single
> GetTable PR query per node, and by enabling the caches across the fabric
> at different times, the single burst to the SA can be avoided.
How this will be managed? This will add extra startup time in the
cluster, because cluster will be usable only after last cache has been
enabled. Am I right?
>
> > Incomplete connectivity will be till first PR is requested for that
> > destination, Because if its a cache miss, any how application is going
> > to initiate a ib_sa_get_path_rec() and resolved PR will be added in
> > cache for future reference.
>
> ib_sa_get_path_rec() only returns a single path.  If multiple paths
> exist, adding a single path to the cache may cause all applications to
How multi-pathing is handled in current cache_module?
> make use of that single path. Updating the cache on demand isn't as
> simple as it seems.
>
> - Sean
>


From mst at dev.mellanox.co.il  Sun May 27 23:45:18 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 28 May 2007 09:45:18 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak
In-Reply-To: <adalkf921at.fsf@cisco.com>
References: <20070521120459.GI20400@mellanox.co.il>
	<ada1wh9ewya.fsf@cisco.com> <46595950.6080106@voltaire.com>
	<20070527125337.GF8342@mellanox.co.il> <adalkf921at.fsf@cisco.com>
Message-ID: <20070528064518.GF2945@mellanox.co.il>

>  > It seems that the right approach might be to create
>  > a loopback QP in RTS, and perform post sends there.
> 
> How about using the send queue of the QP we're trying to flush?
> I'll try to code this up tomorrow if no one beats me to the fix.

Unfortunately, this won't work, as it hits another firmware problem:
it won't generate CQE with error for send WR unless the QP was
in RTS at some point.


-- 
MST


From vlad at lists.openfabrics.org  Mon May 28 02:49:22 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon, 28 May 2007 02:49:22 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070528-0200 daily build status
Message-ID: <20070528094922.99289E60856@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From jimmy at hillraiser.com  Mon May 28 04:32:25 2007
From: jimmy at hillraiser.com (Jimmy Hill)
Date: Mon, 28 May 2007 06:32:25 -0500
Subject: [ofa-general] ibv_get_cq_event blocking forever after successful
	ibv_post_send...
In-Reply-To: <465A67C5.2060101@dev.mellanox.co.il>
Message-ID: <HFEPKIFILMMCLHAOBMMOMEAFGMAA.jimmy@hillraiser.com>

> 
> I have a question: if you enable in all of the SRs (Send Requests) that 
> you are posting the SIGNAL bit, why don't you just
> enable the sq_sig_all when creating the QP?
> 

I have. That was one of the things I tried.


From mst at dev.mellanox.co.il  Mon May 28 04:37:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 28 May 2007 14:37:27 +0300
Subject: [ofa-general] [PATCH for-2.6.22] IB/ipoib: fix performance
	regression on Mellanox
In-Reply-To: <adalkf921at.fsf@cisco.com>
References: <20070521120459.GI20400@mellanox.co.il>
	<ada1wh9ewya.fsf@cisco.com> <46595950.6080106@voltaire.com>
	<20070527125337.GF8342@mellanox.co.il> <adalkf921at.fsf@cisco.com>
Message-ID: <20070528113727.GP2945@mellanox.co.il>

commit 518b1646f8a31904ca637b8df0c1e31c34a7a3c2: IPoIB/cm: Fix SRQ WR leak
introduced performance regression on Mellanox cards: keeping a QP in error
state for extended periods of time, moves hardware to the slow path (until
QP is destroyed). Fix this by posting a send WR on one of the QPs that are
being flushed, instead of using a separate drain QP.

This fixes bug <https://bugs.openfabrics.org/show_bug.cgi?id=636>
Reported and bisected by Scott Weitzenkamp at Cisco.
Debugged by Sasha Mikheev at Voltaire.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

> How about using the send queue of the QP we're trying to flush?
> I'll try to code this up tomorrow if no one beats me to the fix.

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index f133b56..253ece1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -69,8 +69,9 @@ static struct ib_qp_attr ipoib_cm_err_attr = {
 
 #define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff
 
-static struct ib_recv_wr ipoib_cm_rx_drain_wr = {
-	.wr_id = IPOIB_CM_RX_DRAIN_WRID
+static struct ib_send_wr ipoib_cm_rx_drain_wr = {
+	.wr_id = IPOIB_CM_RX_DRAIN_WRID,
+	.opcode = IB_WR_SEND,
 };
 
 static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
@@ -163,16 +164,22 @@ partial_error:
 
 static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv)
 {
-	struct ib_recv_wr *bad_wr;
+	struct ib_send_wr *bad_wr;
+	struct ipoib_cm_rx *p;
 
-	/* rx_drain_qp send queue depth is 1, so
+	/* We only reserved 1 extra slot in CQ for drain WRs, so
 	 * make sure we have at most 1 outstanding WR. */
 	if (list_empty(&priv->cm.rx_flush_list) ||
 	    !list_empty(&priv->cm.rx_drain_list))
 		return;
 
-	if (ib_post_recv(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_wr))
-		ipoib_warn(priv, "failed to post rx_drain wr\n");
+	/*
+	 * QPs on flush list are error state.  This way, a "flush
+	 * error" WC will be immediately generated for each WR we post.
+	 */
+	p = list_entry(priv->cm.rx_flush_list.next, typeof(*p), list);
+	if (ib_post_send(p->qp, &ipoib_cm_rx_drain_wr, &bad_wr))
+		ipoib_warn(priv, "failed to post drain wr\n");
 
 	list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list);
 }
@@ -199,10 +206,10 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
 		.event_handler = ipoib_cm_rx_event_handler,
-		.send_cq = priv->cq, /* does not matter, we never send anything */
+		.send_cq = priv->cq, /* For drain WR */
 		.recv_cq = priv->cq,
 		.srq = priv->cm.srq,
-		.cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */
+		.cap.max_send_wr = 1, /* For drain WR */
 		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
 		.sq_sig_type = IB_SIGNAL_ALL_WR,
 		.qp_type = IB_QPT_RC,
@@ -242,6 +249,24 @@ static int ipoib_cm_modify_rx_qp(struct net_device *dev,
 		ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret);
 		return ret;
 	}
+
+	/* Mellanox firmware won't generate completions with error for drain WRs
+	 * unless the QP has been moved to RTS first. This work-around leaves a
+	 * window where a QP has moved to error asynchronously, but this will
+	 * eventually get fixed in firmware, so let's not error out if modify QP
+	 * fails. */
+	qp_attr.qp_state = IB_QPS_RTS;
+	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to init QP attr for RTS: %d\n", ret);
+		return 0;
+	}
+	ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to RTS: %d\n", ret);
+		return 0;
+	}
+
 	return 0;
 }
 
@@ -623,38 +648,11 @@ static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr)
 int ipoib_cm_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ib_qp_init_attr qp_init_attr = {
-		.send_cq = priv->cq,   /* does not matter, we never send anything */
-		.recv_cq = priv->cq,
-		.cap.max_send_wr = 1,  /* FIXME: 0 Seems not to work */
-		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
-		.cap.max_recv_wr = 1,
-		.cap.max_recv_sge = 1, /* FIXME: 0 Seems not to work */
-		.sq_sig_type = IB_SIGNAL_ALL_WR,
-		.qp_type = IB_QPT_UC,
-	};
 	int ret;
 
 	if (!IPOIB_CM_SUPPORTED(dev->dev_addr))
 		return 0;
 
-	priv->cm.rx_drain_qp = ib_create_qp(priv->pd, &qp_init_attr);
-	if (IS_ERR(priv->cm.rx_drain_qp)) {
-		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
-		ret = PTR_ERR(priv->cm.rx_drain_qp);
-		return ret;
-	}
-
-	/*
-	 * We put the QP in error state directly.  This way, a "flush
-	 * error" WC will be immediately generated for each WR we post.
-	 */
-	ret = ib_modify_qp(priv->cm.rx_drain_qp, &ipoib_cm_err_attr, IB_QP_STATE);
-	if (ret) {
-		ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret);
-		goto err_qp;
-	}
-
 	priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev);
 	if (IS_ERR(priv->cm.id)) {
 		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
@@ -676,8 +674,6 @@ err_listen:
 	ib_destroy_cm_id(priv->cm.id);
 err_cm:
 	priv->cm.id = NULL;
-err_qp:
-	ib_destroy_qp(priv->cm.rx_drain_qp);
 	return ret;
 }
 
@@ -740,7 +736,6 @@ void ipoib_cm_dev_stop(struct net_device *dev)
 		kfree(p);
 	}
 
-	ib_destroy_qp(priv->cm.rx_drain_qp);
 	cancel_delayed_work(&priv->cm.stale_task);
 }
 

-- 
MST


From edwinmorry at hotmail.com  Mon May 28 03:11:47 2007
From: edwinmorry at hotmail.com (EDWIN MORRY)
Date: Mon, 28 May 2007 03:11:47 -0700
Subject: [ofa-general] From UNITED NATION OFFICE.....LONDON
Message-ID: <20070528031147.wjoyqwlj40s4800g@66.160.178.240>


UNITED NATION INTERNATIONAL FUNDS TRANSFER/AUDIT UNIT
UNITED NATIONS(WORLD BANK ASSISTED PROGRAMME)
DIRECTORATE OF INTERNATIONALPAYMENT AND TRANSFERS.
LONDON REGIONAL OFFICE,KILBURN LANE LONDON - ENGLAND.
WIRE TRANSFER/AUDIT UNIT.

ATTN: BENEFICIARY,

CONTRACT/INHERITANCE FUNDS PAYMENT - REF:WB/NF/UN/XX027.

Hello,

This is to urgently inform you that your contract/inheritance
entitlement has finally been approved. Following the resolution and settlement
of the payment by the United Nations in conjunction with the World Bank,an
irrevocable instruction and authorization has been given to us today  
to process
and effect your payment valued at US$6.5M (Six Million Five Hundred Thousand
United States Dollars) only due to you, without delay.

However, please be officially informed that your fund valued at US$6.5
M(Six Million five Hundred Thousand United States Dollars only) is under due
processing for immediate release to you and upon the conclusion of the
processing of the payment we shall immediately transfer the total funds into
your designated account and it will arrive in your account without any delay.

Therefore, be rest assured that we shall ensure that your fund is
released to you as soon as the processing and approval of the funds have been
completed by us, then we shall effect the payment immediately without any
delay. To this end, We are pleased to inform you that we are going to effect
your total payment through any of the following mode of payment below:

(1) SPECIAL CASH PAYMENT
(2) SWIFT/TELEGRAPHIC WIRE TRANSFER
5) ATM.

You are hereby advised to choose any of the above option that suits you
toenable this reputable office finalize and effect your payment without
any delay.Note that the funds will be paid to you in US Dollars. To
facilitate the finalization of the process you must re-confirm the
following information to me of Probate for final authentication and
approval of your fund.

(1) Your full name.
(2) Phone, fax and mobile #.
(3) Company name, position and address.
(4) Profession, age and marital status.
(5) Copy of int'l passport or any scanned identity of proof of yourself.

Finally, as a matter of urgency you are advised to contact me as soon
as you receive this mail to enable us release your fund. Act  
accordingly.I will
be waiting for your urgent and prompt response. Please get back to me through
this email: edwinmorry at hotmail.com

CONGRATULATIONS.

Dr. Edwin Morry
Tel: +44-7024080054


From mst at dev.mellanox.co.il  Mon May 28 05:12:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 28 May 2007 15:12:06 +0300
Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
In-Reply-To: <20070514045832.GA18615@mellanox.co.il>
References: <20070514045832.GA18615@mellanox.co.il>
Message-ID: <20070528121206.GA1847@mellanox.co.il>

Roland, please pick up the patches from:

	git://git.openfabrics.org/~mst/linux-2.6/.git master

This will pull in the following outstanding patches intended for 2.6.22: all of
them have been posted previously (let me know if you like me to re-post the
patches):

Michael S. Tsirkin (3):
      IB/ipoib: fix to_ipoib_neigh access race
      IB/mthca: fix send CQE with error for QP connected to SRQ
      IB/ipoib: fix performance regression on Mellanox

Sean Hefty (1):
      ib/cm: fix stale connection detection

-- 
MST


From erezz at voltaire.com  Mon May 28 06:02:09 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Mon, 28 May 2007 16:02:09 +0300
Subject: [ofa-general] OFED 1.x (Gen 2) based SRP target code released!
In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F6F91AB@mtiexch01.mti.com>
References: <9FA59C95FFCBB34EA5E42C1A8573784F6F91AB@mtiexch01.mti.com>
Message-ID: <465AD2D1.2070100@voltaire.com>

Sujal Das wrote:
> Hello all,
> 
>  
> 
> Mellanox is pleased to release the OFED 1.x (Gen 2) - based SRP Target
> source code to the OpenFabrics community, OEMs and end users.  
> 
>  
> 
> This release is an upgrade to the previously released SRP Target source
> code that was based on the Mellanox IBGold driver and Gen 1 software
> interface.  The code has been tested to work with Mellanox InfiniBand
> adapters and is available under Open Fabrics open source license terms.
> 
>  

I'm trying to build srpt according to the instructions, but it does not get built at all. Here's what I did:

tar xzf OFED-1.2-rc3.tgz
cd OFED-1.2-rc3/SRPMS
rpm2cpio ofa_kernel-1.2-rc3.src.rpm |cpio -i
tar xzf ofa_kernel-1.2.tgz
cd ofa_kernel-1.2
patch -p1 < ~/srpt_inc/add_srpt_01.patch
patch -p1 < ~/srpt_inc/add_srpt_03.patch
cp -r ~/srpt drivers/infiniband/ulp/srpt
./configure --with-core-mod --with-ipoib-mod --with-srp-target-mod --with-mthca-mod

Here's the autoconf.h file that was generated:

#undef CONFIG_INFINIBAND
#undef CONFIG_INFINIBAND_IPOIB
#undef CONFIG_INFINIBAND_IPOIB_CM
#undef CONFIG_INFINIBAND_SDP
#undef CONFIG_INFINIBAND_SRP
#undef CONFIG_INFINIBAND_SRPT

#undef CONFIG_INFINIBAND_USER_MAD
#undef CONFIG_INFINIBAND_USER_ACCESS
#undef CONFIG_INFINIBAND_ADDR_TRANS
#undef CONFIG_INFINIBAND_MTHCA

#undef CONFIG_INFINIBAND_IPOIB_DEBUG
#undef CONFIG_INFINIBAND_ISER
#undef CONFIG_INFINIBAND_EHCA
#undef CONFIG_INFINIBAND_EHCA_SCALING
#undef CONFIG_RDS
#undef CONFIG_RDS_IB
#undef CONFIG_RDS_TCP
#undef CONFIG_RDS_DEBUG
#undef CONFIG_INFINIBAND_MADEYE
#undef CONFIG_INFINIBAND_VNIC
#undef CONFIG_INFINIBAND_VNIC_DEBUG
#undef CONFIG_INFINIBAND_VNIC_STATS
#undef CONFIG_INFINIBAND_CXGB3
#undef CONFIG_INFINIBAND_CXGB3_DEBUG
#undef CONFIG_CHELSIO_T3

#undef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
#undef CONFIG_INFINIBAND_SDP_SEND_ZCOPY
#undef CONFIG_INFINIBAND_SDP_RECV_ZCOPY
#undef CONFIG_INFINIBAND_SDP_DEBUG
#undef CONFIG_INFINIBAND_SDP_DEBUG_DATA
#undef CONFIG_INFINIBAND_IPATH
#undef CONFIG_INFINIBAND_MTHCA_DEBUG

#define CONFIG_INFINIBAND 1
#define CONFIG_INFINIBAND_IPOIB 1
#define CONFIG_INFINIBAND_IPOIB_CM 1
#undef CONFIG_INFINIBAND_SDP
#undef CONFIG_INFINIBAND_SRP
#define CONFIG_INFINIBAND_SRPT 1

#undef CONFIG_INFINIBAND_USER_MAD
#undef CONFIG_INFINIBAND_USER_ACCESS
#undef CONFIG_INFINIBAND_ADDR_TRANS
#define CONFIG_INFINIBAND_MTHCA 1
#undef CONFIG_INFINIBAND_VNIC
#undef CONFIG_INFINIBAND_CXGB3
#undef CONFIG_CHELSIO_T3

#define CONFIG_INFINIBAND_IPOIB_DEBUG 1
#undef CONFIG_INFINIBAND_ISER
#undef CONFIG_SCSI_ISCSI_ATTRS
#undef CONFIG_ISCSI_TCP
#undef CONFIG_INFINIBAND_EHCA
#undef CONFIG_RDS
#undef CONFIG_RDS_IB
#undef CONFIG_RDS_TCP
#undef CONFIG_RDS_DEBUG
#undef CONFIG_INFINIBAND_VNIC_DEBUG
#undef CONFIG_INFINIBAND_VNIC_STATS
#undef CONFIG_INFINIBAND_CXGB3_DEBUG

#undef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
#undef CONFIG_INFINIBAND_SDP_SEND_ZCOPY
#undef CONFIG_INFINIBAND_SDP_RECV_ZCOPY
#undef CONFIG_INFINIBAND_SDP_DEBUG
#undef CONFIG_INFINIBAND_SDP_DEBUG_DATA
#undef CONFIG_INFINIBAND_IPATH
#define CONFIG_INFINIBAND_MTHCA_DEBUG 1
#undef CONFIG_INFINIBAND_MADEYE

Now, I ran "make" and srpt wasn't built:

Building kernel modules
Kernel version: 2.6.16.21-0.8-smp
Modules directory: //lib/modules/2.6.16.21-0.8-smp/updates
Kernel sources: /lib/modules/2.6.16.21-0.8-smp/build
env EXTRA_CFLAGS="  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include \
        -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib \
        -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug \
        -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core \
        -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 \
        -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds " \
        make -C /lib/modules/2.6.16.21-0.8-smp/build SUBDIRS="/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2" KERNELRELEASE=2.6.16.21-0.8-smp \
        EXTRAVERSION=.21-0.8-smp V=1  \
        CONFIG_INFINIBAND=m \
        CONFIG_INFINIBAND_IPOIB=m \
        CONFIG_INFINIBAND_IPOIB_CM=y \
        CONFIG_INFINIBAND_SDP= \
        CONFIG_INFINIBAND_SRP= \
        CONFIG_INFINIBAND_USER_MAD= \
        CONFIG_INFINIBAND_USER_ACCESS= \
        CONFIG_INFINIBAND_ADDR_TRANS= \
        CONFIG_INFINIBAND_MTHCA=m \
        CONFIG_INFINIBAND_IPOIB_DEBUG=y \
        CONFIG_INFINIBAND_ISER= \
                CONFIG_SCSI_ISCSI_ATTRS= \
                CONFIG_ISCSI_TCP= \
        CONFIG_INFINIBAND_EHCA= \
        CONFIG_INFINIBAND_EHCA_SCALING= \
        CONFIG_RDS= \
        CONFIG_RDS_IB= \
        CONFIG_RDS_TCP= \
        CONFIG_RDS_DEBUG= \
        CONFIG_INFINIBAND_IPOIB_DEBUG_DATA= \
        CONFIG_INFINIBAND_SDP_SEND_ZCOPY= \
        CONFIG_INFINIBAND_SDP_RECV_ZCOPY= \
        CONFIG_INFINIBAND_SDP_DEBUG= \
        CONFIG_INFINIBAND_SDP_DEBUG_DATA= \
        CONFIG_INFINIBAND_IPATH= \
        CONFIG_INFINIBAND_MTHCA_DEBUG=y \
                CONFIG_INFINIBAND_MADEYE= \
                CONFIG_INFINIBAND_VNIC= \
                CONFIG_INFINIBAND_VNIC_DEBUG= \
                CONFIG_INFINIBAND_VNIC_STATS= \
                CONFIG_CHELSIO_T3= \
                CONFIG_INFINIBAND_CXGB3= \
                CONFIG_INFINIBAND_CXGB3_DEBUG= \
        LINUXINCLUDE=' \
        -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ \
        -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include \
        -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include \
        -Iinclude \
        $(if $(KBUILD_SRC),-Iinclude2 -I$(srctree)/include) \
        -include include/linux/autoconf.h \
        -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h \
        ' \
        modules
make[1]: Entering directory `/usr/src/linux-2.6.16.21-0.8-obj/x86_64/smp'
make -C ../../../linux-2.6.16.21-0.8 O=../linux-2.6.16.21-0.8-obj/x86_64/smp modules
make -C /usr/src/linux-2.6.16.21-0.8-obj/x86_64/smp \
KBUILD_SRC=/usr/src/linux-2.6.16.21-0.8 \
KBUILD_EXTMOD="/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2" -f /usr/src/linux-2.6.16.21-0.8/Makefile modules
rm -rf /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/.tmp_versions
mkdir -p /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/.tmp_versions
make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2
make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband
make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.cm.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-p
ointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(cm)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_cm)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/cm.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.packer.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -W
no-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(packer)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_packer.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/packer.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ud_header.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement
 -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ud_header)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_ud_header.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ud_header.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.verbs.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wn
o-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(verbs)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_verbs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/verbs.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.sysfs.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wn
o-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(sysfs)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_sysfs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/sysfs.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.device.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -W
no-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(device)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_device.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/device.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.fmr_pool.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement 
-Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(fmr_pool)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_fmr_pool.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/fmr_pool.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.cache.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wn
o-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(cache)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_cache.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/cache.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.genalloc.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement 
-Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(genalloc)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_genalloc.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/genalloc.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.netevent.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement 
-Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(netevent)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_netevent.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/netevent.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.local_sa.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement 
-Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(local_sa)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_local_sa)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_local_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/local_sa.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.mad.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-
pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mad)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mad)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/mad.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.smi.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-
pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(smi)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mad)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_smi.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/smi.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.agent.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wn
o-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(agent)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mad)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_agent.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/agent.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.mad_rmpp.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement 
-Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mad_rmpp)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mad)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_mad_rmpp.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/mad_rmpp.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.sa_query.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement 
-Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(sa_query)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_sa)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_sa_query.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/sa_query.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.multicast.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement
 -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(multicast)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_sa)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_multicast.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/multicast.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.iwcm.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno
-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(iwcm)"  -D"KBUILD_MODNAME=KBUILD_STR(iw_cm)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_iwcm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iwcm.c
  ld -m elf_x86_64  -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/packer.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ud_header.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/verbs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/sysfs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/device.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/fmr_pool.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/cache.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/genalloc.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/netevent.o
  ld -m elf_x86_64  -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/smi.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/agent.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/mad_rmpp.o
  ld -m elf_x86_64  -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/sa_query.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/multicast.o
  ld -m elf_x86_64  -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/local_sa.o
  ld -m elf_x86_64  -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/cm.o
  ld -m elf_x86_64  -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iwcm.o
make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_main.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-
statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_main)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_main.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_main.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_cmd.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-s
tatement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_cmd)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_cmd.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_profile.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-aft
er-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_profile)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_profile.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_profile.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_reset.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after
-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_reset)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_reset.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_reset.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_allocator.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-a
fter-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_allocator)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_allocator.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_allocator.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_eq.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st
atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_eq)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_eq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_pd.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st
atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_pd)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_pd.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_pd.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_cq.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st
atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_cq)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_cq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cq.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_mr.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st
atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_mr)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_mr.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mr.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_qp.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st
atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_qp)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_qp.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c
/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c: In function ?mthca_tavor_post_send?:
/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c:1587: warning: ?f0? may be used uninitialized in this function
/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c: In function ?mthca_arbel_post_send?:
/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c:1941: warning: ?f0? may be used uninitialized in this function
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_av.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st
atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_av)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_av.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_av.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_mcg.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-s
tatement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_mcg)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_mcg.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mcg.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_mad.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-s
tatement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_mad)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mad.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_provider.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-af
ter-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_provider)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_provider.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_provider.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_memfree.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-aft
er-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_memfree)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_memfree.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_memfree.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_uar.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-s
tatement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_uar)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_uar.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_uar.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_srq.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-s
tatement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_srq)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_srq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_catas.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after
-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_catas)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_catas.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_catas.c
  ld -m elf_x86_64  -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_main.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_profile.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_reset.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_allocator.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_pd.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mr.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_av.o /tmp/OFED
-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mcg.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_provider.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_memfree.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_uar.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_catas.o
make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_main.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-afte
r-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_main)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_main.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_main.c
/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_main.c: In function ?ipoib_neigh_destructor?:
/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_main.c:867: warning: ISO C90 forbids mixed declarations and code
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_ib.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-
statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_ib)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_ib.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_ib.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_multicast.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration
-after-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_multicast)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_multicast.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_verbs.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-aft
er-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_verbs)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_verbs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_vlan.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-afte
r-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_vlan)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_vlan.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_cm.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-
statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_cm)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_cm.c
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_fs.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-
statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds  -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_fs)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_fs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_fs.c
  ld -m elf_x86_64  -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_main.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_ib.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_multicast.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_verbs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_vlan.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_fs.o
  Building modules, stage 2.
make -rR -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.modpost
  scripts/mod/modpost -m -a -i /usr/src/linux-2.6.16.21-0.8-obj/x86_64/smp/Module.symvers -I /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/Modules.symvers -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/Modules.symvers -s /dev/null /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.o
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ib_cm.mod.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-
1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds   -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_cm)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_cm)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.mod.c
  ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.mod.o
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ib_core.mod.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFE
D-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds   -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_core)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.mod.c
  ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.mod.o
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ib_local_sa.mod.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp
/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds   -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_local_sa)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_local_sa)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.mod.c
  ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.mod.o
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ib_mad.mod.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED
-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds   -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_mad)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mad)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.mod.c
  ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.mod.o
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ib_sa.mod.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-
1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds   -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_sa)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_sa)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.mod.c
  ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.mod.o
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.iw_cm.mod.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-
1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds   -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(iw_cm)"  -D"KBUILD_MODNAME=KBUILD_STR(iw_cm)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.mod.c
  ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.mod.o
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.ib_mthca.mod.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tm
p/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds   -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_mthca)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.mod.c
  ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.mod.o
  gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ib_ipoib.mod.o.d  -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -Iinclude  -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include  -include include/linux/autoconf.h  -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h    -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/t
mp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3  -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds   -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_ipoib)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.mod.c
  ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.mod.o
make[1]: Leaving directory `/usr/src/linux-2.6.16.21-0.8-obj/x86_64/smp'


From tziporet at dev.mellanox.co.il  Mon May 28 07:30:41 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 28 May 2007 17:30:41 +0300
Subject: [ofa-general] OFED 1.2 status & RC4
Message-ID: <465AE791.5040003@mellanox.co.il>

Hi All,

Most of critical and major bugs are fixed thus we plan to have RC4 this 
Wed (or Thursday if some other important fix will be available)

567 	blocker 	rolandd at cisco.com 	RHEL5 ppc64 UD verbs failures
577 	critical 	ishai at mellanox.co.il 	SRP multipath failover too slow 
(minutes, not seconds)
626 	major 	monis at voltaire.com 	wrong network /broadcast address set by 
ib-bond script


All - if you have any fix that must be applied to RC4 please send this 
till end of Tuesday (US time)

Roland, Ishai and Moni - please update me regarding status of your bugs

Thanks,
Tziporet


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070528/cab9cbd4/attachment.html>

From dotanb at dev.mellanox.co.il  Mon May 28 08:22:59 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Mon, 28 May 2007 18:22:59 +0300
Subject: [ofa-general] there is a warning message in every use of the library
	libibverbs
Message-ID: <465AF3D3.10205@dev.mellanox.co.il>

Hi Roland.

In every test/application that uses the libibverbs (i think when the 
libibverbs init function is being called)
i see the following warning:

<-start->
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
<-end->

Why did you add this warning message?


Even when i executed this test as root i got this warning ....

Can you add an environment variable that will prevent this warning?
(or i can send it to you if you agree ...)


thanks
Dotan


From rdreier at cisco.com  Mon May 28 10:02:37 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 28 May 2007 10:02:37 -0700
Subject: [ofa-general] Re: there is a warning message in every use of the
	library libibverbs
In-Reply-To: <465AF3D3.10205@dev.mellanox.co.il> (Dotan Barak's message of
	"Mon, 28 May 2007 18:22:59 +0300")
References: <465AF3D3.10205@dev.mellanox.co.il>
Message-ID: <adad50k285u.fsf@cisco.com>

 > In every test/application that uses the libibverbs (i think when the
 > libibverbs init function is being called)
 > i see the following warning:
 > 
 > <-start->
 > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
 >    This will severely limit memory registrations.
 > <-end->
 > 
 > Why did you add this warning message?

To avoid the FAQ of "memory registration / CQ creation fails and I
don't know why".

 - R.


From sweitzen at cisco.com  Mon May 28 10:09:00 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 28 May 2007 10:09:00 -0700
Subject: [ofa-general] RE: [ewg] OFED 1.2 status & RC4
In-Reply-To: <465AE791.5040003@mellanox.co.il>
References: <465AE791.5040003@mellanox.co.il>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303951123@xmb-sjc-216.amer.cisco.com>

There were several IPoIB bugs marked fixed today, are all the IPoIB
fixes in OFED-1.2-20070528-0600.tgz or do I need to wait another day?
 
Scott


________________________________

	From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren
	Sent: Monday, May 28, 2007 7:31 AM
	To: EWG
	Cc: Moni Levy; Roland Dreier (rdreier); Ishai Rabinovitz;
OpenFabrics General
	Subject: [ewg] OFED 1.2 status & RC4
	
	
	Hi All,
	
	Most of critical and major bugs are fixed thus we plan to have
RC4 this Wed (or Thursday if some other important fix will be available)
	
	
567	 blocker	 rolandd at cisco.com	 RHEL5 ppc64 UD verbs
failures	
577	 critical	 ishai at mellanox.co.il	 SRP multipath failover
too slow (minutes, not seconds)	
626	 major	 monis at voltaire.com	 wrong network /broadcast
address set by ib-bond script	


	All - if you have any fix that must be applied to RC4 please
send this till end of Tuesday (US time)
	
	Roland, Ishai and Moni - please update me regarding status of
your bugs
	
	Thanks,
	Tziporet
	
	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070528/56bd1265/attachment.html>

From sashak at voltaire.com  Mon May 28 13:07:42 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 28 May 2007 23:07:42 +0300
Subject: [ofa-general] [PATCH] opensm/console: portstatus command for only
	initialized ports
Message-ID: <20070528200742.GA13193@sashak.voltaire.com>


Run portstatus command for only initialized ports + minor identation
fixes.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_console.c |   18 ++++++++++--------
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index 2802c38..3415262 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -598,15 +598,17 @@ __get_stats(cl_map_item_t * const p_map_item, void *context)
 	fs->total_nodes++;
 
 	for (port = 1; port < num_ports; port++) {
-		osm_physp_t    *phys = osm_node_get_physp_ptr(node, port);
+		osm_physp_t *phys = osm_node_get_physp_ptr(node, port);
 		ib_port_info_t *pi = &(phys->port_info);
-
-		uint8_t         active_speed = ib_port_info_get_link_speed_active(pi);
-		uint8_t         enabled_speed = ib_port_info_get_link_speed_enabled(pi);
-		uint8_t         active_width = pi->link_width_active;
-		uint8_t         enabled_width = pi->link_width_enabled;
-		uint8_t         port_state = ib_port_info_get_port_state(pi);
-		uint8_t         port_phys_state = ib_port_info_get_port_phys_state(pi);
+		uint8_t active_speed = ib_port_info_get_link_speed_active(pi);
+		uint8_t enabled_speed = ib_port_info_get_link_speed_enabled(pi);
+		uint8_t active_width = pi->link_width_active;
+		uint8_t enabled_width = pi->link_width_enabled;
+		uint8_t port_state = ib_port_info_get_port_state(pi);
+		uint8_t port_phys_state = ib_port_info_get_port_phys_state(pi);
+
+		if (!osm_physp_is_valid(phys))
+			continue;
 
 		if ((enabled_width ^ active_width) > active_width) {
 			__tag_port_report(&(fs->reduced_width_ports),
-- 
1.5.2.109.g802f


From mst at dev.mellanox.co.il  Mon May 28 21:27:41 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 07:27:41 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh
	access race
In-Reply-To: <20070524131154.GA7940@mellanox.co.il>
References: <20070522005918.GB13311@mellanox.co.il> <adatzu4d1wx.fsf@cisco.com>
	<20070524131154.GA7940@mellanox.co.il>
Message-ID: <20070529042741.GB13866@mellanox.co.il>

> Quoting Michael S. Tsirkin <mst at dev.mellanox.co.il>:
> Subject: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race
> 
> hard_start_xmit dereferences to_ipoib_neigh when only tx_lock is taken.  This
> would only be safe if all calls that modify *to_ipoib_neigh take tx_lock too.
> Currently this is not always true for ipoib_neigh_free and path_rec_completion,
> which results in memory corruption.  Fix this race, making sure
> path_rec_completion and ipoib_neigh_free are always called under
> tx_lock.
> 
> Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

Could you on this patch please?

-- 
MST


From rdreier at cisco.com  Mon May 28 21:28:32 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 28 May 2007 21:28:32 -0700
Subject: [ofa-general] ibv_get_cq_event blocking forever after successful
	ibv_post_send...
In-Reply-To: <20070525212214.20500.qmail@station183.com> (Jimmy Hill's message
	of "Fri, 25 May 2007 21:22:14 +0000")
References: <20070525212214.20500.qmail@station183.com>
Message-ID: <adalkf8z21b.fsf@cisco.com>

 > Any ideas on why the ibv_get_cq_event() would never see an event
 > after a "successful" send requesting a completion event?

It's either a bug in your code or a bug in the stack below your code.
The best way to debug this would be for you to post your actual code
(in a form that someone else can run), so that we can either point out
what's wrong with your code, or have a test case for the real bug.

 - R.


From rdreier at cisco.com  Mon May 28 21:33:17 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 28 May 2007 21:33:17 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh
	access race
In-Reply-To: <20070529042741.GB13866@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 29 May 2007 07:27:41 +0300")
References: <20070522005918.GB13311@mellanox.co.il>
	<adatzu4d1wx.fsf@cisco.com> <20070524131154.GA7940@mellanox.co.il>
	<20070529042741.GB13866@mellanox.co.il>
Message-ID: <adaejl0z1te.fsf@cisco.com>

 > Could you on this patch please?

??


From rdreier at cisco.com  Mon May 28 21:40:27 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 28 May 2007 21:40:27 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix performance
	regression on Mellanox
In-Reply-To: <20070528113727.GP2945@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 28 May 2007 14:37:27 +0300")
References: <20070521120459.GI20400@mellanox.co.il>
	<ada1wh9ewya.fsf@cisco.com> <46595950.6080106@voltaire.com>
	<20070527125337.GF8342@mellanox.co.il> <adalkf921at.fsf@cisco.com>
	<20070528113727.GP2945@mellanox.co.il>
Message-ID: <adaabvoz1hg.fsf@cisco.com>

seems like this leaves rx_drain_qp in the data structure and also in
the comment in ipoib.h... not sure if there are any other remnants of
the previous approach that should be cleaned up.


From mst at dev.mellanox.co.il  Mon May 28 21:44:42 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 07:44:42 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh
	access race
In-Reply-To: <adaejl0z1te.fsf@cisco.com>
References: <20070522005918.GB13311@mellanox.co.il> <adatzu4d1wx.fsf@cisco.com>
	<20070524131154.GA7940@mellanox.co.il>
	<20070529042741.GB13866@mellanox.co.il> <adaejl0z1te.fsf@cisco.com>
Message-ID: <20070529044442.GC13866@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race
> 
>  > Could you on this patch please?
> 
> ??

Could you comment on this patch please?

-- 
MST


From rdreier at cisco.com  Mon May 28 21:45:19 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 28 May 2007 21:45:19 -0700
Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
In-Reply-To: <20070528121206.GA1847@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 28 May 2007 15:12:06 +0300")
References: <20070514045832.GA18615@mellanox.co.il>
	<20070528121206.GA1847@mellanox.co.il>
Message-ID: <ada1wh0z19c.fsf@cisco.com>

 >       IB/ipoib: fix to_ipoib_neigh access race

I'm not convinced this is 2.6.22 material at this point -- it doesn't
fix any observed problem that I know of.  (And the SRQ drain patch
shows how even safe-looking patches can cause big problems)

 - R.


From rdreier at cisco.com  Mon May 28 21:46:26 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 28 May 2007 21:46:26 -0700
Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git
In-Reply-To: <20070526194049.GD15942@mellanox.co.il> (Michael S. Tsirkin's
	message of "Sat, 26 May 2007 22:40:49 +0300")
References: <adahcq036ep.fsf@cisco.com> <20070526194049.GD15942@mellanox.co.il>
Message-ID: <adawsysxmn1.fsf@cisco.com>

 > don't we want he patch that sets status to flushed with error?

I figured I would test it a little and queue it for 2.6.23.  I don't
see a justification for putting in 2.6.22 since it's just paranoia not
driven by any observed issue.


From mst at dev.mellanox.co.il  Mon May 28 21:48:15 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 07:48:15 +0300
Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
In-Reply-To: <ada1wh0z19c.fsf@cisco.com>
References: <20070514045832.GA18615@mellanox.co.il>
	<20070528121206.GA1847@mellanox.co.il> <ada1wh0z19c.fsf@cisco.com>
Message-ID: <20070529044815.GD13866@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
> 
>  >       IB/ipoib: fix to_ipoib_neigh access race
> 
> I'm not convinced this is 2.6.22 material at this point -- it doesn't
> fix any observed problem that I know of.  (And the SRQ drain patch
> shows how even safe-looking patches can cause big problems)

Fine, but we do have it in OFED - could you spare some cycles to review it?

-- 
MST


From mst at dev.mellanox.co.il  Mon May 28 21:51:34 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 07:51:34 +0300
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix performance
	regression on Mellanox
In-Reply-To: <adaabvoz1hg.fsf@cisco.com>
References: <20070521120459.GI20400@mellanox.co.il>
	<ada1wh9ewya.fsf@cisco.com> <46595950.6080106@voltaire.com>
	<20070527125337.GF8342@mellanox.co.il> <adalkf921at.fsf@cisco.com>
	<20070528113727.GP2945@mellanox.co.il> <adaabvoz1hg.fsf@cisco.com>
Message-ID: <20070529045134.GE13866@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH for-2.6.22] IB/ipoib: fix performance regression on Mellanox
> 
> seems like this leaves rx_drain_qp in the data structure and also in
> the comment in ipoib.h...

Right, add this on top of it.

> not sure if there are any other remnants of
> the previous approach that should be cleaned up.

Hopefully not - compiler'd notice any uses of rx_drain_qp, and
that really is the only change.

----->

Remove unused rx_drain_qp.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 158759e..285c143 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -156,7 +156,7 @@ struct ipoib_cm_data {
  * - and then invoke a Destroy QP or Reset QP.
  *
  * We use the second option and wait for a completion on the
- * rx_drain_qp before destroying QPs attached to our SRQ.
+ * same CQ before destroying QPs attached to our SRQ.
  */
 
 enum ipoib_cm_state {
@@ -199,7 +199,6 @@ struct ipoib_cm_dev_priv {
 	struct ib_srq  	       *srq;
 	struct ipoib_cm_rx_buf *srq_ring;
 	struct ib_cm_id        *id;
-	struct ib_qp           *rx_drain_qp;   /* generates WR described in 10.3.1 */
 	struct list_head        passive_ids;   /* state: LIVE */
 	struct list_head        rx_error_list; /* state: ERROR */
 	struct list_head        rx_flush_list; /* state: FLUSH, drain not started */

-- 
MST


From mst at dev.mellanox.co.il  Mon May 28 23:06:50 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 09:06:50 +0300
Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
In-Reply-To: <ada1wh0z19c.fsf@cisco.com>
References: <20070514045832.GA18615@mellanox.co.il>
	<20070528121206.GA1847@mellanox.co.il> <ada1wh0z19c.fsf@cisco.com>
Message-ID: <20070529060626.GB6032@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
> 
>  >       IB/ipoib: fix to_ipoib_neigh access race
> 
> I'm not convinced this is 2.6.22 material at this point -- it doesn't
> fix any observed problem that I know of.

OK, I removed it for now, and cleaned the unused rx_drain_qp field
in the performance fix patch. What's left is:

Michael S. Tsirkin (2):
      IB/mthca: fix send CQE with error for QP connected to SRQ
      IB/ipoib: fix performance regression on Mellanox

Sean Hefty (1):
      ib/cm: fix stale connection detection

These are all fixes for observed problems.

> (And the SRQ drain patch
> shows how even safe-looking patches can cause big problems)

Yea. We did know it's a risky, big change - it just seemed we
must fix it for IPoIB CM to be useful.

-- 
MST


From monisonlists at gmail.com  Mon May 28 23:41:02 2007
From: monisonlists at gmail.com (Moni Shoua)
Date: Tue, 29 May 2007 09:41:02 +0300
Subject: [ofa-general] Trouble install OFED 1.2-rc3 - ib-bonding
In-Reply-To: <A382D4292574EB47A85B8159A6AED1A18305BC@FPNYEXCBE02.opus-i.corp>
References: <A382D4292574EB47A85B8159A6AED1A18305BC@FPNYEXCBE02.opus-i.corp>
Message-ID: <465BCAFE.2030001@gmail.com>

Jeffrey Wong wrote:
> Hello,
> 
> I am installing the OFED 1.2-rc3.
> 
> Everything else builds except for ib-bonding.  =
I see you have kernel 2.6.18-8.1.4.el5 which is not supported by ib-bonding.
It seems like a beta of RHEL5. Am I right?
> 
> 
>  =
> 
> 
> Thanks in advance.
> 
>  =
> 
> 
>  =
> 
> 
> I am getting the following error messages:
> 
> + make -C /lib/modules/2.6.18-8.1.4.el5/build modules
> M=3D/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding
> 
> make: Entering directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64'
> 
>   CC [M]
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.o
> 
> In file included from
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:78:
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h: In function 'bond_set_slave_inactive_flags':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
> 
>  function)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h:262: error: (Each undeclared identifier is reported only once
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h:262: error: for each function it appears in.)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h: In function 'bond_set_slave_active_flags':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
> g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this
> 
>  function)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_compute_features':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:1233: warning: comparison of distinct pointer types lacks a
> 
>  cast
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_enslave':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this fu
> 
> nction)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_release':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this fu
> 
> nction)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in t
> 
> his function)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_arp_rcv':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this fu
> 
> nction)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_netdev_event':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this fu
> 
> nction)
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c: In function 'bond_init':
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:4374: warning: assignment discards qualifiers from pointer
> 
> target type
> 
> /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
> ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this fu
> 
> nction)
> 
> make[1]: ***
> [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_
> main.o] Error 1
> 
> make: ***
> [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi
> ng] Error 2
> 
> make: Leaving directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64'
> 
> + echo ' Building  IB bonding driver failed'
> 
>  Building  IB bonding driver failed
> 
> + exit 1
> 
> error: Bad exit status from /var/tmp/rpm-tmp.23876 (%build)
> 
>  =
> 
> 
>  =
> 
> 
>  =
> 
> 
> Jeff Wong
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mst at dev.mellanox.co.il  Tue May 29 00:17:01 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 10:17:01 +0300
Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
In-Reply-To: <ada1wh0z19c.fsf@cisco.com>
References: <20070514045832.GA18615@mellanox.co.il>
	<20070528121206.GA1847@mellanox.co.il> <ada1wh0z19c.fsf@cisco.com>
Message-ID: <20070529071701.GA8159@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
> 
>  >       IB/ipoib: fix to_ipoib_neigh access race
> 
> I'm not convinced this is 2.6.22 material at this point -- it doesn't
> fix any observed problem that I know of.  (And the SRQ drain patch
> shows how even safe-looking patches can cause big problems)

for-2.6.23 for now?

-- 
MST


From tziporet at dev.mellanox.co.il  Tue May 29 00:26:44 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 29 May 2007 10:26:44 +0300
Subject: [ofa-general] Re: [ewg] OFED 1.2 status & RC4
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303951123@xmb-sjc-216.amer.cisco.com>
References: <465AE791.5040003@mellanox.co.il>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303951123@xmb-sjc-216.amer.cisco.com>
Message-ID: <465BD5B4.50003@mellanox.co.il>

Scott Weitzenkamp (sweitzen) wrote:
> There were several IPoIB bugs marked fixed today, are all the IPoIB 
> fixes in OFED-1.2-20070528-0600.tgz or do I need to wait another day?
>  
> Scott
IPoIB fixes are in IP OFED-1.2-20070528-0600.tgz
SRP fix will be in next build


Tziporet


From cap at nsc.liu.se  Tue May 29 00:38:30 2007
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Tue, 29 May 2007 09:38:30 +0200
Subject: [ofa-general] Trouble install OFED 1.2-rc3 - ib-bonding
In-Reply-To: <465BCAFE.2030001@gmail.com>
References: <A382D4292574EB47A85B8159A6AED1A18305BC@FPNYEXCBE02.opus-i.corp>
	<465BCAFE.2030001@gmail.com>
Message-ID: <200705290938.35022.cap@nsc.liu.se>

On Tuesday 29 May 2007, Moni Shoua wrote:
> Jeffrey Wong wrote:
> > Hello,
> >
> > I am installing the OFED 1.2-rc3.
> >
> > Everything else builds except for ib-bonding.  =
>
> I see you have kernel 2.6.18-8.1.4.el5 which is not supported by
> ib-bonding. It seems like a beta of RHEL5. Am I right?

No, that is _the_ current RHEL5 kernel (release + security updates).

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070529/4a101c8c/attachment.sig>

From ogerlitz at voltaire.com  Tue May 29 00:56:00 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 29 May 2007 10:56:00 +0300
Subject: [ofa-general] Re: ipoib / bonding and OFED
In-Reply-To: <4657373E.2030903@hp.com>
References: <3857BB049D83424D9DB82753D37CEA55459C41@taurus.voltaire.com>
	<4657373E.2030903@hp.com>
Message-ID: <465BDC90.5080305@voltaire.com>

Bob Kossey wrote:
> I copied OR since I think this is related to his OFED HA work, and
> he might have some insights.  A few more questions for Or:
> I was trying to use ipoib bonding with OFED 1.2 rc2 and a 2.6.9 kernel,
> but was not able to get it to work so far.  I saw your Sonoma bonding
> slides, and you mention kernel bonding driver changes were needed.
> 2. Is there a minimum kernel version, with the kernel bonding driver
> changes, that is required to use bonding with OFED ipoib?

Just to have a base line here: to get bonding to work with IPoIB, you 
should use the bonding driver provided with OFED 1.2. This driver is the 
  upstream one (of 2.6.20) being patched to support IPoIB and backported 
to RH5, SLES10 and RH4 U3/4/5, other kernels are not supported.

If you were using the ofed bonding on a system that matches the support 
matrix it should worl. If do have problems under this config, please 
either open a bug at the ofed bugzilla
@ bugs.openfabrics.org assigned to monis at voltaire.com (Moni Shoua) or 
send first report/question to Moni and CC ewg at lists.openfabrics.org

Please note that between RC2 and RC4 (to be released today etc) some 
bugs were fixed, you can search in the bugzilla to see what.

> 3. The bonding driver uses the HWADDR from the underlying ipoib
> devices, how does it obtain the HWADDR?  Does it use the full 20 bytes,
> or some subset?

when enslaving IPoIB devices, the bonding driver uses the full hw 
address of the active slave, it simply looks on the dev_addr field of 
the slave struct netdevice (see include/linux/netdevice.h)

> 4. What use_carrier options for link status detection does OFED ipoib 
> support,
> MII, ETHTOOL or netif_carrier_ok?

the mii/ethertool etc local link detection methods of the bonding driver 
  are somehow deprecated, since nowadays almost any network device 
support the netif_carrier_ok call. The --default-- of the upstream 
bonding driver (eg the one we use in OFED and the 2.6.21 listed below) 
is to set the use_carrier mod param to 1 that is mii is not used anymore.

> author:         Thomas Davis, tadavis at lbl.gov and many others
> description:    Ethernet Channel Bonding Driver, v3.1.2
> version:        3.1.2
> parm:           use_carrier:Use netif_carrier_ok (vs MII ioctls) in miimon; 0 for off, 1 for on (default) (int)
> parm:           miimon:Link check interval in milliseconds (int)

> If you have any good examples of bonding configuration settings that work
> with OFED, I'd appreciate that also.

The bonding RPM provided with OFED is made of a driver, script and some 
help text containing usage examples, please take a look there and let me 
know if you have further questions.

> $ rpm -ql ib-bonding-0.9.0-2.6.9_42.ELsmp
> /lib/modules/2.6.9-42.ELsmp/updates/kernel/drivers/net/bonding/bonding.ko
> /usr/bin/ib-bond
> /usr/share/doc/ib-bonding-0.9.0/ib-bonding.txt

The ofed service (/etc/init.d/openibd) was enhanced to allow for 
--persistent-- bonding configuration, please see the bonding section at
docs/ipoib_release_notes.txt to see how to do it.

Or.


From mst at dev.mellanox.co.il  Tue May 29 02:12:46 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 12:12:46 +0300
Subject: [ofa-general] [PATCH] suppress RLIMIT warning for root user (was Re:
	there is a warning message in every use of the library libibverbs)
In-Reply-To: <adad50k285u.fsf@cisco.com>
References: <465AF3D3.10205@dev.mellanox.co.il> <adad50k285u.fsf@cisco.com>
Message-ID: <20070529091246.GF8159@mellanox.co.il>

root can register as much memory as he likes, so the rlimit
value shouldn't matter in this case. Do not print a warning
about RLIMIT being too low in this case.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: there is a warning message in every use of the library libibverbs
> 
>  > In every test/application that uses the libibverbs (i think when the
>  > libibverbs init function is being called)
>  > i see the following warning:
>  > 
>  > <-start->
>  > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>  >    This will severely limit memory registrations.
>  > <-end->
>  > 
>  > Why did you add this warning message?
> 
> To avoid the FAQ of "memory registration / CQ creation fails and I
> don't know why".

OK, but kernel side actually ignores the rlimit value for the root user,
so let's not print a warning in this case?

diff --git a/src/init.c b/src/init.c
index a17ae16..de485cb 100644
--- a/src/init.c
+++ b/src/init.c
@@ -417,10 +417,15 @@ static void check_memlock_limit(void)
 		return;
 	}
 
-	if (rlim.rlim_cur <= 32768)
-		fprintf(stderr, PFX "Warning: RLIMIT_MEMLOCK is %lu bytes.\n"
-			"    This will severely limit memory registrations.\n",
-			rlim.rlim_cur);
+	if (rlim.rlim_cur > 32768)
+		return;
+
+	if (!getuid())
+		return;
+
+	fprintf(stderr, PFX "Warning: RLIMIT_MEMLOCK is %lu bytes.\n"
+		"    This will severely limit memory registrations.\n",
+		rlim.rlim_cur);
 }
 
 static void add_device(struct ibv_device *dev,

-- 
MST


From mst at dev.mellanox.co.il  Tue May 29 02:15:43 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 12:15:43 +0300
Subject: [ofa-general] libibverbs autogen failures in ubuntu dapper
Message-ID: <20070529091543.GG8159@mellanox.co.il>

Attempt to run autogen.sh on an ubuntu dapper laptop gave me this:

aclocal -I config
+ libtoolize --force --copy
Putting files in AC_CONFIG_AUX_DIR, `config'.
+ autoheader
+ automake --foreign --add-missing --copy
automake: Makefile.am: `src/libibverbs.la' is not a standard libtool library name
automake: Makefile.am: not supported: source file `src/cmd.c' is in subdirectory
automake: Makefile.am: not supported: source file `src/compat-1_0.c' is in subdirectory
automake: Makefile.am: not supported: source file `src/device.c' is in subdirectory
automake: Makefile.am: not supported: source file `src/init.c' is in subdirectory
automake: Makefile.am: not supported: source file `src/marshall.c' is in subdirectory
automake: Makefile.am: not supported: source file `src/memory.c' is in subdirectory
automake: Makefile.am: not supported: source file `src/sysfs.c' is in subdirectory
automake: Makefile.am: not supported: source file `src/verbs.c' is in subdirectory
Bareword found where operator expected at (eval 336) line 1, near
"s/\@LTLIBRARY\@/src/libibverbs"

.........

and it fails to produce a working build:
./configure
...
make
mst at mst-lt:~/scm/libibverbs$ make
Makefile:343: warning: overriding commands for target `@PROGRAM@'
Makefile:339: warning: ignoring old commands for target `@PROGRAM@'
Makefile:347: warning: overriding commands for target `@PROGRAM@'
Makefile:343: warning: ignoring old commands for target `@PROGRAM@'
Makefile:351: warning: overriding commands for target `@PROGRAM@'
Makefile:347: warning: ignoring old commands for target `@PROGRAM@'
Makefile:355: warning: overriding commands for target `@PROGRAM@'
Makefile:351: warning: ignoring old commands for target `@PROGRAM@'
Makefile:359: warning: overriding commands for target `@PROGRAM@'
Makefile:355: warning: ignoring old commands for target `@PROGRAM@'
Makefile:363: warning: overriding commands for target `@PROGRAM@'
Makefile:359: warning: ignoring old commands for target `@PROGRAM@'
make: *** No rule to make target `src/libibverbs.la', needed by `all-am'.  Stop.

I think this worked at some point - any idea what's wrong now?

-- 
MST


From vacchianow7037 at plaza101.com  Tue May 29 15:50:54 2007
From: vacchianow7037 at plaza101.com (Rafaela Cruz)
Date: Tue, 29 May 2007 23:50:54 +0100
Subject: [ofa-general] Think its' time to start
Message-ID: <000801c7a1d3$2ebf5150$6901a8c0@vacchianow7037>

Take delivery of a sizeable modify on your Meds

dependable classes, paramount quality.

Massive array, including intricate to find drugs
0 RX indispensable.
Secret with No waiting quarters or arrangmenet requisite

take in massiveness and Save! even supposing supplemental

Just type www [.] Topbuyrx . org
in Your Internet Explore - Go here now


-----
They panicky forsook suit each other cautiously remarkably bred well, said Dangl 'It is well,' said he, kissing solid new sane it; defiant 'it is my mast
fast And shrilly tip pipe what? demanded Morrel. 

revolting nerve feeling So done are all Italians. poke interest I think I may venture foolish to ask strip you this favor.


From vlad at lists.openfabrics.org  Tue May 29 02:44:13 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue, 29 May 2007 02:44:13 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070529-0200 daily build status
Message-ID: <20070529094414.392E3E6089D@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From Koen.SEGERS at VRT.BE  Tue May 29 03:03:27 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Tue, 29 May 2007 12:03:27 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3038E1834@xmb-sjc-216.amer.cisco.com>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D71@OCBEXS01001.rto.be>

Hi,

Saturday we did a different stresstest.
This is what we see in the /var/log/messages:

May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11

There were errors from that time on. Can someone explain me what this
message does?

Koen

-----Oorspronkelijk bericht-----
Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
Verzonden: woensdag 23 mei 2007 17:41
Aan: SEGERS Koen; Hal Rosenstock
CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

Try 20 seconds, I'm curious if if you are barely crossing the 10-second
threshold.

Scott 

> -----Original Message-----
> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> Sent: Wednesday, May 23, 2007 8:39 AM
> To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> Cc: Clive Hall (clivhall); 
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
> 
> What value would you recommend then?
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> Verzonden: woensdag 23 mei 2007 17:38
> Aan: SEGERS Koen; Hal Rosenstock
> CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> The boot time of the host doesn't matter for this timeout.  While the
> host is booting, the IB link is down anyway.
> 
> Scott 
> 
> > -----Original Message-----
> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > Sent: Wednesday, May 23, 2007 8:20 AM
> > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > Cc: Clive Hall (clivhall); 
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > After a whole day of stresstesting with the MAD renicing 
> turned on, we
> > got the error once. So I think I should raise the timeout on 
> > the switch
> > also.
> > 
> > It takes about 2 minutes to boot the system. Do you agree 
> > that this is a
> > good value for the timeout?
> > 
> > Scott,
> > Can you explain me the problem of the memlock?
> > 
> > I saw that the SLES10 bug is only an issue in MVAPICH. 
> Since we didn't
> > install this, the bug is not related to us. This is 
> correct, isn't it?
> > 
> > Greetz
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > Verzonden: woensdag 23 mei 2007 16:12
> > Aan: Scott "Weitzenkamp (sweitzen)
> > CC: SEGERS Koen; Clive Hall (clivhall);
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > No C code changes, just a few config file changes 
> (RENICE_IB_MAD=yes
> > in
> > > openib.conf,
> > 
> > Does the host really not respond to MAD requests for over 10 
> > seconds in
> > some cases ?
> > 
> > -- Hal
> > 
> > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > SLES10 for bug 267, etc.).
> > > 
> > > Scott Weitzenkamp
> > > SQA and Release Manager
> > > Server Virtualization Business Unit
> > > Cisco Systems
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > general at lists.openfabrics.org; 
> > general-bounces at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > This far, all tests seem to work.
> > > > 
> > > > Thanks for the help!
> > > > 
> > > > Scott,
> > > > Are there more bugfixes that cisco does in its rpms?
> > > > 
> > > > Greetz
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > (clivhall)
> > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > > > general-bounces at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > It's not so much pinging every 10 seconds as expecting a 
> > > > response within
> > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > 
> > > > You only need to do 1) or 2), not both.  Cisco configures 1) 
> > > > in the OFED
> > > > binary RPMs we release at
> > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > > prefer to have
> > > > the host be more responsive.
> > > > 
> > > > 
> > > > Scott Weitzenkamp
> > > > SQA and Release Manager
> > > > Server Virtualization Business Unit
> > > > Cisco Systems
> > > >  
> > > > 
> > > > > -----Original Message-----
> > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > general at lists.openfabrics.org;
> > general-bounces at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > If I understand it wright, the switch is actually polling 
> > > > > (=pinging) the
> > > > > interfaces every 10s. This means that when the interface is
> > handling
> > > > > other traffic, the poll can fail and the port could be 
> > > > > considered out of
> > > > > service. My question is then: "How can the timeout be reached
> > while
> > > > > packets are being sent/received to/from the interface?"
> > > > > 
> > > > > Anyway, what timeout-value would you recommend for 
> us? And why?
> > > > > 
> > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > 1) change the MAD niceness of the servers
> > > > > 2) change the timeout on the switches
> > > > > 
> > > > > Are these changes sufficient for the HCA's to keep 
> > their ports in
> > > > > PORT_ACTIVE state?
> > > > > 
> > > > > Regards,
> > > > > 
> > > > > Koen
> > > > > 
> > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > (sweitzen) wrote:
> > > > > > Yes, you can tune it.  Here's an example via the switch CLI:
> > > > > >  
> > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> fe:80:00:00:00:00:00:00
> > > > > > node-timeout <value>
> > > > > > 
> > > > > > The default is 10 seconds, it can be configured up to 
> > > > 2000 seconds.
> > > > > > If a HCA is completely unresponsive for longer than the 
> > > > node-timeout
> > > > > > value, then we consider that HCA out of service.
> > > > > >  
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > >  
> > > > > > 
> > > > > >         
> > > > > >         
> > > > > ______________________________________________________________
> > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > >         To: koen.segers at VRT.BE
> > > > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > > > >         general-bounces at lists.openfabrics.org; Scott 
> > Weitzenkamp
> > > > > >         (sweitzen)
> > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > IB-connection
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > >         Koen,
> > > > > >         
> > > > > >         So it is most likely you hit the same bug as 
> > 229 (Scott
> > > > > >         pointed out earlier). The same workaround might 
> > > > work for you
> > > > > >         by renicing ib_mad as Scott suggested.
> > > > > >         
> > > > > >         I think this should be a SM query timeout 
> > tunable value
> > in
> > > > > >         Cisco SM. Am I right, Scott?
> > > > > >         
> > > > > >         Thanks
> > > > > >         Shirley Ma
> > > > > >         
> > > > > >         
> > > > > >         Inactive hide details for Koen Segers 
> > > > > <koen.segers at VRT.BE>Koen
> > > > > >         Segers <koen.segers at VRT.BE>
> > > > > >         
> > > > > >         
> > > > > >                                         Koen Segers 
> > > > > <koen.segers at VRT.BE> 
> > > > > >                                         
> > > > > >                                         05/22/07 11:14 AM 
> > > > > >                                         Please respond to
> > > > > >                                         koen.segers at VRT.BE
> > > > > >                                         
> > > > > >         
> > > > > >                      To
> > > > > >         
> > > > > >         Shirley
> > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > >         
> > > > > >                      cc
> > > > > >         
> > > > > >         Ami Perlmutter
> > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > general at lists.openfabrics.org,
> > general-bounces at lists.openfabrics.org
> > > > > >         
> > > > > >                 Subject
> > > > > >         
> > > > > >         RE:
> > > > > >         [ofa-general]
> > > > > >         GPFS node loses
> > > > > >         IB-connection
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > >         Hi,
> > > > > >         
> > > > > >         It is the Cisco SM. 
> > > > > >         
> > > > > >         SFS-7000P> show version
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > ==============================================================
> > > > > ==================
> > > > > >                                   System Version Information
> > > > > >         
> > > > > ==============================================================
> > > > > ==================
> > > > > >                   system-version : SFS-7000P TopspinOS 
> > > > 2.9.0 releng
> > > > > >         #147
> > > > > >         10/25/2006 02:01:32
> > > > > >                          contact : tac at cisco.com
> > > > > >                             name : SFS-7000P
> > > > > >                         location : 170 West Tasman Drive, 
> > > > > San Jose, CA
> > > > > >         95134
> > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > >                      last-change : none
> > > > > >                 last-config-save : none
> > > > > >                           action : none
> > > > > >                           result : none
> > > > > >                        oper-mode : normal
> > > > > >         
> > > > > >         There is also a command that gives the SM version, 
> > > > > but I can't
> > > > > >         find it
> > > > > >         right now. 
> > > > > >         
> > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > > > > >         > Hello Koen,
> > > > > >         > 
> > > > > >         > From the switch log, it looks a SM issue to me. 
> > > > > The node was
> > > > > >         kicked
> > > > > >         > out of the membership. Which SM you are 
> > using in your
> > > > > >         fabric? 
> > > > > >         > 
> > > > > >         > Thanks
> > > > > >         > Shirley Ma
> > > > > >         > 
> > > > > >         *** Disclaimer ***
> > > > > >         
> > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > >         
> > > > > >         nv van publiek recht
> > > > > >         BTW BE 0244.142.664
> > > > > >         RPR Brussel
> > > > > >         http://www.vrt.be/disclaimer
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > >         
> > > > > *** Disclaimer ***
> > > > > 
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > 
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >  
> > > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > 
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From amip at dev.mellanox.co.il  Tue May 29 04:35:07 2007
From: amip at dev.mellanox.co.il (Ami Perlmutter)
Date: Tue, 29 May 2007 14:35:07 +0300
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D71@OCBEXS01001.rto.be>
References: <D63C0BE2D613C543B6F3305502E9784C03157D71@OCBEXS01001.rto.be>
Message-ID: <1180438537.12048.5.camel@localhost>

this means you are getting a message your SDP does not recognize.
message 11 is resize request which was added to sdp a few days ago.
can it be that you are running 2 different versions of OFED?
anywas, this doesn't pose any problem so you can ignore it.

On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> Hi,
> 
> Saturday we did a different stresstest.
> This is what we see in the /var/log/messages:
> 
> May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> 
> There were errors from that time on. Can someone explain me what this
> message does?
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> Verzonden: woensdag 23 mei 2007 17:41
> Aan: SEGERS Koen; Hal Rosenstock
> CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> Try 20 seconds, I'm curious if if you are barely crossing the 10-second
> threshold.
> 
> Scott 
> 
> > -----Original Message-----
> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > Sent: Wednesday, May 23, 2007 8:39 AM
> > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > Cc: Clive Hall (clivhall); 
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > What value would you recommend then?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > Verzonden: woensdag 23 mei 2007 17:38
> > Aan: SEGERS Koen; Hal Rosenstock
> > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > The boot time of the host doesn't matter for this timeout.  While the
> > host is booting, the IB link is down anyway.
> > 
> > Scott 
> > 
> > > -----Original Message-----
> > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > Cc: Clive Hall (clivhall); 
> > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > After a whole day of stresstesting with the MAD renicing 
> > turned on, we
> > > got the error once. So I think I should raise the timeout on 
> > > the switch
> > > also.
> > > 
> > > It takes about 2 minutes to boot the system. Do you agree 
> > > that this is a
> > > good value for the timeout?
> > > 
> > > Scott,
> > > Can you explain me the problem of the memlock?
> > > 
> > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > Since we didn't
> > > install this, the bug is not related to us. This is 
> > correct, isn't it?
> > > 
> > > Greetz
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > Verzonden: woensdag 23 mei 2007 16:12
> > > Aan: Scott "Weitzenkamp (sweitzen)
> > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > > No C code changes, just a few config file changes 
> > (RENICE_IB_MAD=yes
> > > in
> > > > openib.conf,
> > > 
> > > Does the host really not respond to MAD requests for over 10 
> > > seconds in
> > > some cases ?
> > > 
> > > -- Hal
> > > 
> > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > SLES10 for bug 267, etc.).
> > > > 
> > > > Scott Weitzenkamp
> > > > SQA and Release Manager
> > > > Server Virtualization Business Unit
> > > > Cisco Systems
> > > >  
> > > > 
> > > > > -----Original Message-----
> > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > general at lists.openfabrics.org; 
> > > general-bounces at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > This far, all tests seem to work.
> > > > > 
> > > > > Thanks for the help!
> > > > > 
> > > > > Scott,
> > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > 
> > > > > Greetz
> > > > > 
> > > > > Koen
> > > > > 
> > > > > -----Oorspronkelijk bericht-----
> > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > > (clivhall)
> > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > > > > general-bounces at lists.openfabrics.org
> > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > It's not so much pinging every 10 seconds as expecting a 
> > > > > response within
> > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > 
> > > > > You only need to do 1) or 2), not both.  Cisco configures 1) 
> > > > > in the OFED
> > > > > binary RPMs we release at
> > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > > > prefer to have
> > > > > the host be more responsive.
> > > > > 
> > > > > 
> > > > > Scott Weitzenkamp
> > > > > SQA and Release Manager
> > > > > Server Virtualization Business Unit
> > > > > Cisco Systems
> > > > >  
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > general at lists.openfabrics.org;
> > > general-bounces at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > If I understand it wright, the switch is actually polling 
> > > > > > (=pinging) the
> > > > > > interfaces every 10s. This means that when the interface is
> > > handling
> > > > > > other traffic, the poll can fail and the port could be 
> > > > > > considered out of
> > > > > > service. My question is then: "How can the timeout be reached
> > > while
> > > > > > packets are being sent/received to/from the interface?"
> > > > > > 
> > > > > > Anyway, what timeout-value would you recommend for 
> > us? And why?
> > > > > > 
> > > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > > 1) change the MAD niceness of the servers
> > > > > > 2) change the timeout on the switches
> > > > > > 
> > > > > > Are these changes sufficient for the HCA's to keep 
> > > their ports in
> > > > > > PORT_ACTIVE state?
> > > > > > 
> > > > > > Regards,
> > > > > > 
> > > > > > Koen
> > > > > > 
> > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > (sweitzen) wrote:
> > > > > > > Yes, you can tune it.  Here's an example via the switch CLI:
> > > > > > >  
> > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > fe:80:00:00:00:00:00:00
> > > > > > > node-timeout <value>
> > > > > > > 
> > > > > > > The default is 10 seconds, it can be configured up to 
> > > > > 2000 seconds.
> > > > > > > If a HCA is completely unresponsive for longer than the 
> > > > > node-timeout
> > > > > > > value, then we consider that HCA out of service.
> > > > > > >  
> > > > > > > Scott Weitzenkamp
> > > > > > > SQA and Release Manager
> > > > > > > Server Virtualization Business Unit
> > > > > > > Cisco Systems
> > > > > > >  
> > > > > > > 
> > > > > > >         
> > > > > > >         
> > > > > > ______________________________________________________________
> > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > >         To: koen.segers at VRT.BE
> > > > > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > > > > >         general-bounces at lists.openfabrics.org; Scott 
> > > Weitzenkamp
> > > > > > >         (sweitzen)
> > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > IB-connection
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > >         Koen,
> > > > > > >         
> > > > > > >         So it is most likely you hit the same bug as 
> > > 229 (Scott
> > > > > > >         pointed out earlier). The same workaround might 
> > > > > work for you
> > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > >         
> > > > > > >         I think this should be a SM query timeout 
> > > tunable value
> > > in
> > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > >         
> > > > > > >         Thanks
> > > > > > >         Shirley Ma
> > > > > > >         
> > > > > > >         
> > > > > > >         Inactive hide details for Koen Segers 
> > > > > > <koen.segers at VRT.BE>Koen
> > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > >         
> > > > > > >         
> > > > > > >                                         Koen Segers 
> > > > > > <koen.segers at VRT.BE> 
> > > > > > >                                         
> > > > > > >                                         05/22/07 11:14 AM 
> > > > > > >                                         Please respond to
> > > > > > >                                         koen.segers at VRT.BE
> > > > > > >                                         
> > > > > > >         
> > > > > > >                      To
> > > > > > >         
> > > > > > >         Shirley
> > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > >         
> > > > > > >                      cc
> > > > > > >         
> > > > > > >         Ami Perlmutter
> > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > general at lists.openfabrics.org,
> > > general-bounces at lists.openfabrics.org
> > > > > > >         
> > > > > > >                 Subject
> > > > > > >         
> > > > > > >         RE:
> > > > > > >         [ofa-general]
> > > > > > >         GPFS node loses
> > > > > > >         IB-connection
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > >         Hi,
> > > > > > >         
> > > > > > >         It is the Cisco SM. 
> > > > > > >         
> > > > > > >         SFS-7000P> show version
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > ==============================================================
> > > > > > ==================
> > > > > > >                                   System Version Information
> > > > > > >         
> > > > > > ==============================================================
> > > > > > ==================
> > > > > > >                   system-version : SFS-7000P TopspinOS 
> > > > > 2.9.0 releng
> > > > > > >         #147
> > > > > > >         10/25/2006 02:01:32
> > > > > > >                          contact : tac at cisco.com
> > > > > > >                             name : SFS-7000P
> > > > > > >                         location : 170 West Tasman Drive, 
> > > > > > San Jose, CA
> > > > > > >         95134
> > > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > > >                      last-change : none
> > > > > > >                 last-config-save : none
> > > > > > >                           action : none
> > > > > > >                           result : none
> > > > > > >                        oper-mode : normal
> > > > > > >         
> > > > > > >         There is also a command that gives the SM version, 
> > > > > > but I can't
> > > > > > >         find it
> > > > > > >         right now. 
> > > > > > >         
> > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > > > > > >         > Hello Koen,
> > > > > > >         > 
> > > > > > >         > From the switch log, it looks a SM issue to me. 
> > > > > > The node was
> > > > > > >         kicked
> > > > > > >         > out of the membership. Which SM you are 
> > > using in your
> > > > > > >         fabric? 
> > > > > > >         > 
> > > > > > >         > Thanks
> > > > > > >         > Shirley Ma
> > > > > > >         > 
> > > > > > >         *** Disclaimer ***
> > > > > > >         
> > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > >         
> > > > > > >         nv van publiek recht
> > > > > > >         BTW BE 0244.142.664
> > > > > > >         RPR Brussel
> > > > > > >         http://www.vrt.be/disclaimer
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > *** Disclaimer ***
> > > > > > 
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > 
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >  
> > > > > > 
> > > > > *** Disclaimer ***
> > > > > 
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > 
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >  
> > > > > 
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > 
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Koen.SEGERS at VRT.BE  Tue May 29 04:37:18 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Tue, 29 May 2007 13:37:18 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <1180438537.12048.5.camel@localhost>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D72@OCBEXS01001.rto.be>

We are running ofed-1.2.RC1 on all machines. Hence it is impossible that
this message is added only a few days ago.

How can you be so sure that this doesn't pose any problems?

Koen

-----Oorspronkelijk bericht-----
Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
Verzonden: dinsdag 29 mei 2007 13:35
Aan: SEGERS Koen
CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

this means you are getting a message your SDP does not recognize.
message 11 is resize request which was added to sdp a few days ago.
can it be that you are running 2 different versions of OFED?
anywas, this doesn't pose any problem so you can ignore it.

On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> Hi,
> 
> Saturday we did a different stresstest.
> This is what we see in the /var/log/messages:
> 
> May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> 
> There were errors from that time on. Can someone explain me what this
> message does?
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> Verzonden: woensdag 23 mei 2007 17:41
> Aan: SEGERS Koen; Hal Rosenstock
> CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> Try 20 seconds, I'm curious if if you are barely crossing the
10-second
> threshold.
> 
> Scott 
> 
> > -----Original Message-----
> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > Sent: Wednesday, May 23, 2007 8:39 AM
> > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > Cc: Clive Hall (clivhall); 
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > What value would you recommend then?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > Verzonden: woensdag 23 mei 2007 17:38
> > Aan: SEGERS Koen; Hal Rosenstock
> > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > The boot time of the host doesn't matter for this timeout.  While
the
> > host is booting, the IB link is down anyway.
> > 
> > Scott 
> > 
> > > -----Original Message-----
> > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > Cc: Clive Hall (clivhall); 
> > > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > After a whole day of stresstesting with the MAD renicing 
> > turned on, we
> > > got the error once. So I think I should raise the timeout on 
> > > the switch
> > > also.
> > > 
> > > It takes about 2 minutes to boot the system. Do you agree 
> > > that this is a
> > > good value for the timeout?
> > > 
> > > Scott,
> > > Can you explain me the problem of the memlock?
> > > 
> > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > Since we didn't
> > > install this, the bug is not related to us. This is 
> > correct, isn't it?
> > > 
> > > Greetz
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > Verzonden: woensdag 23 mei 2007 16:12
> > > Aan: Scott "Weitzenkamp (sweitzen)
> > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > > No C code changes, just a few config file changes 
> > (RENICE_IB_MAD=yes
> > > in
> > > > openib.conf,
> > > 
> > > Does the host really not respond to MAD requests for over 10 
> > > seconds in
> > > some cases ?
> > > 
> > > -- Hal
> > > 
> > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > SLES10 for bug 267, etc.).
> > > > 
> > > > Scott Weitzenkamp
> > > > SQA and Release Manager
> > > > Server Virtualization Business Unit
> > > > Cisco Systems
> > > >  
> > > > 
> > > > > -----Original Message-----
> > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > general at lists.openfabrics.org; 
> > > general-bounces at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > This far, all tests seem to work.
> > > > > 
> > > > > Thanks for the help!
> > > > > 
> > > > > Scott,
> > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > 
> > > > > Greetz
> > > > > 
> > > > > Koen
> > > > > 
> > > > > -----Oorspronkelijk bericht-----
> > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > > (clivhall)
> > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > > > > general-bounces at lists.openfabrics.org
> > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > It's not so much pinging every 10 seconds as expecting a 
> > > > > response within
> > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > 
> > > > > You only need to do 1) or 2), not both.  Cisco configures 1) 
> > > > > in the OFED
> > > > > binary RPMs we release at
> > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > > > prefer to have
> > > > > the host be more responsive.
> > > > > 
> > > > > 
> > > > > Scott Weitzenkamp
> > > > > SQA and Release Manager
> > > > > Server Virtualization Business Unit
> > > > > Cisco Systems
> > > > >  
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > general at lists.openfabrics.org;
> > > general-bounces at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > If I understand it wright, the switch is actually polling 
> > > > > > (=pinging) the
> > > > > > interfaces every 10s. This means that when the interface is
> > > handling
> > > > > > other traffic, the poll can fail and the port could be 
> > > > > > considered out of
> > > > > > service. My question is then: "How can the timeout be
reached
> > > while
> > > > > > packets are being sent/received to/from the interface?"
> > > > > > 
> > > > > > Anyway, what timeout-value would you recommend for 
> > us? And why?
> > > > > > 
> > > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > > 1) change the MAD niceness of the servers
> > > > > > 2) change the timeout on the switches
> > > > > > 
> > > > > > Are these changes sufficient for the HCA's to keep 
> > > their ports in
> > > > > > PORT_ACTIVE state?
> > > > > > 
> > > > > > Regards,
> > > > > > 
> > > > > > Koen
> > > > > > 
> > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > (sweitzen) wrote:
> > > > > > > Yes, you can tune it.  Here's an example via the switch
CLI:
> > > > > > >  
> > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > fe:80:00:00:00:00:00:00
> > > > > > > node-timeout <value>
> > > > > > > 
> > > > > > > The default is 10 seconds, it can be configured up to 
> > > > > 2000 seconds.
> > > > > > > If a HCA is completely unresponsive for longer than the 
> > > > > node-timeout
> > > > > > > value, then we consider that HCA out of service.
> > > > > > >  
> > > > > > > Scott Weitzenkamp
> > > > > > > SQA and Release Manager
> > > > > > > Server Virtualization Business Unit
> > > > > > > Cisco Systems
> > > > > > >  
> > > > > > > 
> > > > > > >         
> > > > > > >         
> > > > > >
______________________________________________________________
> > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > >         To: koen.segers at VRT.BE
> > > > > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > > > > >         general-bounces at lists.openfabrics.org; Scott 
> > > Weitzenkamp
> > > > > > >         (sweitzen)
> > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > IB-connection
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > >         Koen,
> > > > > > >         
> > > > > > >         So it is most likely you hit the same bug as 
> > > 229 (Scott
> > > > > > >         pointed out earlier). The same workaround might 
> > > > > work for you
> > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > >         
> > > > > > >         I think this should be a SM query timeout 
> > > tunable value
> > > in
> > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > >         
> > > > > > >         Thanks
> > > > > > >         Shirley Ma
> > > > > > >         
> > > > > > >         
> > > > > > >         Inactive hide details for Koen Segers 
> > > > > > <koen.segers at VRT.BE>Koen
> > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > >         
> > > > > > >         
> > > > > > >                                         Koen Segers 
> > > > > > <koen.segers at VRT.BE> 
> > > > > > >                                         
> > > > > > >                                         05/22/07 11:14 AM 
> > > > > > >                                         Please respond to
> > > > > > >                                         koen.segers at VRT.BE
> > > > > > >                                         
> > > > > > >         
> > > > > > >                      To
> > > > > > >         
> > > > > > >         Shirley
> > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > >         
> > > > > > >                      cc
> > > > > > >         
> > > > > > >         Ami Perlmutter
> > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > general at lists.openfabrics.org,
> > > general-bounces at lists.openfabrics.org
> > > > > > >         
> > > > > > >                 Subject
> > > > > > >         
> > > > > > >         RE:
> > > > > > >         [ofa-general]
> > > > > > >         GPFS node loses
> > > > > > >         IB-connection
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > >         Hi,
> > > > > > >         
> > > > > > >         It is the Cisco SM. 
> > > > > > >         
> > > > > > >         SFS-7000P> show version
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > >
==============================================================
> > > > > > ==================
> > > > > > >                                   System Version
Information
> > > > > > >         
> > > > > >
==============================================================
> > > > > > ==================
> > > > > > >                   system-version : SFS-7000P TopspinOS 
> > > > > 2.9.0 releng
> > > > > > >         #147
> > > > > > >         10/25/2006 02:01:32
> > > > > > >                          contact : tac at cisco.com
> > > > > > >                             name : SFS-7000P
> > > > > > >                         location : 170 West Tasman Drive, 
> > > > > > San Jose, CA
> > > > > > >         95134
> > > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > > >                      last-change : none
> > > > > > >                 last-config-save : none
> > > > > > >                           action : none
> > > > > > >                           result : none
> > > > > > >                        oper-mode : normal
> > > > > > >         
> > > > > > >         There is also a command that gives the SM version,

> > > > > > but I can't
> > > > > > >         find it
> > > > > > >         right now. 
> > > > > > >         
> > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma
wrote:
> > > > > > >         > Hello Koen,
> > > > > > >         > 
> > > > > > >         > From the switch log, it looks a SM issue to me. 
> > > > > > The node was
> > > > > > >         kicked
> > > > > > >         > out of the membership. Which SM you are 
> > > using in your
> > > > > > >         fabric? 
> > > > > > >         > 
> > > > > > >         > Thanks
> > > > > > >         > Shirley Ma
> > > > > > >         > 
> > > > > > >         *** Disclaimer ***
> > > > > > >         
> > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > >         
> > > > > > >         nv van publiek recht
> > > > > > >         BTW BE 0244.142.664
> > > > > > >         RPR Brussel
> > > > > > >         http://www.vrt.be/disclaimer
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > >         
> > > > > > *** Disclaimer ***
> > > > > > 
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > 
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >  
> > > > > > 
> > > > > *** Disclaimer ***
> > > > > 
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > 
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >  
> > > > > 
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > 
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From amip at dev.mellanox.co.il  Tue May 29 05:02:42 2007
From: amip at dev.mellanox.co.il (Ami Perlmutter)
Date: Tue, 29 May 2007 15:02:42 +0300
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D72@OCBEXS01001.rto.be>
References: <D63C0BE2D613C543B6F3305502E9784C03157D72@OCBEXS01001.rto.be>
Message-ID: <1180440193.12048.9.camel@localhost>

if this is an actual resize request than there is no problem when it is
dropped.
since you are running rc1, no resize requests should be sent so this
means there is a problem since data could be dropped. do you notice lost
data?

On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> We are running ofed-1.2.RC1 on all machines. Hence it is impossible that
> this message is added only a few days ago.
> 
> How can you be so sure that this doesn't pose any problems?
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> Verzonden: dinsdag 29 mei 2007 13:35
> Aan: SEGERS Koen
> CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> this means you are getting a message your SDP does not recognize.
> message 11 is resize request which was added to sdp a few days ago.
> can it be that you are running 2 different versions of OFED?
> anywas, this doesn't pose any problem so you can ignore it.
> 
> On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > Hi,
> > 
> > Saturday we did a different stresstest.
> > This is what we see in the /var/log/messages:
> > 
> > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> > 
> > There were errors from that time on. Can someone explain me what this
> > message does?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > Verzonden: woensdag 23 mei 2007 17:41
> > Aan: SEGERS Koen; Hal Rosenstock
> > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > Try 20 seconds, I'm curious if if you are barely crossing the
> 10-second
> > threshold.
> > 
> > Scott 
> > 
> > > -----Original Message-----
> > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > Cc: Clive Hall (clivhall); 
> > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > What value would you recommend then?
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > Verzonden: woensdag 23 mei 2007 17:38
> > > Aan: SEGERS Koen; Hal Rosenstock
> > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > The boot time of the host doesn't matter for this timeout.  While
> the
> > > host is booting, the IB link is down anyway.
> > > 
> > > Scott 
> > > 
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > Cc: Clive Hall (clivhall); 
> > > > general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > After a whole day of stresstesting with the MAD renicing 
> > > turned on, we
> > > > got the error once. So I think I should raise the timeout on 
> > > > the switch
> > > > also.
> > > > 
> > > > It takes about 2 minutes to boot the system. Do you agree 
> > > > that this is a
> > > > good value for the timeout?
> > > > 
> > > > Scott,
> > > > Can you explain me the problem of the memlock?
> > > > 
> > > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > > Since we didn't
> > > > install this, the bug is not related to us. This is 
> > > correct, isn't it?
> > > > 
> > > > Greetz
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > > > No C code changes, just a few config file changes 
> > > (RENICE_IB_MAD=yes
> > > > in
> > > > > openib.conf,
> > > > 
> > > > Does the host really not respond to MAD requests for over 10 
> > > > seconds in
> > > > some cases ?
> > > > 
> > > > -- Hal
> > > > 
> > > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > SLES10 for bug 267, etc.).
> > > > > 
> > > > > Scott Weitzenkamp
> > > > > SQA and Release Manager
> > > > > Server Virtualization Business Unit
> > > > > Cisco Systems
> > > > >  
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > general at lists.openfabrics.org; 
> > > > general-bounces at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > This far, all tests seem to work.
> > > > > > 
> > > > > > Thanks for the help!
> > > > > > 
> > > > > > Scott,
> > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > > 
> > > > > > Greetz
> > > > > > 
> > > > > > Koen
> > > > > > 
> > > > > > -----Oorspronkelijk bericht-----
> > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > > > (clivhall)
> > > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > It's not so much pinging every 10 seconds as expecting a 
> > > > > > response within
> > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > > 
> > > > > > You only need to do 1) or 2), not both.  Cisco configures 1) 
> > > > > > in the OFED
> > > > > > binary RPMs we release at
> > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > > > > prefer to have
> > > > > > the host be more responsive.
> > > > > > 
> > > > > > 
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > >  
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > general at lists.openfabrics.org;
> > > > general-bounces at lists.openfabrics.org
> > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > 
> > > > > > > If I understand it wright, the switch is actually polling 
> > > > > > > (=pinging) the
> > > > > > > interfaces every 10s. This means that when the interface is
> > > > handling
> > > > > > > other traffic, the poll can fail and the port could be 
> > > > > > > considered out of
> > > > > > > service. My question is then: "How can the timeout be
> reached
> > > > while
> > > > > > > packets are being sent/received to/from the interface?"
> > > > > > > 
> > > > > > > Anyway, what timeout-value would you recommend for 
> > > us? And why?
> > > > > > > 
> > > > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > > > 1) change the MAD niceness of the servers
> > > > > > > 2) change the timeout on the switches
> > > > > > > 
> > > > > > > Are these changes sufficient for the HCA's to keep 
> > > > their ports in
> > > > > > > PORT_ACTIVE state?
> > > > > > > 
> > > > > > > Regards,
> > > > > > > 
> > > > > > > Koen
> > > > > > > 
> > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > > (sweitzen) wrote:
> > > > > > > > Yes, you can tune it.  Here's an example via the switch
> CLI:
> > > > > > > >  
> > > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > > fe:80:00:00:00:00:00:00
> > > > > > > > node-timeout <value>
> > > > > > > > 
> > > > > > > > The default is 10 seconds, it can be configured up to 
> > > > > > 2000 seconds.
> > > > > > > > If a HCA is completely unresponsive for longer than the 
> > > > > > node-timeout
> > > > > > > > value, then we consider that HCA out of service.
> > > > > > > >  
> > > > > > > > Scott Weitzenkamp
> > > > > > > > SQA and Release Manager
> > > > > > > > Server Virtualization Business Unit
> > > > > > > > Cisco Systems
> > > > > > > >  
> > > > > > > > 
> > > > > > > >         
> > > > > > > >         
> > > > > > >
> ______________________________________________________________
> > > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > >         To: koen.segers at VRT.BE
> > > > > > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > > > > > >         general-bounces at lists.openfabrics.org; Scott 
> > > > Weitzenkamp
> > > > > > > >         (sweitzen)
> > > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > > IB-connection
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         Koen,
> > > > > > > >         
> > > > > > > >         So it is most likely you hit the same bug as 
> > > > 229 (Scott
> > > > > > > >         pointed out earlier). The same workaround might 
> > > > > > work for you
> > > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > > >         
> > > > > > > >         I think this should be a SM query timeout 
> > > > tunable value
> > > > in
> > > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > > >         
> > > > > > > >         Thanks
> > > > > > > >         Shirley Ma
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         Inactive hide details for Koen Segers 
> > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > > >         
> > > > > > > >         
> > > > > > > >                                         Koen Segers 
> > > > > > > <koen.segers at VRT.BE> 
> > > > > > > >                                         
> > > > > > > >                                         05/22/07 11:14 AM 
> > > > > > > >                                         Please respond to
> > > > > > > >                                         koen.segers at VRT.BE
> > > > > > > >                                         
> > > > > > > >         
> > > > > > > >                      To
> > > > > > > >         
> > > > > > > >         Shirley
> > > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > > >         
> > > > > > > >                      cc
> > > > > > > >         
> > > > > > > >         Ami Perlmutter
> > > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > > general at lists.openfabrics.org,
> > > > general-bounces at lists.openfabrics.org
> > > > > > > >         
> > > > > > > >                 Subject
> > > > > > > >         
> > > > > > > >         RE:
> > > > > > > >         [ofa-general]
> > > > > > > >         GPFS node loses
> > > > > > > >         IB-connection
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         Hi,
> > > > > > > >         
> > > > > > > >         It is the Cisco SM. 
> > > > > > > >         
> > > > > > > >         SFS-7000P> show version
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > >
> ==============================================================
> > > > > > > ==================
> > > > > > > >                                   System Version
> Information
> > > > > > > >         
> > > > > > >
> ==============================================================
> > > > > > > ==================
> > > > > > > >                   system-version : SFS-7000P TopspinOS 
> > > > > > 2.9.0 releng
> > > > > > > >         #147
> > > > > > > >         10/25/2006 02:01:32
> > > > > > > >                          contact : tac at cisco.com
> > > > > > > >                             name : SFS-7000P
> > > > > > > >                         location : 170 West Tasman Drive, 
> > > > > > > San Jose, CA
> > > > > > > >         95134
> > > > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > > > >                      last-change : none
> > > > > > > >                 last-config-save : none
> > > > > > > >                           action : none
> > > > > > > >                           result : none
> > > > > > > >                        oper-mode : normal
> > > > > > > >         
> > > > > > > >         There is also a command that gives the SM version,
> 
> > > > > > > but I can't
> > > > > > > >         find it
> > > > > > > >         right now. 
> > > > > > > >         
> > > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma
> wrote:
> > > > > > > >         > Hello Koen,
> > > > > > > >         > 
> > > > > > > >         > From the switch log, it looks a SM issue to me. 
> > > > > > > The node was
> > > > > > > >         kicked
> > > > > > > >         > out of the membership. Which SM you are 
> > > > using in your
> > > > > > > >         fabric? 
> > > > > > > >         > 
> > > > > > > >         > Thanks
> > > > > > > >         > Shirley Ma
> > > > > > > >         > 
> > > > > > > >         *** Disclaimer ***
> > > > > > > >         
> > > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > >         
> > > > > > > >         nv van publiek recht
> > > > > > > >         BTW BE 0244.142.664
> > > > > > > >         RPR Brussel
> > > > > > > >         http://www.vrt.be/disclaimer
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > *** Disclaimer ***
> > > > > > > 
> > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > 
> > > > > > > nv van publiek recht
> > > > > > > BTW BE 0244.142.664
> > > > > > > RPR Brussel
> > > > > > > http://www.vrt.be/disclaimer
> > > > > > >  
> > > > > > > 
> > > > > > *** Disclaimer ***
> > > > > > 
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > 
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >  
> > > > > > 
> > > > > _______________________________________________
> > > > > general mailing list
> > > > > general at lists.openfabrics.org
> > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > 
> > > > > To unsubscribe, please visit
> > > > http://openib.org/mailman/listinfo/openib-general
> > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 


From Koen.SEGERS at VRT.BE  Tue May 29 05:28:57 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Tue, 29 May 2007 14:28:57 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <1180440193.12048.9.camel@localhost>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D75@OCBEXS01001.rto.be>

One of the machines has 2 dropped packets:

gpfswhbe2n1:~ # ifconfig ib0
ib0       Link encap:UNSPEC  HWaddr
80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
          inet addr:192.168.2.1  Bcast:192.168.4.255  Mask:255.255.255.0
          inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:17311 errors:0 dropped:0 overruns:0 frame:0
          TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:148363444 (141.4 Mb)  TX bytes:6715076 (6.4 Mb)

Can this be related?

Does anyone now how this is possible with sdp? I thought SDP was a RC.
I'm also curious how gpfs reacts to this. Do you know where I can find
the timestamp of these dropped packets?

Koen

-----Oorspronkelijk bericht-----
Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
Verzonden: dinsdag 29 mei 2007 14:03
Aan: SEGERS Koen
CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

if this is an actual resize request than there is no problem when it is
dropped.
since you are running rc1, no resize requests should be sent so this
means there is a problem since data could be dropped. do you notice lost
data?

On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> We are running ofed-1.2.RC1 on all machines. Hence it is impossible
that
> this message is added only a few days ago.
> 
> How can you be so sure that this doesn't pose any problems?
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> Verzonden: dinsdag 29 mei 2007 13:35
> Aan: SEGERS Koen
> CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> this means you are getting a message your SDP does not recognize.
> message 11 is resize request which was added to sdp a few days ago.
> can it be that you are running 2 different versions of OFED?
> anywas, this doesn't pose any problem so you can ignore it.
> 
> On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > Hi,
> > 
> > Saturday we did a different stresstest.
> > This is what we see in the /var/log/messages:
> > 
> > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> > 
> > There were errors from that time on. Can someone explain me what
this
> > message does?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > Verzonden: woensdag 23 mei 2007 17:41
> > Aan: SEGERS Koen; Hal Rosenstock
> > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > Try 20 seconds, I'm curious if if you are barely crossing the
> 10-second
> > threshold.
> > 
> > Scott 
> > 
> > > -----Original Message-----
> > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > Cc: Clive Hall (clivhall); 
> > > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > What value would you recommend then?
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > Verzonden: woensdag 23 mei 2007 17:38
> > > Aan: SEGERS Koen; Hal Rosenstock
> > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > The boot time of the host doesn't matter for this timeout.  While
> the
> > > host is booting, the IB link is down anyway.
> > > 
> > > Scott 
> > > 
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > Cc: Clive Hall (clivhall); 
> > > > general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > After a whole day of stresstesting with the MAD renicing 
> > > turned on, we
> > > > got the error once. So I think I should raise the timeout on 
> > > > the switch
> > > > also.
> > > > 
> > > > It takes about 2 minutes to boot the system. Do you agree 
> > > > that this is a
> > > > good value for the timeout?
> > > > 
> > > > Scott,
> > > > Can you explain me the problem of the memlock?
> > > > 
> > > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > > Since we didn't
> > > > install this, the bug is not related to us. This is 
> > > correct, isn't it?
> > > > 
> > > > Greetz
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > > > No C code changes, just a few config file changes 
> > > (RENICE_IB_MAD=yes
> > > > in
> > > > > openib.conf,
> > > > 
> > > > Does the host really not respond to MAD requests for over 10 
> > > > seconds in
> > > > some cases ?
> > > > 
> > > > -- Hal
> > > > 
> > > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > SLES10 for bug 267, etc.).
> > > > > 
> > > > > Scott Weitzenkamp
> > > > > SQA and Release Manager
> > > > > Server Virtualization Business Unit
> > > > > Cisco Systems
> > > > >  
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > general at lists.openfabrics.org; 
> > > > general-bounces at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > This far, all tests seem to work.
> > > > > > 
> > > > > > Thanks for the help!
> > > > > > 
> > > > > > Scott,
> > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > > 
> > > > > > Greetz
> > > > > > 
> > > > > > Koen
> > > > > > 
> > > > > > -----Oorspronkelijk bericht-----
> > > > > > Van: Scott Weitzenkamp (sweitzen)
[mailto:sweitzen at cisco.com] 
> > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > > > (clivhall)
> > > > > > CC: Shirley Ma; Ami Perlmutter;
general at lists.openfabrics.org;
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > It's not so much pinging every 10 seconds as expecting a 
> > > > > > response within
> > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > > 
> > > > > > You only need to do 1) or 2), not both.  Cisco configures 1)

> > > > > > in the OFED
> > > > > > binary RPMs we release at
> > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > > > > prefer to have
> > > > > > the host be more responsive.
> > > > > > 
> > > > > > 
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > >  
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > general at lists.openfabrics.org;
> > > > general-bounces at lists.openfabrics.org
> > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > 
> > > > > > > If I understand it wright, the switch is actually polling 
> > > > > > > (=pinging) the
> > > > > > > interfaces every 10s. This means that when the interface
is
> > > > handling
> > > > > > > other traffic, the poll can fail and the port could be 
> > > > > > > considered out of
> > > > > > > service. My question is then: "How can the timeout be
> reached
> > > > while
> > > > > > > packets are being sent/received to/from the interface?"
> > > > > > > 
> > > > > > > Anyway, what timeout-value would you recommend for 
> > > us? And why?
> > > > > > > 
> > > > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > > > 1) change the MAD niceness of the servers
> > > > > > > 2) change the timeout on the switches
> > > > > > > 
> > > > > > > Are these changes sufficient for the HCA's to keep 
> > > > their ports in
> > > > > > > PORT_ACTIVE state?
> > > > > > > 
> > > > > > > Regards,
> > > > > > > 
> > > > > > > Koen
> > > > > > > 
> > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > > (sweitzen) wrote:
> > > > > > > > Yes, you can tune it.  Here's an example via the switch
> CLI:
> > > > > > > >  
> > > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > > fe:80:00:00:00:00:00:00
> > > > > > > > node-timeout <value>
> > > > > > > > 
> > > > > > > > The default is 10 seconds, it can be configured up to 
> > > > > > 2000 seconds.
> > > > > > > > If a HCA is completely unresponsive for longer than the 
> > > > > > node-timeout
> > > > > > > > value, then we consider that HCA out of service.
> > > > > > > >  
> > > > > > > > Scott Weitzenkamp
> > > > > > > > SQA and Release Manager
> > > > > > > > Server Virtualization Business Unit
> > > > > > > > Cisco Systems
> > > > > > > >  
> > > > > > > > 
> > > > > > > >         
> > > > > > > >         
> > > > > > >
> ______________________________________________________________
> > > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > >         To: koen.segers at VRT.BE
> > > > > > > >         Cc: Ami Perlmutter;
general at lists.openfabrics.org;
> > > > > > > >         general-bounces at lists.openfabrics.org; Scott 
> > > > Weitzenkamp
> > > > > > > >         (sweitzen)
> > > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > > IB-connection
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         Koen,
> > > > > > > >         
> > > > > > > >         So it is most likely you hit the same bug as 
> > > > 229 (Scott
> > > > > > > >         pointed out earlier). The same workaround might 
> > > > > > work for you
> > > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > > >         
> > > > > > > >         I think this should be a SM query timeout 
> > > > tunable value
> > > > in
> > > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > > >         
> > > > > > > >         Thanks
> > > > > > > >         Shirley Ma
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         Inactive hide details for Koen Segers 
> > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > > >         
> > > > > > > >         
> > > > > > > >                                         Koen Segers 
> > > > > > > <koen.segers at VRT.BE> 
> > > > > > > >                                         
> > > > > > > >                                         05/22/07 11:14
AM 
> > > > > > > >                                         Please respond
to
> > > > > > > >
koen.segers at VRT.BE
> > > > > > > >                                         
> > > > > > > >         
> > > > > > > >                      To
> > > > > > > >         
> > > > > > > >         Shirley
> > > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > > >         
> > > > > > > >                      cc
> > > > > > > >         
> > > > > > > >         Ami Perlmutter
> > > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > > general at lists.openfabrics.org,
> > > > general-bounces at lists.openfabrics.org
> > > > > > > >         
> > > > > > > >                 Subject
> > > > > > > >         
> > > > > > > >         RE:
> > > > > > > >         [ofa-general]
> > > > > > > >         GPFS node loses
> > > > > > > >         IB-connection
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         Hi,
> > > > > > > >         
> > > > > > > >         It is the Cisco SM. 
> > > > > > > >         
> > > > > > > >         SFS-7000P> show version
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > >
> ==============================================================
> > > > > > > ==================
> > > > > > > >                                   System Version
> Information
> > > > > > > >         
> > > > > > >
> ==============================================================
> > > > > > > ==================
> > > > > > > >                   system-version : SFS-7000P TopspinOS 
> > > > > > 2.9.0 releng
> > > > > > > >         #147
> > > > > > > >         10/25/2006 02:01:32
> > > > > > > >                          contact : tac at cisco.com
> > > > > > > >                             name : SFS-7000P
> > > > > > > >                         location : 170 West Tasman
Drive, 
> > > > > > > San Jose, CA
> > > > > > > >         95134
> > > > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > > > >                      last-change : none
> > > > > > > >                 last-config-save : none
> > > > > > > >                           action : none
> > > > > > > >                           result : none
> > > > > > > >                        oper-mode : normal
> > > > > > > >         
> > > > > > > >         There is also a command that gives the SM
version,
> 
> > > > > > > but I can't
> > > > > > > >         find it
> > > > > > > >         right now. 
> > > > > > > >         
> > > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma
> wrote:
> > > > > > > >         > Hello Koen,
> > > > > > > >         > 
> > > > > > > >         > From the switch log, it looks a SM issue to
me. 
> > > > > > > The node was
> > > > > > > >         kicked
> > > > > > > >         > out of the membership. Which SM you are 
> > > > using in your
> > > > > > > >         fabric? 
> > > > > > > >         > 
> > > > > > > >         > Thanks
> > > > > > > >         > Shirley Ma
> > > > > > > >         > 
> > > > > > > >         *** Disclaimer ***
> > > > > > > >         
> > > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > >         
> > > > > > > >         nv van publiek recht
> > > > > > > >         BTW BE 0244.142.664
> > > > > > > >         RPR Brussel
> > > > > > > >         http://www.vrt.be/disclaimer
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > *** Disclaimer ***
> > > > > > > 
> > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > 
> > > > > > > nv van publiek recht
> > > > > > > BTW BE 0244.142.664
> > > > > > > RPR Brussel
> > > > > > > http://www.vrt.be/disclaimer
> > > > > > >  
> > > > > > > 
> > > > > > *** Disclaimer ***
> > > > > > 
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > 
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >  
> > > > > > 
> > > > > _______________________________________________
> > > > > general mailing list
> > > > > general at lists.openfabrics.org
> > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > 
> > > > > To unsubscribe, please visit
> > > > http://openib.org/mailman/listinfo/openib-general
> > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From FENKES at de.ibm.com  Tue May 29 05:44:32 2007
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Tue, 29 May 2007 14:44:32 +0200
Subject: [ofa-general] Re: [PATCH] IB/ehca: Refactor "maybe missed event"
	code
In-Reply-To: <adalkfe6rxk.fsf@cisco.com>
Message-ID: <OFA7B951E3.8AFF22C4-ONC12572EA.0045F4E8-C12572EA.00461997@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 24.05.2007 19:40:39:

> This isn't fixing anything is it?  I think it's 2.6.23 material;
> correct me if I'm wrong.

Right, it doesn't fix things, just coalesces some code.

Joachim


From eli at mellanox.co.il  Tue May 29 06:00:46 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Tue, 29 May 2007 16:00:46 +0300
Subject: [ofa-general] [PATCH] libmlx4: fix qp capabilities
Message-ID: <1180443676.6825.8.camel@mtls03>

Fix calulation of max inline returned to the user. Without this fix,
the size of inline may increase every time create qp is called with
the previous values returned.

For example, here is a quote from the output of the test showing the
problem:

request: cap.max_send_sge = 1,   cap.max_inline_data = 0
got:     cap.max_send_sge = 5,   cap.max_inline_data = 76

request: cap.max_send_sge  = 5,  cap.max_inline_data = 76
got:     cap. max_send_sge = 13, cap.max_inline_data = 204

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: libmlx4/src/qp.c
===================================================================
--- libmlx4.orig/src/qp.c	2007-05-29 13:13:57.000000000 +0300
+++ libmlx4/src/qp.c	2007-05-29 14:41:33.000000000 +0300
@@ -396,12 +396,13 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd,
 		cap->max_send_sge = 1;
 
 	qp->rq.max_gs	 = cap->max_recv_sge;
-	qp->sq.max_gs	 = cap->max_send_sge;
 	max_sq_sge	 = align(cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg),
 				 sizeof (struct mlx4_wqe_data_seg)) / sizeof (struct mlx4_wqe_data_seg);
 	if (max_sq_sge < cap->max_send_sge)
 		max_sq_sge = cap->max_send_sge;
 
+	qp->sq.max_gs = max_sq_sge;
+
 	qp->sq.wrid = malloc(qp->sq.max * sizeof (uint64_t));
 	if (!qp->sq.wrid)
 		return -1;
@@ -419,6 +420,7 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd,
 		; /* nothing */
 
 	size = max_sq_sge * sizeof (struct mlx4_wqe_data_seg);
+	qp->max_inline_data  = size - sizeof (struct mlx4_wqe_inline_seg);
 	switch (type) {
 	case IBV_QPT_UD:
 		size += sizeof (struct mlx4_wqe_datagram_seg);
@@ -482,26 +484,7 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd,
 void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap,
 		       enum ibv_qp_type type)
 {
-	int wqe_size;
-
-	wqe_size = 1 << qp->sq.wqe_shift;
-	switch (type) {
-	case IBV_QPT_UD:
-		wqe_size -= sizeof (struct mlx4_wqe_datagram_seg);
-		break;
-
-	case IBV_QPT_UC:
-	case IBV_QPT_RC:
-		wqe_size -= sizeof (struct mlx4_wqe_raddr_seg);
-		break;
-
-	default:
-		break;
-	}
-
-	qp->sq.max_gs        = wqe_size / sizeof (struct mlx4_wqe_data_seg);
 	cap->max_send_sge    = qp->sq.max_gs;
-	qp->max_inline_data  = wqe_size - sizeof (struct mlx4_wqe_inline_seg);
 	cap->max_inline_data = qp->max_inline_data;
 }
 

From jsquyres at cisco.com  Tue May 29 06:06:17 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 29 May 2007 09:06:17 -0400
Subject: [ofa-general] Updating OFED teleconferences
Message-ID: <4D7D1F84-28E1-4CF4-8340-1645E0FBF4B6@cisco.com>

You're about to get some Outlook invites for upcoming OFED  
teleconferences.

I'll send a summary after the invites are sent (I need to get the  
teleconference codes before I can send the summary).

-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Tue May 29 06:19:03 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 29 May 2007 09:19:03 -0400
Subject: [ofa-general] Upcoming OFED teleconferences
Message-ID: <338EEB24-0F0E-443D-94AA-61E512611F8B@cisco.com>

Short version:
--------------

Upcoming OFED teleconferences, all at noon US eastern / 9am US  
Pacific / 7pm Israel.

1. Wednesday, May 30 (*TOMORROW*), code 210262040
2. Monday, June 4, code 2102061
3. Monday, June 11, code 210213621
4. Monday, June 18, code 2102061
5. Monday, June 25, code 210213621

US/Canada:  +1.866.432.9903
India:      +91.80.4103.3979
Israel:     +972.9.892.7026
Others:     http://cisco.com/en/US/about/doing_business/conferencing/

Longer version:
---------------

You just got 2 Outlook invites for upcoming OFED teleconferences:

1. Due to the US holiday, there was no OFED teleconference  
yesterday.  Today is also not good for several OF members, so there  
will be an OFED teleconference tomorrow (Wednesday, 30 May 2007) at  
the normal time.

2. There will also be weekly teleconferences throughout June 2007.   
We already had teleconferences scheduled for the 4th and 18th; I just  
added teleconferences for the 11th and 25th.

Once OFED v1.2 is released, we'll be returning to bi-weekly OFED  
teleconferences.

-- 
Jeff Squyres
Cisco Systems


From Koen.SEGERS at VRT.BE  Tue May 29 06:29:48 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Tue, 29 May 2007 15:29:48 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D75@OCBEXS01001.rto.be>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D77@OCBEXS01001.rto.be>

I just remembered that, with SDP, these values aren't related anymore.
SDP doesn't give this kind of information to the OS.

Koen

-----Oorspronkelijk bericht-----
Van: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen
Verzonden: dinsdag 29 mei 2007 14:29
Aan: amip at dev.mellanox.co.il
CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

One of the machines has 2 dropped packets:

gpfswhbe2n1:~ # ifconfig ib0
ib0       Link encap:UNSPEC  HWaddr
80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
          inet addr:192.168.2.1  Bcast:192.168.4.255  Mask:255.255.255.0
          inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:17311 errors:0 dropped:0 overruns:0 frame:0
          TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:148363444 (141.4 Mb)  TX bytes:6715076 (6.4 Mb)

Can this be related?

Does anyone now how this is possible with sdp? I thought SDP was a RC.
I'm also curious how gpfs reacts to this. Do you know where I can find
the timestamp of these dropped packets?

Koen

-----Oorspronkelijk bericht-----
Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
Verzonden: dinsdag 29 mei 2007 14:03
Aan: SEGERS Koen
CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

if this is an actual resize request than there is no problem when it is
dropped.
since you are running rc1, no resize requests should be sent so this
means there is a problem since data could be dropped. do you notice lost
data?

On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> We are running ofed-1.2.RC1 on all machines. Hence it is impossible
that
> this message is added only a few days ago.
> 
> How can you be so sure that this doesn't pose any problems?
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> Verzonden: dinsdag 29 mei 2007 13:35
> Aan: SEGERS Koen
> CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> this means you are getting a message your SDP does not recognize.
> message 11 is resize request which was added to sdp a few days ago.
> can it be that you are running 2 different versions of OFED?
> anywas, this doesn't pose any problem so you can ignore it.
> 
> On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > Hi,
> > 
> > Saturday we did a different stresstest.
> > This is what we see in the /var/log/messages:
> > 
> > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> > 
> > There were errors from that time on. Can someone explain me what
this
> > message does?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > Verzonden: woensdag 23 mei 2007 17:41
> > Aan: SEGERS Koen; Hal Rosenstock
> > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > Try 20 seconds, I'm curious if if you are barely crossing the
> 10-second
> > threshold.
> > 
> > Scott 
> > 
> > > -----Original Message-----
> > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > Cc: Clive Hall (clivhall); 
> > > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > What value would you recommend then?
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > Verzonden: woensdag 23 mei 2007 17:38
> > > Aan: SEGERS Koen; Hal Rosenstock
> > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > The boot time of the host doesn't matter for this timeout.  While
> the
> > > host is booting, the IB link is down anyway.
> > > 
> > > Scott 
> > > 
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > Cc: Clive Hall (clivhall); 
> > > > general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > After a whole day of stresstesting with the MAD renicing 
> > > turned on, we
> > > > got the error once. So I think I should raise the timeout on 
> > > > the switch
> > > > also.
> > > > 
> > > > It takes about 2 minutes to boot the system. Do you agree 
> > > > that this is a
> > > > good value for the timeout?
> > > > 
> > > > Scott,
> > > > Can you explain me the problem of the memlock?
> > > > 
> > > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > > Since we didn't
> > > > install this, the bug is not related to us. This is 
> > > correct, isn't it?
> > > > 
> > > > Greetz
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > > > No C code changes, just a few config file changes 
> > > (RENICE_IB_MAD=yes
> > > > in
> > > > > openib.conf,
> > > > 
> > > > Does the host really not respond to MAD requests for over 10 
> > > > seconds in
> > > > some cases ?
> > > > 
> > > > -- Hal
> > > > 
> > > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > SLES10 for bug 267, etc.).
> > > > > 
> > > > > Scott Weitzenkamp
> > > > > SQA and Release Manager
> > > > > Server Virtualization Business Unit
> > > > > Cisco Systems
> > > > >  
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > general at lists.openfabrics.org; 
> > > > general-bounces at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > This far, all tests seem to work.
> > > > > > 
> > > > > > Thanks for the help!
> > > > > > 
> > > > > > Scott,
> > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > > 
> > > > > > Greetz
> > > > > > 
> > > > > > Koen
> > > > > > 
> > > > > > -----Oorspronkelijk bericht-----
> > > > > > Van: Scott Weitzenkamp (sweitzen)
[mailto:sweitzen at cisco.com] 
> > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > > > (clivhall)
> > > > > > CC: Shirley Ma; Ami Perlmutter;
general at lists.openfabrics.org;
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > It's not so much pinging every 10 seconds as expecting a 
> > > > > > response within
> > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > > 
> > > > > > You only need to do 1) or 2), not both.  Cisco configures 1)

> > > > > > in the OFED
> > > > > > binary RPMs we release at
> > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > > > > prefer to have
> > > > > > the host be more responsive.
> > > > > > 
> > > > > > 
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > >  
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > general at lists.openfabrics.org;
> > > > general-bounces at lists.openfabrics.org
> > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > 
> > > > > > > If I understand it wright, the switch is actually polling 
> > > > > > > (=pinging) the
> > > > > > > interfaces every 10s. This means that when the interface
is
> > > > handling
> > > > > > > other traffic, the poll can fail and the port could be 
> > > > > > > considered out of
> > > > > > > service. My question is then: "How can the timeout be
> reached
> > > > while
> > > > > > > packets are being sent/received to/from the interface?"
> > > > > > > 
> > > > > > > Anyway, what timeout-value would you recommend for 
> > > us? And why?
> > > > > > > 
> > > > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > > > 1) change the MAD niceness of the servers
> > > > > > > 2) change the timeout on the switches
> > > > > > > 
> > > > > > > Are these changes sufficient for the HCA's to keep 
> > > > their ports in
> > > > > > > PORT_ACTIVE state?
> > > > > > > 
> > > > > > > Regards,
> > > > > > > 
> > > > > > > Koen
> > > > > > > 
> > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > > (sweitzen) wrote:
> > > > > > > > Yes, you can tune it.  Here's an example via the switch
> CLI:
> > > > > > > >  
> > > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > > fe:80:00:00:00:00:00:00
> > > > > > > > node-timeout <value>
> > > > > > > > 
> > > > > > > > The default is 10 seconds, it can be configured up to 
> > > > > > 2000 seconds.
> > > > > > > > If a HCA is completely unresponsive for longer than the 
> > > > > > node-timeout
> > > > > > > > value, then we consider that HCA out of service.
> > > > > > > >  
> > > > > > > > Scott Weitzenkamp
> > > > > > > > SQA and Release Manager
> > > > > > > > Server Virtualization Business Unit
> > > > > > > > Cisco Systems
> > > > > > > >  
> > > > > > > > 
> > > > > > > >         
> > > > > > > >         
> > > > > > >
> ______________________________________________________________
> > > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > >         To: koen.segers at VRT.BE
> > > > > > > >         Cc: Ami Perlmutter;
general at lists.openfabrics.org;
> > > > > > > >         general-bounces at lists.openfabrics.org; Scott 
> > > > Weitzenkamp
> > > > > > > >         (sweitzen)
> > > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > > IB-connection
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         Koen,
> > > > > > > >         
> > > > > > > >         So it is most likely you hit the same bug as 
> > > > 229 (Scott
> > > > > > > >         pointed out earlier). The same workaround might 
> > > > > > work for you
> > > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > > >         
> > > > > > > >         I think this should be a SM query timeout 
> > > > tunable value
> > > > in
> > > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > > >         
> > > > > > > >         Thanks
> > > > > > > >         Shirley Ma
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         Inactive hide details for Koen Segers 
> > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > > >         
> > > > > > > >         
> > > > > > > >                                         Koen Segers 
> > > > > > > <koen.segers at VRT.BE> 
> > > > > > > >                                         
> > > > > > > >                                         05/22/07 11:14
AM 
> > > > > > > >                                         Please respond
to
> > > > > > > >
koen.segers at VRT.BE
> > > > > > > >                                         
> > > > > > > >         
> > > > > > > >                      To
> > > > > > > >         
> > > > > > > >         Shirley
> > > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > > >         
> > > > > > > >                      cc
> > > > > > > >         
> > > > > > > >         Ami Perlmutter
> > > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > > general at lists.openfabrics.org,
> > > > general-bounces at lists.openfabrics.org
> > > > > > > >         
> > > > > > > >                 Subject
> > > > > > > >         
> > > > > > > >         RE:
> > > > > > > >         [ofa-general]
> > > > > > > >         GPFS node loses
> > > > > > > >         IB-connection
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         Hi,
> > > > > > > >         
> > > > > > > >         It is the Cisco SM. 
> > > > > > > >         
> > > > > > > >         SFS-7000P> show version
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > >
> ==============================================================
> > > > > > > ==================
> > > > > > > >                                   System Version
> Information
> > > > > > > >         
> > > > > > >
> ==============================================================
> > > > > > > ==================
> > > > > > > >                   system-version : SFS-7000P TopspinOS 
> > > > > > 2.9.0 releng
> > > > > > > >         #147
> > > > > > > >         10/25/2006 02:01:32
> > > > > > > >                          contact : tac at cisco.com
> > > > > > > >                             name : SFS-7000P
> > > > > > > >                         location : 170 West Tasman
Drive, 
> > > > > > > San Jose, CA
> > > > > > > >         95134
> > > > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > > > >                      last-change : none
> > > > > > > >                 last-config-save : none
> > > > > > > >                           action : none
> > > > > > > >                           result : none
> > > > > > > >                        oper-mode : normal
> > > > > > > >         
> > > > > > > >         There is also a command that gives the SM
version,
> 
> > > > > > > but I can't
> > > > > > > >         find it
> > > > > > > >         right now. 
> > > > > > > >         
> > > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma
> wrote:
> > > > > > > >         > Hello Koen,
> > > > > > > >         > 
> > > > > > > >         > From the switch log, it looks a SM issue to
me. 
> > > > > > > The node was
> > > > > > > >         kicked
> > > > > > > >         > out of the membership. Which SM you are 
> > > > using in your
> > > > > > > >         fabric? 
> > > > > > > >         > 
> > > > > > > >         > Thanks
> > > > > > > >         > Shirley Ma
> > > > > > > >         > 
> > > > > > > >         *** Disclaimer ***
> > > > > > > >         
> > > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > >         
> > > > > > > >         nv van publiek recht
> > > > > > > >         BTW BE 0244.142.664
> > > > > > > >         RPR Brussel
> > > > > > > >         http://www.vrt.be/disclaimer
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > >         
> > > > > > > *** Disclaimer ***
> > > > > > > 
> > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > 
> > > > > > > nv van publiek recht
> > > > > > > BTW BE 0244.142.664
> > > > > > > RPR Brussel
> > > > > > > http://www.vrt.be/disclaimer
> > > > > > >  
> > > > > > > 
> > > > > > *** Disclaimer ***
> > > > > > 
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > 
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >  
> > > > > > 
> > > > > _______________________________________________
> > > > > general mailing list
> > > > > general at lists.openfabrics.org
> > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > 
> > > > > To unsubscribe, please visit
> > > > http://openib.org/mailman/listinfo/openib-general
> > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From erezz at voltaire.com  Tue May 29 06:41:12 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Tue, 29 May 2007 16:41:12 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel
 addons	foropen-iscsiover iSER support for RHAS4 up3 and up4
In-Reply-To: <20070524115715.GC4585@mellanox.co.il>
References: <20070521114410.GG20400@mellanox.co.il>	<46557BCB.7030102@voltaire.com>
	<20070524115715.GC4585@mellanox.co.il>
Message-ID: <465C2D78.30100@voltaire.com>


>> I have the following files in backport/2.6.9_UX/include/src/:
>>
>> attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it.
>>     
>
> could be a patch ...
> which line?
>
>   
Now, attribute_container.c, klist.c & transport_class.c are copied from
the kernel tree. I've committed the required changes in
~erezz/ofabuild_iser_rh4.git & ~erezz/ofed_1_2_iser_rh4.git. The main
change is a new dir called "kernel_addons_patches". It contains patches
for kernel tree files in order to create the required addons from them.

The rest of the files that I added to include/src are very small. I hope
it's ok now.

>> init.c - only a small part of the original file in 2.6.20
>>
>> klist.c - almost identical to the file on 2.6.20. I had to change one line in it.
>>     
>
> which line?
>   
See above.
>   
>> kref_new.c - based on kref.c
>>     
>
> Sounds scary ... how different is it?
>   
This file was removed.
>   
>> scsi.c - only a small part of the original file in 2.6.20
>>
>> scsi_lib.c - only a small part of the original file in 2.6.20
>>
>> scsi_scan.c - only a small part of the original file in 2.6.20
>>
>> transport_class.c - identical to 2.6.20
>>
>> So, the only file identical to 2.6.20 is transport_class.c. We can copy it from 2.6.20, but since it's only a single file, I'm not sure if it will make a real difference.
>>     
>
> transport_class.c, attribute_container.c and klist.c are quite big together:
> more than 1000 lines. So by all means, let's check them out from kernel tree.
>   
Done.

Erez


From mst at dev.mellanox.co.il  Tue May 29 07:11:43 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 17:11:43 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel
	addons	foropen-iscsiover iSER support for RHAS4 up3 and up4
In-Reply-To: <465C2D78.30100@voltaire.com>
References: <20070521114410.GG20400@mellanox.co.il>
	<46557BCB.7030102@voltaire.com>
	<20070524115715.GC4585@mellanox.co.il>
	<465C2D78.30100@voltaire.com>
Message-ID: <20070529141143.GD27671@mellanox.co.il>

> Quoting Erez Zilber <erezz at voltaire.com>:
> Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons?foropen-iscsiover iSER support for RHAS4 up3 and up4
> 
> 
> >> I have the following files in backport/2.6.9_UX/include/src/:
> >>
> >> attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it.
> >>     
> >
> > could be a patch ...
> > which line?
> >
> >   
> Now, attribute_container.c, klist.c & transport_class.c are copied from
> the kernel tree. I've committed the required changes in
> ~erezz/ofabuild_iser_rh4.git & ~erezz/ofed_1_2_iser_rh4.git.

git fetch git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git.
fatal: The remote end hung up unexpectedly
Cannot get the repository state from
git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git.


> The main
> change is a new dir called "kernel_addons_patches". It contains patches
> for kernel tree files in order to create the required addons from them.

sorry, but I really don't think we can touch build scripts at this point.
Doing cp in build scripts is also a problem since it interferes with
development (there are 2 places to edit each file).
And adding kernel version dependency there is also really messy.

Suggestion: why can't these patches be part of the regular backport directory?

you copy stuff to include/src and then include it, but this just looks
like and unnecessary extra step. Can't we include the source file from
it original place directory, like this:
#include "../drivers/base/attribute_container.c"

> The rest of the files that I added to include/src are very small. I hope
> it's ok now.

Yes, the rest looks OK to me.

-- 
MST


From mst at dev.mellanox.co.il  Tue May 29 07:24:22 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 17:24:22 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel
	addons	foropen-iscsiover iSER support for RHAS4 up3 and up4
In-Reply-To: <20070529141143.GD27671@mellanox.co.il>
References: <20070521114410.GG20400@mellanox.co.il>
	<46557BCB.7030102@voltaire.com>
	<20070524115715.GC4585@mellanox.co.il>
	<465C2D78.30100@voltaire.com>
	<20070529141143.GD27671@mellanox.co.il>
Message-ID: <20070529142422.GE27671@mellanox.co.il>

> Quoting Michael S. Tsirkin <mst at dev.mellanox.co.il>:
> Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons?foropen-iscsiover iSER support for RHAS4 up3 and up4
> 
> > Quoting Erez Zilber <erezz at voltaire.com>:
> > Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons?foropen-iscsiover iSER support for RHAS4 up3 and up4
> > 
> > 
> > >> I have the following files in backport/2.6.9_UX/include/src/:
> > >>
> > >> attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it.
> > >>     
> > >
> > > could be a patch ...
> > > which line?
> > >
> > >   
> > Now, attribute_container.c, klist.c & transport_class.c are copied from
> > the kernel tree. I've committed the required changes in
> > ~erezz/ofabuild_iser_rh4.git & ~erezz/ofed_1_2_iser_rh4.git.
> 
> git fetch git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git.
> fatal: The remote end hung up unexpectedly
> Cannot get the repository state from
> git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git.
> 
> 
> > The main
> > change is a new dir called "kernel_addons_patches". It contains patches
> > for kernel tree files in order to create the required addons from them.
> 
> sorry, but I really don't think we can touch build scripts at this point.
> Doing cp in build scripts is also a problem since it interferes with
> development (there are 2 places to edit each file).
> And adding kernel version dependency there is also really messy.

BTW, one important principle is that *all* information about
the kernel build process must be contained inside kernel tree.
Keeping lists of files in an external tree is *evil*.

> Suggestion: why can't these patches be part of the regular backport directory?
> 
> you copy stuff to include/src and then include it, but this just looks
> like and unnecessary extra step. Can't we include the source file from
> it original place directory, like this:
> #include "../drivers/base/attribute_container.c"


-- 
MST


From tziporet at dev.mellanox.co.il  Tue May 29 07:34:54 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 29 May 2007 17:34:54 +0300
Subject: [ofa-general] Re: [ewg] Upcoming OFED teleconferences
In-Reply-To: <338EEB24-0F0E-443D-94AA-61E512611F8B@cisco.com>
References: <338EEB24-0F0E-443D-94AA-61E512611F8B@cisco.com>
Message-ID: <465C3A0E.6070602@mellanox.co.il>

Jeff Squyres wrote:
> Short version:
> --------------
>
> Upcoming OFED teleconferences, all at noon US eastern / 9am US Pacific 
> / 7pm Israel.
>
> 1. Wednesday, May 30 (*TOMORROW*), code 210262040
I cannot make it at Wed 9am PST
Can you change to 11:30am PST

Thanks,
Tziporet
> 2. Monday, June 4, code 2102061
> 3. Monday, June 11, code 210213621
> 4. Monday, June 18, code 2102061
> 5. Monday, June 25, code 210213621
>
> US/Canada:  +1.866.432.9903
> India:      +91.80.4103.3979
> Israel:     +972.9.892.7026
> Others:     http://cisco.com/en/US/about/doing_business/conferencing/


From amip at dev.mellanox.co.il  Tue May 29 07:39:54 2007
From: amip at dev.mellanox.co.il (Ami Perlmutter)
Date: Tue, 29 May 2007 17:39:54 +0300
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D77@OCBEXS01001.rto.be>
References: <D63C0BE2D613C543B6F3305502E9784C03157D77@OCBEXS01001.rto.be>
Message-ID: <1180449624.12048.13.camel@localhost>

can you describe the scenario in which you see data lost?
does the "SDP: FIXME MID 11" message correlate with the data loss?

On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote:
> I just remembered that, with SDP, these values aren't related anymore.
> SDP doesn't give this kind of information to the OS.
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen
> Verzonden: dinsdag 29 mei 2007 14:29
> Aan: amip at dev.mellanox.co.il
> CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> One of the machines has 2 dropped packets:
> 
> gpfswhbe2n1:~ # ifconfig ib0
> ib0       Link encap:UNSPEC  HWaddr
> 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
>           inet addr:192.168.2.1  Bcast:192.168.4.255  Mask:255.255.255.0
>           inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
>           RX packets:17311 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0
>           collisions:0 txqueuelen:128
>           RX bytes:148363444 (141.4 Mb)  TX bytes:6715076 (6.4 Mb)
> 
> Can this be related?
> 
> Does anyone now how this is possible with sdp? I thought SDP was a RC.
> I'm also curious how gpfs reacts to this. Do you know where I can find
> the timestamp of these dropped packets?
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> Verzonden: dinsdag 29 mei 2007 14:03
> Aan: SEGERS Koen
> CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> if this is an actual resize request than there is no problem when it is
> dropped.
> since you are running rc1, no resize requests should be sent so this
> means there is a problem since data could be dropped. do you notice lost
> data?
> 
> On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> > We are running ofed-1.2.RC1 on all machines. Hence it is impossible
> that
> > this message is added only a few days ago.
> > 
> > How can you be so sure that this doesn't pose any problems?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> > Verzonden: dinsdag 29 mei 2007 13:35
> > Aan: SEGERS Koen
> > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > this means you are getting a message your SDP does not recognize.
> > message 11 is resize request which was added to sdp a few days ago.
> > can it be that you are running 2 different versions of OFED?
> > anywas, this doesn't pose any problem so you can ignore it.
> > 
> > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > > Hi,
> > > 
> > > Saturday we did a different stresstest.
> > > This is what we see in the /var/log/messages:
> > > 
> > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> > > 
> > > There were errors from that time on. Can someone explain me what
> this
> > > message does?
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > Verzonden: woensdag 23 mei 2007 17:41
> > > Aan: SEGERS Koen; Hal Rosenstock
> > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > Try 20 seconds, I'm curious if if you are barely crossing the
> > 10-second
> > > threshold.
> > > 
> > > Scott 
> > > 
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > > Cc: Clive Hall (clivhall); 
> > > > general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > What value would you recommend then?
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > Verzonden: woensdag 23 mei 2007 17:38
> > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > > > general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > The boot time of the host doesn't matter for this timeout.  While
> > the
> > > > host is booting, the IB link is down anyway.
> > > > 
> > > > Scott 
> > > > 
> > > > > -----Original Message-----
> > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > > Cc: Clive Hall (clivhall); 
> > > > > general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > After a whole day of stresstesting with the MAD renicing 
> > > > turned on, we
> > > > > got the error once. So I think I should raise the timeout on 
> > > > > the switch
> > > > > also.
> > > > > 
> > > > > It takes about 2 minutes to boot the system. Do you agree 
> > > > > that this is a
> > > > > good value for the timeout?
> > > > > 
> > > > > Scott,
> > > > > Can you explain me the problem of the memlock?
> > > > > 
> > > > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > > > Since we didn't
> > > > > install this, the bug is not related to us. This is 
> > > > correct, isn't it?
> > > > > 
> > > > > Greetz
> > > > > 
> > > > > Koen
> > > > > 
> > > > > -----Oorspronkelijk bericht-----
> > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > > general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > > > > No C code changes, just a few config file changes 
> > > > (RENICE_IB_MAD=yes
> > > > > in
> > > > > > openib.conf,
> > > > > 
> > > > > Does the host really not respond to MAD requests for over 10 
> > > > > seconds in
> > > > > some cases ?
> > > > > 
> > > > > -- Hal
> > > > > 
> > > > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > > SLES10 for bug 267, etc.).
> > > > > > 
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > >  
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > general at lists.openfabrics.org; 
> > > > > general-bounces at lists.openfabrics.org
> > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > 
> > > > > > > This far, all tests seem to work.
> > > > > > > 
> > > > > > > Thanks for the help!
> > > > > > > 
> > > > > > > Scott,
> > > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > > > 
> > > > > > > Greetz
> > > > > > > 
> > > > > > > Koen
> > > > > > > 
> > > > > > > -----Oorspronkelijk bericht-----
> > > > > > > Van: Scott Weitzenkamp (sweitzen)
> [mailto:sweitzen at cisco.com] 
> > > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > > > > (clivhall)
> > > > > > > CC: Shirley Ma; Ami Perlmutter;
> general at lists.openfabrics.org;
> > > > > > > general-bounces at lists.openfabrics.org
> > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > 
> > > > > > > It's not so much pinging every 10 seconds as expecting a 
> > > > > > > response within
> > > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > > > 
> > > > > > > You only need to do 1) or 2), not both.  Cisco configures 1)
> 
> > > > > > > in the OFED
> > > > > > > binary RPMs we release at
> > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > > > > > prefer to have
> > > > > > > the host be more responsive.
> > > > > > > 
> > > > > > > 
> > > > > > > Scott Weitzenkamp
> > > > > > > SQA and Release Manager
> > > > > > > Server Virtualization Business Unit
> > > > > > > Cisco Systems
> > > > > > >  
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > > general at lists.openfabrics.org;
> > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > > 
> > > > > > > > If I understand it wright, the switch is actually polling 
> > > > > > > > (=pinging) the
> > > > > > > > interfaces every 10s. This means that when the interface
> is
> > > > > handling
> > > > > > > > other traffic, the poll can fail and the port could be 
> > > > > > > > considered out of
> > > > > > > > service. My question is then: "How can the timeout be
> > reached
> > > > > while
> > > > > > > > packets are being sent/received to/from the interface?"
> > > > > > > > 
> > > > > > > > Anyway, what timeout-value would you recommend for 
> > > > us? And why?
> > > > > > > > 
> > > > > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > > > > 1) change the MAD niceness of the servers
> > > > > > > > 2) change the timeout on the switches
> > > > > > > > 
> > > > > > > > Are these changes sufficient for the HCA's to keep 
> > > > > their ports in
> > > > > > > > PORT_ACTIVE state?
> > > > > > > > 
> > > > > > > > Regards,
> > > > > > > > 
> > > > > > > > Koen
> > > > > > > > 
> > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > > > (sweitzen) wrote:
> > > > > > > > > Yes, you can tune it.  Here's an example via the switch
> > CLI:
> > > > > > > > >  
> > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > > > fe:80:00:00:00:00:00:00
> > > > > > > > > node-timeout <value>
> > > > > > > > > 
> > > > > > > > > The default is 10 seconds, it can be configured up to 
> > > > > > > 2000 seconds.
> > > > > > > > > If a HCA is completely unresponsive for longer than the 
> > > > > > > node-timeout
> > > > > > > > > value, then we consider that HCA out of service.
> > > > > > > > >  
> > > > > > > > > Scott Weitzenkamp
> > > > > > > > > SQA and Release Manager
> > > > > > > > > Server Virtualization Business Unit
> > > > > > > > > Cisco Systems
> > > > > > > > >  
> > > > > > > > > 
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > >
> > ______________________________________________________________
> > > > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > > >         To: koen.segers at VRT.BE
> > > > > > > > >         Cc: Ami Perlmutter;
> general at lists.openfabrics.org;
> > > > > > > > >         general-bounces at lists.openfabrics.org; Scott 
> > > > > Weitzenkamp
> > > > > > > > >         (sweitzen)
> > > > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > > > IB-connection
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         Koen,
> > > > > > > > >         
> > > > > > > > >         So it is most likely you hit the same bug as 
> > > > > 229 (Scott
> > > > > > > > >         pointed out earlier). The same workaround might 
> > > > > > > work for you
> > > > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > > > >         
> > > > > > > > >         I think this should be a SM query timeout 
> > > > > tunable value
> > > > > in
> > > > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > > > >         
> > > > > > > > >         Thanks
> > > > > > > > >         Shirley Ma
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         Inactive hide details for Koen Segers 
> > > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >                                         Koen Segers 
> > > > > > > > <koen.segers at VRT.BE> 
> > > > > > > > >                                         
> > > > > > > > >                                         05/22/07 11:14
> AM 
> > > > > > > > >                                         Please respond
> to
> > > > > > > > >
> koen.segers at VRT.BE
> > > > > > > > >                                         
> > > > > > > > >         
> > > > > > > > >                      To
> > > > > > > > >         
> > > > > > > > >         Shirley
> > > > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > > > >         
> > > > > > > > >                      cc
> > > > > > > > >         
> > > > > > > > >         Ami Perlmutter
> > > > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > > > general at lists.openfabrics.org,
> > > > > general-bounces at lists.openfabrics.org
> > > > > > > > >         
> > > > > > > > >                 Subject
> > > > > > > > >         
> > > > > > > > >         RE:
> > > > > > > > >         [ofa-general]
> > > > > > > > >         GPFS node loses
> > > > > > > > >         IB-connection
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         Hi,
> > > > > > > > >         
> > > > > > > > >         It is the Cisco SM. 
> > > > > > > > >         
> > > > > > > > >         SFS-7000P> show version
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > >
> > ==============================================================
> > > > > > > > ==================
> > > > > > > > >                                   System Version
> > Information
> > > > > > > > >         
> > > > > > > >
> > ==============================================================
> > > > > > > > ==================
> > > > > > > > >                   system-version : SFS-7000P TopspinOS 
> > > > > > > 2.9.0 releng
> > > > > > > > >         #147
> > > > > > > > >         10/25/2006 02:01:32
> > > > > > > > >                          contact : tac at cisco.com
> > > > > > > > >                             name : SFS-7000P
> > > > > > > > >                         location : 170 West Tasman
> Drive, 
> > > > > > > > San Jose, CA
> > > > > > > > >         95134
> > > > > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > > > > >                      last-change : none
> > > > > > > > >                 last-config-save : none
> > > > > > > > >                           action : none
> > > > > > > > >                           result : none
> > > > > > > > >                        oper-mode : normal
> > > > > > > > >         
> > > > > > > > >         There is also a command that gives the SM
> version,
> > 
> > > > > > > > but I can't
> > > > > > > > >         find it
> > > > > > > > >         right now. 
> > > > > > > > >         
> > > > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma
> > wrote:
> > > > > > > > >         > Hello Koen,
> > > > > > > > >         > 
> > > > > > > > >         > From the switch log, it looks a SM issue to
> me. 
> > > > > > > > The node was
> > > > > > > > >         kicked
> > > > > > > > >         > out of the membership. Which SM you are 
> > > > > using in your
> > > > > > > > >         fabric? 
> > > > > > > > >         > 
> > > > > > > > >         > Thanks
> > > > > > > > >         > Shirley Ma
> > > > > > > > >         > 
> > > > > > > > >         *** Disclaimer ***
> > > > > > > > >         
> > > > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > >         
> > > > > > > > >         nv van publiek recht
> > > > > > > > >         BTW BE 0244.142.664
> > > > > > > > >         RPR Brussel
> > > > > > > > >         http://www.vrt.be/disclaimer
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > *** Disclaimer ***
> > > > > > > > 
> > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > 
> > > > > > > > nv van publiek recht
> > > > > > > > BTW BE 0244.142.664
> > > > > > > > RPR Brussel
> > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > >  
> > > > > > > > 
> > > > > > > *** Disclaimer ***
> > > > > > > 
> > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > 
> > > > > > > nv van publiek recht
> > > > > > > BTW BE 0244.142.664
> > > > > > > RPR Brussel
> > > > > > > http://www.vrt.be/disclaimer
> > > > > > >  
> > > > > > > 
> > > > > > _______________________________________________
> > > > > > general mailing list
> > > > > > general at lists.openfabrics.org
> > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > 
> > > > > > To unsubscribe, please visit
> > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > 
> > > > > *** Disclaimer ***
> > > > > 
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > 
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >  
> > > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > 
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 


From Koen.SEGERS at VRT.BE  Tue May 29 07:56:58 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Tue, 29 May 2007 16:56:58 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <1180449624.12048.13.camel@localhost>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D79@OCBEXS01001.rto.be>

We don't really see data getting lost. We don't get an error in the log
files of gpfs. We only got a system that was not able to read its
filesystem anymore. It was exactly at the time this FIXME error
occurred.

Therefore I think there must me some kind of correlation. But I don't
really know what ... :(

Koen

-----Oorspronkelijk bericht-----
Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
Verzonden: dinsdag 29 mei 2007 16:40
Aan: SEGERS Koen
CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

can you describe the scenario in which you see data lost?
does the "SDP: FIXME MID 11" message correlate with the data loss?

On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote:
> I just remembered that, with SDP, these values aren't related anymore.
> SDP doesn't give this kind of information to the OS.
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen
> Verzonden: dinsdag 29 mei 2007 14:29
> Aan: amip at dev.mellanox.co.il
> CC: general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> One of the machines has 2 dropped packets:
> 
> gpfswhbe2n1:~ # ifconfig ib0
> ib0       Link encap:UNSPEC  HWaddr
> 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
>           inet addr:192.168.2.1  Bcast:192.168.4.255
Mask:255.255.255.0
>           inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
>           RX packets:17311 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0
>           collisions:0 txqueuelen:128
>           RX bytes:148363444 (141.4 Mb)  TX bytes:6715076 (6.4 Mb)
> 
> Can this be related?
> 
> Does anyone now how this is possible with sdp? I thought SDP was a RC.
> I'm also curious how gpfs reacts to this. Do you know where I can find
> the timestamp of these dropped packets?
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> Verzonden: dinsdag 29 mei 2007 14:03
> Aan: SEGERS Koen
> CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> if this is an actual resize request than there is no problem when it
is
> dropped.
> since you are running rc1, no resize requests should be sent so this
> means there is a problem since data could be dropped. do you notice
lost
> data?
> 
> On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> > We are running ofed-1.2.RC1 on all machines. Hence it is impossible
> that
> > this message is added only a few days ago.
> > 
> > How can you be so sure that this doesn't pose any problems?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> > Verzonden: dinsdag 29 mei 2007 13:35
> > Aan: SEGERS Koen
> > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > this means you are getting a message your SDP does not recognize.
> > message 11 is resize request which was added to sdp a few days ago.
> > can it be that you are running 2 different versions of OFED?
> > anywas, this doesn't pose any problem so you can ignore it.
> > 
> > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > > Hi,
> > > 
> > > Saturday we did a different stresstest.
> > > This is what we see in the /var/log/messages:
> > > 
> > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> > > 
> > > There were errors from that time on. Can someone explain me what
> this
> > > message does?
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > Verzonden: woensdag 23 mei 2007 17:41
> > > Aan: SEGERS Koen; Hal Rosenstock
> > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > Try 20 seconds, I'm curious if if you are barely crossing the
> > 10-second
> > > threshold.
> > > 
> > > Scott 
> > > 
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > > Cc: Clive Hall (clivhall); 
> > > > general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > What value would you recommend then?
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > Verzonden: woensdag 23 mei 2007 17:38
> > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > CC: Clive Hall (clivhall);
general-bounces at lists.openfabrics.org;
> > > > general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > The boot time of the host doesn't matter for this timeout.
While
> > the
> > > > host is booting, the IB link is down anyway.
> > > > 
> > > > Scott 
> > > > 
> > > > > -----Original Message-----
> > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > > Cc: Clive Hall (clivhall); 
> > > > > general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > After a whole day of stresstesting with the MAD renicing 
> > > > turned on, we
> > > > > got the error once. So I think I should raise the timeout on 
> > > > > the switch
> > > > > also.
> > > > > 
> > > > > It takes about 2 minutes to boot the system. Do you agree 
> > > > > that this is a
> > > > > good value for the timeout?
> > > > > 
> > > > > Scott,
> > > > > Can you explain me the problem of the memlock?
> > > > > 
> > > > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > > > Since we didn't
> > > > > install this, the bug is not related to us. This is 
> > > > correct, isn't it?
> > > > > 
> > > > > Greetz
> > > > > 
> > > > > Koen
> > > > > 
> > > > > -----Oorspronkelijk bericht-----
> > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > > general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen)
wrote:
> > > > > > No C code changes, just a few config file changes 
> > > > (RENICE_IB_MAD=yes
> > > > > in
> > > > > > openib.conf,
> > > > > 
> > > > > Does the host really not respond to MAD requests for over 10 
> > > > > seconds in
> > > > > some cases ?
> > > > > 
> > > > > -- Hal
> > > > > 
> > > > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > > SLES10 for bug 267, etc.).
> > > > > > 
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > >  
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > general at lists.openfabrics.org; 
> > > > > general-bounces at lists.openfabrics.org
> > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > 
> > > > > > > This far, all tests seem to work.
> > > > > > > 
> > > > > > > Thanks for the help!
> > > > > > > 
> > > > > > > Scott,
> > > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > > > 
> > > > > > > Greetz
> > > > > > > 
> > > > > > > Koen
> > > > > > > 
> > > > > > > -----Oorspronkelijk bericht-----
> > > > > > > Van: Scott Weitzenkamp (sweitzen)
> [mailto:sweitzen at cisco.com] 
> > > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > > > > (clivhall)
> > > > > > > CC: Shirley Ma; Ami Perlmutter;
> general at lists.openfabrics.org;
> > > > > > > general-bounces at lists.openfabrics.org
> > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > 
> > > > > > > It's not so much pinging every 10 seconds as expecting a 
> > > > > > > response within
> > > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > > > 
> > > > > > > You only need to do 1) or 2), not both.  Cisco configures
1)
> 
> > > > > > > in the OFED
> > > > > > > binary RPMs we release at
> > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > > > > > prefer to have
> > > > > > > the host be more responsive.
> > > > > > > 
> > > > > > > 
> > > > > > > Scott Weitzenkamp
> > > > > > > SQA and Release Manager
> > > > > > > Server Virtualization Business Unit
> > > > > > > Cisco Systems
> > > > > > >  
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > > general at lists.openfabrics.org;
> > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > > 
> > > > > > > > If I understand it wright, the switch is actually
polling 
> > > > > > > > (=pinging) the
> > > > > > > > interfaces every 10s. This means that when the interface
> is
> > > > > handling
> > > > > > > > other traffic, the poll can fail and the port could be 
> > > > > > > > considered out of
> > > > > > > > service. My question is then: "How can the timeout be
> > reached
> > > > > while
> > > > > > > > packets are being sent/received to/from the interface?"
> > > > > > > > 
> > > > > > > > Anyway, what timeout-value would you recommend for 
> > > > us? And why?
> > > > > > > > 
> > > > > > > > To recapitulate: these are the actions I'll take
tomorrow
> > > > > > > > 1) change the MAD niceness of the servers
> > > > > > > > 2) change the timeout on the switches
> > > > > > > > 
> > > > > > > > Are these changes sufficient for the HCA's to keep 
> > > > > their ports in
> > > > > > > > PORT_ACTIVE state?
> > > > > > > > 
> > > > > > > > Regards,
> > > > > > > > 
> > > > > > > > Koen
> > > > > > > > 
> > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > > > (sweitzen) wrote:
> > > > > > > > > Yes, you can tune it.  Here's an example via the
switch
> > CLI:
> > > > > > > > >  
> > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > > > fe:80:00:00:00:00:00:00
> > > > > > > > > node-timeout <value>
> > > > > > > > > 
> > > > > > > > > The default is 10 seconds, it can be configured up to 
> > > > > > > 2000 seconds.
> > > > > > > > > If a HCA is completely unresponsive for longer than
the 
> > > > > > > node-timeout
> > > > > > > > > value, then we consider that HCA out of service.
> > > > > > > > >  
> > > > > > > > > Scott Weitzenkamp
> > > > > > > > > SQA and Release Manager
> > > > > > > > > Server Virtualization Business Unit
> > > > > > > > > Cisco Systems
> > > > > > > > >  
> > > > > > > > > 
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > >
> > ______________________________________________________________
> > > > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > > >         To: koen.segers at VRT.BE
> > > > > > > > >         Cc: Ami Perlmutter;
> general at lists.openfabrics.org;
> > > > > > > > >         general-bounces at lists.openfabrics.org; Scott 
> > > > > Weitzenkamp
> > > > > > > > >         (sweitzen)
> > > > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > > > IB-connection
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         Koen,
> > > > > > > > >         
> > > > > > > > >         So it is most likely you hit the same bug as 
> > > > > 229 (Scott
> > > > > > > > >         pointed out earlier). The same workaround
might 
> > > > > > > work for you
> > > > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > > > >         
> > > > > > > > >         I think this should be a SM query timeout 
> > > > > tunable value
> > > > > in
> > > > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > > > >         
> > > > > > > > >         Thanks
> > > > > > > > >         Shirley Ma
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         Inactive hide details for Koen Segers 
> > > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >                                         Koen Segers 
> > > > > > > > <koen.segers at VRT.BE> 
> > > > > > > > >                                         
> > > > > > > > >                                         05/22/07 11:14
> AM 
> > > > > > > > >                                         Please respond
> to
> > > > > > > > >
> koen.segers at VRT.BE
> > > > > > > > >                                         
> > > > > > > > >         
> > > > > > > > >                      To
> > > > > > > > >         
> > > > > > > > >         Shirley
> > > > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > > > >         
> > > > > > > > >                      cc
> > > > > > > > >         
> > > > > > > > >         Ami Perlmutter
> > > > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > > > general at lists.openfabrics.org,
> > > > > general-bounces at lists.openfabrics.org
> > > > > > > > >         
> > > > > > > > >                 Subject
> > > > > > > > >         
> > > > > > > > >         RE:
> > > > > > > > >         [ofa-general]
> > > > > > > > >         GPFS node loses
> > > > > > > > >         IB-connection
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         Hi,
> > > > > > > > >         
> > > > > > > > >         It is the Cisco SM. 
> > > > > > > > >         
> > > > > > > > >         SFS-7000P> show version
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > >
> > ==============================================================
> > > > > > > > ==================
> > > > > > > > >                                   System Version
> > Information
> > > > > > > > >         
> > > > > > > >
> > ==============================================================
> > > > > > > > ==================
> > > > > > > > >                   system-version : SFS-7000P TopspinOS

> > > > > > > 2.9.0 releng
> > > > > > > > >         #147
> > > > > > > > >         10/25/2006 02:01:32
> > > > > > > > >                          contact : tac at cisco.com
> > > > > > > > >                             name : SFS-7000P
> > > > > > > > >                         location : 170 West Tasman
> Drive, 
> > > > > > > > San Jose, CA
> > > > > > > > >         95134
> > > > > > > > >                          up-time :
11(d):7(h):49(m):3(s)
> > > > > > > > >                      last-change : none
> > > > > > > > >                 last-config-save : none
> > > > > > > > >                           action : none
> > > > > > > > >                           result : none
> > > > > > > > >                        oper-mode : normal
> > > > > > > > >         
> > > > > > > > >         There is also a command that gives the SM
> version,
> > 
> > > > > > > > but I can't
> > > > > > > > >         find it
> > > > > > > > >         right now. 
> > > > > > > > >         
> > > > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma
> > wrote:
> > > > > > > > >         > Hello Koen,
> > > > > > > > >         > 
> > > > > > > > >         > From the switch log, it looks a SM issue to
> me. 
> > > > > > > > The node was
> > > > > > > > >         kicked
> > > > > > > > >         > out of the membership. Which SM you are 
> > > > > using in your
> > > > > > > > >         fabric? 
> > > > > > > > >         > 
> > > > > > > > >         > Thanks
> > > > > > > > >         > Shirley Ma
> > > > > > > > >         > 
> > > > > > > > >         *** Disclaimer ***
> > > > > > > > >         
> > > > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > >         
> > > > > > > > >         nv van publiek recht
> > > > > > > > >         BTW BE 0244.142.664
> > > > > > > > >         RPR Brussel
> > > > > > > > >         http://www.vrt.be/disclaimer
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > >         
> > > > > > > > *** Disclaimer ***
> > > > > > > > 
> > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > 
> > > > > > > > nv van publiek recht
> > > > > > > > BTW BE 0244.142.664
> > > > > > > > RPR Brussel
> > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > >  
> > > > > > > > 
> > > > > > > *** Disclaimer ***
> > > > > > > 
> > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > 
> > > > > > > nv van publiek recht
> > > > > > > BTW BE 0244.142.664
> > > > > > > RPR Brussel
> > > > > > > http://www.vrt.be/disclaimer
> > > > > > >  
> > > > > > > 
> > > > > > _______________________________________________
> > > > > > general mailing list
> > > > > > general at lists.openfabrics.org
> > > > > >
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > 
> > > > > > To unsubscribe, please visit
> > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > 
> > > > > *** Disclaimer ***
> > > > > 
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > 
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >  
> > > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > 
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From amip at dev.mellanox.co.il  Tue May 29 08:05:24 2007
From: amip at dev.mellanox.co.il (Ami Perlmutter)
Date: Tue, 29 May 2007 18:05:24 +0300
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D79@OCBEXS01001.rto.be>
References: <D63C0BE2D613C543B6F3305502E9784C03157D79@OCBEXS01001.rto.be>
Message-ID: <1180451154.12048.15.camel@localhost>

any chance of moving to rc3 (or wait till rc4)?

On Tue, 2007-05-29 at 16:56 +0200, SEGERS Koen wrote:
> We don't really see data getting lost. We don't get an error in the log
> files of gpfs. We only got a system that was not able to read its
> filesystem anymore. It was exactly at the time this FIXME error
> occurred.
> 
> Therefore I think there must me some kind of correlation. But I don't
> really know what ... :(
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> Verzonden: dinsdag 29 mei 2007 16:40
> Aan: SEGERS Koen
> CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> can you describe the scenario in which you see data lost?
> does the "SDP: FIXME MID 11" message correlate with the data loss?
> 
> On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote:
> > I just remembered that, with SDP, these values aren't related anymore.
> > SDP doesn't give this kind of information to the OS.
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: general-bounces at lists.openfabrics.org
> > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen
> > Verzonden: dinsdag 29 mei 2007 14:29
> > Aan: amip at dev.mellanox.co.il
> > CC: general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > One of the machines has 2 dropped packets:
> > 
> > gpfswhbe2n1:~ # ifconfig ib0
> > ib0       Link encap:UNSPEC  HWaddr
> > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
> >           inet addr:192.168.2.1  Bcast:192.168.4.255
> Mask:255.255.255.0
> >           inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
> >           RX packets:17311 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0
> >           collisions:0 txqueuelen:128
> >           RX bytes:148363444 (141.4 Mb)  TX bytes:6715076 (6.4 Mb)
> > 
> > Can this be related?
> > 
> > Does anyone now how this is possible with sdp? I thought SDP was a RC.
> > I'm also curious how gpfs reacts to this. Do you know where I can find
> > the timestamp of these dropped packets?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> > Verzonden: dinsdag 29 mei 2007 14:03
> > Aan: SEGERS Koen
> > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > if this is an actual resize request than there is no problem when it
> is
> > dropped.
> > since you are running rc1, no resize requests should be sent so this
> > means there is a problem since data could be dropped. do you notice
> lost
> > data?
> > 
> > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> > > We are running ofed-1.2.RC1 on all machines. Hence it is impossible
> > that
> > > this message is added only a few days ago.
> > > 
> > > How can you be so sure that this doesn't pose any problems?
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> > > Verzonden: dinsdag 29 mei 2007 13:35
> > > Aan: SEGERS Koen
> > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > this means you are getting a message your SDP does not recognize.
> > > message 11 is resize request which was added to sdp a few days ago.
> > > can it be that you are running 2 different versions of OFED?
> > > anywas, this doesn't pose any problem so you can ignore it.
> > > 
> > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > > > Hi,
> > > > 
> > > > Saturday we did a different stresstest.
> > > > This is what we see in the /var/log/messages:
> > > > 
> > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> > > > 
> > > > There were errors from that time on. Can someone explain me what
> > this
> > > > message does?
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > Verzonden: woensdag 23 mei 2007 17:41
> > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > > > general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > Try 20 seconds, I'm curious if if you are barely crossing the
> > > 10-second
> > > > threshold.
> > > > 
> > > > Scott 
> > > > 
> > > > > -----Original Message-----
> > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > > > Cc: Clive Hall (clivhall); 
> > > > > general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > What value would you recommend then?
> > > > > 
> > > > > Koen
> > > > > 
> > > > > -----Oorspronkelijk bericht-----
> > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > > Verzonden: woensdag 23 mei 2007 17:38
> > > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > > CC: Clive Hall (clivhall);
> general-bounces at lists.openfabrics.org;
> > > > > general at lists.openfabrics.org
> > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > The boot time of the host doesn't matter for this timeout.
> While
> > > the
> > > > > host is booting, the IB link is down anyway.
> > > > > 
> > > > > Scott 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > > > Cc: Clive Hall (clivhall); 
> > > > > > general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > After a whole day of stresstesting with the MAD renicing 
> > > > > turned on, we
> > > > > > got the error once. So I think I should raise the timeout on 
> > > > > > the switch
> > > > > > also.
> > > > > > 
> > > > > > It takes about 2 minutes to boot the system. Do you agree 
> > > > > > that this is a
> > > > > > good value for the timeout?
> > > > > > 
> > > > > > Scott,
> > > > > > Can you explain me the problem of the memlock?
> > > > > > 
> > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > > > > Since we didn't
> > > > > > install this, the bug is not related to us. This is 
> > > > > correct, isn't it?
> > > > > > 
> > > > > > Greetz
> > > > > > 
> > > > > > Koen
> > > > > > 
> > > > > > -----Oorspronkelijk bericht-----
> > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > > > general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen)
> wrote:
> > > > > > > No C code changes, just a few config file changes 
> > > > > (RENICE_IB_MAD=yes
> > > > > > in
> > > > > > > openib.conf,
> > > > > > 
> > > > > > Does the host really not respond to MAD requests for over 10 
> > > > > > seconds in
> > > > > > some cases ?
> > > > > > 
> > > > > > -- Hal
> > > > > > 
> > > > > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > > > SLES10 for bug 267, etc.).
> > > > > > > 
> > > > > > > Scott Weitzenkamp
> > > > > > > SQA and Release Manager
> > > > > > > Server Virtualization Business Unit
> > > > > > > Cisco Systems
> > > > > > >  
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > > general at lists.openfabrics.org; 
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > > 
> > > > > > > > This far, all tests seem to work.
> > > > > > > > 
> > > > > > > > Thanks for the help!
> > > > > > > > 
> > > > > > > > Scott,
> > > > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > > > > 
> > > > > > > > Greetz
> > > > > > > > 
> > > > > > > > Koen
> > > > > > > > 
> > > > > > > > -----Oorspronkelijk bericht-----
> > > > > > > > Van: Scott Weitzenkamp (sweitzen)
> > [mailto:sweitzen at cisco.com] 
> > > > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > > > > > (clivhall)
> > > > > > > > CC: Shirley Ma; Ami Perlmutter;
> > general at lists.openfabrics.org;
> > > > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > > 
> > > > > > > > It's not so much pinging every 10 seconds as expecting a 
> > > > > > > > response within
> > > > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > > > > 
> > > > > > > > You only need to do 1) or 2), not both.  Cisco configures
> 1)
> > 
> > > > > > > > in the OFED
> > > > > > > > binary RPMs we release at
> > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> > > > > > > > prefer to have
> > > > > > > > the host be more responsive.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Scott Weitzenkamp
> > > > > > > > SQA and Release Manager
> > > > > > > > Server Virtualization Business Unit
> > > > > > > > Cisco Systems
> > > > > > > >  
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > > > general at lists.openfabrics.org;
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > > > 
> > > > > > > > > If I understand it wright, the switch is actually
> polling 
> > > > > > > > > (=pinging) the
> > > > > > > > > interfaces every 10s. This means that when the interface
> > is
> > > > > > handling
> > > > > > > > > other traffic, the poll can fail and the port could be 
> > > > > > > > > considered out of
> > > > > > > > > service. My question is then: "How can the timeout be
> > > reached
> > > > > > while
> > > > > > > > > packets are being sent/received to/from the interface?"
> > > > > > > > > 
> > > > > > > > > Anyway, what timeout-value would you recommend for 
> > > > > us? And why?
> > > > > > > > > 
> > > > > > > > > To recapitulate: these are the actions I'll take
> tomorrow
> > > > > > > > > 1) change the MAD niceness of the servers
> > > > > > > > > 2) change the timeout on the switches
> > > > > > > > > 
> > > > > > > > > Are these changes sufficient for the HCA's to keep 
> > > > > > their ports in
> > > > > > > > > PORT_ACTIVE state?
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > 
> > > > > > > > > Koen
> > > > > > > > > 
> > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > > > > (sweitzen) wrote:
> > > > > > > > > > Yes, you can tune it.  Here's an example via the
> switch
> > > CLI:
> > > > > > > > > >  
> > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > > > > fe:80:00:00:00:00:00:00
> > > > > > > > > > node-timeout <value>
> > > > > > > > > > 
> > > > > > > > > > The default is 10 seconds, it can be configured up to 
> > > > > > > > 2000 seconds.
> > > > > > > > > > If a HCA is completely unresponsive for longer than
> the 
> > > > > > > > node-timeout
> > > > > > > > > > value, then we consider that HCA out of service.
> > > > > > > > > >  
> > > > > > > > > > Scott Weitzenkamp
> > > > > > > > > > SQA and Release Manager
> > > > > > > > > > Server Virtualization Business Unit
> > > > > > > > > > Cisco Systems
> > > > > > > > > >  
> > > > > > > > > > 
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > >
> > > ______________________________________________________________
> > > > > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > > > >         To: koen.segers at VRT.BE
> > > > > > > > > >         Cc: Ami Perlmutter;
> > general at lists.openfabrics.org;
> > > > > > > > > >         general-bounces at lists.openfabrics.org; Scott 
> > > > > > Weitzenkamp
> > > > > > > > > >         (sweitzen)
> > > > > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > > > > IB-connection
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Koen,
> > > > > > > > > >         
> > > > > > > > > >         So it is most likely you hit the same bug as 
> > > > > > 229 (Scott
> > > > > > > > > >         pointed out earlier). The same workaround
> might 
> > > > > > > > work for you
> > > > > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > > > > >         
> > > > > > > > > >         I think this should be a SM query timeout 
> > > > > > tunable value
> > > > > > in
> > > > > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > > > > >         
> > > > > > > > > >         Thanks
> > > > > > > > > >         Shirley Ma
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Inactive hide details for Koen Segers 
> > > > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >                                         Koen Segers 
> > > > > > > > > <koen.segers at VRT.BE> 
> > > > > > > > > >                                         
> > > > > > > > > >                                         05/22/07 11:14
> > AM 
> > > > > > > > > >                                         Please respond
> > to
> > > > > > > > > >
> > koen.segers at VRT.BE
> > > > > > > > > >                                         
> > > > > > > > > >         
> > > > > > > > > >                      To
> > > > > > > > > >         
> > > > > > > > > >         Shirley
> > > > > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > > > > >         
> > > > > > > > > >                      cc
> > > > > > > > > >         
> > > > > > > > > >         Ami Perlmutter
> > > > > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > > > > general at lists.openfabrics.org,
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > > >         
> > > > > > > > > >                 Subject
> > > > > > > > > >         
> > > > > > > > > >         RE:
> > > > > > > > > >         [ofa-general]
> > > > > > > > > >         GPFS node loses
> > > > > > > > > >         IB-connection
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Hi,
> > > > > > > > > >         
> > > > > > > > > >         It is the Cisco SM. 
> > > > > > > > > >         
> > > > > > > > > >         SFS-7000P> show version
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > >
> > > ==============================================================
> > > > > > > > > ==================
> > > > > > > > > >                                   System Version
> > > Information
> > > > > > > > > >         
> > > > > > > > >
> > > ==============================================================
> > > > > > > > > ==================
> > > > > > > > > >                   system-version : SFS-7000P TopspinOS
> 
> > > > > > > > 2.9.0 releng
> > > > > > > > > >         #147
> > > > > > > > > >         10/25/2006 02:01:32
> > > > > > > > > >                          contact : tac at cisco.com
> > > > > > > > > >                             name : SFS-7000P
> > > > > > > > > >                         location : 170 West Tasman
> > Drive, 
> > > > > > > > > San Jose, CA
> > > > > > > > > >         95134
> > > > > > > > > >                          up-time :
> 11(d):7(h):49(m):3(s)
> > > > > > > > > >                      last-change : none
> > > > > > > > > >                 last-config-save : none
> > > > > > > > > >                           action : none
> > > > > > > > > >                           result : none
> > > > > > > > > >                        oper-mode : normal
> > > > > > > > > >         
> > > > > > > > > >         There is also a command that gives the SM
> > version,
> > > 
> > > > > > > > > but I can't
> > > > > > > > > >         find it
> > > > > > > > > >         right now. 
> > > > > > > > > >         
> > > > > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma
> > > wrote:
> > > > > > > > > >         > Hello Koen,
> > > > > > > > > >         > 
> > > > > > > > > >         > From the switch log, it looks a SM issue to
> > me. 
> > > > > > > > > The node was
> > > > > > > > > >         kicked
> > > > > > > > > >         > out of the membership. Which SM you are 
> > > > > > using in your
> > > > > > > > > >         fabric? 
> > > > > > > > > >         > 
> > > > > > > > > >         > Thanks
> > > > > > > > > >         > Shirley Ma
> > > > > > > > > >         > 
> > > > > > > > > >         *** Disclaimer ***
> > > > > > > > > >         
> > > > > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > > >         
> > > > > > > > > >         nv van publiek recht
> > > > > > > > > >         BTW BE 0244.142.664
> > > > > > > > > >         RPR Brussel
> > > > > > > > > >         http://www.vrt.be/disclaimer
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > *** Disclaimer ***
> > > > > > > > > 
> > > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > > 
> > > > > > > > > nv van publiek recht
> > > > > > > > > BTW BE 0244.142.664
> > > > > > > > > RPR Brussel
> > > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > > >  
> > > > > > > > > 
> > > > > > > > *** Disclaimer ***
> > > > > > > > 
> > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > 
> > > > > > > > nv van publiek recht
> > > > > > > > BTW BE 0244.142.664
> > > > > > > > RPR Brussel
> > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > >  
> > > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > general mailing list
> > > > > > > general at lists.openfabrics.org
> > > > > > >
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > > 
> > > > > > > To unsubscribe, please visit
> > > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > > 
> > > > > > *** Disclaimer ***
> > > > > > 
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > 
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >  
> > > > > > 
> > > > > *** Disclaimer ***
> > > > > 
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > 
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >  
> > > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > 
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 


From Koen.SEGERS at VRT.BE  Tue May 29 08:12:07 2007
From: Koen.SEGERS at VRT.BE (SEGERS Koen)
Date: Tue, 29 May 2007 17:12:07 +0200
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <1180451154.12048.15.camel@localhost>
Message-ID: <D63C0BE2D613C543B6F3305502E9784C03157D7A@OCBEXS01001.rto.be>

That is very difficult. This system is supposed to go in production
within a few weeks. Changing the OFED drivers requires rebuilding a lot
of other programs. If it isn't really necessary, I prefer not to do
this...

Koen

-----Oorspronkelijk bericht-----
Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
Verzonden: dinsdag 29 mei 2007 17:05
Aan: SEGERS Koen
CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

any chance of moving to rc3 (or wait till rc4)?

On Tue, 2007-05-29 at 16:56 +0200, SEGERS Koen wrote:
> We don't really see data getting lost. We don't get an error in the
log
> files of gpfs. We only got a system that was not able to read its
> filesystem anymore. It was exactly at the time this FIXME error
> occurred.
> 
> Therefore I think there must me some kind of correlation. But I don't
> really know what ... :(
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> Verzonden: dinsdag 29 mei 2007 16:40
> Aan: SEGERS Koen
> CC: general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> can you describe the scenario in which you see data lost?
> does the "SDP: FIXME MID 11" message correlate with the data loss?
> 
> On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote:
> > I just remembered that, with SDP, these values aren't related
anymore.
> > SDP doesn't give this kind of information to the OS.
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: general-bounces at lists.openfabrics.org
> > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen
> > Verzonden: dinsdag 29 mei 2007 14:29
> > Aan: amip at dev.mellanox.co.il
> > CC: general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > One of the machines has 2 dropped packets:
> > 
> > gpfswhbe2n1:~ # ifconfig ib0
> > ib0       Link encap:UNSPEC  HWaddr
> > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
> >           inet addr:192.168.2.1  Bcast:192.168.4.255
> Mask:255.255.255.0
> >           inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
> >           RX packets:17311 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0
> >           collisions:0 txqueuelen:128
> >           RX bytes:148363444 (141.4 Mb)  TX bytes:6715076 (6.4 Mb)
> > 
> > Can this be related?
> > 
> > Does anyone now how this is possible with sdp? I thought SDP was a
RC.
> > I'm also curious how gpfs reacts to this. Do you know where I can
find
> > the timestamp of these dropped packets?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> > Verzonden: dinsdag 29 mei 2007 14:03
> > Aan: SEGERS Koen
> > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > if this is an actual resize request than there is no problem when it
> is
> > dropped.
> > since you are running rc1, no resize requests should be sent so this
> > means there is a problem since data could be dropped. do you notice
> lost
> > data?
> > 
> > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> > > We are running ofed-1.2.RC1 on all machines. Hence it is
impossible
> > that
> > > this message is added only a few days ago.
> > > 
> > > How can you be so sure that this doesn't pose any problems?
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> > > Verzonden: dinsdag 29 mei 2007 13:35
> > > Aan: SEGERS Koen
> > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > this means you are getting a message your SDP does not recognize.
> > > message 11 is resize request which was added to sdp a few days
ago.
> > > can it be that you are running 2 different versions of OFED?
> > > anywas, this doesn't pose any problem so you can ignore it.
> > > 
> > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > > > Hi,
> > > > 
> > > > Saturday we did a different stresstest.
> > > > This is what we see in the /var/log/messages:
> > > > 
> > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> > > > 
> > > > There were errors from that time on. Can someone explain me what
> > this
> > > > message does?
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > Verzonden: woensdag 23 mei 2007 17:41
> > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > CC: Clive Hall (clivhall);
general-bounces at lists.openfabrics.org;
> > > > general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > Try 20 seconds, I'm curious if if you are barely crossing the
> > > 10-second
> > > > threshold.
> > > > 
> > > > Scott 
> > > > 
> > > > > -----Original Message-----
> > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > > > Cc: Clive Hall (clivhall); 
> > > > > general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > What value would you recommend then?
> > > > > 
> > > > > Koen
> > > > > 
> > > > > -----Oorspronkelijk bericht-----
> > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > > Verzonden: woensdag 23 mei 2007 17:38
> > > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > > CC: Clive Hall (clivhall);
> general-bounces at lists.openfabrics.org;
> > > > > general at lists.openfabrics.org
> > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > The boot time of the host doesn't matter for this timeout.
> While
> > > the
> > > > > host is booting, the IB link is down anyway.
> > > > > 
> > > > > Scott 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > > > Cc: Clive Hall (clivhall); 
> > > > > > general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > After a whole day of stresstesting with the MAD renicing 
> > > > > turned on, we
> > > > > > got the error once. So I think I should raise the timeout on

> > > > > > the switch
> > > > > > also.
> > > > > > 
> > > > > > It takes about 2 minutes to boot the system. Do you agree 
> > > > > > that this is a
> > > > > > good value for the timeout?
> > > > > > 
> > > > > > Scott,
> > > > > > Can you explain me the problem of the memlock?
> > > > > > 
> > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > > > > Since we didn't
> > > > > > install this, the bug is not related to us. This is 
> > > > > correct, isn't it?
> > > > > > 
> > > > > > Greetz
> > > > > > 
> > > > > > Koen
> > > > > > 
> > > > > > -----Oorspronkelijk bericht-----
> > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > > > general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen)
> wrote:
> > > > > > > No C code changes, just a few config file changes 
> > > > > (RENICE_IB_MAD=yes
> > > > > > in
> > > > > > > openib.conf,
> > > > > > 
> > > > > > Does the host really not respond to MAD requests for over 10

> > > > > > seconds in
> > > > > > some cases ?
> > > > > > 
> > > > > > -- Hal
> > > > > > 
> > > > > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > > > SLES10 for bug 267, etc.).
> > > > > > > 
> > > > > > > Scott Weitzenkamp
> > > > > > > SQA and Release Manager
> > > > > > > Server Virtualization Business Unit
> > > > > > > Cisco Systems
> > > > > > >  
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > > general at lists.openfabrics.org; 
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > > 
> > > > > > > > This far, all tests seem to work.
> > > > > > > > 
> > > > > > > > Thanks for the help!
> > > > > > > > 
> > > > > > > > Scott,
> > > > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > > > > 
> > > > > > > > Greetz
> > > > > > > > 
> > > > > > > > Koen
> > > > > > > > 
> > > > > > > > -----Oorspronkelijk bericht-----
> > > > > > > > Van: Scott Weitzenkamp (sweitzen)
> > [mailto:sweitzen at cisco.com] 
> > > > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive
Hall
> > > > > > (clivhall)
> > > > > > > > CC: Shirley Ma; Ami Perlmutter;
> > general at lists.openfabrics.org;
> > > > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses
IB-connection
> > > > > > > > 
> > > > > > > > It's not so much pinging every 10 seconds as expecting a

> > > > > > > > response within
> > > > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > > > > 
> > > > > > > > You only need to do 1) or 2), not both.  Cisco
configures
> 1)
> > 
> > > > > > > > in the OFED
> > > > > > > > binary RPMs we release at
> > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I

> > > > > > > > prefer to have
> > > > > > > > the host be more responsive.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Scott Weitzenkamp
> > > > > > > > SQA and Release Manager
> > > > > > > > Server Virtualization Business Unit
> > > > > > > > Cisco Systems
> > > > > > > >  
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > > > general at lists.openfabrics.org;
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > > Subject: RE: [ofa-general] GPFS node loses
IB-connection
> > > > > > > > > 
> > > > > > > > > If I understand it wright, the switch is actually
> polling 
> > > > > > > > > (=pinging) the
> > > > > > > > > interfaces every 10s. This means that when the
interface
> > is
> > > > > > handling
> > > > > > > > > other traffic, the poll can fail and the port could be

> > > > > > > > > considered out of
> > > > > > > > > service. My question is then: "How can the timeout be
> > > reached
> > > > > > while
> > > > > > > > > packets are being sent/received to/from the
interface?"
> > > > > > > > > 
> > > > > > > > > Anyway, what timeout-value would you recommend for 
> > > > > us? And why?
> > > > > > > > > 
> > > > > > > > > To recapitulate: these are the actions I'll take
> tomorrow
> > > > > > > > > 1) change the MAD niceness of the servers
> > > > > > > > > 2) change the timeout on the switches
> > > > > > > > > 
> > > > > > > > > Are these changes sufficient for the HCA's to keep 
> > > > > > their ports in
> > > > > > > > > PORT_ACTIVE state?
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > 
> > > > > > > > > Koen
> > > > > > > > > 
> > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > > > > (sweitzen) wrote:
> > > > > > > > > > Yes, you can tune it.  Here's an example via the
> switch
> > > CLI:
> > > > > > > > > >  
> > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > > > > fe:80:00:00:00:00:00:00
> > > > > > > > > > node-timeout <value>
> > > > > > > > > > 
> > > > > > > > > > The default is 10 seconds, it can be configured up
to 
> > > > > > > > 2000 seconds.
> > > > > > > > > > If a HCA is completely unresponsive for longer than
> the 
> > > > > > > > node-timeout
> > > > > > > > > > value, then we consider that HCA out of service.
> > > > > > > > > >  
> > > > > > > > > > Scott Weitzenkamp
> > > > > > > > > > SQA and Release Manager
> > > > > > > > > > Server Virtualization Business Unit
> > > > > > > > > > Cisco Systems
> > > > > > > > > >  
> > > > > > > > > > 
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > >
> > > ______________________________________________________________
> > > > > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > > > >         To: koen.segers at VRT.BE
> > > > > > > > > >         Cc: Ami Perlmutter;
> > general at lists.openfabrics.org;
> > > > > > > > > >         general-bounces at lists.openfabrics.org; Scott

> > > > > > Weitzenkamp
> > > > > > > > > >         (sweitzen)
> > > > > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > > > > IB-connection
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Koen,
> > > > > > > > > >         
> > > > > > > > > >         So it is most likely you hit the same bug as

> > > > > > 229 (Scott
> > > > > > > > > >         pointed out earlier). The same workaround
> might 
> > > > > > > > work for you
> > > > > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > > > > >         
> > > > > > > > > >         I think this should be a SM query timeout 
> > > > > > tunable value
> > > > > > in
> > > > > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > > > > >         
> > > > > > > > > >         Thanks
> > > > > > > > > >         Shirley Ma
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Inactive hide details for Koen Segers 
> > > > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >                                         Koen Segers 
> > > > > > > > > <koen.segers at VRT.BE> 
> > > > > > > > > >                                         
> > > > > > > > > >                                         05/22/07
11:14
> > AM 
> > > > > > > > > >                                         Please
respond
> > to
> > > > > > > > > >
> > koen.segers at VRT.BE
> > > > > > > > > >                                         
> > > > > > > > > >         
> > > > > > > > > >                      To
> > > > > > > > > >         
> > > > > > > > > >         Shirley
> > > > > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > > > > >         
> > > > > > > > > >                      cc
> > > > > > > > > >         
> > > > > > > > > >         Ami Perlmutter
> > > > > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > > > > general at lists.openfabrics.org,
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > > >         
> > > > > > > > > >                 Subject
> > > > > > > > > >         
> > > > > > > > > >         RE:
> > > > > > > > > >         [ofa-general]
> > > > > > > > > >         GPFS node loses
> > > > > > > > > >         IB-connection
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Hi,
> > > > > > > > > >         
> > > > > > > > > >         It is the Cisco SM. 
> > > > > > > > > >         
> > > > > > > > > >         SFS-7000P> show version
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > >
> > > ==============================================================
> > > > > > > > > ==================
> > > > > > > > > >                                   System Version
> > > Information
> > > > > > > > > >         
> > > > > > > > >
> > > ==============================================================
> > > > > > > > > ==================
> > > > > > > > > >                   system-version : SFS-7000P
TopspinOS
> 
> > > > > > > > 2.9.0 releng
> > > > > > > > > >         #147
> > > > > > > > > >         10/25/2006 02:01:32
> > > > > > > > > >                          contact : tac at cisco.com
> > > > > > > > > >                             name : SFS-7000P
> > > > > > > > > >                         location : 170 West Tasman
> > Drive, 
> > > > > > > > > San Jose, CA
> > > > > > > > > >         95134
> > > > > > > > > >                          up-time :
> 11(d):7(h):49(m):3(s)
> > > > > > > > > >                      last-change : none
> > > > > > > > > >                 last-config-save : none
> > > > > > > > > >                           action : none
> > > > > > > > > >                           result : none
> > > > > > > > > >                        oper-mode : normal
> > > > > > > > > >         
> > > > > > > > > >         There is also a command that gives the SM
> > version,
> > > 
> > > > > > > > > but I can't
> > > > > > > > > >         find it
> > > > > > > > > >         right now. 
> > > > > > > > > >         
> > > > > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley
Ma
> > > wrote:
> > > > > > > > > >         > Hello Koen,
> > > > > > > > > >         > 
> > > > > > > > > >         > From the switch log, it looks a SM issue
to
> > me. 
> > > > > > > > > The node was
> > > > > > > > > >         kicked
> > > > > > > > > >         > out of the membership. Which SM you are 
> > > > > > using in your
> > > > > > > > > >         fabric? 
> > > > > > > > > >         > 
> > > > > > > > > >         > Thanks
> > > > > > > > > >         > Shirley Ma
> > > > > > > > > >         > 
> > > > > > > > > >         *** Disclaimer ***
> > > > > > > > > >         
> > > > > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > > >         
> > > > > > > > > >         nv van publiek recht
> > > > > > > > > >         BTW BE 0244.142.664
> > > > > > > > > >         RPR Brussel
> > > > > > > > > >         http://www.vrt.be/disclaimer
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > *** Disclaimer ***
> > > > > > > > > 
> > > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > > 
> > > > > > > > > nv van publiek recht
> > > > > > > > > BTW BE 0244.142.664
> > > > > > > > > RPR Brussel
> > > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > > >  
> > > > > > > > > 
> > > > > > > > *** Disclaimer ***
> > > > > > > > 
> > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > 
> > > > > > > > nv van publiek recht
> > > > > > > > BTW BE 0244.142.664
> > > > > > > > RPR Brussel
> > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > >  
> > > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > general mailing list
> > > > > > > general at lists.openfabrics.org
> > > > > > >
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > > 
> > > > > > > To unsubscribe, please visit
> > > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > > 
> > > > > > *** Disclaimer ***
> > > > > > 
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > 
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >  
> > > > > > 
> > > > > *** Disclaimer ***
> > > > > 
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > 
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >  
> > > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > 
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

From sweitzen at cisco.com  Tue May 29 08:30:16 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 29 May 2007 08:30:16 -0700
Subject: [ofa-general] OFED-1.2-20070529-0600 won't build due to srptools
	changes
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303951244@xmb-sjc-216.amer.cisco.com>

I have reopened https://bugs.openfabrics.org/show_bug.cgi?id=533, Ishai
please fix ASAP. This bug is now a P1 blocker.
 
RPM build errors:
    user vlad does not exist - using root
    group vlad does not exist - using root
    user vlad does not exist - using root
    group vlad does not exist - using root
    File not found:
/var/tmp/OFED/usr/sbin/execute_multipath_or_kpartx.sh
    File not found: /var/tmp/OFED/usr/sbin/srp_dm_multipath_daemon
    File not found: /var/tmp/OFED/usr/sbin/srp_post_multipath
ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
/var/tmp/OFEDRPM'\
 --define '_prefix /usr' --define 'build_root /var/tmp/OFED' --define
'configur\
e_options --with-dapl --with-libibcommon --with-libibmad
--with-libibumad --wit\
h-libibverbs --with-libmthca --with-opensm --with-librdmacm
--with-libsdp --wit\
h-openib-diags --with-sdpnetstat --with-srptools --with-mstflint
--with-perftes\
t --with-tvflash --sysconfdir=/etc --mandir=/usr/share/man' --define
'configure\
_options32 --with-dapl --with-libibcommon --with-libibmad
--with-libibumad --wi\
th-libibverbs --with-libmthca --with-opensm --with-librdmacm
--with-libsdp --wi\
th-openib-diags --with-sdpnetstat --with-srptools --sysconfdir=/etc
--mandir=/u\
sr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man'
/tmp/O\
FED-1.2-20070529-0600/SRPMS/ofa_user-1.2-rc2.src.rpm"

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070529/0971a240/attachment.html>

From sweitzen at cisco.com  Tue May 29 08:44:37 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 29 May 2007 08:44:37 -0700
Subject: [ofa-general] Re: ipoib / bonding and OFED
In-Reply-To: <465BDC90.5080305@voltaire.com>
References: <3857BB049D83424D9DB82753D37CEA55459C41@taurus.voltaire.com><4657373E.2030903@hp.com>
	<465BDC90.5080305@voltaire.com>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303951259@xmb-sjc-216.amer.cisco.com>

Bob, it is now possible to configure IPoIB bonding in
/etc/infiniband/openib.conf, this configuration file includes the
following boilerplate.

# Enable the bonding driver on startup
IPOIBBOND_ENABLE=no
# Set bond interface names
#IPOIB_BONDS=bond0,bond1
# Set specific bond params; address and slaves
#bond0_IP=10.10.10.1
#bond0_SLAVES=ib0,ib1
#bond1_IP=20.10.10.1
#bond1_SLAVES=ib2,ib3,ib4

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Or Gerlitz
> Sent: Tuesday, May 29, 2007 12:56 AM
> To: Bob Kossey
> Cc: OpenFabrics General
> Subject: [ofa-general] Re: ipoib / bonding and OFED
> 
> Bob Kossey wrote:
> > I copied OR since I think this is related to his OFED HA work, and
> > he might have some insights.  A few more questions for Or:
> > I was trying to use ipoib bonding with OFED 1.2 rc2 and a 
> 2.6.9 kernel,
> > but was not able to get it to work so far.  I saw your 
> Sonoma bonding
> > slides, and you mention kernel bonding driver changes were needed.
> > 2. Is there a minimum kernel version, with the kernel bonding driver
> > changes, that is required to use bonding with OFED ipoib?
> 
> Just to have a base line here: to get bonding to work with IPoIB, you 
> should use the bonding driver provided with OFED 1.2. This 
> driver is the 
>   upstream one (of 2.6.20) being patched to support IPoIB and 
> backported 
> to RH5, SLES10 and RH4 U3/4/5, other kernels are not supported.
> 
> If you were using the ofed bonding on a system that matches 
> the support 
> matrix it should worl. If do have problems under this config, please 
> either open a bug at the ofed bugzilla
> @ bugs.openfabrics.org assigned to monis at voltaire.com (Moni Shoua) or 
> send first report/question to Moni and CC ewg at lists.openfabrics.org
> 
> Please note that between RC2 and RC4 (to be released today etc) some 
> bugs were fixed, you can search in the bugzilla to see what.
> 
> > 3. The bonding driver uses the HWADDR from the underlying ipoib
> > devices, how does it obtain the HWADDR?  Does it use the 
> full 20 bytes,
> > or some subset?
> 
> when enslaving IPoIB devices, the bonding driver uses the full hw 
> address of the active slave, it simply looks on the dev_addr field of 
> the slave struct netdevice (see include/linux/netdevice.h)
> 
> > 4. What use_carrier options for link status detection does 
> OFED ipoib 
> > support,
> > MII, ETHTOOL or netif_carrier_ok?
> 
> the mii/ethertool etc local link detection methods of the 
> bonding driver 
>   are somehow deprecated, since nowadays almost any network device 
> support the netif_carrier_ok call. The --default-- of the upstream 
> bonding driver (eg the one we use in OFED and the 2.6.21 
> listed below) 
> is to set the use_carrier mod param to 1 that is mii is not 
> used anymore.
> 
> > author:         Thomas Davis, tadavis at lbl.gov and many others
> > description:    Ethernet Channel Bonding Driver, v3.1.2
> > version:        3.1.2
> > parm:           use_carrier:Use netif_carrier_ok (vs MII 
> ioctls) in miimon; 0 for off, 1 for on (default) (int)
> > parm:           miimon:Link check interval in milliseconds (int)
> 
> > If you have any good examples of bonding configuration 
> settings that work
> > with OFED, I'd appreciate that also.
> 
> The bonding RPM provided with OFED is made of a driver, 
> script and some 
> help text containing usage examples, please take a look there 
> and let me 
> know if you have further questions.
> 
> > $ rpm -ql ib-bonding-0.9.0-2.6.9_42.ELsmp
> > 
> /lib/modules/2.6.9-42.ELsmp/updates/kernel/drivers/net/bonding
> /bonding.ko
> > /usr/bin/ib-bond
> > /usr/share/doc/ib-bonding-0.9.0/ib-bonding.txt
> 
> The ofed service (/etc/init.d/openibd) was enhanced to allow for 
> --persistent-- bonding configuration, please see the bonding 
> section at
> docs/ipoib_release_notes.txt to see how to do it.
> 
> Or.
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From mst at dev.mellanox.co.il  Tue May 29 08:44:58 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 18:44:58 +0300
Subject: [ofa-general] Re: GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D7A@OCBEXS01001.rto.be>
References: <1180451154.12048.15.camel@localhost>
	<D63C0BE2D613C543B6F3305502E9784C03157D7A@OCBEXS01001.rto.be>
Message-ID: <20070529154458.GA7101@mellanox.co.il>

> Changing the OFED drivers requires rebuilding a lot of other programs.

It does? Why does it?

Quoting SEGERS Koen <Koen.SEGERS at VRT.BE>:
Subject: RE: GPFS node loses IB-connection

That is very difficult. This system is supposed to go in production
within a few weeks. Changing the OFED drivers requires rebuilding a lot
of other programs. If it isn't really necessary, I prefer not to do
this...

Koen

-----Oorspronkelijk bericht-----
Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
Verzonden: dinsdag 29 mei 2007 17:05
Aan: SEGERS Koen
CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

any chance of moving to rc3 (or wait till rc4)?

On Tue, 2007-05-29 at 16:56 +0200, SEGERS Koen wrote:
> We don't really see data getting lost. We don't get an error in the
log
> files of gpfs. We only got a system that was not able to read its
> filesystem anymore. It was exactly at the time this FIXME error
> occurred.
> 
> Therefore I think there must me some kind of correlation. But I don't
> really know what ... :(
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> Verzonden: dinsdag 29 mei 2007 16:40
> Aan: SEGERS Koen
> CC: general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> can you describe the scenario in which you see data lost?
> does the "SDP: FIXME MID 11" message correlate with the data loss?
> 
> On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote:
> > I just remembered that, with SDP, these values aren't related
anymore.
> > SDP doesn't give this kind of information to the OS.
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: general-bounces at lists.openfabrics.org
> > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen
> > Verzonden: dinsdag 29 mei 2007 14:29
> > Aan: amip at dev.mellanox.co.il
> > CC: general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > One of the machines has 2 dropped packets:
> > 
> > gpfswhbe2n1:~ # ifconfig ib0
> > ib0       Link encap:UNSPEC  HWaddr
> > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
> >           inet addr:192.168.2.1  Bcast:192.168.4.255
> Mask:255.255.255.0
> >           inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
> >           RX packets:17311 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0
> >           collisions:0 txqueuelen:128
> >           RX bytes:148363444 (141.4 Mb)  TX bytes:6715076 (6.4 Mb)
> > 
> > Can this be related?
> > 
> > Does anyone now how this is possible with sdp? I thought SDP was a
RC.
> > I'm also curious how gpfs reacts to this. Do you know where I can
find
> > the timestamp of these dropped packets?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> > Verzonden: dinsdag 29 mei 2007 14:03
> > Aan: SEGERS Koen
> > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > if this is an actual resize request than there is no problem when it
> is
> > dropped.
> > since you are running rc1, no resize requests should be sent so this
> > means there is a problem since data could be dropped. do you notice
> lost
> > data?
> > 
> > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> > > We are running ofed-1.2.RC1 on all machines. Hence it is
impossible
> > that
> > > this message is added only a few days ago.
> > > 
> > > How can you be so sure that this doesn't pose any problems?
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> > > Verzonden: dinsdag 29 mei 2007 13:35
> > > Aan: SEGERS Koen
> > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > this means you are getting a message your SDP does not recognize.
> > > message 11 is resize request which was added to sdp a few days
ago.
> > > can it be that you are running 2 different versions of OFED?
> > > anywas, this doesn't pose any problem so you can ignore it.
> > > 
> > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > > > Hi,
> > > > 
> > > > Saturday we did a different stresstest.
> > > > This is what we see in the /var/log/messages:
> > > > 
> > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> > > > 
> > > > There were errors from that time on. Can someone explain me what
> > this
> > > > message does?
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > Verzonden: woensdag 23 mei 2007 17:41
> > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > CC: Clive Hall (clivhall);
general-bounces at lists.openfabrics.org;
> > > > general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > Try 20 seconds, I'm curious if if you are barely crossing the
> > > 10-second
> > > > threshold.
> > > > 
> > > > Scott 
> > > > 
> > > > > -----Original Message-----
> > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > > > Cc: Clive Hall (clivhall); 
> > > > > general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > What value would you recommend then?
> > > > > 
> > > > > Koen
> > > > > 
> > > > > -----Oorspronkelijk bericht-----
> > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > > Verzonden: woensdag 23 mei 2007 17:38
> > > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > > CC: Clive Hall (clivhall);
> general-bounces at lists.openfabrics.org;
> > > > > general at lists.openfabrics.org
> > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > The boot time of the host doesn't matter for this timeout.
> While
> > > the
> > > > > host is booting, the IB link is down anyway.
> > > > > 
> > > > > Scott 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > > > Cc: Clive Hall (clivhall); 
> > > > > > general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > After a whole day of stresstesting with the MAD renicing 
> > > > > turned on, we
> > > > > > got the error once. So I think I should raise the timeout on

> > > > > > the switch
> > > > > > also.
> > > > > > 
> > > > > > It takes about 2 minutes to boot the system. Do you agree 
> > > > > > that this is a
> > > > > > good value for the timeout?
> > > > > > 
> > > > > > Scott,
> > > > > > Can you explain me the problem of the memlock?
> > > > > > 
> > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > > > > Since we didn't
> > > > > > install this, the bug is not related to us. This is 
> > > > > correct, isn't it?
> > > > > > 
> > > > > > Greetz
> > > > > > 
> > > > > > Koen
> > > > > > 
> > > > > > -----Oorspronkelijk bericht-----
> > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > > > general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen)
> wrote:
> > > > > > > No C code changes, just a few config file changes 
> > > > > (RENICE_IB_MAD=yes
> > > > > > in
> > > > > > > openib.conf,
> > > > > > 
> > > > > > Does the host really not respond to MAD requests for over 10

> > > > > > seconds in
> > > > > > some cases ?
> > > > > > 
> > > > > > -- Hal
> > > > > > 
> > > > > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > > > SLES10 for bug 267, etc.).
> > > > > > > 
> > > > > > > Scott Weitzenkamp
> > > > > > > SQA and Release Manager
> > > > > > > Server Virtualization Business Unit
> > > > > > > Cisco Systems
> > > > > > >  
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > > general at lists.openfabrics.org; 
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > > 
> > > > > > > > This far, all tests seem to work.
> > > > > > > > 
> > > > > > > > Thanks for the help!
> > > > > > > > 
> > > > > > > > Scott,
> > > > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > > > > 
> > > > > > > > Greetz
> > > > > > > > 
> > > > > > > > Koen
> > > > > > > > 
> > > > > > > > -----Oorspronkelijk bericht-----
> > > > > > > > Van: Scott Weitzenkamp (sweitzen)
> > [mailto:sweitzen at cisco.com] 
> > > > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive
Hall
> > > > > > (clivhall)
> > > > > > > > CC: Shirley Ma; Ami Perlmutter;
> > general at lists.openfabrics.org;
> > > > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses
IB-connection
> > > > > > > > 
> > > > > > > > It's not so much pinging every 10 seconds as expecting a

> > > > > > > > response within
> > > > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > > > > 
> > > > > > > > You only need to do 1) or 2), not both.  Cisco
configures
> 1)
> > 
> > > > > > > > in the OFED
> > > > > > > > binary RPMs we release at
> > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I

> > > > > > > > prefer to have
> > > > > > > > the host be more responsive.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Scott Weitzenkamp
> > > > > > > > SQA and Release Manager
> > > > > > > > Server Virtualization Business Unit
> > > > > > > > Cisco Systems
> > > > > > > >  
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > > > general at lists.openfabrics.org;
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > > Subject: RE: [ofa-general] GPFS node loses
IB-connection
> > > > > > > > > 
> > > > > > > > > If I understand it wright, the switch is actually
> polling 
> > > > > > > > > (=pinging) the
> > > > > > > > > interfaces every 10s. This means that when the
interface
> > is
> > > > > > handling
> > > > > > > > > other traffic, the poll can fail and the port could be

> > > > > > > > > considered out of
> > > > > > > > > service. My question is then: "How can the timeout be
> > > reached
> > > > > > while
> > > > > > > > > packets are being sent/received to/from the
interface?"
> > > > > > > > > 
> > > > > > > > > Anyway, what timeout-value would you recommend for 
> > > > > us? And why?
> > > > > > > > > 
> > > > > > > > > To recapitulate: these are the actions I'll take
> tomorrow
> > > > > > > > > 1) change the MAD niceness of the servers
> > > > > > > > > 2) change the timeout on the switches
> > > > > > > > > 
> > > > > > > > > Are these changes sufficient for the HCA's to keep 
> > > > > > their ports in
> > > > > > > > > PORT_ACTIVE state?
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > 
> > > > > > > > > Koen
> > > > > > > > > 
> > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > > > > (sweitzen) wrote:
> > > > > > > > > > Yes, you can tune it.  Here's an example via the
> switch
> > > CLI:
> > > > > > > > > >  
> > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > > > > fe:80:00:00:00:00:00:00
> > > > > > > > > > node-timeout <value>
> > > > > > > > > > 
> > > > > > > > > > The default is 10 seconds, it can be configured up
to 
> > > > > > > > 2000 seconds.
> > > > > > > > > > If a HCA is completely unresponsive for longer than
> the 
> > > > > > > > node-timeout
> > > > > > > > > > value, then we consider that HCA out of service.
> > > > > > > > > >  
> > > > > > > > > > Scott Weitzenkamp
> > > > > > > > > > SQA and Release Manager
> > > > > > > > > > Server Virtualization Business Unit
> > > > > > > > > > Cisco Systems
> > > > > > > > > >  
> > > > > > > > > > 
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > >
> > > ______________________________________________________________
> > > > > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > > > >         To: koen.segers at VRT.BE
> > > > > > > > > >         Cc: Ami Perlmutter;
> > general at lists.openfabrics.org;
> > > > > > > > > >         general-bounces at lists.openfabrics.org; Scott

> > > > > > Weitzenkamp
> > > > > > > > > >         (sweitzen)
> > > > > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > > > > IB-connection
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Koen,
> > > > > > > > > >         
> > > > > > > > > >         So it is most likely you hit the same bug as

> > > > > > 229 (Scott
> > > > > > > > > >         pointed out earlier). The same workaround
> might 
> > > > > > > > work for you
> > > > > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > > > > >         
> > > > > > > > > >         I think this should be a SM query timeout 
> > > > > > tunable value
> > > > > > in
> > > > > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > > > > >         
> > > > > > > > > >         Thanks
> > > > > > > > > >         Shirley Ma
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Inactive hide details for Koen Segers 
> > > > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >                                         Koen Segers 
> > > > > > > > > <koen.segers at VRT.BE> 
> > > > > > > > > >                                         
> > > > > > > > > >                                         05/22/07
11:14
> > AM 
> > > > > > > > > >                                         Please
respond
> > to
> > > > > > > > > >
> > koen.segers at VRT.BE
> > > > > > > > > >                                         
> > > > > > > > > >         
> > > > > > > > > >                      To
> > > > > > > > > >         
> > > > > > > > > >         Shirley
> > > > > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > > > > >         
> > > > > > > > > >                      cc
> > > > > > > > > >         
> > > > > > > > > >         Ami Perlmutter
> > > > > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > > > > general at lists.openfabrics.org,
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > > >         
> > > > > > > > > >                 Subject
> > > > > > > > > >         
> > > > > > > > > >         RE:
> > > > > > > > > >         [ofa-general]
> > > > > > > > > >         GPFS node loses
> > > > > > > > > >         IB-connection
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Hi,
> > > > > > > > > >         
> > > > > > > > > >         It is the Cisco SM. 
> > > > > > > > > >         
> > > > > > > > > >         SFS-7000P> show version
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > >
> > > ==============================================================
> > > > > > > > > ==================
> > > > > > > > > >                                   System Version
> > > Information
> > > > > > > > > >         
> > > > > > > > >
> > > ==============================================================
> > > > > > > > > ==================
> > > > > > > > > >                   system-version : SFS-7000P
TopspinOS
> 
> > > > > > > > 2.9.0 releng
> > > > > > > > > >         #147
> > > > > > > > > >         10/25/2006 02:01:32
> > > > > > > > > >                          contact : tac at cisco.com
> > > > > > > > > >                             name : SFS-7000P
> > > > > > > > > >                         location : 170 West Tasman
> > Drive, 
> > > > > > > > > San Jose, CA
> > > > > > > > > >         95134
> > > > > > > > > >                          up-time :
> 11(d):7(h):49(m):3(s)
> > > > > > > > > >                      last-change : none
> > > > > > > > > >                 last-config-save : none
> > > > > > > > > >                           action : none
> > > > > > > > > >                           result : none
> > > > > > > > > >                        oper-mode : normal
> > > > > > > > > >         
> > > > > > > > > >         There is also a command that gives the SM
> > version,
> > > 
> > > > > > > > > but I can't
> > > > > > > > > >         find it
> > > > > > > > > >         right now. 
> > > > > > > > > >         
> > > > > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley
Ma
> > > wrote:
> > > > > > > > > >         > Hello Koen,
> > > > > > > > > >         > 
> > > > > > > > > >         > From the switch log, it looks a SM issue
to
> > me. 
> > > > > > > > > The node was
> > > > > > > > > >         kicked
> > > > > > > > > >         > out of the membership. Which SM you are 
> > > > > > using in your
> > > > > > > > > >         fabric? 
> > > > > > > > > >         > 
> > > > > > > > > >         > Thanks
> > > > > > > > > >         > Shirley Ma
> > > > > > > > > >         > 
> > > > > > > > > >         *** Disclaimer ***
> > > > > > > > > >         
> > > > > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > > >         
> > > > > > > > > >         nv van publiek recht
> > > > > > > > > >         BTW BE 0244.142.664
> > > > > > > > > >         RPR Brussel
> > > > > > > > > >         http://www.vrt.be/disclaimer
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > *** Disclaimer ***
> > > > > > > > > 
> > > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > > 
> > > > > > > > > nv van publiek recht
> > > > > > > > > BTW BE 0244.142.664
> > > > > > > > > RPR Brussel
> > > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > > >  
> > > > > > > > > 
> > > > > > > > *** Disclaimer ***
> > > > > > > > 
> > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > 
> > > > > > > > nv van publiek recht
> > > > > > > > BTW BE 0244.142.664
> > > > > > > > RPR Brussel
> > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > >  
> > > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > general mailing list
> > > > > > > general at lists.openfabrics.org
> > > > > > >
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > > 
> > > > > > > To unsubscribe, please visit
> > > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > > 
> > > > > > *** Disclaimer ***
> > > > > > 
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > 
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >  
> > > > > > 
> > > > > *** Disclaimer ***
> > > > > 
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > 
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >  
> > > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > 
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
MST


From mst at dev.mellanox.co.il  Tue May 29 08:47:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 29 May 2007 18:47:00 +0300
Subject: [ofa-general] Re: GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D7A@OCBEXS01001.rto.be>
References: <1180451154.12048.15.camel@localhost>
	<D63C0BE2D613C543B6F3305502E9784C03157D7A@OCBEXS01001.rto.be>
Message-ID: <20070529154700.GA8321@mellanox.co.il>

What Ami is asking you to do is to try to reproduce the problem with -RC3 or
-RC4 when it's out.  If the problem goes away, we'll know it's one of the bugs
that got fixed since then, if not it'll be easier to debug on a recent RC.

Quoting SEGERS Koen <Koen.SEGERS at VRT.BE>:
Subject: RE: GPFS node loses IB-connection

That is very difficult. This system is supposed to go in production
within a few weeks. Changing the OFED drivers requires rebuilding a lot
of other programs. If it isn't really necessary, I prefer not to do
this...

Koen

-----Oorspronkelijk bericht-----
Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
Verzonden: dinsdag 29 mei 2007 17:05
Aan: SEGERS Koen
CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection

any chance of moving to rc3 (or wait till rc4)?

On Tue, 2007-05-29 at 16:56 +0200, SEGERS Koen wrote:
> We don't really see data getting lost. We don't get an error in the
log
> files of gpfs. We only got a system that was not able to read its
> filesystem anymore. It was exactly at the time this FIXME error
> occurred.
> 
> Therefore I think there must me some kind of correlation. But I don't
> really know what ... :(
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> Verzonden: dinsdag 29 mei 2007 16:40
> Aan: SEGERS Koen
> CC: general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> can you describe the scenario in which you see data lost?
> does the "SDP: FIXME MID 11" message correlate with the data loss?
> 
> On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote:
> > I just remembered that, with SDP, these values aren't related
anymore.
> > SDP doesn't give this kind of information to the OS.
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: general-bounces at lists.openfabrics.org
> > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen
> > Verzonden: dinsdag 29 mei 2007 14:29
> > Aan: amip at dev.mellanox.co.il
> > CC: general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > One of the machines has 2 dropped packets:
> > 
> > gpfswhbe2n1:~ # ifconfig ib0
> > ib0       Link encap:UNSPEC  HWaddr
> > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
> >           inet addr:192.168.2.1  Bcast:192.168.4.255
> Mask:255.255.255.0
> >           inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
> >           RX packets:17311 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0
> >           collisions:0 txqueuelen:128
> >           RX bytes:148363444 (141.4 Mb)  TX bytes:6715076 (6.4 Mb)
> > 
> > Can this be related?
> > 
> > Does anyone now how this is possible with sdp? I thought SDP was a
RC.
> > I'm also curious how gpfs reacts to this. Do you know where I can
find
> > the timestamp of these dropped packets?
> > 
> > Koen
> > 
> > -----Oorspronkelijk bericht-----
> > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> > Verzonden: dinsdag 29 mei 2007 14:03
> > Aan: SEGERS Koen
> > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > if this is an actual resize request than there is no problem when it
> is
> > dropped.
> > since you are running rc1, no resize requests should be sent so this
> > means there is a problem since data could be dropped. do you notice
> lost
> > data?
> > 
> > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> > > We are running ofed-1.2.RC1 on all machines. Hence it is
impossible
> > that
> > > this message is added only a few days ago.
> > > 
> > > How can you be so sure that this doesn't pose any problems?
> > > 
> > > Koen
> > > 
> > > -----Oorspronkelijk bericht-----
> > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
> > > Verzonden: dinsdag 29 mei 2007 13:35
> > > Aan: SEGERS Koen
> > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > 
> > > this means you are getting a message your SDP does not recognize.
> > > message 11 is resize request which was added to sdp a few days
ago.
> > > can it be that you are running 2 different versions of OFED?
> > > anywas, this doesn't pose any problem so you can ignore it.
> > > 
> > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > > > Hi,
> > > > 
> > > > Saturday we did a different stresstest.
> > > > This is what we see in the /var/log/messages:
> > > > 
> > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> > > > 
> > > > There were errors from that time on. Can someone explain me what
> > this
> > > > message does?
> > > > 
> > > > Koen
> > > > 
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > Verzonden: woensdag 23 mei 2007 17:41
> > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > CC: Clive Hall (clivhall);
general-bounces at lists.openfabrics.org;
> > > > general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > 
> > > > Try 20 seconds, I'm curious if if you are barely crossing the
> > > 10-second
> > > > threshold.
> > > > 
> > > > Scott 
> > > > 
> > > > > -----Original Message-----
> > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > > > Cc: Clive Hall (clivhall); 
> > > > > general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > What value would you recommend then?
> > > > > 
> > > > > Koen
> > > > > 
> > > > > -----Oorspronkelijk bericht-----
> > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> > > > > Verzonden: woensdag 23 mei 2007 17:38
> > > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > > CC: Clive Hall (clivhall);
> general-bounces at lists.openfabrics.org;
> > > > > general at lists.openfabrics.org
> > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > 
> > > > > The boot time of the host doesn't matter for this timeout.
> While
> > > the
> > > > > host is booting, the IB link is down anyway.
> > > > > 
> > > > > Scott 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > > > Cc: Clive Hall (clivhall); 
> > > > > > general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > After a whole day of stresstesting with the MAD renicing 
> > > > > turned on, we
> > > > > > got the error once. So I think I should raise the timeout on

> > > > > > the switch
> > > > > > also.
> > > > > > 
> > > > > > It takes about 2 minutes to boot the system. Do you agree 
> > > > > > that this is a
> > > > > > good value for the timeout?
> > > > > > 
> > > > > > Scott,
> > > > > > Can you explain me the problem of the memlock?
> > > > > > 
> > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. 
> > > > > Since we didn't
> > > > > > install this, the bug is not related to us. This is 
> > > > > correct, isn't it?
> > > > > > 
> > > > > > Greetz
> > > > > > 
> > > > > > Koen
> > > > > > 
> > > > > > -----Oorspronkelijk bericht-----
> > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] 
> > > > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > > > general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > 
> > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen)
> wrote:
> > > > > > > No C code changes, just a few config file changes 
> > > > > (RENICE_IB_MAD=yes
> > > > > > in
> > > > > > > openib.conf,
> > > > > > 
> > > > > > Does the host really not respond to MAD requests for over 10

> > > > > > seconds in
> > > > > > some cases ?
> > > > > > 
> > > > > > -- Hal
> > > > > > 
> > > > > > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > > > SLES10 for bug 267, etc.).
> > > > > > > 
> > > > > > > Scott Weitzenkamp
> > > > > > > SQA and Release Manager
> > > > > > > Server Virtualization Business Unit
> > > > > > > Cisco Systems
> > > > > > >  
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > > general at lists.openfabrics.org; 
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > > 
> > > > > > > > This far, all tests seem to work.
> > > > > > > > 
> > > > > > > > Thanks for the help!
> > > > > > > > 
> > > > > > > > Scott,
> > > > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > > > > 
> > > > > > > > Greetz
> > > > > > > > 
> > > > > > > > Koen
> > > > > > > > 
> > > > > > > > -----Oorspronkelijk bericht-----
> > > > > > > > Van: Scott Weitzenkamp (sweitzen)
> > [mailto:sweitzen at cisco.com] 
> > > > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive
Hall
> > > > > > (clivhall)
> > > > > > > > CC: Shirley Ma; Ami Perlmutter;
> > general at lists.openfabrics.org;
> > > > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses
IB-connection
> > > > > > > > 
> > > > > > > > It's not so much pinging every 10 seconds as expecting a

> > > > > > > > response within
> > > > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > > > > 
> > > > > > > > You only need to do 1) or 2), not both.  Cisco
configures
> 1)
> > 
> > > > > > > > in the OFED
> > > > > > > > binary RPMs we release at
> > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I

> > > > > > > > prefer to have
> > > > > > > > the host be more responsive.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Scott Weitzenkamp
> > > > > > > > SQA and Release Manager
> > > > > > > > Server Virtualization Business Unit
> > > > > > > > Cisco Systems
> > > > > > > >  
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; 
> > > > > > > > > general at lists.openfabrics.org;
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > > Subject: RE: [ofa-general] GPFS node loses
IB-connection
> > > > > > > > > 
> > > > > > > > > If I understand it wright, the switch is actually
> polling 
> > > > > > > > > (=pinging) the
> > > > > > > > > interfaces every 10s. This means that when the
interface
> > is
> > > > > > handling
> > > > > > > > > other traffic, the poll can fail and the port could be

> > > > > > > > > considered out of
> > > > > > > > > service. My question is then: "How can the timeout be
> > > reached
> > > > > > while
> > > > > > > > > packets are being sent/received to/from the
interface?"
> > > > > > > > > 
> > > > > > > > > Anyway, what timeout-value would you recommend for 
> > > > > us? And why?
> > > > > > > > > 
> > > > > > > > > To recapitulate: these are the actions I'll take
> tomorrow
> > > > > > > > > 1) change the MAD niceness of the servers
> > > > > > > > > 2) change the timeout on the switches
> > > > > > > > > 
> > > > > > > > > Are these changes sufficient for the HCA's to keep 
> > > > > > their ports in
> > > > > > > > > PORT_ACTIVE state?
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > 
> > > > > > > > > Koen
> > > > > > > > > 
> > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> > > > > > > > (sweitzen) wrote:
> > > > > > > > > > Yes, you can tune it.  Here's an example via the
> switch
> > > CLI:
> > > > > > > > > >  
> > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix 
> > > > > fe:80:00:00:00:00:00:00
> > > > > > > > > > node-timeout <value>
> > > > > > > > > > 
> > > > > > > > > > The default is 10 seconds, it can be configured up
to 
> > > > > > > > 2000 seconds.
> > > > > > > > > > If a HCA is completely unresponsive for longer than
> the 
> > > > > > > > node-timeout
> > > > > > > > > > value, then we consider that HCA out of service.
> > > > > > > > > >  
> > > > > > > > > > Scott Weitzenkamp
> > > > > > > > > > SQA and Release Manager
> > > > > > > > > > Server Virtualization Business Unit
> > > > > > > > > > Cisco Systems
> > > > > > > > > >  
> > > > > > > > > > 
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > >
> > > ______________________________________________________________
> > > > > > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > > > > > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > > > >         To: koen.segers at VRT.BE
> > > > > > > > > >         Cc: Ami Perlmutter;
> > general at lists.openfabrics.org;
> > > > > > > > > >         general-bounces at lists.openfabrics.org; Scott

> > > > > > Weitzenkamp
> > > > > > > > > >         (sweitzen)
> > > > > > > > > >         Subject: RE: [ofa-general] GPFS node loses 
> > > > > > IB-connection
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Koen,
> > > > > > > > > >         
> > > > > > > > > >         So it is most likely you hit the same bug as

> > > > > > 229 (Scott
> > > > > > > > > >         pointed out earlier). The same workaround
> might 
> > > > > > > > work for you
> > > > > > > > > >         by renicing ib_mad as Scott suggested.
> > > > > > > > > >         
> > > > > > > > > >         I think this should be a SM query timeout 
> > > > > > tunable value
> > > > > > in
> > > > > > > > > >         Cisco SM. Am I right, Scott?
> > > > > > > > > >         
> > > > > > > > > >         Thanks
> > > > > > > > > >         Shirley Ma
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Inactive hide details for Koen Segers 
> > > > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > > > >         Segers <koen.segers at VRT.BE>
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >                                         Koen Segers 
> > > > > > > > > <koen.segers at VRT.BE> 
> > > > > > > > > >                                         
> > > > > > > > > >                                         05/22/07
11:14
> > AM 
> > > > > > > > > >                                         Please
respond
> > to
> > > > > > > > > >
> > koen.segers at VRT.BE
> > > > > > > > > >                                         
> > > > > > > > > >         
> > > > > > > > > >                      To
> > > > > > > > > >         
> > > > > > > > > >         Shirley
> > > > > > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > > > > > >         
> > > > > > > > > >                      cc
> > > > > > > > > >         
> > > > > > > > > >         Ami Perlmutter
> > > > > > > > > >         <amip at dev.mellanox.co.il>, 
> > > > > > > > > general at lists.openfabrics.org,
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > > >         
> > > > > > > > > >                 Subject
> > > > > > > > > >         
> > > > > > > > > >         RE:
> > > > > > > > > >         [ofa-general]
> > > > > > > > > >         GPFS node loses
> > > > > > > > > >         IB-connection
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         Hi,
> > > > > > > > > >         
> > > > > > > > > >         It is the Cisco SM. 
> > > > > > > > > >         
> > > > > > > > > >         SFS-7000P> show version
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > >
> > > ==============================================================
> > > > > > > > > ==================
> > > > > > > > > >                                   System Version
> > > Information
> > > > > > > > > >         
> > > > > > > > >
> > > ==============================================================
> > > > > > > > > ==================
> > > > > > > > > >                   system-version : SFS-7000P
TopspinOS
> 
> > > > > > > > 2.9.0 releng
> > > > > > > > > >         #147
> > > > > > > > > >         10/25/2006 02:01:32
> > > > > > > > > >                          contact : tac at cisco.com
> > > > > > > > > >                             name : SFS-7000P
> > > > > > > > > >                         location : 170 West Tasman
> > Drive, 
> > > > > > > > > San Jose, CA
> > > > > > > > > >         95134
> > > > > > > > > >                          up-time :
> 11(d):7(h):49(m):3(s)
> > > > > > > > > >                      last-change : none
> > > > > > > > > >                 last-config-save : none
> > > > > > > > > >                           action : none
> > > > > > > > > >                           result : none
> > > > > > > > > >                        oper-mode : normal
> > > > > > > > > >         
> > > > > > > > > >         There is also a command that gives the SM
> > version,
> > > 
> > > > > > > > > but I can't
> > > > > > > > > >         find it
> > > > > > > > > >         right now. 
> > > > > > > > > >         
> > > > > > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley
Ma
> > > wrote:
> > > > > > > > > >         > Hello Koen,
> > > > > > > > > >         > 
> > > > > > > > > >         > From the switch log, it looks a SM issue
to
> > me. 
> > > > > > > > > The node was
> > > > > > > > > >         kicked
> > > > > > > > > >         > out of the membership. Which SM you are 
> > > > > > using in your
> > > > > > > > > >         fabric? 
> > > > > > > > > >         > 
> > > > > > > > > >         > Thanks
> > > > > > > > > >         > Shirley Ma
> > > > > > > > > >         > 
> > > > > > > > > >         *** Disclaimer ***
> > > > > > > > > >         
> > > > > > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > > >         
> > > > > > > > > >         nv van publiek recht
> > > > > > > > > >         BTW BE 0244.142.664
> > > > > > > > > >         RPR Brussel
> > > > > > > > > >         http://www.vrt.be/disclaimer
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > >         
> > > > > > > > > *** Disclaimer ***
> > > > > > > > > 
> > > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > > 
> > > > > > > > > nv van publiek recht
> > > > > > > > > BTW BE 0244.142.664
> > > > > > > > > RPR Brussel
> > > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > > >  
> > > > > > > > > 
> > > > > > > > *** Disclaimer ***
> > > > > > > > 
> > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > 
> > > > > > > > nv van publiek recht
> > > > > > > > BTW BE 0244.142.664
> > > > > > > > RPR Brussel
> > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > >  
> > > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > general mailing list
> > > > > > > general at lists.openfabrics.org
> > > > > > >
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > > 
> > > > > > > To unsubscribe, please visit
> > > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > > 
> > > > > > *** Disclaimer ***
> > > > > > 
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > 
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >  
> > > > > > 
> > > > > *** Disclaimer ***
> > > > > 
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > 
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >  
> > > > > 
> > > > *** Disclaimer ***
> > > > 
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > 
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >  
> > > > 
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > 
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > *** Disclaimer ***
> > > 
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > > 
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >  
> > > 
> > 
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 

*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
MST


From sobebike at gmail.com  Tue May 29 09:19:16 2007
From: sobebike at gmail.com (SoBeBike)
Date: Tue, 29 May 2007 11:19:16 -0500
Subject: [ofa-general] ibv_get_cq_event blocking forever after successful
	ibv_post_send...
In-Reply-To: <adalkf8z21b.fsf@cisco.com>
References: <20070525212214.20500.qmail@station183.com>
	<adalkf8z21b.fsf@cisco.com>
Message-ID: <dedddf10705290919y7e2dd25cj126a1d24fdeb2c7c@mail.gmail.com>

OK. I'll try to create a simple test case which exhibits the problem.
It'll be a bit - I'm on vacation and probably won't mess with this
much until I return.


On 5/28/07, Roland Dreier <rdreier at cisco.com> wrote:
>  > Any ideas on why the ibv_get_cq_event() would never see an event
>  > after a "successful" send requesting a completion event?
>
> It's either a bug in your code or a bug in the stack below your code.
> The best way to debug this would be for you to post your actual code
> (in a form that someone else can run), so that we can either point out
> what's wrong with your code, or have a test case for the real bug.
>
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From sean.hefty at intel.com  Tue May 29 10:14:54 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 29 May 2007 10:14:54 -0700
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705272250q68aa4064l40454db5b266a967@mail.gmail.com>
Message-ID: <000001c7a214$e0d30580$86cc180a@amr.corp.intel.com>

>Ok, but, by that time we can keep the framework ready?

I plan on re-submitting the cache for 2.6.23.  Beyond that I won't have the time
to work on enhancements for a few weeks.  I will happily review any patch
submissions though.

>How this will be managed? This will add extra startup time in the
>cluster, because cluster will be usable only after last cache has been
>enabled. Am I right?

I would word this differently: we can improve the time required to load the
cache, versus stating that the cache adds extra startup time. 

The cache is not necessary to use the cluster, so doesn't force extra startup
time.  Cache misses would simply be forwarded directly to the SA.  If the first
application to run on the cluster isn't establishing all-to-all communication
between the nodes then there may not be any reason to delay starting the app.

Even if the first app does establish all-to-all communication, waiting for the
caches to load can delay the start of the app, but cache use may decrease the
overall execution time of the app by more than this delay.  (Loading the cache
is likely to be more efficient than applications obtaining the path records
themselves.)

>How multi-pathing is handled in current cache_module?

A kernel ULP can request all paths, then select the one they want.  Beyond that,
the cache can either return paths to the user round robin or randomly, based on
the cache settings.

- Sean


From ralph.campbell at qlogic.com  Tue May 29 10:24:48 2007
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 29 May 2007 10:24:48 -0700
Subject: [ofa-general] Re: [ewg] [PATCH] ofed_1_2/sdp - SDP can lose
	receive data
In-Reply-To: <1180256850.15464.1.camel@localhost>
References: <1180139623.3407.373.camel@brick.pathscale.com>
	<1180256850.15464.1.camel@localhost>
Message-ID: <1180459488.3407.376.camel@brick.pathscale.com>

It is from git://git.openfabrics.org/~vlad/ofed_1_2
commit 726c6827ac31c0b2f40acd804dc53362289bd21f

On Sun, 2007-05-27 at 12:07 +0300, Ami Perlmutter wrote:
> Ralph,
> this is how the code is now.
> Were are you getting this code from?
> 
> On Fri, 2007-05-25 at 17:33 -0700, Ralph Campbell wrote:
> > Can this fix be considered for OFED 1.2?
> > Thanks.
> > 
> > 
> > If a receive work completion is processed but there is no room
> > in a previously queued skb, the data is dropped.
> > This patch fixes the problem by queuing the skb.
> > 
> > Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
> > 
> > diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c
> > --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:04:51 2007 -0700
> > +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:07:02 2007 -0700
> > @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q
> >  			skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len);
> >  			__kfree_skb(skb);
> >  			skb = tail;
> > -		}
> > +		} else
> > +			skb_queue_tail(&sk->sk_receive_queue, skb);
> >  	} else
> >  		skb_queue_tail(&sk->sk_receive_queue, skb);
> >  
> > 
> > 
> > _______________________________________________
> > ewg mailing list
> > ewg at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From bob.kossey at hp.com  Tue May 29 10:35:30 2007
From: bob.kossey at hp.com (Bob Kossey)
Date: Tue, 29 May 2007 13:35:30 -0400
Subject: [ofa-general] Re: ipoib / bonding and OFED
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303951259@xmb-sjc-216.amer.cisco.com>
References: <3857BB049D83424D9DB82753D37CEA55459C41@taurus.voltaire.com><4657373E.2030903@hp.com>
	<465BDC90.5080305@voltaire.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303951259@xmb-sjc-216.amer.cisco.com>
Message-ID: <465C6462.6000904@hp.com>

Thanks guys, I'll have to update my bits and try again. 

Another related question.  Does OFED 1.2 now support multiple 
independent IB fabrics
(multiple SMs, etc) connected to multiple HCAs on the same node?  Are 
there any
qualifications about which dimensions are supported with this, such as 
ipoib HA, SRP HA,
other types of failover, etc.?

Thanks,
Bob

Scott Weitzenkamp (sweitzen) wrote:
> Bob, it is now possible to configure IPoIB bonding in
> /etc/infiniband/openib.conf, this configuration file includes the
> following boilerplate.
>
> # Enable the bonding driver on startup
> IPOIBBOND_ENABLE=no
> # Set bond interface names
> #IPOIB_BONDS=bond0,bond1
> # Set specific bond params; address and slaves
> #bond0_IP=10.10.10.1
> #bond0_SLAVES=ib0,ib1
> #bond1_IP=20.10.10.1
> #bond1_SLAVES=ib2,ib3,ib4
>
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
>
>   
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org 
>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Or Gerlitz
>> Sent: Tuesday, May 29, 2007 12:56 AM
>> To: Bob Kossey
>> Cc: OpenFabrics General
>> Subject: [ofa-general] Re: ipoib / bonding and OFED
>>
>> Bob Kossey wrote:
>>     
>>> I copied OR since I think this is related to his OFED HA work, and
>>> he might have some insights.  A few more questions for Or:
>>> I was trying to use ipoib bonding with OFED 1.2 rc2 and a 
>>>       
>> 2.6.9 kernel,
>>     
>>> but was not able to get it to work so far.  I saw your 
>>>       
>> Sonoma bonding
>>     
>>> slides, and you mention kernel bonding driver changes were needed.
>>> 2. Is there a minimum kernel version, with the kernel bonding driver
>>> changes, that is required to use bonding with OFED ipoib?
>>>       
>> Just to have a base line here: to get bonding to work with IPoIB, you 
>> should use the bonding driver provided with OFED 1.2. This 
>> driver is the 
>>   upstream one (of 2.6.20) being patched to support IPoIB and 
>> backported 
>> to RH5, SLES10 and RH4 U3/4/5, other kernels are not supported.
>>
>> If you were using the ofed bonding on a system that matches 
>> the support 
>> matrix it should worl. If do have problems under this config, please 
>> either open a bug at the ofed bugzilla
>> @ bugs.openfabrics.org assigned to monis at voltaire.com (Moni Shoua) or 
>> send first report/question to Moni and CC ewg at lists.openfabrics.org
>>
>> Please note that between RC2 and RC4 (to be released today etc) some 
>> bugs were fixed, you can search in the bugzilla to see what.
>>
>>     
>>> 3. The bonding driver uses the HWADDR from the underlying ipoib
>>> devices, how does it obtain the HWADDR?  Does it use the 
>>>       
>> full 20 bytes,
>>     
>>> or some subset?
>>>       
>> when enslaving IPoIB devices, the bonding driver uses the full hw 
>> address of the active slave, it simply looks on the dev_addr field of 
>> the slave struct netdevice (see include/linux/netdevice.h)
>>
>>     
>>> 4. What use_carrier options for link status detection does 
>>>       
>> OFED ipoib 
>>     
>>> support,
>>> MII, ETHTOOL or netif_carrier_ok?
>>>       
>> the mii/ethertool etc local link detection methods of the 
>> bonding driver 
>>   are somehow deprecated, since nowadays almost any network device 
>> support the netif_carrier_ok call. The --default-- of the upstream 
>> bonding driver (eg the one we use in OFED and the 2.6.21 
>> listed below) 
>> is to set the use_carrier mod param to 1 that is mii is not 
>> used anymore.
>>
>>     
>>> author:         Thomas Davis, tadavis at lbl.gov and many others
>>> description:    Ethernet Channel Bonding Driver, v3.1.2
>>> version:        3.1.2
>>> parm:           use_carrier:Use netif_carrier_ok (vs MII 
>>>       
>> ioctls) in miimon; 0 for off, 1 for on (default) (int)
>>     
>>> parm:           miimon:Link check interval in milliseconds (int)
>>>       
>>> If you have any good examples of bonding configuration 
>>>       
>> settings that work
>>     
>>> with OFED, I'd appreciate that also.
>>>       
>> The bonding RPM provided with OFED is made of a driver, 
>> script and some 
>> help text containing usage examples, please take a look there 
>> and let me 
>> know if you have further questions.
>>
>>     
>>> $ rpm -ql ib-bonding-0.9.0-2.6.9_42.ELsmp
>>>
>>>       
>> /lib/modules/2.6.9-42.ELsmp/updates/kernel/drivers/net/bonding
>> /bonding.ko
>>     
>>> /usr/bin/ib-bond
>>> /usr/share/doc/ib-bonding-0.9.0/ib-bonding.txt
>>>       
>> The ofed service (/etc/init.d/openibd) was enhanced to allow for 
>> --persistent-- bonding configuration, please see the bonding 
>> section at
>> docs/ipoib_release_notes.txt to see how to do it.
>>
>> Or.
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
>>     


From rdreier at cisco.com  Tue May 29 11:27:46 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 29 May 2007 11:27:46 -0700
Subject: [ofa-general] Re: [PATCH] libmlx4: fix qp capabilities
In-Reply-To: <1180443676.6825.8.camel@mtls03> (Eli Cohen's message of "Tue,
	29 May 2007 16:00:46 +0300")
References: <1180443676.6825.8.camel@mtls03>
Message-ID: <ada1wgzv61p.fsf@cisco.com>

thanks, that bug looks familiar from libmthca.  I prefer to fix it
like as below, though, since that gives the true capabilities of the
QP being created.

Also, how did you create your patch?

 > --- libmlx4.orig/src/qp.c	2007-05-29 13:13:57.000000000 +0300
 > +++ libmlx4/src/qp.c	2007-05-29 14:41:33.000000000 +0300
 > @@ -396,12 +396,13 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd,
 >  		cap->max_send_sge = 1;

I couldn't find that context line in any version of src/qp.c that I had.


diff --git a/src/qp.c b/src/qp.c
index fa20dfa..8e2a3d3 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -390,7 +390,6 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap,
 	int max_sq_sge;
 
 	qp->rq.max_gs	 = cap->max_recv_sge;
-	qp->sq.max_gs	 = cap->max_send_sge;
 	max_sq_sge	 = align(cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg),
 				 sizeof (struct mlx4_wqe_data_seg)) / sizeof (struct mlx4_wqe_data_seg);
 	if (max_sq_sge < cap->max_send_sge)
@@ -478,7 +477,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap,
 {
 	int wqe_size;
 
-	wqe_size = 1 << qp->sq.wqe_shift;
+	wqe_size = (1 << qp->sq.wqe_shift) - sizeof (struct mlx4_wqe_ctrl_seg);
 	switch (type) {
 	case IBV_QPT_UD:
 		wqe_size -= sizeof (struct mlx4_wqe_datagram_seg);
@@ -493,7 +492,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap,
 		break;
 	}
 
-	qp->sq.max_gs        = wqe_size / sizeof (struct mlx4_wqe_data_seg);
+	qp->sq.max_gs	     = wqe_size / sizeof (struct mlx4_wqe_data_seg);
 	cap->max_send_sge    = qp->sq.max_gs;
 	qp->max_inline_data  = wqe_size - sizeof (struct mlx4_wqe_inline_seg);
 	cap->max_inline_data = qp->max_inline_data;


From xma at us.ibm.com  Tue May 29 13:12:28 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Tue, 29 May 2007 13:12:28 -0700
Subject: [ofa-general] GPFS node loses IB-connection
In-Reply-To: <D63C0BE2D613C543B6F3305502E9784C03157D7A@OCBEXS01001.rto.be>
Message-ID: <OF606E9B75.D55AEBDA-ON872572EA.006EC3C1-882572EA.0074613A@us.ibm.com>


Hello Koen,

>That is very difficult. This system is supposed to go in production
>within a few weeks. Changing the OFED drivers requires rebuilding a lot
>of other programs. If it isn't really necessary, I prefer not to do
>this...

I don't think there are some major changes in RC3 or RC4 to prevent you
from running programs built against RC2. Please point out if wrong. You can
try one node first.

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070529/5a5f8696/attachment.html>

From xma at us.ibm.com  Tue May 29 13:13:55 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Tue, 29 May 2007 13:13:55 -0700
Subject: [ofa-general] Re: GPFS node loses IB-connection
In-Reply-To: <20070529154700.GA8321@mellanox.co.il>
Message-ID: <OFBC757AEE.F82A79F7-ON872572EA.006F0EEA-882572EA.00748369@us.ibm.com>


Hello Michael,

>What Ami is asking you to do is to try to reproduce the problem with -RC3
or
-RC4 when it's out.

Is there a known bug fix related to this issue in RC3?

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070529/09d2d41e/attachment.html>

From rdreier at cisco.com  Tue May 29 13:32:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 29 May 2007 13:32:47 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/mthca: fix send CQE with
	error for QP connected to SRQ
In-Reply-To: <20070527150642.GC26933@mellanox.co.il> (Michael S. Tsirkin's
	message of "Sun, 27 May 2007 18:06:42 +0300")
References: <20070527150642.GC26933@mellanox.co.il>
Message-ID: <adasl9ftlow.fsf@cisco.com>

thanks, applied for 2.6.22 (and also fixed the same bug in libmthca)


From sweitzen at cisco.com  Tue May 29 13:39:54 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 29 May 2007 13:39:54 -0700
Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with
	kernel2-6.18-8.1.4.el5
In-Reply-To: <A382D4292574EB47A85B8159A6AED1A101498068@FPNYEXCBE02.opus-i.corp>
References: <1E3DCD1C63492545881FACB6063A57C1D524A3@mtiexch01.mti.com>
	<A382D4292574EB47A85B8159A6AED1A101498068@FPNYEXCBE02.opus-i.corp>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3039514E2@xmb-sjc-216.amer.cisco.com>

Moni S,
 
The ib-bonding configuration process seems too picky, should we just
apply RHEL5 patches if we see a *el5* kernel?  In other words, change:
 
$ fgrep 2.6.18 linux/configure
        2.6.18-1.2747.el5*|2.6.18-8.el5*|2.6.18-*.*.fc6)
 
to:

        2.6.18-*el5*|2.6.18-*.*.fc6)


________________________________

	From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jeffrey Wong
	Sent: Friday, May 25, 2007 1:12 PM
	To: general at lists.openfabrics.org
	Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding
with kernel2-6.18-8.1.4.el5
	
	
	Hello,

	I am installing the OFED 1.2-rc3.

	Everything else builds except for ib-bonding.  

	 
	Thanks in advance.

	 
	I am getting the following error messages:

	+ make -C /lib/modules/2.6.18-8.1.4.el5/build modules
M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding

	make: Entering directory
`/usr/src/kernels/2.6.18-8.1.4.el5-x86_64'

	  CC [M]
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.o

	In file included from
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:78:

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_inactive_flags':

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this

	 function)

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: (Each undeclared identifier is reported only once

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:262: error: for each function it appears in.)

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h: In function 'bond_set_slave_active_flags':

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin
g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this

	 function)

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_compute_features':

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1233: warning: comparison of distinct pointer types lacks a

	 cast

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_enslave':

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this fu

	nction)

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_release':

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this fu

	nction)

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in t

	his function)

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_arp_rcv':

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this fu

	nction)

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_netdev_event':

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this fu

	nction)

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c: In function 'bond_init':

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4374: warning: assignment discards qualifiers from pointer

	target type

	
/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m
ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this fu

	nction)

	make[1]: ***
[/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_
main.o] Error 1

	make: ***
[_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi
ng] Error 2

	make: Leaving directory
`/usr/src/kernels/2.6.18-8.1.4.el5-x86_64'

	+ echo ' Building  IB bonding driver failed'

	 Building  IB bonding driver failed

	+ exit 1

	error: Bad exit status from /var/tmp/rpm-tmp.23876 (%build)

	 
	Jeff Wong

	 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070529/f6771aee/attachment.html>

From rdreier at cisco.com  Tue May 29 13:48:09 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 29 May 2007 13:48:09 -0700
Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix performance
	regression on Mellanox
In-Reply-To: <20070528113727.GP2945@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 28 May 2007 14:37:27 +0300")
References: <20070521120459.GI20400@mellanox.co.il>
	<ada1wh9ewya.fsf@cisco.com> <46595950.6080106@voltaire.com>
	<20070527125337.GF8342@mellanox.co.il> <adalkf921at.fsf@cisco.com>
	<20070528113727.GP2945@mellanox.co.il>
Message-ID: <adairabtkza.fsf@cisco.com>

thanks, I queued this


From rdreier at cisco.com  Tue May 29 13:49:22 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 29 May 2007 13:49:22 -0700
Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
In-Reply-To: <20070529044815.GD13866@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 29 May 2007 07:48:15 +0300")
References: <20070514045832.GA18615@mellanox.co.il>
	<20070528121206.GA1847@mellanox.co.il> <ada1wh0z19c.fsf@cisco.com>
	<20070529044815.GD13866@mellanox.co.il>
Message-ID: <adaejkztkx9.fsf@cisco.com>

 > >  >       IB/ipoib: fix to_ipoib_neigh access race
 > > 
 > > I'm not convinced this is 2.6.22 material at this point -- it doesn't
 > > fix any observed problem that I know of.  (And the SRQ drain patch
 > > shows how even safe-looking patches can cause big problems)
 > 
 > Fine, but we do have it in OFED - could you spare some cycles to review it?

I plan to review it, but I question the decision to put it in OFED.  I
would have thought that OFED 1.2 was even more frozen then 2.6.22, and
I'm not sure why you would want to stick a patch like this in when you
don't know of anything that it fixes.


From rdreier at cisco.com  Tue May 29 13:49:33 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 29 May 2007 13:49:33 -0700
Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection
In-Reply-To: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> (Sean Hefty's
	message of "Mon, 21 May 2007 17:38:02 -0700")
References: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com>
Message-ID: <adaabvntkwy.fsf@cisco.com>

thanks, applied for 2.6.22


From rdreier at cisco.com  Tue May 29 13:50:12 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 29 May 2007 13:50:12 -0700
Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
In-Reply-To: <20070529071701.GA8159@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 29 May 2007 10:17:01 +0300")
References: <20070514045832.GA18615@mellanox.co.il>
	<20070528121206.GA1847@mellanox.co.il> <ada1wh0z19c.fsf@cisco.com>
	<20070529071701.GA8159@mellanox.co.il>
Message-ID: <ada646btkvv.fsf@cisco.com>

 >  >       IB/ipoib: fix to_ipoib_neigh access race

 > for-2.6.23 for now?

Yes, I plan to review it more carefully and queue it for 2.6.23.


From rdreier at cisco.com  Tue May 29 13:57:12 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 29 May 2007 13:57:12 -0700
Subject: [ofa-general] Re: libibverbs autogen failures in ubuntu dapper
In-Reply-To: <20070529091543.GG8159@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 29 May 2007 12:15:43 +0300")
References: <20070529091543.GG8159@mellanox.co.il>
Message-ID: <ada1wgztkk7.fsf@cisco.com>

 > automake: Makefile.am: `src/libibverbs.la' is not a standard libtool library name

I think you need a newer automake.

BTW...

 > Attempt to run autogen.sh on an ubuntu dapper laptop gave me this:

dapper??  isn't it already time to update to gutsy?

 - R.


From rdreier at cisco.com  Tue May 29 13:59:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 29 May 2007 13:59:47 -0700
Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user
In-Reply-To: <20070529091246.GF8159@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 29 May 2007 12:12:46 +0300")
References: <465AF3D3.10205@dev.mellanox.co.il> <adad50k285u.fsf@cisco.com>
	<20070529091246.GF8159@mellanox.co.il>
Message-ID: <adawsyrs5vg.fsf@cisco.com>

makes sense but:

 > -	if (rlim.rlim_cur <= 32768)
 > -		fprintf(stderr, PFX "Warning: RLIMIT_MEMLOCK is %lu bytes.\n"
 > -			"    This will severely limit memory registrations.\n",
 > -			rlim.rlim_cur);
 > +	if (rlim.rlim_cur > 32768)
 > +		return;
 > +
 > +	if (!getuid())
 > +		return;

I think it would be more natural to check the UID before getting the
rlimit.  And shouldn't this be geteuid() to handle processes that have
dropped their privileges?


From weiny2 at llnl.gov  Tue May 29 15:07:27 2007
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 29 May 2007 15:07:27 -0700
Subject: [ofa-general] Re: [PATCH] opensm/console: portstatus command for
 only initialized ports
In-Reply-To: <20070528200742.GA13193@sashak.voltaire.com>
References: <20070528200742.GA13193@sashak.voltaire.com>
Message-ID: <20070529150727.7d6be9c1.weiny2@llnl.gov>

Looks fine to me.

I guess you don't like this formatting?  I like things to line up.  It is
easier to read.  No big deal.

Thanks,
Ira

On Mon, 28 May 2007 23:07:42 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> 
> Run portstatus command for only initialized ports + minor identation
> fixes.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/opensm/osm_console.c |   18 ++++++++++--------
>  1 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
> index 2802c38..3415262 100644
> --- a/opensm/opensm/osm_console.c
> +++ b/opensm/opensm/osm_console.c
> @@ -598,15 +598,17 @@ __get_stats(cl_map_item_t * const p_map_item, void *context)
>  	fs->total_nodes++;
>  
>  	for (port = 1; port < num_ports; port++) {
> -		osm_physp_t    *phys = osm_node_get_physp_ptr(node, port);
> +		osm_physp_t *phys = osm_node_get_physp_ptr(node, port);
>  		ib_port_info_t *pi = &(phys->port_info);
> -
> -		uint8_t         active_speed = ib_port_info_get_link_speed_active(pi);
> -		uint8_t         enabled_speed = ib_port_info_get_link_speed_enabled(pi);
> -		uint8_t         active_width = pi->link_width_active;
> -		uint8_t         enabled_width = pi->link_width_enabled;
> -		uint8_t         port_state = ib_port_info_get_port_state(pi);
> -		uint8_t         port_phys_state = ib_port_info_get_port_phys_state(pi);
> +		uint8_t active_speed = ib_port_info_get_link_speed_active(pi);
> +		uint8_t enabled_speed = ib_port_info_get_link_speed_enabled(pi);
> +		uint8_t active_width = pi->link_width_active;
> +		uint8_t enabled_width = pi->link_width_enabled;
> +		uint8_t port_state = ib_port_info_get_port_state(pi);
> +		uint8_t port_phys_state = ib_port_info_get_port_phys_state(pi);
> +
> +		if (!osm_physp_is_valid(phys))
> +			continue;
>  
>  		if ((enabled_width ^ active_width) > active_width) {
>  			__tag_port_report(&(fs->reduced_width_ports),
> -- 
> 1.5.2.109.g802f


From hanafim.ctr at asc.hpc.mil  Tue May 29 15:57:53 2007
From: hanafim.ctr at asc.hpc.mil (MAHMOUD HANAFI)
Date: Tue, 29 May 2007 18:57:53 -0400
Subject: [ofa-general] Need OFED1.1 ib_srp  max_hw_sectors_kb help!
Message-ID: <465CAFF1.9000603@asc.hpc.mil>

All,

I am using OFED1.1 with CISCO HCA/switch and DDN Storage. I am able to load and perform IO to the 
DDN via srp driver. But, the max_hw_sectors_kb for the device is getting set to 64kb. Any one else 
seen this issue? Same host and storage with fiber channel doesn't have this problem. It set 
max_hw_sectors_kb correctly to 4096KB.

Thanks,
-- 
Mahmoud Hanafi
Senior System Administrator
ASC/MSRC
www.asc.hpc.mil
2435 5th Street
WPAFB, OHIO 45433
(937) 255-1536


From rdreier at cisco.com  Tue May 29 16:20:39 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 29 May 2007 16:20:39 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adaodk3rzco.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get a few more nicely balanced ("55 insertions(+), 55
deletions(-)") 2.6.22-rc3 fixes, mostly for IPoIB connected mode:

Michael S. Tsirkin (2):
      IB/mthca: Fix handling of send CQE with error for QPs connected to SRQ
      IPoIB/cm: Fix performance regression on Mellanox

Roland Dreier (1):
      IB/mlx4: Fix last allocated object tracking in bitmap allocator

Sean Hefty (1):
      IB/cm: Fix stale connection detection

 drivers/infiniband/core/cm.c            |   25 ++++++-----
 drivers/infiniband/hw/mthca/mthca_qp.c  |    6 +-
 drivers/infiniband/ulp/ipoib/ipoib.h    |    3 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |   74 +++++++++++++++----------------
 drivers/net/mlx4/alloc.c                |    2 +-
 5 files changed, 55 insertions(+), 55 deletions(-)


diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index e840434..40c004a 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1297,26 +1297,29 @@ static struct cm_id_private * cm_match_req(struct cm_work *work,
 
 	req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
 
-	/* Check for duplicate REQ and stale connections. */
+	/* Check for possible duplicate REQ. */
 	spin_lock_irqsave(&cm.lock, flags);
 	timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
-	if (!timewait_info)
-		timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
-
 	if (timewait_info) {
 		cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
 					   timewait_info->work.remote_id);
-		cm_cleanup_timewait(cm_id_priv->timewait_info);
 		spin_unlock_irqrestore(&cm.lock, flags);
 		if (cur_cm_id_priv) {
 			cm_dup_req_handler(work, cur_cm_id_priv);
 			cm_deref_id(cur_cm_id_priv);
-		} else
-			cm_issue_rej(work->port, work->mad_recv_wc,
-				     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
-				     NULL, 0);
-		listen_cm_id_priv = NULL;
-		goto out;
+		}
+		return NULL;
+	}
+
+	/* Check for stale connections. */
+	timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
+	if (timewait_info) {
+		cm_cleanup_timewait(cm_id_priv->timewait_info);
+		spin_unlock_irqrestore(&cm.lock, flags);
+		cm_issue_rej(work->port, work->mad_recv_wc,
+			     IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
+			     NULL, 0);
+		return NULL;
 	}
 
 	/* Find matching listen request. */
diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c
index 0276649..eef415b 100644
--- a/drivers/infiniband/hw/mthca/mthca_qp.c
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -2284,10 +2284,10 @@ void mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send,
 	struct mthca_next_seg *next;
 
 	/*
-	 * For SRQs, all WQEs generate a CQE, so we're always at the
-	 * end of the doorbell chain.
+	 * For SRQs, all receive WQEs generate a CQE, so we're always
+	 * at the end of the doorbell chain.
 	 */
-	if (qp->ibqp.srq) {
+	if (qp->ibqp.srq && !is_send) {
 		*new_wqe = 0;
 		return;
 	}
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 158759e..285c143 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -156,7 +156,7 @@ struct ipoib_cm_data {
  * - and then invoke a Destroy QP or Reset QP.
  *
  * We use the second option and wait for a completion on the
- * rx_drain_qp before destroying QPs attached to our SRQ.
+ * same CQ before destroying QPs attached to our SRQ.
  */
 
 enum ipoib_cm_state {
@@ -199,7 +199,6 @@ struct ipoib_cm_dev_priv {
 	struct ib_srq  	       *srq;
 	struct ipoib_cm_rx_buf *srq_ring;
 	struct ib_cm_id        *id;
-	struct ib_qp           *rx_drain_qp;   /* generates WR described in 10.3.1 */
 	struct list_head        passive_ids;   /* state: LIVE */
 	struct list_head        rx_error_list; /* state: ERROR */
 	struct list_head        rx_flush_list; /* state: FLUSH, drain not started */
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index f133b56..076a0bb 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -69,8 +69,9 @@ static struct ib_qp_attr ipoib_cm_err_attr = {
 
 #define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff
 
-static struct ib_recv_wr ipoib_cm_rx_drain_wr = {
-	.wr_id = IPOIB_CM_RX_DRAIN_WRID
+static struct ib_send_wr ipoib_cm_rx_drain_wr = {
+	.wr_id = IPOIB_CM_RX_DRAIN_WRID,
+	.opcode = IB_WR_SEND,
 };
 
 static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
@@ -163,16 +164,22 @@ partial_error:
 
 static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv)
 {
-	struct ib_recv_wr *bad_wr;
+	struct ib_send_wr *bad_wr;
+	struct ipoib_cm_rx *p;
 
-	/* rx_drain_qp send queue depth is 1, so
+	/* We only reserved 1 extra slot in CQ for drain WRs, so
 	 * make sure we have at most 1 outstanding WR. */
 	if (list_empty(&priv->cm.rx_flush_list) ||
 	    !list_empty(&priv->cm.rx_drain_list))
 		return;
 
-	if (ib_post_recv(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_wr))
-		ipoib_warn(priv, "failed to post rx_drain wr\n");
+	/*
+	 * QPs on flush list are error state.  This way, a "flush
+	 * error" WC will be immediately generated for each WR we post.
+	 */
+	p = list_entry(priv->cm.rx_flush_list.next, typeof(*p), list);
+	if (ib_post_send(p->qp, &ipoib_cm_rx_drain_wr, &bad_wr))
+		ipoib_warn(priv, "failed to post drain wr\n");
 
 	list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list);
 }
@@ -199,10 +206,10 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
 		.event_handler = ipoib_cm_rx_event_handler,
-		.send_cq = priv->cq, /* does not matter, we never send anything */
+		.send_cq = priv->cq, /* For drain WR */
 		.recv_cq = priv->cq,
 		.srq = priv->cm.srq,
-		.cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */
+		.cap.max_send_wr = 1, /* For drain WR */
 		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
 		.sq_sig_type = IB_SIGNAL_ALL_WR,
 		.qp_type = IB_QPT_RC,
@@ -242,6 +249,27 @@ static int ipoib_cm_modify_rx_qp(struct net_device *dev,
 		ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret);
 		return ret;
 	}
+
+	/*
+	 * Current Mellanox HCA firmware won't generate completions
+	 * with error for drain WRs unless the QP has been moved to
+	 * RTS first. This work-around leaves a window where a QP has
+	 * moved to error asynchronously, but this will eventually get
+	 * fixed in firmware, so let's not error out if modify QP
+	 * fails.
+	 */
+	qp_attr.qp_state = IB_QPS_RTS;
+	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to init QP attr for RTS: %d\n", ret);
+		return 0;
+	}
+	ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to RTS: %d\n", ret);
+		return 0;
+	}
+
 	return 0;
 }
 
@@ -623,38 +651,11 @@ static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr)
 int ipoib_cm_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ib_qp_init_attr qp_init_attr = {
-		.send_cq = priv->cq,   /* does not matter, we never send anything */
-		.recv_cq = priv->cq,
-		.cap.max_send_wr = 1,  /* FIXME: 0 Seems not to work */
-		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
-		.cap.max_recv_wr = 1,
-		.cap.max_recv_sge = 1, /* FIXME: 0 Seems not to work */
-		.sq_sig_type = IB_SIGNAL_ALL_WR,
-		.qp_type = IB_QPT_UC,
-	};
 	int ret;
 
 	if (!IPOIB_CM_SUPPORTED(dev->dev_addr))
 		return 0;
 
-	priv->cm.rx_drain_qp = ib_create_qp(priv->pd, &qp_init_attr);
-	if (IS_ERR(priv->cm.rx_drain_qp)) {
-		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
-		ret = PTR_ERR(priv->cm.rx_drain_qp);
-		return ret;
-	}
-
-	/*
-	 * We put the QP in error state directly.  This way, a "flush
-	 * error" WC will be immediately generated for each WR we post.
-	 */
-	ret = ib_modify_qp(priv->cm.rx_drain_qp, &ipoib_cm_err_attr, IB_QP_STATE);
-	if (ret) {
-		ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret);
-		goto err_qp;
-	}
-
 	priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev);
 	if (IS_ERR(priv->cm.id)) {
 		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
@@ -676,8 +677,6 @@ err_listen:
 	ib_destroy_cm_id(priv->cm.id);
 err_cm:
 	priv->cm.id = NULL;
-err_qp:
-	ib_destroy_qp(priv->cm.rx_drain_qp);
 	return ret;
 }
 
@@ -740,7 +739,6 @@ void ipoib_cm_dev_stop(struct net_device *dev)
 		kfree(p);
 	}
 
-	ib_destroy_qp(priv->cm.rx_drain_qp);
 	cancel_delayed_work(&priv->cm.stale_task);
 }
 
diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c
index dfbd580..f8d63d3 100644
--- a/drivers/net/mlx4/alloc.c
+++ b/drivers/net/mlx4/alloc.c
@@ -51,8 +51,8 @@ u32 mlx4_bitmap_alloc(struct mlx4_bitmap *bitmap)
 
 	if (obj < bitmap->max) {
 		set_bit(obj, bitmap->table);
+		bitmap->last = (obj + 1) & (bitmap->max - 1);
 		obj |= bitmap->top;
-		bitmap->last = obj + 1;
 	} else
 		obj = -1;
 

From pradeeps at linux.vnet.ibm.com  Tue May 29 18:01:21 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Tue, 29 May 2007 18:01:21 -0700
Subject: [ofa-general] Re: IPOIB CM (NOSRQ) patch -memory footprint
In-Reply-To: <20070527083932.GC8342@mellanox.co.il>
References: <46537081.30906@linux.vnet.ibm.com>
	<20070524053819.GF6019@mellanox.co.il>
	<46574099.3090601@linux.vnet.ibm.com>
	<20070527083932.GC8342@mellanox.co.il>
Message-ID: <465CCCE1.1020106@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>>>> -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that
>>>> support SRQ like the Topspin HCA and, such HCAs should not be
>>>> impacted at all.
>>> I don't think it's that clean yet.
>>>
>>> Here's an idea: implement "fake SRQ" for ehca in software: make post recv on
>>> srq queue the WR, spread them evenly between QPs as they connect.  Once # of
>>> QPs goes above some limit, create QP command will fail. 

This already exists in the last couple of versions of the patch. We send
a REJ command when a predetermined threshold is crossed. What we have
been debating is what to do on the active side when the REJ command is
received.


  This would contain
>>> the mess nicely inside ehca (I think you'll want to add a flag that lets
>>> software figure out that SRQ is fake).
>>>
>>> We will still be left with the basic problem of what to do at the active side
>>> upon the reject, though.
>> As you indicate this will not solve the problem, so it is not an option.
> 
> Above, I have outlined how it can be done, so it certainly *is* an option.

In the previous mail I proposed a method to address both viewpoints:
a) let the active side return an error to the user level app and leave
the onus to the application b) switch to datagram mode when the QP
threshold is crossed.

There has been no response to that proposal.


> 
> In this thread, you basically keep saying that ehca will ever be the only HCA without SRQ
> support, so you can make a lot of assumptions about how IPoIB is used.
> 
> Fine, but if you follow this logic, it makes sense to hide the mess under the ehca
> provider interface.
> 
> 

Every time I address the issues you have raised previously, it appears
that something else crops up. I have said that I can provide a patch
that addresses both alternatives a) and b) above. Let us just stick to
that and limit our discussions and proceed to close the issues out and
not diverge any further.

Pradeep


From jsquyres at cisco.com  Tue May 29 18:47:11 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 29 May 2007 21:47:11 -0400
Subject: [ofa-general] Re: [ewg] Upcoming OFED teleconferences
In-Reply-To: <465C3A0E.6070602@mellanox.co.il>
References: <338EEB24-0F0E-443D-94AA-61E512611F8B@cisco.com>
	<465C3A0E.6070602@mellanox.co.il>
Message-ID: <D085C965-A339-44B9-A0E2-206C803C5D54@cisco.com>

Since no one objected, I moved tomorrow's teleconference to meet  
Tziporet's schedule (and OFED teleconference without the RM would be  
kinda meaningless!).

2:30pm US Eastern, 11:30am US Pacific, 9:30pm Israel
1. Wednesday, May 30 (*TOMORROW*), code 210262040

All others: noon US eastern / 9am US Pacific / 7pm Israel
2. Monday, June 4, code 2102061
3. Monday, June 11, code 210213621
4. Monday, June 18, code 2102061
5. Monday, June 25, code 210213621

US/Canada:  +1.866.432.9903
India:      +91.80.4103.3979
Israel:     +972.9.892.7026
Others:     http://cisco.com/en/US/about/doing_business/conferencing/


On May 29, 2007, at 10:34 AM, Tziporet Koren wrote:

> Jeff Squyres wrote:
>> Short version:
>> --------------
>>
>> Upcoming OFED teleconferences, all at noon US eastern / 9am US  
>> Pacific / 7pm Israel.
>>
>> 1. Wednesday, May 30 (*TOMORROW*), code 210262040
> I cannot make it at Wed 9am PST
> Can you change to 11:30am PST
>
> Thanks,
> Tziporet
>> 2. Monday, June 4, code 2102061
>> 3. Monday, June 11, code 210213621
>> 4. Monday, June 18, code 2102061
>> 5. Monday, June 25, code 210213621
>>
>> US/Canada:  +1.866.432.9903
>> India:      +91.80.4103.3979
>> Israel:     +972.9.892.7026
>> Others:     http://cisco.com/en/US/about/doing_business/conferencing/


-- 
Jeff Squyres
Cisco Systems


From mst at dev.mellanox.co.il  Tue May 29 20:37:43 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 30 May 2007 06:37:43 +0300
Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
In-Reply-To: <adaejkztkx9.fsf@cisco.com>
References: <20070514045832.GA18615@mellanox.co.il>
	<20070528121206.GA1847@mellanox.co.il> <ada1wh0z19c.fsf@cisco.com>
	<20070529044815.GD13866@mellanox.co.il> <adaejkztkx9.fsf@cisco.com>
Message-ID: <20070530033743.GC9036@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git
> 
>  > >  >       IB/ipoib: fix to_ipoib_neigh access race
>  > > 
>  > > I'm not convinced this is 2.6.22 material at this point -- it doesn't
>  > > fix any observed problem that I know of.  (And the SRQ drain patch
>  > > shows how even safe-looking patches can cause big problems)
>  > 
>  > Fine, but we do have it in OFED - could you spare some cycles to review it?
> 
> I plan to review it, but I question the decision to put it in OFED.  I
> would have thought that OFED 1.2 was even more frozen then 2.6.22, and
> I'm not sure why you would want to stick a patch like this in when you
> don't know of anything that it fixes.

Point taken - I took this out.

-- 
MST


From mst at dev.mellanox.co.il  Tue May 29 20:39:45 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 30 May 2007 06:39:45 +0300
Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user
In-Reply-To: <adawsyrs5vg.fsf@cisco.com>
References: <465AF3D3.10205@dev.mellanox.co.il> <adad50k285u.fsf@cisco.com>
	<20070529091246.GF8159@mellanox.co.il> <adawsyrs5vg.fsf@cisco.com>
Message-ID: <20070530033945.GD9036@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] suppress RLIMIT warning for root user
> 
> makes sense but:
> 
>  > -	if (rlim.rlim_cur <= 32768)
>  > -		fprintf(stderr, PFX "Warning: RLIMIT_MEMLOCK is %lu bytes.\n"
>  > -			"    This will severely limit memory registrations.\n",
>  > -			rlim.rlim_cur);
>  > +	if (rlim.rlim_cur > 32768)
>  > +		return;
>  > +
>  > +	if (!getuid())
>  > +		return;
> 
> I think it would be more natural to check the UID before getting the
> rlimit.  And shouldn't this be geteuid() to handle processes that have
> dropped their privileges?

Agree on both points.

-- 
MST


From devesh28 at gmail.com  Tue May 29 21:44:21 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Wed, 30 May 2007 10:14:21 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <000001c7a214$e0d30580$86cc180a@amr.corp.intel.com>
References: <309a667c0705272250q68aa4064l40454db5b266a967@mail.gmail.com>
	<000001c7a214$e0d30580$86cc180a@amr.corp.intel.com>
Message-ID: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com>

On 5/29/07, Sean Hefty <sean.hefty at intel.com> wrote:
> >Ok, but, by that time we can keep the framework ready?
>
> I plan on re-submitting the cache for 2.6.23.  Beyond that I won't have the time
> to work on enhancements for a few weeks.  I will happily review any patch
> submissions though.
Ok, Soon I will post a patch related to this.
How static PR file will be generated? Needs to be discussed.
>
> >How this will be managed? This will add extra startup time in the
> >cluster, because cluster will be usable only after last cache has been
> >enabled. Am I right?
>
> I would word this differently: we can improve the time required to load the
> cache, versus stating that the cache adds extra startup time.
>
> The cache is not necessary to use the cluster, so doesn't force extra startup
> time.  Cache misses would simply be forwarded directly to the SA.  If the first
> application to run on the cluster isn't establishing all-to-all communication
> between the nodes then there may not be any reason to delay starting the app.
>
> Even if the first app does establish all-to-all communication, waiting for the
> caches to load can delay the start of the app, but cache use may decrease the
> overall execution time of the app by more than this delay.  (Loading the cache
> is likely to be more efficient than applications obtaining the path records
> themselves.)
>
> >How multi-pathing is handled in current cache_module?
>
> A kernel ULP can request all paths, then select the one they want.  Beyond that,
> the cache can either return paths to the user round robin or randomly, based on
> the cache settings.
>
> - Sean
>


From eli at mellanox.co.il  Tue May 29 23:23:38 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 30 May 2007 09:23:38 +0300
Subject: [ofa-general] Re: [PATCH] libmlx4: fix qp capabilities
In-Reply-To: <ada1wgzv61p.fsf@cisco.com>
References: <1180443676.6825.8.camel@mtls03>  <ada1wgzv61p.fsf@cisco.com>
Message-ID: <1180506248.6825.19.camel@mtls03>

On Tue, 2007-05-29 at 11:27 -0700, Roland Dreier wrote:

> Also, how did you create your patch?
> 
>  > --- libmlx4.orig/src/qp.c	2007-05-29 13:13:57.000000000 +0300
>  > +++ libmlx4/src/qp.c	2007-05-29 14:41:33.000000000 +0300
>  > @@ -396,12 +396,13 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd,
>  >  		cap->max_send_sge = 1;
> 
> I couldn't find that context line in any version of src/qp.c that I had.
> 

This is a part of a patch that we apply on OFED builds that is required
to ensure the send queue is greater then zero.


From cap at nsc.liu.se  Tue May 29 23:43:09 2007
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Wed, 30 May 2007 08:43:09 +0200
Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with
	kernel2-6.18-8.1.4.el5
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA3039514E2@xmb-sjc-216.amer.cisco.com>
References: <1E3DCD1C63492545881FACB6063A57C1D524A3@mtiexch01.mti.com>
	<A382D4292574EB47A85B8159A6AED1A101498068@FPNYEXCBE02.opus-i.corp>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA3039514E2@xmb-sjc-216.amer.cisco.com>
Message-ID: <200705300843.13575.cap@nsc.liu.se>

On Tuesday 29 May 2007, Scott Weitzenkamp (sweitzen) wrote:
> Moni S,
>
> The ib-bonding configuration process seems too picky, should we just
> apply RHEL5 patches if we see a *el5* kernel?  In other words, change:
>
> $ fgrep 2.6.18 linux/configure
>         2.6.18-1.2747.el5*|2.6.18-8.el5*|2.6.18-*.*.fc6)
>
> to:
>
>         2.6.18-*el5*|2.6.18-*.*.fc6)

Why mix in non-el5 kernels? 2.6.18-2747 and similar are beta kernels (fc 
naming left-over) and stuff with fc6 in it is clearly not el5. Update kernels 
for el5 (before el5u1) should be named 2.6.18-8.x.y.el5, so 
maybe: "2.6.18-8.*el5"?

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070530/9be5913f/attachment.sig>

From amip at dev.mellanox.co.il  Tue May 29 23:43:06 2007
From: amip at dev.mellanox.co.il (Ami Perlmutter)
Date: Wed, 30 May 2007 09:43:06 +0300
Subject: [ofa-general] Re: [ewg] [PATCH] ofed_1_2/sdp - SDP can lose
	receive data
In-Reply-To: <1180459488.3407.376.camel@brick.pathscale.com>
References: <1180139623.3407.373.camel@brick.pathscale.com>
	<1180256850.15464.1.camel@localhost>
	<1180459488.3407.376.camel@brick.pathscale.com>
Message-ID: <1180507416.12048.19.camel@localhost>

this is how the code looks now:

if (likely(skb_len && (tail = skb_peek_tail(&sk->sk_receive_queue))) &&
    unlikely(skb_tailroom(tail) >= skb_len)) {
	skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len);
	__kfree_skb(skb);
	skb = tail;
} else
	skb_queue_tail(&sk->sk_receive_queue, skb);

could you point out the problem?

On Tue, 2007-05-29 at 10:24 -0700, Ralph Campbell wrote:
> It is from git://git.openfabrics.org/~vlad/ofed_1_2
> commit 726c6827ac31c0b2f40acd804dc53362289bd21f
> 
> On Sun, 2007-05-27 at 12:07 +0300, Ami Perlmutter wrote:
> > Ralph,
> > this is how the code is now.
> > Were are you getting this code from?
> > 
> > On Fri, 2007-05-25 at 17:33 -0700, Ralph Campbell wrote:
> > > Can this fix be considered for OFED 1.2?
> > > Thanks.
> > > 
> > > 
> > > If a receive work completion is processed but there is no room
> > > in a previously queued skb, the data is dropped.
> > > This patch fixes the problem by queuing the skb.
> > > 
> > > Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
> > > 
> > > diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c
> > > --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:04:51 2007 -0700
> > > +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:07:02 2007 -0700
> > > @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q
> > >  			skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len);
> > >  			__kfree_skb(skb);
> > >  			skb = tail;
> > > -		}
> > > +		} else
> > > +			skb_queue_tail(&sk->sk_receive_queue, skb);
> > >  	} else
> > >  		skb_queue_tail(&sk->sk_receive_queue, skb);
> > >  
> > > 
> > > 
> > > _______________________________________________
> > > ewg mailing list
> > > ewg at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From notting at redhat.com  Wed May 30 01:05:18 2007
From: notting at redhat.com (Bill Nottingham)
Date: Wed, 30 May 2007 04:05:18 -0400
Subject: [ofa-general] [PATCH] drivers/infiniband: fix comparsion between
	unsigned and negative
Message-ID: <20070530080518.GA29195@nostromo.devel.redhat.com>

Recent gcc versions emit warnings when unsigned variables are compared < 0 or >= 0.

Signed-off-by: Bill Nottingham <notting at redhat.com>

---
 core/sysfs.c       |    2 +-
 core/ucm.c         |    2 +-
 core/ucma.c        |    2 +-
 core/user_mad.c    |    5 ++---
 core/uverbs_main.c |    3 +--
 core/verbs.c       |    3 +--
 hw/mlx4/qp.c       |    2 +-
 7 files changed, 8 insertions(+), 11 deletions(-)

diff -ru linux-2.6.21-old/drivers/infiniband/core/sysfs.c linux-2.6.21/drivers/infiniband/core/sysfs.c
--- linux-2.6.21-old/drivers/infiniband/core/sysfs.c	2007-05-30 02:52:52.000000000 -0400
+++ linux-2.6.21/drivers/infiniband/core/sysfs.c	2007-05-30 02:07:31.000000000 -0400
@@ -112,7 +112,7 @@
 		return ret;
 
 	return sprintf(buf, "%d: %s\n", attr.state,
-		       attr.state >= 0 && attr.state < ARRAY_SIZE(state_name) ?
+		       attr.state < ARRAY_SIZE(state_name) ?
 		       state_name[attr.state] : "UNKNOWN");
 }
 
diff -ru linux-2.6.21-old/drivers/infiniband/core/ucma.c linux-2.6.21/drivers/infiniband/core/ucma.c
--- linux-2.6.21-old/drivers/infiniband/core/ucma.c	2007-05-30 02:52:52.000000000 -0400
+++ linux-2.6.21/drivers/infiniband/core/ucma.c	2007-05-30 02:09:34.000000000 -0400
@@ -955,7 +955,7 @@
 	if (copy_from_user(&hdr, buf, sizeof(hdr)))
 		return -EFAULT;
 
-	if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
+	if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
 		return -EINVAL;
 
 	if (hdr.in + sizeof(hdr) > len)
diff -ru linux-2.6.21-old/drivers/infiniband/core/ucm.c linux-2.6.21/drivers/infiniband/core/ucm.c
--- linux-2.6.21-old/drivers/infiniband/core/ucm.c	2007-05-30 02:52:52.000000000 -0400
+++ linux-2.6.21/drivers/infiniband/core/ucm.c	2007-05-30 02:08:01.000000000 -0400
@@ -1125,7 +1125,7 @@
 	if (copy_from_user(&hdr, buf, sizeof(hdr)))
 		return -EFAULT;
 
-	if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucm_cmd_table))
+	if (hdr.cmd >= ARRAY_SIZE(ucm_cmd_table))
 		return -EINVAL;
 
 	if (hdr.in + sizeof(hdr) > len)
diff -ru linux-2.6.21-old/drivers/infiniband/core/user_mad.c linux-2.6.21/drivers/infiniband/core/user_mad.c
--- linux-2.6.21-old/drivers/infiniband/core/user_mad.c	2007-05-30 02:52:52.000000000 -0400
+++ linux-2.6.21/drivers/infiniband/core/user_mad.c	2007-05-30 02:08:32.000000000 -0400
@@ -455,8 +455,7 @@
 		goto err;
 	}
 
-	if (packet->mad.hdr.id < 0 ||
-	    packet->mad.hdr.id >= IB_UMAD_MAX_AGENTS) {
+	if (packet->mad.hdr.id >= IB_UMAD_MAX_AGENTS) {
 		ret = -EINVAL;
 		goto err;
 	}
@@ -665,7 +664,7 @@
 
 	down_write(&file->port->mutex);
 
-	if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !__get_agent(file, id)) {
+	if (id >= IB_UMAD_MAX_AGENTS || !__get_agent(file, id)) {
 		ret = -EINVAL;
 		goto out;
 	}
diff -ru linux-2.6.21-old/drivers/infiniband/core/uverbs_main.c linux-2.6.21/drivers/infiniband/core/uverbs_main.c
--- linux-2.6.21-old/drivers/infiniband/core/uverbs_main.c	2007-05-30 02:52:52.000000000 -0400
+++ linux-2.6.21/drivers/infiniband/core/uverbs_main.c	2007-05-30 02:09:07.000000000 -0400
@@ -592,8 +592,7 @@
 	if (hdr.in_words * 4 != count)
 		return -EINVAL;
 
-	if (hdr.command < 0				||
-	    hdr.command >= ARRAY_SIZE(uverbs_cmd_table) ||
+	if (hdr.command >= ARRAY_SIZE(uverbs_cmd_table) ||
 	    !uverbs_cmd_table[hdr.command]		||
 	    !(file->device->ib_dev->uverbs_cmd_mask & (1ull << hdr.command)))
 		return -EINVAL;
diff -ru linux-2.6.21-old/drivers/infiniband/core/verbs.c linux-2.6.21/drivers/infiniband/core/verbs.c
--- linux-2.6.21-old/drivers/infiniband/core/verbs.c	2007-05-30 02:52:52.000000000 -0400
+++ linux-2.6.21/drivers/infiniband/core/verbs.c	2007-05-30 02:07:06.000000000 -0400
@@ -535,8 +535,7 @@
 {
 	enum ib_qp_attr_mask req_param, opt_param;
 
-	if (cur_state  < 0 || cur_state  > IB_QPS_ERR ||
-	    next_state < 0 || next_state > IB_QPS_ERR)
+	if (cur_state  > IB_QPS_ERR || next_state > IB_QPS_ERR)
 		return 0;
 
 	if (mask & IB_QP_CUR_STATE  &&
diff -ru linux-2.6.21-old/drivers/infiniband/hw/mlx4/qp.c linux-2.6.21/drivers/infiniband/hw/mlx4/qp.c
--- linux-2.6.21-old/drivers/infiniband/hw/mlx4/qp.c	2007-05-30 02:52:52.000000000 -0400
+++ linux-2.6.21/drivers/infiniband/hw/mlx4/qp.c	2007-05-30 02:10:18.000000000 -0400
@@ -1284,7 +1284,7 @@
 		 */
 		wmb();
 
-		if (wr->opcode < 0 || wr->opcode >= ARRAY_SIZE(mlx4_ib_opcode)) {
+		if (wr->opcode >= ARRAY_SIZE(mlx4_ib_opcode)) {
 			err = -EINVAL;
 			goto out;
 		}


From vlad at lists.openfabrics.org  Wed May 30 02:41:53 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed, 30 May 2007 02:41:53 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070530-0200 daily build status
Message-ID: <20070530094153.EE7F2E607FA@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From eli at mellanox.co.il  Wed May 30 03:14:31 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 30 May 2007 13:14:31 +0300
Subject: [ofa-general] [PATCH] mlx4_core: fix CQ mailbox layout
Message-ID: <1180520101.6825.26.camel@mtls03>

Fix CQ inbox layout

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/net/mlx4/cq.c
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/cq.c	2007-05-29 16:20:17.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/cq.c	2007-05-30 12:50:51.000000000 +0300
@@ -61,7 +61,7 @@
 	__be32			solicit_producer_index;
 	__be32			consumer_index;
 	__be32			producer_index;
-	u8			reserved6[2];
+	u32			reserved6[2];
 	__be64			db_rec_addr;
 };
 

From halr at voltaire.com  Wed May 30 03:38:50 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 May 2007 06:38:50 -0400
Subject: [ofa-general] Re: [PATCH] opensm/console: portstatus command for
	only initialized ports
In-Reply-To: <20070528200742.GA13193@sashak.voltaire.com>
References: <20070528200742.GA13193@sashak.voltaire.com>
Message-ID: <1180521528.7116.53237.camel@hal.voltaire.com>

On Mon, 2007-05-28 at 16:07, Sasha Khapyorsky wrote:
> Run portstatus command for only initialized ports + minor identation
> fixes.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From tziporet at mellanox.co.il  Wed May 30 03:43:26 2007
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 30 May 2007 13:43:26 +0300
Subject: [ofa-general] RE: [ewg] OFED-1.2-20070529-0600 won't build due to
	srptools changes
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303951244@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA303951244@xmb-sjc-216.amer.cisco.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901563477@mtlexch01.mtl.com>

We noticed this too and it was already fixed yesterday in the build of
6am
 
Tziporet

________________________________

From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Scott
Weitzenkamp (sweitzen)
Sent: Tuesday, May 29, 2007 6:30 PM
To: OpenFabrics EWG
Cc: OpenFabrics General
Subject: [ewg] OFED-1.2-20070529-0600 won't build due to srptools
changes
Importance: High


I have reopened https://bugs.openfabrics.org/show_bug.cgi?id=533, Ishai
please fix ASAP. This bug is now a P1 blocker.
 
RPM build errors:
    user vlad does not exist - using root
    group vlad does not exist - using root
    user vlad does not exist - using root
    group vlad does not exist - using root
    File not found:
/var/tmp/OFED/usr/sbin/execute_multipath_or_kpartx.sh
    File not found: /var/tmp/OFED/usr/sbin/srp_dm_multipath_daemon
    File not found: /var/tmp/OFED/usr/sbin/srp_post_multipath
ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
/var/tmp/OFEDRPM'\
 --define '_prefix /usr' --define 'build_root /var/tmp/OFED' --define
'configur\
e_options --with-dapl --with-libibcommon --with-libibmad
--with-libibumad --wit\
h-libibverbs --with-libmthca --with-opensm --with-librdmacm
--with-libsdp --wit\
h-openib-diags --with-sdpnetstat --with-srptools --with-mstflint
--with-perftes\
t --with-tvflash --sysconfdir=/etc --mandir=/usr/share/man' --define
'configure\
_options32 --with-dapl --with-libibcommon --with-libibmad
--with-libibumad --wi\
th-libibverbs --with-libmthca --with-opensm --with-librdmacm
--with-libsdp --wi\
th-openib-diags --with-sdpnetstat --with-srptools --sysconfdir=/etc
--mandir=/u\
sr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man'
/tmp/O\
FED-1.2-20070529-0600/SRPMS/ofa_user-1.2-rc2.src.rpm"

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070530/3b646721/attachment.html>

From erezz at voltaire.com  Wed May 30 05:38:30 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 30 May 2007 15:38:30 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel
 addons	foropen-iscsiover iSER support for RHAS4 up3 and up4
In-Reply-To: <20070529141143.GD27671@mellanox.co.il>
References: <20070521114410.GG20400@mellanox.co.il>
	<46557BCB.7030102@voltaire.com>
	<20070524115715.GC4585@mellanox.co.il>
	<465C2D78.30100@voltaire.com>
	<20070529141143.GD27671@mellanox.co.il>
Message-ID: <465D7046.3080109@voltaire.com>

Michael S. Tsirkin wrote:
>> Quoting Erez Zilber <erezz at voltaire.com>:
>> Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons?foropen-iscsiover iSER support for RHAS4 up3 and up4
>>
>>
>>     
>>>> I have the following files in backport/2.6.9_UX/include/src/:
>>>>
>>>> attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it.
>>>>     
>>>>         
>>> could be a patch ...
>>> which line?
>>>
>>>   
>>>       
>> Now, attribute_container.c, klist.c & transport_class.c are copied from
>> the kernel tree. I've committed the required changes in
>> ~erezz/ofabuild_iser_rh4.git & ~erezz/ofed_1_2_iser_rh4.git.
>>     
>
> git fetch git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git.
> fatal: The remote end hung up unexpectedly
> Cannot get the repository state from
> git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git.
>
>
>   

Here's what I was able to do:

[root at hydrus t]# git-clone git://git.openfabrics.org/~vlad/ofed_1_2/.git
remote: Generating pack...
remote: Done counting 418996 objects.
remote: Deltifying 418996 objects...
remote:  100% (418996/418996) done
remote: Total 418996 (delta 333751), reused 399530 (delta 318605)
Checking files out...)
 100% (22588/22588) done
[root at hydrus t]# cd ofed_1_2
[root at hydrus ofed_1_2]# git fetch
git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git
remote: Generating pack...
remote: Done counting 156 objects.
Result has 133 objects.
remote: Deltifying 133 objects...
remote:  100% (133/133) done
Unpacking 133 objects
remote: Total 133 (delta 74), reused 24 (delta 5)
 100% (133/133) done

>> The main
>> change is a new dir called "kernel_addons_patches". It contains patches
>> for kernel tree files in order to create the required addons from them.
>>     
>
> sorry, but I really don't think we can touch build scripts at this point.
> Doing cp in build scripts is also a problem since it interferes with
> development (there are 2 places to edit each file).
> And adding kernel version dependency there is also really messy.
>
> Suggestion: why can't these patches be part of the regular backport directory?
>
> you copy stuff to include/src and then include it, but this just looks
> like and unnecessary extra step. Can't we include the source file from
> it original place directory, like this:
> #include "../drivers/base/attribute_container.c"
>   

I can use attribute_container.c from drivers/base. However, having some
of the addons in drivers/base while most of the addons are in
kernel_addons is confusing, isn't it? It will also require ugly
adjustments like:

kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch:

diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index e212608..3bf2015 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -1,2 +1,7 @@
 obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
 obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
+
+CFLAGS_attribute_container.o =  
-I$(PWD)/kernel_addons/backport/2.6.9_U3/include/src/
+
+scsi_transport_iscsi-y := scsi_transport_iscsi_f.o scsi.o scsi_lib.o
init.o klist.o attribute_container.o transport_class.o
+libiscsi-y             := libiscsi_f.o scsi_scan.o

(because base.h is in kernel_addons/backport/2.6.9_U3/include/src)


From mst at dev.mellanox.co.il  Wed May 30 05:54:56 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 30 May 2007 15:54:56 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel
	addons	foropen-iscsiover iSER support for RHAS4 up3 and up4
In-Reply-To: <465D7046.3080109@voltaire.com>
References: <20070521114410.GG20400@mellanox.co.il>
	<46557BCB.7030102@voltaire.com>
	<20070524115715.GC4585@mellanox.co.il>
	<465C2D78.30100@voltaire.com>
	<20070529141143.GD27671@mellanox.co.il>
	<465D7046.3080109@voltaire.com>
Message-ID: <20070530125456.GF9036@mellanox.co.il>

> >> The main
> >> change is a new dir called "kernel_addons_patches". It contains patches
> >> for kernel tree files in order to create the required addons from them.
> >>     
> >
> > sorry, but I really don't think we can touch build scripts at this point.
> > Doing cp in build scripts is also a problem since it interferes with
> > development (there are 2 places to edit each file).
> > And adding kernel version dependency there is also really messy.
> >
> > Suggestion: why can't these patches be part of the regular backport directory?
> >
> > you copy stuff to include/src and then include it, but this just looks
> > like and unnecessary extra step. Can't we include the source file from
> > it original place directory, like this:
> > #include "../drivers/base/attribute_container.c"
> >   
> 
> I can use attribute_container.c from drivers/base. However, having some
> of the addons in drivers/base while most of the addons are in
> kernel_addons is confusing, isn't it?

No, because it's not an addon - this is actual upstream kernel code.
Addons is stuff that we *add* to kernel.

> It will also require ugly
> adjustments like:
> 
> kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch:
> 
> diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
> index e212608..3bf2015 100644
> --- a/drivers/scsi/Makefile
> +++ b/drivers/scsi/Makefile
> @@ -1,2 +1,7 @@
>  obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
>  obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
> +
> +CFLAGS_attribute_container.o =  
> -I$(PWD)/kernel_addons/backport/2.6.9_U3/include/src/
> +
> +scsi_transport_iscsi-y := scsi_transport_iscsi_f.o scsi.o scsi_lib.o
> init.o klist.o attribute_container.o transport_class.o
> +libiscsi-y             := libiscsi_f.o scsi_scan.o
> 
> (because base.h is in kernel_addons/backport/2.6.9_U3/include/src)

This is one approach, and I think it's not too bad.
Alternative is to use the relative path for include directive:
#include "../drivers/base/attribute_container.c"

Wouldn't this work?

-- 
MST


From rdreier at cisco.com  Wed May 30 08:23:35 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 30 May 2007 08:23:35 -0700
Subject: [ofa-general] Re: [PATCH] mlx4_core: fix CQ mailbox layout
In-Reply-To: <1180520101.6825.26.camel@mtls03> (Eli Cohen's message of "Wed,
	30 May 2007 13:14:31 +0300")
References: <1180520101.6825.26.camel@mtls03>
Message-ID: <ada4plus5c8.fsf@cisco.com>

yikes... applied, thanks.


From rdreier at cisco.com  Wed May 30 08:30:00 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 30 May 2007 08:30:00 -0700
Subject: [ofa-general] dealing with gcc 'comparison is always false' warnings
	(was: [PATCH] drivers/infiniband: fix comparsion between
	unsigned and negative)
In-Reply-To: <20070530080518.GA29195@nostromo.devel.redhat.com> (Bill
	Nottingham's message of "Wed, 30 May 2007 04:05:18 -0400")
References: <20070530080518.GA29195@nostromo.devel.redhat.com>
Message-ID: <adazm3mqqh3.fsf@cisco.com>

thanks... I'm wondering if there's a consensus among kernel hackers
about changes like:

 > -	if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
 > +	if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
 >  		return -EINVAL;

I understand that new gcc sees that hdr.cmd is unsigned and hence
can't be < 0, and generates a warning for that, and having a build
cluttered with warnings hides bugs and so on.  However the code here
looks quite sensible to me -- otherwise we end up with missing range
checking if hdr.cmd ever changes to a signed type.  This seems like a
good way to introduce bugs: delete valid range checking code to shut
up a silly gcc warning, and then change the type of a variable.

Can't we just make gcc shut up about the comparison and generate no
code for it because it knows it can't be true?

 - R.


From satyam.sharma at gmail.com  Wed May 30 08:41:29 2007
From: satyam.sharma at gmail.com (Satyam Sharma)
Date: Wed, 30 May 2007 21:11:29 +0530
Subject: [ofa-general] Re: dealing with gcc 'comparison is always false'
	warnings (was: [PATCH] drivers/infiniband: fix comparsion
	between unsigned and negative)
In-Reply-To: <adazm3mqqh3.fsf@cisco.com>
References: <20070530080518.GA29195@nostromo.devel.redhat.com>
	<adazm3mqqh3.fsf@cisco.com>
Message-ID: <a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>

On 5/30/07, Roland Dreier <rdreier at cisco.com> wrote:
> thanks... I'm wondering if there's a consensus among kernel hackers
> about changes like:
>
>  > -    if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
>  > +    if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
>  >              return -EINVAL;
>
> I understand that new gcc sees that hdr.cmd is unsigned and hence
> can't be < 0, and generates a warning for that, and having a build
> cluttered with warnings hides bugs and so on.  However the code here
> looks quite sensible to me -- otherwise we end up with missing range
> checking if hdr.cmd ever changes to a signed type.  This seems like a
> good way to introduce bugs: delete valid range checking code to shut
> up a silly gcc warning, and then change the type of a variable.

You're *absolutely* correct about the issue that these "fixes" that remove
such conditions end up remove range-checking making the code more
flakey / less readable.

However, gcc is _just as correct_. It is only crying about seeing a condition
that the programmer could have written with some purpose in mind but which
is being completely compiled away by it when generating the code because
of it being a tautology / contradiction ...

> Can't we just make gcc shut up about the comparison and generate no
> code for it because it knows it can't be true?

No, shutting gcc up wouldn't be the right thing, IMHO. These warnings are
a good reminder to the programmer to go and see if there is a real bug
somewhere and if something really needs to be done with the code (could
be simply to change the type of a variable to signed that was mistakenly
declared unsigned, f.e.).

But yes, the kind of "fixes" you pointed out that _remove_ these conditions
are definitely *not* what we would want to do.

Satyam


From erezz at voltaire.com  Wed May 30 08:43:49 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 30 May 2007 18:43:49 +0300
Subject: [ofa-general] RE: [PATCH 2/2] IB/iser: add backport &
	kerneladdons	foropen-iscsiover iSER support for RHAS4 up3 and up4
References: <20070521114410.GG20400@mellanox.co.il><46557BCB.7030102@voltaire.com><20070524115715.GC4585@mellanox.co.il><465C2D78.30100@voltaire.com><20070529141143.GD27671@mellanox.co.il><465D7046.3080109@voltaire.com>
	<20070530125456.GF9036@mellanox.co.il>
Message-ID: <39C75744D164D948A170E9792AF8E7CA1109F5@exil.voltaire.com>

>> It will also require ugly
>> adjustments like:
>>
>> kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch:
>>
>> diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
>> index e212608..3bf2015 100644
>> --- a/drivers/scsi/Makefile
>> +++ b/drivers/scsi/Makefile
>> @@ -1,2 +1,7 @@
>>  obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o
>>  obj-$(CONFIG_ISCSI_TCP)        += libiscsi.o   iscsi_tcp.o
>> +
>> +CFLAGS_attribute_container.o = 
>> -I$(PWD)/kernel_addons/backport/2.6.9_U3/include/src/
>> +
>> +scsi_transport_iscsi-y := scsi_transport_iscsi_f.o scsi.o scsi_lib.o
>> init.o klist.o attribute_container.o transport_class.o
>> +libiscsi-y             := libiscsi_f.o scsi_scan.o
>>
>> (because base.h is in kernel_addons/backport/2.6.9_U3/include/src)
>
> This is one approach, and I think it's not too bad.
> Alternative is to use the relative path for include directive:
> #include "../drivers/base/attribute_container.c"
> 
> Wouldn't this work?

I am doing that. However, attribute_container.c includes base.h which is in the kernel_addons dir. Since attribute_container.c is no longer there, I need to add the following line:
 
-I$(PWD)/kernel_addons/backport/2.6.9_U3/include/src/
 
It is not very very ugly, so I think that we can do that. I will make the required fixes according to this approach.
 
Erez


From rdreier at cisco.com  Wed May 30 08:56:37 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 30 May 2007 08:56:37 -0700
Subject: [ofa-general] Re: dealing with gcc 'comparison is always false'
	warnings
In-Reply-To: <a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
	(Satyam Sharma's message of "Wed, 30 May 2007 21:11:29 +0530")
References: <20070530080518.GA29195@nostromo.devel.redhat.com>
	<adazm3mqqh3.fsf@cisco.com>
	<a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
Message-ID: <adaveeaqp8q.fsf@cisco.com>

 > However, gcc is _just as correct_. It is only crying about seeing a condition
 > that the programmer could have written with some purpose in mind but which
 > is being completely compiled away by it when generating the code because
 > of it being a tautology / contradiction ...

Well, OK, but there's lots of things gcc could warn about.  How about

	while (1) { ...

By your argument gcc should warn that '1' always evaluates to true.
Or how about

#if 0

why shouldn't the preprocessor warn that the conditional is always false?

 > No, shutting gcc up wouldn't be the right thing, IMHO. These warnings are
 > a good reminder to the programmer to go and see if there is a real bug
 > somewhere and if something really needs to be done with the code (could
 > be simply to change the type of a variable to signed that was mistakenly
 > declared unsigned, f.e.).

OK, but suppose I looked at it and there's no bug.  Leaving the
warning has a cost too: it hides useful warnings (that might be
showing real bugs) in all the clutter.

 - R.


From satyam.sharma at gmail.com  Wed May 30 09:06:05 2007
From: satyam.sharma at gmail.com (Satyam Sharma)
Date: Wed, 30 May 2007 21:36:05 +0530
Subject: [ofa-general] Re: dealing with gcc 'comparison is always false'
	warnings (was: [PATCH] drivers/infiniband: fix comparsion
	between unsigned and negative)
In-Reply-To: <a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
References: <20070530080518.GA29195@nostromo.devel.redhat.com>
	<adazm3mqqh3.fsf@cisco.com>
	<a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
Message-ID: <a781481a0705300906w69c43925xdfa95682f62c3087@mail.gmail.com>

On 5/30/07, Satyam Sharma <satyam.sharma at gmail.com> wrote:
> On 5/30/07, Roland Dreier <rdreier at cisco.com> wrote:
> > thanks... I'm wondering if there's a consensus among kernel hackers
> > about changes like:
> >
> >  > -    if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
> >  > +    if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
> >  >              return -EINVAL;
> >
> > I understand that new gcc sees that hdr.cmd is unsigned and hence
> > can't be < 0, and generates a warning for that, and having a build
> > cluttered with warnings hides bugs and so on.  However the code here
> > looks quite sensible to me -- otherwise we end up with missing range
> > checking if hdr.cmd ever changes to a signed type.  This seems like a
> > good way to introduce bugs: delete valid range checking code to shut
> > up a silly gcc warning, and then change the type of a variable.
>
> You're *absolutely* correct about the issue that these "fixes" that remove
> such conditions end up remove range-checking making the code more
> flakey / less readable.
>
> However, gcc is _just as correct_. It is only crying about seeing a condition
> that the programmer could have written with some purpose in mind but which
> is being completely compiled away by it when generating the code because
> of it being a tautology / contradiction ...
>
> > Can't we just make gcc shut up about the comparison and generate no
> > code for it because it knows it can't be true?

[ BTW gcc does not generate code for such cases already; either for the
condition whose truth value is already known, or for the codepath that
will never be executed as a result. ]

> No, shutting gcc up wouldn't be the right thing, IMHO. These warnings are
> a good reminder to the programmer to go and see if there is a real bug
> somewhere and if something really needs to be done with the code (could
> be simply to change the type of a variable to signed that was mistakenly
> declared unsigned, f.e.).

A common scenario I could imagine for the above would be where a typo
makes someone declare a var as size_t when it should've been ssize_t.
This is clearly a real bug that would get caught with this gcc warning
(but not with -Wall).

> But yes, the kind of "fixes" you pointed out that _remove_ these conditions
> are definitely *not* what we would want to do.

Erm, to qualify my rather strong opinion above: there could perhaps be
exceptions where the condition being removed could be truly redundant,
of course :-)

Satyam


From jwong at datallegro.com  Wed May 30 09:22:07 2007
From: jwong at datallegro.com (Jeffrey Wong)
Date: Wed, 30 May 2007 12:22:07 -0400
Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with
	kernel2-6.18-8.1.4.el5
In-Reply-To: <200705300843.13575.cap@nsc.liu.se>
Message-ID: <A382D4292574EB47A85B8159A6AED1A101498F4F@FPNYEXCBE02.opus-i.corp>

Is there a workaround for this problem?  Will there be a patch to the
current release?  Will this be fixed in the next release?
Thanks,

Jeff


-----Original Message-----
From: Peter Kjellstrom [mailto:cap at nsc.liu.se] 
Sent: Tuesday, May 29, 2007 11:43 PM
To: general at lists.openfabrics.org
Cc: Scott Weitzenkamp (sweitzen); Jeffrey Wong; Moni Shoua; Moni Levy
Subject: Re: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding
with kernel2-6.18-8.1.4.el5

On Tuesday 29 May 2007, Scott Weitzenkamp (sweitzen) wrote:
> Moni S,
>
> The ib-bonding configuration process seems too picky, should we just
> apply RHEL5 patches if we see a *el5* kernel?  In other words, change:
>
> $ fgrep 2.6.18 linux/configure
>         2.6.18-1.2747.el5*|2.6.18-8.el5*|2.6.18-*.*.fc6)
>
> to:
>
>         2.6.18-*el5*|2.6.18-*.*.fc6)

Why mix in non-el5 kernels? 2.6.18-2747 and similar are beta kernels (fc

naming left-over) and stuff with fc6 in it is clearly not el5. Update
kernels 
for el5 (before el5u1) should be named 2.6.18-8.x.y.el5, so 
maybe: "2.6.18-8.*el5"?

/Peter


From sweitzen at cisco.com  Wed May 30 09:25:12 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 30 May 2007 09:25:12 -0700
Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with
	kernel2-6.18-8.1.4.el5
In-Reply-To: <A382D4292574EB47A85B8159A6AED1A101498F4F@FPNYEXCBE02.opus-i.corp>
References: <200705300843.13575.cap@nsc.liu.se>
	<A382D4292574EB47A85B8159A6AED1A101498F4F@FPNYEXCBE02.opus-i.corp>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA3039BE41D@xmb-sjc-216.amer.cisco.com>

Looks fixed in
http://www.openfabrics.org/builds/ofed-1.2/OFED-1.2-20070530-0809.tgz.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Jeffrey Wong [mailto:jwong at datallegro.com] 
> Sent: Wednesday, May 30, 2007 9:22 AM
> To: Peter Kjellstrom; general at lists.openfabrics.org
> Cc: Scott Weitzenkamp (sweitzen); Moni Shoua; Moni Levy
> Subject: RE: [ofa-general] Trouble installing OFED1.2-rc3 
> ib-bonding with kernel2-6.18-8.1.4.el5
> 
> Is there a workaround for this problem?  Will there be a patch to the
> current release?  Will this be fixed in the next release?
> Thanks,
> 
> Jeff
> 
> 
> 
> -----Original Message-----
> From: Peter Kjellstrom [mailto:cap at nsc.liu.se] 
> Sent: Tuesday, May 29, 2007 11:43 PM
> To: general at lists.openfabrics.org
> Cc: Scott Weitzenkamp (sweitzen); Jeffrey Wong; Moni Shoua; Moni Levy
> Subject: Re: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding
> with kernel2-6.18-8.1.4.el5
> 
> On Tuesday 29 May 2007, Scott Weitzenkamp (sweitzen) wrote:
> > Moni S,
> >
> > The ib-bonding configuration process seems too picky, should we just
> > apply RHEL5 patches if we see a *el5* kernel?  In other 
> words, change:
> >
> > $ fgrep 2.6.18 linux/configure
> >         2.6.18-1.2747.el5*|2.6.18-8.el5*|2.6.18-*.*.fc6)
> >
> > to:
> >
> >         2.6.18-*el5*|2.6.18-*.*.fc6)
> 
> Why mix in non-el5 kernels? 2.6.18-2747 and similar are beta 
> kernels (fc
> 
> naming left-over) and stuff with fc6 in it is clearly not el5. Update
> kernels 
> for el5 (before el5u1) should be named 2.6.18-8.x.y.el5, so 
> maybe: "2.6.18-8.*el5"?
> 
> /Peter
> 


From satyam.sharma at gmail.com  Wed May 30 10:00:14 2007
From: satyam.sharma at gmail.com (Satyam Sharma)
Date: Wed, 30 May 2007 22:30:14 +0530
Subject: [ofa-general] Re: dealing with gcc 'comparison is always false'
	warnings
In-Reply-To: <adaveeaqp8q.fsf@cisco.com>
References: <20070530080518.GA29195@nostromo.devel.redhat.com>
	<adazm3mqqh3.fsf@cisco.com>
	<a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
	<adaveeaqp8q.fsf@cisco.com>
Message-ID: <a781481a0705301000r35e2dde7x8c90ba7b0497e4ef@mail.gmail.com>

[ Sorry, the threading broke because the subject changed,
so I missed seeing this mail earlier. ]

On 5/30/07, Roland Dreier <rdreier at cisco.com> wrote:
>  > However, gcc is _just as correct_. It is only crying about seeing a condition
>  > that the programmer could have written with some purpose in mind but which
>  > is being completely compiled away by it when generating the code because
>  > of it being a tautology / contradiction ...
>
> Well, OK, but there's lots of things gcc could warn about.  How about
>
>         while (1) { ...

Umm ... perhaps because gcc does not compile away any code for
such cases, but only the condition? Or because gcc knows this is
a common idiom in a *lot* of C code? I don't know (or care!) ... the
precise cases for which the warning is emitted would be known only
by reading gcc sources (which I have no intention of doing :-)

> By your argument gcc should warn that '1' always evaluates to true.

Note that my "argument" was about conditions that weren't as
simplistic as #if 0 or while (1) and that involved not merely 1 or 0,
but variables whose values might not be available at compile-time ...

> Or how about
>
> #if 0
>
> why shouldn't the preprocessor warn that the conditional is always false?

Perhaps because gcc knows programmers often use this common
method to disable some code?

I can't answer all these questions, of course (better ask the gcc folks),
but I don't care either. Clearly, none of the above are any reason why
gcc should *not* complain when it sees a _seemingly_ meaningful
condition conceivably written by the programmer with something in
mind but being completely optimized away by it.

[ BTW, perhaps the reason why the gcc folks did *not* put a warning
for while (1) or #if 0 is also because they know that programmers often
write such conditions with something meaningful in mind. ]

>  > No, shutting gcc up wouldn't be the right thing, IMHO. These warnings are
>  > a good reminder to the programmer to go and see if there is a real bug
>  > somewhere and if something really needs to be done with the code (could
>  > be simply to change the type of a variable to signed that was mistakenly
>  > declared unsigned, f.e.).
>
> OK, but suppose I looked at it and there's no bug.  Leaving the
> warning has a cost too: it hides useful warnings (that might be
> showing real bugs) in all the clutter.

Agreed, this warning emits a lot of false positives. But this warning isn't
enabled with -Wall either, or is it (now)? I remember the only way to
enable this was with -Wextra, and last I heard the top-level Makefile
did not specify that ... (?)

Satyam


From ralph.campbell at qlogic.com  Wed May 30 10:09:25 2007
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 30 May 2007 10:09:25 -0700
Subject: [ofa-general] Re: [ewg] [PATCH] ofed_1_2/sdp - SDP can lose
	receive data
In-Reply-To: <1180507416.12048.19.camel@localhost>
References: <1180139623.3407.373.camel@brick.pathscale.com>
	<1180256850.15464.1.camel@localhost>
	<1180459488.3407.376.camel@brick.pathscale.com>
	<1180507416.12048.19.camel@localhost>
Message-ID: <1180544965.3407.424.camel@brick.pathscale.com>

I guess I'm still getting used to git.
Somehow, I was looking at an earlier version.
The current code looks OK to me.

On Wed, 2007-05-30 at 09:43 +0300, Ami Perlmutter wrote:
> this is how the code looks now:
> 
> if (likely(skb_len && (tail = skb_peek_tail(&sk->sk_receive_queue))) &&
>     unlikely(skb_tailroom(tail) >= skb_len)) {
> 	skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len);
> 	__kfree_skb(skb);
> 	skb = tail;
> } else
> 	skb_queue_tail(&sk->sk_receive_queue, skb);
> 
> could you point out the problem?
> 
> On Tue, 2007-05-29 at 10:24 -0700, Ralph Campbell wrote:
> > It is from git://git.openfabrics.org/~vlad/ofed_1_2
> > commit 726c6827ac31c0b2f40acd804dc53362289bd21f
> > 
> > On Sun, 2007-05-27 at 12:07 +0300, Ami Perlmutter wrote:
> > > Ralph,
> > > this is how the code is now.
> > > Were are you getting this code from?
> > > 
> > > On Fri, 2007-05-25 at 17:33 -0700, Ralph Campbell wrote:
> > > > Can this fix be considered for OFED 1.2?
> > > > Thanks.
> > > > 
> > > > 
> > > > If a receive work completion is processed but there is no room
> > > > in a previously queued skb, the data is dropped.
> > > > This patch fixes the problem by queuing the skb.
> > > > 
> > > > Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
> > > > 
> > > > diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c
> > > > --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:04:51 2007 -0700
> > > > +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c	Fri May 25 17:07:02 2007 -0700
> > > > @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q
> > > >  			skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len);
> > > >  			__kfree_skb(skb);
> > > >  			skb = tail;
> > > > -		}
> > > > +		} else
> > > > +			skb_queue_tail(&sk->sk_receive_queue, skb);
> > > >  	} else
> > > >  		skb_queue_tail(&sk->sk_receive_queue, skb);
> > > >  
> > > > 
> > > > 
> > > > _______________________________________________
> > > > ewg mailing list
> > > > ewg at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> > > 
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > 
> > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> 


From halr at voltaire.com  Wed May 30 10:58:02 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 May 2007 13:58:02 -0400
Subject: [ofa-general] [PATCH] ib_types.h: Change macros to convert
	from "host" byte order to "network"
In-Reply-To: <20070522102327.0cea4153.weiny2@llnl.gov>
References: <20070522102327.0cea4153.weiny2@llnl.gov>
Message-ID: <1180547880.7116.81717.camel@hal.voltaire.com>

On Tue, 2007-05-22 at 13:23, Ira Weiny wrote:
> >From 7e53267d5bc9389f5f1a4dae3a2d290c69c6e1d4 Mon Sep 17 00:00:00 2001
> From: Ira K. Weiny <weiny2 at llnl.gov>
> Date: Tue, 24 Apr 2007 16:07:19 -0700
> Subject: [PATCH] Change macros to convert from "host" byte order to "network"
> 
>    Although the macros CL_HTON* and CL_NTOH* are defined to be the same
>    operation it is technically incorrect to convert a constant from network
>    byte order.  The constant should be converted from host byte order to
>    network byte order.
> 
> Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>

Thanks. Applied.

-- Hal


From rdreier at cisco.com  Wed May 30 10:58:49 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 30 May 2007 10:58:49 -0700
Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user
In-Reply-To: <20070529091246.GF8159@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 29 May 2007 12:12:46 +0300")
References: <465AF3D3.10205@dev.mellanox.co.il> <adad50k285u.fsf@cisco.com>
	<20070529091246.GF8159@mellanox.co.il>
Message-ID: <adamyzmqjl2.fsf@cisco.com>

ok, I applied the patch with changes as discussed


From sashak at voltaire.com  Wed May 30 11:23:47 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 30 May 2007 21:23:47 +0300
Subject: [ofa-general] [PATCH] opensm: drop_mgr: clean only associated with
	port physical obj
Message-ID: <20070530182347.GF13193@sashak.voltaire.com>


Then remove osm_port_t cleanup only associated osm_physp_t object and
not do not all node's osm_physp_t objects. If all osm_physp_t should be
removed do it in node removing routine.

This fix prevents random crashes in post drop manager flows, when CA node
had two port connected and one was disconnected.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_drop_mgr.c |  171 +++++++++++++++++++-----------------------
 1 files changed, 78 insertions(+), 93 deletions(-)

diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c
index 7ec185c..1890696 100644
--- a/opensm/opensm/osm_drop_mgr.c
+++ b/opensm/opensm/osm_drop_mgr.c
@@ -137,6 +137,78 @@ __osm_drop_mgr_remove_router(
   }
 }
 
+
+/**********************************************************************
+ **********************************************************************/
+static void
+drop_mgr_clean_physp(
+  IN const osm_drop_mgr_t* const p_mgr,
+  IN osm_physp_t *p_physp)
+{
+  cl_qmap_t *p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl;
+  osm_physp_t *p_remote_physp;
+  osm_port_t* p_remote_port;
+
+  p_remote_physp = osm_physp_get_remote( p_physp );
+  if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) )
+  {
+    p_remote_port = (osm_port_t*)cl_qmap_get( p_port_guid_tbl,
+                                              p_remote_physp->port_guid );
+
+    if ( p_remote_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl ) )
+    {
+      /* Let's check if this is a case of link that is lost (both ports
+         weren't recognized), or a "hiccup" in the subnet - in which case
+         the remote port was recognized, and its state is ACTIVE.
+         If this is just a "hiccup" - force a heavy sweep in the next sweep.
+         We don't want to lose that part of the subnet. */
+      if (osm_port_discovery_count_get( p_remote_port ) &&
+          osm_physp_get_port_state( p_remote_physp ) == IB_LINK_ACTIVE )
+      {
+        osm_log( p_mgr->p_log, OSM_LOG_VERBOSE,
+                 "drop_mgr_clean_physp: "
+                 "Forcing delayed heavy sweep. Remote "
+                 "port 0x%016" PRIx64 " port num: 0x%X "
+                 "was recognized in ACTIVE state\n",
+                 cl_ntoh64( p_remote_physp->port_guid ),
+                 p_remote_physp->port_num );
+        p_mgr->p_subn->force_delayed_heavy_sweep = TRUE;
+      }
+
+      /* If the remote node is ca or router - need to remove the remote port,
+         since it is no longer reachable. This can be done if we reset the
+         discovery count of the remote port. */
+      if ( !p_remote_physp->p_node->sw )
+      {
+        osm_port_discovery_count_reset( p_remote_port );
+        osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
+                 "drop_mgr_clean_physp: Resetting discovery count of node: "
+                 "0x%016" PRIx64 " port num:0x%X\n",
+                 cl_ntoh64( osm_node_get_node_guid( p_remote_physp->p_node ) ),
+                 p_remote_physp->port_num );
+      }
+    }
+
+    osm_log( p_mgr->p_log, OSM_LOG_VERBOSE,
+             "drop_mgr_clean_physp: "
+             "Unlinking local node 0x%016" PRIx64 ", port 0x%X"
+             "\n\t\t\t\tand remote node 0x%016" PRIx64 ", port 0x%X\n",
+             cl_ntoh64( osm_node_get_node_guid( p_physp->p_node ) ),
+             p_physp->port_num,
+             cl_ntoh64( osm_node_get_node_guid( p_remote_physp->p_node ) ),
+             p_remote_physp->port_num );
+
+    osm_physp_unlink( p_physp, p_remote_physp );
+
+  }
+
+  osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
+           "drop_mgr_clean_physp: Clearing physical port number 0x%X\n",
+           p_physp->port_num );
+
+  osm_physp_destroy( p_physp );
+}
+
 /**********************************************************************
  **********************************************************************/
 static void
@@ -156,17 +228,11 @@ __osm_drop_mgr_remove_port(
   uint16_t min_lid_ho;
   uint16_t max_lid_ho;
   uint16_t lid_ho;
-  uint32_t port_num;
-  uint32_t remote_port_num;
-  uint32_t num_physp;
   osm_node_t *p_node;
-  osm_node_t *p_remote_node;
-  osm_physp_t *p_physp;
-  osm_physp_t *p_remote_physp;
   osm_remote_sm_t *p_sm;
   ib_gid_t port_gid;
-  ib_mad_notice_attr_t    notice;
-  ib_api_status_t         status;
+  ib_mad_notice_attr_t notice;
+  ib_api_status_t status;
 
   OSM_LOG_ENTER( p_mgr->p_log, __osm_drop_mgr_remove_port );
 
@@ -231,89 +297,7 @@ __osm_drop_mgr_remove_port(
   for( lid_ho = min_lid_ho; lid_ho <= max_lid_ho; lid_ho++ )
     cl_ptr_vector_set( p_port_lid_tbl, lid_ho, NULL );
 
-  /*
-    For each Physical Port associated with this port:
-    Unlink the remote Physical Port, if any
-    Re-initialize each Physical Port.
-  */
-
-  num_physp = osm_node_get_num_physp( p_port->p_node );
-  for( port_num = 0; port_num < num_physp; port_num++ )
-  {
-    p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)port_num );
-
-    if( p_physp && osm_physp_is_valid(p_physp) )
-    {
-      p_remote_physp = osm_physp_get_remote( p_physp );
-      if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) )
-      {
-        osm_port_t* p_remote_port;
-
-        p_node = osm_physp_get_node_ptr( p_physp );
-        p_remote_node = osm_physp_get_node_ptr( p_remote_physp );
-        remote_port_num = osm_physp_get_port_num( p_remote_physp );
-        p_remote_port = (osm_port_t*)cl_qmap_get( p_port_guid_tbl, p_remote_physp->port_guid );
-
-        if ( p_remote_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl ) )
-        {
-          /* Let's check if this is a case of link that is lost (both ports
-             weren't recognized), or a "hiccup" in the subnet - in which case
-             the remote port was recognized, and its state is ACTIVE. 
-             If this is just a "hiccup" - force a heavy sweep in the next sweep.
-             We don't want to lose that part of the subnet. */
-          if (osm_port_discovery_count_get( p_remote_port ) &&
-              osm_physp_get_port_state( p_remote_physp ) == IB_LINK_ACTIVE )
-          {
-            osm_log( p_mgr->p_log, OSM_LOG_VERBOSE,
-                     "__osm_drop_mgr_remove_port: "
-                     "Forcing delayed heavy sweep. Remote "
-                     "port 0x%016" PRIx64 " port num: 0x%X "
-                     "was recognized in ACTIVE state\n",
-                     cl_ntoh64( p_remote_physp->port_guid ),
-                     remote_port_num );
-            p_mgr->p_subn->force_delayed_heavy_sweep = TRUE;
-          }
-        }
-
-        osm_log( p_mgr->p_log, OSM_LOG_VERBOSE,
-                 "__osm_drop_mgr_remove_port: "
-                 "Unlinking local node 0x%016" PRIx64 ", port 0x%X"
-                 "\n\t\t\t\tand remote node 0x%016" PRIx64
-                 ", port 0x%X\n",
-                 cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-                 port_num,
-                 cl_ntoh64( osm_node_get_node_guid( p_remote_node ) ),
-                 remote_port_num );
-
-        osm_node_unlink( p_node, (uint8_t)port_num,
-                         p_remote_node, (uint8_t)remote_port_num );
-
-        /* If the remote node is ca or router - need to remove the remote port,
-           since it is no longer reachable. This can be done if we reset the
-           discovery count of the remote port. */
-        if ( osm_node_get_type( p_remote_node ) != IB_NODE_TYPE_SWITCH )
-        {
-          if ( p_remote_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl ) )
-          {
-            osm_port_discovery_count_reset( p_remote_port );
-            osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
-                     "__osm_drop_mgr_remove_port: "
-                     "Resetting discovery count of node: "
-                     "0x%016" PRIx64 " port num:0x%X\n",
-                     cl_ntoh64( osm_node_get_node_guid( p_remote_node ) ),
-                     remote_port_num );
-          }
-        }
-      }
-
-      osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
-               "__osm_drop_mgr_remove_port: "
-               "Clearing physical port number 0x%X\n",
-               port_num );
-
-      osm_physp_destroy( p_physp );
-    }
-  }
+  drop_mgr_clean_physp(p_mgr, p_port->p_physp);
 
   p_mcm = (osm_mcm_info_t*)cl_qlist_remove_head( &p_port->mcm_list );
   while( p_mcm != (osm_mcm_info_t *)cl_qlist_end( &p_port->mcm_list ) )
@@ -454,6 +438,8 @@ __osm_drop_mgr_process_node(
 
       if( p_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl ) )
         __osm_drop_mgr_remove_port( p_mgr, p_port );
+      else
+        drop_mgr_clean_physp( p_mgr, p_physp );
     }
   }
 
@@ -535,8 +521,7 @@ __osm_drop_mgr_check_node(
    
   port_guid = osm_physp_get_port_guid( p_physp );
 
-  p_port = (osm_port_t*)cl_qmap_get(
-    p_port_guid_tbl, port_guid );
+  p_port = (osm_port_t*)cl_qmap_get( p_port_guid_tbl, port_guid );
 
   if( p_port == (osm_port_t*)cl_qmap_end( p_port_guid_tbl ) )
   {
-- 
1.5.2.109.g802f


From tilman at imap.cc  Wed May 30 12:00:57 2007
From: tilman at imap.cc (Tilman Schmidt)
Date: Wed, 30 May 2007 21:00:57 +0200
Subject: [ofa-general] Re: dealing with gcc 'comparison is always false'
	warnings
In-Reply-To: <a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
References: <20070530080518.GA29195@nostromo.devel.redhat.com>	
	<adazm3mqqh3.fsf@cisco.com>
	<a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
Message-ID: <465DC9E9.3040904@imap.cc>

Am 30.05.2007 17:41 schrieb Satyam Sharma:
> On 5/30/07, Roland Dreier <rdreier at cisco.com> wrote:
>> thanks... I'm wondering if there's a consensus among kernel hackers
>> about changes like:
>>
>>  > -    if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
>>  > +    if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
>>  >              return -EINVAL;
>>
>> I understand that new gcc sees that hdr.cmd is unsigned and hence
>> can't be < 0, and generates a warning for that, and having a build
>> cluttered with warnings hides bugs and so on.  However the code here
>> looks quite sensible to me -- otherwise we end up with missing range
>> checking if hdr.cmd ever changes to a signed type.  This seems like a
>> good way to introduce bugs: delete valid range checking code to shut
>> up a silly gcc warning, and then change the type of a variable.
> 
> You're *absolutely* correct about the issue that these "fixes" that remove
> such conditions end up remove range-checking making the code more
> flakey / less readable.

I disagree. Changing the type of a variable is a significant
modification. If someone does that, he or she *must* check every
use of that variable, at which point he or she will also modify
any range checks accordingly. Having checks that don't fit with
the previous type *distracts* from that job. "Oh, did I modify
that part already? Guess I can skip checking the rest of that
function then." Oops.

Nor is readability a suitable argument. Checking if hdr.cmd is
less than zero gives the misleading impression that it *could*
be less than zero, thus *impairing* readability.

jm2c
T.

-- 
Tilman Schmidt                          E-Mail: tilman at imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070530/943ffd97/attachment.sig>

From notting at redhat.com  Wed May 30 12:09:37 2007
From: notting at redhat.com (Bill Nottingham)
Date: Wed, 30 May 2007 15:09:37 -0400
Subject: [ofa-general] Re: dealing with gcc 'comparison is always false'
	warnings (was:
	[PATCH] drivers/infiniband: fix comparsion between unsigned and
	negative)
In-Reply-To: <a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
References: <20070530080518.GA29195@nostromo.devel.redhat.com>
	<adazm3mqqh3.fsf@cisco.com>
	<a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
Message-ID: <20070530190937.GA5444@nostromo.devel.redhat.com>

Satyam Sharma (satyam.sharma at gmail.com) said: 
> But yes, the kind of "fixes" you pointed out that _remove_ these conditions
> are definitely *not* what we would want to do.

I can see that - but I think it should be at least be brought up for each
warning, to determine either:

1) if it should be ignored
2) if a signed type is actually intended
3) if the code should be elided

While not necessarily in the IB instances, there are cases where there
are entire blocks of code (with debugging output, error returns, etc)
that can never get run, and it may make sense to remove those.

Bill


From satyam.sharma at gmail.com  Wed May 30 12:14:30 2007
From: satyam.sharma at gmail.com (Satyam Sharma)
Date: Thu, 31 May 2007 00:44:30 +0530
Subject: [ofa-general] Re: dealing with gcc 'comparison is always false'
	warnings
In-Reply-To: <465DC9E9.3040904@imap.cc>
References: <20070530080518.GA29195@nostromo.devel.redhat.com>
	<adazm3mqqh3.fsf@cisco.com>
	<a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
	<465DC9E9.3040904@imap.cc>
Message-ID: <a781481a0705301214h3295315esa31c40933ae4a539@mail.gmail.com>

Hi,

On 5/31/07, Tilman Schmidt <tilman at imap.cc> wrote:
> Am 30.05.2007 17:41 schrieb Satyam Sharma:
> > On 5/30/07, Roland Dreier <rdreier at cisco.com> wrote:
> >> thanks... I'm wondering if there's a consensus among kernel hackers
> >> about changes like:
> >>
> >>  > -    if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
> >>  > +    if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
> >>  >              return -EINVAL;
> >>
> >> I understand that new gcc sees that hdr.cmd is unsigned and hence
> >> can't be < 0, and generates a warning for that, and having a build
> >> cluttered with warnings hides bugs and so on.  However the code here
> >> looks quite sensible to me -- otherwise we end up with missing range
> >> checking if hdr.cmd ever changes to a signed type.  This seems like a
> >> good way to introduce bugs: delete valid range checking code to shut
> >> up a silly gcc warning, and then change the type of a variable.
> >
> > You're *absolutely* correct about the issue that these "fixes" that remove
> > such conditions end up remove range-checking making the code more
> > flakey / less readable.
>
> I disagree. Changing the type of a variable is a significant
> modification. If someone does that, he or she *must* check every
> use of that variable, at which point he or she will also modify
> any range checks accordingly. Having checks that don't fit with
> the previous type *distracts* from that job. "Oh, did I modify
> that part already? Guess I can skip checking the rest of that
> function then." Oops.

I did not suggest the change-variable-type-from-unsigned-to-signed
thing as a "general" solution to such cases! ... in fact what I said
was that such cases do _not_ have a general solution at all, and
that shutting gcc up might not be a good idea, because a lot of
times such warnings do un-hide bugs. [ BTW when I gave the
change-type-from-unsigned-to-signed example, I had the size_t vs
ssize_t typo/bug in mind, for which changing the type is the proper
fix; and note that similar bugs can occur for non-size_t cases too. ]

> Nor is readability a suitable argument. Checking if hdr.cmd is
> less than zero gives the misleading impression that it *could*
> be less than zero, thus *impairing* readability.

Hmmm, but I tend to agree with the sentiment expressed in:
http://lkml.org/lkml/2006/11/28/206

Satyam


From satyam.sharma at gmail.com  Wed May 30 12:23:20 2007
From: satyam.sharma at gmail.com (Satyam Sharma)
Date: Thu, 31 May 2007 00:53:20 +0530
Subject: [ofa-general] Re: dealing with gcc 'comparison is always false'
	warnings (was: [PATCH] drivers/infiniband: fix comparsion
	between unsigned and negative)
In-Reply-To: <20070530190937.GA5444@nostromo.devel.redhat.com>
References: <20070530080518.GA29195@nostromo.devel.redhat.com>
	<adazm3mqqh3.fsf@cisco.com>
	<a781481a0705300841k2a02335eg66985129eaed28f4@mail.gmail.com>
	<20070530190937.GA5444@nostromo.devel.redhat.com>
Message-ID: <a781481a0705301223k1b52049eo33606cc158b00222@mail.gmail.com>

Hi Bill,

On 5/31/07, Bill Nottingham <notting at redhat.com> wrote:
> Satyam Sharma (satyam.sharma at gmail.com) said:
> > But yes, the kind of "fixes" you pointed out that _remove_ these conditions
> > are definitely *not* what we would want to do.
>
> I can see that - but I think it should be at least be brought up for each
> warning, to determine either:
>
> 1) if it should be ignored
> 2) if a signed type is actually intended
> 3) if the code should be elided

Agreed. The extract you've pointed out above was too strongly worded
unnecessarily / wrong generalization, and I corrected it later.

> While not necessarily in the IB instances, there are cases where there
> are entire blocks of code (with debugging output, error returns, etc)
> that can never get run, and it may make sense to remove those.

Agreed, again.

Satyam


From halr at voltaire.com  Wed May 30 12:51:45 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 May 2007 15:51:45 -0400
Subject: [ofa-general] Re: [PATCH] opensm: drop_mgr: clean only associated
	with port physical obj
In-Reply-To: <20070530182347.GF13193@sashak.voltaire.com>
References: <20070530182347.GF13193@sashak.voltaire.com>
Message-ID: <1180554704.7116.88960.camel@hal.voltaire.com>

On Wed, 2007-05-30 at 14:23, Sasha Khapyorsky wrote:
> Then remove osm_port_t cleanup only associated osm_physp_t object and
> not do not all node's osm_physp_t objects. If all osm_physp_t should be
> removed do it in node removing routine.
> 
> This fix prevents random crashes in post drop manager flows, when CA node
> had two port connected and one was disconnected.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From sean.hefty at intel.com  Wed May 30 13:23:06 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 30 May 2007 13:23:06 -0700
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com>
Message-ID: <000801c7a2f8$55749000$3c98070a@amr.corp.intel.com>

>Ok, Soon I will post a patch related to this.
>How static PR file will be generated? Needs to be discussed.

Please look at my latest changes to the local SA in when generating the patches.

git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache

I'm not sure about the best way to communicate PRs to the cache.  I haven't
given it more than about 2 minutes of thought, but as an idea, we could look at
trying to make use of the userspace MAD interface.  For example, we could send
MADs to the local SA with the PRs to load.  More details would obviously need to
be worked out, but this could provide an extensible solution.

- Sean


From sean.hefty at intel.com  Wed May 30 13:34:13 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 30 May 2007 13:34:13 -0700
Subject: [ofa-general] [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path
	record caching
Message-ID: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com>

I've updated the local SA patches based on previous feedback.
The most significant change is to integrate the local SA with
the ib_sa module.  This allows all apps to make use of the local
SA without changes.

The use of a device file was also replaced with simple module
parameters.

I've also pushed these changes to:

	git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache

I would like to close any open issues with this approach in time
to pull it into 2.6.23.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


From sean.hefty at intel.com  Wed May 30 13:36:37 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 30 May 2007 13:36:37 -0700
Subject: [ofa-general] [RFC] [PATCH 1/2] for 2.6.23: ib/sa - add
	InformInfo/Notice support
In-Reply-To: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com>
Message-ID: <000a01c7a2fa$383b42c0$3c98070a@amr.corp.intel.com>

Add SA client support for notice/trap registration using InformInfo.
Clients can use the ib_sa interface to register for SA events based
on trap numbers, and receive SA event notification.  This allows
clients to receive notification, such as GID in/out of service.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/Makefile   |    2 
 drivers/infiniband/core/notice.c   |  749 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/sa.h       |   16 +
 drivers/infiniband/core/sa_query.c |  316 +++++++++++++++
 include/rdma/ib_sa.h               |  171 ++++++++
 5 files changed, 1251 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index cb1ab3e..7c5b5ed 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,7 +13,7 @@ ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 
 ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
 
-ib_sa-y :=			sa_query.o multicast.o
+ib_sa-y :=			sa_query.o multicast.o notice.o
 
 ib_cm-y :=			cm.o
 
diff --git a/drivers/infiniband/core/notice.c b/drivers/infiniband/core/notice.c
new file mode 100644
index 0000000..e4c73c8
--- /dev/null
+++ b/drivers/infiniband/core/notice.c
@@ -0,0 +1,749 @@
+/*
+ * Copyright (c) 2006 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/completion.h>
+#include <linux/dma-mapping.h>
+#include <linux/err.h>
+#include <linux/interrupt.h>
+#include <linux/pci.h>
+#include <linux/bitops.h>
+#include <linux/random.h>
+
+#include "sa.h"
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("InfiniBand InformInfo & Notice event handling");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static void inform_add_one(struct ib_device *device);
+static void inform_remove_one(struct ib_device *device);
+
+static struct ib_client inform_client = {
+	.name   = "ib_notice",
+	.add    = inform_add_one,
+	.remove = inform_remove_one
+};
+
+static struct ib_sa_client	sa_client;
+static struct workqueue_struct	*inform_wq;
+
+struct inform_device;
+
+struct inform_port {
+	struct inform_device	*dev;
+	spinlock_t		lock;
+	struct rb_root		table;
+	atomic_t		refcount;
+	struct completion	comp;
+	u8			port_num;
+};
+
+struct inform_device {
+	struct ib_device	*device;
+	struct ib_event_handler	event_handler;
+	int			start_port;
+	int			end_port;
+	struct inform_port	port[0];
+};
+
+enum inform_state {
+	INFORM_IDLE,
+	INFORM_REGISTERING,
+	INFORM_MEMBER,
+	INFORM_BUSY,
+	INFORM_ERROR
+};
+
+struct inform_member;
+
+struct inform_group {
+	u16			trap_number;
+	struct rb_node		node;
+	struct inform_port	*port;
+	spinlock_t		lock;
+	struct work_struct	work;
+	struct list_head	pending_list;
+	struct list_head	active_list;
+	struct list_head	notice_list;
+	struct inform_member	*last_join;
+	int			members;
+	enum inform_state	join_state; /* State relative to SA */
+	atomic_t		refcount;
+	enum inform_state	state;
+	struct ib_sa_query	*query;
+	int			query_id;
+};
+
+struct inform_member {
+	struct ib_inform_info	info;
+	struct ib_sa_client	*client;
+	struct inform_group	*group;
+	struct list_head	list;
+	enum inform_state	state;
+	atomic_t		refcount;
+	struct completion	comp;
+};
+
+struct inform_notice {
+	struct list_head	list;
+	struct ib_sa_notice	notice;
+};
+
+static void reg_handler(int status, struct ib_sa_inform *inform,
+			 void *context);
+static void unreg_handler(int status, struct ib_sa_inform *inform,
+			  void *context);
+
+static struct inform_group *inform_find(struct inform_port *port,
+					u16 trap_number)
+{
+	struct rb_node *node = port->table.rb_node;
+	struct inform_group *group;
+
+	while (node) {
+		group = rb_entry(node, struct inform_group, node);
+		if (trap_number < group->trap_number)
+			node = node->rb_left;
+		else if (trap_number > group->trap_number)
+			node = node->rb_right;
+		else
+			return group;
+	}
+	return NULL;
+}
+
+static struct inform_group *inform_insert(struct inform_port *port,
+					  struct inform_group *group)
+{
+	struct rb_node **link = &port->table.rb_node;
+	struct rb_node *parent = NULL;
+	struct inform_group *cur_group;
+
+	while (*link) {
+		parent = *link;
+		cur_group = rb_entry(parent, struct inform_group, node);
+		if (group->trap_number < cur_group->trap_number)
+			link = &(*link)->rb_left;
+		else if (group->trap_number > cur_group->trap_number)
+			link = &(*link)->rb_right;
+		else
+			return cur_group;
+	}
+	rb_link_node(&group->node, parent, link);
+	rb_insert_color(&group->node, &port->table);
+	return NULL;
+}
+
+static void deref_port(struct inform_port *port)
+{
+	if (atomic_dec_and_test(&port->refcount))
+		complete(&port->comp);
+}
+
+static void release_group(struct inform_group *group)
+{
+	struct inform_port *port = group->port;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port->lock, flags);
+	if (atomic_dec_and_test(&group->refcount)) {
+		rb_erase(&group->node, &port->table);
+		spin_unlock_irqrestore(&port->lock, flags);
+		kfree(group);
+		deref_port(port);
+	} else
+		spin_unlock_irqrestore(&port->lock, flags);
+}
+
+static void deref_member(struct inform_member *member)
+{
+	if (atomic_dec_and_test(&member->refcount))
+		complete(&member->comp);
+}
+
+static void queue_reg(struct inform_member *member)
+{
+	struct inform_group *group = member->group;
+	unsigned long flags;
+
+	spin_lock_irqsave(&group->lock, flags);
+	list_add(&member->list, &group->pending_list);
+	if (group->state == INFORM_IDLE) {
+		group->state = INFORM_BUSY;
+		atomic_inc(&group->refcount);
+		queue_work(inform_wq, &group->work);
+	}
+	spin_unlock_irqrestore(&group->lock, flags);
+}
+
+static int send_reg(struct inform_group *group, struct inform_member *member)
+{
+	struct inform_port *port = group->port;
+	struct ib_sa_inform inform;
+	int ret;
+
+	memset(&inform, 0, sizeof inform);
+	inform.lid_range_begin = cpu_to_be16(0xFFFF);
+	inform.is_generic = 1;
+	inform.subscribe = 1;
+	inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL);
+	inform.trap.generic.trap_num = cpu_to_be16(member->info.trap_number);
+	inform.trap.generic.resp_time = 19;
+	inform.trap.generic.producer_type =
+				cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL);
+
+	group->last_join = member;
+	ret = ib_sa_informinfo_query(&sa_client, port->dev->device,
+				     port->port_num, &inform, 3000, GFP_KERNEL,
+				     reg_handler, group,&group->query);
+	if (ret >= 0) {
+		group->query_id = ret;
+		ret = 0;
+	}
+	return ret;
+}
+
+static int send_unreg(struct inform_group *group)
+{
+	struct inform_port *port = group->port;
+	struct ib_sa_inform inform;
+	int ret;
+
+	memset(&inform, 0, sizeof inform);
+	inform.lid_range_begin = cpu_to_be16(0xFFFF);
+	inform.is_generic = 1;
+	inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL);
+	inform.trap.generic.trap_num = cpu_to_be16(group->trap_number);
+	inform.trap.generic.qpn = IB_QP1;
+	inform.trap.generic.resp_time = 19;
+	inform.trap.generic.producer_type =
+				cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL);
+
+	ret = ib_sa_informinfo_query(&sa_client, port->dev->device,
+				     port->port_num, &inform, 3000, GFP_KERNEL,
+				     unreg_handler, group, &group->query);
+	if (ret >= 0) {
+		group->query_id = ret;
+		ret = 0;
+	}
+	return ret;
+}
+
+static void join_group(struct inform_group *group, struct inform_member *member)
+{
+	member->state = INFORM_MEMBER;
+	group->members++;
+	list_move(&member->list, &group->active_list);
+}
+
+static int fail_join(struct inform_group *group, struct inform_member *member,
+		     int status)
+{
+	spin_lock_irq(&group->lock);
+	list_del_init(&member->list);
+	spin_unlock_irq(&group->lock);
+	return member->info.callback(status, &member->info, NULL);
+}
+
+static void process_group_error(struct inform_group *group)
+{
+	struct inform_member *member;
+	int ret;
+
+	spin_lock_irq(&group->lock);
+	while (!list_empty(&group->active_list)) {
+		member = list_entry(group->active_list.next,
+				    struct inform_member, list);
+		atomic_inc(&member->refcount);
+		list_del_init(&member->list);
+		group->members--;
+		member->state = INFORM_ERROR;
+		spin_unlock_irq(&group->lock);
+
+		ret = member->info.callback(-ENETRESET, &member->info, NULL);
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+		spin_lock_irq(&group->lock);
+	}
+
+	group->join_state = INFORM_IDLE;
+	group->state = INFORM_BUSY;
+	spin_unlock_irq(&group->lock);
+}
+
+/*
+ * Report a notice to all active subscribers.  We use a temporary list to
+ * handle unsubscription requests while the notice is being reported, which
+ * avoids holding the group lock while in the user's callback.
+ */
+static void process_notice(struct inform_group *group,
+			   struct inform_notice *info_notice)
+{
+	struct inform_member *member;
+	struct list_head list;
+	int ret;
+
+	INIT_LIST_HEAD(&list);
+
+	spin_lock_irq(&group->lock);
+	list_splice_init(&group->active_list, &list);
+	while (!list_empty(&list)) {
+
+		member = list_entry(list.next, struct inform_member, list);
+		atomic_inc(&member->refcount);
+		list_move(&member->list, &group->active_list);
+		spin_unlock_irq(&group->lock);
+
+		ret = member->info.callback(0, &member->info,
+					    &info_notice->notice);
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+		spin_lock_irq(&group->lock);
+	}
+	spin_unlock_irq(&group->lock);
+}
+
+static void inform_work_handler(struct work_struct *work)
+{
+	struct inform_group *group;
+	struct inform_member *member;
+	struct ib_inform_info *info;
+	struct inform_notice *info_notice;
+	int status, ret;
+
+	group = container_of(work, typeof(*group), work);
+retest:
+	spin_lock_irq(&group->lock);
+	while (!list_empty(&group->pending_list) ||
+	       !list_empty(&group->notice_list) ||
+	       (group->state == INFORM_ERROR)) {
+
+		if (group->state == INFORM_ERROR) {
+			spin_unlock_irq(&group->lock);
+			process_group_error(group);
+			goto retest;
+		}
+
+		if (!list_empty(&group->notice_list)) {
+			info_notice = list_entry(group->notice_list.next,
+						 struct inform_notice, list);
+			list_del(&info_notice->list);
+			spin_unlock_irq(&group->lock);
+			process_notice(group, info_notice);
+			kfree(info_notice);
+			goto retest;
+		}
+
+		member = list_entry(group->pending_list.next,
+				    struct inform_member, list);
+		info = &member->info;
+		atomic_inc(&member->refcount);
+
+		if (group->join_state == INFORM_MEMBER) {
+			join_group(group, member);
+			spin_unlock_irq(&group->lock);
+			ret = info->callback(0, info, NULL);
+		} else {
+			spin_unlock_irq(&group->lock);
+			status = send_reg(group, member);
+			if (!status) {
+				deref_member(member);
+				return;
+			}
+			ret = fail_join(group, member, status);
+		}
+
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+		spin_lock_irq(&group->lock);
+	}
+
+	if (!group->members && (group->join_state == INFORM_MEMBER)) {
+		group->join_state = INFORM_IDLE;
+		spin_unlock_irq(&group->lock);
+		if (send_unreg(group))
+			goto retest;
+	} else {
+		group->state = INFORM_IDLE;
+		spin_unlock_irq(&group->lock);
+		release_group(group);
+	}
+}
+
+/*
+ * Fail a join request if it is still active - at the head of the pending queue.
+ */
+static void process_join_error(struct inform_group *group, int status)
+{
+	struct inform_member *member;
+	int ret;
+
+	spin_lock_irq(&group->lock);
+	member = list_entry(group->pending_list.next,
+			    struct inform_member, list);
+	if (group->last_join == member) {
+		atomic_inc(&member->refcount);
+		list_del_init(&member->list);
+		spin_unlock_irq(&group->lock);
+		ret = member->info.callback(status, &member->info, NULL);
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+	} else
+		spin_unlock_irq(&group->lock);
+}
+
+static void reg_handler(int status, struct ib_sa_inform *inform, void *context)
+{
+	struct inform_group *group = context;
+
+	if (status)
+		process_join_error(group, status);
+	else
+		group->join_state = INFORM_MEMBER;
+
+	inform_work_handler(&group->work);
+}
+
+static void unreg_handler(int status, struct ib_sa_inform *rec, void *context)
+{
+	struct inform_group *group = context;
+
+	inform_work_handler(&group->work);
+}
+
+int notice_dispatch(struct ib_device *device, u8 port_num,
+		    struct ib_sa_notice *notice)
+{
+	struct inform_device *dev;
+	struct inform_port *port;
+	struct inform_group *group;
+	struct inform_notice *info_notice;
+
+	dev = ib_get_client_data(device, &inform_client);
+	if (!dev)
+		return 0; /* No one to give notice to. */
+
+	port = &dev->port[port_num - dev->start_port];
+	spin_lock_irq(&port->lock);
+	group = inform_find(port, __be16_to_cpu(notice->trap.
+						generic.trap_num));
+	if (!group) {
+		spin_unlock_irq(&port->lock);
+		return 0;
+	}
+
+	atomic_inc(&group->refcount);
+	spin_unlock_irq(&port->lock);
+
+	info_notice = kmalloc(sizeof *info_notice, GFP_KERNEL);
+	if (!info_notice) {
+		release_group(group);
+		return -ENOMEM;
+	}
+
+	info_notice->notice = *notice;
+
+	spin_lock_irq(&group->lock);
+	list_add(&info_notice->list, &group->notice_list);
+	if (group->state == INFORM_IDLE) {
+		group->state = INFORM_BUSY;
+		spin_unlock_irq(&group->lock);
+		inform_work_handler(&group->work);
+	} else {
+		spin_unlock_irq(&group->lock);
+		release_group(group);
+	}
+
+	return 0;
+}
+
+static struct inform_group *acquire_group(struct inform_port *port,
+					  u16 trap_number, gfp_t gfp_mask)
+{
+	struct inform_group *group, *cur_group;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port->lock, flags);
+	group = inform_find(port, trap_number);
+	if (group)
+		goto found;
+	spin_unlock_irqrestore(&port->lock, flags);
+
+	group = kzalloc(sizeof *group, gfp_mask);
+	if (!group)
+		return NULL;
+
+	group->port = port;
+	group->trap_number = trap_number;
+	INIT_LIST_HEAD(&group->pending_list);
+	INIT_LIST_HEAD(&group->active_list);
+	INIT_LIST_HEAD(&group->notice_list);
+	INIT_WORK(&group->work, inform_work_handler);
+	spin_lock_init(&group->lock);
+
+	spin_lock_irqsave(&port->lock, flags);
+	cur_group = inform_insert(port, group);
+	if (cur_group) {
+		kfree(group);
+		group = cur_group;
+	} else
+		atomic_inc(&port->refcount);
+found:
+	atomic_inc(&group->refcount);
+	spin_unlock_irqrestore(&port->lock, flags);
+	return group;
+}
+
+/*
+ * We serialize all join requests to a single group to make our lives much
+ * easier.  Otherwise, two users could try to join the same group
+ * simultaneously, with different configurations, one could leave while the
+ * join is in progress, etc., which makes locking around error recovery
+ * difficult.
+ */
+struct ib_inform_info *
+ib_sa_register_inform_info(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num,
+			   u16 trap_number, gfp_t gfp_mask,
+			   int (*callback)(int status,
+					   struct ib_inform_info *info,
+					   struct ib_sa_notice *notice),
+			   void *context)
+{
+	struct inform_device *dev;
+	struct inform_member *member;
+	struct ib_inform_info *info;
+	int ret;
+
+	dev = ib_get_client_data(device, &inform_client);
+	if (!dev)
+		return ERR_PTR(-ENODEV);
+
+	member = kzalloc(sizeof *member, gfp_mask);
+	if (!member)
+		return ERR_PTR(-ENOMEM);
+
+	ib_sa_client_get(client);
+	member->client = client;
+	member->info.trap_number = trap_number;
+	member->info.callback = callback;
+	member->info.context = context;
+	init_completion(&member->comp);
+	atomic_set(&member->refcount, 1);
+	member->state = INFORM_REGISTERING;
+
+	member->group = acquire_group(&dev->port[port_num - dev->start_port],
+				      trap_number, gfp_mask);
+	if (!member->group) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/*
+	 * The user will get the info structure in their callback.  They
+	 * could then free the info structure before we can return from
+	 * this routine.  So we save the pointer to return before queuing
+	 * any callback.
+	 */
+	info = &member->info;
+	queue_reg(member);
+	return info;
+
+err:
+	ib_sa_client_put(member->client);
+	kfree(member);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(ib_sa_register_inform_info);
+
+void ib_sa_unregister_inform_info(struct ib_inform_info *info)
+{
+	struct inform_member *member;
+	struct inform_group *group;
+
+	member = container_of(info, struct inform_member, info);
+	group = member->group;
+
+	spin_lock_irq(&group->lock);
+	if (member->state == INFORM_MEMBER)
+		group->members--;
+
+	list_del_init(&member->list);
+
+	if (group->state == INFORM_IDLE) {
+		group->state = INFORM_BUSY;
+		spin_unlock_irq(&group->lock);
+		/* Continue to hold reference on group until callback */
+		queue_work(inform_wq, &group->work);
+	} else {
+		spin_unlock_irq(&group->lock);
+		release_group(group);
+	}
+
+	deref_member(member);
+	wait_for_completion(&member->comp);
+	ib_sa_client_put(member->client);
+	kfree(member);
+}
+EXPORT_SYMBOL(ib_sa_unregister_inform_info);
+
+static void inform_groups_lost(struct inform_port *port)
+{
+	struct inform_group *group;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port->lock, flags);
+	for (node = rb_first(&port->table); node; node = rb_next(node)) {
+		group = rb_entry(node, struct inform_group, node);
+		spin_lock(&group->lock);
+		if (group->state == INFORM_IDLE) {
+			atomic_inc(&group->refcount);
+			queue_work(inform_wq, &group->work);
+		}
+		group->state = INFORM_ERROR;
+		spin_unlock(&group->lock);
+	}
+	spin_unlock_irqrestore(&port->lock, flags);
+}
+
+static void inform_event_handler(struct ib_event_handler *handler,
+				struct ib_event *event)
+{
+	struct inform_device *dev;
+
+	dev = container_of(handler, struct inform_device, event_handler);
+
+	switch (event->event) {
+	case IB_EVENT_PORT_ERR:
+	case IB_EVENT_LID_CHANGE:
+	case IB_EVENT_SM_CHANGE:
+	case IB_EVENT_CLIENT_REREGISTER:
+		inform_groups_lost(&dev->port[event->element.port_num -
+					      dev->start_port]);
+		break;
+	default:
+		break;
+	}
+}
+
+static void inform_add_one(struct ib_device *device)
+{
+	struct inform_device *dev;
+	struct inform_port *port;
+	int i;
+
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
+		      GFP_KERNEL);
+	if (!dev)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH)
+		dev->start_port = dev->end_port = 0;
+	else {
+		dev->start_port = 1;
+		dev->end_port = device->phys_port_cnt;
+	}
+
+	for (i = 0; i <= dev->end_port - dev->start_port; i++) {
+		port = &dev->port[i];
+		port->dev = dev;
+		port->port_num = dev->start_port + i;
+		spin_lock_init(&port->lock);
+		port->table = RB_ROOT;
+		init_completion(&port->comp);
+		atomic_set(&port->refcount, 1);
+	}
+
+	dev->device = device;
+	ib_set_client_data(device, &inform_client, dev);
+
+	INIT_IB_EVENT_HANDLER(&dev->event_handler, device, inform_event_handler);
+	ib_register_event_handler(&dev->event_handler);
+}
+
+static void inform_remove_one(struct ib_device *device)
+{
+	struct inform_device *dev;
+	struct inform_port *port;
+	int i;
+
+	dev = ib_get_client_data(device, &inform_client);
+	if (!dev)
+		return;
+
+	ib_unregister_event_handler(&dev->event_handler);
+	flush_workqueue(inform_wq);
+
+	for (i = 0; i <= dev->end_port - dev->start_port; i++) {
+		port = &dev->port[i];
+		deref_port(port);
+		wait_for_completion(&port->comp);
+	}
+
+	kfree(dev);
+}
+
+int notice_init(void)
+{
+	int ret;
+
+	inform_wq = create_singlethread_workqueue("ib_inform");
+	if (!inform_wq)
+		return -ENOMEM;
+
+	ib_sa_register_client(&sa_client);
+
+	ret = ib_register_client(&inform_client);
+	if (ret)
+		goto err;
+	return 0;
+
+err:
+	ib_sa_unregister_client(&sa_client);
+	destroy_workqueue(inform_wq);
+	return ret;
+}
+
+void notice_cleanup(void)
+{
+	ib_unregister_client(&inform_client);
+	ib_sa_unregister_client(&sa_client);
+	destroy_workqueue(inform_wq);
+}
diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h
index 24c93fd..b8eac66 100644
--- a/drivers/infiniband/core/sa.h
+++ b/drivers/infiniband/core/sa.h
@@ -63,4 +63,20 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 int mcast_init(void);
 void mcast_cleanup(void);
 
+int ib_sa_informinfo_query(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num,
+			   struct ib_sa_inform *rec,
+			   int timeout_ms, gfp_t gfp_mask,
+			   void (*callback)(int status,
+					    struct ib_sa_inform *resp,
+					    void *context),
+			   void *context,
+			   struct ib_sa_query **sa_query);
+
+int notice_dispatch(struct ib_device *device, u8 port_num,
+		    struct ib_sa_notice *notice);
+
+int notice_init(void);
+void notice_cleanup(void);
+
 #endif /* SA_H */
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 6469406..369fe60 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -61,10 +61,12 @@ struct ib_sa_sm_ah {
 
 struct ib_sa_port {
 	struct ib_mad_agent *agent;
+	struct ib_mad_agent *notice_agent;
 	struct ib_sa_sm_ah  *sm_ah;
 	struct work_struct   update_task;
 	spinlock_t           ah_lock;
 	u8                   port_num;
+	struct ib_device    *device;
 };
 
 struct ib_sa_device {
@@ -101,6 +103,12 @@ struct ib_sa_mcmember_query {
 	struct ib_sa_query sa_query;
 };
 
+struct ib_sa_inform_query {
+	void (*callback)(int, struct ib_sa_inform *, void *);
+	void *context;
+	struct ib_sa_query sa_query;
+};
+
 static void ib_sa_add_one(struct ib_device *device);
 static void ib_sa_remove_one(struct ib_device *device);
 
@@ -352,6 +360,110 @@ static const struct ib_field service_rec_table[] = {
 	  .size_bits    = 2*64 },
 };
 
+#define INFORM_FIELD(field) \
+	.struct_offset_bytes = offsetof(struct ib_sa_inform, field), \
+	.struct_size_bytes   = sizeof ((struct ib_sa_inform *) 0)->field, \
+	.field_name          = "sa_inform:" #field
+
+static const struct ib_field inform_table[] = {
+	{ INFORM_FIELD(gid),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+	{ INFORM_FIELD(lid_range_begin),
+	  .offset_words = 4,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(lid_range_end),
+	  .offset_words = 4,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ RESERVED,
+	  .offset_words = 5,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(is_generic),
+	  .offset_words = 5,
+	  .offset_bits  = 16,
+	  .size_bits    = 8 },
+	{ INFORM_FIELD(subscribe),
+	  .offset_words = 5,
+	  .offset_bits  = 24,
+	  .size_bits    = 8 },
+	{ INFORM_FIELD(type),
+	  .offset_words = 6,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(trap.generic.trap_num),
+	  .offset_words = 6,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(trap.generic.qpn),
+	  .offset_words = 7,
+	  .offset_bits  = 0,
+	  .size_bits    = 24 },
+	{ RESERVED,
+	  .offset_words = 7,
+	  .offset_bits  = 24,
+	  .size_bits    = 3 },
+	{ INFORM_FIELD(trap.generic.resp_time),
+	  .offset_words = 7,
+	  .offset_bits  = 27,
+	  .size_bits    = 5 },
+	{ RESERVED,
+	  .offset_words = 8,
+	  .offset_bits  = 0,
+	  .size_bits    = 8 },
+	{ INFORM_FIELD(trap.generic.producer_type),
+	  .offset_words = 8,
+	  .offset_bits  = 8,
+	  .size_bits    = 24 },
+};
+
+#define NOTICE_FIELD(field) \
+	.struct_offset_bytes = offsetof(struct ib_sa_notice, field), \
+	.struct_size_bytes   = sizeof ((struct ib_sa_notice *) 0)->field, \
+	.field_name          = "sa_notice:" #field
+
+static const struct ib_field notice_table[] = {
+	{ NOTICE_FIELD(is_generic),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 1 },
+	{ NOTICE_FIELD(type),
+	  .offset_words = 0,
+	  .offset_bits  = 1,
+	  .size_bits    = 7 },
+	{ NOTICE_FIELD(trap.generic.producer_type),
+	  .offset_words = 0,
+	  .offset_bits  = 8,
+	  .size_bits    = 24 },
+	{ NOTICE_FIELD(trap.generic.trap_num),
+	  .offset_words = 1,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ NOTICE_FIELD(issuer_lid),
+	  .offset_words = 1,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ NOTICE_FIELD(notice_toggle),
+	  .offset_words = 2,
+	  .offset_bits  = 0,
+	  .size_bits    = 1 },
+	{ NOTICE_FIELD(notice_count),
+	  .offset_words = 2,
+	  .offset_bits  = 1,
+	  .size_bits    = 15 },
+	{ NOTICE_FIELD(data_details),
+	  .offset_words = 2,
+	  .offset_bits  = 16,
+	  .size_bits    = 432 },
+	{ NOTICE_FIELD(issuer_gid),
+	  .offset_words = 16,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+};
+
 static void free_sm_ah(struct kref *kref)
 {
 	struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref);
@@ -913,6 +1025,153 @@ err1:
 	return ret;
 }
 
+static void ib_sa_inform_callback(struct ib_sa_query *sa_query,
+				  int status,
+				  struct ib_sa_mad *mad)
+{
+	struct ib_sa_inform_query *query =
+		container_of(sa_query, struct ib_sa_inform_query, sa_query);
+
+	if (mad) {
+		struct ib_sa_inform rec;
+
+		ib_unpack(inform_table, ARRAY_SIZE(inform_table),
+			  mad->data, &rec);
+		query->callback(status, &rec, query->context);
+	} else
+		query->callback(status, NULL, query->context);
+}
+
+static void ib_sa_inform_release(struct ib_sa_query *sa_query)
+{
+	kfree(container_of(sa_query, struct ib_sa_inform_query, sa_query));
+}
+
+/**
+ * ib_sa_informinfo_query - Start an InformInfo registration.
+ * @client:SA client
+ * @device:device to send query on
+ * @port_num: port number to send query on
+ * @rec:Inform record to send in query
+ * @timeout_ms:time to wait for response
+ * @gfp_mask:GFP mask to use for internal allocations
+ * @callback:function called when notice handler registration completes,
+ * times out or is canceled
+ * @context:opaque user context passed to callback
+ * @sa_query:query context, used to cancel query
+ *
+ * This function sends inform info to register with SA to receive
+ * in-service notice.
+ * The callback function will be called when the query completes (or
+ * fails); status is 0 for a successful response, -EINTR if the query
+ * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error
+ * occurred sending the query.  The resp parameter of the callback is
+ * only valid if status is 0.
+ *
+ * If the return value of ib_sa_inform_query() is negative, it is an
+ * error code.  Otherwise it is a query ID that can be used to cancel
+ * the query.
+ */
+int ib_sa_informinfo_query(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num,
+			   struct ib_sa_inform *rec,
+			   int timeout_ms, gfp_t gfp_mask,
+			   void (*callback)(int status,
+					   struct ib_sa_inform *resp,
+					   void *context),
+			   void *context,
+			   struct ib_sa_query **sa_query)
+{
+	struct ib_sa_inform_query *query;
+	struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client);
+	struct ib_sa_port   *port;
+	struct ib_mad_agent *agent;
+	struct ib_sa_mad *mad;
+	int ret;
+
+	if (!sa_dev)
+		return -ENODEV;
+
+	port  = &sa_dev->port[port_num - sa_dev->start_port];
+	agent = port->agent;
+
+	query = kmalloc(sizeof *query, gfp_mask);
+	if (!query)
+		return -ENOMEM;
+
+	query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0,
+						     0, IB_MGMT_SA_HDR,
+						     IB_MGMT_SA_DATA, gfp_mask);
+	if (!query->sa_query.mad_buf) {
+		ret = -ENOMEM;
+		goto err1;
+	}
+
+	ib_sa_client_get(client);
+	query->sa_query.client = client;
+	query->callback = callback;
+	query->context  = context;
+
+	mad = query->sa_query.mad_buf->mad;
+	init_mad(mad, agent);
+
+	query->sa_query.callback = callback ? ib_sa_inform_callback : NULL;
+	query->sa_query.release  = ib_sa_inform_release;
+	query->sa_query.port     = port;
+	mad->mad_hdr.method	 = IB_MGMT_METHOD_SET;
+	mad->mad_hdr.attr_id	 = cpu_to_be16(IB_SA_ATTR_INFORM_INFO);
+
+	ib_pack(inform_table, ARRAY_SIZE(inform_table), rec, mad->data);
+
+	*sa_query = &query->sa_query;
+	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
+	if (ret < 0)
+		goto err2;
+
+	return ret;
+
+err2:
+	*sa_query = NULL;
+	ib_sa_client_put(query->sa_query.client);
+	ib_free_send_mad(query->sa_query.mad_buf);
+err1:
+	kfree(query);
+	return ret;
+}
+
+static void ib_sa_notice_resp(struct ib_sa_port *port,
+			      struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_mad_send_buf *mad_buf;
+	struct ib_sa_mad *mad;
+	int ret;
+
+	mad_buf = ib_create_send_mad(port->notice_agent, 1, 0, 0,
+				     IB_MGMT_SA_HDR, IB_MGMT_SA_DATA,
+				     GFP_KERNEL);
+	if (IS_ERR(mad_buf))
+		return;
+
+	mad = mad_buf->mad;
+	memcpy(mad, mad_recv_wc->recv_buf.mad, sizeof *mad);
+	mad->mad_hdr.method = IB_MGMT_METHOD_REPORT_RESP;
+
+	spin_lock_irq(&port->ah_lock);
+	kref_get(&port->sm_ah->ref);
+	mad_buf->context[0] = &port->sm_ah->ref;
+	mad_buf->ah = port->sm_ah->ah;
+	spin_unlock_irq(&port->ah_lock);
+
+	ret = ib_post_send_mad(mad_buf, NULL);
+	if (ret)
+		goto err;
+
+	return;
+err:
+	kref_put(mad_buf->context[0], free_sm_ah);
+	ib_free_send_mad(mad_buf);
+}
+
 static void send_handler(struct ib_mad_agent *agent,
 			 struct ib_mad_send_wc *mad_send_wc)
 {
@@ -967,9 +1226,36 @@ static void recv_handler(struct ib_mad_agent *mad_agent,
 	ib_free_recv_mad(mad_recv_wc);
 }
 
+static void notice_resp_handler(struct ib_mad_agent *agent,
+				struct ib_mad_send_wc *mad_send_wc)
+{
+	kref_put(mad_send_wc->send_buf->context[0], free_sm_ah);
+	ib_free_send_mad(mad_send_wc->send_buf);
+}
+
+static void notice_handler(struct ib_mad_agent *mad_agent,
+			   struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_sa_port *port;
+	struct ib_sa_mad *mad;
+	struct ib_sa_notice notice;
+
+	port = mad_agent->context;
+	mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad;
+	ib_unpack(notice_table, ARRAY_SIZE(notice_table), mad->data, &notice);
+
+	if (!notice_dispatch(port->device, port->port_num, &notice))
+		ib_sa_notice_resp(port, mad_recv_wc);
+	ib_free_recv_mad(mad_recv_wc);
+}
+
 static void ib_sa_add_one(struct ib_device *device)
 {
 	struct ib_sa_device *sa_dev;
+	struct ib_mad_reg_req reg_req = {
+		.mgmt_class = IB_MGMT_CLASS_SUBN_ADM,
+		.mgmt_class_version = 2
+	};
 	int s, e, i;
 
 	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
@@ -1003,6 +1289,16 @@ static void ib_sa_add_one(struct ib_device *device)
 		if (IS_ERR(sa_dev->port[i].agent))
 			goto err;
 
+		sa_dev->port[i].device = device;
+		set_bit(IB_MGMT_METHOD_REPORT, reg_req.method_mask);
+		sa_dev->port[i].notice_agent =
+			ib_register_mad_agent(device, i + s, IB_QPT_GSI,
+					      &reg_req, 0, notice_resp_handler,
+					      notice_handler, &sa_dev->port[i]);
+
+		if (IS_ERR(sa_dev->port[i].notice_agent))
+			goto err;
+
 		INIT_WORK(&sa_dev->port[i].update_task, update_sm_ah);
 	}
 
@@ -1025,8 +1321,14 @@ static void ib_sa_add_one(struct ib_device *device)
 	return;
 
 err:
-	while (--i >= 0)
-		ib_unregister_mad_agent(sa_dev->port[i].agent);
+	while (--i >= 0) {
+		if (!IS_ERR(sa_dev->port[i].notice_agent)) {
+			ib_unregister_mad_agent(sa_dev->port[i].notice_agent);
+		}
+		if (!IS_ERR(sa_dev->port[i].agent)) {
+			ib_unregister_mad_agent(sa_dev->port[i].agent);
+		}
+	}
 
 	kfree(sa_dev);
 
@@ -1046,6 +1348,7 @@ static void ib_sa_remove_one(struct ib_device *device)
 	flush_scheduled_work();
 
 	for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) {
+		ib_unregister_mad_agent(sa_dev->port[i].notice_agent);
 		ib_unregister_mad_agent(sa_dev->port[i].agent);
 		kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
 	}
@@ -1074,7 +1377,15 @@ static int __init ib_sa_init(void)
 		goto err2;
 	}
 
+	ret = notice_init();
+	if (ret) {
+		printk(KERN_ERR "Couldn't initialize notice handling\n");
+		goto err3;
+	}
+
 	return 0;
+err3:
+	mcast_cleanup();
 err2:
 	ib_unregister_client(&sa_client);
 err1:
@@ -1084,6 +1395,7 @@ err1:
 static void __exit ib_sa_cleanup(void)
 {
 	mcast_cleanup();
+	notice_cleanup();
 	ib_unregister_client(&sa_client);
 	idr_destroy(&query_idr);
 }
diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
index 5e26b2f..83d8157 100644
--- a/include/rdma/ib_sa.h
+++ b/include/rdma/ib_sa.h
@@ -254,6 +254,127 @@ struct ib_sa_service_rec {
 	u64		data64[2];
 };
 
+enum {
+	IB_SA_EVENT_TYPE_FATAL		= 0x0,
+	IB_SA_EVENT_TYPE_URGENT		= 0x1,
+	IB_SA_EVENT_TYPE_SECURITY	= 0x2,
+	IB_SA_EVENT_TYPE_SM		= 0x3,
+	IB_SA_EVENT_TYPE_INFO		= 0x4,
+	IB_SA_EVENT_TYPE_EMPTY		= 0x7F,
+	IB_SA_EVENT_TYPE_ALL		= 0xFFFF
+};
+
+enum {
+	IB_SA_EVENT_PRODUCER_TYPE_CA		= 0x1,
+	IB_SA_EVENT_PRODUCER_TYPE_SWITCH	= 0x2,
+	IB_SA_EVENT_PRODUCER_TYPE_ROUTER	= 0x3,
+	IB_SA_EVENT_PRODUCER_TYPE_CLASS_MANAGER	= 0x4,
+	IB_SA_EVENT_PRODUCER_TYPE_ALL		= 0xFFFFFF
+};
+
+enum {
+	IB_SA_SM_TRAP_GID_IN_SERVICE			= 64,
+	IB_SA_SM_TRAP_GID_OUT_OF_SERVICE		= 65,
+	IB_SA_SM_TRAP_CREATE_MC_GROUP			= 66,
+	IB_SA_SM_TRAP_DELETE_MC_GROUP			= 67,
+	IB_SA_SM_TRAP_PORT_CHANGE_STATE			= 128,
+	IB_SA_SM_TRAP_LINK_INTEGRITY			= 129,
+	IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN		= 130,
+	IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED	= 131,
+	IB_SA_SM_TRAP_BAD_M_KEY				= 256,
+	IB_SA_SM_TRAP_BAD_P_KEY				= 257,
+	IB_SA_SM_TRAP_BAD_Q_KEY				= 258,
+	IB_SA_SM_TRAP_SWITCH_BAD_P_KEY			= 259,
+	IB_SA_SM_TRAP_ALL				= 0xFFFF
+};
+
+struct ib_sa_inform {
+	union ib_gid	gid;
+	__be16		lid_range_begin;
+	__be16		lid_range_end;
+	u8		is_generic;
+	u8		subscribe;
+	__be16		type;
+	union {
+		struct {
+			__be16	trap_num;
+			__be32	qpn;
+			u8	resp_time;
+			__be32	producer_type;
+		} generic;
+		struct {
+			__be16	device_id;
+			__be32	qpn;
+			u8	resp_time;
+			__be32	vendor_id;
+		} vendor;
+	} trap;
+};
+
+struct ib_sa_notice {
+	u8		is_generic;
+	u8		type;
+	union {
+		struct {
+			__be32	producer_type;
+			__be16	trap_num;
+		} generic;
+		struct {
+			__be32	vendor_id;
+			__be16	device_id;
+		} vendor;
+	} trap;
+	__be16		issuer_lid;
+	__be16		notice_count;
+	u8		notice_toggle;
+	/*
+	 * Align data 16 bits off 64 bit field to match InformInfo definition.
+	 * Data contained within this field will then align properly.
+	 * See IB spec 1.2, sections 13.4.8.2 and 14.2.5.1.
+	 */
+	u8		reserved[5];
+	u8		data_details[54];
+	union ib_gid	issuer_gid;
+};
+
+/*
+ * SM notice data details for:
+ *
+ * IB_SA_SM_TRAP_GID_IN_SERVICE		= 64
+ * IB_SA_SM_TRAP_GID_OUT_OF_SERVICE	= 65
+ * IB_SA_SM_TRAP_CREATE_MC_GROUP	= 66
+ * IB_SA_SM_TRAP_DELETE_MC_GROUP	= 67
+ */
+struct ib_sa_notice_data_gid {
+	u8	reserved[6];
+	u8	gid[16];
+	u8	padding[32];
+};
+
+/*
+ * SM notice data details for:
+ *
+ * IB_SA_SM_TRAP_PORT_CHANGE_STATE	= 128
+ */
+struct ib_sa_notice_data_port_change {
+	__be16	lid;
+	u8	padding[52];
+};
+
+/*
+ * SM notice data details for:
+ *
+ * IB_SA_SM_TRAP_LINK_INTEGRITY			= 129
+ * IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN	= 130
+ * IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED	= 131
+ */
+struct ib_sa_notice_data_port_error {
+	u8	reserved[2];
+	__be16	lid;
+	u8	port_num;
+	u8	padding[49];
+};
+
 struct ib_sa_client {
 	atomic_t users;
 	struct completion comp;
@@ -382,4 +503,54 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
 			 struct ib_sa_path_rec *rec,
 			 struct ib_ah_attr *ah_attr);
 
+struct ib_inform_info {
+	void		*context;
+	int		(*callback)(int status,
+				    struct ib_inform_info *info,
+				    struct ib_sa_notice *notice);
+	u16		trap_number;
+};
+
+/**
+ * ib_sa_register_inform_info - Registers to receive notice events.
+ * @device: Device associated with the registration.
+ * @port_num: Port on the specified device to associate with the registration.
+ * @trap_number: InformInfo trap number to register for.
+ * @gfp_mask: GFP mask for memory allocations.
+ * @callback: User callback invoked once the registration completes and to
+ *   report noticed events.
+ * @context: User specified context stored with the ib_inform_reg structure.
+ *
+ * This call initiates a registration request with the SA for the specified
+ * trap number.  If the operation is started successfully, it returns
+ * an ib_inform_info structure that is used to track the registration operation.
+ * Users must free this structure by calling ib_unregister_inform_info,
+ * even if the operation later fails.  (The callback status is non-zero.)
+ *
+ * If the registration fails; status will be non-zero.  If the registration
+ * succeeds, the callback status will be zero, but the notice parameter will
+ * be NULL.  If the notice parameter is not NULL, a trap or notice is being
+ * reported to the user.
+ *
+ * A status of -ENETRESET indicates that an error occurred which requires
+ * reregisteration.
+ */
+struct ib_inform_info *
+ib_sa_register_inform_info(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num,
+			   u16 trap_number, gfp_t gfp_mask,
+			   int (*callback)(int status,
+					   struct ib_inform_info *info,
+					   struct ib_sa_notice *notice),
+			   void *context);
+
+/**
+ * ib_sa_unregister_inform_info - Releases an InformInfo registration.
+ * @info: InformInfo registration tracking structure.
+ *
+ * This call blocks until the registration request is destroyed.  It may
+ * not be called from within the registration callback.
+ */
+void ib_sa_unregister_inform_info(struct ib_inform_info *info);
+
 #endif /* IB_SA_H */


From sean.hefty at intel.com  Wed May 30 13:45:18 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 30 May 2007 13:45:18 -0700
Subject: [ofa-general] [RFC] [PATCH 2/2] for 2.6.23: ib/sa - add local path
	record caching
In-Reply-To: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com>
Message-ID: <000b01c7a2fb$6ebb05f0$3c98070a@amr.corp.intel.com>

Query and store path records locally to decrease path record query time
and offload SA flooding during the start-up of large clustered jobs.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/Makefile    |    2 
 drivers/infiniband/core/local_sa.c  | 1273 +++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/multicast.c |   50 -
 drivers/infiniband/core/sa.h        |   23 +
 drivers/infiniband/core/sa_query.c  |  107 ++-
 5 files changed, 1379 insertions(+), 76 deletions(-)

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 7c5b5ed..f646040 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,7 +13,7 @@ ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 
 ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
 
-ib_sa-y :=			sa_query.o multicast.o notice.o
+ib_sa-y :=			sa_query.o multicast.o notice.o local_sa.o
 
 ib_cm-y :=			cm.o
 
diff --git a/drivers/infiniband/core/local_sa.c b/drivers/infiniband/core/local_sa.c
new file mode 100644
index 0000000..3b5bb8f
--- /dev/null
+++ b/drivers/infiniband/core/local_sa.c
@@ -0,0 +1,1273 @@
+/*
+ * Copyright (c) 2006 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/dma-mapping.h>
+#include <linux/err.h>
+#include <linux/interrupt.h>
+#include <linux/rbtree.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+#include <linux/miscdevice.h>
+#include <linux/random.h>
+
+#include <rdma/ib_cache.h>
+#include <rdma/ib_sa.h>
+#include "sa.h"
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("InfiniBand subnet administration caching");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+	SA_DB_MAX_PATHS_PER_DEST = 0x7F,
+	SA_DB_MIN_RETRY_TIMER	 = 4000,  /*   4 sec */
+	SA_DB_MAX_RETRY_TIMER	 = 256000 /* 256 sec */
+};
+
+static int set_paths_per_dest(const char *val, struct kernel_param *kp);
+static unsigned long paths_per_dest = SA_DB_MAX_PATHS_PER_DEST;
+module_param_call(paths_per_dest, set_paths_per_dest, param_get_ulong,
+		  &paths_per_dest, 0644);
+MODULE_PARM_DESC(paths_per_dest, "Maximum number of paths to retrieve "
+				 "to each destination (DGID).  Set to 0 "
+				 "to disable cache.");
+
+static int set_subscribe_inform_info(const char *val, struct kernel_param *kp);
+static char subscribe_inform_info = 1;
+module_param_call(subscribe_inform_info, set_subscribe_inform_info,
+		  param_get_bool, &subscribe_inform_info, 0644);
+MODULE_PARM_DESC(subscribe_inform_info,
+		 "Subscribe for SA InformInfo/Notice events.");
+
+static int do_refresh(const char *val, struct kernel_param *kp);
+module_param_call(refresh, do_refresh, NULL, NULL, 0200);
+
+static unsigned long retry_timer = SA_DB_MIN_RETRY_TIMER;
+
+enum sa_db_lookup_method {
+	SA_DB_LOOKUP_LEAST_USED,
+	SA_DB_LOOKUP_RANDOM
+};
+
+static int set_lookup_method(const char *val, struct kernel_param *kp);
+static int get_lookup_method(char *buf, struct kernel_param *kp);
+static unsigned long lookup_method;
+module_param_call(lookup_method, set_lookup_method, get_lookup_method,
+		  &lookup_method, 0644);
+MODULE_PARM_DESC(lookup_method, "Method used to return path records when "
+				"multiple paths exist to a given destination.");
+
+static void sa_db_add_dev(struct ib_device *device);
+static void sa_db_remove_dev(struct ib_device *device);
+
+static struct ib_client sa_db_client = {
+	.name   = "local_sa",
+	.add    = sa_db_add_dev,
+	.remove = sa_db_remove_dev
+};
+
+static LIST_HEAD(dev_list);
+static DEFINE_MUTEX(lock);
+static rwlock_t rwlock;
+static struct workqueue_struct *sa_wq;
+static struct ib_sa_client sa_client;
+
+enum sa_db_state {
+	SA_DB_IDLE,
+	SA_DB_REFRESH,
+	SA_DB_DESTROY
+};
+
+struct sa_db_port {
+	struct sa_db_device	*dev;
+	struct ib_mad_agent	*agent;
+	/* Limit number of outstanding MADs to SA to reduce SA flooding */
+	struct ib_mad_send_buf	*msg;
+	u16			sm_lid;
+	u8			sm_sl;
+	struct ib_inform_info	*in_info;
+	struct ib_inform_info	*out_info;
+	struct rb_root		paths;
+	struct list_head	update_list;
+	unsigned long		update_id;
+	enum sa_db_state	state;
+	struct work_struct	work;
+	union ib_gid		gid;
+	int			port_num;
+};
+
+struct sa_db_device {
+	struct list_head	list;
+	struct ib_device	*device;
+	struct ib_event_handler event_handler;
+	int			start_port;
+	int			port_count;
+	struct sa_db_port	port[0];
+};
+
+struct ib_sa_iterator {
+	struct ib_sa_iterator	*next;
+};
+
+struct ib_sa_attr_iter {
+	struct ib_sa_iterator	*iter;
+	unsigned long		flags;
+};
+
+struct ib_sa_attr_list {
+	struct ib_sa_iterator	iter;
+	struct ib_sa_iterator	*tail;
+	int			update_id;
+	union ib_gid		gid;
+	struct rb_node		node;
+};
+
+struct ib_path_rec_info {
+	struct ib_sa_iterator	iter; /* keep first */
+	struct ib_sa_path_rec	rec;
+	unsigned long		lookups;
+};
+
+struct ib_sa_mad_iter {
+	struct ib_mad_recv_wc	*recv_wc;
+	struct ib_mad_recv_buf	*recv_buf;
+	int			attr_size;
+	int			attr_offset;
+	int			data_offset;
+	int			data_left;
+	void			*attr;
+	u8			attr_data[0];
+};
+
+enum sa_update_type {
+	SA_UPDATE_FULL,
+	SA_UPDATE_ADD,
+	SA_UPDATE_REMOVE
+};
+
+struct update_info {
+	struct list_head	list;
+	union ib_gid		gid;
+	enum sa_update_type	type;
+};
+
+struct sa_path_request {
+	struct work_struct	work;
+	struct ib_sa_client	*client;
+	void			(*callback)(int, struct ib_sa_path_rec *, void *);
+	void			*context;
+	struct ib_sa_path_rec	path_rec;
+};
+
+static void process_updates(struct sa_db_port *port);
+
+static void free_attr_list(struct ib_sa_attr_list *attr_list)
+{
+	struct ib_sa_iterator *cur;
+
+	for (cur = attr_list->iter.next; cur; cur = attr_list->iter.next) {
+		attr_list->iter.next = cur->next;
+		kfree(cur);
+	}
+	attr_list->tail = &attr_list->iter;
+}
+
+static void remove_attr(struct rb_root *root, struct ib_sa_attr_list *attr_list)
+{
+	rb_erase(&attr_list->node, root);
+	free_attr_list(attr_list);
+	kfree(attr_list);
+}
+
+static void remove_all_attrs(struct rb_root *root)
+{
+	struct rb_node *node, *next_node;
+	struct ib_sa_attr_list *attr_list;
+
+	write_lock_irq(&rwlock);
+	for (node = rb_first(root); node; node = next_node) {
+		next_node = rb_next(node);
+		attr_list = rb_entry(node, struct ib_sa_attr_list, node);
+		remove_attr(root, attr_list);
+	}
+	write_unlock_irq(&rwlock);
+}
+
+static void remove_old_attrs(struct rb_root *root, unsigned long update_id)
+{
+	struct rb_node *node, *next_node;
+	struct ib_sa_attr_list *attr_list;
+
+	write_lock_irq(&rwlock);
+	for (node = rb_first(root); node; node = next_node) {
+		next_node = rb_next(node);
+		attr_list = rb_entry(node, struct ib_sa_attr_list, node);
+		if (attr_list->update_id != update_id)
+			remove_attr(root, attr_list);
+	}
+	write_unlock_irq(&rwlock);
+}
+
+static struct ib_sa_attr_list *insert_attr_list(struct rb_root *root,
+						struct ib_sa_attr_list *attr_list)
+{
+	struct rb_node **link = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct ib_sa_attr_list *cur_attr_list;
+	int cmp;
+
+	while (*link) {
+		parent = *link;
+		cur_attr_list = rb_entry(parent, struct ib_sa_attr_list, node);
+		cmp = memcmp(&cur_attr_list->gid, &attr_list->gid,
+			     sizeof attr_list->gid);
+		if (cmp < 0)
+			link = &(*link)->rb_left;
+		else if (cmp > 0)
+			link = &(*link)->rb_right;
+		else
+			return cur_attr_list;
+	}
+	rb_link_node(&attr_list->node, parent, link);
+	rb_insert_color(&attr_list->node, root);
+	return NULL;
+}
+
+static struct ib_sa_attr_list *find_attr_list(struct rb_root *root, u8 *gid)
+{
+	struct rb_node *node = root->rb_node;
+	struct ib_sa_attr_list *attr_list;
+	int cmp;
+
+	while (node) {
+		attr_list = rb_entry(node, struct ib_sa_attr_list, node);
+		cmp = memcmp(&attr_list->gid, gid, sizeof attr_list->gid);
+		if (cmp < 0)
+			node = node->rb_left;
+		else if (cmp > 0)
+			node = node->rb_right;
+		else
+			return attr_list;
+	}
+	return NULL;
+}
+
+static int insert_attr(struct rb_root *root, unsigned long update_id, void *key,
+		       struct ib_sa_iterator *iter)
+{
+	struct ib_sa_attr_list *attr_list;
+	void *err;
+
+	write_lock_irq(&rwlock);
+	attr_list = find_attr_list(root, key);
+	if (!attr_list) {
+		write_unlock_irq(&rwlock);
+		attr_list = kmalloc(sizeof *attr_list, GFP_KERNEL);
+		if (!attr_list)
+			return -ENOMEM;
+
+		attr_list->iter.next = NULL;
+		attr_list->tail = &attr_list->iter;
+		attr_list->update_id = update_id;
+		memcpy(attr_list->gid.raw, key, sizeof attr_list->gid);
+
+		write_lock_irq(&rwlock);
+		err = insert_attr_list(root, attr_list);
+		if (err) {
+			write_unlock_irq(&rwlock);
+			kfree(attr_list);
+			return PTR_ERR(err);
+		}
+	} else if (attr_list->update_id != update_id) {
+		free_attr_list(attr_list);
+		attr_list->update_id = update_id;
+	}
+
+	attr_list->tail->next = iter;
+	iter->next = NULL;
+	attr_list->tail = iter;
+	write_unlock_irq(&rwlock);
+	return 0;
+}
+
+static struct ib_sa_mad_iter *ib_sa_iter_create(struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_sa_mad_iter *iter;
+	struct ib_sa_mad *mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad;
+	int attr_size, attr_offset;
+
+	attr_offset = be16_to_cpu(mad->sa_hdr.attr_offset) * 8;
+	attr_size = 64;		/* path record length */
+	if (attr_offset < attr_size)
+		return ERR_PTR(-EINVAL);
+
+	iter = kzalloc(sizeof *iter + attr_size, GFP_KERNEL);
+	if (!iter)
+		return ERR_PTR(-ENOMEM);
+
+	iter->data_left = mad_recv_wc->mad_len - IB_MGMT_SA_HDR;
+	iter->recv_wc = mad_recv_wc;
+	iter->recv_buf = &mad_recv_wc->recv_buf;
+	iter->attr_offset = attr_offset;
+	iter->attr_size = attr_size;
+	return iter;
+}
+
+static void ib_sa_iter_free(struct ib_sa_mad_iter *iter)
+{
+	kfree(iter);
+}
+
+static void *ib_sa_iter_next(struct ib_sa_mad_iter *iter)
+{
+	struct ib_sa_mad *mad;
+	int left, offset = 0;
+
+	while (iter->data_left >= iter->attr_offset) {
+		while (iter->data_offset < IB_MGMT_SA_DATA) {
+			mad = (struct ib_sa_mad *) iter->recv_buf->mad;
+
+			left = IB_MGMT_SA_DATA - iter->data_offset;
+			if (left < iter->attr_size) {
+				/* copy first piece of the attribute */
+				iter->attr = &iter->attr_data;
+				memcpy(iter->attr,
+				       &mad->data[iter->data_offset], left);
+				offset = left;
+				break;
+			} else if (offset) {
+				/* copy the second piece of the attribute */
+				memcpy(iter->attr + offset, &mad->data[0],
+				       iter->attr_size - offset);
+				iter->data_offset = iter->attr_size - offset;
+				offset = 0;
+			} else {
+				iter->attr = &mad->data[iter->data_offset];
+				iter->data_offset += iter->attr_size;
+			}
+
+			iter->data_left -= iter->attr_offset;
+			goto out;
+		}
+		iter->data_offset = 0;
+		iter->recv_buf = list_entry(iter->recv_buf->list.next,
+					    struct ib_mad_recv_buf, list);
+	}
+	iter->attr = NULL;
+out:
+	return iter->attr;
+}
+
+/*
+ * Copy path records from a received response and insert them into our cache.
+ * A path record in the MADs are in network order, packed, and may
+ * span multiple MAD buffers, just to make our life hard.
+ */
+static void update_path_db(struct sa_db_port *port,
+			   struct ib_mad_recv_wc *mad_recv_wc,
+			   enum sa_update_type type)
+{
+	struct ib_sa_mad_iter *iter;
+	struct ib_path_rec_info *path_info;
+	void *attr;
+	int ret;
+
+	iter = ib_sa_iter_create(mad_recv_wc);
+	if (IS_ERR(iter))
+		return;
+
+	port->update_id += (type == SA_UPDATE_FULL);
+
+	while ((attr = ib_sa_iter_next(iter)) &&
+	       (path_info = kmalloc(sizeof *path_info, GFP_KERNEL))) {
+
+		ib_sa_unpack_attr(&path_info->rec, attr, IB_SA_ATTR_PATH_REC);
+
+		ret = insert_attr(&port->paths, port->update_id,
+				  path_info->rec.dgid.raw, &path_info->iter);
+		if (ret) {
+			kfree(path_info);
+			break;
+		}
+	}
+	ib_sa_iter_free(iter);
+
+	if (type == SA_UPDATE_FULL)
+		remove_old_attrs(&port->paths, port->update_id);
+}
+
+static struct ib_mad_send_buf *get_sa_msg(struct sa_db_port *port,
+					  struct update_info *update)
+{
+	struct ib_ah_attr ah_attr;
+	struct ib_mad_send_buf *msg;
+
+	msg = ib_create_send_mad(port->agent, 1, 0, 0, IB_MGMT_SA_HDR,
+				 IB_MGMT_SA_DATA, GFP_KERNEL);
+	if (IS_ERR(msg))
+		return NULL;
+
+	memset(&ah_attr, 0, sizeof ah_attr);
+	ah_attr.dlid = port->sm_lid;
+	ah_attr.sl = port->sm_sl;
+	ah_attr.port_num = port->port_num;
+
+	msg->ah = ib_create_ah(port->agent->qp->pd, &ah_attr);
+	if (IS_ERR(msg->ah)) {
+		ib_free_send_mad(msg);
+		return NULL;
+	}
+
+	msg->timeout_ms = retry_timer;
+	msg->retries = 0;
+	msg->context[0] = port;
+	msg->context[1] = update;
+	return msg;
+}
+
+static __be64 form_tid(u32 hi_tid)
+{
+	static atomic_t tid;
+	return cpu_to_be64((((u64) hi_tid) << 32) |
+			   ((u32) atomic_inc_return(&tid)));
+}
+
+static void format_path_req(struct sa_db_port *port,
+			    struct update_info *update,
+			    struct ib_mad_send_buf *msg)
+{
+	struct ib_sa_mad *mad = msg->mad;
+	struct ib_sa_path_rec path_rec;
+
+	mad->mad_hdr.base_version  = IB_MGMT_BASE_VERSION;
+	mad->mad_hdr.mgmt_class	   = IB_MGMT_CLASS_SUBN_ADM;
+	mad->mad_hdr.class_version = IB_SA_CLASS_VERSION;
+	mad->mad_hdr.method	   = IB_SA_METHOD_GET_TABLE;
+	mad->mad_hdr.attr_id	   = cpu_to_be16(IB_SA_ATTR_PATH_REC);
+	mad->mad_hdr.tid	   = form_tid(msg->mad_agent->hi_tid);
+
+	mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH;
+
+	path_rec.sgid = port->gid;
+	path_rec.numb_path = (u8) paths_per_dest;
+
+	if (update->type == SA_UPDATE_ADD) {
+		mad->sa_hdr.comp_mask |= IB_SA_PATH_REC_DGID;
+		memcpy(&path_rec.dgid, &update->gid, sizeof path_rec.dgid);
+	}
+
+	ib_sa_pack_attr(mad->data, &path_rec, IB_SA_ATTR_PATH_REC);
+}
+
+static int send_query(struct sa_db_port *port,
+		      struct update_info *update)
+{
+	int ret;
+
+	port->msg = get_sa_msg(port, update);
+	if (!port->msg)
+		return -ENOMEM;
+
+	format_path_req(port, update, port->msg);
+
+	ret = ib_post_send_mad(port->msg, NULL);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	ib_destroy_ah(port->msg->ah);
+	ib_free_send_mad(port->msg);
+	return ret;
+}
+
+static void add_update(struct sa_db_port *port, u8 *gid,
+		       enum sa_update_type type)
+{
+	struct update_info *update;
+
+	update = kmalloc(sizeof *update, GFP_KERNEL);
+	if (update) {
+		if (gid)
+			memcpy(&update->gid, gid, sizeof update->gid);
+		update->type = type;
+		list_add(&update->list, &port->update_list);
+	}
+
+	if (port->state == SA_DB_IDLE) {
+		port->state = SA_DB_REFRESH;
+		process_updates(port);
+	}
+}
+
+static void clean_update_list(struct sa_db_port *port)
+{
+	struct update_info *update;
+
+	while (!list_empty(&port->update_list)) {
+		update = list_entry(port->update_list.next,
+				    struct update_info, list);
+		list_del(&update->list);
+		kfree(update);
+	}
+}
+
+static int notice_handler(int status, struct ib_inform_info *info,
+			  struct ib_sa_notice *notice)
+{
+	struct sa_db_port *port = info->context;
+	struct ib_sa_notice_data_gid *gid_data;
+	struct ib_inform_info **pinfo;
+	enum sa_update_type type;
+
+	if (info->trap_number == IB_SA_SM_TRAP_GID_IN_SERVICE) {
+		pinfo = &port->in_info;
+		type = SA_UPDATE_ADD;
+	} else {
+		pinfo = &port->out_info;
+		type = SA_UPDATE_REMOVE;
+	}
+
+	mutex_lock(&lock);
+	if (port->state == SA_DB_DESTROY || !*pinfo) {
+		mutex_unlock(&lock);
+		return 0;
+	}
+
+	if (notice) {
+		gid_data = (struct ib_sa_notice_data_gid *)
+			   &notice->data_details;
+		add_update(port, gid_data->gid, type);
+		mutex_unlock(&lock);
+	} else if (status == -ENETRESET) {
+		*pinfo = NULL;
+		mutex_unlock(&lock);
+	} else {
+		if (status)
+			*pinfo = ERR_PTR(-EINVAL);
+		port->state = SA_DB_IDLE;
+		clean_update_list(port);
+		mutex_unlock(&lock);
+		queue_work(sa_wq, &port->work);
+	}
+
+	return status;
+}
+
+static int reg_in_info(struct sa_db_port *port)
+{
+	int ret = 0;
+
+	port->in_info = ib_sa_register_inform_info(&sa_client,
+						   port->dev->device,
+						   port->port_num,
+						   IB_SA_SM_TRAP_GID_IN_SERVICE,
+						   GFP_KERNEL, notice_handler,
+						   port);
+	if (IS_ERR(port->in_info))
+		ret = PTR_ERR(port->in_info);
+
+	return ret;
+}
+
+static int reg_out_info(struct sa_db_port *port)
+{
+	int ret = 0;
+
+	port->out_info = ib_sa_register_inform_info(&sa_client,
+						    port->dev->device,
+						    port->port_num,
+						    IB_SA_SM_TRAP_GID_OUT_OF_SERVICE,
+						    GFP_KERNEL, notice_handler,
+						    port);
+	if (IS_ERR(port->out_info))
+		ret = PTR_ERR(port->out_info);
+
+	return ret;
+}
+
+static void unsubscribe_port(struct sa_db_port *port)
+{
+	if (port->in_info && !IS_ERR(port->in_info))
+		ib_sa_unregister_inform_info(port->in_info);
+
+	if (port->out_info && !IS_ERR(port->out_info))
+		ib_sa_unregister_inform_info(port->out_info);
+
+	port->out_info = NULL;
+	port->in_info = NULL;
+
+}
+
+static void cleanup_port(struct sa_db_port *port)
+{
+	unsubscribe_port(port);
+	flush_workqueue(sa_wq);
+
+	clean_update_list(port);
+	remove_all_attrs(&port->paths);
+}
+
+static int update_port_info(struct sa_db_port *port)
+{
+	struct ib_port_attr port_attr;
+	int ret;
+
+	ret = ib_query_port(port->dev->device, port->port_num, &port_attr);
+	if (ret)
+		return ret;
+
+	if (port_attr.state != IB_PORT_ACTIVE)
+		return -ENODATA;
+
+        port->sm_lid = port_attr.sm_lid;
+	port->sm_sl = port_attr.sm_sl;
+	return 0;
+}
+
+static void process_updates(struct sa_db_port *port)
+{
+	struct update_info *update;
+	struct ib_sa_attr_list *attr_list;
+	int ret;
+
+	if (!paths_per_dest || update_port_info(port)) {
+		cleanup_port(port);
+		goto out;
+	}
+
+	/* Event registration is an optimization, so ignore failures. */
+	if (subscribe_inform_info) {
+		if (!port->out_info) {
+			ret = reg_out_info(port);
+			if (!ret)
+				return;
+		}
+
+		if (!port->in_info) {
+			ret = reg_in_info(port);
+			if (!ret)
+				return;
+		}
+	} else
+		unsubscribe_port(port);
+
+	while (!list_empty(&port->update_list)) {
+		update = list_entry(port->update_list.next,
+				    struct update_info, list);
+
+		if (update->type == SA_UPDATE_REMOVE) {
+			write_lock_irq(&rwlock);
+			attr_list = find_attr_list(&port->paths,
+						   update->gid.raw);
+			if (attr_list)
+				remove_attr(&port->paths, attr_list);
+			write_unlock_irq(&rwlock);
+		} else {
+			ret = send_query(port, update);
+			if (!ret)
+				return;
+
+		}
+		list_del(&update->list);
+		kfree(update);
+	}
+out:
+	port->state = SA_DB_IDLE;
+}
+
+static void refresh_port_db(struct sa_db_port *port)
+{
+	if (port->state == SA_DB_DESTROY)
+		return;
+
+	if (port->state == SA_DB_REFRESH) {
+		clean_update_list(port);
+		ib_cancel_mad(port->agent, port->msg);
+	}
+
+	add_update(port, NULL, SA_UPDATE_FULL);
+}
+
+static void refresh_dev_db(struct sa_db_device *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->port_count; i++)
+		refresh_port_db(&dev->port[i]);
+}
+
+static void refresh_db(void)
+{
+	struct sa_db_device *dev;
+
+	list_for_each_entry(dev, &dev_list, list)
+		refresh_dev_db(dev);
+}
+
+static int do_refresh(const char *val, struct kernel_param *kp)
+{
+	mutex_lock(&lock);
+	refresh_db();
+	mutex_unlock(&lock);
+	return 0;
+}
+
+static int get_lookup_method(char *buf, struct kernel_param *kp)
+{
+	return sprintf(buf,
+		       "%c %d round robin\n"
+		       "%c %d random",
+		       (lookup_method == SA_DB_LOOKUP_LEAST_USED) ? '*' : ' ',
+		       SA_DB_LOOKUP_LEAST_USED,
+		       (lookup_method == SA_DB_LOOKUP_RANDOM) ? '*' : ' ',
+		       SA_DB_LOOKUP_RANDOM);
+}
+
+static int set_lookup_method(const char *val, struct kernel_param *kp)
+{
+	unsigned long method;
+	int ret = 0;
+
+	method = simple_strtoul(val, NULL, 0);
+
+	switch (method) {
+	case SA_DB_LOOKUP_LEAST_USED:
+	case SA_DB_LOOKUP_RANDOM:
+		lookup_method = method;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static int set_paths_per_dest(const char *val, struct kernel_param *kp)
+{
+	int ret;
+
+	mutex_lock(&lock);
+	ret = param_set_ulong(val, kp);
+	if (ret)
+		goto out;
+
+	if (paths_per_dest > SA_DB_MAX_PATHS_PER_DEST)
+		paths_per_dest = SA_DB_MAX_PATHS_PER_DEST;
+	refresh_db();
+out:
+	mutex_unlock(&lock);
+	return ret;
+}
+
+static int set_subscribe_inform_info(const char *val, struct kernel_param *kp)
+{
+	int ret;
+
+	ret = param_set_bool(val, kp);
+	if (ret)
+		return ret;
+
+	return do_refresh(val, kp);
+}
+
+static void port_work_handler(struct work_struct *work)
+{
+	struct sa_db_port *port;
+
+	port = container_of(work, typeof(*port), work);
+	mutex_lock(&lock);
+	refresh_port_db(port);
+	mutex_unlock(&lock);
+}
+
+static void handle_event(struct ib_event_handler *event_handler,
+			 struct ib_event *event)
+{
+	struct sa_db_device *dev;
+	struct sa_db_port *port;
+
+	dev = container_of(event_handler, typeof(*dev), event_handler);
+	port = &dev->port[event->element.port_num - dev->start_port];
+
+	switch (event->event) {
+	case IB_EVENT_PORT_ERR:
+	case IB_EVENT_LID_CHANGE:
+	case IB_EVENT_SM_CHANGE:
+	case IB_EVENT_CLIENT_REREGISTER:
+	case IB_EVENT_PKEY_CHANGE:
+	case IB_EVENT_PORT_ACTIVE:
+		queue_work(sa_wq, &port->work);
+		break;
+	default:
+		break;
+	}
+}
+
+static void ib_free_path_iter(struct ib_sa_attr_iter *iter)
+{
+	read_unlock_irqrestore(&rwlock, iter->flags);
+}
+
+static int ib_create_path_iter(struct ib_device *device, u8 port_num,
+			       union ib_gid *dgid, struct ib_sa_attr_iter *iter)
+{
+	struct sa_db_device *dev;
+	struct sa_db_port *port;
+	struct ib_sa_attr_list *list;
+
+	dev = ib_get_client_data(device, &sa_db_client);
+	if (!dev)
+		return -ENODEV;
+
+	port = &dev->port[port_num - dev->start_port];
+
+	read_lock_irqsave(&rwlock, iter->flags);
+	list = find_attr_list(&port->paths, dgid->raw);
+	if (!list) {
+		ib_free_path_iter(iter);
+		return -ENODATA;
+	}
+
+	iter->iter = &list->iter;
+	return 0;
+}
+
+static struct ib_sa_path_rec *ib_get_next_path(struct ib_sa_attr_iter *iter)
+{
+	struct ib_path_rec_info *next_path;
+
+	iter->iter = iter->iter->next;
+	if (iter->iter) {
+		next_path = container_of(iter->iter, struct ib_path_rec_info, iter);
+		return &next_path->rec;
+	} else
+		return NULL;
+}
+
+static int cmp_rec(struct ib_sa_path_rec *src,
+		   struct ib_sa_path_rec *dst, ib_sa_comp_mask comp_mask)
+{
+	/* DGID check already done */
+	if (comp_mask & IB_SA_PATH_REC_SGID &&
+	    memcmp(&src->sgid, &dst->sgid, sizeof src->sgid))
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_DLID && src->dlid != dst->dlid)
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_SLID && src->slid != dst->slid)
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_RAW_TRAFFIC &&
+	    src->raw_traffic != dst->raw_traffic)
+		return -EINVAL;
+
+	if (comp_mask & IB_SA_PATH_REC_FLOW_LABEL &&
+	    src->flow_label != dst->flow_label)
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_HOP_LIMIT &&
+	    src->hop_limit != dst->hop_limit)
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_TRAFFIC_CLASS &&
+	    src->traffic_class != dst->traffic_class)
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_REVERSIBLE &&
+	    dst->reversible && !src->reversible)
+		return -EINVAL;
+	/* Numb path check already done */
+	if (comp_mask & IB_SA_PATH_REC_PKEY && src->pkey != dst->pkey)
+		return -EINVAL;
+
+	if (comp_mask & IB_SA_PATH_REC_SL && src->sl != dst->sl)
+		return -EINVAL;
+
+	if (ib_sa_check_selector(comp_mask, IB_SA_PATH_REC_MTU_SELECTOR,
+				 IB_SA_PATH_REC_MTU, dst->mtu_selector,
+				 src->mtu, dst->mtu))
+		return -EINVAL;
+	if (ib_sa_check_selector(comp_mask, IB_SA_PATH_REC_RATE_SELECTOR,
+				 IB_SA_PATH_REC_RATE, dst->rate_selector,
+				 src->rate, dst->rate))
+		return -EINVAL;
+	if (ib_sa_check_selector(comp_mask,
+				 IB_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR,
+				 IB_SA_PATH_REC_PACKET_LIFE_TIME,
+				 dst->packet_life_time_selector,
+				 src->packet_life_time, dst->packet_life_time))
+		return -EINVAL;
+
+	return 0;
+}
+
+static struct ib_sa_path_rec *get_random_path(struct ib_sa_attr_iter *iter,
+					      struct ib_sa_path_rec *req_path,
+					      ib_sa_comp_mask comp_mask)
+{
+	struct ib_sa_path_rec *path, *rand_path = NULL;
+	int num, count = 0;
+
+	for (path = ib_get_next_path(iter); path;
+	     path = ib_get_next_path(iter)) {
+		if (!cmp_rec(path, req_path, comp_mask)) {
+			get_random_bytes(&num, sizeof num);
+			if ((num % ++count) == 0)
+				rand_path = path;
+		}
+	}
+
+	return rand_path;
+}
+
+static struct ib_sa_path_rec *get_next_path(struct ib_sa_attr_iter *iter,
+					    struct ib_sa_path_rec *req_path,
+					    ib_sa_comp_mask comp_mask)
+{
+	struct ib_path_rec_info *cur_path, *next_path = NULL;
+	struct ib_sa_path_rec *path;
+	unsigned long lookups = ~0;
+
+	for (path = ib_get_next_path(iter); path;
+	     path = ib_get_next_path(iter)) {
+		if (!cmp_rec(path, req_path, comp_mask)) {
+
+			cur_path = container_of(iter->iter, struct ib_path_rec_info,
+						iter);
+			if (cur_path->lookups < lookups) {
+				lookups = cur_path->lookups;
+				next_path = cur_path;
+			}
+		}
+	}
+
+	if (next_path) {
+		next_path->lookups++;
+		return &next_path->rec;
+	} else
+		return NULL;
+}
+
+static void report_path(struct work_struct *work)
+{
+	struct sa_path_request *req;
+	
+	req = container_of(work, struct sa_path_request, work);
+	req->callback(0, &req->path_rec, req->context);
+	ib_sa_client_put(req->client);
+	kfree(req);
+}
+
+/**
+ * ib_sa_path_rec_get - Start a Path get query
+ * @client:SA client
+ * @device:device to send query on
+ * @port_num: port number to send query on
+ * @rec:Path Record to send in query
+ * @comp_mask:component mask to send in query
+ * @timeout_ms:time to wait for response
+ * @gfp_mask:GFP mask to use for internal allocations
+ * @callback:function called when query completes, times out or is
+ * canceled
+ * @context:opaque user context passed to callback
+ * @sa_query:query context, used to cancel query
+ *
+ * Send a Path Record Get query to the SA to look up a path.  The
+ * callback function will be called when the query completes (or
+ * fails); status is 0 for a successful response, -EINTR if the query
+ * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error
+ * occurred sending the query.  The resp parameter of the callback is
+ * only valid if status is 0.
+ *
+ * If the return value of ib_sa_path_rec_get() is negative, it is an
+ * error code.  Otherwise it is a query ID that can be used to cancel
+ * the query.
+ */
+int ib_sa_path_rec_get(struct ib_sa_client *client,
+		       struct ib_device *device, u8 port_num,
+		       struct ib_sa_path_rec *rec,
+		       ib_sa_comp_mask comp_mask,
+		       int timeout_ms, gfp_t gfp_mask,
+		       void (*callback)(int status,
+					struct ib_sa_path_rec *resp,
+					void *context),
+		       void *context,
+		       struct ib_sa_query **sa_query)
+{
+	struct sa_path_request *req;
+	struct ib_sa_attr_iter iter;
+	struct ib_sa_path_rec *path_rec;
+	int ret;
+
+	if (!paths_per_dest)
+		goto query_sa;
+
+	if (!(comp_mask & IB_SA_PATH_REC_DGID) ||
+	    !(comp_mask & IB_SA_PATH_REC_NUMB_PATH) || rec->numb_path != 1)
+		goto query_sa;
+
+	req = kmalloc(sizeof *req, gfp_mask);
+	if (!req)
+		goto query_sa;
+
+	ret = ib_create_path_iter(device, port_num, &rec->dgid, &iter);
+	if (ret)
+		goto free_req;
+
+	if (lookup_method == SA_DB_LOOKUP_RANDOM)
+		path_rec = get_random_path(&iter, rec, comp_mask);
+	else
+		path_rec = get_next_path(&iter, rec, comp_mask);
+
+	if (!path_rec)
+		goto free_iter;
+
+	memcpy(&req->path_rec, path_rec, sizeof *path_rec);
+	ib_free_path_iter(&iter);
+
+	INIT_WORK(&req->work, report_path);
+	req->client = client;
+	req->callback = callback;
+	req->context = context;
+
+	ib_sa_client_get(client);
+	queue_work(sa_wq, &req->work);	
+	*sa_query = ERR_PTR(-EEXIST);
+	return 0;
+
+free_iter:
+	ib_free_path_iter(&iter);
+free_req:
+	kfree(req);
+query_sa:
+	return ib_sa_path_rec_query(client, device, port_num, rec, comp_mask,
+				    timeout_ms, gfp_mask, callback, context,
+				    sa_query);
+}
+EXPORT_SYMBOL(ib_sa_path_rec_get);
+
+static void recv_handler(struct ib_mad_agent *mad_agent,
+			 struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct sa_db_port *port;
+	struct update_info *update;
+	struct ib_mad_send_buf *msg;
+	enum sa_update_type type;
+
+	msg = (struct ib_mad_send_buf *) (unsigned long) mad_recv_wc->wc->wr_id;
+	port = msg->context[0];
+	update = msg->context[1];
+
+	mutex_lock(&lock);
+	if (port->state == SA_DB_DESTROY ||
+	    update != list_entry(port->update_list.next,
+				 struct update_info, list)) {
+		mutex_unlock(&lock);
+	} else {
+		type = update->type;
+		mutex_unlock(&lock);
+		update_path_db(mad_agent->context, mad_recv_wc, type);
+	}
+
+	ib_free_recv_mad(mad_recv_wc);
+}
+
+static void send_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_send_wc *mad_send_wc)
+{
+	struct ib_mad_send_buf *msg;
+	struct sa_db_port *port;
+	struct update_info *update;
+	int ret;
+
+	msg = mad_send_wc->send_buf;
+	port = msg->context[0];
+	update = msg->context[1];
+
+	mutex_lock(&lock);
+	if (port->state == SA_DB_DESTROY)
+		goto unlock;
+
+	if (update == list_entry(port->update_list.next,
+				 struct update_info, list)) {
+
+		if (mad_send_wc->status == IB_WC_RESP_TIMEOUT_ERR &&
+		    msg->timeout_ms < SA_DB_MAX_RETRY_TIMER) {
+
+			msg->timeout_ms <<= 1;
+			ret = ib_post_send_mad(msg, NULL);
+			if (!ret) {
+				mutex_unlock(&lock);
+				return;
+			}
+		}
+		list_del(&update->list);
+		kfree(update);
+	}
+	process_updates(port);
+unlock:
+	mutex_unlock(&lock);
+
+	ib_destroy_ah(msg->ah);
+	ib_free_send_mad(msg);
+}
+
+static int init_port(struct sa_db_device *dev, int port_num)
+{
+	struct sa_db_port *port;
+	int ret;
+
+	port = &dev->port[port_num - dev->start_port];
+	port->dev = dev;
+	port->port_num = port_num;
+	INIT_WORK(&port->work, port_work_handler);
+	port->paths = RB_ROOT;
+	INIT_LIST_HEAD(&port->update_list);
+
+	ret = ib_get_cached_gid(dev->device, port_num, 0, &port->gid);
+	if (ret)
+		return ret;
+
+	port->agent = ib_register_mad_agent(dev->device, port_num, IB_QPT_GSI,
+					    NULL, IB_MGMT_RMPP_VERSION,
+					    send_handler, recv_handler, port);
+	if (IS_ERR(port->agent))
+		ret = PTR_ERR(port->agent);
+
+	return ret;
+}
+
+static void destroy_port(struct sa_db_port *port)
+{
+	mutex_lock(&lock);
+	port->state = SA_DB_DESTROY;
+	mutex_unlock(&lock);
+
+	ib_unregister_mad_agent(port->agent);
+	cleanup_port(port);
+}
+
+static void sa_db_add_dev(struct ib_device *device)
+{
+	struct sa_db_device *dev;
+	struct sa_db_port *port;
+	int s, e, i, ret;
+
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
+		s = e = 0;
+	} else {
+		s = 1;
+		e = device->phys_port_cnt;
+	}
+
+	dev = kzalloc(sizeof *dev + (e - s + 1) * sizeof *port, GFP_KERNEL);
+	if (!dev)
+		return;
+
+	dev->start_port = s;
+	dev->port_count = e - s + 1;
+	dev->device = device;
+	for (i = 0; i < dev->port_count; i++) {
+		ret = init_port(dev, s + i);
+		if (ret)
+			goto err;
+	}
+
+	ib_set_client_data(device, &sa_db_client, dev);
+
+	INIT_IB_EVENT_HANDLER(&dev->event_handler, device, handle_event);
+
+	mutex_lock(&lock);
+	list_add_tail(&dev->list, &dev_list);
+	refresh_dev_db(dev);
+	mutex_unlock(&lock);
+
+	ib_register_event_handler(&dev->event_handler);
+	return;
+err:
+	while (i--)
+		destroy_port(&dev->port[i]);
+	kfree(dev);
+}
+
+static void sa_db_remove_dev(struct ib_device *device)
+{
+	struct sa_db_device *dev;
+	int i;
+
+	dev = ib_get_client_data(device, &sa_db_client);
+	if (!dev)
+		return;
+
+	ib_unregister_event_handler(&dev->event_handler);
+	flush_workqueue(sa_wq);
+
+	for (i = 0; i < dev->port_count; i++)
+		destroy_port(&dev->port[i]);
+
+	mutex_lock(&lock);
+	list_del(&dev->list);
+	mutex_unlock(&lock);
+
+	kfree(dev);
+}
+
+int sa_db_init(void)
+{
+	int ret;
+
+	rwlock_init(&rwlock);
+	sa_wq = create_singlethread_workqueue("local_sa");
+	if (!sa_wq)
+		return -ENOMEM;
+
+	ib_sa_register_client(&sa_client);
+	ret = ib_register_client(&sa_db_client);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	ib_sa_unregister_client(&sa_client);
+	destroy_workqueue(sa_wq);
+	return ret;
+}
+
+void sa_db_cleanup(void)
+{
+	ib_unregister_client(&sa_db_client);
+	ib_sa_unregister_client(&sa_client);
+	destroy_workqueue(sa_wq);
+}
diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c
index 1e13ab4..f49eb75 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -238,34 +238,6 @@ static u8 get_leave_state(struct mcast_group *group)
 	return leave_state & group->rec.join_state;
 }
 
-static int check_selector(ib_sa_comp_mask comp_mask,
-			  ib_sa_comp_mask selector_mask,
-			  ib_sa_comp_mask value_mask,
-			  u8 selector, u8 src_value, u8 dst_value)
-{
-	int err;
-
-	if (!(comp_mask & selector_mask) || !(comp_mask & value_mask))
-		return 0;
-
-	switch (selector) {
-	case IB_SA_GT:
-		err = (src_value <= dst_value);
-		break;
-	case IB_SA_LT:
-		err = (src_value >= dst_value);
-		break;
-	case IB_SA_EQ:
-		err = (src_value != dst_value);
-		break;
-	default:
-		err = 0;
-		break;
-	}
-
-	return err;
-}
-
 static int cmp_rec(struct ib_sa_mcmember_rec *src,
 		   struct ib_sa_mcmember_rec *dst, ib_sa_comp_mask comp_mask)
 {
@@ -278,24 +250,24 @@ static int cmp_rec(struct ib_sa_mcmember_rec *src,
 		return -EINVAL;
 	if (comp_mask & IB_SA_MCMEMBER_REC_MLID && src->mlid != dst->mlid)
 		return -EINVAL;
-	if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_MTU_SELECTOR,
-			   IB_SA_MCMEMBER_REC_MTU, dst->mtu_selector,
-			   src->mtu, dst->mtu))
+	if (ib_sa_check_selector(comp_mask, IB_SA_MCMEMBER_REC_MTU_SELECTOR,
+				 IB_SA_MCMEMBER_REC_MTU, dst->mtu_selector,
+				 src->mtu, dst->mtu))
 		return -EINVAL;
 	if (comp_mask & IB_SA_MCMEMBER_REC_TRAFFIC_CLASS &&
 	    src->traffic_class != dst->traffic_class)
 		return -EINVAL;
 	if (comp_mask & IB_SA_MCMEMBER_REC_PKEY && src->pkey != dst->pkey)
 		return -EINVAL;
-	if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_RATE_SELECTOR,
-			   IB_SA_MCMEMBER_REC_RATE, dst->rate_selector,
-			   src->rate, dst->rate))
+	if (ib_sa_check_selector(comp_mask, IB_SA_MCMEMBER_REC_RATE_SELECTOR,
+				 IB_SA_MCMEMBER_REC_RATE, dst->rate_selector,
+				 src->rate, dst->rate))
 		return -EINVAL;
-	if (check_selector(comp_mask,
-			   IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR,
-			   IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME,
-			   dst->packet_life_time_selector,
-			   src->packet_life_time, dst->packet_life_time))
+	if (ib_sa_check_selector(comp_mask,
+				 IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR,
+				 IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME,
+				 dst->packet_life_time_selector,
+				 src->packet_life_time, dst->packet_life_time))
 		return -EINVAL;
 	if (comp_mask & IB_SA_MCMEMBER_REC_SL && src->sl != dst->sl)
 		return -EINVAL;
diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h
index b8eac66..0f19dde 100644
--- a/drivers/infiniband/core/sa.h
+++ b/drivers/infiniband/core/sa.h
@@ -48,6 +48,29 @@ static inline void ib_sa_client_put(struct ib_sa_client *client)
 		complete(&client->comp);
 }
 
+int ib_sa_check_selector(ib_sa_comp_mask comp_mask,
+			 ib_sa_comp_mask selector_mask,
+			 ib_sa_comp_mask value_mask,
+			 u8 selector, u8 src_value, u8 dst_value);
+
+int ib_sa_pack_attr(void *dst, void *src, int attr_id);
+
+int ib_sa_unpack_attr(void *dst, void *src, int attr_id);
+
+int ib_sa_path_rec_query(struct ib_sa_client *client,
+			 struct ib_device *device, u8 port_num,
+			 struct ib_sa_path_rec *rec,
+			 ib_sa_comp_mask comp_mask,
+			 int timeout_ms, gfp_t gfp_mask,
+			 void (*callback)(int status,
+					  struct ib_sa_path_rec *resp,
+					  void *context),
+			 void *context,
+			 struct ib_sa_query **sa_query);
+
+int sa_db_init(void);
+void sa_db_cleanup(void);
+
 int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 			     struct ib_device *device, u8 port_num,
 			     u8 method,
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 369fe60..cb7a503 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -464,6 +464,58 @@ static const struct ib_field notice_table[] = {
 	  .size_bits    = 128 },
 };
 
+int ib_sa_check_selector(ib_sa_comp_mask comp_mask,
+			 ib_sa_comp_mask selector_mask,
+			 ib_sa_comp_mask value_mask,
+			 u8 selector, u8 src_value, u8 dst_value)
+{
+	int err;
+
+	if (!(comp_mask & selector_mask) || !(comp_mask & value_mask))
+		return 0;
+
+	switch (selector) {
+	case IB_SA_GT:
+		err = (src_value <= dst_value);
+		break;
+	case IB_SA_LT:
+		err = (src_value >= dst_value);
+		break;
+	case IB_SA_EQ:
+		err = (src_value != dst_value);
+		break;
+	default:
+		err = 0;
+		break;
+	}
+
+	return err;
+}
+
+int ib_sa_pack_attr(void *dst, void *src, int attr_id)
+{
+	switch (attr_id) {
+	case IB_SA_ATTR_PATH_REC:
+		ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+int ib_sa_unpack_attr(void *dst, void *src, int attr_id)
+{
+	switch (attr_id) {
+	case IB_SA_ATTR_PATH_REC:
+		ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
 static void free_sm_ah(struct kref *kref)
 {
 	struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref);
@@ -706,41 +758,16 @@ static void ib_sa_path_rec_release(struct ib_sa_query *sa_query)
 	kfree(container_of(sa_query, struct ib_sa_path_query, sa_query));
 }
 
-/**
- * ib_sa_path_rec_get - Start a Path get query
- * @client:SA client
- * @device:device to send query on
- * @port_num: port number to send query on
- * @rec:Path Record to send in query
- * @comp_mask:component mask to send in query
- * @timeout_ms:time to wait for response
- * @gfp_mask:GFP mask to use for internal allocations
- * @callback:function called when query completes, times out or is
- * canceled
- * @context:opaque user context passed to callback
- * @sa_query:query context, used to cancel query
- *
- * Send a Path Record Get query to the SA to look up a path.  The
- * callback function will be called when the query completes (or
- * fails); status is 0 for a successful response, -EINTR if the query
- * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error
- * occurred sending the query.  The resp parameter of the callback is
- * only valid if status is 0.
- *
- * If the return value of ib_sa_path_rec_get() is negative, it is an
- * error code.  Otherwise it is a query ID that can be used to cancel
- * the query.
- */
-int ib_sa_path_rec_get(struct ib_sa_client *client,
-		       struct ib_device *device, u8 port_num,
-		       struct ib_sa_path_rec *rec,
-		       ib_sa_comp_mask comp_mask,
-		       int timeout_ms, gfp_t gfp_mask,
-		       void (*callback)(int status,
-					struct ib_sa_path_rec *resp,
-					void *context),
-		       void *context,
-		       struct ib_sa_query **sa_query)
+int ib_sa_path_rec_query(struct ib_sa_client *client,
+			 struct ib_device *device, u8 port_num,
+			 struct ib_sa_path_rec *rec,
+			 ib_sa_comp_mask comp_mask,
+			 int timeout_ms, gfp_t gfp_mask,
+			 void (*callback)(int status,
+					  struct ib_sa_path_rec *resp,
+					  void *context),
+			 void *context,
+			 struct ib_sa_query **sa_query)
 {
 	struct ib_sa_path_query *query;
 	struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client);
@@ -801,7 +828,6 @@ err1:
 	kfree(query);
 	return ret;
 }
-EXPORT_SYMBOL(ib_sa_path_rec_get);
 
 static void ib_sa_service_rec_callback(struct ib_sa_query *sa_query,
 				    int status,
@@ -1383,7 +1409,15 @@ static int __init ib_sa_init(void)
 		goto err3;
 	}
 
+	ret = sa_db_init();
+	if (ret) {
+		printk(KERN_ERR "Couldn't initialize local SA\n");
+		goto err4;
+	}
+
 	return 0;
+err4:
+	notice_cleanup();
 err3:
 	mcast_cleanup();
 err2:
@@ -1394,6 +1428,7 @@ err1:
 
 static void __exit ib_sa_cleanup(void)
 {
+	sa_db_cleanup();
 	mcast_cleanup();
 	notice_cleanup();
 	ib_unregister_client(&sa_client);


From rdreier at cisco.com  Wed May 30 13:48:29 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 30 May 2007 13:48:29 -0700
Subject: [ofa-general] [PATCH] drivers/infiniband: fix comparsion between
	unsigned and negative
In-Reply-To: <20070530080518.GA29195@nostromo.devel.redhat.com> (Bill
	Nottingham's message of "Wed, 30 May 2007 04:05:18 -0400")
References: <20070530080518.GA29195@nostromo.devel.redhat.com>
Message-ID: <adairaaqbqa.fsf@cisco.com>

I just went through this patch, and all the changes are of the form of
removing the < 0 test from code like

	if (x < 0 || x > MAX) 
		return -ERROR;

which Linus said we don't change in the email <http://lkml.org/lkml/2006/11/28/206>
that Satyam just pointed out.  So I'll drop this patch.

 - R.


From rdreier at cisco.com  Wed May 30 13:56:03 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 30 May 2007 13:56:03 -0700
Subject: [ofa-general] Re: wmb missing in libmthca?
In-Reply-To: <20070524114711.GB4585@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 24 May 2007 14:47:11 +0300")
References: <20070524114711.GB4585@mellanox.co.il>
Message-ID: <adaejkyqbdo.fsf@cisco.com>

 > Roland, I see this in kernel:
 > 
 >                 ((struct mthca_next_seg *) prev_wqe)->nda_op =
 >                         cpu_to_be32((ind << qp->rq.wqe_shift) | 1);
 >                 wmb();
 >                 ((struct mthca_next_seg *) prev_wqe)->ee_nds =
 >                         cpu_to_be32(MTHCA_NEXT_DBD | size);
 > 
 > but userspace does not have wmb here.
 > Is it needed?

It does seem that way -- otherwise the hardware might read prev_wqe
and see the ee_nds field as set before the nda_op field has the right
variable.  Does this look right to you as a libmthca fix?

diff --git a/src/qp.c b/src/qp.c
index 2d03d49..2ea9dc0 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -292,7 +292,10 @@ int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 			htonl(((ind << qp->sq.wqe_shift) +
 			       qp->send_wqe_offset) |
 			      mthca_opcode[wr->opcode]);
-
+		/*
+		 * Make sure that nda_op is written before setting ee_nds.
+		 */
+		wmb();
 		((struct mthca_next_seg *) prev_wqe)->ee_nds =
 			htonl((size0 ? 0 : MTHCA_NEXT_DBD) | size |
 			((wr->send_flags & IBV_SEND_FENCE) ?


From sashak at voltaire.com  Wed May 30 15:01:27 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 31 May 2007 01:01:27 +0300
Subject: [ofa-general] [PATCH] opensm: osm_node_get_physp_ptr() usage fixes
Message-ID: <20070530220127.GP13193@sashak.voltaire.com>


Function osm_node_get_physp_ptr() cannot return NULL, but can return
pointer to non-initialized object. This patch fixes cases where resulted
pointer was not verified properly.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_link_mgr.c           |    2 +-
 opensm/opensm/osm_mcast_mgr.c          |    1 -
 opensm/opensm/osm_node.c               |   28 ++++++---------------
 opensm/opensm/osm_node_info_rcv.c      |   41 +++++++++++--------------------
 opensm/opensm/osm_pkey_rcv.c           |    2 -
 opensm/opensm/osm_port.c               |    5 +--
 opensm/opensm/osm_port_info_rcv.c      |    7 ++---
 opensm/opensm/osm_qos.c                |    2 +-
 opensm/opensm/osm_sa_link_record.c     |   19 +++++++-------
 opensm/opensm/osm_sa_pkey_record.c     |    5 +---
 opensm/opensm/osm_sa_portinfo_record.c |    5 +---
 opensm/opensm/osm_sa_slvl_record.c     |    6 ----
 opensm/opensm/osm_sa_vlarb_record.c    |    5 +---
 opensm/opensm/osm_state_mgr.c          |    8 +----
 opensm/opensm/osm_switch.c             |    5 ++-
 opensm/opensm/osm_trap_rcv.c           |    5 +++-
 opensm/opensm/osm_ucast_lash.c         |   12 ++++----
 17 files changed, 58 insertions(+), 100 deletions(-)

diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c
index a38d179..73bebce 100644
--- a/opensm/opensm/osm_link_mgr.c
+++ b/opensm/opensm/osm_link_mgr.c
@@ -435,7 +435,7 @@ __osm_link_mgr_process_port(
       specified state.
     */
     p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)i );
-    if( p_physp && osm_physp_is_valid( p_physp ) )
+    if( osm_physp_is_valid( p_physp ) )
     {
       current_state = osm_physp_get_port_state( p_physp );
 
diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
index da787b4..2ecb34e 100644
--- a/opensm/opensm/osm_mcast_mgr.c
+++ b/opensm/opensm/osm_mcast_mgr.c
@@ -818,7 +818,6 @@ __osm_mcast_mgr_branch(
       CL_ASSERT( p_remote_node->sw );
 
       p_physp = osm_node_get_physp_ptr( p_node, i );
-      CL_ASSERT( p_physp );
       CL_ASSERT( osm_physp_is_valid( p_physp ) );
 
       p_remote_physp = osm_physp_get_remote( p_physp );
diff --git a/opensm/opensm/osm_node.c b/opensm/opensm/osm_node.c
index cd4ccfa..8d2c3f5 100644
--- a/opensm/opensm/osm_node.c
+++ b/opensm/opensm/osm_node.c
@@ -61,7 +61,6 @@ osm_node_init_physp(
   IN osm_node_t* const p_node,
   IN const osm_madw_t* const p_madw )
 {
-  osm_physp_t        *p_physp;
   ib_net64_t         port_guid;
   ib_smp_t           *p_smp;
   ib_node_info_t     *p_ni;
@@ -80,9 +79,8 @@ osm_node_init_physp(
 
   CL_ASSERT( port_num < p_node->physp_tbl_size );
 
-  p_physp = osm_node_get_physp_ptr( p_node, port_num );
-
-  osm_physp_init( p_physp, port_guid, port_num, p_node,
+  osm_physp_init( &p_node->physp_table[port_num],
+                  port_guid, port_num, p_node,
                   osm_madw_get_bind_handle( p_madw ),
                   p_smp->hop_count, p_smp->initial_path );
 }
@@ -133,7 +131,7 @@ osm_node_new(
       Get(NodeInfo).
     */
     for( i = 0; i < p_node->physp_tbl_size; i++ )
-      osm_physp_construct( osm_node_get_physp_ptr( p_node, i ) );
+      osm_physp_construct( &p_node->physp_table[i] );
 
     osm_node_init_physp( p_node, p_madw );
   }
@@ -147,18 +145,13 @@ static void
 osm_node_destroy(
   IN osm_node_t *p_node )
 {
-  osm_physp_t *p_physp;
   uint16_t i;
 
   /*
     Cleanup all physports 
   */
   for( i = 0; i < p_node->physp_tbl_size; i++ )
-  {
-    p_physp = osm_node_get_physp_ptr( p_node, i );
-    if (p_physp) 
-      osm_physp_destroy( p_physp );
-  }
+    osm_physp_destroy( &p_node->physp_table[i] );
 }
 
 /**********************************************************************
@@ -189,8 +182,7 @@ osm_node_link(
   CL_ASSERT( remote_port_num < p_remote_node->physp_tbl_size );
 
   p_physp = osm_node_get_physp_ptr( p_node, port_num );
-  p_remote_physp =  osm_node_get_physp_ptr( p_remote_node,
-                                            remote_port_num );
+  p_remote_physp =  osm_node_get_physp_ptr( p_remote_node, remote_port_num );
 
   if (p_physp->p_remote_physp)
     p_physp->p_remote_physp->p_remote_physp = NULL;
@@ -220,8 +212,7 @@ osm_node_unlink(
   {
 
     p_physp = osm_node_get_physp_ptr( p_node, port_num );
-    p_remote_physp =  osm_node_get_physp_ptr( p_remote_node,
-                                              remote_port_num );
+    p_remote_physp =  osm_node_get_physp_ptr( p_remote_node, remote_port_num );
 
     osm_physp_unlink( p_physp, p_remote_physp );
   }
@@ -243,8 +234,7 @@ osm_node_link_exists(
   CL_ASSERT( remote_port_num < p_remote_node->physp_tbl_size );
 
   p_physp = osm_node_get_physp_ptr( p_node, port_num );
-  p_remote_physp =  osm_node_get_physp_ptr( p_remote_node,
-                                            remote_port_num );
+  p_remote_physp =  osm_node_get_physp_ptr( p_remote_node, remote_port_num );
 
   return( osm_physp_link_exists( p_physp, p_remote_physp ) );
 }
@@ -265,8 +255,7 @@ osm_node_link_has_valid_ports(
   CL_ASSERT( remote_port_num < p_remote_node->physp_tbl_size );
 
   p_physp = osm_node_get_physp_ptr( p_node, port_num );
-  p_remote_physp =  osm_node_get_physp_ptr( p_remote_node,
-                                            remote_port_num );
+  p_remote_physp =  osm_node_get_physp_ptr( p_remote_node, remote_port_num );
 
   return( osm_physp_is_valid( p_physp ) &&
           osm_physp_is_valid( p_remote_physp ) );
@@ -329,4 +318,3 @@ osm_node_get_remote_base_lid(
 
   return( 0 );
 }
-
diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index 2c79056..2486ffb 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -144,17 +144,14 @@ __osm_ni_rcv_set_links(
 
             p_physp = osm_node_get_physp_ptr( p_node, port_num );
             sprintf( dr_new_path, "no_path_available" );
-            if (p_physp)
+            p_path = osm_physp_get_dr_path_ptr( p_physp );
+            if ( p_path )
             {
-              p_path = osm_physp_get_dr_path_ptr( p_physp );
-              if ( p_path )
+              sprintf( dr_new_path, "new path:" );
+              for (i = 0; i <= p_path->hop_count; i++ )
               {
-                sprintf( dr_new_path, "new path:" );
-                for (i = 0; i <= p_path->hop_count; i++ )
-                {
-                  sprintf( line, "[%X]", p_path->path[i] );
-                  strcat( dr_new_path, line );
-                }
+                sprintf( line, "[%X]", p_path->path[i] );
+                strcat( dr_new_path, line );
               }
             }
 
@@ -164,17 +161,14 @@ __osm_ni_rcv_set_links(
               p_old_neighbor_node,
               old_neighbor_port_num);
             sprintf( dr_old_path, "no_path_available" );
-            if (p_old_physp)
+            p_old_path = osm_physp_get_dr_path_ptr( p_old_physp );
+            if ( p_old_path )
             {
-              p_old_path = osm_physp_get_dr_path_ptr( p_old_physp );
-              if ( p_old_path )
+              sprintf( dr_old_path, "old_path:" );
+              for (i = 0; i <= p_old_path->hop_count; i++ )
               {
-                sprintf( dr_old_path, "old_path:" );
-                for (i = 0; i <= p_old_path->hop_count; i++ )
-                {
-                  sprintf( line, "[%X]", p_old_path->path[i] );
-                  strcat( dr_old_path, line );
-                }
+                sprintf( line, "[%X]", p_old_path->path[i] );
+                strcat( dr_old_path, line );
               }
             }
 
@@ -226,10 +220,9 @@ __osm_ni_rcv_set_links(
                      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                      port_num );
             p_physp = osm_node_get_physp_ptr( p_node, port_num );
-            if (p_physp)
-              osm_dump_dr_path(p_rcv->p_log,
-                               osm_physp_get_dr_path_ptr(p_physp),
-                               OSM_LOG_ERROR);
+            osm_dump_dr_path(p_rcv->p_log,
+                             osm_physp_get_dr_path_ptr(p_physp),
+                             OSM_LOG_ERROR);
 
             osm_log( p_rcv->p_log, OSM_LOG_SYS,
                      "Errors on subnet. Duplicate GUID found "
@@ -313,7 +306,6 @@ __osm_ni_rcv_process_new_node(
   */
   p_physp = osm_node_get_physp_ptr( p_node, port_num );
 
-  CL_ASSERT( p_physp );
   CL_ASSERT( osm_physp_is_valid( p_physp ) );
   CL_ASSERT( osm_madw_get_bind_handle( p_madw ) ==
              osm_dr_path_get_bind_handle(
@@ -379,7 +371,6 @@ __osm_ni_rcv_get_node_desc(
   */
   p_physp = osm_node_get_physp_ptr( p_node, port_num );
 
-  CL_ASSERT( p_physp );
   CL_ASSERT( osm_physp_is_valid( p_physp ) );
   CL_ASSERT( osm_madw_get_bind_handle( p_madw ) ==
              osm_dr_path_get_bind_handle(
@@ -539,8 +530,6 @@ __osm_ni_rcv_process_existing_ca_or_router(
   {
     p_physp = osm_node_get_physp_ptr( p_node, port_num );
 
-    CL_ASSERT( p_physp );
-
     if ( !osm_physp_is_valid( p_physp ) )
     {
         osm_log( p_rcv->p_log, OSM_LOG_ERROR,
diff --git a/opensm/opensm/osm_pkey_rcv.c b/opensm/opensm/osm_pkey_rcv.c
index 7c87d7e..67fe067 100644
--- a/opensm/opensm/osm_pkey_rcv.c
+++ b/opensm/opensm/osm_pkey_rcv.c
@@ -174,8 +174,6 @@ osm_pkey_rcv_process(
     port_num = p_physp->port_num;
   }
 
-  CL_ASSERT( p_physp );
-
   /*
     We do not mind if this is a result of a set or get - all we want is to
     update the subnet.
diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index eab86e1..9e86ca5 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -589,7 +589,7 @@ __osm_physp_get_dr_physp_set(
              p_path->path[hop]);
 
     /* make sure we got a valid port and it has a remote port */
-    if (!(p_physp && osm_physp_is_valid( p_physp )))
+    if (!osm_physp_is_valid( p_physp ))
     {
       osm_log( p_log, OSM_LOG_ERROR,
                "__osm_physp_get_dr_nodes_set: ERR 4104: "
@@ -770,8 +770,7 @@ osm_physp_replace_dr_path_with_alternate_dr_path(
            4. The port is not in the physp_map
            5. This port haven't been visited before
         */
-        if ( p_remote_physp &&
-             osm_physp_is_valid ( p_remote_physp ) &&
+        if ( osm_physp_is_valid ( p_remote_physp ) &&
              p_remote_physp != p_physp &&
              cl_map_get( &physp_map, __osm_ptr_to_key(p_remote_physp)) == NULL &&
              cl_map_get( &visited_map, __osm_ptr_to_key(p_remote_physp)) == NULL )
diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
index 0076b00..a53044f 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -559,7 +559,7 @@ osm_pi_rcv_process_set(
   CL_ASSERT( p_node );
 
   p_physp = osm_node_get_physp_ptr( p_node, port_num );
-  CL_ASSERT( p_physp && osm_physp_is_valid( p_physp ) );
+  CL_ASSERT( osm_physp_is_valid( p_physp ) );
 
   port_guid = osm_physp_get_port_guid( p_physp );
 
@@ -744,10 +744,9 @@ osm_pi_rcv_process(
     }
 
     p_node = p_port->p_node;
-    p_physp = osm_node_get_physp_ptr( p_node, port_num );
-
     CL_ASSERT( p_node );
-    CL_ASSERT( p_physp );
+
+    p_physp = osm_node_get_physp_ptr( p_node, port_num );
 
     /*
       Determine if we encountered a new Physical Port.
diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
index f426241..bbb1608 100644
--- a/opensm/opensm/osm_qos.c
+++ b/opensm/opensm/osm_qos.c
@@ -337,7 +337,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm)
 			num_physp = osm_node_get_num_physp(p_node);
 			for (i = 1; i < num_physp; i++) {
 				p_physp = osm_node_get_physp_ptr(p_node, i);
-				if (!p_physp || !osm_physp_is_valid(p_physp))
+				if (!osm_physp_is_valid(p_physp))
 					continue;
 				status =
 				    qos_physp_setup(&p_osm->log, &p_osm->sm.req,
diff --git a/opensm/opensm/osm_sa_link_record.c b/opensm/opensm/osm_sa_link_record.c
index 5e4e35e..81d3877 100644
--- a/opensm/opensm/osm_sa_link_record.c
+++ b/opensm/opensm/osm_sa_link_record.c
@@ -357,7 +357,8 @@ __osm_lr_rcv_get_port_links(
           p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node,
                                                  dest_port_num );
           /* both physical ports should be with data */
-          if (p_src_physp && p_dest_physp)
+          if (osm_physp_is_valid(p_src_physp) &&
+              osm_physp_is_valid(p_dest_physp))
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp,
                                          p_dest_physp, comp_mask,
                                          p_list, p_req_physp );
@@ -377,7 +378,7 @@ __osm_lr_rcv_get_port_links(
         if (port_num < p_src_port->p_node->physp_tbl_size)
         {          
           p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num );
-          if (p_src_physp)
+          if (osm_physp_is_valid(p_src_physp))
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp,
                                          NULL, comp_mask, p_list,
                                          p_req_physp );
@@ -389,7 +390,7 @@ __osm_lr_rcv_get_port_links(
         for( port_num = 1; port_num < num_ports; port_num++ )
         {
           p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num );
-          if (p_src_physp)
+          if (osm_physp_is_valid(p_src_physp))
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp,
                                          NULL, comp_mask, p_list,
                                          p_req_physp );
@@ -411,9 +412,9 @@ __osm_lr_rcv_get_port_links(
            this couldn't be a relevant record. */
         if (port_num < p_dest_port->p_node->physp_tbl_size )
         {
-          p_dest_physp = osm_node_get_physp_ptr(
-            p_dest_port->p_node, port_num );
-          if (p_dest_physp)
+          p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node,
+                                                 port_num );
+          if (osm_physp_is_valid(p_dest_physp))
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL,
                                          p_dest_physp, comp_mask,
                                          p_list, p_req_physp );
@@ -424,9 +425,9 @@ __osm_lr_rcv_get_port_links(
         num_ports = osm_node_get_num_physp( p_dest_port->p_node );
         for( port_num = 1; port_num < num_ports; port_num++ )
         {
-          p_dest_physp = osm_node_get_physp_ptr(
-            p_dest_port->p_node, port_num );
-          if (p_dest_physp)
+          p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node,
+                                                 port_num );
+          if (osm_physp_is_valid(p_dest_physp))
             __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL,
                                          p_dest_physp, comp_mask,
                                          p_list, p_req_physp );
diff --git a/opensm/opensm/osm_sa_pkey_record.c b/opensm/opensm/osm_sa_pkey_record.c
index 8a71314..49606bb 100644
--- a/opensm/opensm/osm_sa_pkey_record.c
+++ b/opensm/opensm/osm_sa_pkey_record.c
@@ -254,7 +254,7 @@ __osm_sa_pkey_by_comp_mask(
       p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       /* Check that the p_physp is valid, and that is shares a pkey
          with the p_req_physp. */
-      if( p_physp && osm_physp_is_valid( p_physp ) &&
+      if( osm_physp_is_valid( p_physp ) &&
           (osm_physp_share_pkey(p_rcv->p_log, p_req_physp, p_physp)) )
         __osm_sa_pkey_check_physp( p_rcv, p_physp, p_ctxt );
     }
@@ -273,9 +273,6 @@ __osm_sa_pkey_by_comp_mask(
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
       p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
-      if( p_physp == NULL )
-        continue;
-
       if( !osm_physp_is_valid( p_physp ) )
         continue;
 
diff --git a/opensm/opensm/osm_sa_portinfo_record.c b/opensm/opensm/osm_sa_portinfo_record.c
index 74f53d6..a1f3fcb 100644
--- a/opensm/opensm/osm_sa_portinfo_record.c
+++ b/opensm/opensm/osm_sa_portinfo_record.c
@@ -547,7 +547,7 @@ __osm_sa_pir_by_comp_mask(
       p_physp = osm_node_get_physp_ptr( p_port->p_node, p_rcvd_rec->port_num );
       /* Check that the p_physp is valid, and that the p_physp and the
          p_req_physp share a pkey. */
-      if( p_physp && osm_physp_is_valid( p_physp ) &&
+      if( osm_physp_is_valid( p_physp ) &&
           osm_physp_share_pkey(p_rcv->p_log, p_req_physp, p_physp))
         __osm_sa_pir_check_physp( p_rcv, p_physp, p_ctxt );
     }
@@ -557,9 +557,6 @@ __osm_sa_pir_by_comp_mask(
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
       p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
-      if( p_physp == NULL )
-        continue;
-
       if( !osm_physp_is_valid( p_physp ) )
         continue;
 
diff --git a/opensm/opensm/osm_sa_slvl_record.c b/opensm/opensm/osm_sa_slvl_record.c
index e40ad61..010f23e 100644
--- a/opensm/opensm/osm_sa_slvl_record.c
+++ b/opensm/opensm/osm_sa_slvl_record.c
@@ -244,9 +244,6 @@ __osm_sa_slvl_by_comp_mask(
 
     for( out_port_num = out_port_start; out_port_num <= out_port_end; out_port_num++ ) {
       p_out_physp = osm_node_get_physp_ptr( p_port->p_node, out_port_num );
-      if( p_out_physp == NULL )
-        continue;
-
       if( !osm_physp_is_valid( p_out_physp ) )
         continue;
 
@@ -257,9 +254,6 @@ __osm_sa_slvl_by_comp_mask(
 #endif
 
         p_in_physp = osm_node_get_physp_ptr( p_port->p_node, in_port_num );
-        if( p_in_physp == NULL )
-          continue;
-
         if( !osm_physp_is_valid( p_in_physp ) )
           continue;
 
diff --git a/opensm/opensm/osm_sa_vlarb_record.c b/opensm/opensm/osm_sa_vlarb_record.c
index a462ee9..8f60d8d 100644
--- a/opensm/opensm/osm_sa_vlarb_record.c
+++ b/opensm/opensm/osm_sa_vlarb_record.c
@@ -258,7 +258,7 @@ __osm_sa_vl_arb_by_comp_mask(
       p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
       /* check that the p_physp is valid, and that the requester
          and the p_physp share a pkey. */
-      if( p_physp && osm_physp_is_valid( p_physp ) &&
+      if( osm_physp_is_valid( p_physp ) &&
           osm_physp_share_pkey(p_rcv->p_log, p_req_physp, p_physp) )
         __osm_sa_vl_arb_check_physp( p_rcv, p_physp, p_ctxt );
     }
@@ -277,9 +277,6 @@ __osm_sa_vl_arb_by_comp_mask(
     for( port_num = 0; port_num < num_ports; port_num++ )
     {
       p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num );
-      if( p_physp == NULL )
-        continue;
-
       if( !osm_physp_is_valid( p_physp ) )
         continue;
 
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 46c1cd0..73980b8 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -925,7 +925,6 @@ __osm_state_mgr_sweep_hop_1(
 
    p_physp = osm_node_get_physp_ptr( p_node, port_num );
 
-   CL_ASSERT( p_physp );
    CL_ASSERT( osm_physp_is_valid( p_physp ) );
 
    p_dr_path = osm_physp_get_dr_path_ptr( p_physp );
@@ -972,9 +971,6 @@ __osm_state_mgr_sweep_hop_1(
       {
          /* go through the port only if the port is not DOWN */
          p_ext_physp = osm_node_get_physp_ptr( p_node, port_num );
-         /* Make sure the physp object exists */
-         if( !p_ext_physp )
-            continue;
          if( ib_port_info_get_port_state( &( p_ext_physp->port_info ) ) >
              IB_LINK_DOWN )
          {
@@ -1119,7 +1115,7 @@ __osm_topology_file_create(
 
             p_physp = osm_node_get_physp_ptr( p_node, cPort );
 
-            if( ( p_physp == NULL ) || ( !osm_physp_is_valid( p_physp ) ) )
+            if( !osm_physp_is_valid( p_physp ) )
                continue;
 
             p_rphysp = p_physp->p_remote_physp;
@@ -1288,7 +1284,7 @@ __osm_state_mgr_report(
       for( port_num = start_port; port_num < num_ports; port_num++ )
       {
          p_physp = osm_node_get_physp_ptr( p_node, port_num );
-         if( ( p_physp == NULL ) || ( !osm_physp_is_valid( p_physp ) ) )
+         if( !osm_physp_is_valid( p_physp ) )
             continue;
 
          osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, "%s : %s : %02X :",
diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
index a79f5cd..2a8d1c2 100644
--- a/opensm/opensm/osm_switch.c
+++ b/opensm/opensm/osm_switch.c
@@ -321,14 +321,15 @@ osm_switch_recommend_path(
 
     if (port_num != OSM_NO_PATH)
     {
+      CL_ASSERT(port_num < num_ports);
+
       p_physp = osm_node_get_physp_ptr(p_sw->p_node, port_num);
       /*
         Don't be too trusting of the current forwarding table!
         Verify that the port number is legal and that the
         LID is reachable through this port.
       */
-      if( (port_num < num_ports )  &&
-          osm_physp_is_valid(p_physp) &&
+      if( osm_physp_is_valid(p_physp) &&
           osm_physp_is_healthy(p_physp) &&
           osm_physp_get_remote(p_physp) )
       {
diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index 309cdd5..c0cab76 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -100,6 +100,7 @@ __get_physp_by_lid_and_num(
 {
   cl_ptr_vector_t *p_vec = &(p_rcv->p_subn->port_lid_tbl);
   osm_port_t *p_port;
+  osm_physp_t *p_physp;
 
   if (lid > cl_ptr_vector_get_size(p_vec))
     return NULL;
@@ -111,7 +112,9 @@ __get_physp_by_lid_and_num(
   if (osm_node_get_num_physp(p_port->p_node) < num)
     return NULL;
 
-  return( osm_node_get_physp_ptr(p_port->p_node, num) );
+  p_physp = osm_node_get_physp_ptr(p_port->p_node, num);
+
+  return osm_physp_is_valid(p_physp) ? p_physp : NULL;
 }
 
 /**********************************************************************
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 5d32e89..04f32d5 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -162,7 +162,7 @@ static uint64_t osm_lash_get_switch_guid(IN const osm_switch_t *p_sw)
   uint64_t switch_guid = -1;
   osm_physp_t* p_physp = osm_node_get_physp_ptr(p_sw->p_node, 0);
 
-  if (p_physp && osm_physp_is_valid (p_physp))
+  if (osm_physp_is_valid(p_physp))
     switch_guid = osm_physp_get_port_guid(p_physp);
 
   return switch_guid;
@@ -215,7 +215,7 @@ static uint8_t find_port_from_lid(IN const ib_net16_t lid_no,
 
     p_current_physp = osm_node_get_physp_ptr(p_sw->p_node, i);
 
-    if (p_current_physp && osm_physp_is_valid (p_current_physp)) {
+    if (osm_physp_is_valid(p_current_physp)) {
 
 	p_remote_physp = p_current_physp->p_remote_physp;
 
@@ -1251,10 +1251,10 @@ static void osm_lash_process_switch(lash_t *p_lash, osm_switch_t *p_sw)
 
     p_current_physp = osm_node_get_physp_ptr(p_sw->p_node, i);
 
-    if (osm_physp_is_valid (p_current_physp)) {
+    if (osm_physp_is_valid(p_current_physp)) {
       p_remote_physp = p_current_physp->p_remote_physp;
 
-      if (p_remote_physp && osm_physp_is_valid ( p_remote_physp ) &&
+      if (p_remote_physp && osm_physp_is_valid(p_remote_physp) &&
           p_remote_physp->p_node->sw) {
 	int physical_port_a_num = osm_physp_get_port_num(p_current_physp);
 	int physical_port_b_num = osm_physp_get_port_num(p_remote_physp);
@@ -1342,8 +1342,8 @@ static int discover_network_properties(lash_t *p_lash)
       for (i=1; i<port_count; i++) {
 	  osm_physp_t *p_current_physp = osm_node_get_physp_ptr(p_sw->p_node, i);
 
-	  if (p_current_physp && osm_physp_is_valid (p_current_physp) &&
-	      p_current_physp->p_remote_physp) {
+	  if (osm_physp_is_valid(p_current_physp) &&
+              p_current_physp->p_remote_physp) {
 
 	    ib_port_info_t *p_port_info = &p_current_physp->port_info;
 	    uint8_t port_vl_min = ib_port_info_get_op_vls(p_port_info);
-- 
1.5.2.160.g10a94


From hasfarizan at mpklang.gov.my  Wed May 30 15:42:39 2007
From: hasfarizan at mpklang.gov.my (hasfarizan at mpklang.gov.my)
Date: Thu, 31 May 2007 06:42:39 +0800
Subject: [ofa-general] Your e-mail message was blocked
Message-ID: <D0000082ad@exchange01.main.mpklang.gov.my>

MailMarshal (an automated content monitoring gateway) has 
stopped the following email for the following reason:

It believes it may contain unacceptable language, or 
inappropriate material.

   Message: B00021bda7.00000000.mml
   From:    general at mpklang.gov.my
   To:      general at mpklang.gov.my
   Subject: RE: Top-Alert

Please remove any inappropriate language and send it again.

The blocked email will be automatically deleted after 5 days.

MailMarshal Rule: Outbound : Block Unacceptable Language
Script Offensive Language (Basic) Triggered in Body
Expression: shitty Triggered 1 times weighting 5


Email security by MailMarshal from Marshal Software.

From halr at voltaire.com  Wed May 30 20:59:32 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 May 2007 23:59:32 -0400
Subject: [ofa-general] Re: [PATCH] opensm: osm_node_get_physp_ptr() usage
	fixes
In-Reply-To: <20070530220127.GP13193@sashak.voltaire.com>
References: <20070530220127.GP13193@sashak.voltaire.com>
Message-ID: <1180583970.7116.120376.camel@hal.voltaire.com>

On Wed, 2007-05-30 at 18:01, Sasha Khapyorsky wrote:
> Function osm_node_get_physp_ptr() cannot return NULL, but can return
> pointer to non-initialized object. This patch fixes cases where resulted
> pointer was not verified properly.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From mst at dev.mellanox.co.il  Wed May 30 21:34:47 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 31 May 2007 07:34:47 +0300
Subject: [ofa-general] Re: wmb missing in libmthca?
In-Reply-To: <adaejkyqbdo.fsf@cisco.com>
References: <20070524114711.GB4585@mellanox.co.il> <adaejkyqbdo.fsf@cisco.com>
Message-ID: <20070531043447.GA11669@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: wmb missing in libmthca?
> 
>  > Roland, I see this in kernel:
>  > 
>  >                 ((struct mthca_next_seg *) prev_wqe)->nda_op =
>  >                         cpu_to_be32((ind << qp->rq.wqe_shift) | 1);
>  >                 wmb();
>  >                 ((struct mthca_next_seg *) prev_wqe)->ee_nds =
>  >                         cpu_to_be32(MTHCA_NEXT_DBD | size);
>  > 
>  > but userspace does not have wmb here.
>  > Is it needed?
> 
> It does seem that way -- otherwise the hardware might read prev_wqe
> and see the ee_nds field as set before the nda_op field has the right
> variable.  Does this look right to you as a libmthca fix?

Looks ok.

-- 
MST


From mst at dev.mellanox.co.il  Wed May 30 21:54:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 31 May 2007 07:54:00 +0300
Subject: [ofa-general] Re: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local
	path record caching
In-Reply-To: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com>
References: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com>
Message-ID: <20070531045400.GB11669@mellanox.co.il>

> Quoting Sean Hefty <sean.hefty at intel.com>:
> Subject: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path record caching
> 
> I've updated the local SA patches based on previous feedback.
> The most significant change is to integrate the local SA with
> the ib_sa module.  This allows all apps to make use of the local
> SA without changes.
> 
> The use of a device file was also replaced with simple module
> parameters.
> 
> I've also pushed these changes to:
> 
> 	git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache
> 
> I would like to close any open issues with this approach in time
> to pull it into 2.6.23.

This code seems to be significantly different from what
OFED 1.2 has (and on the flip side, it does look better),
but this means that it's all new code in a core component.
Might it be prudent to have this sit in -mm for awhile?
Another approach might be to have it disabled by default in 2.6.23 -
that would make it a low-risk change, and give people time to experiment
with it.

-- 
MST


From sean.hefty at intel.com  Wed May 30 22:40:10 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 30 May 2007 22:40:10 -0700
Subject: [ofa-general] RE: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local
	path recordcaching
In-Reply-To: <20070531045400.GB11669@mellanox.co.il>
Message-ID: <000001c7a346$28212ed0$cec9180a@amr.corp.intel.com>

>This code seems to be significantly different from what
>OFED 1.2 has (and on the flip side, it does look better),
>but this means that it's all new code in a core component.
>Might it be prudent to have this sit in -mm for awhile?
>Another approach might be to have it disabled by default in 2.6.23 -
>that would make it a low-risk change, and give people time to experiment
>with it.

I would rather disable it by default, since I'm not sure how many IB people run
-mm.

I go back and forth on whether the cache should be enabled by default and
figured we could decide on the list.

- Sean


From mst at dev.mellanox.co.il  Wed May 30 22:46:22 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 31 May 2007 08:46:22 +0300
Subject: [ofa-general] Re: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local
	path recordcaching
In-Reply-To: <000001c7a346$28212ed0$cec9180a@amr.corp.intel.com>
References: <20070531045400.GB11669@mellanox.co.il>
	<000001c7a346$28212ed0$cec9180a@amr.corp.intel.com>
Message-ID: <20070531054622.GC11669@mellanox.co.il>

> Quoting Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path recordcaching
> 
> >This code seems to be significantly different from what
> >OFED 1.2 has (and on the flip side, it does look better),
> >but this means that it's all new code in a core component.
> >Might it be prudent to have this sit in -mm for awhile?
> >Another approach might be to have it disabled by default in 2.6.23 -
> >that would make it a low-risk change, and give people time to experiment
> >with it.
> 
> I would rather disable it by default, since I'm not sure how many IB people run
> -mm.

Yes, seems to be true.

> I go back and forth on whether the cache should be enabled by default and
> figured we could decide on the list.


-- 
MST


From mst at dev.mellanox.co.il  Wed May 30 22:52:51 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 31 May 2007 08:52:51 +0300
Subject: [ofa-general] Re: [RFC] [PATCH 2/2] for 2.6.23: ib/sa - add local
	path record caching
In-Reply-To: <000b01c7a2fb$6ebb05f0$3c98070a@amr.corp.intel.com>
References: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com>
	<000b01c7a2fb$6ebb05f0$3c98070a@amr.corp.intel.com>
Message-ID: <20070531055251.GD11669@mellanox.co.il>

It seems that below you try to get 0x7F paths to each dest:

+enum {
+	SA_DB_MAX_PATHS_PER_DEST = 0x7F,
+	SA_DB_MIN_RETRY_TIMER	 = 4000,  /*   4 sec */
+	SA_DB_MAX_RETRY_TIMER	 = 256000 /* 256 sec */
+};
+
+static int set_paths_per_dest(const char *val, struct kernel_param *kp);
+static unsigned long paths_per_dest = SA_DB_MAX_PATHS_PER_DEST;
+module_param_call(paths_per_dest, set_paths_per_dest, param_get_ulong,
+		  &paths_per_dest, 0644);
+MODULE_PARM_DESC(paths_per_dest, "Maximum number of paths to retrieve "
+				 "to each destination (DGID).  Set to 0 "
+				 "to disable cache.");

But here you seem to bypass cache for multi-path queries:

+int ib_sa_path_rec_get(struct ib_sa_client *client,
+		       struct ib_device *device, u8 port_num,
+		       struct ib_sa_path_rec *rec,
+		       ib_sa_comp_mask comp_mask,
+		       int timeout_ms, gfp_t gfp_mask,
+		       void (*callback)(int status,
+					struct ib_sa_path_rec *resp,
+					void *context),
+		       void *context,
+		       struct ib_sa_query **sa_query)
+{
+	struct sa_path_request *req;
+	struct ib_sa_attr_iter iter;
+	struct ib_sa_path_rec *path_rec;
+	int ret;
+
+	if (!paths_per_dest)
+		goto query_sa;
+
+	if (!(comp_mask & IB_SA_PATH_REC_DGID) ||
+	    !(comp_mask & IB_SA_PATH_REC_NUMB_PATH) || rec->numb_path != 1)
+		goto query_sa;

how are multiple paths used?

-- 
MST


From monil at voltaire.com  Thu May 31 00:35:19 2007
From: monil at voltaire.com (Moni Levy)
Date: Thu, 31 May 2007 10:35:19 +0300
Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user
In-Reply-To: <adamyzmqjl2.fsf@cisco.com>
References: <465AF3D3.10205@dev.mellanox.co.il> <adad50k285u.fsf@cisco.com>
	<20070529091246.GF8159@mellanox.co.il> <adamyzmqjl2.fsf@cisco.com>
Message-ID: <6a122cc00705310035o400580d9q890a5a21b40d2170@mail.gmail.com>

Michael, can you please add this to OFED 1.2.RC4, Tziporet, please approve.

Moni

On 5/30/07, Roland Dreier <rdreier at cisco.com> wrote:
> ok, I applied the patch with changes as discussed
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From mst at dev.mellanox.co.il  Thu May 31 00:39:45 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 31 May 2007 10:39:45 +0300
Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user
In-Reply-To: <6a122cc00705310035o400580d9q890a5a21b40d2170@mail.gmail.com>
References: <465AF3D3.10205@dev.mellanox.co.il> <adad50k285u.fsf@cisco.com>
	<20070529091246.GF8159@mellanox.co.il> <adamyzmqjl2.fsf@cisco.com>
	<6a122cc00705310035o400580d9q890a5a21b40d2170@mail.gmail.com>
Message-ID: <20070531073945.GB26309@mellanox.co.il>

Vlad does this stuff.

Quoting Moni Levy <monil at voltaire.com>:
Subject: Re: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user

Michael, can you please add this to OFED 1.2.RC4, Tziporet, please approve.

Moni

On 5/30/07, Roland Dreier <rdreier at cisco.com> wrote:
>ok, I applied the patch with changes as discussed
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general
>

-- 
MST


From mst at dev.mellanox.co.il  Thu May 31 00:46:35 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 31 May 2007 10:46:35 +0300
Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user
In-Reply-To: <adamyzmqjl2.fsf@cisco.com>
References: <465AF3D3.10205@dev.mellanox.co.il> <adad50k285u.fsf@cisco.com>
	<20070529091246.GF8159@mellanox.co.il> <adamyzmqjl2.fsf@cisco.com>
Message-ID: <20070531074635.GC26309@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] suppress RLIMIT warning for root user
> 
> ok, I applied the patch with changes as discussed

Thanks. I saw you put this in master - can this go into stable branch?
This warning is very annoying to people ...

Alternatively, how about rolling libibverbs release so that OFED can use
that?

-- 
MST


From tziporet at dev.mellanox.co.il  Thu May 31 01:52:48 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 31 May 2007 11:52:48 +0300
Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user
In-Reply-To: <6a122cc00705310035o400580d9q890a5a21b40d2170@mail.gmail.com>
References: <465AF3D3.10205@dev.mellanox.co.il>
	<adad50k285u.fsf@cisco.com>	<20070529091246.GF8159@mellanox.co.il>
	<adamyzmqjl2.fsf@cisco.com>
	<6a122cc00705310035o400580d9q890a5a21b40d2170@mail.gmail.com>
Message-ID: <465E8CE0.5010303@mellanox.co.il>

Moni Levy wrote:
> Michael, can you please add this to OFED 1.2.RC4, Tziporet, please 
> approve.
>
> Moni
>

Approved - Vlad please take this patch
Thanks,
Tziporet


From vlad at lists.openfabrics.org  Thu May 31 02:44:11 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu, 31 May 2007 02:44:11 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070531-0200 daily build status
Message-ID: <20070531094412.25E54E6083A@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:  --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.19
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From erezz at voltaire.com  Thu May 31 03:32:40 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Thu, 31 May 2007 13:32:40 +0300
Subject: [ofa-general] Re: [ewg] RE: [PATCH 2/2] IB/iser: add backport
 &	kerneladdons	foropen-iscsiover iSER support for RHAS4 up3 and up4
In-Reply-To: <39C75744D164D948A170E9792AF8E7CA1109F5@exil.voltaire.com>
References: <20070521114410.GG20400@mellanox.co.il><46557BCB.7030102@voltaire.com><20070524115715.GC4585@mellanox.co.il><465C2D78.30100@voltaire.com><20070529141143.GD27671@mellanox.co.il><465D7046.3080109@voltaire.com>	<20070530125456.GF9036@mellanox.co.il>
	<39C75744D164D948A170E9792AF8E7CA1109F5@exil.voltaire.com>
Message-ID: <465EA448.6090907@voltaire.com>


> I am doing that. However, attribute_container.c includes base.h which is in the kernel_addons dir. Since attribute_container.c is no longer there, I need to add the following line:
>  
> -I$(PWD)/kernel_addons/backport/2.6.9_U3/include/src/
>  
> It is not very very ugly, so I think that we can do that. I will make the required fixes according to this approach.
>   

Done. I removed kernel_addons_patches and I don't make any changes in
the build scripts.

Can you please review it and let me know if you have more comments?

Thanks,
Erez


From devesh28 at gmail.com  Thu May 31 03:36:58 2007
From: devesh28 at gmail.com (Devesh Sharma)
Date: Thu, 31 May 2007 16:06:58 +0530
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <000801c7a2f8$55749000$3c98070a@amr.corp.intel.com>
References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com>
	<000801c7a2f8$55749000$3c98070a@amr.corp.intel.com>
Message-ID: <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com>

On 5/31/07, Sean Hefty <sean.hefty at intel.com> wrote:
> >Ok, Soon I will post a patch related to this.
> >How static PR file will be generated? Needs to be discussed.
>
> Please look at my latest changes to the local SA in when generating the patches.
>
> git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache
>
Do you have some pointer/doc related to the design of current SA_CACHE
module....It will make things faster to understand........if not then
I will require your support to understand the things, Though I have
some top level view.
Thanks
> I'm not sure about the best way to communicate PRs to the cache.  I haven't
> given it more than about 2 minutes of thought, but as an idea, we could look at
> trying to make use of the userspace MAD interface.  For example, we could send
> MADs to the local SA with the PRs to load.  More details would obviously need to
> be worked out, but this could provide an extensible solution.
Ok you mean Its not required to create a separate device interface in
cache module as such. I think this is a good idea......Just for
confirmation...whether /dev/umad is active on each node other than SM
node?
>
> - Sean
>


From halr at voltaire.com  Thu May 31 03:42:12 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 06:42:12 -0400
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com>
References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com>
	<000801c7a2f8$55749000$3c98070a@amr.corp.intel.com>
	<309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com>
Message-ID: <1180608131.7116.145947.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 06:36, Devesh Sharma wrote:
> On 5/31/07, Sean Hefty <sean.hefty at intel.com> wrote:
> > >Ok, Soon I will post a patch related to this.
> > >How static PR file will be generated? Needs to be discussed.
> >
> > Please look at my latest changes to the local SA in when generating the patches.
> >
> > git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache
> >
> Do you have some pointer/doc related to the design of current SA_CACHE
> module....It will make things faster to understand........if not then
> I will require your support to understand the things, Though I have
> some top level view.
> Thanks
> > I'm not sure about the best way to communicate PRs to the cache.  I haven't
> > given it more than about 2 minutes of thought, but as an idea, we could look at
> > trying to make use of the userspace MAD interface.  For example, we could send
> > MADs to the local SA with the PRs to load.  More details would obviously need to
> > be worked out, but this could provide an extensible solution.
> Ok you mean Its not required to create a separate device interface in
> cache module as such. I think this is a good idea......Just for
> confirmation...whether /dev/umad is active on each node other than SM
> node?

It's not required to be but is for things that require userspace SA
client access (like RDMA CM or local cache).

-- Hal

> > - Sean
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From monil at voltaire.com  Thu May 31 04:24:55 2007
From: monil at voltaire.com (Moni Levy)
Date: Thu, 31 May 2007 14:24:55 +0300
Subject: [ofa-general] RE: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local
	path recordcaching
In-Reply-To: <000001c7a346$28212ed0$cec9180a@amr.corp.intel.com>
References: <20070531045400.GB11669@mellanox.co.il>
	<000001c7a346$28212ed0$cec9180a@amr.corp.intel.com>
Message-ID: <6a122cc00705310424j3594c944i74f8f454ba928c96@mail.gmail.com>

On 5/31/07, Sean Hefty <sean.hefty at intel.com> wrote:
>
> I go back and forth on whether the cache should be enabled by default and
> figured we could decide on the list.

I vote for disabling it by default in OFED also until we have an
agreed solution. It's very easy to enable it after all. What do you
think ?

--Moni

>
> - Sean
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From gmbobo at iol.pt  Thu May 31 04:32:44 2007
From: gmbobo at iol.pt (=?iso-8859-1?Q?Mr.=20Gabriel=20Mbobo?=)
Date: Thu, 31 May 2007 12:32:44 +0100
Subject: [ofa-general] Compliments
Message-ID: <f533e91b45b2.465ec06c@iol.pt>


Good day,

I represent a top mining company executive in South Africa. I have a very sensitive and private brief from this top executive to ask for your partnership to re-profile funds totally Forty Two Million United States Dollars. ( $42,000,000.00) I will give the details of how we intend to proceed,this is a legitimate transaction. You will be paid 15% for your "management fees", if I am able to reach terms with you. 
 
If you are interested, please write me back by email and provide me with your full names and telephone numbers and address  and I will provide further details. Please keep this close to your chest as much as possible; we are still in acting service.
 
I wait in anticipation of your fullest co-operation. I am available to entertain any questions concerning the clarity of this transaction.

Regards, 

Mr. Gabriel Mbobo.

_______________________________________________________________________________________
Quer 5.000 euros? So na Conta Viva da GE Money.
Saiba mais em: http://www.iol.pt/correio/rodape.php?dst=0705281


From mst at dev.mellanox.co.il  Thu May 31 05:12:39 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 31 May 2007 15:12:39 +0300
Subject: [ofa-general] Re: Re: [Query] ib add path record cache
In-Reply-To: <1180608131.7116.145947.camel@hal.voltaire.com>
References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com>
	<000801c7a2f8$55749000$3c98070a@amr.corp.intel.com>
	<309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com>
	<1180608131.7116.145947.camel@hal.voltaire.com>
Message-ID: <20070531121239.GG26309@mellanox.co.il>

> > Ok you mean Its not required to create a separate device interface in
> > cache module as such. I think this is a good idea......Just for
> > confirmation...whether /dev/umad is active on each node other than SM
> > node?
> 
> It's not required to be but is for things that require userspace SA
> client access (like RDMA CM or local cache).

AFAIK /dev/umad is not required to use "rdma cm" - this module
has its own device.

-- 
MST


From halr at voltaire.com  Thu May 31 05:23:38 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 08:23:38 -0400
Subject: [ofa-general] Re: Re: [Query] ib add path record cache
In-Reply-To: <20070531121239.GG26309@mellanox.co.il>
References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com>
	<000801c7a2f8$55749000$3c98070a@amr.corp.intel.com>
	<309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com>
	<1180608131.7116.145947.camel@hal.voltaire.com>
	<20070531121239.GG26309@mellanox.co.il>
Message-ID: <1180614217.7116.152409.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 08:12, Michael S. Tsirkin wrote:
> > > Ok you mean Its not required to create a separate device interface in
> > > cache module as such. I think this is a good idea......Just for
> > > confirmation...whether /dev/umad is active on each node other than SM
> > > node?
> > 
> > It's not required to be but is for things that require userspace SA
> > client access (like RDMA CM or local cache).
> 
> AFAIK /dev/umad is not required to use "rdma cm" - this module
> has its own device.

My mistake. 

It's also used by infiniband-diags and ibutils.

-- Hal


From eli at mellanox.co.il  Thu May 31 05:29:54 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 31 May 2007 15:29:54 +0300
Subject: [ofa-general] [PATCH] libibverbs/examples: free invalid pointer
Message-ID: <1180614624.7053.14.camel@mtls03>

the dev_list pointer is allocated, incremented and then freed.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: libibverbs/examples/srq_pingpong.c
===================================================================
--- libibverbs.orig/examples/srq_pingpong.c	2007-05-31 15:18:10.000000000 +0300
+++ libibverbs/examples/srq_pingpong.c	2007-05-31 15:21:47.000000000 +0300
@@ -549,7 +549,7 @@ static void usage(const char *argv0)
 
 int main(int argc, char *argv[])
 {
-	struct ibv_device      **dev_list;
+	struct ibv_device      **dev_list, **__dev_list;
 	struct ibv_device 	*ib_dev;
 	struct ibv_wc		*wc;
 	struct pingpong_context *ctx;
@@ -668,7 +668,7 @@ int main(int argc, char *argv[])
 
 	page_size = sysconf(_SC_PAGESIZE);
 
-	dev_list = ibv_get_device_list(NULL);
+	dev_list = __dev_list = ibv_get_device_list(NULL);
 	if (!dev_list) {
 		fprintf(stderr, "No IB devices found\n");
 		return 1;
@@ -863,7 +863,7 @@ int main(int argc, char *argv[])
 	if (pp_close_ctx(ctx, num_qp))
 		return 1;
 
-	ibv_free_device_list(dev_list);
+	ibv_free_device_list(__dev_list);
 	free(rem_dest);
 
 	return 0;


From halr at voltaire.com  Thu May 31 08:39:13 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 11:39:13 -0400
Subject: [ofa-general] [PATCH 1/2] management/*.spec.in: Change source
Message-ID: <1180625949.7116.164563.camel@hal.voltaire.com>

management/*.spec.in: Change source

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/infiniband-diags/infiniband-diags.spec.in b/infiniband-diags/infiniband-diags.spec.in
index cc9de3b..0a9c7bc 100644
--- a/infiniband-diags/infiniband-diags.spec.in
+++ b/infiniband-diags/infiniband-diags.spec.in
@@ -9,7 +9,7 @@ Release: %rel%{?dist}
 License: GPL/BSD
 Group: System Environment/Libraries
 BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
-Source: git://git.openfabrics.org/~halr/management/@TARBALL@
+Source: git://git.openfabrics.org/~halr/management/infiniband-diags-git.tgz
 Url: http://openfabrics.org/
 BuildRequires: libibmad-devel, opensm-devel, autoconf, automake
 Provides: perl(IBswcountlimits)
diff --git a/libibcommon/libibcommon.spec.in b/libibcommon/libibcommon.spec.in
index 73542fa..6ab806f 100644
--- a/libibcommon/libibcommon.spec.in
+++ b/libibcommon/libibcommon.spec.in
@@ -9,7 +9,7 @@ Release: %rel%{?dist}
 License: GPL/BSD
 Group: System Environment/Libraries
 BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
-Source: git://git.openfabrics.org/~halr/management/@TARBALL@
+Source: git://git.openfabrics.org/~halr/management/libibcommon-git.tgz
 Url: http://openfabrics.org/
 Requires(post): /sbin/ldconfig
 Requires(postun): /sbin/ldconfig
diff --git a/libibmad/libibmad.spec.in b/libibmad/libibmad.spec.in
index 0ca9ac3..8d2b10a 100644
--- a/libibmad/libibmad.spec.in
+++ b/libibmad/libibmad.spec.in
@@ -9,7 +9,7 @@ Release: %rel%{?dist}
 License: GPL/BSD
 Group: System Environment/Libraries
 BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
-Source: git://git.openfabrics.org/~halr/management/@TARBALL@
+Source: git://git.openfabrics.org/~halr/management/libibmad-git.tgz
 Url: http://openfabrics.org/
 BuildRequires: libibumad-devel, autoconf, libtool, automake
 Requires(post): /sbin/ldconfig
diff --git a/libibumad/libibumad.spec.in b/libibumad/libibumad.spec.in
index f2641e2..e5890d7 100644
--- a/libibumad/libibumad.spec.in
+++ b/libibumad/libibumad.spec.in
@@ -9,7 +9,7 @@ Release: %rel%{?dist}
 License: GPL/BSD
 Group: System Environment/Libraries
 BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
-Source: git://git.openfabrics.org/~halr/management/@TARBALL@
+Source: git://git.openfabrics.org/~halr/management/libibumad-git.tgz
 Url: http://openfabrics.org
 Requires(post): /sbin/ldconfig
 Requires(postun): /sbin/ldconfig
diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in
index ea86756..f72c5b9 100644
--- a/opensm/opensm.spec.in
+++ b/opensm/opensm.spec.in
@@ -21,7 +21,7 @@ Release: %rel%{?dist}
 License: GPL/BSD
 Group: System Environment/Daemons
 URL: http://openfabrics.org/
-Source: git://git.openfabrics.org/~halr/management/@TARBALL@
+Source: git://git.openfabrics.org/~halr/management/opensm-git.tgz
 BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
 BuildRequires: libibumad-devel, autoconf, libtool, automake
 Requires: %{name}-libs = %{version}-%{release}, logrotate


From halr at voltaire.com  Thu May 31 08:41:38 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 11:41:38 -0400
Subject: [ofa-general] [PATCH 2/2] management/make.dist: Handle .spec files
	differently
Message-ID: <1180625959.7116.164565.camel@hal.voltaire.com>

management/make.dist: Handle .spec files differently

No longer checkin .spec files and remove them after tarball is created.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/make.dist b/make.dist
index 99481de..f65c7e2 100755
--- a/make.dist
+++ b/make.dist
@@ -24,10 +24,10 @@ echo "code as released before even if yo
 echo "around."
 echo
 echo "	As part of this process, the script will parse the <target>.spec.in"
-echo "file and output a <target>.spec file and check that into the git repo"
-echo "so it is included in the tag.  Since this script isn't smart enough"
-echo "to deal with other random changes that should have their own checkin,"
-echo "the script will refuse to run if the current repo state is not clean."
+echo "file and output a <target>.spec file.  Since this script isn't smart"
+echo "enough to deal with other random changes that should have their own" 
+echo "checkin the script will refuse to run if the current repo state is not"
+echo "clean."
 echo
 echo "	NOTE: the script has no clue if you are tagging on the right branch,"
 echo "it will however show you the git branch output so you can confirm it"
@@ -107,7 +107,7 @@ for target in $TARGETS; do
 	else
 		TARBALL=$target-$VERSION.tgz
 	fi
-	sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $target/$target.spec.in > $target/$target.spec
+	sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/' < $target/$target.spec.in > $target/$target.spec
 	cp -a $target $target-$VERSION
 	echo "Creating $TMPDIR/$TARBALL"
 	tar -czf $TMPDIR/$TARBALL --exclude=.git $target-$VERSION
@@ -115,11 +115,10 @@ for target in $TARGETS; do
 done
 
 if [ $1 = release ]; then
-	echo "Checking in modified spec files and tagging release."
+	echo "Removing modified spec files and tagging release."
 	for target in $TARGETS; do
-		git add $target/$target.spec
+		rm -rf $target/$target.spec
 	done
-	git commit -m "Automatic check-in of target spec files after version processing"
 	for target in $TARGETS; do
 		VERSION=`grep "AC_INIT.*$target" $target/configure.in | cut -f 2 -d ',' | sed -e 's/ //g'`
 		if [ ! -z "$2" ]; then
@@ -134,4 +133,3 @@ if [ $1 = release ]; then
 	done
 fi
 
-


From gmbobo at iol.pt  Thu May 31 08:46:09 2007
From: gmbobo at iol.pt (=?iso-8859-1?Q?Mr.=20Gabriel=20Mbobo?=)
Date: Thu, 31 May 2007 16:46:09 +0100
Subject: [ofa-general] Compliments
Message-ID: <f73bf5f56373.465efbd1@iol.pt>


Good day,

I represent a top mining company executive in South Africa. I have a very sensitive and private brief from this top executive to ask for your partnership to re-profile funds totally Forty Two Million United States Dollars. ( $42,000,000.00) I will give the details of how we intend to proceed,this is a legitimate transaction. You will be paid 15% for your "management fees", if I am able to reach terms with you. 
 
If you are interested, please write me back by email and provide me with your full names and telephone numbers and address  and I will provide further details. Please keep this close to your chest as much as possible; we are still in acting service.
 
I wait in anticipation of your fullest co-operation. I am available to entertain any questions concerning the clarity of this transaction.

Regards, 

Mr. Gabriel Mbobo.

_______________________________________________________________________________________
Aqueca o seu Inverno com o credito pronto a usar!
Saiba mais em http://www.iol.pt/correio/rodape.php?dst=0701181


From dledford at redhat.com  Thu May 31 08:47:11 2007
From: dledford at redhat.com (Doug Ledford)
Date: Thu, 31 May 2007 11:47:11 -0400
Subject: [ofa-general] Re: [PATCH 1/2] management/*.spec.in: Change source
In-Reply-To: <1180625949.7116.164563.camel@hal.voltaire.com>
References: <1180625949.7116.164563.camel@hal.voltaire.com>
Message-ID: <1180626431.4120.44.camel@firewall.xsintricity.com>

On Thu, 2007-05-31 at 11:39 -0400, Hal Rosenstock wrote:
> management/*.spec.in: Change source
>
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Nak.  If you check this in, then the automatic version update of the
file for the different RPMs won't work.  You need to leave the @TARBALL@
sed substitution in place in the spec.in file, and keep the sed
substition in the make.dist script, otherwise your release tarballs will
be named eg. opensm-3.3.0.tgz and in the spec file it will say
opensm-git.tgz

> diff --git a/infiniband-diags/infiniband-diags.spec.in b/infiniband-diags/infiniband-diags.spec.in
> index cc9de3b..0a9c7bc 100644
> --- a/infiniband-diags/infiniband-diags.spec.in
> +++ b/infiniband-diags/infiniband-diags.spec.in
> @@ -9,7 +9,7 @@ Release: %rel%{?dist}
>  License: GPL/BSD
>  Group: System Environment/Libraries
>  BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
> -Source: git://git.openfabrics.org/~halr/management/@TARBALL@
> +Source: git://git.openfabrics.org/~halr/management/infiniband-diags-git.tgz
>  Url: http://openfabrics.org/
>  BuildRequires: libibmad-devel, opensm-devel, autoconf, automake
>  Provides: perl(IBswcountlimits)
> diff --git a/libibcommon/libibcommon.spec.in b/libibcommon/libibcommon.spec.in
> index 73542fa..6ab806f 100644
> --- a/libibcommon/libibcommon.spec.in
> +++ b/libibcommon/libibcommon.spec.in
> @@ -9,7 +9,7 @@ Release: %rel%{?dist}
>  License: GPL/BSD
>  Group: System Environment/Libraries
>  BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
> -Source: git://git.openfabrics.org/~halr/management/@TARBALL@
> +Source: git://git.openfabrics.org/~halr/management/libibcommon-git.tgz
>  Url: http://openfabrics.org/
>  Requires(post): /sbin/ldconfig
>  Requires(postun): /sbin/ldconfig
> diff --git a/libibmad/libibmad.spec.in b/libibmad/libibmad.spec.in
> index 0ca9ac3..8d2b10a 100644
> --- a/libibmad/libibmad.spec.in
> +++ b/libibmad/libibmad.spec.in
> @@ -9,7 +9,7 @@ Release: %rel%{?dist}
>  License: GPL/BSD
>  Group: System Environment/Libraries
>  BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
> -Source: git://git.openfabrics.org/~halr/management/@TARBALL@
> +Source: git://git.openfabrics.org/~halr/management/libibmad-git.tgz
>  Url: http://openfabrics.org/
>  BuildRequires: libibumad-devel, autoconf, libtool, automake
>  Requires(post): /sbin/ldconfig
> diff --git a/libibumad/libibumad.spec.in b/libibumad/libibumad.spec.in
> index f2641e2..e5890d7 100644
> --- a/libibumad/libibumad.spec.in
> +++ b/libibumad/libibumad.spec.in
> @@ -9,7 +9,7 @@ Release: %rel%{?dist}
>  License: GPL/BSD
>  Group: System Environment/Libraries
>  BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
> -Source: git://git.openfabrics.org/~halr/management/@TARBALL@
> +Source: git://git.openfabrics.org/~halr/management/libibumad-git.tgz
>  Url: http://openfabrics.org
>  Requires(post): /sbin/ldconfig
>  Requires(postun): /sbin/ldconfig
> diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in
> index ea86756..f72c5b9 100644
> --- a/opensm/opensm.spec.in
> +++ b/opensm/opensm.spec.in
> @@ -21,7 +21,7 @@ Release: %rel%{?dist}
>  License: GPL/BSD
>  Group: System Environment/Daemons
>  URL: http://openfabrics.org/
> -Source: git://git.openfabrics.org/~halr/management/@TARBALL@
> +Source: git://git.openfabrics.org/~halr/management/opensm-git.tgz
>  BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
>  BuildRequires: libibumad-devel, autoconf, libtool, automake
>  Requires: %{name}-libs = %{version}-%{release}, logrotate
> 
> 
-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070531/7c650102/attachment.sig>

From halr at voltaire.com  Thu May 31 09:05:30 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 12:05:30 -0400
Subject: [ofa-general] Re: [PATCH 1/2] management/*.spec.in: Change source
In-Reply-To: <1180626431.4120.44.camel@firewall.xsintricity.com>
References: <1180625949.7116.164563.camel@hal.voltaire.com>
	<1180626431.4120.44.camel@firewall.xsintricity.com>
Message-ID: <1180627527.7116.166220.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 11:47, Doug Ledford wrote:
> On Thu, 2007-05-31 at 11:39 -0400, Hal Rosenstock wrote:
> > management/*.spec.in: Change source
> >
> > Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> Nak.  If you check this in, then the automatic version update of the
> file for the different RPMs won't work.  You need to leave the @TARBALL@
> sed substitution in place in the spec.in file, and keep the sed
> substition in the make.dist script, otherwise your release tarballs will
> be named eg. opensm-3.3.0.tgz and in the spec file it will say
> opensm-git.tgz

OK; I'm dropping this part. I'll reissue the second part (make.dist)
shortly.

-- Hal


From halr at voltaire.com  Thu May 31 09:05:59 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 12:05:59 -0400
Subject: [ofa-general] [PATCH v2] management/make.dist: Handle .spec files
	differently
Message-ID: <1180627536.7116.166222.camel@hal.voltaire.com>

management/make.dist: Handle .spec files differently

No longer commit .spec files and remove them after tarball is created.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/make.dist b/make.dist
index 99481de..20d9ca4 100755
--- a/make.dist
+++ b/make.dist
@@ -24,10 +24,10 @@ echo "code as released before even if yo
 echo "around."
 echo
 echo "	As part of this process, the script will parse the <target>.spec.in"
-echo "file and output a <target>.spec file and check that into the git repo"
-echo "so it is included in the tag.  Since this script isn't smart enough"
-echo "to deal with other random changes that should have their own checkin,"
-echo "the script will refuse to run if the current repo state is not clean."
+echo "file and output a <target>.spec file.  Since this script isn't smart"
+echo "enough to deal with other random changes that should have their own" 
+echo "checkin the script will refuse to run if the current repo state is not"
+echo "clean."
 echo
 echo "	NOTE: the script has no clue if you are tagging on the right branch,"
 echo "it will however show you the git branch output so you can confirm it"
@@ -115,11 +115,10 @@ for target in $TARGETS; do
 done
 
 if [ $1 = release ]; then
-	echo "Checking in modified spec files and tagging release."
+	echo "Removing modified spec files and tagging release."
 	for target in $TARGETS; do
-		git add $target/$target.spec
+		rm -rf $target/$target.spec
 	done
-	git commit -m "Automatic check-in of target spec files after version processing"
 	for target in $TARGETS; do
 		VERSION=`grep "AC_INIT.*$target" $target/configure.in | cut -f 2 -d ',' | sed -e 's/ //g'`
 		if [ ! -z "$2" ]; then


From dledford at redhat.com  Thu May 31 09:12:00 2007
From: dledford at redhat.com (Doug Ledford)
Date: Thu, 31 May 2007 12:12:00 -0400
Subject: [ofa-general] Re: [PATCH v2] management/make.dist: Handle .spec
	files differently
In-Reply-To: <1180627536.7116.166222.camel@hal.voltaire.com>
References: <1180627536.7116.166222.camel@hal.voltaire.com>
Message-ID: <1180627920.4120.48.camel@firewall.xsintricity.com>

On Thu, 2007-05-31 at 12:05 -0400, Hal Rosenstock wrote:
> management/make.dist: Handle .spec files differently
> 
> No longer commit .spec files and remove them after tarball is created.
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Ack-by: Doug Ledford <dledford at redhat.com>

> diff --git a/make.dist b/make.dist
> index 99481de..20d9ca4 100755
> --- a/make.dist
> +++ b/make.dist
> @@ -24,10 +24,10 @@ echo "code as released before even if yo
>  echo "around."
>  echo
>  echo "	As part of this process, the script will parse the <target>.spec.in"
> -echo "file and output a <target>.spec file and check that into the git repo"
> -echo "so it is included in the tag.  Since this script isn't smart enough"
> -echo "to deal with other random changes that should have their own checkin,"
> -echo "the script will refuse to run if the current repo state is not clean."
> +echo "file and output a <target>.spec file.  Since this script isn't smart"
> +echo "enough to deal with other random changes that should have their own" 
> +echo "checkin the script will refuse to run if the current repo state is not"
> +echo "clean."
>  echo
>  echo "	NOTE: the script has no clue if you are tagging on the right branch,"
>  echo "it will however show you the git branch output so you can confirm it"
> @@ -115,11 +115,10 @@ for target in $TARGETS; do
>  done
>  
>  if [ $1 = release ]; then
> -	echo "Checking in modified spec files and tagging release."
> +	echo "Removing modified spec files and tagging release."
>  	for target in $TARGETS; do
> -		git add $target/$target.spec
> +		rm -rf $target/$target.spec
>  	done
> -	git commit -m "Automatic check-in of target spec files after version processing"
>  	for target in $TARGETS; do
>  		VERSION=`grep "AC_INIT.*$target" $target/configure.in | cut -f 2 -d ',' | sed -e 's/ //g'`
>  		if [ ! -z "$2" ]; then
> 
> 
-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070531/4f879c2e/attachment.sig>

From mshefty at ichips.intel.com  Thu May 31 09:28:24 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 31 May 2007 09:28:24 -0700
Subject: [ofa-general] Re: [RFC] [PATCH 2/2] for 2.6.23: ib/sa - add local
	path record caching
In-Reply-To: <20070531055251.GD11669@mellanox.co.il>
References: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com>	<000b01c7a2fb$6ebb05f0$3c98070a@amr.corp.intel.com>
	<20070531055251.GD11669@mellanox.co.il>
Message-ID: <465EF7A8.2010909@ichips.intel.com>

Michael S. Tsirkin wrote:
> It seems that below you try to get 0x7F paths to each dest:

This is the maximum number that a PR can request.  Note that you only 
get that many if that many exist.  I would expect most subnets to only 
have a couple of paths between each destination.

> But here you seem to bypass cache for multi-path queries:
> 
> +int ib_sa_path_rec_get(struct ib_sa_client *client,
> +		       struct ib_device *device, u8 port_num,
> +		       struct ib_sa_path_rec *rec,
> +		       ib_sa_comp_mask comp_mask,
> +		       int timeout_ms, gfp_t gfp_mask,
> +		       void (*callback)(int status,
> +					struct ib_sa_path_rec *resp,
> +					void *context),
> +		       void *context,
> +		       struct ib_sa_query **sa_query)


This is the existing ib_sa API, which only returns one path.  You could 
change the behavior to return an array of paths, but I did not do that 
at this time.

> +{
> +	struct sa_path_request *req;
> +	struct ib_sa_attr_iter iter;
> +	struct ib_sa_path_rec *path_rec;
> +	int ret;
> +
> +	if (!paths_per_dest)
> +		goto query_sa;
> +
> +	if (!(comp_mask & IB_SA_PATH_REC_DGID) ||
> +	    !(comp_mask & IB_SA_PATH_REC_NUMB_PATH) || rec->numb_path != 1)
> +		goto query_sa;
> 
> how are multiple paths used?

The cache returns paths using one of two algorithms.  Paths are either 
returned in a round robin fashion or randomly.  See further down in this 
same function:

+	if (lookup_method == SA_DB_LOOKUP_RANDOM)
+		path_rec = get_random_path(&iter, rec, comp_mask);
+	else
+		path_rec = get_next_path(&iter, rec, comp_mask);


The check for rec->numb_path != 1 should probably return a failure, 
since neither the API nor the underlying sa_query code supports it.

- Sean


From mshefty at ichips.intel.com  Thu May 31 10:16:47 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 31 May 2007 10:16:47 -0700
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com>
References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com>	
	<000801c7a2f8$55749000$3c98070a@amr.corp.intel.com>
	<309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com>
Message-ID: <465F02FF.60401@ichips.intel.com>

> Do you have some pointer/doc related to the design of current SA_CACHE
> module....It will make things faster to understand........if not then
> I will require your support to understand the things, Though I have
> some top level view.

I don't have any design docs.  But I will happily answer any questions.

> Ok you mean Its not required to create a separate device interface in
> cache module as such. I think this is a good idea......Just for
> confirmation...whether /dev/umad is active on each node other than SM
> node?

After giving this more thought, I like this approach.  If we use a 
vendor/application specific MAD class that used the SA class as a 
template, we can begin creating a distributed SA.

I haven't worked through details, but as an example, to load a PR into 
the local SA, you could send it a 'Set' MAD with a PR in the data 
portion.  To load multiple paths, we could add a 'SetTable' method.  To 
remove a path, we would send a 'Delete' MAD.

Whether or not the MADs are sent from the local node, or some other node 
wouldn't matter.  We can use this mechanism to pre-load the cache or 
simply push updates to it.

- Sean


From halr at voltaire.com  Thu May 31 10:22:15 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 13:22:15 -0400
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <465F02FF.60401@ichips.intel.com>
References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com>
	<000801c7a2f8$55749000$3c98070a@amr.corp.intel.com>
	<309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com>
	<465F02FF.60401@ichips.intel.com>
Message-ID: <1180632135.7116.170924.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 13:16, Sean Hefty wrote:
> > Do you have some pointer/doc related to the design of current SA_CACHE
> > module....It will make things faster to understand........if not then
> > I will require your support to understand the things, Though I have
> > some top level view.
> 
> I don't have any design docs.  But I will happily answer any questions.
> 
> > Ok you mean Its not required to create a separate device interface in
> > cache module as such. I think this is a good idea......Just for
> > confirmation...whether /dev/umad is active on each node other than SM
> > node?
> 
> After giving this more thought, I like this approach.  If we use a 
> vendor/application specific MAD class that used the SA class as a 
> template, we can begin creating a distributed SA.
> 
> I haven't worked through details, but as an example, to load a PR into 
> the local SA, you could send it a 'Set' MAD with a PR in the data 
> portion.  To load multiple paths, we could add a 'SetTable' method.  To 
> remove a path, we would send a 'Delete' MAD.
> 
> Whether or not the MADs are sent from the local node, or some other node 
> wouldn't matter.

Would there be some sort of weak authorization needed to do this (like
some key of some sort) ?

-- Hal

>   We can use this mechanism to pre-load the cache or 
> simply push updates to it.
> 
> - Sean
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sean.hefty at intel.com  Thu May 31 10:31:31 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 31 May 2007 10:31:31 -0700
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <1180632135.7116.170924.camel@hal.voltaire.com>
Message-ID: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com>

>Would there be some sort of weak authorization needed to do this (like
>some key of some sort) ?

I was thinking of matching the SA class MAD format, which includes the SM_Key
field.  I wouldn't use the SA class, since we'd could be defining a new method,
and a different attribute / method map than what's in the spec.  But I would
re-use as much of the SA class design as possible, to avoid re-inventing things.

- Sean


From rdreier at cisco.com  Thu May 31 10:35:08 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 31 May 2007 10:35:08 -0700
Subject: [ofa-general] Re: [PATCH] libibverbs/examples: free invalid pointer
In-Reply-To: <1180614624.7053.14.camel@mtls03> (Eli Cohen's message of "Thu,
	31 May 2007 15:29:54 +0300")
References: <1180614624.7053.14.camel@mtls03>
Message-ID: <adafy5cq4kz.fsf@cisco.com>

Thanks, but I think I fixed this bug in all the pingpong examples (not
just srq_pingpong) at the beginning of May.

 - R.


From halr at voltaire.com  Thu May 31 10:42:15 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 13:42:15 -0400
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com>
References: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com>
Message-ID: <1180633333.7116.172147.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 13:31, Sean Hefty wrote:
> >Would there be some sort of weak authorization needed to do this (like
> >some key of some sort) ?
> 
> I was thinking of matching the SA class MAD format, which includes the SM_Key
> field.  I wouldn't use the SA class, since we'd could be defining a new method,
> and a different attribute / method map than what's in the spec.  But I would
> re-use as much of the SA class design as possible, to avoid re-inventing things.

You'd need to use a vendor class 2 if you wanted to use RMPP as the SA
does. However, there is some rearranging you would need to do if you
compare the relevant MAD formats.

-- Hal

> - Sean


From mshefty at ichips.intel.com  Thu May 31 10:54:56 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 31 May 2007 10:54:56 -0700
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <1180633333.7116.172147.camel@hal.voltaire.com>
References: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com>
	<1180633333.7116.172147.camel@hal.voltaire.com>
Message-ID: <465F0BF0.3040002@ichips.intel.com>

> You'd need to use a vendor class 2 if you wanted to use RMPP as the SA
> does. However, there is some rearranging you would need to do if you
> compare the relevant MAD formats.

Vendor class 2 just adds the OUI, correct?  I guess you could either 
move the SA specific header by 4-bytes, or use a 32-bit key.

It's not clear to me when to use vendor-specific versus application 
specific, but I think that we could use application specific as well. 
The mad layer just wouldn't know to enable RMPP as easily as it does 
with vendor class 2.

- Sean


From rdreier at cisco.com  Thu May 31 11:00:09 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 31 May 2007 11:00:09 -0700
Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user
In-Reply-To: <20070531074635.GC26309@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 31 May 2007 10:46:35 +0300")
References: <465AF3D3.10205@dev.mellanox.co.il> <adad50k285u.fsf@cisco.com>
	<20070529091246.GF8159@mellanox.co.il> <adamyzmqjl2.fsf@cisco.com>
	<20070531074635.GC26309@mellanox.co.il>
Message-ID: <adabqg0q3fa.fsf@cisco.com>

 > Thanks. I saw you put this in master - can this go into stable branch?
 > This warning is very annoying to people ...

Good point.  I haven't been very good about getting fixes onto the
stable branch.  I just pulled everything from libibverbs master branch
onto the stable branch, since all the fixes looked appropriate.

 > Alternatively, how about rolling libibverbs release so that OFED can use
 > that?

It's a good idea... I'll put out libibverbs 1.1.1 soon.

 - R.


From rdreier at cisco.com  Thu May 31 11:01:56 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 31 May 2007 11:01:56 -0700
Subject: [ofa-general] Re: wmb missing in libmthca?
In-Reply-To: <20070531043447.GA11669@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 31 May 2007 07:34:47 +0300")
References: <20070524114711.GB4585@mellanox.co.il> <adaejkyqbdo.fsf@cisco.com>
	<20070531043447.GA11669@mellanox.co.il>
Message-ID: <ada7iqoq3cb.fsf@cisco.com>

OK, I committed my patch.


From halr at voltaire.com  Thu May 31 11:04:22 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 14:04:22 -0400
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <465F0BF0.3040002@ichips.intel.com>
References: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com>
	<1180633333.7116.172147.camel@hal.voltaire.com>
	<465F0BF0.3040002@ichips.intel.com>
Message-ID: <1180634660.7116.173529.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 13:54, Sean Hefty wrote:
> > You'd need to use a vendor class 2 if you wanted to use RMPP as the SA
> > does. However, there is some rearranging you would need to do if you
> > compare the relevant MAD formats.
> 
> Vendor class 2 just adds the OUI, correct?

Yes, so there are 4 less bytes available.

> I guess you could either move the SA specific header by 4-bytes,

Yes, everything starting with SM_Key.

>  or use a 32-bit key.

I think that was deemed too weak and why 64 bits was chosen to begin
with.

> It's not clear to me when to use vendor-specific versus application 
> specific, but I think that we could use application specific as well.

Yes, but there would be more work involved for RMPP. There is no way to
know if such a class uses RMPP.

> The mad layer just wouldn't know to enable RMPP as easily as it does 
> with vendor class 2.

Ugh... That's an understatement.

-- Hal

> - Sean


From venkatesh.babu at 3leafnetworks.com  Thu May 31 11:35:06 2007
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Thu, 31 May 2007 11:35:06 -0700
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <1179878469.16831.42580.camel@hal.voltaire.com>
References: <4652167F.9040709@3leafnetworks.com>	
	<1179785796.15940.27092.camel@hal.voltaire.com>	
	<4652542C.3010400@3leafnetworks.com>	
	<1179805556.15940.47640.camel@hal.voltaire.com>	
	<46528E3C.8090305@3leafnetworks.com>	
	<1179831181.15940.74121.camel@hal.voltaire.com>	
	<4653845C.1090507@3leafnetworks.com>
	<1179878469.16831.42580.camel@hal.voltaire.com>
Message-ID: <465F155A.5030508@3leafnetworks.com>

Hal Rosenstock wrote:

>>This output was captured on node vortex3l-83, the one who runs opensm.
>>Do you want the perfquery output before and after some time interval ?
>>    
>>
>
>I'm interested in VL15 drops to make sure that is not going on.
>  
>

  I am seeing non zero (0 - 10) VL15 drops counter. What is the 
significance and cause of these errors ?
How can I get rid or correct them ?

 VBabu


From halr at voltaire.com  Thu May 31 11:21:40 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 14:21:40 -0400
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <465F155A.5030508@3leafnetworks.com>
References: <4652167F.9040709@3leafnetworks.com>
	<1179785796.15940.27092.camel@hal.voltaire.com>
	<4652542C.3010400@3leafnetworks.com>
	<1179805556.15940.47640.camel@hal.voltaire.com>
	<46528E3C.8090305@3leafnetworks.com>
	<1179831181.15940.74121.camel@hal.voltaire.com>
	<4653845C.1090507@3leafnetworks.com>
	<1179878469.16831.42580.camel@hal.voltaire.com>
	<465F155A.5030508@3leafnetworks.com>
Message-ID: <1180635698.7116.174586.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 14:35, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >>This output was captured on node vortex3l-83, the one who runs opensm.
> >>Do you want the perfquery output before and after some time interval ?
> >>    
> >>
> >
> >I'm interested in VL15 drops to make sure that is not going on.
> >  
> >
> 
>   I am seeing non zero (0 - 10) VL15 drops counter. What is the 
> significance and cause of these errors ?

This means that some VL15 packets arrive at the switch with no available
VL15 buffers so they are dropped. These could be any SM packets (SMInfo
is just one possibility).

> How can I get rid or correct them ?

You would need to contact your switch vendor to see if the VL15
buffering can be reconfigured.

I'm not sure whether or not this is related to your standby issue or
not.

Are you seeing any other errors on any of the ports ?

-- Hal

>  VBabu


From halr at voltaire.com  Thu May 31 11:31:59 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 14:31:59 -0400
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
Message-ID: <1180636318.7116.175237.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 14:35, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >>This output was captured on node vortex3l-83, the one who runs opensm.
> >>Do you want the perfquery output before and after some time interval ?
> >>    
> >>
> >
> >I'm interested in VL15 drops to make sure that is not going on.
> >  
> >
> 
>   I am seeing non zero (0 - 10) VL15 drops counter. What is the 
> significance and cause of these errors ?

This means that some VL15 packets arrive at the switch with no available
VL15 buffers so they are dropped. These could be any SM packets (SMInfo
is just one possibility).

> How can I get rid or correct them ?

You would need to contact your switch vendor to see if the VL15
buffering can be reconfigured.

I'm not sure whether or not this is related to your standby issue or
not.

Are you seeing any other errors on any of the ports ?

-- Hal

>  VBabu


From mshefty at ichips.intel.com  Thu May 31 11:41:51 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 31 May 2007 11:41:51 -0700
Subject: [ofa-general] Re: [Query] ib add path record cache
In-Reply-To: <1180634660.7116.173529.camel@hal.voltaire.com>
References: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com>	
	<1180633333.7116.172147.camel@hal.voltaire.com>	
	<465F0BF0.3040002@ichips.intel.com>
	<1180634660.7116.173529.camel@hal.voltaire.com>
Message-ID: <465F16EF.2060605@ichips.intel.com>

> Yes, so there are 4 less bytes available.

..and we may want to shift everything down another 4 bytes for 
alignment, if that's needed.  It could be very convenient to have the 
exact same layout as the SA MADs.  This would let an app issue a normal 
SA GetTable query, store away the response, then later forward the 
response to the local SA only changing a couple of fields without having 
to re-pack everything.

>> The mad layer just wouldn't know to enable RMPP as easily as it does 
>> with vendor class 2.
> 
> Ugh... That's an understatement.

We could add a check for the local SA class, but I'm hoping there's a 
better way.  What would be nice are vendor/application specific 
extensions to the existing classes... or even a standardized framework 
for supporting a distributed SA.  (Neither of these seem likely anytime 
soon though.)

- Sean


From venkatesh.babu at 3leafnetworks.com  Thu May 31 12:12:22 2007
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Thu, 31 May 2007 12:12:22 -0700
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <1180636318.7116.175237.camel@hal.voltaire.com>
References: <1180636318.7116.175237.camel@hal.voltaire.com>
Message-ID: <465F1E16.2080309@3leafnetworks.com>


Hal Rosenstock wrote:

>>  I am seeing non zero (0 - 10) VL15 drops counter. What is the 
>>significance and cause of these errors ?
>>    
>>
>
>This means that some VL15 packets arrive at the switch with no available
>VL15 buffers so they are dropped. These could be any SM packets (SMInfo
>is just one possibility).
>
>  
>
>>How can I get rid or correct them ?
>>    
>>
>
>You would need to contact your switch vendor to see if the VL15
>buffering can be reconfigured.
>
>I'm not sure whether or not this is related to your standby issue or
>not.
>  
>
  At least opensm is not working correctly. Eventhough ibv_devinfo shows 
it as master and it is not responding to the broadcast join operations 
or it doesn't assign LIDs to other nodes.

>Are you seeing any other errors on any of the ports ?
>  
>
  I do see non zero port_xmit_discards error counters on some ports.
 
  Are these errors could be because of the bad cables or ports ?

 VBabu

>-- Hal
>  
>


From tziporet at mellanox.co.il  Thu May 31 11:54:34 2007
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 31 May 2007 21:54:34 +0300
Subject: [ofa-general] OFED 1.2 rc4 release
In-Reply-To: <43AA3CB3C1BF5A499F5AAD31CA5023AC06624A26@mtlexch01.mtl.com>
References: <43AA3CB3C1BF5A499F5AAD31CA5023AC06624A26@mtlexch01.mtl.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C9015634B7@mtlexch01.mtl.com>

 
Hi, 

OFED 1.2-RC3 is available on
http://www.openfabrics.org/builds/ofed-1.2/ 
File: OFED-1.2-rc4.tgz 
To get BUILD_ID run ofed_info 

Please report any issues in bugzilla https://bugs.openfabrics.org/

Next RC or official release will be decided in coming Monday
coordination meeting

Tziporet & Vlad 

========================================================================
============ 
Release information: 

OS support: 
Novell: 
    - SLES 9.0 SP3 
    - SLES10 (and SP1 RC2 partially tested) 
Redhat: 
    - Redhat EL4 up3, up4 and up5 
    - Redhat EL5 
kernel.org: 
    - 2.6.20 
    - 2.6.19 

Note: Fedora C6 and SuSE Pro 10 are not part of the official list. 
We keep the backport patches for these OSes and make sure OFED compile
and loaded 
properly but will not do full QA cycle.

Systems: 
    * x86_64 
    * x86 
    * ia64 
    * ppc64 

Main changes from OFED-1.1-rc3: 

1. Fixed 23 bugs (see attachment for all bugs fixed)
2. Updated documents - all owners please review to make sure docs of
your component is updated.

Major limitations and known issues: 
-----------------------------------
567	blocker	rolandd at cisco.com		RHEL5 ppc64 UD verbs
failures
577	critical	ishai at mellanox.co.il	SRP multipath failover
too slow (minutes, not seconds)
626	major		monis at voltaire.com	wrong network /broadcast
address set by ib-bond script
629	major		monis at voltaire.com	ib-bonding: sometimes
slow failover is noticed
541	major		mst at mellanox.co.il	slow failover with IPoIB
CM bonding/ipoibtools HA
650	major		pasha at mellanox.co.il	error on install rc3
openmpi with pathscale compiler


See bugzilla for all open issues. 

Tasks that should be completed: 
1. Fix all blocker, critical and major bugs 
2. Complete all documentation (release notes, README, etc.) 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fixed_rc4_bugs.csv
Type: application/octet-stream
Size: 2999 bytes
Desc: fixed_rc4_bugs.csv
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070531/96416320/attachment.obj>

From halr at voltaire.com  Thu May 31 11:57:47 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 14:57:47 -0400
Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of
	them claims master
In-Reply-To: <465F1E16.2080309@3leafnetworks.com>
References: <1180636318.7116.175237.camel@hal.voltaire.com>
	<465F1E16.2080309@3leafnetworks.com>
Message-ID: <1180637866.7116.176862.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 15:12, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >>  I am seeing non zero (0 - 10) VL15 drops counter. What is the 
> >>significance and cause of these errors ?
> >>    
> >>
> >
> >This means that some VL15 packets arrive at the switch with no available
> >VL15 buffers so they are dropped. These could be any SM packets (SMInfo
> >is just one possibility).
> >
> >  
> >
> >>How can I get rid or correct them ?
> >>    
> >>
> >
> >You would need to contact your switch vendor to see if the VL15
> >buffering can be reconfigured.
> >
> >I'm not sure whether or not this is related to your standby issue or
> >not.
> >  
> >
>   At least opensm is not working correctly. Eventhough ibv_devinfo shows 
> it as master and it is not responding to the broadcast join operations 
> or it doesn't assign LIDs to other nodes.

ibv_devinfo only indicates the SMLID of the last master which claimed
this node. So if there is no real current master...

In this state, there is no master so no SA queries will be responded to.
Only an SM which was master would respond. So if some local node thinks
the SM is foo, and foo's SM is not in master, it will nott respond.

This may be an OpenSM issue or might be some lower level issue which
OpenSM is not handling well. I'm not sure which as I cannot recreate
this and am not sure what is going on in your environment.

> >Are you seeing any other errors on any of the ports ?
> >  
> >
>   I do see non zero port_xmit_discards error counters on some ports.
>  
>   Are these errors could be because of the bad cables or ports ?

I would try swapping in known good cables and see what happens.

-- Hal

>  VBabu
> 
> >-- Hal
> >  
> >


From sashak at voltaire.com  Thu May 31 13:45:24 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 31 May 2007 23:45:24 +0300
Subject: [ofa-general] [PATCH] opensm: sminfo self query check
Message-ID: <20070531204524.GX13193@sashak.voltaire.com>


OpenSM can query itself for SMInfo because it is just legal, or
occasionally due to port moving during subnet discovery process.
Don't create remote SM entry in this case in order to prevent
deadlocks.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_sminfo_rcv.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_sminfo_rcv.c b/opensm/opensm/osm_sminfo_rcv.c
index 776c70b..99a716e 100644
--- a/opensm/opensm/osm_sminfo_rcv.c
+++ b/opensm/opensm/osm_sminfo_rcv.c
@@ -632,6 +632,15 @@ __osm_sminfo_rcv_process_get_response(
     goto Exit;
   }
 
+  if( port_guid == p_rcv->p_subn->sm_port_guid )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
+             "__osm_sminfo_rcv_process_get_response: "
+             "Self query response received - SM port 0x%016" PRIx64 "\n",
+             cl_ntoh64( port_guid ) );
+    goto Exit;
+  }
+
   p_sm = (osm_remote_sm_t*)cl_qmap_get( p_sm_tbl, port_guid );
   if( p_sm == (osm_remote_sm_t*)cl_qmap_end( p_sm_tbl ) )
   {
-- 
1.5.2.171.gf509


From sashak at voltaire.com  Thu May 31 13:47:05 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 31 May 2007 23:47:05 +0300
Subject: [ofa-general] [PATCH] opensm: cleanup discovery count functions
Message-ID: <20070531204705.GY13193@sashak.voltaire.com>


This removes discovery count functions for osm_port_t and osm_node_t
and makes discovery_count handling similar to osm_switch_t.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_node.h  |   89 ------------------------------------
 opensm/include/opensm/osm_port.h  |   90 -------------------------------------
 opensm/opensm/osm_drop_mgr.c      |   10 ++--
 opensm/opensm/osm_node_info_rcv.c |    8 ++--
 opensm/opensm/osm_port_info_rcv.c |    2 +-
 opensm/opensm/osm_state_mgr.c     |    4 +-
 6 files changed, 12 insertions(+), 191 deletions(-)

diff --git a/opensm/include/opensm/osm_node.h b/opensm/include/opensm/osm_node.h
index a841de7..b2d03a2 100644
--- a/opensm/include/opensm/osm_node.h
+++ b/opensm/include/opensm/osm_node.h
@@ -591,95 +591,6 @@ osm_node_init_physp(
 *	Node object, Physical Port object.
 *********/
 
-/****f* OpenSM: Node/osm_node_discovery_count_get
-* NAME
-*	osm_node_discovery_count_get
-*
-* DESCRIPTION
-*	Returns a pointer to the physical port object at the
-*	specified local port number.
-*
-* SYNOPSIS
-*/
-static inline uint32_t
-osm_node_discovery_count_get(
-	IN const osm_node_t* const p_node )
-{
-	return( p_node->discovery_count );
-}
-/*
-* PARAMETERS
-*	p_node
-*		[in] Pointer to an osm_node_t object.
-*
-* RETURN VALUES
-*	Returns the discovery count for this node.
-*
-* NOTES
-*
-* SEE ALSO
-*	Node object
-*********/
-
-/****f* OpenSM: Node/osm_node_discovery_count_reset
-* NAME
-*	osm_node_discovery_count_reset
-*
-* DESCRIPTION
-*	Resets the discovery count for this node to zero.
-*	This operation should be performed at the start of a sweep.
-*
-* SYNOPSIS
-*/
-static inline void
-osm_node_discovery_count_reset(
-	IN osm_node_t* const p_node )
-{
-	p_node->discovery_count = 0;
-}
-/*
-* PARAMETERS
-*	p_node
-*		[in] Pointer to an osm_node_t object.
-*
-* RETURN VALUES
-*	None.
-*
-* NOTES
-*
-* SEE ALSO
-*	Node object
-*********/
-
-/****f* OpenSM: Node/osm_node_discovery_count_inc
-* NAME
-*	osm_node_discovery_count_inc
-*
-* DESCRIPTION
-*	Increments the discovery count for this node.
-*
-* SYNOPSIS
-*/
-static inline void
-osm_node_discovery_count_inc(
-	IN osm_node_t* const p_node )
-{
-	p_node->discovery_count++;
-}
-/*
-* PARAMETERS
-*	p_node
-*		[in] Pointer to an osm_node_t object.
-*
-* RETURN VALUES
-*	None.
-*
-* NOTES
-*
-* SEE ALSO
-*	Node object
-*********/
-
 /****f* OpenSM: Node/osm_node_get_node_guid
 * NAME
 *	osm_node_get_node_guid
diff --git a/opensm/include/opensm/osm_port.h b/opensm/include/opensm/osm_port.h
index df9065e..54ebcfc 100644
--- a/opensm/include/opensm/osm_port.h
+++ b/opensm/include/opensm/osm_port.h
@@ -1556,96 +1556,6 @@ osm_port_add_new_physp(
 *	Port
 *********/
 
-/****f* OpenSM: Port/osm_port_discovery_count_reset
-* NAME
-*	osm_port_discovery_count_reset
-*
-* DESCRIPTION
-*	Resets the discovery count for this Port to zero.
-*	This operation should be performed at the start of a sweep.
-*
-* SYNOPSIS
-*/
-static inline void
-osm_port_discovery_count_reset(
-	IN osm_port_t* const p_port )
-{
-	p_port->discovery_count = 0;
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to an osm_port_t object.
-*
-* RETURN VALUES
-*	None.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port object
-*********/
-
-/****f* OpenSM: Port/osm_port_discovery_count_get
-* NAME
-*	osm_port_discovery_count_get
-*
-* DESCRIPTION
-*	Returns the number of times this port has been discovered
-*	since the last time the discovery count was reset.
-*
-* SYNOPSIS
-*/
-static inline uint32_t
-osm_port_discovery_count_get(
-	IN const osm_port_t* const p_port )
-{
-	return( p_port->discovery_count );
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to an osm_port_t object.
-*
-* RETURN VALUES
-*	Returns the number of times this port has been discovered
-*	since the last time the discovery count was reset.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port object
-*********/
-
-/****f* OpenSM: Port/osm_port_discovery_count_inc
-* NAME
-*	osm_port_discovery_count_inc
-*
-* DESCRIPTION
-*	Increments the discovery count for this Port.
-*
-* SYNOPSIS
-*/
-static inline void
-osm_port_discovery_count_inc(
-	IN osm_port_t* const p_port )
-{
-	p_port->discovery_count++;
-}
-/*
-* PARAMETERS
-*	p_port
-*		[in] Pointer to an osm_port_t object.
-*
-* RETURN VALUES
-*	None.
-*
-* NOTES
-*
-* SEE ALSO
-*	Port object
-*********/
-
 /****f* OpenSM: Port/osm_port_add_mgrp
 * NAME
 *	osm_port_add_mgrp
diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c
index 7689728..9d91b6b 100644
--- a/opensm/opensm/osm_drop_mgr.c
+++ b/opensm/opensm/osm_drop_mgr.c
@@ -161,7 +161,7 @@ drop_mgr_clean_physp(
          the remote port was recognized, and its state is ACTIVE.
          If this is just a "hiccup" - force a heavy sweep in the next sweep.
          We don't want to lose that part of the subnet. */
-      if (osm_port_discovery_count_get( p_remote_port ) &&
+      if (p_remote_port->discovery_count &&
           osm_physp_get_port_state( p_remote_physp ) == IB_LINK_ACTIVE )
       {
         osm_log( p_mgr->p_log, OSM_LOG_VERBOSE,
@@ -179,7 +179,7 @@ drop_mgr_clean_physp(
          discovery count of the remote port. */
       if ( !p_remote_physp->p_node->sw )
       {
-        osm_port_discovery_count_reset( p_remote_port );
+        p_remote_port->discovery_count = 0;
         osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
                  "drop_mgr_clean_physp: Resetting discovery count of node: "
                  "0x%016" PRIx64 " port num:0x%X\n",
@@ -534,7 +534,7 @@ __osm_drop_mgr_check_node(
     goto Exit;
   }
 
-  if ( osm_port_discovery_count_get( p_port ) == 0 )
+  if ( p_port->discovery_count == 0 )
   {
     osm_log( p_mgr->p_log, OSM_LOG_VERBOSE,
              "__osm_drop_mgr_check_node: "
@@ -601,7 +601,7 @@ osm_drop_mgr_process(
       If not, it is unreachable in the current subnet, and
       should therefore be removed from the subnet object.
     */
-    if( osm_node_discovery_count_get( p_node ) == 0 )
+    if( p_node->discovery_count == 0 )
       __osm_drop_mgr_process_node( p_mgr, p_node );
   }
 
@@ -655,7 +655,7 @@ osm_drop_mgr_process(
     /*
       If the port is unreachable, remove it from the guid table.
     */
-    if( osm_port_discovery_count_get( p_port ) == 0 )
+    if( p_port->discovery_count == 0 )
       __osm_drop_mgr_remove_port( p_mgr, p_port );
   }
 
diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index 2486ffb..1eca625 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -641,7 +641,7 @@ __osm_ni_rcv_process_existing_switch(
     if the SwitchInfo mad didn't reach the SM) then we want
     to retry to probe the switch.
   */
-  if( osm_node_discovery_count_get( p_node ) == 1 )
+  if( p_node->discovery_count == 1 )
     __osm_ni_rcv_process_switch( p_rcv, p_node, p_madw );
   else
   {
@@ -862,7 +862,7 @@ __osm_ni_rcv_process_new(
   else
     __osm_ni_rcv_set_links( p_rcv, p_node, port_num, p_ni_context );
 
-  osm_node_discovery_count_inc( p_node );
+  p_node->discovery_count++;
   __osm_ni_rcv_get_node_desc( p_rcv, p_node, p_madw );
 
   switch( p_ni->node_type )
@@ -916,14 +916,14 @@ __osm_ni_rcv_process_existing(
              ib_get_node_type_str(p_ni->node_type),
              cl_ntoh64( p_ni->node_guid ),
              cl_ntoh64( p_smp->trans_id ),
-             osm_node_discovery_count_get( p_node ) );
+             p_node->discovery_count );
   }
 
   /*
     If we haven't already encountered this existing node
     on this particular sweep, then process further.
   */
-  osm_node_discovery_count_inc( p_node );
+  p_node->discovery_count++;
 
   switch( p_ni->node_type )
   {
diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
index a53044f..7b241d6 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -724,7 +724,7 @@ osm_pi_rcv_process(
   }
   else
   {
-    osm_port_discovery_count_inc( p_port );
+    p_port->discovery_count++;
 
     /*
       This PortInfo arrived because we did a Get() method,
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 73980b8..a9aef36 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -524,7 +524,7 @@ __osm_state_mgr_reset_node_count(
                cl_ntoh64( osm_node_get_node_guid( p_node ) ) );
    }
 
-   osm_node_discovery_count_reset( p_node );
+   p_node->discovery_count = 0;
 }
 
 /**********************************************************************
@@ -545,7 +545,7 @@ __osm_state_mgr_reset_port_count(
                cl_ntoh64( osm_port_get_guid( p_port ) ) );
    }
 
-   osm_port_discovery_count_reset( p_port );
+   p_port->discovery_count = 0;
 }
 
 /**********************************************************************
-- 
1.5.2.171.gf509


From sashak at voltaire.com  Thu May 31 14:39:34 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 Jun 2007 00:39:34 +0300
Subject: [ofa-general] [PATCH] opensm: less iterations in
	osm_link_mgr_process() loop
Message-ID: <20070531213933.GZ13193@sashak.voltaire.com>


Instead of looping over endports in order to get node's physical
ports list (and to repeat scanning), just use nodes. IOW - replace
__osm_link_mgr_process_port() by __osm_link_mgr_process_node().

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_link_mgr.c |   34 +++++++++++++++++-----------------
 1 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c
index 73bebce..640ed38 100644
--- a/opensm/opensm/osm_link_mgr.c
+++ b/opensm/opensm/osm_link_mgr.c
@@ -399,9 +399,9 @@ __osm_link_mgr_set_physp_pi(
 /**********************************************************************
  **********************************************************************/
 static osm_signal_t
-__osm_link_mgr_process_port(
+__osm_link_mgr_process_node(
   IN osm_link_mgr_t* const p_mgr,
-  IN osm_port_t* const p_port,
+  IN osm_node_t* const p_node,
   IN const uint8_t link_state )
 {
   uint32_t i;
@@ -410,14 +410,14 @@ __osm_link_mgr_process_port(
   uint8_t current_state;
   osm_signal_t signal = OSM_SIGNAL_DONE;
 
-  OSM_LOG_ENTER( p_mgr->p_log, __osm_link_mgr_process_port );
+  OSM_LOG_ENTER( p_mgr->p_log, __osm_link_mgr_process_node );
 
   if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
   {
     osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
-             "__osm_link_mgr_process_port: "
-             "Port 0x%" PRIx64 " going to %s\n",
-             cl_ntoh64( osm_port_get_guid( p_port ) ),
+             "__osm_link_mgr_process_node: "
+             "Node 0x%" PRIx64 " going to %s\n",
+             cl_ntoh64( osm_node_get_node_guid( p_node ) ),
              ib_get_port_state_str( link_state ) );
   }
 
@@ -426,7 +426,7 @@ __osm_link_mgr_process_port(
     with this Port.  Start iterating with port 1, since the linkstate
     is not applicable to the management port on switches.
   */
-  num_physp = osm_node_get_num_physp( p_port->p_node );
+  num_physp = osm_node_get_num_physp( p_node );
   for( i = 0; i < num_physp; i ++ )
   {
     /*
@@ -434,7 +434,7 @@ __osm_link_mgr_process_port(
       or if the state of the port is already better then the
       specified state.
     */
-    p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)i );
+    p_physp = osm_node_get_physp_ptr( p_node, (uint8_t)i );
     if( osm_physp_is_valid( p_physp ) )
     {
       current_state = osm_physp_get_port_state( p_physp );
@@ -464,9 +464,9 @@ __osm_link_mgr_process_port(
         if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
         {
           osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
-                   "__osm_link_mgr_process_port: "
+                   "__osm_link_mgr_process_node: "
                    "Physical port 0x%X already %s. Skipping\n",
-                   osm_physp_get_port_num( p_physp ),
+                   p_physp->port_num,
                    ib_get_port_state_str( current_state ) );
         }
       }
@@ -484,21 +484,21 @@ osm_link_mgr_process(
   IN osm_link_mgr_t* const p_mgr,
   IN const uint8_t link_state )
 {
-  cl_qmap_t *p_port_guid_tbl;
-  osm_port_t *p_port;
+  cl_qmap_t *p_node_guid_tbl;
+  osm_node_t *p_node;
   osm_signal_t signal = OSM_SIGNAL_DONE;
 
   OSM_LOG_ENTER( p_mgr->p_log, osm_link_mgr_process );
 
-  p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl;
+  p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl;
 
   CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock );
 
-  for( p_port = (osm_port_t*)cl_qmap_head( p_port_guid_tbl );
-       p_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl );
-       p_port = (osm_port_t*)cl_qmap_next( &p_port->map_item ) )
+  for( p_node = (osm_node_t*)cl_qmap_head( p_node_guid_tbl );
+       p_node != (osm_node_t*)cl_qmap_end( p_node_guid_tbl );
+       p_node = (osm_node_t*)cl_qmap_next( &p_node->map_item ) )
   {
-    if( __osm_link_mgr_process_port( p_mgr, p_port, link_state ) ==
+    if( __osm_link_mgr_process_node( p_mgr, p_node, link_state ) ==
         OSM_SIGNAL_DONE_PENDING )
       signal = OSM_SIGNAL_DONE_PENDING;
   }
-- 
1.5.2.171.gf509


From sashak at voltaire.com  Thu May 31 15:33:41 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 Jun 2007 01:33:41 +0300
Subject: [ofa-general] [PATCH] opensm/sminfo: mutex cleanup fix
In-Reply-To: <20070531204524.GX13193@sashak.voltaire.com>
References: <20070531204524.GX13193@sashak.voltaire.com>
Message-ID: <20070531223341.GA23029@sashak.voltaire.com>


This fixes mutex cleanups in SMInfo processor.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_sminfo_rcv.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/opensm/opensm/osm_sminfo_rcv.c b/opensm/opensm/osm_sminfo_rcv.c
index 99a716e..b26b6bf 100644
--- a/opensm/opensm/osm_sminfo_rcv.c
+++ b/opensm/opensm/osm_sminfo_rcv.c
@@ -617,7 +617,7 @@ __osm_sminfo_rcv_process_get_response(
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
              "__osm_sminfo_rcv_process_get_response: ERR 2F12: "
              "No port object for this SM\n" );
-    goto Exit;
+    goto _unlock_and_exit;
   }
 
   if( osm_port_get_guid( p_port ) != p_smi->guid )
@@ -629,7 +629,7 @@ __osm_sminfo_rcv_process_get_response(
              ", Received 0x%016" PRIx64 "\n",
              cl_ntoh64( osm_port_get_guid( p_port ) ),
              cl_ntoh64( p_smi->guid ) );
-    goto Exit;
+    goto _unlock_and_exit;
   }
 
   if( port_guid == p_rcv->p_subn->sm_port_guid )
@@ -638,7 +638,7 @@ __osm_sminfo_rcv_process_get_response(
              "__osm_sminfo_rcv_process_get_response: "
              "Self query response received - SM port 0x%016" PRIx64 "\n",
              cl_ntoh64( port_guid ) );
-    goto Exit;
+    goto _unlock_and_exit;
   }
 
   p_sm = (osm_remote_sm_t*)cl_qmap_get( p_sm_tbl, port_guid );
@@ -650,7 +650,7 @@ __osm_sminfo_rcv_process_get_response(
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "__osm_sminfo_rcv_process_get_response: ERR 2F14: "
                "Unable to allocate SM object\n" );
-      goto Exit;
+      goto _unlock_and_exit;
     }
 
     osm_remote_sm_init( p_sm, p_port, p_smi );
@@ -668,7 +668,7 @@ __osm_sminfo_rcv_process_get_response(
 
   process_get_sm_ret_val = __osm_sminfo_rcv_process_get_sm( p_rcv, p_sm );
 
- Exit:
+ _unlock_and_exit:
   CL_PLOCK_RELEASE( p_rcv->p_lock );
   
   /* If process_get_sm_ret_val != OSM_SIGNAL_NONE then we have to signal
@@ -676,6 +676,8 @@ __osm_sminfo_rcv_process_get_response(
   if (process_get_sm_ret_val != OSM_SIGNAL_NONE)
     osm_state_mgr_process( p_rcv->p_state_mgr,
                            process_get_sm_ret_val );
+
+ Exit:
   OSM_LOG_EXIT( p_rcv->p_log );
 }
 
-- 
1.5.2.171.gf509


From halr at voltaire.com  Thu May 31 15:25:27 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 18:25:27 -0400
Subject: [ofa-general] Re: [PATCH] opensm: cleanup discovery count functions
In-Reply-To: <20070531204705.GY13193@sashak.voltaire.com>
References: <20070531204705.GY13193@sashak.voltaire.com>
Message-ID: <1180650323.7116.189578.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 16:47, Sasha Khapyorsky wrote:
> This removes discovery count functions for osm_port_t and osm_node_t
> and makes discovery_count handling similar to osm_switch_t.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Thu May 31 15:26:04 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 18:26:04 -0400
Subject: [ofa-general] Re: [PATCH] opensm: less iterations in
	osm_link_mgr_process() loop
In-Reply-To: <20070531213933.GZ13193@sashak.voltaire.com>
References: <20070531213933.GZ13193@sashak.voltaire.com>
Message-ID: <1180650363.7116.189663.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 17:39, Sasha Khapyorsky wrote:
> Instead of looping over endports in order to get node's physical
> ports list (and to repeat scanning), just use nodes. IOW - replace
> __osm_link_mgr_process_port() by __osm_link_mgr_process_node().
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Nice optimization. Thanks. Applied.

-- Hal


From halr at voltaire.com  Thu May 31 15:26:22 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 May 2007 18:26:22 -0400
Subject: [ofa-general] Re: [PATCH] opensm: sminfo self query check
In-Reply-To: <20070531204524.GX13193@sashak.voltaire.com>
References: <20070531204524.GX13193@sashak.voltaire.com>
Message-ID: <1180650368.7116.189665.camel@hal.voltaire.com>

On Thu, 2007-05-31 at 16:45, Sasha Khapyorsky wrote:
> OpenSM can query itself for SMInfo because it is just legal, or
> occasionally due to port moving during subnet discovery process.
> Don't create remote SM entry in this case in order to prevent
> deadlocks.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Good catch.

Thanks. Applied.

-- Hal


From pradeeps at linux.vnet.ibm.com  Thu May 31 16:37:23 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Thu, 31 May 2007 16:37:23 -0700
Subject: [ofa-general] ipoib drain cq question
Message-ID: <465F5C33.1050202@linux.vnet.ibm.com>

ipoib_cm_start_rx_drain() posts a DRAIN_WRID to be sent.

So, why does ipoib_cm_handle_rx_wc() have to call
ipoib_cm_start_rx_drain() upon receipt of the WC?

Pradeep


From MAILER-DAEMON at bmapps.persistent.co.in  Thu May 31 23:44:48 2007
From: MAILER-DAEMON at bmapps.persistent.co.in (Mail Delivery System)
Date: Fri,  1 Jun 2007 12:14:48 +0530 (IST)
Subject: [ofa-general] Delayed Mail (still being retried)
Message-ID: <20070601064448.7DE73528FC4@bmapps.persistent.co.in>

This is the Symantec Mail Security program at host bmapps.persistent.co.in.

####################################################################
# THIS IS A WARNING ONLY.  YOU DO NOT NEED TO RESEND YOUR MESSAGE. #
####################################################################

Your message could not be delivered for 2.0 hours.
It will be retried until it is 1.0 days old.

For further assistance, please send mail to <postmaster>

If you do so, please include this problem report. You can
delete your own text from the attached returned message.

			The Symantec Mail Security program

<openhouse at pspl.co.in>: connect to 10.78.0.6[10.78.0.6]: read timeout
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/rfc822-headers
Size: 1049 bytes
Desc: Undelivered Message Headers
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070601/dce02f06/attachment.bin>