From dotanb at dev.mellanox.co.il Tue May 1 00:07:18 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 01 May 2007 10:07:18 +0300 Subject: [ofa-general] Re: [ewg] APM Example In-Reply-To: References: <87aa148d0704261534s67c37c7amdcd01d5f664a768b@mail.gmail.com> <87aa148d0704261549l64d0c9cfy7d29eddcfd89561c@mail.gmail.com> <87aa148d0704280711x9244561pf0abef8100a53887@mail.gmail.com> Message-ID: <4636E726.6010804@dev.mellanox.co.il> Roland Dreier wrote: > > > You don't know the time that the transition occurred, except that it > > > is between when you called modify QP and when it returned. But an > > > asynchronous event doesn't really help, does it? > > > It does help. APM is not only defined for network fault tolerance, it can > > also be used for load-balancing. With this event, one can know when > > the path is loaded and it is safe to call modify_qp. > > I guess I don't really understand how you're using this event. What > advantage is there in getting an async event at some unknown time > (maybe before the modify QP operation returns, maybe after)? What > does it let you do that you can't do with the verbs architecture as > defined strictly by the verbs? > Roland is right, this event wasn't defined in the IB spec. If you wish to know when it is safe to call the modify_qp verb you can call query_qp and check the path_mig_state: it the state is ARMED, it means that it is safe to use the alternate path. Dotan From dotanb at dev.mellanox.co.il Tue May 1 00:09:31 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 01 May 2007 10:09:31 +0300 Subject: [ofa-general] OFED 1.2 RC2 <-> WinIB 1.3 In-Reply-To: <46361E15.1050006@hp.com> References: <4631D27B.10301@holografika.com> <4634AD01.5010909@dev.mellanox.co.il> <46361E15.1050006@hp.com> Message-ID: <4636E7AB.4030908@dev.mellanox.co.il> Rick Jones wrote: > Dotan Barak wrote: >> Hi Peter. >> >> Kovacs Peter Tamas wrote: >> >>> Dear all, >>> >>> I've tried to do some sped tests between a Linux and a Windows box >>> using InfiniBand. >>> I've installed OFED 1.2 RC2 to a Fedora Core 6 x64 box, and >>> connected it to a Windows XP x64 box with WinIB 1.3, both machines >>> having a Mellanox MHES-14XTC. >> >> >> As much as i know the performance tests in windows and in OFED cannot >> work together (even if they have the same name). > > I wonder if the SDP_mumble tests in netperf top of trunk would work? Any test can work between 2 different OSes (for example: windows and Linux) over eth. should work over SDP (or IPoIB) because they are wire protocols. Dotan From jwong at datallegro.com Tue May 1 00:06:18 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Tue, 1 May 2007 03:06:18 -0400 Subject: [ofa-general] RE: Trouble installing OFED1.2 with kernel References: <20070501040329.GK13293@mellanox.co.il> <20070501064802.GM13293@mellanox.co.il> Message-ID: I have downloaded the kernel src from http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.18.8.tar.gz I have gunzip and untarred the directory. In the file linux-2.6.18.8/include/linux/fs.h. Here is the structure definition of inode. When I look below the i_private ptr does not exist. Please advise. Thanks, Jeff struct inode { struct hlist_node i_hash; struct list_head i_list; struct list_head i_sb_list; struct list_head i_dentry; unsigned long i_ino; atomic_t i_count; umode_t i_mode; unsigned int i_nlink; uid_t i_uid; gid_t i_gid; dev_t i_rdev; loff_t i_size; struct timespec i_atime; struct timespec i_mtime; struct timespec i_ctime; unsigned int i_blkbits; unsigned long i_blksize; unsigned long i_version; blkcnt_t i_blocks; unsigned short i_bytes; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ struct mutex i_mutex; struct rw_semaphore i_alloc_sem; struct inode_operations *i_op; const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct super_block *i_sb; struct file_lock *i_flock; struct address_space *i_mapping; struct address_space i_data; #ifdef CONFIG_QUOTA struct dquot *i_dquot[MAXQUOTAS]; #endif /* These three should probably be a union */ struct list_head i_devices; struct pipe_inode_info *i_pipe; struct block_device *i_bdev; struct cdev *i_cdev; int i_cindex; __u32 i_generation; #ifdef CONFIG_DNOTIFY unsigned long i_dnotify_mask; /* Directory notify events */ struct dnotify_struct *i_dnotify; /* for directory notifications */ #endif #ifdef CONFIG_INOTIFY struct list_head inotify_watches; /* watches on this inode */ struct mutex inotify_mutex; /* protects the watches list */ #endif unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ unsigned int i_flags; atomic_t i_writecount; void *i_security; union { void *generic_ip; } u; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; #endif }; -----Original Message----- From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] Sent: Tue 5/1/2007 2:48 AM To: Jeffrey Wong Cc: Michael S. Tsirkin; general at lists.openfabrics.org Subject: Re: Trouble installing OFED1.2 with kernel I don't think you are actually using the kernel from kernel.org: we test-build these nightly. Quoting Jeffrey Wong : Subject: RE: Trouble installing OFED1.2 with kernel Well when I try to compile I get an error message saying i_private is not a member of the inode structure when trying to compile the ulp/iboip and the ib_ipath modules. I'm using the 2.6.18-8 kernel src from kernel.org. Any reasons why I would be getting this error message? Thanks, Jeff -----Original Message----- From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] Sent: Tue 5/1/2007 12:03 AM To: Jeffrey Wong Cc: general at lists.openfabrics.org Subject: Re: Trouble installing OFED1.2 with kernel > Quoting Jeffrey Wong : > Subject: Re: Trouble installing OFED1.2 with kernel > > Is there a workaround for the i_private member of the inode structure either in > the kernel or in the OFED 1.2 software? > > I want to be able to compile the ipoib drivers and I cannot with the error > i_private not being a member of inode struct. > > What does the ulp/ipoib do? > > I want to be able to test out the ipverbs library and ipoib library to compare > performance. > > > > Thanks. > > > > Jeff OFED 1.2 supports the RHEL5 kernel. Shouldn't the Centos kernel be identical? -- MST -- MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Tue May 1 00:22:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 1 May 2007 10:22:12 +0300 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel In-Reply-To: References: <20070501040329.GK13293@mellanox.co.il> <20070501064802.GM13293@mellanox.co.il> Message-ID: <20070501072212.GR13293@mellanox.co.il> > Quoting Jeffrey Wong : > Subject: RE: Trouble installing OFED1.2 with kernel > > I have downloaded the kernel src from > http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.18.8.tar.gz > I have gunzip and untarred the directory. > > In the file linux-2.6.18.8/include/linux/fs.h. Here is the structure > definition of inode. When I look below the i_private ptr does not exist. > Please advise. Yes but that kernel would be named 2.6.18.8, not 2.6.18-8.el5. -- MST From vlad at dev.mellanox.co.il Tue May 1 00:44:44 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 01 May 2007 10:44:44 +0300 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel In-Reply-To: <20070501072212.GR13293@mellanox.co.il> References: <20070501040329.GK13293@mellanox.co.il> <20070501064802.GM13293@mellanox.co.il> <20070501072212.GR13293@mellanox.co.il> Message-ID: <1178005484.7789.4.camel@vladsk-laptop> On Tue, 2007-05-01 at 10:22 +0300, Michael S. Tsirkin wrote: > > Quoting Jeffrey Wong : > > Subject: RE: Trouble installing OFED1.2 with kernel > > > > I have downloaded the kernel src from > > http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.18.8.tar.gz > > I have gunzip and untarred the directory. > > > > In the file linux-2.6.18.8/include/linux/fs.h. Here is the structure > > definition of inode. When I look below the i_private ptr does not exist. > > Please advise. > > Yes but that kernel would be named 2.6.18.8, not 2.6.18-8.el5. > > Jeff, If you named the kernel from kernel.org in 2.6.18-*el5* manner then the backport patches for RedHat 5.0 will be applied by OFED-1.2. This is the reason of your failures. So, to fix this rename you kernel and then you can install OFED-1.2 with ipath and ipoib. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From jwong at datallegro.com Tue May 1 00:59:02 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Tue, 1 May 2007 03:59:02 -0400 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel References: <20070501040329.GK13293@mellanox.co.il><20070501064802.GM13293@mellanox.co.il><20070501072212.GR13293@mellanox.co.il> <1178005484.7789.4.camel@vladsk-laptop> Message-ID: I see. So I should have never renamed my kernel from 2.6.18.8 to 2.6.18.8-el5_x86. So this will install once I rename my kernel back to 2.6.18.8? Thanks for the info. Jeff -----Original Message----- From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] Sent: Tue 5/1/2007 3:44 AM To: Jeffrey Wong Cc: Michael S. Tsirkin; general at lists.openfabrics.org Subject: Re: [ofa-general] Re: Trouble installing OFED1.2 with kernel On Tue, 2007-05-01 at 10:22 +0300, Michael S. Tsirkin wrote: > > Quoting Jeffrey Wong : > > Subject: RE: Trouble installing OFED1.2 with kernel > > > > I have downloaded the kernel src from > > http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.18.8.tar.gz > > I have gunzip and untarred the directory. > > > > In the file linux-2.6.18.8/include/linux/fs.h. Here is the structure > > definition of inode. When I look below the i_private ptr does not exist. > > Please advise. > > Yes but that kernel would be named 2.6.18.8, not 2.6.18-8.el5. > > Jeff, If you named the kernel from kernel.org in 2.6.18-*el5* manner then the backport patches for RedHat 5.0 will be applied by OFED-1.2. This is the reason of your failures. So, to fix this rename you kernel and then you can install OFED-1.2 with ipath and ipoib. -- Vladimir Sokolovsky Mellanox Technologies Ltd. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at dev.mellanox.co.il Tue May 1 01:48:41 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 01 May 2007 11:48:41 +0300 Subject: [ofa-general] Re: Trouble installing OFED1.2 with kernel In-Reply-To: References: <20070501040329.GK13293@mellanox.co.il> <20070501064802.GM13293@mellanox.co.il> <20070501072212.GR13293@mellanox.co.il> <1178005484.7789.4.camel@vladsk-laptop> Message-ID: <1178009321.7789.8.camel@vladsk-laptop> On Tue, 2007-05-01 at 03:59 -0400, Jeffrey Wong wrote: > I see. So I should have never renamed my kernel from 2.6.18.8 to 2.6.18.8-el5_x86. So this will install once I rename my kernel back to 2.6.18.8? > > Thanks for the info. > > > Jeff > Yes, The pattern 2.6.18-*el5* used by configure script to select backport patches. There is a differens between backport patches for 2.6.18* from kernel.org and 2.6.18 from RHEL5.0. Regards, Vladimir > > -----Original Message----- > From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] > Sent: Tue 5/1/2007 3:44 AM > To: Jeffrey Wong > Cc: Michael S. Tsirkin; general at lists.openfabrics.org > Subject: Re: [ofa-general] Re: Trouble installing OFED1.2 with kernel > > On Tue, 2007-05-01 at 10:22 +0300, Michael S. Tsirkin wrote: > > > Quoting Jeffrey Wong : > > > Subject: RE: Trouble installing OFED1.2 with kernel > > > > > > I have downloaded the kernel src from > > > http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.18.8.tar.gz > > > I have gunzip and untarred the directory. > > > > > > In the file linux-2.6.18.8/include/linux/fs.h. Here is the structure > > > definition of inode. When I look below the i_private ptr does not exist. > > > Please advise. > > > > Yes but that kernel would be named 2.6.18.8, not 2.6.18-8.el5. > > > > > > Jeff, > If you named the kernel from kernel.org in 2.6.18-*el5* manner then the > backport patches for RedHat 5.0 will be applied by OFED-1.2. This is the > reason of your failures. > > So, to fix this rename you kernel and then you can install OFED-1.2 with > ipath and ipoib. > > From vlad at lists.openfabrics.org Tue May 1 02:37:11 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 1 May 2007 02:37:11 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070501-0200 daily build status Message-ID: <20070501093712.4CA32E60811@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From stigtyytu at contaire.de Tue May 1 02:25:20 2007 From: stigtyytu at contaire.de (Jacob Banks) Date: Tue, 01 May 2007 08:25:20 -0100 Subject: [ofa-general] Best reason Message-ID: reproduce And sadly some order others might afterwards say that is the ultimateNo. Just sponge be there. suddenly I don't think cheerful waste you'll be dis fraternal hurt middle Hey beyond Carl, wait upThe guard jewel turned a goat cinerary couple of blow pages on his clip- You station know payment Jeff, I bath believe that bibulous for people our aAlright, earth button everybody spoon jog. limit He commanded. And Fe brick Ay-yai Skipper chose She gave stick town him one last kiss, th circle Dana paused. Stace, overcome hammer I've dig got one more question nod See cloth industry you tomorrow Angel. angle Jeff climbed back in parcel saw So what bumpy earth class is this boy in with you? Hello?Jeff trodden deal paused briefly and then care was deadpanned, I sup I carry bit won't horse sir, Jeff hollered cushion over his shoulder Stacy smooth was now sporting an camera difficult reaction ear-to-ear grin. As Huh? bulb Dana was page town caught a march little off guard. I think flower determined I can guess war glow where this is going. scream flag You're animal girlfriend is very gleaming pretty observed Agg hang ventral Do you prickly somatic and Jeff...you know...? frantically card Perhaps if we lock even up so they can't found get in wit Well, I don't spoon have disgust one. I machine agree with key you. At ofiction Stacy, account it's ancient me, Came Dana's voice innocently over the re Hey, what's up? So sank what's fear up with amuse bump Linda? asked Jeff. I spoke to her. cast overthrew account She's on play for tomorrow afternoo Heya.cast That kick should loss do it, nodded Jeff. account But as Guy sYou filthy said that reluctantly you and bird he are bump going to be study Gee, unlock you blastous jelly noticed behavior that too? Jeff responded sar Of system course salt that wasn't milk the case. Even effect if Gavin h jealous You house spotless paste doing anything right now? color Stacy smiled. Dana, what describe the two met of direction us do when Naw, place slippery support just going home. Did fancy you have something i shown Assuming we've done thick boat sneeze what normal people would d How're bred squeeze you adjusting corporeal rhythm to the cast? That's sour hope a bought pretty parochial government attitude for an AtheIt's been a judge little quality inconvenient oil to recognise say the leaYeah. wipe card I ran into frame him in the knife hall. It's a go. W hide Jeff sat up on his fight elbows again. swam Now simian let me se repulsive slid I've got a question to ask. salty I've trousers noticed a vid Finally Marcie spoke up. melt paint ursine C'mon, cut aren't you gon Um.. Dana wasn't concentrate sure how undress move early to answer this. Well, it doesn't hair fowl beset matter. I'm bore just glad you're gracefully This bought was not stem what Dana self was expecting to hear at busy pled Not if store the place morning is crawling with imaginary co damage I explain was wood just noise heading to the park to watch my tea real smoke By this suggest move time, he was a bit steamed. You know, love tore Contrary to expectation, almost the angle intruders did not death Yep. I moon receive answer modern it every day, along with my 1st -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hlvycex.gif Type: image/gif Size: 6255 bytes Desc: not available URL: From mst at dev.mellanox.co.il Tue May 1 05:53:21 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 1 May 2007 15:53:21 +0300 Subject: [ofa-general] Fwd: [ANNOUNCE] GIT 1.5.1.3 Message-ID: <20070501125321.GC8447@mellanox.co.il> FYI. I think we want to update - the mmap fix looks important enough. Sasha? ----- Forwarded message from Junio C Hamano ----- Subject: [ANNOUNCE] GIT 1.5.1.3 Date: Tue, 1 May 2007 06:08:58 +0300 From: Junio C Hamano The latest maintenance release GIT 1.5.1.3 is available at the usual places: http://www.kernel.org/pub/software/scm/git/ git-1.5.1.3.tar.{gz,bz2} (tarball) git-htmldocs-1.5.1.3.tar.{gz,bz2} (preformatted docs) git-manpages-1.5.1.3.tar.{gz,bz2} (preformatted docs) RPMS/$arch/git-*-1.5.1.3-1.$arch.rpm (RPM) GIT v1.5.1.3 Release Notes ========================== Fixes since v1.5.1.2 -------------------- * Bugfixes - git-add tried to optimize by finding common leading directories across its arguments but botched, causing very confused behaviour. - unofficial rpm.spec file shipped with git was letting ETC_GITCONFIG set to /usr/etc/gitconfig. Tweak the official Makefile to make it harder for distro people to make the same mistake, by setting the variable to /etc/gitconfig if prefix is set to /usr. - git-svn inconsistently stripped away username from the URL only when svnsync_props was in use. - git-svn got confused when handling symlinks on Mac OS. - git-send-email was not quoting recipient names that have period '.' in them. Also it did not allow overriding envelope sender, which made it impossible to send patches to certain subscriber-only lists. - built-in write_tree() routine had a sequence that renamed a file that is still open, which some systems did not like. - when memory is very tight, sliding mmap code to read packfiles incorrectly closed the fd that was still being used to read the pack. - import-tars contributed front-end for fastimport was passing wrong directory modes without checking. - git-fastimport trusted its input too much and allowed to create corrupt tree objects with entries without a name. - git-fetch needlessly barfed when too long reflog action description was given by the caller. Also contains various documentation updates. ---------------------------------------------------------------- Changes since v1.5.1.2 are as follows: Adam Roben (5): Remove usernames from all commit messages, not just when using svmprops git-svn: Don't rely on $_ after making a function call git-svn: Ignore usernames in URLs in find_by_url git-svn: Added 'find-rev' command git-svn: Add 'find-rev' command Alex Riesen (1): Fix handle leak in write_tree Andrew Ruder (8): Removing -n option from git-diff-files documentation Document additional options for git-fetch Update git-fmt-merge documentation Update git-grep documentation Update -L documentation for git-blame/git-annotate Update git-http-push documentation Update git-local-fetch documentation Update git-http-fetch documentation Brian Gernhardt (2): Reverse the order of -b and --track in the man page. Ignore all man sections as they are generated files. Gerrit Pape (1): Documentation/git-reset.txt: suggest git commit --amend in example. Jari Aalto (3): Clarify SubmittingPatches Checklist git.7: Mention preformatted html doc location send-email documentation: clarify --smtp-server Johannes Schindelin (2): dir.c(common_prefix): Fix two bugs import-tars: be nice to wrong directory modes Josh Triplett (3): Fix typo in git-am: s/Was is/Was it/ Create a sysconfdir variable, and use it for ETC_GITCONFIG Add missing reference to GIT_COMMITTER_DATE in git-commit-tree documentation Julian Phillips (1): http.c: Fix problem with repeated calls of http_init Junio C Hamano (8): Build RPM with ETC_GITCONFIG=/etc/gitconfig applymbox & quiltimport: typofix. Start preparing for 1.5.1.3 Do not barf on too long action description Update .mailmap with "Michael" Fix import-tars fix. Fix symlink handling in git-svn, related to PerlIO GIT v1.5.1.3 Michele Ballabio (1): git shortlog documentation: add long options and fix a typo Robin H. Johnson (10): Document --dry-run parameter to send-email. Prefix Dry- to the message status to denote dry-runs. Debugging cleanup improvements Change the scope of the $cc variable as it is not needed outside of send_message. Perform correct quoting of recipient names. Validate @recipients before using it for sendmail and Net::SMTP. Ensure clean addresses are always used with Net::SMTP Allow users to optionally specify their envelope sender. Document --dry-run and envelope-sender for git-send-email. Sanitize @to recipients. Shawn O. Pearce (3): Actually handle some-low memory conditions Don't allow empty pathnames in fast-import Catch empty pathnames in trees during fsck - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ----- End forwarded message ----- -- MST From mhagen at iol.unh.edu Tue May 1 07:20:53 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Tue, 1 May 2007 10:20:53 -0400 (EDT) Subject: [ofa-general] [PATCH] infiniband: modify ammasso driver to use send with invalidate Message-ID: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> Modification to the ammasso driver to use the iWARP verbs SEND with INV and SEND with SE and INV. --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 13:12:54.000000000 -0400 +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 16:24:38.000000000 -0400 @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str switch (ib_wr->opcode) { case IB_WR_SEND: - if (ib_wr->send_flags & IB_SEND_SOLICITED) { + if (ib_wr->send_flags & IB_SEND_SOLICITED + && ib_wr->send_flags & IB_SEND_INVALIDATE) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV); + wr.sqwr.send.remote_stag = + cpu_to_be32(ib_wr->wr.invalidate.rkey); + } else if (ib_wr->send_flags & IB_SEND_SOLICITED) { c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); - msg_size = sizeof(struct c2wr_send_req); + wr.sqwr.send.remote_stag = 0; + } else if (ib_wr->send_flags & IB_SEND_INVALIDATE) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV); + wr.sqwr.send.remote_stag = + cpu_to_be32(ib_wr->wr.invalidate.rkey); } else { c2_wr_set_id(&wr, C2_WR_TYPE_SEND); - msg_size = sizeof(struct c2wr_send_req); + wr.sqwr.send.remote_stag = 0; } - wr.sqwr.send.remote_stag = 0; - msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; + msg_size = sizeof(struct c2wr_send_req); + msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; if (ib_wr->num_sge > qp->send_sgl_depth) { err = -EINVAL; break; -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From mhagen at iol.unh.edu Tue May 1 07:22:34 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Tue, 1 May 2007 10:22:34 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] infiniband: add support for invalidate stag In-Reply-To: <20070501035708.GJ13293@mellanox.co.il> References: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu> <20070501035708.GJ13293@mellanox.co.il> Message-ID: <36387.132.177.125.178.1178029354.squirrel@postal.iol.unh.edu> I don't believe so. I just sent out modifications to the Ammasso driver on another thread that might clear this up. The modifications to the driver should show how these new verbs could be used. >> Quoting mhagen at iol.unh.edu : >> Subject: [PATCH] infiniband: add support for invalidate stag >> >> Patch to add support for the iWARP verbs SEND with INV and SEND with SE >> and INV. >> >> --- linux-2.6.21.1/include/rdma/ib_verbs.h 2007-04-28 >> 15:35:02.677618096 -0400 >> +++ linux-2.6.21.1/include/rdma/ib_verbs.h 2007-04-28 >> 15:29:16.200290656 -0400 >> @@ -611,7 +611,8 @@ enum ib_send_flags { >> IB_SEND_FENCE = 1, >> IB_SEND_SIGNALED = (1<<1), >> IB_SEND_SOLICITED = (1<<2), >> - IB_SEND_INLINE = (1<<3) >> + IB_SEND_INLINE = (1<<3), >> + IB_SEND_INVALIDATE = (1<<4) >> }; >> >> struct ib_sge { >> @@ -646,6 +647,9 @@ struct ib_send_wr { >> u16 pkey_index; /* valid for GSI only */ >> u8 port_num; /* valid for DR SMPs on switch only */ >> } ud; >> + struct { >> + u32 rkey; >> + } invalidate; >> } wr; >> }; > > Shouldn't this rather be part of rc wr? > > -- > MST > -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From MAILER-DAEMON at lists.openfabrics.org Tue May 1 08:21:10 2007 From: MAILER-DAEMON at lists.openfabrics.org (MAILER-DAEMON at lists.openfabrics.org) Date: Wed, 02 May 2007 00:21:10 +0900 Subject: [ofa-general] Delivery Status Message-ID: <200705011521.l41FLCKq000261@nttmail4.ecl.ntt.co.jp> --- The following addresses had delivery problems --- (5.1.1 ... User unknown) -------------- next part -------------- An embedded message was scrubbed... From: Canadian Doctor Tamika Subject: [spam] RE: MedHelp 21802 Date: Wed, 2 May 2007 00:20:57 +0900 (JST) Size: 1892 URL: From jlentini at netapp.com Tue May 1 09:13:53 2007 From: jlentini at netapp.com (James Lentini) Date: Tue, 1 May 2007 12:13:53 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] uDAPL OFED 1.2 RC2 build issue on ia64 and RHEL5 In-Reply-To: <000001c78925$18867a10$ff0da8c0@amr.corp.intel.com> References: <000001c78925$18867a10$ff0da8c0@amr.corp.intel.com> Message-ID: On Fri, 27 Apr 2007, Arlin Davis wrote: > Fixes build problems with ia64 and RHEL5 with atomic operations. > Patch was tested on ia64 RHEL4 and RHEL5 using dtest/dapltest. > > James, can you review this before I push. Looks good. From loic at myri.com Tue May 1 09:26:48 2007 From: loic at myri.com (Loic Prylli) Date: Tue, 01 May 2007 12:26:48 -0400 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <20070501015731.3568d28b.billfink@mindspring.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> <20070428025117.a3b1200a.billfink@mindspring.com> <46362244.9030406@hp.com> <20070501015731.3568d28b.billfink@mindspring.com> Message-ID: <46376A48.3050102@myri.com> On 5/1/2007 1:57 AM, Bill Fink wrote: > On Mon, 30 Apr 2007, Rick Jones wrote: > > >> Ethtool -i on the interface reports 1.2.0 as the driver version. >> > > Perhaps it would be useful to have different version strings for > the in-kernel Linux version and the Myricom externally provided > version. Just a thought. > Indeed, and it is the case as of March-21 git (or any myri10ge version >= 1.3.0). The in-kernel version will show something like: 1.3.0-1.226, the external version will only show1.3.0. Loic From monil at voltaire.com Tue May 1 09:37:24 2007 From: monil at voltaire.com (Moni Levy) Date: Tue, 1 May 2007 19:37:24 +0300 Subject: [ofa-general] Re: pkey change handling patch In-Reply-To: References: <20070417223547.GI25314@mellanox.co.il> <20070419203705.GA613@mellanox.co.il> <6a122cc00704250113r152c6968ibd7155906c1839b1@mail.gmail.com> <6a122cc00704260449w7b4ef0cap1a617c172a6415da@mail.gmail.com> <20070426133442.GJ32513@mellanox.co.il> <6a122cc00704260636l1d30c4faw7c3fffae8ed8c1c9@mail.gmail.com> <20070426134331.GL32513@mellanox.co.il> <6a122cc00704260653i706caaecu1b8265157805912@mail.gmail.com> Message-ID: <6a122cc00705010937r162b53b1jafad6d7b8055bea5@mail.gmail.com> On 4/26/07, Roland Dreier wrote: > > > Let's do it over query_pkey/query_port for now. > > > Long term providers will just optimize these I think. > > > > How ? Caching at device driver level ? > > Yes... for the most part, it should be much easier to do within the > driver. For example mthca, mlx4 and ipath at least know exactly when > the P_Key table is being changed and can just snoop the operation > without needing to worry about deferring things to a workqueue, etc. > > ehca seems to have a hypercall that returns the whole P_Key table in > one go. > > I think it would be fine to change the interface to something like > > query_pkey(struct ib_device *dev, u8 port, u16 start_index, > u16 num_pkeys, u16 *pkey) > > that returns a block of P_Keys in one go, but I don't see it as that critical. That's exactly what I meant, and yes I agree it's not urgent. > From swise at opengridcomputing.com Tue May 1 10:18:56 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 01 May 2007 12:18:56 -0500 Subject: [ofa-general] Requesting CQ notifications In-Reply-To: References: <462FD3F7.1010304@evergrid.com> Message-ID: <1178039936.2309.67.camel@stevo-desktop> On Wed, 2007-04-25 at 18:58 -0700, Roland Dreier wrote: > > Is there a differentiation between multiple CQE's being in the CQ > > vs. CQE's being arriving into the CQ when using completion > > notifications? > > > > For example, assume I have the following order of events: > > > > > > 2 CQEs arrive > > > > select() returns readable for comp. channel > > > > ibv_get_cq_event() returns event > > > > ibv_req_notify_cq(cq, 0) > > > > ibv_poll_cq(cq, 1, &cqe) returns 1 > > > > ibv_ack_cq_events(cq, 1) > > > > > > Will the comp. channel receive another event for the second CQE even > > if it had arrived before ibv_req_notify_cq() was called? > > This is really an ill-posed question: according to the semantics > defined by the verbs spec, the presence or absence of the second CQE > is not defined until you poll the CQ again. > > In practice we can look at what real hardware does, and the answer is > "it depends." Some adapters (eg mthca, mlx4) will generate an event > immediately if ibv_req_notify_cq() is called for a CQ that contains an > unpolled CQE, while other adapters (eg ipath, ehca) will only generate > an event when a CQE is added after the cal to ibv_req_notify_cq(). > cxgb3 behaves like ipath/ehca. IE arrival of a new CQE generates the notification event. > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Tue May 1 10:26:32 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 01 May 2007 12:26:32 -0500 Subject: [ofa-general] Re: [PATCH] infiniband: add support for invalidate stag In-Reply-To: <20070501035708.GJ13293@mellanox.co.il> References: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu> <20070501035708.GJ13293@mellanox.co.il> Message-ID: <1178040392.2309.72.camel@stevo-desktop> On Tue, 2007-05-01 at 06:57 +0300, Michael S. Tsirkin wrote: > > Quoting mhagen at iol.unh.edu : > > Subject: [PATCH] infiniband: add support for invalidate stag > > > > Patch to add support for the iWARP verbs SEND with INV and SEND with SE > > and INV. > > > > --- linux-2.6.21.1/include/rdma/ib_verbs.h 2007-04-28 > > 15:35:02.677618096 -0400 > > +++ linux-2.6.21.1/include/rdma/ib_verbs.h 2007-04-28 > > 15:29:16.200290656 -0400 > > @@ -611,7 +611,8 @@ enum ib_send_flags { > > IB_SEND_FENCE = 1, > > IB_SEND_SIGNALED = (1<<1), > > IB_SEND_SOLICITED = (1<<2), > > - IB_SEND_INLINE = (1<<3) > > + IB_SEND_INLINE = (1<<3), > > + IB_SEND_INVALIDATE = (1<<4) > > }; > > > > struct ib_sge { > > @@ -646,6 +647,9 @@ struct ib_send_wr { > > u16 pkey_index; /* valid for GSI only */ > > u8 port_num; /* valid for DR SMPs on switch only */ > > } ud; > > + struct { > > + u32 rkey; > > + } invalidate; > > } wr; > > }; > > Shouldn't this rather be part of rc wr? What's an rc wr? He added the invalidate struct to the union part of the ib_send_wr. Its analogous to the rdma struct in that union in that its additional values passed in the send wr for an iwarp send w/invalidate and send-se w/invalidate. The seems reasonable to me... > From swise at opengridcomputing.com Tue May 1 10:34:11 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 01 May 2007 12:34:11 -0500 Subject: [ofa-general] [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> Message-ID: <1178040851.2309.75.camel@stevo-desktop> The code looks correct. I'd make the msg_size lines just one statement: msg_size = sizeof(struct c2wr_send_req) + sizeof(struct c2_data_addr) * ib_wr->num_sge; Have you tested that it works? Steve. On Tue, 2007-05-01 at 10:20 -0400, mhagen at iol.unh.edu wrote: > Modification to the ammasso driver to use the iWARP verbs SEND with INV > and SEND with SE and INV. > > --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 > 13:12:54.000000000 -0400 > +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 > 16:24:38.000000000 -0400 > @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str > > switch (ib_wr->opcode) { > case IB_WR_SEND: > - if (ib_wr->send_flags & IB_SEND_SOLICITED) { > + if (ib_wr->send_flags & IB_SEND_SOLICITED > + && ib_wr->send_flags & IB_SEND_INVALIDATE) { > + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV); > + wr.sqwr.send.remote_stag = > + cpu_to_be32(ib_wr->wr.invalidate.rkey); > + } else if (ib_wr->send_flags & IB_SEND_SOLICITED) { > c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); > - msg_size = sizeof(struct c2wr_send_req); > + wr.sqwr.send.remote_stag = 0; > + } else if (ib_wr->send_flags & IB_SEND_INVALIDATE) { > + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV); > + wr.sqwr.send.remote_stag = > + cpu_to_be32(ib_wr->wr.invalidate.rkey); > } else { > c2_wr_set_id(&wr, C2_WR_TYPE_SEND); > - msg_size = sizeof(struct c2wr_send_req); > + wr.sqwr.send.remote_stag = 0; > } > > - wr.sqwr.send.remote_stag = 0; > - msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; > + msg_size = sizeof(struct c2wr_send_req); > + msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; > if (ib_wr->num_sge > qp->send_sgl_depth) { > err = -EINVAL; > break; > > > From mhagen at iol.unh.edu Tue May 1 10:39:00 2007 From: mhagen at iol.unh.edu (Mikkel Hagen) Date: Tue, 01 May 2007 13:39:00 -0400 Subject: [ofa-general] [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: <1178040851.2309.75.camel@stevo-desktop> References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> <1178040851.2309.75.camel@stevo-desktop> Message-ID: <46377B34.9040605@iol.unh.edu> I don't believe that we can make it into one line as Roland pointed out earlier - it introduces an accumulation bug because it is within a while loop. Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 Steve Wise wrote: > The code looks correct. > > I'd make the msg_size lines just one statement: > > msg_size = sizeof(struct c2wr_send_req) + > sizeof(struct c2_data_addr) * ib_wr->num_sge; > > Have you tested that it works? > > > Steve. > > > On Tue, 2007-05-01 at 10:20 -0400, mhagen at iol.unh.edu wrote: > >> Modification to the ammasso driver to use the iWARP verbs SEND with INV >> and SEND with SE and INV. >> >> --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 >> 13:12:54.000000000 -0400 >> +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 >> 16:24:38.000000000 -0400 >> @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str >> >> switch (ib_wr->opcode) { >> case IB_WR_SEND: >> - if (ib_wr->send_flags & IB_SEND_SOLICITED) { >> + if (ib_wr->send_flags & IB_SEND_SOLICITED >> + && ib_wr->send_flags & IB_SEND_INVALIDATE) { >> + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV); >> + wr.sqwr.send.remote_stag = >> + cpu_to_be32(ib_wr->wr.invalidate.rkey); >> + } else if (ib_wr->send_flags & IB_SEND_SOLICITED) { >> c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); >> - msg_size = sizeof(struct c2wr_send_req); >> + wr.sqwr.send.remote_stag = 0; >> + } else if (ib_wr->send_flags & IB_SEND_INVALIDATE) { >> + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV); >> + wr.sqwr.send.remote_stag = >> + cpu_to_be32(ib_wr->wr.invalidate.rkey); >> } else { >> c2_wr_set_id(&wr, C2_WR_TYPE_SEND); >> - msg_size = sizeof(struct c2wr_send_req); >> + wr.sqwr.send.remote_stag = 0; >> } >> >> - wr.sqwr.send.remote_stag = 0; >> - msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; >> + msg_size = sizeof(struct c2wr_send_req); >> + msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; >> if (ib_wr->num_sge > qp->send_sgl_depth) { >> err = -EINVAL; >> break; >> >> >> >> From swise at opengridcomputing.com Tue May 1 10:43:24 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 01 May 2007 12:43:24 -0500 Subject: [ofa-general] [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: <46377B34.9040605@iol.unh.edu> References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> <1178040851.2309.75.camel@stevo-desktop> <46377B34.9040605@iol.unh.edu> Message-ID: <1178041404.2309.80.camel@stevo-desktop> No, the accumulation bug was because you were always doing a +=. > > > > msg_size = sizeof(struct c2wr_send_req) + > > sizeof(struct c2_data_addr) * ib_wr->num_sge; This always assigns to msg_size. From mhagen at iol.unh.edu Tue May 1 11:05:58 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Tue, 1 May 2007 14:05:58 -0400 (EDT) Subject: [ofa-general] [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: <1178041404.2309.80.camel@stevo-desktop> References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> <1178040851.2309.75.camel@stevo-desktop> <46377B34.9040605@iol.unh.edu> <1178041404.2309.80.camel@stevo-desktop> Message-ID: <47342.132.177.125.178.1178042758.squirrel@postal.iol.unh.edu> Good point! --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 13:12:54.000000000 -0400 +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-05-01 14:04:07.000000000 -0400 @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str switch (ib_wr->opcode) { case IB_WR_SEND: - if (ib_wr->send_flags & IB_SEND_SOLICITED) { + if (ib_wr->send_flags & IB_SEND_SOLICITED + && ib_wr->send_flags & IB_SEND_INVALIDATE) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV); + wr.sqwr.send.remote_stag = + cpu_to_be32(ib_wr->wr.invalidate.rkey); + } else if (ib_wr->send_flags & IB_SEND_SOLICITED) { c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); - msg_size = sizeof(struct c2wr_send_req); + wr.sqwr.send.remote_stag = 0; + } else if (ib_wr->send_flags & IB_SEND_INVALIDATE) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV); + wr.sqwr.send.remote_stag = + cpu_to_be32(ib_wr->wr.invalidate.rkey); } else { c2_wr_set_id(&wr, C2_WR_TYPE_SEND); - msg_size = sizeof(struct c2wr_send_req); + wr.sqwr.send.remote_stag = 0; } - wr.sqwr.send.remote_stag = 0; - msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; + msg_size = sizeof(struct c2wr_send_req) + + sizeof(struct c2_data_addr) * ib_wr->num_sge; if (ib_wr->num_sge > qp->send_sgl_depth) { err = -EINVAL; break; > No, the accumulation bug was because you were always doing a +=. > >> > >> > msg_size = sizeof(struct c2wr_send_req) + >> > sizeof(struct c2_data_addr) * ib_wr->num_sge; > > This always assigns to msg_size. > > > -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From etta at systemfabricworks.com Tue May 1 11:58:35 2007 From: etta at systemfabricworks.com (Chieng Etta) Date: Tue, 1 May 2007 13:58:35 -0500 Subject: [ofa-general] OFED 1.2-rc2 - Multipath failover stress test results Message-ID: <000b01c78c22$b8b84810$c801a8c0@ettac> All, SFW has completed the SRP multipath failover stress test on the following builds and OSes. * OFED 1.2-rc2 - SLES10 x86 and SLES10 x86_64 * 04192007-0600 - RHEL 5 x86_64. The I/O was running on each platform for more than 10 hours during the failovers. No I/O error occurred. Please see attached for details. Thanks, Etta -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED1 2rc2_multipath_test_report.xls Type: application/vnd.ms-excel Size: 28672 bytes Desc: not available URL: From jwong at datallegro.com Tue May 1 13:10:40 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Tue, 1 May 2007 16:10:40 -0400 Subject: [ofa-general] Errors after install for openibd start Message-ID: Hello, I have successfully run the ./install.sh script with kernel 2.6.18-8.1.1.el5 I did not reboot the machine. After installing and configuring the ports using the defaults, I tried to execute the command: /etc/init.d/openibd start I have truncated the errors to show an example. Any suggestions? Thanks, Jeff I am getting the following errors: ib_ipath: disagrees about version of symbol ib_unregister_device ib_ipath: Unknown symbol ib_unregister_device ib_ipath: disagrees about version of symbol ib_register_device ib_ipath: Unknown symbol ib_register_device ib_ipath: disagrees about version of symbol ib_dispatch_event ib_ipath: Unknown symbol ib_dispatch_event ib_ipath: disagrees about version of symbol ib_dealloc_device ib_ipath: Unknown symbol ib_dealloc_device ib_ipath: disagrees about version of symbol ib_alloc_device ib_ipath: Unknown symbol ib_alloc_device ib_ipoib: disagrees about version of symbol ib_unregister_client ib_ipoib: Unknown symbol ib_unregister_client ib_ipoib: disagrees about version of symbol ib_create_cq ib_ipoib: Unknown symbol ib_create_cq ib_ipoib: Unknown symbol ib_sa_register_client ib_ipoib: disagrees about version of symbol ib_cm_listen ib_ipoib: Unknown symbol ib_cm_listen -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Tue May 1 13:38:55 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 1 May 2007 13:38:55 -0700 Subject: [ofa-general] Errors after install for openibd start In-Reply-To: References: Message-ID: Try rebooting, and see if it still happens. Scott ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jeffrey Wong Sent: Tuesday, May 01, 2007 1:11 PM To: general at lists.openfabrics.org Subject: [ofa-general] Errors after install for openibd start Hello, I have successfully run the ./install.sh script with kernel 2.6.18-8.1.1.el5 I did not reboot the machine. After installing and configuring the ports using the defaults, I tried to execute the command: /etc/init.d/openibd start I have truncated the errors to show an example. Any suggestions? Thanks, Jeff I am getting the following errors: ib_ipath: disagrees about version of symbol ib_unregister_device ib_ipath: Unknown symbol ib_unregister_device ib_ipath: disagrees about version of symbol ib_register_device ib_ipath: Unknown symbol ib_register_device ib_ipath: disagrees about version of symbol ib_dispatch_event ib_ipath: Unknown symbol ib_dispatch_event ib_ipath: disagrees about version of symbol ib_dealloc_device ib_ipath: Unknown symbol ib_dealloc_device ib_ipath: disagrees about version of symbol ib_alloc_device ib_ipath: Unknown symbol ib_alloc_device ib_ipoib: disagrees about version of symbol ib_unregister_client ib_ipoib: Unknown symbol ib_unregister_client ib_ipoib: disagrees about version of symbol ib_create_cq ib_ipoib: Unknown symbol ib_create_cq ib_ipoib: Unknown symbol ib_sa_register_client ib_ipoib: disagrees about version of symbol ib_cm_listen ib_ipoib: Unknown symbol ib_cm_listen -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Tue May 1 13:46:44 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 01 May 2007 15:46:44 -0500 Subject: [ofa-general] requisite problems with rhel5 Message-ID: <1178052404.2309.132.camel@stevo-desktop> I just noticed a false prereq problem when running build.sh from ofed-1.2 on rhel5 _with_ a kernel.org kernel installed. Here's the issue: build_env.sh checks the existence of /etc/redhat-release determine if the distro is from redhat. But if that file exists -and- the kernel `uname -r` ends in *el5, then its sets DISTRIBUTION=redhat5. Otherwise it sets the DISTRIBUTION=redhat. The mvapich2 prereq stuff that checks for sysfsutils-devel vs libsysfs-devel uses the DISTRIBUTION variable to decide which rpm it needs to prereq. However, on my system, the distro _is_ RHEL5, but I installed 2.6.20.8 on it. So the prereqs fail because build_env.sh incorrectly picks redhat as the distro instead of redhat5. I think build_env.sh shouldn't use the kernel uname to determine if the distro is redhat5. Rather, it should grok the contents of /etc/redhat-release to determine if its rhel5 or not... Is this worthy of fixing for 1.2? Steve. From mst at dev.mellanox.co.il Tue May 1 13:51:38 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 1 May 2007 23:51:38 +0300 Subject: [ofa-general] Re: [PATCH] infiniband: add support for invalidate stag In-Reply-To: <1178040392.2309.72.camel@stevo-desktop> References: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu> <20070501035708.GJ13293@mellanox.co.il> <1178040392.2309.72.camel@stevo-desktop> Message-ID: <20070501205138.GG8447@mellanox.co.il> > He added the invalidate struct to the union part of the ib_send_wr. Its > analogous to the rdma struct in that union in that its additional values > passed in the send wr for an iwarp send w/invalidate and send-se > w/invalidate. The seems reasonable to me... I have re-read the patch, and I agree it's reasonable. -- MST From swise at opengridcomputing.com Tue May 1 13:58:03 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 01 May 2007 15:58:03 -0500 Subject: [ofa-general] requisite problems with rhel5 In-Reply-To: <1178052404.2309.132.camel@stevo-desktop> References: <1178052404.2309.132.camel@stevo-desktop> Message-ID: <1178053083.2309.137.camel@stevo-desktop> On Tue, 2007-05-01 at 15:46 -0500, Steve Wise wrote: > I think build_env.sh shouldn't use the kernel uname to determine if the > distro is redhat5. Rather, it should grok the contents > of /etc/redhat-release to determine if its rhel5 or not... > > Is this worthy of fixing for 1.2? Maybe this? --- build_env.sh.org 2007-05-01 10:53:22.000000000 -0500 +++ build_env.sh 2007-05-01 10:54:54.000000000 -0500 @@ -288,8 +288,8 @@ elif [ -f /etc/fedora-release ]; then elif [ -f /etc/rocks-release ]; then DISTRIBUTION="Rocks" elif [ -f /etc/redhat-release ]; then - case ${K_VER} in - *el5*) + case $(cat /etc/redhat-release) in + *"release 5"*) DISTRIBUTION="redhat5" ;; *) From mst at dev.mellanox.co.il Tue May 1 14:02:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 00:02:52 +0300 Subject: [ofa-general] Fwd: ipath-23-srp-limit-queued-commands.patch, Message-ID: <20070501210252.GH8447@mellanox.co.il> > ----- Forwarded message from Vu Pham ----- > > Subject: ipath-23-srp-limit-queued-commands.patch, > Date: Tue, 1 May 2007 19:30:48 +0300 > From: Vu Pham > > Hi, > This patch > kernel_patches/fixes/ipath-23-srp-limit-queued-commands.patch > which change .can_queue = SRP_SQ_SIZE to .can_queue = 1 > hurts our srp performance. It limits our srp performance at > ~210 MB/s - without it our srp performance can reach 1.35 > GB/s using the same configuration > > Please remove it or apply it for whoever choose ipath as > their hw > > thanks, > -vu > > ----- End forwarded message ----- I missed the fact that the patch with ipath- prefix actually changed SRP for all devices. How come? The comment says: Limit the number of queued SCSI commands (over SRP) to one. This patch works around a limitation that requires the number of SRP requests in progress to one. Further investigation of this limitation continues. From: Jeremy Brown And this was queued at Feb 7, apparently with no progress in the investigation. So not only is this not doing the right thing, AFAIK no problem was reported publically and no one seems to be likely to find out why is it somehow helpful either. At this point I'm inclined to think the right thing is to remove this hack. We can add a module parameter to limit the number of commands if that's the only thing that makes qlogic hardware tick. Bryan, are there other such hacks? I really expect the ipath-XX series to only touch the ipath driver. -- MST From pradeep at us.ibm.com Tue May 1 14:04:32 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 1 May 2007 14:04:32 -0700 Subject: [ofa-general] please review IPOIB CM NOSRQ patch Message-ID: Would one of you have the bandwidth to review the IPOIB CM NOSRQ patch (v3) that I submitted last week. Pradeep pradeep at us.ibm.com From sashak at voltaire.com Tue May 1 14:06:35 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 2 May 2007 00:06:35 +0300 Subject: [ofa-general] Fwd: [ANNOUNCE] GIT 1.5.1.3 References: <20070501125321.GC8447@mellanox.co.il> Message-ID: <39C75744D164D948A170E9792AF8E7CA185D3F@exil.voltaire.com> We will upgrde after Sonoma. Sasha ________________________________ From: general-bounces at lists.openfabrics.org on behalf of Michael S. Tsirkin Sent: Tue 5/1/2007 3:53 PM To: general at lists.openfabrics.org Subject: [ofa-general] Fwd: [ANNOUNCE] GIT 1.5.1.3 FYI. I think we want to update - the mmap fix looks important enough. Sasha? ----- Forwarded message from Junio C Hamano ----- Subject: [ANNOUNCE] GIT 1.5.1.3 Date: Tue, 1 May 2007 06:08:58 +0300 From: Junio C Hamano The latest maintenance release GIT 1.5.1.3 is available at the usual places: http://www.kernel.org/pub/software/scm/git/ git-1.5.1.3.tar.{gz,bz2} (tarball) git-htmldocs-1.5.1.3.tar.{gz,bz2} (preformatted docs) git-manpages-1.5.1.3.tar.{gz,bz2} (preformatted docs) RPMS/$arch/git-*-1.5.1.3-1.$arch.rpm (RPM) GIT v1.5.1.3 Release Notes ========================== Fixes since v1.5.1.2 -------------------- * Bugfixes - git-add tried to optimize by finding common leading directories across its arguments but botched, causing very confused behaviour. - unofficial rpm.spec file shipped with git was letting ETC_GITCONFIG set to /usr/etc/gitconfig. Tweak the official Makefile to make it harder for distro people to make the same mistake, by setting the variable to /etc/gitconfig if prefix is set to /usr. - git-svn inconsistently stripped away username from the URL only when svnsync_props was in use. - git-svn got confused when handling symlinks on Mac OS. - git-send-email was not quoting recipient names that have period '.' in them. Also it did not allow overriding envelope sender, which made it impossible to send patches to certain subscriber-only lists. - built-in write_tree() routine had a sequence that renamed a file that is still open, which some systems did not like. - when memory is very tight, sliding mmap code to read packfiles incorrectly closed the fd that was still being used to read the pack. - import-tars contributed front-end for fastimport was passing wrong directory modes without checking. - git-fastimport trusted its input too much and allowed to create corrupt tree objects with entries without a name. - git-fetch needlessly barfed when too long reflog action description was given by the caller. Also contains various documentation updates. ---------------------------------------------------------------- Changes since v1.5.1.2 are as follows: Adam Roben (5): Remove usernames from all commit messages, not just when using svmprops git-svn: Don't rely on $_ after making a function call git-svn: Ignore usernames in URLs in find_by_url git-svn: Added 'find-rev' command git-svn: Add 'find-rev' command Alex Riesen (1): Fix handle leak in write_tree Andrew Ruder (8): Removing -n option from git-diff-files documentation Document additional options for git-fetch Update git-fmt-merge documentation Update git-grep documentation Update -L documentation for git-blame/git-annotate Update git-http-push documentation Update git-local-fetch documentation Update git-http-fetch documentation Brian Gernhardt (2): Reverse the order of -b and --track in the man page. Ignore all man sections as they are generated files. Gerrit Pape (1): Documentation/git-reset.txt: suggest git commit --amend in example. Jari Aalto (3): Clarify SubmittingPatches Checklist git.7: Mention preformatted html doc location send-email documentation: clarify --smtp-server Johannes Schindelin (2): dir.c(common_prefix): Fix two bugs import-tars: be nice to wrong directory modes Josh Triplett (3): Fix typo in git-am: s/Was is/Was it/ Create a sysconfdir variable, and use it for ETC_GITCONFIG Add missing reference to GIT_COMMITTER_DATE in git-commit-tree documentation Julian Phillips (1): http.c: Fix problem with repeated calls of http_init Junio C Hamano (8): Build RPM with ETC_GITCONFIG=/etc/gitconfig applymbox & quiltimport: typofix. Start preparing for 1.5.1.3 Do not barf on too long action description Update .mailmap with "Michael" Fix import-tars fix. Fix symlink handling in git-svn, related to PerlIO GIT v1.5.1.3 Michele Ballabio (1): git shortlog documentation: add long options and fix a typo Robin H. Johnson (10): Document --dry-run parameter to send-email. Prefix Dry- to the message status to denote dry-runs. Debugging cleanup improvements Change the scope of the $cc variable as it is not needed outside of send_message. Perform correct quoting of recipient names. Validate @recipients before using it for sendmail and Net::SMTP. Ensure clean addresses are always used with Net::SMTP Allow users to optionally specify their envelope sender. Document --dry-run and envelope-sender for git-send-email. Sanitize @to recipients. Shawn O. Pearce (3): Actually handle some-low memory conditions Don't allow empty pathnames in fast-import Catch empty pathnames in trees during fsck - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ----- End forwarded message ----- -- MST _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Tue May 1 14:11:04 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 00:11:04 +0300 Subject: [ofa-general] Re: please review IPOIB CM NOSRQ patch In-Reply-To: References: Message-ID: <20070501211104.GK8447@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: please review IPOIB CM NOSRQ patch > > Would one of you have the bandwidth to review the IPOIB CM NOSRQ patch > (v3) that I submitted last week. > > Pradeep > pradeep at us.ibm.com OK, but could you please send a version that isn't line-wrapped? Take a look at how it's formatted e.g. here: http://article.gmane.org/gmane.linux.drivers.openib/39021 -- MST From tziporet at dev.mellanox.co.il Tue May 1 14:13:52 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 01 May 2007 14:13:52 -0700 Subject: [ofa-general] Release version of Ofed v1.2 In-Reply-To: <9BEB932202A05B488722B05D2374A1DA03C7F87A@mtv-amer001e--3.americas.sgi.com> References: <9BEB932202A05B488722B05D2374A1DA03C7F87A@mtv-amer001e--3.americas.sgi.com> Message-ID: <4637AD90.2060202@mellanox.co.il> Scott Shaw wrote: > When will the general release of ofed v1.2 be available? Also is the OS > requirement going to be SUSE10 SP1? > > Will ofed v1.2 work with SUSE10 without service packs? > > Thanks, > Scott > > General release is planed for May 15, but the actual date depend on stability and some hard bugs we still try to fix. Its should support SLES10 SP1 (currently tested with SP1 RC1).. BTW Moiz from Novell said that SLES10 SP1 will include OFED 1.2 as an add-on package from Novell. Tziporet From pradeep at us.ibm.com Tue May 1 14:36:40 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 1 May 2007 14:36:40 -0700 Subject: [ofa-general] Re: please review IPOIB CM NOSRQ patch In-Reply-To: <20070501211104.GK8447@mellanox.co.il> Message-ID: Thanks for spending cycles on this!There are few long lines that are broken up, but most of it is not line wrapped like the previous versions. Would it be possible to look at the text file attached instead-that should be as expected. Pradeep pradeep at us.ibm.com "Michael S. Tsirkin" wrote on 05/01/2007 02:11:04 PM: > > Quoting Pradeep Satyanarayana : > > Subject: please review IPOIB CM NOSRQ patch > > > > Would one of you have the bandwidth to review the IPOIB CM NOSRQ patch > > (v3) that I submitted last week. > > > > Pradeep > > pradeep at us.ibm.com > > OK, but could you please send a version that isn't line-wrapped? > > Take a look at how it's formatted e.g. here: > http://article.gmane.org/gmane.linux.drivers.openib/39021 > > > -- > MST From jwong at datallegro.com Tue May 1 15:04:38 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Tue, 1 May 2007 18:04:38 -0400 Subject: [ofa-general] Errors after install for openibd start In-Reply-To: Message-ID: Thanks, Rebooting solved the problems. Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From loic at myri.com Tue May 1 15:05:24 2007 From: loic at myri.com (Loic Prylli) Date: Tue, 01 May 2007 15:05:24 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <46365BD4.5060607@hp.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> <20070428025117.a3b1200a.billfink@mindspring.com> <4634F49F.9030408@myri.com> <46365BD4.5060607@hp.com> Message-ID: <4637B9A4.2050103@myri.com> On 4/30/2007 2:12 PM, Rick Jones wrote: > > Speaking of defaults, it would seem that the external 1.2.0 driver > comes with 9000 bytes as the default MTU? At least I think that is > what I am seeing now that I've started looking more closely. > > rick jones That's the same for the in-kernel-tree code (9K MTU by default). Assuming this is not wanted, I will submit a patch for that. Loic From rick.jones2 at hp.com Tue May 1 15:12:08 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Tue, 01 May 2007 15:12:08 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <4637B9A4.2050103@myri.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> <20070428025117.a3b1200a.billfink@mindspring.com> <4634F49F.9030408@myri.com> <46365BD4.5060607@hp.com> <4637B9A4.2050103@myri.com> Message-ID: <4637BB38.9020809@hp.com> Loic Prylli wrote: > On 4/30/2007 2:12 PM, Rick Jones wrote: > >> >> Speaking of defaults, it would seem that the external 1.2.0 driver >> comes with 9000 bytes as the default MTU? At least I think that is >> what I am seeing now that I've started looking more closely. >> >> rick jones > > > > That's the same for the in-kernel-tree code (9K MTU by default). > Assuming this is not wanted, I will submit a patch for that. While I like what that does for perrformance, and at the risk of putting words into the mouths of netdev, I suspect that 1500 bytes is indeed the desired default. It matches the IEEE specs, I've yet to see a switch which enabled "Jumbo Frames" by default, not everything out there even believes that Jubmo Frames means 9000 byte MTU etc etc etc. I think that 1500 bytes for an "Ethernet" device remains in line with the principle of least surprise. rick jones From swise at opengridcomputing.com Tue May 1 16:03:16 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 01 May 2007 18:03:16 -0500 Subject: [ofa-general] [PATCH librdmacm] rping: Transfer rkey/addr/len information in network byte order. In-Reply-To: <1177515271.22094.33.camel@stevo-desktop> References: <1177515271.22094.33.camel@stevo-desktop> Message-ID: <1178060596.2309.195.camel@stevo-desktop> Sean, This patch regresses rping. I failed to test it on AMD64<->AMD64 (ie like endian systems). I will provide another patch shortly, or we can undo the broken rping patch for -rc3. Whatever you think is best. Sorry for this! Steve. On Wed, 2007-04-25 at 10:34 -0500, Steve Wise wrote: > Sean, > > This patch enables rping between a BE and LE system. Tested on IBM > PPC64 <-> AMD64. > > Transfer rkey/addr/len information in network byte order. > > Signed-off-by: Steve Wise > --- > > examples/rping.c | 7 ++++--- > 1 files changed, 4 insertions(+), 3 deletions(-) > > diff --git a/examples/rping.c b/examples/rping.c > index 0441300..17b0000 100644 > --- a/examples/rping.c > +++ b/examples/rping.c > @@ -47,6 +47,7 @@ #include > #include > > #include > +#include > > static int debug = 0; > #define DEBUG_LOG if (debug) printf > @@ -239,9 +240,9 @@ static int server_recv(struct rping_cb * > return -1; > } > > - cb->remote_rkey = cb->recv_buf.rkey; > - cb->remote_addr = cb->recv_buf.buf; > - cb->remote_len = cb->recv_buf.size; > + cb->remote_rkey = ntohl(cb->recv_buf.rkey); > + cb->remote_addr = ntohll(cb->recv_buf.buf); > + cb->remote_len = ntohl(cb->recv_buf.size); > DEBUG_LOG("Received rkey %x addr %" PRIx64 "len %d from peer\n", > cb->remote_rkey, cb->remote_addr, cb->remote_len); > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jwong at datallegro.com Tue May 1 16:50:59 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Tue, 1 May 2007 19:50:59 -0400 Subject: [ofa-general] Help building sdp library. In-Reply-To: Message-ID: Hello, I am trying to do an OFED 2.1 install for all the modules now. I was able to compile and install the Basic install and now I am trying to install the all selection part. When I try to install with this selection I am getting an error when compiling the libsdp directory. It looks like since I don't have the 32 bit compiler, the build is failing. Is there a workaround to only compile the 64 bit version? I added into the build_env.sh build_32bit=0 to build the Basic install but it doesn't seem to apply to the libsdp when I try use the other selection of installing all. Thanks in advance. Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From jwong at datallegro.com Tue May 1 17:38:37 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Tue, 1 May 2007 20:38:37 -0400 Subject: [ofa-general] Errors when compiling ib-bonding module. In-Reply-To: Message-ID: Hello, I am using kernel 2.6-18.8.1.1.el5 x86_64 I have changed the build_env.sh to have the build_32bit=-1 Thanks in advance. Jeff When installing all modules I am getting the following errors. + make -C /lib/modules/2.6.18-8.1.1.el5/build modules M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding make: Entering directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64' CC [M] /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.o In file included from /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:78: /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_inactive_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: (Each undeclared identifier is reported only once /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: for each function it appears in.) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_active_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_compute_features': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1233: warning: comparison of distinct pointer types lacks a cast /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_enslave': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_release': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_arp_rcv': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_netdev_event': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_init': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4374: warning: assignment discards qualifiers from pointer target type /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this function) make[1]: *** [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_ main.o] Error 1 make: *** [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi ng] Error 2 make: Leaving directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64' + echo ' Building IB bonding driver failed' Building IB bonding driver failed + exit 1 error: Bad exit status from /var/tmp/rpm-tmp.99179 (%build) Jeff Wong, Linux Software Engineer (949) 680-3066 - Office (949) 680-3001 - Fax jwong at datallegro.com www.datallegro.com 85Enterprise, 2nd Floor, Aliso Viejo, CA 92656 The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain proprietary, confidential and/or privileged material. If you have received this email in error please contact the sender by replying and delete this email so that it is not recoverable. If you are not the intended recipient(s), any retention, review, disclosure, distribution, copying, printing, dissemination, or other use of, or the taking of any action in reliance upon, this information is strictly prohibited and without liability on our part. ________________________________ From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Sent: Tuesday, May 01, 2007 1:39 PM To: Jeffrey Wong; general at lists.openfabrics.org Subject: RE: [ofa-general] Errors after install for openibd start Try rebooting, and see if it still happens. Scott ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jeffrey Wong Sent: Tuesday, May 01, 2007 1:11 PM To: general at lists.openfabrics.org Subject: [ofa-general] Errors after install for openibd start Hello, I have successfully run the ./install.sh script with kernel 2.6.18-8.1.1.el5 I did not reboot the machine. After installing and configuring the ports using the defaults, I tried to execute the command: /etc/init.d/openibd start I have truncated the errors to show an example. Any suggestions? Thanks, Jeff I am getting the following errors: ib_ipath: disagrees about version of symbol ib_unregister_device ib_ipath: Unknown symbol ib_unregister_device ib_ipath: disagrees about version of symbol ib_register_device ib_ipath: Unknown symbol ib_register_device ib_ipath: disagrees about version of symbol ib_dispatch_event ib_ipath: Unknown symbol ib_dispatch_event ib_ipath: disagrees about version of symbol ib_dealloc_device ib_ipath: Unknown symbol ib_dealloc_device ib_ipath: disagrees about version of symbol ib_alloc_device ib_ipath: Unknown symbol ib_alloc_device ib_ipoib: disagrees about version of symbol ib_unregister_client ib_ipoib: Unknown symbol ib_unregister_client ib_ipoib: disagrees about version of symbol ib_create_cq ib_ipoib: Unknown symbol ib_create_cq ib_ipoib: Unknown symbol ib_sa_register_client ib_ipoib: disagrees about version of symbol ib_cm_listen ib_ipoib: Unknown symbol ib_cm_listen -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 10836 bytes Desc: image001.jpg URL: From jwong at datallegro.com Tue May 1 18:16:57 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Tue, 1 May 2007 21:16:57 -0400 Subject: [ofa-general] RE: Errors when compiling ib-bonding module. Message-ID: Hello, I am using kernel 2.6-18.8.1.1.el5 x86_64 I have changed the build_env.sh to have the build_32bit=-1 Thanks in advance. Jeff When installing all modules I am getting the following errors. + make -C /lib/modules/2.6.18-8.1.1.el5/build modules M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding make: Entering directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64' CC [M] /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.o In file included from /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:78: /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_inactive_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: (Each undeclared identifier is reported only once /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: for each function it appears in.) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_active_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_compute_features': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1233: warning: comparison of distinct pointer types lacks a cast /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_enslave': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_release': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_arp_rcv': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_netdev_event': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_init': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4374: warning: assignment discards qualifiers from pointer target type /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this function) make[1]: *** [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_ main.o] Error 1 make: *** [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi ng] Error 2 make: Leaving directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64' + echo ' Building IB bonding driver failed' Building IB bonding driver failed + exit 1 error: Bad exit status from /var/tmp/rpm-tmp.99179 (%build) -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Tue May 1 19:08:09 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 1 May 2007 19:08:09 -0700 Subject: [ofa-general] bugs 541 and 465: slow IPoIB CM HA failover and eventual failing IPoIB HA Message-ID: With IPoIB HA (both ipoibtools and ib-bonding), I am seeing slow IPoIB CM HA failover, and eventually IPoIB stops working after enough failovers. I am running netperf -D traffic between two IPoIB HA hosts, while flipping the 4 host IB ports one at a time (port 1 down, sleep, port 1 up, sleep, ..., port 4 down, sleep, port 4 up, sleep) in a loop. This is a very easy test to set up. Can Mellanox and Voltaire please try to reproduce the problem? I think this problem must be fixed for OFED 1.2 rc3. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Tue May 1 19:50:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 05:50:55 +0300 Subject: [ofa-general] Re: Requesting CQ notifications In-Reply-To: References: <462FD3F7.1010304@evergrid.com> Message-ID: <20070502025055.GM8447@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: Requesting CQ notifications > > > Is there a differentiation between multiple CQE's being in the CQ > > vs. CQE's being arriving into the CQ when using completion > > notifications? > > > > For example, assume I have the following order of events: > > > > > > 2 CQEs arrive > > > > select() returns readable for comp. channel > > > > ibv_get_cq_event() returns event > > > > ibv_req_notify_cq(cq, 0) > > > > ibv_poll_cq(cq, 1, &cqe) returns 1 > > > > ibv_ack_cq_events(cq, 1) > > > > > > Will the comp. channel receive another event for the second CQE even > > if it had arrived before ibv_req_notify_cq() was called? > > This is really an ill-posed question: according to the semantics > defined by the verbs spec, the presence or absence of the second CQE > is not defined until you poll the CQ again. > > In practice we can look at what real hardware does, and the answer is > "it depends." Some adapters (eg mthca, mlx4) will generate an event > immediately if ibv_req_notify_cq() is called for a CQ that contains an > unpolled CQE, This is not exact. mthca/mlx4 will generate an event immediately only for unpolled CQE *that was not present in CQ at the time the previous event was generated*. So the answer for mthca is yes only if the CQE arrived between calls to select and ibv_req_notify_cq. > while other adapters (eg ipath, ehca) will only generate > an event when a CQE is added after the cal to ibv_req_notify_cq(). -- MST From mike.heffner at evergrid.com Tue May 1 20:23:52 2007 From: mike.heffner at evergrid.com (Mike Heffner) Date: Tue, 01 May 2007 23:23:52 -0400 Subject: [ofa-general] Re: Requesting CQ notifications In-Reply-To: <20070502025055.GM8447@mellanox.co.il> References: <462FD3F7.1010304@evergrid.com> <20070502025055.GM8447@mellanox.co.il> Message-ID: <46380448.1020401@evergrid.com> Michael S. Tsirkin wrote: >> Quoting Roland Dreier : >> Subject: Re: Requesting CQ notifications > > This is not exact. mthca/mlx4 will generate an event immediately > only for unpolled CQE *that was not present in CQ at the > time the previous event was generated*. > So the answer for mthca is yes only if the CQE arrived > between calls to select and ibv_req_notify_cq. > Is there any method by which you can query the total number of CQEs in the CQ at an instantaneous point in time (ie., after you had called ibv_req_notify_cq() to get notification of *new* CQEs)? Mike -- Mike Heffner EverGrid Software Blacksburg, VA USA Voice: (540) 443-3500 #603 From rajib.majumder at credit-suisse.com Tue May 1 20:34:01 2007 From: rajib.majumder at credit-suisse.com (Majumder, Rajib) Date: Wed, 2 May 2007 11:34:01 +0800 Subject: [ofa-general] OFED SDP & IPoIB Message-ID: Hi, I have the following questions regarding OFED SDP. 1) Does SDP support zcopy yet? If yes, is it for aio calls or sync/non-blocking socket calls as well? 2) Does SDP work on 10GigE iWARP, apart from IB? 3) Does IPoIB support IP multicast? Thanks Rajib ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ============================================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue May 1 21:01:13 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 May 2007 21:01:13 -0700 Subject: [ofa-general] [PATCH librdmacm] rping: Transfer rkey/addr/len information in network byte order. In-Reply-To: <1178060596.2309.195.camel@stevo-desktop> References: <1177515271.22094.33.camel@stevo-desktop> <1178060596.2309.195.camel@stevo-desktop> Message-ID: <46380D09.5070906@ichips.intel.com> > This patch regresses rping. I failed to test it on AMD64<->AMD64 (ie > like endian systems). I will provide another patch shortly, or we can > undo the broken rping patch for -rc3. Whatever you think is best. Let's fix it. Please create a patch on top of this that fixes the problem. Thanks - Sean From Arkady.Kanevsky at netapp.com Tue May 1 21:39:21 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 2 May 2007 00:39:21 -0400 Subject: [ofa-general] minutes from socket over RDMA discussion at workshop Message-ID: Dear, group enclosed is the discussion we had at Sonoma workshop. Here are rough minutes. We had not tried submission SDP to kernel.org. The current SDP performance is not very good. IPOIB connection mode has much better bandwidth. But SDP has better latency and less overhead. IPOIB connection mode scalability was not stressed yet. What are API requirements? Socket over RDMA sounds like RDS for financial services. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: socket_sonoma_2007.pdf Type: application/octet-stream Size: 59813 bytes Desc: socket_sonoma_2007.pdf URL: From sweitzen at cisco.com Tue May 1 21:48:26 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 1 May 2007 21:48:26 -0700 Subject: [ofa-general] minutes from socket over RDMA discussion at workshop In-Reply-To: References: Message-ID: No, this is not right. SDP has better latency and better throughput than IPoIB CM, but also uses more CPU. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Kanevsky, Arkady Sent: Tuesday, May 01, 2007 9:39 PM To: openib-general Subject: [ofa-general] minutes from socket over RDMA discussion at workshop Dear, group enclosed is the discussion we had at Sonoma workshop. Here are rough minutes. We had not tried submission SDP to kernel.org. The current SDP performance is not very good. IPOIB connection mode has much better bandwidth. But SDP has better latency and less overhead. IPOIB connection mode scalability was not stressed yet. What are API requirements? Socket over RDMA sounds like RDS for financial services. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ashish.Batwara at lsi.com Tue May 1 22:53:06 2007 From: Ashish.Batwara at lsi.com (Batwara, Ashish) Date: Tue, 1 May 2007 23:53:06 -0600 Subject: [ofa-general] We are seeing SYNDROME_LOCAL_PROT_ERR status on CQE with Mellanox Arbel HCA in memfree mode In-Reply-To: Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A016398AA@NAMAIL2.ad.lsil.com> Any idea why this error? We see this error when we use FMR? Are there any special setting that HCA needs to work with FMR? [Batwara, Ashish] From tziporet at dev.mellanox.co.il Tue May 1 23:05:23 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 01 May 2007 23:05:23 -0700 Subject: [ofa-general] OFED SDP & IPoIB In-Reply-To: References: Message-ID: <46382A23.2030403@mellanox.co.il> Majumder, Rajib wrote: > > Hi, > > I have the following questions regarding OFED SDP. > > 1) Does SDP support zcopy yet? If yes, is it for aio calls or > sync/non-blocking socket calls as well? > No for both > > > 2) Does SDP work on 10GigE iWARP, apart from IB? > No > > > 3) Does IPoIB support IP multicast? > Yes > Tziporet From vlad at dev.mellanox.co.il Tue May 1 23:33:09 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 02 May 2007 09:33:09 +0300 Subject: [ofa-general] Help building sdp library. In-Reply-To: References: Message-ID: <1178087589.14131.3.camel@vladsk-laptop> On Tue, 2007-05-01 at 19:50 -0400, Jeffrey Wong wrote: > Hello, > I am trying to do an OFED 2.1 install for all the modules now. > I was able to compile and install the Basic install and now I am > trying to install the all selection part. > When I try to install with this selection I am getting an error when > compiling the libsdp directory. > > It looks like since I don’t have the 32 bit compiler, the build is > failing. > Is there a workaround to only compile the 64 bit version? > I added into the build_env.sh > build_32bit=0 to build the Basic install but it doesn’t seem to apply > to the libsdp when I try use the other selection of installing all. > > > Thanks in advance. > > Jeff > Hi, Try the latest OFED-1.2 from http://www.openfabrics.org/builds/ofed-1.2/ It should be fixed there. Note, you don't have to edit build_env.sh to change the value of build_32bit variable. It is enough to run 'export build_32bit=0' and then run install. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From mst at dev.mellanox.co.il Tue May 1 23:46:48 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 09:46:48 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: References: Message-ID: <20070502064549.GN8447@mellanox.co.il> OK, we are making progress (line-wrapping issues aside :). And there seems to be some whitespace damage, too. Pls take care of this. I think the handle_rx_wc split is going in the right direction, but let's take this through all the datapath. I went over the patch in a bit more depth, and I have some questions: > + for (i = 0; i < ipoib_recvq_size; ++i) { > + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, ... > + if (ipoib_cm_post_receive(dev, i << 32 | index)) { 1. It seems there are multiple QPs mapped to a single CQ - and each gets ipoib_recvq_size recv WRs above. Is that right? How do you prevent CQ overrun then? > + /* Find an empty rx_index_ring[] entry */ > + for (index = 0; index < NOSRQ_INDEX_RING_SIZE; index++) > + if (priv->cm.rx_index_ring[index] == NULL) > + break; > + > + if ( index == NOSRQ_INDEX_RING_SIZE) { > + spin_unlock_irq(&priv->lock); > + printk(KERN_WARNING "NOSRQ supports a max of %d RC " > + "QPs. That limit has now been reached\n", > + NOSRQ_INDEX_RING_SIZE); > + return -EINVAL; > + } 2. So, when QP limit has been reached, remote side will get a reject with custom reject reason? Is so, it seems that since the remote does not know what the reason for reject is, it'll just retry on the next packet, and again and again. Basically, connectivity is denied where it previously worked fine by falling back on datagram mode? One way to fix this, could be to try and use a reject reason that will tell the remote "I'm busy, switch to datagram mode for a loooooong time". Using path mtu discovery here might be useful to actually have it come back and retry after several minutes. *In theory*, we could get this even with SRQ - if the *HCA* starts running out of RC QPs - it is just never happening in practice as current HCAs support #QPs larger than a maximum IB subnet size. So I might post a patch to implement this, stay tuned. > + spin_lock_irqsave(&priv->lock, flags); > + rx_ptr = priv->cm.rx_index_ring[index]; > + spin_unlock_irqrestore(&priv->lock, flags); 3. You never actually test the rx_ptr that you got. So why does locking help? A better way to destroy QPs might be to move it to error state first. We actually need something like this for CM too - stay tuned for a patch. I also commented on some style issues below. > Note 1: I have retained the code to avoid IB_WC_RETRY_EXC_ERR while performing > interoperability tests As discussed in this mailing list that may be a CM bug or > have the various HCA address it. Hence I would like to seperate out that issue > from this patch. > At a future point when the issue gets resolved I can provide > another patch to change the retry_count values back to 0 if need be. The correct way to separate it, in my opinion, is to set retry_count = 0, and (for now) apply a work-around patch at your site before testing. We really don't want to paper over this bug, in my opinion. A general suggestion, before we dive into code: document, first of all, data structures, then functions. Rest of code quite often can be made self documenting. Stuff like if (!srq) /* no SRQ */, and } /* end of loop */ is really not telling us anything useful. > --- a/linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-04-24 18:10:17.000000000 -0700 > +++ b//linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-04-25 10:11:34.000000000 -0700 > @@ -99,6 +99,12 @@ enum { > #define IPOIB_OP_RECV (1ul << 31) > #ifdef CONFIG_INFINIBAND_IPOIB_CM > #define IPOIB_CM_OP_SRQ (1ul << 30) > +#define IPOIB_CM_OP_NOSRQ (1ul << 29) > + > +/* These two go hand in hand */ > +#define NOSRQ_INDEX_RING_SIZE 1024 > +#define NOSRQ_INDEX_MASK 0x00000000000003ff > + When you need a comment for 2 lines of code is when you know something's really obscure. How about #define NOSRQ_INDEX_MASK (NOSRQ_INDEX_RING_SIZE - 1) and we can kill the comment? > #else > #define IPOIB_CM_OP_SRQ (0) > #endif > @@ -136,9 +142,11 @@ struct ipoib_cm_data { > struct ipoib_cm_rx { > struct ib_cm_id *id; > struct ib_qp *qp; > + struct ipoib_cm_rx_buf *rx_ring; Alignment's broken here. > struct list_head list; > struct net_device *dev; > unsigned long jiffies; > + u32 index; index and rx_ring are only valid for non-srq code, right? I think we need a comment of some kind to tell us this. > }; > > struct ipoib_cm_tx { > @@ -177,6 +185,7 @@ struct ipoib_cm_dev_priv { > struct ib_wc ibwc[IPOIB_NUM_WC]; > struct ib_sge rx_sge[IPOIB_CM_RX_SG]; > struct ib_recv_wr rx_wr; > + struct ipoib_cm_rx **rx_index_ring; > }; > > /* Isn't "ring" a bit of a misnomer? Also - you have multiple QPs mapped to a single CQ - how do you prevent CQ overrun? > --- a/linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-04-24 18:10:17.000000000 -0700 > +++ b//linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-04-27 14:03:40.000000000 -0700 > @@ -76,7 +76,7 @@ static void ipoib_cm_dma_unmap_rx(struct > ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); > } > > -static int ipoib_cm_post_receive(struct net_device *dev, int id) > +static int post_receive_srq(struct net_device *dev, u64 id) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ib_recv_wr *bad_wr; > @@ -85,13 +85,14 @@ static int ipoib_cm_post_receive(struct > priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; > > for (i = 0; i < IPOIB_CM_RX_SG; ++i) > - priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; > + priv->cm.rx_sge[i].addr = > + priv->cm.srq_ring[id].mapping[i]; The line wasn't too long here, so why wrap it? And continuation lines need to be shifted *significantly* to the right. > ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); > if (unlikely(ret)) { > ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); > ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, > - priv->cm.srq_ring[id].mapping); > + priv->cm.srq_ring[id].mapping); what's the deal here? > dev_kfree_skb_any(priv->cm.srq_ring[id].skb); > priv->cm.srq_ring[id].skb = NULL; > } > @@ -99,12 +100,69 @@ static int ipoib_cm_post_receive(struct > return ret; > } > > -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, > +static int post_receive_nosrq(struct net_device *dev, u64 id) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + struct ib_recv_wr *bad_wr; > + int i, ret; > + u32 index; > + u64 wr_id; > + struct ipoib_cm_rx *rx_ptr; > + unsigned long flags; > + > + index = id & NOSRQ_INDEX_MASK ; > + wr_id = id >> 32; So wr_id has always, ever, 32 lower bits set - why make it u64 then? > + /* There is a slender chance of a race between the stale_task > + * running after a period of inactivity and the receipt of > + * a packet being processed at about the same instant. > + * Hence the lock */ I think you can get rid of this, by changing the stale task code: move QP to error, and wait for WRs posted to complete. Then there won't be any more completions for this QP. As it is, I'm not convinced you can't get a completion after QP has been removed out of the array - so it seems the race hasn't been solved here? We actually need something like this for CM too - stay tuned for a patch. > + spin_lock_irqsave(&priv->lock, flags); > + rx_ptr = priv->cm.rx_index_ring[index]; > + spin_unlock_irqrestore(&priv->lock, flags); > + > + priv->cm.rx_wr.wr_id = wr_id << 32 | index | IPOIB_CM_OP_NOSRQ; Isn't this just id, again? > + for (i = 0; i < IPOIB_CM_RX_SG; ++i) > + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; > + > + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); > + if (unlikely(ret)) { > + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", > + wr_id, ret); > + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, > + rx_ptr->rx_ring[wr_id].mapping); > + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); > + rx_ptr->rx_ring[wr_id].skb = NULL; > + } > + > + return ret; > +} > + > +static int ipoib_cm_post_receive(struct net_device *dev, u64 id) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + int ret; > + > + if (priv->cm.srq) > + ret = post_receive_srq(dev, id); > + else > + ret = post_receive_nosrq(dev, id); > + > + return ret; > +} I think you can split this one now that srq/nonsrq completions are handled separately. > +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, > + int frags, > u64 mapping[IPOIB_CM_RX_SG]) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct sk_buff *skb; > int i; > + struct ipoib_cm_rx *rx_ptr; > + u32 index, wr_id; > + unsigned long flags; > > skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); > if (unlikely(!skb)) > @@ -123,7 +181,7 @@ static struct sk_buff *ipoib_cm_alloc_rx > return NULL; > } > > - for (i = 0; i < frags; i++) { > + for (i = 0; i < frags; i++) { what's the deal here? > struct page *page = alloc_page(GFP_ATOMIC); > > if (!page) > @@ -136,7 +194,17 @@ static struct sk_buff *ipoib_cm_alloc_rx > goto partial_error; > } > > - priv->cm.srq_ring[id].skb = skb; > + if (priv->cm.srq) > + priv->cm.srq_ring[id].skb = skb; > + else { > + index = id & NOSRQ_INDEX_MASK ; > + wr_id = id >> 32; > + spin_lock_irqsave(&priv->lock, flags); > + rx_ptr = priv->cm.rx_index_ring[index]; > + spin_unlock_irqrestore(&priv->lock, flags); See above about the locking here. Try to get rid of it - this is datapath. > + > + rx_ptr->rx_ring[wr_id].skb = skb; > + } > return skb; > > partial_error: A branch on datapath just for 2 lines that are different is not worth it. Just keep common code in ipoib_cm_alloc_rx, and move lines that differ to the site of call. > @@ -157,13 +225,20 @@ static struct ib_qp *ipoib_cm_create_rx_ > struct ib_qp_init_attr attr = { > .send_cq = priv->cq, /* does not matter, we never send anything */ > .recv_cq = priv->cq, > - .srq = priv->cm.srq, > .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ > + .cap.max_recv_wr = ipoib_recvq_size + 1, > .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ > + .cap.max_recv_sge = IPOIB_CM_RX_SG, /* Is this correct? */ I don't think we should set both attr.srq and max_recv_sge for a QP connected to SRQ. > .sq_sig_type = IB_SIGNAL_ALL_WR, > .qp_type = IB_QPT_RC, > .qp_context = p, > }; > + > + if (priv->cm.srq) > + attr.srq = priv->cm.srq; > + else > + attr.srq = NULL; Since attr has an initializer, attr.srq is already 0 unless you set it. > + > return ib_create_qp(priv->pd, &attr); > } > > @@ -198,6 +273,7 @@ static int ipoib_cm_modify_rx_qp(struct > ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret); > return ret; > } > + Kill this. > return 0; > } > > @@ -217,12 +293,87 @@ static int ipoib_cm_send_rep(struct net_ > rep.flow_control = 0; > rep.rnr_retry_count = req->rnr_retry_count; > rep.target_ack_delay = 20; /* FIXME */ > - rep.srq = 1; > rep.qp_num = qp->qp_num; > rep.starting_psn = psn; > + > + if (priv->cm.srq) > + rep.srq = 1; > + else > + rep.srq = 0; > return ib_send_cm_rep(cm_id, &rep); > } > > +int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, struct ipoib_cm_rx *p, unsigned psn) This one's too long I think. > +{ > + struct net_device *dev = cm_id->context; > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + int ret; > + u32 qp_num, index; > + u64 i; > + > + qp_num = p->qp->qp_num; > + /* Allocate space for the rx_ring here */ You mostly want to kill such comments - they take up code lines and don't really tell anything useful. > + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, > + GFP_KERNEL); > + if (p->rx_ring == NULL) > + return -ENOMEM; > + > + cm_id->context = p; > + p->jiffies = jiffies; > + spin_lock_irq(&priv->lock); > + list_add(&p->list, &priv->cm.passive_ids); > + > + /* Find an empty rx_index_ring[] entry */ And this. > + for (index = 0; index < NOSRQ_INDEX_RING_SIZE; index++) > + if (priv->cm.rx_index_ring[index] == NULL) > + break; No == NULL tests please. > + > + if ( index == NOSRQ_INDEX_RING_SIZE) { > + spin_unlock_irq(&priv->lock); > + printk(KERN_WARNING "NOSRQ supports a max of %d RC " > + "QPs. That limit has now been reached\n", > + NOSRQ_INDEX_RING_SIZE); > + return -EINVAL; > + } So, when QP limit has been reached, connectivity is denied where it previously worked fine in datagram mode? This looks like an important regression. > + /* Store the pointer to retrieve it later using the index */ Kill this too. > + priv->cm.rx_index_ring[index] = p; > + spin_unlock_irq(&priv->lock); > + p->index = index; > + > + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > + if (ret) { > + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); > + goto err_modify_nosrq; > + } > + > + for (i = 0; i < ipoib_recvq_size; ++i) { > + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, > + IPOIB_CM_RX_SG - 1, > + p->rx_ring[i].mapping)) { > + ipoib_warn(priv, "failed to allocate receive " > + "buffer %d\n", i); > + ipoib_cm_dev_cleanup(dev); > + /* Free rx_ring previously allocated */ And this. > + kfree(p->rx_ring); > + return -ENOMEM; > + } > + > + /* Can we call the nosrq version? */ what's the deal here? > + if (ipoib_cm_post_receive(dev, i << 32 | index)) { > + ipoib_warn(priv, "ipoib_ib_post_receive " > + "failed for buf %d\n", i); > + ipoib_cm_dev_cleanup(dev); > + return -EIO; > + } > + } /* end for */ And surely this. > + > + return 0; > + > +err_modify_nosrq: > + return ret; > +} > + > static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) > { > struct net_device *dev = cm_id->context; > @@ -243,10 +394,17 @@ static int ipoib_cm_req_handler(struct i > goto err_qp; > } > > - psn = random32() & 0xffffff; > - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > - if (ret) > - goto err_modify; > + if (priv->cm.srq == NULL) { /* NOSRQ */ No == NULL tests please. Also - what does the comment tell us that we don't already know? > + psn = random32() & 0xffffff; random call could be in common code? > + if (ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn)) > + goto err_modify; > + } else { /* SRQ */ What does the comment tell us that we don't already know? > + p->rx_ring = NULL; /* This is used only by NOSRQ */ > + psn = random32() & 0xffffff; > + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > + if (ret) > + goto err_modify; > + } > > ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); > if (ret) { > @@ -254,11 +412,13 @@ static int ipoib_cm_req_handler(struct i > goto err_rep; > } > > - cm_id->context = p; > - p->jiffies = jiffies; > - spin_lock_irq(&priv->lock); > - list_add(&p->list, &priv->cm.passive_ids); > - spin_unlock_irq(&priv->lock); > + if (priv->cm.srq) { > + cm_id->context = p; > + p->jiffies = jiffies; > + spin_lock_irq(&priv->lock); > + list_add(&p->list, &priv->cm.passive_ids); > + spin_unlock_irq(&priv->lock); > + } > queue_delayed_work(ipoib_workqueue, > &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > return 0; > @@ -339,23 +499,40 @@ static void skb_put_frags(struct sk_buff > } > } > > -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > +static void timer_check(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) > +{ > + unsigned long flags; > + > + if (time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { Now that it's a separate function, we can if (time_before(....)) return; > + spin_lock_irqsave(&priv->lock, flags); > + p->jiffies = jiffies; > + /* Move this entry to list head, but do > + * not re-add it if it has been removed. */ > + if (!list_empty(&p->list)) > + list_move(&p->list, &priv->cm.passive_ids); > + spin_unlock_irqrestore(&priv->lock, flags); > + queue_delayed_work(ipoib_workqueue, > + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > + } > +} > +static int handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) Why is making this an int a good idea? You aren't doing anything useful with this down the line. > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; > struct sk_buff *skb, *newskb; > + u64 mapping[IPOIB_CM_RX_SG], wr_id; > struct ipoib_cm_rx *p; > unsigned long flags; > - u64 mapping[IPOIB_CM_RX_SG]; > - int frags; > + int frags, ret; > + > + wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; I like initing the variable at declaration site. If you wan to change the style, maybe make it a separate patch? > ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", > wr_id, wc->status); > > if (unlikely(wr_id >= ipoib_recvq_size)) { > - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", > - wr_id, ipoib_recvq_size); > - return; > + ipoib_warn(priv, "cm recv completion event with wrid %d " > + "(> %d)\n", wr_id, ipoib_recvq_size); the line wasn't too long before, so why wrap it? > + return 1; > } > > skb = priv->cm.srq_ring[wr_id].skb; > @@ -365,22 +542,12 @@ void ipoib_cm_handle_rx_wc(struct net_de > "(status=%d, wrid=%d vend_err %x)\n", > wc->status, wr_id, wc->vendor_err); > ++priv->stats.rx_dropped; > - goto repost; > + goto repost_srq; > } > > if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { > p = wc->qp->qp_context; > - if (time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { > - spin_lock_irqsave(&priv->lock, flags); > - p->jiffies = jiffies; > - /* Move this entry to list head, but do > - * not re-add it if it has been removed. */ > - if (!list_empty(&p->list)) > - list_move(&p->list, &priv->cm.passive_ids); > - spin_unlock_irqrestore(&priv->lock, flags); > - queue_delayed_work(ipoib_workqueue, > - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > - } > + timer_check(priv, p); > } > > frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, > @@ -388,22 +555,119 @@ void ipoib_cm_handle_rx_wc(struct net_de > > newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); > if (unlikely(!newskb)) { > - /* > - * If we can't allocate a new RX buffer, dump > - * this packet and reuse the old buffer. > - */ > - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); > + /* > + * If we can't allocate a new RX buffer, dump > + * this packet and reuse the old buffer. > + */ > + ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); > + ++priv->stats.rx_dropped; > + goto repost_srq; > + } > + > + ipoib_cm_dma_unmap_rx(priv, frags, > + priv->cm.srq_ring[wr_id].mapping); > + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, > + (frags + 1) * sizeof *mapping); > + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", > + wc->byte_len, wc->slid); > + > + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); > + > + skb->protocol = ((struct ipoib_header *) skb->data)->proto; > + skb->mac.raw = skb->data; > + skb_pull(skb, IPOIB_ENCAP_LEN); > + > + dev->last_rx = jiffies; > + ++priv->stats.rx_packets; > + priv->stats.rx_bytes += skb->len; > + > + skb->dev = dev; > + /* XXX get correct PACKET_ type here */ > + skb->pkt_type = PACKET_HOST; > + > + netif_rx_ni(skb); > + > +repost_srq: Labels don't need to be unique cross-function. So you can call this one repost: > + ret = ipoib_cm_post_receive(dev, wr_id); > + > + if (unlikely(ret)) { > + ipoib_warn(priv, "ipoib_cm_post_receive failed for buf %d\n", > + wr_id); > + return 1; > + } > + > + return 0; > + > +} > + > +static int handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + struct sk_buff *skb, *newskb; > + u64 mapping[IPOIB_CM_RX_SG], wr_id; > + u32 index; > + struct ipoib_cm_rx *p, *rx_ptr; > + unsigned long flags; > + int frags, ret; > + > + > + wr_id = wc->wr_id >> 32; > + > + ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", > + wr_id, wc->status); > + > + if (unlikely(wr_id >= ipoib_recvq_size)) { > + ipoib_warn(priv, "cm recv completion event with wrid %d " > + "(> %d)\n", wr_id, ipoib_recvq_size); > + return 1; > + } > + > + index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) & NOSRQ_INDEX_MASK ; > + spin_lock_irqsave(&priv->lock, flags); > + rx_ptr = priv->cm.rx_index_ring[index]; > + spin_unlock_irqrestore(&priv->lock, flags); > + > + skb = rx_ptr->rx_ring[wr_id].skb; > + > + if (unlikely(wc->status != IB_WC_SUCCESS)) { > + ipoib_dbg(priv, "cm recv error " > + "(status=%d, wrid=%d vend_err %x)\n", > + wc->status, wr_id, wc->vendor_err); > ++priv->stats.rx_dropped; > - goto repost; > + goto repost_nosrq; > } > > - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); > - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); > + if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { > + /* There are no guarantees that wc->qp is not NULL for HCAs > + * that do not support SRQ. */ > + p = rx_ptr; > + timer_check(priv, p); > + } > + > + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, > + (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; > + > + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, > + mapping); > + if (unlikely(!newskb)) { > + /* > + * If we can't allocate a new RX buffer, dump > + * this packet and reuse the old buffer. > + */ > + ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); > + ++priv->stats.rx_dropped; > + goto repost_nosrq; > + } > + > + ipoib_cm_dma_unmap_rx(priv, frags, > + rx_ptr->rx_ring[wr_id].mapping); > + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, > + (frags + 1) * sizeof *mapping); > > ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", > wc->byte_len, wc->slid); > > - skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); > + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); > > skb->protocol = ((struct ipoib_header *) skb->data)->proto; > skb->mac.raw = skb->data; > @@ -416,12 +680,34 @@ void ipoib_cm_handle_rx_wc(struct net_de > skb->dev = dev; > /* XXX get correct PACKET_ type here */ > skb->pkt_type = PACKET_HOST; > + > netif_rx_ni(skb); > > -repost: > - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) > - ipoib_warn(priv, "ipoib_cm_post_receive failed " > - "for buf %d\n", wr_id); > +repost_nosrq: Labels don't need to be unique cross-function. So you can call this one repost: > + ret = ipoib_cm_post_receive(dev, wr_id << 32 | index); > + > + if (unlikely(ret)) { > + ipoib_warn(priv, "ipoib_cm_post_receive failed for buf %d\n", > + wr_id); > + return 1; > + } > + > + return 0; > +} > + > +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + int ret; > + > + > + if (priv->cm.srq) > + ret = handle_rx_wc_srq(dev, wc); > + else > + ret = handle_rx_wc_nosrq(dev, wc); > + > + if (unlikely(ret)) > + ipoib_warn(priv, "Error processing rx wc\n"); > } See below about this. > static inline int post_send(struct ipoib_dev_priv *priv, > @@ -606,6 +892,22 @@ int ipoib_cm_dev_open(struct net_device > return 0; > } > > +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) > +{ I suggest you loose the _nosrq suffix and just do if (priv->cq.srq) return; At the top of the function. > + int i; > + > + for(i = 0; i < ipoib_recvq_size; ++i) > + if(p->rx_ring[i].skb) { > + ipoib_cm_dma_unmap_rx(priv, > + IPOIB_CM_RX_SG - 1, > + p->rx_ring[i].mapping); > + dev_kfree_skb_any(p->rx_ring[i].skb); > + p->rx_ring[i].skb = NULL; > + } > + kfree(p->rx_ring); > +} > + > + Loose double empty lines. > void ipoib_cm_dev_stop(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > @@ -618,6 +920,8 @@ void ipoib_cm_dev_stop(struct net_device > spin_lock_irq(&priv->lock); > while (!list_empty(&priv->cm.passive_ids)) { > p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); > + if (priv->cm.srq == NULL) > + free_resources_nosrq(priv, p); No == NULL tests please. > list_del_init(&p->list); > spin_unlock_irq(&priv->lock); > ib_destroy_cm_id(p->id); > @@ -703,9 +1007,14 @@ static struct ib_qp *ipoib_cm_create_tx_ > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ib_qp_init_attr attr = {}; > attr.recv_cq = priv->cq; > - attr.srq = priv->cm.srq; > + if (priv->cm.srq) > + attr.srq = priv->cm.srq; > + else > + attr.srq = NULL; > attr.cap.max_send_wr = ipoib_sendq_size; > + attr.cap.max_recv_wr = 1; /* Not in MST code */ > attr.cap.max_send_sge = 1; > + attr.cap.max_recv_sge = 1; /* Not in MST code */ I don't think we should set both attr.srq and max_recv_sge for a QP connected to SRQ. > attr.sq_sig_type = IB_SIGNAL_ALL_WR; > attr.qp_type = IB_QPT_RC; > attr.send_cq = cq; > @@ -742,10 +1051,13 @@ static int ipoib_cm_send_req(struct net_ > req.responder_resources = 4; > req.remote_cm_response_timeout = 20; > req.local_cm_response_timeout = 20; > - req.retry_count = 0; /* RFC draft warns against retries */ > - req.rnr_retry_count = 0; /* RFC draft warns against retries */ > + req.retry_count = 6; /* RFC draft warns against retries */ > + req.rnr_retry_count = 6;/* RFC draft warns against retries */ > req.max_cm_retries = 15; > - req.srq = 1; > + if (priv->cm.srq) > + req.srq = 1; > + else > + req.srq = 0; > return ib_send_cm_req(id, &req); > } > > @@ -1089,6 +1401,10 @@ static void ipoib_cm_stale_task(struct w > p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); > if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) > break; > + if (priv->cm.srq == NULL) { /* NOSRQ */ No == NULL tests please. Also - what does the comment tell us? > + free_resources_nosrq(priv, p); > + priv->cm.rx_index_ring[p->index] = NULL; > + } > list_del_init(&p->list); > spin_unlock_irq(&priv->lock); > ib_destroy_cm_id(p->id); > @@ -1143,16 +1459,40 @@ int ipoib_cm_add_mode_attr(struct net_de > return device_create_file(&dev->dev, &dev_attr_mode); > } > > +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv) > +{ > + struct ib_srq_init_attr srq_init_attr; > + int ret; > + > + srq_init_attr.attr.max_wr = ipoib_recvq_size; > + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; > + > + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); > + if (IS_ERR(priv->cm.srq)) { > + ret = PTR_ERR(priv->cm.srq); > + priv->cm.srq = NULL; > + return ret; > + } > + > + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * > + sizeof *priv->cm.srq_ring, > + GFP_KERNEL); > + if (!priv->cm.srq_ring) { > + printk(KERN_WARNING "%s: failed to allocate CM ring " > + "(%d entries)\n", > + priv->ca->name, ipoib_recvq_size); > + ipoib_cm_dev_cleanup(dev); Since you have separated create_srq from cm_dev_init, calling ipoib_cm_dev_cleanup from it looks wrong. > + return -ENOMEM; > + } > + > + return 0; > +} > + > int ipoib_cm_dev_init(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > - struct ib_srq_init_attr srq_init_attr = { > - .attr = { > - .max_wr = ipoib_recvq_size, > - .max_sge = IPOIB_CM_RX_SG > - } > - }; > - int ret, i; > + int ret, i, supports_srq; > + struct ib_device_attr attr; > > INIT_LIST_HEAD(&priv->cm.passive_ids); > INIT_LIST_HEAD(&priv->cm.reap_list); > @@ -1164,21 +1504,26 @@ int ipoib_cm_dev_init(struct net_device > > skb_queue_head_init(&priv->cm.skb_queue); > > - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); > - if (IS_ERR(priv->cm.srq)) { > - ret = PTR_ERR(priv->cm.srq); > - priv->cm.srq = NULL; > + if (ret = ib_query_device(priv->ca, &attr)) > return ret; I think a cleaner way would be to just test device->create_srq. > + if (attr.max_srq) > + supports_srq = 1; /* This device supports SRQ */ > + else { > + supports_srq = 0; > } > > - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, > - GFP_KERNEL); > - if (!priv->cm.srq_ring) { > - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", > - priv->ca->name, ipoib_recvq_size); > - ipoib_cm_dev_cleanup(dev); > - return -ENOMEM; > - } > + if (supports_srq) { > + if (ret = create_srq(dev, priv)) > + return ret; > + > + priv->cm.rx_index_ring = NULL; /* Not needed for SRQ */ > + } else { > + priv->cm.srq = NULL; > + priv->cm.srq_ring = NULL; > + priv->cm.rx_index_ring = kzalloc(NOSRQ_INDEX_RING_SIZE * > + sizeof *priv->cm.rx_index_ring, > + GFP_KERNEL); > + } > > for (i = 0; i < IPOIB_CM_RX_SG; ++i) > priv->cm.rx_sge[i].lkey = priv->mr->lkey; do we really need supports_srq variable? It's only used once ... > @@ -1190,19 +1535,25 @@ int ipoib_cm_dev_init(struct net_device > priv->cm.rx_wr.sg_list = priv->cm.rx_sge; > priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; > > - for (i = 0; i < ipoib_recvq_size; ++i) { > - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, > + /* One can post receive buffers even before the RX QP is created > + * only in the SRQ case. Therefore for NOSRQ we skip the rest of init > + * and do that in ipoib_cm_req_handler() */ > + > + if (priv->cm.srq) { > + for (i = 0; i < ipoib_recvq_size; ++i) { > + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, > priv->cm.srq_ring[i].mapping)) { > - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); > - ipoib_cm_dev_cleanup(dev); > - return -ENOMEM; > - } > - if (ipoib_cm_post_receive(dev, i)) { > - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); > - ipoib_cm_dev_cleanup(dev); > - return -EIO; > + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); > + ipoib_cm_dev_cleanup(dev); > + return -ENOMEM; > + } > + if (ipoib_cm_post_receive(dev, i)) { > + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); > + ipoib_cm_dev_cleanup(dev); > + return -EIO; > + } > } > - } > + } /* if SRQ */ > > priv->dev->dev_addr[0] = IPOIB_FLAGS_RC; > return 0; When you start adding /* if SRQ */ comments near the closing bracket, is where you know your nesting is too deep. How about if (!priv->cm.srq) goto done; > --- a/linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-04-24 18:10:17.000000000 -0700 > +++ b//linux-2.6.21-rc7/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-04-25 10:11:34.000000000 -0700 > @@ -282,7 +282,7 @@ static void ipoib_ib_handle_tx_wc(struct > > static void ipoib_ib_handle_wc(struct net_device *dev, struct ib_wc *wc) > { > - if (wc->wr_id & IPOIB_CM_OP_SRQ) > + if ((wc->wr_id & IPOIB_CM_OP_SRQ) || (wc->wr_id & IPOIB_CM_OP_NOSRQ)) > ipoib_cm_handle_rx_wc(dev, wc); > else if (wc->wr_id & IPOIB_OP_RECV) > ipoib_ib_handle_rx_wc(dev, wc); So you have a branch on IPOIB_CM_OP_NOSRQ here, and you have a branch on priv->srq down the line. What I suggest instead, is split ipoib_ib_completion to SRQ/non-SRQ variants, which will completely avoid extra branch cost at runtime. -- MST From mst at dev.mellanox.co.il Wed May 2 00:08:49 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 10:08:49 +0300 Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> Message-ID: <20070502070849.GO8447@mellanox.co.il> > - if (ib_wr->send_flags & IB_SEND_SOLICITED) { > + if (ib_wr->send_flags & IB_SEND_SOLICITED > + && ib_wr->send_flags & IB_SEND_INVALIDATE) { How about if (ib_wr->send_flags & (IB_SEND_SOLICITED | IB_SEND_INVALIDATE)) -- MST From mst at dev.mellanox.co.il Wed May 2 00:15:21 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 10:15:21 +0300 Subject: [ofa-general] Re: Requesting CQ notifications In-Reply-To: <46380448.1020401@evergrid.com> References: <462FD3F7.1010304@evergrid.com> <20070502025055.GM8447@mellanox.co.il> <46380448.1020401@evergrid.com> Message-ID: <20070502071521.GR8447@mellanox.co.il> > Quoting Mike Heffner : > Subject: Re: Requesting CQ notifications > > Michael S. Tsirkin wrote: > >>Quoting Roland Dreier : > >>Subject: Re: Requesting CQ notifications > > > > >This is not exact. mthca/mlx4 will generate an event immediately > >only for unpolled CQE *that was not present in CQ at the > >time the previous event was generated*. > >So the answer for mthca is yes only if the CQE arrived > >between calls to select and ibv_req_notify_cq. > > > > Is there any method by which you can query the total number of CQEs in > the CQ at an instantaneous point in time (ie., after you had called > ibv_req_notify_cq() to get notification of *new* CQEs)? Not really - what are you trying to do? -- MST From vlad at lists.openfabrics.org Wed May 2 02:37:40 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 2 May 2007 02:37:40 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070502-0200 daily build status Message-ID: <20070502093740.53DEDE6089D@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From donateacr.com at goodvibesvideo.com Wed May 2 03:03:59 2007 From: donateacr.com at goodvibesvideo.com (Matthew Hill) Date: Wed, 02 May 2007 12:03:59 +0200 Subject: [ofa-general] Corel Draw Message-ID: <000001c78ca0$da642280$0100007f@localhost> See attach ----- She had almost reached the top The council wasnt paying her a And if the Dunbars form an all Vincent spoke up next. Why mus -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic44.gif Type: image/gif Size: 9095 bytes Desc: not available URL: From mst at dev.mellanox.co.il Wed May 2 05:31:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 15:31:12 +0300 Subject: [ofa-general] [PATCH] ipoib/cm: compliance fix Message-ID: <20070502123112.GI22292@mellanox.co.il> IPoIB CM spec allows the use of a single connection in both active->passive and passive->active directions. Current code does not do this, but if the remote ever tries to, we oops when we try to look up the passive connection. Fix by checking qp_context before use. Signed-off-by: Michael S. Tsirkin --- I noticed this bug while experimenting with changes to IPoIB/CM code. Important enough for -stable? diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 0c4e59b..1778fd6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -370,7 +370,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { p = wc->qp->qp_context; - if (time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { spin_lock_irqsave(&priv->lock, flags); p->jiffies = jiffies; /* Move this entry to list head, but do -- MST From dotanb at dev.mellanox.co.il Wed May 2 06:08:48 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 02 May 2007 16:08:48 +0300 Subject: [ofa-general] We are seeing SYNDROME_LOCAL_PROT_ERR status on CQE with Mellanox Arbel HCA in memfree mode In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A016398AA@NAMAIL2.ad.lsil.com> References: <01B9E81EECACE94DBBD0A556E768FB8A016398AA@NAMAIL2.ad.lsil.com> Message-ID: <46388D60.5010803@dev.mellanox.co.il> Batwara, Ashish wrote: > Any idea why this error? We see this error when we use FMR? Are there > any special setting that HCA needs to work with FMR? > Check the WR of this completion: do you have any violation in the scatter/gather element? thanks Dotan From ceramicplates at bestprice.novelco.com Wed May 2 06:43:35 2007 From: ceramicplates at bestprice.novelco.com (ceramicplates at bestprice.novelco.com) Date: Wed, 02 May 2007 06:43:35 -0700 Subject: [ofa-general] Manufacturers of Ceramic Plates wanted Message-ID: <20070502064334.89844A0CF5D66E10@bestprice.novelco.com> Greetings I would like to know if your company is a manufacturer of ceramic plates We are a supplier in Los Angeles California have been in business 5 years We currently are looking for 5000 ceramic plates about 5 1/2 inches in diameter white color with a 2 color imprint Price below $2.00 each with imprint Delivery by July 2007 Please provide your price quote to plates at bestprice.novelco.com Sincerely, Joseph Taylor Worldlink 1012 W Beverly Blvd., #990 Montebello, CA 90640 P 562-215-4843 F 206-350-5967 From steffen.persvold at scali.com Wed May 2 07:00:14 2007 From: steffen.persvold at scali.com (Steffen Persvold) Date: Wed, 2 May 2007 10:00:14 -0400 Subject: [ofa-general] OFED 1.2 RC2 on rhel4u4 x86_64 Message-ID: Folks, I used the build.sh script to build the above mentioned packages on rhel4u4 x86_64, but for some reason it only compiles 32bit libraries (even if the packages are named x86_64) : # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm x86_64 (after installing it) : # file /usr/lib/libibverbs.so.1.0.0 /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), not stripped What did I do wrong ?? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at dev.mellanox.co.il Wed May 2 07:05:28 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 02 May 2007 17:05:28 +0300 Subject: [ofa-general] Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 In-Reply-To: References: Message-ID: <1178114728.14131.30.camel@vladsk-laptop> Don't you have /usr/lib64/libibverbs.so.1.0.0? Regards, Vladimir On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote: > Folks, > > I used the build.sh script to build the above mentioned packages on > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries > (even if the packages are named x86_64) : > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm > x86_64 > > (after installing it) : > > # file /usr/lib/libibverbs.so.1.0.0 > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel > 80386, version 1 (SYSV), not stripped > > What did I do wrong ?? > > Cheers, > Steffen Persvold > Technical Director Americas > tel. 508-281-7100 x401 > fax. 508-281-7171 > > http://www.scali.com/ > Scaling the Linux datacenter > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From jackm at dev.mellanox.co.il Wed May 2 07:12:24 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 2 May 2007 17:12:24 +0300 Subject: [ofa-general] [PATCH] libmlx4: fix post inline when posting a list Message-ID: <200705021712.24400.jackm@dev.mellanox.co.il> Need to set inl parameter to zero for each inline post (when posting a wr-list of inlines -- so that the value of inl reflects that specific work request, and is not cumulative. Signed-off-by: Jack Morgenstein diff --git a/src/userspace/libmlx4/src/qp.c b/src/userspace/libmlx4/src/qp.c index 76abf75..a70e5f2 100644 --- a/src/userspace/libmlx4/src/qp.c +++ b/src/userspace/libmlx4/src/qp.c @@ -217,6 +217,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, if (wr->num_sge) { struct mlx4_wqe_inline_seg *seg = wqe; + inl = 0; wqe += sizeof *seg; for (i = 0; i < wr->num_sge; ++i) { uint32_t len = wr->sg_list[i].length; From jackm at dev.mellanox.co.il Wed May 2 07:14:05 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 2 May 2007 17:14:05 +0300 Subject: [ofa-general] [patch] mlx4_ib: return proper num s/g entries for rq at create_qp Message-ID: <200705021714.05933.jackm@dev.mellanox.co.il> Fix number of scatter-gather entries returned for receive queue at qp creation. Signed-off-by: Jack Morgenstein diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 53aedfb..33db96c 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -236,7 +236,7 @@ static int set_qp_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_wr = qp->sq.max; cap->max_recv_wr = qp->rq.max; cap->max_send_sge = qp->sq.max_gs; - cap->max_recv_sge = qp->sq.max_gs; + cap->max_recv_sge = qp->rq.max_gs; cap->max_inline_data = (1 << qp->sq.wqe_shift) - send_wqe_overhead(type) - sizeof (struct mlx4_wqe_inline_seg); From steffen.persvold at scali.com Wed May 2 07:20:26 2007 From: steffen.persvold at scali.com (Steffen Persvold) Date: Wed, 2 May 2007 10:20:26 -0400 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 References: <1178114728.14131.30.camel@vladsk-laptop> Message-ID: Nope : [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 [redhat-release-4ES-5.5]# So the RPM got built, but without 64bit libraries. Now if it was the other way around (i.e no 32bit libraries) I could have understood it (as 32bit is an option on x86_64), but not having the native 64bit libraries is not so easy to understand :) cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter ________________________________ From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir Sokolovsky Sent: Wed 5/2/2007 10:05 AM To: Steffen Persvold Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Don't you have /usr/lib64/libibverbs.so.1.0.0? Regards, Vladimir On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote: > Folks, > > I used the build.sh script to build the above mentioned packages on > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries > (even if the packages are named x86_64) : > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm > x86_64 > > (after installing it) : > > # file /usr/lib/libibverbs.so.1.0.0 > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel > 80386, version 1 (SYSV), not stripped > > What did I do wrong ?? > > Cheers, > Steffen Persvold > Technical Director Americas > tel. 508-281-7100 x401 > fax. 508-281-7171 > > http://www.scali.com/ > Scaling the Linux datacenter > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -------------- next part -------------- An HTML attachment was scrubbed... URL: From steffen.persvold at scali.com Wed May 2 07:30:56 2007 From: steffen.persvold at scali.com (Steffen Persvold) Date: Wed, 2 May 2007 10:30:56 -0400 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 References: <1178114728.14131.30.camel@vladsk-laptop> Message-ID: Also, If I look at the /etc/ld.so.conf/ofed.conf file I have : # cat ofed.conf /usr/lib /usr/lib which seems kinda weird ? :) Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter ________________________________ From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold Sent: Wed 5/2/2007 10:20 AM To: Vladimir Sokolovsky Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Nope : [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 [redhat-release-4ES-5.5]# So the RPM got built, but without 64bit libraries. Now if it was the other way around (i.e no 32bit libraries) I could have understood it (as 32bit is an option on x86_64), but not having the native 64bit libraries is not so easy to understand :) cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter ________________________________ From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir Sokolovsky Sent: Wed 5/2/2007 10:05 AM To: Steffen Persvold Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Don't you have /usr/lib64/libibverbs.so.1.0.0? Regards, Vladimir On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote: > Folks, > > I used the build.sh script to build the above mentioned packages on > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries > (even if the packages are named x86_64) : > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm > x86_64 > > (after installing it) : > > # file /usr/lib/libibverbs.so.1.0.0 > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel > 80386, version 1 (SYSV), not stripped > > What did I do wrong ?? > > Cheers, > Steffen Persvold > Technical Director Americas > tel. 508-281-7100 x401 > fax. 508-281-7171 > > http://www.scali.com/ > Scaling the Linux datacenter > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Wed May 2 07:56:46 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 02 May 2007 09:56:46 -0500 Subject: [ofa-general] [PATCH librdmacm] rping: Transfer rkey/addr/len information in network byte order. In-Reply-To: <46380D09.5070906@ichips.intel.com> References: <1177515271.22094.33.camel@stevo-desktop> <1178060596.2309.195.camel@stevo-desktop> <46380D09.5070906@ichips.intel.com> Message-ID: <1178117806.18609.25.camel@stevo-desktop> On Tue, 2007-05-01 at 21:01 -0700, Sean Hefty wrote: > > This patch regresses rping. I failed to test it on AMD64<->AMD64 (ie > > like endian systems). I will provide another patch shortly, or we can > > undo the broken rping patch for -rc3. Whatever you think is best. > > Let's fix it. Please create a patch on top of this that fixes the problem. > > Thanks > > - Sean Here is the fix. Tested with: ppc64 client, amd64 server ppc64 server, amd64 client amd64 client, amd64 server --- Fix regression introduced by 88fc0cb21698dfb5d7660eecf7dddd0531fc8021. From: Steve Wise - swizzle memory info when sending it to peer. - fixed printf format Signed-off-by: Steve Wise --- examples/rping.c | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/examples/rping.c b/examples/rping.c index 17b0000..bccabb0 100644 --- a/examples/rping.c +++ b/examples/rping.c @@ -243,7 +243,7 @@ static int server_recv(struct rping_cb * cb->remote_rkey = ntohl(cb->recv_buf.rkey); cb->remote_addr = ntohll(cb->recv_buf.buf); cb->remote_len = ntohl(cb->recv_buf.size); - DEBUG_LOG("Received rkey %x addr %" PRIx64 "len %d from peer\n", + DEBUG_LOG("Received rkey %x addr %" PRIx64 " len %d from peer\n", cb->remote_rkey, cb->remote_addr, cb->remote_len); if (cb->state <= CONNECTED || cb->state == RDMA_WRITE_COMPLETE) @@ -614,12 +614,12 @@ static void rping_format_send(struct rpi { struct rping_rdma_info *info = &cb->send_buf; - info->buf = (uint64_t) (unsigned long) buf; - info->rkey = mr->rkey; - info->size = cb->size; + info->buf = htonll((uint64_t) (unsigned long) buf); + info->rkey = htonl(mr->rkey); + info->size = htonl(cb->size); DEBUG_LOG("RDMA addr %" PRIx64" rkey %x len %d\n", - info->buf, info->rkey, info->size); + ntohll(info->buf), ntohl(info->rkey), ntohl(info->size)); } static int rping_test_server(struct rping_cb *cb) From steffen.persvold at scali.com Wed May 2 08:30:44 2007 From: steffen.persvold at scali.com (Steffen Persvold) Date: Wed, 2 May 2007 11:30:44 -0400 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 References: <1178114728.14131.30.camel@vladsk-laptop> Message-ID: Hmm, so I tried something. I put : build_32bit=0 into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This time it built 64bit libraries, but it puts them in the wrong directory : # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 # file /usr/lib/libibverbs.so.1.0.0 /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped So what's up ?? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter ________________________________ From: Steffen Persvold Sent: Wed 5/2/2007 10:30 AM To: Steffen Persvold; Vladimir Sokolovsky Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Also, If I look at the /etc/ld.so.conf/ofed.conf file I have : # cat ofed.conf /usr/lib /usr/lib which seems kinda weird ? :) Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter ________________________________ From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold Sent: Wed 5/2/2007 10:20 AM To: Vladimir Sokolovsky Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Nope : [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 [redhat-release-4ES-5.5]# So the RPM got built, but without 64bit libraries. Now if it was the other way around (i.e no 32bit libraries) I could have understood it (as 32bit is an option on x86_64), but not having the native 64bit libraries is not so easy to understand :) cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter ________________________________ From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir Sokolovsky Sent: Wed 5/2/2007 10:05 AM To: Steffen Persvold Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Don't you have /usr/lib64/libibverbs.so.1.0.0? Regards, Vladimir On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote: > Folks, > > I used the build.sh script to build the above mentioned packages on > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries > (even if the packages are named x86_64) : > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm > x86_64 > > (after installing it) : > > # file /usr/lib/libibverbs.so.1.0.0 > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel > 80386, version 1 (SYSV), not stripped > > What did I do wrong ?? > > Cheers, > Steffen Persvold > Technical Director Americas > tel. 508-281-7100 x401 > fax. 508-281-7171 > > http://www.scali.com/ > Scaling the Linux datacenter > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -------------- next part -------------- An HTML attachment was scrubbed... URL: From amitk at mellanox.co.il Wed May 2 08:41:59 2007 From: amitk at mellanox.co.il (Amit Krig) Date: Wed, 2 May 2007 18:41:59 +0300 Subject: [ofa-general] RE: bugs 541 and 465: slow IPoIB CM HA failover and eventual failing IPoIB HA In-Reply-To: References: Message-ID: <6C2C79E72C305246B504CBA17B5500C9016761E8@mtlexch01.mtl.com> Thanks for the update, Yohad will reproduce this failure in our labs Amit ________________________________ From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Sent: Wednesday, May 02, 2007 5:08 AM To: ewg at lists.openfabrics.org; Scott Weitzenkamp (sweitzen); Tziporet Koren; Amit Krig; Michael S. Tsirkin; Roland Dreier (rdreier); Moni Shoua; Moni Levy Cc: openib Subject: bugs 541 and 465: slow IPoIB CM HA failover and eventual failing IPoIB HA With IPoIB HA (both ipoibtools and ib-bonding), I am seeing slow IPoIB CM HA failover, and eventually IPoIB stops working after enough failovers. I am running netperf -D traffic between two IPoIB HA hosts, while flipping the 4 host IB ports one at a time (port 1 down, sleep, port 1 up, sleep, ..., port 4 down, sleep, port 4 up, sleep) in a loop. This is a very easy test to set up. Can Mellanox and Voltaire please try to reproduce the problem? I think this problem must be fixed for OFED 1.2 rc3. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From yosefe at voltaire.com Wed May 2 08:54:26 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 02 May 2007 18:54:26 +0300 Subject: [ofa-general] [PATCH 0/3] pkey change handling Message-ID: <4638B432.3060801@voltaire.com> There are 3 pathces in this series. The issue addressed is keeping ipoib interfaces alive despite port's pkey order is changed. pkey-to-index queries were using a cache. however, the cache might not be up-to-date when ipoib asks it to resolve a pkey. Therefore must use a direct query. On the other hand, in build_mlx_header, the pkey query must be atomic. So, the driver will keep its own pkey cache, which is non blocking and always updated before ipoib is notified of the event. In addition, remove the pkey delayed initiallization thread, instead start the interface on pkey change notification. 1: ipoib: handle pkey change notifications, by restarting the qp which validates the pkey index of the qp in case the pkeys in case they were shuffled. remove the pkey polling thread, and upon pkey change events, bring up interfaces for which pkeys were not found. 2: core: remove the infiniband cache and replace it with blocking calls. update its users. 3: mthca: put a pkey cache in the provider. update the cache on pkey table smps use it to answer pkey_query. use the cache in build_mlx_header atomic context Signed-off-by: Yosef Etigin -- From yosefe at voltaire.com Wed May 2 08:56:23 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 02 May 2007 18:56:23 +0300 Subject: [ofa-general] [PATCH 1/3 v4] ipoib: restart interfaces on pkey change events In-Reply-To: <4638B432.3060801@voltaire.com> References: <4638B432.3060801@voltaire.com> Message-ID: <4638B4A7.6080803@voltaire.com> This issue was found during partitioning & SM fail over testing. The fix was tested with pkey reshuffling, removal and addition every few seconds concurrent with OFED restart. Changes from v1: * added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * fixed a bug in device extraction from the work struct * removed some warnings in case they are caused due to missing PKEY as this seems like a valid flow now. Changes from v2: * less/fixed debug prints - (MST remark) * flush_restart_qp stuff renamed to just restart_qp (MST remark) * the patch now depends on Roland's "IPoIB: Only handle async events for one port" Changed from v3: * We now reschedule that qp_restart_task in case the PKEY cache was not coherent. Changed from v4: * We do not reschedue qp_restart_task, but assume that the cache is coherent * Do not restart QP if iface is not iniliallized, but do restart if not ADMIN_UP * Restart child interfaces first, so if parent is down child still restarted * Remove the pkey polling thread and pkey dalyed initiallization * If an interface is brought up but pkey is not found, mark it with IPOIB_PKEY_NEEDED and when a pkey event arrives, try to restart it SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. When a interface is brought up, it many Signed-off-by: Moni Levy Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 10 - drivers/infiniband/ulp/ipoib/ipoib_ib.c | 142 ++++++++----------------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 11 - drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 + drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 21 +-- 5 files changed, 74 insertions(+), 121 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-02 17:48:05.276713741 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-02 17:48:30.149283427 +0300 @@ -80,7 +80,7 @@ enum { IPOIB_FLAG_INITIALIZED = 1, IPOIB_FLAG_ADMIN_UP = 2, IPOIB_PKEY_ASSIGNED = 3, - IPOIB_PKEY_STOP = 4, + IPOIB_PKEY_NEEDED = 4, IPOIB_FLAG_SUBINTERFACE = 5, IPOIB_MCAST_RUN = 6, IPOIB_STOP_REAPER = 7, @@ -202,9 +202,9 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; struct delayed_work mcast_task; struct work_struct flush_task; + struct work_struct pkey_task; struct work_struct restart_task; struct delayed_work ah_reap_task; @@ -333,12 +333,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); @@ -384,9 +385,6 @@ void ipoib_event(struct ib_event_handler int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); -void ipoib_pkey_poll(struct work_struct *work); -int ipoib_pkey_dev_delay_open(struct net_device *dev); - #ifdef CONFIG_INFINIBAND_IPOIB_CM #define IPOIB_FLAGS_RC 0x80 Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-02 17:48:05.276713741 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-02 18:04:16.512553724 +0300 @@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -441,28 +441,10 @@ int ipoib_ib_dev_open(struct net_device return 0; } -static void ipoib_pkey_dev_check_presence(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - u16 pkey_index = 0; - - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); - else - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); -} - int ipoib_ib_dev_up(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - ipoib_pkey_dev_check_presence(dev); - - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { - ipoib_dbg(priv, "PKEY is not assigned.\n"); - return 0; - } - set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); return ipoib_mcast_start_thread(dev); @@ -477,16 +459,6 @@ int ipoib_ib_dev_down(struct net_device clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); netif_carrier_off(dev); - /* Shutdown the P_Key thread if still active */ - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { - mutex_lock(&pkey_mutex); - set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); - mutex_unlock(&pkey_mutex); - if (flush) - flush_workqueue(ipoib_workqueue); - } - ipoib_mcast_stop_thread(dev, flush); ipoib_mcast_dev_flush(dev); @@ -508,7 +480,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +553,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,14 +595,31 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { - ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, restart_qp); + + mutex_unlock(&priv->vlan_mutex); + + /* + * If the device is not initiallized since it needs a pkey - + * try to reopen it + */ + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { + if (restart_qp && test_bit(IPOIB_PKEY_NEEDED, &priv->flags) + && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { + /* if this iface needs pkey, try to assign it one */ + ipoib_open(priv->dev); + } + else + ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -642,6 +632,12 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_down(dev, 0); + if (restart_qp) { + if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +646,25 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); + /* we only restart the QP in case of pkey change event */ + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_task); - mutex_unlock(&priv->vlan_mutex); + /* restart the QP in case of pkey change event */ + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -672,54 +679,3 @@ void ipoib_ib_dev_cleanup(struct net_dev ipoib_transport_dev_cleanup(dev); } -/* - * Delayed P_Key Assigment Interim Support - * - * The following is initial implementation of delayed P_Key assigment - * mechanism. It is using the same approach implemented for the multicast - * group join. The single goal of this implementation is to quickly address - * Bug #2507. This implementation will probably be removed when the P_Key - * change async notification is available. - */ - -void ipoib_pkey_poll(struct work_struct *work) -{ - struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); - struct net_device *dev = priv->dev; - - ipoib_pkey_dev_check_presence(dev); - - if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) - ipoib_open(dev); - else { - mutex_lock(&pkey_mutex); - if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) - queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, - HZ); - mutex_unlock(&pkey_mutex); - } -} - -int ipoib_pkey_dev_delay_open(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - - /* Look for the interface pkey value in the IB Port P_Key table and */ - /* set the interface pkey assigment flag */ - ipoib_pkey_dev_check_presence(dev); - - /* P_Key value not assigned yet - start polling */ - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { - mutex_lock(&pkey_mutex); - clear_bit(IPOIB_PKEY_STOP, &priv->flags); - queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, - HZ); - mutex_unlock(&pkey_mutex); - return 1; - } - - return 0; -} Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-02 17:48:05.276713741 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-02 17:48:30.150283249 +0300 @@ -100,14 +100,11 @@ int ipoib_open(struct net_device *dev) set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); - if (ipoib_pkey_dev_delay_open(dev)) - return 0; - if (ipoib_ib_dev_open(dev)) - return -EINVAL; + return test_bit(IPOIB_PKEY_NEEDED, &priv->flags) ? 0 : -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +149,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +987,7 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-02 17:48:05.277713563 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-02 17:48:30.151283071 +0300 @@ -232,9 +232,10 @@ static int ipoib_mcast_join_finish(struc ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid), &mcast->mcmember.mgid); if (ret < 0) { - ipoib_warn(priv, "couldn't attach QP to multicast group " - IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); + if (ret != -ENXIO) /* No pkey found */ + ipoib_warn(priv, "couldn't attach QP to multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags); return ret; @@ -312,7 +313,7 @@ ipoib_mcast_sendonly_join_complete(int s status = ipoib_mcast_join_finish(mcast, &multicast->rec); if (status) { - if (mcast->logcount++ < 20) + if (mcast->logcount++ < 20 && status != -ENXIO) ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " IPOIB_GID_FMT ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), status); @@ -420,7 +421,7 @@ static int ipoib_mcast_join_complete(int ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), status); - } else { + } else if (status != -ENXIO) { ipoib_warn(priv, "multicast join failed for " IPOIB_GID_FMT ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-02 17:48:05.277713563 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-02 17:48:30.152282893 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,12 +47,12 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); ret = -ENXIO; goto out; } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + set_bit(IPOIB_PKEY_NEEDED, &priv->flags); /* set correct QKey for QP */ qp_attr->qkey = priv->qkey; @@ -103,12 +101,12 @@ int ipoib_init_qp(struct net_device *dev * The port has to be assigned to the respective IB partition in * advance. */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); if (ret) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + set_bit(IPOIB_PKEY_NEEDED, &priv->flags); return ret; } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); qp_attr.qp_state = IB_QPS_INIT; qp_attr.qkey = 0; @@ -238,7 +236,7 @@ void ipoib_transport_dev_cleanup(struct ipoib_warn(priv, "ib_qp_destroy failed\n"); priv->qp = NULL; - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); } if (ib_destroy_cq(priv->cq)) @@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_task); } } From yosefe at voltaire.com Wed May 2 08:57:09 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 02 May 2007 18:57:09 +0300 Subject: [ofa-general] [PATCH 2/3] remove ib pkey gid and lmc cache In-Reply-To: <4638B432.3060801@voltaire.com> References: <4638B432.3060801@voltaire.com> Message-ID: <4638B4D5.7050709@voltaire.com> Remove IB cache from core * Remove pkey, gid, and lmc caches * Rewrite ib_find_gid and ib_find_pkey over blocking device queries * Modify users of the cache to use these methods Signed-off-by: Yosef Etigin --- drivers/infiniband/core/cache.c | 398 -------------------------------- include/rdma/ib_cache.h | 118 --------- drivers/infiniband/core/Makefile | 2 drivers/infiniband/core/cm.c | 8 drivers/infiniband/core/cma.c | 9 drivers/infiniband/core/core_priv.h | 3 drivers/infiniband/core/device.c | 143 ++++++++++- drivers/infiniband/core/mad.c | 5 drivers/infiniband/core/multicast.c | 3 drivers/infiniband/core/sa_query.c | 3 drivers/infiniband/core/verbs.c | 3 drivers/infiniband/hw/mthca/mthca_av.c | 3 drivers/infiniband/hw/mthca/mthca_qp.c | 10 drivers/infiniband/ulp/ipoib/ipoib_cm.c | 3 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 2 drivers/infiniband/ulp/srp/ib_srp.c | 6 include/rdma/ib_verbs.h | 37 ++ 17 files changed, 196 insertions(+), 560 deletions(-) Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-02 17:47:50.517342683 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-02 17:48:30.719181916 +0300 @@ -149,6 +149,20 @@ static int alloc_name(char *name) return 0; } + +static inline int start_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + + +static inline int end_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; +} + + /** * ib_alloc_device - allocate an IB device struct * @size:size of structure to allocate @@ -592,6 +606,128 @@ int ib_modify_port(struct ib_device *dev } EXPORT_SYMBOL(ib_modify_port); +/** + * ib_find_gid - Returns the port number and index of a GID + * @device: Device to query. + * @gid: GID to look for + * @port_num: Returned port number + * @index: Returned index + * + * ib_find_gid() returns the index of @pkey in the pkey table + * on port @port_num + */ + int ib_find_gid(struct ib_device *device, + union ib_gid *gid, + u8 *port_num, + u16 *index) +{ + struct ib_port_attr *tprops = NULL; + union ib_gid tmp_gid; + int ret; + int port; + int i; + + tprops = kmalloc(sizeof *tprops, GFP_ATOMIC); + + for (port = start_port(device); port <= end_port(device); ++port) { + ret = ib_query_port(device, port, tprops); + if (ret) + continue; + + for (i = 0; i < tprops->gid_tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &tmp_gid); + if (ret) + goto out; + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { + *port_num = port; + *index = i; + ret = 0; + goto out; + } + } /* for i */ + } + ret = -ENOENT; +out: + kfree(tprops); + return ret; +} +EXPORT_SYMBOL(ib_find_gid); + +/** + * ib_find_pkey - Returns the index of a PKey on a port + * @device: Device to query. + * @port_num: Port to query on + * @pkey: PKey to look for + * @index: Returned index + * + * ib_find_pkey() returns the index of @pkey in the pkey table + * on port @port_num + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, + u16 pkey, + u16 *index) +{ + struct ib_port_attr *tprops = NULL; + int ret; + int i = -1; + u16 tmp_pkey; + + tprops = kmalloc(sizeof *tprops, GFP_ATOMIC); + + ret = ib_query_port(device, port_num, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret); + goto out; + } + + for (i = 0; i < tprops->pkey_tbl_len; ++i) { + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); + if (ret) + goto out; + + if (pkey == tmp_pkey) { + *index = i; + ret = 0; + goto out; + } + } + ret = -ENOENT; + +out: + kfree(tprops); + return ret; +} +EXPORT_SYMBOL(ib_find_pkey); + +/** + * ib_query_lmc - Returns the LMC of a port + * @device: Device to query. + * @port_num: Port to query on + * @lmc: Returned LMC + * + * ib_query_lmc() returns the LID mask control associated + * with port @port_num + */ +int ib_query_lmc(struct ib_device *device, + u8 port_num, + u8 *lmc) +{ + struct ib_port_attr *tprops = NULL; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_ATOMIC); + ret = ib_query_port(device, port_num, tprops); + if (ret) goto err; + + *lmc = tprops->lmc; +err: + kfree(tprops); + return ret; + +} +EXPORT_SYMBOL(ib_query_lmc); + static int __init ib_core_init(void) { int ret; @@ -600,18 +736,11 @@ static int __init ib_core_init(void) if (ret) printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); - ret = ib_cache_setup(); - if (ret) { - printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n"); - ib_sysfs_cleanup(); - } - return ret; } static void __exit ib_core_cleanup(void) { - ib_cache_cleanup(); ib_sysfs_cleanup(); } Index: b/drivers/infiniband/core/cache.c =================================================================== --- a/drivers/infiniband/core/cache.c 2007-05-02 17:47:49.878456482 +0300 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,398 +0,0 @@ -/* - * Copyright (c) 2004 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Intel Corporation. All rights reserved. - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ - */ - -#include -#include -#include - -#include - -#include "core_priv.h" - -struct ib_pkey_cache { - int table_len; - u16 table[0]; -}; - -struct ib_gid_cache { - int table_len; - union ib_gid table[0]; -}; - -struct ib_update_work { - struct work_struct work; - struct ib_device *device; - u8 port_num; -}; - -static inline int start_port(struct ib_device *device) -{ - return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; -} - -static inline int end_port(struct ib_device *device) -{ - return (device->node_type == RDMA_NODE_IB_SWITCH) ? - 0 : device->phys_port_cnt; -} - -int ib_get_cached_gid(struct ib_device *device, - u8 port_num, - int index, - union ib_gid *gid) -{ - struct ib_gid_cache *cache; - unsigned long flags; - int ret = 0; - - if (port_num < start_port(device) || port_num > end_port(device)) - return -EINVAL; - - read_lock_irqsave(&device->cache.lock, flags); - - cache = device->cache.gid_cache[port_num - start_port(device)]; - - if (index < 0 || index >= cache->table_len) - ret = -EINVAL; - else - *gid = cache->table[index]; - - read_unlock_irqrestore(&device->cache.lock, flags); - - return ret; -} -EXPORT_SYMBOL(ib_get_cached_gid); - -int ib_find_cached_gid(struct ib_device *device, - union ib_gid *gid, - u8 *port_num, - u16 *index) -{ - struct ib_gid_cache *cache; - unsigned long flags; - int p, i; - int ret = -ENOENT; - - *port_num = -1; - if (index) - *index = -1; - - read_lock_irqsave(&device->cache.lock, flags); - - for (p = 0; p <= end_port(device) - start_port(device); ++p) { - cache = device->cache.gid_cache[p]; - for (i = 0; i < cache->table_len; ++i) { - if (!memcmp(gid, &cache->table[i], sizeof *gid)) { - *port_num = p + start_port(device); - if (index) - *index = i; - ret = 0; - goto found; - } - } - } -found: - read_unlock_irqrestore(&device->cache.lock, flags); - - return ret; -} -EXPORT_SYMBOL(ib_find_cached_gid); - -int ib_get_cached_pkey(struct ib_device *device, - u8 port_num, - int index, - u16 *pkey) -{ - struct ib_pkey_cache *cache; - unsigned long flags; - int ret = 0; - - if (port_num < start_port(device) || port_num > end_port(device)) - return -EINVAL; - - read_lock_irqsave(&device->cache.lock, flags); - - cache = device->cache.pkey_cache[port_num - start_port(device)]; - - if (index < 0 || index >= cache->table_len) - ret = -EINVAL; - else - *pkey = cache->table[index]; - - read_unlock_irqrestore(&device->cache.lock, flags); - - return ret; -} -EXPORT_SYMBOL(ib_get_cached_pkey); - -int ib_find_cached_pkey(struct ib_device *device, - u8 port_num, - u16 pkey, - u16 *index) -{ - struct ib_pkey_cache *cache; - unsigned long flags; - int i; - int ret = -ENOENT; - - if (port_num < start_port(device) || port_num > end_port(device)) - return -EINVAL; - - read_lock_irqsave(&device->cache.lock, flags); - - cache = device->cache.pkey_cache[port_num - start_port(device)]; - - *index = -1; - - for (i = 0; i < cache->table_len; ++i) - if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { - *index = i; - ret = 0; - break; - } - - read_unlock_irqrestore(&device->cache.lock, flags); - - return ret; -} -EXPORT_SYMBOL(ib_find_cached_pkey); - -int ib_get_cached_lmc(struct ib_device *device, - u8 port_num, - u8 *lmc) -{ - unsigned long flags; - int ret = 0; - - if (port_num < start_port(device) || port_num > end_port(device)) - return -EINVAL; - - read_lock_irqsave(&device->cache.lock, flags); - *lmc = device->cache.lmc_cache[port_num - start_port(device)]; - read_unlock_irqrestore(&device->cache.lock, flags); - - return ret; -} -EXPORT_SYMBOL(ib_get_cached_lmc); - -static void ib_cache_update(struct ib_device *device, - u8 port) -{ - struct ib_port_attr *tprops = NULL; - struct ib_pkey_cache *pkey_cache = NULL, *old_pkey_cache; - struct ib_gid_cache *gid_cache = NULL, *old_gid_cache; - int i; - int ret; - - tprops = kmalloc(sizeof *tprops, GFP_KERNEL); - if (!tprops) - return; - - ret = ib_query_port(device, port, tprops); - if (ret) { - printk(KERN_WARNING "ib_query_port failed (%d) for %s\n", - ret, device->name); - goto err; - } - - pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * - sizeof *pkey_cache->table, GFP_KERNEL); - if (!pkey_cache) - goto err; - - pkey_cache->table_len = tprops->pkey_tbl_len; - - gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len * - sizeof *gid_cache->table, GFP_KERNEL); - if (!gid_cache) - goto err; - - gid_cache->table_len = tprops->gid_tbl_len; - - for (i = 0; i < pkey_cache->table_len; ++i) { - ret = ib_query_pkey(device, port, i, pkey_cache->table + i); - if (ret) { - printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n", - ret, device->name, i); - goto err; - } - } - - for (i = 0; i < gid_cache->table_len; ++i) { - ret = ib_query_gid(device, port, i, gid_cache->table + i); - if (ret) { - printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", - ret, device->name, i); - goto err; - } - } - - write_lock_irq(&device->cache.lock); - - old_pkey_cache = device->cache.pkey_cache[port - start_port(device)]; - old_gid_cache = device->cache.gid_cache [port - start_port(device)]; - - device->cache.pkey_cache[port - start_port(device)] = pkey_cache; - device->cache.gid_cache [port - start_port(device)] = gid_cache; - - device->cache.lmc_cache[port - start_port(device)] = tprops->lmc; - - write_unlock_irq(&device->cache.lock); - - kfree(old_pkey_cache); - kfree(old_gid_cache); - kfree(tprops); - return; - -err: - kfree(pkey_cache); - kfree(gid_cache); - kfree(tprops); -} - -static void ib_cache_task(struct work_struct *_work) -{ - struct ib_update_work *work = - container_of(_work, struct ib_update_work, work); - - ib_cache_update(work->device, work->port_num); - kfree(work); -} - -static void ib_cache_event(struct ib_event_handler *handler, - struct ib_event *event) -{ - struct ib_update_work *work; - - if (event->event == IB_EVENT_PORT_ERR || - event->event == IB_EVENT_PORT_ACTIVE || - event->event == IB_EVENT_LID_CHANGE || - event->event == IB_EVENT_PKEY_CHANGE || - event->event == IB_EVENT_SM_CHANGE || - event->event == IB_EVENT_CLIENT_REREGISTER) { - work = kmalloc(sizeof *work, GFP_ATOMIC); - if (work) { - INIT_WORK(&work->work, ib_cache_task); - work->device = event->device; - work->port_num = event->element.port_num; - schedule_work(&work->work); - } - } -} - -static void ib_cache_setup_one(struct ib_device *device) -{ - int p; - - rwlock_init(&device->cache.lock); - - device->cache.pkey_cache = - kmalloc(sizeof *device->cache.pkey_cache * - (end_port(device) - start_port(device) + 1), GFP_KERNEL); - device->cache.gid_cache = - kmalloc(sizeof *device->cache.gid_cache * - (end_port(device) - start_port(device) + 1), GFP_KERNEL); - - device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache * - (end_port(device) - - start_port(device) + 1), - GFP_KERNEL); - - if (!device->cache.pkey_cache || !device->cache.gid_cache || - !device->cache.lmc_cache) { - printk(KERN_WARNING "Couldn't allocate cache " - "for %s\n", device->name); - goto err; - } - - for (p = 0; p <= end_port(device) - start_port(device); ++p) { - device->cache.pkey_cache[p] = NULL; - device->cache.gid_cache [p] = NULL; - ib_cache_update(device, p + start_port(device)); - } - - INIT_IB_EVENT_HANDLER(&device->cache.event_handler, - device, ib_cache_event); - if (ib_register_event_handler(&device->cache.event_handler)) - goto err_cache; - - return; - -err_cache: - for (p = 0; p <= end_port(device) - start_port(device); ++p) { - kfree(device->cache.pkey_cache[p]); - kfree(device->cache.gid_cache[p]); - } - -err: - kfree(device->cache.pkey_cache); - kfree(device->cache.gid_cache); - kfree(device->cache.lmc_cache); -} - -static void ib_cache_cleanup_one(struct ib_device *device) -{ - int p; - - ib_unregister_event_handler(&device->cache.event_handler); - flush_scheduled_work(); - - for (p = 0; p <= end_port(device) - start_port(device); ++p) { - kfree(device->cache.pkey_cache[p]); - kfree(device->cache.gid_cache[p]); - } - - kfree(device->cache.pkey_cache); - kfree(device->cache.gid_cache); - kfree(device->cache.lmc_cache); -} - -static struct ib_client cache_client = { - .name = "cache", - .add = ib_cache_setup_one, - .remove = ib_cache_cleanup_one -}; - -int __init ib_cache_setup(void) -{ - return ib_register_client(&cache_client); -} - -void __exit ib_cache_cleanup(void) -{ - ib_unregister_client(&cache_client); -} Index: b/include/rdma/ib_cache.h =================================================================== --- a/include/rdma/ib_cache.h 2007-05-02 17:47:13.398954200 +0300 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,118 +0,0 @@ -/* - * Copyright (c) 2004 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Intel Corporation. All rights reserved. - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $ - */ - -#ifndef _IB_CACHE_H -#define _IB_CACHE_H - -#include - -/** - * ib_get_cached_gid - Returns a cached GID table entry - * @device: The device to query. - * @port_num: The port number of the device to query. - * @index: The index into the cached GID table to query. - * @gid: The GID value found at the specified index. - * - * ib_get_cached_gid() fetches the specified GID table entry stored in - * the local software cache. - */ -int ib_get_cached_gid(struct ib_device *device, - u8 port_num, - int index, - union ib_gid *gid); - -/** - * ib_find_cached_gid - Returns the port number and GID table index where - * a specified GID value occurs. - * @device: The device to query. - * @gid: The GID value to search for. - * @port_num: The port number of the device where the GID value was found. - * @index: The index into the cached GID table where the GID was found. This - * parameter may be NULL. - * - * ib_find_cached_gid() searches for the specified GID value in - * the local software cache. - */ -int ib_find_cached_gid(struct ib_device *device, - union ib_gid *gid, - u8 *port_num, - u16 *index); - -/** - * ib_get_cached_pkey - Returns a cached PKey table entry - * @device: The device to query. - * @port_num: The port number of the device to query. - * @index: The index into the cached PKey table to query. - * @pkey: The PKey value found at the specified index. - * - * ib_get_cached_pkey() fetches the specified PKey table entry stored in - * the local software cache. - */ -int ib_get_cached_pkey(struct ib_device *device_handle, - u8 port_num, - int index, - u16 *pkey); - -/** - * ib_find_cached_pkey - Returns the PKey table index where a specified - * PKey value occurs. - * @device: The device to query. - * @port_num: The port number of the device to search for the PKey. - * @pkey: The PKey value to search for. - * @index: The index into the cached PKey table where the PKey was found. - * - * ib_find_cached_pkey() searches the specified PKey table in - * the local software cache. - */ -int ib_find_cached_pkey(struct ib_device *device, - u8 port_num, - u16 pkey, - u16 *index); - -/** - * ib_get_cached_lmc - Returns a cached lmc table entry - * @device: The device to query. - * @port_num: The port number of the device to query. - * @lmc: The lmc value for the specified port for that device. - * - * ib_get_cached_lmc() fetches the specified lmc table entry stored in - * the local software cache. - */ -int ib_get_cached_lmc(struct ib_device *device, - u8 port_num, - u8 *lmc); - -#endif /* _IB_CACHE_H */ Index: b/include/rdma/ib_verbs.h =================================================================== --- a/include/rdma/ib_verbs.h 2007-05-02 17:47:13.398954200 +0300 +++ b/include/rdma/ib_verbs.h 2007-05-02 17:48:30.741177998 +0300 @@ -1134,6 +1134,43 @@ int ib_modify_port(struct ib_device *dev struct ib_port_modify *port_modify); /** + * ib_find_gid - Returns the port number and index of a GID + * @device: Device to query. + * @gid: GID to look for + * @port_num: Returned port number + * @index: Returned index + * + * ib_find_gid() returns the index of @pkey in the pkey table + * on port @port_num + */ + int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index); + +/** + * ib_find_pkey - Returns the index of a PKey on a port + * @device: Device to query. + * @port_num: Port to query on + * @pkey: PKey to look for + * @index: Returned index + * + * ib_find_pkey() returns the index of @pkey in the pkey table + * on port @port_num + */ +int ib_find_pkey(struct ib_device *device, u8 port_num, + u16 pkey, u16 *index); + +/** + * ib_query_lmc - Returns the LMC of a port + * @device: Device to query. + * @port_num: Port to query on + * @lmc: Returned LMC + * + * ib_query_lmc() returns the LID mask control associated + * with port @port_num + */ +int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc); + +/** * ib_alloc_pd - Allocates an unused protection domain. * @device: The device on which to allocate the protection domain. * Index: b/drivers/infiniband/core/Makefile =================================================================== --- a/drivers/infiniband/core/Makefile 2007-05-02 17:47:49.333553540 +0300 +++ b/drivers/infiniband/core/Makefile 2007-05-02 17:48:30.741177998 +0300 @@ -8,7 +8,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += $(user_access-y) ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ - device.o fmr_pool.o cache.o + device.o fmr_pool.o ib_mad-y := mad.o smi.o agent.o mad_rmpp.o Index: b/drivers/infiniband/core/cm.c =================================================================== --- a/drivers/infiniband/core/cm.c 2007-05-02 17:47:49.762477140 +0300 +++ b/drivers/infiniband/core/cm.c 2007-05-02 17:48:30.744177464 +0300 @@ -46,8 +46,8 @@ #include #include -#include #include +#include #include "cm_msgs.h" MODULE_AUTHOR("Sean Hefty"); @@ -275,7 +275,7 @@ static int cm_init_av_by_path(struct ib_ read_lock_irqsave(&cm.device_lock, flags); list_for_each_entry(cm_dev, &cm.device_list, list) { - if (!ib_find_cached_gid(cm_dev->device, &path->sgid, + if (!ib_find_gid(cm_dev->device, &path->sgid, &p, NULL)) { port = &cm_dev->port[p-1]; break; @@ -286,7 +286,7 @@ static int cm_init_av_by_path(struct ib_ if (!port) return -EINVAL; - ret = ib_find_cached_pkey(cm_dev->device, port->port_num, + ret = ib_find_pkey(cm_dev->device, port->port_num, be16_to_cpu(path->pkey), &av->pkey_index); if (ret) return ret; @@ -1382,7 +1382,7 @@ static int cm_req_handler(struct cm_work cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av); if (ret) { - ib_get_cached_gid(work->port->cm_dev->device, + ib_query_gid(work->port->cm_dev->device, work->port->port_num, 0, &work->path[0].sgid); ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID, &work->path[0].sgid, sizeof work->path[0].sgid, Index: b/drivers/infiniband/core/cma.c =================================================================== --- a/drivers/infiniband/core/cma.c 2007-05-02 17:47:50.749301367 +0300 +++ b/drivers/infiniband/core/cma.c 2007-05-02 17:48:30.746177108 +0300 @@ -41,7 +41,6 @@ #include #include -#include #include #include #include @@ -325,7 +324,7 @@ static int cma_acquire_dev(struct rdma_i } list_for_each_entry(cma_dev, &dev_list, list) { - ret = ib_find_cached_gid(cma_dev->device, &gid, + ret = ib_find_gid(cma_dev->device, &gid, &id_priv->id.port_num, NULL); if (!ret) { ret = cma_set_qkey(cma_dev->device, @@ -514,7 +513,7 @@ static int cma_ib_init_qp_attr(struct rd struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; int ret; - ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, + ret = ib_find_pkey(id_priv->id.device, id_priv->id.port_num, ib_addr_get_pkey(dev_addr), &qp_attr->pkey_index); if (ret) @@ -1658,11 +1657,11 @@ static int cma_bind_loopback(struct rdma cma_dev = list_entry(dev_list.next, struct cma_device, list); port_found: - ret = ib_get_cached_gid(cma_dev->device, p, 0, &gid); + ret = ib_query_gid(cma_dev->device, p, 0, &gid); if (ret) goto out; - ret = ib_get_cached_pkey(cma_dev->device, p, 0, &pkey); + ret = ib_query_pkey(cma_dev->device, p, 0, &pkey); if (ret) goto out; Index: b/drivers/infiniband/core/mad.c =================================================================== --- a/drivers/infiniband/core/mad.c 2007-05-02 17:47:50.423359423 +0300 +++ b/drivers/infiniband/core/mad.c 2007-05-02 17:48:30.748176751 +0300 @@ -34,7 +34,6 @@ * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ */ #include -#include #include "mad_priv.h" #include "mad_rmpp.h" @@ -1707,13 +1706,13 @@ static inline int rcv_has_same_gid(struc if (!send_resp && rcv_resp) { /* is request/response. */ if (!(attr.ah_flags & IB_AH_GRH)) { - if (ib_get_cached_lmc(device, port_num, &lmc)) + if (ib_query_lmc(device, port_num, &lmc)) return 0; return (!lmc || !((attr.src_path_bits ^ rwc->wc->dlid_path_bits) & ((1 << lmc) - 1))); } else { - if (ib_get_cached_gid(device, port_num, + if (ib_query_gid(device, port_num, attr.grh.sgid_index, &sgid)) return 0; return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw, Index: b/drivers/infiniband/core/multicast.c =================================================================== --- a/drivers/infiniband/core/multicast.c 2007-05-02 17:47:51.014254173 +0300 +++ b/drivers/infiniband/core/multicast.c 2007-05-02 17:48:30.749176573 +0300 @@ -38,7 +38,6 @@ #include #include -#include #include "sa.h" static void mcast_add_one(struct ib_device *device); @@ -686,7 +685,7 @@ int ib_init_ah_from_mcmember(struct ib_d u16 gid_index; u8 p; - ret = ib_find_cached_gid(device, &rec->port_gid, &p, &gid_index); + ret = ib_find_gid(device, &rec->port_gid, &p, &gid_index); if (ret) return ret; Index: b/drivers/infiniband/core/sa_query.c =================================================================== --- a/drivers/infiniband/core/sa_query.c 2007-05-02 17:47:49.689490140 +0300 +++ b/drivers/infiniband/core/sa_query.c 2007-05-02 17:48:30.749176573 +0300 @@ -47,7 +47,6 @@ #include #include -#include #include "sa.h" MODULE_AUTHOR("Roland Dreier"); @@ -477,7 +476,7 @@ int ib_init_ah_from_path(struct ib_devic ah_attr->ah_flags = IB_AH_GRH; ah_attr->grh.dgid = rec->dgid; - ret = ib_find_cached_gid(device, &rec->sgid, &port_num, + ret = ib_find_gid(device, &rec->sgid, &port_num, &gid_index); if (ret) return ret; Index: b/drivers/infiniband/core/verbs.c =================================================================== --- a/drivers/infiniband/core/verbs.c 2007-05-02 17:47:49.091596637 +0300 +++ b/drivers/infiniband/core/verbs.c 2007-05-02 17:48:30.750176395 +0300 @@ -43,7 +43,6 @@ #include #include -#include int ib_rate_to_mult(enum ib_rate rate) { @@ -159,7 +158,7 @@ int ib_init_ah_from_wc(struct ib_device ah_attr->ah_flags = IB_AH_GRH; ah_attr->grh.dgid = grh->sgid; - ret = ib_find_cached_gid(device, &grh->dgid, &port_num, + ret = ib_find_gid(device, &grh->dgid, &port_num, &gid_index); if (ret) return ret; Index: b/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_av.c 2007-05-02 17:47:53.157872352 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_av.c 2007-05-02 17:48:30.751176217 +0300 @@ -37,7 +37,6 @@ #include #include -#include #include "mthca_dev.h" @@ -279,7 +278,7 @@ int mthca_read_ah(struct mthca_dev *dev, (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; header->grh.flow_label = ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); - ib_get_cached_gid(&dev->ib_dev, + ib_query_gid(&dev->ib_dev, be32_to_cpu(ah->av->port_pd) >> 24, ah->av->gid_index % dev->limits.gid_table_len, &header->grh.source_gid); Index: b/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_qp.c 2007-05-02 17:47:53.153873064 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_qp.c 2007-05-02 18:04:14.123981858 +0300 @@ -40,9 +40,8 @@ #include -#include -#include #include +#include #include "mthca_dev.h" #include "mthca_cmd.h" @@ -1485,11 +1484,10 @@ static int build_mlx_header(struct mthca sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); if (!sqp->qp.ibqp.qp_num) - ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, - sqp->pkey_index, &pkey); + ib_query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); else - ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, - wr->wr.ud.pkey_index, &pkey); + ib_query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); + sqp->ud_header.bth.pkey = cpu_to_be16(pkey); sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); Index: b/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-02 17:47:52.042071098 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-02 17:48:30.753175861 +0300 @@ -33,7 +33,6 @@ */ #include -#include #include #include #include @@ -759,7 +758,7 @@ static int ipoib_cm_modify_tx_init(struc struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index); if (ret) { ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret); return ret; Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-02 17:48:30.150283249 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-02 17:48:30.754175683 +0300 @@ -38,7 +38,7 @@ #include #include -#include +#include #include "ipoib.h" Index: b/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- a/drivers/infiniband/ulp/srp/ib_srp.c 2007-05-02 17:47:52.336018740 +0300 +++ b/drivers/infiniband/ulp/srp/ib_srp.c 2007-05-02 17:48:30.755175505 +0300 @@ -48,8 +48,6 @@ #include #include -#include - #include "ib_srp.h" #define DRV_NAME "ib_srp" @@ -164,7 +162,7 @@ static int srp_init_qp(struct srp_target if (!attr) return -ENOMEM; - ret = ib_find_cached_pkey(target->srp_host->dev->dev, + ret = ib_find_pkey(target->srp_host->dev->dev, target->srp_host->port, be16_to_cpu(target->path.pkey), &attr->pkey_index); @@ -1780,7 +1778,7 @@ static ssize_t srp_create_target(struct if (ret) goto err; - ib_get_cached_gid(host->dev->dev, host->port, 0, &target->path.sgid); + ib_query_gid(host->dev->dev, host->port, 0, &target->path.sgid); printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x " "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", Index: b/drivers/infiniband/core/core_priv.h =================================================================== --- a/drivers/infiniband/core/core_priv.h 2007-05-02 17:47:50.519342327 +0300 +++ b/drivers/infiniband/core/core_priv.h 2007-05-02 17:48:30.755175505 +0300 @@ -46,7 +46,4 @@ void ib_device_unregister_sysfs(struct i int ib_sysfs_setup(void); void ib_sysfs_cleanup(void); -int ib_cache_setup(void); -void ib_cache_cleanup(void); - #endif /* _CORE_PRIV_H */ From yosefe at voltaire.com Wed May 2 08:57:50 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 02 May 2007 18:57:50 +0300 Subject: [ofa-general] [PATCH 3/3] mthca: provider-level caching of pkeys In-Reply-To: <4638B432.3060801@voltaire.com> References: <4638B432.3060801@voltaire.com> Message-ID: <4638B4FE.8010605@voltaire.com> Add provider-level caching of pkeys to mthca * have the dirver intercept smp's which are pkey table notifications, and update its internal cache with the new values. * modify query_pkey to use this cache instead of doing a blocking HW call * while creating a MLX QP, use this cache Signed-off-by: Yosef Etigin --- drivers/infiniband/hw/mthca/mthca_dev.h | 12 + drivers/infiniband/hw/mthca/mthca_mad.c | 5 drivers/infiniband/hw/mthca/mthca_provider.c | 167 +++++++++++++++++++++++---- drivers/infiniband/hw/mthca/mthca_qp.c | 5 include/rdma/ib_smi.h | 1 5 files changed, 163 insertions(+), 27 deletions(-) Index: b/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_dev.h 2007-05-02 17:47:52.931912600 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_dev.h 2007-05-02 17:48:31.525038376 +0300 @@ -49,6 +49,8 @@ #include +#include + #include "mthca_provider.h" #include "mthca_doorbell.h" @@ -287,6 +289,11 @@ struct mthca_catas_err { struct list_head list; }; +struct mthca_pkey_cache { + int table_len; + u16 table[0]; +}; + extern struct mutex mthca_device_mutex; struct mthca_dev { @@ -360,6 +367,9 @@ struct mthca_dev { struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; spinlock_t sm_lock; u8 rate[MTHCA_MAX_PORTS]; + + rwlock_t pkey_cache_lock; + struct mthca_pkey_cache *pkey_cache[MTHCA_MAX_PORTS]; }; #ifdef CONFIG_INFINIBAND_MTHCA_DEBUG @@ -585,6 +595,8 @@ int mthca_process_mad(struct ib_device * int mthca_create_agents(struct mthca_dev *dev); void mthca_free_agents(struct mthca_dev *dev); +int mthca_cache_update(struct mthca_dev *mdev, struct ib_smp *smp, u8 port_num); + static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) { return container_of(ibdev, struct mthca_dev, ib_dev); Index: b/drivers/infiniband/hw/mthca/mthca_mad.c =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_mad.c 2007-05-02 17:47:53.067888380 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_mad.c 2007-05-02 17:48:31.525038376 +0300 @@ -134,6 +134,11 @@ static void smp_snoop(struct ib_device * } if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) { + + /* update pkey cache from a snnoped MAD */ + mthca_dbg(to_mdev(ibdev), "pkey change at port %d\n", port_num); + mthca_cache_update(to_mdev(ibdev), (struct ib_smp*) mad, port_num); + event.device = ibdev; event.event = IB_EVENT_PKEY_CHANGE; event.element.port_num = port_num; Index: b/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_provider.c 2007-05-02 17:47:52.996901024 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_provider.c 2007-05-02 17:48:31.526038198 +0300 @@ -243,36 +243,27 @@ out: static int mthca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) { - struct ib_smp *in_mad = NULL; - struct ib_smp *out_mad = NULL; - int err = -ENOMEM; - u8 status; + struct mthca_dev * mdev; + unsigned int flags; - in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); - out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); - if (!in_mad || !out_mad) - goto out; + mdev = to_mdev(ibdev); + read_lock_irqsave(&mdev->pkey_cache_lock, flags); - init_query_mad(in_mad); - in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; - in_mad->attr_mod = cpu_to_be32(index / 32); + if (port < 1 || port > mdev->ib_dev.phys_port_cnt || + index >= mdev->pkey_cache[ port - 1 ]->table_len ) { + mthca_warn(mdev, "pkey request at %d[%d] is out of range %d[%d] - %d[%d]\n", + port, index, + 1, 0, + mdev->ib_dev.phys_port_cnt, mdev->pkey_cache[ port - 1 ]->table_len -1); - err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, - port, NULL, NULL, in_mad, out_mad, - &status); - if (err) - goto out; - if (status) { - err = -EINVAL; - goto out; + read_unlock_irqrestore(&mdev->pkey_cache_lock, flags); + return -EINVAL; } - *pkey = be16_to_cpu(((__be16 *) out_mad->data)[index % 32]); + *pkey = mdev->pkey_cache[ port - 1 ]->table[ index ]; - out: - kfree(in_mad); - kfree(out_mad); - return err; + read_unlock_irqrestore(&mdev->pkey_cache_lock, flags); + return 0; } static int mthca_query_gid(struct ib_device *ibdev, u8 port, @@ -1259,6 +1250,127 @@ out: return err; } +/* + * Initiallize cache: + * ask the SM for the table + */ +static int mthca_cache_init(struct mthca_dev *mdev) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + struct ib_port_attr *tprops = NULL; + unsigned int i; + unsigned int tbl_len; + + int err = -ENOMEM; + u8 status; + + rwlock_init(&mdev->pkey_cache_lock); + + mthca_dbg(mdev, "setting up PKey cache\n"); + + memset(mdev->pkey_cache, 0, sizeof mdev->pkey_cache); + + tprops = kmalloc( sizeof * tprops, GFP_KERNEL ); + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + + if (!tprops || !in_mad || !out_mad) + goto out; + + for ( i = 0; i < mdev->ib_dev.phys_port_cnt; ++i ) { + + /* find out how many pkeys this port holds */ + err = mthca_query_port(&mdev->ib_dev, i+1, tprops); + if (err) + continue; + + /* allocate cache */ + tbl_len = tprops->pkey_tbl_len; + mdev->pkey_cache[ i ] = kmalloc(sizeof(struct mthca_pkey_cache) + + tbl_len * sizeof(u16), GFP_KERNEL); + if ( ! mdev->pkey_cache[ i ] ) + goto out; + + mdev->pkey_cache[ i ]->table_len = tbl_len; + + while (tbl_len) { + + /* send pkey query mad */ + memset(in_mad, 0, sizeof * in_mad); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; + in_mad->attr_mod = cpu_to_be32( (tbl_len-1) / IB_SMP_NUM_PKEY_ENTRIES); + + err = mthca_MAD_IFC(mdev, 1, 1, + i + 1, NULL, NULL, in_mad, out_mad, + &status); + + if (err || status) + break; + + mthca_cache_update(mdev, out_mad, i + 1); + tbl_len -= IB_SMP_NUM_PKEY_ENTRIES; + } + } + +out: + kfree(in_mad); + kfree(out_mad); + kfree(tprops); + return err; +} + +/* + * Destroy the pkey cache + */ +static void mthca_cache_destroy(struct mthca_dev *mdev) +{ + int i; + for ( i = 0; i < mdev->ib_dev.phys_port_cnt; ++i ) { + kfree( mdev->pkey_cache[ i ] ); + } +} + +/* + * We snooped a pkey-table mad + * extract the new pkey table, and update our internal cache + */ +int mthca_cache_update(struct mthca_dev *mdev, struct ib_smp *smp, u8 port_num) +{ + unsigned int table_offset; + unsigned long flags; + int i; + struct mthca_pkey_cache *pkey_cache; + u16 *entry; + + table_offset = ( be32_to_cpu(smp->attr_mod) & 0xFFFF ) * + IB_SMP_NUM_PKEY_ENTRIES; + + mthca_dbg(mdev, "port %d: new pkey table at offset %d\n", + port_num, table_offset); + + write_lock_irqsave(&mdev->pkey_cache_lock, flags); + + pkey_cache = mdev->pkey_cache[ port_num - 1 ]; + + if (pkey_cache->table_len < IB_SMP_NUM_PKEY_ENTRIES + table_offset) { + mthca_warn(mdev, "pkey table out of range - ignoring\n"); + write_unlock_irqrestore(&mdev->pkey_cache_lock, flags); + return -EINVAL; + } + + /* update the cache */ + entry = pkey_cache->table + table_offset; + for ( i = 0; i < IB_SMP_NUM_PKEY_ENTRIES; ++i ) { + u16 pkey = be16_to_cpu ( *( ( (u16*)smp->data ) + i ) ); + *(entry++) = pkey; + } + + write_unlock_irqrestore(&mdev->pkey_cache_lock, flags); + return 0; +} + int mthca_register_device(struct mthca_dev *dev) { int ret; @@ -1365,6 +1477,12 @@ int mthca_register_device(struct mthca_d mutex_init(&dev->cap_mask_mutex); + ret = mthca_cache_init(dev); + if (ret) { + mthca_cache_destroy(dev); + return ret; + } + ret = ib_register_device(&dev->ib_dev); if (ret) return ret; @@ -1387,4 +1505,5 @@ void mthca_unregister_device(struct mthc { mthca_stop_catas_poll(dev); ib_unregister_device(&dev->ib_dev); + mthca_cache_destroy(dev); } Index: b/include/rdma/ib_smi.h =================================================================== --- a/include/rdma/ib_smi.h 2007-05-02 17:47:12.741071381 +0300 +++ b/include/rdma/ib_smi.h 2007-05-02 17:48:31.527038020 +0300 @@ -43,6 +43,7 @@ #define IB_SMP_DATA_SIZE 64 #define IB_SMP_MAX_PATH_HOPS 64 +#define IB_SMP_NUM_PKEY_ENTRIES 32 struct ib_smp { u8 base_version; Index: b/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_qp.c 2007-05-02 17:48:30.752176039 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_qp.c 2007-05-02 17:48:31.528037842 +0300 @@ -41,7 +41,6 @@ #include #include -#include #include "mthca_dev.h" #include "mthca_cmd.h" @@ -1484,9 +1483,9 @@ static int build_mlx_header(struct mthca sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); if (!sqp->qp.ibqp.qp_num) - ib_query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); + dev->ib_dev.query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); else - ib_query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); + dev->ib_dev.query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); sqp->ud_header.bth.pkey = cpu_to_be16(pkey); sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); From rick.jones2 at hp.com Wed May 2 09:25:59 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 02 May 2007 09:25:59 -0700 Subject: [ofa-general] minutes from socket over RDMA discussion at workshop In-Reply-To: References: Message-ID: <4638BB97.8050403@hp.com> Scott Weitzenkamp (sweitzen) wrote: > No, this is not right. SDP has better latency and better throughput > than IPoIB CM, but also uses more CPU. I can confirm that with some netperf numbers I've been gathering: SD = Service Demand - usec CPU consumed per KB or tran smaller is better SDx = Service Demand Xmit; SDr = Service Demand Recv * back to back - no switch. 9k - 9000 byte MTU. As it turns-out, ... this is the _default_ in the version of the myri10ge driver given to the author to use. RedHat Enterprise Linux 5 Bulk Transfer "Latency" Unidir Bidir Card Mbit/s SDx SDr Mbit/s SDx SDr Tran/s SDx SDr --------------------------------------------------------------------------- nnnn nnnnn nnnnn nnnn nnnnn nnnnn nnnnn nnnnn nnnnn AD313A IPoIB 2970 4.418 4.544 3530 3.59 3.95 19290 n/a n/a AD313A SDP 7810 0.453 1.048 12820 0.69 0.68 38030 26.29 26.29 AD313A SDP p=0 7810 0.346 0.527 12670 0.42 0.043 19380 n/a n/a AD144A IP Myri10G IP 9k 9320 0.862 0.949 10950 1.00 0.86 19260 19.67 16.18 * Myri10G IP 9k msi 9320 0.449 0.672 10840 0.63 0.62 19430 11.68 11.56 Myri10G IP msi 7020 0.525 1.790 9820 1.22 1.22 not measured Service demand for IPoIB TCP_RR and SDP p=0 not shown here because netperf could never hit the confidence intervals for CPU utilization. See the raw data for details. * No switch - systems connected back-to-back p=0 - recv_poll and send_poll module parameters of ib_sdp set to zero to address CPU util issue for SDP_RR test, where the equivalent of an entire core was consumed on each side msi - MSI or Message Signaled Interrupts enabled 1.2.0 version of the myri10ge driver wrt zero copy and sockets and such, long long ago, in an OS far away - HP-UX 9.X, HP had a zero-copy for TCP/IP over FDDI. This was when MTUs and page sizes were still well-aligned (both 4K, well 4K and change for the FDDI MTU). The zero copy there was copy-on-write, and it was enabled only when an application made an explicit setsockopt() call telling the transport the application knew this was going to be happening. Then it was up to the application to make sure it did not try to access the address range of a previous send() call until after it knew the transport was no longer referencing it. ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.pdf rick jones From rdreier at cisco.com Wed May 2 10:16:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 May 2007 10:16:24 -0700 Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: <20070502070849.GO8447@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 May 2007 10:08:49 +0300") References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> <20070502070849.GO8447@mellanox.co.il> Message-ID: > > + if (ib_wr->send_flags & IB_SEND_SOLICITED > > + && ib_wr->send_flags & IB_SEND_INVALIDATE) { > > How about > if (ib_wr->send_flags & (IB_SEND_SOLICITED | IB_SEND_INVALIDATE)) These two aren't equivalent -- the first has an &&, yours works like ||. Which is correct? From mst at dev.mellanox.co.il Wed May 2 10:18:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 20:18:29 +0300 Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache In-Reply-To: <4638B4D5.7050709@voltaire.com> References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com> Message-ID: <20070502171829.GO22292@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCH 2/3] remove ib pkey gid and lmc cache > > Remove IB cache from core > > * Remove pkey, gid, and lmc caches > * Rewrite ib_find_gid and ib_find_pkey over blocking device queries > * Modify users of the cache to use these methods > That's what we wanted to do, allright. But there are some issues here. > Signed-off-by: Yosef Etigin > --- > drivers/infiniband/core/cache.c | 398 -------------------------------- > include/rdma/ib_cache.h | 118 --------- > drivers/infiniband/core/Makefile | 2 > drivers/infiniband/core/cm.c | 8 > drivers/infiniband/core/cma.c | 9 > drivers/infiniband/core/core_priv.h | 3 > drivers/infiniband/core/device.c | 143 ++++++++++- > drivers/infiniband/core/mad.c | 5 > drivers/infiniband/core/multicast.c | 3 > drivers/infiniband/core/sa_query.c | 3 > drivers/infiniband/core/verbs.c | 3 > drivers/infiniband/hw/mthca/mthca_av.c | 3 > drivers/infiniband/hw/mthca/mthca_qp.c | 10 > drivers/infiniband/ulp/ipoib/ipoib_cm.c | 3 > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 2 > drivers/infiniband/ulp/srp/ib_srp.c | 6 > include/rdma/ib_verbs.h | 37 ++ > 17 files changed, 196 insertions(+), 560 deletions(-) > > Index: b/drivers/infiniband/core/device.c > =================================================================== > --- a/drivers/infiniband/core/device.c 2007-05-02 17:47:50.517342683 +0300 > +++ b/drivers/infiniband/core/device.c 2007-05-02 17:48:30.719181916 +0300 > @@ -149,6 +149,20 @@ static int alloc_name(char *name) > return 0; > } > > + > +static inline int start_port(struct ib_device *device) > +{ > + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; > +} > + > + > +static inline int end_port(struct ib_device *device) > +{ > + return (device->node_type == RDMA_NODE_IB_SWITCH) ? > + 0 : device->phys_port_cnt; > +} > + > + > /** > * ib_alloc_device - allocate an IB device struct > * @size:size of structure to allocate No double-spacing please. A single empty line is enough for separation. > @@ -592,6 +606,128 @@ int ib_modify_port(struct ib_device *dev > } > EXPORT_SYMBOL(ib_modify_port); > > +/** > + * ib_find_gid - Returns the port number and index of a GID > + * @device: Device to query. Kill the "." > + * @gid: GID to look for > + * @port_num: Returned port number > + * @index: Returned index > + * > + * ib_find_gid() returns the index of @pkey in the pkey table > + * on port @port_num > + */ The description is not really clear. For comparison, here's what we had for ib_find_cached_gid: > - * ib_find_cached_gid - Returns the port number and GID table index where > - * a specified GID value occurs. > - * @device: The device to query. > - * @gid: The GID value to search for. > - * @port_num: The port number of the device where the GID value was found. > - * @index: The index into the cached GID table where the GID was found. This > - * parameter may be NULL. > - * > - * ib_find_cached_gid() searches for the specified GID value in > - * the local software cache. so how about just removing the last 2 lines (which don't apply now) and reusing the description as is? > + int ib_find_gid(struct ib_device *device, > + union ib_gid *gid, > + u8 *port_num, > + u16 *index) > +{ what's going on with alignment/whitespace here? > + struct ib_port_attr *tprops = NULL; > + union ib_gid tmp_gid; > + int ret; > + int port; > + int i; Just one int will do: int i, port, ret; > + tprops = kmalloc(sizeof *tprops, GFP_ATOMIC); Why ATOMIC? What if allocation fails? > + > + for (port = start_port(device); port <= end_port(device); ++port) { > + ret = ib_query_port(device, port, tprops); > + if (ret) > + continue; > + > + for (i = 0; i < tprops->gid_tbl_len; ++i) { > + ret = ib_query_gid(device, port, i, &tmp_gid); > + if (ret) > + goto out; > + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { > + *port_num = port; > + *index = i; > + ret = 0; > + goto out; > + } > + } /* for i */ Killthe comment pls. > + } > + ret = -ENOENT; > +out: > + kfree(tprops); > + return ret; > +} > +EXPORT_SYMBOL(ib_find_gid); Mostly same comments apply to other functions below. > +/** > + * ib_find_pkey - Returns the index of a PKey on a port > + * @device: Device to query. > + * @port_num: Port to query on > + * @pkey: PKey to look for > + * @index: Returned index > + * > + * ib_find_pkey() returns the index of @pkey in the pkey table > + * on port @port_num > + */ > +int ib_find_pkey(struct ib_device *device, > + u8 port_num, > + u16 pkey, > + u16 *index) > +{ > + struct ib_port_attr *tprops = NULL; > + int ret; > + int i = -1; What does -1 do here? > + u16 tmp_pkey; > + > + tprops = kmalloc(sizeof *tprops, GFP_ATOMIC); > + > + ret = ib_query_port(device, port_num, tprops); > + if (ret) { > + printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret); > + goto out; > + } > + > + for (i = 0; i < tprops->pkey_tbl_len; ++i) { > + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); > + if (ret) > + goto out; > + > + if (pkey == tmp_pkey) { > + *index = i; > + ret = 0; > + goto out; > + } > + } > + ret = -ENOENT; > + > +out: > + kfree(tprops); > + return ret; > +} > +EXPORT_SYMBOL(ib_find_pkey); > + > +/** > + * ib_query_lmc - Returns the LMC of a port > + * @device: Device to query. > + * @port_num: Port to query on > + * @lmc: Returned LMC > + * > + * ib_query_lmc() returns the LID mask control associated > + * with port @port_num > + */ > +int ib_query_lmc(struct ib_device *device, > + u8 port_num, > + u8 *lmc) > +{ > + struct ib_port_attr *tprops = NULL; > + int ret; > + > + tprops = kmalloc(sizeof *tprops, GFP_ATOMIC); > + ret = ib_query_port(device, port_num, tprops); > + if (ret) goto err; goto belongs on a line of its own. > + > + *lmc = tprops->lmc; > +err: > + kfree(tprops); > + return ret; > + > +} > +EXPORT_SYMBOL(ib_query_lmc); > + > static int __init ib_core_init(void) > { > int ret; > @@ -600,18 +736,11 @@ static int __init ib_core_init(void) > if (ret) > printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); > > - ret = ib_cache_setup(); > - if (ret) { > - printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n"); > - ib_sysfs_cleanup(); > - } > - > return ret; > } > > static void __exit ib_core_cleanup(void) > { > - ib_cache_cleanup(); > ib_sysfs_cleanup(); > } > > Index: b/drivers/infiniband/core/cache.c > =================================================================== > --- a/drivers/infiniband/core/cache.c 2007-05-02 17:47:49.878456482 +0300 > +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 > @@ -1,398 +0,0 @@ > -/* > - * Copyright (c) 2004 Topspin Communications. All rights reserved. > - * Copyright (c) 2005 Intel Corporation. All rights reserved. > - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. > - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. > - * > - * This software is available to you under a choice of one of two > - * licenses. You may choose to be licensed under the terms of the GNU > - * General Public License (GPL) Version 2, available from the file > - * COPYING in the main directory of this source tree, or the > - * OpenIB.org BSD license below: > - * > - * Redistribution and use in source and binary forms, with or > - * without modification, are permitted provided that the following > - * conditions are met: > - * > - * - Redistributions of source code must retain the above > - * copyright notice, this list of conditions and the following > - * disclaimer. > - * > - * - Redistributions in binary form must reproduce the above > - * copyright notice, this list of conditions and the following > - * disclaimer in the documentation and/or other materials > - * provided with the distribution. > - * > - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > - * SOFTWARE. > - * > - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ > - */ > - > -#include > -#include > -#include > - > -#include > - > -#include "core_priv.h" > - > -struct ib_pkey_cache { > - int table_len; > - u16 table[0]; > -}; > - > -struct ib_gid_cache { > - int table_len; > - union ib_gid table[0]; > -}; > - > -struct ib_update_work { > - struct work_struct work; > - struct ib_device *device; > - u8 port_num; > -}; > - > -static inline int start_port(struct ib_device *device) > -{ > - return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; > -} > - > -static inline int end_port(struct ib_device *device) > -{ > - return (device->node_type == RDMA_NODE_IB_SWITCH) ? > - 0 : device->phys_port_cnt; > -} > - > -int ib_get_cached_gid(struct ib_device *device, > - u8 port_num, > - int index, > - union ib_gid *gid) > -{ > - struct ib_gid_cache *cache; > - unsigned long flags; > - int ret = 0; > - > - if (port_num < start_port(device) || port_num > end_port(device)) > - return -EINVAL; > - > - read_lock_irqsave(&device->cache.lock, flags); > - > - cache = device->cache.gid_cache[port_num - start_port(device)]; > - > - if (index < 0 || index >= cache->table_len) > - ret = -EINVAL; > - else > - *gid = cache->table[index]; > - > - read_unlock_irqrestore(&device->cache.lock, flags); > - > - return ret; > -} > -EXPORT_SYMBOL(ib_get_cached_gid); > - > -int ib_find_cached_gid(struct ib_device *device, > - union ib_gid *gid, > - u8 *port_num, > - u16 *index) > -{ > - struct ib_gid_cache *cache; > - unsigned long flags; > - int p, i; > - int ret = -ENOENT; > - > - *port_num = -1; > - if (index) > - *index = -1; > - > - read_lock_irqsave(&device->cache.lock, flags); > - > - for (p = 0; p <= end_port(device) - start_port(device); ++p) { > - cache = device->cache.gid_cache[p]; > - for (i = 0; i < cache->table_len; ++i) { > - if (!memcmp(gid, &cache->table[i], sizeof *gid)) { > - *port_num = p + start_port(device); > - if (index) > - *index = i; > - ret = 0; > - goto found; > - } > - } > - } > -found: > - read_unlock_irqrestore(&device->cache.lock, flags); > - > - return ret; > -} > -EXPORT_SYMBOL(ib_find_cached_gid); > - > -int ib_get_cached_pkey(struct ib_device *device, > - u8 port_num, > - int index, > - u16 *pkey) > -{ > - struct ib_pkey_cache *cache; > - unsigned long flags; > - int ret = 0; > - > - if (port_num < start_port(device) || port_num > end_port(device)) > - return -EINVAL; > - > - read_lock_irqsave(&device->cache.lock, flags); > - > - cache = device->cache.pkey_cache[port_num - start_port(device)]; > - > - if (index < 0 || index >= cache->table_len) > - ret = -EINVAL; > - else > - *pkey = cache->table[index]; > - > - read_unlock_irqrestore(&device->cache.lock, flags); > - > - return ret; > -} > -EXPORT_SYMBOL(ib_get_cached_pkey); > - > -int ib_find_cached_pkey(struct ib_device *device, > - u8 port_num, > - u16 pkey, > - u16 *index) > -{ > - struct ib_pkey_cache *cache; > - unsigned long flags; > - int i; > - int ret = -ENOENT; > - > - if (port_num < start_port(device) || port_num > end_port(device)) > - return -EINVAL; > - > - read_lock_irqsave(&device->cache.lock, flags); > - > - cache = device->cache.pkey_cache[port_num - start_port(device)]; > - > - *index = -1; > - > - for (i = 0; i < cache->table_len; ++i) > - if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { > - *index = i; > - ret = 0; > - break; > - } > - > - read_unlock_irqrestore(&device->cache.lock, flags); > - > - return ret; > -} > -EXPORT_SYMBOL(ib_find_cached_pkey); > - > -int ib_get_cached_lmc(struct ib_device *device, > - u8 port_num, > - u8 *lmc) > -{ > - unsigned long flags; > - int ret = 0; > - > - if (port_num < start_port(device) || port_num > end_port(device)) > - return -EINVAL; > - > - read_lock_irqsave(&device->cache.lock, flags); > - *lmc = device->cache.lmc_cache[port_num - start_port(device)]; > - read_unlock_irqrestore(&device->cache.lock, flags); > - > - return ret; > -} > -EXPORT_SYMBOL(ib_get_cached_lmc); > - > -static void ib_cache_update(struct ib_device *device, > - u8 port) > -{ > - struct ib_port_attr *tprops = NULL; > - struct ib_pkey_cache *pkey_cache = NULL, *old_pkey_cache; > - struct ib_gid_cache *gid_cache = NULL, *old_gid_cache; > - int i; > - int ret; > - > - tprops = kmalloc(sizeof *tprops, GFP_KERNEL); > - if (!tprops) > - return; > - > - ret = ib_query_port(device, port, tprops); > - if (ret) { > - printk(KERN_WARNING "ib_query_port failed (%d) for %s\n", > - ret, device->name); > - goto err; > - } > - > - pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * > - sizeof *pkey_cache->table, GFP_KERNEL); > - if (!pkey_cache) > - goto err; > - > - pkey_cache->table_len = tprops->pkey_tbl_len; > - > - gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len * > - sizeof *gid_cache->table, GFP_KERNEL); > - if (!gid_cache) > - goto err; > - > - gid_cache->table_len = tprops->gid_tbl_len; > - > - for (i = 0; i < pkey_cache->table_len; ++i) { > - ret = ib_query_pkey(device, port, i, pkey_cache->table + i); > - if (ret) { > - printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n", > - ret, device->name, i); > - goto err; > - } > - } > - > - for (i = 0; i < gid_cache->table_len; ++i) { > - ret = ib_query_gid(device, port, i, gid_cache->table + i); > - if (ret) { > - printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", > - ret, device->name, i); > - goto err; > - } > - } > - > - write_lock_irq(&device->cache.lock); > - > - old_pkey_cache = device->cache.pkey_cache[port - start_port(device)]; > - old_gid_cache = device->cache.gid_cache [port - start_port(device)]; > - > - device->cache.pkey_cache[port - start_port(device)] = pkey_cache; > - device->cache.gid_cache [port - start_port(device)] = gid_cache; > - > - device->cache.lmc_cache[port - start_port(device)] = tprops->lmc; > - > - write_unlock_irq(&device->cache.lock); > - > - kfree(old_pkey_cache); > - kfree(old_gid_cache); > - kfree(tprops); > - return; > - > -err: > - kfree(pkey_cache); > - kfree(gid_cache); > - kfree(tprops); > -} > - > -static void ib_cache_task(struct work_struct *_work) > -{ > - struct ib_update_work *work = > - container_of(_work, struct ib_update_work, work); > - > - ib_cache_update(work->device, work->port_num); > - kfree(work); > -} > - > -static void ib_cache_event(struct ib_event_handler *handler, > - struct ib_event *event) > -{ > - struct ib_update_work *work; > - > - if (event->event == IB_EVENT_PORT_ERR || > - event->event == IB_EVENT_PORT_ACTIVE || > - event->event == IB_EVENT_LID_CHANGE || > - event->event == IB_EVENT_PKEY_CHANGE || > - event->event == IB_EVENT_SM_CHANGE || > - event->event == IB_EVENT_CLIENT_REREGISTER) { > - work = kmalloc(sizeof *work, GFP_ATOMIC); > - if (work) { > - INIT_WORK(&work->work, ib_cache_task); > - work->device = event->device; > - work->port_num = event->element.port_num; > - schedule_work(&work->work); > - } > - } > -} > - > -static void ib_cache_setup_one(struct ib_device *device) > -{ > - int p; > - > - rwlock_init(&device->cache.lock); > - > - device->cache.pkey_cache = > - kmalloc(sizeof *device->cache.pkey_cache * > - (end_port(device) - start_port(device) + 1), GFP_KERNEL); > - device->cache.gid_cache = > - kmalloc(sizeof *device->cache.gid_cache * > - (end_port(device) - start_port(device) + 1), GFP_KERNEL); > - > - device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache * > - (end_port(device) - > - start_port(device) + 1), > - GFP_KERNEL); > - > - if (!device->cache.pkey_cache || !device->cache.gid_cache || > - !device->cache.lmc_cache) { > - printk(KERN_WARNING "Couldn't allocate cache " > - "for %s\n", device->name); > - goto err; > - } > - > - for (p = 0; p <= end_port(device) - start_port(device); ++p) { > - device->cache.pkey_cache[p] = NULL; > - device->cache.gid_cache [p] = NULL; > - ib_cache_update(device, p + start_port(device)); > - } > - > - INIT_IB_EVENT_HANDLER(&device->cache.event_handler, > - device, ib_cache_event); > - if (ib_register_event_handler(&device->cache.event_handler)) > - goto err_cache; > - > - return; > - > -err_cache: > - for (p = 0; p <= end_port(device) - start_port(device); ++p) { > - kfree(device->cache.pkey_cache[p]); > - kfree(device->cache.gid_cache[p]); > - } > - > -err: > - kfree(device->cache.pkey_cache); > - kfree(device->cache.gid_cache); > - kfree(device->cache.lmc_cache); > -} > - > -static void ib_cache_cleanup_one(struct ib_device *device) > -{ > - int p; > - > - ib_unregister_event_handler(&device->cache.event_handler); > - flush_scheduled_work(); > - > - for (p = 0; p <= end_port(device) - start_port(device); ++p) { > - kfree(device->cache.pkey_cache[p]); > - kfree(device->cache.gid_cache[p]); > - } > - > - kfree(device->cache.pkey_cache); > - kfree(device->cache.gid_cache); > - kfree(device->cache.lmc_cache); > -} > - > -static struct ib_client cache_client = { > - .name = "cache", > - .add = ib_cache_setup_one, > - .remove = ib_cache_cleanup_one > -}; > - > -int __init ib_cache_setup(void) > -{ > - return ib_register_client(&cache_client); > -} > - > -void __exit ib_cache_cleanup(void) > -{ > - ib_unregister_client(&cache_client); > -} > Index: b/include/rdma/ib_cache.h > =================================================================== > --- a/include/rdma/ib_cache.h 2007-05-02 17:47:13.398954200 +0300 > +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 > @@ -1,118 +0,0 @@ > -/* > - * Copyright (c) 2004 Topspin Communications. All rights reserved. > - * Copyright (c) 2005 Intel Corporation. All rights reserved. > - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. > - * > - * This software is available to you under a choice of one of two > - * licenses. You may choose to be licensed under the terms of the GNU > - * General Public License (GPL) Version 2, available from the file > - * COPYING in the main directory of this source tree, or the > - * OpenIB.org BSD license below: > - * > - * Redistribution and use in source and binary forms, with or > - * without modification, are permitted provided that the following > - * conditions are met: > - * > - * - Redistributions of source code must retain the above > - * copyright notice, this list of conditions and the following > - * disclaimer. > - * > - * - Redistributions in binary form must reproduce the above > - * copyright notice, this list of conditions and the following > - * disclaimer in the documentation and/or other materials > - * provided with the distribution. > - * > - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > - * SOFTWARE. > - * > - * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $ > - */ > - > -#ifndef _IB_CACHE_H > -#define _IB_CACHE_H > - > -#include > - > -/** > - * ib_get_cached_gid - Returns a cached GID table entry > - * @device: The device to query. > - * @port_num: The port number of the device to query. > - * @index: The index into the cached GID table to query. > - * @gid: The GID value found at the specified index. > - * > - * ib_get_cached_gid() fetches the specified GID table entry stored in > - * the local software cache. > - */ > -int ib_get_cached_gid(struct ib_device *device, > - u8 port_num, > - int index, > - union ib_gid *gid); > - > -/** > - * ib_find_cached_gid - Returns the port number and GID table index where > - * a specified GID value occurs. > - * @device: The device to query. > - * @gid: The GID value to search for. > - * @port_num: The port number of the device where the GID value was found. > - * @index: The index into the cached GID table where the GID was found. This > - * parameter may be NULL. > - * > - * ib_find_cached_gid() searches for the specified GID value in > - * the local software cache. > - */ > -int ib_find_cached_gid(struct ib_device *device, > - union ib_gid *gid, > - u8 *port_num, > - u16 *index); > - > -/** > - * ib_get_cached_pkey - Returns a cached PKey table entry > - * @device: The device to query. > - * @port_num: The port number of the device to query. > - * @index: The index into the cached PKey table to query. > - * @pkey: The PKey value found at the specified index. > - * > - * ib_get_cached_pkey() fetches the specified PKey table entry stored in > - * the local software cache. > - */ > -int ib_get_cached_pkey(struct ib_device *device_handle, > - u8 port_num, > - int index, > - u16 *pkey); > - > -/** > - * ib_find_cached_pkey - Returns the PKey table index where a specified > - * PKey value occurs. > - * @device: The device to query. > - * @port_num: The port number of the device to search for the PKey. > - * @pkey: The PKey value to search for. > - * @index: The index into the cached PKey table where the PKey was found. > - * > - * ib_find_cached_pkey() searches the specified PKey table in > - * the local software cache. > - */ > -int ib_find_cached_pkey(struct ib_device *device, > - u8 port_num, > - u16 pkey, > - u16 *index); > - > -/** > - * ib_get_cached_lmc - Returns a cached lmc table entry > - * @device: The device to query. > - * @port_num: The port number of the device to query. > - * @lmc: The lmc value for the specified port for that device. > - * > - * ib_get_cached_lmc() fetches the specified lmc table entry stored in > - * the local software cache. > - */ > -int ib_get_cached_lmc(struct ib_device *device, > - u8 port_num, > - u8 *lmc); > - > -#endif /* _IB_CACHE_H */ > Index: b/include/rdma/ib_verbs.h > =================================================================== > --- a/include/rdma/ib_verbs.h 2007-05-02 17:47:13.398954200 +0300 > +++ b/include/rdma/ib_verbs.h 2007-05-02 17:48:30.741177998 +0300 > @@ -1134,6 +1134,43 @@ int ib_modify_port(struct ib_device *dev > struct ib_port_modify *port_modify); > > /** > + * ib_find_gid - Returns the port number and index of a GID > + * @device: Device to query. > + * @gid: GID to look for > + * @port_num: Returned port number > + * @index: Returned index > + * > + * ib_find_gid() returns the index of @pkey in the pkey table > + * on port @port_num > + */ > + int ib_find_gid(struct ib_device *device, union ib_gid *gid, > + u8 *port_num, u16 *index); > + > +/** > + * ib_find_pkey - Returns the index of a PKey on a port > + * @device: Device to query. > + * @port_num: Port to query on > + * @pkey: PKey to look for > + * @index: Returned index > + * > + * ib_find_pkey() returns the index of @pkey in the pkey table > + * on port @port_num > + */ > +int ib_find_pkey(struct ib_device *device, u8 port_num, > + u16 pkey, u16 *index); > + > +/** > + * ib_query_lmc - Returns the LMC of a port > + * @device: Device to query. > + * @port_num: Port to query on > + * @lmc: Returned LMC > + * > + * ib_query_lmc() returns the LID mask control associated > + * with port @port_num > + */ > +int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc); > + I don't think we need this one in ib_verbs.h - it just does query_port once. Let's keep the API simple. The only user is in mad.c - move it there and make it static. > +/** > * ib_alloc_pd - Allocates an unused protection domain. > * @device: The device on which to allocate the protection domain. > * > Index: b/drivers/infiniband/core/Makefile > =================================================================== > --- a/drivers/infiniband/core/Makefile 2007-05-02 17:47:49.333553540 +0300 > +++ b/drivers/infiniband/core/Makefile 2007-05-02 17:48:30.741177998 +0300 > @@ -8,7 +8,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += > $(user_access-y) > > ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ > - device.o fmr_pool.o cache.o > + device.o fmr_pool.o > > ib_mad-y := mad.o smi.o agent.o mad_rmpp.o > > Index: b/drivers/infiniband/core/cm.c > =================================================================== > --- a/drivers/infiniband/core/cm.c 2007-05-02 17:47:49.762477140 +0300 > +++ b/drivers/infiniband/core/cm.c 2007-05-02 17:48:30.744177464 +0300 > @@ -46,8 +46,8 @@ > #include > #include > > -#include > #include > +#include > #include "cm_msgs.h" > > MODULE_AUTHOR("Sean Hefty"); > @@ -275,7 +275,7 @@ static int cm_init_av_by_path(struct ib_ > > read_lock_irqsave(&cm.device_lock, flags); > list_for_each_entry(cm_dev, &cm.device_list, list) { > - if (!ib_find_cached_gid(cm_dev->device, &path->sgid, > + if (!ib_find_gid(cm_dev->device, &path->sgid, > &p, NULL)) { > port = &cm_dev->port[p-1]; > break; > @@ -286,7 +286,7 @@ static int cm_init_av_by_path(struct ib_ > if (!port) > return -EINVAL; > > - ret = ib_find_cached_pkey(cm_dev->device, port->port_num, > + ret = ib_find_pkey(cm_dev->device, port->port_num, > be16_to_cpu(path->pkey), &av->pkey_index); > if (ret) > return ret; > @@ -1382,7 +1382,7 @@ static int cm_req_handler(struct cm_work > cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); > ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av); > if (ret) { > - ib_get_cached_gid(work->port->cm_dev->device, > + ib_query_gid(work->port->cm_dev->device, > work->port->port_num, 0, &work->path[0].sgid); > ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID, > &work->path[0].sgid, sizeof work->path[0].sgid, > Index: b/drivers/infiniband/core/cma.c > =================================================================== > --- a/drivers/infiniband/core/cma.c 2007-05-02 17:47:50.749301367 +0300 > +++ b/drivers/infiniband/core/cma.c 2007-05-02 17:48:30.746177108 +0300 > @@ -41,7 +41,6 @@ > > #include > #include > -#include > #include > #include > #include > @@ -325,7 +324,7 @@ static int cma_acquire_dev(struct rdma_i > } > > list_for_each_entry(cma_dev, &dev_list, list) { > - ret = ib_find_cached_gid(cma_dev->device, &gid, > + ret = ib_find_gid(cma_dev->device, &gid, > &id_priv->id.port_num, NULL); > if (!ret) { > ret = cma_set_qkey(cma_dev->device, > @@ -514,7 +513,7 @@ static int cma_ib_init_qp_attr(struct rd > struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; > int ret; > > - ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, > + ret = ib_find_pkey(id_priv->id.device, id_priv->id.port_num, > ib_addr_get_pkey(dev_addr), > &qp_attr->pkey_index); > if (ret) > @@ -1658,11 +1657,11 @@ static int cma_bind_loopback(struct rdma > cma_dev = list_entry(dev_list.next, struct cma_device, list); > > port_found: > - ret = ib_get_cached_gid(cma_dev->device, p, 0, &gid); > + ret = ib_query_gid(cma_dev->device, p, 0, &gid); > if (ret) > goto out; > > - ret = ib_get_cached_pkey(cma_dev->device, p, 0, &pkey); > + ret = ib_query_pkey(cma_dev->device, p, 0, &pkey); > if (ret) > goto out; > > Index: b/drivers/infiniband/core/mad.c > =================================================================== > --- a/drivers/infiniband/core/mad.c 2007-05-02 17:47:50.423359423 +0300 > +++ b/drivers/infiniband/core/mad.c 2007-05-02 17:48:30.748176751 +0300 > @@ -34,7 +34,6 @@ > * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ > */ > #include > -#include > > #include "mad_priv.h" > #include "mad_rmpp.h" > @@ -1707,13 +1706,13 @@ static inline int rcv_has_same_gid(struc > if (!send_resp && rcv_resp) { > /* is request/response. */ > if (!(attr.ah_flags & IB_AH_GRH)) { > - if (ib_get_cached_lmc(device, port_num, &lmc)) > + if (ib_query_lmc(device, port_num, &lmc)) Just do query_port here. > return 0; > return (!lmc || !((attr.src_path_bits ^ > rwc->wc->dlid_path_bits) & > ((1 << lmc) - 1))); > } else { > - if (ib_get_cached_gid(device, port_num, > + if (ib_query_gid(device, port_num, > attr.grh.sgid_index, &sgid)) > return 0; > return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw, > Index: b/drivers/infiniband/core/multicast.c > =================================================================== > --- a/drivers/infiniband/core/multicast.c 2007-05-02 17:47:51.014254173 +0300 > +++ b/drivers/infiniband/core/multicast.c 2007-05-02 17:48:30.749176573 +0300 > @@ -38,7 +38,6 @@ > #include > #include > > -#include > #include "sa.h" > > static void mcast_add_one(struct ib_device *device); > @@ -686,7 +685,7 @@ int ib_init_ah_from_mcmember(struct ib_d > u16 gid_index; > u8 p; > > - ret = ib_find_cached_gid(device, &rec->port_gid, &p, &gid_index); > + ret = ib_find_gid(device, &rec->port_gid, &p, &gid_index); > if (ret) > return ret; > > Index: b/drivers/infiniband/core/sa_query.c > =================================================================== > --- a/drivers/infiniband/core/sa_query.c 2007-05-02 17:47:49.689490140 +0300 > +++ b/drivers/infiniband/core/sa_query.c 2007-05-02 17:48:30.749176573 +0300 > @@ -47,7 +47,6 @@ > #include > > #include > -#include > #include "sa.h" > > MODULE_AUTHOR("Roland Dreier"); > @@ -477,7 +476,7 @@ int ib_init_ah_from_path(struct ib_devic > ah_attr->ah_flags = IB_AH_GRH; > ah_attr->grh.dgid = rec->dgid; > > - ret = ib_find_cached_gid(device, &rec->sgid, &port_num, > + ret = ib_find_gid(device, &rec->sgid, &port_num, > &gid_index); > if (ret) > return ret; > Index: b/drivers/infiniband/core/verbs.c > =================================================================== > --- a/drivers/infiniband/core/verbs.c 2007-05-02 17:47:49.091596637 +0300 > +++ b/drivers/infiniband/core/verbs.c 2007-05-02 17:48:30.750176395 +0300 > @@ -43,7 +43,6 @@ > #include > > #include > -#include > > int ib_rate_to_mult(enum ib_rate rate) > { > @@ -159,7 +158,7 @@ int ib_init_ah_from_wc(struct ib_device > ah_attr->ah_flags = IB_AH_GRH; > ah_attr->grh.dgid = grh->sgid; > > - ret = ib_find_cached_gid(device, &grh->dgid, &port_num, > + ret = ib_find_gid(device, &grh->dgid, &port_num, > &gid_index); > if (ret) > return ret; > Index: b/drivers/infiniband/hw/mthca/mthca_av.c > =================================================================== > --- a/drivers/infiniband/hw/mthca/mthca_av.c 2007-05-02 17:47:53.157872352 +0300 > +++ b/drivers/infiniband/hw/mthca/mthca_av.c 2007-05-02 17:48:30.751176217 +0300 > @@ -37,7 +37,6 @@ > #include > > #include > -#include > > #include "mthca_dev.h" > > @@ -279,7 +278,7 @@ int mthca_read_ah(struct mthca_dev *dev, > (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; > header->grh.flow_label = > ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); > - ib_get_cached_gid(&dev->ib_dev, > + ib_query_gid(&dev->ib_dev, > be32_to_cpu(ah->av->port_pd) >> 24, > ah->av->gid_index % dev->limits.gid_table_len, > &header->grh.source_gid); > Index: b/drivers/infiniband/hw/mthca/mthca_qp.c > =================================================================== > --- a/drivers/infiniband/hw/mthca/mthca_qp.c 2007-05-02 17:47:53.153873064 +0300 > +++ b/drivers/infiniband/hw/mthca/mthca_qp.c 2007-05-02 18:04:14.123981858 +0300 > @@ -40,9 +40,8 @@ > > #include > > -#include > -#include > #include > +#include > > #include "mthca_dev.h" > #include "mthca_cmd.h" > @@ -1485,11 +1484,10 @@ static int build_mlx_header(struct mthca > sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; > sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); > if (!sqp->qp.ibqp.qp_num) > - ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, > - sqp->pkey_index, &pkey); > + ib_query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); > else > - ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, > - wr->wr.ud.pkey_index, &pkey); > + ib_query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); > + > sqp->ud_header.bth.pkey = cpu_to_be16(pkey); > sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); > sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); > Index: b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > =================================================================== > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-02 17:47:52.042071098 +0300 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-02 17:48:30.753175861 +0300 > @@ -33,7 +33,6 @@ > */ > > #include > -#include > #include > #include > #include > @@ -759,7 +758,7 @@ static int ipoib_cm_modify_tx_init(struc > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ib_qp_attr qp_attr; > int qp_attr_mask, ret; > - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index); > + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index); > if (ret) { > ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret); > return ret; > Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > =================================================================== > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-02 17:48:30.150283249 +0300 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-02 17:48:30.754175683 +0300 > @@ -38,7 +38,7 @@ > #include > #include > > -#include > +#include > > #include "ipoib.h" > > Index: b/drivers/infiniband/ulp/srp/ib_srp.c > =================================================================== > --- a/drivers/infiniband/ulp/srp/ib_srp.c 2007-05-02 17:47:52.336018740 +0300 > +++ b/drivers/infiniband/ulp/srp/ib_srp.c 2007-05-02 17:48:30.755175505 +0300 > @@ -48,8 +48,6 @@ > #include > #include > > -#include > - > #include "ib_srp.h" > > #define DRV_NAME "ib_srp" > @@ -164,7 +162,7 @@ static int srp_init_qp(struct srp_target > if (!attr) > return -ENOMEM; > > - ret = ib_find_cached_pkey(target->srp_host->dev->dev, > + ret = ib_find_pkey(target->srp_host->dev->dev, > target->srp_host->port, > be16_to_cpu(target->path.pkey), > &attr->pkey_index); > @@ -1780,7 +1778,7 @@ static ssize_t srp_create_target(struct > if (ret) > goto err; > > - ib_get_cached_gid(host->dev->dev, host->port, 0, &target->path.sgid); > + ib_query_gid(host->dev->dev, host->port, 0, &target->path.sgid); > > printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x " > "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", > Index: b/drivers/infiniband/core/core_priv.h > =================================================================== > --- a/drivers/infiniband/core/core_priv.h 2007-05-02 17:47:50.519342327 +0300 > +++ b/drivers/infiniband/core/core_priv.h 2007-05-02 17:48:30.755175505 +0300 > @@ -46,7 +46,4 @@ void ib_device_unregister_sysfs(struct i > int ib_sysfs_setup(void); > void ib_sysfs_cleanup(void); > > -int ib_cache_setup(void); > -void ib_cache_cleanup(void); > - > #endif /* _CORE_PRIV_H */ > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From swise at opengridcomputing.com Wed May 2 10:30:46 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 02 May 2007 12:30:46 -0500 Subject: [ofa-general] man pages for the rdma-cm Message-ID: <1178127046.18609.107.camel@stevo-desktop> Sean, Are there man pages for the rdma-cm in the pipeline? I think it would be great (requirement?) to have these for ofed-1.2 since we do have the other verbs man pages. I didn't know if this was in-progress or are we looking for volunteers... ;-) Steve. From swise at opengridcomputing.com Wed May 2 10:37:40 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 02 May 2007 12:37:40 -0500 Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> <20070502070849.GO8447@mellanox.co.il> Message-ID: <1178127460.18609.111.camel@stevo-desktop> On Wed, 2007-05-02 at 10:16 -0700, Roland Dreier wrote: > > > + if (ib_wr->send_flags & IB_SEND_SOLICITED > > > + && ib_wr->send_flags & IB_SEND_INVALIDATE) { > > > > How about > > if (ib_wr->send_flags & (IB_SEND_SOLICITED | IB_SEND_INVALIDATE)) > > These two aren't equivalent -- the first has an &&, yours works like ||. > Which is correct? The test needs to be: 'if both are set' Michael's sez: 'if either are set' > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mhagen at iol.unh.edu Wed May 2 10:39:35 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Wed, 2 May 2007 13:39:35 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> <20070502070849.GO8447@mellanox.co.il> Message-ID: <48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu> I believe that they are the same, thanks for the nice addition Michael. New patch follows. --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 13:12:54.000000000 -0400 +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-05-02 13:18:17.000000000 -0400 @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str switch (ib_wr->opcode) { case IB_WR_SEND: - if (ib_wr->send_flags & IB_SEND_SOLICITED) { + if (ib_wr->send_flags & + (IB_SEND_SOLICITED | IB_SEND_INVALIDATE)) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV); + wr.sqwr.send.remote_stag = + cpu_to_be32(ib_wr->wr.invalidate.rkey); + } else if (ib_wr->send_flags & IB_SEND_SOLICITED) { c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); - msg_size = sizeof(struct c2wr_send_req); + wr.sqwr.send.remote_stag = 0; + } else if (ib_wr->send_flags & IB_SEND_INVALIDATE) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV); + wr.sqwr.send.remote_stag = + cpu_to_be32(ib_wr->wr.invalidate.rkey); } else { c2_wr_set_id(&wr, C2_WR_TYPE_SEND); - msg_size = sizeof(struct c2wr_send_req); + wr.sqwr.send.remote_stag = 0; } - wr.sqwr.send.remote_stag = 0; - msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; + msg_size = sizeof(struct c2wr_send_req) + + sizeof(struct c2_data_addr) * ib_wr->num_sge; if (ib_wr->num_sge > qp->send_sgl_depth) { err = -EINVAL; break; -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From swise at opengridcomputing.com Wed May 2 10:46:50 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 02 May 2007 12:46:50 -0500 Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: <48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu> References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> <20070502070849.GO8447@mellanox.co.il> <48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu> Message-ID: <1178128010.18609.116.camel@stevo-desktop> On Wed, 2007-05-02 at 13:39 -0400, mhagen at iol.unh.edu wrote: > I believe that they are the same, thanks for the nice addition Michael. > New patch follows. > > --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 > 13:12:54.000000000 -0400 > +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-05-02 > 13:18:17.000000000 -0400 > @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str > > switch (ib_wr->opcode) { > case IB_WR_SEND: > - if (ib_wr->send_flags & IB_SEND_SOLICITED) { > + if (ib_wr->send_flags & > + (IB_SEND_SOLICITED | IB_SEND_INVALIDATE)) { this will set the opcde to SEND_SE_INV if either SEND_SOLICITED is set -or- SEND_INV is set. That's incorrect. you want your older code... > + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV); > + wr.sqwr.send.remote_stag = > + cpu_to_be32(ib_wr->wr.invalidate.rkey); > + } else if (ib_wr->send_flags & IB_SEND_SOLICITED) { > c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); > - msg_size = sizeof(struct c2wr_send_req); > + wr.sqwr.send.remote_stag = 0; > + } else if (ib_wr->send_flags & IB_SEND_INVALIDATE) { > + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV); > + wr.sqwr.send.remote_stag = > + cpu_to_be32(ib_wr->wr.invalidate.rkey); > } else { > c2_wr_set_id(&wr, C2_WR_TYPE_SEND); > - msg_size = sizeof(struct c2wr_send_req); > + wr.sqwr.send.remote_stag = 0; > } > > - wr.sqwr.send.remote_stag = 0; > - msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; > + msg_size = sizeof(struct c2wr_send_req) + > + sizeof(struct c2_data_addr) * ib_wr->num_sge; > if (ib_wr->num_sge > qp->send_sgl_depth) { > err = -EINVAL; > break; > > From rdreier at cisco.com Wed May 2 10:49:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 May 2007 10:49:47 -0700 Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache In-Reply-To: <4638B4D5.7050709@voltaire.com> (Yosef Etigin's message of "Wed, 02 May 2007 18:57:09 +0300") References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com> Message-ID: > @@ -279,7 +278,7 @@ int mthca_read_ah(struct mthca_dev *dev, > (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; > header->grh.flow_label = > ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); > - ib_get_cached_gid(&dev->ib_dev, > + ib_query_gid(&dev->ib_dev, > be32_to_cpu(ah->av->port_pd) >> 24, > ah->av->gid_index % dev->limits.gid_table_len, > &header->grh.source_gid); > Index: b/drivers/infiniband/hw/mthca/mthca_qp.c > =================================================================== > --- a/drivers/infiniband/hw/mthca/mthca_qp.c 2007-05-02 17:47:53.153873064 +0300 > +++ b/drivers/infiniband/hw/mthca/mthca_qp.c 2007-05-02 18:04:14.123981858 +0300 > @@ -40,9 +40,8 @@ > > #include > > -#include > -#include > #include > +#include > > #include "mthca_dev.h" > #include "mthca_cmd.h" > @@ -1485,11 +1484,10 @@ static int build_mlx_header(struct mthca > sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; > sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); > if (!sqp->qp.ibqp.qp_num) > - ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, > - sqp->pkey_index, &pkey); > + ib_query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); > else > - ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, > - wr->wr.ud.pkey_index, &pkey); > + ib_query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); > + > sqp->ud_header.bth.pkey = cpu_to_be16(pkey); > sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); > sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); These mthca changes can't be correct -- you are adding calls to sleeping functions from build_mlx_header(), which is always called with a spinlock held. You'll have to update mthca to keep track of the GID and P_Key tables internally to fix this. Please test your code with CONFIG_DEBUG_SPINLOCK_SLEEP=y at least to see if there are any other places that are using the cache because they're not allowed to sleep. - R. From mhagen at iol.unh.edu Wed May 2 10:52:15 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Wed, 2 May 2007 13:52:15 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: <1178128010.18609.116.camel@stevo-desktop> References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> <20070502070849.GO8447@mellanox.co.il> <48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu> <1178128010.18609.116.camel@stevo-desktop> Message-ID: <41704.132.177.125.178.1178128335.squirrel@postal.iol.unh.edu> Ok, here is the patch reverted again. --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 13:12:54.000000000 -0400 +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-05-02 13:50:25.000000000 -0400 @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str switch (ib_wr->opcode) { case IB_WR_SEND: - if (ib_wr->send_flags & IB_SEND_SOLICITED) { + if (ib_wr->send_flags & IB_SEND_SOLICITED + && ib_wr->send_flags & IB_SEND_INVALIDATE) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV); + wr.sqwr.send.remote_stag = + cpu_to_be32(ib_wr->wr.invalidate.rkey); + } else if (ib_wr->send_flags & IB_SEND_SOLICITED) { c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); - msg_size = sizeof(struct c2wr_send_req); + wr.sqwr.send.remote_stag = 0; + } else if (ib_wr->send_flags & IB_SEND_INVALIDATE) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV); + wr.sqwr.send.remote_stag = + cpu_to_be32(ib_wr->wr.invalidate.rkey); } else { c2_wr_set_id(&wr, C2_WR_TYPE_SEND); - msg_size = sizeof(struct c2wr_send_req); + wr.sqwr.send.remote_stag = 0; } - wr.sqwr.send.remote_stag = 0; - msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; + msg_size = sizeof(struct c2wr_send_req) + + sizeof(struct c2_data_addr) * ib_wr->num_sge; if (ib_wr->num_sge > qp->send_sgl_depth) { err = -EINVAL; break; -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From rdreier at cisco.com Wed May 2 10:55:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 May 2007 10:55:09 -0700 Subject: [ofa-general] Re: [PATCH 3/3] mthca: provider-level caching of pkeys In-Reply-To: <4638B4FE.8010605@voltaire.com> (Yosef Etigin's message of "Wed, 02 May 2007 18:57:50 +0300") References: <4638B432.3060801@voltaire.com> <4638B4FE.8010605@voltaire.com> Message-ID: Oh, I didn't see this patch before... anyway along with all the minor whitespace,etc problems, there are two big issues: 1) this patch needs to be _before_ the previous 2/3 patch (or else the intermediate state is buggy) and 2) you need a GID cache too. From rdreier at cisco.com Wed May 2 10:56:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 May 2007 10:56:49 -0700 Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: <41704.132.177.125.178.1178128335.squirrel@postal.iol.unh.edu> (mhagen@iol.unh.edu's message of "Wed, 2 May 2007 13:52:15 -0400 (EDT)") References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> <20070502070849.GO8447@mellanox.co.il> <48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu> <1178128010.18609.116.camel@stevo-desktop> <41704.132.177.125.178.1178128335.squirrel@postal.iol.unh.edu> Message-ID: OK, can you please send both patches one more time with a proper changelog and Signed-off-by line? Thanks... (BTW, please don't include subscriber-only lists like ofalab at iol.unh.edu in the cc -- it's annoying to have all my replies generate a bounce) From mst at dev.mellanox.co.il Wed May 2 11:23:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 21:23:15 +0300 Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache In-Reply-To: <4638B4D5.7050709@voltaire.com> References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com> Message-ID: <20070502182315.GQ22292@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCH 2/3] remove ib pkey gid and lmc cache > > Remove IB cache from core > > * Remove pkey, gid, and lmc caches > * Rewrite ib_find_gid and ib_find_pkey over blocking device queries > * Modify users of the cache to use these methods > > > Signed-off-by: Yosef Etigin > --- > drivers/infiniband/core/cache.c | 398 -------------------------------- > include/rdma/ib_cache.h | 118 --------- > drivers/infiniband/core/Makefile | 2 > drivers/infiniband/core/cm.c | 8 > drivers/infiniband/core/cma.c | 9 > drivers/infiniband/core/core_priv.h | 3 > drivers/infiniband/core/device.c | 143 ++++++++++- > drivers/infiniband/core/mad.c | 5 > drivers/infiniband/core/multicast.c | 3 > drivers/infiniband/core/sa_query.c | 3 > drivers/infiniband/core/verbs.c | 3 > drivers/infiniband/hw/mthca/mthca_av.c | 3 > drivers/infiniband/hw/mthca/mthca_qp.c | 10 > drivers/infiniband/ulp/ipoib/ipoib_cm.c | 3 > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 2 > drivers/infiniband/ulp/srp/ib_srp.c | 6 > include/rdma/ib_verbs.h | 37 ++ > 17 files changed, 196 insertions(+), 560 deletions(-) I think this should be split in 2 as follow: 1. Implement ib_find_gid and ib_find_pkey over blocking device queries + Modify core and ULPs to use these methods This will already fix ipoib pkey bug you opened in bugzilla. 2. modify mthca to keep cache updated by snooping MAD, and remove the cache Not really high priority. -- MST From mst at dev.mellanox.co.il Wed May 2 11:42:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 21:42:43 +0300 Subject: [ofa-general] Re: minutes from socket over RDMA discussion at workshop In-Reply-To: <4638BB97.8050403@hp.com> References: <4638BB97.8050403@hp.com> Message-ID: <20070502184243.GR22292@mellanox.co.il> > > Bulk Transfer "Latency" > Unidir Bidir > Card Mbit/s SDx SDr Mbit/s SDx SDr Tran/s SDx SDr > --------------------------------------------------------------------------- > nnnn nnnnn nnnnn nnnn nnnnn nnnnn nnnnn nnnnn nnnnn > AD313A IPoIB 2970 4.418 4.544 3530 3.59 3.95 19290 n/a n/a > AD313A SDP 7810 0.453 1.048 12820 0.69 0.68 38030 26.29 26.29 > AD313A SDP p=0 7810 0.346 0.527 12670 0.42 0.043 19380 n/a n/a What's AD313A? What's the MTU for IPoIB (in OFED 1.2 it defaults to 64K)? -- MST From mst at dev.mellanox.co.il Wed May 2 11:53:51 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 21:53:51 +0300 Subject: [ofa-general] [PATCH RFC] comp_vector support Message-ID: <20070502185350.GS22292@mellanox.co.il> The following untested patch does the following: 1. extends ib_create_cq to pass in comp_vector parameter, and updates all ULP/providers. 2. mthca is enhanced to support multiple vectors if MSI-X is enabled. 3. uverbs and IPoIB CM are enhanced to use multiple vectors if available Signed-off-by: Michael S. Tsirkin --- I plan to test and repost this in earnest soon, but wanted to first hear what do people think about the API. Note this closely parallels what we already have for userspace. diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 7fabb42..45d269b 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -161,9 +161,14 @@ static int alloc_name(char *name) */ struct ib_device *ib_alloc_device(size_t size) { + struct ib_device *dev; BUG_ON(size < sizeof (struct ib_device)); - return kzalloc(size, GFP_KERNEL); + dev = kzalloc(size, GFP_KERNEL); + if (dev) + dev->num_comp_vectors = 1; + + return dev; } EXPORT_SYMBOL(ib_alloc_device); diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6edfecf..85ccf13 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2771,7 +2771,7 @@ static int ib_mad_port_open(struct ib_device *device, cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; port_priv->cq = ib_create_cq(port_priv->device, ib_mad_thread_completion_handler, - NULL, port_priv, cq_size); + NULL, port_priv, cq_size, 0); if (IS_ERR(port_priv->cq)) { printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n"); ret = PTR_ERR(port_priv->cq); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 4fd75af..6b9390f 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -802,7 +802,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file, INIT_LIST_HEAD(&obj->async_list); cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe, - file->ucontext, &udata); + cmd.comp_vector, file->ucontext, &udata); if (IS_ERR(cq)) { ret = PTR_ERR(cq); goto err_file; diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index f8bc822..d44e547 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -752,7 +752,7 @@ static void ib_uverbs_add_one(struct ib_device *device) spin_unlock(&map_lock); uverbs_dev->ib_dev = device; - uverbs_dev->num_comp_vectors = 1; + uverbs_dev->num_comp_vectors = device->num_comp_vectors; uverbs_dev->dev = cdev_alloc(); if (!uverbs_dev->dev) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index ccdf93d..86ed8af 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -609,11 +609,11 @@ EXPORT_SYMBOL(ib_destroy_qp); struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, void (*event_handler)(struct ib_event *, void *), - void *cq_context, int cqe) + void *cq_context, int cqe, int comp_vector) { struct ib_cq *cq; - cq = device->create_cq(device, cqe, NULL, NULL); + cq = device->create_cq(device, cqe, comp_vector, NULL, NULL); if (!IS_ERR(cq)) { cq->device = device; diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index 607c09b..46ea16b 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -290,7 +290,7 @@ static int c2_destroy_qp(struct ib_qp *ib_qp) return 0; } -static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index af28a31..4f76b2e 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -139,7 +139,7 @@ static int iwch_destroy_cq(struct ib_cq *ib_cq) return 0; } -static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, +static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, int comp_vector, struct ib_ucontext *ib_context, struct ib_udata *udata) { diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index e2cdc1a..67f0670 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -113,7 +113,7 @@ struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num) return ret; } -struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 95fd59f..aff96ac 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -123,7 +123,7 @@ int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq); void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq); -struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 3b23d67..dbe2723 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -375,7 +375,7 @@ static int ehca_create_aqp1(struct ehca_shca *shca, u32 port) return -EPERM; } - ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10); + ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10, 0); if (IS_ERR(ibcq)) { ehca_err(&shca->ib_device, "Cannot create AQP1 CQ."); return PTR_ERR(ibcq); diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index ea78e6d..cfca5d1 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -204,7 +204,7 @@ static void send_complete(unsigned long data) * * Called by ib_create_cq() in the generic verbs code. */ -struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, +struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 7c4929f..865966d 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -734,7 +734,7 @@ int ipath_destroy_srq(struct ib_srq *ibsrq); int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, +struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index efd79ef..61ff333 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -779,7 +779,7 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) return 0; } -int mthca_init_cq(struct mthca_dev *dev, int nent, +int mthca_init_cq(struct mthca_dev *dev, int nent, int comp_vector, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq) { @@ -790,6 +790,7 @@ int mthca_init_cq(struct mthca_dev *dev, int nent, cq->ibcq.cqe = nent - 1; cq->is_kernel = !ctx; + cq->eq = MTHCA_EQ_COMP + comp_vector; cq->cqn = mthca_alloc(&dev->cq_table.alloc); if (cq->cqn == -1) @@ -844,7 +845,7 @@ int mthca_init_cq(struct mthca_dev *dev, int nent, else cq_context->logsize_usrpage |= cpu_to_be32(dev->driver_uar.index); cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); - cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); + cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[cq->eq].eqn); cq_context->pd = cpu_to_be32(pdn); cq_context->lkey = cpu_to_be32(cq->buf.mr.ibmr.lkey); cq_context->cqn = cpu_to_be32(cq->cqn); @@ -954,7 +955,7 @@ void mthca_free_cq(struct mthca_dev *dev, spin_unlock_irq(&dev->cq_table.lock); if (dev->mthca_flags & MTHCA_FLAG_MSI_X) - synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); + synchronize_irq(dev->eq_table.eq[cq->eq].msi_x_vector); else synchronize_irq(dev->pdev->irq); diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index b7e42ef..fcfb0e2 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -96,7 +96,8 @@ enum { MTHCA_EQ_CMD, MTHCA_EQ_ASYNC, MTHCA_EQ_COMP, - MTHCA_NUM_EQ + MTHCA_NUM_EQ, + MTHCA_NUM_EQS = 32 }; enum { @@ -497,7 +498,7 @@ int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); -int mthca_init_cq(struct mthca_dev *dev, int nent, +int mthca_init_cq(struct mthca_dev *dev, int nent, int comp_vector, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq); void mthca_free_cq(struct mthca_dev *dev, diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c index 8ec9fa1..6e38f2a 100644 --- a/drivers/infiniband/hw/mthca/mthca_eq.c +++ b/drivers/infiniband/hw/mthca/mthca_eq.c @@ -161,6 +161,11 @@ struct mthca_eqe { u8 owner; } __attribute__((packed)); +static inline int mthca_num_eq(struct mthca_dev *dev) +{ + return dev->ib_dev.num_comp_vectors + MTHCA_NUM_EQ - 1; +} + #define MTHCA_EQ_ENTRY_OWNER_SW (0 << 7) #define MTHCA_EQ_ENTRY_OWNER_HW (1 << 7) @@ -657,7 +662,7 @@ static void mthca_free_irqs(struct mthca_dev *dev) if (dev->eq_table.have_irq) free_irq(dev->pdev->irq, dev); - for (i = 0; i < MTHCA_NUM_EQ; ++i) + for (i = 0; i < mthca_num_eq(dev); ++i) if (dev->eq_table.eq[i].have_irq) free_irq(dev->eq_table.eq[i].msi_x_vector, dev->eq_table.eq + i); @@ -824,12 +829,37 @@ void mthca_unmap_eq_icm(struct mthca_dev *dev) __free_page(dev->eq_table.icm_page); } +static inline const char *eq_name(int i) +{ + switch (i) { + case MTHCA_EQ_ASYNC: + return DRV_NAME " (async)"; + case MTHCA_EQ_CMD: + return DRV_NAME " (cmd)"; + default: + return DRV_NAME " (comp)"; + } +} + +static inline int eq_size(struct mthca_dev *dev, int i) +{ + switch (i) { + case MTHCA_EQ_ASYNC: + return MTHCA_NUM_ASYNC_EQE; + case MTHCA_EQ_CMD: + return MTHCA_NUM_CMD_EQE; + default: + return dev->limits.num_cqs; + } +} + + int mthca_init_eq_table(struct mthca_dev *dev) { int err; u8 status; u8 intr; - int i; + int i, eqn; err = mthca_alloc_init(&dev->eq_table.alloc, dev->limits.num_eqs, @@ -857,39 +887,23 @@ int mthca_init_eq_table(struct mthca_dev *dev) intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? 128 : dev->eq_table.inta_pin; - err = mthca_create_eq(dev, dev->limits.num_cqs + MTHCA_NUM_SPARE_EQE, - (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, - &dev->eq_table.eq[MTHCA_EQ_COMP]); - if (err) - goto err_out_unmap; - - err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE + MTHCA_NUM_SPARE_EQE, - (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, - &dev->eq_table.eq[MTHCA_EQ_ASYNC]); - if (err) - goto err_out_comp; - - err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE + MTHCA_NUM_SPARE_EQE, - (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, - &dev->eq_table.eq[MTHCA_EQ_CMD]); - if (err) - goto err_out_async; + for (eqn = 0; eqn < mthca_num_eq(dev); ++eqn) { + err = mthca_create_eq(dev, eq_size(dev, eqn) + MTHCA_NUM_SPARE_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, + &dev->eq_table.eq[eqn]); + if (err) + goto err_out_eq; + } if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { - static const char *eq_name[] = { - [MTHCA_EQ_COMP] = DRV_NAME " (comp)", - [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", - [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" - }; - - for (i = 0; i < MTHCA_NUM_EQ; ++i) { + for (i = 0; i < mthca_num_eq(dev); ++i) { err = request_irq(dev->eq_table.eq[i].msi_x_vector, mthca_is_memfree(dev) ? mthca_arbel_msi_x_interrupt : mthca_tavor_msi_x_interrupt, - 0, eq_name[i], dev->eq_table.eq + i); + 0, eq_name(i), dev->eq_table.eq + i); if (err) - goto err_out_cmd; + goto err_out_irq; dev->eq_table.eq[i].have_irq = 1; } } else { @@ -899,7 +913,7 @@ int mthca_init_eq_table(struct mthca_dev *dev) mthca_tavor_interrupt, IRQF_SHARED, DRV_NAME, dev); if (err) - goto err_out_cmd; + goto err_out_eq; dev->eq_table.have_irq = 1; } @@ -929,17 +943,13 @@ int mthca_init_eq_table(struct mthca_dev *dev) return 0; -err_out_cmd: +err_out_irq: mthca_free_irqs(dev); - mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]); -err_out_async: - mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); - -err_out_comp: - mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); +err_out_eq: + for (i = 0; i < eqn; ++i) + mthca_free_eq(dev, &dev->eq_table.eq[i]); -err_out_unmap: mthca_unmap_eq_regs(dev); err_out_free: @@ -959,7 +969,7 @@ void mthca_cleanup_eq_table(struct mthca_dev *dev) mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, 1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); - for (i = 0; i < MTHCA_NUM_EQ; ++i) + for (i = 0; i < mthca_num_eq(dev); ++i) mthca_free_eq(dev, &dev->eq_table.eq[i]); mthca_unmap_eq_regs(dev); diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 773145e..83a54df 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -976,24 +976,27 @@ static void mthca_release_regions(struct pci_dev *pdev, static int mthca_enable_msi_x(struct mthca_dev *mdev) { - struct msix_entry entries[3]; - int err; + struct msix_entry entries[MTHCA_NUM_EQS]; + int i, err; - entries[0].entry = 0; - entries[1].entry = 1; - entries[2].entry = 2; + for (i = 0; i < MTHCA_NUM_EQS; ++i) + entries[i].entry = i; - err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries)); + err = pci_enable_msix(mdev->pdev, entries, MTHCA_NUM_EQS); if (err) { if (err > 0) - mthca_info(mdev, "Only %d MSI-X vectors available, " - "not using MSI-X\n", err); - return err; + mthca_info(mdev, "Only %d MSI-X vectors available.", err); + + if (err < MTHCA_NUM_EQ) { + mthca_info(mdev, "Not using MSI-X: %d\n", err); + pci_disable_msix(mdev->pdev); + return err; + } } - mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector; - mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector; - mdev->eq_table.eq[MTHCA_EQ_CMD ].msi_x_vector = entries[2].vector; + mdev->ib_dev.num_comp_vectors = err - MTHCA_NUM_EQ + 1; + for (i = 0; i < err; ++i) + mdev->eq_table.eq[i ].msi_x_vector = entries[i].vector; return 0; } diff --git a/drivers/infiniband/hw/mthca/mthca_profile.c b/drivers/infiniband/hw/mthca/mthca_profile.c index 26bf86d..834b303 100644 --- a/drivers/infiniband/hw/mthca/mthca_profile.c +++ b/drivers/infiniband/hw/mthca/mthca_profile.c @@ -59,7 +59,6 @@ enum { }; enum { - MTHCA_NUM_EQS = 32, MTHCA_NUM_PDS = 1 << 15 }; diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 47e6fd4..0b125b0 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -663,6 +663,7 @@ static int mthca_destroy_qp(struct ib_qp *qp) } static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries, + int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { @@ -706,7 +707,7 @@ static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries, for (nent = 1; nent <= entries; nent <<= 1) ; /* nothing */ - err = mthca_init_cq(to_mdev(ibdev), nent, + err = mthca_init_cq(to_mdev(ibdev), nent, comp_vector, context ? to_mucontext(context) : NULL, context ? ucmd.pdn : to_mdev(ibdev)->driver_pd.pd_num, cq); diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index 1d266ac..591d953 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -202,6 +202,7 @@ struct mthca_cq { spinlock_t lock; int refcount; int cqn; + int eq; u32 cons_index; struct mthca_cq_buf buf; struct mthca_cq_resize *resize_buf; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 0c4e59b..1778fd6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -789,7 +789,7 @@ static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, } p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p, - ipoib_sendq_size + 1); + ipoib_sendq_size + 1, priv->ca->num_comp_vectors > 1); if (IS_ERR(p->cq)) { ret = PTR_ERR(p->cq); ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 7f3ec20..5c3c6a4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -187,7 +187,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) if (!ret) size += ipoib_recvq_size; - priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size); + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); goto out_free_mr; diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index 1fc9674..89d6008 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -76,7 +76,7 @@ static int iser_create_device_ib_res(struct iser_device *device) iser_cq_callback, iser_cq_event_callback, (void *)device, - ISER_MAX_CQ_LEN); + ISER_MAX_CQ_LEN, 0); if (IS_ERR(device->cq)) goto cq_err; diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 5e8ac57..33c249a 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -197,7 +197,7 @@ static int srp_create_target_ib(struct srp_target_port *target) return -ENOMEM; target->cq = ib_create_cq(target->srp_host->dev->dev, srp_completion, - NULL, target, SRP_CQ_SIZE); + NULL, target, SRP_CQ_SIZE, 0); if (IS_ERR(target->cq)) { ret = PTR_ERR(target->cq); goto out; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 765589f..a16e509 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -912,6 +912,8 @@ struct ib_device { u32 flags; + int num_comp_vectors; + struct iw_cm_verbs *iwcm; int (*query_device)(struct ib_device *device, @@ -978,6 +980,7 @@ struct ib_device { struct ib_recv_wr *recv_wr, struct ib_recv_wr **bad_recv_wr); struct ib_cq * (*create_cq)(struct ib_device *device, int cqe, + int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); int (*destroy_cq)(struct ib_cq *cq); @@ -1358,13 +1361,15 @@ static inline int ib_post_recv(struct ib_qp *qp, * @cq_context: Context associated with the CQ returned to the user via * the associated completion and event handlers. * @cqe: The minimum size of the CQ. + * @comp_vector - Completion vector used to signal completion events. + * Must be >= 0 and < context->num_comp_vectors. * * Users can examine the cq structure to determine the actual CQ size. */ struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, void (*event_handler)(struct ib_event *, void *), - void *cq_context, int cqe); + void *cq_context, int cqe, int comp_vector); /** * ib_resize_cq - Modifies the capacity of the CQ. -- MST From rick.jones2 at hp.com Wed May 2 11:56:04 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 02 May 2007 11:56:04 -0700 Subject: [ofa-general] Re: minutes from socket over RDMA discussion at workshop In-Reply-To: <20070502184243.GR22292@mellanox.co.il> References: <4638BB97.8050403@hp.com> <20070502184243.GR22292@mellanox.co.il> Message-ID: <4638DEC4.9040700@hp.com> Michael S. Tsirkin wrote: >> Bulk Transfer "Latency" >> Unidir Bidir >> Card Mbit/s SDx SDr Mbit/s SDx SDr Tran/s SDx SDr >>--------------------------------------------------------------------------- >> nnnn nnnnn nnnnn nnnn nnnnn nnnnn nnnnn nnnnn nnnnn >> AD313A IPoIB 2970 4.418 4.544 3530 3.59 3.95 19290 n/a n/a >> AD313A SDP 7810 0.453 1.048 12820 0.69 0.68 38030 26.29 26.29 >> AD313A SDP p=0 7810 0.346 0.527 12670 0.42 0.043 19380 n/a n/a > > > What's AD313A? What's the MTU for IPoIB (in OFED 1.2 it defaults to 64K)? The OFED is whatever is in RHEL5 - someone said that might be 1.1. I had some problems getting all of it removed enough to get 1.2 to load there - in particular the ib_sdp stuff. The AD313A shows as this in lspci: 08:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev 20) AS313A is how people order it from HP for their Integrity servers. The work was targetted at HPers, hence the use of the HP product number in the write-up. I just didn't think to provide the decoder ring in the post :) rick jones From mst at dev.mellanox.co.il Wed May 2 12:00:24 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 22:00:24 +0300 Subject: [ofa-general] Re: minutes from socket over RDMA discussion at workshop In-Reply-To: <4638DEC4.9040700@hp.com> References: <4638BB97.8050403@hp.com> <20070502184243.GR22292@mellanox.co.il> <4638DEC4.9040700@hp.com> Message-ID: <20070502190024.GU22292@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: minutes from socket over RDMA discussion at workshop > > Michael S. Tsirkin wrote: > >> Bulk Transfer "Latency" > >> Unidir Bidir > >> Card Mbit/s SDx SDr Mbit/s SDx SDr Tran/s SDx SDr > >>--------------------------------------------------------------------------- > >> nnnn nnnnn nnnnn nnnn nnnnn nnnnn nnnnn nnnnn nnnnn > >>AD313A IPoIB 2970 4.418 4.544 3530 3.59 3.95 19290 n/a n/a > >>AD313A SDP 7810 0.453 1.048 12820 0.69 0.68 38030 26.29 26.29 > >>AD313A SDP p=0 7810 0.346 0.527 12670 0.42 0.043 19380 n/a n/a > > > > > >What's AD313A? What's the MTU for IPoIB (in OFED 1.2 it defaults to 64K)? > > > The OFED is whatever is in RHEL5 - someone said that might be 1.1. So you are not comparing apples to apples: SPD uses buffers of 64K, IPoIB datagram mode - 2K. > I had > some problems getting all of it removed enough to get 1.2 to load there - > in particular the ib_sdp stuff. Report the problem, people'll try to help. -- MST From rick.jones2 at hp.com Wed May 2 12:10:48 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 02 May 2007 12:10:48 -0700 Subject: [ofa-general] Re: minutes from socket over RDMA discussion at workshop In-Reply-To: <20070502190024.GU22292@mellanox.co.il> References: <4638BB97.8050403@hp.com> <20070502184243.GR22292@mellanox.co.il> <4638DEC4.9040700@hp.com> <20070502190024.GU22292@mellanox.co.il> Message-ID: <4638E238.5020002@hp.com> >>The OFED is whatever is in RHEL5 - someone said that might be 1.1. > > > So you are not comparing apples to apples: > SPD uses buffers of 64K, IPoIB datagram mode - 2K. I won't dispute that, I'll just say it is what people running RHEL5 will see out of the box. > Report the problem, people'll try to help. I thought I had, albeit perhaps too tangentially. Anyway, I'm running on a Debian 4.0 with a 2.6.21.1 kernel on it (from a kernel.org tar) now and will try a contemporary 1.2 on that. It seems there is already some OFED stuff in the 2.6.21.1 kernel, so if there are suggestions on how to remove that for a successful 1.2 install, or other suggestions on how to have a successful 1.2 install on 2.6.21.1 I'm all ears. When it comes to manipulating bits here I'm still on the very steep part of the learning curve. happy benchmarking, rick jones From mhagen at iol.unh.edu Wed May 2 12:21:57 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Wed, 2 May 2007 15:21:57 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] infiniband: modify ammasso driver to use send with invalidate In-Reply-To: References: <46669.132.177.125.178.1178029253.squirrel@postal.iol.unh.edu> <20070502070849.GO8447@mellanox.co.il> <48959.132.177.125.178.1178127575.squirrel@postal.iol.unh.edu> <1178128010.18609.116.camel@stevo-desktop> <41704.132.177.125.178.1178128335.squirrel@postal.iol.unh.edu> Message-ID: <44075.132.177.125.178.1178133717.squirrel@postal.iol.unh.edu> Modification to the ammasso driver to use the iWARP verbs SEND with INV and SEND with SE and INV. Signed-off-by: Mikkel Hagen --- linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-04-30 13:12:54.000000000 -0400 +++ linux-2.6.21.1/drivers/infiniband/hw/amso1100/c2_qp.c 2007-05-02 13:50:25.000000000 -0400 @@ -810,16 +810,25 @@ int c2_post_send(struct ib_qp *ibqp, str switch (ib_wr->opcode) { case IB_WR_SEND: - if (ib_wr->send_flags & IB_SEND_SOLICITED) { + if (ib_wr->send_flags & IB_SEND_SOLICITED + && ib_wr->send_flags & IB_SEND_INVALIDATE) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE_INV); + wr.sqwr.send.remote_stag = + cpu_to_be32(ib_wr->wr.invalidate.rkey); + } else if (ib_wr->send_flags & IB_SEND_SOLICITED) { c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE); - msg_size = sizeof(struct c2wr_send_req); + wr.sqwr.send.remote_stag = 0; + } else if (ib_wr->send_flags & IB_SEND_INVALIDATE) { + c2_wr_set_id(&wr, C2_WR_TYPE_SEND_INV); + wr.sqwr.send.remote_stag = + cpu_to_be32(ib_wr->wr.invalidate.rkey); } else { c2_wr_set_id(&wr, C2_WR_TYPE_SEND); - msg_size = sizeof(struct c2wr_send_req); + wr.sqwr.send.remote_stag = 0; } - wr.sqwr.send.remote_stag = 0; - msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge; + msg_size = sizeof(struct c2wr_send_req) + + sizeof(struct c2_data_addr) * ib_wr->num_sge; if (ib_wr->num_sge > qp->send_sgl_depth) { err = -EINVAL; break; -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From mhagen at iol.unh.edu Wed May 2 12:25:19 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Wed, 2 May 2007 15:25:19 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] infiniband: add support for invalidate stag In-Reply-To: <20070501205138.GG8447@mellanox.co.il> References: <37245.132.177.125.178.1177963626.squirrel@postal.iol.unh.edu> <20070501035708.GJ13293@mellanox.co.il> <1178040392.2309.72.camel@stevo-desktop> <20070501205138.GG8447@mellanox.co.il> Message-ID: <44080.132.177.125.178.1178133919.squirrel@postal.iol.unh.edu> Patch to add support for the iWARP verbs SEND with INV and SEND with SE and INV. Signed-off-by: Mikkel Hagen --- linux-2.6.21.1/include/rdma/ib_verbs.h 2007-05-02 15:17:24.000000000 -0400 +++ linux-2.6.21.1/include/rdma/ib_verbs.h 2007-05-02 15:19:05.000000000 -0400 @@ -611,7 +611,8 @@ enum ib_send_flags { IB_SEND_FENCE = 1, IB_SEND_SIGNALED = (1<<1), IB_SEND_SOLICITED = (1<<2), - IB_SEND_INLINE = (1<<3) + IB_SEND_INLINE = (1<<3), + IB_SEND_INVALIDATE = (1<<4) }; struct ib_sge { @@ -646,6 +647,9 @@ struct ib_send_wr { u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; + struct { + u32 rkey; + } invalidate; } wr; }; -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From mst at dev.mellanox.co.il Wed May 2 12:37:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 22:37:06 +0300 Subject: [ofa-general] Re: minutes from socket over RDMA discussion at workshop In-Reply-To: <4638E238.5020002@hp.com> References: <4638BB97.8050403@hp.com> <20070502184243.GR22292@mellanox.co.il> <4638DEC4.9040700@hp.com> <20070502190024.GU22292@mellanox.co.il> <4638E238.5020002@hp.com> Message-ID: <20070502193706.GZ22292@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: minutes from socket over RDMA discussion at workshop > > >>The OFED is whatever is in RHEL5 - someone said that might be 1.1. > > > > > >So you are not comparing apples to apples: > >SPD uses buffers of 64K, IPoIB datagram mode - 2K. > > I won't dispute that, I'll just say it is what people running RHEL5 will > see out of the box. OK but when some people say "IPoIB gives same BW as SDP" they mean 1.2. > >Report the problem, people'll try to help. > > I thought I had, albeit perhaps too tangentially. Anyway, I'm running on a > Debian 4.0 with a 2.6.21.1 kernel on it (from a kernel.org tar) now and > will try a contemporary 1.2 on that. It seems there is already some OFED > stuff in the 2.6.21.1 kernel, so if there are suggestions on how to remove > that for a successful 1.2 install, or other suggestions on how to have a > successful 1.2 install on 2.6.21.1 I'm all ears. When it comes to > manipulating bits here I'm still on the very steep part of the learning > curve. OFED should really do that for you automatically, and will even try to put it all back on uninstall. -- MST From zorllia.com at originalpink.com Wed May 2 13:07:11 2007 From: zorllia.com at originalpink.com (Kaleb Perry) Date: Wed, 02 May 2007 17:07:11 -0300 Subject: [ofa-general] cheap oem soft shipping //orldwide Message-ID: <000001c78cf5$38baf080$0100007f@localhost> See attach ----- Isnt it a fine day? she asked And whats so fine about it? Ra Oh, everything, Laird. The sun Bridgid, I just spoke to your -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic44.gif Type: image/gif Size: 9095 bytes Desc: not available URL: From rick.jones2 at hp.com Wed May 2 13:16:08 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 02 May 2007 13:16:08 -0700 Subject: [ofa-general] Re: minutes from socket over RDMA discussion at workshop In-Reply-To: <20070502193706.GZ22292@mellanox.co.il> References: <4638BB97.8050403@hp.com> <20070502184243.GR22292@mellanox.co.il> <4638DEC4.9040700@hp.com> <20070502190024.GU22292@mellanox.co.il> <4638E238.5020002@hp.com> <20070502193706.GZ22292@mellanox.co.il> Message-ID: <4638F188.30802@hp.com> Michael S. Tsirkin wrote: >>Quoting Rick Jones : >>Subject: Re: minutes from socket over RDMA discussion at workshop >> >> >>>>The OFED is whatever is in RHEL5 - someone said that might be 1.1. >>> >>> >>>So you are not comparing apples to apples: >>>SPD uses buffers of 64K, IPoIB datagram mode - 2K. >> >>I won't dispute that, I'll just say it is what people running RHEL5 will >>see out of the box. > > > OK but when some people say "IPoIB gives same BW as SDP" they mean 1.2. Fair enough. The joy of moving targets :) >>>Report the problem, people'll try to help. >> >>I thought I had, albeit perhaps too tangentially. Anyway, I'm running on a >>Debian 4.0 with a 2.6.21.1 kernel on it (from a kernel.org tar) now and >>will try a contemporary 1.2 on that. It seems there is already some OFED >>stuff in the 2.6.21.1 kernel, so if there are suggestions on how to remove >>that for a successful 1.2 install, or other suggestions on how to have a >>successful 1.2 install on 2.6.21.1 I'm all ears. When it comes to >>manipulating bits here I'm still on the very steep part of the learning >>curve. > > > OFED should really do that for you automatically, > and will even try to put it all back on uninstall. I'll likely be trying later this afternoon (US Pacific time). rick jones From mst at dev.mellanox.co.il Wed May 2 13:23:17 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 May 2007 23:23:17 +0300 Subject: [ofa-general] Re: minutes from socket over RDMA discussion at workshop In-Reply-To: <4638F188.30802@hp.com> References: <4638BB97.8050403@hp.com> <20070502184243.GR22292@mellanox.co.il> <4638DEC4.9040700@hp.com> <20070502190024.GU22292@mellanox.co.il> <4638E238.5020002@hp.com> <20070502193706.GZ22292@mellanox.co.il> <4638F188.30802@hp.com> Message-ID: <20070502202317.GD22292@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: minutes from socket over RDMA discussion at workshop > > Michael S. Tsirkin wrote: > >>Quoting Rick Jones : > >>Subject: Re: minutes from socket over RDMA discussion at workshop > >> > >> > >>>>The OFED is whatever is in RHEL5 - someone said that might be 1.1. > >>> > >>> > >>>So you are not comparing apples to apples: > >>>SPD uses buffers of 64K, IPoIB datagram mode - 2K. > >> > >>I won't dispute that, I'll just say it is what people running RHEL5 will > >>see out of the box. > > > > > >OK but when some people say "IPoIB gives same BW as SDP" they mean 1.2. > > Fair enough. The joy of moving targets :) > > >>>Report the problem, people'll try to help. > >> > >>I thought I had, albeit perhaps too tangentially. Anyway, I'm running on > >>a Debian 4.0 with a 2.6.21.1 kernel on it (from a kernel.org tar) now and > >>will try a contemporary 1.2 on that. It seems there is already some OFED > >>stuff in the 2.6.21.1 kernel, so if there are suggestions on how to > >>remove that for a successful 1.2 install, or other suggestions on how to > >>have a successful 1.2 install on 2.6.21.1 I'm all ears. When it comes to > >>manipulating bits here I'm still on the very steep part of the learning > >>curve. > > > > > >OFED should really do that for you automatically, > >and will even try to put it all back on uninstall. > > I'll likely be trying later this afternoon (US Pacific time). I'm afraid I'm going offline now. -- MST From rick.jones2 at hp.com Wed May 2 14:24:48 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 02 May 2007 14:24:48 -0700 Subject: [ofa-general] OFED-1.2-20070502-0600 on Debian Message-ID: <463901A0.5060905@hp.com> Sooo, I grabbed the latest 1.2 tar from and untarred it onto my Debian 4.0 with 2.6.21.1 kernel from kernel.org, did ./install.sh and a bunch of stuff like this flew past: /root/OFED-1.2-20070502-0600/build_env.sh: line 77: rpm: command not found /root/OFED-1.2-20070502-0600/build_env.sh: line 78: rpm: command not found /root/OFED-1.2-20070502-0600/build_env.sh: line 79: rpm: command not found /root/OFED-1.2-20070502-0600/build_env.sh: line 319: rpm: command not found /root/OFED-1.2-20070502-0600/build_env.sh: line 320: rpm: command not found /root/OFED-1.2-20070502-0600/build_env.sh: line 321: rpm: command not found /root/OFED-1.2-20070502-0600/build_env.sh: line 327: rpm: command not found I still got the menu, from which I selected (IIRC) 2, selected the basic OFED bits (option 1) at which point it said: Below is the list of OFED packages that you have chosen (some may have been added by the installer due to package dependencies): ib_ipoib ib_mthca ib_verbs kernel-ib kernel-ib-devel libcxgb3 libcxgb3-devel libibcm libibcm-devel libibverbs libibverbs-devel libibverbs-utils libmthca libmthca-devel librdmacm mstflint perftest ofed-docs ofed-scripts ERROR: The gcc package is required to run libibverbs now, there _is_ a gcc installed, it was used to build the kernel I'm running: hpcpc106:~/OFED-1.2-20070502-0600# which gcc /usr/bin/gcc hpcpc106:~/OFED-1.2-20070502-0600# gcc -v Using built-in specs. Target: ia64-linux-gnu Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --disable-libssp --with-system-libunwind --enable-checking=release ia64-linux-gnu Thread model: posix gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) I sthe script perhaps getting confused and thinking this is RHEL or something, or in using a Debian system am I going beyond what is "supported" and/or "known to work?" The readme's which came along with the bits don't seem to say much about Debian, just RedHat and SuSE hpcpc106:~/OFED-1.2-20070502-0600# uname -a Linux hpcpc106 2.6.21.1-raj #1 SMP Tue May 1 13:57:28 PDT 2007 ia64 GNU/Linux hpcpc106:~/OFED-1.2-20070502-0600# should I be taking a different path to build here? rick jones From jwong at datallegro.com Wed May 2 14:28:12 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Wed, 2 May 2007 17:28:12 -0400 Subject: [ofa-general] Help building ib-bonding References: <1178087589.14131.3.camel@vladsk-laptop> Message-ID: Hello, I am using kernel 2.6-18.8.1.1.el5 x86_64 I have changed the build_env.sh to have the build_32bit=-1 Thanks in advance. Jeff When installing all modules I am getting the following errors. + make -C /lib/modules/2.6.18-8.1.1.el5/build modules M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding make: Entering directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64' CC [M] /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.o In file included from /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:78: /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_inactive_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: (Each undeclared identifier is reported only once /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: for each function it appears in.) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_active_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_compute_features': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1233: warning: comparison of distinct pointer types lacks a cast /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_enslave': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_release': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_arp_rcv': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_netdev_event': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_init': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4374: warning: assignment discards qualifiers from pointer target type /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this function) make[1]: *** [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_ main.o] Error 1 make: *** [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi ng] Error 2 make: Leaving directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64' + echo ' Building IB bonding driver failed' Building IB bonding driver failed + exit 1 error: Bad exit status from /var/tmp/rpm-tmp.99179 (%build) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Wed May 2 14:49:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 00:49:44 +0300 Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian In-Reply-To: <463901A0.5060905@hp.com> References: <463901A0.5060905@hp.com> Message-ID: <20070502214944.GF10009@mellanox.co.il> > Quoting Rick Jones : > Subject: OFED-1.2-20070502-0600 on Debian > > Sooo, I grabbed the latest 1.2 tar from > > and untarred it onto my Debian 4.0 with 2.6.21.1 kernel from kernel.org, > did ./install.sh and a bunch of stuff like this flew past: > > /root/OFED-1.2-20070502-0600/build_env.sh: line 77: rpm: command not found > /root/OFED-1.2-20070502-0600/build_env.sh: line 78: rpm: command not found > /root/OFED-1.2-20070502-0600/build_env.sh: line 79: rpm: command not found > /root/OFED-1.2-20070502-0600/build_env.sh: line 319: rpm: command not found > /root/OFED-1.2-20070502-0600/build_env.sh: line 320: rpm: command not found > /root/OFED-1.2-20070502-0600/build_env.sh: line 321: rpm: command not found > /root/OFED-1.2-20070502-0600/build_env.sh: line 327: rpm: command not found rpm is not installed. I don't know how to solve this, Vlad might be able to answer tomorrow. > should I be taking a different path to build here? Maybe, maybe not. There *is* another way which should be enough to test IPoIB: try getting a kernel tarball from http://git.openfabrics.org/~vlad/builds/ If you unpack this, you can configure/make/make install. Installer will backup your original modules under the prefix. Keep the source around and you'll be able to make uninstall to get back to original system. Note 1: default configure settings are often not what you want: run ./configure --help first of all to see which modules to select (--with-ipoib-mod and --with-mthca-mod I think) and to set a prefix. Note 2: having quilt tool installed is recommended - will let you add/remove patches later. Note 3: this way you get no userspace. openfabrics tarballs are under the same directory, and a similiar method works there. external tarballs (MPI, bonding, etc) are supplied to us in SRPM format so this trick does not work for them. -- MST From rdreier at cisco.com Wed May 2 15:10:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 May 2007 15:10:45 -0700 Subject: [ofa-general] Re: [PATCH] libmlx4: fix post inline when posting a list In-Reply-To: <200705021712.24400.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Wed, 2 May 2007 17:12:24 +0300") References: <200705021712.24400.jackm@dev.mellanox.co.il> Message-ID: Thanks, applied. However in the future > diff --git a/src/userspace/libmlx4/src/qp.c b/src/userspace/libmlx4/src/qp.c please use my git tree rather than some other setup, so that I can apply things with patch -p1 From rdreier at cisco.com Wed May 2 15:11:38 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 May 2007 15:11:38 -0700 Subject: [ofa-general] Re: [patch] mlx4_ib: return proper num s/g entries for rq at create_qp In-Reply-To: <200705021714.05933.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Wed, 2 May 2007 17:14:05 +0300") References: <200705021714.05933.jackm@dev.mellanox.co.il> Message-ID: thanks, applied From rdreier at cisco.com Wed May 2 15:21:15 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 May 2007 15:21:15 -0700 Subject: [ofa-general] Re: [PATCH] ipoib/cm: compliance fix In-Reply-To: <20070502123112.GI22292@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 May 2007 15:31:12 +0300") References: <20070502123112.GI22292@mellanox.co.il> Message-ID: thanks, applied > Important enough for -stable? I don't think so -- there's no way to hit this in practice right now, so I don't think it meets the -stable criteria. BTW, looking at the code that happens to be in the patch context: > if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { I think this annotation is unclear and I'm not sure gcc will do what is intended here (and I'm not sure what is intended). Should this be if (likely(!(wr_id & IPOIB_CM_RX_UPDATE_MASK))) { or if (unlikely(!(wr_id & IPOIB_CM_RX_UPDATE_MASK))) { ...seems as if "unlikely" is appropriate. - R. From mst at dev.mellanox.co.il Wed May 2 15:21:31 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 01:21:31 +0300 Subject: [ofa-general] Re: [PATCH] libmlx4: fix post inline when posting a list In-Reply-To: References: <200705021712.24400.jackm@dev.mellanox.co.il> Message-ID: <20070502222131.GI10009@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] libmlx4: fix post inline when posting a list > > Thanks, applied. However in the future > > > diff --git a/src/userspace/libmlx4/src/qp.c b/src/userspace/libmlx4/src/qp.c > > please use my git tree rather than some other setup, so that I can > apply things with patch -p1 BTW, note that recent git-am version can accept -p. -- MST From rdreier at cisco.com Wed May 2 15:25:10 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 02 May 2007 15:25:10 -0700 Subject: [ofa-general] Re: [PATCH] libmlx4: fix post inline when posting a list In-Reply-To: <20070502222131.GI10009@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 May 2007 01:21:31 +0300") References: <200705021712.24400.jackm@dev.mellanox.co.il> <20070502222131.GI10009@mellanox.co.il> Message-ID: > BTW, note that recent git-am version can accept -p. Yes, but it's annoying to have to count '/'s just to apply a patch. And it also doesn't inspire much confidence that something has been tested against the tree I'm going to apply it to when it was obviously generated from a different tree. - R. From mst at dev.mellanox.co.il Wed May 2 15:31:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 01:31:27 +0300 Subject: [ofa-general] Re: [PATCH] ipoib/cm: compliance fix In-Reply-To: References: <20070502123112.GI22292@mellanox.co.il> Message-ID: <20070502223127.GJ10009@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] ipoib/cm: compliance fix > > thanks, applied > > > Important enough for -stable? > > I don't think so -- there's no way to hit this in practice right now, > so I don't think it meets the -stable criteria. > > BTW, looking at the code that happens to be in the patch context: > > > if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { > > I think this annotation is unclear and I'm not sure gcc will do what > is intended here (and I'm not sure what is intended). Should this be > > if (likely(!(wr_id & IPOIB_CM_RX_UPDATE_MASK))) { > > or > > if (unlikely(!(wr_id & IPOIB_CM_RX_UPDATE_MASK))) { > > ...seems as if "unlikely" is appropriate. I expect unlikely to be equivalent: likely means typically == 1, unlikely means typically == 0, so !likely(x) is equivalent to unlikely(!x). I did expect gcc to do the right thing here, but go ahead and test if you like. And I do agree "unlikely" version is more clear. -- MST From mst at dev.mellanox.co.il Wed May 2 15:37:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 01:37:32 +0300 Subject: [ofa-general] Re: [PATCH] libmlx4: fix post inline when posting a list In-Reply-To: References: <200705021712.24400.jackm@dev.mellanox.co.il> <20070502222131.GI10009@mellanox.co.il> Message-ID: <20070502223732.GK10009@mellanox.co.il> > Yes, but it's annoying to have to count '/'s just to apply a patch. Maybe we should teach git-am to guess the strip level automatically :) But you are right, for now. > And it also doesn't inspire much confidence that something has been > tested against the tree I'm going to apply it to when it was obviously > generated from a different tree. No, this is coming from your tree, don't worry. Actually we just have a script that does git-checkout of several trees each into a separate subdirectory. And it seems Jack works on several trees at the same time, so he's using quilt at the top level to manage patches across them all, and that's what he has sent you, instead of generating the patches with git which would have gotten the right level, but otehrwise equivalent. Hope that's clear. -- MST From pradeep at us.ibm.com Wed May 2 17:30:06 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Wed, 2 May 2007 17:30:06 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: <20070502064549.GN8447@mellanox.co.il> Message-ID: Firstly thanks for the review Michael. My responses/questions below, and yes will fix some of the style issues that you have pointed out. The new functions (and labels) had the srq/nosrq suffxes for mainatainability purposes and also to keep a structure similar to the current IPOIB. Pradeep pradeep at us.ibm.com "Michael S. Tsirkin" wrote on 05/01/2007 11:46:48 PM: > OK, we are making progress (line-wrapping issues aside :). And there seems to > be some whitespace damage, too. Pls take care of this. > > I think the handle_rx_wc split is going in the right direction, > but let's take this through all the datapath. > > I went over the patch in a bit more depth, and I have some questions: > > > + for (i = 0; i < ipoib_recvq_size; ++i) { > > + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, > > ... > > > + if (ipoib_cm_post_receive(dev, i << 32 | index)) { > > 1. It seems there are multiple QPs mapped to a single CQ - > and each gets ipoib_recvq_size recv WRs above. > Is that right? How do you prevent CQ overrun then? Good point! Looking at the IB spec it appears that upon CQ overflow it results in a Local Work Queue catastrophic error and puts the QP (receiver side) in an error state. Hence, I am speculating that the sending side will see an error. This will result in the sending side destroying the QP and sending a DREQ message which, will remove the receive side QP. A new set of QPs will be created on the send side (this is RC) and the connection setup starts over again. It will continue, but at a degraded rate. Is this correct? What other alternative do you suggest -create a CQ per QP? Is the max number of CQs an issue to consider, if we adopt this approach? > > > + /* Find an empty rx_index_ring[] entry */ > > + for (index = 0; index < NOSRQ_INDEX_RING_SIZE; index++) > > + if (priv->cm.rx_index_ring[index] == NULL) > > + break; > > + > > + if ( index == NOSRQ_INDEX_RING_SIZE) { > > + spin_unlock_irq(&priv->lock); > > + printk(KERN_WARNING "NOSRQ supports a max of %d RC " > > + "QPs. That limit has now been reached\n", > > + NOSRQ_INDEX_RING_SIZE); > > + return -EINVAL; > > + } > > 2. So, when QP limit has been reached, remote side will get > a reject with custom reject reason? > Is so, it seems that since the remote does not know > what the reason for reject is, it'll just retry > on the next packet, and again and again. Basically, > connectivity is denied where it previously worked fine > by falling back on datagram mode? > Good point again! Yes, this would be an apt description. However, I have few questions (see below) > One way to fix this, could be to try and use a reject reason > that will tell the remote "I'm busy, switch to datagram mode > for a loooooong time". Using path mtu discovery here might be useful > to actually have it come back and retry after several minutes. > How does one send a reject reason -through CM? I could unset the bit IPOIB_FLAG_ADMIN_CM in flag, but will that not transition all the QPs to datagram mode. What we need is a mechanism that will let the current set of QPs be in connected mode, and transition only the new ones to datagram mode if connected mode cannot be supported as in this case. How to do that? > *In theory*, we could get this even with SRQ - > if the *HCA* starts running out of RC QPs - it is just > never happening in practice as current HCAs support #QPs larger > than a maximum IB subnet size. > So I might post a patch to implement this, stay tuned. This will be interesting. > > > + spin_lock_irqsave(&priv->lock, flags); > > + rx_ptr = priv->cm.rx_index_ring[index]; > > + spin_unlock_irqrestore(&priv->lock, flags); > > 3. You never actually test the rx_ptr that you got. > So why does locking help? > A better way to destroy QPs might be to move it to error state first. In ipoib_cm_stale_task(): priv->cm.rx_index_ring[p->index] = NULL; this assignment does happen under lock. All I need to do (in the code snippet above you point out) is check if rx_pt == NULL, if so drop the packet. I did think about this one, but never implemented it. > > We actually need something like this for CM too - stay tuned for a patch. > > I also commented on some style issues below. > > > Note 1: I have retained the code to avoid IB_WC_RETRY_EXC_ERR > while performing > > interoperability tests As discussed in this mailing list that may > be a CM bug or > > have the various HCA address it. Hence I would like to seperate > out that issue > > from this patch. > > At a future point when the issue gets resolved I can provide > > another patch to change the retry_count values back to 0 if need be. > > The correct way to separate it, in my opinion, is to set retry_count = 0, > and (for now) apply a work-around patch at your site before testing. > We really don't want to paper over this bug, in my opinion. Ok, will reset this back to 0, but that is not (my) preferred way. If some one were to pick up the code and try it with retry_count=0, the HCAs will not inter-operate as is. Hence the hesitation. > > > > struct ipoib_cm_tx { > > @@ -177,6 +185,7 @@ struct ipoib_cm_dev_priv { > > struct ib_wc ibwc[IPOIB_NUM_WC]; > > struct ib_sge rx_sge[IPOIB_CM_RX_SG]; > > struct ib_recv_wr rx_wr; > > + struct ipoib_cm_rx **rx_index_ring; > > }; > > > > /* > > Isn't "ring" a bit of a misnomer? Yes, this is a misnomer. This is a vestige of an an earlier thing that I thought of. Will change it to something else more appropriate. > > + unsigned long flags; > > + > > + index = id & NOSRQ_INDEX_MASK ; > > + wr_id = id >> 32; > > So wr_id has always, ever, 32 lower bits set - why make it u64 then? Because I later use it as wr_id << 32 | index | IPOIB_CM_OP_NOSRQ. I could have used index | IPOIB_CM_OP_NOSRQ instead. > > > + /* There is a slender chance of a race between the stale_task > > + * running after a period of inactivity and the receipt of > > + * a packet being processed at about the same instant. > > + * Hence the lock */ > > I think you can get rid of this, by changing the stale task code: > move QP to error, and wait for WRs posted to complete. > Then there won't be any more completions for this QP. > > As it is, I'm not convinced you can't get a completion after > QP has been removed out of the array - so it seems the race hasn't > been solved here? We have discussed this above. > > We actually need something like this for CM too - > stay tuned for a patch. > > > + spin_lock_irqsave(&priv->lock, flags); > > + rx_ptr = priv->cm.rx_index_ring[index]; > > + spin_unlock_irqrestore(&priv->lock, flags); > > + > > + priv->cm.rx_wr.wr_id = wr_id << 32 | index | IPOIB_CM_OP_NOSRQ; > > Isn't this just id, again? This is id | IPOIB_CM_OP_NOSRQ. > > +static int ipoib_cm_post_receive(struct net_device *dev, u64 id) > > +{ > > + struct ipoib_dev_priv *priv = netdev_priv(dev); > > + int ret; > > + > > + if (priv->cm.srq) > > + ret = post_receive_srq(dev, id); > > + else > > + ret = post_receive_nosrq(dev, id); > > + > > + return ret; > > +} > > I think you can split this one now that srq/nonsrq completions are > handled separately. I don't understand this commennt. From pradeep at us.ibm.com Wed May 2 18:39:15 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Wed, 2 May 2007 18:39:15 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: Message-ID: > > > > > > + spin_lock_irqsave(&priv->lock, flags); > > > + rx_ptr = priv->cm.rx_index_ring[index]; > > > + spin_unlock_irqrestore(&priv->lock, flags); > > > > 3. You never actually test the rx_ptr that you got. > > So why does locking help? > > A better way to destroy QPs might be to move it to error state first. > > In ipoib_cm_stale_task(): priv->cm.rx_index_ring[p->index] = NULL; > this assignment does happen under lock. All I need to do (in the code snippet > above you point out) is check if rx_pt == NULL, if so drop the packet. > I did think about this one, but never implemented it. > I get what you suggest. Move the QP to error state under a lock and then destroy it subsequently. Since the QP is in error state, nothing else should come through and we can eliminate the locking -right? Yes, this is doable, just that we need to check if rx_ptr == NULL check and drop it if that is the case. Pradeep pradeep at us.ibm.com From mst at dev.mellanox.co.il Wed May 2 20:32:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 06:32:30 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: References: <20070502064549.GN8447@mellanox.co.il> Message-ID: <20070503033230.GM10009@mellanox.co.il> > > > +static int ipoib_cm_post_receive(struct net_device *dev, u64 id) > > > +{ > > > + struct ipoib_dev_priv *priv = netdev_priv(dev); > > > + int ret; > > > + > > > + if (priv->cm.srq) > > > + ret = post_receive_srq(dev, id); > > > + else > > > + ret = post_receive_nosrq(dev, id); > > > + > > > + return ret; > > > +} > > > > I think you can split this one now that srq/nonsrq completions are > > handled separately. > > I don't understand this commennt. Since you now have 2 handle_wc routines for srq/nonsrq, call the appropriate one directly. Generally, I think we can get rid of if (srq) tests on data path. -- MST From mst at dev.mellanox.co.il Wed May 2 20:55:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 06:55:47 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: References: <20070502064549.GN8447@mellanox.co.il> Message-ID: <20070503035547.GN10009@mellanox.co.il> > > > + if (ipoib_cm_post_receive(dev, i << 32 | index)) { > > > > 1. It seems there are multiple QPs mapped to a single CQ - > > and each gets ipoib_recvq_size recv WRs above. > > Is that right? How do you prevent CQ overrun then? > > Good point! Looking at the IB spec it appears that upon CQ overflow > it results in a Local Work Queue catastrophic error and puts the QP > (receiver side) in an error state. Look further in spec - you get CQ error, too. > Hence, I am speculating that the > sending side will see an error. This will result in the sending side > destroying the QP and sending a DREQ message which, will remove the > receive side QP. > > A new set of QPs will be created on the send side (this is RC) and > the connection setup starts over again. It will continue, but at a > degraded rate. > Is this correct? What other alternative do you suggest > -create a CQ per QP? Is the max number of CQs an issue to consider, if > we adopt this approach? We were switching to NAPI though, and NAPI kind of forces you to use a common CQ, I think. -- MST From mst at dev.mellanox.co.il Wed May 2 21:47:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 07:47:11 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: References: <20070502064549.GN8447@mellanox.co.il> Message-ID: <20070503044711.GO10009@mellanox.co.il> > > > Note 1: I have retained the code to avoid IB_WC_RETRY_EXC_ERR while > > > performing interoperability tests As discussed in this mailing list that > > > may be a CM bug or have the various HCA address it. Hence I would like to > > > seperate out that issue from this patch. At a future point when the issue > > > gets resolved I can provide another patch to change the retry_count values > > > back to 0 if need be. > > > > The correct way to separate it, in my opinion, is to set retry_count = 0, > > and (for now) apply a work-around patch at your site before testing. > > We really don't want to paper over this bug, in my opinion. > > Ok, will reset this back to 0, but that is not (my) preferred way. If some > one were to pick up the code and try it with retry_count=0, the HCAs will > not inter-operate as is. Hence the hesitation. BTW, why do you ignore the option to use UC QP? Even taking this single issue aside, I think that UC is a better QP type choice for IPoIB than RC - you get away from RNR errors (so you can prepost less data, and you can even reset some RQs temporarily, moving WRs between them, without affecting TX), and you get send completion sooner, so you can use less memory for send buffers and smaller TX queues. With UC, we might get stale TX connections, so a way to detect and handle them will need to be designed. -- MST From k_mahesh85 at yahoo.co.in Wed May 2 23:50:15 2007 From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh) Date: Thu, 3 May 2007 07:50:15 +0100 (BST) Subject: [ofa-general] [query] SMI nodeinfo, port_info structures Message-ID: <735493.40597.qm@web8323.mail.in.yahoo.com> Hi, Even though nobody except the ipath driver is using the structures nodeinfo and port_info currently aren't these structures should be in smi.h? Because these structures are not specific to any hardware but they are specific to the SMI. And can anyone tell me why some fields have big endian (__bexx) data type and others have normal (uxx) data type in these structures? -Mahesh --------------------------------- Check out what you're missing if you're not on Yahoo! Messenger -------------- next part -------------- An HTML attachment was scrubbed... URL: From amalgamativezulza at kumamoto-iiaji-iimise.com Thu May 3 00:41:24 2007 From: amalgamativezulza at kumamoto-iiaji-iimise.com (Sanford) Date: Thu, 03 May 2007 08:41:24 +0100 Subject: [ofa-general] Sorry to be late Message-ID: <9d0d01c78d5e$d4e81410$05b1a18e@amalgamativezulza> Dana now driven looked particularly glum. authority powerfully damp Does Jeff kfresh thick I'm not limit so sure she's my fed friend, Jeff was eve She worries fold government husky awoken alot about me.Well what worried wear episcopal haven't corporal you seen yet? She asked. Why do forgiven you think pump drawn struck I wouldn't enjoy it?swung I ice had no idea that confuse bring my accident in gym was appa Marcie drain didn't pedal lock press the thumb matter any further. She As a matter of fact, hum no. None form of watch release the three of Turn overdone right overthrew up ahead, and then top drive replace a couple o I've summer excuse seen all of them, did boot but I wouldn't mind see copper battle Remarkably mug angle well. That's the other thing you'llJeff wildly complete rolled his soap eyes. Alright, if tonsorial you want to answer I doubt it. blind sense Jeff bit thick into his sandwich. told Stacy dead chuckled. damaged Well thank market you very much for t amount punch Gavin had it all planned. Like spun most elegantly guys, he kn Thank you. level Marcie blown followed his concentrate nose instructions, and sure enou But old competition you're war completely stride missing the point. Being One of swiftly them was smoking, charming pen sail said Jeff. I caught argument Oh get real. Obviously neither of knot stale grow us has to woshelf awkwardly fraternal Really? bit Great. Anybody I know? She took a eye deep breath. head His adjustment name knot is Jeff Feing Then pugilistic how do level enormously you hour explain all these people I don Stace, I don't sleepy move brass hair know if you've noticed or not, sour She's pretty osteal too. You become sure you want around to risk giWas it thunder some obscure and escape nerve spoken exotic tobacco? saidI've got a better insurance rode library idea, she drop sat back down on As request they drove cup up, Jeff swung happily kiss the right rear doo comparison heard tick Gavin receipt tossed her the remote control. Knock you cheerful I'm just deliver returning sweep damage the favor, Jeff mused. Sh against Yeah, Dana concurred. explain I wouldn't be steel manage at all s Well, you right had about a fifty bomb smash card percent success ra shelf rule Just gentle an average breed cigarette, I think. Suddenly, the begin expression on unripe poison her showed husbands face t paint clear Stacy now had a rung sly glint in shirt her eyes. As a maHenry...defeated smite What roll does advertisement that have to do with anything? What's that? Hey, didn't like we once meet enter his steam folks at shoed a PTA me Dana handed print the guard his learning jacket root yell and thanked hi forgo She picked up the spring remote, winter and started wrung to flip t powerful Let's balance walk see went what this is. apple Dana, what's edificial wrong with your bleach mother ancient is her ove So bit motionless room there's no point in looking for bravely ash to turn soap enthusiastically organization Gretchen was care next. She basically just repeated jealous You card about eye alright? He asked. And mowed they shoved off before we unripe slippery found ok out that t Yes but.. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: uyqzooku.gif Type: image/gif Size: 4801 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: kvabodilibyg.gif Type: image/gif Size: 2647 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fdueyxudolauja.gif Type: image/gif Size: 3044 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: kyyeezyuolca.gif Type: image/gif Size: 1496 bytes Desc: not available URL: From yosefe at voltaire.com Thu May 3 01:02:56 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 03 May 2007 11:02:56 +0300 Subject: [ofa-general] Re: [PATCH 3/3] mthca: provider-level caching of pkeys In-Reply-To: References: <4638B432.3060801@voltaire.com> <4638B4FE.8010605@voltaire.com> Message-ID: <46399730.9040902@voltaire.com> Roland Dreier wrote: > Oh, I didn't see this patch before... > anyway along with all the minor whitespace,etc problems, there are two > big issues: 1) this patch needs to be _before_ the previous 2/3 patch > (or else the intermediate state is buggy) and 2) you need a GID cache too. 1) ok, i'll do that as MST suggested 2) why do we need a GID cache? what are the other problems you found there? --Yossi From yosefe at voltaire.com Thu May 3 01:14:04 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 03 May 2007 11:14:04 +0300 Subject: [ofa-general] Re: [PATCH 3/3] mthca: provider-level caching of pkeys In-Reply-To: <46399730.9040902@voltaire.com> References: <4638B432.3060801@voltaire.com> <4638B4FE.8010605@voltaire.com> <46399730.9040902@voltaire.com> Message-ID: <463999CC.6080800@voltaire.com> Yosef Etigin wrote: > 2) why do we need a GID cache? Soory, didnt notice it's called from mthca_read_ah --Yossi From vlad at lists.openfabrics.org Thu May 3 02:36:48 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 3 May 2007 02:36:48 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070503-0200 daily build status Message-ID: <20070503093649.502EBE608E9@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Failed: From yosefe at voltaire.com Thu May 3 02:40:11 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 03 May 2007 12:40:11 +0300 Subject: [ofa-general] [PATCH 2/3 v2] remove ib pkey gid and lmc cache In-Reply-To: <20070502182315.GQ22292@mellanox.co.il> References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com> <20070502182315.GQ22292@mellanox.co.il> Message-ID: <4639ADFB.70707@voltaire.com> don't use ib cache in core and ulp's v1: * Add ib_find_gid and ib_find_pkey, over uncached device queries * Modify users of the cache in core and ulp's to use the unchached methods changes from v2: * don't remove the cache compelely, and still use it within mthca driver. the mthca changes and complete removal of the cache - in the next patch Signed-off-by: Yosef Etigin --- drivers/infiniband/core/cm.c | 8 - drivers/infiniband/core/cma.c | 9 -- drivers/infiniband/core/device.c | 137 ++++++++++++++++++++++++++++++-- drivers/infiniband/core/mad.c | 5 - drivers/infiniband/core/multicast.c | 3 drivers/infiniband/core/sa_query.c | 3 drivers/infiniband/core/verbs.c | 3 drivers/infiniband/ulp/ipoib/ipoib_cm.c | 3 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 2 drivers/infiniband/ulp/srp/ib_srp.c | 6 - include/rdma/ib_verbs.h | 37 ++++++++ 11 files changed, 184 insertions(+), 32 deletions(-) Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-03 11:25:58.878535870 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-03 12:34:12.639368278 +0300 @@ -149,6 +149,20 @@ static int alloc_name(char *name) return 0; } + +static inline int start_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + + +static inline int end_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; +} + + /** * ib_alloc_device - allocate an IB device struct * @size:size of structure to allocate @@ -592,6 +606,122 @@ int ib_modify_port(struct ib_device *dev } EXPORT_SYMBOL(ib_modify_port); +/** + * ib_find_gid - Returns the port number and index of a GID + * @device: Device to query. + * @gid: GID to look for + * @port_num: Returned port number + * @index: Returned index + * + * ib_find_gid() returns the index of @pkey in the pkey table + * on port @port_num + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index) +{ + struct ib_port_attr *tprops = NULL; + union ib_gid tmp_gid; + int ret; + int port; + int i; + + tprops = kmalloc(sizeof *tprops, GFP_ATOMIC); + + for (port = start_port(device); port <= end_port(device); ++port) { + ret = ib_query_port(device, port, tprops); + if (ret) + continue; + + for (i = 0; i < tprops->gid_tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &tmp_gid); + if (ret) + goto out; + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { + *port_num = port; + *index = i; + ret = 0; + goto out; + } + } /* for i */ + } + ret = -ENOENT; +out: + kfree(tprops); + return ret; +} +EXPORT_SYMBOL(ib_find_gid); + +/** + * ib_find_pkey - Returns the index of a PKey on a port + * @device: Device to query. + * @port_num: Port to query on + * @pkey: PKey to look for + * @index: Returned index + * + * ib_find_pkey() returns the index of @pkey in the pkey table + * on port @port_num + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index) +{ + struct ib_port_attr *tprops = NULL; + int ret; + int i = -1; + u16 tmp_pkey; + + tprops = kmalloc(sizeof *tprops, GFP_ATOMIC); + + ret = ib_query_port(device, port_num, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret); + goto out; + } + + for (i = 0; i < tprops->pkey_tbl_len; ++i) { + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); + if (ret) + goto out; + + if (pkey == tmp_pkey) { + *index = i; + ret = 0; + goto out; + } + } + ret = -ENOENT; + +out: + kfree(tprops); + return ret; +} +EXPORT_SYMBOL(ib_find_pkey); + +/** + * ib_query_lmc - Returns the LMC of a port + * @device: Device to query. + * @port_num: Port to query on + * @lmc: Returned LMC + * + * ib_query_lmc() returns the LID mask control associated + * with port @port_num + */ +int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc) +{ + struct ib_port_attr *tprops = NULL; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_ATOMIC); + ret = ib_query_port(device, port_num, tprops); + if (ret) goto err; + + *lmc = tprops->lmc; +err: + kfree(tprops); + return ret; + +} +EXPORT_SYMBOL(ib_query_lmc); + static int __init ib_core_init(void) { int ret; Index: b/include/rdma/ib_verbs.h =================================================================== --- a/include/rdma/ib_verbs.h 2007-05-03 11:25:59.090497856 +0300 +++ b/include/rdma/ib_verbs.h 2007-05-03 12:34:01.210405046 +0300 @@ -1134,6 +1134,43 @@ int ib_modify_port(struct ib_device *dev struct ib_port_modify *port_modify); /** + * ib_find_gid - Returns the port number and index of a GID + * @device: Device to query. + * @gid: GID to look for + * @port_num: Returned port number + * @index: Returned index + * + * ib_find_gid() returns the index of @pkey in the pkey table + * on port @port_num + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index) + +/** + * ib_find_pkey - Returns the index of a PKey on a port + * @device: Device to query. + * @port_num: Port to query on + * @pkey: PKey to look for + * @index: Returned index + * + * ib_find_pkey() returns the index of @pkey in the pkey table + * on port @port_num + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index) + +/** + * ib_query_lmc - Returns the LMC of a port + * @device: Device to query. + * @port_num: Port to query on + * @lmc: Returned LMC + * + * ib_query_lmc() returns the LID mask control associated + * with port @port_num + */ +int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc) + +/** * ib_alloc_pd - Allocates an unused protection domain. * @device: The device on which to allocate the protection domain. * Index: b/drivers/infiniband/core/cm.c =================================================================== --- a/drivers/infiniband/core/cm.c 2007-05-03 11:25:58.900531925 +0300 +++ b/drivers/infiniband/core/cm.c 2007-05-03 11:32:21.376036145 +0300 @@ -46,8 +46,8 @@ #include #include -#include #include +#include #include "cm_msgs.h" MODULE_AUTHOR("Sean Hefty"); @@ -275,7 +275,7 @@ static int cm_init_av_by_path(struct ib_ read_lock_irqsave(&cm.device_lock, flags); list_for_each_entry(cm_dev, &cm.device_list, list) { - if (!ib_find_cached_gid(cm_dev->device, &path->sgid, + if (!ib_find_gid(cm_dev->device, &path->sgid, &p, NULL)) { port = &cm_dev->port[p-1]; break; @@ -286,7 +286,7 @@ static int cm_init_av_by_path(struct ib_ if (!port) return -EINVAL; - ret = ib_find_cached_pkey(cm_dev->device, port->port_num, + ret = ib_find_pkey(cm_dev->device, port->port_num, be16_to_cpu(path->pkey), &av->pkey_index); if (ret) return ret; @@ -1382,7 +1382,7 @@ static int cm_req_handler(struct cm_work cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av); if (ret) { - ib_get_cached_gid(work->port->cm_dev->device, + ib_query_gid(work->port->cm_dev->device, work->port->port_num, 0, &work->path[0].sgid); ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID, &work->path[0].sgid, sizeof work->path[0].sgid, Index: b/drivers/infiniband/core/cma.c =================================================================== --- a/drivers/infiniband/core/cma.c 2007-05-03 11:25:58.916529056 +0300 +++ b/drivers/infiniband/core/cma.c 2007-05-03 11:32:21.401031673 +0300 @@ -41,7 +41,6 @@ #include #include -#include #include #include #include @@ -325,7 +324,7 @@ static int cma_acquire_dev(struct rdma_i } list_for_each_entry(cma_dev, &dev_list, list) { - ret = ib_find_cached_gid(cma_dev->device, &gid, + ret = ib_find_gid(cma_dev->device, &gid, &id_priv->id.port_num, NULL); if (!ret) { ret = cma_set_qkey(cma_dev->device, @@ -514,7 +513,7 @@ static int cma_ib_init_qp_attr(struct rd struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; int ret; - ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, + ret = ib_find_pkey(id_priv->id.device, id_priv->id.port_num, ib_addr_get_pkey(dev_addr), &qp_attr->pkey_index); if (ret) @@ -1658,11 +1657,11 @@ static int cma_bind_loopback(struct rdma cma_dev = list_entry(dev_list.next, struct cma_device, list); port_found: - ret = ib_get_cached_gid(cma_dev->device, p, 0, &gid); + ret = ib_query_gid(cma_dev->device, p, 0, &gid); if (ret) goto out; - ret = ib_get_cached_pkey(cma_dev->device, p, 0, &pkey); + ret = ib_query_pkey(cma_dev->device, p, 0, &pkey); if (ret) goto out; Index: b/drivers/infiniband/core/mad.c =================================================================== --- a/drivers/infiniband/core/mad.c 2007-05-03 11:25:58.930526546 +0300 +++ b/drivers/infiniband/core/mad.c 2007-05-03 11:32:21.435025591 +0300 @@ -34,7 +34,6 @@ * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ */ #include -#include #include "mad_priv.h" #include "mad_rmpp.h" @@ -1707,13 +1706,13 @@ static inline int rcv_has_same_gid(struc if (!send_resp && rcv_resp) { /* is request/response. */ if (!(attr.ah_flags & IB_AH_GRH)) { - if (ib_get_cached_lmc(device, port_num, &lmc)) + if (ib_query_lmc(device, port_num, &lmc)) return 0; return (!lmc || !((attr.src_path_bits ^ rwc->wc->dlid_path_bits) & ((1 << lmc) - 1))); } else { - if (ib_get_cached_gid(device, port_num, + if (ib_query_gid(device, port_num, attr.grh.sgid_index, &sgid)) return 0; return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw, Index: b/drivers/infiniband/core/multicast.c =================================================================== --- a/drivers/infiniband/core/multicast.c 2007-05-03 11:25:58.947523497 +0300 +++ b/drivers/infiniband/core/multicast.c 2007-05-03 11:32:21.454022192 +0300 @@ -38,7 +38,6 @@ #include #include -#include #include "sa.h" static void mcast_add_one(struct ib_device *device); @@ -686,7 +685,7 @@ int ib_init_ah_from_mcmember(struct ib_d u16 gid_index; u8 p; - ret = ib_find_cached_gid(device, &rec->port_gid, &p, &gid_index); + ret = ib_find_gid(device, &rec->port_gid, &p, &gid_index); if (ret) return ret; Index: b/drivers/infiniband/core/sa_query.c =================================================================== --- a/drivers/infiniband/core/sa_query.c 2007-05-03 11:25:58.964520449 +0300 +++ b/drivers/infiniband/core/sa_query.c 2007-05-03 11:32:21.472018972 +0300 @@ -47,7 +47,6 @@ #include #include -#include #include "sa.h" MODULE_AUTHOR("Roland Dreier"); @@ -477,7 +476,7 @@ int ib_init_ah_from_path(struct ib_devic ah_attr->ah_flags = IB_AH_GRH; ah_attr->grh.dgid = rec->dgid; - ret = ib_find_cached_gid(device, &rec->sgid, &port_num, + ret = ib_find_gid(device, &rec->sgid, &port_num, &gid_index); if (ret) return ret; Index: b/drivers/infiniband/core/verbs.c =================================================================== --- a/drivers/infiniband/core/verbs.c 2007-05-03 11:25:58.984516863 +0300 +++ b/drivers/infiniband/core/verbs.c 2007-05-03 11:32:21.499014142 +0300 @@ -43,7 +43,6 @@ #include #include -#include int ib_rate_to_mult(enum ib_rate rate) { @@ -159,7 +158,7 @@ int ib_init_ah_from_wc(struct ib_device ah_attr->ah_flags = IB_AH_GRH; ah_attr->grh.dgid = grh->sgid; - ret = ib_find_cached_gid(device, &grh->dgid, &port_num, + ret = ib_find_gid(device, &grh->dgid, &port_num, &gid_index); if (ret) return ret; Index: b/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-03 11:25:59.020510408 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-03 11:32:21.536007523 +0300 @@ -33,7 +33,6 @@ */ #include -#include #include #include #include @@ -759,7 +758,7 @@ static int ipoib_cm_modify_tx_init(struc struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index); if (ret) { ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret); return ret; Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-03 11:27:12.676301751 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-03 11:32:21.563002693 +0300 @@ -38,7 +38,7 @@ #include #include -#include +#include #include "ipoib.h" Index: b/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- a/drivers/infiniband/ulp/srp/ib_srp.c 2007-05-03 11:25:59.064502518 +0300 +++ b/drivers/infiniband/ulp/srp/ib_srp.c 2007-05-03 11:32:21.592997326 +0300 @@ -48,8 +48,6 @@ #include #include -#include - #include "ib_srp.h" #define DRV_NAME "ib_srp" @@ -164,7 +162,7 @@ static int srp_init_qp(struct srp_target if (!attr) return -ENOMEM; - ret = ib_find_cached_pkey(target->srp_host->dev->dev, + ret = ib_find_pkey(target->srp_host->dev->dev, target->srp_host->port, be16_to_cpu(target->path.pkey), &attr->pkey_index); @@ -1780,7 +1778,7 @@ static ssize_t srp_create_target(struct if (ret) goto err; - ib_get_cached_gid(host->dev->dev, host->port, 0, &target->path.sgid); + ib_query_gid(host->dev->dev, host->port, 0, &target->path.sgid); printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x " "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", From mst at dev.mellanox.co.il Thu May 3 02:57:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 12:57:30 +0300 Subject: [ofa-general] Re: [PATCH 2/3 v2] remove ib pkey gid and lmc cache In-Reply-To: <4639ADFB.70707@voltaire.com> References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com> <20070502182315.GQ22292@mellanox.co.il> <4639ADFB.70707@voltaire.com> Message-ID: <20070503095730.GA10009@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCH 2/3 v2] remove ib pkey gid and lmc cache > > don't use ib cache in core and ulp's > > v1: > * Add ib_find_gid and ib_find_pkey, over uncached device queries > * Modify users of the cache in core and ulp's to use the unchached methods > > changes from v2: > * don't remove the cache compelely, and still use it within mthca driver. > > the mthca changes and complete removal of the cache - in the next patch > > Signed-off-by: Yosef Etigin You haven't addressed the rest of my comments, though. http://article.gmane.org/gmane.linux.drivers.openib/39173 Why is that? -- MST From jackm at dev.mellanox.co.il Thu May 3 02:59:03 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 3 May 2007 12:59:03 +0300 Subject: [ofa-general] Re: [PATCH] libmlx4: fix post inline when posting a list In-Reply-To: References: <200705021712.24400.jackm@dev.mellanox.co.il> Message-ID: <200705031259.03428.jackm@dev.mellanox.co.il> On Thursday 03 May 2007 01:10, Roland Dreier wrote: > please use my git tree rather than some other setup, so that I can > apply things with patch -p1 In our build, we have a directory of userspace fixes (patches) that get applied while the fix approval/commit process is in progress. Once a fix is committed, the associated patch is removed from the build (since we fetch/rebase or pull). I apologize that I did not pay attention to the directories -- I just copied the patch we are maintaining in that user-space fixes directory (which is at the level indicated by the patch). The irony is that I started off with a patch which was generated via git-diff on my libmlx4 git clone, and modified the file paths to comply with our userspace patch fixes requirements. Next time, I'll make sure to send the original patch which is generated by git-diff on my libmlx4 clone. - Jack > From mst at dev.mellanox.co.il Thu May 3 03:48:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 13:48:06 +0300 Subject: [ofa-general] [PATCH 0 of 3] comp_vector kernel support Message-ID: <20070503104806.GC10009@mellanox.co.il> The following patch series adds completion vector support in kernel 1. extends ib_create_cq to pass in comp_vector parameter 2. Update all ULP/providers 3. mthca is enhanced to support multiple vectors if MSI-X is enabled on SMP 4. Other providers report support for a single completion vector 5. uverbs and IPoIB CM are enhanced to use multiple vectors if available Please consider for 2.6.22. -- MST From mst at dev.mellanox.co.il Thu May 3 03:48:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 13:48:47 +0300 Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support in core Message-ID: <20070503104847.GD10009@mellanox.co.il> Extend ib_create_cq to pass in comp_vector parameter - this parallels our userspace API. Update all ULPs and providers. Make uverbs use multiple vectors if available. Signed-off-by: Michael S. Tsirkin --- Note: since num_comp_vectors = 0 is not legal, and to mimizime provider churn, I set num_comp_vectors to a sane value in core. Providers can increase that. Index: linux-2.6/drivers/infiniband/core/device.c =================================================================== --- linux-2.6.orig/drivers/infiniband/core/device.c +++ linux-2.6/drivers/infiniband/core/device.c @@ -161,9 +161,14 @@ static int alloc_name(char *name) */ struct ib_device *ib_alloc_device(size_t size) { + struct ib_device *dev; BUG_ON(size < sizeof (struct ib_device)); - return kzalloc(size, GFP_KERNEL); + dev = kzalloc(size, GFP_KERNEL); + if (dev) + dev->num_comp_vectors = 1; + + return dev; } EXPORT_SYMBOL(ib_alloc_device); Index: linux-2.6/drivers/infiniband/core/mad.c =================================================================== --- linux-2.6.orig/drivers/infiniband/core/mad.c +++ linux-2.6/drivers/infiniband/core/mad.c @@ -2767,7 +2767,7 @@ static int ib_mad_port_open(struct ib_de cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; port_priv->cq = ib_create_cq(port_priv->device, ib_mad_thread_completion_handler, - NULL, port_priv, cq_size); + NULL, port_priv, cq_size, 0); if (IS_ERR(port_priv->cq)) { printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n"); ret = PTR_ERR(port_priv->cq); Index: linux-2.6/drivers/infiniband/core/uverbs_cmd.c =================================================================== --- linux-2.6.orig/drivers/infiniband/core/uverbs_cmd.c +++ linux-2.6/drivers/infiniband/core/uverbs_cmd.c @@ -802,6 +802,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uv INIT_LIST_HEAD(&obj->async_list); cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe, + cmd.comp_vector, file->ucontext, &udata); if (IS_ERR(cq)) { ret = PTR_ERR(cq); Index: linux-2.6/drivers/infiniband/core/uverbs_main.c =================================================================== --- linux-2.6.orig/drivers/infiniband/core/uverbs_main.c +++ linux-2.6/drivers/infiniband/core/uverbs_main.c @@ -752,7 +752,7 @@ static void ib_uverbs_add_one(struct ib_ spin_unlock(&map_lock); uverbs_dev->ib_dev = device; - uverbs_dev->num_comp_vectors = 1; + uverbs_dev->num_comp_vectors = device->num_comp_vectors; uverbs_dev->dev = cdev_alloc(); if (!uverbs_dev->dev) Index: linux-2.6/drivers/infiniband/core/verbs.c =================================================================== --- linux-2.6.orig/drivers/infiniband/core/verbs.c +++ linux-2.6/drivers/infiniband/core/verbs.c @@ -609,11 +609,11 @@ EXPORT_SYMBOL(ib_destroy_qp); struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, void (*event_handler)(struct ib_event *, void *), - void *cq_context, int cqe) + void *cq_context, int cqe, int comp_vector) { struct ib_cq *cq; - cq = device->create_cq(device, cqe, NULL, NULL); + cq = device->create_cq(device, cqe, comp_vector, NULL, NULL); if (!IS_ERR(cq)) { cq->device = device; Index: linux-2.6/drivers/infiniband/hw/amso1100/c2_provider.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/amso1100/c2_provider.c +++ linux-2.6/drivers/infiniband/hw/amso1100/c2_provider.c @@ -290,7 +290,7 @@ static int c2_destroy_qp(struct ib_qp *i return 0; } -static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata) { Index: linux-2.6/drivers/infiniband/hw/cxgb3/iwch_provider.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ linux-2.6/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -139,7 +139,7 @@ static int iwch_destroy_cq(struct ib_cq return 0; } -static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, +static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *ib_context, struct ib_udata *udata) { Index: linux-2.6/drivers/infiniband/hw/ehca/ehca_cq.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_cq.c +++ linux-2.6/drivers/infiniband/hw/ehca/ehca_cq.c @@ -113,7 +113,7 @@ struct ehca_qp* ehca_cq_get_qp(struct eh return ret; } -struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { Index: linux-2.6/drivers/infiniband/hw/ehca/ehca_iverbs.h =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ linux-2.6/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -123,7 +123,7 @@ int ehca_destroy_eq(struct ehca_shca *sh void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq); -struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); Index: linux-2.6/drivers/infiniband/hw/ehca/ehca_main.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_main.c +++ linux-2.6/drivers/infiniband/hw/ehca/ehca_main.c @@ -375,7 +375,7 @@ static int ehca_create_aqp1(struct ehca_ return -EPERM; } - ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10); + ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10, 0); if (IS_ERR(ibcq)) { ehca_err(&shca->ib_device, "Cannot create AQP1 CQ."); return PTR_ERR(ibcq); Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_cq.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_cq.c +++ linux-2.6/drivers/infiniband/hw/ipath/ipath_cq.c @@ -170,7 +170,7 @@ static void send_complete(unsigned long * * Called by ib_create_cq() in the generic verbs code. */ -struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, +struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_verbs.h =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_verbs.h +++ linux-2.6/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -710,7 +710,7 @@ void ipath_cq_enter(struct ipath_cq *cq, int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, +struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c @@ -663,6 +663,7 @@ static int mthca_destroy_qp(struct ib_qp } static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries, + int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -793,7 +793,7 @@ static int ipoib_cm_tx_init(struct ipoib } p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p, - ipoib_sendq_size + 1); + ipoib_sendq_size + 1, 0); if (IS_ERR(p->cq)) { ret = PTR_ERR(p->cq); ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret); Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -187,7 +187,7 @@ int ipoib_transport_dev_init(struct net_ if (!ret) size += ipoib_recvq_size; - priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size); + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); goto out_free_mr; Index: linux-2.6/drivers/infiniband/ulp/iser/iser_verbs.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/iser/iser_verbs.c +++ linux-2.6/drivers/infiniband/ulp/iser/iser_verbs.c @@ -76,7 +76,7 @@ static int iser_create_device_ib_res(str iser_cq_callback, iser_cq_event_callback, (void *)device, - ISER_MAX_CQ_LEN); + ISER_MAX_CQ_LEN, 0); if (IS_ERR(device->cq)) goto cq_err; Index: linux-2.6/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/srp/ib_srp.c +++ linux-2.6/drivers/infiniband/ulp/srp/ib_srp.c @@ -197,7 +197,7 @@ static int srp_create_target_ib(struct s return -ENOMEM; target->cq = ib_create_cq(target->srp_host->dev->dev, srp_completion, - NULL, target, SRP_CQ_SIZE); + NULL, target, SRP_CQ_SIZE, 0); if (IS_ERR(target->cq)) { ret = PTR_ERR(target->cq); goto out; Index: linux-2.6/include/rdma/ib_verbs.h =================================================================== --- linux-2.6.orig/include/rdma/ib_verbs.h +++ linux-2.6/include/rdma/ib_verbs.h @@ -912,6 +912,8 @@ struct ib_device { u32 flags; + int num_comp_vectors; + struct iw_cm_verbs *iwcm; int (*query_device)(struct ib_device *device, @@ -978,6 +980,7 @@ struct ib_device { struct ib_recv_wr *recv_wr, struct ib_recv_wr **bad_recv_wr); struct ib_cq * (*create_cq)(struct ib_device *device, int cqe, + int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); int (*destroy_cq)(struct ib_cq *cq); @@ -1358,13 +1361,15 @@ static inline int ib_post_recv(struct ib * @cq_context: Context associated with the CQ returned to the user via * the associated completion and event handlers. * @cqe: The minimum size of the CQ. + * @comp_vector - Completion vector used to signal completion events. + * Must be >= 0 and < context->num_comp_vectors. * * Users can examine the cq structure to determine the actual CQ size. */ struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, void (*event_handler)(struct ib_event *, void *), - void *cq_context, int cqe); + void *cq_context, int cqe, int comp_vector); /** * ib_resize_cq - Modifies the capacity of the CQ. -- MST From mst at dev.mellanox.co.il Thu May 3 03:49:24 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 13:49:24 +0300 Subject: [ofa-general] [PATCH 2 of 3] IB/mthca: support multiple completion vectors Message-ID: <20070503104924.GE10009@mellanox.co.il> Support 2 completion vectors in mthca on SMP if MSI-X is enabled Signed-off-by: Michael S. Tsirkin --- I don't know how many vectors make sense, so I decided to be conservative here, since each EQ consumes a lot of memory by default. Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_cq.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_cq.c @@ -779,7 +779,7 @@ int mthca_arbel_arm_cq(struct ib_cq *ibc return 0; } -int mthca_init_cq(struct mthca_dev *dev, int nent, +int mthca_init_cq(struct mthca_dev *dev, int nent, int comp_vector, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq) { @@ -790,6 +790,7 @@ int mthca_init_cq(struct mthca_dev *dev, cq->ibcq.cqe = nent - 1; cq->is_kernel = !ctx; + cq->eq = MTHCA_EQ_COMP + comp_vector; cq->cqn = mthca_alloc(&dev->cq_table.alloc); if (cq->cqn == -1) @@ -844,7 +845,7 @@ int mthca_init_cq(struct mthca_dev *dev, else cq_context->logsize_usrpage |= cpu_to_be32(dev->driver_uar.index); cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); - cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); + cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[cq->eq].eqn); cq_context->pd = cpu_to_be32(pdn); cq_context->lkey = cpu_to_be32(cq->buf.mr.ibmr.lkey); cq_context->cqn = cpu_to_be32(cq->cqn); @@ -954,7 +955,7 @@ void mthca_free_cq(struct mthca_dev *dev spin_unlock_irq(&dev->cq_table.lock); if (dev->mthca_flags & MTHCA_FLAG_MSI_X) - synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); + synchronize_irq(dev->eq_table.eq[cq->eq].msi_x_vector); else synchronize_irq(dev->pdev->irq); Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h @@ -96,7 +96,9 @@ enum { MTHCA_EQ_CMD, MTHCA_EQ_ASYNC, MTHCA_EQ_COMP, - MTHCA_NUM_EQ + MTHCA_NUM_EQ, + MTHCA_COMP_VECTORS = 2, + MTHCA_MAX_EQS = MTHCA_NUM_EQ + MTHCA_COMP_VECTORS - 1 }; enum { @@ -230,7 +232,7 @@ struct mthca_eq_table { void __iomem *clr_int; u32 clr_mask; u32 arm_mask; - struct mthca_eq eq[MTHCA_NUM_EQ]; + struct mthca_eq eq[MTHCA_MAX_EQS]; u64 icm_virt; struct page *icm_page; dma_addr_t icm_dma; @@ -497,7 +499,7 @@ int mthca_poll_cq(struct ib_cq *ibcq, in struct ib_wc *entry); int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); -int mthca_init_cq(struct mthca_dev *dev, int nent, +int mthca_init_cq(struct mthca_dev *dev, int nent, int comp_vector, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq); void mthca_free_cq(struct mthca_dev *dev, Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_eq.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_eq.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_eq.c @@ -161,6 +161,11 @@ struct mthca_eqe { u8 owner; } __attribute__((packed)); +static inline int mthca_num_eq(struct mthca_dev *dev) +{ + return dev->ib_dev.num_comp_vectors + MTHCA_NUM_EQ - 1; +} + #define MTHCA_EQ_ENTRY_OWNER_SW (0 << 7) #define MTHCA_EQ_ENTRY_OWNER_HW (1 << 7) @@ -579,8 +584,7 @@ static int mthca_create_eq(struct mthca_ dev->eq_table.arm_mask |= eq->eqn_mask; - mthca_dbg(dev, "Allocated EQ %d with %d entries\n", - eq->eqn, eq->nent); + mthca_dbg(dev, "Allocated EQ %d with %d entries\n", eq->eqn, eq->nent); return err; @@ -657,7 +661,7 @@ static void mthca_free_irqs(struct mthca if (dev->eq_table.have_irq) free_irq(dev->pdev->irq, dev); - for (i = 0; i < MTHCA_NUM_EQ; ++i) + for (i = 0; i < mthca_num_eq(dev); ++i) if (dev->eq_table.eq[i].have_irq) free_irq(dev->eq_table.eq[i].msi_x_vector, dev->eq_table.eq + i); @@ -824,12 +828,37 @@ void mthca_unmap_eq_icm(struct mthca_dev __free_page(dev->eq_table.icm_page); } +static inline const char *eq_name(int i) +{ + switch (i) { + case MTHCA_EQ_ASYNC: + return DRV_NAME " (async)"; + case MTHCA_EQ_CMD: + return DRV_NAME " (cmd)"; + default: + return DRV_NAME " (comp)"; + } +} + +static inline int eq_size(struct mthca_dev *dev, int i) +{ + switch (i) { + case MTHCA_EQ_ASYNC: + return MTHCA_NUM_ASYNC_EQE; + case MTHCA_EQ_CMD: + return MTHCA_NUM_CMD_EQE; + default: + return dev->limits.num_cqs; + } +} + + int mthca_init_eq_table(struct mthca_dev *dev) { int err; u8 status; u8 intr; - int i; + int i, eqn; err = mthca_alloc_init(&dev->eq_table.alloc, dev->limits.num_eqs, @@ -857,39 +886,29 @@ int mthca_init_eq_table(struct mthca_dev intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? 128 : dev->eq_table.inta_pin; - err = mthca_create_eq(dev, dev->limits.num_cqs + MTHCA_NUM_SPARE_EQE, - (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, - &dev->eq_table.eq[MTHCA_EQ_COMP]); - if (err) - goto err_out_unmap; - - err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE + MTHCA_NUM_SPARE_EQE, - (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, - &dev->eq_table.eq[MTHCA_EQ_ASYNC]); - if (err) - goto err_out_comp; - - err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE + MTHCA_NUM_SPARE_EQE, - (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, - &dev->eq_table.eq[MTHCA_EQ_CMD]); - if (err) - goto err_out_async; + for (eqn = 0; eqn < mthca_num_eq(dev); ++eqn) { + err = mthca_create_eq(dev, eq_size(dev, eqn) + MTHCA_NUM_SPARE_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? + 128 + eqn : intr, + &dev->eq_table.eq[eqn]); + if (err) + goto err_out_eq; + } if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { - static const char *eq_name[] = { - [MTHCA_EQ_COMP] = DRV_NAME " (comp)", - [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", - [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" - }; - - for (i = 0; i < MTHCA_NUM_EQ; ++i) { + for (i = 0; i < mthca_num_eq(dev); ++i) { err = request_irq(dev->eq_table.eq[i].msi_x_vector, mthca_is_memfree(dev) ? mthca_arbel_msi_x_interrupt : mthca_tavor_msi_x_interrupt, - 0, eq_name[i], dev->eq_table.eq + i); - if (err) - goto err_out_cmd; + 0, eq_name(i), dev->eq_table.eq + i); + if (err) { + mthca_err(dev, "Failed to request IRQ %d for EQ %d (%d)," + " aborting.\n", + dev->eq_table.eq[i].msi_x_vector, + dev->eq_table.eq[i].eqn, i); + goto err_out_irq; + } dev->eq_table.eq[i].have_irq = 1; } } else { @@ -898,8 +917,11 @@ int mthca_init_eq_table(struct mthca_dev mthca_arbel_interrupt : mthca_tavor_interrupt, IRQF_SHARED, DRV_NAME, dev); - if (err) - goto err_out_cmd; + if (err) { + mthca_err(dev, "Failed to request IRQ %d, aborting.\n", + dev->pdev->irq); + goto err_out_eq; + } dev->eq_table.have_irq = 1; } @@ -921,7 +943,7 @@ int mthca_init_eq_table(struct mthca_dev mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); - for (i = 0; i < MTHCA_NUM_EQ; ++i) + for (i = 0; i < mthca_num_eq(dev); ++i) if (mthca_is_memfree(dev)) arbel_eq_req_not(dev, dev->eq_table.eq[i].eqn_mask); else @@ -929,17 +951,13 @@ int mthca_init_eq_table(struct mthca_dev return 0; -err_out_cmd: +err_out_irq: mthca_free_irqs(dev); - mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]); -err_out_async: - mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); - -err_out_comp: - mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); +err_out_eq: + for (i = 0; i < eqn; ++i) + mthca_free_eq(dev, &dev->eq_table.eq[i]); -err_out_unmap: mthca_unmap_eq_regs(dev); err_out_free: @@ -959,7 +977,7 @@ void mthca_cleanup_eq_table(struct mthca mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, 1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); - for (i = 0; i < MTHCA_NUM_EQ; ++i) + for (i = 0; i < mthca_num_eq(dev); ++i) mthca_free_eq(dev, &dev->eq_table.eq[i]); mthca_unmap_eq_regs(dev); Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_main.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_main.c @@ -39,6 +39,7 @@ #include #include #include +#include #include "mthca_dev.h" #include "mthca_config_reg.h" @@ -976,24 +977,35 @@ static void mthca_release_regions(struct static int mthca_enable_msi_x(struct mthca_dev *mdev) { - struct msix_entry entries[3]; - int err; + struct msix_entry entries[MTHCA_MAX_EQS]; + int i, err, num; - entries[0].entry = 0; - entries[1].entry = 1; - entries[2].entry = 2; - - err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries)); - if (err) { - if (err > 0) - mthca_info(mdev, "Only %d MSI-X vectors available, " - "not using MSI-X\n", err); - return err; + num = min(mdev->limits.num_eqs - mdev->limits.reserved_eqs, MTHCA_MAX_EQS); + num = min(num_possible_cpus(), num); + + for (i = 0; i < num; ++i) + entries[i].entry = i; + + for (;;) { + err = pci_enable_msix(mdev->pdev, entries, num); + if (!err) + break; + else if (err < 0) + return err; + + if (err < MTHCA_NUM_EQ) { + mthca_info(mdev, "Only %d MSI-X vectors available. " + "Not using MSI-X\n", err); + pci_disable_msix(mdev->pdev); + return err; + } + + num = err; } - mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector; - mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector; - mdev->eq_table.eq[MTHCA_EQ_CMD ].msi_x_vector = entries[2].vector; + mdev->ib_dev.num_comp_vectors = num - MTHCA_NUM_EQ + 1; + for (i = 0; i < num; ++i) + mdev->eq_table.eq[i].msi_x_vector = entries[i].vector; return 0; } @@ -1115,12 +1127,6 @@ static int __mthca_init_one(struct pci_d goto err_free_dev; } - if (msi_x && !mthca_enable_msi_x(mdev)) - mdev->mthca_flags |= MTHCA_FLAG_MSI_X; - if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) && - !pci_enable_msi(pdev)) - mdev->mthca_flags |= MTHCA_FLAG_MSI; - if (mthca_cmd_init(mdev)) { mthca_err(mdev, "Failed to init command interface, aborting.\n"); goto err_free_dev; @@ -1144,6 +1150,12 @@ static int __mthca_init_one(struct pci_d mthca_warn(mdev, "If you have problems, try updating your HCA FW.\n"); } + if (msi_x && !mthca_enable_msi_x(mdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI_X; + if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) && + !pci_enable_msi(pdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI; + err = mthca_setup_hca(mdev); if (err) goto err_close; @@ -1180,17 +1192,17 @@ err_cleanup: mthca_cleanup_uar_table(mdev); err_close: + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + mthca_close_hca(mdev); err_cmd: mthca_cmd_cleanup(mdev); err_free_dev: - if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) - pci_disable_msix(pdev); - if (mdev->mthca_flags & MTHCA_FLAG_MSI) - pci_disable_msi(pdev); - ib_dealloc_device(&mdev->ib_dev); err_free_res: @@ -1231,14 +1243,15 @@ static void __mthca_remove_one(struct pc iounmap(mdev->kar); mthca_uar_free(mdev, &mdev->driver_uar); mthca_cleanup_uar_table(mdev); - mthca_close_hca(mdev); - mthca_cmd_cleanup(mdev); if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) pci_disable_msix(pdev); if (mdev->mthca_flags & MTHCA_FLAG_MSI) pci_disable_msi(pdev); + mthca_close_hca(mdev); + mthca_cmd_cleanup(mdev); + ib_dealloc_device(&mdev->ib_dev); mthca_release_regions(pdev, mdev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN); Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c @@ -707,7 +707,7 @@ static struct ib_cq *mthca_create_cq(str for (nent = 1; nent <= entries; nent <<= 1) ; /* nothing */ - err = mthca_init_cq(to_mdev(ibdev), nent, + err = mthca_init_cq(to_mdev(ibdev), nent, comp_vector, context ? to_mucontext(context) : NULL, context ? ucmd.pdn : to_mdev(ibdev)->driver_pd.pd_num, cq); Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.h =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.h +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.h @@ -202,6 +202,7 @@ struct mthca_cq { spinlock_t lock; int refcount; int cqn; + int eq; u32 cons_index; struct mthca_cq_buf buf; struct mthca_cq_resize *resize_buf; -- MST From mst at dev.mellanox.co.il Thu May 3 03:49:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 13:49:55 +0300 Subject: [ofa-general] [PATCH 3 of 3] ipoib/cm: separate comp vectors to RX/TX Message-ID: <20070503104955.GF10009@mellanox.co.il> Enhance ipoib to use multiple completion vectors if available. On mthca, this increases netperf BW by some 5% with same or lower service demand. Signed-off-by: Michael S. Tsirkin --- Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -793,7 +793,7 @@ static int ipoib_cm_tx_init(struct ipoib } p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p, - ipoib_sendq_size + 1, 0); + ipoib_sendq_size + 1, priv->ca->num_comp_vectors > 1); if (IS_ERR(p->cq)) { ret = PTR_ERR(p->cq); ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret); -- MST From mst at dev.mellanox.co.il Thu May 3 05:22:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 15:22:15 +0300 Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache In-Reply-To: <4639D16F.4060807@voltaire.com> References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com> <20070502171829.GO22292@mellanox.co.il> <4639D16F.4060807@voltaire.com> Message-ID: <20070503122215.GA9719@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCH 2/3] remove ib pkey gid and lmc cache > > Michael S. Tsirkin wrote: > >>+ * ib_query_lmc() returns the LID mask control associated > >>+ * with port @port_num > >>+ */ > >>+int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc); > >>+ > > > > > > I don't think we need this one in ib_verbs.h - it just does query_port once. > > Let's keep the API simple. The only user is in mad.c - move > > it there and make it static. > > > > > > why keep ib_query_lmc anyway if we won't use it? Actually, I think I see a problem with changing ib_get_cached_lmc -> ib_query_lmc: it is called on data path in mad.c. Calling ib_query_port there will slow down MAD processing significantly, because it's hard to driver to cache all of portinfo state (e.g. how do you cache phys_state?). But mad.c is actually seeing all MADs, too, so maybe the right thing is to cache lmc directly there. -- MST From yosefe at voltaire.com Thu May 3 05:28:52 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 03 May 2007 15:28:52 +0300 Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache In-Reply-To: <20070503122215.GA9719@mellanox.co.il> References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com> <20070502171829.GO22292@mellanox.co.il> <4639D16F.4060807@voltaire.com> <20070503122215.GA9719@mellanox.co.il> Message-ID: <4639D584.3010706@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: Re: [PATCH 2/3] remove ib pkey gid and lmc cache >> >>Michael S. Tsirkin wrote: >> >>>>+ * ib_query_lmc() returns the LID mask control associated >>>>+ * with port @port_num >>>>+ */ >>>>+int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc); >>>>+ >>> >>> >>>I don't think we need this one in ib_verbs.h - it just does query_port once. >>>Let's keep the API simple. The only user is in mad.c - move >>>it there and make it static. >>> >>> >> >>why keep ib_query_lmc anyway if we won't use it? > > > Actually, I think I see a problem with changing > ib_get_cached_lmc -> ib_query_lmc: it is called on data path in mad.c. > Calling ib_query_port there will slow down MAD processing significantly, > because it's hard to driver to cache all of portinfo state > (e.g. how do you cache phys_state?). > > But mad.c is actually seeing all MADs, too, so maybe the right thing > is to cache lmc directly there. > So how about mad.c will use the cache for now (like mthca), and will have a lmc cache of its own after the cache is really removed? From mst at dev.mellanox.co.il Thu May 3 05:49:56 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 15:49:56 +0300 Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache In-Reply-To: <4639D584.3010706@voltaire.com> References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com> <20070502171829.GO22292@mellanox.co.il> <4639D16F.4060807@voltaire.com> <20070503122215.GA9719@mellanox.co.il> <4639D584.3010706@voltaire.com> Message-ID: <20070503124956.GB9719@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCH 2/3] remove ib pkey gid and lmc cache > > Michael S. Tsirkin wrote: > >>Quoting Yosef Etigin : > >>Subject: Re: [PATCH 2/3] remove ib pkey gid and lmc cache > >> > >>Michael S. Tsirkin wrote: > >> > >>>>+ * ib_query_lmc() returns the LID mask control associated > >>>>+ * with port @port_num > >>>>+ */ > >>>>+int ib_query_lmc(struct ib_device *device, u8 port_num, u8 *lmc); > >>>>+ > >>> > >>> > >>>I don't think we need this one in ib_verbs.h - it just does query_port once. > >>>Let's keep the API simple. The only user is in mad.c - move > >>>it there and make it static. > >>> > >>> > >> > >>why keep ib_query_lmc anyway if we won't use it? > > > > > > Actually, I think I see a problem with changing > > ib_get_cached_lmc -> ib_query_lmc: it is called on data path in mad.c. > > Calling ib_query_port there will slow down MAD processing significantly, > > because it's hard to driver to cache all of portinfo state > > (e.g. how do you cache phys_state?). > > > > But mad.c is actually seeing all MADs, too, so maybe the right thing > > is to cache lmc directly there. > > > So how about mad.c will use the cache for now (like mthca), > and will have a lmc cache of its own after the cache is really removed? OK I guess. But I wonder whether other changes might affect e.g. connection setup for other pieces, too. We started by discussing a race condition with IPoIB. So again, my suggestion for now (from archives) would be: > > - a patch to add ib_find_pkey() and ib_find_gid() to core > > - a patch to replace cache usage in IPoIB with uncached > > hardware accesses on top of this > > - ipoib pkey change handling patch on top of these Here's a link to discussion we had in April: http://www.mail-archive.com/general at lists.openfabrics.org/msg01613.html -- MST From dotanb at dev.mellanox.co.il Thu May 3 05:57:03 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 03 May 2007 15:57:03 +0300 Subject: [ofa-general] man pages for the rdma-cm In-Reply-To: <1178127046.18609.107.camel@stevo-desktop> References: <1178127046.18609.107.camel@stevo-desktop> Message-ID: <4639DC1F.2030905@dev.mellanox.co.il> Steve Wise wrote: > Sean, > > Are there man pages for the rdma-cm in the pipeline? I think it would > be great (requirement?) to have these for ofed-1.2 since we do have the > other verbs man pages. > man-pages are very important to new users (this is why i wrote the man pages to the libibverbs). I believe that all of the libraries (which has an API) that are being installed in the OFED package should have man pages. I think that this should be one of the goals in OFED 1.3. thanks Dotan From info at freeaward.co.uk Thu May 3 06:06:33 2007 From: info at freeaward.co.uk (info at freeaward.co.uk) Date: Thu, 3 May 2007 06:06:33 -0700 Subject: [ofa-general] FREE AWARD DEPT( YOU ARE A WINNER)............... Message-ID: <1178197592.4639de5901a22@webmail.telus.net> FREE AWARD DEPARTMENT L70 1NL.LONDON,UNITED KINGDOM Ref: UK/9420X2/68 Batch: 074/05/ZY369 We are pleased to announce FREE LOTTO AWARDS draw held on 3rd of May . All 3 winning addresses were randomlyselected from a batch of 5,000,000 international emails. Your email addressemerged alongside 2 others as a 3rd category winner in this month's draw.Consequently,you have therefore been approved for a total pay out of (Five Hundred And Fifty Thousand Pounds Sterlings) only. The following particulars are attached to your lotto payment order: (i) Ticket:56475600545005 (ii) Serial Number:5368 (iii)File Number:KTU/9023118308/03 (iv)LuckyNumber:09112239404212 Please contact the underlisted claims officer as soon as possible for the immediate release of your winnings: Mr.Danny Boer Email: mrdannyboer_freelottoaward at yahoo.com.hk Tel : +44 7024035988 Yours faithfully, FREE AWARD DEPARTMENT From vlad at dev.mellanox.co.il Thu May 3 06:07:06 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 03 May 2007 16:07:06 +0300 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 In-Reply-To: References: <1178114728.14131.30.camel@vladsk-laptop> Message-ID: <1178197626.6580.38.camel@vladsk-laptop> Please see if this happens in OFED-1.2-20070503-0600. But first uninstall the previous OFED version with ofed_uninstall.sh command. Thanks, Regards, Vladimir On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote: > Hmm, > > so I tried something. I put : > > build_32bit=0 > > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This time > it built 64bit libraries, but it puts them in the wrong directory : > > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm > /etc/ld.so.conf.d/ofed.conf > /usr/lib/libibverbs.so.1 > /usr/lib/libibverbs.so.1.0.0 > > # file /usr/lib/libibverbs.so.1.0.0 > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD > x86-64, version 1 (SYSV), not stripped > > So what's up ?? > > Cheers, > Steffen Persvold > Technical Director Americas > tel. 508-281-7100 x401 > fax. 508-281-7171 > > http://www.scali.com/ > Scaling the Linux datacenter > > > ______________________________________________________________________ > From: Steffen Persvold > Sent: Wed 5/2/2007 10:30 AM > To: Steffen Persvold; Vladimir Sokolovsky > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > Also, > > If I look at the /etc/ld.so.conf/ofed.conf file I have : > > # cat ofed.conf > /usr/lib > /usr/lib > > > which seems kinda weird ? :) > > Cheers, > > Steffen Persvold > Technical Director Americas > tel. 508-281-7100 x401 > fax. 508-281-7171 > > http://www.scali.com/ > Scaling the Linux datacenter > > > ______________________________________________________________________ > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold > Sent: Wed 5/2/2007 10:20 AM > To: Vladimir Sokolovsky > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > Nope : > > > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm > /etc/ld.so.conf.d/ofed.conf > /usr/lib/libibverbs.so.1 > /usr/lib/libibverbs.so.1.0.0 > [redhat-release-4ES-5.5]# > > So the RPM got built, but without 64bit libraries. Now if it was the > other way around (i.e no 32bit libraries) I could have understood it > (as 32bit is an option on x86_64), but not having the native 64bit > libraries is not so easy to understand :) > > cheers, > Steffen Persvold > Technical Director Americas > tel. 508-281-7100 x401 > fax. 508-281-7171 > > http://www.scali.com/ > Scaling the Linux datacenter > > > ______________________________________________________________________ > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir > Sokolovsky > Sent: Wed 5/2/2007 10:05 AM > To: Steffen Persvold > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > Don't you have /usr/lib64/libibverbs.so.1.0.0? > > Regards, > Vladimir > > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote: > > Folks, > > > > I used the build.sh script to build the above mentioned packages on > > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries > > (even if the packages are named x86_64) : > > > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm > > x86_64 > > > > (after installing it) : > > > > # file /usr/lib/libibverbs.so.1.0.0 > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel > > 80386, version 1 (SYSV), not stripped > > > > What did I do wrong ?? > > > > Cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > _______________________________________________ > > ewg mailing list > > ewg at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > From vlad at dev.mellanox.co.il Thu May 3 07:17:11 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 03 May 2007 17:17:11 +0300 Subject: [ofa-general] Re: Help building ib-bonding In-Reply-To: References: <1178087589.14131.3.camel@vladsk-laptop> Message-ID: <1178201831.6580.49.camel@vladsk-laptop> Hi Moni, Please check ib-bonding on updated RH5.0 and FedoraC6 See also https://bugs.openfabrics.org/show_bug.cgi?id=595 Thanks, Regards, Vladimir On Wed, 2007-05-02 at 17:28 -0400, Jeffrey Wong wrote: > Hello, > > I am using kernel 2.6-18.8.1.1.el5 x86_64 > > I have changed the build_env.sh to have the build_32bit=-1 > > > > Thanks in advance. > > > > Jeff > > > > > > When installing all modules I am getting the following errors. > > > > > > + make -C /lib/modules/2.6.18-8.1.1.el5/build modules > M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding > > make: Entering directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64' > > CC [M] > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.o > > In file included from > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:78: > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h: In function 'bond_set_slave_inactive_flags': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this > function) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h:262: error: (Each undeclared identifier is reported only once > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h:262: error: for each function it appears in.) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h: In function 'bond_set_slave_active_flags': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this > function) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_compute_features': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:1233: warning: comparison of distinct pointer types lacks a cast > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_enslave': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this function) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_release': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this function) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this > function) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_arp_rcv': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this function) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_netdev_event': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this function) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_init': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:4374: warning: assignment discards qualifiers from pointer target > type > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this function) > > make[1]: *** > [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_ > main.o] Error 1 > > make: *** > [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi > ng] Error 2 > > make: Leaving directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64' > > + echo ' Building IB bonding driver failed' > > Building IB bonding driver failed > > + exit 1 > > error: Bad exit status from /var/tmp/rpm-tmp.99179 (%build) > From jboo620 at charter.net Thu May 3 07:26:45 2007 From: jboo620 at charter.net (UK NATIONAL LOTTERY) Date: Thu, 3 May 2007 7:26:45 -0700 Subject: [ofa-general] Congratulations! Your Have Won, Check Your Email For Details Message-ID: <1479575152.1178202405785.JavaMail.root@fepweb08> Dear Winner We are pleased to inform you of the final announcement that you are one of our end of year winners of the UNITED KINGDOM FREE LOTTERY ONLINE PROMO PROGRAMMER, Ticket Number :4156189324Agent Id Number:110 held on the 2nd of May, 2007. You have therefore been approved to claim a total sum of £1,750,000.00 POUNDS STERLING. Please contact fiduciary agent for your claims. To file for your claim, Please contact our Fiduciary Agent for VALIDATION. Mr Anthony Flowers Foreign Service Manager, Claims and Release Order Department, Email:claimsdepartment_uk at yahoo.co.uk Tel:+4407204096859 From jsquyres at cisco.com Thu May 3 07:28:37 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 3 May 2007 07:28:37 -0700 Subject: [ofa-general] OFED [bi-]weekly teleconferences Message-ID: Tziporet and I chatted in Sonoma this week and decided that it would be good to keep the weekly OFED teleconferences going throughout the month of May. You should already have teleconferences on your calendar for May 7 (next Monday) and May 21. I will be sending around an Outlook invitation shortly for May 14 and May 28. -- Jeff Squyres Cisco Systems From monis at voltaire.com Thu May 3 07:48:01 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 3 May 2007 17:48:01 +0300 Subject: [ofa-general] RE: Help building ib-bonding In-Reply-To: <1178201831.6580.49.camel@vladsk-laptop> Message-ID: <39C75744D164D948A170E9792AF8E7CA244F55@exil.voltaire.com> I am going to release ib-bonding with a fix for bug 595 today (release 10) However, this bug doesn't say anything about using build_32bit=-1 so I'll have to look into it later (assuming that this is a problem) > -----Original Message----- > From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] > Sent: Thursday, May 03, 2007 5:17 PM > To: Moni Shoua; Moni Shoua > Cc: Jeffrey Wong; general at lists.openfabrics.org > Subject: Re: Help building ib-bonding > > Hi Moni, > Please check ib-bonding on updated RH5.0 and FedoraC6 See > also https://bugs.openfabrics.org/show_bug.cgi?id=595 > > Thanks, > > Regards, > Vladimir > > On Wed, 2007-05-02 at 17:28 -0400, Jeffrey Wong wrote: > > Hello, > > > > I am using kernel 2.6-18.8.1.1.el5 x86_64 > > > > I have changed the build_env.sh to have the build_32bit=-1 > > > > > > > > Thanks in advance. > > > > > > > > Jeff > > > > > > > > > > > > When installing all modules I am getting the following errors. > > > > > > > > > > > > + make -C /lib/modules/2.6.18-8.1.1.el5/build modules > > M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding > > > > make: Entering directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64' > > > > CC [M] > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.o > > > > In file included from > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c:78: > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > in > > g.h: In function 'bond_set_slave_inactive_flags': > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > in > > g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this > > function) > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > in > > g.h:262: error: (Each undeclared identifier is reported only once > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > in > > g.h:262: error: for each function it appears in.) > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > in > > g.h: In function 'bond_set_slave_active_flags': > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > in > > g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this > > function) > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c: In function 'bond_compute_features': > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c:1233: warning: comparison of distinct pointer types > lacks a cast > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c: In function 'bond_enslave': > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this > > function) > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c: In function 'bond_release': > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this > > function) > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this > > function) > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c: In function 'bond_arp_rcv': > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this > > function) > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c: In function 'bond_netdev_event': > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this > > function) > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c: In function 'bond_init': > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c:4374: warning: assignment discards qualifiers from pointer > > target type > > > > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond > > _m > > ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this > > function) > > > > make[1]: *** > > > [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bon > > d_ > > main.o] Error 1 > > > > make: *** > > > [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bon > > di > > ng] Error 2 > > > > make: Leaving directory `/usr/src/kernels/2.6.18-8.1.1.el5-x86_64' > > > > + echo ' Building IB bonding driver failed' > > > > Building IB bonding driver failed > > > > + exit 1 > > > > error: Bad exit status from /var/tmp/rpm-tmp.99179 (%build) > > > > From vlad at dev.mellanox.co.il Thu May 3 07:58:24 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 03 May 2007 17:58:24 +0300 Subject: [ofa-general] RE: Help building ib-bonding In-Reply-To: <39C75744D164D948A170E9792AF8E7CA244F55@exil.voltaire.com> References: <39C75744D164D948A170E9792AF8E7CA244F55@exil.voltaire.com> Message-ID: <1178204304.6580.53.camel@vladsk-laptop> On Thu, 2007-05-03 at 17:48 +0300, Moni Shoua wrote: > I am going to release ib-bonding with a fix for bug 595 today (release > 10) > However, this bug doesn't say anything about using build_32bit=-1 so > I'll have to look into it later (assuming that this is a problem) > Skip this part (build_32bit=-1). Please check also bug: https://bugs.openfabrics.org/show_bug.cgi?id=589 -- Vladimir Sokolovsky Mellanox Technologies Ltd. From steffen.persvold at scali.com Thu May 3 08:25:46 2007 From: steffen.persvold at scali.com (Steffen Persvold) Date: Thu, 3 May 2007 11:25:46 -0400 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 In-Reply-To: <1178197626.6580.38.camel@vladsk-laptop> References: <1178114728.14131.30.camel@vladsk-laptop> <1178197626.6580.38.camel@vladsk-laptop> Message-ID: Vladimir, Nope. Still the same issue. The RPMs will only contain one set of libraries and it is always in /usr/lib (if I set the build_32bit=0 option I get the 64bit libraries but in the wrong directory). Seriously, am I the only one seeing this ? I would think rhel4 u4 was a very normal test platform ? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter > -----Original Message----- > From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] > Sent: Thursday, May 03, 2007 9:07 AM > To: Steffen Persvold > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > Please see if this happens in OFED-1.2-20070503-0600. > But first uninstall the previous OFED version with ofed_uninstall.sh > command. > > Thanks, > > Regards, > Vladimir > > On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote: > > Hmm, > > > > so I tried something. I put : > > > > build_32bit=0 > > > > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This time > > it built 64bit libraries, but it puts them in the wrong directory : > > > > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm > > /etc/ld.so.conf.d/ofed.conf > > /usr/lib/libibverbs.so.1 > > /usr/lib/libibverbs.so.1.0.0 > > > > # file /usr/lib/libibverbs.so.1.0.0 > > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD > > x86-64, version 1 (SYSV), not stripped > > > > So what's up ?? > > > > Cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: Steffen Persvold > > Sent: Wed 5/2/2007 10:30 AM > > To: Steffen Persvold; Vladimir Sokolovsky > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Also, > > > > If I look at the /etc/ld.so.conf/ofed.conf file I have : > > > > # cat ofed.conf > > /usr/lib > > /usr/lib > > > > > > which seems kinda weird ? :) > > > > Cheers, > > > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold > > Sent: Wed 5/2/2007 10:20 AM > > To: Vladimir Sokolovsky > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Nope : > > > > > > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm > > /etc/ld.so.conf.d/ofed.conf > > /usr/lib/libibverbs.so.1 > > /usr/lib/libibverbs.so.1.0.0 > > [redhat-release-4ES-5.5]# > > > > So the RPM got built, but without 64bit libraries. Now if it was the > > other way around (i.e no 32bit libraries) I could have understood it > > (as 32bit is an option on x86_64), but not having the native 64bit > > libraries is not so easy to understand :) > > > > cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir > > Sokolovsky > > Sent: Wed 5/2/2007 10:05 AM > > To: Steffen Persvold > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Don't you have /usr/lib64/libibverbs.so.1.0.0? > > > > Regards, > > Vladimir > > > > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote: > > > Folks, > > > > > > I used the build.sh script to build the above mentioned packages on > > > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries > > > (even if the packages are named x86_64) : > > > > > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm > > > x86_64 > > > > > > (after installing it) : > > > > > > # file /usr/lib/libibverbs.so.1.0.0 > > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel > > > 80386, version 1 (SYSV), not stripped > > > > > > What did I do wrong ?? > > > > > > Cheers, > > > Steffen Persvold > > > Technical Director Americas > > > tel. 508-281-7100 x401 > > > fax. 508-281-7171 > > > > > > http://www.scali.com/ > > > Scaling the Linux datacenter > > > _______________________________________________ > > > ewg mailing list > > > ewg at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > _______________________________________________ > > ewg mailing list > > ewg at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > > > From cap at nsc.liu.se Thu May 3 09:06:36 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Thu, 3 May 2007 18:06:36 +0200 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 In-Reply-To: References: <1178197626.6580.38.camel@vladsk-laptop> Message-ID: <200705031806.36851.cap@nsc.liu.se> On Thursday 03 May 2007, Steffen Persvold wrote: > Vladimir, > > Nope. Still the same issue. The RPMs will only contain one set of > libraries and it is always in /usr/lib (if I set the build_32bit=0 > option I get the 64bit libraries but in the wrong directory). > > Seriously, am I the only one seeing this ? I would think rhel4 u4 was a > very normal test platform ? Hello Steffen, Being a curious person I tried to build 1.2-20070503-0600 on one of my centos-4.4 x86_64 boxes. I had only x86_64 packages so the build warned "glibc-devel 32bit is required for 32-bit libraries. Building 64-bit only". The result was fine (all packages named x86_64 and all libs in /usr/lib64). /Peter > Cheers, > > Steffen Persvold -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From mshefty at ichips.intel.com Thu May 3 09:48:14 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 03 May 2007 09:48:14 -0700 Subject: [ofa-general] man pages for the rdma-cm In-Reply-To: <1178127046.18609.107.camel@stevo-desktop> References: <1178127046.18609.107.camel@stevo-desktop> Message-ID: <463A124E.4020604@ichips.intel.com> > Are there man pages for the rdma-cm in the pipeline? I think it would > be great (requirement?) to have these for ofed-1.2 since we do have the > other verbs man pages. > > I didn't know if this was in-progress or are we looking for > volunteers... I don't have man pages, but I did update the comments in the rdma_cm header file to assist in the auto generation of man pages. The results weren't quite what I was wanting (I think I was using some tool that created kernel man pages). It probably wouldn't take that long to auto generate some pages, then manually fix-up any issues. What's the release date for RC3? - Sean From swise at opengridcomputing.com Thu May 3 09:51:24 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 May 2007 11:51:24 -0500 Subject: [ofa-general] man pages for the rdma-cm In-Reply-To: <463A124E.4020604@ichips.intel.com> References: <1178127046.18609.107.camel@stevo-desktop> <463A124E.4020604@ichips.intel.com> Message-ID: <1178211084.27558.8.camel@stevo-desktop> On Thu, 2007-05-03 at 09:48 -0700, Sean Hefty wrote: > > Are there man pages for the rdma-cm in the pipeline? I think it would > > be great (requirement?) to have these for ofed-1.2 since we do have the > > other verbs man pages. > > > > I didn't know if this was in-progress or are we looking for > > volunteers... > > I don't have man pages, but I did update the comments in the rdma_cm header file > to assist in the auto generation of man pages. The results weren't quite what I > was wanting (I think I was using some tool that created kernel man pages). It > probably wouldn't take that long to auto generate some pages, then manually > fix-up any issues. > > What's the release date for RC3? > I believe it is today. From xma at us.ibm.com Thu May 3 09:54:12 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 3 May 2007 09:54:12 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: <20070503044711.GO10009@mellanox.co.il> Message-ID: > BTW, why do you ignore the option to use UC QP? > MST Unfortunately, eHCA doesn't support UC in current version. Next generation will have RC w/i SRQ support. Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu May 3 09:55:02 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 May 2007 09:55:02 -0700 Subject: [ofa-general] man pages for the rdma-cm In-Reply-To: <1178211084.27558.8.camel@stevo-desktop> Message-ID: <000101c78da3$cb32fad0$3b78e984@amr.corp.intel.com> >I believe it is today. So, there's about a 0% chance of this making RC 3 then... I will try to get this done by early next week. From swise at opengridcomputing.com Thu May 3 09:58:12 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 May 2007 11:58:12 -0500 Subject: [ofa-general] [PATCH librdmacm] rping: Transfer rkey/addr/len information in network byte order. In-Reply-To: <1178117806.18609.25.camel@stevo-desktop> References: <1177515271.22094.33.camel@stevo-desktop> <1178060596.2309.195.camel@stevo-desktop> <46380D09.5070906@ichips.intel.com> <1178117806.18609.25.camel@stevo-desktop> Message-ID: <1178211492.27558.12.camel@stevo-desktop> Hey, We need this in -rc3. Steve. On Wed, 2007-05-02 at 09:56 -0500, Steve Wise wrote: > On Tue, 2007-05-01 at 21:01 -0700, Sean Hefty wrote: > > > This patch regresses rping. I failed to test it on AMD64<->AMD64 (ie > > > like endian systems). I will provide another patch shortly, or we can > > > undo the broken rping patch for -rc3. Whatever you think is best. > > > > Let's fix it. Please create a patch on top of this that fixes the problem. > > > > Thanks > > > > - Sean > > Here is the fix. Tested with: > > ppc64 client, amd64 server > ppc64 server, amd64 client > amd64 client, amd64 server > > > --- > > Fix regression introduced by 88fc0cb21698dfb5d7660eecf7dddd0531fc8021. > > From: Steve Wise > > - swizzle memory info when sending it to peer. > - fixed printf format > > Signed-off-by: Steve Wise > --- > > examples/rping.c | 10 +++++----- > 1 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/examples/rping.c b/examples/rping.c > index 17b0000..bccabb0 100644 > --- a/examples/rping.c > +++ b/examples/rping.c > @@ -243,7 +243,7 @@ static int server_recv(struct rping_cb * > cb->remote_rkey = ntohl(cb->recv_buf.rkey); > cb->remote_addr = ntohll(cb->recv_buf.buf); > cb->remote_len = ntohl(cb->recv_buf.size); > - DEBUG_LOG("Received rkey %x addr %" PRIx64 "len %d from peer\n", > + DEBUG_LOG("Received rkey %x addr %" PRIx64 " len %d from peer\n", > cb->remote_rkey, cb->remote_addr, cb->remote_len); > > if (cb->state <= CONNECTED || cb->state == RDMA_WRITE_COMPLETE) > @@ -614,12 +614,12 @@ static void rping_format_send(struct rpi > { > struct rping_rdma_info *info = &cb->send_buf; > > - info->buf = (uint64_t) (unsigned long) buf; > - info->rkey = mr->rkey; > - info->size = cb->size; > + info->buf = htonll((uint64_t) (unsigned long) buf); > + info->rkey = htonl(mr->rkey); > + info->size = htonl(cb->size); > > DEBUG_LOG("RDMA addr %" PRIx64" rkey %x len %d\n", > - info->buf, info->rkey, info->size); > + ntohll(info->buf), ntohl(info->rkey), ntohl(info->size)); > } > > static int rping_test_server(struct rping_cb *cb) > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Thu May 3 09:59:21 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 May 2007 11:59:21 -0500 Subject: [ofa-general] man pages for the rdma-cm In-Reply-To: <000101c78da3$cb32fad0$3b78e984@amr.corp.intel.com> References: <000101c78da3$cb32fad0$3b78e984@amr.corp.intel.com> Message-ID: <1178211561.27558.14.camel@stevo-desktop> On Thu, 2007-05-03 at 09:55 -0700, Sean Hefty wrote: > >I believe it is today. > > So, there's about a 0% chance of this making RC 3 then... I will try to get > this done by early next week. If you want me to review the text, lemme know... From halr at voltaire.com Thu May 3 09:59:39 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 03 May 2007 12:59:39 -0400 Subject: [ofa-general] Re: [PATCH] osm: source and destination strings overlap when using sprintf() In-Reply-To: <4636E4A7.7060108@dev.mellanox.co.il> References: <462C7C21.7010004@dev.mellanox.co.il> <20070423101738.GG4579@mellanox.co.il> <462E80A3.5060503@dev.mellanox.co.il> <20070501005101.GA26019@sashak.voltaire.com> <4636E4A7.7060108@dev.mellanox.co.il> Message-ID: <1178211572.32222.3479.camel@hal.voltaire.com> On Tue, 2007-05-01 at 02:56, Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > On 01:11 Wed 25 Apr , Yevgeny Kliteynik wrote: > >> Michael S. Tsirkin wrote: > >>> Since you seem to do a strcat which does an anyway, how about, for example: > >>> > >>> - sprintf( buf_line1,"%s 0x%01x |", > >>> - buf_line1, p_vla_tbl->vl_entry[i].vl); > >>> + sprintf( buf_line1 + strlen(buf_line1)," 0x%01x |", > >>> + p_vla_tbl->vl_entry[i].vl); > >>> > >>> and so on in all the other places? > >> Agree. > >> I'll send a new patch later. > > > > Or like this: > > > > + int n = 0; > > ... > > - sprintf( buf_line1,"%s 0x%01x |", > > - buf_line1, p_vla_tbl->vl_entry[i].vl); > > + n += sprintf( buf_line1 + n," 0x%01x |", > > + p_vla_tbl->vl_entry[i].vl); > > > > , so strlen() rerunning in loop is not needed anymore. > > Right, it does look better. So is someone going to submit this patch ? Thanks. -- Hal > -- Yevgeny > > > Sasha > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu May 3 10:01:51 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 03 May 2007 13:01:51 -0400 Subject: [ofa-general] [query] SMI nodeinfo, port_info structures In-Reply-To: <735493.40597.qm@web8323.mail.in.yahoo.com> References: <735493.40597.qm@web8323.mail.in.yahoo.com> Message-ID: <1178211580.32222.3481.camel@hal.voltaire.com> Hi Mahesh, On Thu, 2007-05-03 at 02:50, Keshetti Mahesh wrote: > Hi, > > Even though nobody except the ipath driver is using the structures > nodeinfo and port_info currently aren't these structures should be in > smi.h? I don't think so. > Because these structures are not specific to any hardware but they are > specific to the SMI. SMI has nothing to do with those SM attributes. > And can anyone tell me why some fields have big endian (__bexx) data > type and others have normal (uxx) data type in these structures? What structures (in what file(s)) are you referring to ? -- Hal > -Mahesh > > > > ______________________________________________________________________ > Check out what you're missing if you're not on Yahoo! Messenger > > ______________________________________________________________________ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rick.jones2 at hp.com Thu May 3 10:32:10 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 03 May 2007 10:32:10 -0700 Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian In-Reply-To: <20070502214944.GF10009@mellanox.co.il> References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il> Message-ID: <463A1C9A.6060706@hp.com> > rpm is not installed. I don't know how to solve this, Vlad > might be able to answer tomorrow. That would be cool. Push comes to shove I can probably put my two systems on an external network if there is a need for a "laying on of hands." >>should I be taking a different path to build here? > > > Maybe, maybe not. > > There *is* another way which should be enough to test IPoIB: > try getting a kernel tarball from > http://git.openfabrics.org/~vlad/builds/ > > If you unpack this, you can configure/make/make install. > > Installer will backup your original modules under the prefix. > Keep the source around and you'll be able to make uninstall > to get back to original system. > > Note 1: default configure settings are often not what you want: > run ./configure --help first of all to see which modules to select > (--with-ipoib-mod and --with-mthca-mod I think) and to set a prefix. > Note 2: having quilt tool installed is recommended - will let you > add/remove patches later. > Note 3: this way you get no userspace. openfabrics tarballs > are under the same directory, and a similiar method works there. > external tarballs (MPI, bonding, etc) are supplied to us in SRPM > format so this trick does not work for them. Seems I found little joy there too, probably my own fault. The environment is a Debian 4.0. The kernel is called: hpcpc107:~/ofa_1_2_kernel-20070502-0200# uname -a Linux hpcpc107 2.6.21.1-raj #1 SMP Tue May 1 14:11:27 PDT 2007 ia64 GNU/Linux The sources to which are at: /root/linux-2.6.21.1 My configure line was: ./configure --with-ipoib-mod --with-mthca-mod --with-sdp-mod --prefix=/root/save I didn't save the first set of configure output :( Subsequent configures give: hpcpc107:~/ofa_1_2_kernel-20070502-0200# ./configure --with-ipoib-mod --with-mthca-mod --with-sdp-mod --prefix=/root/save mkdir -p /root/ofa_1_2_kernel-20070502-0200/patches touch /root/ofa_1_2_kernel-20070502-0200/patches/quiltrc /root/ofa_1_2_kernel-20070502-0200/kernel_patches/fixes/add_orig_dgid_to_sysfs.patch /usr/bin/quilt --quiltrc /root/ofa_1_2_kernel-20070502-0200/patches/quiltrc import /root/ofa_1_2_kernel-20070502-0200/kernel_patches/fixes/add_orig_dgid_to_sysfs.patch Patch add_orig_dgid_to_sysfs.patch is applied Failed executing /usr/bin/quilthpcpc107:~/ofa_1_2_kernel-20070502-0200# Make then reports: hpcpc107:~/ofa_1_2_kernel-20070502-0200# make Building kernel modules Kernel version: 2.6.21.1-raj Modules directory: //lib/modules/2.6.21.1-raj/updates Kernel sources: /lib/modules/2.6.21.1-raj/build env EXTRA_CFLAGS=" -I/root/ofa_1_2_kernel-20070502-0200/include -I/root/ofa_1_2_kernel-20070502-0200/drivers/infiniband/include \ -I/root/ofa_1_2_kernel-20070502-0200/drivers/infiniband/ulp/ipoib \ -I/root/ofa_1_2_kernel-20070502-0200/drivers/infiniband/debug \ -I/root/ofa_1_2_kernel-20070502-0200/drivers/infiniband/hw/cxgb3/core \ -I/root/ofa_1_2_kernel-20070502-0200/drivers/net/cxgb3 \ -I/root/ofa_1_2_kernel-20070502-0200/drivers/net/rds " \ make -C /lib/modules/2.6.21.1-raj/build SUBDIRS="/root/ofa_1_2_kernel-20070502-0200" KERNELRELEASE=2.6.21.1-raj \ EXTRAVERSION=.1-raj V=1 \ CONFIG_INFINIBAND= \ CONFIG_INFINIBAND_IPOIB=m \ CONFIG_INFINIBAND_IPOIB_CM=y \ CONFIG_INFINIBAND_SDP=m \ CONFIG_INFINIBAND_SRP= \ CONFIG_INFINIBAND_USER_MAD= \ CONFIG_INFINIBAND_USER_ACCESS= \ CONFIG_INFINIBAND_ADDR_TRANS= \ CONFIG_INFINIBAND_MTHCA=m \ CONFIG_INFINIBAND_IPOIB_DEBUG=y \ CONFIG_INFINIBAND_ISER= \ CONFIG_SCSI_ISCSI_ATTRS= \ CONFIG_ISCSI_TCP= \ CONFIG_INFINIBAND_EHCA= \ CONFIG_INFINIBAND_EHCA_SCALING= \ CONFIG_RDS= \ CONFIG_RDS_IB= \ CONFIG_RDS_TCP= \ CONFIG_RDS_DEBUG= \ CONFIG_INFINIBAND_IPOIB_DEBUG_DATA= \ CONFIG_INFINIBAND_SDP_SEND_ZCOPY= \ CONFIG_INFINIBAND_SDP_RECV_ZCOPY= \ CONFIG_INFINIBAND_SDP_DEBUG=y \ CONFIG_INFINIBAND_SDP_DEBUG_DATA= \ CONFIG_INFINIBAND_IPATH= \ CONFIG_INFINIBAND_MTHCA_DEBUG=y \ CONFIG_INFINIBAND_MADEYE= \ CONFIG_INFINIBAND_VNIC= \ CONFIG_INFINIBAND_VNIC_DEBUG= \ CONFIG_INFINIBAND_VNIC_STATS= \ CONFIG_CHELSIO_T3= \ CONFIG_INFINIBAND_CXGB3= \ CONFIG_INFINIBAND_CXGB3_DEBUG= \ LINUXINCLUDE=' \ \ -I/root/ofa_1_2_kernel-20070502-0200/include \ -I/root/ofa_1_2_kernel-20070502-0200/drivers/infiniband/include \ -Iinclude \ $(if $(KBUILD_SRC),-Iinclude2 -I$(srctree)/include) \ -include include/linux/autoconf.h \ -include /root/ofa_1_2_kernel-20070502-0200/include/linux/autoconf.h \ ' \ modules make[1]: Entering directory `/root/linux-2.6.21.1' test -e include/linux/autoconf.h -a -e include/config/auto.conf || ( \ echo; \ echo " ERROR: Kernel configuration is invalid."; \ echo " include/linux/autoconf.h or include/config/auto.conf are missing."; \ echo " Run 'make oldconfig && make prepare' on kernel src to fix it."; \ echo; \ /bin/false) mkdir -p /root/ofa_1_2_kernel-20070502-0200/.tmp_versions rm -f /root/ofa_1_2_kernel-20070502-0200/.tmp_versions/* make -f scripts/Makefile.build obj=/root/ofa_1_2_kernel-20070502-0200 Building modules, stage 2. make -f /root/linux-2.6.21.1/scripts/Makefile.modpost scripts/mod/modpost -m -i /root/linux-2.6.21.1/Module.symvers -I /root/ofa_1_2_kernel-20070502-0200/Module.symvers -o /root/ofa_1_2_kernel-20070502-0200/Module.symvers -w vmlinux make[1]: Leaving directory `/root/linux-2.6.21.1' so I go ahead and do as the output suggests, make oldconfig and make prepare in the kernel source directory, come back to the ofa kernel directory, type make and get: ... make[1]: Entering directory `/root/linux-2.6.21.1' test -e include/linux/autoconf.h -a -e include/config/auto.conf || ( \ echo; \ echo " ERROR: Kernel configuration is invalid."; \ echo " include/linux/autoconf.h or include/config/auto.conf are missing."; \ echo " Run 'make oldconfig && make prepare' on kernel src to fix it."; \ echo; \ /bin/false) mkdir -p /root/ofa_1_2_kernel-20070502-0200/.tmp_versions rm -f /root/ofa_1_2_kernel-20070502-0200/.tmp_versions/* make -f scripts/Makefile.build obj=/root/ofa_1_2_kernel-20070502-0200 Building modules, stage 2. make -f /root/linux-2.6.21.1/scripts/Makefile.modpost scripts/mod/modpost -m -i /root/linux-2.6.21.1/Module.symvers -I /root/ofa_1_2_kernel-20070502-0200/Module.symvers -o /root/ofa_1_2_kernel-20070502-0200/Module.symvers -w vmlinux make[1]: Leaving directory `/root/linux-2.6.21.1' again. rick jones still rather kernel clueless From mst at dev.mellanox.co.il Thu May 3 10:48:17 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 20:48:17 +0300 Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian In-Reply-To: <463A1C9A.6060706@hp.com> References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il> <463A1C9A.6060706@hp.com> Message-ID: <20070503174817.GC9719@mellanox.co.il> > make[1]: Entering directory `/root/linux-2.6.21.1' > test -e include/linux/autoconf.h -a -e include/config/auto.conf || ( > \ > echo; \ > echo " ERROR: Kernel configuration is invalid."; \ > echo " include/linux/autoconf.h or include/config/auto.conf > are missing."; \ > echo " Run 'make oldconfig && make prepare' on kernel src > to fix it."; \ This is kernel's message, not our's - is this the source you built kernel from? If you go into /root/linux-2.6.21.1 as root and do make modules, does it succeed? -- MST From rick.jones2 at hp.com Thu May 3 10:59:15 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 03 May 2007 10:59:15 -0700 Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian In-Reply-To: <20070503174817.GC9719@mellanox.co.il> References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il> <463A1C9A.6060706@hp.com> <20070503174817.GC9719@mellanox.co.il> Message-ID: <463A22F3.4090108@hp.com> Michael S. Tsirkin wrote: >>make[1]: Entering directory `/root/linux-2.6.21.1' >>test -e include/linux/autoconf.h -a -e include/config/auto.conf || ( >>\ >> echo; \ >> echo " ERROR: Kernel configuration is invalid."; \ >> echo " include/linux/autoconf.h or include/config/auto.conf >> are missing."; \ >> echo " Run 'make oldconfig && make prepare' on kernel src >> to fix it."; \ > > > This is kernel's message, not our's - is this the source you built kernel from? > If you go into /root/linux-2.6.21.1 as root and do make modules, > does it succeed? yes. some warnings at the beginning about some modules and section mismatches but it seems to complete. rick jones From pradeep at us.ibm.com Thu May 3 11:14:56 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Thu, 3 May 2007 11:14:56 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review Message-ID: "Michael S. Tsirkin" wrote on 05/02/2007 08:55:47 PM: > > > > + if (ipoib_cm_post_receive(dev, i << 32 | index)) { > > > > > > 1. It seems there are multiple QPs mapped to a single CQ - > > > and each gets ipoib_recvq_size recv WRs above. > > > Is that right? How do you prevent CQ overrun then? > > > > Good point! Looking at the IB spec it appears that upon CQ overflow > > it results in a Local Work Queue catastrophic error and puts the QP > > (receiver side) in an error state. > > Look further in spec - you get CQ error, too. > > > Hence, I am speculating that the > > sending side will see an error. This will result in the sending side > > destroying the QP and sending a DREQ message which, will remove the > > receive side QP. > > > > A new set of QPs will be created on the send side (this is RC) and > > the connection setup starts over again. It will continue, but at a > > degraded rate. > > Is this correct? What other alternative do you suggest > > -create a CQ per QP? Is the max number of CQs an issue to consider, if > > we adopt this approach? > > We were switching to NAPI though, and NAPI kind of forces you to use > a common CQ, I think. What if in ipoib_transport_dev_init() size is changed to something like: size = ipoib_sendq_size + NOSRQ_INDEX_RING_SIZE * ipoib_recvq_size + 1; used by ib_create_cq() call for the NOSRQ case only? Yes, we will end up consuming a lot more memory -do you see any (other) problems with that? Pradeep pradeep at us.ibm.com From rdreier at cisco.com Thu May 3 11:33:29 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 11:33:29 -0700 Subject: [ofa-general] Re: [PATCH 0 of 3] comp_vector kernel support In-Reply-To: <20070503104806.GC10009@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 May 2007 13:48:06 +0300") References: <20070503104806.GC10009@mellanox.co.il> Message-ID: > 1. extends ib_create_cq to pass in comp_vector parameter > 2. Update all ULP/providers > 3. mthca is enhanced to support multiple vectors if MSI-X is enabled on SMP > 4. Other providers report support for a single completion vector > 5. uverbs and IPoIB CM are enhanced to use multiple vectors if available > Please consider for 2.6.22. This is good work, but given that this has just appeared halfway through the 2.6.22 merge window I don't think we should just merge it just now. Rather, let's definitely get it into 2.6.23. From rdreier at cisco.com Thu May 3 11:35:51 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 11:35:51 -0700 Subject: [ofa-general] Re: [PATCH 1 of 3] IB/verbs: add cq comp_vector support in core In-Reply-To: <20070503104847.GD10009@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 May 2007 13:48:47 +0300") References: <20070503104847.GD10009@mellanox.co.il> Message-ID: > Note: since num_comp_vectors = 0 is not legal, and to mimizime provider churn, > I set num_comp_vectors to a sane value in core. Providers can increase that. I would actually prefer to see providers updated to set this explicitly. Right now the rule is ib_alloc_device() returns a completely zeroed out structure -- that's much easier to understand than initializing some fields but not others. And there are other fields we could set defaults for, like phys_port_cnt == 0 is not legal either, but we don't try to set that in the core. - R. From rdreier at cisco.com Thu May 3 11:37:43 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 11:37:43 -0700 Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors In-Reply-To: <20070503104924.GE10009@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 May 2007 13:49:24 +0300") References: <20070503104924.GE10009@mellanox.co.il> Message-ID: > I don't know how many vectors make sense, so I decided to be > conservative here, since each EQ consumes a lot of memory by default. I think #vectors == O(#CPUs) is what we should aim for. Also another useful thing for NUMA systems might be to allocate the EQ memory from the CPU where that interrupt will be targeted. But I don't know exactly the best way to do that, and I think we can leave that for later. - R. From rdreier at cisco.com Thu May 3 11:39:43 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 11:39:43 -0700 Subject: [ofa-general] Re: [PATCH 3 of 3] ipoib/cm: separate comp vectors to RX/TX In-Reply-To: <20070503104955.GF10009@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 May 2007 13:49:55 +0300") References: <20070503104955.GF10009@mellanox.co.il> Message-ID: > Enhance ipoib to use multiple completion vectors if available. > On mthca, this increases netperf BW by some 5% with > same or lower service demand. I wonder if this is the best way to use multiple vectors in IPoIB. Shirley has pointed out that right now we can't scale to both ports on 2-port adapters because both ports end up using the same interrupt. So maybe we want to target each port to separate completion vectors when available? - R. From xma at us.ibm.com Thu May 3 11:57:32 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 3 May 2007 11:57:32 -0700 Subject: [ofa-general] Re: [PATCH 3 of 3] ipoib/cm: separate comp vectors to RX/TX In-Reply-To: Message-ID: general-bounces at lists.openfabrics.org wrote on 05/03/2007 11:39:43 AM: > > Enhance ipoib to use multiple completion vectors if available. > > On mthca, this increases netperf BW by some 5% with > > same or lower service demand. > > I wonder if this is the best way to use multiple vectors in IPoIB. > Shirley has pointed out that right now we can't scale to both ports on > 2-port adapters because both ports end up using the same interrupt. > So maybe we want to target each port to separate completion vectors > when available? > > - R. Yes, it would be better to do perPort/perCompletion/perEvent/perInterrupt. It also helps latency not just throughput. Shirley Ma IBM Linux Technology Center -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu May 3 12:01:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 12:01:03 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: (Pradeep Satyanarayana's message of "Fri, 27 Apr 2007 18:51:14 -0600") References: Message-ID: > +#define IPOIB_CM_OP_NOSRQ (1ul << 29) I don't understand the point of this... the only places you do anything with it are: > + priv->cm.rx_wr.wr_id = wr_id << 32 | index | IPOIB_CM_OP_NOSRQ; > + index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) & NOSRQ_INDEX_MASK ; > + if ((wc->wr_id & IPOIB_CM_OP_SRQ) || (wc->wr_id & IPOIB_CM_OP_NOSRQ)) so probably the most sensible thing to do is just to rename IPOIB_CM_OP_SRQ to IPOIB_OP_CM_RECV. > +/* These two go hand in hand */ > +#define NOSRQ_INDEX_RING_SIZE 1024 > +#define NOSRQ_INDEX_MASK 0x00000000000003ff Rather than having a comment, I would just do #define NOSRQ_INDEX_RING_SIZE 1024 #define NOSRQ_INDEX_MASK (NOSRQ_INDEX_RING_SIZE - 1) also I think the RING name is wrong -- it's not a ring, it's a table, right? I don't like having a static limit on the number of nosrq connections; could this be a hash table instead? Some of these changes seem strange: > - priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; > + priv->cm.rx_sge[i].addr = > + priv->cm.srq_ring[id].mapping[i]; > - priv->cm.srq_ring[id].mapping); > + priv->cm.srq_ring[id].mapping); please try to put any changes like this that you want to make into a separate patch. > + if (priv->cm.srq) > + priv->cm.srq_ring[id].skb = skb; > + else { > + index = id & NOSRQ_INDEX_MASK ; > + wr_id = id >> 32; > + spin_lock_irqsave(&priv->lock, flags); > + rx_ptr = priv->cm.rx_index_ring[index]; > + spin_unlock_irqrestore(&priv->lock, flags); > + > + rx_ptr->rx_ring[wr_id].skb = skb; why does the nosrq case need to take a lock when the srq case doesn't? A comment would be welcome here... > - .srq = priv->cm.srq, > > + if (priv->cm.srq) > + attr.srq = priv->cm.srq; > + else > + attr.srq = NULL; isn't the code you use to replace the assignment just an obfuscated version of the original assignment? > - rep.srq = 1; > + if (priv->cm.srq) > + rep.srq = 1; > + else > + rep.srq = 0; similarly I would rather see "rep.srq = !!priv->cm.srq" > + /* Allocate space for the rx_ring here */ > + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, > + GFP_KERNEL); in general comments are good, but I don't like to see redundancy like: /* do something here */ do_something(); > + if ( index == NOSRQ_INDEX_RING_SIZE) { no space between ( and index please. > + printk(KERN_WARNING "NOSRQ supports a max of %d RC " > + "QPs. That limit has now been reached\n", > + NOSRQ_INDEX_RING_SIZE); ipoib_warn() instead of printk? Also isn't this going to flood logs if the remote side keeps trying to connect? > + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > + if (ret) { > + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", > ret); > + goto err_modify_nosrq; > + } It's good to goto to unwind code, but in this case you just have a return at err_modify_nosrq -- why not just return directly? However you seem to leak rx_ring here, so it would be better to use the unwind code more consistently instead of using return later. > + for (i = 0; i < ipoib_recvq_size; ++i) { > + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, > + IPOIB_CM_RX_SG - 1, > + p->rx_ring[i].mapping)) { > + ipoib_warn(priv, "failed to allocate receive " > + "buffer %d\n", i); > + ipoib_cm_dev_cleanup(dev); > + /* Free rx_ring previously allocated */ this comment doesn't tell me anything I couldn't see for myself > + kfree(p->rx_ring); > + return -ENOMEM; > + } > + > + /* Can we call the nosrq version? */ > + if (ipoib_cm_post_receive(dev, i << 32 | index)) { > + ipoib_warn(priv, "ipoib_ib_post_receive " > + "failed for buf %d\n", i); > + ipoib_cm_dev_cleanup(dev); seems like you're missing the call to kfree(p->rx_ring) here? this code could probably benefit from a goto to unwind code. > + return -EIO; > + } > + } /* end for */ and another useless comment here... + if (priv->cm.srq == NULL) { /* NOSRQ */ I prefer "if (!priv->cm.srq)" to "== NULL". Also I don't think this comment tells me anything. + psn = random32() & 0xffffff; + if (ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn)) + goto err_modify; + } else { /* SRQ */ + p->rx_ring = NULL; /* This is used only by NOSRQ */ + psn = random32() & 0xffffff; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) + goto err_modify; + } move the psn assignment out of the if? > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + int ret; > + > + please avoid double spacing. > - attr.srq = priv->cm.srq; > + if (priv->cm.srq) > + attr.srq = priv->cm.srq; > + else > + attr.srq = NULL; adding the if seems like pure obfuscation here... + if (attr.max_srq) + supports_srq = 1; /* This device supports SRQ */ + else { + supports_srq = 0; } I don't see what the supports_srq temporary variable buys you. Also please don't put { } around one-line blocks. > + priv->cm.rx_index_ring = kzalloc(NOSRQ_INDEX_RING_SIZE * > + sizeof *priv->cm.rx_index_ring, > + GFP_KERNEL); > + } Handle allocation failure here? - R. From etta at systemfabricworks.com Thu May 3 12:26:05 2007 From: etta at systemfabricworks.com (Chieng Etta) Date: Thu, 3 May 2007 14:26:05 -0500 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 In-Reply-To: Message-ID: <006f01c78db8$e57d1ff0$c801a8c0@ettac> Hi Steffen, After removing all the OFED packages by using ./uninstall.sh, I tried ./build.sh to build the RPMs then installed libibverbs-1.1-0.x86_64.rpm onto system. "libibverbs.so.1.0.0" was installed under the right directories (/usr/lib and /usr/lib64). Please see the output below. Thanks, Etta [root at sfw1 etc]# cat /etc/*release Red Hat Enterprise Linux AS release 4 (Nahant Update 4) [root at sfw1 etc]# uname -a Linux sfw1.sfw.int 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root at sfw1 lib64]# pwd /usr/lib64 [root at sfw1 lib64]# ll libibverbs* ls: libibverbs*: No such file or directory [root at sfw1 lib64]# rpm -aq |grep libibverbs [root at sfw1 lib64]# cd - /root/images/OFED-1.2-rc2/RPMS/redhat-release-4AS-5.5 [root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 /usr/lib64/libibverbs.so.1 /usr/lib64/libibverbs.so.1.0.0 [root at sfw1 redhat-release-4AS-5.5]# rpm -ivh libibverbs-1.1-0.x86_64.rpm Preparing... ########################################### [100%] 1:libibverbs ########################################### [100%] [root at sfw1 redhat-release-4AS-5.5]# rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm x86_64 [root at sfw1 redhat-release-4AS-5.5]# cd - /usr/lib64 [root at sfw1 lib64]# rpm -aq |grep libibverbs libibverbs-1.1-0 [root at sfw1 lib64]# ll libibverbs* lrwxrwxrwx 1 root root 19 May 3 13:50 libibverbs.so.1 -> libibverbs.so.1.0.0 -rwxr-xr-x 1 root root 200993 May 3 13:18 libibverbs.so.1.0.0 [root at sfw1 lib64]# file libibverbs.so.1.0.0 libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped [root at sfw1 lib]# cd /usr/lib [root at sfw1 lib]# file libibverbs.so.1.0.0 libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), not stripped [root at sfw1 etc]# cat /etc/ld.so.conf include ld.so.conf.d/*.conf /usr/ofed/lib64 [root at sfw1 etc]# cat /etc/ld.so.conf.d/ofed.conf /usr/lib64 /usr/lib -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Steffen Persvold Sent: Thursday, May 03, 2007 10:26 AM To: vlad at dev.mellanox.co.il Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Vladimir, Nope. Still the same issue. The RPMs will only contain one set of libraries and it is always in /usr/lib (if I set the build_32bit=0 option I get the 64bit libraries but in the wrong directory). Seriously, am I the only one seeing this ? I would think rhel4 u4 was a very normal test platform ? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter > -----Original Message----- > From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] > Sent: Thursday, May 03, 2007 9:07 AM > To: Steffen Persvold > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > Please see if this happens in OFED-1.2-20070503-0600. > But first uninstall the previous OFED version with ofed_uninstall.sh > command. > > Thanks, > > Regards, > Vladimir > > On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote: > > Hmm, > > > > so I tried something. I put : > > > > build_32bit=0 > > > > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This time > > it built 64bit libraries, but it puts them in the wrong directory : > > > > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm > > /etc/ld.so.conf.d/ofed.conf > > /usr/lib/libibverbs.so.1 > > /usr/lib/libibverbs.so.1.0.0 > > > > # file /usr/lib/libibverbs.so.1.0.0 > > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD > > x86-64, version 1 (SYSV), not stripped > > > > So what's up ?? > > > > Cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: Steffen Persvold > > Sent: Wed 5/2/2007 10:30 AM > > To: Steffen Persvold; Vladimir Sokolovsky > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Also, > > > > If I look at the /etc/ld.so.conf/ofed.conf file I have : > > > > # cat ofed.conf > > /usr/lib > > /usr/lib > > > > > > which seems kinda weird ? :) > > > > Cheers, > > > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold > > Sent: Wed 5/2/2007 10:20 AM > > To: Vladimir Sokolovsky > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Nope : > > > > > > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm > > /etc/ld.so.conf.d/ofed.conf > > /usr/lib/libibverbs.so.1 > > /usr/lib/libibverbs.so.1.0.0 > > [redhat-release-4ES-5.5]# > > > > So the RPM got built, but without 64bit libraries. Now if it was the > > other way around (i.e no 32bit libraries) I could have understood it > > (as 32bit is an option on x86_64), but not having the native 64bit > > libraries is not so easy to understand :) > > > > cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir > > Sokolovsky > > Sent: Wed 5/2/2007 10:05 AM > > To: Steffen Persvold > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Don't you have /usr/lib64/libibverbs.so.1.0.0? > > > > Regards, > > Vladimir > > > > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote: > > > Folks, > > > > > > I used the build.sh script to build the above mentioned packages on > > > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries > > > (even if the packages are named x86_64) : > > > > > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm > > > x86_64 > > > > > > (after installing it) : > > > > > > # file /usr/lib/libibverbs.so.1.0.0 > > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel > > > 80386, version 1 (SYSV), not stripped > > > > > > What did I do wrong ?? > > > > > > Cheers, > > > Steffen Persvold > > > Technical Director Americas > > > tel. 508-281-7100 x401 > > > fax. 508-281-7171 > > > > > > http://www.scali.com/ > > > Scaling the Linux datacenter > > > _______________________________________________ > > > ewg mailing list > > > ewg at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > _______________________________________________ > > ewg mailing list > > ewg at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > > > _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From mst at dev.mellanox.co.il Thu May 3 12:29:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 22:29:06 +0300 Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors In-Reply-To: References: <20070503104924.GE10009@mellanox.co.il> Message-ID: <20070503192906.GE9719@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors > > > I don't know how many vectors make sense, so I decided to be > > conservative here, since each EQ consumes a lot of memory by default. > > I think #vectors == O(#CPUs) is what we should aim for. What do you suggest to do about memory scalaibility? -- MST From rdreier at cisco.com Thu May 3 12:37:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 12:37:12 -0700 Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors In-Reply-To: <20070503192906.GE9719@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 May 2007 22:29:06 +0300") References: <20070503104924.GE10009@mellanox.co.il> <20070503192906.GE9719@mellanox.co.il> Message-ID: > > I think #vectors == O(#CPUs) is what we should aim for. > > What do you suggest to do about memory scalaibility? Right now we default to 64K CQs, so each completion EQ uses 64K * 32 bytes, which is 2 MB. Which is a lot but not a killer on modern servers. And I would expect memory to scale as O(#CPUs) too. So I think 1 or 2 completion EQs per CPU is the right amount. I guess it would be nice if there was something tricky we could do to adjust things on the fly but I don't think it's worth getting too tricky. - R. From ralphc at pathscale.com Thu May 3 12:40:51 2007 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 03 May 2007 12:40:51 -0700 Subject: [ofa-general] IB/ipath - fix two more spin lock problems Message-ID: <1178221251.3407.111.camel@brick.pathscale.com> IB/ipath - fix two more spin lock problems Fix a missing unlock in ipath_rc_rcv_resp() and remove an extra unlock from ipath_rc_rcv_error(). Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index 9e68c91..f4d729d 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -1257,6 +1257,7 @@ ack_err: wc.dlid_path_bits = 0; wc.port_num = 0; ipath_sqerror_qp(qp, &wc); + spin_unlock_irqrestore(&qp->s_lock, flags); bail: return; } @@ -1436,7 +1437,6 @@ static inline int ipath_rc_rcv_error(struct ipath_ibdev *dev, break; } qp->r_nak_state = 0; - spin_unlock_irq(&qp->s_lock); tasklet_hi_schedule(&qp->s_task); unlock_done: From ralph.campbell at qlogic.com Thu May 3 12:43:03 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 03 May 2007 12:43:03 -0700 Subject: [ofa-general] IB/ipath - fix a race condition when generating ACKs Message-ID: <1178221383.3407.115.camel@brick.pathscale.com> IB/ipath - fix a race condition when generating ACKs Fix a problem where simple ACKs can be sent ahead of RDMA read responses thus implicitly NAKing the RDMA read. Signed-off-by: Ralph Campbell Signed-off-by: Robert Walsh diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index f4d729d..1915771 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -98,13 +98,21 @@ static int ipath_make_rc_ack(struct ipath_qp *qp, case OP(RDMA_READ_RESPONSE_LAST): case OP(RDMA_READ_RESPONSE_ONLY): case OP(ATOMIC_ACKNOWLEDGE): - qp->s_ack_state = OP(ACKNOWLEDGE); + /* + * We can increment the tail pointer now that the last + * response has been sent instead of only being + * constructed. + */ + if (++qp->s_tail_ack_queue > IPATH_MAX_RDMA_ATOMIC) + qp->s_tail_ack_queue = 0; /* FALLTHROUGH */ + case OP(SEND_ONLY): case OP(ACKNOWLEDGE): /* Check for no next entry in the queue. */ if (qp->r_head_ack_queue == qp->s_tail_ack_queue) { if (qp->s_flags & IPATH_S_ACK_PENDING) goto normal; + qp->s_ack_state = OP(ACKNOWLEDGE); goto bail; } @@ -117,12 +125,8 @@ static int ipath_make_rc_ack(struct ipath_qp *qp, if (len > pmtu) { len = pmtu; qp->s_ack_state = OP(RDMA_READ_RESPONSE_FIRST); - } else { + } else qp->s_ack_state = OP(RDMA_READ_RESPONSE_ONLY); - if (++qp->s_tail_ack_queue > - IPATH_MAX_RDMA_ATOMIC) - qp->s_tail_ack_queue = 0; - } ohdr->u.aeth = ipath_compute_aeth(qp); hwords++; qp->s_ack_rdma_psn = e->psn; @@ -139,8 +143,6 @@ static int ipath_make_rc_ack(struct ipath_qp *qp, cpu_to_be32(e->atomic_data); hwords += sizeof(ohdr->u.at) / sizeof(u32); bth2 = e->psn; - if (++qp->s_tail_ack_queue > IPATH_MAX_RDMA_ATOMIC) - qp->s_tail_ack_queue = 0; } bth0 = qp->s_ack_state << 24; break; @@ -156,8 +158,6 @@ static int ipath_make_rc_ack(struct ipath_qp *qp, ohdr->u.aeth = ipath_compute_aeth(qp); hwords++; qp->s_ack_state = OP(RDMA_READ_RESPONSE_LAST); - if (++qp->s_tail_ack_queue > IPATH_MAX_RDMA_ATOMIC) - qp->s_tail_ack_queue = 0; } bth0 = qp->s_ack_state << 24; bth2 = qp->s_ack_rdma_psn++ & IPATH_PSN_MASK; @@ -171,7 +171,7 @@ static int ipath_make_rc_ack(struct ipath_qp *qp, * the ACK before setting s_ack_state to ACKNOWLEDGE * (see above). */ - qp->s_ack_state = OP(ATOMIC_ACKNOWLEDGE); + qp->s_ack_state = OP(SEND_ONLY); qp->s_flags &= ~IPATH_S_ACK_PENDING; qp->s_cur_sge = NULL; if (qp->s_nak_state) @@ -223,7 +223,7 @@ int ipath_make_rc_req(struct ipath_qp *qp, /* Sending responses has higher priority over sending requests. */ if ((qp->r_head_ack_queue != qp->s_tail_ack_queue || (qp->s_flags & IPATH_S_ACK_PENDING) || - qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE) && + qp->s_ack_state != OP(ACKNOWLEDGE)) && ipath_make_rc_ack(qp, ohdr, pmtu, bth0p, bth2p)) goto done; @@ -585,7 +585,9 @@ static void send_rc_ack(struct ipath_qp *qp) unsigned long flags; /* Don't send ACK or NAK if a RDMA read or atomic is pending. */ - if (qp->r_head_ack_queue != qp->s_tail_ack_queue) + if (qp->r_head_ack_queue != qp->s_tail_ack_queue || + (qp->s_flags & IPATH_S_ACK_PENDING) || + qp->s_ack_state != OP(ACKNOWLEDGE)) goto queue_ack; /* Construct the header. */ From mst at dev.mellanox.co.il Thu May 3 12:46:37 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 22:46:37 +0300 Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors In-Reply-To: References: <20070503104924.GE10009@mellanox.co.il> <20070503192906.GE9719@mellanox.co.il> Message-ID: <20070503194637.GF9719@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors > > > > I think #vectors == O(#CPUs) is what we should aim for. > > > > What do you suggest to do about memory scalaibility? > > Right now we default to 64K CQs, so each completion EQ uses 64K * 32 bytes, > which is 2 MB. Which is a lot but not a killer on modern servers. > And I would expect memory to scale as O(#CPUs) too. So I think 1 or 2 > completion EQs per CPU is the right amount. With dual-core, I'm not sure memory scales as fast as #CPUs anymore. > I guess it would be nice if there was something tricky we could do to > adjust things on the fly but I don't think it's worth getting too tricky. Problem is, #CPUs is not a static value anymore. With CPU hotplug num_possible_cpus is quite often 4 or 8 even though one actually has a single one present. -- MST From mst at dev.mellanox.co.il Thu May 3 12:49:38 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 May 2007 22:49:38 +0300 Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors In-Reply-To: References: <20070503104924.GE10009@mellanox.co.il> Message-ID: <20070503194938.GG9719@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors > > > I don't know how many vectors make sense, so I decided to be > > conservative here, since each EQ consumes a lot of memory by default. > > I think #vectors == O(#CPUs) is what we should aim for. > > Also another useful thing for NUMA systems might be to allocate the EQ > memory from the CPU where that interrupt will be targeted. Can't interrupts migrate between CPUs? > But I > don't know exactly the best way to do that, and I think we can leave > that for later. OTOH, especially for latency, it might be best to only have 1 interrupt, and service that on the node closest to device. -- MST From rick.jones2 at hp.com Thu May 3 13:10:45 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 03 May 2007 13:10:45 -0700 Subject: [ofa-general] Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors In-Reply-To: <20070503194938.GG9719@mellanox.co.il> References: <20070503104924.GE10009@mellanox.co.il> <20070503194938.GG9719@mellanox.co.il> Message-ID: <463A41C5.6050106@hp.com> Michael S. Tsirkin wrote: >>Quoting Roland Dreier : >>Subject: Re: [PATCH 2 of 3] IB/mthca: support multiple completion vectors >> >> > I don't know how many vectors make sense, so I decided to be >> > conservative here, since each EQ consumes a lot of memory by default. >> >>I think #vectors == O(#CPUs) is what we should aim for. >> >>Also another useful thing for NUMA systems might be to allocate the EQ >>memory from the CPU where that interrupt will be targeted. > > > Can't interrupts migrate between CPUs? Only if someone leaves the irqbalancer running. Given that it isn't presently NUMA-aware (plusungood) I at least tend to disable it. Apart from it, then generally only under explicit administrator command would an interrupt be migrated from one CPU to another. > > OTOH, especially for latency, it might be best to only have 1 interrupt, > and service that on the node closest to device. > topological proximity is a good thing. there can be more than 1 core "topologically close" to the I/O card. rickjones From HNGUYEN at de.ibm.com Thu May 3 13:20:40 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Thu, 3 May 2007 22:20:40 +0200 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: Message-ID: > > BTW, why do you ignore the option to use UC QP? > > MST > Unfortunately, eHCA doesn't support UC in current version. Next > generation will have RC w/i SRQ support. Current ehca surely supports UC. Just give ibv_uc_pingpong a try. Nam From rdreier at cisco.com Thu May 3 13:30:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 13:30:00 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: (Hoang-Nam Nguyen's message of "Thu, 3 May 2007 22:20:40 +0200") References: Message-ID: > Current ehca surely supports UC. Just give ibv_uc_pingpong a try. Thanks... in that case I would definitely suggest investigating using UC for IPoIB connected mode when SRQs are not available. In fact assuming the IBA work to add the ability to attach UC QPs to SRQs is completed, I think we would probably want to move IPoIB CM to using UC exclusively. - R. From rdreier at cisco.com Thu May 3 13:31:38 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 13:31:38 -0700 Subject: [ofa-general] IB/ipath - fix two more spin lock problems In-Reply-To: <1178221251.3407.111.camel@brick.pathscale.com> (Ralph Campbell's message of "Thu, 03 May 2007 12:40:51 -0700") References: <1178221251.3407.111.camel@brick.pathscale.com> Message-ID: thanks, applied From rdreier at cisco.com Thu May 3 13:33:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 13:33:00 -0700 Subject: [ofa-general] Re: IB/ipath - fix a race condition when generating ACKs In-Reply-To: <1178221383.3407.115.camel@brick.pathscale.com> (Ralph Campbell's message of "Thu, 03 May 2007 12:43:03 -0700") References: <1178221383.3407.115.camel@brick.pathscale.com> Message-ID: applied, thanks From xma at us.ibm.com Thu May 3 13:38:47 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 3 May 2007 13:38:47 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: Message-ID: Roland Dreier wrote on 05/03/2007 01:30:00 PM: > > Current ehca surely supports UC. Just give ibv_uc_pingpong a try. > > Thanks... in that case I would definitely suggest investigating using > UC for IPoIB connected mode when SRQs are not available. > > In fact assuming the IBA work to add the ability to attach UC QPs to > SRQs is completed, I think we would probably want to move IPoIB CM to > using UC exclusively. > > - R. Agree. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu May 3 13:44:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 13:44:07 -0700 Subject: [ofa-general] Last chance to NAK IPoIB NAPI Message-ID: I think we have consensus on merging IPoIB NAPI for 2.6.22, so I'll just post the patches one more time and ask Linus to pull tomorrow. If you don't think we should merge this, please let me know now! - R. From pradeep at us.ibm.com Thu May 3 13:49:34 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Thu, 3 May 2007 13:49:34 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: Message-ID: general-bounces at lists.openfabrics.org wrote on 05/03/2007 01:30:00 PM: > > Current ehca surely supports UC. Just give ibv_uc_pingpong a try. > > Thanks... in that case I would definitely suggest investigating using > UC for IPoIB connected mode when SRQs are not available. > > In fact assuming the IBA work to add the ability to attach UC QPs to > SRQs is completed, I think we would probably want to move IPoIB CM to > using UC exclusively. Then in the interim, how do we address the interoperability issue between say for example, Topspin and IBM HCAs using connected mode -switch to datagram mode? This discussion started from using non zero retry count -that is still easily solvable by patching CM and/or drivers and resetting the retry_count back to 0. Switching to datagram mode will result in too big a performance drop. I am not inclined to go in that direction. Pradeep pradeep at us.ibm.com From rdreier at cisco.com Thu May 3 13:49:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 13:49:44 -0700 Subject: [ofa-general] [PATCH 1/2] IB: Return "maybe missed event" hint from ib_req_notify_cq() In-Reply-To: (Roland Dreier's message of "Thu, 03 May 2007 13:44:07 -0700") References: Message-ID: The semantics defined by the InfiniBand specification say that completion events are only generated when a completions is added to a completion queue (CQ) after completion notification is requested. In other words, this means that the following race is possible: while (CQ is not empty) ib_poll_cq(CQ); // new completion is added after while loop is exited ib_req_notify_cq(CQ); // no event is generated for the existing completion To close this race, the IB spec recommends doing another poll of the CQ after requesting notification. However, it is not always possible to arrange code this way (for example, we have found that NAPI for IPoIB cannot poll after requesting notification). Also, some hardware (eg Mellanox HCAs) actually will generate an event for completions added before the call to ib_req_notify_cq() -- which is allowed by the spec, since there's no way for any upper-layer consumer to know exactly when a completion was really added -- so the extra poll of the CQ is just a waste. Motivated by this, we add a new flag "IB_CQ_REPORT_MISSED_EVENTS" for ib_req_notify_cq() so that it can return a hint about whether the a completion may have been added before the request for notification. The return value of ib_req_notify_cq() is extended so: < 0 means an error occurred while requesting notification == 0 means notification was requested successfully, and if IB_CQ_REPORT_MISSED_EVENTS was passed in, then no events were missed and it is safe to wait for another event. > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was passed in. It means that the consumer must poll the CQ again to make sure it is empty to avoid the race described above. We add a flag to enable this behavior rather than turning it on unconditionally, because checking for missed events may incur significant overhead for some low-level drivers, and consumers that don't care about the results of this test shouldn't be forced to pay for the test. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/amso1100/c2.h | 2 +- drivers/infiniband/hw/amso1100/c2_cq.c | 16 ++++++++--- drivers/infiniband/hw/cxgb3/cxio_hal.c | 3 ++ drivers/infiniband/hw/cxgb3/iwch_provider.c | 8 +++-- drivers/infiniband/hw/ehca/ehca_iverbs.h | 2 +- drivers/infiniband/hw/ehca/ehca_reqs.c | 14 +++++++-- drivers/infiniband/hw/ehca/ipz_pt_fn.h | 8 +++++ drivers/infiniband/hw/ipath/ipath_cq.c | 15 +++++++--- drivers/infiniband/hw/ipath/ipath_verbs.h | 2 +- drivers/infiniband/hw/mthca/mthca_cq.c | 12 +++++--- drivers/infiniband/hw/mthca/mthca_dev.h | 4 +- include/rdma/ib_verbs.h | 40 +++++++++++++++++++++------ 12 files changed, 93 insertions(+), 33 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h index 04a9db5..fa58200 100644 --- a/drivers/infiniband/hw/amso1100/c2.h +++ b/drivers/infiniband/hw/amso1100/c2.h @@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq); extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags); /* CM */ extern int c2_llp_connect(struct iw_cm_id *cm_id, diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index 5175c99..d2b3366 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -217,17 +217,19 @@ int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) return npolled; } -int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags) { struct c2_mq_shared __iomem *shared; struct c2_cq *cq; + unsigned long flags; + int ret = 0; cq = to_c2cq(ibcq); shared = cq->mq.peer; - if (notify == IB_CQ_NEXT_COMP) + if ((notify_flags & IB_CQ_SOLICITED_MASK) == IB_CQ_NEXT_COMP) writeb(C2_CQ_NOTIFICATION_TYPE_NEXT, &shared->notification_type); - else if (notify == IB_CQ_SOLICITED) + else if ((notify_flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED) writeb(C2_CQ_NOTIFICATION_TYPE_NEXT_SE, &shared->notification_type); else return -EINVAL; @@ -241,7 +243,13 @@ int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) */ readb(&shared->armed); - return 0; + if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) { + spin_lock_irqsave(&cq->lock, flags); + ret = !c2_mq_empty(&cq->mq); + spin_unlock_irqrestore(&cq->lock, flags); + } + + return ret; } static void c2_free_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index f5e9aee..76049af 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -114,7 +114,10 @@ int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, return -EIO; } } + + return 1; } + return 0; } diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index af28a31..b7a2183 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -292,7 +292,7 @@ static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) #endif } -static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) { struct iwch_dev *rhp; struct iwch_cq *chp; @@ -303,7 +303,7 @@ static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) chp = to_iwch_cq(ibcq); rhp = chp->rhp; - if (notify == IB_CQ_SOLICITED) + if ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED) cq_op = CQ_ARM_SE; else cq_op = CQ_ARM_AN; @@ -317,9 +317,11 @@ static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr); err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0); spin_unlock_irqrestore(&chp->lock, flag); - if (err) + if (err < 0) printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, chp->cq.cqid); + if (err > 0 && !(flags & IB_CQ_REPORT_MISSED_EVENTS)) + err = 0; return err; } diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 95fd59f..9e5460d 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -135,7 +135,7 @@ int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc); int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags notify_flags); struct ib_qp *ehca_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *init_attr, diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index 08d3f89..caec9de 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -634,11 +634,13 @@ poll_cq_exit0: return ret; } -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags notify_flags) { struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); + unsigned long spl_flags; + int ret = 0; - switch (cq_notify) { + switch (notify_flags & IB_CQ_SOLICITED_MASK) { case IB_CQ_SOLICITED: hipz_set_cqx_n0(my_cq, 1); break; @@ -649,5 +651,11 @@ int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) return -EINVAL; } - return 0; + if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) { + spin_lock_irqsave(&my_cq->spinlock, spl_flags); + ret = ipz_qeit_is_valid(&my_cq->ipz_queue); + spin_unlock_irqrestore(&my_cq->spinlock, spl_flags); + } + + return ret; } diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h index 8199c45..57f141a 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h @@ -140,6 +140,14 @@ static inline void *ipz_qeit_get_inc_valid(struct ipz_queue *queue) return cqe; } +static inline int ipz_qeit_is_valid(struct ipz_queue *queue) +{ + struct ehca_cqe *cqe = ipz_qeit_get(queue); + u32 cqe_flags = cqe->cqe_flags; + + return cqe_flags >> 7 == (queue->toggle_state & 1); +} + /* * returns and resets Queue Entry iterator * returns address (kv) of first Queue Entry diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index ea78e6d..1eb204c 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -340,17 +340,18 @@ int ipath_destroy_cq(struct ib_cq *ibcq) /** * ipath_req_notify_cq - change the notification type for a completion queue * @ibcq: the completion queue - * @notify: the type of notification to request + * @notify_flags: the type of notification to request * * Returns 0 for success. * * This may be called from interrupt context. Also called by * ib_req_notify_cq() in the generic verbs code. */ -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags) { struct ipath_cq *cq = to_icq(ibcq); unsigned long flags; + int ret = 0; spin_lock_irqsave(&cq->lock, flags); /* @@ -358,9 +359,15 @@ int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) * any other transitions (see C11-31 and C11-32 in ch. 11.4.2.2). */ if (cq->notify != IB_CQ_NEXT_COMP) - cq->notify = notify; + cq->notify = notify_flags & IB_CQ_SOLICITED_MASK; + + if ((notify_flags & IB_CQ_REPORT_MISSED_EVENTS) && + cq->queue->head != cq->queue->tail) + ret = 1; + spin_unlock_irqrestore(&cq->lock, flags); - return 0; + + return ret; } /** diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 7c4929f..6662380 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -740,7 +740,7 @@ struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int ipath_destroy_cq(struct ib_cq *ibcq); -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags); int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index efd79ef..cf0868f 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -726,11 +726,12 @@ repoll: return err == 0 || err == -EAGAIN ? npolled : err; } -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags) { __be32 doorbell[2]; - doorbell[0] = cpu_to_be32((notify == IB_CQ_SOLICITED ? + doorbell[0] = cpu_to_be32(((flags & IB_CQ_SOLICITED_MASK) == + IB_CQ_SOLICITED ? MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : MTHCA_TAVOR_CQ_DB_REQ_NOT) | to_mcq(cq)->cqn); @@ -743,7 +744,7 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) return 0; } -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) { struct mthca_cq *cq = to_mcq(ibcq); __be32 doorbell[2]; @@ -755,7 +756,8 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) doorbell[0] = ci; doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | - (notify == IB_CQ_SOLICITED ? 1 : 2)); + ((flags & IB_CQ_SOLICITED_MASK) == + IB_CQ_SOLICITED ? 1 : 2)); mthca_write_db_rec(doorbell, cq->arm_db); @@ -766,7 +768,7 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) wmb(); doorbell[0] = cpu_to_be32((sn << 28) | - (notify == IB_CQ_SOLICITED ? + ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : MTHCA_ARBEL_CQ_DB_REQ_NOT) | cq->cqn); diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index b7e42ef..9bae3cc 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -495,8 +495,8 @@ void mthca_unmap_eq_icm(struct mthca_dev *dev); int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags); +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags); int mthca_init_cq(struct mthca_dev *dev, int nent, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 765589f..529a69d 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -431,9 +431,11 @@ struct ib_wc { u8 port_num; /* valid only for DR SMPs on switches */ }; -enum ib_cq_notify { - IB_CQ_SOLICITED, - IB_CQ_NEXT_COMP +enum ib_cq_notify_flags { + IB_CQ_SOLICITED = 1 << 0, + IB_CQ_NEXT_COMP = 1 << 1, + IB_CQ_SOLICITED_MASK = IB_CQ_SOLICITED | IB_CQ_NEXT_COMP, + IB_CQ_REPORT_MISSED_EVENTS = 1 << 2, }; enum ib_srq_attr_mask { @@ -987,7 +989,7 @@ struct ib_device { struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); int (*req_notify_cq)(struct ib_cq *cq, - enum ib_cq_notify cq_notify); + enum ib_cq_notify_flags flags); int (*req_ncomp_notif)(struct ib_cq *cq, int wc_cnt); struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, @@ -1414,14 +1416,34 @@ int ib_peek_cq(struct ib_cq *cq, int wc_cnt); /** * ib_req_notify_cq - Request completion notification on a CQ. * @cq: The CQ to generate an event for. - * @cq_notify: If set to %IB_CQ_SOLICITED, completion notification will - * occur on the next solicited event. If set to %IB_CQ_NEXT_COMP, - * notification will occur on the next completion. + * @flags: + * Must contain exactly one of %IB_CQ_SOLICITED or %IB_CQ_NEXT_COMP + * to request an event on the next solicited event or next work + * completion at any type, respectively. %IB_CQ_REPORT_MISSED_EVENTS + * may also be |ed in to request a hint about missed events, as + * described below. + * + * Return Value: + * < 0 means an error occurred while requesting notification + * == 0 means notification was requested successfully, and if + * IB_CQ_REPORT_MISSED_EVENTS was passed in, then no events + * were missed and it is safe to wait for another event. In + * this case is it guaranteed that any work completions added + * to the CQ since the last CQ poll will trigger a completion + * notification event. + * > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was passed + * in. It means that the consumer must poll the CQ again to + * make sure it is empty to avoid missing an event because of a + * race between requesting notification and an entry being + * added to the CQ. This return value means it is possible + * (but not guaranteed) that a work completion has been added + * to the CQ since the last poll without triggering a + * completion notification event. */ static inline int ib_req_notify_cq(struct ib_cq *cq, - enum ib_cq_notify cq_notify) + enum ib_cq_notify_flags flags) { - return cq->device->req_notify_cq(cq, cq_notify); + return cq->device->req_notify_cq(cq, flags); } /** -- 1.5.1.2 From rdreier at cisco.com Thu May 3 13:50:10 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 13:50:10 -0700 Subject: [ofa-general] IPoIB: Convert to NAPI In-Reply-To: (Roland Dreier's message of "Thu, 03 May 2007 13:44:07 -0700") References: Message-ID: Convert the IP-over-InfiniBand network device driver over to using NAPI to handle all completions (both receive and send). Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 90 +++++++++++++++++++++++------ drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 + 4 files changed, 75 insertions(+), 20 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index fd55826..15867af 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -311,6 +311,7 @@ extern struct workqueue_struct *ipoib_workqueue; /* functions */ +int ipoib_poll(struct net_device *dev, int *budget); void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 0c4e59b..6f78ae0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -416,7 +416,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; - netif_rx_ni(skb); + netif_receive_skb(skb); repost: if (unlikely(ipoib_cm_post_receive(dev, wr_id))) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 1bdb910..fbc7371 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -226,7 +226,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; - netif_rx_ni(skb); + netif_receive_skb(skb); } else { ipoib_dbg_data(priv, "dropping loopback packet\n"); dev_kfree_skb_any(skb); @@ -280,28 +280,64 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) wc->status, wr_id, wc->vendor_err); } -static void ipoib_ib_handle_wc(struct net_device *dev, struct ib_wc *wc) +int ipoib_poll(struct net_device *dev, int *budget) { - if (wc->wr_id & IPOIB_CM_OP_SRQ) - ipoib_cm_handle_rx_wc(dev, wc); - else if (wc->wr_id & IPOIB_OP_RECV) - ipoib_ib_handle_rx_wc(dev, wc); - else - ipoib_ib_handle_tx_wc(dev, wc); + struct ipoib_dev_priv *priv = netdev_priv(dev); + int max = min(*budget, dev->quota); + int done; + int t; + int empty; + int n, i; + + done = 0; + empty = 0; + + while (max) { + t = min(IPOIB_NUM_WC, max); + n = ib_poll_cq(priv->cq, t, priv->ibwc); + + for (i = 0; i < n; ++i) { + struct ib_wc *wc = priv->ibwc + i; + + if (wc->wr_id & IPOIB_CM_OP_SRQ) { + ++done; + --max; + ipoib_cm_handle_rx_wc(dev, wc); + } else if (wc->wr_id & IPOIB_OP_RECV) { + ++done; + --max; + ipoib_ib_handle_rx_wc(dev, wc); + } else + ipoib_ib_handle_tx_wc(dev, wc); + } + + if (n != t) { + empty = 1; + break; + } + } + + dev->quota -= done; + *budget -= done; + + if (empty) { + netif_rx_complete(dev); + if (unlikely(ib_req_notify_cq(priv->cq, + IB_CQ_NEXT_COMP | + IB_CQ_REPORT_MISSED_EVENTS))) { + netif_rx_reschedule(dev, 0); + return 1; + } + + return 0; + } + + return 1; } void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) { - struct net_device *dev = (struct net_device *) dev_ptr; - struct ipoib_dev_priv *priv = netdev_priv(dev); - int n, i; - - ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); - do { - n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); - for (i = 0; i < n; ++i) - ipoib_ib_handle_wc(dev, priv->ibwc + i); - } while (n == IPOIB_NUM_WC); + netif_rx_schedule(dev_ptr); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -514,9 +550,10 @@ int ipoib_ib_dev_stop(struct net_device *dev) struct ib_qp_attr qp_attr; unsigned long begin; struct ipoib_tx_buf *tx_req; - int i; + int i, n; clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); + netif_poll_disable(dev); ipoib_cm_dev_stop(dev); @@ -568,6 +605,18 @@ int ipoib_ib_dev_stop(struct net_device *dev) goto timeout; } + do { + n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); + for (i = 0; i < n; ++i) { + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) + ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); + else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) + ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); + else + ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); + } + } while (n == IPOIB_NUM_WC); + msleep(1); } @@ -596,6 +645,9 @@ timeout: msleep(1); } + netif_poll_enable(dev); + ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP); + return 0; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index b4c380c..0a428f2 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -948,6 +948,8 @@ static void ipoib_setup(struct net_device *dev) dev->hard_header = ipoib_hard_header; dev->set_multicast_list = ipoib_set_mcast_list; dev->neigh_setup = ipoib_neigh_setup_dev; + dev->poll = ipoib_poll; + dev->weight = 100; dev->watchdog_timeo = HZ; -- 1.5.1.2 From rdreier at cisco.com Thu May 3 13:52:23 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 13:52:23 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: (Pradeep Satyanarayana's message of "Thu, 3 May 2007 13:49:34 -0700") References: Message-ID: > Then in the interim, how do we address the interoperability issue between > say > for example, Topspin and IBM HCAs using connected mode -switch to datagram > mode? I think that we need to understand the root cause of the issue and try to solve it without going to a non-zero retry count if possible. Right the theory is that it is a bug in the CM, namely that the respective HCA ack delays are not taken into account. So we should fix that bug. - R. From tziporet at dev.mellanox.co.il Thu May 3 14:07:50 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 03 May 2007 14:07:50 -0700 Subject: [ofa-general] OFED 1.2 RC3 is delayed for Monday next week (May 7) Message-ID: <463A4F26.3010804@mellanox.co.il> Hi All, Since some of the critical bugs are not solved yet we decided to delay the release to Monday May 7. This is the list of critical bugs that should be fixed for RC3: bug_id bug_severity assigned_to short_short_desc 574 blocker raisch at de.ibm.com ehca driver fails while running openmpi 420 critical monil at voltaire.com PKey table reordering caused by SM failover stops ipoib traffic 577 critical rolandd at cisco.com SRP multipath failover too slow (minutes, not seconds) 465 critical mst at mellanox.co.il IPoIB HA fails after several hours of failovers 549 critical amip at dev.mellanox.co.il SDP Policy need to be consistent 597 critical vlad at mellanox.co.il support RHEL4U5 in OFED 1.2 499 major vlad at mellanox.co.il module compiled over ofed won't load due to symbol version mismatch 519 major pasha at mellanox.co.il MVAPICH I APPLICATION ABORTS WITH PARTITIONS CONFIGURED 534 major vlad at mellanox.co.il SLES9 - Installer fails on declarations - OFED 1.2-20070409 530 major dannyz at mellanox.co.il ibdiagnet -r fails on RHEL5 i686 538 major monis at voltaire.com integrate IPoIB bonding with IPoIB CM 541 major mst at mellanox.co.il slow failover with IPoIB CM bonding/ipoibtools HA 558 major rolandd at cisco.com tvflash configure fails on SLES10 SP1 RC2 All owners of blocker and critical bugs - please reply with status of the bug resolution Thanks, Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Thu May 3 14:14:28 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 03 May 2007 16:14:28 -0500 Subject: [ofa-general] OFED 1.2 RC3 is delayed for Monday next week (May 7) In-Reply-To: <463A4F26.3010804@mellanox.co.il> References: <463A4F26.3010804@mellanox.co.il> Message-ID: <1178226868.27558.58.camel@stevo-desktop> We also need 599 (rping regression; i just opened this) included in -rc3. I've submitted the patch and I believe sean has it ready to go. Steve. On Thu, 2007-05-03 at 14:07 -0700, Tziporet Koren wrote: > Hi All, > Since some of the critical bugs are not solved yet we decided to delay > the release to Monday May 7. > > This is the list of critical bugs that should be fixed for RC3: > bug_id > bug_severity > assigned_to > short_short_desc > 574 > blocker > raisch at de.ibm.com > ehca driver fails > while running > openmpi > 420 > critical > monil at voltaire.com > PKey table > reordering caused > by SM failover > stops ipoib > traffic > 577 > critical > rolandd at cisco.com > SRP multipath > failover too slow > (minutes, not > seconds) > 465 > critical > mst at mellanox.co.il > IPoIB HA fails > after several > hours of > failovers > 549 > critical > amip at dev.mellanox.co.il > SDP Policy need > to be consistent > 597 > critical > vlad at mellanox.co.il > support RHEL4U5 > in OFED 1.2 > 499 > major > vlad at mellanox.co.il > module compiled > over ofed won't > load due to > symbol version > mismatch > 519 > major > pasha at mellanox.co.il > MVAPICH I > APPLICATION > ABORTS WITH > PARTITIONS > CONFIGURED > 534 > major > vlad at mellanox.co.il > SLES9 - Installer > fails on > declarations - > OFED 1.2-20070409 > 530 > major > dannyz at mellanox.co.il > ibdiagnet -r > fails on RHEL5 > i686 > 538 > major > monis at voltaire.com > integrate IPoIB > bonding with > IPoIB CM > 541 > major > mst at mellanox.co.il > slow failover > with IPoIB CM > bonding/ipoibtools HA > 558 > major > rolandd at cisco.com > tvflash configure > fails on SLES10 > SP1 RC2 > > > All owners of blocker and critical bugs - please reply with status of > the bug resolution > > Thanks, > Tziporet > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tziporet at dev.mellanox.co.il Thu May 3 14:27:14 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 03 May 2007 14:27:14 -0700 Subject: [ofa-general] man pages for the rdma-cm In-Reply-To: <1178211084.27558.8.camel@stevo-desktop> References: <1178127046.18609.107.camel@stevo-desktop> <463A124E.4020604@ichips.intel.com> <1178211084.27558.8.camel@stevo-desktop> Message-ID: <463A53B2.2040103@mellanox.co.il> Steve Wise wrote: >> >> What's the release date for RC3? >> > > I believe it is today. > > Was just delayed to Monday. If you can do it today/tomorrow we may be able to integrate it Tziporet From sean.hefty at intel.com Thu May 3 14:46:04 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 May 2007 14:46:04 -0700 Subject: [ofa-general] man pages for the rdma-cm In-Reply-To: <463A53B2.2040103@mellanox.co.il> Message-ID: <000201c78dcc$73a2ea90$3b78e984@amr.corp.intel.com> >Was just delayed to Monday. If you can do it today/tomorrow we may be >able to integrate it I will try to complete the man pages for at least the APIs by tomorrow. I'm about 70% done writing them, but still need to tie them in with the build scripts. Steve, I will push the rping changes in with these changes. - Sean From xma at us.ibm.com Thu May 3 14:55:49 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 3 May 2007 14:55:49 -0700 Subject: [ofa-general] Re: [PATCH 3 of 3] ipoib/cm: separate comp vectors to RX/TX In-Reply-To: Message-ID: > Yes, it would be better to do > perPort/perCompletion/perEvent/perInterrupt. It also helps latency > not just throughput. It might even need multiple completions multple Events, multiple interrupt association for CM mode along with CPU affinity. thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From steffen.persvold at scali.com Thu May 3 15:38:27 2007 From: steffen.persvold at scali.com (Steffen Persvold) Date: Thu, 3 May 2007 18:38:27 -0400 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 References: <006f01c78db8$e57d1ff0$c801a8c0@ettac> Message-ID: So I don't understand it then... Why are my RPMs only containing one of the two versions. I'm running on ES and not AS but that shouldn't really matter... This output that you list : [root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 /usr/lib64/libibverbs.so.1 /usr/lib64/libibverbs.so.1.0.0 Is exactly what I would have expected as well, but my RPM says : [root at pe1850-1 redhat-release-4ES-5.5]# pwd /root/OFED-1.2-rc2/RPMS/redhat-release-4ES-5.5 [root at pe1850-1 redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 I'm lookin through the build log (/tmp/OFED.build.xxx.log) and both versions get compiled, but it looks like the 32bit libraries (which gets compiled last) overwrites the 64bit libraries in the "make install" section because both ends up in /usr/lib : (64bit section of the build) : /usr/bin/install -c src/.libs/libibverbs.so.1.0.0 /var/tmp/OFED/usr/lib/libibverbs.so.1.0.0 (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1 || { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; }; }) (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so || { rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; }) (32bit section of the build) : /usr/bin/install -c src/.libs/libibverbs.so.1.0.0 /var/tmp/OFED/usr/lib/libibverbs.so.1.0.0 (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1 || { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; }; }) (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so || { rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; }) So the question is, why is the 64bit section ending up in /usr/lib in the first place ??? I do see this though : /bin/rm -f /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs Running: env ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range= yes ac_cv_func_ibv_dofork_range=yes ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache-file=/var/tmp/OFEDRPM/ BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir /usr/lib --mandir=/usr/man --sysconfdir=/usr/etc CPPFLAGS="-I../libibverbs/include" --libdir /usr/lib ??? shouldn't that be --libdir /usr/lib64 for the 64bit section ? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter ________________________________ From: Chieng Etta [mailto:etta at systemfabricworks.com] Sent: Thu 5/3/2007 3:26 PM To: Steffen Persvold; vlad at dev.mellanox.co.il Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Hi Steffen, After removing all the OFED packages by using ./uninstall.sh, I tried ./build.sh to build the RPMs then installed libibverbs-1.1-0.x86_64.rpm onto system. "libibverbs.so.1.0.0" was installed under the right directories (/usr/lib and /usr/lib64). Please see the output below. Thanks, Etta [root at sfw1 etc]# cat /etc/*release Red Hat Enterprise Linux AS release 4 (Nahant Update 4) [root at sfw1 etc]# uname -a Linux sfw1.sfw.int 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root at sfw1 lib64]# pwd /usr/lib64 [root at sfw1 lib64]# ll libibverbs* ls: libibverbs*: No such file or directory [root at sfw1 lib64]# rpm -aq |grep libibverbs [root at sfw1 lib64]# cd - /root/images/OFED-1.2-rc2/RPMS/redhat-release-4AS-5.5 [root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 /usr/lib64/libibverbs.so.1 /usr/lib64/libibverbs.so.1.0.0 [root at sfw1 redhat-release-4AS-5.5]# rpm -ivh libibverbs-1.1-0.x86_64.rpm Preparing... ########################################### [100%] 1:libibverbs ########################################### [100%] [root at sfw1 redhat-release-4AS-5.5]# rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm x86_64 [root at sfw1 redhat-release-4AS-5.5]# cd - /usr/lib64 [root at sfw1 lib64]# rpm -aq |grep libibverbs libibverbs-1.1-0 [root at sfw1 lib64]# ll libibverbs* lrwxrwxrwx 1 root root 19 May 3 13:50 libibverbs.so.1 -> libibverbs.so.1.0.0 -rwxr-xr-x 1 root root 200993 May 3 13:18 libibverbs.so.1.0.0 [root at sfw1 lib64]# file libibverbs.so.1.0.0 libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped [root at sfw1 lib]# cd /usr/lib [root at sfw1 lib]# file libibverbs.so.1.0.0 libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), not stripped [root at sfw1 etc]# cat /etc/ld.so.conf include ld.so.conf.d/*.conf /usr/ofed/lib64 [root at sfw1 etc]# cat /etc/ld.so.conf.d/ofed.conf /usr/lib64 /usr/lib -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Steffen Persvold Sent: Thursday, May 03, 2007 10:26 AM To: vlad at dev.mellanox.co.il Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Vladimir, Nope. Still the same issue. The RPMs will only contain one set of libraries and it is always in /usr/lib (if I set the build_32bit=0 option I get the 64bit libraries but in the wrong directory). Seriously, am I the only one seeing this ? I would think rhel4 u4 was a very normal test platform ? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter > -----Original Message----- > From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] > Sent: Thursday, May 03, 2007 9:07 AM > To: Steffen Persvold > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > Please see if this happens in OFED-1.2-20070503-0600. > But first uninstall the previous OFED version with ofed_uninstall.sh > command. > > Thanks, > > Regards, > Vladimir > > On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote: > > Hmm, > > > > so I tried something. I put : > > > > build_32bit=0 > > > > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This time > > it built 64bit libraries, but it puts them in the wrong directory : > > > > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm > > /etc/ld.so.conf.d/ofed.conf > > /usr/lib/libibverbs.so.1 > > /usr/lib/libibverbs.so.1.0.0 > > > > # file /usr/lib/libibverbs.so.1.0.0 > > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD > > x86-64, version 1 (SYSV), not stripped > > > > So what's up ?? > > > > Cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: Steffen Persvold > > Sent: Wed 5/2/2007 10:30 AM > > To: Steffen Persvold; Vladimir Sokolovsky > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Also, > > > > If I look at the /etc/ld.so.conf/ofed.conf file I have : > > > > # cat ofed.conf > > /usr/lib > > /usr/lib > > > > > > which seems kinda weird ? :) > > > > Cheers, > > > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold > > Sent: Wed 5/2/2007 10:20 AM > > To: Vladimir Sokolovsky > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Nope : > > > > > > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm > > /etc/ld.so.conf.d/ofed.conf > > /usr/lib/libibverbs.so.1 > > /usr/lib/libibverbs.so.1.0.0 > > [redhat-release-4ES-5.5]# > > > > So the RPM got built, but without 64bit libraries. Now if it was the > > other way around (i.e no 32bit libraries) I could have understood it > > (as 32bit is an option on x86_64), but not having the native 64bit > > libraries is not so easy to understand :) > > > > cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir > > Sokolovsky > > Sent: Wed 5/2/2007 10:05 AM > > To: Steffen Persvold > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Don't you have /usr/lib64/libibverbs.so.1.0.0? > > > > Regards, > > Vladimir > > > > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote: > > > Folks, > > > > > > I used the build.sh script to build the above mentioned packages on > > > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries > > > (even if the packages are named x86_64) : > > > > > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm > > > x86_64 > > > > > > (after installing it) : > > > > > > # file /usr/lib/libibverbs.so.1.0.0 > > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel > > > 80386, version 1 (SYSV), not stripped > > > > > > What did I do wrong ?? > > > > > > Cheers, > > > Steffen Persvold > > > Technical Director Americas > > > tel. 508-281-7100 x401 > > > fax. 508-281-7171 > > > > > > http://www.scali.com/ > > > Scaling the Linux datacenter > > > _______________________________________________ > > > ewg mailing list > > > ewg at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > _______________________________________________ > > ewg mailing list > > ewg at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > > > _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -------------- next part -------------- An HTML attachment was scrubbed... URL: From lawver1 at llnl.gov Thu May 3 16:37:46 2007 From: lawver1 at llnl.gov (Bryan Lawver) Date: Thu, 03 May 2007 16:37:46 -0700 Subject: [ofa-general] Re: IPoIB forwarding In-Reply-To: <4637B9A4.2050103@myri.com> References: <6.1.2.0.2.20070423160212.12db6400@mail.llnl.gov> <20070425124652.GG1624@mellanox.co.il> <6.1.2.0.2.20070426083410.1389d9e0@mail.llnl.gov> <20070426161409.GF15540@mellanox.co.il> <6.1.2.0.2.20070426095112.138e9a68@mail.llnl.gov> <20070426180618.GJ15540@mellanox.co.il> <6.1.2.0.2.20070427115435.13ea5ec0@mail.llnl.gov> <46325DF3.2050203@hp.com> <6.1.2.0.2.20070427152027.13fe46d0@mail.llnl.gov> <46327A07.1000404@hp.com> <6.1.2.0.2.20070427153952.13fc7d08@mail.llnl.gov> <4632894D.40705@hp.com> <20070428025117.a3b1200a.billfink@mindspring.com> <4634F49F.9030408@myri.com> <46365BD4.5060607@hp.com> <4637B9A4.2050103@myri.com> Message-ID: <6.1.2.0.2.20070503163428.15b51750@mail.llnl.gov> I have been able to install and use the 1.3.0 myricom driver and everything works as I expected and performance is pretty decent. Interesting little side tour through various drivers...The router node sees almost no load which is really encouraging. Thanks, bryan At 03:05 PM 5/1/2007, Loic Prylli wrote: >On 4/30/2007 2:12 PM, Rick Jones wrote: >> >>Speaking of defaults, it would seem that the external 1.2.0 driver comes >>with 9000 bytes as the default MTU? At least I think that is what I am >>seeing now that I've started looking more closely. >> >>rick jones > > >That's the same for the in-kernel-tree code (9K MTU by default). Assuming >this is not wanted, I will submit a patch for that. > > >Loic From pradeep at us.ibm.com Thu May 3 17:12:42 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Thu, 3 May 2007 17:12:42 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: Message-ID: Thanks for the review. Some of them MST had already pointed out. I will respond to the additional ones that you make. Pradeep pradeep at us.ibm.com Roland Dreier wrote on 05/03/2007 12:01:03 PM: > > +#define IPOIB_CM_OP_NOSRQ (1ul << 29) > > I don't understand the point of this... the only places you do anything > with it are: > > > + priv->cm.rx_wr.wr_id = wr_id << 32 | index | IPOIB_CM_OP_NOSRQ; > > + index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) & NOSRQ_INDEX_MASK ; > > + if ((wc->wr_id & IPOIB_CM_OP_SRQ) || (wc->wr_id & > IPOIB_CM_OP_NOSRQ)) > > so probably the most sensible thing to do is just to rename > IPOIB_CM_OP_SRQ to IPOIB_OP_CM_RECV. Agreed. > > > +/* These two go hand in hand */ > > +#define NOSRQ_INDEX_RING_SIZE 1024 > > +#define NOSRQ_INDEX_MASK 0x00000000000003ff > > Rather than having a comment, I would just do > > #define NOSRQ_INDEX_RING_SIZE 1024 > #define NOSRQ_INDEX_MASK (NOSRQ_INDEX_RING_SIZE - 1) > > also I think the RING name is wrong -- it's not a ring, it's a table, > right? I don't like having a static limit on the number of nosrq > connections; could this be a hash table instead? > I will just call this an array. Nosrq will hog memory and my thought was that 1024 was pretty large. I envisioned using nosrq for a small number (maybe a few dozen), and so did not think it was necessary to make this a module paramater either. What do you suggest? > > > - rep.srq = 1; > > > + if (priv->cm.srq) > > + rep.srq = 1; > > + else > > + rep.srq = 0; > > similarly I would rather see "rep.srq = !!priv->cm.srq" ok > > > + /* Allocate space for the rx_ring here */ > > + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, > > + GFP_KERNEL); > > > > + printk(KERN_WARNING "NOSRQ supports a max of %d RC " > > + "QPs. That limit has now been reached\n", > > + NOSRQ_INDEX_RING_SIZE); > > ipoib_warn() instead of printk? Also isn't this going to flood logs > if the remote side keeps trying to connect? As you describe, the remote side will continue to attempt connecting and will fail. That is a pretty serious scenario. Hence I leaned towards flooding the logs rather than losing this among a ton of other messages, there may be application speceific messages too. I can change this to ipoib_warn(). > > > + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > > + if (ret) { > > + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", > > ret); > > + goto err_modify_nosrq; > > + } > > It's good to goto to unwind code, but in this case you just have a > return at err_modify_nosrq -- why not just return directly? However > you seem to leak rx_ring here, so it would be better to use the unwind > code more consistently instead of using return later. Yes, there is a leak here -will fix that. kfree(p->rx_ring); > > + return -ENOMEM; > > + } > > + > > + /* Can we call the nosrq version? */ > > + if (ipoib_cm_post_receive(dev, i << 32 | index)) { > > + ipoib_warn(priv, "ipoib_ib_post_receive " > > + "failed for buf %d\n", i); > > + ipoib_cm_dev_cleanup(dev); > > seems like you're missing the call to kfree(p->rx_ring) here? > this code could probably benefit from a goto to unwind code. Yes there is leak here too -will fix. From rdreier at cisco.com Thu May 3 17:16:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 17:16:07 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: (Pradeep Satyanarayana's message of "Thu, 3 May 2007 17:12:42 -0700") References: Message-ID: > > also I think the RING name is wrong -- it's not a ring, it's a table, > > right? I don't like having a static limit on the number of nosrq > > connections; could this be a hash table instead? > > > > I will just call this an array. Nosrq will hog memory and my thought was > that 1024 was pretty large. I envisioned using nosrq for a small number > (maybe a > few dozen), and so did not think it was necessary to make this a module > paramater > either. What do you suggest? Maybe make it a hash table of size 32 or 64 or something like that. You use less memory in the expected case, and degrade fairly gracefully when things get bigger. If you want to get really fancy, make it a hash table that you grow when it gets too full. I agree that we don't want yet another module parameter that has to be tuned here. - R. From pradeep at us.ibm.com Thu May 3 21:49:40 2007 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Thu, 3 May 2007 21:49:40 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V3] patch for review In-Reply-To: Message-ID: Roland Dreier wrote on 05/03/2007 05:16:07 PM: > > > also I think the RING name is wrong -- it's not a ring, it's a table, > > > right? I don't like having a static limit on the number of nosrq > > > connections; could this be a hash table instead? > > > > > > > I will just call this an array. Nosrq will hog memory and my thought was > > that 1024 was pretty large. I envisioned using nosrq for a small number > > (maybe a > > few dozen), and so did not think it was necessary to make this a module > > paramater > > either. What do you suggest? > > Maybe make it a hash table of size 32 or 64 or something like that. > You use less memory in the expected case, and degrade fairly > gracefully when things get bigger. If you want to get really fancy, > make it a hash table that you grow when it gets too full. > The only time we do a search in the array is to find an empty slot to store a pointer to ipoib_cm_rx. This happens upon receipt of a REQ. There are no other lookups that we perform. On the receipt of a packet we have the index encoded in wr_id, and so use that to retrive the ipoib_cm_rx poineter. We don't need a hash table for this. Could use a head and tail pointer to reduce the search. Pradeep pradeep at us.ibm.com From k_mahesh85 at yahoo.co.in Thu May 3 22:04:53 2007 From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh) Date: Fri, 4 May 2007 06:04:53 +0100 (BST) Subject: [ofa-general] [query] SMI nodeinfo, port_info structures In-Reply-To: <1178211580.32222.3481.camel@hal.voltaire.com> Message-ID: <984925.59702.qm@web8327.mail.in.yahoo.com> >SMI has nothing to do with those SM attributes. Yes you are right. SMI has nothing to do with them right now. But if some other hardware vendor wants to implement the SMA in the host software (like ipath) in future he again needs to implement those structures (nodeinfo and port_info, attributes of SM ) in his driver. we can avoid this situation (duplicate declarations of same structre) by declaring the above mentioned structures in the core layer. >What structures (in what file(s)) are you referring to ? Here I am referring to the structures for some SM attributes like nodeinfo and port_info which are currently declared in ipath driver. Some fields in those structures have big endian (__bexx) alignment and others have CPU (uxx) alignment. e.g: in struct port_info declared in ipath driver (ipath_mad.c), the mkey is declared as __be64 mkey whereas the local port number is declared as u8 local_port_num. -Mahesh --------------------------------- Here’s a new way to find what you're looking for - Yahoo! Answers -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu May 3 22:12:40 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 03 May 2007 22:12:40 -0700 Subject: [ofa-general] [query] SMI nodeinfo, port_info structures In-Reply-To: <984925.59702.qm@web8327.mail.in.yahoo.com> (Keshetti Mahesh's message of "Fri, 4 May 2007 06:04:53 +0100 (BST)") References: <984925.59702.qm@web8327.mail.in.yahoo.com> Message-ID: [please try to keep your line lengths below 80 columns] > we can avoid this situation (duplicate declarations of same > structre) by declaring the above mentioned structures in the core > layer. I think a patch moving structures defined by the IB spec to a more appropriate location would be fine. > Some fields in those structures have big endian (__bexx) alignment > and others have CPU (uxx) alignment. > > e.g: in struct port_info declared in ipath driver (ipath_mad.c), > the mkey is declared as __be64 mkey whereas the local port number > is declared as u8 local_port_num. Think about it... what could endianess mean for a single-byte field? From k_mahesh85 at yahoo.co.in Thu May 3 22:22:16 2007 From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh) Date: Fri, 4 May 2007 06:22:16 +0100 (BST) Subject: [ofa-general] [query] SMI nodeinfo, port_info structures In-Reply-To: Message-ID: <496423.96231.qm@web8314.mail.in.yahoo.com> > [please try to keep your line lengths below 80 columns] sure.. >I think a patch moving structures defined by the IB spec to a more >appropriate location would be fine. Isn't the include/rdma/ib_smi.h is an appropriate location? >Think about it... what could endianess mean for a single-byte field? Yes.. got the point. -Mahesh --------------------------------- Here’s a new way to find what you're looking for - Yahoo! Answers -------------- next part -------------- An HTML attachment was scrubbed... URL: From HNGUYEN at de.ibm.com Thu May 3 23:18:56 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Fri, 4 May 2007 08:18:56 +0200 Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support in core In-Reply-To: <20070503104847.GD10009@mellanox.co.il> Message-ID: Hello Michael and Roland! How about a new verb delivering number of cqs associated with a comp_vector like this /** * Returns number of cqs assigned to comp_vector * @return < 0 in error case eg invalid comp_vector */ int ib_query_comp_vector(struct ib_device *dev, int comp_vector); A consumer or ULP would be able to pick an "empty" comp_vector. Surely that does not prevent that a certain comp_vector resp. IRQ can be "overloaded", and that's another topic. Thanks Nam PS: I'm waiting for your comments first, a patch will come later. From mst at dev.mellanox.co.il Thu May 3 23:25:26 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 May 2007 09:25:26 +0300 Subject: [ofa-general] Re: [PATCH 0 of 3] comp_vector kernel support In-Reply-To: References: <20070503104806.GC10009@mellanox.co.il> Message-ID: <20070504062526.GB4829@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH 0 of 3] comp_vector kernel support > > > 1. extends ib_create_cq to pass in comp_vector parameter > > 2. Update all ULP/providers > > 3. mthca is enhanced to support multiple vectors if MSI-X is enabled on SMP > > 4. Other providers report support for a single completion vector > > 5. uverbs and IPoIB CM are enhanced to use multiple vectors if available > > > Please consider for 2.6.22. > > This is good work, but given that this has just appeared halfway > through the 2.6.22 merge window I don't think we should just merge it > just now. Rather, let's definitely get it into 2.6.23. How about just patches 1 and 2? They don't do anything to *kernel* ULPs by themselves, and give userspace ULPs opportunity to start using the feature. We'll learn from that, and enhance kernel ULPs by 2.6.23. -- MST From mst at dev.mellanox.co.il Thu May 3 23:29:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 May 2007 09:29:46 +0300 Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support in core In-Reply-To: References: <20070503104847.GD10009@mellanox.co.il> Message-ID: <20070504062946.GC4829@mellanox.co.il> > Quoting Hoang-Nam Nguyen : > Subject: Re: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support in core > > Hello Michael and Roland! > How about a new verb delivering number of cqs associated with a > comp_vector like this > > /** > * Returns number of cqs assigned to comp_vector > * @return < 0 in error case eg invalid comp_vector > */ > int ib_query_comp_vector(struct ib_device *dev, int comp_vector); > > A consumer or ULP would be able to pick an "empty" comp_vector. > Surely that does not prevent that a certain comp_vector resp. IRQ > can be "overloaded", and that's another topic. > Thanks > Nam > PS: I'm waiting for your comments first, a patch will come later. I'm not convinced it's an interesting metric. A CQ which has multiple QPs assigned to it might get more traffic than a CQ which only has a single QP. My gut feeling would be that once ULPs learn to use multiple vectors, each ULPs will spread across them evenly, without help from provider. -- MST From mst at dev.mellanox.co.il Fri May 4 00:25:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 May 2007 10:25:25 +0300 Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian In-Reply-To: <463A22F3.4090108@hp.com> References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il> <463A1C9A.6060706@hp.com> <20070503174817.GC9719@mellanox.co.il> <463A22F3.4090108@hp.com> Message-ID: <20070504072525.GE4829@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: OFED-1.2-20070502-0600 on Debian > > Michael S. Tsirkin wrote: > >>make[1]: Entering directory `/root/linux-2.6.21.1' > >>test -e include/linux/autoconf.h -a -e include/config/auto.conf || ( > >>\ > >> echo; \ > >> echo " ERROR: Kernel configuration is invalid."; \ > >> echo " include/linux/autoconf.h or > >> include/config/auto.conf are missing."; \ > >> echo " Run 'make oldconfig && make prepare' on kernel src > >> to fix it."; \ > > > > > >This is kernel's message, not our's - is this the source you built kernel > >from? > >If you go into /root/linux-2.6.21.1 as root and do make modules, > >does it succeed? > > yes. some warnings at the beginning about some modules and section > mismatches but it seems to complete. Okay ... so, do you see include/linux/autoconf.h there? -- MST From mst at dev.mellanox.co.il Fri May 4 00:32:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 May 2007 10:32:15 +0300 Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian In-Reply-To: <463A22F3.4090108@hp.com> References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il> <463A1C9A.6060706@hp.com> <20070503174817.GC9719@mellanox.co.il> <463A22F3.4090108@hp.com> Message-ID: <20070504073215.GG4829@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: OFED-1.2-20070502-0600 on Debian > > Michael S. Tsirkin wrote: > >>make[1]: Entering directory `/root/linux-2.6.21.1' > >>test -e include/linux/autoconf.h -a -e include/config/auto.conf || ( > >>\ > >> echo; \ > >> echo " ERROR: Kernel configuration is invalid."; \ > >> echo " include/linux/autoconf.h or > >> include/config/auto.conf are missing."; \ > >> echo " Run 'make oldconfig && make prepare' on kernel src > >> to fix it."; \ > > > > > >This is kernel's message, not our's - is this the source you built kernel > >from? > >If you go into /root/linux-2.6.21.1 as root and do make modules, > >does it succeed? > > yes. some warnings at the beginning about some modules and section > mismatches but it seems to complete. I just tried this on my ubuntu laptop with the same result. We'll work on fixing this by Monday. -- MST From mst at dev.mellanox.co.il Fri May 4 00:42:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 May 2007 10:42:41 +0300 Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian In-Reply-To: <463A1C9A.6060706@hp.com> References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il> <463A1C9A.6060706@hp.com> Message-ID: <20070504074241.GH4829@mellanox.co.il> > >There *is* another way which should be enough to test IPoIB: > >try getting a kernel tarball from > >http://git.openfabrics.org/~vlad/builds/ > > > >If you unpack this, you can configure/make/make install. > > > >Installer will backup your original modules under the prefix. > >Keep the source around and you'll be able to make uninstall > >to get back to original system. > > > >Note 1: default configure settings are often not what you want: > >run ./configure --help first of all to see which modules to select > >(--with-ipoib-mod and --with-mthca-mod I think) and to set a prefix. > >Note 2: having quilt tool installed is recommended - will let you > >add/remove patches later. > >Note 3: this way you get no userspace. openfabrics tarballs > >are under the same directory, and a similiar method works there. > >external tarballs (MPI, bonding, etc) are supplied to us in SRPM > >format so this trick does not work for them. > > Seems I found little joy there too, probably my own fault. The environment > is a Debian 4.0. The kernel is called: > > hpcpc107:~/ofa_1_2_kernel-20070502-0200# uname -a > Linux hpcpc107 2.6.21.1-raj #1 SMP Tue May 1 14:11:27 PDT 2007 ia64 > GNU/Linux > > > The sources to which are at: > > /root/linux-2.6.21.1 > > My configure line was: > > ./configure --with-ipoib-mod --with-mthca-mod --with-sdp-mod > --prefix=/root/save You must add --with-core-mod and it starts to rip. Vlad, I think configure should be smart enough to know that selecting any modules should enable core, too. OK? However, I noticed that 2.6.21 isn't supported yet (build actually fails). I'll try to add support by Monday, for now latest supported kernel is 2.6.20.y. > I didn't save the first set of configure output :( Subsequent configures > give: Yes, it's a known limitation currently. There are 3 possible work-arounds: - remove the build directory and re-open it - run >quilt pop -a >rm -fr patches .pc before second configure run - add --without-patch to second configure run (only works if you did not change the kernel version to build for) -- MST From mst at dev.mellanox.co.il Fri May 4 00:54:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 May 2007 10:54:59 +0300 Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian In-Reply-To: <463A22F3.4090108@hp.com> References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il> <463A1C9A.6060706@hp.com> <20070503174817.GC9719@mellanox.co.il> <463A22F3.4090108@hp.com> Message-ID: <20070504075459.GI4829@mellanox.co.il> OK. Apply these 2 patches after configure: -- MST -------------- next part -------------- commit ecbb416939da77c0d107409976499724baddce7b Author: Alexey Kuznetsov Date: Sat Mar 24 12:52:16 2007 -0700 [NET]: Fix neighbour destructor handling. ->neigh_destructor() is killed (not used), replaced with ->neigh_cleanup(), which is called when neighbor entry goes to dead state. At this point everything is still valid: neigh->dev, neigh->parms etc. The device should guarantee that dead neighbor entries (neigh->dead != 0) do not get private part initialized, otherwise nobody will cleanup it. I think this is enough for ipoib which is the only user of this thing. Initialization private part of neighbor entries happens in ipib start_xmit routine, which is not reached when device is down. But it would be better to add explicit test for neigh->dead in any case. Signed-off-by: David S. Miller diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 0741c6d..f2a40ae 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct net_device *dev) queue_work(ipoib_workqueue, &priv->restart_task); } -static void ipoib_neigh_destructor(struct neighbour *n) +static void ipoib_neigh_cleanup(struct neighbour *n) { struct ipoib_neigh *neigh; struct ipoib_dev_priv *priv = netdev_priv(n->dev); @@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struct neighbour *n) struct ipoib_ah *ah = NULL; ipoib_dbg(priv, - "neigh_destructor for %06x " IPOIB_GID_FMT "\n", + "neigh_cleanup for %06x " IPOIB_GID_FMT "\n", IPOIB_QPN(n->ha), IPOIB_GID_RAW_ARG(n->ha + 4)); @@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { - parms->neigh_destructor = ipoib_neigh_destructor; + parms->neigh_cleanup = ipoib_neigh_cleanup; return 0; } -------------- next part -------------- commit 43cb76d91ee85f579a69d42bc8efc08bac560278 Author: Greg Kroah-Hartman Date: Tue Apr 9 12:14:34 2002 -0700 Network: convert network devices to use struct device instead of class_device This lets the network core have the ability to handle suspend/resume issues, if it wants to. Thanks to Frederik Deweerdt for the arm driver fixes. Signed-off-by: Greg Kroah-Hartman diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 705eb1d..af5ee2e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -958,16 +958,17 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) return netdev_priv(dev); } -static ssize_t show_pkey(struct class_device *cdev, char *buf) +static ssize_t show_pkey(struct device *dev, + struct device_attribute *attr, char *buf) { - struct ipoib_dev_priv *priv = - netdev_priv(container_of(cdev, struct net_device, class_dev)); + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); return sprintf(buf, "0x%04x\n", priv->pkey); } -static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); +static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); -static ssize_t create_child(struct class_device *cdev, +static ssize_t create_child(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) { int pkey; @@ -985,14 +986,14 @@ static ssize_t create_child(struct class_device *cdev, */ pkey |= 0x8000; - ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev), - pkey); + ret = ipoib_vlan_add(to_net_dev(dev), pkey); return ret ? ret : count; } -static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); +static DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); -static ssize_t delete_child(struct class_device *cdev, +static ssize_t delete_child(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) { int pkey; @@ -1004,18 +1005,16 @@ static ssize_t delete_child(struct class_device *cdev, if (pkey < 0 || pkey > 0xffff) return -EINVAL; - ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev), - pkey); + ret = ipoib_vlan_delete(to_net_dev(dev), pkey); return ret ? ret : count; } -static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); +static DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); int ipoib_add_pkey_attr(struct net_device *dev) { - return class_device_create_file(&dev->class_dev, - &class_device_attr_pkey); + return device_create_file(&dev->dev, &dev_attr_pkey); } static struct net_device *ipoib_add_port(const char *format, @@ -1083,11 +1082,9 @@ static struct net_device *ipoib_add_port(const char *format, if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; - if (class_device_create_file(&priv->dev->class_dev, - &class_device_attr_create_child)) + if (device_create_file(&priv->dev->dev, &dev_attr_create_child)) goto sysfs_failed; - if (class_device_create_file(&priv->dev->class_dev, - &class_device_attr_delete_child)) + if (device_create_file(&priv->dev->dev, &dev_attr_delete_child)) goto sysfs_failed; return priv->dev; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c index f887780..085eafe 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -42,15 +42,15 @@ #include "ipoib.h" -static ssize_t show_parent(struct class_device *class_dev, char *buf) +static ssize_t show_parent(struct device *d, struct device_attribute *attr, + char *buf) { - struct net_device *dev = - container_of(class_dev, struct net_device, class_dev); + struct net_device *dev = to_net_dev(d); struct ipoib_dev_priv *priv = netdev_priv(dev); return sprintf(buf, "%s\n", priv->parent->name); } -static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL); +static DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL); int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) { @@ -118,8 +118,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; - if (class_device_create_file(&priv->dev->class_dev, - &class_device_attr_parent)) + if (device_create_file(&priv->dev->dev, &dev_attr_parent)) goto sysfs_failed; list_add_tail(&priv->list, &ppriv->child_intfs); From HNGUYEN at de.ibm.com Fri May 4 02:09:06 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Fri, 4 May 2007 11:09:06 +0200 Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support in core In-Reply-To: <20070504062946.GC4829@mellanox.co.il> Message-ID: > > Hello Michael and Roland! > > How about a new verb delivering number of cqs associated with a > > comp_vector like this > > > > /** > > * Returns number of cqs assigned to comp_vector > > * @return < 0 in error case eg invalid comp_vector > > */ > > int ib_query_comp_vector(struct ib_device *dev, int comp_vector); > > > > A consumer or ULP would be able to pick an "empty" comp_vector. > > Surely that does not prevent that a certain comp_vector resp. IRQ > > can be "overloaded", and that's another topic. > > Thanks > > Nam > > PS: I'm waiting for your comments first, a patch will come later. > > I'm not convinced it's an interesting metric. > A CQ which has multiple QPs assigned to it might get more traffic > than a CQ which only has a single QP. That's true for association between CQ and QPs. > My gut feeling would be that once ULPs learn to use multiple vectors, > each ULPs will spread across them evenly, without help from provider. Enabling multiple vectors is to be done by a provider. ULP can utilize it by checking num_comp_vector. Above simple metric could be provided in ib_core. Anyway, as Shirley stated in her email, using comp_vector per port will help a lot, at least on ppc64 and with ehca - for other HCAs I haven't benchmarked. And that metric allows ULP to implement such one approach. Nam From mst at dev.mellanox.co.il Fri May 4 02:13:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 May 2007 12:13:06 +0300 Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support in core In-Reply-To: References: <20070504062946.GC4829@mellanox.co.il> Message-ID: <20070504091306.GJ4829@mellanox.co.il> > using comp_vector > per port will help a lot, at least on ppc64 and with ehca - for other > HCAs I haven't benchmarked. And that metric allows ULP to implement > such one approach. Looks like a bit of overdesign. I think you can just set comp_vector = port_num - 1 % num_comp_vectors without any special metrics -- MST From vlad at lists.openfabrics.org Fri May 4 02:37:28 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 4 May 2007 02:37:28 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070504-0200 daily build status Message-ID: <20070504093728.C6CFDE60979@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From HNGUYEN at de.ibm.com Fri May 4 02:38:27 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Fri, 4 May 2007 11:38:27 +0200 Subject: [ofa-general] [PATCH 1 of 3] IB/verbs: add cq comp_vector support in core In-Reply-To: <20070504091306.GJ4829@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 04.05.2007 11:13:06: > > using comp_vector > > per port will help a lot, at least on ppc64 and with ehca - for other > > HCAs I haven't benchmarked. And that metric allows ULP to implement > > such one approach. > > Looks like a bit of overdesign. I think you can just set > comp_vector = port_num - 1 % num_comp_vectors > without any special metrics Right, looks much simpler for that per-port-purpose. Nam From halr at voltaire.com Fri May 4 03:39:04 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 04 May 2007 06:39:04 -0400 Subject: [ofa-general] [query] SMI nodeinfo, port_info structures In-Reply-To: References: <984925.59702.qm@web8327.mail.in.yahoo.com> Message-ID: <1178275139.32222.70093.camel@hal.voltaire.com> On Fri, 2007-05-04 at 01:12, Roland Dreier wrote: > [please try to keep your line lengths below 80 columns] > > > we can avoid this situation (duplicate declarations of same > > structre) by declaring the above mentioned structures in the core > > layer. > > I think a patch moving structures defined by the IB spec to a more > appropriate location would be fine. Sure; currently ipath is the only one which needed these for its soft SMA so there was no push to do this. -- Hal > > Some fields in those structures have big endian (__bexx) alignment > > and others have CPU (uxx) alignment. > > > > e.g: in struct port_info declared in ipath driver (ipath_mad.c), > > the mkey is declared as __be64 mkey whereas the local port number > > is declared as u8 local_port_num. > > Think about it... what could endianess mean for a single-byte field? From halr at voltaire.com Fri May 4 03:40:38 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 04 May 2007 06:40:38 -0400 Subject: [ofa-general] [query] SMI nodeinfo, port_info structures In-Reply-To: <496423.96231.qm@web8314.mail.in.yahoo.com> References: <496423.96231.qm@web8314.mail.in.yahoo.com> Message-ID: <1178275237.32222.70180.camel@hal.voltaire.com> On Fri, 2007-05-04 at 01:22, Keshetti Mahesh wrote: > > [please try to keep your line lengths below 80 columns] > > sure.. > > >I think a patch moving structures defined by the IB spec to a more > >appropriate location would be fine. > > Isn't the include/rdma/ib_smi.h is an appropriate location? Not really as this is for SMI which is lower in the architecture than SM class attributes. Maybe ib_mad.h or some new header files (ib_sma.h and ib_pma.h) are more appropriate. What do others think ? -- Hal > >Think about it... what could endianess mean for a single-byte field? > > Yes.. got the point. > > -Mahesh > > > > ______________________________________________________________________ > Heres a new way to find what you're looking for - Yahoo! Answers From mst at dev.mellanox.co.il Fri May 4 03:57:18 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 May 2007 13:57:18 +0300 Subject: [ofa-general] Re: [query] SMI nodeinfo, port_info structures In-Reply-To: <984925.59702.qm@web8327.mail.in.yahoo.com> References: <1178211580.32222.3481.camel@hal.voltaire.com> <984925.59702.qm@web8327.mail.in.yahoo.com> Message-ID: <20070504105718.GL4829@mellanox.co.il> > Quoting Keshetti Mahesh : > Subject: Re: [query] SMI nodeinfo, port_info structures > > >SMI has nothing to do with those SM attributes. > > Yes you are right. SMI has nothing to do with them right now. But if some other > hardware vendor wants to implement the SMA in the host software (like ipath) in > future he again needs to implement those structures (nodeinfo and port_info, > attributes of SM ) in his driver. > we can avoid this situation (duplicate declarations of same structre) by > declaring the above mentioned structures in the core layer. Why not wait till this actually happens? -- MST From etta at systemfabricworks.com Fri May 4 07:28:23 2007 From: etta at systemfabricworks.com (Chieng Etta) Date: Fri, 4 May 2007 09:28:23 -0500 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 In-Reply-To: Message-ID: <008d01c78e58$78df8810$c801a8c0@ettac> Steffen, The installation should be the same on either ES or AS. I assume that your system should have /usr/lib64 directory. Would you be able to install rc2 by using ./install.sh script? Thanks, Etta _____ From: Steffen Persvold [mailto:steffen.persvold at scali.com] Sent: Thursday, May 03, 2007 5:38 PM To: Chieng Etta; vlad at dev.mellanox.co.il Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 So I don't understand it then... Why are my RPMs only containing one of the two versions. I'm running on ES and not AS but that shouldn't really matter... This output that you list : [root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 /usr/lib64/libibverbs.so.1 /usr/lib64/libibverbs.so.1.0.0 Is exactly what I would have expected as well, but my RPM says : [root at pe1850-1 redhat-release-4ES-5.5]# pwd /root/OFED-1.2-rc2/RPMS/redhat-release-4ES-5.5 [root at pe1850-1 redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 I'm lookin through the build log (/tmp/OFED.build.xxx.log) and both versions get compiled, but it looks like the 32bit libraries (which gets compiled last) overwrites the 64bit libraries in the "make install" section because both ends up in /usr/lib : (64bit section of the build) : /usr/bin/install -c src/.libs/libibverbs.so.1.0.0 /var/tmp/OFED/usr/lib/libibverbs.so.1.0.0 (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1 || { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; }; }) (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so || { rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; }) (32bit section of the build) : /usr/bin/install -c src/.libs/libibverbs.so.1.0.0 /var/tmp/OFED/usr/lib/libibverbs.so.1.0.0 (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1 || { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; }; }) (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so || { rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; }) So the question is, why is the 64bit section ending up in /usr/lib in the first place ??? I do see this though : /bin/rm -f /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs Running: env ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range= yes ac_cv_func_ibv_dofork_range=yes ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache-file=/var/tmp/OFEDRPM/ BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir /usr/lib --mandir=/usr/man --sysconfdir=/usr/etc CPPFLAGS="-I../libibverbs/include" --libdir /usr/lib ??? shouldn't that be --libdir /usr/lib64 for the 64bit section ? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter _____ From: Chieng Etta [mailto:etta at systemfabricworks.com] Sent: Thu 5/3/2007 3:26 PM To: Steffen Persvold; vlad at dev.mellanox.co.il Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Hi Steffen, After removing all the OFED packages by using ./uninstall.sh, I tried ./build.sh to build the RPMs then installed libibverbs-1.1-0.x86_64.rpm onto system. "libibverbs.so.1.0.0" was installed under the right directories (/usr/lib and /usr/lib64). Please see the output below. Thanks, Etta [root at sfw1 etc]# cat /etc/*release Red Hat Enterprise Linux AS release 4 (Nahant Update 4) [root at sfw1 etc]# uname -a Linux sfw1.sfw.int 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root at sfw1 lib64]# pwd /usr/lib64 [root at sfw1 lib64]# ll libibverbs* ls: libibverbs*: No such file or directory [root at sfw1 lib64]# rpm -aq |grep libibverbs [root at sfw1 lib64]# cd - /root/images/OFED-1.2-rc2/RPMS/redhat-release-4AS-5.5 [root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 /usr/lib64/libibverbs.so.1 /usr/lib64/libibverbs.so.1.0.0 [root at sfw1 redhat-release-4AS-5.5]# rpm -ivh libibverbs-1.1-0.x86_64.rpm Preparing... ########################################### [100%] 1:libibverbs ########################################### [100%] [root at sfw1 redhat-release-4AS-5.5]# rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm x86_64 [root at sfw1 redhat-release-4AS-5.5]# cd - /usr/lib64 [root at sfw1 lib64]# rpm -aq |grep libibverbs libibverbs-1.1-0 [root at sfw1 lib64]# ll libibverbs* lrwxrwxrwx 1 root root 19 May 3 13:50 libibverbs.so.1 -> libibverbs.so.1.0.0 -rwxr-xr-x 1 root root 200993 May 3 13:18 libibverbs.so.1.0.0 [root at sfw1 lib64]# file libibverbs.so.1.0.0 libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped [root at sfw1 lib]# cd /usr/lib [root at sfw1 lib]# file libibverbs.so.1.0.0 libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), not stripped [root at sfw1 etc]# cat /etc/ld.so.conf include ld.so.conf.d/*.conf /usr/ofed/lib64 [root at sfw1 etc]# cat /etc/ld.so.conf.d/ofed.conf /usr/lib64 /usr/lib -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Steffen Persvold Sent: Thursday, May 03, 2007 10:26 AM To: vlad at dev.mellanox.co.il Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Vladimir, Nope. Still the same issue. The RPMs will only contain one set of libraries and it is always in /usr/lib (if I set the build_32bit=0 option I get the 64bit libraries but in the wrong directory). Seriously, am I the only one seeing this ? I would think rhel4 u4 was a very normal test platform ? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter > -----Original Message----- > From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] > Sent: Thursday, May 03, 2007 9:07 AM > To: Steffen Persvold > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > Please see if this happens in OFED-1.2-20070503-0600. > But first uninstall the previous OFED version with ofed_uninstall.sh > command. > > Thanks, > > Regards, > Vladimir > > On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote: > > Hmm, > > > > so I tried something. I put : > > > > build_32bit=0 > > > > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This time > > it built 64bit libraries, but it puts them in the wrong directory : > > > > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm > > /etc/ld.so.conf.d/ofed.conf > > /usr/lib/libibverbs.so.1 > > /usr/lib/libibverbs.so.1.0.0 > > > > # file /usr/lib/libibverbs.so.1.0.0 > > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD > > x86-64, version 1 (SYSV), not stripped > > > > So what's up ?? > > > > Cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: Steffen Persvold > > Sent: Wed 5/2/2007 10:30 AM > > To: Steffen Persvold; Vladimir Sokolovsky > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Also, > > > > If I look at the /etc/ld.so.conf/ofed.conf file I have : > > > > # cat ofed.conf > > /usr/lib > > /usr/lib > > > > > > which seems kinda weird ? :) > > > > Cheers, > > > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold > > Sent: Wed 5/2/2007 10:20 AM > > To: Vladimir Sokolovsky > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Nope : > > > > > > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm > > /etc/ld.so.conf.d/ofed.conf > > /usr/lib/libibverbs.so.1 > > /usr/lib/libibverbs.so.1.0.0 > > [redhat-release-4ES-5.5]# > > > > So the RPM got built, but without 64bit libraries. Now if it was the > > other way around (i.e no 32bit libraries) I could have understood it > > (as 32bit is an option on x86_64), but not having the native 64bit > > libraries is not so easy to understand :) > > > > cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir > > Sokolovsky > > Sent: Wed 5/2/2007 10:05 AM > > To: Steffen Persvold > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Don't you have /usr/lib64/libibverbs.so.1.0.0? > > > > Regards, > > Vladimir > > > > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote: > > > Folks, > > > > > > I used the build.sh script to build the above mentioned packages on > > > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries > > > (even if the packages are named x86_64) : > > > > > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm > > > x86_64 > > > > > > (after installing it) : > > > > > > # file /usr/lib/libibverbs.so.1.0.0 > > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel > > > 80386, version 1 (SYSV), not stripped > > > > > > What did I do wrong ?? > > > > > > Cheers, > > > Steffen Persvold > > > Technical Director Americas > > > tel. 508-281-7100 x401 > > > fax. 508-281-7171 > > > > > > http://www.scali.com/ > > > Scaling the Linux datacenter > > > _______________________________________________ > > > ewg mailing list > > > ewg at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > _______________________________________________ > > ewg mailing list > > ewg at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > > > _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -------------- next part -------------- An HTML attachment was scrubbed... URL: From rick.jones2 at hp.com Fri May 4 10:11:28 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 04 May 2007 10:11:28 -0700 Subject: [ofa-general] Re: OFED-1.2-20070502-0600 on Debian In-Reply-To: <20070504075459.GI4829@mellanox.co.il> References: <463901A0.5060905@hp.com> <20070502214944.GF10009@mellanox.co.il> <463A1C9A.6060706@hp.com> <20070503174817.GC9719@mellanox.co.il> <463A22F3.4090108@hp.com> <20070504075459.GI4829@mellanox.co.il> Message-ID: <463B6940.2000500@hp.com> Michael S. Tsirkin wrote: > OK. > Apply these 2 patches after configure: Are they already in the latest nightly? rick jones > > > > ------------------------------------------------------------------------ > > commit ecbb416939da77c0d107409976499724baddce7b > Author: Alexey Kuznetsov > Date: Sat Mar 24 12:52:16 2007 -0700 > > [NET]: Fix neighbour destructor handling. > > ->neigh_destructor() is killed (not used), replaced with > ->neigh_cleanup(), which is called when neighbor entry goes to dead > state. At this point everything is still valid: neigh->dev, > neigh->parms etc. > > The device should guarantee that dead neighbor entries (neigh->dead != > 0) do not get private part initialized, otherwise nobody will cleanup > it. > > I think this is enough for ipoib which is the only user of this thing. > Initialization private part of neighbor entries happens in ipib > start_xmit routine, which is not reached when device is down. But it > would be better to add explicit test for neigh->dead in any case. > > Signed-off-by: David S. Miller > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 0741c6d..f2a40ae 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct net_device *dev) > queue_work(ipoib_workqueue, &priv->restart_task); > } > > -static void ipoib_neigh_destructor(struct neighbour *n) > +static void ipoib_neigh_cleanup(struct neighbour *n) > { > struct ipoib_neigh *neigh; > struct ipoib_dev_priv *priv = netdev_priv(n->dev); > @@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struct neighbour *n) > struct ipoib_ah *ah = NULL; > > ipoib_dbg(priv, > - "neigh_destructor for %06x " IPOIB_GID_FMT "\n", > + "neigh_cleanup for %06x " IPOIB_GID_FMT "\n", > IPOIB_QPN(n->ha), > IPOIB_GID_RAW_ARG(n->ha + 4)); > > @@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) > > static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) > { > - parms->neigh_destructor = ipoib_neigh_destructor; > + parms->neigh_cleanup = ipoib_neigh_cleanup; > > return 0; > } > > > ------------------------------------------------------------------------ > > commit 43cb76d91ee85f579a69d42bc8efc08bac560278 > Author: Greg Kroah-Hartman > Date: Tue Apr 9 12:14:34 2002 -0700 > > Network: convert network devices to use struct device instead of class_device > > This lets the network core have the ability to handle suspend/resume > issues, if it wants to. > > Thanks to Frederik Deweerdt for the arm > driver fixes. > > Signed-off-by: Greg Kroah-Hartman > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 705eb1d..af5ee2e 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -958,16 +958,17 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) > return netdev_priv(dev); > } > > -static ssize_t show_pkey(struct class_device *cdev, char *buf) > +static ssize_t show_pkey(struct device *dev, > + struct device_attribute *attr, char *buf) > { > - struct ipoib_dev_priv *priv = > - netdev_priv(container_of(cdev, struct net_device, class_dev)); > + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); > > return sprintf(buf, "0x%04x\n", priv->pkey); > } > -static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); > +static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); > > -static ssize_t create_child(struct class_device *cdev, > +static ssize_t create_child(struct device *dev, > + struct device_attribute *attr, > const char *buf, size_t count) > { > int pkey; > @@ -985,14 +986,14 @@ static ssize_t create_child(struct class_device *cdev, > */ > pkey |= 0x8000; > > - ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev), > - pkey); > + ret = ipoib_vlan_add(to_net_dev(dev), pkey); > > return ret ? ret : count; > } > -static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); > +static DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); > > -static ssize_t delete_child(struct class_device *cdev, > +static ssize_t delete_child(struct device *dev, > + struct device_attribute *attr, > const char *buf, size_t count) > { > int pkey; > @@ -1004,18 +1005,16 @@ static ssize_t delete_child(struct class_device *cdev, > if (pkey < 0 || pkey > 0xffff) > return -EINVAL; > > - ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev), > - pkey); > + ret = ipoib_vlan_delete(to_net_dev(dev), pkey); > > return ret ? ret : count; > > } > -static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); > +static DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); > > int ipoib_add_pkey_attr(struct net_device *dev) > { > - return class_device_create_file(&dev->class_dev, > - &class_device_attr_pkey); > + return device_create_file(&dev->dev, &dev_attr_pkey); > } > > static struct net_device *ipoib_add_port(const char *format, > @@ -1083,11 +1082,9 @@ static struct net_device *ipoib_add_port(const char *format, > > if (ipoib_add_pkey_attr(priv->dev)) > goto sysfs_failed; > - if (class_device_create_file(&priv->dev->class_dev, > - &class_device_attr_create_child)) > + if (device_create_file(&priv->dev->dev, &dev_attr_create_child)) > goto sysfs_failed; > - if (class_device_create_file(&priv->dev->class_dev, > - &class_device_attr_delete_child)) > + if (device_create_file(&priv->dev->dev, &dev_attr_delete_child)) > goto sysfs_failed; > > return priv->dev; > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c > index f887780..085eafe 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c > @@ -42,15 +42,15 @@ > > #include "ipoib.h" > > -static ssize_t show_parent(struct class_device *class_dev, char *buf) > +static ssize_t show_parent(struct device *d, struct device_attribute *attr, > + char *buf) > { > - struct net_device *dev = > - container_of(class_dev, struct net_device, class_dev); > + struct net_device *dev = to_net_dev(d); > struct ipoib_dev_priv *priv = netdev_priv(dev); > > return sprintf(buf, "%s\n", priv->parent->name); > } > -static CLASS_DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL); > +static DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL); > > int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) > { > @@ -118,8 +118,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) > if (ipoib_add_pkey_attr(priv->dev)) > goto sysfs_failed; > > - if (class_device_create_file(&priv->dev->class_dev, > - &class_device_attr_parent)) > + if (device_create_file(&priv->dev->dev, &dev_attr_parent)) > goto sysfs_failed; > > list_add_tail(&priv->list, &ppriv->child_intfs); From mhagen at iol.unh.edu Fri May 4 12:39:21 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Fri, 4 May 2007 15:39:21 -0400 (EDT) Subject: [ofa-general] [PATCH] infiniband: add userspace support for invalidate stag Message-ID: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu> --- linux-2.6.21.1/include/rdma/ib_user_verbs.h 2007-05-02 15:35:13.000000000 -0400 +++ linux-2.6.21.1/include/rdma/ib_user_verbs.h 2007-05-02 15:53:40.000000000 -0400 @@ -553,6 +553,10 @@ struct ib_uverbs_send_wr { __u32 remote_qkey; __u32 reserved; } ud; + struct { + __u32 rkey; + __u32 reserved; + } invalidate; } wr; }; -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From mhagen at iol.unh.edu Fri May 4 12:39:55 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Fri, 4 May 2007 15:39:55 -0400 (EDT) Subject: [ofa-general] [PATCH] infiniband: add userspace support for invalidate stag Message-ID: <53313.132.177.125.178.1178307595.squirrel@postal.iol.unh.edu> --- linux-2.6.21.1/drivers/infiniband/core/uverbs_cmd.c 2007-05-04 14:25:50.000000000 -0400 +++ linux-2.6.21.1/drivers/infiniband/core/uverbs_cmd.c 2007-05-04 14:47:42.000000000 -0400 @@ -1507,6 +1507,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv next->wr.atomic.swap = user_wr->wr.atomic.swap; next->wr.atomic.rkey = user_wr->wr.atomic.rkey; break; + case IB_WR_SEND: + if(next->send_flags & IB_SEND_INVALIDATE) { + next->wr.invalidate.rkey = + user_wr->wr.invalidate.rkey; + } + break; default: break; } -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From pradeeps at linux.vnet.ibm.com Fri May 4 12:41:18 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Fri, 04 May 2007 12:41:18 -0700 Subject: [ofa-general] Queue Pair in error state Message-ID: <463B8C5E.3060005@linux.vnet.ibm.com> If packets are received by a queue pair that has gone to an error state- which of the following is to expected : 1.It gets dropped by the hardware and the sender will be notified with an error. 2. The packet gets delivered to the receiver and the work completion handler needs to deal with it. Pradeep From mhagen at iol.unh.edu Fri May 4 12:40:55 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Fri, 4 May 2007 15:40:55 -0400 (EDT) Subject: [ofa-general] [PATCH] infiniband: add userspace support for invalidate stag Message-ID: <53314.132.177.125.178.1178307655.squirrel@postal.iol.unh.edu> --- ofa_1_2_user-20070502-0200/src/userspace/libibverbs/include/infiniband/verbs.h 2007-05-03 10:11:23.000000000 -0400 +++ ofa_1_2_user-20070502-0200/src/userspace/libibverbs/include/infiniband/verbs.h 2007-05-03 10:12:32.000000000 -0400 @@ -492,7 +492,8 @@ enum ibv_send_flags { IBV_SEND_FENCE = 1 << 0, IBV_SEND_SIGNALED = 1 << 1, IBV_SEND_SOLICITED = 1 << 2, - IBV_SEND_INLINE = 1 << 3 + IBV_SEND_INLINE = 1 << 3, + IBV_SEND_INVALIDATE = 1 << 4 }; struct ibv_sge { @@ -525,6 +526,9 @@ struct ibv_send_wr { uint32_t remote_qpn; uint32_t remote_qkey; } ud; + struct { + uint32_t rkey; + } invalidate; } wr; }; -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From mhagen at iol.unh.edu Fri May 4 12:41:34 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Fri, 4 May 2007 15:41:34 -0400 (EDT) Subject: [ofa-general] [PATCH] infiniband: add userspace support for invalidate stag Message-ID: <43731.132.177.125.178.1178307694.squirrel@postal.iol.unh.edu> --- ofa_1_2_user-20070502-0200/src/userspace/libibverbs/include/infiniband/kern-abi.h 2007-05-03 10:36:13.000000000 -0400 +++ ofa_1_2_user-20070502-0200/src/userspace/libibverbs/include/infiniband/kern-abi.h 2007-05-03 10:37:39.000000000 -0400 @@ -592,6 +592,10 @@ struct ibv_kern_send_wr { __u32 remote_qkey; __u32 reserved; } ud; + struct { + __u32 rkey; + __u32 reserved; + } invalidate; } wr; }; -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From mhagen at iol.unh.edu Fri May 4 12:42:09 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Fri, 4 May 2007 15:42:09 -0400 (EDT) Subject: [ofa-general] [PATCH] infiniband: add userspace support for invalidate stag Message-ID: <43733.132.177.125.178.1178307729.squirrel@postal.iol.unh.edu> --- ofa_1_2_user-20070502-0200/src/userspace/libibverbs/src/cmd.c 2007-05-02 05:00:25.000000000 -0400 +++ ofa_1_2_user-20070502-0200/src/userspace/libibverbs/src/cmd.c 2007-05-04 15:19:36.000000000 -0400 @@ -857,6 +857,11 @@ int ibv_cmd_post_send(struct ibv_qp *ibq tmp->wr.atomic.swap = i->wr.atomic.swap; tmp->wr.atomic.rkey = i->wr.atomic.rkey; break; + case IBV_WR_SEND: + if(tmp->send_flags & IBV_SEND_INVALIDATE) { + tmp->wr.invalidate.rkey = + i->wr.invalidate.rkey; + } default: break; } -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From steffen.persvold at scali.com Fri May 4 14:29:34 2007 From: steffen.persvold at scali.com (Steffen Persvold) Date: Fri, 4 May 2007 17:29:34 -0400 Subject: [ofa-general] RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 References: <008d01c78e58$78df8810$c801a8c0@ettac> Message-ID: Etta, Of course my system has the /usr/lib64 directory. Using install or build doesn't seem to make a difference, the problems seems to be that when the 64bit libraries are compiled and installed they're installed in /usr/lib and not /usr/lib64 and thus when rpmbuild gets to compiling and installing the 32bit libraries the 64bit libraries are overwritten... I don't know too much about the Make files and configure scripts inside the .src.rpm files to understand exactly why it tells it to install the 64bit libraries in /usr/lib and not in /usr/lib64... Anyone have any insight on that ?? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter ________________________________ From: Chieng Etta [mailto:etta at systemfabricworks.com] Sent: Fri 5/4/2007 10:28 AM To: Steffen Persvold; vlad at dev.mellanox.co.il Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Steffen, The installation should be the same on either ES or AS. I assume that your system should have /usr/lib64 directory. Would you be able to install rc2 by using ./install.sh script? Thanks, Etta ________________________________ From: Steffen Persvold [mailto:steffen.persvold at scali.com] Sent: Thursday, May 03, 2007 5:38 PM To: Chieng Etta; vlad at dev.mellanox.co.il Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 So I don't understand it then... Why are my RPMs only containing one of the two versions. I'm running on ES and not AS but that shouldn't really matter... This output that you list : [root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 /usr/lib64/libibverbs.so.1 /usr/lib64/libibverbs.so.1.0.0 Is exactly what I would have expected as well, but my RPM says : [root at pe1850-1 redhat-release-4ES-5.5]# pwd /root/OFED-1.2-rc2/RPMS/redhat-release-4ES-5.5 [root at pe1850-1 redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 I'm lookin through the build log (/tmp/OFED.build.xxx.log) and both versions get compiled, but it looks like the 32bit libraries (which gets compiled last) overwrites the 64bit libraries in the "make install" section because both ends up in /usr/lib : (64bit section of the build) : /usr/bin/install -c src/.libs/libibverbs.so.1.0.0 /var/tmp/OFED/usr/lib/libibverbs.so.1.0.0 (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1 || { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; }; }) (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so || { rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; }) (32bit section of the build) : /usr/bin/install -c src/.libs/libibverbs.so.1.0.0 /var/tmp/OFED/usr/lib/libibverbs.so.1.0.0 (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so.1 || { rm -f libibverbs.so.1 && ln -s libibverbs.so.1.0.0 libibverbs.so.1; }; }) (cd /var/tmp/OFED/usr/lib && { ln -s -f libibverbs.so.1.0.0 libibverbs.so || { rm -f libibverbs.so && ln -s libibverbs.so.1.0.0 libibverbs.so; }; }) So the question is, why is the 64bit section ending up in /usr/lib in the first place ??? I do see this though : /bin/rm -f /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/configure.cache cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs Running: env ac_cv_lib_ibverbs_ibv_get_device_list=yes ac_cv_header_infiniband_driver_h=yes ac_cv_func_ibv_read_sysfs_file=yes ac_cv_func_ibv_dontfork_range= yes ac_cv_func_ibv_dofork_range=yes ac_cv_func_ibv_register_driver=yes HAVE_IBV_DEVICE_LIBRARY_EXTENSION_TRUE=yes ./configure --cache-file=/var/tmp/OFEDRPM/ BUILD/ofa_user-1.2/configure.cache --disable-libcheck --prefix /usr --libdir /usr/lib --mandir=/usr/man --sysconfdir=/usr/etc CPPFLAGS="-I../libibverbs/include" --libdir /usr/lib ??? shouldn't that be --libdir /usr/lib64 for the 64bit section ? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter ________________________________ From: Chieng Etta [mailto:etta at systemfabricworks.com] Sent: Thu 5/3/2007 3:26 PM To: Steffen Persvold; vlad at dev.mellanox.co.il Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Hi Steffen, After removing all the OFED packages by using ./uninstall.sh, I tried ./build.sh to build the RPMs then installed libibverbs-1.1-0.x86_64.rpm onto system. "libibverbs.so.1.0.0" was installed under the right directories (/usr/lib and /usr/lib64). Please see the output below. Thanks, Etta [root at sfw1 etc]# cat /etc/*release Red Hat Enterprise Linux AS release 4 (Nahant Update 4) [root at sfw1 etc]# uname -a Linux sfw1.sfw.int 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root at sfw1 lib64]# pwd /usr/lib64 [root at sfw1 lib64]# ll libibverbs* ls: libibverbs*: No such file or directory [root at sfw1 lib64]# rpm -aq |grep libibverbs [root at sfw1 lib64]# cd - /root/images/OFED-1.2-rc2/RPMS/redhat-release-4AS-5.5 [root at sfw1 redhat-release-4AS-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm /etc/ld.so.conf.d/ofed.conf /usr/lib/libibverbs.so.1 /usr/lib/libibverbs.so.1.0.0 /usr/lib64/libibverbs.so.1 /usr/lib64/libibverbs.so.1.0.0 [root at sfw1 redhat-release-4AS-5.5]# rpm -ivh libibverbs-1.1-0.x86_64.rpm Preparing... ########################################### [100%] 1:libibverbs ########################################### [100%] [root at sfw1 redhat-release-4AS-5.5]# rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm x86_64 [root at sfw1 redhat-release-4AS-5.5]# cd - /usr/lib64 [root at sfw1 lib64]# rpm -aq |grep libibverbs libibverbs-1.1-0 [root at sfw1 lib64]# ll libibverbs* lrwxrwxrwx 1 root root 19 May 3 13:50 libibverbs.so.1 -> libibverbs.so.1.0.0 -rwxr-xr-x 1 root root 200993 May 3 13:18 libibverbs.so.1.0.0 [root at sfw1 lib64]# file libibverbs.so.1.0.0 libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped [root at sfw1 lib]# cd /usr/lib [root at sfw1 lib]# file libibverbs.so.1.0.0 libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), not stripped [root at sfw1 etc]# cat /etc/ld.so.conf include ld.so.conf.d/*.conf /usr/ofed/lib64 [root at sfw1 etc]# cat /etc/ld.so.conf.d/ofed.conf /usr/lib64 /usr/lib -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Steffen Persvold Sent: Thursday, May 03, 2007 10:26 AM To: vlad at dev.mellanox.co.il Cc: openfabrics-ewg at openib.org; openib-general at openib.org Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 Vladimir, Nope. Still the same issue. The RPMs will only contain one set of libraries and it is always in /usr/lib (if I set the build_32bit=0 option I get the 64bit libraries but in the wrong directory). Seriously, am I the only one seeing this ? I would think rhel4 u4 was a very normal test platform ? Cheers, Steffen Persvold Technical Director Americas tel. 508-281-7100 x401 fax. 508-281-7171 http://www.scali.com/ Scaling the Linux datacenter > -----Original Message----- > From: Vladimir Sokolovsky [mailto:vlad at dev.mellanox.co.il] > Sent: Thursday, May 03, 2007 9:07 AM > To: Steffen Persvold > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > Please see if this happens in OFED-1.2-20070503-0600. > But first uninstall the previous OFED version with ofed_uninstall.sh > command. > > Thanks, > > Regards, > Vladimir > > On Wed, 2007-05-02 at 11:30 -0400, Steffen Persvold wrote: > > Hmm, > > > > so I tried something. I put : > > > > build_32bit=0 > > > > into my ofed.conf file and rebuilt (build.sh -c ofed.conf). This time > > it built 64bit libraries, but it puts them in the wrong directory : > > > > # rpm -qpl ../libibverbs-1.1-0.x86_64.rpm > > /etc/ld.so.conf.d/ofed.conf > > /usr/lib/libibverbs.so.1 > > /usr/lib/libibverbs.so.1.0.0 > > > > # file /usr/lib/libibverbs.so.1.0.0 > > /usr/lib/libibverbs.so.1.0.0: ELF 64-bit LSB shared object, AMD > > x86-64, version 1 (SYSV), not stripped > > > > So what's up ?? > > > > Cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: Steffen Persvold > > Sent: Wed 5/2/2007 10:30 AM > > To: Steffen Persvold; Vladimir Sokolovsky > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Also, > > > > If I look at the /etc/ld.so.conf/ofed.conf file I have : > > > > # cat ofed.conf > > /usr/lib > > /usr/lib > > > > > > which seems kinda weird ? :) > > > > Cheers, > > > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: ewg-bounces at lists.openfabrics.org on behalf of Steffen Persvold > > Sent: Wed 5/2/2007 10:20 AM > > To: Vladimir Sokolovsky > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: RE: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Nope : > > > > > > [redhat-release-4ES-5.5]# rpm -qpl libibverbs-1.1-0.x86_64.rpm > > /etc/ld.so.conf.d/ofed.conf > > /usr/lib/libibverbs.so.1 > > /usr/lib/libibverbs.so.1.0.0 > > [redhat-release-4ES-5.5]# > > > > So the RPM got built, but without 64bit libraries. Now if it was the > > other way around (i.e no 32bit libraries) I could have understood it > > (as 32bit is an option on x86_64), but not having the native 64bit > > libraries is not so easy to understand :) > > > > cheers, > > Steffen Persvold > > Technical Director Americas > > tel. 508-281-7100 x401 > > fax. 508-281-7171 > > > > http://www.scali.com/ > > Scaling the Linux datacenter > > > > > > ______________________________________________________________________ > > From: ewg-bounces at lists.openfabrics.org on behalf of Vladimir > > Sokolovsky > > Sent: Wed 5/2/2007 10:05 AM > > To: Steffen Persvold > > Cc: openfabrics-ewg at openib.org; openib-general at openib.org > > Subject: Re: [ewg] OFED 1.2 RC2 on rhel4u4 x86_64 > > > > > > Don't you have /usr/lib64/libibverbs.so.1.0.0? > > > > Regards, > > Vladimir > > > > On Wed, 2007-05-02 at 10:00 -0400, Steffen Persvold wrote: > > > Folks, > > > > > > I used the build.sh script to build the above mentioned packages on > > > rhel4u4 x86_64, but for some reason it only compiles 32bit libraries > > > (even if the packages are named x86_64) : > > > > > > # rpm -qp --qf "%{arch}\n" libibverbs-1.1-0.x86_64.rpm > > > x86_64 > > > > > > (after installing it) : > > > > > > # file /usr/lib/libibverbs.so.1.0.0 > > > /usr/lib/libibverbs.so.1.0.0: ELF 32-bit LSB shared object, Intel > > > 80386, version 1 (SYSV), not stripped > > > > > > What did I do wrong ?? > > > > > > Cheers, > > > Steffen Persvold > > > Technical Director Americas > > > tel. 508-281-7100 x401 > > > fax. 508-281-7171 > > > > > > http://www.scali.com/ > > > Scaling the Linux datacenter > > > _______________________________________________ > > > ewg mailing list > > > ewg at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > _______________________________________________ > > ewg mailing list > > ewg at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > > > _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri May 4 14:50:04 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 04 May 2007 14:50:04 -0700 Subject: [ofa-general] [PATCH] infiniband: add userspace support for invalidate stag In-Reply-To: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu> (mhagen@iol.unh.edu's message of "Fri, 4 May 2007 15:39:21 -0400 (EDT)") References: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu> Message-ID: A few general things: - please always submit patches with a changelog entry and Signed-off-by: line - please send patches in logical chunks. Usually I'm complaining about people combining unrelated things into one patch, but in this case I think you divided the patch up too much -- rather than 5 patches, this should probably be one kernel patch and one userspace patch. - please make libibverbs patches apply to the libibverbs git tree with -p1. You seem to have generated patches against an OFED package. OK, with that out of the way, I think there are still some issues to sort out with how to handle send with invalidate from userspace. These patches don't address the case of new userspace with send-with-invalidate support talking to an unpatched kernel -- it seems that send-with-invalidate would be silently turned into a plain send request, which is not a very good failure mode. I don't know what the right solution is yet -- a kernel ABI bump for this one case (send with invalidate support for userspace drivers that don't do kernel bypass == amso1100) is ugly. Maybe we also need a device capabilities bit that says whether send-with-invalidate is supported? - R. From rdreier at cisco.com Fri May 4 14:52:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 04 May 2007 14:52:49 -0700 Subject: [ofa-general] Re: [PATCH 0 of 3] comp_vector kernel support In-Reply-To: <20070504062526.GB4829@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 4 May 2007 09:25:26 +0300") References: <20070503104806.GC10009@mellanox.co.il> <20070504062526.GB4829@mellanox.co.il> Message-ID: > How about just patches 1 and 2? > They don't do anything to *kernel* ULPs by themselves, > and give userspace ULPs opportunity to start using the feature. > We'll learn from that, and enhance kernel ULPs by 2.6.23. I guess I could see doing the first patch (just support multiple vectors in the kernel without changing any drivers). That way it would be easy to experiment with patched drivers that enable multiple vectors. I think there's still some figuring out to do about how many EQs to enable, etc, and I think it would be better to prevent drivers from escaping into the wild before we have a better handle on the issues. - R. From swise at opengridcomputing.com Fri May 4 17:32:58 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 04 May 2007 19:32:58 -0500 Subject: [ofa-general] [PATCH] infiniband: add userspace support for invalidate stag In-Reply-To: References: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu> Message-ID: <1178325178.3011.4.camel@stevo-laptop> On Fri, 2007-05-04 at 14:50 -0700, Roland Dreier wrote: > A few general things: > - please always submit patches with a changelog entry and > Signed-off-by: line > - please send patches in logical chunks. Usually I'm complaining > about people combining unrelated things into one patch, but in this > case I think you divided the patch up too much -- rather than 5 > patches, this should probably be one kernel patch and one userspace > patch. > - please make libibverbs patches apply to the libibverbs git tree > with -p1. You seem to have generated patches against an OFED package. > > OK, with that out of the way, I think there are still some issues to > sort out with how to handle send with invalidate from userspace. > These patches don't address the case of new userspace with > send-with-invalidate support talking to an unpatched kernel -- it > seems that send-with-invalidate would be silently turned into a plain > send request, which is not a very good failure mode. > > I don't know what the right solution is yet -- a kernel ABI bump for > this one case (send with invalidate support for userspace drivers that > don't do kernel bypass == amso1100) is ugly. Maybe we also need a > device capabilities bit that says whether send-with-invalidate is > supported? > There already exists a SEND-INV capabilities flag. IB_DEVICE_SEND_W_INV = (1<<16), I think with the capabilities flag, we shouldn't worry about changing the ABI. But the drivers will need to set this flag. Amso currently does... Steve. From mgredden at bellsouth.net Fri May 4 20:19:41 2007 From: mgredden at bellsouth.net (Microsoft Award Team) Date: Fri, 4 May 2007 22:19:41 -0500 Subject: [ofa-general] Microsoft Award Promo Message-ID: <20070505031941.PEYX1041.ibm67aec.bellsouth.net@mail.bellsouth.net> Microsoft Award Promo 43 Wilson Ave, Harlesden London NW10 United Kingdom, Ref: BTD/876/03 Batch: 653978E Dear Winner, The prestigious Microsoft and AOL has set out and successfully organised a Sweepstakes marking this year 2007 anniversary we rolled out over £10,000.000.00 (Ten million Great Britain Pounds) for this year Anniversary Draws. The selection was made randomly from World Wide Web site through a computer draw system extracted from over 100,000 individuals and companies, attaching email addresses to ticket numbers. Your email address as indicated was drawn and attached to ticket number 005493262748 with serial numbers BTD/890578302/04 and drew the lucky numbers 15-22-27-38-40-47(20) which subsequently wonyou £1,000,000.00 (One Million Great Britain Pounds) as one of the 10 jackpot winners in this draw. contact your agent on how to claim your prize Name: MR. DAVID CARPENTER Email :claimsagent_carpenter07 at yahoo.co.uk Sincerely, Mrs Susan Miller Microsoft Promotion Team From vlad at lists.openfabrics.org Sat May 5 02:37:30 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 5 May 2007 02:37:30 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070505-0200 daily build status Message-ID: <20070505093730.F188DE60927@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From swise at opengridcomputing.com Sat May 5 09:00:59 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 05 May 2007 11:00:59 -0500 Subject: [ofa-general] [PATCH] infiniband: add userspace support for invalidate stag In-Reply-To: <1178325178.3011.4.camel@stevo-laptop> References: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu> <1178325178.3011.4.camel@stevo-laptop> Message-ID: <1178380859.8125.2.camel@stevo-desktop> On Fri, 2007-05-04 at 19:32 -0500, Steve Wise wrote: > On Fri, 2007-05-04 at 14:50 -0700, Roland Dreier wrote: > > A few general things: > > - please always submit patches with a changelog entry and > > Signed-off-by: line > > - please send patches in logical chunks. Usually I'm complaining > > about people combining unrelated things into one patch, but in this > > case I think you divided the patch up too much -- rather than 5 > > patches, this should probably be one kernel patch and one userspace > > patch. > > - please make libibverbs patches apply to the libibverbs git tree > > with -p1. You seem to have generated patches against an OFED package. > > > > OK, with that out of the way, I think there are still some issues to > > sort out with how to handle send with invalidate from userspace. > > These patches don't address the case of new userspace with > > send-with-invalidate support talking to an unpatched kernel -- it > > seems that send-with-invalidate would be silently turned into a plain > > send request, which is not a very good failure mode. > > > > I don't know what the right solution is yet -- a kernel ABI bump for > > this one case (send with invalidate support for userspace drivers that > > don't do kernel bypass == amso1100) is ugly. Maybe we also need a > > device capabilities bit that says whether send-with-invalidate is > > supported? > > > > There already exists a SEND-INV capabilities flag. > > > IB_DEVICE_SEND_W_INV = (1<<16), > > I think with the capabilities flag, we shouldn't worry about changing > the ABI. But the drivers will need to set this flag. Amso currently > does... Actually, since Amso has set this flag since day one, it doesn't really solve the ABI issue Roland describes. Steve. From halr at voltaire.com Sat May 5 10:38:48 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 05 May 2007 13:38:48 -0400 Subject: [ofa-general] [PATCH] IB/core: Enhance SMI for switch support Message-ID: <1178386725.32222.188297.camel@hal.voltaire.com> IB/core: Enhance SMI for switch support SMI is extended for switch (intermediate hop) support. Care has been taken to ensure the CA (and router) code paths are as identical as possible as to how they were prior to adding this support. Signed-off-by: Suresh Shelvapille Signed-off-by: Hal Rosenstock diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c index ecd1a30..7583941 100644 --- a/drivers/infiniband/core/agent.c +++ b/drivers/infiniband/core/agent.c @@ -3,7 +3,7 @@ * Copyright (c) 2004, 2005 Infinicon Corporation. All rights reserved. * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. * Copyright (c) 2004, 2005 Topspin Corporation. All rights reserved. - * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2004-2007 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -34,7 +34,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: agent.c 1389 2004-12-27 22:56:47Z roland $ */ #include @@ -42,6 +41,7 @@ #include "agent.h" #include "smi.h" +#include "mad_priv.h" #define SPFX "ib_agent: " @@ -87,8 +87,13 @@ int agent_send_response(struct ib_mad *m struct ib_mad_send_buf *send_buf; struct ib_ah *ah; int ret; + struct ib_mad_send_wr_private *mad_send_wr; + + if (device->node_type == RDMA_NODE_IB_SWITCH) + port_priv = ib_get_agent_port(device, 0); + else + port_priv = ib_get_agent_port(device, port_num); - port_priv = ib_get_agent_port(device, port_num); if (!port_priv) { printk(KERN_ERR SPFX "Unable to find port agent\n"); return -ENODEV; @@ -113,6 +118,14 @@ int agent_send_response(struct ib_mad *m memcpy(send_buf->mad, mad, sizeof *mad); send_buf->ah = ah; + + if (device->node_type == RDMA_NODE_IB_SWITCH){ + mad_send_wr = container_of(send_buf, + struct ib_mad_send_wr_private, + send_buf); + mad_send_wr->send_wr.wr.ud.port_num = port_num; + } + if ((ret = ib_post_send_mad(send_buf, NULL))) { printk(KERN_ERR SPFX "ib_post_send_mad error:%d\n", ret); goto err2; diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6edfecf..70b4adc 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -675,10 +675,16 @@ static int handle_outgoing_dr_smp(struct struct ib_mad_port_private *port_priv; struct ib_mad_agent_private *recv_mad_agent = NULL; struct ib_device *device = mad_agent_priv->agent.device; - u8 port_num = mad_agent_priv->agent.port_num; + u8 port_num; struct ib_wc mad_wc; struct ib_send_wr *send_wr = &mad_send_wr->send_wr; + if (device->node_type == RDMA_NODE_IB_SWITCH && + smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + port_num = send_wr->wr.ud.port_num; + else + port_num = mad_agent_priv->agent.port_num; + /* * Directed route handling starts if the initial LID routed part of * a request or the ending LID routed part of a response is empty. @@ -1839,6 +1845,7 @@ static void ib_mad_recv_done_handler(str struct ib_mad_private *recv, *response; struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent; + int port_num; response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); if (!response) @@ -1872,25 +1879,50 @@ static void ib_mad_recv_done_handler(str if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) goto out; + if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) + port_num = wc->port_num; + else + port_num = port_priv->port_num; + if (recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + enum smi_forward_action retsmi; + if (smi_handle_dr_smp_recv(&recv->mad.smp, port_priv->device->node_type, - port_priv->port_num, + port_num, port_priv->device->phys_port_cnt) == IB_SMI_DISCARD) goto out; - if (smi_check_forward_dr_smp(&recv->mad.smp) == IB_SMI_LOCAL) + retsmi = smi_check_forward_dr_smp(&recv->mad.smp); + if (retsmi == IB_SMI_LOCAL) goto local; - if (smi_handle_dr_smp_send(&recv->mad.smp, - port_priv->device->node_type, - port_priv->port_num) == IB_SMI_DISCARD) - goto out; + if (retsmi == IB_SMI_SEND) { /* don't forward */ + if (smi_handle_dr_smp_send(&recv->mad.smp, + port_priv->device->node_type, + port_num) == IB_SMI_DISCARD) + goto out; + + if (smi_check_local_smp(&recv->mad.smp, port_priv->device) == IB_SMI_DISCARD) + goto out; + } else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) { + /* forward case for switches */ + memcpy(response, recv, sizeof(*response)); + response->header.recv_wc.wc = &response->header.wc; + response->header.recv_wc.recv_buf.mad = &response->mad.mad; + response->header.recv_wc.recv_buf.grh = &response->grh; + + if (!agent_send_response(&response->mad.mad, + &response->grh, wc, + port_priv->device, + smi_get_fwd_port(&recv->mad.smp), + qp_info->qp->qp_num)) + response = NULL; - if (smi_check_local_smp(&recv->mad.smp, port_priv->device) == IB_SMI_DISCARD) goto out; + } } local: @@ -1919,7 +1951,7 @@ local: agent_send_response(&response->mad.mad, &recv->grh, wc, port_priv->device, - port_priv->port_num, + port_num, qp_info->qp->qp_num); goto out; } diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c index 2bca753..8723675 100644 --- a/drivers/infiniband/core/smi.c +++ b/drivers/infiniband/core/smi.c @@ -192,7 +192,7 @@ enum smi_action smi_handle_dr_smp_recv(s } /* smp->hop_ptr updated when sending */ return (node_type == RDMA_NODE_IB_SWITCH ? - IB_SMI_HANDLE: IB_SMI_DISCARD); + IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-13:4 -- hop_ptr = 0 -> give to SM */ @@ -211,7 +211,7 @@ enum smi_forward_action smi_check_forwar if (!ib_get_smp_direction(smp)) { /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) - return IB_SMI_SEND; + return IB_SMI_FORWARD; /* C14-9:3 -- at the end of the DR segment of path */ if (hop_ptr == hop_cnt) @@ -224,7 +224,7 @@ enum smi_forward_action smi_check_forwar } else { /* C14-13:2 -- intermediate hop */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) - return IB_SMI_SEND; + return IB_SMI_FORWARD; /* C14-13:3 -- at the end of the DR segment of path */ if (hop_ptr == 1) @@ -233,3 +233,13 @@ enum smi_forward_action smi_check_forwar } return IB_SMI_LOCAL; } + +/* + * Return the forwarding port number from initial_path for outgoing SMP and + * from return_path for returning SMP + */ +int smi_get_fwd_port(struct ib_smp *smp) +{ + return (!ib_get_smp_direction(smp) ? smp->initial_path[smp->hop_ptr+1] : + smp->return_path[smp->hop_ptr-1]); +} diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h index 9a4b349..1cfc298 100644 --- a/drivers/infiniband/core/smi.h +++ b/drivers/infiniband/core/smi.h @@ -48,10 +48,12 @@ enum smi_action { enum smi_forward_action { IB_SMI_LOCAL, /* SMP should be completed up the stack */ IB_SMI_SEND, /* received DR SMP should be forwarded to the send queue */ + IB_SMI_FORWARD /* SMP should be forwarded (for switches only) */ }; enum smi_action smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, int port_num, int phys_port_cnt); +int smi_get_fwd_port(struct ib_smp *smp); extern enum smi_forward_action smi_check_forward_dr_smp(struct ib_smp *smp); extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, u8 node_type, int port_num); From mst at dev.mellanox.co.il Sat May 5 13:12:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 5 May 2007 23:12:00 +0300 Subject: [ofa-general] Re: [PATCH 0 of 3] comp_vector kernel support In-Reply-To: References: <20070503104806.GC10009@mellanox.co.il> <20070504062526.GB4829@mellanox.co.il> Message-ID: <20070505201200.GA20811@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH 0 of 3] comp_vector kernel support > > > How about just patches 1 and 2? > > They don't do anything to *kernel* ULPs by themselves, > > and give userspace ULPs opportunity to start using the feature. > > We'll learn from that, and enhance kernel ULPs by 2.6.23. > > I guess I could see doing the first patch (just support multiple > vectors in the kernel without changing any drivers). That way it > would be easy to experiment with patched drivers that enable multiple > vectors. OK, let's do it. > I think there's still some figuring out to do about how many EQs to > enable, etc, and I think it would be better to prevent drivers from > escaping into the wild before we have a better handle on the issues. How about applying mthca patch, but changing MTHCA_COMP_VECTORS to 1? That would make it easy to do experiments at the ULP level. -- MST From emilysimon009 at yahoo.es Sat May 5 16:19:45 2007 From: emilysimon009 at yahoo.es (CO-ORDINATOR {Euromillionlottery}) Date: Sat, 5 May 2007 18:19:45 -0500 (CDT) Subject: [ofa-general] Euro Million Loteria Award !!! Message-ID: <3518.81.199.179.2.1178407185.squirrel@leveldonchange.250meg.com> Euro Million Loteria Award Paseo De La Castellana 15-89, 28008 Madrid,Spain Branch. Ref. Nº: ES/007/05/12/MAD. Batch. Nº: GHT/2907/333/05. www.loteria.com Prize And Award Notification YOUR E-MAIL ADDRESS WON THE LOTTERY. We wish to congratulate you over your email success in our computer BALLOTING SWEEPSTAKE held on May 5th,2007. This is a millennium scientific computer game in which email addresses were used. It is a promotional program aimed at encouraging internet users; therefore you do not need to buy ticket to enter for it. Your email address attached to ticket star number (9901-0148-790-691) drew the EUROMILLION lucky numbers 3-19-26-49-50 which consequently won the draw in the Second category. You have been approve for the star prize of EUR 787,248.26. (Seven Hundred And Eighty Seven housand, Two Hundred And Fourty Eight Euros,Twenty Six Cents) CONGRATULATIONS !!! You are advised to keep this winning very confidential until you receive your lump prize in your account or optional cheque issuance to you. This is a protective measure to avoid double claiming by people you may tell as we have had cases like this before. You are required to provide the information below: Name, Telephone Number, Fax Number, Wining Ticket Number, Reference Number and Amount Won. This information For processing of your winning fund should be sent to our registered claim agent in address below. Guarantee Trust Agency. Mr.Melvin Clinton. Address: Sin Numero Madrid Spain. Telephone: 0034636287740. E-mail: mrmelvinclinton at aim.com Remember, all winning must be claimed not later than May 31st, 2007. Please note, in order to avoid unnecessary delays and complications, remember to quote your reference number and batch number in all correspondence. Furthermore, should there be any change of address do inform our agent as soon as possible. ONCE AGAIN CONGRATULATIONS. Best Regard, Mrs. Emily Simon, Lottery Co-Odinator. The information transmitted is intended only for the person or entity to whom or which it is addressed. Unauthorised use, disclosure or copying is strictly prohibited. The sender accepts no liability for the improper transmission of this communication nor for any delay in its receipt. From xma at us.ibm.com Sat May 5 23:33:49 2007 From: xma at us.ibm.com (Shirley Ma) Date: Sat, 5 May 2007 23:33:49 -0700 Subject: [ofa-general] IPoIB: Convert to NAPI In-Reply-To: Message-ID: Roland, This patch looks good. I am working on a patch to split CQ to sendCQ and recvCQ and each CQ will have a different interrupt assoicated with different CPU to reduce latency and improve uni & bi directional BW. I would like to compare the performance difference. I hope there is no conflict. Thanks Shirley Ma IBM Linux Technology Center -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Sun May 6 02:37:35 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 6 May 2007 02:37:35 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070506-0200 daily build status Message-ID: <20070506093736.23E64E6083A@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From dotanb at dev.mellanox.co.il Sun May 6 05:20:06 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 06 May 2007 15:20:06 +0300 Subject: [ofa-general] does the mlx4 low level driver support working with multicast groups from user level? Message-ID: <463DC7F6.2070209@dev.mellanox.co.il> Hi Roland. When i executed ibv_devinfo and checked the multicast props of the device i got the following values: sw180:~ # ibv_devinfo -v | grep cast This will severely limit memory registrations. max_mcast_grp: 8192 max_mcast_qp_attach: 0 max_total_mcast_qp_attach: 0 It seems that the IB low level driver (drivers/infiniband/hw/mlx4) doesn't fill the attribute max_mcast_qp_attach. When i tried to use multicast groups from user level i got weird failures. Does the low level driver support working with multicast groups from user level? thanks Dotan From sashak at voltaire.com Sun May 6 05:43:33 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 6 May 2007 15:43:33 +0300 Subject: [ofa-general] Re: [PATCH] osm: source and destination strings overlap when using sprintf() In-Reply-To: <1178211572.32222.3479.camel@hal.voltaire.com> References: <462C7C21.7010004@dev.mellanox.co.il> <20070423101738.GG4579@mellanox.co.il> <462E80A3.5060503@dev.mellanox.co.il> <20070501005101.GA26019@sashak.voltaire.com> <4636E4A7.7060108@dev.mellanox.co.il> <1178211572.32222.3479.camel@hal.voltaire.com> Message-ID: <20070506124333.GB9692@sashak.voltaire.com> On 12:59 Thu 03 May , Hal Rosenstock wrote: > On Tue, 2007-05-01 at 02:56, Yevgeny Kliteynik wrote: > > Sasha Khapyorsky wrote: > > > On 01:11 Wed 25 Apr , Yevgeny Kliteynik wrote: > > >> Michael S. Tsirkin wrote: > > >>> Since you seem to do a strcat which does an anyway, how about, for example: > > >>> > > >>> - sprintf( buf_line1,"%s 0x%01x |", > > >>> - buf_line1, p_vla_tbl->vl_entry[i].vl); > > >>> + sprintf( buf_line1 + strlen(buf_line1)," 0x%01x |", > > >>> + p_vla_tbl->vl_entry[i].vl); > > >>> > > >>> and so on in all the other places? > > >> Agree. > > >> I'll send a new patch later. > > > > > > Or like this: > > > > > > + int n = 0; > > > ... > > > - sprintf( buf_line1,"%s 0x%01x |", > > > - buf_line1, p_vla_tbl->vl_entry[i].vl); > > > + n += sprintf( buf_line1 + n," 0x%01x |", > > > + p_vla_tbl->vl_entry[i].vl); > > > > > > , so strlen() rerunning in loop is not needed anymore. > > > > Right, it does look better. > > So is someone going to submit this patch ? Thanks. Will do. Sasha From sashak at voltaire.com Sun May 6 06:03:52 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 6 May 2007 16:03:52 +0300 Subject: [ofa-general] [PATCH TRIVIAL] opensm/osm_helper: remove repeated strlen() calls In-Reply-To: <20070506124333.GB9692@sashak.voltaire.com> References: <462C7C21.7010004@dev.mellanox.co.il> <20070423101738.GG4579@mellanox.co.il> <462E80A3.5060503@dev.mellanox.co.il> <20070501005101.GA26019@sashak.voltaire.com> <4636E4A7.7060108@dev.mellanox.co.il> <1178211572.32222.3479.camel@hal.voltaire.com> <20070506124333.GB9692@sashak.voltaire.com> Message-ID: <20070506130352.GC9692@sashak.voltaire.com> Replace repeated strlen() calls used in sprintf() by actual string length accumulated from sprintf() return values. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_helper.c | 56 +++++++++++++++++++++-------------------------- 1 files changed, 25 insertions(+), 31 deletions(-) diff --git a/osm/opensm/osm_helper.c b/osm/opensm/osm_helper.c index a1a2e93..b424e84 100644 --- a/osm/opensm/osm_helper.c +++ b/osm/opensm/osm_helper.c @@ -1145,22 +1145,22 @@ osm_dump_multipath_record( IN const ib_multipath_rec_t* const p_mpr, IN const osm_log_level_t log_level ) { - int i; char buf_line[1024]; ib_gid_t const *p_gid; + int i, n; if( osm_log_is_active( p_log, log_level ) ) { - memset(buf_line, 0, sizeof(buf_line)); + n = 0; p_gid = p_mpr->gids; if ( p_mpr->sgid_count ) { for (i = 0; i < p_mpr->sgid_count; i++) { - sprintf( buf_line + strlen(buf_line), "\t\t\t\tsgid%02d.................." - "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", - i + 1, cl_ntoh64( p_gid->unicast.prefix ), - cl_ntoh64( p_gid->unicast.interface_id ) ); + n += sprintf( buf_line + n, "\t\t\t\tsgid%02d.................." + "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", + i + 1, cl_ntoh64( p_gid->unicast.prefix ), + cl_ntoh64( p_gid->unicast.interface_id ) ); p_gid++; } } @@ -1168,10 +1168,10 @@ osm_dump_multipath_record( { for (i = 0; i < p_mpr->dgid_count; i++) { - sprintf( buf_line + strlen(buf_line), "\t\t\t\tdgid%02d.................." - "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", - i + 1, cl_ntoh64( p_gid->unicast.prefix ), - cl_ntoh64( p_gid->unicast.interface_id ) ); + n += sprintf( buf_line + n, "\t\t\t\tdgid%02d.................." + "0x%016" PRIx64 " : 0x%016" PRIx64 "\n", + i + 1, cl_ntoh64( p_gid->unicast.prefix ), + cl_ntoh64( p_gid->unicast.interface_id ) ); p_gid++; } } @@ -1650,15 +1650,14 @@ osm_dump_pkey_block( IN const ib_pkey_table_t* const p_pkey_tbl, IN const osm_log_level_t log_level ) { - int i; char buf_line[1024]; + int i, n; if( osm_log_is_active( p_log, log_level ) ) { - buf_line[0] = '\0'; - for (i = 0; i < 32; i++) - sprintf( buf_line + strlen(buf_line)," 0x%04x |", - cl_ntoh16(p_pkey_tbl->pkey_entry[i])); + for (i = 0, n = 0; i < 32; i++) + n += sprintf( buf_line + n," 0x%04x |", + cl_ntoh16(p_pkey_tbl->pkey_entry[i])); osm_log( p_log, log_level, "P_Key table dump:\n" @@ -1684,18 +1683,17 @@ osm_dump_slvl_map_table( IN const ib_slvl_table_t* const p_slvl_tbl, IN const osm_log_level_t log_level ) { - uint8_t i; char buf_line1[1024]; char buf_line2[1024]; + int n; + uint8_t i; if( osm_log_is_active( p_log, log_level ) ) { - buf_line1[0] = '\0'; - buf_line2[0] = '\0'; - for (i = 0; i < 16; i++) - sprintf( buf_line1 + strlen(buf_line1)," %-2u |", i); - for (i = 0; i < 16; i++) - sprintf( buf_line2 + strlen(buf_line2),"0x%01X |", + for (i = 0, n = 0; i < 16; i++) + n += sprintf( buf_line1 + n," %-2u |", i); + for (i = 0, n = 0; i < 16; i++) + n += sprintf( buf_line2 + n,"0x%01X |", ib_slvl_table_get(p_slvl_tbl, i)); osm_log( p_log, log_level, "SLtoVL dump:\n" @@ -1721,20 +1719,16 @@ osm_dump_vl_arb_table( IN const ib_vl_arb_table_t* const p_vla_tbl, IN const osm_log_level_t log_level ) { - int i; char buf_line1[1024]; char buf_line2[1024]; + int i, n; if( osm_log_is_active( p_log, log_level ) ) { - buf_line1[0] = '\0'; - buf_line2[0] = '\0'; - for (i = 0; i < 32; i++) - sprintf( buf_line1 + strlen(buf_line1)," 0x%01X |", - p_vla_tbl->vl_entry[i].vl); - for (i = 0; i < 32; i++) - sprintf( buf_line2 + strlen(buf_line2)," 0x%01X |", - p_vla_tbl->vl_entry[i].weight); + for (i = 0, n = 0; i < 32; i++) + n += sprintf( buf_line1 + n," 0x%01X |", p_vla_tbl->vl_entry[i].vl); + for (i = 0, n = 0; i < 32; i++) + n += sprintf( buf_line2 + n," 0x%01X |", p_vla_tbl->vl_entry[i].weight); osm_log( p_log, log_level, "VlArb dump:\n" "\t\t\tport_guid...........0x%016" PRIx64 "\n" -- 1.5.1.rc1.18.ga41b4 From dotanb at dev.mellanox.co.il Sun May 6 05:58:28 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 06 May 2007 15:58:28 +0300 Subject: [ofa-general] Queue Pair in error state In-Reply-To: <463B8C5E.3060005@linux.vnet.ibm.com> References: <463B8C5E.3060005@linux.vnet.ibm.com> Message-ID: <463DD0F4.2050709@dev.mellanox.co.il> Pradeep Satyanarayana wrote: > If packets are received by a queue pair that has gone to an error > state- which of the following is to expected : > > 1.It gets dropped by the hardware and the sender will be notified with > an error. > 2. The packet gets delivered to the receiver and the work completion > handler needs to deal with it. I believe that the first scenario will occur. The responder QP is in error state so all of the incoming packets will be dropped by the HCA. The requestor QP, which won't get any ack (or nack), will eventually get a retry exceeded and move to error state as well. Dotan From dotanb at dev.mellanox.co.il Sun May 6 06:46:42 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 06 May 2007 16:46:42 +0300 Subject: [ofa-general] [PATCH] libibverbs/ibv_devinfo : Print the number of max_vl_num as a number Message-ID: <1178459202.16752.1.camel@mtldesk014.lab.mtl.com> Print the number of max_vl_num as a number and not as enumerated value. Signed-off-by: Dotan Barak --- diff --git a/examples/devinfo.c b/examples/devinfo.c index 28cf8d1..40575c6 100644 --- a/examples/devinfo.c +++ b/examples/devinfo.c @@ -135,6 +135,18 @@ static const char *speed_str(uint8_t speed) } } +static const char *vl_str(uint8_t vl_num) +{ + switch (vl_num) { + case 1: return "1"; + case 2: return "2"; + case 3: return "4"; + case 4: return "8"; + case 5: return "15"; + default: return "invalid value"; + } +} + static int print_all_port_gids(struct ibv_context *ctx, uint8_t port_num, int tbl_len) { union ibv_gid gid; @@ -266,7 +278,7 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port) if (verbose) { printf("\t\t\tmax_msg_sz:\t\t0x%x\n", port_attr.max_msg_sz); printf("\t\t\tport_cap_flags:\t\t0x%08x\n", port_attr.port_cap_flags); - printf("\t\t\tmax_vl_num:\t\t%d\n", port_attr.max_vl_num); + printf("\t\t\tmax_vl_num:\t\t%s\n", vl_str(port_attr.max_vl_num)); printf("\t\t\tbad_pkey_cntr:\t\t0x%x\n", port_attr.bad_pkey_cntr); printf("\t\t\tqkey_viol_cntr:\t\t0x%x\n", port_attr.qkey_viol_cntr); printf("\t\t\tsm_sl:\t\t\t%d\n", port_attr.sm_sl); From rdreier at cisco.com Sun May 6 08:41:50 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 06 May 2007 08:41:50 -0700 Subject: [ofa-general] Re: does the mlx4 low level driver support working with multicast groups from user level? References: <463DC7F6.2070209@dev.mellanox.co.il> Message-ID: > When i executed ibv_devinfo and checked the multicast props of the > device i got the following values: > > sw180:~ # ibv_devinfo -v | grep cast > This will severely limit memory registrations. > max_mcast_grp: 8192 > max_mcast_qp_attach: 0 > max_total_mcast_qp_attach: 0 > > It seems that the IB low level driver (drivers/infiniband/hw/mlx4) > doesn't fill the attribute max_mcast_qp_attach. Yes, that code is missing. > When i tried to use multicast groups from user level i got weird failures. > > Does the low level driver support working with multicast groups from > user level? There's nothing special to do to handle userspace multicast groups. The multicast groups work well enough for IPoIB to work for me, but I haven't done any real testing. It should work but there's probably a silly bug somewhere. I just fixed one such bug but without knowing what your weird failures are, it's hard to say whether it would affect your tests. - R. From eli at mellanox.co.il Sun May 6 08:44:06 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Sun, 06 May 2007 18:44:06 +0300 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: free doorbel fix Message-ID: <1178466276.20653.127.camel@mtls03> When freeing an entry from order 1, the index field ends up shifted twice and the resulting index is wrong causing corruption of the data structure. Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/infiniband/hw/mlx4/doorbell.c =================================================================== --- connectx_kernel.orig/drivers/infiniband/hw/mlx4/doorbell.c 2007-05-06 18:24:54.000000000 +0300 +++ connectx_kernel/drivers/infiniband/hw/mlx4/doorbell.c 2007-05-06 18:29:32.000000000 +0300 @@ -136,9 +136,9 @@ void mlx4_ib_db_free(struct mlx4_ib_dev if (db->order == 0 && test_bit(i ^ 1, db->u.pgdir->order0)) { clear_bit(i ^ 1, db->u.pgdir->order0); ++o; + i >>= o; } - i >>= o; set_bit(i, db->u.pgdir->bits[o]); if (bitmap_full(db->u.pgdir->order1, MLX4_IB_DB_PER_PAGE / 2)) { From eli at mellanox.co.il Sun May 6 08:53:19 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Sun, 06 May 2007 18:53:19 +0300 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: fix doorbell allocations Message-ID: <1178466829.5013.1.camel@mtls03> These allocations are done under a spinlock and should be made with GFP_ATOMIC flags to prevent a deadlock. Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/infiniband/hw/mlx4/doorbell.c =================================================================== --- connectx_kernel.orig/drivers/infiniband/hw/mlx4/doorbell.c 2007-05-06 10:38:26.000000000 +0300 +++ connectx_kernel/drivers/infiniband/hw/mlx4/doorbell.c 2007-05-06 10:43:08.000000000 +0300 @@ -47,7 +47,7 @@ static struct mlx4_ib_db_pgdir *mlx4_ib_ { struct mlx4_ib_db_pgdir *pgdir; - pgdir = kzalloc(sizeof *pgdir, GFP_KERNEL); + pgdir = kzalloc(sizeof *pgdir, GFP_ATOMIC); if (!pgdir) return NULL; @@ -56,7 +56,7 @@ static struct mlx4_ib_db_pgdir *mlx4_ib_ pgdir->bits[1] = pgdir->order1; pgdir->db_page = dma_alloc_coherent(dev->ib_dev.dma_device, PAGE_SIZE, &pgdir->db_dma, - GFP_KERNEL); + GFP_ATOMIC); if (!pgdir->db_page) { kfree(pgdir); return NULL; From rdreier at cisco.com Sun May 6 09:21:38 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 06 May 2007 09:21:38 -0700 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: free doorbel fix In-Reply-To: <1178466276.20653.127.camel@mtls03> (Eli Cohen's message of "Sun, 06 May 2007 18:44:06 +0300") References: <1178466276.20653.127.camel@mtls03> Message-ID: Thanks, good catch... I fixed it this way: commit e5b1dd9313497cc22ae171ab6cccb7eb044aba53 Author: Eli Cohen Date: Sun May 6 09:20:13 2007 -0700 When freeing an entry from order 1, the index field ends up shifted twice and the resulting index is wrong causing corruption of the data structure. Signed-off-by: Eli Cohen Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c index 4b564d5..e55c286 100644 --- a/drivers/infiniband/hw/mlx4/doorbell.c +++ b/drivers/infiniband/hw/mlx4/doorbell.c @@ -132,7 +132,6 @@ void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db) spin_lock(&dev->pgdir_lock); o = db->order; - i = db->index >> db->order; if (db->order == 0 && test_bit(i ^ 1, db->u.pgdir->order0)) { clear_bit(i ^ 1, db->u.pgdir->order0); From rdreier at cisco.com Sun May 6 09:27:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 06 May 2007 09:27:58 -0700 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: free doorbel fix In-Reply-To: (Roland Dreier's message of "Sun, 06 May 2007 09:21:38 -0700") References: <1178466276.20653.127.camel@mtls03> Message-ID: err, like this really: commit 19219048ce32931392ca703f4cd9d54a8926215b Author: Eli Cohen Date: Sun May 6 09:27:29 2007 -0700 IB/mlx4: Fix free of doorbell record buddies When freeing an entry from order 1, the index field ends up shifted twice and the resulting index is wrong causing corruption of the data structure. Signed-off-by: Eli Cohen Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c index 4b564d5..0515052 100644 --- a/drivers/infiniband/hw/mlx4/doorbell.c +++ b/drivers/infiniband/hw/mlx4/doorbell.c @@ -131,8 +131,7 @@ void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db) spin_lock(&dev->pgdir_lock); - o = db->order; - i = db->index >> db->order; + i = db->index; if (db->order == 0 && test_bit(i ^ 1, db->u.pgdir->order0)) { clear_bit(i ^ 1, db->u.pgdir->order0); From rdreier at cisco.com Sun May 6 09:28:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 06 May 2007 09:28:07 -0700 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: fix doorbell allocations In-Reply-To: <1178466829.5013.1.camel@mtls03> (Eli Cohen's message of "Sun, 06 May 2007 18:53:19 +0300") References: <1178466829.5013.1.camel@mtls03> Message-ID: another good catch. let's make the lock a mutex instead, rather than relying on atomic allocations: commit 7a62f478170f69225fa8f35d0502dbaf26652615 Author: Roland Dreier Date: Sun May 6 09:26:16 2007 -0700 IB/mlx4: Convert pgdir_lock to pgdir_mutex Doorbell record pages are allocated inside the pgdir lock, so change the lock to a mutex so we can use GFP_KERNEL allocations. Pointed out by Eli Cohen . Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c index e55c286..2e36cee 100644 --- a/drivers/infiniband/hw/mlx4/doorbell.c +++ b/drivers/infiniband/hw/mlx4/doorbell.c @@ -101,7 +101,7 @@ int mlx4_ib_db_alloc(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db, int order) struct mlx4_ib_db_pgdir *pgdir; int ret = 0; - spin_lock(&dev->pgdir_lock); + mutex_lock(&dev->pgdir_mutex); list_for_each_entry(pgdir, &dev->pgdir_list, list) if (!mlx4_ib_alloc_db_from_pgdir(pgdir, db, order)) @@ -119,7 +119,7 @@ int mlx4_ib_db_alloc(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db, int order) WARN_ON(mlx4_ib_alloc_db_from_pgdir(pgdir, db, order)); out: - spin_unlock(&dev->pgdir_lock); + mutex_unlock(&dev->pgdir_mutex); return ret; } @@ -129,7 +129,7 @@ void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db) int o; int i; - spin_lock(&dev->pgdir_lock); + mutex_lock(&dev->pgdir_mutex); o = db->order; @@ -148,7 +148,7 @@ void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db) kfree(db->u.pgdir); } - spin_unlock(&dev->pgdir_lock); + mutex_unlock(&dev->pgdir_mutex); } struct mlx4_ib_user_db_page { diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index b3af928..5ef6d19 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -490,7 +490,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) goto err_uar; INIT_LIST_HEAD(&ibdev->pgdir_list); - spin_lock_init(&ibdev->pgdir_lock); + mutex_init(&ibdev->pgdir_mutex); ibdev->dev = dev; diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index bb866b0..62be599 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -152,7 +152,7 @@ struct mlx4_ib_dev { void __iomem *uar_map; struct list_head pgdir_list; - spinlock_t pgdir_lock; + struct mutex pgdir_mutex; struct mlx4_uar priv_uar; u32 priv_pdn; From rdreier at cisco.com Sun May 6 09:30:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 06 May 2007 09:30:09 -0700 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: free doorbel fix In-Reply-To: (Roland Dreier's message of "Sun, 06 May 2007 09:27:58 -0700") References: <1178466276.20653.127.camel@mtls03> Message-ID: err, one more try: commit 49b070c5a9473fabb379c82761ecf8c573a9b548 Author: Eli Cohen Date: Sun May 6 09:29:28 2007 -0700 IB/mlx4: Fix free of doorbell record buddies When freeing an entry from order 1, the index field ends up shifted twice and the resulting index is wrong causing corruption of the data structure. Signed-off-by: Eli Cohen Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c index 4b564d5..acb4ce2 100644 --- a/drivers/infiniband/hw/mlx4/doorbell.c +++ b/drivers/infiniband/hw/mlx4/doorbell.c @@ -132,7 +132,7 @@ void mlx4_ib_db_free(struct mlx4_ib_dev *dev, struct mlx4_ib_db *db) spin_lock(&dev->pgdir_lock); o = db->order; - i = db->index >> db->order; + i = db->index; if (db->order == 0 && test_bit(i ^ 1, db->u.pgdir->order0)) { clear_bit(i ^ 1, db->u.pgdir->order0); From sashak at voltaire.com Sun May 6 10:41:38 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 6 May 2007 20:41:38 +0300 Subject: [ofa-general] [PATCH TRIVIAL] opensm: remove unneeded run-time check Message-ID: <20070506174138.GI9692@sashak.voltaire.com> remove unneeded run-time NULL pointer check (followed free() is not under this check anyway). Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_node.c | 20 +++++++++----------- 1 files changed, 9 insertions(+), 11 deletions(-) diff --git a/osm/opensm/osm_node.c b/osm/opensm/osm_node.c index 3f96c16..e725fd5 100644 --- a/osm/opensm/osm_node.c +++ b/osm/opensm/osm_node.c @@ -147,20 +147,17 @@ void osm_node_destroy( IN osm_node_t *p_node ) { + osm_physp_t *p_physp; uint16_t i; - /* Cleanup all PhysPorts */ - if( p_node != NULL ) + /* + Cleanup all physports + */ + for( i = 0; i < p_node->physp_tbl_size; i++ ) { - /* - Cleanup all physports - */ - for( i = 0; i < p_node->physp_tbl_size; i++ ) - { - osm_physp_t *p_physp = osm_node_get_physp_ptr( p_node, i ); - if (p_physp) - osm_physp_destroy( p_physp ); - } + p_physp = osm_node_get_physp_ptr( p_node, i ); + if (p_physp) + osm_physp_destroy( p_physp ); } } @@ -170,6 +167,7 @@ void osm_node_delete( IN OUT osm_node_t** const p_node ) { + CL_ASSERT(p_node && *p_node); osm_node_destroy( *p_node ); free( *p_node ); *p_node = NULL; -- 1.5.2.rc2.20.gac2a From sashak at voltaire.com Sun May 6 10:44:31 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 6 May 2007 20:44:31 +0300 Subject: [ofa-general] [PATCH TRIVIAL] opensm: make osm_node_destroy() static In-Reply-To: <20070506174138.GI9692@sashak.voltaire.com> References: <20070506174138.GI9692@sashak.voltaire.com> Message-ID: <20070506174431.GJ9692@sashak.voltaire.com> This makes locally used osm_node_destroy() function static Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_node.h | 28 ---------------------------- osm/opensm/osm_node.c | 2 +- 2 files changed, 1 insertions(+), 29 deletions(-) diff --git a/osm/include/opensm/osm_node.h b/osm/include/opensm/osm_node.h index 035ecef..a841de7 100644 --- a/osm/include/opensm/osm_node.h +++ b/osm/include/opensm/osm_node.h @@ -149,34 +149,6 @@ typedef struct _osm_node * Node object *********/ -/****f* OpenSM: Node/osm_node_destroy -* NAME -* osm_node_destroy -* -* DESCRIPTION -* The osm_node_destroy function destroys a node, releasing -* all resources. -* -* SYNOPSIS -*/void -osm_node_destroy( - IN osm_node_t *p_node ); -/* -* PARAMETERS -* p_node -* [in] Pointer a Node object to destroy. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Performs any necessary cleanup of the specified Node object. -* This function should only be called after a call to osm_node_new. -* -* SEE ALSO -* Node object, osm_node_new -*********/ - /****f* OpenSM: Node/osm_node_delete * NAME * osm_node_delete diff --git a/osm/opensm/osm_node.c b/osm/opensm/osm_node.c index e725fd5..80a7465 100644 --- a/osm/opensm/osm_node.c +++ b/osm/opensm/osm_node.c @@ -143,7 +143,7 @@ osm_node_new( /********************************************************************** **********************************************************************/ -void +static void osm_node_destroy( IN osm_node_t *p_node ) { -- 1.5.2.rc2.20.gac2a From sashak at voltaire.com Sun May 6 11:19:37 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 6 May 2007 21:19:37 +0300 Subject: [ofa-general] [PATCH TRIVIAL] opensm: trivial osm_port cleanups Message-ID: <20070506181937.GK9692@sashak.voltaire.com> This removes non-meanful osm_port_construct() and osm_port_destroy() functions and makes static locally used osm_port_init(). Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_port.h | 111 +--------------------------------------- osm/opensm/osm_port.c | 14 +++-- 2 files changed, 11 insertions(+), 114 deletions(-) diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h index 347ab3b..775f228 100644 --- a/osm/include/opensm/osm_port.h +++ b/osm/include/opensm/osm_port.h @@ -289,7 +289,7 @@ osm_physp_destroy( * osm_physp_init. * * SEE ALSO -* Port, osm_port_init, osm_port_destroy +* Port *********/ /****f* OpenSM: Physical Port/osm_physp_is_valid @@ -1313,70 +1313,6 @@ typedef struct _osm_port * Port, Physical Port, Physical Port Table *********/ -/****f* OpenSM: Port/osm_port_construct -* NAME -* osm_port_construct -* -* DESCRIPTION -* This function constructs a Port object. -* -* SYNOPSIS -*/ -static inline void -osm_port_construct( - IN osm_port_t* const p_port ) -{ - memset( p_port, 0, sizeof(*p_port) ); - cl_qlist_init( &p_port->mcm_list ); -} -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object to construct. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Allows calling osm_port_init, and osm_port_destroy. -* -* Calling osm_port_construct is a prerequisite to calling any other -* method except osm_port_init. -* -* SEE ALSO -* Port, osm_port_init, osm_port_destroy -*********/ - -/****f* OpenSM: Port/osm_port_destroy -* NAME -* osm_port_destroy -* -* DESCRIPTION -* This function destroys a Port object. -* -* SYNOPSIS -*/ -void -osm_port_destroy( - IN osm_port_t* const p_port ); -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object to construct. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Performs any necessary cleanup of the specified Port object. -* Further operations should not be attempted on the destroyed object. -* This function should only be called after a call to osm_port_construct -* or osm_port_init. -* -* SEE ALSO -* Port, osm_port_init, osm_port_destroy -*********/ - /****f* OpenSM: Port/osm_port_delete * NAME * osm_port_delete @@ -1386,14 +1322,9 @@ osm_port_destroy( * * SYNOPSIS */ -inline static void +void osm_port_delete( - IN OUT osm_port_t** const pp_port ) -{ - osm_port_destroy( *pp_port ); - free( *pp_port ); - *pp_port = NULL; -} + IN OUT osm_port_t** const pp_port ); /* * PARAMETERS * pp_port @@ -1407,42 +1338,6 @@ osm_port_delete( * Performs any necessary cleanup of the specified Port object. * * SEE ALSO -* Port, osm_port_init, osm_port_destroy -*********/ - -/****f* OpenSM: Port/osm_port_init -* NAME -* osm_port_init -* -* DESCRIPTION -* This function initializes a Port object. -* -* SYNOPSIS -*/ -void -osm_port_init( - IN osm_port_t* const p_port, - IN const ib_node_info_t* p_ni, - IN const struct _osm_node* const p_parent_node ); -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object to initialize. -* -* p_ni -* [in] Pointer to the NodeInfo attribute relavent for this port. -* -* p_parent_node -* [in] Pointer to the initialized parent osm_node_t object -* that owns this port. -* -* RETURN VALUE -* None. -* -* NOTES -* Allows calling other port methods. -* -* SEE ALSO * Port *********/ diff --git a/osm/opensm/osm_port.c b/osm/opensm/osm_port.c index ec2998c..260e28a 100644 --- a/osm/opensm/osm_port.c +++ b/osm/opensm/osm_port.c @@ -154,16 +154,18 @@ osm_physp_init( /********************************************************************** **********************************************************************/ void -osm_port_destroy( - IN osm_port_t* const p_port ) +osm_port_delete( + IN OUT osm_port_t** const pp_port ) { /* cleanup all mcm recs attached */ - osm_port_remove_all_mgrp( p_port ); + osm_port_remove_all_mgrp( *pp_port ); + free( *pp_port ); + *pp_port = NULL; } /********************************************************************** **********************************************************************/ -void +static void osm_port_init( IN osm_port_t* const p_port, IN const ib_node_info_t* p_ni, @@ -178,8 +180,8 @@ osm_port_init( CL_ASSERT( p_ni ); CL_ASSERT( p_parent_node ); - osm_port_construct( p_port ); - + memset( p_port, 0, sizeof(*p_port) ); + cl_qlist_init( &p_port->mcm_list ); p_port->p_node = (struct _osm_node *)p_parent_node; port_guid = p_ni->port_guid; p_port->guid = port_guid; -- 1.5.2.rc2.20.gac2a From sashak at voltaire.com Sun May 6 13:00:13 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 6 May 2007 23:00:13 +0300 Subject: [ofa-general] [PATCH] opensm: consolidate CA and router PortInfo receiving code Message-ID: <20070506200013.GL9692@sashak.voltaire.com> Consolidate CA and router PortInfo receiving processing code. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_port_info_rcv.c | 36 +----------------------------------- 1 files changed, 1 insertions(+), 35 deletions(-) diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c index e12daa6..f23410b 100644 --- a/osm/opensm/osm_port_info_rcv.c +++ b/osm/opensm/osm_port_info_rcv.c @@ -406,37 +406,6 @@ __osm_pi_rcv_process_ca_port( OSM_LOG_EXIT( p_rcv->p_log ); } -/********************************************************************** - **********************************************************************/ -static void -__osm_pi_rcv_process_router_port( - IN const osm_pi_rcv_t* const p_rcv, - IN osm_node_t* const p_node, - IN osm_physp_t* const p_physp, - IN const ib_port_info_t* const p_pi ) -{ - ib_net16_t orig_lid; - - OSM_LOG_ENTER( p_rcv->p_log, __osm_pi_rcv_process_router_port ); - - UNUSED_PARAM( p_node ); - - /* - Update the PortInfo attribute. - */ - osm_physp_set_port_info( p_physp, p_pi ); - - if ( (orig_lid = osm_physp_trim_base_lid_to_valid_range( p_physp ) ) ) - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_pi_rcv_process_router_port: ERR 0F09: " - "Invalid base LID 0x%x corrected\n", - cl_ntoh16 ( orig_lid) ); - - __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi); - - OSM_LOG_EXIT( p_rcv->p_log ); -} - #define IBM_VENDOR_ID (0x5076) /********************************************************************** **********************************************************************/ @@ -851,13 +820,10 @@ osm_pi_rcv_process( switch( osm_node_get_type( p_node ) ) { case IB_NODE_TYPE_CA: + case IB_NODE_TYPE_ROUTER: __osm_pi_rcv_process_ca_port( p_rcv, p_node, p_physp, p_pi ); break; - case IB_NODE_TYPE_ROUTER: - __osm_pi_rcv_process_router_port( p_rcv, - p_node, p_physp, p_pi ); - break; case IB_NODE_TYPE_SWITCH: __osm_pi_rcv_process_switch_port( p_rcv, p_node, p_physp, p_pi ); -- 1.5.2.rc2.20.gac2a From mst at dev.mellanox.co.il Sun May 6 13:07:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 May 2007 23:07:52 +0300 Subject: [ofa-general] Fwd: Re: using stgit/guilt for public branches Message-ID: <20070506200752.GB30339@mellanox.co.il> FYI, some more discussion forwarded from the git mailing list. Executive summary: it's possible to make repostitory managed by stgit public, but tools to make it possible for multiple developers to clone and work on such a repository seem not to be there yet. ----- Forwarded message from Yann Dirson ----- Subject: Re: using stgit/guilt for public branches Date: Sat, 5 May 2007 00:37:41 +0300 In-Reply-To: <20070504052042.GA4829 at mellanox.co.il> References: <20070425122048.GD1624 at mellanox.co.il> <20070425191838.GA6267 at filer.fsl.cs.sunysb.edu> <200704252337.05851.robin.rosenberg.lists at dewire.com> <20070503205836.GA19253 at nan92-1-81-57-214-146.fbx.proxad.net> <20070504052042.GA4829 at mellanox.co.il> From: Yann Dirson On Fri, May 04, 2007 at 08:20:59AM +0300, Michael S. Tsirkin wrote: > > Quoting Yann Dirson : > > Subject: Re: using stgit/guilt for public branches > > > > On Wed, Apr 25, 2007 at 11:37:05PM +0200, Robin Rosenberg wrote: > > > onsdag 25 april 2007 skrev Josef Sipek: > > > > On Wed, Apr 25, 2007 at 03:20:49PM +0300, Michael S. Tsirkin wrote: > > > [...] > > > > > I am concerned that publishing a git branch managed by stg/guilt > > > > > would present problems: it seems that every time patches are re-ordered, > > > > > a patch is re-written or removed, or we update from upstream, > > > > > everyone who pulls the tree branch will have a hard-to-resolve conflict. > > > > > > > > > > Is that really a problem? If so, would it be possible to work around this > > > > > somehow? > > > > > > > > I thought about this problem a while back when I was trying to decide how to > > > > manage the Unionfs git repository. I came to the conclusion, that there was > > > > no clean way of doing this (at least not using guilt - I can't really speak > > > > for stgit, as I don't know how it does things exactly). > > > > > > StGit has the same problem. Publishing such a branch is only for viewing if > > > you want to publish the tip, like the pu branch in the Git repo. You shouldn't > > > merge from pu either. > > > > You are right, in that what can be done with such branches is limited. > > BUT you can safely "stg branch --create" off any remote stgit stack. > > Then you can "stg rebase origin/master" to port your stack to the new > > tip of the remote stack. > > OK. > What happens if someone clones the repo, then reorders patches, > drops some of them, adds new patches in the middle of the stack? You can't do that out of the box, since you don't get a real stack when you clone it, you only get the refs. You would need to uncommit patches manually, and there will not be much support to help you. Now you're forcing me to unveil my secret plans :) 1. it would be quite easy to reconstruct a full-fledged stack from those refs, and since you get the remote patchlogs, we could also fetch any former version of the patch that would be still available (more work for "stg clone") 2. if noone beats me to doing that, I'll enhance patchlogs some day to record branching in patchlogs (eg. from "stg branch --clone" or "stg pick"), as well as merges (eg. from "stg sync") Note that proper merging from patchlog history will require working at the meta-diff (ie. "diffs of diffs of trees") level, just like proper merging at tree-level requires working at the diff level. I don't think we have the tools for this yet, so we still have a long way to go :) Best regards, -- Yann. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ----- End forwarded message ----- -- MST From rdreier at cisco.com Sun May 6 14:35:22 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 06 May 2007 14:35:22 -0700 Subject: [ofa-general] Re: [PATCH 1 of 3] IB/verbs: add cq comp_vector support in core In-Reply-To: <20070503104847.GD10009@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 May 2007 13:48:47 +0300") References: <20070503104847.GD10009@mellanox.co.il> Message-ID: OK, I added at least this to my tree for now. I haven't had a chance to think about the mthca changes one way or another yet... (I changed your patch to move the assignment of num_comp_vectors into the individual drivers) commit c15f960a112f8f0158e24b801bdce40da52ce485 Author: Michael S. Tsirkin Date: Thu May 3 13:48:47 2007 +0300 IB: Add CQ comp_vector support Add a num_comp_vectors member to struct ib_device and extend ib_create_cq() to pass in a comp_vector parameter -- this parallels the userspace libibverbs API. Update all hardware drivers to set num_comp_vectors to 1 and have all ULPs pass 0 for the comp_vector value. Pass the value of num_comp_vectors to userspace rather than hard-coding a value of 1. We want multiple CQ event vector support (via MSI-X or similar for adapters that can generate multiple interrupts), but it's not clear how many vectors we want, or how we want to deal with policy issues such as how to decide which vector to use or how to set up interrupt affinity. This patch is useful for experimenting, since no core changes will be necessary when updating a driver to support multiple vectors, and we know that we want to make at least these changes anyway. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6edfecf..85ccf13 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2771,7 +2771,7 @@ static int ib_mad_port_open(struct ib_device *device, cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; port_priv->cq = ib_create_cq(port_priv->device, ib_mad_thread_completion_handler, - NULL, port_priv, cq_size); + NULL, port_priv, cq_size, 0); if (IS_ERR(port_priv->cq)) { printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n"); ret = PTR_ERR(port_priv->cq); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 4fd75af..bab6676 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -802,6 +802,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file, INIT_LIST_HEAD(&obj->async_list); cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe, + cmd.comp_vector, file->ucontext, &udata); if (IS_ERR(cq)) { ret = PTR_ERR(cq); diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index f8bc822..d44e547 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -752,7 +752,7 @@ static void ib_uverbs_add_one(struct ib_device *device) spin_unlock(&map_lock); uverbs_dev->ib_dev = device; - uverbs_dev->num_comp_vectors = 1; + uverbs_dev->num_comp_vectors = device->num_comp_vectors; uverbs_dev->dev = cdev_alloc(); if (!uverbs_dev->dev) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index ccdf93d..86ed8af 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -609,11 +609,11 @@ EXPORT_SYMBOL(ib_destroy_qp); struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, void (*event_handler)(struct ib_event *, void *), - void *cq_context, int cqe) + void *cq_context, int cqe, int comp_vector) { struct ib_cq *cq; - cq = device->create_cq(device, cqe, NULL, NULL); + cq = device->create_cq(device, cqe, comp_vector, NULL, NULL); if (!IS_ERR(cq)) { cq->device = device; diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index 607c09b..1091662 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -290,7 +290,7 @@ static int c2_destroy_qp(struct ib_qp *ib_qp) return 0; } -static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata) { @@ -795,6 +795,7 @@ int c2_register_device(struct c2_dev *dev) memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6); dev->ibdev.phys_port_cnt = 1; + dev->ibdev.num_comp_vectors = 1; dev->ibdev.dma_device = &dev->pcidev->dev; dev->ibdev.query_device = c2_query_device; dev->ibdev.query_port = c2_query_port; diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 93038c0..78a495f 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -139,7 +139,7 @@ static int iwch_destroy_cq(struct ib_cq *ib_cq) return 0; } -static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, +static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *ib_context, struct ib_udata *udata) { @@ -1110,6 +1110,7 @@ int iwch_register_device(struct iwch_dev *dev) dev->ibdev.node_type = RDMA_NODE_RNIC; memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC)); dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports; + dev->ibdev.num_comp_vectors = 1; dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev); dev->ibdev.query_device = iwch_query_device; dev->ibdev.query_port = iwch_query_port; diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index e2cdc1a..67f0670 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -113,7 +113,7 @@ struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num) return ret; } -struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 95fd59f..aff96ac 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -123,7 +123,7 @@ int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq); void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq); -struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 3b23d67..77bb36b 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -313,6 +313,7 @@ int ehca_init_device(struct ehca_shca *shca) shca->ib_device.node_type = RDMA_NODE_IB_CA; shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; shca->ib_device.query_device = ehca_query_device; shca->ib_device.query_port = ehca_query_port; @@ -375,7 +376,7 @@ static int ehca_create_aqp1(struct ehca_shca *shca, u32 port) return -EPERM; } - ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10); + ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10, 0); if (IS_ERR(ibcq)) { ehca_err(&shca->ib_device, "Cannot create AQP1 CQ."); return PTR_ERR(ibcq); diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index 4715f89..00d3eb9 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -204,7 +204,7 @@ static void send_complete(unsigned long data) * * Called by ib_create_cq() in the generic verbs code. */ -struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, +struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index b676ea8..12933e7 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -1561,6 +1561,7 @@ int ipath_register_ib_device(struct ipath_devdata *dd) (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); dev->node_type = RDMA_NODE_IB_CA; dev->phys_port_cnt = 1; + dev->num_comp_vectors = 1; dev->dma_device = &dd->pcidev->dev; dev->query_device = ipath_query_device; dev->modify_device = ipath_modify_device; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index ac66c00..2d734fb 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -735,7 +735,7 @@ int ipath_destroy_srq(struct ib_srq *ibsrq); int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, +struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 47e6fd4..1c05486 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -663,6 +663,7 @@ static int mthca_destroy_qp(struct ib_qp *qp) } static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries, + int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { @@ -1292,6 +1293,7 @@ int mthca_register_device(struct mthca_dev *dev) (1ull << IB_USER_VERBS_CMD_DETACH_MCAST); dev->ib_dev.node_type = RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; + dev->ib_dev.num_comp_vectors = 1; dev->ib_dev.dma_device = &dev->pdev->dev; dev->ib_dev.query_device = mthca_query_device; dev->ib_dev.query_port = mthca_query_port; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 1e27930..b8089a0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -793,7 +793,7 @@ static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, } p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p, - ipoib_sendq_size + 1); + ipoib_sendq_size + 1, 0); if (IS_ERR(p->cq)) { ret = PTR_ERR(p->cq); ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 7f3ec20..5c3c6a4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -187,7 +187,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) if (!ret) size += ipoib_recvq_size; - priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size); + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); goto out_free_mr; diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index 1fc9674..89d6008 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -76,7 +76,7 @@ static int iser_create_device_ib_res(struct iser_device *device) iser_cq_callback, iser_cq_event_callback, (void *)device, - ISER_MAX_CQ_LEN); + ISER_MAX_CQ_LEN, 0); if (IS_ERR(device->cq)) goto cq_err; diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 3468ae1..39bf057 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -197,7 +197,7 @@ static int srp_create_target_ib(struct srp_target_port *target) return -ENOMEM; target->cq = ib_create_cq(target->srp_host->dev->dev, srp_completion, - NULL, target, SRP_CQ_SIZE); + NULL, target, SRP_CQ_SIZE, 0); if (IS_ERR(target->cq)) { ret = PTR_ERR(target->cq); goto out; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 765589f..17cc309 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -912,6 +912,8 @@ struct ib_device { u32 flags; + int num_comp_vectors; + struct iw_cm_verbs *iwcm; int (*query_device)(struct ib_device *device, @@ -978,6 +980,7 @@ struct ib_device { struct ib_recv_wr *recv_wr, struct ib_recv_wr **bad_recv_wr); struct ib_cq * (*create_cq)(struct ib_device *device, int cqe, + int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); int (*destroy_cq)(struct ib_cq *cq); @@ -1358,13 +1361,15 @@ static inline int ib_post_recv(struct ib_qp *qp, * @cq_context: Context associated with the CQ returned to the user via * the associated completion and event handlers. * @cqe: The minimum size of the CQ. + * @comp_vector - Completion vector used to signal completion events. + * Must be >= 0 and < context->num_comp_vectors. * * Users can examine the cq structure to determine the actual CQ size. */ struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, void (*event_handler)(struct ib_event *, void *), - void *cq_context, int cqe); + void *cq_context, int cqe, int comp_vector); /** * ib_resize_cq - Modifies the capacity of the CQ. From dannyz at mellanox.co.il Sun May 6 15:28:43 2007 From: dannyz at mellanox.co.il (Danny Zarko) Date: Mon, 7 May 2007 01:28:43 +0300 Subject: [ofa-general] RE: OFED 1.2 RC3 is delayed for Monday next week (May 7) In-Reply-To: <463A4F26.3010804@mellanox.co.il> Message-ID: <6C2C79E72C305246B504CBA17B5500C90172404F@mtlexch01.mtl.com> The bug could not be reproduced in mellanox. Will not be able to handle it before next week. ________________________________ From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] Sent: Thursday, May 03, 2007 5:08 PM To: EWG; Christoph Raisch; Moni Levy; Roland Dreier; Michael S. Tsirkin; Ami Perlmutter; Vladimir Sokolovsky; Pavel Shamis; Danny Zarko Cc: OpenFabrics General Subject: OFED 1.2 RC3 is delayed for Monday next week (May 7) Hi All, Since some of the critical bugs are not solved yet we decided to delay the release to Monday May 7. This is the list of critical bugs that should be fixed for RC3: bug_id bug_severity assigned_to short_short_desc 574 blocker raisch at de.ibm.com ehca driver fails while running openmpi 420 critical monil at voltaire.com PKey table reordering caused by SM failover stops ipoib traffic 577 critical rolandd at cisco.com SRP multipath failover too slow (minutes, not seconds) 465 critical mst at mellanox.co.il IPoIB HA fails after several hours of failovers 549 critical amip at dev.mellanox.co.il SDP Policy need to be consistent 597 critical vlad at mellanox.co.il support RHEL4U5 in OFED 1.2 499 major vlad at mellanox.co.il module compiled over ofed won't load due to symbol version mismatch 519 major pasha at mellanox.co.il MVAPICH I APPLICATION ABORTS WITH PARTITIONS CONFIGURED 534 major vlad at mellanox.co.il SLES9 - Installer fails on declarations - OFED 1.2-20070409 530 major dannyz at mellanox.co.il ibdiagnet -r fails on RHEL5 i686 538 major monis at voltaire.com integrate IPoIB bonding with IPoIB CM 541 major mst at mellanox.co.il slow failover with IPoIB CM bonding/ipoibtools HA 558 major rolandd at cisco.com tvflash configure fails on SLES10 SP1 RC2 All owners of blocker and critical bugs - please reply with status of the bug resolution Thanks, Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Sun May 6 21:17:18 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 6 May 2007 21:17:18 -0700 Subject: [ofa-general] RE: man pages for the rdma-cm In-Reply-To: <1178127046.18609.107.camel@stevo-desktop> Message-ID: <000001c7905e$9a562190$95fd070a@amr.corp.intel.com> >Are there man pages for the rdma-cm in the pipeline? I think it would >be great (requirement?) to have these for ofed-1.2 since we do have the >other verbs man pages. I've added man pages for the APIs and test programs to my master and ofed_1_2 branches. If anyone gets a chance, I'd appreciate someone looking them over. I plan on requested that they be pulled into the rc3 release. - Sean From rdreier at cisco.com Sun May 6 21:19:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 06 May 2007 21:19:18 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This is the second batch of merges for 2.6.22 -- mostly fixes, but also the conversion of IPoIB to use NAPI: Ishai Rabinovitz (1): IB/srp: Add orig_dgid sysfs attribute to scsi_host Michael S. Tsirkin (4): IB/mthca: Work around kernel QP starvation IPoIB/cm: Fix error handling in ipoib_cm_dev_open() IPoIB/cm: Don't crash if remote side uses one QP for both directions IB: Add CQ comp_vector support Ralph Campbell (4): IB/ipath: Don't call spin_lock_irq() from interrupt context IB/ipath: Don't put QP in timeout queue if waiting to send IB/ipath: Fix two more spin lock problems IB/ipath: Fix a race condition when generating ACKs Robert Walsh (1): IB/ipath: Don't corrupt pending mmap list when unmapped objects are freed Roland Dreier (4): IB/srp: Set proc_name IB/fmr_pool: Add prefix to all printks IB: Return "maybe missed event" hint from ib_req_notify_cq() IPoIB: Convert to NAPI Steve Wise (4): RDMA/cxgb3: Fix TERM codes RDMA/cxgb3: Fail qp creation if the requested max_inline is too large RDMA/cxgb3: Initialize cpu_idx field in cpl_close_listserv_req message RDMA/cxgb3: Support for new abort logic drivers/infiniband/core/fmr_pool.c | 32 +++++---- drivers/infiniband/core/mad.c | 2 +- drivers/infiniband/core/uverbs_cmd.c | 1 + drivers/infiniband/core/uverbs_main.c | 2 +- drivers/infiniband/core/verbs.c | 4 +- drivers/infiniband/hw/amso1100/c2.h | 2 +- drivers/infiniband/hw/amso1100/c2_cq.c | 16 ++++- drivers/infiniband/hw/amso1100/c2_provider.c | 3 +- drivers/infiniband/hw/cxgb3/cxio_hal.c | 3 + drivers/infiniband/hw/cxgb3/cxio_wr.h | 1 + drivers/infiniband/hw/cxgb3/iwch_cm.c | 19 ++++++ drivers/infiniband/hw/cxgb3/iwch_cm.h | 6 ++ drivers/infiniband/hw/cxgb3/iwch_provider.c | 14 +++- drivers/infiniband/hw/cxgb3/iwch_qp.c | 69 +++++++++++--------- drivers/infiniband/hw/ehca/ehca_cq.c | 2 +- drivers/infiniband/hw/ehca/ehca_iverbs.h | 4 +- drivers/infiniband/hw/ehca/ehca_main.c | 3 +- drivers/infiniband/hw/ehca/ehca_reqs.c | 14 +++- drivers/infiniband/hw/ehca/ipz_pt_fn.h | 8 ++ drivers/infiniband/hw/ipath/ipath_cq.c | 68 ++++++++++---------- drivers/infiniband/hw/ipath/ipath_mmap.c | 64 +++++++++++++++++-- drivers/infiniband/hw/ipath/ipath_qp.c | 52 +++++++++------ drivers/infiniband/hw/ipath/ipath_rc.c | 55 ++++++++-------- drivers/infiniband/hw/ipath/ipath_srq.c | 55 ++++++++-------- drivers/infiniband/hw/ipath/ipath_verbs.c | 4 + drivers/infiniband/hw/ipath/ipath_verbs.h | 24 +++++-- drivers/infiniband/hw/mthca/mthca_cq.c | 12 ++-- drivers/infiniband/hw/mthca/mthca_dev.h | 4 +- drivers/infiniband/hw/mthca/mthca_provider.c | 2 + drivers/infiniband/hw/mthca/mthca_qp.c | 13 ++++ drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c | 14 +++-- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 89 ++++++++++++++++++++------ drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 + drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 2 +- drivers/infiniband/ulp/iser/iser_verbs.c | 2 +- drivers/infiniband/ulp/srp/ib_srp.c | 27 +++++++- drivers/infiniband/ulp/srp/ib_srp.h | 1 + drivers/net/cxgb3/version.h | 4 +- include/rdma/ib_verbs.h | 47 +++++++++++--- 40 files changed, 508 insertions(+), 239 deletions(-) From yosefe at voltaire.com Sun May 6 23:40:05 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 09:40:05 +0300 Subject: [ofa-general] Re: [PATCH 2/3] remove ib pkey gid and lmc cache In-Reply-To: <20070503124956.GB9719@mellanox.co.il> References: <4638B432.3060801@voltaire.com> <4638B4D5.7050709@voltaire.com> <20070502171829.GO22292@mellanox.co.il> <4639D16F.4060807@voltaire.com> <20070503122215.GA9719@mellanox.co.il> <4639D584.3010706@voltaire.com> <20070503124956.GB9719@mellanox.co.il> Message-ID: <463EC9C5.2010509@voltaire.com> How about keeping the cache, but keeping it always up-to-date by registering it to process incomind mads instead of events? From vlad at lists.openfabrics.org Mon May 7 02:37:55 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 7 May 2007 02:37:55 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070507-0200 daily build status Message-ID: <20070507093756.27DD9E60838@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-22.ELsmp Failed: Build failed on i686 with linux-2.6.21.1 From bs at q-leap.de Mon May 7 02:40:10 2007 From: bs at q-leap.de (Bernd Schubert) Date: Mon, 7 May 2007 11:40:10 +0200 Subject: [ofa-general] [PATCH] IB/ipath - Don't call spin_lock_irq() from interrupt context In-Reply-To: <1177697471.3407.14.camel@brick.pathscale.com> References: <1177697471.3407.14.camel@brick.pathscale.com> Message-ID: <200705071140.10854.bs@q-leap.de> On Friday 27 April 2007 20:11:11 Ralph Campbell wrote: > This patch fixes the problem reported by Bernd Schubert > with kernel debug options enabled. > BUG: at kernel/lockdep.c:1860 trace_hardirqs_on() > > Hopefully, this can be included in OFED 1.2 as well as > going upstream. > > Signed-off-by: Ralph Campbell > > diff -r 97262e873c51 drivers/infiniband/hw/ipath/ipath_rc.c This file looks significantly different to the version in 2.6.20 and 2.6.21, where can I get the latest version of the driver from? Thanks in advance, Bernd -- Bernd Schubert Q-Leap Networks GmbH From eli at mellanox.co.il Mon May 7 04:52:22 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 07 May 2007 14:52:22 +0300 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: commands timeout Message-ID: <1178538772.10759.2.camel@mtls03> When the system is busy it may happen that the command actually completed but it took more than the specified timeout till the task executing the command was actually given CPU time. This test checks that the completion is really missing before failing. Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/net/mlx4/cmd.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/cmd.c 2007-05-07 12:32:35.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/cmd.c 2007-05-07 14:30:26.000000000 +0300 @@ -272,10 +272,11 @@ static int mlx4_cmd_wait(struct mlx4_dev mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0, in_modifier, op_modifier, op, context->token, 1); - if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout))) { - err = -EBUSY; - goto out; - } + if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout))) + if (!context->done.done) { + err = -EBUSY; + goto out; + } err = context->result; if (err) From mst at dev.mellanox.co.il Mon May 7 05:04:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 May 2007 15:04:29 +0300 Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout In-Reply-To: <1178538772.10759.2.camel@mtls03> References: <1178538772.10759.2.camel@mtls03> Message-ID: <20070507115714.GC29350@mellanox.co.il> > Quoting Eli Cohen : > Subject: [PATCH] IB/mlx4 mlx4_ib: commands timeout > > When the system is busy it may happen that the command actually > completed but it took more than the specified timeout till the > task executing the command was actually given CPU time. This test > checks that the completion is really missing before failing. > > Signed-off-by: Eli Cohen How likely is this to help in practice? -- MST From eli at mellanox.co.il Mon May 7 05:18:00 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 07 May 2007 15:18:00 +0300 Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout In-Reply-To: <20070507115714.GC29350@mellanox.co.il> References: <1178538772.10759.2.camel@mtls03> <20070507115714.GC29350@mellanox.co.il> Message-ID: <1178540310.10759.9.camel@mtls03> On Mon, 2007-05-07 at 15:04 +0300, Michael S. Tsirkin wrote: > How likely is this to help in practice? > Like I said, when the system is very busy. In this case the command may actually complete very soon but wait_for_completion_timeout() will nevertheless return zero since the task did not get CPU time before the specified timeout expired. In this case we would like to check if done is signaled and thus not fail the command. From halr at voltaire.com Mon May 7 05:32:45 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 07 May 2007 08:32:45 -0400 Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm/osm_helper: remove repeated strlen() calls In-Reply-To: <20070506130352.GC9692@sashak.voltaire.com> References: <462C7C21.7010004@dev.mellanox.co.il> <20070423101738.GG4579@mellanox.co.il> <462E80A3.5060503@dev.mellanox.co.il> <20070501005101.GA26019@sashak.voltaire.com> <4636E4A7.7060108@dev.mellanox.co.il> <1178211572.32222.3479.camel@hal.voltaire.com> <20070506124333.GB9692@sashak.voltaire.com> <20070506130352.GC9692@sashak.voltaire.com> Message-ID: <1178541140.32222.348653.camel@hal.voltaire.com> On Sun, 2007-05-06 at 09:03, Sasha Khapyorsky wrote: > Replace repeated strlen() calls used in sprintf() by actual string > length accumulated from sprintf() return values. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied to master only (as this is a cleanup rather than a bug fix). Let me know if you think this should be applied to ofed_1_2. -- Hal From yosefe at voltaire.com Mon May 7 05:52:49 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 15:52:49 +0300 Subject: [ofa-general] [PATCH 0/6 v2] fix pkey change handling and remove the cahce Message-ID: <463F2121.5080803@voltaire.com> The issue addressed is keeping ipoib interfaces alive despite port's pkey order is changed. pkey-to-index queries were using a cache. however, the cache might not be up-to-date when ipoib asks it to resolve a pkey. Therefore must use a direct query. On the other hand, in build_mlx_header, the pkey query must be atomic. So, the driver will keep its own pkey cache, which is non blocking and always updated before ipoib is notified of the event. In addition, remove the pkey delayed initiallization thread, instead start the interface on pkey change notification. changes from v1: * code style fixes * reorganize patches * mthca: add gid cache * mad: add lmc cache patch ordering: 1. core: add blockong device queries 2. core,ulp: use bloking queries 3. ipoib: handle pkey event 4. mthca: cache gids and pkeys 5. mad: cache lmc 6. core: remove cache From yosefe at voltaire.com Mon May 7 05:54:34 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 15:54:34 +0300 Subject: [ofa-general] [PATCH 1/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F2121.5080803@voltaire.com> References: <463F2121.5080803@voltaire.com> Message-ID: <463F218A.7030400@voltaire.com> core: uncached "find gid" and "find pkey" queries * Add ib_find_gid and ib_find_pkey over possibly blocking device queries. Signed-off-by: Yosef Etigin --- drivers/infiniband/core/device.c | 96 +++++++++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 23 +++++++++ 2 files changed, 119 insertions(+) Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-06 09:16:18.000000000 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-06 09:33:50.000000000 +0300 @@ -149,6 +149,18 @@ static int alloc_name(char *name) return 0; } +static inline int start_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + + +static inline int end_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; +} + /** * ib_alloc_device - allocate an IB device struct * @size:size of structure to allocate @@ -592,6 +604,90 @@ int ib_modify_port(struct ib_device *dev } EXPORT_SYMBOL(ib_modify_port); +/** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index) +{ + struct ib_port_attr *tprops = NULL; + union ib_gid tmp_gid; + int ret, port, i; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + + for (port = start_port(device); port <= end_port(device); ++port) { + ret = ib_query_port(device, port, tprops); + if (ret) + continue; + + for (i = 0; i < tprops->gid_tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &tmp_gid); + if (ret) + goto out; + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { + *port_num = port; + *index = i; + ret = 0; + goto out; + } + } + } + ret = -ENOENT; +out: + kfree(tprops); + return ret; +} +EXPORT_SYMBOL(ib_find_gid); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index) +{ + struct ib_port_attr *tprops = NULL; + int ret, i; + u16 tmp_pkey; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + + ret = ib_query_port(device, port_num, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret); + goto out; + } + + for (i = 0; i < tprops->pkey_tbl_len; ++i) { + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); + if (ret) + goto out; + + if (pkey == tmp_pkey) { + *index = i; + ret = 0; + goto out; + } + } + ret = -ENOENT; + +out: + kfree(tprops); + return ret; +} +EXPORT_SYMBOL(ib_find_pkey); + static int __init ib_core_init(void) { int ret; Index: b/include/rdma/ib_verbs.h =================================================================== --- a/include/rdma/ib_verbs.h 2007-05-06 09:16:18.000000000 +0300 +++ b/include/rdma/ib_verbs.h 2007-05-06 09:16:22.000000000 +0300 @@ -1134,6 +1134,29 @@ int ib_modify_port(struct ib_device *dev struct ib_port_modify *port_modify); /** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index); + +/** * ib_alloc_pd - Allocates an unused protection domain. * @device: The device on which to allocate the protection domain. * From yosefe at voltaire.com Mon May 7 05:55:40 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 15:55:40 +0300 Subject: [ofa-general] [PATCH 2/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F2121.5080803@voltaire.com> References: <463F2121.5080803@voltaire.com> Message-ID: <463F21CC.1060807@voltaire.com> core, ulp: don't use ib_cahce * Modify users of the ib cache in: core, ipoib, srp, to use blocking device queries. Signed-off-by: Yosef Etigin --- drivers/infiniband/core/cm.c | 8 ++++---- drivers/infiniband/core/cma.c | 9 ++++----- drivers/infiniband/core/multicast.c | 3 +-- drivers/infiniband/core/sa_query.c | 3 +-- drivers/infiniband/core/verbs.c | 3 +-- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 3 +-- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 4 ++-- drivers/infiniband/ulp/srp/ib_srp.c | 6 ++---- 8 files changed, 16 insertions(+), 23 deletions(-) Index: b/drivers/infiniband/core/cm.c =================================================================== --- a/drivers/infiniband/core/cm.c 2007-05-06 09:26:12.000000000 +0300 +++ b/drivers/infiniband/core/cm.c 2007-05-06 09:26:17.000000000 +0300 @@ -46,8 +46,8 @@ #include #include -#include #include +#include #include "cm_msgs.h" MODULE_AUTHOR("Sean Hefty"); @@ -275,7 +275,7 @@ static int cm_init_av_by_path(struct ib_ read_lock_irqsave(&cm.device_lock, flags); list_for_each_entry(cm_dev, &cm.device_list, list) { - if (!ib_find_cached_gid(cm_dev->device, &path->sgid, + if (!ib_find_gid(cm_dev->device, &path->sgid, &p, NULL)) { port = &cm_dev->port[p-1]; break; @@ -286,7 +286,7 @@ static int cm_init_av_by_path(struct ib_ if (!port) return -EINVAL; - ret = ib_find_cached_pkey(cm_dev->device, port->port_num, + ret = ib_find_pkey(cm_dev->device, port->port_num, be16_to_cpu(path->pkey), &av->pkey_index); if (ret) return ret; @@ -1382,7 +1382,7 @@ static int cm_req_handler(struct cm_work cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av); if (ret) { - ib_get_cached_gid(work->port->cm_dev->device, + ib_query_gid(work->port->cm_dev->device, work->port->port_num, 0, &work->path[0].sgid); ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID, &work->path[0].sgid, sizeof work->path[0].sgid, Index: b/drivers/infiniband/core/cma.c =================================================================== --- a/drivers/infiniband/core/cma.c 2007-05-06 09:26:12.000000000 +0300 +++ b/drivers/infiniband/core/cma.c 2007-05-06 09:26:17.000000000 +0300 @@ -41,7 +41,6 @@ #include #include -#include #include #include #include @@ -325,7 +324,7 @@ static int cma_acquire_dev(struct rdma_i } list_for_each_entry(cma_dev, &dev_list, list) { - ret = ib_find_cached_gid(cma_dev->device, &gid, + ret = ib_find_gid(cma_dev->device, &gid, &id_priv->id.port_num, NULL); if (!ret) { ret = cma_set_qkey(cma_dev->device, @@ -514,7 +513,7 @@ static int cma_ib_init_qp_attr(struct rd struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; int ret; - ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, + ret = ib_find_pkey(id_priv->id.device, id_priv->id.port_num, ib_addr_get_pkey(dev_addr), &qp_attr->pkey_index); if (ret) @@ -1658,11 +1657,11 @@ static int cma_bind_loopback(struct rdma cma_dev = list_entry(dev_list.next, struct cma_device, list); port_found: - ret = ib_get_cached_gid(cma_dev->device, p, 0, &gid); + ret = ib_query_gid(cma_dev->device, p, 0, &gid); if (ret) goto out; - ret = ib_get_cached_pkey(cma_dev->device, p, 0, &pkey); + ret = ib_query_pkey(cma_dev->device, p, 0, &pkey); if (ret) goto out; Index: b/drivers/infiniband/core/multicast.c =================================================================== --- a/drivers/infiniband/core/multicast.c 2007-05-06 09:26:12.000000000 +0300 +++ b/drivers/infiniband/core/multicast.c 2007-05-06 09:26:17.000000000 +0300 @@ -38,7 +38,6 @@ #include #include -#include #include "sa.h" static void mcast_add_one(struct ib_device *device); @@ -686,7 +685,7 @@ int ib_init_ah_from_mcmember(struct ib_d u16 gid_index; u8 p; - ret = ib_find_cached_gid(device, &rec->port_gid, &p, &gid_index); + ret = ib_find_gid(device, &rec->port_gid, &p, &gid_index); if (ret) return ret; Index: b/drivers/infiniband/core/sa_query.c =================================================================== --- a/drivers/infiniband/core/sa_query.c 2007-05-06 09:26:12.000000000 +0300 +++ b/drivers/infiniband/core/sa_query.c 2007-05-06 09:26:17.000000000 +0300 @@ -47,7 +47,6 @@ #include #include -#include #include "sa.h" MODULE_AUTHOR("Roland Dreier"); @@ -477,7 +476,7 @@ int ib_init_ah_from_path(struct ib_devic ah_attr->ah_flags = IB_AH_GRH; ah_attr->grh.dgid = rec->dgid; - ret = ib_find_cached_gid(device, &rec->sgid, &port_num, + ret = ib_find_gid(device, &rec->sgid, &port_num, &gid_index); if (ret) return ret; Index: b/drivers/infiniband/core/verbs.c =================================================================== --- a/drivers/infiniband/core/verbs.c 2007-05-06 09:26:12.000000000 +0300 +++ b/drivers/infiniband/core/verbs.c 2007-05-06 09:26:17.000000000 +0300 @@ -43,7 +43,6 @@ #include #include -#include int ib_rate_to_mult(enum ib_rate rate) { @@ -159,7 +158,7 @@ int ib_init_ah_from_wc(struct ib_device ah_attr->ah_flags = IB_AH_GRH; ah_attr->grh.dgid = grh->sgid; - ret = ib_find_cached_gid(device, &grh->dgid, &port_num, + ret = ib_find_gid(device, &grh->dgid, &port_num, &gid_index); if (ret) return ret; Index: b/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-06 09:26:12.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-06 09:26:17.000000000 +0300 @@ -33,7 +33,6 @@ */ #include -#include #include #include #include @@ -759,7 +758,7 @@ static int ipoib_cm_modify_tx_init(struc struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index); if (ret) { ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret); return ret; Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-06 09:26:12.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-07 14:28:48.625133165 +0300 @@ -38,7 +38,7 @@ #include #include -#include +#include #include "ipoib.h" @@ -446,7 +446,7 @@ static void ipoib_pkey_dev_check_presenc struct ipoib_dev_priv *priv = netdev_priv(dev); u16 pkey_index = 0; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); else set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); Index: b/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- a/drivers/infiniband/ulp/srp/ib_srp.c 2007-05-06 09:26:12.000000000 +0300 +++ b/drivers/infiniband/ulp/srp/ib_srp.c 2007-05-06 09:26:17.000000000 +0300 @@ -48,8 +48,6 @@ #include #include -#include - #include "ib_srp.h" #define DRV_NAME "ib_srp" @@ -164,7 +162,7 @@ static int srp_init_qp(struct srp_target if (!attr) return -ENOMEM; - ret = ib_find_cached_pkey(target->srp_host->dev->dev, + ret = ib_find_pkey(target->srp_host->dev->dev, target->srp_host->port, be16_to_cpu(target->path.pkey), &attr->pkey_index); @@ -1780,7 +1778,7 @@ static ssize_t srp_create_target(struct if (ret) goto err; - ib_get_cached_gid(host->dev->dev, host->port, 0, &target->path.sgid); + ib_query_gid(host->dev->dev, host->port, 0, &target->path.sgid); printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x " "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", From yosefe at voltaire.com Mon May 7 05:57:27 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 15:57:27 +0300 Subject: [ofa-general] [PATCH 3/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F2121.5080803@voltaire.com> References: <463F2121.5080803@voltaire.com> Message-ID: <463F2237.7050809@voltaire.com> ipoib: handle pkey change events This issue was found during partitioning & SM fail over testing. * added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * fixed a bug in device extraction from the work struct * removed some warnings in case they are caused due to missing PKEY as this seems like a valid flow now. * Assume that the cache is coherent - do not retry on cache queries * Restart child interfaces first before parent * Remove the pkey polling thread and pkey delayed initiallization * If an interface is brought up but pkey is not found, mark it with IPOIB_PKEY_NEEDED and when a pkey event arrives, try to restart it. SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Moni Levy Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 10 - drivers/infiniband/ulp/ipoib/ipoib_ib.c | 144 ++++++++----------------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 11 - drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 + drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 21 +-- 5 files changed, 76 insertions(+), 121 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-06 09:26:08.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-06 09:26:18.000000000 +0300 @@ -80,7 +80,7 @@ enum { IPOIB_FLAG_INITIALIZED = 1, IPOIB_FLAG_ADMIN_UP = 2, IPOIB_PKEY_ASSIGNED = 3, - IPOIB_PKEY_STOP = 4, + IPOIB_PKEY_NEEDED = 4, IPOIB_FLAG_SUBINTERFACE = 5, IPOIB_MCAST_RUN = 6, IPOIB_STOP_REAPER = 7, @@ -202,9 +202,9 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; struct delayed_work mcast_task; struct work_struct flush_task; + struct work_struct pkey_task; struct work_struct restart_task; struct delayed_work ah_reap_task; @@ -333,12 +333,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); @@ -384,9 +385,6 @@ void ipoib_event(struct ib_event_handler int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); -void ipoib_pkey_poll(struct work_struct *work); -int ipoib_pkey_dev_delay_open(struct net_device *dev); - #ifdef CONFIG_INFINIBAND_IPOIB_CM #define IPOIB_FLAGS_RC 0x80 Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-06 09:26:17.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-06 09:26:26.000000000 +0300 @@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -441,28 +441,10 @@ int ipoib_ib_dev_open(struct net_device return 0; } -static void ipoib_pkey_dev_check_presence(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - u16 pkey_index = 0; - - if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); - else - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); -} - int ipoib_ib_dev_up(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - ipoib_pkey_dev_check_presence(dev); - - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { - ipoib_dbg(priv, "PKEY is not assigned.\n"); - return 0; - } - set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); return ipoib_mcast_start_thread(dev); @@ -477,16 +459,6 @@ int ipoib_ib_dev_down(struct net_device clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); netif_carrier_off(dev); - /* Shutdown the P_Key thread if still active */ - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { - mutex_lock(&pkey_mutex); - set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); - mutex_unlock(&pkey_mutex); - if (flush) - flush_workqueue(ipoib_workqueue); - } - ipoib_mcast_stop_thread(dev, flush); ipoib_mcast_dev_flush(dev); @@ -508,7 +480,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +553,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,14 +595,33 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { - ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, restart_qp); + + mutex_unlock(&priv->vlan_mutex); + + /* + * If the device is not initiallized since it needs a pkey - + * try to reopen it + */ + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { + + if (restart_qp + && test_bit(IPOIB_PKEY_NEEDED, &priv->flags) + && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { + /* this iface needs pkey, try to bring it up */ + ipoib_open(priv->dev); + } + else + ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -642,6 +634,12 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_down(dev, 0); + if (restart_qp) { + if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +648,25 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); + /* we only restart the QP in case of pkey change event */ + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_task); - mutex_unlock(&priv->vlan_mutex); + /* restart the QP in case of pkey change event */ + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -672,54 +681,3 @@ void ipoib_ib_dev_cleanup(struct net_dev ipoib_transport_dev_cleanup(dev); } -/* - * Delayed P_Key Assigment Interim Support - * - * The following is initial implementation of delayed P_Key assigment - * mechanism. It is using the same approach implemented for the multicast - * group join. The single goal of this implementation is to quickly address - * Bug #2507. This implementation will probably be removed when the P_Key - * change async notification is available. - */ - -void ipoib_pkey_poll(struct work_struct *work) -{ - struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); - struct net_device *dev = priv->dev; - - ipoib_pkey_dev_check_presence(dev); - - if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) - ipoib_open(dev); - else { - mutex_lock(&pkey_mutex); - if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) - queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, - HZ); - mutex_unlock(&pkey_mutex); - } -} - -int ipoib_pkey_dev_delay_open(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - - /* Look for the interface pkey value in the IB Port P_Key table and */ - /* set the interface pkey assigment flag */ - ipoib_pkey_dev_check_presence(dev); - - /* P_Key value not assigned yet - start polling */ - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { - mutex_lock(&pkey_mutex); - clear_bit(IPOIB_PKEY_STOP, &priv->flags); - queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, - HZ); - mutex_unlock(&pkey_mutex); - return 1; - } - - return 0; -} Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-06 09:26:08.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-06 09:26:18.000000000 +0300 @@ -100,14 +100,11 @@ int ipoib_open(struct net_device *dev) set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); - if (ipoib_pkey_dev_delay_open(dev)) - return 0; - if (ipoib_ib_dev_open(dev)) - return -EINVAL; + return test_bit(IPOIB_PKEY_NEEDED, &priv->flags) ? 0 : -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +149,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +987,7 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-06 09:26:08.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-06 09:26:18.000000000 +0300 @@ -232,9 +232,10 @@ static int ipoib_mcast_join_finish(struc ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid), &mcast->mcmember.mgid); if (ret < 0) { - ipoib_warn(priv, "couldn't attach QP to multicast group " - IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); + if (ret != -ENXIO) /* No pkey found */ + ipoib_warn(priv, "couldn't attach QP to multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags); return ret; @@ -312,7 +313,7 @@ ipoib_mcast_sendonly_join_complete(int s status = ipoib_mcast_join_finish(mcast, &multicast->rec); if (status) { - if (mcast->logcount++ < 20) + if (mcast->logcount++ < 20 && status != -ENXIO) ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " IPOIB_GID_FMT ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), status); @@ -420,7 +421,7 @@ static int ipoib_mcast_join_complete(int ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), status); - } else { + } else if (status != -ENXIO) { ipoib_warn(priv, "multicast join failed for " IPOIB_GID_FMT ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-06 09:26:08.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-06 09:26:18.000000000 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,12 +47,12 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); ret = -ENXIO; goto out; } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + set_bit(IPOIB_PKEY_NEEDED, &priv->flags); /* set correct QKey for QP */ qp_attr->qkey = priv->qkey; @@ -103,12 +101,12 @@ int ipoib_init_qp(struct net_device *dev * The port has to be assigned to the respective IB partition in * advance. */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); if (ret) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + set_bit(IPOIB_PKEY_NEEDED, &priv->flags); return ret; } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); qp_attr.qp_state = IB_QPS_INIT; qp_attr.qkey = 0; @@ -238,7 +236,7 @@ void ipoib_transport_dev_cleanup(struct ipoib_warn(priv, "ib_qp_destroy failed\n"); priv->qp = NULL; - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); } if (ib_destroy_cq(priv->cq)) @@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_task); } } From yosefe at voltaire.com Mon May 7 05:58:22 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 15:58:22 +0300 Subject: [ofa-general] [PATCH 4/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F2121.5080803@voltaire.com> References: <463F2121.5080803@voltaire.com> Message-ID: <463F226E.9040700@voltaire.com> mthca: cache pkeys and gids * Use incoming mads to update the internal cache: use PKEY_TABLE mads to update pkey table cache, and GUID_INFO, PORT_INFO mads to update gid table cache (which update guid table and gid prefix, accordingly). * Modify query_pkey and query_gid to use this cache, which makes them non-blocking * While creating a MLX QP, use these functions instead of the cache from ib core. Signed-off-by: Yosef Etigin --- drivers/infiniband/hw/mthca/mthca_av.c | 3 drivers/infiniband/hw/mthca/mthca_dev.h | 20 + drivers/infiniband/hw/mthca/mthca_mad.c | 3 drivers/infiniband/hw/mthca/mthca_provider.c | 284 ++++++++++++++++++++------- drivers/infiniband/hw/mthca/mthca_qp.c | 5 include/rdma/ib_smi.h | 4 6 files changed, 245 insertions(+), 74 deletions(-) Index: b/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_dev.h 2007-05-07 14:28:47.574320783 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_dev.h 2007-05-07 14:28:55.365929626 +0300 @@ -49,6 +49,8 @@ #include +#include + #include "mthca_provider.h" #include "mthca_doorbell.h" @@ -287,6 +289,19 @@ struct mthca_catas_err { struct list_head list; }; +struct mthca_pkey_cache { + rwlock_t lock; + int table_len; + u16 table[0]; +}; + +struct mthca_gid_cache { + rwlock_t lock; + u64 gid_prefix; + int table_len; + u64 guid_table[0]; +}; + extern struct mutex mthca_device_mutex; struct mthca_dev { @@ -360,6 +375,9 @@ struct mthca_dev { struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; spinlock_t sm_lock; u8 rate[MTHCA_MAX_PORTS]; + + struct mthca_pkey_cache *pkey_cache[MTHCA_MAX_PORTS]; + struct mthca_gid_cache *gid_cache[MTHCA_MAX_PORTS]; }; #ifdef CONFIG_INFINIBAND_MTHCA_DEBUG @@ -585,6 +603,8 @@ int mthca_process_mad(struct ib_device * int mthca_create_agents(struct mthca_dev *dev); void mthca_free_agents(struct mthca_dev *dev); +int mthca_cache_update(struct mthca_dev *mdev, u8 port_num, struct ib_mad *mad); + static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) { return container_of(ibdev, struct mthca_dev, ib_dev); Index: b/drivers/infiniband/hw/mthca/mthca_mad.c =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_mad.c 2007-05-07 14:28:47.574320783 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_mad.c 2007-05-07 14:28:55.366929448 +0300 @@ -139,6 +139,9 @@ static void smp_snoop(struct ib_device * event.element.port_num = port_num; ib_dispatch_event(&event); } + + /* update cache with the incoming mad */ + mthca_cache_update(to_mdev(ibdev), port_num, mad); } } Index: b/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_provider.c 2007-05-07 14:28:47.575320605 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_provider.c 2007-05-07 14:28:55.367929269 +0300 @@ -243,87 +243,44 @@ out: static int mthca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) { - struct ib_smp *in_mad = NULL; - struct ib_smp *out_mad = NULL; - int err = -ENOMEM; - u8 status; + struct mthca_dev *mdev; + struct mthca_pkey_cache *pkey_cache; + unsigned int flags; - in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); - out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); - if (!in_mad || !out_mad) - goto out; + mdev = to_mdev(ibdev); - init_query_mad(in_mad); - in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; - in_mad->attr_mod = cpu_to_be32(index / 32); - - err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, - port, NULL, NULL, in_mad, out_mad, - &status); - if (err) - goto out; - if (status) { - err = -EINVAL; - goto out; + if (port < 1 || port > mdev->ib_dev.phys_port_cnt || + index >= mdev->pkey_cache[ port - 1 ]->table_len ) { + return -EINVAL; } - *pkey = be16_to_cpu(((__be16 *) out_mad->data)[index % 32]); - - out: - kfree(in_mad); - kfree(out_mad); - return err; + pkey_cache = mdev->pkey_cache[ port - 1 ]; + read_lock_irqsave(&pkey_cache->lock, flags); + *pkey = be16_to_cpu( pkey_cache->table[ index ] ); + read_unlock_irqrestore(&pkey_cache->lock, flags); + return 0; } static int mthca_query_gid(struct ib_device *ibdev, u8 port, int index, union ib_gid *gid) { - struct ib_smp *in_mad = NULL; - struct ib_smp *out_mad = NULL; - int err = -ENOMEM; - u8 status; - - in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); - out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); - if (!in_mad || !out_mad) - goto out; + struct mthca_dev * mdev; + unsigned int flags; + struct mthca_gid_cache *gid_cache; - init_query_mad(in_mad); - in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; - in_mad->attr_mod = cpu_to_be32(port); + mdev = to_mdev(ibdev); - err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, - port, NULL, NULL, in_mad, out_mad, - &status); - if (err) - goto out; - if (status) { - err = -EINVAL; - goto out; - } - - memcpy(gid->raw, out_mad->data + 8, 8); - - init_query_mad(in_mad); - in_mad->attr_id = IB_SMP_ATTR_GUID_INFO; - in_mad->attr_mod = cpu_to_be32(index / 8); - - err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, - port, NULL, NULL, in_mad, out_mad, - &status); - if (err) - goto out; - if (status) { - err = -EINVAL; - goto out; + if (port < 1 || port > mdev->ib_dev.phys_port_cnt || + index >= mdev->gid_cache[ port - 1 ]->table_len ) { + return -EINVAL; } - memcpy(gid->raw + 8, out_mad->data + (index % 8) * 8, 8); - - out: - kfree(in_mad); - kfree(out_mad); - return err; + gid_cache = mdev->gid_cache[ port - 1 ]; + read_lock_irqsave(&gid_cache->lock, flags); + memcpy( gid->raw, &gid_cache->gid_prefix, 8); + memcpy( gid->raw + 8, gid_cache->guid_table + index, 8); + read_unlock_irqrestore(&gid_cache->lock, flags); + return 0; } static struct ib_ucontext *mthca_alloc_ucontext(struct ib_device *ibdev, @@ -1259,6 +1216,189 @@ out: return err; } +/* update a cached table */ +static int mthca_cache_update_table(struct mthca_dev *mdev, + void *table, int table_size, + void *data, int data_size, int table_offset) +{ + + /* make sure the offset is valid */ + if (table_size < table_offset+data_size) { + mthca_warn(mdev, "cache table offset out of range - ignoring\n"); + return -EINVAL; + } + + /* update the cache */ + memcpy((u8*)table+table_offset, data, data_size); + + return 0; +} + +/* update the cache with mad */ +int mthca_cache_update(struct mthca_dev *mdev, u8 port_num, struct ib_mad *mad) +{ + struct mthca_pkey_cache *pkey_cache; + struct mthca_gid_cache *gid_cache; + unsigned long flags; + struct ib_smp *smp; + unsigned int offset; + int ret = 0; + + smp = (struct ib_smp*)mad; + offset = ( be32_to_cpu(smp->attr_mod) & 0xFFFF ); + //TODO check if port# is valid + + switch (mad->mad_hdr.attr_id) { + case IB_SMP_ATTR_PKEY_TABLE: + mthca_dbg(mdev, "port %d: pkey table change\n", port_num); + pkey_cache = mdev->pkey_cache[ port_num - 1 ]; + write_lock_irqsave(&pkey_cache->lock, flags); + mthca_cache_update_table(mdev, + pkey_cache->table, pkey_cache->table_len * sizeof (u16), + smp->data, IB_SMP_NUM_PKEY_ENTRIES * sizeof (u16), + offset * IB_SMP_NUM_PKEY_ENTRIES * sizeof (u16)); + write_unlock_irqrestore(&pkey_cache->lock, flags); + break; + + case IB_SMP_ATTR_GUID_INFO: + mthca_dbg(mdev, "port %d: guid table change\n", port_num); + gid_cache = mdev->gid_cache[ port_num - 1 ]; + write_lock_irqsave(&gid_cache->lock, flags); + mthca_cache_update_table(mdev, + gid_cache->guid_table, gid_cache->table_len * sizeof (u64), + smp->data, IB_SMP_NUM_GUID_ENTRIES * sizeof (u64), + offset * IB_SMP_NUM_GUID_ENTRIES * sizeof (u64)); + write_unlock_irqrestore(&gid_cache->lock, flags); + break; + + case IB_SMP_ATTR_PORT_INFO: + mthca_dbg(mdev, "port %d: port info change\n", port_num); + gid_cache = mdev->gid_cache[ port_num - 1 ]; + write_lock_irqsave(&gid_cache->lock, flags); + gid_cache->gid_prefix = *(u64*)(smp->data + 8); + write_unlock_irqrestore(&gid_cache->lock, flags); + break; + } + return ret; +} + +static int mthca_cache_init(struct mthca_dev *mdev) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + struct mthca_pkey_cache *pkey_cache; + struct mthca_gid_cache *gid_cache; + unsigned int i, offset; + u8 status; + int err = -ENOMEM; + + memset(mdev->pkey_cache, 0, sizeof mdev->pkey_cache); + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + + if (!in_mad || !out_mad) + goto out; + + for ( i = 0; i < mdev->ib_dev.phys_port_cnt; ++i ) { + + unsigned int port = i + 1; + + /* allocate pkey cache */ + mdev->pkey_cache[ i ] = pkey_cache = kmalloc(sizeof *pkey_cache + + mdev->limits.pkey_table_len * sizeof(u16), GFP_KERNEL); + if ( ! pkey_cache ) + goto out; + + rwlock_init(&pkey_cache->lock); + + /* populate pkey table */ + pkey_cache->table_len = mdev->limits.pkey_table_len; + for (offset = 0; offset < pkey_cache->table_len; + offset += IB_SMP_NUM_PKEY_ENTRIES) { + + memset(in_mad, 0, sizeof *in_mad); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; + in_mad->attr_mod = cpu_to_be32( offset / IB_SMP_NUM_PKEY_ENTRIES); + + err = mthca_MAD_IFC(mdev, 1, 1, + port, NULL, NULL, in_mad, out_mad, + &status); + + if (err || status) + break; + + mthca_cache_update_table(mdev, + pkey_cache->table, pkey_cache->table_len * sizeof (u16), + out_mad->data, IB_SMP_NUM_PKEY_ENTRIES * sizeof (u16), + offset * sizeof (u16)); + } + + /* allocate gid cache */ + mdev->gid_cache[ i ] = gid_cache = kmalloc(sizeof *gid_cache + + mdev->limits.gid_table_len * sizeof(u64), GFP_KERNEL); + if ( !gid_cache ) + goto out; + + rwlock_init(&gid_cache->lock); + + /* populate guid table */ + gid_cache->table_len = mdev->limits.gid_table_len; + for (offset = 0; offset < gid_cache->table_len; + offset += IB_SMP_NUM_GUID_ENTRIES) { + + memset(in_mad, 0, sizeof *in_mad); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_GUID_INFO; + in_mad->attr_mod = cpu_to_be32( offset / IB_SMP_NUM_GUID_ENTRIES); + + err = mthca_MAD_IFC(mdev, 1, 1, + port, NULL, NULL, in_mad, out_mad, + &status); + + if (err || status) + break; + + mthca_cache_update_table(mdev, + gid_cache->guid_table, gid_cache->table_len * sizeof (u64), + out_mad->data, IB_SMP_NUM_GUID_ENTRIES * sizeof (u64), + offset * sizeof (u64)); + } + + /* read gid prefix */ + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(mdev, 1, 1, + port, NULL, NULL, in_mad, out_mad, + &status); + + if (err || status) + continue; + + mdev->gid_cache[ i ]->gid_prefix = *(u64*)(out_mad->data + 8); + } + +out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +/* + * Destroy the cache + */ +static void mthca_cache_destroy(struct mthca_dev *mdev) +{ + int i; + for ( i = 0; i < mdev->ib_dev.phys_port_cnt; ++i ) { + kfree( mdev->pkey_cache[ i ] ); + kfree( mdev->gid_cache[ i ] ); + } +} + int mthca_register_device(struct mthca_dev *dev) { int ret; @@ -1365,6 +1505,12 @@ int mthca_register_device(struct mthca_d mutex_init(&dev->cap_mask_mutex); + ret = mthca_cache_init(dev); + if (ret) { + mthca_cache_destroy(dev); + return ret; + } + ret = ib_register_device(&dev->ib_dev); if (ret) return ret; @@ -1387,4 +1533,6 @@ void mthca_unregister_device(struct mthc { mthca_stop_catas_poll(dev); ib_unregister_device(&dev->ib_dev); + mthca_cache_destroy(dev); } + Index: b/include/rdma/ib_smi.h =================================================================== --- a/include/rdma/ib_smi.h 2007-05-07 14:28:47.576320426 +0300 +++ b/include/rdma/ib_smi.h 2007-05-07 14:28:55.367929269 +0300 @@ -42,7 +42,9 @@ #include #define IB_SMP_DATA_SIZE 64 -#define IB_SMP_MAX_PATH_HOPS 64 +#define IB_SMP_MAX_PATH_HOPS 64 +#define IB_SMP_NUM_PKEY_ENTRIES 32 +#define IB_SMP_NUM_GUID_ENTRIES 8 struct ib_smp { u8 base_version; Index: b/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_qp.c 2007-05-07 14:28:47.575320605 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_qp.c 2007-05-07 14:28:55.369928912 +0300 @@ -41,7 +41,6 @@ #include #include -#include #include #include "mthca_dev.h" @@ -1485,10 +1484,10 @@ static int build_mlx_header(struct mthca sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); if (!sqp->qp.ibqp.qp_num) - ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, + dev->ib_dev.query_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); else - ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, + dev->ib_dev.query_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); sqp->ud_header.bth.pkey = cpu_to_be16(pkey); sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); Index: b/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- a/drivers/infiniband/hw/mthca/mthca_av.c 2007-05-07 14:28:47.575320605 +0300 +++ b/drivers/infiniband/hw/mthca/mthca_av.c 2007-05-07 14:28:55.369928912 +0300 @@ -37,7 +37,6 @@ #include #include -#include #include "mthca_dev.h" @@ -279,7 +278,7 @@ int mthca_read_ah(struct mthca_dev *dev, (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; header->grh.flow_label = ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); - ib_get_cached_gid(&dev->ib_dev, + dev->ib_dev.query_gid(&dev->ib_dev, be32_to_cpu(ah->av->port_pd) >> 24, ah->av->gid_index % dev->limits.gid_table_len, &header->grh.source_gid); From yosefe at voltaire.com Mon May 7 05:59:23 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 15:59:23 +0300 Subject: [ofa-general] [PATCH 5/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F2121.5080803@voltaire.com> References: <463F2121.5080803@voltaire.com> Message-ID: <463F22AB.3090704@voltaire.com> mad: cache port lmc * Instead of using the ib cache, mad core will keep the up-to-date lmc of each port inside port_priv struct. It will be updated by incoming PORT_INFO mads. * use the uncached version of "query gid". This query will be cache-optimized in the provider level. Signed-off-by: Yosef Etigin --- drivers/infiniband/core/mad.c | 31 +++++++++++++++++++++++++++---- drivers/infiniband/core/mad_priv.h | 1 + 2 files changed, 28 insertions(+), 4 deletions(-) Index: b/drivers/infiniband/core/mad.c =================================================================== --- a/drivers/infiniband/core/mad.c 2007-05-07 14:31:49.304874864 +0300 +++ b/drivers/infiniband/core/mad.c 2007-05-07 14:31:59.320086832 +0300 @@ -34,7 +34,6 @@ * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ */ #include -#include #include "mad_priv.h" #include "mad_rmpp.h" @@ -1707,13 +1706,12 @@ static inline int rcv_has_same_gid(struc if (!send_resp && rcv_resp) { /* is request/response. */ if (!(attr.ah_flags & IB_AH_GRH)) { - if (ib_get_cached_lmc(device, port_num, &lmc)) - return 0; + lmc = atomic_read(&mad_agent_priv->qp_info->port_priv->port_lmc); return (!lmc || !((attr.src_path_bits ^ rwc->wc->dlid_path_bits) & ((1 << lmc) - 1))); } else { - if (ib_get_cached_gid(device, port_num, + if (ib_query_gid(device, port_num, attr.grh.sgid_index, &sgid)) return 0; return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw, @@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; recv->header.recv_wc.recv_buf.grh = &recv->grh; + /* update our lmc cache with port info smps */ + if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) + && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) + { + atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); + } + if (atomic_read(&qp_info->snoop_count)) snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); @@ -2747,6 +2754,7 @@ static int ib_mad_port_open(struct ib_de { int ret, cq_size; struct ib_mad_port_private *port_priv; + struct ib_port_attr *port_attr; unsigned long flags; char name[sizeof "ib_mad123"]; @@ -2764,6 +2772,19 @@ static int ib_mad_port_open(struct ib_de init_mad_qp(port_priv, &port_priv->qp_info[0]); init_mad_qp(port_priv, &port_priv->qp_info[1]); + port_attr = kmalloc(sizeof *port_attr, GFP_KERNEL); + if (!port_attr) { + printk(KERN_ERR PFX "No memory for ib_port_attr\n"); + return -ENOMEM; + } + + if (ib_query_port(device, port_num, port_attr)) { + printk(KERN_ERR PFX "Couldn't query port %d\n", port_num); + ret = -EINVAL; + goto error2; + } + atomic_set(&port_priv->port_lmc, port_attr->lmc); + cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; port_priv->cq = ib_create_cq(port_priv->device, ib_mad_thread_completion_handler, @@ -2834,6 +2855,8 @@ error4: cleanup_recv_queue(&port_priv->qp_info[1]); cleanup_recv_queue(&port_priv->qp_info[0]); error3: + kfree(port_attr); +error2: kfree(port_priv); return ret; Index: b/drivers/infiniband/core/mad_priv.h =================================================================== --- a/drivers/infiniband/core/mad_priv.h 2007-05-07 14:32:34.000000000 +0300 +++ b/drivers/infiniband/core/mad_priv.h 2007-05-07 14:33:28.856102158 +0300 @@ -200,6 +200,7 @@ struct ib_mad_port_private { struct list_head port_list; struct ib_device *device; int port_num; + atomic_t port_lmc; struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; From yosefe at voltaire.com Mon May 7 06:00:24 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 16:00:24 +0300 Subject: [ofa-general] [PATCH 6/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F2121.5080803@voltaire.com> References: <463F2121.5080803@voltaire.com> Message-ID: <463F22E8.4020406@voltaire.com> core: remove the cache * Remove the core cache completely. This patch depends on the previous pathces in the series, which remove the usages of this cache, from: core, ipoib, srp, mthca, mad Signed-off-by: Yosef Etigin --- drivers/infiniband/core/cache.c | 398 ------------------------------------ include/rdma/ib_cache.h | 118 ---------- drivers/infiniband/core/Makefile | 2 drivers/infiniband/core/core_priv.h | 3 drivers/infiniband/core/device.c | 7 5 files changed, 1 insertion(+), 527 deletions(-) Index: b/drivers/infiniband/core/Makefile =================================================================== --- a/drivers/infiniband/core/Makefile 2007-05-06 09:33:50.000000000 +0300 +++ b/drivers/infiniband/core/Makefile 2007-05-07 14:28:56.187782888 +0300 @@ -8,7 +8,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += $(user_access-y) ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ - device.o fmr_pool.o cache.o + device.o fmr_pool.o ib_mad-y := mad.o smi.o agent.o mad_rmpp.o Index: b/drivers/infiniband/core/core_priv.h =================================================================== --- a/drivers/infiniband/core/core_priv.h 2007-05-06 09:33:50.000000000 +0300 +++ b/drivers/infiniband/core/core_priv.h 2007-05-07 14:28:56.188782710 +0300 @@ -46,7 +46,4 @@ void ib_device_unregister_sysfs(struct i int ib_sysfs_setup(void); void ib_sysfs_cleanup(void); -int ib_cache_setup(void); -void ib_cache_cleanup(void); - #endif /* _CORE_PRIV_H */ Index: b/include/rdma/ib_cache.h =================================================================== --- a/include/rdma/ib_cache.h 2007-05-06 09:33:50.000000000 +0300 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,118 +0,0 @@ -/* - * Copyright (c) 2004 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Intel Corporation. All rights reserved. - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $ - */ - -#ifndef _IB_CACHE_H -#define _IB_CACHE_H - -#include - -/** - * ib_get_cached_gid - Returns a cached GID table entry - * @device: The device to query. - * @port_num: The port number of the device to query. - * @index: The index into the cached GID table to query. - * @gid: The GID value found at the specified index. - * - * ib_get_cached_gid() fetches the specified GID table entry stored in - * the local software cache. - */ -int ib_get_cached_gid(struct ib_device *device, - u8 port_num, - int index, - union ib_gid *gid); - -/** - * ib_find_cached_gid - Returns the port number and GID table index where - * a specified GID value occurs. - * @device: The device to query. - * @gid: The GID value to search for. - * @port_num: The port number of the device where the GID value was found. - * @index: The index into the cached GID table where the GID was found. This - * parameter may be NULL. - * - * ib_find_cached_gid() searches for the specified GID value in - * the local software cache. - */ -int ib_find_cached_gid(struct ib_device *device, - union ib_gid *gid, - u8 *port_num, - u16 *index); - -/** - * ib_get_cached_pkey - Returns a cached PKey table entry - * @device: The device to query. - * @port_num: The port number of the device to query. - * @index: The index into the cached PKey table to query. - * @pkey: The PKey value found at the specified index. - * - * ib_get_cached_pkey() fetches the specified PKey table entry stored in - * the local software cache. - */ -int ib_get_cached_pkey(struct ib_device *device_handle, - u8 port_num, - int index, - u16 *pkey); - -/** - * ib_find_cached_pkey - Returns the PKey table index where a specified - * PKey value occurs. - * @device: The device to query. - * @port_num: The port number of the device to search for the PKey. - * @pkey: The PKey value to search for. - * @index: The index into the cached PKey table where the PKey was found. - * - * ib_find_cached_pkey() searches the specified PKey table in - * the local software cache. - */ -int ib_find_cached_pkey(struct ib_device *device, - u8 port_num, - u16 pkey, - u16 *index); - -/** - * ib_get_cached_lmc - Returns a cached lmc table entry - * @device: The device to query. - * @port_num: The port number of the device to query. - * @lmc: The lmc value for the specified port for that device. - * - * ib_get_cached_lmc() fetches the specified lmc table entry stored in - * the local software cache. - */ -int ib_get_cached_lmc(struct ib_device *device, - u8 port_num, - u8 *lmc); - -#endif /* _IB_CACHE_H */ Index: b/drivers/infiniband/core/cache.c =================================================================== --- a/drivers/infiniband/core/cache.c 2007-05-06 09:33:50.000000000 +0300 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,398 +0,0 @@ -/* - * Copyright (c) 2004 Topspin Communications. All rights reserved. - * Copyright (c) 2005 Intel Corporation. All rights reserved. - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ - */ - -#include -#include -#include - -#include - -#include "core_priv.h" - -struct ib_pkey_cache { - int table_len; - u16 table[0]; -}; - -struct ib_gid_cache { - int table_len; - union ib_gid table[0]; -}; - -struct ib_update_work { - struct work_struct work; - struct ib_device *device; - u8 port_num; -}; - -static inline int start_port(struct ib_device *device) -{ - return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; -} - -static inline int end_port(struct ib_device *device) -{ - return (device->node_type == RDMA_NODE_IB_SWITCH) ? - 0 : device->phys_port_cnt; -} - -int ib_get_cached_gid(struct ib_device *device, - u8 port_num, - int index, - union ib_gid *gid) -{ - struct ib_gid_cache *cache; - unsigned long flags; - int ret = 0; - - if (port_num < start_port(device) || port_num > end_port(device)) - return -EINVAL; - - read_lock_irqsave(&device->cache.lock, flags); - - cache = device->cache.gid_cache[port_num - start_port(device)]; - - if (index < 0 || index >= cache->table_len) - ret = -EINVAL; - else - *gid = cache->table[index]; - - read_unlock_irqrestore(&device->cache.lock, flags); - - return ret; -} -EXPORT_SYMBOL(ib_get_cached_gid); - -int ib_find_cached_gid(struct ib_device *device, - union ib_gid *gid, - u8 *port_num, - u16 *index) -{ - struct ib_gid_cache *cache; - unsigned long flags; - int p, i; - int ret = -ENOENT; - - *port_num = -1; - if (index) - *index = -1; - - read_lock_irqsave(&device->cache.lock, flags); - - for (p = 0; p <= end_port(device) - start_port(device); ++p) { - cache = device->cache.gid_cache[p]; - for (i = 0; i < cache->table_len; ++i) { - if (!memcmp(gid, &cache->table[i], sizeof *gid)) { - *port_num = p + start_port(device); - if (index) - *index = i; - ret = 0; - goto found; - } - } - } -found: - read_unlock_irqrestore(&device->cache.lock, flags); - - return ret; -} -EXPORT_SYMBOL(ib_find_cached_gid); - -int ib_get_cached_pkey(struct ib_device *device, - u8 port_num, - int index, - u16 *pkey) -{ - struct ib_pkey_cache *cache; - unsigned long flags; - int ret = 0; - - if (port_num < start_port(device) || port_num > end_port(device)) - return -EINVAL; - - read_lock_irqsave(&device->cache.lock, flags); - - cache = device->cache.pkey_cache[port_num - start_port(device)]; - - if (index < 0 || index >= cache->table_len) - ret = -EINVAL; - else - *pkey = cache->table[index]; - - read_unlock_irqrestore(&device->cache.lock, flags); - - return ret; -} -EXPORT_SYMBOL(ib_get_cached_pkey); - -int ib_find_cached_pkey(struct ib_device *device, - u8 port_num, - u16 pkey, - u16 *index) -{ - struct ib_pkey_cache *cache; - unsigned long flags; - int i; - int ret = -ENOENT; - - if (port_num < start_port(device) || port_num > end_port(device)) - return -EINVAL; - - read_lock_irqsave(&device->cache.lock, flags); - - cache = device->cache.pkey_cache[port_num - start_port(device)]; - - *index = -1; - - for (i = 0; i < cache->table_len; ++i) - if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { - *index = i; - ret = 0; - break; - } - - read_unlock_irqrestore(&device->cache.lock, flags); - - return ret; -} -EXPORT_SYMBOL(ib_find_cached_pkey); - -int ib_get_cached_lmc(struct ib_device *device, - u8 port_num, - u8 *lmc) -{ - unsigned long flags; - int ret = 0; - - if (port_num < start_port(device) || port_num > end_port(device)) - return -EINVAL; - - read_lock_irqsave(&device->cache.lock, flags); - *lmc = device->cache.lmc_cache[port_num - start_port(device)]; - read_unlock_irqrestore(&device->cache.lock, flags); - - return ret; -} -EXPORT_SYMBOL(ib_get_cached_lmc); - -static void ib_cache_update(struct ib_device *device, - u8 port) -{ - struct ib_port_attr *tprops = NULL; - struct ib_pkey_cache *pkey_cache = NULL, *old_pkey_cache; - struct ib_gid_cache *gid_cache = NULL, *old_gid_cache; - int i; - int ret; - - tprops = kmalloc(sizeof *tprops, GFP_KERNEL); - if (!tprops) - return; - - ret = ib_query_port(device, port, tprops); - if (ret) { - printk(KERN_WARNING "ib_query_port failed (%d) for %s\n", - ret, device->name); - goto err; - } - - pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * - sizeof *pkey_cache->table, GFP_KERNEL); - if (!pkey_cache) - goto err; - - pkey_cache->table_len = tprops->pkey_tbl_len; - - gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len * - sizeof *gid_cache->table, GFP_KERNEL); - if (!gid_cache) - goto err; - - gid_cache->table_len = tprops->gid_tbl_len; - - for (i = 0; i < pkey_cache->table_len; ++i) { - ret = ib_query_pkey(device, port, i, pkey_cache->table + i); - if (ret) { - printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n", - ret, device->name, i); - goto err; - } - } - - for (i = 0; i < gid_cache->table_len; ++i) { - ret = ib_query_gid(device, port, i, gid_cache->table + i); - if (ret) { - printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", - ret, device->name, i); - goto err; - } - } - - write_lock_irq(&device->cache.lock); - - old_pkey_cache = device->cache.pkey_cache[port - start_port(device)]; - old_gid_cache = device->cache.gid_cache [port - start_port(device)]; - - device->cache.pkey_cache[port - start_port(device)] = pkey_cache; - device->cache.gid_cache [port - start_port(device)] = gid_cache; - - device->cache.lmc_cache[port - start_port(device)] = tprops->lmc; - - write_unlock_irq(&device->cache.lock); - - kfree(old_pkey_cache); - kfree(old_gid_cache); - kfree(tprops); - return; - -err: - kfree(pkey_cache); - kfree(gid_cache); - kfree(tprops); -} - -static void ib_cache_task(struct work_struct *_work) -{ - struct ib_update_work *work = - container_of(_work, struct ib_update_work, work); - - ib_cache_update(work->device, work->port_num); - kfree(work); -} - -static void ib_cache_event(struct ib_event_handler *handler, - struct ib_event *event) -{ - struct ib_update_work *work; - - if (event->event == IB_EVENT_PORT_ERR || - event->event == IB_EVENT_PORT_ACTIVE || - event->event == IB_EVENT_LID_CHANGE || - event->event == IB_EVENT_PKEY_CHANGE || - event->event == IB_EVENT_SM_CHANGE || - event->event == IB_EVENT_CLIENT_REREGISTER) { - work = kmalloc(sizeof *work, GFP_ATOMIC); - if (work) { - INIT_WORK(&work->work, ib_cache_task); - work->device = event->device; - work->port_num = event->element.port_num; - schedule_work(&work->work); - } - } -} - -static void ib_cache_setup_one(struct ib_device *device) -{ - int p; - - rwlock_init(&device->cache.lock); - - device->cache.pkey_cache = - kmalloc(sizeof *device->cache.pkey_cache * - (end_port(device) - start_port(device) + 1), GFP_KERNEL); - device->cache.gid_cache = - kmalloc(sizeof *device->cache.gid_cache * - (end_port(device) - start_port(device) + 1), GFP_KERNEL); - - device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache * - (end_port(device) - - start_port(device) + 1), - GFP_KERNEL); - - if (!device->cache.pkey_cache || !device->cache.gid_cache || - !device->cache.lmc_cache) { - printk(KERN_WARNING "Couldn't allocate cache " - "for %s\n", device->name); - goto err; - } - - for (p = 0; p <= end_port(device) - start_port(device); ++p) { - device->cache.pkey_cache[p] = NULL; - device->cache.gid_cache [p] = NULL; - ib_cache_update(device, p + start_port(device)); - } - - INIT_IB_EVENT_HANDLER(&device->cache.event_handler, - device, ib_cache_event); - if (ib_register_event_handler(&device->cache.event_handler)) - goto err_cache; - - return; - -err_cache: - for (p = 0; p <= end_port(device) - start_port(device); ++p) { - kfree(device->cache.pkey_cache[p]); - kfree(device->cache.gid_cache[p]); - } - -err: - kfree(device->cache.pkey_cache); - kfree(device->cache.gid_cache); - kfree(device->cache.lmc_cache); -} - -static void ib_cache_cleanup_one(struct ib_device *device) -{ - int p; - - ib_unregister_event_handler(&device->cache.event_handler); - flush_scheduled_work(); - - for (p = 0; p <= end_port(device) - start_port(device); ++p) { - kfree(device->cache.pkey_cache[p]); - kfree(device->cache.gid_cache[p]); - } - - kfree(device->cache.pkey_cache); - kfree(device->cache.gid_cache); - kfree(device->cache.lmc_cache); -} - -static struct ib_client cache_client = { - .name = "cache", - .add = ib_cache_setup_one, - .remove = ib_cache_cleanup_one -}; - -int __init ib_cache_setup(void) -{ - return ib_register_client(&cache_client); -} - -void __exit ib_cache_cleanup(void) -{ - ib_unregister_client(&cache_client); -} Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-07 14:28:54.229132596 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-07 14:28:56.192781996 +0300 @@ -696,18 +696,11 @@ static int __init ib_core_init(void) if (ret) printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); - ret = ib_cache_setup(); - if (ret) { - printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n"); - ib_sysfs_cleanup(); - } - return ret; } static void __exit ib_core_cleanup(void) { - ib_cache_cleanup(); ib_sysfs_cleanup(); } From halr at voltaire.com Mon May 7 06:13:06 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 07 May 2007 09:13:06 -0400 Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm: remove unneeded run-time check In-Reply-To: <20070506174138.GI9692@sashak.voltaire.com> References: <20070506174138.GI9692@sashak.voltaire.com> Message-ID: <1178543540.32222.350561.camel@hal.voltaire.com> On Sun, 2007-05-06 at 13:41, Sasha Khapyorsky wrote: > remove unneeded run-time NULL pointer check (followed free() is not > under this check anyway). > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to master only). -- Hal From halr at voltaire.com Mon May 7 06:16:21 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 07 May 2007 09:16:21 -0400 Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm: make osm_node_destroy() static In-Reply-To: <20070506174431.GJ9692@sashak.voltaire.com> References: <20070506174138.GI9692@sashak.voltaire.com> <20070506174431.GJ9692@sashak.voltaire.com> Message-ID: <1178543690.32222.350646.camel@hal.voltaire.com> On Sun, 2007-05-06 at 13:44, Sasha Khapyorsky wrote: > This makes locally used osm_node_destroy() function static > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to master only). Isn't the same applicable for the other osm_xxx_destroy functions ? If so, shouldn't they also be made static ? -- Hal From mst at dev.mellanox.co.il Mon May 7 06:26:02 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 May 2007 16:26:02 +0300 Subject: [ofa-general] Re: [PATCH 1/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F218A.7030400@voltaire.com> References: <463F2121.5080803@voltaire.com> <463F218A.7030400@voltaire.com> Message-ID: <20070507132602.GG29350@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCH 1/6 v2] fix pkey change handling and remove the cahce > > core: uncached "find gid" and "find pkey" queries > > * Add ib_find_gid and ib_find_pkey over possibly blocking device > queries. Before I look into this deeper, a note on submissin format: please do not use the same subject for all patches in the series. For example, subject this one should have been: Subject: [PATCH 1/6 v2] IB/core: add uncached "find gid" and "find pkey" queries And then there won't be a need to add '*' before the actual description. -- MST From mst at dev.mellanox.co.il Mon May 7 06:31:40 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 May 2007 16:31:40 +0300 Subject: [ofa-general] Re: [PATCH 4/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F226E.9040700@voltaire.com> References: <463F2121.5080803@voltaire.com> <463F226E.9040700@voltaire.com> Message-ID: <20070507133140.GH29350@mellanox.co.il> > @@ -139,6 +139,9 @@ static void smp_snoop(struct ib_device * > event.element.port_num = port_num; > ib_dispatch_event(&event); > } > + > + /* update cache with the incoming mad */ Please don't add such comments: name mthca_cache_update is clear enough. > + mthca_cache_update(to_mdev(ibdev), port_num, mad); > } > } This will generate the event first, update cache after this, right? If so, there is still a window where e.g. ipoib gets a pkey change event, performs a query and gets a stale value. -- MST From yosefe at voltaire.com Mon May 7 06:41:27 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 16:41:27 +0300 Subject: [ofa-general] Re: [PATCH 4/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <20070507133140.GH29350@mellanox.co.il> References: <463F2121.5080803@voltaire.com> <463F226E.9040700@voltaire.com> <20070507133140.GH29350@mellanox.co.il> Message-ID: <463F2C87.7090002@voltaire.com> Michael S. Tsirkin wrote: >>@@ -139,6 +139,9 @@ static void smp_snoop(struct ib_device * >> event.element.port_num = port_num; >> ib_dispatch_event(&event); >> } >>+ >>+ /* update cache with the incoming mad */ > > > Please don't add such comments: name mthca_cache_update is clear enough. > > >>+ mthca_cache_update(to_mdev(ibdev), port_num, mad); >> } >> } > > > This will generate the event first, update cache after this, right? > If so, there is still a window where e.g. ipoib gets > a pkey change event, performs a query and gets a stale value. > Right. Did you find more issues in this patch? From mst at dev.mellanox.co.il Mon May 7 06:50:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 May 2007 16:50:30 +0300 Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F22AB.3090704@voltaire.com> References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com> Message-ID: <20070507135030.GI29350@mellanox.co.il> > @@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str > recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; > recv->header.recv_wc.recv_buf.grh = &recv->grh; > > + /* update our lmc cache with port info smps */ > + if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || > + recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > + && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) > + && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) > + { > + atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); > + } > + > if (atomic_read(&qp_info->snoop_count)) > snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); > Why is this an atomic? The comment does not seem to tell us anything useful. Remove it? These 8 lines seem to violate coding style rules in at least 3 different ways::) -- MST From tziporet at dev.mellanox.co.il Mon May 7 06:59:04 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 07 May 2007 16:59:04 +0300 Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout In-Reply-To: <1178540310.10759.9.camel@mtls03> References: <1178538772.10759.2.camel@mtls03> <20070507115714.GC29350@mellanox.co.il> <1178540310.10759.9.camel@mtls03> Message-ID: <463F30A8.5050005@mellanox.co.il> Eli Cohen wrote: > On Mon, 2007-05-07 at 15:04 +0300, Michael S. Tsirkin wrote: > >> How likely is this to help in practice? >> >> > Like I said, when the system is very busy. In this case the command may > actually complete very soon but wait_for_completion_timeout() will > nevertheless return zero since the task did not get CPU time before the > specified timeout expired. In this case we would like to check if done > is signaled and thus not fail the command. > This is not a theoretical issue - we actually saw this problem here with some tests. Tziporet From mst at dev.mellanox.co.il Mon May 7 07:06:19 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 May 2007 17:06:19 +0300 Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout In-Reply-To: <463F30A8.5050005@mellanox.co.il> References: <1178538772.10759.2.camel@mtls03> <20070507115714.GC29350@mellanox.co.il> <1178540310.10759.9.camel@mtls03> <463F30A8.5050005@mellanox.co.il> Message-ID: <20070507140619.GK29350@mellanox.co.il> > Quoting Tziporet Koren : > Subject: Re: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout > > Eli Cohen wrote: > >On Mon, 2007-05-07 at 15:04 +0300, Michael S. Tsirkin wrote: > > > >>How likely is this to help in practice? > >> > >> > >Like I said, when the system is very busy. In this case the command may > >actually complete very soon but wait_for_completion_timeout() will > >nevertheless return zero since the task did not get CPU time before the > >specified timeout expired. In this case we would like to check if done > >is signaled and thus not fail the command. > > > This is not a theoretical issue - we actually saw this problem here with > some tests. I wonder whether this applicable to mthca as well then. -- MST From rdreier at cisco.com Mon May 7 07:12:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 07 May 2007 07:12:47 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout In-Reply-To: <1178538772.10759.2.camel@mtls03> (Eli Cohen's message of "Mon, 07 May 2007 14:52:22 +0300") References: <1178538772.10759.2.camel@mtls03> Message-ID: > When the system is busy it may happen that the command actually > completed but it took more than the specified timeout till the > task executing the command was actually given CPU time. This test > checks that the completion is really missing before failing. > + if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout))) > + if (!context->done.done) { > + err = -EBUSY; > + goto out; > + } This seems more like a bug in wait_for_completion_timeout(). Anyway, it's definitely not OK to poke inside the definition of struct completion in driver code, so we need to find a different way to solve this. BTW the same completion handling code is in mthca -- is this also a problem there? - R. From yosefe at voltaire.com Mon May 7 07:18:51 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 17:18:51 +0300 Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <20070507135030.GI29350@mellanox.co.il> References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com> <20070507135030.GI29350@mellanox.co.il> Message-ID: <463F354B.8030908@voltaire.com> Michael S. Tsirkin wrote: >>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str >> recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; >> recv->header.recv_wc.recv_buf.grh = &recv->grh; >> >>+ /* update our lmc cache with port info smps */ >>+ if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || >>+ recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) >>+ && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) >>+ && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) >>+ { >>+ atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); >>+ } >>+ >> if (atomic_read(&qp_info->snoop_count)) >> snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); >> > > > Why is this an atomic? I thought there might be a race between this and where we read the lmc (rcv_has_same_gid) > The comment does not seem to tell us anything useful. Remove it? > These 8 lines seem to violate coding style rules in at least 3 different ways::) > if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); is that better? From eli at mellanox.co.il Mon May 7 07:22:50 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 07 May 2007 17:22:50 +0300 Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout In-Reply-To: References: <1178538772.10759.2.camel@mtls03> Message-ID: <1178547800.10759.13.camel@mtls03> On Mon, 2007-05-07 at 07:12 -0700, Roland Dreier wrote: > > When the system is busy it may happen that the command actually > > completed but it took more than the specified timeout till the > > task executing the command was actually given CPU time. This test > > checks that the completion is really missing before failing. > > > + if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout))) > > + if (!context->done.done) { > > + err = -EBUSY; > > + goto out; > > + } > > This seems more like a bug in wait_for_completion_timeout(). Anyway, > it's definitely not OK to poke inside the definition of struct > completion in driver code, so we need to find a different way to solve > this. > I agree that wait_for_completion_timeout() would better give an indication of this special case but it does not. The only we can now of such a situation is by poking into done or poking into the EQ which is worst. > BTW the same completion handling code is in mthca -- is this also a > problem there? > We saw this with the mthca port for connectx. > - R. > From svenar at simula.no Mon May 7 07:42:54 2007 From: svenar at simula.no (svenar at simula.no) Date: Mon, 7 May 2007 16:42:54 +0200 (CEST) Subject: [ofa-general] ibdiagnet credit checks Message-ID: <37241.192.9.112.188.1178548974.squirrel@webmail.uio.no> Hi, I have question regarding ibdiagnet and credit loop checking. In debug.tcl there seems to be two different credit checks: # report credit loops ibdmCalcMinHopTables $fabric set roots [ibdmFindRootNodesByMinHop $fabric] if {[llength $roots]} { inform "-I-reporting:found.roots" $roots ibdmReportNonUpDownCa2CaPaths $fabric $roots } else { ibdmAnalyzeLoops $fabric } What is the difference between these two checks? From a brief inspection of the relevant code I would guess that "ibdmReportNonUpDownCa2CaPaths" checks the routing table for volations of the UpDown rule, while "ibdmAnalyzeLoops" checks the routing table for cyclic dependencies. Is this correct? Best regards, Sven-Arne From halr at voltaire.com Mon May 7 07:47:28 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 07 May 2007 10:47:28 -0400 Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F354B.8030908@voltaire.com> References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com> <20070507135030.GI29350@mellanox.co.il> <463F354B.8030908@voltaire.com> Message-ID: <1178549162.32222.355374.camel@hal.voltaire.com> Hi Yosef, On Mon, 2007-05-07 at 10:18, Yosef Etigin wrote: > Michael S. Tsirkin wrote: > >>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str > >> recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; > >> recv->header.recv_wc.recv_buf.grh = &recv->grh; > >> > >>+ /* update our lmc cache with port info smps */ > >>+ if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || > >>+ recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > >>+ && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) > >>+ && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) > >>+ { > >>+ atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); > >>+ } > >>+ > >> if (atomic_read(&qp_info->snoop_count)) > >> snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); > >> > > > > > > Why is this an atomic? > > I thought there might be a race between this and where we read the lmc (rcv_has_same_gid) > > > The comment does not seem to tell us anything useful. Remove it? > > These 8 lines seem to violate coding style rules in at least 3 different ways::) > > > if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || > recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) > && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) > atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); Should at least a #define be used for smp.data[34} if not a struct so it is clearer what is going on here ? I haven't yet had a chance to look at the rest of the patch. -- Hal > is that better? > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Mon May 7 07:53:20 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 May 2007 17:53:20 +0300 Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout In-Reply-To: References: <1178538772.10759.2.camel@mtls03> Message-ID: <20070507145320.GA15275@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] IB/mlx4 mlx4_ib: commands timeout > > > When the system is busy it may happen that the command actually > > completed but it took more than the specified timeout till the > > task executing the command was actually given CPU time. This test > > checks that the completion is really missing before failing. > > > + if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(timeout))) > > + if (!context->done.done) { > > + err = -EBUSY; > > + goto out; > > + } > > This seems more like a bug in wait_for_completion_timeout(). Anyway, > it's definitely not OK to poke inside the definition of struct > completion in driver code, so we need to find a different way to solve > this. > > BTW the same completion handling code is in mthca -- is this also a > problem there? Google gave me this: http://lkml.org/lkml/2007/3/1/156 so it seems a similiar problem was observed in mthca. Thomas, Ingo, I think you were the ones to propose wait_for_completion_timeout()/wait_for_completion_interruptible_timeout(): would it make sense to change these functions to return -ETIMEDOUT on timeout, 0 on success? No one seems to use the actual timeout value, as far as I can see. Would something like the following, untested, patch, make sense? Signed-off-by: Michael S. Tsirkin -- diff --git a/include/linux/completion.h b/include/linux/completion.h index 268c5a4..84360c8 100644 --- a/include/linux/completion.h +++ b/include/linux/completion.h @@ -44,11 +44,10 @@ static inline void init_completion(struct completion *x) extern void FASTCALL(wait_for_completion(struct completion *)); extern int FASTCALL(wait_for_completion_interruptible(struct completion *x)); -extern unsigned long FASTCALL(wait_for_completion_timeout(struct completion *x, - unsigned long timeout)); -extern unsigned long FASTCALL(wait_for_completion_interruptible_timeout( - struct completion *x, unsigned long timeout)); - +extern int FASTCALL(wait_for_completion_timeout(struct completion *x, + unsigned long timeout)); +extern int FASTCALL(wait_for_completion_interruptible_timeout(struct completion *x, + unsigned long timeout)); extern void FASTCALL(complete(struct completion *)); extern void FASTCALL(complete_all(struct completion *)); diff --git a/kernel/sched.c b/kernel/sched.c index b9a6837..5ee3df6 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3661,9 +3661,10 @@ void fastcall __sched wait_for_completion(struct completion *x) } EXPORT_SYMBOL(wait_for_completion); -unsigned long fastcall __sched +int fastcall __sched wait_for_completion_timeout(struct completion *x, unsigned long timeout) { + int ret = 0; might_sleep(); spin_lock_irq(&x->wait.lock); @@ -3672,22 +3673,24 @@ wait_for_completion_timeout(struct completion *x, unsigned long timeout) wait.flags |= WQ_FLAG_EXCLUSIVE; __add_wait_queue_tail(&x->wait, &wait); - do { + for (;;) { __set_current_state(TASK_UNINTERRUPTIBLE); spin_unlock_irq(&x->wait.lock); timeout = schedule_timeout(timeout); spin_lock_irq(&x->wait.lock); + if (x->done) + break; if (!timeout) { - __remove_wait_queue(&x->wait, &wait); - goto out; + ret = -ETIMEDOUT; + break; } - } while (!x->done); + } __remove_wait_queue(&x->wait, &wait); } x->done--; out: spin_unlock_irq(&x->wait.lock); - return timeout; + return ret; } EXPORT_SYMBOL(wait_for_completion_timeout); @@ -3724,10 +3727,12 @@ out: } EXPORT_SYMBOL(wait_for_completion_interruptible); -unsigned long fastcall __sched +int fastcall __sched wait_for_completion_interruptible_timeout(struct completion *x, unsigned long timeout) { + int ret = 0; + might_sleep(); spin_lock_irq(&x->wait.lock); @@ -3736,7 +3741,7 @@ wait_for_completion_interruptible_timeout(struct completion *x, wait.flags |= WQ_FLAG_EXCLUSIVE; __add_wait_queue_tail(&x->wait, &wait); - do { + for (;;) { if (signal_pending(current)) { timeout = -ERESTARTSYS; __remove_wait_queue(&x->wait, &wait); @@ -3746,17 +3751,19 @@ wait_for_completion_interruptible_timeout(struct completion *x, spin_unlock_irq(&x->wait.lock); timeout = schedule_timeout(timeout); spin_lock_irq(&x->wait.lock); + if (x->done) + break; if (!timeout) { - __remove_wait_queue(&x->wait, &wait); - goto out; + ret = -ETIMEDOUT; + break; } - } while (!x->done); + } __remove_wait_queue(&x->wait, &wait); } x->done--; out: spin_unlock_irq(&x->wait.lock); - return timeout; + return ret; } EXPORT_SYMBOL(wait_for_completion_interruptible_timeout); -- MST From mst at dev.mellanox.co.il Mon May 7 07:56:53 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 May 2007 17:56:53 +0300 Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F354B.8030908@voltaire.com> References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com> <20070507135030.GI29350@mellanox.co.il> <463F354B.8030908@voltaire.com> Message-ID: <20070507145653.GB15275@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce > > Michael S. Tsirkin wrote: > >>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str > >> recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; > >> recv->header.recv_wc.recv_buf.grh = &recv->grh; > >> > >>+ /* update our lmc cache with port info smps */ > >>+ if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || > >>+ recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > >>+ && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) > >>+ && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) > >>+ { > >>+ atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); > >>+ } > >>+ > >> if (atomic_read(&qp_info->snoop_count)) > >> snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); > >> > > > > > > Why is this an atomic? > > I thought there might be a race between this and where we read the lmc (rcv_has_same_gid) Aren't all incoming MADs on a port handled over a single threaded WQ? And how would atomics help? > > The comment does not seem to tell us anything useful. Remove it? > > These 8 lines seem to violate coding style rules in at least 3 different ways::) > > > if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || > recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) > && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) > atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); > > is that better? Move && to the end of each line, and kill the extra () around single comparisons. -- MST From yosefe at voltaire.com Mon May 7 08:09:46 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 18:09:46 +0300 Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <20070507145653.GB15275@mellanox.co.il> References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com> <20070507135030.GI29350@mellanox.co.il> <463F354B.8030908@voltaire.com> <20070507145653.GB15275@mellanox.co.il> Message-ID: <463F413A.5020308@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce >> >>Michael S. Tsirkin wrote: >> >>>>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str >>>> recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; >>>> recv->header.recv_wc.recv_buf.grh = &recv->grh; >>>> >>>>+ /* update our lmc cache with port info smps */ >>>>+ if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || >>>>+ recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) >>>>+ && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) >>>>+ && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) >>>>+ { >>>>+ atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); >>>>+ } >>>>+ >>>> if (atomic_read(&qp_info->snoop_count)) >>>> snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); >>>> >>> >>> >>>Why is this an atomic? >> >>I thought there might be a race between this and where we read the lmc (rcv_has_same_gid) > > > Aren't all incoming MADs on a port handled over a single threaded WQ? > And how would atomics help? > Yes. not atomic any more. > >>>The comment does not seem to tell us anything useful. Remove it? >>>These 8 lines seem to violate coding style rules in at least 3 different ways::) >>> >> >> if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || >> recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) >> && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) >> && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) >> atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); >> >>is that better? > > > Move && to the end of each line, and kill the extra () around single comparisons. > ok. From yosefe at voltaire.com Mon May 7 08:12:19 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 18:12:19 +0300 Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <1178549162.32222.355374.camel@hal.voltaire.com> References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com> <20070507135030.GI29350@mellanox.co.il> <463F354B.8030908@voltaire.com> <1178549162.32222.355374.camel@hal.voltaire.com> Message-ID: <463F41D3.4050603@voltaire.com> Hal Rosenstock wrote: > Hi Yosef, > > On Mon, 2007-05-07 at 10:18, Yosef Etigin wrote: > >>Michael S. Tsirkin wrote: >> >>>>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str >>>> recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; >>>> recv->header.recv_wc.recv_buf.grh = &recv->grh; >>>> >>>>+ /* update our lmc cache with port info smps */ >>>>+ if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || >>>>+ recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) >>>>+ && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) >>>>+ && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) >>>>+ { >>>>+ atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); >>>>+ } >>>>+ >>>> if (atomic_read(&qp_info->snoop_count)) >>>> snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); >>>> >>> >>> >>>Why is this an atomic? >> >>I thought there might be a race between this and where we read the lmc (rcv_has_same_gid) >> >> >>>The comment does not seem to tell us anything useful. Remove it? >>>These 8 lines seem to violate coding style rules in at least 3 different ways::) >>> >> >> if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || >> recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) >> && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) >> && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) >> atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); > > > Should at least a #define be used for smp.data[34} if not a struct so it > is clearer what is going on here ? > you mean something like: #define LMC_FROM_PORT_INFO(data) ( ( (char*)(data) )[34] & 0x07 ) ? > I haven't yet had a chance to look at the rest of the patch. > > -- Hal > > >>is that better? >>_______________________________________________ >>general mailing list >>general at lists.openfabrics.org >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From eli at mellanox.co.il Mon May 7 08:20:33 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 07 May 2007 18:20:33 +0300 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts Message-ID: <1178551555.17477.0.camel@mtls03> In order to prevent losing interrupts, all EQs must be rearmed whenever an interrupt occurs, regardless if that interrupt is generated for the EQ or not. Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/net/mlx4/eq.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/eq.c 2007-05-07 12:32:35.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/eq.c 2007-05-07 17:33:09.000000000 +0300 @@ -266,13 +266,17 @@ static irqreturn_t mlx4_interrupt(int ir { struct mlx4_dev *dev = dev_ptr; struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_eq *eq; int work = 0; int i; writel(priv->eq_table.clr_mask, priv->eq_table.clr_int); - for (i = 0; i < MLX4_EQ_CATAS; ++i) - work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]); + for (i = 0; i < MLX4_EQ_CATAS; ++i) { + eq = &priv->eq_table.eq[i]; + work |= mlx4_eq_int(dev, eq); + eq_set_ci(eq, 1); + } return IRQ_RETVAL(work); } From mst at dev.mellanox.co.il Mon May 7 08:30:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 May 2007 18:30:25 +0300 Subject: [ofa-general] Re: [PATCH 3/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F2237.7050809@voltaire.com> References: <463F2121.5080803@voltaire.com> <463F2237.7050809@voltaire.com> Message-ID: <20070507153025.GD15275@mellanox.co.il> All in all, this patch tries to do many things at once. I wonder whether you can split the patch in 2: fix the pkey change case separately, and remove pkey polling separately. > Quoting Yosef Etigin : > Subject: [PATCH 3/6 v2] fix pkey change handling and remove the cahce > > ipoib: handle pkey change events > > This issue was found during partitioning & SM fail over testing. > > * added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike > * fixed a bug in device extraction from the work struct > * removed some warnings in case they are caused due to missing PKEY as > this seems like a valid flow now. This seems to remove a useful tool for debugging invalid pkeys. Why is this a valid flow now? > * Assume that the cache is coherent - do not retry on cache queries > * Restart child interfaces first before parent Why? Is this related to pkey change somehow? > * Remove the pkey polling thread and pkey delayed initiallization > * If an interface is brought up but pkey is not found, mark it with > IPOIB_PKEY_NEEDED and when a pkey event arrives, try to restart it. > > SM reconfiguration or failover possibly causes a shuffling of the values in the port > pkey table. The current implementation only queries for the index of the pkey once, > when it creates the device QP and after that moves it into working state, and hence > does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger > to reconfigure the device QP. > > > Signed-off-by: Moni Levy > Signed-off-by: Yosef Etigin > --- > drivers/infiniband/ulp/ipoib/ipoib.h | 10 - > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 144 ++++++++----------------- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 11 - > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 + > drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 21 +-- > 5 files changed, 76 insertions(+), 121 deletions(-) > > Index: b/drivers/infiniband/ulp/ipoib/ipoib.h > =================================================================== > --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-06 09:26:08.000000000 +0300 > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-06 09:26:18.000000000 +0300 > @@ -80,7 +80,7 @@ enum { > IPOIB_FLAG_INITIALIZED = 1, > IPOIB_FLAG_ADMIN_UP = 2, > IPOIB_PKEY_ASSIGNED = 3, > - IPOIB_PKEY_STOP = 4, > + IPOIB_PKEY_NEEDED = 4, > IPOIB_FLAG_SUBINTERFACE = 5, > IPOIB_MCAST_RUN = 6, > IPOIB_STOP_REAPER = 7, > @@ -202,9 +202,9 @@ struct ipoib_dev_priv { > struct list_head multicast_list; > struct rb_root multicast_tree; > > - struct delayed_work pkey_task; > struct delayed_work mcast_task; > struct work_struct flush_task; > + struct work_struct pkey_task; > struct work_struct restart_task; > struct delayed_work ah_reap_task; > > @@ -333,12 +333,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( > > int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); > void ipoib_ib_dev_flush(struct work_struct *work); > +void ipoib_pkey_event(struct work_struct *work); > void ipoib_ib_dev_cleanup(struct net_device *dev); > > int ipoib_ib_dev_open(struct net_device *dev); > int ipoib_ib_dev_up(struct net_device *dev); > int ipoib_ib_dev_down(struct net_device *dev, int flush); > -int ipoib_ib_dev_stop(struct net_device *dev); > +int ipoib_ib_dev_stop(struct net_device *dev, int flush); > > int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); > void ipoib_dev_cleanup(struct net_device *dev); > @@ -384,9 +385,6 @@ void ipoib_event(struct ib_event_handler > int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); > int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); > > -void ipoib_pkey_poll(struct work_struct *work); > -int ipoib_pkey_dev_delay_open(struct net_device *dev); > - > #ifdef CONFIG_INFINIBAND_IPOIB_CM > > #define IPOIB_FLAGS_RC 0x80 > Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > =================================================================== > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-06 09:26:17.000000000 +0300 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-06 09:26:26.000000000 +0300 > @@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device > ret = ipoib_ib_post_receives(dev); > if (ret) { > ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); > - ipoib_ib_dev_stop(dev); > + ipoib_ib_dev_stop(dev, 1); > return -1; > } > > ret = ipoib_cm_dev_open(dev); > if (ret) { > ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); > - ipoib_ib_dev_stop(dev); > + ipoib_ib_dev_stop(dev, 1); > return -1; > } > > @@ -441,28 +441,10 @@ int ipoib_ib_dev_open(struct net_device > return 0; > } > > -static void ipoib_pkey_dev_check_presence(struct net_device *dev) > -{ > - struct ipoib_dev_priv *priv = netdev_priv(dev); > - u16 pkey_index = 0; > - > - if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) > - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > - else > - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > -} > - > int ipoib_ib_dev_up(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > > - ipoib_pkey_dev_check_presence(dev); > - > - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { > - ipoib_dbg(priv, "PKEY is not assigned.\n"); > - return 0; > - } > - > set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); > > return ipoib_mcast_start_thread(dev); > @@ -477,16 +459,6 @@ int ipoib_ib_dev_down(struct net_device > clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); > netif_carrier_off(dev); > > - /* Shutdown the P_Key thread if still active */ > - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { > - mutex_lock(&pkey_mutex); > - set_bit(IPOIB_PKEY_STOP, &priv->flags); > - cancel_delayed_work(&priv->pkey_task); > - mutex_unlock(&pkey_mutex); > - if (flush) > - flush_workqueue(ipoib_workqueue); > - } > - > ipoib_mcast_stop_thread(dev, flush); > ipoib_mcast_dev_flush(dev); > > @@ -508,7 +480,7 @@ static int recvs_pending(struct net_devi > return pending; > } > > -int ipoib_ib_dev_stop(struct net_device *dev) > +int ipoib_ib_dev_stop(struct net_device *dev, int flush) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ib_qp_attr qp_attr; > @@ -581,7 +553,8 @@ timeout: > /* Wait for all AHs to be reaped */ > set_bit(IPOIB_STOP_REAPER, &priv->flags); > cancel_delayed_work(&priv->ah_reap_task); > - flush_workqueue(ipoib_workqueue); > + if (flush) > + flush_workqueue(ipoib_workqueue); > > begin = jiffies; > > @@ -622,14 +595,33 @@ int ipoib_ib_dev_init(struct net_device > return 0; > } > > -void ipoib_ib_dev_flush(struct work_struct *work) > +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp) > { > - struct ipoib_dev_priv *cpriv, *priv = > - container_of(work, struct ipoib_dev_priv, flush_task); > + struct ipoib_dev_priv *cpriv; > struct net_device *dev = priv->dev; > > - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { > - ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); > + mutex_lock(&priv->vlan_mutex); > + > + /* Flush any child interfaces */ > + list_for_each_entry(cpriv, &priv->child_intfs, list) > + __ipoib_ib_dev_flush(cpriv, restart_qp); > + > + mutex_unlock(&priv->vlan_mutex); > + > + /* > + * If the device is not initiallized since it needs a pkey - > + * try to reopen it > + */ > + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { > + > + if (restart_qp > + && test_bit(IPOIB_PKEY_NEEDED, &priv->flags) > + && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { > + /* this iface needs pkey, try to bring it up */ > + ipoib_open(priv->dev); > + } > + else > + ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); > return; > } Clean up the above please. > @@ -642,6 +634,12 @@ void ipoib_ib_dev_flush(struct work_stru > > ipoib_ib_dev_down(dev, 0); > > + if (restart_qp) { > + if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) > + ipoib_ib_dev_stop(dev, 0); > + ipoib_ib_dev_open(dev); > + } > + > /* > * The device could have been brought down between the start and when > * we get here, don't bring it back up if it's not configured up I find these if (restart_qp) branches somewhat confusing. Why is this flag tested in 2 places? > @@ -650,14 +648,25 @@ void ipoib_ib_dev_flush(struct work_stru > ipoib_ib_dev_up(dev); > ipoib_mcast_restart_task(&priv->restart_task); > } > +} > > - mutex_lock(&priv->vlan_mutex); > +void ipoib_ib_dev_flush(struct work_struct *work) > +{ > + struct ipoib_dev_priv *priv = > + container_of(work, struct ipoib_dev_priv, flush_task); > + /* we only restart the QP in case of pkey change event */ Kill the comment please. > + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); > + __ipoib_ib_dev_flush(priv, 0); > +} > > - /* Flush any child interfaces too */ > - list_for_each_entry(cpriv, &priv->child_intfs, list) > - ipoib_ib_dev_flush(&cpriv->flush_task); > +void ipoib_pkey_event(struct work_struct *work) > +{ > + struct ipoib_dev_priv *priv = > + container_of(work, struct ipoib_dev_priv, pkey_task); > > - mutex_unlock(&priv->vlan_mutex); > + /* restart the QP in case of pkey change event */ > + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); Kill the comment please. > + __ipoib_ib_dev_flush(priv, 1); > } > > void ipoib_ib_dev_cleanup(struct net_device *dev) > @@ -672,54 +681,3 @@ void ipoib_ib_dev_cleanup(struct net_dev > ipoib_transport_dev_cleanup(dev); > } > > -/* > - * Delayed P_Key Assigment Interim Support > - * > - * The following is initial implementation of delayed P_Key assigment > - * mechanism. It is using the same approach implemented for the multicast > - * group join. The single goal of this implementation is to quickly address > - * Bug #2507. This implementation will probably be removed when the P_Key > - * change async notification is available. > - */ > - > -void ipoib_pkey_poll(struct work_struct *work) > -{ > - struct ipoib_dev_priv *priv = > - container_of(work, struct ipoib_dev_priv, pkey_task.work); > - struct net_device *dev = priv->dev; > - > - ipoib_pkey_dev_check_presence(dev); > - > - if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) > - ipoib_open(dev); > - else { > - mutex_lock(&pkey_mutex); > - if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, > - &priv->pkey_task, > - HZ); > - mutex_unlock(&pkey_mutex); > - } > -} > - > -int ipoib_pkey_dev_delay_open(struct net_device *dev) > -{ > - struct ipoib_dev_priv *priv = netdev_priv(dev); > - > - /* Look for the interface pkey value in the IB Port P_Key table and */ > - /* set the interface pkey assigment flag */ > - ipoib_pkey_dev_check_presence(dev); > - > - /* P_Key value not assigned yet - start polling */ > - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { > - mutex_lock(&pkey_mutex); > - clear_bit(IPOIB_PKEY_STOP, &priv->flags); > - queue_delayed_work(ipoib_workqueue, > - &priv->pkey_task, > - HZ); > - mutex_unlock(&pkey_mutex); > - return 1; > - } > - > - return 0; > -} > > Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c > =================================================================== > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-06 09:26:08.000000000 +0300 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-06 09:26:18.000000000 +0300 > @@ -100,14 +100,11 @@ int ipoib_open(struct net_device *dev) > > set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); > > - if (ipoib_pkey_dev_delay_open(dev)) > - return 0; > - > if (ipoib_ib_dev_open(dev)) > - return -EINVAL; > + return test_bit(IPOIB_PKEY_NEEDED, &priv->flags) ? 0 : -EINVAL; > > if (ipoib_ib_dev_up(dev)) { > - ipoib_ib_dev_stop(dev); > + ipoib_ib_dev_stop(dev, 1); > return -EINVAL; > } > > @@ -152,7 +149,7 @@ static int ipoib_stop(struct net_device > flush_workqueue(ipoib_workqueue); > > ipoib_ib_dev_down(dev, 1); > - ipoib_ib_dev_stop(dev); > + ipoib_ib_dev_stop(dev, 1); > > if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { > struct ipoib_dev_priv *cpriv; > @@ -990,7 +987,7 @@ static void ipoib_setup(struct net_devic > INIT_LIST_HEAD(&priv->dead_ahs); > INIT_LIST_HEAD(&priv->multicast_list); > > - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); > + INIT_WORK(&priv->pkey_task, ipoib_pkey_event); > INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); > INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); > INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); > Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > =================================================================== > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-06 09:26:08.000000000 +0300 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-06 09:26:18.000000000 +0300 > @@ -232,9 +232,10 @@ static int ipoib_mcast_join_finish(struc > ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid), > &mcast->mcmember.mgid); > if (ret < 0) { > - ipoib_warn(priv, "couldn't attach QP to multicast group " > - IPOIB_GID_FMT "\n", > - IPOIB_GID_ARG(mcast->mcmember.mgid)); > + if (ret != -ENXIO) /* No pkey found */ > + ipoib_warn(priv, "couldn't attach QP to multicast group " > + IPOIB_GID_FMT "\n", > + IPOIB_GID_ARG(mcast->mcmember.mgid)); > > clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags); > return ret; > @@ -312,7 +313,7 @@ ipoib_mcast_sendonly_join_complete(int s > status = ipoib_mcast_join_finish(mcast, &multicast->rec); > > if (status) { > - if (mcast->logcount++ < 20) > + if (mcast->logcount++ < 20 && status != -ENXIO) > ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " > IPOIB_GID_FMT ", status %d\n", > IPOIB_GID_ARG(mcast->mcmember.mgid), status); > @@ -420,7 +421,7 @@ static int ipoib_mcast_join_complete(int > ", status %d\n", > IPOIB_GID_ARG(mcast->mcmember.mgid), > status); > - } else { > + } else if (status != -ENXIO) { > ipoib_warn(priv, "multicast join failed for " > IPOIB_GID_FMT ", status %d\n", > IPOIB_GID_ARG(mcast->mcmember.mgid), > > Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > =================================================================== > --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-06 09:26:08.000000000 +0300 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-06 09:26:18.000000000 +0300 > @@ -33,8 +33,6 @@ > * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ > */ > > -#include > - > #include "ipoib.h" > > int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) > @@ -49,12 +47,12 @@ int ipoib_mcast_attach(struct net_device > if (!qp_attr) > goto out; > > - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { > - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { > + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); > ret = -ENXIO; > goto out; > } > - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + set_bit(IPOIB_PKEY_NEEDED, &priv->flags); > > /* set correct QKey for QP */ > qp_attr->qkey = priv->qkey; > @@ -103,12 +101,12 @@ int ipoib_init_qp(struct net_device *dev > * The port has to be assigned to the respective IB partition in > * advance. > */ > - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); > + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); > if (ret) { > - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + set_bit(IPOIB_PKEY_NEEDED, &priv->flags); > return ret; > } > - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); > > qp_attr.qp_state = IB_QPS_INIT; > qp_attr.qkey = 0; > @@ -238,7 +236,7 @@ void ipoib_transport_dev_cleanup(struct > ipoib_warn(priv, "ib_qp_destroy failed\n"); > > priv->qp = NULL; > - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); > } > > if (ib_destroy_cq(priv->cq)) > @@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler > container_of(handler, struct ipoib_dev_priv, event_handler); > > if ((record->event == IB_EVENT_PORT_ERR || > - record->event == IB_EVENT_PKEY_CHANGE || > record->event == IB_EVENT_PORT_ACTIVE || > record->event == IB_EVENT_LID_CHANGE || > record->event == IB_EVENT_SM_CHANGE || > @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler > record->element.port_num == priv->port) { > ipoib_dbg(priv, "Port state change event\n"); > queue_work(ipoib_workqueue, &priv->flush_task); > + } else if (record->event == IB_EVENT_PKEY_CHANGE && > + record->element.port_num == priv->port) { > + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); > + queue_work(ipoib_workqueue, &priv->pkey_task); > } > } -- MST From halr at voltaire.com Mon May 7 09:04:38 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 07 May 2007 12:04:38 -0400 Subject: [ofa-general] Re: [PATCH] opensm: consolidate CA and router PortInfo receiving code In-Reply-To: <20070506200013.GL9692@sashak.voltaire.com> References: <20070506200013.GL9692@sashak.voltaire.com> Message-ID: <1178553448.32222.358968.camel@hal.voltaire.com> On Sun, 2007-05-06 at 16:00, Sasha Khapyorsky wrote: > Consolidate CA and router PortInfo receiving processing code. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to master only). -- Hal From halr at voltaire.com Mon May 7 09:16:07 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 07 May 2007 12:16:07 -0400 Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm: trivial osm_port cleanups In-Reply-To: <20070506181937.GK9692@sashak.voltaire.com> References: <20070506181937.GK9692@sashak.voltaire.com> Message-ID: <1178554553.32222.359968.camel@hal.voltaire.com> On Sun, 2007-05-06 at 14:19, Sasha Khapyorsky wrote: > This removes non-meanful osm_port_construct() and osm_port_destroy() > functions and makes static locally used osm_port_init(). > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to master only). -- Hal From halr at voltaire.com Mon May 7 09:21:33 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 07 May 2007 12:21:33 -0400 Subject: [ofa-general] Re: [PATCH 5/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <463F41D3.4050603@voltaire.com> References: <463F2121.5080803@voltaire.com> <463F22AB.3090704@voltaire.com> <20070507135030.GI29350@mellanox.co.il> <463F354B.8030908@voltaire.com> <1178549162.32222.355374.camel@hal.voltaire.com> <463F41D3.4050603@voltaire.com> Message-ID: <1178554880.32222.360219.camel@hal.voltaire.com> On Mon, 2007-05-07 at 11:12, Yosef Etigin wrote: > Hal Rosenstock wrote: > > Hi Yosef, > > > > On Mon, 2007-05-07 at 10:18, Yosef Etigin wrote: > > > >>Michael S. Tsirkin wrote: > >> > >>>>@@ -1865,6 +1863,15 @@ static void ib_mad_recv_done_handler(str > >>>> recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; > >>>> recv->header.recv_wc.recv_buf.grh = &recv->grh; > >>>> > >>>>+ /* update our lmc cache with port info smps */ > >>>>+ if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || > >>>>+ recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > >>>>+ && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) > >>>>+ && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) > >>>>+ { > >>>>+ atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); > >>>>+ } > >>>>+ > >>>> if (atomic_read(&qp_info->snoop_count)) > >>>> snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); > >>>> > >>> > >>> > >>>Why is this an atomic? > >> > >>I thought there might be a race between this and where we read the lmc (rcv_has_same_gid) > >> > >> > >>>The comment does not seem to tell us anything useful. Remove it? > >>>These 8 lines seem to violate coding style rules in at least 3 different ways::) > >>> > >> > >> if ((recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || > >> recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > >> && (recv->mad.mad.mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) > >> && (recv->mad.mad.mad_hdr.method == IB_MGMT_METHOD_SET)) > >> atomic_set(&port_priv->port_lmc, recv->mad.smp.data[34] & 0x7); > > > > > > Should at least a #define be used for smp.data[34} if not a struct so it > > is clearer what is going on here ? > > > > you mean something like: > #define LMC_FROM_PORT_INFO(data) ( ( (char*)(data) )[34] & 0x07 ) ? Yes, something along those lines at a minimum. -- Hal > > I haven't yet had a chance to look at the rest of the patch. > > > > -- Hal > > > > > >>is that better? > >>_______________________________________________ > >>general mailing list > >>general at lists.openfabrics.org > >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > From rdreier at cisco.com Mon May 7 09:40:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 07 May 2007 09:40:28 -0700 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts In-Reply-To: <1178551555.17477.0.camel@mtls03> (Eli Cohen's message of "Mon, 07 May 2007 18:20:33 +0300") References: <1178551555.17477.0.camel@mtls03> Message-ID: Thanks... should we optimize out the if (eqes_found) eq_set_ci(eq, 1); at the end of mlx4_eq_int() now? Actually the best fix is probably: diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c index 8d641b8..acf1c80 100644 --- a/drivers/net/mlx4/eq.c +++ b/drivers/net/mlx4/eq.c @@ -249,8 +249,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq) } } - if (eqes_found) - eq_set_ci(eq, 1); + eq_set_ci(eq, 1); return eqes_found; } because it seems sort of strange if we ever don't rearm the EQ on an MSI-X interrupt. What do you think? On the other hand, this patch (and your patch) rearms the EQ on shared interrupts for other devices too. Can't be helped I guess. - R. From Kapil.Dukle at med.ge.com Mon May 7 09:41:28 2007 From: Kapil.Dukle at med.ge.com (Dukle, Kapil (GE Healthcare)) Date: Mon, 7 May 2007 12:41:28 -0400 Subject: [ofa-general] Infiniband data transfer across servers w/ different IB drivers Message-ID: Hi, I am currently experimenting with Infiniband data transfers across servers with different operating systems. Is it possible for two servers with different Infiniband drivers (and OS) to communicate for data transfers - as in the example below. Server A runs VxWorks and uses SBS IB driver modules and APIs Server B runs Linux and uses OFED 1.0 drivers and APIs - Is it possible for these servers to transfer data across Infiniband the way they are currently set up? OR - Would I need to update Server A to have the OFED 1.0 IB drivers? Let me know if you need any more information that might help answer these questions... Thanks, -------------- next part -------------- An HTML attachment was scrubbed... URL: From boris at mellanox.com Mon May 7 09:45:08 2007 From: boris at mellanox.com (Boris Shpolyansky) Date: Mon, 7 May 2007 09:45:08 -0700 Subject: [ofa-general] Infiniband data transfer across servers w/ differentIB drivers In-Reply-To: Message-ID: <1E3DCD1C63492545881FACB6063A57C1D524ED@mtiexch01.mti.com> I am not familiar with SBS IB driver for VxWorks, but in general any IB compliant HCA should talk with any other IB compliant switch/HCA with no regards to the driver implementation. Make sure to run SM on one of the ends to enable link establishment. Boris ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Dukle, Kapil (GE Healthcare) Sent: Monday, May 07, 2007 9:41 AM To: openib-general at openib.org Subject: [ofa-general] Infiniband data transfer across servers w/ differentIB drivers Hi, I am currently experimenting with Infiniband data transfers across servers with different operating systems. Is it possible for two servers with different Infiniband drivers (and OS) to communicate for data transfers - as in the example below. Server A runs VxWorks and uses SBS IB driver modules and APIs Server B runs Linux and uses OFED 1.0 drivers and APIs - Is it possible for these servers to transfer data across Infiniband the way they are currently set up? OR - Would I need to update Server A to have the OFED 1.0 IB drivers? Let me know if you need any more information that might help answer these questions... Thanks, -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Mon May 7 09:49:36 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 07 May 2007 11:49:36 -0500 Subject: [ofa-general] RE: man pages for the rdma-cm In-Reply-To: <000001c7905e$9a562190$95fd070a@amr.corp.intel.com> References: <000001c7905e$9a562190$95fd070a@amr.corp.intel.com> Message-ID: <1178556576.30571.79.camel@stevo-desktop> On Sun, 2007-05-06 at 21:17 -0700, Sean Hefty wrote: > >Are there man pages for the rdma-cm in the pipeline? I think it would > >be great (requirement?) to have these for ofed-1.2 since we do have the > >other verbs man pages. > > I've added man pages for the APIs and test programs to my master and ofed_1_2 > branches. If anyone gets a chance, I'd appreciate someone looking them over. I > plan on requested that they be pulled into the rc3 release. > > - Sean Hey Sean, the pages look good! Here are a few comments. Consider them for inclusion, but what you've done so far is a great start. - are the events described anywhere? Maybe they should be described in rdma_get_cm_event? - rdma_accept / rdma_connect: describe the conn_param fields. - rdma_bind_addr: binding to port 0 will cause the rdma-cm to select and available port. - no pages for get_src_port/get_dst_port - rdma_connect - "connected" and "unconnected" when discussing cm_ids is misleading. Perhaps "reliable connection" vs "unreliable datagram"? - rdma_create_event_channel: it would be nice to mention that the fd can be used like any other fd (made non blocking, poll()/select()able, etc). - rdma_disconnect - for iWARP connections, this initiates a RDMAC Verbs "normal close". If the connection was properly quiesced by the application, then the QP will end up back in IDLE, but if the connection was not quiesced, then the connection will be terminated and the QP will end up in ERROR. Dunno if we want to describe this in detail? - Also, it might be nice to have some sort of overview man page that maps the exected event flows for connection setup and teardown. Maybe 'man rdmacm' gets you some overview? Steve. From koen.segers at vrt.be Mon May 7 09:49:22 2007 From: koen.segers at vrt.be (Koen Segers) Date: Mon, 07 May 2007 18:49:22 +0200 Subject: [ofa-general] DDR and SDR Message-ID: <1178556562.8727.3.camel@KOEN> A simple question: Is it possible to connect a SDR HCA to a DDR switch? If so, what happens with the data that is send from a DDR HCA to the SDR HCA? Regards, Koen *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From yosefe at voltaire.com Mon May 7 09:54:07 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 07 May 2007 19:54:07 +0300 Subject: [ofa-general] Re: [PATCH 3/6 v2] fix pkey change handling and remove the cahce In-Reply-To: <20070507153025.GD15275@mellanox.co.il> References: <463F2121.5080803@voltaire.com> <463F2237.7050809@voltaire.com> <20070507153025.GD15275@mellanox.co.il> Message-ID: <463F59AF.70501@voltaire.com> Michael S. Tsirkin wrote: > All in all, this patch tries to do many things at once. I wonder whether you > can split the patch in 2: fix the pkey change case separately, and remove pkey > polling separately. > > I'm not sure it's nessesary. What I had in mind is that the polling was created since we did not handle events, so now that we handle them we should update the way ipoib handles pkey changes. >>Quoting Yosef Etigin : >>Subject: [PATCH 3/6 v2] fix pkey change handling and remove the cahce >> >>ipoib: handle pkey change events >> >>This issue was found during partitioning & SM fail over testing. >> >> * added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike >> * fixed a bug in device extraction from the work struct >> * removed some warnings in case they are caused due to missing PKEY as >> this seems like a valid flow now. > > > This seems to remove a useful tool for debugging invalid pkeys. > Why is this a valid flow now? > > restored to previous state. >> * Assume that the cache is coherent - do not retry on cache queries >> * Restart child interfaces first before parent > > > Why? Is this related to pkey change somehow? > comment removed. >>@@ -642,6 +634,12 @@ void ipoib_ib_dev_flush(struct work_stru >> >> ipoib_ib_dev_down(dev, 0); >> >>+ if (restart_qp) { >>+ if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) >>+ ipoib_ib_dev_stop(dev, 0); >>+ ipoib_ib_dev_open(dev); >>+ } >>+ >> /* >> * The device could have been brought down between the start and when >> * we get here, don't bring it back up if it's not configured up > > > I find these if (restart_qp) branches somewhat confusing. > Why is this flag tested in 2 places? > > first test - open devices that need a pkey only from restart_qp flow second - restart or not, at all. these and rest of the comments are applied below. ipoib: handle pkey change events This issue was found during partitioning & SM fail over testing. * added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * fixed a bug in device extraction from the work struct * Restart child interfaces first before parent * Remove the pkey polling thread and pkey delayed initiallization * If an interface is brought up but pkey is not found, mark it with IPOIB_PKEY_NEEDED and when a pkey event arrives, try to restart it. SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Moni Levy Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 10 -- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 140 +++++++++-------------------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 11 -- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 21 ++-- 4 files changed, 66 insertions(+), 116 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-07 15:42:23.262692889 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-07 15:43:05.685154318 +0300 @@ -80,7 +80,7 @@ enum { IPOIB_FLAG_INITIALIZED = 1, IPOIB_FLAG_ADMIN_UP = 2, IPOIB_PKEY_ASSIGNED = 3, - IPOIB_PKEY_STOP = 4, + IPOIB_PKEY_NEEDED = 4, IPOIB_FLAG_SUBINTERFACE = 5, IPOIB_MCAST_RUN = 6, IPOIB_STOP_REAPER = 7, @@ -202,9 +202,9 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; struct delayed_work mcast_task; struct work_struct flush_task; + struct work_struct pkey_task; struct work_struct restart_task; struct delayed_work ah_reap_task; @@ -333,12 +333,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); @@ -384,9 +385,6 @@ void ipoib_event(struct ib_event_handler int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); -void ipoib_pkey_poll(struct work_struct *work); -int ipoib_pkey_dev_delay_open(struct net_device *dev); - #ifdef CONFIG_INFINIBAND_IPOIB_CM #define IPOIB_FLAGS_RC 0x80 Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-07 15:43:05.074262877 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-07 19:48:28.843156398 +0300 @@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -441,28 +441,10 @@ int ipoib_ib_dev_open(struct net_device return 0; } -static void ipoib_pkey_dev_check_presence(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - u16 pkey_index = 0; - - if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); - else - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); -} - int ipoib_ib_dev_up(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - ipoib_pkey_dev_check_presence(dev); - - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { - ipoib_dbg(priv, "PKEY is not assigned.\n"); - return 0; - } - set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); return ipoib_mcast_start_thread(dev); @@ -477,16 +459,6 @@ int ipoib_ib_dev_down(struct net_device clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); netif_carrier_off(dev); - /* Shutdown the P_Key thread if still active */ - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { - mutex_lock(&pkey_mutex); - set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); - mutex_unlock(&pkey_mutex); - if (flush) - flush_workqueue(ipoib_workqueue); - } - ipoib_mcast_stop_thread(dev, flush); ipoib_mcast_dev_flush(dev); @@ -508,7 +480,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +553,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,14 +595,30 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { - ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, restart_qp); + + mutex_unlock(&priv->vlan_mutex); + + /* + * If the device is not initiallized since it needs a pkey - + * try to reopen it + */ + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { + if (restart_qp && + test_bit(IPOIB_PKEY_NEEDED, &priv->flags) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_open(priv->dev); + else + ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -642,6 +631,12 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_down(dev, 0); + if (restart_qp) { + if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +645,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - mutex_unlock(&priv->vlan_mutex); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_task); + + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -672,54 +677,3 @@ void ipoib_ib_dev_cleanup(struct net_dev ipoib_transport_dev_cleanup(dev); } -/* - * Delayed P_Key Assigment Interim Support - * - * The following is initial implementation of delayed P_Key assigment - * mechanism. It is using the same approach implemented for the multicast - * group join. The single goal of this implementation is to quickly address - * Bug #2507. This implementation will probably be removed when the P_Key - * change async notification is available. - */ - -void ipoib_pkey_poll(struct work_struct *work) -{ - struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); - struct net_device *dev = priv->dev; - - ipoib_pkey_dev_check_presence(dev); - - if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) - ipoib_open(dev); - else { - mutex_lock(&pkey_mutex); - if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) - queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, - HZ); - mutex_unlock(&pkey_mutex); - } -} - -int ipoib_pkey_dev_delay_open(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - - /* Look for the interface pkey value in the IB Port P_Key table and */ - /* set the interface pkey assigment flag */ - ipoib_pkey_dev_check_presence(dev); - - /* P_Key value not assigned yet - start polling */ - if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { - mutex_lock(&pkey_mutex); - clear_bit(IPOIB_PKEY_STOP, &priv->flags); - queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, - HZ); - mutex_unlock(&pkey_mutex); - return 1; - } - - return 0; -} Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-07 15:42:23.101721494 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-07 15:43:05.687153963 +0300 @@ -100,14 +100,11 @@ int ipoib_open(struct net_device *dev) set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); - if (ipoib_pkey_dev_delay_open(dev)) - return 0; - if (ipoib_ib_dev_open(dev)) - return -EINVAL; + return test_bit(IPOIB_PKEY_NEEDED, &priv->flags) ? 0 : -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +149,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +987,7 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-07 15:42:23.387670681 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-07 15:43:05.688153785 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,12 +47,12 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); ret = -ENXIO; goto out; } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + set_bit(IPOIB_PKEY_NEEDED, &priv->flags); /* set correct QKey for QP */ qp_attr->qkey = priv->qkey; @@ -103,12 +101,12 @@ int ipoib_init_qp(struct net_device *dev * The port has to be assigned to the respective IB partition in * advance. */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); if (ret) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + set_bit(IPOIB_PKEY_NEEDED, &priv->flags); return ret; } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); qp_attr.qp_state = IB_QPS_INIT; qp_attr.qkey = 0; @@ -238,7 +236,7 @@ void ipoib_transport_dev_cleanup(struct ipoib_warn(priv, "ib_qp_destroy failed\n"); priv->qp = NULL; - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + clear_bit(IPOIB_PKEY_NEEDED, &priv->flags); } if (ib_destroy_cq(priv->cq)) @@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_task); } } From mhagen at iol.unh.edu Mon May 7 09:58:26 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Mon, 7 May 2007 12:58:26 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] infiniband: add userspace support for invalidate stag In-Reply-To: <53312.132.177.125.178.1178307563.squirrel@postal.iol.unh.edu> References: <53312.132.177.125.178.1178307563.squirrel@postal.iol.unh.edu> Message-ID: <60316.132.177.125.178.1178557106.squirrel@postal.iol.unh.edu> Add userspace support for iWARP verbs Send w/ INV and Send w/ SE and INV. Signed-off-by: Mikkel Hagen --- linux-2.6.21.1/include/rdma/ib_user_verbs.h 2007-05-02 15:35:13.000000000 -0400 +++ linux-2.6.21.1/include/rdma/ib_user_verbs.h 2007-05-02 15:53:40.000000000 -0400 @@ -553,6 +553,10 @@ struct ib_uverbs_send_wr { __u32 remote_qkey; __u32 reserved; } ud; + struct { + __u32 rkey; + __u32 reserved; + } invalidate; } wr; }; -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From mhagen at iol.unh.edu Mon May 7 09:59:59 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Mon, 7 May 2007 12:59:59 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] infiniband: add userspace support for invalidate stag In-Reply-To: <53313.132.177.125.178.1178307596.squirrel@postal.iol.unh.edu> References: <53313.132.177.125.178.1178307596.squirrel@postal.iol.unh.edu> Message-ID: <47431.132.177.125.178.1178557199.squirrel@postal.iol.unh.edu> Add userspace support for iWARP verbs Send w/ INV and Send w/ SE and INV. Signed-off-by: Mikkel Hagen --- linux-2.6.21.1/drivers/infiniband/core/uverbs_cmd.c 2007-05-04 14:25:50.000000000 -0400 +++ linux-2.6.21.1/drivers/infiniband/core/uverbs_cmd.c 2007-05-04 14:47:42.000000000 -0400 @@ -1507,6 +1507,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv next->wr.atomic.swap = user_wr->wr.atomic.swap; next->wr.atomic.rkey = user_wr->wr.atomic.rkey; break; + case IB_WR_SEND: + if(next->send_flags & IB_SEND_INVALIDATE) { + next->wr.invalidate.rkey = + user_wr->wr.invalidate.rkey; + } + break; default: break; } -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From todd.rimmer at qlogic.com Mon May 7 10:00:23 2007 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Mon, 7 May 2007 12:00:23 -0500 Subject: [ofa-general] DDR and SDR In-Reply-To: <1178556562.8727.3.camel@KOEN> Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE06119251D38@EPEXCH2.qlogic.org> > From: Koen Segers > Sent: Monday, May 07, 2007 12:49 PM > To: openib-general at openib.org > Subject: [ofa-general] DDR and SDR > > A simple question: > > Is it possible to connect a SDR HCA to a DDR switch? Yes, at the time of Link Layer training, the link speed and width are negotiated down to the highest common speed/width. Hence when an SDR HCA is connected to a DDR switch, the HCA's link and the corresponding switch port will run at SDR speeds. > If so, what happens with the data that is send from a DDR HCA to the SDR > HCA? In IB every Path Record, Multicast group and Address Vector has a "static rate". This represents the speed of the path between 2 nodes. When a DDR HCA sends to an SDR HCA, it should have obtained a static rate from the SA showing a 10Gb/s rate (for a 4x SDR path). In which case, the DDR HCA will pace its sending to not exceed SDR speeds. Todd Rimmer Chief Architect QLogic System Interconnect Group Voice: 610-233-4852 Fax: 610-233-4777 Todd.Rimmer at QLogic.com www.QLogic.com From mshefty at ichips.intel.com Mon May 7 10:03:33 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 07 May 2007 10:03:33 -0700 Subject: [ofa-general] RE: man pages for the rdma-cm In-Reply-To: <1178556576.30571.79.camel@stevo-desktop> References: <000001c7905e$9a562190$95fd070a@amr.corp.intel.com> <1178556576.30571.79.camel@stevo-desktop> Message-ID: <463F5BE5.8030806@ichips.intel.com> > Here are a few comments. Consider them for inclusion, but what you've > done so far is a great start. Thanks for the feedback. I'll try to update this before RC3 freezes. > - rdma_disconnect - for iWARP connections, this initiates a RDMAC Verbs > "normal close". If the connection was properly quiesced by the > application, then the QP will end up back in IDLE, but if the connection > was not quiesced, then the connection will be terminated and the QP will > end up in ERROR. Dunno if we want to describe this in detail? Are all work requests flushed in both cases? I don't know if we need to go into details about which state the QP ends up in, unless the behavior differences are visible to the user. > - Also, it might be nice to have some sort of overview man page that > maps the exected event flows for connection setup and teardown. Maybe > 'man rdmacm' gets you some overview? I agree that this would be nice. Is there a standard way of doing this? - Sean From mhagen at iol.unh.edu Mon May 7 10:04:13 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Mon, 7 May 2007 13:04:13 -0400 (EDT) Subject: [ofa-general] [PATCH] infiniband: add userspace support for invalidate stag In-Reply-To: <1178380859.8125.2.camel@stevo-desktop> References: <53312.132.177.125.178.1178307561.squirrel@postal.iol.unh.edu> <1178325178.3011.4.camel@stevo-laptop> <1178380859.8125.2.camel@stevo-desktop> Message-ID: <47434.132.177.125.178.1178557453.squirrel@postal.iol.unh.edu> Well, I resubmitted the kernel level changes with the comment and signed-off-by fields. I will wait on resubmitting the userspace changes. The only real contribution to the discussion on what we should do, would be to suggest that maybe we just keep them in a user-patches dir for a while (until a couple of kernel revs with invalidate supported) then move them into the main code base. > On Fri, 2007-05-04 at 19:32 -0500, Steve Wise wrote: >> On Fri, 2007-05-04 at 14:50 -0700, Roland Dreier wrote: >> > A few general things: >> > - please always submit patches with a changelog entry and >> > Signed-off-by: line >> > - please send patches in logical chunks. Usually I'm complaining >> > about people combining unrelated things into one patch, but in this >> > case I think you divided the patch up too much -- rather than 5 >> > patches, this should probably be one kernel patch and one userspace >> > patch. >> > - please make libibverbs patches apply to the libibverbs git tree >> > with -p1. You seem to have generated patches against an OFED >> package. >> > >> > OK, with that out of the way, I think there are still some issues to >> > sort out with how to handle send with invalidate from userspace. >> > These patches don't address the case of new userspace with >> > send-with-invalidate support talking to an unpatched kernel -- it >> > seems that send-with-invalidate would be silently turned into a plain >> > send request, which is not a very good failure mode. >> > >> > I don't know what the right solution is yet -- a kernel ABI bump for >> > this one case (send with invalidate support for userspace drivers that >> > don't do kernel bypass == amso1100) is ugly. Maybe we also need a >> > device capabilities bit that says whether send-with-invalidate is >> > supported? >> > >> >> There already exists a SEND-INV capabilities flag. >> >> >> IB_DEVICE_SEND_W_INV = (1<<16), >> >> I think with the capabilities flag, we shouldn't worry about changing >> the ABI. But the drivers will need to set this flag. Amso currently >> does... > > Actually, since Amso has set this flag since day one, it doesn't really > solve the ABI issue Roland describes. > > > Steve. > > -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From swise at opengridcomputing.com Mon May 7 10:27:03 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 07 May 2007 12:27:03 -0500 Subject: [ofa-general] RE: man pages for the rdma-cm In-Reply-To: <463F5BE5.8030806@ichips.intel.com> References: <000001c7905e$9a562190$95fd070a@amr.corp.intel.com> <1178556576.30571.79.camel@stevo-desktop> <463F5BE5.8030806@ichips.intel.com> Message-ID: <1178558823.30571.97.camel@stevo-desktop> On Mon, 2007-05-07 at 10:03 -0700, Sean Hefty wrote: > > Here are a few comments. Consider them for inclusion, but what you've > > done so far is a great start. > > Thanks for the feedback. I'll try to update this before RC3 freezes. > > > - rdma_disconnect - for iWARP connections, this initiates a RDMAC Verbs > > "normal close". If the connection was properly quiesced by the > > application, then the QP will end up back in IDLE, but if the connection > > was not quiesced, then the connection will be terminated and the QP will > > end up in ERROR. Dunno if we want to describe this in detail? > > Are all work requests flushed in both cases? I don't know if we need to go into > details about which state the QP ends up in, unless the behavior differences are > visible to the user. > In the "normal close" case, the user is responsible to quiesce the SQ. In both cases the RQ entries are flused. We can omit this for now if you want. > > - Also, it might be nice to have some sort of overview man page that > > maps the exected event flows for connection setup and teardown. Maybe > > 'man rdmacm' gets you some overview? > > I agree that this would be nice. Is there a standard way of doing this? > There's a 'tcp' man page to describe tcp. So I think its ok to have a 'rdmacm' or 'rdmacma' man page. Steve. From pradeeps at linux.vnet.ibm.com Mon May 7 10:32:09 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Mon, 07 May 2007 10:32:09 -0700 Subject: [ofa-general] Question about git tree Message-ID: <463F6299.8050106@linux.vnet.ibm.com> Roland, Last night you submitted the NAPI work for 2.6.22. When I checked a few minutes ago I saw that the NAPI work has been merged into the for-linus tree and not the for-2.6.22 tree. I want to merge and test my patch against the latest tree -which git tree should I use? Can you please provide insight into how this procedure works, or if it is documented please provide a pointer. Pradeep From sean.hefty at intel.com Mon May 7 11:39:46 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 7 May 2007 11:39:46 -0700 Subject: [ofa-general] [PATCH 0/3] rdma/cm: cleanup device removal synchronization In-Reply-To: <000401c787ca$f37d7ee0$2ad8180a@amr.corp.intel.com> Message-ID: <000101c790d7$1642e680$8698070a@amr.corp.intel.com> Here's a couple of patches that make the device removal synchronization in the rdma_cm a little more explicit, along with one fix to add in missing synchronization. With these patches, it's now possible to call rdma_disconnect() after receiving a device removal event. I plan on pushing these changes to my git tree and request that they be pulled into 2.6.22 within the next couple of days if there are no issues. Signed-off-by: Sean Hefty From sean.hefty at intel.com Mon May 7 11:42:16 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 7 May 2007 11:42:16 -0700 Subject: [ofa-general] [PATCH 1/3] rdma/cm: simplify device removal handling code In-Reply-To: <000101c790d7$1642e680$8698070a@amr.corp.intel.com> Message-ID: <000201c790d7$6fac26f0$8698070a@amr.corp.intel.com> Add a new routine and rename another to encapsulate common code for synchronizing with device removal. Signed-off-by: Sean Hefty --- drivers/infiniband/core/cma.c | 89 ++++++++++++++++++++++------------------- 1 files changed, 48 insertions(+), 41 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index fde92ce..d026764 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -346,7 +346,23 @@ static void cma_deref_id(struct rdma_id_private *id_priv) complete(&id_priv->comp); } -static void cma_release_remove(struct rdma_id_private *id_priv) +static int cma_disable_remove(struct rdma_id_private *id_priv, + enum cma_state state) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&id_priv->lock, flags); + if (id_priv->state == state) { + atomic_inc(&id_priv->dev_remove); + ret = 0; + } else + ret = -EINVAL; + spin_unlock_irqrestore(&id_priv->lock, flags); + return ret; +} + +static void cma_enable_remove(struct rdma_id_private *id_priv) { if (atomic_dec_and_test(&id_priv->dev_remove)) wake_up(&id_priv->wait_remove); @@ -884,9 +900,8 @@ static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) struct rdma_cm_event event; int ret = 0; - atomic_inc(&id_priv->dev_remove); - if (!cma_comp(id_priv, CMA_CONNECT)) - goto out; + if (cma_disable_remove(id_priv, CMA_CONNECT)) + return 0; memset(&event, 0, sizeof event); switch (ib_event->event) { @@ -942,12 +957,12 @@ static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) /* Destroy the CM ID by returning a non-zero value. */ id_priv->cm_id.ib = NULL; cma_exch(id_priv, CMA_DESTROYING); - cma_release_remove(id_priv); + cma_enable_remove(id_priv); rdma_destroy_id(&id_priv->id); return ret; } out: - cma_release_remove(id_priv); + cma_enable_remove(id_priv); return ret; } @@ -1057,11 +1072,8 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) int offset, ret; listen_id = cm_id->context; - atomic_inc(&listen_id->dev_remove); - if (!cma_comp(listen_id, CMA_LISTEN)) { - ret = -ECONNABORTED; - goto out; - } + if (cma_disable_remove(listen_id, CMA_LISTEN)) + return -ECONNABORTED; memset(&event, 0, sizeof event); offset = cma_user_data_offset(listen_id->id.ps); @@ -1101,11 +1113,11 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) release_conn_id: cma_exch(conn_id, CMA_DESTROYING); - cma_release_remove(conn_id); + cma_enable_remove(conn_id); rdma_destroy_id(&conn_id->id); out: - cma_release_remove(listen_id); + cma_enable_remove(listen_id); return ret; } @@ -1214,12 +1226,12 @@ static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) /* Destroy the CM ID by returning a non-zero value. */ id_priv->cm_id.iw = NULL; cma_exch(id_priv, CMA_DESTROYING); - cma_release_remove(id_priv); + cma_enable_remove(id_priv); rdma_destroy_id(&id_priv->id); return ret; } - cma_release_remove(id_priv); + cma_enable_remove(id_priv); return ret; } @@ -1234,11 +1246,8 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id, int ret; listen_id = cm_id->context; - atomic_inc(&listen_id->dev_remove); - if (!cma_comp(listen_id, CMA_LISTEN)) { - ret = -ECONNABORTED; - goto out; - } + if (cma_disable_remove(listen_id, CMA_LISTEN)) + return -ECONNABORTED; /* Create a new RDMA id for the new IW CM ID */ new_cm_id = rdma_create_id(listen_id->id.event_handler, @@ -1255,13 +1264,13 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id, dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr); if (!dev) { ret = -EADDRNOTAVAIL; - cma_release_remove(conn_id); + cma_enable_remove(conn_id); rdma_destroy_id(new_cm_id); goto out; } ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL); if (ret) { - cma_release_remove(conn_id); + cma_enable_remove(conn_id); rdma_destroy_id(new_cm_id); goto out; } @@ -1270,7 +1279,7 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id, ret = cma_acquire_dev(conn_id); mutex_unlock(&lock); if (ret) { - cma_release_remove(conn_id); + cma_enable_remove(conn_id); rdma_destroy_id(new_cm_id); goto out; } @@ -1293,14 +1302,14 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id, /* User wants to destroy the CM ID */ conn_id->cm_id.iw = NULL; cma_exch(conn_id, CMA_DESTROYING); - cma_release_remove(conn_id); + cma_enable_remove(conn_id); rdma_destroy_id(&conn_id->id); } out: if (dev) dev_put(dev); - cma_release_remove(listen_id); + cma_enable_remove(listen_id); return ret; } @@ -1519,7 +1528,7 @@ static void cma_work_handler(struct work_struct *_work) destroy = 1; } out: - cma_release_remove(id_priv); + cma_enable_remove(id_priv); cma_deref_id(id_priv); if (destroy) rdma_destroy_id(&id_priv->id); @@ -1711,13 +1720,13 @@ static void addr_handler(int status, struct sockaddr *src_addr, if (id_priv->id.event_handler(&id_priv->id, &event)) { cma_exch(id_priv, CMA_DESTROYING); - cma_release_remove(id_priv); + cma_enable_remove(id_priv); cma_deref_id(id_priv); rdma_destroy_id(&id_priv->id); return; } out: - cma_release_remove(id_priv); + cma_enable_remove(id_priv); cma_deref_id(id_priv); } @@ -2042,11 +2051,10 @@ static int cma_sidr_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_sidr_rep_event_param *rep = &ib_event->param.sidr_rep_rcvd; int ret = 0; - memset(&event, 0, sizeof event); - atomic_inc(&id_priv->dev_remove); - if (!cma_comp(id_priv, CMA_CONNECT)) - goto out; + if (cma_disable_remove(id_priv, CMA_CONNECT)) + return 0; + memset(&event, 0, sizeof event); switch (ib_event->event) { case IB_CM_SIDR_REQ_ERROR: event.event = RDMA_CM_EVENT_UNREACHABLE; @@ -2084,12 +2092,12 @@ static int cma_sidr_rep_handler(struct ib_cm_id *cm_id, /* Destroy the CM ID by returning a non-zero value. */ id_priv->cm_id.ib = NULL; cma_exch(id_priv, CMA_DESTROYING); - cma_release_remove(id_priv); + cma_enable_remove(id_priv); rdma_destroy_id(&id_priv->id); return ret; } out: - cma_release_remove(id_priv); + cma_enable_remove(id_priv); return ret; } @@ -2499,10 +2507,9 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) int ret; id_priv = mc->id_priv; - atomic_inc(&id_priv->dev_remove); - if (!cma_comp(id_priv, CMA_ADDR_BOUND) && - !cma_comp(id_priv, CMA_ADDR_RESOLVED)) - goto out; + if (cma_disable_remove(id_priv, CMA_ADDR_BOUND) && + cma_disable_remove(id_priv, CMA_ADDR_RESOLVED)) + return 0; if (!status && id_priv->id.qp) status = ib_attach_mcast(id_priv->id.qp, &multicast->rec.mgid, @@ -2524,12 +2531,12 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) ret = id_priv->id.event_handler(&id_priv->id, &event); if (ret) { cma_exch(id_priv, CMA_DESTROYING); - cma_release_remove(id_priv); + cma_enable_remove(id_priv); rdma_destroy_id(&id_priv->id); return 0; } -out: - cma_release_remove(id_priv); + + cma_enable_remove(id_priv); return 0; } From sean.hefty at intel.com Mon May 7 11:43:38 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 7 May 2007 11:43:38 -0700 Subject: [ofa-general] [PATCH 2/3] rdma/cm: Fix synchronization with device removal in cma_iw_handler In-Reply-To: <000201c790d7$6fac26f0$8698070a@amr.corp.intel.com> Message-ID: <000301c790d7$a089c3e0$8698070a@amr.corp.intel.com> The cma_iw_handler needs to validate the state of the rdma_cm_id before processing a new connection request to ensure that a device removal is not already being processed for the same rdma_cm_id. Without the state check, the user can receive simultaneous callbacks for the same cm_id, or a callback after they've destroyed the cm_id. Signed-off-by: Sean Hefty --- drivers/infiniband/core/cma.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d026764..cfd57b4 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1183,9 +1183,10 @@ static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) struct sockaddr_in *sin; int ret = 0; - memset(&event, 0, sizeof event); - atomic_inc(&id_priv->dev_remove); + if (cma_disable_remove(id_priv, CMA_CONNECT)) + return 0; + memset(&event, 0, sizeof event); switch (iw_event->event) { case IW_CM_EVENT_CLOSE: event.event = RDMA_CM_EVENT_DISCONNECTED; From sean.hefty at intel.com Mon May 7 11:45:23 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 7 May 2007 11:45:23 -0700 Subject: [ofa-general] [PATCH 3/3] rdma/cm: Add check to validate that cm_id is bound to a device In-Reply-To: <000101c790d7$1642e680$8698070a@amr.corp.intel.com> Message-ID: <000401c790d7$df1b38a0$8698070a@amr.corp.intel.com> Several checks in the rdma_cm check against the state of the cm_id, but only to validate that the cm_id is bound to an underlying transport specific CM and an RDMA device. Make the check explicit in what we're trying to check for, since we're not synchronizing against the cm_id state. This will allow a user to disconnect a cm_id or reject a connection after receiving a device removal event. Signed-off-by: Sean Hefty --- drivers/infiniband/core/cma.c | 12 ++++++++---- 1 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index cfd57b4..2eb52b7 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -368,6 +368,11 @@ static void cma_enable_remove(struct rdma_id_private *id_priv) wake_up(&id_priv->wait_remove); } +static int cma_has_cm_dev(struct rdma_id_private *id_priv) +{ + return (id_priv->id.device && id_priv->cm_id.ib); +} + struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, void *context, enum rdma_port_space ps) { @@ -2422,7 +2427,7 @@ int rdma_notify(struct rdma_cm_id *id, enum ib_event_type event) int ret; id_priv = container_of(id, struct rdma_id_private, id); - if (!cma_comp(id_priv, CMA_CONNECT)) + if (!cma_has_cm_dev(id_priv)) return -EINVAL; switch (id->device->node_type) { @@ -2444,7 +2449,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data, int ret; id_priv = container_of(id, struct rdma_id_private, id); - if (!cma_comp(id_priv, CMA_CONNECT)) + if (!cma_has_cm_dev(id_priv)) return -EINVAL; switch (rdma_node_get_transport(id->device->node_type)) { @@ -2475,8 +2480,7 @@ int rdma_disconnect(struct rdma_cm_id *id) int ret; id_priv = container_of(id, struct rdma_id_private, id); - if (!cma_comp(id_priv, CMA_CONNECT) && - !cma_comp(id_priv, CMA_DISCONNECT)) + if (!cma_has_cm_dev(id_priv)) return -EINVAL; switch (rdma_node_get_transport(id->device->node_type)) { From mst at dev.mellanox.co.il Mon May 7 13:03:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 May 2007 23:03:15 +0300 Subject: [ofa-general] [PATCH] ipoib/cm: make stale task actually run once in a while In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C90BFCB3@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C9076E27@mtlexch01.mtl.com> Message-ID: <20070507200315.GD22341@mellanox.co.il> In the presence of some active passive connections, stale task would never run, since each 4 RX CQEs we repeat queue_delayed_work calls which delays it for some 10 minutes. As a result, on a noisy system with failing ports, we slowly run out of resources - slowing connection setup down and eventually failing. What we actually want to do is - start stale task when a first passive connection is added, rerun it every 10 min as long as there are outstanding passive connections. As a happy side effect, this removes some code from RX data path. Signed-off-by: Michael S. Tsirkin --- Scott, I think this might address bugs 541 and 465: slow IPoIB CM HA failover and eventual failing IPoIB HA. Could you test this please? diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 2b242a4..b77e8d7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -258,10 +258,11 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even cm_id->context = p; p->jiffies = jiffies; spin_lock_irqsave(&priv->lock, flags); + if (list_empty(&priv->cm.passive_ids)) + queue_delayed_work(ipoib_workqueue, + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); list_add(&p->list, &priv->cm.passive_ids); spin_unlock_irqrestore(&priv->lock, flags); - queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); return 0; err_rep: @@ -380,8 +381,6 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) if (!list_empty(&p->list)) list_move(&p->list, &priv->cm.passive_ids); spin_unlock_irqrestore(&priv->lock, flags); - queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); } } @@ -1104,6 +1103,10 @@ static void ipoib_cm_stale_task(struct work_struct *work) kfree(p); spin_lock_irqsave(&priv->lock, flags); } + + if (!list_empty(&priv->cm.passive_ids)) + queue_delayed_work(ipoib_workqueue, + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); spin_unlock_irqrestore(&priv->lock, flags); } -- MST From rdreier at cisco.com Mon May 7 13:25:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 07 May 2007 13:25:19 -0700 Subject: [ofa-general] Re: Question about git tree In-Reply-To: <463F6299.8050106@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Mon, 07 May 2007 10:32:09 -0700") References: <463F6299.8050106@linux.vnet.ibm.com> Message-ID: > Last night you submitted the NAPI work for 2.6.22. When I checked a > few minutes ago I saw that the NAPI work has been merged into the > for-linus tree and not the for-2.6.22 tree. Yes, that was a temporary situation until Linus pulled everything into his tree (which he now has done). > I want to merge and test my patch against the latest tree -which git > tree should I use? Can you please provide insight into how this > procedure works, or if it is documented please provide a pointer. Your question is actually a fairly deep one. In fact in the git world the concept of "latest tree" is not defined. A situation such as for example some fixes queued in for-2.6.22 and some new features queued in for-2.6.23 is quite common. And for-mm in general is something like the union of everything that has a chance at being merged within the next couple of kernel releases. So I guess you just have to use some judgement and look at which tree has things that are likely to impact what you're working on. - R. From swise at opengridcomputing.com Mon May 7 15:09:21 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 07 May 2007 17:09:21 -0500 Subject: [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> Message-ID: <1178575761.30571.175.camel@stevo-desktop> On Sat, 2007-04-28 at 16:20 -0400, Jeff Squyres wrote: > You'd probably be better asking this question on the Open MPI mailing > lists, not here. :-) > > FWIW, yes, adding RDMA CM support has actually been on my to-do list > for a while, but it keeps getting bumped by higher priority items. > It would be *much* better if some iWARP companies got involved in > Open MPI... > Hey Jeff, Chelsio's gonna pony up the resources to get this work done asap. Do you have any thoughts on how we can collaborate on this project? I'm familiar with mvapich, not ompi, so I need to go do some homework. But any pointers on the connection setup design for ompi would be great. I'm CCing devel at openmpi.org in case anyone else is interested in helping. Chelsio can provide rnic HW... Thanks, Steve. > > > On Apr 28, 2007, at 4:16 PM, Steve Wise wrote: > > > Is anyone working on adding RDMA-CM support to OpenMPI? > > > > Thanks, > > > > Steve. > > > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > > openib-general > > From swise at opengridcomputing.com Mon May 7 15:52:26 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 07 May 2007 17:52:26 -0500 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <1178575761.30571.175.camel@stevo-desktop> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> Message-ID: <1178578346.30571.183.camel@stevo-desktop> Also, there appears to be a DAPL BTL in OMPI. Is this BTL complete and enabled for the ofed-1.2 udapl library? Steve. On Mon, 2007-05-07 at 17:09 -0500, Steve Wise wrote: > On Sat, 2007-04-28 at 16:20 -0400, Jeff Squyres wrote: > > You'd probably be better asking this question on the Open MPI mailing > > lists, not here. :-) > > > > FWIW, yes, adding RDMA CM support has actually been on my to-do list > > for a while, but it keeps getting bumped by higher priority items. > > It would be *much* better if some iWARP companies got involved in > > Open MPI... > > > > Hey Jeff, > > Chelsio's gonna pony up the resources to get this work done asap. Do > you have any thoughts on how we can collaborate on this project? I'm > familiar with mvapich, not ompi, so I need to go do some homework. But > any pointers on the connection setup design for ompi would be great. > > I'm CCing devel at openmpi.org in case anyone else is interested in > helping. Chelsio can provide rnic HW... > > > Thanks, > > Steve. > > > > > > > > > On Apr 28, 2007, at 4:16 PM, Steve Wise wrote: > > > > > Is anyone working on adding RDMA-CM support to OpenMPI? > > > > > > Thanks, > > > > > > Steve. > > > > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > > > openib-general > > > > > > _______________________________________________ > devel mailing list > devel at open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel From jsquyres at cisco.com Mon May 7 17:37:17 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 7 May 2007 20:37:17 -0400 Subject: [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <1178575761.30571.175.camel@stevo-desktop> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> Message-ID: <95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com> On May 7, 2007, at 6:09 PM, Steve Wise wrote: >> You'd probably be better asking this question on the Open MPI mailing >> lists, not here. :-) >> >> FWIW, yes, adding RDMA CM support has actually been on my to-do list >> for a while, but it keeps getting bumped by higher priority items. >> It would be *much* better if some iWARP companies got involved in >> Open MPI... > > Chelsio's gonna pony up the resources to get this work done asap. Do > you have any thoughts on how we can collaborate on this project? I'm > familiar with mvapich, not ompi, so I need to go do some homework. > But > any pointers on the connection setup design for ompi would be great. Excellent! Let's chat on the phone tomorrow -- this would probably be the best way to start. We will need a signed Open MPI 3rd party contribution agreement from either you and/or Chelsio (whoever owns the intellectual property that will be contributed). See http://www.open-mpi.org/community/ contribute/. > I'm CCing devel at openmpi.org in case anyone else is interested in > helping. Chelsio can provide rnic HW... Anyone else here interested? Free hardware! :-) -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Mon May 7 17:39:58 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 7 May 2007 20:39:58 -0400 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <1178578346.30571.183.camel@stevo-desktop> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> Message-ID: On May 7, 2007, at 6:52 PM, Steve Wise wrote: > Also, there appears to be a DAPL BTL in OMPI. Is this BTL complete > and > enabled for the ofed-1.2 udapl library? Yes, it is complete and is well-tested in Solaris. It is not well tested in Linux/OFED (we've been concentrating on the verbs interface on the OFED side of things -- the "openib" BTL [we never renamed it when OpenIB changed names to OpenFabrics]). In fact, we've had scattered reports of it not working properly in Linux/ OFED, but those could well have been pilot error (i.e., me not trying to run properly -- I know just about zilch about udapl). -- Jeff Squyres Cisco Systems From afriedle at indiana.edu Mon May 7 17:54:26 2007 From: afriedle at indiana.edu (Andrew Friedley) Date: Mon, 07 May 2007 20:54:26 -0400 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com> Message-ID: <463FCA42.3000104@indiana.edu> Jeff Squyres wrote: > On May 7, 2007, at 6:09 PM, Steve Wise wrote: > >>> You'd probably be better asking this question on the Open MPI mailing >>> lists, not here. :-) >>> >>> FWIW, yes, adding RDMA CM support has actually been on my to-do list >>> for a while, but it keeps getting bumped by higher priority items. >>> It would be *much* better if some iWARP companies got involved in >>> Open MPI... >> Chelsio's gonna pony up the resources to get this work done asap. Do >> you have any thoughts on how we can collaborate on this project? I'm >> familiar with mvapich, not ompi, so I need to go do some homework. >> But >> any pointers on the connection setup design for ompi would be great. > > Excellent! Let's chat on the phone tomorrow -- this would probably > be the best way to start. > > We will need a signed Open MPI 3rd party contribution agreement from > either you and/or Chelsio (whoever owns the intellectual property > that will be contributed). See http://www.open-mpi.org/community/ > contribute/. > >> I'm CCing devel at openmpi.org in case anyone else is interested in >> helping. Chelsio can provide rnic HW... > > Anyone else here interested? Free hardware! :-) Hmm I'm interested. I've already done some work switching over to RDMA CM for some research stuff I've been doing; it's not publicly accessible w/o the 3rd party agreement. I can help answer questions on what exactly needs to change, and do some testing. Andrew From info123456789 at cox.net Mon May 7 18:07:06 2007 From: info123456789 at cox.net (info123456789 at cox.net) Date: Mon, 7 May 2007 18:07:06 -0700 Subject: [ofa-general] 53Q8/02. Message-ID: <31308378.1178586426768.JavaMail.root@eastrmwml08.mgt.cox.net> Congratulations! You won 470,274.11 pounds and it is equivalent to $921,201 dollars from the NET ON-LINE LOTTERY CORPORATION IN UNITED KINGDOM this year bonanza. Contact Claims Department quoting winning draw number: 53Q8/02. CONTACT PERSON: Mr. Michael Watson EMAIL: net_onlinelottery at yahoo.co.uk Congratulations, Ms.Trace C. Cusac. From rdreier at cisco.com Mon May 7 19:40:29 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 07 May 2007 19:40:29 -0700 Subject: [ofa-general] [last RFC] mlx4 (Mellanox ConnectX adapter) InfiniBand drivers Message-ID: I've added my InfiniBand drivers for the new Mellanox ConnectX adapter to what's queued up for 2.6.22 in: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-2.6.22 This is still a new driver, with some things missing and undoubtedly some bugs and opportunities for cleanup, but I trust myself to keep improving the driver even after it's upstream. Unless I hear a good reason why I shouldn't, I'll ask Linus to pull this tomorrow. I received no responses to my earlier posts, so I'm not going to spam everyone with a big patch series again. But here's the diffstat at least -- if you want to see details, just pull the git URL above. commit 0c2f16963d60c30920ee4fb3c900ae29d6ed0f74 Author: Roland Dreier Date: Mon May 7 15:48:06 2007 -0700 IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters Add an InfiniBand driver for Mellanox ConnectX adapters. Because these adapters can also be used as ethernet NICs and Fibre Channel HBAs, the driver is split into two modules: mlx4_core: Handles low-level things like device initialization and processing firmware commands. Also controls resource allocation so that the InfiniBand, ethernet and FC functions can share a device without stepping on each other. mlx4_ib: Handles InfiniBand-specific things; plugs into the InfiniBand midlayer. Signed-off-by: Roland Dreier drivers/infiniband/Kconfig | 2 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/mlx4/Kconfig | 9 + drivers/infiniband/hw/mlx4/Makefile | 3 + drivers/infiniband/hw/mlx4/ah.c | 100 +++ drivers/infiniband/hw/mlx4/cq.c | 525 +++++++++++++ drivers/infiniband/hw/mlx4/doorbell.c | 216 ++++++ drivers/infiniband/hw/mlx4/mad.c | 339 +++++++++ drivers/infiniband/hw/mlx4/main.c | 651 +++++++++++++++++ drivers/infiniband/hw/mlx4/mlx4_ib.h | 285 ++++++++ drivers/infiniband/hw/mlx4/mr.c | 184 +++++ drivers/infiniband/hw/mlx4/qp.c | 1294 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/mlx4/srq.c | 334 +++++++++ drivers/infiniband/hw/mlx4/user.h | 92 +++ drivers/net/Kconfig | 14 + drivers/net/Makefile | 1 + drivers/net/mlx4/Makefile | 4 + drivers/net/mlx4/alloc.c | 179 +++++ drivers/net/mlx4/catas.c | 70 ++ drivers/net/mlx4/cmd.c | 429 +++++++++++ drivers/net/mlx4/cq.c | 254 +++++++ drivers/net/mlx4/eq.c | 696 ++++++++++++++++++ drivers/net/mlx4/fw.c | 775 ++++++++++++++++++++ drivers/net/mlx4/fw.h | 167 +++++ drivers/net/mlx4/icm.c | 379 ++++++++++ drivers/net/mlx4/icm.h | 135 ++++ drivers/net/mlx4/intf.c | 165 +++++ drivers/net/mlx4/main.c | 936 ++++++++++++++++++++++++ drivers/net/mlx4/mcg.c | 380 ++++++++++ drivers/net/mlx4/mlx4.h | 348 +++++++++ drivers/net/mlx4/mr.c | 479 ++++++++++++ drivers/net/mlx4/pd.c | 102 +++ drivers/net/mlx4/profile.c | 238 ++++++ drivers/net/mlx4/qp.c | 273 +++++++ drivers/net/mlx4/reset.c | 181 +++++ drivers/net/mlx4/srq.c | 227 ++++++ include/linux/mlx4/cmd.h | 178 +++++ include/linux/mlx4/cq.h | 123 ++++ include/linux/mlx4/device.h | 331 +++++++++ include/linux/mlx4/doorbell.h | 97 +++ include/linux/mlx4/driver.h | 59 ++ include/linux/mlx4/qp.h | 288 ++++++++ include/linux/mlx4/srq.h | 42 ++ 43 files changed, 11585 insertions(+), 0 deletions(-) From benh at kernel.crashing.org Mon May 7 20:21:56 2007 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 08 May 2007 13:21:56 +1000 Subject: [ofa-general] Incorrect atomic usage in ipath driver Message-ID: <1178594516.14928.62.camel@localhost.localdomain> Hi ! So I see this construct: /* There is already a thread processing this queue. */ if (test_and_set_bit(0, &dd->ipath_rcv_pending)) goto bail; .../... done: clear_bit(0, &dd->ipath_rcv_pending); smp_mb__after_clear_bit(); So that's basically an attempt at doing a spinlock. The problem is your barrier is wrong at the end. Better would be: done: smp_mb__before_clear_bit(); clear_bit(0, &dd->ipath_rcv_pending); Though it's still less optimal that doing: if (!spin_trylock(...)) goto bail; .../... done: spin_unlock(...) If you really want to stick to bitops, then you may want to look at Nick's upcoming patches adding some bitops with appropriate lock semantics. Cheers, Ben. From mst at dev.mellanox.co.il Mon May 7 22:07:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 08:07:00 +0300 Subject: [ofa-general] memory leak in cm.c? Message-ID: <20070508050700.GI22341@mellanox.co.il> Hi! I applied the following patch to cm.c, and it crashed after some duplicate reqs where detected. Does this indicate a memory leak in cm? --- diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 842cd0b..3f95eae 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -790,6 +790,7 @@ retest: cm_free_work(work); kfree(cm_id_priv->compare_data); kfree(cm_id_priv->private_data); + BUG_ON(cm_id_priv->timewait_info); kfree(cm_id_priv); } -- MST From eli at mellanox.co.il Mon May 7 23:47:26 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 08 May 2007 09:47:26 +0300 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts In-Reply-To: References: <1178551555.17477.0.camel@mtls03> Message-ID: <1178606876.17477.15.camel@mtls03> On Mon, 2007-05-07 at 09:40 -0700, Roland Dreier wrote: > Thanks... should we optimize out the > > if (eqes_found) > eq_set_ci(eq, 1); > > at the end of mlx4_eq_int() now? I think we should > Actually the best fix is probably: > > diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c > index 8d641b8..acf1c80 100644 > --- a/drivers/net/mlx4/eq.c > +++ b/drivers/net/mlx4/eq.c > @@ -249,8 +249,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq) > } > } > > - if (eqes_found) > - eq_set_ci(eq, 1); > + eq_set_ci(eq, 1); > > return eqes_found; > } > This will not ensure arming all EQs for all interrupts and we will face the same problem of losing interrupts. Index: connectx_kernel/drivers/net/mlx4/eq.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/eq.c 2007-05-06 17:34:12.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/eq.c 2007-05-08 09:37:50.000000000 +0300 @@ -256,9 +256,6 @@ static int mlx4_eq_int(struct mlx4_dev * } } - if (eqes_found) - eq_set_ci(eq, 1); - return eqes_found; } @@ -266,13 +263,17 @@ static irqreturn_t mlx4_interrupt(int ir { struct mlx4_dev *dev = dev_ptr; struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_eq *eq; int work = 0; int i; writel(priv->eq_table.clr_mask, priv->eq_table.clr_int); - for (i = 0; i < MLX4_EQ_CATAS; ++i) - work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]); + for (i = 0; i < MLX4_EQ_CATAS; ++i) { + eq = &priv->eq_table.eq[i]; + work |= mlx4_eq_int(dev, eq); + eq_set_ci(eq, 1); + } return IRQ_RETVAL(work); } @@ -283,6 +284,7 @@ static irqreturn_t mlx4_msi_x_interrupt( struct mlx4_dev *dev = eq->dev; mlx4_eq_int(dev, eq); + eq_set_ci(eq, 1); /* MSI-X vectors always belong to us */ return IRQ_HANDLED; > because it seems sort of strange if we ever don't rearm the EQ on an > MSI-X interrupt. > > What do you think? Actually I think the following patch can do the work and is similar to what we did for mthca/Hermon > > On the other hand, this patch (and your patch) rearms the EQ on shared > interrupts for other devices too. Can't be helped I guess. > > - R. > From eli at mellanox.co.il Tue May 8 02:37:22 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 08 May 2007 12:37:22 +0300 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_core: fix qp free sync Message-ID: <1178617072.17477.45.camel@mtls03> fix missing initialization of free object for qp and use logic similar to cq when closing the qp. The problem first shows when using qp events when complete attempts to acquire a none initialized spinlock. Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/net/mlx4/qp.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/qp.c 2007-05-07 17:48:17.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/qp.c 2007-05-08 12:07:33.000000000 +0300 @@ -185,6 +185,9 @@ int mlx4_qp_alloc(struct mlx4_dev *dev, if (err) goto err_put_cmpt; + atomic_set(&qp->refcount, 1); + init_completion(&qp->free); + return 0; err_put_cmpt: @@ -225,6 +228,10 @@ void mlx4_qp_free(struct mlx4_dev *dev, { struct mlx4_qp_table *qp_table = &mlx4_priv(dev)->qp_table; + if (atomic_dec_and_test(&qp->refcount)) + complete(&qp->free); + wait_for_completion(&qp->free); + mlx4_table_put(dev, &qp_table->cmpt_table, qp->qpn); mlx4_table_put(dev, &qp_table->rdmarc_table, qp->qpn); mlx4_table_put(dev, &qp_table->altc_table, qp->qpn); From jackm at dev.mellanox.co.il Tue May 8 02:38:41 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 8 May 2007 12:38:41 +0300 Subject: [ofa-general] no SRQ empty check in libmthca and in mlx2 kernel modules Message-ID: <200705081238.41255.jackm@dev.mellanox.co.il> It looks to me like there is no check for "no more available WQEs" when posting SRQ reads. See libmlx4/src/srq.c and drivers/infiniband/hw/mlx4/srq.c. There is no check in either place if srq_head = srq_tail, or some equivalent check. - Jack From vlad at lists.openfabrics.org Tue May 8 02:38:12 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 8 May 2007 02:38:12 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status Message-ID: <20070508093812.9A193E603C1@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: Build failed on i686 with linux-2.6.21.1 From tziporet at dev.mellanox.co.il Tue May 8 05:40:02 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 08 May 2007 15:40:02 +0300 Subject: [ofa-general] [PATCH 0/3] rdma/cm: cleanup device removal synchronization In-Reply-To: <000101c790d7$1642e680$8698070a@amr.corp.intel.com> References: <000101c790d7$1642e680$8698070a@amr.corp.intel.com> Message-ID: <46406FA2.9060802@mellanox.co.il> Sean Hefty wrote: > Here's a couple of patches that make the device removal synchronization > in the rdma_cm a little more explicit, along with one fix to add in > missing synchronization. > > With these patches, it's now possible to call rdma_disconnect() after > receiving a device removal event. > > I plan on pushing these changes to my git tree and request that they > be pulled into 2.6.22 within the next couple of days if there are no > issues. > > Hi Sean, Do you think we want these for OFED 1.2 too? if yes please prepare a patches against OFED 1.2 git tree too Thanks, Tziporet From jsquyres at cisco.com Tue May 8 06:16:57 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 8 May 2007 09:16:57 -0400 Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work In-Reply-To: <464044D4.5010501@lfbs.rwth-aachen.de> References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM> <46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM> <464044D4.5010501@lfbs.rwth-aachen.de> Message-ID: <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com> I'm forwarding this to the OpenFabrics general list -- as it just came up the other day, we know that Open MPI's UDAPL support works on Solaris, but we have done little/no testing of it on OFED (I personally know almost nothing about UDPAL). Can the UDAPL OFED wizards shed any light on the error messages that are listed below? In particular, these seem to be worrysome: > setup_listener Permission denied > setup_listener Address already in use and > create_qp Address already in use Thanks... On May 8, 2007, at 5:37 AM, Boris Bierbaum wrote: > Hi, > > we (my collegue Andreas and me) are still trying to solve this > problem. > I have compiled some additional information, maybe somebody has an > idea > about what's going on. > > OS: Debian GNU/Linux 4.0, Kernel 2.6.18, x86, 32-Bit > IB software: OFED 1.1 > SM: OpenSM from OFED 1.1 > uDAPL: DAPL reference implementation version gamma 3.02 (using DAPL > from > OFED 1.1 doesn't change anything, I suppose it's the same code, at > least > roughly) > Test program: Intel MPI Benchmarks Version 2.3 > OpenMPI version: 1.2.1 > > Running OpenMPI directly over IB verbs (mpirun --mca btl > self,sm,openib > ...) works. Here's the output of ibv_devinfo and ifconfig for the two > nodes on which tried to run the benchmark (ulimit -l is unlimited on > both machines): > > ------------ 1st node ------------------------------- > > boris at pd-04:/work/boris/IMB_2.3/src$ /opt/infiniband/bin/ibv_devinfo > hca_id: mthca0 > fw_ver: 1.2.0 > node_guid: 0002:c902:0020:b528 > sys_image_guid: 0002:c902:0020:b52b > vendor_id: 0x02c9 > vendor_part_id: 25204 > hw_ver: 0xA0 > board_id: MT_0230000001 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1 > port_lid: 9 > port_lmc: 0x00 > > boris at pd-04:/work/boris/IMB_2.3/src$ /sbin/ifconfig > > ... > > ib0 Protokoll:UNSPEC Hardware Adresse > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > inet Adresse:192.168.0.14 Bcast:192.168.0.255 > Maske:255.255.255.0 > inet6 Adresse: fe80::202:c902:20:b529/64 > Gültigkeitsbereich:Verbindung > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > RX packets:67 errors:0 dropped:0 overruns:0 frame:0 > TX packets:16 errors:0 dropped:2 overruns:0 carrier:0 > Kollisionen:0 Sendewarteschlangenlänge:128 > RX bytes:3752 (3.6 KiB) TX bytes:968 (968.0 b) > > ... > > ------------ 2nd node ------------------------------- > > boris at pd-05:~$ /opt/infiniband/bin/ibv_devinfo > hca_id: mthca0 > fw_ver: 1.2.0 > node_guid: 0002:c902:0020:b4f4 > sys_image_guid: 0002:c902:0020:b4f7 > vendor_id: 0x02c9 > vendor_part_id: 25204 > hw_ver: 0xA0 > board_id: MT_0230000001 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1 > port_lid: 10 > port_lmc: 0x00 > > boris at pd-05:~$ /sbin/ifconfig > > ... > > ib0 Protokoll:UNSPEC Hardware Adresse > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > inet Adresse:192.168.0.15 Bcast:192.168.0.255 > Maske:255.255.255.0 > inet6 Adresse: fe80::202:c902:20:b4f5/64 > Gültigkeitsbereich:Verbindung > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > RX packets:67 errors:0 dropped:0 overruns:0 frame:0 > TX packets:18 errors:0 dropped:2 overruns:0 carrier:0 > Kollisionen:0 Sendewarteschlangenlänge:128 > RX bytes:3752 (3.6 KiB) TX bytes:1088 (1.0 KiB) > > > ... > > ---------------------------------------------------------------------- > --- > > > Here's the output from the failed run, with every DAT and DAPL debug > output enabled: > > > > boris at pd-04:/work/boris/IMB_2.3/src$ mpirun -np 2 -x DAT_DBG_TYPE -x > DAPL_DBG_TYPE -x DAT_OVERRIDE --mca btl self,sm,udapl --host > pd-04,pd-05 > /work/boris/IMB_2.3/src/IMB-MPI1 pingpong > DAT Registry: Started (dat_init) > DAT Registry: static registry file > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > libdapl_openib_cma.so> > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value <> > > > DAT Registry: token > type eor > value <> > > > DAT Registry: entry > ia_name OpenIB-cma > api_version > type 0x0 > major.minor 1.2 > is_thread_safe 0 > is_default 1 > lib_path > /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ > libdapl_openib_cma.so > provider_version > id mv_dapl > major.minor 1.2 > ia_params ib0 0 > > DAT Registry: loading provider for OpenIB-cma > > DAT Registry: token > type eof > value <> > > DAT Registry: dat_registry_list_providers () called > DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called > DAT Registry: IA OpenIB-cma, trying to load library > /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ > libdapl_openib_cma.so > DAPL: NOT Setting Loopback > dapl_ib_init: > DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0) > open_hca: ib0 - 0x807cf28 > ib_thread_init(17919) > ib_thread_init: waiting for ib_thread > ib_thread(17919,0xa7b08bb0): ENTER: pipe 8 ucma 12 > DAT Registry: Started (dat_init) > DAT Registry: static registry file > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > libdapl_openib_cma.so> > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value <> > > > DAT Registry: token > type eor > value <> > > > DAT Registry: entry > ia_name OpenIB-cma > api_version > type 0x0 > major.minor 1.2 > is_thread_safe 0 > is_default 1 > lib_path > /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ > libdapl_openib_cma.so > provider_version > id mv_dapl > major.minor 1.2 > ia_params ib0 0 > > DAT Registry: loading provider for OpenIB-cma > > DAT Registry: token > type eof > value <> > > DAT Registry: dat_registry_list_providers () called > DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called > DAT Registry: IA OpenIB-cma, trying to load library > /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ > libdapl_openib_cma.so > ib_thread_init(17919) exit > DAPL: NOT Setting Loopback > dapl_ib_init: > DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0) > open_hca: ib0 - 0x807cf18 > ib_thread_init(12326) > ib_thread_init: waiting for ib_thread > ib_thread(12326,0xa7b75bb0): ENTER: pipe 8 ucma 12 > ib_thread_init(12326) exit > getipaddr: family 2 port 0 addr 192.168.0.14 > open_hca: ctx=0x809ecd0 port=1 GID subnet fe80000000000000 id > 0002c9020020b529 > open_hca: ib0, AF_INET 192.168.0.14 INLINE_MAX=128 > ib_thread(17919) poll_event: async=0x1 pipe=0x1 cm=0x0 cq=0x0 > ib_thread(17919) poll_fd: hca[134729592]=0xb, async=8 pipe=12 > cm=13 cq=d > query_hca: ib0 AF_INET 192.168.0.14 > query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071 > query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 > rd_io 4 > setup_async_cb: ia 0x80a1648 type 0 hdl (nil) cb 0xa7b1ec6c ctx > 0x80a16d0 > setup_async_cb: ia 0x80a1648 type 1 hdl (nil) cb 0xa7b1e9c0 ctx > 0x80a16d0 > setup_async_cb: ia 0x80a1648 type 3 hdl (nil) cb 0xa7b1eb50 ctx > 0x80a1648 > dat_set_handle 0x80a1648 to 1 > dat_get_ia_handle from 1 to 0x80a1648 > pd_alloc: pd_handle=0x80a1928 > dat_get_ia_handle from 1 to 0x80a1648 > query_hca: ib0 AF_INET 192.168.0.14 > query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071 > query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 > rd_io 4 > dat_get_ia_handle from 1 to 0x80a1648 > cq_object_create: (0x80a1958,0x80a1a44) > dapls_ib_cq_alloc: evd 0x80a1958 cqlen=32 > dapls_ib_cq_alloc: new_cq 0x80a1a68 cqlen=63 > setup_async_cb: ia 0x80a1648 type 2 hdl 0x80a1958 cb 0xa7b1f174 ctx > 0x80a1958 > dat_get_ia_handle from 1 to 0x80a1648 > dat_get_ia_handle from 1 to 0x80a1648 > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Address already in use > listen(ia_ptr 0x80a1648 SID 1025 sp 0x80a7a00 conn 0x80a7a70 id > 134904736) > listen(conn=0x80a7a70 cm_id=134904736) > dat_get_ia_handle from 1 to 0x80a1648 > mr_register: ia=0x80a1648, lmr=0x80a3718 va=0x80ae000 ln=266240 > pv=0x0 > mr_register: mr=0x80a37c8 h 4 pd 0x80a1928 ctx 0x809ecd0 > lkey=0x72002700 rkey=0x72002700 priv=41000 > dat_get_ia_handle from 1 to 0x80a1648 > mr_register: ia=0x80a1648, lmr=0x80a7f18 va=0x80ef000 ln=528384 > pv=0x0 > mr_register: mr=0x80a7fc8 h 5 pd 0x80a1928 ctx 0x809ecd0 > lkey=0xf2002800 rkey=0xf2002800 priv=81000 > getipaddr: family 2 port 0 addr 192.168.0.15 > open_hca: ctx=0x809ecc0 port=1 GID subnet fe80000000000000 id > 0002c9020020b4f5 > open_hca: ib0, AF_INET 192.168.0.15 INLINE_MAX=128 > ib_thread(12326) poll_event: async=0x1 pipe=0x1 cm=0x0 cq=0x0 > ib_thread(12326) poll_fd: hca[134729576]=0xb, async=8 pipe=12 > cm=13 cq=d > query_hca: ib0 AF_INET 192.168.0.15 > query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071 > query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 > rd_io 4 > setup_async_cb: ia 0x80a1638 type 0 hdl (nil) cb 0xa7b8bc6c ctx > 0x80a16c0 > setup_async_cb: ia 0x80a1638 type 1 hdl (nil) cb 0xa7b8b9c0 ctx > 0x80a16c0 > setup_async_cb: ia 0x80a1638 type 3 hdl (nil) cb 0xa7b8bb50 ctx > 0x80a1638 > dat_set_handle 0x80a1638 to 1 > dat_get_ia_handle from 1 to 0x80a1638 > pd_alloc: pd_handle=0x80a1918 > dat_get_ia_handle from 1 to 0x80a1638 > query_hca: ib0 AF_INET 192.168.0.15 > query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071 > query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 > rd_io 4 > dat_get_ia_handle from 1 to 0x80a1638 > cq_object_create: (0x80a1948,0x80a1a34) > dapls_ib_cq_alloc: evd 0x80a1948 cqlen=32 > dapls_ib_cq_alloc: new_cq 0x80a1a58 cqlen=63 > setup_async_cb: ia 0x80a1638 type 2 hdl 0x80a1948 cb 0xa7b8c174 ctx > 0x80a1948 > dat_get_ia_handle from 1 to 0x80a1638 > dat_get_ia_handle from 1 to 0x80a1638 > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > setup_listener Permission denied > listen(ia_ptr 0x80a1638 SID 1024 sp 0x80a7a00 conn 0x80a7a70 id > 134904736) > listen(conn=0x80a7a70 cm_id=134904736) > dat_get_ia_handle from 1 to 0x80a1638 > mr_register: ia=0x80a1638, lmr=0x80a3708 va=0x80ae000 ln=266240 > pv=0x0 > mr_register: mr=0x80a37b8 h 1 pd 0x80a1918 ctx 0x809ecc0 > lkey=0x60002400 rkey=0x60002400 priv=41000 > dat_get_ia_handle from 1 to 0x80a1638 > mr_register: ia=0x80a1638, lmr=0x80a7ee8 va=0x80ef000 ln=528384 > pv=0x0 > mr_register: mr=0x80a7f98 h 2 pd 0x80a1918 ctx 0x809ecc0 > lkey=0x60002500 rkey=0x60002500 priv=81000 > #--------------------------------------------------- > # Intel (R) MPI Benchmark Suite V2.3, MPI-1 part > #--------------------------------------------------- > # Date : Tue May 8 11:16:58 2007 > # Machine : i686# System : Linux > # Release : 2.6.18 > # Version : #1 SMP Tue Nov 14 18:02:03 CET 2006 > > # > # Minimum message length in bytes: 0 > # Maximum message length in bytes: 16777216 > # > # MPI_Datatype : MPI_BYTE > # MPI_Datatype for reductions : MPI_FLOAT > # MPI_Op : MPI_SUM > # > # > > # List of Benchmarks to run: > > # PingPong > dat_get_ia_handle from 1 to 0x80a1638 > query_hca: MAX msg 2147483648 dto 16384 iov 30 rdma i4,o4 > qp_alloc: ia_ptr 0x80a1638 ep_ptr 0x81741f8 ep_ctx_ptr 0x81741f8 > create_qp Address already in use > > ---------------------------------------------------------------------- > --- > > The jobs hangs at this point. From the output of another simple test > program I assume that it hangs inside of a receive operation. Of > course, > I have noticed the "Permission denied" messages, but I don't think > that > the probleme is there. These messages seem to come from RDMA CM when > things are set up, but the execution continues from there on and I > have > seen these messages on successful DAPL runs, too. I'm not very > familiar > with RDMA CM, though, so I don't know the cause of these messages. > > That's a lot of information, I know, but it would be great if someone > would have a look at it. > > Thanks in advance > Boris > > > > Donald Kerr wrote: >> I have not tried Open MPI uDAPL on Linux nor do I have access to a >> Linux >> box so I am having a difficult time trying to find a way to help you >> debug this issue. >> >> -DON >> >> Andreas Kuntze wrote: >> >>> On Linux you needn't initialise the dat registry. Your program >>> prints: >>> "provider 1: OpenIB-cma". I successfully tested INTEL MPI and >>> mvapich2 >>> with uDAPL . >>> >>> Andreas >>> >>> Donald Kerr wrote: >>> >>> >>>> Andreas, >>>> >>>> I am going to guess at a minimum the interfaces are up and you can >>>> ping them. On Solaris there is an additional step required and >>>> that >>>> is initializing the dat registry. If "/usr/sbin/datadm -v" does not >>>> show some driver output then you would need to run "/usr/sbin/ >>>> datadm >>>> -a /usr/share/dat/SUNWudaplt.conf". I don't know if there is an >>>> equivalent on Linux. >>>> >>>> Attached is a simple udapl program which will check if the >>>> interfaces >>>> are available in the dat registry. >>>> >>>> -DON >>>> >>>> >>> >>> _______________________________________________ >>> users mailing list >>> users at open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> _______________________________________________ >> users mailing list >> users at open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > -- > | _ RWTH | Boris Bierbaum > |_|_`_ | Lehrstuhl fuer Betriebssysteme > | |_) _ | RWTH Aachen D-52056 Aachen > |_)(_` | Tel: +49-241-80-27805 > ._) | Fax: +49-241-80-22339 > > > _______________________________________________ > users mailing list > users at open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems From mst at dev.mellanox.co.il Tue May 8 06:34:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 16:34:59 +0300 Subject: [ofa-general] Re: ofa_1_2_kernel 20070508-0200 daily build status In-Reply-To: <20070508093812.9A193E603C1@openfabrics.org> References: <20070508093812.9A193E603C1@openfabrics.org> Message-ID: <20070508133459.GQ21591@mellanox.co.il> > Failed: > Build failed on i686 with linux-2.6.21.1 OK, there were some build failures in ipoib, rds and cxgb3. I picked the ipoib and cxgb3 patches from 2.6.21 git, and now it builds. we missed 20070508 but will be in the next daily. Steve, you might want to review the patch under kernel_patches/backports/2.6.21/, and/or test OFED there, on your hardware. -- MST From swise at opengridcomputing.com Tue May 8 06:47:09 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 May 2007 08:47:09 -0500 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> Message-ID: <1178632029.3056.3.camel@stevo-desktop> On Mon, 2007-05-07 at 20:39 -0400, Jeff Squyres wrote: > On May 7, 2007, at 6:52 PM, Steve Wise wrote: > > > Also, there appears to be a DAPL BTL in OMPI. Is this BTL complete > > and > > enabled for the ofed-1.2 udapl library? > > Yes, it is complete and is well-tested in Solaris. > > It is not well tested in Linux/OFED (we've been concentrating on the > verbs interface on the OFED side of things -- the "openib" BTL [we > never renamed it when OpenIB changed names to OpenFabrics]). In > fact, we've had scattered reports of it not working properly in Linux/ > OFED, but those could well have been pilot error (i.e., me not trying > to run properly -- I know just about zilch about udapl). > The reason I'm asking is twofold: 1) this can get OMPI running on iwarp devices today if it works. 2) the udapl code can be a model for the rdma-cm piece, since the two are similary (client / server connection set, ipaddr/port based, etc)... I'll try it out on T3. Steve. From yosefe at voltaire.com Tue May 8 06:54:52 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 08 May 2007 16:54:52 +0300 Subject: [ofa-general] [PATCHv3 0/2] pkey change handling - fix bug #577 Message-ID: <4640812C.6060003@voltaire.com> These two patches fix bug #577: PKey table reordering caused by SM failover stops ipoib traffic patch 1: add uncached device queries to core patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init -- From yosefe at voltaire.com Tue May 8 07:03:34 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 08 May 2007 17:03:34 +0300 Subject: [ofa-general] [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <4640812C.6060003@voltaire.com> References: <4640812C.6060003@voltaire.com> Message-ID: <46408336.8080908@voltaire.com> Add ib_find_gid and ib_find_pkey over uncached device queries. The calls might block but the returns are always up-to-date. Signed-off-by: Yosef Etigin --- drivers/infiniband/core/device.c | 96 +++++++++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 23 +++++++++ 2 files changed, 119 insertions(+) Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-07 15:42:19.000000000 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-08 11:16:35.049600754 +0300 @@ -149,6 +149,18 @@ static int alloc_name(char *name) return 0; } +static inline int start_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + + +static inline int end_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; +} + /** * ib_alloc_device - allocate an IB device struct * @size:size of structure to allocate @@ -592,6 +604,90 @@ int ib_modify_port(struct ib_device *dev } EXPORT_SYMBOL(ib_modify_port); +/** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index) +{ + struct ib_port_attr *tprops = NULL; + union ib_gid tmp_gid; + int ret, port, i; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + + for (port = start_port(device); port <= end_port(device); ++port) { + ret = ib_query_port(device, port, tprops); + if (ret) + continue; + + for (i = 0; i < tprops->gid_tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &tmp_gid); + if (ret) + goto out; + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { + *port_num = port; + *index = i; + ret = 0; + goto out; + } + } + } + ret = -ENOENT; +out: + kfree(tprops); + return ret; +} +EXPORT_SYMBOL(ib_find_gid); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index) +{ + struct ib_port_attr *tprops = NULL; + int ret, i; + u16 tmp_pkey; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + + ret = ib_query_port(device, port_num, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret); + goto out; + } + + for (i = 0; i < tprops->pkey_tbl_len; ++i) { + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); + if (ret) + goto out; + + if (pkey == tmp_pkey) { + *index = i; + ret = 0; + goto out; + } + } + ret = -ENOENT; + +out: + kfree(tprops); + return ret; +} +EXPORT_SYMBOL(ib_find_pkey); + static int __init ib_core_init(void) { int ret; Index: b/include/rdma/ib_verbs.h =================================================================== --- a/include/rdma/ib_verbs.h 2007-05-07 15:41:13.000000000 +0300 +++ b/include/rdma/ib_verbs.h 2007-05-07 15:43:04.000000000 +0300 @@ -1134,6 +1134,29 @@ int ib_modify_port(struct ib_device *dev struct ib_port_modify *port_modify); /** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index); + +/** * ib_alloc_pd - Allocates an unused protection domain. * @device: The device on which to allocate the protection domain. * From yosefe at voltaire.com Tue May 8 07:04:16 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 08 May 2007 17:04:16 +0300 Subject: [ofa-general] [PATCHv2 1/2] ipoib: handle pkey change events In-Reply-To: <4640812C.6060003@voltaire.com> References: <4640812C.6060003@voltaire.com> Message-ID: <46408360.3040006@voltaire.com> This issue was found during partitioning & SM fail over testing. * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity * Upon PKEY_CHANGE event, schedule a work that restarts the QP * Restart child interfaces before parent * Use uncached pkey query upon qp initiallization SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 6 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 62 +++++++++++++++++++++-------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +-- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 11 ++--- 4 files changed, 59 insertions(+), 27 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 12:13:39.481155747 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 12:15:14.716172776 +0300 @@ -202,11 +202,12 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; + struct delayed_work pkey_poll_task; struct delayed_work mcast_task; struct work_struct flush_task; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct work_struct pkey_event_task; struct ib_device *ca; u8 port; @@ -333,12 +334,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 12:13:39.481155747 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 12:57:20.842183673 +0300 @@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -481,7 +481,7 @@ int ipoib_ib_dev_down(struct net_device if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { mutex_lock(&pkey_mutex); set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); + cancel_delayed_work(&priv->pkey_poll_task); mutex_unlock(&pkey_mutex); if (flush) flush_workqueue(ipoib_workqueue); @@ -508,7 +508,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +581,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,13 +623,24 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, restart_qp); + + mutex_unlock(&priv->vlan_mutex); + + /* + * If the device is not initiallized since it needs a pkey - + * try to reopen it + */ + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -642,6 +654,12 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_down(dev, 0); + if (restart_qp) { + if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +668,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - mutex_unlock(&priv->vlan_mutex); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_event_task); + + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -685,7 +713,7 @@ void ipoib_ib_dev_cleanup(struct net_dev void ipoib_pkey_poll(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); + container_of(work, struct ipoib_dev_priv, pkey_poll_task.work); struct net_device *dev = priv->dev; ipoib_pkey_dev_check_presence(dev); @@ -696,7 +724,7 @@ void ipoib_pkey_poll(struct work_struct mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); } @@ -715,7 +743,7 @@ int ipoib_pkey_dev_delay_open(struct net mutex_lock(&pkey_mutex); clear_bit(IPOIB_PKEY_STOP, &priv->flags); queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); return 1; Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 12:13:39.481155747 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 12:20:51.605085249 +0300 @@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev) return -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 12:13:39.481155747 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 12:57:41.882456471 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = -ENXIO; goto out; @@ -103,7 +101,7 @@ int ipoib_init_qp(struct net_device *dev * The port has to be assigned to the respective IB partition in * advance. */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); if (ret) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); return ret; @@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_event_task); } } From mst at dev.mellanox.co.il Tue May 8 07:17:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 17:17:27 +0300 Subject: [ofa-general] libmlx4 wc flash Message-ID: <20070508141727.GR21591@mellanox.co.il> Roland, libmlx4 has this comments: /* FIXME flush wc buffers */ and since it does *not* currently actually flush the buffers, if we enable WC for blueflame, WRs gets mixed in the WC buffer, and QP gets corrupted/stuck. It seems we should we have arch.h under mthca and stick some macro like wc_wmb() in there. Or, would infiniband/arch.h under libibverbs be a better place? If WC is not enabled, userspace can avoid the flush - so, should we return such a bit as part of kernel abi? -- MST From swise at opengridcomputing.com Tue May 8 07:32:35 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 May 2007 09:32:35 -0500 Subject: [ofa-general] Re: ofa_1_2_kernel 20070508-0200 daily build status In-Reply-To: <20070508133459.GQ21591@mellanox.co.il> References: <20070508093812.9A193E603C1@openfabrics.org> <20070508133459.GQ21591@mellanox.co.il> Message-ID: <1178634755.3056.7.camel@stevo-desktop> On Tue, 2007-05-08 at 16:34 +0300, Michael S. Tsirkin wrote: > > Failed: > > Build failed on i686 with linux-2.6.21.1 > > OK, there were some build failures in ipoib, rds and cxgb3. > I picked the ipoib and cxgb3 patches from 2.6.21 git, > and now it builds. we missed 20070508 but will be in the next daily. > > Steve, you might want to review > the patch under kernel_patches/backports/2.6.21/, > and/or test OFED there, on your hardware. > looks good. steve. From mst at dev.mellanox.co.il Tue May 8 08:03:56 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 18:03:56 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events In-Reply-To: <46408360.3040006@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> Message-ID: <20070508150356.GT21591@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCHv2 1/2] ipoib: handle pkey change events > > This issue was found during partitioning & SM fail over testing. > > * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike > * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity > * Upon PKEY_CHANGE event, schedule a work that restarts the QP > * Restart child interfaces before parent What's the reason for this change? Is this a separate bugfix? You might want to put this info in the log. > * Use uncached pkey query upon qp initiallization > > SM reconfiguration or failover possibly causes a shuffling of the values in the port > pkey table. The current implementation only queries for the index of the pkey once, > when it creates the device QP and after that moves it into working state, and hence > does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger > to reconfigure the device QP. > > > Signed-off-by: Yosef Etigin Btw, pls try making log lines a bit shorter - git log shifts everything to the right. -- MST From mst at dev.mellanox.co.il Tue May 8 08:09:07 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 18:09:07 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <46408336.8080908@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com> Message-ID: <20070508150907.GU21591@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries > > Add ib_find_gid and ib_find_pkey over uncached device queries. > The calls might block but the returns are always up-to-date. > > > Signed-off-by: Yosef Etigin > --- > drivers/infiniband/core/device.c | 96 +++++++++++++++++++++++++++++++++++++++ > include/rdma/ib_verbs.h | 23 +++++++++ > 2 files changed, 119 insertions(+) > > Index: b/drivers/infiniband/core/device.c > =================================================================== > --- a/drivers/infiniband/core/device.c 2007-05-07 15:42:19.000000000 +0300 > +++ b/drivers/infiniband/core/device.c 2007-05-08 11:16:35.049600754 +0300 > @@ -149,6 +149,18 @@ static int alloc_name(char *name) > return 0; > } > > +static inline int start_port(struct ib_device *device) > +{ > + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; > +} > + > + > +static inline int end_port(struct ib_device *device) > +{ > + return (device->node_type == RDMA_NODE_IB_SWITCH) ? > + 0 : device->phys_port_cnt; > +} > + > /** > * ib_alloc_device - allocate an IB device struct > * @size:size of structure to allocate > @@ -592,6 +604,90 @@ int ib_modify_port(struct ib_device *dev > } > EXPORT_SYMBOL(ib_modify_port); > > +/** > + * ib_find_gid - Returns the port number and GID table index where > + * a specified GID value occurs. > + * @device: The device to query. > + * @gid: The GID value to search for. > + * @port_num: The port number of the device where the GID value was found. > + * @index: The index into the GID table where the GID was found. This > + * parameter may be NULL. > + */ > +int ib_find_gid(struct ib_device *device, union ib_gid *gid, > + u8 *port_num, u16 *index) > +{ > + struct ib_port_attr *tprops = NULL; > + union ib_gid tmp_gid; > + int ret, port, i; > + > + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); > + > + for (port = start_port(device); port <= end_port(device); ++port) { > + ret = ib_query_port(device, port, tprops); > + if (ret) > + continue; > + > + for (i = 0; i < tprops->gid_tbl_len; ++i) { > + ret = ib_query_gid(device, port, i, &tmp_gid); > + if (ret) > + goto out; > + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { > + *port_num = port; > + *index = i; > + ret = 0; > + goto out; > + } > + } > + } > + ret = -ENOENT; > +out: > + kfree(tprops); > + return ret; > +} > +EXPORT_SYMBOL(ib_find_gid); > + > +/** > + * ib_find_pkey - Returns the PKey table index where a specified > + * PKey value occurs. > + * @device: The device to query. > + * @port_num: The port number of the device to search for the PKey. > + * @pkey: The PKey value to search for. > + * @index: The index into the PKey table where the PKey was found. > + */ > +int ib_find_pkey(struct ib_device *device, > + u8 port_num, u16 pkey, u16 *index) > +{ > + struct ib_port_attr *tprops = NULL; > + int ret, i; > + u16 tmp_pkey; > + > + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); > + > + ret = ib_query_port(device, port_num, tprops); > + if (ret) { > + printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret); > + goto out; > + } > + > + for (i = 0; i < tprops->pkey_tbl_len; ++i) { > + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); > + if (ret) > + goto out; > + > + if (pkey == tmp_pkey) { > + *index = i; > + ret = 0; > + goto out; > + } > + } > + ret = -ENOENT; > + > +out: > + kfree(tprops); > + return ret; > +} > +EXPORT_SYMBOL(ib_find_pkey); > + > static int __init ib_core_init(void) > { > int ret; OK, look good - later, providers will be able to optimize these by caching ib_query_pkey/ib_query_gid calls. But I see a problem here in that ib_query_port is a call providers won't be able to optimize out (because it includes e.g. port state), and it seems a waste. Is that right? One way out would be to pass the table length in to ib_find_pkey/ib_find_gid as an extra parameter, and cache that at the ULP level. -- MST From yosefe at voltaire.com Tue May 8 08:11:09 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 08 May 2007 18:11:09 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events In-Reply-To: <20070508150356.GT21591@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508150356.GT21591@mellanox.co.il> Message-ID: <4640930D.9010800@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: [PATCHv2 1/2] ipoib: handle pkey change events >> >>This issue was found during partitioning & SM fail over testing. >> >> * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike >> * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity >> * Upon PKEY_CHANGE event, schedule a work that restarts the QP >> * Restart child interfaces before parent > > > What's the reason for this change? > Is this a separate bugfix? > You might want to put this info in the log. > The reason is that if the child are restarted after the parent, and the parent is not up, then the flush function returns immediately due to the INITIALLIZED bit test. Now I think that we might use a goto statement instead. > >> * Use uncached pkey query upon qp initiallization >> >>SM reconfiguration or failover possibly causes a shuffling of the values in the port >>pkey table. The current implementation only queries for the index of the pkey once, >>when it creates the device QP and after that moves it into working state, and hence >>does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger >>to reconfigure the device QP. >> >> >>Signed-off-by: Yosef Etigin > > > Btw, pls try making log lines a bit shorter - git log shifts everything > to the right. > Ok From mst at dev.mellanox.co.il Tue May 8 08:19:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 18:19:27 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events In-Reply-To: <4640930D.9010800@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508150356.GT21591@mellanox.co.il> <4640930D.9010800@voltaire.com> Message-ID: <20070508151927.GW21591@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events > > Michael S. Tsirkin wrote: > >>Quoting Yosef Etigin : > >>Subject: [PATCHv2 1/2] ipoib: handle pkey change events > >> > >>This issue was found during partitioning & SM fail over testing. > >> > >> * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike > >> * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity > >> * Upon PKEY_CHANGE event, schedule a work that restarts the QP > >> * Restart child interfaces before parent > > > > > > What's the reason for this change? > > Is this a separate bugfix? > > You might want to put this info in the log. > > > > The reason is that if the child are restarted after the parent, and the parent is > not up, then the flush function returns immediately due to the INITIALLIZED bit test. > Now I think that we might use a goto statement instead. So ... what the problem? I still don't see it. -- MST From yosefe at voltaire.com Tue May 8 08:19:32 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 08 May 2007 18:19:32 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <20070508150907.GU21591@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com> <20070508150907.GU21591@mellanox.co.il> Message-ID: <46409504.9000802@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries >> >>Add ib_find_gid and ib_find_pkey over uncached device queries. >>The calls might block but the returns are always up-to-date. >> >> >>Signed-off-by: Yosef Etigin >>--- >> drivers/infiniband/core/device.c | 96 +++++++++++++++++++++++++++++++++++++++ >> include/rdma/ib_verbs.h | 23 +++++++++ >> 2 files changed, 119 insertions(+) >> >>Index: b/drivers/infiniband/core/device.c >>=================================================================== >>--- a/drivers/infiniband/core/device.c 2007-05-07 15:42:19.000000000 +0300 >>+++ b/drivers/infiniband/core/device.c 2007-05-08 11:16:35.049600754 +0300 >>@@ -149,6 +149,18 @@ static int alloc_name(char *name) >> return 0; >> } >> >>+static inline int start_port(struct ib_device *device) >>+{ >>+ return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; >>+} >>+ >>+ >>+static inline int end_port(struct ib_device *device) >>+{ >>+ return (device->node_type == RDMA_NODE_IB_SWITCH) ? >>+ 0 : device->phys_port_cnt; >>+} >>+ >> /** >> * ib_alloc_device - allocate an IB device struct >> * @size:size of structure to allocate >>@@ -592,6 +604,90 @@ int ib_modify_port(struct ib_device *dev >> } >> EXPORT_SYMBOL(ib_modify_port); >> >>+/** >>+ * ib_find_gid - Returns the port number and GID table index where >>+ * a specified GID value occurs. >>+ * @device: The device to query. >>+ * @gid: The GID value to search for. >>+ * @port_num: The port number of the device where the GID value was found. >>+ * @index: The index into the GID table where the GID was found. This >>+ * parameter may be NULL. >>+ */ >>+int ib_find_gid(struct ib_device *device, union ib_gid *gid, >>+ u8 *port_num, u16 *index) >>+{ >>+ struct ib_port_attr *tprops = NULL; >>+ union ib_gid tmp_gid; >>+ int ret, port, i; >>+ >>+ tprops = kmalloc(sizeof *tprops, GFP_KERNEL); >>+ >>+ for (port = start_port(device); port <= end_port(device); ++port) { >>+ ret = ib_query_port(device, port, tprops); >>+ if (ret) >>+ continue; >>+ >>+ for (i = 0; i < tprops->gid_tbl_len; ++i) { >>+ ret = ib_query_gid(device, port, i, &tmp_gid); >>+ if (ret) >>+ goto out; >>+ if (!memcmp(&tmp_gid, gid, sizeof *gid)) { >>+ *port_num = port; >>+ *index = i; >>+ ret = 0; >>+ goto out; >>+ } >>+ } >>+ } >>+ ret = -ENOENT; >>+out: >>+ kfree(tprops); >>+ return ret; >>+} >>+EXPORT_SYMBOL(ib_find_gid); >>+ >>+/** >>+ * ib_find_pkey - Returns the PKey table index where a specified >>+ * PKey value occurs. >>+ * @device: The device to query. >>+ * @port_num: The port number of the device to search for the PKey. >>+ * @pkey: The PKey value to search for. >>+ * @index: The index into the PKey table where the PKey was found. >>+ */ >>+int ib_find_pkey(struct ib_device *device, >>+ u8 port_num, u16 pkey, u16 *index) >>+{ >>+ struct ib_port_attr *tprops = NULL; >>+ int ret, i; >>+ u16 tmp_pkey; >>+ >>+ tprops = kmalloc(sizeof *tprops, GFP_KERNEL); >>+ >>+ ret = ib_query_port(device, port_num, tprops); >>+ if (ret) { >>+ printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret); >>+ goto out; >>+ } >>+ >>+ for (i = 0; i < tprops->pkey_tbl_len; ++i) { >>+ ret = ib_query_pkey(device, port_num, i, &tmp_pkey); >>+ if (ret) >>+ goto out; >>+ >>+ if (pkey == tmp_pkey) { >>+ *index = i; >>+ ret = 0; >>+ goto out; >>+ } >>+ } >>+ ret = -ENOENT; >>+ >>+out: >>+ kfree(tprops); >>+ return ret; >>+} >>+EXPORT_SYMBOL(ib_find_pkey); >>+ >> static int __init ib_core_init(void) >> { >> int ret; > > > OK, look good - later, providers will be able to optimize these > by caching ib_query_pkey/ib_query_gid calls. > > But I see a problem here in that ib_query_port is a call providers > won't be able to optimize out (because it includes e.g. port state), > and it seems a waste. > > Is that right? > > One way out would be to pass the table length in to ib_find_pkey/ib_find_gid > as an extra parameter, and cache that at the ULP level. > provider might try to remember the port state after each mad we see.. but it looks like too much to demand from it. Anyway, since the information about the port table length does not come from mads but from device properties, the core can set each of these lengths during initiallization, and use them in ib_find_* functions. From sashak at voltaire.com Tue May 8 08:29:44 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 8 May 2007 18:29:44 +0300 Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm/osm_helper: remove repeated strlen() calls In-Reply-To: <1178541140.32222.348653.camel@hal.voltaire.com> References: <462C7C21.7010004@dev.mellanox.co.il> <20070423101738.GG4579@mellanox.co.il> <462E80A3.5060503@dev.mellanox.co.il> <20070501005101.GA26019@sashak.voltaire.com> <4636E4A7.7060108@dev.mellanox.co.il> <1178211572.32222.3479.camel@hal.voltaire.com> <20070506124333.GB9692@sashak.voltaire.com> <20070506130352.GC9692@sashak.voltaire.com> <1178541140.32222.348653.camel@hal.voltaire.com> Message-ID: <20070508152944.GN9692@sashak.voltaire.com> On 08:32 Mon 07 May , Hal Rosenstock wrote: > On Sun, 2007-05-06 at 09:03, Sasha Khapyorsky wrote: > > Replace repeated strlen() calls used in sprintf() by actual string > > length accumulated from sprintf() return values. > > > > Signed-off-by: Sasha Khapyorsky > > Thanks. Applied to master only (as this is a cleanup rather than a bug > fix). Let me know if you think this should be applied to ofed_1_2. This is a minor improvement and not critical for ofed_1_2. Sasha From mst at dev.mellanox.co.il Tue May 8 08:26:50 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 18:26:50 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <46409504.9000802@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com> <20070508150907.GU21591@mellanox.co.il> <46409504.9000802@voltaire.com> Message-ID: <20070508152650.GA5845@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries > > Michael S. Tsirkin wrote: > >>Quoting Yosef Etigin : > >>Subject: [PATCHv2 1/2] core: uncached "find gid" and "find pkey" queries > >> > >>Add ib_find_gid and ib_find_pkey over uncached device queries. > >>The calls might block but the returns are always up-to-date. > >> > >> > >>Signed-off-by: Yosef Etigin > >>--- > >> drivers/infiniband/core/device.c | 96 +++++++++++++++++++++++++++++++++++++++ > >> include/rdma/ib_verbs.h | 23 +++++++++ > >> 2 files changed, 119 insertions(+) > >> > >>Index: b/drivers/infiniband/core/device.c > >>=================================================================== > >>--- a/drivers/infiniband/core/device.c 2007-05-07 15:42:19.000000000 +0300 > >>+++ b/drivers/infiniband/core/device.c 2007-05-08 11:16:35.049600754 +0300 > >>@@ -149,6 +149,18 @@ static int alloc_name(char *name) > >> return 0; > >> } > >> > >>+static inline int start_port(struct ib_device *device) > >>+{ > >>+ return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; > >>+} > >>+ > >>+ > >>+static inline int end_port(struct ib_device *device) > >>+{ > >>+ return (device->node_type == RDMA_NODE_IB_SWITCH) ? > >>+ 0 : device->phys_port_cnt; > >>+} > >>+ > >> /** > >> * ib_alloc_device - allocate an IB device struct > >> * @size:size of structure to allocate > >>@@ -592,6 +604,90 @@ int ib_modify_port(struct ib_device *dev > >> } > >> EXPORT_SYMBOL(ib_modify_port); > >> > >>+/** > >>+ * ib_find_gid - Returns the port number and GID table index where > >>+ * a specified GID value occurs. > >>+ * @device: The device to query. > >>+ * @gid: The GID value to search for. > >>+ * @port_num: The port number of the device where the GID value was found. > >>+ * @index: The index into the GID table where the GID was found. This > >>+ * parameter may be NULL. > >>+ */ > >>+int ib_find_gid(struct ib_device *device, union ib_gid *gid, > >>+ u8 *port_num, u16 *index) > >>+{ > >>+ struct ib_port_attr *tprops = NULL; > >>+ union ib_gid tmp_gid; > >>+ int ret, port, i; > >>+ > >>+ tprops = kmalloc(sizeof *tprops, GFP_KERNEL); > >>+ > >>+ for (port = start_port(device); port <= end_port(device); ++port) { > >>+ ret = ib_query_port(device, port, tprops); > >>+ if (ret) > >>+ continue; > >>+ > >>+ for (i = 0; i < tprops->gid_tbl_len; ++i) { > >>+ ret = ib_query_gid(device, port, i, &tmp_gid); > >>+ if (ret) > >>+ goto out; > >>+ if (!memcmp(&tmp_gid, gid, sizeof *gid)) { > >>+ *port_num = port; > >>+ *index = i; > >>+ ret = 0; > >>+ goto out; > >>+ } > >>+ } > >>+ } > >>+ ret = -ENOENT; > >>+out: > >>+ kfree(tprops); > >>+ return ret; > >>+} > >>+EXPORT_SYMBOL(ib_find_gid); > >>+ > >>+/** > >>+ * ib_find_pkey - Returns the PKey table index where a specified > >>+ * PKey value occurs. > >>+ * @device: The device to query. > >>+ * @port_num: The port number of the device to search for the PKey. > >>+ * @pkey: The PKey value to search for. > >>+ * @index: The index into the PKey table where the PKey was found. > >>+ */ > >>+int ib_find_pkey(struct ib_device *device, > >>+ u8 port_num, u16 pkey, u16 *index) > >>+{ > >>+ struct ib_port_attr *tprops = NULL; > >>+ int ret, i; > >>+ u16 tmp_pkey; > >>+ > >>+ tprops = kmalloc(sizeof *tprops, GFP_KERNEL); > >>+ > >>+ ret = ib_query_port(device, port_num, tprops); > >>+ if (ret) { > >>+ printk(KERN_WARNING "ib_query_port failed , ret = %d\n", ret); > >>+ goto out; > >>+ } > >>+ > >>+ for (i = 0; i < tprops->pkey_tbl_len; ++i) { > >>+ ret = ib_query_pkey(device, port_num, i, &tmp_pkey); > >>+ if (ret) > >>+ goto out; > >>+ > >>+ if (pkey == tmp_pkey) { > >>+ *index = i; > >>+ ret = 0; > >>+ goto out; > >>+ } > >>+ } > >>+ ret = -ENOENT; > >>+ > >>+out: > >>+ kfree(tprops); > >>+ return ret; > >>+} > >>+EXPORT_SYMBOL(ib_find_pkey); > >>+ > >> static int __init ib_core_init(void) > >> { > >> int ret; > > > > > > OK, look good - later, providers will be able to optimize these > > by caching ib_query_pkey/ib_query_gid calls. > > > > But I see a problem here in that ib_query_port is a call providers > > won't be able to optimize out (because it includes e.g. port state), > > and it seems a waste. > > > > Is that right? > > > > One way out would be to pass the table length in to ib_find_pkey/ib_find_gid > > as an extra parameter, and cache that at the ULP level. > > > provider might try to remember the port state after each mad we see.. but it > looks like too much to demand from it. Port can go down without any MADs. > Anyway, since the information about the port table length does not come from mads > but from device properties, the core can set each of these lengths during initiallization, > and use them in ib_find_* functions. Passing it in looks simpler ... but maybe you're right. Patch? -- MST From yosefe at voltaire.com Tue May 8 08:38:57 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 08 May 2007 18:38:57 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events In-Reply-To: <20070508151927.GW21591@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508150356.GT21591@mellanox.co.il> <4640930D.9010800@voltaire.com> <20070508151927.GW21591@mellanox.co.il> Message-ID: <46409991.2070305@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events >> >>Michael S. Tsirkin wrote: >> >>>>Quoting Yosef Etigin : >>>>Subject: [PATCHv2 1/2] ipoib: handle pkey change events >>>> >>>>This issue was found during partitioning & SM fail over testing. >>>> >>>>* Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike >>>>* Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity >>>>* Upon PKEY_CHANGE event, schedule a work that restarts the QP >>>>* Restart child interfaces before parent >>> >>> >>>What's the reason for this change? >>>Is this a separate bugfix? >>>You might want to put this info in the log. >>> >> >>The reason is that if the child are restarted after the parent, and the parent is >>not up, then the flush function returns immediately due to the INITIALLIZED bit test. >>Now I think that we might use a goto statement instead. > > > So ... what the problem? I still don't see it. > If I get pkey change event, I want to restart all active ifaces on my port. If the parent is not marked with IPOIB_FLAG_INITIALIZED, the function returns before it has a chance to recursively restart child ifaces. From mst at dev.mellanox.co.il Tue May 8 08:53:07 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 18:53:07 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events In-Reply-To: <46409991.2070305@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508150356.GT21591@mellanox.co.il> <4640930D.9010800@voltaire.com> <20070508151927.GW21591@mellanox.co.il> <46409991.2070305@voltaire.com> Message-ID: <20070508155307.GC5845@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events > > Michael S. Tsirkin wrote: > >>Quoting Yosef Etigin : > >>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events > >> > >>Michael S. Tsirkin wrote: > >> > >>>>Quoting Yosef Etigin : > >>>>Subject: [PATCHv2 1/2] ipoib: handle pkey change events > >>>> > >>>>This issue was found during partitioning & SM fail over testing. > >>>> > >>>>* Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike > >>>>* Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity > >>>>* Upon PKEY_CHANGE event, schedule a work that restarts the QP > >>>>* Restart child interfaces before parent > >>> > >>> > >>>What's the reason for this change? > >>>Is this a separate bugfix? > >>>You might want to put this info in the log. > >>> > >> > >>The reason is that if the child are restarted after the parent, and the parent is > >>not up, then the flush function returns immediately due to the INITIALLIZED bit test. > >>Now I think that we might use a goto statement instead. > > > > > > So ... what the problem? I still don't see it. > > > If I get pkey change event, I want to restart all active ifaces on my port. If > the parent is not marked with IPOIB_FLAG_INITIALIZED, the function returns > before it has a chance to recursively restart child ifaces. So now, if restart_qp is set, you are going to open it, and it's not initialized? BTW, if the interface is not initialized, is not QP in reset already? So can't we just move the code that assign the pkey to the open call? Another idea - won't it be cleaner to have a function ipoib_restart_qp (functionally similiar to ib_dev_flush, but also changing the QP) than adding a flag to ib_dev_flush? -- MST From yosefe at voltaire.com Tue May 8 09:00:10 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 08 May 2007 19:00:10 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events In-Reply-To: <20070508155307.GC5845@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508150356.GT21591@mellanox.co.il> <4640930D.9010800@voltaire.com> <20070508151927.GW21591@mellanox.co.il> <46409991.2070305@voltaire.com> <20070508155307.GC5845@mellanox.co.il> Message-ID: <46409E8A.6040408@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events >> >>Michael S. Tsirkin wrote: >> >>>>Quoting Yosef Etigin : >>>>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events >>>> >>>>Michael S. Tsirkin wrote: >>>> >>>> >>>>>>Quoting Yosef Etigin : >>>>>>Subject: [PATCHv2 1/2] ipoib: handle pkey change events >>>>>> >>>>>>This issue was found during partitioning & SM fail over testing. >>>>>> >>>>>>* Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike >>>>>>* Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity >>>>>>* Upon PKEY_CHANGE event, schedule a work that restarts the QP >>>>>>* Restart child interfaces before parent >>>>> >>>>> >>>>>What's the reason for this change? >>>>>Is this a separate bugfix? >>>>>You might want to put this info in the log. >>>>> >>>> >>>>The reason is that if the child are restarted after the parent, and the parent is >>>>not up, then the flush function returns immediately due to the INITIALLIZED bit test. >>>>Now I think that we might use a goto statement instead. >>> >>> >>>So ... what the problem? I still don't see it. >>> >> >>If I get pkey change event, I want to restart all active ifaces on my port. If >>the parent is not marked with IPOIB_FLAG_INITIALIZED, the function returns >>before it has a chance to recursively restart child ifaces. > > > So now, if restart_qp is set, you are going to open it, and it's not > initialized? > > BTW, if the interface is not initialized, is not QP in reset already? > So can't we just move the code that assign the pkey to the open call? > No - I'm going to open its *child* interface. The problem is that parents "mask out" their children. > Another idea - won't it be cleaner to have a function ipoib_restart_qp > (functionally similiar to ib_dev_flush, but also changing the QP) > than adding a flag to ib_dev_flush? > It might be, but we wanted to avoid code duplication (the only difference is 2-3 lines) From mst at dev.mellanox.co.il Tue May 8 09:27:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 19:27:27 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events In-Reply-To: <46408360.3040006@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> Message-ID: <20070508162727.GD5845@mellanox.co.il> > @@ -622,13 +623,24 @@ int ipoib_ib_dev_init(struct net_device > return 0; > } > > -void ipoib_ib_dev_flush(struct work_struct *work) > +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp) > { > - struct ipoib_dev_priv *cpriv, *priv = > - container_of(work, struct ipoib_dev_priv, flush_task); > + struct ipoib_dev_priv *cpriv; > struct net_device *dev = priv->dev; > > - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { > + mutex_lock(&priv->vlan_mutex); > + > + /* Flush any child interfaces */ > + list_for_each_entry(cpriv, &priv->child_intfs, list) > + __ipoib_ib_dev_flush(cpriv, restart_qp); > + > + mutex_unlock(&priv->vlan_mutex); > + > + /* > + * If the device is not initiallized since it needs a pkey - > + * try to reopen it > + */ Kill this comment - typos and all. > + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { > ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); > return; > } -- MST From mst at dev.mellanox.co.il Tue May 8 09:30:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 19:30:03 +0300 Subject: [ofa-general] Re: [PATCHv2 1/2] ipoib: handle pkey change events In-Reply-To: <46409E8A.6040408@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508150356.GT21591@mellanox.co.il> <4640930D.9010800@voltaire.com> <20070508151927.GW21591@mellanox.co.il> <46409991.2070305@voltaire.com> <20070508155307.GC5845@mellanox.co.il> <46409E8A.6040408@voltaire.com> Message-ID: <20070508163003.GE5845@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events > > Michael S. Tsirkin wrote: > >>Quoting Yosef Etigin : > >>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events > >> > >>Michael S. Tsirkin wrote: > >> > >>>>Quoting Yosef Etigin : > >>>>Subject: Re: [PATCHv2 1/2] ipoib: handle pkey change events > >>>> > >>>>Michael S. Tsirkin wrote: > >>>> > >>>> > >>>>>>Quoting Yosef Etigin : > >>>>>>Subject: [PATCHv2 1/2] ipoib: handle pkey change events > >>>>>> > >>>>>>This issue was found during partitioning & SM fail over testing. > >>>>>> > >>>>>>* Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike > >>>>>>* Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity > >>>>>>* Upon PKEY_CHANGE event, schedule a work that restarts the QP > >>>>>>* Restart child interfaces before parent > >>>>> > >>>>> > >>>>>What's the reason for this change? > >>>>>Is this a separate bugfix? > >>>>>You might want to put this info in the log. > >>>>> > >>>> > >>>>The reason is that if the child are restarted after the parent, and the parent is > >>>>not up, then the flush function returns immediately due to the INITIALLIZED bit test. > >>>>Now I think that we might use a goto statement instead. > >>> > >>> > >>>So ... what the problem? I still don't see it. > >>> > >> > >>If I get pkey change event, I want to restart all active ifaces on my port. If > >>the parent is not marked with IPOIB_FLAG_INITIALIZED, the function returns > >>before it has a chance to recursively restart child ifaces. > > > > > > So now, if restart_qp is set, you are going to open it, and it's not > > initialized? > > > > BTW, if the interface is not initialized, is not QP in reset already? > > So can't we just move the code that assign the pkey to the open call? > > > No - I'm going to open its *child* interface. The problem is that parents "mask out" > their children. Aha, I see this now. You might want to explain this in the comment. Something like: /* Flush any child interfaces too - * they might be up even if the parent is down */ > > Another idea - won't it be cleaner to have a function ipoib_restart_qp > > (functionally similiar to ib_dev_flush, but also changing the QP) > > than adding a flag to ib_dev_flush? > > > It might be, but we wanted to avoid code duplication (the only difference is 2-3 lines) OK, it's a valid approach too. -- MST From swise at opengridcomputing.com Tue May 8 09:31:12 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 May 2007 11:31:12 -0500 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> <1178632029.3056.3.camel@stevo-desktop> <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com> Message-ID: <1178641872.6064.12.camel@stevo-desktop> On Tue, 2007-05-08 at 11:50 -0400, Jeff Squyres wrote: > In the "FYI" category... > > There was discussion about the udpal BTL over OFED today on the > weekly developer teleconference (per my earlier post, a user is > reporting that it doesn't work). Andrew Friedley is going to work > with the Sun developers -- he thinks he might know where the problem > is coming from but is in process of physically relocating, and > therefore couldn't look at it until late next week at the earliest. > Sun may be able to pick up the issue -- but if so, I don't know what > their timeframe will be (and it may depend on the severity of the > problem). > Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work. I'm debugging now. > > On May 8, 2007, at 9:47 AM, Steve Wise wrote: > > > On Mon, 2007-05-07 at 20:39 -0400, Jeff Squyres wrote: > >> On May 7, 2007, at 6:52 PM, Steve Wise wrote: > >> > >>> Also, there appears to be a DAPL BTL in OMPI. Is this BTL complete > >>> and > >>> enabled for the ofed-1.2 udapl library? > >> > >> Yes, it is complete and is well-tested in Solaris. > >> > >> It is not well tested in Linux/OFED (we've been concentrating on the > >> verbs interface on the OFED side of things -- the "openib" BTL [we > >> never renamed it when OpenIB changed names to OpenFabrics]). In > >> fact, we've had scattered reports of it not working properly in > >> Linux/ > >> OFED, but those could well have been pilot error (i.e., me not trying > >> to run properly -- I know just about zilch about udapl). > >> > > > > The reason I'm asking is twofold: > > > > 1) this can get OMPI running on iwarp devices today if it works. > > > > 2) the udapl code can be a model for the rdma-cm piece, since the two > > are similary (client / server connection set, ipaddr/port based, > > etc)... > > > > I'll try it out on T3. > > > > > > Steve. > > From yosefe at voltaire.com Tue May 8 09:43:41 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 08 May 2007 19:43:41 +0300 Subject: [ofa-general] [PATCHv3 1/2] ipoib: handle pkey change events In-Reply-To: <20070508162727.GD5845@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> Message-ID: <4640A8BD.4000405@voltaire.com> This issue was found during partitioning & SM fail over testing. * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity * Upon PKEY_CHANGE event, schedule a work that restarts the QP * Restart child interfaces before parent. They might be up even if the parent is down * Use uncached pkey query upon qp initiallization SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 6 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 59 ++++++++++++++++++++--------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +-- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 11 ++--- 4 files changed, 56 insertions(+), 27 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 15:46:53.767972077 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 16:45:44.768882483 +0300 @@ -202,11 +202,12 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; + struct delayed_work pkey_poll_task; struct delayed_work mcast_task; struct work_struct flush_task; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct work_struct pkey_event_task; struct ib_device *ca; u8 port; @@ -333,12 +334,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.784969043 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 19:37:12.841977849 +0300 @@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -481,7 +481,7 @@ int ipoib_ib_dev_down(struct net_device if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { mutex_lock(&pkey_mutex); set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); + cancel_delayed_work(&priv->pkey_poll_task); mutex_unlock(&pkey_mutex); if (flush) flush_workqueue(ipoib_workqueue); @@ -508,7 +508,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +581,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,13 +623,21 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces too - + * they might be up even if the parent is down */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, restart_qp); + + mutex_unlock(&priv->vlan_mutex); + + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -642,6 +651,12 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_down(dev, 0); + if (restart_qp) { + if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +665,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - mutex_unlock(&priv->vlan_mutex); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_event_task); + + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -685,7 +710,7 @@ void ipoib_ib_dev_cleanup(struct net_dev void ipoib_pkey_poll(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); + container_of(work, struct ipoib_dev_priv, pkey_poll_task.work); struct net_device *dev = priv->dev; ipoib_pkey_dev_check_presence(dev); @@ -696,7 +721,7 @@ void ipoib_pkey_poll(struct work_struct mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); } @@ -715,7 +740,7 @@ int ipoib_pkey_dev_delay_open(struct net mutex_lock(&pkey_mutex); clear_bit(IPOIB_PKEY_STOP, &priv->flags); queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); return 1; Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 15:46:53.805965295 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 16:45:44.768882483 +0300 @@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev) return -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 15:46:53.877952447 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 16:45:44.769882305 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = -ENXIO; goto out; @@ -103,7 +101,7 @@ int ipoib_init_qp(struct net_device *dev * The port has to be assigned to the respective IB partition in * advance. */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); if (ret) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); return ret; @@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_event_task); } } From yosefe at voltaire.com Tue May 8 09:45:05 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 08 May 2007 19:45:05 +0300 Subject: [ofa-general] [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <20070508152650.GA5845@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com> <20070508150907.GU21591@mellanox.co.il> <46409504.9000802@voltaire.com> <20070508152650.GA5845@mellanox.co.il> Message-ID: <4640A911.8000609@voltaire.com> * Add ib_find_gid and ib_find_pkey over uncached device queries. The calls might block but the returns are always up-to-date. * Cache pky,gid table lengths in core to avoid port info queries. Signed-off-by: Yosef Etigin --- drivers/infiniband/core/device.c | 140 +++++++++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 25 ++++++ 2 files changed, 165 insertions(+) Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-08 15:46:36.773005388 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-08 19:30:53.095613249 +0300 @@ -149,6 +149,18 @@ static int alloc_name(char *name) return 0; } +static inline int start_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + + +static inline int end_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; +} + /** * ib_alloc_device - allocate an IB device struct * @size:size of structure to allocate @@ -208,6 +220,56 @@ static int add_client_context(struct ib_ return 0; } +/* read the lengths of pkey,gid tables on each port */ +static inline int read_port_table_lengths(struct ib_device *device) +{ + struct ib_port_attr *tprops = NULL; + int num_ports, ret = -ENOMEM; + u8 port_index; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + goto out; + + num_ports = end_port(device) - start_port(device) + 1; + + device->pkey_tbl_len = kmalloc(num_ports * + sizeof *device->pkey_tbl_len, GFP_KERNEL); + if (!device->pkey_tbl_len) + goto out; + + device->gid_tbl_len = kmalloc(num_ports * + sizeof *device->gid_tbl_len, GFP_KERNEL); + if (!device->gid_tbl_len) + goto err1; + + for (port_index = 0; port_index < num_ports; ++port_index) { + ret = ib_query_port(device, port_index + start_port(device), + tprops); + if (ret) + goto err2; + + device->pkey_tbl_len[ port_index ] = tprops->pkey_tbl_len; + device->gid_tbl_len[ port_index ] = tprops->gid_tbl_len; + } + + ret = 0; + goto out; +err2: + kfree(device->gid_tbl_len); +err1: + kfree(device->pkey_tbl_len); +out: + kfree(tprops); + return ret; +} + +static inline void free_port_table_lengths(struct ib_device *device) +{ + kfree(device->gid_tbl_len); + kfree(device->pkey_tbl_len); +} + /** * ib_register_device - Register an IB device with IB core * @device:Device to register @@ -239,6 +301,13 @@ int ib_register_device(struct ib_device spin_lock_init(&device->event_handler_lock); spin_lock_init(&device->client_data_lock); + ret = read_port_table_lengths(device); + if (ret) { + printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n", + device->name); + goto out; + } + ret = ib_device_register_sysfs(device); if (ret) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", @@ -284,6 +353,8 @@ void ib_unregister_device(struct ib_devi list_del(&device->core_list); + free_port_table_lengths(device); + mutex_unlock(&device_mutex); spin_lock_irqsave(&device->client_data_lock, flags); @@ -592,6 +663,75 @@ int ib_modify_port(struct ib_device *dev } EXPORT_SYMBOL(ib_modify_port); +/** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index) +{ + union ib_gid tmp_gid; + int ret, port, i, tbl_len; + + for (port = start_port(device); port <= end_port(device); ++port) { + tbl_len = device->gid_tbl_len[ port - start_port(device) ]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &tmp_gid); + if (ret) + goto out; + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { + *port_num = port; + *index = i; + ret = 0; + goto out; + } + } + } + ret = -ENOENT; +out: + return ret; +} +EXPORT_SYMBOL(ib_find_gid); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index) +{ + int ret, i, tbl_len; + u16 tmp_pkey; + + tbl_len = device->pkey_tbl_len[ port_num - start_port(device) ]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); + if (ret) + goto out; + + if (pkey == tmp_pkey) { + *index = i; + ret = 0; + goto out; + } + } + ret = -ENOENT; + +out: + kfree(tprops); + return ret; +} +EXPORT_SYMBOL(ib_find_pkey); + static int __init ib_core_init(void) { int ret; Index: b/include/rdma/ib_verbs.h =================================================================== --- a/include/rdma/ib_verbs.h 2007-05-08 15:45:45.199210546 +0300 +++ b/include/rdma/ib_verbs.h 2007-05-08 18:48:23.334763770 +0300 @@ -1058,6 +1058,8 @@ struct ib_device { __be64 node_guid; u8 node_type; u8 phys_port_cnt; + int *pkey_tbl_len; + int *gid_tbl_len; }; struct ib_client { @@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev struct ib_port_modify *port_modify); /** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index); + +/** * ib_alloc_pd - Allocates an unused protection domain. * @device: The device on which to allocate the protection domain. * From jsquyres at cisco.com Tue May 8 08:50:56 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 8 May 2007 11:50:56 -0400 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <1178632029.3056.3.camel@stevo-desktop> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> <1178632029.3056.3.camel@stevo-desktop> Message-ID: <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com> In the "FYI" category... There was discussion about the udpal BTL over OFED today on the weekly developer teleconference (per my earlier post, a user is reporting that it doesn't work). Andrew Friedley is going to work with the Sun developers -- he thinks he might know where the problem is coming from but is in process of physically relocating, and therefore couldn't look at it until late next week at the earliest. Sun may be able to pick up the issue -- but if so, I don't know what their timeframe will be (and it may depend on the severity of the problem). On May 8, 2007, at 9:47 AM, Steve Wise wrote: > On Mon, 2007-05-07 at 20:39 -0400, Jeff Squyres wrote: >> On May 7, 2007, at 6:52 PM, Steve Wise wrote: >> >>> Also, there appears to be a DAPL BTL in OMPI. Is this BTL complete >>> and >>> enabled for the ofed-1.2 udapl library? >> >> Yes, it is complete and is well-tested in Solaris. >> >> It is not well tested in Linux/OFED (we've been concentrating on the >> verbs interface on the OFED side of things -- the "openib" BTL [we >> never renamed it when OpenIB changed names to OpenFabrics]). In >> fact, we've had scattered reports of it not working properly in >> Linux/ >> OFED, but those could well have been pilot error (i.e., me not trying >> to run properly -- I know just about zilch about udapl). >> > > The reason I'm asking is twofold: > > 1) this can get OMPI running on iwarp devices today if it works. > > 2) the udapl code can be a model for the rdma-cm piece, since the two > are similary (client / server connection set, ipaddr/port based, > etc)... > > I'll try it out on T3. > > > Steve. -- Jeff Squyres Cisco Systems From sean.hefty at intel.com Tue May 8 08:46:36 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 8 May 2007 08:46:36 -0700 Subject: [ofa-general] RE: memory leak in cm.c? In-Reply-To: <20070508050700.GI22341@mellanox.co.il> Message-ID: <000101c79188$10043740$39d1180a@amr.corp.intel.com> >I applied the following patch to cm.c, and it crashed after >some duplicate reqs where detected. Does this indicate a >memory leak in cm? In this case, the timewait_info structure is freed directly in the cm_req_handler (line 1373). The pointer is just not cleared. - Sean From swise at opengridcomputing.com Tue May 8 10:29:21 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 May 2007 12:29:21 -0500 Subject: OMPI over OFA udapl (was Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM) In-Reply-To: <1178641872.6064.12.camel@stevo-desktop> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> <1178632029.3056.3.camel@stevo-desktop> <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com> <1178641872.6064.12.camel@stevo-desktop> Message-ID: <1178645361.6064.35.camel@stevo-desktop> > > Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work. I'm > debugging now. > Here's part of the problem (from ompi/btl/udapl/btl_udapl.c): /* TODO - big bad evil hack! */ /* uDAPL doesn't ever seem to keep track of ports with addresses. This becomes a problem when we use dat_ep_query() to obtain a remote address on an endpoint. In this case, both the DAT_PORT_QUAL and the sin_port field in the DAT_SOCK_ADDR are 0, regardless of the actual port. This is a problem when we have more than one uDAPL process per IA - these processes will have exactly the same address, as the port is all we have to differentiate who is who. Thus, our uDAPL EP -> BTL EP matching algorithm will break down. So, we insert the port we used for our PSP into the DAT_SOCK_ADDR for this IA. uDAPL then conveniently propagates this to where we need it. */ ((struct sockaddr_in*)attr.ia_address_ptr)->sin_port = htons(port); ((struct sockaddr_in*)&btl->udapl_addr.addr)->sin_port = htons(port); The OMPI code stuffs the port chosen by udapl for a listening endpoint into the ia address memory (which is owned by the udapl layer btw). There's a slight problem with that: The OFA udapl openib_cma code binds cm_id's to this ia_address regularly. When an hca is opened, a cm_id is bound to this address to obtain the local hca port number and gid that is being used. In addition, a cm_id is bound to this address each time an endpoint is created (either at ep_create time or ep_connect time). So that ia_address field is used by the dapl cm to create local cm_ids... Since the port was always zero, the rmda-cma would choose a unique port for each cm_id bound to that address. But OMPI sets a the port field to non-zero, the rdma_cma fails all the subsequent rdma_bind_addr() calls since the port is already in use. Perhaps this hack really is a workaround for a DAPL bug where somebodies dapl wasn't tracking port numbers correctly? I think there are three issues here: 1) OMPI shouldn't be stepping on the ia_address. 2) OFA udapl should probably be explicitly binding local cm_ids to port zero. 3) dat_ep_query() should be returning the correct port numbers... I'm going to run a few experiments: 1) remove the OMPI hack and see if things work fine for OFA udapl. Perhaps OFA udapl correctly tracks ports on endpoints? 2) leave OMPI as-is and change OFA udapl to not assume the ia_addr sockaddr has a 0 port in it. Steve. From sean.hefty at intel.com Tue May 8 10:29:35 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 8 May 2007 10:29:35 -0700 Subject: [ofa-general] RE: man pages for the rdma-cm In-Reply-To: <1178556576.30571.79.camel@stevo-desktop> Message-ID: <000201c79196$72e8a680$39d1180a@amr.corp.intel.com> I updated the man pages in my master branch and pushed the changes out. Details below. >- are the events described anywhere? Maybe they should be described in >rdma_get_cm_event? done >- rdma_accept / rdma_connect: describe the conn_param fields. done >- rdma_bind_addr: binding to port 0 will cause the rdma-cm to select and >available port. added >- no pages for get_src_port/get_dst_port not added yet >- rdma_connect - "connected" and "unconnected" when discussing cm_ids is >misleading. Perhaps "reliable connection" vs "unreliable datagram"? I reworked the wording here to clarify that the behavior is based on the port space associated with the cm_id. >- rdma_create_event_channel: it would be nice to mention that the fd can >be used like any other fd (made non blocking, poll()/select()able, etc). added >- Also, it might be nice to have some sort of overview man page that >maps the exected event flows for connection setup and teardown. Maybe >'man rdmacm' gets you some overview? I added an rdma_cm man page that gives an overview. I still need to add references to this man page from the other API man pages, which I'll do before pushing into OFED. - Sean From swise at opengridcomputing.com Tue May 8 10:51:30 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 May 2007 12:51:30 -0500 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <463FCA42.3000104@indiana.edu> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com> <463FCA42.3000104@indiana.edu> Message-ID: <1178646690.6064.38.camel@stevo-desktop> > >> Chelsio's gonna pony up the resources to get this work done asap. Do > >> you have any thoughts on how we can collaborate on this project? I'm > >> familiar with mvapich, not ompi, so I need to go do some homework. > >> But > >> any pointers on the connection setup design for ompi would be great. > > > > Excellent! Let's chat on the phone tomorrow -- this would probably > > be the best way to start. > > > > We will need a signed Open MPI 3rd party contribution agreement from > > either you and/or Chelsio (whoever owns the intellectual property > > that will be contributed). See http://www.open-mpi.org/community/ > > contribute/. > > > >> I'm CCing devel at openmpi.org in case anyone else is interested in > >> helping. Chelsio can provide rnic HW... > > > > Anyone else here interested? Free hardware! :-) > > Hmm I'm interested. I've already done some work switching over to RDMA > CM for some research stuff I've been doing; it's not publicly accessible > w/o the 3rd party agreement. I can help answer questions on what > exactly needs to change, and do some testing. > > Andrew I'm working on the 3rd party agreement from chelsio now. Stay tuned! Steve. From afriedle at open-mpi.org Tue May 8 10:57:44 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Tue, 08 May 2007 13:57:44 -0400 Subject: [OMPI devel] OMPI over OFA udapl (was Re: [ofa-general] OpenMPI and RDMA-CM) In-Reply-To: <1178645361.6064.35.camel@stevo-desktop> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> <1178632029.3056.3.camel@stevo-desktop> <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com> <1178641872.6064.12.camel@stevo-desktop> <1178645361.6064.35.camel@stevo-desktop> Message-ID: <4640BA18.7060104@open-mpi.org> Steve Wise wrote: >> Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work. I'm >> debugging now. >> > > Here's part of the problem (from ompi/btl/udapl/btl_udapl.c): > > /* TODO - big bad evil hack! */ > /* uDAPL doesn't ever seem to keep track of ports with addresses. This > becomes a problem when we use dat_ep_query() to obtain a remote address > on an endpoint. In this case, both the DAT_PORT_QUAL and the sin_port > field in the DAT_SOCK_ADDR are 0, regardless of the actual port. This is > a problem when we have more than one uDAPL process per IA - these > processes will have exactly the same address, as the port is all > we have to differentiate who is who. Thus, our uDAPL EP -> BTL EP > matching algorithm will break down. > > So, we insert the port we used for our PSP into the DAT_SOCK_ADDR for > this IA. uDAPL then conveniently propagates this to where we need it. > */ > ((struct sockaddr_in*)attr.ia_address_ptr)->sin_port = htons(port); > ((struct sockaddr_in*)&btl->udapl_addr.addr)->sin_port = htons(port); > > The OMPI code stuffs the port chosen by udapl for a listening endpoint > into the ia address memory (which is owned by the udapl layer btw). > There's a slight problem with that: The OFA udapl openib_cma code binds > cm_id's to this ia_address regularly. When an hca is opened, a cm_id is > bound to this address to obtain the local hca port number and gid that > is being used. In addition, a cm_id is bound to this address each time > an endpoint is created (either at ep_create time or ep_connect time). > So that ia_address field is used by the dapl cm to create local > cm_ids... Since the port was always zero, the rmda-cma would choose a > unique port for each cm_id bound to that address. > > But OMPI sets a the port field to non-zero, the rdma_cma fails all the > subsequent rdma_bind_addr() calls since the port is already in use. > > Perhaps this hack really is a workaround for a DAPL bug where somebodies > dapl wasn't tracking port numbers correctly? Yep. My memory is dim, but I think that was OFED's DAPL, or it was in the generic part of DAPL that all implementations seem to share. As hinted by the comment (I wrote it by the way), I think the best solution would be if dat_ep_query() returned the port number correctly. Most of uDAPL seems to just pass around pointers to internal data structures (which I'm not sure is the best idea in the world), so it didn't seem like a trivial fix to me at the time. I remember considering reporting this as a bug, but I didn't because the uDAPL standard didn't seem to enforce any requirements on passing the port number around with the address, so it technically wasn't wrong. Was the OFED uDAPL code switched from something else to RDMA CM at some point? I'm almost certain I was running fine on OFED's uDAPL at one point (in fact, a lot of the uDAPL BTL development I did was using the OFED stack). > I'm going to run a few experiments: > > 1) remove the OMPI hack and see if things work fine for OFA udapl. > Perhaps OFA udapl correctly tracks ports on endpoints? Doubt it, but worth trying. > 2) leave OMPI as-is and change OFA udapl to not assume the ia_addr > sockaddr has a 0 port in it. Pretty sure this will work, don't know if it's the correct solution though. Andrew From jlentini at netapp.com Tue May 8 11:38:06 2007 From: jlentini at netapp.com (James Lentini) Date: Tue, 8 May 2007 14:38:06 -0400 (EDT) Subject: [ofa-general] [IPoIB][RFC] remove redundant gid query Message-ID: Both ipoib_add_port() and ipoib_mcast_join_task() query the GID at index 0 to setup the ipoib_dev_priv structure's local_gid and the net_device structure's dev_addr. There does not appear to be a way for ipoib_mcast_join_task() to be executed before ipoib_add_port() completes. Therefore, the work done in ipoib_mcast_join_task() appears to be redundant. Signed-off-by: James Lentini --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-08 14:34:15.000000000 -0400 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-08 14:34:07.000000000 -0400 @@ -523,11 +523,6 @@ void ipoib_mcast_join_task(struct work_s if (!test_bit(IPOIB_MCAST_RUN, &priv->flags)) return; - if (ib_query_gid(priv->ca, priv->port, 0, &priv->local_gid)) - ipoib_warn(priv, "ib_gid_entry_get() failed\n"); - else - memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); - { struct ib_port_attr attr; From ardavis at ichips.intel.com Tue May 8 12:09:02 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 08 May 2007 12:09:02 -0700 Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work In-Reply-To: <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com> References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM> <46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM> <464044D4.5010501@lfbs.rwth-aachen.de> <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com> Message-ID: <4640CACE.8070201@ichips.intel.com> Jeff Squyres wrote: > I'm forwarding this to the OpenFabrics general list -- as it just > came up the other day, we know that Open MPI's UDAPL support works on > Solaris, but we have done little/no testing of it on OFED (I > personally know almost nothing about UDPAL). > > Can the UDAPL OFED wizards shed any light on the error messages that > are listed below? In particular, these seem to be worrysome: > >> setup_listener Permission denied > > setup_listener Address already in use These failures are from rdma_cm_bind indicating the port is already bound to this IA address. How are you creating the service point? dat_psp_create or dat_psp_create_any? If it is psp_create_any then you will see some failures until it gets to a free port. That is normal. Just make sure your create call returns DAT_SUCCESS. >> create_qp Address already in use > This is a real problem with the bind, port is already in use. Not sure why this would fail since the current version of OFED uDAPL uses a wildcard port when binding and uses the address from the open; I remember an issue a while back with rdma_cm and wildcard ports. What version of OFED are you using? -arlin From mst at dev.mellanox.co.il Tue May 8 12:14:49 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 22:14:49 +0300 Subject: [ofa-general] [PATCH for-2.6.22] IB/mthca: fix REST to ERR transition Message-ID: <20070508191449.GA10845@mellanox.co.il> According to IB spec, QP can be moved from RESET to ERROR state, but mthca firmware does not support this. Work around this by moving the QP to INIT with dummy parameters first. Signed-off-by: Michael S. Tsirkin -- diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 1c6b63a..dfcb038 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -295,7 +295,7 @@ static int to_mthca_st(int transport) } } -static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr, +static void store_attrs(struct mthca_sqp *sqp, const struct ib_qp_attr *attr, int attr_mask) { if (attr_mask & IB_QP_PKEY_INDEX) @@ -327,7 +327,7 @@ static void init_port(struct mthca_dev *dev, int port) mthca_warn(dev, "INIT_IB returned status %02x.\n", status); } -static __be32 get_hw_access_flags(struct mthca_qp *qp, struct ib_qp_attr *attr, +static __be32 get_hw_access_flags(struct mthca_qp *qp, const struct ib_qp_attr *attr, int attr_mask) { u8 dest_rd_atomic; @@ -510,7 +510,7 @@ out: return err; } -static int mthca_path_set(struct mthca_dev *dev, struct ib_ah_attr *ah, +static int mthca_path_set(struct mthca_dev *dev, const struct ib_ah_attr *ah, struct mthca_qp_path *path, u8 port) { path->g_mylmc = ah->src_path_bits & 0x7f; @@ -538,12 +538,12 @@ static int mthca_path_set(struct mthca_dev *dev, struct ib_ah_attr *ah, return 0; } -int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, - struct ib_udata *udata) +static int __mthca_modify_qp(struct ib_qp *ibqp, + const struct ib_qp_attr *attr, int attr_mask, + enum ib_qp_state cur_state, enum ib_qp_state new_state) { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - enum ib_qp_state cur_state, new_state; struct mthca_mailbox *mailbox; struct mthca_qp_param *qp_param; struct mthca_qp_context *qp_context; @@ -551,60 +551,6 @@ int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, u8 status; int err = -EINVAL; - mutex_lock(&qp->mutex); - - if (attr_mask & IB_QP_CUR_STATE) { - cur_state = attr->cur_qp_state; - } else { - spin_lock_irq(&qp->sq.lock); - spin_lock(&qp->rq.lock); - cur_state = qp->state; - spin_unlock(&qp->rq.lock); - spin_unlock_irq(&qp->sq.lock); - } - - new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state; - - if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) { - mthca_dbg(dev, "Bad QP transition (transport %d) " - "%d->%d with attr 0x%08x\n", - qp->transport, cur_state, new_state, - attr_mask); - goto out; - } - - if (cur_state == new_state && cur_state == IB_QPS_RESET) { - err = 0; - goto out; - } - - if ((attr_mask & IB_QP_PKEY_INDEX) && - attr->pkey_index >= dev->limits.pkey_table_len) { - mthca_dbg(dev, "P_Key index (%u) too large. max is %d\n", - attr->pkey_index, dev->limits.pkey_table_len-1); - goto out; - } - - if ((attr_mask & IB_QP_PORT) && - (attr->port_num == 0 || attr->port_num > dev->limits.num_ports)) { - mthca_dbg(dev, "Port number (%u) is invalid\n", attr->port_num); - goto out; - } - - if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && - attr->max_rd_atomic > dev->limits.max_qp_init_rdma) { - mthca_dbg(dev, "Max rdma_atomic as initiator %u too large (max is %d)\n", - attr->max_rd_atomic, dev->limits.max_qp_init_rdma); - goto out; - } - - if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC && - attr->max_dest_rd_atomic > 1 << dev->qp_table.rdb_shift) { - mthca_dbg(dev, "Max rdma_atomic as responder %u too large (max %d)\n", - attr->max_dest_rd_atomic, 1 << dev->qp_table.rdb_shift); - goto out; - } - mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); if (IS_ERR(mailbox)) { err = PTR_ERR(mailbox); @@ -878,7 +824,98 @@ int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, out_mailbox: mthca_free_mailbox(dev, mailbox); +out: + return err; +} + +static const struct ib_qp_attr mthca_qp_attr = { .port_num = 1}; +static const int mthca_qp_attr_mask_table[IB_QPT_UD + 1] = { + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), +}; + +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, + struct ib_udata *udata) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + enum ib_qp_state cur_state, new_state; + int err = -EINVAL; + + mutex_lock(&qp->mutex); + if (attr_mask & IB_QP_CUR_STATE) { + cur_state = attr->cur_qp_state; + } else { + spin_lock_irq(&qp->sq.lock); + spin_lock(&qp->rq.lock); + cur_state = qp->state; + spin_unlock(&qp->rq.lock); + spin_unlock_irq(&qp->sq.lock); + } + + new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state; + + if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) { + mthca_dbg(dev, "Bad QP transition (transport %d) " + "%d->%d with attr 0x%08x\n", + qp->transport, cur_state, new_state, + attr_mask); + goto out; + } + + if ((attr_mask & IB_QP_PKEY_INDEX) && + attr->pkey_index >= dev->limits.pkey_table_len) { + mthca_dbg(dev, "P_Key index (%u) too large. max is %d\n", + attr->pkey_index, dev->limits.pkey_table_len-1); + goto out; + } + + if ((attr_mask & IB_QP_PORT) && + (attr->port_num == 0 || attr->port_num > dev->limits.num_ports)) { + mthca_dbg(dev, "Port number (%u) is invalid\n", attr->port_num); + goto out; + } + + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && + attr->max_rd_atomic > dev->limits.max_qp_init_rdma) { + mthca_dbg(dev, "Max rdma_atomic as initiator %u too large (max is %d)\n", + attr->max_rd_atomic, dev->limits.max_qp_init_rdma); + goto out; + } + + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC && + attr->max_dest_rd_atomic > 1 << dev->qp_table.rdb_shift) { + mthca_dbg(dev, "Max rdma_atomic as responder %u too large (max %d)\n", + attr->max_dest_rd_atomic, 1 << dev->qp_table.rdb_shift); + goto out; + } + + if (cur_state == new_state && cur_state == IB_QPS_RESET) { + err = 0; + goto out; + } + + if (cur_state == IB_QPS_RESET && new_state == IB_QPS_ERR) { + err = __mthca_modify_qp(ibqp, &mthca_qp_attr, + mthca_qp_attr_mask_table[ibqp->qp_type], + IB_QPS_RESET, IB_QPS_INIT); + if (err) + goto out; + cur_state = IB_QPS_INIT; + } + err = __mthca_modify_qp(ibqp, attr, attr_mask, cur_state, new_state); out: mutex_unlock(&qp->mutex); return err; -- MST From ardavis at ichips.intel.com Tue May 8 12:37:25 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 08 May 2007 12:37:25 -0700 Subject: [OMPI devel] OMPI over OFA udapl (was Re: [ofa-general] OpenMPI and RDMA-CM) In-Reply-To: <4640BA18.7060104@open-mpi.org> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> <1178632029.3056.3.camel@stevo-desktop> <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com> <1178641872.6064.12.camel@stevo-desktop> <1178645361.6064.35.camel@stevo-desktop> <4640BA18.7060104@open-mpi.org> Message-ID: <4640D175.2080207@ichips.intel.com> Andrew Friedley wrote: > > Yep. My memory is dim, but I think that was OFED's DAPL, or it was in > the generic part of DAPL that all implementations seem to share. > > As hinted by the comment (I wrote it by the way), I think the best > solution would be if dat_ep_query() returned the port number > correctly. Most of uDAPL seems to just pass around pointers to > internal data structures (which I'm not sure is the best idea in the > world), so it didn't seem like a trivial fix to me at the time. I > remember considering reporting this as a bug, but I didn't because the > uDAPL standard didn't seem to enforce any requirements on passing the > port number around with the address, so it technically wasn't wrong. I tend to agree. The common code should query the actual provider to get local address for the EP and not assume it is the IA address from the HCA used during the open. They are after all different bindings. I will take a look at the code. > > Was the OFED uDAPL code switched from something else to RDMA CM at > some point? I'm almost certain I was running fine on OFED's uDAPL at > one point (in fact, a lot of the uDAPL BTL development I did was using > the OFED stack). We had several interations while waiting for the rdma_cm code to become available. I am guessing that you were using one the early version that used sockets to setup the QP's. -arlin From swise at opengridcomputing.com Tue May 8 12:52:59 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 May 2007 14:52:59 -0500 Subject: [OMPI devel] OMPI over OFA udapl (was Re: [ofa-general] OpenMPI and RDMA-CM) In-Reply-To: <4640BA18.7060104@open-mpi.org> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> <1178632029.3056.3.camel@stevo-desktop> <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com> <1178641872.6064.12.camel@stevo-desktop> <1178645361.6064.35.camel@stevo-desktop> <4640BA18.7060104@open-mpi.org> Message-ID: <1178653979.11455.4.camel@stevo-desktop> On Tue, 2007-05-08 at 13:57 -0400, Andrew Friedley wrote: > Steve Wise wrote: > >> Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work. I'm > >> debugging now. > >> > > > > Here's part of the problem (from ompi/btl/udapl/btl_udapl.c): > > > > /* TODO - big bad evil hack! */ > > /* uDAPL doesn't ever seem to keep track of ports with addresses. This > > becomes a problem when we use dat_ep_query() to obtain a remote address > > on an endpoint. In this case, both the DAT_PORT_QUAL and the sin_port > > field in the DAT_SOCK_ADDR are 0, regardless of the actual port. This is > > a problem when we have more than one uDAPL process per IA - these > > processes will have exactly the same address, as the port is all > > we have to differentiate who is who. Thus, our uDAPL EP -> BTL EP > > matching algorithm will break down. > > > > So, we insert the port we used for our PSP into the DAT_SOCK_ADDR for > > this IA. uDAPL then conveniently propagates this to where we need it. > > */ > > ((struct sockaddr_in*)attr.ia_address_ptr)->sin_port = htons(port); > > ((struct sockaddr_in*)&btl->udapl_addr.addr)->sin_port = htons(port); > > > > The OMPI code stuffs the port chosen by udapl for a listening endpoint > > into the ia address memory (which is owned by the udapl layer btw). > > There's a slight problem with that: The OFA udapl openib_cma code binds > > cm_id's to this ia_address regularly. When an hca is opened, a cm_id is > > bound to this address to obtain the local hca port number and gid that > > is being used. In addition, a cm_id is bound to this address each time > > an endpoint is created (either at ep_create time or ep_connect time). > > So that ia_address field is used by the dapl cm to create local > > cm_ids... Since the port was always zero, the rmda-cma would choose a > > unique port for each cm_id bound to that address. > > > > But OMPI sets a the port field to non-zero, the rdma_cma fails all the > > subsequent rdma_bind_addr() calls since the port is already in use. > > > > Perhaps this hack really is a workaround for a DAPL bug where somebodies > > dapl wasn't tracking port numbers correctly? > > Yep. My memory is dim, but I think that was OFED's DAPL, or it was in > the generic part of DAPL that all implementations seem to share. > > As hinted by the comment (I wrote it by the way), I think the best > solution would be if dat_ep_query() returned the port number correctly. > Most of uDAPL seems to just pass around pointers to internal data > structures (which I'm not sure is the best idea in the world), so it > didn't seem like a trivial fix to me at the time. I remember > considering reporting this as a bug, but I didn't because the uDAPL > standard didn't seem to enforce any requirements on passing the port > number around with the address, so it technically wasn't wrong. > > Was the OFED uDAPL code switched from something else to RDMA CM at some > point? I'm almost certain I was running fine on OFED's uDAPL at one > point (in fact, a lot of the uDAPL BTL development I did was using the > OFED stack). Yes, the OFA uDAPL was changed from using the ib-cm to the rdma-cm a while back. Perhaps you ran on the ib-cm version? And, the rdma-cma started using port numbers and enforcing uniqueness even more recently I think. Perhaps Don Kerr has some insight on how the Sun uDAPL behaves? Should OMPI still need this hack? Steve. From swise at opengridcomputing.com Tue May 8 12:55:24 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 May 2007 14:55:24 -0500 Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work In-Reply-To: <4640CACE.8070201@ichips.intel.com> References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM> <46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM> <464044D4.5010501@lfbs.rwth-aachen.de> <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com> <4640CACE.8070201@ichips.intel.com> Message-ID: <1178654124.11455.6.camel@stevo-desktop> BTW: We have 2 threads on this topic. See my other emails describing the issue... On Tue, 2007-05-08 at 12:09 -0700, Arlin Davis wrote: > Jeff Squyres wrote: > > > I'm forwarding this to the OpenFabrics general list -- as it just > > came up the other day, we know that Open MPI's UDAPL support works on > > Solaris, but we have done little/no testing of it on OFED (I > > personally know almost nothing about UDPAL). > > > > Can the UDAPL OFED wizards shed any light on the error messages that > > are listed below? In particular, these seem to be worrysome: > > > >> setup_listener Permission denied > > > > setup_listener Address already in use > > These failures are from rdma_cm_bind indicating the port is already > bound to this IA address. How are you creating the service point? > dat_psp_create or dat_psp_create_any? If it is psp_create_any then you > will see some failures until it gets to a free port. That is normal. > Just make sure your create call returns DAT_SUCCESS. > > >> create_qp Address already in use > > > This is a real problem with the bind, port is already in use. Not sure > why this would fail since the current version of OFED uDAPL uses a > wildcard port when binding and uses the address from the open; I > remember an issue a while back with rdma_cm and wildcard ports. What > version of OFED are you using? > > -arlin > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ardavis at ichips.intel.com Tue May 8 12:55:39 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 08 May 2007 12:55:39 -0700 Subject: OMPI over OFA udapl (was Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM) In-Reply-To: <1178645361.6064.35.camel@stevo-desktop> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> <1178632029.3056.3.camel@stevo-desktop> <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com> <1178641872.6064.12.camel@stevo-desktop> <1178645361.6064.35.camel@stevo-desktop> Message-ID: <4640D5BB.8060104@ichips.intel.com> Steve Wise wrote: >1) OMPI shouldn't be stepping on the ia_address. > > stongly agree >2) OFA udapl should probably be explicitly binding local cm_ids to port >zero. > > current implementation uses port zero on ep_create and ia_open. >3) dat_ep_query() should be returning the correct port numbers... > > agree. I also don't like the common code hands out pointers to internal structures... I will work on a patch that will insure compadibility with other providers but allow the openib_cma provider to return the port on the ep_query. -arlin From swise at opengridcomputing.com Tue May 8 12:58:08 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 May 2007 14:58:08 -0500 Subject: OMPI over OFA udapl (was Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM) In-Reply-To: <4640D5BB.8060104@ichips.intel.com> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> <1178632029.3056.3.camel@stevo-desktop> <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com> <1178641872.6064.12.camel@stevo-desktop> <1178645361.6064.35.camel@stevo-desktop> <4640D5BB.8060104@ichips.intel.com> Message-ID: <1178654288.11455.8.camel@stevo-desktop> On Tue, 2007-05-08 at 12:55 -0700, Arlin Davis wrote: > Steve Wise wrote: > > >1) OMPI shouldn't be stepping on the ia_address. > > > > > stongly agree > > >2) OFA udapl should probably be explicitly binding local cm_ids to port > >zero. > > > > > current implementation uses port zero on ep_create and ia_open. > > >3) dat_ep_query() should be returning the correct port numbers... > > > > > agree. I also don't like the common code hands out pointers to internal > structures... > > I will work on a patch that will insure compadibility with other > providers but allow the openib_cma provider to return the port on the > ep_query. > > -arlin Cool! I'll test this over iWARP when you have something... From jsquyres at cisco.com Tue May 8 12:34:12 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 8 May 2007 15:34:12 -0400 Subject: Fwd: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work References: <4640CACE.8070201@ichips.intel.com> Message-ID: Re-forwarding to OMPI list; because of the OMPI list anti-spam checks, Arlin's post didn't make it through to our list when he originally posted. Begin forwarded message: > From: Arlin Davis > Date: May 8, 2007 3:09:02 PM EDT > To: Jeff Squyres > Cc: Open MPI Users , OpenFabrics General > > Subject: Re: [ofa-general] Re: [OMPI users] openMPI over uDAPL > doesn't work > > Jeff Squyres wrote: > >> I'm forwarding this to the OpenFabrics general list -- as it just >> came up the other day, we know that Open MPI's UDAPL support works >> on Solaris, but we have done little/no testing of it on OFED (I >> personally know almost nothing about UDPAL). >> >> Can the UDAPL OFED wizards shed any light on the error messages >> that are listed below? In particular, these seem to be worrysome: >> >>> setup_listener Permission denied >> >> setup_listener Address already in use > > These failures are from rdma_cm_bind indicating the port is already > bound to this IA address. How are you creating the service point? > dat_psp_create or dat_psp_create_any? If it is psp_create_any then > you will see some failures until it gets to a free port. That is > normal. Just make sure your create call returns DAT_SUCCESS. > >>> create_qp Address already in use >> > This is a real problem with the bind, port is already in use. Not > sure why this would fail since the current version of OFED uDAPL > uses a wildcard port when binding and uses the address from the > open; I remember an issue a while back with rdma_cm and wildcard > ports. What version of OFED are you using? > > -arlin -- Jeff Squyres Cisco Systems From ggrundstrom at NetEffect.com Tue May 8 13:14:39 2007 From: ggrundstrom at NetEffect.com (Glenn Grundstrom) Date: Tue, 8 May 2007 15:14:39 -0500 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <1178646690.6064.38.camel@stevo-desktop> References: <1177791386.4615.8.camel@stevo-laptop><98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com><1178575761.30571.175.camel@stevo-desktop><95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com><463FCA42.3000104@indiana.edu> <1178646690.6064.38.camel@stevo-desktop> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC06F967F9@venom2> Steve/Andrew, It sounds like you've got a handle on the development side. I'd be willing to provide additional testing resources. Let me know how I can help test. I'll also check on the 3rd party agreement. Glenn. -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Steve Wise Sent: Tuesday, May 08, 2007 12:52 PM To: Andrew Friedley Cc: Devel at openmpi.org; Open MPI Developers; general; Asgeir Eiriksson Subject: Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM > >> Chelsio's gonna pony up the resources to get this work done asap. > >> Do you have any thoughts on how we can collaborate on this project? I'm > >> familiar with mvapich, not ompi, so I need to go do some homework. > >> But > >> any pointers on the connection setup design for ompi would be great. > > > > Excellent! Let's chat on the phone tomorrow -- this would probably > > be the best way to start. > > > > We will need a signed Open MPI 3rd party contribution agreement from > > either you and/or Chelsio (whoever owns the intellectual property > > that will be contributed). See http://www.open-mpi.org/community/ > > contribute/. > > > >> I'm CCing devel at openmpi.org in case anyone else is interested in > >> helping. Chelsio can provide rnic HW... > > > > Anyone else here interested? Free hardware! :-) > > Hmm I'm interested. I've already done some work switching over to > RDMA CM for some research stuff I've been doing; it's not publicly > accessible w/o the 3rd party agreement. I can help answer questions > on what exactly needs to change, and do some testing. > > Andrew I'm working on the 3rd party agreement from chelsio now. Stay tuned! Steve. _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Tue May 8 13:15:53 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 May 2007 15:15:53 -0500 Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work In-Reply-To: <4640CACE.8070201@ichips.intel.com> References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM> <46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM> <464044D4.5010501@lfbs.rwth-aachen.de> <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com> <4640CACE.8070201@ichips.intel.com> Message-ID: <1178655353.11455.14.camel@stevo-desktop> > > Can the UDAPL OFED wizards shed any light on the error messages that > > are listed below? In particular, these seem to be worrysome: > > > >> setup_listener Permission denied > > > > setup_listener Address already in use > > These failures are from rdma_cm_bind indicating the port is already > bound to this IA address. How are you creating the service point? > dat_psp_create or dat_psp_create_any? If it is psp_create_any then you > will see some failures until it gets to a free port. That is normal. > Just make sure your create call returns DAT_SUCCESS. > Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down and let the rdma-cma pick an available port number? From Don.Kerr at Sun.COM Tue May 8 13:21:18 2007 From: Don.Kerr at Sun.COM (Donald Kerr) Date: Tue, 08 May 2007 16:21:18 -0400 Subject: [OMPI devel] OMPI over OFA udapl (was Re: [ofa-general] OpenMPI and RDMA-CM) In-Reply-To: <1178653979.11455.4.camel@stevo-desktop> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <1178578346.30571.183.camel@stevo-desktop> <1178632029.3056.3.camel@stevo-desktop> <73CEA9F2-CDBD-4137-91E3-5CC13FE64DF9@cisco.com> <1178641872.6064.12.camel@stevo-desktop> <1178645361.6064.35.camel@stevo-desktop> <4640BA18.7060104@open-mpi.org> <1178653979.11455.4.camel@stevo-desktop> Message-ID: <4640DBBE.5000601@Sun.COM> Steve Wise wrote: >On Tue, 2007-05-08 at 13:57 -0400, Andrew Friedley wrote: > > >>Steve Wise wrote: >> >> >>>>Well I've tried OMPI on ofed-1.2 udapl today and it doesn't work. I'm >>>>debugging now. >>>> >>>> >>>> >>>Here's part of the problem (from ompi/btl/udapl/btl_udapl.c): >>> >>> /* TODO - big bad evil hack! */ >>> /* uDAPL doesn't ever seem to keep track of ports with addresses. This >>> becomes a problem when we use dat_ep_query() to obtain a remote address >>> on an endpoint. In this case, both the DAT_PORT_QUAL and the sin_port >>> field in the DAT_SOCK_ADDR are 0, regardless of the actual port. This is >>> a problem when we have more than one uDAPL process per IA - these >>> processes will have exactly the same address, as the port is all >>> we have to differentiate who is who. Thus, our uDAPL EP -> BTL EP >>> matching algorithm will break down. >>> >>> So, we insert the port we used for our PSP into the DAT_SOCK_ADDR for >>> this IA. uDAPL then conveniently propagates this to where we need it. >>> */ >>> ((struct sockaddr_in*)attr.ia_address_ptr)->sin_port = htons(port); >>> ((struct sockaddr_in*)&btl->udapl_addr.addr)->sin_port = htons(port); >>> >>>The OMPI code stuffs the port chosen by udapl for a listening endpoint >>>into the ia address memory (which is owned by the udapl layer btw). >>>There's a slight problem with that: The OFA udapl openib_cma code binds >>>cm_id's to this ia_address regularly. When an hca is opened, a cm_id is >>>bound to this address to obtain the local hca port number and gid that >>>is being used. In addition, a cm_id is bound to this address each time >>>an endpoint is created (either at ep_create time or ep_connect time). >>>So that ia_address field is used by the dapl cm to create local >>>cm_ids... Since the port was always zero, the rmda-cma would choose a >>>unique port for each cm_id bound to that address. >>> >>>But OMPI sets a the port field to non-zero, the rdma_cma fails all the >>>subsequent rdma_bind_addr() calls since the port is already in use. >>> >>>Perhaps this hack really is a workaround for a DAPL bug where somebodies >>>dapl wasn't tracking port numbers correctly? >>> >>> >>Yep. My memory is dim, but I think that was OFED's DAPL, or it was in >>the generic part of DAPL that all implementations seem to share. >> >>As hinted by the comment (I wrote it by the way), I think the best >>solution would be if dat_ep_query() returned the port number correctly. >> Most of uDAPL seems to just pass around pointers to internal data >>structures (which I'm not sure is the best idea in the world), so it >>didn't seem like a trivial fix to me at the time. I remember >>considering reporting this as a bug, but I didn't because the uDAPL >>standard didn't seem to enforce any requirements on passing the port >>number around with the address, so it technically wasn't wrong. >> >>Was the OFED uDAPL code switched from something else to RDMA CM at some >>point? I'm almost certain I was running fine on OFED's uDAPL at one >>point (in fact, a lot of the uDAPL BTL development I did was using the >>OFED stack). >> >> > >Yes, the OFA uDAPL was changed from using the ib-cm to the rdma-cm a >while back. Perhaps you ran on the ib-cm version? And, the rdma-cma >started using port numbers and enforcing uniqueness even more recently I >think. > >Perhaps Don Kerr has some insight on how the Sun uDAPL behaves? Should >OMPI still need this hack? > > From what I recall, and Andrew can probably set me straight if I get this wrong. This hack was included because we were not able to pull the remote port from dat_ep_query. If dat_ep_query supplies that data then we could probably do away with the hack. I have not heard back from the developer at Sun who implemented uDAPL for Solaris. My thought is that it was also based on the older ib-cm but will confirm. I submitted a bug against Solaris uDAPL to provide the port via dat_ep_query awhile back and it looks like it has been fixed, I just have not tested this because we weren't using it. -DON > >Steve. > > > From arthur.jones at qlogic.com Tue May 8 13:25:58 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 08 May 2007 13:25:58 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- shadow the gpio_mask register Message-ID: <20070508202557.27647.47035.stgit@bauxite.internal.keyresearch.com> GPIO interrupts which have the gpio_mask bits set are no longer unlikely. remove the unlikely annotation in the interrupt handler and keep a shadow copy of the gpio_mask register. Signed-off-by: Arthur Jones --- drivers/infiniband/hw/ipath/ipath_iba6120.c | 7 +++---- drivers/infiniband/hw/ipath/ipath_intr.c | 7 +++---- drivers/infiniband/hw/ipath/ipath_kernel.h | 2 ++ drivers/infiniband/hw/ipath/ipath_verbs.c | 12 ++++++------ 4 files changed, 14 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index fb58154..c21d99b 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -747,7 +747,6 @@ static void ipath_pe_quiet_serdes(struct ipath_devdata *dd) static int ipath_pe_intconfig(struct ipath_devdata *dd) { - u64 val; u32 chiprev; /* @@ -760,9 +759,9 @@ static int ipath_pe_intconfig(struct ipath_devdata *dd) if ((chiprev & INFINIPATH_R_CHIPREVMINOR_MASK) > 1) { /* Rev2+ reports extra errors via internal GPIO pins */ dd->ipath_flags |= IPATH_GPIO_ERRINTRS; - val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask); - val |= IPATH_GPIO_ERRINTR_MASK; - ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val); + dd->ipath_gpio_mask |= IPATH_GPIO_ERRINTR_MASK; + ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, + dd->ipath_gpio_mask); } return 0; } diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 45d0331..a90d3b5 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -1056,7 +1056,7 @@ irqreturn_t ipath_intr(int irq, void *data) gpiostatus &= ~(1 << IPATH_GPIO_PORT0_BIT); chk0rcv = 1; } - if (unlikely(gpiostatus)) { + if (gpiostatus) { /* * Some unexpected bits remain. If they could have * caused the interrupt, complain and clear. @@ -1065,9 +1065,8 @@ irqreturn_t ipath_intr(int irq, void *data) * GPIO interrupts, possibly on a "three strikes" * basis. */ - u32 mask; - mask = ipath_read_kreg32( - dd, dd->ipath_kregs->kr_gpio_mask); + const u32 mask = (u32) dd->ipath_gpio_mask; + if (mask & gpiostatus) { ipath_dbg("Unexpected GPIO IRQ bits %x\n", gpiostatus & mask); diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index e900c25..12194f3 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -397,6 +397,8 @@ struct ipath_devdata { unsigned long ipath_pioavailshadow[8]; /* shadow of kr_gpio_out, for rmw ops */ u64 ipath_gpio_out; + /* shadow the gpio mask register */ + u64 ipath_gpio_mask; /* kr_revision shadow */ u64 ipath_revision; /* diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 12933e7..bb70845 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -1387,13 +1387,12 @@ static int enable_timer(struct ipath_devdata *dd) * processing. */ if (dd->ipath_flags & IPATH_GPIO_INTR) { - u64 val; ipath_write_kreg(dd, dd->ipath_kregs->kr_debugportselect, 0x2074076542310ULL); /* Enable GPIO bit 2 interrupt */ - val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask); - val |= (u64) (1 << IPATH_GPIO_PORT0_BIT); - ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val); + dd->ipath_gpio_mask |= (u64) (1 << IPATH_GPIO_PORT0_BIT); + ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, + dd->ipath_gpio_mask); } init_timer(&dd->verbs_timer); @@ -1412,8 +1411,9 @@ static int disable_timer(struct ipath_devdata *dd) u64 val; /* Disable GPIO bit 2 interrupt */ val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask); - val &= ~((u64) (1 << IPATH_GPIO_PORT0_BIT)); - ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val); + dd->ipath_gpio_mask &= ~((u64) (1 << IPATH_GPIO_PORT0_BIT)); + ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, + dd->ipath_gpio_mask); /* * We might want to undo changes to debugportselect, * but how? From mst at dev.mellanox.co.il Tue May 8 13:28:36 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 23:28:36 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] ipoib: handle pkey change events In-Reply-To: <4640A8BD.4000405@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> Message-ID: <20070508202836.GG10845@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCHv3 1/2] ipoib: handle pkey change events This should hav ebeen 1 of 2, is that right? -- MST From mst at dev.mellanox.co.il Tue May 8 13:33:18 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 May 2007 23:33:18 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <4640A911.8000609@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com> <20070508150907.GU21591@mellanox.co.il> <46409504.9000802@voltaire.com> <20070508152650.GA5845@mellanox.co.il> <4640A911.8000609@voltaire.com> Message-ID: <20070508203318.GH10845@mellanox.co.il> Don't put whitespace after [ and before ]. + device->pkey_tbl_len[ port_index ] = tprops->pkey_tbl_len; + device->gid_tbl_len[ port_index ] = tprops->gid_tbl_len; whitespace damage here + tbl_len = device->gid_tbl_len[ port - start_port(device) ]; and here + tbl_len = device->pkey_tbl_len[ port_num - start_port(device) ]; and here Have you read the boring list of rules? http://git.openfabrics.org/~mst/boring.txt -- MST From tziporet at mellanox.co.il Tue May 8 13:45:58 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 8 May 2007 23:45:58 +0300 Subject: [ofa-general] OFED 1.2 RC3 delayed Message-ID: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> Hi All, In the OFED meeting yesterday we decided that OFED 1.2 RC3 will be out once the bugs 420 and 465 are resolved. Tentative date is Thursday may 10. If these bugs will not be fixed this week we will have to reconsider this decision next week. Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Tue May 8 13:56:05 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 08 May 2007 15:56:05 -0500 Subject: [ofa-general] OFED 1.2 RC3 delayed In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> Message-ID: <1178657765.11455.32.camel@stevo-desktop> I would like the group to consider including changes needed to OMPI and/or ofa udapl to get OMPI working again on udapl for ofed-1.2. This will provide OMPI support over iwarp devices via udapl until we can get rdma-cm support added to OMPI. Steve. On Tue, 2007-05-08 at 23:45 +0300, Tziporet Koren wrote: > Hi All, > In the OFED meeting yesterday we decided that OFED 1.2 RC3 will be out > once the bugs 420 and 465 are resolved. > Tentative date is Thursday may 10. > > If these bugs will not be fixed this week we will have to reconsider > this decision next week. > > Tziporet Koren > Software Director > Mellanox Technologies > mailto: tziporet at mellanox.co.il > Tel +972-4-9097200, ext 380 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ardavis at ichips.intel.com Tue May 8 13:56:46 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 08 May 2007 13:56:46 -0700 Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work In-Reply-To: <1178655353.11455.14.camel@stevo-desktop> References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM> <46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM> <464044D4.5010501@lfbs.rwth-aachen.de> <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com> <4640CACE.8070201@ichips.intel.com> <1178655353.11455.14.camel@stevo-desktop> Message-ID: <4640E40E.6000803@ichips.intel.com> Steve Wise wrote: >>>Can the UDAPL OFED wizards shed any light on the error messages that >>>are listed below? In particular, these seem to be worrysome: >>> >>> >>> >>>> setup_listener Permission denied >>>> >>>> >>> setup_listener Address already in use >>> >>> >>These failures are from rdma_cm_bind indicating the port is already >>bound to this IA address. How are you creating the service point? >>dat_psp_create or dat_psp_create_any? If it is psp_create_any then you >>will see some failures until it gets to a free port. That is normal. >>Just make sure your create call returns DAT_SUCCESS. >> >> >> > >Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down >and let the rdma-cma pick an available port number? > > > That would work fine if the provider interface allowed the port to be returned. I will take a look and see if we can improve on this common code seeding method. From tom.mitchell at qlogic.com Tue May 8 14:31:03 2007 From: tom.mitchell at qlogic.com (Tom Mitchell) Date: Tue, 8 May 2007 14:31:03 -0700 Subject: [ofa-general] Re: Incorrect atomic usage in ipath driver In-Reply-To: <1178594516.14928.62.camel@localhost.localdomain> References: <1178594516.14928.62.camel@localhost.localdomain> Message-ID: <20070508213103.GC19539@pathscale.com> Thank you for the feedback. Part this code path is necessary for an early revision of the hardware. It may be important on ppc. As it is now, it is a don't care on x86_64. The responsible engineers here have seen this and will investigate further. On May 08 01:21, Benjamin Herrenschmidt wrote: > Hi ! > > So I see this construct: > > /* There is already a thread processing this queue. */ > if (test_and_set_bit(0, &dd->ipath_rcv_pending)) > goto bail; > > .../... > > done: > clear_bit(0, &dd->ipath_rcv_pending); > smp_mb__after_clear_bit(); > > So that's basically an attempt at doing a spinlock. The problem is your > barrier is wrong at the end. Better would be: > > > done: > smp_mb__before_clear_bit(); > clear_bit(0, &dd->ipath_rcv_pending); > > > Though it's still less optimal that doing: > > if (!spin_trylock(...)) > goto bail; > > .../... > > done: > spin_unlock(...) > > If you really want to stick to bitops, then you may want to look at > Nick's upcoming patches adding some bitops with appropriate lock > semantics. > > Cheers, > Ben. > > -- T o m M i t c h e l l Host Solutions Group QLogic Corp. http://www.qlogic.com From stan.smith at intel.com Tue May 8 16:28:29 2007 From: stan.smith at intel.com (Smith, Stan) Date: Tue, 8 May 2007 16:28:29 -0700 Subject: [ofa-general] WinOF 1.0 RC1 available for testing Message-ID: <55CE0347B98FCA468923E5FBC25CB4DCE8AC77@orsmsx413.amr.corp.intel.com> For those who have an interest in Windows, otherwise flames > /dev/null Please find WinOF 1.0 RC1 (Windows OpenFabrics Release Candidate #1) 'WinOF_1-0_RC1.zip' at http://www.openfabrics.org/~woody/WinOF_1.0/ Suggestions can be directed towards 'ofw at lists.openfabrics.org'. From arlin.r.davis at intel.com Tue May 8 16:44:56 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Tue, 8 May 2007 16:44:56 -0700 Subject: [ofa-general] RE: OMPI over OFA udapl [PATCH] In-Reply-To: <1178654288.11455.8.camel@stevo-desktop> Message-ID: <000001c791ca$e2853330$4297070a@amr.corp.intel.com> >-----Original Message----- >From: Steve Wise [mailto:swise at opengridcomputing.com] > >Cool! I'll test this over iWARP when you have something... Steve, Can you try this patch? I also included a change to dtest to query. Make sure you have the latest librdmacm fixes. There was a late breaking fix that just went in that overwrote the port during the rdma_bind_addr call. Signed-off by: Arlin Davis ardavis at ichips.intel.com diff --git a/dapl/openib_cma/dapl_ib_cm.c b/dapl/openib_cma/dapl_ib_cm.c index 8bdd0eb..4639f87 100755 --- a/dapl/openib_cma/dapl_ib_cm.c +++ b/dapl/openib_cma/dapl_ib_cm.c @@ -891,6 +891,9 @@ dapls_ib_accept_connection(IN DAT_CR_HANDLE cr_handle, goto bail; } + /* save remote port for ep query */ + ep_ptr->param.remote_port_qual = rdma_get_dst_port(cr_conn->cm_id); + return DAT_SUCCESS; bail: rdma_reject(cr_conn->cm_id, NULL, 0); diff --git a/dapl/openib_cma/dapl_ib_qp.c b/dapl/openib_cma/dapl_ib_qp.c old mode 100644 new mode 100755 index f1e1671..69c49a9 --- a/dapl/openib_cma/dapl_ib_qp.c +++ b/dapl/openib_cma/dapl_ib_qp.c @@ -179,14 +179,19 @@ DAT_RETURN dapls_ib_qp_alloc(IN DAPL_IA *ia_ptr, conn->route_retries = dapl_os_get_env_val("DAPL_CM_ROUTE_RETRY_COUNT", IB_ROUTE_RETRY_COUNT); + /* setup up ep->param to reference the bound local address and port */ + ep_ptr->param.local_ia_address_ptr = &cm_id->route.addr.src_addr; + ep_ptr->param.local_port_qual = rdma_get_src_port(cm_id); + ep_ptr->qp_handle = conn; ep_ptr->qp_state = IB_QP_STATE_INIT; dapl_dbg_log(DAPL_DBG_TYPE_EP, - " qp_alloc: qpn %p sq %d,%d rq %d,%d\n", + " qp_alloc: qpn %p sq %d,%d rq %d,%d port=%d\n", ep_ptr->qp_handle->cm_id->qp->qp_num, qp_create.cap.max_send_wr,qp_create.cap.max_send_sge, - qp_create.cap.max_recv_wr,qp_create.cap.max_recv_sge); + qp_create.cap.max_recv_wr,qp_create.cap.max_recv_sge, + ep_ptr->param.local_port_qual); return DAT_SUCCESS; bail: diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c old mode 100644 new mode 100755 index ec8a82e..68a3cbe --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -106,6 +106,7 @@ static DAT_RMR_CONTEXT rmr_context_send_msg; static DAT_VLEN registered_size_send_msg; static DAT_VADDR registered_addr_send_msg; static DAT_EP_ATTR ep_attr; +static DAT_EP_PARAM ep_param; char hostname[256] = {0}; char provider[256] = DAPL_PROVIDER; @@ -329,6 +330,25 @@ main(int argc, char **argv) } else LOGPRINTF("%d EP created %p \n", getpid(), h_ep); + /* query EP for local address information */ + ret = dat_ep_query( h_ep, DAT_EP_FIELD_ALL, &ep_param ); + if(ret != DAT_SUCCESS) { + fprintf(stderr, "%d Error dat_ep_query: %s\n", + getpid(),DT_RetToString(ret)); + goto cleanup; + } else + LOGPRINTF("%d EP queried %p \n", getpid(), h_ep); + + printf("%d Query EP: family %d port %d addr %d.%d.%d.%d (%d)\n", getpid(), + ((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_family, + ((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_port, + ((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_addr.s_addr >> 24 & 0xff, + ep_param.local_port_qual); + fflush(stdout); + /* * register message buffers, establish connection, and * exchange DMA RMR information info via messages From sobebike at gmail.com Tue May 8 17:37:38 2007 From: sobebike at gmail.com (SoBeBike) Date: Tue, 8 May 2007 19:37:38 -0500 Subject: [ofa-general] abi_compat Message-ID: Under what conditions is the field abi_compat of struct ibv_context set to non-zero? I'm encountering a situation where it is set when coding to verbs on a clean OFED 1.2 install. Seems odd that it would be set since I suspected that it would only occur for verbs 1.0/1.1 compatibility. thanks! From rdreier at cisco.com Tue May 8 17:51:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 08 May 2007 17:51:08 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4 mlx4_core: fix qp free sync In-Reply-To: <1178617072.17477.45.camel@mtls03> (Eli Cohen's message of "Tue, 08 May 2007 12:37:22 +0300") References: <1178617072.17477.45.camel@mtls03> Message-ID: Thanks, I rolled this up into what I'll merge upstream. From rdreier at cisco.com Tue May 8 17:57:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 08 May 2007 17:57:24 -0700 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts In-Reply-To: <1178606876.17477.15.camel@mtls03> (Eli Cohen's message of "Tue, 08 May 2007 09:47:26 +0300") References: <1178551555.17477.0.camel@mtls03> <1178606876.17477.15.camel@mtls03> Message-ID: > > @@ -249,8 +249,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq) > > } > > } > > > > - if (eqes_found) > > - eq_set_ci(eq, 1); > > + eq_set_ci(eq, 1); > > > > return eqes_found; > > } > This will not ensure arming all EQs for all interrupts and we will face > the same problem of losing interrupts. I don't understand what you mean here. How is unconditionally arming the EQ at the end of mlx4_eq_int() any different from your proposed patch? My change calls eq_set_ci() at the end of every call to mlx4_eq_int(), and your change calls eq_set_ci() after every call to mlx4_eq_int(). I'm probably missing something obvious, but I really don't see it right now. - R. From rdreier at cisco.com Tue May 8 17:58:51 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 08 May 2007 17:58:51 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- shadow the gpio_mask register In-Reply-To: <20070508202557.27647.47035.stgit@bauxite.internal.keyresearch.com> (Arthur Jones's message of "Tue, 08 May 2007 13:25:58 -0700") References: <20070508202557.27647.47035.stgit@bauxite.internal.keyresearch.com> Message-ID: > GPIO interrupts which have the gpio_mask bits set are > no longer unlikely. remove the unlikely annotation in > the interrupt handler and keep a shadow copy of the > gpio_mask register. A better changelog would be appreciated here... I can see deleting the unlikely() if it's no longer appropriate, but why keep a shadow copy of the register? Because this is now a hotter path and you want to avoid the MMIO read? - R. From rdreier at cisco.com Tue May 8 18:06:56 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 08 May 2007 18:06:56 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will merge the mlx4 drivers for new Mellanox adapters: Roland Dreier (3): IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules IB: Put rlimit accounting struct in struct ib_umem IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters drivers/infiniband/Kconfig | 7 + drivers/infiniband/Makefile | 1 + drivers/infiniband/core/Makefile | 4 +- drivers/infiniband/core/device.c | 2 + drivers/infiniband/core/{uverbs_mem.c => umem.c} | 153 ++- drivers/infiniband/core/uverbs.h | 6 +- drivers/infiniband/core/uverbs_cmd.c | 60 +- drivers/infiniband/core/uverbs_main.c | 11 +- drivers/infiniband/hw/amso1100/c2_provider.c | 42 +- drivers/infiniband/hw/amso1100/c2_provider.h | 1 + drivers/infiniband/hw/cxgb3/iwch_provider.c | 28 +- drivers/infiniband/hw/cxgb3/iwch_provider.h | 1 + drivers/infiniband/hw/ehca/ehca_classes.h | 1 + drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 69 +- drivers/infiniband/hw/ipath/ipath_mr.c | 38 +- drivers/infiniband/hw/ipath/ipath_verbs.h | 5 +- drivers/infiniband/hw/mlx4/Kconfig | 9 + drivers/infiniband/hw/mlx4/Makefile | 3 + drivers/infiniband/hw/mlx4/ah.c | 100 ++ drivers/infiniband/hw/mlx4/cq.c | 525 +++++++++ drivers/infiniband/hw/mlx4/doorbell.c | 216 ++++ drivers/infiniband/hw/mlx4/mad.c | 339 ++++++ drivers/infiniband/hw/mlx4/main.c | 651 +++++++++++ drivers/infiniband/hw/mlx4/mlx4_ib.h | 285 +++++ drivers/infiniband/hw/mlx4/mr.c | 184 +++ drivers/infiniband/hw/mlx4/qp.c | 1294 ++++++++++++++++++++++ drivers/infiniband/hw/mlx4/srq.c | 334 ++++++ drivers/infiniband/hw/mlx4/user.h | 92 ++ drivers/infiniband/hw/mthca/mthca_provider.c | 38 +- drivers/infiniband/hw/mthca/mthca_provider.h | 1 + drivers/net/Kconfig | 14 + drivers/net/Makefile | 1 + drivers/net/mlx4/Makefile | 4 + drivers/net/mlx4/alloc.c | 179 +++ drivers/net/mlx4/catas.c | 70 ++ drivers/net/mlx4/cmd.c | 429 +++++++ drivers/net/mlx4/cq.c | 254 +++++ drivers/net/mlx4/eq.c | 696 ++++++++++++ drivers/net/mlx4/fw.c | 775 +++++++++++++ drivers/net/mlx4/fw.h | 167 +++ drivers/net/mlx4/icm.c | 379 +++++++ drivers/net/mlx4/icm.h | 135 +++ drivers/net/mlx4/intf.c | 165 +++ drivers/net/mlx4/main.c | 936 ++++++++++++++++ drivers/net/mlx4/mcg.c | 380 +++++++ drivers/net/mlx4/mlx4.h | 348 ++++++ drivers/net/mlx4/mr.c | 479 ++++++++ drivers/net/mlx4/pd.c | 102 ++ drivers/net/mlx4/profile.c | 238 ++++ drivers/net/mlx4/qp.c | 280 +++++ drivers/net/mlx4/reset.c | 181 +++ drivers/net/mlx4/srq.c | 227 ++++ include/linux/mlx4/cmd.h | 178 +++ include/linux/mlx4/cq.h | 123 ++ include/linux/mlx4/device.h | 331 ++++++ include/linux/mlx4/doorbell.h | 97 ++ include/linux/mlx4/driver.h | 59 + include/linux/mlx4/qp.h | 288 +++++ include/linux/mlx4/srq.h | 42 + include/rdma/ib_umem.h | 81 ++ include/rdma/ib_verbs.h | 28 +- 62 files changed, 11951 insertions(+), 218 deletions(-) rename drivers/infiniband/core/{uverbs_mem.c => umem.c} (59%) create mode 100644 drivers/infiniband/hw/mlx4/Kconfig create mode 100644 drivers/infiniband/hw/mlx4/Makefile create mode 100644 drivers/infiniband/hw/mlx4/ah.c create mode 100644 drivers/infiniband/hw/mlx4/cq.c create mode 100644 drivers/infiniband/hw/mlx4/doorbell.c create mode 100644 drivers/infiniband/hw/mlx4/mad.c create mode 100644 drivers/infiniband/hw/mlx4/main.c create mode 100644 drivers/infiniband/hw/mlx4/mlx4_ib.h create mode 100644 drivers/infiniband/hw/mlx4/mr.c create mode 100644 drivers/infiniband/hw/mlx4/qp.c create mode 100644 drivers/infiniband/hw/mlx4/srq.c create mode 100644 drivers/infiniband/hw/mlx4/user.h create mode 100644 drivers/net/mlx4/Makefile create mode 100644 drivers/net/mlx4/alloc.c create mode 100644 drivers/net/mlx4/catas.c create mode 100644 drivers/net/mlx4/cmd.c create mode 100644 drivers/net/mlx4/cq.c create mode 100644 drivers/net/mlx4/eq.c create mode 100644 drivers/net/mlx4/fw.c create mode 100644 drivers/net/mlx4/fw.h create mode 100644 drivers/net/mlx4/icm.c create mode 100644 drivers/net/mlx4/icm.h create mode 100644 drivers/net/mlx4/intf.c create mode 100644 drivers/net/mlx4/main.c create mode 100644 drivers/net/mlx4/mcg.c create mode 100644 drivers/net/mlx4/mlx4.h create mode 100644 drivers/net/mlx4/mr.c create mode 100644 drivers/net/mlx4/pd.c create mode 100644 drivers/net/mlx4/profile.c create mode 100644 drivers/net/mlx4/qp.c create mode 100644 drivers/net/mlx4/reset.c create mode 100644 drivers/net/mlx4/srq.c create mode 100644 include/linux/mlx4/cmd.h create mode 100644 include/linux/mlx4/cq.h create mode 100644 include/linux/mlx4/device.h create mode 100644 include/linux/mlx4/doorbell.h create mode 100644 include/linux/mlx4/driver.h create mode 100644 include/linux/mlx4/qp.h create mode 100644 include/linux/mlx4/srq.h create mode 100644 include/rdma/ib_umem.h From rdreier at cisco.com Tue May 8 18:08:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 08 May 2007 18:08:54 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: <20070508141727.GR21591@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 8 May 2007 17:17:27 +0300") References: <20070508141727.GR21591@mellanox.co.il> Message-ID: > libmlx4 has this comments: > > /* FIXME flush wc buffers */ > > and since it does *not* currently actually flush the buffers, if we > enable WC for blueflame, WRs gets mixed in the WC buffer, and QP gets > corrupted/stuck. > > It seems we should we have arch.h under mthca and stick > some macro like wc_wmb() in there. > > Or, would infiniband/arch.h under libibverbs be a better place? I think we should add it to infiniband/arch.h but then also have an #ifndef wc_wmb in libmlx4 until libibverbs with the define is ubiquitous. > If WC is not enabled, userspace can avoid the flush - so, should we > return such a bit as part of kernel abi? Maybe, although I'm not sure it's worth it. - R. From rdreier at cisco.com Tue May 8 18:10:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 08 May 2007 18:10:26 -0700 Subject: [ofa-general] Re: no SRQ empty check in libmthca and in mlx2 kernel modules In-Reply-To: <200705081238.41255.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 8 May 2007 12:38:41 +0300") References: <200705081238.41255.jackm@dev.mellanox.co.il> Message-ID: > It looks to me like there is no check for "no more available WQEs" when posting > SRQ reads. See libmlx4/src/srq.c and drivers/infiniband/hw/mlx4/srq.c. > There is no check in either place if srq_head = srq_tail, or some equivalent check. Yes, you're right. From weiny2 at llnl.gov Tue May 8 18:49:38 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 8 May 2007 18:49:38 -0700 Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager Message-ID: <20070508184938.311b1c8f.weiny2@llnl.gov> I would like to submit to the list a performance manager which I have been working on for OpenSM. It is implemented as the first proposed architecture model set forth by Hal (As an integrated thread to OpenSM.) As such it works fine on our small test cluster but there is some concern about its scalability. I have extended this architecture with an idea of my own. This idea is to have a plug-able module for the "event database". With this interface one could write their own Data reduction, logging, and tracking methods. Here at LLNL I propose to use this to add counter and subnet events directly to our management database which is used to show system status to our operators. Other installations might prefer other methods of logging, SNMP for example. This patch includes a "reference" implementation of this "event database" which stores the information internally until the user requests a "dump". Let the flames begin, Ira Weiny weiny2 at llnl.gov >From 4ce288b6a5a371872cf160f6d4e29e768a065cb9 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Tue, 24 Apr 2007 23:44:15 -0700 Subject: [PATCH] OpenSM Proposed Perf Manager Features include: * Create "PerfMgr" thread and sweep all ports on the subnet every sweep_time seconds * port counter clear on overflow * plugable architecture for the "event" database * Output machine and human readable output in the default event database dump * Control using the "perfmgr" command in the console Known Issues * Not tested at scale. * Event database should record trap events and other "intresting" subnet events. * port counter log warnings should be configureable not hard coded. * partitions are not handled yet. * Code might not be as pristine as I would like Enable using --enable-perf-mgr Signed-off-by: Ira K. Weiny --- osm/Makefile.am | 3 +- osm/config/osmvsel.m4 | 26 ++ osm/configure.in | 5 +- osm/eventdb/Makefile.am | 37 ++ osm/eventdb/autogen.sh | 15 + osm/eventdb/configure.in | 70 ++++ osm/eventdb/libibeventdb.map | 5 + osm/eventdb/libibeventdb.spec.in | 38 ++ osm/eventdb/libibeventdb.ver | 9 + osm/eventdb/src/ibeventdb.c | 622 +++++++++++++++++++++++++++++++++ osm/include/Makefile.am | 2 + osm/include/iba/ib_types.h | 74 ++++ osm/include/opensm/osm_base.h | 23 ++ osm/include/opensm/osm_event_db.h | 151 ++++++++ osm/include/opensm/osm_madw.h | 40 +++ osm/include/opensm/osm_msgdef.h | 1 + osm/include/opensm/osm_opensm.h | 4 + osm/include/opensm/osm_perfmgr.h | 223 ++++++++++++ osm/include/opensm/osm_subnet.h | 18 + osm/opensm.spec.in | 11 +- osm/opensm/Makefile.am | 5 +- osm/opensm/configure.in | 3 + osm/opensm/main.c | 19 + osm/opensm/osm_console.c | 78 +++++ osm/opensm/osm_event_db.c | 172 +++++++++ osm/opensm/osm_opensm.c | 24 ++ osm/opensm/osm_perfmgr.c | 686 +++++++++++++++++++++++++++++++++++++ osm/opensm/osm_subnet.c | 51 +++ osm/opensm/osm_trap_rcv.c | 15 + 29 files changed, 2425 insertions(+), 5 deletions(-) diff --git a/osm/Makefile.am b/osm/Makefile.am index ec66883..32f5f64 100644 --- a/osm/Makefile.am +++ b/osm/Makefile.am @@ -1,6 +1,7 @@ # note that order matters: make the libs first then use them -SUBDIRS = complib libvendor opensm osmtest include +SUBDIRS = complib libvendor opensm osmtest include $(EVENTDB) +DIST_SUBDIRS = complib libvendor opensm osmtest include eventdb # this will control the update of the files in order MAINTAINERCLEANFILES = Makefile.in aclocal.m4 configure config-h.in diff --git a/osm/config/osmvsel.m4 b/osm/config/osmvsel.m4 index 9234f36..ce6039c 100644 --- a/osm/config/osmvsel.m4 +++ b/osm/config/osmvsel.m4 @@ -180,3 +180,29 @@ if test "$disable_libcheck" != "yes"; th fi # --- END OPENIB_APP_OSMV_CHECK_HEADER --- ]) dnl OPENIB_APP_OSMV_CHECK_HEADER + + +AC_DEFUN([OPENIB_OSM_PERF_MGR_SEL], [ +# --- BEGIN OPENIB_OSM_PERF_MGR_SEL --- + +dnl enable the perf-mgr +AC_ARG_ENABLE(perf-mgr, +[ --enable-perf-mgr Enable the performance manager (default no)], + [case $enableval in + yes) perf_mgr=yes ;; + no) perf_mgr=no ;; + esac], + perf_mgr=no) +if test $perf_mgr = yes; then + AC_DEFINE(ENABLE_OSM_PERF_MGR, + 1, + [Define as 1 if you want to enable the performance manager]) + EVENTDB=eventdb +else + EVENTDB= +fi +AC_SUBST([EVENTDB]) + +# --- END OPENIB_OSM_PERF_MGR_SEL --- +]) dnl OPENIB_OSM_PERF_MGR_SEL + diff --git a/osm/configure.in b/osm/configure.in index eb6552f..94d4483 100644 --- a/osm/configure.in +++ b/osm/configure.in @@ -27,11 +27,14 @@ AC_ARG_ENABLE(debug, esac],[debug=false]) AM_CONDITIONAL(DEBUG, test x$debug = xtrue) +dnl select performance manager or not +OPENIB_OSM_PERF_MGR_SEL + dnl Provide user option to select vendor OPENIB_APP_OSMV_SEL dnl Configure the following subdirs -AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include) +AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include eventdb) dnl Create the following Makefiles AC_OUTPUT(Makefile) diff --git a/osm/eventdb/Makefile.am b/osm/eventdb/Makefile.am new file mode 100644 index 0000000..18f2db9 --- /dev/null +++ b/osm/eventdb/Makefile.am @@ -0,0 +1,37 @@ + +INCLUDES = -I$(srcdir)/../include \ + -I$(includedir)/infiniband + +lib_LTLIBRARIES = libibeventdb.la + +if DEBUG +DBGFLAGS = -ggdb -D_DEBUG_ +else +DBGFLAGS = -g +endif + +libibeventdb_la_CFLAGS = -Wall $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -Wno-deprecated-declarations + +if HAVE_LD_VERSION_SCRIPT + libibeventdb_version_script = -Wl,--version-script=$(srcdir)/libibeventdb.map +else + libibeventdb_version_script = +endif + +libibeventdb_la_SOURCES = src/ibeventdb.c +libibeventdb_la_LDFLAGS = -version-info $(ibeventdb_api_version) \ + -export-dynamic $(libibeventdb_version_script) +libibeventdb_la_LIBADD = -L../complib $(OSMV_LDADD) -losmcomp +libibeventdb_la_DEPENDENCIES = $(srcdir)/libibeventdb.map + +libibeventdbincludedir = $(includedir)/infiniband/complib + +libibeventdbinclude_HEADERS = + +# headers are distributed as part of the include dir +EXTRA_DIST = $(srcdir)/libibeventdb.spec.in $(srcdir)/libibeventdb.map \ + $(srcdir)/libibeventdb.ver + +dist-hook: libibeventdb.spec + cp libibeventdb.spec $(distdir) + diff --git a/osm/eventdb/autogen.sh b/osm/eventdb/autogen.sh new file mode 100755 index 0000000..ec20fc5 --- /dev/null +++ b/osm/eventdb/autogen.sh @@ -0,0 +1,15 @@ +#! /bin/sh + +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + +# create config dir if not exist +test -d config || mkdir config + +set -x +(aclocal -I config -I ../config 2>&1 ) && \ +(libtoolize --force --copy) && \ +(autoheader) && \ +(automake --foreign --add-missing --copy) && \ +autoconf + diff --git a/osm/eventdb/configure.in b/osm/eventdb/configure.in new file mode 100644 index 0000000..f5fa345 --- /dev/null +++ b/osm/eventdb/configure.in @@ -0,0 +1,70 @@ +dnl Process this file with autoconf to produce a configure script. + +AC_PREREQ(2.57) +AC_INIT(libibeventdb, 1.0.0, openib-general at openib.org) +AC_CONFIG_AUX_DIR(config) +AM_CONFIG_HEADER(config.h) +AM_INIT_AUTOMAKE + +dnl the library version info is available in the file: libibeventdb.ver +ibeventdb_api_version=`grep LIBVERSION $srcdir/libibeventdb.ver | sed 's/LIBVERSION=//'` +if test -z $ibeventdb_api_version; then + ibeventdb_api_version=1:0:0 +fi +AC_SUBST(ibeventdb_api_version) + +dnl Checks for programs +AC_PROG_CC +AC_PROG_GCC_TRADITIONAL +AC_PROG_LIBTOOL + +dnl Checks for libraries +AC_CHECK_LIB(pthread, pthread_mutex_init, [], + AC_MSG_ERROR([pthread_mutex_init() not found. libibeventdb requires libpthread.])) + +dnl Checks for header files. +AC_HEADER_STDC +AC_CHECK_HEADERS([fcntl.h stdlib.h string.h sys/ioctl.h sys/time.h syslog.h unistd.h]) + +dnl Checks for library functions +AC_FUNC_MALLOC +AC_FUNC_MEMCMP +AC_CHECK_FUNC([time]) +dnl AC_CHECK_FUNC([cl_plock_excl_acquire], [], +dnl AC_MSG_ERROR([cl_plock_excl_acquire not found, libibeventdb requires libosmcomp])) + +dnl Checks for typedefs, structures, and compiler characteristics. +AC_C_CONST +AC_C_INLINE +AC_TYPE_SIZE_T +AC_HEADER_TIME + +dnl We use --version-script with ld if possible +AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, + if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then + ac_cv_version_script=yes + else + ac_cv_version_script=no + fi) + +AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") + +dnl Support debug mode build - if enable-debug provided the DEBUG variable is set +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + +# we have to revive the env CFLAGS as some how they are being overwritten... +# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering +# for why they should NEVER be modified by the configure to allow for user +# overrides. +CFLAGS=$ac_env_CFLAGS_value + + +AC_CONFIG_FILES([Makefile libibeventdb.spec]) +AC_OUTPUT diff --git a/osm/eventdb/libibeventdb.map b/osm/eventdb/libibeventdb.map new file mode 100644 index 0000000..ca4f78c --- /dev/null +++ b/osm/eventdb/libibeventdb.map @@ -0,0 +1,5 @@ +OSMPMDB_1.0 { + global: + __osm_event_db; + local: *; +}; diff --git a/osm/eventdb/libibeventdb.spec.in b/osm/eventdb/libibeventdb.spec.in new file mode 100644 index 0000000..ac66545 --- /dev/null +++ b/osm/eventdb/libibeventdb.spec.in @@ -0,0 +1,38 @@ + +%define ver @VERSION@ +%define RELEASE 1 +%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} + +Summary: OpenIB InfiniBand OpenSM Component Library +Name: libibeventdb +Version: %ver +Release: %rel%{?dist} +License: GPL/BSD +Group: System Environment/Libraries +BuildRoot: %{_tmppath}/%{name}-%{version}-root +Source: http://openib.org/downloads/%{name}-%{version}.tar.gz +Url: http://openib.org/ +Requires: opensm + +%description +libibeventdb provides a default plugin for the OpenSM event database + +%prep +%setup -q + +%build +%configure +make + +%install +make DESTDIR=${RPM_BUILD_ROOT} install +# remove unpackaged files from the buildroot +rm -f $RPM_BUILD_ROOT%{_libdir}/*.la + +%clean +rm -rf $RPM_BUILD_ROOT + +%files +%defattr(-,root,root) +%{_libdir}/libibeventdb*.so.* +%doc ChangeLog diff --git a/osm/eventdb/libibeventdb.ver b/osm/eventdb/libibeventdb.ver new file mode 100644 index 0000000..7a703b7 --- /dev/null +++ b/osm/eventdb/libibeventdb.ver @@ -0,0 +1,9 @@ +# In this file we track the current API version +# of the vendor interface (and libraries) +# The version is built of the following +# tree numbers: +# API_REV:RUNNING_REV:AGE +# API_REV - advance on any added API +# RUNNING_REV - advance any change to the vendor files +# AGE - number of backward versions the API still supports +LIBVERSION=1:0:0 diff --git a/osm/eventdb/src/ibeventdb.c b/osm/eventdb/src/ibeventdb.c new file mode 100644 index 0000000..e98f85c --- /dev/null +++ b/osm/eventdb/src/ibeventdb.c @@ -0,0 +1,622 @@ +/* + * Copyright (c) 2007 The Regents of the University of California. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/** + * Port counter object. + * Store all the port counters for a single port. + */ +typedef struct _osm_event_pc { + struct { + uint64_t symbol_err_cnt; + uint64_t link_err_recover; + uint64_t link_downed; + uint64_t rcv_err; + uint64_t rcv_rem_phys_err; + uint64_t rcv_switch_relay_err; + uint64_t xmit_discards; + uint64_t xmit_constraint_err; + uint64_t rcv_constraint_err; + uint64_t link_int_err; + uint64_t buffer_overrun_err; + uint64_t vl15_dropped; + uint64_t xmit_data; + uint64_t rcv_data; + uint64_t xmit_pkts; + uint64_t rcv_pkts; + time_t last_reset; + } totals; + osm_pc_reading_t previous; +} osm_event_pc_t; + +/** + * group port counters for ports into the nodes + */ +typedef struct _osm_pc_node { + cl_map_item_t map_item; /* must be first */ + uint64_t node_guid; + osm_event_pc_t *ports; + uint8_t num_ports; +} osm_pc_node_t; + +/** + * all nodes in the system. + */ +typedef struct _osm_pc_db { + cl_qmap_t pc_data; /* stores type (osm_pc_node_t *) */ + cl_plock_t lock; + osm_log_t *osm_log; +} osm_pc_db_t; + + +/** ========================================================================= + */ +static void * +db_construct(osm_log_t *osm_log) +{ + /* use the default */ + osm_pc_db_t *db = malloc(sizeof(*db)); + if (!db) { + return (NULL); + } + cl_plock_construct(&(db->lock)); + cl_plock_init(&(db->lock)); + cl_qmap_init(&(db->pc_data)); + db->osm_log = osm_log; + return ((void *)db); +} + +/** ========================================================================= + */ +static void +db_destroy(void *_db) +{ + osm_pc_db_t *db = (osm_pc_db_t *)_db; + cl_plock_excl_acquire(&(db->lock)); + /* remove all the items in the qmap */ + while (!cl_is_qmap_empty(&(db->pc_data))) { + cl_map_item_t *rc = cl_qmap_head(&(db->pc_data)); + cl_qmap_remove_item(&(db->pc_data), rc); + } + cl_plock_release(&(db->lock)); + cl_plock_destroy(&(db->lock)); + free(db); +} + +/** ========================================================================= + */ +static osm_pc_node_t * +malloc_node(void *_db, uint64_t guid, uint8_t num_ports) +{ + int i = 0; + time_t cur_time = 0; + osm_pc_node_t *rc = malloc(sizeof(*rc)); + if (!rc) + return (NULL); + + rc->ports = calloc(num_ports, sizeof(osm_event_pc_t)); + if (!rc->ports) { + goto free_rc; + } + rc->num_ports = num_ports; + rc->node_guid = guid; + + cur_time = time(NULL); + for (i = 0; i < num_ports; i++) { + rc->ports[i].totals.last_reset = cur_time; + rc->ports[i].previous.time = cur_time; + } + + return (rc); +free_rc: + free(rc); + return (NULL); +} + +/** ========================================================================= + */ +static void +free_node(osm_pc_node_t *node) +{ + if (!node) + return; + if (node->ports) + free(node->ports); + free(node); +} + +/* insert nodes to the database */ +static osm_event_db_err_t +insert(void *_db, osm_pc_node_t *node) +{ + osm_pc_db_t *db = (osm_pc_db_t *)_db; + cl_map_item_t *rc = cl_qmap_insert(&(db->pc_data), node->node_guid, (cl_map_item_t *)node); + if ((void *)rc != (void *)node) + return (OSM_EVENT_DB_FAIL); + return (OSM_EVENT_DB_SUCCESS); +} + +/********************************************************************** + * Internal call db->lock should be held when calling + **********************************************************************/ +static inline osm_pc_node_t * +get(void *_db, uint64_t guid) +{ + osm_pc_db_t *db = (osm_pc_db_t *)_db; + cl_map_item_t *rc = cl_qmap_get(&(db->pc_data), guid); + const cl_map_item_t *end = cl_qmap_end(&(db->pc_data)); + if (rc == end) + return (NULL); + return ((osm_pc_node_t *)rc); +} + +/** ========================================================================= + */ +static osm_event_db_err_t +db_create_entry(void *_db, uint64_t guid, uint8_t num_ports) +{ + osm_pc_db_t *db = (osm_pc_db_t *)_db; + osm_event_db_err_t rc = OSM_EVENT_DB_SUCCESS; + cl_plock_excl_acquire(&(db->lock)); + if (!get(db, guid)) { + osm_pc_node_t *pc_node = malloc_node(db, guid, num_ports); + if (!pc_node) { + rc = OSM_EVENT_DB_NOMEM; + goto Exit; + } + if (insert(db, pc_node)) { + free_node(pc_node); + rc = OSM_EVENT_DB_FAIL; + goto Exit; + } + } +Exit: + cl_plock_release(&(db->lock)); + return (rc); +} + +/********************************************************************** + **********************************************************************/ +static osm_event_db_err_t +db_get_prev(void *_db, uint64_t guid, + uint8_t port, osm_pc_reading_t *reading) +{ + osm_pc_db_t *db = (osm_pc_db_t *)_db; + osm_pc_node_t *node = NULL; + cl_map_item_t *rc = NULL; + const cl_map_item_t *end = NULL; + + cl_plock_acquire(&(db->lock)); + + rc = cl_qmap_get(&(db->pc_data), guid); + end = cl_qmap_end(&(db->pc_data)); + if (rc == end) + return (OSM_EVENT_DB_GUIDNOTFOUND); + + node = (osm_pc_node_t *)rc; + if (port >= node->num_ports) + return (OSM_EVENT_DB_PORTNOTFOUND); + + *reading = node->ports[port].previous; + + cl_plock_release(&(db->lock)); + return (OSM_EVENT_DB_SUCCESS); +} + +/********************************************************************** + * Output a tab deliminated output of the port counters + **********************************************************************/ +static void +__dump_node_mr(osm_pc_node_t *node, FILE *fp) +{ + int i = 0; + + fprintf(fp, "\nGUID Port\t%s\t%s\t" + "%s\t%s\t%s\t%s\t%s\t%s\t%s\t" + "%s\t%s\t%s\t%s\t%s\t%s\t%s\n", + "symbol_err_cnt", + "link_err_recover", + "link_downed", + "rcv_err", + "rcv_rem_phys_err", + "rcv_switch_relay_err", + "xmit_discards", + "xmit_constraint_err", + "rcv_constraint_err", + "link_int_err", + "buf_overrun_err", + "vl15_dropped", + "xmit_data", + "rcv_data", + "xmit_pkts", + "rcv_pkts"); + for (i = 1; i < node->num_ports; i++) + { + fprintf(fp, "0x%" PRIx64 "\t%d\t%"PRIu64"\t%"PRIu64"\t" + "%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t" + "%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t" + "%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t" + "%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t%"PRIu64"\n", + node->node_guid, + i, + node->ports[i].totals.symbol_err_cnt, + node->ports[i].totals.link_err_recover, + node->ports[i].totals.link_downed, + node->ports[i].totals.rcv_err, + node->ports[i].totals.rcv_rem_phys_err, + node->ports[i].totals.rcv_switch_relay_err, + node->ports[i].totals.xmit_discards, + node->ports[i].totals.xmit_constraint_err, + node->ports[i].totals.rcv_constraint_err, + node->ports[i].totals.link_int_err, + node->ports[i].totals.buffer_overrun_err, + node->ports[i].totals.vl15_dropped, + node->ports[i].totals.xmit_data, + node->ports[i].totals.rcv_data, + node->ports[i].totals.xmit_pkts, + node->ports[i].totals.rcv_pkts + ); + } +} + +/********************************************************************** + * Output a human readable output of the port counters + **********************************************************************/ +static void +__dump_node_hr(osm_pc_node_t *node, FILE *fp) +{ + int i = 0; + + fprintf(fp, "\n"); + for (i = 1; i < node->num_ports; i++) + { + fprintf(fp, "GUID 0x%"PRIx64": Port %d:\n" + " symbol_err_cnt: %"PRIu64"\n" + " link_err_recover: %"PRIu64"\n" + " link_downed: %"PRIu64"\n" + " rcv_err: %"PRIu64"\n" + " rcv_rem_phys_err: %"PRIu64"\n" + " rcv_switch_relay_err: %"PRIu64"\n" + " xmit_discards: %"PRIu64"\n" + " xmit_constraint_err: %"PRIu64"\n" + " rcv_constraint_err: %"PRIu64"\n" + " link_int_err: %"PRIu64"\n" + " buf_overrun_err: %"PRIu64"\n" + " vl15_dropped: %"PRIu64"\n" + " xmit_data: %"PRIu64"\n" + " rcv_data: %"PRIu64"\n" + " xmit_pkts: %"PRIu64"\n" + " rcv_pkts: %"PRIu64"\n" + , + node->node_guid, + i, + node->ports[i].totals.symbol_err_cnt, + node->ports[i].totals.link_err_recover, + node->ports[i].totals.link_downed, + node->ports[i].totals.rcv_err, + node->ports[i].totals.rcv_rem_phys_err, + node->ports[i].totals.rcv_switch_relay_err, + node->ports[i].totals.xmit_discards, + node->ports[i].totals.xmit_constraint_err, + node->ports[i].totals.rcv_constraint_err, + node->ports[i].totals.link_int_err, + node->ports[i].totals.buffer_overrun_err, + node->ports[i].totals.vl15_dropped, + node->ports[i].totals.xmit_data, + node->ports[i].totals.rcv_data, + node->ports[i].totals.xmit_pkts, + node->ports[i].totals.rcv_pkts + ); + } +} + +/* Define a context for the __db_dump callback */ +typedef struct { + FILE *fp; + osm_event_db_dump_t dump_type; +} dump_context_t; + +/********************************************************************** + **********************************************************************/ +static void +__db_dump(cl_map_item_t * const p_map_item, void *context ) +{ + osm_pc_node_t *node = (osm_pc_node_t *)p_map_item; + dump_context_t *c = (dump_context_t *)context; + FILE *fp = c->fp; + + switch (c->dump_type) + { + case OSM_EVENT_DB_DUMP_MR: + __dump_node_mr(node, fp); + break; + case OSM_EVENT_DB_DUMP_HR: + default: + __dump_node_hr(node, fp); + break; + } +} + +/********************************************************************** + * dump the data to the file "file" + **********************************************************************/ +static osm_event_db_err_t +db_dump(void *_db, char *file, osm_event_db_dump_t dump_type) +{ + osm_pc_db_t *db = (osm_pc_db_t *)_db; + dump_context_t context; + + context.fp = fopen(file, "w+"); + if (!context.fp) + return (OSM_EVENT_DB_FAIL); + context.dump_type = dump_type; + + cl_plock_acquire(&(db->lock)); + cl_qmap_apply_func(&(db->pc_data), __db_dump, (void *)&context); + cl_plock_release(&(db->lock)); + fclose(context.fp); + return (OSM_EVENT_DB_SUCCESS); +} + +/********************************************************************** + * call back to support the below + **********************************************************************/ +static void +__clear_counters(cl_map_item_t * const p_map_item, void *context ) +{ + osm_pc_node_t *node = (osm_pc_node_t *)p_map_item; + int i = 0; + for (i = 0; i < node->num_ports; i++) { + node->ports[i].totals.symbol_err_cnt = 0; + node->ports[i].totals.link_err_recover = 0; + node->ports[i].totals.link_downed = 0; + node->ports[i].totals.rcv_err = 0; + node->ports[i].totals.rcv_rem_phys_err = 0; + node->ports[i].totals.rcv_switch_relay_err = 0; + node->ports[i].totals.xmit_discards = 0; + node->ports[i].totals.xmit_constraint_err = 0; + node->ports[i].totals.rcv_constraint_err = 0; + node->ports[i].totals.link_int_err = 0; + node->ports[i].totals.buffer_overrun_err = 0; + node->ports[i].totals.vl15_dropped = 0; + node->ports[i].totals.xmit_data = 0; + node->ports[i].totals.rcv_data = 0; + node->ports[i].totals.xmit_pkts = 0; + node->ports[i].totals.rcv_pkts = 0; + node->ports[i].totals.last_reset = time(NULL); + } +} + +/********************************************************************** + * Clear the counters from the db + **********************************************************************/ +static void +db_clear_port_counters(void *_db) +{ + osm_pc_db_t *db = (osm_pc_db_t *)_db; + cl_plock_excl_acquire(&(db->lock)); + cl_qmap_apply_func(&(db->pc_data), __clear_counters, (void *)db); + cl_plock_release(&(db->lock)); +} + +#if 0 +/********************************************************************** + * Dump a reading vs the previous reading to stdout + **********************************************************************/ +static void +dump_reading(osm_event_pc_t *port, ib_port_counters_t *cur) +{ + printf("sym %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->symbol_err_cnt), + cl_ntoh16(port->previous.reading.symbol_err_cnt), port->totals.symbol_err_cnt); + printf("ler %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->link_err_recover), + cl_ntoh16(port->previous.reading.link_err_recover), port->totals.link_err_recover); + printf("ld %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->link_downed), + cl_ntoh16(port->previous.reading.link_downed), port->totals.link_downed); + printf("re %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->rcv_err), + cl_ntoh16(port->previous.reading.rcv_err), port->totals.rcv_err); + printf("rrp %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->rcv_rem_phys_err), + cl_ntoh16(port->previous.reading.rcv_rem_phys_err), port->totals.rcv_rem_phys_err); + printf("rsr %u - %u (%" PRIx64 ")\n", + cl_ntoh16(cur->rcv_switch_relay_err), + cl_ntoh16(port->previous.reading.rcv_switch_relay_err), port->totals.rcv_switch_relay_err); + printf("xd %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->xmit_discards), + cl_ntoh16(port->previous.reading.xmit_discards), port->totals.xmit_discards); + printf("xce %u - %u (%" PRIx64 ")\n", + cl_ntoh16(cur->xmit_constraint_err), + cl_ntoh16(port->previous.reading.xmit_constraint_err), port->totals.xmit_constraint_err); + printf("rce %u - %u (%" PRIx64 ")\n", + cl_ntoh16(cur->rcv_constraint_err), + cl_ntoh16(port->previous.reading.rcv_constraint_err), port->totals.rcv_constraint_err); + printf("li %x - %x (%" PRIx64 ")\n", + cl_ntoh16(cur->link_int_buffer_overrun), + cl_ntoh16(port->previous.reading.link_int_buffer_overrun), port->totals.link_int_err); + printf("bo %x - %x (%" PRIx64 ")\n", + cl_ntoh16(cur->link_int_buffer_overrun), + cl_ntoh16(port->previous.reading.link_int_buffer_overrun), port->totals.buffer_overrun_err); + printf("vld %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->vl15_dropped), + cl_ntoh16(port->previous.reading.vl15_dropped), port->totals.vl15_dropped); + + printf("xd %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->xmit_data), + cl_ntoh32(port->previous.reading.xmit_data), port->totals.xmit_data); + printf("rd %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->rcv_data), + cl_ntoh32(port->previous.reading.rcv_data), port->totals.rcv_data); + printf("xp %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->xmit_pkts), + cl_ntoh32(port->previous.reading.xmit_pkts), port->totals.xmit_pkts); + printf("rp %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->rcv_pkts), + cl_ntoh32(port->previous.reading.rcv_pkts), port->totals.rcv_pkts); +} +#endif + +/********************************************************************** + * Add the reading to the osm_pc_node_t + **********************************************************************/ +static osm_event_db_err_t +db_clear_prev_pc(void *_db, uint64_t guid, uint8_t port) +{ + osm_pc_db_t *db = (osm_pc_db_t *)_db; + osm_event_pc_t *p_port = NULL; + osm_pc_node_t *p_node = NULL; + ib_port_counters_t *previous = NULL; + osm_event_db_err_t rc = OSM_EVENT_DB_SUCCESS; + + cl_plock_excl_acquire(&(db->lock)); + p_node = get(db, guid); + + if (!p_node) + return (OSM_EVENT_DB_GUIDNOTFOUND); + + if (port >= p_node->num_ports) + return (OSM_EVENT_DB_PORTNOTFOUND); + + p_port = &(p_node->ports[port]); + previous = &(p_node->ports[port].previous.reading); + + memset(previous, 0, sizeof(*previous)); + p_port->previous.time = time(NULL); + + cl_plock_release(&(db->lock)); + return (rc); +} + +/********************************************************************** + * Add the reading to the osm_pc_node_t + **********************************************************************/ +static osm_event_db_err_t +db_add_reading(void *_db, uint64_t guid, + uint8_t port, ib_port_counters_t *reading) +{ + osm_pc_db_t *db = (osm_pc_db_t *)_db; + osm_event_pc_t *p_port = NULL; + osm_pc_node_t *p_node = NULL; + ib_port_counters_t *previous = NULL; + osm_event_db_err_t rc = OSM_EVENT_DB_SUCCESS; + + cl_plock_excl_acquire(&(db->lock)); + p_node = get(db, guid); + + if (!p_node) + return (OSM_EVENT_DB_GUIDNOTFOUND); + + if (port >= p_node->num_ports) + return (OSM_EVENT_DB_PORTNOTFOUND); + + p_port = &(p_node->ports[port]); + previous = &(p_node->ports[port].previous.reading); + +#if 0 + dump_reading(p_port, reading); +#endif + + /* calculate changes from previous reading */ + p_port->totals.symbol_err_cnt + += (cl_ntoh16(reading->symbol_err_cnt) + - cl_ntoh16(previous->symbol_err_cnt)); + p_port->totals.link_err_recover + += (reading->link_err_recover - previous->link_err_recover); + p_port->totals.link_downed + += (reading->link_downed - previous->link_downed); + p_port->totals.rcv_err + += (cl_ntoh16(reading->rcv_err) + - cl_ntoh16(previous->rcv_err)); + p_port->totals.rcv_rem_phys_err + += (cl_ntoh16(reading->rcv_rem_phys_err) + - cl_ntoh16(previous->rcv_rem_phys_err)); + p_port->totals.rcv_switch_relay_err + += (cl_ntoh16(reading->rcv_switch_relay_err) + - cl_ntoh16(previous->rcv_switch_relay_err)); + p_port->totals.xmit_discards + += (cl_ntoh16(reading->xmit_discards) + - cl_ntoh16(previous->xmit_discards)); + p_port->totals.xmit_constraint_err + += (reading->xmit_constraint_err - previous->xmit_constraint_err); + p_port->totals.rcv_constraint_err + += (reading->rcv_constraint_err - previous->rcv_constraint_err); + p_port->totals.link_int_err + += PC_LINK_INT(reading->link_int_buffer_overrun) + - PC_LINK_INT(previous->link_int_buffer_overrun); + p_port->totals.buffer_overrun_err + += PC_BUF_OVERRUN(reading->link_int_buffer_overrun) + - PC_BUF_OVERRUN(previous->link_int_buffer_overrun); + p_port->totals.vl15_dropped + += (cl_ntoh16(reading->vl15_dropped) + - cl_ntoh16(previous->vl15_dropped)); + + p_port->totals.xmit_data + += (cl_ntoh32(reading->xmit_data) + - cl_ntoh32(previous->xmit_data)); + p_port->totals.rcv_data + += (cl_ntoh32(reading->rcv_data) + - cl_ntoh32(previous->rcv_data)); + p_port->totals.xmit_pkts + += (cl_ntoh32(reading->xmit_pkts) + - cl_ntoh32(previous->xmit_pkts)); + p_port->totals.rcv_pkts + += (cl_ntoh32(reading->rcv_pkts) + - cl_ntoh32(previous->rcv_pkts)); + + p_port->previous.reading = *reading; + p_port->previous.time = time(NULL); + + cl_plock_release(&(db->lock)); + return (rc); +} + +/** ========================================================================= + * Define the object symbol for loading + */ +__osm_event_db_t __osm_event_db = +{ +interface_version: OSM_EVENT_DB_INTERFACE_VER, +construct : db_construct, +destroy : db_destroy, +create_entry : db_create_entry, +get_prev_pc : db_get_prev, +dump : db_dump, +clear_port_counters : db_clear_port_counters, +add_pc_reading : db_add_reading, +clear_prev_pc : db_clear_prev_pc +}; + diff --git a/osm/include/Makefile.am b/osm/include/Makefile.am index 8499d3b..fd874c8 100644 --- a/osm/include/Makefile.am +++ b/osm/include/Makefile.am @@ -87,6 +87,8 @@ EXTRA_DIST = \ $(srcdir)/opensm/osm_drop_mgr.h \ $(srcdir)/opensm/osm_port_info_rcv.h \ $(srcdir)/opensm/osm_state_mgr_ctrl.h \ + $(srcdir)/opensm/osm_perfmgr.h \ + $(srcdir)/opensm/osm_event_db.h \ $(srcdir)/complib/cl_thread_osd.h \ $(srcdir)/complib/cl_packon.h \ $(srcdir)/complib/cl_atomic_osd.h \ diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h index b3937cb..2a4057b 100644 --- a/osm/include/iba/ib_types.h +++ b/osm/include/iba/ib_types.h @@ -7353,6 +7353,80 @@ typedef struct _ib_inform_info_record } PACK_SUFFIX ib_inform_info_record_t; #include +/****s* IBA Base: Types/ib_perfmgr_mad_t +* NAME +* ib_perfmgr_mad_t +* +* DESCRIPTION +* IBA defined Perf Management MAD (16.3.1) +* +* SYNOPSIS +*/ +#include +typedef struct _ib_perfmgr_mad +{ + ib_mad_t header; + uint8_t resv[40]; + +#define IB_PM_DATA_SIZE 192 + uint8_t data[IB_PM_DATA_SIZE]; + +} PACK_SUFFIX ib_perfmgr_mad_t; +#include +/* +* FIELDS +* header +* Common MAD header. +* +* resv +* Reserved. +* +* data +* Performance Management payload. The structure and content of this field +* depend upon the method, attr_id, and attr_mod fields in the header. +* +* SEE ALSO +* ib_mad_t +*********/ + +/****s* IBA Base: Types/ib_port_counters +* NAME +* ib_port_counters_t +* +* DESCRIPTION +* IBA defined PortCounters Attribute. (16.1.3.5) +* +* SYNOPSIS +*/ +#include +typedef struct _ib_port_counters +{ + uint8_t reserved; + uint8_t port_select; + ib_net16_t counter_select; + ib_net16_t symbol_err_cnt; + uint8_t link_err_recover; + uint8_t link_downed; + ib_net16_t rcv_err; + ib_net16_t rcv_rem_phys_err; + ib_net16_t rcv_switch_relay_err; + ib_net16_t xmit_discards; + uint8_t xmit_constraint_err; + uint8_t rcv_constraint_err; + uint8_t res1; + uint8_t link_int_buffer_overrun; + ib_net16_t res2; + ib_net16_t vl15_dropped; + ib_net32_t xmit_data; + ib_net32_t rcv_data; + ib_net32_t xmit_pkts; + ib_net32_t rcv_pkts; +} PACK_SUFFIX ib_port_counters_t; +#include + +#define PC_LINK_INT(integ_buf_over) ((integ_buf_over & 0xF0) >> 4) +#define PC_BUF_OVERRUN(integ_buf_over) (integ_buf_over & 0x0F) + /****d* IBA Base: Types/DM_SVC_NAME * NAME * DM_SVC_NAME diff --git a/osm/include/opensm/osm_base.h b/osm/include/opensm/osm_base.h index b38b511..51cef49 100644 --- a/osm/include/opensm/osm_base.h +++ b/osm/include/opensm/osm_base.h @@ -448,6 +448,29 @@ BEGIN_C_DECLS */ #define OSM_SM_DEFAULT_QP1_SEND_SIZE 256 +/****d* OpenSM: Base/OSM_PM_DEFAULT_QP1_RCV_SIZE +* NAME +* OSM_PM_DEFAULT_QP1_RCV_SIZE +* +* DESCRIPTION +* Specifies the default size (in MADs) of the QP1 receive queue +* +* SYNOPSIS +*/ +#define OSM_PM_DEFAULT_QP1_RCV_SIZE 256 +/***********/ + +/****d* OpenSM: Base/OSM_PM_DEFAULT_QP1_SEND_SIZE +* NAME +* OSM_PM_DEFAULT_QP1_SEND_SIZE +* +* DESCRIPTION +* Specifies the default size (in MADs) of the QP1 send queue +* +* SYNOPSIS +*/ +#define OSM_PM_DEFAULT_QP1_SEND_SIZE 256 + /****d* OpenSM: Base/OSM_SM_DEFAULT_POLLING_TIMEOUT_MILLISECS * NAME diff --git a/osm/include/opensm/osm_event_db.h b/osm/include/opensm/osm_event_db.h new file mode 100644 index 0000000..17effaf --- /dev/null +++ b/osm/include/opensm/osm_event_db.h @@ -0,0 +1,151 @@ +/* + * Copyright (c) 2007 The Regents of the University of California. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _OSM_EVENT_DB_H_ +#define _OSM_EVENT_DB_H_ + +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****h* OpenSM/Event Database +* DESCRIPTION +* Database interface to record subnet events +* +* Implementations of this object _MUST_ be thread safe. +* +* AUTHOR +* Ira Weiny, LLNL +* +*********/ + +typedef enum { + OSM_EVENT_DB_SUCCESS = 0, + OSM_EVENT_DB_FAIL, + OSM_EVENT_DB_NOMEM, + OSM_EVENT_DB_GUIDNOTFOUND, + OSM_EVENT_DB_PORTNOTFOUND +} osm_event_db_err_t; + +/** ========================================================================= + * Port counter reading + */ +typedef struct { + ib_port_counters_t reading; + time_t time; +} osm_pc_reading_t; + +/** ========================================================================= + * Dump output options + */ +typedef enum { + OSM_EVENT_DB_DUMP_HR = 0, /* Human readable */ + OSM_EVENT_DB_DUMP_MR /* Machine readable */ +} osm_event_db_dump_t; + +/** ========================================================================= + * Plugin creators should allocate an object of this type + * (name __osm_event_db_t) + * The version should be set to OSM_EVENT_DB_INTERFACE_VER + */ +#define OSM_EVENT_DB_INTERFACE_VER (1) +typedef struct +{ + int interface_version; + void *(*construct)(osm_log_t *osm_log); + void (*destroy)(void *db); + osm_event_db_err_t (*create_entry)(void *db, uint64_t guid, uint8_t num_ports); + osm_event_db_err_t (*get_prev_pc)(void *db, uint64_t guid, + uint8_t port, osm_pc_reading_t *reading); + osm_event_db_err_t (*dump)(void *db, char *file, osm_event_db_dump_t dump_type); + void (*clear_port_counters)(void *db); + osm_event_db_err_t (*add_pc_reading)(void *db, uint64_t guid, + uint8_t port, ib_port_counters_t *reading); + osm_event_db_err_t (*clear_prev_pc)(void *db, uint64_t guid, uint8_t port); +} __osm_event_db_t; + +/** ========================================================================= + * The database structure which should be considered opaque + */ +typedef struct { + void *handle; + __osm_event_db_t *db_impl; + void *db_data; + osm_log_t *p_log; +} osm_event_db_t; + + +/** + * functions + */ +osm_event_db_t *osm_event_db_construct(osm_log_t *p_log, char *type); +void osm_event_db_destroy(osm_event_db_t *db); + +osm_event_db_err_t osm_event_db_create_entry(osm_event_db_t *db, uint64_t guid, + uint8_t num_ports); +osm_event_db_err_t osm_event_db_get_prev_pc(osm_event_db_t *db, + uint64_t guid, uint8_t port, + osm_pc_reading_t *reading); +osm_event_db_err_t osm_event_db_dump(osm_event_db_t *db, char *file, + osm_event_db_dump_t dump_type); +osm_event_db_err_t osm_event_db_add_pc_reading(osm_event_db_t *db, uint64_t guid, + uint8_t port, ib_port_counters_t *reading); +void osm_event_db_clear_port_counters(osm_event_db_t *db); +osm_event_db_err_t osm_event_db_clear_prev_pc(osm_event_db_t *db, uint64_t guid, + uint8_t port); + +#if 0 +/* work out the tracking of notice (trap) events. */ + +typedef struct { + ib_mad_notice_attr_t reading; + time_t time; +} osm_notice_reading_t; +osm_event_db_err_t osm_event_db_add_notice_reading(osm_event_db_t *db, uint64_t guid, + uint8_t port, ib_mad_notice_attr_t *reading); +#endif + +END_C_DECLS + +#endif /* _OSM_PM_DB_H_ */ + diff --git a/osm/include/opensm/osm_madw.h b/osm/include/opensm/osm_madw.h index 95be0f4..80258f4 100644 --- a/osm/include/opensm/osm_madw.h +++ b/osm/include/opensm/osm_madw.h @@ -315,6 +315,19 @@ typedef struct _osm_vla_context } osm_vla_context_t; /*********/ +/****s* OpenSM: MAD Wrapper/osm_perfmgr_context_t +* DESCRIPTION +* Context for Performance manager queries +*/ +typedef struct _osm_perfmgr_context { + uint64_t node_guid; + uint16_t port; + uint8_t num_ports; + uint8_t mad_method; /* was this a get or a set */ + struct timeval query_start; +} osm_perfmgr_context_t; +/*********/ + #ifndef OSM_VENDOR_INTF_OPENIB /****s* OpenSM: MAD Wrapper/osm_arbitrary_context_t * NAME @@ -354,6 +367,7 @@ typedef union _osm_madw_context osm_slvl_context_t slvl_context; osm_pkey_context_t pkey_context; osm_vla_context_t vla_context; + osm_perfmgr_context_t perfmgr_context; #ifndef OSM_VENDOR_INTF_OPENIB osm_arbitrary_context_t arb_context; #endif @@ -639,6 +653,32 @@ osm_madw_get_sa_mad_ptr( * MAD Wrapper object, osm_madw_construct, osm_madw_destroy *********/ +/****f* OpenSM: MAD Wrapper/osm_madw_get_perfmgr_mad_ptr +* DESCRIPTION +* Gets a pointer to the PerfMgr MAD in this MAD wrapper. +* +* SYNOPSIS +*/ +static inline ib_perfmgr_mad_t* +osm_madw_get_perfmgr_mad_ptr( + IN const osm_madw_t* const p_madw ) +{ + return((ib_perfmgr_mad_t*)p_madw->p_mad); +} +/* +* PARAMETERS +* p_madw +* [in] Pointer to an osm_madw_t object. +* +* RETURN VALUES +* Pointer to the start of the PM MAD. +* +* NOTES +* +* SEE ALSO +* MAD Wrapper object, osm_madw_construct, osm_madw_destroy +*********/ + /****f* OpenSM: MAD Wrapper/osm_madw_get_ni_context_ptr * NAME * osm_madw_get_ni_context_ptr diff --git a/osm/include/opensm/osm_msgdef.h b/osm/include/opensm/osm_msgdef.h index a90e3b9..6732992 100644 --- a/osm/include/opensm/osm_msgdef.h +++ b/osm/include/opensm/osm_msgdef.h @@ -186,6 +186,7 @@ enum #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) OSM_MSG_MAD_MULTIPATH_RECORD, #endif + OSM_MSG_MAD_PORT_COUNTERS, OSM_MSG_MAX }; diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h index 482de28..bdaa8f3 100644 --- a/osm/include/opensm/osm_opensm.h +++ b/osm/include/opensm/osm_opensm.h @@ -57,6 +57,7 @@ #include #include #include +#include #include #include #include @@ -157,6 +158,9 @@ typedef struct _osm_opensm_t osm_subn_t subn; osm_sm_t sm; osm_sa_t sa; +#ifdef ENABLE_OSM_PERF_MGR + osm_perfmgr_t perfmgr; +#endif /* ENABLE_OSM_PERF_MGR */ osm_db_t db; osm_mad_pool_t mad_pool; osm_vendor_t *p_vendor; diff --git a/osm/include/opensm/osm_perfmgr.h b/osm/include/opensm/osm_perfmgr.h new file mode 100644 index 0000000..6138ec3 --- /dev/null +++ b/osm/include/opensm/osm_perfmgr.h @@ -0,0 +1,223 @@ +/* + * Copyright (c) 2007 The Regents of the University of California. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _OSM_PERFMGR_H_ +#define _OSM_PERFMGR_H_ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#ifdef ENABLE_OSM_PERF_MGR + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef __cplusplus +extern "C" { +#endif /* __cplusplus */ + +/****h* OpenSM/PERFMGR +* NAME +* PERFMGR +* +* DESCRIPTION +* Performance manager thread which takes care of polling the fabric for +* Port counters values. +* +* The PERFMGR object is thread safe. +* +* AUTHOR +* Ira Weiny, LLNL +* +*********/ + +#define OSM_PERFMGR_DEFAULT_SWEEP_TIME_S 180 +#define OSM_PERFMGR_DEFAULT_DUMP_FILE OSM_DEFAULT_TMP_DIR "/osm_port_counters.log" +#define OSM_DEFAULT_EVENT_PLUGIN "ibeventdb" + +/****s* OpenSM: PERFMGR/osm_perfmgr_state_t */ +typedef enum +{ + PERFMGR_STATE_DISABLE, + PERFMGR_STATE_ENABLED, + PERFMGR_STATE_NO_DB +} osm_perfmgr_state_t; + +/****s* OpenSM: PERFMGR/osm_perfmgr_t +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +*/ +typedef struct _osm_perfmgr +{ + osm_thread_state_t thread_state; + cl_event_t sig_sweep; + cl_thread_t sweeper; + osm_subn_t *subn; + osm_sm_t *sm; + cl_plock_t *lock; + osm_log_t *log; + osm_mad_pool_t *mad_pool; + atomic32_t trans_id; + osm_vendor_t *vendor; + osm_bind_handle_t bind_handle; + cl_disp_reg_handle_t pc_disp_h; + osm_perfmgr_state_t state; + uint16_t sweep_time_s; + char *db_file; + char *event_db_dump_file; + char *event_db_plugin; + osm_event_db_t *db; +} osm_perfmgr_t; +/* +* FIELDS +* subn +* Subnet object for this subnet. +* +* log +* Pointer to the log object. +* +* mad_pool +* Pointer to the MAD pool. +* +* event_db_dump_file +* File to be used to dump the Port Counters +* +* mad_ctrl +* Mad Controller +*********/ + +/****f* OpenSM: Creation Functions */ +void osm_perfmgr_shutdown(osm_perfmgr_t *const p_perfmgr ); +void osm_perfmgr_destroy(osm_perfmgr_t * const p_perfmgr ); + +/****f* OpenSM: Inline accessor functions */ +inline static void osm_perfmgr_set_state(osm_perfmgr_t *p_perfmgr, + osm_perfmgr_state_t state) +{ + p_perfmgr->state = state; +} +inline static osm_perfmgr_state_t osm_perfmgr_get_state(osm_perfmgr_t + *p_perfmgr) { return (p_perfmgr->state); } +inline static char *osm_perfmgr_get_state_str(osm_perfmgr_t *p_perfmgr) +{ + switch (p_perfmgr->state) + { + case PERFMGR_STATE_DISABLE: return ("Disabled"); break; + case PERFMGR_STATE_ENABLED: return ("Enabled"); break; + case PERFMGR_STATE_NO_DB: return ("No Database"); break; + } + return ("UNKNOWN"); +} +inline static void osm_perfmgr_set_sweep_time_s(osm_perfmgr_t *p_perfmgr, uint16_t time_s) +{ + p_perfmgr->sweep_time_s = time_s; + cl_event_signal(&p_perfmgr->sig_sweep); +} +inline static uint16_t osm_perfmgr_get_sweep_time_s(osm_perfmgr_t *p_perfmgr) +{ + return (p_perfmgr->sweep_time_s); +} +void osm_perfmgr_clear_counters(osm_perfmgr_t *p_perfmgr); +void osm_perfmgr_dump_counters(osm_perfmgr_t *p_perfmgr, + osm_event_db_dump_t dump_type); + +ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * const p_perfmgr, const ib_net64_t port_guid); + +#if 0 +/* Work out the tracking of notice events */ +ib_api_status_t osm_report_notice_to_perfmgr(osm_log_t *const p_log, osm_subn_t *p_subn, + ib_mad_notice_attr_t *p_ntc ) +#endif + +/****f* OpenSM: PERFMGR/osm_perfmgr_init */ +ib_api_status_t +osm_perfmgr_init( + osm_perfmgr_t* const perfmgr, + osm_subn_t* const subn, + osm_sm_t * const sm, + osm_log_t* const log, + osm_mad_pool_t * const mad_pool, + osm_vendor_t * const vendor, + cl_dispatcher_t* const disp, + cl_plock_t* const lock, + const osm_subn_opt_t * const p_opt ); +/* +* PARAMETERS +* perfmgr +* [in] Pointer to an osm_perfmgr_t object to initialize. +* +* subn +* [in] Pointer to the Subnet object for this subnet. +* +* sm +* [in] Pointer to the Subnet object for this subnet. +* +* log +* [in] Pointer to the log object. +* +* mad_pool +* [in] Pointer to the MAD pool. +* +* vendor +* [in] Pointer to the vendor specific interfaces object. +* +* disp +* [in] Pointer to the OpenSM central Dispatcher. +* +* lock +* [in] Pointer to the OpenSM serializing lock. +* +* p_opt +* [in] Starting options +* +* RETURN VALUES +* IB_SUCCESS if the PERFMGR object was initialized successfully. +*********/ + +#ifdef __cplusplus +} +#endif /* __cplusplus */ + +#endif /* ENABLE_OSM_PERF_MGR */ + +#endif /* _OSM_PERFMGR_H_ */ + diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h index fc52b5e..0fdc18b 100644 --- a/osm/include/opensm/osm_subnet.h +++ b/osm/include/opensm/osm_subnet.h @@ -291,6 +291,12 @@ typedef struct _osm_subn_opt osm_qos_options_t qos_rtr_options; boolean_t enable_quirks; boolean_t no_clients_rereg; +#ifdef ENABLE_OSM_PERF_MGR + boolean_t perfmgr; + uint16_t perfmgr_sweep_time_s; + char * event_db_dump_file; + char * event_db_plugin; +#endif /* ENABLE_OSM_PERF_MGR */ } osm_subn_opt_t; /* * FIELDS @@ -468,6 +474,18 @@ typedef struct _osm_subn_opt * sm_inactive * OpenSM will start with SM in not active state. * +* perfmgr +* Enable or disable the performance manager +* +* perfmgr_sweep_time_s +* Define the period of PM sweep (in seconds). +* +* event_db_dump_file +* File to dump the event database to +* +* event_db_plugin +* specify the name of the event plugin +* * qos_options * Default set of QoS options * diff --git a/osm/opensm.spec.in b/osm/opensm.spec.in index c4e1798..8857a7b 100644 --- a/osm/opensm.spec.in +++ b/osm/opensm.spec.in @@ -38,10 +38,19 @@ Static libraries and header files for Op %define _disable_console_socket --disable-console-socket %endif +%if %{?_with_perf_mgr:1}%{!?_with_perf_mgr:0} +%define _enable_perf_mgr --enable-perf-mgr +%endif +%if %{?_without_perf_mgr:1}%{!?_without_perf_mgr:0} +%define _disable_perf_mgr --disable-perf-mgr +%endif + %build %configure \ %{?_enable_console_socket} \ - %{?_disable_console_socket} + %{?_disable_console_socket} \ + %{?_enable_perf_mgr} \ + %{?_disable_perf_mgr} make %{?_smp_mflags} %install diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am index e2520b8..9a1f6f4 100644 --- a/osm/opensm/Makefile.am +++ b/osm/opensm/Makefile.am @@ -55,7 +55,8 @@ opensm_SOURCES = main.c osm_console.c os osm_trap_rcv.c osm_ucast_mgr.c osm_ucast_updn.c \ osm_ucast_lash.c osm_ucast_file.c osm_ucast_ftree.c \ osm_vl15intf.c osm_vl_arb_rcv.c \ - st.c + st.c \ + osm_perfmgr.c osm_event_db.c if OSMV_OPENIB opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 @@ -78,7 +79,7 @@ endif # we always give precedence to local tree libs and then use the pre-installed ones. opensm_LDADD = -L../complib -L../libvendor -L. $(OSMV_LDADD) -lopensm -losmcomp -losmvendor -opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread +opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread -ldl opensmincludedir = $(includedir)/infiniband/opensm diff --git a/osm/opensm/configure.in b/osm/opensm/configure.in index ad3333a..9e23719 100644 --- a/osm/opensm/configure.in +++ b/osm/opensm/configure.in @@ -78,6 +78,9 @@ if test $console_socket = yes; then [Define as 1 if you want to enable a console on a socket connection]) fi +dnl select performance manager or not +OPENIB_OSM_PERF_MGR_SEL + dnl Provide user option to select vendor OPENIB_APP_OSMV_SEL diff --git a/osm/opensm/main.c b/osm/opensm/main.c index 153e44d..4fa3563 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -59,6 +59,7 @@ #include #include #include +#include volatile unsigned int osm_exit_flag = 0; @@ -273,6 +274,13 @@ show_usage(void) printf("-I\n" "--inactive\n" " Start SM in inactive rather than normal init SM state.\n\n"); +#ifdef ENABLE_OSM_PERF_MGR + printf( "--pm\n" + " Activate the performance manager.\n\n"); + printf( "--pm_sweep_time_s\n" + " Define the period for PerfMgr sweeps (in seconds) default %ds.\n\n", + OSM_PERFMGR_DEFAULT_SWEEP_TIME_S); +#endif /* ENABLE_OSM_PERF_MGR */ printf( "-v\n" "--verbose\n" " This option increases the log verbosity level.\n" @@ -630,6 +638,8 @@ main( #endif { "daemon", 0, NULL, 'B'}, { "inactive", 0, NULL, 'I'}, + { "pm", 0, NULL, 1}, /* no short options for PM stuff */ + { "pm_sweep_time_s", 1, NULL, 2}, { NULL, 0, NULL, 0 } /* Required at the end of the array */ }; @@ -907,6 +917,15 @@ main( printf(" SM started in inactive state\n"); break; +#ifdef ENABLE_OSM_PERF_MGR + case 1: + opt.perfmgr = TRUE; + break; + case 2: + opt.perfmgr_sweep_time_s = atoi(optarg); + break; +#endif /* ENABLE_OSM_PERF_MGR */ + case 'h': case '?': case ':': diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c index 38b978a..d6c30d8 100644 --- a/osm/opensm/osm_console.c +++ b/osm/opensm/osm_console.c @@ -52,6 +52,7 @@ #include #include #include +#include struct command { char *name; @@ -136,6 +137,20 @@ static void help_logflush(FILE *out, int fprintf(out, "logflush -- flush the osm.log file\n"); } +#ifdef ENABLE_OSM_PERF_MGR +static void help_perfmgr(FILE *out, int detail) +{ + fprintf(out, "perfmgr [enable|disable|clear_counters|dump_counters|sweep_time][seconds]\n"); + if (detail) { + fprintf(out, "perfmgr -- print the performance manager state\n"); + fprintf(out, " [enable|disable] -- change the perfmgr state\n"); + fprintf(out, " [sweep_time] -- change the perfmgr sweep time (requires [seconds] option)\n"); + fprintf(out, " [clear_counters] -- clear the counters stored\n"); + fprintf(out, " [dump_counters [mach]] -- dump the counters\n"); + } +} +#endif /* ENABLE_OSM_PERF_MGR */ + /* more help routines go here */ static void help_parse(char **p_last, osm_opensm_t *p_osm, FILE *out) @@ -427,6 +442,66 @@ static void logflush_parse(char **p_last fflush(p_osm->log.out_port); } +#ifdef ENABLE_OSM_PERF_MGR +static void perfmgr_parse(char **p_last, osm_opensm_t *p_osm, FILE *out) +{ + char *p_cmd; + + p_cmd = next_token(p_last); + if (p_cmd) + { + if (strcmp(p_cmd, "enable") == 0) + { + osm_perfmgr_set_state(&(p_osm->perfmgr), PERFMGR_STATE_ENABLED); + } + else if (strcmp(p_cmd, "disable") == 0) + { + osm_perfmgr_set_state(&(p_osm->perfmgr), PERFMGR_STATE_DISABLE); + } + else if (strcmp(p_cmd, "clear_counters") == 0) + { + osm_perfmgr_clear_counters(&(p_osm->perfmgr)); + } + else if (strcmp(p_cmd, "dump_counters") == 0) + { + p_cmd = next_token(p_last); + if (p_cmd && (strcmp(p_cmd, "mach") == 0)) { + osm_perfmgr_dump_counters(&(p_osm->perfmgr), + OSM_EVENT_DB_DUMP_MR); + } else { + osm_perfmgr_dump_counters(&(p_osm->perfmgr), + OSM_EVENT_DB_DUMP_HR); + } + } + else if (strcmp(p_cmd, "sweep_time") == 0) + { + p_cmd = next_token(p_last); + if (p_cmd) + { + uint16_t time_s = atoi(p_cmd); + osm_perfmgr_set_sweep_time_s(&(p_osm->perfmgr), time_s); + } + else + { + fprintf(out, "sweep_time requires a time specified\n"); + } + } + else + { + fprintf(out, "\"%s\" option not found\n", p_cmd); + } + } else { + fprintf(out, "Performance Manager status:\n" + "state : %s\n" + "sweep time : %us\n" + , + osm_perfmgr_get_state_str(&(p_osm->perfmgr)), + osm_perfmgr_get_sweep_time_s(&(p_osm->perfmgr)) + ); + } +} +#endif /* ENABLE_OSM_PERF_MGR */ + /* This is public to be able to close it on exit */ void osm_console_close_socket(osm_opensm_t *p_osm) { @@ -456,6 +531,9 @@ static const struct command console_cmds { "resweep", &help_resweep, &resweep_parse}, { "status", &help_status, &status_parse}, { "logflush", &help_logflush, &logflush_parse}, +#ifdef ENABLE_OSM_PERF_MGR + { "perfmgr", &help_perfmgr, &perfmgr_parse}, +#endif /* ENABLE_OSM_PERF_MGR */ { NULL, NULL, NULL} /* end of array */ }; diff --git a/osm/opensm/osm_event_db.c b/osm/opensm/osm_event_db.c new file mode 100644 index 0000000..90ca8da --- /dev/null +++ b/osm/opensm/osm_event_db.c @@ -0,0 +1,172 @@ +/* + * Copyright (c) 2007 The Regents of the University of California. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include + +#include + +/** ========================================================================= + */ +osm_event_db_t * +osm_event_db_construct(osm_log_t *p_log, char *type) +{ + char lib_name[PATH_MAX]; + osm_event_db_t *rc = NULL; + + if (!type) + return (NULL); + + /* find the plugin */ + snprintf(lib_name, PATH_MAX, "lib%s.so", type); + + rc = malloc(sizeof(*rc)); + if (!rc) + return (NULL); + + rc->handle = dlopen(lib_name, RTLD_LAZY); + if (!rc->handle) + { + osm_log(p_log, OSM_LOG_ERROR, + "Failed to open PM Database \"%s\" : \"%s\"\n", + lib_name, dlerror()); + goto DLOPENFAIL; + } + + rc->db_impl = (__osm_event_db_t *)dlsym(rc->handle, "__osm_event_db"); + if (!rc->db_impl) + { + osm_log(p_log, OSM_LOG_ERROR, + "Failed to find __osm_event_db symbol in \"%s\" : \"%s\"\n", + lib_name, dlerror()); + goto Exit; + } + + /* Check the version to make sure this module will work with us */ + if (rc->db_impl->interface_version != OSM_EVENT_DB_INTERFACE_VER) + { + osm_log(p_log, OSM_LOG_ERROR, + "__osm_event_db symbol is the wrong version %d != %d\n", + rc->db_impl->interface_version, + OSM_EVENT_DB_INTERFACE_VER); + goto Exit; + } + + rc->db_data = rc->db_impl->construct(p_log); + + if (!rc->db_data) + goto Exit; + + rc->p_log = p_log; + return (rc); + +Exit: + dlclose(rc->handle); +DLOPENFAIL: + free(rc); + return (NULL); +} + +/** ========================================================================= + */ +void +osm_event_db_destroy(osm_event_db_t *db) +{ + if (db) + { + db->db_impl->destroy(db->db_data); + free(db); + } +} + +/** ========================================================================= + */ +osm_event_db_err_t +osm_event_db_create_entry(osm_event_db_t *db, uint64_t guid, uint8_t num_ports) +{ + return(db->db_impl->create_entry(db->db_data, guid, num_ports)); +} + +/********************************************************************** + **********************************************************************/ +osm_event_db_err_t osm_event_db_get_prev_pc(osm_event_db_t *db, uint64_t guid, + uint8_t port, osm_pc_reading_t *reading) +{ + return (db->db_impl->get_prev_pc(db->db_data, guid, port, reading)); +} + +/********************************************************************** + * dump the data to the file "file" + **********************************************************************/ +osm_event_db_err_t +osm_event_db_dump(osm_event_db_t *db, char *file, osm_event_db_dump_t dump_type) +{ + return (db->db_impl->dump(db->db_data, file, dump_type)); +} + +/********************************************************************** + * Clear the port counters from the db + **********************************************************************/ +void osm_event_db_clear_port_counters(osm_event_db_t *db) +{ + db->db_impl->clear_port_counters(db->db_data); +} + +/********************************************************************** + * Add the reading to the osm_pm_node_t + **********************************************************************/ +osm_event_db_err_t +osm_event_db_add_pc_reading(osm_event_db_t *db, uint64_t guid, + uint8_t port, ib_port_counters_t *reading) +{ + return (db->db_impl->add_pc_reading(db->db_data, guid, + port, reading)); +} + +/********************************************************************** + * Add the reading to the osm_pm_node_t + **********************************************************************/ +osm_event_db_err_t +osm_event_db_clear_prev_pc(osm_event_db_t *db, uint64_t guid, uint8_t port) +{ + return (db->db_impl->clear_prev_pc(db->db_data, guid, port)); +} + diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 8430605..fa572c5 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -172,6 +172,9 @@ osm_opensm_destroy( p_osm->routing_engine.delete(p_osm->routing_engine.context); osm_sa_destroy( &p_osm->sa ); osm_sm_destroy( &p_osm->sm ); +#ifdef ENABLE_OSM_PERF_MGR + osm_perfmgr_destroy( &p_osm->perfmgr ); +#endif /* ENABLE_OSM_PERF_MGR */ osm_db_destroy( &p_osm->db ); osm_vl15_destroy( &p_osm->vl15, &p_osm->mad_pool ); osm_mad_pool_destroy( &p_osm->mad_pool ); @@ -286,6 +289,21 @@ osm_opensm_init( if( status != IB_SUCCESS ) goto Exit; +#ifdef ENABLE_OSM_PERF_MGR + status = osm_perfmgr_init( &p_osm->perfmgr, + &p_osm->subn, + &p_osm->sm, + &p_osm->log, + &p_osm->mad_pool, + p_osm->p_vendor, + &p_osm->disp, + &p_osm->lock, + p_opt); + + if( status != IB_SUCCESS ) + goto Exit; +#endif /* ENABLE_OSM_PERF_MGR */ + if( p_opt->routing_engine_name && setup_routing_engine(p_osm, p_opt->routing_engine_name)) { osm_log( &p_osm->log, OSM_LOG_VERBOSE, @@ -319,6 +337,12 @@ osm_opensm_bind( if( status != IB_SUCCESS ) goto Exit; +#ifdef ENABLE_OSM_PERF_MGR + status = osm_perfmgr_bind( &p_osm->perfmgr, guid ); + if( status != IB_SUCCESS ) + goto Exit; +#endif /* ENABLE_OSM_PERF_MGR */ + Exit: OSM_LOG_EXIT( &p_osm->log ); return ( status ); diff --git a/osm/opensm/osm_perfmgr.c b/osm/opensm/osm_perfmgr.c new file mode 100644 index 0000000..297a0e2 --- /dev/null +++ b/osm/opensm/osm_perfmgr.c @@ -0,0 +1,686 @@ +/* + * Copyright (c) 2007 The Regents of the University of California. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + + +/* + * Abstract: + * Implementation of osm_perfmgr_t. + * + * Author: + * Ira Weiny, LLNL + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#ifdef ENABLE_OSM_PERF_MGR + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define OSM_PERFMGR_INITIAL_TID_VALUE 0xcafe + +/********************************************************************** + * Recieve the MAD from the vendor layer and post it for processing by the + * dispatcher. + **********************************************************************/ +static void +osm_perfmgr_mad_recv_callback(osm_madw_t *p_madw, void* bind_context, + osm_madw_t *p_req_madw ) +{ + osm_perfmgr_t *pm = (osm_perfmgr_t *)bind_context; + cl_status_t cl_status = CL_SUCCESS; + + OSM_LOG_ENTER( pm->log, osm_pm_mad_recv_callback ); + + osm_madw_copy_context( p_madw, p_req_madw ); + osm_mad_pool_put( pm->mad_pool, p_req_madw ); + + /* post this message for later processing. */ + cl_status = cl_disp_post(pm->pc_disp_h, OSM_MSG_MAD_PORT_COUNTERS, + (void *)p_madw, NULL, NULL); +#if 0 + do { + struct timeval rcv_time; + gettimeofday(&rcv_time, NULL); + osm_log(pm->log, OSM_LOG_INFO, + "perfmgr rcv time %ld\n", + rcv_time.tv_usec - + p_madw->context.perfmgr_context.query_start.tv_usec); + } while (0); +#endif + OSM_LOG_EXIT( pm->log ); +} + +/********************************************************************** + * Process errors from the MAD send. + **********************************************************************/ +static void +osm_perfmgr_mad_send_err_callback(void* bind_context, osm_madw_t *p_madw) +{ + osm_perfmgr_t *pm = (osm_perfmgr_t *)bind_context; + osm_madw_context_t *context = &(p_madw->context); + + OSM_LOG_ENTER( pm->log, osm_pm_mad_send_err_callback ); + + osm_log( pm->log, OSM_LOG_ERROR, + "osm_pm_mad_send_err_callback: 0x%" PRIx64 " port %d\n", + context->perfmgr_context.node_guid, + context->perfmgr_context.port); + + osm_mad_pool_put( pm->mad_pool, p_madw ); + + OSM_LOG_EXIT( pm->log ); +} + +/********************************************************************** + * Bind the PM to the vendor layer for MAD sends/receives + **********************************************************************/ +ib_api_status_t +osm_perfmgr_bind(osm_perfmgr_t * const pm, const ib_net64_t port_guid) +{ + osm_bind_info_t bind_info; + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( pm->log, osm_pm_bind ); + + if( pm->bind_handle != OSM_BIND_INVALID_HANDLE ) { + osm_log( pm->log, OSM_LOG_ERROR, + "osm_pm_mad_ctrl_bind: Multiple binds not allowed\n" ); + status = IB_ERROR; + goto Exit; + } + + bind_info.port_guid = port_guid; + bind_info.mad_class = IB_MCLASS_PERF; + bind_info.class_version = 1; + bind_info.is_responder = FALSE; + bind_info.is_report_processor = FALSE; + bind_info.is_trap_processor = FALSE; + bind_info.recv_q_size = OSM_PM_DEFAULT_QP1_RCV_SIZE; + bind_info.send_q_size = OSM_PM_DEFAULT_QP1_SEND_SIZE; + + osm_log( pm->log, OSM_LOG_VERBOSE, + "osm_pm_mad_bind: " + "Binding to port GUID 0x%" PRIx64 "\n", + cl_ntoh64( port_guid ) ); + + pm->bind_handle = osm_vendor_bind( pm->vendor, + &bind_info, + pm->mad_pool, + osm_perfmgr_mad_recv_callback, + osm_perfmgr_mad_send_err_callback, + pm ); + + if( pm->bind_handle == OSM_BIND_INVALID_HANDLE ) { + status = IB_ERROR; + osm_log( pm->log, OSM_LOG_ERROR, + "osm_pm_mad_bind: Vendor specific bind failed (%s)\n", + ib_get_err_str(status) ); + goto Exit; + } + +Exit: + OSM_LOG_EXIT( pm->log ); + return( status ); +} + +/********************************************************************** + * Unbind the PM to the vendor layer for MAD sends/receives + **********************************************************************/ +void +osm_perfmgr_mad_unbind(osm_perfmgr_t * const pm) +{ + OSM_LOG_ENTER( pm->log, osm_sa_mad_ctrl_unbind ); + if( pm->bind_handle == OSM_BIND_INVALID_HANDLE ) { + osm_log( pm->log, OSM_LOG_ERROR, + "osm_pm_mad_unbind: No previous bind\n" ); + goto Exit; + } + osm_vendor_unbind( pm->bind_handle ); +Exit: + OSM_LOG_EXIT( pm->log ); +} + +/********************************************************************** + * Given a node and a port return the appropriate lid to query that port + **********************************************************************/ +static ib_net16_t +get_lid(osm_node_t *p_node, uint8_t port) +{ + ib_net16_t lid = 0; + + switch (p_node->node_info.node_type) + { + case IB_NODE_TYPE_CA: + case IB_NODE_TYPE_ROUTER: + lid = osm_node_get_base_lid(p_node, port); + break; + case IB_NODE_TYPE_SWITCH: + lid = osm_node_get_base_lid(p_node, 0); + break; + default: + break; + } + return (lid); +} + +/********************************************************************** + * Form the Port Counter MAD and send the MAD for a single port. + **********************************************************************/ +static ib_api_status_t +osm_perfmgr_send_pc_mad(osm_perfmgr_t *perfmgr, ib_net16_t dest_lid, uint8_t port, + uint8_t mad_method, osm_madw_context_t* const p_context ) +{ + ib_api_status_t status = IB_SUCCESS; + ib_port_counters_t *port_counter = NULL; + ib_perfmgr_mad_t *pm_mad = NULL; + osm_madw_t *p_madw = NULL; + + OSM_LOG_ENTER(perfmgr->log, osm_perfmgr_send_pc_mad); + + p_madw = osm_mad_pool_get(perfmgr->mad_pool, perfmgr->bind_handle, MAD_BLOCK_SIZE, NULL); + if (p_madw == NULL) + return (IB_INSUFFICIENT_MEMORY); + + pm_mad = osm_madw_get_perfmgr_mad_ptr(p_madw); + + /* build the mad */ + pm_mad->header.base_ver = 1; + pm_mad->header.mgmt_class = IB_MCLASS_PERF; + pm_mad->header.class_ver = 1; + pm_mad->header.method = mad_method; + pm_mad->header.status = 0; + pm_mad->header.class_spec = 0; + pm_mad->header.trans_id = cl_hton64((uint64_t)cl_atomic_inc(&(perfmgr->trans_id))); + pm_mad->header.attr_id = IB_MAD_ATTR_PORT_CNTRS; + pm_mad->header.resv = 0; + pm_mad->header.attr_mod = 0; + + port_counter = (ib_port_counters_t *)&(pm_mad->data); + memset(port_counter, 0, sizeof(*port_counter)); + port_counter->port_select = port; + port_counter->counter_select = 0xFFFF; + + p_madw->mad_addr.dest_lid = dest_lid; + p_madw->mad_addr.addr_type.gsi.remote_qp = cl_hton32(1); + p_madw->mad_addr.addr_type.gsi.remote_qkey = cl_hton32(IB_QP1_WELL_KNOWN_Q_KEY); + /* FIXME what about other partitions */ + p_madw->mad_addr.addr_type.gsi.pkey = cl_hton16(0xFFFF); + p_madw->mad_addr.addr_type.gsi.service_level = 0; + p_madw->mad_addr.addr_type.gsi.global_route = FALSE; + p_madw->resp_expected = TRUE; + + if( p_context ) + p_madw->context = *p_context; + + status = osm_vendor_send(perfmgr->bind_handle, p_madw, TRUE); + + OSM_LOG_EXIT(perfmgr->log); + return( status ); +} + +/********************************************************************** + * query the Port Counters of all the nodes in the subnet. + **********************************************************************/ +static void +__osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context ) +{ + ib_api_status_t status = IB_SUCCESS; + uint8_t port = 0; + osm_perfmgr_t *pm = (osm_perfmgr_t *)context; + osm_node_t *p_node = (osm_node_t *)p_map_item; + uint8_t node_desc[IB_NODE_DESCRIPTION_SIZE]; + osm_madw_context_t mad_context; + uint8_t num_ports = 0; + uint64_t node_guid = 0; + + OSM_LOG_ENTER( pm->log, __osm_pm_query_counters ); + + memcpy(node_desc, p_node->node_desc.description, + IB_NODE_DESCRIPTION_SIZE); + node_desc[IB_NODE_DESCRIPTION_SIZE-1] = '\0'; + + num_ports = osm_node_get_num_physp(p_node); + node_guid = cl_ntoh64(p_node->node_info.node_guid); + + /* make sure we have a database object ready to store this information */ + if (osm_event_db_create_entry(pm->db, node_guid, num_ports) != + OSM_EVENT_DB_SUCCESS) + { + osm_log(pm->log, OSM_LOG_ERROR, + "PerfMgr DB create entry failed for 0x%" PRIx64 " : %s\n", + node_guid, strerror(errno)); + goto Exit; + } + + /* issue the queries for each port */ + for (port = 1; port < num_ports; port++) + { + ib_net16_t lid = get_lid(p_node, port); + if (lid == 0) + { + osm_log(pm->log, OSM_LOG_DEBUG, + "WARN: node 0x%" PRIx64 " port %d (%s): port out of range, skipping\n", + cl_ntoh64(p_node->node_info.node_guid), port, node_desc); + continue; + } + + mad_context.perfmgr_context.node_guid = node_guid; + mad_context.perfmgr_context.port = port; + mad_context.perfmgr_context.num_ports = num_ports; + mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_GET; +#if 0 + gettimeofday(&(mad_context.perfmgr_context.query_start), NULL); +#endif + osm_log(pm->log, OSM_LOG_VERBOSE, + " Getting stats for node 0x%" PRIx64 " port %d (lid %X) (%s)\n", + node_guid, port, cl_ntoh16(lid), node_desc); + status = osm_perfmgr_send_pc_mad(pm, lid, port, IB_MAD_METHOD_GET, &mad_context); + if (status != IB_SUCCESS) + { + osm_log(pm->log, OSM_LOG_ERROR, + "Failed to issue port counter query for node 0x%" PRIx64 " port %d (%s)\n", + p_node->node_info.node_guid, port, node_desc); + } + } +Exit: + OSM_LOG_EXIT( pm->log ); +} + +/********************************************************************** + * Main PerfMgr Thread. + * Loop continueously and query the performance counters. + **********************************************************************/ +void +__osm_perfmgr_sweeper(void *p_ptr) +{ + ib_api_status_t status; + osm_perfmgr_t *const pm = ( osm_perfmgr_t * ) p_ptr; + + OSM_LOG_ENTER( pm->log, __osm_pm_sweeper ); + + if( pm->thread_state == OSM_THREAD_STATE_INIT ) + pm->thread_state = OSM_THREAD_STATE_RUN; + + while( pm->thread_state == OSM_THREAD_STATE_RUN ) { + /* do the sweep only if we are in MASTER state + * AND we have been activated. + * FIXME put something in here to try and reduce the load on the system + * when it is not IDLE. + if (pm->sm->state_mgr.state != OSM_SM_STATE_IDLE) + */ + if( pm->subn->sm_state == IB_SMINFO_STATE_MASTER + && pm->state == PERFMGR_STATE_ENABLED) { +#if 0 + struct timeval before, after; + gettimeofday(&before, NULL); +#endif + /* for each node query their counters */ + cl_plock_acquire(pm->lock); + osm_log(pm->log, OSM_LOG_VERBOSE, "Gathering PerfMgr stats\n"); + cl_qmap_apply_func(&(pm->subn->node_guid_tbl), + __osm_perfmgr_query_counters, (void *)pm); + cl_plock_release(pm->lock); +#if 0 + gettimeofday(&after, NULL); + osm_log(pm->log, OSM_LOG_INFO, + "total sweep time : %ld us\n", after.tv_usec - before.tv_usec); +#endif + } + + /* Wait for a forced sweep or period timeout. */ + status = cl_event_wait_on( &pm->sig_sweep, + pm->sweep_time_s * 1000000, + TRUE ); + } + + OSM_LOG_EXIT( pm->log ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_perfmgr_shutdown(osm_perfmgr_t * const pm) +{ + OSM_LOG_ENTER( pm->log, osm_perfmgr_shutdown ); + osm_perfmgr_mad_unbind(pm); + OSM_LOG_EXIT( pm->log ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_perfmgr_destroy(osm_perfmgr_t * const pm) +{ + OSM_LOG_ENTER( pm->log, osm_perfmgr_destroy ); + free(pm->event_db_dump_file); + free(pm->event_db_plugin); + osm_event_db_destroy(pm->db); + OSM_LOG_EXIT( pm->log ); +} + +/********************************************************************** + * Return 1 if the value has overflowed + **********************************************************************/ +int counter_overflow_4(uint8_t val) +{ + return (val >= 10); +} +int counter_overflow_8(uint8_t val) +{ + return (val >= (UINT8_MAX - (UINT8_MAX/4))); +} +int counter_overflow_16(uint16_t val) +{ + return (cl_ntoh16(val) >= (UINT16_MAX - (UINT16_MAX/4))); +} +int counter_overflow_32(uint32_t val) +{ + return (cl_ntoh32(val) >= (UINT32_MAX - (UINT32_MAX/4))); +} + +/********************************************************************** + * Check if the port counters have overflowed and if so issue a clear MAD to + * the port. + **********************************************************************/ +static void +osm_perfmgr_check_clear(osm_perfmgr_t *pm, uint64_t node_guid, + uint8_t port, int num_ports, ib_port_counters_t *cr) +{ + osm_madw_context_t mad_context; + + OSM_LOG_ENTER( pm->log, osm_pm_check_clear ); + if (counter_overflow_16(cr->symbol_err_cnt) + || counter_overflow_8(cr->link_err_recover) + || counter_overflow_8(cr->link_downed) + || counter_overflow_16(cr->rcv_err) + || counter_overflow_16(cr->rcv_rem_phys_err) + || counter_overflow_16(cr->rcv_switch_relay_err) + || counter_overflow_16(cr->xmit_discards) + || counter_overflow_8(cr->xmit_constraint_err) + || counter_overflow_8(cr->rcv_constraint_err) + || counter_overflow_4(PC_LINK_INT(cr->link_int_buffer_overrun)) + || counter_overflow_4(PC_BUF_OVERRUN(cr->link_int_buffer_overrun)) + || counter_overflow_16(cr->vl15_dropped) + || counter_overflow_32(cr->xmit_data) + || counter_overflow_32(cr->rcv_data) + || counter_overflow_32(cr->xmit_pkts) + || counter_overflow_32(cr->rcv_pkts) + ) + { + osm_log(pm->log, OSM_LOG_INFO, + "Counter overflow: 0x%" PRIx64 " port %d; clearing counters\n", + node_guid, port); + osm_node_t *p_node = NULL; + ib_net16_t lid = 0; + cl_plock_acquire(pm->lock); + p_node = (osm_node_t *)cl_qmap_get(&(pm->subn->node_guid_tbl), + cl_hton64(node_guid)); + lid = get_lid(p_node, port); + cl_plock_release(pm->lock); + if (lid == 0) + { + osm_log(pm->log, OSM_LOG_INFO, + "Failed to clear counters for node 0x%" PRIx64 " port %d; failed to get lid\n", + node_guid, port); + goto Exit; + } + mad_context.perfmgr_context.node_guid = node_guid; + mad_context.perfmgr_context.port = port; + mad_context.perfmgr_context.num_ports = num_ports; + mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_SET; + /* clear port counter */ + osm_perfmgr_send_pc_mad(pm, lid, port, IB_MAD_METHOD_SET, &mad_context); + } +Exit: + OSM_LOG_EXIT( pm->log ); +} + +/********************************************************************** + * Check values for logging of errors + **********************************************************************/ +static void +osm_perfmgr_log_events(osm_perfmgr_t *pm, uint64_t node_guid, uint8_t port, + ib_port_counters_t *reading) +{ + osm_pc_reading_t prev_read; + ib_port_counters_t *prev; + time_t time_diff = 0; + osm_event_db_err_t err = osm_event_db_get_prev_pc(pm->db, node_guid, port, &prev_read); + if (err != OSM_EVENT_DB_SUCCESS) + { + osm_log(pm->log, OSM_LOG_VERBOSE, + "failed to find previous reading for 0x%" PRIx64 " port %u\n", + node_guid, port); + return; + } + time_diff = (time(NULL) - prev_read.time); + prev = &(prev_read.reading); + + /* FIXME these events should be defineable by the user in a config + * file somewhere. */ + if (reading->symbol_err_cnt > prev->symbol_err_cnt) { + osm_log(pm->log, OSM_LOG_ERROR, + "Found %u Symbol errors in %lu sec on node 0x%" PRIx64 " port %u\n", + (cl_ntoh16(reading->symbol_err_cnt) - cl_ntoh16(prev->symbol_err_cnt)), + time_diff, + node_guid, + port); + } + if (reading->rcv_err > prev->rcv_err) { + osm_log(pm->log, OSM_LOG_ERROR, + "Found %u Recieve errors in %lu sec on node 0x%" PRIx64 " port %u\n", + (cl_ntoh16(reading->rcv_err) - cl_ntoh16(prev->rcv_err)), + time_diff, + node_guid, + port); + } + if (reading->xmit_discards > prev->xmit_discards) { + osm_log(pm->log, OSM_LOG_ERROR, + "Found %u XMIT Discards in %lu sec on node 0x%" PRIx64 " port %u\n", + (cl_ntoh16(reading->xmit_discards) - cl_ntoh16(prev->xmit_discards)), + time_diff, + node_guid, + port); + } +} + + +/********************************************************************** + * The dispatcher uses a thread pool which will call this function when we have + * a thread available to process our mad recieved from the wire. + **********************************************************************/ +static void +osm_pc_rcv_process(void *context, void *data) +{ + osm_perfmgr_t *const pm = (osm_perfmgr_t *)context; + osm_madw_t *p_madw = (osm_madw_t *)data; + osm_madw_context_t *mad_context = &(p_madw->context); + ib_port_counters_t *counter_reading = + (ib_port_counters_t *)&(osm_madw_get_perfmgr_mad_ptr(p_madw)->data); + uint64_t node_guid = mad_context->perfmgr_context.node_guid; + uint8_t port_num = mad_context->perfmgr_context.port; + int num_ports = mad_context->perfmgr_context.num_ports; + + OSM_LOG_ENTER( pm->log, osm_pc_rcv_process ); + + osm_log(pm->log, OSM_LOG_VERBOSE, + "Processing recieved MAD context 0x%" PRIx64 " port %u/%d\n", + node_guid, port_num, num_ports); + + /* log any critical events from this reading */ + osm_perfmgr_log_events(pm, node_guid, port_num, counter_reading); + + if (mad_context->perfmgr_context.mad_method == IB_MAD_METHOD_GET) + osm_event_db_add_pc_reading(pm->db, node_guid, port_num, counter_reading); + else + osm_event_db_clear_prev_pc(pm->db, node_guid, port_num); + osm_perfmgr_check_clear(pm, node_guid, port_num, num_ports, counter_reading); + +#if 0 + do { + struct timeval proc_time; + gettimeofday(&proc_time, NULL); + osm_log(pm->log, OSM_LOG_INFO, + "perfmgr done processing time %ld\n", + proc_time.tv_usec - + p_madw->context.perfmgr_context.query_start.tv_usec); + } while (0); +#endif + + osm_mad_pool_put( pm->mad_pool, p_madw ); + + OSM_LOG_EXIT( pm->log ); +} + +/********************************************************************** + * Initialize the PERFMGR object + **********************************************************************/ +ib_api_status_t +osm_perfmgr_init( + osm_perfmgr_t * const pm, + osm_subn_t * const subn, + osm_sm_t * const sm, + osm_log_t * const log, + osm_mad_pool_t * const mad_pool, + osm_vendor_t * const vendor, + cl_dispatcher_t* const disp, + cl_plock_t* const lock, + const osm_subn_opt_t * const p_opt ) +{ + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( log, osm_pm_init ); + + osm_log(log, OSM_LOG_VERBOSE, "initing PM\n"); + + memset( pm, 0, sizeof( *pm ) ); + + cl_event_construct(&pm->sig_sweep); + cl_event_init(&pm->sig_sweep, FALSE); + pm->subn = subn; + pm->sm = sm; + pm->log = log; + pm->mad_pool = mad_pool; + pm->vendor = vendor; + pm->trans_id = OSM_PERFMGR_INITIAL_TID_VALUE; + pm->lock = lock; + pm->state = p_opt->perfmgr ? PERFMGR_STATE_ENABLED : PERFMGR_STATE_DISABLE; + pm->sweep_time_s = p_opt->perfmgr_sweep_time_s; + pm->event_db_dump_file = strdup(p_opt->event_db_dump_file); + pm->event_db_plugin = strdup(p_opt->event_db_plugin); + + pm->db = osm_event_db_construct(pm->log, pm->event_db_plugin); + if (!pm->db) + { + pm->state = PERFMGR_STATE_NO_DB; + goto Exit; + } + + pm->pc_disp_h = cl_disp_register(disp, OSM_MSG_MAD_PORT_COUNTERS, + osm_pc_rcv_process, pm); + if( pm->pc_disp_h == CL_DISP_INVALID_HANDLE ) + goto Exit; + + pm->thread_state = OSM_THREAD_STATE_INIT; + status = cl_thread_init( &pm->sweeper, __osm_perfmgr_sweeper, pm, + "PerfMgr sweeper" ); + if( status != IB_SUCCESS ) + goto Exit; + +Exit: + OSM_LOG_EXIT( log ); + return ( status ); +} + +/********************************************************************** + * Clear the counters from the db + **********************************************************************/ +void +osm_perfmgr_clear_counters(osm_perfmgr_t *pm) +{ + /** + * FIXME todo issue clear on the fabric? + */ + osm_event_db_clear_port_counters(pm->db); + osm_log( pm->log, OSM_LOG_INFO, "PerfMgr counters cleared\n"); +} + +/******************************************************************* + * Have the DB dump it's information to the file specified. + *******************************************************************/ +void +osm_perfmgr_dump_counters(osm_perfmgr_t *pm, osm_event_db_dump_t dump_type) +{ + if (osm_event_db_dump(pm->db, pm->event_db_dump_file, dump_type) != 0) + { + osm_log( pm->log, OSM_LOG_ERROR, + "PB dump port counters: Failed to file %s : %s", + pm->event_db_dump_file, strerror(errno)); + } +} + +#if 0 +/******************************************************************* + * Use this later to track events on the fabric + **********************************************************************/ +ib_api_status_t +osm_report_notice_to_perfmgr(osm_log_t* const log, osm_subn_t* subn, + ib_mad_notice_attr_t *p_ntc ) +{ + OSM_LOG_ENTER( log, osm_report_trap_to_pm ); + if ((p_ntc->generic_type & 0x80) + && (cl_ntoh16(p_ntc->g_or_v.generic.trap_num) == 128)) { + osm_log( log, OSM_LOG_INFO, "PerfMgr notified of trap 128\n"); + } + OSM_LOG_EXIT( log ); + return (IB_SUCCESS); +} +#endif + +#endif /* ENABLE_OSM_PERF_MGR */ + diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index c8c3ddc..77c19a5 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -66,6 +66,7 @@ #include #include #include +#include #if defined(PATH_MAX) #define OSM_PATH_MAX (PATH_MAX + 1) @@ -471,6 +472,12 @@ osm_subn_set_default_opt( p_opt->honor_guid2lid_file = FALSE; p_opt->daemon = FALSE; p_opt->sm_inactive = FALSE; +#ifdef ENABLE_OSM_PERF_MGR + p_opt->perfmgr = FALSE; + p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; + p_opt->event_db_dump_file = OSM_PERFMGR_DEFAULT_DUMP_FILE; + p_opt->event_db_plugin = OSM_DEFAULT_EVENT_PLUGIN; +#endif /* ENABLE_OSM_PERF_MGR */ p_opt->dump_files_dir = getenv("OSM_TMP_DIR"); if (!p_opt->dump_files_dir || !(*p_opt->dump_files_dir)) @@ -1076,6 +1083,24 @@ osm_subn_parse_conf_file( "sm_inactive", p_key, p_val, &p_opts->sm_inactive); +#ifdef ENABLE_OSM_PERF_MGR + __osm_subn_opts_unpack_boolean( + "perfmgr", + p_key, p_val, &p_opts->perfmgr); + + __osm_subn_opts_unpack_uint16( + "perfmgr_sweep_time_s", + p_key, p_val, &p_opts->perfmgr_sweep_time_s); + + __osm_subn_opts_unpack_charp( + "event_db_dump_file", + p_key, p_val, &p_opts->event_db_dump_file); + + __osm_subn_opts_unpack_charp( + "event_db_plugin", + p_key, p_val, &p_opts->event_db_plugin); +#endif /* ENABLE_OSM_PERF_MGR */ + subn_parse_qos_options("qos", p_key, p_val, &p_opts->qos_options); @@ -1321,6 +1346,32 @@ osm_subn_write_conf_file( p_opts->sm_inactive ? "TRUE" : "FALSE" ); +#ifdef ENABLE_OSM_PERF_MGR + fprintf( + opts_file, + "#\n# Performance Manager Options\n#\n" + "# perfmgr enable\n" + "perfmgr %s\n\n" + "# sweep time in seconds\n" + "perfmgr_sweep_time_s %d\n\n" + , + p_opts->perfmgr ? "TRUE" : "FALSE", + p_opts->perfmgr_sweep_time_s + ); + + fprintf( + opts_file, + "#\n# Event DB Options\n#\n" + "# Dump file to dump the events to\n" + "event_db_dump_file %s\n\n" + "# Event db plugin\n" + "event_db_plugin %s\n\n" + , + p_opts->event_db_dump_file, + p_opts->event_db_plugin + ); +#endif /* ENABLE_OSM_PERF_MGR */ + fprintf( opts_file, "#\n# DEBUG FEATURES\n#\n" diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c index 0858968..19be781 100644 --- a/osm/opensm/osm_trap_rcv.c +++ b/osm/opensm/osm_trap_rcv.c @@ -698,6 +698,21 @@ __osm_trap_rcv_process_request( goto Exit; } +#ifdef ENABLE_OSM_PERF_MGR +#if 0 + /* we still need to work out how this will work */ + status = osm_report_notice_to_perfmgr(p_rcv->p_log, p_rcv->p_subn, p_ntci); + if( status != IB_SUCCESS ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3803: " + "Error sending trap reports (%s)\n", + ib_get_err_str( status ) ); + goto Exit; + } +#endif +#endif /* ENABLE_OSM_PERF_MGR */ + Exit: OSM_LOG_EXIT( p_rcv->p_log ); } -- 1.4.4 From arthur.jones at qlogic.com Tue May 8 19:19:04 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 8 May 2007 19:19:04 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- shadow the gpio_mask register In-Reply-To: References: <20070508202557.27647.47035.stgit@bauxite.internal.keyresearch.com> Message-ID: <20070509021904.GA16964@bauxite.pathscale.com> hi roland, ... On Tue, May 08, 2007 at 05:58:51PM -0700, Roland Dreier wrote: > > GPIO interrupts which have the gpio_mask bits set are > > no longer unlikely. remove the unlikely annotation in > > the interrupt handler and keep a shadow copy of the > > gpio_mask register. > > A better changelog would be appreciated here... I can see deleting the > unlikely() if it's no longer appropriate, but why keep a shadow copy > of the register? Because this is now a hotter path and you want to > avoid the MMIO read? exactly. shall i add that and resend? arthur From ogerlitz at voltaire.com Tue May 8 22:37:02 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 09 May 2007 08:37:02 +0300 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <463FCA42.3000104@indiana.edu> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com> <463FCA42.3000104@indiana.edu> Message-ID: <46415DFE.9030807@voltaire.com> Andrew Friedley wrote: > Jeff Squyres wrote: >>>> FWIW, yes, adding RDMA CM support has actually been on my to-do list >>>> for a while, but it keeps getting bumped by higher priority items. >>>> It would be *much* better if some iWARP companies got involved in >>>> Open MPI... > Hmm I'm interested. I've already done some work switching over to RDMA > CM for some research stuff I've been doing; it's not publicly accessible > w/o the 3rd party agreement. I can help answer questions on what > exactly needs to change, and do some testing. Doing a bit of zoom out from the "how to make ofed's udapl work for ompi" thread, my thinking is that the ompi udapl btl enablement is actually only the first step, where for production/longterm/etc you want to have an rdmacm btl. Reasoning here is made of many arguments, among them the quickest i can make are: A) it seems that ompi would want to use not only RC but rather also UD multicast and unicast, which are not covered by udapl B) there's actually no real justification to maintain two APIs (namely udapl vs libibvers/librdmacm), so down the road, only one of them would survive (udapl is implemented ***over*** libibverbs/librdmacm so if the latteres dies same does udapl). Specifically, I hear here and there that the OFED stack is now on its way to be deployed all over the place, specifically in commercial Unix OSs (which want modern! code that supports IPoIB-CM,RDS,SRP,iSER, etc you named it) so eventually the rdmacm btl can be used also over Solaris et al. Or. From yosefe at voltaire.com Tue May 8 23:59:26 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 09:59:26 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <20070508203318.GH10845@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com> <20070508150907.GU21591@mellanox.co.il> <46409504.9000802@voltaire.com> <20070508152650.GA5845@mellanox.co.il> <4640A911.8000609@voltaire.com> <20070508203318.GH10845@mellanox.co.il> Message-ID: <4641714E.6050806@voltaire.com> Michael S. Tsirkin wrote: > > > Have you read the boring list of rules? > http://git.openfabrics.org/~mst/boring.txt > > Thanks for the pointer. core: uncached "find gid" and "find pkey" queries * Add ib_find_gid and ib_find_pkey over uncached device queries. The calls might block but the returns are always up-to-date. * Cache pky,gid table lengths in core to avoid port info queries. Signed-off-by: Yosef Etigin --- drivers/infiniband/core/device.c | 139 +++++++++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 25 +++++++ 2 files changed, 164 insertions(+) Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-08 15:46:36.000000000 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-09 09:54:54.242486631 +0300 @@ -149,6 +149,18 @@ static int alloc_name(char *name) return 0; } +static inline int start_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + + +static inline int end_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; +} + /** * ib_alloc_device - allocate an IB device struct * @size:size of structure to allocate @@ -208,6 +220,56 @@ static int add_client_context(struct ib_ return 0; } +/* read the lengths of pkey,gid tables on each port */ +static inline int read_port_table_lengths(struct ib_device *device) +{ + struct ib_port_attr *tprops = NULL; + int num_ports, ret = -ENOMEM; + u8 port_index; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + goto out; + + num_ports = end_port(device) - start_port(device) + 1; + + device->pkey_tbl_len = kmalloc(num_ports * + sizeof *device->pkey_tbl_len, GFP_KERNEL); + if (!device->pkey_tbl_len) + goto out; + + device->gid_tbl_len = kmalloc(num_ports * + sizeof *device->gid_tbl_len, GFP_KERNEL); + if (!device->gid_tbl_len) + goto err1; + + for (port_index = 0; port_index < num_ports; ++port_index) { + ret = ib_query_port(device, port_index + start_port(device), + tprops); + if (ret) + goto err2; + + device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len; + device->gid_tbl_len[port_index] = tprops->gid_tbl_len; + } + + ret = 0; + goto out; +err2: + kfree(device->gid_tbl_len); +err1: + kfree(device->pkey_tbl_len); +out: + kfree(tprops); + return ret; +} + +static inline void free_port_table_lengths(struct ib_device *device) +{ + kfree(device->gid_tbl_len); + kfree(device->pkey_tbl_len); +} + /** * ib_register_device - Register an IB device with IB core * @device:Device to register @@ -239,6 +301,13 @@ int ib_register_device(struct ib_device spin_lock_init(&device->event_handler_lock); spin_lock_init(&device->client_data_lock); + ret = read_port_table_lengths(device); + if (ret) { + printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n", + device->name); + goto out; + } + ret = ib_device_register_sysfs(device); if (ret) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", @@ -284,6 +353,8 @@ void ib_unregister_device(struct ib_devi list_del(&device->core_list); + free_port_table_lengths(device); + mutex_unlock(&device_mutex); spin_lock_irqsave(&device->client_data_lock, flags); @@ -592,6 +663,74 @@ int ib_modify_port(struct ib_device *dev } EXPORT_SYMBOL(ib_modify_port); +/** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index) +{ + union ib_gid tmp_gid; + int ret, port, i, tbl_len; + + for (port = start_port(device); port <= end_port(device); ++port) { + tbl_len = device->gid_tbl_len[port - start_port(device)]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &tmp_gid); + if (ret) + goto out; + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { + *port_num = port; + *index = i; + ret = 0; + goto out; + } + } + } + ret = -ENOENT; +out: + return ret; +} +EXPORT_SYMBOL(ib_find_gid); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index) +{ + int ret, i, tbl_len; + u16 tmp_pkey; + + tbl_len = device->pkey_tbl_len[port_num - start_port(device)]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); + if (ret) + goto out; + + if (pkey == tmp_pkey) { + *index = i; + ret = 0; + goto out; + } + } + ret = -ENOENT; + +out: + return ret; +} +EXPORT_SYMBOL(ib_find_pkey); + static int __init ib_core_init(void) { int ret; Index: b/include/rdma/ib_verbs.h =================================================================== --- a/include/rdma/ib_verbs.h 2007-05-08 15:45:45.000000000 +0300 +++ b/include/rdma/ib_verbs.h 2007-05-08 18:48:23.000000000 +0300 @@ -1058,6 +1058,8 @@ struct ib_device { __be64 node_guid; u8 node_type; u8 phys_port_cnt; + int *pkey_tbl_len; + int *gid_tbl_len; }; struct ib_client { @@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev struct ib_port_modify *port_modify); /** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index); + +/** * ib_alloc_pd - Allocates an unused protection domain. * @device: The device on which to allocate the protection domain. * From yosefe at voltaire.com Wed May 9 00:00:05 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 10:00:05 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] ipoib: handle pkey change events In-Reply-To: <20070508202836.GG10845@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070508202836.GG10845@mellanox.co.il> Message-ID: <46417175.1060505@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: [PATCHv3 1/2] ipoib: handle pkey change events > > > This should hav ebeen 1 of 2, is that right? Yes. should have been 2/2. From mst at dev.mellanox.co.il Wed May 9 00:07:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 10:07:59 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <4641714E.6050806@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com> <20070508150907.GU21591@mellanox.co.il> <46409504.9000802@voltaire.com> <20070508152650.GA5845@mellanox.co.il> <4640A911.8000609@voltaire.com> <20070508203318.GH10845@mellanox.co.il> <4641714E.6050806@voltaire.com> Message-ID: <20070509070759.GA18513@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries > > Michael S. Tsirkin wrote: > > > > > > Have you read the boring list of rules? > > http://git.openfabrics.org/~mst/boring.txt > > > > > Thanks for the pointer. This still violates rule 4c in the above (chapter 2 in CodingStyle). -- MST From eli at mellanox.co.il Wed May 9 00:19:53 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 09 May 2007 10:19:53 +0300 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts In-Reply-To: References: <1178551555.17477.0.camel@mtls03> <1178606876.17477.15.camel@mtls03> Message-ID: <1178695223.24989.42.camel@mtls03> On Tue, 2007-05-08 at 17:57 -0700, Roland Dreier wrote: > > > @@ -249,8 +249,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq) > > > } > > > } > > > > > > - if (eqes_found) > > > - eq_set_ci(eq, 1); > > > + eq_set_ci(eq, 1); > > > > > > return eqes_found; > > > } > > > This will not ensure arming all EQs for all interrupts and we will face > > the same problem of losing interrupts. > > I don't understand what you mean here. How is unconditionally arming > the EQ at the end of mlx4_eq_int() any different from your proposed > patch? My change calls eq_set_ci() at the end of every call to > mlx4_eq_int(), and your change calls eq_set_ci() after every call to > mlx4_eq_int(). I'm probably missing something obvious, but I really > don't see it right now. > The difference between what I propose and what you propose is that my version unconditionally arms ALL EQs regardless of whether we find any EQEs in them while you arm only the EQs in which you find EQEs. The justification for doing this comes from the following scenario. Suppose we have two EQs, 0 and 1: 1. An event is generated on EQ1. 2. EQ1 posts an EQE. 3. A set interrupt message is sent. Very soon after that ... 3. An event is generated on EQ0. 4. EQ0 posts an EQE. 5. The interrupt handler is called and does: a. clear interrupt b. poll EQ0 but there is nothing there since the EQE is not yet in memory. c. poll EQ1, find an EQE, arm EQ1 Now we have an unconsumed EQE in EQ0 but it is not armed. Remember that the same is true for Arbel but there we arm all the EQs in a single write to the device. From boris at lfbs.RWTH-Aachen.DE Wed May 9 00:24:56 2007 From: boris at lfbs.RWTH-Aachen.DE (Boris Bierbaum) Date: Wed, 09 May 2007 09:24:56 +0200 Subject: [ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work In-Reply-To: <1178655353.11455.14.camel@stevo-desktop> References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM> <46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM> <464044D4.5010501@lfbs.rwth-aachen.de> <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com> <4640CACE.8070201@ichips.intel.com> <1178655353.11455.14.camel@stevo-desktop> Message-ID: <46417748.4020602@lfbs.rwth-aachen.de> It has been explained in a different thread on [ofa-general] that the problem lies in a combination of the OpenIB-cma provider not setting the local and remote port numbers on endpoints correctly and Open MPI stepping over the IA to save the port number to circumvent this problem, thereby confusing the provider. I commented out line 197 in ompi/mca/btl/udapl/btl_udapl.c (Open MPI 1.2.1 release) and this fixes the problem. As the problem in the provider is currently being fixed, the whole saving of the port number in the uDAPL BTL code will be unnecessary in the future. Steve Wise wrote: >>> Can the UDAPL OFED wizards shed any light on the error messages that >>> are listed below? In particular, these seem to be worrysome: >>> >>>> setup_listener Permission denied >>> setup_listener Address already in use >> These failures are from rdma_cm_bind indicating the port is already >> bound to this IA address. How are you creating the service point? >> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you >> will see some failures until it gets to a free port. That is normal. >> Just make sure your create call returns DAT_SUCCESS. >> > > Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down > and let the rdma-cma pick an available port number? > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- | _ RWTH | Boris Bierbaum |_|_`_ | Lehrstuhl fuer Betriebssysteme | |_) _ | RWTH Aachen D-52056 Aachen |_)(_` | Tel: +49-241-80-27805 ._) | Fax: +49-241-80-22339 From yosefe at voltaire.com Wed May 9 01:11:38 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 11:11:38 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <20070509070759.GA18513@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com> <20070508150907.GU21591@mellanox.co.il> <46409504.9000802@voltaire.com> <20070508152650.GA5845@mellanox.co.il> <4640A911.8000609@voltaire.com> <20070508203318.GH10845@mellanox.co.il> <4641714E.6050806@voltaire.com> <20070509070759.GA18513@mellanox.co.il> Message-ID: <4641823A.8040100@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries >> >>Michael S. Tsirkin wrote: >> >>> >>>Have you read the boring list of rules? >>>http://git.openfabrics.org/~mst/boring.txt >>> >>> >>Thanks for the pointer. > > > This still violates rule 4c in the above (chapter 2 in CodingStyle). > Isn't chapter 2 about placing braces? core: uncached "find gid" and "find pkey" queries * Add ib_find_gid and ib_find_pkey over uncached device queries. The calls might block but the returns are always up-to-date. * Cache pky,gid table lengths in core to avoid port info queries. Signed-off-by: Yosef Etigin --- drivers/infiniband/core/device.c | 138 +++++++++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 25 +++++++ 2 files changed, 163 insertions(+) Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-08 15:46:36.000000000 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-09 11:08:29.913598989 +0300 @@ -149,6 +149,18 @@ static int alloc_name(char *name) return 0; } +static inline int start_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + + +static inline int end_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; +} + /** * ib_alloc_device - allocate an IB device struct * @size:size of structure to allocate @@ -208,6 +220,55 @@ static int add_client_context(struct ib_ return 0; } +/* read the lengths of pkey,gid tables on each port */ +static inline int read_port_table_lengths(struct ib_device *device) +{ + struct ib_port_attr *tprops = NULL; + int num_ports, ret = -ENOMEM; + u8 port_index; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + goto out; + + num_ports = end_port(device) - start_port(device) + 1; + + device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len * + num_ports, GFP_KERNEL); + if (!device->pkey_tbl_len) + goto out; + + device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len * + num_ports, GFP_KERNEL); + if (!device->gid_tbl_len) + goto err1; + + for (port_index = 0; port_index < num_ports; ++port_index) { + ret = ib_query_port(device, port_index + start_port(device), + tprops); + if (ret) + goto err2; + device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len; + device->gid_tbl_len[port_index] = tprops->gid_tbl_len; + } + + ret = 0; + goto out; +err2: + kfree(device->gid_tbl_len); +err1: + kfree(device->pkey_tbl_len); +out: + kfree(tprops); + return ret; +} + +static inline void free_port_table_lengths(struct ib_device *device) +{ + kfree(device->gid_tbl_len); + kfree(device->pkey_tbl_len); +} + /** * ib_register_device - Register an IB device with IB core * @device:Device to register @@ -239,6 +300,13 @@ int ib_register_device(struct ib_device spin_lock_init(&device->event_handler_lock); spin_lock_init(&device->client_data_lock); + ret = read_port_table_lengths(device); + if (ret) { + printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n", + device->name); + goto out; + } + ret = ib_device_register_sysfs(device); if (ret) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", @@ -284,6 +352,8 @@ void ib_unregister_device(struct ib_devi list_del(&device->core_list); + free_port_table_lengths(device); + mutex_unlock(&device_mutex); spin_lock_irqsave(&device->client_data_lock, flags); @@ -592,6 +662,74 @@ int ib_modify_port(struct ib_device *dev } EXPORT_SYMBOL(ib_modify_port); +/** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index) +{ + union ib_gid tmp_gid; + int ret, port, i, tbl_len; + + for (port = start_port(device); port <= end_port(device); ++port) { + tbl_len = device->gid_tbl_len[port - start_port(device)]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &tmp_gid); + if (ret) + goto out; + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { + *port_num = port; + *index = i; + ret = 0; + goto out; + } + } + } + ret = -ENOENT; +out: + return ret; +} +EXPORT_SYMBOL(ib_find_gid); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index) +{ + int ret, i, tbl_len; + u16 tmp_pkey; + + tbl_len = device->pkey_tbl_len[port_num - start_port(device)]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); + if (ret) + goto out; + + if (pkey == tmp_pkey) { + *index = i; + ret = 0; + goto out; + } + } + ret = -ENOENT; + +out: + return ret; +} +EXPORT_SYMBOL(ib_find_pkey); + static int __init ib_core_init(void) { int ret; Index: b/include/rdma/ib_verbs.h =================================================================== --- a/include/rdma/ib_verbs.h 2007-05-08 15:45:45.000000000 +0300 +++ b/include/rdma/ib_verbs.h 2007-05-08 18:48:23.000000000 +0300 @@ -1058,6 +1058,8 @@ struct ib_device { __be64 node_guid; u8 node_type; u8 phys_port_cnt; + int *pkey_tbl_len; + int *gid_tbl_len; }; struct ib_client { @@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev struct ib_port_modify *port_modify); /** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index); + +/** * ib_alloc_pd - Allocates an unused protection domain. * @device: The device on which to allocate the protection domain. * From mst at dev.mellanox.co.il Wed May 9 01:43:13 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 11:43:13 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <4641823A.8040100@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com> <20070508150907.GU21591@mellanox.co.il> <46409504.9000802@voltaire.com> <20070508152650.GA5845@mellanox.co.il> <4640A911.8000609@voltaire.com> <20070508203318.GH10845@mellanox.co.il> <4641714E.6050806@voltaire.com> <20070509070759.GA18513@mellanox.co.il> <4641823A.8040100@voltaire.com> Message-ID: <20070509084312.GA6974@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries > > Michael S. Tsirkin wrote: > >>Quoting Yosef Etigin : > >>Subject: Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries > >> > >>Michael S. Tsirkin wrote: > >> > >>> > >>>Have you read the boring list of rules? > >>>http://git.openfabrics.org/~mst/boring.txt > >>> > >>> > >>Thanks for the pointer. > > > > > > This still violates rule 4c in the above (chapter 2 in CodingStyle). > > > Isn't chapter 2 about placing braces? Yes, I see you've fixed this. Some last pedantic nits: > core: uncached "find gid" and "find pkey" queries > > * Add ib_find_gid and ib_find_pkey over uncached device queries. > The calls might block but the returns are always up-to-date. > * Cache pky,gid table lengths in core to avoid port info queries. > > > Signed-off-by: Yosef Etigin > --- > drivers/infiniband/core/device.c | 138 +++++++++++++++++++++++++++++++++++++++ > include/rdma/ib_verbs.h | 25 +++++++ > 2 files changed, 163 insertions(+) > > Index: b/drivers/infiniband/core/device.c > =================================================================== > --- a/drivers/infiniband/core/device.c 2007-05-08 15:46:36.000000000 +0300 > +++ b/drivers/infiniband/core/device.c 2007-05-09 11:08:29.913598989 +0300 > @@ -149,6 +149,18 @@ static int alloc_name(char *name) > return 0; > } > > +static inline int start_port(struct ib_device *device) > +{ > + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; > +} > + > + > +static inline int end_port(struct ib_device *device) > +{ > + return (device->node_type == RDMA_NODE_IB_SWITCH) ? > + 0 : device->phys_port_cnt; > +} > + > /** > * ib_alloc_device - allocate an IB device struct > * @size:size of structure to allocate > @@ -208,6 +220,55 @@ static int add_client_context(struct ib_ > return 0; > } > > +/* read the lengths of pkey,gid tables on each port */ > +static inline int read_port_table_lengths(struct ib_device *device) This function is too big to be inline. > +{ > + struct ib_port_attr *tprops = NULL; > + int num_ports, ret = -ENOMEM; > + u8 port_index; > + > + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); > + if (!tprops) > + goto out; > + > + num_ports = end_port(device) - start_port(device) + 1; > + > + device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len * > + num_ports, GFP_KERNEL); > + if (!device->pkey_tbl_len) > + goto out; > + > + device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len * > + num_ports, GFP_KERNEL); > + if (!device->gid_tbl_len) > + goto err1; > + > + for (port_index = 0; port_index < num_ports; ++port_index) { > + ret = ib_query_port(device, port_index + start_port(device), > + tprops); > + if (ret) > + goto err2; > + device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len; > + device->gid_tbl_len[port_index] = tprops->gid_tbl_len; > + } > + > + ret = 0; > + goto out; > +err2: > + kfree(device->gid_tbl_len); > +err1: > + kfree(device->pkey_tbl_len); > +out: > + kfree(tprops); > + return ret; > +} > + > +static inline void free_port_table_lengths(struct ib_device *device) > +{ > + kfree(device->gid_tbl_len); > + kfree(device->pkey_tbl_len); > +} > + > /** > * ib_register_device - Register an IB device with IB core > * @device:Device to register > @@ -239,6 +300,13 @@ int ib_register_device(struct ib_device > spin_lock_init(&device->event_handler_lock); > spin_lock_init(&device->client_data_lock); > > + ret = read_port_table_lengths(device); > + if (ret) { > + printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n", > + device->name); > + goto out; > + } > + > ret = ib_device_register_sysfs(device); > if (ret) { > printk(KERN_WARNING "Couldn't register device %s with driver model\n", > @@ -284,6 +352,8 @@ void ib_unregister_device(struct ib_devi > > list_del(&device->core_list); > > + free_port_table_lengths(device); > + > mutex_unlock(&device_mutex); > > spin_lock_irqsave(&device->client_data_lock, flags); > @@ -592,6 +662,74 @@ int ib_modify_port(struct ib_device *dev > } > EXPORT_SYMBOL(ib_modify_port); > > +/** > + * ib_find_gid - Returns the port number and GID table index where > + * a specified GID value occurs. > + * @device: The device to query. > + * @gid: The GID value to search for. > + * @port_num: The port number of the device where the GID value was found. > + * @index: The index into the GID table where the GID was found. This > + * parameter may be NULL. > + */ > +int ib_find_gid(struct ib_device *device, union ib_gid *gid, > + u8 *port_num, u16 *index) Either indent with tabs only here, or use spaces to align continuation at (. > +{ > + union ib_gid tmp_gid; > + int ret, port, i, tbl_len; > + > + for (port = start_port(device); port <= end_port(device); ++port) { > + tbl_len = device->gid_tbl_len[port - start_port(device)]; > + for (i = 0; i < tbl_len; ++i) { > + ret = ib_query_gid(device, port, i, &tmp_gid); > + if (ret) > + goto out; > + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { > + *port_num = port; > + *index = i; > + ret = 0; > + goto out; > + } > + } > + } > + ret = -ENOENT; > +out: > + return ret; > +} > +EXPORT_SYMBOL(ib_find_gid); > + > +/** > + * ib_find_pkey - Returns the PKey table index where a specified > + * PKey value occurs. > + * @device: The device to query. > + * @port_num: The port number of the device to search for the PKey. > + * @pkey: The PKey value to search for. > + * @index: The index into the PKey table where the PKey was found. > + */ > +int ib_find_pkey(struct ib_device *device, > + u8 port_num, u16 pkey, u16 *index) > +{ > + int ret, i, tbl_len; > + u16 tmp_pkey; > + > + tbl_len = device->pkey_tbl_len[port_num - start_port(device)]; > + for (i = 0; i < tbl_len; ++i) { > + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); > + if (ret) > + goto out; > + > + if (pkey == tmp_pkey) { > + *index = i; > + ret = 0; > + goto out; > + } > + } > + ret = -ENOENT; > + > +out: > + return ret; > +} > +EXPORT_SYMBOL(ib_find_pkey); > + > static int __init ib_core_init(void) > { > int ret; > Index: b/include/rdma/ib_verbs.h > =================================================================== > --- a/include/rdma/ib_verbs.h 2007-05-08 15:45:45.000000000 +0300 > +++ b/include/rdma/ib_verbs.h 2007-05-08 18:48:23.000000000 +0300 > @@ -1058,6 +1058,8 @@ struct ib_device { > __be64 node_guid; > u8 node_type; > u8 phys_port_cnt; > + int *pkey_tbl_len; > + int *gid_tbl_len; > }; > > struct ib_client { > @@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev > struct ib_port_modify *port_modify); > > /** > + * ib_find_gid - Returns the port number and GID table index where > + * a specified GID value occurs. > + * @device: The device to query. > + * @gid: The GID value to search for. > + * @port_num: The port number of the device where the GID value was found. > + * @index: The index into the GID table where the GID was found. This > + * parameter may be NULL. > + */ > +int ib_find_gid(struct ib_device *device, union ib_gid *gid, > + u8 *port_num, u16 *index); And here, too. > + > +/** > + * ib_find_pkey - Returns the PKey table index where a specified > + * PKey value occurs. > + * @device: The device to query. > + * @port_num: The port number of the device to search for the PKey. > + * @pkey: The PKey value to search for. > + * @index: The index into the PKey table where the PKey was found. > + */ > +int ib_find_pkey(struct ib_device *device, > + u8 port_num, u16 pkey, u16 *index); > + > +/** > * ib_alloc_pd - Allocates an unused protection domain. > * @device: The device on which to allocate the protection domain. > * -- MST From yosefe at voltaire.com Wed May 9 01:52:15 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 11:52:15 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <20070509084312.GA6974@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408336.8080908@voltaire.com> <20070508150907.GU21591@mellanox.co.il> <46409504.9000802@voltaire.com> <20070508152650.GA5845@mellanox.co.il> <4640A911.8000609@voltaire.com> <20070508203318.GH10845@mellanox.co.il> <4641714E.6050806@voltaire.com> <20070509070759.GA18513@mellanox.co.il> <4641823A.8040100@voltaire.com> <20070509084312.GA6974@mellanox.co.il> Message-ID: <46418BBF.10801@voltaire.com> * Add ib_find_gid and ib_find_pkey over uncached device queries. The calls might block but the returns are always up-to-date. * Cache pky,gid table lengths in core to avoid port info queries. Signed-off-by: Yosef Etigin --- drivers/infiniband/core/device.c | 138 +++++++++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 25 +++++++ 2 files changed, 163 insertions(+) Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-08 15:46:36.000000000 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-09 11:47:22.096064221 +0300 @@ -149,6 +149,18 @@ static int alloc_name(char *name) return 0; } +static inline int start_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + + +static inline int end_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; +} + /** * ib_alloc_device - allocate an IB device struct * @size:size of structure to allocate @@ -208,6 +220,55 @@ static int add_client_context(struct ib_ return 0; } +/* read the lengths of pkey,gid tables on each port */ +static int read_port_table_lengths(struct ib_device *device) +{ + struct ib_port_attr *tprops = NULL; + int num_ports, ret = -ENOMEM; + u8 port_index; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + goto out; + + num_ports = end_port(device) - start_port(device) + 1; + + device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len * + num_ports, GFP_KERNEL); + if (!device->pkey_tbl_len) + goto out; + + device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len * + num_ports, GFP_KERNEL); + if (!device->gid_tbl_len) + goto err1; + + for (port_index = 0; port_index < num_ports; ++port_index) { + ret = ib_query_port(device, port_index + start_port(device), + tprops); + if (ret) + goto err2; + device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len; + device->gid_tbl_len[port_index] = tprops->gid_tbl_len; + } + + ret = 0; + goto out; +err2: + kfree(device->gid_tbl_len); +err1: + kfree(device->pkey_tbl_len); +out: + kfree(tprops); + return ret; +} + +static inline void free_port_table_lengths(struct ib_device *device) +{ + kfree(device->gid_tbl_len); + kfree(device->pkey_tbl_len); +} + /** * ib_register_device - Register an IB device with IB core * @device:Device to register @@ -239,6 +300,13 @@ int ib_register_device(struct ib_device spin_lock_init(&device->event_handler_lock); spin_lock_init(&device->client_data_lock); + ret = read_port_table_lengths(device); + if (ret) { + printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n", + device->name); + goto out; + } + ret = ib_device_register_sysfs(device); if (ret) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", @@ -284,6 +352,8 @@ void ib_unregister_device(struct ib_devi list_del(&device->core_list); + free_port_table_lengths(device); + mutex_unlock(&device_mutex); spin_lock_irqsave(&device->client_data_lock, flags); @@ -592,6 +662,74 @@ int ib_modify_port(struct ib_device *dev } EXPORT_SYMBOL(ib_modify_port); +/** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index) +{ + union ib_gid tmp_gid; + int ret, port, i, tbl_len; + + for (port = start_port(device); port <= end_port(device); ++port) { + tbl_len = device->gid_tbl_len[port - start_port(device)]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &tmp_gid); + if (ret) + goto out; + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { + *port_num = port; + *index = i; + ret = 0; + goto out; + } + } + } + ret = -ENOENT; +out: + return ret; +} +EXPORT_SYMBOL(ib_find_gid); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index) +{ + int ret, i, tbl_len; + u16 tmp_pkey; + + tbl_len = device->pkey_tbl_len[port_num - start_port(device)]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); + if (ret) + goto out; + + if (pkey == tmp_pkey) { + *index = i; + ret = 0; + goto out; + } + } + ret = -ENOENT; + +out: + return ret; +} +EXPORT_SYMBOL(ib_find_pkey); + static int __init ib_core_init(void) { int ret; Index: b/include/rdma/ib_verbs.h =================================================================== --- a/include/rdma/ib_verbs.h 2007-05-08 15:45:45.000000000 +0300 +++ b/include/rdma/ib_verbs.h 2007-05-09 11:47:55.006221894 +0300 @@ -1058,6 +1058,8 @@ struct ib_device { __be64 node_guid; u8 node_type; u8 phys_port_cnt; + int *pkey_tbl_len; + int *gid_tbl_len; }; struct ib_client { @@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev struct ib_port_modify *port_modify); /** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index); + +/** * ib_alloc_pd - Allocates an unused protection domain. * @device: The device on which to allocate the protection domain. * From eli at mellanox.co.il Wed May 9 01:12:43 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 09 May 2007 11:12:43 +0300 Subject: [ofa-general] [PATCH] IB/core user memory registrations Message-ID: <1178698393.26046.8.camel@mtls03> fix missing initialization of write_mtt_size Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- connectx_kernel.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2007-05-08 15:48:57.000000000 +0300 +++ connectx_kernel/drivers/infiniband/hw/mthca/mthca_provider.c 2007-05-08 17:17:03.000000000 +0300 @@ -1020,7 +1020,7 @@ static struct ib_mr *mthca_reg_user_mr(s int shift, n, len; int i, j, k; int err = 0; - int write_mtt_size; + int write_mtt_size = mthca_write_mtt_size(dev); mr = kmalloc(sizeof *mr, GFP_KERNEL); if (!mr) From vlad at lists.openfabrics.org Wed May 9 02:30:10 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 9 May 2007 02:30:10 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070509-0200 daily build status Message-ID: <20070509093010.A59C3E60823@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Failed: From mst at dev.mellanox.co.il Wed May 9 02:35:48 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 12:35:48 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] ipoib: handle pkey change events In-Reply-To: <4640A8BD.4000405@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> Message-ID: <20070509093548.GA7683@mellanox.co.il> > @@ -642,6 +651,12 @@ void ipoib_ib_dev_flush(struct work_stru > > ipoib_ib_dev_down(dev, 0); > > + if (restart_qp) { > + if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) > + ipoib_ib_dev_stop(dev, 0); > + ipoib_ib_dev_open(dev); > + } > + > /* > * The device could have been brought down between the start and when > * we get here, don't bring it back up if it's not configured up This is something that still puzzles me 1. We have tested IPOIB_FLAG_INITIALIZED above already, didn't we? Did you observe it flipping in testing? If yes there's some race ... 2. Let's assume that device is not initialized: how come you are calling ipoib_ib_dev_open on it here? -- MST From yosefe at voltaire.com Wed May 9 04:01:26 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 14:01:26 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] ipoib: handle pkey change events In-Reply-To: <20070509093548.GA7683@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> Message-ID: <4641AA06.1050002@voltaire.com> Michael S. Tsirkin wrote: >>@@ -642,6 +651,12 @@ void ipoib_ib_dev_flush(struct work_stru >> >> ipoib_ib_dev_down(dev, 0); >> >>+ if (restart_qp) { >>+ if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) >>+ ipoib_ib_dev_stop(dev, 0); >>+ ipoib_ib_dev_open(dev); >>+ } >>+ >> /* >> * The device could have been brought down between the start and when >> * we get here, don't bring it back up if it's not configured up > > > This is something that still puzzles me > > 1. We have tested IPOIB_FLAG_INITIALIZED above already, didn't we? > Did you observe it flipping in testing? If yes there's some race ... > > 2. Let's assume that device is not initialized: > how come you are calling ipoib_ib_dev_open on it here? > Option 2 is true. this test is a leftover from a previous version of the patch and should be removed. -- This issue was found during partitioning & SM fail over testing. * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity * Upon PKEY_CHANGE event, schedule a work that restarts the QP * Restart child interfaces before parent. They might be up even if the parent is down * Use uncached pkey query upon qp initiallization SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 6 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 59 ++++++++++++++++++++--------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +-- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 11 ++--- 4 files changed, 56 insertions(+), 27 deletions(-) Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 16:45:44.000000000 +0300 @@ -202,11 +202,12 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; + struct delayed_work pkey_poll_task; struct delayed_work mcast_task; struct work_struct flush_task; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct work_struct pkey_event_task; struct ib_device *ca; u8 port; @@ -333,12 +334,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-09 13:56:00.754030478 +0300 @@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -481,7 +481,7 @@ int ipoib_ib_dev_down(struct net_device if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { mutex_lock(&pkey_mutex); set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); + cancel_delayed_work(&priv->pkey_poll_task); mutex_unlock(&pkey_mutex); if (flush) flush_workqueue(ipoib_workqueue); @@ -508,7 +508,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +581,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,13 +623,21 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces too - + * they might be up even if the parent is down */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, restart_qp); + + mutex_unlock(&priv->vlan_mutex); + + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -642,6 +651,11 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_down(dev, 0); + if (restart_qp) { + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +664,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - mutex_unlock(&priv->vlan_mutex); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_event_task); + + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -685,7 +709,7 @@ void ipoib_ib_dev_cleanup(struct net_dev void ipoib_pkey_poll(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); + container_of(work, struct ipoib_dev_priv, pkey_poll_task.work); struct net_device *dev = priv->dev; ipoib_pkey_dev_check_presence(dev); @@ -696,7 +720,7 @@ void ipoib_pkey_poll(struct work_struct mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); } @@ -715,7 +739,7 @@ int ipoib_pkey_dev_delay_open(struct net mutex_lock(&pkey_mutex); clear_bit(IPOIB_PKEY_STOP, &priv->flags); queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); return 1; Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 16:45:44.000000000 +0300 @@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev) return -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 16:45:44.000000000 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = -ENXIO; goto out; @@ -103,7 +101,7 @@ int ipoib_init_qp(struct net_device *dev * The port has to be assigned to the respective IB partition in * advance. */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); if (ret) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); return ret; @@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_event_task); } } From rdreier at cisco.com Wed May 9 04:03:55 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 09 May 2007 04:03:55 -0700 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts In-Reply-To: <1178695223.24989.42.camel@mtls03> (Eli Cohen's message of "Wed, 09 May 2007 10:19:53 +0300") References: <1178551555.17477.0.camel@mtls03> <1178606876.17477.15.camel@mtls03> <1178695223.24989.42.camel@mtls03> Message-ID: > > I don't understand what you mean here. How is unconditionally arming > > the EQ at the end of mlx4_eq_int() any different from your proposed > > patch? My change calls eq_set_ci() at the end of every call to > > mlx4_eq_int(), and your change calls eq_set_ci() after every call to > > mlx4_eq_int(). I'm probably missing something obvious, but I really > > don't see it right now. > The difference between what I propose and what you propose is that my > version unconditionally arms ALL EQs regardless of whether we find any > EQEs in them while you arm only the EQs in which you find EQEs. The > justification for doing this comes from the following scenario. Suppose > we have two EQs, 0 and 1: I understand all that. The question is, what's the difference between my version (which is in my tree now), which does: mlx4_eq_int(...eq...) { ... eq_set_ci(eq, 1); return eqes_found; } and your version, which does mlx4_eq_int(eq); eq_set_ci(eq, 1); for every call to mlx4_eq_int()? Why does it matter if the eq_set_ci() is inside mlx4_eq_int() or outside? - R. From mst at dev.mellanox.co.il Wed May 9 04:26:26 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 14:26:26 +0300 Subject: [ofa-general] Re: [PATCHv3 1/2] ipoib: handle pkey change events In-Reply-To: <4641AA06.1050002@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> Message-ID: <20070509112626.GA10068@mellanox.co.il> OK. looks pretty good to me. One coding style violation I found: > @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler > record->element.port_num == priv->port) { > ipoib_dbg(priv, "Port state change event\n"); > queue_work(ipoib_workqueue, &priv->flush_task); > + } else if (record->event == IB_EVENT_PKEY_CHANGE && > + record->element.port_num == priv->port) { > + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); > + queue_work(ipoib_workqueue, &priv->pkey_event_task); > } > } This violates Breaking long lines rule again. Should be > + } else if (record->event == IB_EVENT_PKEY_CHANGE && > + record->element.port_num == priv->port) { > + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); > + queue_work(ipoib_workqueue, &priv->pkey_event_task); > } -- MST From eli at mellanox.co.il Wed May 9 04:28:22 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 09 May 2007 14:28:22 +0300 Subject: [ofa-general] [PATCH] IB/mlx4 mlx4_ib: eq interrupts In-Reply-To: References: <1178551555.17477.0.camel@mtls03> <1178606876.17477.15.camel@mtls03> <1178695223.24989.42.camel@mtls03> Message-ID: <1178710133.27749.4.camel@mtls03> On Wed, 2007-05-09 at 04:03 -0700, Roland Dreier wrote: > I understand all that. The question is, what's the difference between > my version (which is in my tree now), which does: > > mlx4_eq_int(...eq...) > { > ... > eq_set_ci(eq, 1); > > return eqes_found; > } > > and your version, which does > > mlx4_eq_int(eq); > eq_set_ci(eq, 1); > > for every call to mlx4_eq_int()? Why does it matter if the > eq_set_ci() is inside mlx4_eq_int() or outside? > > - R. Oh I see, you're right - your version also arms all the EQs. From mst at dev.mellanox.co.il Wed May 9 04:28:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 14:28:29 +0300 Subject: [ofa-general] Re: [PATCH] IB/core user memory registrations In-Reply-To: <1178698393.26046.8.camel@mtls03> References: <1178698393.26046.8.camel@mtls03> Message-ID: <20070509112829.GB10068@mellanox.co.il> > Quoting Eli Cohen : > Subject: [PATCH] IB/core user memory registrations > > fix missing initialization of write_mtt_size > > Signed-off-by: Eli Cohen This is actually IB/mthca, right? Wow, this seems to fix breakage introduced by latest core changes, is that right? I'm not sure how could I have missed this - need to go back and re-review that patch. -- MST From fenkes at de.ibm.com Wed May 9 04:47:56 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 9 May 2007 13:47:56 +0200 Subject: [ofa-general] [PATCH 1/6] IB/ehca: Serialize hypervisor calls in ehca_register_mr() Message-ID: <200705091347.57470.fenkes@de.ibm.com> From: Stefan Roscher Some pSeries hypervisor versions show a race condition in the allocate MR hCall. Serialize this call per adapter to circumvent this problem. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 1 + drivers/infiniband/hw/ehca/ehca_main.c | 2 ++ drivers/infiniband/hw/ehca/hcp_if.c | 14 ++++++++++++-- 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 10fb8fb..4bc5cb3 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -276,6 +276,7 @@ void ehca_cleanup_mrmw_cache(void); extern spinlock_t ehca_qp_idr_lock; extern spinlock_t ehca_cq_idr_lock; +extern spinlock_t hcall_lock; extern struct idr ehca_qp_idr; extern struct idr ehca_cq_idr; diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 2d37054..2e27e68 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -98,6 +98,7 @@ MODULE_PARM_DESC(scaling_code, spinlock_t ehca_qp_idr_lock; spinlock_t ehca_cq_idr_lock; +spinlock_t hcall_lock; DEFINE_IDR(ehca_qp_idr); DEFINE_IDR(ehca_cq_idr); @@ -817,6 +818,7 @@ int __init ehca_module_init(void) idr_init(&ehca_cq_idr); spin_lock_init(&ehca_qp_idr_lock); spin_lock_init(&ehca_cq_idr_lock); + spin_lock_init(&hcall_lock); INIT_LIST_HEAD(&shca_list); spin_lock_init(&shca_list_lock); diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index b564fcd..bb76134 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -154,7 +154,9 @@ static long ehca_plpar_hcall9(unsigned l unsigned long arg9) { long ret; - int i, sleep_msecs; + int i, sleep_msecs, lock_is_set = 0; + unsigned long flags; + ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx " "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx", @@ -162,10 +164,18 @@ static long ehca_plpar_hcall9(unsigned l arg8, arg9); for (i = 0; i < 5; i++) { + if ((opcode == H_ALLOC_RESOURCE) && (arg2 == 5)) { + spin_lock_irqsave(&hcall_lock, flags); + lock_is_set = 1; + } + ret = plpar_hcall9(opcode, outs, arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9); + if (lock_is_set) + spin_unlock_irqrestore(&hcall_lock, flags); + if (H_IS_LONG_BUSY(ret)) { sleep_msecs = get_longbusy_msecs(ret); msleep_interruptible(sleep_msecs); @@ -193,11 +203,11 @@ static long ehca_plpar_hcall9(unsigned l opcode, ret, outs[0], outs[1], outs[2], outs[3], outs[4], outs[5], outs[6], outs[7], outs[8]); return ret; - } return H_BUSY; } + u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle, struct ehca_pfeq *pfeq, const u32 neq_control, -- 1.4.2.1 From fenkes at de.ibm.com Wed May 9 04:48:01 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 9 May 2007 13:48:01 +0200 Subject: [ofa-general] [PATCH 2/6] IB/ehca: correctly set GRH mask bit in ehca_modify_qp() Message-ID: <200705091348.02396.fenkes@de.ibm.com> The driver needs to always supply the "GRH present" flag to the hypervisor, whether it's true or false. Not supplying it (i.e. not setting the corresponding mask bit) amounts to a "perhaps", which we don't want. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_qp.c | 12 ++++++++---- 1 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index df0516f..e21d796 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -968,17 +968,21 @@ static int internal_modify_qp(struct ib_ ((ehca_mult - 1) / ah_mult) : 0; else mqpcb->max_static_rate = 0; - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE, 1); /* + * Always supply the GRH flag, even if it's zero, to give the + * hypervisor a clear "yes" or "no" instead of a "perhaps" + */ + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1); + + /* * only if GRH is TRUE we might consider SOURCE_GID_IDX * and DEST_GID otherwise phype will return H_ATTR_PARM!!! */ if (attr->ah_attr.ah_flags == IB_AH_GRH) { - mqpcb->send_grh_flag = 1 << 31; - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1); + mqpcb->send_grh_flag = 1; + mqpcb->source_gid_idx = attr->ah_attr.grh.sgid_index; update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX, 1); -- 1.4.2.1 From fenkes at de.ibm.com Wed May 9 04:48:25 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 9 May 2007 13:48:25 +0200 Subject: [ofa-general] [PATCH 5/6] IB/ehca: beautify sysfs attribute code, fix compiler warnings Message-ID: <200705091348.26426.fenkes@de.ibm.com> eHCA's sysfs attributes are now being created via sysfs_create_group(), making the process neatly table-driven. The return value is checked, thus fixing a few compiler warnings. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_main.c | 86 ++++++++++++++------------------ 1 files changed, 37 insertions(+), 49 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 2e27e68..dc736e8 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -454,15 +454,14 @@ static ssize_t ehca_store_debug_level(st DRIVER_ATTR(debug_level, S_IRUSR | S_IWUSR, ehca_show_debug_level, ehca_store_debug_level); -void ehca_create_driver_sysfs(struct ibmebus_driver *drv) -{ - driver_create_file(&drv->driver, &driver_attr_debug_level); -} +static struct attribute *ehca_drv_attrs[] = { + &driver_attr_debug_level.attr, + NULL +}; -void ehca_remove_driver_sysfs(struct ibmebus_driver *drv) -{ - driver_remove_file(&drv->driver, &driver_attr_debug_level); -} +static struct attribute_group ehca_drv_attr_grp = { + .attrs = ehca_drv_attrs +}; #define EHCA_RESOURCE_ATTR(name) \ static ssize_t ehca_show_##name(struct device *dev, \ @@ -524,44 +523,28 @@ static ssize_t ehca_show_adapter_handle( } static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL); +static struct attribute *ehca_dev_attrs[] = { + &dev_attr_adapter_handle.attr, + &dev_attr_num_ports.attr, + &dev_attr_hw_ver.attr, + &dev_attr_max_eq.attr, + &dev_attr_cur_eq.attr, + &dev_attr_max_cq.attr, + &dev_attr_cur_cq.attr, + &dev_attr_max_qp.attr, + &dev_attr_cur_qp.attr, + &dev_attr_max_mr.attr, + &dev_attr_cur_mr.attr, + &dev_attr_max_mw.attr, + &dev_attr_cur_mw.attr, + &dev_attr_max_pd.attr, + &dev_attr_max_ah.attr, + NULL +}; -void ehca_create_device_sysfs(struct ibmebus_dev *dev) -{ - device_create_file(&dev->ofdev.dev, &dev_attr_adapter_handle); - device_create_file(&dev->ofdev.dev, &dev_attr_num_ports); - device_create_file(&dev->ofdev.dev, &dev_attr_hw_ver); - device_create_file(&dev->ofdev.dev, &dev_attr_max_eq); - device_create_file(&dev->ofdev.dev, &dev_attr_cur_eq); - device_create_file(&dev->ofdev.dev, &dev_attr_max_cq); - device_create_file(&dev->ofdev.dev, &dev_attr_cur_cq); - device_create_file(&dev->ofdev.dev, &dev_attr_max_qp); - device_create_file(&dev->ofdev.dev, &dev_attr_cur_qp); - device_create_file(&dev->ofdev.dev, &dev_attr_max_mr); - device_create_file(&dev->ofdev.dev, &dev_attr_cur_mr); - device_create_file(&dev->ofdev.dev, &dev_attr_max_mw); - device_create_file(&dev->ofdev.dev, &dev_attr_cur_mw); - device_create_file(&dev->ofdev.dev, &dev_attr_max_pd); - device_create_file(&dev->ofdev.dev, &dev_attr_max_ah); -} - -void ehca_remove_device_sysfs(struct ibmebus_dev *dev) -{ - device_remove_file(&dev->ofdev.dev, &dev_attr_adapter_handle); - device_remove_file(&dev->ofdev.dev, &dev_attr_num_ports); - device_remove_file(&dev->ofdev.dev, &dev_attr_hw_ver); - device_remove_file(&dev->ofdev.dev, &dev_attr_max_eq); - device_remove_file(&dev->ofdev.dev, &dev_attr_cur_eq); - device_remove_file(&dev->ofdev.dev, &dev_attr_max_cq); - device_remove_file(&dev->ofdev.dev, &dev_attr_cur_cq); - device_remove_file(&dev->ofdev.dev, &dev_attr_max_qp); - device_remove_file(&dev->ofdev.dev, &dev_attr_cur_qp); - device_remove_file(&dev->ofdev.dev, &dev_attr_max_mr); - device_remove_file(&dev->ofdev.dev, &dev_attr_cur_mr); - device_remove_file(&dev->ofdev.dev, &dev_attr_max_mw); - device_remove_file(&dev->ofdev.dev, &dev_attr_cur_mw); - device_remove_file(&dev->ofdev.dev, &dev_attr_max_pd); - device_remove_file(&dev->ofdev.dev, &dev_attr_max_ah); -} +static struct attribute_group ehca_dev_attr_grp = { + .attrs = ehca_dev_attrs +}; static int __devinit ehca_probe(struct ibmebus_dev *dev, const struct of_device_id *id) @@ -669,7 +652,10 @@ static int __devinit ehca_probe(struct i } } - ehca_create_device_sysfs(dev); + ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); spin_lock(&shca_list_lock); list_add(&shca->shca_list, &shca_list); @@ -721,7 +707,7 @@ static int __devexit ehca_remove(struct struct ehca_shca *shca = dev->ofdev.dev.driver_data; int ret; - ehca_remove_device_sysfs(dev); + sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); if (ehca_open_aqp1 == 1) { int i; @@ -840,7 +826,9 @@ int __init ehca_module_init(void) goto module_init2; } - ehca_create_driver_sysfs(&ehca_driver); + ret = sysfs_create_group(&ehca_driver.driver.kobj, &ehca_drv_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_gen_err("Cannot create driver attributes ret=%d", ret); if (ehca_poll_all_eqs != 1) { ehca_gen_err("WARNING!!!"); @@ -867,7 +855,7 @@ void __exit ehca_module_exit(void) if (ehca_poll_all_eqs == 1) del_timer_sync(&poll_eqs_timer); - ehca_remove_driver_sysfs(&ehca_driver); + sysfs_remove_group(&ehca_driver.driver.kobj, &ehca_drv_attr_grp); ibmebus_unregister_driver(&ehca_driver); ehca_destroy_slab_caches(); -- 1.4.2.1 From fenkes at de.ibm.com Wed May 9 04:48:11 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 9 May 2007 13:48:11 +0200 Subject: [ofa-general] [PATCH 3/6] IB/ehca: Fix AQP0/1 QP number Message-ID: <200705091348.12551.fenkes@de.ibm.com> From: Hoang-Nam Nguyen AQP0/1 should report qp_num={0|1} and the actual QP# should be stored in struct ehca_qp, not the other way round. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_qp.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index e21d796..b5bc787 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -523,6 +523,8 @@ struct ib_qp *ehca_create_qp(struct ib_p goto create_qp_exit1; } + my_qp->ib_qp.qp_num = my_qp->real_qp_num; + switch (init_attr->qp_type) { case IB_QPT_RC: if (isdaqp == 0) { @@ -568,7 +570,7 @@ struct ib_qp *ehca_create_qp(struct ib_p parms.act_nr_recv_wqes = init_attr->cap.max_recv_wr; parms.act_nr_send_sges = init_attr->cap.max_send_sge; parms.act_nr_recv_sges = init_attr->cap.max_recv_sge; - my_qp->real_qp_num = + my_qp->ib_qp.qp_num = (init_attr->qp_type == IB_QPT_SMI) ? 0 : 1; } @@ -595,7 +597,6 @@ struct ib_qp *ehca_create_qp(struct ib_p my_qp->ib_qp.recv_cq = init_attr->recv_cq; my_qp->ib_qp.send_cq = init_attr->send_cq; - my_qp->ib_qp.qp_num = my_qp->real_qp_num; my_qp->ib_qp.qp_type = init_attr->qp_type; my_qp->qp_type = init_attr->qp_type; -- 1.4.2.1 From fenkes at de.ibm.com Wed May 9 04:48:20 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 9 May 2007 13:48:20 +0200 Subject: [ofa-general] [PATCH 4/6] IB/ehca: remove _irqsave, move #ifdef Message-ID: <200705091348.20808.fenkes@de.ibm.com> - In ehca_process_eq(), we're IRQ safe throughout the whole function, so we don't need another _irqsave in the middle of flight. - take_over_work() is only called by comp_pool_callback(), so it can move into the same #ifdef block. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_irq.c | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index f284be1..f172013 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -517,12 +517,11 @@ void ehca_process_eq(struct ehca_shca *s else { struct ehca_cq *cq = eq->eqe_cache[i].cq; comp_event_callback(cq); - spin_lock_irqsave(&ehca_cq_idr_lock, flags); + spin_lock(&ehca_cq_idr_lock); cq->nr_events--; if (!cq->nr_events) wake_up(&cq->wait_completion); - spin_unlock_irqrestore(&ehca_cq_idr_lock, - flags); + spin_unlock(&ehca_cq_idr_lock); } } else { ehca_dbg(&shca->ib_device, "Got non completion event"); @@ -711,6 +710,7 @@ static void destroy_comp_task(struct ehc kthread_stop(task); } +#ifdef CONFIG_HOTPLUG_CPU static void take_over_work(struct ehca_comp_pool *pool, int cpu) { @@ -735,7 +735,6 @@ static void take_over_work(struct ehca_c } -#ifdef CONFIG_HOTPLUG_CPU static int comp_pool_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) -- 1.4.2.1 From fenkes at de.ibm.com Wed May 9 04:48:31 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 9 May 2007 13:48:31 +0200 Subject: [ofa-general] [PATCH 6/6] IB/ehca: disable scaling code by default, bump version number Message-ID: <200705091348.31742.fenkes@de.ibm.com> - Scaling code is still considered experimental, so disable it by default - Increase version to SVNEHCA_0023 Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_main.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index dc736e8..233788a 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -52,7 +52,7 @@ #include "hcp_if.h" MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Christoph Raisch "); MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); -MODULE_VERSION("SVNEHCA_0022"); +MODULE_VERSION("SVNEHCA_0023"); int ehca_open_aqp1 = 0; int ehca_debug_level = 0; @@ -62,7 +62,7 @@ int ehca_use_hp_mr = 0; int ehca_port_act_time = 30; int ehca_poll_all_eqs = 1; int ehca_static_rate = -1; -int ehca_scaling_code = 1; +int ehca_scaling_code = 0; module_param_named(open_aqp1, ehca_open_aqp1, int, 0); module_param_named(debug_level, ehca_debug_level, int, 0); @@ -799,7 +799,7 @@ int __init ehca_module_init(void) int ret; printk(KERN_INFO "eHCA Infiniband Device Driver " - "(Rel.: SVNEHCA_0022)\n"); + "(Rel.: SVNEHCA_0023)\n"); idr_init(&ehca_qp_idr); idr_init(&ehca_cq_idr); spin_lock_init(&ehca_qp_idr_lock); -- 1.4.2.1 From jsquyres at cisco.com Wed May 9 04:51:07 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 9 May 2007 07:51:07 -0400 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <46415DFE.9030807@voltaire.com> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com> <463FCA42.3000104@indiana.edu> <46415DFE.9030807@voltaire.com> Message-ID: On May 9, 2007, at 1:37 AM, Or Gerlitz wrote: > Doing a bit of zoom out from the "how to make ofed's udapl work for > ompi" thread, my thinking is that the ompi udapl btl enablement is > actually only the first step, where for production/longterm/etc you > want to have an rdmacm btl. I think this is a bit of a misunderstanding. The "BTL" in Open MPI is a byte transfer layer; it is a point-to-point abstraction for moving bytes between two processes. BTL components (read: plugins) are typically distinguished by the underlying protocols used. For example, we have an RC verbs-based BTL and we have a separate uDAPL- based BTL. Andrew is also working on a research-quality UD verbs- based BTL. Hence, how a particular BTL component makes connections between process peers is really a side-effect of moving bytes around, and not the focus of the BTL. So having a "rdmacm" BTL doesn't really make sense. If both the RC and UD verbs-based BTLs someday use the RDMA CM for connections, we might abstract the connection management out to a common piece of code between the two. But that's a different issue. If we end up having a mixed BTL someday that uses both RC and UD, then the need for the common code may go away. But that's in the future. > Reasoning here is made of many arguments, among them the quickest i > can make are: > > A) it seems that ompi would want to use not only RC but rather also > UD multicast and unicast, which are not covered by udapl > > B) there's actually no real justification to maintain two APIs > (namely udapl vs libibvers/librdmacm), so down the road, only one > of them would survive (udapl is implemented ***over*** libibverbs/ > librdmacm so if the latteres dies same does udapl). Specifically, I > hear here and there that the OFED stack is now on its way to be > deployed all over the place, specifically in commercial Unix OSs > (which want modern! code that supports IPoIB-CM,RDS,SRP,iSER, etc > you named it) so eventually the rdmacm btl can be used also over > Solaris et al. I think that's not quite the point. 1. A piece of history: the uDAPL BTL was originally developed by a grad student just as an excuse to learn the BTL interface and OMPI internals. We already had an RC verbs-based BTL at the time. 2. When Sun joined Open MPI, they took over the development and maintenance of the uDAPL BTL because uDAPL is the only high performance stack on Solaris. 3. It's fine that Sun will someday support the same verbs interface that OFED does. But *today*, they don't. So for their current customers, they need to support uDAPL. As such, we have done little/ no testing of uDAPL on OFED since Sun took over the uDAPL BTL -- all testing since that point has been on Solaris uDAPL. All of our Linux/ OFED efforts have been on the verbs interface. 4. The Open MPI focus on uDAPL over OFED at the moment is simply to jump-start iWARP testing. Both NetEffect and Chelsio have chimed in to say that they will do the RDMA CM work for Open MPI, but uDAPL can be used as a temporary workaround that can be used [effectively] immediately while they get up to speed on the Open MPI code base and do the RDMA CM work. -- Jeff Squyres Cisco Systems From yosefe at voltaire.com Wed May 9 04:53:33 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 14:53:33 +0300 Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <20070509112626.GA10068@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> Message-ID: <4641B63D.4010602@voltaire.com> Michael S. Tsirkin wrote: > OK. looks pretty good to me. One coding style violation I found: > > fixed -- This issue was found during partitioning & SM fail over testing. * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity * Upon PKEY_CHANGE event, schedule a work that restarts the QP * Restart child interfaces before parent. They might be up even if the parent is down * Use uncached pkey query upon qp initiallization SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 6 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 59 ++++++++++++++++++++--------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +-- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 11 ++--- 4 files changed, 56 insertions(+), 27 deletions(-) Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 16:45:44.000000000 +0300 @@ -202,11 +202,12 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; + struct delayed_work pkey_poll_task; struct delayed_work mcast_task; struct work_struct flush_task; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct work_struct pkey_event_task; struct ib_device *ca; u8 port; @@ -333,12 +334,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-09 13:56:00.754030478 +0300 @@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -481,7 +481,7 @@ int ipoib_ib_dev_down(struct net_device if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { mutex_lock(&pkey_mutex); set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); + cancel_delayed_work(&priv->pkey_poll_task); mutex_unlock(&pkey_mutex); if (flush) flush_workqueue(ipoib_workqueue); @@ -508,7 +508,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +581,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,13 +623,21 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces too - + * they might be up even if the parent is down */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, restart_qp); + + mutex_unlock(&priv->vlan_mutex); + + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -642,6 +651,11 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_down(dev, 0); + if (restart_qp) { + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +664,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - mutex_unlock(&priv->vlan_mutex); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_event_task); + + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -685,7 +709,7 @@ void ipoib_ib_dev_cleanup(struct net_dev void ipoib_pkey_poll(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); + container_of(work, struct ipoib_dev_priv, pkey_poll_task.work); struct net_device *dev = priv->dev; ipoib_pkey_dev_check_presence(dev); @@ -696,7 +720,7 @@ void ipoib_pkey_poll(struct work_struct mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); } @@ -715,7 +739,7 @@ int ipoib_pkey_dev_delay_open(struct net mutex_lock(&pkey_mutex); clear_bit(IPOIB_PKEY_STOP, &priv->flags); queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); return 1; Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 16:45:44.000000000 +0300 @@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev) return -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-09 14:51:32.684627634 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = -ENXIO; goto out; @@ -103,7 +101,7 @@ int ipoib_init_qp(struct net_device *dev * The port has to be assigned to the respective IB partition in * advance. */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); if (ret) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); return ret; @@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_event_task); } } From mst at dev.mellanox.co.il Wed May 9 05:07:42 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 15:07:42 +0300 Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <4641B63D.4010602@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> Message-ID: <20070509120742.GC10068@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCHv3 2/2] ipoib: handle pkey change events > > Michael S. Tsirkin wrote: > > OK. looks pretty good to me. One coding style violation I found: > > > > > fixed OK, Ack for this latest revision. I'm quite happy with the latest state of these 2 patches - they are small, clean, fix a real bug, and move us in the direction of gradually phasing out the cache as we agreed we want to. Please post the final revision of them (in a new thread), so it's clear for Roland what to take up for 2.6.22 (you can label them [PATCHv4 for-2.6.22 1 of 2] for clarity). We'll also stick them in OFED assuming no one objects by tomorrow. -- MST From yosefe at voltaire.com Wed May 9 05:14:56 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 15:14:56 +0300 Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <20070509120742.GC10068@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509120742.GC10068@mellanox.co.il> Message-ID: <4641BB40.9090208@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: Re: [PATCHv3 2/2] ipoib: handle pkey change events >> >>Michael S. Tsirkin wrote: >> >>>OK. looks pretty good to me. One coding style violation I found: >>> >>> >> >>fixed > > > OK, Ack for this latest revision. > I'm quite happy with the latest state of these 2 patches - they are small, > clean, fix a real bug, and move us in the direction of gradually > phasing out the cache as we agreed we want to. > > Please post the final revision of them (in a new > thread), so it's clear for Roland what to take up for 2.6.22 > (you can label them [PATCHv4 for-2.6.22 1 of 2] for clarity). > > We'll also stick them in OFED assuming no one objects by tomorrow. > ACK From yosefe at voltaire.com Wed May 9 05:17:09 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 15:17:09 +0300 Subject: [ofa-general] [PATCHv4 for-2.6.22 0 of 2] pkey change handling - fix bug #577 Message-ID: <4641BBC5.7040106@voltaire.com> These two patches fix bug #577: PKey table reordering caused by SM failover stops ipoib traffic patch 1: add uncached device queries to core patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init -- From yosefe at voltaire.com Wed May 9 05:20:42 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 15:20:42 +0300 Subject: [ofa-general] [PATCHv4 for-2.6.22 1 of 2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <4641BBC5.7040106@voltaire.com> References: <4641BBC5.7040106@voltaire.com> Message-ID: <4641BC9A.2050409@voltaire.com> * Add ib_find_gid and ib_find_pkey over uncached device queries. The calls might block but the returns are always up-to-date. * Cache pky,gid table lengths in core to avoid port info queries. Signed-off-by: Yosef Etigin --- drivers/infiniband/core/device.c | 138 +++++++++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 25 +++++++ 2 files changed, 163 insertions(+) Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-08 15:46:36.000000000 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-09 11:47:22.096064221 +0300 @@ -149,6 +149,18 @@ static int alloc_name(char *name) return 0; } +static inline int start_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + + +static inline int end_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; +} + /** * ib_alloc_device - allocate an IB device struct * @size:size of structure to allocate @@ -208,6 +220,55 @@ static int add_client_context(struct ib_ return 0; } +/* read the lengths of pkey,gid tables on each port */ +static int read_port_table_lengths(struct ib_device *device) +{ + struct ib_port_attr *tprops = NULL; + int num_ports, ret = -ENOMEM; + u8 port_index; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + goto out; + + num_ports = end_port(device) - start_port(device) + 1; + + device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len * + num_ports, GFP_KERNEL); + if (!device->pkey_tbl_len) + goto out; + + device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len * + num_ports, GFP_KERNEL); + if (!device->gid_tbl_len) + goto err1; + + for (port_index = 0; port_index < num_ports; ++port_index) { + ret = ib_query_port(device, port_index + start_port(device), + tprops); + if (ret) + goto err2; + device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len; + device->gid_tbl_len[port_index] = tprops->gid_tbl_len; + } + + ret = 0; + goto out; +err2: + kfree(device->gid_tbl_len); +err1: + kfree(device->pkey_tbl_len); +out: + kfree(tprops); + return ret; +} + +static inline void free_port_table_lengths(struct ib_device *device) +{ + kfree(device->gid_tbl_len); + kfree(device->pkey_tbl_len); +} + /** * ib_register_device - Register an IB device with IB core * @device:Device to register @@ -239,6 +300,13 @@ int ib_register_device(struct ib_device spin_lock_init(&device->event_handler_lock); spin_lock_init(&device->client_data_lock); + ret = read_port_table_lengths(device); + if (ret) { + printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n", + device->name); + goto out; + } + ret = ib_device_register_sysfs(device); if (ret) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", @@ -284,6 +352,8 @@ void ib_unregister_device(struct ib_devi list_del(&device->core_list); + free_port_table_lengths(device); + mutex_unlock(&device_mutex); spin_lock_irqsave(&device->client_data_lock, flags); @@ -592,6 +662,74 @@ int ib_modify_port(struct ib_device *dev } EXPORT_SYMBOL(ib_modify_port); +/** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index) +{ + union ib_gid tmp_gid; + int ret, port, i, tbl_len; + + for (port = start_port(device); port <= end_port(device); ++port) { + tbl_len = device->gid_tbl_len[port - start_port(device)]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &tmp_gid); + if (ret) + goto out; + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { + *port_num = port; + *index = i; + ret = 0; + goto out; + } + } + } + ret = -ENOENT; +out: + return ret; +} +EXPORT_SYMBOL(ib_find_gid); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index) +{ + int ret, i, tbl_len; + u16 tmp_pkey; + + tbl_len = device->pkey_tbl_len[port_num - start_port(device)]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); + if (ret) + goto out; + + if (pkey == tmp_pkey) { + *index = i; + ret = 0; + goto out; + } + } + ret = -ENOENT; + +out: + return ret; +} +EXPORT_SYMBOL(ib_find_pkey); + static int __init ib_core_init(void) { int ret; Index: b/include/rdma/ib_verbs.h =================================================================== --- a/include/rdma/ib_verbs.h 2007-05-08 15:45:45.000000000 +0300 +++ b/include/rdma/ib_verbs.h 2007-05-09 11:47:55.006221894 +0300 @@ -1058,6 +1058,8 @@ struct ib_device { __be64 node_guid; u8 node_type; u8 phys_port_cnt; + int *pkey_tbl_len; + int *gid_tbl_len; }; struct ib_client { @@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev struct ib_port_modify *port_modify); /** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index); + +/** * ib_alloc_pd - Allocates an unused protection domain. * @device: The device on which to allocate the protection domain. * From yosefe at voltaire.com Wed May 9 05:20:44 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 15:20:44 +0300 Subject: [ofa-general] [PATCHv4 for-2.6.22 2 of 2] ipoib: handle pkey change events In-Reply-To: <4641BBC5.7040106@voltaire.com> References: <4641BBC5.7040106@voltaire.com> Message-ID: <4641BC9C.8060501@voltaire.com> This issue was found during partitioning & SM fail over testing. * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity * Upon PKEY_CHANGE event, schedule a work that restarts the QP * Restart child interfaces before parent. They might be up even if the parent is down * Use uncached pkey query upon qp initiallization SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 6 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 59 ++++++++++++++++++++--------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +-- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 11 ++--- 4 files changed, 56 insertions(+), 27 deletions(-) Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 16:45:44.000000000 +0300 @@ -202,11 +202,12 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; + struct delayed_work pkey_poll_task; struct delayed_work mcast_task; struct work_struct flush_task; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct work_struct pkey_event_task; struct ib_device *ca; u8 port; @@ -333,12 +334,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-09 13:56:00.754030478 +0300 @@ -422,14 +422,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -481,7 +481,7 @@ int ipoib_ib_dev_down(struct net_device if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { mutex_lock(&pkey_mutex); set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); + cancel_delayed_work(&priv->pkey_poll_task); mutex_unlock(&pkey_mutex); if (flush) flush_workqueue(ipoib_workqueue); @@ -508,7 +508,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +581,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,13 +623,21 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int restart_qp) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces too - + * they might be up even if the parent is down */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, restart_qp); + + mutex_unlock(&priv->vlan_mutex); + + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -642,6 +651,11 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_down(dev, 0); + if (restart_qp) { + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +664,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - mutex_unlock(&priv->vlan_mutex); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_event_task); + + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -685,7 +709,7 @@ void ipoib_ib_dev_cleanup(struct net_dev void ipoib_pkey_poll(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); + container_of(work, struct ipoib_dev_priv, pkey_poll_task.work); struct net_device *dev = priv->dev; ipoib_pkey_dev_check_presence(dev); @@ -696,7 +720,7 @@ void ipoib_pkey_poll(struct work_struct mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); } @@ -715,7 +739,7 @@ int ipoib_pkey_dev_delay_open(struct net mutex_lock(&pkey_mutex); clear_bit(IPOIB_PKEY_STOP, &priv->flags); queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); return 1; Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 16:45:44.000000000 +0300 @@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev) return -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- infiniband.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 15:46:53.000000000 +0300 +++ infiniband/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-09 14:51:32.684627634 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = -ENXIO; goto out; @@ -103,7 +101,7 @@ int ipoib_init_qp(struct net_device *dev * The port has to be assigned to the respective IB partition in * advance. */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); + ret = ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); if (ret) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); return ret; @@ -260,7 +258,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +265,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_event_task); } } From mst at dev.mellanox.co.il Wed May 9 05:23:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 15:23:00 +0300 Subject: [ofa-general] Re: [PATCHv4 for-2.6.22 1 of 2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <4641BC9A.2050409@voltaire.com> References: <4641BBC5.7040106@voltaire.com> <4641BC9A.2050409@voltaire.com> Message-ID: <20070509122300.GE10068@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCHv4 for-2.6.22 1 of 2] core: uncached "find gid" and "find pkey" queries > > > * Add ib_find_gid and ib_find_pkey over uncached device queries. > The calls might block but the returns are always up-to-date. > * Cache pky,gid table lengths in core to avoid port info queries. > > > Signed-off-by: Yosef Etigin Acked-by: Michael S. Tsirkin -- MST From mst at dev.mellanox.co.il Wed May 9 05:23:37 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 15:23:37 +0300 Subject: [ofa-general] Re: [PATCHv4 for-2.6.22 2 of 2] ipoib: handle pkey change events In-Reply-To: <4641BC9C.8060501@voltaire.com> References: <4641BBC5.7040106@voltaire.com> <4641BC9C.8060501@voltaire.com> Message-ID: <20070509122337.GF10068@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCHv4 for-2.6.22 2 of 2] ipoib: handle pkey change events > > > This issue was found during partitioning & SM fail over testing. > > * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike > * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity > * Upon PKEY_CHANGE event, schedule a work that restarts the QP > * Restart child interfaces before parent. They might be up even if the > parent is down > * Use uncached pkey query upon qp initiallization > > SM reconfiguration or failover possibly causes a shuffling of the values > in the port pkey table. The current implementation only queries for the > index of the pkey once, when it creates the device QP and after that moves > it into working state, and hence does not address this scenario. Fix this > by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. > > Signed-off-by: Yosef Etigin Acked-by: Michael S. Tsirkin -- MST From mst at dev.mellanox.co.il Wed May 9 05:25:58 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 15:25:58 +0300 Subject: [ofa-general] Re: [PATCHv4 for-2.6.22 0 of 2] pkey change handling - fix bug #577 In-Reply-To: <4641BBC5.7040106@voltaire.com> References: <4641BBC5.7040106@voltaire.com> Message-ID: <20070509122558.GG10068@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCHv4 for-2.6.22 0 of 2] pkey change handling - fix bug #577 > > These two patches fix bug #577: PKey table reordering caused by SM failover stops ipoib traffic > patch 1: add uncached device queries to core > patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init Looks good to go to me - not only do these two patches fix a real bug, but they also move us in the direction we decided we want to go - getting rid of the cache. Acked-by: Michael S. Tsirkin Roland, what do you think? -- MST From cap at nsc.liu.se Wed May 9 05:39:57 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Wed, 9 May 2007 14:39:57 +0200 Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status In-Reply-To: <20070508093812.9A193E603C1@openfabrics.org> References: <20070508093812.9A193E603C1@openfabrics.org> Message-ID: <200705091440.01872.cap@nsc.liu.se> Not related to the failed 2.6.21.1 below, but, are there any plans to add the EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and 2.6.9-55.{EL,ELsmp}). Also, out of curiosity, what is the 2.6.16.43-0.3 (below)? /Peter On Tuesday 08 May 2007, Vladimir Sokolovsky wrote: ... > Passed on x86_64 with linux-2.6.16.43-0.3-smp > Passed on ia64 with linux-2.6.16 > Passed on ia64 with linux-2.6.17 > Passed on x86_64 with linux-2.6.16.21-0.8-smp > Passed on ia64 with linux-2.6.19 > Passed on x86_64 with linux-2.6.9-42.ELsmp > Passed on x86_64 with linux-2.6.9-34.ELsmp > Passed on x86_64 with linux-2.6.9-22.ELsmp > Passed on x86_64 with linux-2.6.18-1.2798.fc6 > Passed on ia64 with linux-2.6.16.21-0.8-default > > Failed: > Build failed on i686 with linux-2.6.21.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From cap at nsc.liu.se Wed May 9 05:39:57 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Wed, 9 May 2007 14:39:57 +0200 Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status In-Reply-To: <20070508093812.9A193E603C1@openfabrics.org> References: <20070508093812.9A193E603C1@openfabrics.org> Message-ID: <200705091440.01872.cap@nsc.liu.se> Not related to the failed 2.6.21.1 below, but, are there any plans to add the EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and 2.6.9-55.{EL,ELsmp}). Also, out of curiosity, what is the 2.6.16.43-0.3 (below)? /Peter On Tuesday 08 May 2007, Vladimir Sokolovsky wrote: ... > Passed on x86_64 with linux-2.6.16.43-0.3-smp > Passed on ia64 with linux-2.6.16 > Passed on ia64 with linux-2.6.17 > Passed on x86_64 with linux-2.6.16.21-0.8-smp > Passed on ia64 with linux-2.6.19 > Passed on x86_64 with linux-2.6.9-42.ELsmp > Passed on x86_64 with linux-2.6.9-34.ELsmp > Passed on x86_64 with linux-2.6.9-22.ELsmp > Passed on x86_64 with linux-2.6.18-1.2798.fc6 > Passed on ia64 with linux-2.6.16.21-0.8-default > > Failed: > Build failed on i686 with linux-2.6.21.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From mst at dev.mellanox.co.il Wed May 9 05:45:21 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 15:45:21 +0300 Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status In-Reply-To: <200705091440.01872.cap@nsc.liu.se> References: <20070508093812.9A193E603C1@openfabrics.org> <200705091440.01872.cap@nsc.liu.se> Message-ID: <20070509124521.GI10068@mellanox.co.il> > Quoting Peter Kjellstrom : > Subject: Re: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status > > Not related to the failed 2.6.21.1 below, but, are there any plans to add the > EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and 2.6.9-55.{EL,ELsmp}). We do test on them locally, haven't the time to prepare these for cross-build yet. Can you do this? > Also, out of curiosity, what is the 2.6.16.43-0.3 (below)? SLES10 I think. -- MST From mst at dev.mellanox.co.il Wed May 9 05:45:21 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 15:45:21 +0300 Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status In-Reply-To: <200705091440.01872.cap@nsc.liu.se> References: <20070508093812.9A193E603C1@openfabrics.org> <200705091440.01872.cap@nsc.liu.se> Message-ID: <20070509124521.GI10068@mellanox.co.il> > Quoting Peter Kjellstrom : > Subject: Re: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status > > Not related to the failed 2.6.21.1 below, but, are there any plans to add the > EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and 2.6.9-55.{EL,ELsmp}). We do test on them locally, haven't the time to prepare these for cross-build yet. Can you do this? > Also, out of curiosity, what is the 2.6.16.43-0.3 (below)? SLES10 I think. -- MST From fenkes at de.ibm.com Wed May 9 05:46:23 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 9 May 2007 14:46:23 +0200 Subject: [ofa-general] [PATCH 0/6] IB/ehca: Assorted patches Message-ID: <200705091446.23783.fenkes@de.ibm.com> Here's a set of patches containing various improvements and bugfixes for the IBM eHCA InfiniBand driver, bumping the version number to SVNEHCA_0023. The patches are, in detail: #1 - Serialize hypervisor calls in ehca_register_mr() #2 - correctly set GRH mask bit in ehca_modify_qp() #3 - Fix AQP0/1 QP number #4 - remove _irqsave where it's not needed; move an #ifdef to where it makes even better sense #5 - beautify sysfs attribute code and fix compiler warnings #6 - disable scaling code by default and bump version number The patches are ready for inclusion into 2.6.22 and apply on top of Roland's git tree (which has been pulled by Linus recently, so they should apply there, too). Cheers, Joachim -- Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2) Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany eMail: fenkes at de.ibm.com  --  Phone: +49 7031 16 1239 From afriedle at open-mpi.org Wed May 9 06:13:19 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Wed, 09 May 2007 09:13:19 -0400 Subject: [OMPI users] [ofa-general] Re: openMPI over uDAPL doesn't work In-Reply-To: <46417748.4020602@lfbs.rwth-aachen.de> References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM> <46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM> <464044D4.5010501@lfbs.rwth-aachen.de> <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com> <4640CACE.8070201@ichips.intel.com> <1178655353.11455.14.camel@stevo-desktop> <46417748.4020602@lfbs.rwth-aachen.de> Message-ID: <4641C8EF.5080709@open-mpi.org> You say that fixes the problem, does it work even when running more than one MPI process per node? (that is the case the hack fixes) Simply doing an mpirun with a -np paremeter higher than the number of nodes you have set up should trigger this case, and making sure to use '-mca btl udapl,self' (ie not SM or anything else). Andrew Boris Bierbaum wrote: > It has been explained in a different thread on [ofa-general] that the > problem lies in a combination of the OpenIB-cma provider not setting the > local and remote port numbers on endpoints correctly and Open MPI > stepping over the IA to save the port number to circumvent this problem, > thereby confusing the provider. > > I commented out line 197 in ompi/mca/btl/udapl/btl_udapl.c (Open MPI > 1.2.1 release) and this fixes the problem. As the problem in the > provider is currently being fixed, the whole saving of the port number > in the uDAPL BTL code will be unnecessary in the future. > > Steve Wise wrote: >>>> Can the UDAPL OFED wizards shed any light on the error messages that >>>> are listed below? In particular, these seem to be worrysome: >>>> >>>>> setup_listener Permission denied >>>> setup_listener Address already in use >>> These failures are from rdma_cm_bind indicating the port is already >>> bound to this IA address. How are you creating the service point? >>> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you >>> will see some failures until it gets to a free port. That is normal. >>> Just make sure your create call returns DAT_SUCCESS. >>> >> Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down >> and let the rdma-cma pick an available port number? >> >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > > From swise at opengridcomputing.com Wed May 9 06:41:30 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 09 May 2007 08:41:30 -0500 Subject: [ofa-general] OMPI over ofed udapl - bugs opened In-Reply-To: <4640FDE9.9010000@ichips.intel.com> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> Message-ID: <1178718090.382.4.camel@stevo-desktop> 606 opened to track the udapl change. 607 opened to track the ompi change to remove the port number stashing hack. Status: I have a patch from Arlin to test today. I will test with that patch and with the OMPI port hack removed. Stay tuned... Steve. On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote: > Steve Wise wrote: > > >I would like the group to consider including changes needed to OMPI > >and/or ofa udapl to get OMPI working again on udapl for ofed-1.2. > > > >This will provide OMPI support over iwarp devices via udapl until we can > >get rdma-cm support added to OMPI. > > > > > >Steve. > > > > > > > Steve,cCan you open a bug to track this? From boris at lfbs.RWTH-Aachen.DE Wed May 9 06:50:30 2007 From: boris at lfbs.RWTH-Aachen.DE (Boris Bierbaum) Date: Wed, 09 May 2007 15:50:30 +0200 Subject: [OMPI users] [ofa-general] Re: openMPI over uDAPL doesn't work In-Reply-To: <4641C8EF.5080709@open-mpi.org> References: <462E13A6.3030207@lfbs.rwth-aachen.de> <462E1DFE.5010703@Sun.COM> <46305D0A.5020900@lfbs.rwth-aachen.de> <4630EFDE.8070608@Sun.COM> <464044D4.5010501@lfbs.rwth-aachen.de> <054A73AF-4EEB-4269-8DBC-0D39E1ADC08B@cisco.com> <4640CACE.8070201@ichips.intel.com> <1178655353.11455.14.camel@stevo-desktop> <46417748.4020602@lfbs.rwth-aachen.de> <4641C8EF.5080709@open-mpi.org> Message-ID: <4641D1A6.30603@lfbs.rwth-aachen.de> I've run the whole IMB Benchmark Suite on 2, 3, and 4 nodes with 2 processes per node and --mca btl udapl,self. I didn't encouter any problems. The comment above line 197 says that dat_ep_query() returns wrong port numbers (which it does indeed), but I can't find any call to dat_ep_query() in the uDAPL BTL code. Maybe the comment is out of date? Boris Andrew Friedley wrote: > You say that fixes the problem, does it work even when running more than > one MPI process per node? (that is the case the hack fixes) Simply > doing an mpirun with a -np paremeter higher than the number of nodes you > have set up should trigger this case, and making sure to use '-mca btl > udapl,self' (ie not SM or anything else). > > Andrew > > Boris Bierbaum wrote: >> It has been explained in a different thread on [ofa-general] that the >> problem lies in a combination of the OpenIB-cma provider not setting the >> local and remote port numbers on endpoints correctly and Open MPI >> stepping over the IA to save the port number to circumvent this problem, >> thereby confusing the provider. >> >> I commented out line 197 in ompi/mca/btl/udapl/btl_udapl.c (Open MPI >> 1.2.1 release) and this fixes the problem. As the problem in the >> provider is currently being fixed, the whole saving of the port number >> in the uDAPL BTL code will be unnecessary in the future. >> >> Steve Wise wrote: >>>>> Can the UDAPL OFED wizards shed any light on the error messages that >>>>> are listed below? In particular, these seem to be worrysome: >>>>> >>>>>> setup_listener Permission denied >>>>> setup_listener Address already in use >>>> These failures are from rdma_cm_bind indicating the port is already >>>> bound to this IA address. How are you creating the service point? >>>> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you >>>> will see some failures until it gets to a free port. That is normal. >>>> Just make sure your create call returns DAT_SUCCESS. >>>> >>> Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down >>> and let the rdma-cma pick an available port number? >>> >>> >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >> > _______________________________________________ > users mailing list > users at open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- | _ RWTH | Boris Bierbaum |_|_`_ | Lehrstuhl fuer Betriebssysteme | |_) _ | RWTH Aachen D-52056 Aachen |_)(_` | Tel: +49-241-80-27805 ._) | Fax: +49-241-80-22339 From erezz at voltaire.com Wed May 9 06:54:29 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 09 May 2007 16:54:29 +0300 Subject: [ofa-general] [PATCH 0/2] IB/iser: Add open-iscsi over iSER support for RHAS4 up3 & up4 in OFED Message-ID: <4641D295.5060907@voltaire.com> The following patches add the required backports & kernel addons in order to support open-iscsi over iSER in RHAS4 up3 & up4 in OFED (currently SLES 10, SLES 10 sp1 & RHEL 5 are supported). -- ____________________________________________________________ Erez Zilber | 972-9-971-7689 Software Engineer, Storage Team Voltaire – _The Grid Backbone_ __ www.voltaire.com From erezz at voltaire.com Wed May 9 06:57:01 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 09 May 2007 16:57:01 +0300 Subject: [ofa-general] [PATCH 1/2] IB/iser: add open-iscsi over iSER support for RHAS4 in OFED scripts In-Reply-To: <4641D295.5060907@voltaire.com> References: <4641D295.5060907@voltaire.com> Message-ID: <4641D32D.6030505@voltaire.com> Add support for open-iscsi over iSER in RHAS4 in OFED's scripts. Signed-off-by: Erez Zilber --- build.sh | 2 +- build_env.sh | 4 ++-- install.sh | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/build.sh b/build.sh index d54c55d..be2d1e6 100755 --- a/build.sh +++ b/build.sh @@ -344,7 +344,7 @@ open-iscsi() SuSE) ex "$MV -f ${RPM_DIR}/RPMS/$build_arch/${OPEN_ISCSI_SUSE_NAME}-${OPEN_ISCSI_VERSION}.${build_arch}.rpm $RPMS" ;; - redhat5) + redhat|redhat5) ex "$MV -f ${RPM_DIR}/RPMS/$build_arch/${OPEN_ISCSI_REDHAT_NAME}-${OPEN_ISCSI_VERSION}.${build_arch}.rpm $RPMS" ;; *) diff --git a/build_env.sh b/build_env.sh index 6e65b21..49821b4 100644 --- a/build_env.sh +++ b/build_env.sh @@ -135,7 +135,7 @@ IB_KERNEL_PACKAGES="${IB_KERNEL_PACKAGES # Iser # Currently iSER is supported only on SLES10 & RHEL5 case ${K_VER} in - 2.6.16.*-*-*|2.6.*.el5) + 2.6.16.*-*-*|2.6.*.el5|2.6.9-*.EL*) IB_KERNEL_PACKAGES="${IB_KERNEL_PACKAGES} ib_iser" ;; esac @@ -1998,7 +1998,7 @@ set_package_deps() ib_iser) # Currently iSER is supported only on SLES10 & RHEL5 case ${K_VER} in - 2.6.16.*-*-*|2.6.*.el5) + 2.6.16.*-*-*|2.6.*.el5|2.6.9-*.EL*) OFA_KERNEL_PACKAGES=$(echo "$OFA_KERNEL_PACKAGES ib_verbs ${ll_driver} ib_iser" | tr -s ' ' '\n' | sort -n | uniq) OFA_PACKAGES=$(echo "$OFA_PACKAGES kernel-ib" | tr -s ' ' '\n' | sort -n | uniq) EXTRA_PACKAGES=$(echo "$EXTRA_PACKAGES open-iscsi" | tr -s ' ' '\n' | sort -rn | uniq) diff --git a/install.sh b/install.sh index f9ed6da..dadc144 100755 --- a/install.sh +++ b/install.sh @@ -990,7 +990,7 @@ # fi err_echo "${OPEN_ISCSI_SUSE_NAME}-${OPEN_ISCSI_VERSION}.${build_arch}.rpm not found under ${RPMS}." fi ;; - redhat5) + redhat|redhat5) if [ -f ${RPMS}/${OPEN_ISCSI_REDHAT_NAME}-${OPEN_ISCSI_VERSION}.${build_arch}.rpm ]; then ex "$RPM -Uhv --oldpackage ${RPMS}/${OPEN_ISCSI_REDHAT_NAME}-${OPEN_ISCSI_VERSION}.${build_arch}.rpm" else -- 1.4.2 From erezz at voltaire.com Wed May 9 06:58:34 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 09 May 2007 16:58:34 +0300 Subject: [ofa-general] [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4 In-Reply-To: <4641D295.5060907@voltaire.com> References: <4641D295.5060907@voltaire.com> Message-ID: <4641D38A.8040406@voltaire.com> Add the required backport patches & kernel addons for open-iscsi over iSER in RHAS4 up3 and up4. Signed-off-by: Erez Zilber --- .../2.6.9_U3/include/linux/attribute_container.h | 71 +++ .../backport/2.6.9_U3/include/linux/klist.h | 61 ++ .../backport/2.6.9_U3/include/scsi/scsi_device.h | 19 + .../2.6.9_U3/include/scsi/scsi_transport.h | 8 .../2.6.9_U3/include/src/attribute_container.c | 438 +++++++++++++++++ kernel_addons/backport/2.6.9_U3/include/src/base.h | 1 kernel_addons/backport/2.6.9_U3/include/src/init.c | 26 + .../backport/2.6.9_U3/include/src/klist.c | 287 +++++++++++ .../backport/2.6.9_U3/include/src/kref_new.c | 29 + kernel_addons/backport/2.6.9_U3/include/src/scsi.c | 50 ++ .../backport/2.6.9_U3/include/src/scsi_lib.c | 164 ++++++ .../backport/2.6.9_U3/include/src/scsi_scan.c | 48 ++ .../2.6.9_U3/include/src/transport_class.c | 280 +++++++++++ .../2.6.9_U4/include/linux/attribute_container.h | 71 +++ .../backport/2.6.9_U4/include/linux/klist.h | 61 ++ .../backport/2.6.9_U4/include/scsi/scsi_device.h | 19 + .../2.6.9_U4/include/scsi/scsi_transport.h | 8 .../2.6.9_U4/include/src/attribute_container.c | 438 +++++++++++++++++ kernel_addons/backport/2.6.9_U4/include/src/base.h | 1 kernel_addons/backport/2.6.9_U4/include/src/init.c | 26 + .../backport/2.6.9_U4/include/src/klist.c | 287 +++++++++++ .../backport/2.6.9_U4/include/src/kref_new.c | 29 + kernel_addons/backport/2.6.9_U4/include/src/scsi.c | 50 ++ .../backport/2.6.9_U4/include/src/scsi_lib.c | 164 ++++++ .../backport/2.6.9_U4/include/src/scsi_scan.c | 48 ++ .../2.6.9_U4/include/src/transport_class.c | 280 +++++++++++ .../backport/2.6.9_U3/add_iscsi_proto_h.patch | 591 +++++++++++++++++++++++ kernel_patches/backport/2.6.9_U3/add_iser.patch | 13 .../backport/2.6.9_U3/add_memory_h.patch | 93 ++++ .../backport/2.6.9_U3/add_open_iscsi.patch | 504 ++++++++++++++++++++ .../backport/2.6.9_U3/add_open_iscsi_h.patch | 60 ++ .../backport/2.6.9_U3/add_transport_class_h.patch | 104 ++++ .../2.6.9_U3/fix_inclusion_order_iscsi_iser.patch | 13 + .../backport/2.6.9_U3/linux_stuff_to_2_6_17.patch | 58 ++ .../2.6.9_U3/netlink-01-add_netlink_h.patch | 247 ++++++++++ .../2.6.9_U3/netlink-02-netlink_h_for_rh4.patch | 200 ++++++++ .../backport/2.6.9_U4/add_iscsi_proto_h.patch | 591 +++++++++++++++++++++++ kernel_patches/backport/2.6.9_U4/add_iser.patch | 13 .../backport/2.6.9_U4/add_memory_h.patch | 93 ++++ .../backport/2.6.9_U4/add_open_iscsi.patch | 504 ++++++++++++++++++++ .../backport/2.6.9_U4/add_open_iscsi_h.patch | 60 ++ .../backport/2.6.9_U4/add_transport_class_h.patch | 104 ++++ .../2.6.9_U4/fix_inclusion_order_iscsi_iser.patch | 13 + .../backport/2.6.9_U4/linux_stuff_to_2_6_17.patch | 58 ++ .../2.6.9_U4/netlink-01-add_netlink_h.patch | 247 ++++++++++ .../2.6.9_U4/netlink-02-netlink_h_for_rh4.patch | 200 ++++++++ 46 files changed, 6728 insertions(+), 2 deletions(-) diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h b/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h new file mode 100644 index 0000000..93bfb0b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h @@ -0,0 +1,71 @@ +/* + * class_container.h - a generic container for all classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + */ + +#ifndef _ATTRIBUTE_CONTAINER_H_ +#define _ATTRIBUTE_CONTAINER_H_ + +#include +#include +#include +#include + +struct attribute_container { + struct list_head node; + struct klist containers; + struct class *class; + struct class_device_attribute **attrs; + int (*match)(struct attribute_container *, struct device *); +#define ATTRIBUTE_CONTAINER_NO_CLASSDEVS 0x01 + unsigned long flags; +}; + +static inline int +attribute_container_no_classdevs(struct attribute_container *atc) +{ + return atc->flags & ATTRIBUTE_CONTAINER_NO_CLASSDEVS; +} + +static inline void +attribute_container_set_no_classdevs(struct attribute_container *atc) +{ + atc->flags |= ATTRIBUTE_CONTAINER_NO_CLASSDEVS; +} + +int attribute_container_register(struct attribute_container *cont); +int attribute_container_unregister(struct attribute_container *cont); +void attribute_container_create_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_add_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_remove_device(struct device *dev, + void (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_device_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *)); +int attribute_container_add_attrs(struct class_device *classdev); +int attribute_container_add_class_device(struct class_device *classdev); +int attribute_container_add_class_device_adapter(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev); +void attribute_container_remove_attrs(struct class_device *classdev); +void attribute_container_class_device_del(struct class_device *classdev); +struct attribute_container *attribute_container_classdev_to_container(struct class_device *); +struct class_device *attribute_container_find_class_device(struct attribute_container *, struct device *); +struct class_device_attribute **attribute_container_classdev_to_attrs(const struct class_device *classdev); + +#endif diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/klist.h b/kernel_addons/backport/2.6.9_U3/include/linux/klist.h new file mode 100644 index 0000000..7407125 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/linux/klist.h @@ -0,0 +1,61 @@ +/* + * klist.h - Some generic list helpers, extending struct list_head a bit. + * + * Implementations are found in lib/klist.c + * + * + * Copyright (C) 2005 Patrick Mochel + * + * This file is rleased under the GPL v2. + */ + +#ifndef _LINUX_KLIST_H +#define _LINUX_KLIST_H + +#include +#include +#include +#include + +struct klist_node; +struct klist { + spinlock_t k_lock; + struct list_head k_list; + void (*get)(struct klist_node *); + void (*put)(struct klist_node *); +}; + + +extern void klist_init(struct klist * k, void (*get)(struct klist_node *), + void (*put)(struct klist_node *)); + +struct klist_node { + struct klist * n_klist; + struct list_head n_node; + struct kref n_ref; + struct completion n_removed; +}; + +extern void klist_add_tail(struct klist_node * n, struct klist * k); +extern void klist_add_head(struct klist_node * n, struct klist * k); + +extern void klist_del(struct klist_node * n); +extern void klist_remove(struct klist_node * n); + +extern int klist_node_attached(struct klist_node * n); + + +struct klist_iter { + struct klist * i_klist; + struct list_head * i_head; + struct klist_node * i_cur; +}; + + +extern void klist_iter_init(struct klist * k, struct klist_iter * i); +extern void klist_iter_init_node(struct klist * k, struct klist_iter * i, + struct klist_node * n); +extern void klist_iter_exit(struct klist_iter * i); +extern struct klist_node * klist_next(struct klist_iter * i); + +#endif diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h new file mode 100644 index 0000000..f353e0b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h @@ -0,0 +1,19 @@ +#ifndef _SCSI_SCSI_DEVICE_H_BACKPORT +#define _SCSI_SCSI_DEVICE_H_BACKPORT + +#include_next + +#include +#include +#include +#include +#include + +struct scsi_lun; + +extern void int_to_scsilun(unsigned int, struct scsi_lun *); +extern void scsi_target_block(struct device *); +extern void scsi_target_unblock(struct device *); +extern void starget_for_each_device(struct scsi_target *, void *, + void (*fn)(struct scsi_device *, void *)); +#endif /* _SCSI_SCSI_DEVICE_H_BACKPORT */ diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h new file mode 100644 index 0000000..99c2b12 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h @@ -0,0 +1,8 @@ +#ifndef _SCSI_SCSI_TRANSPORT_H_BACKPORT +#define _SCSI_SCSI_TRANSPORT_H_BACKPORT + +#include_next + +#include + +#endif /* _SCSI_SCSI_TRANSPORT_H_BACKPORT */ diff --git a/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c b/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c new file mode 100644 index 0000000..44948d1 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c @@ -0,0 +1,438 @@ +/* + * attribute_container.c - implementation of a simple container for classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + * + * The basic idea here is to enable a device to be attached to an + * aritrary numer of classes without having to allocate storage for them. + * Instead, the contained classes select the devices they need to attach + * to via a matching function. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "base.h" + +/* This is a private structure used to tie the classdev and the + * container .. it should never be visible outside this file */ +struct internal_container { + struct klist_node node; + struct attribute_container *cont; + struct class_device classdev; +}; + +static void internal_container_klist_get(struct klist_node *n) +{ + struct internal_container *ic = + container_of(n, struct internal_container, node); + class_device_get(&ic->classdev); +} + +static void internal_container_klist_put(struct klist_node *n) +{ + struct internal_container *ic = + container_of(n, struct internal_container, node); + class_device_put(&ic->classdev); +} + + +/** + * attribute_container_classdev_to_container - given a classdev, return the container + * + * @classdev: the class device created by attribute_container_add_device. + * + * Returns the container associated with this classdev. + */ +struct attribute_container * +attribute_container_classdev_to_container(struct class_device *classdev) +{ + struct internal_container *ic = + container_of(classdev, struct internal_container, classdev); + return ic->cont; +} +EXPORT_SYMBOL_GPL(attribute_container_classdev_to_container); + +static struct list_head attribute_container_list; + +static DECLARE_MUTEX(attribute_container_mutex); + +/** + * attribute_container_register - register an attribute container + * + * @cont: The container to register. This must be allocated by the + * callee and should also be zeroed by it. + */ +int +attribute_container_register(struct attribute_container *cont) +{ + INIT_LIST_HEAD(&cont->node); + klist_init(&cont->containers,internal_container_klist_get, + internal_container_klist_put); + + down(&attribute_container_mutex); + list_add_tail(&cont->node, &attribute_container_list); + up(&attribute_container_mutex); + + return 0; +} +EXPORT_SYMBOL_GPL(attribute_container_register); + +/** + * attribute_container_unregister - remove a container registration + * + * @cont: previously registered container to remove + */ +int +attribute_container_unregister(struct attribute_container *cont) +{ + int retval = -EBUSY; + down(&attribute_container_mutex); + spin_lock(&cont->containers.k_lock); + if (!list_empty(&cont->containers.k_list)) + goto out; + retval = 0; + list_del(&cont->node); + out: + spin_unlock(&cont->containers.k_lock); + up(&attribute_container_mutex); + return retval; + +} +EXPORT_SYMBOL_GPL(attribute_container_unregister); + +/* private function used as class release */ +static void attribute_container_release(struct class_device *classdev) +{ + struct internal_container *ic + = container_of(classdev, struct internal_container, classdev); + struct device *dev = classdev->dev; + + kfree(ic); + put_device(dev); +} + +/** + * attribute_container_add_device - see if any container is interested in dev + * + * @dev: device to add attributes to + * @fn: function to trigger addition of class device. + * + * This function allocates storage for the class device(s) to be + * attached to dev (one for each matching attribute_container). If no + * fn is provided, the code will simply register the class device via + * class_device_add. If a function is provided, it is expected to add + * the class device at the appropriate time. One of the things that + * might be necessary is to allocate and initialise the classdev and + * then add it a later time. To do this, call this routine for + * allocation and initialisation and then use + * attribute_container_device_trigger() to call class_device_add() on + * it. Note: after this, the class device contains a reference to dev + * which is not relinquished until the release of the classdev. + */ +void +attribute_container_add_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + + if (attribute_container_no_classdevs(cont)) + continue; + + if (!cont->match(cont, dev)) + continue; + + ic = kzalloc(sizeof(*ic), GFP_KERNEL); + if (!ic) { + dev_printk(KERN_ERR, dev, "failed to allocate class container\n"); + continue; + } + + ic->cont = cont; + class_device_initialize(&ic->classdev); + ic->classdev.dev = get_device(dev); + ic->classdev.class = cont->class; + cont->class->release = attribute_container_release; + strcpy(ic->classdev.class_id, dev->bus_id); + if (fn) + fn(cont, dev, &ic->classdev); + else + attribute_container_add_class_device(&ic->classdev); + klist_add_tail(&ic->node, &cont->containers); + } + up(&attribute_container_mutex); +} + +/* FIXME: can't break out of this unless klist_iter_exit is also + * called before doing the break + */ +#define klist_for_each_entry(pos, head, member, iter) \ + for (klist_iter_init(head, iter); (pos = ({ \ + struct klist_node *n = klist_next(iter); \ + n ? container_of(n, typeof(*pos), member) : \ + ({ klist_iter_exit(iter) ; NULL; }); \ + }) ) != NULL; ) + + +/** + * attribute_container_remove_device - make device eligible for removal. + * + * @dev: The generic device + * @fn: A function to call to remove the device + * + * This routine triggers device removal. If fn is NULL, then it is + * simply done via class_device_unregister (note that if something + * still has a reference to the classdev, then the memory occupied + * will not be freed until the classdev is released). If you want a + * two phase release: remove from visibility and then delete the + * device, then you should use this routine with a fn that calls + * class_device_del() and then use + * attribute_container_device_trigger() to do the final put on the + * classdev. + */ +void +attribute_container_remove_device(struct device *dev, + void (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + struct klist_iter iter; + + if (attribute_container_no_classdevs(cont)) + continue; + + if (!cont->match(cont, dev)) + continue; + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (dev != ic->classdev.dev) + continue; + klist_del(&ic->node); + if (fn) + fn(cont, dev, &ic->classdev); + else { + attribute_container_remove_attrs(&ic->classdev); + class_device_unregister(&ic->classdev); + } + } + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_device_trigger - execute a trigger for each matching classdev + * + * @dev: The generic device to run the trigger for + * @fn the function to execute for each classdev. + * + * This funcion is for executing a trigger when you need to know both + * the container and the classdev. If you only care about the + * container, then use attribute_container_trigger() instead. + */ +void +attribute_container_device_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + struct klist_iter iter; + + if (!cont->match(cont, dev)) + continue; + + if (attribute_container_no_classdevs(cont)) { + fn(cont, dev, NULL); + continue; + } + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (dev == ic->classdev.dev) + fn(cont, dev, &ic->classdev); + } + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_trigger - trigger a function for each matching container + * + * @dev: The generic device to activate the trigger for + * @fn: the function to trigger + * + * This routine triggers a function that only needs to know the + * matching containers (not the classdev) associated with a device. + * It is more lightweight than attribute_container_device_trigger, so + * should be used in preference unless the triggering function + * actually needs to know the classdev. + */ +void +attribute_container_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + if (cont->match(cont, dev)) + fn(cont, dev); + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_add_attrs - add attributes + * + * @classdev: The class device + * + * This simply creates all the class device sysfs files from the + * attributes listed in the container + */ +int +attribute_container_add_attrs(struct class_device *classdev) +{ + struct attribute_container *cont = + attribute_container_classdev_to_container(classdev); + struct class_device_attribute **attrs = cont->attrs; + int i, error; + + if (!attrs) + return 0; + + for (i = 0; attrs[i]; i++) { + error = class_device_create_file(classdev, attrs[i]); + if (error) + return error; + } + + return 0; +} + +/** + * attribute_container_add_class_device - same function as class_device_add + * + * @classdev: the class device to add + * + * This performs essentially the same function as class_device_add except for + * attribute containers, namely add the classdev to the system and then + * create the attribute files + */ +int +attribute_container_add_class_device(struct class_device *classdev) +{ + int error = class_device_add(classdev); + if (error) + return error; + return attribute_container_add_attrs(classdev); +} + +/** + * attribute_container_add_class_device_adapter - simple adapter for triggers + * + * This function is identical to attribute_container_add_class_device except + * that it is designed to be called from the triggers + */ +int +attribute_container_add_class_device_adapter(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + return attribute_container_add_class_device(classdev); +} + +/** + * attribute_container_remove_attrs - remove any attribute files + * + * @classdev: The class device to remove the files from + * + */ +void +attribute_container_remove_attrs(struct class_device *classdev) +{ + struct attribute_container *cont = + attribute_container_classdev_to_container(classdev); + struct class_device_attribute **attrs = cont->attrs; + int i; + + if (!attrs) + return; + + for (i = 0; attrs[i]; i++) + class_device_remove_file(classdev, attrs[i]); +} + +/** + * attribute_container_class_device_del - equivalent of class_device_del + * + * @classdev: the class device + * + * This function simply removes all the attribute files and then calls + * class_device_del. + */ +void +attribute_container_class_device_del(struct class_device *classdev) +{ + attribute_container_remove_attrs(classdev); + class_device_del(classdev); +} + +/** + * attribute_container_find_class_device - find the corresponding class_device + * + * @cont: the container + * @dev: the generic device + * + * Looks up the device in the container's list of class devices and returns + * the corresponding class_device. + */ +struct class_device * +attribute_container_find_class_device(struct attribute_container *cont, + struct device *dev) +{ + struct class_device *cdev = NULL; + struct internal_container *ic; + struct klist_iter iter; + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (ic->classdev.dev == dev) { + cdev = &ic->classdev; + /* FIXME: must exit iterator then break */ + klist_iter_exit(&iter); + break; + } + } + + return cdev; +} +EXPORT_SYMBOL_GPL(attribute_container_find_class_device); + +int +attribute_container_init(void) +{ + INIT_LIST_HEAD(&attribute_container_list); + return 0; +} +EXPORT_SYMBOL_GPL(attribute_container_init); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/base.h b/kernel_addons/backport/2.6.9_U3/include/src/base.h new file mode 100644 index 0000000..a5f8936 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/base.h @@ -0,0 +1 @@ +extern int attribute_container_init(void); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/init.c b/kernel_addons/backport/2.6.9_U3/include/src/init.c new file mode 100644 index 0000000..15f0bc6 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/init.c @@ -0,0 +1,26 @@ +/* + * + * Copyright (c) 2002-3 Patrick Mochel + * Copyright (c) 2002-3 Open Source Development Labs + * + * This file is released under the GPLv2 + * + */ + +#include +#include +#include + +#include "base.h" + +/** + * driver_init - initialize driver model. + * + * Call the driver model init functions to initialize their + * subsystems. Called early from init/main.c. + */ + +void __init driver_init(void) +{ + attribute_container_init(); +} diff --git a/kernel_addons/backport/2.6.9_U3/include/src/klist.c b/kernel_addons/backport/2.6.9_U3/include/src/klist.c new file mode 100644 index 0000000..3b29ebc --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/klist.c @@ -0,0 +1,287 @@ +/* + * klist.c - Routines for manipulating klists. + * + * + * This klist interface provides a couple of structures that wrap around + * struct list_head to provide explicit list "head" (struct klist) and + * list "node" (struct klist_node) objects. For struct klist, a spinlock + * is included that protects access to the actual list itself. struct + * klist_node provides a pointer to the klist that owns it and a kref + * reference count that indicates the number of current users of that node + * in the list. + * + * The entire point is to provide an interface for iterating over a list + * that is safe and allows for modification of the list during the + * iteration (e.g. insertion and removal), including modification of the + * current node on the list. + * + * It works using a 3rd object type - struct klist_iter - that is declared + * and initialized before an iteration. klist_next() is used to acquire the + * next element in the list. It returns NULL if there are no more items. + * Internally, that routine takes the klist's lock, decrements the reference + * count of the previous klist_node and increments the count of the next + * klist_node. It then drops the lock and returns. + * + * There are primitives for adding and removing nodes to/from a klist. + * When deleting, klist_del() will simply decrement the reference count. + * Only when the count goes to 0 is the node removed from the list. + * klist_remove() will try to delete the node from the list and block + * until it is actually removed. This is useful for objects (like devices) + * that have been removed from the system and must be freed (but must wait + * until all accessors have finished). + * + * Copyright (C) 2005 Patrick Mochel + * + * This file is released under the GPL v2. + */ + +#include +#include + + +/** + * klist_init - Initialize a klist structure. + * @k: The klist we're initializing. + * @get: The get function for the embedding object (NULL if none) + * @put: The put function for the embedding object (NULL if none) + * + * Initialises the klist structure. If the klist_node structures are + * going to be embedded in refcounted objects (necessary for safe + * deletion) then the get/put arguments are used to initialise + * functions that take and release references on the embedding + * objects. + */ + +void klist_init(struct klist * k, void (*get)(struct klist_node *), + void (*put)(struct klist_node *)) +{ + INIT_LIST_HEAD(&k->k_list); + spin_lock_init(&k->k_lock); + k->get = get; + k->put = put; +} + +EXPORT_SYMBOL_GPL(klist_init); + + +static void add_head(struct klist * k, struct klist_node * n) +{ + spin_lock(&k->k_lock); + list_add(&n->n_node, &k->k_list); + spin_unlock(&k->k_lock); +} + +static void add_tail(struct klist * k, struct klist_node * n) +{ + spin_lock(&k->k_lock); + list_add_tail(&n->n_node, &k->k_list); + spin_unlock(&k->k_lock); +} + + +static void klist_node_init(struct klist * k, struct klist_node * n) +{ + INIT_LIST_HEAD(&n->n_node); + init_completion(&n->n_removed); + kref_init(&n->n_ref); + n->n_klist = k; + if (k->get) + k->get(n); +} + + +/** + * klist_add_head - Initialize a klist_node and add it to front. + * @n: node we're adding. + * @k: klist it's going on. + */ + +void klist_add_head(struct klist_node * n, struct klist * k) +{ + klist_node_init(k, n); + add_head(k, n); +} + +EXPORT_SYMBOL_GPL(klist_add_head); + + +/** + * klist_add_tail - Initialize a klist_node and add it to back. + * @n: node we're adding. + * @k: klist it's going on. + */ + +void klist_add_tail(struct klist_node * n, struct klist * k) +{ + klist_node_init(k, n); + add_tail(k, n); +} + +EXPORT_SYMBOL_GPL(klist_add_tail); + + +static void klist_release(struct kref * kref) +{ + struct klist_node * n = container_of(kref, struct klist_node, n_ref); + + list_del(&n->n_node); + complete(&n->n_removed); + n->n_klist = NULL; +} + +static int klist_dec_and_del(struct klist_node * n) +{ + return kref_put_new(&n->n_ref, klist_release); +} + + +/** + * klist_del - Decrement the reference count of node and try to remove. + * @n: node we're deleting. + */ + +void klist_del(struct klist_node * n) +{ + struct klist * k = n->n_klist; + void (*put)(struct klist_node *) = k->put; + + spin_lock(&k->k_lock); + if (!klist_dec_and_del(n)) + put = NULL; + spin_unlock(&k->k_lock); + if (put) + put(n); +} + +EXPORT_SYMBOL_GPL(klist_del); + + +/** + * klist_remove - Decrement the refcount of node and wait for it to go away. + * @n: node we're removing. + */ + +void klist_remove(struct klist_node * n) +{ + klist_del(n); + wait_for_completion(&n->n_removed); +} + +EXPORT_SYMBOL_GPL(klist_remove); + + +/** + * klist_node_attached - Say whether a node is bound to a list or not. + * @n: Node that we're testing. + */ + +int klist_node_attached(struct klist_node * n) +{ + return (n->n_klist != NULL); +} + +EXPORT_SYMBOL_GPL(klist_node_attached); + + +/** + * klist_iter_init_node - Initialize a klist_iter structure. + * @k: klist we're iterating. + * @i: klist_iter we're filling. + * @n: node to start with. + * + * Similar to klist_iter_init(), but starts the action off with @n, + * instead of with the list head. + */ + +void klist_iter_init_node(struct klist * k, struct klist_iter * i, struct klist_node * n) +{ + i->i_klist = k; + i->i_head = &k->k_list; + i->i_cur = n; + if (n) + kref_get(&n->n_ref); +} + +EXPORT_SYMBOL_GPL(klist_iter_init_node); + + +/** + * klist_iter_init - Iniitalize a klist_iter structure. + * @k: klist we're iterating. + * @i: klist_iter structure we're filling. + * + * Similar to klist_iter_init_node(), but start with the list head. + */ + +void klist_iter_init(struct klist * k, struct klist_iter * i) +{ + klist_iter_init_node(k, i, NULL); +} + +EXPORT_SYMBOL_GPL(klist_iter_init); + + +/** + * klist_iter_exit - Finish a list iteration. + * @i: Iterator structure. + * + * Must be called when done iterating over list, as it decrements the + * refcount of the current node. Necessary in case iteration exited before + * the end of the list was reached, and always good form. + */ + +void klist_iter_exit(struct klist_iter * i) +{ + if (i->i_cur) { + klist_del(i->i_cur); + i->i_cur = NULL; + } +} + +EXPORT_SYMBOL_GPL(klist_iter_exit); + + +static struct klist_node * to_klist_node(struct list_head * n) +{ + return container_of(n, struct klist_node, n_node); +} + + +/** + * klist_next - Ante up next node in list. + * @i: Iterator structure. + * + * First grab list lock. Decrement the reference count of the previous + * node, if there was one. Grab the next node, increment its reference + * count, drop the lock, and return that next node. + */ + +struct klist_node * klist_next(struct klist_iter * i) +{ + struct list_head * next; + struct klist_node * lnode = i->i_cur; + struct klist_node * knode = NULL; + void (*put)(struct klist_node *) = i->i_klist->put; + + spin_lock(&i->i_klist->k_lock); + if (lnode) { + next = lnode->n_node.next; + if (!klist_dec_and_del(lnode)) + put = NULL; + } else + next = i->i_head->next; + + if (next != i->i_head) { + knode = to_klist_node(next); + kref_get(&knode->n_ref); + } + i->i_cur = knode; + spin_unlock(&i->i_klist->k_lock); + if (put && lnode) + put(lnode); + return knode; +} + +EXPORT_SYMBOL_GPL(klist_next); + + diff --git a/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c b/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c new file mode 100644 index 0000000..d45bb3f --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c @@ -0,0 +1,29 @@ +#include +#include + +/** + * kref_put - decrement refcount for object. + * @kref: object. + * @release: pointer to the function that will clean up the object when the + * last reference to the object is released. + * This pointer is required, and it is not acceptable to pass kfree + * in as this function. + * + * Decrement the refcount, and if 0, call release(). + * Return 1 if the object was removed, otherwise return 0. Beware, if this + * function returns 0, you still can not count on the kref from remaining in + * memory. Only use the return value if you want to see if the kref is now + * gone, not present. + */ +int kref_put_new(struct kref *kref, void (*release)(struct kref *kref)) +{ + WARN_ON(release == NULL); + WARN_ON(release == (void (*)(struct kref *))kfree); + + if (atomic_dec_and_test(&kref->refcount)) { + release(kref); + return 1; + } + return 0; +} +EXPORT_SYMBOL(kref_put_new); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi.c new file mode 100644 index 0000000..8c833c0 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi.c @@ -0,0 +1,50 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +/** + * starget_for_each_device - helper to walk all devices of a target + * @starget: target whose devices we want to iterate over. + * + * This traverses over each devices of @shost. The devices have + * a reference that must be released by scsi_host_put when breaking + * out of the loop. + */ +void starget_for_each_device(struct scsi_target *starget, void * data, + void (*fn)(struct scsi_device *, void *)) +{ + struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); + struct scsi_device *sdev; + + printk("%s: entry\n", __FUNCTION__); + shost_for_each_device(sdev, shost) { + if ((sdev->channel == starget->channel) && + (sdev->id == starget->id)) + fn(sdev, data); + } + printk("%s: exit\n", __FUNCTION__); +} +EXPORT_SYMBOL(starget_for_each_device); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c new file mode 100644 index 0000000..327b53f --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c @@ -0,0 +1,164 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +int scsi_is_target_device(const struct device *dev) +{ + char *str = dev->bus_id; + + if (strncmp(str, "target", 6) == 0) { + return 1; + } + + return 0; +} + +/** + * scsi_internal_device_block - internal function to put a device + * temporarily into the SDEV_BLOCK state + * @sdev: device to block + * + * Block request made by scsi lld's to temporarily stop all + * scsi commands on the specified device. Called from interrupt + * or normal process context. + * + * Returns zero if successful or error if not + * + * Notes: + * This routine transitions the device to the SDEV_BLOCK state + * (which must be a legal transition). When the device is in this + * state, all commands are deferred until the scsi lld reenables + * the device with scsi_device_unblock or device_block_tmo fires. + * This routine assumes the host_lock is held on entry. + **/ +int +scsi_internal_device_block(struct scsi_device *sdev) +{ + request_queue_t *q = sdev->request_queue; + unsigned long flags; + int err = 0; + + err = scsi_device_set_state(sdev, SDEV_BLOCK); + if (err) + return err; + + /* + * The device has transitioned to SDEV_BLOCK. Stop the + * block layer from calling the midlayer with this device's + * request queue. + */ + spin_lock_irqsave(q->queue_lock, flags); + blk_stop_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(scsi_internal_device_block); + +/** + * scsi_internal_device_unblock - resume a device after a block request + * @sdev: device to resume + * + * Called by scsi lld's or the midlayer to restart the device queue + * for the previously suspended scsi device. Called from interrupt or + * normal process context. + * + * Returns zero if successful or error if not. + * + * Notes: + * This routine transitions the device to the SDEV_RUNNING state + * (which must be a legal transition) allowing the midlayer to + * goose the queue for this device. This routine assumes the + * host_lock is held upon entry. + **/ +int +scsi_internal_device_unblock(struct scsi_device *sdev) +{ + request_queue_t *q = sdev->request_queue; + int err; + unsigned long flags; + + + /* + * Try to transition the scsi device to SDEV_RUNNING + * and goose the device queue if successful. + */ + err = scsi_device_set_state(sdev, SDEV_RUNNING); + if (err) + return err; + + spin_lock_irqsave(q->queue_lock, flags); + blk_start_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(scsi_internal_device_unblock); + +static void +device_block(struct scsi_device *sdev, void *data) +{ + scsi_internal_device_block(sdev); +} + +static int +target_block(struct device *dev, void *data) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_block); + + return 0; +} + +void +scsi_target_block(struct device *dev) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_block); + else + device_for_each_child(dev, NULL, target_block); +} +EXPORT_SYMBOL_GPL(scsi_target_block); + +static void +device_unblock(struct scsi_device *sdev, void *data) +{ + scsi_internal_device_unblock(sdev); +} + +static int +target_unblock(struct device *dev, void *data) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_unblock); + return 0; +} + +void +scsi_target_unblock(struct device *dev) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_unblock); + else + device_for_each_child(dev, NULL, target_unblock); +} +EXPORT_SYMBOL_GPL(scsi_target_unblock); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c new file mode 100644 index 0000000..b7b7674 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c @@ -0,0 +1,48 @@ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +/** + * int_to_scsilun: reverts an int into a scsi_lun + * @int: integer to be reverted + * @scsilun: struct scsi_lun to be set. + * + * Description: + * Reverts the functionality of the scsilun_to_int, which packed + * an 8-byte lun value into an int. This routine unpacks the int + * back into the lun value. + * Note: the scsilun_to_int() routine does not truly handle all + * 8bytes of the lun value. This functions restores only as much + * as was set by the routine. + * + * Notes: + * Given an integer : 0x0b030a04, this function returns a + * scsi_lun of : struct scsi_lun of: 0a 04 0b 03 00 00 00 00 + * + **/ +void int_to_scsilun(unsigned int lun, struct scsi_lun *scsilun) +{ + int i; + + memset(scsilun->scsi_lun, 0, sizeof(scsilun->scsi_lun)); + + for (i = 0; i < sizeof(lun); i += 2) { + scsilun->scsi_lun[i] = (lun >> 8) & 0xFF; + scsilun->scsi_lun[i+1] = lun & 0xFF; + lun = lun >> 16; + } +} +EXPORT_SYMBOL(int_to_scsilun); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c b/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c new file mode 100644 index 0000000..f25e7c6 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c @@ -0,0 +1,280 @@ +/* + * transport_class.c - implementation of generic transport classes + * using attribute_containers + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + * + * The basic idea here is to allow any "device controller" (which + * would most often be a Host Bus Adapter to use the services of one + * or more tranport classes for performing transport specific + * services. Transport specific services are things that the generic + * command layer doesn't want to know about (speed settings, line + * condidtioning, etc), but which the user might be interested in. + * Thus, the HBA's use the routines exported by the transport classes + * to perform these functions. The transport classes export certain + * values to the user via sysfs using attribute containers. + * + * Note: because not every HBA will care about every transport + * attribute, there's a many to one relationship that goes like this: + * + * transport class<-----attribute container<----class device + * + * Usually the attribute container is per-HBA, but the design doesn't + * mandate that. Although most of the services will be specific to + * the actual external storage connection used by the HBA, the generic + * transport class is framed entirely in terms of generic devices to + * allow it to be used by any physical HBA in the system. + */ +#include +#include + +/** + * transport_class_register - register an initial transport class + * + * @tclass: a pointer to the transport class structure to be initialised + * + * The transport class contains an embedded class which is used to + * identify it. The caller should initialise this structure with + * zeros and then generic class must have been initialised with the + * actual transport class unique name. There's a macro + * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must + * be registered). + * + * Returns 0 on success or error on failure. + */ +int transport_class_register(struct transport_class *tclass) +{ + return class_register(&tclass->class); +} +EXPORT_SYMBOL_GPL(transport_class_register); + +/** + * transport_class_unregister - unregister a previously registered class + * + * @tclass: The transport class to unregister + * + * Must be called prior to deallocating the memory for the transport + * class. + */ +void transport_class_unregister(struct transport_class *tclass) +{ + class_unregister(&tclass->class); +} +EXPORT_SYMBOL_GPL(transport_class_unregister); + +static int anon_transport_dummy_function(struct transport_container *tc, + struct device *dev, + struct class_device *cdev) +{ + /* do nothing */ + return 0; +} + +/** + * anon_transport_class_register - register an anonymous class + * + * @atc: The anon transport class to register + * + * The anonymous transport class contains both a transport class and a + * container. The idea of an anonymous class is that it never + * actually has any device attributes associated with it (and thus + * saves on container storage). So it can only be used for triggering + * events. Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to + * initialise the anon transport class storage. + */ +int anon_transport_class_register(struct anon_transport_class *atc) +{ + int error; + atc->container.class = &atc->tclass.class; + attribute_container_set_no_classdevs(&atc->container); + error = attribute_container_register(&atc->container); + if (error) + return error; + atc->tclass.setup = anon_transport_dummy_function; + atc->tclass.remove = anon_transport_dummy_function; + return 0; +} +EXPORT_SYMBOL_GPL(anon_transport_class_register); + +/** + * anon_transport_class_unregister - unregister an anon class + * + * @atc: Pointer to the anon transport class to unregister + * + * Must be called prior to deallocating the memory for the anon + * transport class. + */ +void anon_transport_class_unregister(struct anon_transport_class *atc) +{ + attribute_container_unregister(&atc->container); +} +EXPORT_SYMBOL_GPL(anon_transport_class_unregister); + +static int transport_setup_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + struct transport_container *tcont = attribute_container_to_transport_container(cont); + + if (tclass->setup) + tclass->setup(tcont, dev, classdev); + + return 0; +} + +/** + * transport_setup_device - declare a new dev for transport class association + * but don't make it visible yet. + * + * @dev: the generic device representing the entity being added + * + * Usually, dev represents some component in the HBA system (either + * the HBA itself or a device remote across the HBA bus). This + * routine is simply a trigger point to see if any set of transport + * classes wishes to associate with the added device. This allocates + * storage for the class device and initialises it, but does not yet + * add it to the system or add attributes to it (you do this with + * transport_add_device). If you have no need for a separate setup + * and add operations, use transport_register_device (see + * transport_class.h). + */ + +void transport_setup_device(struct device *dev) +{ + attribute_container_add_device(dev, transport_setup_classdev); +} +EXPORT_SYMBOL_GPL(transport_setup_device); + +static int transport_add_class_device(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + int error = attribute_container_add_class_device(classdev); + struct transport_container *tcont = + attribute_container_to_transport_container(cont); + + if (!error && tcont->statistics) + error = sysfs_create_group(&classdev->kobj, tcont->statistics); + + return error; +} + + +/** + * transport_add_device - declare a new dev for transport class association + * + * @dev: the generic device representing the entity being added + * + * Usually, dev represents some component in the HBA system (either + * the HBA itself or a device remote across the HBA bus). This + * routine is simply a trigger point used to add the device to the + * system and register attributes for it. + */ + +void transport_add_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_add_class_device); +} +EXPORT_SYMBOL_GPL(transport_add_device); + +static int transport_configure(struct attribute_container *cont, + struct device *dev, + struct class_device *cdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + struct transport_container *tcont = attribute_container_to_transport_container(cont); + + if (tclass->configure) + tclass->configure(tcont, dev, cdev); + + return 0; +} + +/** + * transport_configure_device - configure an already set up device + * + * @dev: generic device representing device to be configured + * + * The idea of configure is simply to provide a point within the setup + * process to allow the transport class to extract information from a + * device after it has been setup. This is used in SCSI because we + * have to have a setup device to begin using the HBA, but after we + * send the initial inquiry, we use configure to extract the device + * parameters. The device need not have been added to be configured. + */ +void transport_configure_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_configure); +} +EXPORT_SYMBOL_GPL(transport_configure_device); + +static int transport_remove_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_container *tcont = + attribute_container_to_transport_container(cont); + struct transport_class *tclass = class_to_transport_class(cont->class); + + if (tclass->remove) + tclass->remove(tcont, dev, classdev); + + if (tclass->remove != anon_transport_dummy_function) { + if (tcont->statistics) + sysfs_remove_group(&classdev->kobj, tcont->statistics); + attribute_container_class_device_del(classdev); + } + + return 0; +} + + +/** + * transport_remove_device - remove the visibility of a device + * + * @dev: generic device to remove + * + * This call removes the visibility of the device (to the user from + * sysfs), but does not destroy it. To eliminate a device entirely + * you must also call transport_destroy_device. If you don't need to + * do remove and destroy as separate operations, use + * transport_unregister_device() (see transport_class.h) which will + * perform both calls for you. + */ +void transport_remove_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_remove_classdev); +} +EXPORT_SYMBOL_GPL(transport_remove_device); + +static void transport_destroy_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + + if (tclass->remove != anon_transport_dummy_function) + class_device_put(classdev); +} + + +/** + * transport_destroy_device - destroy a removed device + * + * @dev: device to eliminate from the transport class. + * + * This call triggers the elimination of storage associated with the + * transport classdev. Note: all it really does is relinquish a + * reference to the classdev. The memory will not be freed until the + * last reference goes to zero. Note also that the classdev retains a + * reference count on dev, so dev too will remain for as long as the + * transport class device remains around. + */ +void transport_destroy_device(struct device *dev) +{ + attribute_container_remove_device(dev, transport_destroy_classdev); +} +EXPORT_SYMBOL_GPL(transport_destroy_device); diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h b/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h new file mode 100644 index 0000000..93bfb0b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h @@ -0,0 +1,71 @@ +/* + * class_container.h - a generic container for all classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + */ + +#ifndef _ATTRIBUTE_CONTAINER_H_ +#define _ATTRIBUTE_CONTAINER_H_ + +#include +#include +#include +#include + +struct attribute_container { + struct list_head node; + struct klist containers; + struct class *class; + struct class_device_attribute **attrs; + int (*match)(struct attribute_container *, struct device *); +#define ATTRIBUTE_CONTAINER_NO_CLASSDEVS 0x01 + unsigned long flags; +}; + +static inline int +attribute_container_no_classdevs(struct attribute_container *atc) +{ + return atc->flags & ATTRIBUTE_CONTAINER_NO_CLASSDEVS; +} + +static inline void +attribute_container_set_no_classdevs(struct attribute_container *atc) +{ + atc->flags |= ATTRIBUTE_CONTAINER_NO_CLASSDEVS; +} + +int attribute_container_register(struct attribute_container *cont); +int attribute_container_unregister(struct attribute_container *cont); +void attribute_container_create_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_add_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_remove_device(struct device *dev, + void (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_device_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *)); +int attribute_container_add_attrs(struct class_device *classdev); +int attribute_container_add_class_device(struct class_device *classdev); +int attribute_container_add_class_device_adapter(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev); +void attribute_container_remove_attrs(struct class_device *classdev); +void attribute_container_class_device_del(struct class_device *classdev); +struct attribute_container *attribute_container_classdev_to_container(struct class_device *); +struct class_device *attribute_container_find_class_device(struct attribute_container *, struct device *); +struct class_device_attribute **attribute_container_classdev_to_attrs(const struct class_device *classdev); + +#endif diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/klist.h b/kernel_addons/backport/2.6.9_U4/include/linux/klist.h new file mode 100644 index 0000000..7407125 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/linux/klist.h @@ -0,0 +1,61 @@ +/* + * klist.h - Some generic list helpers, extending struct list_head a bit. + * + * Implementations are found in lib/klist.c + * + * + * Copyright (C) 2005 Patrick Mochel + * + * This file is rleased under the GPL v2. + */ + +#ifndef _LINUX_KLIST_H +#define _LINUX_KLIST_H + +#include +#include +#include +#include + +struct klist_node; +struct klist { + spinlock_t k_lock; + struct list_head k_list; + void (*get)(struct klist_node *); + void (*put)(struct klist_node *); +}; + + +extern void klist_init(struct klist * k, void (*get)(struct klist_node *), + void (*put)(struct klist_node *)); + +struct klist_node { + struct klist * n_klist; + struct list_head n_node; + struct kref n_ref; + struct completion n_removed; +}; + +extern void klist_add_tail(struct klist_node * n, struct klist * k); +extern void klist_add_head(struct klist_node * n, struct klist * k); + +extern void klist_del(struct klist_node * n); +extern void klist_remove(struct klist_node * n); + +extern int klist_node_attached(struct klist_node * n); + + +struct klist_iter { + struct klist * i_klist; + struct list_head * i_head; + struct klist_node * i_cur; +}; + + +extern void klist_iter_init(struct klist * k, struct klist_iter * i); +extern void klist_iter_init_node(struct klist * k, struct klist_iter * i, + struct klist_node * n); +extern void klist_iter_exit(struct klist_iter * i); +extern struct klist_node * klist_next(struct klist_iter * i); + +#endif diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h new file mode 100644 index 0000000..f353e0b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h @@ -0,0 +1,19 @@ +#ifndef _SCSI_SCSI_DEVICE_H_BACKPORT +#define _SCSI_SCSI_DEVICE_H_BACKPORT + +#include_next + +#include +#include +#include +#include +#include + +struct scsi_lun; + +extern void int_to_scsilun(unsigned int, struct scsi_lun *); +extern void scsi_target_block(struct device *); +extern void scsi_target_unblock(struct device *); +extern void starget_for_each_device(struct scsi_target *, void *, + void (*fn)(struct scsi_device *, void *)); +#endif /* _SCSI_SCSI_DEVICE_H_BACKPORT */ diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h new file mode 100644 index 0000000..99c2b12 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h @@ -0,0 +1,8 @@ +#ifndef _SCSI_SCSI_TRANSPORT_H_BACKPORT +#define _SCSI_SCSI_TRANSPORT_H_BACKPORT + +#include_next + +#include + +#endif /* _SCSI_SCSI_TRANSPORT_H_BACKPORT */ diff --git a/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c b/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c new file mode 100644 index 0000000..44948d1 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c @@ -0,0 +1,438 @@ +/* + * attribute_container.c - implementation of a simple container for classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + * + * The basic idea here is to enable a device to be attached to an + * aritrary numer of classes without having to allocate storage for them. + * Instead, the contained classes select the devices they need to attach + * to via a matching function. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "base.h" + +/* This is a private structure used to tie the classdev and the + * container .. it should never be visible outside this file */ +struct internal_container { + struct klist_node node; + struct attribute_container *cont; + struct class_device classdev; +}; + +static void internal_container_klist_get(struct klist_node *n) +{ + struct internal_container *ic = + container_of(n, struct internal_container, node); + class_device_get(&ic->classdev); +} + +static void internal_container_klist_put(struct klist_node *n) +{ + struct internal_container *ic = + container_of(n, struct internal_container, node); + class_device_put(&ic->classdev); +} + + +/** + * attribute_container_classdev_to_container - given a classdev, return the container + * + * @classdev: the class device created by attribute_container_add_device. + * + * Returns the container associated with this classdev. + */ +struct attribute_container * +attribute_container_classdev_to_container(struct class_device *classdev) +{ + struct internal_container *ic = + container_of(classdev, struct internal_container, classdev); + return ic->cont; +} +EXPORT_SYMBOL_GPL(attribute_container_classdev_to_container); + +static struct list_head attribute_container_list; + +static DECLARE_MUTEX(attribute_container_mutex); + +/** + * attribute_container_register - register an attribute container + * + * @cont: The container to register. This must be allocated by the + * callee and should also be zeroed by it. + */ +int +attribute_container_register(struct attribute_container *cont) +{ + INIT_LIST_HEAD(&cont->node); + klist_init(&cont->containers,internal_container_klist_get, + internal_container_klist_put); + + down(&attribute_container_mutex); + list_add_tail(&cont->node, &attribute_container_list); + up(&attribute_container_mutex); + + return 0; +} +EXPORT_SYMBOL_GPL(attribute_container_register); + +/** + * attribute_container_unregister - remove a container registration + * + * @cont: previously registered container to remove + */ +int +attribute_container_unregister(struct attribute_container *cont) +{ + int retval = -EBUSY; + down(&attribute_container_mutex); + spin_lock(&cont->containers.k_lock); + if (!list_empty(&cont->containers.k_list)) + goto out; + retval = 0; + list_del(&cont->node); + out: + spin_unlock(&cont->containers.k_lock); + up(&attribute_container_mutex); + return retval; + +} +EXPORT_SYMBOL_GPL(attribute_container_unregister); + +/* private function used as class release */ +static void attribute_container_release(struct class_device *classdev) +{ + struct internal_container *ic + = container_of(classdev, struct internal_container, classdev); + struct device *dev = classdev->dev; + + kfree(ic); + put_device(dev); +} + +/** + * attribute_container_add_device - see if any container is interested in dev + * + * @dev: device to add attributes to + * @fn: function to trigger addition of class device. + * + * This function allocates storage for the class device(s) to be + * attached to dev (one for each matching attribute_container). If no + * fn is provided, the code will simply register the class device via + * class_device_add. If a function is provided, it is expected to add + * the class device at the appropriate time. One of the things that + * might be necessary is to allocate and initialise the classdev and + * then add it a later time. To do this, call this routine for + * allocation and initialisation and then use + * attribute_container_device_trigger() to call class_device_add() on + * it. Note: after this, the class device contains a reference to dev + * which is not relinquished until the release of the classdev. + */ +void +attribute_container_add_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + + if (attribute_container_no_classdevs(cont)) + continue; + + if (!cont->match(cont, dev)) + continue; + + ic = kzalloc(sizeof(*ic), GFP_KERNEL); + if (!ic) { + dev_printk(KERN_ERR, dev, "failed to allocate class container\n"); + continue; + } + + ic->cont = cont; + class_device_initialize(&ic->classdev); + ic->classdev.dev = get_device(dev); + ic->classdev.class = cont->class; + cont->class->release = attribute_container_release; + strcpy(ic->classdev.class_id, dev->bus_id); + if (fn) + fn(cont, dev, &ic->classdev); + else + attribute_container_add_class_device(&ic->classdev); + klist_add_tail(&ic->node, &cont->containers); + } + up(&attribute_container_mutex); +} + +/* FIXME: can't break out of this unless klist_iter_exit is also + * called before doing the break + */ +#define klist_for_each_entry(pos, head, member, iter) \ + for (klist_iter_init(head, iter); (pos = ({ \ + struct klist_node *n = klist_next(iter); \ + n ? container_of(n, typeof(*pos), member) : \ + ({ klist_iter_exit(iter) ; NULL; }); \ + }) ) != NULL; ) + + +/** + * attribute_container_remove_device - make device eligible for removal. + * + * @dev: The generic device + * @fn: A function to call to remove the device + * + * This routine triggers device removal. If fn is NULL, then it is + * simply done via class_device_unregister (note that if something + * still has a reference to the classdev, then the memory occupied + * will not be freed until the classdev is released). If you want a + * two phase release: remove from visibility and then delete the + * device, then you should use this routine with a fn that calls + * class_device_del() and then use + * attribute_container_device_trigger() to do the final put on the + * classdev. + */ +void +attribute_container_remove_device(struct device *dev, + void (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + struct klist_iter iter; + + if (attribute_container_no_classdevs(cont)) + continue; + + if (!cont->match(cont, dev)) + continue; + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (dev != ic->classdev.dev) + continue; + klist_del(&ic->node); + if (fn) + fn(cont, dev, &ic->classdev); + else { + attribute_container_remove_attrs(&ic->classdev); + class_device_unregister(&ic->classdev); + } + } + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_device_trigger - execute a trigger for each matching classdev + * + * @dev: The generic device to run the trigger for + * @fn the function to execute for each classdev. + * + * This funcion is for executing a trigger when you need to know both + * the container and the classdev. If you only care about the + * container, then use attribute_container_trigger() instead. + */ +void +attribute_container_device_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + struct klist_iter iter; + + if (!cont->match(cont, dev)) + continue; + + if (attribute_container_no_classdevs(cont)) { + fn(cont, dev, NULL); + continue; + } + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (dev == ic->classdev.dev) + fn(cont, dev, &ic->classdev); + } + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_trigger - trigger a function for each matching container + * + * @dev: The generic device to activate the trigger for + * @fn: the function to trigger + * + * This routine triggers a function that only needs to know the + * matching containers (not the classdev) associated with a device. + * It is more lightweight than attribute_container_device_trigger, so + * should be used in preference unless the triggering function + * actually needs to know the classdev. + */ +void +attribute_container_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + if (cont->match(cont, dev)) + fn(cont, dev); + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_add_attrs - add attributes + * + * @classdev: The class device + * + * This simply creates all the class device sysfs files from the + * attributes listed in the container + */ +int +attribute_container_add_attrs(struct class_device *classdev) +{ + struct attribute_container *cont = + attribute_container_classdev_to_container(classdev); + struct class_device_attribute **attrs = cont->attrs; + int i, error; + + if (!attrs) + return 0; + + for (i = 0; attrs[i]; i++) { + error = class_device_create_file(classdev, attrs[i]); + if (error) + return error; + } + + return 0; +} + +/** + * attribute_container_add_class_device - same function as class_device_add + * + * @classdev: the class device to add + * + * This performs essentially the same function as class_device_add except for + * attribute containers, namely add the classdev to the system and then + * create the attribute files + */ +int +attribute_container_add_class_device(struct class_device *classdev) +{ + int error = class_device_add(classdev); + if (error) + return error; + return attribute_container_add_attrs(classdev); +} + +/** + * attribute_container_add_class_device_adapter - simple adapter for triggers + * + * This function is identical to attribute_container_add_class_device except + * that it is designed to be called from the triggers + */ +int +attribute_container_add_class_device_adapter(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + return attribute_container_add_class_device(classdev); +} + +/** + * attribute_container_remove_attrs - remove any attribute files + * + * @classdev: The class device to remove the files from + * + */ +void +attribute_container_remove_attrs(struct class_device *classdev) +{ + struct attribute_container *cont = + attribute_container_classdev_to_container(classdev); + struct class_device_attribute **attrs = cont->attrs; + int i; + + if (!attrs) + return; + + for (i = 0; attrs[i]; i++) + class_device_remove_file(classdev, attrs[i]); +} + +/** + * attribute_container_class_device_del - equivalent of class_device_del + * + * @classdev: the class device + * + * This function simply removes all the attribute files and then calls + * class_device_del. + */ +void +attribute_container_class_device_del(struct class_device *classdev) +{ + attribute_container_remove_attrs(classdev); + class_device_del(classdev); +} + +/** + * attribute_container_find_class_device - find the corresponding class_device + * + * @cont: the container + * @dev: the generic device + * + * Looks up the device in the container's list of class devices and returns + * the corresponding class_device. + */ +struct class_device * +attribute_container_find_class_device(struct attribute_container *cont, + struct device *dev) +{ + struct class_device *cdev = NULL; + struct internal_container *ic; + struct klist_iter iter; + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (ic->classdev.dev == dev) { + cdev = &ic->classdev; + /* FIXME: must exit iterator then break */ + klist_iter_exit(&iter); + break; + } + } + + return cdev; +} +EXPORT_SYMBOL_GPL(attribute_container_find_class_device); + +int +attribute_container_init(void) +{ + INIT_LIST_HEAD(&attribute_container_list); + return 0; +} +EXPORT_SYMBOL_GPL(attribute_container_init); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/base.h b/kernel_addons/backport/2.6.9_U4/include/src/base.h new file mode 100644 index 0000000..a5f8936 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/base.h @@ -0,0 +1 @@ +extern int attribute_container_init(void); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/init.c b/kernel_addons/backport/2.6.9_U4/include/src/init.c new file mode 100644 index 0000000..15f0bc6 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/init.c @@ -0,0 +1,26 @@ +/* + * + * Copyright (c) 2002-3 Patrick Mochel + * Copyright (c) 2002-3 Open Source Development Labs + * + * This file is released under the GPLv2 + * + */ + +#include +#include +#include + +#include "base.h" + +/** + * driver_init - initialize driver model. + * + * Call the driver model init functions to initialize their + * subsystems. Called early from init/main.c. + */ + +void __init driver_init(void) +{ + attribute_container_init(); +} diff --git a/kernel_addons/backport/2.6.9_U4/include/src/klist.c b/kernel_addons/backport/2.6.9_U4/include/src/klist.c new file mode 100644 index 0000000..3b29ebc --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/klist.c @@ -0,0 +1,287 @@ +/* + * klist.c - Routines for manipulating klists. + * + * + * This klist interface provides a couple of structures that wrap around + * struct list_head to provide explicit list "head" (struct klist) and + * list "node" (struct klist_node) objects. For struct klist, a spinlock + * is included that protects access to the actual list itself. struct + * klist_node provides a pointer to the klist that owns it and a kref + * reference count that indicates the number of current users of that node + * in the list. + * + * The entire point is to provide an interface for iterating over a list + * that is safe and allows for modification of the list during the + * iteration (e.g. insertion and removal), including modification of the + * current node on the list. + * + * It works using a 3rd object type - struct klist_iter - that is declared + * and initialized before an iteration. klist_next() is used to acquire the + * next element in the list. It returns NULL if there are no more items. + * Internally, that routine takes the klist's lock, decrements the reference + * count of the previous klist_node and increments the count of the next + * klist_node. It then drops the lock and returns. + * + * There are primitives for adding and removing nodes to/from a klist. + * When deleting, klist_del() will simply decrement the reference count. + * Only when the count goes to 0 is the node removed from the list. + * klist_remove() will try to delete the node from the list and block + * until it is actually removed. This is useful for objects (like devices) + * that have been removed from the system and must be freed (but must wait + * until all accessors have finished). + * + * Copyright (C) 2005 Patrick Mochel + * + * This file is released under the GPL v2. + */ + +#include +#include + + +/** + * klist_init - Initialize a klist structure. + * @k: The klist we're initializing. + * @get: The get function for the embedding object (NULL if none) + * @put: The put function for the embedding object (NULL if none) + * + * Initialises the klist structure. If the klist_node structures are + * going to be embedded in refcounted objects (necessary for safe + * deletion) then the get/put arguments are used to initialise + * functions that take and release references on the embedding + * objects. + */ + +void klist_init(struct klist * k, void (*get)(struct klist_node *), + void (*put)(struct klist_node *)) +{ + INIT_LIST_HEAD(&k->k_list); + spin_lock_init(&k->k_lock); + k->get = get; + k->put = put; +} + +EXPORT_SYMBOL_GPL(klist_init); + + +static void add_head(struct klist * k, struct klist_node * n) +{ + spin_lock(&k->k_lock); + list_add(&n->n_node, &k->k_list); + spin_unlock(&k->k_lock); +} + +static void add_tail(struct klist * k, struct klist_node * n) +{ + spin_lock(&k->k_lock); + list_add_tail(&n->n_node, &k->k_list); + spin_unlock(&k->k_lock); +} + + +static void klist_node_init(struct klist * k, struct klist_node * n) +{ + INIT_LIST_HEAD(&n->n_node); + init_completion(&n->n_removed); + kref_init(&n->n_ref); + n->n_klist = k; + if (k->get) + k->get(n); +} + + +/** + * klist_add_head - Initialize a klist_node and add it to front. + * @n: node we're adding. + * @k: klist it's going on. + */ + +void klist_add_head(struct klist_node * n, struct klist * k) +{ + klist_node_init(k, n); + add_head(k, n); +} + +EXPORT_SYMBOL_GPL(klist_add_head); + + +/** + * klist_add_tail - Initialize a klist_node and add it to back. + * @n: node we're adding. + * @k: klist it's going on. + */ + +void klist_add_tail(struct klist_node * n, struct klist * k) +{ + klist_node_init(k, n); + add_tail(k, n); +} + +EXPORT_SYMBOL_GPL(klist_add_tail); + + +static void klist_release(struct kref * kref) +{ + struct klist_node * n = container_of(kref, struct klist_node, n_ref); + + list_del(&n->n_node); + complete(&n->n_removed); + n->n_klist = NULL; +} + +static int klist_dec_and_del(struct klist_node * n) +{ + return kref_put_new(&n->n_ref, klist_release); +} + + +/** + * klist_del - Decrement the reference count of node and try to remove. + * @n: node we're deleting. + */ + +void klist_del(struct klist_node * n) +{ + struct klist * k = n->n_klist; + void (*put)(struct klist_node *) = k->put; + + spin_lock(&k->k_lock); + if (!klist_dec_and_del(n)) + put = NULL; + spin_unlock(&k->k_lock); + if (put) + put(n); +} + +EXPORT_SYMBOL_GPL(klist_del); + + +/** + * klist_remove - Decrement the refcount of node and wait for it to go away. + * @n: node we're removing. + */ + +void klist_remove(struct klist_node * n) +{ + klist_del(n); + wait_for_completion(&n->n_removed); +} + +EXPORT_SYMBOL_GPL(klist_remove); + + +/** + * klist_node_attached - Say whether a node is bound to a list or not. + * @n: Node that we're testing. + */ + +int klist_node_attached(struct klist_node * n) +{ + return (n->n_klist != NULL); +} + +EXPORT_SYMBOL_GPL(klist_node_attached); + + +/** + * klist_iter_init_node - Initialize a klist_iter structure. + * @k: klist we're iterating. + * @i: klist_iter we're filling. + * @n: node to start with. + * + * Similar to klist_iter_init(), but starts the action off with @n, + * instead of with the list head. + */ + +void klist_iter_init_node(struct klist * k, struct klist_iter * i, struct klist_node * n) +{ + i->i_klist = k; + i->i_head = &k->k_list; + i->i_cur = n; + if (n) + kref_get(&n->n_ref); +} + +EXPORT_SYMBOL_GPL(klist_iter_init_node); + + +/** + * klist_iter_init - Iniitalize a klist_iter structure. + * @k: klist we're iterating. + * @i: klist_iter structure we're filling. + * + * Similar to klist_iter_init_node(), but start with the list head. + */ + +void klist_iter_init(struct klist * k, struct klist_iter * i) +{ + klist_iter_init_node(k, i, NULL); +} + +EXPORT_SYMBOL_GPL(klist_iter_init); + + +/** + * klist_iter_exit - Finish a list iteration. + * @i: Iterator structure. + * + * Must be called when done iterating over list, as it decrements the + * refcount of the current node. Necessary in case iteration exited before + * the end of the list was reached, and always good form. + */ + +void klist_iter_exit(struct klist_iter * i) +{ + if (i->i_cur) { + klist_del(i->i_cur); + i->i_cur = NULL; + } +} + +EXPORT_SYMBOL_GPL(klist_iter_exit); + + +static struct klist_node * to_klist_node(struct list_head * n) +{ + return container_of(n, struct klist_node, n_node); +} + + +/** + * klist_next - Ante up next node in list. + * @i: Iterator structure. + * + * First grab list lock. Decrement the reference count of the previous + * node, if there was one. Grab the next node, increment its reference + * count, drop the lock, and return that next node. + */ + +struct klist_node * klist_next(struct klist_iter * i) +{ + struct list_head * next; + struct klist_node * lnode = i->i_cur; + struct klist_node * knode = NULL; + void (*put)(struct klist_node *) = i->i_klist->put; + + spin_lock(&i->i_klist->k_lock); + if (lnode) { + next = lnode->n_node.next; + if (!klist_dec_and_del(lnode)) + put = NULL; + } else + next = i->i_head->next; + + if (next != i->i_head) { + knode = to_klist_node(next); + kref_get(&knode->n_ref); + } + i->i_cur = knode; + spin_unlock(&i->i_klist->k_lock); + if (put && lnode) + put(lnode); + return knode; +} + +EXPORT_SYMBOL_GPL(klist_next); + + diff --git a/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c b/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c new file mode 100644 index 0000000..d45bb3f --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c @@ -0,0 +1,29 @@ +#include +#include + +/** + * kref_put - decrement refcount for object. + * @kref: object. + * @release: pointer to the function that will clean up the object when the + * last reference to the object is released. + * This pointer is required, and it is not acceptable to pass kfree + * in as this function. + * + * Decrement the refcount, and if 0, call release(). + * Return 1 if the object was removed, otherwise return 0. Beware, if this + * function returns 0, you still can not count on the kref from remaining in + * memory. Only use the return value if you want to see if the kref is now + * gone, not present. + */ +int kref_put_new(struct kref *kref, void (*release)(struct kref *kref)) +{ + WARN_ON(release == NULL); + WARN_ON(release == (void (*)(struct kref *))kfree); + + if (atomic_dec_and_test(&kref->refcount)) { + release(kref); + return 1; + } + return 0; +} +EXPORT_SYMBOL(kref_put_new); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi.c new file mode 100644 index 0000000..8c833c0 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi.c @@ -0,0 +1,50 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +/** + * starget_for_each_device - helper to walk all devices of a target + * @starget: target whose devices we want to iterate over. + * + * This traverses over each devices of @shost. The devices have + * a reference that must be released by scsi_host_put when breaking + * out of the loop. + */ +void starget_for_each_device(struct scsi_target *starget, void * data, + void (*fn)(struct scsi_device *, void *)) +{ + struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); + struct scsi_device *sdev; + + printk("%s: entry\n", __FUNCTION__); + shost_for_each_device(sdev, shost) { + if ((sdev->channel == starget->channel) && + (sdev->id == starget->id)) + fn(sdev, data); + } + printk("%s: exit\n", __FUNCTION__); +} +EXPORT_SYMBOL(starget_for_each_device); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c new file mode 100644 index 0000000..327b53f --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c @@ -0,0 +1,164 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +int scsi_is_target_device(const struct device *dev) +{ + char *str = dev->bus_id; + + if (strncmp(str, "target", 6) == 0) { + return 1; + } + + return 0; +} + +/** + * scsi_internal_device_block - internal function to put a device + * temporarily into the SDEV_BLOCK state + * @sdev: device to block + * + * Block request made by scsi lld's to temporarily stop all + * scsi commands on the specified device. Called from interrupt + * or normal process context. + * + * Returns zero if successful or error if not + * + * Notes: + * This routine transitions the device to the SDEV_BLOCK state + * (which must be a legal transition). When the device is in this + * state, all commands are deferred until the scsi lld reenables + * the device with scsi_device_unblock or device_block_tmo fires. + * This routine assumes the host_lock is held on entry. + **/ +int +scsi_internal_device_block(struct scsi_device *sdev) +{ + request_queue_t *q = sdev->request_queue; + unsigned long flags; + int err = 0; + + err = scsi_device_set_state(sdev, SDEV_BLOCK); + if (err) + return err; + + /* + * The device has transitioned to SDEV_BLOCK. Stop the + * block layer from calling the midlayer with this device's + * request queue. + */ + spin_lock_irqsave(q->queue_lock, flags); + blk_stop_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(scsi_internal_device_block); + +/** + * scsi_internal_device_unblock - resume a device after a block request + * @sdev: device to resume + * + * Called by scsi lld's or the midlayer to restart the device queue + * for the previously suspended scsi device. Called from interrupt or + * normal process context. + * + * Returns zero if successful or error if not. + * + * Notes: + * This routine transitions the device to the SDEV_RUNNING state + * (which must be a legal transition) allowing the midlayer to + * goose the queue for this device. This routine assumes the + * host_lock is held upon entry. + **/ +int +scsi_internal_device_unblock(struct scsi_device *sdev) +{ + request_queue_t *q = sdev->request_queue; + int err; + unsigned long flags; + + + /* + * Try to transition the scsi device to SDEV_RUNNING + * and goose the device queue if successful. + */ + err = scsi_device_set_state(sdev, SDEV_RUNNING); + if (err) + return err; + + spin_lock_irqsave(q->queue_lock, flags); + blk_start_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(scsi_internal_device_unblock); + +static void +device_block(struct scsi_device *sdev, void *data) +{ + scsi_internal_device_block(sdev); +} + +static int +target_block(struct device *dev, void *data) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_block); + + return 0; +} + +void +scsi_target_block(struct device *dev) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_block); + else + device_for_each_child(dev, NULL, target_block); +} +EXPORT_SYMBOL_GPL(scsi_target_block); + +static void +device_unblock(struct scsi_device *sdev, void *data) +{ + scsi_internal_device_unblock(sdev); +} + +static int +target_unblock(struct device *dev, void *data) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_unblock); + return 0; +} + +void +scsi_target_unblock(struct device *dev) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_unblock); + else + device_for_each_child(dev, NULL, target_unblock); +} +EXPORT_SYMBOL_GPL(scsi_target_unblock); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c new file mode 100644 index 0000000..b7b7674 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c @@ -0,0 +1,48 @@ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +/** + * int_to_scsilun: reverts an int into a scsi_lun + * @int: integer to be reverted + * @scsilun: struct scsi_lun to be set. + * + * Description: + * Reverts the functionality of the scsilun_to_int, which packed + * an 8-byte lun value into an int. This routine unpacks the int + * back into the lun value. + * Note: the scsilun_to_int() routine does not truly handle all + * 8bytes of the lun value. This functions restores only as much + * as was set by the routine. + * + * Notes: + * Given an integer : 0x0b030a04, this function returns a + * scsi_lun of : struct scsi_lun of: 0a 04 0b 03 00 00 00 00 + * + **/ +void int_to_scsilun(unsigned int lun, struct scsi_lun *scsilun) +{ + int i; + + memset(scsilun->scsi_lun, 0, sizeof(scsilun->scsi_lun)); + + for (i = 0; i < sizeof(lun); i += 2) { + scsilun->scsi_lun[i] = (lun >> 8) & 0xFF; + scsilun->scsi_lun[i+1] = lun & 0xFF; + lun = lun >> 16; + } +} +EXPORT_SYMBOL(int_to_scsilun); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c b/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c new file mode 100644 index 0000000..f25e7c6 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c @@ -0,0 +1,280 @@ +/* + * transport_class.c - implementation of generic transport classes + * using attribute_containers + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + * + * The basic idea here is to allow any "device controller" (which + * would most often be a Host Bus Adapter to use the services of one + * or more tranport classes for performing transport specific + * services. Transport specific services are things that the generic + * command layer doesn't want to know about (speed settings, line + * condidtioning, etc), but which the user might be interested in. + * Thus, the HBA's use the routines exported by the transport classes + * to perform these functions. The transport classes export certain + * values to the user via sysfs using attribute containers. + * + * Note: because not every HBA will care about every transport + * attribute, there's a many to one relationship that goes like this: + * + * transport class<-----attribute container<----class device + * + * Usually the attribute container is per-HBA, but the design doesn't + * mandate that. Although most of the services will be specific to + * the actual external storage connection used by the HBA, the generic + * transport class is framed entirely in terms of generic devices to + * allow it to be used by any physical HBA in the system. + */ +#include +#include + +/** + * transport_class_register - register an initial transport class + * + * @tclass: a pointer to the transport class structure to be initialised + * + * The transport class contains an embedded class which is used to + * identify it. The caller should initialise this structure with + * zeros and then generic class must have been initialised with the + * actual transport class unique name. There's a macro + * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must + * be registered). + * + * Returns 0 on success or error on failure. + */ +int transport_class_register(struct transport_class *tclass) +{ + return class_register(&tclass->class); +} +EXPORT_SYMBOL_GPL(transport_class_register); + +/** + * transport_class_unregister - unregister a previously registered class + * + * @tclass: The transport class to unregister + * + * Must be called prior to deallocating the memory for the transport + * class. + */ +void transport_class_unregister(struct transport_class *tclass) +{ + class_unregister(&tclass->class); +} +EXPORT_SYMBOL_GPL(transport_class_unregister); + +static int anon_transport_dummy_function(struct transport_container *tc, + struct device *dev, + struct class_device *cdev) +{ + /* do nothing */ + return 0; +} + +/** + * anon_transport_class_register - register an anonymous class + * + * @atc: The anon transport class to register + * + * The anonymous transport class contains both a transport class and a + * container. The idea of an anonymous class is that it never + * actually has any device attributes associated with it (and thus + * saves on container storage). So it can only be used for triggering + * events. Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to + * initialise the anon transport class storage. + */ +int anon_transport_class_register(struct anon_transport_class *atc) +{ + int error; + atc->container.class = &atc->tclass.class; + attribute_container_set_no_classdevs(&atc->container); + error = attribute_container_register(&atc->container); + if (error) + return error; + atc->tclass.setup = anon_transport_dummy_function; + atc->tclass.remove = anon_transport_dummy_function; + return 0; +} +EXPORT_SYMBOL_GPL(anon_transport_class_register); + +/** + * anon_transport_class_unregister - unregister an anon class + * + * @atc: Pointer to the anon transport class to unregister + * + * Must be called prior to deallocating the memory for the anon + * transport class. + */ +void anon_transport_class_unregister(struct anon_transport_class *atc) +{ + attribute_container_unregister(&atc->container); +} +EXPORT_SYMBOL_GPL(anon_transport_class_unregister); + +static int transport_setup_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + struct transport_container *tcont = attribute_container_to_transport_container(cont); + + if (tclass->setup) + tclass->setup(tcont, dev, classdev); + + return 0; +} + +/** + * transport_setup_device - declare a new dev for transport class association + * but don't make it visible yet. + * + * @dev: the generic device representing the entity being added + * + * Usually, dev represents some component in the HBA system (either + * the HBA itself or a device remote across the HBA bus). This + * routine is simply a trigger point to see if any set of transport + * classes wishes to associate with the added device. This allocates + * storage for the class device and initialises it, but does not yet + * add it to the system or add attributes to it (you do this with + * transport_add_device). If you have no need for a separate setup + * and add operations, use transport_register_device (see + * transport_class.h). + */ + +void transport_setup_device(struct device *dev) +{ + attribute_container_add_device(dev, transport_setup_classdev); +} +EXPORT_SYMBOL_GPL(transport_setup_device); + +static int transport_add_class_device(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + int error = attribute_container_add_class_device(classdev); + struct transport_container *tcont = + attribute_container_to_transport_container(cont); + + if (!error && tcont->statistics) + error = sysfs_create_group(&classdev->kobj, tcont->statistics); + + return error; +} + + +/** + * transport_add_device - declare a new dev for transport class association + * + * @dev: the generic device representing the entity being added + * + * Usually, dev represents some component in the HBA system (either + * the HBA itself or a device remote across the HBA bus). This + * routine is simply a trigger point used to add the device to the + * system and register attributes for it. + */ + +void transport_add_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_add_class_device); +} +EXPORT_SYMBOL_GPL(transport_add_device); + +static int transport_configure(struct attribute_container *cont, + struct device *dev, + struct class_device *cdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + struct transport_container *tcont = attribute_container_to_transport_container(cont); + + if (tclass->configure) + tclass->configure(tcont, dev, cdev); + + return 0; +} + +/** + * transport_configure_device - configure an already set up device + * + * @dev: generic device representing device to be configured + * + * The idea of configure is simply to provide a point within the setup + * process to allow the transport class to extract information from a + * device after it has been setup. This is used in SCSI because we + * have to have a setup device to begin using the HBA, but after we + * send the initial inquiry, we use configure to extract the device + * parameters. The device need not have been added to be configured. + */ +void transport_configure_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_configure); +} +EXPORT_SYMBOL_GPL(transport_configure_device); + +static int transport_remove_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_container *tcont = + attribute_container_to_transport_container(cont); + struct transport_class *tclass = class_to_transport_class(cont->class); + + if (tclass->remove) + tclass->remove(tcont, dev, classdev); + + if (tclass->remove != anon_transport_dummy_function) { + if (tcont->statistics) + sysfs_remove_group(&classdev->kobj, tcont->statistics); + attribute_container_class_device_del(classdev); + } + + return 0; +} + + +/** + * transport_remove_device - remove the visibility of a device + * + * @dev: generic device to remove + * + * This call removes the visibility of the device (to the user from + * sysfs), but does not destroy it. To eliminate a device entirely + * you must also call transport_destroy_device. If you don't need to + * do remove and destroy as separate operations, use + * transport_unregister_device() (see transport_class.h) which will + * perform both calls for you. + */ +void transport_remove_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_remove_classdev); +} +EXPORT_SYMBOL_GPL(transport_remove_device); + +static void transport_destroy_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + + if (tclass->remove != anon_transport_dummy_function) + class_device_put(classdev); +} + + +/** + * transport_destroy_device - destroy a removed device + * + * @dev: device to eliminate from the transport class. + * + * This call triggers the elimination of storage associated with the + * transport classdev. Note: all it really does is relinquish a + * reference to the classdev. The memory will not be freed until the + * last reference goes to zero. Note also that the classdev retains a + * reference count on dev, so dev too will remain for as long as the + * transport class device remains around. + */ +void transport_destroy_device(struct device *dev) +{ + attribute_container_remove_device(dev, transport_destroy_classdev); +} +EXPORT_SYMBOL_GPL(transport_destroy_device); diff --git a/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch b/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch new file mode 100644 index 0000000..c4df6bb --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch @@ -0,0 +1,591 @@ +diff -rupN linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h linux-2.6.20/include/scsi/iscsi_proto.h +--- linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h 1970-01-01 02:00:00.000000000 +0200 ++++ linux-2.6.20/include/scsi/iscsi_proto.h 2007-02-04 20:44:54.000000000 +0200 +@@ -0,0 +1,587 @@ ++/* ++ * RFC 3720 (iSCSI) protocol data types ++ * ++ * Copyright (C) 2005 Dmitry Yusupov ++ * Copyright (C) 2005 Alex Aizman ++ * maintained by open-iscsi at googlegroups.com ++ * ++ * This program is free software; you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License as published ++ * by the Free Software Foundation; either version 2 of the License, or ++ * (at your option) any later version. ++ * ++ * This program is distributed in the hope that it will be useful, but ++ * WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ++ * General Public License for more details. ++ * ++ * See the file COPYING included with this distribution for more details. ++ */ ++ ++#ifndef ISCSI_PROTO_H ++#define ISCSI_PROTO_H ++ ++#define ISCSI_DRAFT20_VERSION 0x00 ++ ++/* default iSCSI listen port for incoming connections */ ++#define ISCSI_LISTEN_PORT 3260 ++ ++/* Padding word length */ ++#define PAD_WORD_LEN 4 ++ ++/* ++ * useful common(control and data pathes) macro ++ */ ++#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2])) ++#define hton24(p, v) { \ ++ p[0] = (((v) >> 16) & 0xFF); \ ++ p[1] = (((v) >> 8) & 0xFF); \ ++ p[2] = ((v) & 0xFF); \ ++} ++#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;} ++ ++/* ++ * iSCSI Template Message Header ++ */ ++struct iscsi_hdr { ++ uint8_t opcode; ++ uint8_t flags; /* Final bit */ ++ uint8_t rsvd2[2]; ++ uint8_t hlength; /* AHSs total length */ ++ uint8_t dlength[3]; /* Data length */ ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 ttt; /* Target Task Tag */ ++ __be32 statsn; ++ __be32 exp_statsn; ++ __be32 max_statsn; ++ uint8_t other[12]; ++}; ++ ++/************************* RFC 3720 Begin *****************************/ ++ ++#define ISCSI_RESERVED_TAG 0xffffffff ++ ++/* Opcode encoding bits */ ++#define ISCSI_OP_RETRY 0x80 ++#define ISCSI_OP_IMMEDIATE 0x40 ++#define ISCSI_OPCODE_MASK 0x3F ++ ++/* Initiator Opcode values */ ++#define ISCSI_OP_NOOP_OUT 0x00 ++#define ISCSI_OP_SCSI_CMD 0x01 ++#define ISCSI_OP_SCSI_TMFUNC 0x02 ++#define ISCSI_OP_LOGIN 0x03 ++#define ISCSI_OP_TEXT 0x04 ++#define ISCSI_OP_SCSI_DATA_OUT 0x05 ++#define ISCSI_OP_LOGOUT 0x06 ++#define ISCSI_OP_SNACK 0x10 ++ ++#define ISCSI_OP_VENDOR1_CMD 0x1c ++#define ISCSI_OP_VENDOR2_CMD 0x1d ++#define ISCSI_OP_VENDOR3_CMD 0x1e ++#define ISCSI_OP_VENDOR4_CMD 0x1f ++ ++/* Target Opcode values */ ++#define ISCSI_OP_NOOP_IN 0x20 ++#define ISCSI_OP_SCSI_CMD_RSP 0x21 ++#define ISCSI_OP_SCSI_TMFUNC_RSP 0x22 ++#define ISCSI_OP_LOGIN_RSP 0x23 ++#define ISCSI_OP_TEXT_RSP 0x24 ++#define ISCSI_OP_SCSI_DATA_IN 0x25 ++#define ISCSI_OP_LOGOUT_RSP 0x26 ++#define ISCSI_OP_R2T 0x31 ++#define ISCSI_OP_ASYNC_EVENT 0x32 ++#define ISCSI_OP_REJECT 0x3f ++ ++struct iscsi_ahs_hdr { ++ __be16 ahslength; ++ uint8_t ahstype; ++ uint8_t ahspec[5]; ++}; ++ ++#define ISCSI_AHSTYPE_CDB 1 ++#define ISCSI_AHSTYPE_RLENGTH 2 ++ ++/* iSCSI PDU Header */ ++struct iscsi_cmd { ++ uint8_t opcode; ++ uint8_t flags; ++ __be16 rsvd2; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 data_length; ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ uint8_t cdb[16]; /* SCSI Command Block */ ++ /* Additional Data (Command Dependent) */ ++}; ++ ++/* Command PDU flags */ ++#define ISCSI_FLAG_CMD_FINAL 0x80 ++#define ISCSI_FLAG_CMD_READ 0x40 ++#define ISCSI_FLAG_CMD_WRITE 0x20 ++#define ISCSI_FLAG_CMD_ATTR_MASK 0x07 /* 3 bits */ ++ ++/* SCSI Command Attribute values */ ++#define ISCSI_ATTR_UNTAGGED 0 ++#define ISCSI_ATTR_SIMPLE 1 ++#define ISCSI_ATTR_ORDERED 2 ++#define ISCSI_ATTR_HEAD_OF_QUEUE 3 ++#define ISCSI_ATTR_ACA 4 ++ ++struct iscsi_rlength_ahdr { ++ __be16 ahslength; ++ uint8_t ahstype; ++ uint8_t reserved; ++ __be32 read_length; ++}; ++ ++/* SCSI Response Header */ ++struct iscsi_cmd_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t response; ++ uint8_t cmd_status; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 rsvd1; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ __be32 exp_datasn; ++ __be32 bi_residual_count; ++ __be32 residual_count; ++ /* Response or Sense Data (optional) */ ++}; ++ ++/* Command Response PDU flags */ ++#define ISCSI_FLAG_CMD_BIDI_OVERFLOW 0x10 ++#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW 0x08 ++#define ISCSI_FLAG_CMD_OVERFLOW 0x04 ++#define ISCSI_FLAG_CMD_UNDERFLOW 0x02 ++ ++/* iSCSI Status values. Valid if Rsp Selector bit is not set */ ++#define ISCSI_STATUS_CMD_COMPLETED 0 ++#define ISCSI_STATUS_TARGET_FAILURE 1 ++#define ISCSI_STATUS_SUBSYS_FAILURE 2 ++ ++/* Asynchronous Event Header */ ++struct iscsi_async { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[2]; ++ uint8_t rsvd3; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ uint8_t rsvd4[8]; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ uint8_t async_event; ++ uint8_t async_vcode; ++ __be16 param1; ++ __be16 param2; ++ __be16 param3; ++ uint8_t rsvd5[4]; ++}; ++ ++/* iSCSI Event Codes */ ++#define ISCSI_ASYNC_MSG_SCSI_EVENT 0 ++#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT 1 ++#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION 2 ++#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS 3 ++#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION 4 ++#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC 255 ++ ++/* NOP-Out Message */ ++struct iscsi_nopout { ++ uint8_t opcode; ++ uint8_t flags; ++ __be16 rsvd2; ++ uint8_t rsvd3; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 ttt; /* Target Transfer Tag */ ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ uint8_t rsvd4[16]; ++}; ++ ++/* NOP-In Message */ ++struct iscsi_nopin { ++ uint8_t opcode; ++ uint8_t flags; ++ __be16 rsvd2; ++ uint8_t rsvd3; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 ttt; /* Target Transfer Tag */ ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ uint8_t rsvd4[12]; ++}; ++ ++/* SCSI Task Management Message Header */ ++struct iscsi_tm { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd1[2]; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 rtt; /* Reference Task Tag */ ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ __be32 refcmdsn; ++ __be32 exp_datasn; ++ uint8_t rsvd2[8]; ++}; ++ ++#define ISCSI_FLAG_TM_FUNC_MASK 0x7F ++ ++/* Function values */ ++#define ISCSI_TM_FUNC_ABORT_TASK 1 ++#define ISCSI_TM_FUNC_ABORT_TASK_SET 2 ++#define ISCSI_TM_FUNC_CLEAR_ACA 3 ++#define ISCSI_TM_FUNC_CLEAR_TASK_SET 4 ++#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET 5 ++#define ISCSI_TM_FUNC_TARGET_WARM_RESET 6 ++#define ISCSI_TM_FUNC_TARGET_COLD_RESET 7 ++#define ISCSI_TM_FUNC_TASK_REASSIGN 8 ++ ++/* SCSI Task Management Response Header */ ++struct iscsi_tm_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t response; /* see Response values below */ ++ uint8_t qualifier; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd2[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 rtt; /* Reference Task Tag */ ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ uint8_t rsvd3[12]; ++}; ++ ++/* Response values */ ++#define ISCSI_TMF_RSP_COMPLETE 0x00 ++#define ISCSI_TMF_RSP_NO_TASK 0x01 ++#define ISCSI_TMF_RSP_NO_LUN 0x02 ++#define ISCSI_TMF_RSP_TASK_ALLEGIANT 0x03 ++#define ISCSI_TMF_RSP_NO_FAILOVER 0x04 ++#define ISCSI_TMF_RSP_NOT_SUPPORTED 0x05 ++#define ISCSI_TMF_RSP_AUTH_FAILED 0x06 ++#define ISCSI_TMF_RSP_REJECTED 0xff ++ ++/* Ready To Transfer Header */ ++struct iscsi_r2t_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[2]; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 ttt; /* Target Transfer Tag */ ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ __be32 r2tsn; ++ __be32 data_offset; ++ __be32 data_length; ++}; ++ ++/* SCSI Data Hdr */ ++struct iscsi_data { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[2]; ++ uint8_t rsvd3; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; ++ __be32 ttt; ++ __be32 rsvd4; ++ __be32 exp_statsn; ++ __be32 rsvd5; ++ __be32 datasn; ++ __be32 offset; ++ __be32 rsvd6; ++ /* Payload */ ++}; ++ ++/* SCSI Data Response Hdr */ ++struct iscsi_data_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2; ++ uint8_t cmd_status; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; ++ __be32 ttt; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ __be32 datasn; ++ __be32 offset; ++ __be32 residual_count; ++}; ++ ++/* Data Response PDU flags */ ++#define ISCSI_FLAG_DATA_ACK 0x40 ++#define ISCSI_FLAG_DATA_OVERFLOW 0x04 ++#define ISCSI_FLAG_DATA_UNDERFLOW 0x02 ++#define ISCSI_FLAG_DATA_STATUS 0x01 ++ ++/* Text Header */ ++struct iscsi_text { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[2]; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd4[8]; ++ __be32 itt; ++ __be32 ttt; ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ uint8_t rsvd5[16]; ++ /* Text - key=value pairs */ ++}; ++ ++#define ISCSI_FLAG_TEXT_CONTINUE 0x40 ++ ++/* Text Response Header */ ++struct iscsi_text_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[2]; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd4[8]; ++ __be32 itt; ++ __be32 ttt; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ uint8_t rsvd5[12]; ++ /* Text Response - key:value pairs */ ++}; ++ ++/* Login Header */ ++struct iscsi_login { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t max_version; /* Max. version supported */ ++ uint8_t min_version; /* Min. version supported */ ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t isid[6]; /* Initiator Session ID */ ++ __be16 tsih; /* Target Session Handle */ ++ __be32 itt; /* Initiator Task Tag */ ++ __be16 cid; ++ __be16 rsvd3; ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ uint8_t rsvd5[16]; ++}; ++ ++/* Login PDU flags */ ++#define ISCSI_FLAG_LOGIN_TRANSIT 0x80 ++#define ISCSI_FLAG_LOGIN_CONTINUE 0x40 ++#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK 0x0C /* 2 bits */ ++#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK 0x03 /* 2 bits */ ++ ++#define ISCSI_LOGIN_CURRENT_STAGE(flags) \ ++ ((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2) ++#define ISCSI_LOGIN_NEXT_STAGE(flags) \ ++ (flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK) ++ ++/* Login Response Header */ ++struct iscsi_login_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t max_version; /* Max. version supported */ ++ uint8_t active_version; /* Active version */ ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t isid[6]; /* Initiator Session ID */ ++ __be16 tsih; /* Target Session Handle */ ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 rsvd3; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ uint8_t status_class; /* see Login RSP ststus classes below */ ++ uint8_t status_detail; /* see Login RSP Status details below */ ++ uint8_t rsvd4[10]; ++}; ++ ++/* Login stage (phase) codes for CSG, NSG */ ++#define ISCSI_INITIAL_LOGIN_STAGE -1 ++#define ISCSI_SECURITY_NEGOTIATION_STAGE 0 ++#define ISCSI_OP_PARMS_NEGOTIATION_STAGE 1 ++#define ISCSI_FULL_FEATURE_PHASE 3 ++ ++/* Login Status response classes */ ++#define ISCSI_STATUS_CLS_SUCCESS 0x00 ++#define ISCSI_STATUS_CLS_REDIRECT 0x01 ++#define ISCSI_STATUS_CLS_INITIATOR_ERR 0x02 ++#define ISCSI_STATUS_CLS_TARGET_ERR 0x03 ++ ++/* Login Status response detail codes */ ++/* Class-0 (Success) */ ++#define ISCSI_LOGIN_STATUS_ACCEPT 0x00 ++ ++/* Class-1 (Redirection) */ ++#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP 0x01 ++#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM 0x02 ++ ++/* Class-2 (Initiator Error) */ ++#define ISCSI_LOGIN_STATUS_INIT_ERR 0x00 ++#define ISCSI_LOGIN_STATUS_AUTH_FAILED 0x01 ++#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN 0x02 ++#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND 0x03 ++#define ISCSI_LOGIN_STATUS_TGT_REMOVED 0x04 ++#define ISCSI_LOGIN_STATUS_NO_VERSION 0x05 ++#define ISCSI_LOGIN_STATUS_ISID_ERROR 0x06 ++#define ISCSI_LOGIN_STATUS_MISSING_FIELDS 0x07 ++#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED 0x08 ++#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE 0x09 ++#define ISCSI_LOGIN_STATUS_NO_SESSION 0x0a ++#define ISCSI_LOGIN_STATUS_INVALID_REQUEST 0x0b ++ ++/* Class-3 (Target Error) */ ++#define ISCSI_LOGIN_STATUS_TARGET_ERROR 0x00 ++#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE 0x01 ++#define ISCSI_LOGIN_STATUS_NO_RESOURCES 0x02 ++ ++/* Logout Header */ ++struct iscsi_logout { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd1[2]; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd2[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be16 cid; ++ uint8_t rsvd3[2]; ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ uint8_t rsvd4[16]; ++}; ++ ++/* Logout PDU flags */ ++#define ISCSI_FLAG_LOGOUT_REASON_MASK 0x7F ++ ++/* logout reason_code values */ ++ ++#define ISCSI_LOGOUT_REASON_CLOSE_SESSION 0 ++#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION 1 ++#define ISCSI_LOGOUT_REASON_RECOVERY 2 ++#define ISCSI_LOGOUT_REASON_AEN_REQUEST 3 ++ ++/* Logout Response Header */ ++struct iscsi_logout_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t response; /* see Logout response values below */ ++ uint8_t rsvd2; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd3[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 rsvd4; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ __be32 rsvd5; ++ __be16 t2wait; ++ __be16 t2retain; ++ __be32 rsvd6; ++}; ++ ++/* logout response status values */ ++ ++#define ISCSI_LOGOUT_SUCCESS 0 ++#define ISCSI_LOGOUT_CID_NOT_FOUND 1 ++#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED 2 ++#define ISCSI_LOGOUT_CLEANUP_FAILED 3 ++ ++/* SNACK Header */ ++struct iscsi_snack { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[14]; ++ __be32 itt; ++ __be32 begrun; ++ __be32 runlength; ++ __be32 exp_statsn; ++ __be32 rsvd3; ++ __be32 exp_datasn; ++ uint8_t rsvd6[8]; ++}; ++ ++/* SNACK PDU flags */ ++#define ISCSI_FLAG_SNACK_TYPE_MASK 0x0F /* 4 bits */ ++ ++/* Reject Message Header */ ++struct iscsi_reject { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t reason; ++ uint8_t rsvd2; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd3[8]; ++ __be32 ffffffff; ++ uint8_t rsvd4[4]; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ __be32 datasn; ++ uint8_t rsvd5[8]; ++ /* Text - Rejected hdr */ ++}; ++ ++/* Reason for Reject */ ++#define ISCSI_REASON_CMD_BEFORE_LOGIN 1 ++#define ISCSI_REASON_DATA_DIGEST_ERROR 2 ++#define ISCSI_REASON_DATA_SNACK_REJECT 3 ++#define ISCSI_REASON_PROTOCOL_ERROR 4 ++#define ISCSI_REASON_CMD_NOT_SUPPORTED 5 ++#define ISCSI_REASON_IMM_CMD_REJECT 6 ++#define ISCSI_REASON_TASK_IN_PROGRESS 7 ++#define ISCSI_REASON_INVALID_SNACK 8 ++#define ISCSI_REASON_BOOKMARK_INVALID 9 ++#define ISCSI_REASON_BOOKMARK_NO_RESOURCES 10 ++#define ISCSI_REASON_NEGOTIATION_RESET 11 ++ ++/* Max. number of Key=Value pairs in a text message */ ++#define MAX_KEY_VALUE_PAIRS 8192 ++ ++/* maximum length for text keys/values */ ++#define KEY_MAXLEN 64 ++#define VALUE_MAXLEN 255 ++#define TARGET_NAME_MAXLEN VALUE_MAXLEN ++ ++#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH 8192 ++ ++/************************* RFC 3720 End *****************************/ ++ ++#endif /* ISCSI_PROTO_H */ diff --git a/kernel_patches/backport/2.6.9_U3/add_iser.patch b/kernel_patches/backport/2.6.9_U3/add_iser.patch new file mode 100644 index 0000000..0da53d2 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/add_iser.patch @@ -0,0 +1,13 @@ +diff -rup linux-2.6.20/drivers/infiniband/ulp/iser/iser_initiator.c linux-2.6.20-backport-rh4-u3/drivers/infiniband/ulp/iser/iser_initiator.c +--- linux-2.6.20/drivers/infiniband/ulp/iser/iser_initiator.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-backport-rh4-u3/drivers/infiniband/ulp/iser/iser_initiator.c 2007-03-26 11:27:11.000000000 +0200 +@@ -618,7 +618,8 @@ void iser_snd_completion(struct iser_des + + if (resume_tx) { + iser_dbg("%ld resuming tx\n",jiffies); +- scsi_queue_work(conn->session->host, &conn->xmitwork); ++ //scsi_queue_work(conn->session->host, &conn->xmitwork); ++ schedule_work(&conn->xmitwork); + } + + if (tx_desc->type == ISCSI_TX_CONTROL) { diff --git a/kernel_patches/backport/2.6.9_U3/add_memory_h.patch b/kernel_patches/backport/2.6.9_U3/add_memory_h.patch new file mode 100644 index 0000000..5daad2e --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/add_memory_h.patch @@ -0,0 +1,93 @@ +diff -rupN linux-2.6.20-like-rh4/include/linux/memory.h linux-2.6.20/include/linux/memory.h +--- linux-2.6.20-like-rh4/include/linux/memory.h 1970-01-01 02:00:00.000000000 +0200 ++++ linux-2.6.20/include/linux/memory.h 2007-02-04 20:44:54.000000000 +0200 +@@ -0,0 +1,89 @@ ++/* ++ * include/linux/memory.h - generic memory definition ++ * ++ * This is mainly for topological representation. We define the ++ * basic "struct memory_block" here, which can be embedded in per-arch ++ * definitions or NUMA information. ++ * ++ * Basic handling of the devices is done in drivers/base/memory.c ++ * and system devices are handled in drivers/base/sys.c. ++ * ++ * Memory block are exported via sysfs in the class/memory/devices/ ++ * directory. ++ * ++ */ ++#ifndef _LINUX_MEMORY_H_ ++#define _LINUX_MEMORY_H_ ++ ++#include ++#include ++#include ++ ++#include ++ ++struct memory_block { ++ unsigned long phys_index; ++ unsigned long state; ++ /* ++ * This serializes all state change requests. It isn't ++ * held during creation because the control files are ++ * created long after the critical areas during ++ * initialization. ++ */ ++ struct semaphore state_sem; ++ int phys_device; /* to which fru does this belong? */ ++ void *hw; /* optional pointer to fw/hw data */ ++ int (*phys_callback)(struct memory_block *); ++ struct sys_device sysdev; ++}; ++ ++/* These states are exposed to userspace as text strings in sysfs */ ++#define MEM_ONLINE (1<<0) /* exposed to userspace */ ++#define MEM_GOING_OFFLINE (1<<1) /* exposed to userspace */ ++#define MEM_OFFLINE (1<<2) /* exposed to userspace */ ++ ++/* ++ * All of these states are currently kernel-internal for notifying ++ * kernel components and architectures. ++ * ++ * For MEM_MAPPING_INVALID, all notifier chains with priority >0 ++ * are called before pfn_to_page() becomes invalid. The priority=0 ++ * entry is reserved for the function that actually makes ++ * pfn_to_page() stop working. Any notifiers that want to be called ++ * after that should have priority <0. ++ */ ++#define MEM_MAPPING_INVALID (1<<3) ++ ++struct notifier_block; ++struct mem_section; ++ ++#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE ++static inline int memory_dev_init(void) ++{ ++ return 0; ++} ++static inline int register_memory_notifier(struct notifier_block *nb) ++{ ++ return 0; ++} ++static inline void unregister_memory_notifier(struct notifier_block *nb) ++{ ++} ++#else ++extern int register_new_memory(struct mem_section *); ++extern int unregister_memory_section(struct mem_section *); ++extern int memory_dev_init(void); ++extern int remove_memory_block(unsigned long, struct mem_section *, int); ++ ++#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<dd_data; + +- crypto_hash_digest(&tcp_conn->tx_hash, &buf->sg, buf->sg.length, crc); ++ crypto_digest_digest(tcp_conn->tx_tfm, &buf->sg, 1, crc); + buf->sg.length = tcp_conn->hdr_size; + } + +@@ -419,7 +419,7 @@ iscsi_r2t_rsp(struct iscsi_conn *conn, s + tcp_ctask->xmstate |= XMSTATE_SOL_HDR; + list_move_tail(&ctask->running, &conn->xmitqueue); + +- scsi_queue_work(session->host, &conn->xmitwork); ++ schedule_work(&conn->xmitwork); + conn->r2t_pdus_cnt++; + spin_unlock(&session->lock); + +@@ -468,8 +468,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co + + sg_init_one(&sg, (u8 *)hdr, + sizeof(struct iscsi_hdr) + ahslen); +- crypto_hash_digest(&tcp_conn->rx_hash, &sg, sg.length, +- (u8 *)&cdgst); ++ crypto_digest_digest(tcp_conn->rx_tfm, &sg, 1, (u8 *)&cdgst); + rdgst = *(uint32_t*)((char*)hdr + sizeof(struct iscsi_hdr) + + ahslen); + if (cdgst != rdgst) { +@@ -676,7 +675,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, + } + + static inline void +-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg, ++partial_sg_digest_update(struct crypto_tfm *tfm, struct scatterlist *sg, + int offset, int length) + { + struct scatterlist temp; +@@ -684,7 +683,7 @@ partial_sg_digest_update(struct hash_des + memcpy(&temp, sg, sizeof(struct scatterlist)); + temp.offset = offset; + temp.length = length; +- crypto_hash_update(desc, &temp, length); ++ crypto_digest_update(tfm, &temp, 1); + } + + static void +@@ -693,7 +692,7 @@ iscsi_recv_digest_update(struct iscsi_tc + struct scatterlist tmp; + + sg_init_one(&tmp, buf, len); +- crypto_hash_update(&tcp_conn->rx_hash, &tmp, len); ++ crypto_digest_update(tcp_conn->rx_tfm, &tmp, 1); + } + + static int iscsi_scsi_data_in(struct iscsi_conn *conn) +@@ -747,12 +746,12 @@ static int iscsi_scsi_data_in(struct isc + if (!rc) { + if (conn->datadgst_en) { + if (!offset) +- crypto_hash_update( +- &tcp_conn->rx_hash, +- &sg[i], sg[i].length); ++ crypto_digest_update( ++ tcp_conn->rx_tfm, ++ &sg[i], 1); + else + partial_sg_digest_update( +- &tcp_conn->rx_hash, ++ tcp_conn->rx_tfm, + &sg[i], + sg[i].offset + offset, + sg[i].length - offset); +@@ -766,10 +765,9 @@ static int iscsi_scsi_data_in(struct isc + /* + * data-in is complete, but buffer not... + */ +- partial_sg_digest_update(&tcp_conn->rx_hash, +- &sg[i], +- sg[i].offset, +- sg[i].length-rc); ++ partial_sg_digest_update(tcp_conn->rx_tfm, ++ &sg[i], ++ sg[i].offset, sg[i].length-rc); + rc = 0; + break; + } +@@ -887,7 +885,7 @@ more: + rc = iscsi_tcp_hdr_recv(conn); + if (!rc && tcp_conn->in.datalen) { + if (conn->datadgst_en) +- crypto_hash_init(&tcp_conn->rx_hash); ++ crypto_digest_init(tcp_conn->rx_tfm); + tcp_conn->in_progress = IN_PROGRESS_DATA_RECV; + } else if (rc) { + iscsi_conn_failure(conn, rc); +@@ -944,11 +942,11 @@ more: + tcp_conn->in.padding); + memset(pad, 0, tcp_conn->in.padding); + sg_init_one(&sg, pad, tcp_conn->in.padding); +- crypto_hash_update(&tcp_conn->rx_hash, +- &sg, sg.length); ++ crypto_digest_update(tcp_conn->rx_tfm, ++ &sg, 1); + } +- crypto_hash_final(&tcp_conn->rx_hash, +- (u8 *) &tcp_conn->in.datadgst); ++ crypto_digest_final(tcp_conn->rx_tfm, ++ (u8 *) &tcp_conn->in.datadgst); + debug_tcp("rx digest 0x%x\n", tcp_conn->in.datadgst); + tcp_conn->in_progress = IN_PROGRESS_DDIGEST_RECV; + tcp_conn->data_copied = 0; +@@ -1043,7 +1041,7 @@ iscsi_write_space(struct sock *sk) + + tcp_conn->old_write_space(sk); + debug_tcp("iscsi_write_space: cid %d\n", conn->id); +- scsi_queue_work(conn->session->host, &conn->xmitwork); ++ schedule_work(&conn->xmitwork); + } + + static void +@@ -1193,7 +1191,7 @@ static inline void + iscsi_data_digest_init(struct iscsi_tcp_conn *tcp_conn, + struct iscsi_tcp_cmd_task *tcp_ctask) + { +- crypto_hash_init(&tcp_conn->tx_hash); ++ crypto_digest_init(tcp_conn->tx_tfm); + tcp_ctask->digest_count = 4; + } + +@@ -1449,9 +1447,8 @@ iscsi_send_padding(struct iscsi_conn *co + iscsi_buf_init_iov(&tcp_ctask->sendbuf, (char*)&tcp_ctask->pad, + tcp_ctask->pad_count); + if (conn->datadgst_en) +- crypto_hash_update(&tcp_conn->tx_hash, +- &tcp_ctask->sendbuf.sg, +- tcp_ctask->sendbuf.sg.length); ++ crypto_digest_update(tcp_conn->tx_tfm, ++ &tcp_ctask->sendbuf.sg, 1); + } else if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_PAD)) + return 0; + +@@ -1483,7 +1480,7 @@ iscsi_send_digest(struct iscsi_conn *con + tcp_conn = conn->dd_data; + + if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_DATA_DIGEST)) { +- crypto_hash_final(&tcp_conn->tx_hash, (u8*)digest); ++ crypto_digest_final(tcp_conn->tx_tfm, (u8*)digest); + iscsi_buf_init_iov(buf, (char*)digest, 4); + } + tcp_ctask->xmstate &= ~XMSTATE_W_RESEND_DATA_DIGEST; +@@ -1517,7 +1514,7 @@ iscsi_send_data(struct iscsi_cmd_task *c + rc = iscsi_sendpage(conn, sendbuf, count, &buf_sent); + *sent = *sent + buf_sent; + if (buf_sent && conn->datadgst_en) +- partial_sg_digest_update(&tcp_conn->tx_hash, ++ partial_sg_digest_update(tcp_conn->tx_tfm, + &sendbuf->sg, sendbuf->sg.offset + offset, + buf_sent); + if (!iscsi_buf_left(sendbuf) && *sg != tcp_ctask->bad_sg) { +@@ -1774,22 +1771,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s + /* initial operational parameters */ + tcp_conn->hdr_size = sizeof(struct iscsi_hdr); + +- tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->tx_hash.flags = 0; +- if (IS_ERR(tcp_conn->tx_hash.tfm)) ++ tcp_conn->tx_tfm = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->tx_tfm) + goto free_tcp_conn; + +- tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->rx_hash.flags = 0; +- if (IS_ERR(tcp_conn->rx_hash.tfm)) ++ tcp_conn->rx_tfm = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->rx_tfm) + goto free_tx_tfm; + + return cls_conn; + + free_tx_tfm: +- crypto_free_hash(tcp_conn->tx_hash.tfm); ++ crypto_free_tfm(tcp_conn->tx_tfm); + free_tcp_conn: + kfree(tcp_conn); + tcp_conn_alloc_fail: +@@ -1823,10 +1816,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_ + iscsi_tcp_release_conn(conn); + iscsi_conn_teardown(cls_conn); + +- if (tcp_conn->tx_hash.tfm) +- crypto_free_hash(tcp_conn->tx_hash.tfm); +- if (tcp_conn->rx_hash.tfm) +- crypto_free_hash(tcp_conn->rx_hash.tfm); ++ if (tcp_conn->tx_tfm) ++ crypto_free_tfm(tcp_conn->tx_tfm); ++ if (tcp_conn->rx_tfm) ++ crypto_free_tfm(tcp_conn->rx_tfm); + + kfree(tcp_conn); + } +@@ -2017,7 +2010,7 @@ iscsi_tcp_conn_get_param(struct iscsi_cl + { + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_tcp_conn *tcp_conn = conn->dd_data; +- struct inet_sock *inet; ++ struct inet_opt *inet; + struct ipv6_pinfo *np; + struct sock *sk; + int len; +@@ -2044,11 +2037,13 @@ iscsi_tcp_conn_get_param(struct iscsi_cl + sk = tcp_conn->sock->sk; + if (sk->sk_family == PF_INET) { + inet = inet_sk(sk); +- len = sprintf(buf, NIPQUAD_FMT "\n", ++ len = sprintf(buf, "%u.%u.%u.%u\n", + NIPQUAD(inet->daddr)); + } else { + np = inet6_sk(sk); +- len = sprintf(buf, NIP6_FMT "\n", NIP6(np->daddr)); ++ len = sprintf(buf, ++ "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", ++ NIP6(np->daddr)); + } + mutex_unlock(&conn->xmitmutex); + break; +@@ -2135,7 +2130,6 @@ static void iscsi_tcp_session_destroy(st + static struct scsi_host_template iscsi_sht = { + .name = "iSCSI Initiator over TCP/IP", + .queuecommand = iscsi_queuecommand, +- .change_queue_depth = iscsi_change_queue_depth, + .can_queue = ISCSI_XMIT_CMDS_MAX - 1, + .sg_tablesize = ISCSI_SG_TABLESIZE, + .cmd_per_lun = ISCSI_DEF_CMD_PER_LUN, +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h +--- linux-2.6.20/drivers/scsi/iscsi_tcp.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h 2007-04-01 13:11:55.000000000 +0300 +@@ -49,7 +49,6 @@ + #define ISCSI_SG_TABLESIZE SG_ALL + #define ISCSI_TCP_MAX_CMD_LEN 16 + +-struct crypto_hash; + struct socket; + + /* Socket connection recieve helper */ +@@ -93,8 +92,8 @@ struct iscsi_tcp_conn { + void (*old_write_space)(struct sock *); + + /* data and header digests */ +- struct hash_desc tx_hash; /* CRC32C (Tx) */ +- struct hash_desc rx_hash; /* CRC32C (Rx) */ ++ struct crypto_tfm *tx_tfm; /* CRC32C (Tx) */ ++ struct crypto_tfm *rx_tfm; /* CRC32C (Rx) */ + + /* MIB custom statistics */ + uint32_t sendpage_failures_cnt; +diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c +--- linux-2.6.20/drivers/scsi/libiscsi.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c 2007-04-01 13:15:57.000000000 +0300 +@@ -23,6 +23,7 @@ + */ + #include + #include ++#include + #include + #include + #include +@@ -831,7 +832,7 @@ int iscsi_queuecommand(struct scsi_cmnd + session->cmdsn, session->max_cmdsn - session->exp_cmdsn + 1); + spin_unlock(&session->lock); + +- scsi_queue_work(host, &conn->xmitwork); ++ schedule_work(&conn->xmitwork); + return 0; + + reject: +@@ -932,7 +933,7 @@ iscsi_conn_send_generic(struct iscsi_con + else + __kfifo_put(conn->mgmtqueue, (void*)&mtask, sizeof(void*)); + +- scsi_queue_work(session->host, &conn->xmitwork); ++ schedule_work(&conn->xmitwork); + return 0; + } + +@@ -1370,7 +1371,6 @@ iscsi_session_setup(struct iscsi_transpo + shost->max_lun = iscsit->max_lun; + shost->max_cmd_len = iscsit->max_cmd_len; + shost->transportt = scsit; +- shost->transportt->create_work_queue = 1; + *hostno = shost->host_no; + + session = iscsi_hostdata(shost->hostdata); +diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c +--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c 2007-04-01 13:18:33.000000000 +0300 +@@ -29,11 +29,15 @@ + #include + #include + #include ++#include ++#include + + #define ISCSI_SESSION_ATTRS 11 + #define ISCSI_CONN_ATTRS 11 + #define ISCSI_HOST_ATTRS 0 +-#define ISCSI_TRANSPORT_VERSION "2.0-724" ++#define ISCSI_TRANSPORT_VERSION "2.0-754" ++ ++#define SCAN_WILD_CARD ~0 + + struct iscsi_internal { + int daemon_pid; +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l + #define cdev_to_iscsi_internal(_cdev) \ + container_of(_cdev, struct iscsi_internal, cdev) + ++extern int attribute_container_init(void); ++ + static void iscsi_transport_release(struct class_device *cdev) + { + struct iscsi_internal *priv = cdev_to_iscsi_internal(cdev); +@@ -80,6 +86,17 @@ static struct class iscsi_transport_clas + .release = iscsi_transport_release, + }; + ++static void iscsi_host_class_release(struct class_device *class_dev) ++{ ++ struct Scsi_Host *shost = transport_class_to_shost(class_dev); ++ put_device(&shost->shost_gendev); ++} ++ ++struct class iscsi_host_class = { ++ .name = "iscsi_host", ++ .release = iscsi_host_class_release, ++}; ++ + static ssize_t + show_transport_handle(struct class_device *cdev, char *buf) + { +@@ -115,10 +132,8 @@ static struct attribute_group iscsi_tran + .attrs = iscsi_transport_attrs, + }; + +-static int iscsi_setup_host(struct transport_container *tc, struct device *dev, +- struct class_device *cdev) ++static int iscsi_setup_host(struct Scsi_Host *shost) + { +- struct Scsi_Host *shost = dev_to_shost(dev); + struct iscsi_host *ihost = shost->shost_data; + + memset(ihost, 0, sizeof(*ihost)); +@@ -127,12 +142,6 @@ static int iscsi_setup_host(struct trans + return 0; + } + +-static DECLARE_TRANSPORT_CLASS(iscsi_host_class, +- "iscsi_host", +- iscsi_setup_host, +- NULL, +- NULL); +- + static DECLARE_TRANSPORT_CLASS(iscsi_session_class, + "iscsi_session", + NULL, +@@ -216,28 +225,10 @@ static int iscsi_is_session_dev(const st + return dev->release == iscsi_session_release; + } + +-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel, +- uint id, uint lun) +-{ +- struct iscsi_host *ihost = shost->shost_data; +- struct iscsi_cls_session *session; +- +- mutex_lock(&ihost->mutex); +- list_for_each_entry(session, &ihost->sessions, host_list) { +- if ((channel == SCAN_WILD_CARD || channel == 0) && +- (id == SCAN_WILD_CARD || id == session->target_id)) +- scsi_scan_target(&session->dev, 0, +- session->target_id, lun, 1); +- } +- mutex_unlock(&ihost->mutex); +- +- return 0; +-} +- +-static void session_recovery_timedout(struct work_struct *work) ++static void session_recovery_timedout(void *data) + { + struct iscsi_cls_session *session = +- container_of(work, struct iscsi_cls_session, ++ container_of(data, struct iscsi_cls_session, + recovery_work.work); + + dev_printk(KERN_INFO, &session->dev, "iscsi: session recovery timed " +@@ -362,8 +353,6 @@ void iscsi_remove_session(struct iscsi_c + list_del(&session->host_list); + mutex_unlock(&ihost->mutex); + +- scsi_remove_target(&session->dev); +- + transport_unregister_device(&session->dev); + device_del(&session->dev); + } +@@ -452,6 +441,7 @@ iscsi_create_conn(struct iscsi_cls_sessi + goto release_parent_ref; + } + transport_register_device(&conn->dev); ++ + return conn; + + release_parent_ref: +@@ -606,9 +596,8 @@ iscsi_if_send_reply(int pid, int seq, in + struct nlmsghdr *nlh; + int len = NLMSG_SPACE(size); + int flags = multi ? NLM_F_MULTI : 0; +- int t = done ? NLMSG_DONE : type; + +- skb = alloc_skb(len, GFP_ATOMIC); ++ skb = alloc_skb(len, GFP_KERNEL); + /* + * FIXME: + * user is supposed to react on iferror == -ENOMEM; +@@ -649,7 +638,7 @@ iscsi_if_get_stats(struct iscsi_transpor + do { + int actual_size; + +- skbstat = alloc_skb(len, GFP_ATOMIC); ++ skbstat = alloc_skb(len, GFP_KERNEL); + if (!skbstat) { + dev_printk(KERN_ERR, &conn->dev, "iscsi: can not " + "deliver stats: OOM\n"); +@@ -1269,24 +1258,6 @@ static int iscsi_conn_match(struct attri + return &priv->conn_cont.ac == cont; + } + +-static int iscsi_host_match(struct attribute_container *cont, +- struct device *dev) +-{ +- struct Scsi_Host *shost; +- struct iscsi_internal *priv; +- +- if (!scsi_is_host_device(dev)) +- return 0; +- +- shost = dev_to_shost(dev); +- if (!shost->transportt || +- shost->transportt->host_attrs.ac.class != &iscsi_host_class.class) +- return 0; +- +- priv = to_iscsi_internal(shost->transportt); +- return &priv->t.host_attrs.ac == cont; +-} +- + struct scsi_transport_template * + iscsi_register_transport(struct iscsi_transport *tt) + { +@@ -1306,7 +1277,6 @@ iscsi_register_transport(struct iscsi_tr + INIT_LIST_HEAD(&priv->list); + priv->daemon_pid = -1; + priv->iscsi_transport = tt; +- priv->t.user_scan = iscsi_user_scan; + + priv->cdev.class = &iscsi_transport_class; + snprintf(priv->cdev.class_id, BUS_ID_SIZE, "%s", tt->name); +@@ -1319,12 +1289,11 @@ iscsi_register_transport(struct iscsi_tr + goto unregister_cdev; + + /* host parameters */ +- priv->t.host_attrs.ac.attrs = &priv->host_attrs[0]; +- priv->t.host_attrs.ac.class = &iscsi_host_class.class; +- priv->t.host_attrs.ac.match = iscsi_host_match; ++ ++ priv->t.host_attrs = &priv->host_attrs[0]; ++ priv->t.host_class = &iscsi_host_class; ++ priv->t.host_setup = iscsi_setup_host; + priv->t.host_size = sizeof(struct iscsi_host); +- priv->host_attrs[0] = NULL; +- transport_container_register(&priv->t.host_attrs); + + /* connection parameters */ + priv->conn_cont.ac.attrs = &priv->conn_attrs[0]; +@@ -1402,7 +1371,6 @@ int iscsi_unregister_transport(struct is + + transport_container_unregister(&priv->conn_cont); + transport_container_unregister(&priv->session_cont); +- transport_container_unregister(&priv->t.host_attrs); + + sysfs_remove_group(&priv->cdev.kobj, &iscsi_transport_group); + class_device_unregister(&priv->cdev); +@@ -1419,6 +1387,8 @@ static __init int iscsi_transport_init(v + printk(KERN_INFO "Loading iSCSI transport class v%s.\n", + ISCSI_TRANSPORT_VERSION); + ++ attribute_container_init(); ++ + err = class_register(&iscsi_transport_class); + if (err) + return err; diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch new file mode 100644 index 0000000..6dd4429 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch @@ -0,0 +1,60 @@ +diff -rupN linux-2.6.20-rc7/include/scsi/iscsi_compat.h linux-2.6.9/include/scsi/iscsi_compat.h +--- linux-2.6.20-rc7/include/scsi/iscsi_compat.h 1970-01-01 02:00:00.000000000 +0200 ++++ linux-2.6.9/include/scsi/iscsi_compat.h 2007-02-08 08:45:39.000000000 +0200 +@@ -0,0 +1,16 @@ ++#ifndef ISCSI_COMPAT ++#define ISCSI_COMPAT ++ ++#include ++#include ++#include ++ ++#define __nlmsg_put(skb, daemon_pid, seq, type, len, flags) \ ++ __nlmsg_put(skb, daemon_pid, 0, 0, len) ++ ++#define netlink_kernel_create(uint, groups, input, mod) \ ++ netlink_kernel_create(uint, input) ++ ++#define gfp_t unsigned ++ ++#endif /* ISCSI_COMPAT */ +diff -rupN linux-2.6.20-rc7/include/scsi/iscsi_if.h linux-2.6.9/include/scsi/iscsi_if.h +--- linux-2.6.20-rc7/include/scsi/iscsi_if.h 2006-11-29 23:57:37.000000000 +0200 ++++ linux-2.6.9/include/scsi/iscsi_if.h 2007-02-04 12:50:15.000000000 +0200 +@@ -277,7 +277,6 @@ enum iscsi_param { + * These flags describes reason of stop_conn() call + */ + #define STOP_CONN_TERM 0x1 +-#define STOP_CONN_SUSPEND 0x2 + #define STOP_CONN_RECOVER 0x3 + + #define ISCSI_STATS_CUSTOM_MAX 32 +diff -rupN linux-2.6.20-rc7/include/scsi/libiscsi.h linux-2.6.9/include/scsi/libiscsi.h +--- linux-2.6.20-rc7/include/scsi/libiscsi.h 2007-02-07 11:10:56.000000000 +0200 ++++ linux-2.6.9/include/scsi/libiscsi.h 2007-02-07 15:51:59.000000000 +0200 +@@ -25,10 +25,9 @@ + + #include + #include +-#include +-#include + #include + #include ++#include + + struct scsi_transport_template; + struct scsi_device; +diff -rupN linux-2.6.20-rc7/include/scsi/scsi_transport_iscsi.h linux-2.6.9/include/scsi/scsi_transport_iscsi.h +--- linux-2.6.20-rc7/include/scsi/scsi_transport_iscsi.h 2007-02-07 11:10:56.000000000 +0200 ++++ linux-2.6.9/include/scsi/scsi_transport_iscsi.h 2007-02-07 15:52:50.000000000 +0200 +@@ -24,7 +24,9 @@ + #define SCSI_TRANSPORT_ISCSI_H + + #include +-#include ++#include "iscsi_if.h" ++#include "iscsi_compat.h" ++//#include <../drivers/scsi/transport_class.h> + + struct scsi_transport_template; + struct iscsi_transport; diff --git a/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch b/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch new file mode 100644 index 0000000..f2425e0 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch @@ -0,0 +1,104 @@ +diff -rupN linux-2.6.20-like-rh4/include/linux/transport_class.h linux-2.6.20/include/linux/transport_class.h +--- linux-2.6.20-like-rh4/include/linux/transport_class.h 1970-01-01 02:00:00.000000000 +0200 ++++ linux-2.6.20/include/linux/transport_class.h 2007-02-04 20:44:54.000000000 +0200 +@@ -0,0 +1,100 @@ ++/* ++ * transport_class.h - a generic container for all transport classes ++ * ++ * Copyright (c) 2005 - James Bottomley ++ * ++ * This file is licensed under GPLv2 ++ */ ++ ++#ifndef _TRANSPORT_CLASS_H_ ++#define _TRANSPORT_CLASS_H_ ++ ++#include ++#include ++ ++struct transport_container; ++ ++struct transport_class { ++ struct class class; ++ int (*setup)(struct transport_container *, struct device *, ++ struct class_device *); ++ int (*configure)(struct transport_container *, struct device *, ++ struct class_device *); ++ int (*remove)(struct transport_container *, struct device *, ++ struct class_device *); ++}; ++ ++#define DECLARE_TRANSPORT_CLASS(cls, nm, su, rm, cfg) \ ++struct transport_class cls = { \ ++ .class = { \ ++ .name = nm, \ ++ }, \ ++ .setup = su, \ ++ .remove = rm, \ ++ .configure = cfg, \ ++} ++ ++ ++struct anon_transport_class { ++ struct transport_class tclass; ++ struct attribute_container container; ++}; ++ ++#define DECLARE_ANON_TRANSPORT_CLASS(cls, mtch, cfg) \ ++struct anon_transport_class cls = { \ ++ .tclass = { \ ++ .configure = cfg, \ ++ }, \ ++ . container = { \ ++ .match = mtch, \ ++ }, \ ++} ++ ++#define class_to_transport_class(x) \ ++ container_of(x, struct transport_class, class) ++ ++struct transport_container { ++ struct attribute_container ac; ++ struct attribute_group *statistics; ++}; ++ ++#define attribute_container_to_transport_container(x) \ ++ container_of(x, struct transport_container, ac) ++ ++void transport_remove_device(struct device *); ++void transport_add_device(struct device *); ++void transport_setup_device(struct device *); ++void transport_configure_device(struct device *); ++void transport_destroy_device(struct device *); ++ ++static inline void ++transport_register_device(struct device *dev) ++{ ++ transport_setup_device(dev); ++ transport_add_device(dev); ++} ++ ++static inline void ++transport_unregister_device(struct device *dev) ++{ ++ transport_remove_device(dev); ++ transport_destroy_device(dev); ++} ++ ++static inline int transport_container_register(struct transport_container *tc) ++{ ++ return attribute_container_register(&tc->ac); ++} ++ ++static inline int transport_container_unregister(struct transport_container *tc) ++{ ++ return attribute_container_unregister(&tc->ac); ++} ++ ++int transport_class_register(struct transport_class *); ++int anon_transport_class_register(struct anon_transport_class *); ++void transport_class_unregister(struct transport_class *); ++void anon_transport_class_unregister(struct anon_transport_class *); ++ ++ ++#endif diff --git a/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch new file mode 100644 index 0000000..3c2a969 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch @@ -0,0 +1,13 @@ +--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:13:43.000000000 +0200 ++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:14:31.000000000 +0200 +@@ -70,9 +70,8 @@ + #include + #include + #include +-#include +- + #include "iscsi_iser.h" ++#include + + static unsigned int iscsi_max_lun = 512; + module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); diff --git a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch index e84b964..52c0136 100644 --- a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch +++ b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch @@ -19,6 +19,62 @@ index 0000000..58cf933 +++ b/drivers/infiniband/core/kfifo.c @@ -0,0 +1 @@ +#include "src/kfifo.c" +diff --git a/drivers/infiniband/core/init.c b/drivers/infiniband/core/init.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/init.c +@@ -0,0 +1 @@ ++#include "src/init.c" +diff --git a/drivers/infiniband/core/attribute_container.c b/drivers/infiniband/core/attribute_container.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/attribute_container.c +@@ -0,0 +1 @@ ++#include "src/attribute_container.c" +diff --git a/drivers/infiniband/core/transport_class.c b/drivers/infiniband/core/transport_class.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/transport_class.c +@@ -0,0 +1 @@ ++#include "src/transport_class.c" +diff --git a/drivers/infiniband/core/klist.c b/drivers/infiniband/core/klist.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/klist.c +@@ -0,0 +1 @@ ++#include "src/klist.c" +diff --git a/drivers/infiniband/core/scsi.c b/drivers/infiniband/core/scsi.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/scsi.c +@@ -0,0 +1 @@ ++#include "src/scsi.c" +diff --git a/drivers/infiniband/core/scsi_lib.c b/drivers/infiniband/core/scsi_lib.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/scsi_lib.c +@@ -0,0 +1 @@ ++#include "src/scsi_lib.c" +diff --git a/drivers/infiniband/core/scsi_scan.c b/drivers/infiniband/core/scsi_scan.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/scsi_scan.c +@@ -0,0 +1 @@ ++#include "src/scsi_scan.c" +diff --git a/drivers/infiniband/core/kref_new.c b/drivers/infiniband/core/kref_new.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/kref_new.c +@@ -0,0 +1 @@ ++#include "src/kref_new.c" diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 50fb1cd..456bfd0 100644 --- a/drivers/infiniband/core/Makefile @@ -28,4 +84,4 @@ index 50fb1cd..456bfd0 100644 ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ uverbs_marshall.o + -+ib_core-y += genalloc.o netevent.o kfifo.o ++ib_core-y += genalloc.o netevent.o kfifo.o scsi.o scsi_lib.o scsi_scan.o init.o attribute_container.o transport_class.o klist.o kref_new.o diff --git a/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch b/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch new file mode 100644 index 0000000..cc071ef --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch @@ -0,0 +1,247 @@ +diff -rupN linux-2.6.20-like-rh4/include/linux/netlink.h linux-2.6.20/include/linux/netlink.h +--- linux-2.6.20-like-rh4/include/linux/netlink.h 1970-01-01 02:00:00.000000000 +0200 ++++ linux-2.6.20/include/linux/netlink.h 2007-02-04 20:44:54.000000000 +0200 +@@ -0,0 +1,243 @@ ++#ifndef __LINUX_NETLINK_H ++#define __LINUX_NETLINK_H ++ ++#include /* for sa_family_t */ ++#include ++ ++#define NETLINK_ROUTE 0 /* Routing/device hook */ ++#define NETLINK_UNUSED 1 /* Unused number */ ++#define NETLINK_USERSOCK 2 /* Reserved for user mode socket protocols */ ++#define NETLINK_FIREWALL 3 /* Firewalling hook */ ++#define NETLINK_INET_DIAG 4 /* INET socket monitoring */ ++#define NETLINK_NFLOG 5 /* netfilter/iptables ULOG */ ++#define NETLINK_XFRM 6 /* ipsec */ ++#define NETLINK_SELINUX 7 /* SELinux event notifications */ ++#define NETLINK_ISCSI 8 /* Open-iSCSI */ ++#define NETLINK_AUDIT 9 /* auditing */ ++#define NETLINK_FIB_LOOKUP 10 ++#define NETLINK_CONNECTOR 11 ++#define NETLINK_NETFILTER 12 /* netfilter subsystem */ ++#define NETLINK_IP6_FW 13 ++#define NETLINK_DNRTMSG 14 /* DECnet routing messages */ ++#define NETLINK_KOBJECT_UEVENT 15 /* Kernel messages to userspace */ ++#define NETLINK_GENERIC 16 ++/* leave room for NETLINK_DM (DM Events) */ ++#define NETLINK_SCSITRANSPORT 18 /* SCSI Transports */ ++ ++#define MAX_LINKS 32 ++ ++struct sockaddr_nl ++{ ++ sa_family_t nl_family; /* AF_NETLINK */ ++ unsigned short nl_pad; /* zero */ ++ __u32 nl_pid; /* process pid */ ++ __u32 nl_groups; /* multicast groups mask */ ++}; ++ ++struct nlmsghdr ++{ ++ __u32 nlmsg_len; /* Length of message including header */ ++ __u16 nlmsg_type; /* Message content */ ++ __u16 nlmsg_flags; /* Additional flags */ ++ __u32 nlmsg_seq; /* Sequence number */ ++ __u32 nlmsg_pid; /* Sending process PID */ ++}; ++ ++/* Flags values */ ++ ++#define NLM_F_REQUEST 1 /* It is request message. */ ++#define NLM_F_MULTI 2 /* Multipart message, terminated by NLMSG_DONE */ ++#define NLM_F_ACK 4 /* Reply with ack, with zero or error code */ ++#define NLM_F_ECHO 8 /* Echo this request */ ++ ++/* Modifiers to GET request */ ++#define NLM_F_ROOT 0x100 /* specify tree root */ ++#define NLM_F_MATCH 0x200 /* return all matching */ ++#define NLM_F_ATOMIC 0x400 /* atomic GET */ ++#define NLM_F_DUMP (NLM_F_ROOT|NLM_F_MATCH) ++ ++/* Modifiers to NEW request */ ++#define NLM_F_REPLACE 0x100 /* Override existing */ ++#define NLM_F_EXCL 0x200 /* Do not touch, if it exists */ ++#define NLM_F_CREATE 0x400 /* Create, if it does not exist */ ++#define NLM_F_APPEND 0x800 /* Add to end of list */ ++ ++/* ++ 4.4BSD ADD NLM_F_CREATE|NLM_F_EXCL ++ 4.4BSD CHANGE NLM_F_REPLACE ++ ++ True CHANGE NLM_F_CREATE|NLM_F_REPLACE ++ Append NLM_F_CREATE ++ Check NLM_F_EXCL ++ */ ++ ++#define NLMSG_ALIGNTO 4 ++#define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) ) ++#define NLMSG_HDRLEN ((int) NLMSG_ALIGN(sizeof(struct nlmsghdr))) ++#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(NLMSG_HDRLEN)) ++#define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len)) ++#define NLMSG_DATA(nlh) ((void*)(((char*)nlh) + NLMSG_LENGTH(0))) ++#define NLMSG_NEXT(nlh,len) ((len) -= NLMSG_ALIGN((nlh)->nlmsg_len), \ ++ (struct nlmsghdr*)(((char*)(nlh)) + NLMSG_ALIGN((nlh)->nlmsg_len))) ++#define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \ ++ (nlh)->nlmsg_len >= sizeof(struct nlmsghdr) && \ ++ (nlh)->nlmsg_len <= (len)) ++#define NLMSG_PAYLOAD(nlh,len) ((nlh)->nlmsg_len - NLMSG_SPACE((len))) ++ ++#define NLMSG_NOOP 0x1 /* Nothing. */ ++#define NLMSG_ERROR 0x2 /* Error */ ++#define NLMSG_DONE 0x3 /* End of a dump */ ++#define NLMSG_OVERRUN 0x4 /* Data lost */ ++ ++#define NLMSG_MIN_TYPE 0x10 /* < 0x10: reserved control messages */ ++ ++struct nlmsgerr ++{ ++ int error; ++ struct nlmsghdr msg; ++}; ++ ++#define NETLINK_ADD_MEMBERSHIP 1 ++#define NETLINK_DROP_MEMBERSHIP 2 ++#define NETLINK_PKTINFO 3 ++ ++struct nl_pktinfo ++{ ++ __u32 group; ++}; ++ ++#define NET_MAJOR 36 /* Major 36 is reserved for networking */ ++ ++enum { ++ NETLINK_UNCONNECTED = 0, ++ NETLINK_CONNECTED, ++}; ++ ++/* ++ * <------- NLA_HDRLEN ------> <-- NLA_ALIGN(payload)--> ++ * +---------------------+- - -+- - - - - - - - - -+- - -+ ++ * | Header | Pad | Payload | Pad | ++ * | (struct nlattr) | ing | | ing | ++ * +---------------------+- - -+- - - - - - - - - -+- - -+ ++ * <-------------- nlattr->nla_len --------------> ++ */ ++ ++struct nlattr ++{ ++ __u16 nla_len; ++ __u16 nla_type; ++}; ++ ++#define NLA_ALIGNTO 4 ++#define NLA_ALIGN(len) (((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1)) ++#define NLA_HDRLEN ((int) NLA_ALIGN(sizeof(struct nlattr))) ++ ++#ifdef __KERNEL__ ++ ++#include ++#include ++ ++struct netlink_skb_parms ++{ ++ struct ucred creds; /* Skb credentials */ ++ __u32 pid; ++ __u32 dst_group; ++ kernel_cap_t eff_cap; ++ __u32 loginuid; /* Login (audit) uid */ ++ __u32 sid; /* SELinux security id */ ++}; ++ ++#define NETLINK_CB(skb) (*(struct netlink_skb_parms*)&((skb)->cb)) ++#define NETLINK_CREDS(skb) (&NETLINK_CB((skb)).creds) ++ ++ ++extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module); ++extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err); ++extern int netlink_has_listeners(struct sock *sk, unsigned int group); ++extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock); ++extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 pid, ++ __u32 group, gfp_t allocation); ++extern void netlink_set_err(struct sock *ssk, __u32 pid, __u32 group, int code); ++extern int netlink_register_notifier(struct notifier_block *nb); ++extern int netlink_unregister_notifier(struct notifier_block *nb); ++ ++/* finegrained unicast helpers: */ ++struct sock *netlink_getsockbyfilp(struct file *filp); ++int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, ++ long timeo, struct sock *ssk); ++void netlink_detachskb(struct sock *sk, struct sk_buff *skb); ++int netlink_sendskb(struct sock *sk, struct sk_buff *skb, int protocol); ++ ++/* ++ * skb should fit one page. This choice is good for headerless malloc. ++ */ ++#define NLMSG_GOODORDER 0 ++#define NLMSG_GOODSIZE (SKB_MAX_ORDER(0, NLMSG_GOODORDER)) ++#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN) ++ ++ ++struct netlink_callback ++{ ++ struct sk_buff *skb; ++ struct nlmsghdr *nlh; ++ int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); ++ int (*done)(struct netlink_callback *cb); ++ int family; ++ long args[5]; ++}; ++ ++struct netlink_notify ++{ ++ int pid; ++ int protocol; ++}; ++ ++static __inline__ struct nlmsghdr * ++__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len, int flags) ++{ ++ struct nlmsghdr *nlh; ++ int size = NLMSG_LENGTH(len); ++ ++ nlh = (struct nlmsghdr*)skb_put(skb, NLMSG_ALIGN(size)); ++ nlh->nlmsg_type = type; ++ nlh->nlmsg_len = size; ++ nlh->nlmsg_flags = flags; ++ nlh->nlmsg_pid = pid; ++ nlh->nlmsg_seq = seq; ++ memset(NLMSG_DATA(nlh) + len, 0, NLMSG_ALIGN(size) - size); ++ return nlh; ++} ++ ++#define NLMSG_NEW(skb, pid, seq, type, len, flags) \ ++({ if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) \ ++ goto nlmsg_failure; \ ++ __nlmsg_put(skb, pid, seq, type, len, flags); }) ++ ++#define NLMSG_PUT(skb, pid, seq, type, len) \ ++ NLMSG_NEW(skb, pid, seq, type, len, 0) ++ ++#define NLMSG_NEW_ANSWER(skb, cb, type, len, flags) \ ++ NLMSG_NEW(skb, NETLINK_CB((cb)->skb).pid, \ ++ (cb)->nlh->nlmsg_seq, type, len, flags) ++ ++#define NLMSG_END(skb, nlh) \ ++({ (nlh)->nlmsg_len = (skb)->tail - (unsigned char *) (nlh); \ ++ (skb)->len; }) ++ ++#define NLMSG_CANCEL(skb, nlh) \ ++({ skb_trim(skb, (unsigned char *) (nlh) - (skb)->data); \ ++ -1; }) ++ ++extern int netlink_dump_start(struct sock *ssk, struct sk_buff *skb, ++ struct nlmsghdr *nlh, ++ int (*dump)(struct sk_buff *skb, struct netlink_callback*), ++ int (*done)(struct netlink_callback*)); ++ ++ ++#define NL_NONROOT_RECV 0x1 ++#define NL_NONROOT_SEND 0x2 ++extern void netlink_set_nonroot(int protocol, unsigned flag); ++ ++#endif /* __KERNEL__ */ ++ ++#endif /* __LINUX_NETLINK_H */ diff --git a/kernel_patches/backport/2.6.9_U3/netlink-02-netlink_h_for_rh4.patch b/kernel_patches/backport/2.6.9_U3/netlink-02-netlink_h_for_rh4.patch new file mode 100644 index 0000000..d9ba403 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/netlink-02-netlink_h_for_rh4.patch @@ -0,0 +1,200 @@ +diff -rup linux-2.6.20/include/linux/netlink.h linux-2.6.20-backport-rh4-u3/include/linux/netlink.h +--- linux-2.6.20/include/linux/netlink.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-backport-rh4-u3/include/linux/netlink.h 2007-03-08 10:09:43.000000000 +0200 +@@ -5,24 +5,19 @@ + #include + + #define NETLINK_ROUTE 0 /* Routing/device hook */ +-#define NETLINK_UNUSED 1 /* Unused number */ ++#define NETLINK_SKIP 1 /* Reserved for ENskip */ + #define NETLINK_USERSOCK 2 /* Reserved for user mode socket protocols */ + #define NETLINK_FIREWALL 3 /* Firewalling hook */ +-#define NETLINK_INET_DIAG 4 /* INET socket monitoring */ ++#define NETLINK_TCPDIAG 4 /* TCP socket monitoring */ + #define NETLINK_NFLOG 5 /* netfilter/iptables ULOG */ + #define NETLINK_XFRM 6 /* ipsec */ + #define NETLINK_SELINUX 7 /* SELinux event notifications */ +-#define NETLINK_ISCSI 8 /* Open-iSCSI */ ++#define NETLINK_ISCSI 8 + #define NETLINK_AUDIT 9 /* auditing */ +-#define NETLINK_FIB_LOOKUP 10 +-#define NETLINK_CONNECTOR 11 +-#define NETLINK_NETFILTER 12 /* netfilter subsystem */ ++#define NETLINK_ROUTE6 11 /* af_inet6 route comm channel */ + #define NETLINK_IP6_FW 13 + #define NETLINK_DNRTMSG 14 /* DECnet routing messages */ +-#define NETLINK_KOBJECT_UEVENT 15 /* Kernel messages to userspace */ +-#define NETLINK_GENERIC 16 +-/* leave room for NETLINK_DM (DM Events) */ +-#define NETLINK_SCSITRANSPORT 18 /* SCSI Transports */ ++#define NETLINK_TAPBASE 16 /* 16 to 31 are ethertap */ + + #define MAX_LINKS 32 + +@@ -73,8 +68,7 @@ struct nlmsghdr + + #define NLMSG_ALIGNTO 4 + #define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) ) +-#define NLMSG_HDRLEN ((int) NLMSG_ALIGN(sizeof(struct nlmsghdr))) +-#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(NLMSG_HDRLEN)) ++#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(sizeof(struct nlmsghdr))) + #define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len)) + #define NLMSG_DATA(nlh) ((void*)(((char*)nlh) + NLMSG_LENGTH(0))) + #define NLMSG_NEXT(nlh,len) ((len) -= NLMSG_ALIGN((nlh)->nlmsg_len), \ +@@ -89,23 +83,12 @@ struct nlmsghdr + #define NLMSG_DONE 0x3 /* End of a dump */ + #define NLMSG_OVERRUN 0x4 /* Data lost */ + +-#define NLMSG_MIN_TYPE 0x10 /* < 0x10: reserved control messages */ +- + struct nlmsgerr + { + int error; + struct nlmsghdr msg; + }; + +-#define NETLINK_ADD_MEMBERSHIP 1 +-#define NETLINK_DROP_MEMBERSHIP 2 +-#define NETLINK_PKTINFO 3 +- +-struct nl_pktinfo +-{ +- __u32 group; +-}; +- + #define NET_MAJOR 36 /* Major 36 is reserved for networking */ + + enum { +@@ -113,25 +96,6 @@ enum { + NETLINK_CONNECTED, + }; + +-/* +- * <------- NLA_HDRLEN ------> <-- NLA_ALIGN(payload)--> +- * +---------------------+- - -+- - - - - - - - - -+- - -+ +- * | Header | Pad | Payload | Pad | +- * | (struct nlattr) | ing | | ing | +- * +---------------------+- - -+- - - - - - - - - -+- - -+ +- * <-------------- nlattr->nla_len --------------> +- */ +- +-struct nlattr +-{ +- __u16 nla_len; +- __u16 nla_type; +-}; +- +-#define NLA_ALIGNTO 4 +-#define NLA_ALIGN(len) (((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1)) +-#define NLA_HDRLEN ((int) NLA_ALIGN(sizeof(struct nlattr))) +- + #ifdef __KERNEL__ + + #include +@@ -141,39 +105,42 @@ struct netlink_skb_parms + { + struct ucred creds; /* Skb credentials */ + __u32 pid; +- __u32 dst_group; ++ __u32 groups; ++ __u32 dst_pid; ++ __u32 dst_groups; + kernel_cap_t eff_cap; + __u32 loginuid; /* Login (audit) uid */ +- __u32 sid; /* SELinux security id */ + }; + + #define NETLINK_CB(skb) (*(struct netlink_skb_parms*)&((skb)->cb)) + #define NETLINK_CREDS(skb) (&NETLINK_CB((skb)).creds) + + +-extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module); ++extern int netlink_attach(int unit, int (*function)(int,struct sk_buff *skb)); ++extern void netlink_detach(int unit); ++extern int netlink_post(int unit, struct sk_buff *skb); ++extern struct sock *netlink_kernel_create(int unit, void (*input)(struct sock *sk, int len)); + extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err); +-extern int netlink_has_listeners(struct sock *sk, unsigned int group); + extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock); + extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 pid, +- __u32 group, gfp_t allocation); ++ __u32 group, int allocation); + extern void netlink_set_err(struct sock *ssk, __u32 pid, __u32 group, int code); + extern int netlink_register_notifier(struct notifier_block *nb); + extern int netlink_unregister_notifier(struct notifier_block *nb); + + /* finegrained unicast helpers: */ ++struct sock *netlink_getsockbypid(struct sock *ssk, u32 pid); + struct sock *netlink_getsockbyfilp(struct file *filp); +-int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, +- long timeo, struct sock *ssk); ++int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, long timeo); + void netlink_detachskb(struct sock *sk, struct sk_buff *skb); + int netlink_sendskb(struct sock *sk, struct sk_buff *skb, int protocol); + + /* + * skb should fit one page. This choice is good for headerless malloc. ++ * ++ * FIXME: What is the best size for SLAB???? --ANK + */ +-#define NLMSG_GOODORDER 0 +-#define NLMSG_GOODSIZE (SKB_MAX_ORDER(0, NLMSG_GOODORDER)) +-#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN) ++#define NLMSG_GOODSIZE (PAGE_SIZE - ((sizeof(struct sk_buff)+0xF)&~0xF)) + + + struct netlink_callback +@@ -183,7 +150,7 @@ struct netlink_callback + int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); + int (*done)(struct netlink_callback *cb); + int family; +- long args[5]; ++ long args[4]; + }; + + struct netlink_notify +@@ -193,7 +160,7 @@ struct netlink_notify + }; + + static __inline__ struct nlmsghdr * +-__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len, int flags) ++__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len) + { + struct nlmsghdr *nlh; + int size = NLMSG_LENGTH(len); +@@ -201,32 +168,15 @@ __nlmsg_put(struct sk_buff *skb, u32 pid + nlh = (struct nlmsghdr*)skb_put(skb, NLMSG_ALIGN(size)); + nlh->nlmsg_type = type; + nlh->nlmsg_len = size; +- nlh->nlmsg_flags = flags; ++ nlh->nlmsg_flags = 0; + nlh->nlmsg_pid = pid; + nlh->nlmsg_seq = seq; +- memset(NLMSG_DATA(nlh) + len, 0, NLMSG_ALIGN(size) - size); + return nlh; + } + +-#define NLMSG_NEW(skb, pid, seq, type, len, flags) \ +-({ if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) \ +- goto nlmsg_failure; \ +- __nlmsg_put(skb, pid, seq, type, len, flags); }) +- + #define NLMSG_PUT(skb, pid, seq, type, len) \ +- NLMSG_NEW(skb, pid, seq, type, len, 0) +- +-#define NLMSG_NEW_ANSWER(skb, cb, type, len, flags) \ +- NLMSG_NEW(skb, NETLINK_CB((cb)->skb).pid, \ +- (cb)->nlh->nlmsg_seq, type, len, flags) +- +-#define NLMSG_END(skb, nlh) \ +-({ (nlh)->nlmsg_len = (skb)->tail - (unsigned char *) (nlh); \ +- (skb)->len; }) +- +-#define NLMSG_CANCEL(skb, nlh) \ +-({ skb_trim(skb, (unsigned char *) (nlh) - (skb)->data); \ +- -1; }) ++({ if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) goto nlmsg_failure; \ ++ __nlmsg_put(skb, pid, seq, type, len); }) + + extern int netlink_dump_start(struct sock *ssk, struct sk_buff *skb, + struct nlmsghdr *nlh, diff --git a/kernel_patches/backport/2.6.9_U4/add_iscsi_proto_h.patch b/kernel_patches/backport/2.6.9_U4/add_iscsi_proto_h.patch new file mode 100644 index 0000000..c4df6bb --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/add_iscsi_proto_h.patch @@ -0,0 +1,591 @@ +diff -rupN linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h linux-2.6.20/include/scsi/iscsi_proto.h +--- linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h 1970-01-01 02:00:00.000000000 +0200 ++++ linux-2.6.20/include/scsi/iscsi_proto.h 2007-02-04 20:44:54.000000000 +0200 +@@ -0,0 +1,587 @@ ++/* ++ * RFC 3720 (iSCSI) protocol data types ++ * ++ * Copyright (C) 2005 Dmitry Yusupov ++ * Copyright (C) 2005 Alex Aizman ++ * maintained by open-iscsi at googlegroups.com ++ * ++ * This program is free software; you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License as published ++ * by the Free Software Foundation; either version 2 of the License, or ++ * (at your option) any later version. ++ * ++ * This program is distributed in the hope that it will be useful, but ++ * WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ++ * General Public License for more details. ++ * ++ * See the file COPYING included with this distribution for more details. ++ */ ++ ++#ifndef ISCSI_PROTO_H ++#define ISCSI_PROTO_H ++ ++#define ISCSI_DRAFT20_VERSION 0x00 ++ ++/* default iSCSI listen port for incoming connections */ ++#define ISCSI_LISTEN_PORT 3260 ++ ++/* Padding word length */ ++#define PAD_WORD_LEN 4 ++ ++/* ++ * useful common(control and data pathes) macro ++ */ ++#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2])) ++#define hton24(p, v) { \ ++ p[0] = (((v) >> 16) & 0xFF); \ ++ p[1] = (((v) >> 8) & 0xFF); \ ++ p[2] = ((v) & 0xFF); \ ++} ++#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;} ++ ++/* ++ * iSCSI Template Message Header ++ */ ++struct iscsi_hdr { ++ uint8_t opcode; ++ uint8_t flags; /* Final bit */ ++ uint8_t rsvd2[2]; ++ uint8_t hlength; /* AHSs total length */ ++ uint8_t dlength[3]; /* Data length */ ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 ttt; /* Target Task Tag */ ++ __be32 statsn; ++ __be32 exp_statsn; ++ __be32 max_statsn; ++ uint8_t other[12]; ++}; ++ ++/************************* RFC 3720 Begin *****************************/ ++ ++#define ISCSI_RESERVED_TAG 0xffffffff ++ ++/* Opcode encoding bits */ ++#define ISCSI_OP_RETRY 0x80 ++#define ISCSI_OP_IMMEDIATE 0x40 ++#define ISCSI_OPCODE_MASK 0x3F ++ ++/* Initiator Opcode values */ ++#define ISCSI_OP_NOOP_OUT 0x00 ++#define ISCSI_OP_SCSI_CMD 0x01 ++#define ISCSI_OP_SCSI_TMFUNC 0x02 ++#define ISCSI_OP_LOGIN 0x03 ++#define ISCSI_OP_TEXT 0x04 ++#define ISCSI_OP_SCSI_DATA_OUT 0x05 ++#define ISCSI_OP_LOGOUT 0x06 ++#define ISCSI_OP_SNACK 0x10 ++ ++#define ISCSI_OP_VENDOR1_CMD 0x1c ++#define ISCSI_OP_VENDOR2_CMD 0x1d ++#define ISCSI_OP_VENDOR3_CMD 0x1e ++#define ISCSI_OP_VENDOR4_CMD 0x1f ++ ++/* Target Opcode values */ ++#define ISCSI_OP_NOOP_IN 0x20 ++#define ISCSI_OP_SCSI_CMD_RSP 0x21 ++#define ISCSI_OP_SCSI_TMFUNC_RSP 0x22 ++#define ISCSI_OP_LOGIN_RSP 0x23 ++#define ISCSI_OP_TEXT_RSP 0x24 ++#define ISCSI_OP_SCSI_DATA_IN 0x25 ++#define ISCSI_OP_LOGOUT_RSP 0x26 ++#define ISCSI_OP_R2T 0x31 ++#define ISCSI_OP_ASYNC_EVENT 0x32 ++#define ISCSI_OP_REJECT 0x3f ++ ++struct iscsi_ahs_hdr { ++ __be16 ahslength; ++ uint8_t ahstype; ++ uint8_t ahspec[5]; ++}; ++ ++#define ISCSI_AHSTYPE_CDB 1 ++#define ISCSI_AHSTYPE_RLENGTH 2 ++ ++/* iSCSI PDU Header */ ++struct iscsi_cmd { ++ uint8_t opcode; ++ uint8_t flags; ++ __be16 rsvd2; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 data_length; ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ uint8_t cdb[16]; /* SCSI Command Block */ ++ /* Additional Data (Command Dependent) */ ++}; ++ ++/* Command PDU flags */ ++#define ISCSI_FLAG_CMD_FINAL 0x80 ++#define ISCSI_FLAG_CMD_READ 0x40 ++#define ISCSI_FLAG_CMD_WRITE 0x20 ++#define ISCSI_FLAG_CMD_ATTR_MASK 0x07 /* 3 bits */ ++ ++/* SCSI Command Attribute values */ ++#define ISCSI_ATTR_UNTAGGED 0 ++#define ISCSI_ATTR_SIMPLE 1 ++#define ISCSI_ATTR_ORDERED 2 ++#define ISCSI_ATTR_HEAD_OF_QUEUE 3 ++#define ISCSI_ATTR_ACA 4 ++ ++struct iscsi_rlength_ahdr { ++ __be16 ahslength; ++ uint8_t ahstype; ++ uint8_t reserved; ++ __be32 read_length; ++}; ++ ++/* SCSI Response Header */ ++struct iscsi_cmd_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t response; ++ uint8_t cmd_status; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 rsvd1; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ __be32 exp_datasn; ++ __be32 bi_residual_count; ++ __be32 residual_count; ++ /* Response or Sense Data (optional) */ ++}; ++ ++/* Command Response PDU flags */ ++#define ISCSI_FLAG_CMD_BIDI_OVERFLOW 0x10 ++#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW 0x08 ++#define ISCSI_FLAG_CMD_OVERFLOW 0x04 ++#define ISCSI_FLAG_CMD_UNDERFLOW 0x02 ++ ++/* iSCSI Status values. Valid if Rsp Selector bit is not set */ ++#define ISCSI_STATUS_CMD_COMPLETED 0 ++#define ISCSI_STATUS_TARGET_FAILURE 1 ++#define ISCSI_STATUS_SUBSYS_FAILURE 2 ++ ++/* Asynchronous Event Header */ ++struct iscsi_async { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[2]; ++ uint8_t rsvd3; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ uint8_t rsvd4[8]; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ uint8_t async_event; ++ uint8_t async_vcode; ++ __be16 param1; ++ __be16 param2; ++ __be16 param3; ++ uint8_t rsvd5[4]; ++}; ++ ++/* iSCSI Event Codes */ ++#define ISCSI_ASYNC_MSG_SCSI_EVENT 0 ++#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT 1 ++#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION 2 ++#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS 3 ++#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION 4 ++#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC 255 ++ ++/* NOP-Out Message */ ++struct iscsi_nopout { ++ uint8_t opcode; ++ uint8_t flags; ++ __be16 rsvd2; ++ uint8_t rsvd3; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 ttt; /* Target Transfer Tag */ ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ uint8_t rsvd4[16]; ++}; ++ ++/* NOP-In Message */ ++struct iscsi_nopin { ++ uint8_t opcode; ++ uint8_t flags; ++ __be16 rsvd2; ++ uint8_t rsvd3; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 ttt; /* Target Transfer Tag */ ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ uint8_t rsvd4[12]; ++}; ++ ++/* SCSI Task Management Message Header */ ++struct iscsi_tm { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd1[2]; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 rtt; /* Reference Task Tag */ ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ __be32 refcmdsn; ++ __be32 exp_datasn; ++ uint8_t rsvd2[8]; ++}; ++ ++#define ISCSI_FLAG_TM_FUNC_MASK 0x7F ++ ++/* Function values */ ++#define ISCSI_TM_FUNC_ABORT_TASK 1 ++#define ISCSI_TM_FUNC_ABORT_TASK_SET 2 ++#define ISCSI_TM_FUNC_CLEAR_ACA 3 ++#define ISCSI_TM_FUNC_CLEAR_TASK_SET 4 ++#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET 5 ++#define ISCSI_TM_FUNC_TARGET_WARM_RESET 6 ++#define ISCSI_TM_FUNC_TARGET_COLD_RESET 7 ++#define ISCSI_TM_FUNC_TASK_REASSIGN 8 ++ ++/* SCSI Task Management Response Header */ ++struct iscsi_tm_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t response; /* see Response values below */ ++ uint8_t qualifier; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd2[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 rtt; /* Reference Task Tag */ ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ uint8_t rsvd3[12]; ++}; ++ ++/* Response values */ ++#define ISCSI_TMF_RSP_COMPLETE 0x00 ++#define ISCSI_TMF_RSP_NO_TASK 0x01 ++#define ISCSI_TMF_RSP_NO_LUN 0x02 ++#define ISCSI_TMF_RSP_TASK_ALLEGIANT 0x03 ++#define ISCSI_TMF_RSP_NO_FAILOVER 0x04 ++#define ISCSI_TMF_RSP_NOT_SUPPORTED 0x05 ++#define ISCSI_TMF_RSP_AUTH_FAILED 0x06 ++#define ISCSI_TMF_RSP_REJECTED 0xff ++ ++/* Ready To Transfer Header */ ++struct iscsi_r2t_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[2]; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 ttt; /* Target Transfer Tag */ ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ __be32 r2tsn; ++ __be32 data_offset; ++ __be32 data_length; ++}; ++ ++/* SCSI Data Hdr */ ++struct iscsi_data { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[2]; ++ uint8_t rsvd3; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; ++ __be32 ttt; ++ __be32 rsvd4; ++ __be32 exp_statsn; ++ __be32 rsvd5; ++ __be32 datasn; ++ __be32 offset; ++ __be32 rsvd6; ++ /* Payload */ ++}; ++ ++/* SCSI Data Response Hdr */ ++struct iscsi_data_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2; ++ uint8_t cmd_status; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t lun[8]; ++ __be32 itt; ++ __be32 ttt; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ __be32 datasn; ++ __be32 offset; ++ __be32 residual_count; ++}; ++ ++/* Data Response PDU flags */ ++#define ISCSI_FLAG_DATA_ACK 0x40 ++#define ISCSI_FLAG_DATA_OVERFLOW 0x04 ++#define ISCSI_FLAG_DATA_UNDERFLOW 0x02 ++#define ISCSI_FLAG_DATA_STATUS 0x01 ++ ++/* Text Header */ ++struct iscsi_text { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[2]; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd4[8]; ++ __be32 itt; ++ __be32 ttt; ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ uint8_t rsvd5[16]; ++ /* Text - key=value pairs */ ++}; ++ ++#define ISCSI_FLAG_TEXT_CONTINUE 0x40 ++ ++/* Text Response Header */ ++struct iscsi_text_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[2]; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd4[8]; ++ __be32 itt; ++ __be32 ttt; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ uint8_t rsvd5[12]; ++ /* Text Response - key:value pairs */ ++}; ++ ++/* Login Header */ ++struct iscsi_login { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t max_version; /* Max. version supported */ ++ uint8_t min_version; /* Min. version supported */ ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t isid[6]; /* Initiator Session ID */ ++ __be16 tsih; /* Target Session Handle */ ++ __be32 itt; /* Initiator Task Tag */ ++ __be16 cid; ++ __be16 rsvd3; ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ uint8_t rsvd5[16]; ++}; ++ ++/* Login PDU flags */ ++#define ISCSI_FLAG_LOGIN_TRANSIT 0x80 ++#define ISCSI_FLAG_LOGIN_CONTINUE 0x40 ++#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK 0x0C /* 2 bits */ ++#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK 0x03 /* 2 bits */ ++ ++#define ISCSI_LOGIN_CURRENT_STAGE(flags) \ ++ ((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2) ++#define ISCSI_LOGIN_NEXT_STAGE(flags) \ ++ (flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK) ++ ++/* Login Response Header */ ++struct iscsi_login_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t max_version; /* Max. version supported */ ++ uint8_t active_version; /* Active version */ ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t isid[6]; /* Initiator Session ID */ ++ __be16 tsih; /* Target Session Handle */ ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 rsvd3; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ uint8_t status_class; /* see Login RSP ststus classes below */ ++ uint8_t status_detail; /* see Login RSP Status details below */ ++ uint8_t rsvd4[10]; ++}; ++ ++/* Login stage (phase) codes for CSG, NSG */ ++#define ISCSI_INITIAL_LOGIN_STAGE -1 ++#define ISCSI_SECURITY_NEGOTIATION_STAGE 0 ++#define ISCSI_OP_PARMS_NEGOTIATION_STAGE 1 ++#define ISCSI_FULL_FEATURE_PHASE 3 ++ ++/* Login Status response classes */ ++#define ISCSI_STATUS_CLS_SUCCESS 0x00 ++#define ISCSI_STATUS_CLS_REDIRECT 0x01 ++#define ISCSI_STATUS_CLS_INITIATOR_ERR 0x02 ++#define ISCSI_STATUS_CLS_TARGET_ERR 0x03 ++ ++/* Login Status response detail codes */ ++/* Class-0 (Success) */ ++#define ISCSI_LOGIN_STATUS_ACCEPT 0x00 ++ ++/* Class-1 (Redirection) */ ++#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP 0x01 ++#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM 0x02 ++ ++/* Class-2 (Initiator Error) */ ++#define ISCSI_LOGIN_STATUS_INIT_ERR 0x00 ++#define ISCSI_LOGIN_STATUS_AUTH_FAILED 0x01 ++#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN 0x02 ++#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND 0x03 ++#define ISCSI_LOGIN_STATUS_TGT_REMOVED 0x04 ++#define ISCSI_LOGIN_STATUS_NO_VERSION 0x05 ++#define ISCSI_LOGIN_STATUS_ISID_ERROR 0x06 ++#define ISCSI_LOGIN_STATUS_MISSING_FIELDS 0x07 ++#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED 0x08 ++#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE 0x09 ++#define ISCSI_LOGIN_STATUS_NO_SESSION 0x0a ++#define ISCSI_LOGIN_STATUS_INVALID_REQUEST 0x0b ++ ++/* Class-3 (Target Error) */ ++#define ISCSI_LOGIN_STATUS_TARGET_ERROR 0x00 ++#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE 0x01 ++#define ISCSI_LOGIN_STATUS_NO_RESOURCES 0x02 ++ ++/* Logout Header */ ++struct iscsi_logout { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd1[2]; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd2[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be16 cid; ++ uint8_t rsvd3[2]; ++ __be32 cmdsn; ++ __be32 exp_statsn; ++ uint8_t rsvd4[16]; ++}; ++ ++/* Logout PDU flags */ ++#define ISCSI_FLAG_LOGOUT_REASON_MASK 0x7F ++ ++/* logout reason_code values */ ++ ++#define ISCSI_LOGOUT_REASON_CLOSE_SESSION 0 ++#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION 1 ++#define ISCSI_LOGOUT_REASON_RECOVERY 2 ++#define ISCSI_LOGOUT_REASON_AEN_REQUEST 3 ++ ++/* Logout Response Header */ ++struct iscsi_logout_rsp { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t response; /* see Logout response values below */ ++ uint8_t rsvd2; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd3[8]; ++ __be32 itt; /* Initiator Task Tag */ ++ __be32 rsvd4; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ __be32 rsvd5; ++ __be16 t2wait; ++ __be16 t2retain; ++ __be32 rsvd6; ++}; ++ ++/* logout response status values */ ++ ++#define ISCSI_LOGOUT_SUCCESS 0 ++#define ISCSI_LOGOUT_CID_NOT_FOUND 1 ++#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED 2 ++#define ISCSI_LOGOUT_CLEANUP_FAILED 3 ++ ++/* SNACK Header */ ++struct iscsi_snack { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t rsvd2[14]; ++ __be32 itt; ++ __be32 begrun; ++ __be32 runlength; ++ __be32 exp_statsn; ++ __be32 rsvd3; ++ __be32 exp_datasn; ++ uint8_t rsvd6[8]; ++}; ++ ++/* SNACK PDU flags */ ++#define ISCSI_FLAG_SNACK_TYPE_MASK 0x0F /* 4 bits */ ++ ++/* Reject Message Header */ ++struct iscsi_reject { ++ uint8_t opcode; ++ uint8_t flags; ++ uint8_t reason; ++ uint8_t rsvd2; ++ uint8_t hlength; ++ uint8_t dlength[3]; ++ uint8_t rsvd3[8]; ++ __be32 ffffffff; ++ uint8_t rsvd4[4]; ++ __be32 statsn; ++ __be32 exp_cmdsn; ++ __be32 max_cmdsn; ++ __be32 datasn; ++ uint8_t rsvd5[8]; ++ /* Text - Rejected hdr */ ++}; ++ ++/* Reason for Reject */ ++#define ISCSI_REASON_CMD_BEFORE_LOGIN 1 ++#define ISCSI_REASON_DATA_DIGEST_ERROR 2 ++#define ISCSI_REASON_DATA_SNACK_REJECT 3 ++#define ISCSI_REASON_PROTOCOL_ERROR 4 ++#define ISCSI_REASON_CMD_NOT_SUPPORTED 5 ++#define ISCSI_REASON_IMM_CMD_REJECT 6 ++#define ISCSI_REASON_TASK_IN_PROGRESS 7 ++#define ISCSI_REASON_INVALID_SNACK 8 ++#define ISCSI_REASON_BOOKMARK_INVALID 9 ++#define ISCSI_REASON_BOOKMARK_NO_RESOURCES 10 ++#define ISCSI_REASON_NEGOTIATION_RESET 11 ++ ++/* Max. number of Key=Value pairs in a text message */ ++#define MAX_KEY_VALUE_PAIRS 8192 ++ ++/* maximum length for text keys/values */ ++#define KEY_MAXLEN 64 ++#define VALUE_MAXLEN 255 ++#define TARGET_NAME_MAXLEN VALUE_MAXLEN ++ ++#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH 8192 ++ ++/************************* RFC 3720 End *****************************/ ++ ++#endif /* ISCSI_PROTO_H */ diff --git a/kernel_patches/backport/2.6.9_U4/add_iser.patch b/kernel_patches/backport/2.6.9_U4/add_iser.patch new file mode 100644 index 0000000..0da53d2 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/add_iser.patch @@ -0,0 +1,13 @@ +diff -rup linux-2.6.20/drivers/infiniband/ulp/iser/iser_initiator.c linux-2.6.20-backport-rh4-u3/drivers/infiniband/ulp/iser/iser_initiator.c +--- linux-2.6.20/drivers/infiniband/ulp/iser/iser_initiator.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-backport-rh4-u3/drivers/infiniband/ulp/iser/iser_initiator.c 2007-03-26 11:27:11.000000000 +0200 +@@ -618,7 +618,8 @@ void iser_snd_completion(struct iser_des + + if (resume_tx) { + iser_dbg("%ld resuming tx\n",jiffies); +- scsi_queue_work(conn->session->host, &conn->xmitwork); ++ //scsi_queue_work(conn->session->host, &conn->xmitwork); ++ schedule_work(&conn->xmitwork); + } + + if (tx_desc->type == ISCSI_TX_CONTROL) { diff --git a/kernel_patches/backport/2.6.9_U4/add_memory_h.patch b/kernel_patches/backport/2.6.9_U4/add_memory_h.patch new file mode 100644 index 0000000..5daad2e --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/add_memory_h.patch @@ -0,0 +1,93 @@ +diff -rupN linux-2.6.20-like-rh4/include/linux/memory.h linux-2.6.20/include/linux/memory.h +--- linux-2.6.20-like-rh4/include/linux/memory.h 1970-01-01 02:00:00.000000000 +0200 ++++ linux-2.6.20/include/linux/memory.h 2007-02-04 20:44:54.000000000 +0200 +@@ -0,0 +1,89 @@ ++/* ++ * include/linux/memory.h - generic memory definition ++ * ++ * This is mainly for topological representation. We define the ++ * basic "struct memory_block" here, which can be embedded in per-arch ++ * definitions or NUMA information. ++ * ++ * Basic handling of the devices is done in drivers/base/memory.c ++ * and system devices are handled in drivers/base/sys.c. ++ * ++ * Memory block are exported via sysfs in the class/memory/devices/ ++ * directory. ++ * ++ */ ++#ifndef _LINUX_MEMORY_H_ ++#define _LINUX_MEMORY_H_ ++ ++#include ++#include ++#include ++ ++#include ++ ++struct memory_block { ++ unsigned long phys_index; ++ unsigned long state; ++ /* ++ * This serializes all state change requests. It isn't ++ * held during creation because the control files are ++ * created long after the critical areas during ++ * initialization. ++ */ ++ struct semaphore state_sem; ++ int phys_device; /* to which fru does this belong? */ ++ void *hw; /* optional pointer to fw/hw data */ ++ int (*phys_callback)(struct memory_block *); ++ struct sys_device sysdev; ++}; ++ ++/* These states are exposed to userspace as text strings in sysfs */ ++#define MEM_ONLINE (1<<0) /* exposed to userspace */ ++#define MEM_GOING_OFFLINE (1<<1) /* exposed to userspace */ ++#define MEM_OFFLINE (1<<2) /* exposed to userspace */ ++ ++/* ++ * All of these states are currently kernel-internal for notifying ++ * kernel components and architectures. ++ * ++ * For MEM_MAPPING_INVALID, all notifier chains with priority >0 ++ * are called before pfn_to_page() becomes invalid. The priority=0 ++ * entry is reserved for the function that actually makes ++ * pfn_to_page() stop working. Any notifiers that want to be called ++ * after that should have priority <0. ++ */ ++#define MEM_MAPPING_INVALID (1<<3) ++ ++struct notifier_block; ++struct mem_section; ++ ++#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE ++static inline int memory_dev_init(void) ++{ ++ return 0; ++} ++static inline int register_memory_notifier(struct notifier_block *nb) ++{ ++ return 0; ++} ++static inline void unregister_memory_notifier(struct notifier_block *nb) ++{ ++} ++#else ++extern int register_new_memory(struct mem_section *); ++extern int unregister_memory_section(struct mem_section *); ++extern int memory_dev_init(void); ++extern int remove_memory_block(unsigned long, struct mem_section *, int); ++ ++#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<dd_data; + +- crypto_hash_digest(&tcp_conn->tx_hash, &buf->sg, buf->sg.length, crc); ++ crypto_digest_digest(tcp_conn->tx_tfm, &buf->sg, 1, crc); + buf->sg.length = tcp_conn->hdr_size; + } + +@@ -419,7 +419,7 @@ iscsi_r2t_rsp(struct iscsi_conn *conn, s + tcp_ctask->xmstate |= XMSTATE_SOL_HDR; + list_move_tail(&ctask->running, &conn->xmitqueue); + +- scsi_queue_work(session->host, &conn->xmitwork); ++ schedule_work(&conn->xmitwork); + conn->r2t_pdus_cnt++; + spin_unlock(&session->lock); + +@@ -468,8 +468,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co + + sg_init_one(&sg, (u8 *)hdr, + sizeof(struct iscsi_hdr) + ahslen); +- crypto_hash_digest(&tcp_conn->rx_hash, &sg, sg.length, +- (u8 *)&cdgst); ++ crypto_digest_digest(tcp_conn->rx_tfm, &sg, 1, (u8 *)&cdgst); + rdgst = *(uint32_t*)((char*)hdr + sizeof(struct iscsi_hdr) + + ahslen); + if (cdgst != rdgst) { +@@ -676,7 +675,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, + } + + static inline void +-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg, ++partial_sg_digest_update(struct crypto_tfm *tfm, struct scatterlist *sg, + int offset, int length) + { + struct scatterlist temp; +@@ -684,7 +683,7 @@ partial_sg_digest_update(struct hash_des + memcpy(&temp, sg, sizeof(struct scatterlist)); + temp.offset = offset; + temp.length = length; +- crypto_hash_update(desc, &temp, length); ++ crypto_digest_update(tfm, &temp, 1); + } + + static void +@@ -693,7 +692,7 @@ iscsi_recv_digest_update(struct iscsi_tc + struct scatterlist tmp; + + sg_init_one(&tmp, buf, len); +- crypto_hash_update(&tcp_conn->rx_hash, &tmp, len); ++ crypto_digest_update(tcp_conn->rx_tfm, &tmp, 1); + } + + static int iscsi_scsi_data_in(struct iscsi_conn *conn) +@@ -747,12 +746,12 @@ static int iscsi_scsi_data_in(struct isc + if (!rc) { + if (conn->datadgst_en) { + if (!offset) +- crypto_hash_update( +- &tcp_conn->rx_hash, +- &sg[i], sg[i].length); ++ crypto_digest_update( ++ tcp_conn->rx_tfm, ++ &sg[i], 1); + else + partial_sg_digest_update( +- &tcp_conn->rx_hash, ++ tcp_conn->rx_tfm, + &sg[i], + sg[i].offset + offset, + sg[i].length - offset); +@@ -766,10 +765,9 @@ static int iscsi_scsi_data_in(struct isc + /* + * data-in is complete, but buffer not... + */ +- partial_sg_digest_update(&tcp_conn->rx_hash, +- &sg[i], +- sg[i].offset, +- sg[i].length-rc); ++ partial_sg_digest_update(tcp_conn->rx_tfm, ++ &sg[i], ++ sg[i].offset, sg[i].length-rc); + rc = 0; + break; + } +@@ -887,7 +885,7 @@ more: + rc = iscsi_tcp_hdr_recv(conn); + if (!rc && tcp_conn->in.datalen) { + if (conn->datadgst_en) +- crypto_hash_init(&tcp_conn->rx_hash); ++ crypto_digest_init(tcp_conn->rx_tfm); + tcp_conn->in_progress = IN_PROGRESS_DATA_RECV; + } else if (rc) { + iscsi_conn_failure(conn, rc); +@@ -944,11 +942,11 @@ more: + tcp_conn->in.padding); + memset(pad, 0, tcp_conn->in.padding); + sg_init_one(&sg, pad, tcp_conn->in.padding); +- crypto_hash_update(&tcp_conn->rx_hash, +- &sg, sg.length); ++ crypto_digest_update(tcp_conn->rx_tfm, ++ &sg, 1); + } +- crypto_hash_final(&tcp_conn->rx_hash, +- (u8 *) &tcp_conn->in.datadgst); ++ crypto_digest_final(tcp_conn->rx_tfm, ++ (u8 *) &tcp_conn->in.datadgst); + debug_tcp("rx digest 0x%x\n", tcp_conn->in.datadgst); + tcp_conn->in_progress = IN_PROGRESS_DDIGEST_RECV; + tcp_conn->data_copied = 0; +@@ -1043,7 +1041,7 @@ iscsi_write_space(struct sock *sk) + + tcp_conn->old_write_space(sk); + debug_tcp("iscsi_write_space: cid %d\n", conn->id); +- scsi_queue_work(conn->session->host, &conn->xmitwork); ++ schedule_work(&conn->xmitwork); + } + + static void +@@ -1193,7 +1191,7 @@ static inline void + iscsi_data_digest_init(struct iscsi_tcp_conn *tcp_conn, + struct iscsi_tcp_cmd_task *tcp_ctask) + { +- crypto_hash_init(&tcp_conn->tx_hash); ++ crypto_digest_init(tcp_conn->tx_tfm); + tcp_ctask->digest_count = 4; + } + +@@ -1449,9 +1447,8 @@ iscsi_send_padding(struct iscsi_conn *co + iscsi_buf_init_iov(&tcp_ctask->sendbuf, (char*)&tcp_ctask->pad, + tcp_ctask->pad_count); + if (conn->datadgst_en) +- crypto_hash_update(&tcp_conn->tx_hash, +- &tcp_ctask->sendbuf.sg, +- tcp_ctask->sendbuf.sg.length); ++ crypto_digest_update(tcp_conn->tx_tfm, ++ &tcp_ctask->sendbuf.sg, 1); + } else if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_PAD)) + return 0; + +@@ -1483,7 +1480,7 @@ iscsi_send_digest(struct iscsi_conn *con + tcp_conn = conn->dd_data; + + if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_DATA_DIGEST)) { +- crypto_hash_final(&tcp_conn->tx_hash, (u8*)digest); ++ crypto_digest_final(tcp_conn->tx_tfm, (u8*)digest); + iscsi_buf_init_iov(buf, (char*)digest, 4); + } + tcp_ctask->xmstate &= ~XMSTATE_W_RESEND_DATA_DIGEST; +@@ -1517,7 +1514,7 @@ iscsi_send_data(struct iscsi_cmd_task *c + rc = iscsi_sendpage(conn, sendbuf, count, &buf_sent); + *sent = *sent + buf_sent; + if (buf_sent && conn->datadgst_en) +- partial_sg_digest_update(&tcp_conn->tx_hash, ++ partial_sg_digest_update(tcp_conn->tx_tfm, + &sendbuf->sg, sendbuf->sg.offset + offset, + buf_sent); + if (!iscsi_buf_left(sendbuf) && *sg != tcp_ctask->bad_sg) { +@@ -1774,22 +1771,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s + /* initial operational parameters */ + tcp_conn->hdr_size = sizeof(struct iscsi_hdr); + +- tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->tx_hash.flags = 0; +- if (IS_ERR(tcp_conn->tx_hash.tfm)) ++ tcp_conn->tx_tfm = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->tx_tfm) + goto free_tcp_conn; + +- tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->rx_hash.flags = 0; +- if (IS_ERR(tcp_conn->rx_hash.tfm)) ++ tcp_conn->rx_tfm = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->rx_tfm) + goto free_tx_tfm; + + return cls_conn; + + free_tx_tfm: +- crypto_free_hash(tcp_conn->tx_hash.tfm); ++ crypto_free_tfm(tcp_conn->tx_tfm); + free_tcp_conn: + kfree(tcp_conn); + tcp_conn_alloc_fail: +@@ -1823,10 +1816,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_ + iscsi_tcp_release_conn(conn); + iscsi_conn_teardown(cls_conn); + +- if (tcp_conn->tx_hash.tfm) +- crypto_free_hash(tcp_conn->tx_hash.tfm); +- if (tcp_conn->rx_hash.tfm) +- crypto_free_hash(tcp_conn->rx_hash.tfm); ++ if (tcp_conn->tx_tfm) ++ crypto_free_tfm(tcp_conn->tx_tfm); ++ if (tcp_conn->rx_tfm) ++ crypto_free_tfm(tcp_conn->rx_tfm); + + kfree(tcp_conn); + } +@@ -2017,7 +2010,7 @@ iscsi_tcp_conn_get_param(struct iscsi_cl + { + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_tcp_conn *tcp_conn = conn->dd_data; +- struct inet_sock *inet; ++ struct inet_opt *inet; + struct ipv6_pinfo *np; + struct sock *sk; + int len; +@@ -2044,11 +2037,13 @@ iscsi_tcp_conn_get_param(struct iscsi_cl + sk = tcp_conn->sock->sk; + if (sk->sk_family == PF_INET) { + inet = inet_sk(sk); +- len = sprintf(buf, NIPQUAD_FMT "\n", ++ len = sprintf(buf, "%u.%u.%u.%u\n", + NIPQUAD(inet->daddr)); + } else { + np = inet6_sk(sk); +- len = sprintf(buf, NIP6_FMT "\n", NIP6(np->daddr)); ++ len = sprintf(buf, ++ "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", ++ NIP6(np->daddr)); + } + mutex_unlock(&conn->xmitmutex); + break; +@@ -2135,7 +2130,6 @@ static void iscsi_tcp_session_destroy(st + static struct scsi_host_template iscsi_sht = { + .name = "iSCSI Initiator over TCP/IP", + .queuecommand = iscsi_queuecommand, +- .change_queue_depth = iscsi_change_queue_depth, + .can_queue = ISCSI_XMIT_CMDS_MAX - 1, + .sg_tablesize = ISCSI_SG_TABLESIZE, + .cmd_per_lun = ISCSI_DEF_CMD_PER_LUN, +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h +--- linux-2.6.20/drivers/scsi/iscsi_tcp.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h 2007-04-01 13:11:55.000000000 +0300 +@@ -49,7 +49,6 @@ + #define ISCSI_SG_TABLESIZE SG_ALL + #define ISCSI_TCP_MAX_CMD_LEN 16 + +-struct crypto_hash; + struct socket; + + /* Socket connection recieve helper */ +@@ -93,8 +92,8 @@ struct iscsi_tcp_conn { + void (*old_write_space)(struct sock *); + + /* data and header digests */ +- struct hash_desc tx_hash; /* CRC32C (Tx) */ +- struct hash_desc rx_hash; /* CRC32C (Rx) */ ++ struct crypto_tfm *tx_tfm; /* CRC32C (Tx) */ ++ struct crypto_tfm *rx_tfm; /* CRC32C (Rx) */ + + /* MIB custom statistics */ + uint32_t sendpage_failures_cnt; +diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c +--- linux-2.6.20/drivers/scsi/libiscsi.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c 2007-04-01 13:15:57.000000000 +0300 +@@ -23,6 +23,7 @@ + */ + #include + #include ++#include + #include + #include + #include +@@ -831,7 +832,7 @@ int iscsi_queuecommand(struct scsi_cmnd + session->cmdsn, session->max_cmdsn - session->exp_cmdsn + 1); + spin_unlock(&session->lock); + +- scsi_queue_work(host, &conn->xmitwork); ++ schedule_work(&conn->xmitwork); + return 0; + + reject: +@@ -932,7 +933,7 @@ iscsi_conn_send_generic(struct iscsi_con + else + __kfifo_put(conn->mgmtqueue, (void*)&mtask, sizeof(void*)); + +- scsi_queue_work(session->host, &conn->xmitwork); ++ schedule_work(&conn->xmitwork); + return 0; + } + +@@ -1370,7 +1371,6 @@ iscsi_session_setup(struct iscsi_transpo + shost->max_lun = iscsit->max_lun; + shost->max_cmd_len = iscsit->max_cmd_len; + shost->transportt = scsit; +- shost->transportt->create_work_queue = 1; + *hostno = shost->host_no; + + session = iscsi_hostdata(shost->hostdata); +diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c +--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c 2007-04-01 13:18:33.000000000 +0300 +@@ -29,11 +29,15 @@ + #include + #include + #include ++#include ++#include + + #define ISCSI_SESSION_ATTRS 11 + #define ISCSI_CONN_ATTRS 11 + #define ISCSI_HOST_ATTRS 0 +-#define ISCSI_TRANSPORT_VERSION "2.0-724" ++#define ISCSI_TRANSPORT_VERSION "2.0-754" ++ ++#define SCAN_WILD_CARD ~0 + + struct iscsi_internal { + int daemon_pid; +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l + #define cdev_to_iscsi_internal(_cdev) \ + container_of(_cdev, struct iscsi_internal, cdev) + ++extern int attribute_container_init(void); ++ + static void iscsi_transport_release(struct class_device *cdev) + { + struct iscsi_internal *priv = cdev_to_iscsi_internal(cdev); +@@ -80,6 +86,17 @@ static struct class iscsi_transport_clas + .release = iscsi_transport_release, + }; + ++static void iscsi_host_class_release(struct class_device *class_dev) ++{ ++ struct Scsi_Host *shost = transport_class_to_shost(class_dev); ++ put_device(&shost->shost_gendev); ++} ++ ++struct class iscsi_host_class = { ++ .name = "iscsi_host", ++ .release = iscsi_host_class_release, ++}; ++ + static ssize_t + show_transport_handle(struct class_device *cdev, char *buf) + { +@@ -115,10 +132,8 @@ static struct attribute_group iscsi_tran + .attrs = iscsi_transport_attrs, + }; + +-static int iscsi_setup_host(struct transport_container *tc, struct device *dev, +- struct class_device *cdev) ++static int iscsi_setup_host(struct Scsi_Host *shost) + { +- struct Scsi_Host *shost = dev_to_shost(dev); + struct iscsi_host *ihost = shost->shost_data; + + memset(ihost, 0, sizeof(*ihost)); +@@ -127,12 +142,6 @@ static int iscsi_setup_host(struct trans + return 0; + } + +-static DECLARE_TRANSPORT_CLASS(iscsi_host_class, +- "iscsi_host", +- iscsi_setup_host, +- NULL, +- NULL); +- + static DECLARE_TRANSPORT_CLASS(iscsi_session_class, + "iscsi_session", + NULL, +@@ -216,28 +225,10 @@ static int iscsi_is_session_dev(const st + return dev->release == iscsi_session_release; + } + +-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel, +- uint id, uint lun) +-{ +- struct iscsi_host *ihost = shost->shost_data; +- struct iscsi_cls_session *session; +- +- mutex_lock(&ihost->mutex); +- list_for_each_entry(session, &ihost->sessions, host_list) { +- if ((channel == SCAN_WILD_CARD || channel == 0) && +- (id == SCAN_WILD_CARD || id == session->target_id)) +- scsi_scan_target(&session->dev, 0, +- session->target_id, lun, 1); +- } +- mutex_unlock(&ihost->mutex); +- +- return 0; +-} +- +-static void session_recovery_timedout(struct work_struct *work) ++static void session_recovery_timedout(void *data) + { + struct iscsi_cls_session *session = +- container_of(work, struct iscsi_cls_session, ++ container_of(data, struct iscsi_cls_session, + recovery_work.work); + + dev_printk(KERN_INFO, &session->dev, "iscsi: session recovery timed " +@@ -362,8 +353,6 @@ void iscsi_remove_session(struct iscsi_c + list_del(&session->host_list); + mutex_unlock(&ihost->mutex); + +- scsi_remove_target(&session->dev); +- + transport_unregister_device(&session->dev); + device_del(&session->dev); + } +@@ -452,6 +441,7 @@ iscsi_create_conn(struct iscsi_cls_sessi + goto release_parent_ref; + } + transport_register_device(&conn->dev); ++ + return conn; + + release_parent_ref: +@@ -606,9 +596,8 @@ iscsi_if_send_reply(int pid, int seq, in + struct nlmsghdr *nlh; + int len = NLMSG_SPACE(size); + int flags = multi ? NLM_F_MULTI : 0; +- int t = done ? NLMSG_DONE : type; + +- skb = alloc_skb(len, GFP_ATOMIC); ++ skb = alloc_skb(len, GFP_KERNEL); + /* + * FIXME: + * user is supposed to react on iferror == -ENOMEM; +@@ -649,7 +638,7 @@ iscsi_if_get_stats(struct iscsi_transpor + do { + int actual_size; + +- skbstat = alloc_skb(len, GFP_ATOMIC); ++ skbstat = alloc_skb(len, GFP_KERNEL); + if (!skbstat) { + dev_printk(KERN_ERR, &conn->dev, "iscsi: can not " + "deliver stats: OOM\n"); +@@ -1269,24 +1258,6 @@ static int iscsi_conn_match(struct attri + return &priv->conn_cont.ac == cont; + } + +-static int iscsi_host_match(struct attribute_container *cont, +- struct device *dev) +-{ +- struct Scsi_Host *shost; +- struct iscsi_internal *priv; +- +- if (!scsi_is_host_device(dev)) +- return 0; +- +- shost = dev_to_shost(dev); +- if (!shost->transportt || +- shost->transportt->host_attrs.ac.class != &iscsi_host_class.class) +- return 0; +- +- priv = to_iscsi_internal(shost->transportt); +- return &priv->t.host_attrs.ac == cont; +-} +- + struct scsi_transport_template * + iscsi_register_transport(struct iscsi_transport *tt) + { +@@ -1306,7 +1277,6 @@ iscsi_register_transport(struct iscsi_tr + INIT_LIST_HEAD(&priv->list); + priv->daemon_pid = -1; + priv->iscsi_transport = tt; +- priv->t.user_scan = iscsi_user_scan; + + priv->cdev.class = &iscsi_transport_class; + snprintf(priv->cdev.class_id, BUS_ID_SIZE, "%s", tt->name); +@@ -1319,12 +1289,11 @@ iscsi_register_transport(struct iscsi_tr + goto unregister_cdev; + + /* host parameters */ +- priv->t.host_attrs.ac.attrs = &priv->host_attrs[0]; +- priv->t.host_attrs.ac.class = &iscsi_host_class.class; +- priv->t.host_attrs.ac.match = iscsi_host_match; ++ ++ priv->t.host_attrs = &priv->host_attrs[0]; ++ priv->t.host_class = &iscsi_host_class; ++ priv->t.host_setup = iscsi_setup_host; + priv->t.host_size = sizeof(struct iscsi_host); +- priv->host_attrs[0] = NULL; +- transport_container_register(&priv->t.host_attrs); + + /* connection parameters */ + priv->conn_cont.ac.attrs = &priv->conn_attrs[0]; +@@ -1402,7 +1371,6 @@ int iscsi_unregister_transport(struct is + + transport_container_unregister(&priv->conn_cont); + transport_container_unregister(&priv->session_cont); +- transport_container_unregister(&priv->t.host_attrs); + + sysfs_remove_group(&priv->cdev.kobj, &iscsi_transport_group); + class_device_unregister(&priv->cdev); +@@ -1419,6 +1387,8 @@ static __init int iscsi_transport_init(v + printk(KERN_INFO "Loading iSCSI transport class v%s.\n", + ISCSI_TRANSPORT_VERSION); + ++ attribute_container_init(); ++ + err = class_register(&iscsi_transport_class); + if (err) + return err; diff --git a/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch new file mode 100644 index 0000000..6dd4429 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch @@ -0,0 +1,60 @@ +diff -rupN linux-2.6.20-rc7/include/scsi/iscsi_compat.h linux-2.6.9/include/scsi/iscsi_compat.h +--- linux-2.6.20-rc7/include/scsi/iscsi_compat.h 1970-01-01 02:00:00.000000000 +0200 ++++ linux-2.6.9/include/scsi/iscsi_compat.h 2007-02-08 08:45:39.000000000 +0200 +@@ -0,0 +1,16 @@ ++#ifndef ISCSI_COMPAT ++#define ISCSI_COMPAT ++ ++#include ++#include ++#include ++ ++#define __nlmsg_put(skb, daemon_pid, seq, type, len, flags) \ ++ __nlmsg_put(skb, daemon_pid, 0, 0, len) ++ ++#define netlink_kernel_create(uint, groups, input, mod) \ ++ netlink_kernel_create(uint, input) ++ ++#define gfp_t unsigned ++ ++#endif /* ISCSI_COMPAT */ +diff -rupN linux-2.6.20-rc7/include/scsi/iscsi_if.h linux-2.6.9/include/scsi/iscsi_if.h +--- linux-2.6.20-rc7/include/scsi/iscsi_if.h 2006-11-29 23:57:37.000000000 +0200 ++++ linux-2.6.9/include/scsi/iscsi_if.h 2007-02-04 12:50:15.000000000 +0200 +@@ -277,7 +277,6 @@ enum iscsi_param { + * These flags describes reason of stop_conn() call + */ + #define STOP_CONN_TERM 0x1 +-#define STOP_CONN_SUSPEND 0x2 + #define STOP_CONN_RECOVER 0x3 + + #define ISCSI_STATS_CUSTOM_MAX 32 +diff -rupN linux-2.6.20-rc7/include/scsi/libiscsi.h linux-2.6.9/include/scsi/libiscsi.h +--- linux-2.6.20-rc7/include/scsi/libiscsi.h 2007-02-07 11:10:56.000000000 +0200 ++++ linux-2.6.9/include/scsi/libiscsi.h 2007-02-07 15:51:59.000000000 +0200 +@@ -25,10 +25,9 @@ + + #include + #include +-#include +-#include + #include + #include ++#include + + struct scsi_transport_template; + struct scsi_device; +diff -rupN linux-2.6.20-rc7/include/scsi/scsi_transport_iscsi.h linux-2.6.9/include/scsi/scsi_transport_iscsi.h +--- linux-2.6.20-rc7/include/scsi/scsi_transport_iscsi.h 2007-02-07 11:10:56.000000000 +0200 ++++ linux-2.6.9/include/scsi/scsi_transport_iscsi.h 2007-02-07 15:52:50.000000000 +0200 +@@ -24,7 +24,9 @@ + #define SCSI_TRANSPORT_ISCSI_H + + #include +-#include ++#include "iscsi_if.h" ++#include "iscsi_compat.h" ++//#include <../drivers/scsi/transport_class.h> + + struct scsi_transport_template; + struct iscsi_transport; diff --git a/kernel_patches/backport/2.6.9_U4/add_transport_class_h.patch b/kernel_patches/backport/2.6.9_U4/add_transport_class_h.patch new file mode 100644 index 0000000..f2425e0 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/add_transport_class_h.patch @@ -0,0 +1,104 @@ +diff -rupN linux-2.6.20-like-rh4/include/linux/transport_class.h linux-2.6.20/include/linux/transport_class.h +--- linux-2.6.20-like-rh4/include/linux/transport_class.h 1970-01-01 02:00:00.000000000 +0200 ++++ linux-2.6.20/include/linux/transport_class.h 2007-02-04 20:44:54.000000000 +0200 +@@ -0,0 +1,100 @@ ++/* ++ * transport_class.h - a generic container for all transport classes ++ * ++ * Copyright (c) 2005 - James Bottomley ++ * ++ * This file is licensed under GPLv2 ++ */ ++ ++#ifndef _TRANSPORT_CLASS_H_ ++#define _TRANSPORT_CLASS_H_ ++ ++#include ++#include ++ ++struct transport_container; ++ ++struct transport_class { ++ struct class class; ++ int (*setup)(struct transport_container *, struct device *, ++ struct class_device *); ++ int (*configure)(struct transport_container *, struct device *, ++ struct class_device *); ++ int (*remove)(struct transport_container *, struct device *, ++ struct class_device *); ++}; ++ ++#define DECLARE_TRANSPORT_CLASS(cls, nm, su, rm, cfg) \ ++struct transport_class cls = { \ ++ .class = { \ ++ .name = nm, \ ++ }, \ ++ .setup = su, \ ++ .remove = rm, \ ++ .configure = cfg, \ ++} ++ ++ ++struct anon_transport_class { ++ struct transport_class tclass; ++ struct attribute_container container; ++}; ++ ++#define DECLARE_ANON_TRANSPORT_CLASS(cls, mtch, cfg) \ ++struct anon_transport_class cls = { \ ++ .tclass = { \ ++ .configure = cfg, \ ++ }, \ ++ . container = { \ ++ .match = mtch, \ ++ }, \ ++} ++ ++#define class_to_transport_class(x) \ ++ container_of(x, struct transport_class, class) ++ ++struct transport_container { ++ struct attribute_container ac; ++ struct attribute_group *statistics; ++}; ++ ++#define attribute_container_to_transport_container(x) \ ++ container_of(x, struct transport_container, ac) ++ ++void transport_remove_device(struct device *); ++void transport_add_device(struct device *); ++void transport_setup_device(struct device *); ++void transport_configure_device(struct device *); ++void transport_destroy_device(struct device *); ++ ++static inline void ++transport_register_device(struct device *dev) ++{ ++ transport_setup_device(dev); ++ transport_add_device(dev); ++} ++ ++static inline void ++transport_unregister_device(struct device *dev) ++{ ++ transport_remove_device(dev); ++ transport_destroy_device(dev); ++} ++ ++static inline int transport_container_register(struct transport_container *tc) ++{ ++ return attribute_container_register(&tc->ac); ++} ++ ++static inline int transport_container_unregister(struct transport_container *tc) ++{ ++ return attribute_container_unregister(&tc->ac); ++} ++ ++int transport_class_register(struct transport_class *); ++int anon_transport_class_register(struct anon_transport_class *); ++void transport_class_unregister(struct transport_class *); ++void anon_transport_class_unregister(struct anon_transport_class *); ++ ++ ++#endif diff --git a/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch new file mode 100644 index 0000000..3c2a969 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch @@ -0,0 +1,13 @@ +--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:13:43.000000000 +0200 ++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:14:31.000000000 +0200 +@@ -70,9 +70,8 @@ + #include + #include + #include +-#include +- + #include "iscsi_iser.h" ++#include + + static unsigned int iscsi_max_lun = 512; + module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); diff --git a/kernel_patches/backport/2.6.9_U4/linux_stuff_to_2_6_17.patch b/kernel_patches/backport/2.6.9_U4/linux_stuff_to_2_6_17.patch index e84b964..52c0136 100644 --- a/kernel_patches/backport/2.6.9_U4/linux_stuff_to_2_6_17.patch +++ b/kernel_patches/backport/2.6.9_U4/linux_stuff_to_2_6_17.patch @@ -19,6 +19,62 @@ index 0000000..58cf933 +++ b/drivers/infiniband/core/kfifo.c @@ -0,0 +1 @@ +#include "src/kfifo.c" +diff --git a/drivers/infiniband/core/init.c b/drivers/infiniband/core/init.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/init.c +@@ -0,0 +1 @@ ++#include "src/init.c" +diff --git a/drivers/infiniband/core/attribute_container.c b/drivers/infiniband/core/attribute_container.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/attribute_container.c +@@ -0,0 +1 @@ ++#include "src/attribute_container.c" +diff --git a/drivers/infiniband/core/transport_class.c b/drivers/infiniband/core/transport_class.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/transport_class.c +@@ -0,0 +1 @@ ++#include "src/transport_class.c" +diff --git a/drivers/infiniband/core/klist.c b/drivers/infiniband/core/klist.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/klist.c +@@ -0,0 +1 @@ ++#include "src/klist.c" +diff --git a/drivers/infiniband/core/scsi.c b/drivers/infiniband/core/scsi.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/scsi.c +@@ -0,0 +1 @@ ++#include "src/scsi.c" +diff --git a/drivers/infiniband/core/scsi_lib.c b/drivers/infiniband/core/scsi_lib.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/scsi_lib.c +@@ -0,0 +1 @@ ++#include "src/scsi_lib.c" +diff --git a/drivers/infiniband/core/scsi_scan.c b/drivers/infiniband/core/scsi_scan.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/scsi_scan.c +@@ -0,0 +1 @@ ++#include "src/scsi_scan.c" +diff --git a/drivers/infiniband/core/kref_new.c b/drivers/infiniband/core/kref_new.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/infiniband/core/kref_new.c +@@ -0,0 +1 @@ ++#include "src/kref_new.c" diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 50fb1cd..456bfd0 100644 --- a/drivers/infiniband/core/Makefile @@ -28,4 +84,4 @@ index 50fb1cd..456bfd0 100644 ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ uverbs_marshall.o + -+ib_core-y += genalloc.o netevent.o kfifo.o ++ib_core-y += genalloc.o netevent.o kfifo.o scsi.o scsi_lib.o scsi_scan.o init.o attribute_container.o transport_class.o klist.o kref_new.o diff --git a/kernel_patches/backport/2.6.9_U4/netlink-01-add_netlink_h.patch b/kernel_patches/backport/2.6.9_U4/netlink-01-add_netlink_h.patch new file mode 100644 index 0000000..cc071ef --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/netlink-01-add_netlink_h.patch @@ -0,0 +1,247 @@ +diff -rupN linux-2.6.20-like-rh4/include/linux/netlink.h linux-2.6.20/include/linux/netlink.h +--- linux-2.6.20-like-rh4/include/linux/netlink.h 1970-01-01 02:00:00.000000000 +0200 ++++ linux-2.6.20/include/linux/netlink.h 2007-02-04 20:44:54.000000000 +0200 +@@ -0,0 +1,243 @@ ++#ifndef __LINUX_NETLINK_H ++#define __LINUX_NETLINK_H ++ ++#include /* for sa_family_t */ ++#include ++ ++#define NETLINK_ROUTE 0 /* Routing/device hook */ ++#define NETLINK_UNUSED 1 /* Unused number */ ++#define NETLINK_USERSOCK 2 /* Reserved for user mode socket protocols */ ++#define NETLINK_FIREWALL 3 /* Firewalling hook */ ++#define NETLINK_INET_DIAG 4 /* INET socket monitoring */ ++#define NETLINK_NFLOG 5 /* netfilter/iptables ULOG */ ++#define NETLINK_XFRM 6 /* ipsec */ ++#define NETLINK_SELINUX 7 /* SELinux event notifications */ ++#define NETLINK_ISCSI 8 /* Open-iSCSI */ ++#define NETLINK_AUDIT 9 /* auditing */ ++#define NETLINK_FIB_LOOKUP 10 ++#define NETLINK_CONNECTOR 11 ++#define NETLINK_NETFILTER 12 /* netfilter subsystem */ ++#define NETLINK_IP6_FW 13 ++#define NETLINK_DNRTMSG 14 /* DECnet routing messages */ ++#define NETLINK_KOBJECT_UEVENT 15 /* Kernel messages to userspace */ ++#define NETLINK_GENERIC 16 ++/* leave room for NETLINK_DM (DM Events) */ ++#define NETLINK_SCSITRANSPORT 18 /* SCSI Transports */ ++ ++#define MAX_LINKS 32 ++ ++struct sockaddr_nl ++{ ++ sa_family_t nl_family; /* AF_NETLINK */ ++ unsigned short nl_pad; /* zero */ ++ __u32 nl_pid; /* process pid */ ++ __u32 nl_groups; /* multicast groups mask */ ++}; ++ ++struct nlmsghdr ++{ ++ __u32 nlmsg_len; /* Length of message including header */ ++ __u16 nlmsg_type; /* Message content */ ++ __u16 nlmsg_flags; /* Additional flags */ ++ __u32 nlmsg_seq; /* Sequence number */ ++ __u32 nlmsg_pid; /* Sending process PID */ ++}; ++ ++/* Flags values */ ++ ++#define NLM_F_REQUEST 1 /* It is request message. */ ++#define NLM_F_MULTI 2 /* Multipart message, terminated by NLMSG_DONE */ ++#define NLM_F_ACK 4 /* Reply with ack, with zero or error code */ ++#define NLM_F_ECHO 8 /* Echo this request */ ++ ++/* Modifiers to GET request */ ++#define NLM_F_ROOT 0x100 /* specify tree root */ ++#define NLM_F_MATCH 0x200 /* return all matching */ ++#define NLM_F_ATOMIC 0x400 /* atomic GET */ ++#define NLM_F_DUMP (NLM_F_ROOT|NLM_F_MATCH) ++ ++/* Modifiers to NEW request */ ++#define NLM_F_REPLACE 0x100 /* Override existing */ ++#define NLM_F_EXCL 0x200 /* Do not touch, if it exists */ ++#define NLM_F_CREATE 0x400 /* Create, if it does not exist */ ++#define NLM_F_APPEND 0x800 /* Add to end of list */ ++ ++/* ++ 4.4BSD ADD NLM_F_CREATE|NLM_F_EXCL ++ 4.4BSD CHANGE NLM_F_REPLACE ++ ++ True CHANGE NLM_F_CREATE|NLM_F_REPLACE ++ Append NLM_F_CREATE ++ Check NLM_F_EXCL ++ */ ++ ++#define NLMSG_ALIGNTO 4 ++#define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) ) ++#define NLMSG_HDRLEN ((int) NLMSG_ALIGN(sizeof(struct nlmsghdr))) ++#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(NLMSG_HDRLEN)) ++#define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len)) ++#define NLMSG_DATA(nlh) ((void*)(((char*)nlh) + NLMSG_LENGTH(0))) ++#define NLMSG_NEXT(nlh,len) ((len) -= NLMSG_ALIGN((nlh)->nlmsg_len), \ ++ (struct nlmsghdr*)(((char*)(nlh)) + NLMSG_ALIGN((nlh)->nlmsg_len))) ++#define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \ ++ (nlh)->nlmsg_len >= sizeof(struct nlmsghdr) && \ ++ (nlh)->nlmsg_len <= (len)) ++#define NLMSG_PAYLOAD(nlh,len) ((nlh)->nlmsg_len - NLMSG_SPACE((len))) ++ ++#define NLMSG_NOOP 0x1 /* Nothing. */ ++#define NLMSG_ERROR 0x2 /* Error */ ++#define NLMSG_DONE 0x3 /* End of a dump */ ++#define NLMSG_OVERRUN 0x4 /* Data lost */ ++ ++#define NLMSG_MIN_TYPE 0x10 /* < 0x10: reserved control messages */ ++ ++struct nlmsgerr ++{ ++ int error; ++ struct nlmsghdr msg; ++}; ++ ++#define NETLINK_ADD_MEMBERSHIP 1 ++#define NETLINK_DROP_MEMBERSHIP 2 ++#define NETLINK_PKTINFO 3 ++ ++struct nl_pktinfo ++{ ++ __u32 group; ++}; ++ ++#define NET_MAJOR 36 /* Major 36 is reserved for networking */ ++ ++enum { ++ NETLINK_UNCONNECTED = 0, ++ NETLINK_CONNECTED, ++}; ++ ++/* ++ * <------- NLA_HDRLEN ------> <-- NLA_ALIGN(payload)--> ++ * +---------------------+- - -+- - - - - - - - - -+- - -+ ++ * | Header | Pad | Payload | Pad | ++ * | (struct nlattr) | ing | | ing | ++ * +---------------------+- - -+- - - - - - - - - -+- - -+ ++ * <-------------- nlattr->nla_len --------------> ++ */ ++ ++struct nlattr ++{ ++ __u16 nla_len; ++ __u16 nla_type; ++}; ++ ++#define NLA_ALIGNTO 4 ++#define NLA_ALIGN(len) (((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1)) ++#define NLA_HDRLEN ((int) NLA_ALIGN(sizeof(struct nlattr))) ++ ++#ifdef __KERNEL__ ++ ++#include ++#include ++ ++struct netlink_skb_parms ++{ ++ struct ucred creds; /* Skb credentials */ ++ __u32 pid; ++ __u32 dst_group; ++ kernel_cap_t eff_cap; ++ __u32 loginuid; /* Login (audit) uid */ ++ __u32 sid; /* SELinux security id */ ++}; ++ ++#define NETLINK_CB(skb) (*(struct netlink_skb_parms*)&((skb)->cb)) ++#define NETLINK_CREDS(skb) (&NETLINK_CB((skb)).creds) ++ ++ ++extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module); ++extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err); ++extern int netlink_has_listeners(struct sock *sk, unsigned int group); ++extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock); ++extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 pid, ++ __u32 group, gfp_t allocation); ++extern void netlink_set_err(struct sock *ssk, __u32 pid, __u32 group, int code); ++extern int netlink_register_notifier(struct notifier_block *nb); ++extern int netlink_unregister_notifier(struct notifier_block *nb); ++ ++/* finegrained unicast helpers: */ ++struct sock *netlink_getsockbyfilp(struct file *filp); ++int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, ++ long timeo, struct sock *ssk); ++void netlink_detachskb(struct sock *sk, struct sk_buff *skb); ++int netlink_sendskb(struct sock *sk, struct sk_buff *skb, int protocol); ++ ++/* ++ * skb should fit one page. This choice is good for headerless malloc. ++ */ ++#define NLMSG_GOODORDER 0 ++#define NLMSG_GOODSIZE (SKB_MAX_ORDER(0, NLMSG_GOODORDER)) ++#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN) ++ ++ ++struct netlink_callback ++{ ++ struct sk_buff *skb; ++ struct nlmsghdr *nlh; ++ int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); ++ int (*done)(struct netlink_callback *cb); ++ int family; ++ long args[5]; ++}; ++ ++struct netlink_notify ++{ ++ int pid; ++ int protocol; ++}; ++ ++static __inline__ struct nlmsghdr * ++__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len, int flags) ++{ ++ struct nlmsghdr *nlh; ++ int size = NLMSG_LENGTH(len); ++ ++ nlh = (struct nlmsghdr*)skb_put(skb, NLMSG_ALIGN(size)); ++ nlh->nlmsg_type = type; ++ nlh->nlmsg_len = size; ++ nlh->nlmsg_flags = flags; ++ nlh->nlmsg_pid = pid; ++ nlh->nlmsg_seq = seq; ++ memset(NLMSG_DATA(nlh) + len, 0, NLMSG_ALIGN(size) - size); ++ return nlh; ++} ++ ++#define NLMSG_NEW(skb, pid, seq, type, len, flags) \ ++({ if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) \ ++ goto nlmsg_failure; \ ++ __nlmsg_put(skb, pid, seq, type, len, flags); }) ++ ++#define NLMSG_PUT(skb, pid, seq, type, len) \ ++ NLMSG_NEW(skb, pid, seq, type, len, 0) ++ ++#define NLMSG_NEW_ANSWER(skb, cb, type, len, flags) \ ++ NLMSG_NEW(skb, NETLINK_CB((cb)->skb).pid, \ ++ (cb)->nlh->nlmsg_seq, type, len, flags) ++ ++#define NLMSG_END(skb, nlh) \ ++({ (nlh)->nlmsg_len = (skb)->tail - (unsigned char *) (nlh); \ ++ (skb)->len; }) ++ ++#define NLMSG_CANCEL(skb, nlh) \ ++({ skb_trim(skb, (unsigned char *) (nlh) - (skb)->data); \ ++ -1; }) ++ ++extern int netlink_dump_start(struct sock *ssk, struct sk_buff *skb, ++ struct nlmsghdr *nlh, ++ int (*dump)(struct sk_buff *skb, struct netlink_callback*), ++ int (*done)(struct netlink_callback*)); ++ ++ ++#define NL_NONROOT_RECV 0x1 ++#define NL_NONROOT_SEND 0x2 ++extern void netlink_set_nonroot(int protocol, unsigned flag); ++ ++#endif /* __KERNEL__ */ ++ ++#endif /* __LINUX_NETLINK_H */ diff --git a/kernel_patches/backport/2.6.9_U4/netlink-02-netlink_h_for_rh4.patch b/kernel_patches/backport/2.6.9_U4/netlink-02-netlink_h_for_rh4.patch new file mode 100644 index 0000000..d9ba403 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/netlink-02-netlink_h_for_rh4.patch @@ -0,0 +1,200 @@ +diff -rup linux-2.6.20/include/linux/netlink.h linux-2.6.20-backport-rh4-u3/include/linux/netlink.h +--- linux-2.6.20/include/linux/netlink.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-backport-rh4-u3/include/linux/netlink.h 2007-03-08 10:09:43.000000000 +0200 +@@ -5,24 +5,19 @@ + #include + + #define NETLINK_ROUTE 0 /* Routing/device hook */ +-#define NETLINK_UNUSED 1 /* Unused number */ ++#define NETLINK_SKIP 1 /* Reserved for ENskip */ + #define NETLINK_USERSOCK 2 /* Reserved for user mode socket protocols */ + #define NETLINK_FIREWALL 3 /* Firewalling hook */ +-#define NETLINK_INET_DIAG 4 /* INET socket monitoring */ ++#define NETLINK_TCPDIAG 4 /* TCP socket monitoring */ + #define NETLINK_NFLOG 5 /* netfilter/iptables ULOG */ + #define NETLINK_XFRM 6 /* ipsec */ + #define NETLINK_SELINUX 7 /* SELinux event notifications */ +-#define NETLINK_ISCSI 8 /* Open-iSCSI */ ++#define NETLINK_ISCSI 8 + #define NETLINK_AUDIT 9 /* auditing */ +-#define NETLINK_FIB_LOOKUP 10 +-#define NETLINK_CONNECTOR 11 +-#define NETLINK_NETFILTER 12 /* netfilter subsystem */ ++#define NETLINK_ROUTE6 11 /* af_inet6 route comm channel */ + #define NETLINK_IP6_FW 13 + #define NETLINK_DNRTMSG 14 /* DECnet routing messages */ +-#define NETLINK_KOBJECT_UEVENT 15 /* Kernel messages to userspace */ +-#define NETLINK_GENERIC 16 +-/* leave room for NETLINK_DM (DM Events) */ +-#define NETLINK_SCSITRANSPORT 18 /* SCSI Transports */ ++#define NETLINK_TAPBASE 16 /* 16 to 31 are ethertap */ + + #define MAX_LINKS 32 + +@@ -73,8 +68,7 @@ struct nlmsghdr + + #define NLMSG_ALIGNTO 4 + #define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) ) +-#define NLMSG_HDRLEN ((int) NLMSG_ALIGN(sizeof(struct nlmsghdr))) +-#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(NLMSG_HDRLEN)) ++#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(sizeof(struct nlmsghdr))) + #define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len)) + #define NLMSG_DATA(nlh) ((void*)(((char*)nlh) + NLMSG_LENGTH(0))) + #define NLMSG_NEXT(nlh,len) ((len) -= NLMSG_ALIGN((nlh)->nlmsg_len), \ +@@ -89,23 +83,12 @@ struct nlmsghdr + #define NLMSG_DONE 0x3 /* End of a dump */ + #define NLMSG_OVERRUN 0x4 /* Data lost */ + +-#define NLMSG_MIN_TYPE 0x10 /* < 0x10: reserved control messages */ +- + struct nlmsgerr + { + int error; + struct nlmsghdr msg; + }; + +-#define NETLINK_ADD_MEMBERSHIP 1 +-#define NETLINK_DROP_MEMBERSHIP 2 +-#define NETLINK_PKTINFO 3 +- +-struct nl_pktinfo +-{ +- __u32 group; +-}; +- + #define NET_MAJOR 36 /* Major 36 is reserved for networking */ + + enum { +@@ -113,25 +96,6 @@ enum { + NETLINK_CONNECTED, + }; + +-/* +- * <------- NLA_HDRLEN ------> <-- NLA_ALIGN(payload)--> +- * +---------------------+- - -+- - - - - - - - - -+- - -+ +- * | Header | Pad | Payload | Pad | +- * | (struct nlattr) | ing | | ing | +- * +---------------------+- - -+- - - - - - - - - -+- - -+ +- * <-------------- nlattr->nla_len --------------> +- */ +- +-struct nlattr +-{ +- __u16 nla_len; +- __u16 nla_type; +-}; +- +-#define NLA_ALIGNTO 4 +-#define NLA_ALIGN(len) (((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1)) +-#define NLA_HDRLEN ((int) NLA_ALIGN(sizeof(struct nlattr))) +- + #ifdef __KERNEL__ + + #include +@@ -141,39 +105,42 @@ struct netlink_skb_parms + { + struct ucred creds; /* Skb credentials */ + __u32 pid; +- __u32 dst_group; ++ __u32 groups; ++ __u32 dst_pid; ++ __u32 dst_groups; + kernel_cap_t eff_cap; + __u32 loginuid; /* Login (audit) uid */ +- __u32 sid; /* SELinux security id */ + }; + + #define NETLINK_CB(skb) (*(struct netlink_skb_parms*)&((skb)->cb)) + #define NETLINK_CREDS(skb) (&NETLINK_CB((skb)).creds) + + +-extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module); ++extern int netlink_attach(int unit, int (*function)(int,struct sk_buff *skb)); ++extern void netlink_detach(int unit); ++extern int netlink_post(int unit, struct sk_buff *skb); ++extern struct sock *netlink_kernel_create(int unit, void (*input)(struct sock *sk, int len)); + extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err); +-extern int netlink_has_listeners(struct sock *sk, unsigned int group); + extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock); + extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 pid, +- __u32 group, gfp_t allocation); ++ __u32 group, int allocation); + extern void netlink_set_err(struct sock *ssk, __u32 pid, __u32 group, int code); + extern int netlink_register_notifier(struct notifier_block *nb); + extern int netlink_unregister_notifier(struct notifier_block *nb); + + /* finegrained unicast helpers: */ ++struct sock *netlink_getsockbypid(struct sock *ssk, u32 pid); + struct sock *netlink_getsockbyfilp(struct file *filp); +-int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, +- long timeo, struct sock *ssk); ++int netlink_attachskb(struct sock *sk, struct sk_buff *skb, int nonblock, long timeo); + void netlink_detachskb(struct sock *sk, struct sk_buff *skb); + int netlink_sendskb(struct sock *sk, struct sk_buff *skb, int protocol); + + /* + * skb should fit one page. This choice is good for headerless malloc. ++ * ++ * FIXME: What is the best size for SLAB???? --ANK + */ +-#define NLMSG_GOODORDER 0 +-#define NLMSG_GOODSIZE (SKB_MAX_ORDER(0, NLMSG_GOODORDER)) +-#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN) ++#define NLMSG_GOODSIZE (PAGE_SIZE - ((sizeof(struct sk_buff)+0xF)&~0xF)) + + + struct netlink_callback +@@ -183,7 +150,7 @@ struct netlink_callback + int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); + int (*done)(struct netlink_callback *cb); + int family; +- long args[5]; ++ long args[4]; + }; + + struct netlink_notify +@@ -193,7 +160,7 @@ struct netlink_notify + }; + + static __inline__ struct nlmsghdr * +-__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len, int flags) ++__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len) + { + struct nlmsghdr *nlh; + int size = NLMSG_LENGTH(len); +@@ -201,32 +168,15 @@ __nlmsg_put(struct sk_buff *skb, u32 pid + nlh = (struct nlmsghdr*)skb_put(skb, NLMSG_ALIGN(size)); + nlh->nlmsg_type = type; + nlh->nlmsg_len = size; +- nlh->nlmsg_flags = flags; ++ nlh->nlmsg_flags = 0; + nlh->nlmsg_pid = pid; + nlh->nlmsg_seq = seq; +- memset(NLMSG_DATA(nlh) + len, 0, NLMSG_ALIGN(size) - size); + return nlh; + } + +-#define NLMSG_NEW(skb, pid, seq, type, len, flags) \ +-({ if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) \ +- goto nlmsg_failure; \ +- __nlmsg_put(skb, pid, seq, type, len, flags); }) +- + #define NLMSG_PUT(skb, pid, seq, type, len) \ +- NLMSG_NEW(skb, pid, seq, type, len, 0) +- +-#define NLMSG_NEW_ANSWER(skb, cb, type, len, flags) \ +- NLMSG_NEW(skb, NETLINK_CB((cb)->skb).pid, \ +- (cb)->nlh->nlmsg_seq, type, len, flags) +- +-#define NLMSG_END(skb, nlh) \ +-({ (nlh)->nlmsg_len = (skb)->tail - (unsigned char *) (nlh); \ +- (skb)->len; }) +- +-#define NLMSG_CANCEL(skb, nlh) \ +-({ skb_trim(skb, (unsigned char *) (nlh) - (skb)->data); \ +- -1; }) ++({ if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) goto nlmsg_failure; \ ++ __nlmsg_put(skb, pid, seq, type, len); }) + + extern int netlink_dump_start(struct sock *ssk, struct sk_buff *skb, + struct nlmsghdr *nlh, -- 1.4.2 From swise at opengridcomputing.com Wed May 9 07:30:36 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 09 May 2007 09:30:36 -0500 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <46415DFE.9030807@voltaire.com> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com> <463FCA42.3000104@indiana.edu> <46415DFE.9030807@voltaire.com> Message-ID: <1178721036.382.16.camel@stevo-desktop> On Wed, 2007-05-09 at 08:37 +0300, Or Gerlitz wrote: > Andrew Friedley wrote: > > Jeff Squyres wrote: > >>>> FWIW, yes, adding RDMA CM support has actually been on my to-do list > >>>> for a while, but it keeps getting bumped by higher priority items. > >>>> It would be *much* better if some iWARP companies got involved in > >>>> Open MPI... > > > Hmm I'm interested. I've already done some work switching over to RDMA > > CM for some research stuff I've been doing; it's not publicly accessible > > w/o the 3rd party agreement. I can help answer questions on what > > exactly needs to change, and do some testing. > > Doing a bit of zoom out from the "how to make ofed's udapl work for > ompi" thread, my thinking is that the ompi udapl btl enablement is > actually only the first step, where for production/longterm/etc you want > to have an rdmacm btl. Reasoning here is made of many arguments, among > them the quickest i can make are: > > A) it seems that ompi would want to use not only RC but rather also UD > multicast and unicast, which are not covered by udapl > > B) there's actually no real justification to maintain two APIs (namely > udapl vs libibvers/librdmacm), so down the road, only one of them would > survive (udapl is implemented ***over*** libibverbs/librdmacm so if the > latteres dies same does udapl). Specifically, I hear here and there that > the OFED stack is now on its way to be deployed all over the place, > specifically in commercial Unix OSs (which want modern! code that > supports IPoIB-CM,RDS,SRP,iSER, etc you named it) so eventually the > rdmacm btl can be used also over Solaris et al. > Agreed. enabling udapl will get OMPI over iwarp immediately (and hopefully in ofed-1.2). Post ofed-1.2, I think OMPI _should_ create a rdma-cm btl. That's the plan... Steve. From swise at opengridcomputing.com Wed May 9 07:37:56 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 09 May 2007 09:37:56 -0500 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <1178718090.382.4.camel@stevo-desktop> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> Message-ID: <1178721476.382.18.camel@stevo-desktop> Although as Boris pointed out, perhaps the hack in OMPI is no longer needed at all... On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote: > 606 opened to track the udapl change. > > 607 opened to track the ompi change to remove the port number stashing > hack. > > Status: I have a patch from Arlin to test today. I will test with that > patch and with the OMPI port hack removed. Stay tuned... > > > > Steve. > > On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote: > > Steve Wise wrote: > > > > >I would like the group to consider including changes needed to OMPI > > >and/or ofa udapl to get OMPI working again on udapl for ofed-1.2. > > > > > >This will provide OMPI support over iwarp devices via udapl until we can > > >get rdma-cm support added to OMPI. > > > > > > > > >Steve. > > > > > > > > > > > Steve,cCan you open a bug to track this? > > _______________________________________________ > devel mailing list > devel at open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel From jsquyres at cisco.com Wed May 9 08:25:43 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 9 May 2007 11:25:43 -0400 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <1178721476.382.18.camel@stevo-desktop> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> Message-ID: FWIW, I would marginally prefer if this bug is tracked in the Open MPI trac ticket system, not the OFA bugzilla (Steve W. will have write access there as soon as Chelsio submits their OMPI 3rd party contribution agreement). We've traditionally [mostly] tracked OMPI bugs in the OMPI bug system and OFED-specific OMPI packaging problems in the OFA bugzilla. It's a gray area, I admit. But since I'm not the uDAPL maintainer in Open MPI, moving the bug over there will allow the Right people to see it (some OMPI developers are cross subscribed to the OFA general list, but not all). For example, this udapl problem is likely related to the existing OMPI trac ticket 890 (https://svn.open-mpi.org/trac/ompi/ ticket/890). On May 9, 2007, at 10:37 AM, Steve Wise wrote: > > Although as Boris pointed out, perhaps the hack in OMPI is no longer > needed at all... > > > On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote: >> 606 opened to track the udapl change. >> >> 607 opened to track the ompi change to remove the port number >> stashing >> hack. >> >> Status: I have a patch from Arlin to test today. I will test with >> that >> patch and with the OMPI port hack removed. Stay tuned... >> >> >> >> Steve. >> >> On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote: >>> Steve Wise wrote: >>> >>>> I would like the group to consider including changes needed to OMPI >>>> and/or ofa udapl to get OMPI working again on udapl for ofed-1.2. >>>> >>>> This will provide OMPI support over iwarp devices via udapl >>>> until we can >>>> get rdma-cm support added to OMPI. >>>> >>>> >>>> Steve. >>>> >>>> >>>> >>> Steve,cCan you open a bug to track this? >> >> _______________________________________________ >> devel mailing list >> devel at open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Wed May 9 08:25:46 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 9 May 2007 11:25:46 -0400 Subject: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM In-Reply-To: <1178721036.382.16.camel@stevo-desktop> References: <1177791386.4615.8.camel@stevo-laptop> <98BECD4C-6C0B-4249-A152-D35A4C30763A@cisco.com> <1178575761.30571.175.camel@stevo-desktop> <95B68972-BCB7-4DB6-8E34-2BC558A0FC50@cisco.com> <463FCA42.3000104@indiana.edu> <46415DFE.9030807@voltaire.com> <1178721036.382.16.camel@stevo-desktop> Message-ID: On May 9, 2007, at 10:30 AM, Steve Wise wrote: > Agreed. enabling udapl will get OMPI over iwarp immediately (and > hopefully in ofed-1.2). Post ofed-1.2, I think OMPI _should_ create a > rdma-cm btl. That's the plan... Yes and no. Please see my other reply about an "rdma cm" BTL... -- Jeff Squyres Cisco Systems From pradeeps at linux.vnet.ibm.com Wed May 9 08:32:43 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 09 May 2007 08:32:43 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ)[PATCH V4] patch for review Message-ID: <4641E99B.10706@linux.vnet.ibm.com> Here is a fourth version of the IPOIB_CM_NOSRQ patch for review. This patch will benefit adapters that do not support shared receive queues. This patch incorporates the following review comments from v3: 1. Incorporated review comments (related to style) from Roland Dreier and Michael Tsirkin 2. Fixed a couple of leaks in the error path (thanks to Roland Dreier for pointing them out). 3. Eliminated spin lock in data path (as suggested by Michael Tsirkin) 4. Changes to avoid CQ overflow (issue pointed out by Micheal Tsirkin) 5. Send REJ when no RC QPs remain (credit Micheal Tsirkin for the idea) 6. I have reset the retry_count to 0 in ipoib_cm_send_req() This patch has been tested with linux-2.6.21-rc7 derived from Roland's for-2.6.22 git tree on 05/07/2007) with Topspin and IBM HCAs on ppc64 machines. I have run netperf between two IBM HCAs and as well as between IBM and Topspin HCA. Note 1: For interoperability retry_count in ipoib_cm_send_req() may need to be changed to a non zero value (3 has worked for me). This is a temporary work around till HCA and/or CM bug is fixed that takes into account the HCA local ACK delay. Note 2: I ran into problems trying to build Roland's git tree (on ppc64) that I downloaded 05/07/2007. Hence I just used the infiniband/ subdirectory and had to make changes to use skb->mac.raw = skb->data instead of skb_reset_mac_header(skb). Did not want to submit a patch that was untested. This can be fixed with a subsequent patch when I the tree to build. Signed-off-by: Pradeep Satyanarayana --- --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-07 16:05:32.000000000 -0700 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-07 17:42:14.000000000 -0700 @@ -97,9 +97,13 @@ enum { #define IPOIB_OP_RECV (1ul << 31) #ifdef CONFIG_INFINIBAND_IPOIB_CM -#define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_CM_OP_RECV (1ul << 30) + +#define NOSRQ_INDEX_TABLE_SIZE 1024 +#define NOSRQ_INDEX_MASK (NOSRQ_INDEX_TABLE_SIZE -1) + #else -#define IPOIB_CM_OP_SRQ (0) +#define IPOIB_CM_OP_RECV (0) #endif /* structs */ @@ -133,11 +137,14 @@ struct ipoib_cm_data { }; struct ipoib_cm_rx { - struct ib_cm_id *id; - struct ib_qp *qp; - struct list_head list; - struct net_device *dev; - unsigned long jiffies; + struct ib_cm_id *id; + struct ib_qp *qp; + struct ipoib_cm_rx_buf *rx_ring; /* Used by NOSRQ only */ + struct list_head list; + struct net_device *dev; + unsigned long jiffies; + u32 index; /* wr_ids are distinguished by index + * to identify the QP -NOSRQ only */ }; struct ipoib_cm_tx { @@ -176,6 +183,8 @@ struct ipoib_cm_dev_priv { struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() + *for usage of this element */ }; /* @@ -521,10 +530,9 @@ static inline void ipoib_cm_skb_too_long dev_kfree_skb_any(skb); } -static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) { } - #endif #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-07 22:19:52.000000000 -0700 +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-08 18:07:15.000000000 -0700 @@ -76,20 +76,20 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int post_receive_srq(struct net_device *dev, u64 id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; int i, ret; - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); if (unlikely(ret)) { - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); + ipoib_warn(priv, "post srq failed for buf %ld (%d)\n", id, ret); ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[id].mapping); dev_kfree_skb_any(priv->cm.srq_ring[id].skb); @@ -99,12 +99,60 @@ static int ipoib_cm_post_receive(struct return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int post_receive_nosrq(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + u32 index; + u32 wr_id; + struct ipoib_cm_rx *rx_ptr; + + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + + rx_ptr = priv->cm.rx_index_table[index]; + + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; + + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", + wr_id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx_ptr->rx_ring[wr_id].mapping); + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); + rx_ptr->rx_ring[wr_id].skb = NULL; + } + + return ret; +} + +static int ipoib_cm_post_receive(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + if (priv->cm.srq) + ret = post_receive_srq(dev, id); + else + ret = post_receive_nosrq(dev, id); + + return ret; +} + +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, + int frags, u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; int i; + struct ipoib_cm_rx *rx_ptr; + u32 index, wr_id; skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); if (unlikely(!skb)) @@ -136,7 +184,14 @@ static struct sk_buff *ipoib_cm_alloc_rx goto partial_error; } - priv->cm.srq_ring[id].skb = skb; + if (priv->cm.srq) + priv->cm.srq_ring[id].skb = skb; + else { + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + rx_ptr = priv->cm.rx_index_table[index]; + rx_ptr->rx_ring[wr_id].skb = skb; + } return skb; partial_error: @@ -159,11 +214,14 @@ static struct ib_qp *ipoib_cm_create_rx_ .recv_cq = priv->cq, .srq = priv->cm.srq, .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ + .cap.max_recv_wr = ipoib_recvq_size + 1, .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, .qp_context = p, }; + if (!priv->cm.srq) + attr.cap.max_recv_sge = IPOIB_CM_RX_SG; return ib_create_qp(priv->pd, &attr); } @@ -217,12 +275,103 @@ static int ipoib_cm_send_rep(struct net_ rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; rep.target_ack_delay = 20; /* FIXME */ - rep.srq = 1; rep.qp_num = qp->qp_num; rep.starting_psn = psn; + rep.srq = !!priv->cm.srq; return ib_send_cm_rep(cm_id, &rep); } +static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, unsigned psn) +{ + struct net_device *dev = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u32 qp_num, index; + u64 i; + + qp_num = p->qp->qp_num; + + /* In the SRQ case there is a common rx buffer called the srq_ring. + * However, for the NOSRQ we create an rx_ring for every + * struct ipoib_cm_rx. + */ + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL); + if (!p->rx_ring) { + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", + qp_num); + return -ENOMEM; + } + + cm_id->context = p; + p->jiffies = jiffies; + spin_lock_irq(&priv->lock); + list_add(&p->list, &priv->cm.passive_ids); + + for (index = 0; index < NOSRQ_INDEX_TABLE_SIZE; index++) + if (priv->cm.rx_index_table[index] == NULL) + break; + + if ( index == NOSRQ_INDEX_TABLE_SIZE) { + spin_unlock_irq(&priv->lock); + ipoib_warn(priv, "NOSRQ supports a max of %d RC " + "QPs. That limit has now been reached\n", + NOSRQ_INDEX_TABLE_SIZE); + + /* We send a REJ to the remote side indicating that we + * have no more free RC QPs and leave it to the remote side + * to take appropriate action. This should leave the + * current set of QPs unaffected and any subsequent REQs + * will be able to use RC QPs if they are available. + */ + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); + ret = -EINVAL; + goto err_send_rej; + } + + priv->cm.rx_index_table[index] = p; + spin_unlock_irq(&priv->lock); + + /* We will subsequently use this stored pointer while freeing + * resources in stale task */ + p->index = index; + + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) { + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); + ipoib_cm_dev_cleanup(dev); + goto err_modify_nosrq; + } + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive " + "buffer %ld\n", i); + ipoib_cm_dev_cleanup(dev); + ret = -ENOMEM; + goto err_alloc_and_post; + } + + if (ipoib_cm_post_receive(dev, i << 32 | index)) { + ipoib_warn(priv, "ipoib_ib_post_receive " + "failed for buf %ld\n", i); + ipoib_cm_dev_cleanup(dev); + ret = -EIO; + goto err_alloc_and_post; + } + } + + return 0; + +err_send_rej: +err_modify_nosrq: +err_alloc_and_post: + kfree(p->rx_ring); + return ret; +} + static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) { struct net_device *dev = cm_id->context; @@ -233,8 +382,11 @@ static int ipoib_cm_req_handler(struct i ipoib_dbg(priv, "REQ arrived\n"); p = kzalloc(sizeof *p, GFP_KERNEL); - if (!p) + if (!p) { + printk(KERN_WARNING "Failed to allocate RX control block when " + "REQ arrived\n"); return -ENOMEM; + } p->dev = dev; p->id = cm_id; p->qp = ipoib_cm_create_rx_qp(dev, p); @@ -244,9 +396,15 @@ static int ipoib_cm_req_handler(struct i } psn = random32() & 0xffffff; - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); - if (ret) - goto err_modify; + if (!priv->cm.srq) { + if (ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn)) + goto err_post_nosrq; + } else { + p->rx_ring = NULL; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) + goto err_modify; + } ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); if (ret) { @@ -254,16 +412,19 @@ static int ipoib_cm_req_handler(struct i goto err_rep; } - cm_id->context = p; - p->jiffies = jiffies; - spin_lock_irq(&priv->lock); - list_add(&p->list, &priv->cm.passive_ids); - spin_unlock_irq(&priv->lock); + if (priv->cm.srq) { + cm_id->context = p; + p->jiffies = jiffies; + spin_lock_irq(&priv->lock); + list_add(&p->list, &priv->cm.passive_ids); + spin_unlock_irq(&priv->lock); + } queue_delayed_work(ipoib_workqueue, &priv->cm.stale_task, IPOIB_CM_RX_DELAY); return 0; err_rep: +err_post_nosrq: err_modify: ib_destroy_qp(p->qp); err_qp: @@ -339,48 +500,53 @@ static void skb_put_frags(struct sk_buff } } -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +static void timer_check(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. */ + if (!list_empty(&p->list)) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + queue_delayed_work(ipoib_workqueue, + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); + } +} + +static void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; struct sk_buff *skb, *newskb; + u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV; struct ipoib_cm_rx *p; - unsigned long flags; - u64 mapping[IPOIB_CM_RX_SG]; - int frags; + int frags, ret; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); - return; + ipoib_warn(priv, "cm recv completion event with wrid %ld " + "(> %d)\n", wr_id, ipoib_recvq_size); + return; } skb = priv->cm.srq_ring[wr_id].skb; if (unlikely(wc->status != IB_WC_SUCCESS)) { ipoib_dbg(priv, "cm recv error " - "(status=%d, wrid=%d vend_err %x)\n", + "(status=%d, wrid=%ld vend_err %x)\n", wc->status, wr_id, wc->vendor_err); ++priv->stats.rx_dropped; - goto repost; + goto repost_srq; } if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { p = wc->qp->qp_context; - if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { - spin_lock_irqsave(&priv->lock, flags); - p->jiffies = jiffies; - /* Move this entry to list head, but do - * not re-add it if it has been removed. */ - if (!list_empty(&p->list)) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irqrestore(&priv->lock, flags); - queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); - } + timer_check(priv, p); } frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, @@ -392,13 +558,113 @@ void ipoib_cm_handle_rx_wc(struct net_de * If we can't allocate a new RX buffer, dump * this packet and reuse the old buffer. */ - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id); + ++priv->stats.rx_dropped; + goto repost_srq; + } + + ipoib_cm_dma_unmap_rx(priv, frags, + priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); + + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb->mac.raw = skb->data; + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + +repost_srq: + ret = ipoib_cm_post_receive(dev, wr_id); + + if (unlikely(ret)) + ipoib_warn(priv, "ipoib_cm_post_receive failed for buf %ld\n", + wr_id); + +} + +static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb, *newskb; + u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32; + u32 index; + struct ipoib_cm_rx *p, *rx_ptr; + int frags, ret; + + + ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", + wr_id, wc->status); + + if (unlikely(wr_id >= ipoib_recvq_size)) { + ipoib_warn(priv, "cm recv completion event with wrid %ld " + "(> %d)\n", wr_id, ipoib_recvq_size); + return; + } + + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK ; + + /* This is the only place where rx_ptr could be a NULL - could + * have just received a packet from a connection that has become + * stale and so is going away. We will simply drop the packet and + * let the hardware (it s IB_QPT_RC) handle the dropped packet. + * In the timer_check() function below, p->jiffies is updated and + * hence the connection will not be stale after that. + */ + rx_ptr = priv->cm.rx_index_table[index]; + if (unlikely(!rx_ptr)) { + ipoib_warn(priv, "Received packet from a connection " + "that is going away. Hardware will handle it.\n"); + return; + } + + skb = rx_ptr->rx_ring[wr_id].skb; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + ipoib_dbg(priv, "cm recv error " + "(status=%d, wrid=%ld vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + ++priv->stats.rx_dropped; + goto repost_nosrq; + } + + if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { + /* There are no guarantees that wc->qp is not NULL for HCAs + * that do not support SRQ. */ + p = rx_ptr; + timer_check(priv, p); + } + + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, + (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, + mapping); + if (unlikely(!newskb)) { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id); ++priv->stats.rx_dropped; - goto repost; + goto repost_nosrq; } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + ipoib_cm_dma_unmap_rx(priv, frags, + rx_ptr->rx_ring[wr_id].mapping); + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); @@ -418,10 +684,22 @@ void ipoib_cm_handle_rx_wc(struct net_de skb->pkt_type = PACKET_HOST; netif_receive_skb(skb); -repost: - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_cm_post_receive failed " - "for buf %d\n", wr_id); +repost_nosrq: + ret = ipoib_cm_post_receive(dev, wr_id << 32 | index); + + if (unlikely(ret)) + ipoib_warn(priv, "ipoib_cm_post_receive failed for buf %ld\n", + wr_id); +} + +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->cm.srq) + handle_rx_wc_srq(dev, wc); + else + handle_rx_wc_nosrq(dev, wc); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -609,6 +887,22 @@ int ipoib_cm_dev_open(struct net_device return 0; } +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + int i; + + for(i = 0; i < ipoib_recvq_size; ++i) + if(p->rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping); + dev_kfree_skb_any(p->rx_ring[i].skb); + p->rx_ring[i].skb = NULL; + } + kfree(p->rx_ring); +} + + void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -622,6 +916,8 @@ void ipoib_cm_dev_stop(struct net_device spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + if (!priv->cm.srq) + free_resources_nosrq(priv, p); list_del_init(&p->list); spin_unlock_irq(&priv->lock); ib_destroy_cm_id(p->id); @@ -709,7 +1005,9 @@ static struct ib_qp *ipoib_cm_create_tx_ attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_recv_wr = 1; attr.cap.max_send_sge = 1; + attr.cap.max_recv_sge = 1; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -749,7 +1047,7 @@ static int ipoib_cm_send_req(struct net_ req.retry_count = 0; /* RFC draft warns against retries */ req.rnr_retry_count = 0; /* RFC draft warns against retries */ req.max_cm_retries = 15; - req.srq = 1; + req.srq = !!priv->cm.srq; return ib_send_cm_req(id, &req); } @@ -1085,6 +1383,7 @@ static void ipoib_cm_stale_task(struct w struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.stale_task.work); struct ipoib_cm_rx *p; + struct ib_qp_attr qp_attr; spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { @@ -1093,6 +1392,12 @@ static void ipoib_cm_stale_task(struct w p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; + if (!priv->cm.srq) { + free_resources_nosrq(priv, p); + priv->cm.rx_index_table[p->index] = NULL; + qp_attr.qp_state = IB_QPS_ERR; + ib_modify_qp(p->qp, &qp_attr, IB_QP_STATE); + } list_del_init(&p->list); spin_unlock_irq(&priv->lock); ib_destroy_cm_id(p->id); @@ -1147,16 +1452,40 @@ int ipoib_cm_add_mode_attr(struct net_de return device_create_file(&dev->dev, &dev_attr_mode); } +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv) +{ + struct ib_srq_init_attr srq_init_attr; + int ret; + + srq_init_attr.attr.max_wr = ipoib_recvq_size; + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * + sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring " + "(%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + return 0; +} + int ipoib_cm_dev_init(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_srq_init_attr srq_init_attr = { - .attr = { - .max_wr = ipoib_recvq_size, - .max_sge = IPOIB_CM_RX_SG - } - }; int ret, i; + struct ib_device_attr attr; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1168,20 +1497,30 @@ int ipoib_cm_dev_init(struct net_device skb_queue_head_init(&priv->cm.skb_queue); - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); - if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); - priv->cm.srq = NULL; + if (ret = ib_query_device(priv->ca, &attr)) return ret; - } - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, - GFP_KERNEL); - if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", - priv->ca->name, ipoib_recvq_size); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; + if (attr.max_srq) { + /* This device supports SRQ */ + if (ret = create_srq(dev, priv)) + return ret; + priv->cm.rx_index_table = NULL; + } else { + priv->cm.srq = NULL; + priv->cm.srq_ring = NULL; + + /* Every new REQ that arrives creates a struct ipoib_cm_rx. + * These structures form a link list starting with the + * passive_ids. For quick and easy access we maintain a table + * of pointers to struct ipoib_cm_rx called the rx_index_table + */ + priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE * + sizeof *priv->cm.rx_index_table, + GFP_KERNEL); + if (!priv->cm.rx_index_table) { + printk(KERN_WARNING "Failed to allocate NOSRQ_INDEX_TABLE\n"); + return -ENOMEM; + } } for (i = 0; i < IPOIB_CM_RX_SG; ++i) @@ -1194,17 +1533,23 @@ int ipoib_cm_dev_init(struct net_device priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; - for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, + /* One can post receive buffers even before the RX QP is created + * only in the SRQ case. Therefore for NOSRQ we skip the rest of init + * and do that in ipoib_cm_req_handler() */ + + if (priv->cm.srq) { + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping)) { - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } - if (ipoib_cm_post_receive(dev, i)) { - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -EIO; + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (ipoib_cm_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } } } --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-07 22:31:33.000000000 -0700 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-07 17:29:52.000000000 -0700 @@ -299,7 +299,7 @@ int ipoib_poll(struct net_device *dev, i for (i = 0; i < n; ++i) { struct ib_wc *wc = priv->ibwc + i; - if (wc->wr_id & IPOIB_CM_OP_SRQ) { + if (wc->wr_id & IPOIB_CM_OP_RECV) { ++done; --max; ipoib_cm_handle_rx_wc(dev, wc); @@ -607,7 +607,7 @@ int ipoib_ib_dev_stop(struct net_device do { n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); for (i = 0; i < n; ++i) { - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV) ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-07 16:05:32.000000000 -0700 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-07 17:13:28.000000000 -0700 @@ -187,6 +187,15 @@ int ipoib_transport_dev_init(struct net_ if (!ret) size += ipoib_recvq_size; + /* We increase the size of the CQ in the NOSRQ case to prevent CQ + * overflow. Every new REQ creates a new RX QP and each QP has an + * RX ring associated with it. Therefore we could have + * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs + * in a CQ. + */ + if(!priv->cm.srq) + size += (NOSRQ_INDEX_TABLE_SIZE -1)* ipoib_recvq_size; + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); From Don.Kerr at Sun.COM Wed May 9 08:42:08 2007 From: Don.Kerr at Sun.COM (Donald Kerr) Date: Wed, 09 May 2007 11:42:08 -0400 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> Message-ID: <4641EBD0.3000600@Sun.COM> I agree OMPI trac ticket #890 should cover this. I will test the suggested fix, just removing that one line from btl_udapl.c, on Solaris. I am still not set up on Linux so hopefully Steve can confirm there. -DON Jeff Squyres wrote: >FWIW, I would marginally prefer if this bug is tracked in the Open >MPI trac ticket system, not the OFA bugzilla (Steve W. will have >write access there as soon as Chelsio submits their OMPI 3rd party >contribution agreement). We've traditionally [mostly] tracked OMPI >bugs in the OMPI bug system and OFED-specific OMPI packaging problems >in the OFA bugzilla. It's a gray area, I admit. > >But since I'm not the uDAPL maintainer in Open MPI, moving the bug >over there will allow the Right people to see it (some OMPI >developers are cross subscribed to the OFA general list, but not >all). For example, this udapl problem is likely related to the >existing OMPI trac ticket 890 (https://svn.open-mpi.org/trac/ompi/ >ticket/890). > > >On May 9, 2007, at 10:37 AM, Steve Wise wrote: > > > >>Although as Boris pointed out, perhaps the hack in OMPI is no longer >>needed at all... >> >> >>On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote: >> >> >>>606 opened to track the udapl change. >>> >>>607 opened to track the ompi change to remove the port number >>>stashing >>>hack. >>> >>>Status: I have a patch from Arlin to test today. I will test with >>>that >>>patch and with the OMPI port hack removed. Stay tuned... >>> >>> >>> >>>Steve. >>> >>>On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote: >>> >>> >>>>Steve Wise wrote: >>>> >>>> >>>> >>>>>I would like the group to consider including changes needed to OMPI >>>>>and/or ofa udapl to get OMPI working again on udapl for ofed-1.2. >>>>> >>>>>This will provide OMPI support over iwarp devices via udapl >>>>>until we can >>>>>get rdma-cm support added to OMPI. >>>>> >>>>> >>>>>Steve. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>Steve,cCan you open a bug to track this? >>>> >>>> >>>_______________________________________________ >>>devel mailing list >>>devel at open-mpi.org >>>http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> > > > > From yosefe at voltaire.com Wed May 9 08:47:48 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 09 May 2007 18:47:48 +0300 Subject: [ofa-general] Re: [PATCHv4 for-2.6.22 0 of 2] pkey change handling - fix bug #420 In-Reply-To: <4641BBC5.7040106@voltaire.com> References: <4641BBC5.7040106@voltaire.com> Message-ID: <4641ED24.6030303@voltaire.com> Yosef Etigin wrote: > These two patches fix bug #577: PKey table reordering caused by SM failover stops ipoib traffic This should have been bug #420 > patch 1: add uncached device queries to core > patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init > > -- > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jimmy at hillraiser.com Wed May 9 08:59:39 2007 From: jimmy at hillraiser.com (=?iso-8859-1?Q?Jimmy=20Hill?=) Date: Wed, 09 May 2007 15:59:39 +0000 Subject: [ofa-general] verbs abi_compat Message-ID: <20070509155939.17788.qmail@station183.com> Under what conditions is the field abi_compat of struct ibv_context set to non-zero? I'm encountering a situation where it is set whencoding to verbs on a clean OFED 1.2 install. Seems odd that it would be set since I suspected that it would only occur for verbs 1.0/1.1 compatibility. thanks! From ossrosch at linux.vnet.ibm.com Wed May 9 09:24:53 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Wed, 9 May 2007 18:24:53 +0200 Subject: [ofa-general] Build problem with RHEL-4.5 and OFED-1.2 Message-ID: <200705091824.54394.ossrosch@linux.vnet.ibm.com> Hi Doug, I installed RHEL-4.5 on one of our ppc64 systems and recognized that asm-ppc directory is missing in /usr/src/kernels/2.6.9-55.EL/include. Normally I don't need this directory, but ibmebus.h includes asm-ppc64/of_device.h. And there asm-ppc64/of_device.h includes asm-ppc/of_device.h. Because this file is missing I can not build ehca and ofed stack with ofed-1.2 daily build from today. Did I make something wrong during installation? Regards Stefan Roscher From cap at nsc.liu.se Wed May 9 09:28:54 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Wed, 9 May 2007 18:28:54 +0200 Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status In-Reply-To: <20070509124521.GI10068@mellanox.co.il> References: <20070508093812.9A193E603C1@openfabrics.org> <200705091440.01872.cap@nsc.liu.se> <20070509124521.GI10068@mellanox.co.il> Message-ID: <200705091828.54260.cap@nsc.liu.se> On Wednesday 09 May 2007, Michael S. Tsirkin wrote: > > Quoting Peter Kjellstrom : > > Subject: Re: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build > > status > > > > Not related to the failed 2.6.21.1 below, but, are there any plans to add > > the EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and > > 2.6.9-55.{EL,ELsmp}). > > We do test on them locally, haven't the time to prepare these for > cross-build yet. Can you do this? Unfortunately I lack both time and resources to maintain an automated build verification system ;-( /Peter > > Also, out of curiosity, what is the 2.6.16.43-0.3 (below)? > > SLES10 I think. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From cap at nsc.liu.se Wed May 9 09:28:54 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Wed, 9 May 2007 18:28:54 +0200 Subject: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build status In-Reply-To: <20070509124521.GI10068@mellanox.co.il> References: <20070508093812.9A193E603C1@openfabrics.org> <200705091440.01872.cap@nsc.liu.se> <20070509124521.GI10068@mellanox.co.il> Message-ID: <200705091828.54260.cap@nsc.liu.se> On Wednesday 09 May 2007, Michael S. Tsirkin wrote: > > Quoting Peter Kjellstrom : > > Subject: Re: [ofa-general] ofa_1_2_kernel 20070508-0200 daily build > > status > > > > Not related to the failed 2.6.21.1 below, but, are there any plans to add > > the EL5 and EL4u5 kernels to the list? (2.6.18-8.el5 and > > 2.6.9-55.{EL,ELsmp}). > > We do test on them locally, haven't the time to prepare these for > cross-build yet. Can you do this? Unfortunately I lack both time and resources to maintain an automated build verification system ;-( /Peter > > Also, out of curiosity, what is the 2.6.16.43-0.3 (below)? > > SLES10 I think. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From sean.hefty at intel.com Wed May 9 10:16:00 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 9 May 2007 10:16:00 -0700 Subject: [ofa-general] [GIT PULL] OFED 1.2 librdmacm Message-ID: <000001c7925d$b752cfe0$e598070a@amr.corp.intel.com> Please pull in the latest librdmacm ofed_1_2 tree. This will add a fix for rping and man pages. Signed-off-by: Sean Hefty From mst at dev.mellanox.co.il Wed May 9 10:41:38 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 20:41:38 +0300 Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <4641B63D.4010602@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> Message-ID: <20070509174138.GB17734@mellanox.co.il> > @@ -642,6 +651,11 @@ void ipoib_ib_dev_flush(struct work_stru > > ipoib_ib_dev_down(dev, 0); > > + if (restart_qp) { > + ipoib_ib_dev_stop(dev, 0); > + ipoib_ib_dev_open(dev); > + } > + > /* > * The device could have been brought down between the start and when > * we get here, don't bring it back up if it's not configured up By the way, I think I see a small issue now - if there's a pkey change event, this will flush all interfaces, even if the pkey changed is not used by ipoib at all. How about: - rename restart_qp flag to pkey_change_event - do something like this at the beginning of the flush routine if (pkey_change_event && query_pkey(current index) == current_pkey)) return; Need to think what to do if index is not valid, but you get the idea. This will remove all the extra flushes in the common case where pkeys are not moved around too much. -- MST From mst at dev.mellanox.co.il Wed May 9 10:46:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 20:46:25 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <4641E99B.10706@linux.vnet.ibm.com> References: <4641E99B.10706@linux.vnet.ibm.com> Message-ID: <20070509174625.GC17734@mellanox.co.il> > +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + > + if (priv->cm.srq) > + handle_rx_wc_srq(dev, wc); > + else > + handle_rx_wc_nosrq(dev, wc); > } I still think this conditional branch on datapath should be avoided by using separate RX handlers for SRQ/non SRQ cases. And same for the one on alloc_rx_skb. -- MST From mst at dev.mellanox.co.il Wed May 9 10:47:51 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 20:47:51 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <4641E99B.10706@linux.vnet.ibm.com> References: <4641E99B.10706@linux.vnet.ibm.com> Message-ID: <20070509174751.GD17734@mellanox.co.il> > @@ -159,11 +214,14 @@ static struct ib_qp *ipoib_cm_create_rx_ > .recv_cq = priv->cq, > .srq = priv->cm.srq, > .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ > + .cap.max_recv_wr = ipoib_recvq_size + 1, > .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ > .sq_sig_type = IB_SIGNAL_ALL_WR, > .qp_type = IB_QPT_RC, > .qp_context = p, > }; Why aren't you using UC QPs here? With retry count 0, what is the benefit of RC? -- MST From pradeeps at linux.vnet.ibm.com Wed May 9 10:50:51 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 09 May 2007 10:50:51 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <20070509174625.GC17734@mellanox.co.il> References: <4641E99B.10706@linux.vnet.ibm.com> <20070509174625.GC17734@mellanox.co.il> Message-ID: <464209FB.8060006@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >> +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) >> +{ >> + struct ipoib_dev_priv *priv = netdev_priv(dev); >> + >> + if (priv->cm.srq) >> + handle_rx_wc_srq(dev, wc); >> + else >> + handle_rx_wc_nosrq(dev, wc); >> } > > I still think this conditional branch on datapath should be avoided > by using separate RX handlers for SRQ/non SRQ cases. > And same for the one on alloc_rx_skb. > I attempted implementing this. With NAPI now included, the code looked real ugly and so decided not do so. Pradeep From mst at dev.mellanox.co.il Wed May 9 10:55:16 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 May 2007 20:55:16 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <464209FB.8060006@linux.vnet.ibm.com> References: <4641E99B.10706@linux.vnet.ibm.com> <20070509174625.GC17734@mellanox.co.il> <464209FB.8060006@linux.vnet.ibm.com> Message-ID: <20070509175516.GE17734@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review > > Michael S. Tsirkin wrote: > >>+void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > >>+{ > >>+ struct ipoib_dev_priv *priv = netdev_priv(dev); > >>+ > >>+ if (priv->cm.srq) > >>+ handle_rx_wc_srq(dev, wc); > >>+ else > >>+ handle_rx_wc_nosrq(dev, wc); > >> } > > > >I still think this conditional branch on datapath should be avoided > >by using separate RX handlers for SRQ/non SRQ cases. > >And same for the one on alloc_rx_skb. > > > > I attempted implementing this. With NAPI now included, > the code looked real ugly and so decided not do so. Why? The only difference with NAPI is that instead of a separate completion handler, you should have a separate poll routine. -- MST From pradeeps at linux.vnet.ibm.com Wed May 9 10:56:51 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 09 May 2007 10:56:51 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <20070509174751.GD17734@mellanox.co.il> References: <4641E99B.10706@linux.vnet.ibm.com> <20070509174751.GD17734@mellanox.co.il> Message-ID: <46420B63.8080508@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >> @@ -159,11 +214,14 @@ static struct ib_qp *ipoib_cm_create_rx_ >> .recv_cq = priv->cq, >> .srq = priv->cm.srq, >> .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ >> + .cap.max_recv_wr = ipoib_recvq_size + 1, >> .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ >> .sq_sig_type = IB_SIGNAL_ALL_WR, >> .qp_type = IB_QPT_RC, >> .qp_context = p, >> }; > > Why aren't you using UC QPs here? With retry count 0, what is > the benefit of RC? > The issue with switching only NOSRQ to UC is interoperability between HCAs. Switching IPOIB CM to UC mode would be good, but let us do all of it at one go. Pradeep From kschoche at scl.ameslab.gov Wed May 9 11:08:17 2007 From: kschoche at scl.ameslab.gov (Kyle Schochenmaier) Date: Wed, 09 May 2007 13:08:17 -0500 Subject: [ofa-general] ehca_mrmw patch In-Reply-To: <200705091446.23783.fenkes@de.ibm.com> References: <200705091446.23783.fenkes@de.ibm.com> Message-ID: <46420E11.7080902@scl.ameslab.gov> With the memory registration restrictions of the eHCA coupled with our applications which require large memory registrations, we've found that we can quickly trigger a case where ibv_reg_mr() will return -EINVAL, when it should be returning -ENOMEM. If we were able to differentiate this type of error from the default -EINVAL, we would be able to handle this in userspace by flushing cached entries and retrying the memory registration. I've attached a patch to start the process, if there are other paths back to userspace that can return ENOMEM on a resource limitation we should also add that. thanks, Kyle -- Kyle Schochenmaier kschoche at scl.ameslab.gov Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: ehca_mrmw.patch Type: text/x-patch Size: 920 bytes Desc: not available URL: From kschoche at scl.ameslab.gov Wed May 9 11:11:31 2007 From: kschoche at scl.ameslab.gov (Kyle Schochenmaier) Date: Wed, 09 May 2007 13:11:31 -0500 Subject: [ofa-general] ehca_mrmw patch In-Reply-To: <200705091446.23783.fenkes@de.ibm.com> References: <200705091446.23783.fenkes@de.ibm.com> Message-ID: <46420ED3.8000008@scl.ameslab.gov> With the memory registration restrictions of the eHCA coupled with our applications which require large memory registrations, we've found that we can quickly trigger a case where ibv_reg_mr() will return -EINVAL, when it should be returning -ENOMEM. If we were able to differentiate this type of error from the default -EINVAL, we would be able to handle this in userspace by flushing cached entries and retrying the memory registration. I've attached a patch to start the process, if there are other paths back to userspace that can return ENOMEM on a resource limitation we should also add that. thanks, Kyle -- Kyle Schochenmaier kschoche at scl.ameslab.gov Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: ehca_mrmw.patch Type: text/x-patch Size: 920 bytes Desc: not available URL: From kschoche at scl.ameslab.gov Wed May 9 11:13:15 2007 From: kschoche at scl.ameslab.gov (Kyle Schochenmaier) Date: Wed, 09 May 2007 13:13:15 -0500 Subject: [ofa-general] ehca_mrmw patch In-Reply-To: <200705091446.23783.fenkes@de.ibm.com> References: <200705091446.23783.fenkes@de.ibm.com> Message-ID: <46420F3B.5000409@scl.ameslab.gov> With the memory registration restrictions of the eHCA coupled with our applications which require large memory registrations, we've found that we can quickly trigger a case where ibv_reg_mr() will return -EINVAL, when it should be returning -ENOMEM. If we were able to differentiate this type of error from the default -EINVAL, we would be able to handle this in userspace by flushing cached entries and retrying the memory registration. I've attached a patch to start the process, if there are other paths back to userspace that can return ENOMEM on a resource limitation we should also add that. thanks, Kyle -- Kyle Schochenmaier kschoche at scl.ameslab.gov Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: ehca_mrmw.patch Type: text/x-patch Size: 920 bytes Desc: not available URL: From sean.hefty at intel.com Wed May 9 11:39:59 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 9 May 2007 11:39:59 -0700 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull rdma-dev.git Message-ID: <000101c79269$72c67aa0$e598070a@amr.corp.intel.com> Roland, please pull from: git://git.openfabrics.org/~shefty/rdma-dev.git for-roland This will cleanup device removal synchronization in the rdma_cm. The changes are based on 2.6.21. Sean Hefty (3): rdma/cm: simplify device removal handling code rdma/cm: Fix synchronization with device removal in cma_iw_handler rdma/cm: Add check to validate that cm_id is bound to a device. drivers/infiniband/core/cma.c | 106 +++++++++++++++++++++++------------------ 1 files changed, 59 insertions(+), 47 deletions(-) From dledford at redhat.com Wed May 9 12:05:35 2007 From: dledford at redhat.com (Doug Ledford) Date: Wed, 09 May 2007 15:05:35 -0400 Subject: [ofa-general] Re: Build problem with RHEL-4.5 and OFED-1.2 In-Reply-To: <200705091824.54394.ossrosch@linux.vnet.ibm.com> References: <200705091824.54394.ossrosch@linux.vnet.ibm.com> Message-ID: <1178737535.2848.152.camel@fc6.xsintricity.com> On Wed, 2007-05-09 at 18:24 +0200, Stefan Roscher wrote: > Hi Doug, > > I installed RHEL-4.5 on one of our ppc64 systems and recognized that asm-ppc > directory is missing in /usr/src/kernels/2.6.9-55.EL/include. > Normally I don't need this directory, but ibmebus.h includes > asm-ppc64/of_device.h. And there asm-ppc64/of_device.h includes > asm-ppc/of_device.h. Because this file is missing I can not build > ehca and ofed stack with ofed-1.2 daily build from today. > > Did I make something wrong during installation? > > Regards Stefan Roscher I'll look into it, but in the meantime, install the kernel src.rpm, go into /usr/src/redhat/SPEC and run rpmbuild --bp kernel-2.6.spec and it should create a complete source tree in /usr/src/redhat/BUILD/kernel-2.6.18 that you can then get the asm-ppc directory contents out of. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From ardavis at ichips.intel.com Wed May 9 12:45:52 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 09 May 2007 12:45:52 -0700 Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL Message-ID: <464224F0.6020408@ichips.intel.com> Vlad, please pull latest from uDAPL project (ofed_1_2 branch) Signed-off by: Arlin Davis ardavis at ichips.intel.com Bug Fixes: - 606: Return local and remote ports with dat_ep_query - 585: Add bonding example to dat.conf From swise at opengridcomputing.com Wed May 9 12:54:58 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 09 May 2007 14:54:58 -0500 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <4641EBD0.3000600@Sun.COM> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> Message-ID: <1178740498.382.97.camel@stevo-desktop> On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote: > I agree OMPI trac ticket #890 should cover this. I will test the > suggested fix, just removing that one line from btl_udapl.c, on Solaris. > I am still not set up on Linux so hopefully Steve can confirm there. > All, First, I haven't tested Arlins dat_ep_query() fix yet as we have determined its not needed. The OMPI udapl btl never calls dat_ep_query()... So running OMPI with the suggested fix (removing the overwriting of the hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp rnic still doesn't work. There are two new issues so far: 1) this has uncovered a connection migration issue in the Chelsio driver/firmware. We are developing and testing a fix for this now. Should be ready tomorrow hopefully. 2) OMPI is not adhering to the iwarp protocol requirement that the ULP, in this case OMPI, initiating the iwarp connection (the side issuing the dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA message. So if a OMPI process _accepts_ an rdma connection, then it cannot send on that connection until it receives some sort of rdma operation from the client process. It appears the current OMPI connection setup model doesn't enforce this. This combined with the bug above causes an immediate connection failure on chelsio's rnic. After I fix #1 above, things might get slightly better but my guess is we will still have connection setup problems if the server side sends before the client side finishes streaming->rdma mode transition. There have been a series of discussions on the ofa general list about this issue, and the conclusion to date is that it cannot be resolved in the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because sending an RDMA message involves the ULP's work queue and completion queue, so the CM cannot do this under the covers in a mannor that doesn't affect the application. Thus, the applications must deal with this. Here is a possible solution: I assume in OMPI that connections are only initiated when the mpi application does a send operation. Given that, then udapl btl must ensure that if a given rank accepts a connection, it cannot not send anything until the rank at the other end of the connection sends first. Since the other side initiated the connection, it will have pending data to send... I haven't looked into how painful this will be to implement. Thoughts? FYI: IETF Draft requiring this behavior: http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt See section 7 for specifics. Steve. From afriedle at open-mpi.org Wed May 9 16:15:51 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Wed, 09 May 2007 16:15:51 -0700 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <1178740498.382.97.camel@stevo-desktop> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> Message-ID: <46425627.8000903@open-mpi.org> Steve Wise wrote: > There have been a series of discussions on the ofa general list about > this issue, and the conclusion to date is that it cannot be resolved in > the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because > sending an RDMA message involves the ULP's work queue and completion > queue, so the CM cannot do this under the covers in a mannor that > doesn't affect the application. Thus, the applications must deal with > this. Why can't uDAPL deal with this? As a uDAPL user, I really don't care what API uDAPL is using under the hood to move data from one place to another, nor the quirks of that API. The whole point of uDAPL is to form a network-agnostic abstraction layer. AFAIK, the uDAPL spec doesn't enforce any such requirement on RDMA communication either. In my opinion, exposing such behavior above uDAPL is incorrect and is part of why uDAPL has seen limited adoption -- every single uDAPL implementation behaves in different ways, making it extremely difficult to write an application to work on any uDAPL implementation. Sorry if this sounds harsh, but this comes from many hours of banging my head on the wall due to working around these sorts of problems :) > > Here is a possible solution: > > I assume in OMPI that connections are only initiated when the mpi > application does a send operation. Given that, then udapl btl must > ensure that if a given rank accepts a connection, it cannot not send > anything until the rank at the other end of the connection sends first. > Since the other side initiated the connection, it will have pending data > to send... > > I haven't looked into how painful this will be to implement. > > Thoughts? Following on what I wrote above, I think Open MPI is the wrong place to be dealing with this. There's enough of these hacks as it is; I'm not interested in seeing more get added. Andrew From Don.Kerr at Sun.COM Wed May 9 13:20:23 2007 From: Don.Kerr at Sun.COM (Donald Kerr) Date: Wed, 09 May 2007 16:20:23 -0400 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <1178740498.382.97.camel@stevo-desktop> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> Message-ID: <46422D07.3050600@Sun.COM> I missing some context here. Where are you plugging iwarp and OMPI together? Steve Wise wrote: >On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote: > > >>I agree OMPI trac ticket #890 should cover this. I will test the >>suggested fix, just removing that one line from btl_udapl.c, on Solaris. >>I am still not set up on Linux so hopefully Steve can confirm there. >> >> >> > >All, > >First, I haven't tested Arlins dat_ep_query() fix yet as we have >determined its not needed. The OMPI udapl btl never calls >dat_ep_query()... > >So running OMPI with the suggested fix (removing the overwriting of the >hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp >rnic still doesn't work. > >There are two new issues so far: > >1) this has uncovered a connection migration issue in the Chelsio >driver/firmware. We are developing and testing a fix for this now. >Should be ready tomorrow hopefully. > >2) OMPI is not adhering to the iwarp protocol requirement that the ULP, >in this case OMPI, initiating the iwarp connection (the side issuing the >dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA >message. So if a OMPI process _accepts_ an rdma connection, then it >cannot send on that connection until it receives some sort of rdma >operation from the client process. It appears the current OMPI >connection setup model doesn't enforce this. > >This combined with the bug above causes an immediate connection failure >on chelsio's rnic. After I fix #1 above, things might get slightly >better but my guess is we will still have connection setup problems if >the server side sends before the client side finishes streaming->rdma >mode transition. > >There have been a series of discussions on the ofa general list about >this issue, and the conclusion to date is that it cannot be resolved in >the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because >sending an RDMA message involves the ULP's work queue and completion >queue, so the CM cannot do this under the covers in a mannor that >doesn't affect the application. Thus, the applications must deal with >this. > > >Here is a possible solution: > >I assume in OMPI that connections are only initiated when the mpi >application does a send operation. Given that, then udapl btl must >ensure that if a given rank accepts a connection, it cannot not send >anything until the rank at the other end of the connection sends first. >Since the other side initiated the connection, it will have pending data >to send... > >I haven't looked into how painful this will be to implement. > >Thoughts? > > >FYI: > >IETF Draft requiring this behavior: > >http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt > >See section 7 for specifics. > >Steve. > > >_______________________________________________ >devel mailing list >devel at open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/devel > > From sashak at voltaire.com Wed May 9 13:30:22 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 9 May 2007 23:30:22 +0300 Subject: [ofa-general] [PATCH 0/3] opensm: osm_port_t structure simplification. Message-ID: <11787426251341-git-send-email-sashak@voltaire.com> Hi Hal, This simplifies osm_port_t structure and related API functions - the main idea is to not use duplicated (from osm_node_t) physical port pointers table, but only one direct pointer to appropriated physical port (osm_physp_t). Sasha From sashak at voltaire.com Wed May 9 13:30:24 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 9 May 2007 23:30:24 +0300 Subject: [ofa-general] [PATCH 2/3] opensm: eliminate node's physical ports table duplication in osm_port_t In-Reply-To: <11787426251341-git-send-email-sashak@voltaire.com> References: <11787426251341-git-send-email-sashak@voltaire.com> Message-ID: <1178742625769-git-send-email-sashak@voltaire.com> Eliminate duplication of osm_node's physical ports table in osm_port_t object. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_port.h | 37 +++++++---------------- osm/opensm/osm_pkey_rcv.c | 2 +- osm/opensm/osm_port.c | 60 +++++++++----------------------------- osm/opensm/osm_sa_link_record.c | 4 +- osm/opensm/osm_sa_pkey_record.c | 2 +- osm/opensm/osm_sa_slvl_record.c | 4 +- osm/opensm/osm_sa_vlarb_record.c | 2 +- osm/opensm/osm_slvl_map_rcv.c | 2 +- osm/opensm/osm_sm_state_mgr.c | 2 +- osm/opensm/osm_subnet.c | 4 +- osm/opensm/osm_vl_arb_rcv.c | 2 +- 11 files changed, 37 insertions(+), 84 deletions(-) diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h index 134012c..19a8502 100644 --- a/osm/include/opensm/osm_port.h +++ b/osm/include/opensm/osm_port.h @@ -1274,10 +1274,8 @@ typedef struct _osm_port struct _osm_node *p_node; ib_net64_t guid; uint32_t discovery_count; - uint8_t default_port_num; - uint8_t physp_tbl_size; + osm_physp_t *p_physp; cl_qlist_t mcm_list; - osm_physp_t *tbl[1]; } osm_port_t; /* * FIELDS @@ -1295,20 +1293,13 @@ typedef struct _osm_port * during the current fabric sweep. This number is reset * to zero at the start of a sweep. * -* default_port_num -* Index of the physical port used when physical characteristics -* contained in the Physical Port are needed. -* -* physp_tbl_size -* Number of physical ports associated with this logical port. +* p_physp +* The pointer to physical port used when physical +* characteristics contained in the Physical Port are needed. * * mcm_list * Multicast member list * -* tbl -* Array of pointers to Physical Port objects contained by this node. -* MUST BE LAST ELEMENT SINCE IT CAN GROW !!! -* * SEE ALSO * Port, Physical Port, Physical Port Table *********/ @@ -1386,10 +1377,8 @@ static inline ib_net16_t osm_port_get_base_lid( IN const osm_port_t* const p_port ) { - const osm_physp_t* const p_physp = p_port->tbl[p_port->default_port_num]; - CL_ASSERT( p_physp ); - CL_ASSERT( osm_physp_is_valid( p_physp ) ); - return( osm_physp_get_base_lid( p_physp )); + CL_ASSERT( p_port->p_physp && osm_physp_is_valid( p_port->p_physp ) ); + return( osm_physp_get_base_lid( p_port->p_physp )); } /* * PARAMETERS @@ -1419,10 +1408,8 @@ static inline uint8_t osm_port_get_lmc( IN const osm_port_t* const p_port ) { - const osm_physp_t* const p_physp = p_port->tbl[p_port->default_port_num]; - CL_ASSERT( p_physp ); - CL_ASSERT( osm_physp_is_valid( p_physp ) ); - return( osm_physp_get_lmc( p_physp )); + CL_ASSERT( p_port->p_physp && osm_physp_is_valid( p_port->p_physp ) ); + return( osm_physp_get_lmc( p_port->p_physp )); } /* * PARAMETERS @@ -1481,8 +1468,7 @@ osm_port_get_phys_ptr( IN const osm_port_t* const p_port, IN const uint8_t port_num ) { - CL_ASSERT( port_num < p_port->physp_tbl_size ); - return( p_port->tbl[port_num] ); + return p_port->p_physp; } /* * PARAMETERS @@ -1519,9 +1505,8 @@ osm_physp_t* osm_port_get_default_phys_ptr( IN const osm_port_t* const p_port ) { - CL_ASSERT( p_port->tbl[p_port->default_port_num] ); - CL_ASSERT( osm_physp_is_valid( p_port->tbl[p_port->default_port_num] ) ); - return( p_port->tbl[p_port->default_port_num] ); + CL_ASSERT( osm_physp_is_valid( p_port->p_physp ) ); + return p_port->p_physp; } /* * PARAMETERS diff --git a/osm/opensm/osm_pkey_rcv.c b/osm/opensm/osm_pkey_rcv.c index 76af9fc..0e0ec46 100644 --- a/osm/opensm/osm_pkey_rcv.c +++ b/osm/opensm/osm_pkey_rcv.c @@ -172,7 +172,7 @@ osm_pkey_rcv_process( else { p_physp = osm_port_get_default_phys_ptr(p_port); - port_num = p_port->default_port_num; + port_num = p_physp->port_num; } CL_ASSERT( p_physp ); diff --git a/osm/opensm/osm_port.c b/osm/opensm/osm_port.c index 053fc22..b0949a0 100644 --- a/osm/opensm/osm_port.c +++ b/osm/opensm/osm_port.c @@ -174,7 +174,6 @@ osm_port_init( uint32_t port_index; ib_net64_t port_guid; osm_physp_t *p_physp; - uint32_t size; CL_ASSERT( p_port ); CL_ASSERT( p_ni ); @@ -187,36 +186,24 @@ osm_port_init( p_port->guid = port_guid; /* - See comment in port_new for info about this... - */ - size = p_ni->num_ports; - - p_port->physp_tbl_size = (uint8_t)(size + 1); - - /* Get the pointers to the physical node objects "owned" by this logical port GUID. For switches, all the ports are owned; for HCA's and routers, only the singular part that has this GUID is owned. */ - p_port->default_port_num = 0xFF; - for( port_index = 0; port_index < p_port->physp_tbl_size; port_index++ ) + for( port_index = 0; port_index < p_parent_node->physp_tbl_size; port_index++ ) { p_physp = osm_node_get_physp_ptr( p_parent_node, port_index ); + /* + Because much of the PortInfo data is only valid + for port 0 on switches, try to keep the lowest + possible value of default_port_num. + */ if( osm_physp_is_valid( p_physp ) && - port_guid == osm_physp_get_port_guid( p_physp ) ) - { - p_port->tbl[port_index] = p_physp; - /* - Because much of the PortInfo data is only valid - for port 0 on switches, try to keep the lowest - possible value of default_port_num. - */ - if( port_index < p_port->default_port_num ) - p_port->default_port_num = (uint8_t)port_index; + port_guid == osm_physp_get_port_guid( p_physp ) ) { + p_port->p_physp = p_physp; + break; } - else - p_port->tbl[port_index] = NULL; } CL_ASSERT( p_port->default_port_num < 0xFF ); @@ -230,21 +217,11 @@ osm_port_new( IN const osm_node_t* const p_parent_node ) { osm_port_t* p_port; - uint32_t size; - - /* - The port object already contains one physical port object pointer. - Therefore, subtract 1 from the number of physical ports - used by the switch. This is not done for CA's since they - need to occupy 1 more physp pointer than they physically have since - we still reserve room for a "port 0". - */ - size = p_ni->num_ports; - p_port = malloc( sizeof(*p_port) + sizeof(void *) * size ); + p_port = malloc( sizeof(*p_port) ); if( p_port != NULL ) { - memset( p_port, 0, sizeof(*p_port) + sizeof(void *) * size ); + memset( p_port, 0, sizeof(*p_port) ); osm_port_init( p_port, p_ni, p_parent_node ); } @@ -326,7 +303,6 @@ osm_port_add_new_physp( p_physp = osm_node_get_physp_ptr( p_node, port_num ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); CL_ASSERT( osm_physp_get_port_guid( p_physp ) == p_port->guid ); - p_port->tbl[port_num] = p_physp; /* For switches, we generally want to use Port 0, which is @@ -334,17 +310,9 @@ osm_port_add_new_physp( The LID value in the PortInfo for example, is only valid for port 0 on switches. */ - if( !osm_physp_is_valid( p_port->tbl[p_port->default_port_num] ) ) - { - p_port->default_port_num = port_num; - } - else - { - if( port_num < p_port->default_port_num ) - { - p_port->default_port_num = port_num; - } - } + if( !osm_physp_is_valid( osm_port_get_default_phys_ptr( p_port ) ) || + port_num < p_port->p_physp->port_num ) + p_port->p_physp = p_physp; } /********************************************************************** diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c index 18f655c..17df424 100644 --- a/osm/opensm/osm_sa_link_record.c +++ b/osm/opensm/osm_sa_link_record.c @@ -374,7 +374,7 @@ __osm_lr_rcv_get_port_links( port_num = p_lr->from_port_num; /* If the port number is out of the range of the p_src_port, then this couldn't be a relevant record. */ - if (port_num < p_src_port->physp_tbl_size) + if (port_num < p_src_port->p_node->physp_tbl_size) { p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); if (p_src_physp) @@ -409,7 +409,7 @@ __osm_lr_rcv_get_port_links( port_num = p_lr->to_port_num; /* If the port number is out of the range of the p_dest_port, then this couldn't be a relevant record. */ - if (port_num < p_dest_port->physp_tbl_size ) + if (port_num < p_dest_port->p_node->physp_tbl_size ) { p_dest_physp = osm_port_get_phys_ptr( p_dest_port, port_num ); diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c index 0a199f1..8186603 100644 --- a/osm/opensm/osm_sa_pkey_record.c +++ b/osm/opensm/osm_sa_pkey_record.c @@ -239,7 +239,7 @@ __osm_sa_pkey_by_comp_mask( if ( p_port->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH ) { /* we put it in the comp mask and port num */ - port_num = p_port->default_port_num; + port_num = p_port->p_physp->port_num; osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_pkey_by_comp_mask: " "Using Physical Default Port Number: 0x%X (for End Node)\n", diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c index 3c4ff02..9fbb5c7 100644 --- a/osm/opensm/osm_sa_slvl_record.c +++ b/osm/opensm/osm_sa_slvl_record.c @@ -225,8 +225,8 @@ __osm_sa_slvl_by_comp_mask( osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_slvl_by_comp_mask: " "Using Physical Default Port Number: 0x%X (for End Node)\n", - p_port->default_port_num ); - p_out_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num ); + p_port->p_physp->port_num ); + p_out_physp = p_port->p_physp; /* check that the p_out_physp and the p_req_physp share a pkey */ if (osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_out_physp )) __osm_sa_slvl_create( p_rcv, p_out_physp, p_ctxt, 0 ); diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c index 6df5ed9..97fe060 100644 --- a/osm/opensm/osm_sa_vlarb_record.c +++ b/osm/opensm/osm_sa_vlarb_record.c @@ -243,7 +243,7 @@ __osm_sa_vl_arb_by_comp_mask( if ( p_port->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH) { /* we put it in the comp mask and port num */ - port_num = p_port->default_port_num; + port_num = p_port->p_physp->port_num; osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_vl_arb_by_comp_mask: " "Using Physical Default Port Number: 0x%X (for End Node)\n", diff --git a/osm/opensm/osm_slvl_map_rcv.c b/osm/opensm/osm_slvl_map_rcv.c index 3fa3a7e..b109f75 100644 --- a/osm/opensm/osm_slvl_map_rcv.c +++ b/osm/opensm/osm_slvl_map_rcv.c @@ -183,7 +183,7 @@ osm_slvl_rcv_process( else { p_physp = osm_port_get_default_phys_ptr(p_port); - out_port_num = p_port->default_port_num; + out_port_num = p_physp->port_num; in_port_num = 0; } diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c index 3aa92c8..0034320 100644 --- a/osm/opensm/osm_sm_state_mgr.c +++ b/osm/opensm/osm_sm_state_mgr.c @@ -194,7 +194,7 @@ __osm_sm_state_mgr_send_local_port_info_req( osm_physp_get_dr_path_ptr ( osm_port_get_default_phys_ptr( p_port ) ), IB_MAD_ATTR_PORT_INFO, - cl_hton32( p_port->default_port_num ), + cl_hton32( p_port->p_physp->port_num ), CL_DISP_MSGID_NONE, &context ); if( status != IB_SUCCESS ) diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index c8c3ddc..3d9fdca 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -266,7 +266,7 @@ osm_get_gid_by_mad_addr( ); return(IB_INVALID_PARAMETER); } - p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num); + p_physp = p_port->p_physp; p_gid->unicast.interface_id = p_physp->port_guid; p_gid->unicast.prefix = p_subn->opt.subnet_prefix; } @@ -316,7 +316,7 @@ osm_get_physp_by_mad_addr( goto Exit; } - p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num); + p_physp = p_port->p_physp; } else { diff --git a/osm/opensm/osm_vl_arb_rcv.c b/osm/opensm/osm_vl_arb_rcv.c index 930360a..ed8dfc5 100644 --- a/osm/opensm/osm_vl_arb_rcv.c +++ b/osm/opensm/osm_vl_arb_rcv.c @@ -184,7 +184,7 @@ osm_vla_rcv_process( else { p_physp = osm_port_get_default_phys_ptr(p_port); - port_num = p_port->default_port_num; + port_num = p_physp->port_num; } CL_ASSERT( p_physp ); -- 1.5.2.rc2.20.gac2a From sashak at voltaire.com Wed May 9 13:30:23 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 9 May 2007 23:30:23 +0300 Subject: [ofa-general] [PATCH 1/3] opensm: remove osm_port_get_num_physp() function In-Reply-To: <11787426251341-git-send-email-sashak@voltaire.com> References: <11787426251341-git-send-email-sashak@voltaire.com> Message-ID: <11787426251658-git-send-email-sashak@voltaire.com> This removes osm_port_get_num_physp() function and instead uses native node oriented osm_node_get_num_physp(). Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_port.h | 29 ----------------------------- osm/opensm/osm_drop_mgr.c | 2 +- osm/opensm/osm_link_mgr.c | 2 +- osm/opensm/osm_qos.c | 2 +- osm/opensm/osm_sa_link_record.c | 8 ++++---- osm/opensm/osm_sa_pkey_record.c | 6 +++--- osm/opensm/osm_sa_portinfo_record.c | 2 +- osm/opensm/osm_sa_slvl_record.c | 2 +- osm/opensm/osm_sa_vlarb_record.c | 6 +++--- osm/opensm/osm_state_mgr.c | 2 +- osm/opensm/osm_trap_rcv.c | 2 +- 11 files changed, 17 insertions(+), 46 deletions(-) diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h index 6d51d2b..134012c 100644 --- a/osm/include/opensm/osm_port.h +++ b/osm/include/opensm/osm_port.h @@ -1467,35 +1467,6 @@ osm_port_get_guid( * Port *********/ -/****f* OpenSM: Port/osm_port_get_num_physp -* NAME -* osm_port_get_num_physp -* -* DESCRIPTION -* Returns the number of Physical Port objects associated with this port. -* -* SYNOPSIS -*/ -static inline uint8_t -osm_port_get_num_physp( - IN const osm_port_t* const p_port ) -{ - return( p_port->physp_tbl_size ); -} -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object. -* -* RETURN VALUE -* Returns the number of Physical Port objects associated with this port. -* -* NOTES -* -* SEE ALSO -* Port -*********/ - /****f* OpenSM: Port/osm_port_get_phys_ptr * NAME * osm_port_get_phys_ptr diff --git a/osm/opensm/osm_drop_mgr.c b/osm/opensm/osm_drop_mgr.c index 0d08ff6..d091347 100644 --- a/osm/opensm/osm_drop_mgr.c +++ b/osm/opensm/osm_drop_mgr.c @@ -237,7 +237,7 @@ __osm_drop_mgr_remove_port( Re-initialize each Physical Port. */ - num_physp = osm_port_get_num_physp( p_port ); + num_physp = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_physp; port_num++ ) { p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)port_num ); diff --git a/osm/opensm/osm_link_mgr.c b/osm/opensm/osm_link_mgr.c index a1081bd..71c0495 100644 --- a/osm/opensm/osm_link_mgr.c +++ b/osm/opensm/osm_link_mgr.c @@ -426,7 +426,7 @@ __osm_link_mgr_process_port( with this Port. Start iterating with port 1, since the linkstate is not applicable to the management port on switches. */ - num_physp = osm_port_get_num_physp( p_port ); + num_physp = osm_node_get_num_physp( p_port->p_node ); for( i = 0; i < num_physp; i ++ ) { /* diff --git a/osm/opensm/osm_qos.c b/osm/opensm/osm_qos.c index e71c053..11beaae 100644 --- a/osm/opensm/osm_qos.c +++ b/osm/opensm/osm_qos.c @@ -334,7 +334,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) p_node = p_port->p_node; if (p_node->sw) { - num_physp = osm_port_get_num_physp(p_port); + num_physp = osm_node_get_num_physp(p_node); for (i = 1; i < num_physp; i++) { p_physp = osm_port_get_phys_ptr(p_port, i); if (!p_physp || !osm_physp_is_valid(p_physp)) diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c index 169e75e..18f655c 100644 --- a/osm/opensm/osm_sa_link_record.c +++ b/osm/opensm/osm_sa_link_record.c @@ -346,8 +346,8 @@ __osm_lr_rcv_get_port_links( that do not actually connect. Don't bother screening for that here. */ - num_ports = osm_port_get_num_physp( p_src_port ); - dest_num_ports = osm_port_get_num_physp( p_dest_port ); + num_ports = osm_node_get_num_physp( p_src_port->p_node ); + dest_num_ports = osm_node_get_num_physp( p_dest_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); @@ -385,7 +385,7 @@ __osm_lr_rcv_get_port_links( } else { - num_ports = osm_port_get_num_physp( p_src_port ); + num_ports = osm_node_get_num_physp( p_src_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); @@ -421,7 +421,7 @@ __osm_lr_rcv_get_port_links( } else { - num_ports = osm_port_get_num_physp( p_dest_port ); + num_ports = osm_node_get_num_physp( p_dest_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { p_dest_physp = osm_port_get_phys_ptr( diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c index 5eb15df..0a199f1 100644 --- a/osm/opensm/osm_sa_pkey_record.c +++ b/osm/opensm/osm_sa_pkey_record.c @@ -249,7 +249,7 @@ __osm_sa_pkey_by_comp_mask( if( comp_mask & IB_PKEY_COMPMASK_PORT ) { - if (port_num < osm_port_get_num_physp( p_port )) + if (port_num < osm_node_get_num_physp( p_port->p_node )) { p_physp = osm_port_get_phys_ptr( p_port, port_num ); /* Check that the p_physp is valid, and that is shares a pkey @@ -263,13 +263,13 @@ __osm_sa_pkey_by_comp_mask( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_sa_pkey_by_comp_mask: ERR 4603: " "Given Physical Port Number: 0x%X is out of range should be < 0x%X\n", - port_num, osm_port_get_num_physp( p_port )); + port_num, osm_node_get_num_physp( p_port->p_node )); goto Exit; } } else { - num_ports = osm_port_get_num_physp( p_port ); + num_ports = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_ports; port_num++ ) { p_physp = osm_port_get_phys_ptr( p_port, port_num ); diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c index 5d9b1b2..9d4f18e 100644 --- a/osm/opensm/osm_sa_portinfo_record.c +++ b/osm/opensm/osm_sa_portinfo_record.c @@ -538,7 +538,7 @@ __osm_sa_pir_by_comp_mask( comp_mask = p_ctxt->comp_mask; p_req_physp = p_ctxt->p_req_physp; - num_ports = osm_port_get_num_physp( p_port ); + num_ports = osm_node_get_num_physp( p_port->p_node ); if( comp_mask & IB_PIR_COMPMASK_PORTNUM ) { diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c index d831ffd..3c4ff02 100644 --- a/osm/opensm/osm_sa_slvl_record.c +++ b/osm/opensm/osm_sa_slvl_record.c @@ -213,7 +213,7 @@ __osm_sa_slvl_by_comp_mask( p_rcvd_rec = p_ctxt->p_rcvd_rec; comp_mask = p_ctxt->comp_mask; - num_ports = osm_port_get_num_physp( p_port ); + num_ports = osm_node_get_num_physp( p_port->p_node ); in_port_start = 0; in_port_end = num_ports; out_port_start = 0; diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c index f0ff957..6df5ed9 100644 --- a/osm/opensm/osm_sa_vlarb_record.c +++ b/osm/opensm/osm_sa_vlarb_record.c @@ -253,7 +253,7 @@ __osm_sa_vl_arb_by_comp_mask( if( comp_mask & IB_VLA_COMPMASK_OUT_PORT ) { - if (port_num < osm_port_get_num_physp( p_port )) + if (port_num < osm_node_get_num_physp( p_port->p_node )) { p_physp = osm_port_get_phys_ptr( p_port, port_num ); /* check that the p_physp is valid, and that the requester @@ -267,13 +267,13 @@ __osm_sa_vl_arb_by_comp_mask( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_sa_vl_arb_by_comp_mask: ERR 2A03: " "Given Physical Port Number: 0x%X is out of range should be < 0x%X\n", - port_num, osm_port_get_num_physp( p_port ) ); + port_num, osm_node_get_num_physp( p_port->p_node ) ); goto Exit; } } else { - num_ports = osm_port_get_num_physp( p_port ); + num_ports = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_ports; port_num++ ) { p_physp = osm_port_get_phys_ptr( p_port, port_num ); diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index ddec10c..6f53e60 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -1284,7 +1284,7 @@ __osm_state_mgr_report( else start_port = 1; - num_ports = osm_port_get_num_physp( p_port ); + num_ports = osm_node_get_num_physp( p_node ); for( port_num = start_port; port_num < num_ports; port_num++ ) { p_physp = osm_port_get_phys_ptr( p_port, port_num ); diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c index 0858968..ed507b6 100644 --- a/osm/opensm/osm_trap_rcv.c +++ b/osm/opensm/osm_trap_rcv.c @@ -108,7 +108,7 @@ __get_physp_by_lid_and_num( if (! p_port) return NULL; - if (osm_port_get_num_physp(p_port) < num) + if (osm_node_get_num_physp(p_port->p_node) < num) return NULL; return( osm_port_get_phys_ptr(p_port, num) ); -- 1.5.2.rc2.20.gac2a From sashak at voltaire.com Wed May 9 13:30:25 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 9 May 2007 23:30:25 +0300 Subject: [ofa-general] [PATCH 3/3] opensm: remove some unneeded funcs In-Reply-To: <11787426251341-git-send-email-sashak@voltaire.com> References: <11787426251341-git-send-email-sashak@voltaire.com> Message-ID: <11787426253080-git-send-email-sashak@voltaire.com> This removes some not really needed functions: osm_port_get_phys_ptr(), osm_port_get_default_phys_ptr() and osm_port_get_parent_node(). Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_port.h | 101 ---------------------------------- osm/opensm/osm_drop_mgr.c | 2 +- osm/opensm/osm_lid_mgr.c | 14 +---- osm/opensm/osm_link_mgr.c | 2 +- osm/opensm/osm_mcast_mgr.c | 2 +- osm/opensm/osm_node_info_rcv.c | 6 +- osm/opensm/osm_pkey.c | 8 +- osm/opensm/osm_pkey_mgr.c | 8 +-- osm/opensm/osm_pkey_rcv.c | 4 +- osm/opensm/osm_port.c | 6 +- osm/opensm/osm_port_info_rcv.c | 12 ++-- osm/opensm/osm_prtn.c | 2 +- osm/opensm/osm_qos.c | 6 +- osm/opensm/osm_sa_informinfo.c | 6 +- osm/opensm/osm_sa_lft_record.c | 2 +- osm/opensm/osm_sa_link_record.c | 18 +++--- osm/opensm/osm_sa_mcmember_record.c | 2 +- osm/opensm/osm_sa_mft_record.c | 2 +- osm/opensm/osm_sa_multipath_record.c | 10 ++-- osm/opensm/osm_sa_path_record.c | 12 ++-- osm/opensm/osm_sa_pkey_record.c | 4 +- osm/opensm/osm_sa_portinfo_record.c | 4 +- osm/opensm/osm_sa_service_record.c | 2 +- osm/opensm/osm_sa_slvl_record.c | 4 +- osm/opensm/osm_sa_sminfo_record.c | 2 +- osm/opensm/osm_sa_sw_info_record.c | 2 +- osm/opensm/osm_sa_vlarb_record.c | 4 +- osm/opensm/osm_slvl_map_rcv.c | 4 +- osm/opensm/osm_sm_state_mgr.c | 5 +- osm/opensm/osm_state_mgr.c | 13 ++-- osm/opensm/osm_switch.c | 6 +- osm/opensm/osm_trap_rcv.c | 2 +- osm/opensm/osm_ucast_lash.c | 2 +- osm/opensm/osm_ucast_mgr.c | 7 +- osm/opensm/osm_ucast_updn.c | 2 +- osm/opensm/osm_vl_arb_rcv.c | 5 +- 36 files changed, 88 insertions(+), 205 deletions(-) diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h index 19a8502..df9065e 100644 --- a/osm/include/opensm/osm_port.h +++ b/osm/include/opensm/osm_port.h @@ -1454,107 +1454,6 @@ osm_port_get_guid( * Port *********/ -/****f* OpenSM: Port/osm_port_get_phys_ptr -* NAME -* osm_port_get_phys_ptr -* -* DESCRIPTION -* Gets the pointer to the specified Physical Port object. -* -* SYNOPSIS -*/ -static inline osm_physp_t* -osm_port_get_phys_ptr( - IN const osm_port_t* const p_port, - IN const uint8_t port_num ) -{ - return p_port->p_physp; -} -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object. -* -* port_num -* [in] Number of physical port for which to return the -* osm_physp_t object. If this port is on an HCA, then -* this value is ignored. -* -* RETURN VALUE -* Pointer to the Physical Port object. -* -* NOTES -* -* SEE ALSO -* Port -*********/ - -/****f* OpenSM: Port/osm_port_get_default_phys_ptr -* NAME -* osm_port_get_default_phys_ptr -* -* DESCRIPTION -* Gets the pointer to the default Physical Port object. -* This call should only be used for non-switch ports in which there -* is a one-for-one mapping of port to physp. -* -* SYNOPSIS -*/ -static inline -osm_physp_t* -osm_port_get_default_phys_ptr( - IN const osm_port_t* const p_port ) -{ - CL_ASSERT( osm_physp_is_valid( p_port->p_physp ) ); - return p_port->p_physp; -} -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object. -* -* RETURN VALUE -* Pointer to the Physical Port object. -* -* NOTES -* -* SEE ALSO -* Port -*********/ - -/****f* OpenSM: Port/osm_port_get_parent_node -* NAME -* osm_port_get_parent_node -* -* DESCRIPTION -* Gets the pointer to the this port's Node object. -* -* SYNOPSIS -*/ -static inline struct _osm_node* -osm_port_get_parent_node( - IN const osm_port_t* const p_port ) -{ - return( p_port->p_node ); -} -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object. -* -* port_num -* [in] Number of physical port for which to return the -* osm_physp_t object. -* -* RETURN VALUE -* Pointer to the Physical Port object. -* -* NOTES -* -* SEE ALSO -* Port -*********/ - /****f* OpenSM: Port/osm_port_get_lid_range_ho * NAME * osm_port_get_lid_range_ho diff --git a/osm/opensm/osm_drop_mgr.c b/osm/opensm/osm_drop_mgr.c index d091347..97a95c2 100644 --- a/osm/opensm/osm_drop_mgr.c +++ b/osm/opensm/osm_drop_mgr.c @@ -240,7 +240,7 @@ __osm_drop_mgr_remove_port( num_physp = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_physp; port_num++ ) { - p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)port_num ); if( p_physp ) { diff --git a/osm/opensm/osm_lid_mgr.c b/osm/opensm/osm_lid_mgr.c index d856fb0..6712c6c 100644 --- a/osm/opensm/osm_lid_mgr.c +++ b/osm/opensm/osm_lid_mgr.c @@ -975,10 +975,7 @@ __osm_lid_mgr_set_physp_pi( Don't bother doing anything if this Physical Port is not valid. This allows simplified code in the caller. */ - if( p_physp == NULL ) - goto Exit; - - if( !osm_physp_is_valid( p_physp ) ) + if( p_physp == NULL || !osm_physp_is_valid( p_physp ) ) goto Exit; port_num = osm_physp_get_port_num( p_physp ); @@ -1283,7 +1280,6 @@ __osm_lid_mgr_process_our_sm_node( osm_port_t *p_port; uint16_t min_lid_ho; uint16_t max_lid_ho; - osm_physp_t *p_physp; boolean_t res = TRUE; OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_process_our_sm_node ); @@ -1336,9 +1332,7 @@ __osm_lid_mgr_process_our_sm_node( Set the PortInfo the Physical Port associated with this Port. */ - p_physp = osm_port_get_default_phys_ptr( p_port ); - - __osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho ) ); + __osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_port->p_physp, cl_hton16( min_lid_ho ) ); Exit: OSM_LOG_EXIT( p_mgr->p_log ); @@ -1404,7 +1398,6 @@ osm_lid_mgr_process_subnet( osm_port_t *p_port; ib_net64_t port_guid; uint16_t min_lid_ho, max_lid_ho; - osm_physp_t *p_physp; int lid_changed; CL_ASSERT( p_mgr ); @@ -1460,9 +1453,8 @@ osm_lid_mgr_process_subnet( ", LID [0x%X,0x%X]\n", cl_ntoh64( port_guid ), min_lid_ho, max_lid_ho ); - p_physp = osm_port_get_default_phys_ptr( p_port ); /* the proc returns the fact it sent a set port info */ - if (__osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho ))) + if (__osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_port->p_physp, cl_hton16( min_lid_ho ))) p_mgr->send_set_reqs = TRUE; } } /* all ports */ diff --git a/osm/opensm/osm_link_mgr.c b/osm/opensm/osm_link_mgr.c index 71c0495..a38d179 100644 --- a/osm/opensm/osm_link_mgr.c +++ b/osm/opensm/osm_link_mgr.c @@ -434,7 +434,7 @@ __osm_link_mgr_process_port( or if the state of the port is already better then the specified state. */ - p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)i ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)i ); if( p_physp && osm_physp_is_valid( p_physp ) ) { current_state = osm_physp_get_port_state( p_physp ); diff --git a/osm/opensm/osm_mcast_mgr.c b/osm/opensm/osm_mcast_mgr.c index 0cdcc0e..f5059c9 100644 --- a/osm/opensm/osm_mcast_mgr.c +++ b/osm/opensm/osm_mcast_mgr.c @@ -1127,7 +1127,7 @@ osm_mcast_mgr_process_single( goto Exit; } - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if( p_physp == NULL ) { osm_log( p_mgr->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_node_info_rcv.c b/osm/opensm/osm_node_info_rcv.c index 364b07c..2c79056 100644 --- a/osm/opensm/osm_node_info_rcv.c +++ b/osm/opensm/osm_node_info_rcv.c @@ -791,12 +791,10 @@ __osm_ni_rcv_process_new( "Duplicate Port GUID 0x%" PRIx64 "! Found by the two directed routes:\n", cl_ntoh64( p_ni->port_guid ) ); osm_dump_dr_path(p_rcv->p_log, - osm_physp_get_dr_path_ptr( - osm_port_get_default_phys_ptr ( p_port) ), + osm_physp_get_dr_path_ptr(p_port->p_physp), OSM_LOG_ERROR); osm_dump_dr_path(p_rcv->p_log, - osm_physp_get_dr_path_ptr( - osm_port_get_default_phys_ptr ( p_port_check) ), + osm_physp_get_dr_path_ptr(p_port_check->p_physp), OSM_LOG_ERROR); if ( p_rtr ) osm_router_delete( &p_rtr ); diff --git a/osm/opensm/osm_pkey.c b/osm/opensm/osm_pkey.c index be5578a..c0daa38 100644 --- a/osm/opensm/osm_pkey.c +++ b/osm/opensm/osm_pkey.c @@ -432,8 +432,8 @@ osm_port_share_pkey( goto Exit; } - p_physp1 = osm_port_get_default_phys_ptr(p_port_1); - p_physp2 = osm_port_get_default_phys_ptr(p_port_2); + p_physp1 = p_port_1->p_physp; + p_physp2 = p_port_2->p_physp; if (!p_physp1 || !p_physp2) { @@ -478,7 +478,7 @@ osm_lid_share_pkey( } else { - p_physp1 = osm_port_get_default_phys_ptr(p_port1); + p_physp1 = p_port1->p_physp; } if (osm_node_get_type( p_node2 ) == IB_NODE_TYPE_SWITCH) @@ -487,7 +487,7 @@ osm_lid_share_pkey( } else { - p_physp2 = osm_port_get_default_phys_ptr(p_port2); + p_physp2 = p_port2->p_physp; } return(osm_physp_share_pkey(p_log, p_physp1, p_physp2)); diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index bbbe192..33ac8b5 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -310,7 +310,7 @@ static boolean_t pkey_mgr_update_port( memset(&empty_block, 0, sizeof(ib_pkey_table_t)); - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if ( !osm_physp_is_valid( p_physp ) ) return FALSE; @@ -449,7 +449,7 @@ pkey_mgr_update_peer_port( memset(&empty_block, 0, sizeof(ib_pkey_table_t)); - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if (!osm_physp_is_valid( p_physp )) return FALSE; peer = osm_physp_get_remote( p_physp ); @@ -532,7 +532,6 @@ osm_pkey_mgr_process( osm_prtn_t *p_prtn; osm_port_t *p_port; osm_signal_t signal = OSM_SIGNAL_DONE; - osm_node_t *p_node; CL_ASSERT( p_osm ); @@ -570,8 +569,7 @@ osm_pkey_mgr_process( p_next = cl_qmap_next( p_next ); if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) ) signal = OSM_SIGNAL_DONE_PENDING; - p_node = osm_port_get_parent_node( p_port ); - if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && + if ( ( osm_node_get_type( p_port->p_node ) != IB_NODE_TYPE_SWITCH ) && pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) diff --git a/osm/opensm/osm_pkey_rcv.c b/osm/opensm/osm_pkey_rcv.c index 0e0ec46..7c58d98 100644 --- a/osm/opensm/osm_pkey_rcv.c +++ b/osm/opensm/osm_pkey_rcv.c @@ -159,7 +159,7 @@ osm_pkey_rcv_process( goto Exit; } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; CL_ASSERT( p_node ); block_num = (uint16_t)((cl_ntoh32(p_smp->attr_mod)) & 0x0000FFFF); @@ -171,7 +171,7 @@ osm_pkey_rcv_process( } else { - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; port_num = p_physp->port_num; } diff --git a/osm/opensm/osm_port.c b/osm/opensm/osm_port.c index b0949a0..d6ea9a1 100644 --- a/osm/opensm/osm_port.c +++ b/osm/opensm/osm_port.c @@ -310,7 +310,7 @@ osm_port_add_new_physp( The LID value in the PortInfo for example, is only valid for port 0 on switches. */ - if( !osm_physp_is_valid( osm_port_get_default_phys_ptr( p_port ) ) || + if( !osm_physp_is_valid( p_port->p_physp ) || port_num < p_port->p_physp->port_num ) p_port->p_physp = p_physp; } @@ -573,7 +573,7 @@ __osm_physp_get_dr_physp_set( } /* get the node of the SM */ - p_node = osm_port_get_parent_node(p_port); + p_node = p_port->p_node; /* traverse the path adding the nodes to the table @@ -740,7 +740,7 @@ osm_physp_replace_dr_path_with_alternate_dr_path( port we'll get the port connected to the rest of the subnet. If SM is running on SWITCH - we should try to get a dr path from all switch ports. */ - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; CL_ASSERT( p_physp ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c index 9bd75b5..0076b00 100644 --- a/osm/opensm/osm_port_info_rcv.c +++ b/osm/opensm/osm_port_info_rcv.c @@ -555,13 +555,13 @@ osm_pi_rcv_process_set( p_context = osm_madw_get_pi_context_ptr( p_madw ); - p_physp = osm_port_get_phys_ptr( p_port, port_num ); - CL_ASSERT( p_physp ); - CL_ASSERT( osm_physp_is_valid( p_physp ) ); + p_node = p_port->p_node; + CL_ASSERT( p_node ); + + p_physp = osm_node_get_physp_ptr( p_node, port_num ); + CL_ASSERT( p_physp && osm_physp_is_valid( p_physp ) ); port_guid = osm_physp_get_port_guid( p_physp ); - p_node = osm_port_get_parent_node( p_port ); - CL_ASSERT( p_node ); p_smp = osm_madw_get_smp_ptr( p_madw ); p_pi = (ib_port_info_t*)ib_smp_get_payload_ptr( p_smp ); @@ -743,7 +743,7 @@ osm_pi_rcv_process( cl_ntoh64( p_smp->trans_id ) ); } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; p_physp = osm_node_get_physp_ptr( p_node, port_num ); CL_ASSERT( p_node ); diff --git a/osm/opensm/osm_prtn.c b/osm/opensm/osm_prtn.c index 4099cee..027a5a4 100644 --- a/osm/opensm/osm_prtn.c +++ b/osm/opensm/osm_prtn.c @@ -119,7 +119,7 @@ ib_api_status_t osm_prtn_add_port(osm_log_t *p_log, osm_subn_t *p_subn, return status; } - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; if (!p_physp) { osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " "no physical for port 0x%" PRIx64 "\n", diff --git a/osm/opensm/osm_qos.c b/osm/opensm/osm_qos.c index 11beaae..f426241 100644 --- a/osm/opensm/osm_qos.c +++ b/osm/opensm/osm_qos.c @@ -195,7 +195,7 @@ static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port, if (osm_node_get_type(osm_physp_get_node_ptr(p)) == IB_NODE_TYPE_SWITCH) { if (ib_port_info_get_vl_cap(&p->port_info) == 1) { /* Check port 0's capability mask */ - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; if (!(p_physp->port_info.capability_mask & IB_PORT_CAP_HAS_SL_MAP)) return IB_SUCCESS; } @@ -336,7 +336,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) if (p_node->sw) { num_physp = osm_node_get_num_physp(p_node); for (i = 1; i < num_physp; i++) { - p_physp = osm_port_get_phys_ptr(p_port, i); + p_physp = osm_node_get_physp_ptr(p_node, i); if (!p_physp || !osm_physp_is_valid(p_physp)) continue; status = @@ -353,7 +353,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) else cfg = &ca_config; - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; if (!osm_physp_is_valid(p_physp)) continue; diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c index 340a7f1..6109c5d 100644 --- a/osm/opensm/osm_sa_informinfo.c +++ b/osm/opensm/osm_sa_informinfo.c @@ -194,7 +194,7 @@ __validate_ports_access_rights( } /* get the destination InformInfo physical port */ - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; /* make sure that the requester and destination port can access each other according to the current partitioning. */ @@ -244,7 +244,7 @@ __validate_ports_access_rights( if ( p_port == NULL ) continue; - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; /* make sure that the requester and destination port can access each other according to the current partitioning. */ if (! osm_physp_share_pkey( p_rcv->p_log, p_physp, p_requester_physp)) @@ -405,7 +405,7 @@ __osm_sa_inform_info_rec_by_comp_mask( } /* get the subscriber InformInfo physical port */ - p_subscriber_physp = osm_port_get_default_phys_ptr(p_subscriber_port); + p_subscriber_physp = p_subscriber_port->p_physp; /* make sure that the requester and subscriber port can access each other according to the current partitioning. */ if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_subscriber_physp )) diff --git a/osm/opensm/osm_sa_lft_record.c b/osm/opensm/osm_sa_lft_record.c index b6333e7..c5cd9ca 100644 --- a/osm/opensm/osm_sa_lft_record.c +++ b/osm/opensm/osm_sa_lft_record.c @@ -244,7 +244,7 @@ __osm_lftr_rcv_by_comp_mask( /* check that the requester physp and the current physp are under the same partition. */ - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if (! p_physp) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c index 17df424..5e4e35e 100644 --- a/osm/opensm/osm_sa_link_record.c +++ b/osm/opensm/osm_sa_link_record.c @@ -350,12 +350,12 @@ __osm_lr_rcv_get_port_links( dest_num_ports = osm_node_get_num_physp( p_dest_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { - p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); + p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num ); for( dest_port_num = 1; dest_port_num < dest_num_ports; dest_port_num++ ) { - p_dest_physp = osm_port_get_phys_ptr( p_dest_port, - dest_port_num ); + p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node, + dest_port_num ); /* both physical ports should be with data */ if (p_src_physp && p_dest_physp) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, @@ -376,7 +376,7 @@ __osm_lr_rcv_get_port_links( this couldn't be a relevant record. */ if (port_num < p_src_port->p_node->physp_tbl_size) { - p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); + p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num ); if (p_src_physp) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, NULL, comp_mask, p_list, @@ -388,7 +388,7 @@ __osm_lr_rcv_get_port_links( num_ports = osm_node_get_num_physp( p_src_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { - p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); + p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num ); if (p_src_physp) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, NULL, comp_mask, p_list, @@ -411,8 +411,8 @@ __osm_lr_rcv_get_port_links( this couldn't be a relevant record. */ if (port_num < p_dest_port->p_node->physp_tbl_size ) { - p_dest_physp = osm_port_get_phys_ptr( - p_dest_port, port_num ); + p_dest_physp = osm_node_get_physp_ptr( + p_dest_port->p_node, port_num ); if (p_dest_physp) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, p_dest_physp, comp_mask, @@ -424,8 +424,8 @@ __osm_lr_rcv_get_port_links( num_ports = osm_node_get_num_physp( p_dest_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { - p_dest_physp = osm_port_get_phys_ptr( - p_dest_port, port_num ); + p_dest_physp = osm_node_get_physp_ptr( + p_dest_port->p_node, port_num ); if (p_dest_physp) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, p_dest_physp, comp_mask, diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index 50c4f22..8241129 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -1570,7 +1570,7 @@ __osm_mcmr_rcv_join_mgrp( goto Exit; } - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; /* Check that the p_physp and the requester physp are in the same partition. */ p_request_physp = diff --git a/osm/opensm/osm_sa_mft_record.c b/osm/opensm/osm_sa_mft_record.c index 005c9bd..7908583 100644 --- a/osm/opensm/osm_sa_mft_record.c +++ b/osm/opensm/osm_sa_mft_record.c @@ -250,7 +250,7 @@ __osm_mftr_rcv_by_comp_mask( /* check that the requester physp and the current physp are under the same partition. */ - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if (! p_physp) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_multipath_record.c b/osm/opensm/osm_sa_multipath_record.c index 0c5643e..06640d9 100644 --- a/osm/opensm/osm_sa_multipath_record.c +++ b/osm/opensm/osm_sa_multipath_record.c @@ -154,7 +154,7 @@ __osm_sa_multipath_rec_is_tavor_port( osm_node_t const* p_node; ib_net32_t vend_id; - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; vend_id = ib_node_info_get_vendor_id( &p_node->node_info ); return( (p_node->node_info.device_id == CL_HTON16(23108)) && @@ -255,8 +255,8 @@ __osm_mpr_rcv_get_path_parms( dest_lid = cl_hton16( dest_lid_ho ); - p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port ); - p_physp = osm_port_get_default_phys_ptr( p_src_port ); + p_dest_physp = p_dest_port->p_physp; + p_physp = p_src_port->p_physp; p_pi = &p_physp->port_info; mtu = ib_port_info_get_mtu_cap( p_pi ); @@ -744,8 +744,8 @@ __osm_mpr_rcv_build_pr( OSM_LOG_ENTER( p_rcv->p_log, __osm_mpr_rcv_build_pr ); - p_src_physp = osm_port_get_default_phys_ptr( p_src_port ); - p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port ); + p_src_physp = p_src_port->p_physp; + p_dest_physp = p_dest_port->p_physp; p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp ); p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp ); diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c index 1b0f89f..47d9c33 100644 --- a/osm/opensm/osm_sa_path_record.c +++ b/osm/opensm/osm_sa_path_record.c @@ -171,7 +171,7 @@ __osm_sa_path_rec_is_tavor_port( osm_node_t const* p_node; ib_net32_t vend_id; - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; vend_id = ib_node_info_get_vendor_id( &p_node->node_info ); return( (p_node->node_info.device_id == CL_HTON16(23108)) && @@ -268,8 +268,8 @@ __osm_pr_rcv_get_path_parms( dest_lid = cl_hton16( dest_lid_ho ); - p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port ); - p_physp = osm_port_get_default_phys_ptr( p_src_port ); + p_dest_physp = p_dest_port->p_physp; + p_physp = p_src_port->p_physp; p_pi = &p_physp->port_info; mtu = ib_port_info_get_mtu_cap( p_pi ); @@ -753,9 +753,9 @@ __osm_pr_rcv_build_pr( OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_build_pr ); - p_src_physp = osm_port_get_default_phys_ptr( p_src_port ); + p_src_physp = p_src_port->p_physp; #ifndef ROUTER_EXP - p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port ); + p_dest_physp = p_dest_port->p_physp; p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp ); p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp ); @@ -770,7 +770,7 @@ __osm_pr_rcv_build_pr( p_pr->dgid = *p_dgid; else { - p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port ); + p_dest_physp = p_dest_port->p_physp; p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp ); p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp ); diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c index 8186603..8a71314 100644 --- a/osm/opensm/osm_sa_pkey_record.c +++ b/osm/opensm/osm_sa_pkey_record.c @@ -251,7 +251,7 @@ __osm_sa_pkey_by_comp_mask( { if (port_num < osm_node_get_num_physp( p_port->p_node )) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); /* Check that the p_physp is valid, and that is shares a pkey with the p_req_physp. */ if( p_physp && osm_physp_is_valid( p_physp ) && @@ -272,7 +272,7 @@ __osm_sa_pkey_by_comp_mask( num_ports = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_ports; port_num++ ) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); if( p_physp == NULL ) continue; diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c index 9d4f18e..74f53d6 100644 --- a/osm/opensm/osm_sa_portinfo_record.c +++ b/osm/opensm/osm_sa_portinfo_record.c @@ -544,7 +544,7 @@ __osm_sa_pir_by_comp_mask( { if (p_rcvd_rec->port_num < num_ports) { - p_physp = osm_port_get_phys_ptr( p_port, p_rcvd_rec->port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, p_rcvd_rec->port_num ); /* Check that the p_physp is valid, and that the p_physp and the p_req_physp share a pkey. */ if( p_physp && osm_physp_is_valid( p_physp ) && @@ -556,7 +556,7 @@ __osm_sa_pir_by_comp_mask( { for( port_num = 0; port_num < num_ports; port_num++ ) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); if( p_physp == NULL ) continue; diff --git a/osm/opensm/osm_sa_service_record.c b/osm/opensm/osm_sa_service_record.c index b23a12d..eff0b0a 100644 --- a/osm/opensm/osm_sa_service_record.c +++ b/osm/opensm/osm_sa_service_record.c @@ -213,7 +213,7 @@ __match_service_pkey_with_ports_pkey( /* check on the table of the default physical port of the service port */ if ( !osm_physp_has_pkey( p_rcv->p_log, p_service_rec->service_pkey, - osm_port_get_default_phys_ptr(service_port) ) ) + service_port->p_physp ) ) { valid = FALSE; goto Exit; diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c index 9fbb5c7..e40ad61 100644 --- a/osm/opensm/osm_sa_slvl_record.c +++ b/osm/opensm/osm_sa_slvl_record.c @@ -243,7 +243,7 @@ __osm_sa_slvl_by_comp_mask( } for( out_port_num = out_port_start; out_port_num <= out_port_end; out_port_num++ ) { - p_out_physp = osm_port_get_phys_ptr( p_port, out_port_num ); + p_out_physp = osm_node_get_physp_ptr( p_port->p_node, out_port_num ); if( p_out_physp == NULL ) continue; @@ -256,7 +256,7 @@ __osm_sa_slvl_by_comp_mask( continue; #endif - p_in_physp = osm_port_get_phys_ptr( p_port, in_port_num ); + p_in_physp = osm_node_get_physp_ptr( p_port->p_node, in_port_num ); if( p_in_physp == NULL ) continue; diff --git a/osm/opensm/osm_sa_sminfo_record.c b/osm/opensm/osm_sa_sminfo_record.c index 5e15f52..8c343b4 100644 --- a/osm/opensm/osm_sa_sminfo_record.c +++ b/osm/opensm/osm_sa_sminfo_record.c @@ -374,7 +374,7 @@ osm_smir_rcv_process( { if (FALSE == osm_physp_share_pkey( p_rcv->p_log, p_req_physp, - osm_port_get_default_phys_ptr( local_port ) ) ) + local_port->p_physp ) ) { cl_plock_release( p_rcv->p_lock ); osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_sw_info_record.c b/osm/opensm/osm_sa_sw_info_record.c index da65864..94b1ff9 100644 --- a/osm/opensm/osm_sa_sw_info_record.c +++ b/osm/opensm/osm_sa_sw_info_record.c @@ -245,7 +245,7 @@ __osm_sir_rcv_create_sir( /* check that the requester physp and the current physp are under the same partition. */ - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if (! p_physp) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c index 97fe060..a462ee9 100644 --- a/osm/opensm/osm_sa_vlarb_record.c +++ b/osm/opensm/osm_sa_vlarb_record.c @@ -255,7 +255,7 @@ __osm_sa_vl_arb_by_comp_mask( { if (port_num < osm_node_get_num_physp( p_port->p_node )) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); /* check that the p_physp is valid, and that the requester and the p_physp share a pkey. */ if( p_physp && osm_physp_is_valid( p_physp ) && @@ -276,7 +276,7 @@ __osm_sa_vl_arb_by_comp_mask( num_ports = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_ports; port_num++ ) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); if( p_physp == NULL ) continue; diff --git a/osm/opensm/osm_slvl_map_rcv.c b/osm/opensm/osm_slvl_map_rcv.c index b109f75..3352627 100644 --- a/osm/opensm/osm_slvl_map_rcv.c +++ b/osm/opensm/osm_slvl_map_rcv.c @@ -170,7 +170,7 @@ osm_slvl_rcv_process( goto Exit; } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; CL_ASSERT( p_node ); /* in case of a non switch node the attr modifier should be ignored */ @@ -182,7 +182,7 @@ osm_slvl_rcv_process( } else { - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; out_port_num = p_physp->port_num; in_port_num = 0; } diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c index 0034320..51df1df 100644 --- a/osm/opensm/osm_sm_state_mgr.c +++ b/osm/opensm/osm_sm_state_mgr.c @@ -192,7 +192,7 @@ __osm_sm_state_mgr_send_local_port_info_req( status = osm_req_get( p_sm_mgr->p_req, osm_physp_get_dr_path_ptr - ( osm_port_get_default_phys_ptr( p_port ) ), + ( p_port->p_physp ), IB_MAD_ATTR_PORT_INFO, cl_hton32( p_port->p_physp->port_num ), CL_DISP_MSGID_NONE, &context ); @@ -261,8 +261,7 @@ __osm_sm_state_mgr_send_master_sm_info_req( context.smi_context.set_method = FALSE; status = osm_req_get( p_sm_mgr->p_req, - osm_physp_get_dr_path_ptr - ( osm_port_get_default_phys_ptr( p_port ) ), + osm_physp_get_dr_path_ptr(p_port->p_physp), IB_MAD_ATTR_SM_INFO, 0, CL_DISP_MSGID_NONE, &context ); diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 6f53e60..6681cfc 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -849,7 +849,7 @@ __osm_state_mgr_is_sm_port_down( goto Exit; } - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; CL_ASSERT( p_physp ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); @@ -914,7 +914,7 @@ __osm_state_mgr_sweep_hop_1( goto Exit; } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; CL_ASSERT( p_node ); port_num = ib_node_info_get_local_port_num( &p_node->node_info ); @@ -1277,7 +1277,7 @@ __osm_state_mgr_report( cl_ntoh64( osm_port_get_guid( p_port ) ) ); } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; node_type = osm_node_get_type( p_node ); if( node_type == IB_NODE_TYPE_SWITCH ) start_port = 0; @@ -1287,7 +1287,7 @@ __osm_state_mgr_report( num_ports = osm_node_get_num_physp( p_node ); for( port_num = start_port; port_num < num_ports; port_num++ ) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_node, port_num ); if( ( p_physp == NULL ) || ( !osm_physp_is_valid( p_physp ) ) ) continue; @@ -1622,9 +1622,8 @@ __osm_state_mgr_send_handover( } status = osm_req_set( p_mgr->p_req, - osm_physp_get_dr_path_ptr - ( osm_port_get_default_phys_ptr( p_port ) ), payload, - sizeof(payload), + osm_physp_get_dr_path_ptr(p_port->p_physp), + payload, sizeof(payload), IB_MAD_ATTR_SM_INFO, IB_SMINFO_ATTR_MOD_HANDOVER, CL_DISP_MSGID_NONE, &context ); diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c index 9273459..a79f5cd 100644 --- a/osm/opensm/osm_switch.c +++ b/osm/opensm/osm_switch.c @@ -291,7 +291,7 @@ osm_switch_recommend_path( } else { - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; if (!p_physp || !p_physp->p_remote_physp || !p_physp->p_remote_physp->p_node->sw) return OSM_NO_PATH; @@ -566,7 +566,7 @@ osm_switch_get_port_least_hops( } else { - osm_physp_t *p = osm_port_get_default_phys_ptr(p_port); + osm_physp_t *p = p_port->p_physp; uint8_t hops; if (!p || !p->p_remote_physp || !p->p_remote_physp->p_node->sw) @@ -604,7 +604,7 @@ osm_switch_recommend_mcast_path( } else { - osm_physp_t *p_physp = osm_port_get_default_phys_ptr(p_port); + osm_physp_t *p_physp = p_port->p_physp; if (!p_physp || !p_physp->p_remote_physp || !p_physp->p_remote_physp->p_node->sw) return OSM_NO_PATH; diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c index ed507b6..0ec9a1f 100644 --- a/osm/opensm/osm_trap_rcv.c +++ b/osm/opensm/osm_trap_rcv.c @@ -111,7 +111,7 @@ __get_physp_by_lid_and_num( if (osm_node_get_num_physp(p_port->p_node) < num) return NULL; - return( osm_port_get_phys_ptr(p_port, num) ); + return( osm_node_get_physp_ptr(p_port->p_node, num) ); } /********************************************************************** diff --git a/osm/opensm/osm_ucast_lash.c b/osm/opensm/osm_ucast_lash.c index 4459d9f..5d32e89 100644 --- a/osm/opensm/osm_ucast_lash.c +++ b/osm/opensm/osm_ucast_lash.c @@ -170,7 +170,7 @@ static uint64_t osm_lash_get_switch_guid(IN const osm_switch_t *p_sw) static osm_switch_t *get_osm_switch_from_port(osm_port_t *port) { - osm_physp_t *p = osm_port_get_default_phys_ptr(port); + osm_physp_t *p = port->p_physp; if (p->p_node->sw) return p->p_node->sw; else if (p->p_remote_physp->p_node->sw) diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index 7d3916b..2860e66 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -306,7 +306,7 @@ __osm_ucast_mgr_dump_ucast_routes( } else { - osm_physp_t *p_physp = osm_port_get_default_phys_ptr(p_port); + osm_physp_t *p_physp = p_port->p_physp; if( !p_physp || !p_physp->p_remote_physp || !p_physp->p_remote_physp->p_node->sw ) num_hops = OSM_NO_PATH; @@ -413,7 +413,7 @@ ucast_mgr_dump_lfts(cl_map_item_t *p_map_item, void *cxt) p_port = cl_ptr_vector_get(&p_mgr->p_subn->port_lid_tbl, lid); if (p_port) { - p_node = osm_port_get_parent_node(p_port); + p_node = p_port->p_node; fprintf(file, "%s portguid 0x016%" PRIx64 ": \'%s\'", ib_get_node_type_str(osm_node_get_type(p_node)), cl_ntoh64(osm_port_get_guid(p_port)), @@ -671,8 +671,7 @@ __osm_ucast_mgr_process_port( if (!p_mgr->p_subn->opt.port_profile_switch_nodes) { is_ignored_by_port_prof |= - (osm_node_get_type(osm_port_get_parent_node(p_port)) == - IB_NODE_TYPE_SWITCH); + (osm_node_get_type(p_port->p_node) == IB_NODE_TYPE_SWITCH); } } diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c index b15fe5e..d9446e9 100644 --- a/osm/opensm/osm_ucast_updn.c +++ b/osm/opensm/osm_ucast_updn.c @@ -792,7 +792,7 @@ __osm_updn_find_root_nodes_by_min_hop( p_next_port = (osm_port_t*)cl_qmap_next( &p_next_port->map_item ); if ( osm_node_get_type(p_port->p_node) != IB_NODE_TYPE_SWITCH ) { - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; self_lid_ho = cl_ntoh16( osm_physp_get_base_lid(p_physp) ); numCas++; /* EZ: diff --git a/osm/opensm/osm_vl_arb_rcv.c b/osm/opensm/osm_vl_arb_rcv.c index ed8dfc5..f36751e 100644 --- a/osm/opensm/osm_vl_arb_rcv.c +++ b/osm/opensm/osm_vl_arb_rcv.c @@ -171,7 +171,7 @@ osm_vla_rcv_process( goto Exit; } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; CL_ASSERT( p_node ); block_num = (uint8_t)(cl_ntoh32(p_smp->attr_mod) >> 16); @@ -183,7 +183,7 @@ osm_vla_rcv_process( } else { - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; port_num = p_physp->port_num; } @@ -239,4 +239,3 @@ osm_vla_rcv_process( OSM_LOG_EXIT( p_rcv->p_log ); } - -- 1.5.2.rc2.20.gac2a From swise at opengridcomputing.com Wed May 9 13:24:19 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 09 May 2007 15:24:19 -0500 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <46422D07.3050600@Sun.COM> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> Message-ID: <1178742259.382.112.camel@stevo-desktop> On Wed, 2007-05-09 at 16:20 -0400, Donald Kerr wrote: > I missing some context here. Where are you plugging iwarp and OMPI > together? ofed-1.2 supports iwarp and the chelsio rnic. It can be accessed directly via the ofa verbs and ofa rdma-cm _as well as_ via udapl. I'm attempting to run OMPI over udapl over chelsio's rnic. Steve. From Don.Kerr at Sun.COM Wed May 9 13:27:18 2007 From: Don.Kerr at Sun.COM (Donald Kerr) Date: Wed, 09 May 2007 16:27:18 -0400 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <1178742259.382.112.camel@stevo-desktop> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> Message-ID: <46422EA6.3020006@Sun.COM> So then I agree with Andrew, I think you are trying to impose restrictions on uDAPL which are not part of the Spec. -DON Steve Wise wrote: >On Wed, 2007-05-09 at 16:20 -0400, Donald Kerr wrote: > > >>I missing some context here. Where are you plugging iwarp and OMPI >>together? >> >> > >ofed-1.2 supports iwarp and the chelsio rnic. It can be accessed >directly via the ofa verbs and ofa rdma-cm _as well as_ via udapl. > >I'm attempting to run OMPI over udapl over chelsio's rnic. > >Steve. > > > > > From swise at opengridcomputing.com Wed May 9 13:33:39 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 09 May 2007 15:33:39 -0500 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <46422EA6.3020006@Sun.COM> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> Message-ID: <1178742819.382.114.camel@stevo-desktop> On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote: > So then I agree with Andrew, I think you are trying to impose > restrictions on uDAPL which are not part of the Spec. > true, but if you want a single btl for IB and IW, then you'll need to address this issue in some way... From Don.Kerr at Sun.COM Wed May 9 13:45:16 2007 From: Don.Kerr at Sun.COM (Donald Kerr) Date: Wed, 09 May 2007 16:45:16 -0400 Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <1178742819.382.114.camel@stevo-desktop> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> Message-ID: <464232DC.9010201@Sun.COM> I guess I have not read enough about iwarp yet but if iwarp is sitting below ib verbs or udapl in the stack and is trying to impose restrictions which ib verbs or udapl do not adhere to then maybe iwarp is in the wrong place in the ofed stack. Having said that I do agree the OMPI community needs to consider where iwarp plays in its own stack. If it has not already. Steve Wise wrote: >On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote: > > >>So then I agree with Andrew, I think you are trying to impose >>restrictions on uDAPL which are not part of the Spec. >> >> >> > >true, but if you want a single btl for IB and IW, then you'll need to >address this issue in some way... > > >_______________________________________________ >devel mailing list >devel at open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/devel > > From caitlinb at broadcom.com Wed May 9 14:11:59 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 9 May 2007 14:11:59 -0700 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <1178740498.382.97.camel@stevo-desktop> Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D03C331A7@NT-IRVA-0750.brcm.ad.broadcom.com> > > 2) OMPI is not adhering to the iwarp protocol requirement > that the ULP, > in this case OMPI, initiating the iwarp connection (the side > issuing the > dat_ep_connect() or rdma_connect()) _MUST_ be the first to > send an RDMA > message. So if a OMPI process _accepts_ an rdma connection, then it > cannot send on that connection until it receives some sort of rdma > operation from the client process. It appears the current OMPI > connection setup model doesn't enforce this. > This is actually an MPA requirement, and accoring to *protocol* specs having the active side send a zero length RDMA Write should be able to fix the problem. However there is language in the RDMAC verbs that clearly implies that the active side must Send something, and that an RDMA Write is insufficient. Therefore, the only truly safe thing for an iWARP btl to do (or a udapl btl since that is also an iWARP btl) is to have the active layer send an MPI Layer "nop" of some kind immediately after establishing the connection if there is nothing else to send. From sashak at voltaire.com Wed May 9 14:27:40 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 10 May 2007 00:27:40 +0300 Subject: [ofa-general] Re: [PATCH TRIVIAL] opensm: make osm_node_destroy() static In-Reply-To: <1178543690.32222.350646.camel@hal.voltaire.com> References: <20070506174138.GI9692@sashak.voltaire.com> <20070506174431.GJ9692@sashak.voltaire.com> <1178543690.32222.350646.camel@hal.voltaire.com> Message-ID: <20070509212740.GV9692@sashak.voltaire.com> On 09:16 Mon 07 May , Hal Rosenstock wrote: > On Sun, 2007-05-06 at 13:44, Sasha Khapyorsky wrote: > > This makes locally used osm_node_destroy() function static > > > > Signed-off-by: Sasha Khapyorsky > > Thanks. Applied (to master only). > > Isn't the same applicable for the other osm_xxx_destroy functions ? Only for those osm_xxx objects which have dynamic constructors/destructors osm_xxx_new() and osm_xxx_delete(). > If > so, shouldn't they also be made static ? Yes, good idea. Sasha From jsquyres at cisco.com Wed May 9 14:22:44 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 9 May 2007 17:22:44 -0400 Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <464232DC.9010201@Sun.COM> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> Message-ID: <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> I talked with Steve a bunch on the phone about this. 1. This "connector must RDMA first" issue is an iWARP restriction -- it's not specific to udapl or verbs. For example, if you try to use udapl with iWARP on Solaris, you'll have the same issue (I have no idea whether you have iWARP drivers in Solaris or not). 2. Per his prior e-mail (which I didn't fully grok until I talked to him), using the RDMA CM in the openib BTL will not magically fix this issue for us. 3. So for any of the BTLs to support iWARP -- regardless of underlying protocol or OS -- they are going to have to obey this restriction. 4. Luckily, in iWARP, the restriction can be met by either send/ receive semantics *or* RDMA semantics. You don't have to specifically use RDMA verbs semantics, for example. This is good because of the way that OMPI works (the first fragment that will be transmitted is pretty much guaranteed to be a send/receive fragment, not an RDMA fragment) -- it makes the logistics slightly simpler. Galen Shipman and I talked about this a bit and suggest the following: - During the connection dance (probably for both the udapl and openib BTLs), whichever peer ends up being the connection initiator (don't forget about the race condition where 2 peers may simultaneously decide to initiate -- this case is handled properly in the OMPI code; but just make sure you modify the side that ends up being actual initiator), they can send their pending fragment immediately (and Steve is right that there will always be a pending fragment, because OMPI doesn't make a connection until the first send). - The other peer (the receiver of the connection) must wait to send its pending fragment(s) until it receives the first frag from the connection initiator. This can be accomplished either with another flag on the OMPI module struct or perhaps making it part of the connection protocol (i.e., don't transition the endpoint to be CONNECTED until the first fragment is received). Either of which can be used to queue up fragments on the receiver until the first fragment is received from the initiator. I'd have to look in the code deeper, but I'm *guessing* that it might be best to use the already-existing state flag (i.e., checking for CONNECTED) because then you won't be introducing any more conditionals in the critical path. On May 9, 2007, at 4:45 PM, Donald Kerr wrote: > I guess I have not read enough about iwarp yet but if iwarp is sitting > below ib verbs or udapl in the stack and is trying to impose > restrictions which ib verbs or udapl do not adhere to then maybe iwarp > is in the wrong place in the ofed stack. > > Having said that I do agree the OMPI community needs to consider where > iwarp plays in its own stack. If it has not already. > > Steve Wise wrote: > >> On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote: >> >> >>> So then I agree with Andrew, I think you are trying to impose >>> restrictions on uDAPL which are not part of the Spec. >>> >>> >>> >> >> true, but if you want a single btl for IB and IW, then you'll need to >> address this issue in some way... >> >> >> _______________________________________________ >> devel mailing list >> devel at open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > _______________________________________________ > devel mailing list > devel at open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems From caitlinb at broadcom.com Wed May 9 14:33:38 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 9 May 2007 14:33:38 -0700 Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D03C331D9@NT-IRVA-0750.brcm.ad.broadcom.com> Jeff Squyres wrote: > > - The other peer (the receiver of the connection) must wait > to send its pending fragment(s) until it receives the first > frag from the connection initiator. This can be accomplished > either with another flag on the OMPI module struct or perhaps > making it part of the connection protocol (i.e., don't > transition the endpoint to be CONNECTED until the first > fragment is received). Either of which can be used to queue > up fragments on the receiver until the first fragment is > received from the initiator. I'd have to look in the code > deeper, but I'm *guessing* that it might be best to use the > already-existing state flag (i.e., checking for CONNECTED) > because then you won't be introducing any more conditionals > in the critical path. > The transport provider has several options on ensuring that the passive side does not put a message on the wire before the first message is received. What the transport layer cannot do is create the first message from the active side. Because it will have send/recv semantics it will complete a receive work request, which the application layer has to post with that expectation. this nop does not have to be visible above OMPI, but I'm pretty sure OMPI has to generate it. That isn't exactly fair to the application layer, but the RDMAC verbs are water under the bridge. Assuming OMPI wants to work with *any* iWARP RNIC then it needs to ensure that the active side will send something promptly in all cases. From jsquyres at cisco.com Wed May 9 14:38:02 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 9 May 2007 17:38:02 -0400 Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <1EF1E44200D82B47BD5BA61171E8CE9D03C331D9@NT-IRVA-0750.brcm.ad.broadcom.com> References: <1EF1E44200D82B47BD5BA61171E8CE9D03C331D9@NT-IRVA-0750.brcm.ad.broadcom.com> Message-ID: Understood, and I agree. FWIW: note that the CONNECTED state that I refered to is internal to OMPI's endpoint abstraction (not an iwarp/udapl/verbs/etc. state). It's part of our connection dance protocol. On May 9, 2007, at 5:33 PM, Caitlin Bestler wrote: > Jeff Squyres wrote: > >> >> - The other peer (the receiver of the connection) must wait >> to send its pending fragment(s) until it receives the first >> frag from the connection initiator. This can be accomplished >> either with another flag on the OMPI module struct or perhaps >> making it part of the connection protocol (i.e., don't >> transition the endpoint to be CONNECTED until the first >> fragment is received). Either of which can be used to queue >> up fragments on the receiver until the first fragment is >> received from the initiator. I'd have to look in the code >> deeper, but I'm *guessing* that it might be best to use the >> already-existing state flag (i.e., checking for CONNECTED) >> because then you won't be introducing any more conditionals >> in the critical path. >> > > The transport provider has several options on ensuring that > the passive side does not put a message on the wire before > the first message is received. > > What the transport layer cannot do is create the first message > from the active side. Because it will have send/recv semantics > it will complete a receive work request, which the application > layer has to post with that expectation. > > this nop does not have to be visible above OMPI, but I'm pretty > sure OMPI has to generate it. That isn't exactly fair to the > application layer, but the RDMAC verbs are water under the > bridge. Assuming OMPI wants to work with *any* iWARP RNIC > then it needs to ensure that the active side will send something > promptly in all cases. > > -- Jeff Squyres Cisco Systems From swise at opengridcomputing.com Wed May 9 14:44:52 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 09 May 2007 16:44:52 -0500 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <46425627.8000903@open-mpi.org> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46425627.8000903@open-mpi.org> Message-ID: <1178747092.382.125.camel@stevo-desktop> On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote: > > Steve Wise wrote: > > There have been a series of discussions on the ofa general list about > > this issue, and the conclusion to date is that it cannot be resolved in > > the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because > > sending an RDMA message involves the ULP's work queue and completion > > queue, so the CM cannot do this under the covers in a mannor that > > doesn't affect the application. Thus, the applications must deal with > > this. > > Why can't uDAPL deal with this? As a uDAPL user, I really don't care > what API uDAPL is using under the hood to move data from one place to > another, nor the quirks of that API. The whole point of uDAPL is to > form a network-agnostic abstraction layer. AFAIK, the uDAPL spec > doesn't enforce any such requirement on RDMA communication either. In > my opinion, exposing such behavior above uDAPL is incorrect and is part > of why uDAPL has seen limited adoption -- every single uDAPL > implementation behaves in different ways, making it extremely difficult > to write an application to work on any uDAPL implementation. Sorry if > this sounds harsh, but this comes from many hours of banging my head on > the wall due to working around these sorts of problems :) > I understand your frustration. I think the MPA protocol is deficient in this respect and should have required the necessary "first FPDU" to be sent under the covers by the RNICs. A RTR packet if you will. To resolve this issue "properly", in my opinion, would involve changing the IETF MPA spec and also breaking all the existing iwarp HW. We can't do that. The reason it is hard or impossible to solve this in the DAPL layer is that any rdma operation on the QP affects the state of that QP and the associate CQs. In addition, if you use an RDMA send to enforce this you impact the other side by consuming a RECV buffer. So its hard if not impossible to do this under the covers without affecting the application's resources. Also, the DAPL specification had a goal to not impose any additional protocol on the wire. If you add this under the covers, then you add such a "protocol" and break interoperability between a connection accessed via DAPL on one end and some other API on the other end. > > > > Here is a possible solution: > > > > I assume in OMPI that connections are only initiated when the mpi > > application does a send operation. Given that, then udapl btl must > > ensure that if a given rank accepts a connection, it cannot not send > > anything until the rank at the other end of the connection sends first. > > Since the other side initiated the connection, it will have pending data > > to send... > > > > I haven't looked into how painful this will be to implement. > > > > Thoughts? > > Following on what I wrote above, I think Open MPI is the wrong place to > be dealing with this. There's enough of these hacks as it is; I'm not > interested in seeing more get added. > Unfortunately, I haven't been able to come up with a solution that works with existing iWARP HW and is interoperable. Steve. From afriedle at open-mpi.org Wed May 9 17:46:12 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Wed, 09 May 2007 17:46:12 -0700 Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <1EF1E44200D82B47BD5BA61171E8CE9D03C331A7@NT-IRVA-0750.brcm.ad.broadcom.com> References: <1EF1E44200D82B47BD5BA61171E8CE9D03C331A7@NT-IRVA-0750.brcm.ad.broadcom.com> Message-ID: <46426B54.3020105@open-mpi.org> > Therefore, the only truly safe thing for an iWARP btl to do (or a > udapl btl since that is also an iWARP btl) is to have the active > layer send an MPI Layer "nop" of some kind immediately after > establishing the connection if there is nothing else to send. This is fine for an iWARP/RDMACM/whatever BTL (or anything else that uses the OFA verbs interface(s)), but my argument is that uDAPL is NOT specifically there to support just iWARP (though it may include it), and that OFED's uDAPL should be adjusted to handle this. Again, uDAPL is a network *independent* abstraction, so requiring network-dependent behavior from the uDAPL consumer is wrong. A related question -- how does this 'connection initiator must send first' requirement relate to UD? Andrew From caitlinb at broadcom.com Wed May 9 14:54:53 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 9 May 2007 14:54:53 -0700 Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <46426B54.3020105@open-mpi.org> Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D03C331F4@NT-IRVA-0750.brcm.ad.broadcom.com> general-bounces at lists.openfabrics.org wrote: >> Therefore, the only truly safe thing for an iWARP btl to do (or a >> udapl btl since that is also an iWARP btl) is to have the active >> layer send an MPI Layer "nop" of some kind immediately after >> establishing the connection if there is nothing else to send. > > This is fine for an iWARP/RDMACM/whatever BTL (or anything > else that uses the OFA verbs interface(s)), but my argument > is that uDAPL is NOT specifically there to support just iWARP > (though it may include it), and that OFED's uDAPL should be > adjusted to handle this. Again, uDAPL is a network > *independent* abstraction, so requiring network-dependent > behavior from the uDAPL consumer is wrong. > DAPL strives to define network independent solutions. In this case the network independent solution is that the active side *always* sends the first message. This works for both iWARP and InfiniBand. And away from the HPC market it is almost a non-requirement (which is why the RDMAC managed to goof on this in its specification. A zero-length RDMA Write is enough to deal with the wire protocol problem, but people implemented to the RDMAC verbs.) > > A related question -- how does this 'connection initiator > must send first' requirement relate to UD? > iWARP UD is called UDP. It has nothing to do with MPA or RDMA. An API that mapped to either IB UD or UDP is definitely feasible, but hasn't been important enough to anyone to draft as of yet. From afriedle at open-mpi.org Wed May 9 17:55:52 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Wed, 09 May 2007 17:55:52 -0700 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <1178747092.382.125.camel@stevo-desktop> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46425627.8000903@open-mpi.org> <1178747092.382.125.camel@stevo-desktop> Message-ID: <46426D98.1030406@open-mpi.org> Steve Wise wrote: > On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote: >> Steve Wise wrote: >>> There have been a series of discussions on the ofa general list about >>> this issue, and the conclusion to date is that it cannot be resolved in >>> the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because >>> sending an RDMA message involves the ULP's work queue and completion >>> queue, so the CM cannot do this under the covers in a mannor that >>> doesn't affect the application. Thus, the applications must deal with >>> this. >> Why can't uDAPL deal with this? As a uDAPL user, I really don't care >> what API uDAPL is using under the hood to move data from one place to >> another, nor the quirks of that API. The whole point of uDAPL is to >> form a network-agnostic abstraction layer. AFAIK, the uDAPL spec >> doesn't enforce any such requirement on RDMA communication either. In >> my opinion, exposing such behavior above uDAPL is incorrect and is part >> of why uDAPL has seen limited adoption -- every single uDAPL >> implementation behaves in different ways, making it extremely difficult >> to write an application to work on any uDAPL implementation. Sorry if >> this sounds harsh, but this comes from many hours of banging my head on >> the wall due to working around these sorts of problems :) >> > > I understand your frustration. I think the MPA protocol is deficient in > this respect and should have required the necessary "first FPDU" to be > sent under the covers by the RNICs. A RTR packet if you will. To > resolve this issue "properly", in my opinion, would involve changing the > IETF MPA spec and also breaking all the existing iwarp HW. We can't do > that. Understood. > The reason it is hard or impossible to solve this in the DAPL layer is > that any rdma operation on the QP affects the state of that QP and the > associate CQs. In addition, if you use an RDMA send to enforce this you > impact the other side by consuming a RECV buffer. So its hard if not > impossible to do this under the covers without affecting the > application's resources. Is there no way to do this before passing connection established events to the uDAPL consumer? I need to go read up on the uDAPL API to really understand why this wouldn't work. > > Also, the DAPL specification had a goal to not impose any additional > protocol on the wire. If you add this under the covers, then you add > such a "protocol" and break interoperability between a connection > accessed via DAPL on one end and some other API on the other end. So I guess there's no 'right' solution, at least at the uDAPL level. With RDMACM/OFA verbs, there's at least the argument that you can design the API/semantics however you please, while uDAPL is already standardized. I hope you guys are documenting this in a way that makes this issue extremely clear to both uDAPL and OFA verbs (is this the right naming?) users. Maybe it's been done already, but is it possible to emit some sort of loud warning/error when the accept()'ing side tries to send before a receive? Andrew From swise at opengridcomputing.com Wed May 9 14:56:46 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 09 May 2007 16:56:46 -0500 Subject: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <46426B54.3020105@open-mpi.org> References: <1EF1E44200D82B47BD5BA61171E8CE9D03C331A7@NT-IRVA-0750.brcm.ad.broadcom.com> <46426B54.3020105@open-mpi.org> Message-ID: <1178747806.382.128.camel@stevo-desktop> On Wed, 2007-05-09 at 17:46 -0700, Andrew Friedley wrote: > > Therefore, the only truly safe thing for an iWARP btl to do (or a > > udapl btl since that is also an iWARP btl) is to have the active > > layer send an MPI Layer "nop" of some kind immediately after > > establishing the connection if there is nothing else to send. > > This is fine for an iWARP/RDMACM/whatever BTL (or anything else that > uses the OFA verbs interface(s)), but my argument is that uDAPL is NOT > specifically there to support just iWARP (though it may include it), and > that OFED's uDAPL should be adjusted to handle this. Again, uDAPL is a > network *independent* abstraction, so requiring network-dependent > behavior from the uDAPL consumer is wrong. > > A related question -- how does this 'connection initiator must send > first' requirement relate to UD? > It doesn't. UD isn't supported in IWARP. From ossrosch at linux.vnet.ibm.com Wed May 9 14:57:59 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Wed, 9 May 2007 23:57:59 +0200 Subject: [ofa-general] Re: Build problem with RHEL-4.5 and OFED-1.2 In-Reply-To: <1178737535.2848.152.camel@fc6.xsintricity.com> References: <200705091824.54394.ossrosch@linux.vnet.ibm.com> <1178737535.2848.152.camel@fc6.xsintricity.com> Message-ID: <200705092357.59973.ossrosch@linux.vnet.ibm.com> On Wednesday 09 May 2007 21:05, Doug Ledford wrote: > On Wed, 2007-05-09 at 18:24 +0200, Stefan Roscher wrote: > > Hi Doug, > > > > I installed RHEL-4.5 on one of our ppc64 systems and recognized that asm-ppc > > directory is missing in /usr/src/kernels/2.6.9-55.EL/include. > > Normally I don't need this directory, but ibmebus.h includes > > asm-ppc64/of_device.h. And there asm-ppc64/of_device.h includes > > asm-ppc/of_device.h. Because this file is missing I can not build > > ehca and ofed stack with ofed-1.2 daily build from today. > > > > Did I make something wrong during installation? > > > > Regards Stefan Roscher > > I'll look into it, but in the meantime, install the kernel src.rpm, go > into /usr/src/redhat/SPEC and run rpmbuild --bp kernel-2.6.spec and it > should create a complete source tree > in /usr/src/redhat/BUILD/kernel-2.6.18 that you can then get the asm-ppc > directory contents out of. > > -- > Doug Ledford > GPG KeyID: CFBFF194 > http://people.redhat.com/dledford > > Infiniband specific RPMs available at > http://people.redhat.com/dledford/Infiniband > To create the backportpatches for rhel4.5 I did it like you say, but the buildscripts of ofed dont uses the kernelsources in /usr/src/redhat/BUILD. OFED-1.2 use the source link within /lib/modules/kernel-x.x.x and this points into /usr/src/kernel this kernelsources were created during installation of rhel-4.5. In this kernel source the directory include/asm-ppc is missing. This is the reason why I found this problem not during creation of the backport patches. regards stefan From mshefty at ichips.intel.com Wed May 9 15:01:12 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 May 2007 15:01:12 -0700 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <1178747092.382.125.camel@stevo-desktop> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46425627.8000903@open-mpi.org> <1178747092.382.125.camel@stevo-desktop> Message-ID: <464244A8.4070406@ichips.intel.com> > The reason it is hard or impossible to solve this in the DAPL layer is > that any rdma operation on the QP affects the state of that QP and the > associate CQs. In addition, if you use an RDMA send to enforce this you > impact the other side by consuming a RECV buffer. So its hard if not > impossible to do this under the covers without affecting the > application's resources. I agree that this is hard, but I don't believe that it's impossible. > Also, the DAPL specification had a goal to not impose any additional > protocol on the wire. If you add this under the covers, then you add > such a "protocol" and break interoperability between a connection > accessed via DAPL on one end and some other API on the other end. IMO, this is a unrealized dream. DAPL does generate wire protocol. For example, when running over IB, DAPL's selection of a service ID and CM protocol is visible on the wire. A DAPL that establishes connections using the RDMA CM will likely have a different wire protocol than a version of DAPL that establishes connections talking directly to the IB CM. The two DAPLs will not interoperate unless they agree on how they will map to service IDs and, in the case of using the RDMA CM, the format of the private data carried in the CM messages. Even in the case of iWarp, DAPL's selection of a local port number affects the data visible on the wire. TO communicate, a remote end point must know how this mapping occurs. - Sean From caitlinb at broadcom.com Wed May 9 15:03:08 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 9 May 2007 15:03:08 -0700 Subject: [ofa-general] RE: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <46425627.8000903@open-mpi.org> Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D03C33203@NT-IRVA-0750.brcm.ad.broadcom.com> devel-bounces at open-mpi.org wrote: > Steve Wise wrote: >> There have been a series of discussions on the ofa general list about >> this issue, and the conclusion to date is that it cannot be resolved >> in the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly >> because sending an RDMA message involves the ULP's work queue and >> completion queue, so the CM cannot do this under the covers in a >> mannor that doesn't affect the application. Thus, the applications >> must deal with this. > > Why can't uDAPL deal with this? As a uDAPL user, I really > don't care what API uDAPL is using under the hood to move > data from one place to another, nor the quirks of that API. > The whole point of uDAPL is to form a network-agnostic > abstraction layer. AFAIK, the uDAPL spec doesn't enforce any > such requirement on RDMA communication either. In my > opinion, exposing such behavior above uDAPL is incorrect and > is part of why uDAPL has seen limited adoption -- every > single uDAPL implementation behaves in different ways, making > it extremely difficult to write an application to work on any > uDAPL implementation. Sorry if this sounds harsh, but this > comes from many hours of banging my head on the wall due to > working around these sorts of problems :) > The simple answer is that uDAPL cannot deal with this. The RDMAC verbs specification was overly focused on client/server and therefore did not realize that there was any harm in requiring that the active side did the first send. But given that DAPL could not rewrite either the RDMAC or InfiniBand verbs it had to come up with the best solution that matched the verbs as they were. One of the explicit ground rules was that DAPL MUST support all RDMA devices that were IBTA or RDMAC compliant. Given those rules, if the active side does not send a message the passive side might be held off indefinitely, and sending a message cause consumption of a receive buffer and therefore cannot be transparent to the uDAPL consumer. Given those constraints there is literally nothing that can be done to work around this problem by either DAPL or OFA. From swise at opengridcomputing.com Wed May 9 15:15:15 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 09 May 2007 17:15:15 -0500 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <46426D98.1030406@open-mpi.org> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46425627.8000903@open-mpi.org> <1178747092.382.125.camel@stevo-desktop> <46426D98.1030406@open-mpi.org> Message-ID: <1178748915.382.145.camel@stevo-desktop> On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote: > > Steve Wise wrote: > > On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote: > >> Steve Wise wrote: > >>> There have been a series of discussions on the ofa general list about > >>> this issue, and the conclusion to date is that it cannot be resolved in > >>> the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because > >>> sending an RDMA message involves the ULP's work queue and completion > >>> queue, so the CM cannot do this under the covers in a mannor that > >>> doesn't affect the application. Thus, the applications must deal with > >>> this. > >> Why can't uDAPL deal with this? As a uDAPL user, I really don't care > >> what API uDAPL is using under the hood to move data from one place to > >> another, nor the quirks of that API. The whole point of uDAPL is to > >> form a network-agnostic abstraction layer. AFAIK, the uDAPL spec > >> doesn't enforce any such requirement on RDMA communication either. In > >> my opinion, exposing such behavior above uDAPL is incorrect and is part > >> of why uDAPL has seen limited adoption -- every single uDAPL > >> implementation behaves in different ways, making it extremely difficult > >> to write an application to work on any uDAPL implementation. Sorry if > >> this sounds harsh, but this comes from many hours of banging my head on > >> the wall due to working around these sorts of problems :) > >> > > > > I understand your frustration. I think the MPA protocol is deficient in > > this respect and should have required the necessary "first FPDU" to be > > sent under the covers by the RNICs. A RTR packet if you will. To > > resolve this issue "properly", in my opinion, would involve changing the > > IETF MPA spec and also breaking all the existing iwarp HW. We can't do > > that. > > Understood. > > > The reason it is hard or impossible to solve this in the DAPL layer is > > that any rdma operation on the QP affects the state of that QP and the > > associate CQs. In addition, if you use an RDMA send to enforce this you > > impact the other side by consuming a RECV buffer. So its hard if not > > impossible to do this under the covers without affecting the > > application's resources. > > Is there no way to do this before passing connection established events > to the uDAPL consumer? I need to go read up on the uDAPL API to really > understand why this wouldn't work. > Perhaps the dapl or maybe even a OFA iWARP CM could defer passing up the "established" event on the passive side until an incoming SEND is detected. I know we've discussed this before, but I'm not sure why this was not a workable solution. Perhaps Caitlin or some iwarp folks can recall? > > > > Also, the DAPL specification had a goal to not impose any additional > > protocol on the wire. If you add this under the covers, then you add > > such a "protocol" and break interoperability between a connection > > accessed via DAPL on one end and some other API on the other end. > > So I guess there's no 'right' solution, at least at the uDAPL level. > With RDMACM/OFA verbs, there's at least the argument that you can design > the API/semantics however you please, while uDAPL is already standardized. Yes, but its still difficult to post a SEND under the covers because it consumes the application resources in the form of QP and CQ space and a RECV buffer. So to date, we have...punted and pushed to problem to the ULP. > > I hope you guys are documenting this in a way that makes this issue > extremely clear to both uDAPL and OFA verbs (is this the right naming?) > users. Maybe it's been done already, but is it possible to emit some > sort of loud warning/error when the accept()'ing side tries to send > before a receive? > The connection comes tumbling down. How's that for loud? :) Seriously though, it isn't documented well enough. But we're bleeding edge here. And I'm still hoping somebody will come up with an elegant solution that doesn't break interoperability, applications and/or iwarp hw (i'm a dreamer :). Steve. From swise at opengridcomputing.com Wed May 9 15:18:00 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 09 May 2007 17:18:00 -0500 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <464244A8.4070406@ichips.intel.com> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46425627.8000903@open-mpi.org> <1178747092.382.125.camel@stevo-desktop> <464244A8.4070406@ichips.intel.com> Message-ID: <1178749080.382.148.camel@stevo-desktop> On Wed, 2007-05-09 at 15:01 -0700, Sean Hefty wrote: > > The reason it is hard or impossible to solve this in the DAPL layer is > > that any rdma operation on the QP affects the state of that QP and the > > associate CQs. In addition, if you use an RDMA send to enforce this you > > impact the other side by consuming a RECV buffer. So its hard if not > > impossible to do this under the covers without affecting the > > application's resources. > > I agree that this is hard, but I don't believe that it's impossible. > > > Also, the DAPL specification had a goal to not impose any additional > > protocol on the wire. If you add this under the covers, then you add > > such a "protocol" and break interoperability between a connection > > accessed via DAPL on one end and some other API on the other end. > > IMO, this is a unrealized dream. DAPL does generate wire protocol. For > example, when running over IB, DAPL's selection of a service ID and CM protocol > is visible on the wire. A DAPL that establishes connections using the RDMA CM > will likely have a different wire protocol than a version of DAPL that > establishes connections talking directly to the IB CM. The two DAPLs will not > interoperate unless they agree on how they will map to service IDs and, in the > case of using the RDMA CM, the format of the private data carried in the CM > messages. I wasn't aware of this. > > Even in the case of iWarp, DAPL's selection of a local port number affects the > data visible on the wire. TO communicate, a remote end point must know how this > mapping occurs. You mean the local port on the active side? The remote end point doesn't need to know this at all... Steve. From caitlinb at broadcom.com Wed May 9 15:25:06 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 9 May 2007 15:25:06 -0700 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <1178748915.382.145.camel@stevo-desktop> Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D03C3322A@NT-IRVA-0750.brcm.ad.broadcom.com> general-bounces at lists.openfabrics.org wrote: > On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote: >> >> Steve Wise wrote: >>> On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote: >>>> Steve Wise wrote: >>>>> There have been a series of discussions on the ofa general list >>>>> about this issue, and the conclusion to date is that it cannot be >>>>> resolved in the rdma-cm or iwarp-cm code of the linux rdma stack. >>>>> Mainly because sending an RDMA message involves the ULP's work >>>>> queue and completion queue, so the CM cannot do this under the >>>>> covers in a mannor that doesn't affect the application. > Thus, the >>>>> applications must deal with this. >>>> Why can't uDAPL deal with this? As a uDAPL user, I really don't >>>> care what API uDAPL is using under the hood to move data from one >>>> place to another, nor the quirks of that API. The whole point of >>>> uDAPL is to form a network-agnostic abstraction layer. AFAIK, the >>>> uDAPL spec doesn't enforce any such requirement on RDMA >>>> communication either. In my opinion, exposing such behavior above >>>> uDAPL is incorrect and is part of why uDAPL has seen limited >>>> adoption -- every single uDAPL implementation behaves in different >>>> ways, making it extremely difficult to write an application to work >>>> on any uDAPL implementation. Sorry if this sounds harsh, but this >>>> comes from many hours of banging my head on the wall due to working >>>> around these sorts of problems :) >>>> >>> >>> I understand your frustration. I think the MPA protocol is >>> deficient in this respect and should have required the necessary >>> "first FPDU" to be sent under the covers by the RNICs. A RTR packet >>> if you will. To resolve this issue "properly", in my opinion, would >>> involve changing the IETF MPA spec and also breaking all the >>> existing iwarp HW. We can't do that. >> >> Understood. >> >>> The reason it is hard or impossible to solve this in the DAPL layer >>> is that any rdma operation on the QP affects the state of that QP >>> and the associate CQs. In addition, if you use an RDMA send to >>> enforce this you impact the other side by consuming a RECV buffer. >>> So its hard if not impossible to do this under the covers without >>> affecting the application's resources. >> >> Is there no way to do this before passing connection established >> events to the uDAPL consumer? I need to go read up on the uDAPL API >> to really understand why this wouldn't work. >> > > Perhaps the dapl or maybe even a OFA iWARP CM could defer > passing up the "established" event on the passive side until > an incoming SEND is detected. I know we've discussed this > before, but I'm not sure why this was not a workable > solution. Perhaps Caitlin or some iwarp folks can recall? > That was what the RNIC-PI flag would have enabled. DAPL could check for that flag in a transport/device independent way, and delay the established event until it was safe to post (but no longer than required, for IB and iWARP NICs that fenced the first transmit the Established Event could be generated immediately). So yes, the transport layer (OFA or DAPL) CAN hide this on the passive side. But as you point out, that doesn't solve the problem of needing the Send from the active side. Since the Consumer posts RECV buffers *before* indicating whether the QP/EP will be used on the passive or active end, and there are no standard verbs to jam a receive buffer to the head of an RQ, there is no way to hide a send/recv exchange from the application layer. The fact that it can't be made transparent on the active side certainly diminishes the value of making it traansparent on the receive side. It's still a good idea, but I don't think it has percolated to the top of anyone's TODO list yet. When it does, the RNIC-PI proposed flag is a simple capability flag that is quite easy for any provider to statically set. From afriedle at open-mpi.org Wed May 9 18:26:14 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Wed, 09 May 2007 18:26:14 -0700 Subject: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened In-Reply-To: <1178748915.382.145.camel@stevo-desktop> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46425627.8000903@open-mpi.org> <1178747092.382.125.camel@stevo-desktop> <46426D98.1030406@open-mpi.org> <1178748915.382.145.camel@stevo-desktop> Message-ID: <464274B6.2030508@open-mpi.org> Steve Wise wrote: >> I hope you guys are documenting this in a way that makes this issue >> extremely clear to both uDAPL and OFA verbs (is this the right naming?) >> users. Maybe it's been done already, but is it possible to emit some >> sort of loud warning/error when the accept()'ing side tries to send >> before a receive? >> > > The connection comes tumbling down. How's that for loud? :) works :) > Seriously though, it isn't documented well enough. But we're bleeding > edge here. And I'm still hoping somebody will come up with an elegant > solution that doesn't break interoperability, applications and/or iwarp > hw (i'm a dreamer :). Well, if documenting it once saves someone a headache and a few hours of their time, it's probably worth it. Seems like everyone understands now what the problem is, that it sucks, and it can't be fixed lower down the stack :) Thanks for explaining Caitlin/Steve. As Jeff wrote, dealing with it in the BTLs really won't be that hard, just makes things a little more complicated to maintain. Andrew From sweitzen at cisco.com Wed May 9 16:45:33 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 9 May 2007 16:45:33 -0700 Subject: [ofa-general] RE: [PATCH] ipoib/cm: make stale task actually run once in a while In-Reply-To: <20070507200315.GD22341@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C90BFCB3@mtlexch01.mtl.com><6C2C79E72C305246B504CBA17B5500C9076E27@mtlexch01.mtl.com> <20070507200315.GD22341@mellanox.co.il> Message-ID: I see a new patch ipoib_correct_timers.patch in OFED-1.2-20070509-0600, which patch should I try? Scott > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] > Sent: Monday, May 07, 2007 1:03 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Yohad Dickman; Amit Krig; Tziporet Koren; > mst at mellanox.co.il; general at lists.openfabrics.org; Roland Dreier > Subject: [PATCH] ipoib/cm: make stale task actually run once > in a while > > In the presence of some active passive connections, stale > task would never run, > since each 4 RX CQEs we repeat queue_delayed_work calls which > delays it for some > 10 minutes. As a result, on a noisy system with failing > ports, we slowly run > out of resources - slowing connection setup down and > eventually failing. > > What we actually want to do is - start stale task when a first > passive connection is added, rerun it every 10 min as long > as there are outstanding passive connections. > > As a happy side effect, this removes some code from RX data path. > > Signed-off-by: Michael S. Tsirkin > > --- > > Scott, I think this might address bugs 541 and 465: slow > IPoIB CM HA failover > and eventual failing IPoIB HA. Could you test this please? > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > index 2b242a4..b77e8d7 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > @@ -258,10 +258,11 @@ static int ipoib_cm_req_handler(struct > ib_cm_id *cm_id, struct ib_cm_event *even > cm_id->context = p; > p->jiffies = jiffies; > spin_lock_irqsave(&priv->lock, flags); > + if (list_empty(&priv->cm.passive_ids)) > + queue_delayed_work(ipoib_workqueue, > + &priv->cm.stale_task, > IPOIB_CM_RX_DELAY); > list_add(&p->list, &priv->cm.passive_ids); > spin_unlock_irqrestore(&priv->lock, flags); > - queue_delayed_work(ipoib_workqueue, > - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > return 0; > > err_rep: > @@ -380,8 +381,6 @@ void ipoib_cm_handle_rx_wc(struct > net_device *dev, struct ib_wc *wc) > if (!list_empty(&p->list)) > list_move(&p->list, > &priv->cm.passive_ids); > spin_unlock_irqrestore(&priv->lock, flags); > - queue_delayed_work(ipoib_workqueue, > - > &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > } > } > > @@ -1104,6 +1103,10 @@ static void ipoib_cm_stale_task(struct > work_struct *work) > kfree(p); > spin_lock_irqsave(&priv->lock, flags); > } > + > + if (!list_empty(&priv->cm.passive_ids)) > + queue_delayed_work(ipoib_workqueue, > + &priv->cm.stale_task, > IPOIB_CM_RX_DELAY); > spin_unlock_irqrestore(&priv->lock, flags); > } > > -- > MST > From rdreier at cisco.com Wed May 9 20:00:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 09 May 2007 20:00:58 -0700 Subject: [ofa-general] Re: [PATCH] IB/core user memory registrations In-Reply-To: <1178698393.26046.8.camel@mtls03> (Eli Cohen's message of "Wed, 09 May 2007 11:12:43 +0300") References: <1178698393.26046.8.camel@mtls03> Message-ID: > @@ -1020,7 +1020,7 @@ static struct ib_mr *mthca_reg_user_mr(s > int shift, n, len; > int i, j, k; > int err = 0; > - int write_mtt_size; > + int write_mtt_size = mthca_write_mtt_size(dev); > > mr = kmalloc(sizeof *mr, GFP_KERNEL); > if (!mr) Not sure I understand what this is fixing... can you be more explicit? As far as I can see, the first use of write_mtt_size in mthca_reg_user_mr() is the line write_mtt_size = min(mthca_write_mtt_size(dev), (int) (PAGE_SIZE / sizeof *pages)); so I'm not sure why we need another initialization? (I'm looking at Linus's latest tree, which contains the mlx4 merge) - R. From rdreier at cisco.com Wed May 9 20:02:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 09 May 2007 20:02:35 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- shadow the gpio_mask register In-Reply-To: <20070509021904.GA16964@bauxite.pathscale.com> (Arthur Jones's message of "Tue, 8 May 2007 19:19:04 -0700") References: <20070508202557.27647.47035.stgit@bauxite.internal.keyresearch.com> <20070509021904.GA16964@bauxite.pathscale.com> Message-ID: > > A better changelog would be appreciated here... I can see deleting the > > unlikely() if it's no longer appropriate, but why keep a shadow copy > > of the register? Because this is now a hotter path and you want to > > avoid the MMIO read? > > exactly. shall i add that and resend? That would be great. Also I'm wondering what changed to make this a hotter path (just a reference to an earlier patch would be fine, or you could be more explicit). No rush because I'm traveling this week so I probably won't be able to apply anything until Monday anyway. - R. From rdreier at cisco.com Wed May 9 20:03:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 09 May 2007 20:03:03 -0700 Subject: [ofa-general] Re: [PATCH 1/6] IB/ehca: Serialize hypervisor calls in ehca_register_mr() In-Reply-To: <200705091347.57470.fenkes@de.ibm.com> (Joachim Fenkes's message of "Wed, 9 May 2007 13:47:56 +0200") References: <200705091347.57470.fenkes@de.ibm.com> Message-ID: thanks, it all looks fine... I'll apply when I'm back from my trip on Monday. From rdreier at cisco.com Wed May 9 20:06:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 09 May 2007 20:06:46 -0700 Subject: [ofa-general] verbs abi_compat In-Reply-To: <20070509155939.17788.qmail@station183.com> (Jimmy Hill's message of "Wed, 09 May 2007 15:59:39 +0000") References: <20070509155939.17788.qmail@station183.com> Message-ID: > Under what conditions is the field abi_compat of struct ibv_context > set to non-zero? I'm encountering a situation where it is set > whencoding to verbs on a clean OFED 1.2 install. Seems odd that it > would be set since I suspected that it would only occur for verbs > 1.0/1.1 compatibility. Are you sure it's being set? I think most drivers just use malloc() to allocate the context structure so you could just be seeing uninitialized memory there. Anyway I'm not sure why you're looking at the field at all. It's really just internal to libibverbs. If you want to understand how things work, there's only one assignment to context->abi_compat in libibverbs, in src/cmd.c so it shouldn't be too hard to figure out (and add whatever tracing info you want). - R. From rdreier at cisco.com Wed May 9 20:07:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 09 May 2007 20:07:08 -0700 Subject: [ofa-general] Re: [GIT PULL] 2.6.22: please pull rdma-dev.git In-Reply-To: <000101c79269$72c67aa0$e598070a@amr.corp.intel.com> (Sean Hefty's message of "Wed, 9 May 2007 11:39:59 -0700") References: <000101c79269$72c67aa0$e598070a@amr.corp.intel.com> Message-ID: ok, I'll look at this when I get back home on Monday. From jimmy at hillraiser.com Wed May 9 21:08:01 2007 From: jimmy at hillraiser.com (Jimmy Hill) Date: Wed, 9 May 2007 23:08:01 -0500 Subject: [ofa-general] verbs abi_compat In-Reply-To: Message-ID: > -----Original Message----- > > > Under what conditions is the field abi_compat of struct ibv_context > > set to non-zero? I'm encountering a situation where it is set > > whencoding to verbs on a clean OFED 1.2 install. Seems odd that it > > would be set since I suspected that it would only occur for verbs > > 1.0/1.1 compatibility. > > Are you sure it's being set? I think most drivers just use malloc() > to allocate the context structure so you could just be seeing > uninitialized memory there. > > Anyway I'm not sure why you're looking at the field at all. It's > really just internal to libibverbs. If you want to understand how > things work, there's only one assignment to context->abi_compat in > libibverbs, in src/cmd.c so it shouldn't be too hard to figure out > (and add whatever tracing info you want). > It is set in that it is non-zero, but I agree, it has garbage in it...and that's part of the problem. It is not being set in src/cmd.c, and has a non-zero value. When I call ibv_alloc_pd, I'm ending up in __ibv_alloc_pd_1_0 and that attempts to use context->real_context which is non-zero garbage as well and I get a segmentation violation. The abi_compat flag was what I thought was redirecting me into __ibv_alloc_pd_1_0 instead of __ibv_alloc_pd (where it should be going). So, maybe I asked the wrong question. Let me try a diff approach. What determines if ibv_alloc_pd resolves to __ibv_alloc_pd_1_0 or __ibv_alloc_pd? If I can find out what is redirecting my call to the "compat" code, maybe I can stop it and resolve the problem. Thanks for the response - I appreciate your help! From eli at mellanox.co.il Wed May 9 22:38:39 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 10 May 2007 08:38:39 +0300 Subject: [ofa-general] Re: [PATCH] IB/core user memory registrations In-Reply-To: References: <1178698393.26046.8.camel@mtls03> Message-ID: <1178775549.7405.4.camel@mtls03> On Wed, 2007-05-09 at 20:00 -0700, Roland Dreier wrote: > > @@ -1020,7 +1020,7 @@ static struct ib_mr *mthca_reg_user_mr(s > > int shift, n, len; > > int i, j, k; > > int err = 0; > > - int write_mtt_size; > > + int write_mtt_size = mthca_write_mtt_size(dev); > > > > mr = kmalloc(sizeof *mr, GFP_KERNEL); > > if (!mr) > > Not sure I understand what this is fixing... can you be more explicit? > As far as I can see, the first use of write_mtt_size in mthca_reg_user_mr() > is the line > > write_mtt_size = min(mthca_write_mtt_size(dev), (int) (PAGE_SIZE / sizeof *pages)); > > so I'm not sure why we need another initialization? (I'm looking at > Linus's latest tree, which contains the mlx4 merge) > > - R. This initialization was not in the tree I was working on. Now I see it is fixed in the updated tree. Thanks. From yosefe at voltaire.com Wed May 9 22:51:21 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 10 May 2007 08:51:21 +0300 Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <20070509174138.GB17734@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> Message-ID: <4642B2D9.60309@voltaire.com> Michael S. Tsirkin wrote: >>@@ -642,6 +651,11 @@ void ipoib_ib_dev_flush(struct work_stru >> >> ipoib_ib_dev_down(dev, 0); >> >>+ if (restart_qp) { >>+ ipoib_ib_dev_stop(dev, 0); >>+ ipoib_ib_dev_open(dev); >>+ } >>+ >> /* >> * The device could have been brought down between the start and when >> * we get here, don't bring it back up if it's not configured up > > > By the way, I think I see a small issue now - if there's a > pkey change event, this will flush all interfaces, even if > the pkey changed is not used by ipoib at all. > > How about: > - rename restart_qp flag to pkey_change_event > - do something like this at the beginning of the flush routine > if (pkey_change_event && > query_pkey(current index) == current_pkey)) > return; I think we should do the following: hold the index in dev_priv, set it outside restart_qp, and use it in restart_qp as an input parameter. On flush, we find and set it. This will prevent ~64 pkey queries, which are not yet cache-optimized. > > Need to think what to do if index is not valid, but you get the idea. > We can give up, clear PKEY_ASSIGNED flag, and let the polling do its job. > This will remove all the extra flushes in the common case > where pkeys are not moved around too much. > From yosefe at voltaire.com Thu May 10 00:25:15 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 10 May 2007 10:25:15 +0300 Subject: [ofa-general] [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <20070509174138.GB17734@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> Message-ID: <4642C8DB.1090205@voltaire.com> This issue was found during partitioning & SM fail over testing. * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity * Obtain pkey index prior to entering init_qp, and save in in dev_priv * Upon PKEY_CHANGE event, schedule a work that restarts the qp. * Precondition the restart on whether the pkey index is really changed. Use the cached pkey_index to test this. * Restart child interfaces before parent. They might be up even if the parent is down * Use uncached pkey query upon qp initiallization SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 7 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 95 +++++++++++++++++++++++------ drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 26 ++----- 4 files changed, 96 insertions(+), 39 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-10 08:34:58.335171047 +0300 @@ -202,15 +202,17 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; + struct delayed_work pkey_poll_task; struct delayed_work mcast_task; struct work_struct flush_task; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct work_struct pkey_event_task; struct ib_device *ca; u8 port; u16 pkey; + u16 pkey_index; struct ib_pd *pd; struct ib_mr *mr; struct ib_cq *cq; @@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-10 10:19:35.041587502 +0300 @@ -408,11 +408,40 @@ void ipoib_reap_ah(struct work_struct *w queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); } +static int ipoib_find_pkey_index(struct ipoib_dev_priv *priv, int *is_changed) +{ + u16 new_index; + + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return -ENXIO; + } + + if (is_changed) + *is_changed = !test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags) || + priv->pkey_index != new_index; + + priv->pkey_index = new_index; + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return 0; +} + int ipoib_ib_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; + /* + * Search through the port P_Key table for the requested pkey value. + * The port has to be assigned to the respective IB partition in + * advance. + */ + ret = ipoib_find_pkey_index(priv, NULL); + if (ret) { + ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey); + return -1; + } + ret = ipoib_init_qp(dev); if (ret) { ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret); @@ -422,14 +451,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -481,7 +510,7 @@ int ipoib_ib_dev_down(struct net_device if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { mutex_lock(&pkey_mutex); set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); + cancel_delayed_work(&priv->pkey_poll_task); mutex_unlock(&pkey_mutex); if (flush) flush_workqueue(ipoib_workqueue); @@ -508,7 +537,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +610,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,13 +652,22 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; + int is_index_changed; + + mutex_lock(&priv->vlan_mutex); - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { + /* Flush any child interfaces too - + * they might be up even if the parent is down */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, pkey_event); + + mutex_unlock(&priv->vlan_mutex); + + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -638,10 +677,22 @@ void ipoib_ib_dev_flush(struct work_stru return; } + if (pkey_event && + !ipoib_find_pkey_index(priv, &is_index_changed) && + !is_index_changed) { + ipoib_dbg(priv, "Not flushing - pkey index not changed.\n"); + return; + } + ipoib_dbg(priv, "flushing\n"); ipoib_ib_dev_down(dev, 0); + if (pkey_event) { + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +701,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - mutex_unlock(&priv->vlan_mutex); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_event_task); + + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -685,7 +746,7 @@ void ipoib_ib_dev_cleanup(struct net_dev void ipoib_pkey_poll(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); + container_of(work, struct ipoib_dev_priv, pkey_poll_task.work); struct net_device *dev = priv->dev; ipoib_pkey_dev_check_presence(dev); @@ -696,7 +757,7 @@ void ipoib_pkey_poll(struct work_struct mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); } @@ -715,7 +776,7 @@ int ipoib_pkey_dev_delay_open(struct net mutex_lock(&pkey_mutex); clear_bit(IPOIB_PKEY_STOP, &priv->flags); queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); return 1; Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-09 17:21:03.000000000 +0300 @@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev) return -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-10 09:13:28.997127223 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = -ENXIO; goto out; @@ -94,26 +92,17 @@ int ipoib_init_qp(struct net_device *dev { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; - u16 pkey_index; struct ib_qp_attr qp_attr; int attr_mask; - /* - * Search through the port P_Key table for the requested pkey value. - * The port has to be assigned to the respective IB partition in - * advance. - */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); - if (ret) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); - return ret; - } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + /* Make sure we have a valid pkey_index in priv->pkey_index */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + return -1; qp_attr.qp_state = IB_QPS_INIT; qp_attr.qkey = 0; qp_attr.port_num = priv->port; - qp_attr.pkey_index = pkey_index; + qp_attr.pkey_index = priv->pkey_index; attr_mask = IB_QP_QKEY | IB_QP_PORT | @@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_event_task); } } From vlad at mellanox.co.il Thu May 10 00:27:15 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 10 May 2007 10:27:15 +0300 Subject: [ofa-general] Re: [GIT PULL] OFED 1.2 librdmacm In-Reply-To: <000001c7925d$b752cfe0$e598070a@amr.corp.intel.com> References: <000001c7925d$b752cfe0$e598070a@amr.corp.intel.com> Message-ID: <1178782035.7967.2.camel@vladsk-laptop> On Wed, 2007-05-09 at 10:16 -0700, Sean Hefty wrote: > Please pull in the latest librdmacm ofed_1_2 tree. This will add a fix for > rping and man pages. > > Signed-off-by: Sean Hefty Done. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From vlad at mellanox.co.il Thu May 10 00:28:11 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 10 May 2007 10:28:11 +0300 Subject: [ofa-general] Re: [GIT PULL] OFED 1.2 uDAPL In-Reply-To: <464224F0.6020408@ichips.intel.com> References: <464224F0.6020408@ichips.intel.com> Message-ID: <1178782091.7967.4.camel@vladsk-laptop> On Wed, 2007-05-09 at 12:45 -0700, Arlin Davis wrote: > Vlad, please pull latest from uDAPL project (ofed_1_2 branch) > > Signed-off by: Arlin Davis ardavis at ichips.intel.com > > Bug Fixes: > - 606: Return local and remote ports with dat_ep_query > - 585: Add bonding example to dat.conf Done, -- Vladimir Sokolovsky Mellanox Technologies Ltd. From fenkes at de.ibm.com Thu May 10 01:55:46 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 10 May 2007 10:55:46 +0200 Subject: [ofa-general] Re: [PATCH 1/6] IB/ehca: Serialize hypervisor calls in ehca_register_mr() In-Reply-To: <200705091347.57470.fenkes@de.ibm.com> References: <200705091347.57470.fenkes@de.ibm.com> Message-ID: <200705101055.46433.fenkes@de.ibm.com> On Wednesday 09 May 2007 13:47, Joachim Fenkes wrote: > --- a/drivers/infiniband/hw/ehca/hcp_if.c > +++ b/drivers/infiniband/hw/ehca/hcp_if.c > @@ -154,7 +154,9 @@ static long ehca_plpar_hcall9(unsigned l > unsigned long arg9) > { > long ret; > - int i, sleep_msecs; > + int i, sleep_msecs, lock_is_set = 0; > + unsigned long flags; > + > > ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx " > "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx", > @@ -162,10 +164,18 @@ static long ehca_plpar_hcall9(unsigned l Whoops, that's one too many empty line... Roland, when you apply this patch, could you apply the following patch on top: --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -157,7 +157,6 @@ static long ehca_plpar_hcall9(unsigned l int i, sleep_msecs, lock_is_set = 0; unsigned long flags; - ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx " "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx", opcode, arg1, arg2, arg3, arg4, arg5, arg6, arg7, Thanks! Joachim -- Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2) Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany eMail: fenkes at de.ibm.com  --  Phone: +49 7031 16 1239 From mst at dev.mellanox.co.il Thu May 10 02:29:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 12:29:25 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4 In-Reply-To: <4641D38A.8040406@voltaire.com> References: <4641D295.5060907@voltaire.com> <4641D38A.8040406@voltaire.com> Message-ID: <20070510092925.GB13655@mellanox.co.il> > Quoting Erez Zilber : > Subject: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4 > > > Add the required backport patches & kernel addons for open-iscsi > over iSER in RHAS4 up3 and up4. > > Signed-off-by: Erez Zilber In addition to posting patches, could you pls publish a git tree to pull from, please? This makes it easy to test-build the patch as our build system knows how to do git checkout. --- Two comments, generally A: Please move code from kernel_patches to kernel_addons as much as possible. There are many places where you just add new headers, or add #include directives, or change the function called or remove extra parameters, all this can and should be done through addons. B: Please do not add code to core unless there is more than 1 user - add it to the iser module instead. This way if there is compilation failure there, you do not break core for people. Some specifics below: .... > diff --git a/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch b/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch > new file mode 100644 > index 0000000..c4df6bb > --- /dev/null > +++ b/kernel_patches/backport/2.6.9_U3/add_iscsi_proto_h.patch > @@ -0,0 +1,591 @@ > +diff -rupN linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h linux-2.6.20/include/scsi/iscsi_proto.h > +--- linux-2.6.20-like-rh4/include/scsi/iscsi_proto.h 1970-01-01 02:00:00.000000000 +0200 > ++++ linux-2.6.20/include/scsi/iscsi_proto.h 2007-02-04 20:44:54.000000000 +0200 > +@@ -0,0 +1,587 @@ > ++/* > ++ * RFC 3720 (iSCSI) protocol data types > ++ * > ++ * Copyright (C) 2005 Dmitry Yusupov > ++ * Copyright (C) 2005 Alex Aizman > ++ * maintained by open-iscsi at googlegroups.com > ++ * > ++ * This program is free software; you can redistribute it and/or modify > ++ * it under the terms of the GNU General Public License as published > ++ * by the Free Software Foundation; either version 2 of the License, or > ++ * (at your option) any later version. > ++ * > ++ * This program is distributed in the hope that it will be useful, but > ++ * WITHOUT ANY WARRANTY; without even the implied warranty of > ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > ++ * General Public License for more details. > ++ * > ++ * See the file COPYING included with this distribution for more details. > ++ */ > ++ > ++#ifndef ISCSI_PROTO_H > ++#define ISCSI_PROTO_H > ++ > ++#define ISCSI_DRAFT20_VERSION 0x00 > ++ > ++/* default iSCSI listen port for incoming connections */ > ++#define ISCSI_LISTEN_PORT 3260 > ++ > ++/* Padding word length */ > ++#define PAD_WORD_LEN 4 > ++ > ++/* > ++ * useful common(control and data pathes) macro > ++ */ > ++#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2])) > ++#define hton24(p, v) { \ > ++ p[0] = (((v) >> 16) & 0xFF); \ > ++ p[1] = (((v) >> 8) & 0xFF); \ > ++ p[2] = ((v) & 0xFF); \ > ++} > ++#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;} > ++ > ++/* > ++ * iSCSI Template Message Header > ++ */ > ++struct iscsi_hdr { > ++ uint8_t opcode; > ++ uint8_t flags; /* Final bit */ > ++ uint8_t rsvd2[2]; > ++ uint8_t hlength; /* AHSs total length */ > ++ uint8_t dlength[3]; /* Data length */ > ++ uint8_t lun[8]; > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be32 ttt; /* Target Task Tag */ > ++ __be32 statsn; > ++ __be32 exp_statsn; > ++ __be32 max_statsn; > ++ uint8_t other[12]; > ++}; > ++ > ++/************************* RFC 3720 Begin *****************************/ > ++ > ++#define ISCSI_RESERVED_TAG 0xffffffff > ++ > ++/* Opcode encoding bits */ > ++#define ISCSI_OP_RETRY 0x80 > ++#define ISCSI_OP_IMMEDIATE 0x40 > ++#define ISCSI_OPCODE_MASK 0x3F > ++ > ++/* Initiator Opcode values */ > ++#define ISCSI_OP_NOOP_OUT 0x00 > ++#define ISCSI_OP_SCSI_CMD 0x01 > ++#define ISCSI_OP_SCSI_TMFUNC 0x02 > ++#define ISCSI_OP_LOGIN 0x03 > ++#define ISCSI_OP_TEXT 0x04 > ++#define ISCSI_OP_SCSI_DATA_OUT 0x05 > ++#define ISCSI_OP_LOGOUT 0x06 > ++#define ISCSI_OP_SNACK 0x10 > ++ > ++#define ISCSI_OP_VENDOR1_CMD 0x1c > ++#define ISCSI_OP_VENDOR2_CMD 0x1d > ++#define ISCSI_OP_VENDOR3_CMD 0x1e > ++#define ISCSI_OP_VENDOR4_CMD 0x1f > ++ > ++/* Target Opcode values */ > ++#define ISCSI_OP_NOOP_IN 0x20 > ++#define ISCSI_OP_SCSI_CMD_RSP 0x21 > ++#define ISCSI_OP_SCSI_TMFUNC_RSP 0x22 > ++#define ISCSI_OP_LOGIN_RSP 0x23 > ++#define ISCSI_OP_TEXT_RSP 0x24 > ++#define ISCSI_OP_SCSI_DATA_IN 0x25 > ++#define ISCSI_OP_LOGOUT_RSP 0x26 > ++#define ISCSI_OP_R2T 0x31 > ++#define ISCSI_OP_ASYNC_EVENT 0x32 > ++#define ISCSI_OP_REJECT 0x3f > ++ > ++struct iscsi_ahs_hdr { > ++ __be16 ahslength; > ++ uint8_t ahstype; > ++ uint8_t ahspec[5]; > ++}; > ++ > ++#define ISCSI_AHSTYPE_CDB 1 > ++#define ISCSI_AHSTYPE_RLENGTH 2 > ++ > ++/* iSCSI PDU Header */ > ++struct iscsi_cmd { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ __be16 rsvd2; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t lun[8]; > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be32 data_length; > ++ __be32 cmdsn; > ++ __be32 exp_statsn; > ++ uint8_t cdb[16]; /* SCSI Command Block */ > ++ /* Additional Data (Command Dependent) */ > ++}; > ++ > ++/* Command PDU flags */ > ++#define ISCSI_FLAG_CMD_FINAL 0x80 > ++#define ISCSI_FLAG_CMD_READ 0x40 > ++#define ISCSI_FLAG_CMD_WRITE 0x20 > ++#define ISCSI_FLAG_CMD_ATTR_MASK 0x07 /* 3 bits */ > ++ > ++/* SCSI Command Attribute values */ > ++#define ISCSI_ATTR_UNTAGGED 0 > ++#define ISCSI_ATTR_SIMPLE 1 > ++#define ISCSI_ATTR_ORDERED 2 > ++#define ISCSI_ATTR_HEAD_OF_QUEUE 3 > ++#define ISCSI_ATTR_ACA 4 > ++ > ++struct iscsi_rlength_ahdr { > ++ __be16 ahslength; > ++ uint8_t ahstype; > ++ uint8_t reserved; > ++ __be32 read_length; > ++}; > ++ > ++/* SCSI Response Header */ > ++struct iscsi_cmd_rsp { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t response; > ++ uint8_t cmd_status; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t rsvd[8]; > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be32 rsvd1; > ++ __be32 statsn; > ++ __be32 exp_cmdsn; > ++ __be32 max_cmdsn; > ++ __be32 exp_datasn; > ++ __be32 bi_residual_count; > ++ __be32 residual_count; > ++ /* Response or Sense Data (optional) */ > ++}; > ++ > ++/* Command Response PDU flags */ > ++#define ISCSI_FLAG_CMD_BIDI_OVERFLOW 0x10 > ++#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW 0x08 > ++#define ISCSI_FLAG_CMD_OVERFLOW 0x04 > ++#define ISCSI_FLAG_CMD_UNDERFLOW 0x02 > ++ > ++/* iSCSI Status values. Valid if Rsp Selector bit is not set */ > ++#define ISCSI_STATUS_CMD_COMPLETED 0 > ++#define ISCSI_STATUS_TARGET_FAILURE 1 > ++#define ISCSI_STATUS_SUBSYS_FAILURE 2 > ++ > ++/* Asynchronous Event Header */ > ++struct iscsi_async { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t rsvd2[2]; > ++ uint8_t rsvd3; > ++ uint8_t dlength[3]; > ++ uint8_t lun[8]; > ++ uint8_t rsvd4[8]; > ++ __be32 statsn; > ++ __be32 exp_cmdsn; > ++ __be32 max_cmdsn; > ++ uint8_t async_event; > ++ uint8_t async_vcode; > ++ __be16 param1; > ++ __be16 param2; > ++ __be16 param3; > ++ uint8_t rsvd5[4]; > ++}; > ++ > ++/* iSCSI Event Codes */ > ++#define ISCSI_ASYNC_MSG_SCSI_EVENT 0 > ++#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT 1 > ++#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION 2 > ++#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS 3 > ++#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION 4 > ++#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC 255 > ++ > ++/* NOP-Out Message */ > ++struct iscsi_nopout { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ __be16 rsvd2; > ++ uint8_t rsvd3; > ++ uint8_t dlength[3]; > ++ uint8_t lun[8]; > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be32 ttt; /* Target Transfer Tag */ > ++ __be32 cmdsn; > ++ __be32 exp_statsn; > ++ uint8_t rsvd4[16]; > ++}; > ++ > ++/* NOP-In Message */ > ++struct iscsi_nopin { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ __be16 rsvd2; > ++ uint8_t rsvd3; > ++ uint8_t dlength[3]; > ++ uint8_t lun[8]; > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be32 ttt; /* Target Transfer Tag */ > ++ __be32 statsn; > ++ __be32 exp_cmdsn; > ++ __be32 max_cmdsn; > ++ uint8_t rsvd4[12]; > ++}; > ++ > ++/* SCSI Task Management Message Header */ > ++struct iscsi_tm { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t rsvd1[2]; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t lun[8]; > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be32 rtt; /* Reference Task Tag */ > ++ __be32 cmdsn; > ++ __be32 exp_statsn; > ++ __be32 refcmdsn; > ++ __be32 exp_datasn; > ++ uint8_t rsvd2[8]; > ++}; > ++ > ++#define ISCSI_FLAG_TM_FUNC_MASK 0x7F > ++ > ++/* Function values */ > ++#define ISCSI_TM_FUNC_ABORT_TASK 1 > ++#define ISCSI_TM_FUNC_ABORT_TASK_SET 2 > ++#define ISCSI_TM_FUNC_CLEAR_ACA 3 > ++#define ISCSI_TM_FUNC_CLEAR_TASK_SET 4 > ++#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET 5 > ++#define ISCSI_TM_FUNC_TARGET_WARM_RESET 6 > ++#define ISCSI_TM_FUNC_TARGET_COLD_RESET 7 > ++#define ISCSI_TM_FUNC_TASK_REASSIGN 8 > ++ > ++/* SCSI Task Management Response Header */ > ++struct iscsi_tm_rsp { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t response; /* see Response values below */ > ++ uint8_t qualifier; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t rsvd2[8]; > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be32 rtt; /* Reference Task Tag */ > ++ __be32 statsn; > ++ __be32 exp_cmdsn; > ++ __be32 max_cmdsn; > ++ uint8_t rsvd3[12]; > ++}; > ++ > ++/* Response values */ > ++#define ISCSI_TMF_RSP_COMPLETE 0x00 > ++#define ISCSI_TMF_RSP_NO_TASK 0x01 > ++#define ISCSI_TMF_RSP_NO_LUN 0x02 > ++#define ISCSI_TMF_RSP_TASK_ALLEGIANT 0x03 > ++#define ISCSI_TMF_RSP_NO_FAILOVER 0x04 > ++#define ISCSI_TMF_RSP_NOT_SUPPORTED 0x05 > ++#define ISCSI_TMF_RSP_AUTH_FAILED 0x06 > ++#define ISCSI_TMF_RSP_REJECTED 0xff > ++ > ++/* Ready To Transfer Header */ > ++struct iscsi_r2t_rsp { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t rsvd2[2]; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t lun[8]; > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be32 ttt; /* Target Transfer Tag */ > ++ __be32 statsn; > ++ __be32 exp_cmdsn; > ++ __be32 max_cmdsn; > ++ __be32 r2tsn; > ++ __be32 data_offset; > ++ __be32 data_length; > ++}; > ++ > ++/* SCSI Data Hdr */ > ++struct iscsi_data { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t rsvd2[2]; > ++ uint8_t rsvd3; > ++ uint8_t dlength[3]; > ++ uint8_t lun[8]; > ++ __be32 itt; > ++ __be32 ttt; > ++ __be32 rsvd4; > ++ __be32 exp_statsn; > ++ __be32 rsvd5; > ++ __be32 datasn; > ++ __be32 offset; > ++ __be32 rsvd6; > ++ /* Payload */ > ++}; > ++ > ++/* SCSI Data Response Hdr */ > ++struct iscsi_data_rsp { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t rsvd2; > ++ uint8_t cmd_status; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t lun[8]; > ++ __be32 itt; > ++ __be32 ttt; > ++ __be32 statsn; > ++ __be32 exp_cmdsn; > ++ __be32 max_cmdsn; > ++ __be32 datasn; > ++ __be32 offset; > ++ __be32 residual_count; > ++}; > ++ > ++/* Data Response PDU flags */ > ++#define ISCSI_FLAG_DATA_ACK 0x40 > ++#define ISCSI_FLAG_DATA_OVERFLOW 0x04 > ++#define ISCSI_FLAG_DATA_UNDERFLOW 0x02 > ++#define ISCSI_FLAG_DATA_STATUS 0x01 > ++ > ++/* Text Header */ > ++struct iscsi_text { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t rsvd2[2]; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t rsvd4[8]; > ++ __be32 itt; > ++ __be32 ttt; > ++ __be32 cmdsn; > ++ __be32 exp_statsn; > ++ uint8_t rsvd5[16]; > ++ /* Text - key=value pairs */ > ++}; > ++ > ++#define ISCSI_FLAG_TEXT_CONTINUE 0x40 > ++ > ++/* Text Response Header */ > ++struct iscsi_text_rsp { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t rsvd2[2]; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t rsvd4[8]; > ++ __be32 itt; > ++ __be32 ttt; > ++ __be32 statsn; > ++ __be32 exp_cmdsn; > ++ __be32 max_cmdsn; > ++ uint8_t rsvd5[12]; > ++ /* Text Response - key:value pairs */ > ++}; > ++ > ++/* Login Header */ > ++struct iscsi_login { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t max_version; /* Max. version supported */ > ++ uint8_t min_version; /* Min. version supported */ > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t isid[6]; /* Initiator Session ID */ > ++ __be16 tsih; /* Target Session Handle */ > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be16 cid; > ++ __be16 rsvd3; > ++ __be32 cmdsn; > ++ __be32 exp_statsn; > ++ uint8_t rsvd5[16]; > ++}; > ++ > ++/* Login PDU flags */ > ++#define ISCSI_FLAG_LOGIN_TRANSIT 0x80 > ++#define ISCSI_FLAG_LOGIN_CONTINUE 0x40 > ++#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK 0x0C /* 2 bits */ > ++#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK 0x03 /* 2 bits */ > ++ > ++#define ISCSI_LOGIN_CURRENT_STAGE(flags) \ > ++ ((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2) > ++#define ISCSI_LOGIN_NEXT_STAGE(flags) \ > ++ (flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK) > ++ > ++/* Login Response Header */ > ++struct iscsi_login_rsp { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t max_version; /* Max. version supported */ > ++ uint8_t active_version; /* Active version */ > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t isid[6]; /* Initiator Session ID */ > ++ __be16 tsih; /* Target Session Handle */ > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be32 rsvd3; > ++ __be32 statsn; > ++ __be32 exp_cmdsn; > ++ __be32 max_cmdsn; > ++ uint8_t status_class; /* see Login RSP ststus classes below */ > ++ uint8_t status_detail; /* see Login RSP Status details below */ > ++ uint8_t rsvd4[10]; > ++}; > ++ > ++/* Login stage (phase) codes for CSG, NSG */ > ++#define ISCSI_INITIAL_LOGIN_STAGE -1 > ++#define ISCSI_SECURITY_NEGOTIATION_STAGE 0 > ++#define ISCSI_OP_PARMS_NEGOTIATION_STAGE 1 > ++#define ISCSI_FULL_FEATURE_PHASE 3 > ++ > ++/* Login Status response classes */ > ++#define ISCSI_STATUS_CLS_SUCCESS 0x00 > ++#define ISCSI_STATUS_CLS_REDIRECT 0x01 > ++#define ISCSI_STATUS_CLS_INITIATOR_ERR 0x02 > ++#define ISCSI_STATUS_CLS_TARGET_ERR 0x03 > ++ > ++/* Login Status response detail codes */ > ++/* Class-0 (Success) */ > ++#define ISCSI_LOGIN_STATUS_ACCEPT 0x00 > ++ > ++/* Class-1 (Redirection) */ > ++#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP 0x01 > ++#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM 0x02 > ++ > ++/* Class-2 (Initiator Error) */ > ++#define ISCSI_LOGIN_STATUS_INIT_ERR 0x00 > ++#define ISCSI_LOGIN_STATUS_AUTH_FAILED 0x01 > ++#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN 0x02 > ++#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND 0x03 > ++#define ISCSI_LOGIN_STATUS_TGT_REMOVED 0x04 > ++#define ISCSI_LOGIN_STATUS_NO_VERSION 0x05 > ++#define ISCSI_LOGIN_STATUS_ISID_ERROR 0x06 > ++#define ISCSI_LOGIN_STATUS_MISSING_FIELDS 0x07 > ++#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED 0x08 > ++#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE 0x09 > ++#define ISCSI_LOGIN_STATUS_NO_SESSION 0x0a > ++#define ISCSI_LOGIN_STATUS_INVALID_REQUEST 0x0b > ++ > ++/* Class-3 (Target Error) */ > ++#define ISCSI_LOGIN_STATUS_TARGET_ERROR 0x00 > ++#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE 0x01 > ++#define ISCSI_LOGIN_STATUS_NO_RESOURCES 0x02 > ++ > ++/* Logout Header */ > ++struct iscsi_logout { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t rsvd1[2]; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t rsvd2[8]; > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be16 cid; > ++ uint8_t rsvd3[2]; > ++ __be32 cmdsn; > ++ __be32 exp_statsn; > ++ uint8_t rsvd4[16]; > ++}; > ++ > ++/* Logout PDU flags */ > ++#define ISCSI_FLAG_LOGOUT_REASON_MASK 0x7F > ++ > ++/* logout reason_code values */ > ++ > ++#define ISCSI_LOGOUT_REASON_CLOSE_SESSION 0 > ++#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION 1 > ++#define ISCSI_LOGOUT_REASON_RECOVERY 2 > ++#define ISCSI_LOGOUT_REASON_AEN_REQUEST 3 > ++ > ++/* Logout Response Header */ > ++struct iscsi_logout_rsp { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t response; /* see Logout response values below */ > ++ uint8_t rsvd2; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t rsvd3[8]; > ++ __be32 itt; /* Initiator Task Tag */ > ++ __be32 rsvd4; > ++ __be32 statsn; > ++ __be32 exp_cmdsn; > ++ __be32 max_cmdsn; > ++ __be32 rsvd5; > ++ __be16 t2wait; > ++ __be16 t2retain; > ++ __be32 rsvd6; > ++}; > ++ > ++/* logout response status values */ > ++ > ++#define ISCSI_LOGOUT_SUCCESS 0 > ++#define ISCSI_LOGOUT_CID_NOT_FOUND 1 > ++#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED 2 > ++#define ISCSI_LOGOUT_CLEANUP_FAILED 3 > ++ > ++/* SNACK Header */ > ++struct iscsi_snack { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t rsvd2[14]; > ++ __be32 itt; > ++ __be32 begrun; > ++ __be32 runlength; > ++ __be32 exp_statsn; > ++ __be32 rsvd3; > ++ __be32 exp_datasn; > ++ uint8_t rsvd6[8]; > ++}; > ++ > ++/* SNACK PDU flags */ > ++#define ISCSI_FLAG_SNACK_TYPE_MASK 0x0F /* 4 bits */ > ++ > ++/* Reject Message Header */ > ++struct iscsi_reject { > ++ uint8_t opcode; > ++ uint8_t flags; > ++ uint8_t reason; > ++ uint8_t rsvd2; > ++ uint8_t hlength; > ++ uint8_t dlength[3]; > ++ uint8_t rsvd3[8]; > ++ __be32 ffffffff; > ++ uint8_t rsvd4[4]; > ++ __be32 statsn; > ++ __be32 exp_cmdsn; > ++ __be32 max_cmdsn; > ++ __be32 datasn; > ++ uint8_t rsvd5[8]; > ++ /* Text - Rejected hdr */ > ++}; > ++ > ++/* Reason for Reject */ > ++#define ISCSI_REASON_CMD_BEFORE_LOGIN 1 > ++#define ISCSI_REASON_DATA_DIGEST_ERROR 2 > ++#define ISCSI_REASON_DATA_SNACK_REJECT 3 > ++#define ISCSI_REASON_PROTOCOL_ERROR 4 > ++#define ISCSI_REASON_CMD_NOT_SUPPORTED 5 > ++#define ISCSI_REASON_IMM_CMD_REJECT 6 > ++#define ISCSI_REASON_TASK_IN_PROGRESS 7 > ++#define ISCSI_REASON_INVALID_SNACK 8 > ++#define ISCSI_REASON_BOOKMARK_INVALID 9 > ++#define ISCSI_REASON_BOOKMARK_NO_RESOURCES 10 > ++#define ISCSI_REASON_NEGOTIATION_RESET 11 > ++ > ++/* Max. number of Key=Value pairs in a text message */ > ++#define MAX_KEY_VALUE_PAIRS 8192 > ++ > ++/* maximum length for text keys/values */ > ++#define KEY_MAXLEN 64 > ++#define VALUE_MAXLEN 255 > ++#define TARGET_NAME_MAXLEN VALUE_MAXLEN > ++ > ++#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH 8192 > ++ > ++/************************* RFC 3720 End *****************************/ > ++ > ++#endif /* ISCSI_PROTO_H */ Why isn't the above in addons? ... > diff --git a/kernel_patches/backport/2.6.9_U3/add_memory_h.patch b/kernel_patches/backport/2.6.9_U3/add_memory_h.patch > new file mode 100644 > index 0000000..5daad2e > --- /dev/null > +++ b/kernel_patches/backport/2.6.9_U3/add_memory_h.patch > @@ -0,0 +1,93 @@ > +diff -rupN linux-2.6.20-like-rh4/include/linux/memory.h linux-2.6.20/include/linux/memory.h > +--- linux-2.6.20-like-rh4/include/linux/memory.h 1970-01-01 02:00:00.000000000 +0200 > ++++ linux-2.6.20/include/linux/memory.h 2007-02-04 20:44:54.000000000 +0200 > +@@ -0,0 +1,89 @@ > ++/* > ++ * include/linux/memory.h - generic memory definition > ++ * > ++ * This is mainly for topological representation. We define the > ++ * basic "struct memory_block" here, which can be embedded in per-arch > ++ * definitions or NUMA information. > ++ * > ++ * Basic handling of the devices is done in drivers/base/memory.c > ++ * and system devices are handled in drivers/base/sys.c. > ++ * > ++ * Memory block are exported via sysfs in the class/memory/devices/ > ++ * directory. > ++ * > ++ */ > ++#ifndef _LINUX_MEMORY_H_ > ++#define _LINUX_MEMORY_H_ > ++ > ++#include > ++#include > ++#include > ++ > ++#include > ++ > ++struct memory_block { > ++ unsigned long phys_index; > ++ unsigned long state; > ++ /* > ++ * This serializes all state change requests. It isn't > ++ * held during creation because the control files are > ++ * created long after the critical areas during > ++ * initialization. > ++ */ > ++ struct semaphore state_sem; > ++ int phys_device; /* to which fru does this belong? */ > ++ void *hw; /* optional pointer to fw/hw data */ > ++ int (*phys_callback)(struct memory_block *); > ++ struct sys_device sysdev; > ++}; > ++ > ++/* These states are exposed to userspace as text strings in sysfs */ > ++#define MEM_ONLINE (1<<0) /* exposed to userspace */ > ++#define MEM_GOING_OFFLINE (1<<1) /* exposed to userspace */ > ++#define MEM_OFFLINE (1<<2) /* exposed to userspace */ > ++ > ++/* > ++ * All of these states are currently kernel-internal for notifying > ++ * kernel components and architectures. > ++ * > ++ * For MEM_MAPPING_INVALID, all notifier chains with priority >0 > ++ * are called before pfn_to_page() becomes invalid. The priority=0 > ++ * entry is reserved for the function that actually makes > ++ * pfn_to_page() stop working. Any notifiers that want to be called > ++ * after that should have priority <0. > ++ */ > ++#define MEM_MAPPING_INVALID (1<<3) > ++ > ++struct notifier_block; > ++struct mem_section; > ++ > ++#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE > ++static inline int memory_dev_init(void) > ++{ > ++ return 0; > ++} > ++static inline int register_memory_notifier(struct notifier_block *nb) > ++{ > ++ return 0; > ++} > ++static inline void unregister_memory_notifier(struct notifier_block *nb) > ++{ > ++} > ++#else > ++extern int register_new_memory(struct mem_section *); > ++extern int unregister_memory_section(struct mem_section *); > ++extern int memory_dev_init(void); > ++extern int remove_memory_block(unsigned long, struct mem_section *, int); > ++ > ++#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION< ++ > ++ > ++#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ > ++ > ++#define hotplug_memory_notifier(fn, pri) { \ > ++ static struct notifier_block fn##_mem_nb = \ > ++ { .notifier_call = fn, .priority = pri }; \ > ++ register_memory_notifier(&fn##_mem_nb); \ > ++} > ++ > ++#endif /* _LINUX_MEMORY_H_ */ why isn't this in addons? > diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch > new file mode 100644 > index 0000000..d77c663 > --- /dev/null > +++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch > @@ -0,0 +1,504 @@ > +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c > +--- linux-2.6.20/drivers/scsi/iscsi_tcp.c 2007-02-04 20:44:54.000000000 +0200 > ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c 2007-04-01 13:11:17.000000000 +0300 ... > +@@ -108,7 +108,7 @@ iscsi_hdr_digest(struct iscsi_conn *conn > + { > + struct iscsi_tcp_conn *tcp_conn = conn->dd_data; > + > +- crypto_hash_digest(&tcp_conn->tx_hash, &buf->sg, buf->sg.length, crc); > ++ crypto_digest_digest(tcp_conn->tx_tfm, &buf->sg, 1, crc); > + buf->sg.length = tcp_conn->hdr_size; > + } > + You could make it a macro in addons if you had named the new field tx_hash. > +@@ -419,7 +419,7 @@ iscsi_r2t_rsp(struct iscsi_conn *conn, s > + tcp_ctask->xmstate |= XMSTATE_SOL_HDR; > + list_move_tail(&ctask->running, &conn->xmitqueue); > + > +- scsi_queue_work(session->host, &conn->xmitwork); > ++ schedule_work(&conn->xmitwork); > + conn->r2t_pdus_cnt++; > + spin_unlock(&session->lock); > + Can not this be done with a macro in addons? Same for other places where this change was done. > +@@ -2044,11 +2037,13 @@ iscsi_tcp_conn_get_param(struct iscsi_cl > + sk = tcp_conn->sock->sk; > + if (sk->sk_family == PF_INET) { > + inet = inet_sk(sk); > +- len = sprintf(buf, NIPQUAD_FMT "\n", > ++ len = sprintf(buf, "%u.%u.%u.%u\n", > + NIPQUAD(inet->daddr)); > + } else { > + np = inet6_sk(sk); > +- len = sprintf(buf, NIP6_FMT "\n", NIP6(np->daddr)); > ++ len = sprintf(buf, > ++ "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", > ++ NIP6(np->daddr)); > + } > + mutex_unlock(&conn->xmitmutex); > + break; NIP6_FMT should be defined in addons. > +@@ -2135,7 +2130,6 @@ static void iscsi_tcp_session_destroy(st > + static struct scsi_host_template iscsi_sht = { > + .name = "iSCSI Initiator over TCP/IP", > + .queuecommand = iscsi_queuecommand, > +- .change_queue_depth = iscsi_change_queue_depth, > + .can_queue = ISCSI_XMIT_CMDS_MAX - 1, > + .sg_tablesize = ISCSI_SG_TABLESIZE, > + .cmd_per_lun = ISCSI_DEF_CMD_PER_LUN, > +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h > +--- linux-2.6.20/drivers/scsi/iscsi_tcp.h 2007-02-04 20:44:54.000000000 +0200 > ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.h 2007-04-01 13:11:55.000000000 +0300 > +@@ -49,7 +49,6 @@ > + #define ISCSI_SG_TABLESIZE SG_ALL > + #define ISCSI_TCP_MAX_CMD_LEN 16 > + > +-struct crypto_hash; > + struct socket; > + > + /* Socket connection recieve helper */ > +@@ -93,8 +92,8 @@ struct iscsi_tcp_conn { > + void (*old_write_space)(struct sock *); > + > + /* data and header digests */ > +- struct hash_desc tx_hash; /* CRC32C (Tx) */ > +- struct hash_desc rx_hash; /* CRC32C (Rx) */ > ++ struct crypto_tfm *tx_tfm; /* CRC32C (Tx) */ > ++ struct crypto_tfm *rx_tfm; /* CRC32C (Rx) */ > + > + /* MIB custom statistics */ > + uint32_t sendpage_failures_cnt; Name the new field tx_hash (just change the type) and then you will be able to replace a lot of changes by a one liner in addons. > +diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c > +--- linux-2.6.20/drivers/scsi/libiscsi.c 2007-02-04 20:44:54.000000000 +0200 > ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/libiscsi.c 2007-04-01 13:15:57.000000000 +0300 > +@@ -23,6 +23,7 @@ > + */ > + #include > + #include > ++#include > + #include > + #include > + #include Why does this need to be added? Such stuff should be done through addons. .... > +diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c > +--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c 2007-02-04 20:44:54.000000000 +0200 > ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/scsi_transport_iscsi.c 2007-04-01 13:18:33.000000000 +0300 > +@@ -29,11 +29,15 @@ > + #include > + #include > + #include > ++#include > ++#include Do this through addons. > + > + #define ISCSI_SESSION_ATTRS 11 > + #define ISCSI_CONN_ATTRS 11 > + #define ISCSI_HOST_ATTRS 0 > +-#define ISCSI_TRANSPORT_VERSION "2.0-724" > ++#define ISCSI_TRANSPORT_VERSION "2.0-754" Really a necessary change? > ++ > ++#define SCAN_WILD_CARD ~0 Do this through addons. > + > + struct iscsi_internal { > + int daemon_pid; > +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l > + #define cdev_to_iscsi_internal(_cdev) \ > + container_of(_cdev, struct iscsi_internal, cdev) > + > ++extern int attribute_container_init(void); > ++ This does not look scsi-related. Why does this belong here? > +@@ -216,28 +225,10 @@ static int iscsi_is_session_dev(const st > + return dev->release == iscsi_session_release; > + } > + > +-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel, > +- uint id, uint lun) > +-{ > +- struct iscsi_host *ihost = shost->shost_data; > +- struct iscsi_cls_session *session; > +- > +- mutex_lock(&ihost->mutex); > +- list_for_each_entry(session, &ihost->sessions, host_list) { > +- if ((channel == SCAN_WILD_CARD || channel == 0) && > +- (id == SCAN_WILD_CARD || id == session->target_id)) > +- scsi_scan_target(&session->dev, 0, > +- session->target_id, lun, 1); > +- } > +- mutex_unlock(&ihost->mutex); > +- > +- return 0; > +-} > +- > +-static void session_recovery_timedout(struct work_struct *work) > ++static void session_recovery_timedout(void *data) > + { > + struct iscsi_cls_session *session = > +- container_of(work, struct iscsi_cls_session, > ++ container_of(data, struct iscsi_cls_session, > + recovery_work.work); > + > + dev_printk(KERN_INFO, &session->dev, "iscsi: session recovery timed " you should not need this. This looks like it duplcates the work we did on backporting work_struct to old kernels. > +@@ -452,6 +441,7 @@ iscsi_create_conn(struct iscsi_cls_sessi > + goto release_parent_ref; > + } > + transport_register_device(&conn->dev); > ++ > + return conn; > + > + release_parent_ref: Really necessary in a backport? > +@@ -606,9 +596,8 @@ iscsi_if_send_reply(int pid, int seq, in > + struct nlmsghdr *nlh; > + int len = NLMSG_SPACE(size); > + int flags = multi ? NLM_F_MULTI : 0; > +- int t = done ? NLMSG_DONE : type; > + > +- skb = alloc_skb(len, GFP_ATOMIC); > ++ skb = alloc_skb(len, GFP_KERNEL); > + /* > + * FIXME: > + * user is supposed to react on iferror == -ENOMEM; This looks really strange in a backport. > +@@ -649,7 +638,7 @@ iscsi_if_get_stats(struct iscsi_transpor > + do { > + int actual_size; > + > +- skbstat = alloc_skb(len, GFP_ATOMIC); > ++ skbstat = alloc_skb(len, GFP_KERNEL); > + if (!skbstat) { > + dev_printk(KERN_ERR, &conn->dev, "iscsi: can not " > + "deliver stats: OOM\n"); As does this. .... > diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch > new file mode 100644 > index 0000000..6dd4429 > --- /dev/null > +++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch > @@ -0,0 +1,60 @@ > +diff -rupN linux-2.6.20-rc7/include/scsi/iscsi_compat.h linux-2.6.9/include/scsi/iscsi_compat.h > +--- linux-2.6.20-rc7/include/scsi/iscsi_compat.h 1970-01-01 02:00:00.000000000 +0200 > ++++ linux-2.6.9/include/scsi/iscsi_compat.h 2007-02-08 08:45:39.000000000 +0200 Why isn't this in addons? > diff --git a/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch b/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch > new file mode 100644 > index 0000000..f2425e0 > --- /dev/null > +++ b/kernel_patches/backport/2.6.9_U3/add_transport_class_h.patch > @@ -0,0 +1,104 @@ > +diff -rupN linux-2.6.20-like-rh4/include/linux/transport_class.h linux-2.6.20/include/linux/transport_class.h > +--- linux-2.6.20-like-rh4/include/linux/transport_class.h 1970-01-01 02:00:00.000000000 +0200 > ++++ linux-2.6.20/include/linux/transport_class.h 2007-02-04 20:44:54.000000000 +0200 Why isn't this in addons? > diff --git a/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch > new file mode 100644 > index 0000000..3c2a969 > --- /dev/null > +++ b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch > @@ -0,0 +1,13 @@ > +--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:13:43.000000000 +0200 > ++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:14:31.000000000 +0200 > +@@ -70,9 +70,8 @@ > + #include > + #include > + #include > +-#include > +- > + #include "iscsi_iser.h" > ++#include > + > + static unsigned int iscsi_max_lun = 512; > + module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); Looks like the right thing to do anyway. So put it in fixes instead, and post upstream. > diff --git a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch > index e84b964..52c0136 100644 > --- a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch > +++ b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch > @@ -19,6 +19,62 @@ index 0000000..58cf933 > +++ b/drivers/infiniband/core/kfifo.c > @@ -0,0 +1 @@ > +#include "src/kfifo.c" > +diff --git a/drivers/infiniband/core/init.c b/drivers/infiniband/core/init.c > +new file mode 100644 > +index 0000000..58cf933 > +--- /dev/null > ++++ b/drivers/infiniband/core/init.c > +@@ -0,0 +1 @@ > ++#include "src/init.c" > +diff --git a/drivers/infiniband/core/attribute_container.c b/drivers/infiniband/core/attribute_container.c > +new file mode 100644 > +index 0000000..58cf933 > +--- /dev/null > ++++ b/drivers/infiniband/core/attribute_container.c > +@@ -0,0 +1 @@ > ++#include "src/attribute_container.c" > +diff --git a/drivers/infiniband/core/transport_class.c b/drivers/infiniband/core/transport_class.c > +new file mode 100644 > +index 0000000..58cf933 > +--- /dev/null > ++++ b/drivers/infiniband/core/transport_class.c > +@@ -0,0 +1 @@ > ++#include "src/transport_class.c" > +diff --git a/drivers/infiniband/core/klist.c b/drivers/infiniband/core/klist.c > +new file mode 100644 > +index 0000000..58cf933 > +--- /dev/null > ++++ b/drivers/infiniband/core/klist.c > +@@ -0,0 +1 @@ > ++#include "src/klist.c" > +diff --git a/drivers/infiniband/core/scsi.c b/drivers/infiniband/core/scsi.c > +new file mode 100644 > +index 0000000..58cf933 > +--- /dev/null > ++++ b/drivers/infiniband/core/scsi.c > +@@ -0,0 +1 @@ > ++#include "src/scsi.c" > +diff --git a/drivers/infiniband/core/scsi_lib.c b/drivers/infiniband/core/scsi_lib.c > +new file mode 100644 > +index 0000000..58cf933 > +--- /dev/null > ++++ b/drivers/infiniband/core/scsi_lib.c > +@@ -0,0 +1 @@ > ++#include "src/scsi_lib.c" > +diff --git a/drivers/infiniband/core/scsi_scan.c b/drivers/infiniband/core/scsi_scan.c > +new file mode 100644 > +index 0000000..58cf933 > +--- /dev/null > ++++ b/drivers/infiniband/core/scsi_scan.c > +@@ -0,0 +1 @@ > ++#include "src/scsi_scan.c" > +diff --git a/drivers/infiniband/core/kref_new.c b/drivers/infiniband/core/kref_new.c > +new file mode 100644 > +index 0000000..58cf933 > +--- /dev/null > ++++ b/drivers/infiniband/core/kref_new.c > +@@ -0,0 +1 @@ > ++#include "src/kref_new.c" > diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile > index 50fb1cd..456bfd0 100644 > --- a/drivers/infiniband/core/Makefile > @@ -28,4 +84,4 @@ index 50fb1cd..456bfd0 100644 > ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ > uverbs_marshall.o > + > -+ib_core-y += genalloc.o netevent.o kfifo.o > ++ib_core-y += genalloc.o netevent.o kfifo.o scsi.o scsi_lib.o scsi_scan.o init.o attribute_container.o transport_class.o klist.o kref_new.o Can we make these part of iser place? Linking scsi stuff into core does not look right. > diff --git a/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch b/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch > new file mode 100644 > index 0000000..cc071ef > --- /dev/null > +++ b/kernel_patches/backport/2.6.9_U3/netlink-01-add_netlink_h.patch > @@ -0,0 +1,247 @@ > +diff -rupN linux-2.6.20-like-rh4/include/linux/netlink.h linux-2.6.20/include/linux/netlink.h > +--- linux-2.6.20-like-rh4/include/linux/netlink.h 1970-01-01 02:00:00.000000000 +0200 > ++++ linux-2.6.20/include/linux/netlink.h 2007-02-04 20:44:54.000000000 +0200 Belongs in addons. -- MST From vlad at lists.openfabrics.org Thu May 10 02:31:48 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 10 May 2007 02:31:48 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070510-0200 daily build status Message-ID: <20070510093148.9781DE60828@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Failed: From rdreier at cisco.com Thu May 10 03:42:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 May 2007 03:42:28 -0700 Subject: [ofa-general] verbs abi_compat In-Reply-To: (Jimmy Hill's message of "Wed, 9 May 2007 23:08:01 -0500") References: Message-ID: > It is set in that it is non-zero, but I agree, it has garbage in it...and > that's part of the problem. It is not being set in src/cmd.c, and has a > non-zero value. When I call ibv_alloc_pd, I'm ending up in > __ibv_alloc_pd_1_0 and that attempts to use context->real_context which is > non-zero garbage as well and I get a segmentation violation. The abi_compat > flag was what I thought was redirecting me into __ibv_alloc_pd_1_0 instead > of __ibv_alloc_pd (where it should be going). > > So, maybe I asked the wrong question. Let me try a diff approach. What > determines if ibv_alloc_pd resolves to __ibv_alloc_pd_1_0 or __ibv_alloc_pd? > If I can find out what is redirecting my call to the "compat" code, maybe I > can stop it and resolve the problem. abi_compat has nothing to do with __ibv_alloc_pd vs. __ibv_alloc_pd_1_0. Rather, that choice is made based on whether your app is linked against the IBVERBS_1.1 or IBVERBS_1.0 ABI. If you link against the new library, you should get all IBVERBS_1.1 symbols; if you link against libibverbs 1.0, you should get all IBVERBS_1.1 symbols. Your problem might be that your app is getting __ibv_alloc_pd_1_0, but it gets __ibv_open_device instead of __ibv_open_device_1_0 so the context passed into __ibv_alloc_pd_1_0 is wrong. Are you possibly relinking only part of your app or something? - R. From yosefe at voltaire.com Thu May 10 04:26:31 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 10 May 2007 14:26:31 +0300 Subject: [ofa-general] [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <20070509174138.GB17734@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> Message-ID: <46430167.3010106@voltaire.com> Added - handling the case when a pkey of an interface is deleted and then restored -- This issue was found during partitioning & SM fail over testing. * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity * Obtain pkey index prior to entering init_qp, and save in in dev_priv * Upon PKEY_CHANGE event, schedule a work that restarts the qp. * Precondition the restart on whether the pkey index is really changed. Use the cached pkey_index to test this. * Restart child interfaces before parent. They might be up even if the parent is down. * When interface is restarted, queue delayed initiallization, to handle the case that a pkey is deleted and later restored. * Use uncached pkey query upon qp initiallization SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 7 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 96 +++++++++++++++++++++++------ drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 26 ++----- 4 files changed, 97 insertions(+), 39 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-10 08:34:58.335171047 +0300 @@ -202,15 +202,17 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; + struct delayed_work pkey_poll_task; struct delayed_work mcast_task; struct work_struct flush_task; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct work_struct pkey_event_task; struct ib_device *ca; u8 port; u16 pkey; + u16 pkey_index; struct ib_pd *pd; struct ib_mr *mr; struct ib_cq *cq; @@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-10 13:01:10.737347938 +0300 @@ -408,11 +408,40 @@ void ipoib_reap_ah(struct work_struct *w queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); } +static int ipoib_find_pkey_index(struct ipoib_dev_priv *priv, int *is_changed) +{ + u16 new_index; + + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return -ENXIO; + } + + if (is_changed) + *is_changed = !test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags) || + priv->pkey_index != new_index; + + priv->pkey_index = new_index; + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return 0; +} + int ipoib_ib_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; + /* + * Search through the port P_Key table for the requested pkey value. + * The port has to be assigned to the respective IB partition in + * advance. + */ + ret = ipoib_find_pkey_index(priv, NULL); + if (ret) { + ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey); + return -1; + } + ret = ipoib_init_qp(dev); if (ret) { ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret); @@ -422,14 +451,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -481,7 +510,7 @@ int ipoib_ib_dev_down(struct net_device if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { mutex_lock(&pkey_mutex); set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); + cancel_delayed_work(&priv->pkey_poll_task); mutex_unlock(&pkey_mutex); if (flush) flush_workqueue(ipoib_workqueue); @@ -508,7 +537,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +610,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,13 +652,22 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; + int is_index_changed; + + mutex_lock(&priv->vlan_mutex); - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { + /* Flush any child interfaces too - + * they might be up even if the parent is down */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, pkey_event); + + mutex_unlock(&priv->vlan_mutex); + + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -638,10 +677,23 @@ void ipoib_ib_dev_flush(struct work_stru return; } + if (pkey_event && + !ipoib_find_pkey_index(priv, &is_index_changed) && + !is_index_changed) { + ipoib_dbg(priv, "Not flushing - pkey index not changed.\n"); + return; + } + ipoib_dbg(priv, "flushing\n"); ipoib_ib_dev_down(dev, 0); + if (pkey_event) { + ipoib_ib_dev_stop(dev, 0); + ipoib_pkey_dev_delay_open(dev); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +702,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - mutex_unlock(&priv->vlan_mutex); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_event_task); + + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -685,7 +747,7 @@ void ipoib_ib_dev_cleanup(struct net_dev void ipoib_pkey_poll(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); + container_of(work, struct ipoib_dev_priv, pkey_poll_task.work); struct net_device *dev = priv->dev; ipoib_pkey_dev_check_presence(dev); @@ -696,7 +758,7 @@ void ipoib_pkey_poll(struct work_struct mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); } @@ -715,7 +777,7 @@ int ipoib_pkey_dev_delay_open(struct net mutex_lock(&pkey_mutex); clear_bit(IPOIB_PKEY_STOP, &priv->flags); queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); return 1; Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-09 17:21:03.000000000 +0300 @@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev) return -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-10 09:13:28.997127223 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = -ENXIO; goto out; @@ -94,26 +92,17 @@ int ipoib_init_qp(struct net_device *dev { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; - u16 pkey_index; struct ib_qp_attr qp_attr; int attr_mask; - /* - * Search through the port P_Key table for the requested pkey value. - * The port has to be assigned to the respective IB partition in - * advance. - */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); - if (ret) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); - return ret; - } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + /* Make sure we have a valid pkey_index in priv->pkey_index */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + return -1; qp_attr.qp_state = IB_QPS_INIT; qp_attr.qkey = 0; qp_attr.port_num = priv->port; - qp_attr.pkey_index = pkey_index; + qp_attr.pkey_index = priv->pkey_index; attr_mask = IB_QP_QKEY | IB_QP_PORT | @@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_event_task); } } From mst at dev.mellanox.co.il Thu May 10 05:01:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 15:01:44 +0300 Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <46430167.3010106@voltaire.com> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> Message-ID: <20070510120144.GF13655@mellanox.co.il> OK, this is a whole different approach to the problem. Seems to make sense to me. > Quoting Yosef Etigin : > Subject: [PATCHv3 2/2] ipoib: handle pkey change events > > Added - handling the case when a pkey of an interface is deleted and then restored > > -- > > This issue was found during partitioning & SM fail over testing. > > * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike > * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity > * Obtain pkey index prior to entering init_qp, and save in in dev_priv > * Upon PKEY_CHANGE event, schedule a work that restarts the qp. > * Precondition the restart on whether the pkey index is really changed. > Use the cached pkey_index to test this. > * Restart child interfaces before parent. They might be up even if the > parent is down. > * When interface is restarted, queue delayed initiallization, to handle > the case that a pkey is deleted and later restored. > * Use uncached pkey query upon qp initiallization > > SM reconfiguration or failover possibly causes a shuffling of the values > in the port pkey table. The current implementation only queries for the > index of the pkey once, when it creates the device QP and after that moves > it into working state, and hence does not address this scenario. Fix this > by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. > > Signed-off-by: Yosef Etigin > > --- > drivers/infiniband/ulp/ipoib/ipoib.h | 7 +- > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 96 +++++++++++++++++++++++------ > drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +- > drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 26 ++----- > 4 files changed, 97 insertions(+), 39 deletions(-) > > Index: b/drivers/infiniband/ulp/ipoib/ipoib.h > =================================================================== > --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 15:46:53.000000000 +0300 > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-10 08:34:58.335171047 +0300 > @@ -202,15 +202,17 @@ struct ipoib_dev_priv { > struct list_head multicast_list; > struct rb_root multicast_tree; > > - struct delayed_work pkey_task; > + struct delayed_work pkey_poll_task; > struct delayed_work mcast_task; > struct work_struct flush_task; > struct work_struct restart_task; > struct delayed_work ah_reap_task; > + struct work_struct pkey_event_task; > > struct ib_device *ca; > u8 port; > u16 pkey; > + u16 pkey_index; > struct ib_pd *pd; > struct ib_mr *mr; > struct ib_cq *cq; > @@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( > > int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); > void ipoib_ib_dev_flush(struct work_struct *work); > +void ipoib_pkey_event(struct work_struct *work); > void ipoib_ib_dev_cleanup(struct net_device *dev); > > int ipoib_ib_dev_open(struct net_device *dev); > int ipoib_ib_dev_up(struct net_device *dev); > int ipoib_ib_dev_down(struct net_device *dev, int flush); > -int ipoib_ib_dev_stop(struct net_device *dev); > +int ipoib_ib_dev_stop(struct net_device *dev, int flush); > > int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); > void ipoib_dev_cleanup(struct net_device *dev); > Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > =================================================================== > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.000000000 +0300 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-10 13:01:10.737347938 +0300 > @@ -408,11 +408,40 @@ void ipoib_reap_ah(struct work_struct *w > queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); > } > > +static int ipoib_find_pkey_index(struct ipoib_dev_priv *priv, int *is_changed) > +{ > + u16 new_index; > + > + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { > + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + return -ENXIO; > + } > + > + if (is_changed) > + *is_changed = !test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags) || > + priv->pkey_index != new_index; > + > + priv->pkey_index = new_index; > + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + return 0; > +} I suggest open-coding this - the name ipoib_find_pkey_index does not tell me that it actually sets flags, etc. > @@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler > container_of(handler, struct ipoib_dev_priv, event_handler); > > if ((record->event == IB_EVENT_PORT_ERR || > - record->event == IB_EVENT_PKEY_CHANGE || > record->event == IB_EVENT_PORT_ACTIVE || > record->event == IB_EVENT_LID_CHANGE || > record->event == IB_EVENT_SM_CHANGE || > @@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler > record->element.port_num == priv->port) { > ipoib_dbg(priv, "Port state change event\n"); > queue_work(ipoib_workqueue, &priv->flush_task); > + } else if (record->event == IB_EVENT_PKEY_CHANGE && > + record->element.port_num == priv->port) { > + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); > + queue_work(ipoib_workqueue, &priv->pkey_event_task); > } > } BTW, should we maybe do: if (record->element.port_num != priv->port) return; and then we won't have to do this test for each event type? -- MST From ogerlitz at voltaire.com Thu May 10 05:07:57 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 10 May 2007 15:07:57 +0300 Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <20070510120144.GF13655@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> <20070510120144.GF13655@mellanox.co.il> Message-ID: <46430B1D.1040905@voltaire.com> Michael S. Tsirkin wrote: >> @@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler >> container_of(handler, struct ipoib_dev_priv, event_handler); >> >> if ((record->event == IB_EVENT_PORT_ERR || >> - record->event == IB_EVENT_PKEY_CHANGE || >> record->event == IB_EVENT_PORT_ACTIVE || >> record->event == IB_EVENT_LID_CHANGE || >> record->event == IB_EVENT_SM_CHANGE || >> @@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler >> record->element.port_num == priv->port) { >> ipoib_dbg(priv, "Port state change event\n"); >> queue_work(ipoib_workqueue, &priv->flush_task); >> + } else if (record->event == IB_EVENT_PKEY_CHANGE && >> + record->element.port_num == priv->port) { >> + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); >> + queue_work(ipoib_workqueue, &priv->pkey_event_task); >> } >> } > > BTW, should we maybe do: > if (record->element.port_num != priv->port) > return; > > and then we won't have to do this test for each event type? Just make sure that all the events covered by this check are port affiliated, ie don't have a wider scope. Or. From mst at dev.mellanox.co.il Thu May 10 05:10:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 15:10:39 +0300 Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <46430B1D.1040905@voltaire.com> References: <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> <20070510120144.GF13655@mellanox.co.il> <46430B1D.1040905@voltaire.com> Message-ID: <20070510121039.GI13655@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events > > Michael S. Tsirkin wrote: > >>@@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler > >> container_of(handler, struct ipoib_dev_priv, event_handler); > >> > >> if ((record->event == IB_EVENT_PORT_ERR || > >>- record->event == IB_EVENT_PKEY_CHANGE || > >> record->event == IB_EVENT_PORT_ACTIVE || > >> record->event == IB_EVENT_LID_CHANGE || > >> record->event == IB_EVENT_SM_CHANGE || > >>@@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler > >> record->element.port_num == priv->port) { > >> ipoib_dbg(priv, "Port state change event\n"); > >> queue_work(ipoib_workqueue, &priv->flush_task); > >>+ } else if (record->event == IB_EVENT_PKEY_CHANGE && > >>+ record->element.port_num == priv->port) { > >>+ ipoib_dbg(priv, "pkey change event on port:%d\n", > >>priv->port); > >>+ queue_work(ipoib_workqueue, &priv->pkey_event_task); > >> } > >> } > > > >BTW, should we maybe do: > >if (record->element.port_num != priv->port) > > return; > > > >and then we won't have to do this test for each event type? > > Just make sure that all the events covered by this check are port > affiliated, ie don't have a wider scope. > > Or. Well, we currently have: void ipoib_event(struct ib_event_handler *handler, struct ib_event *record) { struct ipoib_dev_priv *priv = container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || record->event == IB_EVENT_CLIENT_REREGISTER) && record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); } } So this would not change anything, just clean up code a little. -- MST From ogerlitz at voltaire.com Thu May 10 05:12:00 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 10 May 2007 15:12:00 +0300 Subject: [ofa-general] Re: [PATCHv3 2/2] ipoib: handle pkey change events In-Reply-To: <20070510121039.GI13655@mellanox.co.il> References: <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> <20070510120144.GF13655@mellanox.co.il> <46430B1D.1040905@voltaire.com> <20070510121039.GI13655@mellanox.co.il> Message-ID: <46430C10.7040702@voltaire.com> Michael S. Tsirkin wrote: > Well, we currently have: > > void ipoib_event(struct ib_event_handler *handler, > struct ib_event *record) > { > struct ipoib_dev_priv *priv = > container_of(handler, struct ipoib_dev_priv, event_handler); > > if ((record->event == IB_EVENT_PORT_ERR || > record->event == IB_EVENT_PKEY_CHANGE || > record->event == IB_EVENT_PORT_ACTIVE || > record->event == IB_EVENT_LID_CHANGE || > record->event == IB_EVENT_SM_CHANGE || > record->event == IB_EVENT_CLIENT_REREGISTER) && > record->element.port_num == priv->port) { > ipoib_dbg(priv, "Port state change event\n"); > queue_work(ipoib_workqueue, &priv->flush_task); > } > } > > So this would not change anything, just clean up code a little. OK From ogerlitz at voltaire.com Thu May 10 05:23:14 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 10 May 2007 15:23:14 +0300 Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> Message-ID: <46430EB2.7080703@voltaire.com> Jeff Squyres wrote: > Galen Shipman and I talked about this a bit and suggest the following: > > - During the connection dance (probably for both the udapl and openib > BTLs), whichever peer ends up being the connection initiator (don't > forget about the race condition where 2 peers may simultaneously decide > to initiate -- this case is handled properly in the OMPI code; but just > make sure you modify the side that ends up being actual initiator), they > can send their pending fragment immediately (and Steve is right that > there will always be a pending fragment, because OMPI doesn't make a > connection until the first send). > > - The other peer (the receiver of the connection) must wait to send its > pending fragment(s) until it receives the first frag from the connection > initiator. This can be accomplished either with another flag on the > OMPI module struct or perhaps making it part of the connection protocol > (i.e., don't transition the endpoint to be CONNECTED until the first > fragment is received). Either of which can be used to queue up > fragments on the receiver until the first fragment is received from the > initiator. I'd have to look in the code deeper, but I'm *guessing* that > it might be best to use the already-existing state flag (i.e., checking > for CONNECTED) because then you won't be introducing any more > conditionals in the critical path. A different approach which you might want to consider is to have at the btl level --two-- connections per ranks. so if A wants to send B it does so through the A --> B connection and if B wants to send A it does so through the B --> A connection. To some extent, this is the approach taken by IPoIB-CM (I am not enough into the RFC to understand the reasoning but i am quite sure this was the approach in the initial implementation). At first thought it mights seems not very elegant, but taking it into the details (projected on the ompi env) you might find it even nice. Or. From yosefe at voltaire.com Thu May 10 05:25:42 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 10 May 2007 15:25:42 +0300 Subject: [ofa-general] [PATCHv4 2/2] ipoib: handle pkey change events In-Reply-To: <20070510120144.GF13655@mellanox.co.il> References: <4640812C.6060003@voltaire.com> <46408360.3040006@voltaire.com> <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> <20070510120144.GF13655@mellanox.co.il> Message-ID: <46430F46.1080002@voltaire.com> Comments: 1. the return -1 is consistent with all other "return -1" in ipoib_ib_dev_open 2. the polling thread is stopped in ipoib_ib_dev_down Changes: * remove ipoib_find_pkey() -- This issue was found during partitioning & SM fail over testing. * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity * Obtain pkey index prior to entering init_qp, and save in in dev_priv * Upon PKEY_CHANGE event, schedule a work that restarts the qp. * Precondition the restart on whether the pkey index is really changed. Use the cached pkey_index to test this. * Restart child interfaces before parent. They might be up even if the parent is down. * When interface is restarted, queue delayed initiallization, to handle the case that a pkey is deleted and later restored. * Use uncached pkey query upon qp initiallization SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 7 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 88 +++++++++++++++++++++++------ drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 26 ++------ 4 files changed, 89 insertions(+), 39 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-10 08:34:58.335171047 +0300 @@ -202,15 +202,17 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; + struct delayed_work pkey_poll_task; struct delayed_work mcast_task; struct work_struct flush_task; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct work_struct pkey_event_task; struct ib_device *ca; u8 port; u16 pkey; + u16 pkey_index; struct ib_pd *pd; struct ib_mr *mr; struct ib_cq *cq; @@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-10 15:16:47.592982550 +0300 @@ -413,6 +413,18 @@ int ipoib_ib_dev_open(struct net_device struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; + /* + * Search through the port P_Key table for the requested pkey value. + * The port has to be assigned to the respective IB partition in + * advance. + */ + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) { + ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey); + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return -1; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ret = ipoib_init_qp(dev); if (ret) { ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret); @@ -422,14 +434,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -481,7 +493,7 @@ int ipoib_ib_dev_down(struct net_device if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { mutex_lock(&pkey_mutex); set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); + cancel_delayed_work(&priv->pkey_poll_task); mutex_unlock(&pkey_mutex); if (flush) flush_workqueue(ipoib_workqueue); @@ -508,7 +520,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +593,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,13 +635,22 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; + u16 new_index; - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces too - + * they might be up even if the parent is down */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, pkey_event); + + mutex_unlock(&priv->vlan_mutex); + + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -638,10 +660,32 @@ void ipoib_ib_dev_flush(struct work_stru return; } + if (pkey_event) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ipoib_ib_dev_down(dev, 0); + ipoib_pkey_dev_delay_open(dev); + return; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + /* restart qp only of pkey index is cahnged */ + if (new_index == priv->pkey_index) { + ipoib_dbg(priv, "Not flushing - pkey index not changed.\n"); + return; + } + priv->pkey_index = new_index; + } + ipoib_dbg(priv, "flushing\n"); ipoib_ib_dev_down(dev, 0); + if (pkey_event) { + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +694,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); + + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_event_task); - mutex_unlock(&priv->vlan_mutex); + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -685,7 +739,7 @@ void ipoib_ib_dev_cleanup(struct net_dev void ipoib_pkey_poll(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); + container_of(work, struct ipoib_dev_priv, pkey_poll_task.work); struct net_device *dev = priv->dev; ipoib_pkey_dev_check_presence(dev); @@ -696,7 +750,7 @@ void ipoib_pkey_poll(struct work_struct mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); } @@ -715,7 +769,7 @@ int ipoib_pkey_dev_delay_open(struct net mutex_lock(&pkey_mutex); clear_bit(IPOIB_PKEY_STOP, &priv->flags); queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); return 1; Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-09 17:21:03.000000000 +0300 @@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev) return -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-10 09:13:28.997127223 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = -ENXIO; goto out; @@ -94,26 +92,17 @@ int ipoib_init_qp(struct net_device *dev { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; - u16 pkey_index; struct ib_qp_attr qp_attr; int attr_mask; - /* - * Search through the port P_Key table for the requested pkey value. - * The port has to be assigned to the respective IB partition in - * advance. - */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); - if (ret) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); - return ret; - } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + /* Make sure we have a valid pkey_index in priv->pkey_index */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + return -1; qp_attr.qp_state = IB_QPS_INIT; qp_attr.qkey = 0; qp_attr.port_num = priv->port; - qp_attr.pkey_index = pkey_index; + qp_attr.pkey_index = priv->pkey_index; attr_mask = IB_QP_QKEY | IB_QP_PORT | @@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_event_task); } } From jsquyres at cisco.com Thu May 10 05:26:22 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 10 May 2007 08:26:22 -0400 Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <46430EB2.7080703@voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> Message-ID: On May 10, 2007, at 8:23 AM, Or Gerlitz wrote: > A different approach which you might want to consider is to have at > the btl level --two-- connections per ranks. so if A > wants to send B it does so through the A --> B connection and if B > wants to send A it does so through the B --> A connection. To some > extent, this is the approach taken by IPoIB-CM (I am not enough > into the RFC to understand the reasoning but i am quite sure this > was the approach in the initial implementation). At first thought > it mights seems not very elegant, but taking it into the details > (projected on the ompi env) you might find it even nice. What is the advantage of this approach? -- Jeff Squyres Cisco Systems From mst at dev.mellanox.co.il Thu May 10 05:38:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 15:38:55 +0300 Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events In-Reply-To: <46430F46.1080002@voltaire.com> References: <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> <20070510120144.GF13655@mellanox.co.il> <46430F46.1080002@voltaire.com> Message-ID: <20070510123855.GL13655@mellanox.co.il> > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.000000000 +0300 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-10 15:16:47.592982550 +0300 > @@ -413,6 +413,18 @@ int ipoib_ib_dev_open(struct net_device > struct ipoib_dev_priv *priv = netdev_priv(dev); > int ret; > > + /* > + * Search through the port P_Key table for the requested pkey value. > + * The port has to be assigned to the respective IB partition in > + * advance. > + */ > + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) { > + ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey); > + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + return -1; > + } > + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + > ret = ipoib_init_qp(dev); > if (ret) { > ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret); Return some error code -ENXIO. -- MST From ossrosch at linux.vnet.ibm.com Thu May 10 05:41:33 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Thu, 10 May 2007 14:41:33 +0200 Subject: [ofa-general] [PATCH ofed-1.2-rc3 0/3] ehca: backport for rhel-4.5 Message-ID: <200705101441.33813.ossrosch@linux.vnet.ibm.com> Hi, these are the patches to backport ehca driver for rhel-4.5. The patches switch the driver back to handle the old mmap style dur to lack of vm_insert_page() support in kernel 2.6.9. Additionally we use the introduced dma_ops mechanism in order to communicate with ibmebus. regards Stefan From ossrosch at linux.vnet.ibm.com Thu May 10 05:41:43 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Thu, 10 May 2007 14:41:43 +0200 Subject: [ofa-general] [PATCH ofed-1.2-rc3 1/3] ehca: backport for rhel-4.5 - hvcall.h Message-ID: <200705101441.44286.ossrosch@linux.vnet.ibm.com> use kmem_cache_t instead of struct kmem_cache and update hvcall.h Signed-off-by: Stefan Roscher --- drivers/infiniband/hw/ehca/ehca_av.c | 2 drivers/infiniband/hw/ehca/ehca_cq.c | 2 drivers/infiniband/hw/ehca/ehca_main.c | 2 drivers/infiniband/hw/ehca/ehca_mrmw.c | 4 drivers/infiniband/hw/ehca/ehca_pd.c | 2 drivers/infiniband/hw/ehca/ehca_qp.c | 2 kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h | 1 kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h | 167 +++++++++++ 8 files changed, 174 insertions(+), 8 deletions(-) diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_av.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_av.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_av.c 2007-05-09 12:42:01.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_av.c 2007-05-09 12:42:34.000000000 +0200 @@ -48,7 +48,7 @@ #include "ehca_iverbs.h" #include "hcp_if.h" -static struct kmem_cache *av_cache; +static kmem_cache_t *av_cache; struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) { diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c 2007-05-09 12:42:01.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c 2007-05-09 12:42:34.000000000 +0200 @@ -50,7 +50,7 @@ #include "ehca_irq.h" #include "hcp_if.h" -static struct kmem_cache *cq_cache; +static kmem_cache_t *cq_cache; int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp) { diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-09 12:42:01.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-09 12:42:34.000000000 +0200 @@ -465,7 +465,6 @@ void ehca_remove_driver_sysfs(struct ibm #define EHCA_RESOURCE_ATTR(name) \ static ssize_t ehca_show_##name(struct device *dev, \ - struct device_attribute *attr, \ char *buf) \ { \ struct ehca_shca *shca; \ @@ -513,7 +512,6 @@ EHCA_RESOURCE_ATTR(max_pd); EHCA_RESOURCE_ATTR(max_ah); static ssize_t ehca_show_adapter_handle(struct device *dev, - struct device_attribute *attr, char *buf) { struct ehca_shca *shca = dev->driver_data; diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_mrmw.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_mrmw.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_mrmw.c 2007-05-09 12:42:01.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_mrmw.c 2007-05-09 12:42:34.000000000 +0200 @@ -46,8 +46,8 @@ #include "hcp_if.h" #include "hipz_hw.h" -static struct kmem_cache *mr_cache; -static struct kmem_cache *mw_cache; +static kmem_cache_t *mr_cache; +static kmem_cache_t *mw_cache; static struct ehca_mr *ehca_mr_new(void) { diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_pd.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_pd.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_pd.c 2007-05-09 12:42:01.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_pd.c 2007-05-09 12:42:34.000000000 +0200 @@ -43,7 +43,7 @@ #include "ehca_tools.h" #include "ehca_iverbs.h" -static struct kmem_cache *pd_cache; +static kmem_cache_t *pd_cache; struct ib_pd *ehca_alloc_pd(struct ib_device *device, struct ib_ucontext *context, struct ib_udata *udata) diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c 2007-05-09 12:42:01.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c 2007-05-09 12:42:34.000000000 +0200 @@ -51,7 +51,7 @@ #include "hcp_if.h" #include "hipz_fns.h" -static struct kmem_cache *qp_cache; +static kmem_cache_t *qp_cache; /* * attributes not supported by query qp diff -Nurp ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h --- ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h 2007-05-09 12:48:09.000000000 +0200 +++ ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h 2007-05-09 12:51:00.000000000 +0200 @@ -137,6 +137,173 @@ inline static long plpar_hcall9(unsigned return regs[0]; } +inline static long plpar_hcall_7arg_7ret(unsigned long opcode, + unsigned long arg1, /* From ossrosch at linux.vnet.ibm.com Thu May 10 05:41:57 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Thu, 10 May 2007 14:41:57 +0200 Subject: [ofa-general] [PATCH ofed-1.2-rc3 3/3] ehca: backport for rhel-4.5 - use introduced dma_ops Message-ID: <200705101441.58102.ossrosch@linux.vnet.ibm.com> use introduced dma_ops Signed-off-by: Stefan Roscher --- Makefile | 2 ehca_dma.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ehca_main.c | 2 3 files changed, 197 insertions(+), 1 deletion(-) diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c --- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c 1970-01-01 01:00:00.000000000 +0100 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c 2007-05-03 16:25:30.000000000 +0200 @@ -0,0 +1,194 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * eHCA dma mapping via ibmebus + * + * Authors: Stefan Roscher + * Hoang-Nam Nguyen + * + * Copyright (c) 2007 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#include +#include + +static int ehca_mapping_error(struct ib_device *dev, u64 dma_addr); + +static u64 ehca_dma_map_single(struct ib_device *dev, + void *cpu_addr, size_t size, + enum dma_data_direction direction); + +static void ehca_dma_unmap_single(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction); + +static u64 ehca_dma_map_page(struct ib_device *dev, + struct page *page, + unsigned long offset, + size_t size, + enum dma_data_direction direction); + +static void ehca_dma_unmap_page(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction); + +int ehca_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction direction); + +static void ehca_unmap_sg(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction); + +static u64 ehca_sg_dma_address(struct ib_device *dev, struct scatterlist *sg); + +static unsigned int ehca_sg_dma_len(struct ib_device *dev, + struct scatterlist *sg); + +static void ehca_sync_single_for_cpu(struct ib_device *dev, + u64 addr, + size_t size, + enum dma_data_direction dir); + +static void ehca_sync_single_for_device(struct ib_device *dev, + u64 addr, + size_t size, + enum dma_data_direction dir); + +static void *ehca_dma_alloc_coherent(struct ib_device *dev, size_t size, + u64 *dma_handle, gfp_t flag); + +static void ehca_dma_free_coherent(struct ib_device *dev, size_t size, + void *cpu_addr, dma_addr_t dma_handle); + +struct ib_dma_mapping_ops ehca_dma_mapping_ops = { + ehca_mapping_error, + ehca_dma_map_single, + ehca_dma_unmap_single, + ehca_dma_map_page, + ehca_dma_unmap_page, + ehca_map_sg, + ehca_unmap_sg, + ehca_sg_dma_address, + ehca_sg_dma_len, + ehca_sync_single_for_cpu, + ehca_sync_single_for_device, + ehca_dma_alloc_coherent, + ehca_dma_free_coherent +}; + +static int ehca_mapping_error(struct ib_device *dev, u64 dma_addr) +{ + return dma_addr == 0L; +} + +static u64 ehca_dma_map_single(struct ib_device *dev, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + return ibmebus_map_single(dev, cpu_addr, size, direction); +} + +static void ehca_dma_unmap_single(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction) +{ + ibmebus_unmap_single(dev, addr, size, direction); +} + +static u64 ehca_dma_map_page(struct ib_device *dev, + struct page *page, + unsigned long offset, + size_t size, + enum dma_data_direction direction) +{ + return dma_map_page(dev->dma_device, page, offset, size, direction); +} + +static void ehca_dma_unmap_page(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction) +{ + dma_unmap_page(dev->dma_device, addr, size, direction); +} + +int ehca_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + return ibmebus_map_sg(dev, sg, nents, direction); +} + +static void ehca_unmap_sg(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + ibmebus_unmap_sg(dev, sg, nents, direction); +} + +static u64 ehca_sg_dma_address(struct ib_device *dev, struct scatterlist *sg) +{ + return sg_dma_address(sg); +} + +static unsigned int ehca_sg_dma_len(struct ib_device *dev, + struct scatterlist *sg) +{ + return sg_dma_len(sg); +} + +static void ehca_sync_single_for_cpu(struct ib_device *dev, + u64 addr, + size_t size, + enum dma_data_direction dir) +{ + dma_sync_single_for_cpu(dev->dma_device, addr, size, dir); +} + +static void ehca_sync_single_for_device(struct ib_device *dev, + u64 addr, + size_t size, + enum dma_data_direction dir) +{ + dma_sync_single_for_device(dev->dma_device, addr, size, dir); +} + +static void *ehca_dma_alloc_coherent(struct ib_device *dev, size_t size, + u64 *dma_handle, gfp_t flag) +{ + return ibmebus_alloc_coherent(dev, size, dma_handle, flag); +} + +static void ehca_dma_free_coherent(struct ib_device *dev, size_t size, + void *cpu_addr, dma_addr_t dma_handle) +{ + ibmebus_free_coherent(dev, size, cpu_addr, dma_handle); +} diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c --- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_main.c 2007-04-29 15:10:56.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-03 16:19:28.000000000 +0200 @@ -279,6 +279,7 @@ init_node_guid1: int ehca_init_device(struct ehca_shca *shca) { + extern struct ib_dma_mapping_ops ehca_dma_mapping_ops; int ret; ret = init_node_guid(shca); @@ -354,6 +355,7 @@ int ehca_init_device(struct ehca_shca *s shca->ib_device.detach_mcast = ehca_detach_mcast; /* shca->ib_device.process_mad = ehca_process_mad; */ shca->ib_device.mmap = ehca_mmap; + shca->ib_device.dma_ops = &ehca_dma_mapping_ops; return ret; } diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/Makefile ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/Makefile --- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/Makefile 2007-04-29 15:10:56.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/Makefile 2007-05-03 16:26:13.000000000 +0200 @@ -12,5 +12,5 @@ obj-$(CONFIG_INFINIBAND_EHCA) += ib_ehca ib_ehca-objs = ehca_main.o ehca_hca.o ehca_mcast.o ehca_pd.o ehca_av.o ehca_eq.o \ ehca_cq.o ehca_qp.o ehca_sqp.o ehca_mrmw.o ehca_reqs.o ehca_irq.o \ - ehca_uverbs.o ipz_pt_fn.o hcp_if.o hcp_phyp.o + ehca_uverbs.o ehca_dma.o ipz_pt_fn.o hcp_if.o hcp_phyp.o From ossrosch at linux.vnet.ibm.com Thu May 10 05:41:52 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Thu, 10 May 2007 14:41:52 +0200 Subject: [ofa-general] [PATCH ofed-1.2-rc3 2/3] ehca: backport for rhel-4.5 - mmap functonality Message-ID: <200705101441.52922.ossrosch@linux.vnet.ibm.com> change ehca module to older mmap functinality due to lack of vm_insert_page() support in kernel 2.6.9 Signed-off-by: Stefan Roscher --- ehca_classes.h | 29 +++- ehca_cq.c | 65 +++++++-- ehca_iverbs.h | 10 + ehca_main.c | 8 - ehca_qp.c | 78 +++++++++-- ehca_uverbs.c | 395 +++++++++++++++++++++++++++++++++------------------------ 6 files changed, 379 insertions(+), 206 deletions(-) diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-05-04 10:38:23.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-05-04 10:40:06.000000000 +0200 @@ -126,14 +126,13 @@ struct ehca_qp { struct ipz_qp_handle ipz_qp_handle; struct ehca_pfqp pf; struct ib_qp_init_attr init_attr; + u64 uspace_squeue; + u64 uspace_rqueue; + u64 uspace_fwh; struct ehca_cq *send_cq; struct ehca_cq *recv_cq; unsigned int sqerr_purgeflag; struct hlist_node list_entries; - /* mmap counter for resources mapped into user space */ - u32 mm_count_squeue; - u32 mm_count_rqueue; - u32 mm_count_galpa; }; /* must be power of 2 */ @@ -150,6 +149,8 @@ struct ehca_cq { struct ipz_cq_handle ipz_cq_handle; struct ehca_pfcq pf; spinlock_t cb_lock; + u64 uspace_queue; + u64 uspace_fwh; struct hlist_head qp_hashtab[QP_HASHTAB_LEN]; struct list_head entry; u32 nr_callbacks; /* #events assigned to cpu by scaling code */ @@ -157,9 +158,6 @@ struct ehca_cq { wait_queue_head_t wait_completion; spinlock_t task_lock; u32 ownpid; - /* mmap counter for resources mapped into user space */ - u32 mm_count_queue; - u32 mm_count_galpa; }; enum ehca_mr_flag { @@ -259,6 +257,20 @@ struct ehca_ucontext { struct ib_ucontext ib_ucontext; }; +struct ehca_module *ehca_module_new(void); + +int ehca_module_delete(struct ehca_module *me); + +int ehca_eq_ctor(struct ehca_eq *eq); + +int ehca_eq_dtor(struct ehca_eq *eq); + +struct ehca_shca *ehca_shca_new(void); + +int ehca_shca_delete(struct ehca_shca *me); + +struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor); + int ehca_init_pd_cache(void); void ehca_cleanup_pd_cache(void); int ehca_init_cq_cache(void); @@ -282,6 +294,7 @@ extern int ehca_use_hp_mr; extern int ehca_scaling_code; struct ipzu_queue_resp { + u64 queue; /* points to first queue entry */ u32 qe_size; /* queue entry size */ u32 act_nr_of_sg; u32 queue_length; /* queue length allocated in bytes */ @@ -294,6 +307,7 @@ struct ehca_create_cq_resp { u32 cq_number; u32 token; struct ipzu_queue_resp ipz_queue; + struct h_galpas galpas; }; struct ehca_create_qp_resp { @@ -306,6 +320,7 @@ struct ehca_create_qp_resp { u32 dummy; /* padding for 8 byte alignment */ struct ipzu_queue_resp ipz_squeue; struct ipzu_queue_resp ipz_rqueue; + struct h_galpas galpas; }; struct ehca_alloc_cq_parms { diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c 2007-05-04 10:38:23.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c 2007-05-04 10:40:06.000000000 +0200 @@ -268,6 +268,7 @@ struct ib_cq *ehca_create_cq(struct ib_d if (context) { struct ipz_queue *ipz_queue = &my_cq->ipz_queue; struct ehca_create_cq_resp resp; + struct vm_area_struct *vma; memset(&resp, 0, sizeof(resp)); resp.cq_number = my_cq->cq_number; resp.token = my_cq->token; @@ -276,14 +277,40 @@ struct ib_cq *ehca_create_cq(struct ib_d resp.ipz_queue.queue_length = ipz_queue->queue_length; resp.ipz_queue.pagesize = ipz_queue->pagesize; resp.ipz_queue.toggle_state = ipz_queue->toggle_state; + ret = ehca_mmap_nopage(((u64)(my_cq->token) << 32) | 0x12000000, + ipz_queue->queue_length, + (void**)&resp.ipz_queue.queue, + &vma); + if (ret) { + ehca_err(device, "Could not mmap queue pages"); + cq = ERR_PTR(ret); + goto create_cq_exit4; + } + my_cq->uspace_queue = resp.ipz_queue.queue; + resp.galpas = my_cq->galpas; + ret = ehca_mmap_register(my_cq->galpas.user.fw_handle, + (void**)&resp.galpas.kernel.fw_handle, + &vma); + if (ret) { + ehca_err(device, "Could not mmap fw_handle"); + cq = ERR_PTR(ret); + goto create_cq_exit5; + } + my_cq->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; if (ib_copy_to_udata(udata, &resp, sizeof(resp))) { ehca_err(device, "Copy to udata failed."); - goto create_cq_exit4; + goto create_cq_exit6; } } return cq; +create_cq_exit6: + ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE); + +create_cq_exit5: + ehca_munmap(my_cq->uspace_queue, my_cq->ipz_queue.queue_length); + create_cq_exit4: ipz_queue_dtor(&my_cq->ipz_queue); @@ -317,6 +344,7 @@ static int get_cq_nr_events(struct ehca_ int ehca_destroy_cq(struct ib_cq *cq) { u64 h_ret; + int ret; struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); int cq_num = my_cq->cq_number; struct ib_device *device = cq->device; @@ -326,20 +354,6 @@ int ehca_destroy_cq(struct ib_cq *cq) u32 cur_pid = current->tgid; unsigned long flags; - if (cq->uobject) { - if (my_cq->mm_count_galpa || my_cq->mm_count_queue) { - ehca_err(device, "Resources still referenced in " - "user space cq_num=%x", my_cq->cq_number); - return -EINVAL; - } - if (my_cq->ownpid != cur_pid) { - ehca_err(device, "Invalid caller pid=%x ownpid=%x " - "cq_num=%x", - cur_pid, my_cq->ownpid, my_cq->cq_number); - return -EINVAL; - } - } - spin_lock_irqsave(&ehca_cq_idr_lock, flags); while (my_cq->nr_events) { spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); @@ -351,6 +365,25 @@ int ehca_destroy_cq(struct ib_cq *cq) idr_remove(&ehca_cq_idr, my_cq->token); spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) { + ehca_err(device, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_cq->ownpid); + return -EINVAL; + } + + /* un-mmap if vma alloc */ + if (my_cq->uspace_queue ) { + ret = ehca_munmap(my_cq->uspace_queue, + my_cq->ipz_queue.queue_length); + if (ret) + ehca_err(device, "Could not munmap queue ehca_cq=%p " + "cq_num=%x", my_cq, cq_num); + ret = ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE); + if (ret) + ehca_err(device, "Could not munmap fwh ehca_cq=%p " + "cq_num=%x", my_cq, cq_num); + } + h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0); if (h_ret == H_R_STATE) { /* cq in err: read err data and destroy it forcibly */ @@ -379,7 +412,7 @@ int ehca_resize_cq(struct ib_cq *cq, int struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); u32 cur_pid = current->tgid; - if (cq->uobject && my_cq->ownpid != cur_pid) { + if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) { ehca_err(cq->device, "Invalid caller pid=%x ownpid=%x", cur_pid, my_cq->ownpid); return -EINVAL; diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h 2007-05-04 10:38:23.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h 2007-04-29 15:10:56.000000000 +0200 @@ -171,11 +171,19 @@ int ehca_mmap(struct ib_ucontext *contex void ehca_poll_eqs(unsigned long data); +int ehca_mmap_nopage(u64 foffset,u64 length,void **mapped, + struct vm_area_struct **vma); + +int ehca_mmap_register(u64 physical,void **mapped, + struct vm_area_struct **vma); + +int ehca_munmap(unsigned long addr, size_t len); + #ifdef CONFIG_PPC_64K_PAGES void *ehca_alloc_fw_ctrlblock(gfp_t flags); void ehca_free_fw_ctrlblock(void *ptr); #else -#define ehca_alloc_fw_ctrlblock(flags) ((void*) get_zeroed_page(flags)) +#define ehca_alloc_fw_ctrlblock(flags) ((void *) get_zeroed_page(flags)) #define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr)) #endif diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-04 10:38:23.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-04 10:40:06.000000000 +0200 @@ -52,7 +52,7 @@ MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Christoph Raisch "); MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); -MODULE_VERSION("SVNEHCA_0022"); +MODULE_VERSION("SVNEHCA_0019"); int ehca_open_aqp1 = 0; int ehca_debug_level = 0; @@ -293,7 +293,7 @@ int ehca_init_device(struct ehca_shca *s strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX); shca->ib_device.owner = THIS_MODULE; - shca->ib_device.uverbs_abi_ver = 6; + shca->ib_device.uverbs_abi_ver = 5; shca->ib_device.uverbs_cmd_mask = (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | @@ -357,7 +357,7 @@ int ehca_init_device(struct ehca_shca *s shca->ib_device.dealloc_fmr = ehca_dealloc_fmr; shca->ib_device.attach_mcast = ehca_attach_mcast; shca->ib_device.detach_mcast = ehca_detach_mcast; - /* shca->ib_device.process_mad = ehca_process_mad; */ + /* shca->ib_device.process_mad = ehca_process_mad; */ shca->ib_device.mmap = ehca_mmap; return ret; @@ -811,7 +811,7 @@ int __init ehca_module_init(void) int ret; printk(KERN_INFO "eHCA Infiniband Device Driver " - "(Rel.: SVNEHCA_0022)\n"); + "(Rel.: SVNEHCA_0019)\n"); idr_init(&ehca_qp_idr); idr_init(&ehca_cq_idr); spin_lock_init(&ehca_qp_idr_lock); diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c 2007-05-04 10:38:23.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c 2007-04-29 15:10:56.000000000 +0200 @@ -637,6 +637,7 @@ struct ib_qp *ehca_create_qp(struct ib_p struct ipz_queue *ipz_rqueue = &my_qp->ipz_rqueue; struct ipz_queue *ipz_squeue = &my_qp->ipz_squeue; struct ehca_create_qp_resp resp; + struct vm_area_struct * vma; memset(&resp, 0, sizeof(resp)); resp.qp_num = my_qp->real_qp_num; @@ -650,21 +651,59 @@ struct ib_qp *ehca_create_qp(struct ib_p resp.ipz_rqueue.queue_length = ipz_rqueue->queue_length; resp.ipz_rqueue.pagesize = ipz_rqueue->pagesize; resp.ipz_rqueue.toggle_state = ipz_rqueue->toggle_state; + ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x22000000, + ipz_rqueue->queue_length, + (void**)&resp.ipz_rqueue.queue, + &vma); + if (ret) { + ehca_err(pd->device, "Could not mmap rqueue pages"); + goto create_qp_exit3; + } + my_qp->uspace_rqueue = resp.ipz_rqueue.queue; /* squeue properties */ resp.ipz_squeue.qe_size = ipz_squeue->qe_size; resp.ipz_squeue.act_nr_of_sg = ipz_squeue->act_nr_of_sg; resp.ipz_squeue.queue_length = ipz_squeue->queue_length; resp.ipz_squeue.pagesize = ipz_squeue->pagesize; resp.ipz_squeue.toggle_state = ipz_squeue->toggle_state; + ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x23000000, + ipz_squeue->queue_length, + (void**)&resp.ipz_squeue.queue, + &vma); + if (ret) { + ehca_err(pd->device, "Could not mmap squeue pages"); + goto create_qp_exit4; + } + my_qp->uspace_squeue = resp.ipz_squeue.queue; + /* fw_handle */ + resp.galpas = my_qp->galpas; + ret = ehca_mmap_register(my_qp->galpas.user.fw_handle, + (void**)&resp.galpas.kernel.fw_handle, + &vma); + if (ret) { + ehca_err(pd->device, "Could not mmap fw_handle"); + goto create_qp_exit5; + } + my_qp->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; + if (ib_copy_to_udata(udata, &resp, sizeof resp)) { ehca_err(pd->device, "Copy to udata failed"); ret = -EINVAL; - goto create_qp_exit3; + goto create_qp_exit6; } } return &my_qp->ib_qp; +create_qp_exit6: + ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE); + +create_qp_exit5: + ehca_munmap(my_qp->uspace_squeue, my_qp->ipz_squeue.queue_length); + +create_qp_exit4: + ehca_munmap(my_qp->uspace_rqueue, my_qp->ipz_rqueue.queue_length); + create_qp_exit3: ipz_queue_dtor(&my_qp->ipz_rqueue); ipz_queue_dtor(&my_qp->ipz_squeue); @@ -892,7 +931,7 @@ static int internal_modify_qp(struct ib_ my_qp->qp_type == IB_QPT_SMI) && statetrans == IB_QPST_SQE2RTS) { /* mark next free wqe if kernel */ - if (!ibqp->uobject) { + if (my_qp->uspace_squeue == 0) { struct ehca_wqe *wqe; /* lock send queue */ spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); @@ -1378,18 +1417,11 @@ int ehca_destroy_qp(struct ib_qp *ibqp) enum ib_qp_type qp_type; unsigned long flags; - if (ibqp->uobject) { - if (my_qp->mm_count_galpa || - my_qp->mm_count_rqueue || my_qp->mm_count_squeue) { - ehca_err(ibqp->device, "Resources still referenced in " - "user space qp_num=%x", ibqp->qp_num); - return -EINVAL; - } - if (my_pd->ownpid != cur_pid) { - ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x", - cur_pid, my_pd->ownpid); - return -EINVAL; - } + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; } if (my_qp->send_cq) { @@ -1407,6 +1439,24 @@ int ehca_destroy_qp(struct ib_qp *ibqp) idr_remove(&ehca_qp_idr, my_qp->token); spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + /* un-mmap if vma alloc */ + if (my_qp->uspace_rqueue) { + ret = ehca_munmap(my_qp->uspace_rqueue, + my_qp->ipz_rqueue.queue_length); + if (ret) + ehca_err(ibqp->device, "Could not munmap rqueue " + "qp_num=%x", qp_num); + ret = ehca_munmap(my_qp->uspace_squeue, + my_qp->ipz_squeue.queue_length); + if (ret) + ehca_err(ibqp->device, "Could not munmap squeue " + "qp_num=%x", qp_num); + ret = ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE); + if (ret) + ehca_err(ibqp->device, "Could not munmap fwh qp_num=%x", + qp_num); + } + h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); if (h_ret != H_SUCCESS) { ehca_err(ibqp->device, "hipz_h_destroy_qp() failed rc=%lx " diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c --- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c 2007-05-04 10:38:23.000000000 +0200 +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c 2007-04-29 15:10:56.000000000 +0200 @@ -68,183 +68,105 @@ int ehca_dealloc_ucontext(struct ib_ucon return 0; } -static void ehca_mm_open(struct vm_area_struct *vma) +struct page *ehca_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) { - u32 *count = (u32*)vma->vm_private_data; - if (!count) { - ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx", - vma->vm_start, vma->vm_end); - return; - } - (*count)++; - if (!(*count)) - ehca_gen_err("Use count overflow vm_start=%lx vm_end=%lx", - vma->vm_start, vma->vm_end); - ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x", - vma->vm_start, vma->vm_end, *count); -} - -static void ehca_mm_close(struct vm_area_struct *vma) -{ - u32 *count = (u32*)vma->vm_private_data; - if (!count) { - ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx", - vma->vm_start, vma->vm_end); - return; - } - (*count)--; - ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x", - vma->vm_start, vma->vm_end, *count); -} - -static struct vm_operations_struct vm_ops = { - .open = ehca_mm_open, - .close = ehca_mm_close, -}; - -static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas, - u32 *mm_count) -{ - int ret; - u64 vsize, physical; - - vsize = vma->vm_end - vma->vm_start; - if (vsize != EHCA_PAGESIZE) { - ehca_gen_err("invalid vsize=%lx", vma->vm_end - vma->vm_start); - return -EINVAL; - } - - physical = galpas->user.fw_handle; - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); - ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical); - /* VM_IO | VM_RESERVED are set by remap_pfn_range() */ - ret = remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT, - vsize, vma->vm_page_prot); - if (unlikely(ret)) { - ehca_gen_err("remap_pfn_range() failed ret=%x", ret); - return -ENOMEM; - } - - vma->vm_private_data = mm_count; - (*mm_count)++; - vma->vm_ops = &vm_ops; - - return 0; -} + struct page *mypage = NULL; + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; + u32 idr_handle = fileoffset >> 32; + u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 cur_pid = current->tgid; + unsigned long flags; + struct ehca_cq *cq; + struct ehca_qp *qp; + struct ehca_pd *pd; + u64 offset; + void *vaddr; -static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue, - u32 *mm_count) -{ - int ret; - u64 start, ofs; - struct page *page; + switch (q_type) { + case 1: /* CQ */ + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + cq = idr_find(&ehca_cq_idr, idr_handle); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); - vma->vm_flags |= VM_RESERVED; - start = vma->vm_start; - for (ofs = 0; ofs < queue->queue_length; ofs += PAGE_SIZE) { - u64 virt_addr = (u64)ipz_qeit_calc(queue, ofs); - page = virt_to_page(virt_addr); - ret = vm_insert_page(vma, start, page); - if (unlikely(ret)) { - ehca_gen_err("vm_insert_page() failed rc=%x", ret); - return ret; + /* make sure this mmap really belongs to the authorized user */ + if (!cq) { + ehca_gen_err("cq is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; } - start += PAGE_SIZE; - } - vma->vm_private_data = mm_count; - (*mm_count)++; - vma->vm_ops = &vm_ops; - return 0; -} - -static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq, - u32 rsrc_type) -{ - int ret; - - switch (rsrc_type) { - case 1: /* galpa fw handle */ - ehca_dbg(cq->ib_cq.device, "cq_num=%x fw", cq->cq_number); - ret = ehca_mmap_fw(vma, &cq->galpas, &cq->mm_count_galpa); - if (unlikely(ret)) { + if (cq->ownpid != cur_pid) { ehca_err(cq->ib_cq.device, - "ehca_mmap_fw() failed rc=%x cq_num=%x", - ret, cq->cq_number); - return ret; + "Invalid caller pid=%x ownpid=%x", + cur_pid, cq->ownpid); + return NOPAGE_SIGBUS; } - break; - case 2: /* cq queue_addr */ - ehca_dbg(cq->ib_cq.device, "cq_num=%x queue", cq->cq_number); - ret = ehca_mmap_queue(vma, &cq->ipz_queue, &cq->mm_count_queue); - if (unlikely(ret)) { - ehca_err(cq->ib_cq.device, - "ehca_mmap_queue() failed rc=%x cq_num=%x", - ret, cq->cq_number); - return ret; + if (rsrc_type == 2) { + ehca_dbg(cq->ib_cq.device, "cq=%p cq queuearea", cq); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&cq->ipz_queue, offset); + ehca_dbg(cq->ib_cq.device, "offset=%lx vaddr=%p", + offset, vaddr); + mypage = virt_to_page(vaddr); } break; - default: - ehca_err(cq->ib_cq.device, "bad resource type=%x cq_num=%x", - rsrc_type, cq->cq_number); - return -EINVAL; - } - - return 0; -} - -static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, - u32 rsrc_type) -{ - int ret; + case 2: /* QP */ + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + qp = idr_find(&ehca_qp_idr, idr_handle); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); - switch (rsrc_type) { - case 1: /* galpa fw handle */ - ehca_dbg(qp->ib_qp.device, "qp_num=%x fw", qp->ib_qp.qp_num); - ret = ehca_mmap_fw(vma, &qp->galpas, &qp->mm_count_galpa); - if (unlikely(ret)) { - ehca_err(qp->ib_qp.device, - "remap_pfn_range() failed ret=%x qp_num=%x", - ret, qp->ib_qp.qp_num); - return -ENOMEM; + /* make sure this mmap really belongs to the authorized user */ + if (!qp) { + ehca_gen_err("qp is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; } - break; - case 2: /* qp rqueue_addr */ - ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue", - qp->ib_qp.qp_num); - ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, &qp->mm_count_rqueue); - if (unlikely(ret)) { + pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (pd->ownpid != cur_pid) { ehca_err(qp->ib_qp.device, - "ehca_mmap_queue(rq) failed rc=%x qp_num=%x", - ret, qp->ib_qp.qp_num); - return ret; + "Invalid caller pid=%x ownpid=%x", + cur_pid, pd->ownpid); + return NOPAGE_SIGBUS; } - break; - case 3: /* qp squeue_addr */ - ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue", - qp->ib_qp.qp_num); - ret = ehca_mmap_queue(vma, &qp->ipz_squeue, &qp->mm_count_squeue); - if (unlikely(ret)) { - ehca_err(qp->ib_qp.device, - "ehca_mmap_queue(sq) failed rc=%x qp_num=%x", - ret, qp->ib_qp.qp_num); - return ret; + if (rsrc_type == 2) { /* rqueue */ + ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueuearea", qp); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset); + ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p", + offset, vaddr); + mypage = virt_to_page(vaddr); + } else if (rsrc_type == 3) { /* squeue */ + ehca_dbg(qp->ib_qp.device, "qp=%p qp squeuearea", qp); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset); + ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p", + offset, vaddr); + mypage = virt_to_page(vaddr); } break; default: - ehca_err(qp->ib_qp.device, "bad resource type=%x qp=num=%x", - rsrc_type, qp->ib_qp.qp_num); - return -EINVAL; + ehca_gen_err("bad queue type %x", q_type); + return NOPAGE_SIGBUS; } - return 0; + if (!mypage) { + ehca_gen_err("Invalid page adr==NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + get_page(mypage); + + return mypage; } +static struct vm_operations_struct ehcau_vm_ops = { + .nopage = ehca_nopage, +}; + int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) { u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; @@ -253,6 +175,7 @@ int ehca_mmap(struct ib_ucontext *contex u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ u32 cur_pid = current->tgid; u32 ret; + u64 vsize, physical; unsigned long flags; struct ehca_cq *cq; struct ehca_qp *qp; @@ -278,12 +201,44 @@ int ehca_mmap(struct ib_ucontext *contex if (!cq->ib_cq.uobject || cq->ib_cq.uobject->context != context) return -EINVAL; - ret = ehca_mmap_cq(vma, cq, rsrc_type); - if (unlikely(ret)) { - ehca_err(cq->ib_cq.device, - "ehca_mmap_cq() failed rc=%x cq_num=%x", - ret, cq->cq_number); - return ret; + switch (rsrc_type) { + case 1: /* galpa fw handle */ + ehca_dbg(cq->ib_cq.device, "cq=%p cq triggerarea", cq); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != EHCA_PAGESIZE) { + ehca_err(cq->ib_cq.device, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + return -EINVAL; + } + + physical = cq->galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + ehca_dbg(cq->ib_cq.device, + "vsize=%lx physical=%lx", vsize, physical); + ret = remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret) { + ehca_err(cq->ib_cq.device, + "remap_pfn_range() failed ret=%x", + ret); + return -ENOMEM; + } + break; + + case 2: /* cq queue_addr */ + ehca_dbg(cq->ib_cq.device, "cq=%p cq q_addr", cq); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + break; + + default: + ehca_err(cq->ib_cq.device, "bad resource type %x", + rsrc_type); + return -EINVAL; } break; @@ -307,12 +262,50 @@ int ehca_mmap(struct ib_ucontext *contex if (!qp->ib_qp.uobject || qp->ib_qp.uobject->context != context) return -EINVAL; - ret = ehca_mmap_qp(vma, qp, rsrc_type); - if (unlikely(ret)) { - ehca_err(qp->ib_qp.device, - "ehca_mmap_qp() failed rc=%x qp_num=%x", - ret, qp->ib_qp.qp_num); - return ret; + switch (rsrc_type) { + case 1: /* galpa fw handle */ + ehca_dbg(qp->ib_qp.device, "qp=%p qp triggerarea", qp); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != EHCA_PAGESIZE) { + ehca_err(qp->ib_qp.device, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + return -EINVAL; + } + + physical = qp->galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + ehca_dbg(qp->ib_qp.device, "vsize=%lx physical=%lx", + vsize, physical); + ret = remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret) { + ehca_err(qp->ib_qp.device, + "remap_pfn_range() failed ret=%x", + ret); + return -ENOMEM; + } + break; + + case 2: /* qp rqueue_addr */ + ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + break; + + case 3: /* qp squeue_addr */ + ehca_dbg(qp->ib_qp.device, "qp=%p qp squeue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + break; + + default: + ehca_err(qp->ib_qp.device, "bad resource type %x", + rsrc_type); + return -EINVAL; } break; @@ -323,3 +316,77 @@ int ehca_mmap(struct ib_ucontext *contex return 0; } + +int ehca_mmap_nopage(u64 foffset, u64 length, void **mapped, + struct vm_area_struct **vma) +{ + down_write(¤t->mm->mmap_sem); + *mapped = (void*)do_mmap(NULL,0, length, PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, + foffset); + up_write(¤t->mm->mmap_sem); + if (!(*mapped)) { + ehca_gen_err("couldn't mmap foffset=%lx length=%lx", + foffset, length); + return -EINVAL; + } + + *vma = find_vma(current->mm, (u64)*mapped); + if (!(*vma)) { + down_write(¤t->mm->mmap_sem); + do_munmap(current->mm, 0, length); + up_write(¤t->mm->mmap_sem); + ehca_gen_err("couldn't find vma queue=%p", *mapped); + return -EINVAL; + } + (*vma)->vm_flags |= VM_RESERVED; + (*vma)->vm_ops = &ehcau_vm_ops; + + return 0; +} + +int ehca_mmap_register(u64 physical, void **mapped, + struct vm_area_struct **vma) +{ + int ret; + unsigned long vsize; + /* ehca hw supports only 4k page */ + ret = ehca_mmap_nopage(0, EHCA_PAGESIZE, mapped, vma); + if (ret) { + ehca_gen_err("could'nt mmap physical=%lx", physical); + return ret; + } + + (*vma)->vm_flags |= VM_RESERVED; + vsize = (*vma)->vm_end - (*vma)->vm_start; + if (vsize != EHCA_PAGESIZE) { + ehca_gen_err("invalid vsize=%lx", + (*vma)->vm_end - (*vma)->vm_start); + return -EINVAL; + } + + (*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot); + (*vma)->vm_flags |= VM_IO | VM_RESERVED; + + ret = remap_pfn_range((*vma), (*vma)->vm_start, + physical >> PAGE_SHIFT, vsize, + (*vma)->vm_page_prot); + if (ret) { + ehca_gen_err("remap_pfn_range() failed ret=%x", ret); + return -ENOMEM; + } + + return 0; + +} + +int ehca_munmap(unsigned long addr, size_t len) { + int ret = 0; + struct mm_struct *mm = current->mm; + if (mm) { + down_write(&mm->mmap_sem); + ret = do_munmap(mm, addr, len); + up_write(&mm->mmap_sem); + } + return ret; +} From mst at dev.mellanox.co.il Thu May 10 05:49:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 15:49:29 +0300 Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 1/3] ehca: backport for rhel-4.5 - hvcall.h In-Reply-To: <200705101441.44286.ossrosch@linux.vnet.ibm.com> References: <200705101441.44286.ossrosch@linux.vnet.ibm.com> Message-ID: <20070510124929.GA22029@mellanox.co.il> > Quoting Stefan Roscher : > Subject: [PATCH ofed-1.2-rc3 1/3] ehca: backport for rhel-4.5 - hvcall.h > > use kmem_cache_t instead of struct kmem_cache and update hvcall.h > > > > Signed-off-by: Stefan Roscher > --- Format's wrong here: > drivers/infiniband/hw/ehca/ehca_av.c | 2 > drivers/infiniband/hw/ehca/ehca_cq.c | 2 > drivers/infiniband/hw/ehca/ehca_main.c | 2 > drivers/infiniband/hw/ehca/ehca_mrmw.c | 4 > drivers/infiniband/hw/ehca/ehca_pd.c | 2 > drivers/infiniband/hw/ehca/ehca_qp.c | 2 These should be a patch in kernel_patches/backports/2.6.9_U5. > kernel_addons/backport/2.6.9_U5/include/asm-powerpc/system.h | 1 > kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h | 167 +++++++++++ And this part creates files under include/asm/hvcall.h so can be applied directly. -- MST From yosefe at voltaire.com Thu May 10 05:52:34 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 10 May 2007 15:52:34 +0300 Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events In-Reply-To: <20070510123855.GL13655@mellanox.co.il> References: <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> <20070510120144.GF13655@mellanox.co.il> <46430F46.1080002@voltaire.com> <20070510123855.GL13655@mellanox.co.il> Message-ID: <46431592.6080401@voltaire.com> > > Return some error code -ENXIO. > All other branches in this function return -1 (see next hunk) Anyway, let it be -ENXIO. -- This issue was found during partitioning & SM fail over testing. * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity * Obtain pkey index prior to entering init_qp, and save in in dev_priv * Upon PKEY_CHANGE event, schedule a work that restarts the qp. * Precondition the restart on whether the pkey index is really changed. Use the cached pkey_index to test this. * Restart child interfaces before parent. They might be up even if the parent is down. * When interface is restarted, queue delayed initiallization, to handle the case that a pkey is deleted and later restored. * Use uncached pkey query upon qp initiallization SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 7 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 88 +++++++++++++++++++++++------ drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 26 ++------ 4 files changed, 89 insertions(+), 39 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-10 08:34:58.335171047 +0300 @@ -202,15 +202,17 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; + struct delayed_work pkey_poll_task; struct delayed_work mcast_task; struct work_struct flush_task; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct work_struct pkey_event_task; struct ib_device *ca; u8 port; u16 pkey; + u16 pkey_index; struct ib_pd *pd; struct ib_mr *mr; struct ib_cq *cq; @@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-10 15:50:29.315183358 +0300 @@ -413,6 +413,18 @@ int ipoib_ib_dev_open(struct net_device struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; + /* + * Search through the port P_Key table for the requested pkey value. + * The port has to be assigned to the respective IB partition in + * advance. + */ + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) { + ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey); + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return -ENXIO; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ret = ipoib_init_qp(dev); if (ret) { ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret); @@ -422,14 +434,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -481,7 +493,7 @@ int ipoib_ib_dev_down(struct net_device if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { mutex_lock(&pkey_mutex); set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); + cancel_delayed_work(&priv->pkey_poll_task); mutex_unlock(&pkey_mutex); if (flush) flush_workqueue(ipoib_workqueue); @@ -508,7 +520,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +593,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,13 +635,22 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; + u16 new_index; + + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces too - + * they might be up even if the parent is down */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, pkey_event); + + mutex_unlock(&priv->vlan_mutex); - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -638,10 +660,32 @@ void ipoib_ib_dev_flush(struct work_stru return; } + if (pkey_event) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ipoib_ib_dev_down(dev, 0); + ipoib_pkey_dev_delay_open(dev); + return; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + /* restart qp only of pkey index is cahnged */ + if (new_index == priv->pkey_index) { + ipoib_dbg(priv, "Not flushing - pkey index not changed.\n"); + return; + } + priv->pkey_index = new_index; + } + ipoib_dbg(priv, "flushing\n"); ipoib_ib_dev_down(dev, 0); + if (pkey_event) { + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +694,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - mutex_unlock(&priv->vlan_mutex); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_event_task); + + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -685,7 +739,7 @@ void ipoib_ib_dev_cleanup(struct net_dev void ipoib_pkey_poll(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); + container_of(work, struct ipoib_dev_priv, pkey_poll_task.work); struct net_device *dev = priv->dev; ipoib_pkey_dev_check_presence(dev); @@ -696,7 +750,7 @@ void ipoib_pkey_poll(struct work_struct mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); } @@ -715,7 +769,7 @@ int ipoib_pkey_dev_delay_open(struct net mutex_lock(&pkey_mutex); clear_bit(IPOIB_PKEY_STOP, &priv->flags); queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); return 1; Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-09 17:21:03.000000000 +0300 @@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev) return -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-10 09:13:28.997127223 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = -ENXIO; goto out; @@ -94,26 +92,17 @@ int ipoib_init_qp(struct net_device *dev { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; - u16 pkey_index; struct ib_qp_attr qp_attr; int attr_mask; - /* - * Search through the port P_Key table for the requested pkey value. - * The port has to be assigned to the respective IB partition in - * advance. - */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); - if (ret) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); - return ret; - } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + /* Make sure we have a valid pkey_index in priv->pkey_index */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + return -1; qp_attr.qp_state = IB_QPS_INIT; qp_attr.qkey = 0; qp_attr.port_num = priv->port; - qp_attr.pkey_index = pkey_index; + qp_attr.pkey_index = priv->pkey_index; attr_mask = IB_QP_QKEY | IB_QP_PORT | @@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE || @@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler record->element.port_num == priv->port) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE && + record->element.port_num == priv->port) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_event_task); } } From mst at dev.mellanox.co.il Thu May 10 06:01:23 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 16:01:23 +0300 Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events In-Reply-To: <46431592.6080401@voltaire.com> References: <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> <20070510120144.GF13655@mellanox.co.il> <46430F46.1080002@voltaire.com> <20070510123855.GL13655@mellanox.co.il> <46431592.6080401@voltaire.com> Message-ID: <20070510130123.GC22029@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: [PATCHv4 2/2] ipoib: handle pkey change events > > > > > Return some error code -ENXIO. > > > All other branches in this function return -1 (see next hunk) Oh. Right. I haven't looked. > Anyway, let it be -ENXIO. Up to you really, I take it back. > @@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler > container_of(handler, struct ipoib_dev_priv, event_handler); > > if ((record->event == IB_EVENT_PORT_ERR || > - record->event == IB_EVENT_PKEY_CHANGE || > record->event == IB_EVENT_PORT_ACTIVE || > record->event == IB_EVENT_LID_CHANGE || > record->event == IB_EVENT_SM_CHANGE || > @@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler > record->element.port_num == priv->port) { > ipoib_dbg(priv, "Port state change event\n"); > queue_work(ipoib_workqueue, &priv->flush_task); > + } else if (record->event == IB_EVENT_PKEY_CHANGE && > + record->element.port_num == priv->port) { > + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); > + queue_work(ipoib_workqueue, &priv->pkey_event_task); > } > } What do you think about my idea to do if (record->element.port_num != priv->port) return at the top? Anyway, I think you've addressed all real issues - could you pls post final version of both patches for OFED and 2.6.22? -- MST From ogerlitz at voltaire.com Thu May 10 06:02:01 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 10 May 2007 16:02:01 +0300 Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> Message-ID: <464317C9.3020601@voltaire.com> Jeff Squyres wrote: > On May 10, 2007, at 8:23 AM, Or Gerlitz wrote: > >> A different approach which you might want to consider is to have at >> the btl level --two-- connections per ranks. so if A wants >> to send B it does so through the A --> B connection and if B wants to >> send A it does so through the B --> A connection. To some extent, this >> is the approach taken by IPoIB-CM (I am not enough into the RFC to >> understand the reasoning but i am quite sure this was the approach in >> the initial implementation). At first thought it mights seems not very >> elegant, but taking it into the details (projected on the ompi env) >> you might find it even nice. > > What is the advantage of this approach? To start with, my hope here is at least to be able play defensive here, that is convince you that the disadvantages are minor, where only if this fails, would schedule myself some reading into the ipoib-cm rfc to dig the advantages. Or. From mst at dev.mellanox.co.il Thu May 10 06:04:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 16:04:00 +0300 Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 3/3] ehca: backport for rhel-4.5 - use introduced dma_ops In-Reply-To: <200705101441.58102.ossrosch@linux.vnet.ibm.com> References: <200705101441.58102.ossrosch@linux.vnet.ibm.com> Message-ID: <20070510130400.GD22029@mellanox.co.il> > Quoting Stefan Roscher : > Subject: [PATCH ofed-1.2-rc3 3/3] ehca: backport for rhel-4.5 - use introduced dma_ops > > use introduced dma_ops > > > Signed-off-by: Stefan Roscher > --- > > > Makefile | 2 > ehca_dma.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ehca_main.c | 2 > 3 files changed, 197 insertions(+), 1 deletion(-) > > > > diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c > --- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c 1970-01-01 01:00:00.000000000 +0100 > +++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c 2007-05-03 16:25:30.000000000 +0200 These patches belong in kernel_patches/backports. So please post as such. -- MST From yosefe at voltaire.com Thu May 10 06:11:35 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 10 May 2007 16:11:35 +0300 Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events In-Reply-To: <20070510130123.GC22029@mellanox.co.il> References: <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> <20070510120144.GF13655@mellanox.co.il> <46430F46.1080002@voltaire.com> <20070510123855.GL13655@mellanox.co.il> <46431592.6080401@voltaire.com> <20070510130123.GC22029@mellanox.co.il> Message-ID: <46431A07.4080205@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Yosef Etigin : >>Subject: Re: [PATCHv4 2/2] ipoib: handle pkey change events >> >> >>>Return some error code -ENXIO. >>> >> >>All other branches in this function return -1 (see next hunk) > > > Oh. Right. I haven't looked. > > >>Anyway, let it be -ENXIO. > > > Up to you really, I take it back. > > I'd leave it -1, for consistency. >>@@ -260,7 +249,6 @@ void ipoib_event(struct ib_event_handler >> container_of(handler, struct ipoib_dev_priv, event_handler); >> >> if ((record->event == IB_EVENT_PORT_ERR || >>- record->event == IB_EVENT_PKEY_CHANGE || >> record->event == IB_EVENT_PORT_ACTIVE || >> record->event == IB_EVENT_LID_CHANGE || >> record->event == IB_EVENT_SM_CHANGE || >>@@ -268,5 +256,9 @@ void ipoib_event(struct ib_event_handler >> record->element.port_num == priv->port) { >> ipoib_dbg(priv, "Port state change event\n"); >> queue_work(ipoib_workqueue, &priv->flush_task); >>+ } else if (record->event == IB_EVENT_PKEY_CHANGE && >>+ record->element.port_num == priv->port) { >>+ ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); >>+ queue_work(ipoib_workqueue, &priv->pkey_event_task); >> } >> } > > > What do you think about my idea to do > if (record->element.port_num != priv->port) > return > at the top? > > Anyway, I think you've addressed all real issues - could you pls > post final version of both patches for OFED and 2.6.22? > What should be the difference between for OFED and for 2.6.22? From jsquyres at cisco.com Thu May 10 06:11:38 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 10 May 2007 09:11:38 -0400 Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <464317C9.3020601@voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> <464317C9.3020601@voltaire.com> Message-ID: On May 10, 2007, at 9:02 AM, Or Gerlitz wrote: >>> A different approach which you might want to consider is to have >>> at the btl level --two-- connections per ranks. so if A >>> wants to send B it does so through the A --> B connection and if >>> B wants to send A it does so through the B --> A connection. To >>> some extent, this is the approach taken by IPoIB-CM (I am not >>> enough into the RFC to understand the reasoning but i am quite >>> sure this was the approach in the initial implementation). At >>> first thought it mights seems not very elegant, but taking it >>> into the details (projected on the ompi env) you might find it >>> even nice. >> What is the advantage of this approach? > > To start with, my hope here is at least to be able play defensive > here, that is convince you that the disadvantages are minor, where > only if this fails, would schedule myself some reading into the > ipoib-cm rfc to dig the advantages. I ask about the advantages because OMPI currently treats QP's as bi- directional. Having OMPI treat them at unidirectional would be a change. I'm not against such a change, but I think we'd need to be convinced that there are good reasons to do so. For example, on the surface, it seems like this scheme would simply consume more QPs and potentially more registered memory (and is therefore unattractive). -- Jeff Squyres Cisco Systems From ossrosch at linux.vnet.ibm.com Thu May 10 06:12:34 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Thu, 10 May 2007 15:12:34 +0200 Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 1/3] ehca: backport for rhel-4.5 - hvcall.h In-Reply-To: <20070510124929.GA22029@mellanox.co.il> References: <200705101441.44286.ossrosch@linux.vnet.ibm.com> <20070510124929.GA22029@mellanox.co.il> Message-ID: <200705101512.35152.ossrosch@linux.vnet.ibm.com> On Thursday 10 May 2007 14:49, Michael S. Tsirkin wrote: > > Quoting Stefan Roscher : > > Subject: [PATCH ofed-1.2-rc3 1/3] ehca: backport for rhel-4.5 - hvcall.h > > > > use kmem_cache_t instead of struct kmem_cache and update hvcall.h > > > > > > > > Signed-off-by: Stefan Roscher > > --- > > Format's wrong here: Whats is wrong with the format? > > > drivers/infiniband/hw/ehca/ehca_av.c | 2 > > drivers/infiniband/hw/ehca/ehca_cq.c | 2 > > drivers/infiniband/hw/ehca/ehca_main.c | 2 > > drivers/infiniband/hw/ehca/ehca_mrmw.c | 4 > > drivers/infiniband/hw/ehca/ehca_pd.c | 2 > > drivers/infiniband/hw/ehca/ehca_qp.c | 2 > > These should be a patch in kernel_patches/backports/2.6.9_U5. Yes you are rigth the correct patches will follow. regards Stefan From ogerlitz at voltaire.com Thu May 10 06:30:27 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 10 May 2007 16:30:27 +0300 Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563288@mtlexch01.mtl.com> <1178657765.11455.32.camel@stevo-desktop> <4640FDE9.9010000@ichips.intel.com> <1178718090.382.4.camel@stevo-desktop> <1178721476.382.18.camel@stevo-desktop> <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> <464317C9.3020601@voltaire.com> Message-ID: <46431E73.70505@voltaire.com> Jeff Squyres wrote: > On May 10, 2007, at 9:02 AM, Or Gerlitz wrote: >> To start with, my hope here is at least to be able play defensive >> here, that is convince you that the disadvantages are minor, where >> only if this fails, would schedule myself some reading into the >> ipoib-cm rfc to dig the advantages. > I ask about the advantages because OMPI currently treats QP's as > bi-directional. Having OMPI treat them at unidirectional would be a > change. I'm not against such a change, but I think we'd need to be > convinced that there are good reasons to do so. For example, on the > surface, it seems like this scheme would simply consume more QPs and > potentially more registered memory (and is therefore unattractive). Indeed you would need two QPs per btl connection, however, for each direction you can make the relevant QP consume ~zero resources per the other direction, ie on side A: for the A --> B QP : RX WR num = 0, RX SG size = 0 for the B --> A QP : TX WR num = 0, TX SG size = 0 and on side B the other way. I think that IB disallows to have zero len WR num so you set it actually to 1. Note that since you use SRQ for large jobs you have zero overhead for RX resources and this one TX WR overhead for the "RX" connection on each side. This is the only memory related overhead since you don't have to allocate any extra buffers over what you do now. Or. From mst at dev.mellanox.co.il Thu May 10 06:36:14 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 16:36:14 +0300 Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events In-Reply-To: <46431A07.4080205@voltaire.com> References: <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> <20070510120144.GF13655@mellanox.co.il> <46430F46.1080002@voltaire.com> <20070510123855.GL13655@mellanox.co.il> <46431592.6080401@voltaire.com> <20070510130123.GC22029@mellanox.co.il> <46431A07.4080205@voltaire.com> Message-ID: <20070510133614.GM13655@mellanox.co.il> > > Anyway, I think you've addressed all real issues - could you pls > > post final version of both patches for OFED and 2.6.22? > > > > What should be the difference between for OFED and for 2.6.22? I wouldn't expect any difference. -- MST From glebn at voltaire.com Thu May 10 06:44:05 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 10 May 2007 16:44:05 +0300 Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <46431E73.70505@voltaire.com> References: <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> <464317C9.3020601@voltaire.com> <46431E73.70505@voltaire.com> Message-ID: <20070510134405.GH24497@minantech.com> On Thu, May 10, 2007 at 04:30:27PM +0300, Or Gerlitz wrote: > Jeff Squyres wrote: > >On May 10, 2007, at 9:02 AM, Or Gerlitz wrote: > > >>To start with, my hope here is at least to be able play defensive > >>here, that is convince you that the disadvantages are minor, where > >>only if this fails, would schedule myself some reading into the > >>ipoib-cm rfc to dig the advantages. > > >I ask about the advantages because OMPI currently treats QP's as > >bi-directional. Having OMPI treat them at unidirectional would be a > >change. I'm not against such a change, but I think we'd need to be > >convinced that there are good reasons to do so. For example, on the > >surface, it seems like this scheme would simply consume more QPs and > >potentially more registered memory (and is therefore unattractive). > > Indeed you would need two QPs per btl connection, however, for each > direction you can make the relevant QP consume ~zero resources per the > other direction, ie on side A: > > for the A --> B QP : RX WR num = 0, RX SG size = 0 > for the B --> A QP : TX WR num = 0, TX SG size = 0 > > and on side B the other way. I think that IB disallows to have zero len > WR num so you set it actually to 1. Note that since you use SRQ for > large jobs you have zero overhead for RX resources and this one TX WR > overhead for the "RX" connection on each side. This is the only memory > related overhead since you don't have to allocate any extra buffers over > what you do now. > QP is a limited resource and we already have 2 per connection (and much more if LMC is in used), so I don't see any reason to use this scheme only to overcome brain damaged design of iWarp. -- Gleb. From mst at dev.mellanox.co.il Thu May 10 06:54:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 16:54:03 +0300 Subject: [ofa-general] Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <46431E73.70505@voltaire.com> References: <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> <464317C9.3020601@voltaire.com> <46431E73.70505@voltaire.com> Message-ID: <20070510135403.GP13655@mellanox.co.il> > I think that IB disallows to have zero len > WR num so you set it actually to 1. I don't think such limitation exists. -- MST From ogerlitz at voltaire.com Thu May 10 06:57:09 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 10 May 2007 16:57:09 +0300 Subject: [ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <20070510134405.GH24497@minantech.com> References: <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> <464317C9.3020601@voltaire.com> <46431E73.70505@voltaire.com> <20070510134405.GH24497@minantech.com> Message-ID: <464324B5.1080209@voltaire.com> Gleb Natapov wrote: > QP is a limited resource and we already have 2 per connection (and much > more if LMC is in used), so I don't see any reason to use this scheme only > to overcome brain damaged design of iWarp. fair enough, just note that **some** damage (which in understand is just to the extent of adding a flag somewhere) would experienced by ompi people to support iwarp... Or. From mst at dev.mellanox.co.il Thu May 10 06:58:16 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 16:58:16 +0300 Subject: [ofa-general] Fwd: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed?udapl -?bugs?opened Message-ID: <20070510135816.GQ13655@mellanox.co.il> Not sure who first added open-mpi list to Cc:, but please don't do it for mesasges sent to openib-general in the future since this is a subscriber-only list (see below). ----- Forwarded message from devel-owner at open-mpi.org ----- Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed udapl - bugs opened From: devel-owner at open-mpi.org Date: Thu, 10 May 2007 09:54:06 -0400 You are not allowed to post to this mailing list, and your message has been automatically rejected. If you think that your messages are being rejected in error, contact the mailing list owner at devel-owner at open-mpi.org. Date: Thu, 10 May 2007 16:54:03 +0300 From: "Michael S. Tsirkin" Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed udapl - bugs opened Reply-To: "Michael S. Tsirkin" References: <1178742259.382.112.camel at stevo-desktop> <46422EA6.3020006 at Sun.COM> <1178742819.382.114.camel at stevo-desktop> <464232DC.9010201 at Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD at cisco.com> <46430EB2.7080703 at voltaire.com> <464317C9.3020601 at voltaire.com> <46431E73.70505 at voltaire.com> In-Reply-To: <46431E73.70505 at voltaire.com> > I think that IB disallows to have zero len > WR num so you set it actually to 1. I don't think such limitation exists. -- MST ----- End forwarded message ----- -- MST From swise at opengridcomputing.com Thu May 10 07:09:16 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 10 May 2007 09:09:16 -0500 Subject: [ofa-general] Fwd: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed?udapl -?bugs?opened In-Reply-To: <20070510135816.GQ13655@mellanox.co.il> References: <20070510135816.GQ13655@mellanox.co.il> Message-ID: <1178806156.1519.12.camel@stevo-desktop> My fault, but the issue in question is pertinent to both lists (OMPI over ofa iwarp). But since the ompi list is a closed list, I'll refrain from doing this in the future. Steve. On Thu, 2007-05-10 at 16:58 +0300, Michael S. Tsirkin wrote: > Not sure who first added open-mpi list to Cc:, but please don't > do it for mesasges sent to openib-general in the future > since this is a subscriber-only list (see below). > > ----- Forwarded message from devel-owner at open-mpi.org ----- > > Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed udapl - bugs opened > From: devel-owner at open-mpi.org > Date: Thu, 10 May 2007 09:54:06 -0400 > > You are not allowed to post to this mailing list, and your message has > been automatically rejected. If you think that your messages are > being rejected in error, contact the mailing list owner at > devel-owner at open-mpi.org. > > > Date: Thu, 10 May 2007 16:54:03 +0300 > From: "Michael S. Tsirkin" > Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed udapl - bugs opened > Reply-To: "Michael S. Tsirkin" > References: <1178742259.382.112.camel at stevo-desktop> <46422EA6.3020006 at Sun.COM> > <1178742819.382.114.camel at stevo-desktop> <464232DC.9010201 at Sun.COM> > <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD at cisco.com> > <46430EB2.7080703 at voltaire.com> > > <464317C9.3020601 at voltaire.com> > > <46431E73.70505 at voltaire.com> > In-Reply-To: <46431E73.70505 at voltaire.com> > > > I think that IB disallows to have zero len > > WR num so you set it actually to 1. > > I don't think such limitation exists. > > -- > MST > > > ----- End forwarded message ----- > From mst at dev.mellanox.co.il Thu May 10 07:09:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 17:09:33 +0300 Subject: [ofa-general] Re: [PATCHv4 2/2] ipoib: handle pkey change events In-Reply-To: <46430F46.1080002@voltaire.com> References: <20070508162727.GD5845@mellanox.co.il> <4640A8BD.4000405@voltaire.com> <20070509093548.GA7683@mellanox.co.il> <4641AA06.1050002@voltaire.com> <20070509112626.GA10068@mellanox.co.il> <4641B63D.4010602@voltaire.com> <20070509174138.GB17734@mellanox.co.il> <46430167.3010106@voltaire.com> <20070510120144.GF13655@mellanox.co.il> <46430F46.1080002@voltaire.com> Message-ID: <20070510140932.GB26302@mellanox.co.il> > @@ -638,10 +660,32 @@ void ipoib_ib_dev_flush(struct work_stru > return; > } > > + if (pkey_event) { > + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { > + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + ipoib_ib_dev_down(dev, 0); > + ipoib_pkey_dev_delay_open(dev); > + return; > + } > + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > + > + /* restart qp only of pkey index is cahnged */ > + if (new_index == priv->pkey_index) { > + ipoib_dbg(priv, "Not flushing - pkey index not changed.\n"); > + return; > + } > + priv->pkey_index = new_index; > + } > + > ipoib_dbg(priv, "flushing\n"); > Say, what if IPOIB_PKEY_ASSIGNED was cleared previously? priv->pkey_index will be wrong, won't it? -- MST From ossrosch at linux.vnet.ibm.com Thu May 10 07:26:55 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Thu, 10 May 2007 16:26:55 +0200 Subject: [ofa-general] [PATCH ofed-1.2-rc3 0/4] ehca: backport for rhel-4.5 Message-ID: <200705101626.56308.ossrosch@linux.vnet.ibm.com> Because these patches http://lists.openfabrics.org/pipermail/general/2007-May/036125.html I send before were in frong format and did not patch into backport directory I send now the changed patches. Regards Stefan From ossrosch at linux.vnet.ibm.com Thu May 10 07:28:02 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Thu, 10 May 2007 16:28:02 +0200 Subject: [ofa-general] [PATCH ofed-1.2-rc3 1/4] ehca: backport for rhel-4.5 - use kmem_cache_t instead of struct kmem_cache Message-ID: <200705101628.03034.ossrosch@linux.vnet.ibm.com> Signed-off-by: Stefan Roscher --- backport_ehca_1_2.6.9.patch | 82 ++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 82 insertions(+) diff -Nurp ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_1_2.6.9.patch ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_1_2.6.9.patch --- ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_1_2.6.9.patch 1970-01-01 01:00:00.000000000 +0100 +++ ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_1_2.6.9.patch 2007-05-10 17:25:58.000000000 +0200 @@ -0,0 +1,82 @@ +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_av.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_av.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_av.c 2007-05-09 12:42:01.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_av.c 2007-05-09 12:42:34.000000000 +0200 +@@ -48,7 +48,7 @@ + #include "ehca_iverbs.h" + #include "hcp_if.h" + +-static struct kmem_cache *av_cache; ++static kmem_cache_t *av_cache; + + struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) + { +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c 2007-05-09 12:42:01.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c 2007-05-09 12:42:34.000000000 +0200 +@@ -50,7 +50,7 @@ + #include "ehca_irq.h" + #include "hcp_if.h" + +-static struct kmem_cache *cq_cache; ++static kmem_cache_t *cq_cache; + + int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp) + { +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-09 12:42:01.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-09 12:42:34.000000000 +0200 +@@ -465,7 +465,6 @@ void ehca_remove_driver_sysfs(struct ibm + + #define EHCA_RESOURCE_ATTR(name) \ + static ssize_t ehca_show_##name(struct device *dev, \ +- struct device_attribute *attr, \ + char *buf) \ + { \ + struct ehca_shca *shca; \ +@@ -513,7 +512,6 @@ EHCA_RESOURCE_ATTR(max_pd); + EHCA_RESOURCE_ATTR(max_ah); + + static ssize_t ehca_show_adapter_handle(struct device *dev, +- struct device_attribute *attr, + char *buf) + { + struct ehca_shca *shca = dev->driver_data; +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_mrmw.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_mrmw.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_mrmw.c 2007-05-09 12:42:01.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_mrmw.c 2007-05-09 12:42:34.000000000 +0200 +@@ -46,8 +46,8 @@ + #include "hcp_if.h" + #include "hipz_hw.h" + +-static struct kmem_cache *mr_cache; +-static struct kmem_cache *mw_cache; ++static kmem_cache_t *mr_cache; ++static kmem_cache_t *mw_cache; + + static struct ehca_mr *ehca_mr_new(void) + { +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_pd.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_pd.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_pd.c 2007-05-09 12:42:01.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_pd.c 2007-05-09 12:42:34.000000000 +0200 +@@ -43,7 +43,7 @@ + #include "ehca_tools.h" + #include "ehca_iverbs.h" + +-static struct kmem_cache *pd_cache; ++static kmem_cache_t *pd_cache; + + struct ib_pd *ehca_alloc_pd(struct ib_device *device, + struct ib_ucontext *context, struct ib_udata *udata) +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c 2007-05-09 12:42:01.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c 2007-05-09 12:42:34.000000000 +0200 +@@ -51,7 +51,7 @@ + #include "hcp_if.h" + #include "hipz_fns.h" + +-static struct kmem_cache *qp_cache; ++static kmem_cache_t *qp_cache; + + /* + * attributes not supported by query qp + From mst at dev.mellanox.co.il Thu May 10 07:28:26 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 17:28:26 +0300 Subject: [ofa-general] Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed udapl - bugs opened In-Reply-To: References: <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> Message-ID: <20070510142826.GE22029@mellanox.co.il> > Quoting Jeff Squyres : > Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed?udapl -?bugs?opened > > On May 10, 2007, at 8:23 AM, Or Gerlitz wrote: > > >A different approach which you might want to consider is to have at > >the btl level --two-- connections per ranks. so if A > >wants to send B it does so through the A --> B connection and if B > >wants to send A it does so through the B --> A connection. To some > >extent, this is the approach taken by IPoIB-CM (I am not enough > >into the RFC to understand the reasoning but i am quite sure this > >was the approach in the initial implementation). At first thought > >it mights seems not very elegant, but taking it into the details > >(projected on the ompi env) you might find it even nice. > > What is the advantage of this approach? Current ipoib cm uses this approach to simplify the implementation. Overhead seems insignificant. -- MST From ossrosch at linux.vnet.ibm.com Thu May 10 07:28:42 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Thu, 10 May 2007 16:28:42 +0200 Subject: [ofa-general] [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 - mmap functonality Message-ID: <200705101628.43095.ossrosch@linux.vnet.ibm.com> Signed-off-by: Stefan Roscher --- backport_ehca_2_rhel45_umap.patch | 850 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 850 insertions(+) diff -Nurp ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch --- ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch 1970-01-01 01:00:00.000000000 +0100 +++ ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch 2007-05-10 17:27:33.000000000 +0200 @@ -0,0 +1,850 @@ +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-05-04 10:40:06.000000000 +0200 +@@ -126,14 +126,13 @@ struct ehca_qp { + struct ipz_qp_handle ipz_qp_handle; + struct ehca_pfqp pf; + struct ib_qp_init_attr init_attr; ++ u64 uspace_squeue; ++ u64 uspace_rqueue; ++ u64 uspace_fwh; + struct ehca_cq *send_cq; + struct ehca_cq *recv_cq; + unsigned int sqerr_purgeflag; + struct hlist_node list_entries; +- /* mmap counter for resources mapped into user space */ +- u32 mm_count_squeue; +- u32 mm_count_rqueue; +- u32 mm_count_galpa; + }; + + /* must be power of 2 */ +@@ -150,6 +149,8 @@ struct ehca_cq { + struct ipz_cq_handle ipz_cq_handle; + struct ehca_pfcq pf; + spinlock_t cb_lock; ++ u64 uspace_queue; ++ u64 uspace_fwh; + struct hlist_head qp_hashtab[QP_HASHTAB_LEN]; + struct list_head entry; + u32 nr_callbacks; /* #events assigned to cpu by scaling code */ +@@ -157,9 +158,6 @@ struct ehca_cq { + wait_queue_head_t wait_completion; + spinlock_t task_lock; + u32 ownpid; +- /* mmap counter for resources mapped into user space */ +- u32 mm_count_queue; +- u32 mm_count_galpa; + }; + + enum ehca_mr_flag { +@@ -259,6 +257,20 @@ struct ehca_ucontext { + struct ib_ucontext ib_ucontext; + }; + ++struct ehca_module *ehca_module_new(void); ++ ++int ehca_module_delete(struct ehca_module *me); ++ ++int ehca_eq_ctor(struct ehca_eq *eq); ++ ++int ehca_eq_dtor(struct ehca_eq *eq); ++ ++struct ehca_shca *ehca_shca_new(void); ++ ++int ehca_shca_delete(struct ehca_shca *me); ++ ++struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor); ++ + int ehca_init_pd_cache(void); + void ehca_cleanup_pd_cache(void); + int ehca_init_cq_cache(void); +@@ -282,6 +294,7 @@ extern int ehca_use_hp_mr; + extern int ehca_scaling_code; + + struct ipzu_queue_resp { ++ u64 queue; /* points to first queue entry */ + u32 qe_size; /* queue entry size */ + u32 act_nr_of_sg; + u32 queue_length; /* queue length allocated in bytes */ +@@ -294,6 +307,7 @@ struct ehca_create_cq_resp { + u32 cq_number; + u32 token; + struct ipzu_queue_resp ipz_queue; ++ struct h_galpas galpas; + }; + + struct ehca_create_qp_resp { +@@ -306,6 +320,7 @@ struct ehca_create_qp_resp { + u32 dummy; /* padding for 8 byte alignment */ + struct ipzu_queue_resp ipz_squeue; + struct ipzu_queue_resp ipz_rqueue; ++ struct h_galpas galpas; + }; + + struct ehca_alloc_cq_parms { +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c 2007-05-04 10:40:06.000000000 +0200 +@@ -268,6 +268,7 @@ struct ib_cq *ehca_create_cq(struct ib_d + if (context) { + struct ipz_queue *ipz_queue = &my_cq->ipz_queue; + struct ehca_create_cq_resp resp; ++ struct vm_area_struct *vma; + memset(&resp, 0, sizeof(resp)); + resp.cq_number = my_cq->cq_number; + resp.token = my_cq->token; +@@ -276,14 +277,40 @@ struct ib_cq *ehca_create_cq(struct ib_d + resp.ipz_queue.queue_length = ipz_queue->queue_length; + resp.ipz_queue.pagesize = ipz_queue->pagesize; + resp.ipz_queue.toggle_state = ipz_queue->toggle_state; ++ ret = ehca_mmap_nopage(((u64)(my_cq->token) << 32) | 0x12000000, ++ ipz_queue->queue_length, ++ (void**)&resp.ipz_queue.queue, ++ &vma); ++ if (ret) { ++ ehca_err(device, "Could not mmap queue pages"); ++ cq = ERR_PTR(ret); ++ goto create_cq_exit4; ++ } ++ my_cq->uspace_queue = resp.ipz_queue.queue; ++ resp.galpas = my_cq->galpas; ++ ret = ehca_mmap_register(my_cq->galpas.user.fw_handle, ++ (void**)&resp.galpas.kernel.fw_handle, ++ &vma); ++ if (ret) { ++ ehca_err(device, "Could not mmap fw_handle"); ++ cq = ERR_PTR(ret); ++ goto create_cq_exit5; ++ } ++ my_cq->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; + if (ib_copy_to_udata(udata, &resp, sizeof(resp))) { + ehca_err(device, "Copy to udata failed."); +- goto create_cq_exit4; ++ goto create_cq_exit6; + } + } + + return cq; + ++create_cq_exit6: ++ ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE); ++ ++create_cq_exit5: ++ ehca_munmap(my_cq->uspace_queue, my_cq->ipz_queue.queue_length); ++ + create_cq_exit4: + ipz_queue_dtor(&my_cq->ipz_queue); + +@@ -317,6 +344,7 @@ static int get_cq_nr_events(struct ehca_ + int ehca_destroy_cq(struct ib_cq *cq) + { + u64 h_ret; ++ int ret; + struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); + int cq_num = my_cq->cq_number; + struct ib_device *device = cq->device; +@@ -326,20 +354,6 @@ int ehca_destroy_cq(struct ib_cq *cq) + u32 cur_pid = current->tgid; + unsigned long flags; + +- if (cq->uobject) { +- if (my_cq->mm_count_galpa || my_cq->mm_count_queue) { +- ehca_err(device, "Resources still referenced in " +- "user space cq_num=%x", my_cq->cq_number); +- return -EINVAL; +- } +- if (my_cq->ownpid != cur_pid) { +- ehca_err(device, "Invalid caller pid=%x ownpid=%x " +- "cq_num=%x", +- cur_pid, my_cq->ownpid, my_cq->cq_number); +- return -EINVAL; +- } +- } +- + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + while (my_cq->nr_events) { + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); +@@ -351,6 +365,25 @@ int ehca_destroy_cq(struct ib_cq *cq) + idr_remove(&ehca_cq_idr, my_cq->token); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + ++ if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) { ++ ehca_err(device, "Invalid caller pid=%x ownpid=%x", ++ cur_pid, my_cq->ownpid); ++ return -EINVAL; ++ } ++ ++ /* un-mmap if vma alloc */ ++ if (my_cq->uspace_queue ) { ++ ret = ehca_munmap(my_cq->uspace_queue, ++ my_cq->ipz_queue.queue_length); ++ if (ret) ++ ehca_err(device, "Could not munmap queue ehca_cq=%p " ++ "cq_num=%x", my_cq, cq_num); ++ ret = ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE); ++ if (ret) ++ ehca_err(device, "Could not munmap fwh ehca_cq=%p " ++ "cq_num=%x", my_cq, cq_num); ++ } ++ + h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0); + if (h_ret == H_R_STATE) { + /* cq in err: read err data and destroy it forcibly */ +@@ -379,7 +412,7 @@ int ehca_resize_cq(struct ib_cq *cq, int + struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); + u32 cur_pid = current->tgid; + +- if (cq->uobject && my_cq->ownpid != cur_pid) { ++ if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) { + ehca_err(cq->device, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_cq->ownpid); + return -EINVAL; +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h 2007-04-29 15:10:56.000000000 +0200 +@@ -171,11 +171,19 @@ int ehca_mmap(struct ib_ucontext *contex + + void ehca_poll_eqs(unsigned long data); + ++int ehca_mmap_nopage(u64 foffset,u64 length,void **mapped, ++ struct vm_area_struct **vma); ++ ++int ehca_mmap_register(u64 physical,void **mapped, ++ struct vm_area_struct **vma); ++ ++int ehca_munmap(unsigned long addr, size_t len); ++ + #ifdef CONFIG_PPC_64K_PAGES + void *ehca_alloc_fw_ctrlblock(gfp_t flags); + void ehca_free_fw_ctrlblock(void *ptr); + #else +-#define ehca_alloc_fw_ctrlblock(flags) ((void*) get_zeroed_page(flags)) ++#define ehca_alloc_fw_ctrlblock(flags) ((void *) get_zeroed_page(flags)) + #define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr)) + #endif + +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-04 10:40:06.000000000 +0200 +@@ -52,7 +52,7 @@ + MODULE_LICENSE("Dual BSD/GPL"); + MODULE_AUTHOR("Christoph Raisch "); + MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); +-MODULE_VERSION("SVNEHCA_0022"); ++MODULE_VERSION("SVNEHCA_0019"); + + int ehca_open_aqp1 = 0; + int ehca_debug_level = 0; +@@ -293,7 +293,7 @@ int ehca_init_device(struct ehca_shca *s + strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX); + shca->ib_device.owner = THIS_MODULE; + +- shca->ib_device.uverbs_abi_ver = 6; ++ shca->ib_device.uverbs_abi_ver = 5; + shca->ib_device.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | +@@ -357,7 +357,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.dealloc_fmr = ehca_dealloc_fmr; + shca->ib_device.attach_mcast = ehca_attach_mcast; + shca->ib_device.detach_mcast = ehca_detach_mcast; +- /* shca->ib_device.process_mad = ehca_process_mad; */ ++ /* shca->ib_device.process_mad = ehca_process_mad; */ + shca->ib_device.mmap = ehca_mmap; + + return ret; +@@ -811,7 +811,7 @@ int __init ehca_module_init(void) + int ret; + + printk(KERN_INFO "eHCA Infiniband Device Driver " +- "(Rel.: SVNEHCA_0022)\n"); ++ "(Rel.: SVNEHCA_0019)\n"); + idr_init(&ehca_qp_idr); + idr_init(&ehca_cq_idr); + spin_lock_init(&ehca_qp_idr_lock); +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c 2007-04-29 15:10:56.000000000 +0200 +@@ -637,6 +637,7 @@ struct ib_qp *ehca_create_qp(struct ib_p + struct ipz_queue *ipz_rqueue = &my_qp->ipz_rqueue; + struct ipz_queue *ipz_squeue = &my_qp->ipz_squeue; + struct ehca_create_qp_resp resp; ++ struct vm_area_struct * vma; + memset(&resp, 0, sizeof(resp)); + + resp.qp_num = my_qp->real_qp_num; +@@ -650,21 +651,59 @@ struct ib_qp *ehca_create_qp(struct ib_p + resp.ipz_rqueue.queue_length = ipz_rqueue->queue_length; + resp.ipz_rqueue.pagesize = ipz_rqueue->pagesize; + resp.ipz_rqueue.toggle_state = ipz_rqueue->toggle_state; ++ ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x22000000, ++ ipz_rqueue->queue_length, ++ (void**)&resp.ipz_rqueue.queue, ++ &vma); ++ if (ret) { ++ ehca_err(pd->device, "Could not mmap rqueue pages"); ++ goto create_qp_exit3; ++ } ++ my_qp->uspace_rqueue = resp.ipz_rqueue.queue; + /* squeue properties */ + resp.ipz_squeue.qe_size = ipz_squeue->qe_size; + resp.ipz_squeue.act_nr_of_sg = ipz_squeue->act_nr_of_sg; + resp.ipz_squeue.queue_length = ipz_squeue->queue_length; + resp.ipz_squeue.pagesize = ipz_squeue->pagesize; + resp.ipz_squeue.toggle_state = ipz_squeue->toggle_state; ++ ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x23000000, ++ ipz_squeue->queue_length, ++ (void**)&resp.ipz_squeue.queue, ++ &vma); ++ if (ret) { ++ ehca_err(pd->device, "Could not mmap squeue pages"); ++ goto create_qp_exit4; ++ } ++ my_qp->uspace_squeue = resp.ipz_squeue.queue; ++ /* fw_handle */ ++ resp.galpas = my_qp->galpas; ++ ret = ehca_mmap_register(my_qp->galpas.user.fw_handle, ++ (void**)&resp.galpas.kernel.fw_handle, ++ &vma); ++ if (ret) { ++ ehca_err(pd->device, "Could not mmap fw_handle"); ++ goto create_qp_exit5; ++ } ++ my_qp->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; ++ + if (ib_copy_to_udata(udata, &resp, sizeof resp)) { + ehca_err(pd->device, "Copy to udata failed"); + ret = -EINVAL; +- goto create_qp_exit3; ++ goto create_qp_exit6; + } + } + + return &my_qp->ib_qp; + ++create_qp_exit6: ++ ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE); ++ ++create_qp_exit5: ++ ehca_munmap(my_qp->uspace_squeue, my_qp->ipz_squeue.queue_length); ++ ++create_qp_exit4: ++ ehca_munmap(my_qp->uspace_rqueue, my_qp->ipz_rqueue.queue_length); ++ + create_qp_exit3: + ipz_queue_dtor(&my_qp->ipz_rqueue); + ipz_queue_dtor(&my_qp->ipz_squeue); +@@ -892,7 +931,7 @@ static int internal_modify_qp(struct ib_ + my_qp->qp_type == IB_QPT_SMI) && + statetrans == IB_QPST_SQE2RTS) { + /* mark next free wqe if kernel */ +- if (!ibqp->uobject) { ++ if (my_qp->uspace_squeue == 0) { + struct ehca_wqe *wqe; + /* lock send queue */ + spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); +@@ -1378,18 +1417,11 @@ int ehca_destroy_qp(struct ib_qp *ibqp) + enum ib_qp_type qp_type; + unsigned long flags; + +- if (ibqp->uobject) { +- if (my_qp->mm_count_galpa || +- my_qp->mm_count_rqueue || my_qp->mm_count_squeue) { +- ehca_err(ibqp->device, "Resources still referenced in " +- "user space qp_num=%x", ibqp->qp_num); +- return -EINVAL; +- } +- if (my_pd->ownpid != cur_pid) { +- ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x", +- cur_pid, my_pd->ownpid); +- return -EINVAL; +- } ++ if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && ++ my_pd->ownpid != cur_pid) { ++ ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x", ++ cur_pid, my_pd->ownpid); ++ return -EINVAL; + } + + if (my_qp->send_cq) { +@@ -1407,6 +1439,24 @@ int ehca_destroy_qp(struct ib_qp *ibqp) + idr_remove(&ehca_qp_idr, my_qp->token); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + ++ /* un-mmap if vma alloc */ ++ if (my_qp->uspace_rqueue) { ++ ret = ehca_munmap(my_qp->uspace_rqueue, ++ my_qp->ipz_rqueue.queue_length); ++ if (ret) ++ ehca_err(ibqp->device, "Could not munmap rqueue " ++ "qp_num=%x", qp_num); ++ ret = ehca_munmap(my_qp->uspace_squeue, ++ my_qp->ipz_squeue.queue_length); ++ if (ret) ++ ehca_err(ibqp->device, "Could not munmap squeue " ++ "qp_num=%x", qp_num); ++ ret = ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE); ++ if (ret) ++ ehca_err(ibqp->device, "Could not munmap fwh qp_num=%x", ++ qp_num); ++ } ++ + h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); + if (h_ret != H_SUCCESS) { + ehca_err(ibqp->device, "hipz_h_destroy_qp() failed rc=%lx " +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c 2007-04-29 15:10:56.000000000 +0200 +@@ -68,183 +68,105 @@ int ehca_dealloc_ucontext(struct ib_ucon + return 0; + } + +-static void ehca_mm_open(struct vm_area_struct *vma) ++struct page *ehca_nopage(struct vm_area_struct *vma, ++ unsigned long address, int *type) + { +- u32 *count = (u32*)vma->vm_private_data; +- if (!count) { +- ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx", +- vma->vm_start, vma->vm_end); +- return; +- } +- (*count)++; +- if (!(*count)) +- ehca_gen_err("Use count overflow vm_start=%lx vm_end=%lx", +- vma->vm_start, vma->vm_end); +- ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x", +- vma->vm_start, vma->vm_end, *count); +-} +- +-static void ehca_mm_close(struct vm_area_struct *vma) +-{ +- u32 *count = (u32*)vma->vm_private_data; +- if (!count) { +- ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx", +- vma->vm_start, vma->vm_end); +- return; +- } +- (*count)--; +- ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x", +- vma->vm_start, vma->vm_end, *count); +-} +- +-static struct vm_operations_struct vm_ops = { +- .open = ehca_mm_open, +- .close = ehca_mm_close, +-}; +- +-static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas, +- u32 *mm_count) +-{ +- int ret; +- u64 vsize, physical; +- +- vsize = vma->vm_end - vma->vm_start; +- if (vsize != EHCA_PAGESIZE) { +- ehca_gen_err("invalid vsize=%lx", vma->vm_end - vma->vm_start); +- return -EINVAL; +- } +- +- physical = galpas->user.fw_handle; +- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); +- ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical); +- /* VM_IO | VM_RESERVED are set by remap_pfn_range() */ +- ret = remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT, +- vsize, vma->vm_page_prot); +- if (unlikely(ret)) { +- ehca_gen_err("remap_pfn_range() failed ret=%x", ret); +- return -ENOMEM; +- } +- +- vma->vm_private_data = mm_count; +- (*mm_count)++; +- vma->vm_ops = &vm_ops; +- +- return 0; +-} ++ struct page *mypage = NULL; ++ u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; ++ u32 idr_handle = fileoffset >> 32; ++ u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ ++ u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ ++ u32 cur_pid = current->tgid; ++ unsigned long flags; ++ struct ehca_cq *cq; ++ struct ehca_qp *qp; ++ struct ehca_pd *pd; ++ u64 offset; ++ void *vaddr; + +-static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue, +- u32 *mm_count) +-{ +- int ret; +- u64 start, ofs; +- struct page *page; ++ switch (q_type) { ++ case 1: /* CQ */ ++ spin_lock_irqsave(&ehca_cq_idr_lock, flags); ++ cq = idr_find(&ehca_cq_idr, idr_handle); ++ spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + +- vma->vm_flags |= VM_RESERVED; +- start = vma->vm_start; +- for (ofs = 0; ofs < queue->queue_length; ofs += PAGE_SIZE) { +- u64 virt_addr = (u64)ipz_qeit_calc(queue, ofs); +- page = virt_to_page(virt_addr); +- ret = vm_insert_page(vma, start, page); +- if (unlikely(ret)) { +- ehca_gen_err("vm_insert_page() failed rc=%x", ret); +- return ret; ++ /* make sure this mmap really belongs to the authorized user */ ++ if (!cq) { ++ ehca_gen_err("cq is NULL ret=NOPAGE_SIGBUS"); ++ return NOPAGE_SIGBUS; + } +- start += PAGE_SIZE; +- } +- vma->vm_private_data = mm_count; +- (*mm_count)++; +- vma->vm_ops = &vm_ops; + +- return 0; +-} +- +-static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq, +- u32 rsrc_type) +-{ +- int ret; +- +- switch (rsrc_type) { +- case 1: /* galpa fw handle */ +- ehca_dbg(cq->ib_cq.device, "cq_num=%x fw", cq->cq_number); +- ret = ehca_mmap_fw(vma, &cq->galpas, &cq->mm_count_galpa); +- if (unlikely(ret)) { ++ if (cq->ownpid != cur_pid) { + ehca_err(cq->ib_cq.device, +- "ehca_mmap_fw() failed rc=%x cq_num=%x", +- ret, cq->cq_number); +- return ret; ++ "Invalid caller pid=%x ownpid=%x", ++ cur_pid, cq->ownpid); ++ return NOPAGE_SIGBUS; + } +- break; + +- case 2: /* cq queue_addr */ +- ehca_dbg(cq->ib_cq.device, "cq_num=%x queue", cq->cq_number); +- ret = ehca_mmap_queue(vma, &cq->ipz_queue, &cq->mm_count_queue); +- if (unlikely(ret)) { +- ehca_err(cq->ib_cq.device, +- "ehca_mmap_queue() failed rc=%x cq_num=%x", +- ret, cq->cq_number); +- return ret; ++ if (rsrc_type == 2) { ++ ehca_dbg(cq->ib_cq.device, "cq=%p cq queuearea", cq); ++ offset = address - vma->vm_start; ++ vaddr = ipz_qeit_calc(&cq->ipz_queue, offset); ++ ehca_dbg(cq->ib_cq.device, "offset=%lx vaddr=%p", ++ offset, vaddr); ++ mypage = virt_to_page(vaddr); + } + break; + +- default: +- ehca_err(cq->ib_cq.device, "bad resource type=%x cq_num=%x", +- rsrc_type, cq->cq_number); +- return -EINVAL; +- } +- +- return 0; +-} +- +-static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, +- u32 rsrc_type) +-{ +- int ret; ++ case 2: /* QP */ ++ spin_lock_irqsave(&ehca_qp_idr_lock, flags); ++ qp = idr_find(&ehca_qp_idr, idr_handle); ++ spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + +- switch (rsrc_type) { +- case 1: /* galpa fw handle */ +- ehca_dbg(qp->ib_qp.device, "qp_num=%x fw", qp->ib_qp.qp_num); +- ret = ehca_mmap_fw(vma, &qp->galpas, &qp->mm_count_galpa); +- if (unlikely(ret)) { +- ehca_err(qp->ib_qp.device, +- "remap_pfn_range() failed ret=%x qp_num=%x", +- ret, qp->ib_qp.qp_num); +- return -ENOMEM; ++ /* make sure this mmap really belongs to the authorized user */ ++ if (!qp) { ++ ehca_gen_err("qp is NULL ret=NOPAGE_SIGBUS"); ++ return NOPAGE_SIGBUS; + } +- break; + +- case 2: /* qp rqueue_addr */ +- ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue", +- qp->ib_qp.qp_num); +- ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, &qp->mm_count_rqueue); +- if (unlikely(ret)) { ++ pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd); ++ if (pd->ownpid != cur_pid) { + ehca_err(qp->ib_qp.device, +- "ehca_mmap_queue(rq) failed rc=%x qp_num=%x", +- ret, qp->ib_qp.qp_num); +- return ret; ++ "Invalid caller pid=%x ownpid=%x", ++ cur_pid, pd->ownpid); ++ return NOPAGE_SIGBUS; + } +- break; + +- case 3: /* qp squeue_addr */ +- ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue", +- qp->ib_qp.qp_num); +- ret = ehca_mmap_queue(vma, &qp->ipz_squeue, &qp->mm_count_squeue); +- if (unlikely(ret)) { +- ehca_err(qp->ib_qp.device, +- "ehca_mmap_queue(sq) failed rc=%x qp_num=%x", +- ret, qp->ib_qp.qp_num); +- return ret; ++ if (rsrc_type == 2) { /* rqueue */ ++ ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueuearea", qp); ++ offset = address - vma->vm_start; ++ vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset); ++ ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p", ++ offset, vaddr); ++ mypage = virt_to_page(vaddr); ++ } else if (rsrc_type == 3) { /* squeue */ ++ ehca_dbg(qp->ib_qp.device, "qp=%p qp squeuearea", qp); ++ offset = address - vma->vm_start; ++ vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset); ++ ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p", ++ offset, vaddr); ++ mypage = virt_to_page(vaddr); + } + break; + + default: +- ehca_err(qp->ib_qp.device, "bad resource type=%x qp=num=%x", +- rsrc_type, qp->ib_qp.qp_num); +- return -EINVAL; ++ ehca_gen_err("bad queue type %x", q_type); ++ return NOPAGE_SIGBUS; + } + +- return 0; ++ if (!mypage) { ++ ehca_gen_err("Invalid page adr==NULL ret=NOPAGE_SIGBUS"); ++ return NOPAGE_SIGBUS; ++ } ++ get_page(mypage); ++ ++ return mypage; + } + ++static struct vm_operations_struct ehcau_vm_ops = { ++ .nopage = ehca_nopage, ++}; ++ + int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) + { + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; +@@ -253,6 +175,7 @@ int ehca_mmap(struct ib_ucontext *contex + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 cur_pid = current->tgid; + u32 ret; ++ u64 vsize, physical; + unsigned long flags; + struct ehca_cq *cq; + struct ehca_qp *qp; +@@ -278,12 +201,44 @@ int ehca_mmap(struct ib_ucontext *contex + if (!cq->ib_cq.uobject || cq->ib_cq.uobject->context != context) + return -EINVAL; + +- ret = ehca_mmap_cq(vma, cq, rsrc_type); +- if (unlikely(ret)) { +- ehca_err(cq->ib_cq.device, +- "ehca_mmap_cq() failed rc=%x cq_num=%x", +- ret, cq->cq_number); +- return ret; ++ switch (rsrc_type) { ++ case 1: /* galpa fw handle */ ++ ehca_dbg(cq->ib_cq.device, "cq=%p cq triggerarea", cq); ++ vma->vm_flags |= VM_RESERVED; ++ vsize = vma->vm_end - vma->vm_start; ++ if (vsize != EHCA_PAGESIZE) { ++ ehca_err(cq->ib_cq.device, "invalid vsize=%lx", ++ vma->vm_end - vma->vm_start); ++ return -EINVAL; ++ } ++ ++ physical = cq->galpas.user.fw_handle; ++ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); ++ vma->vm_flags |= VM_IO | VM_RESERVED; ++ ++ ehca_dbg(cq->ib_cq.device, ++ "vsize=%lx physical=%lx", vsize, physical); ++ ret = remap_pfn_range(vma, vma->vm_start, ++ physical >> PAGE_SHIFT, vsize, ++ vma->vm_page_prot); ++ if (ret) { ++ ehca_err(cq->ib_cq.device, ++ "remap_pfn_range() failed ret=%x", ++ ret); ++ return -ENOMEM; ++ } ++ break; ++ ++ case 2: /* cq queue_addr */ ++ ehca_dbg(cq->ib_cq.device, "cq=%p cq q_addr", cq); ++ vma->vm_flags |= VM_RESERVED; ++ vma->vm_ops = &ehcau_vm_ops; ++ break; ++ ++ default: ++ ehca_err(cq->ib_cq.device, "bad resource type %x", ++ rsrc_type); ++ return -EINVAL; + } + break; + +@@ -307,12 +262,50 @@ int ehca_mmap(struct ib_ucontext *contex + if (!qp->ib_qp.uobject || qp->ib_qp.uobject->context != context) + return -EINVAL; + +- ret = ehca_mmap_qp(vma, qp, rsrc_type); +- if (unlikely(ret)) { +- ehca_err(qp->ib_qp.device, +- "ehca_mmap_qp() failed rc=%x qp_num=%x", +- ret, qp->ib_qp.qp_num); +- return ret; ++ switch (rsrc_type) { ++ case 1: /* galpa fw handle */ ++ ehca_dbg(qp->ib_qp.device, "qp=%p qp triggerarea", qp); ++ vma->vm_flags |= VM_RESERVED; ++ vsize = vma->vm_end - vma->vm_start; ++ if (vsize != EHCA_PAGESIZE) { ++ ehca_err(qp->ib_qp.device, "invalid vsize=%lx", ++ vma->vm_end - vma->vm_start); ++ return -EINVAL; ++ } ++ ++ physical = qp->galpas.user.fw_handle; ++ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); ++ vma->vm_flags |= VM_IO | VM_RESERVED; ++ ++ ehca_dbg(qp->ib_qp.device, "vsize=%lx physical=%lx", ++ vsize, physical); ++ ret = remap_pfn_range(vma, vma->vm_start, ++ physical >> PAGE_SHIFT, vsize, ++ vma->vm_page_prot); ++ if (ret) { ++ ehca_err(qp->ib_qp.device, ++ "remap_pfn_range() failed ret=%x", ++ ret); ++ return -ENOMEM; ++ } ++ break; ++ ++ case 2: /* qp rqueue_addr */ ++ ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueue_addr", qp); ++ vma->vm_flags |= VM_RESERVED; ++ vma->vm_ops = &ehcau_vm_ops; ++ break; ++ ++ case 3: /* qp squeue_addr */ ++ ehca_dbg(qp->ib_qp.device, "qp=%p qp squeue_addr", qp); ++ vma->vm_flags |= VM_RESERVED; ++ vma->vm_ops = &ehcau_vm_ops; ++ break; ++ ++ default: ++ ehca_err(qp->ib_qp.device, "bad resource type %x", ++ rsrc_type); ++ return -EINVAL; + } + break; + +@@ -323,3 +316,77 @@ int ehca_mmap(struct ib_ucontext *contex + + return 0; + } ++ ++int ehca_mmap_nopage(u64 foffset, u64 length, void **mapped, ++ struct vm_area_struct **vma) ++{ ++ down_write(¤t->mm->mmap_sem); ++ *mapped = (void*)do_mmap(NULL,0, length, PROT_WRITE, ++ MAP_SHARED | MAP_ANONYMOUS, ++ foffset); ++ up_write(¤t->mm->mmap_sem); ++ if (!(*mapped)) { ++ ehca_gen_err("couldn't mmap foffset=%lx length=%lx", ++ foffset, length); ++ return -EINVAL; ++ } ++ ++ *vma = find_vma(current->mm, (u64)*mapped); ++ if (!(*vma)) { ++ down_write(¤t->mm->mmap_sem); ++ do_munmap(current->mm, 0, length); ++ up_write(¤t->mm->mmap_sem); ++ ehca_gen_err("couldn't find vma queue=%p", *mapped); ++ return -EINVAL; ++ } ++ (*vma)->vm_flags |= VM_RESERVED; ++ (*vma)->vm_ops = &ehcau_vm_ops; ++ ++ return 0; ++} ++ ++int ehca_mmap_register(u64 physical, void **mapped, ++ struct vm_area_struct **vma) ++{ ++ int ret; ++ unsigned long vsize; ++ /* ehca hw supports only 4k page */ ++ ret = ehca_mmap_nopage(0, EHCA_PAGESIZE, mapped, vma); ++ if (ret) { ++ ehca_gen_err("could'nt mmap physical=%lx", physical); ++ return ret; ++ } ++ ++ (*vma)->vm_flags |= VM_RESERVED; ++ vsize = (*vma)->vm_end - (*vma)->vm_start; ++ if (vsize != EHCA_PAGESIZE) { ++ ehca_gen_err("invalid vsize=%lx", ++ (*vma)->vm_end - (*vma)->vm_start); ++ return -EINVAL; ++ } ++ ++ (*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot); ++ (*vma)->vm_flags |= VM_IO | VM_RESERVED; ++ ++ ret = remap_pfn_range((*vma), (*vma)->vm_start, ++ physical >> PAGE_SHIFT, vsize, ++ (*vma)->vm_page_prot); ++ if (ret) { ++ ehca_gen_err("remap_pfn_range() failed ret=%x", ret); ++ return -ENOMEM; ++ } ++ ++ return 0; ++ ++} ++ ++int ehca_munmap(unsigned long addr, size_t len) { ++ int ret = 0; ++ struct mm_struct *mm = current->mm; ++ if (mm) { ++ down_write(&mm->mmap_sem); ++ ret = do_munmap(mm, addr, len); ++ up_write(&mm->mmap_sem); ++ } ++ return ret; ++} From ossrosch at linux.vnet.ibm.com Thu May 10 07:29:08 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Thu, 10 May 2007 16:29:08 +0200 Subject: [ofa-general] [PATCH ofed-1.2-rc3 3/4] ehca: backport for rhel-4.5 - use introduced dma_ops Message-ID: <200705101629.09382.ossrosch@linux.vnet.ibm.com> Signed-off-by: Stefan Roscher --- backport_ehca_3_rhel45_dma.patch | 226 +++++++++++++++++++++++++++++++++++++++ 1 files changed, 226 insertions(+) diff -Nurp ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_3_rhel45_dma.patch ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_3_rhel45_dma.patch --- ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_3_rhel45_dma.patch 1970-01-01 01:00:00.000000000 +0100 +++ ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_3_rhel45_dma.patch 2007-05-10 17:30:24.000000000 +0200 @@ -0,0 +1,226 @@ +diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c +--- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_dma.c 1970-01-01 01:00:00.000000000 +0100 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_dma.c 2007-05-03 16:25:30.000000000 +0200 +@@ -0,0 +1,194 @@ ++/* ++ * IBM eServer eHCA Infiniband device driver for Linux on POWER ++ * ++ * eHCA dma mapping via ibmebus ++ * ++ * Authors: Stefan Roscher ++ * Hoang-Nam Nguyen ++ * ++ * Copyright (c) 2007 IBM Corporation ++ * ++ * All rights reserved. ++ * ++ * This source code is distributed under a dual license of GPL v2.0 and OpenIB ++ * BSD. ++ * ++ * OpenIB BSD License ++ * ++ * Redistribution and use in source and binary forms, with or without ++ * modification, are permitted provided that the following conditions are met: ++ * ++ * Redistributions of source code must retain the above copyright notice, this ++ * list of conditions and the following disclaimer. ++ * ++ * Redistributions in binary form must reproduce the above copyright notice, ++ * this list of conditions and the following disclaimer in the documentation ++ * and/or other materials ++ * provided with the distribution. ++ * ++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" ++ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE ++ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ++ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE ++ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR ++ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF ++ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR ++ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER ++ * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ++ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE ++ * POSSIBILITY OF SUCH DAMAGE. ++ */ ++ ++#include ++#include ++ ++static int ehca_mapping_error(struct ib_device *dev, u64 dma_addr); ++ ++static u64 ehca_dma_map_single(struct ib_device *dev, ++ void *cpu_addr, size_t size, ++ enum dma_data_direction direction); ++ ++static void ehca_dma_unmap_single(struct ib_device *dev, ++ u64 addr, size_t size, ++ enum dma_data_direction direction); ++ ++static u64 ehca_dma_map_page(struct ib_device *dev, ++ struct page *page, ++ unsigned long offset, ++ size_t size, ++ enum dma_data_direction direction); ++ ++static void ehca_dma_unmap_page(struct ib_device *dev, ++ u64 addr, size_t size, ++ enum dma_data_direction direction); ++ ++int ehca_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents, ++ enum dma_data_direction direction); ++ ++static void ehca_unmap_sg(struct ib_device *dev, ++ struct scatterlist *sg, int nents, ++ enum dma_data_direction direction); ++ ++static u64 ehca_sg_dma_address(struct ib_device *dev, struct scatterlist *sg); ++ ++static unsigned int ehca_sg_dma_len(struct ib_device *dev, ++ struct scatterlist *sg); ++ ++static void ehca_sync_single_for_cpu(struct ib_device *dev, ++ u64 addr, ++ size_t size, ++ enum dma_data_direction dir); ++ ++static void ehca_sync_single_for_device(struct ib_device *dev, ++ u64 addr, ++ size_t size, ++ enum dma_data_direction dir); ++ ++static void *ehca_dma_alloc_coherent(struct ib_device *dev, size_t size, ++ u64 *dma_handle, gfp_t flag); ++ ++static void ehca_dma_free_coherent(struct ib_device *dev, size_t size, ++ void *cpu_addr, dma_addr_t dma_handle); ++ ++struct ib_dma_mapping_ops ehca_dma_mapping_ops = { ++ ehca_mapping_error, ++ ehca_dma_map_single, ++ ehca_dma_unmap_single, ++ ehca_dma_map_page, ++ ehca_dma_unmap_page, ++ ehca_map_sg, ++ ehca_unmap_sg, ++ ehca_sg_dma_address, ++ ehca_sg_dma_len, ++ ehca_sync_single_for_cpu, ++ ehca_sync_single_for_device, ++ ehca_dma_alloc_coherent, ++ ehca_dma_free_coherent ++}; ++ ++static int ehca_mapping_error(struct ib_device *dev, u64 dma_addr) ++{ ++ return dma_addr == 0L; ++} ++ ++static u64 ehca_dma_map_single(struct ib_device *dev, ++ void *cpu_addr, size_t size, ++ enum dma_data_direction direction) ++{ ++ return ibmebus_map_single(dev, cpu_addr, size, direction); ++} ++ ++static void ehca_dma_unmap_single(struct ib_device *dev, ++ u64 addr, size_t size, ++ enum dma_data_direction direction) ++{ ++ ibmebus_unmap_single(dev, addr, size, direction); ++} ++ ++static u64 ehca_dma_map_page(struct ib_device *dev, ++ struct page *page, ++ unsigned long offset, ++ size_t size, ++ enum dma_data_direction direction) ++{ ++ return dma_map_page(dev->dma_device, page, offset, size, direction); ++} ++ ++static void ehca_dma_unmap_page(struct ib_device *dev, ++ u64 addr, size_t size, ++ enum dma_data_direction direction) ++{ ++ dma_unmap_page(dev->dma_device, addr, size, direction); ++} ++ ++int ehca_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents, ++ enum dma_data_direction direction) ++{ ++ return ibmebus_map_sg(dev, sg, nents, direction); ++} ++ ++static void ehca_unmap_sg(struct ib_device *dev, ++ struct scatterlist *sg, int nents, ++ enum dma_data_direction direction) ++{ ++ ibmebus_unmap_sg(dev, sg, nents, direction); ++} ++ ++static u64 ehca_sg_dma_address(struct ib_device *dev, struct scatterlist *sg) ++{ ++ return sg_dma_address(sg); ++} ++ ++static unsigned int ehca_sg_dma_len(struct ib_device *dev, ++ struct scatterlist *sg) ++{ ++ return sg_dma_len(sg); ++} ++ ++static void ehca_sync_single_for_cpu(struct ib_device *dev, ++ u64 addr, ++ size_t size, ++ enum dma_data_direction dir) ++{ ++ dma_sync_single_for_cpu(dev->dma_device, addr, size, dir); ++} ++ ++static void ehca_sync_single_for_device(struct ib_device *dev, ++ u64 addr, ++ size_t size, ++ enum dma_data_direction dir) ++{ ++ dma_sync_single_for_device(dev->dma_device, addr, size, dir); ++} ++ ++static void *ehca_dma_alloc_coherent(struct ib_device *dev, size_t size, ++ u64 *dma_handle, gfp_t flag) ++{ ++ return ibmebus_alloc_coherent(dev, size, dma_handle, flag); ++} ++ ++static void ehca_dma_free_coherent(struct ib_device *dev, size_t size, ++ void *cpu_addr, dma_addr_t dma_handle) ++{ ++ ibmebus_free_coherent(dev, size, cpu_addr, dma_handle); ++} +diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c +--- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/ehca_main.c 2007-04-29 15:10:56.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-03 16:19:28.000000000 +0200 +@@ -279,6 +279,7 @@ init_node_guid1: + + int ehca_init_device(struct ehca_shca *shca) + { ++ extern struct ib_dma_mapping_ops ehca_dma_mapping_ops; + int ret; + + ret = init_node_guid(shca); +@@ -354,6 +355,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.detach_mcast = ehca_detach_mcast; + /* shca->ib_device.process_mad = ehca_process_mad; */ + shca->ib_device.mmap = ehca_mmap; ++ shca->ib_device.dma_ops = &ehca_dma_mapping_ops; + + return ret; + } +diff -Nurp ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/Makefile ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/Makefile +--- ofa_kernel-1.2_orig/drivers/infiniband/hw/ehca/Makefile 2007-04-29 15:10:56.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/Makefile 2007-05-03 16:26:13.000000000 +0200 +@@ -12,5 +12,5 @@ obj-$(CONFIG_INFINIBAND_EHCA) += ib_ehca + + ib_ehca-objs = ehca_main.o ehca_hca.o ehca_mcast.o ehca_pd.o ehca_av.o ehca_eq.o \ + ehca_cq.o ehca_qp.o ehca_sqp.o ehca_mrmw.o ehca_reqs.o ehca_irq.o \ +- ehca_uverbs.o ipz_pt_fn.o hcp_if.o hcp_phyp.o ++ ehca_uverbs.o ehca_dma.o ipz_pt_fn.o hcp_if.o hcp_phyp.o From ossrosch at linux.vnet.ibm.com Thu May 10 07:30:08 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Thu, 10 May 2007 16:30:08 +0200 Subject: [ofa-general] [PATCH ofed-1.2-rc3 4/4] ehca: backport for rhel-4.5 - create hvcall.h in kernel_addons Message-ID: <200705101630.09153.ossrosch@linux.vnet.ibm.com> creates file hvcall.h and system.h in kernel_addons/backport/2.6.9_U5/include Signed-off-by: Stefan Roscher --- asm-powerpc/system.h | 1 asm/hvcall.h | 309 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 310 insertions(+) diff -Nurp ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h --- ofa_kernel-1.2_old/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h 1970-01-01 01:00:00.000000000 +0100 +++ ofa_kernel-1.2_new/kernel_addons/backport/2.6.9_U5/include/asm/hvcall.h 2007-05-10 18:14:12.000000000 +0200 @@ -0,0 +1,309 @@ +#ifndef ASM_HVCALL_BACKPORT_2616_H +#define ASM_HVCALL_BACKPORT_2616_H + +#include_next + +#ifdef __KERNEL__ + +#define H_SUCCESS H_Success +#define H_BUSY H_Busy +#define H_CONSTRAINED H_Constrained +#define H_PAGE_REGISTERED 15 + +#define H_PARAMETER H_Parameter +#define H_NO_MEM H_NoMem +#define H_RESOURCE H_Resource +#define H_HARDWARE H_Hardware +#define H_ADAPTER_PARM -17 +#define H_RH_PARM -18 +#define H_RT_PARM -22 +#define H_MLENGTH_PARM -27 +#define H_MEM_PARM -28 +#define H_MEM_ACCESS_PARM -29 +#define H_ALIAS_EXIST -39 +#define H_TABLE_FULL -41 +#define H_NOT_ENOUGH_RESOURCES -44 +#define H_R_STATE -45 + +#define H_CB_ALIGNMENT 4096 + +#define H_RESET_EVENTS 0x15C +#define H_ALLOC_RESOURCE 0x160 +#define H_FREE_RESOURCE 0x164 +#define H_MODIFY_QP 0x168 +#define H_QUERY_QP 0x16C +#define H_REREGISTER_PMR 0x170 +#define H_REGISTER_SMR 0x174 +#define H_QUERY_MR 0x178 +#define H_QUERY_MW 0x17C +#define H_QUERY_HCA 0x180 +#define H_QUERY_PORT 0x184 +#define H_MODIFY_PORT 0x188 +#define H_DEFINE_AQP1 0x18C +#define H_DEFINE_AQP0 0x194 +#define H_RESIZE_MR 0x198 +#define H_ATTACH_MCQP 0x19C +#define H_DETACH_MCQP 0x1A0 +#define H_REGISTER_RPAGES 0x1AC +#define H_DISABLE_AND_GETC 0x1B0 +#define H_ERROR_DATA 0x1B4 +#define H_QUERY_INT_STATE 0x1E4 + +#define H_LONG_BUSY_ORDER_1_MSEC H_LongBusyOrder1msec +#define H_LONG_BUSY_ORDER_10_MSEC H_LongBusyOrder10msec +#define H_LONG_BUSY_ORDER_100_MSEC H_LongBusyOrder100msec +#define H_LONG_BUSY_ORDER_1_SEC H_LongBusyOrder1sec +#define H_LONG_BUSY_ORDER_10_SEC H_LongBusyOrder10sec +#define H_LONG_BUSY_ORDER_100_SEC H_LongBusyOrder100sec +#define H_IS_LONG_BUSY(x) ((x >= H_LongBusyStartRange) && (x <= H_LongBusyEndRange)) + + +#ifndef __ASSEMBLY__ +#include + +#define PLPAR_HCALL9_BUFSIZE 9 +inline static long plpar_hcall9(unsigned long opcode, + unsigned long *retbuf, + unsigned long arg1, /* From yosefe at voltaire.com Thu May 10 07:34:50 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 10 May 2007 17:34:50 +0300 Subject: [ofa-general] [PATCHv4 for 2.6.22 0/2] fix bug #420: ippib handling of pkey change events Message-ID: <46432D8A.8030007@voltaire.com> These two patches fix bug #420: PKey table reordering caused by SM failover stops ipoib traffic patch 1: add uncached device queries to core patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init -- From yosefe at voltaire.com Thu May 10 07:36:43 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 10 May 2007 17:36:43 +0300 Subject: [ofa-general] [PATCHv4 for 2.6.22 1/2] core: uncached "find gid" and "find pkey" queries In-Reply-To: <46432D8A.8030007@voltaire.com> References: <46432D8A.8030007@voltaire.com> Message-ID: <46432DFB.9070007@voltaire.com> * Add ib_find_gid and ib_find_pkey over uncached device queries. The calls might block but the returns are always up-to-date. * Cache pky,gid table lengths in core to avoid port info queries. Signed-off-by: Yosef Etigin --- drivers/infiniband/core/device.c | 138 +++++++++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 25 +++++++ 2 files changed, 163 insertions(+) Index: b/drivers/infiniband/core/device.c =================================================================== --- a/drivers/infiniband/core/device.c 2007-05-08 15:46:36.000000000 +0300 +++ b/drivers/infiniband/core/device.c 2007-05-09 11:47:22.096064221 +0300 @@ -149,6 +149,18 @@ static int alloc_name(char *name) return 0; } +static inline int start_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; +} + + +static inline int end_port(struct ib_device *device) +{ + return (device->node_type == RDMA_NODE_IB_SWITCH) ? + 0 : device->phys_port_cnt; +} + /** * ib_alloc_device - allocate an IB device struct * @size:size of structure to allocate @@ -208,6 +220,55 @@ static int add_client_context(struct ib_ return 0; } +/* read the lengths of pkey,gid tables on each port */ +static int read_port_table_lengths(struct ib_device *device) +{ + struct ib_port_attr *tprops = NULL; + int num_ports, ret = -ENOMEM; + u8 port_index; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + goto out; + + num_ports = end_port(device) - start_port(device) + 1; + + device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len * + num_ports, GFP_KERNEL); + if (!device->pkey_tbl_len) + goto out; + + device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len * + num_ports, GFP_KERNEL); + if (!device->gid_tbl_len) + goto err1; + + for (port_index = 0; port_index < num_ports; ++port_index) { + ret = ib_query_port(device, port_index + start_port(device), + tprops); + if (ret) + goto err2; + device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len; + device->gid_tbl_len[port_index] = tprops->gid_tbl_len; + } + + ret = 0; + goto out; +err2: + kfree(device->gid_tbl_len); +err1: + kfree(device->pkey_tbl_len); +out: + kfree(tprops); + return ret; +} + +static inline void free_port_table_lengths(struct ib_device *device) +{ + kfree(device->gid_tbl_len); + kfree(device->pkey_tbl_len); +} + /** * ib_register_device - Register an IB device with IB core * @device:Device to register @@ -239,6 +300,13 @@ int ib_register_device(struct ib_device spin_lock_init(&device->event_handler_lock); spin_lock_init(&device->client_data_lock); + ret = read_port_table_lengths(device); + if (ret) { + printk(KERN_WARNING "Couldn't create table lengths cache for device %s\n", + device->name); + goto out; + } + ret = ib_device_register_sysfs(device); if (ret) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", @@ -284,6 +352,8 @@ void ib_unregister_device(struct ib_devi list_del(&device->core_list); + free_port_table_lengths(device); + mutex_unlock(&device_mutex); spin_lock_irqsave(&device->client_data_lock, flags); @@ -592,6 +662,74 @@ int ib_modify_port(struct ib_device *dev } EXPORT_SYMBOL(ib_modify_port); +/** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index) +{ + union ib_gid tmp_gid; + int ret, port, i, tbl_len; + + for (port = start_port(device); port <= end_port(device); ++port) { + tbl_len = device->gid_tbl_len[port - start_port(device)]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &tmp_gid); + if (ret) + goto out; + if (!memcmp(&tmp_gid, gid, sizeof *gid)) { + *port_num = port; + *index = i; + ret = 0; + goto out; + } + } + } + ret = -ENOENT; +out: + return ret; +} +EXPORT_SYMBOL(ib_find_gid); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index) +{ + int ret, i, tbl_len; + u16 tmp_pkey; + + tbl_len = device->pkey_tbl_len[port_num - start_port(device)]; + for (i = 0; i < tbl_len; ++i) { + ret = ib_query_pkey(device, port_num, i, &tmp_pkey); + if (ret) + goto out; + + if (pkey == tmp_pkey) { + *index = i; + ret = 0; + goto out; + } + } + ret = -ENOENT; + +out: + return ret; +} +EXPORT_SYMBOL(ib_find_pkey); + static int __init ib_core_init(void) { int ret; Index: b/include/rdma/ib_verbs.h =================================================================== --- a/include/rdma/ib_verbs.h 2007-05-08 15:45:45.000000000 +0300 +++ b/include/rdma/ib_verbs.h 2007-05-09 11:47:55.006221894 +0300 @@ -1058,6 +1058,8 @@ struct ib_device { __be64 node_guid; u8 node_type; u8 phys_port_cnt; + int *pkey_tbl_len; + int *gid_tbl_len; }; struct ib_client { @@ -1134,6 +1136,29 @@ int ib_modify_port(struct ib_device *dev struct ib_port_modify *port_modify); /** + * ib_find_gid - Returns the port number and GID table index where + * a specified GID value occurs. + * @device: The device to query. + * @gid: The GID value to search for. + * @port_num: The port number of the device where the GID value was found. + * @index: The index into the GID table where the GID was found. This + * parameter may be NULL. + */ +int ib_find_gid(struct ib_device *device, union ib_gid *gid, + u8 *port_num, u16 *index); + +/** + * ib_find_pkey - Returns the PKey table index where a specified + * PKey value occurs. + * @device: The device to query. + * @port_num: The port number of the device to search for the PKey. + * @pkey: The PKey value to search for. + * @index: The index into the PKey table where the PKey was found. + */ +int ib_find_pkey(struct ib_device *device, + u8 port_num, u16 pkey, u16 *index); + +/** * ib_alloc_pd - Allocates an unused protection domain. * @device: The device on which to allocate the protection domain. * From yosefe at voltaire.com Thu May 10 07:38:02 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 10 May 2007 17:38:02 +0300 Subject: [ofa-general] [PATCHv4 for 2.6.22 2/2] ipoib: handle pkey change events In-Reply-To: <46432D8A.8030007@voltaire.com> References: <46432D8A.8030007@voltaire.com> Message-ID: <46432E4A.9080301@voltaire.com> This issue was found during partitioning & SM fail over testing. * Added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike * Rename the polling thread work to 'pkey_poll_task' to avoid ambiguity * Obtain pkey index prior to entering init_qp, and save in in dev_priv * Upon PKEY_CHANGE event, schedule a work that restarts the qp. * Precondition the restart on whether the pkey index is really changed. Use the cached pkey_index to test this. * Restart child interfaces before parent. They might be up even if the parent is down. * When interface is restarted, queue delayed initiallization, to handle the case that a pkey is deleted and later restored. * Use uncached pkey query upon qp initiallization SM reconfiguration or failover possibly causes a shuffling of the values in the port pkey table. The current implementation only queries for the index of the pkey once, when it creates the device QP and after that moves it into working state, and hence does not address this scenario. Fix this by using the PKEY_CHANGE event as a trigger to reconfigure the device QP. Signed-off-by: Yosef Etigin --- drivers/infiniband/ulp/ipoib/ipoib.h | 7 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 88 +++++++++++++++++++++++------ drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 41 +++++-------- 4 files changed, 97 insertions(+), 46 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-10 08:34:58.335171047 +0300 @@ -202,15 +202,17 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; - struct delayed_work pkey_task; + struct delayed_work pkey_poll_task; struct delayed_work mcast_task; struct work_struct flush_task; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct work_struct pkey_event_task; struct ib_device *ca; u8 port; u16 pkey; + u16 pkey_index; struct ib_pd *pd; struct ib_mr *mr; struct ib_cq *cq; @@ -333,12 +335,13 @@ struct ipoib_dev_priv *ipoib_intf_alloc( int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev, int flush); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_dev_cleanup(struct net_device *dev); Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-10 16:09:38.296297842 +0300 @@ -413,6 +413,18 @@ int ipoib_ib_dev_open(struct net_device struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; + /* + * Search through the port P_Key table for the requested pkey value. + * The port has to be assigned to the respective IB partition in + * advance. + */ + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) { + ipoib_warn(priv, "pkey 0x%04x nof found\n", priv->pkey); + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + return -1; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ret = ipoib_init_qp(dev); if (ret) { ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret); @@ -422,14 +434,14 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_ib_post_receives(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } ret = ipoib_cm_dev_open(dev); if (ret) { ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -1; } @@ -481,7 +493,7 @@ int ipoib_ib_dev_down(struct net_device if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { mutex_lock(&pkey_mutex); set_bit(IPOIB_PKEY_STOP, &priv->flags); - cancel_delayed_work(&priv->pkey_task); + cancel_delayed_work(&priv->pkey_poll_task); mutex_unlock(&pkey_mutex); if (flush) flush_workqueue(ipoib_workqueue); @@ -508,7 +520,7 @@ static int recvs_pending(struct net_devi return pending; } -int ipoib_ib_dev_stop(struct net_device *dev) +int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -581,7 +593,8 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - flush_workqueue(ipoib_workqueue); + if (flush) + flush_workqueue(ipoib_workqueue); begin = jiffies; @@ -622,13 +635,22 @@ int ipoib_ib_dev_init(struct net_device return 0; } -void ipoib_ib_dev_flush(struct work_struct *work) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) { - struct ipoib_dev_priv *cpriv, *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; + u16 new_index; - if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags) ) { + mutex_lock(&priv->vlan_mutex); + + /* Flush any child interfaces too - + * they might be up even if the parent is down */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + __ipoib_ib_dev_flush(cpriv, pkey_event); + + mutex_unlock(&priv->vlan_mutex); + + if (!test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) { ipoib_dbg(priv, "Not flushing - IPOIB_FLAG_INITIALIZED not set.\n"); return; } @@ -638,10 +660,32 @@ void ipoib_ib_dev_flush(struct work_stru return; } + if (pkey_event) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + ipoib_ib_dev_down(dev, 0); + ipoib_pkey_dev_delay_open(dev); + return; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + + /* restart qp only of pkey index is cahnged */ + if (new_index == priv->pkey_index) { + ipoib_dbg(priv, "Not flushing - pkey index not changed.\n"); + return; + } + priv->pkey_index = new_index; + } + ipoib_dbg(priv, "flushing\n"); ipoib_ib_dev_down(dev, 0); + if (pkey_event) { + ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_open(dev); + } + /* * The device could have been brought down between the start and when * we get here, don't bring it back up if it's not configured up @@ -650,14 +694,24 @@ void ipoib_ib_dev_flush(struct work_stru ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } +} - mutex_lock(&priv->vlan_mutex); +void ipoib_ib_dev_flush(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task); + + ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 0); +} - /* Flush any child interfaces too */ - list_for_each_entry(cpriv, &priv->child_intfs, list) - ipoib_ib_dev_flush(&cpriv->flush_task); +void ipoib_pkey_event(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, pkey_event_task); - mutex_unlock(&priv->vlan_mutex); + ipoib_dbg(priv, "Flushing %s and restarting it's QP\n", priv->dev->name); + __ipoib_ib_dev_flush(priv, 1); } void ipoib_ib_dev_cleanup(struct net_device *dev) @@ -685,7 +739,7 @@ void ipoib_ib_dev_cleanup(struct net_dev void ipoib_pkey_poll(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_task.work); + container_of(work, struct ipoib_dev_priv, pkey_poll_task.work); struct net_device *dev = priv->dev; ipoib_pkey_dev_check_presence(dev); @@ -696,7 +750,7 @@ void ipoib_pkey_poll(struct work_struct mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); } @@ -715,7 +769,7 @@ int ipoib_pkey_dev_delay_open(struct net mutex_lock(&pkey_mutex); clear_bit(IPOIB_PKEY_STOP, &priv->flags); queue_delayed_work(ipoib_workqueue, - &priv->pkey_task, + &priv->pkey_poll_task, HZ); mutex_unlock(&pkey_mutex); return 1; Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-09 17:21:03.000000000 +0300 @@ -107,7 +107,7 @@ int ipoib_open(struct net_device *dev) return -EINVAL; if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); return -EINVAL; } @@ -152,7 +152,7 @@ static int ipoib_stop(struct net_device flush_workqueue(ipoib_workqueue); ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev); + ipoib_ib_dev_stop(dev, 1); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -990,7 +990,8 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_DELAYED_WORK(&priv->pkey_task, ipoib_pkey_poll); + INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); + INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-08 15:46:53.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-10 16:08:15.551192622 +0300 @@ -33,8 +33,6 @@ * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ -#include - #include "ipoib.h" int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) @@ -49,7 +47,7 @@ int ipoib_mcast_attach(struct net_device if (!qp_attr) goto out; - if (ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = -ENXIO; goto out; @@ -94,26 +92,17 @@ int ipoib_init_qp(struct net_device *dev { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; - u16 pkey_index; struct ib_qp_attr qp_attr; int attr_mask; - /* - * Search through the port P_Key table for the requested pkey value. - * The port has to be assigned to the respective IB partition in - * advance. - */ - ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &pkey_index); - if (ret) { - clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); - return ret; - } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + /* Make sure we have a valid pkey_index in priv->pkey_index */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + return -1; qp_attr.qp_state = IB_QPS_INIT; qp_attr.qkey = 0; qp_attr.port_num = priv->port; - qp_attr.pkey_index = pkey_index; + qp_attr.pkey_index = priv->pkey_index; attr_mask = IB_QP_QKEY | IB_QP_PORT | @@ -258,15 +247,19 @@ void ipoib_event(struct ib_event_handler { struct ipoib_dev_priv *priv = container_of(handler, struct ipoib_dev_priv, event_handler); - - if ((record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PKEY_CHANGE || - record->event == IB_EVENT_PORT_ACTIVE || - record->event == IB_EVENT_LID_CHANGE || - record->event == IB_EVENT_SM_CHANGE || - record->event == IB_EVENT_CLIENT_REREGISTER) && - record->element.port_num == priv->port) { + + if (record->element.port_num != priv->port) + return; + + if (record->event == IB_EVENT_PORT_ERR || + record->event == IB_EVENT_PORT_ACTIVE || + record->event == IB_EVENT_LID_CHANGE || + record->event == IB_EVENT_SM_CHANGE || + record->event == IB_EVENT_CLIENT_REREGISTER) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); + } else if (record->event == IB_EVENT_PKEY_CHANGE) { + ipoib_dbg(priv, "pkey change event on port:%d\n", priv->port); + queue_work(ipoib_workqueue, &priv->pkey_event_task); } } From mst at dev.mellanox.co.il Thu May 10 07:42:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 17:42:12 +0300 Subject: [ofa-general] Re: [PATCHv4 for 2.6.22 0/2] fix bug #420: ippib handling of pkey change events In-Reply-To: <46432D8A.8030007@voltaire.com> References: <46432D8A.8030007@voltaire.com> Message-ID: <20070510144212.GH22029@mellanox.co.il> > Quoting Yosef Etigin : > Subject: [PATCHv4 for 2.6.22 0/2] fix bug #420: ippib handling of pkey change events > > These two patches fix bug #420: PKey table reordering caused by SM failover stops ipoib traffic > patch 1: add uncached device queries to core > patch 2: restart ipoib qp on pkey change event, and use uncached queries on qp init OK, that's pretty clean. Acked-by: Michael S. Tsirkin -- MST From jsquyres at cisco.com Thu May 10 07:42:01 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 10 May 2007 10:42:01 -0400 Subject: [ofa-general] Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <20070510142826.GE22029@mellanox.co.il> References: <4641EBD0.3000600@Sun.COM> <1178740498.382.97.camel@stevo-desktop> <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> <20070510142826.GE22029@mellanox.co.il> Message-ID: On May 10, 2007, at 10:28 AM, Michael S. Tsirkin wrote: >> What is the advantage of this approach? > > Current ipoib cm uses this approach to simplify the implementation. > Overhead seems insignificant. I think MPI's requirements are a bit different than IPoIB. See Gleb's response. It is not uncommon for MPI apps to have connections open to many peers simultaneously. Registered memory / internal buffering usage is a Big Deal in the MPI / HPC community. -- Jeff Squyres Cisco Systems From mst at dev.mellanox.co.il Thu May 10 07:56:13 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 May 2007 17:56:13 +0300 Subject: [ofa-general] Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed udapl - bugs opened In-Reply-To: References: <46422D07.3050600@Sun.COM> <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> <20070510142826.GE22029@mellanox.co.il> Message-ID: <20070510145613.GR13655@mellanox.co.il> > Quoting Jeff Squyres : > Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed?udapl -?bugs?opened > > On May 10, 2007, at 10:28 AM, Michael S. Tsirkin wrote: > > >>What is the advantage of this approach? > > > >Current ipoib cm uses this approach to simplify the implementation. > >Overhead seems insignificant. > > I think MPI's requirements are a bit different than IPoIB. See > Gleb's response. It is not uncommon for MPI apps to have connections > open to many peers simultaneously. You mean, hundreds of QPs between the same pair of hosts? Yes, in this case you might start running out of QPs. > Registered memory / internal > buffering usage is a Big Deal in the MPI / HPC community. I don't see the connection with the # of QPs. They are very cheap in memory. -- MST From rick.jones2 at hp.com Thu May 10 10:01:37 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 10 May 2007 10:01:37 -0700 Subject: [ofa-general] is there an OFED way to putt VPD from an HCA? Message-ID: <46434FF1.9020005@hp.com> Hi - I would like to pull vital product data (eg serial number) from an IB HCA which is "driven" via OFED bits. Is there any OFED tool to do that or do I have to go hunt-down something HCA-vendor specific (mellanox in this case)? thanks, rick jones From glebn at voltaire.com Thu May 10 10:02:40 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 10 May 2007 20:02:40 +0300 Subject: [ofa-general] Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed udapl - bugs opened In-Reply-To: <20070510145613.GR13655@mellanox.co.il> References: <1178742259.382.112.camel@stevo-desktop> <46422EA6.3020006@Sun.COM> <1178742819.382.114.camel@stevo-desktop> <464232DC.9010201@Sun.COM> <3EDA4C71-A9F7-4D45-B06D-88080CAC0CDD@cisco.com> <46430EB2.7080703@voltaire.com> <20070510142826.GE22029@mellanox.co.il> <20070510145613.GR13655@mellanox.co.il> Message-ID: <20070510170240.GA32053@minantech.com> On Thu, May 10, 2007 at 05:56:13PM +0300, Michael S. Tsirkin wrote: > > Quoting Jeff Squyres : > > Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed?udapl -?bugs?opened > > > > On May 10, 2007, at 10:28 AM, Michael S. Tsirkin wrote: > > > > >>What is the advantage of this approach? > > > > > >Current ipoib cm uses this approach to simplify the implementation. > > >Overhead seems insignificant. > > > > I think MPI's requirements are a bit different than IPoIB. See > > Gleb's response. It is not uncommon for MPI apps to have connections > > open to many peers simultaneously. > > You mean, hundreds of QPs between the same pair of hosts? > Yes, in this case you might start running out of QPs. Why is it matters that QPs between the same pair of hosts or not. QPs are global resource, aren't they? > > > Registered memory / internal > > buffering usage is a Big Deal in the MPI / HPC community. > > I don't see the connection with the # of QPs. > They are very cheap in memory. > 4K is cheap? -- Gleb. From sweitzen at cisco.com Thu May 10 10:59:27 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 10 May 2007 10:59:27 -0700 Subject: [ofa-general] is there an OFED way to putt VPD from an HCA? In-Reply-To: <46434FF1.9020005@hp.com> References: <46434FF1.9020005@hp.com> Message-ID: For Mellanox HCAs, you can use mstflint (Mellanox util) or tvflash (Cisco util). For example, from a Cisco-produced Mellanox HCA (YMMV): [root at svbu-qa1850-1 ~]# mstflint -d mthca0 q Image type: Failsafe FW Version: 1.2.0 I.S. Version: 1 Device ID: 25204 Chip Revision: A0 GUID Des: Node Port1 Sys image GUIDs: 0002c90200218140 0002c90200218141 0005ad000100d050 Board ID: É,­ VSD: É,­ PSID: [root at svbu-qa1850-1 ~]# tvflash -i HCA #0: MT25204, Cheetah DDR, revision 20 Primary image is v1.2.000 build 3.2.0.140, with label 'HCA.Cheetah-DDR.20' Secondary image is v1.2.000 build 3.2.0.139, with label 'HCA.Cheetah-DDR.20' Vital Product Data Product Name: Cheetah DDR P/N: MHGS18-XTC E/C: A1 S/N: MT0612X01178 Freq/Power: PCIe x8 Checksum: Ok Date Code: N/A Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones > Sent: Thursday, May 10, 2007 10:02 AM > To: general at lists.openfabrics.org > Subject: [ofa-general] is there an OFED way to putt VPD from an HCA? > > Hi - > > I would like to pull vital product data (eg serial number) > from an IB HCA which > is "driven" via OFED bits. Is there any OFED tool to do that > or do I have to go > hunt-down something HCA-vendor specific (mellanox in this case)? > > thanks, > > rick jones > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu May 10 11:14:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 May 2007 11:14:07 -0700 Subject: [ofa-general] Re: [PATCHv4 for 2.6.22 0/2] fix bug #420: ippib handling of pkey change events In-Reply-To: <20070510144212.GH22029@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 10 May 2007 17:42:12 +0300") References: <46432D8A.8030007@voltaire.com> <20070510144212.GH22029@mellanox.co.il> Message-ID: thanks ... I haven't had a chance to follow all the discussion while I'm traveling this week, but I'll deal with this next week. From pradeeps at linux.vnet.ibm.com Thu May 10 11:18:44 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 10 May 2007 11:18:44 -0700 Subject: [ofa-general] is there an OFED way to putt VPD from an HCA? In-Reply-To: References: <46434FF1.9020005@hp.com> Message-ID: <46436204.40600@linux.vnet.ibm.com> Scott Weitzenkamp (sweitzen) wrote: > For Mellanox HCAs, you can use mstflint (Mellanox util) or tvflash (Cisco util). For example, from a Cisco-produced Mellanox HCA (YMMV): > > [root at svbu-qa1850-1 ~]# mstflint -d mthca0 q > Image type: Failsafe > FW Version: 1.2.0 > I.S. Version: 1 > Device ID: 25204 > Chip Revision: A0 > GUID Des: Node Port1 Sys image > GUIDs: 0002c90200218140 0002c90200218141 0005ad000100d050 > Board ID: É,­ > VSD: É,­ > PSID: > [root at svbu-qa1850-1 ~]# tvflash -i > HCA #0: MT25204, Cheetah DDR, revision 20 > Primary image is v1.2.000 build 3.2.0.140, with label 'HCA.Cheetah-DDR.20' > Secondary image is v1.2.000 build 3.2.0.139, with label 'HCA.Cheetah-DDR.20' > > Vital Product Data > Product Name: Cheetah DDR > P/N: MHGS18-XTC > E/C: A1 > S/N: MT0612X01178 > Freq/Power: PCIe x8 > Checksum: Ok > Date Code: N/A > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > >> -----Original Message----- >> From: general-bounces at lists.openfabrics.org >> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones >> Sent: Thursday, May 10, 2007 10:02 AM >> To: general at lists.openfabrics.org >> Subject: [ofa-general] is there an OFED way to putt VPD from an HCA? >> >> Hi - >> >> I would like to pull vital product data (eg serial number) >> from an IB HCA which >> is "driven" via OFED bits. Is there any OFED tool to do that >> or do I have to go >> hunt-down something HCA-vendor specific (mellanox in this case)? >> >> thanks, >> >> rick jones There is also some similar information available in the /sys/class/infiniband directory (without using Mellanox or Cisco tools). For example on p5 (ppc64) system I get the following: [root at elm3b37 mthca0]# pwd /sys/class/infiniband/mthca0 [root at elm3b37 mthca0]# ls -l total 0 -r--r--r-- 1 root root 4096 May 10 14:10 board_id lrwxrwxrwx 1 root root 0 May 8 20:19 device -> ../../../devices/pci0002:00/0002:00:02.6/0002:d8:01.0/0002:d9:00.0 -r--r--r-- 1 root root 4096 May 10 14:10 fw_ver -r--r--r-- 1 root root 4096 May 10 14:10 hca_type -r--r--r-- 1 root root 4096 May 10 14:10 hw_rev -rw-r--r-- 1 root root 4096 May 10 14:10 node_desc -r--r--r-- 1 root root 4096 May 10 14:10 node_guid -r--r--r-- 1 root root 4096 May 10 14:10 node_type drwxr-xr-x 4 root root 0 May 8 20:19 ports lrwxrwxrwx 1 root root 0 May 10 14:10 subsystem -> ../../../class/infiniband -r--r--r-- 1 root root 4096 May 10 14:10 sys_image_guid --w------- 1 root root 4096 May 10 14:10 uevent [root at elm3b37 mthca0]# cat board_id MT_0030000001 [root at elm3b37 mthca0]# cat fw_ver 3.5.0 [root at elm3b37 mthca0]# cat hca_type MT23108 [root at elm3b37 mthca0]# cat hw_rev a1 [root at elm3b37 mthca0]# cat node_desc MT23108 InfiniHost Mellanox Technologies [root at elm3b37 mthca0]# cat node_guid 0005:ad00:0003:0564 [root at elm3b37 mthca0]# cat node_type 1: CA [root at elm3b37 mthca0]# cat sys_image_guid 0005:ad00:0003:0567 [root at elm3b37 mthca0]# cat uevent Pradeep From raleigh at systemfabricworks.com Thu May 10 11:20:37 2007 From: raleigh at systemfabricworks.com (Raleigh F Rinehart) Date: Thu, 10 May 2007 13:20:37 -0500 Subject: [ofa-general] is there an OFED way to putt VPD from an HCA? In-Reply-To: References: <46434FF1.9020005@hp.com> Message-ID: <46436275.80406@systemfabricworks.com> Just FYI for older versions of mstflint, such the one I have (ofed_1_1) specifying the device name 'mthcaX' doesn't work, one have to specify the PCI id. i.e. [root at tiger ~]> mstflint -d 07:00.0 q Image type: Failsafe I.S. Version: 1 Chip Revision: A1 GUID Des: Node Port1 Port2 Sys image GUIDs: 0002c901078ce000 0002c901078ce001 0002c901078ce002 0002c901078ce000 Board ID: (MT_0000000001) VSD: PSID: MT_0000000001 or alternatively [root at tiger ~]> mstflint -d /proc/bus/pci/07/00.0 q ... -raleigh Scott Weitzenkamp (sweitzen) wrote: > For Mellanox HCAs, you can use mstflint (Mellanox util) or tvflash (Cisco util). For example, from a Cisco-produced Mellanox HCA (YMMV): > > [root at svbu-qa1850-1 ~]# mstflint -d mthca0 q > Image type: Failsafe > FW Version: 1.2.0 > I.S. Version: 1 > Device ID: 25204 > Chip Revision: A0 > GUID Des: Node Port1 Sys image > GUIDs: 0002c90200218140 0002c90200218141 0005ad000100d050 > Board ID: É,­ > VSD: É,­ > PSID: > [root at svbu-qa1850-1 ~]# tvflash -i > HCA #0: MT25204, Cheetah DDR, revision 20 > Primary image is v1.2.000 build 3.2.0.140, with label 'HCA.Cheetah-DDR.20' > Secondary image is v1.2.000 build 3.2.0.139, with label 'HCA.Cheetah-DDR.20' > > Vital Product Data > Product Name: Cheetah DDR > P/N: MHGS18-XTC > E/C: A1 > S/N: MT0612X01178 > Freq/Power: PCIe x8 > Checksum: Ok > Date Code: N/A > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > >> -----Original Message----- >> From: general-bounces at lists.openfabrics.org >> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones >> Sent: Thursday, May 10, 2007 10:02 AM >> To: general at lists.openfabrics.org >> Subject: [ofa-general] is there an OFED way to putt VPD from an HCA? >> >> Hi - >> >> I would like to pull vital product data (eg serial number) >> from an IB HCA which >> is "driven" via OFED bits. Is there any OFED tool to do that >> or do I have to go >> hunt-down something HCA-vendor specific (mellanox in this case)? >> >> thanks, >> >> rick jones >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3285 bytes Desc: S/MIME Cryptographic Signature URL: From rdreier at cisco.com Thu May 10 11:27:56 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 May 2007 11:27:56 -0700 Subject: [ofa-general] Re: [HOWTO] accessing the DMA mapped data In-Reply-To: <464304DD.9070600@cdac.in> (Mahesh's message of "Thu, 10 May 2007 17:11:17 +0530") References: <4642D095.6070305@cdac.in> <20070510.014751.74749340.davem@daveml oft.net> <4642FFA7.4040303@cdac.in> <20070510.043233.08323827.davem@davemloft.net> <464304DD.9070600@cdac.in> Message-ID: > Here I am dealing with a infiniband (see www.openfabrics.org) > network device driver. The layer above the driver is the standard > infiniband core interface. Now I have a situation where I need to > peek into the packets and do some modifications(some hacking). So I > just want know whether I can access the original data region using > the bus address generated by the dma_map_single. If you give more details about what you're trying to do, maybe I can suggest a good way to accomplish it. Also for IB-related stuff you may want to at least CC to get the best info. From arthur.jones at qlogic.com Thu May 10 12:10:49 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Thu, 10 May 2007 12:10:49 -0700 Subject: [ofa-general] [PATCH take2] IB/ipath -- shadow the gpio_mask register Message-ID: <20070510191047.6876.80760.stgit@bauxite.internal.keyresearch.com> Once upon a time, GPIO interrupts were rare. But then a chip bug in the waldo series forced the use of a GPIO interrupt to signal packet reception. This greatly increased the frequency of GPIO interrupts which have the gpio_mask bits set on the waldo chips. Other bits in the gpio_status register are used for I2C clock and data lines, these bits are usually on. An "unlikely" annotation leftover from the old days was improperly applied to these bits, and an unnecessary chip mmio read was being accessed in the interrupt fast path on waldo. Remove the stagnant unlikely annotation in the interrupt handler and keep a shadow copy of the gpio_mask register to avoid the slow mmio read when testing for interruptable GPIO bits. Signed-off-by: Arthur Jones --- drivers/infiniband/hw/ipath/ipath_iba6120.c | 7 +++---- drivers/infiniband/hw/ipath/ipath_intr.c | 7 +++---- drivers/infiniband/hw/ipath/ipath_kernel.h | 2 ++ drivers/infiniband/hw/ipath/ipath_verbs.c | 12 ++++++------ 4 files changed, 14 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index fb58154..c21d99b 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -747,7 +747,6 @@ static void ipath_pe_quiet_serdes(struct ipath_devdata *dd) static int ipath_pe_intconfig(struct ipath_devdata *dd) { - u64 val; u32 chiprev; /* @@ -760,9 +759,9 @@ static int ipath_pe_intconfig(struct ipath_devdata *dd) if ((chiprev & INFINIPATH_R_CHIPREVMINOR_MASK) > 1) { /* Rev2+ reports extra errors via internal GPIO pins */ dd->ipath_flags |= IPATH_GPIO_ERRINTRS; - val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask); - val |= IPATH_GPIO_ERRINTR_MASK; - ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val); + dd->ipath_gpio_mask |= IPATH_GPIO_ERRINTR_MASK; + ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, + dd->ipath_gpio_mask); } return 0; } diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 45d0331..a90d3b5 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -1056,7 +1056,7 @@ irqreturn_t ipath_intr(int irq, void *data) gpiostatus &= ~(1 << IPATH_GPIO_PORT0_BIT); chk0rcv = 1; } - if (unlikely(gpiostatus)) { + if (gpiostatus) { /* * Some unexpected bits remain. If they could have * caused the interrupt, complain and clear. @@ -1065,9 +1065,8 @@ irqreturn_t ipath_intr(int irq, void *data) * GPIO interrupts, possibly on a "three strikes" * basis. */ - u32 mask; - mask = ipath_read_kreg32( - dd, dd->ipath_kregs->kr_gpio_mask); + const u32 mask = (u32) dd->ipath_gpio_mask; + if (mask & gpiostatus) { ipath_dbg("Unexpected GPIO IRQ bits %x\n", gpiostatus & mask); diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index e900c25..12194f3 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -397,6 +397,8 @@ struct ipath_devdata { unsigned long ipath_pioavailshadow[8]; /* shadow of kr_gpio_out, for rmw ops */ u64 ipath_gpio_out; + /* shadow the gpio mask register */ + u64 ipath_gpio_mask; /* kr_revision shadow */ u64 ipath_revision; /* diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 12933e7..bb70845 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -1387,13 +1387,12 @@ static int enable_timer(struct ipath_devdata *dd) * processing. */ if (dd->ipath_flags & IPATH_GPIO_INTR) { - u64 val; ipath_write_kreg(dd, dd->ipath_kregs->kr_debugportselect, 0x2074076542310ULL); /* Enable GPIO bit 2 interrupt */ - val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask); - val |= (u64) (1 << IPATH_GPIO_PORT0_BIT); - ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val); + dd->ipath_gpio_mask |= (u64) (1 << IPATH_GPIO_PORT0_BIT); + ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, + dd->ipath_gpio_mask); } init_timer(&dd->verbs_timer); @@ -1412,8 +1411,9 @@ static int disable_timer(struct ipath_devdata *dd) u64 val; /* Disable GPIO bit 2 interrupt */ val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask); - val &= ~((u64) (1 << IPATH_GPIO_PORT0_BIT)); - ipath_write_kreg( dd, dd->ipath_kregs->kr_gpio_mask, val); + dd->ipath_gpio_mask &= ~((u64) (1 << IPATH_GPIO_PORT0_BIT)); + ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, + dd->ipath_gpio_mask); /* * We might want to undo changes to debugportselect, * but how? From sean.hefty at intel.com Thu May 10 12:39:28 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 10 May 2007 12:39:28 -0700 Subject: [ofa-general] RFC: location for IB CM statistics Message-ID: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com> I'd like to start adding some statistical information to the IB CM to help identify scalability or connectivity issues. Some example statistics that I would like to expose now are number of retried MADs, unmatched requests, total number of connections, etc. Longer term, additional statistics and information on each connection could be added. I'm looking for ideas on the best way to expose this sort of data. Any thoughts? - Sean From pradeeps at linux.vnet.ibm.com Thu May 10 13:54:05 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 10 May 2007 13:54:05 -0700 Subject: [ofa-general] RFC: location for IB CM statistics In-Reply-To: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com> References: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com> Message-ID: <4643866D.9010203@linux.vnet.ibm.com> Sean Hefty wrote: > I'd like to start adding some statistical information to the IB CM to help > identify scalability or connectivity issues. Some example statistics that I > would like to expose now are number of retried MADs, unmatched requests, total > number of connections, etc. Longer term, additional statistics and information > on each connection could be added. > > I'm looking for ideas on the best way to expose this sort of data. Any > thoughts? > > - Sean This is a great idea and would be a big help in debugging problems and identify performance issues. An approach similar to /sys/class/net/ wherein stats for various devices are given and then a utility akin to netstat that may consolidate these- would that be appealing? Pradeep From sashak at voltaire.com Thu May 10 14:14:36 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 11 May 2007 00:14:36 +0300 Subject: [ofa-general] [PATCH 0/3 v2] opensm: osm_port_t structure simplification. Message-ID: <11788316803259-git-send-email-sashak@voltaire.com> Hi Hal, This simplifies osm_port_t structure and related API functions - the main idea is to not use duplicated (from osm_node_t) physical port pointers table, but only one direct pointer to appropriated physical port (osm_physp_t). The changes in the patch series are slightly reordered against original version, so each patch "works" (in original version patch 2/3 brokes things if applied separately without 3/3). Sasha From sashak at voltaire.com Thu May 10 14:14:37 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 11 May 2007 00:14:37 +0300 Subject: [ofa-general] [PATCH 1/4 v2] opensm: remove osm_port_get_num_physp() function In-Reply-To: <11788316803259-git-send-email-sashak@voltaire.com> References: <11788316803259-git-send-email-sashak@voltaire.com> Message-ID: <11788316803779-git-send-email-sashak@voltaire.com> This removes osm_port_get_num_physp() function and instead uses native node oriented osm_node_get_num_physp(). Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_port.h | 29 ----------------------------- osm/opensm/osm_drop_mgr.c | 2 +- osm/opensm/osm_link_mgr.c | 2 +- osm/opensm/osm_qos.c | 2 +- osm/opensm/osm_sa_link_record.c | 8 ++++---- osm/opensm/osm_sa_pkey_record.c | 6 +++--- osm/opensm/osm_sa_portinfo_record.c | 2 +- osm/opensm/osm_sa_slvl_record.c | 2 +- osm/opensm/osm_sa_vlarb_record.c | 6 +++--- osm/opensm/osm_state_mgr.c | 2 +- osm/opensm/osm_trap_rcv.c | 2 +- 11 files changed, 17 insertions(+), 46 deletions(-) diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h index 6d51d2b..134012c 100644 --- a/osm/include/opensm/osm_port.h +++ b/osm/include/opensm/osm_port.h @@ -1467,35 +1467,6 @@ osm_port_get_guid( * Port *********/ -/****f* OpenSM: Port/osm_port_get_num_physp -* NAME -* osm_port_get_num_physp -* -* DESCRIPTION -* Returns the number of Physical Port objects associated with this port. -* -* SYNOPSIS -*/ -static inline uint8_t -osm_port_get_num_physp( - IN const osm_port_t* const p_port ) -{ - return( p_port->physp_tbl_size ); -} -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object. -* -* RETURN VALUE -* Returns the number of Physical Port objects associated with this port. -* -* NOTES -* -* SEE ALSO -* Port -*********/ - /****f* OpenSM: Port/osm_port_get_phys_ptr * NAME * osm_port_get_phys_ptr diff --git a/osm/opensm/osm_drop_mgr.c b/osm/opensm/osm_drop_mgr.c index 0d08ff6..d091347 100644 --- a/osm/opensm/osm_drop_mgr.c +++ b/osm/opensm/osm_drop_mgr.c @@ -237,7 +237,7 @@ __osm_drop_mgr_remove_port( Re-initialize each Physical Port. */ - num_physp = osm_port_get_num_physp( p_port ); + num_physp = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_physp; port_num++ ) { p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)port_num ); diff --git a/osm/opensm/osm_link_mgr.c b/osm/opensm/osm_link_mgr.c index a1081bd..71c0495 100644 --- a/osm/opensm/osm_link_mgr.c +++ b/osm/opensm/osm_link_mgr.c @@ -426,7 +426,7 @@ __osm_link_mgr_process_port( with this Port. Start iterating with port 1, since the linkstate is not applicable to the management port on switches. */ - num_physp = osm_port_get_num_physp( p_port ); + num_physp = osm_node_get_num_physp( p_port->p_node ); for( i = 0; i < num_physp; i ++ ) { /* diff --git a/osm/opensm/osm_qos.c b/osm/opensm/osm_qos.c index e71c053..11beaae 100644 --- a/osm/opensm/osm_qos.c +++ b/osm/opensm/osm_qos.c @@ -334,7 +334,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) p_node = p_port->p_node; if (p_node->sw) { - num_physp = osm_port_get_num_physp(p_port); + num_physp = osm_node_get_num_physp(p_node); for (i = 1; i < num_physp; i++) { p_physp = osm_port_get_phys_ptr(p_port, i); if (!p_physp || !osm_physp_is_valid(p_physp)) diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c index 169e75e..18f655c 100644 --- a/osm/opensm/osm_sa_link_record.c +++ b/osm/opensm/osm_sa_link_record.c @@ -346,8 +346,8 @@ __osm_lr_rcv_get_port_links( that do not actually connect. Don't bother screening for that here. */ - num_ports = osm_port_get_num_physp( p_src_port ); - dest_num_ports = osm_port_get_num_physp( p_dest_port ); + num_ports = osm_node_get_num_physp( p_src_port->p_node ); + dest_num_ports = osm_node_get_num_physp( p_dest_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); @@ -385,7 +385,7 @@ __osm_lr_rcv_get_port_links( } else { - num_ports = osm_port_get_num_physp( p_src_port ); + num_ports = osm_node_get_num_physp( p_src_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); @@ -421,7 +421,7 @@ __osm_lr_rcv_get_port_links( } else { - num_ports = osm_port_get_num_physp( p_dest_port ); + num_ports = osm_node_get_num_physp( p_dest_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { p_dest_physp = osm_port_get_phys_ptr( diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c index 5eb15df..0a199f1 100644 --- a/osm/opensm/osm_sa_pkey_record.c +++ b/osm/opensm/osm_sa_pkey_record.c @@ -249,7 +249,7 @@ __osm_sa_pkey_by_comp_mask( if( comp_mask & IB_PKEY_COMPMASK_PORT ) { - if (port_num < osm_port_get_num_physp( p_port )) + if (port_num < osm_node_get_num_physp( p_port->p_node )) { p_physp = osm_port_get_phys_ptr( p_port, port_num ); /* Check that the p_physp is valid, and that is shares a pkey @@ -263,13 +263,13 @@ __osm_sa_pkey_by_comp_mask( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_sa_pkey_by_comp_mask: ERR 4603: " "Given Physical Port Number: 0x%X is out of range should be < 0x%X\n", - port_num, osm_port_get_num_physp( p_port )); + port_num, osm_node_get_num_physp( p_port->p_node )); goto Exit; } } else { - num_ports = osm_port_get_num_physp( p_port ); + num_ports = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_ports; port_num++ ) { p_physp = osm_port_get_phys_ptr( p_port, port_num ); diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c index 5d9b1b2..9d4f18e 100644 --- a/osm/opensm/osm_sa_portinfo_record.c +++ b/osm/opensm/osm_sa_portinfo_record.c @@ -538,7 +538,7 @@ __osm_sa_pir_by_comp_mask( comp_mask = p_ctxt->comp_mask; p_req_physp = p_ctxt->p_req_physp; - num_ports = osm_port_get_num_physp( p_port ); + num_ports = osm_node_get_num_physp( p_port->p_node ); if( comp_mask & IB_PIR_COMPMASK_PORTNUM ) { diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c index d831ffd..3c4ff02 100644 --- a/osm/opensm/osm_sa_slvl_record.c +++ b/osm/opensm/osm_sa_slvl_record.c @@ -213,7 +213,7 @@ __osm_sa_slvl_by_comp_mask( p_rcvd_rec = p_ctxt->p_rcvd_rec; comp_mask = p_ctxt->comp_mask; - num_ports = osm_port_get_num_physp( p_port ); + num_ports = osm_node_get_num_physp( p_port->p_node ); in_port_start = 0; in_port_end = num_ports; out_port_start = 0; diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c index f0ff957..6df5ed9 100644 --- a/osm/opensm/osm_sa_vlarb_record.c +++ b/osm/opensm/osm_sa_vlarb_record.c @@ -253,7 +253,7 @@ __osm_sa_vl_arb_by_comp_mask( if( comp_mask & IB_VLA_COMPMASK_OUT_PORT ) { - if (port_num < osm_port_get_num_physp( p_port )) + if (port_num < osm_node_get_num_physp( p_port->p_node )) { p_physp = osm_port_get_phys_ptr( p_port, port_num ); /* check that the p_physp is valid, and that the requester @@ -267,13 +267,13 @@ __osm_sa_vl_arb_by_comp_mask( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_sa_vl_arb_by_comp_mask: ERR 2A03: " "Given Physical Port Number: 0x%X is out of range should be < 0x%X\n", - port_num, osm_port_get_num_physp( p_port ) ); + port_num, osm_node_get_num_physp( p_port->p_node ) ); goto Exit; } } else { - num_ports = osm_port_get_num_physp( p_port ); + num_ports = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_ports; port_num++ ) { p_physp = osm_port_get_phys_ptr( p_port, port_num ); diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index ddec10c..6f53e60 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -1284,7 +1284,7 @@ __osm_state_mgr_report( else start_port = 1; - num_ports = osm_port_get_num_physp( p_port ); + num_ports = osm_node_get_num_physp( p_node ); for( port_num = start_port; port_num < num_ports; port_num++ ) { p_physp = osm_port_get_phys_ptr( p_port, port_num ); diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c index 0858968..ed507b6 100644 --- a/osm/opensm/osm_trap_rcv.c +++ b/osm/opensm/osm_trap_rcv.c @@ -108,7 +108,7 @@ __get_physp_by_lid_and_num( if (! p_port) return NULL; - if (osm_port_get_num_physp(p_port) < num) + if (osm_node_get_num_physp(p_port->p_node) < num) return NULL; return( osm_port_get_phys_ptr(p_port, num) ); -- 1.5.2.rc2.20.gac2a From sashak at voltaire.com Thu May 10 14:14:38 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 11 May 2007 00:14:38 +0300 Subject: [ofa-general] [PATCH 2/4 v2] opensm: remove osm_port_get_phys_ptr() In-Reply-To: <11788316803259-git-send-email-sashak@voltaire.com> References: <11788316803259-git-send-email-sashak@voltaire.com> Message-ID: <11788316803195-git-send-email-sashak@voltaire.com> Function osm_port_get_phys_ptr() returns pointer to physical port by its number - and this should be node and not port related routine. So this patch replaces osm_port_get_phys_ptr() by osm_node_get_physp_ptr(). Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_port.h | 36 ----------------------------------- osm/opensm/osm_drop_mgr.c | 2 +- osm/opensm/osm_link_mgr.c | 2 +- osm/opensm/osm_port_info_rcv.c | 10 ++++---- osm/opensm/osm_qos.c | 2 +- osm/opensm/osm_sa_link_record.c | 18 ++++++++-------- osm/opensm/osm_sa_pkey_record.c | 4 +- osm/opensm/osm_sa_portinfo_record.c | 4 +- osm/opensm/osm_sa_slvl_record.c | 6 ++-- osm/opensm/osm_sa_vlarb_record.c | 4 +- osm/opensm/osm_state_mgr.c | 2 +- osm/opensm/osm_subnet.c | 4 +- osm/opensm/osm_trap_rcv.c | 2 +- 13 files changed, 30 insertions(+), 66 deletions(-) diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h index 134012c..feebf63 100644 --- a/osm/include/opensm/osm_port.h +++ b/osm/include/opensm/osm_port.h @@ -1467,42 +1467,6 @@ osm_port_get_guid( * Port *********/ -/****f* OpenSM: Port/osm_port_get_phys_ptr -* NAME -* osm_port_get_phys_ptr -* -* DESCRIPTION -* Gets the pointer to the specified Physical Port object. -* -* SYNOPSIS -*/ -static inline osm_physp_t* -osm_port_get_phys_ptr( - IN const osm_port_t* const p_port, - IN const uint8_t port_num ) -{ - CL_ASSERT( port_num < p_port->physp_tbl_size ); - return( p_port->tbl[port_num] ); -} -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object. -* -* port_num -* [in] Number of physical port for which to return the -* osm_physp_t object. If this port is on an HCA, then -* this value is ignored. -* -* RETURN VALUE -* Pointer to the Physical Port object. -* -* NOTES -* -* SEE ALSO -* Port -*********/ - /****f* OpenSM: Port/osm_port_get_default_phys_ptr * NAME * osm_port_get_default_phys_ptr diff --git a/osm/opensm/osm_drop_mgr.c b/osm/opensm/osm_drop_mgr.c index d091347..97a95c2 100644 --- a/osm/opensm/osm_drop_mgr.c +++ b/osm/opensm/osm_drop_mgr.c @@ -240,7 +240,7 @@ __osm_drop_mgr_remove_port( num_physp = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_physp; port_num++ ) { - p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)port_num ); if( p_physp ) { diff --git a/osm/opensm/osm_link_mgr.c b/osm/opensm/osm_link_mgr.c index 71c0495..a38d179 100644 --- a/osm/opensm/osm_link_mgr.c +++ b/osm/opensm/osm_link_mgr.c @@ -434,7 +434,7 @@ __osm_link_mgr_process_port( or if the state of the port is already better then the specified state. */ - p_physp = osm_port_get_phys_ptr( p_port, (uint8_t)i ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)i ); if( p_physp && osm_physp_is_valid( p_physp ) ) { current_state = osm_physp_get_port_state( p_physp ); diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c index 9bd75b5..5d9c5c7 100644 --- a/osm/opensm/osm_port_info_rcv.c +++ b/osm/opensm/osm_port_info_rcv.c @@ -555,13 +555,13 @@ osm_pi_rcv_process_set( p_context = osm_madw_get_pi_context_ptr( p_madw ); - p_physp = osm_port_get_phys_ptr( p_port, port_num ); - CL_ASSERT( p_physp ); - CL_ASSERT( osm_physp_is_valid( p_physp ) ); + p_node = p_port->p_node; + CL_ASSERT( p_node ); + + p_physp = osm_node_get_physp_ptr( p_node, port_num ); + CL_ASSERT( p_physp && osm_physp_is_valid( p_physp ) ); port_guid = osm_physp_get_port_guid( p_physp ); - p_node = osm_port_get_parent_node( p_port ); - CL_ASSERT( p_node ); p_smp = osm_madw_get_smp_ptr( p_madw ); p_pi = (ib_port_info_t*)ib_smp_get_payload_ptr( p_smp ); diff --git a/osm/opensm/osm_qos.c b/osm/opensm/osm_qos.c index 11beaae..1255169 100644 --- a/osm/opensm/osm_qos.c +++ b/osm/opensm/osm_qos.c @@ -336,7 +336,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) if (p_node->sw) { num_physp = osm_node_get_num_physp(p_node); for (i = 1; i < num_physp; i++) { - p_physp = osm_port_get_phys_ptr(p_port, i); + p_physp = osm_node_get_physp_ptr(p_node, i); if (!p_physp || !osm_physp_is_valid(p_physp)) continue; status = diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c index 18f655c..c6b7a7c 100644 --- a/osm/opensm/osm_sa_link_record.c +++ b/osm/opensm/osm_sa_link_record.c @@ -350,12 +350,12 @@ __osm_lr_rcv_get_port_links( dest_num_ports = osm_node_get_num_physp( p_dest_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { - p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); + p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num ); for( dest_port_num = 1; dest_port_num < dest_num_ports; dest_port_num++ ) { - p_dest_physp = osm_port_get_phys_ptr( p_dest_port, - dest_port_num ); + p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node, + dest_port_num ); /* both physical ports should be with data */ if (p_src_physp && p_dest_physp) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, @@ -376,7 +376,7 @@ __osm_lr_rcv_get_port_links( this couldn't be a relevant record. */ if (port_num < p_src_port->physp_tbl_size) { - p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); + p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num ); if (p_src_physp) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, NULL, comp_mask, p_list, @@ -388,7 +388,7 @@ __osm_lr_rcv_get_port_links( num_ports = osm_node_get_num_physp( p_src_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { - p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); + p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num ); if (p_src_physp) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, NULL, comp_mask, p_list, @@ -411,8 +411,8 @@ __osm_lr_rcv_get_port_links( this couldn't be a relevant record. */ if (port_num < p_dest_port->physp_tbl_size ) { - p_dest_physp = osm_port_get_phys_ptr( - p_dest_port, port_num ); + p_dest_physp = osm_node_get_physp_ptr( + p_dest_port->p_node, port_num ); if (p_dest_physp) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, p_dest_physp, comp_mask, @@ -424,8 +424,8 @@ __osm_lr_rcv_get_port_links( num_ports = osm_node_get_num_physp( p_dest_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { - p_dest_physp = osm_port_get_phys_ptr( - p_dest_port, port_num ); + p_dest_physp = osm_node_get_physp_ptr( + p_dest_port->p_node, port_num ); if (p_dest_physp) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, p_dest_physp, comp_mask, diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c index 0a199f1..a943fe0 100644 --- a/osm/opensm/osm_sa_pkey_record.c +++ b/osm/opensm/osm_sa_pkey_record.c @@ -251,7 +251,7 @@ __osm_sa_pkey_by_comp_mask( { if (port_num < osm_node_get_num_physp( p_port->p_node )) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); /* Check that the p_physp is valid, and that is shares a pkey with the p_req_physp. */ if( p_physp && osm_physp_is_valid( p_physp ) && @@ -272,7 +272,7 @@ __osm_sa_pkey_by_comp_mask( num_ports = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_ports; port_num++ ) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); if( p_physp == NULL ) continue; diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c index 9d4f18e..74f53d6 100644 --- a/osm/opensm/osm_sa_portinfo_record.c +++ b/osm/opensm/osm_sa_portinfo_record.c @@ -544,7 +544,7 @@ __osm_sa_pir_by_comp_mask( { if (p_rcvd_rec->port_num < num_ports) { - p_physp = osm_port_get_phys_ptr( p_port, p_rcvd_rec->port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, p_rcvd_rec->port_num ); /* Check that the p_physp is valid, and that the p_physp and the p_req_physp share a pkey. */ if( p_physp && osm_physp_is_valid( p_physp ) && @@ -556,7 +556,7 @@ __osm_sa_pir_by_comp_mask( { for( port_num = 0; port_num < num_ports; port_num++ ) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); if( p_physp == NULL ) continue; diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c index 3c4ff02..2f250d9 100644 --- a/osm/opensm/osm_sa_slvl_record.c +++ b/osm/opensm/osm_sa_slvl_record.c @@ -226,7 +226,7 @@ __osm_sa_slvl_by_comp_mask( "__osm_sa_slvl_by_comp_mask: " "Using Physical Default Port Number: 0x%X (for End Node)\n", p_port->default_port_num ); - p_out_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num ); + p_out_physp = osm_port_get_default_phys_ptr( p_port ); /* check that the p_out_physp and the p_req_physp share a pkey */ if (osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_out_physp )) __osm_sa_slvl_create( p_rcv, p_out_physp, p_ctxt, 0 ); @@ -243,7 +243,7 @@ __osm_sa_slvl_by_comp_mask( } for( out_port_num = out_port_start; out_port_num <= out_port_end; out_port_num++ ) { - p_out_physp = osm_port_get_phys_ptr( p_port, out_port_num ); + p_out_physp = osm_node_get_physp_ptr( p_port->p_node, out_port_num ); if( p_out_physp == NULL ) continue; @@ -256,7 +256,7 @@ __osm_sa_slvl_by_comp_mask( continue; #endif - p_in_physp = osm_port_get_phys_ptr( p_port, in_port_num ); + p_in_physp = osm_node_get_physp_ptr( p_port->p_node, in_port_num ); if( p_in_physp == NULL ) continue; diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c index 6df5ed9..9cd346c 100644 --- a/osm/opensm/osm_sa_vlarb_record.c +++ b/osm/opensm/osm_sa_vlarb_record.c @@ -255,7 +255,7 @@ __osm_sa_vl_arb_by_comp_mask( { if (port_num < osm_node_get_num_physp( p_port->p_node )) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); /* check that the p_physp is valid, and that the requester and the p_physp share a pkey. */ if( p_physp && osm_physp_is_valid( p_physp ) && @@ -276,7 +276,7 @@ __osm_sa_vl_arb_by_comp_mask( num_ports = osm_node_get_num_physp( p_port->p_node ); for( port_num = 0; port_num < num_ports; port_num++ ) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); if( p_physp == NULL ) continue; diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 6f53e60..9aeec74 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -1287,7 +1287,7 @@ __osm_state_mgr_report( num_ports = osm_node_get_num_physp( p_node ); for( port_num = start_port; port_num < num_ports; port_num++ ) { - p_physp = osm_port_get_phys_ptr( p_port, port_num ); + p_physp = osm_node_get_physp_ptr( p_node, port_num ); if( ( p_physp == NULL ) || ( !osm_physp_is_valid( p_physp ) ) ) continue; diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index c8c3ddc..8e0c53b 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -266,7 +266,7 @@ osm_get_gid_by_mad_addr( ); return(IB_INVALID_PARAMETER); } - p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num); + p_physp = osm_port_get_default_phys_ptr( p_port ); p_gid->unicast.interface_id = p_physp->port_guid; p_gid->unicast.prefix = p_subn->opt.subnet_prefix; } @@ -316,7 +316,7 @@ osm_get_physp_by_mad_addr( goto Exit; } - p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num); + p_physp = osm_port_get_default_phys_ptr( p_port ); } else { diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c index ed507b6..0ec9a1f 100644 --- a/osm/opensm/osm_trap_rcv.c +++ b/osm/opensm/osm_trap_rcv.c @@ -111,7 +111,7 @@ __get_physp_by_lid_and_num( if (osm_node_get_num_physp(p_port->p_node) < num) return NULL; - return( osm_port_get_phys_ptr(p_port, num) ); + return( osm_node_get_physp_ptr(p_port->p_node, num) ); } /********************************************************************** -- 1.5.2.rc2.20.gac2a From sashak at voltaire.com Thu May 10 14:14:39 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 11 May 2007 00:14:39 +0300 Subject: [ofa-general] [PATCH 3/4 v2] opensm: eliminate node's physical ports table duplication in osm_port_t In-Reply-To: <11788316803259-git-send-email-sashak@voltaire.com> References: <11788316803259-git-send-email-sashak@voltaire.com> Message-ID: <11788316801832-git-send-email-sashak@voltaire.com> Eliminate duplication of osm_node's physical ports table in osm_port_t object. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_port.h | 34 +++++------------- osm/opensm/osm_pkey_rcv.c | 2 +- osm/opensm/osm_port.c | 70 ++++++++----------------------------- osm/opensm/osm_sa_link_record.c | 4 +- osm/opensm/osm_sa_pkey_record.c | 2 +- osm/opensm/osm_sa_slvl_record.c | 2 +- osm/opensm/osm_sa_vlarb_record.c | 2 +- osm/opensm/osm_slvl_map_rcv.c | 2 +- osm/opensm/osm_sm_state_mgr.c | 2 +- osm/opensm/osm_vl_arb_rcv.c | 2 +- 10 files changed, 34 insertions(+), 88 deletions(-) diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h index feebf63..0873a7e 100644 --- a/osm/include/opensm/osm_port.h +++ b/osm/include/opensm/osm_port.h @@ -1274,10 +1274,8 @@ typedef struct _osm_port struct _osm_node *p_node; ib_net64_t guid; uint32_t discovery_count; - uint8_t default_port_num; - uint8_t physp_tbl_size; + osm_physp_t *p_physp; cl_qlist_t mcm_list; - osm_physp_t *tbl[1]; } osm_port_t; /* * FIELDS @@ -1295,20 +1293,13 @@ typedef struct _osm_port * during the current fabric sweep. This number is reset * to zero at the start of a sweep. * -* default_port_num -* Index of the physical port used when physical characteristics -* contained in the Physical Port are needed. -* -* physp_tbl_size -* Number of physical ports associated with this logical port. +* p_physp +* The pointer to physical port used when physical +* characteristics contained in the Physical Port are needed. * * mcm_list * Multicast member list * -* tbl -* Array of pointers to Physical Port objects contained by this node. -* MUST BE LAST ELEMENT SINCE IT CAN GROW !!! -* * SEE ALSO * Port, Physical Port, Physical Port Table *********/ @@ -1386,10 +1377,8 @@ static inline ib_net16_t osm_port_get_base_lid( IN const osm_port_t* const p_port ) { - const osm_physp_t* const p_physp = p_port->tbl[p_port->default_port_num]; - CL_ASSERT( p_physp ); - CL_ASSERT( osm_physp_is_valid( p_physp ) ); - return( osm_physp_get_base_lid( p_physp )); + CL_ASSERT( p_port->p_physp && osm_physp_is_valid( p_port->p_physp ) ); + return( osm_physp_get_base_lid( p_port->p_physp )); } /* * PARAMETERS @@ -1419,10 +1408,8 @@ static inline uint8_t osm_port_get_lmc( IN const osm_port_t* const p_port ) { - const osm_physp_t* const p_physp = p_port->tbl[p_port->default_port_num]; - CL_ASSERT( p_physp ); - CL_ASSERT( osm_physp_is_valid( p_physp ) ); - return( osm_physp_get_lmc( p_physp )); + CL_ASSERT( p_port->p_physp && osm_physp_is_valid( p_port->p_physp ) ); + return( osm_physp_get_lmc( p_port->p_physp )); } /* * PARAMETERS @@ -1483,9 +1470,8 @@ osm_physp_t* osm_port_get_default_phys_ptr( IN const osm_port_t* const p_port ) { - CL_ASSERT( p_port->tbl[p_port->default_port_num] ); - CL_ASSERT( osm_physp_is_valid( p_port->tbl[p_port->default_port_num] ) ); - return( p_port->tbl[p_port->default_port_num] ); + CL_ASSERT( osm_physp_is_valid( p_port->p_physp ) ); + return p_port->p_physp; } /* * PARAMETERS diff --git a/osm/opensm/osm_pkey_rcv.c b/osm/opensm/osm_pkey_rcv.c index 76af9fc..0e0ec46 100644 --- a/osm/opensm/osm_pkey_rcv.c +++ b/osm/opensm/osm_pkey_rcv.c @@ -172,7 +172,7 @@ osm_pkey_rcv_process( else { p_physp = osm_port_get_default_phys_ptr(p_port); - port_num = p_port->default_port_num; + port_num = p_physp->port_num; } CL_ASSERT( p_physp ); diff --git a/osm/opensm/osm_port.c b/osm/opensm/osm_port.c index 053fc22..30e2ab2 100644 --- a/osm/opensm/osm_port.c +++ b/osm/opensm/osm_port.c @@ -174,7 +174,6 @@ osm_port_init( uint32_t port_index; ib_net64_t port_guid; osm_physp_t *p_physp; - uint32_t size; CL_ASSERT( p_port ); CL_ASSERT( p_ni ); @@ -187,39 +186,25 @@ osm_port_init( p_port->guid = port_guid; /* - See comment in port_new for info about this... - */ - size = p_ni->num_ports; - - p_port->physp_tbl_size = (uint8_t)(size + 1); - - /* Get the pointers to the physical node objects "owned" by this logical port GUID. For switches, all the ports are owned; for HCA's and routers, only the singular part that has this GUID is owned. */ - p_port->default_port_num = 0xFF; - for( port_index = 0; port_index < p_port->physp_tbl_size; port_index++ ) + for( port_index = 0; port_index < p_parent_node->physp_tbl_size; port_index++ ) { p_physp = osm_node_get_physp_ptr( p_parent_node, port_index ); + /* + Because much of the PortInfo data is only valid + for port 0 on switches, try to keep the lowest + possible value of default_port_num. + */ if( osm_physp_is_valid( p_physp ) && - port_guid == osm_physp_get_port_guid( p_physp ) ) - { - p_port->tbl[port_index] = p_physp; - /* - Because much of the PortInfo data is only valid - for port 0 on switches, try to keep the lowest - possible value of default_port_num. - */ - if( port_index < p_port->default_port_num ) - p_port->default_port_num = (uint8_t)port_index; + port_guid == osm_physp_get_port_guid( p_physp ) ) { + p_port->p_physp = p_physp; + break; } - else - p_port->tbl[port_index] = NULL; } - - CL_ASSERT( p_port->default_port_num < 0xFF ); } /********************************************************************** @@ -230,21 +215,11 @@ osm_port_new( IN const osm_node_t* const p_parent_node ) { osm_port_t* p_port; - uint32_t size; - - /* - The port object already contains one physical port object pointer. - Therefore, subtract 1 from the number of physical ports - used by the switch. This is not done for CA's since they - need to occupy 1 more physp pointer than they physically have since - we still reserve room for a "port 0". - */ - size = p_ni->num_ports; - p_port = malloc( sizeof(*p_port) + sizeof(void *) * size ); + p_port = malloc( sizeof(*p_port) ); if( p_port != NULL ) { - memset( p_port, 0, sizeof(*p_port) + sizeof(void *) * size ); + memset( p_port, 0, sizeof(*p_port) ); osm_port_init( p_port, p_ni, p_parent_node ); } @@ -315,18 +290,11 @@ osm_port_add_new_physp( IN osm_port_t* const p_port, IN const uint8_t port_num ) { - osm_node_t *p_node; osm_physp_t *p_physp; - CL_ASSERT( port_num < p_port->physp_tbl_size ); - - p_node = p_port->p_node; - CL_ASSERT( p_node ); - - p_physp = osm_node_get_physp_ptr( p_node, port_num ); + p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); CL_ASSERT( osm_physp_get_port_guid( p_physp ) == p_port->guid ); - p_port->tbl[port_num] = p_physp; /* For switches, we generally want to use Port 0, which is @@ -334,17 +302,9 @@ osm_port_add_new_physp( The LID value in the PortInfo for example, is only valid for port 0 on switches. */ - if( !osm_physp_is_valid( p_port->tbl[p_port->default_port_num] ) ) - { - p_port->default_port_num = port_num; - } - else - { - if( port_num < p_port->default_port_num ) - { - p_port->default_port_num = port_num; - } - } + if( !osm_physp_is_valid( osm_port_get_default_phys_ptr( p_port ) ) || + port_num < p_port->p_physp->port_num ) + p_port->p_physp = p_physp; } /********************************************************************** diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c index c6b7a7c..5e4e35e 100644 --- a/osm/opensm/osm_sa_link_record.c +++ b/osm/opensm/osm_sa_link_record.c @@ -374,7 +374,7 @@ __osm_lr_rcv_get_port_links( port_num = p_lr->from_port_num; /* If the port number is out of the range of the p_src_port, then this couldn't be a relevant record. */ - if (port_num < p_src_port->physp_tbl_size) + if (port_num < p_src_port->p_node->physp_tbl_size) { p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num ); if (p_src_physp) @@ -409,7 +409,7 @@ __osm_lr_rcv_get_port_links( port_num = p_lr->to_port_num; /* If the port number is out of the range of the p_dest_port, then this couldn't be a relevant record. */ - if (port_num < p_dest_port->physp_tbl_size ) + if (port_num < p_dest_port->p_node->physp_tbl_size ) { p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node, port_num ); diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c index a943fe0..8a71314 100644 --- a/osm/opensm/osm_sa_pkey_record.c +++ b/osm/opensm/osm_sa_pkey_record.c @@ -239,7 +239,7 @@ __osm_sa_pkey_by_comp_mask( if ( p_port->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH ) { /* we put it in the comp mask and port num */ - port_num = p_port->default_port_num; + port_num = p_port->p_physp->port_num; osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_pkey_by_comp_mask: " "Using Physical Default Port Number: 0x%X (for End Node)\n", diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c index 2f250d9..168901e 100644 --- a/osm/opensm/osm_sa_slvl_record.c +++ b/osm/opensm/osm_sa_slvl_record.c @@ -225,7 +225,7 @@ __osm_sa_slvl_by_comp_mask( osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_slvl_by_comp_mask: " "Using Physical Default Port Number: 0x%X (for End Node)\n", - p_port->default_port_num ); + p_port->p_physp->port_num ); p_out_physp = osm_port_get_default_phys_ptr( p_port ); /* check that the p_out_physp and the p_req_physp share a pkey */ if (osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_out_physp )) diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c index 9cd346c..a462ee9 100644 --- a/osm/opensm/osm_sa_vlarb_record.c +++ b/osm/opensm/osm_sa_vlarb_record.c @@ -243,7 +243,7 @@ __osm_sa_vl_arb_by_comp_mask( if ( p_port->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH) { /* we put it in the comp mask and port num */ - port_num = p_port->default_port_num; + port_num = p_port->p_physp->port_num; osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_sa_vl_arb_by_comp_mask: " "Using Physical Default Port Number: 0x%X (for End Node)\n", diff --git a/osm/opensm/osm_slvl_map_rcv.c b/osm/opensm/osm_slvl_map_rcv.c index 3fa3a7e..b109f75 100644 --- a/osm/opensm/osm_slvl_map_rcv.c +++ b/osm/opensm/osm_slvl_map_rcv.c @@ -183,7 +183,7 @@ osm_slvl_rcv_process( else { p_physp = osm_port_get_default_phys_ptr(p_port); - out_port_num = p_port->default_port_num; + out_port_num = p_physp->port_num; in_port_num = 0; } diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c index 3aa92c8..0034320 100644 --- a/osm/opensm/osm_sm_state_mgr.c +++ b/osm/opensm/osm_sm_state_mgr.c @@ -194,7 +194,7 @@ __osm_sm_state_mgr_send_local_port_info_req( osm_physp_get_dr_path_ptr ( osm_port_get_default_phys_ptr( p_port ) ), IB_MAD_ATTR_PORT_INFO, - cl_hton32( p_port->default_port_num ), + cl_hton32( p_port->p_physp->port_num ), CL_DISP_MSGID_NONE, &context ); if( status != IB_SUCCESS ) diff --git a/osm/opensm/osm_vl_arb_rcv.c b/osm/opensm/osm_vl_arb_rcv.c index 930360a..ed8dfc5 100644 --- a/osm/opensm/osm_vl_arb_rcv.c +++ b/osm/opensm/osm_vl_arb_rcv.c @@ -184,7 +184,7 @@ osm_vla_rcv_process( else { p_physp = osm_port_get_default_phys_ptr(p_port); - port_num = p_port->default_port_num; + port_num = p_physp->port_num; } CL_ASSERT( p_physp ); -- 1.5.2.rc2.20.gac2a From sashak at voltaire.com Thu May 10 14:14:40 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 11 May 2007 00:14:40 +0300 Subject: [ofa-general] [PATCH 4/4 v2] opensm: remove some unneeded funcs In-Reply-To: <11788316803259-git-send-email-sashak@voltaire.com> References: <11788316803259-git-send-email-sashak@voltaire.com> Message-ID: <11788316801642-git-send-email-sashak@voltaire.com> This removes some not really needed functions osm_port_get_default_phys_ptr() and osm_port_get_parent_node(). Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_port.h | 66 ---------------------------------- osm/opensm/osm_lid_mgr.c | 14 ++------ osm/opensm/osm_mcast_mgr.c | 2 +- osm/opensm/osm_node_info_rcv.c | 6 +-- osm/opensm/osm_pkey.c | 8 ++-- osm/opensm/osm_pkey_mgr.c | 8 ++--- osm/opensm/osm_pkey_rcv.c | 4 +- osm/opensm/osm_port.c | 6 ++-- osm/opensm/osm_port_info_rcv.c | 2 +- osm/opensm/osm_prtn.c | 2 +- osm/opensm/osm_qos.c | 4 +- osm/opensm/osm_sa_informinfo.c | 6 ++-- osm/opensm/osm_sa_lft_record.c | 2 +- osm/opensm/osm_sa_mcmember_record.c | 2 +- osm/opensm/osm_sa_mft_record.c | 2 +- osm/opensm/osm_sa_multipath_record.c | 10 +++--- osm/opensm/osm_sa_path_record.c | 12 +++--- osm/opensm/osm_sa_service_record.c | 2 +- osm/opensm/osm_sa_slvl_record.c | 2 +- osm/opensm/osm_sa_sminfo_record.c | 2 +- osm/opensm/osm_sa_sw_info_record.c | 2 +- osm/opensm/osm_slvl_map_rcv.c | 4 +- osm/opensm/osm_sm_state_mgr.c | 5 +-- osm/opensm/osm_state_mgr.c | 11 +++--- osm/opensm/osm_subnet.c | 6 +-- osm/opensm/osm_switch.c | 6 ++-- osm/opensm/osm_ucast_lash.c | 2 +- osm/opensm/osm_ucast_mgr.c | 7 ++-- osm/opensm/osm_ucast_updn.c | 2 +- osm/opensm/osm_vl_arb_rcv.c | 5 +-- 30 files changed, 64 insertions(+), 148 deletions(-) diff --git a/osm/include/opensm/osm_port.h b/osm/include/opensm/osm_port.h index 0873a7e..df9065e 100644 --- a/osm/include/opensm/osm_port.h +++ b/osm/include/opensm/osm_port.h @@ -1454,72 +1454,6 @@ osm_port_get_guid( * Port *********/ -/****f* OpenSM: Port/osm_port_get_default_phys_ptr -* NAME -* osm_port_get_default_phys_ptr -* -* DESCRIPTION -* Gets the pointer to the default Physical Port object. -* This call should only be used for non-switch ports in which there -* is a one-for-one mapping of port to physp. -* -* SYNOPSIS -*/ -static inline -osm_physp_t* -osm_port_get_default_phys_ptr( - IN const osm_port_t* const p_port ) -{ - CL_ASSERT( osm_physp_is_valid( p_port->p_physp ) ); - return p_port->p_physp; -} -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object. -* -* RETURN VALUE -* Pointer to the Physical Port object. -* -* NOTES -* -* SEE ALSO -* Port -*********/ - -/****f* OpenSM: Port/osm_port_get_parent_node -* NAME -* osm_port_get_parent_node -* -* DESCRIPTION -* Gets the pointer to the this port's Node object. -* -* SYNOPSIS -*/ -static inline struct _osm_node* -osm_port_get_parent_node( - IN const osm_port_t* const p_port ) -{ - return( p_port->p_node ); -} -/* -* PARAMETERS -* p_port -* [in] Pointer to a Port object. -* -* port_num -* [in] Number of physical port for which to return the -* osm_physp_t object. -* -* RETURN VALUE -* Pointer to the Physical Port object. -* -* NOTES -* -* SEE ALSO -* Port -*********/ - /****f* OpenSM: Port/osm_port_get_lid_range_ho * NAME * osm_port_get_lid_range_ho diff --git a/osm/opensm/osm_lid_mgr.c b/osm/opensm/osm_lid_mgr.c index d856fb0..6712c6c 100644 --- a/osm/opensm/osm_lid_mgr.c +++ b/osm/opensm/osm_lid_mgr.c @@ -975,10 +975,7 @@ __osm_lid_mgr_set_physp_pi( Don't bother doing anything if this Physical Port is not valid. This allows simplified code in the caller. */ - if( p_physp == NULL ) - goto Exit; - - if( !osm_physp_is_valid( p_physp ) ) + if( p_physp == NULL || !osm_physp_is_valid( p_physp ) ) goto Exit; port_num = osm_physp_get_port_num( p_physp ); @@ -1283,7 +1280,6 @@ __osm_lid_mgr_process_our_sm_node( osm_port_t *p_port; uint16_t min_lid_ho; uint16_t max_lid_ho; - osm_physp_t *p_physp; boolean_t res = TRUE; OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_process_our_sm_node ); @@ -1336,9 +1332,7 @@ __osm_lid_mgr_process_our_sm_node( Set the PortInfo the Physical Port associated with this Port. */ - p_physp = osm_port_get_default_phys_ptr( p_port ); - - __osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho ) ); + __osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_port->p_physp, cl_hton16( min_lid_ho ) ); Exit: OSM_LOG_EXIT( p_mgr->p_log ); @@ -1404,7 +1398,6 @@ osm_lid_mgr_process_subnet( osm_port_t *p_port; ib_net64_t port_guid; uint16_t min_lid_ho, max_lid_ho; - osm_physp_t *p_physp; int lid_changed; CL_ASSERT( p_mgr ); @@ -1460,9 +1453,8 @@ osm_lid_mgr_process_subnet( ", LID [0x%X,0x%X]\n", cl_ntoh64( port_guid ), min_lid_ho, max_lid_ho ); - p_physp = osm_port_get_default_phys_ptr( p_port ); /* the proc returns the fact it sent a set port info */ - if (__osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho ))) + if (__osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_port->p_physp, cl_hton16( min_lid_ho ))) p_mgr->send_set_reqs = TRUE; } } /* all ports */ diff --git a/osm/opensm/osm_mcast_mgr.c b/osm/opensm/osm_mcast_mgr.c index 0cdcc0e..f5059c9 100644 --- a/osm/opensm/osm_mcast_mgr.c +++ b/osm/opensm/osm_mcast_mgr.c @@ -1127,7 +1127,7 @@ osm_mcast_mgr_process_single( goto Exit; } - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if( p_physp == NULL ) { osm_log( p_mgr->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_node_info_rcv.c b/osm/opensm/osm_node_info_rcv.c index 364b07c..2c79056 100644 --- a/osm/opensm/osm_node_info_rcv.c +++ b/osm/opensm/osm_node_info_rcv.c @@ -791,12 +791,10 @@ __osm_ni_rcv_process_new( "Duplicate Port GUID 0x%" PRIx64 "! Found by the two directed routes:\n", cl_ntoh64( p_ni->port_guid ) ); osm_dump_dr_path(p_rcv->p_log, - osm_physp_get_dr_path_ptr( - osm_port_get_default_phys_ptr ( p_port) ), + osm_physp_get_dr_path_ptr(p_port->p_physp), OSM_LOG_ERROR); osm_dump_dr_path(p_rcv->p_log, - osm_physp_get_dr_path_ptr( - osm_port_get_default_phys_ptr ( p_port_check) ), + osm_physp_get_dr_path_ptr(p_port_check->p_physp), OSM_LOG_ERROR); if ( p_rtr ) osm_router_delete( &p_rtr ); diff --git a/osm/opensm/osm_pkey.c b/osm/opensm/osm_pkey.c index be5578a..c0daa38 100644 --- a/osm/opensm/osm_pkey.c +++ b/osm/opensm/osm_pkey.c @@ -432,8 +432,8 @@ osm_port_share_pkey( goto Exit; } - p_physp1 = osm_port_get_default_phys_ptr(p_port_1); - p_physp2 = osm_port_get_default_phys_ptr(p_port_2); + p_physp1 = p_port_1->p_physp; + p_physp2 = p_port_2->p_physp; if (!p_physp1 || !p_physp2) { @@ -478,7 +478,7 @@ osm_lid_share_pkey( } else { - p_physp1 = osm_port_get_default_phys_ptr(p_port1); + p_physp1 = p_port1->p_physp; } if (osm_node_get_type( p_node2 ) == IB_NODE_TYPE_SWITCH) @@ -487,7 +487,7 @@ osm_lid_share_pkey( } else { - p_physp2 = osm_port_get_default_phys_ptr(p_port2); + p_physp2 = p_port2->p_physp; } return(osm_physp_share_pkey(p_log, p_physp1, p_physp2)); diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index bbbe192..33ac8b5 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -310,7 +310,7 @@ static boolean_t pkey_mgr_update_port( memset(&empty_block, 0, sizeof(ib_pkey_table_t)); - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if ( !osm_physp_is_valid( p_physp ) ) return FALSE; @@ -449,7 +449,7 @@ pkey_mgr_update_peer_port( memset(&empty_block, 0, sizeof(ib_pkey_table_t)); - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if (!osm_physp_is_valid( p_physp )) return FALSE; peer = osm_physp_get_remote( p_physp ); @@ -532,7 +532,6 @@ osm_pkey_mgr_process( osm_prtn_t *p_prtn; osm_port_t *p_port; osm_signal_t signal = OSM_SIGNAL_DONE; - osm_node_t *p_node; CL_ASSERT( p_osm ); @@ -570,8 +569,7 @@ osm_pkey_mgr_process( p_next = cl_qmap_next( p_next ); if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) ) signal = OSM_SIGNAL_DONE_PENDING; - p_node = osm_port_get_parent_node( p_port ); - if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && + if ( ( osm_node_get_type( p_port->p_node ) != IB_NODE_TYPE_SWITCH ) && pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) diff --git a/osm/opensm/osm_pkey_rcv.c b/osm/opensm/osm_pkey_rcv.c index 0e0ec46..7c58d98 100644 --- a/osm/opensm/osm_pkey_rcv.c +++ b/osm/opensm/osm_pkey_rcv.c @@ -159,7 +159,7 @@ osm_pkey_rcv_process( goto Exit; } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; CL_ASSERT( p_node ); block_num = (uint16_t)((cl_ntoh32(p_smp->attr_mod)) & 0x0000FFFF); @@ -171,7 +171,7 @@ osm_pkey_rcv_process( } else { - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; port_num = p_physp->port_num; } diff --git a/osm/opensm/osm_port.c b/osm/opensm/osm_port.c index 30e2ab2..eab86e1 100644 --- a/osm/opensm/osm_port.c +++ b/osm/opensm/osm_port.c @@ -302,7 +302,7 @@ osm_port_add_new_physp( The LID value in the PortInfo for example, is only valid for port 0 on switches. */ - if( !osm_physp_is_valid( osm_port_get_default_phys_ptr( p_port ) ) || + if( !osm_physp_is_valid( p_port->p_physp ) || port_num < p_port->p_physp->port_num ) p_port->p_physp = p_physp; } @@ -565,7 +565,7 @@ __osm_physp_get_dr_physp_set( } /* get the node of the SM */ - p_node = osm_port_get_parent_node(p_port); + p_node = p_port->p_node; /* traverse the path adding the nodes to the table @@ -732,7 +732,7 @@ osm_physp_replace_dr_path_with_alternate_dr_path( port we'll get the port connected to the rest of the subnet. If SM is running on SWITCH - we should try to get a dr path from all switch ports. */ - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; CL_ASSERT( p_physp ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c index 5d9c5c7..0076b00 100644 --- a/osm/opensm/osm_port_info_rcv.c +++ b/osm/opensm/osm_port_info_rcv.c @@ -743,7 +743,7 @@ osm_pi_rcv_process( cl_ntoh64( p_smp->trans_id ) ); } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; p_physp = osm_node_get_physp_ptr( p_node, port_num ); CL_ASSERT( p_node ); diff --git a/osm/opensm/osm_prtn.c b/osm/opensm/osm_prtn.c index 4099cee..027a5a4 100644 --- a/osm/opensm/osm_prtn.c +++ b/osm/opensm/osm_prtn.c @@ -119,7 +119,7 @@ ib_api_status_t osm_prtn_add_port(osm_log_t *p_log, osm_subn_t *p_subn, return status; } - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; if (!p_physp) { osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " "no physical for port 0x%" PRIx64 "\n", diff --git a/osm/opensm/osm_qos.c b/osm/opensm/osm_qos.c index 1255169..f426241 100644 --- a/osm/opensm/osm_qos.c +++ b/osm/opensm/osm_qos.c @@ -195,7 +195,7 @@ static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port, if (osm_node_get_type(osm_physp_get_node_ptr(p)) == IB_NODE_TYPE_SWITCH) { if (ib_port_info_get_vl_cap(&p->port_info) == 1) { /* Check port 0's capability mask */ - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; if (!(p_physp->port_info.capability_mask & IB_PORT_CAP_HAS_SL_MAP)) return IB_SUCCESS; } @@ -353,7 +353,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) else cfg = &ca_config; - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; if (!osm_physp_is_valid(p_physp)) continue; diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c index 340a7f1..6109c5d 100644 --- a/osm/opensm/osm_sa_informinfo.c +++ b/osm/opensm/osm_sa_informinfo.c @@ -194,7 +194,7 @@ __validate_ports_access_rights( } /* get the destination InformInfo physical port */ - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; /* make sure that the requester and destination port can access each other according to the current partitioning. */ @@ -244,7 +244,7 @@ __validate_ports_access_rights( if ( p_port == NULL ) continue; - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; /* make sure that the requester and destination port can access each other according to the current partitioning. */ if (! osm_physp_share_pkey( p_rcv->p_log, p_physp, p_requester_physp)) @@ -405,7 +405,7 @@ __osm_sa_inform_info_rec_by_comp_mask( } /* get the subscriber InformInfo physical port */ - p_subscriber_physp = osm_port_get_default_phys_ptr(p_subscriber_port); + p_subscriber_physp = p_subscriber_port->p_physp; /* make sure that the requester and subscriber port can access each other according to the current partitioning. */ if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_subscriber_physp )) diff --git a/osm/opensm/osm_sa_lft_record.c b/osm/opensm/osm_sa_lft_record.c index b6333e7..c5cd9ca 100644 --- a/osm/opensm/osm_sa_lft_record.c +++ b/osm/opensm/osm_sa_lft_record.c @@ -244,7 +244,7 @@ __osm_lftr_rcv_by_comp_mask( /* check that the requester physp and the current physp are under the same partition. */ - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if (! p_physp) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index 50c4f22..8241129 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -1570,7 +1570,7 @@ __osm_mcmr_rcv_join_mgrp( goto Exit; } - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; /* Check that the p_physp and the requester physp are in the same partition. */ p_request_physp = diff --git a/osm/opensm/osm_sa_mft_record.c b/osm/opensm/osm_sa_mft_record.c index 005c9bd..7908583 100644 --- a/osm/opensm/osm_sa_mft_record.c +++ b/osm/opensm/osm_sa_mft_record.c @@ -250,7 +250,7 @@ __osm_mftr_rcv_by_comp_mask( /* check that the requester physp and the current physp are under the same partition. */ - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if (! p_physp) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_multipath_record.c b/osm/opensm/osm_sa_multipath_record.c index 0c5643e..06640d9 100644 --- a/osm/opensm/osm_sa_multipath_record.c +++ b/osm/opensm/osm_sa_multipath_record.c @@ -154,7 +154,7 @@ __osm_sa_multipath_rec_is_tavor_port( osm_node_t const* p_node; ib_net32_t vend_id; - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; vend_id = ib_node_info_get_vendor_id( &p_node->node_info ); return( (p_node->node_info.device_id == CL_HTON16(23108)) && @@ -255,8 +255,8 @@ __osm_mpr_rcv_get_path_parms( dest_lid = cl_hton16( dest_lid_ho ); - p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port ); - p_physp = osm_port_get_default_phys_ptr( p_src_port ); + p_dest_physp = p_dest_port->p_physp; + p_physp = p_src_port->p_physp; p_pi = &p_physp->port_info; mtu = ib_port_info_get_mtu_cap( p_pi ); @@ -744,8 +744,8 @@ __osm_mpr_rcv_build_pr( OSM_LOG_ENTER( p_rcv->p_log, __osm_mpr_rcv_build_pr ); - p_src_physp = osm_port_get_default_phys_ptr( p_src_port ); - p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port ); + p_src_physp = p_src_port->p_physp; + p_dest_physp = p_dest_port->p_physp; p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp ); p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp ); diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c index 1b0f89f..47d9c33 100644 --- a/osm/opensm/osm_sa_path_record.c +++ b/osm/opensm/osm_sa_path_record.c @@ -171,7 +171,7 @@ __osm_sa_path_rec_is_tavor_port( osm_node_t const* p_node; ib_net32_t vend_id; - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; vend_id = ib_node_info_get_vendor_id( &p_node->node_info ); return( (p_node->node_info.device_id == CL_HTON16(23108)) && @@ -268,8 +268,8 @@ __osm_pr_rcv_get_path_parms( dest_lid = cl_hton16( dest_lid_ho ); - p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port ); - p_physp = osm_port_get_default_phys_ptr( p_src_port ); + p_dest_physp = p_dest_port->p_physp; + p_physp = p_src_port->p_physp; p_pi = &p_physp->port_info; mtu = ib_port_info_get_mtu_cap( p_pi ); @@ -753,9 +753,9 @@ __osm_pr_rcv_build_pr( OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_build_pr ); - p_src_physp = osm_port_get_default_phys_ptr( p_src_port ); + p_src_physp = p_src_port->p_physp; #ifndef ROUTER_EXP - p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port ); + p_dest_physp = p_dest_port->p_physp; p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp ); p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp ); @@ -770,7 +770,7 @@ __osm_pr_rcv_build_pr( p_pr->dgid = *p_dgid; else { - p_dest_physp = osm_port_get_default_phys_ptr( p_dest_port ); + p_dest_physp = p_dest_port->p_physp; p_pr->dgid.unicast.prefix = osm_physp_get_subnet_prefix( p_dest_physp ); p_pr->dgid.unicast.interface_id = osm_physp_get_port_guid( p_dest_physp ); diff --git a/osm/opensm/osm_sa_service_record.c b/osm/opensm/osm_sa_service_record.c index b23a12d..eff0b0a 100644 --- a/osm/opensm/osm_sa_service_record.c +++ b/osm/opensm/osm_sa_service_record.c @@ -213,7 +213,7 @@ __match_service_pkey_with_ports_pkey( /* check on the table of the default physical port of the service port */ if ( !osm_physp_has_pkey( p_rcv->p_log, p_service_rec->service_pkey, - osm_port_get_default_phys_ptr(service_port) ) ) + service_port->p_physp ) ) { valid = FALSE; goto Exit; diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c index 168901e..e40ad61 100644 --- a/osm/opensm/osm_sa_slvl_record.c +++ b/osm/opensm/osm_sa_slvl_record.c @@ -226,7 +226,7 @@ __osm_sa_slvl_by_comp_mask( "__osm_sa_slvl_by_comp_mask: " "Using Physical Default Port Number: 0x%X (for End Node)\n", p_port->p_physp->port_num ); - p_out_physp = osm_port_get_default_phys_ptr( p_port ); + p_out_physp = p_port->p_physp; /* check that the p_out_physp and the p_req_physp share a pkey */ if (osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_out_physp )) __osm_sa_slvl_create( p_rcv, p_out_physp, p_ctxt, 0 ); diff --git a/osm/opensm/osm_sa_sminfo_record.c b/osm/opensm/osm_sa_sminfo_record.c index 5e15f52..8c343b4 100644 --- a/osm/opensm/osm_sa_sminfo_record.c +++ b/osm/opensm/osm_sa_sminfo_record.c @@ -374,7 +374,7 @@ osm_smir_rcv_process( { if (FALSE == osm_physp_share_pkey( p_rcv->p_log, p_req_physp, - osm_port_get_default_phys_ptr( local_port ) ) ) + local_port->p_physp ) ) { cl_plock_release( p_rcv->p_lock ); osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_sw_info_record.c b/osm/opensm/osm_sa_sw_info_record.c index da65864..94b1ff9 100644 --- a/osm/opensm/osm_sa_sw_info_record.c +++ b/osm/opensm/osm_sa_sw_info_record.c @@ -245,7 +245,7 @@ __osm_sir_rcv_create_sir( /* check that the requester physp and the current physp are under the same partition. */ - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; if (! p_physp) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_slvl_map_rcv.c b/osm/opensm/osm_slvl_map_rcv.c index b109f75..3352627 100644 --- a/osm/opensm/osm_slvl_map_rcv.c +++ b/osm/opensm/osm_slvl_map_rcv.c @@ -170,7 +170,7 @@ osm_slvl_rcv_process( goto Exit; } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; CL_ASSERT( p_node ); /* in case of a non switch node the attr modifier should be ignored */ @@ -182,7 +182,7 @@ osm_slvl_rcv_process( } else { - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; out_port_num = p_physp->port_num; in_port_num = 0; } diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c index 0034320..51df1df 100644 --- a/osm/opensm/osm_sm_state_mgr.c +++ b/osm/opensm/osm_sm_state_mgr.c @@ -192,7 +192,7 @@ __osm_sm_state_mgr_send_local_port_info_req( status = osm_req_get( p_sm_mgr->p_req, osm_physp_get_dr_path_ptr - ( osm_port_get_default_phys_ptr( p_port ) ), + ( p_port->p_physp ), IB_MAD_ATTR_PORT_INFO, cl_hton32( p_port->p_physp->port_num ), CL_DISP_MSGID_NONE, &context ); @@ -261,8 +261,7 @@ __osm_sm_state_mgr_send_master_sm_info_req( context.smi_context.set_method = FALSE; status = osm_req_get( p_sm_mgr->p_req, - osm_physp_get_dr_path_ptr - ( osm_port_get_default_phys_ptr( p_port ) ), + osm_physp_get_dr_path_ptr(p_port->p_physp), IB_MAD_ATTR_SM_INFO, 0, CL_DISP_MSGID_NONE, &context ); diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 9aeec74..6681cfc 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -849,7 +849,7 @@ __osm_state_mgr_is_sm_port_down( goto Exit; } - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; CL_ASSERT( p_physp ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); @@ -914,7 +914,7 @@ __osm_state_mgr_sweep_hop_1( goto Exit; } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; CL_ASSERT( p_node ); port_num = ib_node_info_get_local_port_num( &p_node->node_info ); @@ -1277,7 +1277,7 @@ __osm_state_mgr_report( cl_ntoh64( osm_port_get_guid( p_port ) ) ); } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; node_type = osm_node_get_type( p_node ); if( node_type == IB_NODE_TYPE_SWITCH ) start_port = 0; @@ -1622,9 +1622,8 @@ __osm_state_mgr_send_handover( } status = osm_req_set( p_mgr->p_req, - osm_physp_get_dr_path_ptr - ( osm_port_get_default_phys_ptr( p_port ) ), payload, - sizeof(payload), + osm_physp_get_dr_path_ptr(p_port->p_physp), + payload, sizeof(payload), IB_MAD_ATTR_SM_INFO, IB_SMINFO_ATTR_MOD_HANDOVER, CL_DISP_MSGID_NONE, &context ); diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 8e0c53b..0484530 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -238,7 +238,6 @@ osm_get_gid_by_mad_addr( { const cl_ptr_vector_t* p_tbl; const osm_port_t* p_port = NULL; - const osm_physp_t* p_physp = NULL; if ( p_gid == NULL ) { @@ -266,8 +265,7 @@ osm_get_gid_by_mad_addr( ); return(IB_INVALID_PARAMETER); } - p_physp = osm_port_get_default_phys_ptr( p_port ); - p_gid->unicast.interface_id = p_physp->port_guid; + p_gid->unicast.interface_id = p_port->p_physp->port_guid; p_gid->unicast.prefix = p_subn->opt.subnet_prefix; } else @@ -316,7 +314,7 @@ osm_get_physp_by_mad_addr( goto Exit; } - p_physp = osm_port_get_default_phys_ptr( p_port ); + p_physp = p_port->p_physp; } else { diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c index 9273459..a79f5cd 100644 --- a/osm/opensm/osm_switch.c +++ b/osm/opensm/osm_switch.c @@ -291,7 +291,7 @@ osm_switch_recommend_path( } else { - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; if (!p_physp || !p_physp->p_remote_physp || !p_physp->p_remote_physp->p_node->sw) return OSM_NO_PATH; @@ -566,7 +566,7 @@ osm_switch_get_port_least_hops( } else { - osm_physp_t *p = osm_port_get_default_phys_ptr(p_port); + osm_physp_t *p = p_port->p_physp; uint8_t hops; if (!p || !p->p_remote_physp || !p->p_remote_physp->p_node->sw) @@ -604,7 +604,7 @@ osm_switch_recommend_mcast_path( } else { - osm_physp_t *p_physp = osm_port_get_default_phys_ptr(p_port); + osm_physp_t *p_physp = p_port->p_physp; if (!p_physp || !p_physp->p_remote_physp || !p_physp->p_remote_physp->p_node->sw) return OSM_NO_PATH; diff --git a/osm/opensm/osm_ucast_lash.c b/osm/opensm/osm_ucast_lash.c index 4459d9f..5d32e89 100644 --- a/osm/opensm/osm_ucast_lash.c +++ b/osm/opensm/osm_ucast_lash.c @@ -170,7 +170,7 @@ static uint64_t osm_lash_get_switch_guid(IN const osm_switch_t *p_sw) static osm_switch_t *get_osm_switch_from_port(osm_port_t *port) { - osm_physp_t *p = osm_port_get_default_phys_ptr(port); + osm_physp_t *p = port->p_physp; if (p->p_node->sw) return p->p_node->sw; else if (p->p_remote_physp->p_node->sw) diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index 7d3916b..2860e66 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -306,7 +306,7 @@ __osm_ucast_mgr_dump_ucast_routes( } else { - osm_physp_t *p_physp = osm_port_get_default_phys_ptr(p_port); + osm_physp_t *p_physp = p_port->p_physp; if( !p_physp || !p_physp->p_remote_physp || !p_physp->p_remote_physp->p_node->sw ) num_hops = OSM_NO_PATH; @@ -413,7 +413,7 @@ ucast_mgr_dump_lfts(cl_map_item_t *p_map_item, void *cxt) p_port = cl_ptr_vector_get(&p_mgr->p_subn->port_lid_tbl, lid); if (p_port) { - p_node = osm_port_get_parent_node(p_port); + p_node = p_port->p_node; fprintf(file, "%s portguid 0x016%" PRIx64 ": \'%s\'", ib_get_node_type_str(osm_node_get_type(p_node)), cl_ntoh64(osm_port_get_guid(p_port)), @@ -671,8 +671,7 @@ __osm_ucast_mgr_process_port( if (!p_mgr->p_subn->opt.port_profile_switch_nodes) { is_ignored_by_port_prof |= - (osm_node_get_type(osm_port_get_parent_node(p_port)) == - IB_NODE_TYPE_SWITCH); + (osm_node_get_type(p_port->p_node) == IB_NODE_TYPE_SWITCH); } } diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c index b15fe5e..d9446e9 100644 --- a/osm/opensm/osm_ucast_updn.c +++ b/osm/opensm/osm_ucast_updn.c @@ -792,7 +792,7 @@ __osm_updn_find_root_nodes_by_min_hop( p_next_port = (osm_port_t*)cl_qmap_next( &p_next_port->map_item ); if ( osm_node_get_type(p_port->p_node) != IB_NODE_TYPE_SWITCH ) { - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; self_lid_ho = cl_ntoh16( osm_physp_get_base_lid(p_physp) ); numCas++; /* EZ: diff --git a/osm/opensm/osm_vl_arb_rcv.c b/osm/opensm/osm_vl_arb_rcv.c index ed8dfc5..f36751e 100644 --- a/osm/opensm/osm_vl_arb_rcv.c +++ b/osm/opensm/osm_vl_arb_rcv.c @@ -171,7 +171,7 @@ osm_vla_rcv_process( goto Exit; } - p_node = osm_port_get_parent_node( p_port ); + p_node = p_port->p_node; CL_ASSERT( p_node ); block_num = (uint8_t)(cl_ntoh32(p_smp->attr_mod) >> 16); @@ -183,7 +183,7 @@ osm_vla_rcv_process( } else { - p_physp = osm_port_get_default_phys_ptr(p_port); + p_physp = p_port->p_physp; port_num = p_physp->port_num; } @@ -239,4 +239,3 @@ osm_vla_rcv_process( OSM_LOG_EXIT( p_rcv->p_log ); } - -- 1.5.2.rc2.20.gac2a From jsquyres at cisco.com Thu May 10 14:10:52 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 10 May 2007 17:10:52 -0400 Subject: [ofa-general] Fwd: [Netpipe] NetPIPE-3.7.1 released! References: <46434F9E.6060206@scl.ameslab.gov> Message-ID: <011D5F75-0AF2-41E8-A876-FD688D845066@cisco.com> FYI. Begin forwarded message: > From: Troy Benjegerdes > Date: May 10, 2007 1:00:14 PM EDT > To: netpipe at source.iprt.iastate.edu > Subject: [Netpipe] NetPIPE-3.7.1 released! > > NetPIPE-3.7.1 has been released. See the trac wiki > > http://source.scl.ameslab.gov/trac/netpipe > > and the direct download link > > http://source.scl.ameslab.gov/NetPIPE/NetPIPE-3.7.1.tar.gz > > > > > The major change from NetPIPE-3.7 is the ability of the ibv module to > select the OpenFabrics adapter and port to use. > > FYI, for those of you wanting to send me patches against NetPIPE, I am > planning on doing a large cosmetic code clean up to indent all the > source files consistently. This means you will have some patch cleanup > to do, so either send your patches now, or plan on some manual work to > get them to re-apply. > > My preference is to follow the Linux coding style, with the > exception of > using 4-space tabs. If you have a strong opinion about codingstyle, > speak now or forever hold your peace ;) > _______________________________________________ > Netpipe mailing list > Netpipe at lists.scl.ameslab.gov > https://lists.scl.ameslab.gov/cgi-bin/mailman/listinfo/netpipe -- Jeff Squyres Cisco Systems From pradeeps at linux.vnet.ibm.com Thu May 10 14:26:10 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 10 May 2007 14:26:10 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <4641E99B.10706@linux.vnet.ibm.com> References: <4641E99B.10706@linux.vnet.ibm.com> Message-ID: <46438DF2.3080601@linux.vnet.ibm.com> If there are no other issues than the small restructure suggestion that Michael had, can this patch be merged into the for-2.6.22 tree? Pradeep From rdreier at cisco.com Thu May 10 14:31:36 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 May 2007 14:31:36 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <46438DF2.3080601@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Thu, 10 May 2007 14:26:10 -0700") References: <4641E99B.10706@linux.vnet.ibm.com> <46438DF2.3080601@linux.vnet.ibm.com> Message-ID: I need to read over the whole thread when I get back from my trip... so it will be next week at the earliest. From rdreier at cisco.com Thu May 10 14:32:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 10 May 2007 14:32:12 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <46438DF2.3080601@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Thu, 10 May 2007 14:26:10 -0700") References: <4641E99B.10706@linux.vnet.ibm.com> <46438DF2.3080601@linux.vnet.ibm.com> Message-ID: > If there are no other issues than the small restructure suggestion that > Michael had, can this patch be merged into the for-2.6.22 tree? by the way, have you adressed that suggestion? From xma at us.ibm.com Thu May 10 14:34:49 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 10 May 2007 14:34:49 -0700 Subject: [ofa-general] RFC: location for IB CM statistics In-Reply-To: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com> Message-ID: Another place is /proc. Networking uses /proc/net for statistics, can we have something like /proc/infiniband? Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From militec-uk.com at watcke.com Thu May 10 16:37:29 2007 From: militec-uk.com at watcke.com (Joseph Evans) Date: Thu, 10 May 2007 15:37:29 -0800 Subject: [ofa-general] Avoid enhancement pills Message-ID: <000001c7934b$19ead880$0100007f@localhost> See attached image http://www.tiranpol.net/ ----- I agree with you that hell not The cook had joined them a few Yes, Elyne? Jamie asked. Why n Brenna embraced Elynes idea. I -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic37.jpg Type: image/jpeg Size: 14015 bytes Desc: not available URL: From pradeeps at linux.vnet.ibm.com Thu May 10 14:49:01 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 10 May 2007 14:49:01 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: References: <4641E99B.10706@linux.vnet.ibm.com> <46438DF2.3080601@linux.vnet.ibm.com> Message-ID: <4643934D.4080306@linux.vnet.ibm.com> Roland Dreier wrote: > > If there are no other issues than the small restructure suggestion that > > Michael had, can this patch be merged into the for-2.6.22 tree? > > by the way, have you adressed that suggestion? > In the submitted patch I have conditional branch in ipoib_cm_handle_rx_wc(), to handle_rx_wc_srq() or handle_rx_wc_nosrq(), and he wanted me to have separate handlers for the SRQ and NOSRQ case. Changing the code to do that would make ipoib_poll() very messy and so I have not done that. I feel that should not stand in the way of merging this patch. Pradeep From pradeeps at linux.vnet.ibm.com Thu May 10 14:50:45 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 10 May 2007 14:50:45 -0700 Subject: [ofa-general] RFC: location for IB CM statistics In-Reply-To: References: Message-ID: <464393B5.9030109@linux.vnet.ibm.com> Shirley Ma wrote: > Another place is /proc. Networking uses /proc/net for statistics, can we > have something like /proc/infiniband? > > Thanks > Shirley Ma > My understanding is that the current thinking is /proc is for process related info and the rest goes into sysfs. Pradeep From rick.jones2 at hp.com Thu May 10 15:00:19 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 10 May 2007 15:00:19 -0700 Subject: [ofa-general] is there an OFED way to putt VPD from an HCA? In-Reply-To: <46436275.80406@systemfabricworks.com> References: <46434FF1.9020005@hp.com> <46436275.80406@systemfabricworks.com> Message-ID: <464395F3.9080004@hp.com> Seems that none of those utilities went into Debian, which was the base distro I installed, and then on top of which I put the 2.6.21.1 kernel. Of course I'm still having that "gcc rpm" not found issue trying to grab the whole OFED 1.2 from 5/10, and an attempt to compile the ofa_kernel from 5.10 ended-up with some asm related errors which sadly I've not saved, but could I suppose recreate. At this point I'm not sure if I don't need to lay-down a fresh set of kernel sources to allow things to patch correctly. rick jones From raleigh at systemfabricworks.com Thu May 10 15:11:14 2007 From: raleigh at systemfabricworks.com (Raleigh F Rinehart) Date: Thu, 10 May 2007 17:11:14 -0500 Subject: [ofa-general] is there an OFED way to putt VPD from an HCA? In-Reply-To: <464395F3.9080004@hp.com> References: <46434FF1.9020005@hp.com> <46436275.80406@systemfabricworks.com> <464395F3.9080004@hp.com> Message-ID: <46439882.7070601@systemfabricworks.com> I don't know about the other issues as I haven't tried installing on Debian, but on my SLES 10 machine I had to build mstflint by hand from the sources in the OFED-1.1 release tarball. The standard ofed install.sh script wouldn't build if I included it. A simple 'make' run in "OFED-1.1/SOURCES/openib-1.1/src/userspace/mstflint" worked just fine for me. thanks, -raleigh Rick Jones wrote: > Seems that none of those utilities went into Debian, which was the > base distro I installed, and then on top of which I put the 2.6.21.1 > kernel. Of course I'm still having that "gcc rpm" not found issue > trying to grab the whole OFED 1.2 from 5/10, and an attempt to compile > the ofa_kernel from 5.10 ended-up with some asm related errors which > sadly I've not saved, but could I suppose recreate. > > At this point I'm not sure if I don't need to lay-down a fresh set of > kernel sources to allow things to patch correctly. > > rick jones > > -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3285 bytes Desc: S/MIME Cryptographic Signature URL: From jimmy at hillraiser.com Thu May 10 15:33:08 2007 From: jimmy at hillraiser.com (=?iso-8859-1?Q?Jimmy=20Hill?=) Date: Thu, 10 May 2007 22:33:08 +0000 Subject: [ofa-general] verbs abi_compat Message-ID: <20070510223308.11276.qmail@station183.com> > > abi_compat has nothing to do with __ibv_alloc_pd vs. __ibv_alloc_pd_1_0. > Rather, that choice is made based on whether your app is linked > against the IBVERBS_1.1 or IBVERBS_1.0 ABI. If you link against the > new library, you should get all IBVERBS_1.1 symbols; if you link > against libibverbs 1.0, you should get all IBVERBS_1.1 symbols. > > Your problem might be that your app is getting __ibv_alloc_pd_1_0, but > it gets __ibv_open_device instead of __ibv_open_device_1_0 so the > context passed into __ibv_alloc_pd_1_0 is wrong. Are you possibly > relinking only part of your app or something? > In my makefile I was linking against ibverbs (libibverbs.so.1) and rdmacm (librdmacm.so.1). In the code, I create an RDMA CM ID and was using the context out of it to create my other IB resources (e.g., PDs, etc.). (long story why this is necessary but suffice it to say doing something similar to the uDAPL implementation). I initially thought I had a compatibility issue between the ibverbs I was linking against compared to the ibverbs the rdmacm was using. But, there was only one copy of each on my system. But, that is what pushed me down the abi_compat stuff, etc. Turns out, another module in my exe was linking with the DAT library...that apparently pulled in the 1.0 verbs and thus the compat stuff. I removed the reference to the DAT library and now my context from the RDMA CM ID gets me to the correct verbs. Thanks for your help! From sashak at voltaire.com Thu May 10 15:43:08 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 11 May 2007 01:43:08 +0300 Subject: [ofa-general] [PATCH] opensm: more osm_*_construct/_init/_destroy cleanups In-Reply-To: <20070509212740.GV9692@sashak.voltaire.com> References: <20070506174138.GI9692@sashak.voltaire.com> <20070506174431.GJ9692@sashak.voltaire.com> <1178543690.32222.350646.camel@hal.voltaire.com> <20070509212740.GV9692@sashak.voltaire.com> Message-ID: <20070510224308.GH9692@sashak.voltaire.com> Hi Hal, As suggested :) This removes/makes static non used osm_*_construct/_init/_destroy initializers for OpenSM objects where osm*_new/_delete are actually used. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_inform.h | 57 ++--------------- osm/include/opensm/osm_lin_fwd_tbl.h | 5 +- osm/include/opensm/osm_mcm_info.h | 93 +-------------------------- osm/include/opensm/osm_mcm_port.h | 117 +++------------------------------- osm/include/opensm/osm_mtree.h | 88 ------------------------- osm/include/opensm/osm_multicast.h | 104 ++---------------------------- osm/include/opensm/osm_router.h | 106 ++---------------------------- osm/include/opensm/osm_service.h | 39 ++--------- osm/opensm/osm_inform.c | 32 +-------- osm/opensm/osm_mcast_mgr.c | 2 +- osm/opensm/osm_mcm_info.c | 23 +------ osm/opensm/osm_mcm_port.c | 28 +-------- osm/opensm/osm_mtree.c | 4 +- osm/opensm/osm_multicast.c | 18 +---- osm/opensm/osm_router.c | 46 +------------ osm/opensm/osm_sa_mcmember_record.c | 4 +- osm/opensm/osm_sa_service_record.c | 4 +- osm/opensm/osm_service.c | 13 +--- osm/opensm/osm_subnet.c | 4 +- 19 files changed, 64 insertions(+), 723 deletions(-) diff --git a/osm/include/opensm/osm_inform.h b/osm/include/opensm/osm_inform.h index 3e8e122..57ab05c 100644 --- a/osm/include/opensm/osm_inform.h +++ b/osm/include/opensm/osm_inform.h @@ -154,57 +154,12 @@ osm_infr_new( * Allows calling other service record methods. * * SEE ALSO -* Inform Record, osm_infr_construct, osm_infr_destroy +* Inform Record, osm_infr_delete *********/ -/****f* OpenSM: Inform Record/osm_infr_init +/****f* OpenSM: Inform Record/osm_infr_delete * NAME -* osm_infr_new -* -* DESCRIPTION -* Initializes the osm_infr_t structure. -* -* SYNOPSIS -*/ -void -osm_infr_init( - IN osm_infr_t* const p_infr, - IN const osm_infr_t *p_infr_rec ); -/* -* PARAMETERS -* p_infr -* [in] Pointer to osm_infr_t structure -* p_inf_rec -* [in] Pointer to the ib_inform_info_record_t -* -* SEE ALSO -* Inform Record, osm_infr_construct, osm_infr_destroy -*********/ - -/****f* OpenSM: Inform Record/osm_infr_construct -* NAME -* osm_infr_construct -* -* DESCRIPTION -* Constructs the osm_infr_t structure. -* -* SYNOPSIS -*/ -void -osm_infr_construct( - IN osm_infr_t* const p_infr ); -/* -* PARAMETERS -* p_infr -* [in] Pointer to osm_infr_t structure -* -* SEE ALSO -* Inform Record, osm_infr_construct, osm_infr_destroy -*********/ - -/****f* OpenSM: Inform Record/osm_infr_destroy -* NAME -* osm_infr_destroy +* osm_infr_delete * * DESCRIPTION * Constructs the osm_infr_t structure. @@ -212,7 +167,7 @@ osm_infr_construct( * SYNOPSIS */ void -osm_infr_destroy( +osm_infr_delete( IN osm_infr_t* const p_infr ); /* * PARAMETERS @@ -220,7 +175,7 @@ osm_infr_destroy( * [in] Pointer to osm_infr_t structure * * SEE ALSO -* Inform Record, osm_infr_construct, osm_infr_destroy +* Inform Record, osm_infr_new *********/ /****f* OpenSM: Inform Record/osm_infr_get_by_rec @@ -251,7 +206,7 @@ osm_infr_get_by_rec( * RETURN * The matching osm_infr_t * SEE ALSO -* Inform Record, osm_infr_construct, osm_infr_destroy +* Inform Record, osm_infr_new, osm_infr_delete *********/ void diff --git a/osm/include/opensm/osm_lin_fwd_tbl.h b/osm/include/opensm/osm_lin_fwd_tbl.h index e059020..26d0465 100644 --- a/osm/include/opensm/osm_lin_fwd_tbl.h +++ b/osm/include/opensm/osm_lin_fwd_tbl.h @@ -93,9 +93,8 @@ BEGIN_C_DECLS */ typedef struct _osm_lin_fwd_tbl { - uint16_t size; - uint8_t port_tbl[1]; - + uint16_t size; + uint8_t port_tbl[1]; } osm_lin_fwd_tbl_t; /* * FIELDS diff --git a/osm/include/opensm/osm_mcm_info.h b/osm/include/opensm/osm_mcm_info.h index b6f6ee2..48d61f5 100644 --- a/osm/include/opensm/osm_mcm_info.h +++ b/osm/include/opensm/osm_mcm_info.h @@ -79,9 +79,8 @@ BEGIN_C_DECLS */ typedef struct _osm_mcm_info { - cl_list_item_t list_item; - ib_net16_t mlid; - + cl_list_item_t list_item; + ib_net16_t mlid; } osm_mcm_info_t; /* * FIELDS @@ -94,94 +93,6 @@ typedef struct _osm_mcm_info * SEE ALSO *********/ -/****f* OpenSM: Multicast Member Info/osm_mcm_info_construct -* NAME -* osm_mcm_info_construct -* -* DESCRIPTION -* This function constructs a Multicast Member Info object. -* -* SYNOPSIS -*/ -static inline void -osm_mcm_info_construct( - IN osm_mcm_info_t* const p_mcm ) -{ - memset( p_mcm, 0, sizeof(*p_mcm) ); -} -/* -* PARAMETERS -* p_mcm -* [in] Pointer to a Multicast Member Info object to construct. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Multicast Member Info/osm_mcm_info_destroy -* NAME -* osm_mcm_info_destroy -* -* DESCRIPTION -* The osm_mcm_info_destroy function destroys the object, releasing -* all resources. -* -* SYNOPSIS -*/ -void -osm_mcm_info_destroy( - IN osm_mcm_info_t* const p_mcm ); -/* -* PARAMETERS -* p_mcm -* [in] Pointer to a Multicast Member Info object to destroy. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Performs any necessary cleanup of the specified Multicast Member Info object. -* Further operations should not be attempted on the destroyed object. -* This function should only be called after a call to osm_mtree_construct or -* osm_mtree_init. -* -* SEE ALSO -* Multicast Member Info object, osm_mtree_construct, osm_mtree_init -*********/ - -/****f* OpenSM: Multicast Member Info/osm_mcm_info_init -* NAME -* osm_mcm_info_init -* -* DESCRIPTION -* Initializes a Multicast Member Info object for use. -* -* SYNOPSIS -*/ -void -osm_mcm_info_init( - IN osm_mcm_info_t* const p_mcm, - IN const ib_net16_t mlid ); -/* -* PARAMETERS -* p_mcm -* [in] Pointer to an osm_mcm_info_t object to initialize. -* -* mlid -* [in] MLID value for this multicast group. -* -* RETURN VALUES -* None. -* -* NOTES -* -* SEE ALSO -*********/ - /****f* OpenSM: Multicast Member Info/osm_mcm_info_new * NAME * osm_mcm_info_new diff --git a/osm/include/opensm/osm_mcm_port.h b/osm/include/opensm/osm_mcm_port.h index df30b84..634b3c7 100644 --- a/osm/include/opensm/osm_mcm_port.h +++ b/osm/include/opensm/osm_mcm_port.h @@ -103,112 +103,13 @@ typedef struct _osm_mcm_port * MCM Port Object *********/ -/****f* OpenSM: MCM Port Object/osm_mcm_port_construct +/****f* OpenSM: MCM Port Object/osm_mcm_port_new * NAME -* osm_mcm_port_construct +* osm_mcm_port_new * * DESCRIPTION -* This function constructs a MCM Port object. -* -* SYNOPSIS -*/ -void -osm_mcm_port_construct( - IN osm_mcm_port_t* const p_mcm ); -/* -* PARAMETERS -* p_mcm -* [in] Pointer to a MCM Port Object to construct. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Allows calling osm_mcm_port_init, osm_mcm_port_destroy. -* -* Calling osm_mcm_port_construct is a prerequisite to calling any other -* method except osm_mcm_port_init. -* -* SEE ALSO -* MCM Port Object, osm_mcm_port_init, osm_mcm_port_destroy -*********/ - -/****f* OpenSM: MCM Port Object/osm_mcm_port_destroy -* NAME -* osm_mcm_port_destroy -* -* DESCRIPTION -* The osm_mcm_port_destroy function destroys a MCM Port Object, releasing -* all resources. -* -* SYNOPSIS -*/ -void -osm_mcm_port_destroy( - IN osm_mcm_port_t* const p_mcm ); -/* -* PARAMETERS -* p_mcm -* [in] Pointer to a MCM Port Object to destroy. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Performs any necessary cleanup of the specified MCM Port Object. -* Further operations should not be attempted on the destroyed object. -* This function should only be called after a call to -* osm_mcm_port_construct or osm_mcm_port_init. -* -* SEE ALSO -* MCM Port Object, osm_mcm_port_construct, osm_mcm_port_init -*********/ - -/****f* OpenSM: MCM Port Object/osm_mcm_port_init -* NAME -* osm_mcm_port_init -* -* DESCRIPTION -* The osm_mcm_port_init function initializes a MCM Port Object for use. -* -* SYNOPSIS -*/ -void -osm_mcm_port_init( - IN osm_mcm_port_t* const p_mcm, - IN const ib_gid_t* const p_port_gid, - IN const uint8_t scope_state, - IN const boolean_t proxy_join ); -/* -* PARAMETERS -* p_mcm -* [in] Pointer to an osm_mcm_port_t object to initialize. -* -* p_port_gid -* [in] Pointer to the GID of the port to add to the multicast group. -* -* scope_state -* [in] scope state of the join request -* -* proxy_join -* [in] proxy_join state analyzed from the request -* -* RETURN VALUES -* None. -* -* NOTES -* Allows calling other MCM Port Object methods. -* -* SEE ALSO -* MCM Port Object, osm_mcm_port_construct, osm_mcm_port_destroy, -*********/ - -/****f* OpenSM: MCM Port Object/osm_mcm_port_init -* NAME -* osm_mcm_port_init -* -* DESCRIPTION -* The osm_mcm_port_init function initializes a MCM Port Object for use. +* The osm_mcm_port_new function allocates and initializes a +* MCM Port Object for use. * * SYNOPSIS */ @@ -234,15 +135,15 @@ osm_mcm_port_new( * NOTES * * SEE ALSO -* MCM Port Object, osm_mcm_port_construct, osm_mcm_port_destroy, +* MCM Port Object, osm_mcm_port_delete, *********/ -/****f* OpenSM: MCM Port Object/osm_mcm_port_destroy +/****f* OpenSM: MCM Port Object/osm_mcm_port_delete * NAME -* osm_mcm_port_destroy +* osm_mcm_port_delete * * DESCRIPTION -* The osm_mcm_port_destroy function destroys and dellallocates an +* The osm_mcm_port_delete function destroys and dellallocates an * MCM Port Object, releasing all resources. * * SYNOPSIS @@ -261,7 +162,7 @@ osm_mcm_port_delete( * NOTES * * SEE ALSO -* MCM Port Object, osm_mcm_port_construct, osm_mcm_port_init +* MCM Port Object, osm_mcm_port_new *********/ END_C_DECLS diff --git a/osm/include/opensm/osm_mtree.h b/osm/include/opensm/osm_mtree.h index d349dc8..aa02cbb 100644 --- a/osm/include/opensm/osm_mtree.h +++ b/osm/include/opensm/osm_mtree.h @@ -138,94 +138,6 @@ typedef struct _osm_mtree_node * SEE ALSO *********/ -/****f* OpenSM: Multicast Tree/osm_mtree_node_construct -* NAME -* osm_mtree_node_construct -* -* DESCRIPTION -* This function constructs a Multicast Tree Node object. -* -* SYNOPSIS -*/ -static inline void -osm_mtree_node_construct( - IN osm_mtree_node_t* const p_mtn ) -{ - memset( p_mtn, 0, sizeof(*p_mtn) ); -} -/* -* PARAMETERS -* p_mtn -* [in] Pointer to a Multicast Tree Node object to construct. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Multicast Tree/osm_mtree_node_destroy -* NAME -* osm_mtree_node_destroy -* -* DESCRIPTION -* The osm_mtree_node_destroy function destroys a node, releasing -* all resources. -* -* SYNOPSIS -*/ -void -osm_mtree_node_destroy( - IN osm_mtree_node_t* const p_mtn ); -/* -* PARAMETERS -* p_mtn -* [in] Pointer to a Multicast Tree Node object to destroy. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Performs any necessary cleanup of the specified Multicast Tree object. -* Further operations should not be attempted on the destroyed object. -* This function should only be called after a call to osm_mtree_construct or -* osm_mtree_init. -* -* SEE ALSO -* Multicast Tree object, osm_mtree_construct, osm_mtree_init -*********/ - -/****f* OpenSM: Multicast Tree/osm_mtree_node_init -* NAME -* osm_mtree_node_init -* -* DESCRIPTION -* Initializes a Multicast Tree Node object for use. -* -* SYNOPSIS -*/ -void -osm_mtree_node_init( - IN osm_mtree_node_t* const p_mtn, - IN const osm_switch_t* const p_sw ); -/* -* PARAMETERS -* p_mtn -* [in] Pointer to an osm_mtree_node_t object to initialize. -* -* p_sw -* [in] Pointer to the switch represented by this node. -* -* RETURN VALUES -* None. -* -* NOTES -* -* SEE ALSO -*********/ - /****f* OpenSM: Multicast Tree/osm_mtree_node_new * NAME * osm_mtree_node_new diff --git a/osm/include/opensm/osm_multicast.h b/osm/include/opensm/osm_multicast.h index b247e01..13a6fd1 100644 --- a/osm/include/opensm/osm_multicast.h +++ b/osm/include/opensm/osm_multicast.h @@ -126,9 +126,9 @@ osm_get_mcast_req_type_str( */ typedef struct osm_mcast_mgr_ctxt { - ib_net16_t mlid; - osm_mcast_req_type_t req_type; - ib_net64_t port_guid; + ib_net16_t mlid; + osm_mcast_req_type_t req_type; + ib_net64_t port_guid; } osm_mcast_mgr_ctxt_t; /* * FIELDS @@ -246,98 +246,6 @@ typedef void (*osm_mgrp_func_t)( * SEE ALSO *********/ -/****f* OpenSM: Multicast Group/osm_mgrp_construct -* NAME -* osm_mgrp_construct -* -* DESCRIPTION -* This function constructs a Multicast Group. -* -* SYNOPSIS -*/ -void -osm_mgrp_construct( - IN osm_mgrp_t* const p_mgrp ); -/* -* PARAMETERS -* p_mgrp -* [in] Pointer to a Multicast Group to construct. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Allows calling osm_mgrp_init, osm_mgrp_destroy. -* -* Calling osm_mgrp_construct is a prerequisite to calling any other -* method except osm_mgrp_init. -* -* SEE ALSO -* Multicast Group, osm_mgrp_init, osm_mgrp_destroy -*********/ - -/****f* OpenSM: Multicast Group/osm_mgrp_destroy -* NAME -* osm_mgrp_destroy -* -* DESCRIPTION -* The osm_mgrp_destroy function destroys a Multicast Group, releasing -* all resources. -* -* SYNOPSIS -*/ -void -osm_mgrp_destroy( - IN osm_mgrp_t* const p_mgrp ); -/* -* PARAMETERS -* p_mgrp -* [in] Pointer to a Muticast Group to destroy. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Performs any necessary cleanup of the specified Multicast Group. -* Further operations should not be attempted on the destroyed object. -* This function should only be called after a call to osm_mgrp_construct or -* osm_mgrp_init. -* -* SEE ALSO -* Multicast Group, osm_mgrp_construct, osm_mgrp_init -*********/ - -/****f* OpenSM: Multicast Group/osm_mgrp_init -* NAME -* osm_mgrp_init -* -* DESCRIPTION -* The osm_mgrp_init function initializes a Multicast Group for use. -* -* SYNOPSIS -*/ -void -osm_mgrp_init( - IN osm_mgrp_t* const p_mgrp, - IN const ib_net16_t mlid ); -/* -* PARAMETERS -* p_mgrp -* [in] Pointer to an osm_mgrp_t object to initialize. -* -* mlid -* [in] Multicast LID for this multicast group. -* -* RETURN VALUES -* None. -* -* NOTES -* Allows calling other Multicast Group methods. -* -* SEE ALSO -* Multicast Group, osm_mgrp_construct, osm_mgrp_destroy, -*********/ - /****f* OpenSM: Multicast Group/osm_mgrp_new * NAME * osm_mgrp_new @@ -362,7 +270,7 @@ osm_mgrp_new( * Allows calling other Multicast Group methods. * * SEE ALSO -* Multicast Group, osm_mgrp_construct, osm_mgrp_destroy, +* Multicast Group, osm_mgrp_delete, *********/ /****f* OpenSM: Multicast Group/osm_mgrp_delete @@ -388,7 +296,7 @@ osm_mgrp_delete( * NOTES * * SEE ALSO -* Multicast Group, osm_mgrp_construct, osm_mgrp_destroy, +* Multicast Group, osm_mgrp_new *********/ /****f* OpenSM: Multicast Group/osm_mgrp_is_guid @@ -568,7 +476,7 @@ osm_mgrp_is_port_present( void osm_mgrp_remove_port( IN osm_subn_t* const p_subn, - IN osm_log_t* const p_log, + IN osm_log_t* const p_log, IN osm_mgrp_t* const p_mgrp, IN const ib_net64_t port_guid ); /* diff --git a/osm/include/opensm/osm_router.h b/osm/include/opensm/osm_router.h index 49c5b46..db1dc13 100644 --- a/osm/include/opensm/osm_router.h +++ b/osm/include/opensm/osm_router.h @@ -100,8 +100,8 @@ BEGIN_C_DECLS */ typedef struct _osm_router { - cl_map_item_t map_item; - osm_port_t *p_port; + cl_map_item_t map_item; + osm_port_t *p_port; } osm_router_t; /* * FIELDS @@ -115,70 +115,9 @@ typedef struct _osm_router * Router object *********/ -/****f* OpenSM: Router/osm_router_construct +/****f* OpenSM: Router/osm_router_delete * NAME -* osm_router_construct -* -* DESCRIPTION -* This function constructs a Router object. -* -* SYNOPSIS -*/ -void -osm_router_construct( - IN osm_router_t* const p_rtr ); -/* -* PARAMETERS -* p_rtr -* [in] Pointer to a Router object to construct. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Allows calling osm_router_init, and osm_router_destroy. -* -* Calling osm_router_construct is a prerequisite to calling any other -* method except osm_router_init. -* -* SEE ALSO -* Router object, osm_router_init, osm_router_destroy -*********/ - -/****f* OpenSM: Router/osm_router_destroy -* NAME -* osm_router_destroy -* -* DESCRIPTION -* The osm_router_destroy function destroys the object, releasing -* all resources. -* -* SYNOPSIS -*/ -void -osm_router_destroy( - IN osm_router_t* const p_rtr ); -/* -* PARAMETERS -* p_rtr -* [in] Pointer to the object to destroy. -* -* RETURN VALUE -* None. -* -* NOTES -* Performs any necessary cleanup of the specified object. -* Further operations should not be attempted on the destroyed object. -* This function should only be called after a call to osm_router_construct -* or osm_router_init. -* -* SEE ALSO -* Router object, osm_router_construct, osm_router_init -*********/ - -/****f* OpenSM: Router/osm_router_destroy -* NAME -* osm_router_destroy +* osm_router_delete * * DESCRIPTION * Destroys and deallocates the object. @@ -199,38 +138,7 @@ osm_router_delete( * NOTES * * SEE ALSO -* Router object, osm_router_construct, osm_router_init -*********/ - -/****f* OpenSM: Router/osm_router_init -* NAME -* osm_router_init -* -* DESCRIPTION -* The osm_router_init function initializes a Router object for use. -* -* SYNOPSIS -*/ -ib_api_status_t -osm_router_init( - IN osm_router_t* const p_rtr, - IN osm_port_t* const p_port ); -/* -* PARAMETERS -* p_rtr -* [in] Pointer to an osm_router_t object to initialize. -* -* p_port -* [in] Pointer to the port object of this router -* -* RETURN VALUES -* IB_SUCCESS if the Router object was initialized successfully. -* -* NOTES -* Allows calling other node methods. -* -* SEE ALSO -* Router object, osm_router_construct, osm_router_destroy +* Router object, osm_router_new *********/ /****f* OpenSM: Router/osm_router_new @@ -238,7 +146,7 @@ osm_router_init( * osm_router_new * * DESCRIPTION -* The osm_router_init function initializes a Router object for use. +* The osm_router_new function initializes a Router object for use. * * SYNOPSIS */ @@ -256,7 +164,7 @@ osm_router_new( * NOTES * * SEE ALSO -* Router object, osm_router_construct, osm_router_destroy, +* Router object, osm_router_new, *********/ /****f* OpenSM: Router/osm_router_get_port_ptr diff --git a/osm/include/opensm/osm_service.h b/osm/include/opensm/osm_service.h index 2470650..7c7434c 100644 --- a/osm/include/opensm/osm_service.h +++ b/osm/include/opensm/osm_service.h @@ -146,13 +146,13 @@ osm_svcr_new( * Allows calling other service record methods. * * SEE ALSO -* Service Record, osm_svcr_construct, osm_svcr_destroy +* Service Record, osm_svcr_delete *********/ /****f* OpenSM: Service Record/osm_svcr_init * NAME -* osm_svcr_new +* osm_svcr_init * * DESCRIPTION * Initializes the osm_svcr_t structure. @@ -171,41 +171,20 @@ osm_svcr_init( * [in] Pointer to the ib_service_record_t * * SEE ALSO -* Service Record, osm_svcr_construct, osm_svcr_destroy -*********/ - -/****f* OpenSM: Service Record/osm_svcr_construct -* NAME -* osm_svcr_construct -* -* DESCRIPTION -* Constructs the osm_svcr_t structure. -* -* SYNOPSIS -*/ -void -osm_svcr_construct( - IN osm_svcr_t* const p_svcr ); -/* -* PARAMETERS -* p_svc_rec -* [in] Pointer to osm_svcr_t structure -* -* SEE ALSO -* Service Record, osm_svcr_construct, osm_svcr_destroy +* Service Record *********/ -/****f* OpenSM: Service Record/osm_svcr_destroy +/****f* OpenSM: Service Record/osm_svcr_delete * NAME -* osm_svcr_destroy +* osm_svcr_delete * * DESCRIPTION -* Constructs the osm_svcr_t structure. +* Deallocates the osm_svcr_t structure. * * SYNOPSIS */ void -osm_svcr_destroy( +osm_svcr_delete( IN osm_svcr_t* const p_svcr ); /* * PARAMETERS @@ -213,11 +192,9 @@ osm_svcr_destroy( * [in] Pointer to osm_svcr_t structure * * SEE ALSO -* Service Record, osm_svcr_construct, osm_svcr_destroy +* Service Record, osm_svcr_new *********/ - - osm_svcr_t* osm_svcr_get_by_rid( IN osm_subn_t const *p_subn, diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c index e66c259..0069b59 100644 --- a/osm/opensm/osm_inform.c +++ b/osm/opensm/osm_inform.c @@ -67,16 +67,7 @@ typedef struct _osm_infr_match_ctxt /********************************************************************** **********************************************************************/ void -osm_infr_construct( - IN osm_infr_t* const p_infr ) -{ - memset( p_infr, 0, sizeof(osm_infr_t) ); -} - -/********************************************************************** - **********************************************************************/ -void -osm_infr_destroy( +osm_infr_delete( IN osm_infr_t* const p_infr ) { free( p_infr ); @@ -84,21 +75,6 @@ osm_infr_destroy( /********************************************************************** **********************************************************************/ -void -osm_infr_init( - IN osm_infr_t* const p_infr, - IN const osm_infr_t *p_infr_rec ) -{ - CL_ASSERT( p_infr ); - - /* what else do we need in the inform_record ??? */ - - /* copy the contents of the provided informinfo */ - memcpy( p_infr, p_infr_rec, sizeof(osm_infr_t) ); -} - -/********************************************************************** - **********************************************************************/ osm_infr_t* osm_infr_new( IN const osm_infr_t *p_infr_rec ) @@ -109,9 +85,7 @@ osm_infr_new( p_infr = (osm_infr_t*)malloc( sizeof(osm_infr_t) ); if( p_infr ) - { - osm_infr_init( p_infr, p_infr_rec ); - } + memcpy( p_infr, p_infr_rec, sizeof(osm_infr_t) ); return( p_infr ); } @@ -369,7 +343,7 @@ osm_infr_remove_from_db( cl_qlist_remove_item( &p_subn->sa_infr_list, &p_infr->list_item ); - osm_infr_destroy( p_infr ); + osm_infr_delete( p_infr ); OSM_LOG_EXIT( p_log ); } diff --git a/osm/opensm/osm_mcast_mgr.c b/osm/opensm/osm_mcast_mgr.c index f5059c9..508dd72 100644 --- a/osm/opensm/osm_mcast_mgr.c +++ b/osm/opensm/osm_mcast_mgr.c @@ -1697,7 +1697,7 @@ osm_mcast_mgr_process_mgrp_cb( cl_qmap_remove_item(&p_mgr->p_subn->mgrp_mlid_tbl, (cl_map_item_t *)p_mgrp ); - osm_mgrp_destroy(p_mgrp); + osm_mgrp_delete(p_mgrp); } } diff --git a/osm/opensm/osm_mcm_info.c b/osm/opensm/osm_mcm_info.c index f250c36..a550a1c 100644 --- a/osm/opensm/osm_mcm_info.c +++ b/osm/opensm/osm_mcm_info.c @@ -55,26 +55,6 @@ /********************************************************************** **********************************************************************/ -void -osm_mcm_info_destroy( - IN osm_mcm_info_t* const p_mcm ) -{ - CL_ASSERT( p_mcm ); -} - -/********************************************************************** - **********************************************************************/ -void -osm_mcm_info_init( - IN osm_mcm_info_t* const p_mcm, - IN const ib_net16_t mlid ) -{ - CL_ASSERT( p_mcm ); - p_mcm->mlid = mlid; -} - -/********************************************************************** - **********************************************************************/ osm_mcm_info_t* osm_mcm_info_new( IN const ib_net16_t mlid ) @@ -85,7 +65,7 @@ osm_mcm_info_new( if( p_mcm ) { memset(p_mcm, 0, sizeof(*p_mcm) ); - osm_mcm_info_init( p_mcm, mlid ); + p_mcm->mlid = mlid; } return( p_mcm ); @@ -97,6 +77,5 @@ void osm_mcm_info_delete( IN osm_mcm_info_t* const p_mcm ) { - osm_mcm_info_destroy( p_mcm ); free( p_mcm ); } diff --git a/osm/opensm/osm_mcm_port.c b/osm/opensm/osm_mcm_port.c index b617a9e..9e4dfe0 100644 --- a/osm/opensm/osm_mcm_port.c +++ b/osm/opensm/osm_mcm_port.c @@ -56,39 +56,16 @@ /********************************************************************** **********************************************************************/ -void -osm_mcm_port_construct( - IN osm_mcm_port_t* const p_mcm ) -{ - memset( p_mcm, 0, sizeof(*p_mcm) ); -} - -/********************************************************************** - **********************************************************************/ -void -osm_mcm_port_destroy( - IN osm_mcm_port_t* const p_mcm ) -{ - /* - Nothing to do? - */ - UNUSED_PARAM( p_mcm ); -} - -/********************************************************************** - **********************************************************************/ -void +static void osm_mcm_port_init( IN osm_mcm_port_t* const p_mcm, IN const ib_gid_t* const p_port_gid, IN const uint8_t scope_state, IN const boolean_t proxy_join ) { - CL_ASSERT( p_mcm ); CL_ASSERT( p_port_gid ); CL_ASSERT( scope_state ); - osm_mcm_port_construct( p_mcm ); p_mcm->port_gid = *p_port_gid; p_mcm->scope_state = scope_state; p_mcm->proxy_join = proxy_join; @@ -107,6 +84,7 @@ osm_mcm_port_new( p_mcm = malloc( sizeof(*p_mcm) ); if( p_mcm ) { + memset( p_mcm, 0, sizeof(*p_mcm) ); osm_mcm_port_init( p_mcm, p_port_gid, scope_state, proxy_join ); } @@ -121,7 +99,5 @@ osm_mcm_port_delete( IN osm_mcm_port_t* const p_mcm ) { CL_ASSERT( p_mcm ); - - osm_mcm_port_destroy( p_mcm ); free( p_mcm ); } diff --git a/osm/opensm/osm_mtree.c b/osm/opensm/osm_mtree.c index 90576d3..c69f46c 100644 --- a/osm/opensm/osm_mtree.c +++ b/osm/opensm/osm_mtree.c @@ -55,7 +55,7 @@ /********************************************************************** **********************************************************************/ -void +static void osm_mtree_node_init( IN osm_mtree_node_t* const p_mtn, IN const osm_switch_t* const p_sw ) @@ -65,7 +65,7 @@ osm_mtree_node_init( CL_ASSERT( p_mtn ); CL_ASSERT( p_sw ); - osm_mtree_node_construct( p_mtn ); + memset( p_mtn, 0, sizeof(*p_mtn) ); p_mtn->p_sw = (osm_switch_t*)p_sw; p_mtn->max_children = p_sw->num_ports; diff --git a/osm/opensm/osm_multicast.c b/osm/opensm/osm_multicast.c index 2538a9a..db79fcd 100644 --- a/osm/opensm/osm_multicast.c +++ b/osm/opensm/osm_multicast.c @@ -77,17 +77,7 @@ osm_get_mcast_req_type_str( /********************************************************************** **********************************************************************/ void -osm_mgrp_construct( - IN osm_mgrp_t* const p_mgrp ) -{ - memset( p_mgrp, 0, sizeof(*p_mgrp) ); - cl_qmap_init( &p_mgrp->mcm_port_tbl ); -} - -/********************************************************************** - **********************************************************************/ -void -osm_mgrp_destroy( +osm_mgrp_delete( IN osm_mgrp_t* const p_mgrp ) { osm_mcm_port_t *p_mcm_port; @@ -110,15 +100,15 @@ osm_mgrp_destroy( /********************************************************************** **********************************************************************/ -void +static void osm_mgrp_init( IN osm_mgrp_t* const p_mgrp, IN const ib_net16_t mlid ) { - CL_ASSERT( p_mgrp ); CL_ASSERT( cl_ntoh16( mlid ) >= IB_LID_MCAST_START_HO ); - osm_mgrp_construct( p_mgrp ); + memset( p_mgrp, 0, sizeof(*p_mgrp) ); + cl_qmap_init( &p_mgrp->mcm_port_tbl ); p_mgrp->mlid = mlid; p_mgrp->last_change_id = 0; p_mgrp->last_tree_id = 0; diff --git a/osm/opensm/osm_router.c b/osm/opensm/osm_router.c index 4b6470c..544afec 100644 --- a/osm/opensm/osm_router.c +++ b/osm/opensm/osm_router.c @@ -50,54 +50,15 @@ #include #include -#include #include #include /********************************************************************** **********************************************************************/ void -osm_router_construct( - IN osm_router_t* const p_rtr ) -{ - CL_ASSERT( p_rtr ); - memset( p_rtr, 0, sizeof(*p_rtr) ); -} - -/********************************************************************** - **********************************************************************/ -ib_api_status_t -osm_router_init( - IN osm_router_t* const p_rtr, - IN osm_port_t* const p_port ) -{ - ib_api_status_t status = IB_SUCCESS; - - CL_ASSERT( p_rtr ); - CL_ASSERT( p_port ); - - osm_router_construct( p_rtr ); - - p_rtr->p_port = p_port; - - return( status ); -} - -/********************************************************************** - **********************************************************************/ -void -osm_router_destroy( - IN osm_router_t* const p_rtr ) -{ -} - -/********************************************************************** - **********************************************************************/ -void osm_router_delete( IN OUT osm_router_t** const pp_rtr ) { - osm_router_destroy( *pp_rtr ); free( *pp_rtr ); *pp_rtr = NULL; } @@ -108,16 +69,15 @@ osm_router_t* osm_router_new( IN osm_port_t* const p_port ) { - ib_api_status_t status; osm_router_t *p_rtr; + CL_ASSERT( p_port ); + p_rtr = (osm_router_t*)malloc( sizeof(*p_rtr) ); if( p_rtr ) { memset( p_rtr, 0, sizeof(*p_rtr) ); - status = osm_router_init( p_rtr, p_port ); - if( status != IB_SUCCESS ) - osm_router_delete( &p_rtr ); + p_rtr->p_port = p_port; } return( p_rtr ); diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index 8241129..d27de5e 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -419,7 +419,7 @@ __cleanup_mgrp( { cl_qmap_remove_item(&p_rcv->p_subn->mgrp_mlid_tbl, (cl_map_item_t *)p_mgrp ); - osm_mgrp_destroy(p_mgrp); + osm_mgrp_delete(p_mgrp); } } } @@ -1358,7 +1358,7 @@ osm_mcmr_rcv_create_new_mgrp( cl_ntoh16(mlid) ); cl_qmap_remove_item(&p_rcv->p_subn->mgrp_mlid_tbl, (cl_map_item_t *)p_prev_mgrp ); - osm_mgrp_destroy( p_prev_mgrp ); + osm_mgrp_delete( p_prev_mgrp ); } cl_qmap_insert(&p_rcv->p_subn->mgrp_mlid_tbl, diff --git a/osm/opensm/osm_sa_service_record.c b/osm/opensm/osm_sa_service_record.c index eff0b0a..ded1cd2 100644 --- a/osm/opensm/osm_sa_service_record.c +++ b/osm/opensm/osm_sa_service_record.c @@ -1038,7 +1038,7 @@ osm_sr_rcv_process_delete_method( cl_qlist_insert_tail( &sr_list, (cl_list_item_t*)&p_sr_item->pool_item ); if(p_svcr) - osm_svcr_destroy(p_svcr); + osm_svcr_delete(p_svcr); __osm_sr_rcv_respond( p_rcv, p_madw, &sr_list ); @@ -1186,7 +1186,7 @@ osm_sr_rcv_lease_cb( p_rcv->p_log, p_svcr); - osm_svcr_destroy(p_svcr); + osm_svcr_delete(p_svcr); p_list_item = p_next_list_item; continue; diff --git a/osm/opensm/osm_service.c b/osm/opensm/osm_service.c index e97d8c6..ba422b3 100644 --- a/osm/opensm/osm_service.c +++ b/osm/opensm/osm_service.c @@ -56,16 +56,7 @@ /********************************************************************** **********************************************************************/ void -osm_svcr_construct( - IN osm_svcr_t* const p_svcr ) -{ - memset( p_svcr, 0, sizeof(*p_svcr) ); -} - -/********************************************************************** - **********************************************************************/ -void -osm_svcr_destroy( +osm_svcr_delete( IN osm_svcr_t* const p_svcr ) { free( p_svcr); @@ -102,7 +93,7 @@ osm_svcr_new( p_svcr = (osm_svcr_t*)malloc( sizeof(*p_svcr) ); if( p_svcr ) { - osm_svcr_construct( p_svcr ); + memset( p_svcr, 0, sizeof(*p_svcr) ); osm_svcr_init( p_svcr, p_svc_rec ); } diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 0484530..855d1ab 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -158,7 +158,7 @@ osm_subn_destroy( { p_mgrp = p_next_mgrp; p_next_mgrp = (osm_mgrp_t*)cl_qmap_next( &p_mgrp->map_item ); - osm_mgrp_destroy( p_mgrp ); + osm_mgrp_delete( p_mgrp ); } p_next_infr = (osm_infr_t*)cl_qlist_head( &p_subn->sa_infr_list ); @@ -166,7 +166,7 @@ osm_subn_destroy( { p_infr = p_next_infr; p_next_infr = (osm_infr_t*)cl_qlist_next( &p_infr->list_item ); - osm_infr_destroy( p_infr ); + osm_infr_delete( p_infr ); } cl_list_remove_all( &p_subn->new_ports_list ); -- 1.5.2.rc2.20.gac2a From ardavis at ichips.intel.com Thu May 10 15:40:47 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 10 May 2007 15:40:47 -0700 Subject: [ofa-general] [ANNOUNCE] DAT 2.0 release branch available for OFED Message-ID: <46439F6F.4030900@ichips.intel.com> A new DAPL branch based on the new DAT 2.0 specification that is ready for testing. This version requires OFED 1.2 verbs and rdmacm libraries. git://git.openfabrics.org/~ardavis/scm/dapl.git branch == dat2.0 This version can be build with or without IB extensions. DAT 2.0 IB extension addendum is attached for reference. Basically, rdma_write with immediate and atomic operations are supported through the new 2.0 extended interface. I have included a new test/dtest/dtestx.c that includes new extended operation examples. To build with IB extensions: ./autogen.sh && ./configure –enable-ext-type=ib && make -arlin -------------- next part -------------- A non-text attachment was scrubbed... Name: DAT_IB_Extensions_Final_draft.pdf Type: application/pdf Size: 107506 bytes Desc: not available URL: From devesh28 at gmail.com Thu May 10 23:31:46 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Fri, 11 May 2007 12:01:46 +0530 Subject: [ofa-general] [Query] ib add path record cache Message-ID: <309a667c0705102331w7839d7et688f9bc00827338@mail.gmail.com> Hello Sean, With reference to discussions we had ( Ref--->[ofa-general] [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache) related to "adding dummy entries into the local_sa_cache", Will following idea do? One user command, reading path records from some file and passing this to local_sa_cache module using standard entry point (read/write), local_cache module is assuming it as a incoming resolved path_record, and adding it to the cache in normal fashion. possibly some device interface will be required to be added. port agent related issues needs to be looked into. Or some better idea if you have, can we discuss? -Devesh From vlad at lists.openfabrics.org Fri May 11 02:30:47 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 11 May 2007 02:30:47 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070511-0200 daily build status Message-ID: <20070511093047.5A21DE60824@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Failed: From chevchenkovic at gmail.com Fri May 11 02:54:05 2007 From: chevchenkovic at gmail.com (Chevchenkovic Chevchenkovic) Date: Fri, 11 May 2007 15:24:05 +0530 Subject: [ofa-general] LMC read_bw test Message-ID: <1c16cdf90705110254o7acf9996ye2bacbb995351b2e@mail.gmail.com> Hi, I had this problem. I had the following configuration problem: node 1 : port 1 : LMC = 1 , LIDs = 12,13 node 2 : port 1 : LMC = 1 , LIDs = 18,19 Now when I run the read_bw test with the lids set as 12 and 18, the test runs fine with no errors. But if I set the value to 13 and 19 , I get an error in execution. The error is in completion queue. How do i get over this? Awaiting reply, -chev From halr at voltaire.com Fri May 11 04:34:27 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 May 2007 07:34:27 -0400 Subject: [ofa-general] Re: [PATCH 1/3] opensm: remove osm_port_get_num_physp() function In-Reply-To: <11787426251658-git-send-email-sashak@voltaire.com> References: <11787426251341-git-send-email-sashak@voltaire.com> <11787426251658-git-send-email-sashak@voltaire.com> Message-ID: <1178883254.25974.76454.camel@hal.voltaire.com> On Wed, 2007-05-09 at 16:30, Sasha Khapyorsky wrote: > This removes osm_port_get_num_physp() function and instead uses native > node oriented osm_node_get_num_physp(). > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to master only). -- Hal From halr at voltaire.com Fri May 11 04:34:56 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 May 2007 07:34:56 -0400 Subject: [ofa-general] Re: [PATCH 2/4 v2] opensm: remove osm_port_get_phys_ptr() In-Reply-To: <11788316803195-git-send-email-sashak@voltaire.com> References: <11788316803259-git-send-email-sashak@voltaire.com> <11788316803195-git-send-email-sashak@voltaire.com> Message-ID: <1178883259.25974.76456.camel@hal.voltaire.com> On Thu, 2007-05-10 at 17:14, Sasha Khapyorsky wrote: > Function osm_port_get_phys_ptr() returns pointer to physical port by its > number - and this should be node and not port related routine. So this > patch replaces osm_port_get_phys_ptr() by osm_node_get_physp_ptr(). > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to master only). -- Hal From halr at voltaire.com Fri May 11 04:35:53 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 May 2007 07:35:53 -0400 Subject: [ofa-general] Re: [PATCH 3/4 v2] opensm: eliminate node's physical ports table duplication in osm_port_t In-Reply-To: <11788316801832-git-send-email-sashak@voltaire.com> References: <11788316803259-git-send-email-sashak@voltaire.com> <11788316801832-git-send-email-sashak@voltaire.com> Message-ID: <1178883342.25974.76542.camel@hal.voltaire.com> On Thu, 2007-05-10 at 17:14, Sasha Khapyorsky wrote: > Eliminate duplication of osm_node's physical ports table in osm_port_t > object. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to master only). -- Hal From halr at voltaire.com Fri May 11 04:37:42 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 May 2007 07:37:42 -0400 Subject: [ofa-general] Re: [PATCH 4/4 v2] opensm: remove some unneeded funcs In-Reply-To: <11788316801642-git-send-email-sashak@voltaire.com> References: <11788316803259-git-send-email-sashak@voltaire.com> <11788316801642-git-send-email-sashak@voltaire.com> Message-ID: <1178883355.25974.76544.camel@hal.voltaire.com> On Thu, 2007-05-10 at 17:14, Sasha Khapyorsky wrote: > This removes some not really needed functions > osm_port_get_default_phys_ptr() and osm_port_get_parent_node(). > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to master only). -- Hal From mst at dev.mellanox.co.il Fri May 11 05:30:36 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 May 2007 15:30:36 +0300 Subject: [ofa-general] Re: is there an OFED way to putt VPD from an HCA? In-Reply-To: <46434FF1.9020005@hp.com> References: <46434FF1.9020005@hp.com> Message-ID: <20070511123036.GB30092@mellanox.co.il> > Quoting Rick Jones : > Subject: is there an OFED way to putt VPD from an HCA? > > Hi - > > I would like to pull vital product data (eg serial number) from an IB HCA > which is "driven" via OFED bits. Is there any OFED tool to do that or do I > have to go hunt-down something HCA-vendor specific (mellanox in this case)? Under the mstflint directory, there is an "mstvpd" tool which will dump out all of vpd. It works for any PCI device, actually. -- MST From mst at dev.mellanox.co.il Fri May 11 05:40:10 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 May 2007 15:40:10 +0300 Subject: [ofa-general] Re: is there an OFED way to putt VPD from an HCA? In-Reply-To: <464395F3.9080004@hp.com> References: <46434FF1.9020005@hp.com> <46436275.80406@systemfabricworks.com> <464395F3.9080004@hp.com> Message-ID: <20070511124010.GC30092@mellanox.co.il> > Quoting Rick Jones : > Subject: Re: is there an OFED way to putt VPD from an HCA? > > Seems that none of those utilities went into Debian, which was the base > distro I installed, and then on top of which I put the 2.6.21.1 kernel. If you just want the utility, get it from source: git clone git.opebfabrics.org/~mst/mstflint.git cd mstflint make and this will generate mstvpd in the current directory. And then give it the pci device address: ./mstvpd 0000:02:00.0 > course I'm still having that "gcc rpm" not found issue trying to grab the > whole OFED 1.2 from 5/10, and an attempt to compile the ofa_kernel from > 5.10 ended-up with some asm related errors which sadly I've not saved, but > could I suppose recreate. I think at that date 2.6.21 wasn't yet supported, but should work now. > At this point I'm not sure if I don't need to lay-down a fresh set of > kernel sources to allow things to patch correctly. Whether you need the OFED kernel bits depends on what you need to do, you certainly do not need them just to look up the vpd. -- MST From mst at dev.mellanox.co.il Fri May 11 06:36:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 May 2007 16:36:39 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <46438DF2.3080601@linux.vnet.ibm.com> References: <4641E99B.10706@linux.vnet.ibm.com> <46438DF2.3080601@linux.vnet.ibm.com> Message-ID: <20070511133639.GD30092@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review > > If there are no other issues than the small restructure suggestion that > Michael had, can this patch be merged into the for-2.6.22 tree? I'm not sure. I haven't the time, at the moment, to go over the patch again in depth. Have the issues from this message been addressed? http://www.mail-archive.com/general at lists.openfabrics.org/msg02056.html Just a quick review, it seems that two most important issues have apparently not been addressed yet: 1. Testing device SRQ capability twice on each RX packet is just too ugly, and it *should* be possible to structure the code by separating common functionality in separate functions instead of scattering if (srq) tests around. 2. Once the number of created connections exceeds the constant that you allow, all attempts to communicate with this node over IP over IB will fail. A way needs to be designed to switch to the datagram mode, and to retry going back to connected after some time. [We actually have this theoretical issue in SRQ as well - it is just much more severe in the nonSRQ case]. If connected mode works much worse than datagram in some setups, people won't be able to enable it by default. -- MST From mst at dev.mellanox.co.il Fri May 11 07:14:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 May 2007 17:14:29 +0300 Subject: [ofa-general] Re: RFC: location for IB CM statistics In-Reply-To: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com> References: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com> Message-ID: <20070511141429.GF30092@mellanox.co.il> > Quoting Sean Hefty : > Subject: RFC: location for IB CM statistics > > I'd like to start adding some statistical information to the IB CM to help > identify scalability or connectivity issues. Some example statistics that I > would like to expose now are number of retried MADs, unmatched requests, total > number of connections, etc. Longer term, additional statistics and information > on each connection could be added. > > I'm looking for ideas on the best way to expose this sort of data. Any > thoughts? I would start with adding data to debugfs - we don't have to maintain format stability there. When we are convinced we got the format right, we'll be able to add data to sysfs/proc as well. -- MST From mst at dev.mellanox.co.il Fri May 11 07:35:10 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 May 2007 17:35:10 +0300 Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 - mmap functonality In-Reply-To: <200705101628.43095.ossrosch@linux.vnet.ibm.com> References: <200705101628.43095.ossrosch@linux.vnet.ibm.com> Message-ID: <20070511143510.GG30092@mellanox.co.il> Some questions: +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-04 10:40:06.000000000 +0200 +@@ -52,7 +52,7 @@ + MODULE_LICENSE("Dual BSD/GPL"); + MODULE_AUTHOR("Christoph Raisch "); + MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); +-MODULE_VERSION("SVNEHCA_0022"); ++MODULE_VERSION("SVNEHCA_0019"); + + int ehca_open_aqp1 = 0; + int ehca_debug_level = 0; Is this intentional? +@@ -293,7 +293,7 @@ int ehca_init_device(struct ehca_shca *s + strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX); + shca->ib_device.owner = THIS_MODULE; + +- shca->ib_device.uverbs_abi_ver = 6; ++ shca->ib_device.uverbs_abi_ver = 5; + shca->ib_device.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | +@@ -357,7 +357,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.dealloc_fmr = ehca_dealloc_fmr; + shca->ib_device.attach_mcast = ehca_attach_mcast; + shca->ib_device.detach_mcast = ehca_detach_mcast; +- /* shca->ib_device.process_mad = ehca_process_mad; */ ++ /* shca->ib_device.process_mad = ehca_process_mad; */ + shca->ib_device.mmap = ehca_mmap; + + return ret; Is this really necessary? +@@ -811,7 +811,7 @@ int __init ehca_module_init(void) + int ret; + + printk(KERN_INFO "eHCA Infiniband Device Driver " +- "(Rel.: SVNEHCA_0022)\n"); ++ "(Rel.: SVNEHCA_0019)\n"); + idr_init(&ehca_qp_idr); + idr_init(&ehca_cq_idr); + spin_lock_init(&ehca_qp_idr_lock); Is this intentional? -- MST From suri at baymicrosystems.com Fri May 11 08:06:38 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Fri, 11 May 2007 11:06:38 -0400 Subject: [ofa-general] Sonoma Conf presentations In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403011C1559@G3W0634.americas.hpqcorp.net> References: <000001c75454$523660f0$eed4180a@amr.corp.intel.com><45DAB3FD.8060606@voltaire.com><349DCDA352EACF42A0C49FA6DCEA84030118B82B@G3W0634.americas.hpqcorp.net><4625C1C6.6040709@voltaire.com><349DCDA352EACF42A0C49FA6DCEA84030118C3FA@G3W0634.americas.hpqcorp.net><462755BD.5020305@voltaire.com><349DCDA352EACF42A0C49FA6DCEA8403011C1415@G3W0634.americas.hpqcorp.net><4627804E.2040004@voltaire.com><349DCDA352EACF42A0C49FA6DCEA8403011C1497@G3W0634.americas.hpqcorp.net><46278780.2010900@voltaire.com> <349DCDA352EACF42A0C49FA6DCEA8403011C1559@G3W0634.americas.hpqcorp.net> Message-ID: <05f401c793dd$fcc25c40$1914a8c0@surioffice> Folks: I was able to access day-1 presentations on the openfabrics.org site but not the Day-2/3 presentations (404-page not found error). Is it my issue or the pages have not been posted yet? Thanks, Suri From ossrosch at linux.vnet.ibm.com Fri May 11 08:38:12 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Fri, 11 May 2007 17:38:12 +0200 Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 - mmap functonality In-Reply-To: <20070511143510.GG30092@mellanox.co.il> References: <200705101628.43095.ossrosch@linux.vnet.ibm.com> <20070511143510.GG30092@mellanox.co.il> Message-ID: <200705111738.12797.ossrosch@linux.vnet.ibm.com> Hi Michael, thanks for reviewing. I make the changes and send the new patch below. On Friday 11 May 2007 16:35, Michael S. Tsirkin wrote: > Some questions: > + MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); > +-MODULE_VERSION("SVNEHCA_0022"); > ++MODULE_VERSION("SVNEHCA_0019"); > + > + int ehca_open_aqp1 = 0; > + int ehca_debug_level = 0; > > Is this intentional? no, i just delete this hunk now > > + shca->ib_device.detach_mcast = ehca_detach_mcast; > +- /* shca->ib_device.process_mad = ehca_process_mad; */ > ++ /* shca->ib_device.process_mad = ehca_process_mad; */ > + shca->ib_device.mmap = ehca_mmap; > + > + return ret; > > Is this really necessary? no I think there was an unantentional change in spaces, i delete this hunk > > +@@ -811,7 +811,7 @@ int __init ehca_module_init(void) > + int ret; > + > + printk(KERN_INFO "eHCA Infiniband Device Driver " > +- "(Rel.: SVNEHCA_0022)\n"); > ++ "(Rel.: SVNEHCA_0019)\n"); > + idr_init(&ehca_qp_idr); > + idr_init(&ehca_cq_idr); > + spin_lock_init(&ehca_qp_idr_lock); > > Is this intentional? no, i delete this hunk too below is the new patch. Signed-off-by: Stefan Roscher --- backport_ehca_2_rhel45_umap.patch | 823 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 823 insertions(+) diff -Nurp ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch --- ofa_kernel-1.2_old/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch 1970-01-01 01:00:00.000000000 +0100 +++ ofa_kernel-1.2_new/kernel_patches/backport/2.6.9_U5/backport_ehca_2_rhel45_umap.patch 2007-05-11 19:13:51.000000000 +0200 @@ -0,0 +1,823 @@ +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-05-04 10:40:06.000000000 +0200 +@@ -126,14 +126,13 @@ struct ehca_qp { + struct ipz_qp_handle ipz_qp_handle; + struct ehca_pfqp pf; + struct ib_qp_init_attr init_attr; ++ u64 uspace_squeue; ++ u64 uspace_rqueue; ++ u64 uspace_fwh; + struct ehca_cq *send_cq; + struct ehca_cq *recv_cq; + unsigned int sqerr_purgeflag; + struct hlist_node list_entries; +- /* mmap counter for resources mapped into user space */ +- u32 mm_count_squeue; +- u32 mm_count_rqueue; +- u32 mm_count_galpa; + }; + + /* must be power of 2 */ +@@ -150,6 +149,8 @@ struct ehca_cq { + struct ipz_cq_handle ipz_cq_handle; + struct ehca_pfcq pf; + spinlock_t cb_lock; ++ u64 uspace_queue; ++ u64 uspace_fwh; + struct hlist_head qp_hashtab[QP_HASHTAB_LEN]; + struct list_head entry; + u32 nr_callbacks; /* #events assigned to cpu by scaling code */ +@@ -157,9 +158,6 @@ struct ehca_cq { + wait_queue_head_t wait_completion; + spinlock_t task_lock; + u32 ownpid; +- /* mmap counter for resources mapped into user space */ +- u32 mm_count_queue; +- u32 mm_count_galpa; + }; + + enum ehca_mr_flag { +@@ -259,6 +257,20 @@ struct ehca_ucontext { + struct ib_ucontext ib_ucontext; + }; + ++struct ehca_module *ehca_module_new(void); ++ ++int ehca_module_delete(struct ehca_module *me); ++ ++int ehca_eq_ctor(struct ehca_eq *eq); ++ ++int ehca_eq_dtor(struct ehca_eq *eq); ++ ++struct ehca_shca *ehca_shca_new(void); ++ ++int ehca_shca_delete(struct ehca_shca *me); ++ ++struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor); ++ + int ehca_init_pd_cache(void); + void ehca_cleanup_pd_cache(void); + int ehca_init_cq_cache(void); +@@ -282,6 +294,7 @@ extern int ehca_use_hp_mr; + extern int ehca_scaling_code; + + struct ipzu_queue_resp { ++ u64 queue; /* points to first queue entry */ + u32 qe_size; /* queue entry size */ + u32 act_nr_of_sg; + u32 queue_length; /* queue length allocated in bytes */ +@@ -294,6 +307,7 @@ struct ehca_create_cq_resp { + u32 cq_number; + u32 token; + struct ipzu_queue_resp ipz_queue; ++ struct h_galpas galpas; + }; + + struct ehca_create_qp_resp { +@@ -306,6 +320,7 @@ struct ehca_create_qp_resp { + u32 dummy; /* padding for 8 byte alignment */ + struct ipzu_queue_resp ipz_squeue; + struct ipzu_queue_resp ipz_rqueue; ++ struct h_galpas galpas; + }; + + struct ehca_alloc_cq_parms { +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_cq.c 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_cq.c 2007-05-04 10:40:06.000000000 +0200 +@@ -268,6 +268,7 @@ struct ib_cq *ehca_create_cq(struct ib_d + if (context) { + struct ipz_queue *ipz_queue = &my_cq->ipz_queue; + struct ehca_create_cq_resp resp; ++ struct vm_area_struct *vma; + memset(&resp, 0, sizeof(resp)); + resp.cq_number = my_cq->cq_number; + resp.token = my_cq->token; +@@ -276,14 +277,40 @@ struct ib_cq *ehca_create_cq(struct ib_d + resp.ipz_queue.queue_length = ipz_queue->queue_length; + resp.ipz_queue.pagesize = ipz_queue->pagesize; + resp.ipz_queue.toggle_state = ipz_queue->toggle_state; ++ ret = ehca_mmap_nopage(((u64)(my_cq->token) << 32) | 0x12000000, ++ ipz_queue->queue_length, ++ (void**)&resp.ipz_queue.queue, ++ &vma); ++ if (ret) { ++ ehca_err(device, "Could not mmap queue pages"); ++ cq = ERR_PTR(ret); ++ goto create_cq_exit4; ++ } ++ my_cq->uspace_queue = resp.ipz_queue.queue; ++ resp.galpas = my_cq->galpas; ++ ret = ehca_mmap_register(my_cq->galpas.user.fw_handle, ++ (void**)&resp.galpas.kernel.fw_handle, ++ &vma); ++ if (ret) { ++ ehca_err(device, "Could not mmap fw_handle"); ++ cq = ERR_PTR(ret); ++ goto create_cq_exit5; ++ } ++ my_cq->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; + if (ib_copy_to_udata(udata, &resp, sizeof(resp))) { + ehca_err(device, "Copy to udata failed."); +- goto create_cq_exit4; ++ goto create_cq_exit6; + } + } + + return cq; + ++create_cq_exit6: ++ ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE); ++ ++create_cq_exit5: ++ ehca_munmap(my_cq->uspace_queue, my_cq->ipz_queue.queue_length); ++ + create_cq_exit4: + ipz_queue_dtor(&my_cq->ipz_queue); + +@@ -317,6 +344,7 @@ static int get_cq_nr_events(struct ehca_ + int ehca_destroy_cq(struct ib_cq *cq) + { + u64 h_ret; ++ int ret; + struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); + int cq_num = my_cq->cq_number; + struct ib_device *device = cq->device; +@@ -326,20 +354,6 @@ int ehca_destroy_cq(struct ib_cq *cq) + u32 cur_pid = current->tgid; + unsigned long flags; + +- if (cq->uobject) { +- if (my_cq->mm_count_galpa || my_cq->mm_count_queue) { +- ehca_err(device, "Resources still referenced in " +- "user space cq_num=%x", my_cq->cq_number); +- return -EINVAL; +- } +- if (my_cq->ownpid != cur_pid) { +- ehca_err(device, "Invalid caller pid=%x ownpid=%x " +- "cq_num=%x", +- cur_pid, my_cq->ownpid, my_cq->cq_number); +- return -EINVAL; +- } +- } +- + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + while (my_cq->nr_events) { + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); +@@ -351,6 +365,25 @@ int ehca_destroy_cq(struct ib_cq *cq) + idr_remove(&ehca_cq_idr, my_cq->token); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + ++ if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) { ++ ehca_err(device, "Invalid caller pid=%x ownpid=%x", ++ cur_pid, my_cq->ownpid); ++ return -EINVAL; ++ } ++ ++ /* un-mmap if vma alloc */ ++ if (my_cq->uspace_queue ) { ++ ret = ehca_munmap(my_cq->uspace_queue, ++ my_cq->ipz_queue.queue_length); ++ if (ret) ++ ehca_err(device, "Could not munmap queue ehca_cq=%p " ++ "cq_num=%x", my_cq, cq_num); ++ ret = ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE); ++ if (ret) ++ ehca_err(device, "Could not munmap fwh ehca_cq=%p " ++ "cq_num=%x", my_cq, cq_num); ++ } ++ + h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0); + if (h_ret == H_R_STATE) { + /* cq in err: read err data and destroy it forcibly */ +@@ -379,7 +412,7 @@ int ehca_resize_cq(struct ib_cq *cq, int + struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); + u32 cur_pid = current->tgid; + +- if (cq->uobject && my_cq->ownpid != cur_pid) { ++ if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) { + ehca_err(cq->device, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_cq->ownpid); + return -EINVAL; +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_iverbs.h 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_iverbs.h 2007-04-29 15:10:56.000000000 +0200 +@@ -171,11 +171,19 @@ int ehca_mmap(struct ib_ucontext *contex + + void ehca_poll_eqs(unsigned long data); + ++int ehca_mmap_nopage(u64 foffset,u64 length,void **mapped, ++ struct vm_area_struct **vma); ++ ++int ehca_mmap_register(u64 physical,void **mapped, ++ struct vm_area_struct **vma); ++ ++int ehca_munmap(unsigned long addr, size_t len); ++ + #ifdef CONFIG_PPC_64K_PAGES + void *ehca_alloc_fw_ctrlblock(gfp_t flags); + void ehca_free_fw_ctrlblock(void *ptr); + #else +-#define ehca_alloc_fw_ctrlblock(flags) ((void*) get_zeroed_page(flags)) ++#define ehca_alloc_fw_ctrlblock(flags) ((void *) get_zeroed_page(flags)) + #define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr)) + #endif + +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-05-04 10:40:06.000000000 +0200 +@@ -293,7 +293,7 @@ int ehca_init_device(struct ehca_shca *s + strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX); + shca->ib_device.owner = THIS_MODULE; + +- shca->ib_device.uverbs_abi_ver = 6; ++ shca->ib_device.uverbs_abi_ver = 5; + shca->ib_device.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_qp.c 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_qp.c 2007-04-29 15:10:56.000000000 +0200 +@@ -637,6 +637,7 @@ struct ib_qp *ehca_create_qp(struct ib_p + struct ipz_queue *ipz_rqueue = &my_qp->ipz_rqueue; + struct ipz_queue *ipz_squeue = &my_qp->ipz_squeue; + struct ehca_create_qp_resp resp; ++ struct vm_area_struct * vma; + memset(&resp, 0, sizeof(resp)); + + resp.qp_num = my_qp->real_qp_num; +@@ -650,21 +651,59 @@ struct ib_qp *ehca_create_qp(struct ib_p + resp.ipz_rqueue.queue_length = ipz_rqueue->queue_length; + resp.ipz_rqueue.pagesize = ipz_rqueue->pagesize; + resp.ipz_rqueue.toggle_state = ipz_rqueue->toggle_state; ++ ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x22000000, ++ ipz_rqueue->queue_length, ++ (void**)&resp.ipz_rqueue.queue, ++ &vma); ++ if (ret) { ++ ehca_err(pd->device, "Could not mmap rqueue pages"); ++ goto create_qp_exit3; ++ } ++ my_qp->uspace_rqueue = resp.ipz_rqueue.queue; + /* squeue properties */ + resp.ipz_squeue.qe_size = ipz_squeue->qe_size; + resp.ipz_squeue.act_nr_of_sg = ipz_squeue->act_nr_of_sg; + resp.ipz_squeue.queue_length = ipz_squeue->queue_length; + resp.ipz_squeue.pagesize = ipz_squeue->pagesize; + resp.ipz_squeue.toggle_state = ipz_squeue->toggle_state; ++ ret = ehca_mmap_nopage(((u64)(my_qp->token) << 32) | 0x23000000, ++ ipz_squeue->queue_length, ++ (void**)&resp.ipz_squeue.queue, ++ &vma); ++ if (ret) { ++ ehca_err(pd->device, "Could not mmap squeue pages"); ++ goto create_qp_exit4; ++ } ++ my_qp->uspace_squeue = resp.ipz_squeue.queue; ++ /* fw_handle */ ++ resp.galpas = my_qp->galpas; ++ ret = ehca_mmap_register(my_qp->galpas.user.fw_handle, ++ (void**)&resp.galpas.kernel.fw_handle, ++ &vma); ++ if (ret) { ++ ehca_err(pd->device, "Could not mmap fw_handle"); ++ goto create_qp_exit5; ++ } ++ my_qp->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; ++ + if (ib_copy_to_udata(udata, &resp, sizeof resp)) { + ehca_err(pd->device, "Copy to udata failed"); + ret = -EINVAL; +- goto create_qp_exit3; ++ goto create_qp_exit6; + } + } + + return &my_qp->ib_qp; + ++create_qp_exit6: ++ ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE); ++ ++create_qp_exit5: ++ ehca_munmap(my_qp->uspace_squeue, my_qp->ipz_squeue.queue_length); ++ ++create_qp_exit4: ++ ehca_munmap(my_qp->uspace_rqueue, my_qp->ipz_rqueue.queue_length); ++ + create_qp_exit3: + ipz_queue_dtor(&my_qp->ipz_rqueue); + ipz_queue_dtor(&my_qp->ipz_squeue); +@@ -892,7 +931,7 @@ static int internal_modify_qp(struct ib_ + my_qp->qp_type == IB_QPT_SMI) && + statetrans == IB_QPST_SQE2RTS) { + /* mark next free wqe if kernel */ +- if (!ibqp->uobject) { ++ if (my_qp->uspace_squeue == 0) { + struct ehca_wqe *wqe; + /* lock send queue */ + spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); +@@ -1378,18 +1417,11 @@ int ehca_destroy_qp(struct ib_qp *ibqp) + enum ib_qp_type qp_type; + unsigned long flags; + +- if (ibqp->uobject) { +- if (my_qp->mm_count_galpa || +- my_qp->mm_count_rqueue || my_qp->mm_count_squeue) { +- ehca_err(ibqp->device, "Resources still referenced in " +- "user space qp_num=%x", ibqp->qp_num); +- return -EINVAL; +- } +- if (my_pd->ownpid != cur_pid) { +- ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x", +- cur_pid, my_pd->ownpid); +- return -EINVAL; +- } ++ if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && ++ my_pd->ownpid != cur_pid) { ++ ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x", ++ cur_pid, my_pd->ownpid); ++ return -EINVAL; + } + + if (my_qp->send_cq) { +@@ -1407,6 +1439,24 @@ int ehca_destroy_qp(struct ib_qp *ibqp) + idr_remove(&ehca_qp_idr, my_qp->token); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + ++ /* un-mmap if vma alloc */ ++ if (my_qp->uspace_rqueue) { ++ ret = ehca_munmap(my_qp->uspace_rqueue, ++ my_qp->ipz_rqueue.queue_length); ++ if (ret) ++ ehca_err(ibqp->device, "Could not munmap rqueue " ++ "qp_num=%x", qp_num); ++ ret = ehca_munmap(my_qp->uspace_squeue, ++ my_qp->ipz_squeue.queue_length); ++ if (ret) ++ ehca_err(ibqp->device, "Could not munmap squeue " ++ "qp_num=%x", qp_num); ++ ret = ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE); ++ if (ret) ++ ehca_err(ibqp->device, "Could not munmap fwh qp_num=%x", ++ qp_num); ++ } ++ + h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); + if (h_ret != H_SUCCESS) { + ehca_err(ibqp->device, "hipz_h_destroy_qp() failed rc=%lx " +diff -Nurp ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c +--- ofa_kernel-1.2_old/drivers/infiniband/hw/ehca/ehca_uverbs.c 2007-05-04 10:38:23.000000000 +0200 ++++ ofa_kernel-1.2_new/drivers/infiniband/hw/ehca/ehca_uverbs.c 2007-04-29 15:10:56.000000000 +0200 +@@ -68,183 +68,105 @@ int ehca_dealloc_ucontext(struct ib_ucon + return 0; + } + +-static void ehca_mm_open(struct vm_area_struct *vma) ++struct page *ehca_nopage(struct vm_area_struct *vma, ++ unsigned long address, int *type) + { +- u32 *count = (u32*)vma->vm_private_data; +- if (!count) { +- ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx", +- vma->vm_start, vma->vm_end); +- return; +- } +- (*count)++; +- if (!(*count)) +- ehca_gen_err("Use count overflow vm_start=%lx vm_end=%lx", +- vma->vm_start, vma->vm_end); +- ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x", +- vma->vm_start, vma->vm_end, *count); +-} +- +-static void ehca_mm_close(struct vm_area_struct *vma) +-{ +- u32 *count = (u32*)vma->vm_private_data; +- if (!count) { +- ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx", +- vma->vm_start, vma->vm_end); +- return; +- } +- (*count)--; +- ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x", +- vma->vm_start, vma->vm_end, *count); +-} +- +-static struct vm_operations_struct vm_ops = { +- .open = ehca_mm_open, +- .close = ehca_mm_close, +-}; +- +-static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas, +- u32 *mm_count) +-{ +- int ret; +- u64 vsize, physical; +- +- vsize = vma->vm_end - vma->vm_start; +- if (vsize != EHCA_PAGESIZE) { +- ehca_gen_err("invalid vsize=%lx", vma->vm_end - vma->vm_start); +- return -EINVAL; +- } +- +- physical = galpas->user.fw_handle; +- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); +- ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical); +- /* VM_IO | VM_RESERVED are set by remap_pfn_range() */ +- ret = remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT, +- vsize, vma->vm_page_prot); +- if (unlikely(ret)) { +- ehca_gen_err("remap_pfn_range() failed ret=%x", ret); +- return -ENOMEM; +- } +- +- vma->vm_private_data = mm_count; +- (*mm_count)++; +- vma->vm_ops = &vm_ops; +- +- return 0; +-} ++ struct page *mypage = NULL; ++ u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; ++ u32 idr_handle = fileoffset >> 32; ++ u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ ++ u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ ++ u32 cur_pid = current->tgid; ++ unsigned long flags; ++ struct ehca_cq *cq; ++ struct ehca_qp *qp; ++ struct ehca_pd *pd; ++ u64 offset; ++ void *vaddr; + +-static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue, +- u32 *mm_count) +-{ +- int ret; +- u64 start, ofs; +- struct page *page; ++ switch (q_type) { ++ case 1: /* CQ */ ++ spin_lock_irqsave(&ehca_cq_idr_lock, flags); ++ cq = idr_find(&ehca_cq_idr, idr_handle); ++ spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + +- vma->vm_flags |= VM_RESERVED; +- start = vma->vm_start; +- for (ofs = 0; ofs < queue->queue_length; ofs += PAGE_SIZE) { +- u64 virt_addr = (u64)ipz_qeit_calc(queue, ofs); +- page = virt_to_page(virt_addr); +- ret = vm_insert_page(vma, start, page); +- if (unlikely(ret)) { +- ehca_gen_err("vm_insert_page() failed rc=%x", ret); +- return ret; ++ /* make sure this mmap really belongs to the authorized user */ ++ if (!cq) { ++ ehca_gen_err("cq is NULL ret=NOPAGE_SIGBUS"); ++ return NOPAGE_SIGBUS; + } +- start += PAGE_SIZE; +- } +- vma->vm_private_data = mm_count; +- (*mm_count)++; +- vma->vm_ops = &vm_ops; + +- return 0; +-} +- +-static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq, +- u32 rsrc_type) +-{ +- int ret; +- +- switch (rsrc_type) { +- case 1: /* galpa fw handle */ +- ehca_dbg(cq->ib_cq.device, "cq_num=%x fw", cq->cq_number); +- ret = ehca_mmap_fw(vma, &cq->galpas, &cq->mm_count_galpa); +- if (unlikely(ret)) { ++ if (cq->ownpid != cur_pid) { + ehca_err(cq->ib_cq.device, +- "ehca_mmap_fw() failed rc=%x cq_num=%x", +- ret, cq->cq_number); +- return ret; ++ "Invalid caller pid=%x ownpid=%x", ++ cur_pid, cq->ownpid); ++ return NOPAGE_SIGBUS; + } +- break; + +- case 2: /* cq queue_addr */ +- ehca_dbg(cq->ib_cq.device, "cq_num=%x queue", cq->cq_number); +- ret = ehca_mmap_queue(vma, &cq->ipz_queue, &cq->mm_count_queue); +- if (unlikely(ret)) { +- ehca_err(cq->ib_cq.device, +- "ehca_mmap_queue() failed rc=%x cq_num=%x", +- ret, cq->cq_number); +- return ret; ++ if (rsrc_type == 2) { ++ ehca_dbg(cq->ib_cq.device, "cq=%p cq queuearea", cq); ++ offset = address - vma->vm_start; ++ vaddr = ipz_qeit_calc(&cq->ipz_queue, offset); ++ ehca_dbg(cq->ib_cq.device, "offset=%lx vaddr=%p", ++ offset, vaddr); ++ mypage = virt_to_page(vaddr); + } + break; + +- default: +- ehca_err(cq->ib_cq.device, "bad resource type=%x cq_num=%x", +- rsrc_type, cq->cq_number); +- return -EINVAL; +- } +- +- return 0; +-} +- +-static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, +- u32 rsrc_type) +-{ +- int ret; ++ case 2: /* QP */ ++ spin_lock_irqsave(&ehca_qp_idr_lock, flags); ++ qp = idr_find(&ehca_qp_idr, idr_handle); ++ spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + +- switch (rsrc_type) { +- case 1: /* galpa fw handle */ +- ehca_dbg(qp->ib_qp.device, "qp_num=%x fw", qp->ib_qp.qp_num); +- ret = ehca_mmap_fw(vma, &qp->galpas, &qp->mm_count_galpa); +- if (unlikely(ret)) { +- ehca_err(qp->ib_qp.device, +- "remap_pfn_range() failed ret=%x qp_num=%x", +- ret, qp->ib_qp.qp_num); +- return -ENOMEM; ++ /* make sure this mmap really belongs to the authorized user */ ++ if (!qp) { ++ ehca_gen_err("qp is NULL ret=NOPAGE_SIGBUS"); ++ return NOPAGE_SIGBUS; + } +- break; + +- case 2: /* qp rqueue_addr */ +- ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue", +- qp->ib_qp.qp_num); +- ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, &qp->mm_count_rqueue); +- if (unlikely(ret)) { ++ pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd); ++ if (pd->ownpid != cur_pid) { + ehca_err(qp->ib_qp.device, +- "ehca_mmap_queue(rq) failed rc=%x qp_num=%x", +- ret, qp->ib_qp.qp_num); +- return ret; ++ "Invalid caller pid=%x ownpid=%x", ++ cur_pid, pd->ownpid); ++ return NOPAGE_SIGBUS; + } +- break; + +- case 3: /* qp squeue_addr */ +- ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue", +- qp->ib_qp.qp_num); +- ret = ehca_mmap_queue(vma, &qp->ipz_squeue, &qp->mm_count_squeue); +- if (unlikely(ret)) { +- ehca_err(qp->ib_qp.device, +- "ehca_mmap_queue(sq) failed rc=%x qp_num=%x", +- ret, qp->ib_qp.qp_num); +- return ret; ++ if (rsrc_type == 2) { /* rqueue */ ++ ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueuearea", qp); ++ offset = address - vma->vm_start; ++ vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset); ++ ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p", ++ offset, vaddr); ++ mypage = virt_to_page(vaddr); ++ } else if (rsrc_type == 3) { /* squeue */ ++ ehca_dbg(qp->ib_qp.device, "qp=%p qp squeuearea", qp); ++ offset = address - vma->vm_start; ++ vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset); ++ ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p", ++ offset, vaddr); ++ mypage = virt_to_page(vaddr); + } + break; + + default: +- ehca_err(qp->ib_qp.device, "bad resource type=%x qp=num=%x", +- rsrc_type, qp->ib_qp.qp_num); +- return -EINVAL; ++ ehca_gen_err("bad queue type %x", q_type); ++ return NOPAGE_SIGBUS; + } + +- return 0; ++ if (!mypage) { ++ ehca_gen_err("Invalid page adr==NULL ret=NOPAGE_SIGBUS"); ++ return NOPAGE_SIGBUS; ++ } ++ get_page(mypage); ++ ++ return mypage; + } + ++static struct vm_operations_struct ehcau_vm_ops = { ++ .nopage = ehca_nopage, ++}; ++ + int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) + { + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; +@@ -253,6 +175,7 @@ int ehca_mmap(struct ib_ucontext *contex + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 cur_pid = current->tgid; + u32 ret; ++ u64 vsize, physical; + unsigned long flags; + struct ehca_cq *cq; + struct ehca_qp *qp; +@@ -278,12 +201,44 @@ int ehca_mmap(struct ib_ucontext *contex + if (!cq->ib_cq.uobject || cq->ib_cq.uobject->context != context) + return -EINVAL; + +- ret = ehca_mmap_cq(vma, cq, rsrc_type); +- if (unlikely(ret)) { +- ehca_err(cq->ib_cq.device, +- "ehca_mmap_cq() failed rc=%x cq_num=%x", +- ret, cq->cq_number); +- return ret; ++ switch (rsrc_type) { ++ case 1: /* galpa fw handle */ ++ ehca_dbg(cq->ib_cq.device, "cq=%p cq triggerarea", cq); ++ vma->vm_flags |= VM_RESERVED; ++ vsize = vma->vm_end - vma->vm_start; ++ if (vsize != EHCA_PAGESIZE) { ++ ehca_err(cq->ib_cq.device, "invalid vsize=%lx", ++ vma->vm_end - vma->vm_start); ++ return -EINVAL; ++ } ++ ++ physical = cq->galpas.user.fw_handle; ++ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); ++ vma->vm_flags |= VM_IO | VM_RESERVED; ++ ++ ehca_dbg(cq->ib_cq.device, ++ "vsize=%lx physical=%lx", vsize, physical); ++ ret = remap_pfn_range(vma, vma->vm_start, ++ physical >> PAGE_SHIFT, vsize, ++ vma->vm_page_prot); ++ if (ret) { ++ ehca_err(cq->ib_cq.device, ++ "remap_pfn_range() failed ret=%x", ++ ret); ++ return -ENOMEM; ++ } ++ break; ++ ++ case 2: /* cq queue_addr */ ++ ehca_dbg(cq->ib_cq.device, "cq=%p cq q_addr", cq); ++ vma->vm_flags |= VM_RESERVED; ++ vma->vm_ops = &ehcau_vm_ops; ++ break; ++ ++ default: ++ ehca_err(cq->ib_cq.device, "bad resource type %x", ++ rsrc_type); ++ return -EINVAL; + } + break; + +@@ -307,12 +262,50 @@ int ehca_mmap(struct ib_ucontext *contex + if (!qp->ib_qp.uobject || qp->ib_qp.uobject->context != context) + return -EINVAL; + +- ret = ehca_mmap_qp(vma, qp, rsrc_type); +- if (unlikely(ret)) { +- ehca_err(qp->ib_qp.device, +- "ehca_mmap_qp() failed rc=%x qp_num=%x", +- ret, qp->ib_qp.qp_num); +- return ret; ++ switch (rsrc_type) { ++ case 1: /* galpa fw handle */ ++ ehca_dbg(qp->ib_qp.device, "qp=%p qp triggerarea", qp); ++ vma->vm_flags |= VM_RESERVED; ++ vsize = vma->vm_end - vma->vm_start; ++ if (vsize != EHCA_PAGESIZE) { ++ ehca_err(qp->ib_qp.device, "invalid vsize=%lx", ++ vma->vm_end - vma->vm_start); ++ return -EINVAL; ++ } ++ ++ physical = qp->galpas.user.fw_handle; ++ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); ++ vma->vm_flags |= VM_IO | VM_RESERVED; ++ ++ ehca_dbg(qp->ib_qp.device, "vsize=%lx physical=%lx", ++ vsize, physical); ++ ret = remap_pfn_range(vma, vma->vm_start, ++ physical >> PAGE_SHIFT, vsize, ++ vma->vm_page_prot); ++ if (ret) { ++ ehca_err(qp->ib_qp.device, ++ "remap_pfn_range() failed ret=%x", ++ ret); ++ return -ENOMEM; ++ } ++ break; ++ ++ case 2: /* qp rqueue_addr */ ++ ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueue_addr", qp); ++ vma->vm_flags |= VM_RESERVED; ++ vma->vm_ops = &ehcau_vm_ops; ++ break; ++ ++ case 3: /* qp squeue_addr */ ++ ehca_dbg(qp->ib_qp.device, "qp=%p qp squeue_addr", qp); ++ vma->vm_flags |= VM_RESERVED; ++ vma->vm_ops = &ehcau_vm_ops; ++ break; ++ ++ default: ++ ehca_err(qp->ib_qp.device, "bad resource type %x", ++ rsrc_type); ++ return -EINVAL; + } + break; + +@@ -323,3 +316,77 @@ int ehca_mmap(struct ib_ucontext *contex + + return 0; + } ++ ++int ehca_mmap_nopage(u64 foffset, u64 length, void **mapped, ++ struct vm_area_struct **vma) ++{ ++ down_write(¤t->mm->mmap_sem); ++ *mapped = (void*)do_mmap(NULL,0, length, PROT_WRITE, ++ MAP_SHARED | MAP_ANONYMOUS, ++ foffset); ++ up_write(¤t->mm->mmap_sem); ++ if (!(*mapped)) { ++ ehca_gen_err("couldn't mmap foffset=%lx length=%lx", ++ foffset, length); ++ return -EINVAL; ++ } ++ ++ *vma = find_vma(current->mm, (u64)*mapped); ++ if (!(*vma)) { ++ down_write(¤t->mm->mmap_sem); ++ do_munmap(current->mm, 0, length); ++ up_write(¤t->mm->mmap_sem); ++ ehca_gen_err("couldn't find vma queue=%p", *mapped); ++ return -EINVAL; ++ } ++ (*vma)->vm_flags |= VM_RESERVED; ++ (*vma)->vm_ops = &ehcau_vm_ops; ++ ++ return 0; ++} ++ ++int ehca_mmap_register(u64 physical, void **mapped, ++ struct vm_area_struct **vma) ++{ ++ int ret; ++ unsigned long vsize; ++ /* ehca hw supports only 4k page */ ++ ret = ehca_mmap_nopage(0, EHCA_PAGESIZE, mapped, vma); ++ if (ret) { ++ ehca_gen_err("could'nt mmap physical=%lx", physical); ++ return ret; ++ } ++ ++ (*vma)->vm_flags |= VM_RESERVED; ++ vsize = (*vma)->vm_end - (*vma)->vm_start; ++ if (vsize != EHCA_PAGESIZE) { ++ ehca_gen_err("invalid vsize=%lx", ++ (*vma)->vm_end - (*vma)->vm_start); ++ return -EINVAL; ++ } ++ ++ (*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot); ++ (*vma)->vm_flags |= VM_IO | VM_RESERVED; ++ ++ ret = remap_pfn_range((*vma), (*vma)->vm_start, ++ physical >> PAGE_SHIFT, vsize, ++ (*vma)->vm_page_prot); ++ if (ret) { ++ ehca_gen_err("remap_pfn_range() failed ret=%x", ret); ++ return -ENOMEM; ++ } ++ ++ return 0; ++ ++} ++ ++int ehca_munmap(unsigned long addr, size_t len) { ++ int ret = 0; ++ struct mm_struct *mm = current->mm; ++ if (mm) { ++ down_write(&mm->mmap_sem); ++ ret = do_munmap(mm, addr, len); ++ up_write(&mm->mmap_sem); ++ } ++ return ret; ++} From rdreier at cisco.com Fri May 11 08:41:25 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 11 May 2007 08:41:25 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: <20070508141727.GR21591@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 8 May 2007 17:17:27 +0300") References: <20070508141727.GR21591@mellanox.co.il> Message-ID: By the way, do you know what the best way to flush WC buffers for i386 is? I know on x86-64 sfence is the way to go, and on ia64 I think we want fc, but I'm not sure what the right thing is for for old 32-bit processors. Also, does it make sense to think about using non-temporal stores (movntq) to get the effect of WC without having to worry about setting up PAT? - R. From sean.hefty at intel.com Fri May 11 09:44:18 2007 From: sean.hefty at intel.com (sean.hefty) Date: Fri, 11 May 2007 09:44:18 -0700 Subject: [ofa-general] RE: [Query] ib add path record cache In-Reply-To: <309a667c0705102331w7839d7et688f9bc00827338@mail.gmail.com> Message-ID: <000001c793eb$9e410f50$ff0da8c0@amr.corp.intel.com> >One user command, reading path records from some file and passing this >to local_sa_cache module using standard entry point (read/write), >local_cache module is >assuming it as a incoming resolved path_record, and adding it to the >cache in normal fashion. possibly some device interface will be >required to be added. >port agent related issues needs to be looked into. > >Or some better idea if you have, can we discuss? This sounds fine. I still just not understanding the reasoning behind populating the cache with dummy entries. (I do think that being able to populate it from a file could be useful to initially load the cache in fairly static configurations.) I should note that the cache flushes old entries after it performs a full update. So if these are entries that you want to remain in the cache after an update, additional changes of the cache would be required. - Sean From sweitzen at cisco.com Fri May 11 10:29:05 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 11 May 2007 10:29:05 -0700 Subject: [ofa-general] RE: [PATCH] ipoib/cm: make stale task actually run once in a while In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C90BFCB3@mtlexch01.mtl.com><6C2C79E72C305246B504CBA17B5500C9076E27@mtlexch01.mtl.com> <20070507200315.GD22341@mellanox.co.il> Message-ID: I see the first patch is in OFED-1.2-20070511-0600 now, I'll try it out. Scott > -----Original Message----- > From: Scott Weitzenkamp (sweitzen) > Sent: Wednesday, May 09, 2007 4:46 PM > To: Michael S. Tsirkin; Scott Weitzenkamp (sweitzen) > Cc: Yohad Dickman; Amit Krig; Tziporet Koren; > mst at mellanox.co.il; general at lists.openfabrics.org; Roland Dreier > Subject: RE: [PATCH] ipoib/cm: make stale task actually run > once in a while > > I see a new patch ipoib_correct_timers.patch in > OFED-1.2-20070509-0600, which patch should I try? > > Scott > > > -----Original Message----- > > From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] > > Sent: Monday, May 07, 2007 1:03 PM > > To: Scott Weitzenkamp (sweitzen) > > Cc: Yohad Dickman; Amit Krig; Tziporet Koren; > > mst at mellanox.co.il; general at lists.openfabrics.org; Roland Dreier > > Subject: [PATCH] ipoib/cm: make stale task actually run once > > in a while > > > > In the presence of some active passive connections, stale > > task would never run, > > since each 4 RX CQEs we repeat queue_delayed_work calls which > > delays it for some > > 10 minutes. As a result, on a noisy system with failing > > ports, we slowly run > > out of resources - slowing connection setup down and > > eventually failing. > > > > What we actually want to do is - start stale task when a first > > passive connection is added, rerun it every 10 min as long > > as there are outstanding passive connections. > > > > As a happy side effect, this removes some code from RX data path. > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > Scott, I think this might address bugs 541 and 465: slow > > IPoIB CM HA failover > > and eventual failing IPoIB HA. Could you test this please? > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > index 2b242a4..b77e8d7 100644 > > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > @@ -258,10 +258,11 @@ static int ipoib_cm_req_handler(struct > > ib_cm_id *cm_id, struct ib_cm_event *even > > cm_id->context = p; > > p->jiffies = jiffies; > > spin_lock_irqsave(&priv->lock, flags); > > + if (list_empty(&priv->cm.passive_ids)) > > + queue_delayed_work(ipoib_workqueue, > > + &priv->cm.stale_task, > > IPOIB_CM_RX_DELAY); > > list_add(&p->list, &priv->cm.passive_ids); > > spin_unlock_irqrestore(&priv->lock, flags); > > - queue_delayed_work(ipoib_workqueue, > > - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > > return 0; > > > > err_rep: > > @@ -380,8 +381,6 @@ void ipoib_cm_handle_rx_wc(struct > > net_device *dev, struct ib_wc *wc) > > if (!list_empty(&p->list)) > > list_move(&p->list, > > &priv->cm.passive_ids); > > spin_unlock_irqrestore(&priv->lock, flags); > > - queue_delayed_work(ipoib_workqueue, > > - > > &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > > } > > } > > > > @@ -1104,6 +1103,10 @@ static void ipoib_cm_stale_task(struct > > work_struct *work) > > kfree(p); > > spin_lock_irqsave(&priv->lock, flags); > > } > > + > > + if (!list_empty(&priv->cm.passive_ids)) > > + queue_delayed_work(ipoib_workqueue, > > + &priv->cm.stale_task, > > IPOIB_CM_RX_DELAY); > > spin_unlock_irqrestore(&priv->lock, flags); > > } > > > > -- > > MST > > From pradeeps at linux.vnet.ibm.com Fri May 11 11:03:47 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Fri, 11 May 2007 11:03:47 -0700 Subject: [ofa-general] UC mode benefits Message-ID: <4644B003.2040707@linux.vnet.ibm.com> This is an offshoot of the discussions that we have had on this mailing list about moving IPOIB CM to use UC mode at some point in the future. Is it speculation that moving to UC mode will get us better performance than RC mode or, if you do have some hard data to that effect can you please share the same? Pradeep From raleigh at systemfabricworks.com Fri May 11 11:58:39 2007 From: raleigh at systemfabricworks.com (Raleigh F Rinehart) Date: Fri, 11 May 2007 13:58:39 -0500 Subject: [ofa-general] RFC: location for IB CM statistics In-Reply-To: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com> References: <000001c7933a$ec2fa480$8698070a@amr.corp.intel.com> Message-ID: <4644BCDF.1000300@systemfabricworks.com> Sean Hefty wrote: > I'd like to start adding some statistical information to the IB CM to help > identify scalability or connectivity issues. Some example statistics that I > would like to expose now are number of retried MADs, unmatched requests, total > number of connections, etc. Longer term, additional statistics and information > on each connection could be added. > > I'm looking for ideas on the best way to expose this sort of data. Any > thoughts? > > - Sean > > This may be way off on a tangent but some users may also benefit from the CM exposing these as to mmapable so that one can do RDMA reads to gather them. This is useful in monitoring a large clusters of IB nodes as efficiently as possible. This type of scenario has been discussed and researched (see the OSU paper by Panda et. al.). -raleigh -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3285 bytes Desc: S/MIME Cryptographic Signature URL: From pradeeps at linux.vnet.ibm.com Fri May 11 12:19:46 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Fri, 11 May 2007 12:19:46 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <20070511133639.GD30092@mellanox.co.il> References: <4641E99B.10706@linux.vnet.ibm.com> <46438DF2.3080601@linux.vnet.ibm.com> <20070511133639.GD30092@mellanox.co.il> Message-ID: <4644C1D2.6040103@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >> Quoting Pradeep Satyanarayana : >> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review >> >> If there are no other issues than the small restructure suggestion that >> Michael had, can this patch be merged into the for-2.6.22 tree? > > I'm not sure. > > I haven't the time, at the moment, to go over the patch again in depth. > Have the issues from this message been addressed? > > http://www.mail-archive.com/general at lists.openfabrics.org/msg02056.html > > Just a quick review, it seems that two most important issues have > apparently not been addressed yet: > > 1. Testing device SRQ capability twice on each RX packet is just too ugly, > and it *should* be possible to structure the code > by separating common functionality in separate > functions instead of scattering if (srq) tests around. I have restructured the code as suggested. In the latest code, there are only two places where SRQ capability is tested upon receipt of a packet: a) ipoib_cm_handle_rx_wc() b)ipoib_cm_post_receive() Instead of the suggested change to ipoib_cm_handle_rx_packet() it is possible to change ipoib_cm_post_receive() and call the srq and nosrq versions directly, without mangling the code. However, I do not believe that this should be stopping us from the code being merged. This can handled as a separate patch. > > 2. Once the number of created connections exceeds > the constant that you allow, all attempts to communicate > with this node over IP over IB will fail. > A way needs to be designed to switch to the datagram mode, > and to retry going back to connected after some time. > [We actually have this theoretical issue in SRQ > as well - it is just much more severe in the nonSRQ case]. Firstly, this has now been changed to send a REJ message to the remote side indicating that there no more free QPs. It is up to the application to handle the situation. Previously, this was flagged as an error that appeared in /var/log/messages. However, here are a few other things we need to consider. Lets us compute the amount of memory consumed when we run into this situation: In CM mode we use 64K packets. Assuming, the rx_ring has 256 entries and the current limitation of 1024 QPs, NOSRQ only will consume 16GB of memory. All else remaining the same if we change the rx_ring size to 1024, NOSRQ will consume 64GB of memory. This is huge and my guess is that on most systems, the application will run out of memory before it runs out of RC QPs (with NOSRQ). Aside from this I would like to understand how do we switch just the "current" QP to datagram mode; we would not want to switch all the existing QPs to datagram mode -that would be unacceptable. Also, we should not prevent subsequent connections using RC QPs. Is there anything in the IB spec about this? I think solving this is a fairly big issue and not just specific to NOSRQ. NOSRQ is just exacerbating the situation. This can be dealt with all at once with SRQ and NOSRQ, if need be. Hence, I do not see these as impediments to the merge. Pradeep From sweitzen at cisco.com Fri May 11 15:32:05 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 11 May 2007 15:32:05 -0700 Subject: [ofa-general] RE: [PATCH] ipoib/cm: make stale task actually run once in a while (DOES NOT HELP) In-Reply-To: <20070507200315.GD22341@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C90BFCB3@mtlexch01.mtl.com><6C2C79E72C305246B504CBA17B5500C9076E27@mtlexch01.mtl.com> <20070507200315.GD22341@mellanox.co.il> Message-ID: This patch, which is in OFED-1.2-20070511-0600, does NOT help. I am still seeing 105-second port failover times. Amit, did you try it? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] > Sent: Monday, May 07, 2007 1:03 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Yohad Dickman; Amit Krig; Tziporet Koren; > mst at mellanox.co.il; general at lists.openfabrics.org; Roland Dreier > Subject: [PATCH] ipoib/cm: make stale task actually run once > in a while > > In the presence of some active passive connections, stale > task would never run, > since each 4 RX CQEs we repeat queue_delayed_work calls which > delays it for some > 10 minutes. As a result, on a noisy system with failing > ports, we slowly run > out of resources - slowing connection setup down and > eventually failing. > > What we actually want to do is - start stale task when a first > passive connection is added, rerun it every 10 min as long > as there are outstanding passive connections. > > As a happy side effect, this removes some code from RX data path. > > Signed-off-by: Michael S. Tsirkin > > --- > > Scott, I think this might address bugs 541 and 465: slow > IPoIB CM HA failover > and eventual failing IPoIB HA. Could you test this please? > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > index 2b242a4..b77e8d7 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > @@ -258,10 +258,11 @@ static int ipoib_cm_req_handler(struct > ib_cm_id *cm_id, struct ib_cm_event *even > cm_id->context = p; > p->jiffies = jiffies; > spin_lock_irqsave(&priv->lock, flags); > + if (list_empty(&priv->cm.passive_ids)) > + queue_delayed_work(ipoib_workqueue, > + &priv->cm.stale_task, > IPOIB_CM_RX_DELAY); > list_add(&p->list, &priv->cm.passive_ids); > spin_unlock_irqrestore(&priv->lock, flags); > - queue_delayed_work(ipoib_workqueue, > - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > return 0; > > err_rep: > @@ -380,8 +381,6 @@ void ipoib_cm_handle_rx_wc(struct > net_device *dev, struct ib_wc *wc) > if (!list_empty(&p->list)) > list_move(&p->list, > &priv->cm.passive_ids); > spin_unlock_irqrestore(&priv->lock, flags); > - queue_delayed_work(ipoib_workqueue, > - > &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > } > } > > @@ -1104,6 +1103,10 @@ static void ipoib_cm_stale_task(struct > work_struct *work) > kfree(p); > spin_lock_irqsave(&priv->lock, flags); > } > + > + if (!list_empty(&priv->cm.passive_ids)) > + queue_delayed_work(ipoib_workqueue, > + &priv->cm.stale_task, > IPOIB_CM_RX_DELAY); > spin_unlock_irqrestore(&priv->lock, flags); > } > > -- > MST > From devesh28 at gmail.com Fri May 11 23:39:19 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Sat, 12 May 2007 12:09:19 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <000001c793eb$9e410f50$ff0da8c0@amr.corp.intel.com> References: <309a667c0705102331w7839d7et688f9bc00827338@mail.gmail.com> <000001c793eb$9e410f50$ff0da8c0@amr.corp.intel.com> Message-ID: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> Thanks for replying, My comments are as follows On 5/11/07, sean.hefty wrote: > >One user command, reading path records from some file and passing this > >to local_sa_cache module using standard entry point (read/write), > >local_cache module is > >assuming it as a incoming resolved path_record, and adding it to the > >cache in normal fashion. possibly some device interface will be > >required to be added. > >port agent related issues needs to be looked into. > > > >Or some better idea if you have, can we discuss? > > This sounds fine. I still just not understanding the reasoning behind This can be treated as a facility similar to what we have in ARP table for TCP/IP. Secondly this will help in debugging of some new up-coming partially infiniband complaint hardware. > populating the cache with dummy entries. (I do think that being able to > populate it from a file could be useful to initially load the cache in fairly > static configurations.) This is one more benefit we will get, It will prevent that initial traffic generated by local_sa_cache module on the network, assume that the cluster is big and every node is creating its cache DB, this will generate a huge traffic burst, mutil-pathing will make the case even worse. Generating a static path_record initially is a issue! > > I should note that the cache flushes old entries after it performs a full > update. So if these are entries that you want to remain in the cache after an yes, I want them to remain in the DB, my idea is similar to the hard coding of ARP table entries in TCP/IP. How do you see this can be achieved? > update, additional changes of the cache would be required. > > - Sean > From vlad at lists.openfabrics.org Sat May 12 02:30:24 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 12 May 2007 02:30:24 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070512-0200 daily build status Message-ID: <20070512093025.32400E60823@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Failed: From ianjiang.ict at gmail.com Sat May 12 02:47:30 2007 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Sat, 12 May 2007 17:47:30 +0800 Subject: [ofa-general] [SRPT]multiple initiators supported? Message-ID: <7b2fa1820705120247t1b232345w8373bb72416c5b28@mail.gmail.com> Does the SRP target support multiple initiators? I am using the SRR initiator and IB drivers in linux-2.6.20. The SRP target is at http://www.openfabrics.org/git/?p=~vu/srpt.git;a=summary and the IB driver is OFED-1.1 with linux-2.6.16.13-4-default of Suse-10.1. -- Ian Jiang From mst at dev.mellanox.co.il Sat May 12 10:29:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 12 May 2007 20:29:27 +0300 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: References: <20070508141727.GR21591@mellanox.co.il> Message-ID: <20070512172927.GA5908@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: libmlx4 wc flash > > By the way, do you know what the best way to flush WC buffers for i386 > is? I know on x86-64 sfence is the way to go, and on ia64 I think we > want fc, but I'm not sure what the right thing is for for old 32-bit > processors. Maybe just disable WC there? > Also, does it make sense to think about using non-temporal stores > (movntq) to get the effect of WC without having to worry about setting > up PAT? I don't think it works this way: if PAT is programmed to UC, I think you get UC access with movntq. No? -- MST From mst at dev.mellanox.co.il Sat May 12 13:06:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 12 May 2007 23:06:35 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <4644C1D2.6040103@linux.vnet.ibm.com> References: <4641E99B.10706@linux.vnet.ibm.com> <46438DF2.3080601@linux.vnet.ibm.com> <20070511133639.GD30092@mellanox.co.il> <4644C1D2.6040103@linux.vnet.ibm.com> Message-ID: <20070512200635.GB5908@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review > > Michael S. Tsirkin wrote: > >>Quoting Pradeep Satyanarayana : > >>Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review > >> > >>If there are no other issues than the small restructure suggestion that > >>Michael had, can this patch be merged into the for-2.6.22 tree? > > > >I'm not sure. > > > >I haven't the time, at the moment, to go over the patch again in depth. > >Have the issues from this message been addressed? > > > >http://www.mail-archive.com/general at lists.openfabrics.org/msg02056.html > > > >Just a quick review, it seems that two most important issues have > >apparently not been addressed yet: > > > >1. Testing device SRQ capability twice on each RX packet is just too ugly, > > and it *should* be possible to structure the code > > by separating common functionality in separate > > functions instead of scattering if (srq) tests around. > > I have restructured the code as suggested. In the latest code, there are > only two places where SRQ capability is tested upon receipt of a packet: > a) ipoib_cm_handle_rx_wc() > b)ipoib_cm_post_receive() > > Instead of the suggested change to ipoib_cm_handle_rx_packet() it is > possible to change ipoib_cm_post_receive() and call the srq and nosrq > versions directly, without mangling the code. However, I do not believe > that this should be stopping us from the code being merged. This can > handled as a separate patch. I actually suggested implementing separate poll routines for srq and non-srq code. This way we won't have *any* if(srq) tests on datapath. > > > >2. Once the number of created connections exceeds > > the constant that you allow, all attempts to communicate > > with this node over IP over IB will fail. > > A way needs to be designed to switch to the datagram mode, > > and to retry going back to connected after some time. > > [We actually have this theoretical issue in SRQ > > as well - it is just much more severe in the nonSRQ case]. > > Firstly, this has now been changed to send a REJ message to the remote > side indicating that there no more free QPs. Since the HCA actually has free QPs - you are actually running out of buffers that you are ready to prepost - one might argue about whether this is spec compliant behaviour. This is something that might better be checked up with at IBTA. > It is up to the application > to handle the situation. The application here being kernel IP over IB here, it currently handles the reject by dropping outstanding packets and retrying the connection on the next packet to this dst. So the specific node might be denied connectivity potentially forever. > Previously, this was flagged as an error that > appeared in /var/log/messages. > > However, here are a few other things we need to consider. Lets us > compute the amount of memory consumed when we run into this situation: > > In CM mode we use 64K packets. Assuming, the rx_ring has 256 entries and > the current limitation of 1024 QPs, NOSRQ only will consume 16GB of > memory. All else remaining the same if we change the rx_ring size to > 1024, NOSRQ will consume 64GB of memory. > > This is huge and my guess is that on most systems, the application will > run out of memory before it runs out of RC QPs (with NOSRQ). > > Aside from this I would like to understand how do we switch just the > "current" QP to datagram mode; we would not want to switch all the > existing QPs to datagram mode -that would be unacceptable. Also, we > should not prevent subsequent connections using RC QPs. Is there > anything in the IB spec about this? Yes, this might need a solution at the protocol level, as you indicate above. > I think solving this is a fairly big issue and not just specific to > NOSRQ. NOSRQ is just exacerbating the situation. This can be dealt with > all at once with SRQ and NOSRQ, if need be. IMO, the memory scalability issue is specific to your code. With current code using shared RQ, each connection needs an order of 1KByte of memory. So we need just 10MByte for a typical 10000 node cluster. > Hence, I do not see these as impediments to the merge. In my humble opinion, we need a handle on the scalability issue (other than crashing or denying service) before merging this, otherwise IBM will be the first to object to making connected mode the default. -- MST From mst at dev.mellanox.co.il Sat May 12 13:15:19 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 12 May 2007 23:15:19 +0300 Subject: [ofa-general] Re: [PATCH] ipoib/cm: make stale task actually run once in a while (DOES NOT HELP) In-Reply-To: References: <20070507200315.GD22341@mellanox.co.il> Message-ID: <20070512201519.GD5908@mellanox.co.il> > Quoting Scott Weitzenkamp (sweitzen) : > Subject: RE: [PATCH] ipoib/cm: make stale task actually run once in a while (DOES NOT HELP) > > This patch, which is in OFED-1.2-20070511-0600, does NOT help. I am > still seeing 105-second port failover times. Amit, did you try it? Same here. Still debugging. -- MST From rdreier at cisco.com Sat May 12 14:20:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 12 May 2007 14:20:18 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: <20070512172927.GA5908@mellanox.co.il> (Michael S. Tsirkin's message of "Sat, 12 May 2007 20:29:27 +0300") References: <20070508141727.GR21591@mellanox.co.il> <20070512172927.GA5908@mellanox.co.il> Message-ID: > > By the way, do you know what the best way to flush WC buffers for i386 > > is? I know on x86-64 sfence is the way to go, and on ia64 I think we > > want fc, but I'm not sure what the right thing is for for old 32-bit > > processors. > > Maybe just disable WC there? I think we want to use write combining on 32-bit kernels or 32-bit userspace. But I don't want to rely on SSE2 instructions for i386 binaries. > I don't think it works this way: if PAT is programmed to UC, > I think you get UC access with movntq. No? You're right -- I misremembered what the non-temporal stuff does, but I just checked and the manual says: "The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region." - R. From rdreier at cisco.com Sat May 12 14:21:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 12 May 2007 14:21:44 -0700 Subject: [ofa-general] Re: UC mode benefits In-Reply-To: <4644B003.2040707@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Fri, 11 May 2007 11:03:47 -0700") References: <4644B003.2040707@linux.vnet.ibm.com> Message-ID: > Is it speculation that moving to UC mode will get us better performance > than RC mode or, if you do have some hard data to that effect can you > please share the same? I don't think there will be any performance benefit. The advantage is just that dropped up messages will just be dropped without retries or transitioning the QP to error. From mst at dev.mellanox.co.il Sat May 12 21:59:38 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 13 May 2007 07:59:38 +0300 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: References: <20070508141727.GR21591@mellanox.co.il> <20070512172927.GA5908@mellanox.co.il> Message-ID: <20070513045921.GA7402@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: libmlx4 wc flash > > > > By the way, do you know what the best way to flush WC buffers for i386 > > > is? I know on x86-64 sfence is the way to go, and on ia64 I think we > > > want fc, but I'm not sure what the right thing is for for old 32-bit > > > processors. > > > > Maybe just disable WC there? > > I think we want to use write combining on 32-bit kernels or 32-bit > userspace. But I don't want to rely on SSE2 instructions for i386 binaries. > > > I don't think it works this way: if PAT is programmed to UC, > > I think you get UC access with movntq. No? > > You're right -- I misremembered what the non-temporal stuff does, but > I just checked and the manual says: > > "The memory type of the region being written to can override the > non-temporal hint, if the memory address specified for the > non-temporal store is in an uncacheable (UC) or write protected (WP) > memory region." I just found this: • Write Combining (WC) — System memory locations are not cached (as with uncacheable memory) and coherency is not enforced by the processor’s bus coherency protocol. Speculative reads are allowed. Writes may be delayed and combined in the write combining buffer (WC buffer) to reduce memory accesses. If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution. This type of cachecontrol is appropriate for video frame buffers, where the order of writes is unimportant as long as the writes update memory so they can be seen on the graphics display. See Section 10.3.1, “Buffering of Write Combining Memory Locations,” for more information about caching the WC memory type. This memory type is available in the Pentium Pro and Pentium II processors by programming the MTRRs or in the Pentium III, Pentium 4, and Intel Xeon processors by programming the MTRRs or by selecting it through the PAT. But in another place it says confusingly: Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor accesses a semaphore using a word access, other processors should not access the semaphore using a byte access. Do not use semaphores on the WC memory type. So, could we use a lock instructions to fence WC writes out? -- MST From mst at dev.mellanox.co.il Sat May 12 22:18:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 13 May 2007 08:18:06 +0300 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: References: <20070508141727.GR21591@mellanox.co.il> Message-ID: <20070513051806.GB7402@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: libmlx4 wc flash > > By the way, do you know what the best way to flush WC buffers for i386 > is? I know on x86-64 sfence is the way to go, and on ia64 I think we > want fc, but I'm not sure what the right thing is for for old 32-bit > processors. By the way, I just re-checked and it seems that WC support first appeared in Pentium II systems. So I think we should be able to use sfence if WC is enabled. -- MST From kliteyn at dev.mellanox.co.il Sun May 13 01:03:43 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 13 May 2007 11:03:43 +0300 Subject: [ofa-general] [PATCH] osm: two similar bugs in Up/Dn routing Message-ID: <4646C65F.1090705@dev.mellanox.co.il> Hi Hal, This small patch fixes two similar bugs in Up/Dn routing in OpenSM. A 8-bits integers were used as indexes when scanning subnet, which in one case caused OpenSM to crash when ranking "path" is longer than 256 switches, and in other case caused OpenSM to go into infinite loop when fabric has more than 256 roots. Please apply both to ofed_1_2 and to master. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_ucast_updn.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c index 78e8363..398a2b2 100644 --- a/osm/opensm/osm_ucast_updn.c +++ b/osm/opensm/osm_ucast_updn.c @@ -473,7 +473,7 @@ updn_subn_rank( IN updn_t* p_updn ) { osm_switch_t *p_sw; - uint8_t rank = base_rank; + uint32_t rank = base_rank; osm_physp_t *p_physp, *p_remote_physp; cl_qlist_t list; cl_status_t did_cause_update; @@ -636,7 +636,7 @@ __osm_subn_calc_up_down_min_hop_table( IN uint64_t* guid_list, IN updn_t* p_updn ) { - uint8_t idx = 0; + uint32_t idx = 0; int status; OSM_LOG_ENTER( &p_updn->p_osm->log, osm_subn_calc_up_down_min_hop_table ); -- 1.4.4.1.GIT From dotanb at dev.mellanox.co.il Sun May 13 01:59:09 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 13 May 2007 11:59:09 +0300 Subject: [ofa-general] LMC read_bw test In-Reply-To: <1c16cdf90705110254o7acf9996ye2bacbb995351b2e@mail.gmail.com> References: <1c16cdf90705110254o7acf9996ye2bacbb995351b2e@mail.gmail.com> Message-ID: <4646D35D.3050708@dev.mellanox.co.il> Chevchenkovic Chevchenkovic wrote: > Hi, > I had this problem. I had the following configuration problem: > node 1 : port 1 : LMC = 1 , LIDs = 12,13 > node 2 : port 1 : LMC = 1 , LIDs = 18,19 > > Now when I run the read_bw test with the lids set as 12 and 18, the > test runs fine with no errors. But if I set the value to 13 and 19 , I > get an error in execution. The error is in completion queue. > How do i get over this? How do you "tell" the test to use the LMC value? (i have a tip: when the extra LIDs (which were given to the port due to the LMC) are being used, src_path_bits needs to be set as well). thanks Dotan From vlad at mellanox.co.il Sun May 13 02:01:03 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 13 May 2007 12:01:03 +0300 Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 0/4] ehca: backport for rhel-4.5 In-Reply-To: <200705101626.56308.ossrosch@linux.vnet.ibm.com> References: <200705101626.56308.ossrosch@linux.vnet.ibm.com> Message-ID: <1179046863.9023.0.camel@vladsk-laptop> On Thu, 2007-05-10 at 16:26 +0200, Stefan Roscher wrote: > Because these patches > http://lists.openfabrics.org/pipermail/general/2007-May/036125.html > I send before were in frong format and did not patch into backport directory I > send now the changed patches. > > Regards Stefan > Applied patches 1-4 and ofed-build-scripts patch. -- Vladimir Sokolovsky Mellanox Technologies Ltd. From vlad at lists.openfabrics.org Sun May 13 02:31:18 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 13 May 2007 02:31:18 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070513-0200 daily build status Message-ID: <20070513093118.A366EE60836@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Failed: From erezz at voltaire.com Sun May 13 02:38:38 2007 From: erezz at voltaire.com (Erez Zilber) Date: Sun, 13 May 2007 12:38:38 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4 In-Reply-To: <20070510092925.GB13655@mellanox.co.il> References: <4641D295.5060907@voltaire.com> <4641D38A.8040406@voltaire.com> <20070510092925.GB13655@mellanox.co.il> Message-ID: <4646DC9E.9030706@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Erez Zilber : >> Subject: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4 >> >> >> Add the required backport patches & kernel addons for open-iscsi >> over iSER in RHAS4 up3 and up4. >> >> Signed-off-by: Erez Zilber >> > > In addition to posting patches, could you pls publish a git tree to pull from, > please? This makes it easy to test-build the patch as our build system > knows how to do git checkout. > > --- > > Two comments, generally > A: Please move code from kernel_patches to kernel_addons as much > as possible. There are many places where you just add new headers, > or add #include directives, or change the function called or > remove extra parameters, all this can and should be done through addons. > > B: Please do not add code to core unless there is more than 1 user - > add it to the iser module instead. This way if there is > compilation failure there, you do not break core for people. > > Thanks for the feedback. I will make the fixes and post a new version soon. Erez From halr at voltaire.com Sun May 13 05:39:35 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 May 2007 08:39:35 -0400 Subject: [ofa-general] Re: [PATCH] osm: two similar bugs in Up/Dn routing In-Reply-To: <4646C65F.1090705@dev.mellanox.co.il> References: <4646C65F.1090705@dev.mellanox.co.il> Message-ID: <1179059958.1540.81616.camel@hal.voltaire.com> Hi Yevgeny, On Sun, 2007-05-13 at 04:03, Yevgeny Kliteynik wrote: > Hi Hal, > > This small patch fixes two similar bugs in Up/Dn routing in OpenSM. > A 8-bits integers were used as indexes when scanning subnet, which > in one case caused OpenSM to crash when ranking "path" is longer > than 256 switches, and in other case caused OpenSM to go into infinite > loop when fabric has more than 256 roots. Good catch. > Please apply both to ofed_1_2 and to master. > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to both master and ofed_1_2). -- Hal From dotanb at dev.mellanox.co.il Sun May 13 07:06:40 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 13 May 2007 17:06:40 +0300 Subject: [ofa-general] RE: man pages for the rdma-cm In-Reply-To: <000201c79196$72e8a680$39d1180a@amr.corp.intel.com> References: <000201c79196$72e8a680$39d1180a@amr.corp.intel.com> Message-ID: <46471B70.1070303@dev.mellanox.co.il> Hi Sean & Steve. > I added an rdma_cm man page that gives an overview. I still need to add > references to this man page from the other API man pages, which I'll do before > pushing into OFED. > The man pages for the rdma-cm are really great news, but their style is different from the man pages of the libibverbs utils and verbs. For example: you can compare the man pages of rping and ibv_devinfo and view the differences (OPTIONS section was added, AUTHORS section is missing, the first line and the last line in the man pages are different). The man pages style of the verbs/functions are different as well. I believe that we wish that all of the man pages that are being installed with the OFED package will have the same style. Can you please fix this issue? Thanks Dotan From jackm at dev.mellanox.co.il Sun May 13 07:18:23 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 13 May 2007 17:18:23 +0300 Subject: [ofa-general] [PATCH] mlx4: fix uninitialized spinlock for 32-bit architectures Message-ID: <200705131718.23298.jackm@dev.mellanox.co.il> uar_lock spinlock was used in mlx4_ib_cq_arm without being initialized (on 32-bit hosts, in 64-bit atomic writes). Signed-off-by: Jack Morgenstein diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 7959698..ad107f2 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -561,6 +561,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) spin_lock_init(&ibdev->sm_lock); mutex_init(&ibdev->cap_mask_mutex); + MLX4_INIT_DOORBELL_LOCK(&ibdev->uar_lock); if (ib_register_device(&ibdev->ib_dev)) goto err_map; From mst at dev.mellanox.co.il Sun May 13 08:56:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 13 May 2007 18:56:47 +0300 Subject: [ofa-general] [PATCH] mthca: fix posting >255 recv WRs Message-ID: <20070513155647.GD16537@mellanox.co.il> The following fixes posting lists of > 255 WRs for tavor: rq.next_ind must be updated each doorbell, otherwise the next doorbell will use an incorrect index. Found by Ronni Zimmermann at Mellanox. Signed-off-by: Michael S. Tsirkin -- diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index fee60c8..72fabb8 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1862,6 +1862,7 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + qp->rq.next_ind = ind; qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB; size0 = 0; } -- MST From mst at dev.mellanox.co.il Sun May 13 08:57:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 13 May 2007 18:57:08 +0300 Subject: [ofa-general] [PATCH] libmthca: fix posting >255 recv WRs Message-ID: <20070513155708.GE16537@mellanox.co.il> The following fixes posting lists of > 255 WRs for tavor: rq.next_ind must be updated each doorbell, otherwise the next doorbell will use an incorrect index. Found by Ronni Zimmermann at Mellanox. Signed-off-by: Michael S. Tsirkin -- Same as the kernel patch, really. diff --git a/src/qp.c b/src/qp.c index f2483e9..372a418 100644 --- a/src/qp.c +++ b/src/qp.c @@ -412,6 +412,7 @@ int mthca_tavor_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_RECV_DOORBELL); + qp->rq.next_ind = ind; qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB; size0 = 0; } -- MST From sashak at voltaire.com Sun May 13 10:29:38 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 13 May 2007 20:29:38 +0300 Subject: [ofa-general] Re: [PATCH] osm: two similar bugs in Up/Dn routing In-Reply-To: <1179059958.1540.81616.camel@hal.voltaire.com> References: <4646C65F.1090705@dev.mellanox.co.il> <1179059958.1540.81616.camel@hal.voltaire.com> Message-ID: <20070513172938.GG29746@sashak.voltaire.com> On 08:39 Sun 13 May , Hal Rosenstock wrote: > Hi Yevgeny, > > On Sun, 2007-05-13 at 04:03, Yevgeny Kliteynik wrote: > > Hi Hal, > > > > This small patch fixes two similar bugs in Up/Dn routing in OpenSM. > > A 8-bits integers were used as indexes when scanning subnet, which > > in one case caused OpenSM to crash when ranking "path" is longer > > than 256 switches, and in other case caused OpenSM to go into infinite > > loop when fabric has more than 256 roots. > > Good catch. Yes. And IMO this shows how fixed-size integers overusing can hurt. Sasha From sashak at voltaire.com Sun May 13 12:55:39 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 13 May 2007 22:55:39 +0300 Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager In-Reply-To: <20070508184938.311b1c8f.weiny2@llnl.gov> References: <20070508184938.311b1c8f.weiny2@llnl.gov> Message-ID: <20070513195539.GH29746@sashak.voltaire.com> Hi Ira, Thanks for the great work! On 18:49 Tue 08 May , Ira Weiny wrote: > I would like to submit to the list a performance manager which I have been > working on for OpenSM. > > It is implemented as the first proposed architecture model set forth by Hal (As > an integrated thread to OpenSM.) As such it works fine on our small test > cluster but there is some concern about its scalability. > > I have extended this architecture with an idea of my own. This idea is to have > a plug-able module for the "event database". With this interface one could > write their own Data reduction, logging, and tracking methods. Here at LLNL I > propose to use this to add counter and subnet events directly to our management > database which is used to show system status to our operators. Other > installations might prefer other methods of logging, SNMP for example. This > patch includes a "reference" implementation of this "event database" which > stores the information internally until the user requests a "dump". I like this event db idea, but not sure this should not be integral part of the low level perfmgr stuff - as it is currently implemented without such plugin loaded PerfMgr just doesn't work - this unconditionally tries to pull all ports counters, but has nothing to do with it without plugin. Instead I would purpose to have a builtin PerfMgr which will be able to pull and store performance related data and then to call "generic" event manager which can process such data. This also will help to have simpler generic API for such event db plugin so other parts of OpenSM will be able to report events using same method(s). What do you think? Some patch related comments are inlined below. Sasha > > Let the flames begin, > Ira Weiny > weiny2 at llnl.gov > > > > >From 4ce288b6a5a371872cf160f6d4e29e768a065cb9 Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Tue, 24 Apr 2007 23:44:15 -0700 > Subject: [PATCH] OpenSM Proposed Perf Manager > > Features include: > * Create "PerfMgr" thread and sweep all ports on the subnet every > sweep_time seconds > * port counter clear on overflow > * plugable architecture for the "event" database > * Output machine and human readable output in the default event database > dump > * Control using the "perfmgr" command in the console > > Known Issues > * Not tested at scale. > * Event database should record trap events and other "intresting" subnet > events. > * port counter log warnings should be configureable not hard coded. > * partitions are not handled yet. > * Code might not be as pristine as I would like > > Enable using --enable-perf-mgr > > Signed-off-by: Ira K. Weiny > --- > osm/Makefile.am | 3 +- > osm/config/osmvsel.m4 | 26 ++ > osm/configure.in | 5 +- > osm/eventdb/Makefile.am | 37 ++ > osm/eventdb/autogen.sh | 15 + > osm/eventdb/configure.in | 70 ++++ > osm/eventdb/libibeventdb.map | 5 + > osm/eventdb/libibeventdb.spec.in | 38 ++ > osm/eventdb/libibeventdb.ver | 9 + > osm/eventdb/src/ibeventdb.c | 622 +++++++++++++++++++++++++++++++++ > osm/include/Makefile.am | 2 + > osm/include/iba/ib_types.h | 74 ++++ > osm/include/opensm/osm_base.h | 23 ++ > osm/include/opensm/osm_event_db.h | 151 ++++++++ > osm/include/opensm/osm_madw.h | 40 +++ > osm/include/opensm/osm_msgdef.h | 1 + > osm/include/opensm/osm_opensm.h | 4 + > osm/include/opensm/osm_perfmgr.h | 223 ++++++++++++ > osm/include/opensm/osm_subnet.h | 18 + > osm/opensm.spec.in | 11 +- > osm/opensm/Makefile.am | 5 +- > osm/opensm/configure.in | 3 + > osm/opensm/main.c | 19 + > osm/opensm/osm_console.c | 78 +++++ > osm/opensm/osm_event_db.c | 172 +++++++++ > osm/opensm/osm_opensm.c | 24 ++ > osm/opensm/osm_perfmgr.c | 686 +++++++++++++++++++++++++++++++++++++ > osm/opensm/osm_subnet.c | 51 +++ > osm/opensm/osm_trap_rcv.c | 15 + > 29 files changed, 2425 insertions(+), 5 deletions(-) > > diff --git a/osm/Makefile.am b/osm/Makefile.am > index ec66883..32f5f64 100644 > --- a/osm/Makefile.am > +++ b/osm/Makefile.am > @@ -1,6 +1,7 @@ > > # note that order matters: make the libs first then use them > -SUBDIRS = complib libvendor opensm osmtest include > +SUBDIRS = complib libvendor opensm osmtest include $(EVENTDB) > +DIST_SUBDIRS = complib libvendor opensm osmtest include eventdb > > # this will control the update of the files in order > MAINTAINERCLEANFILES = Makefile.in aclocal.m4 configure config-h.in > diff --git a/osm/config/osmvsel.m4 b/osm/config/osmvsel.m4 > index 9234f36..ce6039c 100644 > --- a/osm/config/osmvsel.m4 > +++ b/osm/config/osmvsel.m4 > @@ -180,3 +180,29 @@ if test "$disable_libcheck" != "yes"; th > fi > # --- END OPENIB_APP_OSMV_CHECK_HEADER --- > ]) dnl OPENIB_APP_OSMV_CHECK_HEADER > + > + > +AC_DEFUN([OPENIB_OSM_PERF_MGR_SEL], [ > +# --- BEGIN OPENIB_OSM_PERF_MGR_SEL --- > + > +dnl enable the perf-mgr > +AC_ARG_ENABLE(perf-mgr, > +[ --enable-perf-mgr Enable the performance manager (default no)], > + [case $enableval in > + yes) perf_mgr=yes ;; > + no) perf_mgr=no ;; > + esac], > + perf_mgr=no) > +if test $perf_mgr = yes; then > + AC_DEFINE(ENABLE_OSM_PERF_MGR, > + 1, > + [Define as 1 if you want to enable the performance manager]) > + EVENTDB=eventdb > +else > + EVENTDB= > +fi > +AC_SUBST([EVENTDB]) > + > +# --- END OPENIB_OSM_PERF_MGR_SEL --- > +]) dnl OPENIB_OSM_PERF_MGR_SEL > + > diff --git a/osm/configure.in b/osm/configure.in > index eb6552f..94d4483 100644 > --- a/osm/configure.in > +++ b/osm/configure.in > @@ -27,11 +27,14 @@ AC_ARG_ENABLE(debug, > esac],[debug=false]) > AM_CONDITIONAL(DEBUG, test x$debug = xtrue) > > +dnl select performance manager or not > +OPENIB_OSM_PERF_MGR_SEL > + > dnl Provide user option to select vendor > OPENIB_APP_OSMV_SEL > > dnl Configure the following subdirs > -AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include) > +AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include eventdb) > > dnl Create the following Makefiles > AC_OUTPUT(Makefile) > diff --git a/osm/eventdb/Makefile.am b/osm/eventdb/Makefile.am > new file mode 100644 > index 0000000..18f2db9 > --- /dev/null > +++ b/osm/eventdb/Makefile.am > @@ -0,0 +1,37 @@ > + > +INCLUDES = -I$(srcdir)/../include \ > + -I$(includedir)/infiniband > + > +lib_LTLIBRARIES = libibeventdb.la > + > +if DEBUG > +DBGFLAGS = -ggdb -D_DEBUG_ > +else > +DBGFLAGS = -g > +endif > + > +libibeventdb_la_CFLAGS = -Wall $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -Wno-deprecated-declarations > + > +if HAVE_LD_VERSION_SCRIPT > + libibeventdb_version_script = -Wl,--version-script=$(srcdir)/libibeventdb.map > +else > + libibeventdb_version_script = > +endif > + > +libibeventdb_la_SOURCES = src/ibeventdb.c > +libibeventdb_la_LDFLAGS = -version-info $(ibeventdb_api_version) \ > + -export-dynamic $(libibeventdb_version_script) > +libibeventdb_la_LIBADD = -L../complib $(OSMV_LDADD) -losmcomp > +libibeventdb_la_DEPENDENCIES = $(srcdir)/libibeventdb.map > + > +libibeventdbincludedir = $(includedir)/infiniband/complib > + > +libibeventdbinclude_HEADERS = > + > +# headers are distributed as part of the include dir > +EXTRA_DIST = $(srcdir)/libibeventdb.spec.in $(srcdir)/libibeventdb.map \ > + $(srcdir)/libibeventdb.ver > + > +dist-hook: libibeventdb.spec > + cp libibeventdb.spec $(distdir) > + > diff --git a/osm/eventdb/autogen.sh b/osm/eventdb/autogen.sh > new file mode 100755 > index 0000000..ec20fc5 > --- /dev/null > +++ b/osm/eventdb/autogen.sh > @@ -0,0 +1,15 @@ > +#! /bin/sh > + > +# We change dir since the later utilities assume to work in the project dir > +cd ${0%*/*} > + > +# create config dir if not exist > +test -d config || mkdir config > + > +set -x > +(aclocal -I config -I ../config 2>&1 ) && \ > +(libtoolize --force --copy) && \ > +(autoheader) && \ > +(automake --foreign --add-missing --copy) && \ > +autoconf > + > diff --git a/osm/eventdb/configure.in b/osm/eventdb/configure.in > new file mode 100644 > index 0000000..f5fa345 > --- /dev/null > +++ b/osm/eventdb/configure.in > @@ -0,0 +1,70 @@ > +dnl Process this file with autoconf to produce a configure script. > + > +AC_PREREQ(2.57) > +AC_INIT(libibeventdb, 1.0.0, openib-general at openib.org) > +AC_CONFIG_AUX_DIR(config) > +AM_CONFIG_HEADER(config.h) > +AM_INIT_AUTOMAKE > + > +dnl the library version info is available in the file: libibeventdb.ver > +ibeventdb_api_version=`grep LIBVERSION $srcdir/libibeventdb.ver | sed 's/LIBVERSION=//'` > +if test -z $ibeventdb_api_version; then > + ibeventdb_api_version=1:0:0 > +fi > +AC_SUBST(ibeventdb_api_version) > + > +dnl Checks for programs > +AC_PROG_CC > +AC_PROG_GCC_TRADITIONAL > +AC_PROG_LIBTOOL > + > +dnl Checks for libraries > +AC_CHECK_LIB(pthread, pthread_mutex_init, [], > + AC_MSG_ERROR([pthread_mutex_init() not found. libibeventdb requires libpthread.])) > + > +dnl Checks for header files. > +AC_HEADER_STDC > +AC_CHECK_HEADERS([fcntl.h stdlib.h string.h sys/ioctl.h sys/time.h syslog.h unistd.h]) > + > +dnl Checks for library functions > +AC_FUNC_MALLOC > +AC_FUNC_MEMCMP > +AC_CHECK_FUNC([time]) > +dnl AC_CHECK_FUNC([cl_plock_excl_acquire], [], > +dnl AC_MSG_ERROR([cl_plock_excl_acquire not found, libibeventdb requires libosmcomp])) > + > +dnl Checks for typedefs, structures, and compiler characteristics. > +AC_C_CONST > +AC_C_INLINE > +AC_TYPE_SIZE_T > +AC_HEADER_TIME > + > +dnl We use --version-script with ld if possible > +AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, > + if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then > + ac_cv_version_script=yes > + else > + ac_cv_version_script=no > + fi) > + > +AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") > + > +dnl Support debug mode build - if enable-debug provided the DEBUG variable is set > +AC_ARG_ENABLE(debug, > +[ --enable-debug Turn on debug mode], > +[case "${enableval}" in > + yes) debug=true ;; > + no) debug=false ;; > + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; > +esac],[debug=false]) > +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) > + > +# we have to revive the env CFLAGS as some how they are being overwritten... > +# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering > +# for why they should NEVER be modified by the configure to allow for user > +# overrides. > +CFLAGS=$ac_env_CFLAGS_value > + > + > +AC_CONFIG_FILES([Makefile libibeventdb.spec]) > +AC_OUTPUT > diff --git a/osm/eventdb/libibeventdb.map b/osm/eventdb/libibeventdb.map > new file mode 100644 > index 0000000..ca4f78c > --- /dev/null > +++ b/osm/eventdb/libibeventdb.map > @@ -0,0 +1,5 @@ > +OSMPMDB_1.0 { > + global: > + __osm_event_db; > + local: *; > +}; > diff --git a/osm/eventdb/libibeventdb.spec.in b/osm/eventdb/libibeventdb.spec.in > new file mode 100644 > index 0000000..ac66545 > --- /dev/null > +++ b/osm/eventdb/libibeventdb.spec.in > @@ -0,0 +1,38 @@ > + > +%define ver @VERSION@ > +%define RELEASE 1 > +%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} > + > +Summary: OpenIB InfiniBand OpenSM Component Library > +Name: libibeventdb > +Version: %ver > +Release: %rel%{?dist} > +License: GPL/BSD > +Group: System Environment/Libraries > +BuildRoot: %{_tmppath}/%{name}-%{version}-root > +Source: http://openib.org/downloads/%{name}-%{version}.tar.gz > +Url: http://openib.org/ > +Requires: opensm > + > +%description > +libibeventdb provides a default plugin for the OpenSM event database > + > +%prep > +%setup -q > + > +%build > +%configure > +make > + > +%install > +make DESTDIR=${RPM_BUILD_ROOT} install > +# remove unpackaged files from the buildroot > +rm -f $RPM_BUILD_ROOT%{_libdir}/*.la > + > +%clean > +rm -rf $RPM_BUILD_ROOT > + > +%files > +%defattr(-,root,root) > +%{_libdir}/libibeventdb*.so.* > +%doc ChangeLog > diff --git a/osm/eventdb/libibeventdb.ver b/osm/eventdb/libibeventdb.ver > new file mode 100644 > index 0000000..7a703b7 > --- /dev/null > +++ b/osm/eventdb/libibeventdb.ver > @@ -0,0 +1,9 @@ > +# In this file we track the current API version > +# of the vendor interface (and libraries) > +# The version is built of the following > +# tree numbers: > +# API_REV:RUNNING_REV:AGE > +# API_REV - advance on any added API > +# RUNNING_REV - advance any change to the vendor files > +# AGE - number of backward versions the API still supports > +LIBVERSION=1:0:0 > diff --git a/osm/eventdb/src/ibeventdb.c b/osm/eventdb/src/ibeventdb.c > new file mode 100644 > index 0000000..e98f85c > --- /dev/null > +++ b/osm/eventdb/src/ibeventdb.c > @@ -0,0 +1,622 @@ > +/* > + * Copyright (c) 2007 The Regents of the University of California. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#if HAVE_CONFIG_H > +# include > +#endif /* HAVE_CONFIG_H */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +/** > + * Port counter object. > + * Store all the port counters for a single port. > + */ > +typedef struct _osm_event_pc { > + struct { > + uint64_t symbol_err_cnt; > + uint64_t link_err_recover; > + uint64_t link_downed; > + uint64_t rcv_err; > + uint64_t rcv_rem_phys_err; > + uint64_t rcv_switch_relay_err; > + uint64_t xmit_discards; > + uint64_t xmit_constraint_err; > + uint64_t rcv_constraint_err; > + uint64_t link_int_err; > + uint64_t buffer_overrun_err; > + uint64_t vl15_dropped; > + uint64_t xmit_data; > + uint64_t rcv_data; > + uint64_t xmit_pkts; > + uint64_t rcv_pkts; > + time_t last_reset; > + } totals; > + osm_pc_reading_t previous; > +} osm_event_pc_t; > + > +/** > + * group port counters for ports into the nodes > + */ > +typedef struct _osm_pc_node { > + cl_map_item_t map_item; /* must be first */ > + uint64_t node_guid; > + osm_event_pc_t *ports; > + uint8_t num_ports; > +} osm_pc_node_t; Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)? Why not to reuse already existed maps in osm_subn_t (we could add 'void *pm_data' or so field to osm_physp_t structure)? > + > +/** > + * all nodes in the system. > + */ > +typedef struct _osm_pc_db { > + cl_qmap_t pc_data; /* stores type (osm_pc_node_t *) */ > + cl_plock_t lock; > + osm_log_t *osm_log; > +} osm_pc_db_t; > + > + > +/** ========================================================================= > + */ > +static void * > +db_construct(osm_log_t *osm_log) > +{ > + /* use the default */ > + osm_pc_db_t *db = malloc(sizeof(*db)); > + if (!db) { > + return (NULL); > + } > + cl_plock_construct(&(db->lock)); > + cl_plock_init(&(db->lock)); > + cl_qmap_init(&(db->pc_data)); > + db->osm_log = osm_log; > + return ((void *)db); > +} > + > +/** ========================================================================= > + */ > +static void > +db_destroy(void *_db) > +{ > + osm_pc_db_t *db = (osm_pc_db_t *)_db; > + cl_plock_excl_acquire(&(db->lock)); > + /* remove all the items in the qmap */ > + while (!cl_is_qmap_empty(&(db->pc_data))) { > + cl_map_item_t *rc = cl_qmap_head(&(db->pc_data)); > + cl_qmap_remove_item(&(db->pc_data), rc); > + } > + cl_plock_release(&(db->lock)); > + cl_plock_destroy(&(db->lock)); > + free(db); > +} > + > +/** ========================================================================= > + */ > +static osm_pc_node_t * > +malloc_node(void *_db, uint64_t guid, uint8_t num_ports) > +{ > + int i = 0; > + time_t cur_time = 0; > + osm_pc_node_t *rc = malloc(sizeof(*rc)); > + if (!rc) > + return (NULL); > + > + rc->ports = calloc(num_ports, sizeof(osm_event_pc_t)); > + if (!rc->ports) { > + goto free_rc; > + } > + rc->num_ports = num_ports; > + rc->node_guid = guid; > + > + cur_time = time(NULL); > + for (i = 0; i < num_ports; i++) { > + rc->ports[i].totals.last_reset = cur_time; > + rc->ports[i].previous.time = cur_time; > + } > + > + return (rc); > +free_rc: > + free(rc); > + return (NULL); > +} > + > +/** ========================================================================= > + */ > +static void > +free_node(osm_pc_node_t *node) > +{ > + if (!node) > + return; > + if (node->ports) > + free(node->ports); > + free(node); > +} > + > +/* insert nodes to the database */ > +static osm_event_db_err_t > +insert(void *_db, osm_pc_node_t *node) > +{ > + osm_pc_db_t *db = (osm_pc_db_t *)_db; > + cl_map_item_t *rc = cl_qmap_insert(&(db->pc_data), node->node_guid, (cl_map_item_t *)node); > + if ((void *)rc != (void *)node) > + return (OSM_EVENT_DB_FAIL); > + return (OSM_EVENT_DB_SUCCESS); > +} > + > +/********************************************************************** > + * Internal call db->lock should be held when calling > + **********************************************************************/ > +static inline osm_pc_node_t * > +get(void *_db, uint64_t guid) > +{ > + osm_pc_db_t *db = (osm_pc_db_t *)_db; > + cl_map_item_t *rc = cl_qmap_get(&(db->pc_data), guid); > + const cl_map_item_t *end = cl_qmap_end(&(db->pc_data)); > + if (rc == end) > + return (NULL); > + return ((osm_pc_node_t *)rc); > +} > + > +/** ========================================================================= > + */ > +static osm_event_db_err_t > +db_create_entry(void *_db, uint64_t guid, uint8_t num_ports) > +{ > + osm_pc_db_t *db = (osm_pc_db_t *)_db; > + osm_event_db_err_t rc = OSM_EVENT_DB_SUCCESS; > + cl_plock_excl_acquire(&(db->lock)); > + if (!get(db, guid)) { > + osm_pc_node_t *pc_node = malloc_node(db, guid, num_ports); > + if (!pc_node) { > + rc = OSM_EVENT_DB_NOMEM; > + goto Exit; > + } > + if (insert(db, pc_node)) { > + free_node(pc_node); > + rc = OSM_EVENT_DB_FAIL; > + goto Exit; > + } > + } > +Exit: > + cl_plock_release(&(db->lock)); > + return (rc); > +} > + > +/********************************************************************** > + **********************************************************************/ > +static osm_event_db_err_t > +db_get_prev(void *_db, uint64_t guid, > + uint8_t port, osm_pc_reading_t *reading) > +{ > + osm_pc_db_t *db = (osm_pc_db_t *)_db; > + osm_pc_node_t *node = NULL; > + cl_map_item_t *rc = NULL; > + const cl_map_item_t *end = NULL; > + > + cl_plock_acquire(&(db->lock)); > + > + rc = cl_qmap_get(&(db->pc_data), guid); > + end = cl_qmap_end(&(db->pc_data)); > + if (rc == end) > + return (OSM_EVENT_DB_GUIDNOTFOUND); > + > + node = (osm_pc_node_t *)rc; > + if (port >= node->num_ports) > + return (OSM_EVENT_DB_PORTNOTFOUND); > + > + *reading = node->ports[port].previous; > + > + cl_plock_release(&(db->lock)); > + return (OSM_EVENT_DB_SUCCESS); > +} > + > +/********************************************************************** > + * Output a tab deliminated output of the port counters > + **********************************************************************/ > +static void > +__dump_node_mr(osm_pc_node_t *node, FILE *fp) > +{ > + int i = 0; > + > + fprintf(fp, "\nGUID Port\t%s\t%s\t" > + "%s\t%s\t%s\t%s\t%s\t%s\t%s\t" > + "%s\t%s\t%s\t%s\t%s\t%s\t%s\n", > + "symbol_err_cnt", > + "link_err_recover", > + "link_downed", > + "rcv_err", > + "rcv_rem_phys_err", > + "rcv_switch_relay_err", > + "xmit_discards", > + "xmit_constraint_err", > + "rcv_constraint_err", > + "link_int_err", > + "buf_overrun_err", > + "vl15_dropped", > + "xmit_data", > + "rcv_data", > + "xmit_pkts", > + "rcv_pkts"); > + for (i = 1; i < node->num_ports; i++) > + { > + fprintf(fp, "0x%" PRIx64 "\t%d\t%"PRIu64"\t%"PRIu64"\t" > + "%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t" > + "%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t" > + "%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t" > + "%"PRIu64"\t%"PRIu64"\t%"PRIu64"\t%"PRIu64"\n", > + node->node_guid, > + i, > + node->ports[i].totals.symbol_err_cnt, > + node->ports[i].totals.link_err_recover, > + node->ports[i].totals.link_downed, > + node->ports[i].totals.rcv_err, > + node->ports[i].totals.rcv_rem_phys_err, > + node->ports[i].totals.rcv_switch_relay_err, > + node->ports[i].totals.xmit_discards, > + node->ports[i].totals.xmit_constraint_err, > + node->ports[i].totals.rcv_constraint_err, > + node->ports[i].totals.link_int_err, > + node->ports[i].totals.buffer_overrun_err, > + node->ports[i].totals.vl15_dropped, > + node->ports[i].totals.xmit_data, > + node->ports[i].totals.rcv_data, > + node->ports[i].totals.xmit_pkts, > + node->ports[i].totals.rcv_pkts > + ); > + } > +} > + > +/********************************************************************** > + * Output a human readable output of the port counters > + **********************************************************************/ > +static void > +__dump_node_hr(osm_pc_node_t *node, FILE *fp) > +{ > + int i = 0; > + > + fprintf(fp, "\n"); > + for (i = 1; i < node->num_ports; i++) > + { > + fprintf(fp, "GUID 0x%"PRIx64": Port %d:\n" > + " symbol_err_cnt: %"PRIu64"\n" > + " link_err_recover: %"PRIu64"\n" > + " link_downed: %"PRIu64"\n" > + " rcv_err: %"PRIu64"\n" > + " rcv_rem_phys_err: %"PRIu64"\n" > + " rcv_switch_relay_err: %"PRIu64"\n" > + " xmit_discards: %"PRIu64"\n" > + " xmit_constraint_err: %"PRIu64"\n" > + " rcv_constraint_err: %"PRIu64"\n" > + " link_int_err: %"PRIu64"\n" > + " buf_overrun_err: %"PRIu64"\n" > + " vl15_dropped: %"PRIu64"\n" > + " xmit_data: %"PRIu64"\n" > + " rcv_data: %"PRIu64"\n" > + " xmit_pkts: %"PRIu64"\n" > + " rcv_pkts: %"PRIu64"\n" > + , > + node->node_guid, > + i, > + node->ports[i].totals.symbol_err_cnt, > + node->ports[i].totals.link_err_recover, > + node->ports[i].totals.link_downed, > + node->ports[i].totals.rcv_err, > + node->ports[i].totals.rcv_rem_phys_err, > + node->ports[i].totals.rcv_switch_relay_err, > + node->ports[i].totals.xmit_discards, > + node->ports[i].totals.xmit_constraint_err, > + node->ports[i].totals.rcv_constraint_err, > + node->ports[i].totals.link_int_err, > + node->ports[i].totals.buffer_overrun_err, > + node->ports[i].totals.vl15_dropped, > + node->ports[i].totals.xmit_data, > + node->ports[i].totals.rcv_data, > + node->ports[i].totals.xmit_pkts, > + node->ports[i].totals.rcv_pkts > + ); > + } > +} > + > +/* Define a context for the __db_dump callback */ > +typedef struct { > + FILE *fp; > + osm_event_db_dump_t dump_type; > +} dump_context_t; > + > +/********************************************************************** > + **********************************************************************/ > +static void > +__db_dump(cl_map_item_t * const p_map_item, void *context ) > +{ > + osm_pc_node_t *node = (osm_pc_node_t *)p_map_item; > + dump_context_t *c = (dump_context_t *)context; > + FILE *fp = c->fp; > + > + switch (c->dump_type) > + { > + case OSM_EVENT_DB_DUMP_MR: > + __dump_node_mr(node, fp); > + break; > + case OSM_EVENT_DB_DUMP_HR: > + default: > + __dump_node_hr(node, fp); > + break; > + } > +} > + > +/********************************************************************** > + * dump the data to the file "file" > + **********************************************************************/ > +static osm_event_db_err_t > +db_dump(void *_db, char *file, osm_event_db_dump_t dump_type) > +{ > + osm_pc_db_t *db = (osm_pc_db_t *)_db; > + dump_context_t context; > + > + context.fp = fopen(file, "w+"); > + if (!context.fp) > + return (OSM_EVENT_DB_FAIL); > + context.dump_type = dump_type; > + > + cl_plock_acquire(&(db->lock)); > + cl_qmap_apply_func(&(db->pc_data), __db_dump, (void *)&context); > + cl_plock_release(&(db->lock)); > + fclose(context.fp); > + return (OSM_EVENT_DB_SUCCESS); > +} > + > +/********************************************************************** > + * call back to support the below > + **********************************************************************/ > +static void > +__clear_counters(cl_map_item_t * const p_map_item, void *context ) > +{ > + osm_pc_node_t *node = (osm_pc_node_t *)p_map_item; > + int i = 0; > + for (i = 0; i < node->num_ports; i++) { > + node->ports[i].totals.symbol_err_cnt = 0; > + node->ports[i].totals.link_err_recover = 0; > + node->ports[i].totals.link_downed = 0; > + node->ports[i].totals.rcv_err = 0; > + node->ports[i].totals.rcv_rem_phys_err = 0; > + node->ports[i].totals.rcv_switch_relay_err = 0; > + node->ports[i].totals.xmit_discards = 0; > + node->ports[i].totals.xmit_constraint_err = 0; > + node->ports[i].totals.rcv_constraint_err = 0; > + node->ports[i].totals.link_int_err = 0; > + node->ports[i].totals.buffer_overrun_err = 0; > + node->ports[i].totals.vl15_dropped = 0; > + node->ports[i].totals.xmit_data = 0; > + node->ports[i].totals.rcv_data = 0; > + node->ports[i].totals.xmit_pkts = 0; > + node->ports[i].totals.rcv_pkts = 0; > + node->ports[i].totals.last_reset = time(NULL); > + } > +} > + > +/********************************************************************** > + * Clear the counters from the db > + **********************************************************************/ > +static void > +db_clear_port_counters(void *_db) > +{ > + osm_pc_db_t *db = (osm_pc_db_t *)_db; > + cl_plock_excl_acquire(&(db->lock)); > + cl_qmap_apply_func(&(db->pc_data), __clear_counters, (void *)db); > + cl_plock_release(&(db->lock)); > +} > + > +#if 0 > +/********************************************************************** > + * Dump a reading vs the previous reading to stdout > + **********************************************************************/ > +static void > +dump_reading(osm_event_pc_t *port, ib_port_counters_t *cur) > +{ > + printf("sym %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->symbol_err_cnt), > + cl_ntoh16(port->previous.reading.symbol_err_cnt), port->totals.symbol_err_cnt); > + printf("ler %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->link_err_recover), > + cl_ntoh16(port->previous.reading.link_err_recover), port->totals.link_err_recover); > + printf("ld %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->link_downed), > + cl_ntoh16(port->previous.reading.link_downed), port->totals.link_downed); > + printf("re %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->rcv_err), > + cl_ntoh16(port->previous.reading.rcv_err), port->totals.rcv_err); > + printf("rrp %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->rcv_rem_phys_err), > + cl_ntoh16(port->previous.reading.rcv_rem_phys_err), port->totals.rcv_rem_phys_err); > + printf("rsr %u - %u (%" PRIx64 ")\n", > + cl_ntoh16(cur->rcv_switch_relay_err), > + cl_ntoh16(port->previous.reading.rcv_switch_relay_err), port->totals.rcv_switch_relay_err); > + printf("xd %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->xmit_discards), > + cl_ntoh16(port->previous.reading.xmit_discards), port->totals.xmit_discards); > + printf("xce %u - %u (%" PRIx64 ")\n", > + cl_ntoh16(cur->xmit_constraint_err), > + cl_ntoh16(port->previous.reading.xmit_constraint_err), port->totals.xmit_constraint_err); > + printf("rce %u - %u (%" PRIx64 ")\n", > + cl_ntoh16(cur->rcv_constraint_err), > + cl_ntoh16(port->previous.reading.rcv_constraint_err), port->totals.rcv_constraint_err); > + printf("li %x - %x (%" PRIx64 ")\n", > + cl_ntoh16(cur->link_int_buffer_overrun), > + cl_ntoh16(port->previous.reading.link_int_buffer_overrun), port->totals.link_int_err); > + printf("bo %x - %x (%" PRIx64 ")\n", > + cl_ntoh16(cur->link_int_buffer_overrun), > + cl_ntoh16(port->previous.reading.link_int_buffer_overrun), port->totals.buffer_overrun_err); > + printf("vld %u - %u (%" PRIx64 ")\n", cl_ntoh16(cur->vl15_dropped), > + cl_ntoh16(port->previous.reading.vl15_dropped), port->totals.vl15_dropped); > + > + printf("xd %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->xmit_data), > + cl_ntoh32(port->previous.reading.xmit_data), port->totals.xmit_data); > + printf("rd %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->rcv_data), > + cl_ntoh32(port->previous.reading.rcv_data), port->totals.rcv_data); > + printf("xp %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->xmit_pkts), > + cl_ntoh32(port->previous.reading.xmit_pkts), port->totals.xmit_pkts); > + printf("rp %u - %u (%" PRIx64 ")\n", cl_ntoh32(cur->rcv_pkts), > + cl_ntoh32(port->previous.reading.rcv_pkts), port->totals.rcv_pkts); > +} > +#endif > + > +/********************************************************************** > + * Add the reading to the osm_pc_node_t > + **********************************************************************/ > +static osm_event_db_err_t > +db_clear_prev_pc(void *_db, uint64_t guid, uint8_t port) > +{ > + osm_pc_db_t *db = (osm_pc_db_t *)_db; > + osm_event_pc_t *p_port = NULL; > + osm_pc_node_t *p_node = NULL; > + ib_port_counters_t *previous = NULL; > + osm_event_db_err_t rc = OSM_EVENT_DB_SUCCESS; > + > + cl_plock_excl_acquire(&(db->lock)); > + p_node = get(db, guid); > + > + if (!p_node) > + return (OSM_EVENT_DB_GUIDNOTFOUND); > + > + if (port >= p_node->num_ports) > + return (OSM_EVENT_DB_PORTNOTFOUND); > + > + p_port = &(p_node->ports[port]); > + previous = &(p_node->ports[port].previous.reading); > + > + memset(previous, 0, sizeof(*previous)); > + p_port->previous.time = time(NULL); > + > + cl_plock_release(&(db->lock)); > + return (rc); > +} > + > +/********************************************************************** > + * Add the reading to the osm_pc_node_t > + **********************************************************************/ > +static osm_event_db_err_t > +db_add_reading(void *_db, uint64_t guid, > + uint8_t port, ib_port_counters_t *reading) > +{ > + osm_pc_db_t *db = (osm_pc_db_t *)_db; > + osm_event_pc_t *p_port = NULL; > + osm_pc_node_t *p_node = NULL; > + ib_port_counters_t *previous = NULL; > + osm_event_db_err_t rc = OSM_EVENT_DB_SUCCESS; > + > + cl_plock_excl_acquire(&(db->lock)); > + p_node = get(db, guid); > + > + if (!p_node) > + return (OSM_EVENT_DB_GUIDNOTFOUND); > + > + if (port >= p_node->num_ports) > + return (OSM_EVENT_DB_PORTNOTFOUND); > + > + p_port = &(p_node->ports[port]); > + previous = &(p_node->ports[port].previous.reading); > + > +#if 0 > + dump_reading(p_port, reading); > +#endif > + > + /* calculate changes from previous reading */ > + p_port->totals.symbol_err_cnt > + += (cl_ntoh16(reading->symbol_err_cnt) > + - cl_ntoh16(previous->symbol_err_cnt)); > + p_port->totals.link_err_recover > + += (reading->link_err_recover - previous->link_err_recover); > + p_port->totals.link_downed > + += (reading->link_downed - previous->link_downed); > + p_port->totals.rcv_err > + += (cl_ntoh16(reading->rcv_err) > + - cl_ntoh16(previous->rcv_err)); > + p_port->totals.rcv_rem_phys_err > + += (cl_ntoh16(reading->rcv_rem_phys_err) > + - cl_ntoh16(previous->rcv_rem_phys_err)); > + p_port->totals.rcv_switch_relay_err > + += (cl_ntoh16(reading->rcv_switch_relay_err) > + - cl_ntoh16(previous->rcv_switch_relay_err)); > + p_port->totals.xmit_discards > + += (cl_ntoh16(reading->xmit_discards) > + - cl_ntoh16(previous->xmit_discards)); > + p_port->totals.xmit_constraint_err > + += (reading->xmit_constraint_err - previous->xmit_constraint_err); > + p_port->totals.rcv_constraint_err > + += (reading->rcv_constraint_err - previous->rcv_constraint_err); > + p_port->totals.link_int_err > + += PC_LINK_INT(reading->link_int_buffer_overrun) > + - PC_LINK_INT(previous->link_int_buffer_overrun); > + p_port->totals.buffer_overrun_err > + += PC_BUF_OVERRUN(reading->link_int_buffer_overrun) > + - PC_BUF_OVERRUN(previous->link_int_buffer_overrun); > + p_port->totals.vl15_dropped > + += (cl_ntoh16(reading->vl15_dropped) > + - cl_ntoh16(previous->vl15_dropped)); > + > + p_port->totals.xmit_data > + += (cl_ntoh32(reading->xmit_data) > + - cl_ntoh32(previous->xmit_data)); > + p_port->totals.rcv_data > + += (cl_ntoh32(reading->rcv_data) > + - cl_ntoh32(previous->rcv_data)); > + p_port->totals.xmit_pkts > + += (cl_ntoh32(reading->xmit_pkts) > + - cl_ntoh32(previous->xmit_pkts)); > + p_port->totals.rcv_pkts > + += (cl_ntoh32(reading->rcv_pkts) > + - cl_ntoh32(previous->rcv_pkts)); > + > + p_port->previous.reading = *reading; > + p_port->previous.time = time(NULL); > + > + cl_plock_release(&(db->lock)); > + return (rc); > +} > + > +/** ========================================================================= > + * Define the object symbol for loading > + */ > +__osm_event_db_t __osm_event_db = > +{ > +interface_version: OSM_EVENT_DB_INTERFACE_VER, > +construct : db_construct, > +destroy : db_destroy, > +create_entry : db_create_entry, > +get_prev_pc : db_get_prev, > +dump : db_dump, > +clear_port_counters : db_clear_port_counters, > +add_pc_reading : db_add_reading, > +clear_prev_pc : db_clear_prev_pc > +}; > + > diff --git a/osm/include/Makefile.am b/osm/include/Makefile.am > index 8499d3b..fd874c8 100644 > --- a/osm/include/Makefile.am > +++ b/osm/include/Makefile.am > @@ -87,6 +87,8 @@ EXTRA_DIST = \ > $(srcdir)/opensm/osm_drop_mgr.h \ > $(srcdir)/opensm/osm_port_info_rcv.h \ > $(srcdir)/opensm/osm_state_mgr_ctrl.h \ > + $(srcdir)/opensm/osm_perfmgr.h \ > + $(srcdir)/opensm/osm_event_db.h \ > $(srcdir)/complib/cl_thread_osd.h \ > $(srcdir)/complib/cl_packon.h \ > $(srcdir)/complib/cl_atomic_osd.h \ > diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h > index b3937cb..2a4057b 100644 > --- a/osm/include/iba/ib_types.h > +++ b/osm/include/iba/ib_types.h > @@ -7353,6 +7353,80 @@ typedef struct _ib_inform_info_record > } PACK_SUFFIX ib_inform_info_record_t; > #include > > +/****s* IBA Base: Types/ib_perfmgr_mad_t > +* NAME > +* ib_perfmgr_mad_t > +* > +* DESCRIPTION > +* IBA defined Perf Management MAD (16.3.1) > +* > +* SYNOPSIS > +*/ > +#include > +typedef struct _ib_perfmgr_mad > +{ > + ib_mad_t header; > + uint8_t resv[40]; > + > +#define IB_PM_DATA_SIZE 192 > + uint8_t data[IB_PM_DATA_SIZE]; > + > +} PACK_SUFFIX ib_perfmgr_mad_t; > +#include > +/* > +* FIELDS > +* header > +* Common MAD header. > +* > +* resv > +* Reserved. > +* > +* data > +* Performance Management payload. The structure and content of this field > +* depend upon the method, attr_id, and attr_mod fields in the header. > +* > +* SEE ALSO > +* ib_mad_t > +*********/ > + > +/****s* IBA Base: Types/ib_port_counters > +* NAME > +* ib_port_counters_t > +* > +* DESCRIPTION > +* IBA defined PortCounters Attribute. (16.1.3.5) > +* > +* SYNOPSIS > +*/ > +#include > +typedef struct _ib_port_counters > +{ > + uint8_t reserved; > + uint8_t port_select; > + ib_net16_t counter_select; > + ib_net16_t symbol_err_cnt; > + uint8_t link_err_recover; > + uint8_t link_downed; > + ib_net16_t rcv_err; > + ib_net16_t rcv_rem_phys_err; > + ib_net16_t rcv_switch_relay_err; > + ib_net16_t xmit_discards; > + uint8_t xmit_constraint_err; > + uint8_t rcv_constraint_err; > + uint8_t res1; > + uint8_t link_int_buffer_overrun; > + ib_net16_t res2; > + ib_net16_t vl15_dropped; > + ib_net32_t xmit_data; > + ib_net32_t rcv_data; > + ib_net32_t xmit_pkts; > + ib_net32_t rcv_pkts; > +} PACK_SUFFIX ib_port_counters_t; > +#include > + > +#define PC_LINK_INT(integ_buf_over) ((integ_buf_over & 0xF0) >> 4) > +#define PC_BUF_OVERRUN(integ_buf_over) (integ_buf_over & 0x0F) > + > /****d* IBA Base: Types/DM_SVC_NAME > * NAME > * DM_SVC_NAME > diff --git a/osm/include/opensm/osm_base.h b/osm/include/opensm/osm_base.h > index b38b511..51cef49 100644 > --- a/osm/include/opensm/osm_base.h > +++ b/osm/include/opensm/osm_base.h > @@ -448,6 +448,29 @@ BEGIN_C_DECLS > */ > #define OSM_SM_DEFAULT_QP1_SEND_SIZE 256 > > +/****d* OpenSM: Base/OSM_PM_DEFAULT_QP1_RCV_SIZE > +* NAME > +* OSM_PM_DEFAULT_QP1_RCV_SIZE > +* > +* DESCRIPTION > +* Specifies the default size (in MADs) of the QP1 receive queue > +* > +* SYNOPSIS > +*/ > +#define OSM_PM_DEFAULT_QP1_RCV_SIZE 256 > +/***********/ > + > +/****d* OpenSM: Base/OSM_PM_DEFAULT_QP1_SEND_SIZE > +* NAME > +* OSM_PM_DEFAULT_QP1_SEND_SIZE > +* > +* DESCRIPTION > +* Specifies the default size (in MADs) of the QP1 send queue > +* > +* SYNOPSIS > +*/ > +#define OSM_PM_DEFAULT_QP1_SEND_SIZE 256 > + > > /****d* OpenSM: Base/OSM_SM_DEFAULT_POLLING_TIMEOUT_MILLISECS > * NAME > diff --git a/osm/include/opensm/osm_event_db.h b/osm/include/opensm/osm_event_db.h > new file mode 100644 > index 0000000..17effaf > --- /dev/null > +++ b/osm/include/opensm/osm_event_db.h > @@ -0,0 +1,151 @@ > +/* > + * Copyright (c) 2007 The Regents of the University of California. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#ifndef _OSM_EVENT_DB_H_ > +#define _OSM_EVENT_DB_H_ > + > +#include > +#include > +#include > + > +#ifdef __cplusplus > +# define BEGIN_C_DECLS extern "C" { > +# define END_C_DECLS } > +#else /* !__cplusplus */ > +# define BEGIN_C_DECLS > +# define END_C_DECLS > +#endif /* __cplusplus */ > + > +BEGIN_C_DECLS > + > +/****h* OpenSM/Event Database > +* DESCRIPTION > +* Database interface to record subnet events > +* > +* Implementations of this object _MUST_ be thread safe. > +* > +* AUTHOR > +* Ira Weiny, LLNL > +* > +*********/ > + > +typedef enum { > + OSM_EVENT_DB_SUCCESS = 0, > + OSM_EVENT_DB_FAIL, > + OSM_EVENT_DB_NOMEM, > + OSM_EVENT_DB_GUIDNOTFOUND, > + OSM_EVENT_DB_PORTNOTFOUND > +} osm_event_db_err_t; > + > +/** ========================================================================= > + * Port counter reading > + */ > +typedef struct { > + ib_port_counters_t reading; > + time_t time; > +} osm_pc_reading_t; > + > +/** ========================================================================= > + * Dump output options > + */ > +typedef enum { > + OSM_EVENT_DB_DUMP_HR = 0, /* Human readable */ > + OSM_EVENT_DB_DUMP_MR /* Machine readable */ > +} osm_event_db_dump_t; > + > +/** ========================================================================= > + * Plugin creators should allocate an object of this type > + * (name __osm_event_db_t) > + * The version should be set to OSM_EVENT_DB_INTERFACE_VER > + */ > +#define OSM_EVENT_DB_INTERFACE_VER (1) > +typedef struct > +{ > + int interface_version; > + void *(*construct)(osm_log_t *osm_log); > + void (*destroy)(void *db); > + osm_event_db_err_t (*create_entry)(void *db, uint64_t guid, uint8_t num_ports); > + osm_event_db_err_t (*get_prev_pc)(void *db, uint64_t guid, > + uint8_t port, osm_pc_reading_t *reading); > + osm_event_db_err_t (*dump)(void *db, char *file, osm_event_db_dump_t dump_type); > + void (*clear_port_counters)(void *db); > + osm_event_db_err_t (*add_pc_reading)(void *db, uint64_t guid, > + uint8_t port, ib_port_counters_t *reading); > + osm_event_db_err_t (*clear_prev_pc)(void *db, uint64_t guid, uint8_t port); > +} __osm_event_db_t; > + > +/** ========================================================================= > + * The database structure which should be considered opaque > + */ > +typedef struct { > + void *handle; > + __osm_event_db_t *db_impl; > + void *db_data; > + osm_log_t *p_log; > +} osm_event_db_t; > + > + > +/** > + * functions > + */ > +osm_event_db_t *osm_event_db_construct(osm_log_t *p_log, char *type); > +void osm_event_db_destroy(osm_event_db_t *db); > + > +osm_event_db_err_t osm_event_db_create_entry(osm_event_db_t *db, uint64_t guid, > + uint8_t num_ports); > +osm_event_db_err_t osm_event_db_get_prev_pc(osm_event_db_t *db, > + uint64_t guid, uint8_t port, > + osm_pc_reading_t *reading); > +osm_event_db_err_t osm_event_db_dump(osm_event_db_t *db, char *file, > + osm_event_db_dump_t dump_type); > +osm_event_db_err_t osm_event_db_add_pc_reading(osm_event_db_t *db, uint64_t guid, > + uint8_t port, ib_port_counters_t *reading); > +void osm_event_db_clear_port_counters(osm_event_db_t *db); > +osm_event_db_err_t osm_event_db_clear_prev_pc(osm_event_db_t *db, uint64_t guid, > + uint8_t port); > + > +#if 0 > +/* work out the tracking of notice (trap) events. */ > + > +typedef struct { > + ib_mad_notice_attr_t reading; > + time_t time; > +} osm_notice_reading_t; > +osm_event_db_err_t osm_event_db_add_notice_reading(osm_event_db_t *db, uint64_t guid, > + uint8_t port, ib_mad_notice_attr_t *reading); > +#endif > + > +END_C_DECLS > + > +#endif /* _OSM_PM_DB_H_ */ > + > diff --git a/osm/include/opensm/osm_madw.h b/osm/include/opensm/osm_madw.h > index 95be0f4..80258f4 100644 > --- a/osm/include/opensm/osm_madw.h > +++ b/osm/include/opensm/osm_madw.h > @@ -315,6 +315,19 @@ typedef struct _osm_vla_context > } osm_vla_context_t; > /*********/ > > +/****s* OpenSM: MAD Wrapper/osm_perfmgr_context_t > +* DESCRIPTION > +* Context for Performance manager queries > +*/ > +typedef struct _osm_perfmgr_context { > + uint64_t node_guid; > + uint16_t port; > + uint8_t num_ports; > + uint8_t mad_method; /* was this a get or a set */ > + struct timeval query_start; > +} osm_perfmgr_context_t; > +/*********/ > + > #ifndef OSM_VENDOR_INTF_OPENIB > /****s* OpenSM: MAD Wrapper/osm_arbitrary_context_t > * NAME > @@ -354,6 +367,7 @@ typedef union _osm_madw_context > osm_slvl_context_t slvl_context; > osm_pkey_context_t pkey_context; > osm_vla_context_t vla_context; > + osm_perfmgr_context_t perfmgr_context; > #ifndef OSM_VENDOR_INTF_OPENIB > osm_arbitrary_context_t arb_context; > #endif > @@ -639,6 +653,32 @@ osm_madw_get_sa_mad_ptr( > * MAD Wrapper object, osm_madw_construct, osm_madw_destroy > *********/ > > +/****f* OpenSM: MAD Wrapper/osm_madw_get_perfmgr_mad_ptr > +* DESCRIPTION > +* Gets a pointer to the PerfMgr MAD in this MAD wrapper. > +* > +* SYNOPSIS > +*/ > +static inline ib_perfmgr_mad_t* > +osm_madw_get_perfmgr_mad_ptr( > + IN const osm_madw_t* const p_madw ) > +{ > + return((ib_perfmgr_mad_t*)p_madw->p_mad); > +} > +/* > +* PARAMETERS > +* p_madw > +* [in] Pointer to an osm_madw_t object. > +* > +* RETURN VALUES > +* Pointer to the start of the PM MAD. > +* > +* NOTES > +* > +* SEE ALSO > +* MAD Wrapper object, osm_madw_construct, osm_madw_destroy > +*********/ > + > /****f* OpenSM: MAD Wrapper/osm_madw_get_ni_context_ptr > * NAME > * osm_madw_get_ni_context_ptr > diff --git a/osm/include/opensm/osm_msgdef.h b/osm/include/opensm/osm_msgdef.h > index a90e3b9..6732992 100644 > --- a/osm/include/opensm/osm_msgdef.h > +++ b/osm/include/opensm/osm_msgdef.h > @@ -186,6 +186,7 @@ enum > #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) > OSM_MSG_MAD_MULTIPATH_RECORD, > #endif > + OSM_MSG_MAD_PORT_COUNTERS, > OSM_MSG_MAX > }; > > diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h > index 482de28..bdaa8f3 100644 > --- a/osm/include/opensm/osm_opensm.h > +++ b/osm/include/opensm/osm_opensm.h > @@ -57,6 +57,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -157,6 +158,9 @@ typedef struct _osm_opensm_t > osm_subn_t subn; > osm_sm_t sm; > osm_sa_t sa; > +#ifdef ENABLE_OSM_PERF_MGR > + osm_perfmgr_t perfmgr; > +#endif /* ENABLE_OSM_PERF_MGR */ > osm_db_t db; > osm_mad_pool_t mad_pool; > osm_vendor_t *p_vendor; > diff --git a/osm/include/opensm/osm_perfmgr.h b/osm/include/opensm/osm_perfmgr.h > new file mode 100644 > index 0000000..6138ec3 > --- /dev/null > +++ b/osm/include/opensm/osm_perfmgr.h > @@ -0,0 +1,223 @@ > +/* > + * Copyright (c) 2007 The Regents of the University of California. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#ifndef _OSM_PERFMGR_H_ > +#define _OSM_PERFMGR_H_ > + > +#if HAVE_CONFIG_H > +# include > +#endif /* HAVE_CONFIG_H */ > + > +#ifdef ENABLE_OSM_PERF_MGR > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#ifdef __cplusplus > +extern "C" { > +#endif /* __cplusplus */ > + > +/****h* OpenSM/PERFMGR > +* NAME > +* PERFMGR > +* > +* DESCRIPTION > +* Performance manager thread which takes care of polling the fabric for > +* Port counters values. > +* > +* The PERFMGR object is thread safe. > +* > +* AUTHOR > +* Ira Weiny, LLNL > +* > +*********/ > + > +#define OSM_PERFMGR_DEFAULT_SWEEP_TIME_S 180 > +#define OSM_PERFMGR_DEFAULT_DUMP_FILE OSM_DEFAULT_TMP_DIR "/osm_port_counters.log" > +#define OSM_DEFAULT_EVENT_PLUGIN "ibeventdb" > + > +/****s* OpenSM: PERFMGR/osm_perfmgr_state_t */ > +typedef enum > +{ > + PERFMGR_STATE_DISABLE, > + PERFMGR_STATE_ENABLED, > + PERFMGR_STATE_NO_DB Why PERFMGR_STATE_NO_DB is needed? Isn't is duplicated by (pm->db == NULL)? As side effect of this duplication - now when DB was not found I can enable perfmgr with console command, but it obviously crashes during follow 'dump'. > +} osm_perfmgr_state_t; > + > +/****s* OpenSM: PERFMGR/osm_perfmgr_t > +* This object should be treated as opaque and should > +* be manipulated only through the provided functions. > +*/ > +typedef struct _osm_perfmgr > +{ > + osm_thread_state_t thread_state; > + cl_event_t sig_sweep; > + cl_thread_t sweeper; > + osm_subn_t *subn; > + osm_sm_t *sm; > + cl_plock_t *lock; > + osm_log_t *log; > + osm_mad_pool_t *mad_pool; > + atomic32_t trans_id; Do we need separate transaction id generator for PerfMgr? > + osm_vendor_t *vendor; > + osm_bind_handle_t bind_handle; > + cl_disp_reg_handle_t pc_disp_h; > + osm_perfmgr_state_t state; > + uint16_t sweep_time_s; > + char *db_file; > + char *event_db_dump_file; > + char *event_db_plugin; > + osm_event_db_t *db; > +} osm_perfmgr_t; > +/* > +* FIELDS > +* subn > +* Subnet object for this subnet. > +* > +* log > +* Pointer to the log object. > +* > +* mad_pool > +* Pointer to the MAD pool. > +* > +* event_db_dump_file > +* File to be used to dump the Port Counters > +* > +* mad_ctrl > +* Mad Controller > +*********/ > + > +/****f* OpenSM: Creation Functions */ > +void osm_perfmgr_shutdown(osm_perfmgr_t *const p_perfmgr ); > +void osm_perfmgr_destroy(osm_perfmgr_t * const p_perfmgr ); > + > +/****f* OpenSM: Inline accessor functions */ > +inline static void osm_perfmgr_set_state(osm_perfmgr_t *p_perfmgr, > + osm_perfmgr_state_t state) > +{ > + p_perfmgr->state = state; > +} > +inline static osm_perfmgr_state_t osm_perfmgr_get_state(osm_perfmgr_t > + *p_perfmgr) { return (p_perfmgr->state); } > +inline static char *osm_perfmgr_get_state_str(osm_perfmgr_t *p_perfmgr) > +{ > + switch (p_perfmgr->state) > + { > + case PERFMGR_STATE_DISABLE: return ("Disabled"); break; > + case PERFMGR_STATE_ENABLED: return ("Enabled"); break; > + case PERFMGR_STATE_NO_DB: return ("No Database"); break; > + } > + return ("UNKNOWN"); > +} > +inline static void osm_perfmgr_set_sweep_time_s(osm_perfmgr_t *p_perfmgr, uint16_t time_s) > +{ > + p_perfmgr->sweep_time_s = time_s; > + cl_event_signal(&p_perfmgr->sig_sweep); > +} > +inline static uint16_t osm_perfmgr_get_sweep_time_s(osm_perfmgr_t *p_perfmgr) > +{ > + return (p_perfmgr->sweep_time_s); > +} > +void osm_perfmgr_clear_counters(osm_perfmgr_t *p_perfmgr); > +void osm_perfmgr_dump_counters(osm_perfmgr_t *p_perfmgr, > + osm_event_db_dump_t dump_type); > + > +ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * const p_perfmgr, const ib_net64_t port_guid); > + > +#if 0 > +/* Work out the tracking of notice events */ > +ib_api_status_t osm_report_notice_to_perfmgr(osm_log_t *const p_log, osm_subn_t *p_subn, > + ib_mad_notice_attr_t *p_ntc ) > +#endif > + > +/****f* OpenSM: PERFMGR/osm_perfmgr_init */ > +ib_api_status_t > +osm_perfmgr_init( > + osm_perfmgr_t* const perfmgr, > + osm_subn_t* const subn, > + osm_sm_t * const sm, > + osm_log_t* const log, > + osm_mad_pool_t * const mad_pool, > + osm_vendor_t * const vendor, > + cl_dispatcher_t* const disp, > + cl_plock_t* const lock, > + const osm_subn_opt_t * const p_opt ); The identation is not unified (tab character is preferred) here and in another places, also there are lot of trailing white spaces in the patch. You can run 'git-diff --color' to see formatting issues. > +/* > +* PARAMETERS > +* perfmgr > +* [in] Pointer to an osm_perfmgr_t object to initialize. > +* > +* subn > +* [in] Pointer to the Subnet object for this subnet. > +* > +* sm > +* [in] Pointer to the Subnet object for this subnet. > +* > +* log > +* [in] Pointer to the log object. > +* > +* mad_pool > +* [in] Pointer to the MAD pool. > +* > +* vendor > +* [in] Pointer to the vendor specific interfaces object. > +* > +* disp > +* [in] Pointer to the OpenSM central Dispatcher. > +* > +* lock > +* [in] Pointer to the OpenSM serializing lock. > +* > +* p_opt > +* [in] Starting options > +* > +* RETURN VALUES > +* IB_SUCCESS if the PERFMGR object was initialized successfully. > +*********/ > + > +#ifdef __cplusplus > +} > +#endif /* __cplusplus */ > + > +#endif /* ENABLE_OSM_PERF_MGR */ > + > +#endif /* _OSM_PERFMGR_H_ */ > + > diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h > index fc52b5e..0fdc18b 100644 > --- a/osm/include/opensm/osm_subnet.h > +++ b/osm/include/opensm/osm_subnet.h > @@ -291,6 +291,12 @@ typedef struct _osm_subn_opt > osm_qos_options_t qos_rtr_options; > boolean_t enable_quirks; > boolean_t no_clients_rereg; > +#ifdef ENABLE_OSM_PERF_MGR > + boolean_t perfmgr; > + uint16_t perfmgr_sweep_time_s; > + char * event_db_dump_file; > + char * event_db_plugin; > +#endif /* ENABLE_OSM_PERF_MGR */ > } osm_subn_opt_t; > /* > * FIELDS > @@ -468,6 +474,18 @@ typedef struct _osm_subn_opt > * sm_inactive > * OpenSM will start with SM in not active state. > * > +* perfmgr > +* Enable or disable the performance manager > +* > +* perfmgr_sweep_time_s > +* Define the period of PM sweep (in seconds). > +* > +* event_db_dump_file > +* File to dump the event database to > +* > +* event_db_plugin > +* specify the name of the event plugin > +* > * qos_options > * Default set of QoS options > * > diff --git a/osm/opensm.spec.in b/osm/opensm.spec.in > index c4e1798..8857a7b 100644 > --- a/osm/opensm.spec.in > +++ b/osm/opensm.spec.in > @@ -38,10 +38,19 @@ Static libraries and header files for Op > %define _disable_console_socket --disable-console-socket > %endif > > +%if %{?_with_perf_mgr:1}%{!?_with_perf_mgr:0} > +%define _enable_perf_mgr --enable-perf-mgr > +%endif > +%if %{?_without_perf_mgr:1}%{!?_without_perf_mgr:0} > +%define _disable_perf_mgr --disable-perf-mgr > +%endif > + > %build > %configure \ > %{?_enable_console_socket} \ > - %{?_disable_console_socket} > + %{?_disable_console_socket} \ > + %{?_enable_perf_mgr} \ > + %{?_disable_perf_mgr} > make %{?_smp_mflags} > > %install > diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am > index e2520b8..9a1f6f4 100644 > --- a/osm/opensm/Makefile.am > +++ b/osm/opensm/Makefile.am > @@ -55,7 +55,8 @@ opensm_SOURCES = main.c osm_console.c os > osm_trap_rcv.c osm_ucast_mgr.c osm_ucast_updn.c \ > osm_ucast_lash.c osm_ucast_file.c osm_ucast_ftree.c \ > osm_vl15intf.c osm_vl_arb_rcv.c \ > - st.c > + st.c \ > + osm_perfmgr.c osm_event_db.c > if OSMV_OPENIB > opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > @@ -78,7 +79,7 @@ endif > # we always give precedence to local tree libs and then use the pre-installed ones. > opensm_LDADD = -L../complib -L../libvendor -L. $(OSMV_LDADD) -lopensm -losmcomp -losmvendor > > -opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread > +opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread -ldl > > opensmincludedir = $(includedir)/infiniband/opensm > > diff --git a/osm/opensm/configure.in b/osm/opensm/configure.in > index ad3333a..9e23719 100644 > --- a/osm/opensm/configure.in > +++ b/osm/opensm/configure.in > @@ -78,6 +78,9 @@ if test $console_socket = yes; then > [Define as 1 if you want to enable a console on a socket connection]) > fi > > +dnl select performance manager or not > +OPENIB_OSM_PERF_MGR_SEL > + > dnl Provide user option to select vendor > OPENIB_APP_OSMV_SEL > > diff --git a/osm/opensm/main.c b/osm/opensm/main.c > index 153e44d..4fa3563 100644 > --- a/osm/opensm/main.c > +++ b/osm/opensm/main.c > @@ -59,6 +59,7 @@ > #include > #include > #include > +#include > > volatile unsigned int osm_exit_flag = 0; > > @@ -273,6 +274,13 @@ show_usage(void) > printf("-I\n" > "--inactive\n" > " Start SM in inactive rather than normal init SM state.\n\n"); > +#ifdef ENABLE_OSM_PERF_MGR > + printf( "--pm\n" > + " Activate the performance manager.\n\n"); > + printf( "--pm_sweep_time_s\n" > + " Define the period for PerfMgr sweeps (in seconds) default %ds.\n\n", > + OSM_PERFMGR_DEFAULT_SWEEP_TIME_S); > +#endif /* ENABLE_OSM_PERF_MGR */ > printf( "-v\n" > "--verbose\n" > " This option increases the log verbosity level.\n" > @@ -630,6 +638,8 @@ main( > #endif > { "daemon", 0, NULL, 'B'}, > { "inactive", 0, NULL, 'I'}, > + { "pm", 0, NULL, 1}, /* no short options for PM stuff */ > + { "pm_sweep_time_s", 1, NULL, 2}, > { NULL, 0, NULL, 0 } /* Required at the end of the array */ > }; > > @@ -907,6 +917,15 @@ main( > printf(" SM started in inactive state\n"); > break; > > +#ifdef ENABLE_OSM_PERF_MGR > + case 1: > + opt.perfmgr = TRUE; > + break; > + case 2: > + opt.perfmgr_sweep_time_s = atoi(optarg); In case of user error we can get opt.perfmgr_sweep_time_s = 0 (or another strange value), I think at least minimal verification is needed here. > + break; > +#endif /* ENABLE_OSM_PERF_MGR */ > + > case 'h': > case '?': > case ':': > diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c > index 38b978a..d6c30d8 100644 > --- a/osm/opensm/osm_console.c > +++ b/osm/opensm/osm_console.c > @@ -52,6 +52,7 @@ > #include > #include > #include > +#include > > struct command { > char *name; > @@ -136,6 +137,20 @@ static void help_logflush(FILE *out, int > fprintf(out, "logflush -- flush the osm.log file\n"); > } > > +#ifdef ENABLE_OSM_PERF_MGR > +static void help_perfmgr(FILE *out, int detail) > +{ > + fprintf(out, "perfmgr [enable|disable|clear_counters|dump_counters|sweep_time][seconds]\n"); > + if (detail) { > + fprintf(out, "perfmgr -- print the performance manager state\n"); > + fprintf(out, " [enable|disable] -- change the perfmgr state\n"); > + fprintf(out, " [sweep_time] -- change the perfmgr sweep time (requires [seconds] option)\n"); > + fprintf(out, " [clear_counters] -- clear the counters stored\n"); > + fprintf(out, " [dump_counters [mach]] -- dump the counters\n"); > + } > +} > +#endif /* ENABLE_OSM_PERF_MGR */ > + > /* more help routines go here */ > > static void help_parse(char **p_last, osm_opensm_t *p_osm, FILE *out) > @@ -427,6 +442,66 @@ static void logflush_parse(char **p_last > fflush(p_osm->log.out_port); > } > > +#ifdef ENABLE_OSM_PERF_MGR > +static void perfmgr_parse(char **p_last, osm_opensm_t *p_osm, FILE *out) > +{ > + char *p_cmd; > + > + p_cmd = next_token(p_last); > + if (p_cmd) > + { > + if (strcmp(p_cmd, "enable") == 0) > + { > + osm_perfmgr_set_state(&(p_osm->perfmgr), PERFMGR_STATE_ENABLED); > + } > + else if (strcmp(p_cmd, "disable") == 0) > + { > + osm_perfmgr_set_state(&(p_osm->perfmgr), PERFMGR_STATE_DISABLE); > + } > + else if (strcmp(p_cmd, "clear_counters") == 0) > + { > + osm_perfmgr_clear_counters(&(p_osm->perfmgr)); > + } > + else if (strcmp(p_cmd, "dump_counters") == 0) > + { > + p_cmd = next_token(p_last); > + if (p_cmd && (strcmp(p_cmd, "mach") == 0)) { > + osm_perfmgr_dump_counters(&(p_osm->perfmgr), > + OSM_EVENT_DB_DUMP_MR); > + } else { > + osm_perfmgr_dump_counters(&(p_osm->perfmgr), > + OSM_EVENT_DB_DUMP_HR); > + } > + } > + else if (strcmp(p_cmd, "sweep_time") == 0) > + { > + p_cmd = next_token(p_last); > + if (p_cmd) > + { > + uint16_t time_s = atoi(p_cmd); > + osm_perfmgr_set_sweep_time_s(&(p_osm->perfmgr), time_s); > + } > + else > + { > + fprintf(out, "sweep_time requires a time specified\n"); > + } > + } > + else > + { > + fprintf(out, "\"%s\" option not found\n", p_cmd); > + } > + } else { > + fprintf(out, "Performance Manager status:\n" > + "state : %s\n" > + "sweep time : %us\n" > + , > + osm_perfmgr_get_state_str(&(p_osm->perfmgr)), > + osm_perfmgr_get_sweep_time_s(&(p_osm->perfmgr)) > + ); > + } > +} > +#endif /* ENABLE_OSM_PERF_MGR */ > + > /* This is public to be able to close it on exit */ > void osm_console_close_socket(osm_opensm_t *p_osm) > { > @@ -456,6 +531,9 @@ static const struct command console_cmds > { "resweep", &help_resweep, &resweep_parse}, > { "status", &help_status, &status_parse}, > { "logflush", &help_logflush, &logflush_parse}, > +#ifdef ENABLE_OSM_PERF_MGR > + { "perfmgr", &help_perfmgr, &perfmgr_parse}, > +#endif /* ENABLE_OSM_PERF_MGR */ > { NULL, NULL, NULL} /* end of array */ > }; > > diff --git a/osm/opensm/osm_event_db.c b/osm/opensm/osm_event_db.c > new file mode 100644 > index 0000000..90ca8da > --- /dev/null > +++ b/osm/opensm/osm_event_db.c > @@ -0,0 +1,172 @@ > +/* > + * Copyright (c) 2007 The Regents of the University of California. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > + > +#if HAVE_CONFIG_H > +# include > +#endif /* HAVE_CONFIG_H */ > + > +#include > +#include > +#include > +#include > +#include > + > +#include > + > +/** ========================================================================= > + */ > +osm_event_db_t * > +osm_event_db_construct(osm_log_t *p_log, char *type) > +{ > + char lib_name[PATH_MAX]; > + osm_event_db_t *rc = NULL; > + > + if (!type) > + return (NULL); > + > + /* find the plugin */ > + snprintf(lib_name, PATH_MAX, "lib%s.so", type); > + > + rc = malloc(sizeof(*rc)); > + if (!rc) > + return (NULL); > + > + rc->handle = dlopen(lib_name, RTLD_LAZY); > + if (!rc->handle) > + { > + osm_log(p_log, OSM_LOG_ERROR, > + "Failed to open PM Database \"%s\" : \"%s\"\n", > + lib_name, dlerror()); > + goto DLOPENFAIL; > + } > + > + rc->db_impl = (__osm_event_db_t *)dlsym(rc->handle, "__osm_event_db"); > + if (!rc->db_impl) > + { > + osm_log(p_log, OSM_LOG_ERROR, > + "Failed to find __osm_event_db symbol in \"%s\" : \"%s\"\n", > + lib_name, dlerror()); > + goto Exit; > + } > + > + /* Check the version to make sure this module will work with us */ > + if (rc->db_impl->interface_version != OSM_EVENT_DB_INTERFACE_VER) > + { > + osm_log(p_log, OSM_LOG_ERROR, > + "__osm_event_db symbol is the wrong version %d != %d\n", > + rc->db_impl->interface_version, > + OSM_EVENT_DB_INTERFACE_VER); > + goto Exit; > + } > + > + rc->db_data = rc->db_impl->construct(p_log); > + > + if (!rc->db_data) > + goto Exit; > + > + rc->p_log = p_log; > + return (rc); > + > +Exit: > + dlclose(rc->handle); > +DLOPENFAIL: > + free(rc); > + return (NULL); > +} > + > +/** ========================================================================= > + */ > +void > +osm_event_db_destroy(osm_event_db_t *db) > +{ > + if (db) > + { > + db->db_impl->destroy(db->db_data); > + free(db); > + } > +} > + > +/** ========================================================================= > + */ > +osm_event_db_err_t > +osm_event_db_create_entry(osm_event_db_t *db, uint64_t guid, uint8_t num_ports) > +{ > + return(db->db_impl->create_entry(db->db_data, guid, num_ports)); > +} > + > +/********************************************************************** > + **********************************************************************/ > +osm_event_db_err_t osm_event_db_get_prev_pc(osm_event_db_t *db, uint64_t guid, > + uint8_t port, osm_pc_reading_t *reading) > +{ > + return (db->db_impl->get_prev_pc(db->db_data, guid, port, reading)); > +} > + > +/********************************************************************** > + * dump the data to the file "file" > + **********************************************************************/ > +osm_event_db_err_t > +osm_event_db_dump(osm_event_db_t *db, char *file, osm_event_db_dump_t dump_type) > +{ > + return (db->db_impl->dump(db->db_data, file, dump_type)); > +} > + > +/********************************************************************** > + * Clear the port counters from the db > + **********************************************************************/ > +void osm_event_db_clear_port_counters(osm_event_db_t *db) > +{ > + db->db_impl->clear_port_counters(db->db_data); > +} > + > +/********************************************************************** > + * Add the reading to the osm_pm_node_t > + **********************************************************************/ > +osm_event_db_err_t > +osm_event_db_add_pc_reading(osm_event_db_t *db, uint64_t guid, > + uint8_t port, ib_port_counters_t *reading) > +{ > + return (db->db_impl->add_pc_reading(db->db_data, guid, > + port, reading)); > +} > + > +/********************************************************************** > + * Add the reading to the osm_pm_node_t > + **********************************************************************/ > +osm_event_db_err_t > +osm_event_db_clear_prev_pc(osm_event_db_t *db, uint64_t guid, uint8_t port) > +{ > + return (db->db_impl->clear_prev_pc(db->db_data, guid, port)); > +} > + > diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c > index 8430605..fa572c5 100644 > --- a/osm/opensm/osm_opensm.c > +++ b/osm/opensm/osm_opensm.c > @@ -172,6 +172,9 @@ osm_opensm_destroy( > p_osm->routing_engine.delete(p_osm->routing_engine.context); > osm_sa_destroy( &p_osm->sa ); > osm_sm_destroy( &p_osm->sm ); > +#ifdef ENABLE_OSM_PERF_MGR > + osm_perfmgr_destroy( &p_osm->perfmgr ); > +#endif /* ENABLE_OSM_PERF_MGR */ > osm_db_destroy( &p_osm->db ); > osm_vl15_destroy( &p_osm->vl15, &p_osm->mad_pool ); > osm_mad_pool_destroy( &p_osm->mad_pool ); > @@ -286,6 +289,21 @@ osm_opensm_init( > if( status != IB_SUCCESS ) > goto Exit; > > +#ifdef ENABLE_OSM_PERF_MGR > + status = osm_perfmgr_init( &p_osm->perfmgr, > + &p_osm->subn, > + &p_osm->sm, > + &p_osm->log, > + &p_osm->mad_pool, > + p_osm->p_vendor, > + &p_osm->disp, > + &p_osm->lock, > + p_opt); > + > + if( status != IB_SUCCESS ) > + goto Exit; > +#endif /* ENABLE_OSM_PERF_MGR */ > + > if( p_opt->routing_engine_name && > setup_routing_engine(p_osm, p_opt->routing_engine_name)) { > osm_log( &p_osm->log, OSM_LOG_VERBOSE, > @@ -319,6 +337,12 @@ osm_opensm_bind( > if( status != IB_SUCCESS ) > goto Exit; > > +#ifdef ENABLE_OSM_PERF_MGR > + status = osm_perfmgr_bind( &p_osm->perfmgr, guid ); > + if( status != IB_SUCCESS ) > + goto Exit; > +#endif /* ENABLE_OSM_PERF_MGR */ > + > Exit: > OSM_LOG_EXIT( &p_osm->log ); > return ( status ); > diff --git a/osm/opensm/osm_perfmgr.c b/osm/opensm/osm_perfmgr.c > new file mode 100644 > index 0000000..297a0e2 > --- /dev/null > +++ b/osm/opensm/osm_perfmgr.c > @@ -0,0 +1,686 @@ > +/* > + * Copyright (c) 2007 The Regents of the University of California. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > + > +/* > + * Abstract: > + * Implementation of osm_perfmgr_t. > + * > + * Author: > + * Ira Weiny, LLNL > + */ > + > +#if HAVE_CONFIG_H > +# include > +#endif /* HAVE_CONFIG_H */ > + > +#ifdef ENABLE_OSM_PERF_MGR > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define OSM_PERFMGR_INITIAL_TID_VALUE 0xcafe > + > +/********************************************************************** > + * Recieve the MAD from the vendor layer and post it for processing by the > + * dispatcher. > + **********************************************************************/ > +static void > +osm_perfmgr_mad_recv_callback(osm_madw_t *p_madw, void* bind_context, > + osm_madw_t *p_req_madw ) > +{ > + osm_perfmgr_t *pm = (osm_perfmgr_t *)bind_context; > + cl_status_t cl_status = CL_SUCCESS; > + > + OSM_LOG_ENTER( pm->log, osm_pm_mad_recv_callback ); ^^^^^^^^^^^^^^^^^^^^^^^^ I guess here should be 'osm_perfmgr_mad_recv_callback' > + > + osm_madw_copy_context( p_madw, p_req_madw ); > + osm_mad_pool_put( pm->mad_pool, p_req_madw ); > + > + /* post this message for later processing. */ > + cl_status = cl_disp_post(pm->pc_disp_h, OSM_MSG_MAD_PORT_COUNTERS, > + (void *)p_madw, NULL, NULL); > +#if 0 > + do { > + struct timeval rcv_time; > + gettimeofday(&rcv_time, NULL); > + osm_log(pm->log, OSM_LOG_INFO, > + "perfmgr rcv time %ld\n", > + rcv_time.tv_usec - > + p_madw->context.perfmgr_context.query_start.tv_usec); > + } while (0); > +#endif > + OSM_LOG_EXIT( pm->log ); > +} > + > +/********************************************************************** > + * Process errors from the MAD send. > + **********************************************************************/ > +static void > +osm_perfmgr_mad_send_err_callback(void* bind_context, osm_madw_t *p_madw) > +{ > + osm_perfmgr_t *pm = (osm_perfmgr_t *)bind_context; > + osm_madw_context_t *context = &(p_madw->context); > + > + OSM_LOG_ENTER( pm->log, osm_pm_mad_send_err_callback ); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Ditto (the same for another perfmgr functions) > + > + osm_log( pm->log, OSM_LOG_ERROR, > + "osm_pm_mad_send_err_callback: 0x%" PRIx64 " port %d\n", > + context->perfmgr_context.node_guid, > + context->perfmgr_context.port); > + > + osm_mad_pool_put( pm->mad_pool, p_madw ); > + > + OSM_LOG_EXIT( pm->log ); > +} > + > +/********************************************************************** > + * Bind the PM to the vendor layer for MAD sends/receives > + **********************************************************************/ > +ib_api_status_t > +osm_perfmgr_bind(osm_perfmgr_t * const pm, const ib_net64_t port_guid) > +{ > + osm_bind_info_t bind_info; > + ib_api_status_t status = IB_SUCCESS; > + > + OSM_LOG_ENTER( pm->log, osm_pm_bind ); > + > + if( pm->bind_handle != OSM_BIND_INVALID_HANDLE ) { > + osm_log( pm->log, OSM_LOG_ERROR, > + "osm_pm_mad_ctrl_bind: Multiple binds not allowed\n" ); > + status = IB_ERROR; > + goto Exit; > + } > + > + bind_info.port_guid = port_guid; > + bind_info.mad_class = IB_MCLASS_PERF; > + bind_info.class_version = 1; > + bind_info.is_responder = FALSE; > + bind_info.is_report_processor = FALSE; > + bind_info.is_trap_processor = FALSE; > + bind_info.recv_q_size = OSM_PM_DEFAULT_QP1_RCV_SIZE; > + bind_info.send_q_size = OSM_PM_DEFAULT_QP1_SEND_SIZE; > + > + osm_log( pm->log, OSM_LOG_VERBOSE, > + "osm_pm_mad_bind: " > + "Binding to port GUID 0x%" PRIx64 "\n", > + cl_ntoh64( port_guid ) ); > + > + pm->bind_handle = osm_vendor_bind( pm->vendor, > + &bind_info, > + pm->mad_pool, > + osm_perfmgr_mad_recv_callback, > + osm_perfmgr_mad_send_err_callback, > + pm ); > + > + if( pm->bind_handle == OSM_BIND_INVALID_HANDLE ) { > + status = IB_ERROR; > + osm_log( pm->log, OSM_LOG_ERROR, > + "osm_pm_mad_bind: Vendor specific bind failed (%s)\n", > + ib_get_err_str(status) ); > + goto Exit; > + } > + > +Exit: > + OSM_LOG_EXIT( pm->log ); > + return( status ); > +} > + > +/********************************************************************** > + * Unbind the PM to the vendor layer for MAD sends/receives > + **********************************************************************/ > +void > +osm_perfmgr_mad_unbind(osm_perfmgr_t * const pm) > +{ > + OSM_LOG_ENTER( pm->log, osm_sa_mad_ctrl_unbind ); > + if( pm->bind_handle == OSM_BIND_INVALID_HANDLE ) { > + osm_log( pm->log, OSM_LOG_ERROR, > + "osm_pm_mad_unbind: No previous bind\n" ); > + goto Exit; > + } > + osm_vendor_unbind( pm->bind_handle ); > +Exit: > + OSM_LOG_EXIT( pm->log ); > +} > + > +/********************************************************************** > + * Given a node and a port return the appropriate lid to query that port > + **********************************************************************/ > +static ib_net16_t > +get_lid(osm_node_t *p_node, uint8_t port) > +{ > + ib_net16_t lid = 0; > + > + switch (p_node->node_info.node_type) > + { > + case IB_NODE_TYPE_CA: > + case IB_NODE_TYPE_ROUTER: > + lid = osm_node_get_base_lid(p_node, port); > + break; > + case IB_NODE_TYPE_SWITCH: > + lid = osm_node_get_base_lid(p_node, 0); > + break; > + default: > + break; > + } > + return (lid); > +} > + > +/********************************************************************** > + * Form the Port Counter MAD and send the MAD for a single port. > + **********************************************************************/ > +static ib_api_status_t > +osm_perfmgr_send_pc_mad(osm_perfmgr_t *perfmgr, ib_net16_t dest_lid, uint8_t port, > + uint8_t mad_method, osm_madw_context_t* const p_context ) > +{ > + ib_api_status_t status = IB_SUCCESS; > + ib_port_counters_t *port_counter = NULL; > + ib_perfmgr_mad_t *pm_mad = NULL; > + osm_madw_t *p_madw = NULL; > + > + OSM_LOG_ENTER(perfmgr->log, osm_perfmgr_send_pc_mad); > + > + p_madw = osm_mad_pool_get(perfmgr->mad_pool, perfmgr->bind_handle, MAD_BLOCK_SIZE, NULL); > + if (p_madw == NULL) > + return (IB_INSUFFICIENT_MEMORY); > + > + pm_mad = osm_madw_get_perfmgr_mad_ptr(p_madw); > + > + /* build the mad */ > + pm_mad->header.base_ver = 1; > + pm_mad->header.mgmt_class = IB_MCLASS_PERF; > + pm_mad->header.class_ver = 1; > + pm_mad->header.method = mad_method; > + pm_mad->header.status = 0; > + pm_mad->header.class_spec = 0; > + pm_mad->header.trans_id = cl_hton64((uint64_t)cl_atomic_inc(&(perfmgr->trans_id))); > + pm_mad->header.attr_id = IB_MAD_ATTR_PORT_CNTRS; > + pm_mad->header.resv = 0; > + pm_mad->header.attr_mod = 0; > + > + port_counter = (ib_port_counters_t *)&(pm_mad->data); > + memset(port_counter, 0, sizeof(*port_counter)); > + port_counter->port_select = port; > + port_counter->counter_select = 0xFFFF; > + > + p_madw->mad_addr.dest_lid = dest_lid; > + p_madw->mad_addr.addr_type.gsi.remote_qp = cl_hton32(1); > + p_madw->mad_addr.addr_type.gsi.remote_qkey = cl_hton32(IB_QP1_WELL_KNOWN_Q_KEY); > + /* FIXME what about other partitions */ > + p_madw->mad_addr.addr_type.gsi.pkey = cl_hton16(0xFFFF); > + p_madw->mad_addr.addr_type.gsi.service_level = 0; > + p_madw->mad_addr.addr_type.gsi.global_route = FALSE; > + p_madw->resp_expected = TRUE; > + > + if( p_context ) > + p_madw->context = *p_context; > + > + status = osm_vendor_send(perfmgr->bind_handle, p_madw, TRUE); > + > + OSM_LOG_EXIT(perfmgr->log); > + return( status ); > +} > + > +/********************************************************************** > + * query the Port Counters of all the nodes in the subnet. > + **********************************************************************/ > +static void > +__osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context ) > +{ > + ib_api_status_t status = IB_SUCCESS; > + uint8_t port = 0; > + osm_perfmgr_t *pm = (osm_perfmgr_t *)context; > + osm_node_t *p_node = (osm_node_t *)p_map_item; > + uint8_t node_desc[IB_NODE_DESCRIPTION_SIZE]; > + osm_madw_context_t mad_context; > + uint8_t num_ports = 0; > + uint64_t node_guid = 0; > + > + OSM_LOG_ENTER( pm->log, __osm_pm_query_counters ); > + > + memcpy(node_desc, p_node->node_desc.description, > + IB_NODE_DESCRIPTION_SIZE); > + node_desc[IB_NODE_DESCRIPTION_SIZE-1] = '\0'; We have null terminated 'print_desc' field in osm_node_t structure > + > + num_ports = osm_node_get_num_physp(p_node); > + node_guid = cl_ntoh64(p_node->node_info.node_guid); > + > + /* make sure we have a database object ready to store this information */ > + if (osm_event_db_create_entry(pm->db, node_guid, num_ports) != > + OSM_EVENT_DB_SUCCESS) > + { > + osm_log(pm->log, OSM_LOG_ERROR, > + "PerfMgr DB create entry failed for 0x%" PRIx64 " : %s\n", > + node_guid, strerror(errno)); > + goto Exit; > + } > + > + /* issue the queries for each port */ > + for (port = 1; port < num_ports; port++) > + { > + ib_net16_t lid = get_lid(p_node, port); > + if (lid == 0) > + { > + osm_log(pm->log, OSM_LOG_DEBUG, > + "WARN: node 0x%" PRIx64 " port %d (%s): port out of range, skipping\n", > + cl_ntoh64(p_node->node_info.node_guid), port, node_desc); > + continue; > + } > + > + mad_context.perfmgr_context.node_guid = node_guid; > + mad_context.perfmgr_context.port = port; > + mad_context.perfmgr_context.num_ports = num_ports; > + mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_GET; > +#if 0 > + gettimeofday(&(mad_context.perfmgr_context.query_start), NULL); > +#endif > + osm_log(pm->log, OSM_LOG_VERBOSE, > + " Getting stats for node 0x%" PRIx64 " port %d (lid %X) (%s)\n", > + node_guid, port, cl_ntoh16(lid), node_desc); > + status = osm_perfmgr_send_pc_mad(pm, lid, port, IB_MAD_METHOD_GET, &mad_context); > + if (status != IB_SUCCESS) > + { > + osm_log(pm->log, OSM_LOG_ERROR, > + "Failed to issue port counter query for node 0x%" PRIx64 " port %d (%s)\n", > + p_node->node_info.node_guid, port, node_desc); > + } > + } > +Exit: > + OSM_LOG_EXIT( pm->log ); > +} > + > +/********************************************************************** > + * Main PerfMgr Thread. > + * Loop continueously and query the performance counters. > + **********************************************************************/ > +void > +__osm_perfmgr_sweeper(void *p_ptr) > +{ > + ib_api_status_t status; > + osm_perfmgr_t *const pm = ( osm_perfmgr_t * ) p_ptr; > + > + OSM_LOG_ENTER( pm->log, __osm_pm_sweeper ); > + > + if( pm->thread_state == OSM_THREAD_STATE_INIT ) > + pm->thread_state = OSM_THREAD_STATE_RUN; > + > + while( pm->thread_state == OSM_THREAD_STATE_RUN ) { > + /* do the sweep only if we are in MASTER state > + * AND we have been activated. > + * FIXME put something in here to try and reduce the load on the system > + * when it is not IDLE. > + if (pm->sm->state_mgr.state != OSM_SM_STATE_IDLE) > + */ > + if( pm->subn->sm_state == IB_SMINFO_STATE_MASTER > + && pm->state == PERFMGR_STATE_ENABLED) { > +#if 0 > + struct timeval before, after; > + gettimeofday(&before, NULL); > +#endif > + /* for each node query their counters */ > + cl_plock_acquire(pm->lock); > + osm_log(pm->log, OSM_LOG_VERBOSE, "Gathering PerfMgr stats\n"); > + cl_qmap_apply_func(&(pm->subn->node_guid_tbl), > + __osm_perfmgr_query_counters, (void *)pm); > + cl_plock_release(pm->lock); > +#if 0 > + gettimeofday(&after, NULL); > + osm_log(pm->log, OSM_LOG_INFO, > + "total sweep time : %ld us\n", after.tv_usec - before.tv_usec); > +#endif > + } > + > + /* Wait for a forced sweep or period timeout. */ > + status = cl_event_wait_on( &pm->sig_sweep, > + pm->sweep_time_s * 1000000, > + TRUE ); > + } > + > + OSM_LOG_EXIT( pm->log ); > +} > + > +/********************************************************************** > + **********************************************************************/ > +void > +osm_perfmgr_shutdown(osm_perfmgr_t * const pm) > +{ > + OSM_LOG_ENTER( pm->log, osm_perfmgr_shutdown ); > + osm_perfmgr_mad_unbind(pm); > + OSM_LOG_EXIT( pm->log ); > +} > + > +/********************************************************************** > + **********************************************************************/ > +void > +osm_perfmgr_destroy(osm_perfmgr_t * const pm) > +{ > + OSM_LOG_ENTER( pm->log, osm_perfmgr_destroy ); > + free(pm->event_db_dump_file); > + free(pm->event_db_plugin); > + osm_event_db_destroy(pm->db); > + OSM_LOG_EXIT( pm->log ); > +} > + > +/********************************************************************** > + * Return 1 if the value has overflowed > + **********************************************************************/ > +int counter_overflow_4(uint8_t val) > +{ > + return (val >= 10); > +} > +int counter_overflow_8(uint8_t val) > +{ > + return (val >= (UINT8_MAX - (UINT8_MAX/4))); > +} > +int counter_overflow_16(uint16_t val) > +{ > + return (cl_ntoh16(val) >= (UINT16_MAX - (UINT16_MAX/4))); > +} > +int counter_overflow_32(uint32_t val) > +{ > + return (cl_ntoh32(val) >= (UINT32_MAX - (UINT32_MAX/4))); > +} > + > +/********************************************************************** > + * Check if the port counters have overflowed and if so issue a clear MAD to > + * the port. > + **********************************************************************/ > +static void > +osm_perfmgr_check_clear(osm_perfmgr_t *pm, uint64_t node_guid, > + uint8_t port, int num_ports, ib_port_counters_t *cr) > +{ > + osm_madw_context_t mad_context; > + > + OSM_LOG_ENTER( pm->log, osm_pm_check_clear ); > + if (counter_overflow_16(cr->symbol_err_cnt) > + || counter_overflow_8(cr->link_err_recover) > + || counter_overflow_8(cr->link_downed) > + || counter_overflow_16(cr->rcv_err) > + || counter_overflow_16(cr->rcv_rem_phys_err) > + || counter_overflow_16(cr->rcv_switch_relay_err) > + || counter_overflow_16(cr->xmit_discards) > + || counter_overflow_8(cr->xmit_constraint_err) > + || counter_overflow_8(cr->rcv_constraint_err) > + || counter_overflow_4(PC_LINK_INT(cr->link_int_buffer_overrun)) > + || counter_overflow_4(PC_BUF_OVERRUN(cr->link_int_buffer_overrun)) > + || counter_overflow_16(cr->vl15_dropped) > + || counter_overflow_32(cr->xmit_data) > + || counter_overflow_32(cr->rcv_data) > + || counter_overflow_32(cr->xmit_pkts) > + || counter_overflow_32(cr->rcv_pkts) > + ) > + { > + osm_log(pm->log, OSM_LOG_INFO, > + "Counter overflow: 0x%" PRIx64 " port %d; clearing counters\n", > + node_guid, port); > + osm_node_t *p_node = NULL; > + ib_net16_t lid = 0; > + cl_plock_acquire(pm->lock); > + p_node = (osm_node_t *)cl_qmap_get(&(pm->subn->node_guid_tbl), > + cl_hton64(node_guid)); > + lid = get_lid(p_node, port); > + cl_plock_release(pm->lock); > + if (lid == 0) > + { > + osm_log(pm->log, OSM_LOG_INFO, > + "Failed to clear counters for node 0x%" PRIx64 " port %d; failed to get lid\n", > + node_guid, port); > + goto Exit; > + } > + mad_context.perfmgr_context.node_guid = node_guid; > + mad_context.perfmgr_context.port = port; > + mad_context.perfmgr_context.num_ports = num_ports; > + mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_SET; > + /* clear port counter */ > + osm_perfmgr_send_pc_mad(pm, lid, port, IB_MAD_METHOD_SET, &mad_context); > + } > +Exit: > + OSM_LOG_EXIT( pm->log ); > +} > + > +/********************************************************************** > + * Check values for logging of errors > + **********************************************************************/ > +static void > +osm_perfmgr_log_events(osm_perfmgr_t *pm, uint64_t node_guid, uint8_t port, > + ib_port_counters_t *reading) > +{ > + osm_pc_reading_t prev_read; > + ib_port_counters_t *prev; > + time_t time_diff = 0; > + osm_event_db_err_t err = osm_event_db_get_prev_pc(pm->db, node_guid, port, &prev_read); > + if (err != OSM_EVENT_DB_SUCCESS) > + { > + osm_log(pm->log, OSM_LOG_VERBOSE, > + "failed to find previous reading for 0x%" PRIx64 " port %u\n", > + node_guid, port); > + return; > + } > + time_diff = (time(NULL) - prev_read.time); > + prev = &(prev_read.reading); > + > + /* FIXME these events should be defineable by the user in a config > + * file somewhere. */ > + if (reading->symbol_err_cnt > prev->symbol_err_cnt) { > + osm_log(pm->log, OSM_LOG_ERROR, > + "Found %u Symbol errors in %lu sec on node 0x%" PRIx64 " port %u\n", > + (cl_ntoh16(reading->symbol_err_cnt) - cl_ntoh16(prev->symbol_err_cnt)), > + time_diff, > + node_guid, > + port); > + } > + if (reading->rcv_err > prev->rcv_err) { > + osm_log(pm->log, OSM_LOG_ERROR, > + "Found %u Recieve errors in %lu sec on node 0x%" PRIx64 " port %u\n", > + (cl_ntoh16(reading->rcv_err) - cl_ntoh16(prev->rcv_err)), > + time_diff, > + node_guid, > + port); > + } > + if (reading->xmit_discards > prev->xmit_discards) { > + osm_log(pm->log, OSM_LOG_ERROR, > + "Found %u XMIT Discards in %lu sec on node 0x%" PRIx64 " port %u\n", > + (cl_ntoh16(reading->xmit_discards) - cl_ntoh16(prev->xmit_discards)), > + time_diff, > + node_guid, > + port); > + } > +} > + > + > +/********************************************************************** > + * The dispatcher uses a thread pool which will call this function when we have > + * a thread available to process our mad recieved from the wire. > + **********************************************************************/ > +static void > +osm_pc_rcv_process(void *context, void *data) > +{ > + osm_perfmgr_t *const pm = (osm_perfmgr_t *)context; > + osm_madw_t *p_madw = (osm_madw_t *)data; > + osm_madw_context_t *mad_context = &(p_madw->context); > + ib_port_counters_t *counter_reading = > + (ib_port_counters_t *)&(osm_madw_get_perfmgr_mad_ptr(p_madw)->data); > + uint64_t node_guid = mad_context->perfmgr_context.node_guid; > + uint8_t port_num = mad_context->perfmgr_context.port; > + int num_ports = mad_context->perfmgr_context.num_ports; > + > + OSM_LOG_ENTER( pm->log, osm_pc_rcv_process ); > + > + osm_log(pm->log, OSM_LOG_VERBOSE, > + "Processing recieved MAD context 0x%" PRIx64 " port %u/%d\n", > + node_guid, port_num, num_ports); > + > + /* log any critical events from this reading */ > + osm_perfmgr_log_events(pm, node_guid, port_num, counter_reading); > + > + if (mad_context->perfmgr_context.mad_method == IB_MAD_METHOD_GET) > + osm_event_db_add_pc_reading(pm->db, node_guid, port_num, counter_reading); > + else > + osm_event_db_clear_prev_pc(pm->db, node_guid, port_num); > + osm_perfmgr_check_clear(pm, node_guid, port_num, num_ports, counter_reading); > + > +#if 0 > + do { > + struct timeval proc_time; > + gettimeofday(&proc_time, NULL); > + osm_log(pm->log, OSM_LOG_INFO, > + "perfmgr done processing time %ld\n", > + proc_time.tv_usec - > + p_madw->context.perfmgr_context.query_start.tv_usec); > + } while (0); > +#endif > + > + osm_mad_pool_put( pm->mad_pool, p_madw ); > + > + OSM_LOG_EXIT( pm->log ); > +} > + > +/********************************************************************** > + * Initialize the PERFMGR object > + **********************************************************************/ > +ib_api_status_t > +osm_perfmgr_init( > + osm_perfmgr_t * const pm, > + osm_subn_t * const subn, > + osm_sm_t * const sm, > + osm_log_t * const log, > + osm_mad_pool_t * const mad_pool, > + osm_vendor_t * const vendor, > + cl_dispatcher_t* const disp, > + cl_plock_t* const lock, > + const osm_subn_opt_t * const p_opt ) > +{ > + ib_api_status_t status = IB_SUCCESS; > + > + OSM_LOG_ENTER( log, osm_pm_init ); > + > + osm_log(log, OSM_LOG_VERBOSE, "initing PM\n"); > + > + memset( pm, 0, sizeof( *pm ) ); > + > + cl_event_construct(&pm->sig_sweep); > + cl_event_init(&pm->sig_sweep, FALSE); > + pm->subn = subn; > + pm->sm = sm; > + pm->log = log; > + pm->mad_pool = mad_pool; > + pm->vendor = vendor; > + pm->trans_id = OSM_PERFMGR_INITIAL_TID_VALUE; > + pm->lock = lock; > + pm->state = p_opt->perfmgr ? PERFMGR_STATE_ENABLED : PERFMGR_STATE_DISABLE; > + pm->sweep_time_s = p_opt->perfmgr_sweep_time_s; > + pm->event_db_dump_file = strdup(p_opt->event_db_dump_file); > + pm->event_db_plugin = strdup(p_opt->event_db_plugin); > + > + pm->db = osm_event_db_construct(pm->log, pm->event_db_plugin); > + if (!pm->db) > + { > + pm->state = PERFMGR_STATE_NO_DB; > + goto Exit; > + } > + > + pm->pc_disp_h = cl_disp_register(disp, OSM_MSG_MAD_PORT_COUNTERS, > + osm_pc_rcv_process, pm); > + if( pm->pc_disp_h == CL_DISP_INVALID_HANDLE ) > + goto Exit; > + > + pm->thread_state = OSM_THREAD_STATE_INIT; > + status = cl_thread_init( &pm->sweeper, __osm_perfmgr_sweeper, pm, > + "PerfMgr sweeper" ); > + if( status != IB_SUCCESS ) > + goto Exit; > + > +Exit: > + OSM_LOG_EXIT( log ); > + return ( status ); > +} > + > +/********************************************************************** > + * Clear the counters from the db > + **********************************************************************/ > +void > +osm_perfmgr_clear_counters(osm_perfmgr_t *pm) > +{ > + /** > + * FIXME todo issue clear on the fabric? > + */ > + osm_event_db_clear_port_counters(pm->db); > + osm_log( pm->log, OSM_LOG_INFO, "PerfMgr counters cleared\n"); > +} > + > +/******************************************************************* > + * Have the DB dump it's information to the file specified. > + *******************************************************************/ > +void > +osm_perfmgr_dump_counters(osm_perfmgr_t *pm, osm_event_db_dump_t dump_type) > +{ > + if (osm_event_db_dump(pm->db, pm->event_db_dump_file, dump_type) != 0) > + { > + osm_log( pm->log, OSM_LOG_ERROR, > + "PB dump port counters: Failed to file %s : %s", > + pm->event_db_dump_file, strerror(errno)); > + } > +} > + > +#if 0 > +/******************************************************************* > + * Use this later to track events on the fabric > + **********************************************************************/ > +ib_api_status_t > +osm_report_notice_to_perfmgr(osm_log_t* const log, osm_subn_t* subn, > + ib_mad_notice_attr_t *p_ntc ) > +{ > + OSM_LOG_ENTER( log, osm_report_trap_to_pm ); > + if ((p_ntc->generic_type & 0x80) > + && (cl_ntoh16(p_ntc->g_or_v.generic.trap_num) == 128)) { > + osm_log( log, OSM_LOG_INFO, "PerfMgr notified of trap 128\n"); > + } > + OSM_LOG_EXIT( log ); > + return (IB_SUCCESS); > +} > +#endif > + > +#endif /* ENABLE_OSM_PERF_MGR */ > + > diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c > index c8c3ddc..77c19a5 100644 > --- a/osm/opensm/osm_subnet.c > +++ b/osm/opensm/osm_subnet.c > @@ -66,6 +66,7 @@ > #include > #include > #include > +#include > > #if defined(PATH_MAX) > #define OSM_PATH_MAX (PATH_MAX + 1) > @@ -471,6 +472,12 @@ osm_subn_set_default_opt( > p_opt->honor_guid2lid_file = FALSE; > p_opt->daemon = FALSE; > p_opt->sm_inactive = FALSE; > +#ifdef ENABLE_OSM_PERF_MGR > + p_opt->perfmgr = FALSE; > + p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; > + p_opt->event_db_dump_file = OSM_PERFMGR_DEFAULT_DUMP_FILE; > + p_opt->event_db_plugin = OSM_DEFAULT_EVENT_PLUGIN; > +#endif /* ENABLE_OSM_PERF_MGR */ > > p_opt->dump_files_dir = getenv("OSM_TMP_DIR"); > if (!p_opt->dump_files_dir || !(*p_opt->dump_files_dir)) > @@ -1076,6 +1083,24 @@ osm_subn_parse_conf_file( > "sm_inactive", > p_key, p_val, &p_opts->sm_inactive); > > +#ifdef ENABLE_OSM_PERF_MGR > + __osm_subn_opts_unpack_boolean( > + "perfmgr", > + p_key, p_val, &p_opts->perfmgr); > + > + __osm_subn_opts_unpack_uint16( > + "perfmgr_sweep_time_s", > + p_key, p_val, &p_opts->perfmgr_sweep_time_s); > + > + __osm_subn_opts_unpack_charp( > + "event_db_dump_file", > + p_key, p_val, &p_opts->event_db_dump_file); > + > + __osm_subn_opts_unpack_charp( > + "event_db_plugin", > + p_key, p_val, &p_opts->event_db_plugin); > +#endif /* ENABLE_OSM_PERF_MGR */ > + > subn_parse_qos_options("qos", > p_key, p_val, &p_opts->qos_options); > > @@ -1321,6 +1346,32 @@ osm_subn_write_conf_file( > p_opts->sm_inactive ? "TRUE" : "FALSE" > ); > > +#ifdef ENABLE_OSM_PERF_MGR > + fprintf( > + opts_file, > + "#\n# Performance Manager Options\n#\n" > + "# perfmgr enable\n" > + "perfmgr %s\n\n" > + "# sweep time in seconds\n" > + "perfmgr_sweep_time_s %d\n\n" > + , > + p_opts->perfmgr ? "TRUE" : "FALSE", > + p_opts->perfmgr_sweep_time_s > + ); > + > + fprintf( > + opts_file, > + "#\n# Event DB Options\n#\n" > + "# Dump file to dump the events to\n" > + "event_db_dump_file %s\n\n" > + "# Event db plugin\n" > + "event_db_plugin %s\n\n" > + , > + p_opts->event_db_dump_file, > + p_opts->event_db_plugin > + ); > +#endif /* ENABLE_OSM_PERF_MGR */ > + > fprintf( > opts_file, > "#\n# DEBUG FEATURES\n#\n" > diff --git a/osm/opensm/osm_trap_rcv.c b/osm/opensm/osm_trap_rcv.c > index 0858968..19be781 100644 > --- a/osm/opensm/osm_trap_rcv.c > +++ b/osm/opensm/osm_trap_rcv.c > @@ -698,6 +698,21 @@ __osm_trap_rcv_process_request( > goto Exit; > } > > +#ifdef ENABLE_OSM_PERF_MGR > +#if 0 > + /* we still need to work out how this will work */ > + status = osm_report_notice_to_perfmgr(p_rcv->p_log, p_rcv->p_subn, p_ntci); > + if( status != IB_SUCCESS ) > + { > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3803: " > + "Error sending trap reports (%s)\n", > + ib_get_err_str( status ) ); > + goto Exit; > + } > +#endif > +#endif /* ENABLE_OSM_PERF_MGR */ > + > Exit: > OSM_LOG_EXIT( p_rcv->p_log ); > } > -- > 1.4.4 > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Sun May 13 21:58:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 May 2007 07:58:32 +0300 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git Message-ID: <20070514045832.GA18615@mellanox.co.il> Roland, please pick up the patches from: git://git.openfabrics.org/~mst/linux-2.6/.git master This will pull in the following outstanding patches intended for 2.6.22: all of them have been posted previously, I just cleaned up a couple of whitespace errors reported by git-apply (let me know if you like me to re-post the patches): Michael S. Tsirkin (3): IB/mthca: fix posting >255 recv WRs for Tavor IB/mthca: fix RESET to ERROR transition ipoib/cm: optimize stale connection detection Yosef Etigin (2): IB/core: add helpers for uncached gid/pkey queries IB/ipoib: handle pkey re-shuffling -- MST From Nippon at lists.openfabrics.org Sun May 13 23:47:10 2007 From: Nippon at lists.openfabrics.org (Nippon at lists.openfabrics.org) Date: Mon, 14 May 2007 03:47:10 -0300 Subject: [ofa-general] (Job Offer)Work with us@Nippon Oil Exploration Ltd Message-ID: Nippon Oil Exploration and Production U.K., Limited New Liverpool House / 2nd Floor, 15 Eldon Street London EC2M 7LD, U.K. ATTN: SIR/MADAM, Yes you can apply and be appointed,as your province/region covers within the range of North and South America.We seek a representative that can help my company in clearing our payments,this has being occasioned by the wide growth in our market that has expanded our goods and products to your province encompassing---AMERICA,CANADA,MEXICO AND PUERTO RICO. Likewise all exterior axis of EUROPE. Hence the mergence for a representative/agent who is competent,diligent and trustworthy to handle the responsiblity of receiveing payments on our behalf.Our customers issue out payments in cheques,bankdrafts,bonds etc and we don't run an account presently in your province that will clear this payments.Likewise clearance of our goods within your province. As our representative you will receive 10% of any payments you clear for the company, that is how you will be paid/renumerated by the company then you will remit the balance of 90% to us.This the company's standard mode of operation. If you are interested in our offer please confirm your complete informations: 1)FULL NAMES/COMPANY NAME 2)COMPANY/HOME ADDRESS 3)MAILING ADRESS 4)COUNTRY/STATE 5)PHONE NUMBERS-CELL,HOME,OFFICE ETC(kindly ensure you forward a functional phone number/s) 6)FAX NUMBER 7)ZIP CODE 8)EVIDENCE OF FULL IDENTIFICATION--SCANNED AND FORWARDED COPY OF INTERNATIOAL PASSPORT OR DRIVER"S LICENCE. 9)COMPLETE WORKING EXPERIENCE(WITH DETAILS) 10)ANNUAL INCOME all these are required for official documentation and record purpose once this is done i will fax it to our customers instructing them that you are our rep/agent and that they should issue out payments to you on behalf of Nippon Company Ltd. Waiting to hear from you.If you need any other informations do not fail to attention me.i ll instruct my attorney to prepare and send you our company"s statutory memorandum of understanding(M.O.U) via mail/fax. Kindly acknowledge reciept of mail by swiftly replying.i need to to hear from you as soon as possible so that the board of management can concess to your appointment. have a nice day. I await your urgent response. Thanks for your time, Yours Respectfuly, Mr. Tony Johnson Regional Coordinator Nippon Oil Exploration and Production U.K., Limited TEL: +44-704-571-1123 Email to: nipponoilexplorationltd01 at yahoo.co.uk From kliteyn at dev.mellanox.co.il Mon May 14 00:30:45 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 14 May 2007 10:30:45 +0300 Subject: [ofa-general] [PATCH] osm: integer indexes in fat-tree Message-ID: <46481025.9080209@dev.mellanox.co.il> Hi Hal, Enhancing integer indexes in fat-tree to 32 bits. I'm not sure whether it's a bug - fat-tree routing makes indexing not the same way as up/down. It marks rank on all the leaf switches, and only then starts BFS (starting from all the leaf switches and not from roots), so I don't think that it can really overflow the existing indexes. But who knows... Fixing this won't hurt anyway. Please apply to master. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_ucast_ftree.c | 12 ++++++------ 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c index 7b6a6a5..ca51484 100644 --- a/osm/opensm/osm_ucast_ftree.c +++ b/osm/opensm/osm_ucast_ftree.c @@ -174,7 +174,7 @@ typedef struct ftree_sw_t_ { cl_map_item_t map_item; osm_switch_t * p_osm_sw; - uint8_t rank; + uint32_t rank; ftree_tuple_t tuple; ib_net16_t base_lid; ftree_port_group_t ** down_port_groups; @@ -588,7 +588,7 @@ __osm_ftree_sw_create( memset(p_sw, 0, sizeof(ftree_sw_t)); p_sw->p_osm_sw = p_osm_sw; - p_sw->rank = 0xFF; + p_sw->rank = 0xFFFFFFFF; __osm_ftree_tuple_init(p_sw->tuple); p_sw->base_lid = osm_node_get_base_lid(p_sw->p_osm_sw->p_node, 0); @@ -678,7 +678,7 @@ static boolean_t __osm_ftree_sw_ranked( IN ftree_sw_t * p_sw) { - return (p_sw->rank != 0xFF); + return (p_sw->rank != 0xFFFFFFFF); } /***************************************************/ @@ -1025,7 +1025,7 @@ __osm_ftree_fabric_destroy(ftree_fabric_ /***************************************************/ static void -__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint8_t rank) +__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank) { if (rank > p_ftree->tree_rank) p_ftree->tree_rank = rank; @@ -1314,7 +1314,7 @@ __osm_ftree_fabric_assign_first_tuple( ftree_tuple_t new_tuple; __osm_ftree_tuple_init(new_tuple); - new_tuple[0] = p_sw->rank; + new_tuple[0] = (uint8_t)p_sw->rank; for (i = 1; i <= p_sw->rank; i++) new_tuple[i] = 0; @@ -1374,7 +1374,7 @@ __osm_ftree_fabric_calculate_rank( { ftree_sw_t * p_sw; ftree_sw_t * p_next_sw; - uint16_t max_rank = 0; + uint32_t max_rank = 0; /* go over all the switches and find maximal switch rank */ -- 1.4.4.1.GIT From vlad at lists.openfabrics.org Mon May 14 02:32:13 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 14 May 2007 02:32:13 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070514-0200 daily build status Message-ID: <20070514093213.51F90E6082D@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Failed: From eli at mellanox.co.il Mon May 14 01:35:43 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 14 May 2007 11:35:43 +0300 Subject: [ofa-general] [PATCH] IB/core free umem when mm is destroyed Message-ID: <1179131773.7405.39.camel@mtls03> free umem when task's mm is already destroyed by the time ib_umem_release gets called. Found by Dotan Barak at Mellanox Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/infiniband/core/umem.c =================================================================== --- connectx_kernel.orig/drivers/infiniband/core/umem.c 2007-05-14 09:43:02.000000000 +0300 +++ connectx_kernel/drivers/infiniband/core/umem.c 2007-05-14 10:26:26.000000000 +0300 @@ -261,8 +261,10 @@ void ib_umem_release(struct ib_umem *ume __ib_umem_release(umem->context->device, umem, 1); mm = get_task_mm(current); - if (!mm) + if (!mm) { + kfree(umem); return; + } diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; From halr at voltaire.com Mon May 14 03:58:34 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 06:58:34 -0400 Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager In-Reply-To: <20070513195539.GH29746@sashak.voltaire.com> References: <20070508184938.311b1c8f.weiny2@llnl.gov> <20070513195539.GH29746@sashak.voltaire.com> Message-ID: <1179140285.1540.167239.camel@hal.voltaire.com> On Sun, 2007-05-13 at 15:55, Sasha Khapyorsky wrote: > Hi Ira, > > Thanks for the great work! Indeed :-) > On 18:49 Tue 08 May , Ira Weiny wrote: > > I would like to submit to the list a performance manager which I have been > > working on for OpenSM. > > > > It is implemented as the first proposed architecture model set forth by Hal (As > > an integrated thread to OpenSM.) As such it works fine on our small test > > cluster but there is some concern about its scalability. > > > > I have extended this architecture with an idea of my own. This idea is to have > > a plug-able module for the "event database". With this interface one could > > write their own Data reduction, logging, and tracking methods. Here at LLNL I > > propose to use this to add counter and subnet events directly to our management > > database which is used to show system status to our operators. Other > > installations might prefer other methods of logging, SNMP for example. This > > patch includes a "reference" implementation of this "event database" which > > stores the information internally until the user requests a "dump". > > I like this event db idea, but not sure this should not be integral part > of the low level perfmgr stuff - as it is currently implemented without > such plugin loaded PerfMgr just doesn't work - this unconditionally tries > to pull all ports counters, but has nothing to do with it without plugin. > > Instead I would purpose to have a builtin PerfMgr which will be able to > pull and store performance related data and then to call "generic" event > manager which can process such data. This also will help to have simpler > generic API for such event db plugin so other parts of OpenSM will be > able to report events using same method(s). What do you think? Sounds better to me. Ira ? > Some patch related comments are inlined below. > > Sasha > > > > > Let the flames begin, > > Ira Weiny > > weiny2 at llnl.gov > > > > > > > > >From 4ce288b6a5a371872cf160f6d4e29e768a065cb9 Mon Sep 17 00:00:00 2001 > > From: Ira K. Weiny > > Date: Tue, 24 Apr 2007 23:44:15 -0700 > > Subject: [PATCH] OpenSM Proposed Perf Manager > > > > Features include: > > * Create "PerfMgr" thread and sweep all ports on the subnet every > > sweep_time seconds > > * port counter clear on overflow > > * plugable architecture for the "event" database > > * Output machine and human readable output in the default event database > > dump > > * Control using the "perfmgr" command in the console > > > > Known Issues > > * Not tested at scale. > > * Event database should record trap events and other "intresting" subnet > > events. > > * port counter log warnings should be configureable not hard coded. > > * partitions are not handled yet. > > * Code might not be as pristine as I would like > > > > Enable using --enable-perf-mgr > > > > Signed-off-by: Ira K. Weiny > > --- > > osm/Makefile.am | 3 +- > > osm/config/osmvsel.m4 | 26 ++ > > osm/configure.in | 5 +- > > osm/eventdb/Makefile.am | 37 ++ > > osm/eventdb/autogen.sh | 15 + > > osm/eventdb/configure.in | 70 ++++ > > osm/eventdb/libibeventdb.map | 5 + > > osm/eventdb/libibeventdb.spec.in | 38 ++ > > osm/eventdb/libibeventdb.ver | 9 + > > osm/eventdb/src/ibeventdb.c | 622 +++++++++++++++++++++++++++++++++ > > osm/include/Makefile.am | 2 + > > osm/include/iba/ib_types.h | 74 ++++ > > osm/include/opensm/osm_base.h | 23 ++ > > osm/include/opensm/osm_event_db.h | 151 ++++++++ > > osm/include/opensm/osm_madw.h | 40 +++ > > osm/include/opensm/osm_msgdef.h | 1 + > > osm/include/opensm/osm_opensm.h | 4 + > > osm/include/opensm/osm_perfmgr.h | 223 ++++++++++++ > > osm/include/opensm/osm_subnet.h | 18 + > > osm/opensm.spec.in | 11 +- > > osm/opensm/Makefile.am | 5 +- > > osm/opensm/configure.in | 3 + > > osm/opensm/main.c | 19 + > > osm/opensm/osm_console.c | 78 +++++ > > osm/opensm/osm_event_db.c | 172 +++++++++ > > osm/opensm/osm_opensm.c | 24 ++ > > osm/opensm/osm_perfmgr.c | 686 +++++++++++++++++++++++++++++++++++++ > > osm/opensm/osm_subnet.c | 51 +++ > > osm/opensm/osm_trap_rcv.c | 15 + > > 29 files changed, 2425 insertions(+), 5 deletions(-) > > [snip...] > > diff --git a/osm/eventdb/src/ibeventdb.c b/osm/eventdb/src/ibeventdb.c > > new file mode 100644 > > index 0000000..e98f85c > > --- /dev/null > > +++ b/osm/eventdb/src/ibeventdb.c > > @@ -0,0 +1,622 @@ > > +/* > > + * Copyright (c) 2007 The Regents of the University of California. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + * > > + */ > > + > > +#if HAVE_CONFIG_H > > +# include > > +#endif /* HAVE_CONFIG_H */ > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +/** > > + * Port counter object. > > + * Store all the port counters for a single port. > > + */ > > +typedef struct _osm_event_pc { > > + struct { > > + uint64_t symbol_err_cnt; > > + uint64_t link_err_recover; > > + uint64_t link_downed; > > + uint64_t rcv_err; > > + uint64_t rcv_rem_phys_err; > > + uint64_t rcv_switch_relay_err; > > + uint64_t xmit_discards; > > + uint64_t xmit_constraint_err; > > + uint64_t rcv_constraint_err; > > + uint64_t link_int_err; > > + uint64_t buffer_overrun_err; > > + uint64_t vl15_dropped; > > + uint64_t xmit_data; > > + uint64_t rcv_data; > > + uint64_t xmit_pkts; > > + uint64_t rcv_pkts; > > + time_t last_reset; > > + } totals; > > + osm_pc_reading_t previous; > > +} osm_event_pc_t; > > + > > +/** > > + * group port counters for ports into the nodes > > + */ > > +typedef struct _osm_pc_node { > > + cl_map_item_t map_item; /* must be first */ > > + uint64_t node_guid; > > + osm_event_pc_t *ports; > > + uint8_t num_ports; > > +} osm_pc_node_t; > > Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)? > Why not to reuse already existed maps in osm_subn_t (we could add > 'void *pm_data' or so field to osm_physp_t structure)? My one concern would be evolving the PerfMgr. This is better now but is this better when the PerfMgr is separated from the SM functionality ? I know there are other things to untangle to get there. -- Hal [snip...] From halr at voltaire.com Mon May 14 04:04:56 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 07:04:56 -0400 Subject: [ofa-general] Re: [PATCH] osm: integer indexes in fat-tree In-Reply-To: <46481025.9080209@dev.mellanox.co.il> References: <46481025.9080209@dev.mellanox.co.il> Message-ID: <1179140689.1540.167579.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2007-05-14 at 03:30, Yevgeny Kliteynik wrote: > Hi Hal, > > Enhancing integer indexes in fat-tree to 32 bits. > I'm not sure whether it's a bug - fat-tree routing makes indexing > not the same way as up/down. It marks rank on all the leaf switches, > and only then starts BFS (starting from all the leaf switches and not > from roots), so I don't think that it can really overflow the existing > indexes. But who knows... > Fixing this won't hurt anyway. No, it won't. > Please apply to master. See comment/question below. -- Hal > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik > --- > osm/opensm/osm_ucast_ftree.c | 12 ++++++------ > 1 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c > index 7b6a6a5..ca51484 100644 > --- a/osm/opensm/osm_ucast_ftree.c > +++ b/osm/opensm/osm_ucast_ftree.c > @@ -174,7 +174,7 @@ typedef struct ftree_sw_t_ > { > cl_map_item_t map_item; > osm_switch_t * p_osm_sw; > - uint8_t rank; > + uint32_t rank; > ftree_tuple_t tuple; > ib_net16_t base_lid; > ftree_port_group_t ** down_port_groups; > @@ -588,7 +588,7 @@ __osm_ftree_sw_create( > memset(p_sw, 0, sizeof(ftree_sw_t)); > > p_sw->p_osm_sw = p_osm_sw; > - p_sw->rank = 0xFF; > + p_sw->rank = 0xFFFFFFFF; > __osm_ftree_tuple_init(p_sw->tuple); > > p_sw->base_lid = osm_node_get_base_lid(p_sw->p_osm_sw->p_node, 0); > @@ -678,7 +678,7 @@ static boolean_t > __osm_ftree_sw_ranked( > IN ftree_sw_t * p_sw) > { > - return (p_sw->rank != 0xFF); > + return (p_sw->rank != 0xFFFFFFFF); > } > > /***************************************************/ > @@ -1025,7 +1025,7 @@ __osm_ftree_fabric_destroy(ftree_fabric_ > /***************************************************/ > > static void > -__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint8_t rank) > +__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank) > { > if (rank > p_ftree->tree_rank) > p_ftree->tree_rank = rank; > @@ -1314,7 +1314,7 @@ __osm_ftree_fabric_assign_first_tuple( > ftree_tuple_t new_tuple; > > __osm_ftree_tuple_init(new_tuple); > - new_tuple[0] = p_sw->rank; > + new_tuple[0] = (uint8_t)p_sw->rank; Should the declaration of ftree_tuple_t change ? > for (i = 1; i <= p_sw->rank; i++) > new_tuple[i] = 0; > > @@ -1374,7 +1374,7 @@ __osm_ftree_fabric_calculate_rank( > { > ftree_sw_t * p_sw; > ftree_sw_t * p_next_sw; > - uint16_t max_rank = 0; > + uint32_t max_rank = 0; > > /* go over all the switches and find maximal switch rank */ > From halr at voltaire.com Mon May 14 04:17:08 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 07:17:08 -0400 Subject: [ofa-general] Re: [PATCH] opensm: more osm_*_construct/_init/_destroy cleanups In-Reply-To: <20070510224308.GH9692@sashak.voltaire.com> References: <20070506174138.GI9692@sashak.voltaire.com> <20070506174431.GJ9692@sashak.voltaire.com> <1178543690.32222.350646.camel@hal.voltaire.com> <20070509212740.GV9692@sashak.voltaire.com> <20070510224308.GH9692@sashak.voltaire.com> Message-ID: <1179141423.1540.168253.camel@hal.voltaire.com> Hi Sasha, On Thu, 2007-05-10 at 18:43, Sasha Khapyorsky wrote: > Hi Hal, > > As suggested :) > > > This removes/makes static non used osm_*_construct/_init/_destroy > initializers for OpenSM objects where osm*_new/_delete are actually > used. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to master only). -- Hal From kliteyn at dev.mellanox.co.il Mon May 14 04:33:05 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 14 May 2007 14:33:05 +0300 Subject: [ofa-general] Re: [PATCH] osm: integer indexes in fat-tree In-Reply-To: <1179140689.1540.167579.camel@hal.voltaire.com> References: <46481025.9080209@dev.mellanox.co.il> <1179140689.1540.167579.camel@hal.voltaire.com> Message-ID: <464848F1.207@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Yevgeny, > > On Mon, 2007-05-14 at 03:30, Yevgeny Kliteynik wrote: >> Hi Hal, >> >> Enhancing integer indexes in fat-tree to 32 bits. >> I'm not sure whether it's a bug - fat-tree routing makes indexing >> not the same way as up/down. It marks rank on all the leaf switches, >> and only then starts BFS (starting from all the leaf switches and not >> from roots), so I don't think that it can really overflow the existing >> indexes. But who knows... >> Fixing this won't hurt anyway. > > No, it won't. > >> Please apply to master. > > See comment/question below. > > -- Hal > >> -- Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> osm/opensm/osm_ucast_ftree.c | 12 ++++++------ >> 1 files changed, 6 insertions(+), 6 deletions(-) >> >> diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c >> index 7b6a6a5..ca51484 100644 >> --- a/osm/opensm/osm_ucast_ftree.c >> +++ b/osm/opensm/osm_ucast_ftree.c >> @@ -174,7 +174,7 @@ typedef struct ftree_sw_t_ >> { >> cl_map_item_t map_item; >> osm_switch_t * p_osm_sw; >> - uint8_t rank; >> + uint32_t rank; >> ftree_tuple_t tuple; >> ib_net16_t base_lid; >> ftree_port_group_t ** down_port_groups; >> @@ -588,7 +588,7 @@ __osm_ftree_sw_create( >> memset(p_sw, 0, sizeof(ftree_sw_t)); >> >> p_sw->p_osm_sw = p_osm_sw; >> - p_sw->rank = 0xFF; >> + p_sw->rank = 0xFFFFFFFF; >> __osm_ftree_tuple_init(p_sw->tuple); >> >> p_sw->base_lid = osm_node_get_base_lid(p_sw->p_osm_sw->p_node, 0); >> @@ -678,7 +678,7 @@ static boolean_t >> __osm_ftree_sw_ranked( >> IN ftree_sw_t * p_sw) >> { >> - return (p_sw->rank != 0xFF); >> + return (p_sw->rank != 0xFFFFFFFF); >> } >> >> /***************************************************/ >> @@ -1025,7 +1025,7 @@ __osm_ftree_fabric_destroy(ftree_fabric_ >> /***************************************************/ >> >> static void >> -__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint8_t rank) >> +__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank) >> { >> if (rank > p_ftree->tree_rank) >> p_ftree->tree_rank = rank; >> @@ -1314,7 +1314,7 @@ __osm_ftree_fabric_assign_first_tuple( >> ftree_tuple_t new_tuple; >> >> __osm_ftree_tuple_init(new_tuple); >> - new_tuple[0] = p_sw->rank; >> + new_tuple[0] = (uint8_t)p_sw->rank; > > Should the declaration of ftree_tuple_t change ? Only if there's a chance that we will build a switch with more than 256 ports. And when I say "switch" I mean a single unit, not the one that has internal topology with several building blocks. But in that case lots of things should be fixed. For instance, ib_node_info_t.num_ports is 8 bits too. -- Yevgeny >> for (i = 1; i <= p_sw->rank; i++) >> new_tuple[i] = 0; >> >> @@ -1374,7 +1374,7 @@ __osm_ftree_fabric_calculate_rank( >> { >> ftree_sw_t * p_sw; >> ftree_sw_t * p_next_sw; >> - uint16_t max_rank = 0; >> + uint32_t max_rank = 0; >> >> /* go over all the switches and find maximal switch rank */ >> > > From halr at voltaire.com Mon May 14 04:45:55 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 07:45:55 -0400 Subject: [ofa-general] Re: [PATCH] osm: integer indexes in fat-tree In-Reply-To: <46481025.9080209@dev.mellanox.co.il> References: <46481025.9080209@dev.mellanox.co.il> Message-ID: <1179143150.1540.170000.camel@hal.voltaire.com> On Mon, 2007-05-14 at 03:30, Yevgeny Kliteynik wrote: > Hi Hal, > > Enhancing integer indexes in fat-tree to 32 bits. > I'm not sure whether it's a bug - fat-tree routing makes indexing > not the same way as up/down. It marks rank on all the leaf switches, > and only then starts BFS (starting from all the leaf switches and not > from roots), so I don't think that it can really overflow the existing > indexes. But who knows... > Fixing this won't hurt anyway. > > Please apply to master. > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to master only). -- Hal From k_mahesh85 at yahoo.co.in Mon May 14 04:54:12 2007 From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh) Date: Mon, 14 May 2007 12:54:12 +0100 (BST) Subject: [ofa-general] [query] addressing the the IB switches using LID. Message-ID: <572616.31409.qm@web8324.mail.in.yahoo.com> In the case of a IB switch which is not running an IB subnet manager is there any requirement of LID. I mean, is there any situation where an IB switch will be addressed directly (using LID) by SM or any other node after the subnet initialization is complete? -Mahesh --------------------------------- Here’s a new way to find what you're looking for - Yahoo! Answers -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at dev.mellanox.co.il Mon May 14 05:15:49 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Mon, 14 May 2007 15:15:49 +0300 Subject: [ofa-general] [query] addressing the the IB switches using LID. In-Reply-To: <572616.31409.qm@web8324.mail.in.yahoo.com> References: <572616.31409.qm@web8324.mail.in.yahoo.com> Message-ID: <464852F5.2010409@dev.mellanox.co.il> Keshetti Mahesh wrote: > In the case of a IB switch which is not running an IB subnet manager > is there any > requirement of LID. > I mean, is there any situation where an IB switch will be addressed > directly > (using LID) by SM or any other node after the subnet initialization is > complete? > IB switches are not transparent and every IB switch should get a LID (at least one port of the switch is connected to the subnet). It doesn't matter if the SM is being executed on this switch or not. Dotan From eli at mellanox.co.il Mon May 14 05:17:52 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 14 May 2007 15:17:52 +0300 Subject: [ofa-general] [PATCH] libmlx4: WQE shift calculation Message-ID: <1179145102.25749.11.camel@mtls03> For RC QPs we need to add atomic header size when calculating a WQE size. Found by Dotan Barak at Mellanox Signed-off-by: Eli Cohen --- Rolland, the code that calculates WQ size is quite different between kernel and user. I think that writing the code in a way that will allow to copy it as is between kernel and user is in place. Would like me to send such a patch? Index: connectx_user/src/userspace/libmlx4/src/qp.c =================================================================== --- connectx_user.orig/src/userspace/libmlx4/src/qp.c 2007-05-14 17:43:10.000000000 +0300 +++ connectx_user/src/userspace/libmlx4/src/qp.c 2007-05-14 17:44:04.000000000 +0300 @@ -439,7 +439,8 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, break; case IBV_QPT_RC: - size += sizeof (struct mlx4_wqe_raddr_seg); + size += sizeof (struct mlx4_wqe_raddr_seg) + + sizeof (struct mlx4_wqe_atomic_seg); /* * An atomic op will require an atomic segment, a * remote address segment and one scatter entry. From k_mahesh85 at yahoo.co.in Mon May 14 05:26:28 2007 From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh) Date: Mon, 14 May 2007 13:26:28 +0100 (BST) Subject: [ofa-general] [query] addressing the the IB switches using LID. In-Reply-To: <464852F5.2010409@dev.mellanox.co.il> Message-ID: <129174.21096.qm@web8320.mail.in.yahoo.com> IB switches are not transparent and every IB switch should get a LID (at least one port of the switch is connected to the subnet). It doesn't matter if the SM is being executed on this switch or not. Yes, According to the IB architecture IB switch should get a LID. Just out of curiosity I want to know whether the IB switch will be addressed using LID or not. I have mentioned SM here beacuse if the IB switch is running the SM then it needs to be reachable using LID by all nodes inorder to answer the SA queries. But in the other case (no SM) I didn't see any situation yet where the IB switch will be addressed using the LID assigned to it. -Mahesh --------------------------------- Here’s a new way to find what you're looking for - Yahoo! Answers -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon May 14 05:27:37 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 08:27:37 -0400 Subject: [ofa-general] [query] addressing the the IB switches using LID. In-Reply-To: <464852F5.2010409@dev.mellanox.co.il> References: <572616.31409.qm@web8324.mail.in.yahoo.com> <464852F5.2010409@dev.mellanox.co.il> Message-ID: <1179145646.1540.172583.camel@hal.voltaire.com> On Mon, 2007-05-14 at 08:15, Dotan Barak wrote: > Keshetti Mahesh wrote: > > In the case of a IB switch which is not running an IB subnet manager > > is there any > > requirement of LID. > > I mean, is there any situation where an IB switch will be addressed > > directly > > (using LID) by SM or any other node after the subnet initialization is > > complete? > > > IB switches are not transparent and every IB switch should get a LID > (at least one port of the switch is connected to the subnet). Switch port 0 needs a LID. Switch external/physical ports do not get them. -- Hal > It doesn't matter if the SM is being executed on this switch or not. > > > Dotan From halr at voltaire.com Mon May 14 05:28:54 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 08:28:54 -0400 Subject: [ofa-general] Re: [query] addressing the the IB switches using LID. In-Reply-To: <572616.31409.qm@web8324.mail.in.yahoo.com> References: <572616.31409.qm@web8324.mail.in.yahoo.com> Message-ID: <1179145661.1540.172585.camel@hal.voltaire.com> On Mon, 2007-05-14 at 07:54, Keshetti Mahesh wrote: > In the case of a IB switch which is not running an IB subnet manager > is there any > requirement of LID. > I mean, is there any situation where an IB switch will be addressed > directly > (using LID) by SM or any other node after the subnet initialization is > complete? This is purely SM policy. IBA does not dictate this and leaves it up to the SM in question as to whether it uses LID routing or direct routing to "talk" with nodes (including switches). Clearly, initialization requires direct routing. -- Hal > -Mahesh > > > > ______________________________________________________________________ > Heres a new way to find what you're looking for - Yahoo! Answers From kliteyn at dev.mellanox.co.il Mon May 14 05:30:51 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 14 May 2007 15:30:51 +0300 Subject: [ofa-general] [PATCH] osm: fat-tree optimization - creating internal nodes Message-ID: <4648567B.3060809@dev.mellanox.co.il> Hi Hal, A small optimization to creation of fat-tree internal data structures. Using the pointers from osm_node to osm_switch that Sasha has added a while ago, it is enough to scan the OSM node_guid table only once to create all the fat-tree internal nodes. Please apply to master only. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_ucast_ftree.c | 51 +++++------------------------------------ 1 files changed, 7 insertions(+), 44 deletions(-) diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c index ca51484..3bad2fc 100644 --- a/osm/opensm/osm_ucast_ftree.c +++ b/osm/opensm/osm_ucast_ftree.c @@ -2365,36 +2365,13 @@ __osm_ftree_fabric_route_to_switches( ***************************************************/ static int -__osm_ftree_fabric_populate_switches( - IN ftree_fabric_t * p_ftree) -{ - osm_switch_t * p_osm_sw; - osm_switch_t * p_next_osm_sw; - - OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_switches); - - p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_ftree->p_osm->subn.sw_guid_tbl); - while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_ftree->p_osm->subn.sw_guid_tbl) ) - { - p_osm_sw = p_next_osm_sw; - p_next_osm_sw = (osm_switch_t *)cl_qmap_next(&p_osm_sw->map_item ); - __osm_ftree_fabric_add_sw(p_ftree,p_osm_sw); - } - OSM_LOG_EXIT(&p_ftree->p_osm->log); - return 0; -} /* __osm_ftree_fabric_populate_switches() */ - -/*************************************************** - ***************************************************/ - -static int -__osm_ftree_fabric_populate_hcas( +__osm_ftree_fabric_populate_nodes( IN ftree_fabric_t * p_ftree) { osm_node_t * p_osm_node; osm_node_t * p_next_osm_node; - OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_hcas); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_nodes); p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_ftree->p_osm->subn.node_guid_tbl); while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_ftree->p_osm->subn.node_guid_tbl) ) @@ -2409,11 +2386,11 @@ __osm_ftree_fabric_populate_hcas( case IB_NODE_TYPE_ROUTER: break; case IB_NODE_TYPE_SWITCH: - /* all the switches added separately */ + __osm_ftree_fabric_add_sw(p_ftree,p_osm_node->sw); break; default: osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, - "__osm_ftree_fabric_populate_hcas: ERR AB0E: " + "__osm_ftree_fabric_populate_nodes: ERR AB0E: " "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", cl_ntoh64(osm_node_get_node_guid(p_osm_node)), ib_get_node_type_str(osm_node_get_type(p_osm_node))); @@ -2424,7 +2401,7 @@ __osm_ftree_fabric_populate_hcas( OSM_LOG_EXIT(&p_ftree->p_osm->log); return 0; -} /* __osm_ftree_fabric_populate_hcas() */ +} /* __osm_ftree_fabric_populate_nodes() */ /*************************************************** ***************************************************/ @@ -2962,22 +2939,8 @@ __osm_ftree_construct_fabric( osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_construct_fabric: " - "Populating FatTree switch table\n"); - /* ToDo: now that the pointer from node to switch exists, - no need to fill the switch table in a separate loop */ - if (__osm_ftree_fabric_populate_switches(p_ftree) != 0) - { - osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, - "Fabric topology is not fat-tree - " - "falling back to default routing\n"); - status = -1; - goto Exit; - } - - osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, - "__osm_ftree_construct_fabric: " - "Populating FatTree HCA table\n"); - if (__osm_ftree_fabric_populate_hcas(p_ftree) != 0) + "Populating FatTree Switch and HCA tables\n"); + if (__osm_ftree_fabric_populate_nodes(p_ftree) != 0) { osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric topology is not fat-tree - " -- 1.4.4.1.GIT From k_mahesh85 at yahoo.co.in Mon May 14 05:35:09 2007 From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh) Date: Mon, 14 May 2007 13:35:09 +0100 (BST) Subject: [ofa-general] Re: [query] addressing the the IB switches using LID. In-Reply-To: <1179145661.1540.172585.camel@hal.voltaire.com> Message-ID: <374012.81252.qm@web8316.mail.in.yahoo.com> This is purely SM policy. IBA does not dictate this and leaves it up to the SM in question as to whether it uses LID routing or direct routing to "talk" with nodes (including switches). Clearly, initialization requires direct routing.what is the policy of current implementation of subnet manager i.e. openSM? -Mahesh --------------------------------- Office firewalls, cyber cafes, college labs, don't allow you to download CHAT? Here's a solution! -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon May 14 05:38:53 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 08:38:53 -0400 Subject: [ofa-general] [query] addressing the the IB switches using LID. In-Reply-To: <129174.21096.qm@web8320.mail.in.yahoo.com> References: <129174.21096.qm@web8320.mail.in.yahoo.com> Message-ID: <1179146303.1540.173170.camel@hal.voltaire.com> On Mon, 2007-05-14 at 08:26, Keshetti Mahesh wrote: > IB switches are not transparent and every IB switch should get > a LID > (at least one port of the switch is connected to the subnet). > > It doesn't matter if the SM is being executed on this switch > or not. > Yes, According to the IB architecture IB switch should get a LID. Just > out > of curiosity I want to know whether the IB switch will be addressed > using LID > or not. > > I have mentioned SM here beacuse if the IB switch is running the SM > then > it needs to be reachable using LID by all nodes inorder to answer the > SA > queries. More than this. > But in the other case (no SM) I didn't see any situation yet where > the > IB switch will be addressed using the LID assigned to it. Operationally, it depends on the SM. You would also be relying on something beyond the spec (so that if the SM changes (such a change being valid), then things would stop working). Also, there are some port 0 features which require the LID to be set. Compliance wise, this is a non compliance (for the switch port 0 not to have a LID). -- Hal > -Mahesh > > > > ______________________________________________________________________ > Heres a new way to find what you're looking for - Yahoo! Answers > > ______________________________________________________________________ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From k_mahesh85 at yahoo.co.in Mon May 14 05:56:06 2007 From: k_mahesh85 at yahoo.co.in (Keshetti Mahesh) Date: Mon, 14 May 2007 13:56:06 +0100 (BST) Subject: [ofa-general] Re: [query] addressing the the IB switches using LID. In-Reply-To: <1179145661.1540.172585.camel@hal.voltaire.com> Message-ID: <422405.87014.qm@web8316.mail.in.yahoo.com> >This is purely SM policy. IBA does not dictate this and >leaves it up to the SM in question as to whether it uses >LID routing or direct routing to "talk" with nodes >(including switches). Clearly, initialization requires >direct routing. what is the policy of current implementation of subnet manager i.e. openSM? -Mahesh --------------------------------- Office firewalls, cyber cafes, college labs, don't allow you to download CHAT? Here's a solution! -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at dev.mellanox.co.il Mon May 14 06:02:31 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 14 May 2007 16:02:31 +0300 Subject: [ofa-general] [PATCH] osm: fat-tree optimization - improved ranking Message-ID: <46485DE7.1050506@dev.mellanox.co.il> Hi Hal, This patch optimizes fabric ranking. All the leaf switches are marked with rank and added to the BFS list, and only then ranking of rest of the fabric begins. I actually thought that this is the way I've originally implemented it, as I mentioned in the patch that was dealing with 8 and 16 bit integers :) Similar optimization may be applicable to up/dn routing - the roots should be marked with rank 0 and only then ranking of rest of the switches should begin, but frankly, it practically doesn't reduce the routing time, because ranking is only a small fraction of the routing runtime (I checked it on a 4K+ subnet). In case of fat-tree I'm going to need it anyway when I enhance the routing to consider only subset of HCAs in the routing balancing (compute nodes vs. management nodes). Please apply to master. -- Yevgeny Signed-off-by: Yevgeny Kliteynik >From dfa455f86d9ac48ff5cefd38a009718e5aeab1f9 Mon Sep 17 00:00:00 2001 From: Yevgeny Kliteynik Date: Mon, 14 May 2007 15:45:00 +0300 Subject: [PATCH] DELETE AFTER UPDATE: ranking Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_ucast_ftree.c | 83 +++++++++++++++++++++++++----------------- 1 files changed, 49 insertions(+), 34 deletions(-) diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c index 3bad2fc..84da3d7 100644 --- a/osm/opensm/osm_ucast_ftree.c +++ b/osm/opensm/osm_ucast_ftree.c @@ -2406,10 +2406,24 @@ __osm_ftree_fabric_populate_nodes( /*************************************************** ***************************************************/ +static boolean_t +__osm_ftree_sw_update_rank( + IN ftree_sw_t * p_sw, + IN uint32_t new_rank) +{ + if (__osm_ftree_sw_ranked(p_sw) && p_sw->rank <= new_rank) + return FALSE; + p_sw->rank = new_rank; + return TRUE; + +} + +/***************************************************/ + static void -__osm_ftree_rank_from_switch( +__osm_ftree_rank_switches_from_leafs( IN ftree_fabric_t * p_ftree, - IN ftree_sw_t * p_starting_sw) + IN cl_list_t * p_ranking_bfs_list) { ftree_sw_t * p_sw; ftree_sw_t * p_remote_sw; @@ -2417,19 +2431,11 @@ __osm_ftree_rank_from_switch( osm_node_t * p_remote_node; osm_physp_t * p_osm_port; uint8_t i; - cl_list_t bfs_list; ftree_sw_tbl_element_t * p_sw_tbl_element = NULL; - p_starting_sw->rank = 0; - - /* Run BFS scan of the tree, starting from this switch */ - - cl_list_init(&bfs_list, cl_qmap_count(&p_ftree->sw_tbl)); - cl_list_insert_tail(&bfs_list, &__osm_ftree_sw_tbl_element_create(p_starting_sw)->map_item); - - while (!cl_is_list_empty(&bfs_list)) + while (!cl_is_list_empty(p_ranking_bfs_list)) { - p_sw_tbl_element = (ftree_sw_tbl_element_t *)cl_list_remove_head(&bfs_list); + p_sw_tbl_element = (ftree_sw_tbl_element_t *) cl_list_remove_head(p_ranking_bfs_list); p_sw = p_sw_tbl_element->p_sw; __osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element); @@ -2457,26 +2463,23 @@ __osm_ftree_rank_from_switch( /* remote node is not a switch */ continue; } - if (__osm_ftree_sw_ranked(p_remote_sw) && p_remote_sw->rank <= (p_sw->rank + 1)) - continue; - /* rank the remote switch and add it to the BFS list */ - p_remote_sw->rank = p_sw->rank + 1; - cl_list_insert_tail(&bfs_list, - &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item); + /* if needed, rank the remote switch and add it to the BFS list */ + if (__osm_ftree_sw_update_rank(p_remote_sw, p_sw->rank + 1)) + cl_list_insert_tail(p_ranking_bfs_list, + &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item); } } - cl_list_destroy(&bfs_list); -} /* __osm_ftree_rank_from_switch() */ +} /* __osm_ftree_rank_switches_from_leafs() */ -/*************************************************** - ***************************************************/ +/***************************************************/ static int -__osm_ftree_rank_switches_from_hca( +__osm_ftree_rank_leaf_switches( IN ftree_fabric_t * p_ftree, - IN ftree_hca_t * p_hca) + IN ftree_hca_t * p_hca, + IN cl_list_t * p_ranking_bfs_list) { ftree_sw_t * p_sw; osm_node_t * p_osm_node = p_hca->p_osm_node; @@ -2502,7 +2505,7 @@ __osm_ftree_rank_switches_from_hca( case IB_NODE_TYPE_CA: /* HCA connected directly to another HCA - not FatTree */ osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, - "__osm_ftree_rank_switches_from_hca: ERR AB0F: " + "__osm_ftree_rank_leaf_switches: ERR AB0F: " "HCA conected directly to another HCA: " "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), @@ -2520,7 +2523,7 @@ __osm_ftree_rank_switches_from_hca( default: osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, - "__osm_ftree_rank_switches_from_hca: ERR AB10: " + "__osm_ftree_rank_leaf_switches: ERR AB10: " "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)), ib_get_node_type_str(osm_node_get_type(p_remote_osm_node))); @@ -2535,11 +2538,12 @@ __osm_ftree_rank_switches_from_hca( CL_ASSERT(p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl)); - if (__osm_ftree_sw_ranked(p_sw) && p_sw->rank == 0) + if ( !__osm_ftree_sw_update_rank(p_sw,0) ) continue; + /* if needed, rank the remote switch and add it to the BFS list */ osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, - "__osm_ftree_rank_switches_from_hca: " + "__osm_ftree_rank_leaf_switches: " "Marking rank of switch that is directly connected to HCA:\n" " - HCA guid : 0x%016" PRIx64 "\n" " - Switch guid: 0x%016" PRIx64 "\n" @@ -2547,13 +2551,14 @@ __osm_ftree_rank_switches_from_hca( cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), cl_ntoh16(p_sw->base_lid)); - __osm_ftree_rank_from_switch(p_ftree, p_sw); + cl_list_insert_tail(p_ranking_bfs_list, + &__osm_ftree_sw_tbl_element_create(p_sw)->map_item); } Exit: OSM_LOG_EXIT(&p_ftree->p_osm->log); return res; -} /* __osm_ftree_rank_switches_from_hca() */ +} /* __osm_ftree_rank_leaf_switches() */ /***************************************************/ @@ -2789,18 +2794,21 @@ __osm_ftree_fabric_construct_sw_ports( /*************************************************** ***************************************************/ -/* ToDo: improve ranking algorithm complexity - by propogating BFS from more nodes */ static int __osm_ftree_fabric_perform_ranking( IN ftree_fabric_t * p_ftree) { ftree_hca_t * p_hca; ftree_hca_t * p_next_hca; + cl_list_t ranking_bfs_list; int res = 0; OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_perform_ranking); + /* Init the bfs list - the list of the switches that will be + initially filled with the leaf switches */ + cl_list_init(&ranking_bfs_list, cl_qmap_count(&p_ftree->sw_tbl)); + /* Mark REVERSED rank of all the switches in the subnet. Start from switches that are connected to hca's, and scan all the switches in the subnet. */ @@ -2809,7 +2817,7 @@ __osm_ftree_fabric_perform_ranking( { p_hca = p_next_hca; p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item ); - if (__osm_ftree_rank_switches_from_hca(p_ftree,p_hca) != 0) + if (__osm_ftree_rank_leaf_switches(p_ftree, p_hca, &ranking_bfs_list) != 0) { res = -1; osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, @@ -2819,7 +2827,14 @@ __osm_ftree_fabric_perform_ranking( } } - /* calculate and set FatTree rank */ + /* Now rank rest of the switches in the fabric, while the + list already contains all the ranked leaf switches */ + __osm_ftree_rank_switches_from_leafs(p_ftree, &ranking_bfs_list); + cl_list_destroy(&ranking_bfs_list); + + /* REVERSED ranking of all the switches completed. + Calculate and set FatTree rank */ + __osm_ftree_fabric_calculate_rank(p_ftree); osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, "__osm_ftree_fabric_perform_ranking: " -- 1.4.4.1.GIT From halr at voltaire.com Mon May 14 06:07:19 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 09:07:19 -0400 Subject: [ofa-general] Re: [PATCH] osm: fat-tree optimization - creating internal nodes In-Reply-To: <4648567B.3060809@dev.mellanox.co.il> References: <4648567B.3060809@dev.mellanox.co.il> Message-ID: <1179148023.1540.174674.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2007-05-14 at 08:30, Yevgeny Kliteynik wrote: > Hi Hal, > > A small optimization to creation of fat-tree internal data structures. > Using the pointers from osm_node to osm_switch that Sasha has added > a while ago, it is enough to scan the OSM node_guid table only once > to create all the fat-tree internal nodes. > > Please apply to master only. > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to master only). -- Hal From halr at voltaire.com Mon May 14 06:13:05 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 09:13:05 -0400 Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM/osm_ucast_ftree.c: Change HCA to CA in log messages Message-ID: <1179148361.1540.175012.camel@hal.voltaire.com> OpenSM/osm_ucast_ftree.c: Change HCA to CA in log messages Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c index 3bad2fc..eb33e5a 100644 --- a/osm/opensm/osm_ucast_ftree.c +++ b/osm/opensm/osm_ucast_ftree.c @@ -850,7 +850,7 @@ __osm_ftree_hca_dump( osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_hca_dump: " - "HCA GUID: 0x%016" PRIx64 ", Ports: %u UP\n", + "CA GUID: 0x%016" PRIx64 ", Ports: %u UP\n", cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), p_hca->up_port_groups_num); @@ -1124,7 +1124,7 @@ __osm_ftree_fabric_dump(ftree_fabric_t * " |-------------------------------|\n\n"); osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, - "__osm_ftree_fabric_dump: -- HCAs:\n"); + "__osm_ftree_fabric_dump: -- CAs:\n"); for ( p_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); p_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl); @@ -1174,7 +1174,7 @@ __osm_ftree_fabric_dump_general_info( p_ftree->tree_rank); osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, "__osm_ftree_fabric_dump_general_info: " - " - Fabric has %u HCAs, %u switches\n", + " - Fabric has %u CAs, %u switches\n", cl_qmap_count(&p_ftree->hca_tbl), cl_qmap_count(&p_ftree->sw_tbl)); @@ -1886,7 +1886,7 @@ __osm_ftree_fabric_route_upgoing_by_goin p_min_port->remote_port_num); osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_upgoing_by_going_down: " - "Switch %s: set path to HCA LID 0x%x through port %u\n", + "Switch %s: set path to CA LID 0x%x through port %u\n", __osm_ftree_tuple_to_str(p_remote_sw->tuple), cl_ntoh16(target_lid), p_min_port->remote_port_num); @@ -2067,7 +2067,7 @@ __osm_ftree_fabric_route_downgoing_by_go { osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_downgoing_by_going_up: " - " - Routing MAIN path for %s HCA LID 0x%x: %s --> %s\n", + " - Routing MAIN path for %s CA LID 0x%x: %s --> %s\n", (is_real_lid)? "real" : "DUMMY", cl_ntoh16(target_lid), __osm_ftree_tuple_to_str(p_sw->tuple), @@ -2084,7 +2084,7 @@ __osm_ftree_fabric_route_downgoing_by_go p_min_port->remote_port_num); osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_downgoing_by_going_up: " - "Switch %s: set path to HCA LID 0x%x through port %u\n", + "Switch %s: set path to CA LID 0x%x through port %u\n", __osm_ftree_tuple_to_str(p_remote_sw->tuple), cl_ntoh16(target_lid),p_min_port->remote_port_num); @@ -2250,7 +2250,7 @@ __osm_ftree_fabric_route_to_hcas( p_port->port_num); osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_to_hcas: " - "Switch %s: set path to HCA LID 0x%x through port %u\n", + "Switch %s: set path to CA LID 0x%x through port %u\n", __osm_ftree_tuple_to_str(p_sw->tuple), cl_ntoh16(remote_lid), p_port->port_num); @@ -2279,7 +2279,7 @@ __osm_ftree_fabric_route_to_hcas( if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num) { osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: " - "Routing %u dummy HCAs\n", + "Routing %u dummy CAs\n", p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); for ( j = 0; ((int)j) < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); @@ -2503,7 +2503,7 @@ __osm_ftree_rank_switches_from_hca( /* HCA connected directly to another HCA - not FatTree */ osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, "__osm_ftree_rank_switches_from_hca: ERR AB0F: " - "HCA conected directly to another HCA: " + "CA conected directly to another CA: " "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node))); @@ -2540,8 +2540,8 @@ __osm_ftree_rank_switches_from_hca( osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_rank_switches_from_hca: " - "Marking rank of switch that is directly connected to HCA:\n" - " - HCA guid : 0x%016" PRIx64 "\n" + "Marking rank of switch that is directly connected to CA:\n" + " - CA guid : 0x%016" PRIx64 "\n" " - Switch guid: 0x%016" PRIx64 "\n" " - Switch LID : 0x%x\n", cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), @@ -2613,7 +2613,7 @@ __osm_ftree_fabric_construct_hca_ports( /* HCA connected directly to another HCA - not FatTree */ osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, "__osm_ftree_fabric_construct_hca_ports: ERR AB11: " - "HCA conected directly to another HCA: " + "CA conected directly to another CA: " "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", cl_ntoh64(osm_node_get_node_guid(p_node)), cl_ntoh64(remote_node_guid)); @@ -2939,7 +2939,7 @@ __osm_ftree_construct_fabric( osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_construct_fabric: " - "Populating FatTree Switch and HCA tables\n"); + "Populating FatTree Switch and CA tables\n"); if (__osm_ftree_fabric_populate_nodes(p_ftree) != 0) { osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, @@ -2952,7 +2952,7 @@ __osm_ftree_construct_fabric( if (cl_qmap_count(&p_ftree->hca_tbl) < 2) { osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, - "Fabric has %u HCAa - topology is not fat-tree.\n" + "Fabric has %u CAa - topology is not fat-tree.\n" "Falling back to default routing.\n", cl_qmap_count(&p_ftree->hca_tbl)); status = -1; @@ -2983,7 +2983,7 @@ __osm_ftree_construct_fabric( we want the ports to have pointers to ftree_{sw,hca}_t objects.*/ osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_construct_fabric: " - "Populating HCA & switch ports\n"); + "Populating CA & switch ports\n"); if (__osm_ftree_fabric_populate_ports(p_ftree) != 0) { osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, @@ -3061,7 +3061,7 @@ __osm_ftree_do_routing( "Starting FatTree routing\n"); osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " - "Filling switch forwarding tables for routes to HCAs\n"); + "Filling switch forwarding tables for routes to CAs\n"); __osm_ftree_fabric_route_to_hcas(p_ftree); osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " From kliteyn at dev.mellanox.co.il Mon May 14 06:21:26 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 14 May 2007 16:21:26 +0300 Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM/osm_ucast_ftree.c: Change HCA to CA in log messages In-Reply-To: <1179148361.1540.175012.camel@hal.voltaire.com> References: <1179148361.1540.175012.camel@hal.voltaire.com> Message-ID: <46486256.6050806@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > OpenSM/osm_ucast_ftree.c: Change HCA to CA in log messages Sure, makes sense. --Yevgeny > Signed-off-by: Hal Rosenstock > > diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c > index 3bad2fc..eb33e5a 100644 > --- a/osm/opensm/osm_ucast_ftree.c > +++ b/osm/opensm/osm_ucast_ftree.c > @@ -850,7 +850,7 @@ __osm_ftree_hca_dump( > > osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > "__osm_ftree_hca_dump: " > - "HCA GUID: 0x%016" PRIx64 ", Ports: %u UP\n", > + "CA GUID: 0x%016" PRIx64 ", Ports: %u UP\n", > cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), > p_hca->up_port_groups_num); > > @@ -1124,7 +1124,7 @@ __osm_ftree_fabric_dump(ftree_fabric_t * > " |-------------------------------|\n\n"); > > osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > - "__osm_ftree_fabric_dump: -- HCAs:\n"); > + "__osm_ftree_fabric_dump: -- CAs:\n"); > > for ( p_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); > p_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl); > @@ -1174,7 +1174,7 @@ __osm_ftree_fabric_dump_general_info( > p_ftree->tree_rank); > osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, > "__osm_ftree_fabric_dump_general_info: " > - " - Fabric has %u HCAs, %u switches\n", > + " - Fabric has %u CAs, %u switches\n", > cl_qmap_count(&p_ftree->hca_tbl), > cl_qmap_count(&p_ftree->sw_tbl)); > > @@ -1886,7 +1886,7 @@ __osm_ftree_fabric_route_upgoing_by_goin > p_min_port->remote_port_num); > osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > "__osm_ftree_fabric_route_upgoing_by_going_down: " > - "Switch %s: set path to HCA LID 0x%x through port %u\n", > + "Switch %s: set path to CA LID 0x%x through port %u\n", > __osm_ftree_tuple_to_str(p_remote_sw->tuple), > cl_ntoh16(target_lid), > p_min_port->remote_port_num); > @@ -2067,7 +2067,7 @@ __osm_ftree_fabric_route_downgoing_by_go > { > osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > "__osm_ftree_fabric_route_downgoing_by_going_up: " > - " - Routing MAIN path for %s HCA LID 0x%x: %s --> %s\n", > + " - Routing MAIN path for %s CA LID 0x%x: %s --> %s\n", > (is_real_lid)? "real" : "DUMMY", > cl_ntoh16(target_lid), > __osm_ftree_tuple_to_str(p_sw->tuple), > @@ -2084,7 +2084,7 @@ __osm_ftree_fabric_route_downgoing_by_go > p_min_port->remote_port_num); > osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > "__osm_ftree_fabric_route_downgoing_by_going_up: " > - "Switch %s: set path to HCA LID 0x%x through port %u\n", > + "Switch %s: set path to CA LID 0x%x through port %u\n", > __osm_ftree_tuple_to_str(p_remote_sw->tuple), > cl_ntoh16(target_lid),p_min_port->remote_port_num); > > @@ -2250,7 +2250,7 @@ __osm_ftree_fabric_route_to_hcas( > p_port->port_num); > osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > "__osm_ftree_fabric_route_to_hcas: " > - "Switch %s: set path to HCA LID 0x%x through port %u\n", > + "Switch %s: set path to CA LID 0x%x through port %u\n", > __osm_ftree_tuple_to_str(p_sw->tuple), > cl_ntoh16(remote_lid), > p_port->port_num); > @@ -2279,7 +2279,7 @@ __osm_ftree_fabric_route_to_hcas( > if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num) > { > osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: " > - "Routing %u dummy HCAs\n", > + "Routing %u dummy CAs\n", > p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); > for ( j = 0; > ((int)j) < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); > @@ -2503,7 +2503,7 @@ __osm_ftree_rank_switches_from_hca( > /* HCA connected directly to another HCA - not FatTree */ > osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, > "__osm_ftree_rank_switches_from_hca: ERR AB0F: " > - "HCA conected directly to another HCA: " > + "CA conected directly to another CA: " > "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", > cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), > cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node))); > @@ -2540,8 +2540,8 @@ __osm_ftree_rank_switches_from_hca( > > osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > "__osm_ftree_rank_switches_from_hca: " > - "Marking rank of switch that is directly connected to HCA:\n" > - " - HCA guid : 0x%016" PRIx64 "\n" > + "Marking rank of switch that is directly connected to CA:\n" > + " - CA guid : 0x%016" PRIx64 "\n" > " - Switch guid: 0x%016" PRIx64 "\n" > " - Switch LID : 0x%x\n", > cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), > @@ -2613,7 +2613,7 @@ __osm_ftree_fabric_construct_hca_ports( > /* HCA connected directly to another HCA - not FatTree */ > osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, > "__osm_ftree_fabric_construct_hca_ports: ERR AB11: " > - "HCA conected directly to another HCA: " > + "CA conected directly to another CA: " > "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", > cl_ntoh64(osm_node_get_node_guid(p_node)), > cl_ntoh64(remote_node_guid)); > @@ -2939,7 +2939,7 @@ __osm_ftree_construct_fabric( > > osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, > "__osm_ftree_construct_fabric: " > - "Populating FatTree Switch and HCA tables\n"); > + "Populating FatTree Switch and CA tables\n"); > if (__osm_ftree_fabric_populate_nodes(p_ftree) != 0) > { > osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, > @@ -2952,7 +2952,7 @@ __osm_ftree_construct_fabric( > if (cl_qmap_count(&p_ftree->hca_tbl) < 2) > { > osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, > - "Fabric has %u HCAa - topology is not fat-tree.\n" > + "Fabric has %u CAa - topology is not fat-tree.\n" > "Falling back to default routing.\n", > cl_qmap_count(&p_ftree->hca_tbl)); > status = -1; > @@ -2983,7 +2983,7 @@ __osm_ftree_construct_fabric( > we want the ports to have pointers to ftree_{sw,hca}_t objects.*/ > osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, > "__osm_ftree_construct_fabric: " > - "Populating HCA & switch ports\n"); > + "Populating CA & switch ports\n"); > if (__osm_ftree_fabric_populate_ports(p_ftree) != 0) > { > osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, > @@ -3061,7 +3061,7 @@ __osm_ftree_do_routing( > "Starting FatTree routing\n"); > > osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " > - "Filling switch forwarding tables for routes to HCAs\n"); > + "Filling switch forwarding tables for routes to CAs\n"); > __osm_ftree_fabric_route_to_hcas(p_ftree); > > osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ossrosch at linux.vnet.ibm.com Mon May 14 06:31:26 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Mon, 14 May 2007 15:31:26 +0200 Subject: [ofa-general] Re: [ewg] Re: Build problem with RHEL-4.5 and OFED-1.2 In-Reply-To: <200705092357.59973.ossrosch@linux.vnet.ibm.com> References: <200705091824.54394.ossrosch@linux.vnet.ibm.com> <1178737535.2848.152.camel@fc6.xsintricity.com> <200705092357.59973.ossrosch@linux.vnet.ibm.com> Message-ID: <200705141531.26635.ossrosch@linux.vnet.ibm.com> He Doug, are there any news for this problem? Is it a problem of the OFED-build or a problem with redhat? Should I open a bugzilla to track this? Regards Stefan On Wednesday 09 May 2007 23:57, Stefan Roscher wrote: > On Wednesday 09 May 2007 21:05, Doug Ledford wrote: > > On Wed, 2007-05-09 at 18:24 +0200, Stefan Roscher wrote: > > > Hi Doug, > > > > > > I installed RHEL-4.5 on one of our ppc64 systems and recognized that asm-ppc > > > directory is missing in /usr/src/kernels/2.6.9-55.EL/include. > > > Normally I don't need this directory, but ibmebus.h includes > > > asm-ppc64/of_device.h. And there asm-ppc64/of_device.h includes > > > asm-ppc/of_device.h. Because this file is missing I can not build > > > ehca and ofed stack with ofed-1.2 daily build from today. > > > > > > Did I make something wrong during installation? > > > > > > Regards Stefan Roscher > > > > I'll look into it, but in the meantime, install the kernel src.rpm, go > > into /usr/src/redhat/SPEC and run rpmbuild --bp kernel-2.6.spec and it > > should create a complete source tree > > in /usr/src/redhat/BUILD/kernel-2.6.18 that you can then get the asm-ppc > > directory contents out of. > > > > -- > > Doug Ledford > > GPG KeyID: CFBFF194 > > http://people.redhat.com/dledford > > > > Infiniband specific RPMs available at > > http://people.redhat.com/dledford/Infiniband > > > To create the backportpatches for rhel4.5 I did it like you say, but the > buildscripts of ofed dont uses the kernelsources in > /usr/src/redhat/BUILD. OFED-1.2 use the source link within > /lib/modules/kernel-x.x.x and this points into /usr/src/kernel this > kernelsources were created during installation of rhel-4.5. In this kernel > source the directory include/asm-ppc is missing. > This is the reason why I found this problem not during creation of the > backport patches. > > regards stefan > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From halr at voltaire.com Mon May 14 06:30:52 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 09:30:52 -0400 Subject: [ofa-general] Re: [query] addressing the the IB switches using LID. In-Reply-To: <374012.81252.qm@web8316.mail.in.yahoo.com> References: <374012.81252.qm@web8316.mail.in.yahoo.com> Message-ID: <1179149388.1540.175937.camel@hal.voltaire.com> On Mon, 2007-05-14 at 08:35, Keshetti Mahesh wrote: > This is purely SM policy. IBA does not dictate this and leaves > it up to > the SM in question as to whether it uses LID routing or direct > routing > to "talk" with nodes (including switches). Clearly, > initialization > requires direct routing. > what is the policy of current implementation of subnet manager > i.e. openSM? OpenSM currently uses DR except in the case of polling standby SMs but to rely on this is prone to errors and is non compliant with IBA. -- Hal > -Mahesh > > > > ______________________________________________________________________ > Office firewalls, cyber cafes, college labs, don't allow you to > download CHAT? Here's a solution! From halr at voltaire.com Mon May 14 07:02:47 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 10:02:47 -0400 Subject: [ofa-general] Re: [PATCH] osm: fat-tree optimization - improved ranking In-Reply-To: <46485DE7.1050506@dev.mellanox.co.il> References: <46485DE7.1050506@dev.mellanox.co.il> Message-ID: <1179151362.1540.177796.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2007-05-14 at 09:02, Yevgeny Kliteynik wrote: > Hi Hal, > > This patch optimizes fabric ranking. > All the leaf switches are marked with rank and added to the BFS list, > and only then ranking of rest of the fabric begins. > > I actually thought that this is the way I've originally > implemented it, as I mentioned in the patch that was dealing > with 8 and 16 bit integers :) > > Similar optimization may be applicable to up/dn routing - the roots > should be marked with rank 0 and only then ranking of rest of the > switches should begin, but frankly, it practically doesn't reduce > the routing time, because ranking is only a small fraction of the > routing runtime (I checked it on a 4K+ subnet). It's still worth doing IMO. Can you look into this for up/down ? > In case of fat-tree I'm going to need it anyway when I enhance > the routing to consider only subset of HCAs in the routing balancing > (compute nodes vs. management nodes). > > Please apply to master. > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik > > >From dfa455f86d9ac48ff5cefd38a009718e5aeab1f9 Mon Sep 17 00:00:00 2001 > From: Yevgeny Kliteynik > Date: Mon, 14 May 2007 15:45:00 +0300 > Subject: [PATCH] DELETE AFTER UPDATE: ranking > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to master only). -- Hal From kliteyn at dev.mellanox.co.il Mon May 14 07:07:26 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 14 May 2007 17:07:26 +0300 Subject: [ofa-general] Error message in OSM log when cached op file doesn't exist Message-ID: <46486D1E.6010408@dev.mellanox.co.il> Hi Hal. [snip] > Date: 03/30/2007 12:24:12 AM > OpenSM: Handle conf file open failures better > > diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c > index 46315a5..746fbd1 100644 > --- a/osm/opensm/osm_subnet.c > +++ b/osm/opensm/osm_subnet.c > @@ -732,7 +732,7 @@ subn_dump_qos_options( > > /********************************************************************** > **********************************************************************/ > -void > +ib_api_status_t > osm_subn_rescan_conf_file( > IN osm_subn_opt_t* const p_opts ) > { > @@ -751,7 +751,7 @@ osm_subn_rescan_conf_file( > > opts_file = fopen(file_name, "r"); > if (!opts_file) > - return; > + return IB_ERROR; [/snip] This patch was applied a month and a half ago (master). It handles opening cached options file, and prints error messages when OSM failed opening such file. I actually don't like this thing, because now every time you run OpenSM on the machine that doesn't have any cached options file (which is usually the case) you get an error message. There's no point checking whether the file exists, because osm runs as root, and if it fails opening this file, it means that the file doesn't exist or is inaccessible (broken mount, etc). In any case, user gets info in stdout whether or now OpenSM is using cached options file. Do you agree? Should I issue a patch? -- Yevgeny From kliteyn at dev.mellanox.co.il Mon May 14 07:08:37 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 14 May 2007 17:08:37 +0300 Subject: [ofa-general] Re: [PATCH] osm: fat-tree optimization - improved ranking In-Reply-To: <1179151362.1540.177796.camel@hal.voltaire.com> References: <46485DE7.1050506@dev.mellanox.co.il> <1179151362.1540.177796.camel@hal.voltaire.com> Message-ID: <46486D65.5000304@dev.mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > On Mon, 2007-05-14 at 09:02, Yevgeny Kliteynik wrote: >> Hi Hal, >> >> This patch optimizes fabric ranking. >> All the leaf switches are marked with rank and added to the BFS list, >> and only then ranking of rest of the fabric begins. >> >> I actually thought that this is the way I've originally >> implemented it, as I mentioned in the patch that was dealing >> with 8 and 16 bit integers :) >> >> Similar optimization may be applicable to up/dn routing - the roots >> should be marked with rank 0 and only then ranking of rest of the >> switches should begin, but frankly, it practically doesn't reduce >> the routing time, because ranking is only a small fraction of the >> routing runtime (I checked it on a 4K+ subnet). > > It's still worth doing IMO. Can you look into this for up/down ? Sure. --Yevgeny > >> In case of fat-tree I'm going to need it anyway when I enhance >> the routing to consider only subset of HCAs in the routing balancing >> (compute nodes vs. management nodes). >> >> Please apply to master. >> >> -- Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik >> >> >From dfa455f86d9ac48ff5cefd38a009718e5aeab1f9 Mon Sep 17 00:00:00 2001 >> From: Yevgeny Kliteynik >> Date: Mon, 14 May 2007 15:45:00 +0300 >> Subject: [PATCH] DELETE AFTER UPDATE: ranking >> >> Signed-off-by: Yevgeny Kliteynik > > Thanks. Applied (to master only). > > -- Hal > > From mst at dev.mellanox.co.il Mon May 14 07:14:50 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 May 2007 17:14:50 +0300 Subject: [ofa-general] [PATCH] IB/mthca: fix cq cleanup Message-ID: <20070514141450.GB7989@mellanox.co.il> mthca_cq_clean updated CQ consumer index without moving CQEs to HW ownership. As a result, the same WRID might get reported twice, resulting in use-after-free. This was observed in IPoIB CM. Fix by moving all freed CQEs to HW ownership. This fixes this bug: https://bugs.openfabrics.org/show_bug.cgi?id=617 Signed-off-by: Michael S. Tsirkin --- Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2007-05-14 14:22:58.000000000 +0300 +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_cq.c 2007-05-14 14:42:05.000000000 +0300 @@ -284,7 +284,7 @@ void mthca_cq_clean(struct mthca_dev *de { struct mthca_cqe *cqe; u32 prod_index; - int nfreed = 0; + int i, nfreed = 0; spin_lock_irq(&cq->lock); @@ -321,6 +321,8 @@ void mthca_cq_clean(struct mthca_dev *de } if (nfreed) { + for (i = 0; i < nfreed; ++i) + set_cqe_hw(get_cqe(cq, (cq->cons_index + i) & cq->ibcq.cqe)); wmb(); cq->cons_index += nfreed; update_cons_index(dev, cq, nfreed); -- MST From mst at dev.mellanox.co.il Mon May 14 07:15:14 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 May 2007 17:15:14 +0300 Subject: [ofa-general] [PATCH] libmthca: fix cq cleanup Message-ID: <20070514141514.GC7989@mellanox.co.il> mthca_cq_clean updated CQ consumer index without moving CQEs to HW ownership. As a result, the same WRID might get reported twice, resulting in use-after-free. This was observed in IPoIB CM. Fix by moving all freed CQEs to HW ownership. This fixes this bug: https://bugs.openfabrics.org/show_bug.cgi?id=617 Signed-off-by: Michael S. Tsirkin --- diff --git a/src/cq.c b/src/cq.c index 0aeb7a9..ecaed9c 100644 --- a/src/cq.c +++ b/src/cq.c @@ -550,7 +546,7 @@ void mthca_cq_clean(struct mthca_cq *cq, uint32_t qpn, struct mthca_srq *srq) { struct mthca_cqe *cqe; uint32_t prod_index; - int nfreed = 0; + int i, nfreed = 0; pthread_spin_lock(&cq->lock); @@ -584,6 +580,8 @@ void mthca_cq_clean(struct mthca_cq *cq, uint32_t qpn, struct mthca_srq *srq) } if (nfreed) { + for (i = 0; i < nfreed; ++i) + set_cqe_hw(get_cqe(cq, (cq->cons_index + i) & cq->ibv_cq.cqe)); wmb(); cq->cons_index += nfreed; update_cons_index(cq, nfreed); -- MST From mst at dev.mellanox.co.il Mon May 14 07:22:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 May 2007 17:22:33 +0300 Subject: [ofa-general] weird kconfig output Message-ID: <20070514142233.GD7989@mellanox.co.il> Doing make oldconfig on 2.6.22-rc1 (.config came from 2.6.21) gave me this prompt, among the list of 10G/s adapters: Verbose debugging output (MLX4_DEBUG) [Y/n/?] (NEW) Shouldn't I get prompted for mlx4 eth first? -- MST From halr at voltaire.com Mon May 14 07:21:53 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 10:21:53 -0400 Subject: [ofa-general] Re: Error message in OSM log when cached op file doesn't exist In-Reply-To: <46486D1E.6010408@dev.mellanox.co.il> References: <46486D1E.6010408@dev.mellanox.co.il> Message-ID: <1179152459.1540.178811.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2007-05-14 at 10:07, Yevgeny Kliteynik wrote: > Hi Hal. > > [snip] > > Date: 03/30/2007 12:24:12 AM > > OpenSM: Handle conf file open failures better > > > > diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c > > index 46315a5..746fbd1 100644 > > --- a/osm/opensm/osm_subnet.c > > +++ b/osm/opensm/osm_subnet.c > > @@ -732,7 +732,7 @@ subn_dump_qos_options( > > > > /********************************************************************** > > **********************************************************************/ > > -void > > +ib_api_status_t > > osm_subn_rescan_conf_file( > > IN osm_subn_opt_t* const p_opts ) > > { > > @@ -751,7 +751,7 @@ osm_subn_rescan_conf_file( > > > > opts_file = fopen(file_name, "r"); > > if (!opts_file) > > - return; > > + return IB_ERROR; > [/snip] > > This patch was applied a month and a half ago (master). > It handles opening cached options file, and prints error messages > when OSM failed opening such file. > > I actually don't like this thing, because now every time you run > OpenSM on the machine that doesn't have any cached options file > (which is usually the case) you get an error message. Perhaps error is too severe as one can run just fine without this file and there is no requirement to have it. Should it be some other type of message instead ? > There's no point checking whether the file exists, because osm runs > as root, and if it fails opening this file, it means that the file > doesn't exist or is inaccessible (broken mount, etc). That's the most common use case (running OpenSM as root, but not the only one). > In any case, user gets info in stdout whether or now OpenSM is using > cached options file. Is there always a message in the log as well indicating this ? -- Hal > Do you agree? Should I issue a patch? > > -- Yevgeny From tziporet at dev.mellanox.co.il Mon May 14 08:06:16 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 14 May 2007 18:06:16 +0300 Subject: [ofa-general] OFED 1.2 rc3 release In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com> Message-ID: <46487AE8.1020005@mellanox.co.il> Hi, OFED 1.2-RC3 is available on _http://www.openfabrics.org/builds/ofed-1.2/_ File: OFED-1.2-rc3.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla _https://bugs.openfabrics.org/_ *RC4 due date is May 21* Tziporet & Vlad ==================================================================================== *Release information:* *OS support: * Novell: - SLES 9.0 SP3 - SLES10 (and SP1 RC2 partially tested) Redhat: - Redhat EL4 up3, up4 and up5 - Redhat EL5 kernel.org: - 2.6.20 - 2.6.19 Note: Fedora C6 and SuSE Pro 10 are not part of the official list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. *Systems: * * x86_64 * x86 * ia64 * ppc64 *Main changes from OFED-1.1-rc2:* 1. Fixed 49 bugs (see attachment for all bugs fixed) 2. Replace Open MPI to version 1.2.1 3. Added support for RHEL 4 up5 4. Updated documents (but not yet completed) *Major limitations and known issues: * 567 blocker rolandd at cisco.com MPI does not work on RHEL5 ppc64 420 critical monil at voltaire.com PKey table reordering caused by SM failover stops ipoib traffic 607 critical jsquyres at cisco.com remove the hack to save the port number in the ia hca_address 608 critical monis at voltaire.com traffic fails to resume after SM failover with bonding interfaces 611 critical swise at opengridcomputing.com cxgb3: passive side connection transition from streaming to RDMA is broken 577 critical rolandd at cisco.com SRP multipath failover too slow (minutes, not seconds) 465 critical mst at mellanox.co.il IPoIB HA fails after several hours of failovers 549 critical amip at dev.mellanox.co.il SDP Policy need to be consistent 604 critical mst at mellanox.co.il Oops running UDP traffic with IPoIB CM 605 major sean.hefty at intel.com kernel oops in rdma_cm during module unload 614 major halr at voltaire.com All of the CM definitions should be removed from ib_types.h 534 major vlad at mellanox.co.il SLES9 - Installer fails on declarations - OFED 1.2-20070409 530 major dannyz at mellanox.co.il ibdiagnet -r fails on RHEL5 i686 538 major monis at voltaire.com integrate IPoIB bonding with IPoIB CM 541 major mst at mellanox.co.il slow failover with IPoIB CM bonding/ipoibtools HA 558 major rolandd at cisco.com tvflash configure fails on SLES10 SP1 RC2 See bugzilla for all open issues. Tasks that should be completed for RC4 (due date is 21-May): 1. Support SLES10 SP1 RC1 3. Fix all blocker, critical and major bugs 4. Prepare all documentation (release notes, README, etc.) -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: fixed_bugs-rc3.csv URL: From sean.hefty at intel.com Mon May 14 08:11:53 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 14 May 2007 08:11:53 -0700 Subject: [ofa-general] RE: [Query] ib add path record cache In-Reply-To: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> Message-ID: <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> >This can be treated as a facility similar to what we have in ARP table >for TCP/IP. Secondly this will help in debugging of some new up-coming >partially infiniband complaint hardware. But unless such a path actually exists to the remote node, I don't see that it's useful. And if such a path exists, I would expect it to be returned by the SA. Can you clarify its use more wrt the subnet in general? >yes, I want them to remain in the DB, my idea is similar to the hard >coding of ARP table entries in TCP/IP. >How do you see this can be achieved? A simple flag or setting the update counter on the added path to the maximum should be sufficient. - Sean From philippe.gregoire at cea.fr Mon May 14 08:20:06 2007 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Mon, 14 May 2007 17:20:06 +0200 Subject: [ofa-general] suggested patch for partition membership definitiion in osm-partitions.conf Message-ID: <46487E26.4040501@cea.fr> Hi Hal, the way to define in osm-partitions.conf file partition membership for port guids is quite very verbose, specially when you have a lot of (full member) ports. Here is a patch to allow a more compact partition membership definition. It allows definition of a default membership partition for the port guid list. The old syntax is still usable. old way G1 = 0x01 : 0x123=full, 0x124=full, 0x0x125=full, 0x126=full, 0x127=full ; G1 = 0x01 : 0x128=full, 0x129=full, 0x567, 0x569=full new way : G1 = 0x01 , defmember=full : 0x123, 0x124, 0x125, 0x126, 0x127 ; G1 = 0x01 , defmember=full : 0x128, 0x129, 0x567=limited, 0x569 I changed also the opensm man page as some lines (arround limited/full membership) are not well formatted. This patch has been compiled and tested on our cluster with the following osm-partitions.conf : G1 = 0x0001 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9, 0x0005ad00000168ad, 0x0005ad0000000cb7=limited, 0x0008f10403962eb1 ; G2 = 0x0002 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ; G3 = 0x0003 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9, 0x0008f10403962eb1 ; G5 = 0x0005 , defmember=full : 0x0008f10403962eb1 ; G10 = 0x0010 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ; G70 = 0x0070 , defmember=full : 0x0005ad00000165f1 ; G80 = 0x0080 , defmember=full : 0x0005ad00000165f1; G80 = 0x0080 : 0x0005ad00000168ad; G80 = 0x0080 , defmember=full : 0x0005ad0000016cb9; G80 = 0x0080 , defmember=limited : 0x0005ad0000000cb7, 0x0008f10403962eb1; Philippe -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: defmember.patch URL: From philippe.gregoire at cea.fr Mon May 14 08:26:55 2007 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Mon, 14 May 2007 17:26:55 +0200 Subject: [ofa-general] suggested patch for partition membership definitiion in osm-partitions.conf (fix) Message-ID: <46487FBF.7020300@cea.fr> This time , with the definitive patch (sorry) Hi Hal, the way to define in osm-partitions.conf file partition membership for port guids is quite very verbose, specially when you have a lot of (full member) ports. Here is a patch to allow a more compact partition membership definition. It allows definition of a default membership partition for the port guid list. The old syntax is still usable. old way G1 = 0x01 : 0x123=full, 0x124=full, 0x0x125=full, 0x126=full, 0x127=full ; G1 = 0x01 : 0x128=full, 0x129=full, 0x567, 0x569=full new way : G1 = 0x01 , defmember=full : 0x123, 0x124, 0x125, 0x126, 0x127 ; G1 = 0x01 , defmember=full : 0x128, 0x129, 0x567=limited, 0x569 I changed also the opensm man page as some lines (arround limited/full membership) are not well formatted. This patch has been compiled and tested on our cluster with the following osm-partitions.conf : G1 = 0x0001 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9, 0x0005ad00000168ad, 0x0005ad0000000cb7=limited, 0x0008f10403962eb1 ; G2 = 0x0002 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ; G3 = 0x0003 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9, 0x0008f10403962eb1 ; G5 = 0x0005 , defmember=full : 0x0008f10403962eb1 ; G10 = 0x0010 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ; G70 = 0x0070 , defmember=full : 0x0005ad00000165f1 ; G80 = 0x0080 , defmember=full : 0x0005ad00000165f1; G80 = 0x0080 : 0x0005ad00000168ad; G80 = 0x0080 , defmember=full : 0x0005ad0000016cb9; G80 = 0x0080 , defmember=limited : 0x0005ad0000000cb7, 0x0008f10403962eb1; Philippe -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: defmember.patch URL: From monil at voltaire.com Mon May 14 08:44:02 2007 From: monil at voltaire.com (Moni Levy) Date: Mon, 14 May 2007 17:44:02 +0200 Subject: [ofa-general] Re: [ewg] OFED 1.2 rc3 release In-Reply-To: <46487AE8.1020005@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com> <46487AE8.1020005@mellanox.co.il> Message-ID: <6a122cc00705140844k7f65a988x8746c4ea8474b920@mail.gmail.com> On 5/14/07, Tziporet Koren wrote: > > *Major limitations and known issues: > * > 567 blocker rolandd at cisco.com MPI does not work on RHEL5 ppc64 420 > critical monil at voltaire.com PKey table reordering caused by SM failover > stops ipoib traffic > Tziporet, bug #420 was fixed and bugzilla was updated this morning Moni > 607 critical jsquyres at cisco.com remove the hack to save the port number > in the ia hca_address 608 critical monis at voltaire.com traffic fails to > resume after SM failover with bonding interfaces 611 critical > swise at opengridcomputing.com cxgb3: passive side connection transition from > streaming to RDMA is broken 577 critical rolandd at cisco.com SRP multipath > failover too slow (minutes, not seconds) 465 critical mst at mellanox.co.il IPoIB > HA fails after several hours of failovers 549 critical > amip at dev.mellanox.co.il SDP Policy need to be consistent 604 critical > mst at mellanox.co.il Oops running UDP traffic with IPoIB CM 605 major > sean.hefty at intel.com kernel oops in rdma_cm during module unload 614 major > halr at voltaire.com All of the CM definitions should be removed from > ib_types.h 534 major vlad at mellanox.co.il SLES9 - Installer fails on > declarations - OFED 1.2-20070409 530 major dannyz at mellanox.co.il ibdiagnet > -r fails on RHEL5 i686 538 major monis at voltaire.com integrate IPoIB > bonding with IPoIB CM 541 major mst at mellanox.co.il slow failover with > IPoIB CM bonding/ipoibtools HA 558 major rolandd at cisco.com tvflash > configure fails on SLES10 SP1 RC2 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon May 14 08:51:22 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 11:51:22 -0400 Subject: [ofa-general] Re: suggested patch for partition membership definitiion in osm-partitions.conf (fix) In-Reply-To: <46487FBF.7020300@cea.fr> References: <46487FBF.7020300@cea.fr> Message-ID: <1179157835.1540.183713.camel@hal.voltaire.com> Hi Philippe, On Mon, 2007-05-14 at 11:26, Philippe Gregoire wrote: > This time , with the definitive patch (sorry) Can you resubmit this with a S-O-B line ? > Hi Hal, > the way to define in osm-partitions.conf file partition membership for > port guids is quite very verbose, > specially when you have a lot of (full member) ports. or lots of limited members, either way. This is an improvement in the allowed syntax. > Here is a patch to allow a more compact partition membership definition. > It allows definition of a default > membership partition for the port guid list. The old syntax is still usable. > old way > G1 = 0x01 : 0x123=full, 0x124=full, 0x0x125=full, 0x126=full, 0x127=full ; > G1 = 0x01 : 0x128=full, 0x129=full, 0x567, 0x569=full > > new way : > G1 = 0x01 , defmember=full : 0x123, 0x124, 0x125, 0x126, 0x127 ; > G1 = 0x01 , defmember=full : 0x128, 0x129, 0x567=limited, 0x569 > > I changed also the opensm man page as some lines (arround limited/full > membership) are not well formatted. Can you break this piece into 2 parts: fix formatting, and then add defmember ? > This patch has been compiled and tested on our cluster with the > following osm-partitions.conf : > G1 = 0x0001 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9, > 0x0005ad00000168ad, 0x0005ad0000000cb7=limited, 0x0008f10403962eb1 ; > G2 = 0x0002 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ; > G3 = 0x0003 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9, > 0x0008f10403962eb1 ; > G5 = 0x0005 , defmember=full : 0x0008f10403962eb1 ; > G10 = 0x0010 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ; > G70 = 0x0070 , defmember=full : 0x0005ad00000165f1 ; > G80 = 0x0080 , defmember=full : 0x0005ad00000165f1; > G80 = 0x0080 : 0x0005ad00000168ad; > G80 = 0x0080 , defmember=full : 0x0005ad0000016cb9; > G80 = 0x0080 , defmember=limited : 0x0005ad0000000cb7, 0x0008f10403962eb1; Thanks. -- Hal > Philippe > > > > ______________________________________________________________________ > > --- opensm/osm_prtn_config.old.c 2007-04-18 11:54:29.000000000 +0200 > +++ opensm/osm_prtn_config.c 2007-05-14 17:14:42.228813361 +0200 > @@ -70,6 +70,7 @@ > osm_subn_t *p_subn; > osm_prtn_t *p_prtn; > unsigned is_ipoib, mtu, rate, sl, scope; > + boolean_t full; > }; > > extern osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn, > @@ -163,6 +164,14 @@ > " - skipped\n", lineno); > else > conf->sl = sl; > + } else if (!strncmp(flag, "defmember", len)) { > + if (!val || (strcmp(val, "limited") && strcmp(val, "full"))) > + osm_log(conf->p_log, OSM_LOG_VERBOSE, > + "PARSE WARN: line %d: " > + "flag \'defmember\' requires valid value (limited or full)" > + " - skipped\n", lineno); > + else > + conf->full = strcmp(val, "full") == 0; > } else { > osm_log(conf->p_log, OSM_LOG_VERBOSE, > "PARSE WARN: line %d: " > @@ -177,12 +186,14 @@ > { > osm_prtn_t *p = conf->p_prtn; > ib_net64_t guid; > - boolean_t full = FALSE; > + boolean_t full = conf->full; > > if (!name || !*name || !strncmp(name, "NONE", strlen(name))) > return 0; > > if (flag) { > + /* reset default membership to limited */ > + full = FALSE; > if (!strncmp(flag, "full", strlen(flag))) > full = TRUE; > else if (strncmp(flag, "limited", strlen(flag))) { > @@ -275,6 +286,7 @@ > conf->p_prtn = NULL; > conf->is_ipoib = 0; > conf->sl = OSM_DEFAULT_SL; > + conf->full = FALSE; > return conf; > } > > --- man/opensm.8.old 2007-04-18 11:54:29.000000000 +0200 > +++ man/opensm.8 2007-05-14 16:19:11.747555126 +0200 > @@ -291,13 +291,15 @@ > > Partition Definition: > > -[PartitionName][=PKey][,flag[=value]] > +[PartitionName][=PKey][,flag[=value]][,defmember=full|limited] > > PartitionName - string, will be used with logging. When omitted > empty string will be used. > PKey - P_Key value for this partition. Only low 15 bits will > be used. When omitted will be autogenerated. > flag - used to indicate IPoIB capability of this partition. > + defmember=full|limited - specifies default membership for port guid. > + Default is limited. > > Currently recognized flags are: > > @@ -317,10 +319,10 @@ > > PortGUIDs list: > > -PortGUID - GUID of partition member EndPort. Hexadecimal numbers > - should start from 0x, decimal numbers are accepted too. > -full or - indicates full or limited membership for this port. When > - limited omitted (or unrecognized) limited membership is assumed. > + PortGUID - GUID of partition member EndPort. Hexadecimal numbers > + should start from 0x, decimal numbers are accepted too. > + full or limited - indicates full or limited membership for this port. > + When omitted (or unrecognized) default (defmember) membership is assumed. > > There are two useful keywords for PortGUID definition: > > @@ -346,12 +348,20 @@ > > Examples: > > -Default=0x7fff : ALL, SELF=full ; > + Default=0x7fff : ALL, SELF=full ; > > -NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ; > + NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ; > > -YetAnotherOne = 0x300 : SELF=full ; > -YetAnotherOne = 0x300 : ALL=limited ; > + YetAnotherOne = 0x300 : SELF=full ; > + YetAnotherOne = 0x300 : ALL=limited ; > + > + ShareIO = 0x80 , defmember=full : 0x123451, 0x123452; > + # 0x123453, 0x123454 will be limited > + ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full; > + # 0x123456, 0x123457 will be limited > + ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full; > + ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a; > + ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d; > > Note: > From weiny2 at llnl.gov Mon May 14 09:55:53 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 14 May 2007 09:55:53 -0700 Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager In-Reply-To: <20070513195539.GH29746@sashak.voltaire.com> References: <20070508184938.311b1c8f.weiny2@llnl.gov> <20070513195539.GH29746@sashak.voltaire.com> Message-ID: <20070514095553.3ec3bec7.weiny2@llnl.gov> On Sun, 13 May 2007 22:55:39 +0300 Sasha Khapyorsky wrote: > Hi Ira, > > Thanks for the great work! > > On 18:49 Tue 08 May , Ira Weiny wrote: > > I would like to submit to the list a performance manager which I have been > > working on for OpenSM. > > > > It is implemented as the first proposed architecture model set forth by Hal (As > > an integrated thread to OpenSM.) As such it works fine on our small test > > cluster but there is some concern about its scalability. > > > > I have extended this architecture with an idea of my own. This idea is to have > > a plug-able module for the "event database". With this interface one could > > write their own Data reduction, logging, and tracking methods. Here at LLNL I > > propose to use this to add counter and subnet events directly to our management > > database which is used to show system status to our operators. Other > > installations might prefer other methods of logging, SNMP for example. This > > patch includes a "reference" implementation of this "event database" which > > stores the information internally until the user requests a "dump". > > I like this event db idea, but not sure this should not be integral part > of the low level perfmgr stuff - as it is currently implemented without > such plugin loaded PerfMgr just doesn't work - this unconditionally tries > to pull all ports counters, but has nothing to do with it without plugin. > > Instead I would purpose to have a builtin PerfMgr which will be able to > pull and store performance related data and then to call "generic" event > manager which can process such data. This also will help to have simpler > generic API for such event db plugin so other parts of OpenSM will be > able to report events using same method(s). What do you think? This is a good idea. I will think about how to make it work. > > + > > +/** > > + * group port counters for ports into the nodes > > + */ > > +typedef struct _osm_pc_node { > > + cl_map_item_t map_item; /* must be first */ > > + uint64_t node_guid; > > + osm_event_pc_t *ports; > > + uint8_t num_ports; > > +} osm_pc_node_t; > > Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)? > Why not to reuse already existed maps in osm_subn_t (we could add > 'void *pm_data' or so field to osm_physp_t structure)? > I did not want to complicate the SM data structures. Also these structures were part of the plugin. This reference plugin used the compatibility lib qmap structures to store the data. But other plugins may use SQL or other data stores. I think I agree with Hal that we should keep these data structures separate from the SM structures. > > + > > +/****s* OpenSM: PERFMGR/osm_perfmgr_state_t */ > > +typedef enum > > +{ > > + PERFMGR_STATE_DISABLE, > > + PERFMGR_STATE_ENABLED, > > + PERFMGR_STATE_NO_DB > > Why PERFMGR_STATE_NO_DB is needed? Isn't is duplicated by > (pm->db == NULL)? > > As side effect of this duplication - now when DB was not found I can > enable perfmgr with console command, but it obviously crashes during > follow 'dump'. Ah I did not catch that. If we separate out the plugin to be generic with a perfmgr internal store, this will go away. I did add checks for NULL DB functions so that plugins could decide to not receive some types of data, but this only makes sense with the refactoring I did on the DB interface. > > > +} osm_perfmgr_state_t; > > + > > +/****s* OpenSM: PERFMGR/osm_perfmgr_t > > +* This object should be treated as opaque and should > > +* be manipulated only through the provided functions. > > +*/ > > +typedef struct _osm_perfmgr > > +{ > > + osm_thread_state_t thread_state; > > + cl_event_t sig_sweep; > > + cl_thread_t sweeper; > > + osm_subn_t *subn; > > + osm_sm_t *sm; > > + cl_plock_t *lock; > > + osm_log_t *log; > > + osm_mad_pool_t *mad_pool; > > + atomic32_t trans_id; > > Do we need separate transaction id generator for PerfMgr? Probably not here but if we separate out the perfmgr we might. > > + > > +/****f* OpenSM: PERFMGR/osm_perfmgr_init */ > > +ib_api_status_t > > +osm_perfmgr_init( > > + osm_perfmgr_t* const perfmgr, > > + osm_subn_t* const subn, > > + osm_sm_t * const sm, > > + osm_log_t* const log, > > + osm_mad_pool_t * const mad_pool, > > + osm_vendor_t * const vendor, > > + cl_dispatcher_t* const disp, > > + cl_plock_t* const lock, > > + const osm_subn_opt_t * const p_opt ); > > The identation is not unified (tab character is preferred) here and in > another places, also there are lot of trailing white spaces in the patch. > You can run 'git-diff --color' to see formatting issues. Yes, sorry. I have been trying to follow the new codeing standard but I have not done a great job. Thanks for the git tip. > > > > +#ifdef ENABLE_OSM_PERF_MGR > > + case 1: > > + opt.perfmgr = TRUE; > > + break; > > + case 2: > > + opt.perfmgr_sweep_time_s = atoi(optarg); > > In case of user error we can get opt.perfmgr_sweep_time_s = 0 (or another > strange value), I think at least minimal verification is needed here. Yes, good catch. I am actually going to remove these from the command line options. I think one can control these better from the opensm.opts file. There seems to be too many options which must be set for this to work correctly right now. Also what would you guys think of having a separate perfmgr config file? I am not sure about that idea. On one hand it helps to keep the opensm.opts file clean but on the other hand it means the user has to deal with another config file. :-/ > > + > > +/********************************************************************** > > + * Process errors from the MAD send. > > + **********************************************************************/ > > +static void > > +osm_perfmgr_mad_send_err_callback(void* bind_context, osm_madw_t *p_madw) > > +{ > > + osm_perfmgr_t *pm = (osm_perfmgr_t *)bind_context; > > + osm_madw_context_t *context = &(p_madw->context); > > + > > + OSM_LOG_ENTER( pm->log, osm_pm_mad_send_err_callback ); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Ditto (the same for another perfmgr functions) Yep, I already caught these, thanks. > > + > > + OSM_LOG_ENTER( pm->log, __osm_pm_query_counters ); > > + > > + memcpy(node_desc, p_node->node_desc.description, > > + IB_NODE_DESCRIPTION_SIZE); > > + node_desc[IB_NODE_DESCRIPTION_SIZE-1] = '\0'; > > We have null terminated 'print_desc' field in osm_node_t structure Yea, I put it there, I should have known that ;-) I changed this already... Thanks for the comments. I will get a framework done for the general event plugin... I do agree that would be better for other types of events. My original idea was to have these events reported to the "perfmgr". But that is somewhat invasive on the perfmgr object. I am not sure what the best way to do this is at the moment. I have cleaned up the DB interface quite a bit, including making it more generic. So I think this might fit in nicely. I can reissue a patch if you would like to see it. Or I can just submit the header file to see the interface. Thanks, Ira From vuhuong at mellanox.com Mon May 14 09:55:43 2007 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 14 May 2007 09:55:43 -0700 Subject: [ofa-general] [SRPT]multiple initiators supported? In-Reply-To: <7b2fa1820705120247t1b232345w8373bb72416c5b28@mail.gmail.com> References: <7b2fa1820705120247t1b232345w8373bb72416c5b28@mail.gmail.com> Message-ID: <4648948F.5000802@mellanox.com> Ian Jiang wrote: > Does the SRP target support multiple initiators? Yes, it does. > I am using the SRR initiator and IB drivers in linux-2.6.20. > The SRP target is at > http://www.openfabrics.org/git/?p=~vu/srpt.git;a=summary > and the IB driver is OFED-1.1 with linux-2.6.16.13-4-default of Suse-10.1. > > From weiny2 at llnl.gov Mon May 14 10:02:24 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 14 May 2007 10:02:24 -0700 Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager In-Reply-To: <1179140285.1540.167239.camel@hal.voltaire.com> References: <20070508184938.311b1c8f.weiny2@llnl.gov> <20070513195539.GH29746@sashak.voltaire.com> <1179140285.1540.167239.camel@hal.voltaire.com> Message-ID: <20070514100224.7c399438.weiny2@llnl.gov> On 14 May 2007 06:58:34 -0400 Hal Rosenstock wrote: > On Sun, 2007-05-13 at 15:55, Sasha Khapyorsky wrote: > > Hi Ira, > > > > Thanks for the great work! > > Indeed :-) > > > On 18:49 Tue 08 May , Ira Weiny wrote: > > > I would like to submit to the list a performance manager which I have been > > > working on for OpenSM. > > > > > > It is implemented as the first proposed architecture model set forth by Hal (As > > > an integrated thread to OpenSM.) As such it works fine on our small test > > > cluster but there is some concern about its scalability. > > > > > > I have extended this architecture with an idea of my own. This idea is to have > > > a plug-able module for the "event database". With this interface one could > > > write their own Data reduction, logging, and tracking methods. Here at LLNL I > > > propose to use this to add counter and subnet events directly to our management > > > database which is used to show system status to our operators. Other > > > installations might prefer other methods of logging, SNMP for example. This > > > patch includes a "reference" implementation of this "event database" which > > > stores the information internally until the user requests a "dump". > > > > I like this event db idea, but not sure this should not be integral part > > of the low level perfmgr stuff - as it is currently implemented without > > such plugin loaded PerfMgr just doesn't work - this unconditionally tries > > to pull all ports counters, but has nothing to do with it without plugin. > > > > Instead I would purpose to have a builtin PerfMgr which will be able to > > pull and store performance related data and then to call "generic" event > > manager which can process such data. This also will help to have simpler > > generic API for such event db plugin so other parts of OpenSM will be > > able to report events using same method(s). What do you think? > > Sounds better to me. Ira ? Yes, except that I am concerned with storing the data in the perfmgr as well as the plugin. But I like the idea of a more generic plugin for getting events from OSM. My mind is already full of ideas after responding to Sasha... ;-) > > > + > > > +/** > > > + * group port counters for ports into the nodes > > > + */ > > > +typedef struct _osm_pc_node { > > > + cl_map_item_t map_item; /* must be first */ > > > + uint64_t node_guid; > > > + osm_event_pc_t *ports; > > > + uint8_t num_ports; > > > +} osm_pc_node_t; > > > > Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)? > > Why not to reuse already existed maps in osm_subn_t (we could add > > 'void *pm_data' or so field to osm_physp_t structure)? > > My one concern would be evolving the PerfMgr. This is better now but is > this better when the PerfMgr is separated from the SM functionality ? I > know there are other things to untangle to get there. > I fully agree. I don't think we want intertwine the SM structures with the PerfMgr structures. BTW in the new code I have this is named _db_node_t. Ira From amitk at mellanox.co.il Mon May 14 10:59:58 2007 From: amitk at mellanox.co.il (Amit Krig) Date: Mon, 14 May 2007 20:59:58 +0300 Subject: [ofa-general] RE: [PATCH] ipoib/cm: make stale task actually run once in a while (DOES NOT HELP) In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C90BFCB3@mtlexch01.mtl.com><6C2C79E72C305246B504CBA17B5500C9076E27@mtlexch01.mtl.com> <20070507200315.GD22341@mellanox.co.il> Message-ID: <6C2C79E72C305246B504CBA17B5500C90179C4AA@mtlexch01.mtl.com> Still failing in our test as well Amit -----Original Message----- From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Sent: Saturday, May 12, 2007 1:32 AM To: Michael S. Tsirkin; Scott Weitzenkamp (sweitzen) Cc: Yohad Dickman; Amit Krig; Tziporet Koren; Michael S. Tsirkin; general at lists.openfabrics.org; Roland Dreier Subject: RE: [PATCH] ipoib/cm: make stale task actually run once in a while (DOES NOT HELP) Importance: High This patch, which is in OFED-1.2-20070511-0600, does NOT help. I am still seeing 105-second port failover times. Amit, did you try it? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] > Sent: Monday, May 07, 2007 1:03 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Yohad Dickman; Amit Krig; Tziporet Koren; mst at mellanox.co.il; > general at lists.openfabrics.org; Roland Dreier > Subject: [PATCH] ipoib/cm: make stale task actually run once in a > while > > In the presence of some active passive connections, stale task would > never run, since each 4 RX CQEs we repeat queue_delayed_work calls > which delays it for some 10 minutes. As a result, on a noisy system > with failing ports, we slowly run out of resources - slowing > connection setup down and eventually failing. > > What we actually want to do is - start stale task when a first passive > connection is added, rerun it every 10 min as long as there are > outstanding passive connections. > > As a happy side effect, this removes some code from RX data path. > > Signed-off-by: Michael S. Tsirkin > > --- > > Scott, I think this might address bugs 541 and 465: slow IPoIB CM HA > failover and eventual failing IPoIB HA. Could you test this please? > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > index 2b242a4..b77e8d7 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > @@ -258,10 +258,11 @@ static int ipoib_cm_req_handler(struct ib_cm_id > *cm_id, struct ib_cm_event *even > cm_id->context = p; > p->jiffies = jiffies; > spin_lock_irqsave(&priv->lock, flags); > + if (list_empty(&priv->cm.passive_ids)) > + queue_delayed_work(ipoib_workqueue, > + &priv->cm.stale_task, > IPOIB_CM_RX_DELAY); > list_add(&p->list, &priv->cm.passive_ids); > spin_unlock_irqrestore(&priv->lock, flags); > - queue_delayed_work(ipoib_workqueue, > - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > return 0; > > err_rep: > @@ -380,8 +381,6 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, > struct ib_wc *wc) > if (!list_empty(&p->list)) > list_move(&p->list, > &priv->cm.passive_ids); > spin_unlock_irqrestore(&priv->lock, flags); > - queue_delayed_work(ipoib_workqueue, > - > &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > } > } > > @@ -1104,6 +1103,10 @@ static void ipoib_cm_stale_task(struct > work_struct *work) > kfree(p); > spin_lock_irqsave(&priv->lock, flags); > } > + > + if (!list_empty(&priv->cm.passive_ids)) > + queue_delayed_work(ipoib_workqueue, > + &priv->cm.stale_task, > IPOIB_CM_RX_DELAY); > spin_unlock_irqrestore(&priv->lock, flags); } > > -- > MST > From sashak at voltaire.com Mon May 14 11:24:00 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 14 May 2007 21:24:00 +0300 Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager In-Reply-To: <1179140285.1540.167239.camel@hal.voltaire.com> References: <20070508184938.311b1c8f.weiny2@llnl.gov> <20070513195539.GH29746@sashak.voltaire.com> <1179140285.1540.167239.camel@hal.voltaire.com> Message-ID: <20070514182400.GL29746@sashak.voltaire.com> On 06:58 Mon 14 May , Hal Rosenstock wrote: > > > + > > > +/** > > > + * group port counters for ports into the nodes > > > + */ > > > +typedef struct _osm_pc_node { > > > + cl_map_item_t map_item; /* must be first */ > > > + uint64_t node_guid; > > > + osm_event_pc_t *ports; > > > + uint8_t num_ports; > > > +} osm_pc_node_t; > > > > Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)? > > Why not to reuse already existed maps in osm_subn_t (we could add > > 'void *pm_data' or so field to osm_physp_t structure)? > > My one concern would be evolving the PerfMgr. This is better now but is > this better when the PerfMgr is separated from the SM functionality ? I > know there are other things to untangle to get there. PerfMgr "sweep" is based on discovered fabric topology structures anyway. So what is a reason to duplicate nodes/ports qmaps? Sasha From rick.jones2 at hp.com Mon May 14 11:52:08 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 14 May 2007 11:52:08 -0700 Subject: [ofa-general] OFED 1.2 rc3 release In-Reply-To: <46487AE8.1020005@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com> <46487AE8.1020005@mellanox.co.il> Message-ID: <4648AFD8.2060101@hp.com> Tziporet Koren wrote: > Hi, > > OFED 1.2-RC3 is available on _http://www.openfabrics.org/builds/ofed-1.2/_ > File: OFED-1.2-rc3.tgz > To get BUILD_ID run ofed_info > > Please report any issues in bugzilla _https://bugs.openfabrics.org/_ > > *RC4 due date is May 21* It could be that I need new bifocals, but there does not appear to be a 1.2rc3 version listed against which we can submit reports. rick jones From sashak at voltaire.com Mon May 14 12:04:46 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 14 May 2007 22:04:46 +0300 Subject: [ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager In-Reply-To: <20070514095553.3ec3bec7.weiny2@llnl.gov> References: <20070508184938.311b1c8f.weiny2@llnl.gov> <20070513195539.GH29746@sashak.voltaire.com> <20070514095553.3ec3bec7.weiny2@llnl.gov> Message-ID: <20070514190446.GM29746@sashak.voltaire.com> On 09:55 Mon 14 May , Ira Weiny wrote: > > > > Instead I would purpose to have a builtin PerfMgr which will be able to > > pull and store performance related data and then to call "generic" event > > manager which can process such data. This also will help to have simpler > > generic API for such event db plugin so other parts of OpenSM will be > > able to report events using same method(s). What do you think? > > This is a good idea. I will think about how to make it work. Thanks. > > > > > + > > > +/** > > > + * group port counters for ports into the nodes > > > + */ > > > +typedef struct _osm_pc_node { > > > + cl_map_item_t map_item; /* must be first */ > > > + uint64_t node_guid; > > > + osm_event_pc_t *ports; > > > + uint8_t num_ports; > > > +} osm_pc_node_t; > > > > Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)? > > Why not to reuse already existed maps in osm_subn_t (we could add > > 'void *pm_data' or so field to osm_physp_t structure)? > > > > I did not want to complicate the SM data structures. Also these structures > were part of the plugin. This reference plugin used the compatibility lib qmap > structures to store the data. But other plugins may use SQL or other data > stores. Right, but plugin can access OpenSM data structures in the same way as its internal stuff, and just qmaps duplication affects performance. > I think I agree with Hal that we should keep these data structures > separate from the SM structures. [snip..] > I think one can control these better from the opensm.opts file. > There seems to be too many options which must be set for this to work correctly > right now. Also what would you guys think of having a separate perfmgr config > file? I am not sure about that idea. On one hand it helps to keep the > opensm.opts file clean but on the other hand it means the user has to deal with > another config file. :-/ Probably we need to think about /etc/*/opensm.conf instead of option cached /var/*/osm/opensm.opts? > > > Thanks for the comments. I will get a framework done for the general event > plugin... I do agree that would be better for other types of events. My > original idea was to have these events reported to the "perfmgr". But that is > somewhat invasive on the perfmgr object. > > I am not sure what the best way to do this is at the moment. I have cleaned up > the DB interface quite a bit, including making it more generic. So I think > this might fit in nicely. I can reissue a patch if you would like to see it. > Or I can just submit the header file to see the interface. A "header file" way looks fine. Probably we may want to separate PerfMgr and EventMgr things to be separate patch sets. But it is up to you. Thanks again for the great work! Sasha From sweitzen at cisco.com Mon May 14 12:19:24 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 14 May 2007 12:19:24 -0700 Subject: [ofa-general] OFED 1.2 rc3 release In-Reply-To: <4648AFD8.2060101@hp.com> References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com><46487AE8.1020005@mellanox.co.il> <4648AFD8.2060101@hp.com> Message-ID: I added 1.2rc3 to bugzilla. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones > Sent: Monday, May 14, 2007 11:52 AM > To: Tziporet Koren > Cc: ewg at lists.openfabrics.org; general at lists.openfabrics.org > Subject: Re: [ofa-general] OFED 1.2 rc3 release > > Tziporet Koren wrote: > > Hi, > > > > OFED 1.2-RC3 is available on > _http://www.openfabrics.org/builds/ofed-1.2/_ > > File: OFED-1.2-rc3.tgz > > To get BUILD_ID run ofed_info > > > > Please report any issues in bugzilla _https://bugs.openfabrics.org/_ > > > > *RC4 due date is May 21* > > It could be that I need new bifocals, but there does not > appear to be a 1.2rc3 > version listed against which we can submit reports. > > rick jones > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rick.jones2 at hp.com Mon May 14 12:29:48 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 14 May 2007 12:29:48 -0700 Subject: [ofa-general] OFED 1.2 rc3 release In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com><46487AE8.1020005@mellanox.co.il> <4648AFD8.2060101@hp.com> Message-ID: <4648B8AC.4030306@hp.com> Scott Weitzenkamp (sweitzen) wrote: > I added 1.2rc3 to bugzilla. Splendid - bug 618 added :) I didn't have a good feel for severity and/or priority so left them at the defaults. rick jones From rdreier at cisco.com Mon May 14 12:39:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 12:39:31 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: <20070513051806.GB7402@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 13 May 2007 08:18:06 +0300") References: <20070508141727.GR21591@mellanox.co.il> <20070513051806.GB7402@mellanox.co.il> Message-ID: > By the way, I just re-checked and it seems that WC support first > appeared in Pentium II systems. So I think we should be able to > use sfence if WC is enabled. That's actually doubly wrong: WC support was added in Pentium Pro, and sfence was added in Pentium III. - R. From rdreier at cisco.com Mon May 14 12:38:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 12:38:35 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: <20070513045921.GA7402@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 13 May 2007 07:59:38 +0300") References: <20070508141727.GR21591@mellanox.co.il> <20070512172927.GA5908@mellanox.co.il> <20070513045921.GA7402@mellanox.co.il> Message-ID: > So, could we use a lock instructions to fence WC writes out? Yes, the right thing seems to be to use the same thing for wc_wmb() as for mb() on i386, namely "lock; addl $0,0(%%esp)". That is definitely a serializing instruction that will flush WC buffers. - R. From rdreier at cisco.com Mon May 14 12:41:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 12:41:28 -0700 Subject: [ofa-general] Re: weird kconfig output In-Reply-To: <20070514142233.GD7989@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 May 2007 17:22:33 +0300") References: <20070514142233.GD7989@mellanox.co.il> Message-ID: > Doing make oldconfig on 2.6.22-rc1 (.config came from 2.6.21) > gave me this prompt, among the list of 10G/s adapters: > Verbose debugging output (MLX4_DEBUG) [Y/n/?] (NEW) > > Shouldn't I get prompted for mlx4 eth first? mlx4 eth isn't upstream (since it doesn't do anything, FW isn't ready, etc etc). I'm not sure if there's a way to improve this until then. - R. From swise at opengridcomputing.com Mon May 14 12:55:39 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 14 May 2007 14:55:39 -0500 Subject: [ofa-general] [GIT PULL] ofed_1_2 iw_cxgb3 - fix for bug 611 Message-ID: <1179172539.25841.57.camel@stevo-desktop> Vlad, Please pull the cxgb3 driver fix for bug 611 from git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2 Thanks, Steve. ---- commit 1e6d99bddf75465a6c05b74074278f2691edcc37 Author: Steve Wise Date: Mon May 14 13:27:27 2007 -0500 iw_cxgb3: Streaming -> RDMA mode transition fixes. Due to a HW issue, our current scheme to transition the connection from streaming to rdma mode was broken on the passive side. The firmware and driver now support a new transition scheme for the passive side: - driver posts rdma_init_wr (now including the initial receive seqno) - driver posts last streaming message via TX_DATA message (MPA start response) - uP atomically sends the last streaming message and transitions the tcb to rdma mode. - driver waits for wr_ack indicating the last streaming message was ACKed. This change also bumps the required firmware version... Signed-off-by: Steve Wise diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index ce05db5..62998d3 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -862,7 +862,7 @@ int cxio_rdma_init(struct cxio_rdev *rde wqe->ird = cpu_to_be32(attr->ird); wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr); wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size); - wqe->rsvd = 0; + wqe->irs = cpu_to_be32(attr->irs); skb->priority = 0; /* 0=>ToeQ; 1=>CtrlQ */ return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); } diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h index e7ea455..9094147 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_wr.h +++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h @@ -294,6 +294,7 @@ struct t3_rdma_init_attr { u64 qp_dma_addr; u32 qp_dma_size; u32 flags; + u32 irs; }; struct t3_rdma_init_wr { @@ -314,7 +315,7 @@ struct t3_rdma_init_wr { __be32 ird; __be64 qp_dma_addr; /* 7 */ __be32 qp_dma_size; /* 8 */ - u32 rsvd; + u32 irs; }; struct t3_genbit { diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 0d81e2f..ed56d55 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -516,7 +516,7 @@ static void send_mpa_req(struct iwch_ep req->len = htonl(len); req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | V_TX_SNDBUF(snd_win>>15)); - req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->flags = htonl(F_TX_INIT); req->sndseq = htonl(ep->snd_seq); BUG_ON(ep->mpa_skb); ep->mpa_skb = skb; @@ -567,7 +567,7 @@ static int send_mpa_reject(struct iwch_e req->len = htonl(mpalen); req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | V_TX_SNDBUF(snd_win>>15)); - req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->flags = htonl(F_TX_INIT); req->sndseq = htonl(ep->snd_seq); BUG_ON(ep->mpa_skb); ep->mpa_skb = skb; @@ -619,7 +619,7 @@ static int send_mpa_reply(struct iwch_ep req->len = htonl(len); req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | V_TX_SNDBUF(snd_win>>15)); - req->flags = htonl(F_TX_MORE | F_TX_IMM_ACK | F_TX_INIT); + req->flags = htonl(F_TX_INIT); req->sndseq = htonl(ep->snd_seq); ep->mpa_skb = skb; state_set(&ep->com, MPA_REP_SENT); @@ -642,6 +642,7 @@ static int act_establish(struct t3cdev * cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid); ep->snd_seq = ntohl(req->snd_isn); + ep->rcv_seq = ntohl(req->rcv_isn); set_emss(ep, ntohs(req->tcp_opt)); @@ -1021,6 +1022,9 @@ static int rx_data(struct t3cdev *tdev, skb_pull(skb, sizeof(*hdr)); skb_trim(skb, dlen); + + ep->rcv_seq += dlen; + BUG_ON(ep->rcv_seq != (ntohl(hdr->seq) + dlen)); switch (state_read(&ep->com)) { case MPA_REQ_SENT: @@ -1059,7 +1063,6 @@ static int tx_ack(struct t3cdev *tdev, s struct iwch_ep *ep = ctx; struct cpl_wr_ack *hdr = cplhdr(skb); unsigned int credits = ntohs(hdr->credits); - enum iwch_qp_attr_mask mask; PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); @@ -1071,30 +1074,6 @@ static int tx_ack(struct t3cdev *tdev, s ep->mpa_skb = NULL; dst_confirm(ep->dst); if (state_read(&ep->com) == MPA_REP_SENT) { - struct iwch_qp_attributes attrs; - - /* bind QP to EP and move to RTS */ - attrs.mpa_attr = ep->mpa_attr; - attrs.max_ird = ep->ord; - attrs.max_ord = ep->ord; - attrs.llp_stream_handle = ep; - attrs.next_state = IWCH_QP_STATE_RTS; - - /* bind QP and TID with INIT_WR */ - mask = IWCH_QP_ATTR_NEXT_STATE | - IWCH_QP_ATTR_LLP_STREAM_HANDLE | - IWCH_QP_ATTR_MPA_ATTR | - IWCH_QP_ATTR_MAX_IRD | - IWCH_QP_ATTR_MAX_ORD; - - ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp, - ep->com.qp, mask, &attrs, 1); - - if (!ep->com.rpl_err) { - state_set(&ep->com, FPDU_MODE); - established_upcall(ep); - } - ep->com.rpl_done = 1; PDBG("waking up ep %p\n", ep); wake_up(&ep->com.waitq); @@ -1377,6 +1356,7 @@ static int pass_establish(struct t3cdev PDBG("%s ep %p\n", __FUNCTION__, ep); ep->snd_seq = ntohl(req->snd_isn); + ep->rcv_seq = ntohl(req->rcv_isn); set_emss(ep, ntohs(req->tcp_opt)); @@ -1730,10 +1710,8 @@ int iwch_accept_cr(struct iw_cm_id *cm_i struct iwch_qp *qp = get_qhp(h, conn_param->qpn); PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); - if (state_read(&ep->com) == DEAD) { - put_ep(&ep->com); + if (state_read(&ep->com) == DEAD) return -ECONNRESET; - } BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); BUG_ON(!qp); @@ -1753,18 +1731,9 @@ int iwch_accept_cr(struct iw_cm_id *cm_i ep->ird = conn_param->ird; ep->ord = conn_param->ord; PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord); + get_ep(&ep->com); - err = send_mpa_reply(ep, conn_param->private_data, - conn_param->private_data_len); - if (err) { - ep->com.cm_id = NULL; - ep->com.qp = NULL; - cm_id->rem_ref(cm_id); - abort_connection(ep, NULL, GFP_KERNEL); - put_ep(&ep->com); - return err; - } - + /* bind QP to EP and move to RTS */ attrs.mpa_attr = ep->mpa_attr; attrs.max_ird = ep->ord; @@ -1781,16 +1750,29 @@ int iwch_accept_cr(struct iw_cm_id *cm_i err = iwch_modify_qp(ep->com.qp->rhp, ep->com.qp, mask, &attrs, 1); + if (err) + goto err; - if (err) { - ep->com.cm_id = NULL; - ep->com.qp = NULL; - cm_id->rem_ref(cm_id); - abort_connection(ep, NULL, GFP_KERNEL); - } else { - state_set(&ep->com, FPDU_MODE); - established_upcall(ep); - } + err = send_mpa_reply(ep, conn_param->private_data, + conn_param->private_data_len); + if (err) + goto err; + + /* wait for wr_ack */ + wait_event(ep->com.waitq, ep->com.rpl_done); + err = ep->com.rpl_err; + if (err) + goto err; + + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + put_ep(&ep->com); + return 0; +err: + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL, GFP_KERNEL); put_ep(&ep->com); return err; } diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h index 1d4a1a5..5462331 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -175,6 +175,7 @@ struct iwch_ep { unsigned int atid; u32 hwtid; u32 snd_seq; + u32 rcv_seq; struct l2t_entry *l2t; struct dst_entry *dst; struct sk_buff *mpa_skb; diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index 7530dc0..162d1fa 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -734,6 +734,7 @@ #endif init_attr.qp_dma_addr = qhp->wq.dma_addr; init_attr.qp_dma_size = (1UL << qhp->wq.size_log2); init_attr.flags = rqes_posted(qhp) ? RECVS_POSTED : 0; + init_attr.irs = qhp->ep->rcv_seq; PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d " "flags 0x%x qpcaps 0x%x\n", __FUNCTION__, init_attr.rq_addr, init_attr.rq_size, diff --git a/drivers/net/cxgb3/version.h b/drivers/net/cxgb3/version.h index 17b9801..7ef2193 100644 --- a/drivers/net/cxgb3/version.h +++ b/drivers/net/cxgb3/version.h @@ -39,6 +39,6 @@ #define DRV_VERSION "1.0-ofed" /* Firmware version */ #define FW_VERSION_MAJOR 4 -#define FW_VERSION_MINOR 0 +#define FW_VERSION_MINOR 2 #define FW_VERSION_MICRO 0 #endif /* __CHELSIO_VERSION_H */ From rdreier at cisco.com Mon May 14 13:03:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 13:03:13 -0700 Subject: [ofa-general] [PATCH] mlx4: fix uninitialized spinlock for 32-bit architectures In-Reply-To: <200705131718.23298.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Sun, 13 May 2007 17:18:23 +0300") References: <200705131718.23298.jackm@dev.mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Mon May 14 13:23:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 13:23:37 -0700 Subject: [ofa-general] Re: [PATCH take2] IB/ipath -- shadow the gpio_mask register In-Reply-To: <20070510191047.6876.80760.stgit@bauxite.internal.keyresearch.com> (Arthur Jones's message of "Thu, 10 May 2007 12:10:49 -0700") References: <20070510191047.6876.80760.stgit@bauxite.internal.keyresearch.com> Message-ID: Thanks, applied. That changelog was what I dream of seeing with a patch -- it was so perfect I choked up a little. From rdreier at cisco.com Mon May 14 13:42:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 13:42:08 -0700 Subject: [ofa-general] Re: [PATCH 6/6] IB/ehca: disable scaling code by default, bump version number In-Reply-To: <200705091348.31742.fenkes@de.ibm.com> (Joachim Fenkes's message of "Wed, 9 May 2007 13:48:31 +0200") References: <200705091348.31742.fenkes@de.ibm.com> Message-ID: Thanks, applied 1-6. From mst at dev.mellanox.co.il Mon May 14 13:42:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 May 2007 23:42:41 +0300 Subject: [ofa-general] Re: weird kconfig output In-Reply-To: References: <20070514142233.GD7989@mellanox.co.il> Message-ID: <20070514204241.GB12462@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: weird kconfig output > > > Doing make oldconfig on 2.6.22-rc1 (.config came from 2.6.21) > > gave me this prompt, among the list of 10G/s adapters: > > Verbose debugging output (MLX4_DEBUG) [Y/n/?] (NEW) > > > > Shouldn't I get prompted for mlx4 eth first? > > mlx4 eth isn't upstream (since it doesn't do anything, FW isn't ready, > etc etc). I'm not sure if there's a way to improve this until then. Maybe just change the help text? -- MST From rdreier at cisco.com Mon May 14 13:46:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 13:46:07 -0700 Subject: [ofa-general] Re: weird kconfig output In-Reply-To: <20070514204241.GB12462@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 May 2007 23:42:41 +0300") References: <20070514142233.GD7989@mellanox.co.il> <20070514204241.GB12462@mellanox.co.il> Message-ID: > Maybe just change the help text? Right now we have: This option causes debugging code to be compiled into the mlx4_core driver. The output can be turned on via the debug_level module parameter (which can also be set after the driver is loaded through sysfs). what would you suggest changing? From sashak at voltaire.com Mon May 14 14:05:41 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 15 May 2007 00:05:41 +0300 Subject: [ofa-general] Re: Error message in OSM log when cached op file doesn't exist In-Reply-To: <46486D1E.6010408@dev.mellanox.co.il> References: <46486D1E.6010408@dev.mellanox.co.il> Message-ID: <20070514210541.GR29746@sashak.voltaire.com> Hi Yevgeny, On 17:07 Mon 14 May , Yevgeny Kliteynik wrote: > > I actually don't like this thing, because now every time you run > OpenSM on the machine that doesn't have any cached options file > (which is usually the case) you get an error message. > > There's no point checking whether the file exists, because osm runs > as root, Not necessary. > and if it fails opening this file, it means that the file > doesn't exist or is inaccessible (broken mount, etc). or user provided OSM_CACHE_DIR environment variable is broken or malloc failed, or other error (see: man 3 fopen, man 2 open, man 3 malloc) Probably just this solves your issue: diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 855d1ab..f7ddf7d 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -51,6 +51,7 @@ #include #include +#include #include #include #include @@ -856,7 +857,7 @@ osm_subn_parse_conf_file( opts_file = fopen(file_name, "r"); if (!opts_file) - return IB_ERROR; + return errno == ENOENT ? IB_SUCCESS : IB_ERROR; while (fgets(line, 1023, opts_file) != NULL) { Or yet another IB_* status value and less aggressive warning message? Sasha From rdreier at cisco.com Mon May 14 13:57:23 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 13:57:23 -0700 Subject: [ofa-general] Re: [GIT PULL] 2.6.22: please pull rdma-dev.git In-Reply-To: <000101c79269$72c67aa0$e598070a@amr.corp.intel.com> (Sean Hefty's message of "Wed, 9 May 2007 11:39:59 -0700") References: <000101c79269$72c67aa0$e598070a@amr.corp.intel.com> Message-ID: Thanks, applied 1-3 From mst at dev.mellanox.co.il Mon May 14 13:58:23 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 May 2007 23:58:23 +0300 Subject: [ofa-general] Re: weird kconfig output In-Reply-To: References: <20070514142233.GD7989@mellanox.co.il> <20070514204241.GB12462@mellanox.co.il> Message-ID: <20070514205823.GC12462@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: weird kconfig output > > > Maybe just change the help text? > > Right now we have: > > This option causes debugging code to be compiled into the > mlx4_core driver. The output can be turned on via the > debug_level module parameter (which can also be set after > the driver is loaded through sysfs). > > what would you suggest changing? - bool "Verbose debugging output" if (MLX4_CORE && EMBEDDED) + bool "Mellanox ConnectX coremodule: verbose debugging output" if (MLX4_CORE && EMBEDDED) or, is it possible to move the menu option to mlx4_ib? -- MST From rdreier at cisco.com Mon May 14 14:00:57 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 14:00:57 -0700 Subject: [ofa-general] Re: [PATCH] libmthca: fix cq cleanup In-Reply-To: <20070514141514.GC7989@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 May 2007 17:15:14 +0300") References: <20070514141514.GC7989@mellanox.co.il> Message-ID: Thanks, applied both libmthca fixes. From rdreier at cisco.com Mon May 14 14:10:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 14:10:47 -0700 Subject: [ofa-general] Re: [PATCH] IB/mthca: fix cq cleanup In-Reply-To: <20070514141450.GB7989@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 May 2007 17:14:50 +0300") References: <20070514141450.GB7989@mellanox.co.il> Message-ID: thanks, applied. From rdreier at cisco.com Mon May 14 14:14:04 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 14:14:04 -0700 Subject: [ofa-general] Re: [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git In-Reply-To: <20070514045832.GA18615@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 May 2007 07:58:32 +0300") References: <20070514045832.GA18615@mellanox.co.il> Message-ID: thanks... > Michael S. Tsirkin (3): > IB/mthca: fix posting >255 recv WRs for Tavor > ipoib/cm: optimize stale connection detection I applied this one. > IB/mthca: fix RESET to ERROR transition I will read this over more carefully -- it seems to be a rather big patch that adds constification various places etc. > Yosef Etigin (2): > IB/core: add helpers for uncached gid/pkey queries > IB/ipoib: handle pkey re-shuffling I need to catch up on the discussion that I did not read while I was traveling last week, so I'll hold these two as well. From rdreier at cisco.com Mon May 14 14:18:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 14:18:00 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get the following post 2.6.22-rc1 fixes: Arthur Jones (1): IB/ipath: Shadow the gpio_mask register Hoang-Nam Nguyen (1): IB/ehca: Fix AQP0/1 QP number Jack Morgenstein (1): IB/mlx4: Fix uninitialized spinlock for 32-bit archs Joachim Fenkes (4): IB/ehca: Correctly set GRH mask bit in ehca_modify_qp() IB/ehca: Remove _irqsave, move #ifdef IB/ehca: Beautify sysfs attribute code and fix compiler warnings IB/ehca: Disable scaling code by default, bump version number Michael S. Tsirkin (3): IB/mthca: Fix posting >255 recv WRs for Tavor IB/mthca: Set cleaned CQEs back to HW ownership when cleaning CQ IPoIB/cm: Optimize stale connection detection Paul Mundt (1): net: Trivial MLX4_DEBUG dependency fix. Roland Dreier (1): mlx4_core: Remove unused doorbell_lock Sean Hefty (3): RDMA/cma: Simplify device removal handling code RDMA/cma: Fix synchronization with device removal in cma_iw_handler RDMA/cma: Add check to validate that cm_id is bound to a device Stefan Roscher (1): IB/ehca: Serialize hypervisor calls in ehca_register_mr() drivers/infiniband/core/cma.c | 106 +++++++++++++++------------ drivers/infiniband/hw/ehca/ehca_classes.h | 1 + drivers/infiniband/hw/ehca/ehca_irq.c | 7 +- drivers/infiniband/hw/ehca/ehca_main.c | 94 +++++++++++------------- drivers/infiniband/hw/ehca/ehca_qp.c | 17 +++-- drivers/infiniband/hw/ehca/hcp_if.c | 13 +++- drivers/infiniband/hw/ipath/ipath_iba6120.c | 7 +- drivers/infiniband/hw/ipath/ipath_intr.c | 7 +- drivers/infiniband/hw/ipath/ipath_kernel.h | 2 + drivers/infiniband/hw/ipath/ipath_verbs.c | 12 ++-- drivers/infiniband/hw/mlx4/main.c | 1 + drivers/infiniband/hw/mthca/mthca_cq.c | 4 +- drivers/infiniband/hw/mthca/mthca_qp.c | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c | 11 ++- drivers/net/Kconfig | 1 + drivers/net/mlx4/main.c | 2 - drivers/net/mlx4/mlx4.h | 1 - 17 files changed, 154 insertions(+), 133 deletions(-) From halr at voltaire.com Mon May 14 14:21:52 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 17:21:52 -0400 Subject: [ofa-general] IB/core: Enhance SMI for switch support Message-ID: <1179177711.4531.10290.camel@hal.voltaire.com> IB/core: Enhance SMI for switch support SMI is extended for switch (intermediate hop) support. Care has been taken to ensure that the CA (and router) code paths are as identical as possible as to how they were prior to adding this support. Signed-off-by: Suresh Shelvapille Signed-off-by: Hal Rosenstock diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c index ecd1a30..7583941 100644 --- a/drivers/infiniband/core/agent.c +++ b/drivers/infiniband/core/agent.c @@ -3,7 +3,7 @@ * Copyright (c) 2004, 2005 Infinicon Corporation. All rights reserved. * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. * Copyright (c) 2004, 2005 Topspin Corporation. All rights reserved. - * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2004-2007 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -34,7 +34,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: agent.c 1389 2004-12-27 22:56:47Z roland $ */ #include @@ -42,6 +41,7 @@ #include "agent.h" #include "smi.h" +#include "mad_priv.h" #define SPFX "ib_agent: " @@ -87,8 +87,13 @@ int agent_send_response(struct ib_mad *m struct ib_mad_send_buf *send_buf; struct ib_ah *ah; int ret; + struct ib_mad_send_wr_private *mad_send_wr; + + if (device->node_type == RDMA_NODE_IB_SWITCH) + port_priv = ib_get_agent_port(device, 0); + else + port_priv = ib_get_agent_port(device, port_num); - port_priv = ib_get_agent_port(device, port_num); if (!port_priv) { printk(KERN_ERR SPFX "Unable to find port agent\n"); return -ENODEV; @@ -113,6 +118,14 @@ int agent_send_response(struct ib_mad *m memcpy(send_buf->mad, mad, sizeof *mad); send_buf->ah = ah; + + if (device->node_type == RDMA_NODE_IB_SWITCH){ + mad_send_wr = container_of(send_buf, + struct ib_mad_send_wr_private, + send_buf); + mad_send_wr->send_wr.wr.ud.port_num = port_num; + } + if ((ret = ib_post_send_mad(send_buf, NULL))) { printk(KERN_ERR SPFX "ib_post_send_mad error:%d\n", ret); goto err2; diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 85ccf13..6b8faca 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -675,10 +675,16 @@ static int handle_outgoing_dr_smp(struct struct ib_mad_port_private *port_priv; struct ib_mad_agent_private *recv_mad_agent = NULL; struct ib_device *device = mad_agent_priv->agent.device; - u8 port_num = mad_agent_priv->agent.port_num; + u8 port_num; struct ib_wc mad_wc; struct ib_send_wr *send_wr = &mad_send_wr->send_wr; + if (device->node_type == RDMA_NODE_IB_SWITCH && + smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + port_num = send_wr->wr.ud.port_num; + else + port_num = mad_agent_priv->agent.port_num; + /* * Directed route handling starts if the initial LID routed part of * a request or the ending LID routed part of a response is empty. @@ -1839,6 +1845,7 @@ static void ib_mad_recv_done_handler(str struct ib_mad_private *recv, *response; struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent; + int port_num; response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); if (!response) @@ -1872,25 +1879,50 @@ static void ib_mad_recv_done_handler(str if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) goto out; + if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) + port_num = wc->port_num; + else + port_num = port_priv->port_num; + if (recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + enum smi_forward_action retsmi; + if (smi_handle_dr_smp_recv(&recv->mad.smp, port_priv->device->node_type, - port_priv->port_num, + port_num, port_priv->device->phys_port_cnt) == IB_SMI_DISCARD) goto out; - if (smi_check_forward_dr_smp(&recv->mad.smp) == IB_SMI_LOCAL) + retsmi = smi_check_forward_dr_smp(&recv->mad.smp); + if (retsmi == IB_SMI_LOCAL) goto local; - if (smi_handle_dr_smp_send(&recv->mad.smp, - port_priv->device->node_type, - port_priv->port_num) == IB_SMI_DISCARD) - goto out; + if (retsmi == IB_SMI_SEND) { /* don't forward */ + if (smi_handle_dr_smp_send(&recv->mad.smp, + port_priv->device->node_type, + port_num) == IB_SMI_DISCARD) + goto out; + + if (smi_check_local_smp(&recv->mad.smp, port_priv->device) == IB_SMI_DISCARD) + goto out; + } else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) { + /* forward case for switches */ + memcpy(response, recv, sizeof(*response)); + response->header.recv_wc.wc = &response->header.wc; + response->header.recv_wc.recv_buf.mad = &response->mad.mad; + response->header.recv_wc.recv_buf.grh = &response->grh; + + if (!agent_send_response(&response->mad.mad, + &response->grh, wc, + port_priv->device, + smi_get_fwd_port(&recv->mad.smp), + qp_info->qp->qp_num)) + response = NULL; - if (smi_check_local_smp(&recv->mad.smp, port_priv->device) == IB_SMI_DISCARD) goto out; + } } local: @@ -1919,7 +1951,7 @@ local: agent_send_response(&response->mad.mad, &recv->grh, wc, port_priv->device, - port_priv->port_num, + port_num, qp_info->qp->qp_num); goto out; } diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c index 2bca753..8723675 100644 --- a/drivers/infiniband/core/smi.c +++ b/drivers/infiniband/core/smi.c @@ -192,7 +192,7 @@ enum smi_action smi_handle_dr_smp_recv(s } /* smp->hop_ptr updated when sending */ return (node_type == RDMA_NODE_IB_SWITCH ? - IB_SMI_HANDLE: IB_SMI_DISCARD); + IB_SMI_HANDLE : IB_SMI_DISCARD); } /* C14-13:4 -- hop_ptr = 0 -> give to SM */ @@ -211,7 +211,7 @@ enum smi_forward_action smi_check_forwar if (!ib_get_smp_direction(smp)) { /* C14-9:2 -- intermediate hop */ if (hop_ptr && hop_ptr < hop_cnt) - return IB_SMI_SEND; + return IB_SMI_FORWARD; /* C14-9:3 -- at the end of the DR segment of path */ if (hop_ptr == hop_cnt) @@ -224,7 +224,7 @@ enum smi_forward_action smi_check_forwar } else { /* C14-13:2 -- intermediate hop */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) - return IB_SMI_SEND; + return IB_SMI_FORWARD; /* C14-13:3 -- at the end of the DR segment of path */ if (hop_ptr == 1) @@ -233,3 +233,13 @@ enum smi_forward_action smi_check_forwar } return IB_SMI_LOCAL; } + +/* + * Return the forwarding port number from initial_path for outgoing SMP and + * from return_path for returning SMP + */ +int smi_get_fwd_port(struct ib_smp *smp) +{ + return (!ib_get_smp_direction(smp) ? smp->initial_path[smp->hop_ptr+1] : + smp->return_path[smp->hop_ptr-1]); +} diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h index 9a4b349..1cfc298 100644 --- a/drivers/infiniband/core/smi.h +++ b/drivers/infiniband/core/smi.h @@ -48,10 +48,12 @@ enum smi_action { enum smi_forward_action { IB_SMI_LOCAL, /* SMP should be completed up the stack */ IB_SMI_SEND, /* received DR SMP should be forwarded to the send queue */ + IB_SMI_FORWARD /* SMP should be forwarded (for switches only) */ }; enum smi_action smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, int port_num, int phys_port_cnt); +int smi_get_fwd_port(struct ib_smp *smp); extern enum smi_forward_action smi_check_forward_dr_smp(struct ib_smp *smp); extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, u8 node_type, int port_num); From rdreier at cisco.com Mon May 14 14:24:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 14 May 2007 14:24:53 -0700 Subject: [ofa-general] IB/core: Enhance SMI for switch support In-Reply-To: <1179177711.4531.10290.camel@hal.voltaire.com> (Hal Rosenstock's message of "14 May 2007 17:21:52 -0400") References: <1179177711.4531.10290.camel@hal.voltaire.com> Message-ID: Sorry, I lost this one in my queue. However when I was thinking about applying it, I couldn't help but wonder whether it was a good idea or not. Is there any prospect of an in-tree driver that would use the code? Current Mellanox switches do intermediate hop SMI handling in firmware, so a Mellanox switch driver wouldn't use this code. And I'm not sure we can justify this change (which after all carries some risk) just to make it easier for baymicrosystems's proprietary driver. - R. From sashak at voltaire.com Mon May 14 14:36:24 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 15 May 2007 00:36:24 +0300 Subject: [ofa-general] suggested patch for partition membership definitiion in osm-partitions.conf (fix) In-Reply-To: <46487FBF.7020300@cea.fr> References: <46487FBF.7020300@cea.fr> Message-ID: <20070514213624.GS29746@sashak.voltaire.com> Hi Philippe, On 17:26 Mon 14 May , Philippe Gregoire wrote: > > Here is a patch to allow a more compact partition membership definition. > It allows definition of a default > membership partition for the port guid list. The old syntax is still usable. > old way > G1 = 0x01 : 0x123=full, 0x124=full, 0x0x125=full, 0x126=full, 0x127=full ; > G1 = 0x01 : 0x128=full, 0x129=full, 0x567, 0x569=full > > new way : > G1 = 0x01 , defmember=full : 0x123, 0x124, 0x125, 0x126, 0x127 ; > G1 = 0x01 , defmember=full : 0x128, 0x129, 0x567=limited, 0x569 I think this can be useful. Minor comment below. > --- opensm/osm_prtn_config.old.c 2007-04-18 11:54:29.000000000 +0200 > +++ opensm/osm_prtn_config.c 2007-05-14 17:14:42.228813361 +0200 > @@ -70,6 +70,7 @@ > osm_subn_t *p_subn; > osm_prtn_t *p_prtn; > unsigned is_ipoib, mtu, rate, sl, scope; > + boolean_t full; > }; > > extern osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn, > @@ -163,6 +164,14 @@ > " - skipped\n", lineno); > else > conf->sl = sl; > + } else if (!strncmp(flag, "defmember", len)) { > + if (!val || (strcmp(val, "limited") && strcmp(val, "full"))) With strncmp(val, "limited"/"full", strlen(val)) user will be able to use "limi" and "fu" (or shorter :)) substrings. > + osm_log(conf->p_log, OSM_LOG_VERBOSE, > + "PARSE WARN: line %d: " > + "flag \'defmember\' requires valid value (limited or full)" > + " - skipped\n", lineno); > + else > + conf->full = strcmp(val, "full") == 0; > } else { > osm_log(conf->p_log, OSM_LOG_VERBOSE, > "PARSE WARN: line %d: " Sasha From halr at voltaire.com Mon May 14 14:32:52 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 14 May 2007 17:32:52 -0400 Subject: [ofa-general] IB/core: Enhance SMI for switch support In-Reply-To: References: <1179177711.4531.10290.camel@hal.voltaire.com> Message-ID: <1179178372.4531.10975.camel@hal.voltaire.com> On Mon, 2007-05-14 at 17:24, Roland Dreier wrote: > Sorry, I lost this one in my queue. > > However when I was thinking about applying it, I couldn't help but > wonder whether it was a good idea or not. Is there any prospect of an > in-tree driver that would use the code? I'm not sure; I've heard rumors of other OpenIB based switches. > Current Mellanox switches do > intermediate hop SMI handling in firmware, so a Mellanox switch driver > wouldn't use this code. And I'm not sure we can justify this change > (which after all carries some risk) The risk is primarily on the switch side, rather than the CA/router side, right ? So isn't the downside of this minimal ? -- Hal > just to make it easier for baymicrosystems's proprietary driver. > > - R. From mst at dev.mellanox.co.il Mon May 14 14:50:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 May 2007 00:50:30 +0300 Subject: [ofa-general] Re: [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git In-Reply-To: References: <20070514045832.GA18615@mellanox.co.il> Message-ID: <20070514215030.GE12462@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git > > thanks... > > > Michael S. Tsirkin (3): > > IB/mthca: fix posting >255 recv WRs for Tavor > > ipoib/cm: optimize stale connection detection > > I applied this one. > > > IB/mthca: fix RESET to ERROR transition > > I will read this over more carefully -- it seems to be a rather big > patch that adds constification various places etc. Thanks. Some explanations: The only reason for const change is because there's a table of attribute structs that's always the same so I decided it's nice to have it a global const, and the change rippled over. The rest is just splitting up the modify command so that on RESET->ERROR I can perform 2 commands without code duplication. > > Yosef Etigin (2): > > IB/core: add helpers for uncached gid/pkey queries > > IB/ipoib: handle pkey re-shuffling > > I need to catch up on the discussion that I did not read while I was > traveling last week, so I'll hold these two as well. Here's a summary: The last time we all agreed that long term we want to get rid of ib_cache, which will solve all kind of coherency issues. So the ipoib is a minimal patch to do this wrt to pkey, fixing the bug Voltaire is seeing with their partitioning setup. The core patch just adds helpers for this bit, but since query_port can't be chached by provider (it gives e.g. physical port state), it seemed worth the while to query table lengths at startup only, rather than have each ib_find_pkey call re-do this. Yosef also has more patches cooking to remove the rest of the cache usage and speed up query_gid/query_pkey in providers, and also clean the pkey polling thread in ipoib, but it seemed like a good idea to have the bugfix out first. -- MST From pradeeps at linux.vnet.ibm.com Mon May 14 18:21:32 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Mon, 14 May 2007 18:21:32 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <20070512200635.GB5908@mellanox.co.il> References: <4641E99B.10706@linux.vnet.ibm.com> <46438DF2.3080601@linux.vnet.ibm.com> <20070511133639.GD30092@mellanox.co.il> <4644C1D2.6040103@linux.vnet.ibm.com> <20070512200635.GB5908@mellanox.co.il> Message-ID: <46490B1C.20808@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >> Quoting Pradeep Satyanarayana : >> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review >> >> Michael S. Tsirkin wrote: >>>> Quoting Pradeep Satyanarayana : >>>> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review >>>> >>>> If there are no other issues than the small restructure suggestion that >>>> Michael had, can this patch be merged into the for-2.6.22 tree? >>> I'm not sure. >>> >>> I haven't the time, at the moment, to go over the patch again in depth. >>> Have the issues from this message been addressed? >>> >>> http://www.mail-archive.com/general at lists.openfabrics.org/msg02056.html >>> >>> Just a quick review, it seems that two most important issues have >>> apparently not been addressed yet: >>> >>> 1. Testing device SRQ capability twice on each RX packet is just too ugly, >>> and it *should* be possible to structure the code >>> by separating common functionality in separate >>> functions instead of scattering if (srq) tests around. >> I have restructured the code as suggested. In the latest code, there are >> only two places where SRQ capability is tested upon receipt of a packet: >> a) ipoib_cm_handle_rx_wc() >> b)ipoib_cm_post_receive() >> >> Instead of the suggested change to ipoib_cm_handle_rx_packet() it is >> possible to change ipoib_cm_post_receive() and call the srq and nosrq >> versions directly, without mangling the code. However, I do not believe >> that this should be stopping us from the code being merged. This can >> handled as a separate patch. > > I actually suggested implementing separate poll routines > for srq and non-srq code. This way we won't have *any* if(srq) > tests on datapath. Right, I remember you suggested that. From a maintainability perspective I use as much common code as possible. Therefore I did not implement separate polling routines as suggested. So, it boils down to one if(srq) in the data path. I really do not think that should be a point of contention. > >>> 2. Once the number of created connections exceeds >>> the constant that you allow, all attempts to communicate >>> with this node over IP over IB will fail. >>> A way needs to be designed to switch to the datagram mode, >>> and to retry going back to connected after some time. >>> [We actually have this theoretical issue in SRQ >>> as well - it is just much more severe in the nonSRQ case]. >> Firstly, this has now been changed to send a REJ message to the remote >> side indicating that there no more free QPs. > > Since the HCA actually has free QPs - you are actually running out of buffers that > you are ready to prepost - one might argue about whether this is spec compliant > behaviour. This is something that might better be checked up with at IBTA. > >> It is up to the application >> to handle the situation. > > The application here being kernel IP over IB here, it currently handles the > reject by dropping outstanding packets and retrying the connection on the next > packet to this dst. So the specific node might be denied connectivity > potentially forever. When I stated application, I did not mean IPoIB. I meant the user level app. Yes, the app will keep on retrying to establish connection to the specified node using Connected Mode and then subsequently time out. See more comments below. > >> Previously, this was flagged as an error that >> appeared in /var/log/messages. >> >> However, here are a few other things we need to consider. Lets us >> compute the amount of memory consumed when we run into this situation: >> >> In CM mode we use 64K packets. Assuming, the rx_ring has 256 entries and >> the current limitation of 1024 QPs, NOSRQ only will consume 16GB of >> memory. All else remaining the same if we change the rx_ring size to >> 1024, NOSRQ will consume 64GB of memory. >> >> This is huge and my guess is that on most systems, the application will >> run out of memory before it runs out of RC QPs (with NOSRQ). >> >> Aside from this I would like to understand how do we switch just the >> "current" QP to datagram mode; we would not want to switch all the >> existing QPs to datagram mode -that would be unacceptable. Also, we >> should not prevent subsequent connections using RC QPs. Is there >> anything in the IB spec about this? > > Yes, this might need a solution at the protocol level, as you indicate above. I thought through this some more and I do not believe that this is such a good idea (i.e. switching to datagram mode). The app (user level) is expecting to use RC and we silently (or even with warnings) switch to UD mode -I do not think that is appropriate. The app should time out or be returned an error and maybe the app can switch to using another node that has the requested resources. The onus is on the user level app to take appropriate action. The equivalent situation in a non IB environment would be when the recipient node has no more memory to respond to an arp request. The app receives a "node unreachable" message. Therefore I am inclined to say we should leave this as is. > >> I think solving this is a fairly big issue and not just specific to >> NOSRQ. NOSRQ is just exacerbating the situation. This can be dealt with >> all at once with SRQ and NOSRQ, if need be. > > IMO, the memory scalability issue is specific to your code. > With current code using shared RQ, each connection needs > an order of 1KByte of memory. So we need just 10MByte > for a typical 10000 node cluster. > Right, I have always maintained that NOSRQ is indeed a memory hog. I think we must revisit this memory computation for the srq case too - I would say the receive buffers consumed would be 64K (packet size) * 1000 (srq_ring_size) is 64MBytes, irrespective of the number of the number of nodes in the cluster. However, the question that is still unanswered (at least in my mind) is, will 1000 buffers be sufficient to support a 10,000 or even a 1000 node cluster. On just a 2 node cluster (using UD) we had seen previously that a receiveq_size of 256 was inadequate. I would guess even in the SRQ case that would be true. To support large clusters one will run into memory issues even in the SRQ case, but it will occur much sooner in the NOSRQ case. >> Hence, I do not see these as impediments to the merge. > > In my humble opinion, we need a handle on the scalability issue > (other than crashing or denying service) before merging this, > otherwise IBM will be the first to object to making connected mode the default. I will seek the opinion from folks who use applications on large clusters within IBM. I have always stated that NOSRQ should be used only when there are a handful or at most a few dozen clusters. I will try and make this well known so that this does not come as a surprise. Pradeep From mst at dev.mellanox.co.il Mon May 14 23:26:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 May 2007 09:26:46 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <46490B1C.20808@linux.vnet.ibm.com> References: <4641E99B.10706@linux.vnet.ibm.com> <46438DF2.3080601@linux.vnet.ibm.com> <20070511133639.GD30092@mellanox.co.il> <4644C1D2.6040103@linux.vnet.ibm.com> <20070512200635.GB5908@mellanox.co.il> <46490B1C.20808@linux.vnet.ibm.com> Message-ID: <20070515062646.GD5437@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review > > Michael S. Tsirkin wrote: > >>Quoting Pradeep Satyanarayana : > >>Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review > >> > >>Michael S. Tsirkin wrote: > >>>>Quoting Pradeep Satyanarayana : > >>>>Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review > >>>> > >>>>If there are no other issues than the small restructure suggestion that > >>>>Michael had, can this patch be merged into the for-2.6.22 tree? > >>>I'm not sure. > >>> > >>>I haven't the time, at the moment, to go over the patch again in depth. > >>>Have the issues from this message been addressed? > >>> > >>>http://www.mail-archive.com/general at lists.openfabrics.org/msg02056.html > >>> > >>>Just a quick review, it seems that two most important issues have > >>>apparently not been addressed yet: > >>> > >>>1. Testing device SRQ capability twice on each RX packet is just too > >>>ugly, > >>> and it *should* be possible to structure the code > >>> by separating common functionality in separate > >>> functions instead of scattering if (srq) tests around. > >>I have restructured the code as suggested. In the latest code, there are > >>only two places where SRQ capability is tested upon receipt of a packet: > >>a) ipoib_cm_handle_rx_wc() > >>b)ipoib_cm_post_receive() > >> > >>Instead of the suggested change to ipoib_cm_handle_rx_packet() it is > >>possible to change ipoib_cm_post_receive() and call the srq and nosrq > >>versions directly, without mangling the code. However, I do not believe > >>that this should be stopping us from the code being merged. This can > >>handled as a separate patch. > > > >I actually suggested implementing separate poll routines > >for srq and non-srq code. This way we won't have *any* if(srq) > >tests on datapath. > > Right, I remember you suggested that. From a maintainability perspective > I use as much common code as possible. Sprinkling if (srq) all over the code is not necessarily the best wait to reuse code. Moving common code to separate functions is a better way IMO. > Therefore I did not implement > separate polling routines as suggested. So, it boils down to one if(srq) > in the data path. Which patch are we discussing? Patch V4 has 3 of these on data path. The one in alloc_rx_skb also seems to be open-coded - bad for cache usage. > I really do not think that should be a point of contention. True, it's not a *major* point, scalability is still a larger issue. But IMO fixing this would make the patch less ugly. > >>>2. Once the number of created connections exceeds > >>> the constant that you allow, all attempts to communicate > >>> with this node over IP over IB will fail. > >>> A way needs to be designed to switch to the datagram mode, > >>> and to retry going back to connected after some time. > >>> [We actually have this theoretical issue in SRQ > >>> as well - it is just much more severe in the nonSRQ case]. > >>Firstly, this has now been changed to send a REJ message to the remote > >>side indicating that there no more free QPs. > > > >Since the HCA actually has free QPs - you are actually running out of buffers > >that you are ready to prepost - one might argue about whether this is spec > >compliant behaviour. This is something that might better be checked up with > >at IBTA. > > > >>It is up to the application to handle the situation. > > > >The application here being kernel IP over IB here, it currently handles the > >reject by dropping outstanding packets and retrying the connection on the > >next packet to this dst. So the specific node might be denied connectivity > >potentially forever. > > When I stated application, I did not mean IPoIB. I meant the user level > app. Yes, the app will keep on retrying to establish connection to the > specified node using Connected Mode and then subsequently time out. So, how would an application handle the situation? > See more comments below. > > > > >>Previously, this was flagged as an error that > >>appeared in /var/log/messages. > >> > >>However, here are a few other things we need to consider. Lets us > >>compute the amount of memory consumed when we run into this situation: > >> > >>In CM mode we use 64K packets. Assuming, the rx_ring has 256 entries and > >>the current limitation of 1024 QPs, NOSRQ only will consume 16GB of > >>memory. All else remaining the same if we change the rx_ring size to > >>1024, NOSRQ will consume 64GB of memory. > >> > >>This is huge and my guess is that on most systems, the application will > >>run out of memory before it runs out of RC QPs (with NOSRQ). > >> > >>Aside from this I would like to understand how do we switch just the > >>"current" QP to datagram mode; we would not want to switch all the > >>existing QPs to datagram mode -that would be unacceptable. Also, we > >>should not prevent subsequent connections using RC QPs. Is there > >>anything in the IB spec about this? > > > >Yes, this might need a solution at the protocol level, as you indicate > >above. > > I thought through this some more and I do not believe that this is such > a good idea (i.e. switching to datagram mode). The app (user level) is > expecting to use RC and we silently (or even with warnings) switch to > UD mode -I do not think that is appropriate. Which app is this? > The app should time out or be returned an error and maybe the app can > switch to using another node that has the requested resources. The onus > is on the user level app to take appropriate action. Most applications can't do this however. So your patch will break them. > The equivalent situation in a non IB environment would be when the > recipient node has no more memory to respond to an arp request. The > app receives a "node unreachable" message. I think you mean that on a TCP socket, connect will return ENETUNREACH, rather than a message? But since ARP is normally using multicast, if the remote won't accept connections, this is *not* what will happen here, is it? > Therefore I am inclined to say we should leave this as is. The main difference is that on a LAN, arp timeouts don't really occur too often in practice - they are sufficiently rare that lots of applications regard TCP errors as a "node is down" indication. > > > >>I think solving this is a fairly big issue and not just specific to > >>NOSRQ. NOSRQ is just exacerbating the situation. This can be dealt with > >>all at once with SRQ and NOSRQ, if need be. > > > >IMO, the memory scalability issue is specific to your code. > >With current code using shared RQ, each connection needs > >an order of 1KByte of memory. So we need just 10MByte > >for a typical 10000 node cluster. > > > > Right, I have always maintained that NOSRQ is indeed a memory hog. I > think we must revisit this memory computation for the srq case too - > I would say the receive buffers consumed would be 64K (packet size) * > 1000 (srq_ring_size) is 64MBytes, irrespective of the number of the > number of nodes in the cluster. However, the question that is still > unanswered (at least in my mind) is, will 1000 buffers be sufficient > to support a 10,000 or even a 1000 node cluster. On just a 2 node > cluster (using UD) we had seen previously that a receiveq_size of 256 > was inadequate. You should distinguish between occasional packet drops due to RQ overrun, which happens all the time on the internet, so protocols are built to handle it, and dropping *all* packets to a specific destination, which is a quality of implementation issue. > I would guess even in the SRQ case that would be true. Less likely, since each buffer is 32 times larger now. Further, with SRQ we can auto-tune the buffer size by using watermark events. Stay tuned. > To support large clusters one will run into memory issues even in the > SRQ case, I don't really think IPoIB with SRQ will run into memory issues even with large clusters. > but it will occur much sooner in the NOSRQ case. > > >>Hence, I do not see these as impediments to the merge. > > > >In my humble opinion, we need a handle on the scalability issue > >(other than crashing or denying service) before merging this, > >otherwise IBM will be the first to object to making connected mode the > >default. > > I will seek the opinion from folks who use applications on large > clusters within IBM. I have always stated that NOSRQ should be used > only when there are a handful or at most a few dozen clusters. I will > try and make this well known so that this does not come as a surprise. One of my targets is to make connected mode the default, eventually. My concern is that if that enabling connected mode breaks applications, as your patch does, people will be afraid to turn it on. -- MST From ianjiang.ict at gmail.com Tue May 15 00:12:39 2007 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Tue, 15 May 2007 15:12:39 +0800 Subject: [ofa-general] [SRPT]multiple initiators supported? In-Reply-To: <4648948F.5000802@mellanox.com> References: <7b2fa1820705120247t1b232345w8373bb72416c5b28@mail.gmail.com> <4648948F.5000802@mellanox.com> Message-ID: <7b2fa1820705150012l743b817fn1eefeaca290789a@mail.gmail.com> Hi Vu, Thanks for your replay. But I have got something wrong when using two initiators. Two initiators and one target are at three separated nodes. The first initiator connected to the target correctly. However, the second one was aborted 1 minute after its login, and then required to *reset_host*, but it failed to send the CM Connect Request when trying to reconnect to the target. Here are the logs of the second initiator: May 15 13:58:59 cluster5 kernel: ib_srp: new target: id_ext 0002c90200206bd8 ioc_guid 0002c90200206bd8 pkey ffff service_id 0002c90200206bd8 dgid fe80:0000:0000:0000:0002:c902:0020:6bd9 May 15 13:58:59 cluster5 kernel: scsi2 : SRP.T10:0002C90200206BD8 May 15 13:58:59 cluster5 kernel: Vendor: SCST_FIO Model: fdisk_128M Rev: 095 May 15 13:58:59 cluster5 kernel: Type: Direct-Access ANSI SCSI revision: 04 May 15 13:58:59 cluster5 kernel: SCSI device sdb: 262144 512-byte hdwr sectors (134 MB) May 15 13:58:59 cluster5 kernel: sdb: Write Protect is off May 15 13:58:59 cluster5 kernel: sdb: Mode Sense: 6b 00 10 08 May 15 13:58:59 cluster5 kernel: SCSI device sdb: drive cache: write back w/ FUA May 15 13:58:59 cluster5 kernel: SCSI device sdb: 262144 512-byte hdwr sectors (134 MB) May 15 13:58:59 cluster5 kernel: sdb: Write Protect is off May 15 13:58:59 cluster5 kernel: sdb: Mode Sense: 6b 00 10 08 May 15 13:58:59 cluster5 kernel: SCSI device sdb: drive cache: write back w/ FUA May 15 13:58:59 cluster5 kernel: sdb: unknown partition table May 15 13:58:59 cluster5 kernel: sd 2:0:0:0: Attached scsi disk sdb May 15 13:58:59 cluster5 kernel: sd 2:0:0:0: Attached scsi generic sg1 type 0 May 15 13:59:59 cluster5 kernel: SRP abort called May 15 13:59:59 cluster5 kernel: SRP reset_device called May 15 14:00:29 cluster5 kernel: SRP abort called May 15 14:00:34 cluster5 kernel: ib_srp: SRP reset_host called May 15 14:00:36 cluster5 kernel: ib_srp: connection closed May 15 14:02:15 cluster5 kernel: ib_srp: Sending CM REQ failed May 15 14:02:15 cluster5 kernel: ib_srp: reconnect failed (-104), removing target port. May 15 14:02:15 cluster5 kernel: sd 2:0:0:0: scsi: Device offlined - not ready after error recovery May 15 14:02:15 cluster5 kernel: sd 2:0:0:0: rejecting I/O to offline device May 15 14:02:15 cluster5 kernel: Buffer I/O error on device sdb, logical block 32760 May 15 14:02:15 cluster5 kernel: 2:0:0:0: rejecting I/O to dead device May 15 14:02:15 cluster5 kernel: Buffer I/O error on device sdb, logical block 32760 Here are the logs of the target during the second initiator's connection. It seemed that it did not receive the reconnect request. May 15 13:57:27 cluster4 kernel: ib_srpt: Host i_port_id=0x100000000000000:0xcc6b200002c90200 login with t_port_id=0xd86b200002c90200:0xd86b200002c90200 it_iu_len=260 May 15 13:57:27 cluster4 kernel: ib_srpt: srpt_create_ch_ib[1105] max_cqe= 255 max_sge= 29 cm_id= da9b7200 May 15 13:57:27 cluster4 kernel: [3966]: scst_init_session:scst: Name 0x00000000000000010002c90200206bcc not found, using default group May 15 13:57:27 cluster4 kernel: [3966]: scst_alloc_add_tgt_dev:Virtual device SCST lun=0 May 15 13:57:27 cluster4 kernel: [3966]: tm_dbg_init_tgt_dev:LUN 0 connected from initiator ib_srpt is under TM debugging May 15 13:57:27 cluster4 kernel: ib_srpt: Establish connection sess= c9a677a8 name= 0x00000000000000010002c90200206bcc cm_id= da9b7200 May 15 13:57:27 cluster4 kernel: [3964]: scst_set_pending_UA:Setting pending UA cmd dabb3ec0 May 15 13:57:27 cluster4 kernel: [3964]: tm_dbg_delay_cmd:tm_dbg_delay_cmd: delaying timed cmd dabb3ec0 (tag 35) for 60.96 seconds (15241 HZ) May 15 13:58:27 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 1 for task_tag= 35 using tag= 163 cm_id= da9b7200 sess= c9a677a8 May 15 13:58:27 cluster4 kernel: [0]: scst_rx_mgmt_fn_tag:sess=c9a677a8, tag=35 May 15 13:58:27 cluster4 kernel: [0]: scst_post_rx_mgmt_cmd:Adding mgmt cmd c70486a0 to active mgmt cmd list May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving mgmt cmd c70486a0 to mgmt cmd list May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_init:Cmd dabb3ec0 for tag 35 (sn 35) found, aborting it May 15 13:58:27 cluster4 kernel: [3965]: scst_abort_cmd:Aborting cmd dabb3ec0 (tag 35) May 15 13:58:27 cluster4 kernel: [3965]: scst_call_dev_task_mgmt_fn:Calling dev handler disk_fileio task_mgmt_fn(fn=0) May 15 13:58:27 cluster4 kernel: [3965]: scst_call_dev_task_mgmt_fn:Dev handler disk_fileio task_mgmt_fn() returned 0 May 15 13:58:27 cluster4 kernel: [3965]: tm_dbg_release_cmd:Abort request for delayed cmd dabb3ec0 (tag=35), moving it to active cmd list (delayed_cmds_count=1) May 15 13:58:27 cluster4 kernel: ib_srpt: srpt_tsk_mgmt_done[1972] tsk_mgmt_done for tag= 163 status=0 May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_send_done:Dev handler ib_srpt task_mgmt_fn_done() returned May 15 13:58:27 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 8 for task_tag= 35 using tag= 163 cm_id= da9b7200 sess= c9a677a8 May 15 13:58:27 cluster4 kernel: [3964]: tm_dbg_check_cmd:Processing delayed cmd dabb3ec0 (tag 35), delayed_cmds_count=1 May 15 13:58:27 cluster4 kernel: [3964]: tm_dbg_change_state:Deleting timer May 15 13:58:27 cluster4 kernel: ib_srpt: srpt_xmit_response[1898] tag= 35 already get aborted May 15 13:58:57 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 1 for task_tag= 36 using tag= 164 cm_id= da9b7200 sess= c9a677a8 May 15 13:58:57 cluster4 kernel: [0]: scst_rx_mgmt_fn_tag:sess=c9a677a8, tag=36 May 15 13:58:57 cluster4 kernel: [0]: scst_post_rx_mgmt_cmd:Adding mgmt cmd c7048240 to active mgmt cmd list May 15 13:58:57 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving mgmt cmd c7048240 to mgmt cmd list May 15 13:58:57 cluster4 kernel: [3965]: scst_mgmt_cmd_init:Cmd dabb3050 for tag 36 (sn 36) found, aborting it May 15 13:58:57 cluster4 kernel: [3965]: scst_abort_cmd:Aborting cmd dabb3050 (tag 36) May 15 13:58:57 cluster4 kernel: [3965]: scst_call_dev_task_mgmt_fn:Calling dev handler disk_fileio task_mgmt_fn(fn=0) May 15 13:58:57 cluster4 kernel: [3965]: scst_call_dev_task_mgmt_fn:Dev handler disk_fileio task_mgmt_fn() returned 0 May 15 13:58:57 cluster4 kernel: [3965]: scst_abort_cmd:cmd dabb3050 (tag 36) being executed/xmitted (state 12), deferring ABORT... May 15 13:58:57 cluster4 kernel: [3965]: scst_set_mcmd_next_state:cmd_wait_count(1) not 0, preparing to wait May 15 13:59:02 cluster4 kernel: ib_srpt: srpt_cm_dreq_recv[1523] cm_id= da9b7200 ch->state= 1 May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_cm_timewait_exit[1502] cm_id= da9b7200 May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_release_channel[1154] Release channel cm_id= da9b7200 May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_release_channel[1159] Release sess= c9a677a8 sess_name= 0x00000000000000010002c90200206bcc May 15 13:59:21 cluster4 kernel: ib_srpt: failed send status= 12 May 15 13:59:21 cluster4 kernel: [0]: scst_complete_cmd_mgmt:cmd dabb3050 completed (tag 36, mcmd c7048240, mcmd->cmd_wait_count 1) May 15 13:59:21 cluster4 kernel: [0]: scst_complete_cmd_mgmt:Moving mgmt cmd c7048240 to active mgmt cmd list May 15 13:59:21 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving mgmt cmd c7048240 to mgmt cmd list May 15 13:59:21 cluster4 kernel: ib_srpt: srpt_tsk_mgmt_done[1972] tsk_mgmt_done for tag= 164 status=-1 May 15 13:59:21 cluster4 kernel: [3965]: scst_mgmt_cmd_send_done:Dev handler ib_srpt task_mgmt_fn_done() returned May 15 13:59:21 cluster4 kernel: ib_srpt: srpt_unregister_session_done[1143] sess= c9a677a8 May 15 13:59:21 cluster4 kernel: [3966]: scst_free_all_UA:Clearing UA for tgt_dev lun 0 May 15 13:59:21 cluster4 kernel: ib_srpt: failed send status= 5 May 15 13:59:21 cluster4 kernel: ib_srpt: QP event 16 on cm_id= da9b7200 sess_name= 0x00000000000000010002c90200206bcc state= 2 I have no idea why the *abort* was called at the second initiator. Could you please give some suggestion? Thanks a lot! On 5/15/07, Vu Pham wrote: > Ian Jiang wrote: > > Does the SRP target support multiple initiators? > > Yes, it does. > > > > I am using the SRR initiator and IB drivers in linux-2.6.20. > > The SRP target is at > > http://www.openfabrics.org/git/?p=~vu/srpt.git;a=summary > > and the IB driver is OFED-1.1 with linux-2.6.16.13-4-default of Suse-10.1. > > -- Ian Jiang From vlad at lists.openfabrics.org Tue May 15 02:29:54 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 15 May 2007 02:29:54 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070515-0200 daily build status Message-ID: <20070515092955.65D46E60821@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Failed: From bs at q-leap.de Tue May 15 02:48:50 2007 From: bs at q-leap.de (Bernd Schubert) Date: Tue, 15 May 2007 11:48:50 +0200 Subject: [ofa-general] possible irq lock inversion dependency detected Message-ID: <200705151148.50607.bs@q-leap.de> Hi, with 2.6.20 I get this message: [263206.999448] ========================================================= [263207.007607] [ INFO: possible irq lock inversion dependency detected ] [263207.014153] 2.6.20.3-debug #9 [263207.017230] --------------------------------------------------------- [263207.023775] ipoib/6662 just changed the state of lock: [263207.029020] (&idev->n_mcast_grps_lock){-...}, at: [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath] [263207.039866] but this lock was taken by another, hard-irq-safe lock in the past: [263207.047294] (mcast_lock){++..} [263207.050380] [263207.050381] and interrupts could create inverse lock ordering between them. [263207.050382] [263207.060846] [263207.060847] other info that might help us debug this: [263207.067609] 1 lock held by ipoib/6662: [263207.071468] #0: (&priv->mcast_mutex){--..}, at: [mutex_lock+35/39] mutex_lock+0x23/0x27 [263207.080061] [263207.080062] the first lock's dependencies: [263207.085862] -> (&idev->n_mcast_grps_lock){-...} ops: 3 { [263207.091371] initial-use at: [263207.094647] [mark_lock+135/1127] mark_lock+0x87/0x467 [263207.104352] [__lock_acquire+1476/3168] __lock_acquire+0x5c4/0xc60 [263207.114571] [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath] [263207.126446] [lock_acquire+124/160] lock_acquire+0x7c/0xa0 [263207.136317] [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath] [263207.148192] [_spin_lock+34/46] _spin_lock+0x22/0x2e [263207.157891] [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath] [263207.169766] [_end+122587907/2124917936] ib_attach_mcast+0x2f/0x31 [ib_core] [263207.187046] [_end+123887919/2124917936] ipoib_mcast_attach+0xe3/0x123 [ib_ipoib] [263207.198481] [_end+123880632/2124917936] ipoib_mcast_join_finish+0x125/0x3fa [ib_ipoib] [263207.210441] [check_usage+53/661] check_usage+0x35/0x295 [263207.220328] [lock_timer_base+35/72] lock_timer_base+0x23/0x48 [263207.230454] [__lock_acquire+3080/3168] __lock_acquire+0xc08/0xc60 [263207.240674] [_end+123881884/2124917936] ipoib_mcast_join_complete+0xcb/0x31a [ib_ipoib] [263207.252720] [_read_unlock_irqrestore+56/71] _read_unlock_irqrestore+0x38/0x47 [263207.263544] [_end+122585085/2124917936] ib_unpack+0xad/0x11c [ib_core] [263207.274117] [_end+123844451/2124917936] ib_sa_mcmember_rec_callback+0x4c/0x57 [ib_sa] [263207.285983] [_end+123844738/2124917936] recv_handler+0x3f/0x4b [ib_sa] [263207.296547] [_end+123718406/2124917936] ib_mad_completion_handler+0x3c7/0x59b [ib_mad] [263207.308511] [run_workqueue+134/380] run_workqueue+0x86/0x17c [263207.318552] [_end+123717439/2124917936] ib_mad_completion_handler+0x0/0x59b [ib_mad] [263207.330333] [run_workqueue+177/380] run_workqueue+0xb1/0x17c [263207.340377] [worker_thread+0/349] worker_thread+0x0/0x15d [263207.350336] [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a [263207.360983] [worker_thread+294/349] worker_thread+0x126/0x15d [263207.371113] [default_wake_function+0/15] default_wake_function+0x0/0xf [263207.381584] [default_wake_function+0/15] default_wake_function+0x0/0xf [263207.392063] [worker_thread+0/349] worker_thread+0x0/0x15d [263207.402020] [kthread+208/252] kthread+0xd0/0xfc [263207.411459] [child_rip+10/18] child_rip+0xa/0x12 [263207.420985] [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d [263207.431204] [restore_args+0/48] restore_args+0x0/0x30 [263207.440998] [kthread+0/252] kthread+0x0/0xfc [263207.450344] [child_rip+0/18] child_rip+0x0/0x12 [263207.459862] [] 0xffffffffffffffff [263207.469386] hardirq-on-W at: [263207.472661] [mark_lock+135/1127] mark_lock+0x87/0x467 [263207.482357] [__lock_acquire+1412/3168] __lock_acquire+0x584/0xc60 [263207.492567] [kfree+525/541] kfree+0x20d/0x21d [263207.501996] [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath] [263207.513865] [lock_acquire+124/160] lock_acquire+0x7c/0xa0 [263207.523735] [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath] [263207.535601] [_spin_lock+34/46] _spin_lock+0x22/0x2e [263207.545298] [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath] [263207.557167] [debug_mutex_free_waiter+88/92] debug_mutex_free_waiter+0x58/0x5c [263207.567990] [__mutex_lock_slowpath+624/637] __mutex_lock_slowpath+0x270/0x27d [263207.578814] [wait_for_completion+189/198] wait_for_completion+0xbd/0xc6 [263207.589292] [_end+122587956/2124917936] ib_detach_mcast+0x2f/0x33 [ib_core] [263207.600298] [_end+123888044/2124917936] ipoib_mcast_detach+0x3d/0x6e [ib_ipoib] [263207.611650] [_end+123884728/2124917936] ipoib_mcast_leave+0x12d/0x1c8 [ib_ipoib] [263207.623091] [_end+123886246/2124917936] ipoib_mcast_dev_flush+0x100/0x14e [ib_ipoib] [263207.634877] [_end+123886283/2124917936] ipoib_mcast_dev_flush+0x125/0x14e [ib_ipoib] [263207.646661] [_end+123878948/2124917936] ipoib_ib_dev_flush+0x0/0x11f [ib_ipoib] [263207.658013] [_end+123877784/2124917936] ipoib_ib_dev_down+0xa8/0xb7 [ib_ipoib] [263207.669300] [_end+123879087/2124917936] ipoib_ib_dev_flush+0x8b/0x11f [ib_ipoib] [263207.680739] [run_workqueue+177/380] run_workqueue+0xb1/0x17c [263207.690781] [worker_thread+0/349] worker_thread+0x0/0x15d [263207.700732] [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a [263207.711393] [worker_thread+294/349] worker_thread+0x126/0x15d [263207.721523] [default_wake_function+0/15] default_wake_function+0x0/0xf [263207.732000] [default_wake_function+0/15] default_wake_function+0x0/0xf [263207.742478] [worker_thread+0/349] worker_thread+0x0/0x15d [263207.752436] [kthread+208/252] kthread+0xd0/0xfc [263207.761874] [child_rip+10/18] child_rip+0xa/0x12 [263207.771393] [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d [263207.781603] [restore_args+0/48] restore_args+0x0/0x30 [263207.791388] [kthread+0/252] kthread+0x0/0xfc [263207.800730] [child_rip+0/18] child_rip+0x0/0x12 [263207.810251] [] 0xffffffffffffffff [263207.819776] } [263207.821548] ... key at: [_end+122893256/2124917936] __key.5+0x0/0xfffffffffffe369f [ib_ipath] [263207.830154] [263207.830155] the second lock's dependencies: [263207.836049] -> (mcast_lock){++..} ops: 15329 { [263207.840692] initial-use at: [263207.843966] [] 0xffffffffffffffff [263207.853492] in-hardirq-W at: [263207.856757] [] 0xffffffffffffffff [263207.866283] in-softirq-W at: [263207.869550] [] 0xffffffffffffffff [263207.879071] } [263207.880843] ... key at: [_end+122888936/2124917936] mcast_lock+0x18/0xfffffffffffe4797 [ib_ipath] [263207.889798] -> (&idev->n_mcast_grps_lock){-...} ops: 3 { [263207.895398] initial-use at: [263207.898758] [mark_lock+135/1127] mark_lock+0x87/0x467 [263207.908679] [__lock_acquire+1476/3168] __lock_acquire+0x5c4/0xc60 [263207.919118] [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath] [263207.931220] [lock_acquire+124/160] lock_acquire+0x7c/0xa0 [263207.941313] [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath] [263207.953414] [_spin_lock+34/46] _spin_lock+0x22/0x2e [263207.963335] [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath] [263207.975429] [_end+122587907/2124917936] ib_attach_mcast+0x2f/0x31 [ib_core] [263207.986661] [_end+123887919/2124917936] ipoib_mcast_attach+0xe3/0x123 [ib_ipoib] [263207.998334] [_end+123880632/2124917936] ipoib_mcast_join_finish+0x125/0x3fa [ib_ipoib] [263208.010527] [check_usage+53/661] check_usage+0x35/0x295 [263208.020622] [lock_timer_base+35/72] lock_timer_base+0x23/0x48 [263208.030971] [__lock_acquire+3080/3168] __lock_acquire+0xc08/0xc60 [263208.041409] [_end+123881884/2124917936] ipoib_mcast_join_complete+0xcb/0x31a [ib_ipoib] [263208.053680] [_read_unlock_irqrestore+56/71] _read_unlock_irqrestore+0x38/0x47 [263208.064729] [_end+122585085/2124917936] ib_unpack+0xad/0x11c [ib_core] [263208.075526] [_end+123844451/2124917936] ib_sa_mcmember_rec_callback+0x4c/0x57 [ib_sa] [263208.087611] [_end+123844738/2124917936] recv_handler+0x3f/0x4b [ib_sa] [263208.098399] [_end+123718406/2124917936] ib_mad_completion_handler+0x3c7/0x59b [ib_mad] [263208.110585] [run_workqueue+134/380] run_workqueue+0x86/0x17c [263208.127122] [_end+123717439/2124917936] ib_mad_completion_handler+0x0/0x59b [ib_mad] [263208.139119] [run_workqueue+177/380] run_workqueue+0xb1/0x17c [263208.149389] [worker_thread+0/349] worker_thread+0x0/0x15d [263208.159571] [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a [263208.170456] [worker_thread+294/349] worker_thread+0x126/0x15d [263208.180814] [default_wake_function+0/15] default_wake_function+0x0/0xf [263208.191516] [default_wake_function+0/15] default_wake_function+0x0/0xf [263208.202231] [worker_thread+0/349] worker_thread+0x0/0x15d [263208.212414] [kthread+208/252] kthread+0xd0/0xfc [263208.222076] [child_rip+10/18] child_rip+0xa/0x12 [263208.231818] [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d [263208.242254] [restore_args+0/48] restore_args+0x0/0x30 [263208.252263] [kthread+0/252] kthread+0x0/0xfc [263208.261833] [child_rip+0/18] child_rip+0x0/0x12 [263208.271575] [] 0xffffffffffffffff [263208.281317] hardirq-on-W at: [263208.284671] [mark_lock+135/1127] mark_lock+0x87/0x467 [263208.294593] [__lock_acquire+1412/3168] __lock_acquire+0x584/0xc60 [263208.305037] [kfree+525/541] kfree+0x20d/0x21d [263208.314692] [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath] [263208.326786] [lock_acquire+124/160] lock_acquire+0x7c/0xa0 [263208.336880] [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath] [263208.348974] [_spin_lock+34/46] _spin_lock+0x22/0x2e [263208.358896] [_end+122750808/2124917936] ipath_multicast_detach+0x211/0x231 [ib_ipath] [263208.370987] [debug_mutex_free_waiter+88/92] debug_mutex_free_waiter+0x58/0x5c [263208.382037] [__mutex_lock_slowpath+624/637] __mutex_lock_slowpath+0x270/0x27d [263208.393086] [wait_for_completion+189/198] wait_for_completion+0xbd/0xc6 [263208.403789] [_end+122587956/2124917936] ib_detach_mcast+0x2f/0x33 [ib_core] [263208.415020] [_end+123888044/2124917936] ipoib_mcast_detach+0x3d/0x6e [ib_ipoib] [263208.426608] [_end+123884728/2124917936] ipoib_mcast_leave+0x12d/0x1c8 [ib_ipoib] [263208.438274] [_end+123886246/2124917936] ipoib_mcast_dev_flush+0x100/0x14e [ib_ipoib] [263208.450286] [_end+123886283/2124917936] ipoib_mcast_dev_flush+0x125/0x14e [ib_ipoib] [263208.462296] [_end+123878948/2124917936] ipoib_ib_dev_flush+0x0/0x11f [ib_ipoib] [263208.473872] [_end+123877784/2124917936] ipoib_ib_dev_down+0xa8/0xb7 [ib_ipoib] [263208.485366] [_end+123879087/2124917936] ipoib_ib_dev_flush+0x8b/0x11f [ib_ipoib] [263208.497030] [run_workqueue+177/380] run_workqueue+0xb1/0x17c [263208.507298] [worker_thread+0/349] worker_thread+0x0/0x15d [263208.517475] [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a [263208.528355] [worker_thread+294/349] worker_thread+0x126/0x15d [263208.538708] [default_wake_function+0/15] default_wake_function+0x0/0xf [263208.549412] [default_wake_function+0/15] default_wake_function+0x0/0xf [263208.560116] [worker_thread+0/349] worker_thread+0x0/0x15d [263208.570299] [kthread+208/252] kthread+0xd0/0xfc [263208.579968] [child_rip+10/18] child_rip+0xa/0x12 [263208.589713] [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d [263208.600150] [restore_args+0/48] restore_args+0x0/0x30 [263208.610157] [kthread+0/252] kthread+0x0/0xfc [263208.619724] [child_rip+0/18] child_rip+0x0/0x12 [263208.629469] [] 0xffffffffffffffff [263208.639220] } [263208.641084] ... key at: [_end+122893256/2124917936] __key.5+0x0/0xfffffffffffe369f [ib_ipath] [263208.649786] ... acquired at: [263208.652862] [add_lock_to_list+125/169] add_lock_to_list+0x7d/0xa9 [263208.658899] [__lock_acquire+2822/3168] __lock_acquire+0xb06/0xc60 [263208.664936] [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath] [263208.672640] [lock_acquire+124/160] lock_acquire+0x7c/0xa0 [263208.678330] [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath] [263208.686036] [_spin_lock+34/46] _spin_lock+0x22/0x2e [263208.691553] [_end+122750035/2124917936] ipath_multicast_attach+0x178/0x26c [ib_ipath] [263208.699261] [_end+122587907/2124917936] ib_attach_mcast+0x2f/0x31 [ib_core] [263208.706086] [_end+123887919/2124917936] ipoib_mcast_attach+0xe3/0x123 [ib_ipoib] [263208.713342] [_end+123880632/2124917936] ipoib_mcast_join_finish+0x125/0x3fa [ib_ipoib] [263208.721134] [check_usage+53/661] check_usage+0x35/0x295 [263208.726832] [lock_timer_base+35/72] lock_timer_base+0x23/0x48 [263208.732782] [__lock_acquire+3080/3168] __lock_acquire+0xc08/0xc60 [263208.738818] [_end+123881884/2124917936] ipoib_mcast_join_complete+0xcb/0x31a [ib_ipoib] [263208.746688] [_read_unlock_irqrestore+56/71] _read_unlock_irqrestore+0x38/0x47 [263208.753333] [_end+122585085/2124917936] ib_unpack+0xad/0x11c [ib_core] [263208.759722] [_end+123844451/2124917936] ib_sa_mcmember_rec_callback+0x4c/0x57 [ib_sa] [263208.767435] [_end+123844738/2124917936] recv_handler+0x3f/0x4b [ib_sa] [263208.773822] [_end+123718406/2124917936] ib_mad_completion_handler+0x3c7/0x59b [ib_mad] [263208.781622] [run_workqueue+134/380] run_workqueue+0x86/0x17c [263208.787486] [_end+123717439/2124917936] ib_mad_completion_handler+0x0/0x59b [ib_mad] [263208.795106] [run_workqueue+177/380] run_workqueue+0xb1/0x17c [263208.800969] [worker_thread+0/349] worker_thread+0x0/0x15d [263208.806744] [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a [263208.813215] [worker_thread+294/349] worker_thread+0x126/0x15d [263208.819164] [default_wake_function+0/15] default_wake_function+0x0/0xf [263208.825469] [default_wake_function+0/15] default_wake_function+0x0/0xf [263208.831764] [worker_thread+0/349] worker_thread+0x0/0x15d [263208.837541] [kthread+208/252] kthread+0xd0/0xfc [263208.842796] [child_rip+10/18] child_rip+0xa/0x12 [263208.848137] [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d [263208.854174] [restore_args+0/48] restore_args+0x0/0x30 [263208.859778] [kthread+0/252] kthread+0x0/0xfc [263208.864946] [child_rip+0/18] child_rip+0x0/0x12 [263208.870289] [] 0xffffffffffffffff [263208.875632] [263208.877236] [263208.877236] stack backtrace: [263208.881817] [263208.881818] Call Trace: [263208.885968] [print_irq_inversion_bug+292/307] print_irq_inversion_bug+0x124/0x133 [263208.892513] [check_usage_backwards+65/74] check_usage_backwards+0x41/0x4a [263208.898711] [mark_lock+630/1127] mark_lock+0x276/0x467 [263208.904045] [__lock_acquire+1412/3168] __lock_acquire+0x584/0xc60 [263208.909813] [kfree+525/541] kfree+0x20d/0x21d [263208.914808] [_end+122750808/2124917936] :ib_ipath:ipath_multicast_detach+0x211/0x231 [263208.922153] [lock_acquire+124/160] lock_acquire+0x7c/0xa0 [263208.927582] [_end+122750808/2124917936] :ib_ipath:ipath_multicast_detach+0x211/0x231 [263208.934927] [_spin_lock+34/46] _spin_lock+0x22/0x2e [263208.940182] [_end+122750808/2124917936] :ib_ipath:ipath_multicast_detach+0x211/0x231 [263208.947526] [debug_mutex_free_waiter+88/92] debug_mutex_free_waiter+0x58/0x5c [263208.953904] [__mutex_lock_slowpath+624/637] __mutex_lock_slowpath+0x270/0x27d [263208.960277] [wait_for_completion+189/198] wait_for_completion+0xbd/0xc6 [263208.966307] [_end+122587956/2124917936] :ib_core:ib_detach_mcast+0x2f/0x33 [263208.972766] [_end+123888044/2124917936] :ib_ipoib:ipoib_mcast_detach+0x3d/0x6e [263208.979575] [_end+123884728/2124917936] :ib_ipoib:ipoib_mcast_leave+0x12d/0x1c8 [263208.986469] [_end+123886246/2124917936] :ib_ipoib:ipoib_mcast_dev_flush+0x100/0x14e [263208.993724] [_end+123886283/2124917936] :ib_ipoib:ipoib_mcast_dev_flush+0x125/0x14e [263209.000981] [_end+123878948/2124917936] :ib_ipoib:ipoib_ib_dev_flush+0x0/0x11f [263209.007790] [_end+123877784/2124917936] :ib_ipoib:ipoib_ib_dev_down+0xa8/0xb7 [263209.014510] [_end+123879087/2124917936] :ib_ipoib:ipoib_ib_dev_flush+0x8b/0x11f [263209.021409] [run_workqueue+177/380] run_workqueue+0xb1/0x17c [263209.027001] [worker_thread+0/349] worker_thread+0x0/0x15d [263209.032508] [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a [263209.038711] [worker_thread+294/349] worker_thread+0x126/0x15d [263209.050667] [default_wake_function+0/15] default_wake_function+0x0/0xf [263209.056697] [default_wake_function+0/15] default_wake_function+0x0/0xf [263209.062726] [worker_thread+0/349] worker_thread+0x0/0x15d [263209.068233] [kthread+208/252] kthread+0xd0/0xfc [263209.073219] [child_rip+10/18] child_rip+0xa/0x12 [263209.078294] [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d [263209.084062] [restore_args+0/48] restore_args+0x0/0x30 [263209.089396] [kthread+0/252] kthread+0x0/0xfc [263209.094294] [child_rip+0/18] child_rip+0x0/0x12 [263209.099362] [263209.101175] BUG: workqueue leaked lock or atomic: ipoib/0x00000000/6662 [263209.107904] last function: ipoib_ib_dev_flush+0x0/0x11f [ib_ipoib] [263209.114575] 1 lock held by ipoib/6662: [263209.118430] #0: (&priv->mcast_mutex){--..}, at: [mutex_lock+35/39] mutex_lock+0x23/0x27 [263209.127012] [263209.127013] Call Trace: [263209.131160] [run_workqueue+302/380] run_workqueue+0x12e/0x17c [263209.136841] [worker_thread+0/349] worker_thread+0x0/0x15d [263209.142354] [keventd_create_kthread+0/122] keventd_create_kthread+0x0/0x7a [263209.148554] [worker_thread+294/349] worker_thread+0x126/0x15d [263209.154237] [default_wake_function+0/15] default_wake_function+0x0/0xf [263209.160264] [default_wake_function+0/15] default_wake_function+0x0/0xf [263209.166297] [worker_thread+0/349] worker_thread+0x0/0x15d [263209.171807] [kthread+208/252] kthread+0xd0/0xfc [263209.176801] [child_rip+10/18] child_rip+0xa/0x12 [263209.181875] [_spin_unlock_irq+40/45] _spin_unlock_irq+0x28/0x2d [263209.187646] [restore_args+0/48] restore_args+0x0/0x30 [263209.192977] [kthread+0/252] kthread+0x0/0xfc [263209.197875] [child_rip+0/18] child_rip+0x0/0x12 [263209.202951] [263209.204559] BUG: workqueue leaked lock or atomic: ipoib/0x00000000/6662 [263209.211282] last function: ipoib_ib_dev_flush+0x0/0x11f [ib_ipoib] [263209.217946] 1 lock held by ipoib/6662: [263209.221807] #0: (&priv->mcast_mutex){--..}, at: [mutex_lock+35/39] mutex_lock+0x23/0x27 ... [many more repeating traces] -- Bernd Schubert Q-Leap Networks GmbH From kliteyn at dev.mellanox.co.il Tue May 15 04:20:09 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 15 May 2007 14:20:09 +0300 Subject: [ofa-general] Re: Error message in OSM log when cached op file doesn't exist In-Reply-To: <1179152459.1540.178811.camel@hal.voltaire.com> References: <46486D1E.6010408@dev.mellanox.co.il> <1179152459.1540.178811.camel@hal.voltaire.com> Message-ID: <46499769.1070404@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Yevgeny, > > On Mon, 2007-05-14 at 10:07, Yevgeny Kliteynik wrote: >> Hi Hal. >> >> [snip] >>> Date: 03/30/2007 12:24:12 AM >>> OpenSM: Handle conf file open failures better >>> >>> diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c >>> index 46315a5..746fbd1 100644 >>> --- a/osm/opensm/osm_subnet.c >>> +++ b/osm/opensm/osm_subnet.c >>> @@ -732,7 +732,7 @@ subn_dump_qos_options( >>> >>> /********************************************************************** >>> **********************************************************************/ >>> -void >>> +ib_api_status_t >>> osm_subn_rescan_conf_file( >>> IN osm_subn_opt_t* const p_opts ) >>> { >>> @@ -751,7 +751,7 @@ osm_subn_rescan_conf_file( >>> >>> opts_file = fopen(file_name, "r"); >>> if (!opts_file) >>> - return; >>> + return IB_ERROR; >> [/snip] >> >> This patch was applied a month and a half ago (master). >> It handles opening cached options file, and prints error messages >> when OSM failed opening such file. >> >> I actually don't like this thing, because now every time you run >> OpenSM on the machine that doesn't have any cached options file >> (which is usually the case) you get an error message. > > Perhaps error is too severe as one can run just fine without this file > and there is no requirement to have it. Should it be some other type of > message instead ? I think that the message should appear when OpenSM *does* find cached option file, and no message should appear when such file wasn't found (which is the most common use case). >> There's no point checking whether the file exists, because osm runs >> as root, and if it fails opening this file, it means that the file >> doesn't exist or is inaccessible (broken mount, etc). > > That's the most common use case (running OpenSM as root, but not the > only one). > >> In any case, user gets info in stdout whether or now OpenSM is using >> cached options file. > > Is there always a message in the log as well indicating this ? Nope. When this file is parsed, osm_log is not yet initialized. -- Yevgeny > -- Hal > >> Do you agree? Should I issue a patch? >> >> -- Yevgeny > > From kliteyn at dev.mellanox.co.il Tue May 15 04:23:17 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 15 May 2007 14:23:17 +0300 Subject: [ofa-general] Re: Error message in OSM log when cached op file doesn't exist In-Reply-To: <20070514210541.GR29746@sashak.voltaire.com> References: <46486D1E.6010408@dev.mellanox.co.il> <20070514210541.GR29746@sashak.voltaire.com> Message-ID: <46499825.9090204@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 17:07 Mon 14 May , Yevgeny Kliteynik wrote: >> I actually don't like this thing, because now every time you run >> OpenSM on the machine that doesn't have any cached options file >> (which is usually the case) you get an error message. >> >> There's no point checking whether the file exists, because osm runs >> as root, > > Not necessary. > >> and if it fails opening this file, it means that the file >> doesn't exist or is inaccessible (broken mount, etc). > > or user provided OSM_CACHE_DIR environment variable is broken or malloc > failed, or other error (see: man 3 fopen, man 2 open, man 3 malloc) > > Probably just this solves your issue: > > diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c > index 855d1ab..f7ddf7d 100644 > --- a/osm/opensm/osm_subnet.c > +++ b/osm/opensm/osm_subnet.c > @@ -51,6 +51,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -856,7 +857,7 @@ osm_subn_parse_conf_file( > > opts_file = fopen(file_name, "r"); > if (!opts_file) > - return IB_ERROR; > + return errno == ENOENT ? IB_SUCCESS : IB_ERROR; > > while (fgets(line, 1023, opts_file) != NULL) > { I think that this one is a good solution. -- Yevgeny > Or yet another IB_* status value and less aggressive warning message? > > Sasha > From kliteyn at dev.mellanox.co.il Tue May 15 04:35:43 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 15 May 2007 14:35:43 +0300 Subject: [ofa-general] [PATCH] osm: error message when failed opening cached options file Message-ID: <46499B0F.2090902@dev.mellanox.co.il> Hi Hal, As suggested by Sasha, printing error message when failed opening cached options file only when the file was found, but osm failed opening it. Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_subnet.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 855d1ab..9bba1b4 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include #include @@ -856,7 +857,7 @@ osm_subn_parse_conf_file( opts_file = fopen(file_name, "r"); if (!opts_file) - return IB_ERROR; + return (errno == ENOENT) ? IB_SUCCESS : IB_ERROR; while (fgets(line, 1023, opts_file) != NULL) { -- 1.4.4.1.GIT From sashak at voltaire.com Tue May 15 05:54:01 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 15 May 2007 15:54:01 +0300 Subject: [ofa-general] Re: Error message in OSM log when cached op file doesn't exist In-Reply-To: <46499769.1070404@dev.mellanox.co.il> References: <46486D1E.6010408@dev.mellanox.co.il> <1179152459.1540.178811.camel@hal.voltaire.com> <46499769.1070404@dev.mellanox.co.il> Message-ID: <20070515125401.GD23240@sashak.voltaire.com> On 14:20 Tue 15 May , Yevgeny Kliteynik wrote: > > I think that the message should appear when OpenSM *does* find cached > option file, and no message should appear when such file wasn't found > (which is the most common use case). AFAIK OpenSM which used in the labs' clusters almost always uses this file, so I'm not sure about common case. Sasha From philippe.gregoire at cea.fr Tue May 15 05:48:39 2007 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Tue, 15 May 2007 14:48:39 +0200 Subject: [ofa-general] git over http Message-ID: <4649AC27.8010903@cea.fr> I can't get git clone command working due to our firewall. Is there any git http server configured ? If any, how do I translate git clone git://git.openfabrics.org/~halr/management in git clone http path ? Thanks Philippe From kliteyn at dev.mellanox.co.il Tue May 15 05:56:32 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 15 May 2007 15:56:32 +0300 Subject: [ofa-general] Re: Error message in OSM log when cached op file doesn't exist In-Reply-To: <20070515125401.GD23240@sashak.voltaire.com> References: <46486D1E.6010408@dev.mellanox.co.il> <1179152459.1540.178811.camel@hal.voltaire.com> <46499769.1070404@dev.mellanox.co.il> <20070515125401.GD23240@sashak.voltaire.com> Message-ID: <4649AE00.8080806@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 14:20 Tue 15 May , Yevgeny Kliteynik wrote: >> I think that the message should appear when OpenSM *does* find cached >> option file, and no message should appear when such file wasn't found >> (which is the most common use case). > > AFAIK OpenSM which used in the labs' clusters almost always uses this > file, so I'm not sure about common case. If the file is found, user sees "Using cached bla-bla" and "Loading cached option bla-bla" messages. If the file wasn't found, these messages are not printed, so absence of these messages means that the file wasn't found. The only thing we can do is to add a new message that will explicitly inform the user about this, something like "No cached options file". Is this necessary? IMHO, it's not. Do you think otherwise? -- Yevgeny > Sasha > From bhartner at us.ibm.com Tue May 15 06:26:16 2007 From: bhartner at us.ibm.com (Bill Hartner) Date: Tue, 15 May 2007 08:26:16 -0500 Subject: [ofa-general] binary compatibility ofed 1.1 and 1.2 Message-ID: Hi, Will apps built with OFED 1.1 verbs.h run on an OFED 1.2 install ? -Bill -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue May 15 07:05:20 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 15 May 2007 10:05:20 -0400 Subject: [ofa-general] [NOTICE] IB management changes Message-ID: <1179237918.4531.74522.camel@hal.voltaire.com> As discussed last month, the following changes have now been made for IB management (master branch of my management git tree): In order to better match package names, the following directory names have been changed from->to: osm->opensm diags->infiniband-diags Still pending are the following changes: Since opensm is a system daemon, opensm is to be moved from /usr/bin to /usr/sbin Similarly for the infiniband-diags. For consistency with the package name, /var/cache/osm moved to /var/cache/opensm Also, for consistency with the package name, all config, log, and dump files named osm* to be changed to opensm* To avoid confusion and possible conflicts in configuring daemon options, only have 1 configuration file (existence of both /etc/sysconfig/opensm and /etc/opensm.conf is problematic). Remove the /etc/sysconfig/opensm file and only use opensm.conf. Move opensm.conf to /etc/rdma (as discussed in the thread labeled "Location and naming of RDMA enablement stack rpm" on general at lists.openfabrics.org. -- Hal From kliteyn at dev.mellanox.co.il Tue May 15 07:08:52 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 15 May 2007 17:08:52 +0300 Subject: [ofa-general] Re: Error message in OSM log when cached op file doesn't exist In-Reply-To: <46499769.1070404@dev.mellanox.co.il> References: <46486D1E.6010408@dev.mellanox.co.il> <1179152459.1540.178811.camel@hal.voltaire.com> <46499769.1070404@dev.mellanox.co.il> Message-ID: <4649BEF4.9000801@dev.mellanox.co.il> Yevgeny Kliteynik wrote: > Hi Hal, > > Hal Rosenstock wrote: >> Hi Yevgeny, >> >> On Mon, 2007-05-14 at 10:07, Yevgeny Kliteynik wrote: >>> Hi Hal. >>> >>> [snip] >>>> Date: 03/30/2007 12:24:12 AM >>>> OpenSM: Handle conf file open failures better >>>> diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c >>>> index 46315a5..746fbd1 100644 >>>> --- a/osm/opensm/osm_subnet.c >>>> +++ b/osm/opensm/osm_subnet.c >>>> @@ -732,7 +732,7 @@ subn_dump_qos_options( >>>> >>>> /********************************************************************** >>>> >>>> >>>> **********************************************************************/ >>>> -void >>>> +ib_api_status_t >>>> osm_subn_rescan_conf_file( >>>> IN osm_subn_opt_t* const p_opts ) >>>> { >>>> @@ -751,7 +751,7 @@ osm_subn_rescan_conf_file( >>>> opts_file = fopen(file_name, "r"); >>>> if (!opts_file) >>>> - return; >>>> + return IB_ERROR; >>> [/snip] >>> >>> This patch was applied a month and a half ago (master). >>> It handles opening cached options file, and prints error messages >>> when OSM failed opening such file. >>> >>> I actually don't like this thing, because now every time you run >>> OpenSM on the machine that doesn't have any cached options file >>> (which is usually the case) you get an error message. >> >> Perhaps error is too severe as one can run just fine without this file >> and there is no requirement to have it. Should it be some other type of >> message instead ? > > I think that the message should appear when OpenSM *does* find cached > option file, and no message should appear when such file wasn't found > (which is the most common use case). > >>> There's no point checking whether the file exists, because osm runs >>> as root, and if it fails opening this file, it means that the file >>> doesn't exist or is inaccessible (broken mount, etc). >> >> That's the most common use case (running OpenSM as root, but not the >> only one). >> >>> In any case, user gets info in stdout whether or now OpenSM is using >>> cached options file. >> >> Is there always a message in the log as well indicating this ? > > Nope. > When this file is parsed, osm_log is not yet initialized. Correction: There are two places where this file is parsed: 1. osm_subn_parse_conf_file() - called from main(), osm log is not yet initialized when the function is called 2. osm_subn_rescan_conf_file() - called from osm_state_mgr_process() before every heavy sweep (when the log is already initialized), and logs error message about the missing file every time. -- Yevgeny > -- Yevgeny > >> -- Hal >> >>> Do you agree? Should I issue a patch? >>> >>> -- Yevgeny >> >> > > > From tziporet at dev.mellanox.co.il Tue May 15 07:15:45 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 15 May 2007 17:15:45 +0300 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: Message-ID: <4649C091.5050804@mellanox.co.il> Roland Dreier wrote: > > Jack Morgenstein (1): > IB/mlx4: Fix uninitialized spinlock for 32-bit archs > > Hi Roland, There were other 2 fixes from Eli and I see they are not in. Can you take them too? Thanks, Tziporet From serahali100 at yahoo.de Tue May 15 07:15:51 2007 From: serahali100 at yahoo.de (serah ali) Date: Tue, 15 May 2007 16:15:51 +0200 (CEST) Subject: [ofa-general] ATTORNEY SERAH ALI ESQ In-Reply-To: <478935.50146.qm@web39202.mail.mud.yahoo.com> Message-ID: <643391.7216.qm@web23106.mail.ird.yahoo.com> SERAH ALI & ASSOCIATES NOTARY PUBLIC & CORPERATE ATTORNEY II6 WINCHESEA STREET , LONDON. UK. ATTN RE-ESTATE OF LATE ABDULAZEE AHMED HAMZA HABIB OF IRAQ We are attorneys and executors of the estate of late Ahmed Hamza Habib of Iraq. Who is the richest Oil Merchant in the histroy of Iraqi. He escaped out from the War turn Iraq with his family for a political assylun in London and died last month after a brief illiness and has since been buried according to the christian right, because while he was in London , he was converted to christianity and accepted christ. He died at the ripe age of 100yrs (1902-2003) We are contacting you because your name are listed as a beneficiary in the estate of the late Iraqi richest Oil Merchant. . You are specifically listed as beneficiary to the sum of $5, 750,000. in his will. This is for your activities and help to the less privileged in the society. In accordance with the Great Britian inheritance laws, you are required to forward documents of proof of your identity as the bonafide beneficiary to this inheritance and your bio-data. Also required is your present address, telephone and fax numbers to enable easy cmmunication. We shall inform you on further details on reciept of the above outlined information. We hope to hear from you soon. Yours faithfully ATTORNEY SERAH ALI ESQ P/P Notary Public --------------------------------- Yahoo! Clever - Sie haben Fragen? Yahoo! Nutzer antworten Ihnen. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue May 15 07:21:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 07:21:13 -0700 Subject: [ofa-general] Re: [PATCH] IB/core free umem when mm is destroyed In-Reply-To: <1179131773.7405.39.camel@mtls03> (Eli Cohen's message of "Mon, 14 May 2007 11:35:43 +0300") References: <1179131773.7405.39.camel@mtls03> Message-ID: thanks, applied. From rdreier at cisco.com Tue May 15 07:26:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 07:26:49 -0700 Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation In-Reply-To: <1179145102.25749.11.camel@mtls03> (Eli Cohen's message of "Mon, 14 May 2007 15:17:52 +0300") References: <1179145102.25749.11.camel@mtls03> Message-ID: > the code that calculates WQ size is quite different between kernel and > user. I think that writing the code in a way that will allow to copy it > as is between kernel and user is in place. Would like me to send such a > patch? Actually I've been thinking that perhaps we should let libmlx4 tell mlx4 in the kernel what WQE sizes it wants to use. Otherwise it will probably be a pain if we want to use a small BB for SQs, etc. > case IBV_QPT_RC: > - size += sizeof (struct mlx4_wqe_raddr_seg); > + size += sizeof (struct mlx4_wqe_raddr_seg) + > + sizeof (struct mlx4_wqe_atomic_seg); > /* > * An atomic op will require an atomic segment, a > * remote address segment and one scatter entry. This looks wrong. Why do we have to allow for an atomic segment for normal operations? The code that starts with the context above: /* * An atomic op will require an atomic segment, a * remote address segment and one scatter entry. */ if (size < (sizeof (struct mlx4_wqe_atomic_seg) + sizeof (struct mlx4_wqe_raddr_seg) + sizeof (struct mlx4_wqe_data_seg))) size = (sizeof (struct mlx4_wqe_atomic_seg) + sizeof (struct mlx4_wqe_raddr_seg) + sizeof (struct mlx4_wqe_data_seg)); seems to take into account leaving space for atomic operations. - R. From mst at dev.mellanox.co.il Tue May 15 07:47:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 May 2007 17:47:08 +0300 Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation In-Reply-To: References: <1179145102.25749.11.camel@mtls03> Message-ID: <20070515144708.GH13649@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] libmlx4: WQE shift calculation > > > the code that calculates WQ size is quite different between kernel and > > user. I think that writing the code in a way that will allow to copy it > > as is between kernel and user is in place. Would like me to send such a > > patch? > > Actually I've been thinking that perhaps we should let libmlx4 tell > mlx4 in the kernel what WQE sizes it wants to use. Otherwise it will > probably be a pain if we want to use a small BB for SQs, etc. I've been thinking about this, too. But this is a separate issue from what Eli proposes - we'll still need to have this math in both kernel and user-space. -- MST From eli at mellanox.co.il Tue May 15 08:13:32 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 15 May 2007 18:13:32 +0300 Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation In-Reply-To: References: <1179145102.25749.11.camel@mtls03> Message-ID: <1179242042.25749.33.camel@mtls03> On Tue, 2007-05-15 at 07:26 -0700, Roland Dreier wrote: > > the code that calculates WQ size is quite different between kernel and > > user. I think that writing the code in a way that will allow to copy it > > as is between kernel and user is in place. Would like me to send such a > > patch? First I should add the case that triggered his patch: the userspace code calculated a smaller buffer size than the kernel code, which caused get_user_pages() to fail since part of the buffer did not belong to the process's address space. > > Actually I've been thinking that perhaps we should let libmlx4 tell > mlx4 in the kernel what WQE sizes it wants to use. Otherwise it will > probably be a pain if we want to use a small BB for SQs, etc. As Mihcael said in a subsequent post, we still need this code both in user and in kernel. > > > case IBV_QPT_RC: > > - size += sizeof (struct mlx4_wqe_raddr_seg); > > + size += sizeof (struct mlx4_wqe_raddr_seg) + > > + sizeof (struct mlx4_wqe_atomic_seg); > > /* > > * An atomic op will require an atomic segment, a > > * remote address segment and one scatter entry. > > This looks wrong. Why do we have to allow for an atomic segment for > normal operations? The code that starts with the context above: The kernel code in send_wqe_overhead() always adds atomic headers. Maybe the fix should have gone there. Still I think we should have the same code for this calculation. > > /* > * An atomic op will require an atomic segment, a > * remote address segment and one scatter entry. > */ > if (size < (sizeof (struct mlx4_wqe_atomic_seg) + > sizeof (struct mlx4_wqe_raddr_seg) + > sizeof (struct mlx4_wqe_data_seg))) > size = (sizeof (struct mlx4_wqe_atomic_seg) + > sizeof (struct mlx4_wqe_raddr_seg) + > sizeof (struct mlx4_wqe_data_seg)); > > seems to take into account leaving space for atomic operations. > > - R. From halr at voltaire.com Tue May 15 08:14:52 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 15 May 2007 11:14:52 -0400 Subject: [ofa-general] [NOTICE] IB management changes In-Reply-To: <1179237918.4531.74522.camel@hal.voltaire.com> References: <1179237918.4531.74522.camel@hal.voltaire.com> Message-ID: <1179242091.4531.78861.camel@hal.voltaire.com> On Tue, 2007-05-15 at 10:05, Hal Rosenstock wrote: > As discussed last month, the following changes have now been made for IB > management (master branch of my management git tree): > > In order to better match package names, the following directory names have > been changed from->to: > osm->opensm > diags->infiniband-diags > > Still pending are the following changes: > > Since opensm is a system daemon, opensm is to be moved from /usr/bin > to /usr/sbin This was done. > Similarly for the infiniband-diags. Pending. > For consistency with the package name, /var/cache/osm moved to > /var/cache/opensm Done. > Also, for consistency with the package name, all config, log, and dump files named osm* > to be changed to opensm* Done. -- Hal > To avoid confusion and possible conflicts in configuring daemon options, > only have 1 configuration file (existence of both /etc/sysconfig/opensm > and /etc/opensm.conf is problematic). Remove the /etc/sysconfig/opensm > file and only use opensm.conf. Move opensm.conf to /etc/rdma (as > discussed in the thread labeled "Location and naming of RDMA enablement > stack rpm" on general at lists.openfabrics.org. > > -- Hal > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From kliteyn at dev.mellanox.co.il Tue May 15 08:14:39 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 15 May 2007 18:14:39 +0300 Subject: [ofa-general] [PATCHv2] osm: error message when failed opening cached options file Message-ID: <4649CE5F.70102@dev.mellanox.co.il> Hi Hal, [V2 of the patch] As suggested by Sasha, printing error message when failed opening cached options file only when the file was found, but osm failed opening it. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_subnet.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 855d1ab..c785923 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include #include @@ -758,7 +759,7 @@ osm_subn_rescan_conf_file( opts_file = fopen(file_name, "r"); if (!opts_file) - return IB_ERROR; + return (errno == ENOENT) ? IB_SUCCESS : IB_ERROR; while (fgets(line, 1023, opts_file) != NULL) { @@ -856,7 +857,7 @@ osm_subn_parse_conf_file( opts_file = fopen(file_name, "r"); if (!opts_file) - return IB_ERROR; + return (errno == ENOENT) ? IB_SUCCESS : IB_ERROR; while (fgets(line, 1023, opts_file) != NULL) { -- 1.5.1.4 From halr at voltaire.com Tue May 15 08:29:11 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 15 May 2007 11:29:11 -0400 Subject: [ofa-general] Re: [PATCHv2] osm: error message when failed opening cached options file In-Reply-To: <4649CE5F.70102@dev.mellanox.co.il> References: <4649CE5F.70102@dev.mellanox.co.il> Message-ID: <1179242950.4531.79784.camel@hal.voltaire.com> Hi Yevgeny, On Tue, 2007-05-15 at 11:14, Yevgeny Kliteynik wrote: > Hi Hal, > > [V2 of the patch] > > As suggested by Sasha, printing error message when failed > opening cached options file only when the file was found, but > osm failed opening it. > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied (to master only). -- Hal From philippe.gregoire at cea.fr Tue May 15 08:32:25 2007 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Tue, 15 May 2007 17:32:25 +0200 Subject: [ofa-general] Re: suggested patch for partition membership definitiion in osm-partitions.conf (fix) In-Reply-To: <1179157835.1540.183713.camel@hal.voltaire.com> References: <46487FBF.7020300@cea.fr> <1179157835.1540.183713.camel@hal.voltaire.com> Message-ID: <4649D289.3070301@cea.fr> Here are the patches as you asked. I changed the code to use strncmp as suggested by Sasha. Philippe Hal Rosenstock a écrit : > Hi Philippe, > > On Mon, 2007-05-14 at 11:26, Philippe Gregoire wrote: > >> This time , with the definitive patch (sorry) >> > > Can you resubmit this with a S-O-B line ? > > >> Hi Hal, >> the way to define in osm-partitions.conf file partition membership for >> port guids is quite very verbose, >> specially when you have a lot of (full member) ports. >> > > or lots of limited members, either way. This is an improvement in the > allowed syntax. > > >> Here is a patch to allow a more compact partition membership definition. >> It allows definition of a default >> membership partition for the port guid list. The old syntax is still usable. >> old way >> G1 = 0x01 : 0x123=full, 0x124=full, 0x0x125=full, 0x126=full, 0x127=full ; >> G1 = 0x01 : 0x128=full, 0x129=full, 0x567, 0x569=full >> >> new way : >> G1 = 0x01 , defmember=full : 0x123, 0x124, 0x125, 0x126, 0x127 ; >> G1 = 0x01 , defmember=full : 0x128, 0x129, 0x567=limited, 0x569 >> >> I changed also the opensm man page as some lines (arround limited/full >> membership) are not well formatted. >> > > Can you break this piece into 2 parts: fix formatting, and then add > defmember ? > > >> This patch has been compiled and tested on our cluster with the >> following osm-partitions.conf : >> G1 = 0x0001 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9, >> 0x0005ad00000168ad, 0x0005ad0000000cb7=limited, 0x0008f10403962eb1 ; >> G2 = 0x0002 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ; >> G3 = 0x0003 , defmember=full : 0x0005ad00000165f1, 0x0005ad0000016cb9, >> 0x0008f10403962eb1 ; >> G5 = 0x0005 , defmember=full : 0x0008f10403962eb1 ; >> G10 = 0x0010 , defmember=full : 0x0005ad00000165f1, 0x0008f10403962eb1 ; >> G70 = 0x0070 , defmember=full : 0x0005ad00000165f1 ; >> G80 = 0x0080 , defmember=full : 0x0005ad00000165f1; >> G80 = 0x0080 : 0x0005ad00000168ad; >> G80 = 0x0080 , defmember=full : 0x0005ad0000016cb9; >> G80 = 0x0080 , defmember=limited : 0x0005ad0000000cb7, 0x0008f10403962eb1; >> > > Thanks. > > -- Hal > > >> Philippe >> >> >> >> ______________________________________________________________________ >> >> --- opensm/osm_prtn_config.old.c 2007-04-18 11:54:29.000000000 +0200 >> +++ opensm/osm_prtn_config.c 2007-05-14 17:14:42.228813361 +0200 >> @@ -70,6 +70,7 @@ >> osm_subn_t *p_subn; >> osm_prtn_t *p_prtn; >> unsigned is_ipoib, mtu, rate, sl, scope; >> + boolean_t full; >> }; >> >> extern osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn, >> @@ -163,6 +164,14 @@ >> " - skipped\n", lineno); >> else >> conf->sl = sl; >> + } else if (!strncmp(flag, "defmember", len)) { >> + if (!val || (strcmp(val, "limited") && strcmp(val, "full"))) >> + osm_log(conf->p_log, OSM_LOG_VERBOSE, >> + "PARSE WARN: line %d: " >> + "flag \'defmember\' requires valid value (limited or full)" >> + " - skipped\n", lineno); >> + else >> + conf->full = strcmp(val, "full") == 0; >> } else { >> osm_log(conf->p_log, OSM_LOG_VERBOSE, >> "PARSE WARN: line %d: " >> @@ -177,12 +186,14 @@ >> { >> osm_prtn_t *p = conf->p_prtn; >> ib_net64_t guid; >> - boolean_t full = FALSE; >> + boolean_t full = conf->full; >> >> if (!name || !*name || !strncmp(name, "NONE", strlen(name))) >> return 0; >> >> if (flag) { >> + /* reset default membership to limited */ >> + full = FALSE; >> if (!strncmp(flag, "full", strlen(flag))) >> full = TRUE; >> else if (strncmp(flag, "limited", strlen(flag))) { >> @@ -275,6 +286,7 @@ >> conf->p_prtn = NULL; >> conf->is_ipoib = 0; >> conf->sl = OSM_DEFAULT_SL; >> + conf->full = FALSE; >> return conf; >> } >> >> --- man/opensm.8.old 2007-04-18 11:54:29.000000000 +0200 >> +++ man/opensm.8 2007-05-14 16:19:11.747555126 +0200 >> @@ -291,13 +291,15 @@ >> >> Partition Definition: >> >> -[PartitionName][=PKey][,flag[=value]] >> +[PartitionName][=PKey][,flag[=value]][,defmember=full|limited] >> >> PartitionName - string, will be used with logging. When omitted >> empty string will be used. >> PKey - P_Key value for this partition. Only low 15 bits will >> be used. When omitted will be autogenerated. >> flag - used to indicate IPoIB capability of this partition. >> + defmember=full|limited - specifies default membership for port guid. >> + Default is limited. >> >> Currently recognized flags are: >> >> @@ -317,10 +319,10 @@ >> >> PortGUIDs list: >> >> -PortGUID - GUID of partition member EndPort. Hexadecimal numbers >> - should start from 0x, decimal numbers are accepted too. >> -full or - indicates full or limited membership for this port. When >> - limited omitted (or unrecognized) limited membership is assumed. >> + PortGUID - GUID of partition member EndPort. Hexadecimal numbers >> + should start from 0x, decimal numbers are accepted too. >> + full or limited - indicates full or limited membership for this port. >> + When omitted (or unrecognized) default (defmember) membership is assumed. >> >> There are two useful keywords for PortGUID definition: >> >> @@ -346,12 +348,20 @@ >> >> Examples: >> >> -Default=0x7fff : ALL, SELF=full ; >> + Default=0x7fff : ALL, SELF=full ; >> >> -NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ; >> + NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ; >> >> -YetAnotherOne = 0x300 : SELF=full ; >> -YetAnotherOne = 0x300 : ALL=limited ; >> + YetAnotherOne = 0x300 : SELF=full ; >> + YetAnotherOne = 0x300 : ALL=limited ; >> + >> + ShareIO = 0x80 , defmember=full : 0x123451, 0x123452; >> + # 0x123453, 0x123454 will be limited >> + ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full; >> + # 0x123456, 0x123457 will be limited >> + ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full; >> + ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a; >> + ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d; >> >> Note: >> >> > > > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: defmember.patch URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: osm-man1.patch URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: osm-man2.patch URL: From rdreier at cisco.com Tue May 15 08:58:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 08:58:19 -0700 Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation In-Reply-To: <1179242042.25749.33.camel@mtls03> (Eli Cohen's message of "Tue, 15 May 2007 18:13:32 +0300") References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> Message-ID: > First I should add the case that triggered his patch: the userspace code > calculated a smaller buffer size than the kernel code, which caused > get_user_pages() to fail since part of the buffer did not belong to the > process's address space. OK, in this case it seems the bug is in the kernel -- since it is overestimating the size of the WQEs needed. So we might as well fix it in the kernel. > As Mihcael said in a subsequent post, we still need this code both in > user and in kernel. Yes, but I think this issue really convinces me that we should decouple the two calculations, so the kernel code is only used for kernel QPs. And then change the mlx4 ABI so that userspace tells the kernel the wqe buffer size and rq/sq wqe shift/offset. That will allow for different SQ BB sizes and also make things more robust against bugs like this. - R. From rdreier at cisco.com Tue May 15 08:58:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 08:58:46 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: <4649C091.5050804@mellanox.co.il> (Tziporet Koren's message of "Tue, 15 May 2007 17:15:45 +0300") References: <4649C091.5050804@mellanox.co.il> Message-ID: Thanks for the reminder. I put them in the wrong folder but I think I found them now. From xma at us.ibm.com Tue May 15 10:41:23 2007 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 15 May 2007 10:41:23 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <20070515062646.GD5437@mellanox.co.il> Message-ID: Hello Michael, Regarding the memory issue w/o SRQ, do you think there is a way to use low watermark to release prepost buffer in large connections? I think most of the prepost buffers are empty in that case because of the BW. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Tue May 15 11:54:58 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 May 2007 21:54:58 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: References: <20070515062646.GD5437@mellanox.co.il> Message-ID: <20070515185445.GD4161@mellanox.co.il> > Quoting Shirley Ma : > Subject: Re: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review > > Hello Michael, > > Regarding the memory issue w/o SRQ, do you think there is a way to use low > watermark to release prepost buffer in large connections? Maybe with UC - with RC you'll get RNR and connection'll get closed before you have time to handle the low watermark. So sure, might be an interesting idea, but isn't low watermark a SRQ feature? > I think most of the prepost buffers are empty in that case because of the BW. I don't really get the argument. -- MST From parks at lanl.gov Tue May 15 11:55:32 2007 From: parks at lanl.gov (Parks Fields) Date: Tue, 15 May 2007 12:55:32 -0600 Subject: [ofa-general] Re: [ewg] OFED 1.2 rc3 release In-Reply-To: <6a122cc00705140844k7f65a988x8746c4ea8474b920@mail.gmail.co m> References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com> <46487AE8.1020005@mellanox.co.il> <6a122cc00705140844k7f65a988x8746c4ea8474b920@mail.gmail.com> Message-ID: <7.0.1.0.2.20070515125402.02784f60@lanl.gov> Ofed 1.2 rc3 is running IPoIB slower that rc2 or earlier I use to get ~800MB/sec no tuning and now I get ~650MB/sec ?? Ideas?? thanks At 09:44 AM 5/14/2007, Moni Levy wrote: >On 5/14/07, Tziporet Koren ><tziporet at dev.mellanox.co.il> wrote: > > >Major limitations and known issues: >567 blocker rolandd at cisco.com MPI does not >work on RHEL5 ppc64 >420 critical monil at voltaire.com PKey >table reordering caused by SM failover stops ipoib traffic > > > >Tziporet, bug #420 was fixed and bugzilla was updated this morning > >Moni > > > >607 critical jsquyres at cisco.com remove >the hack to save the port number in the ia hca_address >608 critical monis at voltaire.com traffic >fails to resume after SM failover with bonding interfaces >611 critical >swise at opengridcomputing.com >cxgb3: passive side connection transition from streaming to RDMA is broken >577 critical rolandd at cisco.com SRP >multipath failover too slow (minutes, not seconds) >465 critical mst at mellanox.co.il IPoIB HA >fails after several hours of failovers >549 critical amip at dev.mellanox.co.il >SDP Policy need to be consistent >604 critical mst at mellanox.co.il Oops >running UDP traffic with IPoIB CM >605 major sean.hefty at intel.com kernel >oops in rdma_cm during module unload >614 major halr at voltaire.com All of the CM >definitions should be removed from ib_types.h >534 major vlad at mellanox.co.il SLES9 - >Installer fails on declarations - OFED 1.2-20070409 >530 major dannyz at mellanox.co.il >ibdiagnet -r fails on RHEL5 i686 >538 major monis at voltaire.com integrate >IPoIB bonding with IPoIB CM >541 major mst at mellanox.co.il slow >failover with IPoIB CM bonding/ipoibtools HA >558 major rolandd at cisco.com tvflash >configure fails on SLES10 SP1 RC2 > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ***** Correspondence ***** This email contains no programmatic content that requires independent ADC review -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue May 15 11:58:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 11:58:54 -0700 Subject: [ofa-general] possible irq lock inversion dependency detected In-Reply-To: <200705151148.50607.bs@q-leap.de> (Bernd Schubert's message of "Tue, 15 May 2007 11:48:50 +0200") References: <200705151148.50607.bs@q-leap.de> Message-ID: Thanks for the report... looks like a real bug. Can you check whether this patch makes the lockdep warnings go away? commit 4b7eed244c032ce963be543a63e3100b96bc2d87 Author: Roland Dreier Date: Tue May 15 11:56:05 2007 -0700 IB/ipath: Fix potential deadlock with multicast spinlocks Lockdep found the following potential deadlock between mcast_lock and n_mcast_grps_lock: mcast_lock is taken from both interrupt context and process context, so spin_lock_irqsave() must be used to take it. n_mcast_grps_lock is only taken from process context, so at first it seems safe to take it with plain spin_lock(); however, it also nests inside mcast_lock, and hence we could deadlock: cpu A cpu B ipath_mcast_add(): spin_lock_irq(&mcast_lock); ipath_mcast_detach(): spin_lock(&n_mcast_grps_lock); ipath_mcast_find(): spin_lock_irqsave(&mcast_lock); spin_lock(&n_mcast_grps_lock); Fix this by using spin_lock_irq() to take n_mcast_grps_lock. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c index 085e28b..dd691cf 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c @@ -165,10 +165,9 @@ static int ipath_mcast_add(struct ipath_ibdev *dev, { struct rb_node **n = &mcast_tree.rb_node; struct rb_node *pn = NULL; - unsigned long flags; int ret; - spin_lock_irqsave(&mcast_lock, flags); + spin_lock_irq(&mcast_lock); while (*n) { struct ipath_mcast *tmcast; @@ -228,7 +227,7 @@ static int ipath_mcast_add(struct ipath_ibdev *dev, ret = 0; bail: - spin_unlock_irqrestore(&mcast_lock, flags); + spin_unlock_irq(&mcast_lock); return ret; } @@ -289,17 +288,16 @@ int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) struct ipath_mcast *mcast = NULL; struct ipath_mcast_qp *p, *tmp; struct rb_node *n; - unsigned long flags; int last = 0; int ret; - spin_lock_irqsave(&mcast_lock, flags); + spin_lock_irq(&mcast_lock); /* Find the GID in the mcast table. */ n = mcast_tree.rb_node; while (1) { if (n == NULL) { - spin_unlock_irqrestore(&mcast_lock, flags); + spin_unlock_irq(&mcast_lock); ret = -EINVAL; goto bail; } @@ -334,7 +332,7 @@ int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) break; } - spin_unlock_irqrestore(&mcast_lock, flags); + spin_unlock_irq(&mcast_lock); if (p) { /* @@ -348,9 +346,9 @@ int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) atomic_dec(&mcast->refcount); wait_event(mcast->wait, !atomic_read(&mcast->refcount)); ipath_mcast_free(mcast); - spin_lock(&dev->n_mcast_grps_lock); + spin_lock_irq(&dev->n_mcast_grps_lock); dev->n_mcast_grps_allocated--; - spin_unlock(&dev->n_mcast_grps_lock); + spin_unlock_irq(&dev->n_mcast_grps_lock); } ret = 0; From rdreier at cisco.com Tue May 15 12:00:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 12:00:28 -0700 Subject: [ofa-general] New ipath MAINTAINERS entry? Message-ID: Do you guys want to update the maintainers entry for ipath? Right now we have: IPATH DRIVER: P: Bryan O'Sullivan M: support at pathscale.com L: openib-general at openib.org S: Supported Qlogic bought Pathscale and Bryan no longer works for Qlogic so it seems some fresher information might be appropriate. From rdreier at cisco.com Tue May 15 12:41:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 12:41:31 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: <20070508141727.GR21591@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 8 May 2007 17:17:27 +0300") References: <20070508141727.GR21591@mellanox.co.il> Message-ID: OK, how about this for libibverbs: diff --git a/include/infiniband/arch.h b/include/infiniband/arch.h index 6a04287..df4c949 100644 --- a/include/infiniband/arch.h +++ b/include/infiniband/arch.h @@ -56,13 +56,17 @@ static inline uint64_t ntohll(uint64_t x) { return x; } * macro by either the compiler or the CPU. * wmb() - write memory barrier. No stores may be reordered across * this macro by either the compiler or the CPU. + * wc_wmb() - flush write combine buffers. No write-combined writes + * will be reordered across this macro by either the compiler or + * the CPU. */ #if defined(__i386__) -#define mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory") -#define rmb() mb() -#define wmb() asm volatile("" ::: "memory") +#define mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory") +#define rmb() mb() +#define wmb() asm volatile("" ::: "memory") +#define wc_wmb() mb() #elif defined(__x86_64__) @@ -70,47 +74,54 @@ static inline uint64_t ntohll(uint64_t x) { return x; } * Only use lfence for mb() and rmb() because we don't care about * ordering against non-temporal stores (for now at least). */ -#define mb() asm volatile("lfence" ::: "memory") -#define rmb() mb() -#define wmb() asm volatile("" ::: "memory") +#define mb() asm volatile("lfence" ::: "memory") +#define rmb() mb() +#define wmb() asm volatile("" ::: "memory") +#define wc_wmb() asm volatile("sfence" ::: "memory") #elif defined(__PPC64__) -#define mb() asm volatile("sync" ::: "memory") -#define rmb() asm volatile("lwsync" ::: "memory") -#define wmb() mb() +#define mb() asm volatile("sync" ::: "memory") +#define rmb() asm volatile("lwsync" ::: "memory") +#define wmb() mb() +#define wc_wmb() wmb() #elif defined(__ia64__) -#define mb() asm volatile("mf" ::: "memory") -#define rmb() mb() -#define wmb() mb() +#define mb() asm volatile("mf" ::: "memory") +#define rmb() mb() +#define wmb() mb() +#define wc_wmb() asm volatile("fwb" ::: "memory") #elif defined(__PPC__) -#define mb() asm volatile("sync" ::: "memory") -#define rmb() mb() -#define wmb() asm volatile("eieio" ::: "memory") +#define mb() asm volatile("sync" ::: "memory") +#define rmb() mb() +#define wmb() asm volatile("eieio" ::: "memory") +#define wc_wmb() wmb() #elif defined(__sparc_v9__) -#define mb() asm volatile("membar #LoadLoad | #LoadStore | #StoreStore | #StoreLoad" ::: "memory") -#define rmb() asm volatile("membar #LoadLoad" ::: "memory") -#define wmb() asm volatile("membar #StoreStore" ::: "memory") +#define mb() asm volatile("membar #LoadLoad | #LoadStore | #StoreStore | #StoreLoad" ::: "memory") +#define rmb() asm volatile("membar #LoadLoad" ::: "memory") +#define wmb() asm volatile("membar #StoreStore" ::: "memory") +#define wc_wmb() wmb() #elif defined(__sparc__) -#define mb() asm volatile("" ::: "memory") -#define rmb() mb() -#define wmb() mb() +#define mb() asm volatile("" ::: "memory") +#define rmb() mb() +#define wmb() mb() +#define wc_wmb() wmb() #else #warning No architecture specific defines found. Using generic implementation. -#define mb() asm volatile("" ::: "memory") -#define rmb() mb() -#define wmb() mb() +#define mb() asm volatile("" ::: "memory") +#define rmb() mb() +#define wmb() mb() +#define wc_wmb() wmb() #endif From rdreier at cisco.com Tue May 15 12:42:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 12:42:08 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: <20070508141727.GR21591@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 8 May 2007 17:17:27 +0300") References: <20070508141727.GR21591@mellanox.co.il> Message-ID: ...and this for libmlx4? diff --git a/src/mlx4.h b/src/mlx4.h index c4d389f..1e92b88 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -65,6 +65,20 @@ # define wmb() mb() #endif +#ifndef wc_wmb + +#if defined(__i386__) +#define wc_wmb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory") +#elif defined(__x86_64__) +#define wc_wmb() asm volatile("sfence" ::: "memory") +#elif defined(__ia64__) +#define wc_wmb() asm volatile("fwb" ::: "memory") +#else +#define wc_wmb() wmb() +#endif + +#endif + #define HIDDEN __attribute__((visibility ("hidden"))) #define PFX "mlx4: " diff --git a/src/qp.c b/src/qp.c index a70e5f2..a4384f9 100644 --- a/src/qp.c +++ b/src/qp.c @@ -282,9 +282,12 @@ out: ++qp->sq.head; pthread_spin_lock(&ctx->bf_lock); + memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64)); - /* FIXME flush wc buffers */ + wc_wmb(); + ctx->bf_offset ^= ctx->bf_buf_size; + pthread_spin_unlock(&ctx->bf_lock); } else if (nreq) { qp->sq.head += nreq; From rdreier at cisco.com Tue May 15 12:51:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 12:51:26 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: (Roland Dreier's message of "Tue, 15 May 2007 12:42:08 -0700") References: <20070508141727.GR21591@mellanox.co.il> Message-ID: > memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64)); By the way, why are we aligning the size of the WQE we copy to 64 bytes? I copied this from Jack's code but I don't see anything that requires it. We already have: if (nreq == 1 && inl && size > 1 && size < ctx->bf_buf_size / 16) { so we will always have at least 32 bytes to copy. - R. From mst at dev.mellanox.co.il Tue May 15 13:06:18 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 May 2007 23:06:18 +0300 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: References: <20070508141727.GR21591@mellanox.co.il> <20070513051806.GB7402@mellanox.co.il> Message-ID: <20070515200618.GF4161@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: libmlx4 wc flash > > > By the way, I just re-checked and it seems that WC support first > > appeared in Pentium II systems. So I think we should be able to > > use sfence if WC is enabled. > > That's actually doubly wrong: WC support was added in Pentium Pro, and > sfence was added in Pentium III. OK, here's what I remembered, after checking the sources: This memory type is available in the Pentium Pro and Pentium II processors by programming the MTRRs or in the Pentium III, Pentium 4, and Intel Xeon processors by programming the MTRRs or by selecting it through the PAT. so what it comes down to, is that if we assume that WC will *only be enabled through PAT* then it's safe to use sfence in this case. Right? -- MST From rick.jones2 at hp.com Tue May 15 13:08:24 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Tue, 15 May 2007 13:08:24 -0700 Subject: [ofa-general] Re: [ewg] OFED 1.2 rc3 release In-Reply-To: <7.0.1.0.2.20070515125402.02784f60@lanl.gov> References: <6C2C79E72C305246B504CBA17B5500C9A0E1C7@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C9015630E2@mtlexch01.mtl.com> <46487AE8.1020005@mellanox.co.il> <6a122cc00705140844k7f65a988x8746c4ea8474b920@mail.gmail.com> <7.0.1.0.2.20070515125402.02784f60@lanl.gov> Message-ID: <464A1338.7040405@hp.com> Parks Fields wrote: > > > Ofed 1.2 rc3 is running IPoIB slower that rc2 or earlier > > I use to get ~800MB/sec no tuning and now I get ~650MB/sec ?? > > Ideas?? Not specific to IPoIB, but whenever something like that happens to me I start with things like: *) CPU utilization - did that and the netperf (assuming netperf) service demand increase? *) packet losses? *) change in MTU? *) change in interrupt behaviour by the I/O card? (ie did netperf TCP_RR change much?) rick jones From mst at dev.mellanox.co.il Tue May 15 13:11:05 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 May 2007 23:11:05 +0300 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: References: <20070508141727.GR21591@mellanox.co.il> Message-ID: <20070515201105.GG4161@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: libmlx4 wc flash > > > memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64)); > > By the way, why are we aligning the size of the WQE we copy to 64 > bytes? I copied this from Jack's code but I don't see anything that > requires it. We already have: > > if (nreq == 1 && inl && size > 1 && size < ctx->bf_buf_size / 16) { > > so we will always have at least 32 bytes to copy. This is an intel-specific optimization (for new Intel processors): Once the processor has started to evict data from the WC buffer into system memory, it will make a bus-transaction style decision based on how much of the buffer contains valid data. If the buffer is full (for example, all bytes are valid) the processor will execute a burst-write transaction on the bus that will result in all 32 bytes (P6 family processors) or 64 bytes (Pentium 4 and Intel Xeon processor) being transmitted on the data bus in a single burst transaction. If one or more of the WC buffer’s bytes are invalid (for example, have not been written by software) then the processor will transmit the data to memory using “partial write” transactions (one chunk at a time, where a “chunk” is 8 bytes). in other words, it is important to fill the full WC buffer to get good speed. Need to check which sizes are good for AMD, PPC, ... -- MST From rdreier at cisco.com Tue May 15 13:11:55 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 13:11:55 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: <20070515200618.GF4161@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 May 2007 23:06:18 +0300") References: <20070508141727.GR21591@mellanox.co.il> <20070513051806.GB7402@mellanox.co.il> <20070515200618.GF4161@mellanox.co.il> Message-ID: > so what it comes down to, is that if we assume that WC > will *only be enabled through PAT* then it's safe to use sfence > in this case. Right? I don't think we can really make any assumptions about what instructions a 32-bit x86 processor has available. Who knows what wacky stuff VIA or someone like that will come up with? The best thing seems to be just to stick to "lock; addl $0,0(%%esp)" for 32-bit x86. We now have the infrastructure to support multiple builds of libraries and have ld.so select automatically at runtime but I'm not sure it's really worth it. - R. From mst at dev.mellanox.co.il Tue May 15 13:16:34 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 May 2007 23:16:34 +0300 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: References: <20070508141727.GR21591@mellanox.co.il> Message-ID: <20070515201634.GH4161@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: libmlx4 wc flash > > ...and this for libmlx4? > > diff --git a/src/mlx4.h b/src/mlx4.h > index c4d389f..1e92b88 100644 > --- a/src/mlx4.h > +++ b/src/mlx4.h > @@ -65,6 +65,20 @@ > # define wmb() mb() > #endif > > +#ifndef wc_wmb > + > +#if defined(__i386__) > +#define wc_wmb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory") > +#elif defined(__x86_64__) > +#define wc_wmb() asm volatile("sfence" ::: "memory") > +#elif defined(__ia64__) > +#define wc_wmb() asm volatile("fwb" ::: "memory") > +#else > +#define wc_wmb() wmb() > +#endif > + > +#endif > + > #define HIDDEN __attribute__((visibility ("hidden"))) > > #define PFX "mlx4: " > diff --git a/src/qp.c b/src/qp.c > index a70e5f2..a4384f9 100644 > --- a/src/qp.c > +++ b/src/qp.c > @@ -282,9 +282,12 @@ out: > ++qp->sq.head; > > pthread_spin_lock(&ctx->bf_lock); > + > memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64)); > - /* FIXME flush wc buffers */ > + wc_wmb(); > + > ctx->bf_offset ^= ctx->bf_buf_size; > + > pthread_spin_unlock(&ctx->bf_lock); > } else if (nreq) { > qp->sq.head += nreq; Since both the need for fencing and the size being copied are architecture-dependent, it might be that a better API would be memcpy_wc() that does the size alignment tricks and the flush in one go. -- MST From rdreier at cisco.com Tue May 15 13:22:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 13:22:49 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: <20070515201105.GG4161@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 May 2007 23:11:05 +0300") References: <20070508141727.GR21591@mellanox.co.il> <20070515201105.GG4161@mellanox.co.il> Message-ID: > Once the processor has started to evict data from the WC buffer into system > memory, it will make a bus-transaction style decision based on how much of the > buffer contains valid data. If the buffer is full (for example, all bytes are > valid) the processor will execute a burst-write transaction on the bus that > will result in all 32 bytes (P6 family processors) or 64 bytes (Pentium 4 and > Intel Xeon processor) being transmitted on the data bus in a single burst > transaction. If one or more of the WC buffer’s bytes are invalid (for example, > have not been written by software) then the processor will transmit the data to > memory using “partial write” transactions (one chunk at a time, where a “chunk” > is 8 bytes). OK, thanks. Do you have any idea how WC works on ppc? Is the lwsync instruction necessary/sufficient to flush WC buffers? - R. From rdreier at cisco.com Tue May 15 13:24:50 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 13:24:50 -0700 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: <20070515201105.GG4161@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 May 2007 23:11:05 +0300") References: <20070508141727.GR21591@mellanox.co.il> <20070515201105.GG4161@mellanox.co.il> Message-ID: > > memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64)); By the way, if we have an SQ with 32-byte WQEs, and we do blueflame from a WQE at the end of the buffer, we might end up reading off the end of the buffer. Not very likely, I guess. I wonder if memset(,0,) for the remaining bytes might be faster anyway? - R. From mst at dev.mellanox.co.il Tue May 15 13:43:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 May 2007 23:43:35 +0300 Subject: [ofa-general] movnt (Was Re: libmlx4 wc flash) In-Reply-To: References: <20070508141727.GR21591@mellanox.co.il> <20070512172927.GA5908@mellanox.co.il> Message-ID: <20070515204335.GI4161@mellanox.co.il> > > I don't think it works this way: if PAT is programmed to UC, > > I think you get UC access with movntq. No? > > You're right -- I misremembered what the non-temporal stuff does, but > I just checked and the manual says: > > "The memory type of the region being written to can override the > non-temporal hint, if the memory address specified for the > non-temporal store is in an uncacheable (UC) or write protected (WP) > memory region." Actually, I think I just thought up a way to solve this, and I quote in full: Vol. 1 10-19 PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) These SSE and SSE2 non-temporal store instructions minimize cache pollutions by treating the memory being accessed as the write combining (WC) type. If a program specifies a non-temporal store with one of these instructions and the destination region is mapped as cacheable memory (write back [WB], write through [WT] or WC memory type), the processor will do the following: • If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted. • The non-temporal data is written to memory with WC semantics. See also: Chapter 10, “Memory Cache Control,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A. Using the WC semantics, the store transaction will be weakly ordered, meaning that the data may not be written to memory in program order, and the store will not write allocate (that is, the processor will not fetch the corresponding cache line into the cache hierarchy, prior to performing the store). Also, different processor implementations may choose to collapse and combine these stores. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in uncacheable memory. Uncacheable as referred to here means that the region being written to has been mapped with either an uncacheable (UC) or write protected (WP) memory type. ------------- So we can map the device memory with WB or WT semantics, and movnt will enable WC. And the nice thing about this trick, is that both WB and WT *are already programmed into PAT after reset*, which means that we can use them for pages we map for userspace, without stepping on anyone's toes or waiting for the generic in-kernel support for WC to materialize. Another nice thing is that all WRs are 16-byte aligned so we can use the aligned instructions there. Given that full WC support in kernel is likely to take quite while to materialize, maybe that's the way to go for now? What do you think? I attach a header file that implements WC memcpy with these instructions for lengths from 16 to 128 bytes (and one can, naturally, just call xmm_copy64 in a loop), that I wrote for fun at some point. Feel free to read/flame/reuse in any way you like. As far as I remember, replacing memcpy with this hack resulted in a marginal latency speedup for intel, likely on account of loop unrolling I did there. -- MST -------------- next part -------------- A non-text attachment was scrubbed... Name: xmm.h Type: text/x-chdr Size: 2123 bytes Desc: not available URL: From mst at dev.mellanox.co.il Tue May 15 13:48:04 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 May 2007 23:48:04 +0300 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: References: <20070508141727.GR21591@mellanox.co.il> <20070515201105.GG4161@mellanox.co.il> Message-ID: <20070515204804.GJ4161@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: libmlx4 wc flash > > > > memcpy(ctx->bf_page + ctx->bf_offset, ctrl, align(size * 16, 64)); > > By the way, if we have an SQ with 32-byte WQEs, and we do blueflame > from a WQE at the end of the buffer, we might end up reading off the > end of the buffer. Not very likely, I guess. Hmm, is 32-byte SQ wqe shift actually possible? Which parameters give this? > I wonder if memset(,0,) for the remaining bytes might be faster anyway? In some early testing it seemed much slower. Try it :) -- MST From mst at dev.mellanox.co.il Tue May 15 13:50:49 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 May 2007 23:50:49 +0300 Subject: [ofa-general] Re: libmlx4 wc flash In-Reply-To: References: <20070508141727.GR21591@mellanox.co.il> <20070515201105.GG4161@mellanox.co.il> Message-ID: <20070515205049.GK4161@mellanox.co.il> > Do you have any idea how WC works on ppc? Is the lwsync instruction > necessary/sufficient to flush WC buffers? Donnu yet. I hear Jack here plans to start looking at ppc RSN. -- MST From mst at dev.mellanox.co.il Tue May 15 14:04:53 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 May 2007 00:04:53 +0300 Subject: [ofa-general] [PATCH RFC/untested v0] IPoIB/CM: fix SRQ WR leak Message-ID: <20070515210453.GL4161@mellanox.co.il> If the Consumer does not wait for the Affiliated Asynchronous Last WQE Reached Event, then WQE and Data Segment leakage may occur. This leakage has been observed with IPoIB/CM: flipping ports on and off will, with time, leak out all WRs and then all connections will start getting RNR NACKs. Fix this in the way suggested by spec: create a "drain qp" in error state, wait for last wqe reached event on a qp and then post send on "drain QP". Once we observe a completion on the drain QP, it's safe to call ib_destroy_qp. --- Roland, all. Here's a largish, and untested, patch that fixes a design bug in the way IPoIB/CM destroyed passive connections. Unfortunately, doing it by the spec kind of forces us to add a "state" for passive connections, and split the passive list per connection state. That's why the patch grew to be so large. I expect to post a fully tested version by beginning of next week. The issue addressed is very severe (work-around is to unload the ipoib module once in a while) and I do think we need this fixed in 2.6.22, so given how large the patch is, I'd like to ask everyone to review and comment. NB: this is on top of 2.6.22-rc1. diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 87310ee..e300c75 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -132,12 +132,39 @@ struct ipoib_cm_data { __be32 mtu; }; +/* Quoting spec: + * + * If the Consumer does not wait for the Affiliated Asynchronous Last WQE Reached + * Event, then WQE and Data Segment leakage may occur. Therefore, it is good + * programming practice to tear down a QP that is associated with an SRQ by using + * the following process: + * + * + * Put the QP in the Error State + * Wait for the Affiliated Asynchronous Last WQE Reached Event; + * either: + * drain the CQ by invoking the Poll CQ verb and either wait for CQ + * to be empty or the number of Poll CQ operations has exceeded + * CQ capacity size; + * or + * post another WR that completes on the same CQ and wait for this + * WR to return as a WC; + * and then invoke a Destroy QP or Reset QP. + */ + +enum ipoib_cm_state { + IPOIB_CM_RX_LIVE, + IPOIB_CM_RX_ERROR, /* Ignored by stale task */ + IPOIB_CM_RX_FLUSH /* Last WQE Reached event observed */ +}; + struct ipoib_cm_rx { struct ib_cm_id *id; struct ib_qp *qp; struct list_head list; struct net_device *dev; unsigned long jiffies; + enum ipoib_cm_state state; }; struct ipoib_cm_tx { @@ -165,10 +192,15 @@ struct ipoib_cm_dev_priv { struct ib_srq *srq; struct ipoib_cm_rx_buf *srq_ring; struct ib_cm_id *id; - struct list_head passive_ids; + struct ib_qp *rx_drain_qp; + struct list_head passive_ids; /* state: LIVE */ + struct list_head rx_error_list; /* state: ERROR */ + struct list_head rx_flush_list; /* state: FLUSH, drain not started */ + struct list_head rx_drain_list; /* state: FLUSH, drain started */ struct work_struct start_task; struct work_struct reap_task; struct work_struct skb_task; + struct work_struct rx_drain_task; struct delayed_work stale_task; struct sk_buff_head skb_queue; struct list_head start_list; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 785bc85..f6a1405 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -62,6 +62,17 @@ struct ipoib_cm_id { u32 remote_mtu; }; +static struct ib_qp_attr ipoib_cm_err_attr __read_mostly = { + .qp_state = IB_QPS_ERR +}; + +#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff + +static struct ib_send_wr ipoib_cm_rx_drain_wr __read_mostly = { + .wr_id = 0xfff /* todo */, + .opcode = IB_WR_SEND +}; + static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); @@ -150,11 +161,42 @@ partial_error: return NULL; } +static void ipoib_cm_rx_drain(struct ipoib_dev_priv* priv) +{ + struct ib_send_wr *bad_send_wr; + + if (list_empty(&priv->cm.rx_flush_list) || + !list_empty(&priv->cm.rx_drain_list)) + return; + + if (ib_post_send(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_send_wr)) + ipoib_warn(priv, "failed to start rx flush\n"); + + list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list); +} + +static void ipoib_cm_rx_event_handler(struct ib_event *event, void *ctx) +{ + struct ipoib_cm_rx *p = ctx; + struct ipoib_dev_priv *priv = netdev_priv(p->dev); + unsigned long flags; + + if (event->event != IB_EVENT_QP_LAST_WQE_REACHED) + return; + + spin_lock_irqsave(&priv->lock, flags); + list_move(&p->list, &priv->cm.rx_flush_list); + p->state = IPOIB_CM_RX_FLUSH; + ipoib_cm_rx_drain(priv); + spin_unlock_irqrestore(&priv->lock, flags); +} + static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev, struct ipoib_cm_rx *p) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = { + .event_handler = ipoib_cm_rx_event_handler, .send_cq = priv->cq, /* does not matter, we never send anything */ .recv_cq = priv->cq, .srq = priv->cm.srq, @@ -256,6 +298,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even cm_id->context = p; p->jiffies = jiffies; + p->state = IPOIB_CM_RX_LIVE; spin_lock_irq(&priv->lock); list_add(&p->list, &priv->cm.passive_ids); spin_unlock_irq(&priv->lock); @@ -271,12 +314,12 @@ err_qp: return ret; } + static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) { struct ipoib_cm_rx *p; struct ipoib_dev_priv *priv; - int ret; switch (event->event) { case IB_CM_REQ_RECEIVED: @@ -288,20 +331,9 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, case IB_CM_REJ_RECEIVED: p = cm_id->context; priv = netdev_priv(p->dev); - spin_lock_irq(&priv->lock); - if (list_empty(&p->list)) - ret = 0; /* Connection is going away already. */ - else { - list_del_init(&p->list); - ret = -ECONNRESET; - } - spin_unlock_irq(&priv->lock); - if (ret) { - ib_destroy_qp(p->qp); - kfree(p); - return ret; - } - return 0; + if (ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE)) + ipoib_warn(priv, "unable to move qp to error state\n"); + /* Fall through */ default: return 0; } @@ -353,8 +385,11 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); + if (wr_id == IPOIB_CM_RX_DRAIN_WRID) + queue_work(ipoib_workqueue, &priv->cm.rx_drain_task); + else + ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", + wr_id, ipoib_recvq_size); return; } @@ -373,9 +408,9 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { spin_lock_irqsave(&priv->lock, flags); p->jiffies = jiffies; - /* Move this entry to list head, but do - * not re-add it if it has been removed. */ - if (!list_empty(&p->list)) + /* Move this entry to list head, but do not re-add it + * if it has been moved out of list. */ + if (p->state == IPOIB_CM_RX_LIVE) list_move(&p->list, &priv->cm.passive_ids); spin_unlock_irqrestore(&priv->lock, flags); queue_delayed_work(ipoib_workqueue, @@ -584,17 +619,40 @@ static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) int ipoib_cm_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; int ret; if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) return 0; + priv->cm.rx_drain_qp = ipoib_cm_create_rx_qp(dev, NULL); + if (IS_ERR(priv->cm.rx_drain_qp)) { + printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); + ret = PTR_ERR(priv->cm.rx_drain_qp); + return ret; + } + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.port_num = priv->port; + qp_attr.qkey = 0; + qp_attr.qp_access_flags = 0; + ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, + IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PORT | IB_QP_QKEY); + if (ret) { + ipoib_warn(priv, "failed to modify drain QP to INIT: %d\n", ret); + goto err_qp; + } + qp_attr.qp_state = IB_QPS_ERR; + ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, IB_QP_STATE); + if (ret) { + ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret); + goto err_qp; + } + priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev); if (IS_ERR(priv->cm.id)) { printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); - ret = PTR_ERR(priv->cm.id); - priv->cm.id = NULL; - return ret; + goto err_cm; } ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num), @@ -602,35 +660,76 @@ int ipoib_cm_dev_open(struct net_device *dev) if (ret) { printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name, IPOIB_CM_IETF_ID | priv->qp->qp_num); - ib_destroy_cm_id(priv->cm.id); - priv->cm.id = NULL; - return ret; + goto err_cm; } + return 0; + +err_cm: + ib_destroy_cm_id(priv->cm.id); + priv->cm.id = NULL; +err_qp: + ib_destroy_qp(priv->cm.rx_drain_qp); + return ret; } void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_cm_rx *p; + struct ipoib_cm_rx *p, *n; + unsigned long begin; + LIST_HEAD(list); + int ret; if (!IPOIB_CM_SUPPORTED(dev->dev_addr) || !priv->cm.id) return; ib_destroy_cm_id(priv->cm.id); priv->cm.id = NULL; + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { - p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); - list_del_init(&p->list); + p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; spin_unlock_irq(&priv->lock); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + spin_lock_irq(&priv->lock); + } + + /* Wait for all RX to be drained */ + begin = jiffies; + + while (!list_empty(&priv->cm.rx_error_list) || + !list_empty(&priv->cm.rx_flush_list) || + !list_empty(&priv->cm.rx_drain_list)) { + if (!time_after(jiffies, begin + 5 * HZ)) { + ipoib_warn(priv, "RX drain timing out\n"); + + /* + * assume the HW is wedged and just free up everything. + */ + list_splice_init(&priv->cm.rx_flush_list, &list); + list_splice_init(&priv->cm.rx_error_list, &list); + list_splice_init(&priv->cm.rx_drain_list, &list); + break; + } + spin_unlock_irq(&priv->lock); + msleep(1); + spin_lock_irq(&priv->lock); + } + + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); kfree(p); - spin_lock_irq(&priv->lock); } - spin_unlock_irq(&priv->lock); + ib_destroy_qp(priv->cm.rx_drain_qp); cancel_delayed_work(&priv->cm.stale_task); } @@ -1080,24 +1179,45 @@ void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, queue_work(ipoib_workqueue, &priv->cm.skb_task); } +static void ipoib_cm_rx_drain_task(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, + cm.rx_drain_task); + struct ipoib_cm_rx *p, *n; + LIST_HEAD(list); + + spin_lock_irq(&priv->lock); + list_splice_init(&priv->cm.rx_drain_list, &list); + ipoib_cm_rx_drain(priv); + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(p, n, &list, list) { + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + kfree(p); + } +} + static void ipoib_cm_stale_task(struct work_struct *work) { struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.stale_task.work); struct ipoib_cm_rx *p; + int ret; spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { - /* List if sorted by LRU, start from tail, + /* List is sorted by LRU, start from tail, * stop when we see a recently used entry */ p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; - list_del_init(&p->list); + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; spin_unlock_irq(&priv->lock); - ib_destroy_cm_id(p->id); - ib_destroy_qp(p->qp); - kfree(p); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); spin_lock_irq(&priv->lock); } spin_unlock_irq(&priv->lock); @@ -1161,9 +1281,12 @@ int ipoib_cm_dev_init(struct net_device *dev) INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); INIT_LIST_HEAD(&priv->cm.start_list); + INIT_LIST_HEAD(&priv->cm.rx_flush_list); + INIT_LIST_HEAD(&priv->cm.rx_error_list); INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start); INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap); INIT_WORK(&priv->cm.skb_task, ipoib_cm_skb_reap); + INIT_WORK(&priv->cm.rx_drain_task, ipoib_cm_rx_drain_task); INIT_DELAYED_WORK(&priv->cm.stale_task, ipoib_cm_stale_task); skb_queue_head_init(&priv->cm.skb_queue); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 5c3c6a4..af8a6d4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -185,7 +185,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) size = ipoib_sendq_size + ipoib_recvq_size + 1; ret = ipoib_cm_dev_init(dev); if (!ret) - size += ipoib_recvq_size; + size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { -- MST From eli at dev.mellanox.co.il Tue May 15 14:05:44 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 16 May 2007 00:05:44 +0300 Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation In-Reply-To: References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> Message-ID: <4e6a6b3c0705151405n18afdebna815a3ee1c117f74@mail.gmail.com> On 5/15/07, Roland Dreier wrote: > > > First I should add the case that triggered his patch: the userspace code > > calculated a smaller buffer size than the kernel code, which caused > > get_user_pages() to fail since part of the buffer did not belong to the > > process's address space. > > OK, in this case it seems the bug is in the kernel -- since it is > overestimating the size of the WQEs needed. So we might as well fix > it in the kernel. > > > As Mihcael said in a subsequent post, we still need this code both in > > user and in kernel. > > Yes, but I think this issue really convinces me that we should > decouple the two calculations, so the kernel code is only used for > kernel QPs. And then change the mlx4 ABI so that userspace tells the > kernel the wqe buffer size and rq/sq wqe shift/offset. That will > allow for different SQ BB sizes and also make things more robust > against bugs like this. > > - R. So it looks like we can start by: 1. Change the user code to pass the size to kernel 2. Fix calculations in kernel. Would like me to send patches or do you prefer to add your code? If you prefer to code this can you tell when that would be? -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue May 15 14:55:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 14:55:03 -0700 Subject: [ofa-general] Re: [PATCH] libmlx4: WQE shift calculation In-Reply-To: <4e6a6b3c0705151405n18afdebna815a3ee1c117f74@mail.gmail.com> (Eli Cohen's message of "Wed, 16 May 2007 00:05:44 +0300") References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> <4e6a6b3c0705151405n18afdebna815a3ee1c117f74@mail.gmail.com> Message-ID: > So it looks like we can start by: > 1. Change the user code to pass the size to kernel > 2. Fix calculations in kernel. > > Would like me to send patches or do you prefer to add your code? If you > prefer to code this can you tell when that would be? It would be great if you implement it. Otherwise if you don't get to it, I will probably look at it on Thursday or Friday (your weekend). - R. From xma at us.ibm.com Tue May 15 15:31:04 2007 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 15 May 2007 15:31:04 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review In-Reply-To: <20070515185445.GD4161@mellanox.co.il> Message-ID: > > Regarding the memory issue w/o SRQ, do you think there is a way to use low > > watermark to release prepost buffer in large connections? > > Maybe with UC - with RC you'll get RNR and connection'll get closed before you > have time to handle the low watermark. So sure, might be an interesting idea, > but isn't low watermark a SRQ feature? > > > I think most of the prepost buffers are empty in that case becauseof the BW. > > I don't really get the argument. > > -- > MST That's just some random idea. :) Some other ideas like to share RQ buffer based on source-destination address, per CPU RQ buffer ...which might hurt performance too much? It might be too complicated to have UD/RC mode coexisted? Maybe it's better to set up a small RQ size for now, and later when high watermark patch is available we can use it to address RQ overrun? Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From guthridg at us.ibm.com Tue May 15 15:46:36 2007 From: guthridg at us.ibm.com (Scott Guthridge) Date: Tue, 15 May 2007 18:46:36 -0400 Subject: [ofa-general] ibv_modify_port? Message-ID: I'm working on implementing a DM agent at user-level for an experimental I/O controller. Registering the DM agent via umad_register seems to work -- I can receive DM MAD's. But there doesn't appear to be a way to set the IB_PORT_DEVICE_MGMT_SUP bit in the port's SA PortInfo.CapabilityMask, so mask-match SA port queries do not find my device. I noticed that the "ib_srpt" driver does an explicit ib_modify_port in order to set this flag. If there were a user-level version of this function, I could do the same. But.... this leads to another point. Implementing the DMA in each target driver, isn't a particularly general approach. The problem is that you can't implement more than one target driver behind the same channel adapter. For example, I can not register my DM agent if the ib_srpt module happens to be loaded. I would like to propose a better interface. What if there were a generic DM agent in the kernel that provided an API for target devices (kernel and user) to register IOC's with it? It might look something like this: struct ib_dm_ioc { ... u8 ioc_slot; ... };.. struct ib_dm_ioc *ib_dm_register_ioc(struct ib_device *device, u8 port_num, const struct ib_dm_ioc_profile *ioprof); void ib_dm_unregister_ioc(struct ib_ioc *iocp); /* returns service entry slot number */ int ib_dm_add_svcent(struct ib_dm_ioc *, const char *svc_name, u64 service_id); void ib_dm_del_svcent(struct ib_dm_ioc *, int svcent_slot); /* additional registration fn's for diag support could be added later if someone feels ambitious */ The generic DMA would set the IB_PORT_DEVICE_MGMT_SUP flag in PortInfo.CapabilityMask whenever at least one IOC is registered. This interface would allow the DMA functionality to be removed from target drivers, simplifying them somewhat. And it would make it possible to support more than one type of IOC within the same target CA. Comments? Scott From mshefty at ichips.intel.com Tue May 15 16:19:19 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 15 May 2007 16:19:19 -0700 Subject: [ofa-general] ibv_modify_port? In-Reply-To: References: Message-ID: <464A3FF7.6090101@ichips.intel.com> > I would like to propose a better interface. What if there were a generic > DM agent in the kernel that provided an API for target devices (kernel and > user) to register IOC's with it? It might look something like this: A generic DM makes sense. There are existing interfaces / implementations available in some of the legacy code that might be of use for a starting point. I know there's some DM related code in the svn database in the gen1 branch. There may be other implementations under the trunk/contrib directories as well, but I didn't actually check there. - Sean From rdreier at cisco.com Tue May 15 19:32:15 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 15 May 2007 19:32:15 -0700 Subject: [ofa-general] ibv_modify_port? In-Reply-To: (Scott Guthridge's message of "Tue, 15 May 2007 18:46:36 -0400") References: Message-ID: > I noticed that the "ib_srpt" driver does an explicit ib_modify_port in > order to set this flag. If there were a user-level version of this > function, I could do the same. It would be pretty straightforward to add something like /dev/infiniband/isdmX that behaves like the issmX files we already have. Or we could even have something that automatically sets the IsDM bit when the first agent for DM class is created and clears it when the last agent is destroyed. (In fact we could do the same thing for IsCM if we wanted to) > I would like to propose a better interface. What if there were a generic > DM agent in the kernel that provided an API for target devices (kernel and > user) to register IOC's with it? It might look something like this: I'm not sure having a DM agent in the kernel is worth it. Why not have a generic daemon in userspace to do all the DMA stuff? I don't see a strong reason to put it in the kernel, and userspace code is quite a bit easier to write... - R. From vlad at lists.openfabrics.org Wed May 16 02:39:41 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 16 May 2007 02:39:41 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070516-0200 daily build status Message-ID: <20070516093941.F0B71E6082D@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From mst at dev.mellanox.co.il Wed May 16 03:14:57 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 May 2007 13:14:57 +0300 Subject: [ofa-general] [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR leak In-Reply-To: <20070515210453.GL4161@mellanox.co.il> References: <20070515210453.GL4161@mellanox.co.il> Message-ID: <20070516101457.GA5091@mellanox.co.il> SRQ WR leakage has been observed with IPoIB/CM: e.g. flipping ports on and off will, with time, leak out all WRs and then all connections will start getting RNR NACKs. Fix this in the way suggested by spec: move QP to error, wait for last wqe reached event and then post send on "drain QP" connected to the same CQ. Once we observe a completion on the drain QP, it's safe to call ib_destroy_qp. Signed-off-by: Michael S. Tsirkin --- Changes from v0: fixed drain WR ID, comment on the algorithm used, cleaned up the patch description. Roland, all. This is a largish, and still untested, patch that fixes a design bug in the way IPoIB/CM destroyed QPs connected to SRQ. Unfortunately, doing it by the spec kind of forces us to add a "state" for passive connections, and split the connection list per connection state. That's why the patch grew to be so large. The issue addressed is very severe (the only work-around is to unload the ipoib module once in a while), so given how large the patch is, I'd like to ask everyone to review and comment. NB: this is on top of 2.6.22-rc1. ipoib.h | 38 ++++++++++ ipoib_cm.c | 208 ++++++++++++++++++++++++++++++++++++++++++++++++---------- ipoib_verbs.c | 2 3 files changed, 212 insertions(+), 36 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 87310ee..087bbfc 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -132,12 +132,43 @@ struct ipoib_cm_data { __be32 mtu; }; +/* + * Quoting 10.3.1 Queue Pair and EE Context States: + * + * Note, for QPs that are associated with an SRQ, the Consumer should take the + * QP through the Error State before invoking a Destroy QP or a Modify QP to the + * Reset State. The Consumer may invoke the Destroy QP without first performing + * a Modify QP to the Error State and waiting for the Affiliated Asynchronous + * Last WQE Reached Event. However, if the Consumer does not wait for the + * Affiliated Asynchronous Last WQE Reached Event, then WQE and Data Segment + * leakage may occur. Therefore, it is good programming practice to tear down a + * QP that is associated with an SRQ by using the following process: + * + * - Put the QP in the Error State + * - Wait for the Affiliated Asynchronous Last WQE Reached Event; + * - either: + * drain the CQ by invoking the Poll CQ verb and either wait for CQ + * to be empty or the number of Poll CQ operations has exceeded + * CQ capacity size; + * - or + * post another WR that completes on the same CQ and wait for this + * WR to return as a WC; (NB: this is the option that we use) + * and then invoke a Destroy QP or Reset QP. + */ + +enum ipoib_cm_state { + IPOIB_CM_RX_LIVE, + IPOIB_CM_RX_ERROR, /* Ignored by stale task */ + IPOIB_CM_RX_FLUSH /* Last WQE Reached event observed */ +}; + struct ipoib_cm_rx { struct ib_cm_id *id; struct ib_qp *qp; struct list_head list; struct net_device *dev; unsigned long jiffies; + enum ipoib_cm_state state; }; struct ipoib_cm_tx { @@ -165,10 +196,15 @@ struct ipoib_cm_dev_priv { struct ib_srq *srq; struct ipoib_cm_rx_buf *srq_ring; struct ib_cm_id *id; - struct list_head passive_ids; + struct ib_qp *rx_drain_qp; /* generates WR described in 10.3.1 */ + struct list_head passive_ids; /* state: LIVE */ + struct list_head rx_error_list; /* state: ERROR */ + struct list_head rx_flush_list; /* state: FLUSH, drain not started */ + struct list_head rx_drain_list; /* state: FLUSH, drain started */ struct work_struct start_task; struct work_struct reap_task; struct work_struct skb_task; + struct work_struct rx_drain_task; struct delayed_work stale_task; struct sk_buff_head skb_queue; struct list_head start_list; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 785bc85..d4e4cf3 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -62,6 +62,17 @@ struct ipoib_cm_id { u32 remote_mtu; }; +static struct ib_qp_attr ipoib_cm_err_attr __read_mostly = { + .qp_state = IB_QPS_ERR +}; + +#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff + +static struct ib_send_wr ipoib_cm_rx_drain_wr __read_mostly = { + .wr_id = IPOIB_CM_RX_DRAIN_WRID, + .opcode = IB_WR_SEND +}; + static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); @@ -150,11 +161,44 @@ partial_error: return NULL; } +static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv) +{ + struct ib_send_wr *bad_send_wr; + + /* rx_drain_qp send queue depth is 1, so + * make sure we have at most 1 outstanding WR. */ + if (list_empty(&priv->cm.rx_flush_list) || + !list_empty(&priv->cm.rx_drain_list)) + return; + + if (ib_post_send(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_send_wr)) + ipoib_warn(priv, "failed to post rx_drain wr\n"); + + list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list); +} + +static void ipoib_cm_rx_event_handler(struct ib_event *event, void *ctx) +{ + struct ipoib_cm_rx *p = ctx; + struct ipoib_dev_priv *priv = netdev_priv(p->dev); + unsigned long flags; + + if (event->event != IB_EVENT_QP_LAST_WQE_REACHED) + return; + + spin_lock_irqsave(&priv->lock, flags); + list_move(&p->list, &priv->cm.rx_flush_list); + p->state = IPOIB_CM_RX_FLUSH; + ipoib_cm_start_rx_drain(priv); + spin_unlock_irqrestore(&priv->lock, flags); +} + static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev, struct ipoib_cm_rx *p) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = { + .event_handler = ipoib_cm_rx_event_handler, .send_cq = priv->cq, /* does not matter, we never send anything */ .recv_cq = priv->cq, .srq = priv->cm.srq, @@ -256,6 +300,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even cm_id->context = p; p->jiffies = jiffies; + p->state = IPOIB_CM_RX_LIVE; spin_lock_irq(&priv->lock); list_add(&p->list, &priv->cm.passive_ids); spin_unlock_irq(&priv->lock); @@ -276,7 +321,6 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, { struct ipoib_cm_rx *p; struct ipoib_dev_priv *priv; - int ret; switch (event->event) { case IB_CM_REQ_RECEIVED: @@ -288,20 +332,9 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, case IB_CM_REJ_RECEIVED: p = cm_id->context; priv = netdev_priv(p->dev); - spin_lock_irq(&priv->lock); - if (list_empty(&p->list)) - ret = 0; /* Connection is going away already. */ - else { - list_del_init(&p->list); - ret = -ECONNRESET; - } - spin_unlock_irq(&priv->lock); - if (ret) { - ib_destroy_qp(p->qp); - kfree(p); - return ret; - } - return 0; + if (ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE)) + ipoib_warn(priv, "unable to move qp to error state\n"); + /* Fall through */ default: return 0; } @@ -353,8 +386,11 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); + if (wr_id == IPOIB_CM_RX_DRAIN_WRID) + queue_work(ipoib_workqueue, &priv->cm.rx_drain_task); + else + ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", + wr_id, ipoib_recvq_size); return; } @@ -373,9 +409,9 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { spin_lock_irqsave(&priv->lock, flags); p->jiffies = jiffies; - /* Move this entry to list head, but do - * not re-add it if it has been removed. */ - if (!list_empty(&p->list)) + /* Move this entry to list head, but do not re-add it + * if it has been moved out of list. */ + if (p->state == IPOIB_CM_RX_LIVE) list_move(&p->list, &priv->cm.passive_ids); spin_unlock_irqrestore(&priv->lock, flags); queue_delayed_work(ipoib_workqueue, @@ -584,17 +620,54 @@ static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) int ipoib_cm_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + struct ib_qp_init_attr qp_init_attr = { + .send_cq = priv->cq, + .recv_cq = priv->cq, /* does not matter, we never get anything */ + .srq = priv->cm.srq, /* does not matter, we never get anything */ + .cap.max_send_wr = 1, + .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ + .sq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_UC, + }; int ret; if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) return 0; + priv->cm.rx_drain_qp = ib_create_qp(priv->pd, &qp_init_attr); + if (IS_ERR(priv->cm.rx_drain_qp)) { + printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); + ret = PTR_ERR(priv->cm.rx_drain_qp); + return ret; + } + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.port_num = priv->port; + qp_attr.qkey = 0; + qp_attr.qp_access_flags = 0; + ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, + IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PORT | IB_QP_QKEY); + if (ret) { + ipoib_warn(priv, "failed to modify drain QP to INIT: %d\n", ret); + goto err_qp; + } + + /* We put the QP in error state directly: this way, hardware + * will immediately generate WC for each WR we post, without + * sending anything on the wire. */ + qp_attr.qp_state = IB_QPS_ERR; + ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, IB_QP_STATE); + if (ret) { + ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret); + goto err_qp; + } + priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev); if (IS_ERR(priv->cm.id)) { printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); ret = PTR_ERR(priv->cm.id); - priv->cm.id = NULL; - return ret; + goto err_cm; } ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num), @@ -602,35 +675,77 @@ int ipoib_cm_dev_open(struct net_device *dev) if (ret) { printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name, IPOIB_CM_IETF_ID | priv->qp->qp_num); - ib_destroy_cm_id(priv->cm.id); - priv->cm.id = NULL; - return ret; + goto err_listen; } + return 0; + +err_listen: + ib_destroy_cm_id(priv->cm.id); +err_cm: + priv->cm.id = NULL; +err_qp: + ib_destroy_qp(priv->cm.rx_drain_qp); + return ret; } void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_cm_rx *p; + struct ipoib_cm_rx *p, *n; + unsigned long begin; + LIST_HEAD(list); + int ret; if (!IPOIB_CM_SUPPORTED(dev->dev_addr) || !priv->cm.id) return; ib_destroy_cm_id(priv->cm.id); priv->cm.id = NULL; + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); - list_del_init(&p->list); + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; spin_unlock_irq(&priv->lock); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + spin_lock_irq(&priv->lock); + } + + /* Wait for all RX to be drained */ + begin = jiffies; + + while (!list_empty(&priv->cm.rx_error_list) || + !list_empty(&priv->cm.rx_flush_list) || + !list_empty(&priv->cm.rx_drain_list)) { + if (!time_after(jiffies, begin + 5 * HZ)) { + ipoib_warn(priv, "RX drain timing out\n"); + + /* + * assume the HW is wedged and just free up everything. + */ + list_splice_init(&priv->cm.rx_flush_list, &list); + list_splice_init(&priv->cm.rx_error_list, &list); + list_splice_init(&priv->cm.rx_drain_list, &list); + break; + } + spin_unlock_irq(&priv->lock); + msleep(1); + spin_lock_irq(&priv->lock); + } + + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); kfree(p); - spin_lock_irq(&priv->lock); } - spin_unlock_irq(&priv->lock); + ib_destroy_qp(priv->cm.rx_drain_qp); cancel_delayed_work(&priv->cm.stale_task); } @@ -1080,24 +1195,45 @@ void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, queue_work(ipoib_workqueue, &priv->cm.skb_task); } +static void ipoib_cm_rx_drain_task(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, + cm.rx_drain_task); + struct ipoib_cm_rx *p, *n; + LIST_HEAD(list); + + spin_lock_irq(&priv->lock); + list_splice_init(&priv->cm.rx_drain_list, &list); + ipoib_cm_start_rx_drain(priv); + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(p, n, &list, list) { + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + kfree(p); + } +} + static void ipoib_cm_stale_task(struct work_struct *work) { struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.stale_task.work); struct ipoib_cm_rx *p; + int ret; spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { - /* List if sorted by LRU, start from tail, + /* List is sorted by LRU, start from tail, * stop when we see a recently used entry */ p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; - list_del_init(&p->list); + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; spin_unlock_irq(&priv->lock); - ib_destroy_cm_id(p->id); - ib_destroy_qp(p->qp); - kfree(p); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); spin_lock_irq(&priv->lock); } spin_unlock_irq(&priv->lock); @@ -1161,9 +1297,13 @@ int ipoib_cm_dev_init(struct net_device *dev) INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); INIT_LIST_HEAD(&priv->cm.start_list); + INIT_LIST_HEAD(&priv->cm.rx_error_list); + INIT_LIST_HEAD(&priv->cm.rx_flush_list); + INIT_LIST_HEAD(&priv->cm.rx_drain_list); INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start); INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap); INIT_WORK(&priv->cm.skb_task, ipoib_cm_skb_reap); + INIT_WORK(&priv->cm.rx_drain_task, ipoib_cm_rx_drain_task); INIT_DELAYED_WORK(&priv->cm.stale_task, ipoib_cm_stale_task); skb_queue_head_init(&priv->cm.skb_queue); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 5c3c6a4..af8a6d4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -185,7 +185,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) size = ipoib_sendq_size + ipoib_recvq_size + 1; ret = ipoib_cm_dev_init(dev); if (!ret) - size += ipoib_recvq_size; + size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { -- MST From halr at voltaire.com Wed May 16 04:12:15 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 16 May 2007 07:12:15 -0400 Subject: [ofa-general] Re: suggested patch for partition membership definitiion in osm-partitions.conf (fix) In-Reply-To: <4649D289.3070301@cea.fr> References: <46487FBF.7020300@cea.fr> <1179157835.1540.183713.camel@hal.voltaire.com> <4649D289.3070301@cea.fr> Message-ID: <1179313934.4531.155801.camel@hal.voltaire.com> On Tue, 2007-05-15 at 11:32, Philippe Gregoire wrote: > Here are the patches as you asked. > I changed the code to use strncmp as suggested by Sasha. Thanks! Applied (to master only). -- Hal > Philippe From keshetti.mahesh at gmail.com Wed May 16 05:00:07 2007 From: keshetti.mahesh at gmail.com (Keshetti Mahesh) Date: Wed, 16 May 2007 17:30:07 +0530 Subject: [ofa-general] problem with loading IB modules in a IB node with OFED. Message-ID: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com> I am facing problem while loading any IB module ( madeye.ko) into an IB node with OFED-1.0 installed. while loading module lots of "disagrees about symbol version" errors appeared. where as the same module gets successfully loaded into 2.6.9-42Elsmp ( which contains OFED-1.0??). Is this already discussed? -- Thanks and regards, Mahesh. From hnguyen at linux.vnet.ibm.com Wed May 16 05:50:55 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Wed, 16 May 2007 14:50:55 +0200 Subject: [ofa-general] [PATCH 2.6.22] ehca: return proper error code if register_mr fails Message-ID: <200705161450.55848.hnguyen@linux.vnet.ibm.com> This patch sets the return code of ehca_register_mr() to ENOMEM if corresponding firmware call fails due to out of resources. Some of error codes were mapped to EINVAL. They are now mapped to default case, which already returns EINVAL anyway. Signed-off-by: Hoang-Nam Nguyen --- ehca_mrmw.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 84c5bb4..add79bd 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -2050,13 +2050,10 @@ int ehca_mrmw_map_hrc_alloc(const u64 hi switch (hipz_rc) { case H_SUCCESS: /* successful completion */ return 0; - case H_ADAPTER_PARM: /* invalid adapter handle */ - case H_RT_PARM: /* invalid resource type */ case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ - case H_MLENGTH_PARM: /* invalid memory length */ - case H_MEM_ACCESS_PARM: /* invalid access controls */ case H_CONSTRAINED: /* resource constraint */ - return -EINVAL; + case H_NO_MEM: + return -ENOMEM; case H_BUSY: /* long busy */ return -EBUSY; default: From hnguyen at linux.vnet.ibm.com Wed May 16 06:05:08 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Wed, 16 May 2007 15:05:08 +0200 Subject: [ofa-general] [PATCH ofed-1.2] ehca (kernel space): return proper error code if register_mr fails Message-ID: <200705161505.09214.hnguyen@linux.vnet.ibm.com> Hello Tziporet! Please accept below patch for ofed-1.2, because it fixes a mr resources limitation problem reported by Troy and Kyle on this mailing list. Only with this patch their application is able to release no longer used mrs properly. Thanks! Regards Nam This patch sets the return code of ehca_register_mr() to ENOMEM if corresponding firmware call fails due to out of resources. Some of error codes were mapped to EINVAL. They are now mapped to default case, which already returns EINVAL anyway. Signed-off-by: Hoang-Nam Nguyen --- ehca_mrmw.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index cfb362a..b3bbe3b 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -2045,13 +2045,10 @@ int ehca_mrmw_map_hrc_alloc(const u64 hi switch (hipz_rc) { case H_SUCCESS: /* successful completion */ return 0; - case H_ADAPTER_PARM: /* invalid adapter handle */ - case H_RT_PARM: /* invalid resource type */ case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ - case H_MLENGTH_PARM: /* invalid memory length */ - case H_MEM_ACCESS_PARM: /* invalid access controls */ case H_CONSTRAINED: /* resource constraint */ - return -EINVAL; + case H_NO_MEM: + return -ENOMEM; case H_BUSY: /* long busy */ return -EBUSY; default: From devesh28 at gmail.com Wed May 16 06:13:50 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Wed, 16 May 2007 18:43:50 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> Message-ID: <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> On 5/14/07, Sean Hefty wrote: > >This can be treated as a facility similar to what we have in ARP table > >for TCP/IP. Secondly this will help in debugging of some new up-coming > >partially infiniband complaint hardware. > > But unless such a path actually exists to the remote node, I don't see that it's > useful. And if such a path exists, I would expect it to be returned by the SA. But initially this will generate a packet for each path, while sys admin knows that path is there and he can hard-code the entries for it. Other thing is that why Admin will care about creating such record while SA is itself taking care, right? > Can you clarify its use more wrt the subnet in general? Again the same, in most cases Administrator knows that some path is there between Node A and Node B, then why to introduce more delay in making stack up by introducing extra packets (generated by sa_cache_module). In later stages if something is changing, may be, it will generated only few packets to update the cache. Another point I want to know is, When local_sa_cache module will be inserted? After SM comes up or Before SM comes up? I think its after SM is up, So this is introducing extra efforts for Admin, He will have to wait for SM to come up and then insert sa_cache module. If Its inserted before SM is coming up (I am assuming SM is running on some node not on switch) then First Forced schedule_update() is waisted, and for the first application presence of cache is meaningless. Why not to keep cache effective right from the start? CMIIW > > >yes, I want them to remain in the DB, my idea is similar to the hard > >coding of ARP table entries in TCP/IP. > >How do you see this can be achieved? > > A simple flag or setting the update counter on the added path to the maximum > should be sufficient. > > - Sean > From tziporet at dev.mellanox.co.il Wed May 16 06:43:06 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 16 May 2007 16:43:06 +0300 Subject: [ofa-general] Re: [PATCH ofed-1.2] ehca (kernel space): return proper error code if register_mr fails In-Reply-To: <200705161505.09214.hnguyen@linux.vnet.ibm.com> References: <200705161505.09214.hnguyen@linux.vnet.ibm.com> Message-ID: <464B0A6A.6090303@mellanox.co.il> Hoang-Nam Nguyen wrote: > Hello Tziporet! > Please accept below patch for ofed-1.2, because it fixes a mr resources > limitation problem reported by Troy and Kyle on this mailing list. Only > with this patch their application is able to release no longer used mrs > properly. > Thanks! > Regards > Nam > approved Tziporet From changquing.tang at hp.com Wed May 16 07:18:59 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 16 May 2007 14:18:59 -0000 Subject: [ofa-general] OFED HA related question In-Reply-To: References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net> Roland: Suppose I get IBV_EVENT_DEVICE_FATAL async event from the first HCA on my node, can I continue to call ibv_poll_cq() to get back all the work-requests I posted before ? or do I need to keep track these work-requests? I am afraid ibv_poll_cq() will return error by itself. Also can I call ibv_dereg_mr() to free the memory I registered to this HCA ? If I continue to use the second HCA, does the failure of first HCA affect the operation of second HCA (from driver point of view) ? Thanks --CQ From vlad at mellanox.co.il Wed May 16 07:33:14 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 16 May 2007 17:33:14 +0300 Subject: [ofa-general] Re: [PATCH ofed-1.2] ehca (kernel space): return proper error code if register_mr fails In-Reply-To: <200705161505.09214.hnguyen@linux.vnet.ibm.com> References: <200705161505.09214.hnguyen@linux.vnet.ibm.com> Message-ID: <1179325994.7463.56.camel@vladsk-laptop> On Wed, 2007-05-16 at 15:05 +0200, Hoang-Nam Nguyen wrote: > Hello Tziporet! > Please accept below patch for ofed-1.2, because it fixes a mr resources > limitation problem reported by Troy and Kyle on this mailing list. Only > with this patch their application is able to release no longer used mrs > properly. > Thanks! > Regards > Nam > > > > This patch sets the return code of ehca_register_mr() to ENOMEM > if corresponding firmware call fails due to out of resources. > Some of error codes were mapped to EINVAL. They are now mapped > to default case, which already returns EINVAL anyway. > > > Signed-off-by: Hoang-Nam Nguyen Added: kernel_patches/fixes/ehca_8_fix_mr_resources_limitation.patch -- Vladimir Sokolovsky Mellanox Technologies Ltd. From bs at q-leap.de Wed May 16 07:41:18 2007 From: bs at q-leap.de (Bernd Schubert) Date: Wed, 16 May 2007 16:41:18 +0200 Subject: [ofa-general] possible irq lock inversion dependency detected In-Reply-To: References: <200705151148.50607.bs@q-leap.de> Message-ID: <200705161641.18749.bs@q-leap.de> On Tuesday 15 May 2007 20:58:54 Roland Dreier wrote: > Thanks for the report... looks like a real bug. > > Can you check whether this patch makes the lockdep warnings go away? The warnings usually appeared after a few hours uptime, even though the system was in idle state. After applying your patch and a couple of hours uptime no warnings so far, so I guess your patch fixed it. If you shouldn't hear anything from me until Thursday, it definitely fixed it. Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH From kliteyn at dev.mellanox.co.il Wed May 16 08:03:29 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 16 May 2007 18:03:29 +0300 Subject: [ofa-general] [PATCH] osm: up/dn optimization - improved ranking Message-ID: <464B1D41.8080905@dev.mellanox.co.il> Hi Hal, This patch optimizes fabric ranking similar to the fat-tree ranking. All the root switches are marked with rank and added to the BFS list, and only then ranking of rest of the fabric begins. Please apply to master. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_ucast_updn.c | 66 +++++++++++++++++---------------------- 1 files changed, 29 insertions(+), 37 deletions(-) diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c index 5cebd9b..9574216 100644 --- a/opensm/opensm/osm_ucast_updn.c +++ b/opensm/opensm/osm_ucast_updn.c @@ -408,53 +408,49 @@ Exit : /* rank is a SWITCH for BFS purpose */ static int updn_subn_rank( - IN uint64_t root_guid, - IN uint8_t base_rank, + IN uint32_t num_guids, + IN uint64_t* guid_list, IN updn_t* p_updn ) { osm_switch_t *p_sw; - uint32_t rank = base_rank; osm_physp_t *p_physp, *p_remote_physp; cl_qlist_t list; cl_status_t did_cause_update; struct updn_node *u, *remote_u; uint8_t num_ports, port_num; osm_log_t *p_log = &p_updn->p_osm->log; + uint32_t idx = 0; OSM_LOG_ENTER( p_log, updn_subn_rank ); + cl_qlist_init(&list); - p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, cl_hton64(root_guid)); - if(!p_sw) - { - osm_log( p_log, OSM_LOG_ERROR, - "updn_subn_rank: ERR AA05: " - "Root switch GUID 0x%" PRIx64 " not found\n", root_guid ); - OSM_LOG_EXIT( p_log ); - return 1; - } - - osm_log( p_log, OSM_LOG_VERBOSE, - "updn_subn_rank: " - "Ranking starts from GUID 0x%" PRIx64 "\n", root_guid ); - - u = p_sw->priv; - u->is_root = 1; + /* Rank all the roots and add them to list */ - /* Rank the first guid chosen anyway since it's the base rank */ - osm_log( p_log, OSM_LOG_DEBUG, - "updn_subn_rank: " - "Ranking port GUID 0x%" PRIx64 "\n", root_guid ); + for (idx = 0; idx < num_guids; idx++) + { + /* Apply the ranking for each guid given by user - bypass illegal ones */ + p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, cl_hton64(guid_list[idx])); + if(!p_sw) + { + osm_log( p_log, OSM_LOG_ERROR, + "updn_subn_rank: ERR AA05: " + "Root switch GUID 0x%" PRIx64 " not found\n", guid_list[idx] ); + continue; + } - __updn_update_rank(u, rank); + u = p_sw->priv; + u->is_root = 1; - cl_qlist_init(&list); - cl_qlist_insert_tail(&list, &u->list); + osm_log( p_log, OSM_LOG_DEBUG, + "updn_subn_rank: " + "Ranking root port GUID 0x%" PRIx64 "\n", guid_list[idx] ); + __updn_update_rank(u, 0); + cl_qlist_insert_tail(&list, &u->list); + } /* BFS the list till it's empty */ while (!cl_is_qlist_empty(&list)) { - rank++; - u = (struct updn_node *)cl_qlist_remove_head(&list); /* Go over all remote nodes and rank them (if not already visited) */ p_sw = u->sw; @@ -483,7 +479,7 @@ updn_subn_rank( { remote_u = p_remote_physp->p_node->sw->priv; port_guid = p_remote_physp->port_guid; - did_cause_update = __updn_update_rank(remote_u, rank); + did_cause_update = __updn_update_rank(remote_u, u->rank+1); osm_log( p_log, OSM_LOG_DEBUG, "updn_subn_rank: " @@ -500,8 +496,8 @@ updn_subn_rank( /* Print Summary of ranking */ osm_log( p_log, OSM_LOG_VERBOSE, "updn_subn_rank: " - "Rank Info :\n\t Root Guid = 0x%" PRIx64 "\n\t Max Node Rank = %d\n", - root_guid, rank ); + "Subnet ranking completed. Max Node Rank = %d\n", + remote_u->rank ); OSM_LOG_EXIT( p_log ); return 0; } @@ -566,7 +562,6 @@ __osm_subn_calc_up_down_min_hop_table( IN uint64_t* guid_list, IN updn_t* p_updn ) { - uint32_t idx = 0; int status; OSM_LOG_ENTER( &p_updn->p_osm->log, osm_subn_calc_up_down_min_hop_table ); @@ -593,11 +588,8 @@ __osm_subn_calc_up_down_min_hop_table( goto _exit; } - for (idx = 0; idx < num_guids; idx++) - { - /* Apply the ranking for each guid given by user - bypass illegal ones */ - updn_subn_rank(guid_list[idx], 0, p_updn); - } + /* Rank the subnet switches */ + updn_subn_rank(num_guids, guid_list, p_updn); /* After multiple ranking need to set Min Hop Table by UpDn algorithm */ osm_log( &p_updn->p_osm->log, OSM_LOG_VERBOSE, -- 1.5.1.4 From dledford at redhat.com Wed May 16 09:13:06 2007 From: dledford at redhat.com (Doug Ledford) Date: Wed, 16 May 2007 12:13:06 -0400 Subject: [ofa-general] Re: [ewg] Re: Build problem with RHEL-4.5 and OFED-1.2 In-Reply-To: <200705141531.26635.ossrosch@linux.vnet.ibm.com> References: <200705091824.54394.ossrosch@linux.vnet.ibm.com> <1178737535.2848.152.camel@fc6.xsintricity.com> <200705092357.59973.ossrosch@linux.vnet.ibm.com> <200705141531.26635.ossrosch@linux.vnet.ibm.com> Message-ID: <464B2D92.3010400@redhat.com> Stefan Roscher wrote: > He Doug, (Sorry for the late reply, I'm out of town at the moment) > are there any news for this problem? Is it a problem of the OFED-build or a > problem with redhat? > Should I open a bugzilla to track this? Well, yes, there should be a bugzilla. What exactly the bug is depends on the kernel RPM maintainer's intended wishes for the kernel-devel package. If he wanted the ppc kernel-devel to be self sufficient, then he should have included all asm-ppc64 header files that were included by asm-ppc header files and not having them in the kernel-devel.ppc package is the bug. However, doing things this way probably precludes installing both the kernel-devel.ppc and kernel-devel.ppc64 packages at the same time. If, on the other hand, he wanted the ppc devel package to be pure and not include ppc64 header files, then he needs to make sure that A) both the kernel-devel.ppc and kernel-devel.ppc64 rpms can be installed at the same time and B) that the kernel-devel.ppc RPM has a Requires: kernel-devel.ppc64 item in there to avoid this dangling header include problem that currently exists. > Regards Stefan > On Wednesday 09 May 2007 23:57, Stefan Roscher wrote: >> On Wednesday 09 May 2007 21:05, Doug Ledford wrote: >>> On Wed, 2007-05-09 at 18:24 +0200, Stefan Roscher wrote: >>>> Hi Doug, >>>> >>>> I installed RHEL-4.5 on one of our ppc64 systems and recognized that asm-ppc >>>> directory is missing in /usr/src/kernels/2.6.9-55.EL/include. >>>> Normally I don't need this directory, but ibmebus.h includes >>>> asm-ppc64/of_device.h. And there asm-ppc64/of_device.h includes >>>> asm-ppc/of_device.h. Because this file is missing I can not build >>>> ehca and ofed stack with ofed-1.2 daily build from today. >>>> >>>> Did I make something wrong during installation? >>>> >>>> Regards Stefan Roscher >>> I'll look into it, but in the meantime, install the kernel src.rpm, go >>> into /usr/src/redhat/SPEC and run rpmbuild --bp kernel-2.6.spec and it >>> should create a complete source tree >>> in /usr/src/redhat/BUILD/kernel-2.6.18 that you can then get the asm-ppc >>> directory contents out of. >>> >>> -- >>> Doug Ledford >>> GPG KeyID: CFBFF194 >>> http://people.redhat.com/dledford >>> >>> Infiniband specific RPMs available at >>> http://people.redhat.com/dledford/Infiniband >>> >> To create the backportpatches for rhel4.5 I did it like you say, but the >> buildscripts of ofed dont uses the kernelsources in >> /usr/src/redhat/BUILD. OFED-1.2 use the source link within >> /lib/modules/kernel-x.x.x and this points into /usr/src/kernel this >> kernelsources were created during installation of rhel-4.5. In this kernel >> source the directory include/asm-ppc is missing. >> This is the reason why I found this problem not during creation of the >> backport patches. >> >> regards stefan >> >> _______________________________________________ >> ewg mailing list >> ewg at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg >> -- Doug Ledford http://people.redhat.com/dledford Infiniband specific RPMs can be found at http://people.redhat.com/dledford/Infiniband From halr at voltaire.com Wed May 16 09:38:06 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 16 May 2007 12:38:06 -0400 Subject: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h Message-ID: <1179333484.4531.176519.camel@hal.voltaire.com> OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h Signed-off-by: Hal Rosenstock diff --git a/opensm/include/iba/ib_cm_types.h b/opensm/include/iba/ib_cm_types.h new file mode 100644 index 0000000..f4fb139 --- /dev/null +++ b/opensm/include/iba/ib_cm_types.h @@ -0,0 +1,210 @@ +/* + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if !defined(__IB_CM_TYPES_H__) +#define __IB_CM_TYPES_H__ + +#ifndef WIN32 + +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/* + * Defines known Communication management class versions + */ +#define IB_MCLASS_CM_VER_2 2 +#define IB_MCLASS_CM_VER_1 1 + +/* + * Defines the size of user available data in communication management MADs + */ +#define IB_REQ_PDATA_SIZE_VER2 92 +#define IB_MRA_PDATA_SIZE_VER2 222 +#define IB_REJ_PDATA_SIZE_VER2 148 +#define IB_REP_PDATA_SIZE_VER2 196 +#define IB_RTU_PDATA_SIZE_VER2 224 +#define IB_LAP_PDATA_SIZE_VER2 168 +#define IB_APR_PDATA_SIZE_VER2 148 +#define IB_DREQ_PDATA_SIZE_VER2 220 +#define IB_DREP_PDATA_SIZE_VER2 224 +#define IB_SIDR_REQ_PDATA_SIZE_VER2 216 +#define IB_SIDR_REP_PDATA_SIZE_VER2 136 + +#define IB_REQ_PDATA_SIZE_VER1 92 +#define IB_MRA_PDATA_SIZE_VER1 222 +#define IB_REJ_PDATA_SIZE_VER1 148 +#define IB_REP_PDATA_SIZE_VER1 204 +#define IB_RTU_PDATA_SIZE_VER1 224 +#define IB_LAP_PDATA_SIZE_VER1 168 +#define IB_APR_PDATA_SIZE_VER1 151 +#define IB_DREQ_PDATA_SIZE_VER1 220 +#define IB_DREP_PDATA_SIZE_VER1 224 +#define IB_SIDR_REQ_PDATA_SIZE_VER1 216 +#define IB_SIDR_REP_PDATA_SIZE_VER1 140 + +#define IB_ARI_SIZE 72 // redefine +#define IB_APR_INFO_SIZE 72 + +/****d* Access Layer/ib_rej_status_t +* NAME +* ib_rej_status_t +* +* DESCRIPTION +* Rejection reasons. +* +* SYNOPSIS +*/ +typedef ib_net16_t ib_rej_status_t; +/* +* SEE ALSO +* ib_cm_rej, ib_cm_rej_rec_t +* +* SOURCE +*/ +#define IB_REJ_INSUF_QP CL_HTON16(1) +#define IB_REJ_INSUF_EEC CL_HTON16(2) +#define IB_REJ_INSUF_RESOURCES CL_HTON16(3) +#define IB_REJ_TIMEOUT CL_HTON16(4) +#define IB_REJ_UNSUPPORTED CL_HTON16(5) +#define IB_REJ_INVALID_COMM_ID CL_HTON16(6) +#define IB_REJ_INVALID_COMM_INSTANCE CL_HTON16(7) +#define IB_REJ_INVALID_SID CL_HTON16(8) +#define IB_REJ_INVALID_XPORT CL_HTON16(9) +#define IB_REJ_STALE_CONN CL_HTON16(10) +#define IB_REJ_RDC_NOT_EXIST CL_HTON16(11) +#define IB_REJ_INVALID_GID CL_HTON16(12) +#define IB_REJ_INVALID_LID CL_HTON16(13) +#define IB_REJ_INVALID_SL CL_HTON16(14) +#define IB_REJ_INVALID_TRAFFIC_CLASS CL_HTON16(15) +#define IB_REJ_INVALID_HOP_LIMIT CL_HTON16(16) +#define IB_REJ_INVALID_PKT_RATE CL_HTON16(17) +#define IB_REJ_INVALID_ALT_GID CL_HTON16(18) +#define IB_REJ_INVALID_ALT_LID CL_HTON16(19) +#define IB_REJ_INVALID_ALT_SL CL_HTON16(20) +#define IB_REJ_INVALID_ALT_TRAFFIC_CLASS CL_HTON16(21) +#define IB_REJ_INVALID_ALT_HOP_LIMIT CL_HTON16(22) +#define IB_REJ_INVALID_ALT_PKT_RATE CL_HTON16(23) +#define IB_REJ_PORT_REDIRECT CL_HTON16(24) +#define IB_REJ_INVALID_MTU CL_HTON16(26) +#define IB_REJ_INSUFFICIENT_RESP_RES CL_HTON16(27) +#define IB_REJ_USER_DEFINED CL_HTON16(28) +#define IB_REJ_INVALID_RNR_RETRY CL_HTON16(29) +#define IB_REJ_DUPLICATE_LOCAL_COMM_ID CL_HTON16(30) +#define IB_REJ_INVALID_CLASS_VER CL_HTON16(31) +#define IB_REJ_INVALID_FLOW_LBL CL_HTON16(32) +#define IB_REJ_INVALID_ALT_FLOW_LBL CL_HTON16(33) + +#define IB_REJ_SERVICE_HANDOFF CL_HTON16(65535) +/******/ + +/****d* Access Layer/ib_apr_status_t +* NAME +* ib_apr_status_t +* +* DESCRIPTION +* Automatic path migration status information. +* +* SYNOPSIS +*/ +typedef uint8_t ib_apr_status_t; +/* +* SEE ALSO +* ib_cm_apr, ib_cm_apr_rec_t +* +* SOURCE + */ +#define IB_AP_SUCCESS 0 +#define IB_AP_INVALID_COMM_ID 1 +#define IB_AP_UNSUPPORTED 2 +#define IB_AP_REJECT 3 +#define IB_AP_REDIRECT 4 +#define IB_AP_IS_CURRENT 5 +#define IB_AP_INVALID_QPN_EECN 6 +#define IB_AP_INVALID_LID 7 +#define IB_AP_INVALID_GID 8 +#define IB_AP_INVALID_FLOW_LBL 9 +#define IB_AP_INVALID_TCLASS 10 +#define IB_AP_INVALID_HOP_LIMIT 11 +#define IB_AP_INVALID_PKT_RATE 12 +#define IB_AP_INVALID_SL 13 +/******/ + +/****d* Access Layer/ib_cm_cap_mask_t +* NAME +* ib_cm_cap_mask_t +* +* DESCRIPTION +* Capability mask values in ClassPortInfo. +* +* SYNOPSIS +*/ +#define IB_CM_RELIABLE_CONN_CAPABLE CL_HTON16(9) +#define IB_CM_RELIABLE_DGRM_CAPABLE CL_HTON16(10) +#define IB_CM_RDGRM_CAPABLE CL_HTON16(11) +#define IB_CM_UNRELIABLE_CONN_CAPABLE CL_HTON16(12) +#define IB_CM_SIDR_CAPABLE CL_HTON16(13) +/* +* SEE ALSO +* ib_cm_rep, ib_class_port_info_t +* +* SOURCE +* +*******/ + +/* + * Service ID resolution status + */ +typedef uint16_t ib_sidr_status_t; +#define IB_SIDR_SUCCESS 0 +#define IB_SIDR_UNSUPPORTED 1 +#define IB_SIDR_REJECT 2 +#define IB_SIDR_NO_QP 3 +#define IB_SIDR_REDIRECT 4 +#define IB_SIDR_UNSUPPORTED_VER 5 + +END_C_DECLS + +#endif /* ndef WIN32 */ + +#endif /* __IB_CM_TYPES_H__ */ From halr at voltaire.com Wed May 16 09:38:18 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 16 May 2007 12:38:18 -0400 Subject: [ofa-general] [PATCH 2/2] OpenSM/ib_types.h: Remove CM defines Message-ID: <1179333485.4531.176520.camel@hal.voltaire.com> OpenSM/ib_types.h: Remove CM definitions as now in ib_cm_types.h Signed-off-by: Hal Rosenstock diff --git a/opensm/include/Makefile.am b/opensm/include/Makefile.am index 8499d3b..3428d9a 100644 --- a/opensm/include/Makefile.am +++ b/opensm/include/Makefile.am @@ -1,7 +1,7 @@ SUBDIRS = . -nobase_pkginclude_HEADERS = iba/ib_types.h +nobase_pkginclude_HEADERS = iba/ib_types.h iba/ib_cm_types.h EXTRA_DIST = \ $(srcdir)/opensm/osm_version.h \ @@ -128,6 +128,7 @@ EXTRA_DIST = \ $(srcdir)/complib/cl_fleximap.h \ $(srcdir)/complib/cl_qcomppool.h \ $(srcdir)/iba/ib_types.h \ + $(srcdir)/iba/ib_cm_types.h \ $(srcdir)/vendor/osm_vendor_mlx_transport_anafa.h \ $(srcdir)/vendor/osm_vendor_mlx.h \ $(srcdir)/vendor/osm_vendor_mlx_sender.h \ diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index b3937cb..aee7024 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -31,7 +31,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id$ */ #if !defined(__IB_TYPES_H__) @@ -7773,163 +7772,8 @@ typedef struct _ib_ioc_info #include /* - * Defines known Communication management class versions - */ -#define IB_MCLASS_CM_VER_2 2 -#define IB_MCLASS_CM_VER_1 1 - -/* - * Defines the size of user available data in communication management MADs - */ -#define IB_REQ_PDATA_SIZE_VER2 92 -#define IB_MRA_PDATA_SIZE_VER2 222 -#define IB_REJ_PDATA_SIZE_VER2 148 -#define IB_REP_PDATA_SIZE_VER2 196 -#define IB_RTU_PDATA_SIZE_VER2 224 -#define IB_LAP_PDATA_SIZE_VER2 168 -#define IB_APR_PDATA_SIZE_VER2 148 -#define IB_DREQ_PDATA_SIZE_VER2 220 -#define IB_DREP_PDATA_SIZE_VER2 224 -#define IB_SIDR_REQ_PDATA_SIZE_VER2 216 -#define IB_SIDR_REP_PDATA_SIZE_VER2 136 - -#define IB_REQ_PDATA_SIZE_VER1 92 -#define IB_MRA_PDATA_SIZE_VER1 222 -#define IB_REJ_PDATA_SIZE_VER1 148 -#define IB_REP_PDATA_SIZE_VER1 204 -#define IB_RTU_PDATA_SIZE_VER1 224 -#define IB_LAP_PDATA_SIZE_VER1 168 -#define IB_APR_PDATA_SIZE_VER1 151 -#define IB_DREQ_PDATA_SIZE_VER1 220 -#define IB_DREP_PDATA_SIZE_VER1 224 -#define IB_SIDR_REQ_PDATA_SIZE_VER1 216 -#define IB_SIDR_REP_PDATA_SIZE_VER1 140 - -#define IB_ARI_SIZE 72 // redefine -#define IB_APR_INFO_SIZE 72 - -/****d* Access Layer/ib_rej_status_t -* NAME -* ib_rej_status_t -* -* DESCRIPTION -* Rejection reasons. -* -* SYNOPSIS -*/ -typedef ib_net16_t ib_rej_status_t; -/* -* SEE ALSO -* ib_cm_rej, ib_cm_rej_rec_t -* -* SOURCE -*/ -#define IB_REJ_INSUF_QP CL_HTON16(1) -#define IB_REJ_INSUF_EEC CL_HTON16(2) -#define IB_REJ_INSUF_RESOURCES CL_HTON16(3) -#define IB_REJ_TIMEOUT CL_HTON16(4) -#define IB_REJ_UNSUPPORTED CL_HTON16(5) -#define IB_REJ_INVALID_COMM_ID CL_HTON16(6) -#define IB_REJ_INVALID_COMM_INSTANCE CL_HTON16(7) -#define IB_REJ_INVALID_SID CL_HTON16(8) -#define IB_REJ_INVALID_XPORT CL_HTON16(9) -#define IB_REJ_STALE_CONN CL_HTON16(10) -#define IB_REJ_RDC_NOT_EXIST CL_HTON16(11) -#define IB_REJ_INVALID_GID CL_HTON16(12) -#define IB_REJ_INVALID_LID CL_HTON16(13) -#define IB_REJ_INVALID_SL CL_HTON16(14) -#define IB_REJ_INVALID_TRAFFIC_CLASS CL_HTON16(15) -#define IB_REJ_INVALID_HOP_LIMIT CL_HTON16(16) -#define IB_REJ_INVALID_PKT_RATE CL_HTON16(17) -#define IB_REJ_INVALID_ALT_GID CL_HTON16(18) -#define IB_REJ_INVALID_ALT_LID CL_HTON16(19) -#define IB_REJ_INVALID_ALT_SL CL_HTON16(20) -#define IB_REJ_INVALID_ALT_TRAFFIC_CLASS CL_HTON16(21) -#define IB_REJ_INVALID_ALT_HOP_LIMIT CL_HTON16(22) -#define IB_REJ_INVALID_ALT_PKT_RATE CL_HTON16(23) -#define IB_REJ_PORT_REDIRECT CL_HTON16(24) -#define IB_REJ_INVALID_MTU CL_HTON16(26) -#define IB_REJ_INSUFFICIENT_RESP_RES CL_HTON16(27) -#define IB_REJ_USER_DEFINED CL_HTON16(28) -#define IB_REJ_INVALID_RNR_RETRY CL_HTON16(29) -#define IB_REJ_DUPLICATE_LOCAL_COMM_ID CL_HTON16(30) -#define IB_REJ_INVALID_CLASS_VER CL_HTON16(31) -#define IB_REJ_INVALID_FLOW_LBL CL_HTON16(32) -#define IB_REJ_INVALID_ALT_FLOW_LBL CL_HTON16(33) - -#define IB_REJ_SERVICE_HANDOFF CL_HTON16(65535) -/******/ - -/****d* Access Layer/ib_apr_status_t -* NAME -* ib_apr_status_t -* -* DESCRIPTION -* Automatic path migration status information. -* -* SYNOPSIS -*/ -typedef uint8_t ib_apr_status_t; -/* -* SEE ALSO -* ib_cm_apr, ib_cm_apr_rec_t -* -* SOURCE - */ -#define IB_AP_SUCCESS 0 -#define IB_AP_INVALID_COMM_ID 1 -#define IB_AP_UNSUPPORTED 2 -#define IB_AP_REJECT 3 -#define IB_AP_REDIRECT 4 -#define IB_AP_IS_CURRENT 5 -#define IB_AP_INVALID_QPN_EECN 6 -#define IB_AP_INVALID_LID 7 -#define IB_AP_INVALID_GID 8 -#define IB_AP_INVALID_FLOW_LBL 9 -#define IB_AP_INVALID_TCLASS 10 -#define IB_AP_INVALID_HOP_LIMIT 11 -#define IB_AP_INVALID_PKT_RATE 12 -#define IB_AP_INVALID_SL 13 -/******/ - -/****d* Access Layer/ib_cm_cap_mask_t -* NAME -* ib_cm_cap_mask_t -* -* DESCRIPTION -* Capability mask values in ClassPortInfo. -* -* SYNOPSIS -*/ -#define IB_CM_RELIABLE_CONN_CAPABLE CL_HTON16(9) -#define IB_CM_RELIABLE_DGRM_CAPABLE CL_HTON16(10) -#define IB_CM_RDGRM_CAPABLE CL_HTON16(11) -#define IB_CM_UNRELIABLE_CONN_CAPABLE CL_HTON16(12) -#define IB_CM_SIDR_CAPABLE CL_HTON16(13) -/* -* SEE ALSO -* ib_cm_rep, ib_class_port_info_t -* -* SOURCE -* -*******/ - -/* - * Service ID resolution status - */ -typedef uint16_t ib_sidr_status_t; -#define IB_SIDR_SUCCESS 0 -#define IB_SIDR_UNSUPPORTED 1 -#define IB_SIDR_REJECT 2 -#define IB_SIDR_NO_QP 3 -#define IB_SIDR_REDIRECT 4 -#define IB_SIDR_UNSUPPORTED_VER 5 - -/* * The following definitions are shared between the Access Layer and VPD */ - - typedef struct _ib_ca* __ptr64 ib_ca_handle_t; typedef struct _ib_pd* __ptr64 ib_pd_handle_t; typedef struct _ib_rdd* __ptr64 ib_rdd_handle_t; @@ -10467,7 +10311,8 @@ typedef struct _ib_ci_op END_C_DECLS -#endif /* ndef WIN */ +#endif /* ndef WIN32 */ + #if defined( __WIN__ ) #include #endif From mshefty at ichips.intel.com Wed May 16 09:45:43 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 May 2007 09:45:43 -0700 Subject: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h In-Reply-To: <1179333484.4531.176519.camel@hal.voltaire.com> References: <1179333484.4531.176519.camel@hal.voltaire.com> Message-ID: <464B3537.7060405@ichips.intel.com> Hal Rosenstock wrote: > OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h CM types are defined in the libibcm library. Why not remove them completely from the opensm code? - Sean From eboyeryyie at tdatabrasil.net.br Wed May 16 10:04:52 2007 From: eboyeryyie at tdatabrasil.net.br (Jacquline Watson) Date: Wed, 16 May 2007 08:04:52 -0900 Subject: [ofa-general] Holiday time Message-ID: match year Do you fork letter and Jeff...you know...?Suddenly, the story got competition slept a busily clip little more complicat He thought about that for a receipt command appreciate ball moment. Dana, doesCarl sugar blushing was plate doubtful sped that lightening could possibl vivaciously What 'thing' sponge would take that mortally be? asked Gretchen.So tomorrow are throughout you man did gonna teach behave me to catch? speedily Gav, What's in group mine taking so long? help Stacy smiled. Dana, what copy the two bone of opinion us do when I'll knit be skin quit there in theory a second came the voice from Well, politely if mist that happens, there limit park are other girls. Actually, cuddly for tow file the time being, I notice don't see anypump The destroy thing regret where you've got outstanding one foot on a front No. push stung direction Tomorrow we're taking Carl faithfully and Linda to th comfort Was this ship what you were pain trying to different do? asked An Yeah. box My second detail bet choice is glamorous anybody who happens disease This sleep was not burst what Dana spread was expecting to hear at Y-You didn't brush have calculate to amuse open it for crime me. She trie Ah, excited hah, declared brachial metal stitch Stacy. You don't realize i Thanks, important I will, ship Guy told her with misty peripatetic a smile. He discover base bled Yeah, that's it. suspend Jeff got up and dusted himsebroad Well, I took flap a good owner look at the exchange property in Sa press bore Remarkably born spring well. That's the other thing you'll splendid fetch tame Jeff envious rolled his eyes at his own absent mindedne Ooh, how romantic. tray set arrive A hint of a loss smile now started to show through Dseen Half six for a paste quarter untidy to blastous seven kick off, saimiss Jeff spotted the written show two girls tail as they emerged from I'm just bein' a difficult list music gentleman. spilt He sat down next Have idea you sternly plain notice guys picked out a film? asked Stacy. mass curve noisy Jeff was blushing understandably confused, Huh? What's that? I'll explain it to you brake later. Look gotten dust heat there are t corporal I'll get one clear shelter of stamp the lads in the workshop to kn shyly hurt tore Really? fry Great. Anybody I know? That's called fly snake a steamroller, chess blastous said Gretchen. She took a dress deep breath. high-pitched His fiercely name honestly is Jeff Feing3:45 PM Alright, Jeff taste picked up his frame process nation bike Tell me wh Suddenly, the distribution expression on hematal outgoing her got husbands face t No, dreamt edge it's charming not that. Her arrest mind was now racing... Although pen prepare voice they shrug actually hadn't, Jeff immediately Interesting bulb smite agree choice, observed unusual Stacy. Any part Dane, I don't smell talk think sex you're ready wake to have a boy far Jeff escorted the raptorial guest frightened beat to the front door. As G Well, bulb space I'm ball certainly been sorry to hear about your m nerve nervously Gavin pointed to the thunder door profit that they entered thr animal watch You're sure we're shine doing the right stand thing, not c Henry... -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: zyqil.gif Type: image/gif Size: 6280 bytes Desc: not available URL: From halr at voltaire.com Wed May 16 10:22:36 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 16 May 2007 13:22:36 -0400 Subject: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h In-Reply-To: <464B3537.7060405@ichips.intel.com> References: <1179333484.4531.176519.camel@hal.voltaire.com> <464B3537.7060405@ichips.intel.com> Message-ID: <1179336155.23882.604.camel@hal.voltaire.com> On Wed, 2007-05-16 at 12:45, Sean Hefty wrote: > Hal Rosenstock wrote: > > OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h > > CM types are defined in the libibcm library. Why not remove them > completely from the opensm code? I think they are needed in the Windows environment. I believe Linux userspace would never include this header. -- Hal > - Sean From sashak at voltaire.com Wed May 16 10:30:47 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 16 May 2007 20:30:47 +0300 Subject: [ofa-general] Re: Error message in OSM log when cached op file doesn't exist In-Reply-To: <4649AE00.8080806@dev.mellanox.co.il> References: <46486D1E.6010408@dev.mellanox.co.il> <1179152459.1540.178811.camel@hal.voltaire.com> <46499769.1070404@dev.mellanox.co.il> <20070515125401.GD23240@sashak.voltaire.com> <4649AE00.8080806@dev.mellanox.co.il> Message-ID: <20070516173047.GK19271@sashak.voltaire.com> On 15:56 Tue 15 May , Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > >On 14:20 Tue 15 May , Yevgeny Kliteynik wrote: > >>I think that the message should appear when OpenSM *does* find cached > >>option file, and no message should appear when such file wasn't found > >>(which is the most common use case). > > > >AFAIK OpenSM which used in the labs' clusters almost always uses this > >file, so I'm not sure about common case. > > If the file is found, user sees "Using cached bla-bla" and > "Loading cached option bla-bla" messages. > If the file wasn't found, these messages are not printed, > so absence of these messages means that the file wasn't found. > The only thing we can do is to add a new message that will > explicitly inform the user about this, something like > "No cached options file". > Is this necessary? IMHO, it's not. Do you think otherwise? Yes, explicit message would be cleaner. Sasha From sashak at voltaire.com Wed May 16 11:29:02 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 16 May 2007 21:29:02 +0300 Subject: [ofa-general] git over http In-Reply-To: <4649AC27.8010903@cea.fr> References: <4649AC27.8010903@cea.fr> Message-ID: <20070516182902.GL19271@sashak.voltaire.com> On 14:48 Tue 15 May , Philippe Gregoire wrote: > I can't get git clone command working due to our firewall. > Is there any git http server configured ? > If any, how do I translate > git clone git://git.openfabrics.org/~halr/management > > in git clone http path ? Try this: git clone http://git.openfabrics.org/pub/scm/~halr/management.git Also you can ask Hal to put symbolic link to his repo under ~halr/public_html and then it will be accessible similar to git:// as: git clone http://git.openfabrics.org/~halr/management.git Sasha From xma at us.ibm.com Wed May 16 12:09:36 2007 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 16 May 2007 12:09:36 -0700 Subject: [ofa-general] binary compatibility ofed 1.1 and 1.2 In-Reply-To: Message-ID: Hello Roland, > Hi, > > Will apps built with OFED 1.1 verbs.h run on an OFED 1.2 install ? > > -Bill It seems the binary apps are broken between OFED-1.1 and OFED-1.2. Any reason why we can't maintain struct like ibv_cq binary compatibility? Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed May 16 12:31:19 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 May 2007 12:31:19 -0700 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> Message-ID: <464B5C07.8040601@ichips.intel.com> > But initially this will generate a packet for each path, while sys > admin knows that path is there and he can hard-code the entries for > it. Other thing is that why Admin will care about creating such record > while SA is itself taking care, right? In your original message you asked about adding 'dummy entries' to the cache. I agree that pre-loading the cache can be useful. What I still am not understanding is the reasoning for adding 'dummy entries'. By 'dummy entries', I've been assuming that these are invalid path records, but maybe that's not what you meant. > Another point I want to know is, > When local_sa_cache module will be inserted? After SM comes up or > Before SM comes up? It can occur either way. There is no restriction. The cache responds to port up and GID in/out of service events to update itself. > If Its inserted before SM is coming up (I am assuming SM is running on > some node not on switch) then First Forced schedule_update() is > waisted, and for the first application presence of cache is > meaningless. Why not to keep cache effective right from the start? Pre-loading the cache with path records doesn't guarantee that those paths are usable. If the SM has not come up, then the path records will be unusable until the SM configures the subnet, plus there's no guarantee that the remote endpoints specified by the paths are running. The main benefit I see to pre-loading the cache is to avoid SA storms when booting a large cluster. - Sean From sashak at voltaire.com Wed May 16 12:49:19 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 16 May 2007 22:49:19 +0300 Subject: [ofa-general] [PATCH] osm: up/dn optimization - improved ranking In-Reply-To: <464B1D41.8080905@dev.mellanox.co.il> References: <464B1D41.8080905@dev.mellanox.co.il> Message-ID: <20070516194919.GO19271@sashak.voltaire.com> Hi Yevgeny, On 18:03 Wed 16 May , Yevgeny Kliteynik wrote: > Hi Hal, > > This patch optimizes fabric ranking similar to the fat-tree ranking. > All the root switches are marked with rank and added to the BFS list, > and only then ranking of rest of the fabric begins. > > Please apply to master. > > Signed-off-by: Yevgeny Kliteynik > --- Basically looks good. However couple comments below. > opensm/opensm/osm_ucast_updn.c | 66 > +++++++++++++++++---------------------- > 1 files changed, 29 insertions(+), 37 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c > index 5cebd9b..9574216 100644 > --- a/opensm/opensm/osm_ucast_updn.c > +++ b/opensm/opensm/osm_ucast_updn.c > @@ -408,53 +408,49 @@ Exit : > /* rank is a SWITCH for BFS purpose */ > static int > updn_subn_rank( > - IN uint64_t root_guid, > - IN uint8_t base_rank, > + IN uint32_t num_guids, 'num_guids' should not be fixed-size integer just compiler friendly 'unsigned' is fine. > + IN uint64_t* guid_list, > IN updn_t* p_updn ) > { > osm_switch_t *p_sw; > - uint32_t rank = base_rank; > osm_physp_t *p_physp, *p_remote_physp; > cl_qlist_t list; > cl_status_t did_cause_update; > struct updn_node *u, *remote_u; > uint8_t num_ports, port_num; > osm_log_t *p_log = &p_updn->p_osm->log; > + uint32_t idx = 0; Ditto. > > OSM_LOG_ENTER( p_log, updn_subn_rank ); > + cl_qlist_init(&list); > > - p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, > cl_hton64(root_guid)); > - if(!p_sw) > - { > - osm_log( p_log, OSM_LOG_ERROR, > - "updn_subn_rank: ERR AA05: " > - "Root switch GUID 0x%" PRIx64 " not found\n", root_guid ); > - OSM_LOG_EXIT( p_log ); > - return 1; > - } > - > - osm_log( p_log, OSM_LOG_VERBOSE, > - "updn_subn_rank: " > - "Ranking starts from GUID 0x%" PRIx64 "\n", root_guid ); > - > - u = p_sw->priv; > - u->is_root = 1; > + /* Rank all the roots and add them to list */ > > - /* Rank the first guid chosen anyway since it's the base rank */ > - osm_log( p_log, OSM_LOG_DEBUG, > - "updn_subn_rank: " > - "Ranking port GUID 0x%" PRIx64 "\n", root_guid ); > + for (idx = 0; idx < num_guids; idx++) > + { > + /* Apply the ranking for each guid given by user - bypass illegal ones > */ > + p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, > cl_hton64(guid_list[idx])); > + if(!p_sw) > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "updn_subn_rank: ERR AA05: " > + "Root switch GUID 0x%" PRIx64 " not found\n", > guid_list[idx] ); > + continue; > + } > > - __updn_update_rank(u, rank); > + u = p_sw->priv; > + u->is_root = 1; Now when root switches are always ranked first 'is_root' field is not needed anymore, (!u->rank) answers this. > > - cl_qlist_init(&list); > - cl_qlist_insert_tail(&list, &u->list); > + osm_log( p_log, OSM_LOG_DEBUG, > + "updn_subn_rank: " > + "Ranking root port GUID 0x%" PRIx64 "\n", guid_list[idx] ); > + __updn_update_rank(u, 0); > + cl_qlist_insert_tail(&list, &u->list); > + } > > /* BFS the list till it's empty */ > while (!cl_is_qlist_empty(&list)) > { > - rank++; > - > u = (struct updn_node *)cl_qlist_remove_head(&list); > /* Go over all remote nodes and rank them (if not already visited) */ > p_sw = u->sw; > @@ -483,7 +479,7 @@ updn_subn_rank( > { > remote_u = p_remote_physp->p_node->sw->priv; > port_guid = p_remote_physp->port_guid; > - did_cause_update = __updn_update_rank(remote_u, rank); > + did_cause_update = __updn_update_rank(remote_u, u->rank+1); > > osm_log( p_log, OSM_LOG_DEBUG, > "updn_subn_rank: " > @@ -500,8 +496,8 @@ updn_subn_rank( > /* Print Summary of ranking */ > osm_log( p_log, OSM_LOG_VERBOSE, > "updn_subn_rank: " > - "Rank Info :\n\t Root Guid = 0x%" PRIx64 "\n\t Max Node Rank = > %d\n", > - root_guid, rank ); > + "Subnet ranking completed. Max Node Rank = %d\n", > + remote_u->rank ); 'remote_u' can be not initialized here. Another issue is that it can be initialized but to remote switch which has lower than max rank (when did_cause_update = 0). The rest is fine. Sasha From rdreier at cisco.com Wed May 16 13:21:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 May 2007 13:21:31 -0700 Subject: [ofa-general] binary compatibility ofed 1.1 and 1.2 In-Reply-To: (Shirley Ma's message of "Wed, 16 May 2007 12:09:36 -0700") References: Message-ID: > > Will apps built with OFED 1.1 verbs.h run on an OFED 1.2 install ? Yes, unless you do something to defeat the ABI versioning. > It seems the binary apps are broken between OFED-1.1 and OFED-1.2. > Any reason why we can't maintain struct like ibv_cq binary compatibility? libibverbs has a versioned ABI. So if you link your app against libibverbs 1.0, it will be linked against IBVERBS_1.0 symbols and still work with the libibverbs 1.1 dynamic library. A number of changes required struct layout differences etc., so a new IBVERBS_1.1 ABI was introduced as well. But you will only get that by linking against libibverbs 1.1. - R. From rdreier at cisco.com Wed May 16 13:26:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 May 2007 13:26:00 -0700 Subject: [ofa-general] Re: OFED HA related question In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net> (Changqing Tang's message of "Wed, 16 May 2007 14:18:59 -0000") References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net> Message-ID: Changqing> Suppose I get IBV_EVENT_DEVICE_FATAL async event from Changqing> the first HCA on my node, can I continue to call Changqing> ibv_poll_cq() to get back all the work-requests I Changqing> posted before ? or do I need to keep track these Changqing> work-requests? I am afraid ibv_poll_cq() will return Changqing> error by itself. Also can I call ibv_dereg_mr() to free Changqing> the memory I registered to this HCA ? Once you get a catastrophic error, all bets are off. Work request processing is in an undetermined state, since basically the HCA crashed in an unknown way. Polling CQs is probably not a good idea. I guess you do need to deregister memory regions to unpin the memory as part of your cleanup.... Changqing> If I continue to use the second HCA, does the failure Changqing> of first HCA affect the operation of second HCA (from Changqing> driver point of view) ? No. - R. From rdreier at cisco.com Wed May 16 13:39:16 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 May 2007 13:39:16 -0700 Subject: [ofa-general] Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR leak In-Reply-To: <20070516101457.GA5091@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 16 May 2007 13:14:57 +0300") References: <20070515210453.GL4161@mellanox.co.il> <20070516101457.GA5091@mellanox.co.il> Message-ID: > + * - Put the QP in the Error State > + * - Wait for the Affiliated Asynchronous Last WQE Reached Event; > + * - either: > + * drain the CQ by invoking the Poll CQ verb and either wait for CQ > + * to be empty or the number of Poll CQ operations has exceeded > + * CQ capacity size; > + * - or > + * post another WR that completes on the same CQ and wait for this > + * WR to return as a WC; (NB: this is the option that we use) > + * and then invoke a Destroy QP or Reset QP. I guess this last line would look better as * - invoke a Destroy QP or Reset QP. > +static struct ib_qp_attr ipoib_cm_err_attr __read_mostly = { > + .qp_state = IB_QPS_ERR > +}; > + > +#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff > + > +static struct ib_send_wr ipoib_cm_rx_drain_wr __read_mostly = { > + .wr_id = IPOIB_CM_RX_DRAIN_WRID, > + .opcode = IB_WR_SEND > +}; I don't think these are hot enough to be worth marking as __read_mostly. (better to leave them in normal .data so that stuff that is written to ends up getting spaced out more) > + qp_attr.qp_state = IB_QPS_INIT; > + qp_attr.port_num = priv->port; > + qp_attr.qkey = 0; > + qp_attr.qp_access_flags = 0; > + ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, > + IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PORT | IB_QP_QKEY); > + if (ret) { > + ipoib_warn(priv, "failed to modify drain QP to INIT: %d\n", ret); > + goto err_qp; > + } > + > + /* We put the QP in error state directly: this way, hardware > + * will immediately generate WC for each WR we post, without > + * sending anything on the wire. */ > + qp_attr.qp_state = IB_QPS_ERR; > + ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, IB_QP_STATE); > + if (ret) { > + ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret); > + goto err_qp; > + } This actually seems like a good motivation for the mthca RESET -> ERROR fix. We could avoid the transition to INIT if we fixed mthca and mlx4, right? (By the way, any interest in making an mlx4 patch to fix that too?) - R. From minich at ornl.gov Wed May 16 13:46:49 2007 From: minich at ornl.gov (Makia Minich) Date: Wed, 16 May 2007 16:46:49 -0400 Subject: [ofa-general] problem with loading IB modules in a IB node =?iso-8859-1?q?with=09OFED=2E?= In-Reply-To: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com> References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com> Message-ID: <200705161646.49910.minich@ornl.gov> The question is what kernel are you trying to load the module into? Whether or not it's OFED-1.0 is irrelevant if one system is 2.6.9-42Elsmp and the other is not. What is the result of the following (on your system where it's not loading): uname -r modinfo madeye.ko | grep vermagic Also, you might want to check dmesg. On Wednesday 16 May 2007 8:00:07 am Keshetti Mahesh wrote: > I am facing problem while loading any IB module ( madeye.ko) into an > IB node with OFED-1.0 > installed. while loading module lots of "disagrees about symbol > version" errors appeared. > where as the same module gets successfully loaded into 2.6.9-42Elsmp ( > which contains > OFED-1.0??). > Is this already discussed? -- Makia Minich National Center for Computation Science Oak Ridge National Laboratory --*-- Imagine no possessions I wonder if you can - John Lennon From rdreier at cisco.com Wed May 16 13:56:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 May 2007 13:56:46 -0700 Subject: [ofa-general] Re: movnt In-Reply-To: <20070515204335.GI4161@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 May 2007 23:43:35 +0300") References: <20070508141727.GR21591@mellanox.co.il> <20070512172927.GA5908@mellanox.co.il> <20070515204335.GI4161@mellanox.co.il> Message-ID: > So we can map the device memory with WB or WT semantics, and movnt will enable > WC. And the nice thing about this trick, is that both WB and WT *are already > programmed into PAT after reset*, which means that we can use them for pages we > map for userspace, without stepping on anyone's toes or waiting for > the generic in-kernel support for WC to materialize. I'm not sure whether this is much of an advantage. There's no generic way to map memory with WB that I know of. I don't think that setting a PAT entry for WC is the hold-up -- the problem is more in the right infrastructure for pgprot_xxx(). I don't think it's very nice to have #ifdef __x86_64__ in a driver. > I attach a header file that implements WC memcpy with these > instructions for lengths from 16 to 128 bytes (and one can, > naturally, just call xmm_copy64 in a loop), that I wrote for fun > at some point. Feel free to read/flame/reuse in any way you like. Using movntdq means we have to save off xmm's, and it's a hassle to get a properly aligned block to be able to use movdqa to save them (you can't rely on the stack being 16-byte aligned). I'd be curious to see whether it's even worth it for a 64-byte copy (which is probably the most common case for BF), since you need 8 extra movdqa to save/restore the xmms on top of 4 movdqa to load the WQE and 4 movntdq to write it. Just plain movnti might be the simplest thing to do, since 16 movnti is all you would need, and I think that comes out to be smaller code than 12 movdqa + 4 movntdq. (Optimizing the WQE copy in assembly might be worth it independent of how we map the BF page for WC, since obviously posting BF sends is a super-hot path. And it's fun to write SSE code anyway) - R. From kliteyn at mellanox.co.il Wed May 16 14:24:25 2007 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 17 May 2007 00:24:25 +0300 Subject: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move CMdefinitions from ib_types.h In-Reply-To: <1179336155.23882.604.camel@hal.voltaire.com> References: <1179333484.4531.176519.camel@hal.voltaire.com> <464B3537.7060405@ichips.intel.com> <1179336155.23882.604.camel@hal.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901825156@mtlexch01.mtl.com> > From: Hal Rosenstock [mailto:halr at voltaire.com] > Subject: Re: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move CMdefinitions from ib_types.h > > On Wed, 2007-05-16 at 12:45, Sean Hefty wrote: > > Hal Rosenstock wrote: > > > OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h > > > > CM types are defined in the libibcm library. Why not remove them > > completely from the opensm code? > > I think they are needed in the Windows environment. I'm not sure about this. Windows has another ib_types header, and I think that all the other applications are using this header and not the management ib_types. What are the rest of the CM defines that you want to remove? -- Yevgeny > I believe Linux userspace would never include this header. > > -- Hal > > > - Sean > > From changquing.tang at hp.com Wed May 16 14:43:30 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 16 May 2007 21:43:30 -0000 Subject: [ofa-general] RE: OFED HA related question In-Reply-To: References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net> Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net> > > Changqing> Suppose I get IBV_EVENT_DEVICE_FATAL > async event from > Changqing> the first HCA on my node, can I continue to call > Changqing> ibv_poll_cq() to get back all the work-requests I > Changqing> posted before ? or do I need to keep track these > Changqing> work-requests? I am afraid ibv_poll_cq() will return > Changqing> error by itself. Also can I call ibv_dereg_mr() to free > Changqing> the memory I registered to this HCA ? > > Once you get a catastrophic error, all bets are off. Work > request processing is in an undetermined state, since > basically the HCA crashed in an unknown way. Polling CQs is > probably not a good idea. > I guess you do need to deregister memory regions to unpin the > memory as part of your cleanup.... Thanks. However, when catastrophic error occurs, there are some entries in CQ, can I continue to peek them using ibv_poll_cq() ? Also does ibv_dereg_mr() work when fatal error occurs ? --CQ > > Changqing> If I continue to use the second HCA, > does the failure > Changqing> of first HCA affect the operation of second HCA (from > Changqing> driver point of view) ? > > No. > > - R. > From halr at voltaire.com Wed May 16 15:02:52 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 16 May 2007 18:02:52 -0400 Subject: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move CMdefinitions from ib_types.h In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901825156@mtlexch01.mtl.com> References: <1179333484.4531.176519.camel@hal.voltaire.com> <464B3537.7060405@ichips.intel.com> <1179336155.23882.604.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901825156@mtlexch01.mtl.com> Message-ID: <1179352971.23882.18793.camel@hal.voltaire.com> On Wed, 2007-05-16 at 17:24, Yevgeny Kliteynik wrote: > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Subject: Re: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move > CMdefinitions from ib_types.h > > > > On Wed, 2007-05-16 at 12:45, Sean Hefty wrote: > > > Hal Rosenstock wrote: > > > > OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h > > > > > > CM types are defined in the libibcm library. Why not remove them > > > completely from the opensm code? > > > > I think they are needed in the Windows environment. > > I'm not sure about this. Windows has another ib_types header, and those two ib_types.h are unrelated ? If so, what are the various WIN conditionalizations doing in ib_types.h ? Can they be removed ? > and I think that all the > other applications are using this header and not the management > ib_types. Are you referring to Windows, Linux, or both here ? > What are the rest of the CM defines that you want to remove? Huh ? What are you referring to here ? -- Hal > -- Yevgeny > > > I believe Linux userspace would never include this header. > > > > -- Hal > > > > > - Sean > > > > From gsadasiv7 at gmail.com Wed May 16 16:00:54 2007 From: gsadasiv7 at gmail.com (Ganesh Sadasivan) Date: Wed, 16 May 2007 16:00:54 -0700 Subject: [ofa-general] Running multiple SM Message-ID: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com> Hi, I have a setup with 2 HCAs connected back to back and am running opensm ( ofed1.1, running at the same priority) on both of them. Is there any utility to see who is the master? The smlid in ibv_devinfo, seems to be changing whenever an SM does a sweep. Is this expected? Thanks Ganesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed May 16 16:22:00 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 16 May 2007 19:22:00 -0400 Subject: [ofa-general] Running multiple SM In-Reply-To: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com> References: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com> Message-ID: <1179357718.23882.23845.camel@hal.voltaire.com> Hi Ganesh, On Wed, 2007-05-16 at 19:00, Ganesh Sadasivan wrote: > Hi, > > I have a setup with 2 HCAs connected back to back and am running > opensm (ofed1.1, running at the same priority) on both of them. Is > there any utility to see who is the master? sminfo will show the SM state for a LID/GUID. > The smlid in ibv_devinfo, seems to be changing whenever an SM does a > sweep. Is this expected? Nope. If they are both at the same priority, the lower GUID should win the SM election. Not sure what is going wrong in your (back to back HCA) subnet. Do you ports stay active ? -- Hal > Thanks > Ganesh > > ______________________________________________________________________ > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From gsadasiv7 at gmail.com Wed May 16 18:42:19 2007 From: gsadasiv7 at gmail.com (Ganesh Sadasivan) Date: Wed, 16 May 2007 18:42:19 -0700 Subject: [ofa-general] Running multiple SM In-Reply-To: <1179357718.23882.23845.camel@hal.voltaire.com> References: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com> <1179357718.23882.23845.camel@hal.voltaire.com> Message-ID: <532b813a0705161842q59038b1dx69cfa642d989789@mail.gmail.com> Hi Hal, Please see inline. On 16 May 2007 19:22:00 -0400, Hal Rosenstock wrote: > > Hi Ganesh, > > On Wed, 2007-05-16 at 19:00, Ganesh Sadasivan wrote: > > Hi, > > > > I have a setup with 2 HCAs connected back to back and am running > > opensm (ofed1.1, running at the same priority) on both of them. Is > > there any utility to see who is the master? Even with priority difeferences I am seeing the same behavior.Am I missing any option. I am setting "opensm -s 30" and "opensm -s 60" on the respective sides. sminfo will show the SM state for a LID/GUID. Thanks. > The smlid in ibv_devinfo, seems to be changing whenever an SM does a > > sweep. Is this expected? > > Nope. If they are both at the same priority, the lower GUID should win > the SM election. > > Not sure what is going wrong in your (back to back HCA) subnet. Do you > ports stay active ? Yes both ports are active. Thanks Ganesh -- Hal > > > Thanks > > Ganesh > > > > ______________________________________________________________________ > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed May 16 18:57:27 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 16 May 2007 21:57:27 -0400 Subject: [ofa-general] Running multiple SM In-Reply-To: <532b813a0705161842q59038b1dx69cfa642d989789@mail.gmail.com> References: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com> <1179357718.23882.23845.camel@hal.voltaire.com> <532b813a0705161842q59038b1dx69cfa642d989789@mail.gmail.com> Message-ID: <1179367045.23882.33850.camel@hal.voltaire.com> Hi again Ganesh, On Wed, 2007-05-16 at 21:42, Ganesh Sadasivan wrote: > Hi Hal, > > Please see inline. > > On 16 May 2007 19:22:00 -0400, Hal Rosenstock > wrote: > Hi Ganesh, > > On Wed, 2007-05-16 at 19:00, Ganesh Sadasivan wrote: > > Hi, > > > > I have a setup with 2 HCAs connected back to back and am > running > > opensm (ofed1.1, running at the same priority) on both of > them. Is > > there any utility to see who is the master? > > Even with priority difeferences I am seeing the same behavior.Am I > missing any option. I am setting "opensm -s 30" and "opensm -s 60" on > the respective sides. Why not use the default (10 secs) or at least the same on both sides ? > sminfo will show the SM state for a LID/GUID. > > > Thanks. > > > The smlid in ibv_devinfo, seems to be changing whenever an > SM does a > > sweep. Is this expected? > > Nope. If they are both at the same priority, the lower GUID > should win > the SM election. > > Not sure what is going wrong in your (back to back HCA) > subnet. Do you > ports stay active ? > > > Yes both ports are active. And they stay active (no LED color changes) ? If not, can you run both OpenSMs in verbose mode (-V) and see if there is anything interesting/relevant in the logs ? -- Hal > Thanks > Ganesh > > -- Hal > > > Thanks > > Ganesh > > > > > ______________________________________________________________________ > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From rdreier at cisco.com Wed May 16 19:03:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 16 May 2007 19:03:13 -0700 Subject: [ofa-general] Re: OFED HA related question In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net> (Changqing Tang's message of "Wed, 16 May 2007 21:43:30 -0000") References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net> Message-ID: > Thanks. However, when catastrophic error occurs, there are some > entries in CQ, can I continue to peek them using ibv_poll_cq() ? Not necessarily. It's better not to do anything once a catastrophic error is reported, because everything is in an indeterminate state. > Also does ibv_dereg_mr() work when fatal error occurs ? It will probably fail but you should try to destroy all your resources I guess. - R. From gsadasiv7 at gmail.com Wed May 16 21:18:15 2007 From: gsadasiv7 at gmail.com (Ganesh Sadasivan) Date: Wed, 16 May 2007 21:18:15 -0700 Subject: [ofa-general] Running multiple SM In-Reply-To: <1179367045.23882.33850.camel@hal.voltaire.com> References: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com> <1179357718.23882.23845.camel@hal.voltaire.com> <532b813a0705161842q59038b1dx69cfa642d989789@mail.gmail.com> <1179367045.23882.33850.camel@hal.voltaire.com> Message-ID: <532b813a0705162118t1ad20d8al65f940116971e50e@mail.gmail.com> The reason is: Jan 01 01:46:17 321555 [58F3E280] -> osm_vendor_set_sm: ERR 5431: setting IS_SM capability mask failed; errno 2 >From the code it looks like /dev/infiniband/issm needs to be created and I did that. But still the SM with higher GUID seem to become the master whenever it does a sweep. The logs are too detailed. So I am sending snippets. Local port (with a high GUID) Jan 01 02:49:56 332142 [5873E280] -> osm_pi_rcv_process: Discovered port num 0x1 with GUID = 0x2c901097682d1 for parent node GUID = 0x2c901097682d0, TID = 0x1236 Jan 01 02:49:56 332197 [5873E280] -> PortInfo dump: port number.............0x1 node_guid...............0x0002c901097682d0 port_guid...............0x0002c901097682d1 m_key...................0x0000000000000000 subnet_prefix...........0xfe80000000000000 base_lid................0x1 master_sm_base_lid......0x2 capability_mask.........0x2510A68 diag_code...............0x0 m_key_lease_period......0x0 local_port_num..........0x1 link_width_enabled......0x3 link_width_supported....0x3 link_width_active.......0x2 link_speed_supported....0x1 port_state..............ACTIVE state_info2.............0x52 m_key_protect_bits......0x0 lmc.....................0x0 link_speed..............0x11 mtu_smsl................0x40 vl_cap_init_type........0x40 vl_high_limit...........0x0 vl_arb_high_cap.........0x8 vl_arb_low_cap..........0x8 init_rep_mtu_cap........0x4 vl_stall_life...........0xFF vl_enforce..............0x40 m_key_violations........0x0 p_key_violations........0x0 q_key_violations........0x0 guid_cap................0x20 client_reregister.......0x0 subnet_timeout..........0x12 resp_time_value.........0x10 error_threshold.........0x88 Jan 01 02:49:56 332337 [5873E280] -> Capabilities Mask: IB_PORT_CAP_HAS_TRAP IB_PORT_CAP_HAS_AUTO_MIG IB_PORT_CAP_HAS_SL_MAP IB_PORT_CAP_HAS_LED_INFO IB_PORT_CAP_HAS_SYS_IMG_GUID IB_PORT_CAP_HAS_COM_MGT IB_PORT_CAP_HAS_VEND_CLS IB_PORT_CAP_HAS_CAP_NTC IB_PORT_CAP_HAS_CLIENT_REREG Remote Port which hosts the SM: Jan 01 02:49:56 500638 [5AF3E280] -> osm_pi_rcv_process: Discovered port num 0x1 with GUID = 0x2c90109765da1 for parent node GUID = 0x2c90109765da0, TID = 0x123b Jan 01 02:49:56 500690 [5AF3E280] -> PortInfo dump: Jan 01 02:49:56 500638 [5AF3E280] -> osm_pi_rcv_process: Discovered port num 0x1 with GUID = 0x2c90109765da1 for parent node GUID = 0x2c90109765da0, TID = 0x123b Jan 01 02:49:56 500690 [5AF3E280] -> PortInfo dump: port number.............0x1 node_guid...............0x0002c90109765da0 port_guid...............0x0002c90109765da1 m_key...................0x0000000000000000 subnet_prefix...........0xfe80000000000000 base_lid................0x2 master_sm_base_lid......0x2 capability_mask.........0x2510A68 diag_code...............0x0 m_key_lease_period......0x0 local_port_num..........0x1 link_width_enabled......0x3 link_width_supported....0x3 link_width_active.......0x2 link_speed_supported....0x1 port_state..............ACTIVE state_info2.............0x52 m_key_protect_bits......0x0 lmc.....................0x0 link_speed..............0x11 mtu_smsl................0x40 vl_cap_init_type........0x40 vl_high_limit...........0x0 vl_arb_high_cap.........0x8 vl_arb_low_cap..........0x8 init_rep_mtu_cap........0x4 vl_stall_life...........0xFF vl_enforce..............0x40 m_key_violations........0x0 p_key_violations........0x0 q_key_violations........0x0 guid_cap................0x20 client_reregister.......0x0 subnet_timeout..........0x12 resp_time_value.........0x10 error_threshold.........0x88 Jan 01 02:49:56 500831 [5AF3E280] -> Capabilities Mask: IB_PORT_CAP_HAS_TRAP IB_PORT_CAP_HAS_AUTO_MIG IB_PORT_CAP_HAS_SL_MAP IB_PORT_CAP_HAS_LED_INFO IB_PORT_CAP_HAS_SYS_IMG_GUID IB_PORT_CAP_HAS_COM_MGT IB_PORT_CAP_HAS_VEND_CLS IB_PORT_CAP_HAS_CAP_NTC IB_PORT_CAP_HAS_CLIENT_REREG Please let me know if I look at some specific portion. Thanks Ganesh On 16 May 2007 21:57:27 -0400, Hal Rosenstock wrote: > > Hi again Ganesh, > > On Wed, 2007-05-16 at 21:42, Ganesh Sadasivan wrote: > > Hi Hal, > > > > Please see inline. > > > > On 16 May 2007 19:22:00 -0400, Hal Rosenstock > > wrote: > > Hi Ganesh, > > > > On Wed, 2007-05-16 at 19:00, Ganesh Sadasivan wrote: > > > Hi, > > > > > > I have a setup with 2 HCAs connected back to back and am > > running > > > opensm (ofed1.1, running at the same priority) on both of > > them. Is > > > there any utility to see who is the master? > > > > Even with priority difeferences I am seeing the same behavior.Am I > > missing any option. I am setting "opensm -s 30" and "opensm -s 60" on > > the respective sides. > > Why not use the default (10 secs) or at least the same on both sides ? > > > sminfo will show the SM state for a LID/GUID. > > > > > > Thanks. > > > > > The smlid in ibv_devinfo, seems to be changing whenever an > > SM does a > > > sweep. Is this expected? > > > > Nope. If they are both at the same priority, the lower GUID > > should win > > the SM election. > > > > Not sure what is going wrong in your (back to back HCA) > > subnet. Do you > > ports stay active ? > > > > > > Yes both ports are active. > > And they stay active (no LED color changes) ? > > If not, can you run both OpenSMs in verbose mode (-V) and see if there > is anything interesting/relevant in the logs ? > > -- Hal > > > Thanks > > Ganesh > > > > -- Hal > > > > > Thanks > > > Ganesh > > > > > > > > > ______________________________________________________________________ > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From keshetti.mahesh at gmail.com Wed May 16 21:57:28 2007 From: keshetti.mahesh at gmail.com (Keshetti Mahesh) Date: Thu, 17 May 2007 10:27:28 +0530 Subject: [ofa-general] problem with loading IB modules in a IB node with OFED. In-Reply-To: <200705161646.49910.minich@ornl.gov> References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com> <200705161646.49910.minich@ornl.gov> Message-ID: <829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com> > uname -r 2.6.9-34.ELsmp > modinfo madeye.ko | grep vermagic vermagic: 2.6.9-34.ELsmp SMP gcc-3.4 > Also, you might want to check dmesg. dmesg output: madeye: disagrees about version of symbol ib_unregister_client madeye: Unknown symbol ib_unregister_client madeye: disagrees about version of symbol ib_register_mad_snoop madeye: Unknown symbol ib_register_mad_snoop madeye: disagrees about version of symbol ib_register_client madeye: Unknown symbol ib_register_client madeye: disagrees about version of symbol ib_unregister_mad_agent madeye: Unknown symbol ib_unregister_mad_agent madeye: disagrees about version of symbol ib_set_client_data madeye: Unknown symbol ib_set_client_data madeye: disagrees about version of symbol ib_get_client_data madeye: Unknown symbol ib_get_client_data I think the problem with the IB headers with which it is being compiled. In my case I am compiling the 'madeye' module withe IB headers available in the 2.6.9.34Elsmp source code. where as the IB verbs are compiled against the sources from OFED-1.0. But I don't know how to compile my module with the OFED-1.0 headers(because the headers are not available after the compilation). -- Thanks and regards, Mahesh. From devesh28 at gmail.com Wed May 16 22:21:53 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Thu, 17 May 2007 10:51:53 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <464B5C07.8040601@ichips.intel.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> Message-ID: <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> On 5/17/07, Sean Hefty wrote: > > But initially this will generate a packet for each path, while sys > > admin knows that path is there and he can hard-code the entries for > > it. Other thing is that why Admin will care about creating such record > > while SA is itself taking care, right? > > In your original message you asked about adding 'dummy entries' to the > cache. I agree that pre-loading the cache can be useful. What I still > am not understanding is the reasoning for adding 'dummy entries'. By > 'dummy entries', I've been assuming that these are invalid path records, > but maybe that's not what you meant. Ok if "dummy entries" word as such has created confusion then I am sorry for that, But with that I mean that, those are valid path records which Administrator knows in advance and while loading the module, Admin is loading this info in the cache with user command. > > > Another point I want to know is, > > When local_sa_cache module will be inserted? After SM comes up or > > Before SM comes up? > > It can occur either way. There is no restriction. The cache responds > to port up and GID in/out of service events to update itself. Do you mean cache module will start building cache only after Port is UP? > > > If Its inserted before SM is coming up (I am assuming SM is running on > > some node not on switch) then First Forced schedule_update() is > > waisted, and for the first application presence of cache is > > meaningless. Why not to keep cache effective right from the start? > > Pre-loading the cache with path records doesn't guarantee that those > paths are usable. If the SM has not come up, then the path records will > be unusable until the SM configures the subnet, plus there's no > guarantee that the remote endpoints specified by the paths are running. You mean there is no guarantee that even if SM is UP and we have some hard coded entries of path record corresponding to some node X, we are not sure that node X has actually come up or not? In that case actually that path resolving should fail if node has not come up, but with the hard coding still path will be resolved? > > The main benefit I see to pre-loading the cache is to avoid SA storms > when booting a large cluster. that's true. Also cache will get valid entries only if network is configured by SM otherwise every node SA will, possibly, drop SA packets. > > - Sean > From glebn at voltaire.com Wed May 16 23:26:38 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 17 May 2007 09:26:38 +0300 Subject: [ofa-general] Re: OFED HA related question In-Reply-To: References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net> Message-ID: <20070517062637.GL6273@minantech.com> On Wed, May 16, 2007 at 07:03:13PM -0700, Roland Dreier wrote: > > Also does ibv_dereg_mr() work when fatal error occurs ? > > It will probably fail but you should try to destroy all your resources > I guess. > This is very good question. Application should be able to unpin memory even if HCA is completely dead. AFAIR you introduced (or want to introduce) separation of pinning and registering APIs in libibverbs and then unpinning will be totally independent of HCA state, but what will happen in the current driver from ofed1.2? -- Gleb. From mst at dev.mellanox.co.il Thu May 17 00:30:28 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 May 2007 10:30:28 +0300 Subject: [ofa-general] Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR leak In-Reply-To: References: <20070515210453.GL4161@mellanox.co.il> <20070516101457.GA5091@mellanox.co.il> Message-ID: <20070517073017.GA4205@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR leak > > > + * - Put the QP in the Error State > > + * - Wait for the Affiliated Asynchronous Last WQE Reached Event; > > + * - either: > > + * drain the CQ by invoking the Poll CQ verb and either wait for CQ > > + * to be empty or the number of Poll CQ operations has exceeded > > + * CQ capacity size; > > + * - or > > + * post another WR that completes on the same CQ and wait for this > > + * WR to return as a WC; (NB: this is the option that we use) > > + * and then invoke a Destroy QP or Reset QP. > > I guess this last line would look better as > > * - invoke a Destroy QP or Reset QP. Hmm, I would like to quote the spec *literally*. Maybe - and then invoke a Destroy QP or Reset QP. > > +static struct ib_qp_attr ipoib_cm_err_attr __read_mostly = { > > + .qp_state = IB_QPS_ERR > > +}; > > + > > +#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff > > + > > +static struct ib_send_wr ipoib_cm_rx_drain_wr __read_mostly = { > > + .wr_id = IPOIB_CM_RX_DRAIN_WRID, > > + .opcode = IB_WR_SEND > > +}; > > I don't think these are hot enough to be worth marking as __read_mostly. > (better to leave them in normal .data so that stuff that is written to > ends up getting spaced out more) OK, thanks for the suggestion. > > + qp_attr.qp_state = IB_QPS_INIT; > > + qp_attr.port_num = priv->port; > > + qp_attr.qkey = 0; > > + qp_attr.qp_access_flags = 0; > > + ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, > > + IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PORT | IB_QP_QKEY); > > + if (ret) { > > + ipoib_warn(priv, "failed to modify drain QP to INIT: %d\n", ret); > > + goto err_qp; > > + } > > + > > + /* We put the QP in error state directly: this way, hardware > > + * will immediately generate WC for each WR we post, without > > + * sending anything on the wire. */ > > + qp_attr.qp_state = IB_QPS_ERR; > > + ret = ib_modify_qp(priv->cm.rx_drain_qp, &qp_attr, IB_QP_STATE); > > + if (ret) { > > + ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret); > > + goto err_qp; > > + } > > This actually seems like a good motivation for the mthca RESET -> > ERROR fix. We could avoid the transition to INIT if we fixed mthca > and mlx4, right? Yes. That was the motivation. > (By the way, any interest in making an mlx4 patch to > fix that too?) Easy (I also fixed reset to reset on the way). IB/mlx4: fix RESET -> ERROR and RESET -> RESET transitions Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 5cd7069..c93daab 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -573,7 +573,7 @@ static int to_mlx4_st(enum ib_qp_type type) } } -static __be32 to_mlx4_access_flags(struct mlx4_ib_qp *qp, struct ib_qp_attr *attr, +static __be32 to_mlx4_access_flags(struct mlx4_ib_qp *qp, const struct ib_qp_attr *attr, int attr_mask) { u8 dest_rd_atomic; @@ -603,7 +603,7 @@ static __be32 to_mlx4_access_flags(struct mlx4_ib_qp *qp, struct ib_qp_attr *att return cpu_to_be32(hw_access_flags); } -static void store_sqp_attrs(struct mlx4_ib_sqp *sqp, struct ib_qp_attr *attr, +static void store_sqp_attrs(struct mlx4_ib_sqp *sqp, const struct ib_qp_attr *attr, int attr_mask) { if (attr_mask & IB_QP_PKEY_INDEX) @@ -619,7 +619,7 @@ static void mlx4_set_sched(struct mlx4_qp_path *path, u8 port) path->sched_queue = (path->sched_queue & 0xbf) | ((port - 1) << 6); } -static int mlx4_set_path(struct mlx4_ib_dev *dev, struct ib_ah_attr *ah, +static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah, struct mlx4_qp_path *path, u8 port) { path->grh_mylmc = ah->src_path_bits & 0x7f; @@ -655,14 +655,14 @@ static int mlx4_set_path(struct mlx4_ib_dev *dev, struct ib_ah_attr *ah, return 0; } -int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, - int attr_mask, struct ib_udata *udata) +static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, + const struct ib_qp_attr *attr, int attr_mask, + enum ib_qp_state cur_state, enum ib_qp_state new_state) { struct mlx4_ib_dev *dev = to_mdev(ibqp->device); struct mlx4_ib_qp *qp = to_mqp(ibqp); struct mlx4_qp_context *context; enum mlx4_qp_optpar optpar = 0; - enum ib_qp_state cur_state, new_state; int sqd_event; int err = -EINVAL; @@ -670,34 +670,6 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, if (!context) return -ENOMEM; - mutex_lock(&qp->mutex); - - cur_state = attr_mask & IB_QP_CUR_STATE ? attr->cur_qp_state : qp->state; - new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state; - - if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) - goto out; - - if ((attr_mask & IB_QP_PKEY_INDEX) && - attr->pkey_index >= dev->dev->caps.pkey_table_len) { - goto out; - } - - if ((attr_mask & IB_QP_PORT) && - (attr->port_num == 0 || attr->port_num > dev->dev->caps.num_ports)) { - goto out; - } - - if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && - attr->max_rd_atomic > dev->dev->caps.max_qp_init_rdma) { - goto out; - } - - if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC && - attr->max_dest_rd_atomic > 1 << dev->dev->caps.max_qp_dest_rdma) { - goto out; - } - context->flags = cpu_to_be32((to_mlx4_state(new_state) << 28) | (to_mlx4_st(ibqp->qp_type) << 16)); context->flags |= cpu_to_be32(1 << 8); /* DE? */ @@ -920,11 +892,83 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, } out: - mutex_unlock(&qp->mutex); kfree(context); return err; } +static const struct ib_qp_attr mlx4_ib_qp_attr = { .port_num = 1 }; +static const int mlx4_ib_qp_attr_mask_table[IB_QPT_UD + 1] = { + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), +}; + +int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask, struct ib_udata *udata) +{ + struct mlx4_ib_dev *dev = to_mdev(ibqp->device); + struct mlx4_ib_qp *qp = to_mqp(ibqp); + enum ib_qp_state cur_state, new_state; + int err = -EINVAL; + + mutex_lock(&qp->mutex); + + cur_state = attr_mask & IB_QP_CUR_STATE ? attr->cur_qp_state : qp->state; + new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state; + + if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) + goto out; + + if ((attr_mask & IB_QP_PKEY_INDEX) && + attr->pkey_index >= dev->dev->caps.pkey_table_len) { + goto out; + } + + if ((attr_mask & IB_QP_PORT) && + (attr->port_num == 0 || attr->port_num > dev->dev->caps.num_ports)) { + goto out; + } + + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && + attr->max_rd_atomic > dev->dev->caps.max_qp_init_rdma) { + goto out; + } + + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC && + attr->max_dest_rd_atomic > 1 << dev->dev->caps.max_qp_dest_rdma) { + goto out; + } + + if (cur_state == new_state && cur_state == IB_QPS_RESET) { + err = 0; + goto out; + } + + if (cur_state == IB_QPS_RESET && new_state == IB_QPS_ERR) { + err = __mlx4_ib_modify_qp(ibqp, &mlx4_ib_qp_attr, + mlx4_ib_qp_attr_mask_table[ibqp->qp_type], + IB_QPS_RESET, IB_QPS_INIT); + if (err) + goto out; + cur_state = IB_QPS_INIT; + } + + err = __mlx4_ib_modify_qp(ibqp, attr, attr_mask, cur_state, new_state); +out: + mutex_unlock(&qp->mutex); + return err; +} + static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr, void *wqe) { -- MST From eli at mellanox.co.il Thu May 17 00:32:37 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 17 May 2007 10:32:37 +0300 Subject: [ofa-general] [PATCH 1/2] libmlx4: pass more data from user to kernel Message-ID: <1179387187.25749.61.camel@mtls03> Pass to kernel user calcuated values which will be used by the kernel to configure the qp and pin memory. Signed-off-by: Eli Cohen --- Index: libmlx4/src/mlx4-abi.h =================================================================== --- libmlx4.orig/src/mlx4-abi.h 2007-05-16 16:37:01.000000000 +0300 +++ libmlx4/src/mlx4-abi.h 2007-05-17 09:46:56.000000000 +0300 @@ -35,7 +35,7 @@ #include -#define MLX4_UVERBS_ABI_VERSION 1 +#define MLX4_UVERBS_ABI_VERSION 2 struct mlx4_alloc_ucontext_resp { struct ibv_get_context_resp ibv_resp; @@ -83,6 +83,10 @@ struct mlx4_create_qp { struct ibv_create_qp ibv_cmd; __u64 buf_addr; __u64 db_addr; + __u64 rq_size; + __u64 sq_size; + __u8 rcv_wqe_shift; + __u8 log_wqe_bb; }; #endif /* MLX4_ABI_H */ Index: libmlx4/src/verbs.c =================================================================== --- libmlx4.orig/src/verbs.c 2007-05-16 16:37:01.000000000 +0300 +++ libmlx4/src/verbs.c 2007-05-17 09:37:46.000000000 +0300 @@ -385,6 +385,11 @@ struct ibv_qp *mlx4_create_qp(struct ibv cmd.buf_addr = (uintptr_t) qp->buf.buf; cmd.db_addr = (uintptr_t) qp->db; + cmd.rq_size = (uintptr_t) qp->rq.max; + cmd.sq_size = (uintptr_t) qp->sq.max; + cmd.rcv_wqe_shift = qp->rq.wqe_shift; + cmd.log_wqe_bb = qp->sq.wqe_shift; + qp->max_inline_data = attr->cap.max_inline_data; ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd, &resp, sizeof resp); @@ -395,12 +400,6 @@ struct ibv_qp *mlx4_create_qp(struct ibv if (ret) goto err_destroy; - qp->sq.max = attr->cap.max_send_wr; - qp->rq.max = attr->cap.max_recv_wr; - qp->sq.max_gs = attr->cap.max_send_sge; - qp->rq.max_gs = attr->cap.max_recv_sge; - qp->max_inline_data = attr->cap.max_inline_data; - qp->doorbell_qpn = htonl(qp->ibv_qp.qp_num << 8); if (attr->sq_sig_all) qp->sq_signal_bits = htonl(MLX4_WQE_CTRL_CQ_UPDATE); From eli at mellanox.co.il Thu May 17 00:32:41 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 17 May 2007 10:32:41 +0300 Subject: [ofa-general] [PATCH 2/2] IB/mlx4: pass more data from user to kernel Message-ID: <1179387217.25749.62.camel@mtls03> kernel code make minimum caclulations to evaluate wq size of user space consumers for calcualting the buffer size. Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- connectx_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-05-16 16:37:35.000000000 +0300 +++ connectx_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-05-17 09:13:49.000000000 +0300 @@ -188,8 +188,8 @@ static int send_wqe_overhead(enum ib_qp_ } } -static int set_qp_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, - enum ib_qp_type type, struct mlx4_ib_qp *qp) +static int set_kernel_qp_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, + enum ib_qp_type type, struct mlx4_ib_qp *qp) { /* Sanity check QP size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || @@ -249,6 +249,23 @@ static int set_qp_size(struct mlx4_ib_de return 0; } +static int set_user_qp_size(struct mlx4_ib_qp *qp, + struct mlx4_ib_create_qp *ucmd) +{ + if (ucmd->rq_size & ucmd->rq_size - 1 || ucmd->sq_size & ucmd->sq_size - 1) + return -EINVAL; + + qp->rq.max = ucmd->rq_size; + qp->rq.wqe_shift = ucmd->rcv_wqe_shift; + qp->sq.wqe_shift = ucmd->log_wqe_bb; + qp->sq.max = ucmd->sq_size; + + qp->buf_size = (qp->rq.max << qp->rq.wqe_shift) + + (qp->sq.max << qp->sq.wqe_shift); + + return 0; +} + static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, struct ib_qp_init_attr *init_attr, struct ib_udata *udata, int sqpn, struct mlx4_ib_qp *qp) @@ -270,10 +287,6 @@ static int create_qp_common(struct mlx4_ qp->sq.head = 0; qp->sq.tail = 0; - err = set_qp_size(dev, &init_attr->cap, init_attr->qp_type, qp); - if (err) - goto err; - if (pd->uobject) { struct mlx4_ib_create_qp ucmd; @@ -282,6 +295,10 @@ static int create_qp_common(struct mlx4_ goto err; } + err = set_user_qp_size(qp, &ucmd); + if (err) + goto err; + qp->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr, qp->buf_size, 0); if (IS_ERR(qp->umem)) { @@ -303,6 +320,10 @@ static int create_qp_common(struct mlx4_ if (err) goto err_mtt; } else { + err = set_kernel_qp_size(dev, &init_attr->cap, init_attr->qp_type, qp); + if (err) + goto err; + err = mlx4_ib_db_alloc(dev, &qp->db, 0); if (err) goto err; Index: connectx_kernel/drivers/infiniband/hw/mlx4/user.h =================================================================== --- connectx_kernel.orig/drivers/infiniband/hw/mlx4/user.h 2007-05-16 16:37:35.000000000 +0300 +++ connectx_kernel/drivers/infiniband/hw/mlx4/user.h 2007-05-17 09:45:45.000000000 +0300 @@ -39,7 +39,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define MLX4_IB_UVERBS_ABI_VERSION 1 +#define MLX4_IB_UVERBS_ABI_VERSION 2 /* * Make sure that all structs defined in this file remain laid out so @@ -87,6 +87,10 @@ struct mlx4_ib_create_srq_resp { struct mlx4_ib_create_qp { __u64 buf_addr; __u64 db_addr; + __u64 rq_size; + __u64 sq_size; + __u8 rcv_wqe_shift; + __u8 log_wqe_bb; }; #endif /* MLX4_IB_USER_H */ From mst at dev.mellanox.co.il Thu May 17 01:06:23 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 May 2007 11:06:23 +0300 Subject: [ofa-general] Re: movnt In-Reply-To: References: <20070508141727.GR21591@mellanox.co.il> <20070512172927.GA5908@mellanox.co.il> <20070515204335.GI4161@mellanox.co.il> Message-ID: <20070517080623.GB4205@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: movnt > > > So we can map the device memory with WB or WT semantics, and movnt will enable > > WC. And the nice thing about this trick, is that both WB and WT *are already > > programmed into PAT after reset*, which means that we can use them for pages we > > map for userspace, without stepping on anyone's toes or waiting for > > the generic in-kernel support for WC to materialize. > > I'm not sure whether this is much of an advantage. There's no generic > way to map memory with WB that I know of. I think we just need to avoid setting noncacheable flag - this does not sound too bad. > I don't think that setting > a PAT entry for WC is the hold-up I think it is - editing PAT will affect all of the system, you start hittig various errata ... > -- the problem is more in the right > infrastructure for pgprot_xxx(). I don't think it's very nice to have > #ifdef __x86_64__ in a driver. Not nice, but I think doing it in a platform-dependent way first will be a good way for whoever implements the portable infrastructure to see the patterns. So far, most people think about X as the main user for the infrastructure-to-emerge - an easier to experiment with user such as libmlx4 will be a good testcase I think. Look at drivers/infiniband/hw/ipath/ipath_file_ops.c for example - so why make exception for mthca? Isn't the following a good start? } else if (vma->vm_pgoff == 1 && dev->dev->caps.bf_reg_size != 0) { - /* FIXME want pgprot_writecombine() for BlueFlame pages */ +#if defined(__x86_64__) vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); +#endif if (io_remap_pfn_range(vma, vma->vm_start, to_mucontext(context)->uar.pfn + dev->dev->caps.num_uars, PAGE_SIZE, vma->vm_page_prot)) return -EAGAIN; } else This does not look too bad, does it? All we need to do is requite libmlx4 to always use movnti on x86_64. -- MST From mst at dev.mellanox.co.il Thu May 17 01:10:51 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 May 2007 11:10:51 +0300 Subject: [ofa-general] [PATCH] IB/mthca: fix use-after-free In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90182527E@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C90182527E@mtlexch01.mtl.com> Message-ID: <20070517081050.GC4205@mellanox.co.il> From: Ali Ayoub Fix use-after-free on hardware restart. Signed-off-by: Michael S. Tsirkin --- --- ./drivers/infiniband/hw/mthca/mthca_main.c.orig 2007-05-17 11:01:28.000000000 +0300 +++ ./drivers/infiniband/hw/mthca/mthca_main.c 2007-05-17 11:02:36.000000000 +0300 @@ -1250,12 +1250,14 @@ int __mthca_restart_one(struct pci_dev *pdev) { struct mthca_dev *mdev; + int hca_type; mdev = pci_get_drvdata(pdev); + hca_type = mdev->hca_type; if (!mdev) return -ENODEV; __mthca_remove_one(pdev); - return __mthca_init_one(pdev, mdev->hca_type); + return __mthca_init_one(pdev, hca_type); } static int __devinit mthca_init_one(struct pci_dev *pdev, -- Michael S. Tsirkin - Staff Engineer, Mellanox Technologies Ltd. Eternity is a very long time, especially towards the end. From mst at dev.mellanox.co.il Thu May 17 01:26:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 May 2007 11:26:03 +0300 Subject: [ofa-general] Re: Re: OFED HA related question In-Reply-To: <20070517062637.GL6273@minantech.com> References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net> <20070517062637.GL6273@minantech.com> Message-ID: <20070517082603.GE4205@mellanox.co.il> > Quoting Gleb Natapov : > Subject: Re: Re: OFED HA related question > > On Wed, May 16, 2007 at 07:03:13PM -0700, Roland Dreier wrote: > > > Also does ibv_dereg_mr() work when fatal error occurs ? > > > > It will probably fail but you should try to destroy all your resources > > I guess. > > > This is very good question. Application should be able to unpin memory > even if HCA is completely dead. This is only safe after you reset the HCA, otherwise it might be writing over this memory. -- MST From glebn at voltaire.com Thu May 17 01:30:44 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 17 May 2007 11:30:44 +0300 Subject: [ofa-general] Re: Re: OFED HA related question In-Reply-To: <20070517082603.GE4205@mellanox.co.il> References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net> <20070517062637.GL6273@minantech.com> <20070517082603.GE4205@mellanox.co.il> Message-ID: <20070517083044.GM6273@minantech.com> On Thu, May 17, 2007 at 11:26:03AM +0300, Michael S. Tsirkin wrote: > > Quoting Gleb Natapov : > > Subject: Re: Re: OFED HA related question > > > > On Wed, May 16, 2007 at 07:03:13PM -0700, Roland Dreier wrote: > > > > Also does ibv_dereg_mr() work when fatal error occurs ? > > > > > > It will probably fail but you should try to destroy all your resources > > > I guess. > > > > > This is very good question. Application should be able to unpin memory > > even if HCA is completely dead. > > This is only safe after you reset the HCA, otherwise it might be > writing over this memory. > Right. Good point. So to recover memory after HCA failure event we need to reset HCA and only after that unpin memory. What current mthca driver does in case it cannot unregister memory from HCA? Does it proceed to unpin it? -- Gleb. From vlad at lists.openfabrics.org Thu May 17 02:40:42 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 17 May 2007 02:40:42 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070517-0200 daily build status Message-ID: <20070517094042.E9717E60835@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From halr at voltaire.com Thu May 17 03:42:16 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 17 May 2007 06:42:16 -0400 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> Message-ID: <1179398534.23882.67542.camel@hal.voltaire.com> On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote: > On 5/17/07, Sean Hefty wrote: > > > But initially this will generate a packet for each path, while sys > > > admin knows that path is there and he can hard-code the entries for > > > it. Other thing is that why Admin will care about creating such record > > > while SA is itself taking care, right? > > > > In your original message you asked about adding 'dummy entries' to the > > cache. I agree that pre-loading the cache can be useful. What I still > > am not understanding is the reasoning for adding 'dummy entries'. By > > 'dummy entries', I've been assuming that these are invalid path records, > > but maybe that's not what you meant. > Ok if "dummy entries" word as such has created confusion then I am > sorry for that, But with that I mean that, those are valid path > records which Administrator knows in advance and while loading the > module, How does the admin know they are valid ? Are they somehow preconfigured at the SM ? Doesn't each SM have its own policy for generating valid PRs ? Or are these from a live SM and just loaded "out of band" to bypass/preclude the SA PR mechanism ? -- Hal > Admin is loading this info in the cache with user command. > > > > > Another point I want to know is, > > > When local_sa_cache module will be inserted? After SM comes up or > > > Before SM comes up? > > > > It can occur either way. There is no restriction. The cache responds > > to port up and GID in/out of service events to update itself. > Do you mean cache module will start building cache only after Port is UP? > > > > > If Its inserted before SM is coming up (I am assuming SM is running on > > > some node not on switch) then First Forced schedule_update() is > > > waisted, and for the first application presence of cache is > > > meaningless. Why not to keep cache effective right from the start? > > > > Pre-loading the cache with path records doesn't guarantee that those > > paths are usable. If the SM has not come up, then the path records will > > be unusable until the SM configures the subnet, plus there's no > > guarantee that the remote endpoints specified by the paths are running. > You mean there is no guarantee that even if SM is UP and we have some > hard coded entries of path record corresponding to some node X, we are > not sure that node X has actually come up or not? In that case > actually that path resolving should fail if node has not come up, but > with the hard coding still path will be resolved? > > > > The main benefit I see to pre-loading the cache is to avoid SA storms > > when booting a large cluster. > that's true. Also cache will get valid entries only if network is > configured by SM otherwise every node SA will, possibly, drop SA > packets. > > > > - Sean > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu May 17 03:46:47 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 17 May 2007 06:46:47 -0400 Subject: [ofa-general] problem with loading IB modules in a IB node with OFED. In-Reply-To: <829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com> References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com> <200705161646.49910.minich@ornl.gov> <829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com> Message-ID: <1179398805.23882.67884.camel@hal.voltaire.com> On Thu, 2007-05-17 at 00:57, Keshetti Mahesh wrote: > > uname -r > 2.6.9-34.ELsmp > > > modinfo madeye.ko | grep vermagic > vermagic: 2.6.9-34.ELsmp SMP gcc-3.4 > > > Also, you might want to check dmesg. > > dmesg output: > madeye: disagrees about version of symbol ib_unregister_client > madeye: Unknown symbol ib_unregister_client > madeye: disagrees about version of symbol ib_register_mad_snoop > madeye: Unknown symbol ib_register_mad_snoop > madeye: disagrees about version of symbol ib_register_client > madeye: Unknown symbol ib_register_client > madeye: disagrees about version of symbol ib_unregister_mad_agent > madeye: Unknown symbol ib_unregister_mad_agent > madeye: disagrees about version of symbol ib_set_client_data > madeye: Unknown symbol ib_set_client_data > madeye: disagrees about version of symbol ib_get_client_data > madeye: Unknown symbol ib_get_client_data > > I think the problem with the IB headers with which it is being > compiled. In my case I am > compiling the 'madeye' module withe IB headers available in the > 2.6.9.34Elsmp source > code. where as the IB verbs are compiled against the sources from > OFED-1.0. I didn't think that OFED 1.0 included madeye. Where did madeye come from ? It was first made part of OFED at 1.1. -- Hal > But I don't > know how to compile my module with the OFED-1.0 headers(because the > headers are not > available after the compilation). From halr at voltaire.com Thu May 17 03:51:15 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 17 May 2007 06:51:15 -0400 Subject: [ofa-general] Running multiple SM In-Reply-To: <532b813a0705162118t1ad20d8al65f940116971e50e@mail.gmail.com> References: <532b813a0705161600m743a83ecua4ea8a3893853824@mail.gmail.com> <1179357718.23882.23845.camel@hal.voltaire.com> <532b813a0705161842q59038b1dx69cfa642d989789@mail.gmail.com> <1179367045.23882.33850.camel@hal.voltaire.com> <532b813a0705162118t1ad20d8al65f940116971e50e@mail.gmail.com> Message-ID: <1179399074.23882.68142.camel@hal.voltaire.com> On Thu, 2007-05-17 at 00:18, Ganesh Sadasivan wrote: > The reason is: > Jan 01 01:46:17 321555 [58F3E280] -> osm_vendor_set_sm: ERR 5431: > setting IS_SM capability mask failed; errno 2 Yes, this makes sense now and explains what you are seeing. > From the code it looks like /dev/infiniband/issm needs to > be created and I did that. This should be done via udev rather than manually. Do you have udev setup ? If not, please follow the instructions on the wiki. -- Hal > But still the SM with higher GUID seem to become the master whenever > it does a sweep. The logs are too detailed. So I am sending snippets. > > Local port (with a high GUID) > Jan 01 02:49:56 332142 [5873E280] -> osm_pi_rcv_process: Discovered > port num 0x1 with GUID = 0x2c901097682d1 for parent node GUID = > 0x2c901097682d0, TID = 0x1236 > Jan 01 02:49:56 332197 [5873E280] -> PortInfo dump: > port number.............0x1 > > node_guid...............0x0002c901097682d0 > > port_guid...............0x0002c901097682d1 > > m_key...................0x0000000000000000 > > subnet_prefix...........0xfe80000000000000 > base_lid................0x1 > master_sm_base_lid......0x2 > capability_mask.........0x2510A68 > diag_code...............0x0 > m_key_lease_period......0x0 > local_port_num..........0x1 > link_width_enabled......0x3 > link_width_supported....0x3 > link_width_active.......0x2 > link_speed_supported....0x1 > port_state..............ACTIVE > state_info2.............0x52 > m_key_protect_bits......0x0 > lmc.....................0x0 > link_speed..............0x11 > mtu_smsl................0x40 > vl_cap_init_type........0x40 > vl_high_limit...........0x0 > vl_arb_high_cap.........0x8 > vl_arb_low_cap..........0x8 > init_rep_mtu_cap........0x4 > vl_stall_life...........0xFF > vl_enforce..............0x40 > m_key_violations........0x0 > p_key_violations........0x0 > q_key_violations........0x0 > guid_cap................0x20 > client_reregister.......0x0 > subnet_timeout..........0x12 > resp_time_value.........0x10 > error_threshold.........0x88 > Jan 01 02:49:56 332337 [5873E280] -> Capabilities Mask: > IB_PORT_CAP_HAS_TRAP > IB_PORT_CAP_HAS_AUTO_MIG > IB_PORT_CAP_HAS_SL_MAP > IB_PORT_CAP_HAS_LED_INFO > IB_PORT_CAP_HAS_SYS_IMG_GUID > IB_PORT_CAP_HAS_COM_MGT > IB_PORT_CAP_HAS_VEND_CLS > IB_PORT_CAP_HAS_CAP_NTC > IB_PORT_CAP_HAS_CLIENT_REREG > > Remote Port which hosts the SM: > Jan 01 02:49:56 500638 [5AF3E280] -> osm_pi_rcv_process: Discovered > port num 0x1 with GUID = 0x2c90109765da1 for parent node GUID = > 0x2c90109765da0, TID = 0x123b > Jan 01 02:49:56 500690 [5AF3E280] -> PortInfo dump: > Jan 01 02:49:56 500638 [5AF3E280] -> osm_pi_rcv_process: Discovered > port num 0x1 with GUID = 0x2c90109765da1 for parent node GUID = > 0x2c90109765da0, TID = 0x123b > Jan 01 02:49:56 500690 [5AF3E280] -> PortInfo dump: > port number.............0x1 > > node_guid...............0x0002c90109765da0 > > port_guid...............0x0002c90109765da1 > > m_key...................0x0000000000000000 > > subnet_prefix...........0xfe80000000000000 > base_lid................0x2 > master_sm_base_lid......0x2 > capability_mask.........0x2510A68 > diag_code...............0x0 > m_key_lease_period......0x0 > local_port_num..........0x1 > link_width_enabled......0x3 > link_width_supported....0x3 > link_width_active.......0x2 > link_speed_supported....0x1 > port_state..............ACTIVE > state_info2.............0x52 > m_key_protect_bits......0x0 > lmc.....................0x0 > link_speed..............0x11 > mtu_smsl................0x40 > vl_cap_init_type........0x40 > vl_high_limit...........0x0 > vl_arb_high_cap.........0x8 > vl_arb_low_cap..........0x8 > init_rep_mtu_cap........0x4 > vl_stall_life...........0xFF > vl_enforce..............0x40 > m_key_violations........0x0 > p_key_violations........0x0 > q_key_violations........0x0 > guid_cap................0x20 > client_reregister.......0x0 > subnet_timeout..........0x12 > resp_time_value.........0x10 > error_threshold.........0x88 > Jan 01 02:49:56 500831 [5AF3E280] -> Capabilities Mask: > IB_PORT_CAP_HAS_TRAP > IB_PORT_CAP_HAS_AUTO_MIG > IB_PORT_CAP_HAS_SL_MAP > IB_PORT_CAP_HAS_LED_INFO > IB_PORT_CAP_HAS_SYS_IMG_GUID > IB_PORT_CAP_HAS_COM_MGT > IB_PORT_CAP_HAS_VEND_CLS > IB_PORT_CAP_HAS_CAP_NTC > IB_PORT_CAP_HAS_CLIENT_REREG > > Please let me know if I look at some specific portion. > > Thanks > Ganesh > > > > On 16 May 2007 21:57:27 -0400, Hal Rosenstock > wrote: > Hi again Ganesh, > > On Wed, 2007-05-16 at 21:42, Ganesh Sadasivan wrote: > > Hi Hal, > > > > Please see inline. > > > > On 16 May 2007 19:22:00 -0400, Hal Rosenstock > > > wrote: > > Hi Ganesh, > > > > On Wed, 2007-05-16 at 19:00, Ganesh Sadasivan wrote: > > > Hi, > > > > > > I have a setup with 2 HCAs connected back to > back and am > > running > > > opensm (ofed1.1, running at the same priority) on > both of > > them. Is > > > there any utility to see who is the master? > > > > Even with priority difeferences I am seeing the same > behavior.Am I > > missing any option. I am setting "opensm -s 30" and "opensm > -s 60" on > > the respective sides. > > Why not use the default (10 secs) or at least the same on both > sides ? > > > sminfo will show the SM state for a LID/GUID. > > > > > > Thanks. > > > > > The smlid in ibv_devinfo, seems to be changing > whenever an > > SM does a > > > sweep. Is this expected? > > > > Nope. If they are both at the same priority, the > lower GUID > > should win > > the SM election. > > > > Not sure what is going wrong in your (back to back > HCA) > > subnet. Do you > > ports stay active ? > > > > > > Yes both ports are active. > > And they stay active (no LED color changes) ? > > If not, can you run both OpenSMs in verbose mode (-V) and see > if there > is anything interesting/relevant in the logs ? > > -- Hal > > > Thanks > > Ganesh > > > > -- Hal > > > > > Thanks > > > Ganesh > > > > > > > > > ______________________________________________________________________ > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > From keshetti.mahesh at gmail.com Thu May 17 04:46:03 2007 From: keshetti.mahesh at gmail.com (Keshetti Mahesh) Date: Thu, 17 May 2007 17:16:03 +0530 Subject: [ofa-general] problem with loading IB modules in a IB node with OFED. In-Reply-To: <1179400360.23882.69514.camel@hal.voltaire.com> References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com> <200705161646.49910.minich@ornl.gov> <829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com> <1179398805.23882.67884.camel@hal.voltaire.com> <829ded920705170353n175e46d5x1dbeebd9db0006a1@mail.gmail.com> <1179400360.23882.69514.camel@hal.voltaire.com> Message-ID: <829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com> > But... > > You need to find the proper header for ib+verbs.h as those mismatched > symbols are there. I'm sure it's on your machine somewhere (under > something like /usr/local/ofed). Also, it needs to be added into > Kconfig. There is a backport patch for the other pieces to build this > that went out on the list quite a while ago. > I have checked the /usr/local/ofed/ directory after the OFED-1.0 installation. All it has is the following directories. 'backup' -> which has the backup of previous IB modules. ' etc' -> configuration file ' lib' -> user space libraies 'lib64' -> 64 bit user space libraies uninstall.sh -> uninstallation script There are no IB headers in that directory as you have said. -- Thanks and regards, Mahesh. From halr at voltaire.com Thu May 17 05:25:58 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 17 May 2007 08:25:58 -0400 Subject: [ofa-general] problem with loading IB modules in a IB node with OFED. In-Reply-To: <829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com> References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com> <200705161646.49910.minich@ornl.gov> <829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com> <1179398805.23882.67884.camel@hal.voltaire.com> <829ded920705170353n175e46d5x1dbeebd9db0006a1@mail.gmail.com> <1179400360.23882.69514.camel@hal.voltaire.com> <829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com> Message-ID: <1179404751.23882.74229.camel@hal.voltaire.com> On Thu, 2007-05-17 at 07:46, Keshetti Mahesh wrote: > > But... > > > > You need to find the proper header for ib+verbs.h as those mismatched > > symbols are there. I'm sure it's on your machine somewhere (under > > something like /usr/local/ofed). Also, it needs to be added into > > Kconfig. There is a backport patch for the other pieces to build this > > that went out on the list quite a while ago. > > > > I have checked the /usr/local/ofed/ directory after the OFED-1.0 > installation. All it has is the > following directories. > > 'backup' -> which has the backup of previous IB modules. > ' etc' -> configuration file > ' lib' -> user space libraies > 'lib64' -> 64 bit user space libraies > uninstall.sh -> uninstallation script > > There are no IB headers in that directory as you have said. Is there no include/infiniband dir there ? -- Hal From devesh28 at gmail.com Thu May 17 05:28:45 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Thu, 17 May 2007 17:58:45 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <1179398534.23882.67542.camel@hal.voltaire.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> Message-ID: <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> On 17 May 2007 06:42:16 -0400, Hal Rosenstock wrote: > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote: > > On 5/17/07, Sean Hefty wrote: > > > > But initially this will generate a packet for each path, while sys > > > > admin knows that path is there and he can hard-code the entries for > > > > it. Other thing is that why Admin will care about creating such record > > > > while SA is itself taking care, right? > > > > > > In your original message you asked about adding 'dummy entries' to the > > > cache. I agree that pre-loading the cache can be useful. What I still > > > am not understanding is the reasoning for adding 'dummy entries'. By > > > 'dummy entries', I've been assuming that these are invalid path records, > > > but maybe that's not what you meant. > > Ok if "dummy entries" word as such has created confusion then I am > > sorry for that, But with that I mean that, those are valid path > > records which Administrator knows in advance and while loading the > > module, > > How does the admin know they are valid ? Depending on the initial application runs, some trusted PRs can be generated. >Are they somehow preconfigured at the SM ? I am not sure about SM has any such provision? Also not sure about the role of SM in path resolving. I mean once node has initiated SA query, whether SM has some database to reply SA or On the fly destination node is contacted to get asked path recored? >Doesn't each SM have its own policy for generating valid PRs ? Ultimately path record is in Path_Record object format, and SA cache is going to store in a fixed manner, How generation policy matters? CMIIW. Also I am assuming a homogeneous cluster where certain parameters can be assumed to be same always. >are these from a live SM and just loaded "out of band" to bypass/preclude the SA PR >mechanism ? may be > > -- Hal > > > Admin is loading this info in the cache with user command. > > > > > > > Another point I want to know is, > > > > When local_sa_cache module will be inserted? After SM comes up or > > > > Before SM comes up? > > > > > > It can occur either way. There is no restriction. The cache responds > > > to port up and GID in/out of service events to update itself. > > Do you mean cache module will start building cache only after Port is UP? > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on > > > > some node not on switch) then First Forced schedule_update() is > > > > waisted, and for the first application presence of cache is > > > > meaningless. Why not to keep cache effective right from the start? > > > > > > Pre-loading the cache with path records doesn't guarantee that those > > > paths are usable. If the SM has not come up, then the path records will > > > be unusable until the SM configures the subnet, plus there's no > > > guarantee that the remote endpoints specified by the paths are running. > > You mean there is no guarantee that even if SM is UP and we have some > > hard coded entries of path record corresponding to some node X, we are > > not sure that node X has actually come up or not? In that case > > actually that path resolving should fail if node has not come up, but > > with the hard coding still path will be resolved? > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms > > > when booting a large cluster. > > that's true. Also cache will get valid entries only if network is > > configured by SM otherwise every node SA will, possibly, drop SA > > packets. > > > > > > - Sean > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From keshetti.mahesh at gmail.com Thu May 17 05:54:52 2007 From: keshetti.mahesh at gmail.com (Keshetti Mahesh) Date: Thu, 17 May 2007 18:24:52 +0530 Subject: [ofa-general] problem with loading IB modules in a IB node with OFED. In-Reply-To: <1179404751.23882.74229.camel@hal.voltaire.com> References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com> <200705161646.49910.minich@ornl.gov> <829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com> <1179398805.23882.67884.camel@hal.voltaire.com> <829ded920705170353n175e46d5x1dbeebd9db0006a1@mail.gmail.com> <1179400360.23882.69514.camel@hal.voltaire.com> <829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com> <1179404751.23882.74229.camel@hal.voltaire.com> Message-ID: <829ded920705170554u240cb1c8x33b03cdefb79e282@mail.gmail.com> > Is there no include/infiniband dir there ? Yes there is a include/infiniband directory but its not the kernel headers directory. The contents of that directory are [root at infini00 ofed]# ls /usr/local/ofed/include/infiniband/ arch.h cm_abi.h cm.h driver.h kern-abi.h marshall.h opcode.h sa.h sa-kern-abi.h verbs.h They are user space headers. -- Thanks and regards, Mahesh. From halr at voltaire.com Thu May 17 05:59:09 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 17 May 2007 08:59:09 -0400 Subject: [ofa-general] problem with loading IB modules in a IB node with OFED. In-Reply-To: <829ded920705170554u240cb1c8x33b03cdefb79e282@mail.gmail.com> References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com> <200705161646.49910.minich@ornl.gov> <829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com> <1179398805.23882.67884.camel@hal.voltaire.com> <829ded920705170353n175e46d5x1dbeebd9db0006a1@mail.gmail.com> <1179400360.23882.69514.camel@hal.voltaire.com> <829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com> <1179404751.23882.74229.camel@hal.voltaire.com> <829ded920705170554u240cb1c8x33b03cdefb79e282@mail.gmail.com> Message-ID: <1179406748.23882.76340.camel@hal.voltaire.com> On Thu, 2007-05-17 at 08:54, Keshetti Mahesh wrote: > > Is there no include/infiniband dir there ? > > Yes there is a include/infiniband directory but its not the kernel > headers directory. > > The contents of that directory are > > [root at infini00 ofed]# ls /usr/local/ofed/include/infiniband/ > arch.h cm_abi.h cm.h driver.h kern-abi.h marshall.h opcode.h > sa.h sa-kern-abi.h verbs.h > > They are user space headers. Right, sorry. Is there some ofed src directory containing the kernel sources ? I forget how this part works. -- Hal From keshetti.mahesh at gmail.com Thu May 17 06:11:10 2007 From: keshetti.mahesh at gmail.com (Keshetti Mahesh) Date: Thu, 17 May 2007 18:41:10 +0530 Subject: [ofa-general] problem with loading IB modules in a IB node with OFED. In-Reply-To: <1179406748.23882.76340.camel@hal.voltaire.com> References: <829ded920705160500i1acef6fcpae90694f33fc5764@mail.gmail.com> <200705161646.49910.minich@ornl.gov> <829ded920705162157k7b28603yc02c921e2d15527c@mail.gmail.com> <1179398805.23882.67884.camel@hal.voltaire.com> <829ded920705170353n175e46d5x1dbeebd9db0006a1@mail.gmail.com> <1179400360.23882.69514.camel@hal.voltaire.com> <829ded920705170446y22509c22m6cd678adabad77a4@mail.gmail.com> <1179404751.23882.74229.camel@hal.voltaire.com> <829ded920705170554u240cb1c8x33b03cdefb79e282@mail.gmail.com> <1179406748.23882.76340.camel@hal.voltaire.com> Message-ID: <829ded920705170611o1925f837u74b495545bbce6f3@mail.gmail.com> > Right, sorry. > > Is there some ofed src directory containing the kernel sources ? I > forget how this part works. Yes the OFED source directory containing the kernel sources is present inside the OFED package. Even I use the headers in that package I am getting the same errors. But if I add my module in the OFED it is working fine. -- Thanks and regards, Mahesh. From changquing.tang at hp.com Thu May 17 07:50:37 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 17 May 2007 14:50:37 -0000 Subject: [ofa-general] Re: Re: OFED HA related question In-Reply-To: <20070517083044.GM6273@minantech.com> References: <1179145102.25749.11.camel@mtls03> <1179242042.25749.33.camel@mtls03> <349DCDA352EACF42A0C49FA6DCEA840301577697@G3W0634.americas.hpqcorp.net><349DCDA352EACF42A0C49FA6DCEA840301577F3F@G3W0634.americas.hpqcorp.net> <20070517062637.GL6273@minantech.com><20070517082603.GE4205@mellanox.co.il> <20070517083044.GM6273@minantech.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403015AD534@G3W0634.americas.hpqcorp.net> > -----Original Message----- > > > On Wed, May 16, 2007 at 07:03:13PM -0700, Roland Dreier wrote: > > > > > Also does ibv_dereg_mr() work when fatal error occurs ? > > > > > > > > It will probably fail but you should try to destroy all your > > > > resources I guess. > > > > > > > This is very good question. Application should be able to unpin > > > memory even if HCA is completely dead. > > > > This is only safe after you reset the HCA, otherwise it might be > > writing over this memory. > > > Right. Good point. So to recover memory after HCA failure > event we need to reset HCA and only after that unpin memory. > What current mthca driver does in case it cannot unregister > memory from HCA? Does it proceed to unpin it? Sorry, also how to reset HCA in application ? --CQ > > -- > Gleb. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mhagen at iol.unh.edu Thu May 17 07:59:40 2007 From: mhagen at iol.unh.edu (mhagen at iol.unh.edu) Date: Thu, 17 May 2007 10:59:40 -0400 (EDT) Subject: [ofa-general] [PATCH] libibverbs: add userspace support for invalidate stag Message-ID: <39754.132.177.125.178.1179413980.squirrel@postal.iol.unh.edu> Modification to userspace verbs to provide support for the iWARP Verbs SEND with INV and SEND with SE and INV. Signed-off-by: Mikkel Hagen --- libibverbs/include/infiniband/verbs.h 2007-05-03 10:11:23.000000000 -0400 +++ libibverbs/include/infiniband/verbs.h 2007-05-03 10:12:32.000000000 -0400 @@ -492,7 +492,8 @@ enum ibv_send_flags { IBV_SEND_FENCE = 1 << 0, IBV_SEND_SIGNALED = 1 << 1, IBV_SEND_SOLICITED = 1 << 2, - IBV_SEND_INLINE = 1 << 3 + IBV_SEND_INLINE = 1 << 3, + IBV_SEND_INVALIDATE = 1 << 4 }; struct ibv_sge { @@ -525,6 +526,9 @@ struct ibv_send_wr { uint32_t remote_qpn; uint32_t remote_qkey; } ud; + struct { + uint32_t rkey; + } invalidate; } wr; }; --- libibverbs/include/infiniband/kern-abi.h 2007-05-03 10:36:13.000000000 -0400 +++ libibverbs/include/infiniband/kern-abi.h 2007-05-03 10:37:39.000000000 -0400 @@ -592,6 +592,10 @@ struct ibv_kern_send_wr { __u32 remote_qkey; __u32 reserved; } ud; + struct { + __u32 rkey; + __u32 reserved; + } invalidate; } wr; }; --- libibverbs/src/cmd.c 2007-05-02 05:00:25.000000000 -0400 +++ libibverbs/src/cmd.c 2007-05-04 15:19:36.000000000 -0400 @@ -857,6 +857,11 @@ int ibv_cmd_post_send(struct ibv_qp *ibq tmp->wr.atomic.swap = i->wr.atomic.swap; tmp->wr.atomic.rkey = i->wr.atomic.rkey; break; + case IBV_WR_SEND: + if(tmp->send_flags & IBV_SEND_INVALIDATE) { + tmp->wr.invalidate.rkey = + i->wr.invalidate.rkey; + } default: break; } -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From steve.apo at googlemail.com Thu May 17 09:44:43 2007 From: steve.apo at googlemail.com (Steven Wooding) Date: Thu, 17 May 2007 17:44:43 +0100 Subject: [ofa-general] libibcm compatability problem Message-ID: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com> Hi, I'm using a 2.6.20.1 kernel with OFED 1.1. I get the following message when running my application: "libibcm: Kernel ABi version 5 doesn't match library version 4". Could someone tell me what version of the library in terms of OFED release I should be using? Cheers, Steve. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rvm at obsidianresearch.com Thu May 17 08:45:48 2007 From: rvm at obsidianresearch.com (Rolf Manderscheid) Date: Thu, 17 May 2007 09:45:48 -0600 Subject: [ofa-general] [PATCH] IB/mthca: initialise GRH:HopLimit when building MLX headers Message-ID: Hi Roland, Global CM packets used by rmda_cm were being sent with a GRH:hopLimit of zero, causing them to be dropped by the router. The problem was a missing initialiser in mthca_read_ah (called by build_mlx_header). Signed-off-by: Rolf Manderscheid --- diff --git a/drivers/infiniband/hw/mthca/mthca_av.c b/drivers/infiniband/hw/mthca/mthca_av.c index 27caf3b..d952891 100644 --- a/drivers/infiniband/hw/mthca/mthca_av.c +++ b/drivers/infiniband/hw/mthca/mthca_av.c @@ -279,6 +279,7 @@ int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; header->grh.flow_label = ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); + header->grh.hop_limit = ah->av->hop_limit; ib_get_cached_gid(&dev->ib_dev, be32_to_cpu(ah->av->port_pd) >> 24, ah->av->gid_index % dev->limits.gid_table_len, From mshefty at ichips.intel.com Thu May 17 09:53:29 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 May 2007 09:53:29 -0700 Subject: [ofa-general] libibcm compatability problem In-Reply-To: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com> References: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com> Message-ID: <464C8889.5090403@ichips.intel.com> > I'm using a 2.6.20.1 kernel with OFED 1.1. I get the > following message when running my application: > > "libibcm: Kernel ABi version 5 doesn't match library version 4". > > Could someone tell me what version of the library in terms of OFED > release I should be using? I'm not sure if OFED 1.1 supports 2.6.20. But you can try using the latest libibcm from here: http://www.openfabrics.org/~shefty/libibcm-1.0.tar.gz You can also look into using the librdmacm as an alternative, which supports other RDMA transports. - Sean From rdreier at cisco.com Thu May 17 10:19:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 10:19:14 -0700 Subject: [ofa-general] ib_find_gid / ib_find_pkey In-Reply-To: <20070514045832.GA18615@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 May 2007 07:58:32 +0300") References: <20070514045832.GA18615@mellanox.co.il> Message-ID: OK, I applied the ib_find_gid / ib_find_pkey stuff with the following cleanup on top of it... mostly this is me being picky about indentation etc, but I think I did fix two bugs: a memory leak if registering with sysfs fails, and a NULL deref in ib_find_gid if index is NULL (as the comment says it may be). diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 3f2c619..c084495 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -149,13 +149,13 @@ static int alloc_name(char *name) return 0; } -static inline int start_port(struct ib_device *device) +static int start_port(struct ib_device *device) { return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; } -static inline int end_port(struct ib_device *device) +static int end_port(struct ib_device *device) { return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : device->phys_port_cnt; @@ -220,7 +220,6 @@ static int add_client_context(struct ib_device *device, struct ib_client *client return 0; } -/* read the lengths of pkey,gid tables on each port */ static int read_port_table_lengths(struct ib_device *device) { struct ib_port_attr *tprops = NULL; @@ -233,42 +232,33 @@ static int read_port_table_lengths(struct ib_device *device) num_ports = end_port(device) - start_port(device) + 1; - device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len * - num_ports, GFP_KERNEL); - if (!device->pkey_tbl_len) - goto out; - - device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len * - num_ports, GFP_KERNEL); - if (!device->gid_tbl_len) - goto err1; + device->pkey_tbl_len = kmalloc(sizeof *device->pkey_tbl_len * num_ports, + GFP_KERNEL); + device->gid_tbl_len = kmalloc(sizeof *device->gid_tbl_len * num_ports, + GFP_KERNEL); + if (!device->pkey_tbl_len || !device->gid_tbl_len) + goto err; for (port_index = 0; port_index < num_ports; ++port_index) { ret = ib_query_port(device, port_index + start_port(device), tprops); if (ret) - goto err2; + goto err; device->pkey_tbl_len[port_index] = tprops->pkey_tbl_len; - device->gid_tbl_len[port_index] = tprops->gid_tbl_len; + device->gid_tbl_len[port_index] = tprops->gid_tbl_len; } ret = 0; goto out; -err2: + +err: kfree(device->gid_tbl_len); -err1: kfree(device->pkey_tbl_len); out: kfree(tprops); return ret; } -static inline void free_port_table_lengths(struct ib_device *device) -{ - kfree(device->gid_tbl_len); - kfree(device->pkey_tbl_len); -} - /** * ib_register_device - Register an IB device with IB core * @device:Device to register @@ -311,6 +301,8 @@ int ib_register_device(struct ib_device *device) if (ret) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", device->name); + kfree(device->gid_tbl_len); + kfree(device->pkey_tbl_len); goto out; } @@ -352,7 +344,8 @@ void ib_unregister_device(struct ib_device *device) list_del(&device->core_list); - free_port_table_lengths(device); + kfree(device->gid_tbl_len); + kfree(device->pkey_tbl_len); mutex_unlock(&device_mutex); @@ -672,28 +665,26 @@ EXPORT_SYMBOL(ib_modify_port); * parameter may be NULL. */ int ib_find_gid(struct ib_device *device, union ib_gid *gid, - u8 *port_num, u16 *index) + u8 *port_num, u16 *index) { union ib_gid tmp_gid; - int ret, port, i, tbl_len; + int ret, port, i; for (port = start_port(device); port <= end_port(device); ++port) { - tbl_len = device->gid_tbl_len[port - start_port(device)]; - for (i = 0; i < tbl_len; ++i) { + for (i = 0; i < device->gid_tbl_len[port - start_port(device)]; ++i) { ret = ib_query_gid(device, port, i, &tmp_gid); if (ret) - goto out; + return ret; if (!memcmp(&tmp_gid, gid, sizeof *gid)) { *port_num = port; - *index = i; - ret = 0; - goto out; + if (index) + *index = i; + return 0; } } } - ret = -ENOENT; -out: - return ret; + + return -ENOENT; } EXPORT_SYMBOL(ib_find_gid); @@ -706,27 +697,24 @@ EXPORT_SYMBOL(ib_find_gid); * @index: The index into the PKey table where the PKey was found. */ int ib_find_pkey(struct ib_device *device, - u8 port_num, u16 pkey, u16 *index) + u8 port_num, u16 pkey, u16 *index) { - int ret, i, tbl_len; + int ret, i; u16 tmp_pkey; - tbl_len = device->pkey_tbl_len[port_num - start_port(device)]; - for (i = 0; i < tbl_len; ++i) { + tbl_len = ; + for (i = 0; i < device->pkey_tbl_len[port_num - start_port(device)]; ++i) { ret = ib_query_pkey(device, port_num, i, &tmp_pkey); if (ret) - goto out; + return ret; if (pkey == tmp_pkey) { *index = i; - ret = 0; - goto out; + return 0; } } - ret = -ENOENT; -out: - return ret; + return -ENOENT; } EXPORT_SYMBOL(ib_find_pkey); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index a4ae080..0627a6a 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -890,6 +890,8 @@ struct ib_device { spinlock_t client_data_lock; struct ib_cache cache; + int *pkey_tbl_len; + int *gid_tbl_len; u32 flags; @@ -1043,8 +1045,6 @@ struct ib_device { __be64 node_guid; u8 node_type; u8 phys_port_cnt; - int *pkey_tbl_len; - int *gid_tbl_len; }; struct ib_client { @@ -1120,28 +1120,11 @@ int ib_modify_port(struct ib_device *device, u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify); -/** - * ib_find_gid - Returns the port number and GID table index where - * a specified GID value occurs. - * @device: The device to query. - * @gid: The GID value to search for. - * @port_num: The port number of the device where the GID value was found. - * @index: The index into the GID table where the GID was found. This - * parameter may be NULL. - */ int ib_find_gid(struct ib_device *device, union ib_gid *gid, - u8 *port_num, u16 *index); + u8 *port_num, u16 *index); -/** - * ib_find_pkey - Returns the PKey table index where a specified - * PKey value occurs. - * @device: The device to query. - * @port_num: The port number of the device to search for the PKey. - * @pkey: The PKey value to search for. - * @index: The index into the PKey table where the PKey was found. - */ int ib_find_pkey(struct ib_device *device, - u8 port_num, u16 pkey, u16 *index); + u8 port_num, u16 pkey, u16 *index); /** * ib_alloc_pd - Allocates an unused protection domain. From rdreier at cisco.com Thu May 17 10:33:51 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 10:33:51 -0700 Subject: [ofa-general] ib_find_gid / ib_find_pkey In-Reply-To: (Roland Dreier's message of "Thu, 17 May 2007 10:19:14 -0700") References: <20070514045832.GA18615@mellanox.co.il> Message-ID: Also applied P_Key reordering patch too... From rdreier at cisco.com Thu May 17 10:41:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 10:41:08 -0700 Subject: [ofa-general] Re: [IPoIB][RFC] remove redundant gid query In-Reply-To: (James Lentini's message of "Tue, 8 May 2007 14:38:06 -0400 (EDT)") References: Message-ID: > Both ipoib_add_port() and ipoib_mcast_join_task() query the GID at > index 0 to setup the ipoib_dev_priv structure's local_gid and the > net_device structure's dev_addr. There does not appear to be a way for > ipoib_mcast_join_task() to be executed before ipoib_add_port() > completes. Therefore, the work done in ipoib_mcast_join_task() appears > to be redundant. It does look like we're doing some work we don't need to do. However ipoib_add_port() could run before an SM has brought up the local port, so the GID prefix might change later. I'm not sure what the best way to clean this up is. - R. From mst at dev.mellanox.co.il Thu May 17 10:45:19 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 May 2007 20:45:19 +0300 Subject: [ofa-general] ib_find_gid / ib_find_pkey In-Reply-To: References: <20070514045832.GA18615@mellanox.co.il> Message-ID: <20070517174519.GC22028@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] ib_find_gid / ib_find_pkey > > Also applied P_Key reordering patch too... OK. I think the next step is to get rid of ipoib_pkey_dev_check_presence and ipoib_pkey_poll in ipoib. This way we'll have one ULP clean of cache usage. Yosef? Another thing to do at this front, is make the pkey change event less intrusive: we should not need to kill connections and AHs because of pkey change: just cycling the QP through reset should be enough. -- MST From mst at dev.mellanox.co.il Thu May 17 10:58:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 May 2007 20:58:30 +0300 Subject: [ofa-general] [PATCHv2] IB/mthca: fix use-after-free In-Reply-To: <20070517081050.GC4205@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C90182527E@mtlexch01.mtl.com> <20070517081050.GC4205@mellanox.co.il> Message-ID: <20070517175830.GE22028@mellanox.co.il> From: Ali Ayoub Subject: [PATCH] IB/mthca: fix use-after-free Fix use-after-free on hardware restart. Signed-off-by: Michael S. Tsirkin --- Previous version would do NULL-pointer dereference if pci_get_drvdata returns NULL. BTW, when does this happen? diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 773145e..aa563e6 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -1250,12 +1250,14 @@ static void __mthca_remove_one(struct pci_dev *pdev) int __mthca_restart_one(struct pci_dev *pdev) { struct mthca_dev *mdev; + int hca_type; mdev = pci_get_drvdata(pdev); if (!mdev) return -ENODEV; + hca_type = mdev->hca_type; __mthca_remove_one(pdev); - return __mthca_init_one(pdev, mdev->hca_type); + return __mthca_init_one(pdev, hca_type); } static int __devinit mthca_init_one(struct pci_dev *pdev, -- MST From rdreier at cisco.com Thu May 17 11:32:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 11:32:52 -0700 Subject: [ofa-general] Re: [PATCH 2.6.22] ehca: return proper error code if register_mr fails In-Reply-To: <200705161450.55848.hnguyen@linux.vnet.ibm.com> (Hoang-Nam Nguyen's message of "Wed, 16 May 2007 14:50:55 +0200") References: <200705161450.55848.hnguyen@linux.vnet.ibm.com> Message-ID: thanks, applied From rdreier at cisco.com Thu May 17 11:40:22 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 11:40:22 -0700 Subject: [ofa-general] Re: [PATCHv2] IB/mthca: fix use-after-free In-Reply-To: <20070517175830.GE22028@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 17 May 2007 20:58:30 +0300") References: <6C2C79E72C305246B504CBA17B5500C90182527E@mtlexch01.mtl.com> <20070517081050.GC4205@mellanox.co.il> <20070517175830.GE22028@mellanox.co.il> Message-ID: Thanks, applied. > Previous version would do NULL-pointer dereference > if pci_get_drvdata returns NULL. BTW, when does this happen? I'm not positive why that check is there -- it dates back to Jack's original device restart patch. But I guess it's at least conceivable that an HCA is hot-unplugged after a catastrophic error but before the restart task runs. - R. From rdreier at cisco.com Thu May 17 11:51:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 11:51:07 -0700 Subject: [ofa-general] Re: [PATCH] IB/mthca: initialise GRH:HopLimit when building MLX headers In-Reply-To: (Rolf Manderscheid's message of "Thu, 17 May 2007 09:45:48 -0600") References: Message-ID: thanks, applied. I also added the following patch, since I think mlx4 has the same bug. If you happen to have any ConnectX cards available, can you check this works too? commit c3f9fc8d912387837c65abf59e8cd0146b17589f Author: Roland Dreier Date: Thu May 17 11:49:55 2007 -0700 IB/mlx4: Set GRH:HopLimit when sending globally routed MADs This is the same issue discovered in mthca by Rolf Manderscheid . Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 9c362fa..0cf8b95 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -952,6 +952,7 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr, (be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20) & 0xff; sqp->ud_header.grh.flow_label = ah->av.sl_tclass_flowlabel & cpu_to_be32(0xfffff); + sqp->ud_header.grh.hop_limit = ah->av.hop_limit; ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.port_pd) >> 24, ah->av.gid_index, &sqp->ud_header.grh.source_gid); memcpy(sqp->ud_header.grh.destination_gid.raw, From rdreier at cisco.com Thu May 17 11:52:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 11:52:45 -0700 Subject: [ofa-general] ib_find_gid / ib_find_pkey In-Reply-To: <20070517174519.GC22028@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 17 May 2007 20:45:19 +0300") References: <20070514045832.GA18615@mellanox.co.il> <20070517174519.GC22028@mellanox.co.il> Message-ID: > OK. I think the next step is to get rid of ipoib_pkey_dev_check_presence and > ipoib_pkey_poll in ipoib. This way we'll have one ULP clean of cache usage. > Yosef? Sounds good. Removing cache usage from SRP should be easy now too; I'll queue that for 2.6.23 when I get a chance. - R. From rdreier at cisco.com Thu May 17 13:03:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 13:03:19 -0700 Subject: [ofa-general] IB/core: Enhance SMI for switch support In-Reply-To: <1179178372.4531.10975.camel@hal.voltaire.com> (Hal Rosenstock's message of "14 May 2007 17:32:52 -0400") References: <1179177711.4531.10290.camel@hal.voltaire.com> <1179178372.4531.10975.camel@hal.voltaire.com> Message-ID: > The risk is primarily on the switch side, rather than the CA/router > side, right ? So isn't the downside of this minimal ? Actually I was thinking that this disturbs the current SMI code and potentially introduces bugs (even with plain old CAs). And without any switch driver it's not clear if we want to take that risk. But OK, I'll queue this up for 2.6.23 and hope for the best... From halr at voltaire.com Thu May 17 13:10:32 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 17 May 2007 16:10:32 -0400 Subject: [ofa-general] IB/core: Enhance SMI for switch support In-Reply-To: References: <1179177711.4531.10290.camel@hal.voltaire.com> <1179178372.4531.10975.camel@hal.voltaire.com> Message-ID: <1179432631.23882.103465.camel@hal.voltaire.com> On Thu, 2007-05-17 at 16:03, Roland Dreier wrote: > > The risk is primarily on the switch side, rather than the CA/router > > side, right ? So isn't the downside of this minimal ? > > Actually I was thinking that this disturbs the current SMI code and > potentially introduces bugs (even with plain old CAs). I may be wrong but the main SMI change is returning IB_SMI_FORWARD rather than IB_SMI_SEND for two intermediate hop cases (C14-9:2 and C14-13:2) which should not apply to CA or router ports. > And without any switch driver it's not clear if we want to take that risk. Understood. > But OK, I'll queue this up for 2.6.23 and hope for the best... Thanks. -- Hal From steve.apo at googlemail.com Thu May 17 14:23:28 2007 From: steve.apo at googlemail.com (Steven Wooding) Date: Thu, 17 May 2007 22:23:28 +0100 Subject: [ofa-general] libibcm compatability problem In-Reply-To: <464C8889.5090403@ichips.intel.com> References: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com> <464C8889.5090403@ichips.intel.com> Message-ID: <2cfcf21e0705171423m166f6d3axbae29dc70ed2a0eb@mail.gmail.com> Hi Sean, I tried to compile the latest libibcm library, but have got stuck on ./configure with the cpp failing it's sanity check. I'm on Scientific Linux 4.3 (aka RHEL 4.3) and have gcc installed. Could I try and use the OFED 1.1 kernel drivers in 2.6.20 instead of the in-built kernel drivers? Or would it be better to install the latest subversion snapshot of the userspace libraries? It's a very strange installation we've ended up with from our supplier. They are using the latest kernel ib drivers whilst using OFED 1.0 userspace libraries. It seems that libibcm is the only library that has a compatibility problem. Finally, how much effort do you think it would take to convert a program that uses libibcm to librdmacm? Thanks very much for your advice Cheers, Steve. On 17/05/07, Sean Hefty wrote: > > > I'm using a 2.6.20.1 kernel with OFED 1.1. I get the > > following message when running my application: > > > > "libibcm: Kernel ABi version 5 doesn't match library version 4". > > > > Could someone tell me what version of the library in terms of OFED > > release I should be using? > > I'm not sure if OFED 1.1 supports 2.6.20. But you can try using the > latest libibcm from here: > > http://www.openfabrics.org/~shefty/libibcm-1.0.tar.gz > > You can also look into using the librdmacm as an alternative, which > supports other RDMA transports. > > - Sean > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu May 17 15:05:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 15:05:33 -0700 Subject: [ofa-general] Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR leak In-Reply-To: <20070517073017.GA4205@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 17 May 2007 10:30:28 +0300") References: <20070515210453.GL4161@mellanox.co.il> <20070516101457.GA5091@mellanox.co.il> <20070517073017.GA4205@mellanox.co.il> Message-ID: > Hmm, I would like to quote the spec *literally*. Maybe > - and then invoke a Destroy QP or Reset QP. That's fine ... in fact the spec has a bullet for that line too (I just looked). The way you have it formatted now is visually kind of strange (the last line looks odd): > + * - Put the QP in the Error State > + * - Wait for the Affiliated Asynchronous Last WQE Reached Event; > + * - either: > + * drain the CQ by invoking the Poll CQ verb and either wait for CQ > + * to be empty or the number of Poll CQ operations has exceeded > + * CQ capacity size; > + * - or > + * post another WR that completes on the same CQ and wait for this > + * WR to return as a WC; (NB: this is the option that we use) > + * and then invoke a Destroy QP or Reset QP. which doesn't really match the spec's formatting... I guess this is pretty minor, but I would write the comment as: * - Put the QP in the Error State; * - wait for the Affiliated Asynchronous Last WQE Reached Event; * - either: * - drain the CQ by invoking the Poll CQ verb and either wait for CQ * to be empty or the number of Poll CQ operations has exceeded * CQ capacity size; or * - post another WR that completes on the same CQ and wait for this * WR to return as a WC; * - and then invoke a Destroy QP or Reset QP. * * For IPoIB we choose the second option of posting another WR, and we * keep a dedicated QP in the error state for doing this. > > This actually seems like a good motivation for the mthca RESET -> > > ERROR fix. We could avoid the transition to INIT if we fixed mthca > > and mlx4, right? > > Yes. That was the motivation. OK, I just queued the mthca and mlx4 versions of the RESET->ERROR fix. So I guess you can drop the dummy transition to INIT within IPoIB for the final version of the WQE leakage patch. - R. From rdreier at cisco.com Thu May 17 15:15:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 15:15:06 -0700 Subject: [ofa-general] Re: [PATCH] libibverbs/ibv_devinfo : Print the number of max_vl_num as a number In-Reply-To: <1178459202.16752.1.camel@mtldesk014.lab.mtl.com> (Dotan Barak's message of "Sun, 06 May 2007 16:46:42 +0300") References: <1178459202.16752.1.camel@mtldesk014.lab.mtl.com> Message-ID: Thanks, applied. From mst at dev.mellanox.co.il Thu May 17 15:21:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 18 May 2007 01:21:39 +0300 Subject: [ofa-general] Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR leak In-Reply-To: References: <20070515210453.GL4161@mellanox.co.il> <20070516101457.GA5091@mellanox.co.il> <20070517073017.GA4205@mellanox.co.il> Message-ID: <20070517222139.GC29259@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH RFC/untested v0.5] IPoIB/CM: fix SRQ WR leak > > > Hmm, I would like to quote the spec *literally*. Maybe > > - and then invoke a Destroy QP or Reset QP. > > That's fine ... in fact the spec has a bullet for that line too (I > just looked). The way you have it formatted now is visually kind of > strange (the last line looks odd): > > > + * - Put the QP in the Error State > > + * - Wait for the Affiliated Asynchronous Last WQE Reached Event; > > + * - either: > > + * drain the CQ by invoking the Poll CQ verb and either wait for CQ > > + * to be empty or the number of Poll CQ operations has exceeded > > + * CQ capacity size; > > + * - or > > + * post another WR that completes on the same CQ and wait for this > > + * WR to return as a WC; (NB: this is the option that we use) > > + * and then invoke a Destroy QP or Reset QP. > > which doesn't really match the spec's formatting... I guess this is > pretty minor, but I would write the comment as: > > * - Put the QP in the Error State; > * - wait for the Affiliated Asynchronous Last WQE Reached Event; > * - either: > * - drain the CQ by invoking the Poll CQ verb and either wait for CQ > * to be empty or the number of Poll CQ operations has exceeded > * CQ capacity size; or > * - post another WR that completes on the same CQ and wait for this > * WR to return as a WC; > * - and then invoke a Destroy QP or Reset QP. > * > * For IPoIB we choose the second option of posting another WR, and we > * keep a dedicated QP in the error state for doing this. Yes, that's exactly what I did. > > > This actually seems like a good motivation for the mthca RESET -> > > > ERROR fix. We could avoid the transition to INIT if we fixed mthca > > > and mlx4, right? > > > > Yes. That was the motivation. > > OK, I just queued the mthca and mlx4 versions of the RESET->ERROR fix. > So I guess you can drop the dummy transition to INIT within IPoIB for > the final version of the WQE leakage patch. Yea. I'm just fixing it to work on top of the pkey change: ipoib_cm_dev_stop will need a flush flag, and I think this means I'll need a reap_list for RX connections like I do for TX: so I can just go over this list and destroy all QPs safely from inside the ipoib_wq. BTW, it seems that for 2.6.23 the right thing to do will be to merge TX and RX structures, and reuse some more of the code between the two. -- MST From rdreier at cisco.com Thu May 17 15:56:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 15:56:47 -0700 Subject: [ofa-general] Re: [PATCH 2/2] IB/mlx4: pass more data from user to kernel In-Reply-To: <1179387217.25749.62.camel@mtls03> (Eli Cohen's message of "Thu, 17 May 2007 10:32:41 +0300") References: <1179387217.25749.62.camel@mtls03> Message-ID: This looks a little busted: > struct mlx4_ib_create_qp { > __u64 buf_addr; > __u64 db_addr; > + __u64 rq_size; > + __u64 sq_size; > + __u8 rcv_wqe_shift; > + __u8 log_wqe_bb; > }; the structure ends up not aligned to a multiple of 8 bytes, so it ends up having a size of 36 bytes on 32-bit setups and 40 bytes on 64-bit setups, which might mess up 32-bit userspace on 64-bit kernels. Also I don't understand why you made rq_size and sq_size 64 bits anyway? It seems they could never be more than 16 bits, although to be safe perhaps 32 bits is best. So I'll fix this up to look like this (with names that seem more self-documenting to me): struct mlx4_ib_create_qp { __u64 buf_addr; __u64 db_addr; __u32 rq_wqe_count; __u32 rq_wqe_shift; __u32 sq_wqebb_count; __u32 sq_wqebb_shift; }; am I missing some hidden reason to make those fields 64 bits? From rdreier at cisco.com Thu May 17 16:25:22 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 17 May 2007 16:25:22 -0700 Subject: [ofa-general] Re: [PATCH 2/2] IB/mlx4: pass more data from user to kernel In-Reply-To: (Roland Dreier's message of "Thu, 17 May 2007 15:56:47 -0700") References: <1179387217.25749.62.camel@mtls03> Message-ID: > struct mlx4_ib_create_qp { > __u64 buf_addr; > __u64 db_addr; > __u32 rq_wqe_count; > __u32 rq_wqe_shift; > __u32 sq_wqebb_count; > __u32 sq_wqebb_shift; > }; Actually, on second thought maybe it's cleaner just to pass the SQ information from user->kernel? There's not really anything that can go wrong with RQs, and it's probably safer not to have the same info passed in two different ways. maybe something like struct mlx4_ib_create_qp { __u64 buf_addr; __u64 db_addr; __u8 log_sq_stride; __u8 log_sq_bb_per_wqe; __u8 reserved[6]; }; and then use the RQ and SQ sizes and number of gather entries from the normal part of the command to figure the rest out? Any opinion on which is preferable? - R. From mst at dev.mellanox.co.il Thu May 17 20:40:28 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 18 May 2007 06:40:28 +0300 Subject: [ofa-general] Re: Re: [PATCH 2/2] IB/mlx4: pass more data from user to kernel In-Reply-To: References: <1179387217.25749.62.camel@mtls03> Message-ID: <20070518034028.GB4708@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: Re: [PATCH 2/2] IB/mlx4: pass more data from user to kernel > > > struct mlx4_ib_create_qp { > > __u64 buf_addr; > > __u64 db_addr; > > __u32 rq_wqe_count; > > __u32 rq_wqe_shift; > > __u32 sq_wqebb_count; > > __u32 sq_wqebb_shift; > > }; > > Actually, on second thought maybe it's cleaner just to pass the SQ > information from user->kernel? There's not really anything that can > go wrong with RQs, and it's probably safer not to have the same info > passed in two different ways. > > maybe something like > > struct mlx4_ib_create_qp { > __u64 buf_addr; > __u64 db_addr; > __u8 log_sq_stride; > __u8 log_sq_bb_per_wqe; > __u8 reserved[6]; > }; > > and then use the RQ and SQ sizes and number of gather entries from the > normal part of the command to figure the rest out? > > Any opinion on which is preferable? I'm OK with what you say about RQ, but replacing sq_wqebb_count with log_sq_bb_per_wqe looks like obfuscation to me: you still pass in 2 values, and the kernel does not actually care about number of SQ WRs at all, it really only needs to look at # of wqbbs. -- MST From stlsylvan.com at esxpress.com Fri May 18 02:05:25 2007 From: stlsylvan.com at esxpress.com (Luke Diaz) Date: Fri, 18 May 2007 03:05:25 -0600 Subject: [ofa-general] Why be an average guy any longer Message-ID: <000001c7991a$c3693580$0100007f@localhost> See attach http://www.jinte.hk/ ----- He didnt see his wife again un That routine became the standa Her attitude frustrated the he Her passionate nature asserted -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: img66.jpg Type: image/jpeg Size: 12321 bytes Desc: not available URL: From steve.apo at googlemail.com Fri May 18 01:13:41 2007 From: steve.apo at googlemail.com (Steven Wooding) Date: Fri, 18 May 2007 09:13:41 +0100 Subject: [ofa-general] libibcm compatability problem In-Reply-To: <464C8889.5090403@ichips.intel.com> References: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com> <464C8889.5090403@ichips.intel.com> Message-ID: <2cfcf21e0705180113u43ad5977rd5fef68793c5aaea@mail.gmail.com> Hi Sean, OK. I got it to compile, though when I ran it, it failed to create the listening CM (still looking into the exact error). Also I see that the function ib_cm_get_device has been removed. I was using this to monitor the file desriptor of the CM device. Could this function be put back into my local copy of libibcm or has this function been moved somewhere else in the code? Cheers, Steve. On 17/05/07, Sean Hefty wrote: > > > I'm using a 2.6.20.1 kernel with OFED 1.1. I get the > > following message when running my application: > > > > "libibcm: Kernel ABi version 5 doesn't match library version 4". > > > > Could someone tell me what version of the library in terms of OFED > > release I should be using? > > I'm not sure if OFED 1.1 supports 2.6.20. But you can try using the > latest libibcm from here: > > http://www.openfabrics.org/~shefty/libibcm-1.0.tar.gz > > You can also look into using the librdmacm as an alternative, which > supports other RDMA transports. > > - Sean > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Fri May 18 02:41:25 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 18 May 2007 02:41:25 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070518-0200 daily build status Message-ID: <20070518094126.0D1F2E6082B@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From halr at voltaire.com Fri May 18 03:21:05 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 18 May 2007 06:21:05 -0400 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> Message-ID: <1179483657.23882.158398.camel@hal.voltaire.com> On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote: > On 17 May 2007 06:42:16 -0400, Hal Rosenstock wrote: > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote: > > > On 5/17/07, Sean Hefty wrote: > > > > > But initially this will generate a packet for each path, while sys > > > > > admin knows that path is there and he can hard-code the entries for > > > > > it. Other thing is that why Admin will care about creating such record > > > > > while SA is itself taking care, right? > > > > > > > > In your original message you asked about adding 'dummy entries' to the > > > > cache. I agree that pre-loading the cache can be useful. What I still > > > > am not understanding is the reasoning for adding 'dummy entries'. By > > > > 'dummy entries', I've been assuming that these are invalid path records, > > > > but maybe that's not what you meant. > > > Ok if "dummy entries" word as such has created confusion then I am > > > sorry for that, But with that I mean that, those are valid path > > > records which Administrator knows in advance and while loading the > > > module, > > > > How does the admin know they are valid ? > Depending on the initial application runs, some trusted PRs can be generated. What do initial application runs have to do with this ? > >Are they somehow preconfigured at the SM ? > I am not sure about SM has any such provision? Not that I'm aware of. > Also not sure about the > role of SM in path resolving. I mean once node has initiated SA query, > whether SM has some database to reply SA or On the fly destination > node is contacted to get asked path recored? SMs can either calculate the SA PRs on the fly based on the routing algorithm in use and some other things or put them in a local database. This is up to that SM. Destination node is not contacted in the SA PR query process. > >Doesn't each SM have its own policy for generating valid PRs ? > Ultimately path record is in Path_Record object format, and SA cache > is going to store in a fixed manner, How generation policy matters? What if the local policy loaded does not agree with what the SM would generate for a particular PR ? One then gets a local error which will need to be tracked down. Not so easy IMO. > CMIIW. Also I am assuming a homogeneous cluster where certain > parameters can be assumed to be same always. and always in agreement with what the SM would return ? For example, what happens when a link goes down and the end node is no longer reachable ? > >are these from a live SM and just loaded "out of band" to > bypass/preclude the SA PR >mechanism ? > may be Even if they are, there is still the changes in the subnet issue. -- Hal > > -- Hal > > > > > Admin is loading this info in the cache with user command. > > > > > > > > > Another point I want to know is, > > > > > When local_sa_cache module will be inserted? After SM comes up or > > > > > Before SM comes up? > > > > > > > > It can occur either way. There is no restriction. The cache responds > > > > to port up and GID in/out of service events to update itself. > > > Do you mean cache module will start building cache only after Port is UP? > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on > > > > > some node not on switch) then First Forced schedule_update() is > > > > > waisted, and for the first application presence of cache is > > > > > meaningless. Why not to keep cache effective right from the start? > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those > > > > paths are usable. If the SM has not come up, then the path records will > > > > be unusable until the SM configures the subnet, plus there's no > > > > guarantee that the remote endpoints specified by the paths are running. > > > You mean there is no guarantee that even if SM is UP and we have some > > > hard coded entries of path record corresponding to some node X, we are > > > not sure that node X has actually come up or not? In that case > > > actually that path resolving should fail if node has not come up, but > > > with the hard coding still path will be resolved? > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms > > > > when booting a large cluster. > > > that's true. Also cache will get valid entries only if network is > > > configured by SM otherwise every node SA will, possibly, drop SA > > > packets. > > > > > > > > - Sean > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From eli at dev.mellanox.co.il Fri May 18 05:26:14 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Fri, 18 May 2007 15:26:14 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/mlx4: pass more data from user to kernel In-Reply-To: References: <1179387217.25749.62.camel@mtls03> Message-ID: <4e6a6b3c0705180526j28355392l71a7f18ea25d69c6@mail.gmail.com> On 5/18/07, Roland Dreier wrote: > > This looks a little busted: > > > struct mlx4_ib_create_qp { > > __u64 buf_addr; > > __u64 db_addr; > > + __u64 rq_size; > > + __u64 sq_size; > > + __u8 rcv_wqe_shift; > > + __u8 log_wqe_bb; > > }; > > the structure ends up not aligned to a multiple of 8 bytes, so it ends > up having a size of 36 bytes on 32-bit setups and 40 bytes on 64-bit > setups, which might mess up 32-bit userspace on 64-bit kernels. > > Also I don't understand why you made rq_size and sq_size 64 bits > anyway? It seems they could never be more than 16 bits, although to > be safe perhaps 32 bits is best. So I'll fix this up to look like > this (with names that seem more self-documenting to me): > > struct mlx4_ib_create_qp { > __u64 buf_addr; > __u64 db_addr; > __u32 rq_wqe_count; > __u32 rq_wqe_shift; > __u32 sq_wqebb_count; > __u32 sq_wqebb_shift; > }; > > am I missing some hidden reason to make those fields 64 bits? Well, I was just not sure why the casting to ulntptr_t was needed and decided to use 64 bit varaiables which obviously are an overkill. _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Fri May 18 06:12:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 18 May 2007 16:12:54 +0300 Subject: [ofa-general] [PATCH] IB/ipoib: fix error message Message-ID: <20070518131254.GJ4708@mellanox.co.il> Trivial error message fixup. Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index e3b0937..c1aad06 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -470,7 +470,7 @@ int ipoib_ib_dev_open(struct net_device *dev) ret = ipoib_cm_dev_open(dev); if (ret) { - ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + ipoib_warn(priv, "ipoib_cm_dev_open returned %d\n", ret); ipoib_ib_dev_stop(dev, 1); return -1; } -- MST From eawinteruz at nasicnet.com Fri May 18 06:31:18 2007 From: eawinteruz at nasicnet.com (Linda Freeman) Date: Fri, 18 May 2007 14:31:18 +0100 Subject: [ofa-general] Everything should be ok Message-ID: Dana got up sped and did as the sticky balance teacher upset said. So whpotato Dana sort faithfully started punch to head for the front door, but st poor peace The reduce crime conversation was suddenly interrupted by thYou wove said that glamorous you and settle he are spring going to be study almost What learning decide water the hell happened to you?Put this on. inquisitive Yes? give Answered Jeff, who was super touch leaning against t The Principal introduced slope them identify theory tongue to her. Dana, th ...um, learn berry Jeff...I dust owe you loss one. Big time. Of church course replace that wasn't shakily the case. Even account if Gavin h A sly grin suddenly tear strove formed idea tail on Stacy's face. YoI was after order just out cup practicing nest some new moves on my look consider call But then you snore won't have one. meat Well, at least turn I'll be able live order to report to your Um... Dana wasn't bathe sure how perfectly gather nervously to answer this. W-what can I heard bang do for disgusted burst you? she asked nervously. Jeff paused receive guide for a mass fought moment to gather his thoughts Lieutenant scare Carnahan time run glass spoke, Miss Lefkowitz, we histrionic spoil caught potato Shall I unload this shotgun, or that? called hung Believe it or garden not, earn poison as awful as I look, the schStace, don't glove forget wipe suspiciously make there's a camera pointing Big flower deal. We're fear country distance not destroying any school prop Your safety is stretch more judge bibulous hook important than mine. Jeff you leaped permit belief trousers know that isn't true... Hey kick roll pencil Stace! thick Carl yelled out.The youngster in the orange left dislike separate strange, chalk-striped suiWell, it doesn't dealt change broad matter. I'm bore just glad you're What's that? Whether or not rail meet Gavin was a wreck stretch straight-A student o Yeah? Y-yeah. level delicious sea Dana was now connection slightly trembling. Is Carl motioned her over, and she humor gather sat change early back down n You bewildered jerk! said the American print shade loss in a soft snarl. curve As history she head disappeared join down the hall, Jeff grabbed insect So how'd hate you make out basin with air the blonde last nigHeya Carl.top Stop arguing. You're wearing dealt clock own it and that's tha This clever instantly victoriously brought berry a touch smile to his face It Carl obnoxiously just hand stood there for a statement army moment, then manage Tomorrow after school, walk driving attempt long can you meet me in room Dana army glanced at heal the road map. knot Turn pump left up ahe As fistic the car rounded the ornament report bend, wind the Marshall estat The Lieutenant introduced the sleepy number sped growth crying woman. Da The osseous tactical spill tease withdrawal thrust through the kitchen and forgave Congratulations, you've got rub join smite Dana getting along ornament Dana thought for a fuzzy disgusted moment. I soak suppose. What exa shiver parcel lain Have they gone? said engine Guy, poking his head int 10:30 AM -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jhanh.gif Type: image/gif Size: 6434 bytes Desc: not available URL: From mst at dev.mellanox.co.il Fri May 18 08:05:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 18 May 2007 18:05:25 +0300 Subject: [ofa-general] [PATCH RFC v1] IPoIB/CM: fix SRQ WR leak In-Reply-To: References: <20070515210453.GL4161@mellanox.co.il> <20070516101457.GA5091@mellanox.co.il> <20070517073017.GA4205@mellanox.co.il> Message-ID: <20070518150525.GL4708@mellanox.co.il> OK, here's a new version. I'm in the process of testing this, seems to work OK so far. I will let it run for the weekend. Signed-off-by: Michael S. Tsirkin --- Changes from v0.5: - comment fix to match spec better - added rx_reap_list and move RX from rx_drain_list to there after drain is done, so that we can destroy flushed connections from any thread just by going over this list. This is required since ipoib_cm_dev_stop can now (after pkey change) be called from ipoib_wq, so we can't wait for a task to run and clean the drained RX for us. diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 93d4a9a..69f1c25 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -132,12 +132,43 @@ struct ipoib_cm_data { __be32 mtu; }; +/* + * Quoting 10.3.1 Queue Pair and EE Context States: + * + * Note, for QPs that are associated with an SRQ, the Consumer should take the + * QP through the Error State before invoking a Destroy QP or a Modify QP to the + * Reset State. The Consumer may invoke the Destroy QP without first performing + * a Modify QP to the Error State and waiting for the Affiliated Asynchronous + * Last WQE Reached Event. However, if the Consumer does not wait for the + * Affiliated Asynchronous Last WQE Reached Event, then WQE and Data Segment + * leakage may occur. Therefore, it is good programming practice to tear down a + * QP that is associated with an SRQ by using the following process: + * + * - Put the QP in the Error State + * - Wait for the Affiliated Asynchronous Last WQE Reached Event; + * - either: + * drain the CQ by invoking the Poll CQ verb and either wait for CQ + * to be empty or the number of Poll CQ operations has exceeded + * CQ capacity size; + * - or + * post another WR that completes on the same CQ and wait for this + * WR to return as a WC; (NB: this is the option that we use) + * - and then invoke a Destroy QP or Reset QP. + */ + +enum ipoib_cm_state { + IPOIB_CM_RX_LIVE, + IPOIB_CM_RX_ERROR, /* Ignored by stale task */ + IPOIB_CM_RX_FLUSH /* Last WQE Reached event observed */ +}; + struct ipoib_cm_rx { struct ib_cm_id *id; struct ib_qp *qp; struct list_head list; struct net_device *dev; unsigned long jiffies; + enum ipoib_cm_state state; }; struct ipoib_cm_tx { @@ -165,10 +196,16 @@ struct ipoib_cm_dev_priv { struct ib_srq *srq; struct ipoib_cm_rx_buf *srq_ring; struct ib_cm_id *id; - struct list_head passive_ids; + struct ib_qp *rx_drain_qp; /* generates WR described in 10.3.1 */ + struct list_head passive_ids; /* state: LIVE */ + struct list_head rx_error_list; /* state: ERROR */ + struct list_head rx_flush_list; /* state: FLUSH, drain not started */ + struct list_head rx_drain_list; /* state: FLUSH, drain started */ + struct list_head rx_reap_list; /* state: FLUSH, drain done */ struct work_struct start_task; struct work_struct reap_task; struct work_struct skb_task; + struct work_struct rx_reap_task; struct delayed_work stale_task; struct sk_buff_head skb_queue; struct list_head start_list; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index eec833b..46121cf 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -62,6 +62,17 @@ struct ipoib_cm_id { u32 remote_mtu; }; +static struct ib_qp_attr ipoib_cm_err_attr = { + .qp_state = IB_QPS_ERR +}; + +#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff + +static struct ib_send_wr ipoib_cm_rx_drain_wr = { + .wr_id = IPOIB_CM_RX_DRAIN_WRID, + .opcode = IB_WR_SEND +}; + static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); @@ -150,11 +161,44 @@ partial_error: return NULL; } +static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv) +{ + struct ib_send_wr *bad_send_wr; + + /* rx_drain_qp send queue depth is 1, so + * make sure we have at most 1 outstanding WR. */ + if (list_empty(&priv->cm.rx_flush_list) || + !list_empty(&priv->cm.rx_drain_list)) + return; + + if (ib_post_send(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_send_wr)) + ipoib_warn(priv, "failed to post rx_drain wr\n"); + + list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list); +} + +static void ipoib_cm_rx_event_handler(struct ib_event *event, void *ctx) +{ + struct ipoib_cm_rx *p = ctx; + struct ipoib_dev_priv *priv = netdev_priv(p->dev); + unsigned long flags; + + if (event->event != IB_EVENT_QP_LAST_WQE_REACHED) + return; + + spin_lock_irqsave(&priv->lock, flags); + list_move(&p->list, &priv->cm.rx_flush_list); + p->state = IPOIB_CM_RX_FLUSH; + ipoib_cm_start_rx_drain(priv); + spin_unlock_irqrestore(&priv->lock, flags); +} + static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev, struct ipoib_cm_rx *p) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = { + .event_handler = ipoib_cm_rx_event_handler, .send_cq = priv->cq, /* does not matter, we never send anything */ .recv_cq = priv->cq, .srq = priv->cm.srq, @@ -256,6 +300,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even cm_id->context = p; p->jiffies = jiffies; + p->state = IPOIB_CM_RX_LIVE; spin_lock_irq(&priv->lock); if (list_empty(&priv->cm.passive_ids)) queue_delayed_work(ipoib_workqueue, @@ -277,7 +322,6 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, { struct ipoib_cm_rx *p; struct ipoib_dev_priv *priv; - int ret; switch (event->event) { case IB_CM_REQ_RECEIVED: @@ -289,20 +333,9 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, case IB_CM_REJ_RECEIVED: p = cm_id->context; priv = netdev_priv(p->dev); - spin_lock_irq(&priv->lock); - if (list_empty(&p->list)) - ret = 0; /* Connection is going away already. */ - else { - list_del_init(&p->list); - ret = -ECONNRESET; - } - spin_unlock_irq(&priv->lock); - if (ret) { - ib_destroy_qp(p->qp); - kfree(p); - return ret; - } - return 0; + if (ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE)) + ipoib_warn(priv, "unable to move qp to error state\n"); + /* Fall through */ default: return 0; } @@ -354,8 +387,15 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); + if (wr_id == IPOIB_CM_RX_DRAIN_WRID) { + spin_lock_irqsave(&priv->lock, flags); + list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); + ipoib_cm_start_rx_drain(priv); + queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); + spin_unlock_irqrestore(&priv->lock, flags); + } else + ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", + wr_id, ipoib_recvq_size); return; } @@ -374,9 +414,9 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { spin_lock_irqsave(&priv->lock, flags); p->jiffies = jiffies; - /* Move this entry to list head, but do - * not re-add it if it has been removed. */ - if (!list_empty(&p->list)) + /* Move this entry to list head, but do not re-add it + * if it has been moved out of list. */ + if (p->state == IPOIB_CM_RX_LIVE) list_move(&p->list, &priv->cm.passive_ids); spin_unlock_irqrestore(&priv->lock, flags); } @@ -583,17 +623,41 @@ static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) int ipoib_cm_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr qp_init_attr = { + .send_cq = priv->cq, + .recv_cq = priv->cq, /* does not matter, we never get anything */ + .srq = priv->cm.srq, /* does not matter, we never get anything */ + .cap.max_send_wr = 1, + .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ + .sq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_UC, + }; int ret; if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) return 0; + priv->cm.rx_drain_qp = ib_create_qp(priv->pd, &qp_init_attr); + if (IS_ERR(priv->cm.rx_drain_qp)) { + printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); + ret = PTR_ERR(priv->cm.rx_drain_qp); + return ret; + } + + /* We put the QP in error state directly: this way, hardware + * will immediately generate WC for each WR we post, without + * sending anything on the wire. */ + ret = ib_modify_qp(priv->cm.rx_drain_qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) { + ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret); + goto err_qp; + } + priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev); if (IS_ERR(priv->cm.id)) { printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); ret = PTR_ERR(priv->cm.id); - priv->cm.id = NULL; - return ret; + goto err_cm; } ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num), @@ -601,35 +665,79 @@ int ipoib_cm_dev_open(struct net_device *dev) if (ret) { printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name, IPOIB_CM_IETF_ID | priv->qp->qp_num); - ib_destroy_cm_id(priv->cm.id); - priv->cm.id = NULL; - return ret; + goto err_listen; } + return 0; + +err_listen: + ib_destroy_cm_id(priv->cm.id); +err_cm: + priv->cm.id = NULL; +err_qp: + ib_destroy_qp(priv->cm.rx_drain_qp); + return ret; } void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_cm_rx *p; + struct ipoib_cm_rx *p, *n; + unsigned long begin; + LIST_HEAD(list); + int ret; if (!IPOIB_CM_SUPPORTED(dev->dev_addr) || !priv->cm.id) return; ib_destroy_cm_id(priv->cm.id); priv->cm.id = NULL; + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); - list_del_init(&p->list); + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; + spin_unlock_irq(&priv->lock); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + spin_lock_irq(&priv->lock); + } + + /* Wait for all RX to be drained */ + begin = jiffies; + + while (!list_empty(&priv->cm.rx_error_list) || + !list_empty(&priv->cm.rx_flush_list) || + !list_empty(&priv->cm.rx_drain_list)) { + if (!time_after(jiffies, begin + 5 * HZ)) { + ipoib_warn(priv, "RX drain timing out\n"); + + /* + * assume the HW is wedged and just free up everything. + */ + list_splice_init(&priv->cm.rx_flush_list, &list); + list_splice_init(&priv->cm.rx_error_list, &list); + list_splice_init(&priv->cm.rx_drain_list, &list); + break; + } spin_unlock_irq(&priv->lock); + msleep(1); + spin_lock_irq(&priv->lock); + } + + list_splice_init(&priv->cm.rx_reap_list, &list); + + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); kfree(p); - spin_lock_irq(&priv->lock); } - spin_unlock_irq(&priv->lock); + ib_destroy_qp(priv->cm.rx_drain_qp); cancel_delayed_work(&priv->cm.stale_task); } @@ -1079,24 +1187,44 @@ void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, queue_work(ipoib_workqueue, &priv->cm.skb_task); } +static void ipoib_cm_rx_reap(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, + cm.rx_reap_task); + struct ipoib_cm_rx *p, *n; + LIST_HEAD(list); + + spin_lock_irq(&priv->lock); + list_splice_init(&priv->cm.rx_reap_list, &list); + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(p, n, &list, list) { + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + kfree(p); + } +} + static void ipoib_cm_stale_task(struct work_struct *work) { struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.stale_task.work); struct ipoib_cm_rx *p; + int ret; spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { - /* List if sorted by LRU, start from tail, + /* List is sorted by LRU, start from tail, * stop when we see a recently used entry */ p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; - list_del_init(&p->list); + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; spin_unlock_irq(&priv->lock); - ib_destroy_cm_id(p->id); - ib_destroy_qp(p->qp); - kfree(p); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); spin_lock_irq(&priv->lock); } @@ -1164,9 +1292,14 @@ int ipoib_cm_dev_init(struct net_device *dev) INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); INIT_LIST_HEAD(&priv->cm.start_list); + INIT_LIST_HEAD(&priv->cm.rx_error_list); + INIT_LIST_HEAD(&priv->cm.rx_flush_list); + INIT_LIST_HEAD(&priv->cm.rx_drain_list); + INIT_LIST_HEAD(&priv->cm.rx_reap_list); INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start); INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap); INIT_WORK(&priv->cm.skb_task, ipoib_cm_skb_reap); + INIT_WORK(&priv->cm.rx_reap_task, ipoib_cm_rx_reap); INIT_DELAYED_WORK(&priv->cm.stale_task, ipoib_cm_stale_task); skb_queue_head_init(&priv->cm.skb_queue); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index e3b0937..c1aad06 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -470,7 +470,7 @@ int ipoib_ib_dev_open(struct net_device *dev) ret = ipoib_cm_dev_open(dev); if (ret) { - ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + ipoib_warn(priv, "ipoib_cm_dev_open returned %d\n", ret); ipoib_ib_dev_stop(dev, 1); return -1; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 7912526..982eb88 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -173,7 +173,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) size = ipoib_sendq_size + ipoib_recvq_size + 1; ret = ipoib_cm_dev_init(dev); if (!ret) - size += ipoib_recvq_size; + size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { -- MST From mshefty at ichips.intel.com Fri May 18 08:55:49 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 18 May 2007 08:55:49 -0700 Subject: [ofa-general] libibcm compatability problem In-Reply-To: <2cfcf21e0705180113u43ad5977rd5fef68793c5aaea@mail.gmail.com> References: <2cfcf21e0705170944h4ccddd52udef97b59d32452be@mail.gmail.com> <464C8889.5090403@ichips.intel.com> <2cfcf21e0705180113u43ad5977rd5fef68793c5aaea@mail.gmail.com> Message-ID: <464DCC85.6090308@ichips.intel.com> > Also I see that the function ib_cm_get_device has been removed. I was > using this to monitor the file desriptor of the CM device. Could this > function be put back into my local copy of libibcm or has this function > been moved somewhere else in the code? The fd is exposed directly by walking ib_cm_id->device->fd. Btw, to respond to you other email, converting from the ib_cm to the rdma_cm shouldn't be overly difficult. You get fewer CM related events, QP transitions done for you, use of actual path records, and the ability to reference nodes using an IP address. There are some example programs in the librdmacm/example directory if you want to take a quick look at what the code would look like. The IB device is acquired dynamically though, so depending on how you allocate resources in your code, you may need some rework in this area. - Sean From rvm at obsidianresearch.com Fri May 18 09:56:36 2007 From: rvm at obsidianresearch.com (Rolf Manderscheid) Date: Fri, 18 May 2007 10:56:36 -0600 Subject: [ofa-general] Re: [PATCH] IB/mthca: initialise GRH:HopLimit when building MLX headers In-Reply-To: References: Message-ID: <464DDAC4.305@obsidianresearch.com> Roland Dreier wrote: > If you happen to have any ConnectX cards available, can you check this works too? > We don't have any, but we are getting some. I'll report back after they show up. Rolf From rdreier at cisco.com Fri May 18 14:51:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 18 May 2007 14:51:13 -0700 Subject: [ofa-general] Re: [PATCH 2/2] IB/mlx4: pass more data from user to kernel In-Reply-To: <20070518034028.GB4708@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 18 May 2007 06:40:28 +0300") References: <1179387217.25749.62.camel@mtls03> <20070518034028.GB4708@mellanox.co.il> Message-ID: > I'm OK with what you say about RQ, but replacing sq_wqebb_count with log_sq_bb_per_wqe > looks like obfuscation to me: you still pass in 2 values, and the kernel does > not actually care about number of SQ WRs at all, it really only needs to look at > # of wqbbs. Makes sense... how about: struct mlx4_ib_create_qp { __u64 buf_addr; __u64 db_addr; __u8 log_sq_stride; __u8 log_sq_bb_count; __u8 reserved[6]; }; From sean.hefty at intel.com Fri May 18 15:14:07 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 18 May 2007 15:14:07 -0700 Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.23: basic support for IB routers Message-ID: <000401c79999$dad2e9d0$b4d4180a@amr.corp.intel.com> Re-sending - typo in mailing list name... I'd like to get feedback about incorporating the following changes to support IB routers into 2.6.23. The goal of the patches is to allow for IB router development and prototyping within the current framework of IBA. The changes themselves are fairly minimal, but based on the following concepts: * Routing data is maintained by the local SA. No assumption is made regarding how the SA obtains routing information. The SA is only expected to respond to cross subnet PR queries by providing a path to the local router. This matches the behavior in opensm. * A ULP connecting to a remote subnet provides path information about both subnets. For now the implementation simply assumes that the properties of the remote path match that of the local path. This allows the active side CM to properly format the CM REQ. * If the SLID/DLID values in the CM REQ are set to the permissive LID, then the passive side CM uses the SLID/DLID/SL values from the received CM REQ LRH to configure the passive side QP. This is done to meet C9-54 without requiring communication with the remote SA, but I should note that this behavior is non-compliant. These changes were tested by establishing a connection and transferring data between two IB subnets connected by an Obsidian router. These patches are also available in the ib_router branch of my rdma-dev.git tree. The tree is based on 2.6.21, so include a couple of additional patches that were already pushed for 2.6.22. Signed-off-by: Sean Hefty From sean.hefty at intel.com Fri May 18 15:15:36 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 18 May 2007 15:15:36 -0700 Subject: [ofa-general] [RFC] [PATCH 1/3] 2.6.23: IB/CM: add support for paths with hop_limit > 1 In-Reply-To: <000401c79999$dad2e9d0$b4d4180a@amr.corp.intel.com> Message-ID: <000501c7999a$0fb2d520$b4d4180a@amr.corp.intel.com> Paths with hop_limit > 1 indicate that the connection will be routed between IB subnets. To support routed connections, the ib_cm requires two paths: one from the active side to the active side router, and a second from the passive side to the passive side router. Modify the ib_cm interface to support multiple paths, and format the CM REQ message with the correct passive side information. Signed-off-by: Sean Hefty --- drivers/infiniband/core/cm.c | 50 ++++++++++++++++++++++++------------------ include/rdma/ib_cm.h | 5 ++++ 2 files changed, 33 insertions(+), 22 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 842cd0b..1e2010e 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -877,6 +877,8 @@ static void cm_format_req(struct cm_req_msg *req_msg, struct cm_id_private *cm_id_priv, struct ib_cm_req_param *param) { + struct ib_sa_path_rec *pri_path, *alt_path; + cm_format_mad_hdr(&req_msg->hdr, CM_REQ_ATTR_ID, cm_form_tid(cm_id_priv, CM_MSG_SEQUENCE_REQ)); @@ -900,33 +902,37 @@ static void cm_format_req(struct cm_req_msg *req_msg, cm_req_set_max_cm_retries(req_msg, param->max_cm_retries); cm_req_set_srq(req_msg, param->srq); - req_msg->primary_local_lid = param->primary_path->slid; - req_msg->primary_remote_lid = param->primary_path->dlid; - req_msg->primary_local_gid = param->primary_path->sgid; - req_msg->primary_remote_gid = param->primary_path->dgid; - cm_req_set_primary_flow_label(req_msg, param->primary_path->flow_label); - cm_req_set_primary_packet_rate(req_msg, param->primary_path->rate); - req_msg->primary_traffic_class = param->primary_path->traffic_class; - req_msg->primary_hop_limit = param->primary_path->hop_limit; - cm_req_set_primary_sl(req_msg, param->primary_path->sl); - cm_req_set_primary_subnet_local(req_msg, 1); /* local only... */ + pri_path = (param->primary_path->hop_limit <= 1) ? + param->primary_path : ¶m->primary_path[1]; + req_msg->primary_local_lid = pri_path->slid; + req_msg->primary_remote_lid = pri_path->dlid; + req_msg->primary_local_gid = pri_path->sgid; + req_msg->primary_remote_gid = pri_path->dgid; + cm_req_set_primary_flow_label(req_msg, pri_path->flow_label); + cm_req_set_primary_packet_rate(req_msg, pri_path->rate); + req_msg->primary_traffic_class = pri_path->traffic_class; + req_msg->primary_hop_limit = pri_path->hop_limit; + cm_req_set_primary_sl(req_msg, pri_path->sl); + cm_req_set_primary_subnet_local(req_msg, (pri_path->hop_limit <= 1)); cm_req_set_primary_local_ack_timeout(req_msg, - min(31, param->primary_path->packet_life_time + 1)); + min(31, pri_path->packet_life_time + 1)); if (param->alternate_path) { - req_msg->alt_local_lid = param->alternate_path->slid; - req_msg->alt_remote_lid = param->alternate_path->dlid; - req_msg->alt_local_gid = param->alternate_path->sgid; - req_msg->alt_remote_gid = param->alternate_path->dgid; + alt_path = (param->alternate_path->hop_limit <= 1) ? + param->alternate_path : ¶m->alternate_path[1]; + req_msg->alt_local_lid = alt_path->slid; + req_msg->alt_remote_lid = alt_path->dlid; + req_msg->alt_local_gid = alt_path->sgid; + req_msg->alt_remote_gid = alt_path->dgid; cm_req_set_alt_flow_label(req_msg, - param->alternate_path->flow_label); - cm_req_set_alt_packet_rate(req_msg, param->alternate_path->rate); - req_msg->alt_traffic_class = param->alternate_path->traffic_class; - req_msg->alt_hop_limit = param->alternate_path->hop_limit; - cm_req_set_alt_sl(req_msg, param->alternate_path->sl); - cm_req_set_alt_subnet_local(req_msg, 1); /* local only... */ + alt_path->flow_label); + cm_req_set_alt_packet_rate(req_msg, alt_path->rate); + req_msg->alt_traffic_class = alt_path->traffic_class; + req_msg->alt_hop_limit = alt_path->hop_limit; + cm_req_set_alt_sl(req_msg, alt_path->sl); + cm_req_set_alt_subnet_local(req_msg, (alt_path->hop_limit <= 1)); cm_req_set_alt_local_ack_timeout(req_msg, - min(31, param->alternate_path->packet_life_time + 1)); + min(31, alt_path->packet_life_time + 1)); } if (param->private_data && param->private_data_len) diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 5c07017..f715ba5 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -347,6 +347,11 @@ struct ib_cm_compare_data { int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, struct ib_cm_compare_data *compare_data); +/* + * If hop_limit > 1 or reversible = 0, then primary/alternate path fields + * point to an array of paths. The first path is relative to the active + * side, and the second path is relative to the passive side. + */ struct ib_cm_req_param { struct ib_sa_path_rec *primary_path; struct ib_sa_path_rec *alternate_path; From sean.hefty at intel.com Fri May 18 15:17:18 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 18 May 2007 15:17:18 -0700 Subject: [ofa-general] [RFC] [PATCH 2/3] 2.6.23: IB/cm: Modify passive side to use LIDs from LRH for routed connections In-Reply-To: <000501c7999a$0fb2d520$b4d4180a@amr.corp.intel.com> Message-ID: <000601c7999a$4cb74140$b4d4180a@amr.corp.intel.com> To support inter-subnet connections, the passive endpoint needs to use its subnet local LIDs. The LIDs carried in the REQ are currently the LIDs from the active subnet (SLID and router LID). Replace LIDs in the REQ with subnet local LIDs from LRH. Signed-off-by: Sean Hefty --- drivers/infiniband/core/cm.c | 29 +++++++++++++++++++++++++++++ 1 files changed, 29 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 1e2010e..4d30e49 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1345,6 +1345,34 @@ out: return listen_cm_id_priv; } +/* + * Work-around for inter-subnet connections. If the LIDs are permissive, + * we need to override the LID/SL data in the REQ with the LID information + * in the work completion. + */ +static void cm_process_routed_req(struct cm_req_msg *req_msg, struct ib_wc *wc) +{ + if (!cm_req_get_primary_subnet_local(req_msg)) { + if (req_msg->primary_local_lid == IB_LID_PERMISSIVE) { + req_msg->primary_local_lid = cpu_to_be16(wc->slid); + cm_req_set_primary_sl(req_msg, wc->sl); + } + + if (req_msg->primary_remote_lid == IB_LID_PERMISSIVE) + req_msg->primary_remote_lid = cpu_to_be16(wc->dlid_path_bits); + } + + if (!cm_req_get_alt_subnet_local(req_msg)) { + if (req_msg->alt_local_lid == IB_LID_PERMISSIVE) { + req_msg->alt_local_lid = cpu_to_be16(wc->slid); + cm_req_set_alt_sl(req_msg, wc->sl); + } + + if (req_msg->alt_remote_lid == IB_LID_PERMISSIVE) + req_msg->alt_remote_lid = cpu_to_be16(wc->dlid_path_bits); + } +} + static int cm_req_handler(struct cm_work *work) { struct ib_cm_id *cm_id; @@ -1385,6 +1413,7 @@ static int cm_req_handler(struct cm_work *work) cm_id_priv->id.service_id = req_msg->service_id; cm_id_priv->id.service_mask = __constant_cpu_to_be64(~0ULL); + cm_process_routed_req(req_msg, work->mad_recv_wc->wc); cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av); if (ret) { From sean.hefty at intel.com Fri May 18 15:18:37 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 18 May 2007 15:18:37 -0700 Subject: [ofa-general] [RFC] [PATCH 3/3] 2.6.23: RDMA/cma: Add support for routed paths In-Reply-To: <000601c7999a$4cb74140$b4d4180a@amr.corp.intel.com> Message-ID: <000701c7999a$7bf62250$b4d4180a@amr.corp.intel.com> In order to support IB-to-IB routers, we need to provide path information about the remote subnet to the ib_cm. For now, we simply copy our local path information, but use permissive LIDs in place of the actual, remote LIDs. This indicates to the remote ib_cm that it should use the LIDs/SL data from the LRH received with CM REQ in place of the actual data carried in the REQ message when configuring the remote QP. Signed-off-by: Sean Hefty --- drivers/infiniband/core/cma.c | 17 ++++++++++++++++- 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index fde92ce..430f104 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2170,7 +2170,19 @@ static int cma_connect_ib(struct rdma_id_private *id_priv, goto out; req.private_data = private_data; - req.primary_path = &route->path_rec[0]; + if (route->path_rec[0].hop_limit > 1) { + req.primary_path = kmalloc(sizeof *req.primary_path * 2, + GFP_ATOMIC); + if (!req.primary_path) { + ret = -ENOMEM; + goto out; + } + req.primary_path[0] = route->path_rec[0]; + req.primary_path[1] = route->path_rec[0]; + req.primary_path[1].slid = IB_LID_PERMISSIVE; + req.primary_path[1].dlid = IB_LID_PERMISSIVE; + } else + req.primary_path = &route->path_rec[0]; if (route->num_paths == 2) req.alternate_path = &route->path_rec[1]; @@ -2190,6 +2202,9 @@ static int cma_connect_ib(struct rdma_id_private *id_priv, req.srq = id_priv->srq ? 1 : 0; ret = ib_send_cm_req(id_priv->cm_id.ib, &req); + + if (route->path_rec[0].hop_limit > 1) + kfree(req.primary_path); out: if (ret && !IS_ERR(id_priv->cm_id.ib)) { ib_destroy_cm_id(id_priv->cm_id.ib); From lioyd.okoro at gmail.com Sat May 19 02:13:55 2007 From: lioyd.okoro at gmail.com (lioyd Okoro) Date: Sat, 19 May 2007 10:13:55 +0100 Subject: [ofa-general] PLEASE REPLY(Expecting Your Reply) Message-ID: *VIEW THE ATTACHED MESSAGE.* -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: LIOYD.rtf Type: application/rtf Size: 3284 bytes Desc: not available URL: From vlad at lists.openfabrics.org Sat May 19 02:39:56 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 19 May 2007 02:39:56 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070519-0200 daily build status Message-ID: <20070519093957.5001CE60826@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From yosefe at voltaire.com Sun May 20 01:07:00 2007 From: yosefe at voltaire.com (Yosef Etigin) Date: Sun, 20 May 2007 11:07:00 +0300 Subject: [ofa-general] ib_find_gid / ib_find_pkey In-Reply-To: <20070517174519.GC22028@mellanox.co.il> References: <20070514045832.GA18615@mellanox.co.il> <20070517174519.GC22028@mellanox.co.il> Message-ID: <465001A4.2020502@voltaire.com> Michael S. Tsirkin wrote: >>Quoting Roland Dreier : >>Subject: Re: [ofa-general] ib_find_gid / ib_find_pkey >> >>Also applied P_Key reordering patch too... > > > OK. I think the next step is to get rid of ipoib_pkey_dev_check_presence and > ipoib_pkey_poll in ipoib. This way we'll have one ULP clean of cache usage. > Yosef? > I'll rebase the patch on the last version of pkey patch and repost. > Another thing to do at this front, is make the pkey change event > less intrusive: we should not need to kill connections and AHs > because of pkey change: just cycling the QP through reset should be > enough. > like modify->reset and call ipoib_init_qp? From mst at dev.mellanox.co.il Sun May 20 01:11:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 20 May 2007 11:11:08 +0300 Subject: [ofa-general] Re: ib_find_gid / ib_find_pkey In-Reply-To: <465001A4.2020502@voltaire.com> References: <20070514045832.GA18615@mellanox.co.il> <20070517174519.GC22028@mellanox.co.il> <465001A4.2020502@voltaire.com> Message-ID: <20070520081108.GB16863@mellanox.co.il> > Quoting Yosef Etigin : > Subject: Re: ib_find_gid / ib_find_pkey > > Michael S. Tsirkin wrote: > >>Quoting Roland Dreier : > >>Subject: Re: [ofa-general] ib_find_gid / ib_find_pkey > >> > >>Also applied P_Key reordering patch too... > > > > > > OK. I think the next step is to get rid of ipoib_pkey_dev_check_presence and > > ipoib_pkey_poll in ipoib. This way we'll have one ULP clean of cache usage. > > Yosef? > > > I'll rebase the patch on the last version of pkey patch and repost. > > > Another thing to do at this front, is make the pkey change event > > less intrusive: we should not need to kill connections and AHs > > because of pkey change: just cycling the QP through reset should be > > enough. > > > > like modify->reset and call ipoib_init_qp? Careful: you must ->ERR and flush posted WRs. -- MST From vlad at lists.openfabrics.org Sun May 20 02:41:53 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 20 May 2007 02:41:53 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070520-0200 daily build status Message-ID: <20070520094154.0C836E6082A@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From kliteyn at dev.mellanox.co.il Sun May 20 03:17:07 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 20 May 2007 13:17:07 +0300 Subject: [ofa-general] [PATCH] osm: up/dn optimization - improved ranking In-Reply-To: <20070516194919.GO19271@sashak.voltaire.com> References: <464B1D41.8080905@dev.mellanox.co.il> <20070516194919.GO19271@sashak.voltaire.com> Message-ID: <46502023.20703@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 18:03 Wed 16 May , Yevgeny Kliteynik wrote: >> Hi Hal, >> >> This patch optimizes fabric ranking similar to the fat-tree ranking. >> All the root switches are marked with rank and added to the BFS list, >> and only then ranking of rest of the fabric begins. >> >> Please apply to master. >> >> Signed-off-by: Yevgeny Kliteynik >> --- > > Basically looks good. However couple comments below. > >> opensm/opensm/osm_ucast_updn.c | 66 >> +++++++++++++++++---------------------- >> 1 files changed, 29 insertions(+), 37 deletions(-) >> >> diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c >> index 5cebd9b..9574216 100644 >> --- a/opensm/opensm/osm_ucast_updn.c >> +++ b/opensm/opensm/osm_ucast_updn.c >> @@ -408,53 +408,49 @@ Exit : >> /* rank is a SWITCH for BFS purpose */ >> static int >> updn_subn_rank( >> - IN uint64_t root_guid, >> - IN uint8_t base_rank, >> + IN uint32_t num_guids, > > 'num_guids' should not be fixed-size integer just compiler friendly > 'unsigned' is fine. NP. >> + IN uint64_t* guid_list, >> IN updn_t* p_updn ) >> { >> osm_switch_t *p_sw; >> - uint32_t rank = base_rank; >> osm_physp_t *p_physp, *p_remote_physp; >> cl_qlist_t list; >> cl_status_t did_cause_update; >> struct updn_node *u, *remote_u; >> uint8_t num_ports, port_num; >> osm_log_t *p_log = &p_updn->p_osm->log; >> + uint32_t idx = 0; > > Ditto. NP >> OSM_LOG_ENTER( p_log, updn_subn_rank ); >> + cl_qlist_init(&list); >> >> - p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, >> cl_hton64(root_guid)); >> - if(!p_sw) >> - { >> - osm_log( p_log, OSM_LOG_ERROR, >> - "updn_subn_rank: ERR AA05: " >> - "Root switch GUID 0x%" PRIx64 " not found\n", root_guid ); >> - OSM_LOG_EXIT( p_log ); >> - return 1; >> - } >> - >> - osm_log( p_log, OSM_LOG_VERBOSE, >> - "updn_subn_rank: " >> - "Ranking starts from GUID 0x%" PRIx64 "\n", root_guid ); >> - >> - u = p_sw->priv; >> - u->is_root = 1; >> + /* Rank all the roots and add them to list */ >> >> - /* Rank the first guid chosen anyway since it's the base rank */ >> - osm_log( p_log, OSM_LOG_DEBUG, >> - "updn_subn_rank: " >> - "Ranking port GUID 0x%" PRIx64 "\n", root_guid ); >> + for (idx = 0; idx < num_guids; idx++) >> + { >> + /* Apply the ranking for each guid given by user - bypass illegal ones >> */ >> + p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, >> cl_hton64(guid_list[idx])); >> + if(!p_sw) >> + { >> + osm_log( p_log, OSM_LOG_ERROR, >> + "updn_subn_rank: ERR AA05: " >> + "Root switch GUID 0x%" PRIx64 " not found\n", >> guid_list[idx] ); >> + continue; >> + } >> >> - __updn_update_rank(u, rank); >> + u = p_sw->priv; >> + u->is_root = 1; > > Now when root switches are always ranked first 'is_root' field is not > needed anymore, (!u->rank) answers this. Agree >> - cl_qlist_init(&list); >> - cl_qlist_insert_tail(&list, &u->list); >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "updn_subn_rank: " >> + "Ranking root port GUID 0x%" PRIx64 "\n", guid_list[idx] ); >> + __updn_update_rank(u, 0); >> + cl_qlist_insert_tail(&list, &u->list); >> + } >> >> /* BFS the list till it's empty */ >> while (!cl_is_qlist_empty(&list)) >> { >> - rank++; >> - >> u = (struct updn_node *)cl_qlist_remove_head(&list); >> /* Go over all remote nodes and rank them (if not already visited) */ >> p_sw = u->sw; >> @@ -483,7 +479,7 @@ updn_subn_rank( >> { >> remote_u = p_remote_physp->p_node->sw->priv; >> port_guid = p_remote_physp->port_guid; >> - did_cause_update = __updn_update_rank(remote_u, rank); >> + did_cause_update = __updn_update_rank(remote_u, u->rank+1); >> >> osm_log( p_log, OSM_LOG_DEBUG, >> "updn_subn_rank: " >> @@ -500,8 +496,8 @@ updn_subn_rank( >> /* Print Summary of ranking */ >> osm_log( p_log, OSM_LOG_VERBOSE, >> "updn_subn_rank: " >> - "Rank Info :\n\t Root Guid = 0x%" PRIx64 "\n\t Max Node Rank = >> %d\n", >> - root_guid, rank ); >> + "Subnet ranking completed. Max Node Rank = %d\n", >> + remote_u->rank ); > > 'remote_u' can be not initialized here. Another issue is that it can be > initialized but to remote switch which has lower than max rank (when > did_cause_update = 0). Right, good catch. I'll issue a new patch. Thanks. -- Yevgeny > The rest is fine. > > Sasha > From kliteyn at dev.mellanox.co.il Sun May 20 04:26:28 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 20 May 2007 14:26:28 +0300 Subject: [ofa-general] [PATCHv2] osm: up/dn optimization - improved ranking Message-ID: <46503064.7010107@dev.mellanox.co.il> Hi Hal, This patch optimizes fabric ranking similar to the fat-tree ranking. All the root switches are marked with rank and added to the BFS list, and only then ranking of rest of the fabric begins. This version of the patch is updated in accordance with Sasha's suggestions. Please apply to master. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_ucast_updn.c | 80 +++++++++++++++++---------------------- 1 files changed, 35 insertions(+), 45 deletions(-) diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c index 5cebd9b..95a0622 100644 --- a/opensm/opensm/osm_ucast_updn.c +++ b/opensm/opensm/osm_ucast_updn.c @@ -93,7 +93,6 @@ struct updn_node { osm_switch_t *sw; updn_switch_dir_t dir; unsigned rank; - unsigned is_root; unsigned visited; }; @@ -111,15 +110,13 @@ __updn_get_dir( IN unsigned cur_rank, IN unsigned rem_rank, IN uint64_t cur_guid, - IN uint64_t rem_guid, - IN unsigned cur_is_root, - IN unsigned rem_is_root ) + IN uint64_t rem_guid ) { /* HACK: comes to solve root nodes connection, in a classic subnet root nodes do not connect directly, but in case they are we assign to root node an UP direction to allow UPDN to discover the subnet correctly (and not from the point of view of the last root node). */ - if (cur_is_root && rem_is_root) + if (!cur_rank && !rem_rank) return UP; if (cur_rank < rem_rank) @@ -215,8 +212,7 @@ __updn_bfs_by_node( rem_u = p_remote_sw->priv; /* Decide which direction to mark it (UP/DOWN) */ next_dir = __updn_get_dir(u->rank, rem_u->rank, - current_guid, remote_guid, - u->is_root, rem_u->is_root); + current_guid, remote_guid); /* Check if this is a legal step : the only illegal step is going from DOWN to UP */ @@ -408,53 +404,48 @@ Exit : /* rank is a SWITCH for BFS purpose */ static int updn_subn_rank( - IN uint64_t root_guid, - IN uint8_t base_rank, + IN unsigned num_guids, + IN uint64_t* guid_list, IN updn_t* p_updn ) { osm_switch_t *p_sw; - uint32_t rank = base_rank; osm_physp_t *p_physp, *p_remote_physp; cl_qlist_t list; cl_status_t did_cause_update; struct updn_node *u, *remote_u; uint8_t num_ports, port_num; osm_log_t *p_log = &p_updn->p_osm->log; + unsigned idx = 0; + unsigned max_rank = 0; OSM_LOG_ENTER( p_log, updn_subn_rank ); + cl_qlist_init(&list); - p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, cl_hton64(root_guid)); - if(!p_sw) - { - osm_log( p_log, OSM_LOG_ERROR, - "updn_subn_rank: ERR AA05: " - "Root switch GUID 0x%" PRIx64 " not found\n", root_guid ); - OSM_LOG_EXIT( p_log ); - return 1; - } - - osm_log( p_log, OSM_LOG_VERBOSE, - "updn_subn_rank: " - "Ranking starts from GUID 0x%" PRIx64 "\n", root_guid ); - - u = p_sw->priv; - u->is_root = 1; + /* Rank all the roots and add them to list */ - /* Rank the first guid chosen anyway since it's the base rank */ - osm_log( p_log, OSM_LOG_DEBUG, - "updn_subn_rank: " - "Ranking port GUID 0x%" PRIx64 "\n", root_guid ); - - __updn_update_rank(u, rank); + for (idx = 0; idx < num_guids; idx++) + { + /* Apply the ranking for each guid given by user - bypass illegal ones */ + p_sw = osm_get_switch_by_guid(&p_updn->p_osm->subn, cl_hton64(guid_list[idx])); + if(!p_sw) + { + osm_log( p_log, OSM_LOG_ERROR, + "updn_subn_rank: ERR AA05: " + "Root switch GUID 0x%" PRIx64 " not found\n", guid_list[idx] ); + continue; + } - cl_qlist_init(&list); - cl_qlist_insert_tail(&list, &u->list); + u = p_sw->priv; + osm_log( p_log, OSM_LOG_DEBUG, + "updn_subn_rank: " + "Ranking root port GUID 0x%" PRIx64 "\n", guid_list[idx] ); + __updn_update_rank(u, 0); + cl_qlist_insert_tail(&list, &u->list); + } /* BFS the list till it's empty */ while (!cl_is_qlist_empty(&list)) { - rank++; - u = (struct updn_node *)cl_qlist_remove_head(&list); /* Go over all remote nodes and rank them (if not already visited) */ p_sw = u->sw; @@ -483,7 +474,7 @@ updn_subn_rank( { remote_u = p_remote_physp->p_node->sw->priv; port_guid = p_remote_physp->port_guid; - did_cause_update = __updn_update_rank(remote_u, rank); + did_cause_update = __updn_update_rank(remote_u, u->rank+1); osm_log( p_log, OSM_LOG_DEBUG, "updn_subn_rank: " @@ -492,7 +483,10 @@ updn_subn_rank( remote_u->rank ); if (did_cause_update) + { cl_qlist_insert_tail(&list, &remote_u->list); + max_rank = remote_u->rank; + } } } } @@ -500,8 +494,8 @@ updn_subn_rank( /* Print Summary of ranking */ osm_log( p_log, OSM_LOG_VERBOSE, "updn_subn_rank: " - "Rank Info :\n\t Root Guid = 0x%" PRIx64 "\n\t Max Node Rank = %d\n", - root_guid, rank ); + "Subnet ranking completed. Max Node Rank = %d\n", + max_rank ); OSM_LOG_EXIT( p_log ); return 0; } @@ -566,7 +560,6 @@ __osm_subn_calc_up_down_min_hop_table( IN uint64_t* guid_list, IN updn_t* p_updn ) { - uint32_t idx = 0; int status; OSM_LOG_ENTER( &p_updn->p_osm->log, osm_subn_calc_up_down_min_hop_table ); @@ -593,11 +586,8 @@ __osm_subn_calc_up_down_min_hop_table( goto _exit; } - for (idx = 0; idx < num_guids; idx++) - { - /* Apply the ranking for each guid given by user - bypass illegal ones */ - updn_subn_rank(guid_list[idx], 0, p_updn); - } + /* Rank the subnet switches */ + updn_subn_rank(num_guids, guid_list, p_updn); /* After multiple ranking need to set Min Hop Table by UpDn algorithm */ osm_log( &p_updn->p_osm->log, OSM_LOG_VERBOSE, -- 1.5.1.4 From mst at dev.mellanox.co.il Sun May 20 06:44:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 20 May 2007 16:44:41 +0300 Subject: [ofa-general] IB/cm: bug in stale connection detection logic? Message-ID: <20070520134441.GI20649@mellanox.co.il> Sean, Roland, pls comment on the following 2 questions: 1. I see this in cm_match_req: timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); if (!timewait_info) timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); if (timewait_info) { cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, timewait_info->work.remote_id); cm_cleanup_timewait(cm_id_priv->timewait_info); spin_unlock_irqrestore(&cm.lock, flags); if (cur_cm_id_priv) { cm_dup_req_handler(work, cur_cm_id_priv); cm_deref_id(cur_cm_id_priv); } else cm_issue_rej(work->port, work->mad_recv_wc, IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, NULL, 0); Note that cm_get_id is passed data from timewait_info and not from the request: thus, if the QPN in request matches QPN in an existing connection, this is mis-detected as a duplicate request even if the IDs do not match; thus, the request is dropped or "duplicate" reject is sent instead of a "stale connection" reject. Am I missing something? Suggestion: Why is an extra call to cm_get_id required to detect a duplicate? Shouldn't we just timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); if (timewait_info) { /* handle duplicate */ return; } timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); if (timewait_info) { /* handle stale */ return; } not a duplicate and not a stale connection 2. Another question: cm_dup_req_handler does this: /* Quick state check to discard duplicate REQs. */ if (cm_id_priv->id.state == IB_CM_REQ_RCVD) return; Why is this code here? IB_CM_REQ_RCVD is an ephemeural state, going to IB_CM_REP_SENT immediately. Why are duplicate REQs discarded? Should not REP be re-sent? See 12.9.6 COMMUNICATION ESTABLISHMENT - PASSIVE The spec also says: The general rule for handling any input message that is received while in an ephemeral state is to hold that message pending until the CM protocol enters a non-ephemeral state. So this code looks wrong. What am I missing? -- MST From sashak at voltaire.com Sun May 20 09:10:34 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 20 May 2007 19:10:34 +0300 Subject: [ofa-general] Re: [PATCHv2] osm: up/dn optimization - improved ranking In-Reply-To: <46503064.7010107@dev.mellanox.co.il> References: <46503064.7010107@dev.mellanox.co.il> Message-ID: <20070520161034.GY19271@sashak.voltaire.com> Hi Yevgeny, On 14:26 Sun 20 May , Yevgeny Kliteynik wrote: > Hi Hal, > > This patch optimizes fabric ranking similar to the fat-tree ranking. > All the root switches are marked with rank and added to the BFS list, > and only then ranking of rest of the fabric begins. > This version of the patch is updated in accordance with Sasha's suggestions. > > Please apply to master. > > Signed-off-by: Yevgeny Kliteynik > --- Looks fine for me. Nice optimization. Thanks. I guess there still be issue with max_rank calculation (details are below), which affects only log message and for me it is ok to fix it in the incremental patch. > opensm/opensm/osm_ucast_updn.c | 80 > +++++++++++++++++---------------------- > 1 files changed, 35 insertions(+), 45 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c > index 5cebd9b..95a0622 100644 > --- a/opensm/opensm/osm_ucast_updn.c > +++ b/opensm/opensm/osm_ucast_updn.c [snip...] > @@ -483,7 +474,7 @@ updn_subn_rank( > { > remote_u = p_remote_physp->p_node->sw->priv; > port_guid = p_remote_physp->port_guid; > - did_cause_update = __updn_update_rank(remote_u, rank); > + did_cause_update = __updn_update_rank(remote_u, u->rank+1); > > osm_log( p_log, OSM_LOG_DEBUG, > "updn_subn_rank: " > @@ -492,7 +483,10 @@ updn_subn_rank( > remote_u->rank ); > > if (did_cause_update) > + { > cl_qlist_insert_tail(&list, &remote_u->list); > + max_rank = remote_u->rank; > + } I think this still be not accurate. For instance with topology like: A <-> B <-> C <-> D <-> E , where roots are A and E we will get max_rank= 1, which obviously should be 2. Probably we need something like this instead: if (did_cause_update) cl_qlist_insert_tail(&list, &remote_u->list); if (remote_u->rank <= u->rank + 1) max_rank = remote_u->rank; (and after such intervention into rank updating technique we may want to remove also __updn_update_rank() function) And again, this nit affects only reported value in the log message (and just this log message removing can be option too :)) and doesn't touch the optimization itself - good stuff, Yevgeny! Sasha From ianjiang.ict at gmail.com Sun May 20 17:49:18 2007 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Mon, 21 May 2007 08:49:18 +0800 Subject: [ofa-general] [SRPT]multiple initiators supported? In-Reply-To: <7b2fa1820705150012l743b817fn1eefeaca290789a@mail.gmail.com> References: <7b2fa1820705120247t1b232345w8373bb72416c5b28@mail.gmail.com> <4648948F.5000802@mellanox.com> <7b2fa1820705150012l743b817fn1eefeaca290789a@mail.gmail.com> Message-ID: <7b2fa1820705201749t5cb9cee2gb336412cbc88e958@mail.gmail.com> (1) I found where my problem was. I should not define DEBUG_TM in the Makefile of Generic SCSI target mid-level for Linux (SCST). Reference from README of scst-0.9.5.2: - DEBUG_TM - turns on task management functions debugging, when on LUN 0 in the "Default" group some of the commands will be delayed for about 60 sec., so making the remote initiator send TM functions, eg ABORT TASK and TARGET RESET. Also set TM_DBG_GO_OFFLINE symbol in the Makefile to 1 if you want that the device eventually become completely unresponsive, or to 0 otherwise to circle around ABORTs and RESETs code. Needs DEBUG turned on. (2) Likely a bug in SRPT: The CM ID cannot be destroyed in srpt_release_channel(). If so and if a initiator disconnect from the target, the following connection request will fail, because no CM ID exists at that moment. So the line ib_destroy_cm_id(ch->cm_id); in srpt_release_channel() of ib_srpt.c should be removed. And the CM ID would be destroyed only when the entire SRP target is removed. On 5/15/07, Ian Jiang wrote: > Hi Vu, > Thanks for your replay. But I have got something wrong when using two > initiators. > > Two initiators and one target are at three separated nodes. The first > initiator connected to the target correctly. However, the second one > was aborted 1 minute after its login, and then required to > *reset_host*, but it failed to send the CM Connect Request when trying > to reconnect to the target. > > Here are the logs of the second initiator: > > May 15 13:58:59 cluster5 kernel: ib_srp: new target: id_ext > 0002c90200206bd8 ioc_guid 0002c90200206bd8 pkey ffff service_id > 0002c90200206bd8 dgid fe80:0000:0000:0000:0002:c902:0020:6bd9 > May 15 13:58:59 cluster5 kernel: scsi2 : SRP.T10:0002C90200206BD8 > May 15 13:58:59 cluster5 kernel: Vendor: SCST_FIO Model: fdisk_128M > Rev: 095 > May 15 13:58:59 cluster5 kernel: Type: Direct-Access > ANSI SCSI revision: 04 > May 15 13:58:59 cluster5 kernel: SCSI device sdb: 262144 512-byte hdwr > sectors (134 MB) > May 15 13:58:59 cluster5 kernel: sdb: Write Protect is off > May 15 13:58:59 cluster5 kernel: sdb: Mode Sense: 6b 00 10 08 > May 15 13:58:59 cluster5 kernel: SCSI device sdb: drive cache: write back w/ FUA > May 15 13:58:59 cluster5 kernel: SCSI device sdb: 262144 512-byte hdwr > sectors (134 MB) > May 15 13:58:59 cluster5 kernel: sdb: Write Protect is off > May 15 13:58:59 cluster5 kernel: sdb: Mode Sense: 6b 00 10 08 > May 15 13:58:59 cluster5 kernel: SCSI device sdb: drive cache: write back w/ FUA > May 15 13:58:59 cluster5 kernel: sdb: unknown partition table > May 15 13:58:59 cluster5 kernel: sd 2:0:0:0: Attached scsi disk sdb > May 15 13:58:59 cluster5 kernel: sd 2:0:0:0: Attached scsi generic sg1 type 0 > May 15 13:59:59 cluster5 kernel: SRP abort called > May 15 13:59:59 cluster5 kernel: SRP reset_device called > May 15 14:00:29 cluster5 kernel: SRP abort called > May 15 14:00:34 cluster5 kernel: ib_srp: SRP reset_host called > May 15 14:00:36 cluster5 kernel: ib_srp: connection closed > May 15 14:02:15 cluster5 kernel: ib_srp: Sending CM REQ failed > May 15 14:02:15 cluster5 kernel: ib_srp: reconnect failed (-104), > removing target port. > May 15 14:02:15 cluster5 kernel: sd 2:0:0:0: scsi: Device offlined - > not ready after error recovery > May 15 14:02:15 cluster5 kernel: sd 2:0:0:0: rejecting I/O to offline device > May 15 14:02:15 cluster5 kernel: Buffer I/O error on device sdb, > logical block 32760 > May 15 14:02:15 cluster5 kernel: 2:0:0:0: rejecting I/O to dead device > May 15 14:02:15 cluster5 kernel: Buffer I/O error on device sdb, > logical block 32760 > > Here are the logs of the target during the second initiator's > connection. It seemed that it did not receive the reconnect request. > > May 15 13:57:27 cluster4 kernel: ib_srpt: Host > i_port_id=0x100000000000000:0xcc6b200002c90200 login with > t_port_id=0xd86b200002c90200:0xd86b200002c90200 it_iu_len=260 > May 15 13:57:27 cluster4 kernel: ib_srpt: srpt_create_ch_ib[1105] > max_cqe= 255 max_sge= 29 cm_id= da9b7200 > May 15 13:57:27 cluster4 kernel: [3966]: scst_init_session:scst: Name > 0x00000000000000010002c90200206bcc not found, using default group > May 15 13:57:27 cluster4 kernel: [3966]: > scst_alloc_add_tgt_dev:Virtual device SCST lun=0 > May 15 13:57:27 cluster4 kernel: [3966]: tm_dbg_init_tgt_dev:LUN 0 > connected from initiator ib_srpt is under TM debugging > May 15 13:57:27 cluster4 kernel: ib_srpt: Establish connection sess= > c9a677a8 name= 0x00000000000000010002c90200206bcc cm_id= da9b7200 > May 15 13:57:27 cluster4 kernel: [3964]: scst_set_pending_UA:Setting > pending UA cmd dabb3ec0 > May 15 13:57:27 cluster4 kernel: [3964]: > tm_dbg_delay_cmd:tm_dbg_delay_cmd: delaying timed cmd dabb3ec0 (tag > 35) for 60.96 seconds (15241 HZ) > > May 15 13:58:27 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 1 for > task_tag= 35 using tag= 163 cm_id= da9b7200 sess= c9a677a8 > May 15 13:58:27 cluster4 kernel: [0]: scst_rx_mgmt_fn_tag:sess=c9a677a8, tag=35 > May 15 13:58:27 cluster4 kernel: [0]: scst_post_rx_mgmt_cmd:Adding > mgmt cmd c70486a0 to active mgmt cmd list > May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving > mgmt cmd c70486a0 to mgmt cmd list > May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_init:Cmd > dabb3ec0 for tag 35 (sn 35) found, aborting it > May 15 13:58:27 cluster4 kernel: [3965]: scst_abort_cmd:Aborting cmd > dabb3ec0 (tag 35) > May 15 13:58:27 cluster4 kernel: [3965]: > scst_call_dev_task_mgmt_fn:Calling dev handler disk_fileio > task_mgmt_fn(fn=0) > May 15 13:58:27 cluster4 kernel: [3965]: > scst_call_dev_task_mgmt_fn:Dev handler disk_fileio task_mgmt_fn() > returned 0 > May 15 13:58:27 cluster4 kernel: [3965]: tm_dbg_release_cmd:Abort > request for delayed cmd dabb3ec0 (tag=35), moving it to active cmd > list (delayed_cmds_count=1) > May 15 13:58:27 cluster4 kernel: ib_srpt: srpt_tsk_mgmt_done[1972] > tsk_mgmt_done for tag= 163 status=0 > May 15 13:58:27 cluster4 kernel: [3965]: scst_mgmt_cmd_send_done:Dev > handler ib_srpt task_mgmt_fn_done() returned > May 15 13:58:27 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 8 for > task_tag= 35 using tag= 163 cm_id= da9b7200 sess= c9a677a8 > May 15 13:58:27 cluster4 kernel: [3964]: tm_dbg_check_cmd:Processing > delayed cmd dabb3ec0 (tag 35), delayed_cmds_count=1 > May 15 13:58:27 cluster4 kernel: [3964]: tm_dbg_change_state:Deleting timer > May 15 13:58:27 cluster4 kernel: ib_srpt: srpt_xmit_response[1898] > tag= 35 already get aborted > May 15 13:58:57 cluster4 kernel: ib_srpt: recv_tsk_mgmt= 1 for > task_tag= 36 using tag= 164 cm_id= da9b7200 sess= c9a677a8 > May 15 13:58:57 cluster4 kernel: [0]: scst_rx_mgmt_fn_tag:sess=c9a677a8, tag=36 > May 15 13:58:57 cluster4 kernel: [0]: scst_post_rx_mgmt_cmd:Adding > mgmt cmd c7048240 to active mgmt cmd list > May 15 13:58:57 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving > mgmt cmd c7048240 to mgmt cmd list > May 15 13:58:57 cluster4 kernel: [3965]: scst_mgmt_cmd_init:Cmd > dabb3050 for tag 36 (sn 36) found, aborting it > May 15 13:58:57 cluster4 kernel: [3965]: scst_abort_cmd:Aborting cmd > dabb3050 (tag 36) > May 15 13:58:57 cluster4 kernel: [3965]: > scst_call_dev_task_mgmt_fn:Calling dev handler disk_fileio > task_mgmt_fn(fn=0) > May 15 13:58:57 cluster4 kernel: [3965]: > scst_call_dev_task_mgmt_fn:Dev handler disk_fileio task_mgmt_fn() > returned 0 > May 15 13:58:57 cluster4 kernel: [3965]: scst_abort_cmd:cmd dabb3050 > (tag 36) being executed/xmitted (state 12), deferring ABORT... > May 15 13:58:57 cluster4 kernel: [3965]: > scst_set_mcmd_next_state:cmd_wait_count(1) not 0, preparing to wait > May 15 13:59:02 cluster4 kernel: ib_srpt: srpt_cm_dreq_recv[1523] > cm_id= da9b7200 ch->state= 1 > May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_cm_timewait_exit[1502] > cm_id= da9b7200 > May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_release_channel[1154] > Release channel cm_id= da9b7200 > May 15 13:59:04 cluster4 kernel: ib_srpt: srpt_release_channel[1159] > Release sess= c9a677a8 sess_name= 0x00000000000000010002c90200206bcc > May 15 13:59:21 cluster4 kernel: ib_srpt: failed send status= 12 > May 15 13:59:21 cluster4 kernel: [0]: scst_complete_cmd_mgmt:cmd > dabb3050 completed (tag 36, mcmd c7048240, mcmd->cmd_wait_count 1) > May 15 13:59:21 cluster4 kernel: [0]: scst_complete_cmd_mgmt:Moving > mgmt cmd c7048240 to active mgmt cmd list > May 15 13:59:21 cluster4 kernel: [3965]: scst_mgmt_cmd_thread:Moving > mgmt cmd c7048240 to mgmt cmd list > May 15 13:59:21 cluster4 kernel: ib_srpt: srpt_tsk_mgmt_done[1972] > tsk_mgmt_done for tag= 164 status=-1 > May 15 13:59:21 cluster4 kernel: [3965]: scst_mgmt_cmd_send_done:Dev > handler ib_srpt task_mgmt_fn_done() returned > May 15 13:59:21 cluster4 kernel: ib_srpt: > srpt_unregister_session_done[1143] sess= c9a677a8 > May 15 13:59:21 cluster4 kernel: [3966]: scst_free_all_UA:Clearing UA > for tgt_dev lun 0 > May 15 13:59:21 cluster4 kernel: ib_srpt: failed send status= 5 > May 15 13:59:21 cluster4 kernel: ib_srpt: QP event 16 on cm_id= > da9b7200 sess_name= 0x00000000000000010002c90200206bcc state= 2 > > > I have no idea why the *abort* was called at the second initiator. > Could you please give some suggestion? Thanks a lot! > > > On 5/15/07, Vu Pham wrote: > > Ian Jiang wrote: > > > Does the SRP target support multiple initiators? > > > > Yes, it does. > > > > > > > I am using the SRR initiator and IB drivers in linux-2.6.20. > > > The SRP target is at > > > http://www.openfabrics.org/git/?p=~vu/srpt.git;a=summary > > > and the IB driver is OFED-1.1 with linux-2.6.16.13-4-default of Suse-10.1. > > > > > > > -- > Ian Jiang > -- Ian Jiang From devesh28 at gmail.com Sun May 20 22:58:09 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Mon, 21 May 2007 11:28:09 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <1179483657.23882.158398.camel@hal.voltaire.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> <1179483657.23882.158398.camel@hal.voltaire.com> Message-ID: <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com> On 18 May 2007 06:21:05 -0400, Hal Rosenstock wrote: > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote: > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock wrote: > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote: > > > > On 5/17/07, Sean Hefty wrote: > > > > > > But initially this will generate a packet for each path, while sys > > > > > > admin knows that path is there and he can hard-code the entries for > > > > > > it. Other thing is that why Admin will care about creating such record > > > > > > while SA is itself taking care, right? > > > > > > > > > > In your original message you asked about adding 'dummy entries' to the > > > > > cache. I agree that pre-loading the cache can be useful. What I still > > > > > am not understanding is the reasoning for adding 'dummy entries'. By > > > > > 'dummy entries', I've been assuming that these are invalid path records, > > > > > but maybe that's not what you meant. > > > > Ok if "dummy entries" word as such has created confusion then I am > > > > sorry for that, But with that I mean that, those are valid path > > > > records which Administrator knows in advance and while loading the > > > > module, > > > > > > How does the admin know they are valid ? > > Depending on the initial application runs, some trusted PRs can be generated. > > What do initial application runs have to do with this ? My understanding is that, once the cluster is UP, and if between Node A and Node B there is only one path, then, SA query always going to return same values in PR. On this basis Initial application runs will generate PRs, these PRs can be saved in some file, and can be loaded when cache_module comes in. > > > >Are they somehow preconfigured at the SM ? > > I am not sure about SM has any such provision? > > Not that I'm aware of. Ok, So, currently no such support is there in SM? > > > Also not sure about the > > role of SM in path resolving. I mean once node has initiated SA query, > > whether SM has some database to reply SA or On the fly destination > > node is contacted to get asked path recored? > > SMs can either calculate the SA PRs on the fly based on the routing > algorithm in use and some other things or put them in a local database. > This is up to that SM. Ok > > Destination node is not contacted in the SA PR query process. > > > >Doesn't each SM have its own policy for generating valid PRs ? > > Ultimately path record is in Path_Record object format, and SA cache > > is going to store in a fixed manner, How generation policy matters? > > What if the local policy loaded does not agree with what the SM would > generate for a particular PR ? One then gets a local error which will > need to be tracked down. Not so easy IMO. SM policies in a subnet to generate PRs, changes dynamically? at run time? if Not then depending on the local SM policy static PR can be generated to load initially. > > > CMIIW. Also I am assuming a homogeneous cluster where certain > > parameters can be assumed to be same always. > > and always in agreement with what the SM would return ? For example, yes > what happens when a link goes down and the end node is no longer > reachable ? If node is not reachable then, after first timeout of sa_cache, that entry will be removed from cache. > > > >are these from a live SM and just loaded "out of band" to > > bypass/preclude the SA PR >mechanism ? > > may be > > Even if they are, there is still the changes in the subnet issue. > > -- Hal > > > > -- Hal > > > > > > > Admin is loading this info in the cache with user command. > > > > > > > > > > > Another point I want to know is, > > > > > > When local_sa_cache module will be inserted? After SM comes up or > > > > > > Before SM comes up? > > > > > > > > > > It can occur either way. There is no restriction. The cache responds > > > > > to port up and GID in/out of service events to update itself. > > > > Do you mean cache module will start building cache only after Port is UP? > > > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on > > > > > > some node not on switch) then First Forced schedule_update() is > > > > > > waisted, and for the first application presence of cache is > > > > > > meaningless. Why not to keep cache effective right from the start? > > > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those > > > > > paths are usable. If the SM has not come up, then the path records will > > > > > be unusable until the SM configures the subnet, plus there's no > > > > > guarantee that the remote endpoints specified by the paths are running. > > > > You mean there is no guarantee that even if SM is UP and we have some > > > > hard coded entries of path record corresponding to some node X, we are > > > > not sure that node X has actually come up or not? In that case > > > > actually that path resolving should fail if node has not come up, but > > > > with the hard coding still path will be resolved? > > > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms > > > > > when booting a large cluster. > > > > that's true. Also cache will get valid entries only if network is > > > > configured by SM otherwise every node SA will, possibly, drop SA > > > > packets. > > > > > > > > > > - Sean > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > From erezz at voltaire.com Sun May 20 23:28:35 2007 From: erezz at voltaire.com (Erez Zilber) Date: Mon, 21 May 2007 09:28:35 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4 In-Reply-To: <20070510092925.GB13655@mellanox.co.il> References: <4641D295.5060907@voltaire.com> <4641D38A.8040406@voltaire.com> <20070510092925.GB13655@mellanox.co.il> Message-ID: <46513C13.3010100@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Erez Zilber : >> Subject: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4 >> >> >> Add the required backport patches & kernel addons for open-iscsi >> over iSER in RHAS4 up3 and up4. >> >> Signed-off-by: Erez Zilber > > In addition to posting patches, could you pls publish a git tree to pull from, > please? This makes it easy to test-build the patch as our build system > knows how to do git checkout. Added a git tree: http://www.openfabrics.org/git/?p=~erezz/ofed_1_2_iser_rh4.git;a=summary > > --- > > Two comments, generally > A: Please move code from kernel_patches to kernel_addons as much > as possible. There are many places where you just add new headers, > or add #include directives, or change the function called or > remove extra parameters, all this can and should be done through addons. > Done > B: Please do not add code to core unless there is more than 1 user - > add it to the iser module instead. This way if there is > compilation failure there, you do not break core for people. Done > > Some specifics below: > .... > >> >> diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch >> new file mode 100644 >> index 0000000..d77c663 >> --- /dev/null >> +++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch >> @@ -0,0 +1,504 @@ >> +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c >> +--- linux-2.6.20/drivers/scsi/iscsi_tcp.c 2007-02-04 20:44:54.000000000 +0200 >> ++++ linux-2.6.20-backport-rh4-u3/drivers/scsi/iscsi_tcp.c 2007-04-01 13:11:17.000000000 +0300 > > ... >> +@@ -108,7 +108,7 @@ iscsi_hdr_digest(struct iscsi_conn *conn >> + { >> + struct iscsi_tcp_conn *tcp_conn = conn->dd_data; >> + >> +- crypto_hash_digest(&tcp_conn->tx_hash, &buf->sg, buf->sg.length, crc); >> ++ crypto_digest_digest(tcp_conn->tx_tfm, &buf->sg, 1, crc); >> + buf->sg.length = tcp_conn->hdr_size; >> + } >> + > > You could make it a macro in addons if you had named the new field tx_hash. I fixed that and other crypto function calls whenever possible. >> + >> + struct iscsi_internal { >> + int daemon_pid; >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l >> + #define cdev_to_iscsi_internal(_cdev) \ >> + container_of(_cdev, struct iscsi_internal, cdev) >> + >> ++extern int attribute_container_init(void); >> ++ > > This does not look scsi-related. Why does this belong here? This is a hack. In 2.6.20, attribute_container_init is called from drivers/base/init.c. Since I cannot do that, I'm calling it from the init function in scsi_transport_iscsi (because scsi_transport_iscsi uses the attribute container). Do you have a better suggestion? > >> diff --git a/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch >> new file mode 100644 >> index 0000000..3c2a969 >> --- /dev/null >> +++ b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch >> @@ -0,0 +1,13 @@ >> +--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:13:43.000000000 +0200 >> ++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:14:31.000000000 +0200 >> +@@ -70,9 +70,8 @@ >> + #include >> + #include >> + #include >> +-#include >> +- >> + #include "iscsi_iser.h" >> ++#include >> + >> + static unsigned int iscsi_max_lun = 512; >> + module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); > > Looks like the right thing to do anyway. > So put it in fixes instead, and post upstream. It is not required in newer kernels: mutex.h is included from include/scsi/scsi_host.h. Therefore, I don't want to post a patch upstream. Maybe we can add this inclusion to kernel_addons/backport/2.6.9_U4/include/scsi/scsi_host.h. What do you think? > >> diff --git a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch >> index e84b964..52c0136 100644 >> --- a/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch >> +++ b/kernel_patches/backport/2.6.9_U3/linux_stuff_to_2_6_17.patch >> @@ -19,6 +19,62 @@ index 0000000..58cf933 >> +++ b/drivers/infiniband/core/kfifo.c >> @@ -0,0 +1 @@ >> +#include "src/kfifo.c" >> +diff --git a/drivers/infiniband/core/init.c b/drivers/infiniband/core/init.c >> +new file mode 100644 >> +index 0000000..58cf933 >> +--- /dev/null >> ++++ b/drivers/infiniband/core/init.c >> +@@ -0,0 +1 @@ >> ++#include "src/init.c" >> +diff --git a/drivers/infiniband/core/attribute_container.c b/drivers/infiniband/core/attribute_container.c >> +new file mode 100644 >> +index 0000000..58cf933 >> +--- /dev/null >> ++++ b/drivers/infiniband/core/attribute_container.c >> +@@ -0,0 +1 @@ >> ++#include "src/attribute_container.c" ... >> +diff --git a/drivers/infiniband/core/kref_new.c b/drivers/infiniband/core/kref_new.c >> +new file mode 100644 >> +index 0000000..58cf933 >> +--- /dev/null >> ++++ b/drivers/infiniband/core/kref_new.c >> +@@ -0,0 +1 @@ >> ++#include "src/kref_new.c" >> diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile >> index 50fb1cd..456bfd0 100644 >> --- a/drivers/infiniband/core/Makefile >> @@ -28,4 +84,4 @@ index 50fb1cd..456bfd0 100644 >> ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ >> uverbs_marshall.o >> + >> -+ib_core-y += genalloc.o netevent.o kfifo.o >> ++ib_core-y += genalloc.o netevent.o kfifo.o scsi.o scsi_lib.o scsi_scan.o init.o attribute_container.o transport_class.o klist.o kref_new.o > > Can we make these part of iser place? > Linking scsi stuff into core does not look right. Moved that into open-iscsi modules. This code is required for open-iscsi over any transport (not just iSER). Here's the fixed patch (also available on my git tree): diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h b/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h new file mode 100644 index 0000000..93bfb0b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/linux/attribute_container.h @@ -0,0 +1,71 @@ +/* + * class_container.h - a generic container for all classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + */ + +#ifndef _ATTRIBUTE_CONTAINER_H_ +#define _ATTRIBUTE_CONTAINER_H_ + +#include +#include +#include +#include + +struct attribute_container { + struct list_head node; + struct klist containers; + struct class *class; + struct class_device_attribute **attrs; + int (*match)(struct attribute_container *, struct device *); +#define ATTRIBUTE_CONTAINER_NO_CLASSDEVS 0x01 + unsigned long flags; +}; + +static inline int +attribute_container_no_classdevs(struct attribute_container *atc) +{ + return atc->flags & ATTRIBUTE_CONTAINER_NO_CLASSDEVS; +} + +static inline void +attribute_container_set_no_classdevs(struct attribute_container *atc) +{ + atc->flags |= ATTRIBUTE_CONTAINER_NO_CLASSDEVS; +} + +int attribute_container_register(struct attribute_container *cont); +int attribute_container_unregister(struct attribute_container *cont); +void attribute_container_create_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_add_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_remove_device(struct device *dev, + void (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_device_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *)); +int attribute_container_add_attrs(struct class_device *classdev); +int attribute_container_add_class_device(struct class_device *classdev); +int attribute_container_add_class_device_adapter(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev); +void attribute_container_remove_attrs(struct class_device *classdev); +void attribute_container_class_device_del(struct class_device *classdev); +struct attribute_container *attribute_container_classdev_to_container(struct class_device *); +struct class_device *attribute_container_find_class_device(struct attribute_container *, struct device *); +struct class_device_attribute **attribute_container_classdev_to_attrs(const struct class_device *classdev); + +#endif diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/crypto.h b/kernel_addons/backport/2.6.9_U3/include/linux/crypto.h new file mode 100644 index 0000000..aecccde --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/linux/crypto.h @@ -0,0 +1,11 @@ +#ifndef LINUX_CRYPTO_BACKPORT_H +#define LINUX_CRYPTO_BACKPORT_H + +#include_next + +#define crypto_hash_init(desc) crypto_digest_init(*desc) +#define crypto_hash_digest(desc, sg, nbytes, out) crypto_digest_digest(*desc, sg, 1, out) +#define crypto_hash_update(desc, sg, nbytes) crypto_digest_update(*desc, sg, 1) +#define crypto_hash_final(desc, out) crypto_digest_final(*desc, out) + +#endif diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/kernel.h b/kernel_addons/backport/2.6.9_U3/include/linux/kernel.h index a37dcd5..02a5907 100644 --- a/kernel_addons/backport/2.6.9_U3/include/linux/kernel.h +++ b/kernel_addons/backport/2.6.9_U3/include/linux/kernel.h @@ -4,4 +4,7 @@ #define BACKPORT_KERNEL_H_2_6_19 #include_next #include +#define NIP6_FMT "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x" +#define NIPQUAD_FMT "%u.%u.%u.%u" + #endif diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/kfifo.h b/kernel_addons/backport/2.6.9_U3/include/linux/kfifo.h index 48eccd8..2b94461 100644 --- a/kernel_addons/backport/2.6.9_U3/include/linux/kfifo.h +++ b/kernel_addons/backport/2.6.9_U3/include/linux/kfifo.h @@ -25,6 +25,7 @@ #ifdef __KERNEL__ #include #include +#include struct kfifo { unsigned char *buffer; /* the buffer holding the data */ diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/klist.h b/kernel_addons/backport/2.6.9_U3/include/linux/klist.h new file mode 100644 index 0000000..7407125 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/linux/klist.h @@ -0,0 +1,61 @@ +/* + * klist.h - Some generic list helpers, extending struct list_head a bit. + * + * Implementations are found in lib/klist.c + * + * + * Copyright (C) 2005 Patrick Mochel + * + * This file is rleased under the GPL v2. + */ + +#ifndef _LINUX_KLIST_H +#define _LINUX_KLIST_H + +#include +#include +#include +#include + +struct klist_node; +struct klist { + spinlock_t k_lock; + struct list_head k_list; + void (*get)(struct klist_node *); + void (*put)(struct klist_node *); +}; + + +extern void klist_init(struct klist * k, void (*get)(struct klist_node *), + void (*put)(struct klist_node *)); + +struct klist_node { + struct klist * n_klist; + struct list_head n_node; + struct kref n_ref; + struct completion n_removed; +}; + +extern void klist_add_tail(struct klist_node * n, struct klist * k); +extern void klist_add_head(struct klist_node * n, struct klist * k); + +extern void klist_del(struct klist_node * n); +extern void klist_remove(struct klist_node * n); + +extern int klist_node_attached(struct klist_node * n); + + +struct klist_iter { + struct klist * i_klist; + struct list_head * i_head; + struct klist_node * i_cur; +}; + + +extern void klist_iter_init(struct klist * k, struct klist_iter * i); +extern void klist_iter_init_node(struct klist * k, struct klist_iter * i, + struct klist_node * n); +extern void klist_iter_exit(struct klist_iter * i); +extern struct klist_node * klist_next(struct klist_iter * i); + +#endif diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/memory.h b/kernel_addons/backport/2.6.9_U3/include/linux/memory.h new file mode 100644 index 0000000..654ef55 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/linux/memory.h @@ -0,0 +1,89 @@ +/* + * include/linux/memory.h - generic memory definition + * + * This is mainly for topological representation. We define the + * basic "struct memory_block" here, which can be embedded in per-arch + * definitions or NUMA information. + * + * Basic handling of the devices is done in drivers/base/memory.c + * and system devices are handled in drivers/base/sys.c. + * + * Memory block are exported via sysfs in the class/memory/devices/ + * directory. + * + */ +#ifndef _LINUX_MEMORY_H_ +#define _LINUX_MEMORY_H_ + +#include +#include +#include + +#include + +struct memory_block { + unsigned long phys_index; + unsigned long state; + /* + * This serializes all state change requests. It isn't + * held during creation because the control files are + * created long after the critical areas during + * initialization. + */ + struct semaphore state_sem; + int phys_device; /* to which fru does this belong? */ + void *hw; /* optional pointer to fw/hw data */ + int (*phys_callback)(struct memory_block *); + struct sys_device sysdev; +}; + +/* These states are exposed to userspace as text strings in sysfs */ +#define MEM_ONLINE (1<<0) /* exposed to userspace */ +#define MEM_GOING_OFFLINE (1<<1) /* exposed to userspace */ +#define MEM_OFFLINE (1<<2) /* exposed to userspace */ + +/* + * All of these states are currently kernel-internal for notifying + * kernel components and architectures. + * + * For MEM_MAPPING_INVALID, all notifier chains with priority >0 + * are called before pfn_to_page() becomes invalid. The priority=0 + * entry is reserved for the function that actually makes + * pfn_to_page() stop working. Any notifiers that want to be called + * after that should have priority <0. + */ +#define MEM_MAPPING_INVALID (1<<3) + +struct notifier_block; +struct mem_section; + +#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE +static inline int memory_dev_init(void) +{ + return 0; +} +static inline int register_memory_notifier(struct notifier_block *nb) +{ + return 0; +} +static inline void unregister_memory_notifier(struct notifier_block *nb) +{ +} +#else +extern int register_new_memory(struct mem_section *); +extern int unregister_memory_section(struct mem_section *); +extern int memory_dev_init(void); +extern int remove_memory_block(unsigned long, struct mem_section *, int); + +#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION< + +#define __nlmsg_put(skb, daemon_pid, seq, type, len, flags) \ + __nlmsg_put(skb, daemon_pid, 0, 0, len) + +#define netlink_kernel_create(uint, groups, input, mod) \ + netlink_kernel_create(uint, input) + +#define NETLINK_ISCSI 8 + +#endif diff --git a/kernel_addons/backport/2.6.9_U3/include/linux/transport_class.h b/kernel_addons/backport/2.6.9_U3/include/linux/transport_class.h new file mode 100644 index 0000000..1d6cc22 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/linux/transport_class.h @@ -0,0 +1,100 @@ +/* + * transport_class.h - a generic container for all transport classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + */ + +#ifndef _TRANSPORT_CLASS_H_ +#define _TRANSPORT_CLASS_H_ + +#include +#include + +struct transport_container; + +struct transport_class { + struct class class; + int (*setup)(struct transport_container *, struct device *, + struct class_device *); + int (*configure)(struct transport_container *, struct device *, + struct class_device *); + int (*remove)(struct transport_container *, struct device *, + struct class_device *); +}; + +#define DECLARE_TRANSPORT_CLASS(cls, nm, su, rm, cfg) \ +struct transport_class cls = { \ + .class = { \ + .name = nm, \ + }, \ + .setup = su, \ + .remove = rm, \ + .configure = cfg, \ +} + + +struct anon_transport_class { + struct transport_class tclass; + struct attribute_container container; +}; + +#define DECLARE_ANON_TRANSPORT_CLASS(cls, mtch, cfg) \ +struct anon_transport_class cls = { \ + .tclass = { \ + .configure = cfg, \ + }, \ + . container = { \ + .match = mtch, \ + }, \ +} + +#define class_to_transport_class(x) \ + container_of(x, struct transport_class, class) + +struct transport_container { + struct attribute_container ac; + struct attribute_group *statistics; +}; + +#define attribute_container_to_transport_container(x) \ + container_of(x, struct transport_container, ac) + +void transport_remove_device(struct device *); +void transport_add_device(struct device *); +void transport_setup_device(struct device *); +void transport_configure_device(struct device *); +void transport_destroy_device(struct device *); + +static inline void +transport_register_device(struct device *dev) +{ + transport_setup_device(dev); + transport_add_device(dev); +} + +static inline void +transport_unregister_device(struct device *dev) +{ + transport_remove_device(dev); + transport_destroy_device(dev); +} + +static inline int transport_container_register(struct transport_container *tc) +{ + return attribute_container_register(&tc->ac); +} + +static inline int transport_container_unregister(struct transport_container *tc) +{ + return attribute_container_unregister(&tc->ac); +} + +int transport_class_register(struct transport_class *); +int anon_transport_class_register(struct anon_transport_class *); +void transport_class_unregister(struct transport_class *); +void anon_transport_class_unregister(struct anon_transport_class *); + + +#endif diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/iscsi_proto.h b/kernel_addons/backport/2.6.9_U3/include/scsi/iscsi_proto.h new file mode 100644 index 0000000..02f6e4b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/scsi/iscsi_proto.h @@ -0,0 +1,587 @@ +/* + * RFC 3720 (iSCSI) protocol data types + * + * Copyright (C) 2005 Dmitry Yusupov + * Copyright (C) 2005 Alex Aizman + * maintained by open-iscsi at googlegroups.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published + * by the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * See the file COPYING included with this distribution for more details. + */ + +#ifndef ISCSI_PROTO_H +#define ISCSI_PROTO_H + +#define ISCSI_DRAFT20_VERSION 0x00 + +/* default iSCSI listen port for incoming connections */ +#define ISCSI_LISTEN_PORT 3260 + +/* Padding word length */ +#define PAD_WORD_LEN 4 + +/* + * useful common(control and data pathes) macro + */ +#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2])) +#define hton24(p, v) { \ + p[0] = (((v) >> 16) & 0xFF); \ + p[1] = (((v) >> 8) & 0xFF); \ + p[2] = ((v) & 0xFF); \ +} +#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;} + +/* + * iSCSI Template Message Header + */ +struct iscsi_hdr { + uint8_t opcode; + uint8_t flags; /* Final bit */ + uint8_t rsvd2[2]; + uint8_t hlength; /* AHSs total length */ + uint8_t dlength[3]; /* Data length */ + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Task Tag */ + __be32 statsn; + __be32 exp_statsn; + __be32 max_statsn; + uint8_t other[12]; +}; + +/************************* RFC 3720 Begin *****************************/ + +#define ISCSI_RESERVED_TAG 0xffffffff + +/* Opcode encoding bits */ +#define ISCSI_OP_RETRY 0x80 +#define ISCSI_OP_IMMEDIATE 0x40 +#define ISCSI_OPCODE_MASK 0x3F + +/* Initiator Opcode values */ +#define ISCSI_OP_NOOP_OUT 0x00 +#define ISCSI_OP_SCSI_CMD 0x01 +#define ISCSI_OP_SCSI_TMFUNC 0x02 +#define ISCSI_OP_LOGIN 0x03 +#define ISCSI_OP_TEXT 0x04 +#define ISCSI_OP_SCSI_DATA_OUT 0x05 +#define ISCSI_OP_LOGOUT 0x06 +#define ISCSI_OP_SNACK 0x10 + +#define ISCSI_OP_VENDOR1_CMD 0x1c +#define ISCSI_OP_VENDOR2_CMD 0x1d +#define ISCSI_OP_VENDOR3_CMD 0x1e +#define ISCSI_OP_VENDOR4_CMD 0x1f + +/* Target Opcode values */ +#define ISCSI_OP_NOOP_IN 0x20 +#define ISCSI_OP_SCSI_CMD_RSP 0x21 +#define ISCSI_OP_SCSI_TMFUNC_RSP 0x22 +#define ISCSI_OP_LOGIN_RSP 0x23 +#define ISCSI_OP_TEXT_RSP 0x24 +#define ISCSI_OP_SCSI_DATA_IN 0x25 +#define ISCSI_OP_LOGOUT_RSP 0x26 +#define ISCSI_OP_R2T 0x31 +#define ISCSI_OP_ASYNC_EVENT 0x32 +#define ISCSI_OP_REJECT 0x3f + +struct iscsi_ahs_hdr { + __be16 ahslength; + uint8_t ahstype; + uint8_t ahspec[5]; +}; + +#define ISCSI_AHSTYPE_CDB 1 +#define ISCSI_AHSTYPE_RLENGTH 2 + +/* iSCSI PDU Header */ +struct iscsi_cmd { + uint8_t opcode; + uint8_t flags; + __be16 rsvd2; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 data_length; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t cdb[16]; /* SCSI Command Block */ + /* Additional Data (Command Dependent) */ +}; + +/* Command PDU flags */ +#define ISCSI_FLAG_CMD_FINAL 0x80 +#define ISCSI_FLAG_CMD_READ 0x40 +#define ISCSI_FLAG_CMD_WRITE 0x20 +#define ISCSI_FLAG_CMD_ATTR_MASK 0x07 /* 3 bits */ + +/* SCSI Command Attribute values */ +#define ISCSI_ATTR_UNTAGGED 0 +#define ISCSI_ATTR_SIMPLE 1 +#define ISCSI_ATTR_ORDERED 2 +#define ISCSI_ATTR_HEAD_OF_QUEUE 3 +#define ISCSI_ATTR_ACA 4 + +struct iscsi_rlength_ahdr { + __be16 ahslength; + uint8_t ahstype; + uint8_t reserved; + __be32 read_length; +}; + +/* SCSI Response Header */ +struct iscsi_cmd_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t response; + uint8_t cmd_status; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rsvd1; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 exp_datasn; + __be32 bi_residual_count; + __be32 residual_count; + /* Response or Sense Data (optional) */ +}; + +/* Command Response PDU flags */ +#define ISCSI_FLAG_CMD_BIDI_OVERFLOW 0x10 +#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW 0x08 +#define ISCSI_FLAG_CMD_OVERFLOW 0x04 +#define ISCSI_FLAG_CMD_UNDERFLOW 0x02 + +/* iSCSI Status values. Valid if Rsp Selector bit is not set */ +#define ISCSI_STATUS_CMD_COMPLETED 0 +#define ISCSI_STATUS_TARGET_FAILURE 1 +#define ISCSI_STATUS_SUBSYS_FAILURE 2 + +/* Asynchronous Event Header */ +struct iscsi_async { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + uint8_t rsvd4[8]; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t async_event; + uint8_t async_vcode; + __be16 param1; + __be16 param2; + __be16 param3; + uint8_t rsvd5[4]; +}; + +/* iSCSI Event Codes */ +#define ISCSI_ASYNC_MSG_SCSI_EVENT 0 +#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT 1 +#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION 2 +#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS 3 +#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION 4 +#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC 255 + +/* NOP-Out Message */ +struct iscsi_nopout { + uint8_t opcode; + uint8_t flags; + __be16 rsvd2; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Transfer Tag */ + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd4[16]; +}; + +/* NOP-In Message */ +struct iscsi_nopin { + uint8_t opcode; + uint8_t flags; + __be16 rsvd2; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Transfer Tag */ + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t rsvd4[12]; +}; + +/* SCSI Task Management Message Header */ +struct iscsi_tm { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd1[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rtt; /* Reference Task Tag */ + __be32 cmdsn; + __be32 exp_statsn; + __be32 refcmdsn; + __be32 exp_datasn; + uint8_t rsvd2[8]; +}; + +#define ISCSI_FLAG_TM_FUNC_MASK 0x7F + +/* Function values */ +#define ISCSI_TM_FUNC_ABORT_TASK 1 +#define ISCSI_TM_FUNC_ABORT_TASK_SET 2 +#define ISCSI_TM_FUNC_CLEAR_ACA 3 +#define ISCSI_TM_FUNC_CLEAR_TASK_SET 4 +#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET 5 +#define ISCSI_TM_FUNC_TARGET_WARM_RESET 6 +#define ISCSI_TM_FUNC_TARGET_COLD_RESET 7 +#define ISCSI_TM_FUNC_TASK_REASSIGN 8 + +/* SCSI Task Management Response Header */ +struct iscsi_tm_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t response; /* see Response values below */ + uint8_t qualifier; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd2[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rtt; /* Reference Task Tag */ + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t rsvd3[12]; +}; + +/* Response values */ +#define ISCSI_TMF_RSP_COMPLETE 0x00 +#define ISCSI_TMF_RSP_NO_TASK 0x01 +#define ISCSI_TMF_RSP_NO_LUN 0x02 +#define ISCSI_TMF_RSP_TASK_ALLEGIANT 0x03 +#define ISCSI_TMF_RSP_NO_FAILOVER 0x04 +#define ISCSI_TMF_RSP_NOT_SUPPORTED 0x05 +#define ISCSI_TMF_RSP_AUTH_FAILED 0x06 +#define ISCSI_TMF_RSP_REJECTED 0xff + +/* Ready To Transfer Header */ +struct iscsi_r2t_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Transfer Tag */ + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 r2tsn; + __be32 data_offset; + __be32 data_length; +}; + +/* SCSI Data Hdr */ +struct iscsi_data { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; + __be32 ttt; + __be32 rsvd4; + __be32 exp_statsn; + __be32 rsvd5; + __be32 datasn; + __be32 offset; + __be32 rsvd6; + /* Payload */ +}; + +/* SCSI Data Response Hdr */ +struct iscsi_data_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2; + uint8_t cmd_status; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; + __be32 ttt; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 datasn; + __be32 offset; + __be32 residual_count; +}; + +/* Data Response PDU flags */ +#define ISCSI_FLAG_DATA_ACK 0x40 +#define ISCSI_FLAG_DATA_OVERFLOW 0x04 +#define ISCSI_FLAG_DATA_UNDERFLOW 0x02 +#define ISCSI_FLAG_DATA_STATUS 0x01 + +/* Text Header */ +struct iscsi_text { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd4[8]; + __be32 itt; + __be32 ttt; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd5[16]; + /* Text - key=value pairs */ +}; + +#define ISCSI_FLAG_TEXT_CONTINUE 0x40 + +/* Text Response Header */ +struct iscsi_text_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd4[8]; + __be32 itt; + __be32 ttt; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t rsvd5[12]; + /* Text Response - key:value pairs */ +}; + +/* Login Header */ +struct iscsi_login { + uint8_t opcode; + uint8_t flags; + uint8_t max_version; /* Max. version supported */ + uint8_t min_version; /* Min. version supported */ + uint8_t hlength; + uint8_t dlength[3]; + uint8_t isid[6]; /* Initiator Session ID */ + __be16 tsih; /* Target Session Handle */ + __be32 itt; /* Initiator Task Tag */ + __be16 cid; + __be16 rsvd3; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd5[16]; +}; + +/* Login PDU flags */ +#define ISCSI_FLAG_LOGIN_TRANSIT 0x80 +#define ISCSI_FLAG_LOGIN_CONTINUE 0x40 +#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK 0x0C /* 2 bits */ +#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK 0x03 /* 2 bits */ + +#define ISCSI_LOGIN_CURRENT_STAGE(flags) \ + ((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2) +#define ISCSI_LOGIN_NEXT_STAGE(flags) \ + (flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK) + +/* Login Response Header */ +struct iscsi_login_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t max_version; /* Max. version supported */ + uint8_t active_version; /* Active version */ + uint8_t hlength; + uint8_t dlength[3]; + uint8_t isid[6]; /* Initiator Session ID */ + __be16 tsih; /* Target Session Handle */ + __be32 itt; /* Initiator Task Tag */ + __be32 rsvd3; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t status_class; /* see Login RSP ststus classes below */ + uint8_t status_detail; /* see Login RSP Status details below */ + uint8_t rsvd4[10]; +}; + +/* Login stage (phase) codes for CSG, NSG */ +#define ISCSI_INITIAL_LOGIN_STAGE -1 +#define ISCSI_SECURITY_NEGOTIATION_STAGE 0 +#define ISCSI_OP_PARMS_NEGOTIATION_STAGE 1 +#define ISCSI_FULL_FEATURE_PHASE 3 + +/* Login Status response classes */ +#define ISCSI_STATUS_CLS_SUCCESS 0x00 +#define ISCSI_STATUS_CLS_REDIRECT 0x01 +#define ISCSI_STATUS_CLS_INITIATOR_ERR 0x02 +#define ISCSI_STATUS_CLS_TARGET_ERR 0x03 + +/* Login Status response detail codes */ +/* Class-0 (Success) */ +#define ISCSI_LOGIN_STATUS_ACCEPT 0x00 + +/* Class-1 (Redirection) */ +#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP 0x01 +#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM 0x02 + +/* Class-2 (Initiator Error) */ +#define ISCSI_LOGIN_STATUS_INIT_ERR 0x00 +#define ISCSI_LOGIN_STATUS_AUTH_FAILED 0x01 +#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN 0x02 +#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND 0x03 +#define ISCSI_LOGIN_STATUS_TGT_REMOVED 0x04 +#define ISCSI_LOGIN_STATUS_NO_VERSION 0x05 +#define ISCSI_LOGIN_STATUS_ISID_ERROR 0x06 +#define ISCSI_LOGIN_STATUS_MISSING_FIELDS 0x07 +#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED 0x08 +#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE 0x09 +#define ISCSI_LOGIN_STATUS_NO_SESSION 0x0a +#define ISCSI_LOGIN_STATUS_INVALID_REQUEST 0x0b + +/* Class-3 (Target Error) */ +#define ISCSI_LOGIN_STATUS_TARGET_ERROR 0x00 +#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE 0x01 +#define ISCSI_LOGIN_STATUS_NO_RESOURCES 0x02 + +/* Logout Header */ +struct iscsi_logout { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd1[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd2[8]; + __be32 itt; /* Initiator Task Tag */ + __be16 cid; + uint8_t rsvd3[2]; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd4[16]; +}; + +/* Logout PDU flags */ +#define ISCSI_FLAG_LOGOUT_REASON_MASK 0x7F + +/* logout reason_code values */ + +#define ISCSI_LOGOUT_REASON_CLOSE_SESSION 0 +#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION 1 +#define ISCSI_LOGOUT_REASON_RECOVERY 2 +#define ISCSI_LOGOUT_REASON_AEN_REQUEST 3 + +/* Logout Response Header */ +struct iscsi_logout_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t response; /* see Logout response values below */ + uint8_t rsvd2; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd3[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rsvd4; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 rsvd5; + __be16 t2wait; + __be16 t2retain; + __be32 rsvd6; +}; + +/* logout response status values */ + +#define ISCSI_LOGOUT_SUCCESS 0 +#define ISCSI_LOGOUT_CID_NOT_FOUND 1 +#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED 2 +#define ISCSI_LOGOUT_CLEANUP_FAILED 3 + +/* SNACK Header */ +struct iscsi_snack { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[14]; + __be32 itt; + __be32 begrun; + __be32 runlength; + __be32 exp_statsn; + __be32 rsvd3; + __be32 exp_datasn; + uint8_t rsvd6[8]; +}; + +/* SNACK PDU flags */ +#define ISCSI_FLAG_SNACK_TYPE_MASK 0x0F /* 4 bits */ + +/* Reject Message Header */ +struct iscsi_reject { + uint8_t opcode; + uint8_t flags; + uint8_t reason; + uint8_t rsvd2; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd3[8]; + __be32 ffffffff; + uint8_t rsvd4[4]; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 datasn; + uint8_t rsvd5[8]; + /* Text - Rejected hdr */ +}; + +/* Reason for Reject */ +#define ISCSI_REASON_CMD_BEFORE_LOGIN 1 +#define ISCSI_REASON_DATA_DIGEST_ERROR 2 +#define ISCSI_REASON_DATA_SNACK_REJECT 3 +#define ISCSI_REASON_PROTOCOL_ERROR 4 +#define ISCSI_REASON_CMD_NOT_SUPPORTED 5 +#define ISCSI_REASON_IMM_CMD_REJECT 6 +#define ISCSI_REASON_TASK_IN_PROGRESS 7 +#define ISCSI_REASON_INVALID_SNACK 8 +#define ISCSI_REASON_BOOKMARK_INVALID 9 +#define ISCSI_REASON_BOOKMARK_NO_RESOURCES 10 +#define ISCSI_REASON_NEGOTIATION_RESET 11 + +/* Max. number of Key=Value pairs in a text message */ +#define MAX_KEY_VALUE_PAIRS 8192 + +/* maximum length for text keys/values */ +#define KEY_MAXLEN 64 +#define VALUE_MAXLEN 255 +#define TARGET_NAME_MAXLEN VALUE_MAXLEN + +#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH 8192 + +/************************* RFC 3720 End *****************************/ + +#endif /* ISCSI_PROTO_H */ diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h new file mode 100644 index 0000000..f353e0b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_device.h @@ -0,0 +1,19 @@ +#ifndef _SCSI_SCSI_DEVICE_H_BACKPORT +#define _SCSI_SCSI_DEVICE_H_BACKPORT + +#include_next + +#include +#include +#include +#include +#include + +struct scsi_lun; + +extern void int_to_scsilun(unsigned int, struct scsi_lun *); +extern void scsi_target_block(struct device *); +extern void scsi_target_unblock(struct device *); +extern void starget_for_each_device(struct scsi_target *, void *, + void (*fn)(struct scsi_device *, void *)); +#endif /* _SCSI_SCSI_DEVICE_H_BACKPORT */ diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_host.h b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_host.h new file mode 100644 index 0000000..b7e019b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_host.h @@ -0,0 +1,8 @@ +#ifndef _SCSI_SCSI_HOST_H_BACKPORT +#define _SCSI_SCSI_HOST_H_BACKPORT + +#include_next + +#define scsi_queue_work(shost, work) schedule_work(work) + +#endif diff --git a/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h new file mode 100644 index 0000000..99c2b12 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/scsi/scsi_transport.h @@ -0,0 +1,8 @@ +#ifndef _SCSI_SCSI_TRANSPORT_H_BACKPORT +#define _SCSI_SCSI_TRANSPORT_H_BACKPORT + +#include_next + +#include + +#endif /* _SCSI_SCSI_TRANSPORT_H_BACKPORT */ diff --git a/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c b/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c new file mode 100644 index 0000000..44948d1 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/attribute_container.c @@ -0,0 +1,438 @@ +/* + * attribute_container.c - implementation of a simple container for classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + * + * The basic idea here is to enable a device to be attached to an + * aritrary numer of classes without having to allocate storage for them. + * Instead, the contained classes select the devices they need to attach + * to via a matching function. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "base.h" + +/* This is a private structure used to tie the classdev and the + * container .. it should never be visible outside this file */ +struct internal_container { + struct klist_node node; + struct attribute_container *cont; + struct class_device classdev; +}; + +static void internal_container_klist_get(struct klist_node *n) +{ + struct internal_container *ic = + container_of(n, struct internal_container, node); + class_device_get(&ic->classdev); +} + +static void internal_container_klist_put(struct klist_node *n) +{ + struct internal_container *ic = + container_of(n, struct internal_container, node); + class_device_put(&ic->classdev); +} + + +/** + * attribute_container_classdev_to_container - given a classdev, return the container + * + * @classdev: the class device created by attribute_container_add_device. + * + * Returns the container associated with this classdev. + */ +struct attribute_container * +attribute_container_classdev_to_container(struct class_device *classdev) +{ + struct internal_container *ic = + container_of(classdev, struct internal_container, classdev); + return ic->cont; +} +EXPORT_SYMBOL_GPL(attribute_container_classdev_to_container); + +static struct list_head attribute_container_list; + +static DECLARE_MUTEX(attribute_container_mutex); + +/** + * attribute_container_register - register an attribute container + * + * @cont: The container to register. This must be allocated by the + * callee and should also be zeroed by it. + */ +int +attribute_container_register(struct attribute_container *cont) +{ + INIT_LIST_HEAD(&cont->node); + klist_init(&cont->containers,internal_container_klist_get, + internal_container_klist_put); + + down(&attribute_container_mutex); + list_add_tail(&cont->node, &attribute_container_list); + up(&attribute_container_mutex); + + return 0; +} +EXPORT_SYMBOL_GPL(attribute_container_register); + +/** + * attribute_container_unregister - remove a container registration + * + * @cont: previously registered container to remove + */ +int +attribute_container_unregister(struct attribute_container *cont) +{ + int retval = -EBUSY; + down(&attribute_container_mutex); + spin_lock(&cont->containers.k_lock); + if (!list_empty(&cont->containers.k_list)) + goto out; + retval = 0; + list_del(&cont->node); + out: + spin_unlock(&cont->containers.k_lock); + up(&attribute_container_mutex); + return retval; + +} +EXPORT_SYMBOL_GPL(attribute_container_unregister); + +/* private function used as class release */ +static void attribute_container_release(struct class_device *classdev) +{ + struct internal_container *ic + = container_of(classdev, struct internal_container, classdev); + struct device *dev = classdev->dev; + + kfree(ic); + put_device(dev); +} + +/** + * attribute_container_add_device - see if any container is interested in dev + * + * @dev: device to add attributes to + * @fn: function to trigger addition of class device. + * + * This function allocates storage for the class device(s) to be + * attached to dev (one for each matching attribute_container). If no + * fn is provided, the code will simply register the class device via + * class_device_add. If a function is provided, it is expected to add + * the class device at the appropriate time. One of the things that + * might be necessary is to allocate and initialise the classdev and + * then add it a later time. To do this, call this routine for + * allocation and initialisation and then use + * attribute_container_device_trigger() to call class_device_add() on + * it. Note: after this, the class device contains a reference to dev + * which is not relinquished until the release of the classdev. + */ +void +attribute_container_add_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + + if (attribute_container_no_classdevs(cont)) + continue; + + if (!cont->match(cont, dev)) + continue; + + ic = kzalloc(sizeof(*ic), GFP_KERNEL); + if (!ic) { + dev_printk(KERN_ERR, dev, "failed to allocate class container\n"); + continue; + } + + ic->cont = cont; + class_device_initialize(&ic->classdev); + ic->classdev.dev = get_device(dev); + ic->classdev.class = cont->class; + cont->class->release = attribute_container_release; + strcpy(ic->classdev.class_id, dev->bus_id); + if (fn) + fn(cont, dev, &ic->classdev); + else + attribute_container_add_class_device(&ic->classdev); + klist_add_tail(&ic->node, &cont->containers); + } + up(&attribute_container_mutex); +} + +/* FIXME: can't break out of this unless klist_iter_exit is also + * called before doing the break + */ +#define klist_for_each_entry(pos, head, member, iter) \ + for (klist_iter_init(head, iter); (pos = ({ \ + struct klist_node *n = klist_next(iter); \ + n ? container_of(n, typeof(*pos), member) : \ + ({ klist_iter_exit(iter) ; NULL; }); \ + }) ) != NULL; ) + + +/** + * attribute_container_remove_device - make device eligible for removal. + * + * @dev: The generic device + * @fn: A function to call to remove the device + * + * This routine triggers device removal. If fn is NULL, then it is + * simply done via class_device_unregister (note that if something + * still has a reference to the classdev, then the memory occupied + * will not be freed until the classdev is released). If you want a + * two phase release: remove from visibility and then delete the + * device, then you should use this routine with a fn that calls + * class_device_del() and then use + * attribute_container_device_trigger() to do the final put on the + * classdev. + */ +void +attribute_container_remove_device(struct device *dev, + void (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + struct klist_iter iter; + + if (attribute_container_no_classdevs(cont)) + continue; + + if (!cont->match(cont, dev)) + continue; + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (dev != ic->classdev.dev) + continue; + klist_del(&ic->node); + if (fn) + fn(cont, dev, &ic->classdev); + else { + attribute_container_remove_attrs(&ic->classdev); + class_device_unregister(&ic->classdev); + } + } + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_device_trigger - execute a trigger for each matching classdev + * + * @dev: The generic device to run the trigger for + * @fn the function to execute for each classdev. + * + * This funcion is for executing a trigger when you need to know both + * the container and the classdev. If you only care about the + * container, then use attribute_container_trigger() instead. + */ +void +attribute_container_device_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + struct klist_iter iter; + + if (!cont->match(cont, dev)) + continue; + + if (attribute_container_no_classdevs(cont)) { + fn(cont, dev, NULL); + continue; + } + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (dev == ic->classdev.dev) + fn(cont, dev, &ic->classdev); + } + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_trigger - trigger a function for each matching container + * + * @dev: The generic device to activate the trigger for + * @fn: the function to trigger + * + * This routine triggers a function that only needs to know the + * matching containers (not the classdev) associated with a device. + * It is more lightweight than attribute_container_device_trigger, so + * should be used in preference unless the triggering function + * actually needs to know the classdev. + */ +void +attribute_container_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + if (cont->match(cont, dev)) + fn(cont, dev); + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_add_attrs - add attributes + * + * @classdev: The class device + * + * This simply creates all the class device sysfs files from the + * attributes listed in the container + */ +int +attribute_container_add_attrs(struct class_device *classdev) +{ + struct attribute_container *cont = + attribute_container_classdev_to_container(classdev); + struct class_device_attribute **attrs = cont->attrs; + int i, error; + + if (!attrs) + return 0; + + for (i = 0; attrs[i]; i++) { + error = class_device_create_file(classdev, attrs[i]); + if (error) + return error; + } + + return 0; +} + +/** + * attribute_container_add_class_device - same function as class_device_add + * + * @classdev: the class device to add + * + * This performs essentially the same function as class_device_add except for + * attribute containers, namely add the classdev to the system and then + * create the attribute files + */ +int +attribute_container_add_class_device(struct class_device *classdev) +{ + int error = class_device_add(classdev); + if (error) + return error; + return attribute_container_add_attrs(classdev); +} + +/** + * attribute_container_add_class_device_adapter - simple adapter for triggers + * + * This function is identical to attribute_container_add_class_device except + * that it is designed to be called from the triggers + */ +int +attribute_container_add_class_device_adapter(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + return attribute_container_add_class_device(classdev); +} + +/** + * attribute_container_remove_attrs - remove any attribute files + * + * @classdev: The class device to remove the files from + * + */ +void +attribute_container_remove_attrs(struct class_device *classdev) +{ + struct attribute_container *cont = + attribute_container_classdev_to_container(classdev); + struct class_device_attribute **attrs = cont->attrs; + int i; + + if (!attrs) + return; + + for (i = 0; attrs[i]; i++) + class_device_remove_file(classdev, attrs[i]); +} + +/** + * attribute_container_class_device_del - equivalent of class_device_del + * + * @classdev: the class device + * + * This function simply removes all the attribute files and then calls + * class_device_del. + */ +void +attribute_container_class_device_del(struct class_device *classdev) +{ + attribute_container_remove_attrs(classdev); + class_device_del(classdev); +} + +/** + * attribute_container_find_class_device - find the corresponding class_device + * + * @cont: the container + * @dev: the generic device + * + * Looks up the device in the container's list of class devices and returns + * the corresponding class_device. + */ +struct class_device * +attribute_container_find_class_device(struct attribute_container *cont, + struct device *dev) +{ + struct class_device *cdev = NULL; + struct internal_container *ic; + struct klist_iter iter; + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (ic->classdev.dev == dev) { + cdev = &ic->classdev; + /* FIXME: must exit iterator then break */ + klist_iter_exit(&iter); + break; + } + } + + return cdev; +} +EXPORT_SYMBOL_GPL(attribute_container_find_class_device); + +int +attribute_container_init(void) +{ + INIT_LIST_HEAD(&attribute_container_list); + return 0; +} +EXPORT_SYMBOL_GPL(attribute_container_init); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/base.h b/kernel_addons/backport/2.6.9_U3/include/src/base.h new file mode 100644 index 0000000..a5f8936 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/base.h @@ -0,0 +1 @@ +extern int attribute_container_init(void); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/init.c b/kernel_addons/backport/2.6.9_U3/include/src/init.c new file mode 100644 index 0000000..15f0bc6 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/init.c @@ -0,0 +1,26 @@ +/* + * + * Copyright (c) 2002-3 Patrick Mochel + * Copyright (c) 2002-3 Open Source Development Labs + * + * This file is released under the GPLv2 + * + */ + +#include +#include +#include + +#include "base.h" + +/** + * driver_init - initialize driver model. + * + * Call the driver model init functions to initialize their + * subsystems. Called early from init/main.c. + */ + +void __init driver_init(void) +{ + attribute_container_init(); +} diff --git a/kernel_addons/backport/2.6.9_U3/include/src/klist.c b/kernel_addons/backport/2.6.9_U3/include/src/klist.c new file mode 100644 index 0000000..3b29ebc --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/klist.c @@ -0,0 +1,287 @@ +/* + * klist.c - Routines for manipulating klists. + * + * + * This klist interface provides a couple of structures that wrap around + * struct list_head to provide explicit list "head" (struct klist) and + * list "node" (struct klist_node) objects. For struct klist, a spinlock + * is included that protects access to the actual list itself. struct + * klist_node provides a pointer to the klist that owns it and a kref + * reference count that indicates the number of current users of that node + * in the list. + * + * The entire point is to provide an interface for iterating over a list + * that is safe and allows for modification of the list during the + * iteration (e.g. insertion and removal), including modification of the + * current node on the list. + * + * It works using a 3rd object type - struct klist_iter - that is declared + * and initialized before an iteration. klist_next() is used to acquire the + * next element in the list. It returns NULL if there are no more items. + * Internally, that routine takes the klist's lock, decrements the reference + * count of the previous klist_node and increments the count of the next + * klist_node. It then drops the lock and returns. + * + * There are primitives for adding and removing nodes to/from a klist. + * When deleting, klist_del() will simply decrement the reference count. + * Only when the count goes to 0 is the node removed from the list. + * klist_remove() will try to delete the node from the list and block + * until it is actually removed. This is useful for objects (like devices) + * that have been removed from the system and must be freed (but must wait + * until all accessors have finished). + * + * Copyright (C) 2005 Patrick Mochel + * + * This file is released under the GPL v2. + */ + +#include +#include + + +/** + * klist_init - Initialize a klist structure. + * @k: The klist we're initializing. + * @get: The get function for the embedding object (NULL if none) + * @put: The put function for the embedding object (NULL if none) + * + * Initialises the klist structure. If the klist_node structures are + * going to be embedded in refcounted objects (necessary for safe + * deletion) then the get/put arguments are used to initialise + * functions that take and release references on the embedding + * objects. + */ + +void klist_init(struct klist * k, void (*get)(struct klist_node *), + void (*put)(struct klist_node *)) +{ + INIT_LIST_HEAD(&k->k_list); + spin_lock_init(&k->k_lock); + k->get = get; + k->put = put; +} + +EXPORT_SYMBOL_GPL(klist_init); + + +static void add_head(struct klist * k, struct klist_node * n) +{ + spin_lock(&k->k_lock); + list_add(&n->n_node, &k->k_list); + spin_unlock(&k->k_lock); +} + +static void add_tail(struct klist * k, struct klist_node * n) +{ + spin_lock(&k->k_lock); + list_add_tail(&n->n_node, &k->k_list); + spin_unlock(&k->k_lock); +} + + +static void klist_node_init(struct klist * k, struct klist_node * n) +{ + INIT_LIST_HEAD(&n->n_node); + init_completion(&n->n_removed); + kref_init(&n->n_ref); + n->n_klist = k; + if (k->get) + k->get(n); +} + + +/** + * klist_add_head - Initialize a klist_node and add it to front. + * @n: node we're adding. + * @k: klist it's going on. + */ + +void klist_add_head(struct klist_node * n, struct klist * k) +{ + klist_node_init(k, n); + add_head(k, n); +} + +EXPORT_SYMBOL_GPL(klist_add_head); + + +/** + * klist_add_tail - Initialize a klist_node and add it to back. + * @n: node we're adding. + * @k: klist it's going on. + */ + +void klist_add_tail(struct klist_node * n, struct klist * k) +{ + klist_node_init(k, n); + add_tail(k, n); +} + +EXPORT_SYMBOL_GPL(klist_add_tail); + + +static void klist_release(struct kref * kref) +{ + struct klist_node * n = container_of(kref, struct klist_node, n_ref); + + list_del(&n->n_node); + complete(&n->n_removed); + n->n_klist = NULL; +} + +static int klist_dec_and_del(struct klist_node * n) +{ + return kref_put_new(&n->n_ref, klist_release); +} + + +/** + * klist_del - Decrement the reference count of node and try to remove. + * @n: node we're deleting. + */ + +void klist_del(struct klist_node * n) +{ + struct klist * k = n->n_klist; + void (*put)(struct klist_node *) = k->put; + + spin_lock(&k->k_lock); + if (!klist_dec_and_del(n)) + put = NULL; + spin_unlock(&k->k_lock); + if (put) + put(n); +} + +EXPORT_SYMBOL_GPL(klist_del); + + +/** + * klist_remove - Decrement the refcount of node and wait for it to go away. + * @n: node we're removing. + */ + +void klist_remove(struct klist_node * n) +{ + klist_del(n); + wait_for_completion(&n->n_removed); +} + +EXPORT_SYMBOL_GPL(klist_remove); + + +/** + * klist_node_attached - Say whether a node is bound to a list or not. + * @n: Node that we're testing. + */ + +int klist_node_attached(struct klist_node * n) +{ + return (n->n_klist != NULL); +} + +EXPORT_SYMBOL_GPL(klist_node_attached); + + +/** + * klist_iter_init_node - Initialize a klist_iter structure. + * @k: klist we're iterating. + * @i: klist_iter we're filling. + * @n: node to start with. + * + * Similar to klist_iter_init(), but starts the action off with @n, + * instead of with the list head. + */ + +void klist_iter_init_node(struct klist * k, struct klist_iter * i, struct klist_node * n) +{ + i->i_klist = k; + i->i_head = &k->k_list; + i->i_cur = n; + if (n) + kref_get(&n->n_ref); +} + +EXPORT_SYMBOL_GPL(klist_iter_init_node); + + +/** + * klist_iter_init - Iniitalize a klist_iter structure. + * @k: klist we're iterating. + * @i: klist_iter structure we're filling. + * + * Similar to klist_iter_init_node(), but start with the list head. + */ + +void klist_iter_init(struct klist * k, struct klist_iter * i) +{ + klist_iter_init_node(k, i, NULL); +} + +EXPORT_SYMBOL_GPL(klist_iter_init); + + +/** + * klist_iter_exit - Finish a list iteration. + * @i: Iterator structure. + * + * Must be called when done iterating over list, as it decrements the + * refcount of the current node. Necessary in case iteration exited before + * the end of the list was reached, and always good form. + */ + +void klist_iter_exit(struct klist_iter * i) +{ + if (i->i_cur) { + klist_del(i->i_cur); + i->i_cur = NULL; + } +} + +EXPORT_SYMBOL_GPL(klist_iter_exit); + + +static struct klist_node * to_klist_node(struct list_head * n) +{ + return container_of(n, struct klist_node, n_node); +} + + +/** + * klist_next - Ante up next node in list. + * @i: Iterator structure. + * + * First grab list lock. Decrement the reference count of the previous + * node, if there was one. Grab the next node, increment its reference + * count, drop the lock, and return that next node. + */ + +struct klist_node * klist_next(struct klist_iter * i) +{ + struct list_head * next; + struct klist_node * lnode = i->i_cur; + struct klist_node * knode = NULL; + void (*put)(struct klist_node *) = i->i_klist->put; + + spin_lock(&i->i_klist->k_lock); + if (lnode) { + next = lnode->n_node.next; + if (!klist_dec_and_del(lnode)) + put = NULL; + } else + next = i->i_head->next; + + if (next != i->i_head) { + knode = to_klist_node(next); + kref_get(&knode->n_ref); + } + i->i_cur = knode; + spin_unlock(&i->i_klist->k_lock); + if (put && lnode) + put(lnode); + return knode; +} + +EXPORT_SYMBOL_GPL(klist_next); + + diff --git a/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c b/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c new file mode 100644 index 0000000..d45bb3f --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/kref_new.c @@ -0,0 +1,29 @@ +#include +#include + +/** + * kref_put - decrement refcount for object. + * @kref: object. + * @release: pointer to the function that will clean up the object when the + * last reference to the object is released. + * This pointer is required, and it is not acceptable to pass kfree + * in as this function. + * + * Decrement the refcount, and if 0, call release(). + * Return 1 if the object was removed, otherwise return 0. Beware, if this + * function returns 0, you still can not count on the kref from remaining in + * memory. Only use the return value if you want to see if the kref is now + * gone, not present. + */ +int kref_put_new(struct kref *kref, void (*release)(struct kref *kref)) +{ + WARN_ON(release == NULL); + WARN_ON(release == (void (*)(struct kref *))kfree); + + if (atomic_dec_and_test(&kref->refcount)) { + release(kref); + return 1; + } + return 0; +} +EXPORT_SYMBOL(kref_put_new); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi.c new file mode 100644 index 0000000..8c833c0 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi.c @@ -0,0 +1,50 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +/** + * starget_for_each_device - helper to walk all devices of a target + * @starget: target whose devices we want to iterate over. + * + * This traverses over each devices of @shost. The devices have + * a reference that must be released by scsi_host_put when breaking + * out of the loop. + */ +void starget_for_each_device(struct scsi_target *starget, void * data, + void (*fn)(struct scsi_device *, void *)) +{ + struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); + struct scsi_device *sdev; + + printk("%s: entry\n", __FUNCTION__); + shost_for_each_device(sdev, shost) { + if ((sdev->channel == starget->channel) && + (sdev->id == starget->id)) + fn(sdev, data); + } + printk("%s: exit\n", __FUNCTION__); +} +EXPORT_SYMBOL(starget_for_each_device); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c new file mode 100644 index 0000000..f53f824 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi_lib.c @@ -0,0 +1,166 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +int scsi_is_target_device(const struct device *dev) +{ + char *str = dev->bus_id; + + if (strncmp(str, "target", 6) == 0) { + return 1; + } + + return 0; +} + +/** + * scsi_internal_device_block - internal function to put a device + * temporarily into the SDEV_BLOCK state + * @sdev: device to block + * + * Block request made by scsi lld's to temporarily stop all + * scsi commands on the specified device. Called from interrupt + * or normal process context. + * + * Returns zero if successful or error if not + * + * Notes: + * This routine transitions the device to the SDEV_BLOCK state + * (which must be a legal transition). When the device is in this + * state, all commands are deferred until the scsi lld reenables + * the device with scsi_device_unblock or device_block_tmo fires. + * This routine assumes the host_lock is held on entry. + **/ +int +scsi_internal_device_block(struct scsi_device *sdev) +{ + request_queue_t *q = sdev->request_queue; + unsigned long flags; + int err = 0; + + err = scsi_device_set_state(sdev, SDEV_BLOCK); + if (err) + return err; + + /* + * The device has transitioned to SDEV_BLOCK. Stop the + * block layer from calling the midlayer with this device's + * request queue. + */ + spin_lock_irqsave(q->queue_lock, flags); + blk_stop_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(scsi_internal_device_block); + +/** + * scsi_internal_device_unblock - resume a device after a block request + * @sdev: device to resume + * + * Called by scsi lld's or the midlayer to restart the device queue + * for the previously suspended scsi device. Called from interrupt or + * normal process context. + * + * Returns zero if successful or error if not. + * + * Notes: + * This routine transitions the device to the SDEV_RUNNING state + * (which must be a legal transition) allowing the midlayer to + * goose the queue for this device. This routine assumes the + * host_lock is held upon entry. + **/ +int +scsi_internal_device_unblock(struct scsi_device *sdev) +{ + request_queue_t *q = sdev->request_queue; + int err; + unsigned long flags; + + + /* + * Try to transition the scsi device to SDEV_RUNNING + * and goose the device queue if successful. + */ + err = scsi_device_set_state(sdev, SDEV_RUNNING); + if (err) + return err; + + spin_lock_irqsave(q->queue_lock, flags); + blk_start_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(scsi_internal_device_unblock); + +static void +device_block(struct scsi_device *sdev, void *data) +{ + scsi_internal_device_block(sdev); +} + +static int +target_block(struct device *dev, void *data) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_block); + + return 0; +} + +void +scsi_target_block(struct device *dev) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_block); + else + device_for_each_child(dev, NULL, target_block); +} +EXPORT_SYMBOL_GPL(scsi_target_block); + +static void +device_unblock(struct scsi_device *sdev, void *data) +{ + scsi_internal_device_unblock(sdev); +} + +static int +target_unblock(struct device *dev, void *data) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_unblock); + return 0; +} + +void +scsi_target_unblock(struct device *dev) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_unblock); + else + device_for_each_child(dev, NULL, target_unblock); +} +EXPORT_SYMBOL_GPL(scsi_target_unblock); + +MODULE_LICENSE("GPL"); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c b/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c new file mode 100644 index 0000000..b7b7674 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/scsi_scan.c @@ -0,0 +1,48 @@ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +/** + * int_to_scsilun: reverts an int into a scsi_lun + * @int: integer to be reverted + * @scsilun: struct scsi_lun to be set. + * + * Description: + * Reverts the functionality of the scsilun_to_int, which packed + * an 8-byte lun value into an int. This routine unpacks the int + * back into the lun value. + * Note: the scsilun_to_int() routine does not truly handle all + * 8bytes of the lun value. This functions restores only as much + * as was set by the routine. + * + * Notes: + * Given an integer : 0x0b030a04, this function returns a + * scsi_lun of : struct scsi_lun of: 0a 04 0b 03 00 00 00 00 + * + **/ +void int_to_scsilun(unsigned int lun, struct scsi_lun *scsilun) +{ + int i; + + memset(scsilun->scsi_lun, 0, sizeof(scsilun->scsi_lun)); + + for (i = 0; i < sizeof(lun); i += 2) { + scsilun->scsi_lun[i] = (lun >> 8) & 0xFF; + scsilun->scsi_lun[i+1] = lun & 0xFF; + lun = lun >> 16; + } +} +EXPORT_SYMBOL(int_to_scsilun); diff --git a/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c b/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c new file mode 100644 index 0000000..f25e7c6 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U3/include/src/transport_class.c @@ -0,0 +1,280 @@ +/* + * transport_class.c - implementation of generic transport classes + * using attribute_containers + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + * + * The basic idea here is to allow any "device controller" (which + * would most often be a Host Bus Adapter to use the services of one + * or more tranport classes for performing transport specific + * services. Transport specific services are things that the generic + * command layer doesn't want to know about (speed settings, line + * condidtioning, etc), but which the user might be interested in. + * Thus, the HBA's use the routines exported by the transport classes + * to perform these functions. The transport classes export certain + * values to the user via sysfs using attribute containers. + * + * Note: because not every HBA will care about every transport + * attribute, there's a many to one relationship that goes like this: + * + * transport class<-----attribute container<----class device + * + * Usually the attribute container is per-HBA, but the design doesn't + * mandate that. Although most of the services will be specific to + * the actual external storage connection used by the HBA, the generic + * transport class is framed entirely in terms of generic devices to + * allow it to be used by any physical HBA in the system. + */ +#include +#include + +/** + * transport_class_register - register an initial transport class + * + * @tclass: a pointer to the transport class structure to be initialised + * + * The transport class contains an embedded class which is used to + * identify it. The caller should initialise this structure with + * zeros and then generic class must have been initialised with the + * actual transport class unique name. There's a macro + * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must + * be registered). + * + * Returns 0 on success or error on failure. + */ +int transport_class_register(struct transport_class *tclass) +{ + return class_register(&tclass->class); +} +EXPORT_SYMBOL_GPL(transport_class_register); + +/** + * transport_class_unregister - unregister a previously registered class + * + * @tclass: The transport class to unregister + * + * Must be called prior to deallocating the memory for the transport + * class. + */ +void transport_class_unregister(struct transport_class *tclass) +{ + class_unregister(&tclass->class); +} +EXPORT_SYMBOL_GPL(transport_class_unregister); + +static int anon_transport_dummy_function(struct transport_container *tc, + struct device *dev, + struct class_device *cdev) +{ + /* do nothing */ + return 0; +} + +/** + * anon_transport_class_register - register an anonymous class + * + * @atc: The anon transport class to register + * + * The anonymous transport class contains both a transport class and a + * container. The idea of an anonymous class is that it never + * actually has any device attributes associated with it (and thus + * saves on container storage). So it can only be used for triggering + * events. Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to + * initialise the anon transport class storage. + */ +int anon_transport_class_register(struct anon_transport_class *atc) +{ + int error; + atc->container.class = &atc->tclass.class; + attribute_container_set_no_classdevs(&atc->container); + error = attribute_container_register(&atc->container); + if (error) + return error; + atc->tclass.setup = anon_transport_dummy_function; + atc->tclass.remove = anon_transport_dummy_function; + return 0; +} +EXPORT_SYMBOL_GPL(anon_transport_class_register); + +/** + * anon_transport_class_unregister - unregister an anon class + * + * @atc: Pointer to the anon transport class to unregister + * + * Must be called prior to deallocating the memory for the anon + * transport class. + */ +void anon_transport_class_unregister(struct anon_transport_class *atc) +{ + attribute_container_unregister(&atc->container); +} +EXPORT_SYMBOL_GPL(anon_transport_class_unregister); + +static int transport_setup_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + struct transport_container *tcont = attribute_container_to_transport_container(cont); + + if (tclass->setup) + tclass->setup(tcont, dev, classdev); + + return 0; +} + +/** + * transport_setup_device - declare a new dev for transport class association + * but don't make it visible yet. + * + * @dev: the generic device representing the entity being added + * + * Usually, dev represents some component in the HBA system (either + * the HBA itself or a device remote across the HBA bus). This + * routine is simply a trigger point to see if any set of transport + * classes wishes to associate with the added device. This allocates + * storage for the class device and initialises it, but does not yet + * add it to the system or add attributes to it (you do this with + * transport_add_device). If you have no need for a separate setup + * and add operations, use transport_register_device (see + * transport_class.h). + */ + +void transport_setup_device(struct device *dev) +{ + attribute_container_add_device(dev, transport_setup_classdev); +} +EXPORT_SYMBOL_GPL(transport_setup_device); + +static int transport_add_class_device(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + int error = attribute_container_add_class_device(classdev); + struct transport_container *tcont = + attribute_container_to_transport_container(cont); + + if (!error && tcont->statistics) + error = sysfs_create_group(&classdev->kobj, tcont->statistics); + + return error; +} + + +/** + * transport_add_device - declare a new dev for transport class association + * + * @dev: the generic device representing the entity being added + * + * Usually, dev represents some component in the HBA system (either + * the HBA itself or a device remote across the HBA bus). This + * routine is simply a trigger point used to add the device to the + * system and register attributes for it. + */ + +void transport_add_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_add_class_device); +} +EXPORT_SYMBOL_GPL(transport_add_device); + +static int transport_configure(struct attribute_container *cont, + struct device *dev, + struct class_device *cdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + struct transport_container *tcont = attribute_container_to_transport_container(cont); + + if (tclass->configure) + tclass->configure(tcont, dev, cdev); + + return 0; +} + +/** + * transport_configure_device - configure an already set up device + * + * @dev: generic device representing device to be configured + * + * The idea of configure is simply to provide a point within the setup + * process to allow the transport class to extract information from a + * device after it has been setup. This is used in SCSI because we + * have to have a setup device to begin using the HBA, but after we + * send the initial inquiry, we use configure to extract the device + * parameters. The device need not have been added to be configured. + */ +void transport_configure_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_configure); +} +EXPORT_SYMBOL_GPL(transport_configure_device); + +static int transport_remove_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_container *tcont = + attribute_container_to_transport_container(cont); + struct transport_class *tclass = class_to_transport_class(cont->class); + + if (tclass->remove) + tclass->remove(tcont, dev, classdev); + + if (tclass->remove != anon_transport_dummy_function) { + if (tcont->statistics) + sysfs_remove_group(&classdev->kobj, tcont->statistics); + attribute_container_class_device_del(classdev); + } + + return 0; +} + + +/** + * transport_remove_device - remove the visibility of a device + * + * @dev: generic device to remove + * + * This call removes the visibility of the device (to the user from + * sysfs), but does not destroy it. To eliminate a device entirely + * you must also call transport_destroy_device. If you don't need to + * do remove and destroy as separate operations, use + * transport_unregister_device() (see transport_class.h) which will + * perform both calls for you. + */ +void transport_remove_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_remove_classdev); +} +EXPORT_SYMBOL_GPL(transport_remove_device); + +static void transport_destroy_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + + if (tclass->remove != anon_transport_dummy_function) + class_device_put(classdev); +} + + +/** + * transport_destroy_device - destroy a removed device + * + * @dev: device to eliminate from the transport class. + * + * This call triggers the elimination of storage associated with the + * transport classdev. Note: all it really does is relinquish a + * reference to the classdev. The memory will not be freed until the + * last reference goes to zero. Note also that the classdev retains a + * reference count on dev, so dev too will remain for as long as the + * transport class device remains around. + */ +void transport_destroy_device(struct device *dev) +{ + attribute_container_remove_device(dev, transport_destroy_classdev); +} +EXPORT_SYMBOL_GPL(transport_destroy_device); diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h b/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h new file mode 100644 index 0000000..93bfb0b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/linux/attribute_container.h @@ -0,0 +1,71 @@ +/* + * class_container.h - a generic container for all classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + */ + +#ifndef _ATTRIBUTE_CONTAINER_H_ +#define _ATTRIBUTE_CONTAINER_H_ + +#include +#include +#include +#include + +struct attribute_container { + struct list_head node; + struct klist containers; + struct class *class; + struct class_device_attribute **attrs; + int (*match)(struct attribute_container *, struct device *); +#define ATTRIBUTE_CONTAINER_NO_CLASSDEVS 0x01 + unsigned long flags; +}; + +static inline int +attribute_container_no_classdevs(struct attribute_container *atc) +{ + return atc->flags & ATTRIBUTE_CONTAINER_NO_CLASSDEVS; +} + +static inline void +attribute_container_set_no_classdevs(struct attribute_container *atc) +{ + atc->flags |= ATTRIBUTE_CONTAINER_NO_CLASSDEVS; +} + +int attribute_container_register(struct attribute_container *cont); +int attribute_container_unregister(struct attribute_container *cont); +void attribute_container_create_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_add_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_remove_device(struct device *dev, + void (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_device_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *)); +int attribute_container_add_attrs(struct class_device *classdev); +int attribute_container_add_class_device(struct class_device *classdev); +int attribute_container_add_class_device_adapter(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev); +void attribute_container_remove_attrs(struct class_device *classdev); +void attribute_container_class_device_del(struct class_device *classdev); +struct attribute_container *attribute_container_classdev_to_container(struct class_device *); +struct class_device *attribute_container_find_class_device(struct attribute_container *, struct device *); +struct class_device_attribute **attribute_container_classdev_to_attrs(const struct class_device *classdev); + +#endif diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/crypto.h b/kernel_addons/backport/2.6.9_U4/include/linux/crypto.h new file mode 100644 index 0000000..aecccde --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/linux/crypto.h @@ -0,0 +1,11 @@ +#ifndef LINUX_CRYPTO_BACKPORT_H +#define LINUX_CRYPTO_BACKPORT_H + +#include_next + +#define crypto_hash_init(desc) crypto_digest_init(*desc) +#define crypto_hash_digest(desc, sg, nbytes, out) crypto_digest_digest(*desc, sg, 1, out) +#define crypto_hash_update(desc, sg, nbytes) crypto_digest_update(*desc, sg, 1) +#define crypto_hash_final(desc, out) crypto_digest_final(*desc, out) + +#endif diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/kernel.h b/kernel_addons/backport/2.6.9_U4/include/linux/kernel.h index a37dcd5..02a5907 100644 --- a/kernel_addons/backport/2.6.9_U4/include/linux/kernel.h +++ b/kernel_addons/backport/2.6.9_U4/include/linux/kernel.h @@ -4,4 +4,7 @@ #define BACKPORT_KERNEL_H_2_6_19 #include_next #include +#define NIP6_FMT "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x" +#define NIPQUAD_FMT "%u.%u.%u.%u" + #endif diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/kfifo.h b/kernel_addons/backport/2.6.9_U4/include/linux/kfifo.h index 48eccd8..2b94461 100644 --- a/kernel_addons/backport/2.6.9_U4/include/linux/kfifo.h +++ b/kernel_addons/backport/2.6.9_U4/include/linux/kfifo.h @@ -25,6 +25,7 @@ #ifdef __KERNEL__ #include #include +#include struct kfifo { unsigned char *buffer; /* the buffer holding the data */ diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/klist.h b/kernel_addons/backport/2.6.9_U4/include/linux/klist.h new file mode 100644 index 0000000..7407125 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/linux/klist.h @@ -0,0 +1,61 @@ +/* + * klist.h - Some generic list helpers, extending struct list_head a bit. + * + * Implementations are found in lib/klist.c + * + * + * Copyright (C) 2005 Patrick Mochel + * + * This file is rleased under the GPL v2. + */ + +#ifndef _LINUX_KLIST_H +#define _LINUX_KLIST_H + +#include +#include +#include +#include + +struct klist_node; +struct klist { + spinlock_t k_lock; + struct list_head k_list; + void (*get)(struct klist_node *); + void (*put)(struct klist_node *); +}; + + +extern void klist_init(struct klist * k, void (*get)(struct klist_node *), + void (*put)(struct klist_node *)); + +struct klist_node { + struct klist * n_klist; + struct list_head n_node; + struct kref n_ref; + struct completion n_removed; +}; + +extern void klist_add_tail(struct klist_node * n, struct klist * k); +extern void klist_add_head(struct klist_node * n, struct klist * k); + +extern void klist_del(struct klist_node * n); +extern void klist_remove(struct klist_node * n); + +extern int klist_node_attached(struct klist_node * n); + + +struct klist_iter { + struct klist * i_klist; + struct list_head * i_head; + struct klist_node * i_cur; +}; + + +extern void klist_iter_init(struct klist * k, struct klist_iter * i); +extern void klist_iter_init_node(struct klist * k, struct klist_iter * i, + struct klist_node * n); +extern void klist_iter_exit(struct klist_iter * i); +extern struct klist_node * klist_next(struct klist_iter * i); + +#endif diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/memory.h b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h new file mode 100644 index 0000000..654ef55 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h @@ -0,0 +1,89 @@ +/* + * include/linux/memory.h - generic memory definition + * + * This is mainly for topological representation. We define the + * basic "struct memory_block" here, which can be embedded in per-arch + * definitions or NUMA information. + * + * Basic handling of the devices is done in drivers/base/memory.c + * and system devices are handled in drivers/base/sys.c. + * + * Memory block are exported via sysfs in the class/memory/devices/ + * directory. + * + */ +#ifndef _LINUX_MEMORY_H_ +#define _LINUX_MEMORY_H_ + +#include +#include +#include + +#include + +struct memory_block { + unsigned long phys_index; + unsigned long state; + /* + * This serializes all state change requests. It isn't + * held during creation because the control files are + * created long after the critical areas during + * initialization. + */ + struct semaphore state_sem; + int phys_device; /* to which fru does this belong? */ + void *hw; /* optional pointer to fw/hw data */ + int (*phys_callback)(struct memory_block *); + struct sys_device sysdev; +}; + +/* These states are exposed to userspace as text strings in sysfs */ +#define MEM_ONLINE (1<<0) /* exposed to userspace */ +#define MEM_GOING_OFFLINE (1<<1) /* exposed to userspace */ +#define MEM_OFFLINE (1<<2) /* exposed to userspace */ + +/* + * All of these states are currently kernel-internal for notifying + * kernel components and architectures. + * + * For MEM_MAPPING_INVALID, all notifier chains with priority >0 + * are called before pfn_to_page() becomes invalid. The priority=0 + * entry is reserved for the function that actually makes + * pfn_to_page() stop working. Any notifiers that want to be called + * after that should have priority <0. + */ +#define MEM_MAPPING_INVALID (1<<3) + +struct notifier_block; +struct mem_section; + +#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE +static inline int memory_dev_init(void) +{ + return 0; +} +static inline int register_memory_notifier(struct notifier_block *nb) +{ + return 0; +} +static inline void unregister_memory_notifier(struct notifier_block *nb) +{ +} +#else +extern int register_new_memory(struct mem_section *); +extern int unregister_memory_section(struct mem_section *); +extern int memory_dev_init(void); +extern int remove_memory_block(unsigned long, struct mem_section *, int); + +#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION< + +#define __nlmsg_put(skb, daemon_pid, seq, type, len, flags) \ + __nlmsg_put(skb, daemon_pid, 0, 0, len) + +#define netlink_kernel_create(uint, groups, input, mod) \ + netlink_kernel_create(uint, input) + +#define NETLINK_ISCSI 8 + +#endif diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/transport_class.h b/kernel_addons/backport/2.6.9_U4/include/linux/transport_class.h new file mode 100644 index 0000000..1d6cc22 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/linux/transport_class.h @@ -0,0 +1,100 @@ +/* + * transport_class.h - a generic container for all transport classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + */ + +#ifndef _TRANSPORT_CLASS_H_ +#define _TRANSPORT_CLASS_H_ + +#include +#include + +struct transport_container; + +struct transport_class { + struct class class; + int (*setup)(struct transport_container *, struct device *, + struct class_device *); + int (*configure)(struct transport_container *, struct device *, + struct class_device *); + int (*remove)(struct transport_container *, struct device *, + struct class_device *); +}; + +#define DECLARE_TRANSPORT_CLASS(cls, nm, su, rm, cfg) \ +struct transport_class cls = { \ + .class = { \ + .name = nm, \ + }, \ + .setup = su, \ + .remove = rm, \ + .configure = cfg, \ +} + + +struct anon_transport_class { + struct transport_class tclass; + struct attribute_container container; +}; + +#define DECLARE_ANON_TRANSPORT_CLASS(cls, mtch, cfg) \ +struct anon_transport_class cls = { \ + .tclass = { \ + .configure = cfg, \ + }, \ + . container = { \ + .match = mtch, \ + }, \ +} + +#define class_to_transport_class(x) \ + container_of(x, struct transport_class, class) + +struct transport_container { + struct attribute_container ac; + struct attribute_group *statistics; +}; + +#define attribute_container_to_transport_container(x) \ + container_of(x, struct transport_container, ac) + +void transport_remove_device(struct device *); +void transport_add_device(struct device *); +void transport_setup_device(struct device *); +void transport_configure_device(struct device *); +void transport_destroy_device(struct device *); + +static inline void +transport_register_device(struct device *dev) +{ + transport_setup_device(dev); + transport_add_device(dev); +} + +static inline void +transport_unregister_device(struct device *dev) +{ + transport_remove_device(dev); + transport_destroy_device(dev); +} + +static inline int transport_container_register(struct transport_container *tc) +{ + return attribute_container_register(&tc->ac); +} + +static inline int transport_container_unregister(struct transport_container *tc) +{ + return attribute_container_unregister(&tc->ac); +} + +int transport_class_register(struct transport_class *); +int anon_transport_class_register(struct anon_transport_class *); +void transport_class_unregister(struct transport_class *); +void anon_transport_class_unregister(struct anon_transport_class *); + + +#endif diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/iscsi_proto.h b/kernel_addons/backport/2.6.9_U4/include/scsi/iscsi_proto.h new file mode 100644 index 0000000..02f6e4b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/scsi/iscsi_proto.h @@ -0,0 +1,587 @@ +/* + * RFC 3720 (iSCSI) protocol data types + * + * Copyright (C) 2005 Dmitry Yusupov + * Copyright (C) 2005 Alex Aizman + * maintained by open-iscsi at googlegroups.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published + * by the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * See the file COPYING included with this distribution for more details. + */ + +#ifndef ISCSI_PROTO_H +#define ISCSI_PROTO_H + +#define ISCSI_DRAFT20_VERSION 0x00 + +/* default iSCSI listen port for incoming connections */ +#define ISCSI_LISTEN_PORT 3260 + +/* Padding word length */ +#define PAD_WORD_LEN 4 + +/* + * useful common(control and data pathes) macro + */ +#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2])) +#define hton24(p, v) { \ + p[0] = (((v) >> 16) & 0xFF); \ + p[1] = (((v) >> 8) & 0xFF); \ + p[2] = ((v) & 0xFF); \ +} +#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;} + +/* + * iSCSI Template Message Header + */ +struct iscsi_hdr { + uint8_t opcode; + uint8_t flags; /* Final bit */ + uint8_t rsvd2[2]; + uint8_t hlength; /* AHSs total length */ + uint8_t dlength[3]; /* Data length */ + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Task Tag */ + __be32 statsn; + __be32 exp_statsn; + __be32 max_statsn; + uint8_t other[12]; +}; + +/************************* RFC 3720 Begin *****************************/ + +#define ISCSI_RESERVED_TAG 0xffffffff + +/* Opcode encoding bits */ +#define ISCSI_OP_RETRY 0x80 +#define ISCSI_OP_IMMEDIATE 0x40 +#define ISCSI_OPCODE_MASK 0x3F + +/* Initiator Opcode values */ +#define ISCSI_OP_NOOP_OUT 0x00 +#define ISCSI_OP_SCSI_CMD 0x01 +#define ISCSI_OP_SCSI_TMFUNC 0x02 +#define ISCSI_OP_LOGIN 0x03 +#define ISCSI_OP_TEXT 0x04 +#define ISCSI_OP_SCSI_DATA_OUT 0x05 +#define ISCSI_OP_LOGOUT 0x06 +#define ISCSI_OP_SNACK 0x10 + +#define ISCSI_OP_VENDOR1_CMD 0x1c +#define ISCSI_OP_VENDOR2_CMD 0x1d +#define ISCSI_OP_VENDOR3_CMD 0x1e +#define ISCSI_OP_VENDOR4_CMD 0x1f + +/* Target Opcode values */ +#define ISCSI_OP_NOOP_IN 0x20 +#define ISCSI_OP_SCSI_CMD_RSP 0x21 +#define ISCSI_OP_SCSI_TMFUNC_RSP 0x22 +#define ISCSI_OP_LOGIN_RSP 0x23 +#define ISCSI_OP_TEXT_RSP 0x24 +#define ISCSI_OP_SCSI_DATA_IN 0x25 +#define ISCSI_OP_LOGOUT_RSP 0x26 +#define ISCSI_OP_R2T 0x31 +#define ISCSI_OP_ASYNC_EVENT 0x32 +#define ISCSI_OP_REJECT 0x3f + +struct iscsi_ahs_hdr { + __be16 ahslength; + uint8_t ahstype; + uint8_t ahspec[5]; +}; + +#define ISCSI_AHSTYPE_CDB 1 +#define ISCSI_AHSTYPE_RLENGTH 2 + +/* iSCSI PDU Header */ +struct iscsi_cmd { + uint8_t opcode; + uint8_t flags; + __be16 rsvd2; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 data_length; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t cdb[16]; /* SCSI Command Block */ + /* Additional Data (Command Dependent) */ +}; + +/* Command PDU flags */ +#define ISCSI_FLAG_CMD_FINAL 0x80 +#define ISCSI_FLAG_CMD_READ 0x40 +#define ISCSI_FLAG_CMD_WRITE 0x20 +#define ISCSI_FLAG_CMD_ATTR_MASK 0x07 /* 3 bits */ + +/* SCSI Command Attribute values */ +#define ISCSI_ATTR_UNTAGGED 0 +#define ISCSI_ATTR_SIMPLE 1 +#define ISCSI_ATTR_ORDERED 2 +#define ISCSI_ATTR_HEAD_OF_QUEUE 3 +#define ISCSI_ATTR_ACA 4 + +struct iscsi_rlength_ahdr { + __be16 ahslength; + uint8_t ahstype; + uint8_t reserved; + __be32 read_length; +}; + +/* SCSI Response Header */ +struct iscsi_cmd_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t response; + uint8_t cmd_status; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rsvd1; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 exp_datasn; + __be32 bi_residual_count; + __be32 residual_count; + /* Response or Sense Data (optional) */ +}; + +/* Command Response PDU flags */ +#define ISCSI_FLAG_CMD_BIDI_OVERFLOW 0x10 +#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW 0x08 +#define ISCSI_FLAG_CMD_OVERFLOW 0x04 +#define ISCSI_FLAG_CMD_UNDERFLOW 0x02 + +/* iSCSI Status values. Valid if Rsp Selector bit is not set */ +#define ISCSI_STATUS_CMD_COMPLETED 0 +#define ISCSI_STATUS_TARGET_FAILURE 1 +#define ISCSI_STATUS_SUBSYS_FAILURE 2 + +/* Asynchronous Event Header */ +struct iscsi_async { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + uint8_t rsvd4[8]; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t async_event; + uint8_t async_vcode; + __be16 param1; + __be16 param2; + __be16 param3; + uint8_t rsvd5[4]; +}; + +/* iSCSI Event Codes */ +#define ISCSI_ASYNC_MSG_SCSI_EVENT 0 +#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT 1 +#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION 2 +#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS 3 +#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION 4 +#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC 255 + +/* NOP-Out Message */ +struct iscsi_nopout { + uint8_t opcode; + uint8_t flags; + __be16 rsvd2; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Transfer Tag */ + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd4[16]; +}; + +/* NOP-In Message */ +struct iscsi_nopin { + uint8_t opcode; + uint8_t flags; + __be16 rsvd2; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Transfer Tag */ + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t rsvd4[12]; +}; + +/* SCSI Task Management Message Header */ +struct iscsi_tm { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd1[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rtt; /* Reference Task Tag */ + __be32 cmdsn; + __be32 exp_statsn; + __be32 refcmdsn; + __be32 exp_datasn; + uint8_t rsvd2[8]; +}; + +#define ISCSI_FLAG_TM_FUNC_MASK 0x7F + +/* Function values */ +#define ISCSI_TM_FUNC_ABORT_TASK 1 +#define ISCSI_TM_FUNC_ABORT_TASK_SET 2 +#define ISCSI_TM_FUNC_CLEAR_ACA 3 +#define ISCSI_TM_FUNC_CLEAR_TASK_SET 4 +#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET 5 +#define ISCSI_TM_FUNC_TARGET_WARM_RESET 6 +#define ISCSI_TM_FUNC_TARGET_COLD_RESET 7 +#define ISCSI_TM_FUNC_TASK_REASSIGN 8 + +/* SCSI Task Management Response Header */ +struct iscsi_tm_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t response; /* see Response values below */ + uint8_t qualifier; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd2[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rtt; /* Reference Task Tag */ + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t rsvd3[12]; +}; + +/* Response values */ +#define ISCSI_TMF_RSP_COMPLETE 0x00 +#define ISCSI_TMF_RSP_NO_TASK 0x01 +#define ISCSI_TMF_RSP_NO_LUN 0x02 +#define ISCSI_TMF_RSP_TASK_ALLEGIANT 0x03 +#define ISCSI_TMF_RSP_NO_FAILOVER 0x04 +#define ISCSI_TMF_RSP_NOT_SUPPORTED 0x05 +#define ISCSI_TMF_RSP_AUTH_FAILED 0x06 +#define ISCSI_TMF_RSP_REJECTED 0xff + +/* Ready To Transfer Header */ +struct iscsi_r2t_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Transfer Tag */ + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 r2tsn; + __be32 data_offset; + __be32 data_length; +}; + +/* SCSI Data Hdr */ +struct iscsi_data { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; + __be32 ttt; + __be32 rsvd4; + __be32 exp_statsn; + __be32 rsvd5; + __be32 datasn; + __be32 offset; + __be32 rsvd6; + /* Payload */ +}; + +/* SCSI Data Response Hdr */ +struct iscsi_data_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2; + uint8_t cmd_status; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; + __be32 ttt; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 datasn; + __be32 offset; + __be32 residual_count; +}; + +/* Data Response PDU flags */ +#define ISCSI_FLAG_DATA_ACK 0x40 +#define ISCSI_FLAG_DATA_OVERFLOW 0x04 +#define ISCSI_FLAG_DATA_UNDERFLOW 0x02 +#define ISCSI_FLAG_DATA_STATUS 0x01 + +/* Text Header */ +struct iscsi_text { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd4[8]; + __be32 itt; + __be32 ttt; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd5[16]; + /* Text - key=value pairs */ +}; + +#define ISCSI_FLAG_TEXT_CONTINUE 0x40 + +/* Text Response Header */ +struct iscsi_text_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd4[8]; + __be32 itt; + __be32 ttt; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t rsvd5[12]; + /* Text Response - key:value pairs */ +}; + +/* Login Header */ +struct iscsi_login { + uint8_t opcode; + uint8_t flags; + uint8_t max_version; /* Max. version supported */ + uint8_t min_version; /* Min. version supported */ + uint8_t hlength; + uint8_t dlength[3]; + uint8_t isid[6]; /* Initiator Session ID */ + __be16 tsih; /* Target Session Handle */ + __be32 itt; /* Initiator Task Tag */ + __be16 cid; + __be16 rsvd3; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd5[16]; +}; + +/* Login PDU flags */ +#define ISCSI_FLAG_LOGIN_TRANSIT 0x80 +#define ISCSI_FLAG_LOGIN_CONTINUE 0x40 +#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK 0x0C /* 2 bits */ +#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK 0x03 /* 2 bits */ + +#define ISCSI_LOGIN_CURRENT_STAGE(flags) \ + ((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2) +#define ISCSI_LOGIN_NEXT_STAGE(flags) \ + (flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK) + +/* Login Response Header */ +struct iscsi_login_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t max_version; /* Max. version supported */ + uint8_t active_version; /* Active version */ + uint8_t hlength; + uint8_t dlength[3]; + uint8_t isid[6]; /* Initiator Session ID */ + __be16 tsih; /* Target Session Handle */ + __be32 itt; /* Initiator Task Tag */ + __be32 rsvd3; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t status_class; /* see Login RSP ststus classes below */ + uint8_t status_detail; /* see Login RSP Status details below */ + uint8_t rsvd4[10]; +}; + +/* Login stage (phase) codes for CSG, NSG */ +#define ISCSI_INITIAL_LOGIN_STAGE -1 +#define ISCSI_SECURITY_NEGOTIATION_STAGE 0 +#define ISCSI_OP_PARMS_NEGOTIATION_STAGE 1 +#define ISCSI_FULL_FEATURE_PHASE 3 + +/* Login Status response classes */ +#define ISCSI_STATUS_CLS_SUCCESS 0x00 +#define ISCSI_STATUS_CLS_REDIRECT 0x01 +#define ISCSI_STATUS_CLS_INITIATOR_ERR 0x02 +#define ISCSI_STATUS_CLS_TARGET_ERR 0x03 + +/* Login Status response detail codes */ +/* Class-0 (Success) */ +#define ISCSI_LOGIN_STATUS_ACCEPT 0x00 + +/* Class-1 (Redirection) */ +#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP 0x01 +#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM 0x02 + +/* Class-2 (Initiator Error) */ +#define ISCSI_LOGIN_STATUS_INIT_ERR 0x00 +#define ISCSI_LOGIN_STATUS_AUTH_FAILED 0x01 +#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN 0x02 +#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND 0x03 +#define ISCSI_LOGIN_STATUS_TGT_REMOVED 0x04 +#define ISCSI_LOGIN_STATUS_NO_VERSION 0x05 +#define ISCSI_LOGIN_STATUS_ISID_ERROR 0x06 +#define ISCSI_LOGIN_STATUS_MISSING_FIELDS 0x07 +#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED 0x08 +#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE 0x09 +#define ISCSI_LOGIN_STATUS_NO_SESSION 0x0a +#define ISCSI_LOGIN_STATUS_INVALID_REQUEST 0x0b + +/* Class-3 (Target Error) */ +#define ISCSI_LOGIN_STATUS_TARGET_ERROR 0x00 +#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE 0x01 +#define ISCSI_LOGIN_STATUS_NO_RESOURCES 0x02 + +/* Logout Header */ +struct iscsi_logout { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd1[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd2[8]; + __be32 itt; /* Initiator Task Tag */ + __be16 cid; + uint8_t rsvd3[2]; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd4[16]; +}; + +/* Logout PDU flags */ +#define ISCSI_FLAG_LOGOUT_REASON_MASK 0x7F + +/* logout reason_code values */ + +#define ISCSI_LOGOUT_REASON_CLOSE_SESSION 0 +#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION 1 +#define ISCSI_LOGOUT_REASON_RECOVERY 2 +#define ISCSI_LOGOUT_REASON_AEN_REQUEST 3 + +/* Logout Response Header */ +struct iscsi_logout_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t response; /* see Logout response values below */ + uint8_t rsvd2; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd3[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rsvd4; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 rsvd5; + __be16 t2wait; + __be16 t2retain; + __be32 rsvd6; +}; + +/* logout response status values */ + +#define ISCSI_LOGOUT_SUCCESS 0 +#define ISCSI_LOGOUT_CID_NOT_FOUND 1 +#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED 2 +#define ISCSI_LOGOUT_CLEANUP_FAILED 3 + +/* SNACK Header */ +struct iscsi_snack { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[14]; + __be32 itt; + __be32 begrun; + __be32 runlength; + __be32 exp_statsn; + __be32 rsvd3; + __be32 exp_datasn; + uint8_t rsvd6[8]; +}; + +/* SNACK PDU flags */ +#define ISCSI_FLAG_SNACK_TYPE_MASK 0x0F /* 4 bits */ + +/* Reject Message Header */ +struct iscsi_reject { + uint8_t opcode; + uint8_t flags; + uint8_t reason; + uint8_t rsvd2; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd3[8]; + __be32 ffffffff; + uint8_t rsvd4[4]; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 datasn; + uint8_t rsvd5[8]; + /* Text - Rejected hdr */ +}; + +/* Reason for Reject */ +#define ISCSI_REASON_CMD_BEFORE_LOGIN 1 +#define ISCSI_REASON_DATA_DIGEST_ERROR 2 +#define ISCSI_REASON_DATA_SNACK_REJECT 3 +#define ISCSI_REASON_PROTOCOL_ERROR 4 +#define ISCSI_REASON_CMD_NOT_SUPPORTED 5 +#define ISCSI_REASON_IMM_CMD_REJECT 6 +#define ISCSI_REASON_TASK_IN_PROGRESS 7 +#define ISCSI_REASON_INVALID_SNACK 8 +#define ISCSI_REASON_BOOKMARK_INVALID 9 +#define ISCSI_REASON_BOOKMARK_NO_RESOURCES 10 +#define ISCSI_REASON_NEGOTIATION_RESET 11 + +/* Max. number of Key=Value pairs in a text message */ +#define MAX_KEY_VALUE_PAIRS 8192 + +/* maximum length for text keys/values */ +#define KEY_MAXLEN 64 +#define VALUE_MAXLEN 255 +#define TARGET_NAME_MAXLEN VALUE_MAXLEN + +#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH 8192 + +/************************* RFC 3720 End *****************************/ + +#endif /* ISCSI_PROTO_H */ diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h new file mode 100644 index 0000000..f353e0b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_device.h @@ -0,0 +1,19 @@ +#ifndef _SCSI_SCSI_DEVICE_H_BACKPORT +#define _SCSI_SCSI_DEVICE_H_BACKPORT + +#include_next + +#include +#include +#include +#include +#include + +struct scsi_lun; + +extern void int_to_scsilun(unsigned int, struct scsi_lun *); +extern void scsi_target_block(struct device *); +extern void scsi_target_unblock(struct device *); +extern void starget_for_each_device(struct scsi_target *, void *, + void (*fn)(struct scsi_device *, void *)); +#endif /* _SCSI_SCSI_DEVICE_H_BACKPORT */ diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_host.h b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_host.h new file mode 100644 index 0000000..b7e019b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_host.h @@ -0,0 +1,8 @@ +#ifndef _SCSI_SCSI_HOST_H_BACKPORT +#define _SCSI_SCSI_HOST_H_BACKPORT + +#include_next + +#define scsi_queue_work(shost, work) schedule_work(work) + +#endif diff --git a/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h new file mode 100644 index 0000000..99c2b12 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/scsi/scsi_transport.h @@ -0,0 +1,8 @@ +#ifndef _SCSI_SCSI_TRANSPORT_H_BACKPORT +#define _SCSI_SCSI_TRANSPORT_H_BACKPORT + +#include_next + +#include + +#endif /* _SCSI_SCSI_TRANSPORT_H_BACKPORT */ diff --git a/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c b/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c new file mode 100644 index 0000000..44948d1 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/attribute_container.c @@ -0,0 +1,438 @@ +/* + * attribute_container.c - implementation of a simple container for classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + * + * The basic idea here is to enable a device to be attached to an + * aritrary numer of classes without having to allocate storage for them. + * Instead, the contained classes select the devices they need to attach + * to via a matching function. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "base.h" + +/* This is a private structure used to tie the classdev and the + * container .. it should never be visible outside this file */ +struct internal_container { + struct klist_node node; + struct attribute_container *cont; + struct class_device classdev; +}; + +static void internal_container_klist_get(struct klist_node *n) +{ + struct internal_container *ic = + container_of(n, struct internal_container, node); + class_device_get(&ic->classdev); +} + +static void internal_container_klist_put(struct klist_node *n) +{ + struct internal_container *ic = + container_of(n, struct internal_container, node); + class_device_put(&ic->classdev); +} + + +/** + * attribute_container_classdev_to_container - given a classdev, return the container + * + * @classdev: the class device created by attribute_container_add_device. + * + * Returns the container associated with this classdev. + */ +struct attribute_container * +attribute_container_classdev_to_container(struct class_device *classdev) +{ + struct internal_container *ic = + container_of(classdev, struct internal_container, classdev); + return ic->cont; +} +EXPORT_SYMBOL_GPL(attribute_container_classdev_to_container); + +static struct list_head attribute_container_list; + +static DECLARE_MUTEX(attribute_container_mutex); + +/** + * attribute_container_register - register an attribute container + * + * @cont: The container to register. This must be allocated by the + * callee and should also be zeroed by it. + */ +int +attribute_container_register(struct attribute_container *cont) +{ + INIT_LIST_HEAD(&cont->node); + klist_init(&cont->containers,internal_container_klist_get, + internal_container_klist_put); + + down(&attribute_container_mutex); + list_add_tail(&cont->node, &attribute_container_list); + up(&attribute_container_mutex); + + return 0; +} +EXPORT_SYMBOL_GPL(attribute_container_register); + +/** + * attribute_container_unregister - remove a container registration + * + * @cont: previously registered container to remove + */ +int +attribute_container_unregister(struct attribute_container *cont) +{ + int retval = -EBUSY; + down(&attribute_container_mutex); + spin_lock(&cont->containers.k_lock); + if (!list_empty(&cont->containers.k_list)) + goto out; + retval = 0; + list_del(&cont->node); + out: + spin_unlock(&cont->containers.k_lock); + up(&attribute_container_mutex); + return retval; + +} +EXPORT_SYMBOL_GPL(attribute_container_unregister); + +/* private function used as class release */ +static void attribute_container_release(struct class_device *classdev) +{ + struct internal_container *ic + = container_of(classdev, struct internal_container, classdev); + struct device *dev = classdev->dev; + + kfree(ic); + put_device(dev); +} + +/** + * attribute_container_add_device - see if any container is interested in dev + * + * @dev: device to add attributes to + * @fn: function to trigger addition of class device. + * + * This function allocates storage for the class device(s) to be + * attached to dev (one for each matching attribute_container). If no + * fn is provided, the code will simply register the class device via + * class_device_add. If a function is provided, it is expected to add + * the class device at the appropriate time. One of the things that + * might be necessary is to allocate and initialise the classdev and + * then add it a later time. To do this, call this routine for + * allocation and initialisation and then use + * attribute_container_device_trigger() to call class_device_add() on + * it. Note: after this, the class device contains a reference to dev + * which is not relinquished until the release of the classdev. + */ +void +attribute_container_add_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + + if (attribute_container_no_classdevs(cont)) + continue; + + if (!cont->match(cont, dev)) + continue; + + ic = kzalloc(sizeof(*ic), GFP_KERNEL); + if (!ic) { + dev_printk(KERN_ERR, dev, "failed to allocate class container\n"); + continue; + } + + ic->cont = cont; + class_device_initialize(&ic->classdev); + ic->classdev.dev = get_device(dev); + ic->classdev.class = cont->class; + cont->class->release = attribute_container_release; + strcpy(ic->classdev.class_id, dev->bus_id); + if (fn) + fn(cont, dev, &ic->classdev); + else + attribute_container_add_class_device(&ic->classdev); + klist_add_tail(&ic->node, &cont->containers); + } + up(&attribute_container_mutex); +} + +/* FIXME: can't break out of this unless klist_iter_exit is also + * called before doing the break + */ +#define klist_for_each_entry(pos, head, member, iter) \ + for (klist_iter_init(head, iter); (pos = ({ \ + struct klist_node *n = klist_next(iter); \ + n ? container_of(n, typeof(*pos), member) : \ + ({ klist_iter_exit(iter) ; NULL; }); \ + }) ) != NULL; ) + + +/** + * attribute_container_remove_device - make device eligible for removal. + * + * @dev: The generic device + * @fn: A function to call to remove the device + * + * This routine triggers device removal. If fn is NULL, then it is + * simply done via class_device_unregister (note that if something + * still has a reference to the classdev, then the memory occupied + * will not be freed until the classdev is released). If you want a + * two phase release: remove from visibility and then delete the + * device, then you should use this routine with a fn that calls + * class_device_del() and then use + * attribute_container_device_trigger() to do the final put on the + * classdev. + */ +void +attribute_container_remove_device(struct device *dev, + void (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + struct klist_iter iter; + + if (attribute_container_no_classdevs(cont)) + continue; + + if (!cont->match(cont, dev)) + continue; + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (dev != ic->classdev.dev) + continue; + klist_del(&ic->node); + if (fn) + fn(cont, dev, &ic->classdev); + else { + attribute_container_remove_attrs(&ic->classdev); + class_device_unregister(&ic->classdev); + } + } + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_device_trigger - execute a trigger for each matching classdev + * + * @dev: The generic device to run the trigger for + * @fn the function to execute for each classdev. + * + * This funcion is for executing a trigger when you need to know both + * the container and the classdev. If you only care about the + * container, then use attribute_container_trigger() instead. + */ +void +attribute_container_device_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + struct klist_iter iter; + + if (!cont->match(cont, dev)) + continue; + + if (attribute_container_no_classdevs(cont)) { + fn(cont, dev, NULL); + continue; + } + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (dev == ic->classdev.dev) + fn(cont, dev, &ic->classdev); + } + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_trigger - trigger a function for each matching container + * + * @dev: The generic device to activate the trigger for + * @fn: the function to trigger + * + * This routine triggers a function that only needs to know the + * matching containers (not the classdev) associated with a device. + * It is more lightweight than attribute_container_device_trigger, so + * should be used in preference unless the triggering function + * actually needs to know the classdev. + */ +void +attribute_container_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + if (cont->match(cont, dev)) + fn(cont, dev); + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_add_attrs - add attributes + * + * @classdev: The class device + * + * This simply creates all the class device sysfs files from the + * attributes listed in the container + */ +int +attribute_container_add_attrs(struct class_device *classdev) +{ + struct attribute_container *cont = + attribute_container_classdev_to_container(classdev); + struct class_device_attribute **attrs = cont->attrs; + int i, error; + + if (!attrs) + return 0; + + for (i = 0; attrs[i]; i++) { + error = class_device_create_file(classdev, attrs[i]); + if (error) + return error; + } + + return 0; +} + +/** + * attribute_container_add_class_device - same function as class_device_add + * + * @classdev: the class device to add + * + * This performs essentially the same function as class_device_add except for + * attribute containers, namely add the classdev to the system and then + * create the attribute files + */ +int +attribute_container_add_class_device(struct class_device *classdev) +{ + int error = class_device_add(classdev); + if (error) + return error; + return attribute_container_add_attrs(classdev); +} + +/** + * attribute_container_add_class_device_adapter - simple adapter for triggers + * + * This function is identical to attribute_container_add_class_device except + * that it is designed to be called from the triggers + */ +int +attribute_container_add_class_device_adapter(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + return attribute_container_add_class_device(classdev); +} + +/** + * attribute_container_remove_attrs - remove any attribute files + * + * @classdev: The class device to remove the files from + * + */ +void +attribute_container_remove_attrs(struct class_device *classdev) +{ + struct attribute_container *cont = + attribute_container_classdev_to_container(classdev); + struct class_device_attribute **attrs = cont->attrs; + int i; + + if (!attrs) + return; + + for (i = 0; attrs[i]; i++) + class_device_remove_file(classdev, attrs[i]); +} + +/** + * attribute_container_class_device_del - equivalent of class_device_del + * + * @classdev: the class device + * + * This function simply removes all the attribute files and then calls + * class_device_del. + */ +void +attribute_container_class_device_del(struct class_device *classdev) +{ + attribute_container_remove_attrs(classdev); + class_device_del(classdev); +} + +/** + * attribute_container_find_class_device - find the corresponding class_device + * + * @cont: the container + * @dev: the generic device + * + * Looks up the device in the container's list of class devices and returns + * the corresponding class_device. + */ +struct class_device * +attribute_container_find_class_device(struct attribute_container *cont, + struct device *dev) +{ + struct class_device *cdev = NULL; + struct internal_container *ic; + struct klist_iter iter; + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (ic->classdev.dev == dev) { + cdev = &ic->classdev; + /* FIXME: must exit iterator then break */ + klist_iter_exit(&iter); + break; + } + } + + return cdev; +} +EXPORT_SYMBOL_GPL(attribute_container_find_class_device); + +int +attribute_container_init(void) +{ + INIT_LIST_HEAD(&attribute_container_list); + return 0; +} +EXPORT_SYMBOL_GPL(attribute_container_init); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/base.h b/kernel_addons/backport/2.6.9_U4/include/src/base.h new file mode 100644 index 0000000..a5f8936 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/base.h @@ -0,0 +1 @@ +extern int attribute_container_init(void); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/init.c b/kernel_addons/backport/2.6.9_U4/include/src/init.c new file mode 100644 index 0000000..15f0bc6 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/init.c @@ -0,0 +1,26 @@ +/* + * + * Copyright (c) 2002-3 Patrick Mochel + * Copyright (c) 2002-3 Open Source Development Labs + * + * This file is released under the GPLv2 + * + */ + +#include +#include +#include + +#include "base.h" + +/** + * driver_init - initialize driver model. + * + * Call the driver model init functions to initialize their + * subsystems. Called early from init/main.c. + */ + +void __init driver_init(void) +{ + attribute_container_init(); +} diff --git a/kernel_addons/backport/2.6.9_U4/include/src/klist.c b/kernel_addons/backport/2.6.9_U4/include/src/klist.c new file mode 100644 index 0000000..3b29ebc --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/klist.c @@ -0,0 +1,287 @@ +/* + * klist.c - Routines for manipulating klists. + * + * + * This klist interface provides a couple of structures that wrap around + * struct list_head to provide explicit list "head" (struct klist) and + * list "node" (struct klist_node) objects. For struct klist, a spinlock + * is included that protects access to the actual list itself. struct + * klist_node provides a pointer to the klist that owns it and a kref + * reference count that indicates the number of current users of that node + * in the list. + * + * The entire point is to provide an interface for iterating over a list + * that is safe and allows for modification of the list during the + * iteration (e.g. insertion and removal), including modification of the + * current node on the list. + * + * It works using a 3rd object type - struct klist_iter - that is declared + * and initialized before an iteration. klist_next() is used to acquire the + * next element in the list. It returns NULL if there are no more items. + * Internally, that routine takes the klist's lock, decrements the reference + * count of the previous klist_node and increments the count of the next + * klist_node. It then drops the lock and returns. + * + * There are primitives for adding and removing nodes to/from a klist. + * When deleting, klist_del() will simply decrement the reference count. + * Only when the count goes to 0 is the node removed from the list. + * klist_remove() will try to delete the node from the list and block + * until it is actually removed. This is useful for objects (like devices) + * that have been removed from the system and must be freed (but must wait + * until all accessors have finished). + * + * Copyright (C) 2005 Patrick Mochel + * + * This file is released under the GPL v2. + */ + +#include +#include + + +/** + * klist_init - Initialize a klist structure. + * @k: The klist we're initializing. + * @get: The get function for the embedding object (NULL if none) + * @put: The put function for the embedding object (NULL if none) + * + * Initialises the klist structure. If the klist_node structures are + * going to be embedded in refcounted objects (necessary for safe + * deletion) then the get/put arguments are used to initialise + * functions that take and release references on the embedding + * objects. + */ + +void klist_init(struct klist * k, void (*get)(struct klist_node *), + void (*put)(struct klist_node *)) +{ + INIT_LIST_HEAD(&k->k_list); + spin_lock_init(&k->k_lock); + k->get = get; + k->put = put; +} + +EXPORT_SYMBOL_GPL(klist_init); + + +static void add_head(struct klist * k, struct klist_node * n) +{ + spin_lock(&k->k_lock); + list_add(&n->n_node, &k->k_list); + spin_unlock(&k->k_lock); +} + +static void add_tail(struct klist * k, struct klist_node * n) +{ + spin_lock(&k->k_lock); + list_add_tail(&n->n_node, &k->k_list); + spin_unlock(&k->k_lock); +} + + +static void klist_node_init(struct klist * k, struct klist_node * n) +{ + INIT_LIST_HEAD(&n->n_node); + init_completion(&n->n_removed); + kref_init(&n->n_ref); + n->n_klist = k; + if (k->get) + k->get(n); +} + + +/** + * klist_add_head - Initialize a klist_node and add it to front. + * @n: node we're adding. + * @k: klist it's going on. + */ + +void klist_add_head(struct klist_node * n, struct klist * k) +{ + klist_node_init(k, n); + add_head(k, n); +} + +EXPORT_SYMBOL_GPL(klist_add_head); + + +/** + * klist_add_tail - Initialize a klist_node and add it to back. + * @n: node we're adding. + * @k: klist it's going on. + */ + +void klist_add_tail(struct klist_node * n, struct klist * k) +{ + klist_node_init(k, n); + add_tail(k, n); +} + +EXPORT_SYMBOL_GPL(klist_add_tail); + + +static void klist_release(struct kref * kref) +{ + struct klist_node * n = container_of(kref, struct klist_node, n_ref); + + list_del(&n->n_node); + complete(&n->n_removed); + n->n_klist = NULL; +} + +static int klist_dec_and_del(struct klist_node * n) +{ + return kref_put_new(&n->n_ref, klist_release); +} + + +/** + * klist_del - Decrement the reference count of node and try to remove. + * @n: node we're deleting. + */ + +void klist_del(struct klist_node * n) +{ + struct klist * k = n->n_klist; + void (*put)(struct klist_node *) = k->put; + + spin_lock(&k->k_lock); + if (!klist_dec_and_del(n)) + put = NULL; + spin_unlock(&k->k_lock); + if (put) + put(n); +} + +EXPORT_SYMBOL_GPL(klist_del); + + +/** + * klist_remove - Decrement the refcount of node and wait for it to go away. + * @n: node we're removing. + */ + +void klist_remove(struct klist_node * n) +{ + klist_del(n); + wait_for_completion(&n->n_removed); +} + +EXPORT_SYMBOL_GPL(klist_remove); + + +/** + * klist_node_attached - Say whether a node is bound to a list or not. + * @n: Node that we're testing. + */ + +int klist_node_attached(struct klist_node * n) +{ + return (n->n_klist != NULL); +} + +EXPORT_SYMBOL_GPL(klist_node_attached); + + +/** + * klist_iter_init_node - Initialize a klist_iter structure. + * @k: klist we're iterating. + * @i: klist_iter we're filling. + * @n: node to start with. + * + * Similar to klist_iter_init(), but starts the action off with @n, + * instead of with the list head. + */ + +void klist_iter_init_node(struct klist * k, struct klist_iter * i, struct klist_node * n) +{ + i->i_klist = k; + i->i_head = &k->k_list; + i->i_cur = n; + if (n) + kref_get(&n->n_ref); +} + +EXPORT_SYMBOL_GPL(klist_iter_init_node); + + +/** + * klist_iter_init - Iniitalize a klist_iter structure. + * @k: klist we're iterating. + * @i: klist_iter structure we're filling. + * + * Similar to klist_iter_init_node(), but start with the list head. + */ + +void klist_iter_init(struct klist * k, struct klist_iter * i) +{ + klist_iter_init_node(k, i, NULL); +} + +EXPORT_SYMBOL_GPL(klist_iter_init); + + +/** + * klist_iter_exit - Finish a list iteration. + * @i: Iterator structure. + * + * Must be called when done iterating over list, as it decrements the + * refcount of the current node. Necessary in case iteration exited before + * the end of the list was reached, and always good form. + */ + +void klist_iter_exit(struct klist_iter * i) +{ + if (i->i_cur) { + klist_del(i->i_cur); + i->i_cur = NULL; + } +} + +EXPORT_SYMBOL_GPL(klist_iter_exit); + + +static struct klist_node * to_klist_node(struct list_head * n) +{ + return container_of(n, struct klist_node, n_node); +} + + +/** + * klist_next - Ante up next node in list. + * @i: Iterator structure. + * + * First grab list lock. Decrement the reference count of the previous + * node, if there was one. Grab the next node, increment its reference + * count, drop the lock, and return that next node. + */ + +struct klist_node * klist_next(struct klist_iter * i) +{ + struct list_head * next; + struct klist_node * lnode = i->i_cur; + struct klist_node * knode = NULL; + void (*put)(struct klist_node *) = i->i_klist->put; + + spin_lock(&i->i_klist->k_lock); + if (lnode) { + next = lnode->n_node.next; + if (!klist_dec_and_del(lnode)) + put = NULL; + } else + next = i->i_head->next; + + if (next != i->i_head) { + knode = to_klist_node(next); + kref_get(&knode->n_ref); + } + i->i_cur = knode; + spin_unlock(&i->i_klist->k_lock); + if (put && lnode) + put(lnode); + return knode; +} + +EXPORT_SYMBOL_GPL(klist_next); + + diff --git a/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c b/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c new file mode 100644 index 0000000..d45bb3f --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/kref_new.c @@ -0,0 +1,29 @@ +#include +#include + +/** + * kref_put - decrement refcount for object. + * @kref: object. + * @release: pointer to the function that will clean up the object when the + * last reference to the object is released. + * This pointer is required, and it is not acceptable to pass kfree + * in as this function. + * + * Decrement the refcount, and if 0, call release(). + * Return 1 if the object was removed, otherwise return 0. Beware, if this + * function returns 0, you still can not count on the kref from remaining in + * memory. Only use the return value if you want to see if the kref is now + * gone, not present. + */ +int kref_put_new(struct kref *kref, void (*release)(struct kref *kref)) +{ + WARN_ON(release == NULL); + WARN_ON(release == (void (*)(struct kref *))kfree); + + if (atomic_dec_and_test(&kref->refcount)) { + release(kref); + return 1; + } + return 0; +} +EXPORT_SYMBOL(kref_put_new); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi.c new file mode 100644 index 0000000..8c833c0 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi.c @@ -0,0 +1,50 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +/** + * starget_for_each_device - helper to walk all devices of a target + * @starget: target whose devices we want to iterate over. + * + * This traverses over each devices of @shost. The devices have + * a reference that must be released by scsi_host_put when breaking + * out of the loop. + */ +void starget_for_each_device(struct scsi_target *starget, void * data, + void (*fn)(struct scsi_device *, void *)) +{ + struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); + struct scsi_device *sdev; + + printk("%s: entry\n", __FUNCTION__); + shost_for_each_device(sdev, shost) { + if ((sdev->channel == starget->channel) && + (sdev->id == starget->id)) + fn(sdev, data); + } + printk("%s: exit\n", __FUNCTION__); +} +EXPORT_SYMBOL(starget_for_each_device); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c new file mode 100644 index 0000000..f53f824 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi_lib.c @@ -0,0 +1,166 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +int scsi_is_target_device(const struct device *dev) +{ + char *str = dev->bus_id; + + if (strncmp(str, "target", 6) == 0) { + return 1; + } + + return 0; +} + +/** + * scsi_internal_device_block - internal function to put a device + * temporarily into the SDEV_BLOCK state + * @sdev: device to block + * + * Block request made by scsi lld's to temporarily stop all + * scsi commands on the specified device. Called from interrupt + * or normal process context. + * + * Returns zero if successful or error if not + * + * Notes: + * This routine transitions the device to the SDEV_BLOCK state + * (which must be a legal transition). When the device is in this + * state, all commands are deferred until the scsi lld reenables + * the device with scsi_device_unblock or device_block_tmo fires. + * This routine assumes the host_lock is held on entry. + **/ +int +scsi_internal_device_block(struct scsi_device *sdev) +{ + request_queue_t *q = sdev->request_queue; + unsigned long flags; + int err = 0; + + err = scsi_device_set_state(sdev, SDEV_BLOCK); + if (err) + return err; + + /* + * The device has transitioned to SDEV_BLOCK. Stop the + * block layer from calling the midlayer with this device's + * request queue. + */ + spin_lock_irqsave(q->queue_lock, flags); + blk_stop_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(scsi_internal_device_block); + +/** + * scsi_internal_device_unblock - resume a device after a block request + * @sdev: device to resume + * + * Called by scsi lld's or the midlayer to restart the device queue + * for the previously suspended scsi device. Called from interrupt or + * normal process context. + * + * Returns zero if successful or error if not. + * + * Notes: + * This routine transitions the device to the SDEV_RUNNING state + * (which must be a legal transition) allowing the midlayer to + * goose the queue for this device. This routine assumes the + * host_lock is held upon entry. + **/ +int +scsi_internal_device_unblock(struct scsi_device *sdev) +{ + request_queue_t *q = sdev->request_queue; + int err; + unsigned long flags; + + + /* + * Try to transition the scsi device to SDEV_RUNNING + * and goose the device queue if successful. + */ + err = scsi_device_set_state(sdev, SDEV_RUNNING); + if (err) + return err; + + spin_lock_irqsave(q->queue_lock, flags); + blk_start_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(scsi_internal_device_unblock); + +static void +device_block(struct scsi_device *sdev, void *data) +{ + scsi_internal_device_block(sdev); +} + +static int +target_block(struct device *dev, void *data) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_block); + + return 0; +} + +void +scsi_target_block(struct device *dev) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_block); + else + device_for_each_child(dev, NULL, target_block); +} +EXPORT_SYMBOL_GPL(scsi_target_block); + +static void +device_unblock(struct scsi_device *sdev, void *data) +{ + scsi_internal_device_unblock(sdev); +} + +static int +target_unblock(struct device *dev, void *data) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_unblock); + return 0; +} + +void +scsi_target_unblock(struct device *dev) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_unblock); + else + device_for_each_child(dev, NULL, target_unblock); +} +EXPORT_SYMBOL_GPL(scsi_target_unblock); + +MODULE_LICENSE("GPL"); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c b/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c new file mode 100644 index 0000000..b7b7674 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/scsi_scan.c @@ -0,0 +1,48 @@ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +/** + * int_to_scsilun: reverts an int into a scsi_lun + * @int: integer to be reverted + * @scsilun: struct scsi_lun to be set. + * + * Description: + * Reverts the functionality of the scsilun_to_int, which packed + * an 8-byte lun value into an int. This routine unpacks the int + * back into the lun value. + * Note: the scsilun_to_int() routine does not truly handle all + * 8bytes of the lun value. This functions restores only as much + * as was set by the routine. + * + * Notes: + * Given an integer : 0x0b030a04, this function returns a + * scsi_lun of : struct scsi_lun of: 0a 04 0b 03 00 00 00 00 + * + **/ +void int_to_scsilun(unsigned int lun, struct scsi_lun *scsilun) +{ + int i; + + memset(scsilun->scsi_lun, 0, sizeof(scsilun->scsi_lun)); + + for (i = 0; i < sizeof(lun); i += 2) { + scsilun->scsi_lun[i] = (lun >> 8) & 0xFF; + scsilun->scsi_lun[i+1] = lun & 0xFF; + lun = lun >> 16; + } +} +EXPORT_SYMBOL(int_to_scsilun); diff --git a/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c b/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c new file mode 100644 index 0000000..f25e7c6 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U4/include/src/transport_class.c @@ -0,0 +1,280 @@ +/* + * transport_class.c - implementation of generic transport classes + * using attribute_containers + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + * + * The basic idea here is to allow any "device controller" (which + * would most often be a Host Bus Adapter to use the services of one + * or more tranport classes for performing transport specific + * services. Transport specific services are things that the generic + * command layer doesn't want to know about (speed settings, line + * condidtioning, etc), but which the user might be interested in. + * Thus, the HBA's use the routines exported by the transport classes + * to perform these functions. The transport classes export certain + * values to the user via sysfs using attribute containers. + * + * Note: because not every HBA will care about every transport + * attribute, there's a many to one relationship that goes like this: + * + * transport class<-----attribute container<----class device + * + * Usually the attribute container is per-HBA, but the design doesn't + * mandate that. Although most of the services will be specific to + * the actual external storage connection used by the HBA, the generic + * transport class is framed entirely in terms of generic devices to + * allow it to be used by any physical HBA in the system. + */ +#include +#include + +/** + * transport_class_register - register an initial transport class + * + * @tclass: a pointer to the transport class structure to be initialised + * + * The transport class contains an embedded class which is used to + * identify it. The caller should initialise this structure with + * zeros and then generic class must have been initialised with the + * actual transport class unique name. There's a macro + * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must + * be registered). + * + * Returns 0 on success or error on failure. + */ +int transport_class_register(struct transport_class *tclass) +{ + return class_register(&tclass->class); +} +EXPORT_SYMBOL_GPL(transport_class_register); + +/** + * transport_class_unregister - unregister a previously registered class + * + * @tclass: The transport class to unregister + * + * Must be called prior to deallocating the memory for the transport + * class. + */ +void transport_class_unregister(struct transport_class *tclass) +{ + class_unregister(&tclass->class); +} +EXPORT_SYMBOL_GPL(transport_class_unregister); + +static int anon_transport_dummy_function(struct transport_container *tc, + struct device *dev, + struct class_device *cdev) +{ + /* do nothing */ + return 0; +} + +/** + * anon_transport_class_register - register an anonymous class + * + * @atc: The anon transport class to register + * + * The anonymous transport class contains both a transport class and a + * container. The idea of an anonymous class is that it never + * actually has any device attributes associated with it (and thus + * saves on container storage). So it can only be used for triggering + * events. Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to + * initialise the anon transport class storage. + */ +int anon_transport_class_register(struct anon_transport_class *atc) +{ + int error; + atc->container.class = &atc->tclass.class; + attribute_container_set_no_classdevs(&atc->container); + error = attribute_container_register(&atc->container); + if (error) + return error; + atc->tclass.setup = anon_transport_dummy_function; + atc->tclass.remove = anon_transport_dummy_function; + return 0; +} +EXPORT_SYMBOL_GPL(anon_transport_class_register); + +/** + * anon_transport_class_unregister - unregister an anon class + * + * @atc: Pointer to the anon transport class to unregister + * + * Must be called prior to deallocating the memory for the anon + * transport class. + */ +void anon_transport_class_unregister(struct anon_transport_class *atc) +{ + attribute_container_unregister(&atc->container); +} +EXPORT_SYMBOL_GPL(anon_transport_class_unregister); + +static int transport_setup_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + struct transport_container *tcont = attribute_container_to_transport_container(cont); + + if (tclass->setup) + tclass->setup(tcont, dev, classdev); + + return 0; +} + +/** + * transport_setup_device - declare a new dev for transport class association + * but don't make it visible yet. + * + * @dev: the generic device representing the entity being added + * + * Usually, dev represents some component in the HBA system (either + * the HBA itself or a device remote across the HBA bus). This + * routine is simply a trigger point to see if any set of transport + * classes wishes to associate with the added device. This allocates + * storage for the class device and initialises it, but does not yet + * add it to the system or add attributes to it (you do this with + * transport_add_device). If you have no need for a separate setup + * and add operations, use transport_register_device (see + * transport_class.h). + */ + +void transport_setup_device(struct device *dev) +{ + attribute_container_add_device(dev, transport_setup_classdev); +} +EXPORT_SYMBOL_GPL(transport_setup_device); + +static int transport_add_class_device(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + int error = attribute_container_add_class_device(classdev); + struct transport_container *tcont = + attribute_container_to_transport_container(cont); + + if (!error && tcont->statistics) + error = sysfs_create_group(&classdev->kobj, tcont->statistics); + + return error; +} + + +/** + * transport_add_device - declare a new dev for transport class association + * + * @dev: the generic device representing the entity being added + * + * Usually, dev represents some component in the HBA system (either + * the HBA itself or a device remote across the HBA bus). This + * routine is simply a trigger point used to add the device to the + * system and register attributes for it. + */ + +void transport_add_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_add_class_device); +} +EXPORT_SYMBOL_GPL(transport_add_device); + +static int transport_configure(struct attribute_container *cont, + struct device *dev, + struct class_device *cdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + struct transport_container *tcont = attribute_container_to_transport_container(cont); + + if (tclass->configure) + tclass->configure(tcont, dev, cdev); + + return 0; +} + +/** + * transport_configure_device - configure an already set up device + * + * @dev: generic device representing device to be configured + * + * The idea of configure is simply to provide a point within the setup + * process to allow the transport class to extract information from a + * device after it has been setup. This is used in SCSI because we + * have to have a setup device to begin using the HBA, but after we + * send the initial inquiry, we use configure to extract the device + * parameters. The device need not have been added to be configured. + */ +void transport_configure_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_configure); +} +EXPORT_SYMBOL_GPL(transport_configure_device); + +static int transport_remove_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_container *tcont = + attribute_container_to_transport_container(cont); + struct transport_class *tclass = class_to_transport_class(cont->class); + + if (tclass->remove) + tclass->remove(tcont, dev, classdev); + + if (tclass->remove != anon_transport_dummy_function) { + if (tcont->statistics) + sysfs_remove_group(&classdev->kobj, tcont->statistics); + attribute_container_class_device_del(classdev); + } + + return 0; +} + + +/** + * transport_remove_device - remove the visibility of a device + * + * @dev: generic device to remove + * + * This call removes the visibility of the device (to the user from + * sysfs), but does not destroy it. To eliminate a device entirely + * you must also call transport_destroy_device. If you don't need to + * do remove and destroy as separate operations, use + * transport_unregister_device() (see transport_class.h) which will + * perform both calls for you. + */ +void transport_remove_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_remove_classdev); +} +EXPORT_SYMBOL_GPL(transport_remove_device); + +static void transport_destroy_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + + if (tclass->remove != anon_transport_dummy_function) + class_device_put(classdev); +} + + +/** + * transport_destroy_device - destroy a removed device + * + * @dev: device to eliminate from the transport class. + * + * This call triggers the elimination of storage associated with the + * transport classdev. Note: all it really does is relinquish a + * reference to the classdev. The memory will not be freed until the + * last reference goes to zero. Note also that the classdev retains a + * reference count on dev, so dev too will remain for as long as the + * transport class device remains around. + */ +void transport_destroy_device(struct device *dev) +{ + attribute_container_remove_device(dev, transport_destroy_classdev); +} +EXPORT_SYMBOL_GPL(transport_destroy_device); diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/attribute_container.h b/kernel_addons/backport/2.6.9_U5/include/linux/attribute_container.h new file mode 100644 index 0000000..93bfb0b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/linux/attribute_container.h @@ -0,0 +1,71 @@ +/* + * class_container.h - a generic container for all classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + */ + +#ifndef _ATTRIBUTE_CONTAINER_H_ +#define _ATTRIBUTE_CONTAINER_H_ + +#include +#include +#include +#include + +struct attribute_container { + struct list_head node; + struct klist containers; + struct class *class; + struct class_device_attribute **attrs; + int (*match)(struct attribute_container *, struct device *); +#define ATTRIBUTE_CONTAINER_NO_CLASSDEVS 0x01 + unsigned long flags; +}; + +static inline int +attribute_container_no_classdevs(struct attribute_container *atc) +{ + return atc->flags & ATTRIBUTE_CONTAINER_NO_CLASSDEVS; +} + +static inline void +attribute_container_set_no_classdevs(struct attribute_container *atc) +{ + atc->flags |= ATTRIBUTE_CONTAINER_NO_CLASSDEVS; +} + +int attribute_container_register(struct attribute_container *cont); +int attribute_container_unregister(struct attribute_container *cont); +void attribute_container_create_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_add_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_remove_device(struct device *dev, + void (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_device_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)); +void attribute_container_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *)); +int attribute_container_add_attrs(struct class_device *classdev); +int attribute_container_add_class_device(struct class_device *classdev); +int attribute_container_add_class_device_adapter(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev); +void attribute_container_remove_attrs(struct class_device *classdev); +void attribute_container_class_device_del(struct class_device *classdev); +struct attribute_container *attribute_container_classdev_to_container(struct class_device *); +struct class_device *attribute_container_find_class_device(struct attribute_container *, struct device *); +struct class_device_attribute **attribute_container_classdev_to_attrs(const struct class_device *classdev); + +#endif diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/crypto.h b/kernel_addons/backport/2.6.9_U5/include/linux/crypto.h new file mode 100644 index 0000000..aecccde --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/linux/crypto.h @@ -0,0 +1,11 @@ +#ifndef LINUX_CRYPTO_BACKPORT_H +#define LINUX_CRYPTO_BACKPORT_H + +#include_next + +#define crypto_hash_init(desc) crypto_digest_init(*desc) +#define crypto_hash_digest(desc, sg, nbytes, out) crypto_digest_digest(*desc, sg, 1, out) +#define crypto_hash_update(desc, sg, nbytes) crypto_digest_update(*desc, sg, 1) +#define crypto_hash_final(desc, out) crypto_digest_final(*desc, out) + +#endif diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/kernel.h b/kernel_addons/backport/2.6.9_U5/include/linux/kernel.h index a37dcd5..02a5907 100644 --- a/kernel_addons/backport/2.6.9_U5/include/linux/kernel.h +++ b/kernel_addons/backport/2.6.9_U5/include/linux/kernel.h @@ -4,4 +4,7 @@ #define BACKPORT_KERNEL_H_2_6_19 #include_next #include +#define NIP6_FMT "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x" +#define NIPQUAD_FMT "%u.%u.%u.%u" + #endif diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/kfifo.h b/kernel_addons/backport/2.6.9_U5/include/linux/kfifo.h index 48eccd8..2b94461 100644 --- a/kernel_addons/backport/2.6.9_U5/include/linux/kfifo.h +++ b/kernel_addons/backport/2.6.9_U5/include/linux/kfifo.h @@ -25,6 +25,7 @@ #ifdef __KERNEL__ #include #include +#include struct kfifo { unsigned char *buffer; /* the buffer holding the data */ diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/klist.h b/kernel_addons/backport/2.6.9_U5/include/linux/klist.h new file mode 100644 index 0000000..7407125 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/linux/klist.h @@ -0,0 +1,61 @@ +/* + * klist.h - Some generic list helpers, extending struct list_head a bit. + * + * Implementations are found in lib/klist.c + * + * + * Copyright (C) 2005 Patrick Mochel + * + * This file is rleased under the GPL v2. + */ + +#ifndef _LINUX_KLIST_H +#define _LINUX_KLIST_H + +#include +#include +#include +#include + +struct klist_node; +struct klist { + spinlock_t k_lock; + struct list_head k_list; + void (*get)(struct klist_node *); + void (*put)(struct klist_node *); +}; + + +extern void klist_init(struct klist * k, void (*get)(struct klist_node *), + void (*put)(struct klist_node *)); + +struct klist_node { + struct klist * n_klist; + struct list_head n_node; + struct kref n_ref; + struct completion n_removed; +}; + +extern void klist_add_tail(struct klist_node * n, struct klist * k); +extern void klist_add_head(struct klist_node * n, struct klist * k); + +extern void klist_del(struct klist_node * n); +extern void klist_remove(struct klist_node * n); + +extern int klist_node_attached(struct klist_node * n); + + +struct klist_iter { + struct klist * i_klist; + struct list_head * i_head; + struct klist_node * i_cur; +}; + + +extern void klist_iter_init(struct klist * k, struct klist_iter * i); +extern void klist_iter_init_node(struct klist * k, struct klist_iter * i, + struct klist_node * n); +extern void klist_iter_exit(struct klist_iter * i); +extern struct klist_node * klist_next(struct klist_iter * i); + +#endif diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/memory.h b/kernel_addons/backport/2.6.9_U5/include/linux/memory.h new file mode 100644 index 0000000..654ef55 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/linux/memory.h @@ -0,0 +1,89 @@ +/* + * include/linux/memory.h - generic memory definition + * + * This is mainly for topological representation. We define the + * basic "struct memory_block" here, which can be embedded in per-arch + * definitions or NUMA information. + * + * Basic handling of the devices is done in drivers/base/memory.c + * and system devices are handled in drivers/base/sys.c. + * + * Memory block are exported via sysfs in the class/memory/devices/ + * directory. + * + */ +#ifndef _LINUX_MEMORY_H_ +#define _LINUX_MEMORY_H_ + +#include +#include +#include + +#include + +struct memory_block { + unsigned long phys_index; + unsigned long state; + /* + * This serializes all state change requests. It isn't + * held during creation because the control files are + * created long after the critical areas during + * initialization. + */ + struct semaphore state_sem; + int phys_device; /* to which fru does this belong? */ + void *hw; /* optional pointer to fw/hw data */ + int (*phys_callback)(struct memory_block *); + struct sys_device sysdev; +}; + +/* These states are exposed to userspace as text strings in sysfs */ +#define MEM_ONLINE (1<<0) /* exposed to userspace */ +#define MEM_GOING_OFFLINE (1<<1) /* exposed to userspace */ +#define MEM_OFFLINE (1<<2) /* exposed to userspace */ + +/* + * All of these states are currently kernel-internal for notifying + * kernel components and architectures. + * + * For MEM_MAPPING_INVALID, all notifier chains with priority >0 + * are called before pfn_to_page() becomes invalid. The priority=0 + * entry is reserved for the function that actually makes + * pfn_to_page() stop working. Any notifiers that want to be called + * after that should have priority <0. + */ +#define MEM_MAPPING_INVALID (1<<3) + +struct notifier_block; +struct mem_section; + +#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE +static inline int memory_dev_init(void) +{ + return 0; +} +static inline int register_memory_notifier(struct notifier_block *nb) +{ + return 0; +} +static inline void unregister_memory_notifier(struct notifier_block *nb) +{ +} +#else +extern int register_new_memory(struct mem_section *); +extern int unregister_memory_section(struct mem_section *); +extern int memory_dev_init(void); +extern int remove_memory_block(unsigned long, struct mem_section *, int); + +#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION< + +#define __nlmsg_put(skb, daemon_pid, seq, type, len, flags) \ + __nlmsg_put(skb, daemon_pid, 0, 0, len) + +#define netlink_kernel_create(uint, groups, input, mod) \ + netlink_kernel_create(uint, input) + +#define NETLINK_ISCSI 8 + +#endif diff --git a/kernel_addons/backport/2.6.9_U5/include/linux/transport_class.h b/kernel_addons/backport/2.6.9_U5/include/linux/transport_class.h new file mode 100644 index 0000000..1d6cc22 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/linux/transport_class.h @@ -0,0 +1,100 @@ +/* + * transport_class.h - a generic container for all transport classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + */ + +#ifndef _TRANSPORT_CLASS_H_ +#define _TRANSPORT_CLASS_H_ + +#include +#include + +struct transport_container; + +struct transport_class { + struct class class; + int (*setup)(struct transport_container *, struct device *, + struct class_device *); + int (*configure)(struct transport_container *, struct device *, + struct class_device *); + int (*remove)(struct transport_container *, struct device *, + struct class_device *); +}; + +#define DECLARE_TRANSPORT_CLASS(cls, nm, su, rm, cfg) \ +struct transport_class cls = { \ + .class = { \ + .name = nm, \ + }, \ + .setup = su, \ + .remove = rm, \ + .configure = cfg, \ +} + + +struct anon_transport_class { + struct transport_class tclass; + struct attribute_container container; +}; + +#define DECLARE_ANON_TRANSPORT_CLASS(cls, mtch, cfg) \ +struct anon_transport_class cls = { \ + .tclass = { \ + .configure = cfg, \ + }, \ + . container = { \ + .match = mtch, \ + }, \ +} + +#define class_to_transport_class(x) \ + container_of(x, struct transport_class, class) + +struct transport_container { + struct attribute_container ac; + struct attribute_group *statistics; +}; + +#define attribute_container_to_transport_container(x) \ + container_of(x, struct transport_container, ac) + +void transport_remove_device(struct device *); +void transport_add_device(struct device *); +void transport_setup_device(struct device *); +void transport_configure_device(struct device *); +void transport_destroy_device(struct device *); + +static inline void +transport_register_device(struct device *dev) +{ + transport_setup_device(dev); + transport_add_device(dev); +} + +static inline void +transport_unregister_device(struct device *dev) +{ + transport_remove_device(dev); + transport_destroy_device(dev); +} + +static inline int transport_container_register(struct transport_container *tc) +{ + return attribute_container_register(&tc->ac); +} + +static inline int transport_container_unregister(struct transport_container *tc) +{ + return attribute_container_unregister(&tc->ac); +} + +int transport_class_register(struct transport_class *); +int anon_transport_class_register(struct anon_transport_class *); +void transport_class_unregister(struct transport_class *); +void anon_transport_class_unregister(struct anon_transport_class *); + + +#endif diff --git a/kernel_addons/backport/2.6.9_U5/include/scsi/iscsi_proto.h b/kernel_addons/backport/2.6.9_U5/include/scsi/iscsi_proto.h new file mode 100644 index 0000000..02f6e4b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/scsi/iscsi_proto.h @@ -0,0 +1,587 @@ +/* + * RFC 3720 (iSCSI) protocol data types + * + * Copyright (C) 2005 Dmitry Yusupov + * Copyright (C) 2005 Alex Aizman + * maintained by open-iscsi at googlegroups.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published + * by the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * See the file COPYING included with this distribution for more details. + */ + +#ifndef ISCSI_PROTO_H +#define ISCSI_PROTO_H + +#define ISCSI_DRAFT20_VERSION 0x00 + +/* default iSCSI listen port for incoming connections */ +#define ISCSI_LISTEN_PORT 3260 + +/* Padding word length */ +#define PAD_WORD_LEN 4 + +/* + * useful common(control and data pathes) macro + */ +#define ntoh24(p) (((p)[0] << 16) | ((p)[1] << 8) | ((p)[2])) +#define hton24(p, v) { \ + p[0] = (((v) >> 16) & 0xFF); \ + p[1] = (((v) >> 8) & 0xFF); \ + p[2] = ((v) & 0xFF); \ +} +#define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;} + +/* + * iSCSI Template Message Header + */ +struct iscsi_hdr { + uint8_t opcode; + uint8_t flags; /* Final bit */ + uint8_t rsvd2[2]; + uint8_t hlength; /* AHSs total length */ + uint8_t dlength[3]; /* Data length */ + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Task Tag */ + __be32 statsn; + __be32 exp_statsn; + __be32 max_statsn; + uint8_t other[12]; +}; + +/************************* RFC 3720 Begin *****************************/ + +#define ISCSI_RESERVED_TAG 0xffffffff + +/* Opcode encoding bits */ +#define ISCSI_OP_RETRY 0x80 +#define ISCSI_OP_IMMEDIATE 0x40 +#define ISCSI_OPCODE_MASK 0x3F + +/* Initiator Opcode values */ +#define ISCSI_OP_NOOP_OUT 0x00 +#define ISCSI_OP_SCSI_CMD 0x01 +#define ISCSI_OP_SCSI_TMFUNC 0x02 +#define ISCSI_OP_LOGIN 0x03 +#define ISCSI_OP_TEXT 0x04 +#define ISCSI_OP_SCSI_DATA_OUT 0x05 +#define ISCSI_OP_LOGOUT 0x06 +#define ISCSI_OP_SNACK 0x10 + +#define ISCSI_OP_VENDOR1_CMD 0x1c +#define ISCSI_OP_VENDOR2_CMD 0x1d +#define ISCSI_OP_VENDOR3_CMD 0x1e +#define ISCSI_OP_VENDOR4_CMD 0x1f + +/* Target Opcode values */ +#define ISCSI_OP_NOOP_IN 0x20 +#define ISCSI_OP_SCSI_CMD_RSP 0x21 +#define ISCSI_OP_SCSI_TMFUNC_RSP 0x22 +#define ISCSI_OP_LOGIN_RSP 0x23 +#define ISCSI_OP_TEXT_RSP 0x24 +#define ISCSI_OP_SCSI_DATA_IN 0x25 +#define ISCSI_OP_LOGOUT_RSP 0x26 +#define ISCSI_OP_R2T 0x31 +#define ISCSI_OP_ASYNC_EVENT 0x32 +#define ISCSI_OP_REJECT 0x3f + +struct iscsi_ahs_hdr { + __be16 ahslength; + uint8_t ahstype; + uint8_t ahspec[5]; +}; + +#define ISCSI_AHSTYPE_CDB 1 +#define ISCSI_AHSTYPE_RLENGTH 2 + +/* iSCSI PDU Header */ +struct iscsi_cmd { + uint8_t opcode; + uint8_t flags; + __be16 rsvd2; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 data_length; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t cdb[16]; /* SCSI Command Block */ + /* Additional Data (Command Dependent) */ +}; + +/* Command PDU flags */ +#define ISCSI_FLAG_CMD_FINAL 0x80 +#define ISCSI_FLAG_CMD_READ 0x40 +#define ISCSI_FLAG_CMD_WRITE 0x20 +#define ISCSI_FLAG_CMD_ATTR_MASK 0x07 /* 3 bits */ + +/* SCSI Command Attribute values */ +#define ISCSI_ATTR_UNTAGGED 0 +#define ISCSI_ATTR_SIMPLE 1 +#define ISCSI_ATTR_ORDERED 2 +#define ISCSI_ATTR_HEAD_OF_QUEUE 3 +#define ISCSI_ATTR_ACA 4 + +struct iscsi_rlength_ahdr { + __be16 ahslength; + uint8_t ahstype; + uint8_t reserved; + __be32 read_length; +}; + +/* SCSI Response Header */ +struct iscsi_cmd_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t response; + uint8_t cmd_status; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rsvd1; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 exp_datasn; + __be32 bi_residual_count; + __be32 residual_count; + /* Response or Sense Data (optional) */ +}; + +/* Command Response PDU flags */ +#define ISCSI_FLAG_CMD_BIDI_OVERFLOW 0x10 +#define ISCSI_FLAG_CMD_BIDI_UNDERFLOW 0x08 +#define ISCSI_FLAG_CMD_OVERFLOW 0x04 +#define ISCSI_FLAG_CMD_UNDERFLOW 0x02 + +/* iSCSI Status values. Valid if Rsp Selector bit is not set */ +#define ISCSI_STATUS_CMD_COMPLETED 0 +#define ISCSI_STATUS_TARGET_FAILURE 1 +#define ISCSI_STATUS_SUBSYS_FAILURE 2 + +/* Asynchronous Event Header */ +struct iscsi_async { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + uint8_t rsvd4[8]; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t async_event; + uint8_t async_vcode; + __be16 param1; + __be16 param2; + __be16 param3; + uint8_t rsvd5[4]; +}; + +/* iSCSI Event Codes */ +#define ISCSI_ASYNC_MSG_SCSI_EVENT 0 +#define ISCSI_ASYNC_MSG_REQUEST_LOGOUT 1 +#define ISCSI_ASYNC_MSG_DROPPING_CONNECTION 2 +#define ISCSI_ASYNC_MSG_DROPPING_ALL_CONNECTIONS 3 +#define ISCSI_ASYNC_MSG_PARAM_NEGOTIATION 4 +#define ISCSI_ASYNC_MSG_VENDOR_SPECIFIC 255 + +/* NOP-Out Message */ +struct iscsi_nopout { + uint8_t opcode; + uint8_t flags; + __be16 rsvd2; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Transfer Tag */ + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd4[16]; +}; + +/* NOP-In Message */ +struct iscsi_nopin { + uint8_t opcode; + uint8_t flags; + __be16 rsvd2; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Transfer Tag */ + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t rsvd4[12]; +}; + +/* SCSI Task Management Message Header */ +struct iscsi_tm { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd1[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rtt; /* Reference Task Tag */ + __be32 cmdsn; + __be32 exp_statsn; + __be32 refcmdsn; + __be32 exp_datasn; + uint8_t rsvd2[8]; +}; + +#define ISCSI_FLAG_TM_FUNC_MASK 0x7F + +/* Function values */ +#define ISCSI_TM_FUNC_ABORT_TASK 1 +#define ISCSI_TM_FUNC_ABORT_TASK_SET 2 +#define ISCSI_TM_FUNC_CLEAR_ACA 3 +#define ISCSI_TM_FUNC_CLEAR_TASK_SET 4 +#define ISCSI_TM_FUNC_LOGICAL_UNIT_RESET 5 +#define ISCSI_TM_FUNC_TARGET_WARM_RESET 6 +#define ISCSI_TM_FUNC_TARGET_COLD_RESET 7 +#define ISCSI_TM_FUNC_TASK_REASSIGN 8 + +/* SCSI Task Management Response Header */ +struct iscsi_tm_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t response; /* see Response values below */ + uint8_t qualifier; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd2[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rtt; /* Reference Task Tag */ + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t rsvd3[12]; +}; + +/* Response values */ +#define ISCSI_TMF_RSP_COMPLETE 0x00 +#define ISCSI_TMF_RSP_NO_TASK 0x01 +#define ISCSI_TMF_RSP_NO_LUN 0x02 +#define ISCSI_TMF_RSP_TASK_ALLEGIANT 0x03 +#define ISCSI_TMF_RSP_NO_FAILOVER 0x04 +#define ISCSI_TMF_RSP_NOT_SUPPORTED 0x05 +#define ISCSI_TMF_RSP_AUTH_FAILED 0x06 +#define ISCSI_TMF_RSP_REJECTED 0xff + +/* Ready To Transfer Header */ +struct iscsi_r2t_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 ttt; /* Target Transfer Tag */ + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 r2tsn; + __be32 data_offset; + __be32 data_length; +}; + +/* SCSI Data Hdr */ +struct iscsi_data { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t rsvd3; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; + __be32 ttt; + __be32 rsvd4; + __be32 exp_statsn; + __be32 rsvd5; + __be32 datasn; + __be32 offset; + __be32 rsvd6; + /* Payload */ +}; + +/* SCSI Data Response Hdr */ +struct iscsi_data_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2; + uint8_t cmd_status; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t lun[8]; + __be32 itt; + __be32 ttt; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 datasn; + __be32 offset; + __be32 residual_count; +}; + +/* Data Response PDU flags */ +#define ISCSI_FLAG_DATA_ACK 0x40 +#define ISCSI_FLAG_DATA_OVERFLOW 0x04 +#define ISCSI_FLAG_DATA_UNDERFLOW 0x02 +#define ISCSI_FLAG_DATA_STATUS 0x01 + +/* Text Header */ +struct iscsi_text { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd4[8]; + __be32 itt; + __be32 ttt; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd5[16]; + /* Text - key=value pairs */ +}; + +#define ISCSI_FLAG_TEXT_CONTINUE 0x40 + +/* Text Response Header */ +struct iscsi_text_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd4[8]; + __be32 itt; + __be32 ttt; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t rsvd5[12]; + /* Text Response - key:value pairs */ +}; + +/* Login Header */ +struct iscsi_login { + uint8_t opcode; + uint8_t flags; + uint8_t max_version; /* Max. version supported */ + uint8_t min_version; /* Min. version supported */ + uint8_t hlength; + uint8_t dlength[3]; + uint8_t isid[6]; /* Initiator Session ID */ + __be16 tsih; /* Target Session Handle */ + __be32 itt; /* Initiator Task Tag */ + __be16 cid; + __be16 rsvd3; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd5[16]; +}; + +/* Login PDU flags */ +#define ISCSI_FLAG_LOGIN_TRANSIT 0x80 +#define ISCSI_FLAG_LOGIN_CONTINUE 0x40 +#define ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK 0x0C /* 2 bits */ +#define ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK 0x03 /* 2 bits */ + +#define ISCSI_LOGIN_CURRENT_STAGE(flags) \ + ((flags & ISCSI_FLAG_LOGIN_CURRENT_STAGE_MASK) >> 2) +#define ISCSI_LOGIN_NEXT_STAGE(flags) \ + (flags & ISCSI_FLAG_LOGIN_NEXT_STAGE_MASK) + +/* Login Response Header */ +struct iscsi_login_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t max_version; /* Max. version supported */ + uint8_t active_version; /* Active version */ + uint8_t hlength; + uint8_t dlength[3]; + uint8_t isid[6]; /* Initiator Session ID */ + __be16 tsih; /* Target Session Handle */ + __be32 itt; /* Initiator Task Tag */ + __be32 rsvd3; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + uint8_t status_class; /* see Login RSP ststus classes below */ + uint8_t status_detail; /* see Login RSP Status details below */ + uint8_t rsvd4[10]; +}; + +/* Login stage (phase) codes for CSG, NSG */ +#define ISCSI_INITIAL_LOGIN_STAGE -1 +#define ISCSI_SECURITY_NEGOTIATION_STAGE 0 +#define ISCSI_OP_PARMS_NEGOTIATION_STAGE 1 +#define ISCSI_FULL_FEATURE_PHASE 3 + +/* Login Status response classes */ +#define ISCSI_STATUS_CLS_SUCCESS 0x00 +#define ISCSI_STATUS_CLS_REDIRECT 0x01 +#define ISCSI_STATUS_CLS_INITIATOR_ERR 0x02 +#define ISCSI_STATUS_CLS_TARGET_ERR 0x03 + +/* Login Status response detail codes */ +/* Class-0 (Success) */ +#define ISCSI_LOGIN_STATUS_ACCEPT 0x00 + +/* Class-1 (Redirection) */ +#define ISCSI_LOGIN_STATUS_TGT_MOVED_TEMP 0x01 +#define ISCSI_LOGIN_STATUS_TGT_MOVED_PERM 0x02 + +/* Class-2 (Initiator Error) */ +#define ISCSI_LOGIN_STATUS_INIT_ERR 0x00 +#define ISCSI_LOGIN_STATUS_AUTH_FAILED 0x01 +#define ISCSI_LOGIN_STATUS_TGT_FORBIDDEN 0x02 +#define ISCSI_LOGIN_STATUS_TGT_NOT_FOUND 0x03 +#define ISCSI_LOGIN_STATUS_TGT_REMOVED 0x04 +#define ISCSI_LOGIN_STATUS_NO_VERSION 0x05 +#define ISCSI_LOGIN_STATUS_ISID_ERROR 0x06 +#define ISCSI_LOGIN_STATUS_MISSING_FIELDS 0x07 +#define ISCSI_LOGIN_STATUS_CONN_ADD_FAILED 0x08 +#define ISCSI_LOGIN_STATUS_NO_SESSION_TYPE 0x09 +#define ISCSI_LOGIN_STATUS_NO_SESSION 0x0a +#define ISCSI_LOGIN_STATUS_INVALID_REQUEST 0x0b + +/* Class-3 (Target Error) */ +#define ISCSI_LOGIN_STATUS_TARGET_ERROR 0x00 +#define ISCSI_LOGIN_STATUS_SVC_UNAVAILABLE 0x01 +#define ISCSI_LOGIN_STATUS_NO_RESOURCES 0x02 + +/* Logout Header */ +struct iscsi_logout { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd1[2]; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd2[8]; + __be32 itt; /* Initiator Task Tag */ + __be16 cid; + uint8_t rsvd3[2]; + __be32 cmdsn; + __be32 exp_statsn; + uint8_t rsvd4[16]; +}; + +/* Logout PDU flags */ +#define ISCSI_FLAG_LOGOUT_REASON_MASK 0x7F + +/* logout reason_code values */ + +#define ISCSI_LOGOUT_REASON_CLOSE_SESSION 0 +#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION 1 +#define ISCSI_LOGOUT_REASON_RECOVERY 2 +#define ISCSI_LOGOUT_REASON_AEN_REQUEST 3 + +/* Logout Response Header */ +struct iscsi_logout_rsp { + uint8_t opcode; + uint8_t flags; + uint8_t response; /* see Logout response values below */ + uint8_t rsvd2; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd3[8]; + __be32 itt; /* Initiator Task Tag */ + __be32 rsvd4; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 rsvd5; + __be16 t2wait; + __be16 t2retain; + __be32 rsvd6; +}; + +/* logout response status values */ + +#define ISCSI_LOGOUT_SUCCESS 0 +#define ISCSI_LOGOUT_CID_NOT_FOUND 1 +#define ISCSI_LOGOUT_RECOVERY_UNSUPPORTED 2 +#define ISCSI_LOGOUT_CLEANUP_FAILED 3 + +/* SNACK Header */ +struct iscsi_snack { + uint8_t opcode; + uint8_t flags; + uint8_t rsvd2[14]; + __be32 itt; + __be32 begrun; + __be32 runlength; + __be32 exp_statsn; + __be32 rsvd3; + __be32 exp_datasn; + uint8_t rsvd6[8]; +}; + +/* SNACK PDU flags */ +#define ISCSI_FLAG_SNACK_TYPE_MASK 0x0F /* 4 bits */ + +/* Reject Message Header */ +struct iscsi_reject { + uint8_t opcode; + uint8_t flags; + uint8_t reason; + uint8_t rsvd2; + uint8_t hlength; + uint8_t dlength[3]; + uint8_t rsvd3[8]; + __be32 ffffffff; + uint8_t rsvd4[4]; + __be32 statsn; + __be32 exp_cmdsn; + __be32 max_cmdsn; + __be32 datasn; + uint8_t rsvd5[8]; + /* Text - Rejected hdr */ +}; + +/* Reason for Reject */ +#define ISCSI_REASON_CMD_BEFORE_LOGIN 1 +#define ISCSI_REASON_DATA_DIGEST_ERROR 2 +#define ISCSI_REASON_DATA_SNACK_REJECT 3 +#define ISCSI_REASON_PROTOCOL_ERROR 4 +#define ISCSI_REASON_CMD_NOT_SUPPORTED 5 +#define ISCSI_REASON_IMM_CMD_REJECT 6 +#define ISCSI_REASON_TASK_IN_PROGRESS 7 +#define ISCSI_REASON_INVALID_SNACK 8 +#define ISCSI_REASON_BOOKMARK_INVALID 9 +#define ISCSI_REASON_BOOKMARK_NO_RESOURCES 10 +#define ISCSI_REASON_NEGOTIATION_RESET 11 + +/* Max. number of Key=Value pairs in a text message */ +#define MAX_KEY_VALUE_PAIRS 8192 + +/* maximum length for text keys/values */ +#define KEY_MAXLEN 64 +#define VALUE_MAXLEN 255 +#define TARGET_NAME_MAXLEN VALUE_MAXLEN + +#define DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH 8192 + +/************************* RFC 3720 End *****************************/ + +#endif /* ISCSI_PROTO_H */ diff --git a/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_device.h b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_device.h new file mode 100644 index 0000000..f353e0b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_device.h @@ -0,0 +1,19 @@ +#ifndef _SCSI_SCSI_DEVICE_H_BACKPORT +#define _SCSI_SCSI_DEVICE_H_BACKPORT + +#include_next + +#include +#include +#include +#include +#include + +struct scsi_lun; + +extern void int_to_scsilun(unsigned int, struct scsi_lun *); +extern void scsi_target_block(struct device *); +extern void scsi_target_unblock(struct device *); +extern void starget_for_each_device(struct scsi_target *, void *, + void (*fn)(struct scsi_device *, void *)); +#endif /* _SCSI_SCSI_DEVICE_H_BACKPORT */ diff --git a/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_host.h b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_host.h new file mode 100644 index 0000000..b7e019b --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_host.h @@ -0,0 +1,8 @@ +#ifndef _SCSI_SCSI_HOST_H_BACKPORT +#define _SCSI_SCSI_HOST_H_BACKPORT + +#include_next + +#define scsi_queue_work(shost, work) schedule_work(work) + +#endif diff --git a/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_transport.h b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_transport.h new file mode 100644 index 0000000..99c2b12 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/scsi/scsi_transport.h @@ -0,0 +1,8 @@ +#ifndef _SCSI_SCSI_TRANSPORT_H_BACKPORT +#define _SCSI_SCSI_TRANSPORT_H_BACKPORT + +#include_next + +#include + +#endif /* _SCSI_SCSI_TRANSPORT_H_BACKPORT */ diff --git a/kernel_addons/backport/2.6.9_U5/include/src/attribute_container.c b/kernel_addons/backport/2.6.9_U5/include/src/attribute_container.c new file mode 100644 index 0000000..44948d1 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/src/attribute_container.c @@ -0,0 +1,438 @@ +/* + * attribute_container.c - implementation of a simple container for classes + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + * + * The basic idea here is to enable a device to be attached to an + * aritrary numer of classes without having to allocate storage for them. + * Instead, the contained classes select the devices they need to attach + * to via a matching function. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "base.h" + +/* This is a private structure used to tie the classdev and the + * container .. it should never be visible outside this file */ +struct internal_container { + struct klist_node node; + struct attribute_container *cont; + struct class_device classdev; +}; + +static void internal_container_klist_get(struct klist_node *n) +{ + struct internal_container *ic = + container_of(n, struct internal_container, node); + class_device_get(&ic->classdev); +} + +static void internal_container_klist_put(struct klist_node *n) +{ + struct internal_container *ic = + container_of(n, struct internal_container, node); + class_device_put(&ic->classdev); +} + + +/** + * attribute_container_classdev_to_container - given a classdev, return the container + * + * @classdev: the class device created by attribute_container_add_device. + * + * Returns the container associated with this classdev. + */ +struct attribute_container * +attribute_container_classdev_to_container(struct class_device *classdev) +{ + struct internal_container *ic = + container_of(classdev, struct internal_container, classdev); + return ic->cont; +} +EXPORT_SYMBOL_GPL(attribute_container_classdev_to_container); + +static struct list_head attribute_container_list; + +static DECLARE_MUTEX(attribute_container_mutex); + +/** + * attribute_container_register - register an attribute container + * + * @cont: The container to register. This must be allocated by the + * callee and should also be zeroed by it. + */ +int +attribute_container_register(struct attribute_container *cont) +{ + INIT_LIST_HEAD(&cont->node); + klist_init(&cont->containers,internal_container_klist_get, + internal_container_klist_put); + + down(&attribute_container_mutex); + list_add_tail(&cont->node, &attribute_container_list); + up(&attribute_container_mutex); + + return 0; +} +EXPORT_SYMBOL_GPL(attribute_container_register); + +/** + * attribute_container_unregister - remove a container registration + * + * @cont: previously registered container to remove + */ +int +attribute_container_unregister(struct attribute_container *cont) +{ + int retval = -EBUSY; + down(&attribute_container_mutex); + spin_lock(&cont->containers.k_lock); + if (!list_empty(&cont->containers.k_list)) + goto out; + retval = 0; + list_del(&cont->node); + out: + spin_unlock(&cont->containers.k_lock); + up(&attribute_container_mutex); + return retval; + +} +EXPORT_SYMBOL_GPL(attribute_container_unregister); + +/* private function used as class release */ +static void attribute_container_release(struct class_device *classdev) +{ + struct internal_container *ic + = container_of(classdev, struct internal_container, classdev); + struct device *dev = classdev->dev; + + kfree(ic); + put_device(dev); +} + +/** + * attribute_container_add_device - see if any container is interested in dev + * + * @dev: device to add attributes to + * @fn: function to trigger addition of class device. + * + * This function allocates storage for the class device(s) to be + * attached to dev (one for each matching attribute_container). If no + * fn is provided, the code will simply register the class device via + * class_device_add. If a function is provided, it is expected to add + * the class device at the appropriate time. One of the things that + * might be necessary is to allocate and initialise the classdev and + * then add it a later time. To do this, call this routine for + * allocation and initialisation and then use + * attribute_container_device_trigger() to call class_device_add() on + * it. Note: after this, the class device contains a reference to dev + * which is not relinquished until the release of the classdev. + */ +void +attribute_container_add_device(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + + if (attribute_container_no_classdevs(cont)) + continue; + + if (!cont->match(cont, dev)) + continue; + + ic = kzalloc(sizeof(*ic), GFP_KERNEL); + if (!ic) { + dev_printk(KERN_ERR, dev, "failed to allocate class container\n"); + continue; + } + + ic->cont = cont; + class_device_initialize(&ic->classdev); + ic->classdev.dev = get_device(dev); + ic->classdev.class = cont->class; + cont->class->release = attribute_container_release; + strcpy(ic->classdev.class_id, dev->bus_id); + if (fn) + fn(cont, dev, &ic->classdev); + else + attribute_container_add_class_device(&ic->classdev); + klist_add_tail(&ic->node, &cont->containers); + } + up(&attribute_container_mutex); +} + +/* FIXME: can't break out of this unless klist_iter_exit is also + * called before doing the break + */ +#define klist_for_each_entry(pos, head, member, iter) \ + for (klist_iter_init(head, iter); (pos = ({ \ + struct klist_node *n = klist_next(iter); \ + n ? container_of(n, typeof(*pos), member) : \ + ({ klist_iter_exit(iter) ; NULL; }); \ + }) ) != NULL; ) + + +/** + * attribute_container_remove_device - make device eligible for removal. + * + * @dev: The generic device + * @fn: A function to call to remove the device + * + * This routine triggers device removal. If fn is NULL, then it is + * simply done via class_device_unregister (note that if something + * still has a reference to the classdev, then the memory occupied + * will not be freed until the classdev is released). If you want a + * two phase release: remove from visibility and then delete the + * device, then you should use this routine with a fn that calls + * class_device_del() and then use + * attribute_container_device_trigger() to do the final put on the + * classdev. + */ +void +attribute_container_remove_device(struct device *dev, + void (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + struct klist_iter iter; + + if (attribute_container_no_classdevs(cont)) + continue; + + if (!cont->match(cont, dev)) + continue; + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (dev != ic->classdev.dev) + continue; + klist_del(&ic->node); + if (fn) + fn(cont, dev, &ic->classdev); + else { + attribute_container_remove_attrs(&ic->classdev); + class_device_unregister(&ic->classdev); + } + } + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_device_trigger - execute a trigger for each matching classdev + * + * @dev: The generic device to run the trigger for + * @fn the function to execute for each classdev. + * + * This funcion is for executing a trigger when you need to know both + * the container and the classdev. If you only care about the + * container, then use attribute_container_trigger() instead. + */ +void +attribute_container_device_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *, + struct class_device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + struct internal_container *ic; + struct klist_iter iter; + + if (!cont->match(cont, dev)) + continue; + + if (attribute_container_no_classdevs(cont)) { + fn(cont, dev, NULL); + continue; + } + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (dev == ic->classdev.dev) + fn(cont, dev, &ic->classdev); + } + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_trigger - trigger a function for each matching container + * + * @dev: The generic device to activate the trigger for + * @fn: the function to trigger + * + * This routine triggers a function that only needs to know the + * matching containers (not the classdev) associated with a device. + * It is more lightweight than attribute_container_device_trigger, so + * should be used in preference unless the triggering function + * actually needs to know the classdev. + */ +void +attribute_container_trigger(struct device *dev, + int (*fn)(struct attribute_container *, + struct device *)) +{ + struct attribute_container *cont; + + down(&attribute_container_mutex); + list_for_each_entry(cont, &attribute_container_list, node) { + if (cont->match(cont, dev)) + fn(cont, dev); + } + up(&attribute_container_mutex); +} + +/** + * attribute_container_add_attrs - add attributes + * + * @classdev: The class device + * + * This simply creates all the class device sysfs files from the + * attributes listed in the container + */ +int +attribute_container_add_attrs(struct class_device *classdev) +{ + struct attribute_container *cont = + attribute_container_classdev_to_container(classdev); + struct class_device_attribute **attrs = cont->attrs; + int i, error; + + if (!attrs) + return 0; + + for (i = 0; attrs[i]; i++) { + error = class_device_create_file(classdev, attrs[i]); + if (error) + return error; + } + + return 0; +} + +/** + * attribute_container_add_class_device - same function as class_device_add + * + * @classdev: the class device to add + * + * This performs essentially the same function as class_device_add except for + * attribute containers, namely add the classdev to the system and then + * create the attribute files + */ +int +attribute_container_add_class_device(struct class_device *classdev) +{ + int error = class_device_add(classdev); + if (error) + return error; + return attribute_container_add_attrs(classdev); +} + +/** + * attribute_container_add_class_device_adapter - simple adapter for triggers + * + * This function is identical to attribute_container_add_class_device except + * that it is designed to be called from the triggers + */ +int +attribute_container_add_class_device_adapter(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + return attribute_container_add_class_device(classdev); +} + +/** + * attribute_container_remove_attrs - remove any attribute files + * + * @classdev: The class device to remove the files from + * + */ +void +attribute_container_remove_attrs(struct class_device *classdev) +{ + struct attribute_container *cont = + attribute_container_classdev_to_container(classdev); + struct class_device_attribute **attrs = cont->attrs; + int i; + + if (!attrs) + return; + + for (i = 0; attrs[i]; i++) + class_device_remove_file(classdev, attrs[i]); +} + +/** + * attribute_container_class_device_del - equivalent of class_device_del + * + * @classdev: the class device + * + * This function simply removes all the attribute files and then calls + * class_device_del. + */ +void +attribute_container_class_device_del(struct class_device *classdev) +{ + attribute_container_remove_attrs(classdev); + class_device_del(classdev); +} + +/** + * attribute_container_find_class_device - find the corresponding class_device + * + * @cont: the container + * @dev: the generic device + * + * Looks up the device in the container's list of class devices and returns + * the corresponding class_device. + */ +struct class_device * +attribute_container_find_class_device(struct attribute_container *cont, + struct device *dev) +{ + struct class_device *cdev = NULL; + struct internal_container *ic; + struct klist_iter iter; + + klist_for_each_entry(ic, &cont->containers, node, &iter) { + if (ic->classdev.dev == dev) { + cdev = &ic->classdev; + /* FIXME: must exit iterator then break */ + klist_iter_exit(&iter); + break; + } + } + + return cdev; +} +EXPORT_SYMBOL_GPL(attribute_container_find_class_device); + +int +attribute_container_init(void) +{ + INIT_LIST_HEAD(&attribute_container_list); + return 0; +} +EXPORT_SYMBOL_GPL(attribute_container_init); diff --git a/kernel_addons/backport/2.6.9_U5/include/src/base.h b/kernel_addons/backport/2.6.9_U5/include/src/base.h new file mode 100644 index 0000000..a5f8936 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/src/base.h @@ -0,0 +1 @@ +extern int attribute_container_init(void); diff --git a/kernel_addons/backport/2.6.9_U5/include/src/init.c b/kernel_addons/backport/2.6.9_U5/include/src/init.c new file mode 100644 index 0000000..15f0bc6 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/src/init.c @@ -0,0 +1,26 @@ +/* + * + * Copyright (c) 2002-3 Patrick Mochel + * Copyright (c) 2002-3 Open Source Development Labs + * + * This file is released under the GPLv2 + * + */ + +#include +#include +#include + +#include "base.h" + +/** + * driver_init - initialize driver model. + * + * Call the driver model init functions to initialize their + * subsystems. Called early from init/main.c. + */ + +void __init driver_init(void) +{ + attribute_container_init(); +} diff --git a/kernel_addons/backport/2.6.9_U5/include/src/klist.c b/kernel_addons/backport/2.6.9_U5/include/src/klist.c new file mode 100644 index 0000000..3b29ebc --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/src/klist.c @@ -0,0 +1,287 @@ +/* + * klist.c - Routines for manipulating klists. + * + * + * This klist interface provides a couple of structures that wrap around + * struct list_head to provide explicit list "head" (struct klist) and + * list "node" (struct klist_node) objects. For struct klist, a spinlock + * is included that protects access to the actual list itself. struct + * klist_node provides a pointer to the klist that owns it and a kref + * reference count that indicates the number of current users of that node + * in the list. + * + * The entire point is to provide an interface for iterating over a list + * that is safe and allows for modification of the list during the + * iteration (e.g. insertion and removal), including modification of the + * current node on the list. + * + * It works using a 3rd object type - struct klist_iter - that is declared + * and initialized before an iteration. klist_next() is used to acquire the + * next element in the list. It returns NULL if there are no more items. + * Internally, that routine takes the klist's lock, decrements the reference + * count of the previous klist_node and increments the count of the next + * klist_node. It then drops the lock and returns. + * + * There are primitives for adding and removing nodes to/from a klist. + * When deleting, klist_del() will simply decrement the reference count. + * Only when the count goes to 0 is the node removed from the list. + * klist_remove() will try to delete the node from the list and block + * until it is actually removed. This is useful for objects (like devices) + * that have been removed from the system and must be freed (but must wait + * until all accessors have finished). + * + * Copyright (C) 2005 Patrick Mochel + * + * This file is released under the GPL v2. + */ + +#include +#include + + +/** + * klist_init - Initialize a klist structure. + * @k: The klist we're initializing. + * @get: The get function for the embedding object (NULL if none) + * @put: The put function for the embedding object (NULL if none) + * + * Initialises the klist structure. If the klist_node structures are + * going to be embedded in refcounted objects (necessary for safe + * deletion) then the get/put arguments are used to initialise + * functions that take and release references on the embedding + * objects. + */ + +void klist_init(struct klist * k, void (*get)(struct klist_node *), + void (*put)(struct klist_node *)) +{ + INIT_LIST_HEAD(&k->k_list); + spin_lock_init(&k->k_lock); + k->get = get; + k->put = put; +} + +EXPORT_SYMBOL_GPL(klist_init); + + +static void add_head(struct klist * k, struct klist_node * n) +{ + spin_lock(&k->k_lock); + list_add(&n->n_node, &k->k_list); + spin_unlock(&k->k_lock); +} + +static void add_tail(struct klist * k, struct klist_node * n) +{ + spin_lock(&k->k_lock); + list_add_tail(&n->n_node, &k->k_list); + spin_unlock(&k->k_lock); +} + + +static void klist_node_init(struct klist * k, struct klist_node * n) +{ + INIT_LIST_HEAD(&n->n_node); + init_completion(&n->n_removed); + kref_init(&n->n_ref); + n->n_klist = k; + if (k->get) + k->get(n); +} + + +/** + * klist_add_head - Initialize a klist_node and add it to front. + * @n: node we're adding. + * @k: klist it's going on. + */ + +void klist_add_head(struct klist_node * n, struct klist * k) +{ + klist_node_init(k, n); + add_head(k, n); +} + +EXPORT_SYMBOL_GPL(klist_add_head); + + +/** + * klist_add_tail - Initialize a klist_node and add it to back. + * @n: node we're adding. + * @k: klist it's going on. + */ + +void klist_add_tail(struct klist_node * n, struct klist * k) +{ + klist_node_init(k, n); + add_tail(k, n); +} + +EXPORT_SYMBOL_GPL(klist_add_tail); + + +static void klist_release(struct kref * kref) +{ + struct klist_node * n = container_of(kref, struct klist_node, n_ref); + + list_del(&n->n_node); + complete(&n->n_removed); + n->n_klist = NULL; +} + +static int klist_dec_and_del(struct klist_node * n) +{ + return kref_put_new(&n->n_ref, klist_release); +} + + +/** + * klist_del - Decrement the reference count of node and try to remove. + * @n: node we're deleting. + */ + +void klist_del(struct klist_node * n) +{ + struct klist * k = n->n_klist; + void (*put)(struct klist_node *) = k->put; + + spin_lock(&k->k_lock); + if (!klist_dec_and_del(n)) + put = NULL; + spin_unlock(&k->k_lock); + if (put) + put(n); +} + +EXPORT_SYMBOL_GPL(klist_del); + + +/** + * klist_remove - Decrement the refcount of node and wait for it to go away. + * @n: node we're removing. + */ + +void klist_remove(struct klist_node * n) +{ + klist_del(n); + wait_for_completion(&n->n_removed); +} + +EXPORT_SYMBOL_GPL(klist_remove); + + +/** + * klist_node_attached - Say whether a node is bound to a list or not. + * @n: Node that we're testing. + */ + +int klist_node_attached(struct klist_node * n) +{ + return (n->n_klist != NULL); +} + +EXPORT_SYMBOL_GPL(klist_node_attached); + + +/** + * klist_iter_init_node - Initialize a klist_iter structure. + * @k: klist we're iterating. + * @i: klist_iter we're filling. + * @n: node to start with. + * + * Similar to klist_iter_init(), but starts the action off with @n, + * instead of with the list head. + */ + +void klist_iter_init_node(struct klist * k, struct klist_iter * i, struct klist_node * n) +{ + i->i_klist = k; + i->i_head = &k->k_list; + i->i_cur = n; + if (n) + kref_get(&n->n_ref); +} + +EXPORT_SYMBOL_GPL(klist_iter_init_node); + + +/** + * klist_iter_init - Iniitalize a klist_iter structure. + * @k: klist we're iterating. + * @i: klist_iter structure we're filling. + * + * Similar to klist_iter_init_node(), but start with the list head. + */ + +void klist_iter_init(struct klist * k, struct klist_iter * i) +{ + klist_iter_init_node(k, i, NULL); +} + +EXPORT_SYMBOL_GPL(klist_iter_init); + + +/** + * klist_iter_exit - Finish a list iteration. + * @i: Iterator structure. + * + * Must be called when done iterating over list, as it decrements the + * refcount of the current node. Necessary in case iteration exited before + * the end of the list was reached, and always good form. + */ + +void klist_iter_exit(struct klist_iter * i) +{ + if (i->i_cur) { + klist_del(i->i_cur); + i->i_cur = NULL; + } +} + +EXPORT_SYMBOL_GPL(klist_iter_exit); + + +static struct klist_node * to_klist_node(struct list_head * n) +{ + return container_of(n, struct klist_node, n_node); +} + + +/** + * klist_next - Ante up next node in list. + * @i: Iterator structure. + * + * First grab list lock. Decrement the reference count of the previous + * node, if there was one. Grab the next node, increment its reference + * count, drop the lock, and return that next node. + */ + +struct klist_node * klist_next(struct klist_iter * i) +{ + struct list_head * next; + struct klist_node * lnode = i->i_cur; + struct klist_node * knode = NULL; + void (*put)(struct klist_node *) = i->i_klist->put; + + spin_lock(&i->i_klist->k_lock); + if (lnode) { + next = lnode->n_node.next; + if (!klist_dec_and_del(lnode)) + put = NULL; + } else + next = i->i_head->next; + + if (next != i->i_head) { + knode = to_klist_node(next); + kref_get(&knode->n_ref); + } + i->i_cur = knode; + spin_unlock(&i->i_klist->k_lock); + if (put && lnode) + put(lnode); + return knode; +} + +EXPORT_SYMBOL_GPL(klist_next); + + diff --git a/kernel_addons/backport/2.6.9_U5/include/src/kref_new.c b/kernel_addons/backport/2.6.9_U5/include/src/kref_new.c new file mode 100644 index 0000000..d45bb3f --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/src/kref_new.c @@ -0,0 +1,29 @@ +#include +#include + +/** + * kref_put - decrement refcount for object. + * @kref: object. + * @release: pointer to the function that will clean up the object when the + * last reference to the object is released. + * This pointer is required, and it is not acceptable to pass kfree + * in as this function. + * + * Decrement the refcount, and if 0, call release(). + * Return 1 if the object was removed, otherwise return 0. Beware, if this + * function returns 0, you still can not count on the kref from remaining in + * memory. Only use the return value if you want to see if the kref is now + * gone, not present. + */ +int kref_put_new(struct kref *kref, void (*release)(struct kref *kref)) +{ + WARN_ON(release == NULL); + WARN_ON(release == (void (*)(struct kref *))kfree); + + if (atomic_dec_and_test(&kref->refcount)) { + release(kref); + return 1; + } + return 0; +} +EXPORT_SYMBOL(kref_put_new); diff --git a/kernel_addons/backport/2.6.9_U5/include/src/scsi.c b/kernel_addons/backport/2.6.9_U5/include/src/scsi.c new file mode 100644 index 0000000..8c833c0 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/src/scsi.c @@ -0,0 +1,50 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +/** + * starget_for_each_device - helper to walk all devices of a target + * @starget: target whose devices we want to iterate over. + * + * This traverses over each devices of @shost. The devices have + * a reference that must be released by scsi_host_put when breaking + * out of the loop. + */ +void starget_for_each_device(struct scsi_target *starget, void * data, + void (*fn)(struct scsi_device *, void *)) +{ + struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); + struct scsi_device *sdev; + + printk("%s: entry\n", __FUNCTION__); + shost_for_each_device(sdev, shost) { + if ((sdev->channel == starget->channel) && + (sdev->id == starget->id)) + fn(sdev, data); + } + printk("%s: exit\n", __FUNCTION__); +} +EXPORT_SYMBOL(starget_for_each_device); diff --git a/kernel_addons/backport/2.6.9_U5/include/src/scsi_lib.c b/kernel_addons/backport/2.6.9_U5/include/src/scsi_lib.c new file mode 100644 index 0000000..f53f824 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/src/scsi_lib.c @@ -0,0 +1,166 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +int scsi_is_target_device(const struct device *dev) +{ + char *str = dev->bus_id; + + if (strncmp(str, "target", 6) == 0) { + return 1; + } + + return 0; +} + +/** + * scsi_internal_device_block - internal function to put a device + * temporarily into the SDEV_BLOCK state + * @sdev: device to block + * + * Block request made by scsi lld's to temporarily stop all + * scsi commands on the specified device. Called from interrupt + * or normal process context. + * + * Returns zero if successful or error if not + * + * Notes: + * This routine transitions the device to the SDEV_BLOCK state + * (which must be a legal transition). When the device is in this + * state, all commands are deferred until the scsi lld reenables + * the device with scsi_device_unblock or device_block_tmo fires. + * This routine assumes the host_lock is held on entry. + **/ +int +scsi_internal_device_block(struct scsi_device *sdev) +{ + request_queue_t *q = sdev->request_queue; + unsigned long flags; + int err = 0; + + err = scsi_device_set_state(sdev, SDEV_BLOCK); + if (err) + return err; + + /* + * The device has transitioned to SDEV_BLOCK. Stop the + * block layer from calling the midlayer with this device's + * request queue. + */ + spin_lock_irqsave(q->queue_lock, flags); + blk_stop_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(scsi_internal_device_block); + +/** + * scsi_internal_device_unblock - resume a device after a block request + * @sdev: device to resume + * + * Called by scsi lld's or the midlayer to restart the device queue + * for the previously suspended scsi device. Called from interrupt or + * normal process context. + * + * Returns zero if successful or error if not. + * + * Notes: + * This routine transitions the device to the SDEV_RUNNING state + * (which must be a legal transition) allowing the midlayer to + * goose the queue for this device. This routine assumes the + * host_lock is held upon entry. + **/ +int +scsi_internal_device_unblock(struct scsi_device *sdev) +{ + request_queue_t *q = sdev->request_queue; + int err; + unsigned long flags; + + + /* + * Try to transition the scsi device to SDEV_RUNNING + * and goose the device queue if successful. + */ + err = scsi_device_set_state(sdev, SDEV_RUNNING); + if (err) + return err; + + spin_lock_irqsave(q->queue_lock, flags); + blk_start_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(scsi_internal_device_unblock); + +static void +device_block(struct scsi_device *sdev, void *data) +{ + scsi_internal_device_block(sdev); +} + +static int +target_block(struct device *dev, void *data) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_block); + + return 0; +} + +void +scsi_target_block(struct device *dev) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_block); + else + device_for_each_child(dev, NULL, target_block); +} +EXPORT_SYMBOL_GPL(scsi_target_block); + +static void +device_unblock(struct scsi_device *sdev, void *data) +{ + scsi_internal_device_unblock(sdev); +} + +static int +target_unblock(struct device *dev, void *data) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_unblock); + return 0; +} + +void +scsi_target_unblock(struct device *dev) +{ + if (scsi_is_target_device(dev)) + starget_for_each_device(to_scsi_target(dev), NULL, + device_unblock); + else + device_for_each_child(dev, NULL, target_unblock); +} +EXPORT_SYMBOL_GPL(scsi_target_unblock); + +MODULE_LICENSE("GPL"); diff --git a/kernel_addons/backport/2.6.9_U5/include/src/scsi_scan.c b/kernel_addons/backport/2.6.9_U5/include/src/scsi_scan.c new file mode 100644 index 0000000..b7b7674 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/src/scsi_scan.c @@ -0,0 +1,48 @@ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +/** + * int_to_scsilun: reverts an int into a scsi_lun + * @int: integer to be reverted + * @scsilun: struct scsi_lun to be set. + * + * Description: + * Reverts the functionality of the scsilun_to_int, which packed + * an 8-byte lun value into an int. This routine unpacks the int + * back into the lun value. + * Note: the scsilun_to_int() routine does not truly handle all + * 8bytes of the lun value. This functions restores only as much + * as was set by the routine. + * + * Notes: + * Given an integer : 0x0b030a04, this function returns a + * scsi_lun of : struct scsi_lun of: 0a 04 0b 03 00 00 00 00 + * + **/ +void int_to_scsilun(unsigned int lun, struct scsi_lun *scsilun) +{ + int i; + + memset(scsilun->scsi_lun, 0, sizeof(scsilun->scsi_lun)); + + for (i = 0; i < sizeof(lun); i += 2) { + scsilun->scsi_lun[i] = (lun >> 8) & 0xFF; + scsilun->scsi_lun[i+1] = lun & 0xFF; + lun = lun >> 16; + } +} +EXPORT_SYMBOL(int_to_scsilun); diff --git a/kernel_addons/backport/2.6.9_U5/include/src/transport_class.c b/kernel_addons/backport/2.6.9_U5/include/src/transport_class.c new file mode 100644 index 0000000..f25e7c6 --- /dev/null +++ b/kernel_addons/backport/2.6.9_U5/include/src/transport_class.c @@ -0,0 +1,280 @@ +/* + * transport_class.c - implementation of generic transport classes + * using attribute_containers + * + * Copyright (c) 2005 - James Bottomley + * + * This file is licensed under GPLv2 + * + * The basic idea here is to allow any "device controller" (which + * would most often be a Host Bus Adapter to use the services of one + * or more tranport classes for performing transport specific + * services. Transport specific services are things that the generic + * command layer doesn't want to know about (speed settings, line + * condidtioning, etc), but which the user might be interested in. + * Thus, the HBA's use the routines exported by the transport classes + * to perform these functions. The transport classes export certain + * values to the user via sysfs using attribute containers. + * + * Note: because not every HBA will care about every transport + * attribute, there's a many to one relationship that goes like this: + * + * transport class<-----attribute container<----class device + * + * Usually the attribute container is per-HBA, but the design doesn't + * mandate that. Although most of the services will be specific to + * the actual external storage connection used by the HBA, the generic + * transport class is framed entirely in terms of generic devices to + * allow it to be used by any physical HBA in the system. + */ +#include +#include + +/** + * transport_class_register - register an initial transport class + * + * @tclass: a pointer to the transport class structure to be initialised + * + * The transport class contains an embedded class which is used to + * identify it. The caller should initialise this structure with + * zeros and then generic class must have been initialised with the + * actual transport class unique name. There's a macro + * DECLARE_TRANSPORT_CLASS() to do this (declared classes still must + * be registered). + * + * Returns 0 on success or error on failure. + */ +int transport_class_register(struct transport_class *tclass) +{ + return class_register(&tclass->class); +} +EXPORT_SYMBOL_GPL(transport_class_register); + +/** + * transport_class_unregister - unregister a previously registered class + * + * @tclass: The transport class to unregister + * + * Must be called prior to deallocating the memory for the transport + * class. + */ +void transport_class_unregister(struct transport_class *tclass) +{ + class_unregister(&tclass->class); +} +EXPORT_SYMBOL_GPL(transport_class_unregister); + +static int anon_transport_dummy_function(struct transport_container *tc, + struct device *dev, + struct class_device *cdev) +{ + /* do nothing */ + return 0; +} + +/** + * anon_transport_class_register - register an anonymous class + * + * @atc: The anon transport class to register + * + * The anonymous transport class contains both a transport class and a + * container. The idea of an anonymous class is that it never + * actually has any device attributes associated with it (and thus + * saves on container storage). So it can only be used for triggering + * events. Use prezero and then use DECLARE_ANON_TRANSPORT_CLASS() to + * initialise the anon transport class storage. + */ +int anon_transport_class_register(struct anon_transport_class *atc) +{ + int error; + atc->container.class = &atc->tclass.class; + attribute_container_set_no_classdevs(&atc->container); + error = attribute_container_register(&atc->container); + if (error) + return error; + atc->tclass.setup = anon_transport_dummy_function; + atc->tclass.remove = anon_transport_dummy_function; + return 0; +} +EXPORT_SYMBOL_GPL(anon_transport_class_register); + +/** + * anon_transport_class_unregister - unregister an anon class + * + * @atc: Pointer to the anon transport class to unregister + * + * Must be called prior to deallocating the memory for the anon + * transport class. + */ +void anon_transport_class_unregister(struct anon_transport_class *atc) +{ + attribute_container_unregister(&atc->container); +} +EXPORT_SYMBOL_GPL(anon_transport_class_unregister); + +static int transport_setup_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + struct transport_container *tcont = attribute_container_to_transport_container(cont); + + if (tclass->setup) + tclass->setup(tcont, dev, classdev); + + return 0; +} + +/** + * transport_setup_device - declare a new dev for transport class association + * but don't make it visible yet. + * + * @dev: the generic device representing the entity being added + * + * Usually, dev represents some component in the HBA system (either + * the HBA itself or a device remote across the HBA bus). This + * routine is simply a trigger point to see if any set of transport + * classes wishes to associate with the added device. This allocates + * storage for the class device and initialises it, but does not yet + * add it to the system or add attributes to it (you do this with + * transport_add_device). If you have no need for a separate setup + * and add operations, use transport_register_device (see + * transport_class.h). + */ + +void transport_setup_device(struct device *dev) +{ + attribute_container_add_device(dev, transport_setup_classdev); +} +EXPORT_SYMBOL_GPL(transport_setup_device); + +static int transport_add_class_device(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + int error = attribute_container_add_class_device(classdev); + struct transport_container *tcont = + attribute_container_to_transport_container(cont); + + if (!error && tcont->statistics) + error = sysfs_create_group(&classdev->kobj, tcont->statistics); + + return error; +} + + +/** + * transport_add_device - declare a new dev for transport class association + * + * @dev: the generic device representing the entity being added + * + * Usually, dev represents some component in the HBA system (either + * the HBA itself or a device remote across the HBA bus). This + * routine is simply a trigger point used to add the device to the + * system and register attributes for it. + */ + +void transport_add_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_add_class_device); +} +EXPORT_SYMBOL_GPL(transport_add_device); + +static int transport_configure(struct attribute_container *cont, + struct device *dev, + struct class_device *cdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + struct transport_container *tcont = attribute_container_to_transport_container(cont); + + if (tclass->configure) + tclass->configure(tcont, dev, cdev); + + return 0; +} + +/** + * transport_configure_device - configure an already set up device + * + * @dev: generic device representing device to be configured + * + * The idea of configure is simply to provide a point within the setup + * process to allow the transport class to extract information from a + * device after it has been setup. This is used in SCSI because we + * have to have a setup device to begin using the HBA, but after we + * send the initial inquiry, we use configure to extract the device + * parameters. The device need not have been added to be configured. + */ +void transport_configure_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_configure); +} +EXPORT_SYMBOL_GPL(transport_configure_device); + +static int transport_remove_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_container *tcont = + attribute_container_to_transport_container(cont); + struct transport_class *tclass = class_to_transport_class(cont->class); + + if (tclass->remove) + tclass->remove(tcont, dev, classdev); + + if (tclass->remove != anon_transport_dummy_function) { + if (tcont->statistics) + sysfs_remove_group(&classdev->kobj, tcont->statistics); + attribute_container_class_device_del(classdev); + } + + return 0; +} + + +/** + * transport_remove_device - remove the visibility of a device + * + * @dev: generic device to remove + * + * This call removes the visibility of the device (to the user from + * sysfs), but does not destroy it. To eliminate a device entirely + * you must also call transport_destroy_device. If you don't need to + * do remove and destroy as separate operations, use + * transport_unregister_device() (see transport_class.h) which will + * perform both calls for you. + */ +void transport_remove_device(struct device *dev) +{ + attribute_container_device_trigger(dev, transport_remove_classdev); +} +EXPORT_SYMBOL_GPL(transport_remove_device); + +static void transport_destroy_classdev(struct attribute_container *cont, + struct device *dev, + struct class_device *classdev) +{ + struct transport_class *tclass = class_to_transport_class(cont->class); + + if (tclass->remove != anon_transport_dummy_function) + class_device_put(classdev); +} + + +/** + * transport_destroy_device - destroy a removed device + * + * @dev: device to eliminate from the transport class. + * + * This call triggers the elimination of storage associated with the + * transport classdev. Note: all it really does is relinquish a + * reference to the classdev. The memory will not be freed until the + * last reference goes to zero. Note also that the classdev retains a + * reference count on dev, so dev too will remain for as long as the + * transport class device remains around. + */ +void transport_destroy_device(struct device *dev) +{ + attribute_container_remove_device(dev, transport_destroy_classdev); +} +EXPORT_SYMBOL_GPL(transport_destroy_device); diff --git a/kernel_patches/backport/2.6.16_sles10/iscsi_scsi_makefile.patch b/kernel_patches/backport/2.6.16_sles10/iscsi_scsi_makefile.patch new file mode 100644 index 0000000..30b6f0e --- /dev/null +++ b/kernel_patches/backport/2.6.16_sles10/iscsi_scsi_makefile.patch @@ -0,0 +1,9 @@ +diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile +--- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile 1970-01-01 02:00:00.000000000 +0200 ++++ ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile 2006-12-28 17:01:22.000000000 +0200 +@@ -0,0 +1,5 @@ ++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o ++obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o ++ ++scsi_transport_iscsi-y := scsi_transport_iscsi_f.o ++libiscsi-y := libiscsi_f.o diff --git a/kernel_patches/backport/2.6.16_sles10_sp1/iscsi_scsi_makefile.patch b/kernel_patches/backport/2.6.16_sles10_sp1/iscsi_scsi_makefile.patch new file mode 100644 index 0000000..30b6f0e --- /dev/null +++ b/kernel_patches/backport/2.6.16_sles10_sp1/iscsi_scsi_makefile.patch @@ -0,0 +1,9 @@ +diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile +--- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile 1970-01-01 02:00:00.000000000 +0200 ++++ ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile 2006-12-28 17:01:22.000000000 +0200 +@@ -0,0 +1,5 @@ ++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o ++obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o ++ ++scsi_transport_iscsi-y := scsi_transport_iscsi_f.o ++libiscsi-y := libiscsi_f.o diff --git a/kernel_patches/backport/2.6.18/iscsi_scsi_makefile.patch b/kernel_patches/backport/2.6.18/iscsi_scsi_makefile.patch new file mode 100644 index 0000000..30b6f0e --- /dev/null +++ b/kernel_patches/backport/2.6.18/iscsi_scsi_makefile.patch @@ -0,0 +1,9 @@ +diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile +--- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile 1970-01-01 02:00:00.000000000 +0200 ++++ ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile 2006-12-28 17:01:22.000000000 +0200 +@@ -0,0 +1,5 @@ ++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o ++obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o ++ ++scsi_transport_iscsi-y := scsi_transport_iscsi_f.o ++libiscsi-y := libiscsi_f.o diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch new file mode 100644 index 0000000..a339163 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi.patch @@ -0,0 +1,270 @@ +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c +--- linux-2.6.20/drivers/scsi/iscsi_tcp.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c 2007-05-17 16:55:43.000000000 +0300 +@@ -676,7 +676,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, + } + + static inline void +-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg, ++partial_sg_digest_update(struct crypto_tfm *desc, struct scatterlist *sg, + int offset, int length) + { + struct scatterlist temp; +@@ -684,7 +684,7 @@ partial_sg_digest_update(struct hash_des + memcpy(&temp, sg, sizeof(struct scatterlist)); + temp.offset = offset; + temp.length = length; +- crypto_hash_update(desc, &temp, length); ++ crypto_hash_update(&desc, &temp, length); + } + + static void +@@ -1774,22 +1774,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s + /* initial operational parameters */ + tcp_conn->hdr_size = sizeof(struct iscsi_hdr); + +- tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->tx_hash.flags = 0; +- if (IS_ERR(tcp_conn->tx_hash.tfm)) ++ tcp_conn->tx_hash = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->tx_hash) + goto free_tcp_conn; + +- tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->rx_hash.flags = 0; +- if (IS_ERR(tcp_conn->rx_hash.tfm)) ++ tcp_conn->rx_hash = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->rx_hash) + goto free_tx_tfm; + + return cls_conn; + + free_tx_tfm: +- crypto_free_hash(tcp_conn->tx_hash.tfm); ++ crypto_free_tfm(tcp_conn->tx_hash); + free_tcp_conn: + kfree(tcp_conn); + tcp_conn_alloc_fail: +@@ -1823,10 +1819,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_ + iscsi_tcp_release_conn(conn); + iscsi_conn_teardown(cls_conn); + +- if (tcp_conn->tx_hash.tfm) +- crypto_free_hash(tcp_conn->tx_hash.tfm); +- if (tcp_conn->rx_hash.tfm) +- crypto_free_hash(tcp_conn->rx_hash.tfm); ++ if (tcp_conn->tx_hash) ++ crypto_free_tfm(tcp_conn->tx_hash); ++ if (tcp_conn->rx_hash) ++ crypto_free_tfm(tcp_conn->rx_hash); + + kfree(tcp_conn); + } +@@ -2017,7 +2013,7 @@ iscsi_tcp_conn_get_param(struct iscsi_cl + { + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_tcp_conn *tcp_conn = conn->dd_data; +- struct inet_sock *inet; ++ struct inet_opt *inet; + struct ipv6_pinfo *np; + struct sock *sk; + int len; +@@ -2135,7 +2131,6 @@ static void iscsi_tcp_session_destroy(st + static struct scsi_host_template iscsi_sht = { + .name = "iSCSI Initiator over TCP/IP", + .queuecommand = iscsi_queuecommand, +- .change_queue_depth = iscsi_change_queue_depth, + .can_queue = ISCSI_XMIT_CMDS_MAX - 1, + .sg_tablesize = ISCSI_SG_TABLESIZE, + .cmd_per_lun = ISCSI_DEF_CMD_PER_LUN, +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h +--- linux-2.6.20/drivers/scsi/iscsi_tcp.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h 2007-05-17 16:38:14.000000000 +0300 +@@ -49,7 +49,6 @@ + #define ISCSI_SG_TABLESIZE SG_ALL + #define ISCSI_TCP_MAX_CMD_LEN 16 + +-struct crypto_hash; + struct socket; + + /* Socket connection recieve helper */ +@@ -93,8 +92,8 @@ struct iscsi_tcp_conn { + void (*old_write_space)(struct sock *); + + /* data and header digests */ +- struct hash_desc tx_hash; /* CRC32C (Tx) */ +- struct hash_desc rx_hash; /* CRC32C (Rx) */ ++ struct crypto_tfm *tx_hash; /* CRC32C (Tx) */ ++ struct crypto_tfm *rx_hash; /* CRC32C (Rx) */ + + /* MIB custom statistics */ + uint32_t sendpage_failures_cnt; +diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c +--- linux-2.6.20/drivers/scsi/libiscsi.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c 2007-05-17 16:38:14.000000000 +0300 +@@ -1370,7 +1370,6 @@ iscsi_session_setup(struct iscsi_transpo + shost->max_lun = iscsit->max_lun; + shost->max_cmd_len = iscsit->max_cmd_len; + shost->transportt = scsit; +- shost->transportt->create_work_queue = 1; + *hostno = shost->host_no; + + session = iscsi_hostdata(shost->hostdata); +diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c +--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c 2007-05-17 16:38:14.000000000 +0300 +@@ -65,6 +65,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l + #define cdev_to_iscsi_internal(_cdev) \ + container_of(_cdev, struct iscsi_internal, cdev) + ++extern int attribute_container_init(void); ++ + static void iscsi_transport_release(struct class_device *cdev) + { + struct iscsi_internal *priv = cdev_to_iscsi_internal(cdev); +@@ -80,6 +82,17 @@ static struct class iscsi_transport_clas + .release = iscsi_transport_release, + }; + ++static void iscsi_host_class_release(struct class_device *class_dev) ++{ ++ struct Scsi_Host *shost = transport_class_to_shost(class_dev); ++ put_device(&shost->shost_gendev); ++} ++ ++struct class iscsi_host_class = { ++ .name = "iscsi_host", ++ .release = iscsi_host_class_release, ++}; ++ + static ssize_t + show_transport_handle(struct class_device *cdev, char *buf) + { +@@ -115,10 +128,8 @@ static struct attribute_group iscsi_tran + .attrs = iscsi_transport_attrs, + }; + +-static int iscsi_setup_host(struct transport_container *tc, struct device *dev, +- struct class_device *cdev) ++static int iscsi_setup_host(struct Scsi_Host *shost) + { +- struct Scsi_Host *shost = dev_to_shost(dev); + struct iscsi_host *ihost = shost->shost_data; + + memset(ihost, 0, sizeof(*ihost)); +@@ -127,12 +138,6 @@ static int iscsi_setup_host(struct trans + return 0; + } + +-static DECLARE_TRANSPORT_CLASS(iscsi_host_class, +- "iscsi_host", +- iscsi_setup_host, +- NULL, +- NULL); +- + static DECLARE_TRANSPORT_CLASS(iscsi_session_class, + "iscsi_session", + NULL, +@@ -216,24 +221,6 @@ static int iscsi_is_session_dev(const st + return dev->release == iscsi_session_release; + } + +-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel, +- uint id, uint lun) +-{ +- struct iscsi_host *ihost = shost->shost_data; +- struct iscsi_cls_session *session; +- +- mutex_lock(&ihost->mutex); +- list_for_each_entry(session, &ihost->sessions, host_list) { +- if ((channel == SCAN_WILD_CARD || channel == 0) && +- (id == SCAN_WILD_CARD || id == session->target_id)) +- scsi_scan_target(&session->dev, 0, +- session->target_id, lun, 1); +- } +- mutex_unlock(&ihost->mutex); +- +- return 0; +-} +- + static void session_recovery_timedout(struct work_struct *work) + { + struct iscsi_cls_session *session = +@@ -362,8 +349,6 @@ void iscsi_remove_session(struct iscsi_c + list_del(&session->host_list); + mutex_unlock(&ihost->mutex); + +- scsi_remove_target(&session->dev); +- + transport_unregister_device(&session->dev); + device_del(&session->dev); + } +@@ -1269,24 +1254,6 @@ static int iscsi_conn_match(struct attri + return &priv->conn_cont.ac == cont; + } + +-static int iscsi_host_match(struct attribute_container *cont, +- struct device *dev) +-{ +- struct Scsi_Host *shost; +- struct iscsi_internal *priv; +- +- if (!scsi_is_host_device(dev)) +- return 0; +- +- shost = dev_to_shost(dev); +- if (!shost->transportt || +- shost->transportt->host_attrs.ac.class != &iscsi_host_class.class) +- return 0; +- +- priv = to_iscsi_internal(shost->transportt); +- return &priv->t.host_attrs.ac == cont; +-} +- + struct scsi_transport_template * + iscsi_register_transport(struct iscsi_transport *tt) + { +@@ -1306,7 +1273,6 @@ iscsi_register_transport(struct iscsi_tr + INIT_LIST_HEAD(&priv->list); + priv->daemon_pid = -1; + priv->iscsi_transport = tt; +- priv->t.user_scan = iscsi_user_scan; + + priv->cdev.class = &iscsi_transport_class; + snprintf(priv->cdev.class_id, BUS_ID_SIZE, "%s", tt->name); +@@ -1319,12 +1285,11 @@ iscsi_register_transport(struct iscsi_tr + goto unregister_cdev; + + /* host parameters */ +- priv->t.host_attrs.ac.attrs = &priv->host_attrs[0]; +- priv->t.host_attrs.ac.class = &iscsi_host_class.class; +- priv->t.host_attrs.ac.match = iscsi_host_match; ++ ++ priv->t.host_attrs = &priv->host_attrs[0]; ++ priv->t.host_class = &iscsi_host_class; ++ priv->t.host_setup = iscsi_setup_host; + priv->t.host_size = sizeof(struct iscsi_host); +- priv->host_attrs[0] = NULL; +- transport_container_register(&priv->t.host_attrs); + + /* connection parameters */ + priv->conn_cont.ac.attrs = &priv->conn_attrs[0]; +@@ -1402,7 +1367,6 @@ int iscsi_unregister_transport(struct is + + transport_container_unregister(&priv->conn_cont); + transport_container_unregister(&priv->session_cont); +- transport_container_unregister(&priv->t.host_attrs); + + sysfs_remove_group(&priv->cdev.kobj, &iscsi_transport_group); + class_device_unregister(&priv->cdev); +@@ -1419,6 +1383,8 @@ static __init int iscsi_transport_init(v + printk(KERN_INFO "Loading iSCSI transport class v%s.\n", + ISCSI_TRANSPORT_VERSION); + ++ attribute_container_init(); ++ + err = class_register(&iscsi_transport_class); + if (err) + return err; diff --git a/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch new file mode 100644 index 0000000..21715fd --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/add_open_iscsi_h.patch @@ -0,0 +1,35 @@ +diff -rup linux-2.6.20/include/scsi/iscsi_if.h linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h +--- linux-2.6.20/include/scsi/iscsi_if.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h 2007-05-15 08:49:53.000000000 +0300 +@@ -277,7 +277,6 @@ enum iscsi_param { + * These flags describes reason of stop_conn() call + */ + #define STOP_CONN_TERM 0x1 +-#define STOP_CONN_SUSPEND 0x2 + #define STOP_CONN_RECOVER 0x3 + + #define ISCSI_STATS_CUSTOM_MAX 32 +diff -rup linux-2.6.20/include/scsi/libiscsi.h linux-2.6.20-rh4-backport/include/scsi/libiscsi.h +--- linux-2.6.20/include/scsi/libiscsi.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/include/scsi/libiscsi.h 2007-05-15 08:54:49.000000000 +0300 +@@ -25,8 +25,6 @@ + + #include + #include +-#include +-#include + #include + #include + +diff -rup linux-2.6.20/include/scsi/scsi_transport_iscsi.h linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h +--- linux-2.6.20/include/scsi/scsi_transport_iscsi.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h 2007-05-15 08:54:24.000000000 +0300 +@@ -24,7 +24,7 @@ + #define SCSI_TRANSPORT_ISCSI_H + + #include +-#include ++#include "iscsi_if.h" + + struct scsi_transport_template; + struct iscsi_transport; diff --git a/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch new file mode 100644 index 0000000..3c2a969 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/fix_inclusion_order_iscsi_iser.patch @@ -0,0 +1,13 @@ +--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:13:43.000000000 +0200 ++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:14:31.000000000 +0200 +@@ -70,9 +70,8 @@ + #include + #include + #include +-#include +- + #include "iscsi_iser.h" ++#include + + static unsigned int iscsi_max_lun = 512; + module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); diff --git a/kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch b/kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch new file mode 100644 index 0000000..ffa0598 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch @@ -0,0 +1,65 @@ +diff --git a/drivers/scsi/init.c b/drivers/scsi/init.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/init.c +@@ -0,0 +1 @@ ++#include "src/init.c" +diff --git a/drivers/scsi/attribute_container.c b/drivers/scsi/attribute_container.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/attribute_container.c +@@ -0,0 +1 @@ ++#include "src/attribute_container.c" +diff --git a/drivers/scsi/transport_class.c b/drivers/scsi/transport_class.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/transport_class.c +@@ -0,0 +1 @@ ++#include "src/transport_class.c" +diff --git a/drivers/scsi/klist.c b/drivers/scsi/klist.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/klist.c +@@ -0,0 +1 @@ ++#include "src/klist.c" +diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/scsi.c +@@ -0,0 +1 @@ ++#include "src/scsi.c" +diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/scsi_lib.c +@@ -0,0 +1 @@ ++#include "src/scsi_lib.c" +diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/scsi_scan.c +@@ -0,0 +1 @@ ++#include "src/scsi_scan.c" +diff --git a/drivers/scsi/kref_new.c b/drivers/scsi/kref_new.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/kref_new.c +@@ -0,0 +1 @@ ++#include "src/kref_new.c" +diff -rupN ofa_kernel-1.2/drivers/scsi/Makefile ofa_kernel-1.2-iscsi/drivers/scsi/Makefile +--- ofa_kernel-1.2/drivers/scsi/Makefile 1970-01-01 02:00:00.000000000 +0200 ++++ ofa_kernel-1.2-iscsi/drivers/scsi/Makefile 2007-05-16 14:12:22.000000000 +0300 +@@ -0,0 +1,5 @@ ++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o ++obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o ++ ++scsi_transport_iscsi-y := scsi_transport_iscsi_f.o scsi.o scsi_lib.o init.o kref_new.o klist.o attribute_container.o transport_class.o ++libiscsi-y := libiscsi_f.o scsi_scan.o diff --git a/kernel_patches/backport/2.6.9_U4/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U4/add_open_iscsi.patch new file mode 100644 index 0000000..a339163 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/add_open_iscsi.patch @@ -0,0 +1,270 @@ +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c +--- linux-2.6.20/drivers/scsi/iscsi_tcp.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c 2007-05-17 16:55:43.000000000 +0300 +@@ -676,7 +676,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, + } + + static inline void +-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg, ++partial_sg_digest_update(struct crypto_tfm *desc, struct scatterlist *sg, + int offset, int length) + { + struct scatterlist temp; +@@ -684,7 +684,7 @@ partial_sg_digest_update(struct hash_des + memcpy(&temp, sg, sizeof(struct scatterlist)); + temp.offset = offset; + temp.length = length; +- crypto_hash_update(desc, &temp, length); ++ crypto_hash_update(&desc, &temp, length); + } + + static void +@@ -1774,22 +1774,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s + /* initial operational parameters */ + tcp_conn->hdr_size = sizeof(struct iscsi_hdr); + +- tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->tx_hash.flags = 0; +- if (IS_ERR(tcp_conn->tx_hash.tfm)) ++ tcp_conn->tx_hash = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->tx_hash) + goto free_tcp_conn; + +- tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->rx_hash.flags = 0; +- if (IS_ERR(tcp_conn->rx_hash.tfm)) ++ tcp_conn->rx_hash = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->rx_hash) + goto free_tx_tfm; + + return cls_conn; + + free_tx_tfm: +- crypto_free_hash(tcp_conn->tx_hash.tfm); ++ crypto_free_tfm(tcp_conn->tx_hash); + free_tcp_conn: + kfree(tcp_conn); + tcp_conn_alloc_fail: +@@ -1823,10 +1819,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_ + iscsi_tcp_release_conn(conn); + iscsi_conn_teardown(cls_conn); + +- if (tcp_conn->tx_hash.tfm) +- crypto_free_hash(tcp_conn->tx_hash.tfm); +- if (tcp_conn->rx_hash.tfm) +- crypto_free_hash(tcp_conn->rx_hash.tfm); ++ if (tcp_conn->tx_hash) ++ crypto_free_tfm(tcp_conn->tx_hash); ++ if (tcp_conn->rx_hash) ++ crypto_free_tfm(tcp_conn->rx_hash); + + kfree(tcp_conn); + } +@@ -2017,7 +2013,7 @@ iscsi_tcp_conn_get_param(struct iscsi_cl + { + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_tcp_conn *tcp_conn = conn->dd_data; +- struct inet_sock *inet; ++ struct inet_opt *inet; + struct ipv6_pinfo *np; + struct sock *sk; + int len; +@@ -2135,7 +2131,6 @@ static void iscsi_tcp_session_destroy(st + static struct scsi_host_template iscsi_sht = { + .name = "iSCSI Initiator over TCP/IP", + .queuecommand = iscsi_queuecommand, +- .change_queue_depth = iscsi_change_queue_depth, + .can_queue = ISCSI_XMIT_CMDS_MAX - 1, + .sg_tablesize = ISCSI_SG_TABLESIZE, + .cmd_per_lun = ISCSI_DEF_CMD_PER_LUN, +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h +--- linux-2.6.20/drivers/scsi/iscsi_tcp.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h 2007-05-17 16:38:14.000000000 +0300 +@@ -49,7 +49,6 @@ + #define ISCSI_SG_TABLESIZE SG_ALL + #define ISCSI_TCP_MAX_CMD_LEN 16 + +-struct crypto_hash; + struct socket; + + /* Socket connection recieve helper */ +@@ -93,8 +92,8 @@ struct iscsi_tcp_conn { + void (*old_write_space)(struct sock *); + + /* data and header digests */ +- struct hash_desc tx_hash; /* CRC32C (Tx) */ +- struct hash_desc rx_hash; /* CRC32C (Rx) */ ++ struct crypto_tfm *tx_hash; /* CRC32C (Tx) */ ++ struct crypto_tfm *rx_hash; /* CRC32C (Rx) */ + + /* MIB custom statistics */ + uint32_t sendpage_failures_cnt; +diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c +--- linux-2.6.20/drivers/scsi/libiscsi.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c 2007-05-17 16:38:14.000000000 +0300 +@@ -1370,7 +1370,6 @@ iscsi_session_setup(struct iscsi_transpo + shost->max_lun = iscsit->max_lun; + shost->max_cmd_len = iscsit->max_cmd_len; + shost->transportt = scsit; +- shost->transportt->create_work_queue = 1; + *hostno = shost->host_no; + + session = iscsi_hostdata(shost->hostdata); +diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c +--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c 2007-05-17 16:38:14.000000000 +0300 +@@ -65,6 +65,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l + #define cdev_to_iscsi_internal(_cdev) \ + container_of(_cdev, struct iscsi_internal, cdev) + ++extern int attribute_container_init(void); ++ + static void iscsi_transport_release(struct class_device *cdev) + { + struct iscsi_internal *priv = cdev_to_iscsi_internal(cdev); +@@ -80,6 +82,17 @@ static struct class iscsi_transport_clas + .release = iscsi_transport_release, + }; + ++static void iscsi_host_class_release(struct class_device *class_dev) ++{ ++ struct Scsi_Host *shost = transport_class_to_shost(class_dev); ++ put_device(&shost->shost_gendev); ++} ++ ++struct class iscsi_host_class = { ++ .name = "iscsi_host", ++ .release = iscsi_host_class_release, ++}; ++ + static ssize_t + show_transport_handle(struct class_device *cdev, char *buf) + { +@@ -115,10 +128,8 @@ static struct attribute_group iscsi_tran + .attrs = iscsi_transport_attrs, + }; + +-static int iscsi_setup_host(struct transport_container *tc, struct device *dev, +- struct class_device *cdev) ++static int iscsi_setup_host(struct Scsi_Host *shost) + { +- struct Scsi_Host *shost = dev_to_shost(dev); + struct iscsi_host *ihost = shost->shost_data; + + memset(ihost, 0, sizeof(*ihost)); +@@ -127,12 +138,6 @@ static int iscsi_setup_host(struct trans + return 0; + } + +-static DECLARE_TRANSPORT_CLASS(iscsi_host_class, +- "iscsi_host", +- iscsi_setup_host, +- NULL, +- NULL); +- + static DECLARE_TRANSPORT_CLASS(iscsi_session_class, + "iscsi_session", + NULL, +@@ -216,24 +221,6 @@ static int iscsi_is_session_dev(const st + return dev->release == iscsi_session_release; + } + +-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel, +- uint id, uint lun) +-{ +- struct iscsi_host *ihost = shost->shost_data; +- struct iscsi_cls_session *session; +- +- mutex_lock(&ihost->mutex); +- list_for_each_entry(session, &ihost->sessions, host_list) { +- if ((channel == SCAN_WILD_CARD || channel == 0) && +- (id == SCAN_WILD_CARD || id == session->target_id)) +- scsi_scan_target(&session->dev, 0, +- session->target_id, lun, 1); +- } +- mutex_unlock(&ihost->mutex); +- +- return 0; +-} +- + static void session_recovery_timedout(struct work_struct *work) + { + struct iscsi_cls_session *session = +@@ -362,8 +349,6 @@ void iscsi_remove_session(struct iscsi_c + list_del(&session->host_list); + mutex_unlock(&ihost->mutex); + +- scsi_remove_target(&session->dev); +- + transport_unregister_device(&session->dev); + device_del(&session->dev); + } +@@ -1269,24 +1254,6 @@ static int iscsi_conn_match(struct attri + return &priv->conn_cont.ac == cont; + } + +-static int iscsi_host_match(struct attribute_container *cont, +- struct device *dev) +-{ +- struct Scsi_Host *shost; +- struct iscsi_internal *priv; +- +- if (!scsi_is_host_device(dev)) +- return 0; +- +- shost = dev_to_shost(dev); +- if (!shost->transportt || +- shost->transportt->host_attrs.ac.class != &iscsi_host_class.class) +- return 0; +- +- priv = to_iscsi_internal(shost->transportt); +- return &priv->t.host_attrs.ac == cont; +-} +- + struct scsi_transport_template * + iscsi_register_transport(struct iscsi_transport *tt) + { +@@ -1306,7 +1273,6 @@ iscsi_register_transport(struct iscsi_tr + INIT_LIST_HEAD(&priv->list); + priv->daemon_pid = -1; + priv->iscsi_transport = tt; +- priv->t.user_scan = iscsi_user_scan; + + priv->cdev.class = &iscsi_transport_class; + snprintf(priv->cdev.class_id, BUS_ID_SIZE, "%s", tt->name); +@@ -1319,12 +1285,11 @@ iscsi_register_transport(struct iscsi_tr + goto unregister_cdev; + + /* host parameters */ +- priv->t.host_attrs.ac.attrs = &priv->host_attrs[0]; +- priv->t.host_attrs.ac.class = &iscsi_host_class.class; +- priv->t.host_attrs.ac.match = iscsi_host_match; ++ ++ priv->t.host_attrs = &priv->host_attrs[0]; ++ priv->t.host_class = &iscsi_host_class; ++ priv->t.host_setup = iscsi_setup_host; + priv->t.host_size = sizeof(struct iscsi_host); +- priv->host_attrs[0] = NULL; +- transport_container_register(&priv->t.host_attrs); + + /* connection parameters */ + priv->conn_cont.ac.attrs = &priv->conn_attrs[0]; +@@ -1402,7 +1367,6 @@ int iscsi_unregister_transport(struct is + + transport_container_unregister(&priv->conn_cont); + transport_container_unregister(&priv->session_cont); +- transport_container_unregister(&priv->t.host_attrs); + + sysfs_remove_group(&priv->cdev.kobj, &iscsi_transport_group); + class_device_unregister(&priv->cdev); +@@ -1419,6 +1383,8 @@ static __init int iscsi_transport_init(v + printk(KERN_INFO "Loading iSCSI transport class v%s.\n", + ISCSI_TRANSPORT_VERSION); + ++ attribute_container_init(); ++ + err = class_register(&iscsi_transport_class); + if (err) + return err; diff --git a/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch new file mode 100644 index 0000000..21715fd --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/add_open_iscsi_h.patch @@ -0,0 +1,35 @@ +diff -rup linux-2.6.20/include/scsi/iscsi_if.h linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h +--- linux-2.6.20/include/scsi/iscsi_if.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h 2007-05-15 08:49:53.000000000 +0300 +@@ -277,7 +277,6 @@ enum iscsi_param { + * These flags describes reason of stop_conn() call + */ + #define STOP_CONN_TERM 0x1 +-#define STOP_CONN_SUSPEND 0x2 + #define STOP_CONN_RECOVER 0x3 + + #define ISCSI_STATS_CUSTOM_MAX 32 +diff -rup linux-2.6.20/include/scsi/libiscsi.h linux-2.6.20-rh4-backport/include/scsi/libiscsi.h +--- linux-2.6.20/include/scsi/libiscsi.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/include/scsi/libiscsi.h 2007-05-15 08:54:49.000000000 +0300 +@@ -25,8 +25,6 @@ + + #include + #include +-#include +-#include + #include + #include + +diff -rup linux-2.6.20/include/scsi/scsi_transport_iscsi.h linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h +--- linux-2.6.20/include/scsi/scsi_transport_iscsi.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h 2007-05-15 08:54:24.000000000 +0300 +@@ -24,7 +24,7 @@ + #define SCSI_TRANSPORT_ISCSI_H + + #include +-#include ++#include "iscsi_if.h" + + struct scsi_transport_template; + struct iscsi_transport; diff --git a/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch new file mode 100644 index 0000000..3c2a969 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/fix_inclusion_order_iscsi_iser.patch @@ -0,0 +1,13 @@ +--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:13:43.000000000 +0200 ++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:14:31.000000000 +0200 +@@ -70,9 +70,8 @@ + #include + #include + #include +-#include +- + #include "iscsi_iser.h" ++#include + + static unsigned int iscsi_max_lun = 512; + module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); diff --git a/kernel_patches/backport/2.6.9_U4/iscsi_scsi_addons.patch b/kernel_patches/backport/2.6.9_U4/iscsi_scsi_addons.patch new file mode 100644 index 0000000..ffa0598 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U4/iscsi_scsi_addons.patch @@ -0,0 +1,65 @@ +diff --git a/drivers/scsi/init.c b/drivers/scsi/init.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/init.c +@@ -0,0 +1 @@ ++#include "src/init.c" +diff --git a/drivers/scsi/attribute_container.c b/drivers/scsi/attribute_container.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/attribute_container.c +@@ -0,0 +1 @@ ++#include "src/attribute_container.c" +diff --git a/drivers/scsi/transport_class.c b/drivers/scsi/transport_class.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/transport_class.c +@@ -0,0 +1 @@ ++#include "src/transport_class.c" +diff --git a/drivers/scsi/klist.c b/drivers/scsi/klist.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/klist.c +@@ -0,0 +1 @@ ++#include "src/klist.c" +diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/scsi.c +@@ -0,0 +1 @@ ++#include "src/scsi.c" +diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/scsi_lib.c +@@ -0,0 +1 @@ ++#include "src/scsi_lib.c" +diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/scsi_scan.c +@@ -0,0 +1 @@ ++#include "src/scsi_scan.c" +diff --git a/drivers/scsi/kref_new.c b/drivers/scsi/kref_new.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/kref_new.c +@@ -0,0 +1 @@ ++#include "src/kref_new.c" +diff -rupN ofa_kernel-1.2/drivers/scsi/Makefile ofa_kernel-1.2-iscsi/drivers/scsi/Makefile +--- ofa_kernel-1.2/drivers/scsi/Makefile 1970-01-01 02:00:00.000000000 +0200 ++++ ofa_kernel-1.2-iscsi/drivers/scsi/Makefile 2007-05-16 14:12:22.000000000 +0300 +@@ -0,0 +1,5 @@ ++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o ++obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o ++ ++scsi_transport_iscsi-y := scsi_transport_iscsi_f.o scsi.o scsi_lib.o init.o kref_new.o klist.o attribute_container.o transport_class.o ++libiscsi-y := libiscsi_f.o scsi_scan.o diff --git a/kernel_patches/backport/2.6.9_U5/add_open_iscsi.patch b/kernel_patches/backport/2.6.9_U5/add_open_iscsi.patch new file mode 100644 index 0000000..a339163 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U5/add_open_iscsi.patch @@ -0,0 +1,270 @@ +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.c linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c +--- linux-2.6.20/drivers/scsi/iscsi_tcp.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.c 2007-05-17 16:55:43.000000000 +0300 +@@ -676,7 +676,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, + } + + static inline void +-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg, ++partial_sg_digest_update(struct crypto_tfm *desc, struct scatterlist *sg, + int offset, int length) + { + struct scatterlist temp; +@@ -684,7 +684,7 @@ partial_sg_digest_update(struct hash_des + memcpy(&temp, sg, sizeof(struct scatterlist)); + temp.offset = offset; + temp.length = length; +- crypto_hash_update(desc, &temp, length); ++ crypto_hash_update(&desc, &temp, length); + } + + static void +@@ -1774,22 +1774,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s + /* initial operational parameters */ + tcp_conn->hdr_size = sizeof(struct iscsi_hdr); + +- tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->tx_hash.flags = 0; +- if (IS_ERR(tcp_conn->tx_hash.tfm)) ++ tcp_conn->tx_hash = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->tx_hash) + goto free_tcp_conn; + +- tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->rx_hash.flags = 0; +- if (IS_ERR(tcp_conn->rx_hash.tfm)) ++ tcp_conn->rx_hash = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->rx_hash) + goto free_tx_tfm; + + return cls_conn; + + free_tx_tfm: +- crypto_free_hash(tcp_conn->tx_hash.tfm); ++ crypto_free_tfm(tcp_conn->tx_hash); + free_tcp_conn: + kfree(tcp_conn); + tcp_conn_alloc_fail: +@@ -1823,10 +1819,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_ + iscsi_tcp_release_conn(conn); + iscsi_conn_teardown(cls_conn); + +- if (tcp_conn->tx_hash.tfm) +- crypto_free_hash(tcp_conn->tx_hash.tfm); +- if (tcp_conn->rx_hash.tfm) +- crypto_free_hash(tcp_conn->rx_hash.tfm); ++ if (tcp_conn->tx_hash) ++ crypto_free_tfm(tcp_conn->tx_hash); ++ if (tcp_conn->rx_hash) ++ crypto_free_tfm(tcp_conn->rx_hash); + + kfree(tcp_conn); + } +@@ -2017,7 +2013,7 @@ iscsi_tcp_conn_get_param(struct iscsi_cl + { + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_tcp_conn *tcp_conn = conn->dd_data; +- struct inet_sock *inet; ++ struct inet_opt *inet; + struct ipv6_pinfo *np; + struct sock *sk; + int len; +@@ -2135,7 +2131,6 @@ static void iscsi_tcp_session_destroy(st + static struct scsi_host_template iscsi_sht = { + .name = "iSCSI Initiator over TCP/IP", + .queuecommand = iscsi_queuecommand, +- .change_queue_depth = iscsi_change_queue_depth, + .can_queue = ISCSI_XMIT_CMDS_MAX - 1, + .sg_tablesize = ISCSI_SG_TABLESIZE, + .cmd_per_lun = ISCSI_DEF_CMD_PER_LUN, +diff -rup linux-2.6.20/drivers/scsi/iscsi_tcp.h linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h +--- linux-2.6.20/drivers/scsi/iscsi_tcp.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/iscsi_tcp.h 2007-05-17 16:38:14.000000000 +0300 +@@ -49,7 +49,6 @@ + #define ISCSI_SG_TABLESIZE SG_ALL + #define ISCSI_TCP_MAX_CMD_LEN 16 + +-struct crypto_hash; + struct socket; + + /* Socket connection recieve helper */ +@@ -93,8 +92,8 @@ struct iscsi_tcp_conn { + void (*old_write_space)(struct sock *); + + /* data and header digests */ +- struct hash_desc tx_hash; /* CRC32C (Tx) */ +- struct hash_desc rx_hash; /* CRC32C (Rx) */ ++ struct crypto_tfm *tx_hash; /* CRC32C (Tx) */ ++ struct crypto_tfm *rx_hash; /* CRC32C (Rx) */ + + /* MIB custom statistics */ + uint32_t sendpage_failures_cnt; +diff -rup linux-2.6.20/drivers/scsi/libiscsi.c linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c +--- linux-2.6.20/drivers/scsi/libiscsi.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/libiscsi.c 2007-05-17 16:38:14.000000000 +0300 +@@ -1370,7 +1370,6 @@ iscsi_session_setup(struct iscsi_transpo + shost->max_lun = iscsit->max_lun; + shost->max_cmd_len = iscsit->max_cmd_len; + shost->transportt = scsit; +- shost->transportt->create_work_queue = 1; + *hostno = shost->host_no; + + session = iscsi_hostdata(shost->hostdata); +diff -rup linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c +--- linux-2.6.20/drivers/scsi/scsi_transport_iscsi.c 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/drivers/scsi/scsi_transport_iscsi.c 2007-05-17 16:38:14.000000000 +0300 +@@ -65,6 +65,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l + #define cdev_to_iscsi_internal(_cdev) \ + container_of(_cdev, struct iscsi_internal, cdev) + ++extern int attribute_container_init(void); ++ + static void iscsi_transport_release(struct class_device *cdev) + { + struct iscsi_internal *priv = cdev_to_iscsi_internal(cdev); +@@ -80,6 +82,17 @@ static struct class iscsi_transport_clas + .release = iscsi_transport_release, + }; + ++static void iscsi_host_class_release(struct class_device *class_dev) ++{ ++ struct Scsi_Host *shost = transport_class_to_shost(class_dev); ++ put_device(&shost->shost_gendev); ++} ++ ++struct class iscsi_host_class = { ++ .name = "iscsi_host", ++ .release = iscsi_host_class_release, ++}; ++ + static ssize_t + show_transport_handle(struct class_device *cdev, char *buf) + { +@@ -115,10 +128,8 @@ static struct attribute_group iscsi_tran + .attrs = iscsi_transport_attrs, + }; + +-static int iscsi_setup_host(struct transport_container *tc, struct device *dev, +- struct class_device *cdev) ++static int iscsi_setup_host(struct Scsi_Host *shost) + { +- struct Scsi_Host *shost = dev_to_shost(dev); + struct iscsi_host *ihost = shost->shost_data; + + memset(ihost, 0, sizeof(*ihost)); +@@ -127,12 +138,6 @@ static int iscsi_setup_host(struct trans + return 0; + } + +-static DECLARE_TRANSPORT_CLASS(iscsi_host_class, +- "iscsi_host", +- iscsi_setup_host, +- NULL, +- NULL); +- + static DECLARE_TRANSPORT_CLASS(iscsi_session_class, + "iscsi_session", + NULL, +@@ -216,24 +221,6 @@ static int iscsi_is_session_dev(const st + return dev->release == iscsi_session_release; + } + +-static int iscsi_user_scan(struct Scsi_Host *shost, uint channel, +- uint id, uint lun) +-{ +- struct iscsi_host *ihost = shost->shost_data; +- struct iscsi_cls_session *session; +- +- mutex_lock(&ihost->mutex); +- list_for_each_entry(session, &ihost->sessions, host_list) { +- if ((channel == SCAN_WILD_CARD || channel == 0) && +- (id == SCAN_WILD_CARD || id == session->target_id)) +- scsi_scan_target(&session->dev, 0, +- session->target_id, lun, 1); +- } +- mutex_unlock(&ihost->mutex); +- +- return 0; +-} +- + static void session_recovery_timedout(struct work_struct *work) + { + struct iscsi_cls_session *session = +@@ -362,8 +349,6 @@ void iscsi_remove_session(struct iscsi_c + list_del(&session->host_list); + mutex_unlock(&ihost->mutex); + +- scsi_remove_target(&session->dev); +- + transport_unregister_device(&session->dev); + device_del(&session->dev); + } +@@ -1269,24 +1254,6 @@ static int iscsi_conn_match(struct attri + return &priv->conn_cont.ac == cont; + } + +-static int iscsi_host_match(struct attribute_container *cont, +- struct device *dev) +-{ +- struct Scsi_Host *shost; +- struct iscsi_internal *priv; +- +- if (!scsi_is_host_device(dev)) +- return 0; +- +- shost = dev_to_shost(dev); +- if (!shost->transportt || +- shost->transportt->host_attrs.ac.class != &iscsi_host_class.class) +- return 0; +- +- priv = to_iscsi_internal(shost->transportt); +- return &priv->t.host_attrs.ac == cont; +-} +- + struct scsi_transport_template * + iscsi_register_transport(struct iscsi_transport *tt) + { +@@ -1306,7 +1273,6 @@ iscsi_register_transport(struct iscsi_tr + INIT_LIST_HEAD(&priv->list); + priv->daemon_pid = -1; + priv->iscsi_transport = tt; +- priv->t.user_scan = iscsi_user_scan; + + priv->cdev.class = &iscsi_transport_class; + snprintf(priv->cdev.class_id, BUS_ID_SIZE, "%s", tt->name); +@@ -1319,12 +1285,11 @@ iscsi_register_transport(struct iscsi_tr + goto unregister_cdev; + + /* host parameters */ +- priv->t.host_attrs.ac.attrs = &priv->host_attrs[0]; +- priv->t.host_attrs.ac.class = &iscsi_host_class.class; +- priv->t.host_attrs.ac.match = iscsi_host_match; ++ ++ priv->t.host_attrs = &priv->host_attrs[0]; ++ priv->t.host_class = &iscsi_host_class; ++ priv->t.host_setup = iscsi_setup_host; + priv->t.host_size = sizeof(struct iscsi_host); +- priv->host_attrs[0] = NULL; +- transport_container_register(&priv->t.host_attrs); + + /* connection parameters */ + priv->conn_cont.ac.attrs = &priv->conn_attrs[0]; +@@ -1402,7 +1367,6 @@ int iscsi_unregister_transport(struct is + + transport_container_unregister(&priv->conn_cont); + transport_container_unregister(&priv->session_cont); +- transport_container_unregister(&priv->t.host_attrs); + + sysfs_remove_group(&priv->cdev.kobj, &iscsi_transport_group); + class_device_unregister(&priv->cdev); +@@ -1419,6 +1383,8 @@ static __init int iscsi_transport_init(v + printk(KERN_INFO "Loading iSCSI transport class v%s.\n", + ISCSI_TRANSPORT_VERSION); + ++ attribute_container_init(); ++ + err = class_register(&iscsi_transport_class); + if (err) + return err; diff --git a/kernel_patches/backport/2.6.9_U5/add_open_iscsi_h.patch b/kernel_patches/backport/2.6.9_U5/add_open_iscsi_h.patch new file mode 100644 index 0000000..21715fd --- /dev/null +++ b/kernel_patches/backport/2.6.9_U5/add_open_iscsi_h.patch @@ -0,0 +1,35 @@ +diff -rup linux-2.6.20/include/scsi/iscsi_if.h linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h +--- linux-2.6.20/include/scsi/iscsi_if.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/include/scsi/iscsi_if.h 2007-05-15 08:49:53.000000000 +0300 +@@ -277,7 +277,6 @@ enum iscsi_param { + * These flags describes reason of stop_conn() call + */ + #define STOP_CONN_TERM 0x1 +-#define STOP_CONN_SUSPEND 0x2 + #define STOP_CONN_RECOVER 0x3 + + #define ISCSI_STATS_CUSTOM_MAX 32 +diff -rup linux-2.6.20/include/scsi/libiscsi.h linux-2.6.20-rh4-backport/include/scsi/libiscsi.h +--- linux-2.6.20/include/scsi/libiscsi.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/include/scsi/libiscsi.h 2007-05-15 08:54:49.000000000 +0300 +@@ -25,8 +25,6 @@ + + #include + #include +-#include +-#include + #include + #include + +diff -rup linux-2.6.20/include/scsi/scsi_transport_iscsi.h linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h +--- linux-2.6.20/include/scsi/scsi_transport_iscsi.h 2007-02-04 20:44:54.000000000 +0200 ++++ linux-2.6.20-rh4-backport/include/scsi/scsi_transport_iscsi.h 2007-05-15 08:54:24.000000000 +0300 +@@ -24,7 +24,7 @@ + #define SCSI_TRANSPORT_ISCSI_H + + #include +-#include ++#include "iscsi_if.h" + + struct scsi_transport_template; + struct iscsi_transport; diff --git a/kernel_patches/backport/2.6.9_U5/fix_inclusion_order_iscsi_iser.patch b/kernel_patches/backport/2.6.9_U5/fix_inclusion_order_iscsi_iser.patch new file mode 100644 index 0000000..3c2a969 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U5/fix_inclusion_order_iscsi_iser.patch @@ -0,0 +1,13 @@ +--- linux-2.6.20-rc7-orig/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:13:43.000000000 +0200 ++++ linux-2.6.20-rc7/drivers/infiniband/ulp/iser/iscsi_iser.c 2007-02-08 09:14:31.000000000 +0200 +@@ -70,9 +70,8 @@ + #include + #include + #include +-#include +- + #include "iscsi_iser.h" ++#include + + static unsigned int iscsi_max_lun = 512; + module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); diff --git a/kernel_patches/backport/2.6.9_U5/iscsi_scsi_addons.patch b/kernel_patches/backport/2.6.9_U5/iscsi_scsi_addons.patch new file mode 100644 index 0000000..ffa0598 --- /dev/null +++ b/kernel_patches/backport/2.6.9_U5/iscsi_scsi_addons.patch @@ -0,0 +1,65 @@ +diff --git a/drivers/scsi/init.c b/drivers/scsi/init.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/init.c +@@ -0,0 +1 @@ ++#include "src/init.c" +diff --git a/drivers/scsi/attribute_container.c b/drivers/scsi/attribute_container.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/attribute_container.c +@@ -0,0 +1 @@ ++#include "src/attribute_container.c" +diff --git a/drivers/scsi/transport_class.c b/drivers/scsi/transport_class.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/transport_class.c +@@ -0,0 +1 @@ ++#include "src/transport_class.c" +diff --git a/drivers/scsi/klist.c b/drivers/scsi/klist.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/klist.c +@@ -0,0 +1 @@ ++#include "src/klist.c" +diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/scsi.c +@@ -0,0 +1 @@ ++#include "src/scsi.c" +diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/scsi_lib.c +@@ -0,0 +1 @@ ++#include "src/scsi_lib.c" +diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/scsi_scan.c +@@ -0,0 +1 @@ ++#include "src/scsi_scan.c" +diff --git a/drivers/scsi/kref_new.c b/drivers/scsi/kref_new.c +new file mode 100644 +index 0000000..58cf933 +--- /dev/null ++++ b/drivers/scsi/kref_new.c +@@ -0,0 +1 @@ ++#include "src/kref_new.c" +diff -rupN ofa_kernel-1.2/drivers/scsi/Makefile ofa_kernel-1.2-iscsi/drivers/scsi/Makefile +--- ofa_kernel-1.2/drivers/scsi/Makefile 1970-01-01 02:00:00.000000000 +0200 ++++ ofa_kernel-1.2-iscsi/drivers/scsi/Makefile 2007-05-16 14:12:22.000000000 +0300 +@@ -0,0 +1,5 @@ ++obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o ++obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o ++ ++scsi_transport_iscsi-y := scsi_transport_iscsi_f.o scsi.o scsi_lib.o init.o kref_new.o klist.o attribute_container.o transport_class.o ++libiscsi-y := libiscsi_f.o scsi_scan.o diff --git a/kernel_patches/fixes/iscsi_scsi_makefile.patch b/kernel_patches/fixes/iscsi_scsi_makefile.patch deleted file mode 100644 index 9c4fd01..0000000 --- a/kernel_patches/fixes/iscsi_scsi_makefile.patch +++ /dev/null @@ -1,10 +0,0 @@ -Add a Makefile based on the kernel's drivers/scsi/Makefile in order to build open-iscsi. - -Signed-off-by: Erez Zilber - -diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile ---- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile 1970-01-01 02:00:00.000000000 +0200 -+++ ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile 2006-12-28 17:01:22.000000000 +0200 -@@ -0,0 +1,2 @@ -+obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o -+obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o diff --git a/ofed_scripts/makefile b/ofed_scripts/makefile index 34a8996..62abe2c 100644 --- a/ofed_scripts/makefile +++ b/ofed_scripts/makefile @@ -60,6 +60,12 @@ kernel: @echo "Kernel version: $(KVERSION)" @echo "Modules directory: $(DESTDIR)/$(MODULES_DIR)" @echo "Kernel sources: $(KSRC)" + if [ -e $(CWD)/drivers/scsi/libiscsi.c ]; then \ + mv $(CWD)/drivers/scsi/libiscsi.c $(CWD)/drivers/scsi/libiscsi_f.c; \ + fi + if [ -e $(CWD)/drivers/scsi/scsi_transport_iscsi.c ]; then \ + mv $(CWD)/drivers/scsi/scsi_transport_iscsi.c $(CWD)/drivers/scsi/scsi_transport_iscsi_f.c; \ + fi env EXTRA_CFLAGS="$(OPENIB_KERNEL_EXTRA_CFLAGS) $(KERNEL_MEMTRACK_CFLAGS) -I$(CWD)/include -I$(CWD)/drivers/infiniband/include \ -I$(CWD)/drivers/infiniband/ulp/ipoib \ -I$(CWD)/drivers/infiniband/debug \ From mst at dev.mellanox.co.il Mon May 21 01:16:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 May 2007 11:16:25 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4 In-Reply-To: <46513C13.3010100@voltaire.com> References: <4641D295.5060907@voltaire.com> <4641D38A.8040406@voltaire.com> <20070510092925.GB13655@mellanox.co.il> <46513C13.3010100@voltaire.com> Message-ID: <20070521081625.GA20400@mellanox.co.il> > Quoting Erez Zilber : > Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4 > > Michael S. Tsirkin wrote: > >> Quoting Erez Zilber : > >> Subject: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsi over iSER support for RHAS4 up3 and up4 > >> > >> > >> Add the required backport patches & kernel addons for open-iscsi > >> over iSER in RHAS4 up3 and up4. > >> > >> Signed-off-by: Erez Zilber > > > > In addition to posting patches, could you pls publish a git tree to pull from, > > please? This makes it easy to test-build the patch as our build system > > knows how to do git checkout. > > Added a git tree: > > http://www.openfabrics.org/git/?p=~erezz/ofed_1_2_iser_rh4.git;a=summary Looks reasonable. However, you are copying a ton of files from upstream kernel. Sticking extra files in include might interfere with newer kernels, so I don't have better ideas for this for 1.2 (for 1.3 I am hoping we'll use the submodule support in git, so we'll be able to re-use headers as well). But, for files *not* in "include/", I suggest that, instead of sticking our own version in addons, we should check out the files from upstream and tweak makefiles to pick them up: maintaining these in OFED tree long-term will be a problem. > >> + > >> + struct iscsi_internal { > >> + int daemon_pid; > >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l > >> + #define cdev_to_iscsi_internal(_cdev) \ > >> + container_of(_cdev, struct iscsi_internal, cdev) > >> + > >> ++extern int attribute_container_init(void); > >> ++ > > > > This does not look scsi-related. Why does this belong here? > > This is a hack. In 2.6.20, attribute_container_init is called from drivers/base/init.c. Since I cannot do that, I'm calling it from the init function in scsi_transport_iscsi (because scsi_transport_iscsi uses the attribute container). Do you have a better suggestion? Aha. No better ideas for the header, let it be for now. But the code in drivers/base/init.c can be checked out rather than copied over. > diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/memory.h b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h > new file mode 100644 > index 0000000..654ef55 > --- /dev/null > +++ b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h > @@ -0,0 +1,89 @@ > +/* > + * include/linux/memory.h - generic memory definition > + * > + * This is mainly for topological representation. We define the > + * basic "struct memory_block" here, which can be embedded in per-arch > + * definitions or NUMA information. > + * > + * Basic handling of the devices is done in drivers/base/memory.c > + * and system devices are handled in drivers/base/sys.c. > + * > + * Memory block are exported via sysfs in the class/memory/devices/ > + * directory. > + * > + */ Sorry, why are we copying this here? Are you actually trying to work with hotplug memory? > --- a/kernel_patches/fixes/iscsi_scsi_makefile.patch > +++ /dev/null > @@ -1,10 +0,0 @@ > -Add a Makefile based on the kernel's drivers/scsi/Makefile in order to build open-iscsi. > - > -Signed-off-by: Erez Zilber > - > -diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile > ---- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile 1970-01-01 02:00:00.000000000 +0200 > -+++ ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile 2006-12-28 17:01:22.000000000 +0200 > -@@ -0,0 +1,2 @@ > -+obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o > -+obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o > diff --git a/ofed_scripts/makefile b/ofed_scripts/makefile > index 34a8996..62abe2c 100644 > --- a/ofed_scripts/makefile > +++ b/ofed_scripts/makefile > @@ -60,6 +60,12 @@ kernel: > @echo "Kernel version: $(KVERSION)" > @echo "Modules directory: $(DESTDIR)/$(MODULES_DIR)" > @echo "Kernel sources: $(KSRC)" > + if [ -e $(CWD)/drivers/scsi/libiscsi.c ]; then \ > + mv $(CWD)/drivers/scsi/libiscsi.c $(CWD)/drivers/scsi/libiscsi_f.c; \ > + fi > + if [ -e $(CWD)/drivers/scsi/scsi_transport_iscsi.c ]; then \ > + mv $(CWD)/drivers/scsi/scsi_transport_iscsi.c $(CWD)/drivers/scsi/scsi_transport_iscsi_f.c; \ > + fi > env EXTRA_CFLAGS="$(OPENIB_KERNEL_EXTRA_CFLAGS) $(KERNEL_MEMTRACK_CFLAGS) -I$(CWD)/include -I$(CWD)/drivers/infiniband/include \ > -I$(CWD)/drivers/infiniband/ulp/ipoib \ > -I$(CWD)/drivers/infiniband/debug \ This looks pretty hacky. Moving files around during make will interfere with people trying to e.g. create a patch. What is this doing? Can't makefile just point to the right files? -- MST From kliteyn at dev.mellanox.co.il Mon May 21 01:17:02 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 21 May 2007 11:17:02 +0300 Subject: [ofa-general] Re: [PATCHv2] osm: up/dn optimization - improved ranking In-Reply-To: <20070520161034.GY19271@sashak.voltaire.com> References: <46503064.7010107@dev.mellanox.co.il> <20070520161034.GY19271@sashak.voltaire.com> Message-ID: <4651557E.2080400@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 14:26 Sun 20 May , Yevgeny Kliteynik wrote: >> Hi Hal, >> >> This patch optimizes fabric ranking similar to the fat-tree ranking. >> All the root switches are marked with rank and added to the BFS list, >> and only then ranking of rest of the fabric begins. >> This version of the patch is updated in accordance with Sasha's suggestions. >> >> Please apply to master. >> >> Signed-off-by: Yevgeny Kliteynik >> --- > > Looks fine for me. Nice optimization. Thanks. > > I guess there still be issue with max_rank calculation (details are > below), which affects only log message and for me it is ok to fix it in > the incremental patch. > >> opensm/opensm/osm_ucast_updn.c | 80 >> +++++++++++++++++---------------------- >> 1 files changed, 35 insertions(+), 45 deletions(-) >> >> diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c >> index 5cebd9b..95a0622 100644 >> --- a/opensm/opensm/osm_ucast_updn.c >> +++ b/opensm/opensm/osm_ucast_updn.c > > [snip...] > >> @@ -483,7 +474,7 @@ updn_subn_rank( >> { >> remote_u = p_remote_physp->p_node->sw->priv; >> port_guid = p_remote_physp->port_guid; >> - did_cause_update = __updn_update_rank(remote_u, rank); >> + did_cause_update = __updn_update_rank(remote_u, u->rank+1); >> >> osm_log( p_log, OSM_LOG_DEBUG, >> "updn_subn_rank: " >> @@ -492,7 +483,10 @@ updn_subn_rank( >> remote_u->rank ); >> >> if (did_cause_update) >> + { >> cl_qlist_insert_tail(&list, &remote_u->list); >> + max_rank = remote_u->rank; >> + } > > I think this still be not accurate. For instance with topology like: > A <-> B <-> C <-> D <-> E , where roots are A and E we will get > max_rank= 1, which obviously should be 2. Not exactly. What you're describing would happen if the scan would be DFS-like, not BFS. In your example there are two roots: A and E. They both got rank 0 and entered to the BFS list. Now, starting BFS scan: - Removing head of the list - A - Discovering B - Assigning B with rank 1 -------> updating max_rank - Adding B to the end of the list - Removing head of the list - E - Discovering D - Assigning D with rank 1 -------> updating max_rank - Adding D to the end of the list - Removing head of the list - B - Discovering C - Assigning C with rank 2 -------> updating max_rank - Adding C to the end of the list - Removing head of the list - D - Nothing to discover (C has been already discovered) - Removing head of the list - C - BFS list is empty As you can see, the last rank was 2. I actually was expecting this mail, because I thought of something like this initially :) > Probably we need something like this instead: > > if (did_cause_update) > cl_qlist_insert_tail(&list, &remote_u->list); > if (remote_u->rank <= u->rank + 1) > max_rank = remote_u->rank; > > (and after such intervention into rank updating technique we may want to > remove also __updn_update_rank() function) Although I can't think of any scenario that would prove me wrong, I do think that to make the code more "intuitive" we might want to remove the __updn_update_rank() and do something like this: if (remote_u->rank > u->rank + 1) { remote_u->rank = u->rank + 1; max_rank = remote_u->rank; cl_qlist_insert_tail(&list, &remote_u->list); } > And again, this nit affects only reported value in the log message (and > just this log message removing can be option too :)) and doesn't touch > the optimization itself - good stuff, Yevgeny! Truth, all this is for the log message only :) We also might want to remove the message :) I'm OK with either of the two options. -- Yevgeny > Sasha > From vlad at lists.openfabrics.org Mon May 21 02:40:34 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 21 May 2007 02:40:34 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070521-0200 daily build status Message-ID: <20070521094034.B6A33E6082C@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Failed: From erezz at voltaire.com Mon May 21 04:16:08 2007 From: erezz at voltaire.com (Erez Zilber) Date: Mon, 21 May 2007 14:16:08 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsiover iSER support for RHAS4 up3 and up4 In-Reply-To: <20070521081625.GA20400@mellanox.co.il> References: <4641D295.5060907@voltaire.com> <4641D38A.8040406@voltaire.com><20070510092925.GB13655@mellanox.co.il><46513C13.3010100@voltaire.com> <20070521081625.GA20400@mellanox.co.il> Message-ID: <46517F78.8000805@voltaire.com> >> >> Added a git tree: >> >> http://www.openfabrics.org/git/?p=~erezz/ofed_1_2_iser_rh4.git;a=summary > > Looks reasonable. > > However, you are copying a ton of files from upstream kernel. > Sticking extra files in include might interfere with newer > kernels, so I don't have better ideas for this for 1.2 > (for 1.3 I am hoping we'll use the submodule support in git, > so we'll be able to re-use headers as well). > > But, for files *not* in "include/", I suggest that, instead of sticking our > own version in addons, we should check out the files from upstream and tweak > makefiles to pick them up: maintaining these in OFED tree long-term will > be a > problem. Do you suggest to add a new mechanism to OFED that will do that? > >> >> + >> >> + struct iscsi_internal { >> >> + int daemon_pid; >> >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l >> >> + #define cdev_to_iscsi_internal(_cdev) \ >> >> + container_of(_cdev, struct iscsi_internal, cdev) >> >> + >> >> ++extern int attribute_container_init(void); >> >> ++ >> > >> > This does not look scsi-related. Why does this belong here? >> >> This is a hack. In 2.6.20, attribute_container_init is called from > drivers/base/init.c. Since I cannot do that, I'm calling it from the > init function in scsi_transport_iscsi (because scsi_transport_iscsi uses > the attribute container). Do you have a better suggestion? > > Aha. No better ideas for the header, let it be for now. > But the code in drivers/base/init.c can be checked out rather than > copied over. I'm using only a very small part of init.c. I'm not sure that we should copy it. > >> diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/memory.h > b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h >> new file mode 100644 >> index 0000000..654ef55 >> --- /dev/null >> +++ b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h >> @@ -0,0 +1,89 @@ >> +/* >> + * include/linux/memory.h - generic memory definition >> + * >> + * This is mainly for topological representation. We define the >> + * basic "struct memory_block" here, which can be embedded in per-arch >> + * definitions or NUMA information. >> + * >> + * Basic handling of the devices is done in drivers/base/memory.c >> + * and system devices are handled in drivers/base/sys.c. >> + * >> + * Memory block are exported via sysfs in the class/memory/devices/ >> + * directory. >> + * >> + */ > > > Sorry, why are we copying this here? > Are you actually trying to work with hotplug memory? Sorry, it seems that I don't really need memory.h. It was included from init.c, but it is not necessary. I made the fix on ofed_1_2_iser_rh4.git. > > >> --- a/kernel_patches/fixes/iscsi_scsi_makefile.patch >> +++ /dev/null >> @@ -1,10 +0,0 @@ >> -Add a Makefile based on the kernel's drivers/scsi/Makefile in order > to build open-iscsi. >> - >> -Signed-off-by: Erez Zilber >> - >> -diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile > ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile >> ---- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile 1970-01-01 > 02:00:00.000000000 +0200 >> -+++ > ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile > 2006-12-28 17:01:22.000000000 +0200 >> -@@ -0,0 +1,2 @@ >> -+obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o >> -+obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o >> diff --git a/ofed_scripts/makefile b/ofed_scripts/makefile >> index 34a8996..62abe2c 100644 >> --- a/ofed_scripts/makefile >> +++ b/ofed_scripts/makefile >> @@ -60,6 +60,12 @@ kernel: >> @echo "Kernel version: $(KVERSION)" >> @echo "Modules directory: $(DESTDIR)/$(MODULES_DIR)" >> @echo "Kernel sources: $(KSRC)" >> + if [ -e $(CWD)/drivers/scsi/libiscsi.c ]; then \ >> + mv $(CWD)/drivers/scsi/libiscsi.c > $(CWD)/drivers/scsi/libiscsi_f.c; \ >> + fi >> + if [ -e $(CWD)/drivers/scsi/scsi_transport_iscsi.c ]; then \ >> + mv $(CWD)/drivers/scsi/scsi_transport_iscsi.c > $(CWD)/drivers/scsi/scsi_transport_iscsi_f.c; \ >> + fi >> env EXTRA_CFLAGS="$(OPENIB_KERNEL_EXTRA_CFLAGS) > $(KERNEL_MEMTRACK_CFLAGS) -I$(CWD)/include > -I$(CWD)/drivers/infiniband/include \ >> -I$(CWD)/drivers/infiniband/ulp/ipoib \ >> -I$(CWD)/drivers/infiniband/debug \ > > This looks pretty hacky. Moving files around during make will > interfere with people trying to e.g. create a patch. > What is this doing? Can't makefile just point to the right files? > Here's the problem: I'm trying to build a module that contains multiple object files (e.g. libiscsi). libiscsi contains libiscsi.o & scsi_scan.o. Something like: libiscsi-y := libiscsi_f.o scsi_scan.o The problem is that if I'm doing something like: libiscsi-y := libiscsi.o scsi_scan.o then, libiscsi.ko doesn't contain the symbols from libiscsi.o (only symbols from scsi_scan.o). We found 2 solutions for this problem: 1. Change the module name - this is problematic because open-iscsi startup script uses the original module name. 2. Change the file name (libiscsi.c -> libiscsi_f.c) - this is what I did. I don't really like this hack, but I wasn't able to come up with something better. Do you know how to overcome this problem? Thanks, Erez From mst at dev.mellanox.co.il Mon May 21 04:44:10 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 May 2007 14:44:10 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsiover iSER support for RHAS4 up3 and up4 In-Reply-To: <46517F78.8000805@voltaire.com> References: <4641D295.5060907@voltaire.com> <20070521081625.GA20400@mellanox.co.il> <46517F78.8000805@voltaire.com> Message-ID: <20070521114410.GG20400@mellanox.co.il> > Quoting Erez Zilber : > Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons for open-iscsiover iSER support for RHAS4 up3 and up4 > > >> > >> Added a git tree: > >> > >> http://www.openfabrics.org/git/?p=~erezz/ofed_1_2_iser_rh4.git;a=summary > > > > Looks reasonable. > > > > However, you are copying a ton of files from upstream kernel. > > Sticking extra files in include might interfere with newer > > kernels, so I don't have better ideas for this for 1.2 > > (for 1.3 I am hoping we'll use the submodule support in git, > > so we'll be able to re-use headers as well). > > > > But, for files *not* in "include/", I suggest that, instead of sticking our > > own version in addons, we should check out the files from upstream and tweak > > makefiles to pick them up: maintaining these in OFED tree long-term will > > be a > > problem. > > Do you suggest to add a new mechanism to OFED that will do that? No, this is the same mechanism that we use for the rest of the files: check them out of the kernel tree. Look at file ofed_scripts/ofed_checkout.sh But I stress that we can not do this for files under include/ *unless* they only include packet structure definitions. Otherwise we'll get weird data corruption on newer kernels. > > > >> >> + > >> >> + struct iscsi_internal { > >> >> + int daemon_pid; > >> >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l > >> >> + #define cdev_to_iscsi_internal(_cdev) \ > >> >> + container_of(_cdev, struct iscsi_internal, cdev) > >> >> + > >> >> ++extern int attribute_container_init(void); > >> >> ++ > >> > > >> > This does not look scsi-related. Why does this belong here? > >> > >> This is a hack. In 2.6.20, attribute_container_init is called from > > drivers/base/init.c. Since I cannot do that, I'm calling it from the > > init function in scsi_transport_iscsi (because scsi_transport_iscsi uses > > the attribute container). Do you have a better suggestion? > > > > Aha. No better ideas for the header, let it be for now. > > But the code in drivers/base/init.c can be checked out rather than > > copied over. > > I'm using only a very small part of init.c. I'm not sure that we should copy it. OK then. What about the stuff like scsi.c? > >> diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/memory.h > > b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h > >> new file mode 100644 > >> index 0000000..654ef55 > >> --- /dev/null > >> +++ b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h > >> @@ -0,0 +1,89 @@ > >> +/* > >> + * include/linux/memory.h - generic memory definition > >> + * > >> + * This is mainly for topological representation. We define the > >> + * basic "struct memory_block" here, which can be embedded in per-arch > >> + * definitions or NUMA information. > >> + * > >> + * Basic handling of the devices is done in drivers/base/memory.c > >> + * and system devices are handled in drivers/base/sys.c. > >> + * > >> + * Memory block are exported via sysfs in the class/memory/devices/ > >> + * directory. > >> + * > >> + */ > > > > > > Sorry, why are we copying this here? > > Are you actually trying to work with hotplug memory? > > Sorry, it seems that I don't really need memory.h. It was included from init.c, but it is not necessary. I made the fix on ofed_1_2_iser_rh4.git. Pls check other headers you pull in - is there something you can skip? > > > >> --- a/kernel_patches/fixes/iscsi_scsi_makefile.patch > >> +++ /dev/null > >> @@ -1,10 +0,0 @@ > >> -Add a Makefile based on the kernel's drivers/scsi/Makefile in order > > to build open-iscsi. > >> - > >> -Signed-off-by: Erez Zilber > >> - > >> -diff -ruN ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile > > ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile > >> ---- ofa_1_2_kernel-20061228-0200/drivers/scsi/Makefile 1970-01-01 > > 02:00:00.000000000 +0200 > >> -+++ > > ofa_1_2_kernel-20061228-0200-open-iscsi/drivers/scsi/Makefile > > 2006-12-28 17:01:22.000000000 +0200 > >> -@@ -0,0 +1,2 @@ > >> -+obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o > >> -+obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o > >> diff --git a/ofed_scripts/makefile b/ofed_scripts/makefile > >> index 34a8996..62abe2c 100644 > >> --- a/ofed_scripts/makefile > >> +++ b/ofed_scripts/makefile > >> @@ -60,6 +60,12 @@ kernel: > >> @echo "Kernel version: $(KVERSION)" > >> @echo "Modules directory: $(DESTDIR)/$(MODULES_DIR)" > >> @echo "Kernel sources: $(KSRC)" > >> + if [ -e $(CWD)/drivers/scsi/libiscsi.c ]; then \ > >> + mv $(CWD)/drivers/scsi/libiscsi.c > > $(CWD)/drivers/scsi/libiscsi_f.c; \ > >> + fi > >> + if [ -e $(CWD)/drivers/scsi/scsi_transport_iscsi.c ]; then \ > >> + mv $(CWD)/drivers/scsi/scsi_transport_iscsi.c > > $(CWD)/drivers/scsi/scsi_transport_iscsi_f.c; \ > >> + fi > >> env EXTRA_CFLAGS="$(OPENIB_KERNEL_EXTRA_CFLAGS) > > $(KERNEL_MEMTRACK_CFLAGS) -I$(CWD)/include > > -I$(CWD)/drivers/infiniband/include \ > >> -I$(CWD)/drivers/infiniband/ulp/ipoib \ > >> -I$(CWD)/drivers/infiniband/debug \ > > > > This looks pretty hacky. Moving files around during make will > > interfere with people trying to e.g. create a patch. > > What is this doing? Can't makefile just point to the right files? > > > > Here's the problem: > > I'm trying to build a module that contains multiple object files (e.g. libiscsi). libiscsi contains libiscsi.o & scsi_scan.o. Something like: > > libiscsi-y := libiscsi_f.o scsi_scan.o > > The problem is that if I'm doing something like: > > libiscsi-y := libiscsi.o scsi_scan.o > > then, libiscsi.ko doesn't contain the symbols from libiscsi.o (only symbols from scsi_scan.o). We found 2 solutions for this problem: > > 1. Change the module name - this is problematic because open-iscsi startup script uses the original module name. > 2. Change the file name (libiscsi.c -> libiscsi_f.c) - this is what I did. > > I don't really like this hack, but I wasn't able to come up with something better. Do you know how to overcome this problem? I do not have the time to look into this in a deep way. But it seems that you can just add a file libiscsi_f.c with #include "libiscsi.c" would this work? -- MST From kliteyn at dev.mellanox.co.il Mon May 21 04:53:59 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 21 May 2007 14:53:59 +0300 Subject: [ofa-general] [PATCH] osm: fixing coredump in drop manager Message-ID: <46518857.2060308@dev.mellanox.co.il> Hi Hal. This patch fixes a coredump in a drop manager when trying to clear unititialized physical ports. It happens only in master (the code in this area is a bit different in ofed_1_2). Please apply to master. Thanks. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_drop_mgr.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c index 97a95c2..7ec185c 100644 --- a/opensm/opensm/osm_drop_mgr.c +++ b/opensm/opensm/osm_drop_mgr.c @@ -242,7 +242,7 @@ __osm_drop_mgr_remove_port( { p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)port_num ); - if( p_physp ) + if( p_physp && osm_physp_is_valid(p_physp) ) { p_remote_physp = osm_physp_get_remote( p_physp ); if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) ) -- 1.5.1.4 From mst at dev.mellanox.co.il Mon May 21 05:04:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 May 2007 15:04:59 +0300 Subject: [ofa-general] [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak Message-ID: <20070521120459.GI20400@mellanox.co.il> SRQ WR leakage that has been observed with IPoIB/CM: e.g. flipping ports on and off will, with time, leak out all WRs and then all connections will start getting RNR NACKs. Fix this in the way suggested by spec: move QP to error, wait for last wqe reached event and then post WR on "drain QP" connected to the same CQ. Once we observe a completion on the drain QP, it's safe to call ib_destroy_qp. Signed-off-by: Michael S. Tsirkin --- The following has been working well for me. Please consider for 2.6.22. ipoib.h | 39 ++++++++++- ipoib_cm.c | 201 ++++++++++++++++++++++++++++++++++++++++++++++++---------- ipoib_verbs.c | 2 3 files changed, 206 insertions(+), 36 deletions(-) Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-18 15:13:21.000000000 +0300 +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-20 16:22:00.000000000 +0300 @@ -132,12 +135,43 @@ struct ipoib_cm_data { __be32 mtu; }; +/* + * Quoting 10.3.1 Queue Pair and EE Context States: + * + * Note, for QPs that are associated with an SRQ, the Consumer should take the + * QP through the Error State before invoking a Destroy QP or a Modify QP to the + * Reset State. The Consumer may invoke the Destroy QP without first performing + * a Modify QP to the Error State and waiting for the Affiliated Asynchronous + * Last WQE Reached Event. However, if the Consumer does not wait for the + * Affiliated Asynchronous Last WQE Reached Event, then WQE and Data Segment + * leakage may occur. Therefore, it is good programming practice to tear down a + * QP that is associated with an SRQ by using the following process: + * + * - Put the QP in the Error State + * - Wait for the Affiliated Asynchronous Last WQE Reached Event; + * - either: + * drain the CQ by invoking the Poll CQ verb and either wait for CQ + * to be empty or the number of Poll CQ operations has exceeded + * CQ capacity size; + * - or + * post another WR that completes on the same CQ and wait for this + * WR to return as a WC; (NB: this is the option that we use) + * - and then invoke a Destroy QP or Reset QP. + */ + +enum ipoib_cm_state { + IPOIB_CM_RX_LIVE, + IPOIB_CM_RX_ERROR, /* Ignored by stale task */ + IPOIB_CM_RX_FLUSH /* Last WQE Reached event observed */ +}; + struct ipoib_cm_rx { struct ib_cm_id *id; struct ib_qp *qp; struct list_head list; struct net_device *dev; unsigned long jiffies; + enum ipoib_cm_state state; }; struct ipoib_cm_tx { @@ -165,10 +199,16 @@ struct ipoib_cm_dev_priv { struct ib_srq *srq; struct ipoib_cm_rx_buf *srq_ring; struct ib_cm_id *id; - struct list_head passive_ids; + struct ib_qp *rx_drain_qp; /* generates WR described in 10.3.1 */ + struct list_head passive_ids; /* state: LIVE */ + struct list_head rx_error_list; /* state: ERROR */ + struct list_head rx_flush_list; /* state: FLUSH, drain not started */ + struct list_head rx_drain_list; /* state: FLUSH, drain started */ + struct list_head rx_reap_list; /* state: FLUSH, drain done */ struct work_struct start_task; struct work_struct reap_task; struct work_struct skb_task; + struct work_struct rx_reap_task; struct delayed_work stale_task; struct sk_buff_head skb_queue; struct list_head start_list; Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-18 15:13:21.000000000 +0300 +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-05-21 07:44:50.000000000 +0300 @@ -37,6 +37,7 @@ #include #include #include +#include #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA static int data_debug_level; @@ -62,6 +63,16 @@ struct ipoib_cm_id { u32 remote_mtu; }; +static struct ib_qp_attr ipoib_cm_err_attr = { + .qp_state = IB_QPS_ERR +}; + +#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff + +static struct ib_recv_wr ipoib_cm_rx_drain_wr = { + .wr_id = IPOIB_CM_RX_DRAIN_WRID +}; + static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); @@ -150,11 +161,44 @@ partial_error: return NULL; } +static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv) +{ + struct ib_recv_wr *bad_wr; + + /* rx_drain_qp send queue depth is 1, so + * make sure we have at most 1 outstanding WR. */ + if (list_empty(&priv->cm.rx_flush_list) || + !list_empty(&priv->cm.rx_drain_list)) + return; + + if (ib_post_recv(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_wr)) + ipoib_warn(priv, "failed to post rx_drain wr\n"); + + list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list); +} + +static void ipoib_cm_rx_event_handler(struct ib_event *event, void *ctx) +{ + struct ipoib_cm_rx *p = ctx; + struct ipoib_dev_priv *priv = netdev_priv(p->dev); + unsigned long flags; + + if (event->event != IB_EVENT_QP_LAST_WQE_REACHED) + return; + + spin_lock_irqsave(&priv->lock, flags); + list_move(&p->list, &priv->cm.rx_flush_list); + p->state = IPOIB_CM_RX_FLUSH; + ipoib_cm_start_rx_drain(priv); + spin_unlock_irqrestore(&priv->lock, flags); +} + static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev, struct ipoib_cm_rx *p) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = { + .event_handler = ipoib_cm_rx_event_handler, .send_cq = priv->cq, /* does not matter, we never send anything */ .recv_cq = priv->cq, .srq = priv->cm.srq, @@ -256,6 +300,7 @@ static int ipoib_cm_req_handler(struct i cm_id->context = p; p->jiffies = jiffies; + p->state = IPOIB_CM_RX_LIVE; spin_lock_irq(&priv->lock); if (list_empty(&priv->cm.passive_ids)) queue_delayed_work(ipoib_workqueue, @@ -277,7 +322,6 @@ static int ipoib_cm_rx_handler(struct ib { struct ipoib_cm_rx *p; struct ipoib_dev_priv *priv; - int ret; switch (event->event) { case IB_CM_REQ_RECEIVED: @@ -289,20 +333,9 @@ static int ipoib_cm_rx_handler(struct ib case IB_CM_REJ_RECEIVED: p = cm_id->context; priv = netdev_priv(p->dev); - spin_lock_irq(&priv->lock); - if (list_empty(&p->list)) - ret = 0; /* Connection is going away already. */ - else { - list_del_init(&p->list); - ret = -ECONNRESET; - } - spin_unlock_irq(&priv->lock); - if (ret) { - ib_destroy_qp(p->qp); - kfree(p); - return ret; - } - return 0; + if (ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE)) + ipoib_warn(priv, "unable to move qp to error state\n"); + /* Fall through */ default: return 0; } @@ -354,8 +387,15 @@ void ipoib_cm_handle_rx_wc(struct net_de wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); + if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { + spin_lock_irqsave(&priv->lock, flags); + list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); + ipoib_cm_start_rx_drain(priv); + queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); + spin_unlock_irqrestore(&priv->lock, flags); + } else + ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", + wr_id, ipoib_recvq_size); return; } @@ -374,9 +414,9 @@ void ipoib_cm_handle_rx_wc(struct net_de if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { spin_lock_irqsave(&priv->lock, flags); p->jiffies = jiffies; - /* Move this entry to list head, but do - * not re-add it if it has been removed. */ - if (!list_empty(&p->list)) + /* Move this entry to list head, but do not re-add it + * if it has been moved out of list. */ + if (p->state == IPOIB_CM_RX_LIVE) list_move(&p->list, &priv->cm.passive_ids); spin_unlock_irqrestore(&priv->lock, flags); } @@ -583,17 +623,41 @@ static void ipoib_cm_tx_completion(struc int ipoib_cm_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr qp_init_attr = { + .send_cq = priv->cq, /* does not matter, we never send anything */ + .recv_cq = priv->cq, + .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ + .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ + .cap.max_recv_wr = 1, + .cap.max_recv_sge = 1, /* FIXME: 0 Seems not to work */ + .sq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_UC, + }; int ret; if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) return 0; + priv->cm.rx_drain_qp = ib_create_qp(priv->pd, &qp_init_attr); + if (IS_ERR(priv->cm.rx_drain_qp)) { + printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); + ret = PTR_ERR(priv->cm.rx_drain_qp); + return ret; + } + + /* We put the QP in error state directly: this way, hardware + * will immediately generate WC for each WR we post */ + ret = ib_modify_qp(priv->cm.rx_drain_qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) { + ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret); + goto err_qp; + } + priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev); if (IS_ERR(priv->cm.id)) { printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); ret = PTR_ERR(priv->cm.id); - priv->cm.id = NULL; - return ret; + goto err_cm; } ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num), @@ -601,35 +665,79 @@ int ipoib_cm_dev_open(struct net_device if (ret) { printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name, IPOIB_CM_IETF_ID | priv->qp->qp_num); - ib_destroy_cm_id(priv->cm.id); - priv->cm.id = NULL; - return ret; + goto err_listen; } + return 0; + +err_listen: + ib_destroy_cm_id(priv->cm.id); +err_cm: + priv->cm.id = NULL; +err_qp: + ib_destroy_qp(priv->cm.rx_drain_qp); + return ret; } void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_cm_rx *p; + struct ipoib_cm_rx *p, *n; + unsigned long begin; + LIST_HEAD(list); + int ret; if (!IPOIB_CM_SUPPORTED(dev->dev_addr) || !priv->cm.id) return; ib_destroy_cm_id(priv->cm.id); priv->cm.id = NULL; + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); - list_del_init(&p->list); + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; spin_unlock_irq(&priv->lock); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + spin_lock_irq(&priv->lock); + } + + /* Wait for all RX to be drained */ + begin = jiffies; + + while (!list_empty(&priv->cm.rx_error_list) || + !list_empty(&priv->cm.rx_flush_list) || + !list_empty(&priv->cm.rx_drain_list)) { + if (!time_after(jiffies, begin + 5 * HZ)) { + ipoib_warn(priv, "RX drain timing out\n"); + + /* + * assume the HW is wedged and just free up everything. + */ + list_splice_init(&priv->cm.rx_flush_list, &list); + list_splice_init(&priv->cm.rx_error_list, &list); + list_splice_init(&priv->cm.rx_drain_list, &list); + break; + } + spin_unlock_irq(&priv->lock); + msleep(1); + spin_lock_irq(&priv->lock); + } + + list_splice_init(&priv->cm.rx_reap_list, &list); + + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); kfree(p); - spin_lock_irq(&priv->lock); } - spin_unlock_irq(&priv->lock); + ib_destroy_qp(priv->cm.rx_drain_qp); cancel_delayed_work(&priv->cm.stale_task); } @@ -1079,24 +1187,44 @@ void ipoib_cm_skb_too_long(struct net_de queue_work(ipoib_workqueue, &priv->cm.skb_task); } +static void ipoib_cm_rx_reap(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, + cm.rx_reap_task); + struct ipoib_cm_rx *p, *n; + LIST_HEAD(list); + + spin_lock_irq(&priv->lock); + list_splice_init(&priv->cm.rx_reap_list, &list); + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(p, n, &list, list) { + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + kfree(p); + } +} + static void ipoib_cm_stale_task(struct work_struct *work) { struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.stale_task.work); struct ipoib_cm_rx *p; + int ret; spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { - /* List if sorted by LRU, start from tail, + /* List is sorted by LRU, start from tail, * stop when we see a recently used entry */ p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; - list_del_init(&p->list); + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; spin_unlock_irq(&priv->lock); - ib_destroy_cm_id(p->id); - ib_destroy_qp(p->qp); - kfree(p); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); spin_lock_irq(&priv->lock); } @@ -1164,9 +1292,14 @@ int ipoib_cm_dev_init(struct net_device INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); INIT_LIST_HEAD(&priv->cm.start_list); + INIT_LIST_HEAD(&priv->cm.rx_error_list); + INIT_LIST_HEAD(&priv->cm.rx_flush_list); + INIT_LIST_HEAD(&priv->cm.rx_drain_list); + INIT_LIST_HEAD(&priv->cm.rx_reap_list); INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start); INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap); INIT_WORK(&priv->cm.skb_task, ipoib_cm_skb_reap); + INIT_WORK(&priv->cm.rx_reap_task, ipoib_cm_rx_reap); INIT_DELAYED_WORK(&priv->cm.stale_task, ipoib_cm_stale_task); skb_queue_head_init(&priv->cm.skb_queue); Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-18 15:13:21.000000000 +0300 +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-18 15:13:39.000000000 +0300 @@ -173,7 +173,7 @@ int ipoib_transport_dev_init(struct net_ size = ipoib_sendq_size + ipoib_recvq_size + 1; ret = ipoib_cm_dev_init(dev); if (!ret) - size += ipoib_recvq_size; + size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { -- MST From mst at dev.mellanox.co.il Mon May 21 05:06:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 May 2007 15:06:33 +0300 Subject: [ofa-general] [PATCH for-2.6.22] IB/cm: improve local id allocation Message-ID: <20070521120633.GJ20400@mellanox.co.il> IB/cm uses idr for local id allocations, with a running counter as start_id. This fails to generate distinct ids in the scenario where 1. An id is constantly created and destroyed 2. A chunk of ids just beyond the current next_id value is occupied This in turn leads to an increased chance of connection request being mis-detected as a duplicate, sometimes for several retries, until next_id gets past the block of allocated ids. This has been observed in practice. As a fix, remember the last id allocated and start immediately above it. Signed-off-by: Michael S. Tsirkin --- BTW, cast to unsigned here is to prevent integer overflow and make language lawyers happy. Sean, can you ack this pls? diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index d446998..9032cd3 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -308,7 +308,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv) do { spin_lock_irqsave(&cm.lock, flags); ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, - next_id++, &id); + next_id, &id); + if (!ret) + next_id = (unsigned)id + 1; spin_unlock_irqrestore(&cm.lock, flags); } while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) ); -- MST From ogerlitz at voltaire.com Mon May 21 05:28:37 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 21 May 2007 15:28:37 +0300 Subject: [ofa-general] [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: <20070521120633.GJ20400@mellanox.co.il> References: <20070521120633.GJ20400@mellanox.co.il> Message-ID: <46519075.9030303@voltaire.com> Michael S. Tsirkin wrote: > IB/cm uses idr for local id allocations, with a running counter > as start_id. This fails to generate distinct ids > static int cm_alloc_id(struct cm_id_private *cm_id_priv) > { > unsigned long flags; > int ret, id; > static int next_id; > > do { > spin_lock_irqsave(&cm.lock, flags); > ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, > next_id++, &id); > spin_unlock_irqrestore(&cm.lock, flags); > } while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) ); > > cm_id_priv->id.local_id = (__force __be32) (id ^ cm.random_id_operand); Doesn't this Xor of the resulted ID with a random value done after the idr allocation causes the cm to --always-- produce distinct ids??? Or. From mst at dev.mellanox.co.il Mon May 21 05:34:22 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 May 2007 15:34:22 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: <46519075.9030303@voltaire.com> References: <20070521120633.GJ20400@mellanox.co.il> <46519075.9030303@voltaire.com> Message-ID: <20070521123422.GK20400@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [PATCH for-2.6.22] IB/cm: improve local id allocation > > Michael S. Tsirkin wrote: > >IB/cm uses idr for local id allocations, with a running counter > >as start_id. This fails to generate distinct ids > > >static int cm_alloc_id(struct cm_id_private *cm_id_priv) > >{ > > unsigned long flags; > > int ret, id; > > static int next_id; > > > > do { > > spin_lock_irqsave(&cm.lock, flags); > > ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, > > next_id++, &id); > > spin_unlock_irqrestore(&cm.lock, flags); > > } while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, > > GFP_KERNEL) ); > > > > cm_id_priv->id.local_id = (__force __be32) (id ^ > > cm.random_id_operand); > > Doesn't this Xor of the resulted ID with a random value done after the > idr allocation causes the cm to --always-- produce distinct ids??? > > Or. No - the "cm.random_id_operand" is initialized at module load time. That's why we have the static next_id iterator. -- MST From Koen.SEGERS at VRT.BE Mon May 21 06:04:08 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Mon, 21 May 2007 15:04:08 +0200 Subject: [ofa-general] GPFS node loses IB-connection Message-ID: Hi, We are running GPFS with SDP. For this we use OFED 1.2-rc1. The machines are IBM x3755's and x3655's. The IB-switch is a SFS-7000P. The HCA's are all "Mellanox Technologies MT25208 InfiniHost III Ex (rev a0)". Under heavy load, we sometimes lose a node from our GPFS cluster. The machine that lost connection (=10.224.158.104 or gpfswhbe1s1) gave this error: May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901: Reason code 668 Failure Reason Lost membership in cluster enterprise.universe. Unmounting file systems. May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901: After this, we got the following message on some of the nodes that are part of the cluster (including the failing node): GPFS Deadman Switch timer [0] has expired; IOs in progress: 0 Badness in do_exit at kernel/exit.c:807 Call Trace: {do_exit+80} {sys_exit_group+0} {system_call+126} Badness in do_exit at kernel/exit.c:807 Call Trace: {do_exit+80} {sys_exit_group+0} {system_call+126} idr_remove called for id=0 which is not allocated. Call Trace: {idr_remove+228} {kill_anon_super+41} {deactivate_super+111} {sys_umount+624} {sys_newstat+25} {__fput+348} {mntput_no_expire+25} {filp_close+89} {system_call+126} Not all of them give the tracelog at the end. GPFS then gives the following errors: 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.1 cic-gpfswhbe1n1 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.1 cic-gpfswhbe1n1 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.15 cic-gpfswhbe1s2 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.15 cic-gpfswhbe1s2 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.16 cic-gpfswhbe1s3 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.16 cic-gpfswhbe1s3 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.17 cic-gpfswhbe1s4 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.17 cic-gpfswhbe1s4 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.2 cic-gpfswhbe1n2 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.2 cic-gpfswhbe1n2 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.3 cic-gpfswhbe1n3 10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.3 cic-gpfswhbe1n3 10.224.158.104 Fri May 18 13:02:51 2007: Lost membership in cluster enterprise.universe. Unmounting file systems. 10.224.158.104 Fri May 18 13:02:51 2007: Lost membership in cluster enterprise.universe. Unmounting file systems. 10.224.158.106 Fri May 18 13:02:53 2007: Close connection to 192.168.1.14 cic-gpfswhbe1s1 10.224.158.106 Fri May 18 13:02:53 2007: Close connection to 192.168.1.14 cic-gpfswhbe1s1 etc. We found this in the logs of the switch: May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:98:d1 May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:98:d1 May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering removed ports May 18 11:02:51 topspin-120sc ib_sm.x[628]: %IB-6-INFO: Program switch port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 1, due to non-responding CA May 18 11:02:51 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port down - port=1/1, type=ib4xTXP May 18 11:02:51 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in portTblFindEntry() - IfIndex=65(1/1) May 18 11:02:51 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: cannot find entry - IfIndex=65(1/1) May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering new ports May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:98:d1 May 18 11:02:53 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up - port=1/1, type=ib4xTXP May 18 11:02:54 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:98:d1 May 18 11:02:54 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change We are sure this is gpfswhbe1s1, as the number is the same as the node_guid+1: gpfswhbe1s1:~ # ibv_devinfo libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations.hca_id: mthca0 fw_ver: 5.1.0 node_guid: 0005:ad00:0008:98d0 sys_image_guid: 0005:ad00:0008:98d3 vendor_id: 0x05ad vendor_part_id: 25218 hw_ver: 0xA0 board_id: HCA.LionMini.A0 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 6 port_lmc: 0x00 port: 2 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 4 port_lmc: 0x00 Does anyone have a clue what happened? The error does not come up very often. So we can't reproduce it easily. We believe the HCA on gpfswhbe1s1 caused the probem, but we can't really see it. All help is appreciated! Regards, Koen *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon May 21 06:54:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 May 2007 06:54:24 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: <20070521120633.GJ20400@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 21 May 2007 15:06:33 +0300") References: <20070521120633.GJ20400@mellanox.co.il> Message-ID: > IB/cm uses idr for local id allocations, with a running counter > as start_id. This fails to generate distinct ids in the scenario where > 1. An id is constantly created and destroyed > 2. A chunk of ids just beyond the current next_id value is occupied > > This in turn leads to an increased chance of connection request being mis-detected > as a duplicate, sometimes for several retries, until next_id gets past > the block of allocated ids. This has been observed in practice. > > As a fix, remember the last id allocated and start immediately above it. OK I guess but this needs some explanation about why the impact is so severe we want to merge it after rc2 is already out. > + next_id = (unsigned)id + 1; what happens when this wraps and becomes negative? in fact the idr stuff all works with plain signed ints -- could idr_get_new() ever give a negative id? (too lazy too look at the source right now) - R. From jsquyres at cisco.com Mon May 21 07:37:18 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 21 May 2007 07:37:18 -0700 Subject: [ofa-general] Today's OFED teleconf Message-ID: Greetings all. Today's OFED teleconference starts in approximately 90 minutes (9am US Pacific, 11am US central, noon US Eastern, 7pm Israel). Code: 2102061 Dial in numbers: US/Canada: +1.866.432.9903 India: +91.80.4103.3979 Israel: +972.9.892.7026 Others: http://cisco.com/en/US/about/doing_business/conferencing/ index.html -- Jeff Squyres Cisco Systems From jim.ryan at intel.com Mon May 21 07:45:59 2007 From: jim.ryan at intel.com (Ryan, Jim) Date: Mon, 21 May 2007 07:45:59 -0700 Subject: [ofa-general] RE: [ewg] Today's OFED teleconf In-Reply-To: Message-ID: <55CE0347B98FCA468923E5FBC25CB4DCF6D55B@orsmsx413.amr.corp.intel.com> Jeff, I think we have a pretty light agenda. I'll try to wrap the board meeting up early so there's no conflict Jim -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres Sent: Monday, May 21, 2007 7:37 AM To: OpenFabrics EWG; OpenFabrics General Subject: [ewg] Today's OFED teleconf Greetings all. Today's OFED teleconference starts in approximately 90 minutes (9am US Pacific, 11am US central, noon US Eastern, 7pm Israel). Code: 2102061 Dial in numbers: US/Canada: +1.866.432.9903 India: +91.80.4103.3979 Israel: +972.9.892.7026 Others: http://cisco.com/en/US/about/doing_business/conferencing/ index.html -- Jeff Squyres Cisco Systems _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From tziporet at dev.mellanox.co.il Mon May 21 07:49:34 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 21 May 2007 17:49:34 +0300 Subject: [ofa-general] Reminder: OFED 1.2 meeting today at 9am PST Message-ID: <4651B17E.40200@mellanox.co.il> Hi All, Note: Woody will run the meeting today since I will not be able to attend. Vlad will represent Mellanox in the meeting I suggest we set a new target date for RC4: May 30 Tziporet These are the bugs that should be reviewed: 567 blocker jsquyres at cisco.com MPI does not work on RHEL5 ppc64 611 critical swise at opengridcomputing.com cxgb3: passive side connection transition from streaming to RDMA is broken 577 critical ishai at mellanox.co.il SRP multipath failover too slow (minutes, not seconds) 465 critical mst at mellanox.co.il IPoIB HA fails after several hours of failovers 604 critical mst at mellanox.co.il Oops running UDP traffic with IPoIB CM 608 major monis at voltaire.com traffic fails to resume after SM failover with bonding interfaces 626 major monis at voltaire.com wrong network /broadcast address set by ib-bond script 629 major monis at voltaire.com ib-bonding: sometimes slow failover is noticed 632 major mee at pathscale.com Intel MPI Benchmark fails on InfiniPath with 4 or more PPN -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at dev.mellanox.co.il Mon May 21 07:52:03 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 21 May 2007 17:52:03 +0300 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: <4651B213.8040803@mellanox.co.il> SEGERS Koen wrote: > Hi, > > We are running GPFS with SDP. For this we use OFED 1.2-rc1. The > machines are IBM x3755's and x3655's. The IB-switch is a SFS-7000P. > The HCA's are all "Mellanox Technologies MT25208 InfiniHost III Ex > (rev a0)". > Can you try OFED 1.2-rc3? Tziporet From jim.ryan at intel.com Mon May 21 07:51:01 2007 From: jim.ryan at intel.com (Ryan, Jim) Date: Mon, 21 May 2007 07:51:01 -0700 Subject: [ofa-general] RE: [ewg] Today's OFED teleconf -- board bridge reminder added In-Reply-To: <55CE0347B98FCA468923E5FBC25CB4DCF6D55B@orsmsx413.amr.corp.intel.com> Message-ID: <55CE0347B98FCA468923E5FBC25CB4DCF6D561@orsmsx413.amr.corp.intel.com> Monday, May 21, 2007, 08:00 AM US Pacific Time 916-356-2663, Bridge: 1, Passcode: 3094290 -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Ryan, Jim Sent: Monday, May 21, 2007 7:46 AM To: Jeff Squyres; OpenFabrics EWG; OpenFabrics General Subject: RE: [ewg] Today's OFED teleconf Jeff, I think we have a pretty light agenda. I'll try to wrap the board meeting up early so there's no conflict Jim -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres Sent: Monday, May 21, 2007 7:37 AM To: OpenFabrics EWG; OpenFabrics General Subject: [ewg] Today's OFED teleconf Greetings all. Today's OFED teleconference starts in approximately 90 minutes (9am US Pacific, 11am US central, noon US Eastern, 7pm Israel). Code: 2102061 Dial in numbers: US/Canada: +1.866.432.9903 India: +91.80.4103.3979 Israel: +972.9.892.7026 Others: http://cisco.com/en/US/about/doing_business/conferencing/ index.html -- Jeff Squyres Cisco Systems _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From mst at dev.mellanox.co.il Mon May 21 07:54:36 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 May 2007 17:54:36 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: References: <20070521120633.GJ20400@mellanox.co.il> Message-ID: <20070521145436.GA31097@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH for-2.6.22] IB/cm: improve local id allocation > > > IB/cm uses idr for local id allocations, with a running counter > > as start_id. This fails to generate distinct ids in the scenario where > > 1. An id is constantly created and destroyed > > 2. A chunk of ids just beyond the current next_id value is occupied > > > > This in turn leads to an increased chance of connection request being mis-detected > > as a duplicate, sometimes for several retries, until next_id gets past > > the block of allocated ids. This has been observed in practice. > > > > As a fix, remember the last id allocated and start immediately above it. > > OK I guess but this needs some explanation about why the impact is so > severe we want to merge it after rc2 is already out. Well, it's a single-liner, so it seemed safe. The impact currently is that CM times out, we re-create a connection, either the applicatin aborts, or this process repeats until we get a good id, which can take a couple of minutes. > > + next_id = (unsigned)id + 1; > > what happens when this wraps and becomes negative? > > in fact the idr stuff all works with plain signed ints -- could > idr_get_new() ever give a negative id? (too lazy too look at the > source right now) Good point, I'll check. -- MST From Koen.SEGERS at VRT.BE Mon May 21 07:55:58 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Mon, 21 May 2007 16:55:58 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: <4651B213.8040803@mellanox.co.il> Message-ID: Is this a common problem with RC1? I can change it, but it will take a wile... I'll start building rpms anyway. Greetz Koen -----Oorspronkelijk bericht----- Van: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] Verzonden: maandag 21 mei 2007 16:52 Aan: SEGERS Koen CC: general at lists.openfabrics.org Onderwerp: Re: [ofa-general] GPFS node loses IB-connection SEGERS Koen wrote: > Hi, > > We are running GPFS with SDP. For this we use OFED 1.2-rc1. The > machines are IBM x3755's and x3655's. The IB-switch is a SFS-7000P. > The HCA's are all "Mellanox Technologies MT25208 InfiniHost III Ex > (rev a0)". > Can you try OFED 1.2-rc3? Tziporet *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From xma at us.ibm.com Mon May 21 08:41:10 2007 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 21 May 2007 08:41:10 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: Hello, What's the output of /var/log/messages when you hitting this problem? Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon May 21 09:01:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 May 2007 09:01:28 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: <20070521120633.GJ20400@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 21 May 2007 15:06:33 +0300") References: <20070521120633.GJ20400@mellanox.co.il> Message-ID: lib/idr.c says it returns positive IDs always (actually the comments say "in the range 0 ... 0x7fffffff"). So I guess we would want something like: if (!ret) next_id = id == INT_MAX ? 0 : id + 1; (current code has a similar bug, plus exposes undefined behavior of signed overflow). - R. From mst at dev.mellanox.co.il Mon May 21 09:06:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 May 2007 19:06:54 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: References: <20070521120633.GJ20400@mellanox.co.il> Message-ID: <20070521160654.GD31097@mellanox.co.il> > > + next_id = (unsigned)id + 1; > > what happens when this wraps and becomes negative? > > in fact the idr stuff all works with plain signed ints -- could > idr_get_new() ever give a negative id? (too lazy too look at the > source right now) A quick looks makes it look like idr stuff is *really* not designed to get a negative input: and note that old code has the wrap-around problem, too. So, I think the following would be a better fix: Hmm? Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index eff591d..5e77b01 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -306,7 +306,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv) do { spin_lock_irqsave(&cm.lock, flags); ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, - next_id++, &id); + next_id, &id); + if (!ret) + next_id = id == 0x7ffffff ? 0 : id + 1; spin_unlock_irqrestore(&cm.lock, flags); } while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) ); -- MST From mst at dev.mellanox.co.il Mon May 21 09:11:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 May 2007 19:11:30 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: References: <20070521120633.GJ20400@mellanox.co.il> Message-ID: <20070521161130.GE31097@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH for-2.6.22] IB/cm: improve local id allocation > > lib/idr.c says it returns positive IDs always (actually the comments > say "in the range 0 ... 0x7fffffff"). So I guess we would want > something like: > > if (!ret) > next_id = id == INT_MAX ? 0 : id + 1; True, except INT_MAX isn't defined in kernel headers I think, so I just put 0x7fffffff there. > (current code has a similar bug, plus exposes undefined behavior of > signed overflow). -- MST From rdreier at cisco.com Mon May 21 09:13:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 May 2007 09:13:14 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: <20070521160654.GD31097@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 21 May 2007 19:06:54 +0300") References: <20070521120633.GJ20400@mellanox.co.il> <20070521160654.GD31097@mellanox.co.il> Message-ID: > A quick looks makes it look like idr stuff is *really* not designed to > get a negative input: and note that old code has the wrap-around problem, too. > So, I think the following would be a better fix: Yes, that's basically what I just proposed (although see below). It all looks pretty safe to me... Sean, what do you think about this for 2.6.22? > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index eff591d..5e77b01 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -306,7 +306,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv) > do { > spin_lock_irqsave(&cm.lock, flags); > ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, > - next_id++, &id); > + next_id, &id); > + if (!ret) > + next_id = id == 0x7ffffff ? 0 : id + 1; ...except I used MAX_INT here, and indeed your patch only has 6 'f's in that constant. Actually digging a little I see that we should use MAX_ID_MASK to be really correct. From rdreier at cisco.com Mon May 21 09:14:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 May 2007 09:14:03 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: <20070521161130.GE31097@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 21 May 2007 19:11:30 +0300") References: <20070521120633.GJ20400@mellanox.co.il> <20070521161130.GE31097@mellanox.co.il> Message-ID: > True, except INT_MAX isn't defined in kernel headers I think, > so I just put 0x7fffffff there. It doesn't really matter (see my other reply) but actually INT_MAX and others are in - R. From swise at opengridcomputing.com Mon May 21 09:14:14 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 21 May 2007 09:14:14 -0700 Subject: [ofa-general] Reminder: OFED 1.2 meeting today at 9am PST In-Reply-To: <4651B17E.40200@mellanox.co.il> References: <4651B17E.40200@mellanox.co.il> Message-ID: <4651C556.4020904@opengridcomputing.com> I cannot attend this meeting. Bug 611 has been closed today. The fix went in last week. Thanks, Steve. Tziporet Koren wrote: > Hi All, > > Note: Woody will run the meeting today since I will not be able to attend. > Vlad will represent Mellanox in the meeting > I suggest we set a new target date for RC4: May 30 > > Tziporet > > These are the bugs that should be reviewed: > > 567 blocker jsquyres at cisco.com MPI does not work on RHEL5 ppc64 > 611 critical swise at opengridcomputing.com cxgb3: passive side > connection transition from streaming to RDMA is broken > 577 critical ishai at mellanox.co.il SRP multipath failover too slow > (minutes, not seconds) > 465 critical mst at mellanox.co.il IPoIB HA fails after several hours > of failovers > 604 critical mst at mellanox.co.il Oops running UDP traffic with IPoIB CM > 608 major monis at voltaire.com traffic fails to resume after SM > failover with bonding interfaces > 626 major monis at voltaire.com wrong network /broadcast address set > by ib-bond script > 629 major monis at voltaire.com ib-bonding: sometimes slow failover is > noticed > 632 major mee at pathscale.com Intel MPI Benchmark fails on InfiniPath > with 4 or more PPN > > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From parks at lanl.gov Mon May 21 09:16:16 2007 From: parks at lanl.gov (parks) Date: Mon, 21 May 2007 10:16:16 -0600 Subject: [ofa-general] Today's OFED teleconf In-Reply-To: References: Message-ID: <7.0.1.0.2.20070521101557.02a49e50@lanl.gov> ON travel will not be able to make it. parks At 08:37 AM 5/21/2007, Jeff Squyres wrote: >Greetings all. > >Today's OFED teleconference starts in approximately 90 minutes (9am >US Pacific, 11am US central, noon US Eastern, 7pm Israel). > >Code: 2102061 >Dial in numbers: > >US/Canada: +1.866.432.9903 >India: +91.80.4103.3979 >Israel: +972.9.892.7026 >Others: >http://cisco.com/en/US/about/doing_business/conferencing/ index.html > >-- >Jeff Squyres >Cisco Systems > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Mon May 21 09:23:37 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 May 2007 19:23:37 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: References: <20070521120633.GJ20400@mellanox.co.il> <20070521160654.GD31097@mellanox.co.il> Message-ID: <20070521162336.GF31097@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH for-2.6.22] IB/cm: improve local id allocation > > > A quick looks makes it look like idr stuff is *really* not designed to > > get a negative input: and note that old code has the wrap-around problem, too. > > So, I think the following would be a better fix: > > Yes, that's basically what I just proposed (although see below). It > all looks pretty safe to me... Sean, what do you think about this for > 2.6.22? > > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > > index eff591d..5e77b01 100644 > > --- a/drivers/infiniband/core/cm.c > > +++ b/drivers/infiniband/core/cm.c > > @@ -306,7 +306,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv) > > do { > > spin_lock_irqsave(&cm.lock, flags); > > ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, > > - next_id++, &id); > > + next_id, &id); > > + if (!ret) > > + next_id = id == 0x7ffffff ? 0 : id + 1; > > ...except I used MAX_INT here, and indeed your patch only has 6 'f's > in that constant. Actually digging a little I see that we should use > MAX_ID_MASK to be really correct. And since it's a *mask*, we can do it this way if you like: > > + if (!ret) > > + next_id = ((unsigned)id + 1) & MAX_ID_MASK; which might generate a bit less code. -- MST From mshefty at ichips.intel.com Mon May 21 09:34:45 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 21 May 2007 09:34:45 -0700 Subject: [ofa-general] IB/cm: bug in stale connection detection logic? In-Reply-To: <20070520134441.GI20649@mellanox.co.il> References: <20070520134441.GI20649@mellanox.co.il> Message-ID: <4651CA25.9050309@ichips.intel.com> > 1. I see this in cm_match_req: > > timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); > if (!timewait_info) > timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); > > if (timewait_info) { > cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, > timewait_info->work.remote_id); > cm_cleanup_timewait(cm_id_priv->timewait_info); > spin_unlock_irqrestore(&cm.lock, flags); > if (cur_cm_id_priv) { > cm_dup_req_handler(work, cur_cm_id_priv); > cm_deref_id(cur_cm_id_priv); > } else > cm_issue_rej(work->port, work->mad_recv_wc, > IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, > NULL, 0); > > Note that cm_get_id is passed data from timewait_info and not from the request: > thus, if the QPN in request matches QPN in an existing connection, this is > mis-detected as a duplicate request even if the IDs do not match; > thus, the request is dropped or "duplicate" reject is sent instead of > a "stale connection" reject. > > Am I missing something? I think you may be right on the QPN check, so I'll look into it more. Note that a REQ doesn't carry the local ID, which is why cm_get_id doesn't use the IDs from the REQ. > Suggestion: > Why is an extra call to cm_get_id required to detect a duplicate? > Shouldn't we just > > timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); > if (timewait_info) { > /* handle duplicate */ This isn't necessarily a duplicate. We need to check the state of the local connection endpoint (hence the extra call to cm_get_id). Also if the remote CM lost its state information, it could re-use the remote ID for a new connection. > return; > } > > timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); > if (timewait_info) { > /* handle stale */ If the previous check fails, it does look like this should be stale. > return; > } > > not a duplicate and not a stale connection > > 2. Another question: > > cm_dup_req_handler does this: > /* Quick state check to discard duplicate REQs. */ > if (cm_id_priv->id.state == IB_CM_REQ_RCVD) > return; > > Why is this code here? IB_CM_REQ_RCVD is an ephemeural state, > going to IB_CM_REP_SENT immediately. The transition from REQ_RCVD to REP_SENT requires user intervention. It is not immediate and can take several seconds to up to a minute depending on how quickly the user responds to connection requests. For userspace apps, the timing depends on how quickly the user retrieves and processes CM events, which can take longer than the retry timeout. > Why are duplicate REQs discarded? Should not REP be re-sent? > See 12.9.6 COMMUNICATION ESTABLISHMENT - PASSIVE A REP can only be re-sent if we're in the REP_SENT state, which is not the state being checked. If the remote side has sent 3 REQs in the time that it takes to respond to the first REQ, it's inefficient to generate 2 duplicate REPs when finally sending the first REP. - Sean From robert.j.woodruff at intel.com Mon May 21 09:40:37 2007 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 21 May 2007 09:40:37 -0700 Subject: [ofa-general] OFED 1.2 meeting today at 9am PST - Meeting Minutes In-Reply-To: <4651B17E.40200@mellanox.co.il> Message-ID: We discussed the May 30 date for RC4 and people were OK with that date. We reviewed the outstanding bugs, Still open 567 blocker jsquyres at cisco.com MPI does not work on RHEL5 ppc64 Fixed 611 critical swise at opengridcomputing.com cxgb3: passive side Still open 577 critical ishai at mellanox.co.il SRP multipath failover too slow (minutes, not seconds) Fixed 465 critical mst at mellanox.co.il IPoIB HA fails after several hours of failovers Still open 604 critical mst at mellanox.co.il Oops running UDP traffic with IPoIB CM Cannot reproduce 608 major monis at voltaire.com traffic fails to resume after SM failover with bonding Still open 626 major monis at voltaire.com wrong network /broadcast address set by ib-bond script Still open 629 major monis at voltaire.com ib-bonding: sometimes slow failover is noticed May not fix for this release - 632 major mee at pathscale.com Intel MPI Benchmark fails on InfiniPath with 4 or more PPN -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon May 21 09:48:37 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 21 May 2007 09:48:37 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: References: <20070521120633.GJ20400@mellanox.co.il> <20070521160654.GD31097@mellanox.co.il> Message-ID: <4651CD65.7070303@ichips.intel.com> > Yes, that's basically what I just proposed (although see below). It > all looks pretty safe to me... Sean, what do you think about this for > 2.6.22? > > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > > index eff591d..5e77b01 100644 > > --- a/drivers/infiniband/core/cm.c > > +++ b/drivers/infiniband/core/cm.c > > @@ -306,7 +306,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv) > > do { > > spin_lock_irqsave(&cm.lock, flags); > > ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, > > - next_id++, &id); > > + next_id, &id); > > + if (!ret) > > + next_id = id == 0x7ffffff ? 0 : id + 1; > > ...except I used MAX_INT here, and indeed your patch only has 6 'f's > in that constant. Actually digging a little I see that we should use > MAX_ID_MASK to be really correct. Looks good. Thanks Acked by: Sean Hefty From halr at voltaire.com Mon May 21 10:26:33 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 21 May 2007 13:26:33 -0400 Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.23: basic support for IB routers In-Reply-To: <000401c79999$dad2e9d0$b4d4180a@amr.corp.intel.com> References: <000401c79999$dad2e9d0$b4d4180a@amr.corp.intel.com> Message-ID: <1179768393.15940.8104.camel@hal.voltaire.com> On Fri, 2007-05-18 at 18:14, Sean Hefty wrote: > Re-sending - typo in mailing list name... > > I'd like to get feedback about incorporating the following changes to support > IB routers into 2.6.23. The goal of the patches is to allow for IB router > development and prototyping within the current framework of IBA. The changes > themselves are fairly minimal, but based on the following concepts: > > * Routing data is maintained by the local SA. No assumption is made regarding > how the SA obtains routing information. The SA is only expected to respond > to cross subnet PR queries by providing a path to the local router. This > matches the behavior in opensm. > > * A ULP connecting to a remote subnet provides path information about both > subnets. For now the implementation simply assumes that the properties of > the remote path match that of the local path. This allows the active side > CM to properly format the CM REQ. > > * If the SLID/DLID values in the CM REQ are set to the permissive LID, then > the passive side CM uses the SLID/DLID/SL values from the received CM REQ > LRH to configure the passive side QP. This is done to meet C9-54 without > requiring communication with the remote SA, but I should note that this > behavior is non-compliant. Should there be some conditionalization of any non IBA compliant code so it is only turned on if someone really wants this ? I presume this is to be replaced by the real code some time in the future once the IBA spec for these router issues is decided. -- Hal > These changes were tested by establishing a connection and transferring data > between two IB subnets connected by an Obsidian router. > > These patches are also available in the ib_router branch of my rdma-dev.git > tree. The tree is based on 2.6.21, so include a couple of additional patches > that were already pushed for 2.6.22. > > Signed-off-by: Sean Hefty > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Mon May 21 10:50:05 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 21 May 2007 10:50:05 -0700 Subject: [ofa-general] [RFC] [PATCH 0/3] 2.6.23: basic support for IBrouters In-Reply-To: <1179768393.15940.8104.camel@hal.voltaire.com> Message-ID: <000201c79bd0$7751f070$14d0180a@amr.corp.intel.com> >Should there be some conditionalization of any non IBA compliant code so >it is only turned on if someone really wants this ? That is doable. AFIK, the only non-compliance is using the permissive LIDs in the CM REQ. I don't believe this causes any interoperability issues with fully compliant code. It should just result in a rejected connection request. >I presume this is to be replaced by the real code some time in the >future once the IBA spec for these router issues is decided. Yes. The changes to the RDMA CM and passive side IB CM will likely need to be replaced (patches 2 & 3). The active side IB CM changes (patch 1) may be okay unless wording to the 1.2 spec changes. (Patch 1 could also be used to support non-reversible paths at some point, but more work may be needed.) - Sean From halr at voltaire.com Mon May 21 10:52:11 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 21 May 2007 13:52:11 -0400 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> <1179483657.23882.158398.camel@hal.voltaire.com> <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com> Message-ID: <1179769930.15940.9823.camel@hal.voltaire.com> On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote: > On 18 May 2007 06:21:05 -0400, Hal Rosenstock wrote: > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote: > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock wrote: > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote: > > > > > On 5/17/07, Sean Hefty wrote: > > > > > > > But initially this will generate a packet for each path, while sys > > > > > > > admin knows that path is there and he can hard-code the entries for > > > > > > > it. Other thing is that why Admin will care about creating such record > > > > > > > while SA is itself taking care, right? > > > > > > > > > > > > In your original message you asked about adding 'dummy entries' to the > > > > > > cache. I agree that pre-loading the cache can be useful. What I still > > > > > > am not understanding is the reasoning for adding 'dummy entries'. By > > > > > > 'dummy entries', I've been assuming that these are invalid path records, > > > > > > but maybe that's not what you meant. > > > > > Ok if "dummy entries" word as such has created confusion then I am > > > > > sorry for that, But with that I mean that, those are valid path > > > > > records which Administrator knows in advance and while loading the > > > > > module, > > > > > > > > How does the admin know they are valid ? > > > Depending on the initial application runs, some trusted PRs can be generated. > > > > What do initial application runs have to do with this ? > My understanding is that, once the cluster is UP, and if between Node > A and Node B there is only one path, So this is a feature for such one path subnets. I wonder what percentage of deployed subnets fits this case. > then, SA query always going to return same values in PR. If subnet topology is changed, these PRs might change. There are other cases where they change too. > On this basis Initial application runs will generate PRs, That's what confused me before (Applications don't generate PRs but rather request them.) but I think I see what you mean now. > these PRs can be saved in some file, and can be loaded > when cache_module comes in. > > > > > >Are they somehow preconfigured at the SM ? > > > I am not sure about SM has any such provision? > > > > Not that I'm aware of. > Ok, So, currently no such support is there in SM? I can speak definitively for OpenSM and there is no such support. As to the vendor SMs, I don't think so but don't know for absolute certainty. Someone can correct me if I'm wrong but I wouldn't assume no response means correctness as some may not be listening nor want to respond as to "value added" vendor specific features. > > > Also not sure about the > > > role of SM in path resolving. I mean once node has initiated SA query, > > > whether SM has some database to reply SA or On the fly destination > > > node is contacted to get asked path recored? > > > > SMs can either calculate the SA PRs on the fly based on the routing > > algorithm in use and some other things or put them in a local database. > > This is up to that SM. > Ok > > > > Destination node is not contacted in the SA PR query process. > > > > > >Doesn't each SM have its own policy for generating valid PRs ? > > > Ultimately path record is in Path_Record object format, and SA cache > > > is going to store in a fixed manner, How generation policy matters? > > > > What if the local policy loaded does not agree with what the SM would > > generate for a particular PR ? One then gets a local error which will > > need to be tracked down. Not so easy IMO. > SM policies in a subnet to generate PRs, changes dynamically? at run time? The policy doesn't change dynamically but the data to be returned in the SA PR response might. > if Not then depending on the local SM policy static PR can be > generated to load initially. Just as one question related to this, how would link failures be handled ? There are others. > > > CMIIW. Also I am assuming a homogeneous cluster where certain > > > parameters can be assumed to be same always. > > > > and always in agreement with what the SM would return ? For example, > yes > > what happens when a link goes down and the end node is no longer > > reachable ? > If node is not reachable then, after first timeout of sa_cache, that > entry will be removed from cache. OK; that's another aspect to add into this feature. I don't think that is currently done. I think there would need to be an API added to do this. -- Hal > > > >are these from a live SM and just loaded "out of band" to > > > bypass/preclude the SA PR >mechanism ? > > > may be > > > > Even if they are, there is still the changes in the subnet issue. > > > > -- Hal > > > > > > -- Hal > > > > > > > > > Admin is loading this info in the cache with user command. > > > > > > > > > > > > > Another point I want to know is, > > > > > > > When local_sa_cache module will be inserted? After SM comes up or > > > > > > > Before SM comes up? > > > > > > > > > > > > It can occur either way. There is no restriction. The cache responds > > > > > > to port up and GID in/out of service events to update itself. > > > > > Do you mean cache module will start building cache only after Port is UP? > > > > > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on > > > > > > > some node not on switch) then First Forced schedule_update() is > > > > > > > waisted, and for the first application presence of cache is > > > > > > > meaningless. Why not to keep cache effective right from the start? > > > > > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those > > > > > > paths are usable. If the SM has not come up, then the path records will > > > > > > be unusable until the SM configures the subnet, plus there's no > > > > > > guarantee that the remote endpoints specified by the paths are running. > > > > > You mean there is no guarantee that even if SM is UP and we have some > > > > > hard coded entries of path record corresponding to some node X, we are > > > > > not sure that node X has actually come up or not? In that case > > > > > actually that path resolving should fail if node has not come up, but > > > > > with the hard coding still path will be resolved? > > > > > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms > > > > > > when booting a large cluster. > > > > > that's true. Also cache will get valid entries only if network is > > > > > configured by SM otherwise every node SA will, possibly, drop SA > > > > > packets. > > > > > > > > > > > > - Sean > > > > > > > > > > > _______________________________________________ > > > > > general mailing list > > > > > general at lists.openfabrics.org > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > From halr at voltaire.com Mon May 21 11:00:11 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 21 May 2007 14:00:11 -0400 Subject: [ofa-general] Re: [PATCH] osm: fixing coredump in drop manager In-Reply-To: <46518857.2060308@dev.mellanox.co.il> References: <46518857.2060308@dev.mellanox.co.il> Message-ID: <1179770406.15940.10338.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2007-05-21 at 07:53, Yevgeny Kliteynik wrote: > Hi Hal. > > This patch fixes a coredump in a drop manager when trying to clear > unititialized physical ports. > It happens only in master (the code in this area is a bit different in ofed_1_2). > > Please apply to master. > Thanks. > > Signed-off-by: Yevgeny Kliteynik > --- Thanks. Applied (to master only). -- Hal From halr at voltaire.com Mon May 21 11:17:13 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 21 May 2007 14:17:13 -0400 Subject: [ofa-general] Re: [PATCHv2] osm: up/dn optimization - improved ranking In-Reply-To: <46503064.7010107@dev.mellanox.co.il> References: <46503064.7010107@dev.mellanox.co.il> Message-ID: <1179771431.15940.11448.camel@hal.voltaire.com> Hi Yevgeny, On Sun, 2007-05-20 at 07:26, Yevgeny Kliteynik wrote: > Hi Hal, > > This patch optimizes fabric ranking similar to the fat-tree ranking. > All the root switches are marked with rank and added to the BFS list, > and only then ranking of rest of the fabric begins. > This version of the patch is updated in accordance with Sasha's suggestions. > > Please apply to master. > > Signed-off-by: Yevgeny Kliteynik > --- Nice work. Thanks! Applied (to master only). -- Hal From jlentini at netapp.com Mon May 21 11:50:32 2007 From: jlentini at netapp.com (James Lentini) Date: Mon, 21 May 2007 14:50:32 -0400 (EDT) Subject: [ofa-general] Re: [IPoIB][RFC] remove redundant gid query In-Reply-To: References: Message-ID: On Thu, 17 May 2007, Roland Dreier wrote: > > Both ipoib_add_port() and ipoib_mcast_join_task() query the GID at > > index 0 to setup the ipoib_dev_priv structure's local_gid and the > > net_device structure's dev_addr. There does not appear to be a way for > > ipoib_mcast_join_task() to be executed before ipoib_add_port() > > completes. Therefore, the work done in ipoib_mcast_join_task() appears > > to be redundant. > > It does look like we're doing some work we don't need to do. However > ipoib_add_port() could run before an SM has brought up the local port, The same could be true for ipoib_mcast_join_task() These are both instances of the general problem that if the GID at index 0 changes, the IPoIB code is not automatically notified. Agree? > so the GID prefix might change later. > > I'm not sure what the best way to clean this up is. As an aside: Why does ipoib_add_port() treat an error return from ib_query_gid() as fatal while ipoib_mcast_join_task() only emits a warning? james From Koen.SEGERS at VRT.BE Mon May 21 11:50:31 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Mon, 21 May 2007 20:50:31 +0200 Subject: [ofa-general] GPFS node loses IB-connection References: Message-ID: The same as in dmesg. The output for the failing node: May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901: Reason code 668 Failure Reason Lost membership in cluster enterprise.universe. Unmounting file systems. May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901: May 18 13:03:36 gpfswhbe1s1 kernel: GPFS Deadman Switch timer [0] has expired; IOs in progress: 0 May 18 13:04:11 gpfswhbe1s1 kernel: Badness in do_exit at kernel/exit.c:807 May 18 13:04:11 gpfswhbe1s1 kernel: May 18 13:04:11 gpfswhbe1s1 kernel: Call Trace: {do_exit+80} {sys_exit_group+0} May 18 13:04:11 gpfswhbe1s1 kernel: {system_call+126} May 18 13:04:11 gpfswhbe1s1 kernel: Badness in do_exit at kernel/exit.c:807 May 18 13:04:11 gpfswhbe1s1 kernel: May 18 13:04:11 gpfswhbe1s1 kernel: Call Trace: {do_exit+80} {sys_exit_group+0} May 18 13:04:11 gpfswhbe1s1 kernel: {system_call+126} May 18 13:18:57 gpfswhbe1s1 sshd[15090]: Accepted publickey for root from 192.168.1.1 port 52281 ssh2 May 18 13:25:12 gpfswhbe1s1 syslog-ng[3705]: STATS: dropped 0 Today we also did some tests with iperf using sdp. The tests worked fine, as long as we didn't use the parrallel option (-P ). This option starts multiple client threads to connect to the server. As soon as we started the command, the interface died. I found it very strange. Didn't anyone get this problem? Is it still a problem in RC3? Tomorrow we will do more tests to pinpoint the problem even further. We will also build RPMS for the RC3. Hopefully this helps. Regards, Koen ________________________________ Van: Shirley Ma [mailto:xma at us.ibm.com] Verzonden: ma 21/05/2007 17:41 Aan: SEGERS Koen CC: general at lists.openfabrics.org; general-bounces at lists.openfabrics.org; Tziporet Koren Onderwerp: RE: [ofa-general] GPFS node loses IB-connection Hello, What's the output of /var/log/messages when you hitting this problem? Shirley Ma *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon May 21 12:37:34 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 21 May 2007 12:37:34 -0700 Subject: [ofa-general] IB/cm: bug in stale connection detection logic? In-Reply-To: <20070520134441.GI20649@mellanox.co.il> References: <20070520134441.GI20649@mellanox.co.il> Message-ID: <4651F4FE.3090307@ichips.intel.com> > Why is an extra call to cm_get_id required to detect a duplicate? > Shouldn't we just > > timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); > if (timewait_info) { > /* handle duplicate */ > return; > } > > timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); > if (timewait_info) { > /* handle stale */ > return; > } > > not a duplicate and not a stale connection After looking at this more, I think we want something structured closer to what's listed above, with the duplicate handling enhanced to check that the QPN in the potential duplicate REQ matches what's already associated with the remote ID. Did you hit into an actual problem with the current code? It seems like the only issue is that a possible stale request would timeout, rather then be immediately rejected. If so, I will queue up a patch for 2.6.23. - Sean From rdreier at cisco.com Mon May 21 13:29:20 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 May 2007 13:29:20 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: fix error message In-Reply-To: <20070518131254.GJ4708@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 18 May 2007 16:12:54 +0300") References: <20070518131254.GJ4708@mellanox.co.il> Message-ID: thanks applied From rdreier at cisco.com Mon May 21 13:30:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 May 2007 13:30:39 -0700 Subject: [ofa-general] Re: [PATCH 1/2] libmlx4: pass more data from user to kernel In-Reply-To: <1179387187.25749.61.camel@mtls03> (Eli Cohen's message of "Thu, 17 May 2007 10:32:37 +0300") References: <1179387187.25749.61.camel@mtls03> Message-ID: thanks, I applied a new version of this with my changes to the ABI, and also I added code to libmlx4 so it calculates max_inline_data etc correctly. - R. From rdreier at cisco.com Mon May 21 13:35:57 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 May 2007 13:35:57 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak In-Reply-To: <20070521120459.GI20400@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 21 May 2007 15:04:59 +0300") References: <20070521120459.GI20400@mellanox.co.il> Message-ID: OK, I crossed my fingers and merged this for 2.6.22 From mst at dev.mellanox.co.il Mon May 21 13:40:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 May 2007 23:40:41 +0300 Subject: [ofa-general] IB/cm: bug in stale connection detection logic? In-Reply-To: <4651F4FE.3090307@ichips.intel.com> References: <20070520134441.GI20649@mellanox.co.il> <4651F4FE.3090307@ichips.intel.com> Message-ID: <20070521204041.GG31097@mellanox.co.il> > Quoting Sean Hefty : > Subject: Re: [ofa-general] IB/cm: bug in stale connection detection logic? > > >Why is an extra call to cm_get_id required to detect a duplicate? > >Shouldn't we just > > > > timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); > > if (timewait_info) { > > /* handle duplicate */ > > return; > > } > > > > timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); > > if (timewait_info) { > > /* handle stale */ > > return; > > } > > > > not a duplicate and not a stale connection > > After looking at this more, I think we want something structured closer > to what's listed above, with the duplicate handling enhanced to check > that the QPN in the potential duplicate REQ matches what's already > associated with the remote ID. Yes, that's what I thought too. > Did you hit into an actual problem with the current code? It seems like > the only issue is that a possible stale request would timeout, rather > then be immediately rejected. If so, I will queue up a patch for 2.6.23. Exactly. This is a serious problem for IPoIB CM since packet drop rates and recovery times go up radically: sockets get closed, etc. With a reject we would just retry connecting on the next packet. Could you please post a patch? Let's discuss whether it's appropriate for 2.6.22 separately. -- MST From rdreier at cisco.com Mon May 21 13:45:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 May 2007 13:45:12 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation In-Reply-To: <20070521162336.GF31097@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 21 May 2007 19:23:37 +0300") References: <20070521120633.GJ20400@mellanox.co.il> <20070521160654.GD31097@mellanox.co.il> <20070521162336.GF31097@mellanox.co.il> Message-ID: > And since it's a *mask*, we can do it this way if you like: > > > > + if (!ret) > > > + next_id = ((unsigned)id + 1) & MAX_ID_MASK; > > which might generate a bit less code. Good point. In fact it is 8 bytes smaller for x86-64 at least, so this is what I just merged: commit 9f81036c54ed1f860d2807c5a6aa4f2b30c21204 Author: Michael S. Tsirkin Date: Mon May 21 19:06:54 2007 +0300 IB/cm: Improve local id allocation The IB CM uses an idr for local id allocations, with a running counter as start_id. This fails to generate distinct ids if 1. An id is constantly created and destroyed 2. A chunk of ids just beyond the current next_id value is occupied This in turn leads to an increased chance of connection request being mis-detected as a duplicate, sometimes for several retries, until next_id gets past the block of allocated ids. This has been observed in practice. As a fix, remember the last id allocated and start immediately above it. This also fixes a problem with the old code, where next_id might overflow and become negative. Signed-off-by: Michael S. Tsirkin Acked-by: Sean Hefty Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index eff591d..e840434 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -306,7 +306,9 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv) do { spin_lock_irqsave(&cm.lock, flags); ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, - next_id++, &id); + next_id, &id); + if (!ret) + next_id = ((unsigned) id + 1) & MAX_ID_MASK; spin_unlock_irqrestore(&cm.lock, flags); } while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) ); From rdreier at cisco.com Mon May 21 13:48:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 May 2007 13:48:12 -0700 Subject: [ofa-general] Re: [IPoIB][RFC] remove redundant gid query In-Reply-To: (James Lentini's message of "Mon, 21 May 2007 14:50:32 -0400 (EDT)") References: Message-ID: > > It does look like we're doing some work we don't need to do. However > > ipoib_add_port() could run before an SM has brought up the local port, > > The same could be true for ipoib_mcast_join_task() > > These are both instances of the general problem that if the GID at > index 0 changes, the IPoIB code is not automatically notified. Agree? Yes, although what is there now should be semi-OK: a multicast join can't succeed until the port is up, so ipoib should eventually get the right GID. And I would argue that an SM that changes a port's GID prefix without at least generating a client reregister event is broken. > > so the GID prefix might change later. > > > > I'm not sure what the best way to clean this up is. > > As an aside: Why does ipoib_add_port() treat an error return from > ib_query_gid() as fatal while ipoib_mcast_join_task() only emits a > warning? I guess because it's much easier to bail out of ipoib_add_port() than it is to do something intelligent in ipoib_mcast_join_task(). From rdreier at cisco.com Mon May 21 13:51:43 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 21 May 2007 13:51:43 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get the following post 2.6.22-rc2 fixes. (This batch is bigger than I would like, but I think it's all legitimately post-rc2 material: we've had some fixes for fairly serious problems cooking for a while, and those fixes involve largish patches. The rest is either trivial stuff or fixes for the just-merged mlx4 driver) Ali Ayoub (1): IB/mthca: Fix use-after-free on device restart Eli Cohen (3): IB/core: Free umem when mm is already gone IB/mlx4: Fix check of max_qp_dest_rdma in modify QP IB/mlx4: Pass send queue sizes from userspace to kernel Hoang-Nam Nguyen (1): IB/ehca: Return proper error code if register_mr fails Michael S. Tsirkin (5): IB/mthca: Fix RESET to ERROR transition IB/mlx4: Fix RESET to RESET and RESET to ERROR transitions IB/ipoib: Fix typos in error messages IPoIB/cm: Fix SRQ WR leak IB/cm: Improve local id allocation Roland Dreier (6): IB/ipath: Fix potential deadlock with multicast spinlocks IB/core: Use start_port() and end_port() IB/mlx4: Set GRH:HopLimit when sending globally routed MADs mlx4_core: Fix array overrun in dump_dev_cap_flags() IB/mlx4: Fix check of opcode in mlx4_ib_post_send() IB/mlx4: Check if SRQ is full when posting receive Rolf Manderscheid (1): IB/mthca: Set GRH:HopLimit when building MLX headers Yosef Etigin (2): IB/core: Add helpers for uncached GID and P_Key searches IPoIB: Handle P_Key table reordering drivers/infiniband/core/cm.c | 4 +- drivers/infiniband/core/device.c | 135 ++++++++++++++- drivers/infiniband/core/umem.c | 4 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 7 +- drivers/infiniband/hw/ipath/ipath_verbs_mcast.c | 16 +- drivers/infiniband/hw/mlx4/qp.c | 181 ++++++++++++++------ drivers/infiniband/hw/mlx4/srq.c | 6 + drivers/infiniband/hw/mlx4/user.h | 5 +- drivers/infiniband/hw/mthca/mthca_av.c | 1 + drivers/infiniband/hw/mthca/mthca_main.c | 4 +- drivers/infiniband/hw/mthca/mthca_qp.c | 158 +++++++++++------- drivers/infiniband/ulp/ipoib/ipoib.h | 49 +++++- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 203 +++++++++++++++++++---- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 87 ++++++++-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 2 +- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 40 ++--- drivers/net/mlx4/fw.c | 2 +- include/rdma/ib_verbs.h | 8 + 19 files changed, 697 insertions(+), 222 deletions(-) From mshefty at ichips.intel.com Mon May 21 14:20:29 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 21 May 2007 14:20:29 -0700 Subject: [ofa-general] IB/cm: bug in stale connection detection logic? In-Reply-To: <20070521204041.GG31097@mellanox.co.il> References: <20070520134441.GI20649@mellanox.co.il> <4651F4FE.3090307@ichips.intel.com> <20070521204041.GG31097@mellanox.co.il> Message-ID: <46520D1D.3000001@ichips.intel.com> > Could you please post a patch? Let's discuss whether it's appropriate > for 2.6.22 separately. I mentioned 2.6.23 because it affects when I have to generate the patch. :) I will try to get to this tomorrow then. - Sean From venkatesh.babu at 3leafnetworks.com Mon May 21 15:00:31 2007 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Mon, 21 May 2007 15:00:31 -0700 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master Message-ID: <4652167F.9040709@3leafnetworks.com> Configuration: - 4 nodes in the IB network with two nodes running OpenSMs. - Each node has MT25218 InfiniHostEx Mellanox with two IB ports and with firmware version 5.2.0 - All node's IB port 1 is connected to IB Switch1, say subnet1 - All node's IB port 2 is connected to IB Switch2, say subnet2 - vortex3l-83 has two opensm's for each subnet with priority 0 - vortex3l-84 has two opensm's for each subnet with priority 13 Problem: The problem is opensm's on both the machines are in Standy state and none of them are claiming the mastership, though they have different priorities 0 and 13. Most of the times this configuration works fine, but ocassionally it is getting into this problem. It is hard to reproduce this problem. I tried to set the mastership of the opensm but it didn't worked. [root at vortex3l-83 ~]# sminfo -s3 sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5043 priority 0 state 2 SMINFO_STANDBY After couple of minutes [root at vortex3l-83 ~]# sminfo sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5938 priority 0 state 3 SMINFO_MASTER Data: [root at vortex3l-83 ~]# sminfo sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4937 priority 0 state 2 SMINFO_STANDBY [root at vortex3l-83 ~]# ps -aux | grep open Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ root 5237 0.0 0.0 92848 1692 ? Sl 00:39 0:00 /usr/bin/opensm -g 0x005045014a3a0001 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm1.log root 5250 0.0 0.0 92848 1700 ? Sl 00:39 0:00 /usr/bin/opensm -g 0x005045014a3a0002 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm2.log root 8356 0.0 0.0 51064 708 pts/0 S+ 13:40 0:00 grep open [root at vortex3l-84 ~]# sminfo sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4939 priority 0 state 2 SMINFO_STANDBY [root at vortex3l-84 ~]# ps -aux | grep open Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ root 5871 0.0 0.0 92848 1560 ? Sl 00:40 0:00 /usr/bin/opensm -g 0x005045014a2e0001 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm1.log root 5884 0.0 0.0 92848 1568 ? Sl 00:40 0:00 /usr/bin/opensm -g 0x005045014a2e0002 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm2.log root 8845 0.0 0.0 51084 668 pts/0 S+ 13:40 0:00 grep open But ibv_devinfo on vortex3l-83 shows that both ports are active and sm_lid and lid are same, indicating it is master. Looks like it is the stale information. [root at vortex3l-83 ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.2.0 node_guid: 0050:4501:4a3a:0000 sys_image_guid: 0050:4501:4a3a:0003 vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0xA0 board_id: ARM0020000001 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 6 port_lid: 6 port_lmc: 0x00 port: 2 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 1 port_lmc: 0x00 [root at vortex3l-84 ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.2.0 node_guid: 0050:4501:4a2e:0000 sys_image_guid: 0050:4501:4a2e:0003 vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0xA0 board_id: ARM0020000001 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 6 port_lid: 7 port_lmc: 0x00 port: 2 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 2 port_lmc: 0x00 And also in /var/log/opensm[1/2].log I see the following error messages - May 21 00:40:28 250119 [95A9A160] -> OpenSM Rev:openib-2.0.5 OpenIB svn 4954M May 21 00:40:28 484648 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001 May 21 00:40:28 487418 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001 May 21 00:40:29 292689 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x100000125f) -- dropping May 21 00:40:29 292728 [45007960] -> umad_receiver: ERR 5411: DR SMP May 21 00:40:29 292741 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) May 21 00:40:29 292818 [45007960] -> SMP dump: I found that for both ports on both vortex boxes I see the port_xmit_discards counter was 1. Other error counters seems to be zero. Looks like some packets has been transmitted and received on both machines. [root at vortex3l-83 ~]# cat /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards 1 [root at vortex3l-83 ~]# cat /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards 1 [root at vortex3l-84 ~]# cat /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards 1 [root at vortex3l-84 ~]# cat /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards 1 [root at sqasmd ~]# cat /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards 1 Link speed seems to be set to 10 Gb/sec (4X) on all machines. I have the opensm logs and gdb output for all the opensms. If you want I can send it to you. Just attaching one sample gdb output with stack traces of all threads. [root at vortex3l-83 ~]# gdb /usr/bin/opensm 5237 GNU gdb Red Hat Linux (6.3.0.0-1.63rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... (no debugging symbols found) Using host libthread_db library "/lib64/tls/libthread_db.so.1". Attaching to program: /usr/bin/opensm, process 5237 Reading symbols from /usr/lib/libopensm.so.1...done. Loaded symbols for /usr/lib/libopensm.so.1 Reading symbols from /usr/lib/libosmcomp.so.1...done. Loaded symbols for /usr/lib/libosmcomp.so.1 Reading symbols from /lib64/tls/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread 182904123744 (LWP 5237)] [New Thread 1157658976 (LWP 5267)] [New Thread 1147169120 (LWP 5266)] [New Thread 1136679264 (LWP 5265)] [New Thread 1126189408 (LWP 5264)] [New Thread 1115699552 (LWP 5263)] [New Thread 1105209696 (LWP 5262)] [New Thread 1094719840 (LWP 5261)] [New Thread 1084229984 (LWP 5253)] Loaded symbols for /lib64/tls/libpthread.so.0 Reading symbols from /usr/lib/libosmvendor.so.2...done. Loaded symbols for /usr/lib/libosmvendor.so.2 Reading symbols from /usr/lib/libibcommon.so.1...done. Loaded symbols for /usr/lib/libibcommon.so.1 Reading symbols from /usr/lib/libibumad.so.1...done. Loaded symbols for /usr/lib/libibumad.so.1 Reading symbols from /lib64/tls/libc.so.6...done. Loaded symbols for /lib64/tls/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 (gdb) info threads 9 Thread 1084229984 (LWP 5253) 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 8 Thread 1094719840 (LWP 5261) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 7 Thread 1105209696 (LWP 5262) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 6 Thread 1115699552 (LWP 5263) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 5 Thread 1126189408 (LWP 5264) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 4 Thread 1136679264 (LWP 5265) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 3 Thread 1147169120 (LWP 5266) 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 2 Thread 1157658976 (LWP 5267) 0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6 1 Thread 182904123744 (LWP 5237) 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 (gdb) thread 1 [Switching to thread 1 (Thread 182904123744 (LWP 5237))]#0 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 (gdb) bt #0 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 #1 0x0000002a95d82368 in usleep () from /lib64/tls/libc.so.6 #2 0x0000002a9578a05e in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125 #3 0x0000000000405ba1 in main () (gdb) thread 2 [Switching to thread 2 (Thread 1157658976 (LWP 5267))]#0 0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6 (gdb) bt #0 0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6 #1 0x0000002a95bbb94d in dev_poll (fd=Variable "fd" is not available. ) at src/umad.c:775 #2 0x0000002a95bbba6d in umad_recv (portid=Variable "portid" is not available. ) at src/umad.c:805 #3 0x0000002a959ae68b in umad_receiver (p_ptr=0x5c3000) at osm_vendor_ibumad.c:266 #4 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5c3070) at cl_thread.c:61 #5 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 #6 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 #7 0x0000000000000000 in ?? () (gdb) thread 3 [Switching to thread 3 (Thread 1147169120 (LWP 5266))]#0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a95783b4b in cl_event_wait_on (p_event=0x5887b8, wait_us=10000000, interruptible=1) at cl_event.c:181 #2 0x000000000043630c in __osm_sm_sweeper () #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x588898) at cl_thread.c:61 #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 4 [Switching to thread 4 (Thread 1136679264 (LWP 5265))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a278, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x000000000044d7a1 in __osm_vl15_poller () #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58a2e8) at cl_thread.c:61 #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 5 [Switching to thread 5 (Thread 1126189408 (LWP 5264))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) at cl_threadpool.c:71 #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5900e0) at cl_thread.c:61 #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 6 [Switching to thread 6 (Thread 1115699552 (LWP 5263))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) at cl_threadpool.c:71 #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x590010) at cl_thread.c:61 #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 7 [Switching to thread 7 (Thread 1105209696 (LWP 5262))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) at cl_threadpool.c:71 #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58ff40) at cl_thread.c:61 #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 8 [Switching to thread 8 (Thread 1094719840 (LWP 5261))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) at cl_threadpool.c:71 #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58b760) at cl_thread.c:61 #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 9 [Switching to thread 9 (Thread 1084229984 (LWP 5253))]#0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9578a9dd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168 #2 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 #3 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 #4 0x0000000000000000 in ?? () (gdb) thread 10 Thread ID 10 not known. (gdb) Thread ID 10 not known. (gdb) q The program is running. Quit anyway (and detach it)? (y or n) y Detaching from program: /usr/bin/opensm, process 5237 [root at vortex3l-83 ~]# From halr at voltaire.com Mon May 21 15:16:38 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 21 May 2007 18:16:38 -0400 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <4652167F.9040709@3leafnetworks.com> References: <4652167F.9040709@3leafnetworks.com> Message-ID: <1179785796.15940.27092.camel@hal.voltaire.com> On Mon, 2007-05-21 at 18:00, Venkatesh Babu wrote: > Configuration: > - 4 nodes in the IB network with two nodes running OpenSMs. > - Each node has MT25218 InfiniHostEx Mellanox with two IB ports and > with firmware version 5.2.0 > - All node's IB port 1 is connected to IB Switch1, say subnet1 > - All node's IB port 2 is connected to IB Switch2, say subnet2 So there is no link between the 2 switches, right ? > - vortex3l-83 has two opensm's for each subnet with priority 0 > - vortex3l-84 has two opensm's for each subnet with priority 13 > > Problem: > > The problem is opensm's on both the machines are in Standy state and none of them are > claiming the mastership, though they have different priorities 0 and 13. Most of the times > this configuration works fine, but ocassionally it is getting into this problem. It is hard > to reproduce this problem. Is there anything being done ? Cables pulled and reinserted ? Is anything changing or is this a "stable" configuration in terms of the topology ? Is this the only thing going on on the subnet ? > I tried to set the mastership of the opensm but it didn't worked. > [root at vortex3l-83 ~]# sminfo -s3 > sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5043 priority 0 > state 2 SMINFO_STANDBY > > After couple of minutes > [root at vortex3l-83 ~]# sminfo > sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5938 priority 0 state 3 SMINFO_MASTER So it did finally become master ? I take it LID 6 is local (vortex31-83). > Data: > > [root at vortex3l-83 ~]# sminfo > sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4937 priority 0 state > 2 SMINFO_STANDBY > [root at vortex3l-83 ~]# ps -aux | grep open > Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ > root 5237 0.0 0.0 92848 1692 ? Sl 00:39 0:00 /usr/bin/opensm > -g 0x005045014a3a0001 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm1.log > root 5250 0.0 0.0 92848 1700 ? Sl 00:39 0:00 /usr/bin/opensm > -g 0x005045014a3a0002 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm2.log > root 8356 0.0 0.0 51064 708 pts/0 S+ 13:40 0:00 grep open > > > [root at vortex3l-84 ~]# sminfo > sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4939 priority 0 state > 2 SMINFO_STANDBY > [root at vortex3l-84 ~]# ps -aux | grep open > Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ > root 5871 0.0 0.0 92848 1560 ? Sl 00:40 0:00 /usr/bin/opensm > -g 0x005045014a2e0001 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm1.log > root 5884 0.0 0.0 92848 1568 ? Sl 00:40 0:00 /usr/bin/opensm > -g 0x005045014a2e0002 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm2.log > root 8845 0.0 0.0 51084 668 pts/0 S+ 13:40 0:00 grep open > > > But ibv_devinfo on vortex3l-83 shows that both ports are active and sm_lid and > lid are same, indicating it is master. Looks like it is the stale information. > > [root at vortex3l-83 ~]# ibv_devinfo > hca_id: mthca0 > fw_ver: 5.2.0 > node_guid: 0050:4501:4a3a:0000 > sys_image_guid: 0050:4501:4a3a:0003 > vendor_id: 0x02c9 > vendor_part_id: 25218 > hw_ver: 0xA0 > board_id: ARM0020000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 6 > port_lid: 6 > port_lmc: 0x00 > > port: 2 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1 > port_lid: 1 > port_lmc: 0x00 > > [root at vortex3l-84 ~]# ibv_devinfo > hca_id: mthca0 > fw_ver: 5.2.0 > node_guid: 0050:4501:4a2e:0000 > sys_image_guid: 0050:4501:4a2e:0003 > vendor_id: 0x02c9 > vendor_part_id: 25218 > hw_ver: 0xA0 > board_id: ARM0020000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 6 > port_lid: 7 > port_lmc: 0x00 > > port: 2 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1 > port_lid: 2 > port_lmc: 0x00 > > > > And also in /var/log/opensm[1/2].log I see the following error messages - > > May 21 00:40:28 250119 [95A9A160] -> OpenSM Rev:openib-2.0.5 OpenIB svn 4954M This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you try OFED 1.2 ? What kernel is being used ? What distro ? What processor architecture ? > May 21 00:40:28 484648 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001 > May 21 00:40:28 487418 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001 > May 21 00:40:29 292689 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x100000125f) -- dropping > May 21 00:40:29 292728 [45007960] -> umad_receiver: ERR 5411: DR SMP > May 21 00:40:29 292741 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) > May 21 00:40:29 292818 [45007960] -> SMP dump: Is this around the time of the error or just an error in the OpenSM log ? > I found that for both ports on both vortex boxes I see the port_xmit_discards > counter was 1. Did this change from 0 to 1 around the time of the issue with the SM mastership ? Also, what are the port counters for the switch ports in use ? > Other error counters seems to be zero. Looks like some packets > has been transmitted and received on both machines. > > [root at vortex3l-83 ~]# cat > /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards > 1 > [root at vortex3l-83 ~]# cat > /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards > 1 > > [root at vortex3l-84 ~]# cat > /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards > 1 > [root at vortex3l-84 ~]# cat > /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards > 1 > > [root at sqasmd ~]# cat > /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards > 1 > > > Link speed seems to be set to 10 Gb/sec (4X) on all machines. > > > I have the opensm logs and gdb output for all the opensms. If you want I > can send it to you. Perhaps later; not just yet. > Just attaching one sample gdb output with stack traces of all threads. Are they all the same ? -- Hal > [root at vortex3l-83 ~]# gdb /usr/bin/opensm 5237 > GNU gdb Red Hat Linux (6.3.0.0-1.63rh) > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "x86_64-redhat-linux-gnu"... > (no debugging symbols found) > Using host libthread_db library "/lib64/tls/libthread_db.so.1". > > Attaching to program: /usr/bin/opensm, process 5237 > Reading symbols from /usr/lib/libopensm.so.1...done. > Loaded symbols for /usr/lib/libopensm.so.1 > Reading symbols from /usr/lib/libosmcomp.so.1...done. > Loaded symbols for /usr/lib/libosmcomp.so.1 > Reading symbols from /lib64/tls/libpthread.so.0...done. > [Thread debugging using libthread_db enabled] > [New Thread 182904123744 (LWP 5237)] > [New Thread 1157658976 (LWP 5267)] > [New Thread 1147169120 (LWP 5266)] > [New Thread 1136679264 (LWP 5265)] > [New Thread 1126189408 (LWP 5264)] > [New Thread 1115699552 (LWP 5263)] > [New Thread 1105209696 (LWP 5262)] > [New Thread 1094719840 (LWP 5261)] > [New Thread 1084229984 (LWP 5253)] > Loaded symbols for /lib64/tls/libpthread.so.0 > Reading symbols from /usr/lib/libosmvendor.so.2...done. > Loaded symbols for /usr/lib/libosmvendor.so.2 > Reading symbols from /usr/lib/libibcommon.so.1...done. > Loaded symbols for /usr/lib/libibcommon.so.1 > Reading symbols from /usr/lib/libibumad.so.1...done. > Loaded symbols for /usr/lib/libibumad.so.1 > Reading symbols from /lib64/tls/libc.so.6...done. > Loaded symbols for /lib64/tls/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > (gdb) info threads > 9 Thread 1084229984 (LWP 5253) 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 8 Thread 1094719840 (LWP 5261) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 7 Thread 1105209696 (LWP 5262) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 6 Thread 1115699552 (LWP 5263) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 5 Thread 1126189408 (LWP 5264) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 4 Thread 1136679264 (LWP 5265) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 3 Thread 1147169120 (LWP 5266) 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 2 Thread 1157658976 (LWP 5267) 0x0000002a95d7fd22 in poll () > from /lib64/tls/libc.so.6 > 1 Thread 182904123744 (LWP 5237) 0x0000002a95d51d65 in __nanosleep_nocancel > () from /lib64/tls/libc.so.6 > (gdb) thread 1 > [Switching to thread 1 (Thread 182904123744 (LWP 5237))]#0 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > #1 0x0000002a95d82368 in usleep () from /lib64/tls/libc.so.6 > #2 0x0000002a9578a05e in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125 > #3 0x0000000000405ba1 in main () > (gdb) thread 2 > [Switching to thread 2 (Thread 1157658976 (LWP 5267))]#0 0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6 > #1 0x0000002a95bbb94d in dev_poll (fd=Variable "fd" is not available. > ) at src/umad.c:775 > #2 0x0000002a95bbba6d in umad_recv (portid=Variable "portid" is not available. > ) at src/umad.c:805 > #3 0x0000002a959ae68b in umad_receiver (p_ptr=0x5c3000) > at osm_vendor_ibumad.c:266 > #4 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5c3070) at cl_thread.c:61 > #5 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #6 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #7 0x0000000000000000 in ?? () > (gdb) thread 3 > [Switching to thread 3 (Thread 1147169120 (LWP 5266))]#0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783b4b in cl_event_wait_on (p_event=0x5887b8, > wait_us=10000000, interruptible=1) at cl_event.c:181 > #2 0x000000000043630c in __osm_sm_sweeper () > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x588898) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 4 > [Switching to thread 4 (Thread 1136679264 (LWP 5265))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a278, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x000000000044d7a1 in __osm_vl15_poller () > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58a2e8) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 5 > [Switching to thread 5 (Thread 1126189408 (LWP 5264))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) > at cl_threadpool.c:71 > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5900e0) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 6 > [Switching to thread 6 (Thread 1115699552 (LWP 5263))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) > at cl_threadpool.c:71 > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x590010) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 7 > [Switching to thread 7 (Thread 1105209696 (LWP 5262))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) > at cl_threadpool.c:71 > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58ff40) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 8 > [Switching to thread 8 (Thread 1094719840 (LWP 5261))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) > at cl_threadpool.c:71 > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58b760) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 9 > [Switching to thread 9 (Thread 1084229984 (LWP 5253))]#0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9578a9dd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168 > #2 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #3 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #4 0x0000000000000000 in ?? () > (gdb) thread 10 > Thread ID 10 not known. > (gdb) > Thread ID 10 not known. > (gdb) q > The program is running. Quit anyway (and detach it)? (y or n) y > Detaching from program: /usr/bin/opensm, process 5237 > [root at vortex3l-83 ~]# > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From xma at us.ibm.com Mon May 21 15:34:54 2007 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 21 May 2007 15:34:54 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: Thanks. There is no info to show why the connection got lost. Let's wait to see whether you can reproduce this problem in rc3. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Mon May 21 17:38:02 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 21 May 2007 17:38:02 -0700 Subject: [ofa-general] [PATCH] ib/cm: fix stale connection detection In-Reply-To: <20070520134441.GI20649@mellanox.co.il> Message-ID: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> The ib_cm can incorrectly detect a stale connection (a new connection request for a QPN that is already connected) as a duplicate connection request. Separate the handling of potential duplicate REQs from stale connections. Signed-off-by: Sean Hefty --- Can you let me know if this fixes the issues for you? I reworked the code only to detect the stale connection properly. More work is needed to force the local QP into timewait if that is needed. This would likely require adding a new CM event to report that a stale connection was detected on the QP. Also, I left the duplicate request handling as it was, since that should go in as a separate patch. drivers/infiniband/core/cm.c | 25 ++++++++++++++----------- 1 files changed, 14 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index eff591d..c53d486 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1295,26 +1295,29 @@ static struct cm_id_private * cm_match_req(struct cm_work *work, req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; - /* Check for duplicate REQ and stale connections. */ + /* Check for possible duplicate REQ. */ spin_lock_irqsave(&cm.lock, flags); timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); - if (!timewait_info) - timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); - if (timewait_info) { cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, timewait_info->work.remote_id); - cm_cleanup_timewait(cm_id_priv->timewait_info); spin_unlock_irqrestore(&cm.lock, flags); if (cur_cm_id_priv) { cm_dup_req_handler(work, cur_cm_id_priv); cm_deref_id(cur_cm_id_priv); - } else - cm_issue_rej(work->port, work->mad_recv_wc, - IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, - NULL, 0); - listen_cm_id_priv = NULL; - goto out; + } + return NULL; + } + + /* Check for stale connections. */ + timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); + if (timewait_info) { + cm_cleanup_timewait(cm_id_priv->timewait_info); + spin_unlock_irqrestore(&cm.lock, flags); + cm_issue_rej(work->port, work->mad_recv_wc, + IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, + NULL, 0); + return NULL; } /* Find matching listen request. */ From mst at dev.mellanox.co.il Mon May 21 17:48:53 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 May 2007 03:48:53 +0300 Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection In-Reply-To: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> References: <20070520134441.GI20649@mellanox.co.il> <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> Message-ID: <20070522004853.GA13311@mellanox.co.il> Thanks, Sean! The patch looks OK to me. I hope to be able to test it Sunday/Monday. Quoting Sean Hefty : Subject: [PATCH] ib/cm: fix stale connection detection The ib_cm can incorrectly detect a stale connection (a new connection request for a QPN that is already connected) as a duplicate connection request. Separate the handling of potential duplicate REQs from stale connections. Signed-off-by: Sean Hefty --- Can you let me know if this fixes the issues for you? I reworked the code only to detect the stale connection properly. More work is needed to force the local QP into timewait if that is needed. This would likely require adding a new CM event to report that a stale connection was detected on the QP. Also, I left the duplicate request handling as it was, since that should go in as a separate patch. drivers/infiniband/core/cm.c | 25 ++++++++++++++----------- 1 files changed, 14 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index eff591d..c53d486 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1295,26 +1295,29 @@ static struct cm_id_private * cm_match_req(struct cm_work *work, req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; - /* Check for duplicate REQ and stale connections. */ + /* Check for possible duplicate REQ. */ spin_lock_irqsave(&cm.lock, flags); timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); - if (!timewait_info) - timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); - if (timewait_info) { cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, timewait_info->work.remote_id); - cm_cleanup_timewait(cm_id_priv->timewait_info); spin_unlock_irqrestore(&cm.lock, flags); if (cur_cm_id_priv) { cm_dup_req_handler(work, cur_cm_id_priv); cm_deref_id(cur_cm_id_priv); - } else - cm_issue_rej(work->port, work->mad_recv_wc, - IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, - NULL, 0); - listen_cm_id_priv = NULL; - goto out; + } + return NULL; + } + + /* Check for stale connections. */ + timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); + if (timewait_info) { + cm_cleanup_timewait(cm_id_priv->timewait_info); + spin_unlock_irqrestore(&cm.lock, flags); + cm_issue_rej(work->port, work->mad_recv_wc, + IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, + NULL, 0); + return NULL; } /* Find matching listen request. */ -- MST From mst at dev.mellanox.co.il Mon May 21 17:59:18 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 May 2007 03:59:18 +0300 Subject: [ofa-general] [PATCH RFC] IB/ipoib: fix to_ipoib_neigh access race Message-ID: <20070522005918.GB13311@mellanox.co.il> hard_start_xmit dereferences to_ipoib_neigh when only tx_lock is taken. This would only be safe if all calls that modify *to_ipoib_neigh take tx_lock too. Currently this is not always true for ipoib_neigh_free and path_rec_completion, which results in memory corruption. Fix this race, making sure path_rec_completion and ipoib_neigh_free are always called under tx_lock. Signed-off-by: Michael S. Tsirkin --- I'm looking at https://bugs.openfabrics.org/show_bug.cgi?id=604 and I think this could explain the crashes. In any case, Roland, is there a race or am I imagining things? NB: The patch is untested (I'm not at the lab now). diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 0a428f2..ef9845a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -364,9 +364,9 @@ void ipoib_flush_paths(struct net_device *dev) spin_unlock(&priv->lock); spin_unlock_irq(&priv->tx_lock); wait_for_completion(&path->done); - path_free(dev, path); spin_lock_irq(&priv->tx_lock); spin_lock(&priv->lock); + path_free(dev, path); } spin_unlock(&priv->lock); spin_unlock_irq(&priv->tx_lock); @@ -401,7 +401,8 @@ static void path_rec_completion(int status, ah = ipoib_create_ah(dev, priv->pd, &av); } - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); path->ah = ah; @@ -442,7 +443,8 @@ static void path_rec_completion(int status, path->query = NULL; complete(&path->done); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); while ((skb = __skb_dequeue(&skqueue))) { skb->dev = dev; @@ -822,7 +824,8 @@ static void ipoib_neigh_cleanup(struct neighbour *n) IPOIB_QPN(n->ha), IPOIB_GID_RAW_ARG(n->ha + 4)); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); neigh = *to_ipoib_neigh(n); if (neigh) { @@ -832,7 +835,8 @@ static void ipoib_neigh_cleanup(struct neighbour *n) ipoib_neigh_free(n->dev, neigh); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock, flags); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (ah) ipoib_put_ah(ah); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 54fbead..d2e6a1a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -100,7 +100,8 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast) "deleting multicast group " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { /* @@ -114,7 +115,8 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast) ipoib_neigh_free(dev, neigh); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (mcast->ah) ipoib_put_ah(mcast->ah); -- MST From venkatesh.babu at 3leafnetworks.com Mon May 21 19:23:40 2007 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Mon, 21 May 2007 19:23:40 -0700 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <1179785796.15940.27092.camel@hal.voltaire.com> References: <4652167F.9040709@3leafnetworks.com> <1179785796.15940.27092.camel@hal.voltaire.com> Message-ID: <4652542C.3010400@3leafnetworks.com> Hal Rosenstock wrote: >So there is no link between the 2 switches, right ? > > That is right. > >Is there anything being done ? Cables pulled and reinserted ? Is >anything changing or is this a "stable" configuration in terms of the >topology ? > > There was no configuration changes from the cable or switch perspective. But nodes were being rebooted. >Is this the only thing going on on the subnet ? > > That was ipoib but no other ulp modules. There was propritery ulp module which creates udqp and joins broadcast group and discovers nodes and sets up rcqps. There was no traffic being run. >So it did finally become master ? > > Yes, from the /var/log/opensm1.log it looks like it became master. But it was not responding to link local broadcast join operations. It was failing with -110, Connection timed out. >I take it LID 6 is local (vortex31-83). > >This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you >try OFED 1.2 ? > > It is OFED 1.1 released stack. I have seen this problem with OFED 1.0 also. Trying with OFED 1.2 may take much longer time, since we need to port our stuff. >What kernel is being used ? What distro ? What processor architecture ? > > 2.6.9-22.EL RHEL 4.2 Dual Core AMD Opteron(tm) Processor 270 HE > >Is this around the time of the error or just an error in the OpenSM log >? > > The logs were frozen after these error messages. No new entries were being written to the log files. After doing "sminfo -s3" I saw the some messages indicating that it moved to MASTER state and other messages. May 21 00:40:28 013290 [41401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0007 TID:0x0000000000000003 May 21 00:40:28 013431 [41401960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0007 GID:0xfe80000000000000,0x005045014a2e0001 May 21 00:40:28 818202 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x100000135b) -- dropping May 21 00:40:28 819089 [45007960] -> umad_receiver: ERR 5411: DR SMP May 21 00:40:28 819110 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) May 21 00:40:28 819145 [45007960] -> SMP dump: ... May 21 00:40:28 819247 [41E02960] -> Entering STANDBY state May 21 14:04:17 204871 [45007960] -> umad_receiver: ERR 5404: recv error on MAD sized umad (Interrupted system call) May 21 14:06:08 022096 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x20 trans_id=0x100000264f) -- dropping May 21 14:06:08 022132 [45007960] -> umad_receiver: ERR 5411: DR SMP May 21 14:06:08 022145 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) May 21 14:06:08 022182 [45007960] -> SMP dump: ... May 21 14:06:38 035957 [41401960] -> Entering MASTER state May 21 14:06:38 038818 [42803960] -> osm_subn_set_up_down_min_hop_table: BFS through all port guids in the subnet ] May 21 14:06:38 038886 [42803960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches May 21 14:06:38 046438 [41401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000C TID:0x0000000000000ec4 May 21 14:06:38 046565 [41401960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x000C GID:0xfe80000000000000,0x000b8cffff0024f9 May 21 14:06:38 108660 [42803960] -> SUBNET UP May 21 14:06:38 402900 [41401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000000 May 21 14:06:38 403007 [41401960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c9020020f5c5 May 21 14:06:38 914806 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x20 trans_id=0x10000026f0) -- dropping May 21 14:06:38 914823 [45007960] -> umad_receiver: ERR 5411: DR SMP May 21 14:06:38 914864 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) May 21 14:06:38 914899 [45007960] -> SMP dump: >Did this change from 0 to 1 around the time of the issue with the SM >mastership ? > > Not sure, I just got the snapshot when I saw this problem. >Also, what are the port counters for the switch ports in use ? > > [root at vortex3l-83 ~]# ibnetdiscover ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed, skipping port # # Topology file: generated on Mon May 21 02:11:34 2007 # # Max of 2 hops discovered # Initiated from node 005045014a3a0000 port 005045014a3a0001 vendid=0x2c9 devid=0xb924 sysimgguid=0xb8cffff0024f9 switchguid=0xb8cffff0024f9 Switch 24 "S-000b8cffff0024f9" # MT47396 Infiniscale-III Mellanox Technologies base port 0 lid 12 lmc 0 [18] "H-005045014a2e0000"[1] [11] "H-0002c902002048b0"[1] [10] "H-0002c9020020f584"[1] [19] "H-005045014a3a0000"[1] vendid=0x2c9 devid=0x6282 sysimgguid=0x5045014a2e0003 caguid=0x5045014a2e0000 Ca 2 "H-005045014a2e0000" # vortex3l-84 HCA-1 [1] "S-000b8cffff0024f9"[18] # lid 7 lmc 0 vendid=0x2c9 devid=0x6282 sysimgguid=0x2c902002048b3 caguid=0x2c902002048b0 Ca 2 "H-0002c902002048b0" # MT25218 InfiniHostEx Mellanox Technologies [1] "S-000b8cffff0024f9"[11] # lid 5 lmc 0 vendid=0x2c9 devid=0x6282 sysimgguid=0x2c9020020f587 caguid=0x2c9020020f584 Ca 2 "H-0002c9020020f584" # MT25218 InfiniHostEx Mellanox Technologies [1] "S-000b8cffff0024f9"[10] # lid 8 lmc 0 vendid=0x2c9 devid=0x6282 sysimgguid=0x5045014a3a0003 caguid=0x5045014a3a0000 Ca 2 "H-005045014a3a0000" # vortex3l-83 HCA-1 [1] "S-000b8cffff0024f9"[19] # lid 6 lmc 0 [root at vortex3l-83 ~]# >Perhaps later; not just yet. > > >Are they all the same ? > > More or less they are same. All of them have 9 threads and each thread is blocking form some event. VBabu From halr at voltaire.com Mon May 21 20:45:57 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 21 May 2007 23:45:57 -0400 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <4652542C.3010400@3leafnetworks.com> References: <4652167F.9040709@3leafnetworks.com> <1179785796.15940.27092.camel@hal.voltaire.com> <4652542C.3010400@3leafnetworks.com> Message-ID: <1179805556.15940.47640.camel@hal.voltaire.com> On Mon, 2007-05-21 at 22:23, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > >So there is no link between the 2 switches, right ? > > > > > That is right. > > > > >Is there anything being done ? Cables pulled and reinserted ? Is > >anything changing or is this a "stable" configuration in terms of the > >topology ? > > > > > There was no configuration changes from the cable or switch > perspective. But nodes were being rebooted. > > >Is this the only thing going on on the subnet ? > > > > > That was ipoib but no other ulp modules. There was propritery ulp > module which creates udqp and joins broadcast > group and discovers nodes and sets up rcqps. There was no traffic being run. > > >So it did finally become master ? > > > > > Yes, from the /var/log/opensm1.log it looks like it became master. But > it was not responding to > link local broadcast join operations. It was failing with -110, > Connection timed out. > > >I take it LID 6 is local (vortex31-83). > > > >This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you > >try OFED 1.2 ? > > > > > It is OFED 1.1 released stack. I have seen this problem with OFED 1.0 > also. > Trying with OFED 1.2 may take much longer time, since we need to port > our stuff. Can you at least use OFED 1.2 management (OpenSM and management libraries) with the rest being OFED 1.1 ? There are a number of bugs which have been fixed which might affect this. The one I can think of off the top of my head is a fix to atomics in OpenSM's complib. I think that was found and fixed post OFED 1.1. I'll confirm this tomorrow. There may also be some important kernel differences (in user_mad.c or mad.c) which might be relevant. > >What kernel is being used ? What distro ? What processor architecture ? > > > > > 2.6.9-22.EL RHEL 4.2 Dual Core AMD Opteron(tm) Processor > 270 HE > > > > >Is this around the time of the error or just an error in the OpenSM log > >? > > > > > The logs were frozen after these error messages. No new entries were > being written to the log files. > After doing "sminfo -s3" I saw the some messages indicating that it > moved to MASTER state and other messages. > > May 21 00:40:28 013290 [41401960] -> __osm_trap_rcv_process_request: > Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0007 > TID:0x0000000000000003 > May 21 00:40:28 013431 [41401960] -> osm_report_notice: Reporting > Generic Notice type:4 num:144 from LID:0x0007 > GID:0xfe80000000000000,0x005045014a2e0001 > May 21 00:40:28 818202 [45007960] -> umad_receiver: ERR 5409: send > completed with error (method=0x1 attr=0x11 trans_id=0x100000135b) -- > dropping > May 21 00:40:28 819089 [45007960] -> umad_receiver: ERR 5411: DR SMP > May 21 00:40:28 819110 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MAD completed in error (IB_TIMEOUT) > May 21 00:40:28 819145 [45007960] -> SMP dump: > ... > May 21 00:40:28 819247 [41E02960] -> Entering STANDBY state > May 21 14:04:17 204871 [45007960] -> umad_receiver: ERR 5404: recv error > on MAD sized umad (Interrupted system call) > May 21 14:06:08 022096 [45007960] -> umad_receiver: ERR 5409: send > completed with error (method=0x1 attr=0x20 trans_id=0x100000264f) -- > dropping > May 21 14:06:08 022132 [45007960] -> umad_receiver: ERR 5411: DR SMP > May 21 14:06:08 022145 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MAD completed in error (IB_TIMEOUT) > May 21 14:06:08 022182 [45007960] -> SMP dump: > ... > May 21 14:06:38 035957 [41401960] -> Entering MASTER state > May 21 14:06:38 038818 [42803960] -> osm_subn_set_up_down_min_hop_table: > BFS through all port guids in the subnet ] > May 21 14:06:38 038886 [42803960] -> osm_ucast_mgr_process: Min Hop > Tables configured on all switches > May 21 14:06:38 046438 [41401960] -> __osm_trap_rcv_process_request: > Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000C > TID:0x0000000000000ec4 > May 21 14:06:38 046565 [41401960] -> osm_report_notice: Reporting > Generic Notice type:1 num:128 from LID:0x000C > GID:0xfe80000000000000,0x000b8cffff0024f9 > May 21 14:06:38 108660 [42803960] -> SUBNET UP > May 21 14:06:38 402900 [41401960] -> __osm_trap_rcv_process_request: > Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 > TID:0x0000000000000000 > May 21 14:06:38 403007 [41401960] -> osm_report_notice: Reporting > Generic Notice type:4 num:144 from LID:0x0001 > GID:0xfe80000000000000,0x0002c9020020f5c5 > May 21 14:06:38 914806 [45007960] -> umad_receiver: ERR 5409: send > completed with error (method=0x1 attr=0x20 trans_id=0x10000026f0) -- > dropping > May 21 14:06:38 914823 [45007960] -> umad_receiver: ERR 5411: DR SMP > May 21 14:06:38 914864 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MAD completed in error (IB_TIMEOUT) > May 21 14:06:38 914899 [45007960] -> SMP dump: > > >Did this change from 0 to 1 around the time of the issue with the SM > >mastership ? > > > > > Not sure, I just got the snapshot when I saw this problem. > > >Also, what are the port counters for the switch ports in use ? > > > > > [root at vortex3l-83 ~]# ibnetdiscover I was referring to using perfquery, not ibnetdiscover. > ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed, > skipping port Was this node rebooting while you did this or is there some other issue ? > # > # Topology file: generated on Mon May 21 02:11:34 2007 > # > # Max of 2 hops discovered > # Initiated from node 005045014a3a0000 port 005045014a3a0001 > > vendid=0x2c9 > devid=0xb924 > sysimgguid=0xb8cffff0024f9 > switchguid=0xb8cffff0024f9 > Switch 24 "S-000b8cffff0024f9" # MT47396 Infiniscale-III Mellanox > Technologies base port 0 lid 12 lmc 0 > [18] "H-005045014a2e0000"[1] > [11] "H-0002c902002048b0"[1] > [10] "H-0002c9020020f584"[1] > [19] "H-005045014a3a0000"[1] So run these (before and after): perfquery 12 18 perfquery 12 11 perfquery 12 10 perfquery 12 19 and perfquery 12 9 -- Hal > vendid=0x2c9 > devid=0x6282 > sysimgguid=0x5045014a2e0003 > caguid=0x5045014a2e0000 > Ca 2 "H-005045014a2e0000" # vortex3l-84 HCA-1 > [1] "S-000b8cffff0024f9"[18] # lid 7 lmc 0 > > vendid=0x2c9 > devid=0x6282 > sysimgguid=0x2c902002048b3 > caguid=0x2c902002048b0 > Ca 2 "H-0002c902002048b0" # MT25218 InfiniHostEx Mellanox > Technologies > [1] "S-000b8cffff0024f9"[11] # lid 5 lmc 0 > > vendid=0x2c9 > devid=0x6282 > sysimgguid=0x2c9020020f587 > caguid=0x2c9020020f584 > Ca 2 "H-0002c9020020f584" # MT25218 InfiniHostEx Mellanox > Technologies > [1] "S-000b8cffff0024f9"[10] # lid 8 lmc 0 > > vendid=0x2c9 > devid=0x6282 > sysimgguid=0x5045014a3a0003 > caguid=0x5045014a3a0000 > Ca 2 "H-005045014a3a0000" # vortex3l-83 HCA-1 > [1] "S-000b8cffff0024f9"[19] # lid 6 lmc 0 > [root at vortex3l-83 ~]# > > >Perhaps later; not just yet. > > > > > >Are they all the same ? > > > > > More or less they are same. All of them have 9 threads and each thread > is blocking form some event. > > VBabu From mst at dev.mellanox.co.il Mon May 21 22:16:40 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 May 2007 08:16:40 +0300 Subject: [ofa-general] cm.c and irqsave (was Re: [PATCH] ib/cm: fix stale connection detection) In-Reply-To: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> References: <20070520134441.GI20649@mellanox.co.il> <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> Message-ID: <20070522051640.GA23066@mellanox.co.il> > @@ -1295,26 +1295,29 @@ static struct cm_id_private * cm_match_req(struct cm_work *work, > > req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; > > - /* Check for duplicate REQ and stale connections. */ > + /* Check for possible duplicate REQ. */ > spin_lock_irqsave(&cm.lock, flags); > timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); On an unrelated note, it looks like cm.c would benefit from an irqsave diet: it seems to perform work almost exclusively from thread context, so just spin_lock_irq is sure to be enough. And if *everything* is done from a thread context, I think we can go one step further and avoid disabling interrupts as well. -- MST From venkatesh.babu at 3leafnetworks.com Mon May 21 23:31:24 2007 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Mon, 21 May 2007 23:31:24 -0700 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <1179805556.15940.47640.camel@hal.voltaire.com> References: <4652167F.9040709@3leafnetworks.com> <1179785796.15940.27092.camel@hal.voltaire.com> <4652542C.3010400@3leafnetworks.com> <1179805556.15940.47640.camel@hal.voltaire.com> Message-ID: <46528E3C.8090305@3leafnetworks.com> Hal Rosenstock wrote: > >Can you at least use OFED 1.2 management (OpenSM and management >libraries) with the rest being OFED 1.1 ? > > Are these backward compatible ? >There are a number of bugs which have been fixed which might affect >this. The one I can think of off the top of my head is a fix to atomics >in OpenSM's complib. I think that was found and fixed post OFED 1.1. >I'll confirm this tomorrow. > >There may also be some important kernel differences (in user_mad.c or >mad.c) which might be relevant. > > It would be great if you can find these particular patches, we could apply these onto OFED 1.1 instead of migrating to OFED 1.2. By the way, when is production quality OFED 1.2 is supposed to be released ? >I was referring to using perfquery, not ibnetdiscover. > > I don't have that output right now. But I found that all other error counters were zero except port_xmit_discards. > > >>ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed, >>skipping port >> >> > >Was this node rebooting while you did this or is there some other issue >? > > Yes, it is quite possible that node was being rebooted. > >So run these (before and after): >perfquery 12 18 >perfquery 12 11 >perfquery 12 10 >perfquery 12 19 > >and > >perfquery 12 9 > > Unfortunately the systems got rebooted and issue is lost. I was able to collect the perfquery output. It looks like now it is seeing some errors. [root at vortex3l-83 ~]# perfquery 12 9 # Port counters: Lid 12 port 9 PortSelect:......................9 CounterSelect:...................0x0100 SymbolErrors:....................65535 LinkRecovers:....................2 LinkDowned:......................255 RcvErrors:.......................1 RcvRemotePhysErrors:.............0 RcvSwRelayErrors:................41484 XmtDiscards:.....................4918 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................1 XmtBytes:........................2050081143 RcvBytes:........................4294967295 XmtPkts:.........................14539343 RcvPkts:.........................37028545 [root at vortex3l-83 ~]# perfquery 12 10 # Port counters: Lid 12 port 10 PortSelect:......................10 CounterSelect:...................0x0100 SymbolErrors:....................65535 LinkRecovers:....................27 LinkDowned:......................255 RcvErrors:.......................0 RcvRemotePhysErrors:.............0 RcvSwRelayErrors:................19936 XmtDiscards:.....................5192 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtBytes:........................4294967295 RcvBytes:........................4294967295 XmtPkts:.........................1739931538 RcvPkts:.........................1794380558 [root at vortex3l-83 ~]# perfquery 12 11 # Port counters: Lid 12 port 11 PortSelect:......................11 CounterSelect:...................0x0100 SymbolErrors:....................65535 LinkRecovers:....................0 LinkDowned:......................255 RcvErrors:.......................1 RcvRemotePhysErrors:.............0 RcvSwRelayErrors:................8963 XmtDiscards:.....................5636 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtBytes:........................4294967295 RcvBytes:........................4294967295 XmtPkts:.........................2375935494 RcvPkts:.........................2714377528 [root at vortex3l-83 ~]# perfquery 12 18 # Port counters: Lid 12 port 18 PortSelect:......................18 CounterSelect:...................0x0100 SymbolErrors:....................65535 LinkRecovers:....................24 LinkDowned:......................220 RcvErrors:.......................0 RcvRemotePhysErrors:.............0 RcvSwRelayErrors:................65535 XmtDiscards:.....................23628 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtBytes:........................4294967295 RcvBytes:........................4294967295 XmtPkts:.........................604709394 RcvPkts:.........................448409077 [root at vortex3l-83 ~]# perfquery 12 19 # Port counters: Lid 12 port 19 PortSelect:......................19 CounterSelect:...................0x0100 SymbolErrors:....................65535 LinkRecovers:....................21 LinkDowned:......................247 RcvErrors:.......................0 RcvRemotePhysErrors:.............0 RcvSwRelayErrors:................65535 XmtDiscards:.....................37754 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtBytes:........................4294967295 RcvBytes:........................4294967295 XmtPkts:.........................3958092428 RcvPkts:.........................3679343076 [root at vortex3l-83 ~]# -VBabu From mst at dev.mellanox.co.il Mon May 21 23:36:34 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 May 2007 09:36:34 +0300 Subject: [ofa-general] skb queue management in ipoib Message-ID: <20070522063634.GB3331@mellanox.co.il> Roland, all, currently, IPoIB keeps skb queues while SA query/connection request is outstanding. These queues have a length limit, but once the limit is reached, new packets are dropped. Example: if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) __skb_queue_tail(&neigh->queue, skb); else { ipoib_warn(priv, "queue length limit %d. Packet drop.\n", skb_queue_len(&neigh->queue)); goto err_drop; } I think that managing this queue in a FIFO manner, dropping old packets and inserting new ones instead would be better: and older packet has more chance to have been timed out. So we would do something along the lines of: __skb_queue_tail(&neigh->queue, skb); if (skb_queue_len(&neigh->queue) > IPOIB_MAX_PATH_REC_QUEUE) { skb = __skb_dequeue_tail(&neigh->queue); ipoib_warn(priv, "queue length limit %d. Packet drop.\n", skb_queue_len(&neigh->queue)); goto err_drop; } Does this make sense? -- MST From mst at dev.mellanox.co.il Tue May 22 00:59:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 May 2007 10:59:52 +0300 Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection In-Reply-To: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> References: <20070520134441.GI20649@mellanox.co.il> <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> Message-ID: <20070522075952.GC3331@mellanox.co.il> The ib_cm can incorrectly detect a stale connection (a new connection request for a QPN that is already connected) as a duplicate connection request. Separate the handling of potential duplicate REQs from stale connections. Signed-off-by: Sean Hefty Acked-by: Michael S. Tsirkin > --- > > Can you let me know if this fixes the issues for you? I reworked the > code only to detect the stale connection properly. Yes, this has fixed the issue for me. I have not seen any timeouts yet: netperf seems to recover in at most 15 sec, where previously it needed up to 2 minutes. The patch looks obvious enough for 2.6.22, safe enough in that it replaces a timeout with a reject, and it addresses a real problem. Sean? Roland? What do you think? > More work is needed > to force the local QP into timewait if that is needed. Yes, this is needed: IPoIB has its own stale connection detection logic, so it will, after several minutes of inactivity, clean out the connection; however, if the number of QPs is increased, this timeout might become too long: and handling this only at the ULP level is wrong anyway. In practice I don't think we have seen this yet, but the spec is quite explicit about this point: When a CM receives such a REQ/REP it shall abort the connection establishment by issuing REJ to the REQ/REP. It shall then issue DREQ, with “DREQ:remote QPN” set to the remote QPN from the REQ/REP, until DREP is received or Max Retries is exceeded, and place the local QP in the timeWait state. I agree this is 2.6.23 material, however. Something that I think would be very useful for 2.6.22 already: could you please document which portions of chapter 12 are not currently implemented in cm.c, and put this in some file in kernel tree? This way people will be able to figure out whether something that they need is missing, and contribute. > This would > likely require adding a new CM event to report that a stale connection > was detected on the QP. Yes, this looks like a reasonable way to do this. > Also, I left the duplicate request handling > as it was, since that should go in as a separate patch. Could you please describe what is missing currently? Is the missing handling likely to cause timeouts? I hope we have reduced the chance of duplicate request misdetections with the local id patch sufficiently, and fixing this can wait till 2.6.23. > drivers/infiniband/core/cm.c | 25 ++++++++++++++----------- > 1 files changed, 14 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index eff591d..c53d486 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1295,26 +1295,29 @@ static struct cm_id_private * cm_match_req(struct cm_work *work, req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; - /* Check for duplicate REQ and stale connections. */ + /* Check for possible duplicate REQ. */ spin_lock_irqsave(&cm.lock, flags); timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); - if (!timewait_info) - timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); - if (timewait_info) { cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, timewait_info->work.remote_id); - cm_cleanup_timewait(cm_id_priv->timewait_info); spin_unlock_irqrestore(&cm.lock, flags); if (cur_cm_id_priv) { cm_dup_req_handler(work, cur_cm_id_priv); cm_deref_id(cur_cm_id_priv); - } else - cm_issue_rej(work->port, work->mad_recv_wc, - IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, - NULL, 0); - listen_cm_id_priv = NULL; - goto out; + } + return NULL; + } + + /* Check for stale connections. */ + timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); + if (timewait_info) { + cm_cleanup_timewait(cm_id_priv->timewait_info); + spin_unlock_irqrestore(&cm.lock, flags); + cm_issue_rej(work->port, work->mad_recv_wc, + IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, + NULL, 0); + return NULL; } /* Find matching listen request. */ -- MST From amip at dev.mellanox.co.il Tue May 22 01:33:12 2007 From: amip at dev.mellanox.co.il (Ami Perlmutter) Date: Tue, 22 May 2007 11:33:12 +0300 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: does the application constantly open and close connections? -------------- next part -------------- An HTML attachment was scrubbed... URL: From Koen.SEGERS at VRT.BE Tue May 22 01:54:41 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Tue, 22 May 2007 10:54:41 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: GPFS keeps its connection constantly open. We did some more tests with iperf: If we don't run bidirectional tests, all connections keeps running smoothly. If we add bidirectional tests, it becomes unstable. Certainly if this is done on multiple nodes. Is this normal? The failed iperf tests give the same error in the switch log: May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71 May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71 May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering removed ports May 22 08:15:00 topspin-120sc ib_sm.x[621]: %IB-6-INFO: Program switch port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 6, due to non-responding CA May 22 08:15:00 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port down - port=1/6, type=ib4xTXP May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in portTblFindEntry() - IfIndex=70(1/6) May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: cannot find entry - IfIndex=70(1/6) May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering new ports May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71 May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up - port=1/6, type=ib4xTXP May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71 May 22 08:15:08 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change RC3 is just installed. Results will follow soon. Regards, Koen ________________________________ Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] Verzonden: dinsdag 22 mei 2007 10:33 Aan: Shirley Ma CC: SEGERS Koen; general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: Re: [ofa-general] GPFS node loses IB-connection does the application constantly open and close connections? *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Tue May 22 02:41:37 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 22 May 2007 02:41:37 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070522-0200 daily build status Message-ID: <20070522094137.660F4E6081F@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From halr at voltaire.com Tue May 22 03:53:02 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 22 May 2007 06:53:02 -0400 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <46528E3C.8090305@3leafnetworks.com> References: <4652167F.9040709@3leafnetworks.com> <1179785796.15940.27092.camel@hal.voltaire.com> <4652542C.3010400@3leafnetworks.com> <1179805556.15940.47640.camel@hal.voltaire.com> <46528E3C.8090305@3leafnetworks.com> Message-ID: <1179831181.15940.74121.camel@hal.voltaire.com> On Tue, 2007-05-22 at 02:31, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > > > >Can you at least use OFED 1.2 management (OpenSM and management > >libraries) with the rest being OFED 1.1 ? > > > > > Are these backward compatible ? Yes, user_mad kernel module has been at ABI version 5 for quite some time now. > >There are a number of bugs which have been fixed which might affect > >this. The one I can think of off the top of my head is a fix to atomics > >in OpenSM's complib. I think that was found and fixed post OFED 1.1. > >I'll confirm this tomorrow. The atomic fix was in OpenSM 2.0.5 but there are numerous other fixes (see OpenSM release notes for OFED 1.2). > >There may also be some important kernel differences (in user_mad.c or > >mad.c) which might be relevant. > > > > > It would be great if you can find these particular patches, we could > apply these onto OFED 1.1 > instead of migrating to OFED 1.2. The one I see that might be related is the following: commit 39798695b4bcc7b145f8910ca56195808d3a7637 Author: Roland Dreier Date: Mon Nov 13 09:38:07 2006 -0800 IB/mad: Fix race between cancel and receive completion When ib_cancel_mad() is called, it puts the canceled send on a list and schedules a "flushed" callback from process context. However, this leaves a window where a receive completion could be processed before the send is fully flushed. This is fine, except that ib_find_send_mad() will find the MAD and return it to the receive processing, which results in the sender getting both a successful receive and a "flushed" send completion for the same request. Understandably, this confuses the sender, which is expecting only one of these two callbacks, and leads to grief such as a use-after-free in IPoIB. Fix this by changing ib_find_send_mad() to return a send struct only if the status is still successful (and not "flushed"). The search of the send_list already had this check, so this patch just adds the same check to the search of the wait_list. Signed-off-by: Roland Dreier My search was not exhaustive. > By the way, when is production quality OFED 1.2 is supposed to be > released ? It was supposed to be released already but we are closing in on rc4 (May 30) with the release to follow shortly thereafter (1-2 weeks). > >I was referring to using perfquery, not ibnetdiscover. > > > > > I don't have that output right now. But I found that all other error > counters were zero except port_xmit_discards. It would be useful to get these to be sure after the problem occurs. > >>ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed, > >>skipping port > >> > >> > > > >Was this node rebooting while you did this or is there some other issue > >? > > > > > Yes, it is quite possible that node was being rebooted. > > > > >So run these (before and after): > >perfquery 12 18 > >perfquery 12 11 > >perfquery 12 10 > >perfquery 12 19 > > > >and > > > >perfquery 12 9 > > > > > Unfortunately the systems got rebooted and issue is lost. I was able > to collect the perfquery output. It looks like now it is seeing some errors. Are they incrementing ? Which node is this ? I think some of them would increment on node reboot. -- Hal > [root at vortex3l-83 ~]# perfquery 12 9 > # Port counters: Lid 12 port 9 > PortSelect:......................9 > CounterSelect:...................0x0100 > SymbolErrors:....................65535 > LinkRecovers:....................2 > LinkDowned:......................255 > RcvErrors:.......................1 > RcvRemotePhysErrors:.............0 > RcvSwRelayErrors:................41484 > XmtDiscards:.....................4918 > XmtConstraintErrors:.............0 > RcvConstraintErrors:.............0 > LinkIntegrityErrors:.............0 > ExcBufOverrunErrors:.............0 > VL15Dropped:.....................1 > XmtBytes:........................2050081143 > RcvBytes:........................4294967295 > XmtPkts:.........................14539343 > RcvPkts:.........................37028545 > [root at vortex3l-83 ~]# perfquery 12 10 > # Port counters: Lid 12 port 10 > PortSelect:......................10 > CounterSelect:...................0x0100 > SymbolErrors:....................65535 > LinkRecovers:....................27 > LinkDowned:......................255 > RcvErrors:.......................0 > RcvRemotePhysErrors:.............0 > RcvSwRelayErrors:................19936 > XmtDiscards:.....................5192 > XmtConstraintErrors:.............0 > RcvConstraintErrors:.............0 > LinkIntegrityErrors:.............0 > ExcBufOverrunErrors:.............0 > VL15Dropped:.....................0 > XmtBytes:........................4294967295 > RcvBytes:........................4294967295 > XmtPkts:.........................1739931538 > RcvPkts:.........................1794380558 > [root at vortex3l-83 ~]# perfquery 12 11 > # Port counters: Lid 12 port 11 > PortSelect:......................11 > CounterSelect:...................0x0100 > SymbolErrors:....................65535 > LinkRecovers:....................0 > LinkDowned:......................255 > RcvErrors:.......................1 > RcvRemotePhysErrors:.............0 > RcvSwRelayErrors:................8963 > XmtDiscards:.....................5636 > XmtConstraintErrors:.............0 > RcvConstraintErrors:.............0 > LinkIntegrityErrors:.............0 > ExcBufOverrunErrors:.............0 > VL15Dropped:.....................0 > XmtBytes:........................4294967295 > RcvBytes:........................4294967295 > XmtPkts:.........................2375935494 > RcvPkts:.........................2714377528 > [root at vortex3l-83 ~]# perfquery 12 18 > # Port counters: Lid 12 port 18 > PortSelect:......................18 > CounterSelect:...................0x0100 > SymbolErrors:....................65535 > LinkRecovers:....................24 > LinkDowned:......................220 > RcvErrors:.......................0 > RcvRemotePhysErrors:.............0 > RcvSwRelayErrors:................65535 > XmtDiscards:.....................23628 > XmtConstraintErrors:.............0 > RcvConstraintErrors:.............0 > LinkIntegrityErrors:.............0 > ExcBufOverrunErrors:.............0 > VL15Dropped:.....................0 > XmtBytes:........................4294967295 > RcvBytes:........................4294967295 > XmtPkts:.........................604709394 > RcvPkts:.........................448409077 > [root at vortex3l-83 ~]# perfquery 12 19 > # Port counters: Lid 12 port 19 > PortSelect:......................19 > CounterSelect:...................0x0100 > SymbolErrors:....................65535 > LinkRecovers:....................21 > LinkDowned:......................247 > RcvErrors:.......................0 > RcvRemotePhysErrors:.............0 > RcvSwRelayErrors:................65535 > XmtDiscards:.....................37754 > XmtConstraintErrors:.............0 > RcvConstraintErrors:.............0 > LinkIntegrityErrors:.............0 > ExcBufOverrunErrors:.............0 > VL15Dropped:.....................0 > XmtBytes:........................4294967295 > RcvBytes:........................4294967295 > XmtPkts:.........................3958092428 > RcvPkts:.........................3679343076 > [root at vortex3l-83 ~]# > > -VBabu From Koen.SEGERS at VRT.BE Tue May 22 06:43:59 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Tue, 22 May 2007 15:43:59 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: I did the iperf tests on servers with OFED-1.2-RC3. It also gives the same result. Actually, it is even worse: when the interface dies, it gets in PORT_INIT state, but it doesn't go to PORT_ACTIVE again. At least not within 10 minutes. I'll give you the test script I ran: ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 5001 & ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 5002 & ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 5003 & ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 6001 & ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 6002 & ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 6003 & ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 7001 & ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 7002 & ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 7003 & ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 8001 & ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 8002 & ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 8003 & sleep 5 for i in 14 15 16 17 do ssh 10.224.158.111 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))001 -t 120 -d -P 5 & ssh 10.224.158.112 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))002 -t 120 -d -P 5 & ssh 10.224.158.113 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))003 -t 120 -d -P 5 & done Any ideas? Regards, Koen ________________________________ Van: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen Verzonden: dinsdag 22 mei 2007 10:55 Aan: Ami Perlmutter; Shirley Ma CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection GPFS keeps its connection constantly open. We did some more tests with iperf: If we don't run bidirectional tests, all connections keeps running smoothly. If we add bidirectional tests, it becomes unstable. Certainly if this is done on multiple nodes. Is this normal? The failed iperf tests give the same error in the switch log: May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71 May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71 May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering removed ports May 22 08:15:00 topspin-120sc ib_sm.x[621]: %IB-6-INFO: Program switch port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 6, due to non-responding CA May 22 08:15:00 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port down - port=1/6, type=ib4xTXP May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in portTblFindEntry() - IfIndex=70(1/6) May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: cannot find entry - IfIndex=70(1/6) May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering new ports May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71 May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up - port=1/6, type=ib4xTXP May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71 May 22 08:15:08 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change RC3 is just installed. Results will follow soon. Regards, Koen ________________________________ Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] Verzonden: dinsdag 22 mei 2007 10:33 Aan: Shirley Ma CC: SEGERS Koen; general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: Re: [ofa-general] GPFS node loses IB-connection does the application constantly open and close connections? *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Tue May 22 08:34:24 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 22 May 2007 08:34:24 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: What server model and CPU model do you have? This could be https://bugs.openfabrics.org//show_bug.cgi?id=229. Try setting RENICE_IB_MAD=yes in /etc/infiniband/openibd.conf, then reboot or run /etc/init.d/openibd restart, and see if that helps. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of SEGERS Koen Sent: Tuesday, May 22, 2007 6:44 AM To: Ami Perlmutter; Shirley Ma Cc: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Subject: RE: [ofa-general] GPFS node loses IB-connection I did the iperf tests on servers with OFED-1.2-RC3. It also gives the same result. Actually, it is even worse: when the interface dies, it gets in PORT_INIT state, but it doesn't go to PORT_ACTIVE again. At least not within 10 minutes. I'll give you the test script I ran: ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 5001 & ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 5002 & ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 5003 & ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 6001 & ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 6002 & ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 6003 & ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 7001 & ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 7002 & ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 7003 & ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 8001 & ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 8002 & ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -s -p 8003 & sleep 5 for i in 14 15 16 17 do ssh 10.224.158.111 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))001 -t 120 -d -P 5 & ssh 10.224.158.112 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))002 -t 120 -d -P 5 & ssh 10.224.158.113 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))003 -t 120 -d -P 5 & done Any ideas? Regards, Koen ________________________________ Van: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen Verzonden: dinsdag 22 mei 2007 10:55 Aan: Ami Perlmutter; Shirley Ma CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection GPFS keeps its connection constantly open. We did some more tests with iperf: If we don't run bidirectional tests, all connections keeps running smoothly. If we add bidirectional tests, it becomes unstable. Certainly if this is done on multiple nodes. Is this normal? The failed iperf tests give the same error in the switch log: May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71 May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71 May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering removed ports May 22 08:15:00 topspin-120sc ib_sm.x[621]: %IB-6-INFO: Program switch port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 6, due to non-responding CA May 22 08:15:00 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port down - port=1/6, type=ib4xTXP May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in portTblFindEntry() - IfIndex=70(1/6) May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: cannot find entry - IfIndex=70(1/6) May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering new ports May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71 May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up - port=1/6, type=ib4xTXP May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71 May 22 08:15:08 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change RC3 is just installed. Results will follow soon. Regards, Koen ________________________________ Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] Verzonden: dinsdag 22 mei 2007 10:33 Aan: Shirley Ma CC: SEGERS Koen; general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: Re: [ofa-general] GPFS node loses IB-connection does the application constantly open and close connections? *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Tue May 22 08:43:44 2007 From: jlentini at netapp.com (James Lentini) Date: Tue, 22 May 2007 11:43:44 -0400 (EDT) Subject: [ofa-general] Re: [IPoIB][RFC] remove redundant gid query In-Reply-To: References: Message-ID: On Mon, 21 May 2007, Roland Dreier wrote: > > > It does look like we're doing some work we don't need to do. However > > > ipoib_add_port() could run before an SM has brought up the local port, > > > > The same could be true for ipoib_mcast_join_task() > > > > These are both instances of the general problem that if the GID at > > index 0 changes, the IPoIB code is not automatically notified. Agree? > > Yes, although what is there now should be semi-OK: a multicast join > can't succeed until the port is up, so ipoib should eventually get the > right GID. And I would argue that an SM that changes a port's GID > prefix without at least generating a client reregister event is broken. Expecting the SM to request a client reregister is reasonable. >From IPoIB down, everything seems OK. I'm wondering about the layers above IPoIB. When ipoib_add_port() calls register_netdev(), there is at least one place where the networking stack examines the dev_addr value (see rtnl_fill_ifinfo(), a netlink message is created with the device and broadcast hardware addresses). If the GID changes, the IPoIB net_device's dev_addr changes. IPoIB doesn't inform the upper layers when this happens. > > > so the GID prefix might change later. > > > > > > I'm not sure what the best way to clean this up is. > > > > As an aside: Why does ipoib_add_port() treat an error return from > > ib_query_gid() as fatal while ipoib_mcast_join_task() only emits a > > warning? > > I guess because it's much easier to bail out of ipoib_add_port() than > it is to do something intelligent in ipoib_mcast_join_task(). Would adding a warning if the GID changes be of use? Signed-off-by: James Lentini [jlentini at netapp.com] --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2007-04-25 23:08:32.000000000 -0400 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-22 11:17:52.000000000 -0400 @@ -563,11 +563,18 @@ if (mcast_debug_level > 0) \ ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ } while (0) + +#define ipoib_dbg_chkgid(priv, a, b) \ + do { \ + if (memcmp((a), (b), sizeof (union ib_gid))) \ + ipoib_warn(priv, "gid changed\n"); \ + } while (0) #else /* CONFIG_INFINIBAND_IPOIB_DEBUG */ #define ipoib_dbg(priv, format, arg...) \ do { (void) (priv); } while (0) #define ipoib_dbg_mcast(priv, format, arg...) \ do { (void) (priv); } while (0) +#define ipoib_dbg_chkgid(priv, a, b) #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-22 11:07:30.000000000 -0400 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-22 11:17:22.000000000 -0400 @@ -525,8 +525,10 @@ if (ib_query_gid(priv->ca, priv->port, 0, &priv->local_gid)) ipoib_warn(priv, "ib_gid_entry_get() failed\n"); - else + else { + ipoib_dbg_chkgid(priv, priv->dev->dev_addr + 4, priv->local_gid.raw); memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + } { struct ib_port_attr attr; From mshefty at ichips.intel.com Tue May 22 09:38:18 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 22 May 2007 09:38:18 -0700 Subject: [ofa-general] cm.c and irqsave (was Re: [PATCH] ib/cm: fix stale connection detection) In-Reply-To: <20070522051640.GA23066@mellanox.co.il> References: <20070520134441.GI20649@mellanox.co.il> <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> <20070522051640.GA23066@mellanox.co.il> Message-ID: <46531C7A.3060201@ichips.intel.com> > On an unrelated note, it looks like cm.c would benefit from an irqsave > diet: it seems to perform work almost exclusively from thread > context, so just spin_lock_irq is sure to be enough. I don't think everything is done at thread context - that depends on the ULP, but it could definitely replace irqsave with just irq in any of the message processing code. - Sean From xma at us.ibm.com Tue May 22 09:45:28 2007 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 22 May 2007 09:45:28 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: Hello Koen, From the switch log, it looks a SM issue to me. The node was kicked out of the membership. Which SM you are using in your fabric? Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue May 22 10:12:42 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 22 May 2007 10:12:42 -0700 Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection In-Reply-To: <20070522075952.GC3331@mellanox.co.il> References: <20070520134441.GI20649@mellanox.co.il> <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> <20070522075952.GC3331@mellanox.co.il> Message-ID: <4653248A.1040108@ichips.intel.com> > The patch looks obvious enough for 2.6.22, safe enough in that it replaces a > timeout with a reject, and it addresses a real problem. Sean? Roland? What do > you think? To make it easier, I've added the patch to: git://git.openfabrics.org/~shefty/rdma-dev.git for-roland commit 2fbe169db0c6bddcc7b28d03eb51d057277ffd6a I'm comfortable with this merging into 2.6.22 myself. > Something that I think would be very useful for 2.6.22 already: could you please > document which portions of chapter 12 are not currently implemented in cm.c, and > put this in some file in kernel tree? This way people will be able to figure > out whether something that they need is missing, and contribute. This isn't something that I know without comparing the code against the spec. >> Also, I left the duplicate request handling >> as it was, since that should go in as a separate patch. > > Could you please describe what is missing currently? > Is the missing handling likely to cause timeouts? If two REQs are received with matching local IDs, but the REQs themselves differ on one or more fields, such as the QPN, the second REQ is dropped as a duplicate. This causes timeouts, so I need to figure out what the correct behavior should be here. - Sean From weiny2 at llnl.gov Tue May 22 10:23:27 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 22 May 2007 10:23:27 -0700 Subject: [ofa-general] [PATCH] ib_types.h: Change macros to convert from "host" byte order to "network" Message-ID: <20070522102327.0cea4153.weiny2@llnl.gov> >From 7e53267d5bc9389f5f1a4dae3a2d290c69c6e1d4 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Tue, 24 Apr 2007 16:07:19 -0700 Subject: [PATCH] Change macros to convert from "host" byte order to "network" Although the macros CL_HTON* and CL_NTOH* are defined to be the same operation it is technically incorrect to convert a constant from network byte order. The constant should be converted from host byte order to network byte order. Signed-off-by: Ira K. Weiny --- opensm/include/iba/ib_types.h | 180 ++++++++++++++++++++-------------------- 1 files changed, 90 insertions(+), 90 deletions(-) diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index aee7024..f6e85a4 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -157,13 +157,13 @@ BEGIN_C_DECLS * * SOURCE */ -#define IB_QP1_WELL_KNOWN_Q_KEY CL_NTOH32(0x80010000) +#define IB_QP1_WELL_KNOWN_Q_KEY CL_HTON32(0x80010000) /*********/ #define IB_QP0 0 -#define IB_QP1 CL_NTOH32(1) +#define IB_QP1 CL_HTON32(1) -#define IB_QP_PRIVILEGED_Q_KEY CL_NTOH32(0x80000000) +#define IB_QP_PRIVILEGED_Q_KEY CL_HTON32(0x80000000) /****d* IBA Base: Constants/IB_LID_UCAST_START * NAME @@ -405,7 +405,7 @@ BEGIN_C_DECLS * * SOURCE */ -#define IB_PKEY_TYPE_MASK (CL_NTOH16(0x8000)) +#define IB_PKEY_TYPE_MASK (CL_HTON16(0x8000)) /*********/ /****d* IBA Base: Constants/IB_DEFAULT_PARTIAL_PKEY @@ -967,7 +967,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_CLASS_PORT_INFO (CL_NTOH16(0x0001)) +#define IB_MAD_ATTR_CLASS_PORT_INFO (CL_HTON16(0x0001)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_NOTICE @@ -979,7 +979,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_NOTICE (CL_NTOH16(0x0002)) +#define IB_MAD_ATTR_NOTICE (CL_HTON16(0x0002)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_INFORM_INFO @@ -991,7 +991,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_INFORM_INFO (CL_NTOH16(0x0003)) +#define IB_MAD_ATTR_INFORM_INFO (CL_HTON16(0x0003)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_NODE_DESC @@ -1003,7 +1003,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_NODE_DESC (CL_NTOH16(0x0010)) +#define IB_MAD_ATTR_NODE_DESC (CL_HTON16(0x0010)) /****d* IBA Base: Constants/IB_MAD_ATTR_PORT_SMPL_CTRL * NAME @@ -1014,7 +1014,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_PORT_SMPL_CTRL (CL_NTOH16(0x0010)) +#define IB_MAD_ATTR_PORT_SMPL_CTRL (CL_HTON16(0x0010)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_NODE_INFO @@ -1026,7 +1026,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_NODE_INFO (CL_NTOH16(0x0011)) +#define IB_MAD_ATTR_NODE_INFO (CL_HTON16(0x0011)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_PORT_SMPL_RSLT @@ -1038,7 +1038,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_PORT_SMPL_RSLT (CL_NTOH16(0x0011)) +#define IB_MAD_ATTR_PORT_SMPL_RSLT (CL_HTON16(0x0011)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_SWITCH_INFO @@ -1050,7 +1050,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_SWITCH_INFO (CL_NTOH16(0x0012)) +#define IB_MAD_ATTR_SWITCH_INFO (CL_HTON16(0x0012)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_PORT_CNTRS @@ -1062,7 +1062,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_PORT_CNTRS (CL_NTOH16(0x0012)) +#define IB_MAD_ATTR_PORT_CNTRS (CL_HTON16(0x0012)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_GUID_INFO @@ -1074,7 +1074,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_GUID_INFO (CL_NTOH16(0x0014)) +#define IB_MAD_ATTR_GUID_INFO (CL_HTON16(0x0014)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_PORT_INFO @@ -1086,7 +1086,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_PORT_INFO (CL_NTOH16(0x0015)) +#define IB_MAD_ATTR_PORT_INFO (CL_HTON16(0x0015)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_P_KEY_TABLE @@ -1098,7 +1098,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_P_KEY_TABLE (CL_NTOH16(0x0016)) +#define IB_MAD_ATTR_P_KEY_TABLE (CL_HTON16(0x0016)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_SLVL_TABLE @@ -1110,7 +1110,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_SLVL_TABLE (CL_NTOH16(0x0017)) +#define IB_MAD_ATTR_SLVL_TABLE (CL_HTON16(0x0017)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_VL_ARBITRATION @@ -1122,7 +1122,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_VL_ARBITRATION (CL_NTOH16(0x0018)) +#define IB_MAD_ATTR_VL_ARBITRATION (CL_HTON16(0x0018)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_LIN_FWD_TBL @@ -1134,7 +1134,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_LIN_FWD_TBL (CL_NTOH16(0x0019)) +#define IB_MAD_ATTR_LIN_FWD_TBL (CL_HTON16(0x0019)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_RND_FWD_TBL @@ -1146,7 +1146,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_RND_FWD_TBL (CL_NTOH16(0x001A)) +#define IB_MAD_ATTR_RND_FWD_TBL (CL_HTON16(0x001A)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_MCAST_FWD_TBL @@ -1158,7 +1158,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_MCAST_FWD_TBL (CL_NTOH16(0x001B)) +#define IB_MAD_ATTR_MCAST_FWD_TBL (CL_HTON16(0x001B)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_NODE_RECORD @@ -1170,7 +1170,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_NODE_RECORD (CL_NTOH16(0x0011)) +#define IB_MAD_ATTR_NODE_RECORD (CL_HTON16(0x0011)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_PORTINFO_RECORD @@ -1182,7 +1182,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_PORTINFO_RECORD (CL_NTOH16(0x0012)) +#define IB_MAD_ATTR_PORTINFO_RECORD (CL_HTON16(0x0012)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_SWITCH_INFO_RECORD @@ -1194,7 +1194,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_SWITCH_INFO_RECORD (CL_NTOH16(0x0014)) +#define IB_MAD_ATTR_SWITCH_INFO_RECORD (CL_HTON16(0x0014)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_LINK_RECORD @@ -1206,7 +1206,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_LINK_RECORD (CL_NTOH16(0x0020)) +#define IB_MAD_ATTR_LINK_RECORD (CL_HTON16(0x0020)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_SM_INFO @@ -1218,7 +1218,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_SM_INFO (CL_NTOH16(0x0020)) +#define IB_MAD_ATTR_SM_INFO (CL_HTON16(0x0020)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_SMINFO_RECORD @@ -1230,7 +1230,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_SMINFO_RECORD (CL_NTOH16(0x0018)) +#define IB_MAD_ATTR_SMINFO_RECORD (CL_HTON16(0x0018)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_GUIDINFO_RECORD @@ -1242,7 +1242,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_GUIDINFO_RECORD (CL_NTOH16(0x0030)) +#define IB_MAD_ATTR_GUIDINFO_RECORD (CL_HTON16(0x0030)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_VENDOR_DIAG @@ -1254,7 +1254,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_VENDOR_DIAG (CL_NTOH16(0x0030)) +#define IB_MAD_ATTR_VENDOR_DIAG (CL_HTON16(0x0030)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_LED_INFO @@ -1266,7 +1266,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_LED_INFO (CL_NTOH16(0x0031)) +#define IB_MAD_ATTR_LED_INFO (CL_HTON16(0x0031)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_SERVICE_RECORD @@ -1278,7 +1278,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_SERVICE_RECORD (CL_NTOH16(0x0031)) +#define IB_MAD_ATTR_SERVICE_RECORD (CL_HTON16(0x0031)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_LFT_RECORD @@ -1290,7 +1290,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_LFT_RECORD (CL_NTOH16(0x0015)) +#define IB_MAD_ATTR_LFT_RECORD (CL_HTON16(0x0015)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_MFT_RECORD @@ -1302,7 +1302,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_MFT_RECORD (CL_NTOH16(0x0017)) +#define IB_MAD_ATTR_MFT_RECORD (CL_HTON16(0x0017)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_PKEYTBL_RECORD @@ -1314,7 +1314,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_PKEY_TBL_RECORD (CL_NTOH16(0x0033)) +#define IB_MAD_ATTR_PKEY_TBL_RECORD (CL_HTON16(0x0033)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_PATH_RECORD @@ -1326,7 +1326,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_PATH_RECORD (CL_NTOH16(0x0035)) +#define IB_MAD_ATTR_PATH_RECORD (CL_HTON16(0x0035)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_VLARB_RECORD @@ -1338,7 +1338,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_VLARB_RECORD (CL_NTOH16(0x0036)) +#define IB_MAD_ATTR_VLARB_RECORD (CL_HTON16(0x0036)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_SLVL_RECORD @@ -1350,7 +1350,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_SLVL_RECORD (CL_NTOH16(0x0013)) +#define IB_MAD_ATTR_SLVL_RECORD (CL_HTON16(0x0013)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_MCMEMBER_RECORD @@ -1362,7 +1362,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_MCMEMBER_RECORD (CL_NTOH16(0x0038)) +#define IB_MAD_ATTR_MCMEMBER_RECORD (CL_HTON16(0x0038)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_TRACE_RECORD @@ -1374,7 +1374,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_TRACE_RECORD (CL_NTOH16(0x0039)) +#define IB_MAD_ATTR_TRACE_RECORD (CL_HTON16(0x0039)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_MULTIPATH_RECORD @@ -1386,7 +1386,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_MULTIPATH_RECORD (CL_NTOH16(0x003A)) +#define IB_MAD_ATTR_MULTIPATH_RECORD (CL_HTON16(0x003A)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_SVC_ASSOCIATION_RECORD @@ -1398,7 +1398,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD (CL_NTOH16(0x003B)) +#define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD (CL_HTON16(0x003B)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_INFORM_INFO_RECORD @@ -1410,7 +1410,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_INFORM_INFO_RECORD (CL_NTOH16(0x00F3)) +#define IB_MAD_ATTR_INFORM_INFO_RECORD (CL_HTON16(0x00F3)) /****d* IBA Base: Constants/IB_MAD_ATTR_IO_UNIT_INFO * NAME @@ -1421,7 +1421,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_IO_UNIT_INFO (CL_NTOH16(0x0010)) +#define IB_MAD_ATTR_IO_UNIT_INFO (CL_HTON16(0x0010)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_IO_CONTROLLER_PROFILE @@ -1433,7 +1433,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_IO_CONTROLLER_PROFILE (CL_NTOH16(0x0011)) +#define IB_MAD_ATTR_IO_CONTROLLER_PROFILE (CL_HTON16(0x0011)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_SERVICE_ENTRIES @@ -1445,7 +1445,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_SERVICE_ENTRIES (CL_NTOH16(0x0012)) +#define IB_MAD_ATTR_SERVICE_ENTRIES (CL_HTON16(0x0012)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT @@ -1457,7 +1457,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT (CL_NTOH16(0x0020)) +#define IB_MAD_ATTR_DIAGNOSTIC_TIMEOUT (CL_HTON16(0x0020)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_PREPARE_TO_TEST @@ -1469,7 +1469,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_PREPARE_TO_TEST (CL_NTOH16(0x0021)) +#define IB_MAD_ATTR_PREPARE_TO_TEST (CL_HTON16(0x0021)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_TEST_DEVICE_ONCE @@ -1481,7 +1481,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_TEST_DEVICE_ONCE (CL_NTOH16(0x0022)) +#define IB_MAD_ATTR_TEST_DEVICE_ONCE (CL_HTON16(0x0022)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_TEST_DEVICE_LOOP @@ -1493,7 +1493,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_TEST_DEVICE_LOOP (CL_NTOH16(0x0023)) +#define IB_MAD_ATTR_TEST_DEVICE_LOOP (CL_HTON16(0x0023)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_DIAG_CODE @@ -1505,7 +1505,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_DIAG_CODE (CL_NTOH16(0x0024)) +#define IB_MAD_ATTR_DIAG_CODE (CL_HTON16(0x0024)) /**********/ /****d* IBA Base: Constants/IB_MAD_ATTR_SVC_ASSOCIATION_RECORD @@ -1517,7 +1517,7 @@ ib_class_is_rmpp( * * SOURCE */ -#define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD (CL_NTOH16(0x003B)) +#define IB_MAD_ATTR_SVC_ASSOCIATION_RECORD (CL_HTON16(0x003B)) /**********/ /****d* IBA Base: Constants/IB_NODE_TYPE_CA @@ -4084,8 +4084,8 @@ ib_sa_mad_get_payload_ptr( * ib_mad_t *********/ -#define IB_NODE_INFO_PORT_NUM_MASK (CL_NTOH32(0xFF000000)) -#define IB_NODE_INFO_VEND_ID_MASK (CL_NTOH32(0x00FFFFFF)) +#define IB_NODE_INFO_PORT_NUM_MASK (CL_HTON32(0xFF000000)) +#define IB_NODE_INFO_VEND_ID_MASK (CL_HTON32(0x00FFFFFF)) #if CPU_LE #define IB_NODE_INFO_PORT_NUM_SHIFT 0 #else @@ -4246,38 +4246,38 @@ typedef struct _ib_port_info #define IB_PORT_PHYS_STATE_PHYTEST 7 #define IB_PORT_LNKDWNDFTSTATE_MASK 0x0F -#define IB_PORT_CAP_RESV0 (CL_NTOH32(0x00000001)) -#define IB_PORT_CAP_IS_SM (CL_NTOH32(0x00000002)) -#define IB_PORT_CAP_HAS_NOTICE (CL_NTOH32(0x00000004)) -#define IB_PORT_CAP_HAS_TRAP (CL_NTOH32(0x00000008)) -#define IB_PORT_CAP_HAS_IPD (CL_NTOH32(0x00000010)) -#define IB_PORT_CAP_HAS_AUTO_MIG (CL_NTOH32(0x00000020)) -#define IB_PORT_CAP_HAS_SL_MAP (CL_NTOH32(0x00000040)) -#define IB_PORT_CAP_HAS_NV_MKEY (CL_NTOH32(0x00000080)) -#define IB_PORT_CAP_HAS_NV_PKEY (CL_NTOH32(0x00000100)) -#define IB_PORT_CAP_HAS_LED_INFO (CL_NTOH32(0x00000200)) -#define IB_PORT_CAP_SM_DISAB (CL_NTOH32(0x00000400)) -#define IB_PORT_CAP_HAS_SYS_IMG_GUID (CL_NTOH32(0x00000800)) -#define IB_PORT_CAP_HAS_PKEY_SW_EXT_PORT_TRAP (CL_NTOH32(0x00001000)) -#define IB_PORT_CAP_RESV13 (CL_NTOH32(0x00002000)) -#define IB_PORT_CAP_RESV14 (CL_NTOH32(0x00004000)) -#define IB_PORT_CAP_RESV15 (CL_NTOH32(0x00008000)) -#define IB_PORT_CAP_HAS_COM_MGT (CL_NTOH32(0x00010000)) -#define IB_PORT_CAP_HAS_SNMP (CL_NTOH32(0x00020000)) -#define IB_PORT_CAP_REINIT (CL_NTOH32(0x00040000)) -#define IB_PORT_CAP_HAS_DEV_MGT (CL_NTOH32(0x00080000)) -#define IB_PORT_CAP_HAS_VEND_CLS (CL_NTOH32(0x00100000)) -#define IB_PORT_CAP_HAS_DR_NTC (CL_NTOH32(0x00200000)) -#define IB_PORT_CAP_HAS_CAP_NTC (CL_NTOH32(0x00400000)) -#define IB_PORT_CAP_HAS_BM (CL_NTOH32(0x00800000)) -#define IB_PORT_CAP_HAS_LINK_RT_LATENCY (CL_NTOH32(0x01000000)) -#define IB_PORT_CAP_HAS_CLIENT_REREG (CL_NTOH32(0x02000000)) -#define IB_PORT_CAP_RESV26 (CL_NTOH32(0x04000000)) -#define IB_PORT_CAP_RESV27 (CL_NTOH32(0x08000000)) -#define IB_PORT_CAP_RESV28 (CL_NTOH32(0x10000000)) -#define IB_PORT_CAP_RESV29 (CL_NTOH32(0x20000000)) -#define IB_PORT_CAP_RESV30 (CL_NTOH32(0x40000000)) -#define IB_PORT_CAP_RESV31 (CL_NTOH32(0x80000000)) +#define IB_PORT_CAP_RESV0 (CL_HTON32(0x00000001)) +#define IB_PORT_CAP_IS_SM (CL_HTON32(0x00000002)) +#define IB_PORT_CAP_HAS_NOTICE (CL_HTON32(0x00000004)) +#define IB_PORT_CAP_HAS_TRAP (CL_HTON32(0x00000008)) +#define IB_PORT_CAP_HAS_IPD (CL_HTON32(0x00000010)) +#define IB_PORT_CAP_HAS_AUTO_MIG (CL_HTON32(0x00000020)) +#define IB_PORT_CAP_HAS_SL_MAP (CL_HTON32(0x00000040)) +#define IB_PORT_CAP_HAS_NV_MKEY (CL_HTON32(0x00000080)) +#define IB_PORT_CAP_HAS_NV_PKEY (CL_HTON32(0x00000100)) +#define IB_PORT_CAP_HAS_LED_INFO (CL_HTON32(0x00000200)) +#define IB_PORT_CAP_SM_DISAB (CL_HTON32(0x00000400)) +#define IB_PORT_CAP_HAS_SYS_IMG_GUID (CL_HTON32(0x00000800)) +#define IB_PORT_CAP_HAS_PKEY_SW_EXT_PORT_TRAP (CL_HTON32(0x00001000)) +#define IB_PORT_CAP_RESV13 (CL_HTON32(0x00002000)) +#define IB_PORT_CAP_RESV14 (CL_HTON32(0x00004000)) +#define IB_PORT_CAP_RESV15 (CL_HTON32(0x00008000)) +#define IB_PORT_CAP_HAS_COM_MGT (CL_HTON32(0x00010000)) +#define IB_PORT_CAP_HAS_SNMP (CL_HTON32(0x00020000)) +#define IB_PORT_CAP_REINIT (CL_HTON32(0x00040000)) +#define IB_PORT_CAP_HAS_DEV_MGT (CL_HTON32(0x00080000)) +#define IB_PORT_CAP_HAS_VEND_CLS (CL_HTON32(0x00100000)) +#define IB_PORT_CAP_HAS_DR_NTC (CL_HTON32(0x00200000)) +#define IB_PORT_CAP_HAS_CAP_NTC (CL_HTON32(0x00400000)) +#define IB_PORT_CAP_HAS_BM (CL_HTON32(0x00800000)) +#define IB_PORT_CAP_HAS_LINK_RT_LATENCY (CL_HTON32(0x01000000)) +#define IB_PORT_CAP_HAS_CLIENT_REREG (CL_HTON32(0x02000000)) +#define IB_PORT_CAP_RESV26 (CL_HTON32(0x04000000)) +#define IB_PORT_CAP_RESV27 (CL_HTON32(0x08000000)) +#define IB_PORT_CAP_RESV28 (CL_HTON32(0x10000000)) +#define IB_PORT_CAP_RESV29 (CL_HTON32(0x20000000)) +#define IB_PORT_CAP_RESV30 (CL_HTON32(0x40000000)) +#define IB_PORT_CAP_RESV31 (CL_HTON32(0x80000000)) /****f* IBA Base: Types/ib_port_info_get_port_state * NAME @@ -10208,7 +10208,7 @@ typedef uint32_t ib_mr_mod_t; * * SOURCE */ -#define IB_SMINFO_ATTR_MOD_HANDOVER (CL_NTOH32(0x000001)) +#define IB_SMINFO_ATTR_MOD_HANDOVER (CL_HTON32(0x000001)) /**********/ /****d* IBA Base: Constants/IB_SMINFO_ATTR_MOD_ACKNOWLEDGE @@ -10220,7 +10220,7 @@ typedef uint32_t ib_mr_mod_t; * * SOURCE */ -#define IB_SMINFO_ATTR_MOD_ACKNOWLEDGE (CL_NTOH32(0x000002)) +#define IB_SMINFO_ATTR_MOD_ACKNOWLEDGE (CL_HTON32(0x000002)) /**********/ /****d* IBA Base: Constants/IB_SMINFO_ATTR_MOD_DISABLE @@ -10232,7 +10232,7 @@ typedef uint32_t ib_mr_mod_t; * * SOURCE */ -#define IB_SMINFO_ATTR_MOD_DISABLE (CL_NTOH32(0x000003)) +#define IB_SMINFO_ATTR_MOD_DISABLE (CL_HTON32(0x000003)) /**********/ /****d* IBA Base: Constants/IB_SMINFO_ATTR_MOD_STANDBY @@ -10244,7 +10244,7 @@ typedef uint32_t ib_mr_mod_t; * * SOURCE */ -#define IB_SMINFO_ATTR_MOD_STANDBY (CL_NTOH32(0x000004)) +#define IB_SMINFO_ATTR_MOD_STANDBY (CL_HTON32(0x000004)) /**********/ /****d* IBA Base: Constants/IB_SMINFO_ATTR_MOD_DISCOVER @@ -10256,7 +10256,7 @@ typedef uint32_t ib_mr_mod_t; * * SOURCE */ -#define IB_SMINFO_ATTR_MOD_DISCOVER (CL_NTOH32(0x000005)) +#define IB_SMINFO_ATTR_MOD_DISCOVER (CL_HTON32(0x000005)) /**********/ /****s* Access Layer/ib_ci_op_t -- 1.4.4 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0001-Change-macros-to-convert-from-host-byte-order-to-network.txt URL: From rdreier at cisco.com Tue May 22 11:09:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 22 May 2007 11:09:46 -0700 Subject: [ofa-general] Re: skb queue management in ipoib In-Reply-To: <20070522063634.GB3331@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 22 May 2007 09:36:34 +0300") References: <20070522063634.GB3331@mellanox.co.il> Message-ID: > I think that managing this queue in a FIFO manner, dropping > old packets and inserting new ones instead would be better: > and older packet has more chance to have been timed out. Yes, that probably makes sense. > So we would do something along the lines of: > > __skb_queue_tail(&neigh->queue, skb); > if (skb_queue_len(&neigh->queue) > IPOIB_MAX_PATH_REC_QUEUE) { > skb = __skb_dequeue_tail(&neigh->queue); this should just be __skb_dequeue though... > ipoib_warn(priv, "queue length limit %d. Packet drop.\n", > skb_queue_len(&neigh->queue)); > goto err_drop; > } From koen.segers at vrt.be Tue May 22 11:17:25 2007 From: koen.segers at vrt.be (Koen Segers) Date: Tue, 22 May 2007 20:17:25 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: <1179857845.9528.6.camel@KOEN> On Tue, 2007-05-22 at 08:34 -0700, Scott Weitzenkamp (sweitzen) wrote: > What server model and CPU model do you have? cat /proc/cpuinfo processor : 7 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 8218 stepping : 2 cpu MHz : 2600.202 cache size : 1024 KB physical id : 3 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy bogomips : 5200.54 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc > > This could be https://bugs.openfabrics.org//show_bug.cgi?id=229. Try > setting RENICE_IB_MAD=yes in /etc/infiniband/openibd.conf, then reboot > or run /etc/init.d/openibd restart, and see if that helps. AHA, this is interesting. I'll do it tomorrow! > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > > ______________________________________________________________ > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > SEGERS Koen > Sent: Tuesday, May 22, 2007 6:44 AM > To: Ami Perlmutter; Shirley Ma > Cc: general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > I did the iperf tests on servers with OFED-1.2-RC3. > > > > It also gives the same result. Actually, it is even worse: > when the interface dies, it gets in PORT_INIT state, but it > doesn’t go to PORT_ACTIVE again. At least not within 10 > minutes. > > > > I’ll give you the test script I ran: > > > > ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 5001 & > > ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 5002 & > > ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 5003 & > > ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 6001 & > > ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 6002 & > > ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 6003 & > > ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 7001 & > > ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 7002 & > > ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 7003 & > > ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 8001 & > > ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 8002 & > > ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf > -s -p 8003 & > > > > sleep 5 > > > > for i in 14 15 16 17 > > do > > ssh 10.224.158.111 LD_PRELOAD=libsdp.so > SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))001 -t 120 > -d -P 5 & > > ssh 10.224.158.112 LD_PRELOAD=libsdp.so > SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))002 -t 120 > -d -P 5 & > > ssh 10.224.158.113 LD_PRELOAD=libsdp.so > SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))003 -t 120 > -d -P 5 & > > done > > > > Any ideas? > > > > Regards, > > > > Koen > > > ______________________________________________________________ > Van: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS > Koen > Verzonden: dinsdag 22 mei 2007 10:55 > Aan: Ami Perlmutter; Shirley Ma > CC: general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > GPFS keeps its connection constantly open. > > > > We did some more tests with iperf: > > If we don’t run bidirectional tests, all connections keeps > running smoothly. If we add bidirectional tests, it becomes > unstable. Certainly if this is done on multiple nodes. Is this > normal? > > > > The failed iperf tests give the same error in the switch log: > > May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: > Generate SM OUT_OF_SERVICE trap for > GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71 > > May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: > Generate SM DELETE_MC_GROUP trap for > GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71 > > May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: > Configuration caused by discovering removed ports > > May 22 08:15:00 topspin-120sc ib_sm.x[621]: %IB-6-INFO: > Program switch port state to down, > node=00:05:ad:00:00:0b:a2:cc, port= 6, due to non-responding > CA > > May 22 08:15:00 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: > port down - port=1/6, type=ib4xTXP > > May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: > in portTblFindEntry() - IfIndex=70(1/6) > > May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: > cannot find entry - IfIndex=70(1/6) > > May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: > Configuration caused by discovering new ports > > May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: > Configuration caused by multicast membership change > > May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: > Generate SM IN_SERVICE trap for > GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71 > > May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: > port up - port=1/6, type=ib4xTXP > > May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO: > Generate SM CREATE_MC_GROUP trap for > GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71 > > May 22 08:15:08 topspin-120sc ib_sm.x[618]: %IB-6-INFO: > Configuration caused by multicast membership change > > > > RC3 is just installed. Results will follow soon. > > > > Regards, > > > > Koen > > > > > ______________________________________________________________ > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > Verzonden: dinsdag 22 mei 2007 10:33 > Aan: Shirley Ma > CC: SEGERS Koen; general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Onderwerp: Re: [ofa-general] GPFS node loses IB-connection > > > > > does the application constantly open and close connections? > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From koen.segers at vrt.be Tue May 22 11:14:46 2007 From: koen.segers at vrt.be (Koen Segers) Date: Tue, 22 May 2007 20:14:46 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: <1179857686.9528.3.camel@KOEN> Hi, It is the Cisco SM. SFS-7000P> show version ================================================================================ System Version Information ================================================================================ system-version : SFS-7000P TopspinOS 2.9.0 releng #147 10/25/2006 02:01:32 contact : tac at cisco.com name : SFS-7000P location : 170 West Tasman Drive, San Jose, CA 95134 up-time : 11(d):7(h):49(m):3(s) last-change : none last-config-save : none action : none result : none oper-mode : normal There is also a command that gives the SM version, but I can't find it right now. On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > Hello Koen, > > From the switch log, it looks a SM issue to me. The node was kicked > out of the membership. Which SM you are using in your fabric? > > Thanks > Shirley Ma > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From xma at us.ibm.com Tue May 22 11:29:43 2007 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 22 May 2007 11:29:43 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: <1179857686.9528.3.camel@KOEN> Message-ID: Koen, So it is most likely you hit the same bug as 229 (Scott pointed out earlier). The same workaround might work for you by renicing ib_mad as Scott suggested. I think this should be a SM query timeout tunable value in Cisco SM. Am I right, Scott? Thanks Shirley Ma Koen Segers To Shirley Ma/Beaverton/IBM at IBMUS 05/22/07 11:14 AM cc Ami Perlmutter , Please respond to general at lists.openfabrics.org, koen.segers at VRT.B general-bounces at lists.openfabrics.o E rg Subject RE: [ofa-general] GPFS node loses IB-connection Hi, It is the Cisco SM. SFS-7000P> show version ================================================================================ System Version Information ================================================================================ system-version : SFS-7000P TopspinOS 2.9.0 releng #147 10/25/2006 02:01:32 contact : tac at cisco.com name : SFS-7000P location : 170 West Tasman Drive, San Jose, CA 95134 up-time : 11(d):7(h):49(m):3(s) last-change : none last-config-save : none action : none result : none oper-mode : normal There is also a command that gives the SM version, but I can't find it right now. On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > Hello Koen, > > From the switch log, it looks a SM issue to me. The node was kicked > out of the membership. Which SM you are using in your fabric? > > Thanks > Shirley Ma > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic06250.gif Type: image/gif Size: 1255 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: From halr at voltaire.com Tue May 22 11:30:07 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 22 May 2007 14:30:07 -0400 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: <1179858607.16831.20544.camel@hal.voltaire.com> On Tue, 2007-05-22 at 12:45, Shirley Ma wrote: > Hello Koen, > > From the switch log, it looks a SM issue to me. The node was kicked > out of the membership. Are you referring to the following messages: May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71 May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71 followed later by: May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71 May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up - port=1/6, type=ib4xTXP May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71 These is the IPv6 SNM group for that node which is coming and going based on that node coming and going (see the port up and down events in the log as well). -- Hal > Which SM you are using in your fabric? > Thanks > Shirley Ma > > ______________________________________________________________________ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From umaxx at oleco.net Tue May 22 12:04:30 2007 From: umaxx at oleco.net (Joerg Zinke) Date: Tue, 22 May 2007 21:04:30 +0200 Subject: [ofa-general] mmap() and ibv_reg_mr() and RDMA Message-ID: <20070522210430.5df75050@marvin.local> > resend, first try did not arrived Hi, I want to do RDMA-write on mmap'ed memory. but it fails to register the memory region. is there something special, to use ibv_reg_mr() on memory which I got from mmap()? it works fine with plain allocated memory (with memalign()). memory is page-aligned in both cases. Regards, Joerg From sweitzen at cisco.com Tue May 22 12:59:27 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 22 May 2007 12:59:27 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: <1179857686.9528.3.camel@KOEN> Message-ID: Yes, you can tune it. Here's an example via the switch CLI: SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00 node-timeout The default is 10 seconds, it can be configured up to 2000 seconds. If a HCA is completely unresponsive for longer than the node-timeout value, then we consider that HCA out of service. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: Shirley Ma [mailto:xma at us.ibm.com] Sent: Tuesday, May 22, 2007 11:30 AM To: koen.segers at VRT.BE Cc: Ami Perlmutter; general at lists.openfabrics.org; general-bounces at lists.openfabrics.org; Scott Weitzenkamp (sweitzen) Subject: RE: [ofa-general] GPFS node loses IB-connection Koen, So it is most likely you hit the same bug as 229 (Scott pointed out earlier). The same workaround might work for you by renicing ib_mad as Scott suggested. I think this should be a SM query timeout tunable value in Cisco SM. Am I right, Scott? Thanks Shirley Ma Koen Segers Koen Segers 05/22/07 11:14 AM Please respond to koen.segers at VRT.BE To Shirley Ma/Beaverton/IBM at IBMUS cc Ami Perlmutter , general at lists.openfabrics.org, general-bounces at lists.openfabrics.org Subject RE: [ofa-general] GPFS node loses IB-connection Hi, It is the Cisco SM. SFS-7000P> show version ======================================================================== ======== System Version Information ======================================================================== ======== system-version : SFS-7000P TopspinOS 2.9.0 releng #147 10/25/2006 02:01:32 contact : tac at cisco.com name : SFS-7000P location : 170 West Tasman Drive, San Jose, CA 95134 up-time : 11(d):7(h):49(m):3(s) last-change : none last-config-save : none action : none result : none oper-mode : normal There is also a command that gives the SM version, but I can't find it right now. On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > Hello Koen, > > From the switch log, it looks a SM issue to me. The node was kicked > out of the membership. Which SM you are using in your fabric? > > Thanks > Shirley Ma > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: ecblank.gif URL: From sweitzen at cisco.com Tue May 22 13:23:16 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 22 May 2007 13:23:16 -0700 Subject: [ofa-general] What causes "ib0: packet len 65520 (> 2048) too long to send, dropping" messages? Message-ID: I see a small number of these types of messages, when I send large messages via IP multicast. Why do I only see a few of the messages? Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue May 22 13:43:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 22 May 2007 13:43:58 -0700 Subject: [ofa-general] Re: [PATCH RFC] IB/ipoib: fix to_ipoib_neigh access race In-Reply-To: <20070522005918.GB13311@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 22 May 2007 03:59:18 +0300") References: <20070522005918.GB13311@mellanox.co.il> Message-ID: > hard_start_xmit dereferences to_ipoib_neigh when only tx_lock is taken. This > would only be safe if all calls that modify *to_ipoib_neigh take tx_lock too. > Currently this is not always true for ipoib_neigh_free and path_rec_completion, > which results in memory corruption. Fix this race, making sure > path_rec_completion and ipoib_neigh_free are always called under > tx_lock. > > Signed-off-by: Michael S. Tsirkin > > --- > > I'm looking at > https://bugs.openfabrics.org/show_bug.cgi?id=604 > and I think this could explain the crashes. > In any case, Roland, is there a race or am I imagining things? > > NB: The patch is untested (I'm not at the lab now). Yes, it does seem that there is a problem here. However, I the first part of this needs to be handled another way -- for example: > - path_free(dev, path); > spin_lock_irq(&priv->tx_lock); > spin_lock(&priv->lock); > + path_free(dev, path); path_free already takes priv->lock internally, and also calls ipoib_put_ah(), which may end up in ipoib_free_ah(), which also might take priv->lock. It's not immediately obvious what the right fix is... - R. From rdreier at cisco.com Tue May 22 13:46:04 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 22 May 2007 13:46:04 -0700 Subject: [ofa-general] mmap() and ibv_reg_mr() and RDMA In-Reply-To: <20070522210430.5df75050@marvin.local> (Joerg Zinke's message of "Tue, 22 May 2007 21:04:30 +0200") References: <20070522210430.5df75050@marvin.local> Message-ID: > I want to do RDMA-write on mmap'ed memory. > but it fails to register the memory region. > > is there something special, to use ibv_reg_mr() on memory which I got > from mmap()? > it works fine with plain allocated memory (with memalign()). > memory is page-aligned in both cases. How exactly are you mmap()ing the memory? memalign(), malloc() etc are implemented with mmap() internally, so obviously memory registration of some mmap()ed memory is fine. - R. From umaxx at oleco.net Tue May 22 14:25:59 2007 From: umaxx at oleco.net (Joerg Zinke) Date: Tue, 22 May 2007 23:25:59 +0200 Subject: [ofa-general] mmap() and ibv_reg_mr() and RDMA In-Reply-To: References: <20070522210430.5df75050@marvin.local> Message-ID: <20070522232559.7785a331@marvin.local> On Tue, 22 May 2007 13:46:04 -0700 Roland Dreier wrote: > > I want to do RDMA-write on mmap'ed memory. > > but it fails to register the memory region. > > > > is there something special, to use ibv_reg_mr() on memory which I > > got from mmap()? > > it works fine with plain allocated memory (with memalign()). > > memory is page-aligned in both cases. > > How exactly are you mmap()ing the memory? memalign(), malloc() etc > are implemented with mmap() internally, so obviously memory > registration of some mmap()ed memory is fine. i created a character device in the kernel and registered memory with kzalloc(): if ((kmalloc_ptr = kzalloc((NPAGES + 2) * PAGE_SIZE, GFP_KERNEL | __GFP_DMA)) == NULL) { return -ENOMEM; } rounded to page bondary: kmalloc_area = (struct serverinfo *)((((unsigned long)kmalloc_ptr) + PAGE_SIZE - 1) & PAGE_MASK); this area is mapped via the character device and with the help of remap_pfn_range() into userspace... this works fine i can access it from userspace and write/read from it: #define MMAP_AREA_LEN (NPAGES*getpagesize()) ... mmap_area = (struct serverinfo*)mmap(0, MMAP_AREA_LEN, PROT_READ|PROT_WRITE, MAP_SHARED| MAP_LOCKED, fd, MMAP_AREA_LEN); but when i try to register the mmap_area with ibv_reg_mr() it fails. regards, joerg From sean.hefty at intel.com Tue May 22 14:47:45 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 22 May 2007 14:47:45 -0700 Subject: [ofa-general] locating the index of the default PKey - possible sa_query bug? Message-ID: <001101c79cba$d5409ca0$ff0da8c0@amr.corp.intel.com> I've been asked to verify partition support in the IB stack . Everything I've checked so far appears fine, with one possible exception. The sa_query module always sends MADs using pkey index 0. According to section 15.4.2 of the spec, SA MADs should be sent using the index of the default pkey. Is there any requirement that the default pkey be located at index 0? If not, are we fine placing this requirement on the SM? (I'm not aware of any actual problems occurring with the existing code.) - Sean From rdreier at cisco.com Tue May 22 15:19:15 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 22 May 2007 15:19:15 -0700 Subject: [ofa-general] mmap() and ibv_reg_mr() and RDMA In-Reply-To: <20070522232559.7785a331@marvin.local> (Joerg Zinke's message of "Tue, 22 May 2007 23:25:59 +0200") References: <20070522210430.5df75050@marvin.local> <20070522232559.7785a331@marvin.local> Message-ID: > this area is mapped via the character device and with > the help of remap_pfn_range() into userspace... this works fine i can > access it from userspace and write/read from it: I think that's the problem. remap_pfn_range() sets VM_PFNMAP on the vma used to map the pfns. When ibv_reg_mr() calls into the kernel to do the actual mapping, it ends up doing get_user_pages() which fails in vm_normal_page() for such a vma. I don't immediately see a good way to handle this. - R. From koen.segers at vrt.be Tue May 22 15:34:32 2007 From: koen.segers at vrt.be (Koen Segers) Date: Wed, 23 May 2007 00:34:32 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: <1179857686.9528.3.camel@KOEN> Message-ID: <1179873272.9528.27.camel@KOEN> If I understand it wright, the switch is actually polling (=pinging) the interfaces every 10s. This means that when the interface is handling other traffic, the poll can fail and the port could be considered out of service. My question is then: "How can the timeout be reached while packets are being sent/received to/from the interface?" Anyway, what timeout-value would you recommend for us? And why? To recapitulate: these are the actions I'll take tomorrow 1) change the MAD niceness of the servers 2) change the timeout on the switches Are these changes sufficient for the HCA's to keep their ports in PORT_ACTIVE state? Regards, Koen On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp (sweitzen) wrote: > Yes, you can tune it. Here's an example via the switch CLI: > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00 > node-timeout > > The default is 10 seconds, it can be configured up to 2000 seconds. > If a HCA is completely unresponsive for longer than the node-timeout > value, then we consider that HCA out of service. > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > > ______________________________________________________________ > From: Shirley Ma [mailto:xma at us.ibm.com] > Sent: Tuesday, May 22, 2007 11:30 AM > To: koen.segers at VRT.BE > Cc: Ami Perlmutter; general at lists.openfabrics.org; > general-bounces at lists.openfabrics.org; Scott Weitzenkamp > (sweitzen) > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > Koen, > > So it is most likely you hit the same bug as 229 (Scott > pointed out earlier). The same workaround might work for you > by renicing ib_mad as Scott suggested. > > I think this should be a SM query timeout tunable value in > Cisco SM. Am I right, Scott? > > Thanks > Shirley Ma > > > Inactive hide details for Koen Segers Koen > Segers > > > Koen Segers > > 05/22/07 11:14 AM > Please respond to > koen.segers at VRT.BE > > > To > > Shirley > Ma/Beaverton/IBM at IBMUS > > cc > > Ami Perlmutter > , general at lists.openfabrics.org, general-bounces at lists.openfabrics.org > > Subject > > RE: > [ofa-general] > GPFS node loses > IB-connection > > > > Hi, > > It is the Cisco SM. > > SFS-7000P> show version > > > ================================================================================ > System Version Information > ================================================================================ > system-version : SFS-7000P TopspinOS 2.9.0 releng > #147 > 10/25/2006 02:01:32 > contact : tac at cisco.com > name : SFS-7000P > location : 170 West Tasman Drive, San Jose, CA > 95134 > up-time : 11(d):7(h):49(m):3(s) > last-change : none > last-config-save : none > action : none > result : none > oper-mode : normal > > There is also a command that gives the SM version, but I can't > find it > right now. > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > Hello Koen, > > > > From the switch log, it looks a SM issue to me. The node was > kicked > > out of the membership. Which SM you are using in your > fabric? > > > > Thanks > > Shirley Ma > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > > > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From rdreier at cisco.com Tue May 22 15:35:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 22 May 2007 15:35:39 -0700 Subject: [ofa-general] locating the index of the default PKey - possible sa_query bug? In-Reply-To: <001101c79cba$d5409ca0$ff0da8c0@amr.corp.intel.com> (Sean Hefty's message of "Tue, 22 May 2007 14:47:45 -0700") References: <001101c79cba$d5409ca0$ff0da8c0@amr.corp.intel.com> Message-ID: > The sa_query module always sends MADs using pkey index 0. > According to section 15.4.2 of the spec, SA MADs should be sent > using the index of the default pkey. Is there any requirement that > the default pkey be located at index 0? If not, are we fine > placing this requirement on the SM? (I'm not aware of any actual > problems occurring with the existing code.) It does look like a (minor) bug. I don't know of any compliance statement that says the default P_Key has to be at index 0. We probably should fix sa_query to do the right thing, but I don't see it as a high priority. - R. From pradeeps at linux.vnet.ibm.com Tue May 22 15:36:49 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 22 May 2007 15:36:49 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint Message-ID: <46537081.30906@linux.vnet.ibm.com> Here are my thoughts about limiting the memory footprint for IPOIB CM (NOSRQ) patch: By default, cap the NOSRQ memory usage to 1GB. The default recvq_size is set to 128. Therefore for 64KB packets this would imply a maximum of 128 endpoints. -Make the maximum number of endpoints a module parameter with a default value of 128. -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is the default limit and could be changed as needed (by the administrator) depending on the system configuration, application needs and so on. The server would return a "REJ" message upon receiving a "REQ", whenever one of these limits (i.e. max number of endpoints or the max NOSRQ memory usage) is reached. Currently, we only check for the maximum number of endpoints -hard coded to 1024. -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that support SRQ like the Topspin HCA and, such HCAs should not be impacted at all. -Currently we allocate a default of 64KB for the ring buffer elements, and this buffer size is not linked to the mtu. In the future, we could allocate buffers based on the mtu and link that into the computation of the memory cap. This way customers who might want to use a smaller mtu could use a larger number of endpoints, or a larger recvq_size without exceeding the memory cap. Would this approach address the issues of scalability and enable IPOIB CM to be turned as the default? Pradeep From sweitzen at cisco.com Tue May 22 15:38:48 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 22 May 2007 15:38:48 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: <1179873272.9528.27.camel@KOEN> References: <1179857686.9528.3.camel@KOEN> <1179873272.9528.27.camel@KOEN> Message-ID: It's not so much pinging every 10 seconds as expecting a response within 10 seconds (Clive, correct me if I'm wrong). You only need to do 1) or 2), not both. Cisco configures 1) in the OFED binary RPMs we release at http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I prefer to have the host be more responsive. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Koen Segers [mailto:koen.segers at VRT.BE] > Sent: Tuesday, May 22, 2007 3:35 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Shirley Ma; Ami Perlmutter; > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > If I understand it wright, the switch is actually polling > (=pinging) the > interfaces every 10s. This means that when the interface is handling > other traffic, the poll can fail and the port could be > considered out of > service. My question is then: "How can the timeout be reached while > packets are being sent/received to/from the interface?" > > Anyway, what timeout-value would you recommend for us? And why? > > To recapitulate: these are the actions I'll take tomorrow > 1) change the MAD niceness of the servers > 2) change the timeout on the switches > > Are these changes sufficient for the HCA's to keep their ports in > PORT_ACTIVE state? > > Regards, > > Koen > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp (sweitzen) wrote: > > Yes, you can tune it. Here's an example via the switch CLI: > > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00 > > node-timeout > > > > The default is 10 seconds, it can be configured up to 2000 seconds. > > If a HCA is completely unresponsive for longer than the node-timeout > > value, then we consider that HCA out of service. > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Virtualization Business Unit > > Cisco Systems > > > > > > > > > ______________________________________________________________ > > From: Shirley Ma [mailto:xma at us.ibm.com] > > Sent: Tuesday, May 22, 2007 11:30 AM > > To: koen.segers at VRT.BE > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org; Scott Weitzenkamp > > (sweitzen) > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > Koen, > > > > So it is most likely you hit the same bug as 229 (Scott > > pointed out earlier). The same workaround might work for you > > by renicing ib_mad as Scott suggested. > > > > I think this should be a SM query timeout tunable value in > > Cisco SM. Am I right, Scott? > > > > Thanks > > Shirley Ma > > > > > > Inactive hide details for Koen Segers > Koen > > Segers > > > > > > Koen Segers > > > > > 05/22/07 11:14 AM > > Please respond to > > koen.segers at VRT.BE > > > > > > To > > > > Shirley > > Ma/Beaverton/IBM at IBMUS > > > > cc > > > > Ami Perlmutter > > , > general at lists.openfabrics.org, general-bounces at lists.openfabrics.org > > > > Subject > > > > RE: > > [ofa-general] > > GPFS node loses > > IB-connection > > > > > > > > Hi, > > > > It is the Cisco SM. > > > > SFS-7000P> show version > > > > > > > ============================================================== > ================== > > System Version Information > > > ============================================================== > ================== > > system-version : SFS-7000P TopspinOS 2.9.0 releng > > #147 > > 10/25/2006 02:01:32 > > contact : tac at cisco.com > > name : SFS-7000P > > location : 170 West Tasman Drive, > San Jose, CA > > 95134 > > up-time : 11(d):7(h):49(m):3(s) > > last-change : none > > last-config-save : none > > action : none > > result : none > > oper-mode : normal > > > > There is also a command that gives the SM version, > but I can't > > find it > > right now. > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > Hello Koen, > > > > > > From the switch log, it looks a SM issue to me. > The node was > > kicked > > > out of the membership. Which SM you are using in your > > fabric? > > > > > > Thanks > > > Shirley Ma > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > From venkatesh.babu at 3leafnetworks.com Tue May 22 17:01:32 2007 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Tue, 22 May 2007 17:01:32 -0700 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <1179831181.15940.74121.camel@hal.voltaire.com> References: <4652167F.9040709@3leafnetworks.com> <1179785796.15940.27092.camel@hal.voltaire.com> <4652542C.3010400@3leafnetworks.com> <1179805556.15940.47640.camel@hal.voltaire.com> <46528E3C.8090305@3leafnetworks.com> <1179831181.15940.74121.camel@hal.voltaire.com> Message-ID: <4653845C.1090507@3leafnetworks.com> Hal Rosenstock wrote: >The one I see that might be related is the following: > >commit 39798695b4bcc7b145f8910ca56195808d3a7637 >Author: Roland Dreier >Date: Mon Nov 13 09:38:07 2006 -0800 > > IB/mad: Fix race between cancel and receive completion > > When ib_cancel_mad() is called, it puts the canceled send on a list > and schedules a "flushed" callback from process context. However, > this leaves a window where a receive completion could be processed > before the send is fully flushed. > > This is fine, except that ib_find_send_mad() will find the MAD and > return it to the receive processing, which results in the sender > getting both a successful receive and a "flushed" send completion for > the same request. Understandably, this confuses the sender, which is > expecting only one of these two callbacks, and leads to grief such as > a use-after-free in IPoIB. > > Fix this by changing ib_find_send_mad() to return a send struct only > if the status is still successful (and not "flushed"). The search of > the send_list already had this check, so this patch just adds the same > check to the search of the wait_list. > > Signed-off-by: Roland Dreier > >My search was not exhaustive. > > It looks like this may be the fix for the MAD send errors. Do you think this is the cause of opensm not grabbing the mastership from the other ? > >Are they incrementing ? Which node is this ? I think some of them would >increment on node reboot. > > Looks like some counters (Symbol errors, link downed) are reached the top ceiling. This output was captured on node vortex3l-83, the one who runs opensm. Do you want the perfquery output before and after some time interval ? VBabu From xma at us.ibm.com Tue May 22 17:01:05 2007 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 22 May 2007 17:01:05 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: <1179858607.16831.20544.camel@hal.voltaire.com> Message-ID: Thanks Hal. Thanks for the clarification. I meant to say the port up and down kicked the node from GPFS membership. The port up and down was managed by SM. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue May 22 17:01:10 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 22 May 2007 20:01:10 -0400 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <4653845C.1090507@3leafnetworks.com> References: <4652167F.9040709@3leafnetworks.com> <1179785796.15940.27092.camel@hal.voltaire.com> <4652542C.3010400@3leafnetworks.com> <1179805556.15940.47640.camel@hal.voltaire.com> <46528E3C.8090305@3leafnetworks.com> <1179831181.15940.74121.camel@hal.voltaire.com> <4653845C.1090507@3leafnetworks.com> Message-ID: <1179878469.16831.42580.camel@hal.voltaire.com> On Tue, 2007-05-22 at 20:01, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > >The one I see that might be related is the following: > > > >commit 39798695b4bcc7b145f8910ca56195808d3a7637 > >Author: Roland Dreier > >Date: Mon Nov 13 09:38:07 2006 -0800 > > > > IB/mad: Fix race between cancel and receive completion > > > > When ib_cancel_mad() is called, it puts the canceled send on a list > > and schedules a "flushed" callback from process context. However, > > this leaves a window where a receive completion could be processed > > before the send is fully flushed. > > > > This is fine, except that ib_find_send_mad() will find the MAD and > > return it to the receive processing, which results in the sender > > getting both a successful receive and a "flushed" send completion for > > the same request. Understandably, this confuses the sender, which is > > expecting only one of these two callbacks, and leads to grief such as > > a use-after-free in IPoIB. > > > > Fix this by changing ib_find_send_mad() to return a send struct only > > if the status is still successful (and not "flushed"). The search of > > the send_list already had this check, so this patch just adds the same > > check to the search of the wait_list. > > > > Signed-off-by: Roland Dreier > > > >My search was not exhaustive. > > > > > It looks like this may be the fix for the MAD send errors. Perhaps. > Do you > think this is the cause of opensm not grabbing the mastership from the > other ? Unlikely but don't know for sure. > >Are they incrementing ? Which node is this ? I think some of them would > >increment on node reboot. > > > > > Looks like some counters (Symbol errors, link downed) are reached the > top ceiling. You should replace the cable and see if symbol errors improves. You may need to clear these with perfquery -R. I think Link downed will increment when the node reboots. > This output was captured on node vortex3l-83, the one who runs opensm. > Do you want the perfquery output before and after some time interval ? I'm interested in VL15 drops to make sure that is not going on. -- Hal > VBabu From halr at voltaire.com Tue May 22 17:06:10 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 22 May 2007 20:06:10 -0400 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: <1179878769.16831.42940.camel@hal.voltaire.com> On Tue, 2007-05-22 at 20:01, Shirley Ma wrote: > Thanks Hal. > > Thanks for the clarification. I meant to say the port up and down > kicked the node from GPFS membership. The port up and down was managed > by SM. Got it and that port down/up appears to be caused by MAD starvation at the host. -- Hal > Thanks > Shirley Ma From vlad at lists.openfabrics.org Wed May 23 02:40:24 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 23 May 2007 02:40:24 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070523-0200 daily build status Message-ID: <20070523094024.C6340E60814@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From Koen.SEGERS at VRT.BE Wed May 23 06:48:19 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Wed, 23 May 2007 15:48:19 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: This far, all tests seem to work. Thanks for the help! Scott, Are there more bugfixes that cisco does in its rpms? Greetz Koen -----Oorspronkelijk bericht----- Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Verzonden: woensdag 23 mei 2007 0:39 Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; general-bounces at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection It's not so much pinging every 10 seconds as expecting a response within 10 seconds (Clive, correct me if I'm wrong). You only need to do 1) or 2), not both. Cisco configures 1) in the OFED binary RPMs we release at http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I prefer to have the host be more responsive. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Koen Segers [mailto:koen.segers at VRT.BE] > Sent: Tuesday, May 22, 2007 3:35 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Shirley Ma; Ami Perlmutter; > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > If I understand it wright, the switch is actually polling > (=pinging) the > interfaces every 10s. This means that when the interface is handling > other traffic, the poll can fail and the port could be > considered out of > service. My question is then: "How can the timeout be reached while > packets are being sent/received to/from the interface?" > > Anyway, what timeout-value would you recommend for us? And why? > > To recapitulate: these are the actions I'll take tomorrow > 1) change the MAD niceness of the servers > 2) change the timeout on the switches > > Are these changes sufficient for the HCA's to keep their ports in > PORT_ACTIVE state? > > Regards, > > Koen > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp (sweitzen) wrote: > > Yes, you can tune it. Here's an example via the switch CLI: > > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00 > > node-timeout > > > > The default is 10 seconds, it can be configured up to 2000 seconds. > > If a HCA is completely unresponsive for longer than the node-timeout > > value, then we consider that HCA out of service. > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Virtualization Business Unit > > Cisco Systems > > > > > > > > > ______________________________________________________________ > > From: Shirley Ma [mailto:xma at us.ibm.com] > > Sent: Tuesday, May 22, 2007 11:30 AM > > To: koen.segers at VRT.BE > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org; Scott Weitzenkamp > > (sweitzen) > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > Koen, > > > > So it is most likely you hit the same bug as 229 (Scott > > pointed out earlier). The same workaround might work for you > > by renicing ib_mad as Scott suggested. > > > > I think this should be a SM query timeout tunable value in > > Cisco SM. Am I right, Scott? > > > > Thanks > > Shirley Ma > > > > > > Inactive hide details for Koen Segers > Koen > > Segers > > > > > > Koen Segers > > > > > 05/22/07 11:14 AM > > Please respond to > > koen.segers at VRT.BE > > > > > > To > > > > Shirley > > Ma/Beaverton/IBM at IBMUS > > > > cc > > > > Ami Perlmutter > > , > general at lists.openfabrics.org, general-bounces at lists.openfabrics.org > > > > Subject > > > > RE: > > [ofa-general] > > GPFS node loses > > IB-connection > > > > > > > > Hi, > > > > It is the Cisco SM. > > > > SFS-7000P> show version > > > > > > > ============================================================== > ================== > > System Version Information > > > ============================================================== > ================== > > system-version : SFS-7000P TopspinOS 2.9.0 releng > > #147 > > 10/25/2006 02:01:32 > > contact : tac at cisco.com > > name : SFS-7000P > > location : 170 West Tasman Drive, > San Jose, CA > > 95134 > > up-time : 11(d):7(h):49(m):3(s) > > last-change : none > > last-config-save : none > > action : none > > result : none > > oper-mode : normal > > > > There is also a command that gives the SM version, > but I can't > > find it > > right now. > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > Hello Koen, > > > > > > From the switch log, it looks a SM issue to me. > The node was > > kicked > > > out of the membership. Which SM you are using in your > > fabric? > > > > > > Thanks > > > Shirley Ma > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From sweitzen at cisco.com Wed May 23 06:51:55 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 23 May 2007 06:51:55 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: No C code changes, just a few config file changes (RENICE_IB_MAD=yes in openib.conf, memlock in /etc/security/limits.conf, fix /etc/hosts on SLES10 for bug 267, etc.). Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > Sent: Wednesday, May 23, 2007 6:48 AM > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > Cc: Shirley Ma; Ami Perlmutter; > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > This far, all tests seem to work. > > Thanks for the help! > > Scott, > Are there more bugfixes that cisco does in its rpms? > > Greetz > > Koen > > -----Oorspronkelijk bericht----- > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > Verzonden: woensdag 23 mei 2007 0:39 > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > general-bounces at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > It's not so much pinging every 10 seconds as expecting a > response within > 10 seconds (Clive, correct me if I'm wrong). > > You only need to do 1) or 2), not both. Cisco configures 1) > in the OFED > binary RPMs we release at > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > prefer to have > the host be more responsive. > > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > > -----Original Message----- > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > Sent: Tuesday, May 22, 2007 3:35 PM > > To: Scott Weitzenkamp (sweitzen) > > Cc: Shirley Ma; Ami Perlmutter; > > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > If I understand it wright, the switch is actually polling > > (=pinging) the > > interfaces every 10s. This means that when the interface is handling > > other traffic, the poll can fail and the port could be > > considered out of > > service. My question is then: "How can the timeout be reached while > > packets are being sent/received to/from the interface?" > > > > Anyway, what timeout-value would you recommend for us? And why? > > > > To recapitulate: these are the actions I'll take tomorrow > > 1) change the MAD niceness of the servers > > 2) change the timeout on the switches > > > > Are these changes sufficient for the HCA's to keep their ports in > > PORT_ACTIVE state? > > > > Regards, > > > > Koen > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > (sweitzen) wrote: > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00 > > > node-timeout > > > > > > The default is 10 seconds, it can be configured up to > 2000 seconds. > > > If a HCA is completely unresponsive for longer than the > node-timeout > > > value, then we consider that HCA out of service. > > > > > > Scott Weitzenkamp > > > SQA and Release Manager > > > Server Virtualization Business Unit > > > Cisco Systems > > > > > > > > > > > > > > ______________________________________________________________ > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > To: koen.segers at VRT.BE > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > general-bounces at lists.openfabrics.org; Scott Weitzenkamp > > > (sweitzen) > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > Koen, > > > > > > So it is most likely you hit the same bug as 229 (Scott > > > pointed out earlier). The same workaround might > work for you > > > by renicing ib_mad as Scott suggested. > > > > > > I think this should be a SM query timeout tunable value in > > > Cisco SM. Am I right, Scott? > > > > > > Thanks > > > Shirley Ma > > > > > > > > > Inactive hide details for Koen Segers > > Koen > > > Segers > > > > > > > > > Koen Segers > > > > > > > > 05/22/07 11:14 AM > > > Please respond to > > > koen.segers at VRT.BE > > > > > > > > > To > > > > > > Shirley > > > Ma/Beaverton/IBM at IBMUS > > > > > > cc > > > > > > Ami Perlmutter > > > , > > general at lists.openfabrics.org, general-bounces at lists.openfabrics.org > > > > > > Subject > > > > > > RE: > > > [ofa-general] > > > GPFS node loses > > > IB-connection > > > > > > > > > > > > Hi, > > > > > > It is the Cisco SM. > > > > > > SFS-7000P> show version > > > > > > > > > > > ============================================================== > > ================== > > > System Version Information > > > > > ============================================================== > > ================== > > > system-version : SFS-7000P TopspinOS > 2.9.0 releng > > > #147 > > > 10/25/2006 02:01:32 > > > contact : tac at cisco.com > > > name : SFS-7000P > > > location : 170 West Tasman Drive, > > San Jose, CA > > > 95134 > > > up-time : 11(d):7(h):49(m):3(s) > > > last-change : none > > > last-config-save : none > > > action : none > > > result : none > > > oper-mode : normal > > > > > > There is also a command that gives the SM version, > > but I can't > > > find it > > > right now. > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > Hello Koen, > > > > > > > > From the switch log, it looks a SM issue to me. > > The node was > > > kicked > > > > out of the membership. Which SM you are using in your > > > fabric? > > > > > > > > Thanks > > > > Shirley Ma > > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > From halr at voltaire.com Wed May 23 07:11:38 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 23 May 2007 10:11:38 -0400 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: <1179929493.16831.98786.camel@hal.voltaire.com> On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > No C code changes, just a few config file changes (RENICE_IB_MAD=yes in > openib.conf, Does the host really not respond to MAD requests for over 10 seconds in some cases ? -- Hal > memlock in /etc/security/limits.conf, fix /etc/hosts on > SLES10 for bug 267, etc.). > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > > -----Original Message----- > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > Sent: Wednesday, May 23, 2007 6:48 AM > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > Cc: Shirley Ma; Ami Perlmutter; > > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > This far, all tests seem to work. > > > > Thanks for the help! > > > > Scott, > > Are there more bugfixes that cisco does in its rpms? > > > > Greetz > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > Verzonden: woensdag 23 mei 2007 0:39 > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > It's not so much pinging every 10 seconds as expecting a > > response within > > 10 seconds (Clive, correct me if I'm wrong). > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > in the OFED > > binary RPMs we release at > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > prefer to have > > the host be more responsive. > > > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Virtualization Business Unit > > Cisco Systems > > > > > > > -----Original Message----- > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > To: Scott Weitzenkamp (sweitzen) > > > Cc: Shirley Ma; Ami Perlmutter; > > > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > If I understand it wright, the switch is actually polling > > > (=pinging) the > > > interfaces every 10s. This means that when the interface is handling > > > other traffic, the poll can fail and the port could be > > > considered out of > > > service. My question is then: "How can the timeout be reached while > > > packets are being sent/received to/from the interface?" > > > > > > Anyway, what timeout-value would you recommend for us? And why? > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > 1) change the MAD niceness of the servers > > > 2) change the timeout on the switches > > > > > > Are these changes sufficient for the HCA's to keep their ports in > > > PORT_ACTIVE state? > > > > > > Regards, > > > > > > Koen > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > (sweitzen) wrote: > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00 > > > > node-timeout > > > > > > > > The default is 10 seconds, it can be configured up to > > 2000 seconds. > > > > If a HCA is completely unresponsive for longer than the > > node-timeout > > > > value, then we consider that HCA out of service. > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > To: koen.segers at VRT.BE > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org; Scott Weitzenkamp > > > > (sweitzen) > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > Koen, > > > > > > > > So it is most likely you hit the same bug as 229 (Scott > > > > pointed out earlier). The same workaround might > > work for you > > > > by renicing ib_mad as Scott suggested. > > > > > > > > I think this should be a SM query timeout tunable value in > > > > Cisco SM. Am I right, Scott? > > > > > > > > Thanks > > > > Shirley Ma > > > > > > > > > > > > Inactive hide details for Koen Segers > > > Koen > > > > Segers > > > > > > > > > > > > Koen Segers > > > > > > > > > > > 05/22/07 11:14 AM > > > > Please respond to > > > > koen.segers at VRT.BE > > > > > > > > > > > > To > > > > > > > > Shirley > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > cc > > > > > > > > Ami Perlmutter > > > > , > > > general at lists.openfabrics.org, general-bounces at lists.openfabrics.org > > > > > > > > Subject > > > > > > > > RE: > > > > [ofa-general] > > > > GPFS node loses > > > > IB-connection > > > > > > > > > > > > > > > > Hi, > > > > > > > > It is the Cisco SM. > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > ============================================================== > > > ================== > > > > System Version Information > > > > > > > ============================================================== > > > ================== > > > > system-version : SFS-7000P TopspinOS > > 2.9.0 releng > > > > #147 > > > > 10/25/2006 02:01:32 > > > > contact : tac at cisco.com > > > > name : SFS-7000P > > > > location : 170 West Tasman Drive, > > > San Jose, CA > > > > 95134 > > > > up-time : 11(d):7(h):49(m):3(s) > > > > last-change : none > > > > last-config-save : none > > > > action : none > > > > result : none > > > > oper-mode : normal > > > > > > > > There is also a command that gives the SM version, > > > but I can't > > > > find it > > > > right now. > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > Hello Koen, > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > The node was > > > > kicked > > > > > out of the membership. Which SM you are using in your > > > > fabric? > > > > > > > > > > Thanks > > > > > Shirley Ma > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From devesh28 at gmail.com Wed May 23 07:27:55 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Wed, 23 May 2007 19:57:55 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <1179769930.15940.9823.camel@hal.voltaire.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> <1179483657.23882.158398.camel@hal.voltaire.com> <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com> <1179769930.15940.9823.camel@hal.voltaire.com> Message-ID: <309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com> On 21 May 2007 13:52:11 -0400, Hal Rosenstock wrote: > On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote: > > On 18 May 2007 06:21:05 -0400, Hal Rosenstock wrote: > > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote: > > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock wrote: > > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote: > > > > > > On 5/17/07, Sean Hefty wrote: > > > > > > > > But initially this will generate a packet for each path, while sys > > > > > > > > admin knows that path is there and he can hard-code the entries for > > > > > > > > it. Other thing is that why Admin will care about creating such record > > > > > > > > while SA is itself taking care, right? > > > > > > > > > > > > > > In your original message you asked about adding 'dummy entries' to the > > > > > > > cache. I agree that pre-loading the cache can be useful. What I still > > > > > > > am not understanding is the reasoning for adding 'dummy entries'. By > > > > > > > 'dummy entries', I've been assuming that these are invalid path records, > > > > > > > but maybe that's not what you meant. > > > > > > Ok if "dummy entries" word as such has created confusion then I am > > > > > > sorry for that, But with that I mean that, those are valid path > > > > > > records which Administrator knows in advance and while loading the > > > > > > module, > > > > > > > > > > How does the admin know they are valid ? > > > > Depending on the initial application runs, some trusted PRs can be generated. > > > > > > What do initial application runs have to do with this ? > > My understanding is that, once the cluster is UP, and if between Node > > A and Node B there is only one path, > > So this is a feature for such one path subnets. I wonder what percentage > of deployed subnets fits this case. You never know, It may be used for debugging also. > > > then, SA query always going to return same values in PR. > > If subnet topology is changed, these PRs might change. There are other > cases where they change too. Not sure about it...some suggestion? > > > On this basis Initial application runs will generate PRs, > > That's what confused me before (Applications don't generate PRs but > rather request them.) but I think I see what you mean now. Ok > > > these PRs can be saved in some file, and can be loaded > > when cache_module comes in. > > > > > > > >Are they somehow preconfigured at the SM ? > > > > I am not sure about SM has any such provision? > > > > > > Not that I'm aware of. > > Ok, So, currently no such support is there in SM? > > I can speak definitively for OpenSM and there is no such support. As to > the vendor SMs, I don't think so but don't know for absolute certainty. > Someone can correct me if I'm wrong but I wouldn't assume no response > means correctness as some may not be listening nor want to respond as to > "value added" vendor specific features. What is the issue if OpenSM provides this? > > > > > Also not sure about the > > > > role of SM in path resolving. I mean once node has initiated SA query, > > > > whether SM has some database to reply SA or On the fly destination > > > > node is contacted to get asked path recored? > > > > > > SMs can either calculate the SA PRs on the fly based on the routing > > > algorithm in use and some other things or put them in a local database. > > > This is up to that SM. > > Ok > > > > > > Destination node is not contacted in the SA PR query process. > > > > > > > >Doesn't each SM have its own policy for generating valid PRs ? > > > > Ultimately path record is in Path_Record object format, and SA cache > > > > is going to store in a fixed manner, How generation policy matters? > > > > > > What if the local policy loaded does not agree with what the SM would > > > generate for a particular PR ? One then gets a local error which will > > > need to be tracked down. Not so easy IMO. > > SM policies in a subnet to generate PRs, changes dynamically? at run time? > > The policy doesn't change dynamically but the data to be returned in the > SA PR response might. > > > if Not then depending on the local SM policy static PR can be > > generated to load initially. > > Just as one question related to this, how would link failures be handled > ? There are others. Its just a matter of avoiding initial PR query packets by loading the cache with static PRs.....Later on cache module will function in normal fashion. I expect, initially every thing will come up in a trusted cluster. > > > > > CMIIW. Also I am assuming a homogeneous cluster where certain > > > > parameters can be assumed to be same always. > > > > > > and always in agreement with what the SM would return ? For example, > > yes > > > what happens when a link goes down and the end node is no longer > > > reachable ? > > If node is not reachable then, after first timeout of sa_cache, that > > entry will be removed from cache. > > OK; that's another aspect to add into this feature. I don't think that > is currently done. I think there would need to be an API added to do > this. Yes, this has been discussed with Sean, we can add one char_dev interface to the existing sa_cache module implementation, Write entry point will generate a SA_PR_response packet and this packet will be passed to update_cache() function. Also we need to remove the initial schedule_update() call in the add_one() function. One user command is also required to read from user file and write onto this device. > > -- Hal > > > > > >are these from a live SM and just loaded "out of band" to > > > > bypass/preclude the SA PR >mechanism ? > > > > may be > > > > > > Even if they are, there is still the changes in the subnet issue. > > > > > > -- Hal > > > > > > > > -- Hal > > > > > > > > > > > Admin is loading this info in the cache with user command. > > > > > > > > > > > > > > > Another point I want to know is, > > > > > > > > When local_sa_cache module will be inserted? After SM comes up or > > > > > > > > Before SM comes up? > > > > > > > > > > > > > > It can occur either way. There is no restriction. The cache responds > > > > > > > to port up and GID in/out of service events to update itself. > > > > > > Do you mean cache module will start building cache only after Port is UP? > > > > > > > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on > > > > > > > > some node not on switch) then First Forced schedule_update() is > > > > > > > > waisted, and for the first application presence of cache is > > > > > > > > meaningless. Why not to keep cache effective right from the start? > > > > > > > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those > > > > > > > paths are usable. If the SM has not come up, then the path records will > > > > > > > be unusable until the SM configures the subnet, plus there's no > > > > > > > guarantee that the remote endpoints specified by the paths are running. > > > > > > You mean there is no guarantee that even if SM is UP and we have some > > > > > > hard coded entries of path record corresponding to some node X, we are > > > > > > not sure that node X has actually come up or not? In that case > > > > > > actually that path resolving should fail if node has not come up, but > > > > > > with the hard coding still path will be resolved? > > > > > > > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms > > > > > > > when booting a large cluster. > > > > > > that's true. Also cache will get valid entries only if network is > > > > > > configured by SM otherwise every node SA will, possibly, drop SA > > > > > > packets. > > > > > > > > > > > > > > - Sean > > > > > > > > > > > > > _______________________________________________ > > > > > > general mailing list > > > > > > general at lists.openfabrics.org > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > From halr at voltaire.com Wed May 23 07:38:25 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 23 May 2007 10:38:25 -0400 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <000101c7963a$3474ae00$49c9180a@amr.corp.intel.com> <309a667c0705160613j686deb47hf51aacf9f74de0e5@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> <1179483657.23882.158398.camel@hal.voltaire.com> <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com> <1179769930.15940.9823.camel@hal.voltaire.com> <309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com> Message-ID: <1179931104.16831.100554.camel@hal.voltaire.com> On Wed, 2007-05-23 at 10:27, Devesh Sharma wrote: > On 21 May 2007 13:52:11 -0400, Hal Rosenstock wrote: > > On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote: > > > On 18 May 2007 06:21:05 -0400, Hal Rosenstock wrote: > > > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote: > > > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock wrote: > > > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote: > > > > > > > On 5/17/07, Sean Hefty wrote: > > > > > > > > > But initially this will generate a packet for each path, while sys > > > > > > > > > admin knows that path is there and he can hard-code the entries for > > > > > > > > > it. Other thing is that why Admin will care about creating such record > > > > > > > > > while SA is itself taking care, right? > > > > > > > > > > > > > > > > In your original message you asked about adding 'dummy entries' to the > > > > > > > > cache. I agree that pre-loading the cache can be useful. What I still > > > > > > > > am not understanding is the reasoning for adding 'dummy entries'. By > > > > > > > > 'dummy entries', I've been assuming that these are invalid path records, > > > > > > > > but maybe that's not what you meant. > > > > > > > Ok if "dummy entries" word as such has created confusion then I am > > > > > > > sorry for that, But with that I mean that, those are valid path > > > > > > > records which Administrator knows in advance and while loading the > > > > > > > module, > > > > > > > > > > > > How does the admin know they are valid ? > > > > > Depending on the initial application runs, some trusted PRs can be generated. > > > > > > > > What do initial application runs have to do with this ? > > > My understanding is that, once the cluster is UP, and if between Node > > > A and Node B there is only one path, > > > > So this is a feature for such one path subnets. I wonder what percentage > > of deployed subnets fits this case. > You never know, It may be used for debugging also. I still don't have a good feel for how common/generally useful this will really be. > > > then, SA query always going to return same values in PR. > > > > If subnet topology is changed, these PRs might change. There are other > > cases where they change too. > Not sure about it...some suggestion? > > > > > On this basis Initial application runs will generate PRs, > > > > That's what confused me before (Applications don't generate PRs but > > rather request them.) but I think I see what you mean now. > Ok > > > > > these PRs can be saved in some file, and can be loaded > > > when cache_module comes in. > > > > > > > > > >Are they somehow preconfigured at the SM ? > > > > > I am not sure about SM has any such provision? > > > > > > > > Not that I'm aware of. > > > Ok, So, currently no such support is there in SM? > > > > I can speak definitively for OpenSM and there is no such support. As to > > the vendor SMs, I don't think so but don't know for absolute certainty. > > Someone can correct me if I'm wrong but I wouldn't assume no response > > means correctness as some may not be listening nor want to respond as to > > "value added" vendor specific features. > What is the issue if OpenSM provides this? I'm not following you. What does/should OpenSM provide ? OpenIB works in configurations with other SMs. > > > > > > > Also not sure about the > > > > > role of SM in path resolving. I mean once node has initiated SA query, > > > > > whether SM has some database to reply SA or On the fly destination > > > > > node is contacted to get asked path recored? > > > > > > > > SMs can either calculate the SA PRs on the fly based on the routing > > > > algorithm in use and some other things or put them in a local database. > > > > This is up to that SM. > > > Ok > > > > > > > > Destination node is not contacted in the SA PR query process. > > > > > > > > > >Doesn't each SM have its own policy for generating valid PRs ? > > > > > Ultimately path record is in Path_Record object format, and SA cache > > > > > is going to store in a fixed manner, How generation policy matters? > > > > > > > > What if the local policy loaded does not agree with what the SM would > > > > generate for a particular PR ? One then gets a local error which will > > > > need to be tracked down. Not so easy IMO. > > > SM policies in a subnet to generate PRs, changes dynamically? at run time? > > > > The policy doesn't change dynamically but the data to be returned in the > > SA PR response might. > > > > > if Not then depending on the local SM policy static PR can be > > > generated to load initially. > > > > Just as one question related to this, how would link failures be handled > > ? There are others. > Its just a matter of avoiding initial PR query packets by loading the > cache with static PRs.....Later on cache module will function in > normal fashion. I expect, initially every thing will come up in a > trusted cluster. So you're saying the cache would still react to GIDs out and in service, right ? If the cache is loaded from a file, does it bypass querying the SA initially for PRs ? If that is the case, then the file is required to be the full set of PRs for this node otherwise there would be incomplete connectivity. -- Hal > > > > > CMIIW. Also I am assuming a homogeneous cluster where certain > > > > > parameters can be assumed to be same always. > > > > > > > > and always in agreement with what the SM would return ? For example, > > > yes > > > > what happens when a link goes down and the end node is no longer > > > > reachable ? > > > If node is not reachable then, after first timeout of sa_cache, that > > > entry will be removed from cache. > > > > OK; that's another aspect to add into this feature. I don't think that > > is currently done. I think there would need to be an API added to do > > this. > Yes, this has been discussed with Sean, we can add one char_dev > interface to the existing sa_cache module implementation, Write entry > point will generate a SA_PR_response packet and this packet will be > passed to update_cache() function. > > Also we need to remove the initial schedule_update() call in the > add_one() function. > One user command is also required to read from user file and write > onto this device. > > > > -- Hal > > > > > > > >are these from a live SM and just loaded "out of band" to > > > > > bypass/preclude the SA PR >mechanism ? > > > > > may be > > > > > > > > Even if they are, there is still the changes in the subnet issue. > > > > > > > > -- Hal > > > > > > > > > > -- Hal > > > > > > > > > > > > > Admin is loading this info in the cache with user command. > > > > > > > > > > > > > > > > > Another point I want to know is, > > > > > > > > > When local_sa_cache module will be inserted? After SM comes up or > > > > > > > > > Before SM comes up? > > > > > > > > > > > > > > > > It can occur either way. There is no restriction. The cache responds > > > > > > > > to port up and GID in/out of service events to update itself. > > > > > > > Do you mean cache module will start building cache only after Port is UP? > > > > > > > > > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on > > > > > > > > > some node not on switch) then First Forced schedule_update() is > > > > > > > > > waisted, and for the first application presence of cache is > > > > > > > > > meaningless. Why not to keep cache effective right from the start? > > > > > > > > > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those > > > > > > > > paths are usable. If the SM has not come up, then the path records will > > > > > > > > be unusable until the SM configures the subnet, plus there's no > > > > > > > > guarantee that the remote endpoints specified by the paths are running. > > > > > > > You mean there is no guarantee that even if SM is UP and we have some > > > > > > > hard coded entries of path record corresponding to some node X, we are > > > > > > > not sure that node X has actually come up or not? In that case > > > > > > > actually that path resolving should fail if node has not come up, but > > > > > > > with the hard coding still path will be resolved? > > > > > > > > > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms > > > > > > > > when booting a large cluster. > > > > > > > that's true. Also cache will get valid entries only if network is > > > > > > > configured by SM otherwise every node SA will, possibly, drop SA > > > > > > > packets. > > > > > > > > > > > > > > > > - Sean > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > general mailing list > > > > > > > general at lists.openfabrics.org > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > > > > > > > From umaxx at oleco.net Wed May 23 08:03:36 2007 From: umaxx at oleco.net (Joerg Zinke) Date: Wed, 23 May 2007 17:03:36 +0200 Subject: [ofa-general] mmap() and ibv_reg_mr() and RDMA In-Reply-To: References: <20070522210430.5df75050@marvin.local> <20070522232559.7785a331@marvin.local> Message-ID: <20070523170336.2df4e755@marvin.local> On Tue, 22 May 2007 15:19:15 -0700 Roland Dreier wrote: > > this area is mapped via the character device and with > > the help of remap_pfn_range() into userspace... this works fine i > > can access it from userspace and write/read from it: > > I think that's the problem. remap_pfn_range() sets VM_PFNMAP on the > vma used to map the pfns. When ibv_reg_mr() calls into the kernel to > do the actual mapping, it ends up doing get_user_pages() which fails > in vm_normal_page() for such a vma. > > I don't immediately see a good way to handle this. > many thanks for your fast answer. i will try to access the memory via get_user_pages() too instead of mmap'ing it... just the other way around - should be no problem. regards, joerg From Koen.SEGERS at VRT.BE Wed May 23 08:20:20 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Wed, 23 May 2007 17:20:20 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: <1179929493.16831.98786.camel@hal.voltaire.com> Message-ID: After a whole day of stresstesting with the MAD renicing turned on, we got the error once. So I think I should raise the timeout on the switch also. It takes about 2 minutes to boot the system. Do you agree that this is a good value for the timeout? Scott, Can you explain me the problem of the memlock? I saw that the SLES10 bug is only an issue in MVAPICH. Since we didn't install this, the bug is not related to us. This is correct, isn't it? Greetz Koen -----Oorspronkelijk bericht----- Van: Hal Rosenstock [mailto:halr at voltaire.com] Verzonden: woensdag 23 mei 2007 16:12 Aan: Scott "Weitzenkamp (sweitzen) CC: SEGERS Koen; Clive Hall (clivhall); general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > No C code changes, just a few config file changes (RENICE_IB_MAD=yes in > openib.conf, Does the host really not respond to MAD requests for over 10 seconds in some cases ? -- Hal > memlock in /etc/security/limits.conf, fix /etc/hosts on > SLES10 for bug 267, etc.). > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > > -----Original Message----- > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > Sent: Wednesday, May 23, 2007 6:48 AM > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > Cc: Shirley Ma; Ami Perlmutter; > > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > This far, all tests seem to work. > > > > Thanks for the help! > > > > Scott, > > Are there more bugfixes that cisco does in its rpms? > > > > Greetz > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > Verzonden: woensdag 23 mei 2007 0:39 > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > It's not so much pinging every 10 seconds as expecting a > > response within > > 10 seconds (Clive, correct me if I'm wrong). > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > in the OFED > > binary RPMs we release at > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > prefer to have > > the host be more responsive. > > > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Virtualization Business Unit > > Cisco Systems > > > > > > > -----Original Message----- > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > To: Scott Weitzenkamp (sweitzen) > > > Cc: Shirley Ma; Ami Perlmutter; > > > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > If I understand it wright, the switch is actually polling > > > (=pinging) the > > > interfaces every 10s. This means that when the interface is handling > > > other traffic, the poll can fail and the port could be > > > considered out of > > > service. My question is then: "How can the timeout be reached while > > > packets are being sent/received to/from the interface?" > > > > > > Anyway, what timeout-value would you recommend for us? And why? > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > 1) change the MAD niceness of the servers > > > 2) change the timeout on the switches > > > > > > Are these changes sufficient for the HCA's to keep their ports in > > > PORT_ACTIVE state? > > > > > > Regards, > > > > > > Koen > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > (sweitzen) wrote: > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00 > > > > node-timeout > > > > > > > > The default is 10 seconds, it can be configured up to > > 2000 seconds. > > > > If a HCA is completely unresponsive for longer than the > > node-timeout > > > > value, then we consider that HCA out of service. > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > To: koen.segers at VRT.BE > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org; Scott Weitzenkamp > > > > (sweitzen) > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > Koen, > > > > > > > > So it is most likely you hit the same bug as 229 (Scott > > > > pointed out earlier). The same workaround might > > work for you > > > > by renicing ib_mad as Scott suggested. > > > > > > > > I think this should be a SM query timeout tunable value in > > > > Cisco SM. Am I right, Scott? > > > > > > > > Thanks > > > > Shirley Ma > > > > > > > > > > > > Inactive hide details for Koen Segers > > > Koen > > > > Segers > > > > > > > > > > > > Koen Segers > > > > > > > > > > > 05/22/07 11:14 AM > > > > Please respond to > > > > koen.segers at VRT.BE > > > > > > > > > > > > To > > > > > > > > Shirley > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > cc > > > > > > > > Ami Perlmutter > > > > , > > > general at lists.openfabrics.org, general-bounces at lists.openfabrics.org > > > > > > > > Subject > > > > > > > > RE: > > > > [ofa-general] > > > > GPFS node loses > > > > IB-connection > > > > > > > > > > > > > > > > Hi, > > > > > > > > It is the Cisco SM. > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > ============================================================== > > > ================== > > > > System Version Information > > > > > > > ============================================================== > > > ================== > > > > system-version : SFS-7000P TopspinOS > > 2.9.0 releng > > > > #147 > > > > 10/25/2006 02:01:32 > > > > contact : tac at cisco.com > > > > name : SFS-7000P > > > > location : 170 West Tasman Drive, > > > San Jose, CA > > > > 95134 > > > > up-time : 11(d):7(h):49(m):3(s) > > > > last-change : none > > > > last-config-save : none > > > > action : none > > > > result : none > > > > oper-mode : normal > > > > > > > > There is also a command that gives the SM version, > > > but I can't > > > > find it > > > > right now. > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > Hello Koen, > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > The node was > > > > kicked > > > > > out of the membership. Which SM you are using in your > > > > fabric? > > > > > > > > > > Thanks > > > > > Shirley Ma > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From sweitzen at cisco.com Wed May 23 08:37:15 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 23 May 2007 08:37:15 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: <1179929493.16831.98786.camel@hal.voltaire.com> References: <1179929493.16831.98786.camel@hal.voltaire.com> Message-ID: > Does the host really not respond to MAD requests for over 10 > seconds in > some cases ? With recent Xeon, Opteron, and Power processors, yes. Scott From sweitzen at cisco.com Wed May 23 08:37:54 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 23 May 2007 08:37:54 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: <1179929493.16831.98786.camel@hal.voltaire.com> Message-ID: The boot time of the host doesn't matter for this timeout. While the host is booting, the IB link is down anyway. Scott > -----Original Message----- > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > Sent: Wednesday, May 23, 2007 8:20 AM > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > Cc: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > After a whole day of stresstesting with the MAD renicing turned on, we > got the error once. So I think I should raise the timeout on > the switch > also. > > It takes about 2 minutes to boot the system. Do you agree > that this is a > good value for the timeout? > > Scott, > Can you explain me the problem of the memlock? > > I saw that the SLES10 bug is only an issue in MVAPICH. Since we didn't > install this, the bug is not related to us. This is correct, isn't it? > > Greetz > > Koen > > -----Oorspronkelijk bericht----- > Van: Hal Rosenstock [mailto:halr at voltaire.com] > Verzonden: woensdag 23 mei 2007 16:12 > Aan: Scott "Weitzenkamp (sweitzen) > CC: SEGERS Koen; Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > No C code changes, just a few config file changes (RENICE_IB_MAD=yes > in > > openib.conf, > > Does the host really not respond to MAD requests for over 10 > seconds in > some cases ? > > -- Hal > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > SLES10 for bug 267, etc.). > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Virtualization Business Unit > > Cisco Systems > > > > > > > -----Original Message----- > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > Cc: Shirley Ma; Ami Perlmutter; > > > general at lists.openfabrics.org; > general-bounces at lists.openfabrics.org > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > This far, all tests seem to work. > > > > > > Thanks for the help! > > > > > > Scott, > > > Are there more bugfixes that cisco does in its rpms? > > > > > > Greetz > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > Verzonden: woensdag 23 mei 2007 0:39 > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > (clivhall) > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > general-bounces at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > It's not so much pinging every 10 seconds as expecting a > > > response within > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > in the OFED > > > binary RPMs we release at > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > prefer to have > > > the host be more responsive. > > > > > > > > > Scott Weitzenkamp > > > SQA and Release Manager > > > Server Virtualization Business Unit > > > Cisco Systems > > > > > > > > > > -----Original Message----- > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > To: Scott Weitzenkamp (sweitzen) > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > general at lists.openfabrics.org; > general-bounces at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > If I understand it wright, the switch is actually polling > > > > (=pinging) the > > > > interfaces every 10s. This means that when the interface is > handling > > > > other traffic, the poll can fail and the port could be > > > > considered out of > > > > service. My question is then: "How can the timeout be reached > while > > > > packets are being sent/received to/from the interface?" > > > > > > > > Anyway, what timeout-value would you recommend for us? And why? > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > 1) change the MAD niceness of the servers > > > > 2) change the timeout on the switches > > > > > > > > Are these changes sufficient for the HCA's to keep > their ports in > > > > PORT_ACTIVE state? > > > > > > > > Regards, > > > > > > > > Koen > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > (sweitzen) wrote: > > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00 > > > > > node-timeout > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > 2000 seconds. > > > > > If a HCA is completely unresponsive for longer than the > > > node-timeout > > > > > value, then we consider that HCA out of service. > > > > > > > > > > Scott Weitzenkamp > > > > > SQA and Release Manager > > > > > Server Virtualization Business Unit > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > To: koen.segers at VRT.BE > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > general-bounces at lists.openfabrics.org; Scott > Weitzenkamp > > > > > (sweitzen) > > > > > Subject: RE: [ofa-general] GPFS node loses > IB-connection > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > So it is most likely you hit the same bug as > 229 (Scott > > > > > pointed out earlier). The same workaround might > > > work for you > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > I think this should be a SM query timeout > tunable value > in > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > Thanks > > > > > Shirley Ma > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > Koen > > > > > Segers > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > Please respond to > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > To > > > > > > > > > > Shirley > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > cc > > > > > > > > > > Ami Perlmutter > > > > > , > > > > general at lists.openfabrics.org, > general-bounces at lists.openfabrics.org > > > > > > > > > > Subject > > > > > > > > > > RE: > > > > > [ofa-general] > > > > > GPFS node loses > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > ============================================================== > > > > ================== > > > > > System Version Information > > > > > > > > > ============================================================== > > > > ================== > > > > > system-version : SFS-7000P TopspinOS > > > 2.9.0 releng > > > > > #147 > > > > > 10/25/2006 02:01:32 > > > > > contact : tac at cisco.com > > > > > name : SFS-7000P > > > > > location : 170 West Tasman Drive, > > > > San Jose, CA > > > > > 95134 > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > last-change : none > > > > > last-config-save : none > > > > > action : none > > > > > result : none > > > > > oper-mode : normal > > > > > > > > > > There is also a command that gives the SM version, > > > > but I can't > > > > > find it > > > > > right now. > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > > Hello Koen, > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > The node was > > > > > kicked > > > > > > out of the membership. Which SM you are > using in your > > > > > fabric? > > > > > > > > > > > > Thanks > > > > > > Shirley Ma > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > From Koen.SEGERS at VRT.BE Wed May 23 08:39:06 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Wed, 23 May 2007 17:39:06 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: What value would you recommend then? Koen -----Oorspronkelijk bericht----- Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Verzonden: woensdag 23 mei 2007 17:38 Aan: SEGERS Koen; Hal Rosenstock CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection The boot time of the host doesn't matter for this timeout. While the host is booting, the IB link is down anyway. Scott > -----Original Message----- > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > Sent: Wednesday, May 23, 2007 8:20 AM > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > Cc: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > After a whole day of stresstesting with the MAD renicing turned on, we > got the error once. So I think I should raise the timeout on > the switch > also. > > It takes about 2 minutes to boot the system. Do you agree > that this is a > good value for the timeout? > > Scott, > Can you explain me the problem of the memlock? > > I saw that the SLES10 bug is only an issue in MVAPICH. Since we didn't > install this, the bug is not related to us. This is correct, isn't it? > > Greetz > > Koen > > -----Oorspronkelijk bericht----- > Van: Hal Rosenstock [mailto:halr at voltaire.com] > Verzonden: woensdag 23 mei 2007 16:12 > Aan: Scott "Weitzenkamp (sweitzen) > CC: SEGERS Koen; Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > No C code changes, just a few config file changes (RENICE_IB_MAD=yes > in > > openib.conf, > > Does the host really not respond to MAD requests for over 10 > seconds in > some cases ? > > -- Hal > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > SLES10 for bug 267, etc.). > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Virtualization Business Unit > > Cisco Systems > > > > > > > -----Original Message----- > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > Cc: Shirley Ma; Ami Perlmutter; > > > general at lists.openfabrics.org; > general-bounces at lists.openfabrics.org > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > This far, all tests seem to work. > > > > > > Thanks for the help! > > > > > > Scott, > > > Are there more bugfixes that cisco does in its rpms? > > > > > > Greetz > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > Verzonden: woensdag 23 mei 2007 0:39 > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > (clivhall) > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > general-bounces at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > It's not so much pinging every 10 seconds as expecting a > > > response within > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > in the OFED > > > binary RPMs we release at > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > prefer to have > > > the host be more responsive. > > > > > > > > > Scott Weitzenkamp > > > SQA and Release Manager > > > Server Virtualization Business Unit > > > Cisco Systems > > > > > > > > > > -----Original Message----- > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > To: Scott Weitzenkamp (sweitzen) > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > general at lists.openfabrics.org; > general-bounces at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > If I understand it wright, the switch is actually polling > > > > (=pinging) the > > > > interfaces every 10s. This means that when the interface is > handling > > > > other traffic, the poll can fail and the port could be > > > > considered out of > > > > service. My question is then: "How can the timeout be reached > while > > > > packets are being sent/received to/from the interface?" > > > > > > > > Anyway, what timeout-value would you recommend for us? And why? > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > 1) change the MAD niceness of the servers > > > > 2) change the timeout on the switches > > > > > > > > Are these changes sufficient for the HCA's to keep > their ports in > > > > PORT_ACTIVE state? > > > > > > > > Regards, > > > > > > > > Koen > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > (sweitzen) wrote: > > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00 > > > > > node-timeout > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > 2000 seconds. > > > > > If a HCA is completely unresponsive for longer than the > > > node-timeout > > > > > value, then we consider that HCA out of service. > > > > > > > > > > Scott Weitzenkamp > > > > > SQA and Release Manager > > > > > Server Virtualization Business Unit > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > To: koen.segers at VRT.BE > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > general-bounces at lists.openfabrics.org; Scott > Weitzenkamp > > > > > (sweitzen) > > > > > Subject: RE: [ofa-general] GPFS node loses > IB-connection > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > So it is most likely you hit the same bug as > 229 (Scott > > > > > pointed out earlier). The same workaround might > > > work for you > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > I think this should be a SM query timeout > tunable value > in > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > Thanks > > > > > Shirley Ma > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > Koen > > > > > Segers > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > Please respond to > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > To > > > > > > > > > > Shirley > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > cc > > > > > > > > > > Ami Perlmutter > > > > > , > > > > general at lists.openfabrics.org, > general-bounces at lists.openfabrics.org > > > > > > > > > > Subject > > > > > > > > > > RE: > > > > > [ofa-general] > > > > > GPFS node loses > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > ============================================================== > > > > ================== > > > > > System Version Information > > > > > > > > > ============================================================== > > > > ================== > > > > > system-version : SFS-7000P TopspinOS > > > 2.9.0 releng > > > > > #147 > > > > > 10/25/2006 02:01:32 > > > > > contact : tac at cisco.com > > > > > name : SFS-7000P > > > > > location : 170 West Tasman Drive, > > > > San Jose, CA > > > > > 95134 > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > last-change : none > > > > > last-config-save : none > > > > > action : none > > > > > result : none > > > > > oper-mode : normal > > > > > > > > > > There is also a command that gives the SM version, > > > > but I can't > > > > > find it > > > > > right now. > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > > Hello Koen, > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > The node was > > > > > kicked > > > > > > out of the membership. Which SM you are > using in your > > > > > fabric? > > > > > > > > > > > > Thanks > > > > > > Shirley Ma > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From sweitzen at cisco.com Wed May 23 08:40:54 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 23 May 2007 08:40:54 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: Try 20 seconds, I'm curious if if you are barely crossing the 10-second threshold. Scott > -----Original Message----- > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > Sent: Wednesday, May 23, 2007 8:39 AM > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > Cc: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > What value would you recommend then? > > Koen > > -----Oorspronkelijk bericht----- > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > Verzonden: woensdag 23 mei 2007 17:38 > Aan: SEGERS Koen; Hal Rosenstock > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > The boot time of the host doesn't matter for this timeout. While the > host is booting, the IB link is down anyway. > > Scott > > > -----Original Message----- > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > Sent: Wednesday, May 23, 2007 8:20 AM > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > Cc: Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > After a whole day of stresstesting with the MAD renicing > turned on, we > > got the error once. So I think I should raise the timeout on > > the switch > > also. > > > > It takes about 2 minutes to boot the system. Do you agree > > that this is a > > good value for the timeout? > > > > Scott, > > Can you explain me the problem of the memlock? > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > Since we didn't > > install this, the bug is not related to us. This is > correct, isn't it? > > > > Greetz > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > Verzonden: woensdag 23 mei 2007 16:12 > > Aan: Scott "Weitzenkamp (sweitzen) > > CC: SEGERS Koen; Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > No C code changes, just a few config file changes > (RENICE_IB_MAD=yes > > in > > > openib.conf, > > > > Does the host really not respond to MAD requests for over 10 > > seconds in > > some cases ? > > > > -- Hal > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > SLES10 for bug 267, etc.). > > > > > > Scott Weitzenkamp > > > SQA and Release Manager > > > Server Virtualization Business Unit > > > Cisco Systems > > > > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > This far, all tests seem to work. > > > > > > > > Thanks for the help! > > > > > > > > Scott, > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > Greetz > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > (clivhall) > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > response within > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > in the OFED > > > > binary RPMs we release at > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > prefer to have > > > > the host be more responsive. > > > > > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > -----Original Message----- > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > (=pinging) the > > > > > interfaces every 10s. This means that when the interface is > > handling > > > > > other traffic, the poll can fail and the port could be > > > > > considered out of > > > > > service. My question is then: "How can the timeout be reached > > while > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > Anyway, what timeout-value would you recommend for > us? And why? > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > 1) change the MAD niceness of the servers > > > > > 2) change the timeout on the switches > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > their ports in > > > > > PORT_ACTIVE state? > > > > > > > > > > Regards, > > > > > > > > > > Koen > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > (sweitzen) wrote: > > > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > fe:80:00:00:00:00:00:00 > > > > > > node-timeout > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > 2000 seconds. > > > > > > If a HCA is completely unresponsive for longer than the > > > > node-timeout > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > To: koen.segers at VRT.BE > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org; Scott > > Weitzenkamp > > > > > > (sweitzen) > > > > > > Subject: RE: [ofa-general] GPFS node loses > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > So it is most likely you hit the same bug as > > 229 (Scott > > > > > > pointed out earlier). The same workaround might > > > > work for you > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > I think this should be a SM query timeout > > tunable value > > in > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > Thanks > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > Koen > > > > > > Segers > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > Please respond to > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > Shirley > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > cc > > > > > > > > > > > > Ami Perlmutter > > > > > > , > > > > > general at lists.openfabrics.org, > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > Subject > > > > > > > > > > > > RE: > > > > > > [ofa-general] > > > > > > GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > System Version Information > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > system-version : SFS-7000P TopspinOS > > > > 2.9.0 releng > > > > > > #147 > > > > > > 10/25/2006 02:01:32 > > > > > > contact : tac at cisco.com > > > > > > name : SFS-7000P > > > > > > location : 170 West Tasman Drive, > > > > > San Jose, CA > > > > > > 95134 > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > last-change : none > > > > > > last-config-save : none > > > > > > action : none > > > > > > result : none > > > > > > oper-mode : normal > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > but I can't > > > > > > find it > > > > > > right now. > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > > > Hello Koen, > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > The node was > > > > > > kicked > > > > > > > out of the membership. Which SM you are > > using in your > > > > > > fabric? > > > > > > > > > > > > > > Thanks > > > > > > > Shirley Ma > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > From pradeeps at linux.vnet.ibm.com Wed May 23 09:17:19 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 23 May 2007 09:17:19 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <46537081.30906@linux.vnet.ibm.com> References: <46537081.30906@linux.vnet.ibm.com> Message-ID: <4654690F.1040305@linux.vnet.ibm.com> If this proposal is acceptable, would you want me to generate a patch against Roland's for-2.6.22 git tree, or would for-2.6.23 tree be better? Pradeep Pradeep Satyanarayana wrote: > Here are my thoughts about limiting the memory footprint for IPOIB CM > (NOSRQ) patch: > > By default, cap the NOSRQ memory usage to 1GB. The default recvq_size > is set to 128. Therefore for 64KB packets this would imply a maximum of > 128 endpoints. > > -Make the maximum number of endpoints a module parameter with a default > value of 128. > > -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is > the default limit and could be changed as needed (by the administrator) > depending on the system configuration, application needs and so on. The > server would return a "REJ" message upon receiving a "REQ", whenever one > of these limits (i.e. max number of endpoints or the max NOSRQ memory > usage) is reached. Currently, we only check for the maximum number of > endpoints -hard coded to 1024. > > -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that > support SRQ like the Topspin HCA and, such HCAs should not be > impacted at all. > > -Currently we allocate a default of 64KB for the ring buffer elements, > and this buffer size is not linked to the mtu. In the future, we could > allocate buffers based on the mtu and link that into the computation of > the memory cap. This way customers who might want to use a smaller mtu > could use a larger number of endpoints, or a larger recvq_size without > exceeding the memory cap. > > > Would this approach address the issues of scalability and enable IPOIB > CM to be turned as the default? > > > Pradeep > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From narravul at cse.ohio-state.edu Wed May 23 10:27:01 2007 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Wed, 23 May 2007 13:27:01 -0400 (EDT) Subject: [ofa-general] Problem with using two interfaces with rdma-cm Message-ID: Hi Sean, I currently have a setup with two nodes connected with two hcas. Both the hcas have ip addresses in different subnets. Rail 1 (ib0): 192.168.1.* Rail 2 (ib2): 192.168.3.* When I try to connect two qps over these rails (one on each), many times the address resolutions for both the qps return me the context of just one of the rails. i.e. I am not able to use both the rails. Is there any thing I am missing here? We are using OFED-1.1 on this cluster. Thanks, --Sundeep. From sean.hefty at intel.com Wed May 23 10:37:53 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 23 May 2007 10:37:53 -0700 Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm In-Reply-To: Message-ID: <000301c79d61$17db1210$ff0da8c0@amr.corp.intel.com> >Rail 1 (ib0): 192.168.1.* >Rail 2 (ib2): 192.168.3.* > >When I try to connect two qps over these rails (one on each), many times >the address resolutions for both the qps return me the context of just >one of the rails. i.e. I am not able to use both the rails. > >Is there any thing I am missing here? Can you provide more details on how you are establishing your connections? Are you calling rdma_resolve_addr() with 192.168.1.x in one case, and 192.168.3.x in the second case, and both of those resolve back to the same local IP address? Can you tell if ping routes the same way? You can try binding to a specific local address, but based on your setup, I would expect this to work. So, I'd like to understand better what the issue could be. - Sean From mst at dev.mellanox.co.il Wed May 23 10:50:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 May 2007 20:50:30 +0300 Subject: [ofa-general] Re: skb queue management in ipoib In-Reply-To: References: <20070522063634.GB3331@mellanox.co.il> Message-ID: <20070523175030.GA6019@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: skb queue management in ipoib > > > I think that managing this queue in a FIFO manner, dropping > > old packets and inserting new ones instead would be better: > > and older packet has more chance to have been timed out. > > Yes, that probably makes sense. > > > So we would do something along the lines of: > > > > __skb_queue_tail(&neigh->queue, skb); > > if (skb_queue_len(&neigh->queue) > IPOIB_MAX_PATH_REC_QUEUE) { > > skb = __skb_dequeue_tail(&neigh->queue); > > this should just be __skb_dequeue though... Ugh, sure. I'll post something like this for 2.6.23 then? -- MST From mst at dev.mellanox.co.il Wed May 23 11:36:50 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 May 2007 21:36:50 +0300 Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection In-Reply-To: <4653248A.1040108@ichips.intel.com> References: <20070520134441.GI20649@mellanox.co.il> <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> <20070522075952.GC3331@mellanox.co.il> <4653248A.1040108@ichips.intel.com> Message-ID: <20070523183650.GB6019@mellanox.co.il> > >>Also, I left the duplicate request handling > >>as it was, since that should go in as a separate patch. > > > >Could you please describe what is missing currently? > >Is the missing handling likely to cause timeouts? > > If two REQs are received with matching local IDs, but the REQs > themselves differ on one or more fields, such as the QPN, the second REQ > is dropped as a duplicate. Why do you speak about dropping duplicates as a valid response? As far as I can tell, the 2 legal responses to a duplicate REQ are resending a REP and rejecting with code 30. > This causes timeouts, so I need to figure > out what the correct behavior should be here. I agree that it seems that we could use this as a hint that remote has rebooted. -- MST From sean.hefty at intel.com Wed May 23 12:39:33 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 23 May 2007 12:39:33 -0700 Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection In-Reply-To: <20070523183650.GB6019@mellanox.co.il> Message-ID: <000501c79d72$17260a30$ff0da8c0@amr.corp.intel.com> >> If two REQs are received with matching local IDs, but the REQs >> themselves differ on one or more fields, such as the QPN, the second REQ >> is dropped as a duplicate. > >Why do you speak about dropping duplicates as a valid response? I was only mentioning the current behavior. >As far as I can tell, the 2 legal responses to a duplicate REQ >are resending a REP and rejecting with code 30. It's possible to receive a duplicate REQ before processing has completed and a REP generated to the first REQ. In this case, it makes sense simply to discard the duplicate REQ. When processing completes on the first REQ, the CM will generate either a REP or a REJ, so I believe that the behavior is compliant when handling an actual duplicate REQs. - Sean From mst at dev.mellanox.co.il Wed May 23 12:50:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 May 2007 22:50:11 +0300 Subject: [ofa-general] Re: [PATCH RFC] IB/ipoib: fix to_ipoib_neigh access race In-Reply-To: References: <20070522005918.GB13311@mellanox.co.il> Message-ID: <20070523195011.GC6019@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH RFC] IB/ipoib: fix to_ipoib_neigh access race > > > hard_start_xmit dereferences to_ipoib_neigh when only tx_lock is taken. This > > would only be safe if all calls that modify *to_ipoib_neigh take tx_lock too. > > Currently this is not always true for ipoib_neigh_free and path_rec_completion, > > which results in memory corruption. Fix this race, making sure > > path_rec_completion and ipoib_neigh_free are always called under > > tx_lock. > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > I'm looking at > > https://bugs.openfabrics.org/show_bug.cgi?id=604 > > and I think this could explain the crashes. > > In any case, Roland, is there a race or am I imagining things? > > > > NB: The patch is untested (I'm not at the lab now). > > Yes, it does seem that there is a problem here. However, I the first > part of this needs to be handled another way -- for example: > > > - path_free(dev, path); > > spin_lock_irq(&priv->tx_lock); > > spin_lock(&priv->lock); > > + path_free(dev, path); > > path_free already takes priv->lock internally, and also calls > ipoib_put_ah(), which may end up in ipoib_free_ah(), which also might > take priv->lock. Interesting point: note how unicast_arp_send is called under tx_lock, and calls path_free from there. It seems to be safe simply because we never have an AH or any neighbours there, but it does look a bit ugly, and there's a bit of code duplication that function. > It's not immediately obvious what the right fix is... Maybe 1. avoid doing path_free in unicast_arp_send: just do __path_add unconditionally like we do for regular packets. and 2. make path_free take both tx_lock and priv->lock? Something along the following lines (NB: untested): Signed-off-by: Michael S. Tsirkin --- Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-22 01:46:54.000000000 +0300 +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-23 22:45:18.000000000 +0300 @@ -262,7 +262,8 @@ static void path_free(struct net_device while ((skb = __skb_dequeue(&path->queue))) dev_kfree_skb_irq(skb); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { /* @@ -277,7 +278,8 @@ static void path_free(struct net_device ipoib_neigh_free(dev, neigh); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (path->ah) ipoib_put_ah(path->ah); @@ -401,7 +403,8 @@ static void path_rec_completion(int stat ah = ipoib_create_ah(dev, priv->pd, &av); } - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); path->ah = ah; @@ -442,7 +445,8 @@ static void path_rec_completion(int stat path->query = NULL; complete(&path->done); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); while ((skb = __skb_dequeue(&skqueue))) { skb->dev = dev; @@ -614,32 +618,16 @@ static void unicast_arp_send(struct sk_b path = __path_find(dev, phdr->hwaddr + 4); if (!path) { path = path_rec_create(dev, phdr->hwaddr + 4); - if (path) { - /* put pseudoheader back on for next time */ - skb_push(skb, sizeof *phdr); - __skb_queue_tail(&path->queue, skb); - - if (path_rec_start(dev, path)) { - spin_unlock(&priv->lock); - path_free(dev, path); - return; - } else - __path_add(dev, path); - } else { - ++priv->stats.tx_dropped; - dev_kfree_skb_any(skb); - } - - spin_unlock(&priv->lock); - return; + if (path) + __path_add(dev, path); } - if (path->ah) { + if (path && path->ah) { ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr)); - } else if ((path->query || !path_rec_start(dev, path)) && + } else if (path && (path->query || !path_rec_start(dev, path)) && skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { /* put pseudoheader back on for next time */ skb_push(skb, sizeof *phdr); @@ -822,7 +810,8 @@ static void ipoib_neigh_cleanup(struct n IPOIB_QPN(n->ha), IPOIB_GID_RAW_ARG(n->ha + 4)); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); neigh = *to_ipoib_neigh(n); if (neigh) { @@ -832,7 +821,8 @@ static void ipoib_neigh_cleanup(struct n ipoib_neigh_free(n->dev, neigh); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (ah) ipoib_put_ah(ah); Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-22 01:46:54.000000000 +0300 +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-23 21:38:28.000000000 +0300 @@ -100,7 +100,8 @@ static void ipoib_mcast_free(struct ipoi "deleting multicast group " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { /* @@ -114,7 +115,8 @@ static void ipoib_mcast_free(struct ipoi ipoib_neigh_free(dev, neigh); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (mcast->ah) ipoib_put_ah(mcast->ah); -- MST From mst at dev.mellanox.co.il Wed May 23 12:57:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 May 2007 22:57:00 +0300 Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection In-Reply-To: <000501c79d72$17260a30$ff0da8c0@amr.corp.intel.com> References: <20070523183650.GB6019@mellanox.co.il> <000501c79d72$17260a30$ff0da8c0@amr.corp.intel.com> Message-ID: <20070523195700.GD6019@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection > > >> If two REQs are received with matching local IDs, but the REQs > >> themselves differ on one or more fields, such as the QPN, the second REQ > >> is dropped as a duplicate. > > > >Why do you speak about dropping duplicates as a valid response? > > I was only mentioning the current behavior. > > >As far as I can tell, the 2 legal responses to a duplicate REQ > >are resending a REP and rejecting with code 30. > > It's possible to receive a duplicate REQ before processing has completed and a > REP generated to the first REQ. In this case, it makes sense simply to discard > the duplicate REQ. > > When processing completes on the first REQ, the CM will generate either a REP or > a REJ, so I believe that the behavior is compliant when handling an actual > duplicate REQs. Well, in case the second REQ differs from the first one, discarding it might not be the best option: I think we might want to queue it and process when you exit the ephemeural state. -- MST From narravul at cse.ohio-state.edu Wed May 23 13:07:33 2007 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Wed, 23 May 2007 16:07:33 -0400 (EDT) Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm In-Reply-To: <000301c79d61$17db1210$ff0da8c0@amr.corp.intel.com> Message-ID: On Wed, 23 May 2007, Sean Hefty wrote: > >Rail 1 (ib0): 192.168.1.* > >Rail 2 (ib2): 192.168.3.* > > > >When I try to connect two qps over these rails (one on each), many times > >the address resolutions for both the qps return me the context of just > >one of the rails. i.e. I am not able to use both the rails. > > > >Is there any thing I am missing here? > > Can you provide more details on how you are establishing your connections? > > Are you calling rdma_resolve_addr() with 192.168.1.x in one case, and > 192.168.3.x in the second case, and both of those resolve back to the same local > IP address? Can you tell if ping routes the same way? Basically I have the following sequence of steps. Process 1: rdma_bind_addr (someport, 0.0.0.0) rdma_listen () .... rdma_get_cm_event() if (RDMA_CM_EVENT_CONNECT_REQUEST) rdma_accept() .... Process 2: rdma_connect (192.168.1.x) rdma_connect (192.168.3.x) .... wait for connections. I am able to ping on both the interfaces. The ping messages go over both the interfaces. Infact after pinging the interfaces from each other a few times I am able to connect properly over both the rails for some time. After a few minutes it falls back to just one interface. > You can try binding to a specific local address, but based on your setup, I > would expect this to work. So, I'd like to understand better what the issue > could be. hmm.. I can try this but as an last resort. Ideally I would like to use just one listen cm_id binded to 0.0.0.0. --Sundeep. > > - Sean > From sean.hefty at intel.com Wed May 23 13:25:28 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 23 May 2007 13:25:28 -0700 Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm In-Reply-To: Message-ID: <000601c79d78$810037e0$ff0da8c0@amr.corp.intel.com> >I am able to ping on both the interfaces. The ping messages go over both >the interfaces. Infact after pinging the interfaces from each other a few >times I am able to connect properly over both the rails for some time. >After a few minutes it falls back to just one interface. Odd - I will see if I can reproduce this. Are the HCAs sharing the same IB subnet? >hmm.. I can try this but as an last resort. Ideally I would like to use >just one listen cm_id binded to 0.0.0.0. I was thinking about binding on the active side, before calling connect. But I still want to look into this more. - Sean From narravul at cse.ohio-state.edu Wed May 23 13:30:33 2007 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Wed, 23 May 2007 16:30:33 -0400 (EDT) Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm In-Reply-To: <000601c79d78$810037e0$ff0da8c0@amr.corp.intel.com> Message-ID: > Odd - I will see if I can reproduce this. > > Are the HCAs sharing the same IB subnet? Yes. They are in the same IB subnet. > > >hmm.. I can try this but as an last resort. Ideally I would like to use > >just one listen cm_id binded to 0.0.0.0. > > I was thinking about binding on the active side, before calling connect. But I > still want to look into this more. I can try this one out. --Sundeep. > > - Sean > From pradeeps at linux.vnet.ibm.com Wed May 23 14:12:19 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 23 May 2007 14:12:19 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <4654690F.1040305@linux.vnet.ibm.com> References: <46537081.30906@linux.vnet.ibm.com> <4654690F.1040305@linux.vnet.ibm.com> Message-ID: <4654AE33.20506@linux.vnet.ibm.com> Roland, Is it too late to get this into 2.6.22? If so, I will try for 2.6.23 -please let me know. Pradeep Pradeep Satyanarayana wrote: > > If this proposal is acceptable, would you want me to generate a patch > against Roland's for-2.6.22 git tree, or would for-2.6.23 tree be > better? > > Pradeep > > Pradeep Satyanarayana wrote: >> Here are my thoughts about limiting the memory footprint for IPOIB CM >> (NOSRQ) patch: >> >> By default, cap the NOSRQ memory usage to 1GB. The default recvq_size >> is set to 128. Therefore for 64KB packets this would imply a maximum of >> 128 endpoints. >> >> -Make the maximum number of endpoints a module parameter with a default >> value of 128. >> >> -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is >> the default limit and could be changed as needed (by the administrator) >> depending on the system configuration, application needs and so on. The >> server would return a "REJ" message upon receiving a "REQ", whenever one >> of these limits (i.e. max number of endpoints or the max NOSRQ memory >> usage) is reached. Currently, we only check for the maximum number of >> endpoints -hard coded to 1024. >> >> -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that >> support SRQ like the Topspin HCA and, such HCAs should not be >> impacted at all. >> >> -Currently we allocate a default of 64KB for the ring buffer elements, >> and this buffer size is not linked to the mtu. In the future, we could >> allocate buffers based on the mtu and link that into the computation of >> the memory cap. This way customers who might want to use a smaller mtu >> could use a larger number of endpoints, or a larger recvq_size without >> exceeding the memory cap. >> >> >> Would this approach address the issues of scalability and enable IPOIB >> CM to be turned as the default? >> >> >> Pradeep >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Wed May 23 14:14:40 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 23 May 2007 14:14:40 -0700 Subject: [ofa-general] [PATCH] for-2.6.23 ib/sa: use correct index for default pkey In-Reply-To: Message-ID: <000701c79d7f$60460f00$ff0da8c0@amr.corp.intel.com> MADs sent to the SA should use the index for the default pkey. There's no requirement that the default pkey be stored at index 0. Signed-off-by: Sean Hefty --- Patch requires the latest changes to the pkey cache. This fix is not a priority, but it appears to be the only issue in the stack with supporting multiple partitions, which is a requirement for the labs. drivers/infiniband/core/sa_query.c | 85 +++++++++++++++++++++--------------- include/rdma/ib_mad.h | 3 + 2 files changed, 53 insertions(+), 35 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 6469406..4791d01 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -56,6 +56,7 @@ MODULE_LICENSE("Dual BSD/GPL"); struct ib_sa_sm_ah { struct ib_ah *ah; struct kref ref; + u16 pkey_index; u8 src_path_mask; }; @@ -382,6 +383,13 @@ static void update_sm_ah(struct work_struct *work) kref_init(&new_ah->ref); new_ah->src_path_mask = (1 << port_attr.lmc) - 1; + new_ah->pkey_index = 0; + if (ib_find_pkey(port->agent->device, port->port_num, + IB_DEFAULT_PKEY_FULL, &new_ah->pkey_index) && + ib_find_pkey(port->agent->device, port->port_num, + IB_DEFAULT_PKEY_PARTIAL, &new_ah->pkey_index)) + printk(KERN_ERR "Couldn't find index for default PKey\n"); + memset(&ah_attr, 0, sizeof ah_attr); ah_attr.dlid = port_attr.sm_lid; ah_attr.sl = port_attr.sm_sl; @@ -512,6 +520,35 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num, } EXPORT_SYMBOL(ib_init_ah_from_path); +static int alloc_mad(struct ib_sa_query *query, gfp_t gfp_mask) +{ + unsigned long flags; + + spin_lock_irqsave(&query->port->ah_lock, flags); + kref_get(&query->port->sm_ah->ref); + query->sm_ah = query->port->sm_ah; + spin_unlock_irqrestore(&query->port->ah_lock, flags); + + query->mad_buf = ib_create_send_mad(query->port->agent, 1, + query->sm_ah->pkey_index, + 0, IB_MGMT_SA_HDR, IB_MGMT_SA_DATA, + gfp_mask); + if (!query->mad_buf) { + kref_put(&query->sm_ah->ref, free_sm_ah); + return -ENOMEM; + } + + query->mad_buf->ah = query->sm_ah->ah; + + return 0; +} + +static void free_mad(struct ib_sa_query *query) +{ + ib_free_send_mad(query->mad_buf); + kref_put(&query->sm_ah->ref, free_sm_ah); +} + static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) { unsigned long flags; @@ -548,20 +585,11 @@ retry: query->mad_buf->context[0] = query; query->id = id; - spin_lock_irqsave(&query->port->ah_lock, flags); - kref_get(&query->port->sm_ah->ref); - query->sm_ah = query->port->sm_ah; - spin_unlock_irqrestore(&query->port->ah_lock, flags); - - query->mad_buf->ah = query->sm_ah->ah; - ret = ib_post_send_mad(query->mad_buf, NULL); if (ret) { spin_lock_irqsave(&idr_lock, flags); idr_remove(&query_idr, id); spin_unlock_irqrestore(&idr_lock, flags); - - kref_put(&query->sm_ah->ref, free_sm_ah); } /* @@ -647,13 +675,10 @@ int ib_sa_path_rec_get(struct ib_sa_client *client, if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; + query->sa_query.port = port; + ret = alloc_mad(&query->sa_query, gfp_mask); + if (ret) goto err1; - } ib_sa_client_get(client); query->sa_query.client = client; @@ -665,7 +690,6 @@ int ib_sa_path_rec_get(struct ib_sa_client *client, query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; - query->sa_query.port = port; mad->mad_hdr.method = IB_MGMT_METHOD_GET; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); mad->sa_hdr.comp_mask = comp_mask; @@ -683,7 +707,7 @@ int ib_sa_path_rec_get(struct ib_sa_client *client, err2: *sa_query = NULL; ib_sa_client_put(query->sa_query.client); - ib_free_send_mad(query->sa_query.mad_buf); + free_mad(&query->sa_query); err1: kfree(query); @@ -773,13 +797,10 @@ int ib_sa_service_rec_query(struct ib_sa_client *client, if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; + query->sa_query.port = port; + ret = alloc_mad(&query->sa_query, gfp_mask); + if (ret) goto err1; - } ib_sa_client_get(client); query->sa_query.client = client; @@ -791,7 +812,6 @@ int ib_sa_service_rec_query(struct ib_sa_client *client, query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; query->sa_query.release = ib_sa_service_rec_release; - query->sa_query.port = port; mad->mad_hdr.method = method; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_SERVICE_REC); mad->sa_hdr.comp_mask = comp_mask; @@ -810,7 +830,7 @@ int ib_sa_service_rec_query(struct ib_sa_client *client, err2: *sa_query = NULL; ib_sa_client_put(query->sa_query.client); - ib_free_send_mad(query->sa_query.mad_buf); + free_mad(&query->sa_query); err1: kfree(query); @@ -869,13 +889,10 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, if (!query) return -ENOMEM; - query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, - 0, IB_MGMT_SA_HDR, - IB_MGMT_SA_DATA, gfp_mask); - if (!query->sa_query.mad_buf) { - ret = -ENOMEM; + query->sa_query.port = port; + ret = alloc_mad(&query->sa_query, gfp_mask); + if (ret) goto err1; - } ib_sa_client_get(client); query->sa_query.client = client; @@ -887,7 +904,6 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; - query->sa_query.port = port; mad->mad_hdr.method = method; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); mad->sa_hdr.comp_mask = comp_mask; @@ -906,7 +922,7 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, err2: *sa_query = NULL; ib_sa_client_put(query->sa_query.client); - ib_free_send_mad(query->sa_query.mad_buf); + free_mad(&query->sa_query); err1: kfree(query); @@ -939,8 +955,7 @@ static void send_handler(struct ib_mad_agent *agent, idr_remove(&query_idr, query->id); spin_unlock_irqrestore(&idr_lock, flags); - ib_free_send_mad(mad_send_wc->send_buf); - kref_put(&query->sm_ah->ref, free_sm_ah); + free_mad(query); ib_sa_client_put(query->client); query->release(query); } diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h index 739fa4d..30712dd 100644 --- a/include/rdma/ib_mad.h +++ b/include/rdma/ib_mad.h @@ -111,6 +111,9 @@ #define IB_QP1_QKEY 0x80010000 #define IB_QP_SET_QKEY 0x80000000 +#define IB_DEFAULT_PKEY_PARTIAL 0x7FFF +#define IB_DEFAULT_PKEY_FULL 0xFFFF + enum { IB_MGMT_MAD_HDR = 24, IB_MGMT_MAD_DATA = 232, From swise at opengridcomputing.com Wed May 23 14:25:31 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 23 May 2007 14:25:31 -0700 Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm In-Reply-To: References: Message-ID: <4654B14B.4000208@opengridcomputing.com> Guys, this reminds me of an issue we have with rnics and regular nics on the same physical network. By default linux responds to arp queries on all ports it receives the query on. This leads to very bad results with you're trying to do offloaded connections. When resolving the address/route, the rdma client can end up getting the mac address of the dumb nic instead of the rnic. I don't know if route resolution in the ib cm has this issue, but it might since they use ipoib for some part of the resolution, no? You might this (snipit from the cxgb3 release notes file to be included in -rc4): 2) If you have a multi-homed host and the physical ethernet networks are bridged, then you need to configure arp to only send replies on the interface with the target ip address: sysctl -w net.ipv4.conf.all.arp_ignore=2 Steve. Sundeep Narravula wrote: >> Odd - I will see if I can reproduce this. >> >> Are the HCAs sharing the same IB subnet? >> > > Yes. They are in the same IB subnet. > > >>> hmm.. I can try this but as an last resort. Ideally I would like to use >>> just one listen cm_id binded to 0.0.0.0. >>> >> I was thinking about binding on the active side, before calling connect. But I >> still want to look into this more. >> > > I can try this one out. > > --Sundeep. > > >> - Sean >> >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Wed May 23 15:24:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 23 May 2007 15:24:37 -0700 Subject: [ofa-general] [PATCH] IB/mlx4: Don't allocate RQ doorbell if using SRQ Message-ID: Mellanox people, does this look good to you? If a QP is attached to a shared receive queue (SRQ), then it doesn't have a receive queue (RQ). So don't allocate an RQ doorbell (or map a doorbell from userspace for userspace QPs) for that QP. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index a824bc5..88a994d 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -319,20 +319,24 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, if (err) goto err_mtt; - err = mlx4_ib_db_map_user(to_mucontext(pd->uobject->context), - ucmd.db_addr, &qp->db); - if (err) - goto err_mtt; + if (!init_attr->srq) { + err = mlx4_ib_db_map_user(to_mucontext(pd->uobject->context), + ucmd.db_addr, &qp->db); + if (err) + goto err_mtt; + } } else { err = set_kernel_sq_size(dev, &init_attr->cap, init_attr->qp_type, qp); if (err) goto err; - err = mlx4_ib_db_alloc(dev, &qp->db, 0); - if (err) - goto err; + if (!init_attr->srq) { + err = mlx4_ib_db_alloc(dev, &qp->db, 0); + if (err) + goto err; - *qp->db.db = 0; + *qp->db.db = 0; + } if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf)) { err = -ENOMEM; @@ -386,7 +390,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, return 0; err_wrid: - if (pd->uobject) + if (pd->uobject && !init_attr->srq) mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &qp->db); else { kfree(qp->sq.wrid); @@ -403,7 +407,7 @@ err_buf: mlx4_buf_free(dev->dev, qp->buf_size, &qp->buf); err_db: - if (!pd->uobject) + if (!pd->uobject && !init_attr->srq) mlx4_ib_db_free(dev, &qp->db); err: @@ -481,14 +485,16 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp, mlx4_mtt_cleanup(dev->dev, &qp->mtt); if (is_user) { - mlx4_ib_db_unmap_user(to_mucontext(qp->ibqp.uobject->context), - &qp->db); + if (!qp->ibqp.srq) + mlx4_ib_db_unmap_user(to_mucontext(qp->ibqp.uobject->context), + &qp->db); ib_umem_release(qp->umem); } else { kfree(qp->sq.wrid); kfree(qp->rq.wrid); mlx4_buf_free(dev->dev, qp->buf_size, &qp->buf); - mlx4_ib_db_free(dev, &qp->db); + if (!qp->ibqp.srq) + mlx4_ib_db_free(dev, &qp->db); } } @@ -852,7 +858,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, if (ibqp->srq) context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->msrq.srqn); - if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) + if (!ibqp->srq && cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) context->db_rec_addr = cpu_to_be64(qp->db.dma); if (cur_state == IB_QPS_INIT && @@ -919,7 +925,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; - *qp->db.db = 0; + if (!ibqp->srq) + *qp->db.db = 0; } out: From mshefty at ichips.intel.com Wed May 23 15:38:48 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 May 2007 15:38:48 -0700 Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm In-Reply-To: <4654B14B.4000208@opengridcomputing.com> References: <4654B14B.4000208@opengridcomputing.com> Message-ID: <4654C278.6030703@ichips.intel.com> Steve Wise wrote: > Guys, this reminds me of an issue we have with rnics and regular nics on > the same physical network. By default linux responds to arp queries on > all ports it receives the query on. This leads to very bad results with > you're trying to do offloaded connections. When resolving the > address/route, the rdma client can end up getting the mac address of the > dumb nic instead of the rnic. I don't know if route resolution in the > ib cm has this issue, but it might since they use ipoib for some part of > the resolution, no? I think this could be the problem. (And could have taken me a long time to figure it if it is, so thanks!) > 2) If you have a multi-homed host and the physical ethernet networks are > bridged, then you need to configure arp to only send replies on the > interface with the target ip address: > > sysctl -w net.ipv4.conf.all.arp_ignore=2 Sundeep, can you try this and see if it fixes the problem for you? - Sean From xma at us.ibm.com Wed May 23 16:28:57 2007 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 23 May 2007 16:28:57 -0700 Subject: [ofa-general] IPoIB NETIF_F_SG flag for GSO In-Reply-To: <20070523195700.GD6019@mellanox.co.il> Message-ID: Hello Roland, Michael, I tried GSO for IPoIB last year, I didn't see much BW for UD rather than some cpu utilization decreasement. I looked the GSO patch carefully then I found there are additional skb copies in skb segment. If the device supports SG, then the copies can be avoided. IPoIB does support SG, I am planning to enable it to test GSO again. I would like to know whether you have tried this before? Any possible issue you can think of? Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Wed May 23 17:00:35 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 23 May 2007 17:00:35 -0700 Subject: [ofa-general] [PATCH] ib/cm: optimize locking In-Reply-To: <46531C7A.3060201@ichips.intel.com> Message-ID: <003101c79d96$8ea19d80$5bd4180a@amr.corp.intel.com> The ib_cm is a little over zealous about using spin_lock_irqsave, when spin_lock_irq would do. Signed-off-by: Sean Hefty --- This patch applies on top of "ib/cm: fix stale connection detection". It has only been lightly tested using the librdmacm. Additional testing with ipoib cm is still needed. (I will try to get to that tomorrow.) I will request that this be pulled for 2.6.23 if there are no objections. drivers/infiniband/core/cm.c | 171 ++++++++++++++++++------------------------ 1 files changed, 75 insertions(+), 96 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 40c004a..16181d6 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -318,12 +318,10 @@ static int cm_alloc_id(struct cm_id_private *cm_id_priv) static void cm_free_id(__be32 local_id) { - unsigned long flags; - - spin_lock_irqsave(&cm.lock, flags); + spin_lock_irq(&cm.lock); idr_remove(&cm.local_id_table, (__force int) (local_id ^ cm.random_id_operand)); - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); } static struct cm_id_private * cm_get_id(__be32 local_id, __be32 remote_id) @@ -345,11 +343,10 @@ static struct cm_id_private * cm_get_id(__be32 local_id, __be32 remote_id) static struct cm_id_private * cm_acquire_id(__be32 local_id, __be32 remote_id) { struct cm_id_private *cm_id_priv; - unsigned long flags; - spin_lock_irqsave(&cm.lock, flags); + spin_lock_irq(&cm.lock); cm_id_priv = cm_get_id(local_id, remote_id); - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); return cm_id_priv; } @@ -713,31 +710,30 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int err) { struct cm_id_private *cm_id_priv; struct cm_work *work; - unsigned long flags; cm_id_priv = container_of(cm_id, struct cm_id_private, id); retest: - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); switch (cm_id->state) { case IB_CM_LISTEN: cm_id->state = IB_CM_IDLE; - spin_unlock_irqrestore(&cm_id_priv->lock, flags); - spin_lock_irqsave(&cm.lock, flags); + spin_unlock_irq(&cm_id_priv->lock); + spin_lock_irq(&cm.lock); rb_erase(&cm_id_priv->service_node, &cm.listen_service_table); - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); break; case IB_CM_SIDR_REQ_SENT: cm_id->state = IB_CM_IDLE; ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); break; case IB_CM_SIDR_REQ_RCVD: - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); cm_reject_sidr_req(cm_id_priv, IB_SIDR_REJECT); break; case IB_CM_REQ_SENT: ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); ib_send_cm_rej(cm_id, IB_CM_REJ_TIMEOUT, &cm_id_priv->id.device->node_guid, sizeof cm_id_priv->id.device->node_guid, @@ -747,9 +743,9 @@ retest: if (err == -ENOMEM) { /* Do not reject to allow future retries. */ cm_reset_to_idle(cm_id_priv); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); } else { - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); } @@ -762,25 +758,25 @@ retest: case IB_CM_MRA_REQ_SENT: case IB_CM_REP_RCVD: case IB_CM_MRA_REP_SENT: - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); break; case IB_CM_ESTABLISHED: - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); ib_send_cm_dreq(cm_id, NULL, 0); goto retest; case IB_CM_DREQ_SENT: ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); cm_enter_timewait(cm_id_priv); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); break; case IB_CM_DREQ_RCVD: - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); ib_send_cm_drep(cm_id, NULL, 0); break; default: - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); break; } @@ -1169,7 +1165,6 @@ static void cm_format_req_event(struct cm_work *work, static void cm_process_work(struct cm_id_private *cm_id_priv, struct cm_work *work) { - unsigned long flags; int ret; /* We will typically only have the current event to report. */ @@ -1177,9 +1172,9 @@ static void cm_process_work(struct cm_id_private *cm_id_priv, cm_free_work(work); while (!ret && !atomic_add_negative(-1, &cm_id_priv->work_count)) { - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); work = cm_dequeue_work(cm_id_priv); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); BUG_ON(!work); ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->cm_event); @@ -1250,7 +1245,6 @@ static void cm_dup_req_handler(struct cm_work *work, struct cm_id_private *cm_id_priv) { struct ib_mad_send_buf *msg = NULL; - unsigned long flags; int ret; /* Quick state check to discard duplicate REQs. */ @@ -1261,7 +1255,7 @@ static void cm_dup_req_handler(struct cm_work *work, if (ret) return; - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); switch (cm_id_priv->id.state) { case IB_CM_MRA_REQ_SENT: cm_format_mra((struct cm_mra_msg *) msg->mad, cm_id_priv, @@ -1276,14 +1270,14 @@ static void cm_dup_req_handler(struct cm_work *work, default: goto unlock; } - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); ret = ib_post_send_mad(msg, NULL); if (ret) goto free; return; -unlock: spin_unlock_irqrestore(&cm_id_priv->lock, flags); +unlock: spin_unlock_irq(&cm_id_priv->lock); free: cm_free_msg(msg); } @@ -1293,17 +1287,16 @@ static struct cm_id_private * cm_match_req(struct cm_work *work, struct cm_id_private *listen_cm_id_priv, *cur_cm_id_priv; struct cm_timewait_info *timewait_info; struct cm_req_msg *req_msg; - unsigned long flags; req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; /* Check for possible duplicate REQ. */ - spin_lock_irqsave(&cm.lock, flags); + spin_lock_irq(&cm.lock); timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); if (timewait_info) { cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, timewait_info->work.remote_id); - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); if (cur_cm_id_priv) { cm_dup_req_handler(work, cur_cm_id_priv); cm_deref_id(cur_cm_id_priv); @@ -1315,7 +1308,7 @@ static struct cm_id_private * cm_match_req(struct cm_work *work, timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); if (timewait_info) { cm_cleanup_timewait(cm_id_priv->timewait_info); - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); cm_issue_rej(work->port, work->mad_recv_wc, IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, NULL, 0); @@ -1328,7 +1321,7 @@ static struct cm_id_private * cm_match_req(struct cm_work *work, req_msg->private_data); if (!listen_cm_id_priv) { cm_cleanup_timewait(cm_id_priv->timewait_info); - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); cm_issue_rej(work->port, work->mad_recv_wc, IB_CM_REJ_INVALID_SERVICE_ID, CM_MSG_RESPONSE_REQ, NULL, 0); @@ -1338,7 +1331,7 @@ static struct cm_id_private * cm_match_req(struct cm_work *work, atomic_inc(&cm_id_priv->refcount); cm_id_priv->id.state = IB_CM_REQ_RCVD; atomic_inc(&cm_id_priv->work_count); - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); out: return listen_cm_id_priv; } @@ -1591,7 +1584,6 @@ static void cm_dup_rep_handler(struct cm_work *work) struct cm_id_private *cm_id_priv; struct cm_rep_msg *rep_msg; struct ib_mad_send_buf *msg = NULL; - unsigned long flags; int ret; rep_msg = (struct cm_rep_msg *) work->mad_recv_wc->recv_buf.mad; @@ -1604,7 +1596,7 @@ static void cm_dup_rep_handler(struct cm_work *work) if (ret) goto deref; - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); if (cm_id_priv->id.state == IB_CM_ESTABLISHED) cm_format_rtu((struct cm_rtu_msg *) msg->mad, cm_id_priv, cm_id_priv->private_data, @@ -1616,14 +1608,14 @@ static void cm_dup_rep_handler(struct cm_work *work) cm_id_priv->private_data_len); else goto unlock; - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); ret = ib_post_send_mad(msg, NULL); if (ret) goto free; goto deref; -unlock: spin_unlock_irqrestore(&cm_id_priv->lock, flags); +unlock: spin_unlock_irq(&cm_id_priv->lock); free: cm_free_msg(msg); deref: cm_deref_id(cm_id_priv); } @@ -1632,7 +1624,6 @@ static int cm_rep_handler(struct cm_work *work) { struct cm_id_private *cm_id_priv; struct cm_rep_msg *rep_msg; - unsigned long flags; int ret; rep_msg = (struct cm_rep_msg *)work->mad_recv_wc->recv_buf.mad; @@ -1644,13 +1635,13 @@ static int cm_rep_handler(struct cm_work *work) cm_format_rep_event(work); - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); switch (cm_id_priv->id.state) { case IB_CM_REQ_SENT: case IB_CM_MRA_REQ_RCVD: break; default: - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); ret = -EINVAL; goto error; } @@ -1663,7 +1654,7 @@ static int cm_rep_handler(struct cm_work *work) /* Check for duplicate REP. */ if (cm_insert_remote_id(cm_id_priv->timewait_info)) { spin_unlock(&cm.lock); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); ret = -EINVAL; goto error; } @@ -1673,7 +1664,7 @@ static int cm_rep_handler(struct cm_work *work) &cm.remote_id_table); cm_id_priv->timewait_info->inserted_remote_id = 0; spin_unlock(&cm.lock); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); cm_issue_rej(work->port, work->mad_recv_wc, IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REP, NULL, 0); @@ -1696,7 +1687,7 @@ static int cm_rep_handler(struct cm_work *work) ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ret) cm_process_work(cm_id_priv, work); @@ -1712,7 +1703,6 @@ error: static int cm_establish_handler(struct cm_work *work) { struct cm_id_private *cm_id_priv; - unsigned long flags; int ret; /* See comment in cm_establish about lookup. */ @@ -1720,9 +1710,9 @@ static int cm_establish_handler(struct cm_work *work) if (!cm_id_priv) return -EINVAL; - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); if (cm_id_priv->id.state != IB_CM_ESTABLISHED) { - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); goto out; } @@ -1730,7 +1720,7 @@ static int cm_establish_handler(struct cm_work *work) ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ret) cm_process_work(cm_id_priv, work); @@ -1746,7 +1736,6 @@ static int cm_rtu_handler(struct cm_work *work) { struct cm_id_private *cm_id_priv; struct cm_rtu_msg *rtu_msg; - unsigned long flags; int ret; rtu_msg = (struct cm_rtu_msg *)work->mad_recv_wc->recv_buf.mad; @@ -1757,10 +1746,10 @@ static int cm_rtu_handler(struct cm_work *work) work->cm_event.private_data = &rtu_msg->private_data; - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); if (cm_id_priv->id.state != IB_CM_REP_SENT && cm_id_priv->id.state != IB_CM_MRA_REP_RCVD) { - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); goto out; } cm_id_priv->id.state = IB_CM_ESTABLISHED; @@ -1769,7 +1758,7 @@ static int cm_rtu_handler(struct cm_work *work) ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ret) cm_process_work(cm_id_priv, work); @@ -1932,7 +1921,6 @@ static int cm_dreq_handler(struct cm_work *work) struct cm_id_private *cm_id_priv; struct cm_dreq_msg *dreq_msg; struct ib_mad_send_buf *msg = NULL; - unsigned long flags; int ret; dreq_msg = (struct cm_dreq_msg *)work->mad_recv_wc->recv_buf.mad; @@ -1945,7 +1933,7 @@ static int cm_dreq_handler(struct cm_work *work) work->cm_event.private_data = &dreq_msg->private_data; - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); if (cm_id_priv->local_qpn != cm_dreq_get_remote_qpn(dreq_msg)) goto unlock; @@ -1964,7 +1952,7 @@ static int cm_dreq_handler(struct cm_work *work) cm_format_drep((struct cm_drep_msg *) msg->mad, cm_id_priv, cm_id_priv->private_data, cm_id_priv->private_data_len); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ib_post_send_mad(msg, NULL)) cm_free_msg(msg); @@ -1977,7 +1965,7 @@ static int cm_dreq_handler(struct cm_work *work) ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ret) cm_process_work(cm_id_priv, work); @@ -1985,7 +1973,7 @@ static int cm_dreq_handler(struct cm_work *work) cm_deref_id(cm_id_priv); return 0; -unlock: spin_unlock_irqrestore(&cm_id_priv->lock, flags); +unlock: spin_unlock_irq(&cm_id_priv->lock); deref: cm_deref_id(cm_id_priv); return -EINVAL; } @@ -1994,7 +1982,6 @@ static int cm_drep_handler(struct cm_work *work) { struct cm_id_private *cm_id_priv; struct cm_drep_msg *drep_msg; - unsigned long flags; int ret; drep_msg = (struct cm_drep_msg *)work->mad_recv_wc->recv_buf.mad; @@ -2005,10 +1992,10 @@ static int cm_drep_handler(struct cm_work *work) work->cm_event.private_data = &drep_msg->private_data; - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); if (cm_id_priv->id.state != IB_CM_DREQ_SENT && cm_id_priv->id.state != IB_CM_DREQ_RCVD) { - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); goto out; } cm_enter_timewait(cm_id_priv); @@ -2017,7 +2004,7 @@ static int cm_drep_handler(struct cm_work *work) ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ret) cm_process_work(cm_id_priv, work); @@ -2107,17 +2094,16 @@ static struct cm_id_private * cm_acquire_rejected_id(struct cm_rej_msg *rej_msg) { struct cm_timewait_info *timewait_info; struct cm_id_private *cm_id_priv; - unsigned long flags; __be32 remote_id; remote_id = rej_msg->local_comm_id; if (__be16_to_cpu(rej_msg->reason) == IB_CM_REJ_TIMEOUT) { - spin_lock_irqsave(&cm.lock, flags); + spin_lock_irq(&cm.lock); timewait_info = cm_find_remote_id( *((__be64 *) rej_msg->ari), remote_id); if (!timewait_info) { - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); return NULL; } cm_id_priv = idr_find(&cm.local_id_table, (__force int) @@ -2129,7 +2115,7 @@ static struct cm_id_private * cm_acquire_rejected_id(struct cm_rej_msg *rej_msg) else cm_id_priv = NULL; } - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); } else if (cm_rej_get_msg_rejected(rej_msg) == CM_MSG_RESPONSE_REQ) cm_id_priv = cm_acquire_id(rej_msg->remote_comm_id, 0); else @@ -2142,7 +2128,6 @@ static int cm_rej_handler(struct cm_work *work) { struct cm_id_private *cm_id_priv; struct cm_rej_msg *rej_msg; - unsigned long flags; int ret; rej_msg = (struct cm_rej_msg *)work->mad_recv_wc->recv_buf.mad; @@ -2152,7 +2137,7 @@ static int cm_rej_handler(struct cm_work *work) cm_format_rej_event(work); - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); switch (cm_id_priv->id.state) { case IB_CM_REQ_SENT: case IB_CM_MRA_REQ_RCVD: @@ -2176,7 +2161,7 @@ static int cm_rej_handler(struct cm_work *work) cm_enter_timewait(cm_id_priv); break; default: - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); ret = -EINVAL; goto out; } @@ -2184,7 +2169,7 @@ static int cm_rej_handler(struct cm_work *work) ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ret) cm_process_work(cm_id_priv, work); @@ -2295,7 +2280,6 @@ static int cm_mra_handler(struct cm_work *work) { struct cm_id_private *cm_id_priv; struct cm_mra_msg *mra_msg; - unsigned long flags; int timeout, ret; mra_msg = (struct cm_mra_msg *)work->mad_recv_wc->recv_buf.mad; @@ -2309,7 +2293,7 @@ static int cm_mra_handler(struct cm_work *work) timeout = cm_convert_to_ms(cm_mra_get_service_timeout(mra_msg)) + cm_convert_to_ms(cm_id_priv->av.packet_life_time); - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); switch (cm_id_priv->id.state) { case IB_CM_REQ_SENT: if (cm_mra_get_msg_mraed(mra_msg) != CM_MSG_RESPONSE_REQ || @@ -2342,7 +2326,7 @@ static int cm_mra_handler(struct cm_work *work) ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ret) cm_process_work(cm_id_priv, work); @@ -2350,7 +2334,7 @@ static int cm_mra_handler(struct cm_work *work) cm_deref_id(cm_id_priv); return 0; out: - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); cm_deref_id(cm_id_priv); return -EINVAL; } @@ -2465,7 +2449,6 @@ static int cm_lap_handler(struct cm_work *work) struct cm_lap_msg *lap_msg; struct ib_cm_lap_event_param *param; struct ib_mad_send_buf *msg = NULL; - unsigned long flags; int ret; /* todo: verify LAP request and send reject APR if invalid. */ @@ -2480,7 +2463,7 @@ static int cm_lap_handler(struct cm_work *work) cm_format_path_from_lap(cm_id_priv, param->alternate_path, lap_msg); work->cm_event.private_data = &lap_msg->private_data; - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); if (cm_id_priv->id.state != IB_CM_ESTABLISHED) goto unlock; @@ -2497,7 +2480,7 @@ static int cm_lap_handler(struct cm_work *work) cm_id_priv->service_timeout, cm_id_priv->private_data, cm_id_priv->private_data_len); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ib_post_send_mad(msg, NULL)) cm_free_msg(msg); @@ -2515,7 +2498,7 @@ static int cm_lap_handler(struct cm_work *work) ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ret) cm_process_work(cm_id_priv, work); @@ -2523,7 +2506,7 @@ static int cm_lap_handler(struct cm_work *work) cm_deref_id(cm_id_priv); return 0; -unlock: spin_unlock_irqrestore(&cm_id_priv->lock, flags); +unlock: spin_unlock_irq(&cm_id_priv->lock); deref: cm_deref_id(cm_id_priv); return -EINVAL; } @@ -2598,7 +2581,6 @@ static int cm_apr_handler(struct cm_work *work) { struct cm_id_private *cm_id_priv; struct cm_apr_msg *apr_msg; - unsigned long flags; int ret; apr_msg = (struct cm_apr_msg *)work->mad_recv_wc->recv_buf.mad; @@ -2612,11 +2594,11 @@ static int cm_apr_handler(struct cm_work *work) work->cm_event.param.apr_rcvd.info_len = apr_msg->info_length; work->cm_event.private_data = &apr_msg->private_data; - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); if (cm_id_priv->id.state != IB_CM_ESTABLISHED || (cm_id_priv->id.lap_state != IB_CM_LAP_SENT && cm_id_priv->id.lap_state != IB_CM_MRA_LAP_RCVD)) { - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); goto out; } cm_id_priv->id.lap_state = IB_CM_LAP_IDLE; @@ -2626,7 +2608,7 @@ static int cm_apr_handler(struct cm_work *work) ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); if (ret) cm_process_work(cm_id_priv, work); @@ -2761,7 +2743,6 @@ static int cm_sidr_req_handler(struct cm_work *work) struct cm_id_private *cm_id_priv, *cur_cm_id_priv; struct cm_sidr_req_msg *sidr_req_msg; struct ib_wc *wc; - unsigned long flags; cm_id = ib_create_cm_id(work->port->cm_dev->device, NULL, NULL); if (IS_ERR(cm_id)) @@ -2782,10 +2763,10 @@ static int cm_sidr_req_handler(struct cm_work *work) cm_id_priv->tid = sidr_req_msg->hdr.tid; atomic_inc(&cm_id_priv->work_count); - spin_lock_irqsave(&cm.lock, flags); + spin_lock_irq(&cm.lock); cur_cm_id_priv = cm_insert_remote_sidr(cm_id_priv); if (cur_cm_id_priv) { - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); goto out; /* Duplicate message. */ } cur_cm_id_priv = cm_find_listen(cm_id->device, @@ -2793,12 +2774,12 @@ static int cm_sidr_req_handler(struct cm_work *work) sidr_req_msg->private_data); if (!cur_cm_id_priv) { rb_erase(&cm_id_priv->sidr_id_node, &cm.remote_sidr_table); - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); /* todo: reply with no match */ goto out; /* No match. */ } atomic_inc(&cur_cm_id_priv->refcount); - spin_unlock_irqrestore(&cm.lock, flags); + spin_unlock_irq(&cm.lock); cm_id_priv->id.cm_handler = cur_cm_id_priv->id.cm_handler; cm_id_priv->id.context = cur_cm_id_priv->id.context; @@ -2899,7 +2880,6 @@ static int cm_sidr_rep_handler(struct cm_work *work) { struct cm_sidr_rep_msg *sidr_rep_msg; struct cm_id_private *cm_id_priv; - unsigned long flags; sidr_rep_msg = (struct cm_sidr_rep_msg *) work->mad_recv_wc->recv_buf.mad; @@ -2907,14 +2887,14 @@ static int cm_sidr_rep_handler(struct cm_work *work) if (!cm_id_priv) return -EINVAL; /* Unmatched reply. */ - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); if (cm_id_priv->id.state != IB_CM_SIDR_REQ_SENT) { - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); goto out; } cm_id_priv->id.state = IB_CM_IDLE; ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); cm_format_sidr_rep_event(work); cm_process_work(cm_id_priv, work); @@ -2930,14 +2910,13 @@ static void cm_process_send_error(struct ib_mad_send_buf *msg, struct cm_id_private *cm_id_priv; struct ib_cm_event cm_event; enum ib_cm_state state; - unsigned long flags; int ret; memset(&cm_event, 0, sizeof cm_event); cm_id_priv = msg->context[0]; /* Discard old sends or ones without a response. */ - spin_lock_irqsave(&cm_id_priv->lock, flags); + spin_lock_irq(&cm_id_priv->lock); state = (enum ib_cm_state) (unsigned long) msg->context[1]; if (msg != cm_id_priv->msg || state != cm_id_priv->id.state) goto discard; @@ -2964,7 +2943,7 @@ static void cm_process_send_error(struct ib_mad_send_buf *msg, default: goto discard; } - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); cm_event.param.send_status = wc_status; /* No other events can occur on the cm_id at this point. */ @@ -2974,7 +2953,7 @@ static void cm_process_send_error(struct ib_mad_send_buf *msg, ib_destroy_cm_id(&cm_id_priv->id); return; discard: - spin_unlock_irqrestore(&cm_id_priv->lock, flags); + spin_unlock_irq(&cm_id_priv->lock); cm_free_msg(msg); } From ldanenwed at noerhaldgolf.dk Wed May 23 18:13:01 2007 From: ldanenwed at noerhaldgolf.dk (Lindsey) Date: Thu, 24 May 2007 04:13:01 +0300 Subject: [ofa-general] Hey, they are back Message-ID: <0fd001c79db9$d1d61f90$4abc5b68@ldanenwed> I the won He light organization led her to his late bike. She perched herself on spend broken should second a the have 5:00 PM stanza quarter feeling known of in I is that was the weeks a No. Just rush be there. hide I don't think melodic take you'll be dis never poem time part going the I mirror s to was of make Hold shrug need on a second, change step I'm almost done. emotional thankful us it silence that dies at is I had been false is 5:30 PM Cortland I kill The principal experience was company cut not impressed. If I recall c as saving hollowed Not according to plane your support trouble aerial driver's licence, said attack When it ended, the four trot of spin them watch regrouped at th went it for Dana relation nearly charming fell over blood suspiciously laughing. Alright you, l out clearly expresses Put this on. a while Not to home its After trust to importance all everyone strong consider dug But then you frighten won't have one. and power over this asking the was is how But... Gavin was now salt agreement in precede motion a no-win situation. H woman an abnormal misspelt One disease or both withstand of your parents will beam be here to pi it emergency It brick Ay-yai Skipper to She gave support town him one last kiss, th enthusiastically See coach time you tomorrow Angel. angle Jeff climbed back in authority She sat down fraternal at her usual desk in rang minute the front row and is was the who natural was uses Christmas result Horrible it Eve each morning I and of thought Here, wet He high-pitched handed her adjustment his math power book. Would you The I bitter to extensive still or myself use had to get of even history That cushion was alright, run but I'm probably ice not gonna r but responded silk Before I do orange sir, stung I'd like teaching to ask one question poetic my traumatic We thin live paint there too, crush dust said Nicki, referring to t air manage Stacy interest concurred. Remove a couple overtook of swear word it imagery mother 3:15 PM life a Still careful not rescue sure where he experience was going cost with this, sh experiences Mistrust prehomet or and distrust was persona I are really enables walked live in anticipation of the Second Coming but it was his friendship with Dr. Tony Evans -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: geuy.gif Type: image/gif Size: 6724 bytes Desc: not available URL: From narravul at cse.ohio-state.edu Wed May 23 18:40:10 2007 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Wed, 23 May 2007 21:40:10 -0400 (EDT) Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm In-Reply-To: <4654B14B.4000208@opengridcomputing.com> Message-ID: This suggestion seems like its working. I have been able to run the test successfully several times so far with any of the earlier problems. Thanks, --Sundeep. On Wed, 23 May 2007, Steve Wise wrote: > Guys, this reminds me of an issue we have with rnics and regular nics on > the same physical network. By default linux responds to arp queries on > all ports it receives the query on. This leads to very bad results with > you're trying to do offloaded connections. When resolving the > address/route, the rdma client can end up getting the mac address of the > dumb nic instead of the rnic. I don't know if route resolution in the > ib cm has this issue, but it might since they use ipoib for some part of > the resolution, no? > > You might this (snipit from the cxgb3 release notes file to be included > in -rc4): > > 2) If you have a multi-homed host and the physical ethernet networks are > bridged, then you need to configure arp to only send replies on the > interface with the target ip address: > > sysctl -w net.ipv4.conf.all.arp_ignore=2 > > Steve. > > Sundeep Narravula wrote: > >> Odd - I will see if I can reproduce this. > >> > >> Are the HCAs sharing the same IB subnet? > >> > > > > Yes. They are in the same IB subnet. > > > > > >>> hmm.. I can try this but as an last resort. Ideally I would like to use > >>> just one listen cm_id binded to 0.0.0.0. > >>> > >> I was thinking about binding on the active side, before calling connect. But I > >> still want to look into this more. > >> > > > > I can try this one out. > > > > --Sundeep. > > > > > >> - Sean > >> > >> > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From swise at opengridcomputing.com Wed May 23 19:15:43 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 23 May 2007 19:15:43 -0700 Subject: [ofa-general] RE: Problem with using two interfaces with rdma-cm In-Reply-To: References: Message-ID: <4654F54F.9060501@opengridcomputing.com> BTW: I _think_ the ipoib module can set this option on its specific interfaces programatically. Sundeep Narravula wrote: > This suggestion seems like its working. I have been able to run the test > successfully several times so far with any of the earlier problems. > > Thanks, > --Sundeep. > > On Wed, 23 May 2007, Steve Wise wrote: > > >> Guys, this reminds me of an issue we have with rnics and regular nics on >> the same physical network. By default linux responds to arp queries on >> all ports it receives the query on. This leads to very bad results with >> you're trying to do offloaded connections. When resolving the >> address/route, the rdma client can end up getting the mac address of the >> dumb nic instead of the rnic. I don't know if route resolution in the >> ib cm has this issue, but it might since they use ipoib for some part of >> the resolution, no? >> >> You might this (snipit from the cxgb3 release notes file to be included >> in -rc4): >> >> 2) If you have a multi-homed host and the physical ethernet networks are >> bridged, then you need to configure arp to only send replies on the >> interface with the target ip address: >> >> sysctl -w net.ipv4.conf.all.arp_ignore=2 >> >> Steve. >> >> Sundeep Narravula wrote: >> >>>> Odd - I will see if I can reproduce this. >>>> >>>> Are the HCAs sharing the same IB subnet? >>>> >>>> >>> Yes. They are in the same IB subnet. >>> >>> >>> >>>>> hmm.. I can try this but as an last resort. Ideally I would like to use >>>>> just one listen cm_id binded to 0.0.0.0. >>>>> >>>>> >>>> I was thinking about binding on the active side, before calling connect. But I >>>> still want to look into this more. >>>> >>>> >>> I can try this one out. >>> >>> --Sundeep. >>> >>> >>> >>>> - Sean >>>> >>>> >>>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > From mst at dev.mellanox.co.il Wed May 23 20:36:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 06:36:15 +0300 Subject: [ofa-general] Re: IPoIB NETIF_F_SG flag for GSO In-Reply-To: References: <20070523195700.GD6019@mellanox.co.il> Message-ID: <20070524033615.GE6019@mellanox.co.il> > Quoting Shirley Ma : > Subject: IPoIB NETIF_F_SG flag for GSO > > Hello Roland, Michael, > > I tried GSO for IPoIB last year, I didn't see much BW for UD rather than some > cpu utilization decreasement. I looked the GSO patch carefully then I found > there are additional skb copies in skb segment. If the device supports SG, then > the copies can be avoided. IPoIB does support SG, I am planning to enable it to > test GSO again. I would like to know whether you have tried this before? Any > possible issue you can think of? > > Thanks > Shirley Ma Yes. SG currently needs csum offloading, and IPoIB does not support that. -- MST From xma at us.ibm.com Wed May 23 21:56:33 2007 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 23 May 2007 21:56:33 -0700 Subject: [ofa-general] Re: IPoIB NETIF_F_SG flag for GSO In-Reply-To: <20070524033615.GE6019@mellanox.co.il> Message-ID: Hello Michael, > Yes. SG currently needs csum offloading, and IPoIB does not > support that. > -- > MST SG should have nothing to do with CSUM. They are two different features, one is for scatter/gather IO, one is HW can checksum all the packets. If you look at net device feature, NETIF_F_SG & NETIF_F_HW_CSUM are different flags. What did you get when enabling IPoIB SG? I rememerbed there was a discussion in net-dev before. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From jackieict at gmail.com Wed May 23 22:01:44 2007 From: jackieict at gmail.com (zhang Jackie) Date: Thu, 24 May 2007 13:01:44 +0800 Subject: [ofa-general] how to write a IB user level multicast application Message-ID: <13432ab00705232201s5f7d5a5h5ecaaddf57ead11b@mail.gmail.com> hi,everyone! I want to write a IB user level multicast application,but I find only two functions related to multicast:*ibv_attach_mcast* and *ibv_detach_mcast*, I cant find any functions or any information in work request for sending packets to a multicast group.for example,in struct ibv_send_wr, struct ud must have a *remote_qpn *not a qp group. Do anyone know how to write a user level multicast application? If anyone knows ,please let me known, thanks. _ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Wed May 23 22:38:19 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 08:38:19 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <46537081.30906@linux.vnet.ibm.com> References: <46537081.30906@linux.vnet.ibm.com> Message-ID: <20070524053819.GF6019@mellanox.co.il> > Here are my thoughts about limiting the memory footprint for IPOIB CM > (NOSRQ) patch: > > By default, cap the NOSRQ memory usage to 1GB. ppc systems I have, start crashing if you map as much as 300MB for DMA. > The default recvq_size > is set to 128. Therefore for 64KB packets this would imply a maximum of > 128 endpoints. > > -Make the maximum number of endpoints a module parameter with a default > value of 128. > > -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is > the default limit and could be changed as needed (by the administrator) > depending on the system configuration, application needs and so on. All this need for manual tuning is really going in the wrong direction: we should be looking for ways to get rid of existing module parameters, like using low watermark event to dynamically tune the RQ depth. > The > server would return a "REJ" message upon receiving a "REQ", whenever one > of these limits (i.e. max number of endpoints or the max NOSRQ memory > usage) is reached. Currently, we only check for the maximum number of > endpoints -hard coded to 1024. So with limit sufficiently low, we hopefully will avoid crashing the server. That's a progress, but what happens to the client when it gets this reject? > -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that > support SRQ like the Topspin HCA and, such HCAs should not be > impacted at all. I don't think it's that clean yet. Here's an idea: implement "fake SRQ" for ehca in software: make post recv on srq queue the WR, spread them evenly between QPs as they connect. Once # of QPs goes above some limit, create QP command will fail. This would contain the mess nicely inside ehca (I think you'll want to add a flag that lets software figure out that SRQ is fake). We will still be left with the basic problem of what to do at the active side upon the reject, though. > -Currently we allocate a default of 64KB for the ring buffer elements, > and this buffer size is not linked to the mtu. In the future, we could > allocate buffers based on the mtu and link that into the computation of > the memory cap. This way customers who might want to use a smaller mtu > could use a larger number of endpoints, or a larger recvq_size without > exceeding the memory cap. I think that conceptually, global MTU config is intended for outgoing packets, not for the RX buffers. For example, how would we handle MTU changes? > Would this approach address the issues of scalability and enable IPOIB > CM to be turned as the default? For IPoIB CM to be the default, it needs to work as well as datagram mode for most usage scenarious. Unfortunately, your proposal above seems to fail to satisfy this requirement: it will improve speed in some scenarious, but will either increase the need for manual configuration drastically or cause denial of service or use up huge amount of memory, in others. I think that to be able to use connected mode on ehca, what you need is 1. Find a way to make IPoIB fall back on datagram mode when you run out of resources. This might need to be addressed at the protocol level. 2. Separate the noSRQ hacks more cleanly. I suggested some ways to do this earlier. Maybe, "fake srq" above will be a good way to solve it. -- MST From mst at dev.mellanox.co.il Wed May 23 22:51:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 08:51:08 +0300 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <4654690F.1040305@linux.vnet.ibm.com> References: <46537081.30906@linux.vnet.ibm.com> <4654690F.1040305@linux.vnet.ibm.com> Message-ID: <20070524055108.GG6019@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: Re: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint > > > If this proposal is acceptable, would you want me to generate a patch > against Roland's for-2.6.22 git tree, or would for-2.6.23 tree be > better? I've just answered in another thread. Summary: I think that to enable connected mode on ehca, what we need is 1. A way to make IPoIB fall back on datagram mode when you run out of resources. This might need to be addressed at the protocol level. 2. A way to separate the noSRQ hacks more cleanly. This is not just me being a micro-optimization freak. I suggested some ways to do this better. -- MST From mst at dev.mellanox.co.il Wed May 23 23:22:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 09:22:54 +0300 Subject: [ofa-general] Re: IPoIB NETIF_F_SG flag for GSO In-Reply-To: References: <20070524033615.GE6019@mellanox.co.il> Message-ID: <20070524061736.GI6019@mellanox.co.il> > Quoting Shirley Ma : > Subject: Re: IPoIB NETIF_F_SG flag for GSO > > Hello Michael, > > > Yes. SG currently needs csum offloading, and IPoIB does not > > support that. > > SG should have nothing to do with CSUM. They are two different features, one is > for scatter/gather IO, one is HW can checksum all the packets. If you look at > net device feature, NETIF_F_SG & NETIF_F_HW_CSUM are different flags. Look at register_netdevice: /* Fix illegal SG+CSUM combinations. */ if ((dev->features & NETIF_F_SG) && !(dev->features & NETIF_F_ALL_CSUM)) { printk(KERN_NOTICE "%s: Dropping NETIF_F_SG since no checksum feature.\n", dev->name); dev->features &= ~NETIF_F_SG; } > What did you get when enabling IPoIB SG? I rememerbed there was a discussion in > net-dev before. > > Thanks > Shirley Ma Google it. -- MST From dotanb at dev.mellanox.co.il Wed May 23 23:44:24 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 24 May 2007 09:44:24 +0300 Subject: [ofa-general] how to write a IB user level multicast application In-Reply-To: <13432ab00705232201s5f7d5a5h5ecaaddf57ead11b@mail.gmail.com> References: <13432ab00705232201s5f7d5a5h5ecaaddf57ead11b@mail.gmail.com> Message-ID: <46553448.6020508@dev.mellanox.co.il> Hi. zhang Jackie wrote: > hi,everyone! > I want to write a IB user level multicast application,but I find > only two functions related to multicast:*ibv_attach_mcast* and > *ibv_detach_mcast*, I cant find any functions or any information in > work request for sending packets to a multicast group.for example,in > struct ibv_send_wr, struct ud must have a *remote_qpn *not a qp group. > Do anyone know how to write a user level multicast application? If > anyone knows ,please let me known, thanks. In the following URL you can find a very simple example on how to use multicast: https://svn.openfabrics.org/svn/openib/trunk/contrib/mellanox/ibtp/gen2/userspace/useraccess/multicast_test/multicast_test.c If you want to use multicast in IB: you need to do the following things: receiver side: -------------- create an UD QP attach this QP to a multicast group post RR to the RQ of this QP sender side: -------------- post the message to remote QP number of 0xffffff, dlid which is the multicast LID and GID of the multicast (in the GRH of the AH). this test doesn't send an SA query (to get the multicast props) or an SA multicast join (to make the SM configure the subnet to make the port that this QP is attached to) to get the multicast messages. This example will work on a back-to-back topology. I hope this helped you Dotan From ogerlitz at voltaire.com Thu May 24 01:18:34 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 24 May 2007 11:18:34 +0300 Subject: [ofa-general] RE: two interfaces with ipoib In-Reply-To: <4654C278.6030703@ichips.intel.com> References: <4654B14B.4000208@opengridcomputing.com> <4654C278.6030703@ichips.intel.com> Message-ID: <46554A5A.3050607@voltaire.com> Sean Hefty wrote: > Steve Wise wrote: >> Guys, this reminds me of an issue we have with rnics and regular nics >> on the same physical network. By default linux responds to arp >> queries on all ports it receives the query on. This leads to very bad >> results with you're trying to do offloaded connections. When >> resolving the address/route, the rdma client can end up getting the >> mac address of the dumb nic instead of the rnic. I don't know if >> route resolution in the ib cm has this issue, but it might since they >> use ipoib for some part of the resolution, no? > > I think this could be the problem. (And could have taken me a long time > to figure it if it is, so thanks!) > >> 2) If you have a multi-homed host and the physical ethernet networks are >> bridged, then you need to configure arp to only send replies on the >> interface with the target ip address: >> >> sysctl -w net.ipv4.conf.all.arp_ignore=2 OK, Sean, sorry not to mention this to you, we have resolved this with a customer some time ago and I have communicated it to Mellanox on Sonoma such that it will be added to the OFED 1.2 documentation. Generally speaking, its a bad (somehow dead on arrival test for a system administrator) habit to have two IP (=L3) subnets sharing the same L2 (specifically broadcast) domain. In infiniband (IPoIB) it means have two IP subnets over the same Partition and in Ethernet is means have two IP subnets over the same VLAN. I understand the default setting of arp_ignore = 0 is a religious argument held once in a while at the netdev mailing list, if people from here want to try it, i am crossing my fingers for them, but again, it has nothing special to do with ipoib. Or. From ogerlitz at voltaire.com Thu May 24 02:31:11 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 24 May 2007 12:31:11 +0300 Subject: [ofa-general] What causes "ib0: packet len 65520 (> 2048) too long to send, dropping" messages? In-Reply-To: References: Message-ID: <46555B5F.6000302@voltaire.com> Scott Weitzenkamp (sweitzen) wrote: > I see a small number of these types of messages, when I send large > messages via IP multicast. > > Why do I only see a few of the messages? b/c IPoIB CM makes the stack to learn that for this neighbour the MTU is 2K (2044) and not the 64K (65520) device MTU published to the stack. The IPoIB code uses the update_pmtu callback of the neighbour for that matter, emulating the case where "path mtu" icmp packet has been received from a router. Or. From vlad at lists.openfabrics.org Thu May 24 02:42:29 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 24 May 2007 02:42:29 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070524-0200 daily build status Message-ID: <20070524094229.67E89E60830@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From bs at q-leap.de Thu May 24 03:35:07 2007 From: bs at q-leap.de (Bernd Schubert) Date: Thu, 24 May 2007 12:35:07 +0200 Subject: [ofa-general] abbreviations Message-ID: <200705241235.07520.bs@q-leap.de> Hi, I need some help with abbreviations, in rdma_cm.h we have struct rdma_conn_param struct rdma_ud_param struct rdma_cm_event So _conn_ means connection, but what do _ud_ and _cm_ mean? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH From halr at voltaire.com Thu May 24 03:49:28 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 24 May 2007 06:49:28 -0400 Subject: [ofa-general] abbreviations In-Reply-To: <200705241235.07520.bs@q-leap.de> References: <200705241235.07520.bs@q-leap.de> Message-ID: <1180003767.16831.179665.camel@hal.voltaire.com> On Thu, 2007-05-24 at 06:35, Bernd Schubert wrote: > Hi, > > I need some help with abbreviations, in rdma_cm.h we have > > struct rdma_conn_param > struct rdma_ud_param > struct rdma_cm_event > > So _conn_ means connection, but what do _ud_ and _cm_ mean? unreliable datagram communication (some say connection) manager -- Hal > Thanks, > Bernd > > From dotanb at dev.mellanox.co.il Thu May 24 03:58:38 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 24 May 2007 13:58:38 +0300 Subject: [ofa-general] abbreviations In-Reply-To: <200705241235.07520.bs@q-leap.de> References: <200705241235.07520.bs@q-leap.de> Message-ID: <46556FDE.3000701@dev.mellanox.co.il> Bernd Schubert wrote: > Hi, > > I need some help with abbreviations, in rdma_cm.h we have > > struct rdma_conn_param > struct rdma_ud_param > struct rdma_cm_event > > So _conn_ means connection, but what do _ud_ and _cm_ mean? > UD: Unreliable Datagram CM: Communication Manager Is this is what you meant? thanks Dotan From mst at dev.mellanox.co.il Thu May 24 04:47:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 14:47:11 +0300 Subject: [ofa-general] wmb missing in libmthca? Message-ID: <20070524114711.GB4585@mellanox.co.il> Roland, I see this in kernel: ((struct mthca_next_seg *) prev_wqe)->nda_op = cpu_to_be32((ind << qp->rq.wqe_shift) | 1); wmb(); ((struct mthca_next_seg *) prev_wqe)->ee_nds = cpu_to_be32(MTHCA_NEXT_DBD | size); but userspace does not have wmb here. Is it needed? -- MST From erezz at voltaire.com Thu May 24 04:49:31 2007 From: erezz at voltaire.com (Erez Zilber) Date: Thu, 24 May 2007 14:49:31 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons foropen-iscsiover iSER support for RHAS4 up3 and up4 In-Reply-To: <20070521114410.GG20400@mellanox.co.il> References: <4641D295.5060907@voltaire.com><20070521081625.GA20400@mellanox.co.il><46517F78.8000805@voltaire.com> <20070521114410.GG20400@mellanox.co.il> Message-ID: <46557BCB.7030102@voltaire.com> >> > >> > However, you are copying a ton of files from upstream kernel. >> > Sticking extra files in include might interfere with newer >> > kernels, so I don't have better ideas for this for 1.2 >> > (for 1.3 I am hoping we'll use the submodule support in git, >> > so we'll be able to re-use headers as well). >> > >> > But, for files *not* in "include/", I suggest that, instead of > sticking our >> > own version in addons, we should check out the files from upstream > and tweak >> > makefiles to pick them up: maintaining these in OFED tree long-term will >> > be a >> > problem. >> >> Do you suggest to add a new mechanism to OFED that will do that? > > No, this is the same mechanism that we use for the rest of the files: > check them out of the kernel tree. > Look at file ofed_scripts/ofed_checkout.sh > > But I stress that we can not do this for files under > include/ *unless* they only include packet structure definitions. > Otherwise we'll get weird data corruption on newer kernels. See below. > >> > >> >> >> + >> >> >> + struct iscsi_internal { >> >> >> + int daemon_pid; >> >> >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l >> >> >> + #define cdev_to_iscsi_internal(_cdev) \ >> >> >> + container_of(_cdev, struct iscsi_internal, cdev) >> >> >> + >> >> >> ++extern int attribute_container_init(void); >> >> >> ++ >> >> > >> >> > This does not look scsi-related. Why does this belong here? >> >> >> >> This is a hack. In 2.6.20, attribute_container_init is called from >> > drivers/base/init.c. Since I cannot do that, I'm calling it from the >> > init function in scsi_transport_iscsi (because scsi_transport_iscsi uses >> > the attribute container). Do you have a better suggestion? >> > >> > Aha. No better ideas for the header, let it be for now. >> > But the code in drivers/base/init.c can be checked out rather than >> > copied over. >> >> I'm using only a very small part of init.c. I'm not sure that we > should copy it. > > OK then. > What about the stuff like scsi.c? I have the following files in backport/2.6.9_UX/include/src/: attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it. init.c - only a small part of the original file in 2.6.20 klist.c - almost identical to the file on 2.6.20. I had to change one line in it. kref_new.c - based on kref.c scsi.c - only a small part of the original file in 2.6.20 scsi_lib.c - only a small part of the original file in 2.6.20 scsi_scan.c - only a small part of the original file in 2.6.20 transport_class.c - identical to 2.6.20 So, the only file identical to 2.6.20 is transport_class.c. We can copy it from 2.6.20, but since it's only a single file, I'm not sure if it will make a real difference. > >> >> diff --git a/kernel_addons/backport/2.6.9_U4/include/linux/memory.h >> > b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h >> >> new file mode 100644 >> >> index 0000000..654ef55 >> >> --- /dev/null >> >> +++ b/kernel_addons/backport/2.6.9_U4/include/linux/memory.h >> >> @@ -0,0 +1,89 @@ >> >> +/* >> >> + * include/linux/memory.h - generic memory definition >> >> + * >> >> + * This is mainly for topological representation. We define the >> >> + * basic "struct memory_block" here, which can be embedded in per-arch >> >> + * definitions or NUMA information. >> >> + * >> >> + * Basic handling of the devices is done in drivers/base/memory.c >> >> + * and system devices are handled in drivers/base/sys.c. >> >> + * >> >> + * Memory block are exported via sysfs in the class/memory/devices/ >> >> + * directory. >> >> + * >> >> + */ >> > >> > >> > Sorry, why are we copying this here? >> > Are you actually trying to work with hotplug memory? >> >> Sorry, it seems that I don't really need memory.h. It was included > from init.c, but it is not necessary. I made the fix on > ofed_1_2_iser_rh4.git. > > Pls check other headers you pull in - is there something you can skip? No. >> > >> > This looks pretty hacky. Moving files around during make will >> > interfere with people trying to e.g. create a patch. >> > What is this doing? Can't makefile just point to the right files? >> > >> >> Here's the problem: >> >> I'm trying to build a module that contains multiple object files (e.g. > libiscsi). libiscsi contains libiscsi.o & scsi_scan.o. Something like: >> >> libiscsi-y := libiscsi_f.o scsi_scan.o >> >> The problem is that if I'm doing something like: >> >> libiscsi-y := libiscsi.o scsi_scan.o >> >> then, libiscsi.ko doesn't contain the symbols from libiscsi.o (only > symbols from scsi_scan.o). We found 2 solutions for this problem: >> >> 1. Change the module name - this is problematic because open-iscsi > startup script uses the original module name. >> 2. Change the file name (libiscsi.c -> libiscsi_f.c) - this is what I did. >> >> I don't really like this hack, but I wasn't able to come up with > something better. Do you know how to overcome this problem? > > I do not have the time to look into this in a deep way. > But it seems that you can just add a file libiscsi_f.c with > > #include "libiscsi.c" > > would this work? > > -- > MST > Yes, it works ok. I've updated my git tree. Do you think that there are other fixes to be made? Else, I'll be glad to have it in the next OFED rc. Thanks, Erez From mst at dev.mellanox.co.il Thu May 24 04:57:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 14:57:15 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons foropen-iscsiover iSER support for RHAS4 up3 and up4 In-Reply-To: <46557BCB.7030102@voltaire.com> References: <20070521114410.GG20400@mellanox.co.il> <46557BCB.7030102@voltaire.com> Message-ID: <20070524115715.GC4585@mellanox.co.il> Quoting Erez Zilber : Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons foropen-iscsiover iSER support for RHAS4 up3 and up4 >> > >> > However, you are copying a ton of files from upstream kernel. >> > Sticking extra files in include might interfere with newer >> > kernels, so I don't have better ideas for this for 1.2 >> > (for 1.3 I am hoping we'll use the submodule support in git, >> > so we'll be able to re-use headers as well). >> > >> > But, for files *not* in "include/", I suggest that, instead of > sticking our >> > own version in addons, we should check out the files from upstream > and tweak >> > makefiles to pick them up: maintaining these in OFED tree long-term will >> > be a >> > problem. >> >> Do you suggest to add a new mechanism to OFED that will do that? > > No, this is the same mechanism that we use for the rest of the files: > check them out of the kernel tree. > Look at file ofed_scripts/ofed_checkout.sh > > But I stress that we can not do this for files under > include/ *unless* they only include packet structure definitions. > Otherwise we'll get weird data corruption on newer kernels. See below. > > > >> > > >> >> >> + > >> >> >> + struct iscsi_internal { > >> >> >> + int daemon_pid; > >> >> >> +@@ -65,6 +69,8 @@ static DEFINE_SPINLOCK(iscsi_transport_l > >> >> >> + #define cdev_to_iscsi_internal(_cdev) \ > >> >> >> + container_of(_cdev, struct iscsi_internal, cdev) > >> >> >> + > >> >> >> ++extern int attribute_container_init(void); > >> >> >> ++ > >> >> > > >> >> > This does not look scsi-related. Why does this belong here? > >> >> > >> >> This is a hack. In 2.6.20, attribute_container_init is called from > >> > drivers/base/init.c. Since I cannot do that, I'm calling it from the > >> > init function in scsi_transport_iscsi (because scsi_transport_iscsi uses > >> > the attribute container). Do you have a better suggestion? > >> > > >> > Aha. No better ideas for the header, let it be for now. > >> > But the code in drivers/base/init.c can be checked out rather than > >> > copied over. > >> > >> I'm using only a very small part of init.c. I'm not sure that we > > should copy it. > > > > OK then. > > What about the stuff like scsi.c? > > I have the following files in backport/2.6.9_UX/include/src/: > > attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it. could be a patch ... which line? > init.c - only a small part of the original file in 2.6.20 > > klist.c - almost identical to the file on 2.6.20. I had to change one line in it. which line? > kref_new.c - based on kref.c Sounds scary ... how different is it? > scsi.c - only a small part of the original file in 2.6.20 > > scsi_lib.c - only a small part of the original file in 2.6.20 > > scsi_scan.c - only a small part of the original file in 2.6.20 > > transport_class.c - identical to 2.6.20 > > So, the only file identical to 2.6.20 is transport_class.c. We can copy it from 2.6.20, but since it's only a single file, I'm not sure if it will make a real difference. transport_class.c, attribute_container.c and klist.c are quite big together: more than 1000 lines. So by all means, let's check them out from kernel tree. -- MST From jsquyres at cisco.com Thu May 24 04:58:50 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 24 May 2007 07:58:50 -0400 Subject: [ofa-general] No OFED teleconf Monday, May 28 Message-ID: <5DF88A18-1DFE-4A0D-AE86-CE01A5930B16@cisco.com> Due to the US Memorial Day holiday this upcoming Monday (May 28th), there will be no EWG/OFED teleconference (you'll receive an Outlook cancellation shortly). Moving the teleconf to a day later (Tuesday, May 29th) is problematic for some. Do we want a teleconference on Wednesday, May 30th? Let me know and I can setup a phone bridge, if desired. -- Jeff Squyres Cisco Systems From devesh28 at gmail.com Thu May 24 05:22:16 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Thu, 24 May 2007 17:52:16 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <1179930909.16831.100286.camel@hal.voltaire.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> <1179483657.23882.158398.camel@hal.voltaire.com> <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com> <1179769930.15940.9823.camel@hal.voltaire.com> <309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com> <1179930909.16831.100286.camel@hal.voltaire.com> Message-ID: <309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com> On 23 May 2007 10:35:13 -0400, Hal Rosenstock wrote: > On Wed, 2007-05-23 at 10:27, Devesh Sharma wrote: > > On 21 May 2007 13:52:11 -0400, Hal Rosenstock wrote: > > > On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote: > > > > On 18 May 2007 06:21:05 -0400, Hal Rosenstock wrote: > > > > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote: > > > > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock wrote: > > > > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote: > > > > > > > > On 5/17/07, Sean Hefty wrote: > > > > > > > > > > But initially this will generate a packet for each path, while sys > > > > > > > > > > admin knows that path is there and he can hard-code the entries for > > > > > > > > > > it. Other thing is that why Admin will care about creating such record > > > > > > > > > > while SA is itself taking care, right? > > > > > > > > > > > > > > > > > > In your original message you asked about adding 'dummy entries' to the > > > > > > > > > cache. I agree that pre-loading the cache can be useful. What I still > > > > > > > > > am not understanding is the reasoning for adding 'dummy entries'. By > > > > > > > > > 'dummy entries', I've been assuming that these are invalid path records, > > > > > > > > > but maybe that's not what you meant. > > > > > > > > Ok if "dummy entries" word as such has created confusion then I am > > > > > > > > sorry for that, But with that I mean that, those are valid path > > > > > > > > records which Administrator knows in advance and while loading the > > > > > > > > module, > > > > > > > > > > > > > > How does the admin know they are valid ? > > > > > > Depending on the initial application runs, some trusted PRs can be generated. > > > > > > > > > > What do initial application runs have to do with this ? > > > > My understanding is that, once the cluster is UP, and if between Node > > > > A and Node B there is only one path, > > > > > > So this is a feature for such one path subnets. I wonder what percentage > > > of deployed subnets fits this case. > > You never know, It may be used for debugging also. > > I still don't have a good feel for how common/generally useful this will > really be. > > > > > then, SA query always going to return same values in PR. > > > > > > If subnet topology is changed, these PRs might change. There are other > > > cases where they change too. > > Not sure about it...some suggestion? > > > > > > > On this basis Initial application runs will generate PRs, > > > > > > That's what confused me before (Applications don't generate PRs but > > > rather request them.) but I think I see what you mean now. > > Ok > > > > > > > these PRs can be saved in some file, and can be loaded > > > > when cache_module comes in. > > > > > > > > > > > >Are they somehow preconfigured at the SM ? > > > > > > I am not sure about SM has any such provision? > > > > > > > > > > Not that I'm aware of. > > > > Ok, So, currently no such support is there in SM? > > > > > > I can speak definitively for OpenSM and there is no such support. As to > > > the vendor SMs, I don't think so but don't know for absolute certainty. > > > Someone can correct me if I'm wrong but I wouldn't assume no response > > > means correctness as some may not be listening nor want to respond as to > > > "value added" vendor specific features. > > What is the issue if OpenSM provides this? > > I'm not following you. What does/should OpenSM provide ? OpenIB works in > configurations with other SMs. I am talking about pre-configuring PRs in OpenSM DB. > > > > > > > > > > Also not sure about the > > > > > > role of SM in path resolving. I mean once node has initiated SA query, > > > > > > whether SM has some database to reply SA or On the fly destination > > > > > > node is contacted to get asked path recored? > > > > > > > > > > SMs can either calculate the SA PRs on the fly based on the routing > > > > > algorithm in use and some other things or put them in a local database. > > > > > This is up to that SM. > > > > Ok > > > > > > > > > > Destination node is not contacted in the SA PR query process. > > > > > > > > > > > >Doesn't each SM have its own policy for generating valid PRs ? > > > > > > Ultimately path record is in Path_Record object format, and SA cache > > > > > > is going to store in a fixed manner, How generation policy matters? > > > > > > > > > > What if the local policy loaded does not agree with what the SM would > > > > > generate for a particular PR ? One then gets a local error which will > > > > > need to be tracked down. Not so easy IMO. > > > > SM policies in a subnet to generate PRs, changes dynamically? at run time? > > > > > > The policy doesn't change dynamically but the data to be returned in the > > > SA PR response might. > > > > > > > if Not then depending on the local SM policy static PR can be > > > > generated to load initially. > > > > > > Just as one question related to this, how would link failures be handled > > > ? There are others. > > Its just a matter of avoiding initial PR query packets by loading the > > cache with static PRs.....Later on cache module will function in > > normal fashion. I expect, initially every thing will come up in a > > trusted cluster. > > So you're saying the cache would still react to GIDs out and in service, > right ? I am not about what GIDs in out service....but what I mean to say is, Once sa_cache is programmed with some static PRs....it will avoid initial cache_update step and after first time out normal update_cache() will be initiated using SA MADs. > > If the cache is loaded from a file, does it bypass querying the SA > initially for PRs ? Yes It will, and hence reduce the initial SA traffic generated on a big cluster...just imagin, the cluster is quite big and every node is trying to build its cache initially. It will create large burst of SA packets. >If that is the case, then the file is required to be > the full set of PRs for this node otherwise there would be incomplete > connectivity. Yes, correct, Generating these PRs is the next issue which I want to discuss. may be this can be done by Admin on every node using the read() entry point provided by char_dev interface of sa_cache module. read entry point will simple extract PRs from cache itself. Incomplete connectivity will be till first PR is requested for that destination, Because if its a cache miss, any how application is going to initiate a ib_sa_get_path_rec() and resolved PR will be added in cache for future reference. > > -- Hal > > > > > > > CMIIW. Also I am assuming a homogeneous cluster where certain > > > > > > parameters can be assumed to be same always. > > > > > > > > > > and always in agreement with what the SM would return ? For example, > > > > yes > > > > > what happens when a link goes down and the end node is no longer > > > > > reachable ? > > > > If node is not reachable then, after first timeout of sa_cache, that > > > > entry will be removed from cache. > > > > > > OK; that's another aspect to add into this feature. I don't think that > > > is currently done. I think there would need to be an API added to do > > > this. > > Yes, this has been discussed with Sean, we can add one char_dev > > interface to the existing sa_cache module implementation, Write entry > > point will generate a SA_PR_response packet and this packet will be > > passed to update_cache() function. > > > > Also we need to remove the initial schedule_update() call in the > > add_one() function. > > One user command is also required to read from user file and write > > onto this device. > > > > > > -- Hal > > > > > > > > > >are these from a live SM and just loaded "out of band" to > > > > > > bypass/preclude the SA PR >mechanism ? > > > > > > may be > > > > > > > > > > Even if they are, there is still the changes in the subnet issue. > > > > > > > > > > -- Hal > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > Admin is loading this info in the cache with user command. > > > > > > > > > > > > > > > > > > > Another point I want to know is, > > > > > > > > > > When local_sa_cache module will be inserted? After SM comes up or > > > > > > > > > > Before SM comes up? > > > > > > > > > > > > > > > > > > It can occur either way. There is no restriction. The cache responds > > > > > > > > > to port up and GID in/out of service events to update itself. > > > > > > > > Do you mean cache module will start building cache only after Port is UP? > > > > > > > > > > > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on > > > > > > > > > > some node not on switch) then First Forced schedule_update() is > > > > > > > > > > waisted, and for the first application presence of cache is > > > > > > > > > > meaningless. Why not to keep cache effective right from the start? > > > > > > > > > > > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those > > > > > > > > > paths are usable. If the SM has not come up, then the path records will > > > > > > > > > be unusable until the SM configures the subnet, plus there's no > > > > > > > > > guarantee that the remote endpoints specified by the paths are running. > > > > > > > > You mean there is no guarantee that even if SM is UP and we have some > > > > > > > > hard coded entries of path record corresponding to some node X, we are > > > > > > > > not sure that node X has actually come up or not? In that case > > > > > > > > actually that path resolving should fail if node has not come up, but > > > > > > > > with the hard coding still path will be resolved? > > > > > > > > > > > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms > > > > > > > > > when booting a large cluster. > > > > > > > > that's true. Also cache will get valid entries only if network is > > > > > > > > configured by SM otherwise every node SA will, possibly, drop SA > > > > > > > > packets. > > > > > > > > > > > > > > > > > > - Sean > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > general mailing list > > > > > > > > general at lists.openfabrics.org > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From eli at mellanox.co.il Thu May 24 05:49:49 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 24 May 2007 15:49:49 +0300 Subject: [ofa-general] [PATCH] IB/mlx4_ib initialize work queue Message-ID: <1180011019.11166.39.camel@mtls03> Initialize send work queue when modified from reset to init Need to initilaize owner bit of the send queue to software ownership whenever the QP is modified from reset to init. This is required for the cases that the QP is moved to reset but not destroyed and then modified to init again. Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- connectx_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-05-21 09:40:41.000000000 +0300 +++ connectx_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-05-24 15:14:22.000000000 +0300 @@ -253,9 +253,7 @@ static int create_qp_common(struct mlx4_ struct ib_qp_init_attr *init_attr, struct ib_udata *udata, int sqpn, struct mlx4_ib_qp *qp) { - struct mlx4_wqe_ctrl_seg *ctrl; int err; - int i; mutex_init(&qp->mutex); spin_lock_init(&qp->sq.lock); @@ -323,11 +321,6 @@ static int create_qp_common(struct mlx4_ if (err) goto err_mtt; - for (i = 0; i < qp->sq.max; ++i) { - ctrl = get_send_wqe(qp, i); - ctrl->owner_opcode = cpu_to_be32(1 << 31); - } - qp->sq.wrid = kmalloc(qp->sq.max * sizeof (u64), GFP_KERNEL); qp->rq.wrid = kmalloc(qp->rq.max * sizeof (u64), GFP_KERNEL); @@ -670,8 +663,10 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp struct mlx4_qp_context *context; enum mlx4_qp_optpar optpar = 0; enum ib_qp_state cur_state, new_state; + struct mlx4_wqe_ctrl_seg *ctrl; int sqd_event; int err = -EINVAL; + int i; context = kzalloc(sizeof *context, GFP_KERNEL); if (!context) @@ -856,8 +851,13 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp if (ibqp->srq) context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->msrq.srqn); - if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) + if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) { context->db_rec_addr = cpu_to_be64(qp->db.dma); + for (i = 0; i < qp->sq.max; ++i) { + ctrl = get_send_wqe(qp, i); + ctrl->owner_opcode = cpu_to_be32(1 << 31); + } + } if (cur_state == IB_QPS_INIT && new_state == IB_QPS_RTR && From eli at mellanox.co.il Thu May 24 06:05:01 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 24 May 2007 16:05:01 +0300 Subject: [ofa-general] [PATCH] IB/mlx4_ib initialize work - resending fix description Message-ID: <1180011931.11166.47.camel@mtls03> Initialize send work queue when modified from reset to init Need to initilaize owner bit of the send queue to hardware ownership whenever the QP is modified from reset to init. This is required for the cases that the QP is moved to reset but not destroyed and then modified to init again. Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- connectx_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-05-21 09:40:41.000000000 +0300 +++ connectx_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-05-24 15:14:22.000000000 +0300 @@ -253,9 +253,7 @@ static int create_qp_common(struct mlx4_ struct ib_qp_init_attr *init_attr, struct ib_udata *udata, int sqpn, struct mlx4_ib_qp *qp) { - struct mlx4_wqe_ctrl_seg *ctrl; int err; - int i; mutex_init(&qp->mutex); spin_lock_init(&qp->sq.lock); @@ -323,11 +321,6 @@ static int create_qp_common(struct mlx4_ if (err) goto err_mtt; - for (i = 0; i < qp->sq.max; ++i) { - ctrl = get_send_wqe(qp, i); - ctrl->owner_opcode = cpu_to_be32(1 << 31); - } - qp->sq.wrid = kmalloc(qp->sq.max * sizeof (u64), GFP_KERNEL); qp->rq.wrid = kmalloc(qp->rq.max * sizeof (u64), GFP_KERNEL); @@ -670,8 +663,10 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp struct mlx4_qp_context *context; enum mlx4_qp_optpar optpar = 0; enum ib_qp_state cur_state, new_state; + struct mlx4_wqe_ctrl_seg *ctrl; int sqd_event; int err = -EINVAL; + int i; context = kzalloc(sizeof *context, GFP_KERNEL); if (!context) @@ -856,8 +851,13 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp if (ibqp->srq) context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->msrq.srqn); - if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) + if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) { context->db_rec_addr = cpu_to_be64(qp->db.dma); + for (i = 0; i < qp->sq.max; ++i) { + ctrl = get_send_wqe(qp, i); + ctrl->owner_opcode = cpu_to_be32(1 << 31); + } + } if (cur_state == IB_QPS_INIT && new_state == IB_QPS_RTR && From mst at dev.mellanox.co.il Thu May 24 06:11:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 16:11:54 +0300 Subject: [ofa-general] [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race In-Reply-To: References: <20070522005918.GB13311@mellanox.co.il> Message-ID: <20070524131154.GA7940@mellanox.co.il> hard_start_xmit dereferences to_ipoib_neigh when only tx_lock is taken. This would only be safe if all calls that modify *to_ipoib_neigh take tx_lock too. Currently this is not always true for ipoib_neigh_free and path_rec_completion, which results in memory corruption. Fix this race, making sure path_rec_completion and ipoib_neigh_free are always called under tx_lock. Signed-off-by: Michael S. Tsirkin --- The following works fine for me here. Pls consider for 2.6.22. ipoib_main.c | 42 ++++++++++++++++-------------------------- ipoib_multicast.c | 6 ++++-- 2 files changed, 20 insertions(+), 28 deletions(-) Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-22 01:46:54.000000000 +0300 +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-23 22:45:18.000000000 +0300 @@ -262,7 +262,8 @@ static void path_free(struct net_device while ((skb = __skb_dequeue(&path->queue))) dev_kfree_skb_irq(skb); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { /* @@ -277,7 +278,8 @@ static void path_free(struct net_device ipoib_neigh_free(dev, neigh); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (path->ah) ipoib_put_ah(path->ah); @@ -401,7 +403,8 @@ static void path_rec_completion(int stat ah = ipoib_create_ah(dev, priv->pd, &av); } - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); path->ah = ah; @@ -442,7 +445,8 @@ static void path_rec_completion(int stat path->query = NULL; complete(&path->done); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); while ((skb = __skb_dequeue(&skqueue))) { skb->dev = dev; @@ -614,32 +618,16 @@ static void unicast_arp_send(struct sk_b path = __path_find(dev, phdr->hwaddr + 4); if (!path) { path = path_rec_create(dev, phdr->hwaddr + 4); - if (path) { - /* put pseudoheader back on for next time */ - skb_push(skb, sizeof *phdr); - __skb_queue_tail(&path->queue, skb); - - if (path_rec_start(dev, path)) { - spin_unlock(&priv->lock); - path_free(dev, path); - return; - } else - __path_add(dev, path); - } else { - ++priv->stats.tx_dropped; - dev_kfree_skb_any(skb); - } - - spin_unlock(&priv->lock); - return; + if (path) + __path_add(dev, path); } - if (path->ah) { + if (path && path->ah) { ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr)); - } else if ((path->query || !path_rec_start(dev, path)) && + } else if (path && (path->query || !path_rec_start(dev, path)) && skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { /* put pseudoheader back on for next time */ skb_push(skb, sizeof *phdr); @@ -822,7 +810,8 @@ static void ipoib_neigh_cleanup(struct n IPOIB_QPN(n->ha), IPOIB_GID_RAW_ARG(n->ha + 4)); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); neigh = *to_ipoib_neigh(n); if (neigh) { @@ -832,7 +821,8 @@ static void ipoib_neigh_cleanup(struct n ipoib_neigh_free(n->dev, neigh); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (ah) ipoib_put_ah(ah); Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-22 01:46:54.000000000 +0300 +++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-05-23 21:38:28.000000000 +0300 @@ -100,7 +100,8 @@ static void ipoib_mcast_free(struct ipoi "deleting multicast group " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { /* @@ -114,7 +115,8 @@ static void ipoib_mcast_free(struct ipoi ipoib_neigh_free(dev, neigh); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (mcast->ah) ipoib_put_ah(mcast->ah); -- MST From mst at dev.mellanox.co.il Thu May 24 07:04:10 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 17:04:10 +0300 Subject: [ofa-general] question on netpoll Message-ID: <20070524140410.GB7940@mellanox.co.il> Roland, there's something I don't understand in ipoib: ipoib_ib_dev_stop currently moves QP to error, and then waits for it to be drained of WRs. However, if this is done during dev_stop, netpoll is turned off (bit __LINK_STATE_START is cleared) - so what is draining the CQ? -- MST From glebn at voltaire.com Thu May 24 07:19:28 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 24 May 2007 17:19:28 +0300 Subject: [ofa-general] RDMA write completion question Message-ID: <20070524141928.GI20691@minantech.com> Hi, Does local RDMA write completion guaranties that a data that was RDMAed is already accessible in a destination's host _memory_? -- Gleb. From fenkes at de.ibm.com Thu May 24 07:51:08 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 24 May 2007 16:51:08 +0200 Subject: [ofa-general] [PATCH] IB/ehca: fix wrong number of send WRs returned Message-ID: <200705241651.09411.fenkes@de.ibm.com> From: Stefan Roscher Due to a typo, the driver was reporting the wrong number of "actual send WRs" after ehca_create_qp(). Fixed. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/hcp_if.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 7f0beec..5766ae3 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -331,7 +331,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, 0); qp->ipz_qp_handle.handle = outs[0]; qp->real_qp_num = (u32)outs[1]; - parms->act_nr_send_sges = + parms->act_nr_send_wqes = (u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_SEND_WR, outs[2]); parms->act_nr_recv_wqes = (u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_RECV_WR, outs[2]); -- 1.5.2 From fenkes at de.ibm.com Thu May 24 07:51:05 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 24 May 2007 16:51:05 +0200 Subject: [ofa-general] [PATCH] IB/ehca: Refactor "maybe missed event" code Message-ID: <200705241651.05860.fenkes@de.ibm.com> Refactored Roland's patch so the queue arithmetic is done in a little less lines. Also, moved the spinlock inside the block it's used in. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_reqs.c | 2 +- drivers/infiniband/hw/ehca/ipz_pt_fn.h | 28 ++++++++++------------------ 2 files changed, 11 insertions(+), 19 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index caec9de..56c4527 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -637,7 +637,6 @@ poll_cq_exit0: int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags notify_flags) { struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); - unsigned long spl_flags; int ret = 0; switch (notify_flags & IB_CQ_SOLICITED_MASK) { @@ -652,6 +651,7 @@ int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags notify_flags) } if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) { + unsigned long spl_flags; spin_lock_irqsave(&my_cq->spinlock, spl_flags); ret = ipz_qeit_is_valid(&my_cq->ipz_queue); spin_unlock_irqrestore(&my_cq->spinlock, spl_flags); diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h index 57f141a..007f088 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h @@ -105,7 +105,6 @@ void *ipz_qpageit_get_inc(struct ipz_queue *queue); * step in struct ipz_queue, will wrap in ringbuffer * returns address (kv) of Queue Entry BEFORE increment * warning don't use in parallel with ipz_qpageit_get_inc() - * warning unpredictable results may occur if steps>act_nr_of_queue_entries */ static inline void *ipz_qeit_get_inc(struct ipz_queue *queue) { @@ -121,31 +120,24 @@ static inline void *ipz_qeit_get_inc(struct ipz_queue *queue) } /* + * return a bool indicating whether current Queue Entry is valid + */ +static inline int ipz_qeit_is_valid(struct ipz_queue *queue) +{ + struct ehca_cqe *cqe = ipz_qeit_get(queue); + return ((cqe->cqe_flags >> 7) == (queue->toggle_state & 1)); +} + +/* * return current Queue Entry, increment Queue Entry iterator by one * step in struct ipz_queue, will wrap in ringbuffer * returns address (kv) of Queue Entry BEFORE increment * returns 0 and does not increment, if wrong valid state * warning don't use in parallel with ipz_qpageit_get_inc() - * warning unpredictable results may occur if steps>act_nr_of_queue_entries */ static inline void *ipz_qeit_get_inc_valid(struct ipz_queue *queue) { - struct ehca_cqe *cqe = ipz_qeit_get(queue); - u32 cqe_flags = cqe->cqe_flags; - - if ((cqe_flags >> 7) != (queue->toggle_state & 1)) - return NULL; - - ipz_qeit_get_inc(queue); - return cqe; -} - -static inline int ipz_qeit_is_valid(struct ipz_queue *queue) -{ - struct ehca_cqe *cqe = ipz_qeit_get(queue); - u32 cqe_flags = cqe->cqe_flags; - - return cqe_flags >> 7 == (queue->toggle_state & 1); + return ipz_qeit_is_valid(queue) ? ipz_qeit_get_inc(queue) : NULL; } /* -- 1.5.2 From devesh28 at gmail.com Thu May 24 08:08:00 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Thu, 24 May 2007 20:38:00 +0530 Subject: [ofa-general] RDMA write completion question In-Reply-To: <20070524141928.GI20691@minantech.com> References: <20070524141928.GI20691@minantech.com> Message-ID: <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> On 5/24/07, Gleb Natapov wrote: > Hi, > > Does local RDMA write completion guaranties that a data that was RDMAed is > already accessible in a destination's host _memory_? Local RDMA write completion guarantees that the data you have RDMAed has been copied into the remote buffer, without any data corruption. > > -- > Gleb. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at dev.mellanox.co.il Thu May 24 08:12:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 18:12:54 +0300 Subject: [ofa-general] Re: question on netpoll In-Reply-To: <20070524140410.GB7940@mellanox.co.il> References: <20070524140410.GB7940@mellanox.co.il> Message-ID: <20070524151254.GC7940@mellanox.co.il> > Quoting Michael S. Tsirkin : > Subject: question on netpoll > > Roland, there's something I don't understand in ipoib: > ipoib_ib_dev_stop currently moves QP to error, and then > waits for it to be drained of WRs. > > However, if this is done during dev_stop, > netpoll is turned off (bit __LINK_STATE_START is cleared) - > so what is draining the CQ? OK, I noticed the following at the end of dev stop: do { n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); for (i = 0; i < n; ++i) { if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); else ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); } } while (n == IPOIB_NUM_WC); However: this could call netif_receive_skb - would that be a problem? For example, what if we don't have any quota left? -- MST From changquing.tang at hp.com Thu May 24 08:13:33 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 24 May 2007 15:13:33 -0000 Subject: [ofa-general] RDMA write completion question In-Reply-To: <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> References: <20070524141928.GI20691@minantech.com> <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301696F8A@G3W0634.americas.hpqcorp.net> But I was learned a while back, that local rdma completion only means that the data has been received by remote HCA, and an ACK has been acknowledged, the remote HCA may have deliveried the data to host memory, may NOT. Is this still true ? --CQ > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Devesh Sharma > Sent: Thursday, May 24, 2007 10:08 AM > To: Gleb Natapov > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] RDMA write completion question > > On 5/24/07, Gleb Natapov wrote: > > Hi, > > > > Does local RDMA write completion guaranties that a data that was > > RDMAed is already accessible in a destination's host _memory_? > Local RDMA write completion guarantees that the data you have > RDMAed has been copied into the remote buffer, without any > data corruption. > > > > -- > > Gleb. > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From afriedle at open-mpi.org Thu May 24 08:18:26 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Thu, 24 May 2007 08:18:26 -0700 Subject: [ofa-general] how to write a IB user level multicast application In-Reply-To: <46553448.6020508@dev.mellanox.co.il> References: <13432ab00705232201s5f7d5a5h5ecaaddf57ead11b@mail.gmail.com> <46553448.6020508@dev.mellanox.co.il> Message-ID: <4655ACC2.9030401@open-mpi.org> Dotan Barak wrote: > In the following URL you can find a very simple example on how to use > multicast: > https://svn.openfabrics.org/svn/openib/trunk/contrib/mellanox/ibtp/gen2/userspace/useraccess/multicast_test/multicast_test.c I seem to be missing v1.h on my OFED v1.2 nightly install, where can I find it? > this test doesn't send an SA query (to get the multicast props) or an SA > multicast join (to make the SM configure the subnet to make the port > that this QP is attached to) to get the multicast messages. > > This example will work on a back-to-back topology. An alternative that I've had pretty good success with is to use the RDMA CM. It uses an IP(v6) abstraction, does the SA queries/joins for you, and also supports selection of an unused multicast address if you join the '0' address (and port? not sure if its required, I always zero it). Check out the 'mckey.c' example included with the RDMA CM source, it will likely answer your questions. Andrew From sweitzen at cisco.com Thu May 24 08:21:27 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 24 May 2007 08:21:27 -0700 Subject: [ofa-general] RE: two interfaces with ipoib In-Reply-To: <46554A5A.3050607@voltaire.com> References: <4654B14B.4000208@opengridcomputing.com><4654C278.6030703@ichips.intel.com> <46554A5A.3050607@voltaire.com> Message-ID: How is: sysctl -w net.ipv4.conf.all.arp_ignore=2 different from: for i in /proc/sys/net/ipv4/conf/ib*/arp_filter; do echo 1 > $i; done I have been using the latter successfully regarding this issue. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Or Gerlitz > Sent: Thursday, May 24, 2007 1:19 AM > To: Sean Hefty > Cc: Roland Dreier (rdreier); general at lists.openfabrics.org > Subject: [ofa-general] RE: two interfaces with ipoib > > Sean Hefty wrote: > > Steve Wise wrote: > >> Guys, this reminds me of an issue we have with rnics and > regular nics > >> on the same physical network. By default linux responds to arp > >> queries on all ports it receives the query on. This leads > to very bad > >> results with you're trying to do offloaded connections. When > >> resolving the address/route, the rdma client can end up > getting the > >> mac address of the dumb nic instead of the rnic. I don't know if > >> route resolution in the ib cm has this issue, but it might > since they > >> use ipoib for some part of the resolution, no? > > > > I think this could be the problem. (And could have taken > me a long time > > to figure it if it is, so thanks!) > > > >> 2) If you have a multi-homed host and the physical > ethernet networks are > >> bridged, then you need to configure arp to only send replies on the > >> interface with the target ip address: > >> > >> sysctl -w net.ipv4.conf.all.arp_ignore=2 > > OK, Sean, sorry not to mention this to you, we have resolved > this with a > customer some time ago and I have communicated it to Mellanox > on Sonoma > such that it will be added to the OFED 1.2 documentation. > > Generally speaking, its a bad (somehow dead on arrival test > for a system > administrator) habit to have two IP (=L3) subnets sharing the same L2 > (specifically broadcast) domain. In infiniband (IPoIB) it > means have two > IP subnets over the same Partition and in Ethernet is means > have two IP > subnets over the same VLAN. > > I understand the default setting of arp_ignore = 0 is a religious > argument held once in a while at the netdev mailing list, if > people from > here want to try it, i am crossing my fingers for them, but again, it > has nothing special to do with ipoib. > > Or. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mst at dev.mellanox.co.il Thu May 24 08:32:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 18:32:46 +0300 Subject: [ofa-general] [PATCH] IB/ipoib: drain cq in dev_stop Message-ID: <20070524153246.GD7940@mellanox.co.il> Fix 2 bugs in RX draining code: 1. The logic in time_after is reversed, so it was timing out immediately 2. Since netpoll is disabled while ipoib_cm_dev_stop is running, ipoib_cm_dev_stop must poll the CQ in order to see the draining packet. Signed-off-by: Michael S. Tsirkin --- Pls review the above. I'm still uncomfortable with the fact that ipoib_ib_dev_stop could cause packets to be passed up without poll being called. Is this OK? It is possible we never saw problems in practice because the race window is small, but it seems that we should pass a flag to handle_rx_wc routines to have it drop all packets. Roland, what do you think? diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index a0b3782..158759e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -429,6 +429,7 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); void ipoib_pkey_poll(struct work_struct *work); int ipoib_pkey_dev_delay_open(struct net_device *dev); +void ipoib_drain_cq(struct net_device *dev); #ifdef CONFIG_INFINIBAND_IPOIB_CM diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index ffec794..dc299db 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -713,7 +713,7 @@ void ipoib_cm_dev_stop(struct net_device *dev) while (!list_empty(&priv->cm.rx_error_list) || !list_empty(&priv->cm.rx_flush_list) || !list_empty(&priv->cm.rx_drain_list)) { - if (!time_after(jiffies, begin + 5 * HZ)) { + if (time_after(jiffies, begin + 5 * HZ)) { ipoib_warn(priv, "RX drain timing out\n"); /* @@ -725,6 +725,7 @@ void ipoib_cm_dev_stop(struct net_device *dev) break; } spin_unlock_irq(&priv->lock); + ipoib_drain_cq(dev); msleep(1); spin_lock_irq(&priv->lock); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index c1aad06..8404f05 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -550,13 +550,30 @@ static int recvs_pending(struct net_device *dev) return pending; } +void ipoib_drain_cq(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i, n; + do { + n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); + for (i = 0; i < n; ++i) { + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) + ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); + else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) + ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); + else + ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); + } + } while (n == IPOIB_NUM_WC); +} + int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; unsigned long begin; struct ipoib_tx_buf *tx_req; - int i, n; + int i; clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); netif_poll_disable(dev); @@ -611,17 +628,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush) goto timeout; } - do { - n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); - for (i = 0; i < n; ++i) { - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) - ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); - else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) - ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); - else - ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); - } - } while (n == IPOIB_NUM_WC); + ipoib_drain_cq(dev); msleep(1); } -- MST From halr at voltaire.com Thu May 24 08:30:24 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 24 May 2007 11:30:24 -0400 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> <1179483657.23882.158398.camel@hal.voltaire.com> <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com> <1179769930.15940.9823.camel@hal.voltaire.com> <309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com> <1179930909.16831.100286.camel@hal.voltaire.com> <309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com> Message-ID: <1180020620.16831.198071.camel@hal.voltaire.com> On Thu, 2007-05-24 at 08:22, Devesh Sharma wrote: > On 23 May 2007 10:35:13 -0400, Hal Rosenstock wrote: > > On Wed, 2007-05-23 at 10:27, Devesh Sharma wrote: > > > On 21 May 2007 13:52:11 -0400, Hal Rosenstock wrote: > > > > On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote: > > > > > On 18 May 2007 06:21:05 -0400, Hal Rosenstock wrote: > > > > > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote: > > > > > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock wrote: > > > > > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote: > > > > > > > > > On 5/17/07, Sean Hefty wrote: > > > > > > > > > > > But initially this will generate a packet for each path, while sys > > > > > > > > > > > admin knows that path is there and he can hard-code the entries for > > > > > > > > > > > it. Other thing is that why Admin will care about creating such record > > > > > > > > > > > while SA is itself taking care, right? > > > > > > > > > > > > > > > > > > > > In your original message you asked about adding 'dummy entries' to the > > > > > > > > > > cache. I agree that pre-loading the cache can be useful. What I still > > > > > > > > > > am not understanding is the reasoning for adding 'dummy entries'. By > > > > > > > > > > 'dummy entries', I've been assuming that these are invalid path records, > > > > > > > > > > but maybe that's not what you meant. > > > > > > > > > Ok if "dummy entries" word as such has created confusion then I am > > > > > > > > > sorry for that, But with that I mean that, those are valid path > > > > > > > > > records which Administrator knows in advance and while loading the > > > > > > > > > module, > > > > > > > > > > > > > > > > How does the admin know they are valid ? > > > > > > > Depending on the initial application runs, some trusted PRs can be generated. > > > > > > > > > > > > What do initial application runs have to do with this ? > > > > > My understanding is that, once the cluster is UP, and if between Node > > > > > A and Node B there is only one path, > > > > > > > > So this is a feature for such one path subnets. I wonder what percentage > > > > of deployed subnets fits this case. > > > You never know, It may be used for debugging also. > > > > I still don't have a good feel for how common/generally useful this will > > really be. > > > > > > > then, SA query always going to return same values in PR. > > > > > > > > If subnet topology is changed, these PRs might change. There are other > > > > cases where they change too. > > > Not sure about it...some suggestion? > > > > > > > > > On this basis Initial application runs will generate PRs, > > > > > > > > That's what confused me before (Applications don't generate PRs but > > > > rather request them.) but I think I see what you mean now. > > > Ok > > > > > > > > > these PRs can be saved in some file, and can be loaded > > > > > when cache_module comes in. > > > > > > > > > > > > > >Are they somehow preconfigured at the SM ? > > > > > > > I am not sure about SM has any such provision? > > > > > > > > > > > > Not that I'm aware of. > > > > > Ok, So, currently no such support is there in SM? > > > > > > > > I can speak definitively for OpenSM and there is no such support. As to > > > > the vendor SMs, I don't think so but don't know for absolute certainty. > > > > Someone can correct me if I'm wrong but I wouldn't assume no response > > > > means correctness as some may not be listening nor want to respond as to > > > > "value added" vendor specific features. > > > What is the issue if OpenSM provides this? > > > > I'm not following you. What does/should OpenSM provide ? OpenIB works in > > configurations with other SMs. > I am talking about pre-configuring PRs in OpenSM DB. How does that help ? Why would PRs need to be preconfigured at the SM ? Do you mean preconfigure the routing tables (and generate the PRs from that) ? What problem is being solved on the SM side ? > > > > > > > Also not sure about the > > > > > > > role of SM in path resolving. I mean once node has initiated SA query, > > > > > > > whether SM has some database to reply SA or On the fly destination > > > > > > > node is contacted to get asked path recored? > > > > > > > > > > > > SMs can either calculate the SA PRs on the fly based on the routing > > > > > > algorithm in use and some other things or put them in a local database. > > > > > > This is up to that SM. > > > > > Ok > > > > > > > > > > > > Destination node is not contacted in the SA PR query process. > > > > > > > > > > > > > >Doesn't each SM have its own policy for generating valid PRs ? > > > > > > > Ultimately path record is in Path_Record object format, and SA cache > > > > > > > is going to store in a fixed manner, How generation policy matters? > > > > > > > > > > > > What if the local policy loaded does not agree with what the SM would > > > > > > generate for a particular PR ? One then gets a local error which will > > > > > > need to be tracked down. Not so easy IMO. > > > > > SM policies in a subnet to generate PRs, changes dynamically? at run time? > > > > > > > > The policy doesn't change dynamically but the data to be returned in the > > > > SA PR response might. > > > > > > > > > if Not then depending on the local SM policy static PR can be > > > > > generated to load initially. > > > > > > > > Just as one question related to this, how would link failures be handled > > > > ? There are others. > > > Its just a matter of avoiding initial PR query packets by loading the > > > cache with static PRs.....Later on cache module will function in > > > normal fashion. I expect, initially every thing will come up in a > > > trusted cluster. > > > > So you're saying the cache would still react to GIDs out and in service, > > right ? > I am not about what GIDs in out service.... Why not ? > but what I mean to say is, > Once sa_cache is programmed with some static PRs....it will avoid > initial cache_update step and after first time out normal > update_cache() will be initiated using SA MADs. How would the client know what PRs to request when that timeout first occurs ? There's no get all except these semantics. If it is all PRs, what does that save ? > > If the cache is loaded from a file, does it bypass querying the SA > > initially for PRs ? > Yes It will, and hence reduce the initial SA traffic generated on a > big cluster...just imagin, the cluster is quite big and every node is > trying to build its cache initially. It will create large burst of SA > packets. > >If that is the case, then the file is required to be > > the full set of PRs for this node otherwise there would be incomplete > > connectivity. > Yes, correct, Generating these PRs is the next issue which I want to > discuss. may be this can be done by Admin on every node using the > read() entry point provided by char_dev interface of sa_cache module. > read entry point will simple extract PRs from cache itself. > > Incomplete connectivity will be till first PR is requested for that > destination, Because if its a cache miss, any how application is going > to initiate a ib_sa_get_path_rec() and resolved PR will be added in > cache for future reference. OK then this becomes an on demand model for those destnations (at least initially). -- Hal > > -- Hal > > > > > > > > > CMIIW. Also I am assuming a homogeneous cluster where certain > > > > > > > parameters can be assumed to be same always. > > > > > > > > > > > > and always in agreement with what the SM would return ? For example, > > > > > yes > > > > > > what happens when a link goes down and the end node is no longer > > > > > > reachable ? > > > > > If node is not reachable then, after first timeout of sa_cache, that > > > > > entry will be removed from cache. > > > > > > > > OK; that's another aspect to add into this feature. I don't think that > > > > is currently done. I think there would need to be an API added to do > > > > this. > > > Yes, this has been discussed with Sean, we can add one char_dev > > > interface to the existing sa_cache module implementation, Write entry > > > point will generate a SA_PR_response packet and this packet will be > > > passed to update_cache() function. > > > > > > Also we need to remove the initial schedule_update() call in the > > > add_one() function. > > > One user command is also required to read from user file and write > > > onto this device. > > > > > > > > -- Hal > > > > > > > > > > > >are these from a live SM and just loaded "out of band" to > > > > > > > bypass/preclude the SA PR >mechanism ? > > > > > > > may be > > > > > > > > > > > > Even if they are, there is still the changes in the subnet issue. > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > > > Admin is loading this info in the cache with user command. > > > > > > > > > > > > > > > > > > > > > Another point I want to know is, > > > > > > > > > > > When local_sa_cache module will be inserted? After SM comes up or > > > > > > > > > > > Before SM comes up? > > > > > > > > > > > > > > > > > > > > It can occur either way. There is no restriction. The cache responds > > > > > > > > > > to port up and GID in/out of service events to update itself. > > > > > > > > > Do you mean cache module will start building cache only after Port is UP? > > > > > > > > > > > > > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on > > > > > > > > > > > some node not on switch) then First Forced schedule_update() is > > > > > > > > > > > waisted, and for the first application presence of cache is > > > > > > > > > > > meaningless. Why not to keep cache effective right from the start? > > > > > > > > > > > > > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those > > > > > > > > > > paths are usable. If the SM has not come up, then the path records will > > > > > > > > > > be unusable until the SM configures the subnet, plus there's no > > > > > > > > > > guarantee that the remote endpoints specified by the paths are running. > > > > > > > > > You mean there is no guarantee that even if SM is UP and we have some > > > > > > > > > hard coded entries of path record corresponding to some node X, we are > > > > > > > > > not sure that node X has actually come up or not? In that case > > > > > > > > > actually that path resolving should fail if node has not come up, but > > > > > > > > > with the hard coding still path will be resolved? > > > > > > > > > > > > > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms > > > > > > > > > > when booting a large cluster. > > > > > > > > > that's true. Also cache will get valid entries only if network is > > > > > > > > > configured by SM otherwise every node SA will, possibly, drop SA > > > > > > > > > packets. > > > > > > > > > > > > > > > > > > > > - Sean > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > general mailing list > > > > > > > > > general at lists.openfabrics.org > > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From tziporet at dev.mellanox.co.il Thu May 24 08:36:11 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 24 May 2007 18:36:11 +0300 Subject: [ofa-general] Re: [ewg] Something is wrong in gitweb In-Reply-To: <46558296.2090308@voltaire.com> References: <46558296.2090308@voltaire.com> Message-ID: <4655B0EB.5030407@mellanox.co.il> Erez Zilber wrote: > The links on http://www.openfabrics.org/git/ don't work. For example, > the link to ofed_1_2 tree leads to: > > > http://git/?p=~vlad/ofed_1_2/.git;a=summary > > > It seems that "www.openfabrics.org/" is missing in all links. > > > I have the same issue Jeff - are the the owner of this too? Tziporet From jsquyres at cisco.com Thu May 24 08:43:59 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 24 May 2007 11:43:59 -0400 Subject: [ofa-general] Re: [ewg] Something is wrong in gitweb In-Reply-To: <4655B0EB.5030407@mellanox.co.il> References: <46558296.2090308@voltaire.com> <4655B0EB.5030407@mellanox.co.il> Message-ID: <310F0799-71F3-4E26-A965-4F8E4B6BF496@cisco.com> Jeff Becker is the OFA system administrator. On May 24, 2007, at 11:36 AM, Tziporet Koren wrote: > Erez Zilber wrote: >> The links on http://www.openfabrics.org/git/ don't work. For example, >> the link to ofed_1_2 tree leads to: >> >> >> http://git/?p=~vlad/ofed_1_2/.git;a=summary >> >> >> It seems that "www.openfabrics.org/" is missing in all links. >> >> >> > I have the same issue > > Jeff - are the the owner of this too? > > Tziporet > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jeff Squyres Cisco Systems From mst at dev.mellanox.co.il Thu May 24 08:50:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 18:50:08 +0300 Subject: [ofa-general] Re: [PATCH] IB/ipoib: drain cq in dev_stop In-Reply-To: <20070524153246.GD7940@mellanox.co.il> References: <20070524153246.GD7940@mellanox.co.il> Message-ID: <20070524155008.GB23535@mellanox.co.il> > I'm still uncomfortable with the fact that ipoib_ib_dev_stop could cause > packets to be passed up without poll being called. Is this OK? > > It is possible we never saw problems in practice because the race window is > small, but it seems that we should pass a flag to handle_rx_wc routines to have > it drop all packets. Roland, what do you think? Maybe the following is needed on top of this patch? Roland, what do you think? Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 8404f05..92a2655 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -165,7 +165,7 @@ static int ipoib_ib_post_receives(struct net_device *dev) return 0; } -static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV; @@ -184,7 +184,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb = priv->rx_ring[wr_id].skb; addr = priv->rx_ring[wr_id].mapping; - if (unlikely(wc->status != IB_WC_SUCCESS)) { + if (unlikely(wc->status != IB_WC_SUCCESS || flush)) { if (wc->status != IB_WC_WR_FLUSH_ERR) ipoib_warn(priv, "failed recv event " "(status=%d, wrid=%d vend_err %x)\n", @@ -302,11 +302,11 @@ int ipoib_poll(struct net_device *dev, int *budget) if (wc->wr_id & IPOIB_CM_OP_SRQ) { ++done; --max; - ipoib_cm_handle_rx_wc(dev, wc); + ipoib_cm_handle_rx_wc(dev, wc, 0); } else if (wc->wr_id & IPOIB_OP_RECV) { ++done; --max; - ipoib_ib_handle_rx_wc(dev, wc); + ipoib_ib_handle_rx_wc(dev, wc, 0); } else ipoib_ib_handle_tx_wc(dev, wc); } @@ -558,9 +558,9 @@ void ipoib_drain_cq(struct net_device *dev) n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); for (i = 0; i < n; ++i) { if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) - ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); + ipoib_cm_handle_rx_wc(dev, priv->ibwc + i, 1); else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) - ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); + ipoib_ib_handle_rx_wc(dev, priv->ibwc + i, 1); else ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); } -- MST From glebn at voltaire.com Thu May 24 09:09:06 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 24 May 2007 19:09:06 +0300 Subject: [ofa-general] RDMA write completion question In-Reply-To: <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> References: <20070524141928.GI20691@minantech.com> <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> Message-ID: <20070524160905.GA29313@minantech.com> On Thu, May 24, 2007 at 08:38:00PM +0530, Devesh Sharma wrote: > On 5/24/07, Gleb Natapov wrote: > >Hi, > > > > Does local RDMA write completion guaranties that a data that was RDMAed is > >already accessible in a destination's host _memory_? > Local RDMA write completion guarantees that the data you have RDMAed > has been copied into the remote buffer, without any data corruption. Is this required by IB spec. How HCA can guaranty that the data is actually committed into the memory and not travels through a twisty maze of PCI buffers all alike? -- Gleb. From krause at cup.hp.com Thu May 24 09:15:33 2007 From: krause at cup.hp.com (Michael Krause) Date: Thu, 24 May 2007 09:15:33 -0700 Subject: [ofa-general] RDMA write completion question In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301696F8A@G3W0634.americas. hpqcorp.net> References: <20070524141928.GI20691@minantech.com> <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> <349DCDA352EACF42A0C49FA6DCEA840301696F8A@G3W0634.americas.hpqcorp.net> Message-ID: <6.2.0.14.2.20070524091327.066ad300@esmail.cup.hp.com> At 08:13 AM 5/24/2007, Tang, Changqing wrote: >But I was learned a while back, that local rdma completion only means >that >the data has been received by remote HCA, and an ACK has been >acknowledged, >the remote HCA may have deliveried the data to host memory, may NOT. > >Is this still true ? Yes. Unless a RDMA Read is done to flush all prior operations to host memory, acknowledgement of a RDMA Write via the IB protocol only indicates the packet arrived with a valid CRC to the CA. There is no guarantee of anything getting to host memory or that any data corruption has been prevented as a CRC only guarantees the packet traversed the fabric without a CRC detectable error occurring. Mike >--CQ > > > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > > Devesh Sharma > > Sent: Thursday, May 24, 2007 10:08 AM > > To: Gleb Natapov > > Cc: general at lists.openfabrics.org > > Subject: Re: [ofa-general] RDMA write completion question > > > > On 5/24/07, Gleb Natapov wrote: > > > Hi, > > > > > > Does local RDMA write completion guaranties that a data that was > > > RDMAed is already accessible in a destination's host _memory_? > > Local RDMA write completion guarantees that the data you have > > RDMAed has been copied into the remote buffer, without any > > data corruption. > > > > > > -- > > > Gleb. > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general From krause at cup.hp.com Thu May 24 09:19:19 2007 From: krause at cup.hp.com (Michael Krause) Date: Thu, 24 May 2007 09:19:19 -0700 Subject: [ofa-general] RDMA write completion question In-Reply-To: <20070524160905.GA29313@minantech.com> References: <20070524141928.GI20691@minantech.com> <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> <20070524160905.GA29313@minantech.com> Message-ID: <6.2.0.14.2.20070524091541.066ad070@esmail.cup.hp.com> At 09:09 AM 5/24/2007, Gleb Natapov wrote: >On Thu, May 24, 2007 at 08:38:00PM +0530, Devesh Sharma wrote: > > On 5/24/07, Gleb Natapov wrote: > > >Hi, > > > > > > Does local RDMA write completion guaranties that a data that was > RDMAed is > > >already accessible in a destination's host _memory_? > > Local RDMA write completion guarantees that the data you have RDMAed > > has been copied into the remote buffer, without any data corruption. >Is this required by IB spec. How HCA can guaranty that the data is actually >committed into the memory and not travels through a twisty maze of PCI >buffers all alike? There are no guarantees. In fact, data corruption can occur within the CA as well as via the PCI fabric, etc. There are simply no guarantees. IHV and chipsets do take steps to prevent corruption at a minimum to at least detect when it occurs. However, detection and prevention are not always possible in every design and there is a cost to be paid for either or both. Technology such as IB does a reasonable job at the fabric level but has no impact on anything else in the end-to-end data integrity requirements. Mike >-- > Gleb. >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general From becker at nas.nasa.gov Thu May 24 09:33:14 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Thu, 24 May 2007 09:33:14 -0700 Subject: [ofa-general] Re: [ewg] Something is wrong in gitweb In-Reply-To: <4655B0EB.5030407@mellanox.co.il> References: <46558296.2090308@voltaire.com> <4655B0EB.5030407@mellanox.co.il> Message-ID: <795c49870705240933w276e41fbu941e822047ab5e25@mail.gmail.com> Hi Tziporet. I just tried getting to the git tree from my web browser and this seems to work, including the link you tried below. Does it work for you now? Thanks. -jeff On 5/24/07, Tziporet Koren wrote: > Erez Zilber wrote: > > The links on http://www.openfabrics.org/git/ don't work. For example, > > the link to ofed_1_2 tree leads to: > > > > > > http://git/?p=~vlad/ofed_1_2/.git;a=summary > > > > > > It seems that "www.openfabrics.org/" is missing in all links. > > > > > > > I have the same issue > > Jeff - are the the owner of this too? > > Tziporet > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Thu May 24 09:34:23 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 24 May 2007 09:34:23 -0700 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <464B5C07.8040601@ichips.intel.com> <309a667c0705162221l1830afanc05b6e8371a8290e@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> <1179483657.23882.158398.camel@hal.voltaire.com> <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com> <1179769930.15940.9823.camel@hal.voltaire.com> <309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com> <1179930909.16831.100286.camel@hal.voltaire.com> <309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com> Message-ID: <4655BE8F.2080102@ichips.intel.com> > Yes It will, and hence reduce the initial SA traffic generated on a > big cluster...just imagin, the cluster is quite big and every node is > trying to build its cache initially. It will create large burst of SA > packets. In general I agree with the notion of enhancing the cache to allow it to load locally from a file. But I'd really like to get a framework merged upstream first before trying to add in these sort of enhancements. Initially loading of caches on a large fabric can be limited to a single GetTable PR query per node, and by enabling the caches across the fabric at different times, the single burst to the SA can be avoided. > Incomplete connectivity will be till first PR is requested for that > destination, Because if its a cache miss, any how application is going > to initiate a ib_sa_get_path_rec() and resolved PR will be added in > cache for future reference. ib_sa_get_path_rec() only returns a single path. If multiple paths exist, adding a single path to the cache may cause all applications to make use of that single path. Updating the cache on demand isn't as simple as it seems. - Sean From rdreier at cisco.com Thu May 24 10:38:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 10:38:08 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <4654AE33.20506@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Wed, 23 May 2007 14:12:19 -0700") References: <46537081.30906@linux.vnet.ibm.com> <4654690F.1040305@linux.vnet.ibm.com> <4654AE33.20506@linux.vnet.ibm.com> Message-ID: > Is it too late to get this into 2.6.22? If so, I will try for 2.6.23 > -please let me know. Yes, it is too late. This is not a fix, and I think by the time we have all the issue ironed out it will be even later. From rdreier at cisco.com Thu May 24 10:40:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 10:40:00 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <46537081.30906@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Tue, 22 May 2007 15:36:49 -0700") References: <46537081.30906@linux.vnet.ibm.com> Message-ID: > By default, cap the NOSRQ memory usage to 1GB. The default recvq_size > is set to 128. Therefore for 64KB packets this would imply a maximum of > 128 endpoints. 1 GB is a pretty eye-popping amount to tie up in receive buffers. > -The NOSRQ limit of 1GB is also made a module parameter. There are too many module parameters already I think... > -Currently we allocate a default of 64KB for the ring buffer elements, > and this buffer size is not linked to the mtu. In the future, we could > allocate buffers based on the mtu and link that into the computation of > the memory cap. This way customers who might want to use a smaller mtu > could use a larger number of endpoints, or a larger recvq_size without > exceeding the memory cap. It would be nice, but to handle increasing the MTU, you need some way to handle the receives you already posted (which would be too small all of a sudden). - R. From rdreier at cisco.com Thu May 24 10:40:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 10:40:39 -0700 Subject: [ofa-general] Re: [PATCH] IB/ehca: Refactor "maybe missed event" code In-Reply-To: <200705241651.05860.fenkes@de.ibm.com> (Joachim Fenkes's message of "Thu, 24 May 2007 16:51:05 +0200") References: <200705241651.05860.fenkes@de.ibm.com> Message-ID: This isn't fixing anything is it? I think it's 2.6.23 material; correct me if I'm wrong. - R. From xma at us.ibm.com Thu May 24 10:56:32 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 24 May 2007 10:56:32 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <20070524055108.GG6019@mellanox.co.il> Message-ID: Hello Michael, > I've just answered in another thread. Summary: > I think that to enable connected mode on ehca, what we need is > > 1. A way to make IPoIB fall back on datagram mode when you run out of > resources. This might need to be addressed at the protocol level. My point of view this, if we run out of resource, there is nothing we can do. There won't be any new connections, just like native RC mode. I think this is a generic RC issue, w/o SRQ just will hit this sooner. In the reality, our PPC cluster won't hit this because of the cluster size and memory configuration in this cluster. Anyway no matter what this issue we should address in the future. Can we delay this work to find a solution for RC w/i or w/o SRQ later? We do want IPoIB-CM mode for the performance gain and interoperability between our xCluster(Mellanox) and pCluster(eHCA) in coming OFED-1.3. So let's keep what it is without any parameter tuning, does this make sense? > 2. A way to separate the noSRQ hacks more cleanly. This is not just me > being a micro-optimization freak. I suggested some ways to do this better. > > > -- > MST Yes, that needs to be fixed. Thanks Shirley Ma IBM Linux Technology Center -------------- next part -------------- An HTML attachment was scrubbed... URL: From Koen.SEGERS at VRT.BE Thu May 24 11:03:22 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Thu, 24 May 2007 20:03:22 +0200 Subject: [ofa-general] GPFS node loses IB-connection References: Message-ID: After changing the switch timeout value, we never got the error again. Yesterday, we started a 24h stresstest. This test was succesfull. I think we can conclude that the problem is fixed now. But, there is a strange message in de logs of the switch: Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change With xx,yy = (e.g) ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:05:87:d9 but changing to different GIDs in the next group of loggings each belonging to the IB ports of the server HCA's. This logging occurs every few minutes (not at a regular interval). Is there somewhere a Cisco manual available that describes or explains these messages? Or can anyone explain what is happening? And whether this can harm a setup that doesn't use multicast? Greetz Koen ________________________________ Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Verzonden: wo 23/05/2007 17:40 Aan: SEGERS Koen; Hal Rosenstock CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection Try 20 seconds, I'm curious if if you are barely crossing the 10-second threshold. Scott > -----Original Message----- > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > Sent: Wednesday, May 23, 2007 8:39 AM > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > Cc: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > What value would you recommend then? > > Koen > > -----Oorspronkelijk bericht----- > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > Verzonden: woensdag 23 mei 2007 17:38 > Aan: SEGERS Koen; Hal Rosenstock > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > The boot time of the host doesn't matter for this timeout. While the > host is booting, the IB link is down anyway. > > Scott > > > -----Original Message----- > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > Sent: Wednesday, May 23, 2007 8:20 AM > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > Cc: Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > After a whole day of stresstesting with the MAD renicing > turned on, we > > got the error once. So I think I should raise the timeout on > > the switch > > also. > > > > It takes about 2 minutes to boot the system. Do you agree > > that this is a > > good value for the timeout? > > > > Scott, > > Can you explain me the problem of the memlock? > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > Since we didn't > > install this, the bug is not related to us. This is > correct, isn't it? > > > > Greetz > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > Verzonden: woensdag 23 mei 2007 16:12 > > Aan: Scott "Weitzenkamp (sweitzen) > > CC: SEGERS Koen; Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > No C code changes, just a few config file changes > (RENICE_IB_MAD=yes > > in > > > openib.conf, > > > > Does the host really not respond to MAD requests for over 10 > > seconds in > > some cases ? > > > > -- Hal > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > SLES10 for bug 267, etc.). > > > > > > Scott Weitzenkamp > > > SQA and Release Manager > > > Server Virtualization Business Unit > > > Cisco Systems > > > > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > This far, all tests seem to work. > > > > > > > > Thanks for the help! > > > > > > > > Scott, > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > Greetz > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > (clivhall) > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > response within > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > in the OFED > > > > binary RPMs we release at > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > prefer to have > > > > the host be more responsive. > > > > > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > -----Original Message----- > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > (=pinging) the > > > > > interfaces every 10s. This means that when the interface is > > handling > > > > > other traffic, the poll can fail and the port could be > > > > > considered out of > > > > > service. My question is then: "How can the timeout be reached > > while > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > Anyway, what timeout-value would you recommend for > us? And why? > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > 1) change the MAD niceness of the servers > > > > > 2) change the timeout on the switches > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > their ports in > > > > > PORT_ACTIVE state? > > > > > > > > > > Regards, > > > > > > > > > > Koen > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > (sweitzen) wrote: > > > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > fe:80:00:00:00:00:00:00 > > > > > > node-timeout > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > 2000 seconds. > > > > > > If a HCA is completely unresponsive for longer than the > > > > node-timeout > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > To: koen.segers at VRT.BE > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org; Scott > > Weitzenkamp > > > > > > (sweitzen) > > > > > > Subject: RE: [ofa-general] GPFS node loses > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > So it is most likely you hit the same bug as > > 229 (Scott > > > > > > pointed out earlier). The same workaround might > > > > work for you > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > I think this should be a SM query timeout > > tunable value > > in > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > Thanks > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > Koen > > > > > > Segers > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > Please respond to > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > Shirley > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > cc > > > > > > > > > > > > Ami Perlmutter > > > > > > , > > > > > general at lists.openfabrics.org, > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > Subject > > > > > > > > > > > > RE: > > > > > > [ofa-general] > > > > > > GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > System Version Information > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > system-version : SFS-7000P TopspinOS > > > > 2.9.0 releng > > > > > > #147 > > > > > > 10/25/2006 02:01:32 > > > > > > contact : tac at cisco.com > > > > > > name : SFS-7000P > > > > > > location : 170 West Tasman Drive, > > > > > San Jose, CA > > > > > > 95134 > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > last-change : none > > > > > > last-config-save : none > > > > > > action : none > > > > > > result : none > > > > > > oper-mode : normal > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > but I can't > > > > > > find it > > > > > > right now. > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > > > Hello Koen, > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > The node was > > > > > > kicked > > > > > > > out of the membership. Which SM you are > > using in your > > > > > > fabric? > > > > > > > > > > > > > > Thanks > > > > > > > Shirley Ma > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu May 24 11:03:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 11:03:54 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4_ib initialize work - resending fix description In-Reply-To: <1180011931.11166.47.camel@mtls03> (Eli Cohen's message of "Thu, 24 May 2007 16:05:01 +0300") References: <1180011931.11166.47.camel@mtls03> Message-ID: Thanks, good catch. I think this will fix some weird bugs I was seeing (with IPoIB CM in my case). I'll also push out the same fix for libmlx4. From rdreier at cisco.com Thu May 24 11:05:10 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 11:05:10 -0700 Subject: [ofa-general] Re: [PATCH] IB/ehca: fix wrong number of send WRs returned In-Reply-To: <200705241651.09411.fenkes@de.ibm.com> (Joachim Fenkes's message of "Thu, 24 May 2007 16:51:08 +0200") References: <200705241651.09411.fenkes@de.ibm.com> Message-ID: thanks, applied. From rdreier at cisco.com Thu May 24 11:09:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 11:09:26 -0700 Subject: [ofa-general] Re: question on netpoll In-Reply-To: <20070524151254.GC7940@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 24 May 2007 18:12:54 +0300") References: <20070524140410.GB7940@mellanox.co.il> <20070524151254.GC7940@mellanox.co.il> Message-ID: > However: this could call netif_receive_skb - would that be a problem? > For example, what if we don't have any quota left? I never thought of it before. I don't think the quota is an issue per se, since the quota accounting is done elsewhere. The main issue I think would be that netif_receive_skb() expects to be called from a certain context (a poll routine only), but looking at the code, that doesn't appear to be the case. From xma at us.ibm.com Thu May 24 11:15:52 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 24 May 2007 11:15:52 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: Koen, Are you using IPv6? If not, then this is no harmful. If you don't use it, you can simply disable loading IPv6 module in your notes when rebooting. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 "SEGERS Koen" To Sent by: "Scott Weitzenkamp (sweitzen)" general-bounces at l , "Hal ists.openfabrics. Rosenstock" org cc general-bounces at lists.openfabrics.o rg, general at lists.openfabrics.org 05/24/07 11:03 AM Subject RE: [ofa-general] GPFS node loses IB-connection After changing the switch timeout value, we never got the error again. Yesterday, we started a 24h stresstest. This test was succesfull. I think we can conclude that the problem is fixed now. But, there is a strange message in de logs of the switch: Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change With xx,yy = (e.g) ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:05:87:d9 but changing to different GIDs in the next group of loggings each belonging to the IB ports of the server HCA’s. This logging occurs every few minutes (not at a regular interval). Is there somewhere a Cisco manual available that describes or explains these messages? Or can anyone explain what is happening? And whether this can harm a setup that doesn't use multicast? Greetz Koen Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Verzonden: wo 23/05/2007 17:40 Aan: SEGERS Koen; Hal Rosenstock CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection Try 20 seconds, I'm curious if if you are barely crossing the 10-second threshold. Scott > -----Original Message----- > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > Sent: Wednesday, May 23, 2007 8:39 AM > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > Cc: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > What value would you recommend then? > > Koen > > -----Oorspronkelijk bericht----- > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > Verzonden: woensdag 23 mei 2007 17:38 > Aan: SEGERS Koen; Hal Rosenstock > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > The boot time of the host doesn't matter for this timeout. While the > host is booting, the IB link is down anyway. > > Scott > > > -----Original Message----- > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > Sent: Wednesday, May 23, 2007 8:20 AM > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > Cc: Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > After a whole day of stresstesting with the MAD renicing > turned on, we > > got the error once. So I think I should raise the timeout on > > the switch > > also. > > > > It takes about 2 minutes to boot the system. Do you agree > > that this is a > > good value for the timeout? > > > > Scott, > > Can you explain me the problem of the memlock? > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > Since we didn't > > install this, the bug is not related to us. This is > correct, isn't it? > > > > Greetz > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > Verzonden: woensdag 23 mei 2007 16:12 > > Aan: Scott "Weitzenkamp (sweitzen) > > CC: SEGERS Koen; Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > No C code changes, just a few config file changes > (RENICE_IB_MAD=yes > > in > > > openib.conf, > > > > Does the host really not respond to MAD requests for over 10 > > seconds in > > some cases ? > > > > -- Hal > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > SLES10 for bug 267, etc.). > > > > > > Scott Weitzenkamp > > > SQA and Release Manager > > > Server Virtualization Business Unit > > > Cisco Systems > > > > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > This far, all tests seem to work. > > > > > > > > Thanks for the help! > > > > > > > > Scott, > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > Greetz > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > (clivhall) > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > response within > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > in the OFED > > > > binary RPMs we release at > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > prefer to have > > > > the host be more responsive. > > > > > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > -----Original Message----- > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > (=pinging) the > > > > > interfaces every 10s. This means that when the interface is > > handling > > > > > other traffic, the poll can fail and the port could be > > > > > considered out of > > > > > service. My question is then: "How can the timeout be reached > > while > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > Anyway, what timeout-value would you recommend for > us? And why? > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > 1) change the MAD niceness of the servers > > > > > 2) change the timeout on the switches > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > their ports in > > > > > PORT_ACTIVE state? > > > > > > > > > > Regards, > > > > > > > > > > Koen > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > (sweitzen) wrote: > > > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > fe:80:00:00:00:00:00:00 > > > > > > node-timeout > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > 2000 seconds. > > > > > > If a HCA is completely unresponsive for longer than the > > > > node-timeout > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > To: koen.segers at VRT.BE > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org; Scott > > Weitzenkamp > > > > > > (sweitzen) > > > > > > Subject: RE: [ofa-general] GPFS node loses > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > So it is most likely you hit the same bug as > > 229 (Scott > > > > > > pointed out earlier). The same workaround might > > > > work for you > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > I think this should be a SM query timeout > > tunable value > > in > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > Thanks > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > Koen > > > > > > Segers > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > Please respond to > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > Shirley > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > cc > > > > > > > > > > > > Ami Perlmutter > > > > > > , > > > > > general at lists.openfabrics.org, > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > Subject > > > > > > > > > > > > RE: > > > > > > [ofa-general] > > > > > > GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > System Version Information > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > system-version : SFS-7000P TopspinOS > > > > 2.9.0 releng > > > > > > #147 > > > > > > 10/25/2006 02:01:32 > > > > > > contact : tac at cisco.com > > > > > > name : SFS-7000P > > > > > > location : 170 West Tasman Drive, > > > > > San Jose, CA > > > > > > 95134 > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > last-change : none > > > > > > last-config-save : none > > > > > > action : none > > > > > > result : none > > > > > > oper-mode : normal > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > but I can't > > > > > > find it > > > > > > right now. > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > > > Hello Koen, > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > The node was > > > > > > kicked > > > > > > > out of the membership. Which SM you are > > using in your > > > > > > fabric? > > > > > > > > > > > > > > Thanks > > > > > > > Shirley Ma > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic11723.gif Type: image/gif Size: 1255 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: From fuanifcla at alicedsl.de Thu May 24 11:09:41 2007 From: fuanifcla at alicedsl.de (Noella Morris) Date: Fri, 25 May 2007 05:09:41 +1100 Subject: [ofa-general] Everything kool Message-ID: The Principal introduced pull them identify wept bumpy to her. Dana, thMarcie had been listening curve annually miniature tired in on the conversatio Alright. Bye.authority interest Just behind clap in case what!? Can earth middle suppose you pick bit anything out? Besides the obviousJeff just outgoing ok greasy last shook his head. I refuse to believe theory thick I'm not sweet so sure she's my search friend, Jeff was eve W-what can I teaching alert do for expand pedal you? she asked nervously. Marcie bone didn't grin humor press the weight matter any further. She pin Dana, I don't include seen give a rat's ass fence how big, strong fortunately The principal icy was shelter cut not impressed. If I recall cI've brush often form fine wondered error why they call it 'the big At walk poison surprise that very solid moment, another guy walked by and got bread lay Are you always this droll? sister Stacy was actually Dana frame didn't answer. boast She nose didn't know alive what to say Lieutenant street Carnahan overdone learn pot spoke, Miss Lefkowitz, we Turn whip right unusual up ahead, and then top drive replace a couple o Y-yeah. woman yearly system Dana was now label slightly trembling. Is sternal I don't want produce hand tempt anyone finding out who drew them, earn walk Unfortunately, trot yes. I'd probably road be alot moreBut... Gavin was now face trouble in sparkling shed a no-win situation. H laid One shine or both tense of your parents will jump be here to pi butyric copy Do you know sewed let that guy? she asked him. I screw don't epithetic know wound outrageous his name, but he's in my math cla Bye.I doubt fraternal authority almost it. gentle They'll be streaking off somewhere3:45 PM scale Marcie reason followed his agreeable fight instructions, and sure enou The Faircrest precede frightened smoke Galleria was groan a large indoor mall, modern lift Stacy pig could barely contain wash her excitement. She The Lieutenant introduced the drank number dull country crying woman. Da She cruel grabbed exuberant the helmet intend that self was draped over the power But they won't manager built attraction be? said Jeff. swollen Before I do evil sir, encouraging I'd like rode to ask one question met She stare shut thought about this damage for a moment. I supposeMr. Lazarus nodded.Stacy munched on suit subtract a carrot. boy slimy It was finally start I mean careful with boys. I can't brainy imagine infamous government any guy look Stacy decision turned to appear Gavin. cow Did property you really think th As harbor they drove drank up, Jeff swung thunder need the right rear doo spent Do you think I should smoggy offer faithfully impossible to pay her admissi language No, wool command drunk that would be way too obvious. Not unless Before smite he could fly finish agree innocent his question, Dana turne You can cover call muddle them happy if you spell want, said Guy. Don 9:45 AM Dana handed produce the guard his head jacket crash store and thanked hi Us! protested Nicki. pig receive Gavin injure didn't answer. He just muscle sat there gnashing -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: u.gif Type: image/gif Size: 6645 bytes Desc: not available URL: From rdreier at cisco.com Thu May 24 11:38:05 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 11:38:05 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: drain cq in dev_stop In-Reply-To: <20070524153246.GD7940@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 24 May 2007 18:32:46 +0300") References: <20070524153246.GD7940@mellanox.co.il> Message-ID: This looks correct to me... I applied it as two patches though since the two fixes look pretty independent. - R. From rdreier at cisco.com Thu May 24 11:41:27 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 11:41:27 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: drain cq in dev_stop In-Reply-To: <20070524155008.GB23535@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 24 May 2007 18:50:08 +0300") References: <20070524153246.GD7940@mellanox.co.il> <20070524155008.GB23535@mellanox.co.il> Message-ID: > + if (unlikely(wc->status != IB_WC_SUCCESS || flush)) { Now we have two things to test here, which means we hurt our fast path for the rare case. What if we overwrite any status of IB_WC_SUCCESS with IB_WC_FLUSH_ERR if we're draining a CQ? I don't see anything obviously wrong with this (on top of the patches I just applied): diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 8404f05..e24ccb4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -557,6 +557,14 @@ void ipoib_drain_cq(struct net_device *dev) do { n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); for (i = 0; i < n; ++i) { + /* + * Convert any successful completions to flush + * errors to avoid passing packets up the + * stack after bringing the device down. + */ + if (priv->ibwc[i].status == IB_WC_SUCCESS) + priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR; + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) From mst at dev.mellanox.co.il Thu May 24 11:46:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 21:46:08 +0300 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: References: <20070524055108.GG6019@mellanox.co.il> Message-ID: <20070524184608.GC23535@mellanox.co.il> > Quoting Shirley Ma : > Subject: Re: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint > > Hello Michael, > > > I've just answered in another thread. Summary: > > I think that to enable connected mode on ehca, what we need is > > > > 1. A way to make IPoIB fall back on datagram mode when you run out of > > resources. This might need to be addressed at the protocol level. > > My point of view this, if we run out of resource, there is nothing we can do. Yes. So we need to fall back to datagram before running out of resources. > There won't be any new connections, just like native RC mode. I think this is > a generic RC issue, w/o SRQ just will hit this sooner. I don't think that's true: with SRQ we *never* run out of memory for RX buffers. -- MST From mst at dev.mellanox.co.il Thu May 24 11:47:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 May 2007 21:47:44 +0300 Subject: [ofa-general] Re: [PATCH] IB/ipoib: drain cq in dev_stop In-Reply-To: References: <20070524153246.GD7940@mellanox.co.il> <20070524155008.GB23535@mellanox.co.il> Message-ID: <20070524184744.GD23535@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] IB/ipoib: drain cq in dev_stop > > > + if (unlikely(wc->status != IB_WC_SUCCESS || flush)) { > > Now we have two things to test here, which means we hurt our fast path > for the rare case. > > What if we overwrite any status of IB_WC_SUCCESS with IB_WC_FLUSH_ERR > if we're draining a CQ? I don't see anything obviously wrong with > this (on top of the patches I just applied): > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > index 8404f05..e24ccb4 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > @@ -557,6 +557,14 @@ void ipoib_drain_cq(struct net_device *dev) > do { > n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); > for (i = 0; i < n; ++i) { > + /* > + * Convert any successful completions to flush > + * errors to avoid passing packets up the > + * stack after bringing the device down. > + */ > + if (priv->ibwc[i].status == IB_WC_SUCCESS) > + priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR; > + > if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) > ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); > else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) I love this! Go for it. -- MST From pradeeps at linux.vnet.ibm.com Thu May 24 12:16:52 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 24 May 2007 12:16:52 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: References: <46537081.30906@linux.vnet.ibm.com> Message-ID: <4655E4A4.6020408@linux.vnet.ibm.com> Roland Dreier wrote: > > By default, cap the NOSRQ memory usage to 1GB. The default recvq_size > > is set to 128. Therefore for 64KB packets this would imply a maximum of > > 128 endpoints. > > 1 GB is a pretty eye-popping amount to tie up in receive buffers. It is 8MB per endpoint of receive buffers, and so you would use up 1GB when all 128 endpoints are active at the same time. This proposal is only for PPC systems when used with IBM HCA. It has no impact on Topspin HCAs (even when used on PPC systems). IBM cluster nodes are "fat" systems and have large quantities of memory. Hence using up 1GB should not be an issue. > > > -The NOSRQ limit of 1GB is also made a module parameter. > > There are too many module parameters already I think... If we choose the defaults correctly, most customers should not have to tune these parameters. This way we provide the flexibility to systems with large memory (say 64 GB) to raise the limits if need be. > > > -Currently we allocate a default of 64KB for the ring buffer elements, > > and this buffer size is not linked to the mtu. In the future, we could > > allocate buffers based on the mtu and link that into the computation of > > the memory cap. This way customers who might want to use a smaller mtu > > could use a larger number of endpoints, or a larger recvq_size without > > exceeding the memory cap. > > It would be nice, but to handle increasing the MTU, you need some way > to handle the receives you already posted (which would be too small > all of a sudden). > Can you expand on this a little more -I do not catch the drift. Pradeep From xma at us.ibm.com Thu May 24 12:36:11 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 24 May 2007 12:36:11 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <20070524184608.GC23535@mellanox.co.il> Message-ID: Hello Michael, > > My point of view this, if we run out of resource, there is nothingwe can do. > > Yes. So we need to fall back to datagram before running out of resources. Only high-end PPC supports eHCA, in the high-end PPC cluster, each node will configure enough memory to handle the connections within the cluster. So this patch will work OK. We would like to have IPoIB-CM w/o SRQ available to allow IPoIB-CM mode to be turned on as default soon. That would be good on handling running out of resouce, but it is not a simple effort. Let's have an independent patch to handle resouce running out later without blocking this patch to be upper stream. I hope you can agree with this. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From or.gerlitz at gmail.com Thu May 24 12:51:17 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 24 May 2007 22:51:17 +0300 Subject: [ofa-general] RE: two interfaces with ipoib In-Reply-To: References: <4654B14B.4000208@opengridcomputing.com> <4654C278.6030703@ichips.intel.com> <46554A5A.3050607@voltaire.com> Message-ID: <15ddcffd0705241251s4f7ee850qd98b60b33989328a@mail.gmail.com> On 5/24/07, Scott Weitzenkamp (sweitzen) wrote: > > How is: > > sysctl -w net.ipv4.conf.all.arp_ignore=2 > > different from: > > for i in /proc/sys/net/ipv4/conf/ib*/arp_filter; do echo 1 > $i; done > > I have been using the latter successfully regarding this issue. Reading in Documentation/networking/ip-sysctl.txt, my understanding is that setting arp_ignore to 1 or 2 gives more or less the same behavior as setting arp_filter to 1. The arp_ignore param is somehow more refined version of arp_filter. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From clivhall at cisco.com Thu May 24 13:37:48 2007 From: clivhall at cisco.com (Clive Hall (clivhall)) Date: Thu, 24 May 2007 13:37:48 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: <5BD9FA70F5EDAC43AB816A5FDE30F6AC0455380D@xmb-sjc-21a.amer.cisco.com> Those particular log messages are just informational messages. They're logged when multicast groups are created (when the first group member joins) and when multicast groups are deleted (when the last group member leaves). As Shirley said, if you're not using IPv6 anyway then those messages aren't harmful. Even if you are using IPv6 it will quite possibly still be fine, although I don't know why hosts would be leaving/rejoining the multicast groups. Clive. ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shirley Ma Sent: Thursday, May 24, 2007 11:16 AM To: SEGERS Koen Cc: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Subject: RE: [ofa-general] GPFS node loses IB-connection Koen, Are you using IPv6? If not, then this is no harmful. If you don't use it, you can simply disable loading IPv6 module in your notes when rebooting. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 "SEGERS Koen" "SEGERS Koen" Sent by: general-bounces at lists.openfabrics.org 05/24/07 11:03 AM To "Scott Weitzenkamp (sweitzen)" , "Hal Rosenstock" cc general-bounces at lists.openfabrics.org, general at lists.openfabrics.org Subject RE: [ofa-general] GPFS node loses IB-connection After changing the switch timeout value, we never got the error again. Yesterday, we started a 24h stresstest. This test was succesfull. I think we can conclude that the problem is fixed now. But, there is a strange message in de logs of the switch: Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change With xx,yy = (e.g) ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:05:87:d9 but changing to different GIDs in the next group of loggings each belonging to the IB ports of the server HCA's. This logging occurs every few minutes (not at a regular interval). Is there somewhere a Cisco manual available that describes or explains these messages? Or can anyone explain what is happening? And whether this can harm a setup that doesn't use multicast? Greetz Koen ________________________________ Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Verzonden: wo 23/05/2007 17:40 Aan: SEGERS Koen; Hal Rosenstock CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection Try 20 seconds, I'm curious if if you are barely crossing the 10-second threshold. Scott > -----Original Message----- > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE ] > Sent: Wednesday, May 23, 2007 8:39 AM > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > Cc: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > What value would you recommend then? > > Koen > > -----Oorspronkelijk bericht----- > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com ] > Verzonden: woensdag 23 mei 2007 17:38 > Aan: SEGERS Koen; Hal Rosenstock > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > The boot time of the host doesn't matter for this timeout. While the > host is booting, the IB link is down anyway. > > Scott > > > -----Original Message----- > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE ] > > Sent: Wednesday, May 23, 2007 8:20 AM > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > Cc: Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > After a whole day of stresstesting with the MAD renicing > turned on, we > > got the error once. So I think I should raise the timeout on > > the switch > > also. > > > > It takes about 2 minutes to boot the system. Do you agree > > that this is a > > good value for the timeout? > > > > Scott, > > Can you explain me the problem of the memlock? > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > Since we didn't > > install this, the bug is not related to us. This is > correct, isn't it? > > > > Greetz > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Hal Rosenstock [mailto:halr at voltaire.com ] > > Verzonden: woensdag 23 mei 2007 16:12 > > Aan: Scott "Weitzenkamp (sweitzen) > > CC: SEGERS Koen; Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > No C code changes, just a few config file changes > (RENICE_IB_MAD=yes > > in > > > openib.conf, > > > > Does the host really not respond to MAD requests for over 10 > > seconds in > > some cases ? > > > > -- Hal > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > SLES10 for bug 267, etc.). > > > > > > Scott Weitzenkamp > > > SQA and Release Manager > > > Server Virtualization Business Unit > > > Cisco Systems > > > > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE ] > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > This far, all tests seem to work. > > > > > > > > Thanks for the help! > > > > > > > > Scott, > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > Greetz > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [ mailto:sweitzen at cisco.com ] > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > (clivhall) > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > response within > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > in the OFED > > > > binary RPMs we release at > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux . I > > > > prefer to have > > > > the host be more responsive. > > > > > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > -----Original Message----- > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE ] > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > (=pinging) the > > > > > interfaces every 10s. This means that when the interface is > > handling > > > > > other traffic, the poll can fail and the port could be > > > > > considered out of > > > > > service. My question is then: "How can the timeout be reached > > while > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > Anyway, what timeout-value would you recommend for > us? And why? > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > 1) change the MAD niceness of the servers > > > > > 2) change the timeout on the switches > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > their ports in > > > > > PORT_ACTIVE state? > > > > > > > > > > Regards, > > > > > > > > > > Koen > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > (sweitzen) wrote: > > > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > fe:80:00:00:00:00:00:00 > > > > > > node-timeout > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > 2000 seconds. > > > > > > If a HCA is completely unresponsive for longer than the > > > > node-timeout > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com ] > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > To: koen.segers at VRT.BE > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org; Scott > > Weitzenkamp > > > > > > (sweitzen) > > > > > > Subject: RE: [ofa-general] GPFS node loses > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > So it is most likely you hit the same bug as > > 229 (Scott > > > > > > pointed out earlier). The same workaround might > > > > work for you > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > I think this should be a SM query timeout > > tunable value > > in > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > Thanks > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > Koen > > > > > > Segers > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > Please respond to > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > Shirley > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > cc > > > > > > > > > > > > Ami Perlmutter > > > > > > , > > > > > general at lists.openfabrics.org, > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > Subject > > > > > > > > > > > > RE: > > > > > > [ofa-general] > > > > > > GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > System Version Information > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > system-version : SFS-7000P TopspinOS > > > > 2.9.0 releng > > > > > > #147 > > > > > > 10/25/2006 02:01:32 > > > > > > contact : tac at cisco.com > > > > > > name : SFS-7000P > > > > > > location : 170 West Tasman Drive, > > > > > San Jose, CA > > > > > > 95134 > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > last-change : none > > > > > > last-config-save : none > > > > > > action : none > > > > > > result : none > > > > > > oper-mode : normal > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > but I can't > > > > > > find it > > > > > > right now. > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > > > Hello Koen, > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > The node was > > > > > > kicked > > > > > > > out of the membership. Which SM you are > > using in your > > > > > > fabric? > > > > > > > > > > > > > > Thanks > > > > > > > Shirley Ma > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: ecblank.gif URL: From rdreier at cisco.com Thu May 24 14:03:22 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 14:03:22 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4_ib initialize work - resending fix description In-Reply-To: <1180011931.11166.47.camel@mtls03> (Eli Cohen's message of "Thu, 24 May 2007 16:05:01 +0300") References: <1180011931.11166.47.camel@mtls03> Message-ID: > + if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) { > context->db_rec_addr = cpu_to_be64(qp->db.dma); > + for (i = 0; i < qp->sq.max; ++i) { > + ctrl = get_send_wqe(qp, i); > + ctrl->owner_opcode = cpu_to_be32(1 << 31); > + } > + } er... actually we only want to do this for kernel QPs (since we don't have access to the buffer for user QPs). I fixed up the patch in my tree. From mbloom at tervela.com Thu May 24 14:06:57 2007 From: mbloom at tervela.com (Michael Bloom) Date: Thu, 24 May 2007 17:06:57 -0400 Subject: [ofa-general] newbie throughput question Message-ID: I'm new to IB and the OFED stack, but I'm using a proprietary packet blaster program and getting very low throughput. I'm expecting 200k - 500k packets per second, but I'm only achieving ~ 15k pps. This program is running on 2 dedicated RH boxes (transmitter on one, receiver on the other). We're using the standard OFED 1.1.1 stack. Has anyone seen high packet throughput levels using a similar environment? Are there certain kernel parms that I need to tweak here? Any help would be appreciated. Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From Koen.SEGERS at VRT.BE Thu May 24 14:24:27 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Thu, 24 May 2007 23:24:27 +0200 Subject: [ofa-general] GPFS node loses IB-connection References: <5BD9FA70F5EDAC43AB816A5FDE30F6AC0455380D@xmb-sjc-21a.amer.cisco.com> Message-ID: We are not using IPoIB. We only use SDP, but IPoIB is compiled just in case we need it (when SDP is not sufficient). All interfaces are given an IPv4 address, so the messages aren't harmful I guess. Thanks! Koen ________________________________ Van: Clive Hall (clivhall) [mailto:clivhall at cisco.com] Verzonden: do 24/05/2007 22:37 Aan: Shirley Ma; SEGERS Koen CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection Those particular log messages are just informational messages. They're logged when multicast groups are created (when the first group member joins) and when multicast groups are deleted (when the last group member leaves). As Shirley said, if you're not using IPv6 anyway then those messages aren't harmful. Even if you are using IPv6 it will quite possibly still be fine, although I don't know why hosts would be leaving/rejoining the multicast groups. Clive. ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shirley Ma Sent: Thursday, May 24, 2007 11:16 AM To: SEGERS Koen Cc: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Subject: RE: [ofa-general] GPFS node loses IB-connection Koen, Are you using IPv6? If not, then this is no harmful. If you don't use it, you can simply disable loading IPv6 module in your notes when rebooting. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 Inactive hide details for "SEGERS Koen" "SEGERS Koen" "SEGERS Koen" Sent by: general-bounces at lists.openfabrics.org 05/24/07 11:03 AM To "Scott Weitzenkamp (sweitzen)" , "Hal Rosenstock" cc general-bounces at lists.openfabrics.org, general at lists.openfabrics.org Subject RE: [ofa-general] GPFS node loses IB-connection After changing the switch timeout value, we never got the error again. Yesterday, we started a 24h stresstest. This test was succesfull. I think we can conclude that the problem is fixed now. But, there is a strange message in de logs of the switch: Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change With xx,yy = (e.g) ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:05:87:d9 but changing to different GIDs in the next group of loggings each belonging to the IB ports of the server HCA's. This logging occurs every few minutes (not at a regular interval). Is there somewhere a Cisco manual available that describes or explains these messages? Or can anyone explain what is happening? And whether this can harm a setup that doesn't use multicast? Greetz Koen ________________________________ Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Verzonden: wo 23/05/2007 17:40 Aan: SEGERS Koen; Hal Rosenstock CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection Try 20 seconds, I'm curious if if you are barely crossing the 10-second threshold. Scott > -----Original Message----- > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE ] > Sent: Wednesday, May 23, 2007 8:39 AM > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > Cc: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > What value would you recommend then? > > Koen > > -----Oorspronkelijk bericht----- > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com ] > Verzonden: woensdag 23 mei 2007 17:38 > Aan: SEGERS Koen; Hal Rosenstock > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > The boot time of the host doesn't matter for this timeout. While the > host is booting, the IB link is down anyway. > > Scott > > > -----Original Message----- > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE ] > > Sent: Wednesday, May 23, 2007 8:20 AM > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > Cc: Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > After a whole day of stresstesting with the MAD renicing > turned on, we > > got the error once. So I think I should raise the timeout on > > the switch > > also. > > > > It takes about 2 minutes to boot the system. Do you agree > > that this is a > > good value for the timeout? > > > > Scott, > > Can you explain me the problem of the memlock? > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > Since we didn't > > install this, the bug is not related to us. This is > correct, isn't it? > > > > Greetz > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Hal Rosenstock [mailto:halr at voltaire.com ] > > Verzonden: woensdag 23 mei 2007 16:12 > > Aan: Scott "Weitzenkamp (sweitzen) > > CC: SEGERS Koen; Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > No C code changes, just a few config file changes > (RENICE_IB_MAD=yes > > in > > > openib.conf, > > > > Does the host really not respond to MAD requests for over 10 > > seconds in > > some cases ? > > > > -- Hal > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > SLES10 for bug 267, etc.). > > > > > > Scott Weitzenkamp > > > SQA and Release Manager > > > Server Virtualization Business Unit > > > Cisco Systems > > > > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE ] > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > This far, all tests seem to work. > > > > > > > > Thanks for the help! > > > > > > > > Scott, > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > Greetz > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com ] > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > (clivhall) > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > response within > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > in the OFED > > > > binary RPMs we release at > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux . I > > > > prefer to have > > > > the host be more responsive. > > > > > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > -----Original Message----- > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE ] > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > (=pinging) the > > > > > interfaces every 10s. This means that when the interface is > > handling > > > > > other traffic, the poll can fail and the port could be > > > > > considered out of > > > > > service. My question is then: "How can the timeout be reached > > while > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > Anyway, what timeout-value would you recommend for > us? And why? > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > 1) change the MAD niceness of the servers > > > > > 2) change the timeout on the switches > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > their ports in > > > > > PORT_ACTIVE state? > > > > > > > > > > Regards, > > > > > > > > > > Koen > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > (sweitzen) wrote: > > > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > fe:80:00:00:00:00:00:00 > > > > > > node-timeout > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > 2000 seconds. > > > > > > If a HCA is completely unresponsive for longer than the > > > > node-timeout > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com ] > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > To: koen.segers at VRT.BE > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org; Scott > > Weitzenkamp > > > > > > (sweitzen) > > > > > > Subject: RE: [ofa-general] GPFS node loses > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > So it is most likely you hit the same bug as > > 229 (Scott > > > > > > pointed out earlier). The same workaround might > > > > work for you > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > I think this should be a SM query timeout > > tunable value > > in > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > Thanks > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > Koen > > > > > > Segers > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > Please respond to > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > Shirley > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > cc > > > > > > > > > > > > Ami Perlmutter > > > > > > , > > > > > general at lists.openfabrics.org, > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > Subject > > > > > > > > > > > > RE: > > > > > > [ofa-general] > > > > > > GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > System Version Information > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > system-version : SFS-7000P TopspinOS > > > > 2.9.0 releng > > > > > > #147 > > > > > > 10/25/2006 02:01:32 > > > > > > contact : tac at cisco.com > > > > > > name : SFS-7000P > > > > > > location : 170 West Tasman Drive, > > > > > San Jose, CA > > > > > > 95134 > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > last-change : none > > > > > > last-config-save : none > > > > > > action : none > > > > > > result : none > > > > > > oper-mode : normal > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > but I can't > > > > > > find it > > > > > > right now. > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > > > Hello Koen, > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > The node was > > > > > > kicked > > > > > > > out of the membership. Which SM you are > > using in your > > > > > > fabric? > > > > > > > > > > > > > > Thanks > > > > > > > Shirley Ma > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: ecblank.gif URL: From rdreier at cisco.com Thu May 24 14:26:38 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 14:26:38 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <4655E4A4.6020408@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Thu, 24 May 2007 12:16:52 -0700") References: <46537081.30906@linux.vnet.ibm.com> <4655E4A4.6020408@linux.vnet.ibm.com> Message-ID: > > It would be nice, but to handle increasing the MTU, you need some way > > to handle the receives you already posted (which would be too small > > all of a sudden). > Can you expand on this a little more -I do not catch the drift. OK, suppose I configure the interface with a 16K MTU. I assume your plan would be to queue up a bunch of 16K receives. Now suppose I change the MTU to 64K. What do you do about the receives you already queued up that can't handle the new MTU? - R. From sashak at voltaire.com Thu May 24 15:54:28 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 25 May 2007 01:54:28 +0300 Subject: [ofa-general] Re: [PATCHv2] osm: up/dn optimization - improved ranking In-Reply-To: <4651557E.2080400@dev.mellanox.co.il> References: <46503064.7010107@dev.mellanox.co.il> <20070520161034.GY19271@sashak.voltaire.com> <4651557E.2080400@dev.mellanox.co.il> Message-ID: <20070524225428.GK837@sashak.voltaire.com> Hi Yevgeny, On 11:17 Mon 21 May , Yevgeny Kliteynik wrote: > >>@@ -492,7 +483,10 @@ updn_subn_rank( > >> remote_u->rank ); > >> > >> if (did_cause_update) > >>+ { > >> cl_qlist_insert_tail(&list, &remote_u->list); > >>+ max_rank = remote_u->rank; > >>+ } > > > >I think this still be not accurate. For instance with topology like: > >A <-> B <-> C <-> D <-> E , where roots are A and E we will get > >max_rank= 1, which obviously should be 2. > > Not exactly. What you're describing would happen if the scan would be > DFS-like, > not BFS. You are right, I used broken logic :( > I do > think that > to make the code more "intuitive" we might want to remove the > __updn_update_rank() > and do something like this: > > if (remote_u->rank > u->rank + 1) > { > remote_u->rank = u->rank + 1; > max_rank = remote_u->rank; > cl_qlist_insert_tail(&list, &remote_u->list); > } Agree, it looks cleaner. Sasha From rdreier at cisco.com Thu May 24 16:51:55 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 24 May 2007 16:51:55 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race In-Reply-To: <20070524131154.GA7940@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 24 May 2007 16:11:54 +0300") References: <20070522005918.GB13311@mellanox.co.il> <20070524131154.GA7940@mellanox.co.il> Message-ID: > The following works fine for me here. Pls consider for 2.6.22. Does it help with https://bugs.openfabrics.org//show_bug.cgi?id=604 ? Or are we still looking? - R. From swise at opengridcomputing.com Thu May 24 18:59:53 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 24 May 2007 18:59:53 -0700 Subject: [ofa-general] RDMA write completion question In-Reply-To: <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> References: <20070524141928.GI20691@minantech.com> <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> Message-ID: <46564319.4040800@opengridcomputing.com> Devesh Sharma wrote: > On 5/24/07, Gleb Natapov wrote: >> Hi, >> >> Does local RDMA write completion guaranties that a data that was >> RDMAed is >> already accessible in a destination's host _memory_? > Local RDMA write completion guarantees that the data you have RDMAed > has been copied into the remote buffer, without any data corruption. >> For iWARP, the local write completion simply means you can reuse the local buffer and the the transport will deliver it or kill the connection. The data _could_ be queued in the local rnic and anywhere else in the tcp cloud. From vlad at lists.openfabrics.org Fri May 25 02:41:31 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 25 May 2007 02:41:31 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070525-0200 daily build status Message-ID: <20070525094131.6BEAAE6082A@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From mst at dev.mellanox.co.il Fri May 25 02:47:05 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 25 May 2007 12:47:05 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race In-Reply-To: References: <20070522005918.GB13311@mellanox.co.il> <20070524131154.GA7940@mellanox.co.il> Message-ID: <20070525094705.GC15942@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race > > > The following works fine for me here. Pls consider for 2.6.22. > > Does it help with https://bugs.openfabrics.org//show_bug.cgi?id=604 ? > Or are we still looking? Still looking. -- MST From devesh28 at gmail.com Fri May 25 06:52:59 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Fri, 25 May 2007 19:22:59 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <1180020620.16831.198071.camel@hal.voltaire.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> <1179483657.23882.158398.camel@hal.voltaire.com> <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com> <1179769930.15940.9823.camel@hal.voltaire.com> <309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com> <1179930909.16831.100286.camel@hal.voltaire.com> <309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com> <1180020620.16831.198071.camel@hal.voltaire.com> Message-ID: <309a667c0705250652m2ddbfd31v5e8947d9b28882c2@mail.gmail.com> On 24 May 2007 11:30:24 -0400, Hal Rosenstock wrote: > On Thu, 2007-05-24 at 08:22, Devesh Sharma wrote: > > On 23 May 2007 10:35:13 -0400, Hal Rosenstock wrote: > > > On Wed, 2007-05-23 at 10:27, Devesh Sharma wrote: > > > > On 21 May 2007 13:52:11 -0400, Hal Rosenstock wrote: > > > > > On Mon, 2007-05-21 at 01:58, Devesh Sharma wrote: > > > > > > On 18 May 2007 06:21:05 -0400, Hal Rosenstock wrote: > > > > > > > On Thu, 2007-05-17 at 08:28, Devesh Sharma wrote: > > > > > > > > On 17 May 2007 06:42:16 -0400, Hal Rosenstock wrote: > > > > > > > > > On Thu, 2007-05-17 at 01:21, Devesh Sharma wrote: > > > > > > > > > > On 5/17/07, Sean Hefty wrote: > > > > > > > > > > > > But initially this will generate a packet for each path, while sys > > > > > > > > > > > > admin knows that path is there and he can hard-code the entries for > > > > > > > > > > > > it. Other thing is that why Admin will care about creating such record > > > > > > > > > > > > while SA is itself taking care, right? > > > > > > > > > > > > > > > > > > > > > > In your original message you asked about adding 'dummy entries' to the > > > > > > > > > > > cache. I agree that pre-loading the cache can be useful. What I still > > > > > > > > > > > am not understanding is the reasoning for adding 'dummy entries'. By > > > > > > > > > > > 'dummy entries', I've been assuming that these are invalid path records, > > > > > > > > > > > but maybe that's not what you meant. > > > > > > > > > > Ok if "dummy entries" word as such has created confusion then I am > > > > > > > > > > sorry for that, But with that I mean that, those are valid path > > > > > > > > > > records which Administrator knows in advance and while loading the > > > > > > > > > > module, > > > > > > > > > > > > > > > > > > How does the admin know they are valid ? > > > > > > > > Depending on the initial application runs, some trusted PRs can be generated. > > > > > > > > > > > > > > What do initial application runs have to do with this ? > > > > > > My understanding is that, once the cluster is UP, and if between Node > > > > > > A and Node B there is only one path, > > > > > > > > > > So this is a feature for such one path subnets. I wonder what percentage > > > > > of deployed subnets fits this case. > > > > You never know, It may be used for debugging also. > > > > > > I still don't have a good feel for how common/generally useful this will > > > really be. > > > > > > > > > then, SA query always going to return same values in PR. > > > > > > > > > > If subnet topology is changed, these PRs might change. There are other > > > > > cases where they change too. > > > > Not sure about it...some suggestion? > > > > > > > > > > > On this basis Initial application runs will generate PRs, > > > > > > > > > > That's what confused me before (Applications don't generate PRs but > > > > > rather request them.) but I think I see what you mean now. > > > > Ok > > > > > > > > > > > these PRs can be saved in some file, and can be loaded > > > > > > when cache_module comes in. > > > > > > > > > > > > > > > >Are they somehow preconfigured at the SM ? > > > > > > > > I am not sure about SM has any such provision? > > > > > > > > > > > > > > Not that I'm aware of. > > > > > > Ok, So, currently no such support is there in SM? > > > > > > > > > > I can speak definitively for OpenSM and there is no such support. As to > > > > > the vendor SMs, I don't think so but don't know for absolute certainty. > > > > > Someone can correct me if I'm wrong but I wouldn't assume no response > > > > > means correctness as some may not be listening nor want to respond as to > > > > > "value added" vendor specific features. > > > > What is the issue if OpenSM provides this? > > > > > > I'm not following you. What does/should OpenSM provide ? OpenIB works in > > > configurations with other SMs. > > I am talking about pre-configuring PRs in OpenSM DB. > > How does that help ? Why would PRs need to be preconfigured at the SM ? > Do you mean preconfigure the routing tables (and generate the PRs from > that) ? What problem is being solved on the SM side ? I just queried out of curiosity......nothing special.:) > > > > > > > > > Also not sure about the > > > > > > > > role of SM in path resolving. I mean once node has initiated SA query, > > > > > > > > whether SM has some database to reply SA or On the fly destination > > > > > > > > node is contacted to get asked path recored? > > > > > > > > > > > > > > SMs can either calculate the SA PRs on the fly based on the routing > > > > > > > algorithm in use and some other things or put them in a local database. > > > > > > > This is up to that SM. > > > > > > Ok > > > > > > > > > > > > > > Destination node is not contacted in the SA PR query process. > > > > > > > > > > > > > > > >Doesn't each SM have its own policy for generating valid PRs ? > > > > > > > > Ultimately path record is in Path_Record object format, and SA cache > > > > > > > > is going to store in a fixed manner, How generation policy matters? > > > > > > > > > > > > > > What if the local policy loaded does not agree with what the SM would > > > > > > > generate for a particular PR ? One then gets a local error which will > > > > > > > need to be tracked down. Not so easy IMO. > > > > > > SM policies in a subnet to generate PRs, changes dynamically? at run time? > > > > > > > > > > The policy doesn't change dynamically but the data to be returned in the > > > > > SA PR response might. > > > > > > > > > > > if Not then depending on the local SM policy static PR can be > > > > > > generated to load initially. > > > > > > > > > > Just as one question related to this, how would link failures be handled > > > > > ? There are others. > > > > Its just a matter of avoiding initial PR query packets by loading the > > > > cache with static PRs.....Later on cache module will function in > > > > normal fashion. I expect, initially every thing will come up in a > > > > trusted cluster. > > > > > > So you're saying the cache would still react to GIDs out and in service, > > > right ? > > I am not about what GIDs in out service.... > > Why not ? Actually it was a typing mistake....I am trying to say that I am not sure about what GID out and in service is. > > > but what I mean to say is, > > Once sa_cache is programmed with some static PRs....it will avoid > > initial cache_update step and after first time out normal > > update_cache() will be initiated using SA MADs. > > How would the client know what PRs to request when that timeout first > occurs ? There's no get all except these semantics. If it is all PRs, > what does that save ? I think my statement has again confused you.....sorry my falt.."and after first time out normal update_cache() will be initiated using SA MADs." I mean to say, after first time out....only the requested PR will be resolved....not all. > > > > If the cache is loaded from a file, does it bypass querying the SA > > > initially for PRs ? > > Yes It will, and hence reduce the initial SA traffic generated on a > > big cluster...just imagin, the cluster is quite big and every node is > > trying to build its cache initially. It will create large burst of SA > > packets. > > >If that is the case, then the file is required to be > > > the full set of PRs for this node otherwise there would be incomplete > > > connectivity. > > Yes, correct, Generating these PRs is the next issue which I want to > > discuss. may be this can be done by Admin on every node using the > > read() entry point provided by char_dev interface of sa_cache module. > > read entry point will simple extract PRs from cache itself. > > > > Incomplete connectivity will be till first PR is requested for that > > destination, Because if its a cache miss, any how application is going > > to initiate a ib_sa_get_path_rec() and resolved PR will be added in > > cache for future reference. > > OK then this becomes an on demand model for those destnations (at least > initially). By "on demand" do you mean.....normal cluster without cache? if yes than it will be on demand PR resolve model for those incomplete paths. > > -- Hal > > > > -- Hal > > > > > > > > > > > CMIIW. Also I am assuming a homogeneous cluster where certain > > > > > > > > parameters can be assumed to be same always. > > > > > > > > > > > > > > and always in agreement with what the SM would return ? For example, > > > > > > yes > > > > > > > what happens when a link goes down and the end node is no longer > > > > > > > reachable ? > > > > > > If node is not reachable then, after first timeout of sa_cache, that > > > > > > entry will be removed from cache. > > > > > > > > > > OK; that's another aspect to add into this feature. I don't think that > > > > > is currently done. I think there would need to be an API added to do > > > > > this. > > > > Yes, this has been discussed with Sean, we can add one char_dev > > > > interface to the existing sa_cache module implementation, Write entry > > > > point will generate a SA_PR_response packet and this packet will be > > > > passed to update_cache() function. > > > > > > > > Also we need to remove the initial schedule_update() call in the > > > > add_one() function. > > > > One user command is also required to read from user file and write > > > > onto this device. > > > > > > > > > > -- Hal > > > > > > > > > > > > > >are these from a live SM and just loaded "out of band" to > > > > > > > > bypass/preclude the SA PR >mechanism ? > > > > > > > > may be > > > > > > > > > > > > > > Even if they are, there is still the changes in the subnet issue. > > > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > > > > > Admin is loading this info in the cache with user command. > > > > > > > > > > > > > > > > > > > > > > > Another point I want to know is, > > > > > > > > > > > > When local_sa_cache module will be inserted? After SM comes up or > > > > > > > > > > > > Before SM comes up? > > > > > > > > > > > > > > > > > > > > > > It can occur either way. There is no restriction. The cache responds > > > > > > > > > > > to port up and GID in/out of service events to update itself. > > > > > > > > > > Do you mean cache module will start building cache only after Port is UP? > > > > > > > > > > > > > > > > > > > > > > > If Its inserted before SM is coming up (I am assuming SM is running on > > > > > > > > > > > > some node not on switch) then First Forced schedule_update() is > > > > > > > > > > > > waisted, and for the first application presence of cache is > > > > > > > > > > > > meaningless. Why not to keep cache effective right from the start? > > > > > > > > > > > > > > > > > > > > > > Pre-loading the cache with path records doesn't guarantee that those > > > > > > > > > > > paths are usable. If the SM has not come up, then the path records will > > > > > > > > > > > be unusable until the SM configures the subnet, plus there's no > > > > > > > > > > > guarantee that the remote endpoints specified by the paths are running. > > > > > > > > > > You mean there is no guarantee that even if SM is UP and we have some > > > > > > > > > > hard coded entries of path record corresponding to some node X, we are > > > > > > > > > > not sure that node X has actually come up or not? In that case > > > > > > > > > > actually that path resolving should fail if node has not come up, but > > > > > > > > > > with the hard coding still path will be resolved? > > > > > > > > > > > > > > > > > > > > > > The main benefit I see to pre-loading the cache is to avoid SA storms > > > > > > > > > > > when booting a large cluster. > > > > > > > > > > that's true. Also cache will get valid entries only if network is > > > > > > > > > > configured by SM otherwise every node SA will, possibly, drop SA > > > > > > > > > > packets. > > > > > > > > > > > > > > > > > > > > > > - Sean > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > general mailing list > > > > > > > > > > general at lists.openfabrics.org > > > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From thehaydencreekinn.com at esxpress.com Fri May 25 07:09:00 2007 From: thehaydencreekinn.com at esxpress.com (Sean Baker) Date: Fri, 25 May 2007 16:09:00 +0200 Subject: [ofa-general] Why be an average guy any longer Message-ID: <000001c79ed6$50415f80$0100007f@localhost> See attach http://www.querdat.com/ ----- Of course it has a place of ho She turned at the sound of her Good afternoon, Father, Colin His father returned the greeti -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: img66.jpg Type: image/jpeg Size: 12625 bytes Desc: not available URL: From caitlinb at broadcom.com Fri May 25 07:34:57 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 25 May 2007 07:34:57 -0700 Subject: [ofa-general] RDMA write completion question References: <20070524141928.GI20691@minantech.com> <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> <46564319.4040800@opengridcomputing.com> Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D0306ED58@NT-IRVA-0750.brcm.ad.broadcom.com> Steve Wise Wrote: -----Original Message----- Devesh Sharma wrote: > On 5/24/07, Gleb Natapov wrote: >> Hi, >> >> Does local RDMA write completion guaranties that a data that was >> RDMAed is >> already accessible in a destination's host _memory_? > Local RDMA write completion guarantees that the data you have RDMAed > has been copied into the remote buffer, without any data corruption. >> For iWARP, the local write completion simply means you can reuse the local buffer and the the transport will deliver it or kill the connection. The data _could_ be queued in the local rnic and anywhere else in the tcp cloud. _______________________________________________ And The only real difference with InfiniBand is that the uncertainty cloud is limited to the gap between the HCA and the application. Protocol designers can debate the tradeoffs InfiniBand takes to achieve that, but the import thing to Application Designers is that "smaller" is not "zero". Generally, once a Send has been posted that all other interactions with the remote peer over the same connection can assume that the prior actions have completed, but if your application needs an absolute guarantee that something has happened (for checkpointing or other purposes) then you really can only rely on a peer-to-peer message. From wombat2 at us.ibm.com Fri May 25 08:03:57 2007 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Fri, 25 May 2007 11:03:57 -0400 Subject: [ofa-general] IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <20070525135325.E84E5E6083B@openfabrics.org> Message-ID: Roland Dreier wrote: > > > It would be nice, but to handle increasing the MTU, you need some way > > > to handle the receives you already posted (which would be too small > > all of a sudden). > > > Can you expand on this a little more -I do not catch the drift. > > OK, suppose I configure the interface with a 16K MTU. I assume your > plan would be to queue up a bunch of 16K receives. Now suppose I > change the MTU to 64K. What do you do about the receives you already > queued up that can't handle the new MTU? > When you change the MTU you have to build a new set of receive buffers at the new MTU and before advertising the new MTU, and associate the new buffers and associated structures with the current interfaces. This leaves the old buffers structures to be handled appropriately by separate threads. When all older buffers are released/returned, you tear everything down and terminate threads associated with the old MTU. If you find that a set of receive buffers are empty when starting to change the MTU, you can go immediately to the new size. This minimizes the memory needed during the change in MTU. As soon as youchange teh local interface to a diferent MTU, it will be a while before the remote connections find out and change the MTU they send. What happens to Ethernet when you turn on or off jumboframes? > - R. > Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatne -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradeeps at linux.vnet.ibm.com Fri May 25 13:01:29 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Fri, 25 May 2007 13:01:29 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <20070524053819.GF6019@mellanox.co.il> References: <46537081.30906@linux.vnet.ibm.com> <20070524053819.GF6019@mellanox.co.il> Message-ID: <46574099.3090601@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >> Here are my thoughts about limiting the memory footprint for IPOIB CM >> (NOSRQ) patch: >> >> By default, cap the NOSRQ memory usage to 1GB. > > ppc systems I have, start crashing if you map as much as 300MB for DMA. If PPC systems start crashing when you map as much as 300MB, then yes that would be a gating factor when you deploy this patch for certain configurations. Then MPI applications (on UD) allocating more than 300 MB should be crashing the systems even without this patch -right? This is a separate problem and clouds this discussions. Please post the problem on the ppc/ppc64 mailing list. > >> The default recvq_size >> is set to 128. Therefore for 64KB packets this would imply a maximum of >> 128 endpoints. >> >> -Make the maximum number of endpoints a module parameter with a default >> value of 128. >> >> -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is >> the default limit and could be changed as needed (by the administrator) >> depending on the system configuration, application needs and so on. > > All this need for manual tuning is really going in the wrong direction: > we should be looking for ways to get rid of existing module > parameters, like using low watermark event to dynamically tune the RQ > depth. > >> The >> server would return a "REJ" message upon receiving a "REQ", whenever one >> of these limits (i.e. max number of endpoints or the max NOSRQ memory >> usage) is reached. Currently, we only check for the maximum number of >> endpoints -hard coded to 1024. > > So with limit sufficiently low, we hopefully will avoid crashing the server. > That's a progress, but what happens to the client when it gets this reject? > >> -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that >> support SRQ like the Topspin HCA and, such HCAs should not be >> impacted at all. > > I don't think it's that clean yet. > > Here's an idea: implement "fake SRQ" for ehca in software: make post recv on srq > queue the WR, spread them evenly between QPs as they connect. Once # of QPs > goes above some limit, create QP command will fail. This would contain the mess > nicely inside ehca (I think you'll want to add a flag that lets software > figure out that SRQ is fake). > > We will still be left with the basic problem of what to do > at the active side upon the reject, though. As you indicate this will not solve the problem, so it is not an option. > >> -Currently we allocate a default of 64KB for the ring buffer elements, >> and this buffer size is not linked to the mtu. In the future, we could >> allocate buffers based on the mtu and link that into the computation of >> the memory cap. This way customers who might want to use a smaller mtu >> could use a larger number of endpoints, or a larger recvq_size without >> exceeding the memory cap. > > I think that conceptually, global MTU config is intended for outgoing packets, > not for the RX buffers. For example, how would we handle MTU changes? > >> Would this approach address the issues of scalability and enable IPOIB >> CM to be turned as the default? > > For IPoIB CM to be the default, it needs to work as well as datagram mode for > most usage scenarious. Unfortunately, your proposal above seems to fail to > satisfy this requirement: it will improve speed in some scenarious, > but will either increase the need for manual configuration drastically > or cause denial of service or use up huge amount of memory, > in others. My viewpoint is that this is akin to a Qos issue. If the request cannot be satisfied then return an error to the user level application and let the application decide, what to do. Don't do anything under the covers. I have tried to explain that this corner case you cite will be encountered by PPC systems using IBM HCA only. The rest will be unaffected. The PPC systems deployed as cluster nodes are unlikely to encounter this problem. However, we seem to have ideas that are at the opposite end of the spectrum and any amount of debate will not resolve this issue. One idea to move this discussion forward is to implement both options in the corner case and let the user/sys admin choose: a) return error to user level application and leave it to the application when there are no more RC QPs b) switch the active side to using datagram mode when there are no RC QPs. If we choose to go this route then that will mean we need yet another module parameter to let the user decide, or worse compile time options - neither of which is palatable. Any other suggestions? If we can agree upon this approach I will start another thread to discuss just this corner case and with this patch (or a minor variant) permit IPOIB CM to be used as the default. I do not want the corner case to be the gating factor for this patch. Pradeep From jwong at datallegro.com Fri May 25 13:12:02 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Fri, 25 May 2007 16:12:02 -0400 Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with kernel 2-6.18-8.1.4.el5 In-Reply-To: <1E3DCD1C63492545881FACB6063A57C1D524A3@mtiexch01.mti.com> Message-ID: Hello, I am installing the OFED 1.2-rc3. Everything else builds except for ib-bonding. Thanks in advance. I am getting the following error messages: + make -C /lib/modules/2.6.18-8.1.4.el5/build modules M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding make: Entering directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64' CC [M] /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.o In file included from /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:78: /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_inactive_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: (Each undeclared identifier is reported only once /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: for each function it appears in.) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_active_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_compute_features': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1233: warning: comparison of distinct pointer types lacks a cast /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_enslave': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_release': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in t his function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_arp_rcv': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_netdev_event': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_init': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4374: warning: assignment discards qualifiers from pointer target type /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this fu nction) make[1]: *** [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_ main.o] Error 1 make: *** [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi ng] Error 2 make: Leaving directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64' + echo ' Building IB bonding driver failed' Building IB bonding driver failed + exit 1 error: Bad exit status from /var/tmp/rpm-tmp.23876 (%build) Jeff Wong -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Fri May 25 14:03:08 2007 From: krause at cup.hp.com (Michael Krause) Date: Fri, 25 May 2007 14:03:08 -0700 Subject: [ofa-general] RDMA write completion question In-Reply-To: <1EF1E44200D82B47BD5BA61171E8CE9D0306ED58@NT-IRVA-0750.brcm .ad.broadcom.com> References: <20070524141928.GI20691@minantech.com> <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> <46564319.4040800@opengridcomputing.com> <1EF1E44200D82B47BD5BA61171E8CE9D0306ED58@NT-IRVA-0750.brcm.ad.broadcom.com> Message-ID: <6.2.0.14.2.20070525140103.0316e610@esmail.cup.hp.com> At 07:34 AM 5/25/2007, Caitlin Bestler wrote: >Steve Wise Wrote: > >-----Original Message----- > >Devesh Sharma wrote: > > On 5/24/07, Gleb Natapov wrote: > >> Hi, > >> > >> Does local RDMA write completion guaranties that a data that was > >> RDMAed is > >> already accessible in a destination's host _memory_? > > Local RDMA write completion guarantees that the data you have RDMAed > > has been copied into the remote buffer, without any data corruption. > >> >For iWARP, the local write completion simply means you can reuse the >local buffer and the the transport will deliver it or kill the >connection. The data _could_ be queued in the local rnic and anywhere >else in the tcp cloud. > > >_______________________________________________ > > >And The only real difference with InfiniBand is that the uncertainty >cloud is limited to the gap between the HCA and the application. >Protocol designers can debate the tradeoffs InfiniBand takes to >achieve that, but the import thing to Application Designers is >that "smaller" is not "zero". > >Generally, once a Send has been posted that all other interactions >with the remote peer over the same connection can assume that the prior >actions have completed, but if your application needs an absolute >guarantee that something has happened (for checkpointing or other >purposes) then you really can only rely on a peer-to-peer message. The peer-to-peer may be either a RDMA Read which is not visible to the application / ULP or may be an application / ULP interaction. As a general guideline, application developers should not assume anything about the end-to-end data integrity or delivery being guaranteed by the hardware and take appropriate steps to design their communication pattern to validate the data was correctly exchanged. This is not difficult and is often built into many ULP or applications already. Mike >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general From krause at cup.hp.com Fri May 25 13:59:59 2007 From: krause at cup.hp.com (Michael Krause) Date: Fri, 25 May 2007 13:59:59 -0700 Subject: [ofa-general] RDMA write completion question In-Reply-To: <46564319.4040800@opengridcomputing.com> References: <20070524141928.GI20691@minantech.com> <309a667c0705240808je663d95jeb0fb84ec45c49a9@mail.gmail.com> <46564319.4040800@opengridcomputing.com> Message-ID: <6.2.0.14.2.20070525135739.03544150@esmail.cup.hp.com> At 06:59 PM 5/24/2007, Steve Wise wrote: >Devesh Sharma wrote: >>On 5/24/07, Gleb Natapov wrote: >>>Hi, >>> >>> Does local RDMA write completion guaranties that a data that was RDMAed is >>>already accessible in a destination's host _memory_? >>Local RDMA write completion guarantees that the data you have RDMAed >>has been copied into the remote buffer, without any data corruption. >For iWARP, the local write completion simply means you can reuse the local >buffer and the the transport will deliver it or kill the connection. The >data _could_ be queued in the local rnic and anywhere else in the tcp cloud. This is where IB and iWARP do differ slightly. iWARP may indeed not have transmitted anything to the remote RNIC while in IB, a completion should equate to the data being received by the remote CA. As a general point though, the source RNIC is unlikely to issue a RDMA Write completion if it has only injected the packet into the Ethernet fabric. It may issue prior to injection or in response to a TCP ACK indicating the RDMA Write was received by the remote RNIC but I doubt any do something in between. Mike From jimmy at hillraiser.com Fri May 25 14:22:14 2007 From: jimmy at hillraiser.com (=?iso-8859-1?Q?Jimmy=20Hill?=) Date: Fri, 25 May 2007 21:22:14 +0000 Subject: [ofa-general] ibv_get_cq_event blocking forever after successful ibv_post_send... Message-ID: <20070525212214.20500.qmail@station183.com> I have verbs code that is modeled after the first usage model described on the ibv_get_cq_event() man page. That is, I have created all the verbs resources (e.g., completion channel, QP, CQ, etc.) and then followed the sequence of: ibv_req_notify_cq(cq, 0); ibv_post_send(qp, &work_req, &bad_work_req); ibv_get_cq_event(channel, &ev_cq, &ev_ctx); ibv_ack_cq_events(ev_cq, 1); ibv_req_notify_cq(cq, 0); ibv_poll_cq(cq, 1, &wc); // loop to drain - but due to upper protocol, will only ever be 1 at a time The QP is created with the following attributes: qp_init_attr.qp_context = &this->conn_ref; qp_init_attr.send_cq = this->send_cq; qp_init_attr.recv_cq = this->recv_cq; qp_init_attr.srq = NULL; qp_init_attr.cap.max_send_wr = 128; qp_init_attr.cap.max_recv_wr = 4; qp_init_attr.cap.max_send_sge = 16; qp_init_attr.cap.max_recv_sge = 4; qp_init_attr.cap.max_inline_data = 0; qp_init_attr.qp_type = IBV_QPT_RC; qp_init_attr.sq_sig_all = 0; // I have also used sq_sig_all set to 1 and then removed the SIGNALED flag in the send request The Send request (RDMA Write) is formatted as: sge.lkey = response_mr->lkey; sge.addr = response; sge.length = 256; send_work_req.opcode = IBV_WR_RDMA_WRITE; send_work_req.next = NULL; send_work_req.sg_list = &sge; send_work_req.num_sge = 1; send_work_req.wr_id = 0; send_work_req.imm_data = 0; send_work_req.wr.rdma.remote_addr = client_rmr->addr; send_work_req.wr.rdma.rkey = client_rmr->rkey; send_work_req.send_flags = IBV_SEND_SIGNALED; // I have used IBV_SEND_SIGNALED and IBV_SEND_SIGNALED | IBV_SEND_FENCE This QP will be used to RDMA Write a response back to a client. With the current setup, only one RDMA write will be outstanding per QP at any given time. That is, I issue the RDMA Write and wait for its completion prior to continuing processing. The eventual goal is to request and process a completion event every "n" RDMA Writes. The current problem is that everything runs along fine and then I end up in a situation where I block forever on the ibv_get_cq_event() call. The ibv_post_send() just prior to the ibv_get_cq_event() call returned "0" indicating that it successfully processed the command. However, the completion event for that operation never arrives. The data associated with that RDMA write does not appear on the client side, so it seems that even though the ibv_post_send() reported success, it really did not successfully process the request. In order to debug the problem, I changed the completion channel to non-blocking and put the ibv_get_cq_event() call in a loop and dumped out the number of passes through the loop (i.e., number of calls to ibv_get_cq_event()) prior to the arrival of an event (good status from the call). When all is working fine, it only takes one or two calls for the event to arrive. When I encounter the situation where it blocked forever, it loops forever calling ibv_get_cq_event(). I added a counter there and after a large (e.g., 500) number of retries, I looped back up and tried the ibv_post_send() again. For the most part, the request makes it out the second time. But, given enough time, the send queue work requests entries are consumed. That is, if I lower the max_send_wr attribute to 10, after 10 failed event collection attempts and ibv_post_send() retries, the 11th ibv_post_send() will fail with -1 status code. So, the work request entries are not leaving the send queue. Any ideas on why the ibv_get_cq_event() would never see an event after a "successful" send requesting a completion event? thanks, jimmy From seed_der at hotmail.com Fri May 25 14:31:25 2007 From: seed_der at hotmail.com (Richard Smith) Date: Fri, 25 May 2007 23:31:25 +0200 Subject: [ofa-general] WINNING NOTIFICATION!!!!!!!!! Message-ID: GLOBAL MEGA-MILLION LOTTERY SA. The Global Mega Lottery Promotion team is proud to inform you that you have won USD1 950,000.00 Why you have won Your E-mail address is one of 200 lucky Addresses who have won in the weekly Promotion. See below how to claim your prize. Details on the Winnings. Your Winning Reference Number is: FLS-ZR39-825P-4 Batch Number: 74-263-BBN. TICKET NUMBER:100-309-7482 SERAIL NUMBER:513-10 WINNING NUMBERS:02,09,22,23,24,30(05) I wish to Congratulate you on your victory, you are a lucky person to have won this lottery. Your email address was amongst those chosen this quarter from our new java-based software that randomly selects email addresses from the web from which winners are selected. You are required to forward the following details to help facilitate the processing of your claims and certificate which will facilitate of Your winning price is to the tune of USD1 950,000.00 dollars. 1. Full names. 2. Phone number. 3. Fax number. 4. Occupation. 5. Sex. 6. Age. 7. Country. 8.country of resident. Remember, you must contact your claim agent Mr Smith,Call him and claim your prize after calling him send your refrence and batch number and all the above informations to his email address and call him to let him know that you have contacted him through email. Congratulations once again from all our staff and thank you for being part of our promotions program. for claiming of your prize and remember to quote your reference and Batch Number for easy processing of your prize.! You have to note that this program is being sponsored by the FIFA SUPPORT AFRICAN TEAM, to creat awareness for the coming 2010 FIFA world Cup, which is to be host by South Africa. TO FILE YOUR CLAIM...Contact Mr.Howard Arr Processing Manager: Mr. Richard Smith TEL: + 27-74-213-6382 EMAIL: agent1_claims1 at yahoo.com Congratulations once again and have a lot of fun. The International Mega Promotion team. GLOBAL MEGA-MILLION LOTTERY Copyright © 2006 The Xanga web & SA National Lottery Inc. All rights reserved. Terms of Service - Guideline _________________________________________________________________ Explore the seven wonders of the world http://search.msn.com/results.aspx?q=7+wonders+world&mkt=en-US&form=QBRE -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri May 25 15:06:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 25 May 2007 15:06:06 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a few more 2.6.22-rc2 fixes: Eli Cohen (1): IB/mlx4: Initialize send queue entry ownership bits Michael S. Tsirkin (2): IPoIB/cm: Fix timeout check in ipoib_cm_dev_stop() IPoIB/cm: Drain cq in ipoib_cm_dev_stop() Roland Dreier (1): IB/mlx4: Don't allocate RQ doorbell if using SRQ Stefan Roscher (1): IB/ehca: Fix number of send WRs reported for new QP drivers/infiniband/hw/ehca/hcp_if.c | 2 +- drivers/infiniband/hw/mlx4/qp.c | 59 +++++++++++++++++++----------- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c | 3 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 31 ++++++++++------ 5 files changed, 60 insertions(+), 36 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 7f0beec..5766ae3 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -331,7 +331,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, 0); qp->ipz_qp_handle.handle = outs[0]; qp->real_qp_num = (u32)outs[1]; - parms->act_nr_send_sges = + parms->act_nr_send_wqes = (u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_SEND_WR, outs[2]); parms->act_nr_recv_wqes = (u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_RECV_WR, outs[2]); diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index a824bc5..dc137de 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -270,9 +270,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, struct ib_qp_init_attr *init_attr, struct ib_udata *udata, int sqpn, struct mlx4_ib_qp *qp) { - struct mlx4_wqe_ctrl_seg *ctrl; int err; - int i; mutex_init(&qp->mutex); spin_lock_init(&qp->sq.lock); @@ -319,20 +317,24 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, if (err) goto err_mtt; - err = mlx4_ib_db_map_user(to_mucontext(pd->uobject->context), - ucmd.db_addr, &qp->db); - if (err) - goto err_mtt; + if (!init_attr->srq) { + err = mlx4_ib_db_map_user(to_mucontext(pd->uobject->context), + ucmd.db_addr, &qp->db); + if (err) + goto err_mtt; + } } else { err = set_kernel_sq_size(dev, &init_attr->cap, init_attr->qp_type, qp); if (err) goto err; - err = mlx4_ib_db_alloc(dev, &qp->db, 0); - if (err) - goto err; + if (!init_attr->srq) { + err = mlx4_ib_db_alloc(dev, &qp->db, 0); + if (err) + goto err; - *qp->db.db = 0; + *qp->db.db = 0; + } if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf)) { err = -ENOMEM; @@ -348,11 +350,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, if (err) goto err_mtt; - for (i = 0; i < qp->sq.max; ++i) { - ctrl = get_send_wqe(qp, i); - ctrl->owner_opcode = cpu_to_be32(1 << 31); - } - qp->sq.wrid = kmalloc(qp->sq.max * sizeof (u64), GFP_KERNEL); qp->rq.wrid = kmalloc(qp->rq.max * sizeof (u64), GFP_KERNEL); @@ -386,7 +383,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, return 0; err_wrid: - if (pd->uobject) + if (pd->uobject && !init_attr->srq) mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &qp->db); else { kfree(qp->sq.wrid); @@ -403,7 +400,7 @@ err_buf: mlx4_buf_free(dev->dev, qp->buf_size, &qp->buf); err_db: - if (!pd->uobject) + if (!pd->uobject && !init_attr->srq) mlx4_ib_db_free(dev, &qp->db); err: @@ -481,14 +478,16 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp, mlx4_mtt_cleanup(dev->dev, &qp->mtt); if (is_user) { - mlx4_ib_db_unmap_user(to_mucontext(qp->ibqp.uobject->context), - &qp->db); + if (!qp->ibqp.srq) + mlx4_ib_db_unmap_user(to_mucontext(qp->ibqp.uobject->context), + &qp->db); ib_umem_release(qp->umem); } else { kfree(qp->sq.wrid); kfree(qp->rq.wrid); mlx4_buf_free(dev->dev, qp->buf_size, &qp->buf); - mlx4_ib_db_free(dev, &qp->db); + if (!qp->ibqp.srq) + mlx4_ib_db_free(dev, &qp->db); } } @@ -852,7 +851,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, if (ibqp->srq) context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->msrq.srqn); - if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) + if (!ibqp->srq && cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) context->db_rec_addr = cpu_to_be64(qp->db.dma); if (cur_state == IB_QPS_INIT && @@ -872,6 +871,21 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, else sqd_event = 0; + /* + * Before passing a kernel QP to the HW, make sure that the + * ownership bits of the send queue are set so that the + * hardware doesn't start processing stale work requests. + */ + if (!ibqp->uobject && cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) { + struct mlx4_wqe_ctrl_seg *ctrl; + int i; + + for (i = 0; i < qp->sq.max; ++i) { + ctrl = get_send_wqe(qp, i); + ctrl->owner_opcode = cpu_to_be32(1 << 31); + } + } + err = mlx4_qp_modify(dev->dev, &qp->mtt, to_mlx4_state(cur_state), to_mlx4_state(new_state), context, optpar, sqd_event, &qp->mqp); @@ -919,7 +933,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; - *qp->db.db = 0; + if (!ibqp->srq) + *qp->db.db = 0; } out: diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index a0b3782..158759e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -429,6 +429,7 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); void ipoib_pkey_poll(struct work_struct *work); int ipoib_pkey_dev_delay_open(struct net_device *dev); +void ipoib_drain_cq(struct net_device *dev); #ifdef CONFIG_INFINIBAND_IPOIB_CM diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index ffec794..f133b56 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -713,7 +713,7 @@ void ipoib_cm_dev_stop(struct net_device *dev) while (!list_empty(&priv->cm.rx_error_list) || !list_empty(&priv->cm.rx_flush_list) || !list_empty(&priv->cm.rx_drain_list)) { - if (!time_after(jiffies, begin + 5 * HZ)) { + if (time_after(jiffies, begin + 5 * HZ)) { ipoib_warn(priv, "RX drain timing out\n"); /* @@ -726,6 +726,7 @@ void ipoib_cm_dev_stop(struct net_device *dev) } spin_unlock_irq(&priv->lock); msleep(1); + ipoib_drain_cq(dev); spin_lock_irq(&priv->lock); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index c1aad06..8404f05 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -550,13 +550,30 @@ static int recvs_pending(struct net_device *dev) return pending; } +void ipoib_drain_cq(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i, n; + do { + n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); + for (i = 0; i < n; ++i) { + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) + ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); + else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) + ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); + else + ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); + } + } while (n == IPOIB_NUM_WC); +} + int ipoib_ib_dev_stop(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; unsigned long begin; struct ipoib_tx_buf *tx_req; - int i, n; + int i; clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); netif_poll_disable(dev); @@ -611,17 +628,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush) goto timeout; } - do { - n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); - for (i = 0; i < n; ++i) { - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) - ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); - else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) - ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); - else - ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); - } - } while (n == IPOIB_NUM_WC); + ipoib_drain_cq(dev); msleep(1); } From sean.hefty at intel.com Fri May 25 16:26:11 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 25 May 2007 16:26:11 -0700 Subject: [ofa-general] [PATCH] 2.6.23 ib/cm: include HCA ACK delay in local ACK timeout Message-ID: <003801c79f24$14857c80$ff0da8c0@amr.corp.intel.com> The ib_cm should include the HCA ACK delay when calculating the local ACK timeout value. If the HCA ACK delay is large enough relative to the packet life time, then the calculated timeout value is too small, which can result in connections timing out or excessive retries. Signed-off-by: Sean Hefty --- If there are no issues, I will queue this up along with my other patches for 2.6.23. This patch applies on top of the fix for detecting stale connections, and the changes to the CM locking. The local CA ACK delay is moved internally to the CM, removing it from the external API. If someone could perform a sanity check on the ACK delay, I'd appreciate it. drivers/infiniband/core/cm.c | 71 +++++++++++++++++++++++++------ drivers/infiniband/core/cma.c | 1 drivers/infiniband/core/ucm.c | 1 drivers/infiniband/ulp/ipoib/ipoib_cm.c | 1 include/rdma/ib_cm.h | 1 5 files changed, 57 insertions(+), 18 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 16181d6..c7007c4 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -87,6 +87,7 @@ struct cm_port { struct cm_device { struct list_head list; struct ib_device *device; + u8 ack_delay; struct cm_port port[0]; }; @@ -95,7 +96,7 @@ struct cm_av { union ib_gid dgid; struct ib_ah_attr ah_attr; u16 pkey_index; - u8 packet_life_time; + u8 timeout; }; struct cm_work { @@ -154,6 +155,7 @@ struct cm_id_private { u8 retry_count; u8 rnr_retry_count; u8 service_timeout; + u8 target_ack_delay; struct list_head work_list; atomic_t work_count; @@ -293,7 +295,7 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) av->port = port; ib_init_ah_from_path(cm_dev->device, port->port_num, path, &av->ah_attr); - av->packet_life_time = path->packet_life_time; + av->timeout = path->packet_life_time + 1; return 0; } @@ -643,6 +645,25 @@ static inline int cm_convert_to_ms(int iba_time) return 1 << max(iba_time - 8, 0); } +/* + * calculate: 4.096x2^ack_timeout = 4.096x2^ack_delay + 2x4.096x2^life_time + * Because of how ack_timeout is stored, adding one doubles the timeout. + * To avoid large timeouts, select the max(ack_delay, life_time + 1), and + * increment it (round up) only if the other is within 50%. + */ +static u8 cm_ack_timeout(u8 ca_ack_delay, u8 packet_life_time) +{ + int ack_timeout = packet_life_time + 1; + + if (ack_timeout >= ca_ack_delay) + ack_timeout += (ca_ack_delay >= (ack_timeout - 1)); + else + ack_timeout = ca_ack_delay + + (ack_timeout >= (ca_ack_delay - 1)); + + return min(31, ack_timeout); +} + static void cm_cleanup_timewait(struct cm_timewait_info *timewait_info) { if (timewait_info->inserted_remote_id) { @@ -686,7 +707,7 @@ static void cm_enter_timewait(struct cm_id_private *cm_id_priv) * timewait before notifying the user that we've exited timewait. */ cm_id_priv->id.state = IB_CM_TIMEWAIT; - wait_time = cm_convert_to_ms(cm_id_priv->av.packet_life_time + 1); + wait_time = cm_convert_to_ms(cm_id_priv->av.timeout); queue_delayed_work(cm.wq, &cm_id_priv->timewait_info->work.work, msecs_to_jiffies(wait_time)); cm_id_priv->timewait_info = NULL; @@ -908,7 +929,8 @@ static void cm_format_req(struct cm_req_msg *req_msg, cm_req_set_primary_sl(req_msg, param->primary_path->sl); cm_req_set_primary_subnet_local(req_msg, 1); /* local only... */ cm_req_set_primary_local_ack_timeout(req_msg, - min(31, param->primary_path->packet_life_time + 1)); + cm_ack_timeout(cm_id_priv->av.port->cm_dev->ack_delay, + param->primary_path->packet_life_time)); if (param->alternate_path) { req_msg->alt_local_lid = param->alternate_path->slid; @@ -923,7 +945,8 @@ static void cm_format_req(struct cm_req_msg *req_msg, cm_req_set_alt_sl(req_msg, param->alternate_path->sl); cm_req_set_alt_subnet_local(req_msg, 1); /* local only... */ cm_req_set_alt_local_ack_timeout(req_msg, - min(31, param->alternate_path->packet_life_time + 1)); + cm_ack_timeout(cm_id_priv->av.port->cm_dev->ack_delay, + param->alternate_path->packet_life_time)); } if (param->private_data && param->private_data_len) @@ -1433,7 +1456,8 @@ static void cm_format_rep(struct cm_rep_msg *rep_msg, cm_rep_set_starting_psn(rep_msg, cpu_to_be32(param->starting_psn)); rep_msg->resp_resources = param->responder_resources; rep_msg->initiator_depth = param->initiator_depth; - cm_rep_set_target_ack_delay(rep_msg, param->target_ack_delay); + cm_rep_set_target_ack_delay(rep_msg, + cm_id_priv->av.port->cm_dev->ack_delay); cm_rep_set_failover(rep_msg, param->failover_accepted); cm_rep_set_flow_ctrl(rep_msg, param->flow_control); cm_rep_set_rnr_retry_count(rep_msg, param->rnr_retry_count); @@ -1680,6 +1704,13 @@ static int cm_rep_handler(struct cm_work *work) cm_id_priv->responder_resources = rep_msg->initiator_depth; cm_id_priv->sq_psn = cm_rep_get_starting_psn(rep_msg); cm_id_priv->rnr_retry_count = cm_rep_get_rnr_retry_count(rep_msg); + cm_id_priv->target_ack_delay = cm_rep_get_target_ack_delay(rep_msg); + cm_id_priv->av.timeout = + cm_ack_timeout(cm_id_priv->target_ack_delay, + cm_id_priv->av.timeout - 1); + cm_id_priv->alt_av.timeout = + cm_ack_timeout(cm_id_priv->target_ack_delay, + cm_id_priv->alt_av.timeout - 1); /* todo: handle peer_to_peer */ @@ -2291,7 +2322,7 @@ static int cm_mra_handler(struct cm_work *work) work->cm_event.param.mra_rcvd.service_timeout = cm_mra_get_service_timeout(mra_msg); timeout = cm_convert_to_ms(cm_mra_get_service_timeout(mra_msg)) + - cm_convert_to_ms(cm_id_priv->av.packet_life_time); + cm_convert_to_ms(cm_id_priv->av.timeout); spin_lock_irq(&cm_id_priv->lock); switch (cm_id_priv->id.state) { @@ -2363,7 +2394,8 @@ static void cm_format_lap(struct cm_lap_msg *lap_msg, cm_lap_set_sl(lap_msg, alternate_path->sl); cm_lap_set_subnet_local(lap_msg, 1); /* local only... */ cm_lap_set_local_ack_timeout(lap_msg, - min(31, alternate_path->packet_life_time + 1)); + cm_ack_timeout(cm_id_priv->av.port->cm_dev->ack_delay, + alternate_path->packet_life_time)); if (private_data && private_data_len) memcpy(lap_msg->private_data, private_data, private_data_len); @@ -2394,6 +2426,9 @@ int ib_send_cm_lap(struct ib_cm_id *cm_id, ret = cm_init_av_by_path(alternate_path, &cm_id_priv->alt_av); if (ret) goto out; + cm_id_priv->alt_av.timeout = + cm_ack_timeout(cm_id_priv->target_ack_delay, + cm_id_priv->alt_av.timeout - 1); ret = cm_alloc_msg(cm_id_priv, &msg); if (ret) @@ -3248,8 +3283,7 @@ static int cm_init_qp_rtr_attr(struct cm_id_private *cm_id_priv, *qp_attr_mask |= IB_QP_ALT_PATH; qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; qp_attr->alt_pkey_index = cm_id_priv->alt_av.pkey_index; - qp_attr->alt_timeout = - cm_id_priv->alt_av.packet_life_time + 1; + qp_attr->alt_timeout = cm_id_priv->alt_av.timeout; qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr; } ret = 0; @@ -3287,8 +3321,7 @@ static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv, *qp_attr_mask |= IB_QP_TIMEOUT | IB_QP_RETRY_CNT | IB_QP_RNR_RETRY | IB_QP_MAX_QP_RD_ATOMIC; - qp_attr->timeout = - cm_id_priv->av.packet_life_time + 1; + qp_attr->timeout = cm_id_priv->av.timeout; qp_attr->retry_cnt = cm_id_priv->retry_count; qp_attr->rnr_retry = cm_id_priv->rnr_retry_count; qp_attr->max_rd_atomic = @@ -3302,8 +3335,7 @@ static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv, *qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE; qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; qp_attr->alt_pkey_index = cm_id_priv->alt_av.pkey_index; - qp_attr->alt_timeout = - cm_id_priv->alt_av.packet_life_time + 1; + qp_attr->alt_timeout = cm_id_priv->alt_av.timeout; qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr; qp_attr->path_mig_state = IB_MIG_REARM; } @@ -3343,6 +3375,16 @@ int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, } EXPORT_SYMBOL(ib_cm_init_qp_attr); +void cm_get_ack_delay(struct cm_device *cm_dev) +{ + struct ib_device_attr attr; + + if (ib_query_device(cm_dev->device, &attr)) + cm_dev->ack_delay = 0; /* acks will rely on packet life time */ + else + cm_dev->ack_delay = attr.local_ca_ack_delay; +} + static void cm_add_one(struct ib_device *device) { struct cm_device *cm_dev; @@ -3367,6 +3409,7 @@ static void cm_add_one(struct ib_device *device) return; cm_dev->device = device; + cm_get_ack_delay(cm_dev); set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); for (i = 1; i <= device->phys_port_cnt; i++) { diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 2eb52b7..eb15119 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2326,7 +2326,6 @@ static int cma_accept_ib(struct rdma_id_private *id_priv, rep.private_data_len = conn_param->private_data_len; rep.responder_resources = conn_param->responder_resources; rep.initiator_depth = conn_param->initiator_depth; - rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT; rep.failover_accepted = 0; rep.flow_control = conn_param->flow_control; rep.rnr_retry_count = conn_param->rnr_retry_count; diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 2586a3e..424983f 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -823,7 +823,6 @@ static ssize_t ib_ucm_send_rep(struct ib_ucm_file *file, param.private_data_len = cmd.len; param.responder_resources = cmd.responder_resources; param.initiator_depth = cmd.initiator_depth; - param.target_ack_delay = cmd.target_ack_delay; param.failover_accepted = cmd.failover_accepted; param.flow_control = cmd.flow_control; param.rnr_retry_count = cmd.rnr_retry_count; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index ffec794..4a8117f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -260,7 +260,6 @@ static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id, rep.private_data_len = sizeof data; rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; - rep.target_ack_delay = 20; /* FIXME */ rep.srq = 1; rep.qp_num = qp->qp_num; rep.starting_psn = psn; diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 5c07017..12243e8 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -385,7 +385,6 @@ struct ib_cm_rep_param { u8 private_data_len; u8 responder_resources; u8 initiator_depth; - u8 target_ack_delay; u8 failover_accepted; u8 flow_control; u8 rnr_retry_count; -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 7654 bytes Desc: not available URL: From ralph.campbell at qlogic.com Fri May 25 17:33:43 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 25 May 2007 17:33:43 -0700 Subject: [ofa-general] [PATCH] ofed_1_2/sdp - SDP can lose receive data Message-ID: <1180139623.3407.373.camel@brick.pathscale.com> Can this fix be considered for OFED 1.2? Thanks. If a receive work completion is processed but there is no room in a previously queued skb, the data is dropped. This patch fixes the problem by queuing the skb. Signed-off-by: Ralph Campbell diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:04:51 2007 -0700 +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:07:02 2007 -0700 @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len); __kfree_skb(skb); skb = tail; - } + } else + skb_queue_tail(&sk->sk_receive_queue, skb); } else skb_queue_tail(&sk->sk_receive_queue, skb); From vlad at lists.openfabrics.org Sat May 26 02:41:02 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 26 May 2007 02:41:02 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070526-0200 daily build status Message-ID: <20070526094102.B0D85E60845@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From lewisettaimo at linyi.tv Sat May 26 03:59:16 2007 From: lewisettaimo at linyi.tv (Stacey Stafford) Date: Sat, 26 May 2007 18:59:16 +0800 Subject: [ofa-general] Of before badger Message-ID: <001001c79fc7$f50767c0$068379cc@computer> THIS ONE IS BEING PROMOTED, TAKE ADVANTAGE! S.ymboL: ADOVCurrent: $0.52 1 Day Target price: $2.50Action: Aggresive Buy/Hold!!! Bullish profit guaranted (500+%).. ADOV has a nice fresh news, openib-general, contact your broker!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Sat May 26 12:40:49 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 26 May 2007 22:40:49 +0300 Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git In-Reply-To: References: Message-ID: <20070526194049.GD15942@mellanox.co.il> > Michael S. Tsirkin (2): > IPoIB/cm: Fix timeout check in ipoib_cm_dev_stop() > IPoIB/cm: Drain cq in ipoib_cm_dev_stop() don't we want he patch that sets status to flushed with error? -- MST From dotanb at dev.mellanox.co.il Sat May 26 23:26:26 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 27 May 2007 09:26:26 +0300 Subject: [ofa-general] ibv_get_cq_event blocking forever after successful ibv_post_send... In-Reply-To: <20070525212214.20500.qmail@station183.com> References: <20070525212214.20500.qmail@station183.com> Message-ID: <46592492.6000404@dev.mellanox.co.il> Jimmy Hill wrote: > I have verbs code that is modeled after the first usage model described on the ibv_get_cq_event() man page. That is, I have created all the verbs resources (e.g., completion channel, QP, CQ, etc.) and then followed the sequence of: > > ibv_req_notify_cq(cq, 0); > > ibv_post_send(qp, &work_req, &bad_work_req); > > ibv_get_cq_event(channel, &ev_cq, &ev_ctx); > > ibv_ack_cq_events(ev_cq, 1); > > ibv_req_notify_cq(cq, 0); > > ibv_poll_cq(cq, 1, &wc); // loop to drain - but due to upper protocol, will only ever be 1 at a time > > > The QP is created with the following attributes: > qp_init_attr.qp_context = &this->conn_ref; > qp_init_attr.send_cq = this->send_cq; > qp_init_attr.recv_cq = this->recv_cq; > qp_init_attr.srq = NULL; > qp_init_attr.cap.max_send_wr = 128; > qp_init_attr.cap.max_recv_wr = 4; > qp_init_attr.cap.max_send_sge = 16; > qp_init_attr.cap.max_recv_sge = 4; > qp_init_attr.cap.max_inline_data = 0; > qp_init_attr.qp_type = IBV_QPT_RC; > qp_init_attr.sq_sig_all = 0; > // I have also used sq_sig_all set to 1 and then removed the SIGNALED flag in the send request > > The Send request (RDMA Write) is formatted as: > sge.lkey = response_mr->lkey; > sge.addr = response; > sge.length = 256; > > send_work_req.opcode = IBV_WR_RDMA_WRITE; > send_work_req.next = NULL; > send_work_req.sg_list = &sge; > send_work_req.num_sge = 1; > send_work_req.wr_id = 0; > send_work_req.imm_data = 0; > send_work_req.wr.rdma.remote_addr = client_rmr->addr; > send_work_req.wr.rdma.rkey = client_rmr->rkey; > send_work_req.send_flags = IBV_SEND_SIGNALED; > // I have used IBV_SEND_SIGNALED and IBV_SEND_SIGNALED | IBV_SEND_FENCE > > This QP will be used to RDMA Write a response back to a client. With the current setup, only one RDMA write will be outstanding per QP at any given time. That is, I issue the RDMA Write and wait for its completion prior to continuing processing. The eventual goal is to request and process a completion event every "n" RDMA Writes. > > The current problem is that everything runs along fine and then I end up in a situation where I block forever on the ibv_get_cq_event() call. The ibv_post_send() just prior to the ibv_get_cq_event() call returned "0" indicating that it successfully processed the command. However, the completion event for that operation never arrives. The data associated with that RDMA write does not appear on the client side, so it seems that even though the ibv_post_send() reported success, it really did not successfully process the request. > > In order to debug the problem, I changed the completion channel to non-blocking and put the ibv_get_cq_event() call in a loop and dumped out the number of passes through the loop (i.e., number of calls to ibv_get_cq_event()) prior to the arrival of an event (good status from the call). When all is working fine, it only takes one or two calls for the event to arrive. When I encounter the situation where it blocked forever, it loops forever calling ibv_get_cq_event(). I added a counter there and after a large (e.g., 500) number of retries, I looped back up and tried the ibv_post_send() again. For the most part, the request makes it out the second time. But, given enough time, the send queue work requests entries are consumed. That is, if I lower the max_send_wr attribute to 10, after 10 failed event collection attempts and ibv_post_send() retries, the 11th ibv_post_send() will fail with -1 status code. So, the work request entries are not leaving the send queue. > > Any ideas on why the ibv_get_cq_event() would never see an event after a "successful" send requesting a completion Try to do the following scenario: ibv_req_notify_cq(cq, 0); ibv_post_send(qp, &work_req, &bad_work_req); ibv_get_cq_event(channel, &ev_cq, &ev_ctx); ibv_ack_cq_events(ev_cq, 1); ibv_req_notify_cq(cq, 0); in a loop until the CQ is empty: ibv_poll_cq(cq, 1, &wc); // loop to drain - but due to upper protocol, will only ever be 1 at a time Dotan From tziporet at dev.mellanox.co.il Sat May 26 23:58:19 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 27 May 2007 09:58:19 +0300 Subject: [ofa-general] Re: [ewg] [PATCH] ofed_1_2/sdp - SDP can lose receive data In-Reply-To: <1180139623.3407.373.camel@brick.pathscale.com> References: <1180139623.3407.373.camel@brick.pathscale.com> Message-ID: <46592C0B.3070709@mellanox.co.il> Michael Please review Tziporet Ralph Campbell wrote: > Can this fix be considered for OFED 1.2? > Thanks. > > > If a receive work completion is processed but there is no room > in a previously queued skb, the data is dropped. > This patch fixes the problem by queuing the skb. > > Signed-off-by: Ralph Campbell > > diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c > --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:04:51 2007 -0700 > +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:07:02 2007 -0700 > @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q > skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len); > __kfree_skb(skb); > skb = tail; > - } > + } else > + skb_queue_tail(&sk->sk_receive_queue, skb); > } else > skb_queue_tail(&sk->sk_receive_queue, skb); > > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > From tziporet at dev.mellanox.co.il Sun May 27 00:08:56 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 27 May 2007 10:08:56 +0300 Subject: [ofa-general] Re: [ewg] Something is wrong in gitweb In-Reply-To: <795c49870705240933w276e41fbu941e822047ab5e25@mail.gmail.com> References: <46558296.2090308@voltaire.com> <4655B0EB.5030407@mellanox.co.il> <795c49870705240933w276e41fbu941e822047ab5e25@mail.gmail.com> Message-ID: <46592E88.9010709@mellanox.co.il> Jeff Becker wrote: > Hi Tziporet. I just tried getting to the git tree from my web browser > and this seems to work, including the link you tried below. Does it > work for you now? Thanks. > Working for me now. Thanks, Tziporet From mst at dev.mellanox.co.il Sun May 27 01:39:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 May 2007 11:39:32 +0300 Subject: [ofa-general] Re: IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <46574099.3090601@linux.vnet.ibm.com> References: <46537081.30906@linux.vnet.ibm.com> <20070524053819.GF6019@mellanox.co.il> <46574099.3090601@linux.vnet.ibm.com> Message-ID: <20070527083932.GC8342@mellanox.co.il> > >>-The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that > >>support SRQ like the Topspin HCA and, such HCAs should not be > >>impacted at all. > > > >I don't think it's that clean yet. > > > >Here's an idea: implement "fake SRQ" for ehca in software: make post recv on > >srq queue the WR, spread them evenly between QPs as they connect. Once # of > >QPs goes above some limit, create QP command will fail. This would contain > >the mess nicely inside ehca (I think you'll want to add a flag that lets > >software figure out that SRQ is fake). > > > >We will still be left with the basic problem of what to do at the active side > >upon the reject, though. > > As you indicate this will not solve the problem, so it is not an option. Above, I have outlined how it can be done, so it certainly *is* an option. In this thread, you basically keep saying that ehca will ever be the only HCA without SRQ support, so you can make a lot of assumptions about how IPoIB is used. Fine, but if you follow this logic, it makes sense to hide the mess under the ehca provider interface. -- MST From amip at dev.mellanox.co.il Sun May 27 02:07:00 2007 From: amip at dev.mellanox.co.il (Ami Perlmutter) Date: Sun, 27 May 2007 12:07:00 +0300 Subject: [ofa-general] Re: [ewg] [PATCH] ofed_1_2/sdp - SDP can lose receive data In-Reply-To: <1180139623.3407.373.camel@brick.pathscale.com> References: <1180139623.3407.373.camel@brick.pathscale.com> Message-ID: <1180256850.15464.1.camel@localhost> Ralph, this is how the code is now. Were are you getting this code from? On Fri, 2007-05-25 at 17:33 -0700, Ralph Campbell wrote: > Can this fix be considered for OFED 1.2? > Thanks. > > > If a receive work completion is processed but there is no room > in a previously queued skb, the data is dropped. > This patch fixes the problem by queuing the skb. > > Signed-off-by: Ralph Campbell > > diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c > --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:04:51 2007 -0700 > +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:07:02 2007 -0700 > @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q > skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len); > __kfree_skb(skb); > skb = tail; > - } > + } else > + skb_queue_tail(&sk->sk_receive_queue, skb); > } else > skb_queue_tail(&sk->sk_receive_queue, skb); > > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From vlad at lists.openfabrics.org Sun May 27 02:41:04 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 27 May 2007 02:41:04 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070527-0200 daily build status Message-ID: <20070527094104.BC09CE60852@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From ogerlitz at voltaire.com Sun May 27 03:11:28 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 27 May 2007 13:11:28 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak In-Reply-To: References: <20070521120459.GI20400@mellanox.co.il> Message-ID: <46595950.6080106@voltaire.com> Roland Dreier wrote: > OK, I crossed my fingers and merged this for 2.6.22 Somehow it seems when applying this patch to OFED something goes wrong, please see https://bugs.openfabrics.org/show_bug.cgi?id=636 Or. From flynnrprtema at moio.net Sun May 27 05:15:45 2007 From: flynnrprtema at moio.net (Antonio Simmons) Date: Sun, 27 May 2007 23:15:45 +1100 Subject: [ofa-general] As a matter of fact, they do Message-ID: birth mirror purpose The two voiceless of shot stamp them embraced and kissed. I tries to s had to my Before camp selfishly burn they knew it, it rode was getting dark outsid emotional reached convert father silence the being him this injured is mint is truthfully Jeff, she really curl was dry at around Gavin's this evening.. at false farming shown work as area it clearly and All by finally baby Yeah, Dana concurred. religion I wouldn't be applaud ate at all s expresses the the my its atmosphere s cousin importance was absolutely filled and offering 1:00 PM dying from As danger soon as rapid she stocking left the stupid room, Stacy motioned Je power with of If bore you're really worried, I can bulb led heard still phone th It was in between 4th and receive steady narrow 5th forgotten period, that Cind cancer over mint Well, you successful had about a fifty bomb distribution card percent success ra the the woman So tomorrow are groan you muscle tin gonna teach behave me to catch? aroma What staff to all who to Brown have uses say made No. repulsive fed nut Tomorrow we're taking Carl faithfully and Linda to th it each morning after me The that The realize seriously Jeff finally asked silly roughly her, rightfully So whaddya think? When extensive I goes risk I think complain this is going to cycle be trodden the ultimate litmu truly watched it suspect shrank Charming. I daily cannot think of two unit people on the You've got to town horse sprang listen to me. funny This is not a prac apple Dana, what's rate wrong with your bleach mother flag is her ove on to for say use that the of some Browns strength poetic while imagery and that This family I low Dana lavatorial thought agreement about it for a peck moment. I guess si persona was has possess enables a had to Plath pleasant feeling Then to dealings sped Stace, we've gotta impulse sleepy tire talk to you, said Rhonda. carry on The story school wash day raise began as every other, broadcast with kids arr make I with ride square Perhaps if we lock choose up so they can't settle get in wit What's up? I objective moved brave smoothly guard Gretchen was spade next. She basically just repeated evil forandd dug Yeah, veracious that tour enjoy would make sense. in the While past observations examples have of crossing used overcome an through After Microsoft made their name with MS-DOS, they started work on a graphical based operating system much -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: luqka.gif Type: image/gif Size: 6562 bytes Desc: not available URL: From sobebike at gmail.com Sun May 27 05:45:43 2007 From: sobebike at gmail.com (SoBeBike) Date: Sun, 27 May 2007 07:45:43 -0500 Subject: [ofa-general] ibv_get_cq_event blocking forever after successful ibv_post_send... In-Reply-To: <46592492.6000404@dev.mellanox.co.il> References: <20070525212214.20500.qmail@station183.com> <46592492.6000404@dev.mellanox.co.il> Message-ID: That is what I am already doing (note the comment, "// loop to drain..."). I loop calling ibv_poll_cq until it is empty. I just noted that due to the usage model, I only see it pull one CQE and then on the 2nd pass through the loop the CQ is empty. On 5/27/07, Dotan Barak wrote: > Try to do the following scenario: > > > ibv_req_notify_cq(cq, 0); > > ibv_post_send(qp, &work_req, &bad_work_req); > > ibv_get_cq_event(channel, &ev_cq, &ev_ctx); > > ibv_ack_cq_events(ev_cq, 1); > > ibv_req_notify_cq(cq, 0); > > in a loop until the CQ is empty: > ibv_poll_cq(cq, 1, &wc); // loop to drain - but due to upper protocol, will only ever be 1 at a time > From mst at dev.mellanox.co.il Sun May 27 05:53:37 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 May 2007 15:53:37 +0300 Subject: [ofa-general] Re: Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak In-Reply-To: <46595950.6080106@voltaire.com> References: <20070521120459.GI20400@mellanox.co.il> <46595950.6080106@voltaire.com> Message-ID: <20070527125337.GF8342@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak > > Roland Dreier wrote: > >OK, I crossed my fingers and merged this for 2.6.22 > > Somehow it seems when applying this patch to OFED something goes wrong, > please see https://bugs.openfabrics.org/show_bug.cgi?id=636 Yes, it seems that we shouldn't keep a QP in error state for extended periods of time, since that moves hardware to the slow path. It seems that the right approach might be to create a loopback QP in RTS, and perform post sends there. -- MST From dotanb at dev.mellanox.co.il Sun May 27 06:06:41 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 27 May 2007 16:06:41 +0300 Subject: [ofa-general] how to write a IB user level multicast application In-Reply-To: <4655ACC2.9030401@open-mpi.org> References: <13432ab00705232201s5f7d5a5h5ecaaddf57ead11b@mail.gmail.com> <46553448.6020508@dev.mellanox.co.il> <4655ACC2.9030401@open-mpi.org> Message-ID: <46598261.4030005@dev.mellanox.co.il> Andrew Friedley wrote: > Dotan Barak wrote: >> In the following URL you can find a very simple example on how to use >> multicast: >> https://svn.openfabrics.org/svn/openib/trunk/contrib/mellanox/ibtp/gen2/userspace/useraccess/multicast_test/multicast_test.c > > > I seem to be missing v1.h on my OFED v1.2 nightly install, where can I > find it? VL can be found in: https://svn.openfabrics.org/svn/openib/trunk/contrib/mellanox/ibtp/common/tools/vl/ > >> this test doesn't send an SA query (to get the multicast props) or an >> SA multicast join (to make the SM configure the subnet to make the >> port that this QP is attached to) to get the multicast messages. >> >> This example will work on a back-to-back topology. > > An alternative that I've had pretty good success with is to use the > RDMA CM. It uses an IP(v6) abstraction, does the SA queries/joins for > you, and also supports selection of an unused multicast address if you > join the '0' address (and port? not sure if its required, I always > zero it). Check out the 'mckey.c' example included with the RDMA CM > source, it will likely answer your questions. This test was written in order to check the verbs layer without any dependency on any ULP. thanks Dotan From dotanb at dev.mellanox.co.il Sun May 27 06:34:43 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 27 May 2007 16:34:43 +0300 Subject: [ofa-general] ibv_get_cq_event blocking forever after successful ibv_post_send... In-Reply-To: References: <20070525212214.20500.qmail@station183.com> <46592492.6000404@dev.mellanox.co.il> Message-ID: <465988F3.4000100@dev.mellanox.co.il> SoBeBike wrote: > That is what I am already doing (note the comment, "// loop to > drain..."). I loop calling ibv_poll_cq until it is empty. I just noted > that due to the usage model, I only see it pull one CQE and then on > the 2nd pass through the loop the CQ is empty. > When you get to this scenario (for many times you don't get the CQ event) did you try to poll the CQ and check if there is any completion in it? (maybe the problem is that this WR just didn't create any completion when it ended). thanks Dotan From or.gerlitz at gmail.com Sun May 27 07:13:07 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Sun, 27 May 2007 17:13:07 +0300 Subject: [ofa-general] Re: Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak In-Reply-To: <20070527125337.GF8342@mellanox.co.il> References: <20070521120459.GI20400@mellanox.co.il> <46595950.6080106@voltaire.com> <20070527125337.GF8342@mellanox.co.il> Message-ID: <15ddcffd0705270713h52449106x7b5654d558cbbda2@mail.gmail.com> On 5/27/07, Michael S. Tsirkin wrote: > > > Quoting Or Gerlitz : > > Subject: Re: Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak > > > > Roland Dreier wrote: > > >OK, I crossed my fingers and merged this for 2.6.22 > > > > Somehow it seems when applying this patch to OFED something goes wrong, > > please see https://bugs.openfabrics.org/show_bug.cgi?id=636 > > Yes, it seems that we shouldn't keep a QP in error state > for extended periods of time, since that moves hardware > to the slow path. > what actually is the "hardware slow path" ? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Sun May 27 07:57:04 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 May 2007 17:57:04 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race In-Reply-To: References: <20070522005918.GB13311@mellanox.co.il> <20070524131154.GA7940@mellanox.co.il> Message-ID: <20070527145704.GB26933@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race > > > The following works fine for me here. Pls consider for 2.6.22. > > Does it help with https://bugs.openfabrics.org//show_bug.cgi?id=604 ? > Or are we still looking? 604 turns out to be a bug in mthca. I'll post a patch RSN. Still, I think it's a good idea to apply this. Do you agree? I also have put this patch in OFED. -- MST From jwong at datallegro.com Sun May 27 07:59:06 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Sun, 27 May 2007 10:59:06 -0400 Subject: [ofa-general] Trouble install OFED 1.2-rc3 - ib-bonding Message-ID: Hello, I am installing the OFED 1.2-rc3. Everything else builds except for ib-bonding. Thanks in advance. I am getting the following error messages: + make -C /lib/modules/2.6.18-8.1.4.el5/build modules M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding make: Entering directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64' CC [M] /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.o In file included from /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:78: /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_inactive_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: (Each undeclared identifier is reported only once /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: for each function it appears in.) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_active_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_compute_features': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1233: warning: comparison of distinct pointer types lacks a cast /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_enslave': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_release': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in t his function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_arp_rcv': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_netdev_event': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_init': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4374: warning: assignment discards qualifiers from pointer target type /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this fu nction) make[1]: *** [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_ main.o] Error 1 make: *** [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi ng] Error 2 make: Leaving directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64' + echo ' Building IB bonding driver failed' Building IB bonding driver failed + exit 1 error: Bad exit status from /var/tmp/rpm-tmp.23876 (%build) Jeff Wong -------------- next part -------------- An HTML attachment was scrubbed... URL: From jwong at datallegro.com Sun May 27 08:03:01 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Sun, 27 May 2007 11:03:01 -0400 Subject: [ofa-general] Trouble install OFED 1.2-rc3 - ib-bonding Message-ID: Hello, I am installing the OFED 1.2-rc3. Everything else builds except for ib-bonding. = = Thanks in advance. = = I am getting the following error messages: + make -C /lib/modules/2.6.18-8.1.4.el5/build modules M=3D/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding make: Entering directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64' CC [M] /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.o In file included from /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:78: /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_inactive_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: (Each undeclared identifier is reported only once /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: for each function it appears in.) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_active_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_compute_features': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1233: warning: comparison of distinct pointer types lacks a cast /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_enslave': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_release': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in t his function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_arp_rcv': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_netdev_event': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_init': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4374: warning: assignment discards qualifiers from pointer target type /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this fu nction) make[1]: *** [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_ main.o] Error 1 make: *** [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi ng] Error 2 make: Leaving directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64' + echo ' Building IB bonding driver failed' Building IB bonding driver failed + exit 1 error: Bad exit status from /var/tmp/rpm-tmp.23876 (%build) = = = Jeff Wong -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Sun May 27 08:06:42 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 May 2007 18:06:42 +0300 Subject: [ofa-general] [PATCH for-2.6.22] IB/mthca: fix send CQE with error for QP connected to SRQ Message-ID: <20070527150642.GC26933@mellanox.co.il> mthca_free_err_wqe currently treats both send and receive CQEs identically in case of SRQ. But for tavor mode hardware, send CQEs with error can be chained together even if the RQ is part of SRQ, so we miss some CQEs. This, in turn, triggers crashes in IPoIB CM: https://bugs.openfabrics.org//show_bug.cgi?id=604. Fix by following the WQE chain for all send CQEs, same as non-SRQ QP. Signed-off-by: Michael S. Tsirkin --- This is a fix for bug 604 in ofa bugzilla. Pls consider for 2.6.22. diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 0276649..7474646 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -2287,7 +2287,7 @@ void mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send, * For SRQs, all WQEs generate a CQE, so we're always at the * end of the doorbell chain. */ - if (qp->ibqp.srq) { + if (qp->ibqp.srq && !is_send) { *new_wqe = 0; return; } -- MST From jimmy at hillraiser.com Sun May 27 15:45:08 2007 From: jimmy at hillraiser.com (Jimmy Hill) Date: Sun, 27 May 2007 17:45:08 -0500 Subject: [ofa-general] ibv_get_cq_event blocking forever after successful ibv_post_send... In-Reply-To: <465988F3.4000100@dev.mellanox.co.il> Message-ID: > -----Original Message----- > > SoBeBike wrote: > > That is what I am already doing (note the comment, "// loop to > > drain..."). I loop calling ibv_poll_cq until it is empty. I just noted > > that due to the usage model, I only see it pull one CQE and then on > > the 2nd pass through the loop the CQ is empty. > > > > When you get to this scenario (for many times you don't get the CQ > event) did you try to poll the CQ and check if there is any completion > in it? > (maybe the problem is that this WR just didn't create any completion > when it ended). > My code currently blocks (either a blocking call, or looping with a non-blocking FD) waiting for an event (ibv_get_cq_event) before attempting to dequeue (ibv_poll_cq) any entries from the CQ. I assumed that if I did not get a completion event, there would not be a CQ entry waiting. So, no, I have not tried that. I can try that, but it may be a week or more before I am able to get back to my machines. Can I not rely on getting an event when I request signalled completions? Thanks. From rdreier at cisco.com Sun May 27 18:18:34 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 27 May 2007 18:18:34 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak In-Reply-To: <20070527125337.GF8342@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 27 May 2007 15:53:37 +0300") References: <20070521120459.GI20400@mellanox.co.il> <46595950.6080106@voltaire.com> <20070527125337.GF8342@mellanox.co.il> Message-ID: > Yes, it seems that we shouldn't keep a QP in error state > for extended periods of time, since that moves hardware > to the slow path. Ugh, so I can do a local DoS by just creating a QP and moving it to error and then going to sleep for a long time? What hardware is susceptible to this? > It seems that the right approach might be to create > a loopback QP in RTS, and perform post sends there. How about using the send queue of the QP we're trying to flush? I'll try to code this up tomorrow if no one beats me to the fix. - R. From vacchianow7037 at plaza101.com Mon May 28 15:50:54 2007 From: vacchianow7037 at plaza101.com (Rafaela Cruz) Date: Mon, 28 May 2007 23:50:54 +0100 Subject: [ofa-general] Think its' time to start Message-ID: <000801c7a0cd$c205fa60$1e00a8c0@vacchianow7037> Take delivery of a sizeable modify on your Meds dependable classes, paramount quality. Massive array, including intricate to find drugs 0 RX indispensable. Secret with No waiting quarters or arrangmenet requisite take in massiveness and Save! even supposing supplemental Just type www [.] Topbuyrx . org in Your Internet Explore - Go here now ----- They panicky forsook suit each other cautiously remarkably bred well, said Dangl 'It is well,' said he, kissing solid new sane it; defiant 'it is my mast fast And shrilly tip pipe what? demanded Morrel. revolting nerve feeling So done are all Italians. poke interest I think I may venture foolish to ask strip you this favor. From mst at dev.mellanox.co.il Sun May 27 20:41:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 May 2007 06:41:03 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak In-Reply-To: References: <20070521120459.GI20400@mellanox.co.il> <46595950.6080106@voltaire.com> <20070527125337.GF8342@mellanox.co.il> Message-ID: <20070528034103.GB2945@mellanox.co.il> > > It seems that the right approach might be to create > > a loopback QP in RTS, and perform post sends there. > > How about using the send queue of the QP we're trying to flush? > I'll try to code this up tomorrow if no one beats me to the fix. Great idea - since we got last WQE reached, that QP will already be in error. Another alternative I thought about is to keep the drain QP in reset state, and move it to error only when we have some work to do. But this looks like a better way to do it. -- MST From dotanb at dev.mellanox.co.il Sun May 27 22:25:25 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Mon, 28 May 2007 08:25:25 +0300 Subject: [ofa-general] ibv_get_cq_event blocking forever after successful ibv_post_send... In-Reply-To: References: Message-ID: <465A67C5.2060101@dev.mellanox.co.il> Jimmy Hill wrote: > My code currently blocks (either a blocking call, or looping with a > non-blocking FD) waiting for an event (ibv_get_cq_event) before attempting > to dequeue (ibv_poll_cq) any entries from the CQ. I assumed that if I did > not get a completion event, there would not be a CQ entry waiting. So, no, I > have not tried that. I can try that, but it may be a week or more before I > am able to get back to my machines. > > Can I not rely on getting an event when I request signalled completions? > This is the tricky part in IB: when you ask for a completion event, this event will be produced for the NEXT completion that will be produced after you asked for the event. But the answer is yes: if you asked for a completion notification and a completion is being produced you will get an event. I have a question: if you enable in all of the SRs (Send Requests) that you are posting the SIGNAL bit, why don't you just enable the sq_sig_all when creating the QP? Dotan From devesh28 at gmail.com Sun May 27 22:50:24 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Mon, 28 May 2007 11:20:24 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <4655BE8F.2080102@ichips.intel.com> References: <309a667c0705112339k56adbfb5l18beb65412ea5dbb@mail.gmail.com> <1179398534.23882.67542.camel@hal.voltaire.com> <309a667c0705170528s1d7a7e0as19ee1ecc68b40f61@mail.gmail.com> <1179483657.23882.158398.camel@hal.voltaire.com> <309a667c0705202258k55d2077ch5b1a182031156d18@mail.gmail.com> <1179769930.15940.9823.camel@hal.voltaire.com> <309a667c0705230727n26f2fafetb6986ea60777e073@mail.gmail.com> <1179930909.16831.100286.camel@hal.voltaire.com> <309a667c0705240522p5f5ab88aq601d2d3737f0a7dd@mail.gmail.com> <4655BE8F.2080102@ichips.intel.com> Message-ID: <309a667c0705272250q68aa4064l40454db5b266a967@mail.gmail.com> On 5/24/07, Sean Hefty wrote: > > Yes It will, and hence reduce the initial SA traffic generated on a > > big cluster...just imagin, the cluster is quite big and every node is > > trying to build its cache initially. It will create large burst of SA > > packets. > > In general I agree with the notion of enhancing the cache to allow it to > load locally from a file. But I'd really like to get a framework merged > upstream first before trying to add in these sort of enhancements. Ok, but, by that time we can keep the framework ready? > > Initially loading of caches on a large fabric can be limited to a single > GetTable PR query per node, and by enabling the caches across the fabric > at different times, the single burst to the SA can be avoided. How this will be managed? This will add extra startup time in the cluster, because cluster will be usable only after last cache has been enabled. Am I right? > > > Incomplete connectivity will be till first PR is requested for that > > destination, Because if its a cache miss, any how application is going > > to initiate a ib_sa_get_path_rec() and resolved PR will be added in > > cache for future reference. > > ib_sa_get_path_rec() only returns a single path. If multiple paths > exist, adding a single path to the cache may cause all applications to How multi-pathing is handled in current cache_module? > make use of that single path. Updating the cache on demand isn't as > simple as it seems. > > - Sean > From mst at dev.mellanox.co.il Sun May 27 23:45:18 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 May 2007 09:45:18 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix SRQ WR leak In-Reply-To: References: <20070521120459.GI20400@mellanox.co.il> <46595950.6080106@voltaire.com> <20070527125337.GF8342@mellanox.co.il> Message-ID: <20070528064518.GF2945@mellanox.co.il> > > It seems that the right approach might be to create > > a loopback QP in RTS, and perform post sends there. > > How about using the send queue of the QP we're trying to flush? > I'll try to code this up tomorrow if no one beats me to the fix. Unfortunately, this won't work, as it hits another firmware problem: it won't generate CQE with error for send WR unless the QP was in RTS at some point. -- MST From vlad at lists.openfabrics.org Mon May 28 02:49:22 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 28 May 2007 02:49:22 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070528-0200 daily build status Message-ID: <20070528094922.99289E60856@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From jimmy at hillraiser.com Mon May 28 04:32:25 2007 From: jimmy at hillraiser.com (Jimmy Hill) Date: Mon, 28 May 2007 06:32:25 -0500 Subject: [ofa-general] ibv_get_cq_event blocking forever after successful ibv_post_send... In-Reply-To: <465A67C5.2060101@dev.mellanox.co.il> Message-ID: > > I have a question: if you enable in all of the SRs (Send Requests) that > you are posting the SIGNAL bit, why don't you just > enable the sq_sig_all when creating the QP? > I have. That was one of the things I tried. From mst at dev.mellanox.co.il Mon May 28 04:37:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 May 2007 14:37:27 +0300 Subject: [ofa-general] [PATCH for-2.6.22] IB/ipoib: fix performance regression on Mellanox In-Reply-To: References: <20070521120459.GI20400@mellanox.co.il> <46595950.6080106@voltaire.com> <20070527125337.GF8342@mellanox.co.il> Message-ID: <20070528113727.GP2945@mellanox.co.il> commit 518b1646f8a31904ca637b8df0c1e31c34a7a3c2: IPoIB/cm: Fix SRQ WR leak introduced performance regression on Mellanox cards: keeping a QP in error state for extended periods of time, moves hardware to the slow path (until QP is destroyed). Fix this by posting a send WR on one of the QPs that are being flushed, instead of using a separate drain QP. This fixes bug Reported and bisected by Scott Weitzenkamp at Cisco. Debugged by Sasha Mikheev at Voltaire. Signed-off-by: Michael S. Tsirkin --- > How about using the send queue of the QP we're trying to flush? > I'll try to code this up tomorrow if no one beats me to the fix. diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index f133b56..253ece1 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -69,8 +69,9 @@ static struct ib_qp_attr ipoib_cm_err_attr = { #define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff -static struct ib_recv_wr ipoib_cm_rx_drain_wr = { - .wr_id = IPOIB_CM_RX_DRAIN_WRID +static struct ib_send_wr ipoib_cm_rx_drain_wr = { + .wr_id = IPOIB_CM_RX_DRAIN_WRID, + .opcode = IB_WR_SEND, }; static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, @@ -163,16 +164,22 @@ partial_error: static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv) { - struct ib_recv_wr *bad_wr; + struct ib_send_wr *bad_wr; + struct ipoib_cm_rx *p; - /* rx_drain_qp send queue depth is 1, so + /* We only reserved 1 extra slot in CQ for drain WRs, so * make sure we have at most 1 outstanding WR. */ if (list_empty(&priv->cm.rx_flush_list) || !list_empty(&priv->cm.rx_drain_list)) return; - if (ib_post_recv(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_wr)) - ipoib_warn(priv, "failed to post rx_drain wr\n"); + /* + * QPs on flush list are error state. This way, a "flush + * error" WC will be immediately generated for each WR we post. + */ + p = list_entry(priv->cm.rx_flush_list.next, typeof(*p), list); + if (ib_post_send(p->qp, &ipoib_cm_rx_drain_wr, &bad_wr)) + ipoib_warn(priv, "failed to post drain wr\n"); list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list); } @@ -199,10 +206,10 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev, struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = { .event_handler = ipoib_cm_rx_event_handler, - .send_cq = priv->cq, /* does not matter, we never send anything */ + .send_cq = priv->cq, /* For drain WR */ .recv_cq = priv->cq, .srq = priv->cm.srq, - .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ + .cap.max_send_wr = 1, /* For drain WR */ .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, @@ -242,6 +249,24 @@ static int ipoib_cm_modify_rx_qp(struct net_device *dev, ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret); return ret; } + + /* Mellanox firmware won't generate completions with error for drain WRs + * unless the QP has been moved to RTS first. This work-around leaves a + * window where a QP has moved to error asynchronously, but this will + * eventually get fixed in firmware, so let's not error out if modify QP + * fails. */ + qp_attr.qp_state = IB_QPS_RTS; + ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to init QP attr for RTS: %d\n", ret); + return 0; + } + ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTS: %d\n", ret); + return 0; + } + return 0; } @@ -623,38 +648,11 @@ static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) int ipoib_cm_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_qp_init_attr qp_init_attr = { - .send_cq = priv->cq, /* does not matter, we never send anything */ - .recv_cq = priv->cq, - .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ - .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ - .cap.max_recv_wr = 1, - .cap.max_recv_sge = 1, /* FIXME: 0 Seems not to work */ - .sq_sig_type = IB_SIGNAL_ALL_WR, - .qp_type = IB_QPT_UC, - }; int ret; if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) return 0; - priv->cm.rx_drain_qp = ib_create_qp(priv->pd, &qp_init_attr); - if (IS_ERR(priv->cm.rx_drain_qp)) { - printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); - ret = PTR_ERR(priv->cm.rx_drain_qp); - return ret; - } - - /* - * We put the QP in error state directly. This way, a "flush - * error" WC will be immediately generated for each WR we post. - */ - ret = ib_modify_qp(priv->cm.rx_drain_qp, &ipoib_cm_err_attr, IB_QP_STATE); - if (ret) { - ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret); - goto err_qp; - } - priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev); if (IS_ERR(priv->cm.id)) { printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); @@ -676,8 +674,6 @@ err_listen: ib_destroy_cm_id(priv->cm.id); err_cm: priv->cm.id = NULL; -err_qp: - ib_destroy_qp(priv->cm.rx_drain_qp); return ret; } @@ -740,7 +736,6 @@ void ipoib_cm_dev_stop(struct net_device *dev) kfree(p); } - ib_destroy_qp(priv->cm.rx_drain_qp); cancel_delayed_work(&priv->cm.stale_task); } -- MST From edwinmorry at hotmail.com Mon May 28 03:11:47 2007 From: edwinmorry at hotmail.com (EDWIN MORRY) Date: Mon, 28 May 2007 03:11:47 -0700 Subject: [ofa-general] From UNITED NATION OFFICE.....LONDON Message-ID: <20070528031147.wjoyqwlj40s4800g@66.160.178.240> UNITED NATION INTERNATIONAL FUNDS TRANSFER/AUDIT UNIT UNITED NATIONS(WORLD BANK ASSISTED PROGRAMME) DIRECTORATE OF INTERNATIONALPAYMENT AND TRANSFERS. LONDON REGIONAL OFFICE,KILBURN LANE LONDON - ENGLAND. WIRE TRANSFER/AUDIT UNIT. ATTN: BENEFICIARY, CONTRACT/INHERITANCE FUNDS PAYMENT - REF:WB/NF/UN/XX027. Hello, This is to urgently inform you that your contract/inheritance entitlement has finally been approved. Following the resolution and settlement of the payment by the United Nations in conjunction with the World Bank,an irrevocable instruction and authorization has been given to us today to process and effect your payment valued at US$6.5M (Six Million Five Hundred Thousand United States Dollars) only due to you, without delay. However, please be officially informed that your fund valued at US$6.5 M(Six Million five Hundred Thousand United States Dollars only) is under due processing for immediate release to you and upon the conclusion of the processing of the payment we shall immediately transfer the total funds into your designated account and it will arrive in your account without any delay. Therefore, be rest assured that we shall ensure that your fund is released to you as soon as the processing and approval of the funds have been completed by us, then we shall effect the payment immediately without any delay. To this end, We are pleased to inform you that we are going to effect your total payment through any of the following mode of payment below: (1) SPECIAL CASH PAYMENT (2) SWIFT/TELEGRAPHIC WIRE TRANSFER 5) ATM. You are hereby advised to choose any of the above option that suits you toenable this reputable office finalize and effect your payment without any delay.Note that the funds will be paid to you in US Dollars. To facilitate the finalization of the process you must re-confirm the following information to me of Probate for final authentication and approval of your fund. (1) Your full name. (2) Phone, fax and mobile #. (3) Company name, position and address. (4) Profession, age and marital status. (5) Copy of int'l passport or any scanned identity of proof of yourself. Finally, as a matter of urgency you are advised to contact me as soon as you receive this mail to enable us release your fund. Act accordingly.I will be waiting for your urgent and prompt response. Please get back to me through this email: edwinmorry at hotmail.com CONGRATULATIONS. Dr. Edwin Morry Tel: +44-7024080054 From mst at dev.mellanox.co.il Mon May 28 05:12:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 May 2007 15:12:06 +0300 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git In-Reply-To: <20070514045832.GA18615@mellanox.co.il> References: <20070514045832.GA18615@mellanox.co.il> Message-ID: <20070528121206.GA1847@mellanox.co.il> Roland, please pick up the patches from: git://git.openfabrics.org/~mst/linux-2.6/.git master This will pull in the following outstanding patches intended for 2.6.22: all of them have been posted previously (let me know if you like me to re-post the patches): Michael S. Tsirkin (3): IB/ipoib: fix to_ipoib_neigh access race IB/mthca: fix send CQE with error for QP connected to SRQ IB/ipoib: fix performance regression on Mellanox Sean Hefty (1): ib/cm: fix stale connection detection -- MST From erezz at voltaire.com Mon May 28 06:02:09 2007 From: erezz at voltaire.com (Erez Zilber) Date: Mon, 28 May 2007 16:02:09 +0300 Subject: [ofa-general] OFED 1.x (Gen 2) based SRP target code released! In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F6F91AB@mtiexch01.mti.com> References: <9FA59C95FFCBB34EA5E42C1A8573784F6F91AB@mtiexch01.mti.com> Message-ID: <465AD2D1.2070100@voltaire.com> Sujal Das wrote: > Hello all, > > > > Mellanox is pleased to release the OFED 1.x (Gen 2) - based SRP Target > source code to the OpenFabrics community, OEMs and end users. > > > > This release is an upgrade to the previously released SRP Target source > code that was based on the Mellanox IBGold driver and Gen 1 software > interface. The code has been tested to work with Mellanox InfiniBand > adapters and is available under Open Fabrics open source license terms. > > I'm trying to build srpt according to the instructions, but it does not get built at all. Here's what I did: tar xzf OFED-1.2-rc3.tgz cd OFED-1.2-rc3/SRPMS rpm2cpio ofa_kernel-1.2-rc3.src.rpm |cpio -i tar xzf ofa_kernel-1.2.tgz cd ofa_kernel-1.2 patch -p1 < ~/srpt_inc/add_srpt_01.patch patch -p1 < ~/srpt_inc/add_srpt_03.patch cp -r ~/srpt drivers/infiniband/ulp/srpt ./configure --with-core-mod --with-ipoib-mod --with-srp-target-mod --with-mthca-mod Here's the autoconf.h file that was generated: #undef CONFIG_INFINIBAND #undef CONFIG_INFINIBAND_IPOIB #undef CONFIG_INFINIBAND_IPOIB_CM #undef CONFIG_INFINIBAND_SDP #undef CONFIG_INFINIBAND_SRP #undef CONFIG_INFINIBAND_SRPT #undef CONFIG_INFINIBAND_USER_MAD #undef CONFIG_INFINIBAND_USER_ACCESS #undef CONFIG_INFINIBAND_ADDR_TRANS #undef CONFIG_INFINIBAND_MTHCA #undef CONFIG_INFINIBAND_IPOIB_DEBUG #undef CONFIG_INFINIBAND_ISER #undef CONFIG_INFINIBAND_EHCA #undef CONFIG_INFINIBAND_EHCA_SCALING #undef CONFIG_RDS #undef CONFIG_RDS_IB #undef CONFIG_RDS_TCP #undef CONFIG_RDS_DEBUG #undef CONFIG_INFINIBAND_MADEYE #undef CONFIG_INFINIBAND_VNIC #undef CONFIG_INFINIBAND_VNIC_DEBUG #undef CONFIG_INFINIBAND_VNIC_STATS #undef CONFIG_INFINIBAND_CXGB3 #undef CONFIG_INFINIBAND_CXGB3_DEBUG #undef CONFIG_CHELSIO_T3 #undef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA #undef CONFIG_INFINIBAND_SDP_SEND_ZCOPY #undef CONFIG_INFINIBAND_SDP_RECV_ZCOPY #undef CONFIG_INFINIBAND_SDP_DEBUG #undef CONFIG_INFINIBAND_SDP_DEBUG_DATA #undef CONFIG_INFINIBAND_IPATH #undef CONFIG_INFINIBAND_MTHCA_DEBUG #define CONFIG_INFINIBAND 1 #define CONFIG_INFINIBAND_IPOIB 1 #define CONFIG_INFINIBAND_IPOIB_CM 1 #undef CONFIG_INFINIBAND_SDP #undef CONFIG_INFINIBAND_SRP #define CONFIG_INFINIBAND_SRPT 1 #undef CONFIG_INFINIBAND_USER_MAD #undef CONFIG_INFINIBAND_USER_ACCESS #undef CONFIG_INFINIBAND_ADDR_TRANS #define CONFIG_INFINIBAND_MTHCA 1 #undef CONFIG_INFINIBAND_VNIC #undef CONFIG_INFINIBAND_CXGB3 #undef CONFIG_CHELSIO_T3 #define CONFIG_INFINIBAND_IPOIB_DEBUG 1 #undef CONFIG_INFINIBAND_ISER #undef CONFIG_SCSI_ISCSI_ATTRS #undef CONFIG_ISCSI_TCP #undef CONFIG_INFINIBAND_EHCA #undef CONFIG_RDS #undef CONFIG_RDS_IB #undef CONFIG_RDS_TCP #undef CONFIG_RDS_DEBUG #undef CONFIG_INFINIBAND_VNIC_DEBUG #undef CONFIG_INFINIBAND_VNIC_STATS #undef CONFIG_INFINIBAND_CXGB3_DEBUG #undef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA #undef CONFIG_INFINIBAND_SDP_SEND_ZCOPY #undef CONFIG_INFINIBAND_SDP_RECV_ZCOPY #undef CONFIG_INFINIBAND_SDP_DEBUG #undef CONFIG_INFINIBAND_SDP_DEBUG_DATA #undef CONFIG_INFINIBAND_IPATH #define CONFIG_INFINIBAND_MTHCA_DEBUG 1 #undef CONFIG_INFINIBAND_MADEYE Now, I ran "make" and srpt wasn't built: Building kernel modules Kernel version: 2.6.16.21-0.8-smp Modules directory: //lib/modules/2.6.16.21-0.8-smp/updates Kernel sources: /lib/modules/2.6.16.21-0.8-smp/build env EXTRA_CFLAGS=" -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include \ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib \ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug \ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core \ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 \ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds " \ make -C /lib/modules/2.6.16.21-0.8-smp/build SUBDIRS="/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2" KERNELRELEASE=2.6.16.21-0.8-smp \ EXTRAVERSION=.21-0.8-smp V=1 \ CONFIG_INFINIBAND=m \ CONFIG_INFINIBAND_IPOIB=m \ CONFIG_INFINIBAND_IPOIB_CM=y \ CONFIG_INFINIBAND_SDP= \ CONFIG_INFINIBAND_SRP= \ CONFIG_INFINIBAND_USER_MAD= \ CONFIG_INFINIBAND_USER_ACCESS= \ CONFIG_INFINIBAND_ADDR_TRANS= \ CONFIG_INFINIBAND_MTHCA=m \ CONFIG_INFINIBAND_IPOIB_DEBUG=y \ CONFIG_INFINIBAND_ISER= \ CONFIG_SCSI_ISCSI_ATTRS= \ CONFIG_ISCSI_TCP= \ CONFIG_INFINIBAND_EHCA= \ CONFIG_INFINIBAND_EHCA_SCALING= \ CONFIG_RDS= \ CONFIG_RDS_IB= \ CONFIG_RDS_TCP= \ CONFIG_RDS_DEBUG= \ CONFIG_INFINIBAND_IPOIB_DEBUG_DATA= \ CONFIG_INFINIBAND_SDP_SEND_ZCOPY= \ CONFIG_INFINIBAND_SDP_RECV_ZCOPY= \ CONFIG_INFINIBAND_SDP_DEBUG= \ CONFIG_INFINIBAND_SDP_DEBUG_DATA= \ CONFIG_INFINIBAND_IPATH= \ CONFIG_INFINIBAND_MTHCA_DEBUG=y \ CONFIG_INFINIBAND_MADEYE= \ CONFIG_INFINIBAND_VNIC= \ CONFIG_INFINIBAND_VNIC_DEBUG= \ CONFIG_INFINIBAND_VNIC_STATS= \ CONFIG_CHELSIO_T3= \ CONFIG_INFINIBAND_CXGB3= \ CONFIG_INFINIBAND_CXGB3_DEBUG= \ LINUXINCLUDE=' \ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ \ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include \ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include \ -Iinclude \ $(if $(KBUILD_SRC),-Iinclude2 -I$(srctree)/include) \ -include include/linux/autoconf.h \ -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h \ ' \ modules make[1]: Entering directory `/usr/src/linux-2.6.16.21-0.8-obj/x86_64/smp' make -C ../../../linux-2.6.16.21-0.8 O=../linux-2.6.16.21-0.8-obj/x86_64/smp modules make -C /usr/src/linux-2.6.16.21-0.8-obj/x86_64/smp \ KBUILD_SRC=/usr/src/linux-2.6.16.21-0.8 \ KBUILD_EXTMOD="/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2" -f /usr/src/linux-2.6.16.21-0.8/Makefile modules rm -rf /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/.tmp_versions mkdir -p /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/.tmp_versions make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2 make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.cm.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-p ointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(cm)" -D"KBUILD_MODNAME=KBUILD_STR(ib_cm)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/cm.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.packer.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -W no-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(packer)" -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_packer.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/packer.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ud_header.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ud_header)" -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_ud_header.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ud_header.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.verbs.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wn o-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(verbs)" -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_verbs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/verbs.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.sysfs.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wn o-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(sysfs)" -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_sysfs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/sysfs.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.device.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -W no-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(device)" -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_device.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/device.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.fmr_pool.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(fmr_pool)" -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_fmr_pool.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/fmr_pool.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.cache.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wn o-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(cache)" -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_cache.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/cache.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.genalloc.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(genalloc)" -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_genalloc.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/genalloc.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.netevent.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(netevent)" -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_netevent.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/netevent.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.local_sa.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(local_sa)" -D"KBUILD_MODNAME=KBUILD_STR(ib_local_sa)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_local_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/local_sa.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.mad.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno- pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mad)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mad)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/mad.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.smi.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno- pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(smi)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mad)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_smi.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/smi.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.agent.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wn o-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(agent)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mad)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_agent.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/agent.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.mad_rmpp.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mad_rmpp)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mad)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_mad_rmpp.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/mad_rmpp.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.sa_query.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(sa_query)" -D"KBUILD_MODNAME=KBUILD_STR(ib_sa)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_sa_query.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/sa_query.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.multicast.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(multicast)" -D"KBUILD_MODNAME=KBUILD_STR(ib_sa)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_multicast.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/multicast.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.iwcm.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno -pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(iwcm)" -D"KBUILD_MODNAME=KBUILD_STR(iw_cm)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.tmp_iwcm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iwcm.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/packer.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ud_header.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/verbs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/sysfs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/device.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/fmr_pool.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/cache.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/genalloc.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/netevent.o ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/smi.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/agent.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/mad_rmpp.o ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/sa_query.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/multicast.o ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/local_sa.o ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/cm.o ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iwcm.o make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_main.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after- statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_main)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_main.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_main.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_cmd.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-s tatement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_cmd)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_cmd.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_profile.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-aft er-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_profile)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_profile.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_profile.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_reset.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after -statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_reset)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_reset.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_reset.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_allocator.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-a fter-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_allocator)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_allocator.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_allocator.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_eq.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_eq)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_eq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_pd.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_pd)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_pd.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_pd.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_cq.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_cq)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_cq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cq.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_mr.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_mr)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_mr.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mr.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_qp.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_qp)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_qp.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c: In function ?mthca_tavor_post_send?: /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c:1587: warning: ?f0? may be used uninitialized in this function /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c: In function ?mthca_arbel_post_send?: /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c:1941: warning: ?f0? may be used uninitialized in this function gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_av.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-st atement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_av)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_av.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_av.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_mcg.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-s tatement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_mcg)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_mcg.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mcg.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_mad.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-s tatement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_mad)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mad.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_provider.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-af ter-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_provider)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_provider.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_provider.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_memfree.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-aft er-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_memfree)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_memfree.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_memfree.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_uar.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-s tatement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_uar)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_uar.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_uar.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_srq.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-s tatement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_srq)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_srq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.mthca_catas.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after -statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(mthca_catas)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.tmp_mthca_catas.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_catas.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_main.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_profile.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_reset.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_allocator.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_pd.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mr.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_av.o /tmp/OFED -1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mcg.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_provider.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_memfree.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_uar.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_catas.o make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_main.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-afte r-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_main)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_main.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_main.c /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_main.c: In function ?ipoib_neigh_destructor?: /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_main.c:867: warning: ISO C90 forbids mixed declarations and code gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_ib.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after- statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_ib)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_ib.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_ib.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_multicast.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration -after-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_multicast)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_multicast.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_verbs.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-aft er-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_verbs)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_verbs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_vlan.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-afte r-statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_vlan)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_vlan.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_vlan.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_cm.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after- statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_cm)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_cm.c gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ipoib_fs.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after- statement -Wno-pointer-sign -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipoib_fs)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.tmp_ipoib_fs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_fs.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_main.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_ib.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_multicast.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_verbs.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_vlan.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_fs.o Building modules, stage 2. make -rR -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.modpost scripts/mod/modpost -m -a -i /usr/src/linux-2.6.16.21-0.8-obj/x86_64/smp/Module.symvers -I /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/Modules.symvers -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/Modules.symvers -s /dev/null /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.o gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ib_cm.mod.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED- 1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_cm)" -D"KBUILD_MODNAME=KBUILD_STR(ib_cm)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.mod.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_cm.mod.o gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ib_core.mod.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFE D-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_core)" -D"KBUILD_MODNAME=KBUILD_STR(ib_core)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.mod.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_core.mod.o gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ib_local_sa.mod.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp /OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_local_sa)" -D"KBUILD_MODNAME=KBUILD_STR(ib_local_sa)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.mod.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_local_sa.mod.o gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ib_mad.mod.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED -1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_mad)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mad)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.mod.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_mad.mod.o gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.ib_sa.mod.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED- 1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_sa)" -D"KBUILD_MODNAME=KBUILD_STR(ib_sa)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.mod.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/ib_sa.mod.o gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/.iw_cm.mod.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tmp/OFED- 1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(iw_cm)" -D"KBUILD_MODNAME=KBUILD_STR(iw_cm)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.mod.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/core/iw_cm.mod.o gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/.ib_mthca.mod.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/tm p/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_mthca)" -D"KBUILD_MODNAME=KBUILD_STR(ib_mthca)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.mod.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/mthca/ib_mthca.mod.o gcc -Wp,-MD,/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/.ib_ipoib.mod.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.0/include -D__KERNEL__ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/kernel_addons/backport/2.6.16_sles10/include/ -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16.21-0.8/include -include include/linux/autoconf.h -include /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include/linux/autoconf.h -I -I/usr/src/linux-2.6.16.21-0.8/ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/t mp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/include -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/debug -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/hw/cxgb3/core -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/cxgb3 -I/tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/net/rds -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ib_ipoib)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipoib)" -DMODULE -c -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.mod.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.mod.c ld -m elf_x86_64 -r -o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.ko /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.o /tmp/OFED-1.2-rc3/SRPMS/ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ib_ipoib.mod.o make[1]: Leaving directory `/usr/src/linux-2.6.16.21-0.8-obj/x86_64/smp' From tziporet at dev.mellanox.co.il Mon May 28 07:30:41 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 28 May 2007 17:30:41 +0300 Subject: [ofa-general] OFED 1.2 status & RC4 Message-ID: <465AE791.5040003@mellanox.co.il> Hi All, Most of critical and major bugs are fixed thus we plan to have RC4 this Wed (or Thursday if some other important fix will be available) 567 blocker rolandd at cisco.com RHEL5 ppc64 UD verbs failures 577 critical ishai at mellanox.co.il SRP multipath failover too slow (minutes, not seconds) 626 major monis at voltaire.com wrong network /broadcast address set by ib-bond script All - if you have any fix that must be applied to RC4 please send this till end of Tuesday (US time) Roland, Ishai and Moni - please update me regarding status of your bugs Thanks, Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at dev.mellanox.co.il Mon May 28 08:22:59 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Mon, 28 May 2007 18:22:59 +0300 Subject: [ofa-general] there is a warning message in every use of the library libibverbs Message-ID: <465AF3D3.10205@dev.mellanox.co.il> Hi Roland. In every test/application that uses the libibverbs (i think when the libibverbs init function is being called) i see the following warning: <-start-> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. <-end-> Why did you add this warning message? Even when i executed this test as root i got this warning .... Can you add an environment variable that will prevent this warning? (or i can send it to you if you agree ...) thanks Dotan From rdreier at cisco.com Mon May 28 10:02:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 May 2007 10:02:37 -0700 Subject: [ofa-general] Re: there is a warning message in every use of the library libibverbs In-Reply-To: <465AF3D3.10205@dev.mellanox.co.il> (Dotan Barak's message of "Mon, 28 May 2007 18:22:59 +0300") References: <465AF3D3.10205@dev.mellanox.co.il> Message-ID: > In every test/application that uses the libibverbs (i think when the > libibverbs init function is being called) > i see the following warning: > > <-start-> > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > <-end-> > > Why did you add this warning message? To avoid the FAQ of "memory registration / CQ creation fails and I don't know why". - R. From sweitzen at cisco.com Mon May 28 10:09:00 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 28 May 2007 10:09:00 -0700 Subject: [ofa-general] RE: [ewg] OFED 1.2 status & RC4 In-Reply-To: <465AE791.5040003@mellanox.co.il> References: <465AE791.5040003@mellanox.co.il> Message-ID: There were several IPoIB bugs marked fixed today, are all the IPoIB fixes in OFED-1.2-20070528-0600.tgz or do I need to wait another day? Scott ________________________________ From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren Sent: Monday, May 28, 2007 7:31 AM To: EWG Cc: Moni Levy; Roland Dreier (rdreier); Ishai Rabinovitz; OpenFabrics General Subject: [ewg] OFED 1.2 status & RC4 Hi All, Most of critical and major bugs are fixed thus we plan to have RC4 this Wed (or Thursday if some other important fix will be available) 567 blocker rolandd at cisco.com RHEL5 ppc64 UD verbs failures 577 critical ishai at mellanox.co.il SRP multipath failover too slow (minutes, not seconds) 626 major monis at voltaire.com wrong network /broadcast address set by ib-bond script All - if you have any fix that must be applied to RC4 please send this till end of Tuesday (US time) Roland, Ishai and Moni - please update me regarding status of your bugs Thanks, Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Mon May 28 13:07:42 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 28 May 2007 23:07:42 +0300 Subject: [ofa-general] [PATCH] opensm/console: portstatus command for only initialized ports Message-ID: <20070528200742.GA13193@sashak.voltaire.com> Run portstatus command for only initialized ports + minor identation fixes. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_console.c | 18 ++++++++++-------- 1 files changed, 10 insertions(+), 8 deletions(-) diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 2802c38..3415262 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -598,15 +598,17 @@ __get_stats(cl_map_item_t * const p_map_item, void *context) fs->total_nodes++; for (port = 1; port < num_ports; port++) { - osm_physp_t *phys = osm_node_get_physp_ptr(node, port); + osm_physp_t *phys = osm_node_get_physp_ptr(node, port); ib_port_info_t *pi = &(phys->port_info); - - uint8_t active_speed = ib_port_info_get_link_speed_active(pi); - uint8_t enabled_speed = ib_port_info_get_link_speed_enabled(pi); - uint8_t active_width = pi->link_width_active; - uint8_t enabled_width = pi->link_width_enabled; - uint8_t port_state = ib_port_info_get_port_state(pi); - uint8_t port_phys_state = ib_port_info_get_port_phys_state(pi); + uint8_t active_speed = ib_port_info_get_link_speed_active(pi); + uint8_t enabled_speed = ib_port_info_get_link_speed_enabled(pi); + uint8_t active_width = pi->link_width_active; + uint8_t enabled_width = pi->link_width_enabled; + uint8_t port_state = ib_port_info_get_port_state(pi); + uint8_t port_phys_state = ib_port_info_get_port_phys_state(pi); + + if (!osm_physp_is_valid(phys)) + continue; if ((enabled_width ^ active_width) > active_width) { __tag_port_report(&(fs->reduced_width_ports), -- 1.5.2.109.g802f From mst at dev.mellanox.co.il Mon May 28 21:27:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 07:27:41 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race In-Reply-To: <20070524131154.GA7940@mellanox.co.il> References: <20070522005918.GB13311@mellanox.co.il> <20070524131154.GA7940@mellanox.co.il> Message-ID: <20070529042741.GB13866@mellanox.co.il> > Quoting Michael S. Tsirkin : > Subject: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race > > hard_start_xmit dereferences to_ipoib_neigh when only tx_lock is taken. This > would only be safe if all calls that modify *to_ipoib_neigh take tx_lock too. > Currently this is not always true for ipoib_neigh_free and path_rec_completion, > which results in memory corruption. Fix this race, making sure > path_rec_completion and ipoib_neigh_free are always called under > tx_lock. > > Signed-off-by: Michael S. Tsirkin Could you on this patch please? -- MST From rdreier at cisco.com Mon May 28 21:28:32 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 May 2007 21:28:32 -0700 Subject: [ofa-general] ibv_get_cq_event blocking forever after successful ibv_post_send... In-Reply-To: <20070525212214.20500.qmail@station183.com> (Jimmy Hill's message of "Fri, 25 May 2007 21:22:14 +0000") References: <20070525212214.20500.qmail@station183.com> Message-ID: > Any ideas on why the ibv_get_cq_event() would never see an event > after a "successful" send requesting a completion event? It's either a bug in your code or a bug in the stack below your code. The best way to debug this would be for you to post your actual code (in a form that someone else can run), so that we can either point out what's wrong with your code, or have a test case for the real bug. - R. From rdreier at cisco.com Mon May 28 21:33:17 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 May 2007 21:33:17 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race In-Reply-To: <20070529042741.GB13866@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 May 2007 07:27:41 +0300") References: <20070522005918.GB13311@mellanox.co.il> <20070524131154.GA7940@mellanox.co.il> <20070529042741.GB13866@mellanox.co.il> Message-ID: > Could you on this patch please? ?? From rdreier at cisco.com Mon May 28 21:40:27 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 May 2007 21:40:27 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix performance regression on Mellanox In-Reply-To: <20070528113727.GP2945@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 28 May 2007 14:37:27 +0300") References: <20070521120459.GI20400@mellanox.co.il> <46595950.6080106@voltaire.com> <20070527125337.GF8342@mellanox.co.il> <20070528113727.GP2945@mellanox.co.il> Message-ID: seems like this leaves rx_drain_qp in the data structure and also in the comment in ipoib.h... not sure if there are any other remnants of the previous approach that should be cleaned up. From mst at dev.mellanox.co.il Mon May 28 21:44:42 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 07:44:42 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race In-Reply-To: References: <20070522005918.GB13311@mellanox.co.il> <20070524131154.GA7940@mellanox.co.il> <20070529042741.GB13866@mellanox.co.il> Message-ID: <20070529044442.GC13866@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH for-2.6.22] IB/ipoib: fix to_ipoib_neigh access race > > > Could you on this patch please? > > ?? Could you comment on this patch please? -- MST From rdreier at cisco.com Mon May 28 21:45:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 May 2007 21:45:19 -0700 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git In-Reply-To: <20070528121206.GA1847@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 28 May 2007 15:12:06 +0300") References: <20070514045832.GA18615@mellanox.co.il> <20070528121206.GA1847@mellanox.co.il> Message-ID: > IB/ipoib: fix to_ipoib_neigh access race I'm not convinced this is 2.6.22 material at this point -- it doesn't fix any observed problem that I know of. (And the SRQ drain patch shows how even safe-looking patches can cause big problems) - R. From rdreier at cisco.com Mon May 28 21:46:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 28 May 2007 21:46:26 -0700 Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git In-Reply-To: <20070526194049.GD15942@mellanox.co.il> (Michael S. Tsirkin's message of "Sat, 26 May 2007 22:40:49 +0300") References: <20070526194049.GD15942@mellanox.co.il> Message-ID: > don't we want he patch that sets status to flushed with error? I figured I would test it a little and queue it for 2.6.23. I don't see a justification for putting in 2.6.22 since it's just paranoia not driven by any observed issue. From mst at dev.mellanox.co.il Mon May 28 21:48:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 07:48:15 +0300 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git In-Reply-To: References: <20070514045832.GA18615@mellanox.co.il> <20070528121206.GA1847@mellanox.co.il> Message-ID: <20070529044815.GD13866@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git > > > IB/ipoib: fix to_ipoib_neigh access race > > I'm not convinced this is 2.6.22 material at this point -- it doesn't > fix any observed problem that I know of. (And the SRQ drain patch > shows how even safe-looking patches can cause big problems) Fine, but we do have it in OFED - could you spare some cycles to review it? -- MST From mst at dev.mellanox.co.il Mon May 28 21:51:34 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 07:51:34 +0300 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix performance regression on Mellanox In-Reply-To: References: <20070521120459.GI20400@mellanox.co.il> <46595950.6080106@voltaire.com> <20070527125337.GF8342@mellanox.co.il> <20070528113727.GP2945@mellanox.co.il> Message-ID: <20070529045134.GE13866@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH for-2.6.22] IB/ipoib: fix performance regression on Mellanox > > seems like this leaves rx_drain_qp in the data structure and also in > the comment in ipoib.h... Right, add this on top of it. > not sure if there are any other remnants of > the previous approach that should be cleaned up. Hopefully not - compiler'd notice any uses of rx_drain_qp, and that really is the only change. -----> Remove unused rx_drain_qp. Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 158759e..285c143 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -156,7 +156,7 @@ struct ipoib_cm_data { * - and then invoke a Destroy QP or Reset QP. * * We use the second option and wait for a completion on the - * rx_drain_qp before destroying QPs attached to our SRQ. + * same CQ before destroying QPs attached to our SRQ. */ enum ipoib_cm_state { @@ -199,7 +199,6 @@ struct ipoib_cm_dev_priv { struct ib_srq *srq; struct ipoib_cm_rx_buf *srq_ring; struct ib_cm_id *id; - struct ib_qp *rx_drain_qp; /* generates WR described in 10.3.1 */ struct list_head passive_ids; /* state: LIVE */ struct list_head rx_error_list; /* state: ERROR */ struct list_head rx_flush_list; /* state: FLUSH, drain not started */ -- MST From mst at dev.mellanox.co.il Mon May 28 23:06:50 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 09:06:50 +0300 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git In-Reply-To: References: <20070514045832.GA18615@mellanox.co.il> <20070528121206.GA1847@mellanox.co.il> Message-ID: <20070529060626.GB6032@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git > > > IB/ipoib: fix to_ipoib_neigh access race > > I'm not convinced this is 2.6.22 material at this point -- it doesn't > fix any observed problem that I know of. OK, I removed it for now, and cleaned the unused rx_drain_qp field in the performance fix patch. What's left is: Michael S. Tsirkin (2): IB/mthca: fix send CQE with error for QP connected to SRQ IB/ipoib: fix performance regression on Mellanox Sean Hefty (1): ib/cm: fix stale connection detection These are all fixes for observed problems. > (And the SRQ drain patch > shows how even safe-looking patches can cause big problems) Yea. We did know it's a risky, big change - it just seemed we must fix it for IPoIB CM to be useful. -- MST From monisonlists at gmail.com Mon May 28 23:41:02 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Tue, 29 May 2007 09:41:02 +0300 Subject: [ofa-general] Trouble install OFED 1.2-rc3 - ib-bonding In-Reply-To: References: Message-ID: <465BCAFE.2030001@gmail.com> Jeffrey Wong wrote: > Hello, > > I am installing the OFED 1.2-rc3. > > Everything else builds except for ib-bonding. = I see you have kernel 2.6.18-8.1.4.el5 which is not supported by ib-bonding. It seems like a beta of RHEL5. Am I right? > > > = > > > Thanks in advance. > > = > > > = > > > I am getting the following error messages: > > + make -C /lib/modules/2.6.18-8.1.4.el5/build modules > M=3D/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding > > make: Entering directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64' > > CC [M] > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.o > > In file included from > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:78: > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h: In function 'bond_set_slave_inactive_flags': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this > > function) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h:262: error: (Each undeclared identifier is reported only once > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h:262: error: for each function it appears in.) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h: In function 'bond_set_slave_active_flags': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin > g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this > > function) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_compute_features': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:1233: warning: comparison of distinct pointer types lacks a > > cast > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_enslave': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this fu > > nction) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_release': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this fu > > nction) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in t > > his function) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_arp_rcv': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this fu > > nction) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_netdev_event': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this fu > > nction) > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c: In function 'bond_init': > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:4374: warning: assignment discards qualifiers from pointer > > target type > > /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m > ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this fu > > nction) > > make[1]: *** > [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_ > main.o] Error 1 > > make: *** > [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi > ng] Error 2 > > make: Leaving directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64' > > + echo ' Building IB bonding driver failed' > > Building IB bonding driver failed > > + exit 1 > > error: Bad exit status from /var/tmp/rpm-tmp.23876 (%build) > > = > > > = > > > = > > > Jeff Wong > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Tue May 29 00:17:01 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 10:17:01 +0300 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git In-Reply-To: References: <20070514045832.GA18615@mellanox.co.il> <20070528121206.GA1847@mellanox.co.il> Message-ID: <20070529071701.GA8159@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git > > > IB/ipoib: fix to_ipoib_neigh access race > > I'm not convinced this is 2.6.22 material at this point -- it doesn't > fix any observed problem that I know of. (And the SRQ drain patch > shows how even safe-looking patches can cause big problems) for-2.6.23 for now? -- MST From tziporet at dev.mellanox.co.il Tue May 29 00:26:44 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 29 May 2007 10:26:44 +0300 Subject: [ofa-general] Re: [ewg] OFED 1.2 status & RC4 In-Reply-To: References: <465AE791.5040003@mellanox.co.il> Message-ID: <465BD5B4.50003@mellanox.co.il> Scott Weitzenkamp (sweitzen) wrote: > There were several IPoIB bugs marked fixed today, are all the IPoIB > fixes in OFED-1.2-20070528-0600.tgz or do I need to wait another day? > > Scott IPoIB fixes are in IP OFED-1.2-20070528-0600.tgz SRP fix will be in next build Tziporet From cap at nsc.liu.se Tue May 29 00:38:30 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Tue, 29 May 2007 09:38:30 +0200 Subject: [ofa-general] Trouble install OFED 1.2-rc3 - ib-bonding In-Reply-To: <465BCAFE.2030001@gmail.com> References: <465BCAFE.2030001@gmail.com> Message-ID: <200705290938.35022.cap@nsc.liu.se> On Tuesday 29 May 2007, Moni Shoua wrote: > Jeffrey Wong wrote: > > Hello, > > > > I am installing the OFED 1.2-rc3. > > > > Everything else builds except for ib-bonding.  = > > I see you have kernel 2.6.18-8.1.4.el5 which is not supported by > ib-bonding. It seems like a beta of RHEL5. Am I right? No, that is _the_ current RHEL5 kernel (release + security updates). /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From ogerlitz at voltaire.com Tue May 29 00:56:00 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 29 May 2007 10:56:00 +0300 Subject: [ofa-general] Re: ipoib / bonding and OFED In-Reply-To: <4657373E.2030903@hp.com> References: <3857BB049D83424D9DB82753D37CEA55459C41@taurus.voltaire.com> <4657373E.2030903@hp.com> Message-ID: <465BDC90.5080305@voltaire.com> Bob Kossey wrote: > I copied OR since I think this is related to his OFED HA work, and > he might have some insights. A few more questions for Or: > I was trying to use ipoib bonding with OFED 1.2 rc2 and a 2.6.9 kernel, > but was not able to get it to work so far. I saw your Sonoma bonding > slides, and you mention kernel bonding driver changes were needed. > 2. Is there a minimum kernel version, with the kernel bonding driver > changes, that is required to use bonding with OFED ipoib? Just to have a base line here: to get bonding to work with IPoIB, you should use the bonding driver provided with OFED 1.2. This driver is the upstream one (of 2.6.20) being patched to support IPoIB and backported to RH5, SLES10 and RH4 U3/4/5, other kernels are not supported. If you were using the ofed bonding on a system that matches the support matrix it should worl. If do have problems under this config, please either open a bug at the ofed bugzilla @ bugs.openfabrics.org assigned to monis at voltaire.com (Moni Shoua) or send first report/question to Moni and CC ewg at lists.openfabrics.org Please note that between RC2 and RC4 (to be released today etc) some bugs were fixed, you can search in the bugzilla to see what. > 3. The bonding driver uses the HWADDR from the underlying ipoib > devices, how does it obtain the HWADDR? Does it use the full 20 bytes, > or some subset? when enslaving IPoIB devices, the bonding driver uses the full hw address of the active slave, it simply looks on the dev_addr field of the slave struct netdevice (see include/linux/netdevice.h) > 4. What use_carrier options for link status detection does OFED ipoib > support, > MII, ETHTOOL or netif_carrier_ok? the mii/ethertool etc local link detection methods of the bonding driver are somehow deprecated, since nowadays almost any network device support the netif_carrier_ok call. The --default-- of the upstream bonding driver (eg the one we use in OFED and the 2.6.21 listed below) is to set the use_carrier mod param to 1 that is mii is not used anymore. > author: Thomas Davis, tadavis at lbl.gov and many others > description: Ethernet Channel Bonding Driver, v3.1.2 > version: 3.1.2 > parm: use_carrier:Use netif_carrier_ok (vs MII ioctls) in miimon; 0 for off, 1 for on (default) (int) > parm: miimon:Link check interval in milliseconds (int) > If you have any good examples of bonding configuration settings that work > with OFED, I'd appreciate that also. The bonding RPM provided with OFED is made of a driver, script and some help text containing usage examples, please take a look there and let me know if you have further questions. > $ rpm -ql ib-bonding-0.9.0-2.6.9_42.ELsmp > /lib/modules/2.6.9-42.ELsmp/updates/kernel/drivers/net/bonding/bonding.ko > /usr/bin/ib-bond > /usr/share/doc/ib-bonding-0.9.0/ib-bonding.txt The ofed service (/etc/init.d/openibd) was enhanced to allow for --persistent-- bonding configuration, please see the bonding section at docs/ipoib_release_notes.txt to see how to do it. Or. From mst at dev.mellanox.co.il Tue May 29 02:12:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 12:12:46 +0300 Subject: [ofa-general] [PATCH] suppress RLIMIT warning for root user (was Re: there is a warning message in every use of the library libibverbs) In-Reply-To: References: <465AF3D3.10205@dev.mellanox.co.il> Message-ID: <20070529091246.GF8159@mellanox.co.il> root can register as much memory as he likes, so the rlimit value shouldn't matter in this case. Do not print a warning about RLIMIT being too low in this case. Signed-off-by: Michael S. Tsirkin --- > Quoting Roland Dreier : > Subject: Re: there is a warning message in every use of the library libibverbs > > > In every test/application that uses the libibverbs (i think when the > > libibverbs init function is being called) > > i see the following warning: > > > > <-start-> > > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > > This will severely limit memory registrations. > > <-end-> > > > > Why did you add this warning message? > > To avoid the FAQ of "memory registration / CQ creation fails and I > don't know why". OK, but kernel side actually ignores the rlimit value for the root user, so let's not print a warning in this case? diff --git a/src/init.c b/src/init.c index a17ae16..de485cb 100644 --- a/src/init.c +++ b/src/init.c @@ -417,10 +417,15 @@ static void check_memlock_limit(void) return; } - if (rlim.rlim_cur <= 32768) - fprintf(stderr, PFX "Warning: RLIMIT_MEMLOCK is %lu bytes.\n" - " This will severely limit memory registrations.\n", - rlim.rlim_cur); + if (rlim.rlim_cur > 32768) + return; + + if (!getuid()) + return; + + fprintf(stderr, PFX "Warning: RLIMIT_MEMLOCK is %lu bytes.\n" + " This will severely limit memory registrations.\n", + rlim.rlim_cur); } static void add_device(struct ibv_device *dev, -- MST From mst at dev.mellanox.co.il Tue May 29 02:15:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 12:15:43 +0300 Subject: [ofa-general] libibverbs autogen failures in ubuntu dapper Message-ID: <20070529091543.GG8159@mellanox.co.il> Attempt to run autogen.sh on an ubuntu dapper laptop gave me this: aclocal -I config + libtoolize --force --copy Putting files in AC_CONFIG_AUX_DIR, `config'. + autoheader + automake --foreign --add-missing --copy automake: Makefile.am: `src/libibverbs.la' is not a standard libtool library name automake: Makefile.am: not supported: source file `src/cmd.c' is in subdirectory automake: Makefile.am: not supported: source file `src/compat-1_0.c' is in subdirectory automake: Makefile.am: not supported: source file `src/device.c' is in subdirectory automake: Makefile.am: not supported: source file `src/init.c' is in subdirectory automake: Makefile.am: not supported: source file `src/marshall.c' is in subdirectory automake: Makefile.am: not supported: source file `src/memory.c' is in subdirectory automake: Makefile.am: not supported: source file `src/sysfs.c' is in subdirectory automake: Makefile.am: not supported: source file `src/verbs.c' is in subdirectory Bareword found where operator expected at (eval 336) line 1, near "s/\@LTLIBRARY\@/src/libibverbs" ......... and it fails to produce a working build: ./configure ... make mst at mst-lt:~/scm/libibverbs$ make Makefile:343: warning: overriding commands for target `@PROGRAM@' Makefile:339: warning: ignoring old commands for target `@PROGRAM@' Makefile:347: warning: overriding commands for target `@PROGRAM@' Makefile:343: warning: ignoring old commands for target `@PROGRAM@' Makefile:351: warning: overriding commands for target `@PROGRAM@' Makefile:347: warning: ignoring old commands for target `@PROGRAM@' Makefile:355: warning: overriding commands for target `@PROGRAM@' Makefile:351: warning: ignoring old commands for target `@PROGRAM@' Makefile:359: warning: overriding commands for target `@PROGRAM@' Makefile:355: warning: ignoring old commands for target `@PROGRAM@' Makefile:363: warning: overriding commands for target `@PROGRAM@' Makefile:359: warning: ignoring old commands for target `@PROGRAM@' make: *** No rule to make target `src/libibverbs.la', needed by `all-am'. Stop. I think this worked at some point - any idea what's wrong now? -- MST From vacchianow7037 at plaza101.com Tue May 29 15:50:54 2007 From: vacchianow7037 at plaza101.com (Rafaela Cruz) Date: Tue, 29 May 2007 23:50:54 +0100 Subject: [ofa-general] Think its' time to start Message-ID: <000801c7a1d3$2ebf5150$6901a8c0@vacchianow7037> Take delivery of a sizeable modify on your Meds dependable classes, paramount quality. Massive array, including intricate to find drugs 0 RX indispensable. Secret with No waiting quarters or arrangmenet requisite take in massiveness and Save! even supposing supplemental Just type www [.] Topbuyrx . org in Your Internet Explore - Go here now ----- They panicky forsook suit each other cautiously remarkably bred well, said Dangl 'It is well,' said he, kissing solid new sane it; defiant 'it is my mast fast And shrilly tip pipe what? demanded Morrel. revolting nerve feeling So done are all Italians. poke interest I think I may venture foolish to ask strip you this favor. From vlad at lists.openfabrics.org Tue May 29 02:44:13 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 29 May 2007 02:44:13 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070529-0200 daily build status Message-ID: <20070529094414.392E3E6089D@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From Koen.SEGERS at VRT.BE Tue May 29 03:03:27 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Tue, 29 May 2007 12:03:27 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: Hi, Saturday we did a different stresstest. This is what we see in the /var/log/messages: May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 There were errors from that time on. Can someone explain me what this message does? Koen -----Oorspronkelijk bericht----- Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] Verzonden: woensdag 23 mei 2007 17:41 Aan: SEGERS Koen; Hal Rosenstock CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection Try 20 seconds, I'm curious if if you are barely crossing the 10-second threshold. Scott > -----Original Message----- > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > Sent: Wednesday, May 23, 2007 8:39 AM > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > Cc: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Subject: RE: [ofa-general] GPFS node loses IB-connection > > What value would you recommend then? > > Koen > > -----Oorspronkelijk bericht----- > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > Verzonden: woensdag 23 mei 2007 17:38 > Aan: SEGERS Koen; Hal Rosenstock > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > The boot time of the host doesn't matter for this timeout. While the > host is booting, the IB link is down anyway. > > Scott > > > -----Original Message----- > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > Sent: Wednesday, May 23, 2007 8:20 AM > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > Cc: Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > After a whole day of stresstesting with the MAD renicing > turned on, we > > got the error once. So I think I should raise the timeout on > > the switch > > also. > > > > It takes about 2 minutes to boot the system. Do you agree > > that this is a > > good value for the timeout? > > > > Scott, > > Can you explain me the problem of the memlock? > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > Since we didn't > > install this, the bug is not related to us. This is > correct, isn't it? > > > > Greetz > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > Verzonden: woensdag 23 mei 2007 16:12 > > Aan: Scott "Weitzenkamp (sweitzen) > > CC: SEGERS Koen; Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > No C code changes, just a few config file changes > (RENICE_IB_MAD=yes > > in > > > openib.conf, > > > > Does the host really not respond to MAD requests for over 10 > > seconds in > > some cases ? > > > > -- Hal > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > SLES10 for bug 267, etc.). > > > > > > Scott Weitzenkamp > > > SQA and Release Manager > > > Server Virtualization Business Unit > > > Cisco Systems > > > > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > This far, all tests seem to work. > > > > > > > > Thanks for the help! > > > > > > > > Scott, > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > Greetz > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > (clivhall) > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > response within > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > in the OFED > > > > binary RPMs we release at > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > prefer to have > > > > the host be more responsive. > > > > > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > -----Original Message----- > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > general at lists.openfabrics.org; > > general-bounces at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > (=pinging) the > > > > > interfaces every 10s. This means that when the interface is > > handling > > > > > other traffic, the poll can fail and the port could be > > > > > considered out of > > > > > service. My question is then: "How can the timeout be reached > > while > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > Anyway, what timeout-value would you recommend for > us? And why? > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > 1) change the MAD niceness of the servers > > > > > 2) change the timeout on the switches > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > their ports in > > > > > PORT_ACTIVE state? > > > > > > > > > > Regards, > > > > > > > > > > Koen > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > (sweitzen) wrote: > > > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > fe:80:00:00:00:00:00:00 > > > > > > node-timeout > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > 2000 seconds. > > > > > > If a HCA is completely unresponsive for longer than the > > > > node-timeout > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > To: koen.segers at VRT.BE > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org; Scott > > Weitzenkamp > > > > > > (sweitzen) > > > > > > Subject: RE: [ofa-general] GPFS node loses > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > So it is most likely you hit the same bug as > > 229 (Scott > > > > > > pointed out earlier). The same workaround might > > > > work for you > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > I think this should be a SM query timeout > > tunable value > > in > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > Thanks > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > Koen > > > > > > Segers > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > Please respond to > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > Shirley > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > cc > > > > > > > > > > > > Ami Perlmutter > > > > > > , > > > > > general at lists.openfabrics.org, > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > Subject > > > > > > > > > > > > RE: > > > > > > [ofa-general] > > > > > > GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > System Version Information > > > > > > > > > > > ============================================================== > > > > > ================== > > > > > > system-version : SFS-7000P TopspinOS > > > > 2.9.0 releng > > > > > > #147 > > > > > > 10/25/2006 02:01:32 > > > > > > contact : tac at cisco.com > > > > > > name : SFS-7000P > > > > > > location : 170 West Tasman Drive, > > > > > San Jose, CA > > > > > > 95134 > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > last-change : none > > > > > > last-config-save : none > > > > > > action : none > > > > > > result : none > > > > > > oper-mode : normal > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > but I can't > > > > > > find it > > > > > > right now. > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > > > Hello Koen, > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > The node was > > > > > > kicked > > > > > > > out of the membership. Which SM you are > > using in your > > > > > > fabric? > > > > > > > > > > > > > > Thanks > > > > > > > Shirley Ma > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From amip at dev.mellanox.co.il Tue May 29 04:35:07 2007 From: amip at dev.mellanox.co.il (Ami Perlmutter) Date: Tue, 29 May 2007 14:35:07 +0300 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: <1180438537.12048.5.camel@localhost> this means you are getting a message your SDP does not recognize. message 11 is resize request which was added to sdp a few days ago. can it be that you are running 2 different versions of OFED? anywas, this doesn't pose any problem so you can ignore it. On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > Hi, > > Saturday we did a different stresstest. > This is what we see in the /var/log/messages: > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > There were errors from that time on. Can someone explain me what this > message does? > > Koen > > -----Oorspronkelijk bericht----- > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > Verzonden: woensdag 23 mei 2007 17:41 > Aan: SEGERS Koen; Hal Rosenstock > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > Try 20 seconds, I'm curious if if you are barely crossing the 10-second > threshold. > > Scott > > > -----Original Message----- > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > Sent: Wednesday, May 23, 2007 8:39 AM > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > Cc: Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > What value would you recommend then? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > Verzonden: woensdag 23 mei 2007 17:38 > > Aan: SEGERS Koen; Hal Rosenstock > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > The boot time of the host doesn't matter for this timeout. While the > > host is booting, the IB link is down anyway. > > > > Scott > > > > > -----Original Message----- > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > Cc: Clive Hall (clivhall); > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > After a whole day of stresstesting with the MAD renicing > > turned on, we > > > got the error once. So I think I should raise the timeout on > > > the switch > > > also. > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > that this is a > > > good value for the timeout? > > > > > > Scott, > > > Can you explain me the problem of the memlock? > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > Since we didn't > > > install this, the bug is not related to us. This is > > correct, isn't it? > > > > > > Greetz > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > Verzonden: woensdag 23 mei 2007 16:12 > > > Aan: Scott "Weitzenkamp (sweitzen) > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > > No C code changes, just a few config file changes > > (RENICE_IB_MAD=yes > > > in > > > > openib.conf, > > > > > > Does the host really not respond to MAD requests for over 10 > > > seconds in > > > some cases ? > > > > > > -- Hal > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > SLES10 for bug 267, etc.). > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > -----Original Message----- > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > general at lists.openfabrics.org; > > > general-bounces at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > Thanks for the help! > > > > > > > > > > Scott, > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > Greetz > > > > > > > > > > Koen > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > (clivhall) > > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > > general-bounces at lists.openfabrics.org > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > response within > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > > in the OFED > > > > > binary RPMs we release at > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > prefer to have > > > > > the host be more responsive. > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > SQA and Release Manager > > > > > Server Virtualization Business Unit > > > > > Cisco Systems > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > general at lists.openfabrics.org; > > > general-bounces at lists.openfabrics.org > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > > (=pinging) the > > > > > > interfaces every 10s. This means that when the interface is > > > handling > > > > > > other traffic, the poll can fail and the port could be > > > > > > considered out of > > > > > > service. My question is then: "How can the timeout be reached > > > while > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > us? And why? > > > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > > 1) change the MAD niceness of the servers > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > their ports in > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > Regards, > > > > > > > > > > > > Koen > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > (sweitzen) wrote: > > > > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > fe:80:00:00:00:00:00:00 > > > > > > > node-timeout > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > 2000 seconds. > > > > > > > If a HCA is completely unresponsive for longer than the > > > > > node-timeout > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > SQA and Release Manager > > > > > > > Server Virtualization Business Unit > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > To: koen.segers at VRT.BE > > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > Weitzenkamp > > > > > > > (sweitzen) > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > 229 (Scott > > > > > > > pointed out earlier). The same workaround might > > > > > work for you > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > I think this should be a SM query timeout > > > tunable value > > > in > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > Thanks > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > Koen > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > > Please respond to > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > Shirley > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > , > > > > > > general at lists.openfabrics.org, > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > RE: > > > > > > > [ofa-general] > > > > > > > GPFS node loses > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > ================== > > > > > > > System Version Information > > > > > > > > > > > > > ============================================================== > > > > > > ================== > > > > > > > system-version : SFS-7000P TopspinOS > > > > > 2.9.0 releng > > > > > > > #147 > > > > > > > 10/25/2006 02:01:32 > > > > > > > contact : tac at cisco.com > > > > > > > name : SFS-7000P > > > > > > > location : 170 West Tasman Drive, > > > > > > San Jose, CA > > > > > > > 95134 > > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > > last-change : none > > > > > > > last-config-save : none > > > > > > > action : none > > > > > > > result : none > > > > > > > oper-mode : normal > > > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > > but I can't > > > > > > > find it > > > > > > > right now. > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > > The node was > > > > > > > kicked > > > > > > > > out of the membership. Which SM you are > > > using in your > > > > > > > fabric? > > > > > > > > > > > > > > > > Thanks > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > nv van publiek recht > > > > > > > BTW BE 0244.142.664 > > > > > > > RPR Brussel > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Koen.SEGERS at VRT.BE Tue May 29 04:37:18 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Tue, 29 May 2007 13:37:18 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: <1180438537.12048.5.camel@localhost> Message-ID: We are running ofed-1.2.RC1 on all machines. Hence it is impossible that this message is added only a few days ago. How can you be so sure that this doesn't pose any problems? Koen -----Oorspronkelijk bericht----- Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] Verzonden: dinsdag 29 mei 2007 13:35 Aan: SEGERS Koen CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection this means you are getting a message your SDP does not recognize. message 11 is resize request which was added to sdp a few days ago. can it be that you are running 2 different versions of OFED? anywas, this doesn't pose any problem so you can ignore it. On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > Hi, > > Saturday we did a different stresstest. > This is what we see in the /var/log/messages: > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > There were errors from that time on. Can someone explain me what this > message does? > > Koen > > -----Oorspronkelijk bericht----- > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > Verzonden: woensdag 23 mei 2007 17:41 > Aan: SEGERS Koen; Hal Rosenstock > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > Try 20 seconds, I'm curious if if you are barely crossing the 10-second > threshold. > > Scott > > > -----Original Message----- > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > Sent: Wednesday, May 23, 2007 8:39 AM > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > Cc: Clive Hall (clivhall); > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > What value would you recommend then? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > Verzonden: woensdag 23 mei 2007 17:38 > > Aan: SEGERS Koen; Hal Rosenstock > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > The boot time of the host doesn't matter for this timeout. While the > > host is booting, the IB link is down anyway. > > > > Scott > > > > > -----Original Message----- > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > Cc: Clive Hall (clivhall); > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > After a whole day of stresstesting with the MAD renicing > > turned on, we > > > got the error once. So I think I should raise the timeout on > > > the switch > > > also. > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > that this is a > > > good value for the timeout? > > > > > > Scott, > > > Can you explain me the problem of the memlock? > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > Since we didn't > > > install this, the bug is not related to us. This is > > correct, isn't it? > > > > > > Greetz > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > Verzonden: woensdag 23 mei 2007 16:12 > > > Aan: Scott "Weitzenkamp (sweitzen) > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > > No C code changes, just a few config file changes > > (RENICE_IB_MAD=yes > > > in > > > > openib.conf, > > > > > > Does the host really not respond to MAD requests for over 10 > > > seconds in > > > some cases ? > > > > > > -- Hal > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > SLES10 for bug 267, etc.). > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > -----Original Message----- > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > general at lists.openfabrics.org; > > > general-bounces at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > Thanks for the help! > > > > > > > > > > Scott, > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > Greetz > > > > > > > > > > Koen > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > (clivhall) > > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > > general-bounces at lists.openfabrics.org > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > response within > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > > in the OFED > > > > > binary RPMs we release at > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > prefer to have > > > > > the host be more responsive. > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > SQA and Release Manager > > > > > Server Virtualization Business Unit > > > > > Cisco Systems > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > general at lists.openfabrics.org; > > > general-bounces at lists.openfabrics.org > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > > (=pinging) the > > > > > > interfaces every 10s. This means that when the interface is > > > handling > > > > > > other traffic, the poll can fail and the port could be > > > > > > considered out of > > > > > > service. My question is then: "How can the timeout be reached > > > while > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > us? And why? > > > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > > 1) change the MAD niceness of the servers > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > their ports in > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > Regards, > > > > > > > > > > > > Koen > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > (sweitzen) wrote: > > > > > > > Yes, you can tune it. Here's an example via the switch CLI: > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > fe:80:00:00:00:00:00:00 > > > > > > > node-timeout > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > 2000 seconds. > > > > > > > If a HCA is completely unresponsive for longer than the > > > > > node-timeout > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > SQA and Release Manager > > > > > > > Server Virtualization Business Unit > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > To: koen.segers at VRT.BE > > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > Weitzenkamp > > > > > > > (sweitzen) > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > 229 (Scott > > > > > > > pointed out earlier). The same workaround might > > > > > work for you > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > I think this should be a SM query timeout > > > tunable value > > > in > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > Thanks > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > Koen > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > > Please respond to > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > Shirley > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > , > > > > > > general at lists.openfabrics.org, > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > RE: > > > > > > > [ofa-general] > > > > > > > GPFS node loses > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > ================== > > > > > > > System Version Information > > > > > > > > > > > > > ============================================================== > > > > > > ================== > > > > > > > system-version : SFS-7000P TopspinOS > > > > > 2.9.0 releng > > > > > > > #147 > > > > > > > 10/25/2006 02:01:32 > > > > > > > contact : tac at cisco.com > > > > > > > name : SFS-7000P > > > > > > > location : 170 West Tasman Drive, > > > > > > San Jose, CA > > > > > > > 95134 > > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > > last-change : none > > > > > > > last-config-save : none > > > > > > > action : none > > > > > > > result : none > > > > > > > oper-mode : normal > > > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > > but I can't > > > > > > > find it > > > > > > > right now. > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote: > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > > The node was > > > > > > > kicked > > > > > > > > out of the membership. Which SM you are > > > using in your > > > > > > > fabric? > > > > > > > > > > > > > > > > Thanks > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > nv van publiek recht > > > > > > > BTW BE 0244.142.664 > > > > > > > RPR Brussel > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From amip at dev.mellanox.co.il Tue May 29 05:02:42 2007 From: amip at dev.mellanox.co.il (Ami Perlmutter) Date: Tue, 29 May 2007 15:02:42 +0300 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: <1180440193.12048.9.camel@localhost> if this is an actual resize request than there is no problem when it is dropped. since you are running rc1, no resize requests should be sent so this means there is a problem since data could be dropped. do you notice lost data? On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote: > We are running ofed-1.2.RC1 on all machines. Hence it is impossible that > this message is added only a few days ago. > > How can you be so sure that this doesn't pose any problems? > > Koen > > -----Oorspronkelijk bericht----- > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > Verzonden: dinsdag 29 mei 2007 13:35 > Aan: SEGERS Koen > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > this means you are getting a message your SDP does not recognize. > message 11 is resize request which was added to sdp a few days ago. > can it be that you are running 2 different versions of OFED? > anywas, this doesn't pose any problem so you can ignore it. > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > > Hi, > > > > Saturday we did a different stresstest. > > This is what we see in the /var/log/messages: > > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > > > There were errors from that time on. Can someone explain me what this > > message does? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > Verzonden: woensdag 23 mei 2007 17:41 > > Aan: SEGERS Koen; Hal Rosenstock > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > Try 20 seconds, I'm curious if if you are barely crossing the > 10-second > > threshold. > > > > Scott > > > > > -----Original Message----- > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > Sent: Wednesday, May 23, 2007 8:39 AM > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > > Cc: Clive Hall (clivhall); > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > What value would you recommend then? > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > Verzonden: woensdag 23 mei 2007 17:38 > > > Aan: SEGERS Koen; Hal Rosenstock > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > The boot time of the host doesn't matter for this timeout. While > the > > > host is booting, the IB link is down anyway. > > > > > > Scott > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > > Cc: Clive Hall (clivhall); > > > > general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > After a whole day of stresstesting with the MAD renicing > > > turned on, we > > > > got the error once. So I think I should raise the timeout on > > > > the switch > > > > also. > > > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > > that this is a > > > > good value for the timeout? > > > > > > > > Scott, > > > > Can you explain me the problem of the memlock? > > > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > > Since we didn't > > > > install this, the bug is not related to us. This is > > > correct, isn't it? > > > > > > > > Greetz > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Verzonden: woensdag 23 mei 2007 16:12 > > > > Aan: Scott "Weitzenkamp (sweitzen) > > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > > general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > > > No C code changes, just a few config file changes > > > (RENICE_IB_MAD=yes > > > > in > > > > > openib.conf, > > > > > > > > Does the host really not respond to MAD requests for over 10 > > > > seconds in > > > > some cases ? > > > > > > > > -- Hal > > > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > > SLES10 for bug 267, etc.). > > > > > > > > > > Scott Weitzenkamp > > > > > SQA and Release Manager > > > > > Server Virtualization Business Unit > > > > > Cisco Systems > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > Scott, > > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > > > Greetz > > > > > > > > > > > > Koen > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > > (clivhall) > > > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > > response within > > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > > > in the OFED > > > > > > binary RPMs we release at > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > > prefer to have > > > > > > the host be more responsive. > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > > > (=pinging) the > > > > > > > interfaces every 10s. This means that when the interface is > > > > handling > > > > > > > other traffic, the poll can fail and the port could be > > > > > > > considered out of > > > > > > > service. My question is then: "How can the timeout be > reached > > > > while > > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > > us? And why? > > > > > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > > > 1) change the MAD niceness of the servers > > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > > their ports in > > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > > (sweitzen) wrote: > > > > > > > > Yes, you can tune it. Here's an example via the switch > CLI: > > > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > > fe:80:00:00:00:00:00:00 > > > > > > > > node-timeout > > > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > > 2000 seconds. > > > > > > > > If a HCA is completely unresponsive for longer than the > > > > > > node-timeout > > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > SQA and Release Manager > > > > > > > > Server Virtualization Business Unit > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > > To: koen.segers at VRT.BE > > > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > > Weitzenkamp > > > > > > > > (sweitzen) > > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > > 229 (Scott > > > > > > > > pointed out earlier). The same workaround might > > > > > > work for you > > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > > > I think this should be a SM query timeout > > > > tunable value > > > > in > > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > > > Thanks > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > > Koen > > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > > > Please respond to > > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > > > Shirley > > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > > , > > > > > > > general at lists.openfabrics.org, > > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > > > RE: > > > > > > > > [ofa-general] > > > > > > > > GPFS node loses > > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > ================== > > > > > > > > System Version > Information > > > > > > > > > > > > > > > > ============================================================== > > > > > > > ================== > > > > > > > > system-version : SFS-7000P TopspinOS > > > > > > 2.9.0 releng > > > > > > > > #147 > > > > > > > > 10/25/2006 02:01:32 > > > > > > > > contact : tac at cisco.com > > > > > > > > name : SFS-7000P > > > > > > > > location : 170 West Tasman Drive, > > > > > > > San Jose, CA > > > > > > > > 95134 > > > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > > > last-change : none > > > > > > > > last-config-save : none > > > > > > > > action : none > > > > > > > > result : none > > > > > > > > oper-mode : normal > > > > > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > > > > but I can't > > > > > > > > find it > > > > > > > > right now. > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma > wrote: > > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > > > The node was > > > > > > > > kicked > > > > > > > > > out of the membership. Which SM you are > > > > using in your > > > > > > > > fabric? > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > BTW BE 0244.142.664 > > > > > > > > RPR Brussel > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > nv van publiek recht > > > > > > > BTW BE 0244.142.664 > > > > > > > RPR Brussel > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > general mailing list > > > > > general at lists.openfabrics.org > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > To unsubscribe, please visit > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > From Koen.SEGERS at VRT.BE Tue May 29 05:28:57 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Tue, 29 May 2007 14:28:57 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: <1180440193.12048.9.camel@localhost> Message-ID: One of the machines has 2 dropped packets: gpfswhbe2n1:~ # ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:192.168.2.1 Bcast:192.168.4.255 Mask:255.255.255.0 inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:17311 errors:0 dropped:0 overruns:0 frame:0 TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb) Can this be related? Does anyone now how this is possible with sdp? I thought SDP was a RC. I'm also curious how gpfs reacts to this. Do you know where I can find the timestamp of these dropped packets? Koen -----Oorspronkelijk bericht----- Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] Verzonden: dinsdag 29 mei 2007 14:03 Aan: SEGERS Koen CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection if this is an actual resize request than there is no problem when it is dropped. since you are running rc1, no resize requests should be sent so this means there is a problem since data could be dropped. do you notice lost data? On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote: > We are running ofed-1.2.RC1 on all machines. Hence it is impossible that > this message is added only a few days ago. > > How can you be so sure that this doesn't pose any problems? > > Koen > > -----Oorspronkelijk bericht----- > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > Verzonden: dinsdag 29 mei 2007 13:35 > Aan: SEGERS Koen > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > this means you are getting a message your SDP does not recognize. > message 11 is resize request which was added to sdp a few days ago. > can it be that you are running 2 different versions of OFED? > anywas, this doesn't pose any problem so you can ignore it. > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > > Hi, > > > > Saturday we did a different stresstest. > > This is what we see in the /var/log/messages: > > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > > > There were errors from that time on. Can someone explain me what this > > message does? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > Verzonden: woensdag 23 mei 2007 17:41 > > Aan: SEGERS Koen; Hal Rosenstock > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > Try 20 seconds, I'm curious if if you are barely crossing the > 10-second > > threshold. > > > > Scott > > > > > -----Original Message----- > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > Sent: Wednesday, May 23, 2007 8:39 AM > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > > Cc: Clive Hall (clivhall); > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > What value would you recommend then? > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > Verzonden: woensdag 23 mei 2007 17:38 > > > Aan: SEGERS Koen; Hal Rosenstock > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > The boot time of the host doesn't matter for this timeout. While > the > > > host is booting, the IB link is down anyway. > > > > > > Scott > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > > Cc: Clive Hall (clivhall); > > > > general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > After a whole day of stresstesting with the MAD renicing > > > turned on, we > > > > got the error once. So I think I should raise the timeout on > > > > the switch > > > > also. > > > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > > that this is a > > > > good value for the timeout? > > > > > > > > Scott, > > > > Can you explain me the problem of the memlock? > > > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > > Since we didn't > > > > install this, the bug is not related to us. This is > > > correct, isn't it? > > > > > > > > Greetz > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Verzonden: woensdag 23 mei 2007 16:12 > > > > Aan: Scott "Weitzenkamp (sweitzen) > > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > > general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > > > No C code changes, just a few config file changes > > > (RENICE_IB_MAD=yes > > > > in > > > > > openib.conf, > > > > > > > > Does the host really not respond to MAD requests for over 10 > > > > seconds in > > > > some cases ? > > > > > > > > -- Hal > > > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > > SLES10 for bug 267, etc.). > > > > > > > > > > Scott Weitzenkamp > > > > > SQA and Release Manager > > > > > Server Virtualization Business Unit > > > > > Cisco Systems > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > Scott, > > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > > > Greetz > > > > > > > > > > > > Koen > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > > (clivhall) > > > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > > response within > > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > > > in the OFED > > > > > > binary RPMs we release at > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > > prefer to have > > > > > > the host be more responsive. > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > > > (=pinging) the > > > > > > > interfaces every 10s. This means that when the interface is > > > > handling > > > > > > > other traffic, the poll can fail and the port could be > > > > > > > considered out of > > > > > > > service. My question is then: "How can the timeout be > reached > > > > while > > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > > us? And why? > > > > > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > > > 1) change the MAD niceness of the servers > > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > > their ports in > > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > > (sweitzen) wrote: > > > > > > > > Yes, you can tune it. Here's an example via the switch > CLI: > > > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > > fe:80:00:00:00:00:00:00 > > > > > > > > node-timeout > > > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > > 2000 seconds. > > > > > > > > If a HCA is completely unresponsive for longer than the > > > > > > node-timeout > > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > SQA and Release Manager > > > > > > > > Server Virtualization Business Unit > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > > To: koen.segers at VRT.BE > > > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > > Weitzenkamp > > > > > > > > (sweitzen) > > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > > 229 (Scott > > > > > > > > pointed out earlier). The same workaround might > > > > > > work for you > > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > > > I think this should be a SM query timeout > > > > tunable value > > > > in > > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > > > Thanks > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > > Koen > > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > > > Please respond to > > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > > > Shirley > > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > > , > > > > > > > general at lists.openfabrics.org, > > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > > > RE: > > > > > > > > [ofa-general] > > > > > > > > GPFS node loses > > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > ================== > > > > > > > > System Version > Information > > > > > > > > > > > > > > > > ============================================================== > > > > > > > ================== > > > > > > > > system-version : SFS-7000P TopspinOS > > > > > > 2.9.0 releng > > > > > > > > #147 > > > > > > > > 10/25/2006 02:01:32 > > > > > > > > contact : tac at cisco.com > > > > > > > > name : SFS-7000P > > > > > > > > location : 170 West Tasman Drive, > > > > > > > San Jose, CA > > > > > > > > 95134 > > > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > > > last-change : none > > > > > > > > last-config-save : none > > > > > > > > action : none > > > > > > > > result : none > > > > > > > > oper-mode : normal > > > > > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > > > > but I can't > > > > > > > > find it > > > > > > > > right now. > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma > wrote: > > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > > > The node was > > > > > > > > kicked > > > > > > > > > out of the membership. Which SM you are > > > > using in your > > > > > > > > fabric? > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > BTW BE 0244.142.664 > > > > > > > > RPR Brussel > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > nv van publiek recht > > > > > > > BTW BE 0244.142.664 > > > > > > > RPR Brussel > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > general mailing list > > > > > general at lists.openfabrics.org > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > To unsubscribe, please visit > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From FENKES at de.ibm.com Tue May 29 05:44:32 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Tue, 29 May 2007 14:44:32 +0200 Subject: [ofa-general] Re: [PATCH] IB/ehca: Refactor "maybe missed event" code In-Reply-To: Message-ID: Roland Dreier wrote on 24.05.2007 19:40:39: > This isn't fixing anything is it? I think it's 2.6.23 material; > correct me if I'm wrong. Right, it doesn't fix things, just coalesces some code. Joachim From eli at mellanox.co.il Tue May 29 06:00:46 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 29 May 2007 16:00:46 +0300 Subject: [ofa-general] [PATCH] libmlx4: fix qp capabilities Message-ID: <1180443676.6825.8.camel@mtls03> Fix calulation of max inline returned to the user. Without this fix, the size of inline may increase every time create qp is called with the previous values returned. For example, here is a quote from the output of the test showing the problem: request: cap.max_send_sge = 1, cap.max_inline_data = 0 got: cap.max_send_sge = 5, cap.max_inline_data = 76 request: cap.max_send_sge = 5, cap.max_inline_data = 76 got: cap. max_send_sge = 13, cap.max_inline_data = 204 Signed-off-by: Eli Cohen --- Index: libmlx4/src/qp.c =================================================================== --- libmlx4.orig/src/qp.c 2007-05-29 13:13:57.000000000 +0300 +++ libmlx4/src/qp.c 2007-05-29 14:41:33.000000000 +0300 @@ -396,12 +396,13 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, cap->max_send_sge = 1; qp->rq.max_gs = cap->max_recv_sge; - qp->sq.max_gs = cap->max_send_sge; max_sq_sge = align(cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg), sizeof (struct mlx4_wqe_data_seg)) / sizeof (struct mlx4_wqe_data_seg); if (max_sq_sge < cap->max_send_sge) max_sq_sge = cap->max_send_sge; + qp->sq.max_gs = max_sq_sge; + qp->sq.wrid = malloc(qp->sq.max * sizeof (uint64_t)); if (!qp->sq.wrid) return -1; @@ -419,6 +420,7 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, ; /* nothing */ size = max_sq_sge * sizeof (struct mlx4_wqe_data_seg); + qp->max_inline_data = size - sizeof (struct mlx4_wqe_inline_seg); switch (type) { case IBV_QPT_UD: size += sizeof (struct mlx4_wqe_datagram_seg); @@ -482,26 +484,7 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap, enum ibv_qp_type type) { - int wqe_size; - - wqe_size = 1 << qp->sq.wqe_shift; - switch (type) { - case IBV_QPT_UD: - wqe_size -= sizeof (struct mlx4_wqe_datagram_seg); - break; - - case IBV_QPT_UC: - case IBV_QPT_RC: - wqe_size -= sizeof (struct mlx4_wqe_raddr_seg); - break; - - default: - break; - } - - qp->sq.max_gs = wqe_size / sizeof (struct mlx4_wqe_data_seg); cap->max_send_sge = qp->sq.max_gs; - qp->max_inline_data = wqe_size - sizeof (struct mlx4_wqe_inline_seg); cap->max_inline_data = qp->max_inline_data; } From jsquyres at cisco.com Tue May 29 06:06:17 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 29 May 2007 09:06:17 -0400 Subject: [ofa-general] Updating OFED teleconferences Message-ID: <4D7D1F84-28E1-4CF4-8340-1645E0FBF4B6@cisco.com> You're about to get some Outlook invites for upcoming OFED teleconferences. I'll send a summary after the invites are sent (I need to get the teleconference codes before I can send the summary). -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Tue May 29 06:19:03 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 29 May 2007 09:19:03 -0400 Subject: [ofa-general] Upcoming OFED teleconferences Message-ID: <338EEB24-0F0E-443D-94AA-61E512611F8B@cisco.com> Short version: -------------- Upcoming OFED teleconferences, all at noon US eastern / 9am US Pacific / 7pm Israel. 1. Wednesday, May 30 (*TOMORROW*), code 210262040 2. Monday, June 4, code 2102061 3. Monday, June 11, code 210213621 4. Monday, June 18, code 2102061 5. Monday, June 25, code 210213621 US/Canada: +1.866.432.9903 India: +91.80.4103.3979 Israel: +972.9.892.7026 Others: http://cisco.com/en/US/about/doing_business/conferencing/ Longer version: --------------- You just got 2 Outlook invites for upcoming OFED teleconferences: 1. Due to the US holiday, there was no OFED teleconference yesterday. Today is also not good for several OF members, so there will be an OFED teleconference tomorrow (Wednesday, 30 May 2007) at the normal time. 2. There will also be weekly teleconferences throughout June 2007. We already had teleconferences scheduled for the 4th and 18th; I just added teleconferences for the 11th and 25th. Once OFED v1.2 is released, we'll be returning to bi-weekly OFED teleconferences. -- Jeff Squyres Cisco Systems From Koen.SEGERS at VRT.BE Tue May 29 06:29:48 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Tue, 29 May 2007 15:29:48 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: I just remembered that, with SDP, these values aren't related anymore. SDP doesn't give this kind of information to the OS. Koen -----Oorspronkelijk bericht----- Van: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen Verzonden: dinsdag 29 mei 2007 14:29 Aan: amip at dev.mellanox.co.il CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection One of the machines has 2 dropped packets: gpfswhbe2n1:~ # ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:192.168.2.1 Bcast:192.168.4.255 Mask:255.255.255.0 inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:17311 errors:0 dropped:0 overruns:0 frame:0 TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb) Can this be related? Does anyone now how this is possible with sdp? I thought SDP was a RC. I'm also curious how gpfs reacts to this. Do you know where I can find the timestamp of these dropped packets? Koen -----Oorspronkelijk bericht----- Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] Verzonden: dinsdag 29 mei 2007 14:03 Aan: SEGERS Koen CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection if this is an actual resize request than there is no problem when it is dropped. since you are running rc1, no resize requests should be sent so this means there is a problem since data could be dropped. do you notice lost data? On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote: > We are running ofed-1.2.RC1 on all machines. Hence it is impossible that > this message is added only a few days ago. > > How can you be so sure that this doesn't pose any problems? > > Koen > > -----Oorspronkelijk bericht----- > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > Verzonden: dinsdag 29 mei 2007 13:35 > Aan: SEGERS Koen > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > this means you are getting a message your SDP does not recognize. > message 11 is resize request which was added to sdp a few days ago. > can it be that you are running 2 different versions of OFED? > anywas, this doesn't pose any problem so you can ignore it. > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > > Hi, > > > > Saturday we did a different stresstest. > > This is what we see in the /var/log/messages: > > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > > > There were errors from that time on. Can someone explain me what this > > message does? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > Verzonden: woensdag 23 mei 2007 17:41 > > Aan: SEGERS Koen; Hal Rosenstock > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > Try 20 seconds, I'm curious if if you are barely crossing the > 10-second > > threshold. > > > > Scott > > > > > -----Original Message----- > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > Sent: Wednesday, May 23, 2007 8:39 AM > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > > Cc: Clive Hall (clivhall); > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > What value would you recommend then? > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > Verzonden: woensdag 23 mei 2007 17:38 > > > Aan: SEGERS Koen; Hal Rosenstock > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > The boot time of the host doesn't matter for this timeout. While > the > > > host is booting, the IB link is down anyway. > > > > > > Scott > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > > Cc: Clive Hall (clivhall); > > > > general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > After a whole day of stresstesting with the MAD renicing > > > turned on, we > > > > got the error once. So I think I should raise the timeout on > > > > the switch > > > > also. > > > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > > that this is a > > > > good value for the timeout? > > > > > > > > Scott, > > > > Can you explain me the problem of the memlock? > > > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > > Since we didn't > > > > install this, the bug is not related to us. This is > > > correct, isn't it? > > > > > > > > Greetz > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Verzonden: woensdag 23 mei 2007 16:12 > > > > Aan: Scott "Weitzenkamp (sweitzen) > > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > > general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > > > No C code changes, just a few config file changes > > > (RENICE_IB_MAD=yes > > > > in > > > > > openib.conf, > > > > > > > > Does the host really not respond to MAD requests for over 10 > > > > seconds in > > > > some cases ? > > > > > > > > -- Hal > > > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > > SLES10 for bug 267, etc.). > > > > > > > > > > Scott Weitzenkamp > > > > > SQA and Release Manager > > > > > Server Virtualization Business Unit > > > > > Cisco Systems > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > Scott, > > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > > > Greetz > > > > > > > > > > > > Koen > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > > (clivhall) > > > > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > > response within > > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > > > in the OFED > > > > > > binary RPMs we release at > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > > prefer to have > > > > > > the host be more responsive. > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > general at lists.openfabrics.org; > > > > general-bounces at lists.openfabrics.org > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > > > (=pinging) the > > > > > > > interfaces every 10s. This means that when the interface is > > > > handling > > > > > > > other traffic, the poll can fail and the port could be > > > > > > > considered out of > > > > > > > service. My question is then: "How can the timeout be > reached > > > > while > > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > > us? And why? > > > > > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > > > 1) change the MAD niceness of the servers > > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > > their ports in > > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > > (sweitzen) wrote: > > > > > > > > Yes, you can tune it. Here's an example via the switch > CLI: > > > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > > fe:80:00:00:00:00:00:00 > > > > > > > > node-timeout > > > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > > 2000 seconds. > > > > > > > > If a HCA is completely unresponsive for longer than the > > > > > > node-timeout > > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > SQA and Release Manager > > > > > > > > Server Virtualization Business Unit > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > > To: koen.segers at VRT.BE > > > > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org; > > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > > Weitzenkamp > > > > > > > > (sweitzen) > > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > > 229 (Scott > > > > > > > > pointed out earlier). The same workaround might > > > > > > work for you > > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > > > I think this should be a SM query timeout > > > > tunable value > > > > in > > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > > > Thanks > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > > Koen > > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 AM > > > > > > > > Please respond to > > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > > > Shirley > > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > > , > > > > > > > general at lists.openfabrics.org, > > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > > > RE: > > > > > > > > [ofa-general] > > > > > > > > GPFS node loses > > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > ================== > > > > > > > > System Version > Information > > > > > > > > > > > > > > > > ============================================================== > > > > > > > ================== > > > > > > > > system-version : SFS-7000P TopspinOS > > > > > > 2.9.0 releng > > > > > > > > #147 > > > > > > > > 10/25/2006 02:01:32 > > > > > > > > contact : tac at cisco.com > > > > > > > > name : SFS-7000P > > > > > > > > location : 170 West Tasman Drive, > > > > > > > San Jose, CA > > > > > > > > 95134 > > > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > > > last-change : none > > > > > > > > last-config-save : none > > > > > > > > action : none > > > > > > > > result : none > > > > > > > > oper-mode : normal > > > > > > > > > > > > > > > > There is also a command that gives the SM version, > > > > > > > > but I can't > > > > > > > > find it > > > > > > > > right now. > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma > wrote: > > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to me. > > > > > > > The node was > > > > > > > > kicked > > > > > > > > > out of the membership. Which SM you are > > > > using in your > > > > > > > > fabric? > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > BTW BE 0244.142.664 > > > > > > > > RPR Brussel > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > nv van publiek recht > > > > > > > BTW BE 0244.142.664 > > > > > > > RPR Brussel > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > general mailing list > > > > > general at lists.openfabrics.org > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > To unsubscribe, please visit > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From erezz at voltaire.com Tue May 29 06:41:12 2007 From: erezz at voltaire.com (Erez Zilber) Date: Tue, 29 May 2007 16:41:12 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons foropen-iscsiover iSER support for RHAS4 up3 and up4 In-Reply-To: <20070524115715.GC4585@mellanox.co.il> References: <20070521114410.GG20400@mellanox.co.il> <46557BCB.7030102@voltaire.com> <20070524115715.GC4585@mellanox.co.il> Message-ID: <465C2D78.30100@voltaire.com> >> I have the following files in backport/2.6.9_UX/include/src/: >> >> attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it. >> > > could be a patch ... > which line? > > Now, attribute_container.c, klist.c & transport_class.c are copied from the kernel tree. I've committed the required changes in ~erezz/ofabuild_iser_rh4.git & ~erezz/ofed_1_2_iser_rh4.git. The main change is a new dir called "kernel_addons_patches". It contains patches for kernel tree files in order to create the required addons from them. The rest of the files that I added to include/src are very small. I hope it's ok now. >> init.c - only a small part of the original file in 2.6.20 >> >> klist.c - almost identical to the file on 2.6.20. I had to change one line in it. >> > > which line? > See above. > >> kref_new.c - based on kref.c >> > > Sounds scary ... how different is it? > This file was removed. > >> scsi.c - only a small part of the original file in 2.6.20 >> >> scsi_lib.c - only a small part of the original file in 2.6.20 >> >> scsi_scan.c - only a small part of the original file in 2.6.20 >> >> transport_class.c - identical to 2.6.20 >> >> So, the only file identical to 2.6.20 is transport_class.c. We can copy it from 2.6.20, but since it's only a single file, I'm not sure if it will make a real difference. >> > > transport_class.c, attribute_container.c and klist.c are quite big together: > more than 1000 lines. So by all means, let's check them out from kernel tree. > Done. Erez From mst at dev.mellanox.co.il Tue May 29 07:11:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 17:11:43 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons foropen-iscsiover iSER support for RHAS4 up3 and up4 In-Reply-To: <465C2D78.30100@voltaire.com> References: <20070521114410.GG20400@mellanox.co.il> <46557BCB.7030102@voltaire.com> <20070524115715.GC4585@mellanox.co.il> <465C2D78.30100@voltaire.com> Message-ID: <20070529141143.GD27671@mellanox.co.il> > Quoting Erez Zilber : > Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons?foropen-iscsiover iSER support for RHAS4 up3 and up4 > > > >> I have the following files in backport/2.6.9_UX/include/src/: > >> > >> attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it. > >> > > > > could be a patch ... > > which line? > > > > > Now, attribute_container.c, klist.c & transport_class.c are copied from > the kernel tree. I've committed the required changes in > ~erezz/ofabuild_iser_rh4.git & ~erezz/ofed_1_2_iser_rh4.git. git fetch git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git. fatal: The remote end hung up unexpectedly Cannot get the repository state from git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git. > The main > change is a new dir called "kernel_addons_patches". It contains patches > for kernel tree files in order to create the required addons from them. sorry, but I really don't think we can touch build scripts at this point. Doing cp in build scripts is also a problem since it interferes with development (there are 2 places to edit each file). And adding kernel version dependency there is also really messy. Suggestion: why can't these patches be part of the regular backport directory? you copy stuff to include/src and then include it, but this just looks like and unnecessary extra step. Can't we include the source file from it original place directory, like this: #include "../drivers/base/attribute_container.c" > The rest of the files that I added to include/src are very small. I hope > it's ok now. Yes, the rest looks OK to me. -- MST From mst at dev.mellanox.co.il Tue May 29 07:24:22 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 17:24:22 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons foropen-iscsiover iSER support for RHAS4 up3 and up4 In-Reply-To: <20070529141143.GD27671@mellanox.co.il> References: <20070521114410.GG20400@mellanox.co.il> <46557BCB.7030102@voltaire.com> <20070524115715.GC4585@mellanox.co.il> <465C2D78.30100@voltaire.com> <20070529141143.GD27671@mellanox.co.il> Message-ID: <20070529142422.GE27671@mellanox.co.il> > Quoting Michael S. Tsirkin : > Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons?foropen-iscsiover iSER support for RHAS4 up3 and up4 > > > Quoting Erez Zilber : > > Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons?foropen-iscsiover iSER support for RHAS4 up3 and up4 > > > > > > >> I have the following files in backport/2.6.9_UX/include/src/: > > >> > > >> attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it. > > >> > > > > > > could be a patch ... > > > which line? > > > > > > > > Now, attribute_container.c, klist.c & transport_class.c are copied from > > the kernel tree. I've committed the required changes in > > ~erezz/ofabuild_iser_rh4.git & ~erezz/ofed_1_2_iser_rh4.git. > > git fetch git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git. > fatal: The remote end hung up unexpectedly > Cannot get the repository state from > git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git. > > > > The main > > change is a new dir called "kernel_addons_patches". It contains patches > > for kernel tree files in order to create the required addons from them. > > sorry, but I really don't think we can touch build scripts at this point. > Doing cp in build scripts is also a problem since it interferes with > development (there are 2 places to edit each file). > And adding kernel version dependency there is also really messy. BTW, one important principle is that *all* information about the kernel build process must be contained inside kernel tree. Keeping lists of files in an external tree is *evil*. > Suggestion: why can't these patches be part of the regular backport directory? > > you copy stuff to include/src and then include it, but this just looks > like and unnecessary extra step. Can't we include the source file from > it original place directory, like this: > #include "../drivers/base/attribute_container.c" -- MST From tziporet at dev.mellanox.co.il Tue May 29 07:34:54 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 29 May 2007 17:34:54 +0300 Subject: [ofa-general] Re: [ewg] Upcoming OFED teleconferences In-Reply-To: <338EEB24-0F0E-443D-94AA-61E512611F8B@cisco.com> References: <338EEB24-0F0E-443D-94AA-61E512611F8B@cisco.com> Message-ID: <465C3A0E.6070602@mellanox.co.il> Jeff Squyres wrote: > Short version: > -------------- > > Upcoming OFED teleconferences, all at noon US eastern / 9am US Pacific > / 7pm Israel. > > 1. Wednesday, May 30 (*TOMORROW*), code 210262040 I cannot make it at Wed 9am PST Can you change to 11:30am PST Thanks, Tziporet > 2. Monday, June 4, code 2102061 > 3. Monday, June 11, code 210213621 > 4. Monday, June 18, code 2102061 > 5. Monday, June 25, code 210213621 > > US/Canada: +1.866.432.9903 > India: +91.80.4103.3979 > Israel: +972.9.892.7026 > Others: http://cisco.com/en/US/about/doing_business/conferencing/ From amip at dev.mellanox.co.il Tue May 29 07:39:54 2007 From: amip at dev.mellanox.co.il (Ami Perlmutter) Date: Tue, 29 May 2007 17:39:54 +0300 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: <1180449624.12048.13.camel@localhost> can you describe the scenario in which you see data lost? does the "SDP: FIXME MID 11" message correlate with the data loss? On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote: > I just remembered that, with SDP, these values aren't related anymore. > SDP doesn't give this kind of information to the OS. > > Koen > > -----Oorspronkelijk bericht----- > Van: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen > Verzonden: dinsdag 29 mei 2007 14:29 > Aan: amip at dev.mellanox.co.il > CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > One of the machines has 2 dropped packets: > > gpfswhbe2n1:~ # ifconfig ib0 > ib0 Link encap:UNSPEC HWaddr > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > inet addr:192.168.2.1 Bcast:192.168.4.255 Mask:255.255.255.0 > inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > RX packets:17311 errors:0 dropped:0 overruns:0 frame:0 > TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb) > > Can this be related? > > Does anyone now how this is possible with sdp? I thought SDP was a RC. > I'm also curious how gpfs reacts to this. Do you know where I can find > the timestamp of these dropped packets? > > Koen > > -----Oorspronkelijk bericht----- > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > Verzonden: dinsdag 29 mei 2007 14:03 > Aan: SEGERS Koen > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > if this is an actual resize request than there is no problem when it is > dropped. > since you are running rc1, no resize requests should be sent so this > means there is a problem since data could be dropped. do you notice lost > data? > > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote: > > We are running ofed-1.2.RC1 on all machines. Hence it is impossible > that > > this message is added only a few days ago. > > > > How can you be so sure that this doesn't pose any problems? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > > Verzonden: dinsdag 29 mei 2007 13:35 > > Aan: SEGERS Koen > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > this means you are getting a message your SDP does not recognize. > > message 11 is resize request which was added to sdp a few days ago. > > can it be that you are running 2 different versions of OFED? > > anywas, this doesn't pose any problem so you can ignore it. > > > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > > > Hi, > > > > > > Saturday we did a different stresstest. > > > This is what we see in the /var/log/messages: > > > > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > > > > > There were errors from that time on. Can someone explain me what > this > > > message does? > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > Verzonden: woensdag 23 mei 2007 17:41 > > > Aan: SEGERS Koen; Hal Rosenstock > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > Try 20 seconds, I'm curious if if you are barely crossing the > > 10-second > > > threshold. > > > > > > Scott > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > Sent: Wednesday, May 23, 2007 8:39 AM > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > > > Cc: Clive Hall (clivhall); > > > > general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > What value would you recommend then? > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > Verzonden: woensdag 23 mei 2007 17:38 > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > > general at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > The boot time of the host doesn't matter for this timeout. While > > the > > > > host is booting, the IB link is down anyway. > > > > > > > > Scott > > > > > > > > > -----Original Message----- > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > > > Cc: Clive Hall (clivhall); > > > > > general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > After a whole day of stresstesting with the MAD renicing > > > > turned on, we > > > > > got the error once. So I think I should raise the timeout on > > > > > the switch > > > > > also. > > > > > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > > > that this is a > > > > > good value for the timeout? > > > > > > > > > > Scott, > > > > > Can you explain me the problem of the memlock? > > > > > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > > > Since we didn't > > > > > install this, the bug is not related to us. This is > > > > correct, isn't it? > > > > > > > > > > Greetz > > > > > > > > > > Koen > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > Verzonden: woensdag 23 mei 2007 16:12 > > > > > Aan: Scott "Weitzenkamp (sweitzen) > > > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > > > general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > > > > No C code changes, just a few config file changes > > > > (RENICE_IB_MAD=yes > > > > > in > > > > > > openib.conf, > > > > > > > > > > Does the host really not respond to MAD requests for over 10 > > > > > seconds in > > > > > some cases ? > > > > > > > > > > -- Hal > > > > > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > > > SLES10 for bug 267, etc.). > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > general at lists.openfabrics.org; > > > > > general-bounces at lists.openfabrics.org > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > > > Scott, > > > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > > > > > Greetz > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > > Van: Scott Weitzenkamp (sweitzen) > [mailto:sweitzen at cisco.com] > > > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > > > (clivhall) > > > > > > > CC: Shirley Ma; Ami Perlmutter; > general at lists.openfabrics.org; > > > > > > > general-bounces at lists.openfabrics.org > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > > > response within > > > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > > > > > in the OFED > > > > > > > binary RPMs we release at > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > > > prefer to have > > > > > > > the host be more responsive. > > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > SQA and Release Manager > > > > > > > Server Virtualization Business Unit > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > general at lists.openfabrics.org; > > > > > general-bounces at lists.openfabrics.org > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > > > > (=pinging) the > > > > > > > > interfaces every 10s. This means that when the interface > is > > > > > handling > > > > > > > > other traffic, the poll can fail and the port could be > > > > > > > > considered out of > > > > > > > > service. My question is then: "How can the timeout be > > reached > > > > > while > > > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > > > us? And why? > > > > > > > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > > > > 1) change the MAD niceness of the servers > > > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > > > their ports in > > > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > > > (sweitzen) wrote: > > > > > > > > > Yes, you can tune it. Here's an example via the switch > > CLI: > > > > > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > > > fe:80:00:00:00:00:00:00 > > > > > > > > > node-timeout > > > > > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > > > 2000 seconds. > > > > > > > > > If a HCA is completely unresponsive for longer than the > > > > > > > node-timeout > > > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > > SQA and Release Manager > > > > > > > > > Server Virtualization Business Unit > > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > > > To: koen.segers at VRT.BE > > > > > > > > > Cc: Ami Perlmutter; > general at lists.openfabrics.org; > > > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > > > Weitzenkamp > > > > > > > > > (sweitzen) > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > > > 229 (Scott > > > > > > > > > pointed out earlier). The same workaround might > > > > > > > work for you > > > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > > > > > I think this should be a SM query timeout > > > > > tunable value > > > > > in > > > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > > > Koen > > > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 > AM > > > > > > > > > Please respond > to > > > > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > > > > > Shirley > > > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > > > , > > > > > > > > general at lists.openfabrics.org, > > > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > > > > > RE: > > > > > > > > > [ofa-general] > > > > > > > > > GPFS node loses > > > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > ================== > > > > > > > > > System Version > > Information > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > ================== > > > > > > > > > system-version : SFS-7000P TopspinOS > > > > > > > 2.9.0 releng > > > > > > > > > #147 > > > > > > > > > 10/25/2006 02:01:32 > > > > > > > > > contact : tac at cisco.com > > > > > > > > > name : SFS-7000P > > > > > > > > > location : 170 West Tasman > Drive, > > > > > > > > San Jose, CA > > > > > > > > > 95134 > > > > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > > > > last-change : none > > > > > > > > > last-config-save : none > > > > > > > > > action : none > > > > > > > > > result : none > > > > > > > > > oper-mode : normal > > > > > > > > > > > > > > > > > > There is also a command that gives the SM > version, > > > > > > > > > > but I can't > > > > > > > > > find it > > > > > > > > > right now. > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma > > wrote: > > > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to > me. > > > > > > > > The node was > > > > > > > > > kicked > > > > > > > > > > out of the membership. Which SM you are > > > > > using in your > > > > > > > > > fabric? > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > RPR Brussel > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > BTW BE 0244.142.664 > > > > > > > > RPR Brussel > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > nv van publiek recht > > > > > > > BTW BE 0244.142.664 > > > > > > > RPR Brussel > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > general mailing list > > > > > > general at lists.openfabrics.org > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > To unsubscribe, please visit > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > From Koen.SEGERS at VRT.BE Tue May 29 07:56:58 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Tue, 29 May 2007 16:56:58 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: <1180449624.12048.13.camel@localhost> Message-ID: We don't really see data getting lost. We don't get an error in the log files of gpfs. We only got a system that was not able to read its filesystem anymore. It was exactly at the time this FIXME error occurred. Therefore I think there must me some kind of correlation. But I don't really know what ... :( Koen -----Oorspronkelijk bericht----- Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] Verzonden: dinsdag 29 mei 2007 16:40 Aan: SEGERS Koen CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection can you describe the scenario in which you see data lost? does the "SDP: FIXME MID 11" message correlate with the data loss? On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote: > I just remembered that, with SDP, these values aren't related anymore. > SDP doesn't give this kind of information to the OS. > > Koen > > -----Oorspronkelijk bericht----- > Van: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen > Verzonden: dinsdag 29 mei 2007 14:29 > Aan: amip at dev.mellanox.co.il > CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > One of the machines has 2 dropped packets: > > gpfswhbe2n1:~ # ifconfig ib0 > ib0 Link encap:UNSPEC HWaddr > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > inet addr:192.168.2.1 Bcast:192.168.4.255 Mask:255.255.255.0 > inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > RX packets:17311 errors:0 dropped:0 overruns:0 frame:0 > TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb) > > Can this be related? > > Does anyone now how this is possible with sdp? I thought SDP was a RC. > I'm also curious how gpfs reacts to this. Do you know where I can find > the timestamp of these dropped packets? > > Koen > > -----Oorspronkelijk bericht----- > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > Verzonden: dinsdag 29 mei 2007 14:03 > Aan: SEGERS Koen > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > if this is an actual resize request than there is no problem when it is > dropped. > since you are running rc1, no resize requests should be sent so this > means there is a problem since data could be dropped. do you notice lost > data? > > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote: > > We are running ofed-1.2.RC1 on all machines. Hence it is impossible > that > > this message is added only a few days ago. > > > > How can you be so sure that this doesn't pose any problems? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > > Verzonden: dinsdag 29 mei 2007 13:35 > > Aan: SEGERS Koen > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > this means you are getting a message your SDP does not recognize. > > message 11 is resize request which was added to sdp a few days ago. > > can it be that you are running 2 different versions of OFED? > > anywas, this doesn't pose any problem so you can ignore it. > > > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > > > Hi, > > > > > > Saturday we did a different stresstest. > > > This is what we see in the /var/log/messages: > > > > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > > > > > There were errors from that time on. Can someone explain me what > this > > > message does? > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > Verzonden: woensdag 23 mei 2007 17:41 > > > Aan: SEGERS Koen; Hal Rosenstock > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > Try 20 seconds, I'm curious if if you are barely crossing the > > 10-second > > > threshold. > > > > > > Scott > > > > > > > -----Original Message----- > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > Sent: Wednesday, May 23, 2007 8:39 AM > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > > > Cc: Clive Hall (clivhall); > > > > general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > What value would you recommend then? > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > Verzonden: woensdag 23 mei 2007 17:38 > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > > general at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > The boot time of the host doesn't matter for this timeout. While > > the > > > > host is booting, the IB link is down anyway. > > > > > > > > Scott > > > > > > > > > -----Original Message----- > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > > > Cc: Clive Hall (clivhall); > > > > > general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > After a whole day of stresstesting with the MAD renicing > > > > turned on, we > > > > > got the error once. So I think I should raise the timeout on > > > > > the switch > > > > > also. > > > > > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > > > that this is a > > > > > good value for the timeout? > > > > > > > > > > Scott, > > > > > Can you explain me the problem of the memlock? > > > > > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > > > Since we didn't > > > > > install this, the bug is not related to us. This is > > > > correct, isn't it? > > > > > > > > > > Greetz > > > > > > > > > > Koen > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > Verzonden: woensdag 23 mei 2007 16:12 > > > > > Aan: Scott "Weitzenkamp (sweitzen) > > > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > > > general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote: > > > > > > No C code changes, just a few config file changes > > > > (RENICE_IB_MAD=yes > > > > > in > > > > > > openib.conf, > > > > > > > > > > Does the host really not respond to MAD requests for over 10 > > > > > seconds in > > > > > some cases ? > > > > > > > > > > -- Hal > > > > > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > > > SLES10 for bug 267, etc.). > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > general at lists.openfabrics.org; > > > > > general-bounces at lists.openfabrics.org > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > > > Scott, > > > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > > > > > Greetz > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > > Van: Scott Weitzenkamp (sweitzen) > [mailto:sweitzen at cisco.com] > > > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > > > (clivhall) > > > > > > > CC: Shirley Ma; Ami Perlmutter; > general at lists.openfabrics.org; > > > > > > > general-bounces at lists.openfabrics.org > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > > > response within > > > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures 1) > > > > > > > > in the OFED > > > > > > > binary RPMs we release at > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > > > prefer to have > > > > > > > the host be more responsive. > > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > SQA and Release Manager > > > > > > > Server Virtualization Business Unit > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > general at lists.openfabrics.org; > > > > > general-bounces at lists.openfabrics.org > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > If I understand it wright, the switch is actually polling > > > > > > > > (=pinging) the > > > > > > > > interfaces every 10s. This means that when the interface > is > > > > > handling > > > > > > > > other traffic, the poll can fail and the port could be > > > > > > > > considered out of > > > > > > > > service. My question is then: "How can the timeout be > > reached > > > > > while > > > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > > > us? And why? > > > > > > > > > > > > > > > > To recapitulate: these are the actions I'll take tomorrow > > > > > > > > 1) change the MAD niceness of the servers > > > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > > > their ports in > > > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > > > (sweitzen) wrote: > > > > > > > > > Yes, you can tune it. Here's an example via the switch > > CLI: > > > > > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > > > fe:80:00:00:00:00:00:00 > > > > > > > > > node-timeout > > > > > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > > > 2000 seconds. > > > > > > > > > If a HCA is completely unresponsive for longer than the > > > > > > > node-timeout > > > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > > SQA and Release Manager > > > > > > > > > Server Virtualization Business Unit > > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > > > To: koen.segers at VRT.BE > > > > > > > > > Cc: Ami Perlmutter; > general at lists.openfabrics.org; > > > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > > > Weitzenkamp > > > > > > > > > (sweitzen) > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > > > 229 (Scott > > > > > > > > > pointed out earlier). The same workaround might > > > > > > > work for you > > > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > > > > > I think this should be a SM query timeout > > > > > tunable value > > > > > in > > > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > > > Koen > > > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 > AM > > > > > > > > > Please respond > to > > > > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > > > > > Shirley > > > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > > > , > > > > > > > > general at lists.openfabrics.org, > > > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > > > > > RE: > > > > > > > > > [ofa-general] > > > > > > > > > GPFS node loses > > > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > ================== > > > > > > > > > System Version > > Information > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > ================== > > > > > > > > > system-version : SFS-7000P TopspinOS > > > > > > > 2.9.0 releng > > > > > > > > > #147 > > > > > > > > > 10/25/2006 02:01:32 > > > > > > > > > contact : tac at cisco.com > > > > > > > > > name : SFS-7000P > > > > > > > > > location : 170 West Tasman > Drive, > > > > > > > > San Jose, CA > > > > > > > > > 95134 > > > > > > > > > up-time : 11(d):7(h):49(m):3(s) > > > > > > > > > last-change : none > > > > > > > > > last-config-save : none > > > > > > > > > action : none > > > > > > > > > result : none > > > > > > > > > oper-mode : normal > > > > > > > > > > > > > > > > > > There is also a command that gives the SM > version, > > > > > > > > > > but I can't > > > > > > > > > find it > > > > > > > > > right now. > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma > > wrote: > > > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to > me. > > > > > > > > The node was > > > > > > > > > kicked > > > > > > > > > > out of the membership. Which SM you are > > > > > using in your > > > > > > > > > fabric? > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > RPR Brussel > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > BTW BE 0244.142.664 > > > > > > > > RPR Brussel > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > nv van publiek recht > > > > > > > BTW BE 0244.142.664 > > > > > > > RPR Brussel > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > general mailing list > > > > > > general at lists.openfabrics.org > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > To unsubscribe, please visit > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From amip at dev.mellanox.co.il Tue May 29 08:05:24 2007 From: amip at dev.mellanox.co.il (Ami Perlmutter) Date: Tue, 29 May 2007 18:05:24 +0300 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: References: Message-ID: <1180451154.12048.15.camel@localhost> any chance of moving to rc3 (or wait till rc4)? On Tue, 2007-05-29 at 16:56 +0200, SEGERS Koen wrote: > We don't really see data getting lost. We don't get an error in the log > files of gpfs. We only got a system that was not able to read its > filesystem anymore. It was exactly at the time this FIXME error > occurred. > > Therefore I think there must me some kind of correlation. But I don't > really know what ... :( > > Koen > > -----Oorspronkelijk bericht----- > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > Verzonden: dinsdag 29 mei 2007 16:40 > Aan: SEGERS Koen > CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > can you describe the scenario in which you see data lost? > does the "SDP: FIXME MID 11" message correlate with the data loss? > > On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote: > > I just remembered that, with SDP, these values aren't related anymore. > > SDP doesn't give this kind of information to the OS. > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen > > Verzonden: dinsdag 29 mei 2007 14:29 > > Aan: amip at dev.mellanox.co.il > > CC: general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > One of the machines has 2 dropped packets: > > > > gpfswhbe2n1:~ # ifconfig ib0 > > ib0 Link encap:UNSPEC HWaddr > > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > > inet addr:192.168.2.1 Bcast:192.168.4.255 > Mask:255.255.255.0 > > inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > > RX packets:17311 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0 > > collisions:0 txqueuelen:128 > > RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb) > > > > Can this be related? > > > > Does anyone now how this is possible with sdp? I thought SDP was a RC. > > I'm also curious how gpfs reacts to this. Do you know where I can find > > the timestamp of these dropped packets? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > > Verzonden: dinsdag 29 mei 2007 14:03 > > Aan: SEGERS Koen > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > if this is an actual resize request than there is no problem when it > is > > dropped. > > since you are running rc1, no resize requests should be sent so this > > means there is a problem since data could be dropped. do you notice > lost > > data? > > > > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote: > > > We are running ofed-1.2.RC1 on all machines. Hence it is impossible > > that > > > this message is added only a few days ago. > > > > > > How can you be so sure that this doesn't pose any problems? > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > > > Verzonden: dinsdag 29 mei 2007 13:35 > > > Aan: SEGERS Koen > > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > this means you are getting a message your SDP does not recognize. > > > message 11 is resize request which was added to sdp a few days ago. > > > can it be that you are running 2 different versions of OFED? > > > anywas, this doesn't pose any problem so you can ignore it. > > > > > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > > > > Hi, > > > > > > > > Saturday we did a different stresstest. > > > > This is what we see in the /var/log/messages: > > > > > > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > > > > > > > There were errors from that time on. Can someone explain me what > > this > > > > message does? > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > Verzonden: woensdag 23 mei 2007 17:41 > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > > general at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > Try 20 seconds, I'm curious if if you are barely crossing the > > > 10-second > > > > threshold. > > > > > > > > Scott > > > > > > > > > -----Original Message----- > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > Sent: Wednesday, May 23, 2007 8:39 AM > > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > > > > Cc: Clive Hall (clivhall); > > > > > general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > What value would you recommend then? > > > > > > > > > > Koen > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > > Verzonden: woensdag 23 mei 2007 17:38 > > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > > CC: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; > > > > > general at lists.openfabrics.org > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > The boot time of the host doesn't matter for this timeout. > While > > > the > > > > > host is booting, the IB link is down anyway. > > > > > > > > > > Scott > > > > > > > > > > > -----Original Message----- > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > > > > Cc: Clive Hall (clivhall); > > > > > > general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > After a whole day of stresstesting with the MAD renicing > > > > > turned on, we > > > > > > got the error once. So I think I should raise the timeout on > > > > > > the switch > > > > > > also. > > > > > > > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > > > > that this is a > > > > > > good value for the timeout? > > > > > > > > > > > > Scott, > > > > > > Can you explain me the problem of the memlock? > > > > > > > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > > > > Since we didn't > > > > > > install this, the bug is not related to us. This is > > > > > correct, isn't it? > > > > > > > > > > > > Greetz > > > > > > > > > > > > Koen > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > > Verzonden: woensdag 23 mei 2007 16:12 > > > > > > Aan: Scott "Weitzenkamp (sweitzen) > > > > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > > > > general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) > wrote: > > > > > > > No C code changes, just a few config file changes > > > > > (RENICE_IB_MAD=yes > > > > > > in > > > > > > > openib.conf, > > > > > > > > > > > > Does the host really not respond to MAD requests for over 10 > > > > > > seconds in > > > > > > some cases ? > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > > > > SLES10 for bug 267, etc.). > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > SQA and Release Manager > > > > > > > Server Virtualization Business Unit > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > > > > > Scott, > > > > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > > > > > > > Greetz > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > > > Van: Scott Weitzenkamp (sweitzen) > > [mailto:sweitzen at cisco.com] > > > > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > > > > (clivhall) > > > > > > > > CC: Shirley Ma; Ami Perlmutter; > > general at lists.openfabrics.org; > > > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > > > > response within > > > > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures > 1) > > > > > > > > > > in the OFED > > > > > > > > binary RPMs we release at > > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > > > > prefer to have > > > > > > > > the host be more responsive. > > > > > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > SQA and Release Manager > > > > > > > > Server Virtualization Business Unit > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > > general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > > > If I understand it wright, the switch is actually > polling > > > > > > > > > (=pinging) the > > > > > > > > > interfaces every 10s. This means that when the interface > > is > > > > > > handling > > > > > > > > > other traffic, the poll can fail and the port could be > > > > > > > > > considered out of > > > > > > > > > service. My question is then: "How can the timeout be > > > reached > > > > > > while > > > > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > > > > us? And why? > > > > > > > > > > > > > > > > > > To recapitulate: these are the actions I'll take > tomorrow > > > > > > > > > 1) change the MAD niceness of the servers > > > > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > > > > their ports in > > > > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > > > > (sweitzen) wrote: > > > > > > > > > > Yes, you can tune it. Here's an example via the > switch > > > CLI: > > > > > > > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > > > > fe:80:00:00:00:00:00:00 > > > > > > > > > > node-timeout > > > > > > > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > > > > 2000 seconds. > > > > > > > > > > If a HCA is completely unresponsive for longer than > the > > > > > > > > node-timeout > > > > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > > > SQA and Release Manager > > > > > > > > > > Server Virtualization Business Unit > > > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > > > > To: koen.segers at VRT.BE > > > > > > > > > > Cc: Ami Perlmutter; > > general at lists.openfabrics.org; > > > > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > > > > Weitzenkamp > > > > > > > > > > (sweitzen) > > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > > > > 229 (Scott > > > > > > > > > > pointed out earlier). The same workaround > might > > > > > > > > work for you > > > > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > > > > > > > I think this should be a SM query timeout > > > > > > tunable value > > > > > > in > > > > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > > > > Koen > > > > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 > > AM > > > > > > > > > > Please respond > > to > > > > > > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > > > > > > > Shirley > > > > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > > > > , > > > > > > > > > general at lists.openfabrics.org, > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > > > > > > > RE: > > > > > > > > > > [ofa-general] > > > > > > > > > > GPFS node loses > > > > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > > ================== > > > > > > > > > > System Version > > > Information > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > > ================== > > > > > > > > > > system-version : SFS-7000P TopspinOS > > > > > > > > > 2.9.0 releng > > > > > > > > > > #147 > > > > > > > > > > 10/25/2006 02:01:32 > > > > > > > > > > contact : tac at cisco.com > > > > > > > > > > name : SFS-7000P > > > > > > > > > > location : 170 West Tasman > > Drive, > > > > > > > > > San Jose, CA > > > > > > > > > > 95134 > > > > > > > > > > up-time : > 11(d):7(h):49(m):3(s) > > > > > > > > > > last-change : none > > > > > > > > > > last-config-save : none > > > > > > > > > > action : none > > > > > > > > > > result : none > > > > > > > > > > oper-mode : normal > > > > > > > > > > > > > > > > > > > > There is also a command that gives the SM > > version, > > > > > > > > > > > > but I can't > > > > > > > > > > find it > > > > > > > > > > right now. > > > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma > > > wrote: > > > > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to > > me. > > > > > > > > > The node was > > > > > > > > > > kicked > > > > > > > > > > > out of the membership. Which SM you are > > > > > > using in your > > > > > > > > > > fabric? > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > > RPR Brussel > > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > RPR Brussel > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > BTW BE 0244.142.664 > > > > > > > > RPR Brussel > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > general mailing list > > > > > > > general at lists.openfabrics.org > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > > > To unsubscribe, please visit > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > From Koen.SEGERS at VRT.BE Tue May 29 08:12:07 2007 From: Koen.SEGERS at VRT.BE (SEGERS Koen) Date: Tue, 29 May 2007 17:12:07 +0200 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: <1180451154.12048.15.camel@localhost> Message-ID: That is very difficult. This system is supposed to go in production within a few weeks. Changing the OFED drivers requires rebuilding a lot of other programs. If it isn't really necessary, I prefer not to do this... Koen -----Oorspronkelijk bericht----- Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] Verzonden: dinsdag 29 mei 2007 17:05 Aan: SEGERS Koen CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection any chance of moving to rc3 (or wait till rc4)? On Tue, 2007-05-29 at 16:56 +0200, SEGERS Koen wrote: > We don't really see data getting lost. We don't get an error in the log > files of gpfs. We only got a system that was not able to read its > filesystem anymore. It was exactly at the time this FIXME error > occurred. > > Therefore I think there must me some kind of correlation. But I don't > really know what ... :( > > Koen > > -----Oorspronkelijk bericht----- > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > Verzonden: dinsdag 29 mei 2007 16:40 > Aan: SEGERS Koen > CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > can you describe the scenario in which you see data lost? > does the "SDP: FIXME MID 11" message correlate with the data loss? > > On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote: > > I just remembered that, with SDP, these values aren't related anymore. > > SDP doesn't give this kind of information to the OS. > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen > > Verzonden: dinsdag 29 mei 2007 14:29 > > Aan: amip at dev.mellanox.co.il > > CC: general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > One of the machines has 2 dropped packets: > > > > gpfswhbe2n1:~ # ifconfig ib0 > > ib0 Link encap:UNSPEC HWaddr > > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > > inet addr:192.168.2.1 Bcast:192.168.4.255 > Mask:255.255.255.0 > > inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > > RX packets:17311 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0 > > collisions:0 txqueuelen:128 > > RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb) > > > > Can this be related? > > > > Does anyone now how this is possible with sdp? I thought SDP was a RC. > > I'm also curious how gpfs reacts to this. Do you know where I can find > > the timestamp of these dropped packets? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > > Verzonden: dinsdag 29 mei 2007 14:03 > > Aan: SEGERS Koen > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > if this is an actual resize request than there is no problem when it > is > > dropped. > > since you are running rc1, no resize requests should be sent so this > > means there is a problem since data could be dropped. do you notice > lost > > data? > > > > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote: > > > We are running ofed-1.2.RC1 on all machines. Hence it is impossible > > that > > > this message is added only a few days ago. > > > > > > How can you be so sure that this doesn't pose any problems? > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > > > Verzonden: dinsdag 29 mei 2007 13:35 > > > Aan: SEGERS Koen > > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > this means you are getting a message your SDP does not recognize. > > > message 11 is resize request which was added to sdp a few days ago. > > > can it be that you are running 2 different versions of OFED? > > > anywas, this doesn't pose any problem so you can ignore it. > > > > > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > > > > Hi, > > > > > > > > Saturday we did a different stresstest. > > > > This is what we see in the /var/log/messages: > > > > > > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > > > > > > > There were errors from that time on. Can someone explain me what > > this > > > > message does? > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > Verzonden: woensdag 23 mei 2007 17:41 > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > > general at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > Try 20 seconds, I'm curious if if you are barely crossing the > > > 10-second > > > > threshold. > > > > > > > > Scott > > > > > > > > > -----Original Message----- > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > Sent: Wednesday, May 23, 2007 8:39 AM > > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > > > > Cc: Clive Hall (clivhall); > > > > > general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > What value would you recommend then? > > > > > > > > > > Koen > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > > Verzonden: woensdag 23 mei 2007 17:38 > > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > > CC: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; > > > > > general at lists.openfabrics.org > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > The boot time of the host doesn't matter for this timeout. > While > > > the > > > > > host is booting, the IB link is down anyway. > > > > > > > > > > Scott > > > > > > > > > > > -----Original Message----- > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > > > > Cc: Clive Hall (clivhall); > > > > > > general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > After a whole day of stresstesting with the MAD renicing > > > > > turned on, we > > > > > > got the error once. So I think I should raise the timeout on > > > > > > the switch > > > > > > also. > > > > > > > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > > > > that this is a > > > > > > good value for the timeout? > > > > > > > > > > > > Scott, > > > > > > Can you explain me the problem of the memlock? > > > > > > > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > > > > Since we didn't > > > > > > install this, the bug is not related to us. This is > > > > > correct, isn't it? > > > > > > > > > > > > Greetz > > > > > > > > > > > > Koen > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > > Verzonden: woensdag 23 mei 2007 16:12 > > > > > > Aan: Scott "Weitzenkamp (sweitzen) > > > > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > > > > general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) > wrote: > > > > > > > No C code changes, just a few config file changes > > > > > (RENICE_IB_MAD=yes > > > > > > in > > > > > > > openib.conf, > > > > > > > > > > > > Does the host really not respond to MAD requests for over 10 > > > > > > seconds in > > > > > > some cases ? > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > > > > SLES10 for bug 267, etc.). > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > SQA and Release Manager > > > > > > > Server Virtualization Business Unit > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > > > > > Scott, > > > > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > > > > > > > Greetz > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > > > Van: Scott Weitzenkamp (sweitzen) > > [mailto:sweitzen at cisco.com] > > > > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > > > > (clivhall) > > > > > > > > CC: Shirley Ma; Ami Perlmutter; > > general at lists.openfabrics.org; > > > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > > > > response within > > > > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures > 1) > > > > > > > > > > in the OFED > > > > > > > > binary RPMs we release at > > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > > > > prefer to have > > > > > > > > the host be more responsive. > > > > > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > SQA and Release Manager > > > > > > > > Server Virtualization Business Unit > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > > general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > > > If I understand it wright, the switch is actually > polling > > > > > > > > > (=pinging) the > > > > > > > > > interfaces every 10s. This means that when the interface > > is > > > > > > handling > > > > > > > > > other traffic, the poll can fail and the port could be > > > > > > > > > considered out of > > > > > > > > > service. My question is then: "How can the timeout be > > > reached > > > > > > while > > > > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > > > > us? And why? > > > > > > > > > > > > > > > > > > To recapitulate: these are the actions I'll take > tomorrow > > > > > > > > > 1) change the MAD niceness of the servers > > > > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > > > > their ports in > > > > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > > > > (sweitzen) wrote: > > > > > > > > > > Yes, you can tune it. Here's an example via the > switch > > > CLI: > > > > > > > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > > > > fe:80:00:00:00:00:00:00 > > > > > > > > > > node-timeout > > > > > > > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > > > > 2000 seconds. > > > > > > > > > > If a HCA is completely unresponsive for longer than > the > > > > > > > > node-timeout > > > > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > > > SQA and Release Manager > > > > > > > > > > Server Virtualization Business Unit > > > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > > > > To: koen.segers at VRT.BE > > > > > > > > > > Cc: Ami Perlmutter; > > general at lists.openfabrics.org; > > > > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > > > > Weitzenkamp > > > > > > > > > > (sweitzen) > > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > > > > 229 (Scott > > > > > > > > > > pointed out earlier). The same workaround > might > > > > > > > > work for you > > > > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > > > > > > > I think this should be a SM query timeout > > > > > > tunable value > > > > > > in > > > > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > > > > Koen > > > > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 > > AM > > > > > > > > > > Please respond > > to > > > > > > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > > > > > > > Shirley > > > > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > > > > , > > > > > > > > > general at lists.openfabrics.org, > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > > > > > > > RE: > > > > > > > > > > [ofa-general] > > > > > > > > > > GPFS node loses > > > > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > > ================== > > > > > > > > > > System Version > > > Information > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > > ================== > > > > > > > > > > system-version : SFS-7000P TopspinOS > > > > > > > > > 2.9.0 releng > > > > > > > > > > #147 > > > > > > > > > > 10/25/2006 02:01:32 > > > > > > > > > > contact : tac at cisco.com > > > > > > > > > > name : SFS-7000P > > > > > > > > > > location : 170 West Tasman > > Drive, > > > > > > > > > San Jose, CA > > > > > > > > > > 95134 > > > > > > > > > > up-time : > 11(d):7(h):49(m):3(s) > > > > > > > > > > last-change : none > > > > > > > > > > last-config-save : none > > > > > > > > > > action : none > > > > > > > > > > result : none > > > > > > > > > > oper-mode : normal > > > > > > > > > > > > > > > > > > > > There is also a command that gives the SM > > version, > > > > > > > > > > > > but I can't > > > > > > > > > > find it > > > > > > > > > > right now. > > > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma > > > wrote: > > > > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to > > me. > > > > > > > > > The node was > > > > > > > > > > kicked > > > > > > > > > > > out of the membership. Which SM you are > > > > > > using in your > > > > > > > > > > fabric? > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > > RPR Brussel > > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > RPR Brussel > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > BTW BE 0244.142.664 > > > > > > > > RPR Brussel > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > general mailing list > > > > > > > general at lists.openfabrics.org > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > > > To unsubscribe, please visit > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From sweitzen at cisco.com Tue May 29 08:30:16 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 29 May 2007 08:30:16 -0700 Subject: [ofa-general] OFED-1.2-20070529-0600 won't build due to srptools changes Message-ID: I have reopened https://bugs.openfabrics.org/show_bug.cgi?id=533, Ishai please fix ASAP. This bug is now a P1 blocker. RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root File not found: /var/tmp/OFED/usr/sbin/execute_multipath_or_kpartx.sh File not found: /var/tmp/OFED/usr/sbin/srp_dm_multipath_daemon File not found: /var/tmp/OFED/usr/sbin/srp_post_multipath ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM'\ --define '_prefix /usr' --define 'build_root /var/tmp/OFED' --define 'configur\ e_options --with-dapl --with-libibcommon --with-libibmad --with-libibumad --wit\ h-libibverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --wit\ h-openib-diags --with-sdpnetstat --with-srptools --with-mstflint --with-perftes\ t --with-tvflash --sysconfdir=/etc --mandir=/usr/share/man' --define 'configure\ _options32 --with-dapl --with-libibcommon --with-libibmad --with-libibumad --wi\ th-libibverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --wi\ th-openib-diags --with-sdpnetstat --with-srptools --sysconfdir=/etc --mandir=/u\ sr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man' /tmp/O\ FED-1.2-20070529-0600/SRPMS/ofa_user-1.2-rc2.src.rpm" -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Tue May 29 08:44:37 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 29 May 2007 08:44:37 -0700 Subject: [ofa-general] Re: ipoib / bonding and OFED In-Reply-To: <465BDC90.5080305@voltaire.com> References: <3857BB049D83424D9DB82753D37CEA55459C41@taurus.voltaire.com><4657373E.2030903@hp.com> <465BDC90.5080305@voltaire.com> Message-ID: Bob, it is now possible to configure IPoIB bonding in /etc/infiniband/openib.conf, this configuration file includes the following boilerplate. # Enable the bonding driver on startup IPOIBBOND_ENABLE=no # Set bond interface names #IPOIB_BONDS=bond0,bond1 # Set specific bond params; address and slaves #bond0_IP=10.10.10.1 #bond0_SLAVES=ib0,ib1 #bond1_IP=20.10.10.1 #bond1_SLAVES=ib2,ib3,ib4 Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Or Gerlitz > Sent: Tuesday, May 29, 2007 12:56 AM > To: Bob Kossey > Cc: OpenFabrics General > Subject: [ofa-general] Re: ipoib / bonding and OFED > > Bob Kossey wrote: > > I copied OR since I think this is related to his OFED HA work, and > > he might have some insights. A few more questions for Or: > > I was trying to use ipoib bonding with OFED 1.2 rc2 and a > 2.6.9 kernel, > > but was not able to get it to work so far. I saw your > Sonoma bonding > > slides, and you mention kernel bonding driver changes were needed. > > 2. Is there a minimum kernel version, with the kernel bonding driver > > changes, that is required to use bonding with OFED ipoib? > > Just to have a base line here: to get bonding to work with IPoIB, you > should use the bonding driver provided with OFED 1.2. This > driver is the > upstream one (of 2.6.20) being patched to support IPoIB and > backported > to RH5, SLES10 and RH4 U3/4/5, other kernels are not supported. > > If you were using the ofed bonding on a system that matches > the support > matrix it should worl. If do have problems under this config, please > either open a bug at the ofed bugzilla > @ bugs.openfabrics.org assigned to monis at voltaire.com (Moni Shoua) or > send first report/question to Moni and CC ewg at lists.openfabrics.org > > Please note that between RC2 and RC4 (to be released today etc) some > bugs were fixed, you can search in the bugzilla to see what. > > > 3. The bonding driver uses the HWADDR from the underlying ipoib > > devices, how does it obtain the HWADDR? Does it use the > full 20 bytes, > > or some subset? > > when enslaving IPoIB devices, the bonding driver uses the full hw > address of the active slave, it simply looks on the dev_addr field of > the slave struct netdevice (see include/linux/netdevice.h) > > > 4. What use_carrier options for link status detection does > OFED ipoib > > support, > > MII, ETHTOOL or netif_carrier_ok? > > the mii/ethertool etc local link detection methods of the > bonding driver > are somehow deprecated, since nowadays almost any network device > support the netif_carrier_ok call. The --default-- of the upstream > bonding driver (eg the one we use in OFED and the 2.6.21 > listed below) > is to set the use_carrier mod param to 1 that is mii is not > used anymore. > > > author: Thomas Davis, tadavis at lbl.gov and many others > > description: Ethernet Channel Bonding Driver, v3.1.2 > > version: 3.1.2 > > parm: use_carrier:Use netif_carrier_ok (vs MII > ioctls) in miimon; 0 for off, 1 for on (default) (int) > > parm: miimon:Link check interval in milliseconds (int) > > > If you have any good examples of bonding configuration > settings that work > > with OFED, I'd appreciate that also. > > The bonding RPM provided with OFED is made of a driver, > script and some > help text containing usage examples, please take a look there > and let me > know if you have further questions. > > > $ rpm -ql ib-bonding-0.9.0-2.6.9_42.ELsmp > > > /lib/modules/2.6.9-42.ELsmp/updates/kernel/drivers/net/bonding > /bonding.ko > > /usr/bin/ib-bond > > /usr/share/doc/ib-bonding-0.9.0/ib-bonding.txt > > The ofed service (/etc/init.d/openibd) was enhanced to allow for > --persistent-- bonding configuration, please see the bonding > section at > docs/ipoib_release_notes.txt to see how to do it. > > Or. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mst at dev.mellanox.co.il Tue May 29 08:44:58 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 18:44:58 +0300 Subject: [ofa-general] Re: GPFS node loses IB-connection In-Reply-To: References: <1180451154.12048.15.camel@localhost> Message-ID: <20070529154458.GA7101@mellanox.co.il> > Changing the OFED drivers requires rebuilding a lot of other programs. It does? Why does it? Quoting SEGERS Koen : Subject: RE: GPFS node loses IB-connection That is very difficult. This system is supposed to go in production within a few weeks. Changing the OFED drivers requires rebuilding a lot of other programs. If it isn't really necessary, I prefer not to do this... Koen -----Oorspronkelijk bericht----- Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] Verzonden: dinsdag 29 mei 2007 17:05 Aan: SEGERS Koen CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection any chance of moving to rc3 (or wait till rc4)? On Tue, 2007-05-29 at 16:56 +0200, SEGERS Koen wrote: > We don't really see data getting lost. We don't get an error in the log > files of gpfs. We only got a system that was not able to read its > filesystem anymore. It was exactly at the time this FIXME error > occurred. > > Therefore I think there must me some kind of correlation. But I don't > really know what ... :( > > Koen > > -----Oorspronkelijk bericht----- > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > Verzonden: dinsdag 29 mei 2007 16:40 > Aan: SEGERS Koen > CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > can you describe the scenario in which you see data lost? > does the "SDP: FIXME MID 11" message correlate with the data loss? > > On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote: > > I just remembered that, with SDP, these values aren't related anymore. > > SDP doesn't give this kind of information to the OS. > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen > > Verzonden: dinsdag 29 mei 2007 14:29 > > Aan: amip at dev.mellanox.co.il > > CC: general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > One of the machines has 2 dropped packets: > > > > gpfswhbe2n1:~ # ifconfig ib0 > > ib0 Link encap:UNSPEC HWaddr > > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > > inet addr:192.168.2.1 Bcast:192.168.4.255 > Mask:255.255.255.0 > > inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > > RX packets:17311 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0 > > collisions:0 txqueuelen:128 > > RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb) > > > > Can this be related? > > > > Does anyone now how this is possible with sdp? I thought SDP was a RC. > > I'm also curious how gpfs reacts to this. Do you know where I can find > > the timestamp of these dropped packets? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > > Verzonden: dinsdag 29 mei 2007 14:03 > > Aan: SEGERS Koen > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > if this is an actual resize request than there is no problem when it > is > > dropped. > > since you are running rc1, no resize requests should be sent so this > > means there is a problem since data could be dropped. do you notice > lost > > data? > > > > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote: > > > We are running ofed-1.2.RC1 on all machines. Hence it is impossible > > that > > > this message is added only a few days ago. > > > > > > How can you be so sure that this doesn't pose any problems? > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > > > Verzonden: dinsdag 29 mei 2007 13:35 > > > Aan: SEGERS Koen > > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > this means you are getting a message your SDP does not recognize. > > > message 11 is resize request which was added to sdp a few days ago. > > > can it be that you are running 2 different versions of OFED? > > > anywas, this doesn't pose any problem so you can ignore it. > > > > > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > > > > Hi, > > > > > > > > Saturday we did a different stresstest. > > > > This is what we see in the /var/log/messages: > > > > > > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > > > > > > > There were errors from that time on. Can someone explain me what > > this > > > > message does? > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > Verzonden: woensdag 23 mei 2007 17:41 > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > > general at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > Try 20 seconds, I'm curious if if you are barely crossing the > > > 10-second > > > > threshold. > > > > > > > > Scott > > > > > > > > > -----Original Message----- > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > Sent: Wednesday, May 23, 2007 8:39 AM > > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > > > > Cc: Clive Hall (clivhall); > > > > > general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > What value would you recommend then? > > > > > > > > > > Koen > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > > Verzonden: woensdag 23 mei 2007 17:38 > > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > > CC: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; > > > > > general at lists.openfabrics.org > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > The boot time of the host doesn't matter for this timeout. > While > > > the > > > > > host is booting, the IB link is down anyway. > > > > > > > > > > Scott > > > > > > > > > > > -----Original Message----- > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > > > > Cc: Clive Hall (clivhall); > > > > > > general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > After a whole day of stresstesting with the MAD renicing > > > > > turned on, we > > > > > > got the error once. So I think I should raise the timeout on > > > > > > the switch > > > > > > also. > > > > > > > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > > > > that this is a > > > > > > good value for the timeout? > > > > > > > > > > > > Scott, > > > > > > Can you explain me the problem of the memlock? > > > > > > > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > > > > Since we didn't > > > > > > install this, the bug is not related to us. This is > > > > > correct, isn't it? > > > > > > > > > > > > Greetz > > > > > > > > > > > > Koen > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > > Verzonden: woensdag 23 mei 2007 16:12 > > > > > > Aan: Scott "Weitzenkamp (sweitzen) > > > > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > > > > general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) > wrote: > > > > > > > No C code changes, just a few config file changes > > > > > (RENICE_IB_MAD=yes > > > > > > in > > > > > > > openib.conf, > > > > > > > > > > > > Does the host really not respond to MAD requests for over 10 > > > > > > seconds in > > > > > > some cases ? > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > > > > SLES10 for bug 267, etc.). > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > SQA and Release Manager > > > > > > > Server Virtualization Business Unit > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > > > > > Scott, > > > > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > > > > > > > Greetz > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > > > Van: Scott Weitzenkamp (sweitzen) > > [mailto:sweitzen at cisco.com] > > > > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > > > > (clivhall) > > > > > > > > CC: Shirley Ma; Ami Perlmutter; > > general at lists.openfabrics.org; > > > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > > > > response within > > > > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures > 1) > > > > > > > > > > in the OFED > > > > > > > > binary RPMs we release at > > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > > > > prefer to have > > > > > > > > the host be more responsive. > > > > > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > SQA and Release Manager > > > > > > > > Server Virtualization Business Unit > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > > general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > > > If I understand it wright, the switch is actually > polling > > > > > > > > > (=pinging) the > > > > > > > > > interfaces every 10s. This means that when the interface > > is > > > > > > handling > > > > > > > > > other traffic, the poll can fail and the port could be > > > > > > > > > considered out of > > > > > > > > > service. My question is then: "How can the timeout be > > > reached > > > > > > while > > > > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > > > > us? And why? > > > > > > > > > > > > > > > > > > To recapitulate: these are the actions I'll take > tomorrow > > > > > > > > > 1) change the MAD niceness of the servers > > > > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > > > > their ports in > > > > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > > > > (sweitzen) wrote: > > > > > > > > > > Yes, you can tune it. Here's an example via the > switch > > > CLI: > > > > > > > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > > > > fe:80:00:00:00:00:00:00 > > > > > > > > > > node-timeout > > > > > > > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > > > > 2000 seconds. > > > > > > > > > > If a HCA is completely unresponsive for longer than > the > > > > > > > > node-timeout > > > > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > > > SQA and Release Manager > > > > > > > > > > Server Virtualization Business Unit > > > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > > > > To: koen.segers at VRT.BE > > > > > > > > > > Cc: Ami Perlmutter; > > general at lists.openfabrics.org; > > > > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > > > > Weitzenkamp > > > > > > > > > > (sweitzen) > > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > > > > 229 (Scott > > > > > > > > > > pointed out earlier). The same workaround > might > > > > > > > > work for you > > > > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > > > > > > > I think this should be a SM query timeout > > > > > > tunable value > > > > > > in > > > > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > > > > Koen > > > > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 > > AM > > > > > > > > > > Please respond > > to > > > > > > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > > > > > > > Shirley > > > > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > > > > , > > > > > > > > > general at lists.openfabrics.org, > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > > > > > > > RE: > > > > > > > > > > [ofa-general] > > > > > > > > > > GPFS node loses > > > > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > > ================== > > > > > > > > > > System Version > > > Information > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > > ================== > > > > > > > > > > system-version : SFS-7000P TopspinOS > > > > > > > > > 2.9.0 releng > > > > > > > > > > #147 > > > > > > > > > > 10/25/2006 02:01:32 > > > > > > > > > > contact : tac at cisco.com > > > > > > > > > > name : SFS-7000P > > > > > > > > > > location : 170 West Tasman > > Drive, > > > > > > > > > San Jose, CA > > > > > > > > > > 95134 > > > > > > > > > > up-time : > 11(d):7(h):49(m):3(s) > > > > > > > > > > last-change : none > > > > > > > > > > last-config-save : none > > > > > > > > > > action : none > > > > > > > > > > result : none > > > > > > > > > > oper-mode : normal > > > > > > > > > > > > > > > > > > > > There is also a command that gives the SM > > version, > > > > > > > > > > > > but I can't > > > > > > > > > > find it > > > > > > > > > > right now. > > > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma > > > wrote: > > > > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to > > me. > > > > > > > > > The node was > > > > > > > > > > kicked > > > > > > > > > > > out of the membership. Which SM you are > > > > > > using in your > > > > > > > > > > fabric? > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > > RPR Brussel > > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > RPR Brussel > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > BTW BE 0244.142.664 > > > > > > > > RPR Brussel > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > general mailing list > > > > > > > general at lists.openfabrics.org > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > > > To unsubscribe, please visit > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From mst at dev.mellanox.co.il Tue May 29 08:47:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 May 2007 18:47:00 +0300 Subject: [ofa-general] Re: GPFS node loses IB-connection In-Reply-To: References: <1180451154.12048.15.camel@localhost> Message-ID: <20070529154700.GA8321@mellanox.co.il> What Ami is asking you to do is to try to reproduce the problem with -RC3 or -RC4 when it's out. If the problem goes away, we'll know it's one of the bugs that got fixed since then, if not it'll be easier to debug on a recent RC. Quoting SEGERS Koen : Subject: RE: GPFS node loses IB-connection That is very difficult. This system is supposed to go in production within a few weeks. Changing the OFED drivers requires rebuilding a lot of other programs. If it isn't really necessary, I prefer not to do this... Koen -----Oorspronkelijk bericht----- Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] Verzonden: dinsdag 29 mei 2007 17:05 Aan: SEGERS Koen CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org Onderwerp: RE: [ofa-general] GPFS node loses IB-connection any chance of moving to rc3 (or wait till rc4)? On Tue, 2007-05-29 at 16:56 +0200, SEGERS Koen wrote: > We don't really see data getting lost. We don't get an error in the log > files of gpfs. We only got a system that was not able to read its > filesystem anymore. It was exactly at the time this FIXME error > occurred. > > Therefore I think there must me some kind of correlation. But I don't > really know what ... :( > > Koen > > -----Oorspronkelijk bericht----- > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > Verzonden: dinsdag 29 mei 2007 16:40 > Aan: SEGERS Koen > CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > can you describe the scenario in which you see data lost? > does the "SDP: FIXME MID 11" message correlate with the data loss? > > On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote: > > I just remembered that, with SDP, these values aren't related anymore. > > SDP doesn't give this kind of information to the OS. > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen > > Verzonden: dinsdag 29 mei 2007 14:29 > > Aan: amip at dev.mellanox.co.il > > CC: general-bounces at lists.openfabrics.org; > general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > One of the machines has 2 dropped packets: > > > > gpfswhbe2n1:~ # ifconfig ib0 > > ib0 Link encap:UNSPEC HWaddr > > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > > inet addr:192.168.2.1 Bcast:192.168.4.255 > Mask:255.255.255.0 > > inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > > RX packets:17311 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0 > > collisions:0 txqueuelen:128 > > RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb) > > > > Can this be related? > > > > Does anyone now how this is possible with sdp? I thought SDP was a RC. > > I'm also curious how gpfs reacts to this. Do you know where I can find > > the timestamp of these dropped packets? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > > Verzonden: dinsdag 29 mei 2007 14:03 > > Aan: SEGERS Koen > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > if this is an actual resize request than there is no problem when it > is > > dropped. > > since you are running rc1, no resize requests should be sent so this > > means there is a problem since data could be dropped. do you notice > lost > > data? > > > > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote: > > > We are running ofed-1.2.RC1 on all machines. Hence it is impossible > > that > > > this message is added only a few days ago. > > > > > > How can you be so sure that this doesn't pose any problems? > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] > > > Verzonden: dinsdag 29 mei 2007 13:35 > > > Aan: SEGERS Koen > > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > this means you are getting a message your SDP does not recognize. > > > message 11 is resize request which was added to sdp a few days ago. > > > can it be that you are running 2 different versions of OFED? > > > anywas, this doesn't pose any problem so you can ignore it. > > > > > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > > > > Hi, > > > > > > > > Saturday we did a different stresstest. > > > > This is what we see in the /var/log/messages: > > > > > > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > > > > > > > There were errors from that time on. Can someone explain me what > > this > > > > message does? > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > Verzonden: woensdag 23 mei 2007 17:41 > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; > > > > general at lists.openfabrics.org > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > Try 20 seconds, I'm curious if if you are barely crossing the > > > 10-second > > > > threshold. > > > > > > > > Scott > > > > > > > > > -----Original Message----- > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > Sent: Wednesday, May 23, 2007 8:39 AM > > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > > > > Cc: Clive Hall (clivhall); > > > > > general-bounces at lists.openfabrics.org; > > general at lists.openfabrics.org > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > What value would you recommend then? > > > > > > > > > > Koen > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > > > > > Verzonden: woensdag 23 mei 2007 17:38 > > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > > CC: Clive Hall (clivhall); > general-bounces at lists.openfabrics.org; > > > > > general at lists.openfabrics.org > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > The boot time of the host doesn't matter for this timeout. > While > > > the > > > > > host is booting, the IB link is down anyway. > > > > > > > > > > Scott > > > > > > > > > > > -----Original Message----- > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > > > > Cc: Clive Hall (clivhall); > > > > > > general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > After a whole day of stresstesting with the MAD renicing > > > > > turned on, we > > > > > > got the error once. So I think I should raise the timeout on > > > > > > the switch > > > > > > also. > > > > > > > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > > > > that this is a > > > > > > good value for the timeout? > > > > > > > > > > > > Scott, > > > > > > Can you explain me the problem of the memlock? > > > > > > > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > > > > Since we didn't > > > > > > install this, the bug is not related to us. This is > > > > > correct, isn't it? > > > > > > > > > > > > Greetz > > > > > > > > > > > > Koen > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > > Verzonden: woensdag 23 mei 2007 16:12 > > > > > > Aan: Scott "Weitzenkamp (sweitzen) > > > > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > > > > general-bounces at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) > wrote: > > > > > > > No C code changes, just a few config file changes > > > > > (RENICE_IB_MAD=yes > > > > > > in > > > > > > > openib.conf, > > > > > > > > > > > > Does the host really not respond to MAD requests for over 10 > > > > > > seconds in > > > > > > some cases ? > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > > > > SLES10 for bug 267, etc.). > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > SQA and Release Manager > > > > > > > Server Virtualization Business Unit > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] > > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > > > > > Scott, > > > > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > > > > > > > Greetz > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > > > Van: Scott Weitzenkamp (sweitzen) > > [mailto:sweitzen at cisco.com] > > > > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > > > > (clivhall) > > > > > > > > CC: Shirley Ma; Ami Perlmutter; > > general at lists.openfabrics.org; > > > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > > > > response within > > > > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures > 1) > > > > > > > > > > in the OFED > > > > > > > > binary RPMs we release at > > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > > > > prefer to have > > > > > > > > the host be more responsive. > > > > > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > SQA and Release Manager > > > > > > > > Server Virtualization Business Unit > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE] > > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > > general at lists.openfabrics.org; > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > > > If I understand it wright, the switch is actually > polling > > > > > > > > > (=pinging) the > > > > > > > > > interfaces every 10s. This means that when the interface > > is > > > > > > handling > > > > > > > > > other traffic, the poll can fail and the port could be > > > > > > > > > considered out of > > > > > > > > > service. My question is then: "How can the timeout be > > > reached > > > > > > while > > > > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > > > > us? And why? > > > > > > > > > > > > > > > > > > To recapitulate: these are the actions I'll take > tomorrow > > > > > > > > > 1) change the MAD niceness of the servers > > > > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > > > > their ports in > > > > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > > > > (sweitzen) wrote: > > > > > > > > > > Yes, you can tune it. Here's an example via the > switch > > > CLI: > > > > > > > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > > > > fe:80:00:00:00:00:00:00 > > > > > > > > > > node-timeout > > > > > > > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > > > > 2000 seconds. > > > > > > > > > > If a HCA is completely unresponsive for longer than > the > > > > > > > > node-timeout > > > > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > > > SQA and Release Manager > > > > > > > > > > Server Virtualization Business Unit > > > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com] > > > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > > > > To: koen.segers at VRT.BE > > > > > > > > > > Cc: Ami Perlmutter; > > general at lists.openfabrics.org; > > > > > > > > > > general-bounces at lists.openfabrics.org; Scott > > > > > > Weitzenkamp > > > > > > > > > > (sweitzen) > > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > > > > 229 (Scott > > > > > > > > > > pointed out earlier). The same workaround > might > > > > > > > > work for you > > > > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > > > > > > > I think this should be a SM query timeout > > > > > > tunable value > > > > > > in > > > > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > > > > Koen > > > > > > > > > > Segers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 > > AM > > > > > > > > > > Please respond > > to > > > > > > > > > > > > koen.segers at VRT.BE > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > > > > > > > Shirley > > > > > > > > > > Ma/Beaverton/IBM at IBMUS > > > > > > > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > > > > , > > > > > > > > > general at lists.openfabrics.org, > > > > > > general-bounces at lists.openfabrics.org > > > > > > > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > > > > > > > RE: > > > > > > > > > > [ofa-general] > > > > > > > > > > GPFS node loses > > > > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > > ================== > > > > > > > > > > System Version > > > Information > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > > ================== > > > > > > > > > > system-version : SFS-7000P TopspinOS > > > > > > > > > 2.9.0 releng > > > > > > > > > > #147 > > > > > > > > > > 10/25/2006 02:01:32 > > > > > > > > > > contact : tac at cisco.com > > > > > > > > > > name : SFS-7000P > > > > > > > > > > location : 170 West Tasman > > Drive, > > > > > > > > > San Jose, CA > > > > > > > > > > 95134 > > > > > > > > > > up-time : > 11(d):7(h):49(m):3(s) > > > > > > > > > > last-change : none > > > > > > > > > > last-config-save : none > > > > > > > > > > action : none > > > > > > > > > > result : none > > > > > > > > > > oper-mode : normal > > > > > > > > > > > > > > > > > > > > There is also a command that gives the SM > > version, > > > > > > > > > > > > but I can't > > > > > > > > > > find it > > > > > > > > > > right now. > > > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma > > > wrote: > > > > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to > > me. > > > > > > > > > The node was > > > > > > > > > > kicked > > > > > > > > > > > out of the membership. Which SM you are > > > > > > using in your > > > > > > > > > > fabric? > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > > RPR Brussel > > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > RPR Brussel > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > BTW BE 0244.142.664 > > > > > > > > RPR Brussel > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > general mailing list > > > > > > > general at lists.openfabrics.org > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > > > To unsubscribe, please visit > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From sobebike at gmail.com Tue May 29 09:19:16 2007 From: sobebike at gmail.com (SoBeBike) Date: Tue, 29 May 2007 11:19:16 -0500 Subject: [ofa-general] ibv_get_cq_event blocking forever after successful ibv_post_send... In-Reply-To: References: <20070525212214.20500.qmail@station183.com> Message-ID: OK. I'll try to create a simple test case which exhibits the problem. It'll be a bit - I'm on vacation and probably won't mess with this much until I return. On 5/28/07, Roland Dreier wrote: > > Any ideas on why the ibv_get_cq_event() would never see an event > > after a "successful" send requesting a completion event? > > It's either a bug in your code or a bug in the stack below your code. > The best way to debug this would be for you to post your actual code > (in a form that someone else can run), so that we can either point out > what's wrong with your code, or have a test case for the real bug. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Tue May 29 10:14:54 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 29 May 2007 10:14:54 -0700 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705272250q68aa4064l40454db5b266a967@mail.gmail.com> Message-ID: <000001c7a214$e0d30580$86cc180a@amr.corp.intel.com> >Ok, but, by that time we can keep the framework ready? I plan on re-submitting the cache for 2.6.23. Beyond that I won't have the time to work on enhancements for a few weeks. I will happily review any patch submissions though. >How this will be managed? This will add extra startup time in the >cluster, because cluster will be usable only after last cache has been >enabled. Am I right? I would word this differently: we can improve the time required to load the cache, versus stating that the cache adds extra startup time. The cache is not necessary to use the cluster, so doesn't force extra startup time. Cache misses would simply be forwarded directly to the SA. If the first application to run on the cluster isn't establishing all-to-all communication between the nodes then there may not be any reason to delay starting the app. Even if the first app does establish all-to-all communication, waiting for the caches to load can delay the start of the app, but cache use may decrease the overall execution time of the app by more than this delay. (Loading the cache is likely to be more efficient than applications obtaining the path records themselves.) >How multi-pathing is handled in current cache_module? A kernel ULP can request all paths, then select the one they want. Beyond that, the cache can either return paths to the user round robin or randomly, based on the cache settings. - Sean From ralph.campbell at qlogic.com Tue May 29 10:24:48 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 29 May 2007 10:24:48 -0700 Subject: [ofa-general] Re: [ewg] [PATCH] ofed_1_2/sdp - SDP can lose receive data In-Reply-To: <1180256850.15464.1.camel@localhost> References: <1180139623.3407.373.camel@brick.pathscale.com> <1180256850.15464.1.camel@localhost> Message-ID: <1180459488.3407.376.camel@brick.pathscale.com> It is from git://git.openfabrics.org/~vlad/ofed_1_2 commit 726c6827ac31c0b2f40acd804dc53362289bd21f On Sun, 2007-05-27 at 12:07 +0300, Ami Perlmutter wrote: > Ralph, > this is how the code is now. > Were are you getting this code from? > > On Fri, 2007-05-25 at 17:33 -0700, Ralph Campbell wrote: > > Can this fix be considered for OFED 1.2? > > Thanks. > > > > > > If a receive work completion is processed but there is no room > > in a previously queued skb, the data is dropped. > > This patch fixes the problem by queuing the skb. > > > > Signed-off-by: Ralph Campbell > > > > diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c > > --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:04:51 2007 -0700 > > +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:07:02 2007 -0700 > > @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q > > skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len); > > __kfree_skb(skb); > > skb = tail; > > - } > > + } else > > + skb_queue_tail(&sk->sk_receive_queue, skb); > > } else > > skb_queue_tail(&sk->sk_receive_queue, skb); > > > > > > > > _______________________________________________ > > ewg mailing list > > ewg at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bob.kossey at hp.com Tue May 29 10:35:30 2007 From: bob.kossey at hp.com (Bob Kossey) Date: Tue, 29 May 2007 13:35:30 -0400 Subject: [ofa-general] Re: ipoib / bonding and OFED In-Reply-To: References: <3857BB049D83424D9DB82753D37CEA55459C41@taurus.voltaire.com><4657373E.2030903@hp.com> <465BDC90.5080305@voltaire.com> Message-ID: <465C6462.6000904@hp.com> Thanks guys, I'll have to update my bits and try again. Another related question. Does OFED 1.2 now support multiple independent IB fabrics (multiple SMs, etc) connected to multiple HCAs on the same node? Are there any qualifications about which dimensions are supported with this, such as ipoib HA, SRP HA, other types of failover, etc.? Thanks, Bob Scott Weitzenkamp (sweitzen) wrote: > Bob, it is now possible to configure IPoIB bonding in > /etc/infiniband/openib.conf, this configuration file includes the > following boilerplate. > > # Enable the bonding driver on startup > IPOIBBOND_ENABLE=no > # Set bond interface names > #IPOIB_BONDS=bond0,bond1 > # Set specific bond params; address and slaves > #bond0_IP=10.10.10.1 > #bond0_SLAVES=ib0,ib1 > #bond1_IP=20.10.10.1 > #bond1_SLAVES=ib2,ib3,ib4 > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > >> -----Original Message----- >> From: general-bounces at lists.openfabrics.org >> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Or Gerlitz >> Sent: Tuesday, May 29, 2007 12:56 AM >> To: Bob Kossey >> Cc: OpenFabrics General >> Subject: [ofa-general] Re: ipoib / bonding and OFED >> >> Bob Kossey wrote: >> >>> I copied OR since I think this is related to his OFED HA work, and >>> he might have some insights. A few more questions for Or: >>> I was trying to use ipoib bonding with OFED 1.2 rc2 and a >>> >> 2.6.9 kernel, >> >>> but was not able to get it to work so far. I saw your >>> >> Sonoma bonding >> >>> slides, and you mention kernel bonding driver changes were needed. >>> 2. Is there a minimum kernel version, with the kernel bonding driver >>> changes, that is required to use bonding with OFED ipoib? >>> >> Just to have a base line here: to get bonding to work with IPoIB, you >> should use the bonding driver provided with OFED 1.2. This >> driver is the >> upstream one (of 2.6.20) being patched to support IPoIB and >> backported >> to RH5, SLES10 and RH4 U3/4/5, other kernels are not supported. >> >> If you were using the ofed bonding on a system that matches >> the support >> matrix it should worl. If do have problems under this config, please >> either open a bug at the ofed bugzilla >> @ bugs.openfabrics.org assigned to monis at voltaire.com (Moni Shoua) or >> send first report/question to Moni and CC ewg at lists.openfabrics.org >> >> Please note that between RC2 and RC4 (to be released today etc) some >> bugs were fixed, you can search in the bugzilla to see what. >> >> >>> 3. The bonding driver uses the HWADDR from the underlying ipoib >>> devices, how does it obtain the HWADDR? Does it use the >>> >> full 20 bytes, >> >>> or some subset? >>> >> when enslaving IPoIB devices, the bonding driver uses the full hw >> address of the active slave, it simply looks on the dev_addr field of >> the slave struct netdevice (see include/linux/netdevice.h) >> >> >>> 4. What use_carrier options for link status detection does >>> >> OFED ipoib >> >>> support, >>> MII, ETHTOOL or netif_carrier_ok? >>> >> the mii/ethertool etc local link detection methods of the >> bonding driver >> are somehow deprecated, since nowadays almost any network device >> support the netif_carrier_ok call. The --default-- of the upstream >> bonding driver (eg the one we use in OFED and the 2.6.21 >> listed below) >> is to set the use_carrier mod param to 1 that is mii is not >> used anymore. >> >> >>> author: Thomas Davis, tadavis at lbl.gov and many others >>> description: Ethernet Channel Bonding Driver, v3.1.2 >>> version: 3.1.2 >>> parm: use_carrier:Use netif_carrier_ok (vs MII >>> >> ioctls) in miimon; 0 for off, 1 for on (default) (int) >> >>> parm: miimon:Link check interval in milliseconds (int) >>> >>> If you have any good examples of bonding configuration >>> >> settings that work >> >>> with OFED, I'd appreciate that also. >>> >> The bonding RPM provided with OFED is made of a driver, >> script and some >> help text containing usage examples, please take a look there >> and let me >> know if you have further questions. >> >> >>> $ rpm -ql ib-bonding-0.9.0-2.6.9_42.ELsmp >>> >>> >> /lib/modules/2.6.9-42.ELsmp/updates/kernel/drivers/net/bonding >> /bonding.ko >> >>> /usr/bin/ib-bond >>> /usr/share/doc/ib-bonding-0.9.0/ib-bonding.txt >>> >> The ofed service (/etc/init.d/openibd) was enhanced to allow for >> --persistent-- bonding configuration, please see the bonding >> section at >> docs/ipoib_release_notes.txt to see how to do it. >> >> Or. >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> >> From rdreier at cisco.com Tue May 29 11:27:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 May 2007 11:27:46 -0700 Subject: [ofa-general] Re: [PATCH] libmlx4: fix qp capabilities In-Reply-To: <1180443676.6825.8.camel@mtls03> (Eli Cohen's message of "Tue, 29 May 2007 16:00:46 +0300") References: <1180443676.6825.8.camel@mtls03> Message-ID: thanks, that bug looks familiar from libmthca. I prefer to fix it like as below, though, since that gives the true capabilities of the QP being created. Also, how did you create your patch? > --- libmlx4.orig/src/qp.c 2007-05-29 13:13:57.000000000 +0300 > +++ libmlx4/src/qp.c 2007-05-29 14:41:33.000000000 +0300 > @@ -396,12 +396,13 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, > cap->max_send_sge = 1; I couldn't find that context line in any version of src/qp.c that I had. diff --git a/src/qp.c b/src/qp.c index fa20dfa..8e2a3d3 100644 --- a/src/qp.c +++ b/src/qp.c @@ -390,7 +390,6 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap, int max_sq_sge; qp->rq.max_gs = cap->max_recv_sge; - qp->sq.max_gs = cap->max_send_sge; max_sq_sge = align(cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg), sizeof (struct mlx4_wqe_data_seg)) / sizeof (struct mlx4_wqe_data_seg); if (max_sq_sge < cap->max_send_sge) @@ -478,7 +477,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap, { int wqe_size; - wqe_size = 1 << qp->sq.wqe_shift; + wqe_size = (1 << qp->sq.wqe_shift) - sizeof (struct mlx4_wqe_ctrl_seg); switch (type) { case IBV_QPT_UD: wqe_size -= sizeof (struct mlx4_wqe_datagram_seg); @@ -493,7 +492,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap, break; } - qp->sq.max_gs = wqe_size / sizeof (struct mlx4_wqe_data_seg); + qp->sq.max_gs = wqe_size / sizeof (struct mlx4_wqe_data_seg); cap->max_send_sge = qp->sq.max_gs; qp->max_inline_data = wqe_size - sizeof (struct mlx4_wqe_inline_seg); cap->max_inline_data = qp->max_inline_data; From xma at us.ibm.com Tue May 29 13:12:28 2007 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 29 May 2007 13:12:28 -0700 Subject: [ofa-general] GPFS node loses IB-connection In-Reply-To: Message-ID: Hello Koen, >That is very difficult. This system is supposed to go in production >within a few weeks. Changing the OFED drivers requires rebuilding a lot >of other programs. If it isn't really necessary, I prefer not to do >this... I don't think there are some major changes in RC3 or RC4 to prevent you from running programs built against RC2. Please point out if wrong. You can try one node first. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Tue May 29 13:13:55 2007 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 29 May 2007 13:13:55 -0700 Subject: [ofa-general] Re: GPFS node loses IB-connection In-Reply-To: <20070529154700.GA8321@mellanox.co.il> Message-ID: Hello Michael, >What Ami is asking you to do is to try to reproduce the problem with -RC3 or -RC4 when it's out. Is there a known bug fix related to this issue in RC3? Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue May 29 13:32:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 May 2007 13:32:47 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/mthca: fix send CQE with error for QP connected to SRQ In-Reply-To: <20070527150642.GC26933@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 27 May 2007 18:06:42 +0300") References: <20070527150642.GC26933@mellanox.co.il> Message-ID: thanks, applied for 2.6.22 (and also fixed the same bug in libmthca) From sweitzen at cisco.com Tue May 29 13:39:54 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 29 May 2007 13:39:54 -0700 Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with kernel2-6.18-8.1.4.el5 In-Reply-To: References: <1E3DCD1C63492545881FACB6063A57C1D524A3@mtiexch01.mti.com> Message-ID: Moni S, The ib-bonding configuration process seems too picky, should we just apply RHEL5 patches if we see a *el5* kernel? In other words, change: $ fgrep 2.6.18 linux/configure 2.6.18-1.2747.el5*|2.6.18-8.el5*|2.6.18-*.*.fc6) to: 2.6.18-*el5*|2.6.18-*.*.fc6) ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jeffrey Wong Sent: Friday, May 25, 2007 1:12 PM To: general at lists.openfabrics.org Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with kernel2-6.18-8.1.4.el5 Hello, I am installing the OFED 1.2-rc3. Everything else builds except for ib-bonding. Thanks in advance. I am getting the following error messages: + make -C /lib/modules/2.6.18-8.1.4.el5/build modules M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding make: Entering directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64' CC [M] /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.o In file included from /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:78: /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_inactive_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: (Each undeclared identifier is reported only once /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:262: error: for each function it appears in.) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function 'bond_set_slave_active_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:268: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_compute_features': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1233: warning: comparison of distinct pointer types lacks a cast /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_enslave': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1449: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_release': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1848: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:1849: error: 'IFF_SLAVE_NEEDARP' undeclared (first use in t his function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_arp_rcv': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:2548: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_netdev_event': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:3390: error: 'IFF_BONDING' undeclared (first use in this fu nction) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c: In function 'bond_init': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4374: warning: assignment discards qualifiers from pointer target type /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:4386: error: 'IFF_BONDING' undeclared (first use in this fu nction) make[1]: *** [/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_ main.o] Error 1 make: *** [_module_/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bondi ng] Error 2 make: Leaving directory `/usr/src/kernels/2.6.18-8.1.4.el5-x86_64' + echo ' Building IB bonding driver failed' Building IB bonding driver failed + exit 1 error: Bad exit status from /var/tmp/rpm-tmp.23876 (%build) Jeff Wong -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue May 29 13:48:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 May 2007 13:48:09 -0700 Subject: [ofa-general] Re: [PATCH for-2.6.22] IB/ipoib: fix performance regression on Mellanox In-Reply-To: <20070528113727.GP2945@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 28 May 2007 14:37:27 +0300") References: <20070521120459.GI20400@mellanox.co.il> <46595950.6080106@voltaire.com> <20070527125337.GF8342@mellanox.co.il> <20070528113727.GP2945@mellanox.co.il> Message-ID: thanks, I queued this From rdreier at cisco.com Tue May 29 13:49:22 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 May 2007 13:49:22 -0700 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git In-Reply-To: <20070529044815.GD13866@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 May 2007 07:48:15 +0300") References: <20070514045832.GA18615@mellanox.co.il> <20070528121206.GA1847@mellanox.co.il> <20070529044815.GD13866@mellanox.co.il> Message-ID: > > > IB/ipoib: fix to_ipoib_neigh access race > > > > I'm not convinced this is 2.6.22 material at this point -- it doesn't > > fix any observed problem that I know of. (And the SRQ drain patch > > shows how even safe-looking patches can cause big problems) > > Fine, but we do have it in OFED - could you spare some cycles to review it? I plan to review it, but I question the decision to put it in OFED. I would have thought that OFED 1.2 was even more frozen then 2.6.22, and I'm not sure why you would want to stick a patch like this in when you don't know of anything that it fixes. From rdreier at cisco.com Tue May 29 13:49:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 May 2007 13:49:33 -0700 Subject: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection In-Reply-To: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> (Sean Hefty's message of "Mon, 21 May 2007 17:38:02 -0700") References: <000101c79c09$74f15440$54c8180a@amr.corp.intel.com> Message-ID: thanks, applied for 2.6.22 From rdreier at cisco.com Tue May 29 13:50:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 May 2007 13:50:12 -0700 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git In-Reply-To: <20070529071701.GA8159@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 May 2007 10:17:01 +0300") References: <20070514045832.GA18615@mellanox.co.il> <20070528121206.GA1847@mellanox.co.il> <20070529071701.GA8159@mellanox.co.il> Message-ID: > > IB/ipoib: fix to_ipoib_neigh access race > for-2.6.23 for now? Yes, I plan to review it more carefully and queue it for 2.6.23. From rdreier at cisco.com Tue May 29 13:57:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 May 2007 13:57:12 -0700 Subject: [ofa-general] Re: libibverbs autogen failures in ubuntu dapper In-Reply-To: <20070529091543.GG8159@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 May 2007 12:15:43 +0300") References: <20070529091543.GG8159@mellanox.co.il> Message-ID: > automake: Makefile.am: `src/libibverbs.la' is not a standard libtool library name I think you need a newer automake. BTW... > Attempt to run autogen.sh on an ubuntu dapper laptop gave me this: dapper?? isn't it already time to update to gutsy? - R. From rdreier at cisco.com Tue May 29 13:59:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 May 2007 13:59:47 -0700 Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user In-Reply-To: <20070529091246.GF8159@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 May 2007 12:12:46 +0300") References: <465AF3D3.10205@dev.mellanox.co.il> <20070529091246.GF8159@mellanox.co.il> Message-ID: makes sense but: > - if (rlim.rlim_cur <= 32768) > - fprintf(stderr, PFX "Warning: RLIMIT_MEMLOCK is %lu bytes.\n" > - " This will severely limit memory registrations.\n", > - rlim.rlim_cur); > + if (rlim.rlim_cur > 32768) > + return; > + > + if (!getuid()) > + return; I think it would be more natural to check the UID before getting the rlimit. And shouldn't this be geteuid() to handle processes that have dropped their privileges? From weiny2 at llnl.gov Tue May 29 15:07:27 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 29 May 2007 15:07:27 -0700 Subject: [ofa-general] Re: [PATCH] opensm/console: portstatus command for only initialized ports In-Reply-To: <20070528200742.GA13193@sashak.voltaire.com> References: <20070528200742.GA13193@sashak.voltaire.com> Message-ID: <20070529150727.7d6be9c1.weiny2@llnl.gov> Looks fine to me. I guess you don't like this formatting? I like things to line up. It is easier to read. No big deal. Thanks, Ira On Mon, 28 May 2007 23:07:42 +0300 Sasha Khapyorsky wrote: > > Run portstatus command for only initialized ports + minor identation > fixes. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_console.c | 18 ++++++++++-------- > 1 files changed, 10 insertions(+), 8 deletions(-) > > diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c > index 2802c38..3415262 100644 > --- a/opensm/opensm/osm_console.c > +++ b/opensm/opensm/osm_console.c > @@ -598,15 +598,17 @@ __get_stats(cl_map_item_t * const p_map_item, void *context) > fs->total_nodes++; > > for (port = 1; port < num_ports; port++) { > - osm_physp_t *phys = osm_node_get_physp_ptr(node, port); > + osm_physp_t *phys = osm_node_get_physp_ptr(node, port); > ib_port_info_t *pi = &(phys->port_info); > - > - uint8_t active_speed = ib_port_info_get_link_speed_active(pi); > - uint8_t enabled_speed = ib_port_info_get_link_speed_enabled(pi); > - uint8_t active_width = pi->link_width_active; > - uint8_t enabled_width = pi->link_width_enabled; > - uint8_t port_state = ib_port_info_get_port_state(pi); > - uint8_t port_phys_state = ib_port_info_get_port_phys_state(pi); > + uint8_t active_speed = ib_port_info_get_link_speed_active(pi); > + uint8_t enabled_speed = ib_port_info_get_link_speed_enabled(pi); > + uint8_t active_width = pi->link_width_active; > + uint8_t enabled_width = pi->link_width_enabled; > + uint8_t port_state = ib_port_info_get_port_state(pi); > + uint8_t port_phys_state = ib_port_info_get_port_phys_state(pi); > + > + if (!osm_physp_is_valid(phys)) > + continue; > > if ((enabled_width ^ active_width) > active_width) { > __tag_port_report(&(fs->reduced_width_ports), > -- > 1.5.2.109.g802f From hanafim.ctr at asc.hpc.mil Tue May 29 15:57:53 2007 From: hanafim.ctr at asc.hpc.mil (MAHMOUD HANAFI) Date: Tue, 29 May 2007 18:57:53 -0400 Subject: [ofa-general] Need OFED1.1 ib_srp max_hw_sectors_kb help! Message-ID: <465CAFF1.9000603@asc.hpc.mil> All, I am using OFED1.1 with CISCO HCA/switch and DDN Storage. I am able to load and perform IO to the DDN via srp driver. But, the max_hw_sectors_kb for the device is getting set to 64kb. Any one else seen this issue? Same host and storage with fiber channel doesn't have this problem. It set max_hw_sectors_kb correctly to 4096KB. Thanks, -- Mahmoud Hanafi Senior System Administrator ASC/MSRC www.asc.hpc.mil 2435 5th Street WPAFB, OHIO 45433 (937) 255-1536 From rdreier at cisco.com Tue May 29 16:20:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 29 May 2007 16:20:39 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a few more nicely balanced ("55 insertions(+), 55 deletions(-)") 2.6.22-rc3 fixes, mostly for IPoIB connected mode: Michael S. Tsirkin (2): IB/mthca: Fix handling of send CQE with error for QPs connected to SRQ IPoIB/cm: Fix performance regression on Mellanox Roland Dreier (1): IB/mlx4: Fix last allocated object tracking in bitmap allocator Sean Hefty (1): IB/cm: Fix stale connection detection drivers/infiniband/core/cm.c | 25 ++++++----- drivers/infiniband/hw/mthca/mthca_qp.c | 6 +- drivers/infiniband/ulp/ipoib/ipoib.h | 3 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 74 +++++++++++++++---------------- drivers/net/mlx4/alloc.c | 2 +- 5 files changed, 55 insertions(+), 55 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index e840434..40c004a 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1297,26 +1297,29 @@ static struct cm_id_private * cm_match_req(struct cm_work *work, req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; - /* Check for duplicate REQ and stale connections. */ + /* Check for possible duplicate REQ. */ spin_lock_irqsave(&cm.lock, flags); timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); - if (!timewait_info) - timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); - if (timewait_info) { cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, timewait_info->work.remote_id); - cm_cleanup_timewait(cm_id_priv->timewait_info); spin_unlock_irqrestore(&cm.lock, flags); if (cur_cm_id_priv) { cm_dup_req_handler(work, cur_cm_id_priv); cm_deref_id(cur_cm_id_priv); - } else - cm_issue_rej(work->port, work->mad_recv_wc, - IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, - NULL, 0); - listen_cm_id_priv = NULL; - goto out; + } + return NULL; + } + + /* Check for stale connections. */ + timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info); + if (timewait_info) { + cm_cleanup_timewait(cm_id_priv->timewait_info); + spin_unlock_irqrestore(&cm.lock, flags); + cm_issue_rej(work->port, work->mad_recv_wc, + IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ, + NULL, 0); + return NULL; } /* Find matching listen request. */ diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 0276649..eef415b 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -2284,10 +2284,10 @@ void mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send, struct mthca_next_seg *next; /* - * For SRQs, all WQEs generate a CQE, so we're always at the - * end of the doorbell chain. + * For SRQs, all receive WQEs generate a CQE, so we're always + * at the end of the doorbell chain. */ - if (qp->ibqp.srq) { + if (qp->ibqp.srq && !is_send) { *new_wqe = 0; return; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 158759e..285c143 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -156,7 +156,7 @@ struct ipoib_cm_data { * - and then invoke a Destroy QP or Reset QP. * * We use the second option and wait for a completion on the - * rx_drain_qp before destroying QPs attached to our SRQ. + * same CQ before destroying QPs attached to our SRQ. */ enum ipoib_cm_state { @@ -199,7 +199,6 @@ struct ipoib_cm_dev_priv { struct ib_srq *srq; struct ipoib_cm_rx_buf *srq_ring; struct ib_cm_id *id; - struct ib_qp *rx_drain_qp; /* generates WR described in 10.3.1 */ struct list_head passive_ids; /* state: LIVE */ struct list_head rx_error_list; /* state: ERROR */ struct list_head rx_flush_list; /* state: FLUSH, drain not started */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index f133b56..076a0bb 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -69,8 +69,9 @@ static struct ib_qp_attr ipoib_cm_err_attr = { #define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff -static struct ib_recv_wr ipoib_cm_rx_drain_wr = { - .wr_id = IPOIB_CM_RX_DRAIN_WRID +static struct ib_send_wr ipoib_cm_rx_drain_wr = { + .wr_id = IPOIB_CM_RX_DRAIN_WRID, + .opcode = IB_WR_SEND, }; static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, @@ -163,16 +164,22 @@ partial_error: static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv) { - struct ib_recv_wr *bad_wr; + struct ib_send_wr *bad_wr; + struct ipoib_cm_rx *p; - /* rx_drain_qp send queue depth is 1, so + /* We only reserved 1 extra slot in CQ for drain WRs, so * make sure we have at most 1 outstanding WR. */ if (list_empty(&priv->cm.rx_flush_list) || !list_empty(&priv->cm.rx_drain_list)) return; - if (ib_post_recv(priv->cm.rx_drain_qp, &ipoib_cm_rx_drain_wr, &bad_wr)) - ipoib_warn(priv, "failed to post rx_drain wr\n"); + /* + * QPs on flush list are error state. This way, a "flush + * error" WC will be immediately generated for each WR we post. + */ + p = list_entry(priv->cm.rx_flush_list.next, typeof(*p), list); + if (ib_post_send(p->qp, &ipoib_cm_rx_drain_wr, &bad_wr)) + ipoib_warn(priv, "failed to post drain wr\n"); list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list); } @@ -199,10 +206,10 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev, struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = { .event_handler = ipoib_cm_rx_event_handler, - .send_cq = priv->cq, /* does not matter, we never send anything */ + .send_cq = priv->cq, /* For drain WR */ .recv_cq = priv->cq, .srq = priv->cm.srq, - .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ + .cap.max_send_wr = 1, /* For drain WR */ .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, @@ -242,6 +249,27 @@ static int ipoib_cm_modify_rx_qp(struct net_device *dev, ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret); return ret; } + + /* + * Current Mellanox HCA firmware won't generate completions + * with error for drain WRs unless the QP has been moved to + * RTS first. This work-around leaves a window where a QP has + * moved to error asynchronously, but this will eventually get + * fixed in firmware, so let's not error out if modify QP + * fails. + */ + qp_attr.qp_state = IB_QPS_RTS; + ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to init QP attr for RTS: %d\n", ret); + return 0; + } + ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTS: %d\n", ret); + return 0; + } + return 0; } @@ -623,38 +651,11 @@ static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) int ipoib_cm_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_qp_init_attr qp_init_attr = { - .send_cq = priv->cq, /* does not matter, we never send anything */ - .recv_cq = priv->cq, - .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ - .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ - .cap.max_recv_wr = 1, - .cap.max_recv_sge = 1, /* FIXME: 0 Seems not to work */ - .sq_sig_type = IB_SIGNAL_ALL_WR, - .qp_type = IB_QPT_UC, - }; int ret; if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) return 0; - priv->cm.rx_drain_qp = ib_create_qp(priv->pd, &qp_init_attr); - if (IS_ERR(priv->cm.rx_drain_qp)) { - printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); - ret = PTR_ERR(priv->cm.rx_drain_qp); - return ret; - } - - /* - * We put the QP in error state directly. This way, a "flush - * error" WC will be immediately generated for each WR we post. - */ - ret = ib_modify_qp(priv->cm.rx_drain_qp, &ipoib_cm_err_attr, IB_QP_STATE); - if (ret) { - ipoib_warn(priv, "failed to modify drain QP to error: %d\n", ret); - goto err_qp; - } - priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev); if (IS_ERR(priv->cm.id)) { printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); @@ -676,8 +677,6 @@ err_listen: ib_destroy_cm_id(priv->cm.id); err_cm: priv->cm.id = NULL; -err_qp: - ib_destroy_qp(priv->cm.rx_drain_qp); return ret; } @@ -740,7 +739,6 @@ void ipoib_cm_dev_stop(struct net_device *dev) kfree(p); } - ib_destroy_qp(priv->cm.rx_drain_qp); cancel_delayed_work(&priv->cm.stale_task); } diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c index dfbd580..f8d63d3 100644 --- a/drivers/net/mlx4/alloc.c +++ b/drivers/net/mlx4/alloc.c @@ -51,8 +51,8 @@ u32 mlx4_bitmap_alloc(struct mlx4_bitmap *bitmap) if (obj < bitmap->max) { set_bit(obj, bitmap->table); + bitmap->last = (obj + 1) & (bitmap->max - 1); obj |= bitmap->top; - bitmap->last = obj + 1; } else obj = -1; From pradeeps at linux.vnet.ibm.com Tue May 29 18:01:21 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 29 May 2007 18:01:21 -0700 Subject: [ofa-general] Re: IPOIB CM (NOSRQ) patch -memory footprint In-Reply-To: <20070527083932.GC8342@mellanox.co.il> References: <46537081.30906@linux.vnet.ibm.com> <20070524053819.GF6019@mellanox.co.il> <46574099.3090601@linux.vnet.ibm.com> <20070527083932.GC8342@mellanox.co.il> Message-ID: <465CCCE1.1020106@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >>>> -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that >>>> support SRQ like the Topspin HCA and, such HCAs should not be >>>> impacted at all. >>> I don't think it's that clean yet. >>> >>> Here's an idea: implement "fake SRQ" for ehca in software: make post recv on >>> srq queue the WR, spread them evenly between QPs as they connect. Once # of >>> QPs goes above some limit, create QP command will fail. This already exists in the last couple of versions of the patch. We send a REJ command when a predetermined threshold is crossed. What we have been debating is what to do on the active side when the REJ command is received. This would contain >>> the mess nicely inside ehca (I think you'll want to add a flag that lets >>> software figure out that SRQ is fake). >>> >>> We will still be left with the basic problem of what to do at the active side >>> upon the reject, though. >> As you indicate this will not solve the problem, so it is not an option. > > Above, I have outlined how it can be done, so it certainly *is* an option. In the previous mail I proposed a method to address both viewpoints: a) let the active side return an error to the user level app and leave the onus to the application b) switch to datagram mode when the QP threshold is crossed. There has been no response to that proposal. > > In this thread, you basically keep saying that ehca will ever be the only HCA without SRQ > support, so you can make a lot of assumptions about how IPoIB is used. > > Fine, but if you follow this logic, it makes sense to hide the mess under the ehca > provider interface. > > Every time I address the issues you have raised previously, it appears that something else crops up. I have said that I can provide a patch that addresses both alternatives a) and b) above. Let us just stick to that and limit our discussions and proceed to close the issues out and not diverge any further. Pradeep From jsquyres at cisco.com Tue May 29 18:47:11 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 29 May 2007 21:47:11 -0400 Subject: [ofa-general] Re: [ewg] Upcoming OFED teleconferences In-Reply-To: <465C3A0E.6070602@mellanox.co.il> References: <338EEB24-0F0E-443D-94AA-61E512611F8B@cisco.com> <465C3A0E.6070602@mellanox.co.il> Message-ID: Since no one objected, I moved tomorrow's teleconference to meet Tziporet's schedule (and OFED teleconference without the RM would be kinda meaningless!). 2:30pm US Eastern, 11:30am US Pacific, 9:30pm Israel 1. Wednesday, May 30 (*TOMORROW*), code 210262040 All others: noon US eastern / 9am US Pacific / 7pm Israel 2. Monday, June 4, code 2102061 3. Monday, June 11, code 210213621 4. Monday, June 18, code 2102061 5. Monday, June 25, code 210213621 US/Canada: +1.866.432.9903 India: +91.80.4103.3979 Israel: +972.9.892.7026 Others: http://cisco.com/en/US/about/doing_business/conferencing/ On May 29, 2007, at 10:34 AM, Tziporet Koren wrote: > Jeff Squyres wrote: >> Short version: >> -------------- >> >> Upcoming OFED teleconferences, all at noon US eastern / 9am US >> Pacific / 7pm Israel. >> >> 1. Wednesday, May 30 (*TOMORROW*), code 210262040 > I cannot make it at Wed 9am PST > Can you change to 11:30am PST > > Thanks, > Tziporet >> 2. Monday, June 4, code 2102061 >> 3. Monday, June 11, code 210213621 >> 4. Monday, June 18, code 2102061 >> 5. Monday, June 25, code 210213621 >> >> US/Canada: +1.866.432.9903 >> India: +91.80.4103.3979 >> Israel: +972.9.892.7026 >> Others: http://cisco.com/en/US/about/doing_business/conferencing/ -- Jeff Squyres Cisco Systems From mst at dev.mellanox.co.il Tue May 29 20:37:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 May 2007 06:37:43 +0300 Subject: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git In-Reply-To: References: <20070514045832.GA18615@mellanox.co.il> <20070528121206.GA1847@mellanox.co.il> <20070529044815.GD13866@mellanox.co.il> Message-ID: <20070530033743.GC9036@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] [GIT PULL] 2.6.22: please pull ~mst/linux-2.6/.git > > > > > IB/ipoib: fix to_ipoib_neigh access race > > > > > > I'm not convinced this is 2.6.22 material at this point -- it doesn't > > > fix any observed problem that I know of. (And the SRQ drain patch > > > shows how even safe-looking patches can cause big problems) > > > > Fine, but we do have it in OFED - could you spare some cycles to review it? > > I plan to review it, but I question the decision to put it in OFED. I > would have thought that OFED 1.2 was even more frozen then 2.6.22, and > I'm not sure why you would want to stick a patch like this in when you > don't know of anything that it fixes. Point taken - I took this out. -- MST From mst at dev.mellanox.co.il Tue May 29 20:39:45 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 May 2007 06:39:45 +0300 Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user In-Reply-To: References: <465AF3D3.10205@dev.mellanox.co.il> <20070529091246.GF8159@mellanox.co.il> Message-ID: <20070530033945.GD9036@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] suppress RLIMIT warning for root user > > makes sense but: > > > - if (rlim.rlim_cur <= 32768) > > - fprintf(stderr, PFX "Warning: RLIMIT_MEMLOCK is %lu bytes.\n" > > - " This will severely limit memory registrations.\n", > > - rlim.rlim_cur); > > + if (rlim.rlim_cur > 32768) > > + return; > > + > > + if (!getuid()) > > + return; > > I think it would be more natural to check the UID before getting the > rlimit. And shouldn't this be geteuid() to handle processes that have > dropped their privileges? Agree on both points. -- MST From devesh28 at gmail.com Tue May 29 21:44:21 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Wed, 30 May 2007 10:14:21 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <000001c7a214$e0d30580$86cc180a@amr.corp.intel.com> References: <309a667c0705272250q68aa4064l40454db5b266a967@mail.gmail.com> <000001c7a214$e0d30580$86cc180a@amr.corp.intel.com> Message-ID: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com> On 5/29/07, Sean Hefty wrote: > >Ok, but, by that time we can keep the framework ready? > > I plan on re-submitting the cache for 2.6.23. Beyond that I won't have the time > to work on enhancements for a few weeks. I will happily review any patch > submissions though. Ok, Soon I will post a patch related to this. How static PR file will be generated? Needs to be discussed. > > >How this will be managed? This will add extra startup time in the > >cluster, because cluster will be usable only after last cache has been > >enabled. Am I right? > > I would word this differently: we can improve the time required to load the > cache, versus stating that the cache adds extra startup time. > > The cache is not necessary to use the cluster, so doesn't force extra startup > time. Cache misses would simply be forwarded directly to the SA. If the first > application to run on the cluster isn't establishing all-to-all communication > between the nodes then there may not be any reason to delay starting the app. > > Even if the first app does establish all-to-all communication, waiting for the > caches to load can delay the start of the app, but cache use may decrease the > overall execution time of the app by more than this delay. (Loading the cache > is likely to be more efficient than applications obtaining the path records > themselves.) > > >How multi-pathing is handled in current cache_module? > > A kernel ULP can request all paths, then select the one they want. Beyond that, > the cache can either return paths to the user round robin or randomly, based on > the cache settings. > > - Sean > From eli at mellanox.co.il Tue May 29 23:23:38 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 30 May 2007 09:23:38 +0300 Subject: [ofa-general] Re: [PATCH] libmlx4: fix qp capabilities In-Reply-To: References: <1180443676.6825.8.camel@mtls03> Message-ID: <1180506248.6825.19.camel@mtls03> On Tue, 2007-05-29 at 11:27 -0700, Roland Dreier wrote: > Also, how did you create your patch? > > > --- libmlx4.orig/src/qp.c 2007-05-29 13:13:57.000000000 +0300 > > +++ libmlx4/src/qp.c 2007-05-29 14:41:33.000000000 +0300 > > @@ -396,12 +396,13 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, > > cap->max_send_sge = 1; > > I couldn't find that context line in any version of src/qp.c that I had. > This is a part of a patch that we apply on OFED builds that is required to ensure the send queue is greater then zero. From cap at nsc.liu.se Tue May 29 23:43:09 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Wed, 30 May 2007 08:43:09 +0200 Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with kernel2-6.18-8.1.4.el5 In-Reply-To: References: <1E3DCD1C63492545881FACB6063A57C1D524A3@mtiexch01.mti.com> Message-ID: <200705300843.13575.cap@nsc.liu.se> On Tuesday 29 May 2007, Scott Weitzenkamp (sweitzen) wrote: > Moni S, > > The ib-bonding configuration process seems too picky, should we just > apply RHEL5 patches if we see a *el5* kernel? In other words, change: > > $ fgrep 2.6.18 linux/configure > 2.6.18-1.2747.el5*|2.6.18-8.el5*|2.6.18-*.*.fc6) > > to: > > 2.6.18-*el5*|2.6.18-*.*.fc6) Why mix in non-el5 kernels? 2.6.18-2747 and similar are beta kernels (fc naming left-over) and stuff with fc6 in it is clearly not el5. Update kernels for el5 (before el5u1) should be named 2.6.18-8.x.y.el5, so maybe: "2.6.18-8.*el5"? /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From amip at dev.mellanox.co.il Tue May 29 23:43:06 2007 From: amip at dev.mellanox.co.il (Ami Perlmutter) Date: Wed, 30 May 2007 09:43:06 +0300 Subject: [ofa-general] Re: [ewg] [PATCH] ofed_1_2/sdp - SDP can lose receive data In-Reply-To: <1180459488.3407.376.camel@brick.pathscale.com> References: <1180139623.3407.373.camel@brick.pathscale.com> <1180256850.15464.1.camel@localhost> <1180459488.3407.376.camel@brick.pathscale.com> Message-ID: <1180507416.12048.19.camel@localhost> this is how the code looks now: if (likely(skb_len && (tail = skb_peek_tail(&sk->sk_receive_queue))) && unlikely(skb_tailroom(tail) >= skb_len)) { skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len); __kfree_skb(skb); skb = tail; } else skb_queue_tail(&sk->sk_receive_queue, skb); could you point out the problem? On Tue, 2007-05-29 at 10:24 -0700, Ralph Campbell wrote: > It is from git://git.openfabrics.org/~vlad/ofed_1_2 > commit 726c6827ac31c0b2f40acd804dc53362289bd21f > > On Sun, 2007-05-27 at 12:07 +0300, Ami Perlmutter wrote: > > Ralph, > > this is how the code is now. > > Were are you getting this code from? > > > > On Fri, 2007-05-25 at 17:33 -0700, Ralph Campbell wrote: > > > Can this fix be considered for OFED 1.2? > > > Thanks. > > > > > > > > > If a receive work completion is processed but there is no room > > > in a previously queued skb, the data is dropped. > > > This patch fixes the problem by queuing the skb. > > > > > > Signed-off-by: Ralph Campbell > > > > > > diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c > > > --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:04:51 2007 -0700 > > > +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:07:02 2007 -0700 > > > @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q > > > skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len); > > > __kfree_skb(skb); > > > skb = tail; > > > - } > > > + } else > > > + skb_queue_tail(&sk->sk_receive_queue, skb); > > > } else > > > skb_queue_tail(&sk->sk_receive_queue, skb); > > > > > > > > > > > > _______________________________________________ > > > ewg mailing list > > > ewg at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From notting at redhat.com Wed May 30 01:05:18 2007 From: notting at redhat.com (Bill Nottingham) Date: Wed, 30 May 2007 04:05:18 -0400 Subject: [ofa-general] [PATCH] drivers/infiniband: fix comparsion between unsigned and negative Message-ID: <20070530080518.GA29195@nostromo.devel.redhat.com> Recent gcc versions emit warnings when unsigned variables are compared < 0 or >= 0. Signed-off-by: Bill Nottingham --- core/sysfs.c | 2 +- core/ucm.c | 2 +- core/ucma.c | 2 +- core/user_mad.c | 5 ++--- core/uverbs_main.c | 3 +-- core/verbs.c | 3 +-- hw/mlx4/qp.c | 2 +- 7 files changed, 8 insertions(+), 11 deletions(-) diff -ru linux-2.6.21-old/drivers/infiniband/core/sysfs.c linux-2.6.21/drivers/infiniband/core/sysfs.c --- linux-2.6.21-old/drivers/infiniband/core/sysfs.c 2007-05-30 02:52:52.000000000 -0400 +++ linux-2.6.21/drivers/infiniband/core/sysfs.c 2007-05-30 02:07:31.000000000 -0400 @@ -112,7 +112,7 @@ return ret; return sprintf(buf, "%d: %s\n", attr.state, - attr.state >= 0 && attr.state < ARRAY_SIZE(state_name) ? + attr.state < ARRAY_SIZE(state_name) ? state_name[attr.state] : "UNKNOWN"); } diff -ru linux-2.6.21-old/drivers/infiniband/core/ucma.c linux-2.6.21/drivers/infiniband/core/ucma.c --- linux-2.6.21-old/drivers/infiniband/core/ucma.c 2007-05-30 02:52:52.000000000 -0400 +++ linux-2.6.21/drivers/infiniband/core/ucma.c 2007-05-30 02:09:34.000000000 -0400 @@ -955,7 +955,7 @@ if (copy_from_user(&hdr, buf, sizeof(hdr))) return -EFAULT; - if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) + if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) return -EINVAL; if (hdr.in + sizeof(hdr) > len) diff -ru linux-2.6.21-old/drivers/infiniband/core/ucm.c linux-2.6.21/drivers/infiniband/core/ucm.c --- linux-2.6.21-old/drivers/infiniband/core/ucm.c 2007-05-30 02:52:52.000000000 -0400 +++ linux-2.6.21/drivers/infiniband/core/ucm.c 2007-05-30 02:08:01.000000000 -0400 @@ -1125,7 +1125,7 @@ if (copy_from_user(&hdr, buf, sizeof(hdr))) return -EFAULT; - if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucm_cmd_table)) + if (hdr.cmd >= ARRAY_SIZE(ucm_cmd_table)) return -EINVAL; if (hdr.in + sizeof(hdr) > len) diff -ru linux-2.6.21-old/drivers/infiniband/core/user_mad.c linux-2.6.21/drivers/infiniband/core/user_mad.c --- linux-2.6.21-old/drivers/infiniband/core/user_mad.c 2007-05-30 02:52:52.000000000 -0400 +++ linux-2.6.21/drivers/infiniband/core/user_mad.c 2007-05-30 02:08:32.000000000 -0400 @@ -455,8 +455,7 @@ goto err; } - if (packet->mad.hdr.id < 0 || - packet->mad.hdr.id >= IB_UMAD_MAX_AGENTS) { + if (packet->mad.hdr.id >= IB_UMAD_MAX_AGENTS) { ret = -EINVAL; goto err; } @@ -665,7 +664,7 @@ down_write(&file->port->mutex); - if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !__get_agent(file, id)) { + if (id >= IB_UMAD_MAX_AGENTS || !__get_agent(file, id)) { ret = -EINVAL; goto out; } diff -ru linux-2.6.21-old/drivers/infiniband/core/uverbs_main.c linux-2.6.21/drivers/infiniband/core/uverbs_main.c --- linux-2.6.21-old/drivers/infiniband/core/uverbs_main.c 2007-05-30 02:52:52.000000000 -0400 +++ linux-2.6.21/drivers/infiniband/core/uverbs_main.c 2007-05-30 02:09:07.000000000 -0400 @@ -592,8 +592,7 @@ if (hdr.in_words * 4 != count) return -EINVAL; - if (hdr.command < 0 || - hdr.command >= ARRAY_SIZE(uverbs_cmd_table) || + if (hdr.command >= ARRAY_SIZE(uverbs_cmd_table) || !uverbs_cmd_table[hdr.command] || !(file->device->ib_dev->uverbs_cmd_mask & (1ull << hdr.command))) return -EINVAL; diff -ru linux-2.6.21-old/drivers/infiniband/core/verbs.c linux-2.6.21/drivers/infiniband/core/verbs.c --- linux-2.6.21-old/drivers/infiniband/core/verbs.c 2007-05-30 02:52:52.000000000 -0400 +++ linux-2.6.21/drivers/infiniband/core/verbs.c 2007-05-30 02:07:06.000000000 -0400 @@ -535,8 +535,7 @@ { enum ib_qp_attr_mask req_param, opt_param; - if (cur_state < 0 || cur_state > IB_QPS_ERR || - next_state < 0 || next_state > IB_QPS_ERR) + if (cur_state > IB_QPS_ERR || next_state > IB_QPS_ERR) return 0; if (mask & IB_QP_CUR_STATE && diff -ru linux-2.6.21-old/drivers/infiniband/hw/mlx4/qp.c linux-2.6.21/drivers/infiniband/hw/mlx4/qp.c --- linux-2.6.21-old/drivers/infiniband/hw/mlx4/qp.c 2007-05-30 02:52:52.000000000 -0400 +++ linux-2.6.21/drivers/infiniband/hw/mlx4/qp.c 2007-05-30 02:10:18.000000000 -0400 @@ -1284,7 +1284,7 @@ */ wmb(); - if (wr->opcode < 0 || wr->opcode >= ARRAY_SIZE(mlx4_ib_opcode)) { + if (wr->opcode >= ARRAY_SIZE(mlx4_ib_opcode)) { err = -EINVAL; goto out; } From vlad at lists.openfabrics.org Wed May 30 02:41:53 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 30 May 2007 02:41:53 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070530-0200 daily build status Message-ID: <20070530094153.EE7F2E607FA@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From eli at mellanox.co.il Wed May 30 03:14:31 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 30 May 2007 13:14:31 +0300 Subject: [ofa-general] [PATCH] mlx4_core: fix CQ mailbox layout Message-ID: <1180520101.6825.26.camel@mtls03> Fix CQ inbox layout Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/net/mlx4/cq.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/cq.c 2007-05-29 16:20:17.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/cq.c 2007-05-30 12:50:51.000000000 +0300 @@ -61,7 +61,7 @@ __be32 solicit_producer_index; __be32 consumer_index; __be32 producer_index; - u8 reserved6[2]; + u32 reserved6[2]; __be64 db_rec_addr; }; From halr at voltaire.com Wed May 30 03:38:50 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 30 May 2007 06:38:50 -0400 Subject: [ofa-general] Re: [PATCH] opensm/console: portstatus command for only initialized ports In-Reply-To: <20070528200742.GA13193@sashak.voltaire.com> References: <20070528200742.GA13193@sashak.voltaire.com> Message-ID: <1180521528.7116.53237.camel@hal.voltaire.com> On Mon, 2007-05-28 at 16:07, Sasha Khapyorsky wrote: > Run portstatus command for only initialized ports + minor identation > fixes. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From tziporet at mellanox.co.il Wed May 30 03:43:26 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 30 May 2007 13:43:26 +0300 Subject: [ofa-general] RE: [ewg] OFED-1.2-20070529-0600 won't build due to srptools changes In-Reply-To: References: Message-ID: <6C2C79E72C305246B504CBA17B5500C901563477@mtlexch01.mtl.com> We noticed this too and it was already fixed yesterday in the build of 6am Tziporet ________________________________ From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Tuesday, May 29, 2007 6:30 PM To: OpenFabrics EWG Cc: OpenFabrics General Subject: [ewg] OFED-1.2-20070529-0600 won't build due to srptools changes Importance: High I have reopened https://bugs.openfabrics.org/show_bug.cgi?id=533, Ishai please fix ASAP. This bug is now a P1 blocker. RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root File not found: /var/tmp/OFED/usr/sbin/execute_multipath_or_kpartx.sh File not found: /var/tmp/OFED/usr/sbin/srp_dm_multipath_daemon File not found: /var/tmp/OFED/usr/sbin/srp_post_multipath ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM'\ --define '_prefix /usr' --define 'build_root /var/tmp/OFED' --define 'configur\ e_options --with-dapl --with-libibcommon --with-libibmad --with-libibumad --wit\ h-libibverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --wit\ h-openib-diags --with-sdpnetstat --with-srptools --with-mstflint --with-perftes\ t --with-tvflash --sysconfdir=/etc --mandir=/usr/share/man' --define 'configure\ _options32 --with-dapl --with-libibcommon --with-libibmad --with-libibumad --wi\ th-libibverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --wi\ th-openib-diags --with-sdpnetstat --with-srptools --sysconfdir=/etc --mandir=/u\ sr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man' /tmp/O\ FED-1.2-20070529-0600/SRPMS/ofa_user-1.2-rc2.src.rpm" -------------- next part -------------- An HTML attachment was scrubbed... URL: From erezz at voltaire.com Wed May 30 05:38:30 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 30 May 2007 15:38:30 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons foropen-iscsiover iSER support for RHAS4 up3 and up4 In-Reply-To: <20070529141143.GD27671@mellanox.co.il> References: <20070521114410.GG20400@mellanox.co.il> <46557BCB.7030102@voltaire.com> <20070524115715.GC4585@mellanox.co.il> <465C2D78.30100@voltaire.com> <20070529141143.GD27671@mellanox.co.il> Message-ID: <465D7046.3080109@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Erez Zilber : >> Subject: Re: [PATCH 2/2] IB/iser: add backport & kernel addons?foropen-iscsiover iSER support for RHAS4 up3 and up4 >> >> >> >>>> I have the following files in backport/2.6.9_UX/include/src/: >>>> >>>> attribute_container.c - almost identical to the file on 2.6.20. I had to change one line in it. >>>> >>>> >>> could be a patch ... >>> which line? >>> >>> >>> >> Now, attribute_container.c, klist.c & transport_class.c are copied from >> the kernel tree. I've committed the required changes in >> ~erezz/ofabuild_iser_rh4.git & ~erezz/ofed_1_2_iser_rh4.git. >> > > git fetch git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git. > fatal: The remote end hung up unexpectedly > Cannot get the repository state from > git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git. > > > Here's what I was able to do: [root at hydrus t]# git-clone git://git.openfabrics.org/~vlad/ofed_1_2/.git remote: Generating pack... remote: Done counting 418996 objects. remote: Deltifying 418996 objects... remote: 100% (418996/418996) done remote: Total 418996 (delta 333751), reused 399530 (delta 318605) Checking files out...) 100% (22588/22588) done [root at hydrus t]# cd ofed_1_2 [root at hydrus ofed_1_2]# git fetch git://git.openfabrics.org/~erezz/ofed_1_2_iser_rh4.git remote: Generating pack... remote: Done counting 156 objects. Result has 133 objects. remote: Deltifying 133 objects... remote: 100% (133/133) done Unpacking 133 objects remote: Total 133 (delta 74), reused 24 (delta 5) 100% (133/133) done >> The main >> change is a new dir called "kernel_addons_patches". It contains patches >> for kernel tree files in order to create the required addons from them. >> > > sorry, but I really don't think we can touch build scripts at this point. > Doing cp in build scripts is also a problem since it interferes with > development (there are 2 places to edit each file). > And adding kernel version dependency there is also really messy. > > Suggestion: why can't these patches be part of the regular backport directory? > > you copy stuff to include/src and then include it, but this just looks > like and unnecessary extra step. Can't we include the source file from > it original place directory, like this: > #include "../drivers/base/attribute_container.c" > I can use attribute_container.c from drivers/base. However, having some of the addons in drivers/base while most of the addons are in kernel_addons is confusing, isn't it? It will also require ugly adjustments like: kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch: diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile index e212608..3bf2015 100644 --- a/drivers/scsi/Makefile +++ b/drivers/scsi/Makefile @@ -1,2 +1,7 @@ obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o + +CFLAGS_attribute_container.o = -I$(PWD)/kernel_addons/backport/2.6.9_U3/include/src/ + +scsi_transport_iscsi-y := scsi_transport_iscsi_f.o scsi.o scsi_lib.o init.o klist.o attribute_container.o transport_class.o +libiscsi-y := libiscsi_f.o scsi_scan.o (because base.h is in kernel_addons/backport/2.6.9_U3/include/src) From mst at dev.mellanox.co.il Wed May 30 05:54:56 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 May 2007 15:54:56 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iser: add backport & kernel addons foropen-iscsiover iSER support for RHAS4 up3 and up4 In-Reply-To: <465D7046.3080109@voltaire.com> References: <20070521114410.GG20400@mellanox.co.il> <46557BCB.7030102@voltaire.com> <20070524115715.GC4585@mellanox.co.il> <465C2D78.30100@voltaire.com> <20070529141143.GD27671@mellanox.co.il> <465D7046.3080109@voltaire.com> Message-ID: <20070530125456.GF9036@mellanox.co.il> > >> The main > >> change is a new dir called "kernel_addons_patches". It contains patches > >> for kernel tree files in order to create the required addons from them. > >> > > > > sorry, but I really don't think we can touch build scripts at this point. > > Doing cp in build scripts is also a problem since it interferes with > > development (there are 2 places to edit each file). > > And adding kernel version dependency there is also really messy. > > > > Suggestion: why can't these patches be part of the regular backport directory? > > > > you copy stuff to include/src and then include it, but this just looks > > like and unnecessary extra step. Can't we include the source file from > > it original place directory, like this: > > #include "../drivers/base/attribute_container.c" > > > > I can use attribute_container.c from drivers/base. However, having some > of the addons in drivers/base while most of the addons are in > kernel_addons is confusing, isn't it? No, because it's not an addon - this is actual upstream kernel code. Addons is stuff that we *add* to kernel. > It will also require ugly > adjustments like: > > kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch: > > diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile > index e212608..3bf2015 100644 > --- a/drivers/scsi/Makefile > +++ b/drivers/scsi/Makefile > @@ -1,2 +1,7 @@ > obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o > obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o > + > +CFLAGS_attribute_container.o = > -I$(PWD)/kernel_addons/backport/2.6.9_U3/include/src/ > + > +scsi_transport_iscsi-y := scsi_transport_iscsi_f.o scsi.o scsi_lib.o > init.o klist.o attribute_container.o transport_class.o > +libiscsi-y := libiscsi_f.o scsi_scan.o > > (because base.h is in kernel_addons/backport/2.6.9_U3/include/src) This is one approach, and I think it's not too bad. Alternative is to use the relative path for include directive: #include "../drivers/base/attribute_container.c" Wouldn't this work? -- MST From rdreier at cisco.com Wed May 30 08:23:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 30 May 2007 08:23:35 -0700 Subject: [ofa-general] Re: [PATCH] mlx4_core: fix CQ mailbox layout In-Reply-To: <1180520101.6825.26.camel@mtls03> (Eli Cohen's message of "Wed, 30 May 2007 13:14:31 +0300") References: <1180520101.6825.26.camel@mtls03> Message-ID: yikes... applied, thanks. From rdreier at cisco.com Wed May 30 08:30:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 30 May 2007 08:30:00 -0700 Subject: [ofa-general] dealing with gcc 'comparison is always false' warnings (was: [PATCH] drivers/infiniband: fix comparsion between unsigned and negative) In-Reply-To: <20070530080518.GA29195@nostromo.devel.redhat.com> (Bill Nottingham's message of "Wed, 30 May 2007 04:05:18 -0400") References: <20070530080518.GA29195@nostromo.devel.redhat.com> Message-ID: thanks... I'm wondering if there's a consensus among kernel hackers about changes like: > - if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) > + if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) > return -EINVAL; I understand that new gcc sees that hdr.cmd is unsigned and hence can't be < 0, and generates a warning for that, and having a build cluttered with warnings hides bugs and so on. However the code here looks quite sensible to me -- otherwise we end up with missing range checking if hdr.cmd ever changes to a signed type. This seems like a good way to introduce bugs: delete valid range checking code to shut up a silly gcc warning, and then change the type of a variable. Can't we just make gcc shut up about the comparison and generate no code for it because it knows it can't be true? - R. From satyam.sharma at gmail.com Wed May 30 08:41:29 2007 From: satyam.sharma at gmail.com (Satyam Sharma) Date: Wed, 30 May 2007 21:11:29 +0530 Subject: [ofa-general] Re: dealing with gcc 'comparison is always false' warnings (was: [PATCH] drivers/infiniband: fix comparsion between unsigned and negative) In-Reply-To: References: <20070530080518.GA29195@nostromo.devel.redhat.com> Message-ID: On 5/30/07, Roland Dreier wrote: > thanks... I'm wondering if there's a consensus among kernel hackers > about changes like: > > > - if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) > > + if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) > > return -EINVAL; > > I understand that new gcc sees that hdr.cmd is unsigned and hence > can't be < 0, and generates a warning for that, and having a build > cluttered with warnings hides bugs and so on. However the code here > looks quite sensible to me -- otherwise we end up with missing range > checking if hdr.cmd ever changes to a signed type. This seems like a > good way to introduce bugs: delete valid range checking code to shut > up a silly gcc warning, and then change the type of a variable. You're *absolutely* correct about the issue that these "fixes" that remove such conditions end up remove range-checking making the code more flakey / less readable. However, gcc is _just as correct_. It is only crying about seeing a condition that the programmer could have written with some purpose in mind but which is being completely compiled away by it when generating the code because of it being a tautology / contradiction ... > Can't we just make gcc shut up about the comparison and generate no > code for it because it knows it can't be true? No, shutting gcc up wouldn't be the right thing, IMHO. These warnings are a good reminder to the programmer to go and see if there is a real bug somewhere and if something really needs to be done with the code (could be simply to change the type of a variable to signed that was mistakenly declared unsigned, f.e.). But yes, the kind of "fixes" you pointed out that _remove_ these conditions are definitely *not* what we would want to do. Satyam From erezz at voltaire.com Wed May 30 08:43:49 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 30 May 2007 18:43:49 +0300 Subject: [ofa-general] RE: [PATCH 2/2] IB/iser: add backport & kerneladdons foropen-iscsiover iSER support for RHAS4 up3 and up4 References: <20070521114410.GG20400@mellanox.co.il><46557BCB.7030102@voltaire.com><20070524115715.GC4585@mellanox.co.il><465C2D78.30100@voltaire.com><20070529141143.GD27671@mellanox.co.il><465D7046.3080109@voltaire.com> <20070530125456.GF9036@mellanox.co.il> Message-ID: <39C75744D164D948A170E9792AF8E7CA1109F5@exil.voltaire.com> >> It will also require ugly >> adjustments like: >> >> kernel_patches/backport/2.6.9_U3/iscsi_scsi_addons.patch: >> >> diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile >> index e212608..3bf2015 100644 >> --- a/drivers/scsi/Makefile >> +++ b/drivers/scsi/Makefile >> @@ -1,2 +1,7 @@ >> obj-$(CONFIG_SCSI_ISCSI_ATTRS) += scsi_transport_iscsi.o >> obj-$(CONFIG_ISCSI_TCP) += libiscsi.o iscsi_tcp.o >> + >> +CFLAGS_attribute_container.o = >> -I$(PWD)/kernel_addons/backport/2.6.9_U3/include/src/ >> + >> +scsi_transport_iscsi-y := scsi_transport_iscsi_f.o scsi.o scsi_lib.o >> init.o klist.o attribute_container.o transport_class.o >> +libiscsi-y := libiscsi_f.o scsi_scan.o >> >> (because base.h is in kernel_addons/backport/2.6.9_U3/include/src) > > This is one approach, and I think it's not too bad. > Alternative is to use the relative path for include directive: > #include "../drivers/base/attribute_container.c" > > Wouldn't this work? I am doing that. However, attribute_container.c includes base.h which is in the kernel_addons dir. Since attribute_container.c is no longer there, I need to add the following line: -I$(PWD)/kernel_addons/backport/2.6.9_U3/include/src/ It is not very very ugly, so I think that we can do that. I will make the required fixes according to this approach. Erez From rdreier at cisco.com Wed May 30 08:56:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 30 May 2007 08:56:37 -0700 Subject: [ofa-general] Re: dealing with gcc 'comparison is always false' warnings In-Reply-To: (Satyam Sharma's message of "Wed, 30 May 2007 21:11:29 +0530") References: <20070530080518.GA29195@nostromo.devel.redhat.com> Message-ID: > However, gcc is _just as correct_. It is only crying about seeing a condition > that the programmer could have written with some purpose in mind but which > is being completely compiled away by it when generating the code because > of it being a tautology / contradiction ... Well, OK, but there's lots of things gcc could warn about. How about while (1) { ... By your argument gcc should warn that '1' always evaluates to true. Or how about #if 0 why shouldn't the preprocessor warn that the conditional is always false? > No, shutting gcc up wouldn't be the right thing, IMHO. These warnings are > a good reminder to the programmer to go and see if there is a real bug > somewhere and if something really needs to be done with the code (could > be simply to change the type of a variable to signed that was mistakenly > declared unsigned, f.e.). OK, but suppose I looked at it and there's no bug. Leaving the warning has a cost too: it hides useful warnings (that might be showing real bugs) in all the clutter. - R. From satyam.sharma at gmail.com Wed May 30 09:06:05 2007 From: satyam.sharma at gmail.com (Satyam Sharma) Date: Wed, 30 May 2007 21:36:05 +0530 Subject: [ofa-general] Re: dealing with gcc 'comparison is always false' warnings (was: [PATCH] drivers/infiniband: fix comparsion between unsigned and negative) In-Reply-To: References: <20070530080518.GA29195@nostromo.devel.redhat.com> Message-ID: On 5/30/07, Satyam Sharma wrote: > On 5/30/07, Roland Dreier wrote: > > thanks... I'm wondering if there's a consensus among kernel hackers > > about changes like: > > > > > - if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) > > > + if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) > > > return -EINVAL; > > > > I understand that new gcc sees that hdr.cmd is unsigned and hence > > can't be < 0, and generates a warning for that, and having a build > > cluttered with warnings hides bugs and so on. However the code here > > looks quite sensible to me -- otherwise we end up with missing range > > checking if hdr.cmd ever changes to a signed type. This seems like a > > good way to introduce bugs: delete valid range checking code to shut > > up a silly gcc warning, and then change the type of a variable. > > You're *absolutely* correct about the issue that these "fixes" that remove > such conditions end up remove range-checking making the code more > flakey / less readable. > > However, gcc is _just as correct_. It is only crying about seeing a condition > that the programmer could have written with some purpose in mind but which > is being completely compiled away by it when generating the code because > of it being a tautology / contradiction ... > > > Can't we just make gcc shut up about the comparison and generate no > > code for it because it knows it can't be true? [ BTW gcc does not generate code for such cases already; either for the condition whose truth value is already known, or for the codepath that will never be executed as a result. ] > No, shutting gcc up wouldn't be the right thing, IMHO. These warnings are > a good reminder to the programmer to go and see if there is a real bug > somewhere and if something really needs to be done with the code (could > be simply to change the type of a variable to signed that was mistakenly > declared unsigned, f.e.). A common scenario I could imagine for the above would be where a typo makes someone declare a var as size_t when it should've been ssize_t. This is clearly a real bug that would get caught with this gcc warning (but not with -Wall). > But yes, the kind of "fixes" you pointed out that _remove_ these conditions > are definitely *not* what we would want to do. Erm, to qualify my rather strong opinion above: there could perhaps be exceptions where the condition being removed could be truly redundant, of course :-) Satyam From jwong at datallegro.com Wed May 30 09:22:07 2007 From: jwong at datallegro.com (Jeffrey Wong) Date: Wed, 30 May 2007 12:22:07 -0400 Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with kernel2-6.18-8.1.4.el5 In-Reply-To: <200705300843.13575.cap@nsc.liu.se> Message-ID: Is there a workaround for this problem? Will there be a patch to the current release? Will this be fixed in the next release? Thanks, Jeff -----Original Message----- From: Peter Kjellstrom [mailto:cap at nsc.liu.se] Sent: Tuesday, May 29, 2007 11:43 PM To: general at lists.openfabrics.org Cc: Scott Weitzenkamp (sweitzen); Jeffrey Wong; Moni Shoua; Moni Levy Subject: Re: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with kernel2-6.18-8.1.4.el5 On Tuesday 29 May 2007, Scott Weitzenkamp (sweitzen) wrote: > Moni S, > > The ib-bonding configuration process seems too picky, should we just > apply RHEL5 patches if we see a *el5* kernel? In other words, change: > > $ fgrep 2.6.18 linux/configure > 2.6.18-1.2747.el5*|2.6.18-8.el5*|2.6.18-*.*.fc6) > > to: > > 2.6.18-*el5*|2.6.18-*.*.fc6) Why mix in non-el5 kernels? 2.6.18-2747 and similar are beta kernels (fc naming left-over) and stuff with fc6 in it is clearly not el5. Update kernels for el5 (before el5u1) should be named 2.6.18-8.x.y.el5, so maybe: "2.6.18-8.*el5"? /Peter From sweitzen at cisco.com Wed May 30 09:25:12 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 30 May 2007 09:25:12 -0700 Subject: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding with kernel2-6.18-8.1.4.el5 In-Reply-To: References: <200705300843.13575.cap@nsc.liu.se> Message-ID: Looks fixed in http://www.openfabrics.org/builds/ofed-1.2/OFED-1.2-20070530-0809.tgz. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Jeffrey Wong [mailto:jwong at datallegro.com] > Sent: Wednesday, May 30, 2007 9:22 AM > To: Peter Kjellstrom; general at lists.openfabrics.org > Cc: Scott Weitzenkamp (sweitzen); Moni Shoua; Moni Levy > Subject: RE: [ofa-general] Trouble installing OFED1.2-rc3 > ib-bonding with kernel2-6.18-8.1.4.el5 > > Is there a workaround for this problem? Will there be a patch to the > current release? Will this be fixed in the next release? > Thanks, > > Jeff > > > > -----Original Message----- > From: Peter Kjellstrom [mailto:cap at nsc.liu.se] > Sent: Tuesday, May 29, 2007 11:43 PM > To: general at lists.openfabrics.org > Cc: Scott Weitzenkamp (sweitzen); Jeffrey Wong; Moni Shoua; Moni Levy > Subject: Re: [ofa-general] Trouble installing OFED1.2-rc3 ib-bonding > with kernel2-6.18-8.1.4.el5 > > On Tuesday 29 May 2007, Scott Weitzenkamp (sweitzen) wrote: > > Moni S, > > > > The ib-bonding configuration process seems too picky, should we just > > apply RHEL5 patches if we see a *el5* kernel? In other > words, change: > > > > $ fgrep 2.6.18 linux/configure > > 2.6.18-1.2747.el5*|2.6.18-8.el5*|2.6.18-*.*.fc6) > > > > to: > > > > 2.6.18-*el5*|2.6.18-*.*.fc6) > > Why mix in non-el5 kernels? 2.6.18-2747 and similar are beta > kernels (fc > > naming left-over) and stuff with fc6 in it is clearly not el5. Update > kernels > for el5 (before el5u1) should be named 2.6.18-8.x.y.el5, so > maybe: "2.6.18-8.*el5"? > > /Peter > From satyam.sharma at gmail.com Wed May 30 10:00:14 2007 From: satyam.sharma at gmail.com (Satyam Sharma) Date: Wed, 30 May 2007 22:30:14 +0530 Subject: [ofa-general] Re: dealing with gcc 'comparison is always false' warnings In-Reply-To: References: <20070530080518.GA29195@nostromo.devel.redhat.com> Message-ID: [ Sorry, the threading broke because the subject changed, so I missed seeing this mail earlier. ] On 5/30/07, Roland Dreier wrote: > > However, gcc is _just as correct_. It is only crying about seeing a condition > > that the programmer could have written with some purpose in mind but which > > is being completely compiled away by it when generating the code because > > of it being a tautology / contradiction ... > > Well, OK, but there's lots of things gcc could warn about. How about > > while (1) { ... Umm ... perhaps because gcc does not compile away any code for such cases, but only the condition? Or because gcc knows this is a common idiom in a *lot* of C code? I don't know (or care!) ... the precise cases for which the warning is emitted would be known only by reading gcc sources (which I have no intention of doing :-) > By your argument gcc should warn that '1' always evaluates to true. Note that my "argument" was about conditions that weren't as simplistic as #if 0 or while (1) and that involved not merely 1 or 0, but variables whose values might not be available at compile-time ... > Or how about > > #if 0 > > why shouldn't the preprocessor warn that the conditional is always false? Perhaps because gcc knows programmers often use this common method to disable some code? I can't answer all these questions, of course (better ask the gcc folks), but I don't care either. Clearly, none of the above are any reason why gcc should *not* complain when it sees a _seemingly_ meaningful condition conceivably written by the programmer with something in mind but being completely optimized away by it. [ BTW, perhaps the reason why the gcc folks did *not* put a warning for while (1) or #if 0 is also because they know that programmers often write such conditions with something meaningful in mind. ] > > No, shutting gcc up wouldn't be the right thing, IMHO. These warnings are > > a good reminder to the programmer to go and see if there is a real bug > > somewhere and if something really needs to be done with the code (could > > be simply to change the type of a variable to signed that was mistakenly > > declared unsigned, f.e.). > > OK, but suppose I looked at it and there's no bug. Leaving the > warning has a cost too: it hides useful warnings (that might be > showing real bugs) in all the clutter. Agreed, this warning emits a lot of false positives. But this warning isn't enabled with -Wall either, or is it (now)? I remember the only way to enable this was with -Wextra, and last I heard the top-level Makefile did not specify that ... (?) Satyam From ralph.campbell at qlogic.com Wed May 30 10:09:25 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 30 May 2007 10:09:25 -0700 Subject: [ofa-general] Re: [ewg] [PATCH] ofed_1_2/sdp - SDP can lose receive data In-Reply-To: <1180507416.12048.19.camel@localhost> References: <1180139623.3407.373.camel@brick.pathscale.com> <1180256850.15464.1.camel@localhost> <1180459488.3407.376.camel@brick.pathscale.com> <1180507416.12048.19.camel@localhost> Message-ID: <1180544965.3407.424.camel@brick.pathscale.com> I guess I'm still getting used to git. Somehow, I was looking at an earlier version. The current code looks OK to me. On Wed, 2007-05-30 at 09:43 +0300, Ami Perlmutter wrote: > this is how the code looks now: > > if (likely(skb_len && (tail = skb_peek_tail(&sk->sk_receive_queue))) && > unlikely(skb_tailroom(tail) >= skb_len)) { > skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len); > __kfree_skb(skb); > skb = tail; > } else > skb_queue_tail(&sk->sk_receive_queue, skb); > > could you point out the problem? > > On Tue, 2007-05-29 at 10:24 -0700, Ralph Campbell wrote: > > It is from git://git.openfabrics.org/~vlad/ofed_1_2 > > commit 726c6827ac31c0b2f40acd804dc53362289bd21f > > > > On Sun, 2007-05-27 at 12:07 +0300, Ami Perlmutter wrote: > > > Ralph, > > > this is how the code is now. > > > Were are you getting this code from? > > > > > > On Fri, 2007-05-25 at 17:33 -0700, Ralph Campbell wrote: > > > > Can this fix be considered for OFED 1.2? > > > > Thanks. > > > > > > > > > > > > If a receive work completion is processed but there is no room > > > > in a previously queued skb, the data is dropped. > > > > This patch fixes the problem by queuing the skb. > > > > > > > > Signed-off-by: Ralph Campbell > > > > > > > > diff -r 074340d1892d drivers/infiniband/ulp/sdp/sdp_bcopy.c > > > > --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:04:51 2007 -0700 > > > > +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c Fri May 25 17:07:02 2007 -0700 > > > > @@ -314,7 +314,8 @@ static inline struct sk_buff *sdp_sock_q > > > > skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len); > > > > __kfree_skb(skb); > > > > skb = tail; > > > > - } > > > > + } else > > > > + skb_queue_tail(&sk->sk_receive_queue, skb); > > > > } else > > > > skb_queue_tail(&sk->sk_receive_queue, skb); > > > > > > > > > > > > > > > > _______________________________________________ > > > > ewg mailing list > > > > ewg at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From halr at voltaire.com Wed May 30 10:58:02 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 30 May 2007 13:58:02 -0400 Subject: [ofa-general] [PATCH] ib_types.h: Change macros to convert from "host" byte order to "network" In-Reply-To: <20070522102327.0cea4153.weiny2@llnl.gov> References: <20070522102327.0cea4153.weiny2@llnl.gov> Message-ID: <1180547880.7116.81717.camel@hal.voltaire.com> On Tue, 2007-05-22 at 13:23, Ira Weiny wrote: > >From 7e53267d5bc9389f5f1a4dae3a2d290c69c6e1d4 Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Tue, 24 Apr 2007 16:07:19 -0700 > Subject: [PATCH] Change macros to convert from "host" byte order to "network" > > Although the macros CL_HTON* and CL_NTOH* are defined to be the same > operation it is technically incorrect to convert a constant from network > byte order. The constant should be converted from host byte order to > network byte order. > > Signed-off-by: Ira K. Weiny Thanks. Applied. -- Hal From rdreier at cisco.com Wed May 30 10:58:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 30 May 2007 10:58:49 -0700 Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user In-Reply-To: <20070529091246.GF8159@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 May 2007 12:12:46 +0300") References: <465AF3D3.10205@dev.mellanox.co.il> <20070529091246.GF8159@mellanox.co.il> Message-ID: ok, I applied the patch with changes as discussed From sashak at voltaire.com Wed May 30 11:23:47 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 30 May 2007 21:23:47 +0300 Subject: [ofa-general] [PATCH] opensm: drop_mgr: clean only associated with port physical obj Message-ID: <20070530182347.GF13193@sashak.voltaire.com> Then remove osm_port_t cleanup only associated osm_physp_t object and not do not all node's osm_physp_t objects. If all osm_physp_t should be removed do it in node removing routine. This fix prevents random crashes in post drop manager flows, when CA node had two port connected and one was disconnected. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_drop_mgr.c | 171 +++++++++++++++++++----------------------- 1 files changed, 78 insertions(+), 93 deletions(-) diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c index 7ec185c..1890696 100644 --- a/opensm/opensm/osm_drop_mgr.c +++ b/opensm/opensm/osm_drop_mgr.c @@ -137,6 +137,78 @@ __osm_drop_mgr_remove_router( } } + +/********************************************************************** + **********************************************************************/ +static void +drop_mgr_clean_physp( + IN const osm_drop_mgr_t* const p_mgr, + IN osm_physp_t *p_physp) +{ + cl_qmap_t *p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl; + osm_physp_t *p_remote_physp; + osm_port_t* p_remote_port; + + p_remote_physp = osm_physp_get_remote( p_physp ); + if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) ) + { + p_remote_port = (osm_port_t*)cl_qmap_get( p_port_guid_tbl, + p_remote_physp->port_guid ); + + if ( p_remote_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl ) ) + { + /* Let's check if this is a case of link that is lost (both ports + weren't recognized), or a "hiccup" in the subnet - in which case + the remote port was recognized, and its state is ACTIVE. + If this is just a "hiccup" - force a heavy sweep in the next sweep. + We don't want to lose that part of the subnet. */ + if (osm_port_discovery_count_get( p_remote_port ) && + osm_physp_get_port_state( p_remote_physp ) == IB_LINK_ACTIVE ) + { + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "drop_mgr_clean_physp: " + "Forcing delayed heavy sweep. Remote " + "port 0x%016" PRIx64 " port num: 0x%X " + "was recognized in ACTIVE state\n", + cl_ntoh64( p_remote_physp->port_guid ), + p_remote_physp->port_num ); + p_mgr->p_subn->force_delayed_heavy_sweep = TRUE; + } + + /* If the remote node is ca or router - need to remove the remote port, + since it is no longer reachable. This can be done if we reset the + discovery count of the remote port. */ + if ( !p_remote_physp->p_node->sw ) + { + osm_port_discovery_count_reset( p_remote_port ); + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, + "drop_mgr_clean_physp: Resetting discovery count of node: " + "0x%016" PRIx64 " port num:0x%X\n", + cl_ntoh64( osm_node_get_node_guid( p_remote_physp->p_node ) ), + p_remote_physp->port_num ); + } + } + + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "drop_mgr_clean_physp: " + "Unlinking local node 0x%016" PRIx64 ", port 0x%X" + "\n\t\t\t\tand remote node 0x%016" PRIx64 ", port 0x%X\n", + cl_ntoh64( osm_node_get_node_guid( p_physp->p_node ) ), + p_physp->port_num, + cl_ntoh64( osm_node_get_node_guid( p_remote_physp->p_node ) ), + p_remote_physp->port_num ); + + osm_physp_unlink( p_physp, p_remote_physp ); + + } + + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, + "drop_mgr_clean_physp: Clearing physical port number 0x%X\n", + p_physp->port_num ); + + osm_physp_destroy( p_physp ); +} + /********************************************************************** **********************************************************************/ static void @@ -156,17 +228,11 @@ __osm_drop_mgr_remove_port( uint16_t min_lid_ho; uint16_t max_lid_ho; uint16_t lid_ho; - uint32_t port_num; - uint32_t remote_port_num; - uint32_t num_physp; osm_node_t *p_node; - osm_node_t *p_remote_node; - osm_physp_t *p_physp; - osm_physp_t *p_remote_physp; osm_remote_sm_t *p_sm; ib_gid_t port_gid; - ib_mad_notice_attr_t notice; - ib_api_status_t status; + ib_mad_notice_attr_t notice; + ib_api_status_t status; OSM_LOG_ENTER( p_mgr->p_log, __osm_drop_mgr_remove_port ); @@ -231,89 +297,7 @@ __osm_drop_mgr_remove_port( for( lid_ho = min_lid_ho; lid_ho <= max_lid_ho; lid_ho++ ) cl_ptr_vector_set( p_port_lid_tbl, lid_ho, NULL ); - /* - For each Physical Port associated with this port: - Unlink the remote Physical Port, if any - Re-initialize each Physical Port. - */ - - num_physp = osm_node_get_num_physp( p_port->p_node ); - for( port_num = 0; port_num < num_physp; port_num++ ) - { - p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)port_num ); - - if( p_physp && osm_physp_is_valid(p_physp) ) - { - p_remote_physp = osm_physp_get_remote( p_physp ); - if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) ) - { - osm_port_t* p_remote_port; - - p_node = osm_physp_get_node_ptr( p_physp ); - p_remote_node = osm_physp_get_node_ptr( p_remote_physp ); - remote_port_num = osm_physp_get_port_num( p_remote_physp ); - p_remote_port = (osm_port_t*)cl_qmap_get( p_port_guid_tbl, p_remote_physp->port_guid ); - - if ( p_remote_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl ) ) - { - /* Let's check if this is a case of link that is lost (both ports - weren't recognized), or a "hiccup" in the subnet - in which case - the remote port was recognized, and its state is ACTIVE. - If this is just a "hiccup" - force a heavy sweep in the next sweep. - We don't want to lose that part of the subnet. */ - if (osm_port_discovery_count_get( p_remote_port ) && - osm_physp_get_port_state( p_remote_physp ) == IB_LINK_ACTIVE ) - { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "__osm_drop_mgr_remove_port: " - "Forcing delayed heavy sweep. Remote " - "port 0x%016" PRIx64 " port num: 0x%X " - "was recognized in ACTIVE state\n", - cl_ntoh64( p_remote_physp->port_guid ), - remote_port_num ); - p_mgr->p_subn->force_delayed_heavy_sweep = TRUE; - } - } - - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "__osm_drop_mgr_remove_port: " - "Unlinking local node 0x%016" PRIx64 ", port 0x%X" - "\n\t\t\t\tand remote node 0x%016" PRIx64 - ", port 0x%X\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - port_num, - cl_ntoh64( osm_node_get_node_guid( p_remote_node ) ), - remote_port_num ); - - osm_node_unlink( p_node, (uint8_t)port_num, - p_remote_node, (uint8_t)remote_port_num ); - - /* If the remote node is ca or router - need to remove the remote port, - since it is no longer reachable. This can be done if we reset the - discovery count of the remote port. */ - if ( osm_node_get_type( p_remote_node ) != IB_NODE_TYPE_SWITCH ) - { - if ( p_remote_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl ) ) - { - osm_port_discovery_count_reset( p_remote_port ); - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, - "__osm_drop_mgr_remove_port: " - "Resetting discovery count of node: " - "0x%016" PRIx64 " port num:0x%X\n", - cl_ntoh64( osm_node_get_node_guid( p_remote_node ) ), - remote_port_num ); - } - } - } - - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, - "__osm_drop_mgr_remove_port: " - "Clearing physical port number 0x%X\n", - port_num ); - - osm_physp_destroy( p_physp ); - } - } + drop_mgr_clean_physp(p_mgr, p_port->p_physp); p_mcm = (osm_mcm_info_t*)cl_qlist_remove_head( &p_port->mcm_list ); while( p_mcm != (osm_mcm_info_t *)cl_qlist_end( &p_port->mcm_list ) ) @@ -454,6 +438,8 @@ __osm_drop_mgr_process_node( if( p_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl ) ) __osm_drop_mgr_remove_port( p_mgr, p_port ); + else + drop_mgr_clean_physp( p_mgr, p_physp ); } } @@ -535,8 +521,7 @@ __osm_drop_mgr_check_node( port_guid = osm_physp_get_port_guid( p_physp ); - p_port = (osm_port_t*)cl_qmap_get( - p_port_guid_tbl, port_guid ); + p_port = (osm_port_t*)cl_qmap_get( p_port_guid_tbl, port_guid ); if( p_port == (osm_port_t*)cl_qmap_end( p_port_guid_tbl ) ) { -- 1.5.2.109.g802f From tilman at imap.cc Wed May 30 12:00:57 2007 From: tilman at imap.cc (Tilman Schmidt) Date: Wed, 30 May 2007 21:00:57 +0200 Subject: [ofa-general] Re: dealing with gcc 'comparison is always false' warnings In-Reply-To: References: <20070530080518.GA29195@nostromo.devel.redhat.com> Message-ID: <465DC9E9.3040904@imap.cc> Am 30.05.2007 17:41 schrieb Satyam Sharma: > On 5/30/07, Roland Dreier wrote: >> thanks... I'm wondering if there's a consensus among kernel hackers >> about changes like: >> >> > - if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) >> > + if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) >> > return -EINVAL; >> >> I understand that new gcc sees that hdr.cmd is unsigned and hence >> can't be < 0, and generates a warning for that, and having a build >> cluttered with warnings hides bugs and so on. However the code here >> looks quite sensible to me -- otherwise we end up with missing range >> checking if hdr.cmd ever changes to a signed type. This seems like a >> good way to introduce bugs: delete valid range checking code to shut >> up a silly gcc warning, and then change the type of a variable. > > You're *absolutely* correct about the issue that these "fixes" that remove > such conditions end up remove range-checking making the code more > flakey / less readable. I disagree. Changing the type of a variable is a significant modification. If someone does that, he or she *must* check every use of that variable, at which point he or she will also modify any range checks accordingly. Having checks that don't fit with the previous type *distracts* from that job. "Oh, did I modify that part already? Guess I can skip checking the rest of that function then." Oops. Nor is readability a suitable argument. Checking if hdr.cmd is less than zero gives the misleading impression that it *could* be less than zero, thus *impairing* readability. jm2c T. -- Tilman Schmidt E-Mail: tilman at imap.cc Bonn, Germany Diese Nachricht besteht zu 100% aus wiederverwerteten Bits. Ungeöffnet mindestens haltbar bis: (siehe Rückseite) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 253 bytes Desc: OpenPGP digital signature URL: From notting at redhat.com Wed May 30 12:09:37 2007 From: notting at redhat.com (Bill Nottingham) Date: Wed, 30 May 2007 15:09:37 -0400 Subject: [ofa-general] Re: dealing with gcc 'comparison is always false' warnings (was: [PATCH] drivers/infiniband: fix comparsion between unsigned and negative) In-Reply-To: References: <20070530080518.GA29195@nostromo.devel.redhat.com> Message-ID: <20070530190937.GA5444@nostromo.devel.redhat.com> Satyam Sharma (satyam.sharma at gmail.com) said: > But yes, the kind of "fixes" you pointed out that _remove_ these conditions > are definitely *not* what we would want to do. I can see that - but I think it should be at least be brought up for each warning, to determine either: 1) if it should be ignored 2) if a signed type is actually intended 3) if the code should be elided While not necessarily in the IB instances, there are cases where there are entire blocks of code (with debugging output, error returns, etc) that can never get run, and it may make sense to remove those. Bill From satyam.sharma at gmail.com Wed May 30 12:14:30 2007 From: satyam.sharma at gmail.com (Satyam Sharma) Date: Thu, 31 May 2007 00:44:30 +0530 Subject: [ofa-general] Re: dealing with gcc 'comparison is always false' warnings In-Reply-To: <465DC9E9.3040904@imap.cc> References: <20070530080518.GA29195@nostromo.devel.redhat.com> <465DC9E9.3040904@imap.cc> Message-ID: Hi, On 5/31/07, Tilman Schmidt wrote: > Am 30.05.2007 17:41 schrieb Satyam Sharma: > > On 5/30/07, Roland Dreier wrote: > >> thanks... I'm wondering if there's a consensus among kernel hackers > >> about changes like: > >> > >> > - if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) > >> > + if (hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) > >> > return -EINVAL; > >> > >> I understand that new gcc sees that hdr.cmd is unsigned and hence > >> can't be < 0, and generates a warning for that, and having a build > >> cluttered with warnings hides bugs and so on. However the code here > >> looks quite sensible to me -- otherwise we end up with missing range > >> checking if hdr.cmd ever changes to a signed type. This seems like a > >> good way to introduce bugs: delete valid range checking code to shut > >> up a silly gcc warning, and then change the type of a variable. > > > > You're *absolutely* correct about the issue that these "fixes" that remove > > such conditions end up remove range-checking making the code more > > flakey / less readable. > > I disagree. Changing the type of a variable is a significant > modification. If someone does that, he or she *must* check every > use of that variable, at which point he or she will also modify > any range checks accordingly. Having checks that don't fit with > the previous type *distracts* from that job. "Oh, did I modify > that part already? Guess I can skip checking the rest of that > function then." Oops. I did not suggest the change-variable-type-from-unsigned-to-signed thing as a "general" solution to such cases! ... in fact what I said was that such cases do _not_ have a general solution at all, and that shutting gcc up might not be a good idea, because a lot of times such warnings do un-hide bugs. [ BTW when I gave the change-type-from-unsigned-to-signed example, I had the size_t vs ssize_t typo/bug in mind, for which changing the type is the proper fix; and note that similar bugs can occur for non-size_t cases too. ] > Nor is readability a suitable argument. Checking if hdr.cmd is > less than zero gives the misleading impression that it *could* > be less than zero, thus *impairing* readability. Hmmm, but I tend to agree with the sentiment expressed in: http://lkml.org/lkml/2006/11/28/206 Satyam From satyam.sharma at gmail.com Wed May 30 12:23:20 2007 From: satyam.sharma at gmail.com (Satyam Sharma) Date: Thu, 31 May 2007 00:53:20 +0530 Subject: [ofa-general] Re: dealing with gcc 'comparison is always false' warnings (was: [PATCH] drivers/infiniband: fix comparsion between unsigned and negative) In-Reply-To: <20070530190937.GA5444@nostromo.devel.redhat.com> References: <20070530080518.GA29195@nostromo.devel.redhat.com> <20070530190937.GA5444@nostromo.devel.redhat.com> Message-ID: Hi Bill, On 5/31/07, Bill Nottingham wrote: > Satyam Sharma (satyam.sharma at gmail.com) said: > > But yes, the kind of "fixes" you pointed out that _remove_ these conditions > > are definitely *not* what we would want to do. > > I can see that - but I think it should be at least be brought up for each > warning, to determine either: > > 1) if it should be ignored > 2) if a signed type is actually intended > 3) if the code should be elided Agreed. The extract you've pointed out above was too strongly worded unnecessarily / wrong generalization, and I corrected it later. > While not necessarily in the IB instances, there are cases where there > are entire blocks of code (with debugging output, error returns, etc) > that can never get run, and it may make sense to remove those. Agreed, again. Satyam From halr at voltaire.com Wed May 30 12:51:45 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 30 May 2007 15:51:45 -0400 Subject: [ofa-general] Re: [PATCH] opensm: drop_mgr: clean only associated with port physical obj In-Reply-To: <20070530182347.GF13193@sashak.voltaire.com> References: <20070530182347.GF13193@sashak.voltaire.com> Message-ID: <1180554704.7116.88960.camel@hal.voltaire.com> On Wed, 2007-05-30 at 14:23, Sasha Khapyorsky wrote: > Then remove osm_port_t cleanup only associated osm_physp_t object and > not do not all node's osm_physp_t objects. If all osm_physp_t should be > removed do it in node removing routine. > > This fix prevents random crashes in post drop manager flows, when CA node > had two port connected and one was disconnected. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From sean.hefty at intel.com Wed May 30 13:23:06 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 30 May 2007 13:23:06 -0700 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com> Message-ID: <000801c7a2f8$55749000$3c98070a@amr.corp.intel.com> >Ok, Soon I will post a patch related to this. >How static PR file will be generated? Needs to be discussed. Please look at my latest changes to the local SA in when generating the patches. git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache I'm not sure about the best way to communicate PRs to the cache. I haven't given it more than about 2 minutes of thought, but as an idea, we could look at trying to make use of the userspace MAD interface. For example, we could send MADs to the local SA with the PRs to load. More details would obviously need to be worked out, but this could provide an extensible solution. - Sean From sean.hefty at intel.com Wed May 30 13:34:13 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 30 May 2007 13:34:13 -0700 Subject: [ofa-general] [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path record caching Message-ID: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com> I've updated the local SA patches based on previous feedback. The most significant change is to integrate the local SA with the ib_sa module. This allows all apps to make use of the local SA without changes. The use of a device file was also replaced with simple module parameters. I've also pushed these changes to: git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache I would like to close any open issues with this approach in time to pull it into 2.6.23. Signed-off-by: Sean Hefty From sean.hefty at intel.com Wed May 30 13:36:37 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 30 May 2007 13:36:37 -0700 Subject: [ofa-general] [RFC] [PATCH 1/2] for 2.6.23: ib/sa - add InformInfo/Notice support In-Reply-To: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com> Message-ID: <000a01c7a2fa$383b42c0$3c98070a@amr.corp.intel.com> Add SA client support for notice/trap registration using InformInfo. Clients can use the ib_sa interface to register for SA events based on trap numbers, and receive SA event notification. This allows clients to receive notification, such as GID in/out of service. Signed-off-by: Sean Hefty --- drivers/infiniband/core/Makefile | 2 drivers/infiniband/core/notice.c | 749 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/core/sa.h | 16 + drivers/infiniband/core/sa_query.c | 316 +++++++++++++++ include/rdma/ib_sa.h | 171 ++++++++ 5 files changed, 1251 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index cb1ab3e..7c5b5ed 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -13,7 +13,7 @@ ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o ib_mad-y := mad.o smi.o agent.o mad_rmpp.o -ib_sa-y := sa_query.o multicast.o +ib_sa-y := sa_query.o multicast.o notice.o ib_cm-y := cm.o diff --git a/drivers/infiniband/core/notice.c b/drivers/infiniband/core/notice.c new file mode 100644 index 0000000..e4c73c8 --- /dev/null +++ b/drivers/infiniband/core/notice.c @@ -0,0 +1,749 @@ +/* + * Copyright (c) 2006 Intel Corporation.  All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "sa.h" + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand InformInfo & Notice event handling"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void inform_add_one(struct ib_device *device); +static void inform_remove_one(struct ib_device *device); + +static struct ib_client inform_client = { + .name = "ib_notice", + .add = inform_add_one, + .remove = inform_remove_one +}; + +static struct ib_sa_client sa_client; +static struct workqueue_struct *inform_wq; + +struct inform_device; + +struct inform_port { + struct inform_device *dev; + spinlock_t lock; + struct rb_root table; + atomic_t refcount; + struct completion comp; + u8 port_num; +}; + +struct inform_device { + struct ib_device *device; + struct ib_event_handler event_handler; + int start_port; + int end_port; + struct inform_port port[0]; +}; + +enum inform_state { + INFORM_IDLE, + INFORM_REGISTERING, + INFORM_MEMBER, + INFORM_BUSY, + INFORM_ERROR +}; + +struct inform_member; + +struct inform_group { + u16 trap_number; + struct rb_node node; + struct inform_port *port; + spinlock_t lock; + struct work_struct work; + struct list_head pending_list; + struct list_head active_list; + struct list_head notice_list; + struct inform_member *last_join; + int members; + enum inform_state join_state; /* State relative to SA */ + atomic_t refcount; + enum inform_state state; + struct ib_sa_query *query; + int query_id; +}; + +struct inform_member { + struct ib_inform_info info; + struct ib_sa_client *client; + struct inform_group *group; + struct list_head list; + enum inform_state state; + atomic_t refcount; + struct completion comp; +}; + +struct inform_notice { + struct list_head list; + struct ib_sa_notice notice; +}; + +static void reg_handler(int status, struct ib_sa_inform *inform, + void *context); +static void unreg_handler(int status, struct ib_sa_inform *inform, + void *context); + +static struct inform_group *inform_find(struct inform_port *port, + u16 trap_number) +{ + struct rb_node *node = port->table.rb_node; + struct inform_group *group; + + while (node) { + group = rb_entry(node, struct inform_group, node); + if (trap_number < group->trap_number) + node = node->rb_left; + else if (trap_number > group->trap_number) + node = node->rb_right; + else + return group; + } + return NULL; +} + +static struct inform_group *inform_insert(struct inform_port *port, + struct inform_group *group) +{ + struct rb_node **link = &port->table.rb_node; + struct rb_node *parent = NULL; + struct inform_group *cur_group; + + while (*link) { + parent = *link; + cur_group = rb_entry(parent, struct inform_group, node); + if (group->trap_number < cur_group->trap_number) + link = &(*link)->rb_left; + else if (group->trap_number > cur_group->trap_number) + link = &(*link)->rb_right; + else + return cur_group; + } + rb_link_node(&group->node, parent, link); + rb_insert_color(&group->node, &port->table); + return NULL; +} + +static void deref_port(struct inform_port *port) +{ + if (atomic_dec_and_test(&port->refcount)) + complete(&port->comp); +} + +static void release_group(struct inform_group *group) +{ + struct inform_port *port = group->port; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + if (atomic_dec_and_test(&group->refcount)) { + rb_erase(&group->node, &port->table); + spin_unlock_irqrestore(&port->lock, flags); + kfree(group); + deref_port(port); + } else + spin_unlock_irqrestore(&port->lock, flags); +} + +static void deref_member(struct inform_member *member) +{ + if (atomic_dec_and_test(&member->refcount)) + complete(&member->comp); +} + +static void queue_reg(struct inform_member *member) +{ + struct inform_group *group = member->group; + unsigned long flags; + + spin_lock_irqsave(&group->lock, flags); + list_add(&member->list, &group->pending_list); + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + atomic_inc(&group->refcount); + queue_work(inform_wq, &group->work); + } + spin_unlock_irqrestore(&group->lock, flags); +} + +static int send_reg(struct inform_group *group, struct inform_member *member) +{ + struct inform_port *port = group->port; + struct ib_sa_inform inform; + int ret; + + memset(&inform, 0, sizeof inform); + inform.lid_range_begin = cpu_to_be16(0xFFFF); + inform.is_generic = 1; + inform.subscribe = 1; + inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL); + inform.trap.generic.trap_num = cpu_to_be16(member->info.trap_number); + inform.trap.generic.resp_time = 19; + inform.trap.generic.producer_type = + cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL); + + group->last_join = member; + ret = ib_sa_informinfo_query(&sa_client, port->dev->device, + port->port_num, &inform, 3000, GFP_KERNEL, + reg_handler, group,&group->query); + if (ret >= 0) { + group->query_id = ret; + ret = 0; + } + return ret; +} + +static int send_unreg(struct inform_group *group) +{ + struct inform_port *port = group->port; + struct ib_sa_inform inform; + int ret; + + memset(&inform, 0, sizeof inform); + inform.lid_range_begin = cpu_to_be16(0xFFFF); + inform.is_generic = 1; + inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL); + inform.trap.generic.trap_num = cpu_to_be16(group->trap_number); + inform.trap.generic.qpn = IB_QP1; + inform.trap.generic.resp_time = 19; + inform.trap.generic.producer_type = + cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL); + + ret = ib_sa_informinfo_query(&sa_client, port->dev->device, + port->port_num, &inform, 3000, GFP_KERNEL, + unreg_handler, group, &group->query); + if (ret >= 0) { + group->query_id = ret; + ret = 0; + } + return ret; +} + +static void join_group(struct inform_group *group, struct inform_member *member) +{ + member->state = INFORM_MEMBER; + group->members++; + list_move(&member->list, &group->active_list); +} + +static int fail_join(struct inform_group *group, struct inform_member *member, + int status) +{ + spin_lock_irq(&group->lock); + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + return member->info.callback(status, &member->info, NULL); +} + +static void process_group_error(struct inform_group *group) +{ + struct inform_member *member; + int ret; + + spin_lock_irq(&group->lock); + while (!list_empty(&group->active_list)) { + member = list_entry(group->active_list.next, + struct inform_member, list); + atomic_inc(&member->refcount); + list_del_init(&member->list); + group->members--; + member->state = INFORM_ERROR; + spin_unlock_irq(&group->lock); + + ret = member->info.callback(-ENETRESET, &member->info, NULL); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + + group->join_state = INFORM_IDLE; + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); +} + +/* + * Report a notice to all active subscribers. We use a temporary list to + * handle unsubscription requests while the notice is being reported, which + * avoids holding the group lock while in the user's callback. + */ +static void process_notice(struct inform_group *group, + struct inform_notice *info_notice) +{ + struct inform_member *member; + struct list_head list; + int ret; + + INIT_LIST_HEAD(&list); + + spin_lock_irq(&group->lock); + list_splice_init(&group->active_list, &list); + while (!list_empty(&list)) { + + member = list_entry(list.next, struct inform_member, list); + atomic_inc(&member->refcount); + list_move(&member->list, &group->active_list); + spin_unlock_irq(&group->lock); + + ret = member->info.callback(0, &member->info, + &info_notice->notice); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + spin_unlock_irq(&group->lock); +} + +static void inform_work_handler(struct work_struct *work) +{ + struct inform_group *group; + struct inform_member *member; + struct ib_inform_info *info; + struct inform_notice *info_notice; + int status, ret; + + group = container_of(work, typeof(*group), work); +retest: + spin_lock_irq(&group->lock); + while (!list_empty(&group->pending_list) || + !list_empty(&group->notice_list) || + (group->state == INFORM_ERROR)) { + + if (group->state == INFORM_ERROR) { + spin_unlock_irq(&group->lock); + process_group_error(group); + goto retest; + } + + if (!list_empty(&group->notice_list)) { + info_notice = list_entry(group->notice_list.next, + struct inform_notice, list); + list_del(&info_notice->list); + spin_unlock_irq(&group->lock); + process_notice(group, info_notice); + kfree(info_notice); + goto retest; + } + + member = list_entry(group->pending_list.next, + struct inform_member, list); + info = &member->info; + atomic_inc(&member->refcount); + + if (group->join_state == INFORM_MEMBER) { + join_group(group, member); + spin_unlock_irq(&group->lock); + ret = info->callback(0, info, NULL); + } else { + spin_unlock_irq(&group->lock); + status = send_reg(group, member); + if (!status) { + deref_member(member); + return; + } + ret = fail_join(group, member, status); + } + + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + + if (!group->members && (group->join_state == INFORM_MEMBER)) { + group->join_state = INFORM_IDLE; + spin_unlock_irq(&group->lock); + if (send_unreg(group)) + goto retest; + } else { + group->state = INFORM_IDLE; + spin_unlock_irq(&group->lock); + release_group(group); + } +} + +/* + * Fail a join request if it is still active - at the head of the pending queue. + */ +static void process_join_error(struct inform_group *group, int status) +{ + struct inform_member *member; + int ret; + + spin_lock_irq(&group->lock); + member = list_entry(group->pending_list.next, + struct inform_member, list); + if (group->last_join == member) { + atomic_inc(&member->refcount); + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + ret = member->info.callback(status, &member->info, NULL); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + } else + spin_unlock_irq(&group->lock); +} + +static void reg_handler(int status, struct ib_sa_inform *inform, void *context) +{ + struct inform_group *group = context; + + if (status) + process_join_error(group, status); + else + group->join_state = INFORM_MEMBER; + + inform_work_handler(&group->work); +} + +static void unreg_handler(int status, struct ib_sa_inform *rec, void *context) +{ + struct inform_group *group = context; + + inform_work_handler(&group->work); +} + +int notice_dispatch(struct ib_device *device, u8 port_num, + struct ib_sa_notice *notice) +{ + struct inform_device *dev; + struct inform_port *port; + struct inform_group *group; + struct inform_notice *info_notice; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return 0; /* No one to give notice to. */ + + port = &dev->port[port_num - dev->start_port]; + spin_lock_irq(&port->lock); + group = inform_find(port, __be16_to_cpu(notice->trap. + generic.trap_num)); + if (!group) { + spin_unlock_irq(&port->lock); + return 0; + } + + atomic_inc(&group->refcount); + spin_unlock_irq(&port->lock); + + info_notice = kmalloc(sizeof *info_notice, GFP_KERNEL); + if (!info_notice) { + release_group(group); + return -ENOMEM; + } + + info_notice->notice = *notice; + + spin_lock_irq(&group->lock); + list_add(&info_notice->list, &group->notice_list); + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); + inform_work_handler(&group->work); + } else { + spin_unlock_irq(&group->lock); + release_group(group); + } + + return 0; +} + +static struct inform_group *acquire_group(struct inform_port *port, + u16 trap_number, gfp_t gfp_mask) +{ + struct inform_group *group, *cur_group; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + group = inform_find(port, trap_number); + if (group) + goto found; + spin_unlock_irqrestore(&port->lock, flags); + + group = kzalloc(sizeof *group, gfp_mask); + if (!group) + return NULL; + + group->port = port; + group->trap_number = trap_number; + INIT_LIST_HEAD(&group->pending_list); + INIT_LIST_HEAD(&group->active_list); + INIT_LIST_HEAD(&group->notice_list); + INIT_WORK(&group->work, inform_work_handler); + spin_lock_init(&group->lock); + + spin_lock_irqsave(&port->lock, flags); + cur_group = inform_insert(port, group); + if (cur_group) { + kfree(group); + group = cur_group; + } else + atomic_inc(&port->refcount); +found: + atomic_inc(&group->refcount); + spin_unlock_irqrestore(&port->lock, flags); + return group; +} + +/* + * We serialize all join requests to a single group to make our lives much + * easier. Otherwise, two users could try to join the same group + * simultaneously, with different configurations, one could leave while the + * join is in progress, etc., which makes locking around error recovery + * difficult. + */ +struct ib_inform_info * +ib_sa_register_inform_info(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + u16 trap_number, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice), + void *context) +{ + struct inform_device *dev; + struct inform_member *member; + struct ib_inform_info *info; + int ret; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return ERR_PTR(-ENODEV); + + member = kzalloc(sizeof *member, gfp_mask); + if (!member) + return ERR_PTR(-ENOMEM); + + ib_sa_client_get(client); + member->client = client; + member->info.trap_number = trap_number; + member->info.callback = callback; + member->info.context = context; + init_completion(&member->comp); + atomic_set(&member->refcount, 1); + member->state = INFORM_REGISTERING; + + member->group = acquire_group(&dev->port[port_num - dev->start_port], + trap_number, gfp_mask); + if (!member->group) { + ret = -ENOMEM; + goto err; + } + + /* + * The user will get the info structure in their callback. They + * could then free the info structure before we can return from + * this routine. So we save the pointer to return before queuing + * any callback. + */ + info = &member->info; + queue_reg(member); + return info; + +err: + ib_sa_client_put(member->client); + kfree(member); + return ERR_PTR(ret); +} +EXPORT_SYMBOL(ib_sa_register_inform_info); + +void ib_sa_unregister_inform_info(struct ib_inform_info *info) +{ + struct inform_member *member; + struct inform_group *group; + + member = container_of(info, struct inform_member, info); + group = member->group; + + spin_lock_irq(&group->lock); + if (member->state == INFORM_MEMBER) + group->members--; + + list_del_init(&member->list); + + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); + /* Continue to hold reference on group until callback */ + queue_work(inform_wq, &group->work); + } else { + spin_unlock_irq(&group->lock); + release_group(group); + } + + deref_member(member); + wait_for_completion(&member->comp); + ib_sa_client_put(member->client); + kfree(member); +} +EXPORT_SYMBOL(ib_sa_unregister_inform_info); + +static void inform_groups_lost(struct inform_port *port) +{ + struct inform_group *group; + struct rb_node *node; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + for (node = rb_first(&port->table); node; node = rb_next(node)) { + group = rb_entry(node, struct inform_group, node); + spin_lock(&group->lock); + if (group->state == INFORM_IDLE) { + atomic_inc(&group->refcount); + queue_work(inform_wq, &group->work); + } + group->state = INFORM_ERROR; + spin_unlock(&group->lock); + } + spin_unlock_irqrestore(&port->lock, flags); +} + +static void inform_event_handler(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct inform_device *dev; + + dev = container_of(handler, struct inform_device, event_handler); + + switch (event->event) { + case IB_EVENT_PORT_ERR: + case IB_EVENT_LID_CHANGE: + case IB_EVENT_SM_CHANGE: + case IB_EVENT_CLIENT_REREGISTER: + inform_groups_lost(&dev->port[event->element.port_num - + dev->start_port]); + break; + default: + break; + } +} + +static void inform_add_one(struct ib_device *device) +{ + struct inform_device *dev; + struct inform_port *port; + int i; + + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, + GFP_KERNEL); + if (!dev) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) + dev->start_port = dev->end_port = 0; + else { + dev->start_port = 1; + dev->end_port = device->phys_port_cnt; + } + + for (i = 0; i <= dev->end_port - dev->start_port; i++) { + port = &dev->port[i]; + port->dev = dev; + port->port_num = dev->start_port + i; + spin_lock_init(&port->lock); + port->table = RB_ROOT; + init_completion(&port->comp); + atomic_set(&port->refcount, 1); + } + + dev->device = device; + ib_set_client_data(device, &inform_client, dev); + + INIT_IB_EVENT_HANDLER(&dev->event_handler, device, inform_event_handler); + ib_register_event_handler(&dev->event_handler); +} + +static void inform_remove_one(struct ib_device *device) +{ + struct inform_device *dev; + struct inform_port *port; + int i; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return; + + ib_unregister_event_handler(&dev->event_handler); + flush_workqueue(inform_wq); + + for (i = 0; i <= dev->end_port - dev->start_port; i++) { + port = &dev->port[i]; + deref_port(port); + wait_for_completion(&port->comp); + } + + kfree(dev); +} + +int notice_init(void) +{ + int ret; + + inform_wq = create_singlethread_workqueue("ib_inform"); + if (!inform_wq) + return -ENOMEM; + + ib_sa_register_client(&sa_client); + + ret = ib_register_client(&inform_client); + if (ret) + goto err; + return 0; + +err: + ib_sa_unregister_client(&sa_client); + destroy_workqueue(inform_wq); + return ret; +} + +void notice_cleanup(void) +{ + ib_unregister_client(&inform_client); + ib_sa_unregister_client(&sa_client); + destroy_workqueue(inform_wq); +} diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h index 24c93fd..b8eac66 100644 --- a/drivers/infiniband/core/sa.h +++ b/drivers/infiniband/core/sa.h @@ -63,4 +63,20 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, int mcast_init(void); void mcast_cleanup(void); +int ib_sa_informinfo_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_inform *rec, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_inform *resp, + void *context), + void *context, + struct ib_sa_query **sa_query); + +int notice_dispatch(struct ib_device *device, u8 port_num, + struct ib_sa_notice *notice); + +int notice_init(void); +void notice_cleanup(void); + #endif /* SA_H */ diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 6469406..369fe60 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -61,10 +61,12 @@ struct ib_sa_sm_ah { struct ib_sa_port { struct ib_mad_agent *agent; + struct ib_mad_agent *notice_agent; struct ib_sa_sm_ah *sm_ah; struct work_struct update_task; spinlock_t ah_lock; u8 port_num; + struct ib_device *device; }; struct ib_sa_device { @@ -101,6 +103,12 @@ struct ib_sa_mcmember_query { struct ib_sa_query sa_query; }; +struct ib_sa_inform_query { + void (*callback)(int, struct ib_sa_inform *, void *); + void *context; + struct ib_sa_query sa_query; +}; + static void ib_sa_add_one(struct ib_device *device); static void ib_sa_remove_one(struct ib_device *device); @@ -352,6 +360,110 @@ static const struct ib_field service_rec_table[] = { .size_bits = 2*64 }, }; +#define INFORM_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_inform, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_inform *) 0)->field, \ + .field_name = "sa_inform:" #field + +static const struct ib_field inform_table[] = { + { INFORM_FIELD(gid), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 128 }, + { INFORM_FIELD(lid_range_begin), + .offset_words = 4, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(lid_range_end), + .offset_words = 4, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 5, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(is_generic), + .offset_words = 5, + .offset_bits = 16, + .size_bits = 8 }, + { INFORM_FIELD(subscribe), + .offset_words = 5, + .offset_bits = 24, + .size_bits = 8 }, + { INFORM_FIELD(type), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(trap.generic.trap_num), + .offset_words = 6, + .offset_bits = 16, + .size_bits = 16 }, + { INFORM_FIELD(trap.generic.qpn), + .offset_words = 7, + .offset_bits = 0, + .size_bits = 24 }, + { RESERVED, + .offset_words = 7, + .offset_bits = 24, + .size_bits = 3 }, + { INFORM_FIELD(trap.generic.resp_time), + .offset_words = 7, + .offset_bits = 27, + .size_bits = 5 }, + { RESERVED, + .offset_words = 8, + .offset_bits = 0, + .size_bits = 8 }, + { INFORM_FIELD(trap.generic.producer_type), + .offset_words = 8, + .offset_bits = 8, + .size_bits = 24 }, +}; + +#define NOTICE_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_notice, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_notice *) 0)->field, \ + .field_name = "sa_notice:" #field + +static const struct ib_field notice_table[] = { + { NOTICE_FIELD(is_generic), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 1 }, + { NOTICE_FIELD(type), + .offset_words = 0, + .offset_bits = 1, + .size_bits = 7 }, + { NOTICE_FIELD(trap.generic.producer_type), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 24 }, + { NOTICE_FIELD(trap.generic.trap_num), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { NOTICE_FIELD(issuer_lid), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 }, + { NOTICE_FIELD(notice_toggle), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 1 }, + { NOTICE_FIELD(notice_count), + .offset_words = 2, + .offset_bits = 1, + .size_bits = 15 }, + { NOTICE_FIELD(data_details), + .offset_words = 2, + .offset_bits = 16, + .size_bits = 432 }, + { NOTICE_FIELD(issuer_gid), + .offset_words = 16, + .offset_bits = 0, + .size_bits = 128 }, +}; + static void free_sm_ah(struct kref *kref) { struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); @@ -913,6 +1025,153 @@ err1: return ret; } +static void ib_sa_inform_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_inform_query *query = + container_of(sa_query, struct ib_sa_inform_query, sa_query); + + if (mad) { + struct ib_sa_inform rec; + + ib_unpack(inform_table, ARRAY_SIZE(inform_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_inform_release(struct ib_sa_query *sa_query) +{ + kfree(container_of(sa_query, struct ib_sa_inform_query, sa_query)); +} + +/** + * ib_sa_informinfo_query - Start an InformInfo registration. + * @client:SA client + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Inform record to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when notice handler registration completes, + * times out or is canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * This function sends inform info to register with SA to receive + * in-service notice. + * The callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_inform_query() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_informinfo_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_inform *rec, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_inform *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_inform_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port; + struct ib_mad_agent *agent; + struct ib_sa_mad *mad; + int ret; + + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; + } + + ib_sa_client_get(client); + query->sa_query.client = client; + query->callback = callback; + query->context = context; + + mad = query->sa_query.mad_buf->mad; + init_mad(mad, agent); + + query->sa_query.callback = callback ? ib_sa_inform_callback : NULL; + query->sa_query.release = ib_sa_inform_release; + query->sa_query.port = port; + mad->mad_hdr.method = IB_MGMT_METHOD_SET; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_INFORM_INFO); + + ib_pack(inform_table, ARRAY_SIZE(inform_table), rec, mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms, gfp_mask); + if (ret < 0) + goto err2; + + return ret; + +err2: + *sa_query = NULL; + ib_sa_client_put(query->sa_query.client); + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kfree(query); + return ret; +} + +static void ib_sa_notice_resp(struct ib_sa_port *port, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_send_buf *mad_buf; + struct ib_sa_mad *mad; + int ret; + + mad_buf = ib_create_send_mad(port->notice_agent, 1, 0, 0, + IB_MGMT_SA_HDR, IB_MGMT_SA_DATA, + GFP_KERNEL); + if (IS_ERR(mad_buf)) + return; + + mad = mad_buf->mad; + memcpy(mad, mad_recv_wc->recv_buf.mad, sizeof *mad); + mad->mad_hdr.method = IB_MGMT_METHOD_REPORT_RESP; + + spin_lock_irq(&port->ah_lock); + kref_get(&port->sm_ah->ref); + mad_buf->context[0] = &port->sm_ah->ref; + mad_buf->ah = port->sm_ah->ah; + spin_unlock_irq(&port->ah_lock); + + ret = ib_post_send_mad(mad_buf, NULL); + if (ret) + goto err; + + return; +err: + kref_put(mad_buf->context[0], free_sm_ah); + ib_free_send_mad(mad_buf); +} + static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) { @@ -967,9 +1226,36 @@ static void recv_handler(struct ib_mad_agent *mad_agent, ib_free_recv_mad(mad_recv_wc); } +static void notice_resp_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + kref_put(mad_send_wc->send_buf->context[0], free_sm_ah); + ib_free_send_mad(mad_send_wc->send_buf); +} + +static void notice_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_port *port; + struct ib_sa_mad *mad; + struct ib_sa_notice notice; + + port = mad_agent->context; + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(notice_table, ARRAY_SIZE(notice_table), mad->data, ¬ice); + + if (!notice_dispatch(port->device, port->port_num, ¬ice)) + ib_sa_notice_resp(port, mad_recv_wc); + ib_free_recv_mad(mad_recv_wc); +} + static void ib_sa_add_one(struct ib_device *device) { struct ib_sa_device *sa_dev; + struct ib_mad_reg_req reg_req = { + .mgmt_class = IB_MGMT_CLASS_SUBN_ADM, + .mgmt_class_version = 2 + }; int s, e, i; if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) @@ -1003,6 +1289,16 @@ static void ib_sa_add_one(struct ib_device *device) if (IS_ERR(sa_dev->port[i].agent)) goto err; + sa_dev->port[i].device = device; + set_bit(IB_MGMT_METHOD_REPORT, reg_req.method_mask); + sa_dev->port[i].notice_agent = + ib_register_mad_agent(device, i + s, IB_QPT_GSI, + ®_req, 0, notice_resp_handler, + notice_handler, &sa_dev->port[i]); + + if (IS_ERR(sa_dev->port[i].notice_agent)) + goto err; + INIT_WORK(&sa_dev->port[i].update_task, update_sm_ah); } @@ -1025,8 +1321,14 @@ static void ib_sa_add_one(struct ib_device *device) return; err: - while (--i >= 0) - ib_unregister_mad_agent(sa_dev->port[i].agent); + while (--i >= 0) { + if (!IS_ERR(sa_dev->port[i].notice_agent)) { + ib_unregister_mad_agent(sa_dev->port[i].notice_agent); + } + if (!IS_ERR(sa_dev->port[i].agent)) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + } + } kfree(sa_dev); @@ -1046,6 +1348,7 @@ static void ib_sa_remove_one(struct ib_device *device) flush_scheduled_work(); for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { + ib_unregister_mad_agent(sa_dev->port[i].notice_agent); ib_unregister_mad_agent(sa_dev->port[i].agent); kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); } @@ -1074,7 +1377,15 @@ static int __init ib_sa_init(void) goto err2; } + ret = notice_init(); + if (ret) { + printk(KERN_ERR "Couldn't initialize notice handling\n"); + goto err3; + } + return 0; +err3: + mcast_cleanup(); err2: ib_unregister_client(&sa_client); err1: @@ -1084,6 +1395,7 @@ err1: static void __exit ib_sa_cleanup(void) { mcast_cleanup(); + notice_cleanup(); ib_unregister_client(&sa_client); idr_destroy(&query_idr); } diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index 5e26b2f..83d8157 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -254,6 +254,127 @@ struct ib_sa_service_rec { u64 data64[2]; }; +enum { + IB_SA_EVENT_TYPE_FATAL = 0x0, + IB_SA_EVENT_TYPE_URGENT = 0x1, + IB_SA_EVENT_TYPE_SECURITY = 0x2, + IB_SA_EVENT_TYPE_SM = 0x3, + IB_SA_EVENT_TYPE_INFO = 0x4, + IB_SA_EVENT_TYPE_EMPTY = 0x7F, + IB_SA_EVENT_TYPE_ALL = 0xFFFF +}; + +enum { + IB_SA_EVENT_PRODUCER_TYPE_CA = 0x1, + IB_SA_EVENT_PRODUCER_TYPE_SWITCH = 0x2, + IB_SA_EVENT_PRODUCER_TYPE_ROUTER = 0x3, + IB_SA_EVENT_PRODUCER_TYPE_CLASS_MANAGER = 0x4, + IB_SA_EVENT_PRODUCER_TYPE_ALL = 0xFFFFFF +}; + +enum { + IB_SA_SM_TRAP_GID_IN_SERVICE = 64, + IB_SA_SM_TRAP_GID_OUT_OF_SERVICE = 65, + IB_SA_SM_TRAP_CREATE_MC_GROUP = 66, + IB_SA_SM_TRAP_DELETE_MC_GROUP = 67, + IB_SA_SM_TRAP_PORT_CHANGE_STATE = 128, + IB_SA_SM_TRAP_LINK_INTEGRITY = 129, + IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN = 130, + IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED = 131, + IB_SA_SM_TRAP_BAD_M_KEY = 256, + IB_SA_SM_TRAP_BAD_P_KEY = 257, + IB_SA_SM_TRAP_BAD_Q_KEY = 258, + IB_SA_SM_TRAP_SWITCH_BAD_P_KEY = 259, + IB_SA_SM_TRAP_ALL = 0xFFFF +}; + +struct ib_sa_inform { + union ib_gid gid; + __be16 lid_range_begin; + __be16 lid_range_end; + u8 is_generic; + u8 subscribe; + __be16 type; + union { + struct { + __be16 trap_num; + __be32 qpn; + u8 resp_time; + __be32 producer_type; + } generic; + struct { + __be16 device_id; + __be32 qpn; + u8 resp_time; + __be32 vendor_id; + } vendor; + } trap; +}; + +struct ib_sa_notice { + u8 is_generic; + u8 type; + union { + struct { + __be32 producer_type; + __be16 trap_num; + } generic; + struct { + __be32 vendor_id; + __be16 device_id; + } vendor; + } trap; + __be16 issuer_lid; + __be16 notice_count; + u8 notice_toggle; + /* + * Align data 16 bits off 64 bit field to match InformInfo definition. + * Data contained within this field will then align properly. + * See IB spec 1.2, sections 13.4.8.2 and 14.2.5.1. + */ + u8 reserved[5]; + u8 data_details[54]; + union ib_gid issuer_gid; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_GID_IN_SERVICE = 64 + * IB_SA_SM_TRAP_GID_OUT_OF_SERVICE = 65 + * IB_SA_SM_TRAP_CREATE_MC_GROUP = 66 + * IB_SA_SM_TRAP_DELETE_MC_GROUP = 67 + */ +struct ib_sa_notice_data_gid { + u8 reserved[6]; + u8 gid[16]; + u8 padding[32]; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_PORT_CHANGE_STATE = 128 + */ +struct ib_sa_notice_data_port_change { + __be16 lid; + u8 padding[52]; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_LINK_INTEGRITY = 129 + * IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN = 130 + * IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED = 131 + */ +struct ib_sa_notice_data_port_error { + u8 reserved[2]; + __be16 lid; + u8 port_num; + u8 padding[49]; +}; + struct ib_sa_client { atomic_t users; struct completion comp; @@ -382,4 +503,54 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr); +struct ib_inform_info { + void *context; + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice); + u16 trap_number; +}; + +/** + * ib_sa_register_inform_info - Registers to receive notice events. + * @device: Device associated with the registration. + * @port_num: Port on the specified device to associate with the registration. + * @trap_number: InformInfo trap number to register for. + * @gfp_mask: GFP mask for memory allocations. + * @callback: User callback invoked once the registration completes and to + * report noticed events. + * @context: User specified context stored with the ib_inform_reg structure. + * + * This call initiates a registration request with the SA for the specified + * trap number. If the operation is started successfully, it returns + * an ib_inform_info structure that is used to track the registration operation. + * Users must free this structure by calling ib_unregister_inform_info, + * even if the operation later fails. (The callback status is non-zero.) + * + * If the registration fails; status will be non-zero. If the registration + * succeeds, the callback status will be zero, but the notice parameter will + * be NULL. If the notice parameter is not NULL, a trap or notice is being + * reported to the user. + * + * A status of -ENETRESET indicates that an error occurred which requires + * reregisteration. + */ +struct ib_inform_info * +ib_sa_register_inform_info(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + u16 trap_number, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice), + void *context); + +/** + * ib_sa_unregister_inform_info - Releases an InformInfo registration. + * @info: InformInfo registration tracking structure. + * + * This call blocks until the registration request is destroyed. It may + * not be called from within the registration callback. + */ +void ib_sa_unregister_inform_info(struct ib_inform_info *info); + #endif /* IB_SA_H */ From sean.hefty at intel.com Wed May 30 13:45:18 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 30 May 2007 13:45:18 -0700 Subject: [ofa-general] [RFC] [PATCH 2/2] for 2.6.23: ib/sa - add local path record caching In-Reply-To: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com> Message-ID: <000b01c7a2fb$6ebb05f0$3c98070a@amr.corp.intel.com> Query and store path records locally to decrease path record query time and offload SA flooding during the start-up of large clustered jobs. Signed-off-by: Sean Hefty --- drivers/infiniband/core/Makefile | 2 drivers/infiniband/core/local_sa.c | 1273 +++++++++++++++++++++++++++++++++++ drivers/infiniband/core/multicast.c | 50 - drivers/infiniband/core/sa.h | 23 + drivers/infiniband/core/sa_query.c | 107 ++- 5 files changed, 1379 insertions(+), 76 deletions(-) diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 7c5b5ed..f646040 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -13,7 +13,7 @@ ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o ib_mad-y := mad.o smi.o agent.o mad_rmpp.o -ib_sa-y := sa_query.o multicast.o notice.o +ib_sa-y := sa_query.o multicast.o notice.o local_sa.o ib_cm-y := cm.o diff --git a/drivers/infiniband/core/local_sa.c b/drivers/infiniband/core/local_sa.c new file mode 100644 index 0000000..3b5bb8f --- /dev/null +++ b/drivers/infiniband/core/local_sa.c @@ -0,0 +1,1273 @@ +/* + * Copyright (c) 2006 Intel Corporation.  All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include "sa.h" + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand subnet administration caching"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + SA_DB_MAX_PATHS_PER_DEST = 0x7F, + SA_DB_MIN_RETRY_TIMER = 4000, /* 4 sec */ + SA_DB_MAX_RETRY_TIMER = 256000 /* 256 sec */ +}; + +static int set_paths_per_dest(const char *val, struct kernel_param *kp); +static unsigned long paths_per_dest = SA_DB_MAX_PATHS_PER_DEST; +module_param_call(paths_per_dest, set_paths_per_dest, param_get_ulong, + &paths_per_dest, 0644); +MODULE_PARM_DESC(paths_per_dest, "Maximum number of paths to retrieve " + "to each destination (DGID). Set to 0 " + "to disable cache."); + +static int set_subscribe_inform_info(const char *val, struct kernel_param *kp); +static char subscribe_inform_info = 1; +module_param_call(subscribe_inform_info, set_subscribe_inform_info, + param_get_bool, &subscribe_inform_info, 0644); +MODULE_PARM_DESC(subscribe_inform_info, + "Subscribe for SA InformInfo/Notice events."); + +static int do_refresh(const char *val, struct kernel_param *kp); +module_param_call(refresh, do_refresh, NULL, NULL, 0200); + +static unsigned long retry_timer = SA_DB_MIN_RETRY_TIMER; + +enum sa_db_lookup_method { + SA_DB_LOOKUP_LEAST_USED, + SA_DB_LOOKUP_RANDOM +}; + +static int set_lookup_method(const char *val, struct kernel_param *kp); +static int get_lookup_method(char *buf, struct kernel_param *kp); +static unsigned long lookup_method; +module_param_call(lookup_method, set_lookup_method, get_lookup_method, + &lookup_method, 0644); +MODULE_PARM_DESC(lookup_method, "Method used to return path records when " + "multiple paths exist to a given destination."); + +static void sa_db_add_dev(struct ib_device *device); +static void sa_db_remove_dev(struct ib_device *device); + +static struct ib_client sa_db_client = { + .name = "local_sa", + .add = sa_db_add_dev, + .remove = sa_db_remove_dev +}; + +static LIST_HEAD(dev_list); +static DEFINE_MUTEX(lock); +static rwlock_t rwlock; +static struct workqueue_struct *sa_wq; +static struct ib_sa_client sa_client; + +enum sa_db_state { + SA_DB_IDLE, + SA_DB_REFRESH, + SA_DB_DESTROY +}; + +struct sa_db_port { + struct sa_db_device *dev; + struct ib_mad_agent *agent; + /* Limit number of outstanding MADs to SA to reduce SA flooding */ + struct ib_mad_send_buf *msg; + u16 sm_lid; + u8 sm_sl; + struct ib_inform_info *in_info; + struct ib_inform_info *out_info; + struct rb_root paths; + struct list_head update_list; + unsigned long update_id; + enum sa_db_state state; + struct work_struct work; + union ib_gid gid; + int port_num; +}; + +struct sa_db_device { + struct list_head list; + struct ib_device *device; + struct ib_event_handler event_handler; + int start_port; + int port_count; + struct sa_db_port port[0]; +}; + +struct ib_sa_iterator { + struct ib_sa_iterator *next; +}; + +struct ib_sa_attr_iter { + struct ib_sa_iterator *iter; + unsigned long flags; +}; + +struct ib_sa_attr_list { + struct ib_sa_iterator iter; + struct ib_sa_iterator *tail; + int update_id; + union ib_gid gid; + struct rb_node node; +}; + +struct ib_path_rec_info { + struct ib_sa_iterator iter; /* keep first */ + struct ib_sa_path_rec rec; + unsigned long lookups; +}; + +struct ib_sa_mad_iter { + struct ib_mad_recv_wc *recv_wc; + struct ib_mad_recv_buf *recv_buf; + int attr_size; + int attr_offset; + int data_offset; + int data_left; + void *attr; + u8 attr_data[0]; +}; + +enum sa_update_type { + SA_UPDATE_FULL, + SA_UPDATE_ADD, + SA_UPDATE_REMOVE +}; + +struct update_info { + struct list_head list; + union ib_gid gid; + enum sa_update_type type; +}; + +struct sa_path_request { + struct work_struct work; + struct ib_sa_client *client; + void (*callback)(int, struct ib_sa_path_rec *, void *); + void *context; + struct ib_sa_path_rec path_rec; +}; + +static void process_updates(struct sa_db_port *port); + +static void free_attr_list(struct ib_sa_attr_list *attr_list) +{ + struct ib_sa_iterator *cur; + + for (cur = attr_list->iter.next; cur; cur = attr_list->iter.next) { + attr_list->iter.next = cur->next; + kfree(cur); + } + attr_list->tail = &attr_list->iter; +} + +static void remove_attr(struct rb_root *root, struct ib_sa_attr_list *attr_list) +{ + rb_erase(&attr_list->node, root); + free_attr_list(attr_list); + kfree(attr_list); +} + +static void remove_all_attrs(struct rb_root *root) +{ + struct rb_node *node, *next_node; + struct ib_sa_attr_list *attr_list; + + write_lock_irq(&rwlock); + for (node = rb_first(root); node; node = next_node) { + next_node = rb_next(node); + attr_list = rb_entry(node, struct ib_sa_attr_list, node); + remove_attr(root, attr_list); + } + write_unlock_irq(&rwlock); +} + +static void remove_old_attrs(struct rb_root *root, unsigned long update_id) +{ + struct rb_node *node, *next_node; + struct ib_sa_attr_list *attr_list; + + write_lock_irq(&rwlock); + for (node = rb_first(root); node; node = next_node) { + next_node = rb_next(node); + attr_list = rb_entry(node, struct ib_sa_attr_list, node); + if (attr_list->update_id != update_id) + remove_attr(root, attr_list); + } + write_unlock_irq(&rwlock); +} + +static struct ib_sa_attr_list *insert_attr_list(struct rb_root *root, + struct ib_sa_attr_list *attr_list) +{ + struct rb_node **link = &root->rb_node; + struct rb_node *parent = NULL; + struct ib_sa_attr_list *cur_attr_list; + int cmp; + + while (*link) { + parent = *link; + cur_attr_list = rb_entry(parent, struct ib_sa_attr_list, node); + cmp = memcmp(&cur_attr_list->gid, &attr_list->gid, + sizeof attr_list->gid); + if (cmp < 0) + link = &(*link)->rb_left; + else if (cmp > 0) + link = &(*link)->rb_right; + else + return cur_attr_list; + } + rb_link_node(&attr_list->node, parent, link); + rb_insert_color(&attr_list->node, root); + return NULL; +} + +static struct ib_sa_attr_list *find_attr_list(struct rb_root *root, u8 *gid) +{ + struct rb_node *node = root->rb_node; + struct ib_sa_attr_list *attr_list; + int cmp; + + while (node) { + attr_list = rb_entry(node, struct ib_sa_attr_list, node); + cmp = memcmp(&attr_list->gid, gid, sizeof attr_list->gid); + if (cmp < 0) + node = node->rb_left; + else if (cmp > 0) + node = node->rb_right; + else + return attr_list; + } + return NULL; +} + +static int insert_attr(struct rb_root *root, unsigned long update_id, void *key, + struct ib_sa_iterator *iter) +{ + struct ib_sa_attr_list *attr_list; + void *err; + + write_lock_irq(&rwlock); + attr_list = find_attr_list(root, key); + if (!attr_list) { + write_unlock_irq(&rwlock); + attr_list = kmalloc(sizeof *attr_list, GFP_KERNEL); + if (!attr_list) + return -ENOMEM; + + attr_list->iter.next = NULL; + attr_list->tail = &attr_list->iter; + attr_list->update_id = update_id; + memcpy(attr_list->gid.raw, key, sizeof attr_list->gid); + + write_lock_irq(&rwlock); + err = insert_attr_list(root, attr_list); + if (err) { + write_unlock_irq(&rwlock); + kfree(attr_list); + return PTR_ERR(err); + } + } else if (attr_list->update_id != update_id) { + free_attr_list(attr_list); + attr_list->update_id = update_id; + } + + attr_list->tail->next = iter; + iter->next = NULL; + attr_list->tail = iter; + write_unlock_irq(&rwlock); + return 0; +} + +static struct ib_sa_mad_iter *ib_sa_iter_create(struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_mad_iter *iter; + struct ib_sa_mad *mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + int attr_size, attr_offset; + + attr_offset = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; + attr_size = 64; /* path record length */ + if (attr_offset < attr_size) + return ERR_PTR(-EINVAL); + + iter = kzalloc(sizeof *iter + attr_size, GFP_KERNEL); + if (!iter) + return ERR_PTR(-ENOMEM); + + iter->data_left = mad_recv_wc->mad_len - IB_MGMT_SA_HDR; + iter->recv_wc = mad_recv_wc; + iter->recv_buf = &mad_recv_wc->recv_buf; + iter->attr_offset = attr_offset; + iter->attr_size = attr_size; + return iter; +} + +static void ib_sa_iter_free(struct ib_sa_mad_iter *iter) +{ + kfree(iter); +} + +static void *ib_sa_iter_next(struct ib_sa_mad_iter *iter) +{ + struct ib_sa_mad *mad; + int left, offset = 0; + + while (iter->data_left >= iter->attr_offset) { + while (iter->data_offset < IB_MGMT_SA_DATA) { + mad = (struct ib_sa_mad *) iter->recv_buf->mad; + + left = IB_MGMT_SA_DATA - iter->data_offset; + if (left < iter->attr_size) { + /* copy first piece of the attribute */ + iter->attr = &iter->attr_data; + memcpy(iter->attr, + &mad->data[iter->data_offset], left); + offset = left; + break; + } else if (offset) { + /* copy the second piece of the attribute */ + memcpy(iter->attr + offset, &mad->data[0], + iter->attr_size - offset); + iter->data_offset = iter->attr_size - offset; + offset = 0; + } else { + iter->attr = &mad->data[iter->data_offset]; + iter->data_offset += iter->attr_size; + } + + iter->data_left -= iter->attr_offset; + goto out; + } + iter->data_offset = 0; + iter->recv_buf = list_entry(iter->recv_buf->list.next, + struct ib_mad_recv_buf, list); + } + iter->attr = NULL; +out: + return iter->attr; +} + +/* + * Copy path records from a received response and insert them into our cache. + * A path record in the MADs are in network order, packed, and may + * span multiple MAD buffers, just to make our life hard. + */ +static void update_path_db(struct sa_db_port *port, + struct ib_mad_recv_wc *mad_recv_wc, + enum sa_update_type type) +{ + struct ib_sa_mad_iter *iter; + struct ib_path_rec_info *path_info; + void *attr; + int ret; + + iter = ib_sa_iter_create(mad_recv_wc); + if (IS_ERR(iter)) + return; + + port->update_id += (type == SA_UPDATE_FULL); + + while ((attr = ib_sa_iter_next(iter)) && + (path_info = kmalloc(sizeof *path_info, GFP_KERNEL))) { + + ib_sa_unpack_attr(&path_info->rec, attr, IB_SA_ATTR_PATH_REC); + + ret = insert_attr(&port->paths, port->update_id, + path_info->rec.dgid.raw, &path_info->iter); + if (ret) { + kfree(path_info); + break; + } + } + ib_sa_iter_free(iter); + + if (type == SA_UPDATE_FULL) + remove_old_attrs(&port->paths, port->update_id); +} + +static struct ib_mad_send_buf *get_sa_msg(struct sa_db_port *port, + struct update_info *update) +{ + struct ib_ah_attr ah_attr; + struct ib_mad_send_buf *msg; + + msg = ib_create_send_mad(port->agent, 1, 0, 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, GFP_KERNEL); + if (IS_ERR(msg)) + return NULL; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = port->sm_lid; + ah_attr.sl = port->sm_sl; + ah_attr.port_num = port->port_num; + + msg->ah = ib_create_ah(port->agent->qp->pd, &ah_attr); + if (IS_ERR(msg->ah)) { + ib_free_send_mad(msg); + return NULL; + } + + msg->timeout_ms = retry_timer; + msg->retries = 0; + msg->context[0] = port; + msg->context[1] = update; + return msg; +} + +static __be64 form_tid(u32 hi_tid) +{ + static atomic_t tid; + return cpu_to_be64((((u64) hi_tid) << 32) | + ((u32) atomic_inc_return(&tid))); +} + +static void format_path_req(struct sa_db_port *port, + struct update_info *update, + struct ib_mad_send_buf *msg) +{ + struct ib_sa_mad *mad = msg->mad; + struct ib_sa_path_rec path_rec; + + mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; + mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; + mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + mad->mad_hdr.method = IB_SA_METHOD_GET_TABLE; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + mad->mad_hdr.tid = form_tid(msg->mad_agent->hi_tid); + + mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH; + + path_rec.sgid = port->gid; + path_rec.numb_path = (u8) paths_per_dest; + + if (update->type == SA_UPDATE_ADD) { + mad->sa_hdr.comp_mask |= IB_SA_PATH_REC_DGID; + memcpy(&path_rec.dgid, &update->gid, sizeof path_rec.dgid); + } + + ib_sa_pack_attr(mad->data, &path_rec, IB_SA_ATTR_PATH_REC); +} + +static int send_query(struct sa_db_port *port, + struct update_info *update) +{ + int ret; + + port->msg = get_sa_msg(port, update); + if (!port->msg) + return -ENOMEM; + + format_path_req(port, update, port->msg); + + ret = ib_post_send_mad(port->msg, NULL); + if (ret) + goto err; + + return 0; + +err: + ib_destroy_ah(port->msg->ah); + ib_free_send_mad(port->msg); + return ret; +} + +static void add_update(struct sa_db_port *port, u8 *gid, + enum sa_update_type type) +{ + struct update_info *update; + + update = kmalloc(sizeof *update, GFP_KERNEL); + if (update) { + if (gid) + memcpy(&update->gid, gid, sizeof update->gid); + update->type = type; + list_add(&update->list, &port->update_list); + } + + if (port->state == SA_DB_IDLE) { + port->state = SA_DB_REFRESH; + process_updates(port); + } +} + +static void clean_update_list(struct sa_db_port *port) +{ + struct update_info *update; + + while (!list_empty(&port->update_list)) { + update = list_entry(port->update_list.next, + struct update_info, list); + list_del(&update->list); + kfree(update); + } +} + +static int notice_handler(int status, struct ib_inform_info *info, + struct ib_sa_notice *notice) +{ + struct sa_db_port *port = info->context; + struct ib_sa_notice_data_gid *gid_data; + struct ib_inform_info **pinfo; + enum sa_update_type type; + + if (info->trap_number == IB_SA_SM_TRAP_GID_IN_SERVICE) { + pinfo = &port->in_info; + type = SA_UPDATE_ADD; + } else { + pinfo = &port->out_info; + type = SA_UPDATE_REMOVE; + } + + mutex_lock(&lock); + if (port->state == SA_DB_DESTROY || !*pinfo) { + mutex_unlock(&lock); + return 0; + } + + if (notice) { + gid_data = (struct ib_sa_notice_data_gid *) + ¬ice->data_details; + add_update(port, gid_data->gid, type); + mutex_unlock(&lock); + } else if (status == -ENETRESET) { + *pinfo = NULL; + mutex_unlock(&lock); + } else { + if (status) + *pinfo = ERR_PTR(-EINVAL); + port->state = SA_DB_IDLE; + clean_update_list(port); + mutex_unlock(&lock); + queue_work(sa_wq, &port->work); + } + + return status; +} + +static int reg_in_info(struct sa_db_port *port) +{ + int ret = 0; + + port->in_info = ib_sa_register_inform_info(&sa_client, + port->dev->device, + port->port_num, + IB_SA_SM_TRAP_GID_IN_SERVICE, + GFP_KERNEL, notice_handler, + port); + if (IS_ERR(port->in_info)) + ret = PTR_ERR(port->in_info); + + return ret; +} + +static int reg_out_info(struct sa_db_port *port) +{ + int ret = 0; + + port->out_info = ib_sa_register_inform_info(&sa_client, + port->dev->device, + port->port_num, + IB_SA_SM_TRAP_GID_OUT_OF_SERVICE, + GFP_KERNEL, notice_handler, + port); + if (IS_ERR(port->out_info)) + ret = PTR_ERR(port->out_info); + + return ret; +} + +static void unsubscribe_port(struct sa_db_port *port) +{ + if (port->in_info && !IS_ERR(port->in_info)) + ib_sa_unregister_inform_info(port->in_info); + + if (port->out_info && !IS_ERR(port->out_info)) + ib_sa_unregister_inform_info(port->out_info); + + port->out_info = NULL; + port->in_info = NULL; + +} + +static void cleanup_port(struct sa_db_port *port) +{ + unsubscribe_port(port); + flush_workqueue(sa_wq); + + clean_update_list(port); + remove_all_attrs(&port->paths); +} + +static int update_port_info(struct sa_db_port *port) +{ + struct ib_port_attr port_attr; + int ret; + + ret = ib_query_port(port->dev->device, port->port_num, &port_attr); + if (ret) + return ret; + + if (port_attr.state != IB_PORT_ACTIVE) + return -ENODATA; + + port->sm_lid = port_attr.sm_lid; + port->sm_sl = port_attr.sm_sl; + return 0; +} + +static void process_updates(struct sa_db_port *port) +{ + struct update_info *update; + struct ib_sa_attr_list *attr_list; + int ret; + + if (!paths_per_dest || update_port_info(port)) { + cleanup_port(port); + goto out; + } + + /* Event registration is an optimization, so ignore failures. */ + if (subscribe_inform_info) { + if (!port->out_info) { + ret = reg_out_info(port); + if (!ret) + return; + } + + if (!port->in_info) { + ret = reg_in_info(port); + if (!ret) + return; + } + } else + unsubscribe_port(port); + + while (!list_empty(&port->update_list)) { + update = list_entry(port->update_list.next, + struct update_info, list); + + if (update->type == SA_UPDATE_REMOVE) { + write_lock_irq(&rwlock); + attr_list = find_attr_list(&port->paths, + update->gid.raw); + if (attr_list) + remove_attr(&port->paths, attr_list); + write_unlock_irq(&rwlock); + } else { + ret = send_query(port, update); + if (!ret) + return; + + } + list_del(&update->list); + kfree(update); + } +out: + port->state = SA_DB_IDLE; +} + +static void refresh_port_db(struct sa_db_port *port) +{ + if (port->state == SA_DB_DESTROY) + return; + + if (port->state == SA_DB_REFRESH) { + clean_update_list(port); + ib_cancel_mad(port->agent, port->msg); + } + + add_update(port, NULL, SA_UPDATE_FULL); +} + +static void refresh_dev_db(struct sa_db_device *dev) +{ + int i; + + for (i = 0; i < dev->port_count; i++) + refresh_port_db(&dev->port[i]); +} + +static void refresh_db(void) +{ + struct sa_db_device *dev; + + list_for_each_entry(dev, &dev_list, list) + refresh_dev_db(dev); +} + +static int do_refresh(const char *val, struct kernel_param *kp) +{ + mutex_lock(&lock); + refresh_db(); + mutex_unlock(&lock); + return 0; +} + +static int get_lookup_method(char *buf, struct kernel_param *kp) +{ + return sprintf(buf, + "%c %d round robin\n" + "%c %d random", + (lookup_method == SA_DB_LOOKUP_LEAST_USED) ? '*' : ' ', + SA_DB_LOOKUP_LEAST_USED, + (lookup_method == SA_DB_LOOKUP_RANDOM) ? '*' : ' ', + SA_DB_LOOKUP_RANDOM); +} + +static int set_lookup_method(const char *val, struct kernel_param *kp) +{ + unsigned long method; + int ret = 0; + + method = simple_strtoul(val, NULL, 0); + + switch (method) { + case SA_DB_LOOKUP_LEAST_USED: + case SA_DB_LOOKUP_RANDOM: + lookup_method = method; + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static int set_paths_per_dest(const char *val, struct kernel_param *kp) +{ + int ret; + + mutex_lock(&lock); + ret = param_set_ulong(val, kp); + if (ret) + goto out; + + if (paths_per_dest > SA_DB_MAX_PATHS_PER_DEST) + paths_per_dest = SA_DB_MAX_PATHS_PER_DEST; + refresh_db(); +out: + mutex_unlock(&lock); + return ret; +} + +static int set_subscribe_inform_info(const char *val, struct kernel_param *kp) +{ + int ret; + + ret = param_set_bool(val, kp); + if (ret) + return ret; + + return do_refresh(val, kp); +} + +static void port_work_handler(struct work_struct *work) +{ + struct sa_db_port *port; + + port = container_of(work, typeof(*port), work); + mutex_lock(&lock); + refresh_port_db(port); + mutex_unlock(&lock); +} + +static void handle_event(struct ib_event_handler *event_handler, + struct ib_event *event) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + + dev = container_of(event_handler, typeof(*dev), event_handler); + port = &dev->port[event->element.port_num - dev->start_port]; + + switch (event->event) { + case IB_EVENT_PORT_ERR: + case IB_EVENT_LID_CHANGE: + case IB_EVENT_SM_CHANGE: + case IB_EVENT_CLIENT_REREGISTER: + case IB_EVENT_PKEY_CHANGE: + case IB_EVENT_PORT_ACTIVE: + queue_work(sa_wq, &port->work); + break; + default: + break; + } +} + +static void ib_free_path_iter(struct ib_sa_attr_iter *iter) +{ + read_unlock_irqrestore(&rwlock, iter->flags); +} + +static int ib_create_path_iter(struct ib_device *device, u8 port_num, + union ib_gid *dgid, struct ib_sa_attr_iter *iter) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + struct ib_sa_attr_list *list; + + dev = ib_get_client_data(device, &sa_db_client); + if (!dev) + return -ENODEV; + + port = &dev->port[port_num - dev->start_port]; + + read_lock_irqsave(&rwlock, iter->flags); + list = find_attr_list(&port->paths, dgid->raw); + if (!list) { + ib_free_path_iter(iter); + return -ENODATA; + } + + iter->iter = &list->iter; + return 0; +} + +static struct ib_sa_path_rec *ib_get_next_path(struct ib_sa_attr_iter *iter) +{ + struct ib_path_rec_info *next_path; + + iter->iter = iter->iter->next; + if (iter->iter) { + next_path = container_of(iter->iter, struct ib_path_rec_info, iter); + return &next_path->rec; + } else + return NULL; +} + +static int cmp_rec(struct ib_sa_path_rec *src, + struct ib_sa_path_rec *dst, ib_sa_comp_mask comp_mask) +{ + /* DGID check already done */ + if (comp_mask & IB_SA_PATH_REC_SGID && + memcmp(&src->sgid, &dst->sgid, sizeof src->sgid)) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_DLID && src->dlid != dst->dlid) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_SLID && src->slid != dst->slid) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_RAW_TRAFFIC && + src->raw_traffic != dst->raw_traffic) + return -EINVAL; + + if (comp_mask & IB_SA_PATH_REC_FLOW_LABEL && + src->flow_label != dst->flow_label) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_HOP_LIMIT && + src->hop_limit != dst->hop_limit) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_TRAFFIC_CLASS && + src->traffic_class != dst->traffic_class) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_REVERSIBLE && + dst->reversible && !src->reversible) + return -EINVAL; + /* Numb path check already done */ + if (comp_mask & IB_SA_PATH_REC_PKEY && src->pkey != dst->pkey) + return -EINVAL; + + if (comp_mask & IB_SA_PATH_REC_SL && src->sl != dst->sl) + return -EINVAL; + + if (ib_sa_check_selector(comp_mask, IB_SA_PATH_REC_MTU_SELECTOR, + IB_SA_PATH_REC_MTU, dst->mtu_selector, + src->mtu, dst->mtu)) + return -EINVAL; + if (ib_sa_check_selector(comp_mask, IB_SA_PATH_REC_RATE_SELECTOR, + IB_SA_PATH_REC_RATE, dst->rate_selector, + src->rate, dst->rate)) + return -EINVAL; + if (ib_sa_check_selector(comp_mask, + IB_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR, + IB_SA_PATH_REC_PACKET_LIFE_TIME, + dst->packet_life_time_selector, + src->packet_life_time, dst->packet_life_time)) + return -EINVAL; + + return 0; +} + +static struct ib_sa_path_rec *get_random_path(struct ib_sa_attr_iter *iter, + struct ib_sa_path_rec *req_path, + ib_sa_comp_mask comp_mask) +{ + struct ib_sa_path_rec *path, *rand_path = NULL; + int num, count = 0; + + for (path = ib_get_next_path(iter); path; + path = ib_get_next_path(iter)) { + if (!cmp_rec(path, req_path, comp_mask)) { + get_random_bytes(&num, sizeof num); + if ((num % ++count) == 0) + rand_path = path; + } + } + + return rand_path; +} + +static struct ib_sa_path_rec *get_next_path(struct ib_sa_attr_iter *iter, + struct ib_sa_path_rec *req_path, + ib_sa_comp_mask comp_mask) +{ + struct ib_path_rec_info *cur_path, *next_path = NULL; + struct ib_sa_path_rec *path; + unsigned long lookups = ~0; + + for (path = ib_get_next_path(iter); path; + path = ib_get_next_path(iter)) { + if (!cmp_rec(path, req_path, comp_mask)) { + + cur_path = container_of(iter->iter, struct ib_path_rec_info, + iter); + if (cur_path->lookups < lookups) { + lookups = cur_path->lookups; + next_path = cur_path; + } + } + } + + if (next_path) { + next_path->lookups++; + return &next_path->rec; + } else + return NULL; +} + +static void report_path(struct work_struct *work) +{ + struct sa_path_request *req; + + req = container_of(work, struct sa_path_request, work); + req->callback(0, &req->path_rec, req->context); + ib_sa_client_put(req->client); + kfree(req); +} + +/** + * ib_sa_path_rec_get - Start a Path get query + * @client:SA client + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Path Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a Path Record Get query to the SA to look up a path. The + * callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_path_rec_get() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct sa_path_request *req; + struct ib_sa_attr_iter iter; + struct ib_sa_path_rec *path_rec; + int ret; + + if (!paths_per_dest) + goto query_sa; + + if (!(comp_mask & IB_SA_PATH_REC_DGID) || + !(comp_mask & IB_SA_PATH_REC_NUMB_PATH) || rec->numb_path != 1) + goto query_sa; + + req = kmalloc(sizeof *req, gfp_mask); + if (!req) + goto query_sa; + + ret = ib_create_path_iter(device, port_num, &rec->dgid, &iter); + if (ret) + goto free_req; + + if (lookup_method == SA_DB_LOOKUP_RANDOM) + path_rec = get_random_path(&iter, rec, comp_mask); + else + path_rec = get_next_path(&iter, rec, comp_mask); + + if (!path_rec) + goto free_iter; + + memcpy(&req->path_rec, path_rec, sizeof *path_rec); + ib_free_path_iter(&iter); + + INIT_WORK(&req->work, report_path); + req->client = client; + req->callback = callback; + req->context = context; + + ib_sa_client_get(client); + queue_work(sa_wq, &req->work); + *sa_query = ERR_PTR(-EEXIST); + return 0; + +free_iter: + ib_free_path_iter(&iter); +free_req: + kfree(req); +query_sa: + return ib_sa_path_rec_query(client, device, port_num, rec, comp_mask, + timeout_ms, gfp_mask, callback, context, + sa_query); +} +EXPORT_SYMBOL(ib_sa_path_rec_get); + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct sa_db_port *port; + struct update_info *update; + struct ib_mad_send_buf *msg; + enum sa_update_type type; + + msg = (struct ib_mad_send_buf *) (unsigned long) mad_recv_wc->wc->wr_id; + port = msg->context[0]; + update = msg->context[1]; + + mutex_lock(&lock); + if (port->state == SA_DB_DESTROY || + update != list_entry(port->update_list.next, + struct update_info, list)) { + mutex_unlock(&lock); + } else { + type = update->type; + mutex_unlock(&lock); + update_path_db(mad_agent->context, mad_recv_wc, type); + } + + ib_free_recv_mad(mad_recv_wc); +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_mad_send_buf *msg; + struct sa_db_port *port; + struct update_info *update; + int ret; + + msg = mad_send_wc->send_buf; + port = msg->context[0]; + update = msg->context[1]; + + mutex_lock(&lock); + if (port->state == SA_DB_DESTROY) + goto unlock; + + if (update == list_entry(port->update_list.next, + struct update_info, list)) { + + if (mad_send_wc->status == IB_WC_RESP_TIMEOUT_ERR && + msg->timeout_ms < SA_DB_MAX_RETRY_TIMER) { + + msg->timeout_ms <<= 1; + ret = ib_post_send_mad(msg, NULL); + if (!ret) { + mutex_unlock(&lock); + return; + } + } + list_del(&update->list); + kfree(update); + } + process_updates(port); +unlock: + mutex_unlock(&lock); + + ib_destroy_ah(msg->ah); + ib_free_send_mad(msg); +} + +static int init_port(struct sa_db_device *dev, int port_num) +{ + struct sa_db_port *port; + int ret; + + port = &dev->port[port_num - dev->start_port]; + port->dev = dev; + port->port_num = port_num; + INIT_WORK(&port->work, port_work_handler); + port->paths = RB_ROOT; + INIT_LIST_HEAD(&port->update_list); + + ret = ib_get_cached_gid(dev->device, port_num, 0, &port->gid); + if (ret) + return ret; + + port->agent = ib_register_mad_agent(dev->device, port_num, IB_QPT_GSI, + NULL, IB_MGMT_RMPP_VERSION, + send_handler, recv_handler, port); + if (IS_ERR(port->agent)) + ret = PTR_ERR(port->agent); + + return ret; +} + +static void destroy_port(struct sa_db_port *port) +{ + mutex_lock(&lock); + port->state = SA_DB_DESTROY; + mutex_unlock(&lock); + + ib_unregister_mad_agent(port->agent); + cleanup_port(port); +} + +static void sa_db_add_dev(struct ib_device *device) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + int s, e, i, ret; + + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) { + s = e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + dev = kzalloc(sizeof *dev + (e - s + 1) * sizeof *port, GFP_KERNEL); + if (!dev) + return; + + dev->start_port = s; + dev->port_count = e - s + 1; + dev->device = device; + for (i = 0; i < dev->port_count; i++) { + ret = init_port(dev, s + i); + if (ret) + goto err; + } + + ib_set_client_data(device, &sa_db_client, dev); + + INIT_IB_EVENT_HANDLER(&dev->event_handler, device, handle_event); + + mutex_lock(&lock); + list_add_tail(&dev->list, &dev_list); + refresh_dev_db(dev); + mutex_unlock(&lock); + + ib_register_event_handler(&dev->event_handler); + return; +err: + while (i--) + destroy_port(&dev->port[i]); + kfree(dev); +} + +static void sa_db_remove_dev(struct ib_device *device) +{ + struct sa_db_device *dev; + int i; + + dev = ib_get_client_data(device, &sa_db_client); + if (!dev) + return; + + ib_unregister_event_handler(&dev->event_handler); + flush_workqueue(sa_wq); + + for (i = 0; i < dev->port_count; i++) + destroy_port(&dev->port[i]); + + mutex_lock(&lock); + list_del(&dev->list); + mutex_unlock(&lock); + + kfree(dev); +} + +int sa_db_init(void) +{ + int ret; + + rwlock_init(&rwlock); + sa_wq = create_singlethread_workqueue("local_sa"); + if (!sa_wq) + return -ENOMEM; + + ib_sa_register_client(&sa_client); + ret = ib_register_client(&sa_db_client); + if (ret) + goto err; + + return 0; + +err: + ib_sa_unregister_client(&sa_client); + destroy_workqueue(sa_wq); + return ret; +} + +void sa_db_cleanup(void) +{ + ib_unregister_client(&sa_db_client); + ib_sa_unregister_client(&sa_client); + destroy_workqueue(sa_wq); +} diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index 1e13ab4..f49eb75 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -238,34 +238,6 @@ static u8 get_leave_state(struct mcast_group *group) return leave_state & group->rec.join_state; } -static int check_selector(ib_sa_comp_mask comp_mask, - ib_sa_comp_mask selector_mask, - ib_sa_comp_mask value_mask, - u8 selector, u8 src_value, u8 dst_value) -{ - int err; - - if (!(comp_mask & selector_mask) || !(comp_mask & value_mask)) - return 0; - - switch (selector) { - case IB_SA_GT: - err = (src_value <= dst_value); - break; - case IB_SA_LT: - err = (src_value >= dst_value); - break; - case IB_SA_EQ: - err = (src_value != dst_value); - break; - default: - err = 0; - break; - } - - return err; -} - static int cmp_rec(struct ib_sa_mcmember_rec *src, struct ib_sa_mcmember_rec *dst, ib_sa_comp_mask comp_mask) { @@ -278,24 +250,24 @@ static int cmp_rec(struct ib_sa_mcmember_rec *src, return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_MLID && src->mlid != dst->mlid) return -EINVAL; - if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_MTU_SELECTOR, - IB_SA_MCMEMBER_REC_MTU, dst->mtu_selector, - src->mtu, dst->mtu)) + if (ib_sa_check_selector(comp_mask, IB_SA_MCMEMBER_REC_MTU_SELECTOR, + IB_SA_MCMEMBER_REC_MTU, dst->mtu_selector, + src->mtu, dst->mtu)) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_TRAFFIC_CLASS && src->traffic_class != dst->traffic_class) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_PKEY && src->pkey != dst->pkey) return -EINVAL; - if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_RATE_SELECTOR, - IB_SA_MCMEMBER_REC_RATE, dst->rate_selector, - src->rate, dst->rate)) + if (ib_sa_check_selector(comp_mask, IB_SA_MCMEMBER_REC_RATE_SELECTOR, + IB_SA_MCMEMBER_REC_RATE, dst->rate_selector, + src->rate, dst->rate)) return -EINVAL; - if (check_selector(comp_mask, - IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR, - IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME, - dst->packet_life_time_selector, - src->packet_life_time, dst->packet_life_time)) + if (ib_sa_check_selector(comp_mask, + IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR, + IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME, + dst->packet_life_time_selector, + src->packet_life_time, dst->packet_life_time)) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_SL && src->sl != dst->sl) return -EINVAL; diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h index b8eac66..0f19dde 100644 --- a/drivers/infiniband/core/sa.h +++ b/drivers/infiniband/core/sa.h @@ -48,6 +48,29 @@ static inline void ib_sa_client_put(struct ib_sa_client *client) complete(&client->comp); } +int ib_sa_check_selector(ib_sa_comp_mask comp_mask, + ib_sa_comp_mask selector_mask, + ib_sa_comp_mask value_mask, + u8 selector, u8 src_value, u8 dst_value); + +int ib_sa_pack_attr(void *dst, void *src, int attr_id); + +int ib_sa_unpack_attr(void *dst, void *src, int attr_id); + +int ib_sa_path_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query); + +int sa_db_init(void); +void sa_db_cleanup(void); + int ib_sa_mcmember_rec_query(struct ib_sa_client *client, struct ib_device *device, u8 port_num, u8 method, diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 369fe60..cb7a503 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -464,6 +464,58 @@ static const struct ib_field notice_table[] = { .size_bits = 128 }, }; +int ib_sa_check_selector(ib_sa_comp_mask comp_mask, + ib_sa_comp_mask selector_mask, + ib_sa_comp_mask value_mask, + u8 selector, u8 src_value, u8 dst_value) +{ + int err; + + if (!(comp_mask & selector_mask) || !(comp_mask & value_mask)) + return 0; + + switch (selector) { + case IB_SA_GT: + err = (src_value <= dst_value); + break; + case IB_SA_LT: + err = (src_value >= dst_value); + break; + case IB_SA_EQ: + err = (src_value != dst_value); + break; + default: + err = 0; + break; + } + + return err; +} + +int ib_sa_pack_attr(void *dst, void *src, int attr_id) +{ + switch (attr_id) { + case IB_SA_ATTR_PATH_REC: + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); + break; + default: + return -EINVAL; + } + return 0; +} + +int ib_sa_unpack_attr(void *dst, void *src, int attr_id) +{ + switch (attr_id) { + case IB_SA_ATTR_PATH_REC: + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); + break; + default: + return -EINVAL; + } + return 0; +} + static void free_sm_ah(struct kref *kref) { struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); @@ -706,41 +758,16 @@ static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); } -/** - * ib_sa_path_rec_get - Start a Path get query - * @client:SA client - * @device:device to send query on - * @port_num: port number to send query on - * @rec:Path Record to send in query - * @comp_mask:component mask to send in query - * @timeout_ms:time to wait for response - * @gfp_mask:GFP mask to use for internal allocations - * @callback:function called when query completes, times out or is - * canceled - * @context:opaque user context passed to callback - * @sa_query:query context, used to cancel query - * - * Send a Path Record Get query to the SA to look up a path. The - * callback function will be called when the query completes (or - * fails); status is 0 for a successful response, -EINTR if the query - * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error - * occurred sending the query. The resp parameter of the callback is - * only valid if status is 0. - * - * If the return value of ib_sa_path_rec_get() is negative, it is an - * error code. Otherwise it is a query ID that can be used to cancel - * the query. - */ -int ib_sa_path_rec_get(struct ib_sa_client *client, - struct ib_device *device, u8 port_num, - struct ib_sa_path_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_path_rec *resp, - void *context), - void *context, - struct ib_sa_query **sa_query) +int ib_sa_path_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -801,7 +828,6 @@ err1: kfree(query); return ret; } -EXPORT_SYMBOL(ib_sa_path_rec_get); static void ib_sa_service_rec_callback(struct ib_sa_query *sa_query, int status, @@ -1383,7 +1409,15 @@ static int __init ib_sa_init(void) goto err3; } + ret = sa_db_init(); + if (ret) { + printk(KERN_ERR "Couldn't initialize local SA\n"); + goto err4; + } + return 0; +err4: + notice_cleanup(); err3: mcast_cleanup(); err2: @@ -1394,6 +1428,7 @@ err1: static void __exit ib_sa_cleanup(void) { + sa_db_cleanup(); mcast_cleanup(); notice_cleanup(); ib_unregister_client(&sa_client); From rdreier at cisco.com Wed May 30 13:48:29 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 30 May 2007 13:48:29 -0700 Subject: [ofa-general] [PATCH] drivers/infiniband: fix comparsion between unsigned and negative In-Reply-To: <20070530080518.GA29195@nostromo.devel.redhat.com> (Bill Nottingham's message of "Wed, 30 May 2007 04:05:18 -0400") References: <20070530080518.GA29195@nostromo.devel.redhat.com> Message-ID: I just went through this patch, and all the changes are of the form of removing the < 0 test from code like if (x < 0 || x > MAX) return -ERROR; which Linus said we don't change in the email that Satyam just pointed out. So I'll drop this patch. - R. From rdreier at cisco.com Wed May 30 13:56:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 30 May 2007 13:56:03 -0700 Subject: [ofa-general] Re: wmb missing in libmthca? In-Reply-To: <20070524114711.GB4585@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 24 May 2007 14:47:11 +0300") References: <20070524114711.GB4585@mellanox.co.il> Message-ID: > Roland, I see this in kernel: > > ((struct mthca_next_seg *) prev_wqe)->nda_op = > cpu_to_be32((ind << qp->rq.wqe_shift) | 1); > wmb(); > ((struct mthca_next_seg *) prev_wqe)->ee_nds = > cpu_to_be32(MTHCA_NEXT_DBD | size); > > but userspace does not have wmb here. > Is it needed? It does seem that way -- otherwise the hardware might read prev_wqe and see the ee_nds field as set before the nda_op field has the right variable. Does this look right to you as a libmthca fix? diff --git a/src/qp.c b/src/qp.c index 2d03d49..2ea9dc0 100644 --- a/src/qp.c +++ b/src/qp.c @@ -292,7 +292,10 @@ int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, htonl(((ind << qp->sq.wqe_shift) + qp->send_wqe_offset) | mthca_opcode[wr->opcode]); - + /* + * Make sure that nda_op is written before setting ee_nds. + */ + wmb(); ((struct mthca_next_seg *) prev_wqe)->ee_nds = htonl((size0 ? 0 : MTHCA_NEXT_DBD) | size | ((wr->send_flags & IBV_SEND_FENCE) ? From sashak at voltaire.com Wed May 30 15:01:27 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 31 May 2007 01:01:27 +0300 Subject: [ofa-general] [PATCH] opensm: osm_node_get_physp_ptr() usage fixes Message-ID: <20070530220127.GP13193@sashak.voltaire.com> Function osm_node_get_physp_ptr() cannot return NULL, but can return pointer to non-initialized object. This patch fixes cases where resulted pointer was not verified properly. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_link_mgr.c | 2 +- opensm/opensm/osm_mcast_mgr.c | 1 - opensm/opensm/osm_node.c | 28 ++++++--------------- opensm/opensm/osm_node_info_rcv.c | 41 +++++++++++-------------------- opensm/opensm/osm_pkey_rcv.c | 2 - opensm/opensm/osm_port.c | 5 +-- opensm/opensm/osm_port_info_rcv.c | 7 ++--- opensm/opensm/osm_qos.c | 2 +- opensm/opensm/osm_sa_link_record.c | 19 +++++++------- opensm/opensm/osm_sa_pkey_record.c | 5 +--- opensm/opensm/osm_sa_portinfo_record.c | 5 +--- opensm/opensm/osm_sa_slvl_record.c | 6 ---- opensm/opensm/osm_sa_vlarb_record.c | 5 +--- opensm/opensm/osm_state_mgr.c | 8 +---- opensm/opensm/osm_switch.c | 5 ++- opensm/opensm/osm_trap_rcv.c | 5 +++- opensm/opensm/osm_ucast_lash.c | 12 ++++---- 17 files changed, 58 insertions(+), 100 deletions(-) diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c index a38d179..73bebce 100644 --- a/opensm/opensm/osm_link_mgr.c +++ b/opensm/opensm/osm_link_mgr.c @@ -435,7 +435,7 @@ __osm_link_mgr_process_port( specified state. */ p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)i ); - if( p_physp && osm_physp_is_valid( p_physp ) ) + if( osm_physp_is_valid( p_physp ) ) { current_state = osm_physp_get_port_state( p_physp ); diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c index da787b4..2ecb34e 100644 --- a/opensm/opensm/osm_mcast_mgr.c +++ b/opensm/opensm/osm_mcast_mgr.c @@ -818,7 +818,6 @@ __osm_mcast_mgr_branch( CL_ASSERT( p_remote_node->sw ); p_physp = osm_node_get_physp_ptr( p_node, i ); - CL_ASSERT( p_physp ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); p_remote_physp = osm_physp_get_remote( p_physp ); diff --git a/opensm/opensm/osm_node.c b/opensm/opensm/osm_node.c index cd4ccfa..8d2c3f5 100644 --- a/opensm/opensm/osm_node.c +++ b/opensm/opensm/osm_node.c @@ -61,7 +61,6 @@ osm_node_init_physp( IN osm_node_t* const p_node, IN const osm_madw_t* const p_madw ) { - osm_physp_t *p_physp; ib_net64_t port_guid; ib_smp_t *p_smp; ib_node_info_t *p_ni; @@ -80,9 +79,8 @@ osm_node_init_physp( CL_ASSERT( port_num < p_node->physp_tbl_size ); - p_physp = osm_node_get_physp_ptr( p_node, port_num ); - - osm_physp_init( p_physp, port_guid, port_num, p_node, + osm_physp_init( &p_node->physp_table[port_num], + port_guid, port_num, p_node, osm_madw_get_bind_handle( p_madw ), p_smp->hop_count, p_smp->initial_path ); } @@ -133,7 +131,7 @@ osm_node_new( Get(NodeInfo). */ for( i = 0; i < p_node->physp_tbl_size; i++ ) - osm_physp_construct( osm_node_get_physp_ptr( p_node, i ) ); + osm_physp_construct( &p_node->physp_table[i] ); osm_node_init_physp( p_node, p_madw ); } @@ -147,18 +145,13 @@ static void osm_node_destroy( IN osm_node_t *p_node ) { - osm_physp_t *p_physp; uint16_t i; /* Cleanup all physports */ for( i = 0; i < p_node->physp_tbl_size; i++ ) - { - p_physp = osm_node_get_physp_ptr( p_node, i ); - if (p_physp) - osm_physp_destroy( p_physp ); - } + osm_physp_destroy( &p_node->physp_table[i] ); } /********************************************************************** @@ -189,8 +182,7 @@ osm_node_link( CL_ASSERT( remote_port_num < p_remote_node->physp_tbl_size ); p_physp = osm_node_get_physp_ptr( p_node, port_num ); - p_remote_physp = osm_node_get_physp_ptr( p_remote_node, - remote_port_num ); + p_remote_physp = osm_node_get_physp_ptr( p_remote_node, remote_port_num ); if (p_physp->p_remote_physp) p_physp->p_remote_physp->p_remote_physp = NULL; @@ -220,8 +212,7 @@ osm_node_unlink( { p_physp = osm_node_get_physp_ptr( p_node, port_num ); - p_remote_physp = osm_node_get_physp_ptr( p_remote_node, - remote_port_num ); + p_remote_physp = osm_node_get_physp_ptr( p_remote_node, remote_port_num ); osm_physp_unlink( p_physp, p_remote_physp ); } @@ -243,8 +234,7 @@ osm_node_link_exists( CL_ASSERT( remote_port_num < p_remote_node->physp_tbl_size ); p_physp = osm_node_get_physp_ptr( p_node, port_num ); - p_remote_physp = osm_node_get_physp_ptr( p_remote_node, - remote_port_num ); + p_remote_physp = osm_node_get_physp_ptr( p_remote_node, remote_port_num ); return( osm_physp_link_exists( p_physp, p_remote_physp ) ); } @@ -265,8 +255,7 @@ osm_node_link_has_valid_ports( CL_ASSERT( remote_port_num < p_remote_node->physp_tbl_size ); p_physp = osm_node_get_physp_ptr( p_node, port_num ); - p_remote_physp = osm_node_get_physp_ptr( p_remote_node, - remote_port_num ); + p_remote_physp = osm_node_get_physp_ptr( p_remote_node, remote_port_num ); return( osm_physp_is_valid( p_physp ) && osm_physp_is_valid( p_remote_physp ) ); @@ -329,4 +318,3 @@ osm_node_get_remote_base_lid( return( 0 ); } - diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index 2c79056..2486ffb 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -144,17 +144,14 @@ __osm_ni_rcv_set_links( p_physp = osm_node_get_physp_ptr( p_node, port_num ); sprintf( dr_new_path, "no_path_available" ); - if (p_physp) + p_path = osm_physp_get_dr_path_ptr( p_physp ); + if ( p_path ) { - p_path = osm_physp_get_dr_path_ptr( p_physp ); - if ( p_path ) + sprintf( dr_new_path, "new path:" ); + for (i = 0; i <= p_path->hop_count; i++ ) { - sprintf( dr_new_path, "new path:" ); - for (i = 0; i <= p_path->hop_count; i++ ) - { - sprintf( line, "[%X]", p_path->path[i] ); - strcat( dr_new_path, line ); - } + sprintf( line, "[%X]", p_path->path[i] ); + strcat( dr_new_path, line ); } } @@ -164,17 +161,14 @@ __osm_ni_rcv_set_links( p_old_neighbor_node, old_neighbor_port_num); sprintf( dr_old_path, "no_path_available" ); - if (p_old_physp) + p_old_path = osm_physp_get_dr_path_ptr( p_old_physp ); + if ( p_old_path ) { - p_old_path = osm_physp_get_dr_path_ptr( p_old_physp ); - if ( p_old_path ) + sprintf( dr_old_path, "old_path:" ); + for (i = 0; i <= p_old_path->hop_count; i++ ) { - sprintf( dr_old_path, "old_path:" ); - for (i = 0; i <= p_old_path->hop_count; i++ ) - { - sprintf( line, "[%X]", p_old_path->path[i] ); - strcat( dr_old_path, line ); - } + sprintf( line, "[%X]", p_old_path->path[i] ); + strcat( dr_old_path, line ); } } @@ -226,10 +220,9 @@ __osm_ni_rcv_set_links( cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); p_physp = osm_node_get_physp_ptr( p_node, port_num ); - if (p_physp) - osm_dump_dr_path(p_rcv->p_log, - osm_physp_get_dr_path_ptr(p_physp), - OSM_LOG_ERROR); + osm_dump_dr_path(p_rcv->p_log, + osm_physp_get_dr_path_ptr(p_physp), + OSM_LOG_ERROR); osm_log( p_rcv->p_log, OSM_LOG_SYS, "Errors on subnet. Duplicate GUID found " @@ -313,7 +306,6 @@ __osm_ni_rcv_process_new_node( */ p_physp = osm_node_get_physp_ptr( p_node, port_num ); - CL_ASSERT( p_physp ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); CL_ASSERT( osm_madw_get_bind_handle( p_madw ) == osm_dr_path_get_bind_handle( @@ -379,7 +371,6 @@ __osm_ni_rcv_get_node_desc( */ p_physp = osm_node_get_physp_ptr( p_node, port_num ); - CL_ASSERT( p_physp ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); CL_ASSERT( osm_madw_get_bind_handle( p_madw ) == osm_dr_path_get_bind_handle( @@ -539,8 +530,6 @@ __osm_ni_rcv_process_existing_ca_or_router( { p_physp = osm_node_get_physp_ptr( p_node, port_num ); - CL_ASSERT( p_physp ); - if ( !osm_physp_is_valid( p_physp ) ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/opensm/opensm/osm_pkey_rcv.c b/opensm/opensm/osm_pkey_rcv.c index 7c87d7e..67fe067 100644 --- a/opensm/opensm/osm_pkey_rcv.c +++ b/opensm/opensm/osm_pkey_rcv.c @@ -174,8 +174,6 @@ osm_pkey_rcv_process( port_num = p_physp->port_num; } - CL_ASSERT( p_physp ); - /* We do not mind if this is a result of a set or get - all we want is to update the subnet. diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index eab86e1..9e86ca5 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -589,7 +589,7 @@ __osm_physp_get_dr_physp_set( p_path->path[hop]); /* make sure we got a valid port and it has a remote port */ - if (!(p_physp && osm_physp_is_valid( p_physp ))) + if (!osm_physp_is_valid( p_physp )) { osm_log( p_log, OSM_LOG_ERROR, "__osm_physp_get_dr_nodes_set: ERR 4104: " @@ -770,8 +770,7 @@ osm_physp_replace_dr_path_with_alternate_dr_path( 4. The port is not in the physp_map 5. This port haven't been visited before */ - if ( p_remote_physp && - osm_physp_is_valid ( p_remote_physp ) && + if ( osm_physp_is_valid ( p_remote_physp ) && p_remote_physp != p_physp && cl_map_get( &physp_map, __osm_ptr_to_key(p_remote_physp)) == NULL && cl_map_get( &visited_map, __osm_ptr_to_key(p_remote_physp)) == NULL ) diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 0076b00..a53044f 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -559,7 +559,7 @@ osm_pi_rcv_process_set( CL_ASSERT( p_node ); p_physp = osm_node_get_physp_ptr( p_node, port_num ); - CL_ASSERT( p_physp && osm_physp_is_valid( p_physp ) ); + CL_ASSERT( osm_physp_is_valid( p_physp ) ); port_guid = osm_physp_get_port_guid( p_physp ); @@ -744,10 +744,9 @@ osm_pi_rcv_process( } p_node = p_port->p_node; - p_physp = osm_node_get_physp_ptr( p_node, port_num ); - CL_ASSERT( p_node ); - CL_ASSERT( p_physp ); + + p_physp = osm_node_get_physp_ptr( p_node, port_num ); /* Determine if we encountered a new Physical Port. diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index f426241..bbb1608 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -337,7 +337,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) num_physp = osm_node_get_num_physp(p_node); for (i = 1; i < num_physp; i++) { p_physp = osm_node_get_physp_ptr(p_node, i); - if (!p_physp || !osm_physp_is_valid(p_physp)) + if (!osm_physp_is_valid(p_physp)) continue; status = qos_physp_setup(&p_osm->log, &p_osm->sm.req, diff --git a/opensm/opensm/osm_sa_link_record.c b/opensm/opensm/osm_sa_link_record.c index 5e4e35e..81d3877 100644 --- a/opensm/opensm/osm_sa_link_record.c +++ b/opensm/opensm/osm_sa_link_record.c @@ -357,7 +357,8 @@ __osm_lr_rcv_get_port_links( p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node, dest_port_num ); /* both physical ports should be with data */ - if (p_src_physp && p_dest_physp) + if (osm_physp_is_valid(p_src_physp) && + osm_physp_is_valid(p_dest_physp)) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, p_dest_physp, comp_mask, p_list, p_req_physp ); @@ -377,7 +378,7 @@ __osm_lr_rcv_get_port_links( if (port_num < p_src_port->p_node->physp_tbl_size) { p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num ); - if (p_src_physp) + if (osm_physp_is_valid(p_src_physp)) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, NULL, comp_mask, p_list, p_req_physp ); @@ -389,7 +390,7 @@ __osm_lr_rcv_get_port_links( for( port_num = 1; port_num < num_ports; port_num++ ) { p_src_physp = osm_node_get_physp_ptr( p_src_port->p_node, port_num ); - if (p_src_physp) + if (osm_physp_is_valid(p_src_physp)) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, NULL, comp_mask, p_list, p_req_physp ); @@ -411,9 +412,9 @@ __osm_lr_rcv_get_port_links( this couldn't be a relevant record. */ if (port_num < p_dest_port->p_node->physp_tbl_size ) { - p_dest_physp = osm_node_get_physp_ptr( - p_dest_port->p_node, port_num ); - if (p_dest_physp) + p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node, + port_num ); + if (osm_physp_is_valid(p_dest_physp)) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, p_dest_physp, comp_mask, p_list, p_req_physp ); @@ -424,9 +425,9 @@ __osm_lr_rcv_get_port_links( num_ports = osm_node_get_num_physp( p_dest_port->p_node ); for( port_num = 1; port_num < num_ports; port_num++ ) { - p_dest_physp = osm_node_get_physp_ptr( - p_dest_port->p_node, port_num ); - if (p_dest_physp) + p_dest_physp = osm_node_get_physp_ptr( p_dest_port->p_node, + port_num ); + if (osm_physp_is_valid(p_dest_physp)) __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, p_dest_physp, comp_mask, p_list, p_req_physp ); diff --git a/opensm/opensm/osm_sa_pkey_record.c b/opensm/opensm/osm_sa_pkey_record.c index 8a71314..49606bb 100644 --- a/opensm/opensm/osm_sa_pkey_record.c +++ b/opensm/opensm/osm_sa_pkey_record.c @@ -254,7 +254,7 @@ __osm_sa_pkey_by_comp_mask( p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); /* Check that the p_physp is valid, and that is shares a pkey with the p_req_physp. */ - if( p_physp && osm_physp_is_valid( p_physp ) && + if( osm_physp_is_valid( p_physp ) && (osm_physp_share_pkey(p_rcv->p_log, p_req_physp, p_physp)) ) __osm_sa_pkey_check_physp( p_rcv, p_physp, p_ctxt ); } @@ -273,9 +273,6 @@ __osm_sa_pkey_by_comp_mask( for( port_num = 0; port_num < num_ports; port_num++ ) { p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); - if( p_physp == NULL ) - continue; - if( !osm_physp_is_valid( p_physp ) ) continue; diff --git a/opensm/opensm/osm_sa_portinfo_record.c b/opensm/opensm/osm_sa_portinfo_record.c index 74f53d6..a1f3fcb 100644 --- a/opensm/opensm/osm_sa_portinfo_record.c +++ b/opensm/opensm/osm_sa_portinfo_record.c @@ -547,7 +547,7 @@ __osm_sa_pir_by_comp_mask( p_physp = osm_node_get_physp_ptr( p_port->p_node, p_rcvd_rec->port_num ); /* Check that the p_physp is valid, and that the p_physp and the p_req_physp share a pkey. */ - if( p_physp && osm_physp_is_valid( p_physp ) && + if( osm_physp_is_valid( p_physp ) && osm_physp_share_pkey(p_rcv->p_log, p_req_physp, p_physp)) __osm_sa_pir_check_physp( p_rcv, p_physp, p_ctxt ); } @@ -557,9 +557,6 @@ __osm_sa_pir_by_comp_mask( for( port_num = 0; port_num < num_ports; port_num++ ) { p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); - if( p_physp == NULL ) - continue; - if( !osm_physp_is_valid( p_physp ) ) continue; diff --git a/opensm/opensm/osm_sa_slvl_record.c b/opensm/opensm/osm_sa_slvl_record.c index e40ad61..010f23e 100644 --- a/opensm/opensm/osm_sa_slvl_record.c +++ b/opensm/opensm/osm_sa_slvl_record.c @@ -244,9 +244,6 @@ __osm_sa_slvl_by_comp_mask( for( out_port_num = out_port_start; out_port_num <= out_port_end; out_port_num++ ) { p_out_physp = osm_node_get_physp_ptr( p_port->p_node, out_port_num ); - if( p_out_physp == NULL ) - continue; - if( !osm_physp_is_valid( p_out_physp ) ) continue; @@ -257,9 +254,6 @@ __osm_sa_slvl_by_comp_mask( #endif p_in_physp = osm_node_get_physp_ptr( p_port->p_node, in_port_num ); - if( p_in_physp == NULL ) - continue; - if( !osm_physp_is_valid( p_in_physp ) ) continue; diff --git a/opensm/opensm/osm_sa_vlarb_record.c b/opensm/opensm/osm_sa_vlarb_record.c index a462ee9..8f60d8d 100644 --- a/opensm/opensm/osm_sa_vlarb_record.c +++ b/opensm/opensm/osm_sa_vlarb_record.c @@ -258,7 +258,7 @@ __osm_sa_vl_arb_by_comp_mask( p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); /* check that the p_physp is valid, and that the requester and the p_physp share a pkey. */ - if( p_physp && osm_physp_is_valid( p_physp ) && + if( osm_physp_is_valid( p_physp ) && osm_physp_share_pkey(p_rcv->p_log, p_req_physp, p_physp) ) __osm_sa_vl_arb_check_physp( p_rcv, p_physp, p_ctxt ); } @@ -277,9 +277,6 @@ __osm_sa_vl_arb_by_comp_mask( for( port_num = 0; port_num < num_ports; port_num++ ) { p_physp = osm_node_get_physp_ptr( p_port->p_node, port_num ); - if( p_physp == NULL ) - continue; - if( !osm_physp_is_valid( p_physp ) ) continue; diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 46c1cd0..73980b8 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -925,7 +925,6 @@ __osm_state_mgr_sweep_hop_1( p_physp = osm_node_get_physp_ptr( p_node, port_num ); - CL_ASSERT( p_physp ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); p_dr_path = osm_physp_get_dr_path_ptr( p_physp ); @@ -972,9 +971,6 @@ __osm_state_mgr_sweep_hop_1( { /* go through the port only if the port is not DOWN */ p_ext_physp = osm_node_get_physp_ptr( p_node, port_num ); - /* Make sure the physp object exists */ - if( !p_ext_physp ) - continue; if( ib_port_info_get_port_state( &( p_ext_physp->port_info ) ) > IB_LINK_DOWN ) { @@ -1119,7 +1115,7 @@ __osm_topology_file_create( p_physp = osm_node_get_physp_ptr( p_node, cPort ); - if( ( p_physp == NULL ) || ( !osm_physp_is_valid( p_physp ) ) ) + if( !osm_physp_is_valid( p_physp ) ) continue; p_rphysp = p_physp->p_remote_physp; @@ -1288,7 +1284,7 @@ __osm_state_mgr_report( for( port_num = start_port; port_num < num_ports; port_num++ ) { p_physp = osm_node_get_physp_ptr( p_node, port_num ); - if( ( p_physp == NULL ) || ( !osm_physp_is_valid( p_physp ) ) ) + if( !osm_physp_is_valid( p_physp ) ) continue; osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, "%s : %s : %02X :", diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index a79f5cd..2a8d1c2 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -321,14 +321,15 @@ osm_switch_recommend_path( if (port_num != OSM_NO_PATH) { + CL_ASSERT(port_num < num_ports); + p_physp = osm_node_get_physp_ptr(p_sw->p_node, port_num); /* Don't be too trusting of the current forwarding table! Verify that the port number is legal and that the LID is reachable through this port. */ - if( (port_num < num_ports ) && - osm_physp_is_valid(p_physp) && + if( osm_physp_is_valid(p_physp) && osm_physp_is_healthy(p_physp) && osm_physp_get_remote(p_physp) ) { diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index 309cdd5..c0cab76 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -100,6 +100,7 @@ __get_physp_by_lid_and_num( { cl_ptr_vector_t *p_vec = &(p_rcv->p_subn->port_lid_tbl); osm_port_t *p_port; + osm_physp_t *p_physp; if (lid > cl_ptr_vector_get_size(p_vec)) return NULL; @@ -111,7 +112,9 @@ __get_physp_by_lid_and_num( if (osm_node_get_num_physp(p_port->p_node) < num) return NULL; - return( osm_node_get_physp_ptr(p_port->p_node, num) ); + p_physp = osm_node_get_physp_ptr(p_port->p_node, num); + + return osm_physp_is_valid(p_physp) ? p_physp : NULL; } /********************************************************************** diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 5d32e89..04f32d5 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -162,7 +162,7 @@ static uint64_t osm_lash_get_switch_guid(IN const osm_switch_t *p_sw) uint64_t switch_guid = -1; osm_physp_t* p_physp = osm_node_get_physp_ptr(p_sw->p_node, 0); - if (p_physp && osm_physp_is_valid (p_physp)) + if (osm_physp_is_valid(p_physp)) switch_guid = osm_physp_get_port_guid(p_physp); return switch_guid; @@ -215,7 +215,7 @@ static uint8_t find_port_from_lid(IN const ib_net16_t lid_no, p_current_physp = osm_node_get_physp_ptr(p_sw->p_node, i); - if (p_current_physp && osm_physp_is_valid (p_current_physp)) { + if (osm_physp_is_valid(p_current_physp)) { p_remote_physp = p_current_physp->p_remote_physp; @@ -1251,10 +1251,10 @@ static void osm_lash_process_switch(lash_t *p_lash, osm_switch_t *p_sw) p_current_physp = osm_node_get_physp_ptr(p_sw->p_node, i); - if (osm_physp_is_valid (p_current_physp)) { + if (osm_physp_is_valid(p_current_physp)) { p_remote_physp = p_current_physp->p_remote_physp; - if (p_remote_physp && osm_physp_is_valid ( p_remote_physp ) && + if (p_remote_physp && osm_physp_is_valid(p_remote_physp) && p_remote_physp->p_node->sw) { int physical_port_a_num = osm_physp_get_port_num(p_current_physp); int physical_port_b_num = osm_physp_get_port_num(p_remote_physp); @@ -1342,8 +1342,8 @@ static int discover_network_properties(lash_t *p_lash) for (i=1; ip_node, i); - if (p_current_physp && osm_physp_is_valid (p_current_physp) && - p_current_physp->p_remote_physp) { + if (osm_physp_is_valid(p_current_physp) && + p_current_physp->p_remote_physp) { ib_port_info_t *p_port_info = &p_current_physp->port_info; uint8_t port_vl_min = ib_port_info_get_op_vls(p_port_info); -- 1.5.2.160.g10a94 From hasfarizan at mpklang.gov.my Wed May 30 15:42:39 2007 From: hasfarizan at mpklang.gov.my (hasfarizan at mpklang.gov.my) Date: Thu, 31 May 2007 06:42:39 +0800 Subject: [ofa-general] Your e-mail message was blocked Message-ID: MailMarshal (an automated content monitoring gateway) has stopped the following email for the following reason: It believes it may contain unacceptable language, or inappropriate material. Message: B00021bda7.00000000.mml From: general at mpklang.gov.my To: general at mpklang.gov.my Subject: RE: Top-Alert Please remove any inappropriate language and send it again. The blocked email will be automatically deleted after 5 days. MailMarshal Rule: Outbound : Block Unacceptable Language Script Offensive Language (Basic) Triggered in Body Expression: shitty Triggered 1 times weighting 5 Email security by MailMarshal from Marshal Software. From halr at voltaire.com Wed May 30 20:59:32 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 30 May 2007 23:59:32 -0400 Subject: [ofa-general] Re: [PATCH] opensm: osm_node_get_physp_ptr() usage fixes In-Reply-To: <20070530220127.GP13193@sashak.voltaire.com> References: <20070530220127.GP13193@sashak.voltaire.com> Message-ID: <1180583970.7116.120376.camel@hal.voltaire.com> On Wed, 2007-05-30 at 18:01, Sasha Khapyorsky wrote: > Function osm_node_get_physp_ptr() cannot return NULL, but can return > pointer to non-initialized object. This patch fixes cases where resulted > pointer was not verified properly. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From mst at dev.mellanox.co.il Wed May 30 21:34:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 May 2007 07:34:47 +0300 Subject: [ofa-general] Re: wmb missing in libmthca? In-Reply-To: References: <20070524114711.GB4585@mellanox.co.il> Message-ID: <20070531043447.GA11669@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: wmb missing in libmthca? > > > Roland, I see this in kernel: > > > > ((struct mthca_next_seg *) prev_wqe)->nda_op = > > cpu_to_be32((ind << qp->rq.wqe_shift) | 1); > > wmb(); > > ((struct mthca_next_seg *) prev_wqe)->ee_nds = > > cpu_to_be32(MTHCA_NEXT_DBD | size); > > > > but userspace does not have wmb here. > > Is it needed? > > It does seem that way -- otherwise the hardware might read prev_wqe > and see the ee_nds field as set before the nda_op field has the right > variable. Does this look right to you as a libmthca fix? Looks ok. -- MST From mst at dev.mellanox.co.il Wed May 30 21:54:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 May 2007 07:54:00 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path record caching In-Reply-To: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com> References: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com> Message-ID: <20070531045400.GB11669@mellanox.co.il> > Quoting Sean Hefty : > Subject: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path record caching > > I've updated the local SA patches based on previous feedback. > The most significant change is to integrate the local SA with > the ib_sa module. This allows all apps to make use of the local > SA without changes. > > The use of a device file was also replaced with simple module > parameters. > > I've also pushed these changes to: > > git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache > > I would like to close any open issues with this approach in time > to pull it into 2.6.23. This code seems to be significantly different from what OFED 1.2 has (and on the flip side, it does look better), but this means that it's all new code in a core component. Might it be prudent to have this sit in -mm for awhile? Another approach might be to have it disabled by default in 2.6.23 - that would make it a low-risk change, and give people time to experiment with it. -- MST From sean.hefty at intel.com Wed May 30 22:40:10 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 30 May 2007 22:40:10 -0700 Subject: [ofa-general] RE: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path recordcaching In-Reply-To: <20070531045400.GB11669@mellanox.co.il> Message-ID: <000001c7a346$28212ed0$cec9180a@amr.corp.intel.com> >This code seems to be significantly different from what >OFED 1.2 has (and on the flip side, it does look better), >but this means that it's all new code in a core component. >Might it be prudent to have this sit in -mm for awhile? >Another approach might be to have it disabled by default in 2.6.23 - >that would make it a low-risk change, and give people time to experiment >with it. I would rather disable it by default, since I'm not sure how many IB people run -mm. I go back and forth on whether the cache should be enabled by default and figured we could decide on the list. - Sean From mst at dev.mellanox.co.il Wed May 30 22:46:22 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 May 2007 08:46:22 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path recordcaching In-Reply-To: <000001c7a346$28212ed0$cec9180a@amr.corp.intel.com> References: <20070531045400.GB11669@mellanox.co.il> <000001c7a346$28212ed0$cec9180a@amr.corp.intel.com> Message-ID: <20070531054622.GC11669@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path recordcaching > > >This code seems to be significantly different from what > >OFED 1.2 has (and on the flip side, it does look better), > >but this means that it's all new code in a core component. > >Might it be prudent to have this sit in -mm for awhile? > >Another approach might be to have it disabled by default in 2.6.23 - > >that would make it a low-risk change, and give people time to experiment > >with it. > > I would rather disable it by default, since I'm not sure how many IB people run > -mm. Yes, seems to be true. > I go back and forth on whether the cache should be enabled by default and > figured we could decide on the list. -- MST From mst at dev.mellanox.co.il Wed May 30 22:52:51 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 May 2007 08:52:51 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 2/2] for 2.6.23: ib/sa - add local path record caching In-Reply-To: <000b01c7a2fb$6ebb05f0$3c98070a@amr.corp.intel.com> References: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com> <000b01c7a2fb$6ebb05f0$3c98070a@amr.corp.intel.com> Message-ID: <20070531055251.GD11669@mellanox.co.il> It seems that below you try to get 0x7F paths to each dest: +enum { + SA_DB_MAX_PATHS_PER_DEST = 0x7F, + SA_DB_MIN_RETRY_TIMER = 4000, /* 4 sec */ + SA_DB_MAX_RETRY_TIMER = 256000 /* 256 sec */ +}; + +static int set_paths_per_dest(const char *val, struct kernel_param *kp); +static unsigned long paths_per_dest = SA_DB_MAX_PATHS_PER_DEST; +module_param_call(paths_per_dest, set_paths_per_dest, param_get_ulong, + &paths_per_dest, 0644); +MODULE_PARM_DESC(paths_per_dest, "Maximum number of paths to retrieve " + "to each destination (DGID). Set to 0 " + "to disable cache."); But here you seem to bypass cache for multi-path queries: +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct sa_path_request *req; + struct ib_sa_attr_iter iter; + struct ib_sa_path_rec *path_rec; + int ret; + + if (!paths_per_dest) + goto query_sa; + + if (!(comp_mask & IB_SA_PATH_REC_DGID) || + !(comp_mask & IB_SA_PATH_REC_NUMB_PATH) || rec->numb_path != 1) + goto query_sa; how are multiple paths used? -- MST From monil at voltaire.com Thu May 31 00:35:19 2007 From: monil at voltaire.com (Moni Levy) Date: Thu, 31 May 2007 10:35:19 +0300 Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user In-Reply-To: References: <465AF3D3.10205@dev.mellanox.co.il> <20070529091246.GF8159@mellanox.co.il> Message-ID: <6a122cc00705310035o400580d9q890a5a21b40d2170@mail.gmail.com> Michael, can you please add this to OFED 1.2.RC4, Tziporet, please approve. Moni On 5/30/07, Roland Dreier wrote: > ok, I applied the patch with changes as discussed > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at dev.mellanox.co.il Thu May 31 00:39:45 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 May 2007 10:39:45 +0300 Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user In-Reply-To: <6a122cc00705310035o400580d9q890a5a21b40d2170@mail.gmail.com> References: <465AF3D3.10205@dev.mellanox.co.il> <20070529091246.GF8159@mellanox.co.il> <6a122cc00705310035o400580d9q890a5a21b40d2170@mail.gmail.com> Message-ID: <20070531073945.GB26309@mellanox.co.il> Vlad does this stuff. Quoting Moni Levy : Subject: Re: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user Michael, can you please add this to OFED 1.2.RC4, Tziporet, please approve. Moni On 5/30/07, Roland Dreier wrote: >ok, I applied the patch with changes as discussed >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general > -- MST From mst at dev.mellanox.co.il Thu May 31 00:46:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 May 2007 10:46:35 +0300 Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user In-Reply-To: References: <465AF3D3.10205@dev.mellanox.co.il> <20070529091246.GF8159@mellanox.co.il> Message-ID: <20070531074635.GC26309@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] suppress RLIMIT warning for root user > > ok, I applied the patch with changes as discussed Thanks. I saw you put this in master - can this go into stable branch? This warning is very annoying to people ... Alternatively, how about rolling libibverbs release so that OFED can use that? -- MST From tziporet at dev.mellanox.co.il Thu May 31 01:52:48 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 31 May 2007 11:52:48 +0300 Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user In-Reply-To: <6a122cc00705310035o400580d9q890a5a21b40d2170@mail.gmail.com> References: <465AF3D3.10205@dev.mellanox.co.il> <20070529091246.GF8159@mellanox.co.il> <6a122cc00705310035o400580d9q890a5a21b40d2170@mail.gmail.com> Message-ID: <465E8CE0.5010303@mellanox.co.il> Moni Levy wrote: > Michael, can you please add this to OFED 1.2.RC4, Tziporet, please > approve. > > Moni > Approved - Vlad please take this patch Thanks, Tziporet From vlad at lists.openfabrics.org Thu May 31 02:44:11 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 31 May 2007 02:44:11 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070531-0200 daily build status Message-ID: <20070531094412.25E54E6083A@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From erezz at voltaire.com Thu May 31 03:32:40 2007 From: erezz at voltaire.com (Erez Zilber) Date: Thu, 31 May 2007 13:32:40 +0300 Subject: [ofa-general] Re: [ewg] RE: [PATCH 2/2] IB/iser: add backport & kerneladdons foropen-iscsiover iSER support for RHAS4 up3 and up4 In-Reply-To: <39C75744D164D948A170E9792AF8E7CA1109F5@exil.voltaire.com> References: <20070521114410.GG20400@mellanox.co.il><46557BCB.7030102@voltaire.com><20070524115715.GC4585@mellanox.co.il><465C2D78.30100@voltaire.com><20070529141143.GD27671@mellanox.co.il><465D7046.3080109@voltaire.com> <20070530125456.GF9036@mellanox.co.il> <39C75744D164D948A170E9792AF8E7CA1109F5@exil.voltaire.com> Message-ID: <465EA448.6090907@voltaire.com> > I am doing that. However, attribute_container.c includes base.h which is in the kernel_addons dir. Since attribute_container.c is no longer there, I need to add the following line: > > -I$(PWD)/kernel_addons/backport/2.6.9_U3/include/src/ > > It is not very very ugly, so I think that we can do that. I will make the required fixes according to this approach. > Done. I removed kernel_addons_patches and I don't make any changes in the build scripts. Can you please review it and let me know if you have more comments? Thanks, Erez From devesh28 at gmail.com Thu May 31 03:36:58 2007 From: devesh28 at gmail.com (Devesh Sharma) Date: Thu, 31 May 2007 16:06:58 +0530 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <000801c7a2f8$55749000$3c98070a@amr.corp.intel.com> References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com> <000801c7a2f8$55749000$3c98070a@amr.corp.intel.com> Message-ID: <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com> On 5/31/07, Sean Hefty wrote: > >Ok, Soon I will post a patch related to this. > >How static PR file will be generated? Needs to be discussed. > > Please look at my latest changes to the local SA in when generating the patches. > > git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache > Do you have some pointer/doc related to the design of current SA_CACHE module....It will make things faster to understand........if not then I will require your support to understand the things, Though I have some top level view. Thanks > I'm not sure about the best way to communicate PRs to the cache. I haven't > given it more than about 2 minutes of thought, but as an idea, we could look at > trying to make use of the userspace MAD interface. For example, we could send > MADs to the local SA with the PRs to load. More details would obviously need to > be worked out, but this could provide an extensible solution. Ok you mean Its not required to create a separate device interface in cache module as such. I think this is a good idea......Just for confirmation...whether /dev/umad is active on each node other than SM node? > > - Sean > From halr at voltaire.com Thu May 31 03:42:12 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 06:42:12 -0400 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com> References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com> <000801c7a2f8$55749000$3c98070a@amr.corp.intel.com> <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com> Message-ID: <1180608131.7116.145947.camel@hal.voltaire.com> On Thu, 2007-05-31 at 06:36, Devesh Sharma wrote: > On 5/31/07, Sean Hefty wrote: > > >Ok, Soon I will post a patch related to this. > > >How static PR file will be generated? Needs to be discussed. > > > > Please look at my latest changes to the local SA in when generating the patches. > > > > git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache > > > Do you have some pointer/doc related to the design of current SA_CACHE > module....It will make things faster to understand........if not then > I will require your support to understand the things, Though I have > some top level view. > Thanks > > I'm not sure about the best way to communicate PRs to the cache. I haven't > > given it more than about 2 minutes of thought, but as an idea, we could look at > > trying to make use of the userspace MAD interface. For example, we could send > > MADs to the local SA with the PRs to load. More details would obviously need to > > be worked out, but this could provide an extensible solution. > Ok you mean Its not required to create a separate device interface in > cache module as such. I think this is a good idea......Just for > confirmation...whether /dev/umad is active on each node other than SM > node? It's not required to be but is for things that require userspace SA client access (like RDMA CM or local cache). -- Hal > > - Sean > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From monil at voltaire.com Thu May 31 04:24:55 2007 From: monil at voltaire.com (Moni Levy) Date: Thu, 31 May 2007 14:24:55 +0300 Subject: [ofa-general] RE: [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path recordcaching In-Reply-To: <000001c7a346$28212ed0$cec9180a@amr.corp.intel.com> References: <20070531045400.GB11669@mellanox.co.il> <000001c7a346$28212ed0$cec9180a@amr.corp.intel.com> Message-ID: <6a122cc00705310424j3594c944i74f8f454ba928c96@mail.gmail.com> On 5/31/07, Sean Hefty wrote: > > I go back and forth on whether the cache should be enabled by default and > figured we could decide on the list. I vote for disabling it by default in OFED also until we have an agreed solution. It's very easy to enable it after all. What do you think ? --Moni > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From gmbobo at iol.pt Thu May 31 04:32:44 2007 From: gmbobo at iol.pt (=?iso-8859-1?Q?Mr.=20Gabriel=20Mbobo?=) Date: Thu, 31 May 2007 12:32:44 +0100 Subject: [ofa-general] Compliments Message-ID: Good day, I represent a top mining company executive in South Africa. I have a very sensitive and private brief from this top executive to ask for your partnership to re-profile funds totally Forty Two Million United States Dollars. ( $42,000,000.00) I will give the details of how we intend to proceed,this is a legitimate transaction. You will be paid 15% for your "management fees", if I am able to reach terms with you. If you are interested, please write me back by email and provide me with your full names and telephone numbers and address and I will provide further details. Please keep this close to your chest as much as possible; we are still in acting service. I wait in anticipation of your fullest co-operation. I am available to entertain any questions concerning the clarity of this transaction. Regards, Mr. Gabriel Mbobo. _______________________________________________________________________________________ Quer 5.000 euros? So na Conta Viva da GE Money. Saiba mais em: http://www.iol.pt/correio/rodape.php?dst=0705281 From mst at dev.mellanox.co.il Thu May 31 05:12:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 May 2007 15:12:39 +0300 Subject: [ofa-general] Re: Re: [Query] ib add path record cache In-Reply-To: <1180608131.7116.145947.camel@hal.voltaire.com> References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com> <000801c7a2f8$55749000$3c98070a@amr.corp.intel.com> <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com> <1180608131.7116.145947.camel@hal.voltaire.com> Message-ID: <20070531121239.GG26309@mellanox.co.il> > > Ok you mean Its not required to create a separate device interface in > > cache module as such. I think this is a good idea......Just for > > confirmation...whether /dev/umad is active on each node other than SM > > node? > > It's not required to be but is for things that require userspace SA > client access (like RDMA CM or local cache). AFAIK /dev/umad is not required to use "rdma cm" - this module has its own device. -- MST From halr at voltaire.com Thu May 31 05:23:38 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 08:23:38 -0400 Subject: [ofa-general] Re: Re: [Query] ib add path record cache In-Reply-To: <20070531121239.GG26309@mellanox.co.il> References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com> <000801c7a2f8$55749000$3c98070a@amr.corp.intel.com> <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com> <1180608131.7116.145947.camel@hal.voltaire.com> <20070531121239.GG26309@mellanox.co.il> Message-ID: <1180614217.7116.152409.camel@hal.voltaire.com> On Thu, 2007-05-31 at 08:12, Michael S. Tsirkin wrote: > > > Ok you mean Its not required to create a separate device interface in > > > cache module as such. I think this is a good idea......Just for > > > confirmation...whether /dev/umad is active on each node other than SM > > > node? > > > > It's not required to be but is for things that require userspace SA > > client access (like RDMA CM or local cache). > > AFAIK /dev/umad is not required to use "rdma cm" - this module > has its own device. My mistake. It's also used by infiniband-diags and ibutils. -- Hal From eli at mellanox.co.il Thu May 31 05:29:54 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 31 May 2007 15:29:54 +0300 Subject: [ofa-general] [PATCH] libibverbs/examples: free invalid pointer Message-ID: <1180614624.7053.14.camel@mtls03> the dev_list pointer is allocated, incremented and then freed. Signed-off-by: Eli Cohen --- Index: libibverbs/examples/srq_pingpong.c =================================================================== --- libibverbs.orig/examples/srq_pingpong.c 2007-05-31 15:18:10.000000000 +0300 +++ libibverbs/examples/srq_pingpong.c 2007-05-31 15:21:47.000000000 +0300 @@ -549,7 +549,7 @@ static void usage(const char *argv0) int main(int argc, char *argv[]) { - struct ibv_device **dev_list; + struct ibv_device **dev_list, **__dev_list; struct ibv_device *ib_dev; struct ibv_wc *wc; struct pingpong_context *ctx; @@ -668,7 +668,7 @@ int main(int argc, char *argv[]) page_size = sysconf(_SC_PAGESIZE); - dev_list = ibv_get_device_list(NULL); + dev_list = __dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; @@ -863,7 +863,7 @@ int main(int argc, char *argv[]) if (pp_close_ctx(ctx, num_qp)) return 1; - ibv_free_device_list(dev_list); + ibv_free_device_list(__dev_list); free(rem_dest); return 0; From halr at voltaire.com Thu May 31 08:39:13 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 11:39:13 -0400 Subject: [ofa-general] [PATCH 1/2] management/*.spec.in: Change source Message-ID: <1180625949.7116.164563.camel@hal.voltaire.com> management/*.spec.in: Change source Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/infiniband-diags.spec.in b/infiniband-diags/infiniband-diags.spec.in index cc9de3b..0a9c7bc 100644 --- a/infiniband-diags/infiniband-diags.spec.in +++ b/infiniband-diags/infiniband-diags.spec.in @@ -9,7 +9,7 @@ Release: %rel%{?dist} License: GPL/BSD Group: System Environment/Libraries BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) -Source: git://git.openfabrics.org/~halr/management/@TARBALL@ +Source: git://git.openfabrics.org/~halr/management/infiniband-diags-git.tgz Url: http://openfabrics.org/ BuildRequires: libibmad-devel, opensm-devel, autoconf, automake Provides: perl(IBswcountlimits) diff --git a/libibcommon/libibcommon.spec.in b/libibcommon/libibcommon.spec.in index 73542fa..6ab806f 100644 --- a/libibcommon/libibcommon.spec.in +++ b/libibcommon/libibcommon.spec.in @@ -9,7 +9,7 @@ Release: %rel%{?dist} License: GPL/BSD Group: System Environment/Libraries BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) -Source: git://git.openfabrics.org/~halr/management/@TARBALL@ +Source: git://git.openfabrics.org/~halr/management/libibcommon-git.tgz Url: http://openfabrics.org/ Requires(post): /sbin/ldconfig Requires(postun): /sbin/ldconfig diff --git a/libibmad/libibmad.spec.in b/libibmad/libibmad.spec.in index 0ca9ac3..8d2b10a 100644 --- a/libibmad/libibmad.spec.in +++ b/libibmad/libibmad.spec.in @@ -9,7 +9,7 @@ Release: %rel%{?dist} License: GPL/BSD Group: System Environment/Libraries BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) -Source: git://git.openfabrics.org/~halr/management/@TARBALL@ +Source: git://git.openfabrics.org/~halr/management/libibmad-git.tgz Url: http://openfabrics.org/ BuildRequires: libibumad-devel, autoconf, libtool, automake Requires(post): /sbin/ldconfig diff --git a/libibumad/libibumad.spec.in b/libibumad/libibumad.spec.in index f2641e2..e5890d7 100644 --- a/libibumad/libibumad.spec.in +++ b/libibumad/libibumad.spec.in @@ -9,7 +9,7 @@ Release: %rel%{?dist} License: GPL/BSD Group: System Environment/Libraries BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) -Source: git://git.openfabrics.org/~halr/management/@TARBALL@ +Source: git://git.openfabrics.org/~halr/management/libibumad-git.tgz Url: http://openfabrics.org Requires(post): /sbin/ldconfig Requires(postun): /sbin/ldconfig diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in index ea86756..f72c5b9 100644 --- a/opensm/opensm.spec.in +++ b/opensm/opensm.spec.in @@ -21,7 +21,7 @@ Release: %rel%{?dist} License: GPL/BSD Group: System Environment/Daemons URL: http://openfabrics.org/ -Source: git://git.openfabrics.org/~halr/management/@TARBALL@ +Source: git://git.openfabrics.org/~halr/management/opensm-git.tgz BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) BuildRequires: libibumad-devel, autoconf, libtool, automake Requires: %{name}-libs = %{version}-%{release}, logrotate From halr at voltaire.com Thu May 31 08:41:38 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 11:41:38 -0400 Subject: [ofa-general] [PATCH 2/2] management/make.dist: Handle .spec files differently Message-ID: <1180625959.7116.164565.camel@hal.voltaire.com> management/make.dist: Handle .spec files differently No longer checkin .spec files and remove them after tarball is created. Signed-off-by: Hal Rosenstock diff --git a/make.dist b/make.dist index 99481de..f65c7e2 100755 --- a/make.dist +++ b/make.dist @@ -24,10 +24,10 @@ echo "code as released before even if yo echo "around." echo echo " As part of this process, the script will parse the .spec.in" -echo "file and output a .spec file and check that into the git repo" -echo "so it is included in the tag. Since this script isn't smart enough" -echo "to deal with other random changes that should have their own checkin," -echo "the script will refuse to run if the current repo state is not clean." +echo "file and output a .spec file. Since this script isn't smart" +echo "enough to deal with other random changes that should have their own" +echo "checkin the script will refuse to run if the current repo state is not" +echo "clean." echo echo " NOTE: the script has no clue if you are tagging on the right branch," echo "it will however show you the git branch output so you can confirm it" @@ -107,7 +107,7 @@ for target in $TARGETS; do else TARBALL=$target-$VERSION.tgz fi - sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $target/$target.spec.in > $target/$target.spec + sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/' < $target/$target.spec.in > $target/$target.spec cp -a $target $target-$VERSION echo "Creating $TMPDIR/$TARBALL" tar -czf $TMPDIR/$TARBALL --exclude=.git $target-$VERSION @@ -115,11 +115,10 @@ for target in $TARGETS; do done if [ $1 = release ]; then - echo "Checking in modified spec files and tagging release." + echo "Removing modified spec files and tagging release." for target in $TARGETS; do - git add $target/$target.spec + rm -rf $target/$target.spec done - git commit -m "Automatic check-in of target spec files after version processing" for target in $TARGETS; do VERSION=`grep "AC_INIT.*$target" $target/configure.in | cut -f 2 -d ',' | sed -e 's/ //g'` if [ ! -z "$2" ]; then @@ -134,4 +133,3 @@ if [ $1 = release ]; then done fi - From gmbobo at iol.pt Thu May 31 08:46:09 2007 From: gmbobo at iol.pt (=?iso-8859-1?Q?Mr.=20Gabriel=20Mbobo?=) Date: Thu, 31 May 2007 16:46:09 +0100 Subject: [ofa-general] Compliments Message-ID: Good day, I represent a top mining company executive in South Africa. I have a very sensitive and private brief from this top executive to ask for your partnership to re-profile funds totally Forty Two Million United States Dollars. ( $42,000,000.00) I will give the details of how we intend to proceed,this is a legitimate transaction. You will be paid 15% for your "management fees", if I am able to reach terms with you. If you are interested, please write me back by email and provide me with your full names and telephone numbers and address and I will provide further details. Please keep this close to your chest as much as possible; we are still in acting service. I wait in anticipation of your fullest co-operation. I am available to entertain any questions concerning the clarity of this transaction. Regards, Mr. Gabriel Mbobo. _______________________________________________________________________________________ Aqueca o seu Inverno com o credito pronto a usar! Saiba mais em http://www.iol.pt/correio/rodape.php?dst=0701181 From dledford at redhat.com Thu May 31 08:47:11 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 31 May 2007 11:47:11 -0400 Subject: [ofa-general] Re: [PATCH 1/2] management/*.spec.in: Change source In-Reply-To: <1180625949.7116.164563.camel@hal.voltaire.com> References: <1180625949.7116.164563.camel@hal.voltaire.com> Message-ID: <1180626431.4120.44.camel@firewall.xsintricity.com> On Thu, 2007-05-31 at 11:39 -0400, Hal Rosenstock wrote: > management/*.spec.in: Change source > > Signed-off-by: Hal Rosenstock Nak. If you check this in, then the automatic version update of the file for the different RPMs won't work. You need to leave the @TARBALL@ sed substitution in place in the spec.in file, and keep the sed substition in the make.dist script, otherwise your release tarballs will be named eg. opensm-3.3.0.tgz and in the spec file it will say opensm-git.tgz > diff --git a/infiniband-diags/infiniband-diags.spec.in b/infiniband-diags/infiniband-diags.spec.in > index cc9de3b..0a9c7bc 100644 > --- a/infiniband-diags/infiniband-diags.spec.in > +++ b/infiniband-diags/infiniband-diags.spec.in > @@ -9,7 +9,7 @@ Release: %rel%{?dist} > License: GPL/BSD > Group: System Environment/Libraries > BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) > -Source: git://git.openfabrics.org/~halr/management/@TARBALL@ > +Source: git://git.openfabrics.org/~halr/management/infiniband-diags-git.tgz > Url: http://openfabrics.org/ > BuildRequires: libibmad-devel, opensm-devel, autoconf, automake > Provides: perl(IBswcountlimits) > diff --git a/libibcommon/libibcommon.spec.in b/libibcommon/libibcommon.spec.in > index 73542fa..6ab806f 100644 > --- a/libibcommon/libibcommon.spec.in > +++ b/libibcommon/libibcommon.spec.in > @@ -9,7 +9,7 @@ Release: %rel%{?dist} > License: GPL/BSD > Group: System Environment/Libraries > BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) > -Source: git://git.openfabrics.org/~halr/management/@TARBALL@ > +Source: git://git.openfabrics.org/~halr/management/libibcommon-git.tgz > Url: http://openfabrics.org/ > Requires(post): /sbin/ldconfig > Requires(postun): /sbin/ldconfig > diff --git a/libibmad/libibmad.spec.in b/libibmad/libibmad.spec.in > index 0ca9ac3..8d2b10a 100644 > --- a/libibmad/libibmad.spec.in > +++ b/libibmad/libibmad.spec.in > @@ -9,7 +9,7 @@ Release: %rel%{?dist} > License: GPL/BSD > Group: System Environment/Libraries > BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) > -Source: git://git.openfabrics.org/~halr/management/@TARBALL@ > +Source: git://git.openfabrics.org/~halr/management/libibmad-git.tgz > Url: http://openfabrics.org/ > BuildRequires: libibumad-devel, autoconf, libtool, automake > Requires(post): /sbin/ldconfig > diff --git a/libibumad/libibumad.spec.in b/libibumad/libibumad.spec.in > index f2641e2..e5890d7 100644 > --- a/libibumad/libibumad.spec.in > +++ b/libibumad/libibumad.spec.in > @@ -9,7 +9,7 @@ Release: %rel%{?dist} > License: GPL/BSD > Group: System Environment/Libraries > BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) > -Source: git://git.openfabrics.org/~halr/management/@TARBALL@ > +Source: git://git.openfabrics.org/~halr/management/libibumad-git.tgz > Url: http://openfabrics.org > Requires(post): /sbin/ldconfig > Requires(postun): /sbin/ldconfig > diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in > index ea86756..f72c5b9 100644 > --- a/opensm/opensm.spec.in > +++ b/opensm/opensm.spec.in > @@ -21,7 +21,7 @@ Release: %rel%{?dist} > License: GPL/BSD > Group: System Environment/Daemons > URL: http://openfabrics.org/ > -Source: git://git.openfabrics.org/~halr/management/@TARBALL@ > +Source: git://git.openfabrics.org/~halr/management/opensm-git.tgz > BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) > BuildRequires: libibumad-devel, autoconf, libtool, automake > Requires: %{name}-libs = %{version}-%{release}, logrotate > > -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Thu May 31 09:05:30 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 12:05:30 -0400 Subject: [ofa-general] Re: [PATCH 1/2] management/*.spec.in: Change source In-Reply-To: <1180626431.4120.44.camel@firewall.xsintricity.com> References: <1180625949.7116.164563.camel@hal.voltaire.com> <1180626431.4120.44.camel@firewall.xsintricity.com> Message-ID: <1180627527.7116.166220.camel@hal.voltaire.com> On Thu, 2007-05-31 at 11:47, Doug Ledford wrote: > On Thu, 2007-05-31 at 11:39 -0400, Hal Rosenstock wrote: > > management/*.spec.in: Change source > > > > Signed-off-by: Hal Rosenstock > > Nak. If you check this in, then the automatic version update of the > file for the different RPMs won't work. You need to leave the @TARBALL@ > sed substitution in place in the spec.in file, and keep the sed > substition in the make.dist script, otherwise your release tarballs will > be named eg. opensm-3.3.0.tgz and in the spec file it will say > opensm-git.tgz OK; I'm dropping this part. I'll reissue the second part (make.dist) shortly. -- Hal From halr at voltaire.com Thu May 31 09:05:59 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 12:05:59 -0400 Subject: [ofa-general] [PATCH v2] management/make.dist: Handle .spec files differently Message-ID: <1180627536.7116.166222.camel@hal.voltaire.com> management/make.dist: Handle .spec files differently No longer commit .spec files and remove them after tarball is created. Signed-off-by: Hal Rosenstock diff --git a/make.dist b/make.dist index 99481de..20d9ca4 100755 --- a/make.dist +++ b/make.dist @@ -24,10 +24,10 @@ echo "code as released before even if yo echo "around." echo echo " As part of this process, the script will parse the .spec.in" -echo "file and output a .spec file and check that into the git repo" -echo "so it is included in the tag. Since this script isn't smart enough" -echo "to deal with other random changes that should have their own checkin," -echo "the script will refuse to run if the current repo state is not clean." +echo "file and output a .spec file. Since this script isn't smart" +echo "enough to deal with other random changes that should have their own" +echo "checkin the script will refuse to run if the current repo state is not" +echo "clean." echo echo " NOTE: the script has no clue if you are tagging on the right branch," echo "it will however show you the git branch output so you can confirm it" @@ -115,11 +115,10 @@ for target in $TARGETS; do done if [ $1 = release ]; then - echo "Checking in modified spec files and tagging release." + echo "Removing modified spec files and tagging release." for target in $TARGETS; do - git add $target/$target.spec + rm -rf $target/$target.spec done - git commit -m "Automatic check-in of target spec files after version processing" for target in $TARGETS; do VERSION=`grep "AC_INIT.*$target" $target/configure.in | cut -f 2 -d ',' | sed -e 's/ //g'` if [ ! -z "$2" ]; then From dledford at redhat.com Thu May 31 09:12:00 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 31 May 2007 12:12:00 -0400 Subject: [ofa-general] Re: [PATCH v2] management/make.dist: Handle .spec files differently In-Reply-To: <1180627536.7116.166222.camel@hal.voltaire.com> References: <1180627536.7116.166222.camel@hal.voltaire.com> Message-ID: <1180627920.4120.48.camel@firewall.xsintricity.com> On Thu, 2007-05-31 at 12:05 -0400, Hal Rosenstock wrote: > management/make.dist: Handle .spec files differently > > No longer commit .spec files and remove them after tarball is created. > > Signed-off-by: Hal Rosenstock Ack-by: Doug Ledford > diff --git a/make.dist b/make.dist > index 99481de..20d9ca4 100755 > --- a/make.dist > +++ b/make.dist > @@ -24,10 +24,10 @@ echo "code as released before even if yo > echo "around." > echo > echo " As part of this process, the script will parse the .spec.in" > -echo "file and output a .spec file and check that into the git repo" > -echo "so it is included in the tag. Since this script isn't smart enough" > -echo "to deal with other random changes that should have their own checkin," > -echo "the script will refuse to run if the current repo state is not clean." > +echo "file and output a .spec file. Since this script isn't smart" > +echo "enough to deal with other random changes that should have their own" > +echo "checkin the script will refuse to run if the current repo state is not" > +echo "clean." > echo > echo " NOTE: the script has no clue if you are tagging on the right branch," > echo "it will however show you the git branch output so you can confirm it" > @@ -115,11 +115,10 @@ for target in $TARGETS; do > done > > if [ $1 = release ]; then > - echo "Checking in modified spec files and tagging release." > + echo "Removing modified spec files and tagging release." > for target in $TARGETS; do > - git add $target/$target.spec > + rm -rf $target/$target.spec > done > - git commit -m "Automatic check-in of target spec files after version processing" > for target in $TARGETS; do > VERSION=`grep "AC_INIT.*$target" $target/configure.in | cut -f 2 -d ',' | sed -e 's/ //g'` > if [ ! -z "$2" ]; then > > -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Thu May 31 09:28:24 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 31 May 2007 09:28:24 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 2/2] for 2.6.23: ib/sa - add local path record caching In-Reply-To: <20070531055251.GD11669@mellanox.co.il> References: <000901c7a2f9$e2d0a730$3c98070a@amr.corp.intel.com> <000b01c7a2fb$6ebb05f0$3c98070a@amr.corp.intel.com> <20070531055251.GD11669@mellanox.co.il> Message-ID: <465EF7A8.2010909@ichips.intel.com> Michael S. Tsirkin wrote: > It seems that below you try to get 0x7F paths to each dest: This is the maximum number that a PR can request. Note that you only get that many if that many exist. I would expect most subnets to only have a couple of paths between each destination. > But here you seem to bypass cache for multi-path queries: > > +int ib_sa_path_rec_get(struct ib_sa_client *client, > + struct ib_device *device, u8 port_num, > + struct ib_sa_path_rec *rec, > + ib_sa_comp_mask comp_mask, > + int timeout_ms, gfp_t gfp_mask, > + void (*callback)(int status, > + struct ib_sa_path_rec *resp, > + void *context), > + void *context, > + struct ib_sa_query **sa_query) This is the existing ib_sa API, which only returns one path. You could change the behavior to return an array of paths, but I did not do that at this time. > +{ > + struct sa_path_request *req; > + struct ib_sa_attr_iter iter; > + struct ib_sa_path_rec *path_rec; > + int ret; > + > + if (!paths_per_dest) > + goto query_sa; > + > + if (!(comp_mask & IB_SA_PATH_REC_DGID) || > + !(comp_mask & IB_SA_PATH_REC_NUMB_PATH) || rec->numb_path != 1) > + goto query_sa; > > how are multiple paths used? The cache returns paths using one of two algorithms. Paths are either returned in a round robin fashion or randomly. See further down in this same function: + if (lookup_method == SA_DB_LOOKUP_RANDOM) + path_rec = get_random_path(&iter, rec, comp_mask); + else + path_rec = get_next_path(&iter, rec, comp_mask); The check for rec->numb_path != 1 should probably return a failure, since neither the API nor the underlying sa_query code supports it. - Sean From mshefty at ichips.intel.com Thu May 31 10:16:47 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 31 May 2007 10:16:47 -0700 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com> References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com> <000801c7a2f8$55749000$3c98070a@amr.corp.intel.com> <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com> Message-ID: <465F02FF.60401@ichips.intel.com> > Do you have some pointer/doc related to the design of current SA_CACHE > module....It will make things faster to understand........if not then > I will require your support to understand the things, Though I have > some top level view. I don't have any design docs. But I will happily answer any questions. > Ok you mean Its not required to create a separate device interface in > cache module as such. I think this is a good idea......Just for > confirmation...whether /dev/umad is active on each node other than SM > node? After giving this more thought, I like this approach. If we use a vendor/application specific MAD class that used the SA class as a template, we can begin creating a distributed SA. I haven't worked through details, but as an example, to load a PR into the local SA, you could send it a 'Set' MAD with a PR in the data portion. To load multiple paths, we could add a 'SetTable' method. To remove a path, we would send a 'Delete' MAD. Whether or not the MADs are sent from the local node, or some other node wouldn't matter. We can use this mechanism to pre-load the cache or simply push updates to it. - Sean From halr at voltaire.com Thu May 31 10:22:15 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 13:22:15 -0400 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <465F02FF.60401@ichips.intel.com> References: <309a667c0705292144s65f2e15q5074c7e318d56cf2@mail.gmail.com> <000801c7a2f8$55749000$3c98070a@amr.corp.intel.com> <309a667c0705310336x241eadc1xd4694e344df9aa1d@mail.gmail.com> <465F02FF.60401@ichips.intel.com> Message-ID: <1180632135.7116.170924.camel@hal.voltaire.com> On Thu, 2007-05-31 at 13:16, Sean Hefty wrote: > > Do you have some pointer/doc related to the design of current SA_CACHE > > module....It will make things faster to understand........if not then > > I will require your support to understand the things, Though I have > > some top level view. > > I don't have any design docs. But I will happily answer any questions. > > > Ok you mean Its not required to create a separate device interface in > > cache module as such. I think this is a good idea......Just for > > confirmation...whether /dev/umad is active on each node other than SM > > node? > > After giving this more thought, I like this approach. If we use a > vendor/application specific MAD class that used the SA class as a > template, we can begin creating a distributed SA. > > I haven't worked through details, but as an example, to load a PR into > the local SA, you could send it a 'Set' MAD with a PR in the data > portion. To load multiple paths, we could add a 'SetTable' method. To > remove a path, we would send a 'Delete' MAD. > > Whether or not the MADs are sent from the local node, or some other node > wouldn't matter. Would there be some sort of weak authorization needed to do this (like some key of some sort) ? -- Hal > We can use this mechanism to pre-load the cache or > simply push updates to it. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Thu May 31 10:31:31 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 31 May 2007 10:31:31 -0700 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <1180632135.7116.170924.camel@hal.voltaire.com> Message-ID: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com> >Would there be some sort of weak authorization needed to do this (like >some key of some sort) ? I was thinking of matching the SA class MAD format, which includes the SM_Key field. I wouldn't use the SA class, since we'd could be defining a new method, and a different attribute / method map than what's in the spec. But I would re-use as much of the SA class design as possible, to avoid re-inventing things. - Sean From rdreier at cisco.com Thu May 31 10:35:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 31 May 2007 10:35:08 -0700 Subject: [ofa-general] Re: [PATCH] libibverbs/examples: free invalid pointer In-Reply-To: <1180614624.7053.14.camel@mtls03> (Eli Cohen's message of "Thu, 31 May 2007 15:29:54 +0300") References: <1180614624.7053.14.camel@mtls03> Message-ID: Thanks, but I think I fixed this bug in all the pingpong examples (not just srq_pingpong) at the beginning of May. - R. From halr at voltaire.com Thu May 31 10:42:15 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 13:42:15 -0400 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com> References: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com> Message-ID: <1180633333.7116.172147.camel@hal.voltaire.com> On Thu, 2007-05-31 at 13:31, Sean Hefty wrote: > >Would there be some sort of weak authorization needed to do this (like > >some key of some sort) ? > > I was thinking of matching the SA class MAD format, which includes the SM_Key > field. I wouldn't use the SA class, since we'd could be defining a new method, > and a different attribute / method map than what's in the spec. But I would > re-use as much of the SA class design as possible, to avoid re-inventing things. You'd need to use a vendor class 2 if you wanted to use RMPP as the SA does. However, there is some rearranging you would need to do if you compare the relevant MAD formats. -- Hal > - Sean From mshefty at ichips.intel.com Thu May 31 10:54:56 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 31 May 2007 10:54:56 -0700 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <1180633333.7116.172147.camel@hal.voltaire.com> References: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com> <1180633333.7116.172147.camel@hal.voltaire.com> Message-ID: <465F0BF0.3040002@ichips.intel.com> > You'd need to use a vendor class 2 if you wanted to use RMPP as the SA > does. However, there is some rearranging you would need to do if you > compare the relevant MAD formats. Vendor class 2 just adds the OUI, correct? I guess you could either move the SA specific header by 4-bytes, or use a 32-bit key. It's not clear to me when to use vendor-specific versus application specific, but I think that we could use application specific as well. The mad layer just wouldn't know to enable RMPP as easily as it does with vendor class 2. - Sean From rdreier at cisco.com Thu May 31 11:00:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 31 May 2007 11:00:09 -0700 Subject: [ofa-general] Re: [PATCH] suppress RLIMIT warning for root user In-Reply-To: <20070531074635.GC26309@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 31 May 2007 10:46:35 +0300") References: <465AF3D3.10205@dev.mellanox.co.il> <20070529091246.GF8159@mellanox.co.il> <20070531074635.GC26309@mellanox.co.il> Message-ID: > Thanks. I saw you put this in master - can this go into stable branch? > This warning is very annoying to people ... Good point. I haven't been very good about getting fixes onto the stable branch. I just pulled everything from libibverbs master branch onto the stable branch, since all the fixes looked appropriate. > Alternatively, how about rolling libibverbs release so that OFED can use > that? It's a good idea... I'll put out libibverbs 1.1.1 soon. - R. From rdreier at cisco.com Thu May 31 11:01:56 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 31 May 2007 11:01:56 -0700 Subject: [ofa-general] Re: wmb missing in libmthca? In-Reply-To: <20070531043447.GA11669@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 31 May 2007 07:34:47 +0300") References: <20070524114711.GB4585@mellanox.co.il> <20070531043447.GA11669@mellanox.co.il> Message-ID: OK, I committed my patch. From halr at voltaire.com Thu May 31 11:04:22 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 14:04:22 -0400 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <465F0BF0.3040002@ichips.intel.com> References: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com> <1180633333.7116.172147.camel@hal.voltaire.com> <465F0BF0.3040002@ichips.intel.com> Message-ID: <1180634660.7116.173529.camel@hal.voltaire.com> On Thu, 2007-05-31 at 13:54, Sean Hefty wrote: > > You'd need to use a vendor class 2 if you wanted to use RMPP as the SA > > does. However, there is some rearranging you would need to do if you > > compare the relevant MAD formats. > > Vendor class 2 just adds the OUI, correct? Yes, so there are 4 less bytes available. > I guess you could either move the SA specific header by 4-bytes, Yes, everything starting with SM_Key. > or use a 32-bit key. I think that was deemed too weak and why 64 bits was chosen to begin with. > It's not clear to me when to use vendor-specific versus application > specific, but I think that we could use application specific as well. Yes, but there would be more work involved for RMPP. There is no way to know if such a class uses RMPP. > The mad layer just wouldn't know to enable RMPP as easily as it does > with vendor class 2. Ugh... That's an understatement. -- Hal > - Sean From venkatesh.babu at 3leafnetworks.com Thu May 31 11:35:06 2007 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 31 May 2007 11:35:06 -0700 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <1179878469.16831.42580.camel@hal.voltaire.com> References: <4652167F.9040709@3leafnetworks.com> <1179785796.15940.27092.camel@hal.voltaire.com> <4652542C.3010400@3leafnetworks.com> <1179805556.15940.47640.camel@hal.voltaire.com> <46528E3C.8090305@3leafnetworks.com> <1179831181.15940.74121.camel@hal.voltaire.com> <4653845C.1090507@3leafnetworks.com> <1179878469.16831.42580.camel@hal.voltaire.com> Message-ID: <465F155A.5030508@3leafnetworks.com> Hal Rosenstock wrote: >>This output was captured on node vortex3l-83, the one who runs opensm. >>Do you want the perfquery output before and after some time interval ? >> >> > >I'm interested in VL15 drops to make sure that is not going on. > > I am seeing non zero (0 - 10) VL15 drops counter. What is the significance and cause of these errors ? How can I get rid or correct them ? VBabu From halr at voltaire.com Thu May 31 11:21:40 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 14:21:40 -0400 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <465F155A.5030508@3leafnetworks.com> References: <4652167F.9040709@3leafnetworks.com> <1179785796.15940.27092.camel@hal.voltaire.com> <4652542C.3010400@3leafnetworks.com> <1179805556.15940.47640.camel@hal.voltaire.com> <46528E3C.8090305@3leafnetworks.com> <1179831181.15940.74121.camel@hal.voltaire.com> <4653845C.1090507@3leafnetworks.com> <1179878469.16831.42580.camel@hal.voltaire.com> <465F155A.5030508@3leafnetworks.com> Message-ID: <1180635698.7116.174586.camel@hal.voltaire.com> On Thu, 2007-05-31 at 14:35, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > >>This output was captured on node vortex3l-83, the one who runs opensm. > >>Do you want the perfquery output before and after some time interval ? > >> > >> > > > >I'm interested in VL15 drops to make sure that is not going on. > > > > > > I am seeing non zero (0 - 10) VL15 drops counter. What is the > significance and cause of these errors ? This means that some VL15 packets arrive at the switch with no available VL15 buffers so they are dropped. These could be any SM packets (SMInfo is just one possibility). > How can I get rid or correct them ? You would need to contact your switch vendor to see if the VL15 buffering can be reconfigured. I'm not sure whether or not this is related to your standby issue or not. Are you seeing any other errors on any of the ports ? -- Hal > VBabu From halr at voltaire.com Thu May 31 11:31:59 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 14:31:59 -0400 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master Message-ID: <1180636318.7116.175237.camel@hal.voltaire.com> On Thu, 2007-05-31 at 14:35, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > >>This output was captured on node vortex3l-83, the one who runs opensm. > >>Do you want the perfquery output before and after some time interval ? > >> > >> > > > >I'm interested in VL15 drops to make sure that is not going on. > > > > > > I am seeing non zero (0 - 10) VL15 drops counter. What is the > significance and cause of these errors ? This means that some VL15 packets arrive at the switch with no available VL15 buffers so they are dropped. These could be any SM packets (SMInfo is just one possibility). > How can I get rid or correct them ? You would need to contact your switch vendor to see if the VL15 buffering can be reconfigured. I'm not sure whether or not this is related to your standby issue or not. Are you seeing any other errors on any of the ports ? -- Hal > VBabu From mshefty at ichips.intel.com Thu May 31 11:41:51 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 31 May 2007 11:41:51 -0700 Subject: [ofa-general] Re: [Query] ib add path record cache In-Reply-To: <1180634660.7116.173529.camel@hal.voltaire.com> References: <000101c7a3a9$876d8290$ff0da8c0@amr.corp.intel.com> <1180633333.7116.172147.camel@hal.voltaire.com> <465F0BF0.3040002@ichips.intel.com> <1180634660.7116.173529.camel@hal.voltaire.com> Message-ID: <465F16EF.2060605@ichips.intel.com> > Yes, so there are 4 less bytes available. ..and we may want to shift everything down another 4 bytes for alignment, if that's needed. It could be very convenient to have the exact same layout as the SA MADs. This would let an app issue a normal SA GetTable query, store away the response, then later forward the response to the local SA only changing a couple of fields without having to re-pack everything. >> The mad layer just wouldn't know to enable RMPP as easily as it does >> with vendor class 2. > > Ugh... That's an understatement. We could add a check for the local SA class, but I'm hoping there's a better way. What would be nice are vendor/application specific extensions to the existing classes... or even a standardized framework for supporting a distributed SA. (Neither of these seem likely anytime soon though.) - Sean From venkatesh.babu at 3leafnetworks.com Thu May 31 12:12:22 2007 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 31 May 2007 12:12:22 -0700 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <1180636318.7116.175237.camel@hal.voltaire.com> References: <1180636318.7116.175237.camel@hal.voltaire.com> Message-ID: <465F1E16.2080309@3leafnetworks.com> Hal Rosenstock wrote: >> I am seeing non zero (0 - 10) VL15 drops counter. What is the >>significance and cause of these errors ? >> >> > >This means that some VL15 packets arrive at the switch with no available >VL15 buffers so they are dropped. These could be any SM packets (SMInfo >is just one possibility). > > > >>How can I get rid or correct them ? >> >> > >You would need to contact your switch vendor to see if the VL15 >buffering can be reconfigured. > >I'm not sure whether or not this is related to your standby issue or >not. > > At least opensm is not working correctly. Eventhough ibv_devinfo shows it as master and it is not responding to the broadcast join operations or it doesn't assign LIDs to other nodes. >Are you seeing any other errors on any of the ports ? > > I do see non zero port_xmit_discards error counters on some ports. Are these errors could be because of the bad cables or ports ? VBabu >-- Hal > > From tziporet at mellanox.co.il Thu May 31 11:54:34 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 31 May 2007 21:54:34 +0300 Subject: [ofa-general] OFED 1.2 rc4 release In-Reply-To: <43AA3CB3C1BF5A499F5AAD31CA5023AC06624A26@mtlexch01.mtl.com> References: <43AA3CB3C1BF5A499F5AAD31CA5023AC06624A26@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C9015634B7@mtlexch01.mtl.com> Hi, OFED 1.2-RC3 is available on http://www.openfabrics.org/builds/ofed-1.2/ File: OFED-1.2-rc4.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ Next RC or official release will be decided in coming Monday coordination meeting Tziporet & Vlad ======================================================================== ============ Release information: OS support: Novell: - SLES 9.0 SP3 - SLES10 (and SP1 RC2 partially tested) Redhat: - Redhat EL4 up3, up4 and up5 - Redhat EL5 kernel.org: - 2.6.20 - 2.6.19 Note: Fedora C6 and SuSE Pro 10 are not part of the official list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. Systems: * x86_64 * x86 * ia64 * ppc64 Main changes from OFED-1.1-rc3: 1. Fixed 23 bugs (see attachment for all bugs fixed) 2. Updated documents - all owners please review to make sure docs of your component is updated. Major limitations and known issues: ----------------------------------- 567 blocker rolandd at cisco.com RHEL5 ppc64 UD verbs failures 577 critical ishai at mellanox.co.il SRP multipath failover too slow (minutes, not seconds) 626 major monis at voltaire.com wrong network /broadcast address set by ib-bond script 629 major monis at voltaire.com ib-bonding: sometimes slow failover is noticed 541 major mst at mellanox.co.il slow failover with IPoIB CM bonding/ipoibtools HA 650 major pasha at mellanox.co.il error on install rc3 openmpi with pathscale compiler See bugzilla for all open issues. Tasks that should be completed: 1. Fix all blocker, critical and major bugs 2. Complete all documentation (release notes, README, etc.) -------------- next part -------------- A non-text attachment was scrubbed... Name: fixed_rc4_bugs.csv Type: application/octet-stream Size: 2999 bytes Desc: fixed_rc4_bugs.csv URL: From halr at voltaire.com Thu May 31 11:57:47 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 14:57:47 -0400 Subject: [ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master In-Reply-To: <465F1E16.2080309@3leafnetworks.com> References: <1180636318.7116.175237.camel@hal.voltaire.com> <465F1E16.2080309@3leafnetworks.com> Message-ID: <1180637866.7116.176862.camel@hal.voltaire.com> On Thu, 2007-05-31 at 15:12, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > >> I am seeing non zero (0 - 10) VL15 drops counter. What is the > >>significance and cause of these errors ? > >> > >> > > > >This means that some VL15 packets arrive at the switch with no available > >VL15 buffers so they are dropped. These could be any SM packets (SMInfo > >is just one possibility). > > > > > > > >>How can I get rid or correct them ? > >> > >> > > > >You would need to contact your switch vendor to see if the VL15 > >buffering can be reconfigured. > > > >I'm not sure whether or not this is related to your standby issue or > >not. > > > > > At least opensm is not working correctly. Eventhough ibv_devinfo shows > it as master and it is not responding to the broadcast join operations > or it doesn't assign LIDs to other nodes. ibv_devinfo only indicates the SMLID of the last master which claimed this node. So if there is no real current master... In this state, there is no master so no SA queries will be responded to. Only an SM which was master would respond. So if some local node thinks the SM is foo, and foo's SM is not in master, it will nott respond. This may be an OpenSM issue or might be some lower level issue which OpenSM is not handling well. I'm not sure which as I cannot recreate this and am not sure what is going on in your environment. > >Are you seeing any other errors on any of the ports ? > > > > > I do see non zero port_xmit_discards error counters on some ports. > > Are these errors could be because of the bad cables or ports ? I would try swapping in known good cables and see what happens. -- Hal > VBabu > > >-- Hal > > > > From sashak at voltaire.com Thu May 31 13:45:24 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 31 May 2007 23:45:24 +0300 Subject: [ofa-general] [PATCH] opensm: sminfo self query check Message-ID: <20070531204524.GX13193@sashak.voltaire.com> OpenSM can query itself for SMInfo because it is just legal, or occasionally due to port moving during subnet discovery process. Don't create remote SM entry in this case in order to prevent deadlocks. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_sminfo_rcv.c | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_sminfo_rcv.c b/opensm/opensm/osm_sminfo_rcv.c index 776c70b..99a716e 100644 --- a/opensm/opensm/osm_sminfo_rcv.c +++ b/opensm/opensm/osm_sminfo_rcv.c @@ -632,6 +632,15 @@ __osm_sminfo_rcv_process_get_response( goto Exit; } + if( port_guid == p_rcv->p_subn->sm_port_guid ) + { + osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, + "__osm_sminfo_rcv_process_get_response: " + "Self query response received - SM port 0x%016" PRIx64 "\n", + cl_ntoh64( port_guid ) ); + goto Exit; + } + p_sm = (osm_remote_sm_t*)cl_qmap_get( p_sm_tbl, port_guid ); if( p_sm == (osm_remote_sm_t*)cl_qmap_end( p_sm_tbl ) ) { -- 1.5.2.171.gf509 From sashak at voltaire.com Thu May 31 13:47:05 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 31 May 2007 23:47:05 +0300 Subject: [ofa-general] [PATCH] opensm: cleanup discovery count functions Message-ID: <20070531204705.GY13193@sashak.voltaire.com> This removes discovery count functions for osm_port_t and osm_node_t and makes discovery_count handling similar to osm_switch_t. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_node.h | 89 ------------------------------------ opensm/include/opensm/osm_port.h | 90 ------------------------------------- opensm/opensm/osm_drop_mgr.c | 10 ++-- opensm/opensm/osm_node_info_rcv.c | 8 ++-- opensm/opensm/osm_port_info_rcv.c | 2 +- opensm/opensm/osm_state_mgr.c | 4 +- 6 files changed, 12 insertions(+), 191 deletions(-) diff --git a/opensm/include/opensm/osm_node.h b/opensm/include/opensm/osm_node.h index a841de7..b2d03a2 100644 --- a/opensm/include/opensm/osm_node.h +++ b/opensm/include/opensm/osm_node.h @@ -591,95 +591,6 @@ osm_node_init_physp( * Node object, Physical Port object. *********/ -/****f* OpenSM: Node/osm_node_discovery_count_get -* NAME -* osm_node_discovery_count_get -* -* DESCRIPTION -* Returns a pointer to the physical port object at the -* specified local port number. -* -* SYNOPSIS -*/ -static inline uint32_t -osm_node_discovery_count_get( - IN const osm_node_t* const p_node ) -{ - return( p_node->discovery_count ); -} -/* -* PARAMETERS -* p_node -* [in] Pointer to an osm_node_t object. -* -* RETURN VALUES -* Returns the discovery count for this node. -* -* NOTES -* -* SEE ALSO -* Node object -*********/ - -/****f* OpenSM: Node/osm_node_discovery_count_reset -* NAME -* osm_node_discovery_count_reset -* -* DESCRIPTION -* Resets the discovery count for this node to zero. -* This operation should be performed at the start of a sweep. -* -* SYNOPSIS -*/ -static inline void -osm_node_discovery_count_reset( - IN osm_node_t* const p_node ) -{ - p_node->discovery_count = 0; -} -/* -* PARAMETERS -* p_node -* [in] Pointer to an osm_node_t object. -* -* RETURN VALUES -* None. -* -* NOTES -* -* SEE ALSO -* Node object -*********/ - -/****f* OpenSM: Node/osm_node_discovery_count_inc -* NAME -* osm_node_discovery_count_inc -* -* DESCRIPTION -* Increments the discovery count for this node. -* -* SYNOPSIS -*/ -static inline void -osm_node_discovery_count_inc( - IN osm_node_t* const p_node ) -{ - p_node->discovery_count++; -} -/* -* PARAMETERS -* p_node -* [in] Pointer to an osm_node_t object. -* -* RETURN VALUES -* None. -* -* NOTES -* -* SEE ALSO -* Node object -*********/ - /****f* OpenSM: Node/osm_node_get_node_guid * NAME * osm_node_get_node_guid diff --git a/opensm/include/opensm/osm_port.h b/opensm/include/opensm/osm_port.h index df9065e..54ebcfc 100644 --- a/opensm/include/opensm/osm_port.h +++ b/opensm/include/opensm/osm_port.h @@ -1556,96 +1556,6 @@ osm_port_add_new_physp( * Port *********/ -/****f* OpenSM: Port/osm_port_discovery_count_reset -* NAME -* osm_port_discovery_count_reset -* -* DESCRIPTION -* Resets the discovery count for this Port to zero. -* This operation should be performed at the start of a sweep. -* -* SYNOPSIS -*/ -static inline void -osm_port_discovery_count_reset( - IN osm_port_t* const p_port ) -{ - p_port->discovery_count = 0; -} -/* -* PARAMETERS -* p_port -* [in] Pointer to an osm_port_t object. -* -* RETURN VALUES -* None. -* -* NOTES -* -* SEE ALSO -* Port object -*********/ - -/****f* OpenSM: Port/osm_port_discovery_count_get -* NAME -* osm_port_discovery_count_get -* -* DESCRIPTION -* Returns the number of times this port has been discovered -* since the last time the discovery count was reset. -* -* SYNOPSIS -*/ -static inline uint32_t -osm_port_discovery_count_get( - IN const osm_port_t* const p_port ) -{ - return( p_port->discovery_count ); -} -/* -* PARAMETERS -* p_port -* [in] Pointer to an osm_port_t object. -* -* RETURN VALUES -* Returns the number of times this port has been discovered -* since the last time the discovery count was reset. -* -* NOTES -* -* SEE ALSO -* Port object -*********/ - -/****f* OpenSM: Port/osm_port_discovery_count_inc -* NAME -* osm_port_discovery_count_inc -* -* DESCRIPTION -* Increments the discovery count for this Port. -* -* SYNOPSIS -*/ -static inline void -osm_port_discovery_count_inc( - IN osm_port_t* const p_port ) -{ - p_port->discovery_count++; -} -/* -* PARAMETERS -* p_port -* [in] Pointer to an osm_port_t object. -* -* RETURN VALUES -* None. -* -* NOTES -* -* SEE ALSO -* Port object -*********/ - /****f* OpenSM: Port/osm_port_add_mgrp * NAME * osm_port_add_mgrp diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c index 7689728..9d91b6b 100644 --- a/opensm/opensm/osm_drop_mgr.c +++ b/opensm/opensm/osm_drop_mgr.c @@ -161,7 +161,7 @@ drop_mgr_clean_physp( the remote port was recognized, and its state is ACTIVE. If this is just a "hiccup" - force a heavy sweep in the next sweep. We don't want to lose that part of the subnet. */ - if (osm_port_discovery_count_get( p_remote_port ) && + if (p_remote_port->discovery_count && osm_physp_get_port_state( p_remote_physp ) == IB_LINK_ACTIVE ) { osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, @@ -179,7 +179,7 @@ drop_mgr_clean_physp( discovery count of the remote port. */ if ( !p_remote_physp->p_node->sw ) { - osm_port_discovery_count_reset( p_remote_port ); + p_remote_port->discovery_count = 0; osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "drop_mgr_clean_physp: Resetting discovery count of node: " "0x%016" PRIx64 " port num:0x%X\n", @@ -534,7 +534,7 @@ __osm_drop_mgr_check_node( goto Exit; } - if ( osm_port_discovery_count_get( p_port ) == 0 ) + if ( p_port->discovery_count == 0 ) { osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, "__osm_drop_mgr_check_node: " @@ -601,7 +601,7 @@ osm_drop_mgr_process( If not, it is unreachable in the current subnet, and should therefore be removed from the subnet object. */ - if( osm_node_discovery_count_get( p_node ) == 0 ) + if( p_node->discovery_count == 0 ) __osm_drop_mgr_process_node( p_mgr, p_node ); } @@ -655,7 +655,7 @@ osm_drop_mgr_process( /* If the port is unreachable, remove it from the guid table. */ - if( osm_port_discovery_count_get( p_port ) == 0 ) + if( p_port->discovery_count == 0 ) __osm_drop_mgr_remove_port( p_mgr, p_port ); } diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index 2486ffb..1eca625 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -641,7 +641,7 @@ __osm_ni_rcv_process_existing_switch( if the SwitchInfo mad didn't reach the SM) then we want to retry to probe the switch. */ - if( osm_node_discovery_count_get( p_node ) == 1 ) + if( p_node->discovery_count == 1 ) __osm_ni_rcv_process_switch( p_rcv, p_node, p_madw ); else { @@ -862,7 +862,7 @@ __osm_ni_rcv_process_new( else __osm_ni_rcv_set_links( p_rcv, p_node, port_num, p_ni_context ); - osm_node_discovery_count_inc( p_node ); + p_node->discovery_count++; __osm_ni_rcv_get_node_desc( p_rcv, p_node, p_madw ); switch( p_ni->node_type ) @@ -916,14 +916,14 @@ __osm_ni_rcv_process_existing( ib_get_node_type_str(p_ni->node_type), cl_ntoh64( p_ni->node_guid ), cl_ntoh64( p_smp->trans_id ), - osm_node_discovery_count_get( p_node ) ); + p_node->discovery_count ); } /* If we haven't already encountered this existing node on this particular sweep, then process further. */ - osm_node_discovery_count_inc( p_node ); + p_node->discovery_count++; switch( p_ni->node_type ) { diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index a53044f..7b241d6 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -724,7 +724,7 @@ osm_pi_rcv_process( } else { - osm_port_discovery_count_inc( p_port ); + p_port->discovery_count++; /* This PortInfo arrived because we did a Get() method, diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 73980b8..a9aef36 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -524,7 +524,7 @@ __osm_state_mgr_reset_node_count( cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); } - osm_node_discovery_count_reset( p_node ); + p_node->discovery_count = 0; } /********************************************************************** @@ -545,7 +545,7 @@ __osm_state_mgr_reset_port_count( cl_ntoh64( osm_port_get_guid( p_port ) ) ); } - osm_port_discovery_count_reset( p_port ); + p_port->discovery_count = 0; } /********************************************************************** -- 1.5.2.171.gf509 From sashak at voltaire.com Thu May 31 14:39:34 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 Jun 2007 00:39:34 +0300 Subject: [ofa-general] [PATCH] opensm: less iterations in osm_link_mgr_process() loop Message-ID: <20070531213933.GZ13193@sashak.voltaire.com> Instead of looping over endports in order to get node's physical ports list (and to repeat scanning), just use nodes. IOW - replace __osm_link_mgr_process_port() by __osm_link_mgr_process_node(). Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_link_mgr.c | 34 +++++++++++++++++----------------- 1 files changed, 17 insertions(+), 17 deletions(-) diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c index 73bebce..640ed38 100644 --- a/opensm/opensm/osm_link_mgr.c +++ b/opensm/opensm/osm_link_mgr.c @@ -399,9 +399,9 @@ __osm_link_mgr_set_physp_pi( /********************************************************************** **********************************************************************/ static osm_signal_t -__osm_link_mgr_process_port( +__osm_link_mgr_process_node( IN osm_link_mgr_t* const p_mgr, - IN osm_port_t* const p_port, + IN osm_node_t* const p_node, IN const uint8_t link_state ) { uint32_t i; @@ -410,14 +410,14 @@ __osm_link_mgr_process_port( uint8_t current_state; osm_signal_t signal = OSM_SIGNAL_DONE; - OSM_LOG_ENTER( p_mgr->p_log, __osm_link_mgr_process_port ); + OSM_LOG_ENTER( p_mgr->p_log, __osm_link_mgr_process_node ); if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, - "__osm_link_mgr_process_port: " - "Port 0x%" PRIx64 " going to %s\n", - cl_ntoh64( osm_port_get_guid( p_port ) ), + "__osm_link_mgr_process_node: " + "Node 0x%" PRIx64 " going to %s\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), ib_get_port_state_str( link_state ) ); } @@ -426,7 +426,7 @@ __osm_link_mgr_process_port( with this Port. Start iterating with port 1, since the linkstate is not applicable to the management port on switches. */ - num_physp = osm_node_get_num_physp( p_port->p_node ); + num_physp = osm_node_get_num_physp( p_node ); for( i = 0; i < num_physp; i ++ ) { /* @@ -434,7 +434,7 @@ __osm_link_mgr_process_port( or if the state of the port is already better then the specified state. */ - p_physp = osm_node_get_physp_ptr( p_port->p_node, (uint8_t)i ); + p_physp = osm_node_get_physp_ptr( p_node, (uint8_t)i ); if( osm_physp_is_valid( p_physp ) ) { current_state = osm_physp_get_port_state( p_physp ); @@ -464,9 +464,9 @@ __osm_link_mgr_process_port( if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, - "__osm_link_mgr_process_port: " + "__osm_link_mgr_process_node: " "Physical port 0x%X already %s. Skipping\n", - osm_physp_get_port_num( p_physp ), + p_physp->port_num, ib_get_port_state_str( current_state ) ); } } @@ -484,21 +484,21 @@ osm_link_mgr_process( IN osm_link_mgr_t* const p_mgr, IN const uint8_t link_state ) { - cl_qmap_t *p_port_guid_tbl; - osm_port_t *p_port; + cl_qmap_t *p_node_guid_tbl; + osm_node_t *p_node; osm_signal_t signal = OSM_SIGNAL_DONE; OSM_LOG_ENTER( p_mgr->p_log, osm_link_mgr_process ); - p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl; + p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); - for( p_port = (osm_port_t*)cl_qmap_head( p_port_guid_tbl ); - p_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl ); - p_port = (osm_port_t*)cl_qmap_next( &p_port->map_item ) ) + for( p_node = (osm_node_t*)cl_qmap_head( p_node_guid_tbl ); + p_node != (osm_node_t*)cl_qmap_end( p_node_guid_tbl ); + p_node = (osm_node_t*)cl_qmap_next( &p_node->map_item ) ) { - if( __osm_link_mgr_process_port( p_mgr, p_port, link_state ) == + if( __osm_link_mgr_process_node( p_mgr, p_node, link_state ) == OSM_SIGNAL_DONE_PENDING ) signal = OSM_SIGNAL_DONE_PENDING; } -- 1.5.2.171.gf509 From sashak at voltaire.com Thu May 31 15:33:41 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 Jun 2007 01:33:41 +0300 Subject: [ofa-general] [PATCH] opensm/sminfo: mutex cleanup fix In-Reply-To: <20070531204524.GX13193@sashak.voltaire.com> References: <20070531204524.GX13193@sashak.voltaire.com> Message-ID: <20070531223341.GA23029@sashak.voltaire.com> This fixes mutex cleanups in SMInfo processor. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_sminfo_rcv.c | 12 +++++++----- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/opensm/opensm/osm_sminfo_rcv.c b/opensm/opensm/osm_sminfo_rcv.c index 99a716e..b26b6bf 100644 --- a/opensm/opensm/osm_sminfo_rcv.c +++ b/opensm/opensm/osm_sminfo_rcv.c @@ -617,7 +617,7 @@ __osm_sminfo_rcv_process_get_response( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_sminfo_rcv_process_get_response: ERR 2F12: " "No port object for this SM\n" ); - goto Exit; + goto _unlock_and_exit; } if( osm_port_get_guid( p_port ) != p_smi->guid ) @@ -629,7 +629,7 @@ __osm_sminfo_rcv_process_get_response( ", Received 0x%016" PRIx64 "\n", cl_ntoh64( osm_port_get_guid( p_port ) ), cl_ntoh64( p_smi->guid ) ); - goto Exit; + goto _unlock_and_exit; } if( port_guid == p_rcv->p_subn->sm_port_guid ) @@ -638,7 +638,7 @@ __osm_sminfo_rcv_process_get_response( "__osm_sminfo_rcv_process_get_response: " "Self query response received - SM port 0x%016" PRIx64 "\n", cl_ntoh64( port_guid ) ); - goto Exit; + goto _unlock_and_exit; } p_sm = (osm_remote_sm_t*)cl_qmap_get( p_sm_tbl, port_guid ); @@ -650,7 +650,7 @@ __osm_sminfo_rcv_process_get_response( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_sminfo_rcv_process_get_response: ERR 2F14: " "Unable to allocate SM object\n" ); - goto Exit; + goto _unlock_and_exit; } osm_remote_sm_init( p_sm, p_port, p_smi ); @@ -668,7 +668,7 @@ __osm_sminfo_rcv_process_get_response( process_get_sm_ret_val = __osm_sminfo_rcv_process_get_sm( p_rcv, p_sm ); - Exit: + _unlock_and_exit: CL_PLOCK_RELEASE( p_rcv->p_lock ); /* If process_get_sm_ret_val != OSM_SIGNAL_NONE then we have to signal @@ -676,6 +676,8 @@ __osm_sminfo_rcv_process_get_response( if (process_get_sm_ret_val != OSM_SIGNAL_NONE) osm_state_mgr_process( p_rcv->p_state_mgr, process_get_sm_ret_val ); + + Exit: OSM_LOG_EXIT( p_rcv->p_log ); } -- 1.5.2.171.gf509 From halr at voltaire.com Thu May 31 15:25:27 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 18:25:27 -0400 Subject: [ofa-general] Re: [PATCH] opensm: cleanup discovery count functions In-Reply-To: <20070531204705.GY13193@sashak.voltaire.com> References: <20070531204705.GY13193@sashak.voltaire.com> Message-ID: <1180650323.7116.189578.camel@hal.voltaire.com> On Thu, 2007-05-31 at 16:47, Sasha Khapyorsky wrote: > This removes discovery count functions for osm_port_t and osm_node_t > and makes discovery_count handling similar to osm_switch_t. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Thu May 31 15:26:04 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 18:26:04 -0400 Subject: [ofa-general] Re: [PATCH] opensm: less iterations in osm_link_mgr_process() loop In-Reply-To: <20070531213933.GZ13193@sashak.voltaire.com> References: <20070531213933.GZ13193@sashak.voltaire.com> Message-ID: <1180650363.7116.189663.camel@hal.voltaire.com> On Thu, 2007-05-31 at 17:39, Sasha Khapyorsky wrote: > Instead of looping over endports in order to get node's physical > ports list (and to repeat scanning), just use nodes. IOW - replace > __osm_link_mgr_process_port() by __osm_link_mgr_process_node(). > > Signed-off-by: Sasha Khapyorsky Nice optimization. Thanks. Applied. -- Hal From halr at voltaire.com Thu May 31 15:26:22 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 31 May 2007 18:26:22 -0400 Subject: [ofa-general] Re: [PATCH] opensm: sminfo self query check In-Reply-To: <20070531204524.GX13193@sashak.voltaire.com> References: <20070531204524.GX13193@sashak.voltaire.com> Message-ID: <1180650368.7116.189665.camel@hal.voltaire.com> On Thu, 2007-05-31 at 16:45, Sasha Khapyorsky wrote: > OpenSM can query itself for SMInfo because it is just legal, or > occasionally due to port moving during subnet discovery process. > Don't create remote SM entry in this case in order to prevent > deadlocks. > > Signed-off-by: Sasha Khapyorsky Good catch. Thanks. Applied. -- Hal From pradeeps at linux.vnet.ibm.com Thu May 31 16:37:23 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 31 May 2007 16:37:23 -0700 Subject: [ofa-general] ipoib drain cq question Message-ID: <465F5C33.1050202@linux.vnet.ibm.com> ipoib_cm_start_rx_drain() posts a DRAIN_WRID to be sent. So, why does ipoib_cm_handle_rx_wc() have to call ipoib_cm_start_rx_drain() upon receipt of the WC? Pradeep From MAILER-DAEMON at bmapps.persistent.co.in Thu May 31 23:44:48 2007 From: MAILER-DAEMON at bmapps.persistent.co.in (Mail Delivery System) Date: Fri, 1 Jun 2007 12:14:48 +0530 (IST) Subject: [ofa-general] Delayed Mail (still being retried) Message-ID: <20070601064448.7DE73528FC4@bmapps.persistent.co.in> This is the Symantec Mail Security program at host bmapps.persistent.co.in. #################################################################### # THIS IS A WARNING ONLY. YOU DO NOT NEED TO RESEND YOUR MESSAGE. # #################################################################### Your message could not be delivered for 2.0 hours. It will be retried until it is 1.0 days old. For further assistance, please send mail to If you do so, please include this problem report. You can delete your own text from the attached returned message. The Symantec Mail Security program : connect to 10.78.0.6[10.78.0.6]: read timeout -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 1049 bytes Desc: Undelivered Message Headers URL: